-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compatibility with newer spaCy versions #204
Comments
We have not yet investigated what it would take to make medaCy compatible with the latest versions of its dependencies. |
Fair enough. Do you have a documented process of how to train the clinical notes model so I can do the work? |
That model is trained on the n2c2 2018 track 2 dataset, which is described here. One can then train a model over it using the command line interface. Updating the package's dependencies will likely be a future project. |
Thanks for your links on how to train the model. Also, thank you for writing this software and making it available for the public--it is well written. I have generated two models with an updated set of dependencies. Changes
Re-TrainingThe automated process for training is in the subdirectory
The downside is that each of these steps requires a command line invocation since many things can go wrong and the process must be "baby sat". However, they are short and (as mentioned) documented. Action ItemsFrom here, I can either:
Please let me know what you want. Future WorkI plan on first incorporating this work in to my research, then more than likely add a model for the N2C2 2014 Deidentification & Heart Disease data to tag PHI. Thanks again for this great software. |
@plandes It sounds like you put a lot of work into adapting medaCy for your needs, and I really appreciate that you've clearly documented the workflow that you used. Am I correct in understanding that you have a copy of n2c2 2018, that you were able to train a CRF model with no source code modifications using spaCy 2.3.5, and that you received an IndexError when attempting to train a BERT model with a certain version of BERT that may not be compatible with the version of transformers used in medaCy? If so, we can probably set a range of permissible versions of spaCy that include the currently required version up to 2.3.5 (following regression tests). |
@swfarnsworth Correct, this was trained on the n2c2 2018 task 2 corpus. Also, the Clinical BERT embeddings, to be specific (see the paper), were trained with 10 epochs and showed better performance than the default cased (at least on the folds trained on the default 3 epochs before I stopped it). Also correct on the I also forgot to mention that PyTorch 1.8 doesn't appear to be stable as well (along with space 3.0) so I backed off to 1.7. |
Update: I have:
The new trained model has very similar scores to the BERT+CRF model with the non CRF model performing slightly higher. I trained it on 3 epochs instead of 10, but I doubt that makes a difference. I'll be taking down the medaCy_bertcrf_model_clinical_notes at some point, but because it uses a lot of Git LFS space. A better long term strategy might be to make models pip installs because they take so much space. Speaking of, will you please indicate whether you plan to incorporate these changes? If not, that's fine but, I'll change all repo URLs to my forked repos for the main source and models. Then I'll release medaCy to PyPi. Thanks again. This will be helpful in my own research. |
@plandes, I am graduating at the end of this semester and I don't know if I will be able to review any changes to the package before then, though I will see if any my colleagues might be able to. That being said, NLP@VCU will continue to support medaCy after my departure, and we appreciate how thoroughly you've documented your workflow. We will need to discuss internally what changes you've made and which we can merge into the main repository. Am I to understand that you were planning to publish your fork of medaCy to PyPI, or this one? |
@swfarnsworth Congrats on graduating! Seems like a dream to me at this point. Yes, I totally understand--take the time you need, and there's no reason I can't publish what I need for my own purposes, and we can all fold the changes later back in to NLP@VCU later. Yes, I'd publish medaCy with the Bert fix and updated dependencies under my own name space (zensols) along with the models (assuming PyPi doesn't have size constraints) for my own purposes. However, if you can review the changes somewhat soon I'll hold off and wait for that integration. If we can get everything merged back under NLP@VCU, then I'll take down my work from GitHub. |
What problem does your feature solve?
Add instructions on how to retrain the models, or better, one easy robust easy to run script, on new versions of packages (specifically spaCy 2.3.5, and later 3.0).
Describe the solution you'd like
I'd like to have an easy reproducible way to retrain the model on an updated set of packages (numpy/msgpack/msgpack-numpy, torch etc) as I'm using this package with newer versions of its dependencies. Specifically those packages pinned to a version (i.e. spaCy 2.2.2).
Describe alternatives you've considered
Using current versions of packages work, but with warnings and I don't trust it given the word vectors might have changed and other data serialized to (for example) medacy-model-clinical-notes .
Additional context
If you can point me to the resources, I can write a script/process to do this automatically.
The text was updated successfully, but these errors were encountered: