phenotypeCR

HPO-based phenotype concept recognition using language models

Installation

git clone https://github.com/at-cg/phenotypeCR.git
cd phenotypeCR
pip install requirements.txt

Download files from here

move files to the following directories

mv Downloads/embeddings/* .

else run the following commands

gdown 1ED1gqeqnyvX_Sk5_KA_W5XxMVDrWk2tS

After this step the directory structure should look like this

---phenotypeCR
   |---2022
        |---HPO_embeddings_38k.csv
        |---hpo2022.txt
    |---2024
        |---HPO_embeddings_40k.csv
        |---child_parent_dict_merged.json
        |---hpo_dict.txt 
        |---phenotype_to_genes.txt
    |---Evaluation
    |---test
    |---create_embeddings.ipynb
    |---gpt_models.ipynb
    |---README.md
    |---requirements.txt

We have provided a jupyter notebook gpt_models.ipynb which demonstrates how to use the models for phenotype concept recognition and normalisation. Besides, we have provided a test directory which contains the sample data and the output of the models. The Evaluation directory contains the datasets, Eval.ipynb file for reproducing the results.

Updating and Creating HPO Term Embeddings

HPO updates frequently with new terms. To handle this, we provide a method to generate embeddings for new HPO terms.

Steps to create embeddings:
1. Download the latest `hp.obo` file from:
   [http://purl.obolibrary.org/obo/hp.obo](http://purl.obolibrary.org/obo/hp.obo)

2. Open the `create_embeddings.ipynb` notebook.

3. Update the file path in the notebook to point to the downloaded `hp.obo` file.

4. Run the notebook to generate embeddings.

5. The resulting CSV file will contain embeddings for the new HPO terms.

Model Options

You can select either a finetuned or base model depending on your evaluation needs.

Finetuned Models

For improved accuracy, use the following finetuned models:

1. GPT4o-mini-2024-07-18:
   - Identifier: ft:gpt-4o-mini-2024-07-18:iisc-bangalore::AYf5TC9S

2. GPT4o-2024-08-06:
   - Identifier: ft:gpt-4o-2024-08-06:iisc-bangalore::AZ03ME6y

Base Models

For zero-shot evaluation, use one of these base models:

1. GPT4o-mini-2024-07-18:
   - Identifier: gpt-4o-mini-2024-07-18

2. GPT4o-2024-08-06:
   - Identifier: gpt-4o-2024-08-06

How to Use the Models

To use these models, you need an OpenAI API key to access the GPT4o models. If you do not have an API key, you can use the alternative BioMED_NER model.

Steps to use the models:

1. Refer to the `gpt_models.ipynb` notebook for guidance.

2. If using a custom dataset, refer to the `test` directory for a sample file format.

3. Ensure the following:
   - Update paths in the notebook to match your data and model setup.
   - Follow the sample file format in the `test` directory to ensure consistent results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

phenotypeCR

Installation

move files to the following directories

After this step the directory structure should look like this

Updating and Creating HPO Term Embeddings

Model Options

How to Use the Models

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Evaluation		Evaluation
test		test
README.md		README.md
create_embeddings.ipynb		create_embeddings.ipynb
gpt_models.ipynb		gpt_models.ipynb
requirements.txt		requirements.txt

at-cg/phenotypeCR

Folders and files

Latest commit

History

Repository files navigation

phenotypeCR

Installation

move files to the following directories

After this step the directory structure should look like this

Updating and Creating HPO Term Embeddings

Model Options

How to Use the Models

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages