CAPITAL is a toolchain designed for extracting, generating, tagging, and linking scientific data from Europe PMC to various knowledge bases. It consists of four key stages: extract_data
, generate_data
, tagging
, and linking
. This README will guide you through the installation process, an overview of the stages, and how to get started.
To install the dependencies for the entire pipeline, you can use the requirements.txt
provided. Run the following command to install the required packages:
pip install -r requirements.txt
For individual components, refer to their respective folders for additional installation details and requirements.
Make sure to check for any duplicate dependencies before proceeding.
This stage is used for extracting data from Europe PMC using both the Annotation API and the Articles API. The extracted data will serve as the input for the following stages.
Steps:
- Retrieve annotations and articles from the Europe PMC APIs.
- Parse the data to prepare it for the next stage.
In this stage, the extracted data is converted into the appropriate format required for machine learning models. Specifically, the input [sentence, [[token, ner, span_start, span_end],....]]]
format is transformed into IOB format for training the ML classifier.
Steps:
- Convert tokenized sentences and named entity recognition (NER) data into IOB format.
- The output is prepared for training machine learning models in the tagging stage.
This stage trains the machine learning classifier. We use HuggingFace libraries and models for the training process, along with Weights and Biases (wandb) for hyperparameter tuning and tracking.
Steps:
- Train the ML classifier on the IOB-formatted data.
- Utilize HuggingFace models for NER.
- Perform hyperparameter tuning and tracking using
wandb
.
In the final stage, the machine learning classifier's tagged entities are linked to a knowledge base, creating connections between the identified entities and structured data resources.
Steps:
- Link the tagged entities to a predefined knowledge base.
- Create relationships between the classified entities and external data sources.
Here’s a general overview of the workflow:
- Extract Data: Use APIs to pull articles and annotations.
- Generate Data: Convert the data into a suitable format for machine learning.
- Tagging: Train and tune a classifier using the generated data.
- Linking: Connect the entities tagged by the classifier to a knowledge base.
For more details, feel free to check the individual folders for each stage or reach out via GitHub Issues.