- To classify the Software Mentions appearing in text data , extracted from Bio medical and Social Sciences articles or research papers.
- To provide basis for building Software Knowledge Graph
- Usage- If software is being actually used by the researcher
- Mention- If software is just mentioned / disclosed / referred by the researcher but has not actually used it.
- Creation- If software is being developed by the researcher
- Deposition-If software is being first created and then deposited it somewhere for future availability by the researcher
- Annotated Software mentions using Brat Annotation Tool
- 1,727 Files were annotated
- 5,309 Sentences were annotated and extracted for Feature Space
- Brat Standoff Format to BIO Encoded Format
- Relevant Sentence Extraction
- Sentence Tokenization
- BIO Encoding
- Replace Software Mentions with place holder
- Extract Software Mentions Contextual Features
- Find out Software Mentions Position
- Extract Contextual Words as per window_size of 3
- Padding if needed
- Generate Word Embeddings of Contextual Words
- Used Pre trained Model (wikipedia pubmed and PMC w2v.bin)
- Generate Word Embeddings for POS tags of Contextual Words
- Generate Specific Class based features
- Frequent Words, Frequent tags etc.
- Features Concatenation
- Chose Random Forest Classifier (RF) from Scikit learn
- Best Classical Machine Learning Algorithm
- Anticipating performance and better predictability
- Hyperparameters in RF
- n_estimators No of trees in RF
- max_depth depth of tree to fit to samples
- c riterion information gain criteria at each node split
- m ax_features No of features to consider when deciding for best split at nodes
- min_samples_leaf Min no of samples that should be at leaf node
- Training Dataset 70%, Test Dataset 30%
Data | F1 Score(%) |
---|---|
Training | 97.55 |
Test | 60.37 |