Context-based-Classification-of-Software-Mentions-in-Scientific-Data(Bio-medical & Life Sciences)

Motivation

To know about which software / framework that researchers mostly use in their work

About Software availability and attribution of software developer

Objective

To classify the Software Mentions appearing in text data , extracted from Bio medical and Social Sciences articles or research papers.
To provide basis for building Software Knowledge Graph

Classifcation

There are four classes in which software mentions will be categorized as per their context.

Usage- If software is being actually used by the researcher
Mention- If software is just mentioned / disclosed / referred by the researcher but has not actually used it.
Creation- If software is being developed by the researcher
Deposition-If software is being first created and then deposited it somewhere for future availability by the researcher

Dataset Copyrights

This dataset is originally created by Rostock University developers and is their property. For commercial re-use of this data, contact the university administration.

Data Pre-processing

Annotated Software mentions using Brat Annotation Tool

1,727 Files were annotated
5,309 Sentences were annotated and extracted for Feature Space

Brat Standoff Format to BIO Encoded Format
- Relevant Sentence Extraction
- Sentence Tokenization
- BIO Encoding

Feature Engineering

Replace Software Mentions with place holder
Extract Software Mentions Contextual Features
- Find out Software Mentions Position
- Extract Contextual Words as per window_size of 3
- Padding if needed
Generate Word Embeddings of Contextual Words
- Used Pre trained Model (wikipedia pubmed and PMC w2v.bin)
Generate Word Embeddings for POS tags of Contextual Words
Generate Specific Class based features
- Frequent Words, Frequent tags etc.
Features Concatenation

Modeling

Chose Random Forest Classifier (RF) from Scikit learn
- Best Classical Machine Learning Algorithm
- Anticipating performance and better predictability
Hyperparameters in RF
- n_estimators No of trees in RF
- max_depth depth of tree to fit to samples
- c riterion information gain criteria at each node split
- m ax_features No of features to consider when deciding for best split at nodes
- min_samples_leaf Min no of samples that should be at leaf node

Results

Training Dataset 70%, Test Dataset 30%

Data	F1 Score(%)
Training	97.55
Test	60.37

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
Code.ipynb		Code.ipynb
Dataset.zip		Dataset.zip
LICENSE		LICENSE
README.md		README.md
RandomForestTrainingwithHyperparametertuning.PNG		RandomForestTrainingwithHyperparametertuning.PNG
Unused_MentionsRemoval_ann_ file.ipynb		Unused_MentionsRemoval_ann_ file.ipynb
brat_tol.png		brat_tol.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Context-based-Classification-of-Software-Mentions-in-Scientific-Data(Bio-medical & Life Sciences)

Motivation

Objective

Classifcation

Dataset Copyrights

Data Pre-processing

Feature Engineering

Modeling

Results

License and Copyright

About

Releases

Packages

Languages

License

ZohaibRamzan/Context-based-Classification-of-Software-Mentions-in-Scientific-Data

Folders and files

Latest commit

History

Repository files navigation

Context-based-Classification-of-Software-Mentions-in-Scientific-Data(Bio-medical & Life Sciences)

Motivation

Objective

Classifcation

Dataset Copyrights

Data Pre-processing

Feature Engineering

Modeling

Results

License and Copyright

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages