KGLiDS - Linked Data Science Powered by Knowledge Graphs

In recent years, we have witnessed the growing interest from academia and industry in applying data science technologies to analyze large amounts of data. While in this process a myriad of artifcats (datasets, pipeline scripts, etc.) are created, there has so far been no systematic attempt to holistically collect and exploit all the knowledge and experiences that are implicitly contained in those artifacts. Instead, data scientists resort to recovering information and experience from colleagues or learn via trial and error. Hence, this paper presents a scalable system, KGLiDS, that employs machine learning and knowledge graph technologies to abstract and capture the semantics of data science artifacts and their connections. Based on this information KGLiDS enables a variety of downstream applications, such as data discovery and pipelines automation. Our comprehensive evaluation covers use cases in data discovery, data cleaning, transformation, and AutoML and shows that KGLiDS is significantly faster with a lower memory footprint as the state of the art while achieving comparable or better accuracy.

Quickstart on Colab

Try out our KGLiDS Colab Demo and KGLiDS DataPrep Demo that demonstrates our APIs on Kaggle data!

Linked Data Science: Systems and Applications

To learn more about Linked Data Science and its applications, please watch Dr. Mansour's talk at Waterloo DSG Seminar (Here).

Requirements

Conda
Java (make sure $JAVA_HOME env variable is set. See Tutorial Here)
Docker

Installation

Run the project initialization script, which does the following:

Creates kglids Conda environment.
Downloads necessary pip packages.
Downloads necessary word embedding models.
Downloads and runs GraphDB docker container.
Install Postgresql and pgvector (sets default password to postgres).

bash init.sh

Setting Up The Data

KGLiDS expects to set up the pipeline and csv files into the following directory structure:

<DATA SOURCE NAME>/
├── <DATASET 1 ID>/
│   ├── data/
│   │   ├── <TABLE NAME>.csv
│   │   ├── <TABLE NAME>.csv
│   │   └── ...
│   └── notebooks/
│       ├── <NOTEBOOK 1 ID>/
│       │   ├── <NOTEBOOK PYTHON FILE>.py
│       │   └── pipeline_info.json
│       ├── <NOTEBOOK 2 ID>
│       └── ...      
├── <DATASET 2 ID>
└── ...

The following is an example dataset under the kaggle_small data source:

kaggle_small
  └──antfarol.car-sale-advertisements
      ├── data
      │ └── car_ad.csv
      └── notebooks
          └── aidenchoi-notebooke85a1481e0
             ├── notebooke85a1481e0.py
             └── pipeline_info.json

Where pipeline_info.json files contain Kaggle pipeline details (see setup_kaggle_data.py)

Generating the KGLiDS graph:

Add the data sources to kglids_config.py.
Run KGLiDS:

python run_kglids.py

This script does the following:

Profiles the datasets and creates the dataset graph.
Analyzes pipeline scripts and creates pipeline graphs.
Loads the constructed graphs into GraphDB.
Loads the dataset embeddings into pgvector.

Using the KGLiDS APIs:

KGLiDS provides predefined operations in form of python apis that allow seamless integration with a conventional data science pipeline. Checkout the full list of KGLiDS APIs

LiDS Ontology

To store the created knowledge graph in a standardized and well-structured way, we developed an ontology for linked data science: the LiDS Ontology.
Checkout LiDS Ontology!

Benchmarks

The following benchmark datasets were used to evaluate KGLiDS:

Data Discovery: Table Union Search
Data Cleaning, Data Transformation, and AutoML
Kaggle
- setup_kaggle_data.py

KGLiDS APIs

See the full list of supported APIs here.

Citing Our Work

If you find our work useful, please cite it in your research.

@INPROCEEDINGS{kglids,
  author={Helali, Mossad and Monjazeb, Niki and Vashisth, Shubham and Carrier, Philippe and Helal, Ahmed and Cavalcante, Antonio and Ammar, Khaled and Hose, Katja and Mansour, Essam},
  booktitle={2024 IEEE 40th International Conference on Data Engineering (ICDE)}, 
  title={KGLiDS: A Platform for Semantic Abstraction, Linking, and Automation of Data Science}, 
  year={2024},
  pages={179-192},
  url={https://doi.org/10.1109/ICDE60146.2024.00021},
  ISSN={2375-026X},
}

Contributions

We encourage contributions and bug fixes, please don't hesitate to open a PR or create an issue if you face any bugs.

Questions

For any questions please contact us:

[email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 305 Commits
api		api
docs		docs
experiments		experiments
gnn_applications		gnn_applications
kg_governor		kg_governor
storage		storage
storage_utils		storage_utils
.gitignore		.gitignore
KGLiDS_DataPrep_notebook.ipynb		KGLiDS_DataPrep_notebook.ipynb
KGLiDS_notebook.ipynb		KGLiDS_notebook.ipynb
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
demo_minimal_requirements.txt		demo_minimal_requirements.txt
demo_torch_requirements.txt		demo_torch_requirements.txt
init.sh		init.sh
kglids_config.py		kglids_config.py
kglids_server.py		kglids_server.py
requirements.txt		requirements.txt
run_kglids.py		run_kglids.py
server_utils.py		server_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KGLiDS - Linked Data Science Powered by Knowledge Graphs

Quickstart on Colab

Linked Data Science: Systems and Applications

Requirements

Installation

Setting Up The Data

Generating the KGLiDS graph:

Using the KGLiDS APIs:

LiDS Ontology

Benchmarks

KGLiDS APIs

Citing Our Work

Contributions

Questions

About

Releases

Packages

Contributors 6

Languages

License

CoDS-GCS/kglids

Folders and files

Latest commit

History

Repository files navigation

KGLiDS - Linked Data Science Powered by Knowledge Graphs

Quickstart on Colab

Linked Data Science: Systems and Applications

Requirements

Installation

Setting Up The Data

Generating the KGLiDS graph:

Using the KGLiDS APIs:

LiDS Ontology

Benchmarks

KGLiDS APIs

Citing Our Work

Contributions

Questions

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages