Dataframe Pipeline - A framework to build a machine-learning pipeline

This framework provides APIs called data transformers to represent popular data transformation patterns on a pandas DataFrame object which is a 2D array consisting of rows and labeled columns. You can construct an machine-learning pipeline with the data trasformers, and then export it with a trained ML model into a file in the ONNX format which is a standard to represent a ML model and data transformations.

How to install via Docker

The easiest way to use the dataframe pipeline is to build a docker image that includes all of the dependencies. If you want to install your native environment, please follow the steps written in docker/Dockerfile.

1. Set up kaggle API credentials by following the procedure in Kaggle API

This step is needed to download datasets used by our benchmarks. If you do not run the benchmarks, you can skip this step and comment out the lines to copy the kaggle API credentials from docker/Dockerfile. After this step, you should have a json file that includes your API key under ~/.kaggle.

2. Clone this repository

# git clone https://github.com/IBM/dataframe-pipeline.git
# cd dfpipeline

3. Build a docker image

# cd docker
# ./build-docker.sh

If you succeeded to build the image, you can find an image named dfp by running a docker command docker images. You can use the dataframe pipeline in a docker container by running docker run -it dfp bash.

Note that docker/Dockerfile builds the ONNX Runtime which we extended. You can export a ML pipeline in the ONNX format as shown in the following steps. However, for now, current ONNX operators are not enough to represent all of the data transformations available in the dataframe pipeline. Therefore, extended the ONNX Runtime to add some operators which are needed for the data transformations in the dataframe pipeline.

How to use

1. Define your pipeline

import dfpipeline as dfp

pipeline = dfp.DataframePipeline(steps=[
  dfp.ComplementLabelEncoder(inputs=['emaildomain', 'card'], outputs=['emaildomain', 'card']),
  dfp.FrequencyEncoder(inputs=['emaildomain', 'card'], outputs=['emaildomain_fe', 'card_fe']),
  dfp.Aggregator(inputs=['Amt'], groupby=['card'], outputs=['Amt_card_mean'], func='mean'),
])

2. Transform a dataframe for training

import pandas as pd

train_df = pd.read_csv('training.csv')
train_df = pipeline.fit_transform(df)

3. Train a ML model using the transformed dataframe

import xgboost as xgb

clf = xgb.XGBClassifier(...)
clf.fit(train_df)

4. Convert the trained ML model into an ONNX model

from onnxmltools.convert import convert_xgboost
from onnxmltools.convert.common.data_types import FloatTensorType

initial_type = [('dense_input', FloatTensorType([None, len(pipeline.output_columns)]))]
onnx_ml_model = convert_xgboost(clf, initial_types=initial_type)

5. Export a dataframe pipeline with a trained ML model into ONNX

input_columns_to_onnx = pipeline.export('dense_input', [onnx_ml_model], 'pipeline.onnx')

6. Load an ONNX file and run the pipeline

import onnxrutime as rt

test_df = pd.read_csv('test.csv')
sess = rt.InferenceSession('pipeline.onnx')
tensors = dfp.DataframePipeline.convert_to_tensors(test_df, input_columns_to_onnx)
preds = sess.run(None, tensors)

Benchmarking

We developed benchmarks to evaluate the performance of ML pipelines on Python and the ONNX Runtime referring the following use cases.

1. Go to the benchmark directory in a docker container

cd /git/dataframe-pipeline/benchmarks

2. Download datasets

# cd benchmarks
# ./download_inputs.sh

3. Run benchmarks

# ./run.sh

Contributing

Follow our contribution guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
benchmarks		benchmarks
dfpipeline		dfpipeline
docker		docker
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MAINTAINERS.md		MAINTAINERS.md
README.md		README.md
requirements.txt		requirements.txt
run_unit_test.sh		run_unit_test.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataframe Pipeline - A framework to build a machine-learning pipeline

How to install via Docker

1. Set up kaggle API credentials by following the procedure in Kaggle API

2. Clone this repository

3. Build a docker image

How to use

1. Define your pipeline

2. Transform a dataframe for training

3. Train a ML model using the transformed dataframe

4. Convert the trained ML model into an ONNX model

5. Export a dataframe pipeline with a trained ML model into ONNX

6. Load an ONNX file and run the pipeline

Benchmarking

1. Go to the benchmark directory in a docker container

2. Download datasets

3. Run benchmarks

Contributing

About

Releases

Packages

Contributors 2

Languages

License

IBM/dataframe-pipeline

Folders and files

Latest commit

History

Repository files navigation

Dataframe Pipeline - A framework to build a machine-learning pipeline

How to install via Docker

1. Set up kaggle API credentials by following the procedure in Kaggle API

2. Clone this repository

3. Build a docker image

How to use

1. Define your pipeline

2. Transform a dataframe for training

3. Train a ML model using the transformed dataframe

4. Convert the trained ML model into an ONNX model

5. Export a dataframe pipeline with a trained ML model into ONNX

6. Load an ONNX file and run the pipeline

Benchmarking

1. Go to the benchmark directory in a docker container

2. Download datasets

3. Run benchmarks

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages