This framework provides APIs called data transformers to represent popular data transformation patterns on a pandas DataFrame object which is a 2D array consisting of rows and labeled columns. You can construct an machine-learning pipeline with the data trasformers, and then export it with a trained ML model into a file in the ONNX format which is a standard to represent a ML model and data transformations.
The easiest way to use the dataframe pipeline is to build a docker image that includes all of the dependencies. If you want to install your native environment, please follow the steps written in docker/Dockerfile.
1. Set up kaggle API credentials by following the procedure in Kaggle API
This step is needed to download datasets used by our benchmarks. If you do not run the benchmarks, you can skip this step and comment out the lines to copy the kaggle API credentials from docker/Dockerfile. After this step, you should have a json file that includes your API key under ~/.kaggle.
# git clone https://github.com/IBM/dataframe-pipeline.git
# cd dfpipeline
# cd docker
# ./build-docker.sh
If you succeeded to build the image, you can find an image named dfp by running a docker command docker images
. You can use the dataframe pipeline in a docker container by running docker run -it dfp bash
.
Note that docker/Dockerfile builds the ONNX Runtime which we extended. You can export a ML pipeline in the ONNX format as shown in the following steps. However, for now, current ONNX operators are not enough to represent all of the data transformations available in the dataframe pipeline. Therefore, extended the ONNX Runtime to add some operators which are needed for the data transformations in the dataframe pipeline.
import dfpipeline as dfp
pipeline = dfp.DataframePipeline(steps=[
dfp.ComplementLabelEncoder(inputs=['emaildomain', 'card'], outputs=['emaildomain', 'card']),
dfp.FrequencyEncoder(inputs=['emaildomain', 'card'], outputs=['emaildomain_fe', 'card_fe']),
dfp.Aggregator(inputs=['Amt'], groupby=['card'], outputs=['Amt_card_mean'], func='mean'),
])
import pandas as pd
train_df = pd.read_csv('training.csv')
train_df = pipeline.fit_transform(df)
import xgboost as xgb
clf = xgb.XGBClassifier(...)
clf.fit(train_df)
from onnxmltools.convert import convert_xgboost
from onnxmltools.convert.common.data_types import FloatTensorType
initial_type = [('dense_input', FloatTensorType([None, len(pipeline.output_columns)]))]
onnx_ml_model = convert_xgboost(clf, initial_types=initial_type)
input_columns_to_onnx = pipeline.export('dense_input', [onnx_ml_model], 'pipeline.onnx')
import onnxrutime as rt
test_df = pd.read_csv('test.csv')
sess = rt.InferenceSession('pipeline.onnx')
tensors = dfp.DataframePipeline.convert_to_tensors(test_df, input_columns_to_onnx)
preds = sess.run(None, tensors)
We developed benchmarks to evaluate the performance of ML pipelines on Python and the ONNX Runtime referring the following use cases.
cd /git/dataframe-pipeline/benchmarks
# cd benchmarks
# ./download_inputs.sh
# ./run.sh
Follow our contribution guidelines.