-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding PipelineML and pipeline_ml #16
Comments
TL; DR: make sure that A bit of context: what is the
|
Hi @Galileo-Galilei, thanks for your great work! My understanding of def create_pipeline(**kwargs):
return Pipeline(
[
node(
name="Split Data",
func=split_data,
inputs=["text_samples", "parameters"],
outputs=["X_train", "X_test", "y_train", "y_test"],
tags=["training"],
),
node(
name="Fit MultiLabelBinarizer",
func=fit_label_binarizer,
inputs="y_train",
outputs="mlb",
tags=["training"],
),
node(
name="Transform Labels",
func=transform_labels,
inputs=["mlb", "y_train", "y_test"],
outputs=["Y_train", "Y_test"],
tags=["training"],
),
node(
name="Train Model",
func=train_model,
inputs=["X_train", "Y_train"],
outputs="classifier",
tags=["training"],
),
node(
name="Evaluate Model",
func=evaluate_model,
inputs=["classifier", "X_test", "Y_test"],
outputs=None,
tags=["evaluation"],
),
node(
name="Make Prediction",
func=make_prediction,
inputs=["classifier", "mlb", "features"],
outputs=None,
tags=["inference"],
),
]
) Nodes: def split_data(text_samples: pd.DataFrame, parameters: Dict) -> List:
# extract features
X = text_samples["features"].values
# extract labels
y = text_samples["labels"].values
# split dataset into train and test data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=parameters["test_size"], random_state=parameters["random_state"]
)
return [X_train, X_test, y_train, y_test]
def fit_label_binarizer(y_train: np.ndarray) -> MultiLabelBinarizer:
# multi label binarizer to transform data labels
mlb = MultiLabelBinarizer()
# fit the mlb on the train label data
mlb.fit(y_train)
return mlb
def transform_labels(mlb: MultiLabelBinarizer, y_train: np.ndarray, y_test: np.ndarray) -> List:
# transform train label data
Y_train = mlb.transform(y_train)
# transform test label data
Y_test = mlb.transform(y_test)
return [Y_train, Y_test]
def train_model(X_train: np.ndarray, Y_train: np.ndarray) -> Pipeline:
# scikit-learn classifier pipeline
classifier = Pipeline(
[
("vectorizer", CountVectorizer()),
("tfidf", TfidfTransformer()),
("clf", OneVsRestClassifier(LinearSVC())),
]
)
# fit the classifier on the train data
classifier.fit(X_train, Y_train)
return classifier
def evaluate_model(classifier: Pipeline, X_test: np.ndarray, Y_test: np.ndarray):
# make prediction with test data
predicted = classifier.predict(X_test)
# accuracy score of the trained classifier
accu = accuracy_score(Y_test, predicted)
# log accuracy
logger = logging.getLogger(__name__)
logger.info("Model has an accuracy of %.3f", accu)
def make_prediction(classifier: Pipeline, mlb: MultiLabelBinarizer, features: np.ndarray) -> List:
# model inference on features
predicted = classifier.predict(features)
# inverse transform prediction matrix back to string labels
all_labels = mlb.inverse_transform(predicted)
# map input values to predicted label
predictions = []
for item, labels in zip(values, all_labels):
predictions.append({"value": item, "label": labels})
# return predictions as list of dicts
return predictions The inference part does not only need access to the trained model but to the fitted MultiLabelBinarizer as well. Is this supported by |
Hello @laurids-reichardt, glad to see that you're playing around with the plugin! Regarding your question, the access to other data (encoder, binarizer, ...) than the ml model is not only "possible" but exactly what A few remarks on
The good news are that I think your code should almost work "as is". Can you check the following items:
In case these elements are not enough to solve your problem, what is the error message you get ? Can you share a sample of the data you use to make the problem reproducible? EDIT: It is a bug. The EDIT2: It should be fixed. Do all above modifications and install pip install --upgrade git+https://github.com/Galileo-Galilei/kedro-mlflow.git@hotfix-pipeline-ml |
Yeah, thanks for the quick answer! The hotfix works. For reference, the url is: git+https://github.com/Galileo-Galilei/kedro-mlflow.git@hotfix-pipeline-ml |
Indeed, wrong copy pasting, sorry. Does It "works" only means that it is properly stored in mlflow or did you set the API up and tried it out? |
Here's my current implementation: https://github.com/laurids-reichardt/kedro-examples/blob/kedro-mlflow-hotfix2/text-classification/src/text_classification/pipelines/pipeline.py
However I get the following error while trying to make some predictions:
My current guess would be that the issue stems from the fact that I use pyenv+venv to resolve my dependencies instead of conda. I'll investigate further and report back. |
It sounds like your own package is not installed as a python package. What does cd src
pip install -e . |
You're right, Now it works without issues. Thanks for your great support!
|
I converted the scikit-learn classifier pipeline to a kedro pipeline as well: https://github.com/laurids-reichardt/kedro-examples/blob/master/text-classification/docs/kedro-pipeline.svg |
…uced by the training pipeline was an intermediary output instead of a terminal one
I have just added unit tests, updated the changelog and merged this fix to develop. It will be released to pypi soon. I let the issue opened because apart from this bugfix, it is the best documentation of the Short digression on moduleRegarding the module, I always recommend to install your kedro package as a module. In you don't, you can perform relative import between your scripts, but they depend on your working directory. This may lead to annoying issues because:
For all these reasons, I found it much more stable to install your project as a python package with pip. The A better way to specify environment in
|
This issue is closed since :
Feel free to reopen if needed. |
Hi @Galileo-Galilei. As I mentioned in other issue, I'm working currently with integrating my training and inference pipelines with MLPipeline. Unfortunately I'm confused with handling inputs and outputs, I can't wrap my head around it.
Context
My training pipeline is built from three other pipelines:
de_pipeline
(data engineering),fe_pipeline
(feature engineering) andmd_pipeline
(training aka. modeling).My inference pipeline is buit from the same pipelines but with predict argument which change their behavior (they're using previously saved models for imputer and prediction.
In my current implementation it looks like this:
My pipelines also getting as input parameters, obtained from kedro configuration (by that I mean
conf/base/parameters.yaml
).When I'm trying to glue them together with:
and running my training pipeline I'm getting:
I'm understand the issue here, but I don't know how to proceed with that ("un-free" inputs which should be obtained (automatically?) using Kedro features). I would be glad for any tips.
The text was updated successfully, but these errors were encountered: