Kedro hooks do not work inside packaged pipeline for serving #404

Lasica · 2023-02-06T16:11:32Z

Description

When I package a kedro pipeline for serving with modelify my kedro hooks no longer work.

Context

I was using hooks for initialization of some structures and had to make workarounds because hooks are not called. I'd like for hooks "before_pipeline_run" or "after_catalog_creation" to be called once when the server starts serving.
https://kedro.readthedocs.io/en/stable/hooks/introduction.html

Steps to Reproduce

Make a pipeline with hooks
use kedro mlflow modelify ... to package it for serving
start serving with mlflow models serve ...
observe that hooks are not run

I may publish minimal example code later to check it out.

Expected Result

Hooks should run.

Actual Result

They do not.
Tell us what happens instead.

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

kedro and kedro-mlflow version used (pip show kedro and pip show kedro-mlflow): 0.18.4, 0.11.7
Python version used (python -V): 3.10.9
Operating system and version: archlinux 64bit

Does the bug also happen with the last version on master?

Yes.

The text was updated successfully, but these errors were encountered:

Galileo-Galilei · 2023-02-06T21:13:24Z

Hi @Lasica, you are right that hooks do not run when the pipeline is stored as a mlflow model, but this was intended. In fact I think most hooks should not be packaged with the pipeline, e.g.:

hooks which need to read a configuration file for prediction, because the model is supposed to be configuration independent
hooks which intend to write data to a third party filesystem / database, for instance for logging or versioning data, because the model must be able to run in a location where it has no write access
hooks which are not linked to the inference pipeline (hooks are declared for the entire kedro projects, and most of them may not be useful for inference)

That said, I understand some hooks may be useful (e.g. if you have a custom hooks which validate your data between nodes).

I think the solution may be to enable passing hooks manually to the CLI to specify the ones you want to store with the model. Can you tell me more about your use case?

Lasica · 2023-02-07T08:34:13Z

Hmm I store configuration as params in kedro and they are packaged together with other dependencies, so I'm not sure if the 1st point is all valid, as such configuration is saved together with the model.

My use case is - I'm attempting to add a model monitoring (in particular data drift) to my inference pipeline used for serving, that exposes those metrics to Prometheus to be scraped (or pushed to pushgateway). Such class needs some initialization with reference dataset, address of pushgateway and other values. In the inference pipeline I've added a node at the end that passes the data to monitoring metrics calculation. I wanted to make a hook before_pipeline_run or after_catalog_created that would initialize this monitoring class, so that it's done once and I can call the monitoring node with minimal necessary arguments.

Currently I needed a workaround to add initialization arguments to the monitoring node and check every time if the monitoring class singleton is already initialized or not.

I want this hook to only run for inference pipeline.

Galileo-Galilei · 2023-02-08T22:33:54Z

I understand the use, case, but I think this is a very bad design to hide such processing inside the mlflow model. The mlflow model is supposed to be a pure data pipeline which takes a single input (a dataframe / an array) and returns a single output (a series /dataframe array witrh predictions).

The main drawback of hiding your monitoring node inside the model is that you will make your model very dependent of the environnement where you deploy it. If your model is in an environment has no access to Prometheus, has credentials issues, has timeout issues, should run on a cluster with restricted permissions... you will need to recreate a mlflow model for each situation. This is supposed to be completely independent.

IMHO, the correct way to do this is the following :

log the inference pipeline in mlflow (i.e. without the monitoring node and hooks)
create another pipeline (say monitoring) which reads the first pipeline from mlflow (either with MLflowModelLoggerDataSet or MlflowModelRegistryDataSet) and perform the check.
serve this monitoring pipeline directly

This is described in this tutorial and I think it makes sense for your use case, what do you think?

Lasica · 2023-02-14T14:22:56Z

You make valid points about flexibility of this inference pipeline model with co-located monitoring. I wasn't aware it's easy to re-use pipeline as a node in that way. Thanks for that!

I think this solution of monitoring has some pros and cons, what you suggest does not help address that the model will be dependent on environment, you will just have 2 of them, one of them will be. It depends on how fail-safe the monitoring is written whether this environment dependency will be an issue.

This is however, a bit off the subject of including hooks in the pipeline however, as I still think there should be a way to include some of them in the model. I see that you look at this functionality of packaging pipelines as models in light of ideal model abstraction that does just calculations. However this functionality gives us extra flexibility to add some utilities around model and easy way to access serving, if one wishes so. I think that hooks are part of that flexibility and it would be nice to have a way to whitelist them.

Galileo-Galilei · 2023-02-14T21:48:50Z

what you suggest does not help address that the model will be dependent on environment

It's true, but actually this is still more flexible and robsut to change. Say you want to change your monitoring node:

with your solution, you need to have access to all of the data (the model, its artifacts...) to execute the modelify command again. This may be hard if you want to make this change after a couple of weeks, especially if your artifacts are stored locally and you may have overwritten them accidentally while experimenting.
With my solution, the model is "fixed" (it is stored in a given run once for all), and you may change your extra node as much as you want with no need to execute the modelify comand again

➡️ If I understand correctly, it is quite clear that the modelify command is not really what you need. You just wa[nt to leverage mlflow capabilities to create an API automatically from a kedro pipeline. I understand this and AFAIK, there is no real better solution for now. I hope that kedro-serving will close the gap one day, but it's not even close for now. In a ideal world, you'd be able to kedro serve --pipeline and it will be automatically served as an API and be more flexible that what mlflow offers.

a bit off the subject of including hooks in the pipeline however, as I still think there should be a way to include some of them in the model

My opinion here is the following:

in most situation, you dont want hook for serving (e.g. you do not want your inference pipeline to log in mlflow each time it is ran), so this should not be included by default
In some rare situations, it may be needed to execute a hook during inference (e.g. to validate data, or to process data at runtime ouside node for some complex reasons), so I need to offer support for this use case, at least to be consistent with kedro and offer the same functionalities. I will likely make it available in the python API first , and then see how to make it available in the CLI API (i.e. the modelify command).

If you want to help through a PR:

Add a hooks argument (maybe a tuple of Hook) here and store it as an attribute:

kedro-mlflow/kedro_mlflow/mlflow/kedro_pipeline_model.py

Lines 16 to 23 in 9df9c23

    
           def __init__( 
        
               self, 
        
               pipeline: Pipeline, 
        
               catalog: DataCatalog, 
        
               input_name: str, 
        
               runner: Optional[AbstractRunner] = None, 
        
               copy_mode: Optional[Union[Dict[str, str], str]] = None, 
        
           ):

Change the hook registration logic here :

kedro-mlflow/kedro_mlflow/mlflow/kedro_pipeline_model.py

Lines 212 to 214 in 9df9c23

    
           # we create an empty hook manager but do NOT register hooks 
        
           # because we want this model be executable outside of a kedro project 
        
           hook_manager = _create_hook_manager()

. We should check how kedro register hooks to the new hook_manager created at runtime.

Lasica · 2023-02-14T22:26:18Z

Thanks for the well thought response.

In some rare situations, it may be needed to execute a hook during inference (e.g. to validate data, or to process data at runtime ouside node for some complex reasons), so I need to offer support for this use case,

That's all I was looking for as confirmation. I'll look at the code and try to start a PR later this/next week.

With my solution, the model is "fixed" (it is stored in a given run once for all), and you may change your extra node as much as you want with no need to execute the modelify comand again

I'm not sure I understand this. Wouldn't the monitoring pipeline also need to be packaged as a model to be served (for example by MLFlow)?

Galileo-Galilei · 2024-08-26T20:04:57Z

I'm not sure I understand this. Wouldn't the monitoring pipeline also need to be packaged as a model to be served (for example by MLFlow)?

I've never answered this last question, sorry. My idea is that mlflow model should only be used to serve the inference pipeline of ML model, not all pipelines. There are many limitations and it does not make sense to retrain a full model each time you change the code of a pipeline. there are better plugin to serve other pipelines, like monitoring pipelines, e.g. kedro-boot.

I'll just document it.

Galileo-Galilei self-assigned this Feb 6, 2023

Galileo-Galilei added enhancement New feature or request need-design-decision Several ways of implementation are possible and one must be chosen labels Feb 6, 2023

Galileo-Galilei added this to kedro-mlflow roadmap Oct 28, 2023

Galileo-Galilei moved this to 🆕 New in kedro-mlflow roadmap Oct 28, 2023

Galileo-Galilei moved this from 🆕 New to 📋 Backlog in kedro-mlflow roadmap Oct 28, 2023

Galileo-Galilei added the documentation Improvements or additions to documentation label Aug 26, 2024

Galileo-Galilei removed enhancement New feature or request need-design-decision Several ways of implementation are possible and one must be chosen labels Aug 26, 2024

Galileo-Galilei mentioned this issue Oct 1, 2024

Handle inference parameters #580

Draft

6 tasks

Lasica mentioned this issue Oct 31, 2024

WIP: enabling hooks for KedroPipelineModel serving #608

Draft

6 tasks

Galileo-Galilei mentioned this issue Nov 1, 2024

Add inference parameters to KedroPipelineModel #445

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kedro hooks do not work inside packaged pipeline for serving #404

Kedro hooks do not work inside packaged pipeline for serving #404

Lasica commented Feb 6, 2023 •

edited

Loading

Galileo-Galilei commented Feb 6, 2023 •

edited

Loading

Lasica commented Feb 7, 2023

Galileo-Galilei commented Feb 8, 2023

Lasica commented Feb 14, 2023

Galileo-Galilei commented Feb 14, 2023 •

edited

Loading

Lasica commented Feb 14, 2023

Galileo-Galilei commented Aug 26, 2024 •

edited

Loading

Kedro hooks do not work inside packaged pipeline for serving #404

Kedro hooks do not work inside packaged pipeline for serving #404

Comments

Lasica commented Feb 6, 2023 • edited Loading

Description

Context

Steps to Reproduce

Expected Result

Actual Result

Your Environment

Does the bug also happen with the last version on master?

Galileo-Galilei commented Feb 6, 2023 • edited Loading

Lasica commented Feb 7, 2023

Galileo-Galilei commented Feb 8, 2023

Lasica commented Feb 14, 2023

Galileo-Galilei commented Feb 14, 2023 • edited Loading

Lasica commented Feb 14, 2023

Galileo-Galilei commented Aug 26, 2024 • edited Loading

Lasica commented Feb 6, 2023 •

edited

Loading

Galileo-Galilei commented Feb 6, 2023 •

edited

Loading

Galileo-Galilei commented Feb 14, 2023 •

edited

Loading

Galileo-Galilei commented Aug 26, 2024 •

edited

Loading