Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLFlow conflict when using Databricks #610

Open
diegoliraQB opened this issue Nov 21, 2024 · 3 comments
Open

MLFlow conflict when using Databricks #610

diegoliraQB opened this issue Nov 21, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@diegoliraQB
Copy link

Description

When running kedro-mlflow on Databricks, occasionally a new run of the experiment might be triggered when running parallelized code. This is because Databricks enables autologging (at least in recent runtimes), and the new runs might be due to an mlflow bug.

Proposed solution: Add a new hook to disable autolog, or include it in the current hook.

class DisableMLFlowAutoLogger:    
    @hook_impl(tryfirst=True)
    def after_context_created(self, context) -> None:    
        mlflow.autolog(disable=True)

Although I encountered this because of Databricks, I can't imagine a context where you'd like to enable autolog together with the plugin. Could be a parameter of mlflow.yml if you want to be flexible.

Context

See conversation for context:
https://kedro-org.slack.com/archives/C03RKP2LW64/p1732141412790889

Steps to Reproduce

  1. Start a a Kedro pipeline using kedro-mlflow in a Databricks interactive notebook
  2. Use some parallelized code to trigger a new run. Minimal example with Optuna:
    study = optuna.create_study()
    study.optimize(lambda trial: objective(my_data,trial),n_trials=100,n_jobs=-1)

This will trigger maybe 4-6 new runs when using LightGBM in your objective.

Expected Result

Results should be in the run started by kedro-mlflow.

Actual Result

New runs are triggered.

Your Environment

Databricks Runtime 15.4 ML
Kedro 19.9
kedro-mlflow 0.13.3

Does the bug also happen with the last version on master?

Yes

@Galileo-Galilei
Copy link
Owner

Galileo-Galilei commented Dec 14, 2024

Hi,

Can you check the mlflow version you used? If it is above 2.18.0, there are some recent modifications in mlflow to make it thread safe to avoid race condition when running in parallel, which affects kedro-mlflow (#613, #615), pycaret (pycaret/pycaret#4100) and optuna used to have a workaround (optuna/optuna#4088) which may or not be concerned (didn't lok at their code).

It should be fixed in kedro-mlflow, can you install kedro-mlflow>0.13.4 and confirm you are still experiencing the bug before I make the change?

@diegoliraQB
Copy link
Author

diegoliraQB commented Dec 16, 2024

I can confirm that the problem still happens, using mlflow 2.19.0 and kedro-mlflow 0.13.4
Disabling autologging solves it.

@Galileo-Galilei
Copy link
Owner

Galileo-Galilei commented Dec 16, 2024

Thanks for confirming, I'll investigate and make sure to fix this soon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: 🆕 New
Development

No branches or pull requests

2 participants