Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DataFrame::map utility .map function for DataFrame for modifying internal LogicalPlan #14317

Open
phisn opened this issue Jan 27, 2025 · 4 comments
Labels
enhancement New feature or request

Comments

@phisn
Copy link
Contributor

phisn commented Jan 27, 2025

Is your feature request related to a problem or challenge?

Currently when applying custom LogicalPlans (extensions or transformation functions) on DataFrames I need to convert DataFrame to a LogicalPlan and then back to a DataFrame making cumbersome code.

Describe the solution you'd like

It would be very helpful to have a .map function which transforms the LogicalPlan inside DataFrames without reconstructing it.

Describe alternatives you've considered

No response

Additional context

No response

@phisn phisn added the enhancement New feature or request label Jan 27, 2025
@alamb alamb changed the title Utility .map function for DataFrame Utility .map function for DataFrame for modifying internal LogicalPlan Jan 28, 2025
@alamb
Copy link
Contributor

alamb commented Jan 28, 2025

This seems like a good idea to me

As I understand it, it would allow things like

let df = ctx.sql("SELECT * from foo");
let df = df.map(my_awesome_rewrite)?;
...

fn my_awesome_plan(plan: LogicalPlan| ) -> Result<LogicalPlan> {
   ...
  Ok(new_plan)
}

Any thoughts @timsaucer or @Omega359 ?

@alamb alamb changed the title Utility .map function for DataFrame for modifying internal LogicalPlan Add DataFrame::map utility .map function for DataFrame for modifying internal LogicalPlan Jan 28, 2025
@timsaucer
Copy link
Contributor

We have something similar in datafusion-python here. It lets you do something like what you're describing but it operates on a dataframe and not a logical plan.

In python you can currently do something like this (this is from our unit tests)

    def add_string_col(df_internal) -> DataFrame:
        return df_internal.with_column("string_col", literal("string data"))

    def add_with_parameter(df_internal, value: Any) -> DataFrame:
        return df_internal.with_column("new_col", literal(value))

    df = df.transform(add_string_col).transform(add_with_parameter, 3)

If you had a DataFrame df then this would add in two more columns. It's a trivial example, but it allows for some really nice chaining of operations like in the last line. I would definitely support adding something like that on the rust side.

But what I'm doing here doesn't exactly match up with what the issue requests, which is to work on the LogicalPlan. So we could do something similar but df.transform_plan that will do the conversion from dataframe to plan, execute the function/closure, and convert back to dataframe.

@phisn would that meet your needs?

@Omega359
Copy link
Contributor

I personally haven't had the need to go into a LogicalPlan from a dataframe and back again but I could see it being useful.

@phisn
Copy link
Contributor Author

phisn commented Jan 29, 2025

@timsaucer My specific use case is what @alamb described. The problem with the transform approach is that I am forced to .logical_plan().clone as well as create a new DataFrame each time, since it's about directly manipulating the LogicalPlan.

The problem arises with applying extensions and using functions that are not always called using a DataFrame. For example when the context is not always available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants