[FEA] Look into using the auto-tuner to give join/aggregation/filter specific metrics #12121

revans2 · 2025-02-12T23:02:27Z

Is your feature request related to a problem? Please describe.
In doing query optimization we run into situations where we need to estimate the cardinality of a join (will it increase the row count by 10x or reduce it to 1/10th), filters, aggregations, etc. This comes into play quite often when we do things like memory planning on the GPU. We can use some heuristics to guess, we could also use AQE in some cases and possibly come up with estimates ourselves, but AQE only runs after the first shuffle, which might not be enough time to do some optimizations.

As such I would like to propose that we add in a set of configs that can be used to give fuzzy hints about specific operations in queries. Things like when we read table foo with predicate push down a > 5 and a <= 10, we ended up materializing 2 MiB of data per task. (possibly with min, median, and max values) not quite sure yet how we would be an ideal estimate.

The auto-tuner can then look at various parts of an application and encode configs that are outliers from what we would expect.

The plugin, when it is trying to make a decision, would then be able to read these configs (do a fuzzy match to see if it can find historical information that is relevant) and then use that information as input to the planning.

We might even be able to encode higher level data, like column d_index in the dates table appears to be a primary index for a dimension table. The auto-tuner could, in theory, detect this by looking at multiple join/aggregation operations and seeing how they behave. But this is a bit more advanced than matching what we saw before.

revans2 added ? - Needs Triage Need team to review and classify feature request New feature or request labels Feb 12, 2025

This was referenced Feb 12, 2025

[FEA] Explore shuffle config heuristics in the plugin/auto-tuner to reduce spilling but increase throughput #12122

Open

[FEA] Explore file read config heuristics in the plugin/auto-tuner to reduce spilling but increase throughput #12133

Open

binmahone mentioned this issue Feb 18, 2025

[FEA] user defined hints to better optimize query execution #12162

Open

mattahrens removed the ? - Needs Triage Need team to review and classify label Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Look into using the auto-tuner to give join/aggregation/filter specific metrics #12121

[FEA] Look into using the auto-tuner to give join/aggregation/filter specific metrics #12121

revans2 commented Feb 12, 2025

[FEA] Look into using the auto-tuner to give join/aggregation/filter specific metrics #12121

[FEA] Look into using the auto-tuner to give join/aggregation/filter specific metrics #12121

Comments

revans2 commented Feb 12, 2025