Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Float and Decimal coercion #14273

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

findepi
Copy link
Member

@findepi findepi commented Jan 24, 2025

Which issue does this PR close?

Rationale for this change

Before the change, Float* and Decimal* would coerce to a decimal type. However, decimal cannot store all the float values: different range, NaN, and infinity. Coercing to floating point is more desirable and also what other systems typically do.

What changes are included in this PR?

Numeric coercion rules change

Are these changes tested?

yes

Are there any user-facing changes?

yes

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 24, 2025
@findepi findepi force-pushed the findepi/fix-float-and-decimal-coercion-b58e78 branch from 4e8e850 to 0f69629 Compare January 24, 2025 14:52
@github-actions github-actions bot added the core Core DataFusion crate label Jan 24, 2025
@findepi findepi force-pushed the findepi/fix-float-and-decimal-coercion-b58e78 branch from 0f69629 to 056e2d0 Compare January 24, 2025 14:57
@findepi findepi force-pushed the findepi/fix-float-and-decimal-coercion-b58e78 branch from 056e2d0 to 52f19ee Compare January 24, 2025 21:05
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @findepi -- this area is a bit out of my league here to understand the repercussions of this change

@jonahgao @berkaysynnada and @ozankabak and I have been discussing coercion related to @nuno-faria s (very) recent change here:

Perhaps they can weigh in here. I feel like there is a tension here and that we are in serious danger of the behavior oscillating 🤔

I feel like in some past version of DataFusion we used to coerce to Floats and that was changed to Decimal for some reason

Maybe what we can do is write down

  1. What the current coercion behavior is
  2. What we want it to be (and why we want to change it)
  3. Get agreement on to goal
  4. Then make the code match the agreement

Before the change, Float* and Decimal* would coerce to a decimal type.
However, decimal cannot store all the float values: different range,
NaN, and infinity. Coercing to floating point is more desirable and also
what other systems typically do.
@findepi findepi force-pushed the findepi/fix-float-and-decimal-coercion-b58e78 branch from 52f19ee to bed8a43 Compare January 24, 2025 21:34
@findepi
Copy link
Member Author

findepi commented Jan 24, 2025

@alamb thanks for your feedback. I agree it's important to avoid back-and-forth with a change, so the broader review the better.

What the current coercion behavior is

#14272

What we want it to be (and why we want to change it)

I don't think this particular change should be controversial though.

See issue, but also:

  1. we as the project primarily follow PostreSQL behavior and this is what this PR does (i think this rule is kind of well established, even if not well documented, but i see Document SQL dialect guidance #13706 is not merged)
  2. PostgreSQL aside, the implicit coercions should generally not fail (but alas, i've seen systems violating this common sense rule). Casting from float to decimal obviously may mail

Get agreement on to goal

let's do it!

Copy link
Contributor

@jayzhan211 jayzhan211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convert to float is consistent with DuckDB too. It makes sense to me.

@ozankabak
Copy link
Contributor

This change makes sense to me. However, I fully agree with @alamb on avoiding being trigger happy on partial changes to coercion behavior. Let's follow the 4-step process laid out in his comment before merging.

I don't remember the previous change from floats to decimals, but we should understand what happened then to avoid churn. I want to say it was about Spark compatibility, but I might be misremembering. If that is the case, we'd need to choose PostgreSQL compatibility over Spark compatibility (which is in line with project goals).

@alamb
Copy link
Contributor

alamb commented Jan 25, 2025

I have been thinking about this more. Is it a crazy idea if we tried to implement "user defined coercion rules" -- that way we can have whatever coercion rules are in the DataFusion core but when they inevitably are not quite what users want (as @findepi has pointed out many examples) that way users can customize them however they want

I feel like with the current requirement for a single set of coercion rules, we'll always end up with conflicting requiements such as

Rather than having to go back/forth in the core and find some compromise an API would give us an escape hatch.

It would also force us to design the type coercion APIs more coherently which I think would be good for the code quality in general

🤔

If you think it is an idea worth exploring I will file a ticket
Update: filed #14296

@ozankabak
Copy link
Contributor

I have been thinking about this more. Is it a crazy idea if we tried to implement "user defined coercion rules"?

I find this interesting and we should think about it. It is also in line with DF's vision of being fully customizable and extensible. However, doing this should be a follow-up to finalizing/merging this PR (after going through the steps you mention) and fixing the regression introduced by #14233. I don't think the steps to resolve either are in the "either-break-this-or-break-that" category.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been thinking about this more. Is it a crazy idea if we tried to implement "user defined coercion rules"?

I find this interesting and we should think about it. It is also in line with DF's vision of being fully customizable and extensible.

❤️

However, doing this should be a follow-up to finalizing/merging this PR (after going through the steps you mention) and fixing the regression introduced by #14233. I don't think the steps to resolve either are in the "either-break-this-or-break-that" category.

I am almost sure that this switches the behavior back to something that was previously changed.

It also seems like this PR will regress pruning / filtering for data that comes in as Decimal (the columns are now cast as float)

@@ -49,7 +49,7 @@ limit 10;
logical_plan
01)Sort: value DESC NULLS FIRST, fetch=10
02)--Projection: partsupp.ps_partkey, sum(partsupp.ps_supplycost * partsupp.ps_availqty) AS value
03)----Inner Join: Filter: CAST(sum(partsupp.ps_supplycost * partsupp.ps_availqty) AS Decimal128(38, 15)) > __scalar_sq_1.sum(partsupp.ps_supplycost * partsupp.ps_availqty) * Float64(0.0001)
03)----Inner Join: Filter: CAST(sum(partsupp.ps_supplycost * partsupp.ps_availqty) AS Float64) > __scalar_sq_1.sum(partsupp.ps_supplycost * partsupp.ps_availqty) * Float64(0.0001)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vaguely remember the use of Decimal here was important for TPCH results (maybe for correctness or something 🤔 )

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar situation with Q6

@@ -184,7 +184,13 @@ impl TestOutput {
/// and the appropriate scenario
impl ContextWithParquet {
async fn new(scenario: Scenario, unit: Unit) -> Self {
Self::with_config(scenario, unit, SessionConfig::new()).await
let mut session_config = SessionConfig::new();
// TODO (https://github.com/apache/datafusion/issues/12817) once this is the default behavior, remove from here
Copy link
Contributor

@alamb alamb Jan 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this means that DataFusion will no longer prune predicates like decimal_col = 5.0?

If so, this likely a significant regression / issue for anyone who relies on decimal types (like comet for example)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this will affect pruning - but it'd be great if @andygrove can take a look

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look at this today.

@@ -31,13 +31,13 @@ logical_plan
01)Projection: sum(lineitem.l_extendedprice * lineitem.l_discount) AS revenue
02)--Aggregate: groupBy=[[]], aggr=[[sum(lineitem.l_extendedprice * lineitem.l_discount)]]
03)----Projection: lineitem.l_extendedprice, lineitem.l_discount
04)------Filter: lineitem.l_shipdate >= Date32("1994-01-01") AND lineitem.l_shipdate < Date32("1995-01-01") AND lineitem.l_discount >= Decimal128(Some(5),15,2) AND lineitem.l_discount <= Decimal128(Some(7),15,2) AND lineitem.l_quantity < Decimal128(Some(2400),15,2)
05)--------TableScan: lineitem projection=[l_quantity, l_extendedprice, l_discount, l_shipdate], partial_filters=[lineitem.l_shipdate >= Date32("1994-01-01"), lineitem.l_shipdate < Date32("1995-01-01"), lineitem.l_discount >= Decimal128(Some(5),15,2), lineitem.l_discount <= Decimal128(Some(7),15,2), lineitem.l_quantity < Decimal128(Some(2400),15,2)]
04)------Filter: lineitem.l_shipdate >= Date32("1994-01-01") AND lineitem.l_shipdate < Date32("1995-01-01") AND CAST(lineitem.l_discount AS Float64) >= Float64(0.049999999999999996) AND CAST(lineitem.l_discount AS Float64) <= Float64(0.06999999999999999) AND lineitem.l_quantity < Decimal128(Some(2400),15,2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will likely cause a performance regression as it will cast the entire lineitem.l_discount column to Float before comparison where previously it could compare to a constant.

@ozankabak
Copy link
Contributor

We definitely shouldn't merge anything that might introduce regressions without going through the very reasonable process you suggested. It didn't seem to me this would cause regressions, but if that's the case, we should wait and understand better.

This week, I will start off by thinking about what we can do about the correctness issue the other PR is focusing on -- without regressions. It'd be great if we can find a neat fix for the issue that doesn't have the performance implication instead of just reverting for now. I have the feeling that there is a way, but we'll see.

@alamb
Copy link
Contributor

alamb commented Jan 26, 2025

We definitely shouldn't merge anything that might introduce regressions without going through the very reasonable process you suggested. It didn't seem to me this would cause regressions, but if that's the case, we should wait and understand better.

👍

This week, I will start off by thinking about what we can do about the correctness issue the other PR is focusing on -- without regressions. It'd be great if we can find a neat fix for the issue that doesn't have the performance implication instead of just reverting for now. I have the feeling that there is a way, but we'll see.

I think @jayzhan211 has a good idea here:

This is another version of a similiar idea I think

@findepi
Copy link
Member Author

findepi commented Jan 29, 2025

I am OK having DF coercions extensible (#14296), but first and foremost I want them to be optional. At SDF dbt, we create fully resolved and coerced plans, and so DF's coercion rules cannot do any good.
For our needs the best would be clear separation of SQL analysis layer, where coercions should exist, and formal plan processing layer, where coercions should no longer exist. This was outlined in #12723.

Regardless of future apt plans and potential customizability via new or existing extension points, we need to decide what is the "out of the box" behavior. We have a rule to follow PostgreSQL in principle and this is what this PR does. I consider it non-controversial. The existing comments revolve our "let's make sure it's properly reviewed" and by now it's obviously is. Will merge tomorrow unless there are actionable objections.

@ozankabak
Copy link
Contributor

ozankabak commented Jan 29, 2025

As I said, I am OK with this change so merging is fine from my perspective -- but I don't think it is a great idea to pull the trigger before achieving consensus with people who left comments (in this case @alamb). It is not obvious this is the case -- maybe he is on board already, but the (unanswered) comments are at least suggestive of some possible concerns.

@alamb
Copy link
Contributor

alamb commented Jan 30, 2025

We have a rule to follow PostgreSQL in principle and this is what this PR does. I consider it non-controversial.

I really worry that that PR will have unintended downstream impacts on systems that rely on Decimal (like maybe Comet) but I don't know enough to know for sure

It won't impact InfluxDB as we don't use decimal

I am sorry I don't have a more specific concern -- perhaps we can merge this PR and see what happens downstream (and if the old coercion behavior is important for some people that would be more motivation to make them configurable)

@alamb
Copy link
Contributor

alamb commented Jan 30, 2025

TLDR is i have some vague concern about this PR but that is not a reason to avoid merging it. It is fine with me to merge

1 similar comment
@alamb
Copy link
Contributor

alamb commented Jan 30, 2025

TLDR is i have some vague concern about this PR but that is not a reason to avoid merging it. It is fine with me to merge

@findepi
Copy link
Member Author

findepi commented Jan 30, 2025

I believe Comet doesn't rely on any coercion logic from DataFusion? cc @andygrove

@@ -27,4 +27,4 @@ where
and l_discount between 0.06 - 0.01 and 0.06 + 0.01
and l_quantity < 24;
----
11803420.2534
7115406.7008
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT this result changes because the literals are floats, and l_discount is a decimal. Is this a correctness problem? @andygrove

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQL literals should be decimal not float per ANSI SQL specification

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is a correctness problem. I explained in #14273 (comment)

@andygrove
Copy link
Member

Thanks for the pings. I agree with not rushing this in. Decimals are exact types and floats are not so we need to be careful.

There was an earlier issue where we used float instead of decimal in some places and it resulted in correctness issues.

I will run the Comet/Spark tests with these changes and see what happens.

+1 for user-defined coercion rules as well though, separate to this discussion. In Comet we want to use coercion rules that match Spark behavior. We do currently use some of DataFusion's coercion rules.

physical_plan
01)ProjectionExec: expr=[sum(lineitem.l_extendedprice * lineitem.l_discount)@0 as revenue]
02)--AggregateExec: mode=Final, gby=[], aggr=[sum(lineitem.l_extendedprice * lineitem.l_discount)]
03)----CoalescePartitionsExec
04)------AggregateExec: mode=Partial, gby=[], aggr=[sum(lineitem.l_extendedprice * lineitem.l_discount)]
05)--------CoalesceBatchesExec: target_batch_size=8192
06)----------FilterExec: l_shipdate@3 >= 1994-01-01 AND l_shipdate@3 < 1995-01-01 AND l_discount@2 >= Some(5),15,2 AND l_discount@2 <= Some(7),15,2 AND l_quantity@0 < Some(2400),15,2, projection=[l_extendedprice@1, l_discount@2]
06)----------FilterExec: l_shipdate@3 >= 1994-01-01 AND l_shipdate@3 < 1995-01-01 AND CAST(l_discount@2 AS Float64) >= 0.049999999999999996 AND CAST(l_discount@2 AS Float64) <= 0.06999999999999999 AND l_quantity@0 < Some(2400),15,2, projection=[l_extendedprice@1, l_discount@2]
Copy link
Member

@andygrove andygrove Jan 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is incorrect. l_discount is a decimal and the literals in the query are also decimals. There should not be a cast from an exact type to an approximate type here.

and l_discount between 0.05 and 0.07

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't disagree with the goals stated in the issue, but the implementation here is incorrect and introducing correctness issues. I am requesting changes to avoid an accidental merge since the PR has one approval already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
5 participants