-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Float and Decimal coercion #14273
base: main
Are you sure you want to change the base?
Fix Float and Decimal coercion #14273
Conversation
4e8e850
to
0f69629
Compare
0f69629
to
056e2d0
Compare
056e2d0
to
52f19ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @findepi -- this area is a bit out of my league here to understand the repercussions of this change
@jonahgao @berkaysynnada and @ozankabak and I have been discussing coercion related to @nuno-faria s (very) recent change here:
Perhaps they can weigh in here. I feel like there is a tension here and that we are in serious danger of the behavior oscillating 🤔
I feel like in some past version of DataFusion we used to coerce to Floats and that was changed to Decimal for some reason
Maybe what we can do is write down
- What the current coercion behavior is
- What we want it to be (and why we want to change it)
- Get agreement on to goal
- Then make the code match the agreement
Before the change, Float* and Decimal* would coerce to a decimal type. However, decimal cannot store all the float values: different range, NaN, and infinity. Coercing to floating point is more desirable and also what other systems typically do.
52f19ee
to
bed8a43
Compare
@alamb thanks for your feedback. I agree it's important to avoid back-and-forth with a change, so the broader review the better.
I don't think this particular change should be controversial though. See issue, but also:
let's do it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convert to float is consistent with DuckDB too. It makes sense to me.
This change makes sense to me. However, I fully agree with @alamb on avoiding being trigger happy on partial changes to coercion behavior. Let's follow the 4-step process laid out in his comment before merging. I don't remember the previous change from floats to decimals, but we should understand what happened then to avoid churn. I want to say it was about Spark compatibility, but I might be misremembering. If that is the case, we'd need to choose PostgreSQL compatibility over Spark compatibility (which is in line with project goals). |
I have been thinking about this more. Is it a crazy idea if we tried to implement "user defined coercion rules" -- that way we can have whatever coercion rules are in the DataFusion core but when they inevitably are not quite what users want (as @findepi has pointed out many examples) that way users can customize them however they want I feel like with the current requirement for a single set of coercion rules, we'll always end up with conflicting requiements such as Rather than having to go back/forth in the core and find some compromise an API would give us an escape hatch. It would also force us to design the type coercion APIs more coherently which I think would be good for the code quality in general 🤔
|
I find this interesting and we should think about it. It is also in line with DF's vision of being fully customizable and extensible. However, doing this should be a follow-up to finalizing/merging this PR (after going through the steps you mention) and fixing the regression introduced by #14233. I don't think the steps to resolve either are in the "either-break-this-or-break-that" category. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been thinking about this more. Is it a crazy idea if we tried to implement "user defined coercion rules"?
I find this interesting and we should think about it. It is also in line with DF's vision of being fully customizable and extensible.
❤️
However, doing this should be a follow-up to finalizing/merging this PR (after going through the steps you mention) and fixing the regression introduced by #14233. I don't think the steps to resolve either are in the "either-break-this-or-break-that" category.
I am almost sure that this switches the behavior back to something that was previously changed.
It also seems like this PR will regress pruning / filtering for data that comes in as Decimal (the columns are now cast as float)
@@ -49,7 +49,7 @@ limit 10; | |||
logical_plan | |||
01)Sort: value DESC NULLS FIRST, fetch=10 | |||
02)--Projection: partsupp.ps_partkey, sum(partsupp.ps_supplycost * partsupp.ps_availqty) AS value | |||
03)----Inner Join: Filter: CAST(sum(partsupp.ps_supplycost * partsupp.ps_availqty) AS Decimal128(38, 15)) > __scalar_sq_1.sum(partsupp.ps_supplycost * partsupp.ps_availqty) * Float64(0.0001) | |||
03)----Inner Join: Filter: CAST(sum(partsupp.ps_supplycost * partsupp.ps_availqty) AS Float64) > __scalar_sq_1.sum(partsupp.ps_supplycost * partsupp.ps_availqty) * Float64(0.0001) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vaguely remember the use of Decimal here was important for TPCH results (maybe for correctness or something 🤔 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar situation with Q6
@@ -184,7 +184,13 @@ impl TestOutput { | |||
/// and the appropriate scenario | |||
impl ContextWithParquet { | |||
async fn new(scenario: Scenario, unit: Unit) -> Self { | |||
Self::with_config(scenario, unit, SessionConfig::new()).await | |||
let mut session_config = SessionConfig::new(); | |||
// TODO (https://github.com/apache/datafusion/issues/12817) once this is the default behavior, remove from here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this means that DataFusion will no longer prune predicates like decimal_col = 5.0
?
If so, this likely a significant regression / issue for anyone who relies on decimal types (like comet for example)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this will affect pruning - but it'd be great if @andygrove can take a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will look at this today.
@@ -31,13 +31,13 @@ logical_plan | |||
01)Projection: sum(lineitem.l_extendedprice * lineitem.l_discount) AS revenue | |||
02)--Aggregate: groupBy=[[]], aggr=[[sum(lineitem.l_extendedprice * lineitem.l_discount)]] | |||
03)----Projection: lineitem.l_extendedprice, lineitem.l_discount | |||
04)------Filter: lineitem.l_shipdate >= Date32("1994-01-01") AND lineitem.l_shipdate < Date32("1995-01-01") AND lineitem.l_discount >= Decimal128(Some(5),15,2) AND lineitem.l_discount <= Decimal128(Some(7),15,2) AND lineitem.l_quantity < Decimal128(Some(2400),15,2) | |||
05)--------TableScan: lineitem projection=[l_quantity, l_extendedprice, l_discount, l_shipdate], partial_filters=[lineitem.l_shipdate >= Date32("1994-01-01"), lineitem.l_shipdate < Date32("1995-01-01"), lineitem.l_discount >= Decimal128(Some(5),15,2), lineitem.l_discount <= Decimal128(Some(7),15,2), lineitem.l_quantity < Decimal128(Some(2400),15,2)] | |||
04)------Filter: lineitem.l_shipdate >= Date32("1994-01-01") AND lineitem.l_shipdate < Date32("1995-01-01") AND CAST(lineitem.l_discount AS Float64) >= Float64(0.049999999999999996) AND CAST(lineitem.l_discount AS Float64) <= Float64(0.06999999999999999) AND lineitem.l_quantity < Decimal128(Some(2400),15,2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will likely cause a performance regression as it will cast the entire lineitem.l_discount
column to Float before comparison where previously it could compare to a constant.
We definitely shouldn't merge anything that might introduce regressions without going through the very reasonable process you suggested. It didn't seem to me this would cause regressions, but if that's the case, we should wait and understand better. This week, I will start off by thinking about what we can do about the correctness issue the other PR is focusing on -- without regressions. It'd be great if we can find a neat fix for the issue that doesn't have the performance implication instead of just reverting for now. I have the feeling that there is a way, but we'll see. |
👍
I think @jayzhan211 has a good idea here: This is another version of a similiar idea I think |
I am OK having DF coercions extensible (#14296), but first and foremost I want them to be optional. At Regardless of future apt plans and potential customizability via new or existing extension points, we need to decide what is the "out of the box" behavior. We have a rule to follow PostgreSQL in principle and this is what this PR does. I consider it non-controversial. The existing comments revolve our "let's make sure it's properly reviewed" and by now it's obviously is. Will merge tomorrow unless there are actionable objections. |
As I said, I am OK with this change so merging is fine from my perspective -- but I don't think it is a great idea to pull the trigger before achieving consensus with people who left comments (in this case @alamb). It is not obvious this is the case -- maybe he is on board already, but the (unanswered) comments are at least suggestive of some possible concerns. |
I really worry that that PR will have unintended downstream impacts on systems that rely on Decimal (like maybe Comet) but I don't know enough to know for sure It won't impact InfluxDB as we don't use decimal I am sorry I don't have a more specific concern -- perhaps we can merge this PR and see what happens downstream (and if the old coercion behavior is important for some people that would be more motivation to make them configurable) |
TLDR is i have some vague concern about this PR but that is not a reason to avoid merging it. It is fine with me to merge |
1 similar comment
TLDR is i have some vague concern about this PR but that is not a reason to avoid merging it. It is fine with me to merge |
I believe Comet doesn't rely on any coercion logic from DataFusion? cc @andygrove |
@@ -27,4 +27,4 @@ where | |||
and l_discount between 0.06 - 0.01 and 0.06 + 0.01 | |||
and l_quantity < 24; | |||
---- | |||
11803420.2534 | |||
7115406.7008 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT this result changes because the literals are floats, and l_discount
is a decimal. Is this a correctness problem? @andygrove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SQL literals should be decimal not float per ANSI SQL specification
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is a correctness problem. I explained in #14273 (comment)
Thanks for the pings. I agree with not rushing this in. Decimals are exact types and floats are not so we need to be careful. There was an earlier issue where we used float instead of decimal in some places and it resulted in correctness issues. I will run the Comet/Spark tests with these changes and see what happens. +1 for user-defined coercion rules as well though, separate to this discussion. In Comet we want to use coercion rules that match Spark behavior. We do currently use some of DataFusion's coercion rules. |
physical_plan | ||
01)ProjectionExec: expr=[sum(lineitem.l_extendedprice * lineitem.l_discount)@0 as revenue] | ||
02)--AggregateExec: mode=Final, gby=[], aggr=[sum(lineitem.l_extendedprice * lineitem.l_discount)] | ||
03)----CoalescePartitionsExec | ||
04)------AggregateExec: mode=Partial, gby=[], aggr=[sum(lineitem.l_extendedprice * lineitem.l_discount)] | ||
05)--------CoalesceBatchesExec: target_batch_size=8192 | ||
06)----------FilterExec: l_shipdate@3 >= 1994-01-01 AND l_shipdate@3 < 1995-01-01 AND l_discount@2 >= Some(5),15,2 AND l_discount@2 <= Some(7),15,2 AND l_quantity@0 < Some(2400),15,2, projection=[l_extendedprice@1, l_discount@2] | ||
06)----------FilterExec: l_shipdate@3 >= 1994-01-01 AND l_shipdate@3 < 1995-01-01 AND CAST(l_discount@2 AS Float64) >= 0.049999999999999996 AND CAST(l_discount@2 AS Float64) <= 0.06999999999999999 AND l_quantity@0 < Some(2400),15,2, projection=[l_extendedprice@1, l_discount@2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is incorrect. l_discount
is a decimal and the literals in the query are also decimals. There should not be a cast from an exact type to an approximate type here.
and l_discount between 0.05 and 0.07
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't disagree with the goals stated in the issue, but the implementation here is incorrect and introducing correctness issues. I am requesting changes to avoid an accidental merge since the PR has one approval already.
Which issue does this PR close?
Rationale for this change
Before the change, Float* and Decimal* would coerce to a decimal type. However, decimal cannot store all the float values: different range, NaN, and infinity. Coercing to floating point is more desirable and also what other systems typically do.
What changes are included in this PR?
Numeric coercion rules change
Are these changes tested?
yes
Are there any user-facing changes?
yes