Fix Float and Decimal coercion #14273

findepi · 2025-01-24T14:24:47Z

Which issue does this PR close?

Closes Coercion between floating point types and exact numeric types should result in floating point type as the common super type #14272

Rationale for this change

Before the change, Float* and Decimal* would coerce to a decimal type. However, decimal cannot store all the float values: different range, NaN, and infinity. Coercing to floating point is more desirable and also what other systems typically do.

What changes are included in this PR?

Numeric coercion rules change

Are these changes tested?

yes

Are there any user-facing changes?

yes

alamb

Thanks @findepi -- this area is a bit out of my league here to understand the repercussions of this change

@jonahgao @berkaysynnada and @ozankabak and I have been discussing coercion related to @nuno-faria s (very) recent change here:

fix(expr-common): Coerce to Decimal(20, 0) when combining UInt64 with signed integers #14223

Perhaps they can weigh in here. I feel like there is a tension here and that we are in serious danger of the behavior oscillating 🤔

I feel like in some past version of DataFusion we used to coerce to Floats and that was changed to Decimal for some reason

Maybe what we can do is write down

What the current coercion behavior is
What we want it to be (and why we want to change it)
Get agreement on to goal
Then make the code match the agreement

Before the change, Float* and Decimal* would coerce to a decimal type. However, decimal cannot store all the float values: different range, NaN, and infinity. Coercing to floating point is more desirable and also what other systems typically do.

findepi · 2025-01-24T21:37:51Z

@alamb thanks for your feedback. I agree it's important to avoid back-and-forth with a change, so the broader review the better.

What the current coercion behavior is

#14272

What we want it to be (and why we want to change it)

I don't think this particular change should be controversial though.

See issue, but also:

we as the project primarily follow PostreSQL behavior and this is what this PR does (i think this rule is kind of well established, even if not well documented, but i see Document SQL dialect guidance #13706 is not merged)
PostgreSQL aside, the implicit coercions should generally not fail (but alas, i've seen systems violating this common sense rule). Casting from float to decimal obviously may mail

Get agreement on to goal

let's do it!

jayzhan211

Convert to float is consistent with DuckDB too. It makes sense to me.

ozankabak · 2025-01-25T08:34:10Z

This change makes sense to me. However, I fully agree with @alamb on avoiding being trigger happy on partial changes to coercion behavior. Let's follow the 4-step process laid out in his comment before merging.

I don't remember the previous change from floats to decimals, but we should understand what happened then to avoid churn. I want to say it was about Spark compatibility, but I might be misremembering. If that is the case, we'd need to choose PostgreSQL compatibility over Spark compatibility (which is in line with project goals).

alamb · 2025-01-25T17:28:35Z

I have been thinking about this more. Is it a crazy idea if we tried to implement "user defined coercion rules" -- that way we can have whatever coercion rules are in the DataFusion core but when they inevitably are not quite what users want (as @findepi has pointed out many examples) that way users can customize them however they want

I feel like with the current requirement for a single set of coercion rules, we'll always end up with conflicting requiements such as

fix(expr-common): Coerce to Decimal(20, 0) when combining UInt64 with signed integers #14223

Rather than having to go back/forth in the core and find some compromise an API would give us an escape hatch.

It would also force us to design the type coercion APIs more coherently which I think would be good for the code quality in general

🤔

~~If you think it is an idea worth exploring I will file a ticket~~
Update: filed #14296

ozankabak · 2025-01-26T01:03:24Z

I have been thinking about this more. Is it a crazy idea if we tried to implement "user defined coercion rules"?

I find this interesting and we should think about it. It is also in line with DF's vision of being fully customizable and extensible. However, doing this should be a follow-up to finalizing/merging this PR (after going through the steps you mention) and fixing the regression introduced by #14233. I don't think the steps to resolve either are in the "either-break-this-or-break-that" category.

alamb

I have been thinking about this more. Is it a crazy idea if we tried to implement "user defined coercion rules"?

I find this interesting and we should think about it. It is also in line with DF's vision of being fully customizable and extensible.

❤️

However, doing this should be a follow-up to finalizing/merging this PR (after going through the steps you mention) and fixing the regression introduced by #14233. I don't think the steps to resolve either are in the "either-break-this-or-break-that" category.

I am almost sure that this switches the behavior back to something that was previously changed.

It also seems like this PR will regress pruning / filtering for data that comes in as Decimal (the columns are now cast as float)

alamb · 2025-01-26T11:22:20Z

datafusion/sqllogictest/test_files/tpch/plans/q11.slt.part

@@ -49,7 +49,7 @@ limit 10;
 logical_plan
 01)Sort: value DESC NULLS FIRST, fetch=10
 02)--Projection: partsupp.ps_partkey, sum(partsupp.ps_supplycost * partsupp.ps_availqty) AS value
-03)----Inner Join:  Filter: CAST(sum(partsupp.ps_supplycost * partsupp.ps_availqty) AS Decimal128(38, 15)) > __scalar_sq_1.sum(partsupp.ps_supplycost * partsupp.ps_availqty) * Float64(0.0001)
+03)----Inner Join:  Filter: CAST(sum(partsupp.ps_supplycost * partsupp.ps_availqty) AS Float64) > __scalar_sq_1.sum(partsupp.ps_supplycost * partsupp.ps_availqty) * Float64(0.0001)


I vaguely remember the use of Decimal here was important for TPCH results (maybe for correctness or something 🤔 )

Similar situation with Q6

alamb · 2025-01-26T11:26:13Z

datafusion/core/tests/parquet/mod.rs

@@ -184,7 +184,13 @@ impl TestOutput {
 /// and the appropriate scenario
 impl ContextWithParquet {
    async fn new(scenario: Scenario, unit: Unit) -> Self {
-        Self::with_config(scenario, unit, SessionConfig::new()).await
+        let mut session_config = SessionConfig::new();
+        // TODO (https://github.com/apache/datafusion/issues/12817) once this is the default behavior, remove from here


Does this means that DataFusion will no longer prune predicates like decimal_col = 5.0?

If so, this likely a significant regression / issue for anyone who relies on decimal types (like comet for example)

I am not sure if this will affect pruning - but it'd be great if @andygrove can take a look

I will look at this today.

alamb · 2025-01-26T11:27:57Z

datafusion/sqllogictest/test_files/tpch/plans/q6.slt.part

@@ -31,13 +31,13 @@ logical_plan
 01)Projection: sum(lineitem.l_extendedprice * lineitem.l_discount) AS revenue
 02)--Aggregate: groupBy=[[]], aggr=[[sum(lineitem.l_extendedprice * lineitem.l_discount)]]
 03)----Projection: lineitem.l_extendedprice, lineitem.l_discount
-04)------Filter: lineitem.l_shipdate >= Date32("1994-01-01") AND lineitem.l_shipdate < Date32("1995-01-01") AND lineitem.l_discount >= Decimal128(Some(5),15,2) AND lineitem.l_discount <= Decimal128(Some(7),15,2) AND lineitem.l_quantity < Decimal128(Some(2400),15,2)
-05)--------TableScan: lineitem projection=[l_quantity, l_extendedprice, l_discount, l_shipdate], partial_filters=[lineitem.l_shipdate >= Date32("1994-01-01"), lineitem.l_shipdate < Date32("1995-01-01"), lineitem.l_discount >= Decimal128(Some(5),15,2), lineitem.l_discount <= Decimal128(Some(7),15,2), lineitem.l_quantity < Decimal128(Some(2400),15,2)]
+04)------Filter: lineitem.l_shipdate >= Date32("1994-01-01") AND lineitem.l_shipdate < Date32("1995-01-01") AND CAST(lineitem.l_discount AS Float64) >= Float64(0.049999999999999996) AND CAST(lineitem.l_discount AS Float64) <= Float64(0.06999999999999999) AND lineitem.l_quantity < Decimal128(Some(2400),15,2)


This will likely cause a performance regression as it will cast the entire lineitem.l_discount column to Float before comparison where previously it could compare to a constant.

ozankabak · 2025-01-26T13:53:43Z

We definitely shouldn't merge anything that might introduce regressions without going through the very reasonable process you suggested. It didn't seem to me this would cause regressions, but if that's the case, we should wait and understand better.

This week, I will start off by thinking about what we can do about the correctness issue the other PR is focusing on -- without regressions. It'd be great if we can find a neat fix for the issue that doesn't have the performance implication instead of just reverting for now. I have the feeling that there is a way, but we'll see.

alamb · 2025-01-26T14:06:52Z

We definitely shouldn't merge anything that might introduce regressions without going through the very reasonable process you suggested. It didn't seem to me this would cause regressions, but if that's the case, we should wait and understand better.

👍

This week, I will start off by thinking about what we can do about the correctness issue the other PR is focusing on -- without regressions. It'd be great if we can find a neat fix for the issue that doesn't have the performance implication instead of just reverting for now. I have the feeling that there is a way, but we'll see.

I think @jayzhan211 has a good idea here:

fix(expr-common): Coerce to Decimal(20, 0) when combining UInt64 with signed integers #14223 (comment)

This is another version of a similiar idea I think

fix(expr-common): Coerce to Decimal(20, 0) when combining UInt64 with signed integers #14223 (comment)

findepi · 2025-01-29T09:52:48Z

I am OK having DF coercions extensible (#14296), but first and foremost I want them to be optional. At ~~SDF~~ dbt, we create fully resolved and coerced plans, and so DF's coercion rules cannot do any good.
For our needs the best would be clear separation of SQL analysis layer, where coercions should exist, and formal plan processing layer, where coercions should no longer exist. This was outlined in #12723.

Regardless of future apt plans and potential customizability via new or existing extension points, we need to decide what is the "out of the box" behavior. We have a rule to follow PostgreSQL in principle and this is what this PR does. I consider it non-controversial. The existing comments revolve our "let's make sure it's properly reviewed" and by now it's obviously is. Will merge tomorrow unless there are actionable objections.

ozankabak · 2025-01-29T13:15:43Z

As I said, I am OK with this change so merging is fine from my perspective -- but I don't think it is a great idea to pull the trigger before achieving consensus with people who left comments (in this case @alamb). It is not obvious this is the case -- maybe he is on board already, but the (unanswered) comments are at least suggestive of some possible concerns.

alamb · 2025-01-30T14:37:06Z

We have a rule to follow PostgreSQL in principle and this is what this PR does. I consider it non-controversial.

I really worry that that PR will have unintended downstream impacts on systems that rely on Decimal (like maybe Comet) but I don't know enough to know for sure

It won't impact InfluxDB as we don't use decimal

I am sorry I don't have a more specific concern -- perhaps we can merge this PR and see what happens downstream (and if the old coercion behavior is important for some people that would be more motivation to make them configurable)

alamb · 2025-01-30T14:39:24Z

TLDR is i have some vague concern about this PR but that is not a reason to avoid merging it. It is fine with me to merge

alamb · 2025-01-30T14:40:20Z

TLDR is i have some vague concern about this PR but that is not a reason to avoid merging it. It is fine with me to merge

findepi · 2025-01-30T14:42:32Z

I believe Comet doesn't rely on any coercion logic from DataFusion? cc @andygrove

ozankabak · 2025-01-30T14:46:55Z

datafusion/sqllogictest/test_files/tpch/answers/q6.slt.part

@@ -27,4 +27,4 @@ where
  and l_discount between 0.06 - 0.01 and 0.06 + 0.01
  and l_quantity < 24;
 ----
-11803420.2534
+7115406.7008


AFAICT this result changes because the literals are floats, and l_discount is a decimal. Is this a correctness problem? @andygrove

SQL literals should be decimal not float per ANSI SQL specification

Yes, this is a correctness problem. I explained in #14273 (comment)

andygrove · 2025-01-30T16:58:38Z

Thanks for the pings. I agree with not rushing this in. Decimals are exact types and floats are not so we need to be careful.

There was an earlier issue where we used float instead of decimal in some places and it resulted in correctness issues.

I will run the Comet/Spark tests with these changes and see what happens.

+1 for user-defined coercion rules as well though, separate to this discussion. In Comet we want to use coercion rules that match Spark behavior. We do currently use some of DataFusion's coercion rules.

andygrove · 2025-01-30T17:32:22Z

datafusion/sqllogictest/test_files/tpch/plans/q6.slt.part

 physical_plan
 01)ProjectionExec: expr=[sum(lineitem.l_extendedprice * lineitem.l_discount)@0 as revenue]
 02)--AggregateExec: mode=Final, gby=[], aggr=[sum(lineitem.l_extendedprice * lineitem.l_discount)]
 03)----CoalescePartitionsExec
 04)------AggregateExec: mode=Partial, gby=[], aggr=[sum(lineitem.l_extendedprice * lineitem.l_discount)]
 05)--------CoalesceBatchesExec: target_batch_size=8192
-06)----------FilterExec: l_shipdate@3 >= 1994-01-01 AND l_shipdate@3 < 1995-01-01 AND l_discount@2 >= Some(5),15,2 AND l_discount@2 <= Some(7),15,2 AND l_quantity@0 < Some(2400),15,2, projection=[l_extendedprice@1, l_discount@2]
+06)----------FilterExec: l_shipdate@3 >= 1994-01-01 AND l_shipdate@3 < 1995-01-01 AND CAST(l_discount@2 AS Float64) >= 0.049999999999999996 AND CAST(l_discount@2 AS Float64) <= 0.06999999999999999 AND l_quantity@0 < Some(2400),15,2, projection=[l_extendedprice@1, l_discount@2]


This is incorrect. l_discount is a decimal and the literals in the query are also decimals. There should not be a cast from an exact type to an approximate type here.

and l_discount between 0.05 and 0.07

andygrove

I don't disagree with the goals stated in the issue, but the implementation here is incorrect and introducing correctness issues. I am requesting changes to avoid an accidental merge since the PR has one approval already.

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 24, 2025

findepi force-pushed the findepi/fix-float-and-decimal-coercion-b58e78 branch from 4e8e850 to 0f69629 Compare January 24, 2025 14:52

github-actions bot added the core Core DataFusion crate label Jan 24, 2025

findepi force-pushed the findepi/fix-float-and-decimal-coercion-b58e78 branch from 0f69629 to 056e2d0 Compare January 24, 2025 14:57

findepi requested review from alamb, jayzhan211 and jonahgao January 24, 2025 14:58

findepi force-pushed the findepi/fix-float-and-decimal-coercion-b58e78 branch from 056e2d0 to 52f19ee Compare January 24, 2025 21:05

alamb reviewed Jan 24, 2025

View reviewed changes

Fix Float and Decimal coercion

bed8a43

Before the change, Float* and Decimal* would coerce to a decimal type. However, decimal cannot store all the float values: different range, NaN, and infinity. Coercing to floating point is more desirable and also what other systems typically do.

findepi force-pushed the findepi/fix-float-and-decimal-coercion-b58e78 branch from 52f19ee to bed8a43 Compare January 24, 2025 21:34

jayzhan211 approved these changes Jan 25, 2025

View reviewed changes

alamb reviewed Jan 26, 2025

View reviewed changes

ozankabak reviewed Jan 30, 2025

View reviewed changes

andygrove mentioned this pull request Jan 30, 2025

ignore: Test impact of proposed decimal/float type coercion changes upstream apache/datafusion-comet#1355

Closed

andygrove reviewed Jan 30, 2025

View reviewed changes

andygrove requested changes Jan 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Float and Decimal coercion #14273

Fix Float and Decimal coercion #14273

findepi commented Jan 24, 2025 •

edited by alamb

Loading

alamb left a comment

findepi commented Jan 24, 2025

jayzhan211 left a comment

ozankabak commented Jan 25, 2025

alamb commented Jan 25, 2025 •

edited

Loading

ozankabak commented Jan 26, 2025

alamb left a comment

alamb Jan 26, 2025

ozankabak Jan 30, 2025

alamb Jan 26, 2025 •

edited

Loading

ozankabak Jan 30, 2025

andygrove Jan 30, 2025

alamb Jan 26, 2025

ozankabak commented Jan 26, 2025

alamb commented Jan 26, 2025 •

edited

Loading

findepi commented Jan 29, 2025

ozankabak commented Jan 29, 2025 •

edited

Loading

alamb commented Jan 30, 2025

alamb commented Jan 30, 2025

alamb commented Jan 30, 2025

findepi commented Jan 30, 2025

ozankabak Jan 30, 2025

andygrove Jan 30, 2025

andygrove Jan 30, 2025

andygrove commented Jan 30, 2025

andygrove Jan 30, 2025 •

edited

Loading

andygrove left a comment

Fix Float and Decimal coercion #14273

Are you sure you want to change the base?

Fix Float and Decimal coercion #14273

Conversation

findepi commented Jan 24, 2025 • edited by alamb Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

alamb left a comment

Choose a reason for hiding this comment

findepi commented Jan 24, 2025

jayzhan211 left a comment

Choose a reason for hiding this comment

ozankabak commented Jan 25, 2025

alamb commented Jan 25, 2025 • edited Loading

ozankabak commented Jan 26, 2025

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb Jan 26, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ozankabak commented Jan 26, 2025

alamb commented Jan 26, 2025 • edited Loading

findepi commented Jan 29, 2025

ozankabak commented Jan 29, 2025 • edited Loading

alamb commented Jan 30, 2025

alamb commented Jan 30, 2025

alamb commented Jan 30, 2025

findepi commented Jan 30, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Jan 30, 2025

andygrove Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

andygrove left a comment

Choose a reason for hiding this comment

findepi commented Jan 24, 2025 •

edited by alamb

Loading

alamb commented Jan 25, 2025 •

edited

Loading

alamb Jan 26, 2025 •

edited

Loading

alamb commented Jan 26, 2025 •

edited

Loading

ozankabak commented Jan 29, 2025 •

edited

Loading

andygrove Jan 30, 2025 •

edited

Loading