Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(expr-common): Coerce to Decimal(20, 0) when combining UInt64 with signed integers #14223

Merged
merged 5 commits into from
Jan 24, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 18 additions & 15 deletions datafusion/expr-common/src/type_coercion/binary.rs
Original file line number Diff line number Diff line change
Expand Up @@ -777,29 +777,20 @@ pub fn binary_numeric_coercion(
(_, Float32) | (Float32, _) => Some(Float32),
// The following match arms encode the following logic: Given the two
// integral types, we choose the narrowest possible integral type that
// accommodates all values of both types. Note that some information
// loss is inevitable when we have a signed type and a `UInt64`, in
// which case we use `Int64`;i.e. the widest signed integral type.

// TODO: For i64 and u64, we can use decimal or float64
// Postgres has no unsigned type :(
// DuckDB v.0.10.0 has double (double precision floating-point number (8 bytes))
// for largest signed (signed sixteen-byte integer) and unsigned integer (unsigned sixteen-byte integer)
// accommodates all values of both types. Note that to avoid information
// loss when combining UInt64 with signed integers we use Decimal128(20, 0).
(Decimal128(20, 0), _)
| (_, Decimal128(20, 0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks not correct. For example, combining Decimal128(20, 0) with Decimal128(30, 0) should not result in Decimal128(20, 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when both types are decimal they are handled above before this match, when calling the decimal_coercion function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rechecked it, and decimal_coercion already covers them in

fn coerce_numeric_type_to_decimal(numeric_type: &DataType) -> Option<DataType> {
use arrow::datatypes::DataType::*;
// This conversion rule is from spark
// https://github.com/apache/spark/blob/1c81ad20296d34f137238dadd67cc6ae405944eb/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L127
match numeric_type {
Int8 => Some(Decimal128(3, 0)),
Int16 => Some(Decimal128(5, 0)),
Int32 => Some(Decimal128(10, 0)),
Int64 => Some(Decimal128(20, 0)),

Although it doesn't handle unsigned integer types, we can supplement it there, maybe as a follow-up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think we shouldn't combine decimal with integer types here because decimal_coercion has already handled it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that initially but then it would not handle UInt64 and Decimal128(20, 0):

Cannot infer common argument type for comparison operation UInt64 = Decimal128(20, 0)

So maybe it would be best to add new arms to the coerce_numeric_type_to_decimal to include unsigned integers as well?

    match numeric_type {
        Int8 => Some(Decimal128(3, 0)),
        Int16 => Some(Decimal128(5, 0)),
        Int32 => Some(Decimal128(10, 0)),
        Int64 => Some(Decimal128(20, 0)),
        Float32 => Some(Decimal128(14, 7)),
        Float64 => Some(Decimal128(30, 15)),
        _ => None,
    }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe it would be best to add new arms to the coerce_numeric_type_to_decimal to include unsigned integers as well?

I think so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

| (UInt64, Int64 | Int32 | Int16 | Int8)
| (Int64 | Int32 | Int16 | Int8, UInt64) => Some(Decimal128(20, 0)),
(Int64, _)
| (_, Int64)
| (UInt64, Int8)
| (Int8, UInt64)
| (UInt64, Int16)
| (Int16, UInt64)
| (UInt64, Int32)
| (Int32, UInt64)
| (UInt32, Int8)
| (Int8, UInt32)
| (UInt32, Int16)
| (Int16, UInt32)
| (UInt32, Int32)
| (Int32, UInt32) => Some(Int64),
(UInt64, _) | (_, UInt64) => Some(UInt64),
jonahgao marked this conversation as resolved.
Show resolved Hide resolved
(Int32, _)
| (_, Int32)
| (UInt16, Int16)
Expand Down Expand Up @@ -1994,6 +1985,12 @@ mod tests {
Operator::Gt,
DataType::UInt32
);
test_coercion_binary_rule!(
DataType::UInt64,
DataType::Int64,
Operator::Eq,
DataType::Decimal128(20, 0)
);
// numeric/decimal
test_coercion_binary_rule!(
DataType::Int64,
Expand Down Expand Up @@ -2025,6 +2022,12 @@ mod tests {
Operator::GtEq,
DataType::Decimal128(15, 3)
);
test_coercion_binary_rule!(
DataType::UInt64,
DataType::Decimal128(20, 0),
Operator::Eq,
DataType::Decimal128(20, 0)
);

// Binary
test_coercion_binary_rule!(
Expand Down
6 changes: 3 additions & 3 deletions datafusion/sqllogictest/test_files/coalesce.slt
Original file line number Diff line number Diff line change
Expand Up @@ -105,13 +105,13 @@ select
----
2 Int64

# TODO: Got types (i64, u64), casting to decimal or double or even i128 if supported
query IT
# i64 and u64, cast to decimal128(20, 0)
query RT
select
coalesce(2, arrow_cast(3, 'UInt64')),
arrow_typeof(coalesce(2, arrow_cast(3, 'UInt64')));
----
2 Int64
2 Decimal128(20, 0)

statement ok
create table t1 (a bigint, b int, c int) as values (null, null, 1), (null, 2, null);
Expand Down
2 changes: 1 addition & 1 deletion datafusion/sqllogictest/test_files/joins.slt
Original file line number Diff line number Diff line change
Expand Up @@ -1828,7 +1828,7 @@ AS VALUES
('BB', 6, 1),
('BB', 6, 1);

query TII
query TIR
select col1, col2, coalesce(sum_col3, 0) as sum_col3
from (select distinct col2 from tbl) AS q1
cross join (select distinct col1 from tbl) AS q2
Expand Down
10 changes: 5 additions & 5 deletions datafusion/sqllogictest/test_files/union.slt
Original file line number Diff line number Diff line change
Expand Up @@ -413,23 +413,23 @@ explain SELECT c1, c9 FROM aggregate_test_100 UNION ALL SELECT c1, c3 FROM aggre
logical_plan
01)Sort: aggregate_test_100.c9 DESC NULLS FIRST, fetch=5
02)--Union
03)----Projection: aggregate_test_100.c1, CAST(aggregate_test_100.c9 AS Int64) AS c9
03)----Projection: aggregate_test_100.c1, CAST(aggregate_test_100.c9 AS Decimal128(20, 0)) AS c9
04)------TableScan: aggregate_test_100 projection=[c1, c9]
05)----Projection: aggregate_test_100.c1, CAST(aggregate_test_100.c3 AS Int64) AS c9
05)----Projection: aggregate_test_100.c1, CAST(aggregate_test_100.c3 AS Decimal128(20, 0)) AS c9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a regression to me (there is now 2x the space needed)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query unions Int16 with UInt64. We need to find a common type that can accommodate all possible values of these two types, such as -1 and u64::MAX. Despite increasing the space, it makes the following query available.

create table t1(a smallint) as values(1);
create table t2(a bigint unsigned) as values(10000000000000000000);
select * from t1 union select * from t2;

06)------TableScan: aggregate_test_100 projection=[c1, c3]
physical_plan
01)SortPreservingMergeExec: [c9@1 DESC], fetch=5
02)--UnionExec
03)----SortExec: TopK(fetch=5), expr=[c9@1 DESC], preserve_partitioning=[true]
04)------ProjectionExec: expr=[c1@0 as c1, CAST(c9@1 AS Int64) as c9]
04)------ProjectionExec: expr=[c1@0 as c1, CAST(c9@1 AS Decimal128(20, 0)) as c9]
05)--------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
06)----------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1, c9], has_header=true
07)----SortExec: TopK(fetch=5), expr=[c9@1 DESC], preserve_partitioning=[true]
08)------ProjectionExec: expr=[c1@0 as c1, CAST(c3@1 AS Int64) as c9]
08)------ProjectionExec: expr=[c1@0 as c1, CAST(c3@1 AS Decimal128(20, 0)) as c9]
09)--------RepartitionExec: partitioning=RoundRobinBatch(4), input_partitions=1
10)----------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/testing/data/csv/aggregate_test_100.csv]]}, projection=[c1, c3], has_header=true

query TI
query TR
SELECT c1, c9 FROM aggregate_test_100 UNION ALL SELECT c1, c3 FROM aggregate_test_100 ORDER BY c9 DESC LIMIT 5
----
c 4268716378
Expand Down
Loading