Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(expr-common): Coerce to Decimal(20, 0) when combining UInt64 with signed integers #14223

Merged
merged 5 commits into from
Jan 24, 2025

Conversation

nuno-faria
Copy link
Contributor

Previously, when combining UInt64 with any signed integer, the resulting type would be Int64, which would result in lost information. Now, combining UInt64 with a signed integer results in a Decimal(20, 0), which is able to encode all (64-bit) integer types. Thanks @jonahgao for the pointers.

The function bitwise_coercion remains the same, since it's probably not a good idea to introduce decimals when performing bitwise operations. In this case, it converts (UInt64 | _) to UInt64.

Which issue does this PR close?

Closes #14208.

What changes are included in this PR?

  • Updated binary_numeric_coercion in expr-common/type_coercion/binary.rs.
  • Added new tests to expr-common/type_coercion/binary.rs.
  • Updated existing sqllogic tests to use the new coercion.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

… signed integers

Previously, when combining UInt64 with any signed integer, the resulting type would be Int64, which would result in lost information. Now, combining UInt64 with a signed integer results in a Decimal(20, 0), which is able to encode all (64-bit) integer types.
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 21, 2025
// accommodates all values of both types. Note that to avoid information
// loss when combining UInt64 with signed integers we use Decimal128(20, 0).
(Decimal128(20, 0), _)
| (_, Decimal128(20, 0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks not correct. For example, combining Decimal128(20, 0) with Decimal128(30, 0) should not result in Decimal128(20, 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when both types are decimal they are handled above before this match, when calling the decimal_coercion function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rechecked it, and decimal_coercion already covers them in

fn coerce_numeric_type_to_decimal(numeric_type: &DataType) -> Option<DataType> {
use arrow::datatypes::DataType::*;
// This conversion rule is from spark
// https://github.com/apache/spark/blob/1c81ad20296d34f137238dadd67cc6ae405944eb/sql/catalyst/src/main/scala/org/apache/spark/sql/types/DecimalType.scala#L127
match numeric_type {
Int8 => Some(Decimal128(3, 0)),
Int16 => Some(Decimal128(5, 0)),
Int32 => Some(Decimal128(10, 0)),
Int64 => Some(Decimal128(20, 0)),

Although it doesn't handle unsigned integer types, we can supplement it there, maybe as a follow-up PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think we shouldn't combine decimal with integer types here because decimal_coercion has already handled it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that initially but then it would not handle UInt64 and Decimal128(20, 0):

Cannot infer common argument type for comparison operation UInt64 = Decimal128(20, 0)

So maybe it would be best to add new arms to the coerce_numeric_type_to_decimal to include unsigned integers as well?

    match numeric_type {
        Int8 => Some(Decimal128(3, 0)),
        Int16 => Some(Decimal128(5, 0)),
        Int32 => Some(Decimal128(10, 0)),
        Int64 => Some(Decimal128(20, 0)),
        Float32 => Some(Decimal128(14, 7)),
        Float64 => Some(Decimal128(30, 15)),
        _ => None,
    }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe it would be best to add new arms to the coerce_numeric_type_to_decimal to include unsigned integers as well?

I think so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@jonahgao
Copy link
Member

Perhaps we should add a sqllogictest test for #14208.

@nuno-faria
Copy link
Contributor Author

Perhaps we should add a sqllogictest test for #14208.

Done.

Copy link
Member

@jonahgao jonahgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @nuno-faria

@alamb
Copy link
Contributor

alamb commented Jan 22, 2025

Thanks again @nuno-faria - it is great to see you contributing ❤️

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @nuno-faria and the review @jonahgao

However, this change doesn't seem like a good one to me

// for largest signed (signed sixteen-byte integer) and unsigned integer (unsigned sixteen-byte integer)
// accommodates all values of both types. Note that to avoid information
// loss when combining UInt64 with signed integers we use Decimal128(20, 0).
(UInt64, Int64 | Int32 | Int16 | Int8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has potentially (large) performance implications.

My understanding is that this means that Int64+Int64 will result in (always) a 128bit result?

So even though performing int64+int64 will never overflow, all queries will pay the price of 2x space (and some time) overhead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, Int64+Int64 is not affected, it uses mathematics_numerical_coercion.

I did some validation on this PR branch.

DataFusion CLI v44.0.0

> create table test(a bigint, b bigint unsigned) as values(1,1);
0 row(s) fetched.
Elapsed 0.008 seconds.

> select arrow_typeof(a+b), arrow_typeof(a+a), arrow_typeof(a), arrow_typeof(b) from test;
+-------------------------------+-------------------------------+----------------------+----------------------+
| arrow_typeof(test.a + test.b) | arrow_typeof(test.a + test.a) | arrow_typeof(test.a) | arrow_typeof(test.b) |
+-------------------------------+-------------------------------+----------------------+----------------------+
| Int64                         | Int64                         | Int64                | UInt64               |
+-------------------------------+-------------------------------+----------------------+----------------------+
1 row(s) fetched.
Elapsed 0.009 seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really liked this way to verify the behavior. I made a PR with this type of test and verified that the tests still pass with the changes in this PR:

04)------TableScan: aggregate_test_100 projection=[c1, c9]
05)----Projection: aggregate_test_100.c1, CAST(aggregate_test_100.c3 AS Int64) AS c9
05)----Projection: aggregate_test_100.c1, CAST(aggregate_test_100.c3 AS Decimal128(20, 0)) AS c9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a regression to me (there is now 2x the space needed)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This query unions Int16 with UInt64. We need to find a common type that can accommodate all possible values of these two types, such as -1 and u64::MAX. Despite increasing the space, it makes the following query available.

create table t1(a smallint) as values(1);
create table t2(a bigint unsigned) as values(10000000000000000000);
select * from t1 union select * from t2;

3

query I rowsort
select * from unsigned_bigint_test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the case of overflow on * of 2 64 bit numbers is more likely and automatically coercing to Decimal128 may make sense.

However, I would argue that if the user cares about avoiding overflows when doing Intger arithmetic they should use Decimal128 in their input types

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overflow is unexpected in this case, as all these values are valid unsigned bigint.

The problem is that in values (10000000000000000001), (1), 10000000000000000001 is parsed as UInt64, and 1 is parsed as Int64. They were coerced to Int64, which can't accommodate 10000000000000000001.

This is very similar to the union case mentioned above.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @nuno-faria and @jonahgao for the explanation. I agree this PR makes sense and my concerns did not apply.

// for largest signed (signed sixteen-byte integer) and unsigned integer (unsigned sixteen-byte integer)
// accommodates all values of both types. Note that to avoid information
// loss when combining UInt64 with signed integers we use Decimal128(20, 0).
(UInt64, Int64 | Int32 | Int16 | Int8)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really liked this way to verify the behavior. I made a PR with this type of test and verified that the tests still pass with the changes in this PR:

@jonahgao
Copy link
Member

Thanks @nuno-faria and @alamb for the review.

@jonahgao jonahgao merged commit e0f9f65 into apache:main Jan 24, 2025
25 checks passed
@ozankabak
Copy link
Contributor

When we combine a UInt64 with a literal value (say 5), we don't accidentally consider the literal value a signed value and do this coercion to decimal, right? That would kill performance in many cases.

@jonahgao
Copy link
Member

When we combine a UInt64 with a literal value (say 5), we don't accidentally consider the literal value a signed value and do this coercion to decimal, right? That would kill performance in many cases.

We always parse 5 as int64 (see parse_sql_number), so this coercion result will be decimal. The same thing also happens in DuckDB.

v1.1.1-dev319 af39bd0dcf
D create table t(a ubigint);
D select * from t union select 5;
┌────────┐
│   a    │
│ int128 │
├────────┤
│      5 │
└────────┘

This is because, during type coercion, we only consider the data types of the binary operands and do not take their actual data into account. Some binary operations, such as union and comparisons, require that the operands have the same data type, so the coerced type must be a superset of both operand types and therefore can accommodate all possible operand values.

For this case, maybe we can use statistical information to examine the operands' data and perform some physical optimizations.

@ozankabak
Copy link
Contributor

ozankabak commented Jan 24, 2025

This is a problem with a large potential performance impact in many use cases. I do not think it is a good idea to defer its solution with an open-ended timeframe. I see two ways to proceed:

  1. If it is relatively easy, we can do a quick follow-on PR to parse/type literals more intelligently. Maybe always parse as the smallest/narrowest datatype that will hold the constant. We will probably need to somehow decide whether to use a signed type for the constant as well. Coercion will then widen the type if necessary. However, I am not sure if this is possible to do in a straightforward way.
  2. If doing (1) is not easy, we should revert this PR to avoid the performance impact and think of a more holistic solution that includes a mechanism for doing something like (1), or a more involved mechanism utilizing statistics.

I am fine with both ways forward.

@alamb
Copy link
Contributor

alamb commented Jan 24, 2025

@ozankabak note the coercion rules @jonahgao refers to above in #14223 (comment) are for UNION, not for example, integer arithmetic)

I am writing some (more) tests to show what is happening

@alamb
Copy link
Contributor

alamb commented Jan 24, 2025

Here is a PR that tries to demonstrate more of what is happening:

@findepi
Copy link
Member

findepi commented Jan 24, 2025

  1. If it is relatively easy, we can do a quick follow-on PR to parse/type literals more intelligently. Maybe always parse as the smallest/narrowest datatype that will hold the constant.

Trino parses integer literals as integer type unless they do not fit 32-bit, then they are bigint.
This logic generally makes sense, i don't know what SQL spec or Postgres do.

Smallest possible type may be a step to far. For example 120 + 120 would overflow, if 120 is typed as Int8.

Some other systems (eg Snowflake, PostgreSQL) apparently defer literals typing, making it contextual. We have discussed this for varchar-like literals in date/time argument context, but this approach could perhaps also be applicable to numeric literals.

@jonahgao
Copy link
Member

jonahgao commented Jan 24, 2025

Some information may be useful:

  1. This coercion rule applies to union and comparisons (for example, a > b, a <= b), but does not apply to arithmetic operations (for example, a + b). Comparisons in arrow-rs require that the operands have the same type.

  2. Before this PR, we already had similar behavior; for instance, combining UInt32 and Int8 would result in Int64. This PR extends that behavior to UInt64. Without this change, the following queries will fail.

DataFusion CLI v44.0.0
> select 10000000000000000001 > -1;
Arrow error: Cast error: Can't cast value 10000000000000000001 to type Int64

> select 10000000000000000001 union select -1;
External error: External error: Arrow error: Cast error: Can't cast value 10000000000000000001 to type Int64

# use narrowest datatype
> select 10000000000000000001 union select -1::tinyint;
External error: External error: Arrow error: Cast error: Can't cast value 10000000000000000001 to type Int64
  1. Postgres and Spark do not provide unsigned types; DuckDB does, and its behavior is similar to ours when combining an unsigned and a signed integer type.

@ozankabak
Copy link
Contributor

Comparison with a literal is a very common pattern. I am still worried.

@nuno-faria nuno-faria deleted the uint64_cast_bug branch January 24, 2025 10:56
@berkaysynnada
Copy link
Contributor

berkaysynnada commented Jan 24, 2025

I think this change brings more harm than it helps. I wrote a simple benchmark:

  • Table having 100_000 rows, and one column in UInt64
  • The values are randomly generated between 0 and 100_000_000
  • The query is SELECT a > -1 FROM random_a

after this commit:

SELECT a > -1
                       time:   [1.3010 ms 1.4096 ms 1.5468 ms]
Found 13 outliers among 100 measurements (13.00%)
 6 (6.00%) high mild
 7 (7.00%) high severe

before this commit:

SELECT a > -1
                        time:   [601.85 µs 604.89 µs 610.50 µs]
                        change: [-60.997% -57.120% -53.557%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe

I believe we should find a more clever way before the release

@jonahgao
Copy link
Member

Comparison with a literal is a very common pattern. I am still worried.

I found that unwrap_cast_in_comparison can optimize this case.

DataFusion CLI v44.0.0
create table t(a bigint unsigned);

> explain verbose select a > 1 from t;
| logical_plan after simplify_expressions                    | Projection: CAST(t.a AS Decimal128(20, 0)) > Decimal128(Some(1),20,0) AS t.a > Int64(1)                                |
|                                                            |   TableScan: t                                                                                                         |
| logical_plan after unwrap_cast_in_comparison               | Projection: t.a > UInt64(1) AS t.a > Int64(1)                                                                          |
|                                                            |   TableScan: t                                                                                                         |

@jonahgao
Copy link
Member

I think this change brings more harm than it helps. I wrote a simple benchmark:
Table having 100_000 rows, and one column in UInt64
The values are randomly generated between 0 and 100_000_000

The problem is that if the column contains values greater than 9223372036854775807(i64::MAX), this query can not run.

I think this is a necessary overhead for correctness, unless we can guarantee that the column's value will not exceed i64::MAX(through statistics), or arrow-rs supports comparing two different types.

@berkaysynnada
Copy link
Contributor

I think this change brings more harm than it helps. I wrote a simple benchmark:
Table having 100_000 rows, and one column in UInt64
The values are randomly generated between 0 and 100_000_000

The problem is that if the column contains values greater than 9223372036854775807(i64::MAX), this query can not run.

I think this is a necessary overhead for correctness, unless we can guarantee that the column's value will not exceed i64::MAX(through statistics), or arrow-rs supports comparing two different types.

If the user believes such a scenario is possible, they should be expected to explicitly provide the casted type. Why do we impose this overhead on all cases?

@jonahgao
Copy link
Member

If the user believes such a scenario is possible, they should be expected to explicitly provide the casted type. Why do we impose this overhead on all cases?

The user may know nothing about the data in the table and be unable to make decisions. Additionally, some query statements are generated at runtime. I think this is the optimizer's responsibility, for example, to introduce a physical optimization rule similar to unwrap_cast_in_comparison.

@berkaysynnada
Copy link
Contributor

berkaysynnada commented Jan 24, 2025

If the user believes such a scenario is possible, they should be expected to explicitly provide the casted type. Why do we impose this overhead on all cases?

The user may know nothing about the data in the table and be unable to make decisions. Additionally, some query statements are generated at runtime. I think this is the optimizer's responsibility, for example, to introduce a physical optimization rule similar to unwrap_cast_in_comparison.

Let's wait for more feedback. I still believe we should provide a way to take the risk of overflowing while achieving a lot better performance. arrow kernels support both wrapping and checked operations, which are determined by the fail_on_overflow flag in BinaryExpr.

Considering all these stuff, I feel like we have the necessary ammo; we just need to enable the pathway for any user intention.

@ozankabak
Copy link
Contributor

  1. If it is relatively easy, we can do a quick follow-on PR to parse/type literals more intelligently. Maybe always parse as the smallest/narrowest datatype that will hold the constant. We will probably need to somehow decide whether to use a signed type for the constant as well. Coercion will then widen the type if necessary. However, I am not sure if this is possible to do in a straightforward way.

Given that we are talking about coercions in the context of comparisons and unions, I think a flavor/improvement of the idea above may be simple to implement, achieve correctness and avoid the issue with this PR (which is basically incurring TD to fix the performance with no timeframe for the fix).

Let's get more ideas in, and see if there is an even better solution. Once it shows, we can either: Create a PR for it if easy enough, or create an issue describing the approach and revert this if there is no volunteer for it.

Someone from our team will take it on at some point if there is no takers, as the correctness issue is important even though it is a corner case. Thanks for bringing this into our attention @jonahgao

@jayzhan211
Copy link
Contributor

For the scalar case, like SELECT a > -1, consider any SELECT a > b where b is constant. I think we could optimize it since we know the value. If the scalar is negative, it can be simplified to true. Otherwise, we can rewrite the expression as min(a, i64::max) > b OR (min(a, i64::max) = i64::max AND b = i64::max).

However, for the column case, like SELECT a > b, we can only rewrite it to the latter form.

I believe similar optimization rules exist for other comparison operators, given that the type implies the range of the values.

Considering COALESCE and UNION, for the scalar case, we can rewrite the expression given the known value. But for the column case, I think Decimal128 is the only viable option

I quite agree this should be handled in optimizer in general like unwrap_cast_in_comparison or physical optimizer rule that based on column statistics.

Another question I would like to know is whether the u64+i64 combination is common in DataFusion? And whether we can avoid this at all. I guess u64 that is larger than i64::max is uncommon, can we aggressively use i64 even though we know it is always positive?

@alamb
Copy link
Contributor

alamb commented Jan 25, 2025

BTW I filed a ticket to track resolving this thread (so that it does't get lost before we release 45.0.0):

@alamb
Copy link
Contributor

alamb commented Jan 25, 2025

Another question I would like to know is whether the u64+i64 combination is common in DataFusion? And whether we can avoid this at all. I guess u64 that is larger than i64::max is uncommon, can we aggressively use i64 even though we know it is always positive?

To be clear, a u64 + u64 in DataFusion still (even after this PR) is u64 -- we even put a test in to verify this:

Int8 Int16 Int32 Int64 UInt8 UInt16 UInt32 UInt64 Float32 Float64 Decimal128(6, 2)

I am still not sure I fully understand in what circumstances this change might reduce performance.

@jayzhan211
Copy link
Contributor

u64+i64 combination

I mean the comparison_op(u64,i64), coalesce(u64,i64) and union(u64,i64) that use binary_numeric_coercion and casted to decimal128. mathematics operation is unaffected.

@alamb
Copy link
Contributor

alamb commented Jan 25, 2025

I ran benchmarks with / without this change and did not see any noticable performance difference. See details here

I also created a PR with some tests that show the practical impact of this PR on comparisons (I think it is minimal). See

@berkaysynnada
Copy link
Contributor

I ran benchmarks with / without this change and did not see any noticable performance difference. See details here

I don't have a detailed knowledge about the benchmarks' scope, but you can try this:
https://github.com/synnada-ai/datafusion-upstream/tree/benchmark-cast

That would show the drastic regression.

@alamb
Copy link
Contributor

alamb commented Jan 25, 2025

I ran benchmarks with / without this change and did not see any noticable performance difference. See details here

I don't have a detailed knowledge about the benchmarks' scope, but you can try this: https://github.com/synnada-ai/datafusion-upstream/tree/benchmark-cast

That would show the drastic regression.

Got it -- that is comparing a negative number with an unsigned column (a is a UInt64) so it makes sense that now random_a is being cast to Decimal for the conversion

    SELECT a > -1 FROM random_a

Which turns into

    SELECT CAST(a, Decimal) > Decimal(-1) FROM random_a

We could probably fix this particular case with some sort of change in unwrap_cast_in_comparison

Currently unwrap_cast_in_comparison would try to cast -1 to a UInt and will fail

But we could potentially teach it about negative numbers with unsigned numbers and rewrite the comparison from -1 to 0 (which is still the same)

    SELECT CAST(a, Decimal) > Decimal(0) FROM random_a

Which will then get converted back to

    SELECT a > CAST (Decimal(0) as UINT) FROM random_a

🤔

@jonahgao
Copy link
Member

UInt64 > -1 is less common than UInt64 > 1. In my opinion, ensuring that comparisons between unsigned and signed columns are always available is more important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL INSERT casts to Int64 when handling UInt64 values
7 participants