Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement]: Chunk Skipping on UUID values #7744

Open
tpetry opened this issue Feb 19, 2025 · 4 comments
Open

[Enhancement]: Chunk Skipping on UUID values #7744

tpetry opened this issue Feb 19, 2025 · 4 comments
Labels
enhancement An enhancement to an existing feature for functionality

Comments

@tpetry
Copy link

tpetry commented Feb 19, 2025

What type of enhancement is this?

Performance

What subsystems and features will be improved?

Query planner

What does the enhancement do?

At the moment, we've got chunk skipping on integer values to use with serial ids. It's working great, but as soon as you insert new rows into old chunks those chunk skipping ranges are bloated and match every possible id. This leads to some chunks always been selected by the query planner and significantly impacted performance:

CREATE TABLE hypertbl_serial (
  id serial NOT NULL,
  value bigint NOT NULL,
  time timestamptz NOT NULL
);
SELECT create_hypertable('hypertbl_serial', by_range('time', INTERVAL '1 month'));

-- Hypertable is now filled the next two years without late ingestion
INSERT INTO hypertbl_serial (value, time) VALUES ...

-- Many ORMs and web frameworks can only reference a row to fetch by a SINGLE value.
-- Which is a problem but you can use chunk skipping with an index to efficiently find specific rows by an id :)
CREATE INDEX pk_lookup ON hypertbl_serial (id);
SELECT enable_chunk_skipping('hypertbl_serial', 'id');

-- So a framework can now search for id=590212 and Timescale is able to detect which chunk matches the data
-- and only use that chunk:
-- chunk_1 (min=000000, max=123743) -> QueryPlanner: chunk excluded
-- chunk_2 (min=123744, max=487283) -> QueryPlanner: chunk excluded
-- chunk_3 (min=487283, max=748931) -> QueryPlanner: Index Scan on pk_lookup :)
SELECT * FROM hypertbl_serial WHERE id = 590212;

-- Now the fun begins...
-- We have some late-ingestion because we import data from a third-party system or our application is extended
-- for some use-case to just allow this.

-- The row with a time far in the past is inserted with a new serial
INSERT INTO hypertbl_serial (value, time) VALUES (33, NOW() - '1 year'); -- id=974932

-- chunk_1 (min=000000, max=974932): -> QueryPlanner: Index Scan on pk_lookup - argh :(
-- chunk_2 (min=123744, max=487283): -> QueryPlanner: chunk excluded
-- chunk_3 (min=487283, max=748931): -> QueryPlanner: Index Scan on pk_lookup :)
SELECT * FROM hypertbl_serial WHERE id = 590212;

Obviously, the chunk_1 will now match any integer because it's range has been broaded so far because of a single very late-ingested row. The more you backfill old data (e.g. by importing historic records for new customers), the problem will be bigger. At some point, the chunk skipping is useless as every chunk's range overlaps with some older ids.


An alternative approach would be allowing chunk skipping on uuid values - when used with uuid7 values its a great feature. We could link the uuid7 value to the time column used for partioning (related to #7682). So the chunk skipping ranges for each chunk can NEVER overlap and the problem with the former serial column doesn't happen. We'll get perfect chunk skipping for uuid7 values :)

CREATE TABLE hypertbl_uuid (
  id uuid NOT NULL,
  value bigint NOT NULL,
  time timestamptz NOT NULL CHECK(timestamptz_from_uuid7(id) = time)
);
SELECT create_hypertable('hypertbl_uuid', by_range('time', INTERVAL '1 month'));

-- Hypertable is now filled the next two years
-- Note: id is generated like gen_random_uuid7(time column) - NOT INSERTION TIME!
INSERT INTO hypertbl_serial (value, time) VALUES ...

-- Currently uuid chunk skipping is not possible. Let's pretend it would work.
CREATE INDEX pk_lookup ON hypertbl_uuid (id);
SELECT enable_chunk_skipping('hypertbl_uuid', 'id');

-- UUIDs are long values and hard to scan. Let's pretend for now that there is a simple notation like
-- uuid7-0, uuid7-1, ..., uuid7-999999 that appends the 128-bit numer of the UUID to the `uuid7-` suffix.
-- After all, in this way chunk skipping on uuid would work -> use the uuid as a 128-bit number

-- So a framework can now search for id=uuid7-590212 and Timescale is able to detect which chunk matches the data
-- and only use that chunk:
-- chunk_1 (min=uuid7-000000, max=uuid7-123743): nope
-- chunk_2 (min=uuid7-328392, max=uuid7-507362): nope
-- chunk_3 (min=uuid7-537281, max=uuid7-706217): yes - Index Scan on pk_lookup :)
SELECT * FROM hypertbl_serial WHERE id = 'uuid7-590212';

-- For now, everything is the same like the serial example.
-- But late-ingestion is now different: We're not generating the id based on the insertion time
-- but on the time for that row. And make it more fun: The generated id will be the new max value
-- for chunk_1 and not be within the existing range - as it would be the case most of the time.
INSERT INTO hypertbl_serial (value, time) VALUES (33, gen_random_uuid7(NOW() - '1 year')); -- id=uuid7-284732

-- chunk_1 (min=uuid7-000000, max=uuid7-284732): nope
-- chunk_2 (min=uuid7-328392, max=uuid7-507362): nope
-- chunk_3 (min=uuid7-537281, max=uuid7-706217): yes - Index Scan on pk_lookup :)
SELECT * FROM hypertbl_serial WHERE id = 'uuid7-590212';

Implementation challenges

Implementation should be really easy: Chunk skipping currently works on integers and timestamps - both are fixed-size values. A uuid7 value is also a fixed-size 128 bit value and can be represented as an integer. So chunk skipping would exactly work like with 64-bit bigint values but now with 128-bit uuid values. The needed changed should be minimal.

A possible counter argument would be that uuid is mostly used with random uuidv4 values and chunk skipping wouldn't make any sense. That's true. But you can also use chunk skipping on int values filled with RANDOM() or any other semantic value which doesn't match the chunk skipping idea.

@tpetry tpetry added the enhancement An enhancement to an existing feature for functionality label Feb 19, 2025
@jflambert
Copy link

jflambert commented Feb 19, 2025

I +1 this enhancement request. In fact I'm redesigning some hypertables to use bigints instead of uuids just to workaround this limitation.

But I definitely understand the counterargument. @tpetry don't forget uuid isn't exactly recommended for compression. I'm not 100% sure uuid7 fully resolves the locality issue (maybe?)

Image

@tpetry
Copy link
Author

tpetry commented Feb 19, 2025

UUID7 values are time-sorted with the first 48 bits used for the time in millisecond precision.

So with a 14-days hypertable partitioning only the bits 18 to 48 for the timestamp and all later bits can change. Thats already a 18/128=0.14 compression factor. Not bad for mostly random data.

But in reality similar timestamps will be very local. So those 48 bits will mostly have tiny changes for very frequent inserts. With run-length encoding those bits can be reduced heavily. The upper compression limit is 48/128=0.375. Which I believe is awesome when thinking that uuid7 values are mostly random.

So just use run-length encoding (as used with integers). It can lead to relatively good compression for uuid7 and just nothing for uuid4 which is absolutely not compressible at all.

@jflambert
Copy link

jflambert commented Feb 19, 2025

I'm all in. However, my understanding is that pg18 will introduce a new function for uuid7. How are you getting them in pg17 and below?

@tpetry
Copy link
Author

tpetry commented Feb 19, 2025

You can generate them within the application. Or use one of the many SQL implementations if you can‘t install an extension: https://gist.github.com/fabiolimace/515a0440e3e40efeb234e12644a6a346

However, providing a uuid7 generate function and timestamp_from_uuid7 within the timescale namespace would be absolutely awesome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An enhancement to an existing feature for functionality
Projects
None yet
Development

No branches or pull requests

2 participants