Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance drop up to x20 with compression on #2829

Open
sasha-id opened this issue Jan 14, 2021 · 12 comments
Open

Performance drop up to x20 with compression on #2829

sasha-id opened this issue Jan 14, 2021 · 12 comments

Comments

@sasha-id
Copy link

Relevant system information:

  • OS: Ubuntu 18.04
  • PostgreSQL version (output of postgres --version): 12.4
  • TimescaleDB version (output of \dx in psql): 2.0.0
  • Installation method: apt install

Describe the bug
Direct hypertable query: performance drops by ~ x10 when compression is enabled. Query on continuous aggregates: drops by ~x20.

To Reproduce
Schema:

    create table trades
    (
        time        timestamp not null,
        stock_id    bigint    not null,
        exchange_id integer,
        trade_id    bigint,
        tape        smallint,
        price       numeric(10, 4),
        size        integer
    );
      SELECT create_hypertable('trades', 'time', chunk_time_interval => INTERVAL '1 day');
      -- Disable for now, till it fixed
      -- ALTER TABLE trades SET (
      --   timescaledb.compress,
      --   timescaledb.compress_orderby = 'time DESC',
      --   timescaledb.compress_segmentby = 'stock_id'
      -- );
      -- SELECT add_compression_policy('trades', INTERVAL '7 days');


      --   Continuous aggregates
      CREATE MATERIALIZED VIEW candles_1m
      WITH (timescaledb.continuous) AS
      SELECT
          time_bucket('1 minute', time) AS time,
          last(time, time) AS last_trade,
          stock_id,
          count(*) trades,
          first(price, time) open,
          MAX(price) high,
          MIN(price) low,
          last(price, time) AS close,
          SUM(size) volume
      FROM trades
      GROUP BY time_bucket('1 minute', time), stock_id;
      SELECT add_continuous_aggregate_policy('candles_1m',
        start_offset => INTERVAL '1 day',
        end_offset => INTERVAL '2 minutes',
        schedule_interval => INTERVAL '1 minute'
      );

Sample data: https://www.dropbox.com/s/6l4ldo9vg1e5e39/data.csv.bz2?dl=0
psql -d new_db -c "\COPY conditions FROM data.csv CSV"

Query:

explain analyze SELECT
          time_bucket('1 minute', time) AS time,
          count(*) trades,
          first(price, time) open,
          MAX(price) high,
          MIN(price) low,
          last(price, time) AS close,
          SUM(size) volume
      FROM trades WHERE time >= '2020-12-31 11:38' AND time < '2020-12-31 11:39'
      GROUP BY time_bucket('1 minute', time), stock_id;

Query plan w/o compression:

 GroupAggregate  (cost=26456.41..27103.55 rows=2669 width=160) (actual time=140.384..274.565 rows=1942 loops=1)
   Group Key: (time_bucket('00:01:00'::interval, _hyper_1_1_chunk."time")), _hyper_1_1_chunk.stock_id
   ->  Sort  (cost=26456.41..26523.12 rows=26686 width=34) (actual time=140.249..205.019 rows=26301 loops=1)
         Sort Key: (time_bucket('00:01:00'::interval, _hyper_1_1_chunk."time")), _hyper_1_1_chunk.stock_id
         Sort Method: quicksort  Memory: 2823kB
         ->  Index Scan using _hyper_1_1_chunk_trades_time_idx on _hyper_1_1_chunk  (cost=0.43..24494.48 rows=26686 width=34) (actual time=0.081..71.490 rows=26301 loops=1)
               Index Cond: (("time" >= '2020-12-31 11:38:00'::timestamp without time zone) AND ("time" < '2020-12-31 11:39:00'::timestamp without time zone))
 Planning Time: 0.571 ms
 Execution Time: 279.394 ms

Query plan w compression:

 Finalize GroupAggregate  (cost=499803.54..578968.95 rows=2669 width=160) (actual time=398.114..493.005 rows=1942 loops=1)
   Group Key: (time_bucket('00:01:00'::interval, _hyper_1_1_chunk."time")), _hyper_1_1_chunk.stock_id
   ->  Gather Merge  (cost=499803.54..578855.51 rows=2669 width=160) (actual time=397.819..482.142 rows=1943 loops=1)
         Workers Planned: 1
         Workers Launched: 1
         ->  Partial GroupAggregate  (cost=498803.53..577555.24 rows=2669 width=160) (actual time=385.002..452.221 rows=972 loops=2)
               Group Key: (time_bucket('00:01:00'::interval, _hyper_1_1_chunk."time")), _hyper_1_1_chunk.stock_id
               ->  Sort  (cost=498803.53..507548.53 rows=3498000 width=34) (actual time=384.824..417.302 rows=13150 loops=2)
                     Sort Key: (time_bucket('00:01:00'::interval, _hyper_1_1_chunk."time")), _hyper_1_1_chunk.stock_id
                     Sort Method: quicksort  Memory: 1304kB
                     Worker 0:  Sort Method: quicksort  Memory: 1520kB
                     ->  Custom Scan (DecompressChunk) on _hyper_1_1_chunk  (cost=0.16..9288.19 rows=3498000 width=34) (actual time=215.162..351.211 rows=13150 loops=2)
                           Filter: (("time" >= '2020-12-31 11:38:00'::timestamp without time zone) AND ("time" < '2020-12-31 11:39:00'::timestamp without time zone))
                           Rows Removed by Filter: 1300744
                           ->  Parallel Seq Scan on compress_hyper_5_3_chunk  (cost=0.00..543.19 rows=3498 width=124) (actual time=214.981..220.760 rows=1958 loops=2)
                                 Filter: ((_ts_meta_max_1 >= '2020-12-31 11:38:00'::timestamp without time zone) AND (_ts_meta_min_1 < '2020-12-31 11:39:00'::timestamp without time zone))
                                 Rows Removed by Filter: 5532
 Planning Time: 3.477 ms
 JIT:
   Functions: 27
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 5.141 ms, Inlining 62.861 ms, Optimization 206.136 ms, Emission 160.349 ms, Total 434.488 ms
 Execution Time: 523.742 ms
(23 rows)
@phemmer
Copy link

phemmer commented Jan 15, 2021

Not sure they're related, but there are a few open issues on the query planner switching to sequential scans instead of index scans when handling compressed chunks: #2722 #2744

@sasha-id
Copy link
Author

Not sure they're related, but there are a few open issues on the query planner switching to sequential scans instead of index scans when handling compressed chunks: #2722 #2744

Yes, must be the same issue, I provided sample data to help replicate the issue

@svenklemm
Copy link
Member

Can you add the command you used to setup compression.

@akuzm
Copy link
Member

akuzm commented Oct 27, 2021

Unfortunately we cannot reproduce this issue w/o knowing the exact command that was used to setup compression.

@NunoFilipeSantos
Copy link
Contributor

@alex-klepa can you provide a way to reproduce it on 2.5.0? Otherwise, we need to close this issue.

@gayyappan
Copy link
Contributor

gayyappan commented Nov 19, 2021

@NunoFilipeSantos The script has the compression parameters. alex-klepa Thank you for the report and data to reproduce the problem. We are investigating the performance issues.

@jjuraszek
Copy link

I've hit the same wall. My solution which works is to turn off JIT

@genedavis
Copy link

genedavis commented Mar 5, 2023

Hi Team - is there any update on this ticket? I just encountered it with presumably a similar scenario:

  1. Have a large hypertable to look up iot-based data (tag, time, value, etc)
  2. Create a covering index starting with tag since that is a common initial lookup filter
  3. Run a query collecting data for a window of the past few hours of data for groups of tags

This was all working very well, I believe until I added a compression rule for data older than 30 days. Now I'm getting a scan of the full hyperchunk based on time, instead of using the index. Queries for some groups of tags have gone from 10ms up to now 800ms.

Thanks - Gene

Edit: my workaround was to decompress all chunks and set compress to false on this table, which worked. However I would have hoped (I'm sure it's not easy) that the query planner would have been able to see that the query was bounded way outside of the compression time window and wouldn't have caused me to have to do this?

@radresian
Copy link

Hello Team,

I am planning to use timescale cloud for my crypto exchange company in order to keep historical asset prices, i did a small investigation about the performance and compression and came to this issue. Everything timescale offers good excellent but I am frustrated about seeing this issue is open.

Does that mean timescale can not offer compression and performance at the same time ? wont you do anything about this issue or suggest anything to your users ?

@phemmer
Copy link

phemmer commented Mar 16, 2023

Does that mean timescale can not offer compression and performance at the same time ?

Writing to a compressed chunk is the exception, not the norm. Typically you should be writing to uncompressed chunks, and the chunks get compressed once they've rotated out and are not being written to. Inserting into compressed chunks should generally be reserved for situations where a data source was offline, and dumps its data once coming back online after the chunk has been compressed.

@radresian
Copy link

@phemmer i have investigated the sample data in the example after your comment and realized that query is between '2020-12-31 11:38' and '2020-12-31 11:39' which is old data and in a compressed chunk. I thing querying the uncompressed chunk which includes new data would give a much better performance.

@phemmer
Copy link

phemmer commented Mar 17, 2023

Sorry, I completely misunderstood your comment. For some reason I thought you were talking about inserts. But now that I read it again, I don't know where I got that from. So ignore what I said, it's not applicable :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants