[Bug]: inconsistent bad query plan #6900

phemmer · 2024-05-08T18:50:06Z

What type of bug is this?

Performance issue

What subsystems and features are affected?

Query planner

What happened?

I've been experiencing a lot of queries lately that inconsistently perform extremely bad. As in they vary between sub-second and 5+ minutes. I don't know whether they're all the same, but I dug into one of them which I'll use for this report.

I've tested a few slight variations of the query and have managed to produce seemingly insightful results. The basis of this particular query's issue seems to revolve around the time filter. If I use an absolute time range, the query always performs fine. If I use a large relative time range (large meaning 1h, relative meaning using now()), the query always performs fine. If I use a small relative time range (small meaning 5m), the query is inconsistent (sometimes <1s, sometimes >5m).

Now I did run these on a live database, with data constantly changing, however after dozens of tries, the above information is always true. And the short+relative query can jump between good & bad from one second to the next. Sometimes it'll run good for only a few seconds before going bad again, and then sometimes it'll run good for many minutes.

So here's the queries (query + explain analyse output):

5m absolute time range

1h relative time range

5m relative time range - bad plan

5m relative time range - good plan

In the query examples, you can see that the good plans are all exactly the same, and the bad plan is substantially different. And even though the good plans involve parallel workers, that shouldn't explain the difference. We can see in the 5m+relative+good plan, that 7 workers were used. So to do it without parallelization, it should take <7x longer, but doesn't, and is instead 313x longer.

TimescaleDB version affected

2.14.2

PostgreSQL version used

16.2

What operating system did you use?

Debian 10

What installation method did you use?

Deb/Apt

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

No response

How can we reproduce the bug?

Unfortunately I do not know how to reproduce it in a clean environment. I'll provide the basic setup info below, but please feel free to request any additional information and I'll provide it.

First haproxy_server is a view defined as:

create view haproxy_server
            (time, tag_id, mode, pxname, svname, aggregate, host, instance, act, addr, agent_code, agent_desc, agent_duration, agent_fall, agent_health, agent_rise, agent_status, bck, bin, bout, check_desc, check_duration, check_fall,
             check_health, check_rise, check_status, chkdown, chkfail, cli_abrt, connect, cookie, ctime, ctime_max, downtime, dresp, econn, eint, eresp, hrsp_1xx, hrsp_2xx, hrsp_3xx, hrsp_4xx, hrsp_5xx, hrsp_other, idle_conn_cur, iid,
             last_agt, last_chk, lastchg, lastsess, lbtot, need_conn_est, qcur, qlimit, qmax, qtime, qtime_max, rate, rate_max, reuse, rtime, rtime_max, safe_conn_cur, scur, sid, slim, smax, src_ilim, srv_abrt, srv_icur, status, stot,
             tracked, ttime, ttime_max, used_conn_cur, weight, wredis, wretr, wrew)
as
SELECT t."time",
       t.tag_id,
       tt.mode,
       tt.pxname,
       tt.svname,
       tt.aggregate,
       tt.host,
       tt.instance,
       t.act,
       t.addr,
       t.agent_code,
       t.agent_desc,
       t.agent_duration,
       t.agent_fall,
       t.agent_health,
       t.agent_rise,
       t.agent_status,
       t.bck,
       t.bin,
       t.bout,
       t.check_desc,
       t.check_duration,
       t.check_fall,
       t.check_health,
       t.check_rise,
       t.check_status,
       t.chkdown,
       t.chkfail,
       t.cli_abrt,
       t.connect,
       t.cookie,
       t.ctime,
       t.ctime_max,
       t.downtime,
       t.dresp,
       t.econn,
       t.eint,
       t.eresp,
       t.hrsp_1xx,
       t.hrsp_2xx,
       t.hrsp_3xx,
       t.hrsp_4xx,
       t.hrsp_5xx,
       t.hrsp_other,
       t.idle_conn_cur,
       t.iid,
       t.last_agt,
       t.last_chk,
       t.lastchg,
       t.lastsess,
       t.lbtot,
       t.need_conn_est,
       t.qcur,
       t.qlimit,
       t.qmax,
       t.qtime,
       t.qtime_max,
       t.rate,
       t.rate_max,
       t.reuse,
       t.rtime,
       t.rtime_max,
       t.safe_conn_cur,
       t.scur,
       t.sid,
       t.slim,
       t.smax,
       t.src_ilim,
       t.srv_abrt,
       t.srv_icur,
       t.status,
       t.stot,
       t.tracked,
       t.ttime,
       t.ttime_max,
       t.used_conn_cur,
       t.weight,
       t.wredis,
       t.wretr,
       t.wrew
FROM telegraf.haproxy_server t,
     telegraf.haproxy_server_tag tt
WHERE t.tag_id = tt.tag_id;

telegraf.haproxy_server_tag is:

         Table "telegraf.haproxy_server_tag"
  Column   |  Type  | Collation | Nullable | Default 
-----------+--------+-----------+----------+---------
 tag_id    | bigint |           | not null | 
 aggregate | text   |           |          | 
 host      | text   |           |          | 
 instance  | text   |           |          | 
 mode      | text   |           |          | 
 pxname    | text   |           |          | 
 svname    | text   |           |          | 
Indexes:
    "haproxy_server_tag_pkey" PRIMARY KEY, btree (tag_id)

telegraf.haproxy_server is:

                        Table "telegraf.haproxy_server"
     Column     |            Type             | Collation | Nullable | Default 
----------------+-----------------------------+-----------+----------+---------
 time           | timestamp without time zone |           | not null | 
 tag_id         | bigint                      |           |          | 
 act            | bigint                      |           |          | 
 addr           | text                        |           |          | 
 agent_code     | bigint                      |           |          | 
 agent_duration | bigint                      |           |          | 
 agent_fall     | bigint                      |           |          | 
 agent_health   | bigint                      |           |          | 
 agent_rise     | bigint                      |           |          | 
 bck            | bigint                      |           |          | 
 bin            | bigint                      |           |          | 
 bout           | bigint                      |           |          | 
 check_desc     | text                        |           |          | 
 check_duration | bigint                      |           |          | 
 check_fall     | bigint                      |           |          | 
 check_health   | bigint                      |           |          | 
 check_rise     | bigint                      |           |          | 
 check_status   | text                        |           |          | 
 chkdown        | bigint                      |           |          | 
 chkfail        | bigint                      |           |          | 
 cli_abrt       | bigint                      |           |          | 
 connect        | bigint                      |           |          | 
 ctime          | bigint                      |           |          | 
 ctime_max      | bigint                      |           |          | 
 downtime       | bigint                      |           |          | 
 dresp          | bigint                      |           |          | 
 econn          | bigint                      |           |          | 
 eint           | bigint                      |           |          | 
 eresp          | bigint                      |           |          | 
 hrsp_1xx       | bigint                      |           |          | 
 hrsp_2xx       | bigint                      |           |          | 
 hrsp_3xx       | bigint                      |           |          | 
 hrsp_4xx       | bigint                      |           |          | 
 hrsp_5xx       | bigint                      |           |          | 
 hrsp_other     | bigint                      |           |          | 
 idle_conn_cur  | bigint                      |           |          | 
 iid            | bigint                      |           |          | 
 lastchg        | bigint                      |           |          | 
 lastsess       | bigint                      |           |          | 
 lbtot          | bigint                      |           |          | 
 need_conn_est  | bigint                      |           |          | 
 qcur           | bigint                      |           |          | 
 qlimit         | bigint                      |           |          | 
 qmax           | bigint                      |           |          | 
 qtime          | bigint                      |           |          | 
 qtime_max      | bigint                      |           |          | 
 rate           | bigint                      |           |          | 
 rate_max       | bigint                      |           |          | 
 reuse          | bigint                      |           |          | 
 rtime          | bigint                      |           |          | 
 rtime_max      | bigint                      |           |          | 
 safe_conn_cur  | bigint                      |           |          | 
 scur           | bigint                      |           |          | 
 sid            | bigint                      |           |          | 
 smax           | bigint                      |           |          | 
 src_ilim       | bigint                      |           |          | 
 srv_abrt       | bigint                      |           |          | 
 srv_icur       | bigint                      |           |          | 
 status         | text                        |           |          | 
 stot           | bigint                      |           |          | 
 ttime          | bigint                      |           |          | 
 ttime_max      | bigint                      |           |          | 
 used_conn_cur  | bigint                      |           |          | 
 weight         | bigint                      |           |          | 
 wredis         | bigint                      |           |          | 
 wretr          | bigint                      |           |          | 
 wrew           | bigint                      |           |          | 
 agent_desc     | text                        |           |          | 
 agent_status   | text                        |           |          | 
 cookie         | text                        |           |          | 
 last_agt       | text                        |           |          | 
 last_chk       | text                        |           |          | 
 tracked        | text                        |           |          | 
 slim           | bigint                      |           |          | 
Indexes:
    "haproxy_server_time_idx" btree ("time" DESC)
Triggers:
    ts_insert_blocker BEFORE INSERT ON telegraf.haproxy_server FOR EACH ROW EXECUTE FUNCTION _timescaledb_functions.insert_blocker()
Number of child tables: 181 (Use \d+ to list them.)


# select * from timescaledb_information.hypertables where hypertable_name = 'haproxy_server';
 hypertable_schema | hypertable_name |   owner   | num_dimensions | num_chunks | compression_enabled | tablespaces 
-------------------+-----------------+-----------+----------------+------------+---------------------+-------------
 telegraf          | haproxy_server  | edgestats |              1 |        181 | t                   | 

# select * from timescaledb_information.compression_settings where hypertable_name = 'haproxy_server';
 hypertable_schema | hypertable_name | attname | segmentby_column_index | orderby_column_index | orderby_asc | orderby_nullsfirst 
-------------------+-----------------+---------+------------------------+----------------------+-------------+--------------------
 telegraf          | haproxy_server  | tag_id  |                      1 |                      |             | 
 telegraf          | haproxy_server  | time    |                        |                    1 | f           | t


# select * from timescaledb_information.dimensions where hypertable_name = 'haproxy_server';
 hypertable_schema | hypertable_name | dimension_number | column_name |         column_type         | dimension_type | time_interval | integer_interval | integer_now_func | num_partitions 
-------------------+-----------------+------------------+-------------+-----------------------------+----------------+---------------+------------------+------------------+----------------
 telegraf          | haproxy_server  |                1 | time        | timestamp without time zone | Time           | 02:00:00      |                  |                  |

The text was updated successfully, but these errors were encountered:

svenklemm · 2024-05-08T19:26:39Z

The difference seems to be Nested Loop Join vs Hash Join.
Did you intentionally choose to use timestamp without timezone here? This introduces casts in this query which make the chunk exclusion not work at plan time but only at execution time. So the amount of chunks during planning is very different from execution. Switching to timestamp with timezone would probably solve your problem. Another workaround would be to disable nested loop join for this query with set enable_nestloop TO false;

phemmer · 2024-05-09T00:01:53Z

Did you intentionally choose to use timestamp without timezone here?

Yes. To eliminate variables, I wanted the data type to be consistent between the relative and absolute queries. In addition the underlying data type on the table is also timestamp without timezone.

This introduces casts in this query

Why would this be the case? If both the query and the table are using the same data type, why would there be casting involved?

Switching to timestamp with timezone would probably solve your problem.

It did not. https://explain.dalibo.com/plan/949a6301d7ch487e#raw

Another workaround would be to disable nested loop join

That appears to have done it (although now its deciding to play nice for a while even without that). And it appears that grafana (the main source of the queries) does allow transactions via explicit BEGIN; END; so SET LOCAL can be used to apply it on an as-needed basis. But yes, just a workaround, as I don't want to have to rely on this.

svenklemm · 2024-05-09T04:38:21Z

Did you intentionally choose to use timestamp without timezone here?

Yes. To eliminate variables, I wanted the data type to be consistent between the relative and absolute queries. In addition the underlying data type on the table is also timestamp without timezone.

Usage of timestamp without timezone is discouraged in general in postgres. See https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_timestamp_.28without_time_zone.29_to_store_UTC_times

This introduces casts in this query

Why would this be the case? If both the query and the table are using the same data type, why would there be casting involved?

now() returns timestamp with timezone so any constraint with now() will involve datatype conversions

Switching to timestamp with timezone would probably solve your problem.

It did not. https://explain.dalibo.com/plan/949a6301d7ch487e#raw

Did you change the table definition as well. With the datatype being timestamp with timezone i would expect to move the chunk exclusion from execution time to planning time which should be a big improvement to planning time as well.

The line Chunks excluded during startup: 170 should disappear as that is the exclusion happening during executor startup.

phemmer · 2024-05-18T00:19:09Z

now() returns timestamp with timezone so any constraint with now() will involve datatype conversions

That's why I had now()::timestamp so that it would convert once. now() is declared as "stable" (in postgres nomenclature), so it will always return the same result within a given statement. Thus it should only be getting converted once.

Did you change the table definition as well. With the datatype being timestamp with timezone i would expect to move the chunk exclusion from execution time to planning time which should be a big improvement to planning time as well.

No, that's a massive undertaking, as that involves changing a ton of data. However for the purposes of this issue, I've created a new table using timestamptz. Though I now have to wait for it to fill back up.

antekresic · 2024-06-06T13:23:41Z

Hi,

I was just wondering if this issue had a formal resolution or do you need more assistance from us?

Thanks,

phemmer · 2024-06-06T13:49:04Z

I was just wondering if this issue had a formal resolution or do you need more assistance from us?

I'm not sure I follow. The issue is in the query planner. Or are you saying you don't intend to fix it?

antekresic · 2024-06-07T12:10:01Z

No, that's a massive undertaking, as that involves changing a ton of data. However for the purposes of this issue, I've created a new table using timestamptz. Though I now have to wait for it to fill back up.

I was mostly wondering if you backfilled the data and if it fixed the issue for you.

phemmer · 2024-06-14T23:27:18Z

As long as all the time types involved in the query are timestamp with timezone, then yes, the query plan appears optimal.

nikkhils · 2024-06-17T11:16:47Z

@phemmer given that https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_timestamp_.28without_time_zone.29_to_store_UTC_times explicitly mentions that using timestamp is a bad practice and given that we inherit PostgreSQL restrictions/limitations in terms of planning when timestamp is involved, do you think we have a resolution on this?

phemmer · 2024-06-17T11:21:19Z

The only reason given in that link is "This complicates many otherwise useful time calculations", implying that the only issue is that it makes things harder on the user. There is nothing about "this causes sub-optimal query planning and major deficiencies in performance".

nikkhils · 2024-06-18T09:17:40Z

@phemmer we will add this to the list of issues to take a look at from our end.

phemmer added the bug label May 8, 2024

svenklemm added the planner label May 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: inconsistent bad query plan #6900

[Bug]: inconsistent bad query plan #6900

phemmer commented May 8, 2024 •

edited

Loading

svenklemm commented May 8, 2024

phemmer commented May 9, 2024

svenklemm commented May 9, 2024

phemmer commented May 18, 2024

antekresic commented Jun 6, 2024

phemmer commented Jun 6, 2024

antekresic commented Jun 7, 2024

phemmer commented Jun 14, 2024

nikkhils commented Jun 17, 2024

phemmer commented Jun 17, 2024

nikkhils commented Jun 18, 2024

[Bug]: inconsistent bad query plan #6900

[Bug]: inconsistent bad query plan #6900

Comments

phemmer commented May 8, 2024 • edited Loading

What type of bug is this?

What subsystems and features are affected?

What happened?

5m absolute time range

1h relative time range

5m relative time range - bad plan

5m relative time range - good plan

TimescaleDB version affected

PostgreSQL version used

What operating system did you use?

What installation method did you use?

What platform did you run on?

Relevant log output and stack trace

How can we reproduce the bug?

svenklemm commented May 8, 2024

phemmer commented May 9, 2024

svenklemm commented May 9, 2024

phemmer commented May 18, 2024

antekresic commented Jun 6, 2024

phemmer commented Jun 6, 2024

antekresic commented Jun 7, 2024

phemmer commented Jun 14, 2024

nikkhils commented Jun 17, 2024

phemmer commented Jun 17, 2024

nikkhils commented Jun 18, 2024

phemmer commented May 8, 2024 •

edited

Loading