Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Out of Memory on DELETE Query in Hypertable #7776

Open
kevin-uslu opened this issue Feb 26, 2025 · 4 comments
Open

[Bug]: Out of Memory on DELETE Query in Hypertable #7776

kevin-uslu opened this issue Feb 26, 2025 · 4 comments
Labels

Comments

@kevin-uslu
Copy link

kevin-uslu commented Feb 26, 2025

What type of bug is this?

Crash, Performance issue

What subsystems and features are affected?

Query executor

What happened?

Query:

DELETE FROM nmon_cpu_all WHERE time >= '2024-12-14 04:08:02' AND time <= '2024-12-14 04:22:14' AND id_monitoring = 56983;

RAM: 24 GB

Number of chunks: 9337
Hypertable interval: 15 miutes
Continuous aggregate: No

Table creation:

CREATE TABLE IF NOT EXISTS public.nmon_cpu_all
(
    "time" timestamp without time zone NOT NULL,
    id_monitoring bigint NOT NULL,
    user_cpu double precision NOT NULL,
    sys_cpu double precision NOT NULL,
    wait_cpu double precision NOT NULL,
    consumed_cpu double precision NOT NULL
)

SELECT create_hypertable(
	'nmon_cpu_all',
	'time',
	if_not_exists => TRUE,
	chunk_time_interval=>900000000
);

Issue Description:
The execution time for the above DELETE queries has been increasing as the dataset grows, eventually leading to out-of-memory (OOM) failures systematically on any single such DELETE query. When the query is executed:

The TimescaleDB container rapidly consumes all available RAM and gets killed by the host VM's OOM killer.
Docker stats confirm rapid and total host memory consumption.
Running EXPLAIN on the query also causes OOM, making debugging difficult.

Notable Observations:

The column id_monitoring is not a foreign key, but it also lacks an index.
Removing the id_monitoring condition does not resolve the issue; the query still leads to OOM.
Upgrading to TimescaleDB 2.6.1 and 2.15.3 did NOT resolve the issue.

Investigations & Attempts to Resolve:
I have already tried solutions from these issues without success:

At one point, I dropped 9,000 chunks using drop_chunks() to make it possible to run EXPLAIN. The plan showed that only 2 chunks were being considered for the DELETE operation, as expected.

Dataset Characteristics:
The table stores monitoring data for load tests, which typically last 15 minutes per test—hence the hypertable’s chunk interval.
Data is pushed and queried only for test periods, meaning the dataset is discontinuous.

Largest Chunks:
The biggest chunks in nmon_cpu_all are around 1.2 MB in size:

SELECT * FROM chunks_detailed_size('nmon_cpu_all') order by total_bytes DESC limit 20;

chunk_schema/chunk_name/table_bytes/index_bytes/toast_bytes/total_bytes/node_name
_timescaledb_internal | _hyper_51_3980_chunk | 1097728 | 147456 | 0 | 1245184 |
_timescaledb_internal | _hyper_51_3969_chunk | 1073152 | 163840 | 0 | 1236992 |
_timescaledb_internal | _hyper_51_3986_chunk | 794624 | 122880 | 0 | 917504 |
_timescaledb_internal | _hyper_51_26857_chunk | 638976 | 106496 | 0 | 745472 |
_timescaledb_internal | _hyper_51_26881_chunk | 630784 | 114688 | 0 | 745472 |
_timescaledb_internal | _hyper_51_3993_chunk | 647168 | 90112 | 0 | 737280 |
_timescaledb_internal | _hyper_51_26851_chunk | 524288 | 106496 | 0 | 630784 |
_timescaledb_internal | _hyper_51_41492_chunk | 532480 | 90112 | 0 | 622592 |
_timescaledb_internal | _hyper_51_30909_chunk | 483328 | 90112 | 0 | 573440 |
_timescaledb_internal | _hyper_51_6719_chunk | 483328 | 90112 | 0 | 573440 |
_timescaledb_internal | _hyper_51_11819_chunk | 483328 | 90112 | 0 | 573440 |
_timescaledb_internal | _hyper_51_9636_chunk | 483328 | 90112 | 0 | 573440 |
_timescaledb_internal | _hyper_51_10863_chunk | 483328 | 90112 | 0 | 573440 |
_timescaledb_internal | _hyper_51_30921_chunk | 483328 | 90112 | 0 | 573440 |
_timescaledb_internal | _hyper_51_10515_chunk | 483328 | 90112 | 0 | 573440 |
_timescaledb_internal | _hyper_51_26698_chunk | 483328 | 90112 | 0 | 573440 |
_timescaledb_internal | _hyper_51_6641_chunk | 483328 | 90112 | 0 | 573440 |
_timescaledb_internal | _hyper_51_10527_chunk | 483328 | 90112 | 0 | 573440 |
_timescaledb_internal | _hyper_51_10539_chunk | 483328 | 90112 | 0 | 573440 |
_timescaledb_internal | _hyper_51_28178_chunk | 483328 | 90112 | 0 | 573440 |

Potential Next Steps & Open Questions:
One commenter in #4248 suggested upgrading to PostgreSQL 14. I haven't tried this yet, but I’d prefer to avoid a blind upgrade unless necessary.
Would adding an index on id_monitoring help, even though removing it from the query didn’t resolve OOM?
Could the number of chunks (9,337) be a factor in DELETE performance?

postgresql.conf.zip

TimescaleDB version affected

2.2.0, 2.6.1, 2.15.3

PostgreSQL version used

PostgreSQL 13.2 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.2.1_pre1) 10.2.1 20201203, 64-bit

What operating system did you use?

RedHat 8.5.0

What installation method did you use?

Docker

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

2025-02-26 14:40:49.516 UTC [1] LOG:  server process (PID 3091) was terminated by signal 9: Killed
2025-02-26 14:40:49.516 UTC [1] DETAIL:  Failed process was running: DELETE FROM nmon_cpu_all WHERE time >= '2024-12-14 04:08:02'  AND time <= '2024-12-14 04:22:14';
2025-02-26 14:40:49.516 UTC [1] LOG:  terminating any other active server processes
2025-02-26 14:40:49.520 UTC [24] WARNING:  terminating connection because of crash of another server process
2025-02-26 14:40:49.520 UTC [24] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2025-02-26 14:40:49.520 UTC [24] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2025-02-26 14:40:49.523 UTC [81] WARNING:  terminating connection because of crash of another server process
2025-02-26 14:40:49.523 UTC [81] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2025-02-26 14:40:49.523 UTC [81] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2025-02-26 14:40:49.525 UTC [1] LOG:  all server processes terminated; reinitializing
2025-02-26 14:40:49.668 UTC [3101] FATAL:  the database system is in recovery mode
2025-02-26 14:40:49.669 UTC [3100] LOG:  database system was interrupted; last known up at 2025-02-26 13:44:45 UTC
2025-02-26 14:41:00.302 UTC [3100] LOG:  database system was not properly shut down; automatic recovery in progress
2025-02-26 14:41:00.369 UTC [3100] LOG:  redo starts at 35B/6BAD0CE0
2025-02-26 14:41:00.384 UTC [3100] LOG:  invalid record length at 35B/6BCEEB08: wanted 24, got 0
2025-02-26 14:41:00.384 UTC [3100] LOG:  redo done at 35B/6BCEEAD0
2025-02-26 14:41:00.544 UTC [1] LOG:  database system is ready to accept connections
2025-02-26 14:41:00.549 UTC [3107] LOG:  TimescaleDB background worker launcher connected to shared catalogs


[Wed Feb 26 15:40:48 2025] postgres invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
[Wed Feb 26 15:40:48 2025]  oom_kill_process.cold.33+0xb/0x10
[Wed Feb 26 15:40:48 2025] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[Wed Feb 26 15:40:48 2025] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=9aee39def26fcef2ae9a7c77a1ace5455f286a1765ec85672a3a46f8f1bb58a0,mems_allowed=0,oom_memcg=/docker/9aee39def26fcef2ae9a7c77a1ace5455f286a1765ec85672a3a46f8f1bb58a0,task_memcg=/docker/9aee39def26fcef2ae9a7c77a1ace5455f286a1765ec85672a3a46f8f1bb58a0,task=postgres,pid=16724,uid=70
[Wed Feb 26 15:40:48 2025] Memory cgroup out of memory: Killed process 16724 (postgres) total-vm:21916756kB, anon-rss:12096028kB, file-rss:1504kB, shmem-rss:311984kB, UID:70 pgtables:31508kB oom_score_adj:0

How can we reproduce the bug?

Running a single DELETE query on a large hypertable.
@kevin-uslu kevin-uslu added the bug label Feb 26, 2025
@svenklemm
Copy link
Member

Does the hypertable have compressed chunks?

@kevin-uslu
Copy link
Author

Does the hypertable have compressed chunks?

No. Compression is not enabled.

SELECT hypertable_name, num_dimensions, num_chunks, compression_enabled, is_distributed FROM timescaledb_information.hypertables where hypertable_name LIKE 'nmon_%';

hypertable_name | num_dimensions | num_chunks | compression_enabled | is_distributed
-----------------+----------------+------------+---------------------+----------------
nmon_cpu_all | 1 | 9337 | f | f
nmon_disk | 1 | 9337 | f | f
nmon_memory | 1 | 9337 | f | f
nmon_network | 1 | 9335 | f | f
nmon_vcore | 1 | 9337 | f | f

And no compressed chunks.

SELECT hypertable_name, chunk_name, is_compressed, primary_dimension, primary_dimension_type, range_start, range_end
FROM timescaledb_information.chunks
WHERE hypertable_name = 'nmon_cpu_all' and is_compressed = true
limit 10;

hypertable_name | chunk_name | is_compressed | primary_dimension | primary_dimension_type | range_start | range_end
-----------------+------------+---------------+-------------------+------------------------+-------------+-----------
(0 rows)

@kevin-uslu
Copy link
Author

Would enabling compression on chunks have a positive impact on delete performance ?

@kevin-uslu
Copy link
Author

Hello @akuzm, since you worked on similar issues before, would you have any pointers ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants