Potential Deadlock Between drop_chunks() and the compress_chunks BGW #2575

JLockerman · 2020-10-19T18:16:37Z

Relevant system information:

OS: MacOS
PostgreSQL version: 12
TimescaleDB version : 1.7.4
Installation method: docker

Using the Promscale tests using the timescaledev/promscale-extension:latest-pg12 docker image, we occasionally get

2020-10-19 17:48:14.383 UTC [109] ERROR:  deadlock detected
2020-10-19 17:48:14.383 UTC [109] DETAIL:  Process 109 waits for ShareLock on transaction 520; blocked by process 110.
Process 110 waits for AccessExclusiveLock on relation 18254 of database 17453; blocked by process 109.
Process 109: CALL _prom_catalog.drop_metric_chunks($1, $2)
Process 110: <command string not enabled>
2020-10-19 17:48:14.383 UTC [109] HINT:  See server log for query details.
2020-10-19 17:48:14.383 UTC [109] CONTEXT:  while locking tuple (0,1) in relation "dimension_slice"
SQL statement "SELECT drop_chunks(table_name=>metric_table, schema_name=> 'prom_data', older_than=>older_than, cascade_to_materializations=>FALSE)"
PL/pgSQL function drop_metric_chunks(text,timestamp with time zone) line 132 at PERFORM
2020-10-19 17:48:14.383 UTC [109] STATEMENT:  CALL _prom_catalog.drop_metric_chunks($1, $2)
2020-10-19 17:48:14.400 UTC [110] LOG:  completed compressing chunk _timescaledb_internal._hyper_1_1_chunk
2020-10-19 17:48:14.404 UTC [110] LOG:  job 1000 completed compressing chunk

which appears to be a tuple-level deadlock on dimension_slice

The text was updated successfully, but these errors were encountered:

erimatnor · 2020-10-21T06:33:22Z

It would be good to get a reproducible example for this deadlock.

erimatnor · 2020-10-27T15:35:33Z

Potential duplicate of #2509

cevian · 2020-10-27T19:28:00Z

Honestly, Not sure this is a dupe of 2509 because the deadlock here is on a dimension_slice row not a table-level lock.

k-rus · 2020-10-29T09:33:39Z

I post link to a PR merged into 1.7.3, which might relevant to this issue: #2150

BenjenJones · 2020-10-29T10:36:06Z

I had a similar thing happening over and over again. Posted that on slack also, but then found this issue:

I have setup a timescaledb via the helm charts
https://github.com/timescale/timescaledb-kubernetes/tree/master/charts/timescaledb-single

Running Timescaledb 1.7.4 with pg12

It is running ok. But multiple times a day I see connections stacking up on the master.
Connections coming from both replicas AND the master itself (postgres user)
Queries are like:

SELECT table_name AS "Name",table_bytes AS "table",index_bytes AS "index",toast_bytes AS "toast",total_bytes AS "total" FROM _timescaledb_catalog.hypertable, hypertable_relation_size(schema_name::text||'.'||table_name::text) ORDER BY total_bytes DESC;
SELECT sum(index_bytes) FROM _timescaledb_catalog.hypertable, hypertable_relation_size(schema_name::text||'.'||table_name::text);

These are growing until there are no more connections left.
Does anyone know why this is happening?
Seeing this in the masters log file:

2020-10-28 13:21:57 UTC [6921]: [5f997072.1b09-3] postgres@postgres,app=[unknown] [00000] LOG:  process 6921 still waiting for AccessShareLock on relation 575409 of database 13408 after 1000.082 ms
2020-10-28 13:21:57 UTC [6921]: [5f997072.1b09-4] postgres@postgres,app=[unknown] [00000] DETAIL:  Process holding the lock: 6563. Wait queue: 6761, 6764, 6765, 6766, 6767, 6768, 6769, 6771, 6772, 6774, 6778, 6779, 6777, 6780, 6782, 6785, 6784, 6790, 6788, 6789, 6795, 6794, 6797, 6796, 6798, 6847, 6851, 6846, 6852, 6850, 6857, 6856, 6860, 6862, 6861, 6899, 6897, 6902, 6903, 6901, 6906, 6909, 6912, 6913, 6910, 6918, 6917, 6922, 6921, 6920.
2020-10-28 13:21:57 UTC [6921]: [5f997072.1b09-5] postgres@postgres,app=[unknown] [00000] CONTEXT:  PL/pgSQL function hypertable_relation_size(regclass) line 12 at RETURN QUERY
2020-10-28 13:21:57 UTC [6921]: [5f997072.1b09-6] postgres@postgres,app=[unknown] [00000] STATEMENT:  SELECT sum(toast_bytes) FROM _timescaledb_catalog.hypertable, hypertable_relation_size(schema_name::text||'.'||table_name::text);

k-rus · 2020-10-29T14:29:28Z

@BenjenJones Is there compression or drop chunks in your case? Do you call hypertable catalog or information views on hypertable?

BenjenJones · 2020-10-29T15:10:16Z

i am using compression and also drop_chunks.

And I am using a Grafana Dashboard as Monitoring (that calls hypertable catalog information)

edit: crap you might be right ... one of the queries I am seeing in the logs, is the same as one widget from Grafana is using.
But the client_addr are the IP adresses from the 3 kubernetes nodes.

closing down the dashboard and check if that happens again.

erimatnor · 2020-12-10T10:17:50Z

I wasn't able to reproduce this in tests on the 2.0 code base. Isolation tests were expanded for this case in #2688.

In particular, I added tests where I tried to reproduce the deadlock, but I wasn't able to. It might mean that prior fixes to locking in drop_chunks already resolved the issue or the test isn't "detailed" enough to cover the deadlock case. I believed I also ran this test on the 1.7 branch without hitting a deadlock, but further investigation on that code base might be needed.

Note, however, that there are many situations where drop_chunks and compress_chunks will block each other; for example, if you drop a chunk while it is being compressed, it will naturally block until the compression is done. There is always potential for deadlocks if you compress and drop in different order, e.g., process 1 drops chunks A,B and process 2 compresses in order B,A. So, one thing to look further into is to ensure that the compression policy always compresses chunks in same order as drop chunks (e.g., order by chunk ID). I guess just ordering by time can be a problem in case we have multi-dimensional tables, then we need to also order by space partition, and other dimensions, etc.

NunoFilipeSantos · 2021-01-12T14:36:46Z

@JLockerman can we please have a way to reproduce this?

k-rus · 2021-01-19T20:29:34Z

I've got some information how to repro it from @JLockerman
Run go test -t TestSqlDrop ./... using image timescaledev/promscale-extension:latest-pg12 and the commit timescale/promscale@1c852ea

and
run go test -t TestSqlDrop ./pkg/pgmode/end_to_end_tests -use-docker=false -use-extension=false from the source folder of Promscale at the above commit against TimescaleDB database (i.e., PostgreSQL instance with TimescaleDB extension).

k-rus · 2021-01-20T13:50:35Z

To run the Promscale test:

Clone https://github.com/timescale/promscale
Move to commit timescale/promscale@1c852ea
Update Go and Docker if necessary
Have a running PostgreSQL instance with TimescaleDB 1.7.x installed
If necessary update containers.go to use connection string at line 28 and users, so connection to local instance will work
If necessary update extension.go the SQL statement in line 143, so it doesn't fail if the extension folder contains TimescaleDB with no semver labels.
Create a database, which will be used for testing.
If an extension is installed in the database, run go test -run TestSQLDrop ./pkg/pgmodel/end_to_end_tests -use-docker=false -use-extension=false in Promscale folder
However, it is more convenient to leave installation and then removal of the extension to the test. In such case the database should not contain the extension and run go test -run TestSQLDrop ./pkg/pgmodel/end_to_end_tests -use-docker=false in Promscale folder

~~However, I wasn't able to hit any failure in the test in many runs.~~
I got the deadlock. In some occasions the test run prints:

--- FAIL: TestSQLDropChunk (1.94s)
    drop_test.go:165: ERROR: deadlock detected (SQLSTATE 40P01)

k-rus · 2021-01-22T07:25:20Z

The Promscale test adds compression policy(s) during test setup and calls drop_chunks during the test. The compression policy compress only one chunk during the job run and then the job rescheduled to run immediately if more chunks to compress exists. This suggests that the deadlock happens while compress_chunk and drop_chunks compete for the same chunk.

k-rus · 2021-03-09T08:32:52Z

According to the source code in current master
Drop chunks obtains locks:

AccessShareLock on hypertable
Unlocks hypertable
Iterates over dimension slices wit tuple lock of LockWaitBlock and LockTupleExclusive, and obtains AccessShareLock
RowExclusiveLock on catalog table CHUNK during iterating over it
AccessExclusiveLock on chunk tables

Compress chunk obtains locks excluding compression related tables

AccessShareLock on hypertable
ShareLock on chunk table
RowExclusiveLock on catalog table CHUNK
ExclusiveLock on chunkt table

I am setting up an isolation test to see if introducing wait points between obtained locks in drop chunk and compress chunk will lead to a deadlock.

k-rus · 2021-03-10T09:35:38Z

Comparing with 1.7.x branch, I found few differences:

Compress chunk obtains locks excluding compression related tables

AccessShareLock on hypertable

ShareLock on chunk table

In 1.7.x the compress chunk obtains AccessShareLock on chunk table.

Drop chunks

Doesn't loop through referenced tables

So it doesn't seem that the deadlock might be resolved in 2.x

k-rus · 2021-03-18T08:20:10Z

I updated instructions to repro the deadlock in the above comment

I also pushed Promscale version, which I use to repro on 1.7.5, into my branch.

An isolation test was set to run different permutations of locking between compress and drop in correspondence to above comment in this branch however it didn't result in any deadlocks. So it seems that I haven't identified all the relevant locks obtained in the functionalities.

JLockerman added the bug label Oct 19, 2020

erimatnor mentioned this issue Oct 27, 2020

Compression deadlock detected while reading hyper table #2565

Closed

k-rus added the severity-p3 label Oct 29, 2020

erimatnor added severity-p2 and removed severity-p3 labels Nov 16, 2020

erimatnor self-assigned this Nov 26, 2020

erimatnor removed their assignment Dec 10, 2020

k-rus assigned k-rus and unassigned k-rus Jan 19, 2021

k-rus self-assigned this Jan 19, 2021

k-rus mentioned this issue Jan 20, 2021

Parser error on TimescaleDB extension when installed from source timescale/promscale#436

Closed

k-rus added compression data-retention labels Jan 22, 2021

k-rus removed their assignment Jan 22, 2021

k-rus self-assigned this Mar 3, 2021

k-rus removed their assignment Mar 18, 2021

NunoFilipeSantos closed this as completed May 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential Deadlock Between drop_chunks() and the compress_chunks BGW #2575

Potential Deadlock Between drop_chunks() and the compress_chunks BGW #2575

JLockerman commented Oct 19, 2020 •

edited

Loading

erimatnor commented Oct 21, 2020

erimatnor commented Oct 27, 2020

cevian commented Oct 27, 2020

k-rus commented Oct 29, 2020

BenjenJones commented Oct 29, 2020 •

edited

Loading

k-rus commented Oct 29, 2020

BenjenJones commented Oct 29, 2020 •

edited

Loading

erimatnor commented Dec 10, 2020

NunoFilipeSantos commented Jan 12, 2021

k-rus commented Jan 19, 2021 •

edited

Loading

k-rus commented Jan 20, 2021 •

edited

Loading

k-rus commented Jan 22, 2021

k-rus commented Mar 9, 2021

k-rus commented Mar 10, 2021

k-rus commented Mar 18, 2021

Potential Deadlock Between drop_chunks() and the compress_chunks BGW #2575

Potential Deadlock Between drop_chunks() and the compress_chunks BGW #2575

Comments

JLockerman commented Oct 19, 2020 • edited Loading

erimatnor commented Oct 21, 2020

erimatnor commented Oct 27, 2020

cevian commented Oct 27, 2020

k-rus commented Oct 29, 2020

BenjenJones commented Oct 29, 2020 • edited Loading

k-rus commented Oct 29, 2020

BenjenJones commented Oct 29, 2020 • edited Loading

erimatnor commented Dec 10, 2020

NunoFilipeSantos commented Jan 12, 2021

k-rus commented Jan 19, 2021 • edited Loading

k-rus commented Jan 20, 2021 • edited Loading

k-rus commented Jan 22, 2021

k-rus commented Mar 9, 2021

k-rus commented Mar 10, 2021

k-rus commented Mar 18, 2021

JLockerman commented Oct 19, 2020 •

edited

Loading

BenjenJones commented Oct 29, 2020 •

edited

Loading

BenjenJones commented Oct 29, 2020 •

edited

Loading

k-rus commented Jan 19, 2021 •

edited

Loading

k-rus commented Jan 20, 2021 •

edited

Loading