Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential Deadlock Between drop_chunks() and the compress_chunks BGW #2575

Closed
JLockerman opened this issue Oct 19, 2020 · 15 comments
Closed

Comments

@JLockerman
Copy link
Contributor

JLockerman commented Oct 19, 2020

Relevant system information:

  • OS: MacOS
  • PostgreSQL version: 12
  • TimescaleDB version : 1.7.4
  • Installation method: docker

Using the Promscale tests using the timescaledev/promscale-extension:latest-pg12 docker image, we occasionally get

2020-10-19 17:48:14.383 UTC [109] ERROR:  deadlock detected
2020-10-19 17:48:14.383 UTC [109] DETAIL:  Process 109 waits for ShareLock on transaction 520; blocked by process 110.
Process 110 waits for AccessExclusiveLock on relation 18254 of database 17453; blocked by process 109.
Process 109: CALL _prom_catalog.drop_metric_chunks($1, $2)
Process 110: <command string not enabled>
2020-10-19 17:48:14.383 UTC [109] HINT:  See server log for query details.
2020-10-19 17:48:14.383 UTC [109] CONTEXT:  while locking tuple (0,1) in relation "dimension_slice"
SQL statement "SELECT drop_chunks(table_name=>metric_table, schema_name=> 'prom_data', older_than=>older_than, cascade_to_materializations=>FALSE)"
PL/pgSQL function drop_metric_chunks(text,timestamp with time zone) line 132 at PERFORM
2020-10-19 17:48:14.383 UTC [109] STATEMENT:  CALL _prom_catalog.drop_metric_chunks($1, $2)
2020-10-19 17:48:14.400 UTC [110] LOG:  completed compressing chunk _timescaledb_internal._hyper_1_1_chunk
2020-10-19 17:48:14.404 UTC [110] LOG:  job 1000 completed compressing chunk

which appears to be a tuple-level deadlock on dimension_slice

@JLockerman JLockerman added the bug label Oct 19, 2020
@erimatnor
Copy link
Contributor

It would be good to get a reproducible example for this deadlock.

@erimatnor
Copy link
Contributor

Potential duplicate of #2509

@cevian
Copy link
Contributor

cevian commented Oct 27, 2020

Honestly, Not sure this is a dupe of 2509 because the deadlock here is on a dimension_slice row not a table-level lock.

@k-rus
Copy link
Contributor

k-rus commented Oct 29, 2020

I post link to a PR merged into 1.7.3, which might relevant to this issue: #2150

@BenjenJones
Copy link

BenjenJones commented Oct 29, 2020

I had a similar thing happening over and over again. Posted that on slack also, but then found this issue:

I have setup a timescaledb via the helm charts
https://github.com/timescale/timescaledb-kubernetes/tree/master/charts/timescaledb-single

Running Timescaledb 1.7.4 with pg12

It is running ok. But multiple times a day I see connections stacking up on the master.
Connections coming from both replicas AND the master itself (postgres user)
Queries are like:

SELECT table_name AS "Name",table_bytes AS "table",index_bytes AS "index",toast_bytes AS "toast",total_bytes AS "total" FROM _timescaledb_catalog.hypertable, hypertable_relation_size(schema_name::text||'.'||table_name::text) ORDER BY total_bytes DESC;
SELECT sum(index_bytes) FROM _timescaledb_catalog.hypertable, hypertable_relation_size(schema_name::text||'.'||table_name::text);

These are growing until there are no more connections left.
Does anyone know why this is happening?
Seeing this in the masters log file:

2020-10-28 13:21:57 UTC [6921]: [5f997072.1b09-3] postgres@postgres,app=[unknown] [00000] LOG:  process 6921 still waiting for AccessShareLock on relation 575409 of database 13408 after 1000.082 ms
2020-10-28 13:21:57 UTC [6921]: [5f997072.1b09-4] postgres@postgres,app=[unknown] [00000] DETAIL:  Process holding the lock: 6563. Wait queue: 6761, 6764, 6765, 6766, 6767, 6768, 6769, 6771, 6772, 6774, 6778, 6779, 6777, 6780, 6782, 6785, 6784, 6790, 6788, 6789, 6795, 6794, 6797, 6796, 6798, 6847, 6851, 6846, 6852, 6850, 6857, 6856, 6860, 6862, 6861, 6899, 6897, 6902, 6903, 6901, 6906, 6909, 6912, 6913, 6910, 6918, 6917, 6922, 6921, 6920.
2020-10-28 13:21:57 UTC [6921]: [5f997072.1b09-5] postgres@postgres,app=[unknown] [00000] CONTEXT:  PL/pgSQL function hypertable_relation_size(regclass) line 12 at RETURN QUERY
2020-10-28 13:21:57 UTC [6921]: [5f997072.1b09-6] postgres@postgres,app=[unknown] [00000] STATEMENT:  SELECT sum(toast_bytes) FROM _timescaledb_catalog.hypertable, hypertable_relation_size(schema_name::text||'.'||table_name::text);

@k-rus
Copy link
Contributor

k-rus commented Oct 29, 2020

@BenjenJones Is there compression or drop chunks in your case? Do you call hypertable catalog or information views on hypertable?

@BenjenJones
Copy link

BenjenJones commented Oct 29, 2020

i am using compression and also drop_chunks.

And I am using a Grafana Dashboard as Monitoring (that calls hypertable catalog information)

edit: crap you might be right ... one of the queries I am seeing in the logs, is the same as one widget from Grafana is using.
But the client_addr are the IP adresses from the 3 kubernetes nodes.

closing down the dashboard and check if that happens again.

@erimatnor erimatnor self-assigned this Nov 26, 2020
@erimatnor
Copy link
Contributor

I wasn't able to reproduce this in tests on the 2.0 code base. Isolation tests were expanded for this case in #2688.

In particular, I added tests where I tried to reproduce the deadlock, but I wasn't able to. It might mean that prior fixes to locking in drop_chunks already resolved the issue or the test isn't "detailed" enough to cover the deadlock case. I believed I also ran this test on the 1.7 branch without hitting a deadlock, but further investigation on that code base might be needed.

Note, however, that there are many situations where drop_chunks and compress_chunks will block each other; for example, if you drop a chunk while it is being compressed, it will naturally block until the compression is done. There is always potential for deadlocks if you compress and drop in different order, e.g., process 1 drops chunks A,B and process 2 compresses in order B,A. So, one thing to look further into is to ensure that the compression policy always compresses chunks in same order as drop chunks (e.g., order by chunk ID). I guess just ordering by time can be a problem in case we have multi-dimensional tables, then we need to also order by space partition, and other dimensions, etc.

@erimatnor erimatnor removed their assignment Dec 10, 2020
@NunoFilipeSantos
Copy link
Contributor

@JLockerman can we please have a way to reproduce this?

@k-rus k-rus assigned k-rus and unassigned k-rus Jan 19, 2021
@k-rus
Copy link
Contributor

k-rus commented Jan 19, 2021

I've got some information how to repro it from @JLockerman
Run go test -t TestSqlDrop ./... using image timescaledev/promscale-extension:latest-pg12 and the commit timescale/promscale@1c852ea

and
run go test -t TestSqlDrop ./pkg/pgmode/end_to_end_tests -use-docker=false -use-extension=false from the source folder of Promscale at the above commit against TimescaleDB database (i.e., PostgreSQL instance with TimescaleDB extension).

@k-rus k-rus self-assigned this Jan 19, 2021
@k-rus
Copy link
Contributor

k-rus commented Jan 20, 2021

To run the Promscale test:

  1. Clone https://github.com/timescale/promscale
  2. Move to commit timescale/promscale@1c852ea
  3. Update Go and Docker if necessary
  4. Have a running PostgreSQL instance with TimescaleDB 1.7.x installed
  5. If necessary update containers.go to use connection string at line 28 and users, so connection to local instance will work
  6. If necessary update extension.go the SQL statement in line 143, so it doesn't fail if the extension folder contains TimescaleDB with no semver labels.
  7. Create a database, which will be used for testing.
  8. If an extension is installed in the database, run go test -run TestSQLDrop ./pkg/pgmodel/end_to_end_tests -use-docker=false -use-extension=false in Promscale folder
  9. However, it is more convenient to leave installation and then removal of the extension to the test. In such case the database should not contain the extension and run go test -run TestSQLDrop ./pkg/pgmodel/end_to_end_tests -use-docker=false in Promscale folder

However, I wasn't able to hit any failure in the test in many runs.
I got the deadlock. In some occasions the test run prints:

--- FAIL: TestSQLDropChunk (1.94s)
    drop_test.go:165: ERROR: deadlock detected (SQLSTATE 40P01)

@k-rus
Copy link
Contributor

k-rus commented Jan 22, 2021

The Promscale test adds compression policy(s) during test setup and calls drop_chunks during the test. The compression policy compress only one chunk during the job run and then the job rescheduled to run immediately if more chunks to compress exists. This suggests that the deadlock happens while compress_chunk and drop_chunks compete for the same chunk.

@k-rus k-rus removed their assignment Jan 22, 2021
@k-rus k-rus self-assigned this Mar 3, 2021
@k-rus
Copy link
Contributor

k-rus commented Mar 9, 2021

According to the source code in current master
Drop chunks obtains locks:

  1. AccessShareLock on hypertable
  2. Unlocks hypertable
  3. Iterates over dimension slices wit tuple lock of LockWaitBlock and LockTupleExclusive, and obtains AccessShareLock
  4. RowExclusiveLock on catalog table CHUNK during iterating over it
  5. AccessExclusiveLock on chunk tables

Compress chunk obtains locks excluding compression related tables

  1. AccessShareLock on hypertable
  2. ShareLock on chunk table
  3. RowExclusiveLock on catalog table CHUNK
  4. ExclusiveLock on chunkt table

I am setting up an isolation test to see if introducing wait points between obtained locks in drop chunk and compress chunk will lead to a deadlock.

@k-rus
Copy link
Contributor

k-rus commented Mar 10, 2021

Comparing with 1.7.x branch, I found few differences:

Compress chunk obtains locks excluding compression related tables

  1. AccessShareLock on hypertable
  2. ShareLock on chunk table

In 1.7.x the compress chunk obtains AccessShareLock on chunk table.

Drop chunks

  • Doesn't loop through referenced tables

So it doesn't seem that the deadlock might be resolved in 2.x

@k-rus
Copy link
Contributor

k-rus commented Mar 18, 2021

I updated instructions to repro the deadlock in the above comment

I also pushed Promscale version, which I use to repro on 1.7.5, into my branch.

An isolation test was set to run different permutations of locking between compress and drop in correspondence to above comment in this branch however it didn't result in any deadlocks. So it seems that I haven't identified all the relevant locks obtained in the functionalities.

@k-rus k-rus removed their assignment Mar 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants