Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent deadlock on Meter and MeterReading imports #1928

Closed
adrian-lara opened this issue Jul 18, 2019 · 5 comments
Closed

Intermittent deadlock on Meter and MeterReading imports #1928

adrian-lara opened this issue Jul 18, 2019 · 5 comments
Labels
Impact-3 1 - Blocker, 3 - Nicety No Longer Valid Tag as no longer valid to ignore the closed issues in the change log. Do not delete this label. P-3

Comments

@adrian-lara
Copy link
Contributor

adrian-lara commented Jul 18, 2019

During PM Meter/MeterReading imports, an error occurs somewhat infrequently. Any time the error occurs, re-importing the file has always lead to a successful import. I haven't heard or seen this happen for GreenButton Meter/MeterReading imports but it could be possible as the logic is similar though not identical.

Unfortunately, the underlying reason behind the error is not fully understood as the error has been difficult reproduce at will. It is understood that Postgres deadlocks are causing the error. Specifically, an Update Exclusive Lock on the MeterReading model and a Row Exclusive Lock on the Meter model seem to be behind these deadlocks.

The PM Meter import process can be found here: https://github.com/SEED-platform/seed/blob/develop/seed/data_importer/tasks.py#L806
There's one transaction, with 2 main queries:

  1. Get or create a Meter record
  2. Reference that Meter record to bulk create the MeterReadings
    It's necessary to consider that this method is sent to Celery and run as a background task. For any given import, each of these background tasks should be importing/working with separate Meter records (and, subsequently, separate MeterReading records).

Here are the error messages from two instances of this occuring
Sentry - Deadlock error

[2019-07-17 14:40:19,431: ERROR/ForkPoolWorker-2] Chord callback for '77736616-0854-454b-bd74-653397beab51' raised: ChordError("Dependency 8532fab9-edcb-4b70-8400-3154823e646d raised OperationalError('deadlock detected\\nDETAIL:  Process 97539 waits for ShareUpdateExclusiveLock on relation 16945371 of database 16940925; blocked by process 97538.\\nProcess 97538 waits for ShareRowExclusiveLock on relation 16942891 of database 16940925; blocked by process 97539.\\nHINT:  See server log for query details.\\n',)",)
Traceback (most recent call last):
  File "/Users/adrianlara/devetry/projects/seed/venv367-20190509-develop/lib/python3.6/site-packages/celery/backends/redis.py", line 399, in on_chord_part_return
    callback.delay([unpack(tup, decode) for tup in resl])
  File "/Users/adrianlara/devetry/projects/seed/venv367-20190509-develop/lib/python3.6/site-packages/celery/backends/redis.py", line 399, in <listcomp>
    callback.delay([unpack(tup, decode) for tup in resl])
  File "/Users/adrianlara/devetry/projects/seed/venv367-20190509-develop/lib/python3.6/site-packages/celery/backends/redis.py", line 352, in _unpack_chord_result
    raise ChordError('Dependency {0} raised {1!r}'.format(tid, retval))
celery.exceptions.ChordError: Dependency 8532fab9-edcb-4b70-8400-3154823e646d raised OperationalError('deadlock detected\nDETAIL:  Process 97539 waits for ShareUpdateExclusiveLock on relation 16945371 of database 16940925; blocked by process 97538.\nProcess 97538 waits for ShareRowExclusiveLock on relation 16942891 of database 16940925; blocked by process 97539.\nHINT:  See server log for query details.\n',)

Steps to Reproduce

This seems to happen randomly. I haven't been able to narrow down why the error occurs.

During my investigation, I did make attempts to reproduce this deadlock situation directly within Postgres, but was unsuccessful. Specifically, I initiated a transaction, created a meter, some readings, initiated another transaction (in another Postgres instance), added meters and/or readings in this new transaction, and committed both transactions in different orders. Throughout these trials, I also observed locks being created during each step. The only error I could produce was one from creating readings in the second transaction for a meter in the first transaction before that first transaction meter was committed.

Instance Information

This is known to have occured in dev1 and during local development.

@nllong nllong added 5 Points Impact-2 1 - Blocker, 3 - Nicety P-1 labels Aug 9, 2019
@adrian-lara
Copy link
Contributor Author

@nllong I'm having a hard time prioritizing this.

The fact that there are a lot of unknowns as well as the infrequency of this (it usually works on the "second" attempt after a failure) is making me think of an "alternative" solution.

What do you think of us adding some front end error handling in the case that this error happens that basically says "Something went wrong - try again. If you already tried again, let your system admin know."? It doesn't feel great, and I haven't seen this pattern in SEED before so definitely feel free to shoot that down.

@nllong
Copy link
Member

nllong commented Jan 23, 2020

@adrian-lara -- I think having error handling on the front end would be very helpful to start with.

@macintoshpie
Copy link
Contributor

I just experienced the same deadlock error while refactoring the upload of BuildingSync to use the celery tasks on the backend. I believe the database was fresh (ie just created) before I ran this. See the error and the database logs below.

Though the code is not yet in the repo, it was happening during a transaction where BuildingSync.process() was being called on xml files.

Like the example given by Adrian, from what I can infer from the db logs is that the exception is happing during the bulk creation of meter readings (see here).

Hypothesis

This issue seems isolated to MeterReadings in transactions with concurrent processes. The seed_meterreading table seems to be the only one using the TimescaleDB extension. I found this issue on the timescaledb repo where a user reported what appears to be the same issue we're having. A contributor replied with

(If it's the first insert into a hypertable, yes it will effectively lock it for other txns, because we're creating the first chunk in the txn, it's not created ahead of time, once that txn commits the other txns should be able to finish no problem. The main thing here is you want to make sure you always access any hypertables and foreign tables in the same order so that you avoid deadlocks).

Thus it seems possible that this creation of the first chunk is what's causing the deadlock between worker processes, and as a result this should only be an issue when the database is brand new, or after calling set_chunk_time_interval() which from what I can tell will cause new chunks to be created. Someone who knows more about timescaledb should comment here on whether these are the only times new chunks are created.

Fix

A potential fix would be to ensure the chunks are created before doing any concurrent inserts. This would probably just be migration that would create then delete a record from the meterreadings table. If chunks are created after the first insert, I'm not sure what the proper fix would be.

If seen again

Unfortunately I reset the database (dropped and recreated) before realizing table reference numbers would change. I suggest if you run into this issue again, to do the following steps:

  1. check the db logs (probably at /var/log/postgresql/postgresql-10-main.log) and look for the deadlock exception.
  2. Do a query to look up the table names causing the deadlock (see this helpful article here about postgres deadlocks)

Error messages

Celery error message

[2020-03-17 10:46:20,462: ERROR/ForkPoolWorker-2] Chord callback for '6313e616-0a07-4f8a-ae8c-f39db1da8b15' raised: ChordError("Dependency bb8813d0-a443-48d7-8ce0-c59f6c806436 raised OperationalError('deadlock detected\\nDETAIL:  Process 31752 waits for ShareUpdateExclusiveLock on relation 1797542 of database 1793093; blocked by process 31754.\\nProcess 31754 waits for ShareRowExclusiveLock on relation 1795035 of database 1793093; blocked by process 31752.\\nHINT:  See server log for query details.\\n',)",)
Traceback (most recent call last):
  File "/home/ted/.local/lib/python3.6/site-packages/celery/backends/redis.py", line 399, in on_chord_part_return
    callback.delay([unpack(tup, decode) for tup in resl])
  File "/home/ted/.local/lib/python3.6/site-packages/celery/backends/redis.py", line 399, in <listcomp>
    callback.delay([unpack(tup, decode) for tup in resl])
  File "/home/ted/.local/lib/python3.6/site-packages/celery/backends/redis.py", line 352, in _unpack_chord_result
    raise ChordError('Dependency {0} raised {1!r}'.format(tid, retval))
celery.exceptions.ChordError: Dependency bb8813d0-a443-48d7-8ce0-c59f6c806436 raised OperationalError('deadlock detected\nDETAIL:  Process 31752 waits for ShareUpdateExclusiveLock on relation 1797542 of database 1793093; blocked by process 31754.\nProcess 31754 waits for ShareRowExclusiveLock on relation 1795035 of database 1793093; blocked by process 31752.\nHINT:  See server log for query details.\n',)

Database logs

2020-03-17 11:38:55.115 MDT [31422] seeduser@seeddb CONTEXT:  PL/pgSQL function inline_code_block line 23 at RAISE
2020-03-17 11:41:36.654 MDT [31752] seeduser@seeddb ERROR:  deadlock detected
2020-03-17 11:41:36.654 MDT [31752] seeduser@seeddb DETAIL:  Process 31752 waits for ShareUpdateExclusiveLock on relation 1797542 of database 1793093; blocked by process 31754.
        Process 31754 waits for ShareRowExclusiveLock on relation 1795035 of database 1793093; blocked by process 31752.
        Process 31752: INSERT INTO "seed_meterreading" ("meter_id", "start_time", "end_time", "reading", "source_unit", "conversion_factor") VALUES (2, '2017-02-01T00:00:00+00:00'::timestamptz, '2017-03-01T00:00:00+00:00'::timestamptz, 234524.0252775915, 'kBtu', NULL), (2, '2017-09-01T00:00:00+00:00'::timestamptz, '2017-10-01T00:00:00+00:00'::timestamptz, 304193.8483165834, 'kBtu', NULL), (2, '2017-11-01T00:00:00+00:00'::timestamptz, '2017-12-01T00:00:00+00:00'::timestamptz, 253060.44696603998, 'kBtu', NULL), (2, '2017-04-01T00:00:00+00:00'::timestamptz, '2017-05-01T00:00:00+00:00'::timestamptz, 262884.6027515331, 'kBtu', NULL), (2, '2017-06-01T00:00:00+00:00'::timestamptz, '2017-07-01T00:00:00+00:00'::timestamptz, 294824.39173722686, 'kBtu', NULL), (2, '2017-07-01T00:00:00+00:00'::timestamptz, '2017-08-01T00:00:00+00:00'::timestamptz, 316350.645483344, 'kBtu', NULL), (2, '2017-01-01T00:00:00+00:00'::timestamptz, '2017-02-01T00:00:00+00:00'::timestamptz, 258523.62481970456, 'kBtu', NULL), (2, '2017-08-01T00:00:00+00:00'::timestamp
        Process 31754: INSERT INTO "seed_meterreading" ("meter_id", "start_time", "end_time", "reading", "source_unit", "conversion_factor") VALUES (1, '2017-02-01T00:00:00+00:00'::timestamptz, '2017-03-01T00:00:00+00:00'::timestamptz, 234524.0252775915, 'kBtu', NULL), (1, '2017-09-01T00:00:00+00:00'::timestamptz, '2017-10-01T00:00:00+00:00'::timestamptz, 304193.8483165834, 'kBtu', NULL), (1, '2017-11-01T00:00:00+00:00'::timestamptz, '2017-12-01T00:00:00+00:00'::timestamptz, 253060.44696603998, 'kBtu', NULL), (1, '2017-04-01T00:00:00+00:00'::timestamptz, '2017-05-01T00:00:00+00:00'::timestamptz, 262884.6027515331, 'kBtu', NULL), (1, '2017-06-01T00:00:00+00:00'::timestamptz, '2017-07-01T00:00:00+00:00'::timestamptz, 294824.39173722686, 'kBtu', NULL), (1, '2017-07-01T00:00:00+00:00'::timestamptz, '2017-08-01T00:00:00+00:00'::timestamptz, 316350.645483344, 'kBtu', NULL), (1, '2017-01-01T00:00:00+00:00'::timestamptz, '2017-02-01T00:00:00+00:00'::timestamptz, 258523.62481970456, 'kBtu', NULL), (1, '2017-08-01T00:00:00+00:00'::timestamp
2020-03-17 11:41:36.654 MDT [31752] seeduser@seeddb HINT:  See server log for query details.
2020-03-17 11:41:36.654 MDT [31752] seeduser@seeddb STATEMENT:  INSERT INTO "seed_meterreading" ("meter_id", "start_time", "end_time", "reading", "source_unit", "conversion_factor") VALUES (2, '2017-02-01T00:00:00+00:00'::timestamptz, '2017-03-01T00:00:00+00:00'::timestamptz, 234524.0252775915, 'kBtu', NULL), (2, '2017-09-01T00:00:00+00:00'::timestamptz, '2017-10-01T00:00:00+00:00'::timestamptz, 304193.8483165834, 'kBtu', NULL), (2, '2017-11-01T00:00:00+00:00'::timestamptz, '2017-12-01T00:00:00+00:00'::timestamptz, 253060.44696603998, 'kBtu', NULL), (2, '2017-04-01T00:00:00+00:00'::timestamptz, '2017-05-01T00:00:00+00:00'::timestamptz, 262884.6027515331, 'kBtu', NULL), (2, '2017-06-01T00:00:00+00:00'::timestamptz, '2017-07-01T00:00:00+00:00'::timestamptz, 294824.39173722686, 'kBtu', NULL), (2, '2017-07-01T00:00:00+00:00'::timestamptz, '2017-08-01T00:00:00+00:00'::timestamptz, 316350.645483344, 'kBtu', NULL), (2, '2017-01-01T00:00:00+00:00'::timestamptz, '2017-02-01T00:00:00+00:00'::timestamptz, 258523.62481970456, 'kBtu', NULL), (2, '2017-08-01T00:00:00+00:00'::timestamptz, '2017-09-01T00:00:00+00:00'::timestamptz, 308225.00931095495, 'kBtu', NULL), (2, '2017-10-01T00:00:00+00:00'::timestamptz, '2017-11-01T00:00:00+00:00'::timestamptz, 291341.9782226286, 'kBtu', NULL), (2, '2017-12-01T00:00:00+00:00'::timestamptz, '2018-01-01T00:00:00+00:00'::timestamptz, 259046.94606598208, 'kBtu', NULL), (2, '2017-03-01T00:00:00+00:00'::timestamptz, '2017-04-01T00:00:00+00:00'::timestamptz, 259655.78464664074, 'kBtu', NULL), (2, '2017-05-01T00:00:00+00:00'::timestamptz, '2017-06-01T00:00:00+00:00'::timestamptz, 279694.95573607093, 'kBtu', NULL) RETURNING "seed_meterreading"."start_time"

@nllong
Copy link
Member

nllong commented Mar 18, 2020

Thanks @macintoshpie for investigating this more. I think you are on to something about having the backgrounds tasks wait until all are queued up. Just curious what version of SEED are you on?

@adrian-lara adrian-lara added Impact-3 1 - Blocker, 3 - Nicety P-3 and removed Impact-2 1 - Blocker, 3 - Nicety P-1 labels Oct 19, 2020
@aviveiros11 aviveiros11 added the Verify No Longer Relevant These issues are possibly no longer relevant label Apr 8, 2021
@adrian-lara
Copy link
Contributor Author

I haven't seen this one in a while. I think the upgrades to Postgres and TimescaleDB between now and the latest comment probably ended up resolving these.

@adrian-lara adrian-lara added No Longer Valid Tag as no longer valid to ignore the closed issues in the change log. Do not delete this label. and removed Verify No Longer Relevant These issues are possibly no longer relevant labels Apr 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Impact-3 1 - Blocker, 3 - Nicety No Longer Valid Tag as no longer valid to ignore the closed issues in the change log. Do not delete this label. P-3
Projects
None yet
Development

No branches or pull requests

4 participants