-
Notifications
You must be signed in to change notification settings - Fork 900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Compression policy job keeps on hanging using up lots of cpu #6223
Comments
hello @axpobastian thank you for reaching out. Have you checked whether compression is waiting for any locks? |
Thank you for getting back to us, I just ran the query you suggested and to me it seems it's not waiting for any locks. Attached is the output of the query: |
@axpobastian Thanks for providing the output above. It seems like the compression job is compressing a lot of data, which is why it takes such a long time. There are a few other things you can do to investigate if you can improve the performance of the job and pinpoint where the issue is. You can get the uncompressed chunks using this query (here I am looking at the select format('%I.%I', chunk_schema, chunk_name) from timescaledb_information.chunks where is_compressed = false and hypertable_name = 'conditions'; You can then try to compress a single one and see the performance of it: SELECT compress_chunk('_timescaledb_internal._hyper_1_1_chunk'); Unfortunately, the compression policy job is single-threaded (we have an issue #3077 to improve this), which means that it has to compress each chunk in serial order. Since you're compressing so much data, it looks like you've just done a backfill of data, or just turned on compression, which would explain the amount of data that the jobs need to compress. |
Thank for looking into it. We tried the query to see uncompressed chunks, and we didn't find any results other than the most recent chunks (we have daily chunks that normally get compressed every day). We didn't quite do a backfill, maybe I can give a quick summary of what we did to get here / what happened. I feel like our situation is complex and something is going very wrong. In the beginning, we had some external systems constantly feeding data into our timescale db. We had a filter in the external system writing only data younger than 24 hours (since we use daily chunks and it used to be impossible to insert into compressed chunks). All of our daily chunks were compressed (as by the query above), except for the chunk for the current day. Then we removed the filter for data older than 24 hours, so if our source system sent data that was older than 24 hours, we also inserted it. Note that the amount of that old data we get from our source system is very small compared to the amount of recent data we write all the time. After we removed the filter, the following happened:
What we did then was: We removed all the uncompressed chunks using drop_chunks() since there were not relevant for us, and we put the filter in our source system back in so we do not write old data anymore. This helped with the errors on writing new data, however even though the output the above query now told us that we have no more uncompressed chunks, the compression policy job still kept getting stuck. At this point we removed the compression policy job and switched to compressing the new daily chunk manually. A little later we also realized, that even without any compression policy job running our query performance for some queries deteriorated massively (mostly queries of the type Sorry for the wall of text, I guess the tl;dr is that we started inserting a very small amount of data into already compressed chunks and everything started to fall apart. edit: We also noticed that there is a function to recompress_chunks, is this related to our issue possibly? |
@axpobastian Thanks for the thorough explanation, it give a much better insight into how you ended up in this situation. Given the number of jobs running at the same time and the fact that you did not see any deadlocks, I wrote the suggestions because it looked like you just had a lot of data to compress and the jobs were not really stuck.
Even though this is for the situation preceding the current issue, it would be good to know these deadlock reports. They contain the lock as well as the OID of the relation and the query. Understanding these might help find out what went wrong here and maybe also offer insight into what is happening now.
If you have the explain plan for one of those deteriorating queries, it is possible to see what the issue is.
When you have a mix of compressed and uncompressed data in a chunk, this function is used to decompress the compressed data, incorporate the new uncompressed data, and re-compress it. We had issues with it previously because it was executed inside a transaction and hence could grab locks in an order that would lead to a deadlock. You can try to run it manually on the old chunks just to make sure that all data is compressed. See instructions under recompress_chunk. Although we haven't seen any issues with this, it is not entirely impossible that this could be part of your problem. |
@mkindahl thank you so much for getting back to us, we think we might know how to tackle our problem. We played around with the recompress_chunk procedure, we found the following:
Based on this, we are trying to recompress all our chunks. We are still unsure how to deal with the occasional chunk that takes hours to recompress, we would be very thankful for suggestions. Most chunks are done in minutes to seconds though. Also, we would appreciate a comment if our reasoning makes sense :) If we are right on this, it would not be a timescale bug but something that the documentation maybe could point out as a major pitfall. |
This was my suspicion when you said you did not have any visible deadlocks, yes.
Yes, this is the problem. Since you have data that is both compressed and uncompressed, it decompresses all data (into memory), incorporates the uncompressed data, and then sorts it all, possibly spilling to disk, which will just make it worse. You can see it clearly in the second plan, while the first plan just visits the compressed chunks.
Not sure what is causing it to take hours to recompress. It sounds odd, but could have to do with other updates or inserts interfering with the process, or disk IO being throttled if it spills to disk. I suspect that you just have to let it finish, or run it when the machine is not busy. |
Closing the issue since it seems you have a path to fixing this. If you still have issues, feel free to re-open. |
Hi @mkindahl, I'm having a similar issue, except on my largest hypertable all the chunks seem to have rolled themselves into one huge chunk (~1.3tb) and uncompressed themselves (they were in several chunk and around ~600gb compressed the other day). I've been trying to figure out how to get the compression to run, or even tell if it's running. I changed Any guidance would be appreciated!
|
Hi there @zadeviggers! Since this is quite a different problem compared to the original issue, could you please open a new issue with the information you have? It might be a good idea to look at I would expect next start to be set to a later time after the job has executed: if that does not happen, something is wrong. |
I added another TB of disk space and it seems to have sorted itself out now. If this comes up again though, I'll open an issue. |
What type of bug is this?
Performance issue
What subsystems and features are affected?
Compression
What happened?
We have a compression_policy set on a hypertable. Currently, the job gets stuck (runs for hours if we don't cancel it) and consumes a significant amount of cpu. If we remove the compression_policy and readd it, it immediately gets stuck again.
TimescaleDB version affected
2.11.1
PostgreSQL version used
14.9
What operating system did you use?
not applicable
What installation method did you use?
Not applicable
What platform did you run on?
Managed Service for TimescaleDB (MST/Aiven)
Relevant log output and stack trace
No response
How can we reproduce the bug?
We do not know concrete steps to reproduce it with certainty.
Before the problem occurred we accidentally added a lot of old data (i.e. with timestamps way in the past) to our table, which resulted in a large number of new, uncompressed chunks (approx. 1 thousand). We already removed all those chunks manually, but with no improvement.
The accidental import of old data happened because we modified the code feeding the data to our database, so currently every now and then we insert data that belongs to chunks that are already compressed.
The text was updated successfully, but these errors were encountered: