-
Notifications
You must be signed in to change notification settings - Fork 900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use space dimension as a retention setting. #563
Comments
Hi @PIdaho would you mind sharing a bit more high-level about the desired functionality that you want, or use case you have? The above feels like it's trying to shoehorn some existing mechanism into a use case, and I'd want to make sure there isn't a more elegant or first-class way of supporting your desired functionality. |
We collect time-series data from many devices. These data samples are basically the same but some are at higher resolution and some are at lower resolution. We have requirements for different retention policies based on certain devices. Currently we manage these retention policies by putting data for the same policy in the same partitions. We can then drop partitions for quick deletes. It is basically one table with all of the related time-series data where each data has its own retention policy. It just seems like adding a partition column for the retention policy would make this work as long as I know how partition columns are assigned to chunks and I had a safe way of dropping chunks by retention policy. TimescaleDB has some really nice features. We like being able to use the COPY command on the parent table instead of having to figure out which partition to COPY into. We also like the dynamic partition creation. These would greatly simplify our partition management. |
It seems like the approach you might prefer is better support for non-hash partitioning, so that you can specify a distinct "space" partitioning by setting some column value, and then extending Will discuss with team how that fits into roadmap. Thanks! |
Mike, non-hash partitioning would also help with multi-tenanted systems where you frequently need different retention policies. We currently have to allow the system to retain the data for the customer who pays for the highest retention period, which isn't ideal. |
@mfreed, thanks for your consideration on this issue. But what can I do today? I am confident that I can manipulate the partition column and a partitioning function to get data into the correct partitions. The question real comes down to can I do the deletes with TS current implementation? I don't think so with |
I am also going to be needing a similar functionality. One possible solution that I have thought about doing is having a timestamp on when things should be deleted instead, that way you are back to a single column for doing partitioning. Then when you drop chunks you can drop on anything that you want to delete now() or sooner. |
+1 to this request. |
we happen to have exactly the same use-case as @andrew-blake described above: different tenants have different retention expectations, although the whole set of retention policies is quite small (under 20 entries), so dropping chunks on the combination of section/time, rather than time only would work very efficiently working back to "set time when it has to expire" would work for this one particular use case, but would not help us much from the data read perspective, this is why we would use TimeScale DB in the first place, so this is not a solution really ideally, "drop_chunks()" should allow for additional/optional arguments to accept the spatial component filter when applying the time-component |
I'm also really interested in this. Consider a multi-tenant system, where each tenant has it's own retention policy (between 3 months and 10 years) - being able to instantly drop chunks belong to a single tenant would be incredibly efficient! |
+1 |
1 similar comment
+1 |
For some follow-up -- we currently recommend using separate hypertables if you want this functionality, e.g., being able to have different data retention policies. The secondary advantages of separate hypertables is that, especailly in multi-tenant scenarios, including:
This is actually the approach taking by Promscale, where different Prometheus metrics are stored in separate hypertables: https://github.com/timescale/promscale |
I switched to the described solution after receiving the same via Slack (for the multi-tenant problem I described earlier in this thread). This wouldn't feel strange for non-multi-tenant scenarios, but "table-per-tenant" does feel like an odd setup; still, it does work and does provide the states advantages. |
+1, any plan to this request?
By the way, the table |
I have the same need but from different use case. I have data from different devices but some of them are less interesting for me than others. Still I don't wanna delete less interesting data. I would like to move less interesting devices to other postgres tablespace. I already have data in table also program that ingesting data don't differentiate between devices. I can move some data to other hypertable and run some cron on daily basis but I'm looking for some more 'native' solution. |
If we follow the current recommendations for a multi-tenant solution where we use different tables for different tenants who are expecting different retention policies, then we might have a "30 day" table, a "60 day" table, and a "90 day" table. This could work pretty well as long as we know the retention at insertion time (which might require an additional lookup). However, when a tenant changes their mind and switches their contract from 30 days to 90 days... We'd have to copy all of their data out of the 30 day table and into the 90 day table instead of simply updating the retention policy for that tenant. This would probably leave a lot of holes in that old table. (Is that a bad thing?) It would also be tricky to do to keep the data consistent during the copy and cut over to the new table. The alternative is that every tenant gets their own table, which could prove to be rather unwieldy to manage when you get into the thousands of clients. Most likely how I would implement this is to retain the data for the full length of of time for everyone, but hide it from the customer in application logic. This doesn't save any storage, but is probably the least painful. I think ideally you could set policies based on where conditions. Without any where condition it would be "where age of insert is > some interval". Otherwise it would be "where the age is > some interval AND the user_agent_string is null AND the customer is Big Corp". |
|
Any progress on this? The workaround with "one hypertable per client" is really just a workaround, and won't work for every case. |
While one hypertable per client for our team taste felt as overwhelmed solution we decided to use direct deletion from compressed chunks by This gist can be helpful to understand how to find compressed chunk names by date range: https://gist.github.com/yarkoyarok/3277a27987415b40368b53d70f348add Then you can found rows, which consists of columns with compressed data and raw uncompressed segment_by fields. These rows, actually, are segments. Deleting of them occurs without decompression and being proceeded very fast (<1 sec for segment dropping vs ~1:30 minutes for same amount of data being deleted from hypertable directly in our case). So you can not only drop chunks, but if you work with compressed data you can drop more atomic parts - segments. More information on this technique can be found also here #5802. I've proposed there to be used by query planner. |
+1 for this. A good first step would be making it easy to find chunks by both primary (time) and secondary dimension so that we can at least manually implement retention policies. |
+1 on this. The workaround seems to be very prone on SQL injection if implemented badly on sql drivers |
+1 |
+1 but I'm mostly interested in changing chunking policy by a space-dimension, i.e. changing chunk time intervals according to a secondary dimension to avoid small chunks for low resolution data |
I am evaluating TimescaleDB and am interested in using the space dimension as a retention policy field.
The idea being we could use a function similar to
drop_chunks()
to drop chunks out of dimensions based on their retention policies. Something like this.I know from reading that you hash the partition column into the different chunks. If I read correctly you turn that hash into an integer and then divide that integer by the number of partitions for assignment.
I assume if I have dropped chunks and a new value comes in from the out of range past that a new chunk would be created regardless of my scheme described above.
The text was updated successfully, but these errors were encountered: