Proposal: support for exporting data before deleting it via drop_chunks #572

Ngalstyan4 · 2018-06-26T18:42:47Z

As of now, TimescaleDB's drop_chunks provides an easy to use interface to delete the chunks that are entirely before a given time. Note that this is not about deleting all the data (rows) before the given time. Rather, drop_chunks allows deleting the chunks whose time window is before the specified point (i.e., based on the intervals that can be configured during hypertable creation).

This is useful performance-wise, as deleting chunks are basically just deleting an entire file from disk, while deleting individual rows goes through the entire MVCC process and needs to be later garbage collected through VACUUMing.

However, a common use-case is to dump such data into some kind of cold storage before deleting. For this, there is currently no easy-to-use API in Timescale. This can be achieved manually using _timescaledb_catalog.chunk table. The following is a proposal to integrate such functionality in Timescale API by adding the following two functions:

show_chunks - takes a hypertable along with a time indicator and returns the list of chunk-tables representing chunks that ended before the given time. Note that drop_chunks would use show_chunks to figure out which chunks need to be dropped. This means that the user could use show_chunks to find out which chunks would be affected by drop_chunks (as a sanity check or for some other reason).
export_chunks - takes the parameters above along with a file format string and exports CSV(s) into the location specified by format string

Below are some use-cases along with the resulting SQL:

Export the data older than a week in one csv file without deleting it

SELECT export_chunks(test_table, now() - interval ’1 week’, 
‘/tmp/%hypertable%_%timestamp%.csv’,
 csv_per_chunk=FALSE);

Export the data older then one week in csv files (one per chunk) and delete it

BEGIN;
SELECT export_chunks(test_table, now() - interval ’1 week’, 
‘/tmp/%hypertable%_%chunk%.csv’);
SELECT drop_chunks(now() - interval ’1 week’, ‘test_table’);
COMMIT;

And this is the exposed API:

--hypertable_name - name of the timescale table
--older_than - a time indicator constraint defining the end of 
--  time range window from which chunks are selected 
--returns - a column containing chunk tables that satisfy the constraint.
FUNCTION show_chunks (
     hypertable_name REGCLASS,
     older_than INTERVAL = NULL,
)  RETURNS SETOF REGCLASS

--hypertable_name - name of the timescale table
--older_than - a time indicator constraint defining the end of
--  time range window from which chunks are selected 
--format - formatting of file name(s) for the output.
--  The value needs to be an absolute file path 
--  with the following variables supported:
--  %hypertable_name%, %chunk_name%, %chunk_starttime%, %chunk_endtime%, 
--  %epoch%, %timestamp%
--overwrite - if true, will overwrite existing files
--csv_per_chunk - if true, will produce one csv file per chunk table, 
--  otherwise, will produce a single csv file for all of the data. 
--  Note that format should have a chunk identifying variable 
--  if and only if csv_per_chunk is set to true
FUNCTION export_chunks(
     hypertable_name REGCLASS,
     older_than INTERVAL = NULL,
     format TEXT,
     overwrite BOOLEAN=FALSE,
     csv_per_chunk BOOLEAN=TRUE
)  RETURNS VOID

Note that export_chunks would perform a server side copy.
Client side copy (and other desired business specific export functionality) can be achieved with the help of a simple python script and show_chunks function.
As an example, this is how chunks could be exported to separate files.

import psycopg2
import os

conn = psycopg2.connect("dbname=postgres user=myuser")

cur = conn.cursor()
cur.execute("SELECT show_chunks(test_table, now() - interval ’1 week’);")
records = cur.fetchall();
for record in records:
    chunk = record[0]
    path = "/tmp/{hypertable}/{chunk}_dump.csv"
                                    .format(hypertable="test_table",chunk=chunk)

    if (os.path.isfile(path)):
        exit("file %s exists" % path)

    if not os.path.exists(os.path.dirname(path)):
        os.makedirs(os.path.dirname(path))
        # optionally remove already exported files 
    with open(path, 'w+') as dest:
        cur.copy_expert("COPY {chunk} TO \'{path}\' WITH CSV HEADER"
                                    .format(chunk=chunk, path=path), dest)
cur.execute("SELECT drop_chunks(now() - interval ’1 week’, 'test_table')")
conn.commit()
cur.close()
conn.close()

The text was updated successfully, but these errors were encountered:

TSheahan · 2018-06-26T23:57:13Z

Without having much of substance to add, I'll say this looks like a suitable & desirable feature as described.

jamessewell · 2018-06-27T01:20:16Z

This actually looks really good - you could use show_chunks to do data aging to tablespaces hosted on slower disk (or even to other servers via foreign tables) as well.

eduardotsj · 2018-06-29T19:04:43Z

This can be really useful for me, and I have an additional requirement for a newer_than filter!

dianasaur323 · 2018-08-14T18:16:39Z

@eduardotsj interesting, do you mind providing more detail on the use case for your newer_than filter?

eduardotsj · 2018-08-17T20:30:46Z

This is my main use case to the export chunks with a newer_than option:

We deploy our solution on premises and when we need to setup a good dev DB and mainly when we have to do complex bug analysis we need to bring a DB copy to our office over poor network connection. Our current DB is already about 250GB so it's impossible to download on viable time. So we dump the reference tables (excluding schema _timescaledb_internal) and then dump the newest chunks individually. We get the list using this query:
SELECT chunk_id, chunk_table, to_timestamp(lower(ranges[1])/1000000) as range_start, to_timestamp(upper(ranges[1])/1000000) as range_end
FROM chunk_relation_size('{Hypertable}') where to_timestamp(upper(ranges[1])/1000000) > now();
For debuging old data usually is not relevant.

dianasaur323 · 2018-08-18T00:08:41Z

@eduardotsj excellent, that makes perfect sense. Thank you for the additional context. I'll mark you down as being interested in this feature, and update you as we have more progress.

alanhamlett · 2018-09-20T00:23:27Z

Export the data older then one week in csv files (one per chunk) and delete it

What happens if changes are made to the chunk while it's being exported? Would updates be lost when calling drop_chunks?

I'd only use this if drop_chunks supported an AS OF SYSTEM TIME clause or some way to guarantee any updates during the export are not lost.

alanhamlett · 2018-09-23T17:30:26Z

Related to #285, #350, #563, #642.

dianasaur323 · 2018-10-05T16:57:30Z

@alanhamlett We are now actively working on this feature. Regarding what happens when changes are made to drop_chunks, we are thinking of support transactional semantics, so any changes that occur on a chunk actively being dropped will be rejected. How does this approach sound?

alanhamlett · 2018-10-05T17:25:16Z

That would work for me, since the application layer could handle any write failures gracefully.

dianasaur323 · 2018-10-05T17:32:37Z

Wonderful, thank you for the feedback @alanhamlett! Will reping here once I have more updates.

dianasaur323 · 2018-11-30T00:00:07Z

Not implemented yet, so re-opening.

pv97 · 2019-01-07T18:37:11Z

@dianasaur323 is there an estimate timeline for this feature? Thanks!

buzz1000 · 2019-04-22T16:35:08Z

Support for a workflow like this would be very helpful:

Backup data older than e.g., 7 days, i.e., similar to "drop_chunks()" but instead of dropping them - back them up
Drop the same data that was just backed up, i.e., data older than 7 days - the "drop_chunks()" would do this
At a later time, when needed, restore one or more of the backed up chunks

So this way, I could e.g., create daily backups of data that is older than 7 days and limit the active database to just 7 days worth of data. Then, when I need to look at older data for troubleshooting / analysis / or something else, I can restore one or more of the daily chunk backups to another instance of TimescaleDB to work with as needed.

dianasaur323 · 2019-05-11T00:50:21Z

@pv97 apologies for taking a while to respond. we are actually working on a different feature in this release, so no specific timeline yet. For now, you'll have to go with a more manual approach.

svenklemm · 2020-10-13T08:29:34Z

In 2.0 you will be able to implement this with a user-defined action.

Ngalstyan4 changed the title ~~## Support for exporting data before deleting it via drop_chunks~~ Support for exporting data before deleting it via drop_chunks Jun 26, 2018

RobAtticus added the feedback-wanted label Jun 26, 2018

Ngalstyan4 changed the title ~~Support for exporting data before deleting it via drop_chunks~~ Proposal: support for exporting data before deleting it via drop_chunks Jun 26, 2018

Ngalstyan4 mentioned this issue Aug 6, 2018

Implement show_chunks in C and have drop_chunks use it #642

Merged

cevian closed this as completed in #642 Nov 28, 2018

dianasaur323 reopened this Nov 29, 2018

bboule added community-request data-retention labels Feb 19, 2020

svenklemm closed this as completed Oct 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: support for exporting data before deleting it via drop_chunks #572

Proposal: support for exporting data before deleting it via drop_chunks #572

Ngalstyan4 commented Jun 26, 2018 •

edited

Loading

TSheahan commented Jun 26, 2018

jamessewell commented Jun 27, 2018 •

edited

Loading

eduardotsj commented Jun 29, 2018

dianasaur323 commented Aug 14, 2018 •

edited

Loading

eduardotsj commented Aug 17, 2018

dianasaur323 commented Aug 18, 2018

alanhamlett commented Sep 20, 2018

alanhamlett commented Sep 23, 2018 •

edited

Loading

dianasaur323 commented Oct 5, 2018

alanhamlett commented Oct 5, 2018

dianasaur323 commented Oct 5, 2018

dianasaur323 commented Nov 30, 2018

pv97 commented Jan 7, 2019

buzz1000 commented Apr 22, 2019

dianasaur323 commented May 11, 2019

svenklemm commented Oct 13, 2020

Proposal: support for exporting data before deleting it via drop_chunks #572

Proposal: support for exporting data before deleting it via drop_chunks #572

Comments

Ngalstyan4 commented Jun 26, 2018 • edited Loading

TSheahan commented Jun 26, 2018

jamessewell commented Jun 27, 2018 • edited Loading

eduardotsj commented Jun 29, 2018

dianasaur323 commented Aug 14, 2018 • edited Loading

eduardotsj commented Aug 17, 2018

dianasaur323 commented Aug 18, 2018

alanhamlett commented Sep 20, 2018

alanhamlett commented Sep 23, 2018 • edited Loading

dianasaur323 commented Oct 5, 2018

alanhamlett commented Oct 5, 2018

dianasaur323 commented Oct 5, 2018

dianasaur323 commented Nov 30, 2018

pv97 commented Jan 7, 2019

buzz1000 commented Apr 22, 2019

dianasaur323 commented May 11, 2019

svenklemm commented Oct 13, 2020

Ngalstyan4 commented Jun 26, 2018 •

edited

Loading

jamessewell commented Jun 27, 2018 •

edited

Loading

dianasaur323 commented Aug 14, 2018 •

edited

Loading

alanhamlett commented Sep 23, 2018 •

edited

Loading