Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: support for exporting data before deleting it via drop_chunks #572

Closed
Ngalstyan4 opened this issue Jun 26, 2018 · 16 comments · Fixed by #642
Closed

Proposal: support for exporting data before deleting it via drop_chunks #572

Ngalstyan4 opened this issue Jun 26, 2018 · 16 comments · Fixed by #642

Comments

@Ngalstyan4
Copy link
Contributor

Ngalstyan4 commented Jun 26, 2018

As of now, TimescaleDB's drop_chunks provides an easy to use interface to delete the chunks that are entirely before a given time. Note that this is not about deleting all the data (rows) before the given time. Rather, drop_chunks allows deleting the chunks whose time window is before the specified point (i.e., based on the intervals that can be configured during hypertable creation).

This is useful performance-wise, as deleting chunks are basically just deleting an entire file from disk, while deleting individual rows goes through the entire MVCC process and needs to be later garbage collected through VACUUMing.

However, a common use-case is to dump such data into some kind of cold storage before deleting. For this, there is currently no easy-to-use API in Timescale. This can be achieved manually using _timescaledb_catalog.chunk table. The following is a proposal to integrate such functionality in Timescale API by adding the following two functions:

  • show_chunks - takes a hypertable along with a time indicator and returns the list of chunk-tables representing chunks that ended before the given time. Note that drop_chunks would use show_chunks to figure out which chunks need to be dropped. This means that the user could use show_chunks to find out which chunks would be affected by drop_chunks (as a sanity check or for some other reason).
  • export_chunks - takes the parameters above along with a file format string and exports CSV(s) into the location specified by format string

Below are some use-cases along with the resulting SQL:

  • Export the data older than a week in one csv file without deleting it
SELECT export_chunks(test_table, now() - interval ’1 week’, 
‘/tmp/%hypertable%_%timestamp%.csv’,
 csv_per_chunk=FALSE);
  • Export the data older then one week in csv files (one per chunk) and delete it
BEGIN;
SELECT export_chunks(test_table, now() - interval ’1 week’, 
‘/tmp/%hypertable%_%chunk%.csv’);
SELECT drop_chunks(now() - interval ’1 week’, ‘test_table’);
COMMIT;

And this is the exposed API:

--hypertable_name - name of the timescale table
--older_than - a time indicator constraint defining the end of 
--  time range window from which chunks are selected 
--returns - a column containing chunk tables that satisfy the constraint.
FUNCTION show_chunks (
     hypertable_name REGCLASS,
     older_than INTERVAL = NULL,
)  RETURNS SETOF REGCLASS

--hypertable_name - name of the timescale table
--older_than - a time indicator constraint defining the end of
--  time range window from which chunks are selected 
--format - formatting of file name(s) for the output.
--  The value needs to be an absolute file path 
--  with the following variables supported:
--  %hypertable_name%, %chunk_name%, %chunk_starttime%, %chunk_endtime%, 
--  %epoch%, %timestamp%
--overwrite - if true, will overwrite existing files
--csv_per_chunk - if true, will produce one csv file per chunk table, 
--  otherwise, will produce a single csv file for all of the data. 
--  Note that format should have a chunk identifying variable 
--  if and only if csv_per_chunk is set to true
FUNCTION export_chunks(
     hypertable_name REGCLASS,
     older_than INTERVAL = NULL,
     format TEXT,
     overwrite BOOLEAN=FALSE,
     csv_per_chunk BOOLEAN=TRUE
)  RETURNS VOID

Note that export_chunks would perform a server side copy.
Client side copy (and other desired business specific export functionality) can be achieved with the help of a simple python script and show_chunks function.
As an example, this is how chunks could be exported to separate files.

import psycopg2
import os

conn = psycopg2.connect("dbname=postgres user=myuser")

cur = conn.cursor()
cur.execute("SELECT show_chunks(test_table, now() - interval ’1 week’);")
records = cur.fetchall();
for record in records:
    chunk = record[0]
    path = "/tmp/{hypertable}/{chunk}_dump.csv"
                                    .format(hypertable="test_table",chunk=chunk)

    if (os.path.isfile(path)):
        exit("file %s exists" % path)

    if not os.path.exists(os.path.dirname(path)):
        os.makedirs(os.path.dirname(path))
        # optionally remove already exported files 
    with open(path, 'w+') as dest:
        cur.copy_expert("COPY {chunk} TO \'{path}\' WITH CSV HEADER"
                                    .format(chunk=chunk, path=path), dest)
cur.execute("SELECT drop_chunks(now() - interval ’1 week’, 'test_table')")
conn.commit()
cur.close()
conn.close()
@Ngalstyan4 Ngalstyan4 changed the title ## Support for exporting data before deleting it via drop_chunks Support for exporting data before deleting it via drop_chunks Jun 26, 2018
@Ngalstyan4 Ngalstyan4 changed the title Support for exporting data before deleting it via drop_chunks Proposal: support for exporting data before deleting it via drop_chunks Jun 26, 2018
@TSheahan
Copy link

Without having much of substance to add, I'll say this looks like a suitable & desirable feature as described.

@jamessewell
Copy link
Contributor

jamessewell commented Jun 27, 2018

This actually looks really good - you could use show_chunks to do data aging to tablespaces hosted on slower disk (or even to other servers via foreign tables) as well.

@eduardotsj
Copy link

This can be really useful for me, and I have an additional requirement for a newer_than filter!

@dianasaur323
Copy link
Contributor

dianasaur323 commented Aug 14, 2018

@eduardotsj interesting, do you mind providing more detail on the use case for your newer_than filter?

@eduardotsj
Copy link

This is my main use case to the export chunks with a newer_than option:

  • We deploy our solution on premises and when we need to setup a good dev DB and mainly when we have to do complex bug analysis we need to bring a DB copy to our office over poor network connection. Our current DB is already about 250GB so it's impossible to download on viable time. So we dump the reference tables (excluding schema _timescaledb_internal) and then dump the newest chunks individually. We get the list using this query:
    SELECT chunk_id, chunk_table, to_timestamp(lower(ranges[1])/1000000) as range_start, to_timestamp(upper(ranges[1])/1000000) as range_end
    FROM chunk_relation_size('{Hypertable}') where to_timestamp(upper(ranges[1])/1000000) > now();
    For debuging old data usually is not relevant.

@dianasaur323
Copy link
Contributor

@eduardotsj excellent, that makes perfect sense. Thank you for the additional context. I'll mark you down as being interested in this feature, and update you as we have more progress.

@alanhamlett
Copy link

Export the data older then one week in csv files (one per chunk) and delete it

What happens if changes are made to the chunk while it's being exported? Would updates be lost when calling drop_chunks?

I'd only use this if drop_chunks supported an AS OF SYSTEM TIME clause or some way to guarantee any updates during the export are not lost.

@alanhamlett
Copy link

alanhamlett commented Sep 23, 2018

Related to #285, #350, #563, #642.

@dianasaur323
Copy link
Contributor

@alanhamlett We are now actively working on this feature. Regarding what happens when changes are made to drop_chunks, we are thinking of support transactional semantics, so any changes that occur on a chunk actively being dropped will be rejected. How does this approach sound?

@alanhamlett
Copy link

That would work for me, since the application layer could handle any write failures gracefully.

@dianasaur323
Copy link
Contributor

Wonderful, thank you for the feedback @alanhamlett! Will reping here once I have more updates.

@dianasaur323
Copy link
Contributor

Not implemented yet, so re-opening.

@pv97
Copy link

pv97 commented Jan 7, 2019

@dianasaur323 is there an estimate timeline for this feature? Thanks!

@buzz1000
Copy link

Support for a workflow like this would be very helpful:

  1. Backup data older than e.g., 7 days, i.e., similar to "drop_chunks()" but instead of dropping them - back them up
  2. Drop the same data that was just backed up, i.e., data older than 7 days - the "drop_chunks()" would do this
  3. At a later time, when needed, restore one or more of the backed up chunks

So this way, I could e.g., create daily backups of data that is older than 7 days and limit the active database to just 7 days worth of data. Then, when I need to look at older data for troubleshooting / analysis / or something else, I can restore one or more of the daily chunk backups to another instance of TimescaleDB to work with as needed.

@dianasaur323
Copy link
Contributor

@pv97 apologies for taking a while to respond. we are actually working on a different feature in this release, so no specific timeline yet. For now, you'll have to go with a more manual approach.

@svenklemm
Copy link
Member

In 2.0 you will be able to implement this with a user-defined action.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.