Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support time_bucket_gapfill in continuous aggregates #1324

Open
andness opened this issue Jun 28, 2019 · 20 comments
Open

Support time_bucket_gapfill in continuous aggregates #1324

andness opened this issue Jun 28, 2019 · 20 comments

Comments

@andness
Copy link

andness commented Jun 28, 2019

I tried to create a continuous aggregate using time_bucket_gapfill but was disappointed to find this was not possible. Are there plans to support this?

@davidkohn88
Copy link
Contributor

This isn't possible, but you can create the continuous aggregate using the normal time bucket and then do time_bucket_gapfill when you query the view, this should pretty closely approximate what you're trying to do.

@andness
Copy link
Author

andness commented Jul 1, 2019

@davidkohn88, I don't think that would solve my case because I need to join on the time column, and to make that efficient I need it indexed.

To give a little background on what I'm doing: I have a property of my data which is stored as intervals in the source data, i.e. "The device was in state X between time T1 and T2". So the way I solve this is brute-force: For each hour of the day I materialize the state that the device is in for that hour. So this means that the rows:

["Device1", "StateA", 12:20, 17:30]
["Device1", "StateB", 17:30, 21:10]

Expand into

["Device1", "StateA", 12:00]
["Device1", "StateA", 13:00]
["Device1", "StateA", 14:00]
["Device1", "StateA", 15:00]
["Device1", "StateA", 16:00]
["Device1", "StateA", 17:00]
["Device1", "StateB", 18:00]
["Device1", "StateB", 19:00]
["Device1", "StateB", 20:00]
["Device1", "StateB", 21:00]
["Device1", "StateB", 22:00]

Using this approach I can get the state of the device for any given hour of the day by a simple join against this table.

Not the most intelligent approach, but it's simple and straightforward at the expense of using some disk space. I tried using a join query that would essentially look up the last device state value for each hour of the day and as you can imagine that query did not perform very well.

@anton-brass
Copy link

that only works if you do not have State changes during an full hour, otherwise it will be inaccurate

@davidkohn88
Copy link
Contributor

I didn't see the previous comment until now, but I will also say that the join, if structured correctly (as a lateral with order by time desc limit 1, and the correct indexes) should be reasonably performant...You could also build a GIST index on a range expression as well as the device_id (with btree_gist extension) and then do a containment query that should also be reasonably performant.

I think there are some other issues around this that I remember hearing about, namely, wanting to calculate the time spent in a given state over a period of time...that one is a little harder to accomplish nicely right now, and I've opened an issue to do that sort of thing in Timescale Analytics: timescale/timescaledb-toolkit#117

@danthegoodman1
Copy link

This is definitely a pain point for us as well. We have to include it outside in the query on the cagg, which complicates the SQL statements we have since we have to double up on some aggregate functions (which don't end up doing anything but makes the query more complicated).

@Mykyta-Chernenko
Copy link

Hi, I need it for my use-case because selecting it in runtime is not productive enough, are there any plans to release the functionality?

@chrishop
Copy link

I'm also having this same issue, and same requirement as @Mykyta-Chernenko . We need to gapfill within the continuous aggregate for performance reasons.

@melicheradam
Copy link

+1 would be appreciated really

@leppaott
Copy link

leppaott commented Oct 20, 2023

+1, we'd like to use state_agg() and just the LOCF the previous state into next bucket where is no state change otherwise. timescale/timescaledb-toolkit#479 (comment) things people do seem complicated.

@jflambert
Copy link

+1 bummed to discover this has been an open issue for almost 5 years! The workaround isn't doing it for me.

@jrodertqc
Copy link

+1 would be very useful to have this feature.

@goncalvesnelson
Copy link

+1. This feature would be great and simplify some of our queries and improve overall performance.

@leppaott
Copy link

Would be very handy to configure refresh policy to run at every start of bucket and then it could always create the first sample already there.

@xerubia
Copy link

xerubia commented May 10, 2024

Would love this feature, I have the exact same problem that @andness mentioned at the start. Would love to know if there's a viable solution.

@johanatan
Copy link

+1 on this feature.

@ljt7560868
Copy link

This problem has been going on for 5 years, why hasn't it been resolved yet!!!

@landeholt
Copy link

gapfill is an extremely important component of timeseries to get linear space. Too bad it can't be used in caggs, why is that? Is it deemed too computional expensive?

@jkittner
Copy link

Has anyone found a workaround? for this?
Running time_bucket_gapfill for each query is stupid and not a solution to the problem. Since you just want gapfill and no more aggregation!

I implemented the gapfill myself using CTEs, but surprise, CTEs are not supported by continous aggregates. I then changed the query to use subqueries instead, but surprise, subqueries are not supported by continous aggregates either. So no solution found

So far everytime I touched CAGGs it's been a disaster (e.g. getting right-lablled buckets is already a nightmare or just thinking about timezones will give you a heart attack)

@jkittner
Copy link

Maybe a more useful comment on this than my previous comment. I was able to achieve what I wanted by ditching the timescale-specific materialized view entirely and only use the vanilla-postgres materialized view with the regular time_bucket function.

-- defined two tables that are connected via a foreign key
CREATE TABLE stations(
    id BIGINT PRIMARY KEY
);

CREATE TABLE data(
    date TIMESTAMPTZ,
    id BIGINT REFERENCES stations(id),
    measurement NUMERIC,
    PRIMARY KEY(date, id)
);

-- add some example data to showcase the result
INSERT INTO stations(id) VALUES (1), (2);

INSERT INTO data(date, id, measurement) VALUES
    ('2025-01-01 09:15', 1, 0),
    ('2025-01-01 10:15', 1, 2),
    ('2025-01-01 10:45', 1, 4),
    ('2025-01-01 12:15', 1, 6),
    ('2025-01-01 12:45', 1, 8),
    ('2025-01-01 10:15', 2, 3),
    ('2025-01-01 10:45', 2, 6),
    ('2025-01-01 13:15', 2, 9),
    ('2025-01-01 13:45', 2, 12),
    ('2025-01-01 14:15', 2, 15);

-- define the gap-filled materialized view
CREATE MATERIALIZED VIEW data_hourly AS
-- get the start and end date of measurements per station,
-- so we don't extrapolate the gapfilling, but only fill between
-- existing values
WITH data_bounds AS (
    SELECT
        id,
        MIN(date) AS start_time,
        MAX(date) AS end_time
    FROM data
    GROUP BY id
),
-- generate a complete time series that contains all possible dates for any
-- station present
gapfiller_time_series AS (
    SELECT generate_series(
        (SELECT MIN(date) FROM data),
        (SELECT MAX(date) FROM data),
        '1 hour'::INTERVAL
    ) AS date
),
-- based on the gapfiller time series, generate potential fillers for each
-- id in station. Take care that only filler data is generated between the
-- earliest and latest measurement per individual station
time_station_combinations AS (
    SELECT
        date,
        stations.id AS id,
        start_time,
        end_time
    FROM gapfiller_time_series
    CROSS JOIN stations
    JOIN data_bounds ON data_bounds.id = stations.id
    WHERE
        gapfiller_time_series.date >= data_bounds.start_time AND
        gapfiller_time_series.date <= data_bounds.end_time
),
-- now combine both, the acutaly measurements and the filler data, intentionally
-- creating duplicates, which, however, will be elimated in a later step
filled_data AS (
    (
        SELECT
            date,
            id,
            NULL AS measurement
        FROM time_station_combinations
    )
    UNION ALL
    (
        SELECT
            date,
            id,
            measurement
        FROM data
    )
)
-- Now use the regular time_bucket function to calculate hourly averages
-- this will preduce rows of NULL when there are no values otherwise
-- NULLs are ignored when the aggregations contain actual values
SELECT
    time_bucket('1 hour', date) AS hourly_bucket,
    id,
    AVG(measurement)
FROM filled_data
GROUP BY hourly_bucket, id
ORDER BY hourly_bucket, id DESC;

-- to be able to refresh the view concurrently
CREATE UNIQUE INDEX ON data_hourly(id, hourly_bucket);

with this setup the gaps are filled like this:

SELECT * FROM data_hourly ORDER BY id, hourly_bucket;
     hourly_bucket      | id |          avg
------------------------+----+------------------------
 2025-01-01 09:00:00+00 |  1 | 0.00000000000000000000
 2025-01-01 10:00:00+00 |  1 |     3.0000000000000000
 2025-01-01 11:00:00+00 |  1 |
 2025-01-01 12:00:00+00 |  1 |     7.0000000000000000
 2025-01-01 10:00:00+00 |  2 |     4.5000000000000000
 2025-01-01 11:00:00+00 |  2 |
 2025-01-01 12:00:00+00 |  2 |
 2025-01-01 13:00:00+00 |  2 |    10.5000000000000000
 2025-01-01 14:00:00+00 |  2 |    15.0000000000000000

This is IMO an improved behavior compared to the plain time_bucket_gapfill. Of course one could ommit that behavior by simply removing the data_bounds logic and its conditional join after the CROSS JOIN.

SELECT
    time_bucket_gapfill('1 hour', date) AS hourly_bucket,
    id,
    AVG(measurement) AS measurement
FROM data
WHERE date BETWEEN '2025-01-01 09:15' AND '2025-01-01 14:15'
GROUP BY hourly_bucket, id
ORDER BY id, hourly_bucket;
     hourly_bucket      | id |      measurement
------------------------+----+------------------------
 2025-01-01 09:00:00+00 |  1 | 0.00000000000000000000
 2025-01-01 10:00:00+00 |  1 |     3.0000000000000000
 2025-01-01 11:00:00+00 |  1 |
 2025-01-01 12:00:00+00 |  1 |     7.0000000000000000
 2025-01-01 13:00:00+00 |  1 |
 2025-01-01 14:00:00+00 |  1 |
 2025-01-01 09:00:00+00 |  2 |
 2025-01-01 10:00:00+00 |  2 |     4.5000000000000000
 2025-01-01 11:00:00+00 |  2 |
 2025-01-01 12:00:00+00 |  2 |
 2025-01-01 13:00:00+00 |  2 |    10.5000000000000000
 2025-01-01 14:00:00+00 |  2 |    15.0000000000000000

Performance-wise my workaround will likeley perform better for views with fewer rows than the timescale materialized view, but for very large views performance will likely be worse. So this will only be a solution for a couple 100,000 rows...

I hope this will be useful for some people that don't need all the featured of a timescale-native materialized view!

@tukusejssirs
Copy link

I am affected too. As a workaround, I create a trigger function on insert/delete/update which updates the affected time buckets (by either old or new data addition or removal, incl the peeiod between the previous and the next time bucket around the affected time buckets) in a hypertable (not a table or materialised view) with the shorted time bucket used, from which then I create matialised views with CAGGs for other, larger time buckets.

It is not a perfect solution, but it works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests