Support time_bucket_gapfill in continuous aggregates #1324

andness · 2019-06-28T04:07:46Z

I tried to create a continuous aggregate using time_bucket_gapfill but was disappointed to find this was not possible. Are there plans to support this?

davidkohn88 · 2019-06-28T15:27:35Z

This isn't possible, but you can create the continuous aggregate using the normal time bucket and then do time_bucket_gapfill when you query the view, this should pretty closely approximate what you're trying to do.

andness · 2019-07-01T07:29:17Z

@davidkohn88, I don't think that would solve my case because I need to join on the time column, and to make that efficient I need it indexed.

To give a little background on what I'm doing: I have a property of my data which is stored as intervals in the source data, i.e. "The device was in state X between time T1 and T2". So the way I solve this is brute-force: For each hour of the day I materialize the state that the device is in for that hour. So this means that the rows:

["Device1", "StateA", 12:20, 17:30]
["Device1", "StateB", 17:30, 21:10]

Expand into

["Device1", "StateA", 12:00]
["Device1", "StateA", 13:00]
["Device1", "StateA", 14:00]
["Device1", "StateA", 15:00]
["Device1", "StateA", 16:00]
["Device1", "StateA", 17:00]
["Device1", "StateB", 18:00]
["Device1", "StateB", 19:00]
["Device1", "StateB", 20:00]
["Device1", "StateB", 21:00]
["Device1", "StateB", 22:00]

Using this approach I can get the state of the device for any given hour of the day by a simple join against this table.

Not the most intelligent approach, but it's simple and straightforward at the expense of using some disk space. I tried using a join query that would essentially look up the last device state value for each hour of the day and as you can imagine that query did not perform very well.

anton-brass · 2021-04-22T07:10:49Z

that only works if you do not have State changes during an full hour, otherwise it will be inaccurate

davidkohn88 · 2021-04-22T19:37:43Z

I didn't see the previous comment until now, but I will also say that the join, if structured correctly (as a lateral with order by time desc limit 1, and the correct indexes) should be reasonably performant...You could also build a GIST index on a range expression as well as the device_id (with btree_gist extension) and then do a containment query that should also be reasonably performant.

I think there are some other issues around this that I remember hearing about, namely, wanting to calculate the time spent in a given state over a period of time...that one is a little harder to accomplish nicely right now, and I've opened an issue to do that sort of thing in Timescale Analytics: timescale/timescaledb-toolkit#117

danthegoodman1 · 2021-06-28T12:55:38Z

This is definitely a pain point for us as well. We have to include it outside in the query on the cagg, which complicates the SQL statements we have since we have to double up on some aggregate functions (which don't end up doing anything but makes the query more complicated).

Mykyta-Chernenko · 2023-04-18T13:22:39Z

Hi, I need it for my use-case because selecting it in runtime is not productive enough, are there any plans to release the functionality?

chrishop · 2023-09-25T01:07:29Z

I'm also having this same issue, and same requirement as @Mykyta-Chernenko . We need to gapfill within the continuous aggregate for performance reasons.

melicheradam · 2023-10-18T16:09:03Z

+1 would be appreciated really

leppaott · 2023-10-20T08:33:42Z

+1, we'd like to use state_agg() and just the LOCF the previous state into next bucket where is no state change otherwise. timescale/timescaledb-toolkit#479 (comment) things people do seem complicated.

jflambert · 2024-01-16T02:33:55Z

+1 bummed to discover this has been an open issue for almost 5 years! The workaround isn't doing it for me.

jrodertqc · 2024-01-16T15:06:04Z

+1 would be very useful to have this feature.

goncalvesnelson · 2024-03-26T11:39:05Z

+1. This feature would be great and simplify some of our queries and improve overall performance.

leppaott · 2024-04-26T07:23:27Z

Would be very handy to configure refresh policy to run at every start of bucket and then it could always create the first sample already there.

xerubia · 2024-05-10T03:36:52Z

Would love this feature, I have the exact same problem that @andness mentioned at the start. Would love to know if there's a viable solution.

johanatan · 2024-08-27T02:02:26Z

+1 on this feature.

ljt7560868 · 2024-11-12T05:46:22Z

This problem has been going on for 5 years, why hasn't it been resolved yet！！！

landeholt · 2024-11-13T18:37:48Z

gapfill is an extremely important component of timeseries to get linear space. Too bad it can't be used in caggs, why is that? Is it deemed too computional expensive?

jkittner · 2025-01-18T17:56:39Z

Has anyone found a workaround? for this?
Running time_bucket_gapfill for each query is stupid and not a solution to the problem. Since you just want gapfill and no more aggregation!

I implemented the gapfill myself using CTEs, but surprise, CTEs are not supported by continous aggregates. I then changed the query to use subqueries instead, but surprise, subqueries are not supported by continous aggregates either. So no solution found

So far everytime I touched CAGGs it's been a disaster (e.g. getting right-lablled buckets is already a nightmare or just thinking about timezones will give you a heart attack)

jkittner · 2025-01-18T21:59:12Z

Maybe a more useful comment on this than my previous comment. I was able to achieve what I wanted by ditching the timescale-specific materialized view entirely and only use the vanilla-postgres materialized view with the regular time_bucket function.

-- defined two tables that are connected via a foreign key
CREATE TABLE stations(
    id BIGINT PRIMARY KEY
);

CREATE TABLE data(
    date TIMESTAMPTZ,
    id BIGINT REFERENCES stations(id),
    measurement NUMERIC,
    PRIMARY KEY(date, id)
);

-- add some example data to showcase the result
INSERT INTO stations(id) VALUES (1), (2);

INSERT INTO data(date, id, measurement) VALUES
    ('2025-01-01 09:15', 1, 0),
    ('2025-01-01 10:15', 1, 2),
    ('2025-01-01 10:45', 1, 4),
    ('2025-01-01 12:15', 1, 6),
    ('2025-01-01 12:45', 1, 8),
    ('2025-01-01 10:15', 2, 3),
    ('2025-01-01 10:45', 2, 6),
    ('2025-01-01 13:15', 2, 9),
    ('2025-01-01 13:45', 2, 12),
    ('2025-01-01 14:15', 2, 15);

-- define the gap-filled materialized view
CREATE MATERIALIZED VIEW data_hourly AS
-- get the start and end date of measurements per station,
-- so we don't extrapolate the gapfilling, but only fill between
-- existing values
WITH data_bounds AS (
    SELECT
        id,
        MIN(date) AS start_time,
        MAX(date) AS end_time
    FROM data
    GROUP BY id
),
-- generate a complete time series that contains all possible dates for any
-- station present
gapfiller_time_series AS (
    SELECT generate_series(
        (SELECT MIN(date) FROM data),
        (SELECT MAX(date) FROM data),
        '1 hour'::INTERVAL
    ) AS date
),
-- based on the gapfiller time series, generate potential fillers for each
-- id in station. Take care that only filler data is generated between the
-- earliest and latest measurement per individual station
time_station_combinations AS (
    SELECT
        date,
        stations.id AS id,
        start_time,
        end_time
    FROM gapfiller_time_series
    CROSS JOIN stations
    JOIN data_bounds ON data_bounds.id = stations.id
    WHERE
        gapfiller_time_series.date >= data_bounds.start_time AND
        gapfiller_time_series.date <= data_bounds.end_time
),
-- now combine both, the acutaly measurements and the filler data, intentionally
-- creating duplicates, which, however, will be elimated in a later step
filled_data AS (
    (
        SELECT
            date,
            id,
            NULL AS measurement
        FROM time_station_combinations
    )
    UNION ALL
    (
        SELECT
            date,
            id,
            measurement
        FROM data
    )
)
-- Now use the regular time_bucket function to calculate hourly averages
-- this will preduce rows of NULL when there are no values otherwise
-- NULLs are ignored when the aggregations contain actual values
SELECT
    time_bucket('1 hour', date) AS hourly_bucket,
    id,
    AVG(measurement)
FROM filled_data
GROUP BY hourly_bucket, id
ORDER BY hourly_bucket, id DESC;

-- to be able to refresh the view concurrently
CREATE UNIQUE INDEX ON data_hourly(id, hourly_bucket);

with this setup the gaps are filled like this:

SELECT * FROM data_hourly ORDER BY id, hourly_bucket;

     hourly_bucket      | id |          avg
------------------------+----+------------------------
 2025-01-01 09:00:00+00 |  1 | 0.00000000000000000000
 2025-01-01 10:00:00+00 |  1 |     3.0000000000000000
 2025-01-01 11:00:00+00 |  1 |
 2025-01-01 12:00:00+00 |  1 |     7.0000000000000000
 2025-01-01 10:00:00+00 |  2 |     4.5000000000000000
 2025-01-01 11:00:00+00 |  2 |
 2025-01-01 12:00:00+00 |  2 |
 2025-01-01 13:00:00+00 |  2 |    10.5000000000000000
 2025-01-01 14:00:00+00 |  2 |    15.0000000000000000

This is IMO an improved behavior compared to the plain time_bucket_gapfill. Of course one could ommit that behavior by simply removing the data_bounds logic and its conditional join after the CROSS JOIN.

SELECT
    time_bucket_gapfill('1 hour', date) AS hourly_bucket,
    id,
    AVG(measurement) AS measurement
FROM data
WHERE date BETWEEN '2025-01-01 09:15' AND '2025-01-01 14:15'
GROUP BY hourly_bucket, id
ORDER BY id, hourly_bucket;

     hourly_bucket      | id |      measurement
------------------------+----+------------------------
 2025-01-01 09:00:00+00 |  1 | 0.00000000000000000000
 2025-01-01 10:00:00+00 |  1 |     3.0000000000000000
 2025-01-01 11:00:00+00 |  1 |
 2025-01-01 12:00:00+00 |  1 |     7.0000000000000000
 2025-01-01 13:00:00+00 |  1 |
 2025-01-01 14:00:00+00 |  1 |
 2025-01-01 09:00:00+00 |  2 |
 2025-01-01 10:00:00+00 |  2 |     4.5000000000000000
 2025-01-01 11:00:00+00 |  2 |
 2025-01-01 12:00:00+00 |  2 |
 2025-01-01 13:00:00+00 |  2 |    10.5000000000000000
 2025-01-01 14:00:00+00 |  2 |    15.0000000000000000

Performance-wise my workaround will likeley perform better for views with fewer rows than the timescale materialized view, but for very large views performance will likely be worse. So this will only be a solution for a couple 100,000 rows...

I hope this will be useful for some people that don't need all the featured of a timescale-native materialized view!

tukusejssirs · 2025-01-20T05:45:33Z

I am affected too. As a workaround, I create a trigger function on insert/delete/update which updates the affected time buckets (by either old or new data addition or removal, incl the peeiod between the previous and the next time bucket around the affected time buckets) in a hypertable (not a table or materialised view) with the shorted time bucket used, from which then I create matialised views with CAGGs for other, larger time buckets.

It is not a perfect solution, but it works.

mfreed added the enhancement label Jun 29, 2019

dianasaur323 added the community-request label Jul 17, 2019

davidkohn88 mentioned this issue Apr 22, 2021

State-aggregate timescale/timescaledb-toolkit#117

Closed

cracksalad mentioned this issue May 14, 2021

time_bucket_gapfill in select from continuous aggregate #3206

Closed

NunoFilipeSantos added continuous_aggregate time_bucket feature-request Feature proposal and removed community-request labels Sep 28, 2021

NunoFilipeSantos mentioned this issue Mar 2, 2022

Implement time_bucket_gapfill_ng() #4005

Closed

jfjoly added the time_bucket_gapfill label Dec 15, 2022

WireBaron mentioned this issue Jun 15, 2023

Workaround for continuous aggregates not supporting time_bucket_gapfill timescale/timescaledb-toolkit#678

Open

jnidzwetzki mentioned this issue Feb 26, 2024

[Bug]: Continuous Aggregate - Missing rows in hypertable #6690

Open

lmatz mentioned this issue Nov 18, 2024

Tracking: time-series related functions risingwavelabs/risingwave#19343

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support time_bucket_gapfill in continuous aggregates #1324

Support time_bucket_gapfill in continuous aggregates #1324

andness commented Jun 28, 2019

davidkohn88 commented Jun 28, 2019

andness commented Jul 1, 2019

anton-brass commented Apr 22, 2021

davidkohn88 commented Apr 22, 2021

danthegoodman1 commented Jun 28, 2021

Mykyta-Chernenko commented Apr 18, 2023

chrishop commented Sep 25, 2023

melicheradam commented Oct 18, 2023

leppaott commented Oct 20, 2023 •

edited

Loading

jflambert commented Jan 16, 2024

jrodertqc commented Jan 16, 2024

goncalvesnelson commented Mar 26, 2024

leppaott commented Apr 26, 2024

xerubia commented May 10, 2024

johanatan commented Aug 27, 2024

ljt7560868 commented Nov 12, 2024

landeholt commented Nov 13, 2024

jkittner commented Jan 18, 2025

jkittner commented Jan 18, 2025

tukusejssirs commented Jan 20, 2025

Support time_bucket_gapfill in continuous aggregates #1324

Support time_bucket_gapfill in continuous aggregates #1324

Comments

andness commented Jun 28, 2019

davidkohn88 commented Jun 28, 2019

andness commented Jul 1, 2019

anton-brass commented Apr 22, 2021

davidkohn88 commented Apr 22, 2021

danthegoodman1 commented Jun 28, 2021

Mykyta-Chernenko commented Apr 18, 2023

chrishop commented Sep 25, 2023

melicheradam commented Oct 18, 2023

leppaott commented Oct 20, 2023 • edited Loading

jflambert commented Jan 16, 2024

jrodertqc commented Jan 16, 2024

goncalvesnelson commented Mar 26, 2024

leppaott commented Apr 26, 2024

xerubia commented May 10, 2024

johanatan commented Aug 27, 2024

ljt7560868 commented Nov 12, 2024

landeholt commented Nov 13, 2024

jkittner commented Jan 18, 2025

jkittner commented Jan 18, 2025

tukusejssirs commented Jan 20, 2025

leppaott commented Oct 20, 2023 •

edited

Loading