-
Notifications
You must be signed in to change notification settings - Fork 900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support time_bucket_gapfill in continuous aggregates #1324
Comments
This isn't possible, but you can create the continuous aggregate using the normal time bucket and then do time_bucket_gapfill when you query the view, this should pretty closely approximate what you're trying to do. |
@davidkohn88, I don't think that would solve my case because I need to join on the time column, and to make that efficient I need it indexed. To give a little background on what I'm doing: I have a property of my data which is stored as intervals in the source data, i.e. "The device was in state X between time T1 and T2". So the way I solve this is brute-force: For each hour of the day I materialize the state that the device is in for that hour. So this means that the rows: ["Device1", "StateA", 12:20, 17:30] Expand into ["Device1", "StateA", 12:00] Using this approach I can get the state of the device for any given hour of the day by a simple join against this table. Not the most intelligent approach, but it's simple and straightforward at the expense of using some disk space. I tried using a join query that would essentially look up the last device state value for each hour of the day and as you can imagine that query did not perform very well. |
that only works if you do not have State changes during an full hour, otherwise it will be inaccurate |
I didn't see the previous comment until now, but I will also say that the join, if structured correctly (as a lateral with order by time desc limit 1, and the correct indexes) should be reasonably performant...You could also build a GIST index on a range expression as well as the device_id (with btree_gist extension) and then do a containment query that should also be reasonably performant. I think there are some other issues around this that I remember hearing about, namely, wanting to calculate the time spent in a given state over a period of time...that one is a little harder to accomplish nicely right now, and I've opened an issue to do that sort of thing in Timescale Analytics: timescale/timescaledb-toolkit#117 |
This is definitely a pain point for us as well. We have to include it outside in the query on the cagg, which complicates the SQL statements we have since we have to double up on some aggregate functions (which don't end up doing anything but makes the query more complicated). |
Hi, I need it for my use-case because selecting it in runtime is not productive enough, are there any plans to release the functionality? |
I'm also having this same issue, and same requirement as @Mykyta-Chernenko . We need to gapfill within the continuous aggregate for performance reasons. |
+1 would be appreciated really |
+1, we'd like to use |
+1 bummed to discover this has been an open issue for almost 5 years! The workaround isn't doing it for me. |
+1 would be very useful to have this feature. |
+1. This feature would be great and simplify some of our queries and improve overall performance. |
Would be very handy to configure refresh policy to run at every start of bucket and then it could always create the first sample already there. |
Would love this feature, I have the exact same problem that @andness mentioned at the start. Would love to know if there's a viable solution. |
+1 on this feature. |
This problem has been going on for 5 years, why hasn't it been resolved yet!!! |
gapfill is an extremely important component of timeseries to get linear space. Too bad it can't be used in caggs, why is that? Is it deemed too computional expensive? |
Has anyone found a workaround? for this? I implemented the gapfill myself using CTEs, but surprise, CTEs are not supported by continous aggregates. I then changed the query to use subqueries instead, but surprise, subqueries are not supported by continous aggregates either. So no solution found So far everytime I touched CAGGs it's been a disaster (e.g. getting right-lablled buckets is already a nightmare or just thinking about timezones will give you a heart attack) |
Maybe a more useful comment on this than my previous comment. I was able to achieve what I wanted by ditching the timescale-specific materialized view entirely and only use the vanilla-postgres materialized view with the regular -- defined two tables that are connected via a foreign key
CREATE TABLE stations(
id BIGINT PRIMARY KEY
);
CREATE TABLE data(
date TIMESTAMPTZ,
id BIGINT REFERENCES stations(id),
measurement NUMERIC,
PRIMARY KEY(date, id)
);
-- add some example data to showcase the result
INSERT INTO stations(id) VALUES (1), (2);
INSERT INTO data(date, id, measurement) VALUES
('2025-01-01 09:15', 1, 0),
('2025-01-01 10:15', 1, 2),
('2025-01-01 10:45', 1, 4),
('2025-01-01 12:15', 1, 6),
('2025-01-01 12:45', 1, 8),
('2025-01-01 10:15', 2, 3),
('2025-01-01 10:45', 2, 6),
('2025-01-01 13:15', 2, 9),
('2025-01-01 13:45', 2, 12),
('2025-01-01 14:15', 2, 15);
-- define the gap-filled materialized view
CREATE MATERIALIZED VIEW data_hourly AS
-- get the start and end date of measurements per station,
-- so we don't extrapolate the gapfilling, but only fill between
-- existing values
WITH data_bounds AS (
SELECT
id,
MIN(date) AS start_time,
MAX(date) AS end_time
FROM data
GROUP BY id
),
-- generate a complete time series that contains all possible dates for any
-- station present
gapfiller_time_series AS (
SELECT generate_series(
(SELECT MIN(date) FROM data),
(SELECT MAX(date) FROM data),
'1 hour'::INTERVAL
) AS date
),
-- based on the gapfiller time series, generate potential fillers for each
-- id in station. Take care that only filler data is generated between the
-- earliest and latest measurement per individual station
time_station_combinations AS (
SELECT
date,
stations.id AS id,
start_time,
end_time
FROM gapfiller_time_series
CROSS JOIN stations
JOIN data_bounds ON data_bounds.id = stations.id
WHERE
gapfiller_time_series.date >= data_bounds.start_time AND
gapfiller_time_series.date <= data_bounds.end_time
),
-- now combine both, the acutaly measurements and the filler data, intentionally
-- creating duplicates, which, however, will be elimated in a later step
filled_data AS (
(
SELECT
date,
id,
NULL AS measurement
FROM time_station_combinations
)
UNION ALL
(
SELECT
date,
id,
measurement
FROM data
)
)
-- Now use the regular time_bucket function to calculate hourly averages
-- this will preduce rows of NULL when there are no values otherwise
-- NULLs are ignored when the aggregations contain actual values
SELECT
time_bucket('1 hour', date) AS hourly_bucket,
id,
AVG(measurement)
FROM filled_data
GROUP BY hourly_bucket, id
ORDER BY hourly_bucket, id DESC;
-- to be able to refresh the view concurrently
CREATE UNIQUE INDEX ON data_hourly(id, hourly_bucket); with this setup the gaps are filled like this: SELECT * FROM data_hourly ORDER BY id, hourly_bucket; hourly_bucket | id | avg
------------------------+----+------------------------
2025-01-01 09:00:00+00 | 1 | 0.00000000000000000000
2025-01-01 10:00:00+00 | 1 | 3.0000000000000000
2025-01-01 11:00:00+00 | 1 |
2025-01-01 12:00:00+00 | 1 | 7.0000000000000000
2025-01-01 10:00:00+00 | 2 | 4.5000000000000000
2025-01-01 11:00:00+00 | 2 |
2025-01-01 12:00:00+00 | 2 |
2025-01-01 13:00:00+00 | 2 | 10.5000000000000000
2025-01-01 14:00:00+00 | 2 | 15.0000000000000000 This is IMO an improved behavior compared to the plain SELECT
time_bucket_gapfill('1 hour', date) AS hourly_bucket,
id,
AVG(measurement) AS measurement
FROM data
WHERE date BETWEEN '2025-01-01 09:15' AND '2025-01-01 14:15'
GROUP BY hourly_bucket, id
ORDER BY id, hourly_bucket; hourly_bucket | id | measurement
------------------------+----+------------------------
2025-01-01 09:00:00+00 | 1 | 0.00000000000000000000
2025-01-01 10:00:00+00 | 1 | 3.0000000000000000
2025-01-01 11:00:00+00 | 1 |
2025-01-01 12:00:00+00 | 1 | 7.0000000000000000
2025-01-01 13:00:00+00 | 1 |
2025-01-01 14:00:00+00 | 1 |
2025-01-01 09:00:00+00 | 2 |
2025-01-01 10:00:00+00 | 2 | 4.5000000000000000
2025-01-01 11:00:00+00 | 2 |
2025-01-01 12:00:00+00 | 2 |
2025-01-01 13:00:00+00 | 2 | 10.5000000000000000
2025-01-01 14:00:00+00 | 2 | 15.0000000000000000 Performance-wise my workaround will likeley perform better for views with fewer rows than the timescale materialized view, but for very large views performance will likely be worse. So this will only be a solution for a couple 100,000 rows... I hope this will be useful for some people that don't need all the featured of a timescale-native materialized view! |
I am affected too. As a workaround, I create a trigger function on insert/delete/update which updates the affected time buckets (by either old or new data addition or removal, incl the peeiod between the previous and the next time bucket around the affected time buckets) in a hypertable (not a table or materialised view) with the shorted time bucket used, from which then I create matialised views with CAGGs for other, larger time buckets. It is not a perfect solution, but it works. |
I tried to create a continuous aggregate using time_bucket_gapfill but was disappointed to find this was not possible. Are there plans to support this?
The text was updated successfully, but these errors were encountered: