Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TimescaleDB continuous aggregation wrong time buckets #5864

Closed
Sanprof opened this issue Jul 6, 2023 · 8 comments
Closed

TimescaleDB continuous aggregation wrong time buckets #5864

Sanprof opened this issue Jul 6, 2023 · 8 comments

Comments

@Sanprof
Copy link

Sanprof commented Jul 6, 2023

Relevant system information:

  • OS: Debian 11
  • PostgreSQL version (output of SELECT version();): PostgreSQL 15.2, compiled by Visual C++ build 1914, 64-bit
  • TimescaleDB Toolkit version (output of \dx timescaledb_toolkit in psql): timescaledb 2.10.3
  • Installation method: "docker"

Describe the bug
We have a device for temperature collecting, it is configured to collect the temperature of the refrigerator box each second in the buffer (but can collect even fewer periods 10ms, 100ms) and at the end of each minute it sends an average value for the last minute (but it can be required to send average temperature per each second or per every 10 seconds). We have a hyper_table for these data. And we want to create some amount of continuous aggregation (with required intervals but for now it is 5min) for this table as it grows continuously. The issue we faced during the creation of materialized view is that the time_bucket function uses the date_bin function which always uses the start_date as basis. It means that time_bucket will always use the next statement time >= start_date and time < end_date which is wrong for us because we need time > start_date and time <= end_date . I tried to use timescaledb_experimental.time_bucket_ng with origin parameter and it seems it works incorrectly

To Reproduce

DROP MATERIALIZED VIEW IF EXISTS temp_measure_5min;
-- create table
drop table if exists temp_measure;
create table if not exists temp_measure(
    time timestamp without time zone,
    "avg_temp" double precision
);

insert into temp_measure select '2023-06-27 05:59:00', 6.9;
insert into temp_measure select '2023-06-27 06:00:00', 7.2;
insert into temp_measure select '2023-06-27 06:01:00', 15.1;
insert into temp_measure select '2023-06-27 06:02:00', 10.6;
insert into temp_measure select '2023-06-27 06:03:00', 8.5;
insert into temp_measure select '2023-06-27 06:04:00', 12;
insert into temp_measure select '2023-06-27 06:05:00', 13.2;
insert into temp_measure select '2023-06-27 06:06:00', 9.7;
insert into temp_measure select '2023-06-27 06:07:00', 8.1;
insert into temp_measure select '2023-06-27 06:08:00', 7.9;
insert into temp_measure select '2023-06-27 06:09:00', 8.5;
insert into temp_measure select '2023-06-27 06:10:00', 7.7;

-- create hypertable
SELECT create_hypertable('temp_measure', 'time', if_not_exists => TRUE, chunk_time_interval => INTERVAL '1 day', migrate_data => true);

-- create continuous aggregation
CREATE MATERIALIZED VIEW temp_measure_5min
WITH (timescaledb.continuous)
AS
SELECT
    timescaledb_experimental.time_bucket_ng('5 minutes', time, origin => '2000-01-01 00:01:00') as bucket,
    ROUND(AVG(avg_temp)::decimal,1)::double precision as avg_temp
FROM
    temp_measure
GROUP BY bucket
WITH NO DATA;

Expected behavior

-- review data
SELECT * FROM public.temp_measure_5min

bucket  avg_temp
2023-06-27 06:00:00  7.1
2023-06-27 06:05:00  11.9
2023-06-27 06:10:00  8.4

Actual behavior

-- review data
SELECT * FROM public.temp_measure_5min

bucket	avg_temp
2023-06-27 5:56:00	7.1
2023-06-27 6:01:00	11.9
2023-06-27 6:06:00	8.4

I have also incorrect results when I use time_bucket without origin parameter and time has type timestamptz

-- create table
drop table if exists temp_measure_tz;
create table if not exists temp_measure_tz(
    time timestamptz,
    "avg_temp" double precision
);

insert into temp_measure_tz select '2023-06-27 05:59:00+00', 6.9;
insert into temp_measure_tz select '2023-06-27 06:00:00+00', 7.2;
insert into temp_measure_tz select '2023-06-27 06:01:00+00', 15.1;
insert into temp_measure_tz select '2023-06-27 06:02:00+00', 10.6;
insert into temp_measure_tz select '2023-06-27 06:03:00+00', 8.5;
insert into temp_measure_tz select '2023-06-27 06:04:00+00', 12;
insert into temp_measure_tz select '2023-06-27 06:05:00+00', 13.2;
insert into temp_measure_tz select '2023-06-27 06:06:00+00', 9.7;
insert into temp_measure_tz select '2023-06-27 06:07:00+00', 8.1;
insert into temp_measure_tz select '2023-06-27 06:08:00+00', 7.9;
insert into temp_measure_tz select '2023-06-27 06:09:00+00', 8.5;
insert into temp_measure_tz select '2023-06-27 06:10:00+00', 7.7;

-- create hypertable
SELECT create_hypertable('temp_measure_tz', 'time', if_not_exists => TRUE, chunk_time_interval => INTERVAL '1 day', migrate_data => true);

-- create materialized view
CREATE MATERIALIZED VIEW temp_measure_tz_5min
WITH (timescaledb.continuous)
AS
SELECT
    time_bucket('5 minutes', time) as bucket,
    ROUND(AVG(avg_temp)::decimal,1)::double precision as avg_temp
FROM
    temp_measure_tz
GROUP BY bucket
WITH NO DATA;

-- review data
SELECT * FROM public.temp_measure_tz_5min;

bucket  avg_temp
2023-06-27 07:55:00+02  6.9
2023-06-27 08:00:00+02  10.7
2023-06-27 08:05:00+02  9.5
2023-06-27 08:10:00+02  7.7
@Sanprof Sanprof added the bug label Jul 6, 2023
@WireBaron WireBaron transferred this issue from timescale/timescaledb-toolkit Jul 6, 2023
@jonatas
Copy link

jonatas commented Jul 10, 2023

Thanks for reporting @Sanprof!

@pmwkaa
Copy link
Contributor

pmwkaa commented Jul 13, 2023

@Sanprof I've been able to reproduce the case and now trying to understand if this a bug.

I see that you are using origin timestamp set as origin => '2000-01-01 00:01:00'.
I was able to get correct result by changing it to origin => '2000-01-01 00:00:00' or origin => '2000-01-01.

The question is, is it necessary for you to specify date and time?

@Sanprof
Copy link
Author

Sanprof commented Jul 15, 2023

@pmwkaa thanks for you response.
Let me describe a bit what we faced at the beginning related to CAGG of measured temperature values. Just look at the picture below

uJdRH

In this sample I wanted describe that for us extremally important to aggregate data with the condition time > start_time and time <= end_date and time for this bucket has to equal to end_date .
E.g. what we are expecting to see time > 00:05:00 and time <= 00:10:00 and we can say that AVG temperature value happened between these times and related to 00:10:00 and not to 00:05:00.
If we use standard time_bucket function for CAGG the condition is always time >= start_time and time < end_date and the time of bucket always start_time. If we come back to my example the average value of temperature will have time 00:05:00 which was not collected yet by temp collector - and this is wrong. You can reproduce this on my sample

DROP MATERIALIZED VIEW IF EXISTS temp_measure_tz_5min;

-- create table
drop table if exists temp_measure_tz;
create table if not exists temp_measure_tz(
    time timestamptz,
    "avg_temp" double precision
);

insert into temp_measure_tz select '2023-06-27 05:59:00+00', 6.9;
insert into temp_measure_tz select '2023-06-27 06:00:00+00', 7.2;
insert into temp_measure_tz select '2023-06-27 06:01:00+00', 15.1;
insert into temp_measure_tz select '2023-06-27 06:02:00+00', 10.6;
insert into temp_measure_tz select '2023-06-27 06:03:00+00', 8.5;
insert into temp_measure_tz select '2023-06-27 06:04:00+00', 12;
insert into temp_measure_tz select '2023-06-27 06:05:00+00', 13.2;
insert into temp_measure_tz select '2023-06-27 06:06:00+00', 9.7;
insert into temp_measure_tz select '2023-06-27 06:07:00+00', 8.1;
insert into temp_measure_tz select '2023-06-27 06:08:00+00', 7.9;
insert into temp_measure_tz select '2023-06-27 06:09:00+00', 8.5;
insert into temp_measure_tz select '2023-06-27 06:10:00+00', 7.7;

-- create hypertable
SELECT create_hypertable('temp_measure_tz', 'time', if_not_exists => TRUE, chunk_time_interval => INTERVAL '1 day', migrate_data => true);


-- create continuous aggregation
CREATE MATERIALIZED VIEW temp_measure_tz_5min
WITH (timescaledb.continuous)
AS
SELECT
    time_bucket('5 minutes', time) as bucket,
    ROUND(AVG(avg_temp)::decimal,1)::double precision as avg_temp
FROM
    temp_measure_tz
GROUP BY bucket
WITH NO DATA;

SELECT * FROM public.temp_measure_tz_5min ORDER BY bucket ASC;

bucket | avg_temp
2023-06-27 07:55:00+02 | 6.9
2023-06-27 08:00:00+02 | 10.7
2023-06-27 08:05:00+02 | 9.5
2023-06-27 08:10:00+02 | 7.7

The result is wrong and currently we are using our own function to selecting aggregated data from raw data table

CREATE OR REPLACE FUNCTION public.time_window_tz(
	_interval interval,
	_time timestamptz,
	_date_from timestamptz)
    RETURNS timestamptz
    LANGUAGE 'plpgsql'
    COST 100
    VOLATILE PARALLEL UNSAFE
AS $BODY$
begin

	RETURN
		CASE
			WHEN date_bin(_interval, _time, _date_from) = _time THEN _time
			ELSE (date_bin(_interval, _time, _date_from) + _interval)::timestamptz
		END;

end;
$BODY$;


SELECT
    time_window_tz('5 minutes', time, '2000-01-01') as bucket,
    ROUND(AVG(avg_temp)::decimal,1)::double precision as avg_temp
FROM
    temp_measure_tz
GROUP BY bucket;

bucket | avg_temp
2023-06-27 08:00:00+02 | 7.1
2023-06-27 08:05:00+02 | 11.9
2023-06-27 08:10:00+02 | 8.4

And avg values are correct for us

@Sanprof
Copy link
Author

Sanprof commented Jul 25, 2023

image

We have changed the continuous aggregation time bucket to 1 minute and here is what we can see on the live example. We have 1 critical incident because of higher temperature happened at 04:48:52, and because of the wrong time bucket in the materialized view the calculated temperature index = 0 and is belong to 04:48:00 but has to belong to 04:49:00 because the critical issue happened after 04:48:00 and before 04:49:00. This is should be very critical, especially for time metrics to clearly understand when the issue has happened. I'm curious why nobody (I mean users who are using continuous aggregation) didn't think about such a wrong calculation.

@konskov
Copy link
Contributor

konskov commented Jul 31, 2023

Hello @Sanprof,

My understanding is that being able to use this version of time_bucket:

time_bucket('5 min', time, origin => '2000-01-01 00:00:01.000001')

would meet your requirements, is that right?
If that is the case, I think the issue here is that we currently do not support this version of the time_bucket function in continuous aggregates, as reported in issues #5453 and #5301. Would having the ability to use the above version of time_bucket help here?

@Sanprof
Copy link
Author

Sanprof commented Jul 31, 2023

Hello @konskov,

The majority issue for us is that time_bucket function in any interpretation uses data_bin function which always converts TIMESTAMP to the floor of the date interval. In my example, our temperature collector sends us 60 one-second values each minute which means 300 time values per 5 minutes, I'll take for example two dates 2023-07-31 10:05:00 and 2023-07-31 10:10:00 (5-minute period and 300 seconds), when we aggregate 300 values into 1 using time_bucket function this 1 value will belong to the wrong time 2023-07-31 10:05:00 (because of date_bin rule), e.g. when all our users see a higher AVG value of temperature in this time 2023-07-31 10:05:00 they automatically drill into raw data between 2023-07-31 10:00:00 and 2023-07-31 10:05:00 but don't see any critical values because it happened between next 5 minutes. This is very critical for us and every time we have to explain why this works in such a way.
Regarding to your answer I think even with the origin parameter it won't help.
You can find our own time_window function in one of my previous posts and here is an example of how correctly and logically it works for us.

SELECT 
	tv.time as origin_time,
	time_window('5m', tv.time, TIMESTAMP '2000-01-01') as time_window_time,
	date_bin('5m', tv.time, TIMESTAMP '2000-01-01') as date_bin_time,
	time_bucket('5m', tv.time) as time_bucket_time,
	time_bucket('5m', tv.time, origin => TIMESTAMP '2000-01-01 00:01:00.000001') as time_bucket_origin_time	
FROM (
SELECT t.time FROM (VALUES
(TIMESTAMP '2023-07-31 10:00:00'),
(TIMESTAMP '2023-07-31 10:01:00'),
(TIMESTAMP '2023-07-31 10:02:00'),
(TIMESTAMP '2023-07-31 10:03:00'),
(TIMESTAMP '2023-07-31 10:04:00'),
(TIMESTAMP '2023-07-31 10:05:00'),
(TIMESTAMP '2023-07-31 10:06:00'),
(TIMESTAMP '2023-07-31 10:07:00'),
(TIMESTAMP '2023-07-31 10:08:00'),
(TIMESTAMP '2023-07-31 10:09:00'),
(TIMESTAMP '2023-07-31 10:10:00'),
(TIMESTAMP '2023-07-31 10:11:00'),
(TIMESTAMP '2023-07-31 10:12:00'),
(TIMESTAMP '2023-07-31 10:13:00'),
(TIMESTAMP '2023-07-31 10:14:00'),
(TIMESTAMP '2023-07-31 10:15:00'),
(TIMESTAMP '2023-07-31 10:16:00'),
(TIMESTAMP '2023-07-31 10:17:00'),
(TIMESTAMP '2023-07-31 10:18:00'),
(TIMESTAMP '2023-07-31 10:19:00'),
(TIMESTAMP '2023-07-31 10:20:00')
) t(time)
) tv;

image

Unfortunately, for now, we cannot use continuous aggregation while the time bucket has the incorrect time. But in any case when you have the possibility to override this function or we can use something similar to our time_window function please let me know because we see a potentially well approach in this. Thanks.

@melicheradam
Copy link

+1, This might be related to #5806

@mkindahl mkindahl self-assigned this Aug 14, 2023
@mkindahl
Copy link
Contributor

mkindahl commented Aug 14, 2023

@Sanprof Thank you for the report. This looks like a duplicate of #5806 so closing it in favor of that issue.

I think there are two possible solutions: either implement a right-closed, left-open version of time_bucket, or allow other functions to be registered as "ok" for continuous aggregates.

Right now, we have an internal cache that marks functions as "ok", but one possible extension is to expose this as a table and allow any functions to be registered there.

I added a note to #5806 about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants