[Bug]: job crash detected, see server logs #7085

pgloader · 2024-07-01T13:05:59Z

What type of bug is this?

Crash

What subsystems and features are affected?

Background worker

What happened?

job crash detected, see server logs but there was no information in the PostrgeSQL statement log
TimescaleDB: 2.15.2
PostgreSQL: 16.3
log_min_error_statement: log

Besides, the message in the job_history would disappear

it was 5 minutes ago
=# select * from job_errors;
job_id | proc_schema | proc_name | pid | start_time | finish_time | sqlerrcode | err_message
--------+------------------------+-------------------------------------+---------+-------------------------------+-------------------------------+------------+-------------------------------------
1003 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 09:06:39.752447-04 | 2024-06-28 09:06:39.752542-04 | | job crash detected, see server logs
1002 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 10:46:50.699781-04 | 2024-06-28 10:46:50.699857-04 | | job crash detected, see server logs
1025 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 06:30:00.006238-04 | 2024-07-01 06:30:00.006365-04 | | job crash detected, see server logs
1023 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 07:00:00.00073-04 | 2024-07-01 07:00:00.000763-04 | | job crash detected, see server logs
(4 rows)

Now

select * from job_errors;

job_id | proc_schema | proc_name | pid | start_time | finish_time | sqlerrcode | err_message
--------+------------------------+-------------------------------------+---------+-------------------------------+-------------------------------+------------+-------------------------------------
1003 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 09:06:39.752447-04 | 2024-06-28 09:06:39.752542-04 | | job crash detected, see server logs
1002 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 10:46:50.699781-04 | 2024-06-28 10:46:50.699857-04 | | job crash detected, see server logs
1025 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 06:30:00.006238-04 | 2024-07-01 06:30:00.006365-04 | | job crash detected, see server logs
1023 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 09:00:00.008187-04 | 2024-07-01 09:00:00.008324-04 | | job crash detected, see server logs
(4 rows)

The message 2024-07-01 06:30:00.006238-04 was gone

TimescaleDB version affected

2.15.2

PostgreSQL version used

16.3

What operating system did you use?

RHEL8.6

What installation method did you use?

Source

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

#                                     select * from job_errors;
 job_id |      proc_schema       |              proc_name              |   pid   |          start_time           |          finish_time          | sqlerrcode |             err_message
--------+------------------------+-------------------------------------+---------+-------------------------------+-------------------------------+------------+-------------------------------------
   1003 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 09:06:39.752447-04 | 2024-06-28 09:06:39.752542-04 |            | job crash detected, see server logs
   1002 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 10:46:50.699781-04 | 2024-06-28 10:46:50.699857-04 |            | job crash detected, see server logs
   1025 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 06:30:00.006238-04 | 2024-07-01 06:30:00.006365-04 |            | job crash detected, see server logs
   1023 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 07:00:00.00073-04  | 2024-07-01 07:00:00.000763-04 |            | job crash detected, see server logs
(4 rows)

How can we reproduce the bug?

# select * from job_errors;
 job_id |      proc_schema       |              proc_name              |   pid   |          start_time           |          finish_time          | sqlerrcode |             err_message
--------+------------------------+-------------------------------------+---------+-------------------------------+-------------------------------+------------+-------------------------------------
   1003 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 09:06:39.752447-04 | 2024-06-28 09:06:39.752542-04 |            | job crash detected, see server logs
   1002 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 10:46:50.699781-04 | 2024-06-28 10:46:50.699857-04 |            | job crash detected, see server logs
   1025 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 06:30:00.006238-04 | 2024-07-01 06:30:00.006365-04 |            | job crash detected, see server logs
   1023 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 09:00:00.008187-04 | 2024-07-01 09:00:00.008324-04 |            | job crash detected, see server logs
(4 rows)

fabriziomello · 2024-07-26T00:08:18Z

@pgloader can u please copy/paste the result of the following query?

SELECT * FROM timescaledb_information.job_history ORDER BY start_time;

cosimomeli · 2024-07-31T14:51:15Z

Hi, I think I could have the same issues with background jobs. In my case, I have this issue with continuous aggregations:

2024-07-31 14:16:48 UTC [51]: [15-1] 66aa4740.33 0     LOG:  background worker "Refresh Continuous Aggregate Policy [1613]" (PID 222) was terminated by signal 11: Segmentation fault
2024-07-31 14:16:48 UTC [51]: [16-1] 66aa4740.33 0     DETAIL:  Failed process was running: CALL _timescaledb_functions.policy_refresh_continuous_aggregate()
2024-07-31 14:16:48 UTC [51]: [17-1] 66aa4740.33 0     LOG:  terminating any other active server processes

After the segfault, the whole Postgres restarts and enters recovery mode for a while. I'm using Timescale 2.15.3

This is the result of the query you asked.

id	job_id	succeeded	proc_schema	proc_name	pid	start_time	finish_time	config	err_message
1	1001	false	_timescaledb_functions	policy_compression	2672843	2000-01-01 01:00:00.000 +0100	2024-07-24 09:09:00.356 +0200	{"hypertable_id": 2, "compress_after": "01:00:00"}	job crash detected, see server logs
2	1004	false	_timescaledb_functions	policy_compression	2672843	2000-01-01 01:00:00.000 +0100	2024-07-24 09:09:00.362 +0200	{"hypertable_id": 5, "compress_after": "01:00:00"}	job crash detected, see server logs
3	1005	false	_timescaledb_functions	policy_compression	2672843	2000-01-01 01:00:00.000 +0100	2024-07-24 09:09:00.362 +0200	{"hypertable_id": 1, "compress_after": "01:00:00"}	job crash detected, see server logs
4	1010	false	_timescaledb_functions	policy_retention	2672843	2000-01-01 01:00:00.000 +0100	2024-07-24 09:09:00.362 +0200	{"drop_after": "30 days", "hypertable_id": 5}	job crash detected, see server logs
5	1025	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2672843	2000-01-01 01:00:00.000 +0100	2024-07-24 09:09:00.362 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
6	1028	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2672843	2000-01-01 01:00:00.000 +0100	2024-07-24 09:09:00.362 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
7	1029	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2672843	2000-01-01 01:00:00.000 +0100	2024-07-24 09:09:00.362 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
34	1070	false	_timescaledb_functions	policy_retention	2673025	2000-01-01 01:00:00.000 +0100	2024-07-24 09:09:32.354 +0200	{"drop_after": "30 days", "hypertable_id": 5}	job crash detected, see server logs
35	1085	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2673025	2000-01-01 01:00:00.000 +0100	2024-07-24 09:09:32.355 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
36	1088	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2673025	2000-01-01 01:00:00.000 +0100	2024-07-24 09:09:32.355 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
37	1089	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2673025	2000-01-01 01:00:00.000 +0100	2024-07-24 09:09:32.355 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
38	1004	false	_timescaledb_functions	policy_compression	2698766	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:20.310 +0200	{"hypertable_id": 5, "compress_after": "01:00:00"}	job crash detected, see server logs
39	1005	false	_timescaledb_functions	policy_compression	2698766	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:20.313 +0200	{"hypertable_id": 1, "compress_after": "01:00:00"}	job crash detected, see server logs
40	1118	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698766	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:20.314 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
41	1121	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698766	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:20.314 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
42	1122	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698766	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:20.314 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
43	1123	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698766	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:20.314 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
71	1151	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698852	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:31.238 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
72	1154	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698852	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:31.239 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
73	1155	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698852	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:31.239 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
74	1156	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698852	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:31.239 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
104	1184	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698927	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:40.968 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
105	1187	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698927	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:40.969 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
106	1188	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698927	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:40.969 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
107	1189	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698927	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:40.969 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
108	1004	false	_timescaledb_functions	policy_compression	2771978	2000-01-01 01:00:00.000 +0100	2024-07-24 19:47:40.848 +0200	{"hypertable_id": 5, "compress_after": "01:00:00"}	job crash detected, see server logs
109	1217	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2771978	2000-01-01 01:00:00.000 +0100	2024-07-24 19:47:40.855 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
110	1220	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2771978	2000-01-01 01:00:00.000 +0100	2024-07-24 19:47:40.855 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
111	1221	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2771978	2000-01-01 01:00:00.000 +0100	2024-07-24 19:47:40.855 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
112	1222	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2771978	2000-01-01 01:00:00.000 +0100	2024-07-24 19:47:40.855 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
141	1250	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2772125	2000-01-01 01:00:00.000 +0100	2024-07-24 19:48:16.886 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
142	1253	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2772125	2000-01-01 01:00:00.000 +0100	2024-07-24 19:48:16.886 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
143	1254	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2772125	2000-01-01 01:00:00.000 +0100	2024-07-24 19:48:16.886 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
144	1255	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2772125	2000-01-01 01:00:00.000 +0100	2024-07-24 19:48:16.886 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
145	1253	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2957163	2000-01-01 01:00:00.000 +0100	2024-07-25 17:56:17.357 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
178	1283	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2957254	2000-01-01 01:00:00.000 +0100	2024-07-25 17:56:28.657 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
179	1286	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2957254	2000-01-01 01:00:00.000 +0100	2024-07-25 17:56:28.658 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
180	1287	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2957254	2000-01-01 01:00:00.000 +0100	2024-07-25 17:56:28.658 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
181	1288	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2957254	2000-01-01 01:00:00.000 +0100	2024-07-25 17:56:28.659 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
182	1286	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3275070	2000-01-01 01:00:00.000 +0100	2024-07-27 17:11:53.277 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
183	1287	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3427181	2000-01-01 01:00:00.000 +0100	2024-07-28 15:41:31.390 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
216	1316	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3427265	2000-01-01 01:00:00.000 +0100	2024-07-28 15:41:41.907 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
217	1319	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3427265	2000-01-01 01:00:00.000 +0100	2024-07-28 15:41:41.907 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
218	1320	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3427265	2000-01-01 01:00:00.000 +0100	2024-07-28 15:41:41.907 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
219	1321	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3427265	2000-01-01 01:00:00.000 +0100	2024-07-28 15:41:41.907 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
249	1349	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3427339	2000-01-01 01:00:00.000 +0100	2024-07-28 15:41:51.902 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
250	1352	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3427339	2000-01-01 01:00:00.000 +0100	2024-07-28 15:41:51.902 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
251	1353	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3427339	2000-01-01 01:00:00.000 +0100	2024-07-28 15:41:51.902 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
252	1354	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3427339	2000-01-01 01:00:00.000 +0100	2024-07-28 15:41:51.902 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
253	1349	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3571752	2000-01-01 01:00:00.000 +0100	2024-07-29 13:04:27.233 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
254	1352	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3746649	2000-01-01 01:00:00.000 +0100	2024-07-30 14:57:14.466 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
287	1382	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3746753	2000-01-01 01:00:00.000 +0100	2024-07-30 14:57:30.255 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
288	1385	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3746753	2000-01-01 01:00:00.000 +0100	2024-07-30 14:57:30.255 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
289	1386	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3746753	2000-01-01 01:00:00.000 +0100	2024-07-30 14:57:30.255 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
290	1382	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3889928	2000-01-01 01:00:00.000 +0100	2024-07-31 12:08:50.999 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
291	1386	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3889928	2000-01-01 01:00:00.000 +0100	2024-07-31 12:08:51.001 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
292	1415	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3894415	2000-01-01 01:00:00.000 +0100	2024-07-31 12:46:04.729 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
293	1418	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3894415	2000-01-01 01:00:00.000 +0100	2024-07-31 12:46:04.730 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
294	1419	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3894415	2000-01-01 01:00:00.000 +0100	2024-07-31 12:46:04.730 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
295	1420	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3894415	2000-01-01 01:00:00.000 +0100	2024-07-31 12:46:04.730 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
296	1448	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3907195	2000-01-01 01:00:00.000 +0100	2024-07-31 14:34:21.743 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
297	1451	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3907195	2000-01-01 01:00:00.000 +0100	2024-07-31 14:34:21.743 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
298	1452	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3907195	2000-01-01 01:00:00.000 +0100	2024-07-31 14:34:21.744 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
299	1453	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3907195	2000-01-01 01:00:00.000 +0100	2024-07-31 14:34:21.744 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
329	1481	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3907971	2000-01-01 01:00:00.000 +0100	2024-07-31 14:40:19.588 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
330	1484	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3907971	2000-01-01 01:00:00.000 +0100	2024-07-31 14:40:19.589 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
331	1485	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3907971	2000-01-01 01:00:00.000 +0100	2024-07-31 14:40:19.589 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
332	1486	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3907971	2000-01-01 01:00:00.000 +0100	2024-07-31 14:40:19.589 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
333	1514	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3909582	2000-01-01 01:00:00.000 +0100	2024-07-31 14:53:52.791 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
334	1517	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3909582	2000-01-01 01:00:00.000 +0100	2024-07-31 14:53:52.791 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
335	1518	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3909582	2000-01-01 01:00:00.000 +0100	2024-07-31 14:53:52.791 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
336	1519	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3909582	2000-01-01 01:00:00.000 +0100	2024-07-31 14:53:52.791 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
337	1547	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3912453	2000-01-01 01:00:00.000 +0100	2024-07-31 15:18:00.563 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
338	1550	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3912453	2000-01-01 01:00:00.000 +0100	2024-07-31 15:18:00.563 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
339	1551	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3912453	2000-01-01 01:00:00.000 +0100	2024-07-31 15:18:00.564 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
340	1552	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3912453	2000-01-01 01:00:00.000 +0100	2024-07-31 15:18:00.564 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
370	1580	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3912538	2000-01-01 01:00:00.000 +0100	2024-07-31 15:18:10.703 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
371	1583	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3912538	2000-01-01 01:00:00.000 +0100	2024-07-31 15:18:10.704 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
372	1584	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3912538	2000-01-01 01:00:00.000 +0100	2024-07-31 15:18:10.704 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
373	1585	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3912538	2000-01-01 01:00:00.000 +0100	2024-07-31 15:18:10.704 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
374	1613	false	_timescaledb_functions	policy_refresh_continuous_aggregate	260	2000-01-01 01:00:00.000 +0100	2024-07-31 16:16:49.179 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
375	1616	false	_timescaledb_functions	policy_refresh_continuous_aggregate	260	2000-01-01 01:00:00.000 +0100	2024-07-31 16:16:49.182 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
376	1617	false	_timescaledb_functions	policy_refresh_continuous_aggregate	260	2000-01-01 01:00:00.000 +0100	2024-07-31 16:16:49.182 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
377	1618	false	_timescaledb_functions	policy_refresh_continuous_aggregate	260	2000-01-01 01:00:00.000 +0100	2024-07-31 16:16:49.182 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs

fabriziomello · 2024-07-31T17:15:43Z

@cosimomeli looks like your background job for cagg refresh is leading to segfault then the job_errors output is correct. Would be great if you open another issue related to this segmentation for for further investigation.

pgloader · 2024-10-20T14:55:40Z

SELECT * FROM timescaledb_information.job_history ORDER BY start_time;

875 | 1005 | t | _timescaledb_functions | policy_compression | 1481534 | 2024-10-20 09:15:00.004445-04 | 2024-10-20 09:15:00.016282-04 | {"hypertable_id": 1, "compress_after": "14 days"} | |
876 | 1000 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1504011 | 2024-10-20 09:30:00.003659-04 | 2024-10-20 09:33:19.511167-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 2} | |
877 | 1016 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1504012 | 2024-10-20 09:30:00.004353-04 | 2024-10-20 09:32:59.072322-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 14} | |
878 | 1008 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1504013 | 2024-10-20 09:30:00.00464-04 | 2024-10-20 09:30:15.16198-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 8} | |
879 | 1021 | t | _timescaledb_functions | policy_compression | 1534815 | 2024-10-20 09:45:00.002833-04 | 2024-10-20 09:45:00.013852-04 | {"hypertable_id": 13, "compress_after": "14 days"} | |
880 | 2 | f | _timescaledb_functions | policy_job_error_retention | 1544833 | 2024-10-20 09:52:29.102847-04 | 2024-10-20 09:52:29.109349-04 | {"drop_after": "1 month"} | 42883 | function _timescaledb_functions.policy_job_error_retention(integer, jsonb) does not exist
881 | 1008 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1558331 | 2024-10-20 10:00:00.000606-04 | 2024-10-20 10:00:16.541826-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 8} | |
882 | 1016 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1558332 | 2024-10-20 10:00:00.001381-04 | 2024-10-20 10:03:01.800141-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 14} | |
883 | 1000 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1558333 | 2024-10-20 10:00:00.001708-04 | 2024-10-20 10:03:21.682105-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 2} | |
884 | 1008 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1612695 | 2024-10-20 10:30:00.004533-04 | 2024-10-20 10:30:16.592725-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 8} | |
885 | 1016 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1612696 | 2024-10-20 10:30:00.005279-04 | 2024-10-20 10:33:04.315277-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 14} | |
886 | 1000 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1612697 | 2024-10-20 10:30:00.005631-04 | 2024-10-20 10:33:31.534959-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 2} | |
(886 rows)

These are the most recent

pgloader · 2024-10-20T23:44:06Z

@pgloader can u please copy/paste the result of the following query?
SELECT * FROM timescaledb_information.job_history ORDER BY start_time;

Please see the attached outouts
crash.txt

antekresic · 2024-10-21T13:18:47Z

Could you take a look at your logs for these failures?

 825 |   1016 | f         | _timescaledb_functions | policy_refresh_continuous_aggregate | 3882667 | 2024-10-20 01:30:00.002859-04 | 2024-10-20 01:30:00.002877-04 |                                                                                 |            | job crash detected, see server logs
 826 |   1000 | f         | _timescaledb_functions | policy_refresh_continuous_aggregate | 3882667 | 2024-10-20 01:30:00.003178-04 | 2024-10-20 01:30:00.003195-04 |                                                                                 |            | job crash detected, see server logs

Some time around 2024-10-20 01:30:00.002877-04 should give us some log lines about what happened here.

Or do the logs not contain anything this time either?

Thanks.

pgloader · 2024-10-21T14:35:36Z

2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_300" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ]
2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_60" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ]
2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_10" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ]
2024-10-20 01:30:00.472 EDT @ LOG: deleted 0 row(s) from materialization table "_timescaledb_internal._materialized_hypertable_8"

fabriziomello · 2024-10-22T12:28:22Z

2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_300" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_60" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_10" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.472 EDT @ LOG: deleted 0 row(s) from materialization table "_timescaledb_internal._materialized_hypertable_8"

Is that possible you to cut your logs from 2024-10-20 01:00:00.000 EDT to 2024-10-20 02:00:00.000 EDT and attach here? But please make sure there's no sensible information on it.

Also recently I've made some refactoring on the code that capture and record logs executions and exceptions and would be nice if you can try it out by updating the extension to the 2.17.1.

pgloader · 2024-10-22T17:45:44Z

2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_300" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_60" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_10" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.472 EDT @ LOG: deleted 0 row(s) from materialization table "_timescaledb_internal._materialized_hypertable_8"

Is that possible you to cut your logs from 2024-10-20 01:00:00.000 EDT to 2024-10-20 02:00:00.000 EDT and attach here? But please make sure there's no sensible information on it.

Also recently I've made some refactoring on the code that capture and record logs executions and exceptions and would be nice if you can try it out by updating the extension to the 2.17.1.

Please see the attached
0102.log

fabriziomello · 2024-10-22T18:23:41Z

2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_300" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_60" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_10" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.472 EDT @ LOG: deleted 0 row(s) from materialization table "_timescaledb_internal._materialized_hypertable_8"

Is that possible you to cut your logs from 2024-10-20 01:00:00.000 EDT to 2024-10-20 02:00:00.000 EDT and attach here? But please make sure there's no sensible information on it.
Also recently I've made some refactoring on the code that capture and record logs executions and exceptions and would be nice if you can try it out by updating the extension to the 2.17.1.

Please see the attached 0102.log

A quick grep into your logs showed the following:

/tmp 
➜ grep -i 'Continuous Aggregate' 0102.log
2024-10-20 01:00:00.010 EDT @ LOG:  continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_300" in window [ 2024-10-20 00:00:00-04, 2024-10-20 00:30:00-04 ]
2024-10-20 01:00:00.010 EDT @ LOG:  continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_60" in window [ 2024-10-20 00:00:00-04, 2024-10-20 00:30:00-04 ]
2024-10-20 01:00:00.010 EDT @ LOG:  continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_10" in window [ 2024-10-20 00:00:00-04, 2024-10-20 00:30:00-04 ]
2024-10-20 01:30:00.010 EDT @ LOG:  continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_300" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ]
2024-10-20 01:30:00.010 EDT @ LOG:  continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_60" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ]
2024-10-20 01:30:00.010 EDT @ LOG:  continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_10" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ]
2024-10-20 01:30:10.012 EDT @ FATAL:  terminating background worker "Refresh Continuous Aggregate Policy [1000]" due to administrator command
2024-10-20 01:30:12.440 EDT @ LOG:  background worker "Refresh Continuous Aggregate Policy [1000]" (PID 621682) exited with exit code 1
2024-10-20 01:30:12.445 EDT @ FATAL:  terminating background worker "Refresh Continuous Aggregate Policy [1016]" due to administrator command
2024-10-20 01:30:15.075 EDT @ LOG:  background worker "Refresh Continuous Aggregate Policy [1016]" (PID 621681) exited with exit code 1

Looks like another process cancel the execution of the job so the error history is correct.

pgloader · 2024-10-22T19:41:57Z

Most likely conflicted with the pg_dump backup.
Should we manually re-run these aggregations?

fabriziomello · 2024-10-23T13:12:39Z

Most likely conflicted with the pg_dump backup. Should we manually re-run these aggregations?

If you have successful executions after the failure then you're safe since the next execution will process all invalidation logs created even it it was before the window range executed by policy. The downside of it is that it will take more time to refresh because it have more buckets to aggregate.

leosantosbh · 2024-11-29T20:42:58Z

Hi guys, we recently had a problem very similar to the one reported (unfortunately we are not yet on the updated version of timescale) but the perception was as follows: the job failed after a problem in Postgres (too many clients), then the bank went into (recovery mode), when it came back the jobs didn't work again, and to get around it, just removing and recreating the job made it possible to do the operation again, have you ever had something similar?

mkindahl · 2025-01-22T10:51:30Z

Hi guys, we recently had a problem very similar to the one reported (unfortunately we are not yet on the updated version of timescale) but the perception was as follows: the job failed after a problem in Postgres (too many clients), then the bank went into (recovery mode), when it came back the jobs didn't work again, and to get around it, just removing and recreating the job made it possible to do the operation again, have you ever had something similar?

We have had reports of similar situation, where disabling and enabling a job makes it not run again. It might be useful to check that there is a scheduler running for that database as well as the job information, in particular the next start time and if the job is enabled.

mkindahl · 2025-01-22T10:52:35Z

@cosimomeli You had a segmentation fault for the job. You don't happen to have a stack trace that you can add here as well? It might help us pinpoint the issue.

cosimomeli · 2025-01-22T13:10:55Z

@cosimomeli You had a segmentation fault for the job. You don't happen to have a stack trace that you can add here as well? It might help us pinpoint the issue.

Hi, I had no stack trace to share, but we saw the issue was related to LLVM, and we solved it in an unexpected way: moving the instance from an ARM node to an x86 one.

mkindahl · 2025-01-22T14:34:42Z

It's difficult to move forward with this one without knowing where the crash is.

pgloader added the bug label Jul 1, 2024

mkindahl added the bgw The background worker subsystem, including the scheduler label Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: job crash detected, see server logs #7085

[Bug]: job crash detected, see server logs #7085

pgloader commented Jul 1, 2024

fabriziomello commented Jul 26, 2024

cosimomeli commented Jul 31, 2024 •

edited

Loading

fabriziomello commented Jul 31, 2024

pgloader commented Oct 20, 2024

pgloader commented Oct 20, 2024

antekresic commented Oct 21, 2024 •

edited

Loading

pgloader commented Oct 21, 2024

fabriziomello commented Oct 22, 2024

pgloader commented Oct 22, 2024

fabriziomello commented Oct 22, 2024

pgloader commented Oct 22, 2024

fabriziomello commented Oct 23, 2024

leosantosbh commented Nov 29, 2024

mkindahl commented Jan 22, 2025

mkindahl commented Jan 22, 2025

cosimomeli commented Jan 22, 2025

mkindahl commented Jan 22, 2025

[Bug]: job crash detected, see server logs #7085

[Bug]: job crash detected, see server logs #7085

Comments

pgloader commented Jul 1, 2024

What type of bug is this?

What subsystems and features are affected?

What happened?

select * from job_errors;

TimescaleDB version affected

PostgreSQL version used

What operating system did you use?

What installation method did you use?

What platform did you run on?

Relevant log output and stack trace

How can we reproduce the bug?

fabriziomello commented Jul 26, 2024

cosimomeli commented Jul 31, 2024 • edited Loading

fabriziomello commented Jul 31, 2024

pgloader commented Oct 20, 2024

pgloader commented Oct 20, 2024

antekresic commented Oct 21, 2024 • edited Loading

pgloader commented Oct 21, 2024

fabriziomello commented Oct 22, 2024

pgloader commented Oct 22, 2024

fabriziomello commented Oct 22, 2024

pgloader commented Oct 22, 2024

fabriziomello commented Oct 23, 2024

leosantosbh commented Nov 29, 2024

mkindahl commented Jan 22, 2025

mkindahl commented Jan 22, 2025

cosimomeli commented Jan 22, 2025

mkindahl commented Jan 22, 2025

cosimomeli commented Jul 31, 2024 •

edited

Loading

antekresic commented Oct 21, 2024 •

edited

Loading