Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Log is full of "failed to start a background worker" #7602

Closed
cheggerdev opened this issue Jan 19, 2025 · 21 comments
Closed

[Bug]: Log is full of "failed to start a background worker" #7602

cheggerdev opened this issue Jan 19, 2025 · 21 comments
Labels
bgw The background worker subsystem, including the scheduler bug

Comments

@cheggerdev
Copy link

cheggerdev commented Jan 19, 2025

What type of bug is this?

Configuration

What subsystems and features are affected?

Background worker

What happened?

The timescaledb log is full of

zabbix-timescaledb-1 | 2025-01-19 09:55:28.045 UTC [37] WARNING: failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker
zabbix-timescaledb-1 | 2025-01-19 09:55:29.723 UTC [36] WARNING: failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker

Increasing the workers in config files has no effect in the sense of the launch failures do not disappear.

max_worker_processes = 64 (increased from 32)
timescaledb.max_background_workers = 48 (increased from 8 to 16, then to 32, then to 48)
max_parallel_workers = 4 (number of CPUs)

show timescaledb.telemetry_level; => basic

TimescaleDB version affected

docker-compose image tag latest-pg16

PostgreSQL version used

16

What operating system did you use?

Alpine Linux

What installation method did you use?

Docker

What platform did you run on?

Other

Relevant log output and stack trace

zabbix-timescaledb-1  | 2025-01-19 09:55:28.045 UTC [37] WARNING:  failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker
zabbix-timescaledb-1  | 2025-01-19 09:55:29.723 UTC [36] WARNING:  failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker

How can we reproduce the bug?

Not sure.
I am using timescaledb with zabbix 7 or zabbix 7 with timescaledb :-) .
I also have pgbackrest configured which means archive_mode is turned on.

I have StartPollers=200 in zabbix_server.conf
I have set max_connections=300 in postgresql.conf

Increasing the number of timescaledb.max_background_workers and of max_worker_processes show no effect to the log output in the sense of the launch failures do not disappear.
@cheggerdev cheggerdev added the bug label Jan 19, 2025
@mkindahl
Copy link
Contributor

@cheggerdev Thank you for the bug report. The error is generated when the scheduler cannot spawn a new job and RegisterDynamicBackgroundWorker returns false. This can happen for different reasons, but the only reason that does not print any other error message is when you've run out of background worker slots (which is controlled by max_worker_processes). The message alone is unfortunately not enough information to figure out what is wrong.

So, can you provide some more information? In particular:

  • Do you see any other events in the log that could be related to this? In particular, what did the same PID print before these messages?
  • Have you restarted the server after changing the parameters?

@mkindahl mkindahl added the bgw The background worker subsystem, including the scheduler label Jan 20, 2025
@cheggerdev
Copy link
Author

So, can you provide some more information? In particular:

* Do you see any other events in the log that could be related to this? In particular, what did the same PID print before these messages?

Log from a fresh (re-)start:

zabbix-timescaledb-1  | 
zabbix-timescaledb-1  | PostgreSQL Database directory appears to contain a database; Skipping initialization
zabbix-timescaledb-1  | 
zabbix-timescaledb-1  | 2025-01-20 11:37:12.517 UTC [1] LOG:  starting PostgreSQL 16.6 on x86_64-pc-linux-musl, compiled by gcc (Alpine 13.2.1_git20240309) 13.2.1 20240309, 64-bit
zabbix-timescaledb-1  | 2025-01-20 11:37:12.517 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
zabbix-timescaledb-1  | 2025-01-20 11:37:12.517 UTC [1] LOG:  listening on IPv6 address "::", port 5432
zabbix-timescaledb-1  | 2025-01-20 11:37:12.604 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
zabbix-timescaledb-1  | 2025-01-20 11:37:12.772 UTC [30] LOG:  database system was shut down at 2025-01-20 10:14:07 UTC
zabbix-timescaledb-1  | 2025-01-20 11:37:13.005 UTC [1] LOG:  database system is ready to accept connections
zabbix-timescaledb-1  | 2025-01-20 11:37:13.014 UTC [34] LOG:  TimescaleDB background worker launcher connected to shared catalogs
zabbix-timescaledb-1  | 2025-01-20 11:37:17.502 UTC [36] WARNING:  failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker
zabbix-timescaledb-1  | 2025-01-20 11:37:17.737 UTC [37] WARNING:  failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker
zabbix-timescaledb-1  | 2025-01-20 11:37:18.070 UTC [38] LOG:  skipping analyze of "bgw_job_stat" --- lock not available
zabbix-timescaledb-1  | 2025-01-20 11:37:18.415 UTC [36] WARNING:  failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker
zabbix-timescaledb-1  | 2025-01-20 11:37:18.510 UTC [37] WARNING:  failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker
zabbix-timescaledb-1  | 2025-01-20 11:37:19.941 UTC [36] WARNING:  failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker
zabbix-timescaledb-1  | 2025-01-20 11:37:20.059 UTC [37] WARNING:  failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker
zabbix-timescaledb-1  | 2025-01-20 11:37:21.464 UTC [36] WARNING:  failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker
zabbix-timescaledb-1  | 2025-01-20 11:37:21.771 UTC [37] WARNING:  failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker
zabbix-timescaledb-1  | 2025-01-20 11:37:22.881 UTC [36] WARNING:  failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker
zabbix-timescaledb-1  | 2025-01-20 11:37:23.068 UTC [39] LOG:  skipping analyze of "bgw_job_stat" --- lock not available
zabbix-timescaledb-1  | 2025-01-20 11:37:23.461 UTC [37] WARNING:  failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker
zabbix-timescaledb-1  | 2025-01-20 11:37:24.400 UTC [36] WARNING:  failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker
zabbix-timescaledb-1  | 2025-01-20 11:37:24.976 UTC [37] WARNING:  failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker
zabbix-timescaledb-1  | 2025-01-20 11:37:26.080 UTC [36] WARNING:  failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker
zabbix-timescaledb-1  | 2025-01-20 11:37:26.612 UTC [37] WARNING:  failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker
zabbix-timescaledb-1  | 2025-01-20 11:37:27.520 UTC [36] WARNING:  failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker
* Have you restarted the server after changing the parameters?

Yes, I do.

@cheggerdev
Copy link
Author

Hi,

I got more log output with

ALTER SYSTEM SET timescaledb.bgw_log_level TO 'DEBUG5';
SELECT pg_reload_conf();

and the result:

zabbix-timescaledb-1  | 
zabbix-timescaledb-1  | PostgreSQL Database directory appears to contain a database; Skipping initialization
zabbix-timescaledb-1  | 
zabbix-timescaledb-1  | 2025-01-21 10:02:49.396 UTC [1] LOG:  starting PostgreSQL 16.6 on x86_64-pc-linux-musl, compiled by gcc (Alpine 13.2.1_git20240309) 13.2.1 20240309, 64-bit
zabbix-timescaledb-1  | 2025-01-21 10:02:49.396 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
zabbix-timescaledb-1  | 2025-01-21 10:02:49.396 UTC [1] LOG:  listening on IPv6 address "::", port 5432
zabbix-timescaledb-1  | 2025-01-21 10:02:49.911 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
zabbix-timescaledb-1  | 2025-01-21 10:02:49.998 UTC [29] LOG:  database system was shut down at 2025-01-21 10:01:45 UTC
zabbix-timescaledb-1  | 2025-01-21 10:02:50.070 UTC [1] LOG:  database system is ready to accept connections
zabbix-timescaledb-1  | 2025-01-21 10:02:50.074 UTC [33] LOG:  TimescaleDB background worker launcher connected to shared catalogs
zabbix-timescaledb-1  | 2025-01-21 10:02:50.213 UTC [36] DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
zabbix-timescaledb-1  | 2025-01-21 10:02:50.213 UTC [35] DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
zabbix-timescaledb-1  | 2025-01-21 10:02:50.217 UTC [36] DEBUG:  updating scheduled jobs list
zabbix-timescaledb-1  | 2025-01-21 10:02:50.217 UTC [36] DEBUG:  StartSubTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 0/1/0
zabbix-timescaledb-1  | 2025-01-21 10:02:50.217 UTC [36] DEBUG:  StartSubTransaction(2) name: next start on failure; blockState: SUBBEGIN; state: INPROGRESS, xid/subid/cid: 0/2/0
zabbix-timescaledb-1  | 2025-01-21 10:02:50.217 UTC [36] DEBUG:  CommitSubTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 0/1/0
zabbix-timescaledb-1  | 2025-01-21 10:02:50.217 UTC [36] DEBUG:  CommitSubTransaction(2) name: next start on failure; blockState: SUBINPROGRESS; state: INPROGRESS, xid/subid/cid: 0/2/0
zabbix-timescaledb-1  | 2025-01-21 10:02:50.217 UTC [36] DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 0/1/0
zabbix-timescaledb-1  | 2025-01-21 10:02:50.217 UTC [36] DEBUG:  database scheduler for database 5 starting
zabbix-timescaledb-1  | 2025-01-21 10:02:50.217 UTC [36] DEBUG:  scheduler wakeup in database 5
zabbix-timescaledb-1  | 2025-01-21 10:02:50.217 UTC [36] DEBUG:  starting scheduled job 3
zabbix-timescaledb-1  | 2025-01-21 10:02:50.217 UTC [36] DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
zabbix-timescaledb-1  | 2025-01-21 10:02:50.218 UTC [35] DEBUG:  updating scheduled jobs list
zabbix-timescaledb-1  | 2025-01-21 10:02:50.218 UTC [36] DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 16799308/1/1
zabbix-timescaledb-1  | 2025-01-21 10:02:50.219 UTC [35] DEBUG:  StartSubTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 0/1/0
zabbix-timescaledb-1  | 2025-01-21 10:02:50.219 UTC [35] DEBUG:  StartSubTransaction(2) name: next start on failure; blockState: SUBBEGIN; state: INPROGRESS, xid/subid/cid: 0/2/0
zabbix-timescaledb-1  | 2025-01-21 10:02:50.219 UTC [35] DEBUG:  CommitSubTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 0/1/0
zabbix-timescaledb-1  | 2025-01-21 10:02:50.219 UTC [35] DEBUG:  CommitSubTransaction(2) name: next start on failure; blockState: SUBINPROGRESS; state: INPROGRESS, xid/subid/cid: 0/2/0
zabbix-timescaledb-1  | 2025-01-21 10:02:50.219 UTC [35] DEBUG:  StartSubTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 0/1/0
zabbix-timescaledb-1  | 2025-01-21 10:02:50.219 UTC [35] DEBUG:  StartSubTransaction(2) name: next start on failure; blockState: SUBBEGIN; state: INPROGRESS, xid/subid/cid: 0/3/0
zabbix-timescaledb-1  | 2025-01-21 10:02:50.219 UTC [35] DEBUG:  CommitSubTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 0/1/0
zabbix-timescaledb-1  | 2025-01-21 10:02:50.219 UTC [35] DEBUG:  CommitSubTransaction(2) name: next start on failure; blockState: SUBINPROGRESS; state: INPROGRESS, xid/subid/cid: 0/3/0
zabbix-timescaledb-1  | 2025-01-21 10:02:50.219 UTC [35] DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 0/1/0
zabbix-timescaledb-1  | 2025-01-21 10:02:50.219 UTC [35] DEBUG:  database scheduler for database 16384 starting
zabbix-timescaledb-1  | 2025-01-21 10:02:50.219 UTC [35] DEBUG:  scheduler wakeup in database 16384
zabbix-timescaledb-1  | 2025-01-21 10:02:50.219 UTC [35] DEBUG:  starting scheduled job 3 in 3853 seconds
zabbix-timescaledb-1  | 2025-01-21 10:02:50.219 UTC [35] DEBUG:  starting scheduled job 1 in 4049 seconds
zabbix-timescaledb-1  | 2025-01-21 10:02:50.219 UTC [35] DEBUG:  scheduler wakeup in database 16384
zabbix-timescaledb-1  | 2025-01-21 10:02:50.219 UTC [35] DEBUG:  starting scheduled job 3 in 3853 seconds
zabbix-timescaledb-1  | 2025-01-21 10:02:50.219 UTC [35] DEBUG:  starting scheduled job 1 in 4049 seconds
zabbix-timescaledb-1  | 2025-01-21 10:02:50.352 UTC [36] DEBUG:  launching job 3 "Job History Log Retention Policy [3]"
zabbix-timescaledb-1  | 2025-01-21 10:02:50.353 UTC [36] NOTICE:  unable to register background worker
zabbix-timescaledb-1  | 2025-01-21 10:02:50.353 UTC [36] WARNING:  failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker
zabbix-timescaledb-1  | 2025-01-21 10:02:50.353 UTC [36] DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
zabbix-timescaledb-1  | 2025-01-21 10:02:50.355 UTC [36] DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 16799309/1/3
zabbix-timescaledb-1  | 2025-01-21 10:02:50.775 UTC [36] DEBUG:  starting scheduled job 1 in 4049 seconds
zabbix-timescaledb-1  | 2025-01-21 10:02:51.775 UTC [36] DEBUG:  scheduler wakeup in database 5
zabbix-timescaledb-1  | 2025-01-21 10:02:51.775 UTC [36] DEBUG:  starting scheduled job 3
zabbix-timescaledb-1  | 2025-01-21 10:02:51.775 UTC [36] DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
zabbix-timescaledb-1  | 2025-01-21 10:02:51.775 UTC [36] DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 16799310/1/1
zabbix-timescaledb-1  | 2025-01-21 10:02:51.972 UTC [36] DEBUG:  launching job 3 "Job History Log Retention Policy [3]"
zabbix-timescaledb-1  | 2025-01-21 10:02:51.972 UTC [36] NOTICE:  unable to register background worker
zabbix-timescaledb-1  | 2025-01-21 10:02:51.972 UTC [36] WARNING:  failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker
zabbix-timescaledb-1  | 2025-01-21 10:02:51.972 UTC [36] DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
zabbix-timescaledb-1  | 2025-01-21 10:02:51.973 UTC [36] DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 16799311/1/3
zabbix-timescaledb-1  | 2025-01-21 10:02:52.292 UTC [36] DEBUG:  starting scheduled job 1 in 4047 seconds
zabbix-timescaledb-1  | 2025-01-21 10:02:53.292 UTC [36] DEBUG:  scheduler wakeup in database 5
zabbix-timescaledb-1  | 2025-01-21 10:02:53.292 UTC [36] DEBUG:  starting scheduled job 3
zabbix-timescaledb-1  | 2025-01-21 10:02:53.292 UTC [36] DEBUG:  StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
zabbix-timescaledb-1  | 2025-01-21 10:02:53.293 UTC [36] DEBUG:  CommitTransaction(1) name: unnamed; blockState: STARTED; state: INPROGRESS, xid/subid/cid: 16799312/1/1
zabbix-timescaledb-1  | 2025-01-21 10:02:53.377 UTC [36] DEBUG:  launching job 3 "Job History Log Retention Policy [3]"
zabbix-timescaledb-1  | 2025-01-21 10:02:53.377 UTC [36] NOTICE:  unable to register background worker
zabbix-timescaledb-1  | 2025-01-21 10:02:53.377 UTC [36] WARNING:  failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker

@mkindahl
Copy link
Contributor

This line is generated when the RegisterDynamicBackgroundWorker call fails, and since you have no additional message preceding it, the only reason is because the slots are exhausted.

zabbix-timescaledb-1  | 2025-01-21 10:02:50.353 UTC [36] NOTICE:  unable to register background worker

I am not sure there is a good way to check the number of slots or slot assignment through the SQL interface, but could you check pg_stat_activity and see how many processes you have there?

@cheggerdev
Copy link
Author

select count(*) from pg_stat_activity;

=> 42 most times
=> 43 sometimes

@mkindahl
Copy link
Contributor

select count(*) from pg_stat_activity;

=> 42 most times => 43 sometimes

So then you need at least this many slots. Since you set it to 64, that should be sufficient, but unfortunately, there seems to be more slots used. It is not easy to figure out how many slots are used, but if you can connect a debugger to the server or a backend, you can check the contents of the symbol BackgroundWorkerData which holds the slots and information about it.

@cheggerdev
Copy link
Author

I think I figured it out:

I use pgbackrest (https://pgbackrest.org) which requires to turn on archive_mode and set archive_command = 'pgbackrest --stanza demo archive-push %p'

When I turn off archive_mode the launch failures go away.
When I turn on archive_mode the launch failures appear with the first WAL archive push.

How many slots do I need to also cover archive_mode?
https://docs.timescale.com/self-hosted/latest/configuration/about-configuration/#workers tells me nothing how to do that.

@mkindahl
Copy link
Contributor

I think I figured it out:

I use pgbackrest (https://pgbackrest.org) which requires to turn on archive_mode and set archive_command = 'pgbackrest --stanza demo archive-push %p'

When I turn off archive_mode the launch failures go away. When I turn on archive_mode the launch failures appear with the first WAL archive push.

Then the WAL process in question probably does not exit properly. Have you checked the status of that background worker in the log? It might quickly eat up the slots if not exiting correctly.

How many slots do I need to also cover archive_mode? https://docs.timescale.com/self-hosted/latest/configuration/about-configuration/#workers tells me nothing how to do that.

It is a PostgreSQL feature, and the documentation is not very clear about that either.

@cheggerdev
Copy link
Author

Then the WAL process in question probably does not exit properly. Have you checked the status of that background worker in the log? It might quickly eat up the slots if not exiting correctly.

With vanilla postgresql I do not have that problem. Only with timescaledb.
Is timescaledb doing something with an impact on the WAL process? Just guessing...

@cheggerdev
Copy link
Author

Concerning the question if the WAL process exits propery I filed pgbackrest/pgbackrest#2532

The answer:

The archiver does not use additional background worker slots no matter the result from archive_command. Even if it did I think that would be strictly a PostgreSQL issue.

@cheggerdev
Copy link
Author

cheggerdev commented Jan 23, 2025

So then you need at least this many slots. Since you set it to 64, that should be sufficient, but unfortunately, there seems to be more slots used. It is not easy to figure out how many slots are used, but if you can connect a debugger to the server or a backend, you can check the contents of the symbol BackgroundWorkerData which holds the slots and information about it.

gdp -p 35
(gdb) break RegisterDynamicBackgroundWorker
Breakpoint 1 at 0x55bb67105ae0
(gdb) cont
Continuing.

Breakpoint 1, 0x000055bb67105ae0 in RegisterDynamicBackgroundWorker ()
(gdb) bt
#0  0x000055bb67105ae0 in RegisterDynamicBackgroundWorker ()
#1  0x00007ffaf724cc79 in ts_bgw_start_worker (name=name@entry=0x7ffaf74f66cc "Telemetry Reporter [1]", bgw_params=bgw_params@entry=0x7fffd5b37980)
    at /build/timescaledb/src/bgw/scheduler.c:136
#2  0x00007ffaf72487a7 in ts_bgw_job_start (job=job@entry=0x7ffaf74f66c8, user_oid=<optimized out>) at /build/timescaledb/src/bgw/job.c:76
#3  0x00007ffaf724c730 in scheduled_bgw_job_transition_state_to (sjob=0x7ffaf74f66c8, new_state=JOB_STATE_STARTED)
    at /build/timescaledb/src/bgw/scheduler.c:357
#4  0x00007ffaf724d5cd in scheduled_ts_bgw_job_start (sjob=0x7ffaf74f66c8, bgw_register=<optimized out>) at /build/timescaledb/src/bgw/scheduler.c:434
#5  start_scheduled_jobs (bgw_register=<optimized out>) at /build/timescaledb/src/bgw/scheduler.c:619
#6  ts_bgw_scheduler_process (run_for_interval_ms=run_for_interval_ms@entry=-1, bgw_register=bgw_register@entry=0x0)
    at /build/timescaledb/src/bgw/scheduler.c:831
#7  0x00007ffaf724d8f1 in ts_bgw_scheduler_main (fcinfo=<optimized out>) at /build/timescaledb/src/bgw/scheduler.c:955
#8  0x000055bb672f26d9 in DirectFunctionCall1Coll ()
#9  0x00007ffb1a64c711 in ts_bgw_db_scheduler_entrypoint (fcinfo=<optimized out>) at /build/timescaledb/src/loader/bgw_launcher.c:923
#10 0x000055bb671057e0 in BackgroundWorkerMain ()
#11 0x000055bb67107b0e in postmaster_child_launch ()
#12 0x000055bb67109df1 in maybe_start_bgworkers ()
#13 0x000055bb6710ae0a in ServerLoop.isra.0 ()
#14 0x000055bb6710d41d in PostmasterMain ()
#15 0x000055bb66e393d3 in main ()
(gdb) print /x BackgroundWorkerData
'BackgroundWorkerData' has unknown type; cast it to its declared type
(gdb) finish
Run till exit from #0  0x000055bb67105ae0 in RegisterDynamicBackgroundWorker ()
0x00007ffaf724cc79 in ts_bgw_start_worker (name=name@entry=0x7ffaf74f66cc "Telemetry Reporter [1]", bgw_params=bgw_params@entry=0x7fffd5b37980)
    at /build/timescaledb/src/bgw/scheduler.c:136
warning: 136	/build/timescaledb/src/bgw/scheduler.c: No such file or directory
(gdb) 
Run till exit from #0  0x00007ffaf724cc79 in ts_bgw_start_worker (name=name@entry=0x7ffaf74f66cc "Telemetry Reporter [1]", 
    bgw_params=bgw_params@entry=0x7fffd5b37980) at /build/timescaledb/src/bgw/scheduler.c:136
0x00007ffaf72487a7 in ts_bgw_job_start (job=job@entry=0x7ffaf74f66c8, user_oid=<optimized out>) at /build/timescaledb/src/bgw/job.c:77
warning: 77	/build/timescaledb/src/bgw/job.c: No such file or directory
Value returned is $4 = (BackgroundWorkerHandle *) 0x0
(gdb) 
Run till exit from #0  0x00007ffaf72487a7 in ts_bgw_job_start (job=job@entry=0x7ffaf74f66c8, user_oid=<optimized out>) at /build/timescaledb/src/bgw/job.c:77
0x00007ffaf724c730 in scheduled_bgw_job_transition_state_to (sjob=0x7ffaf74f66c8, new_state=JOB_STATE_STARTED) at /build/timescaledb/src/bgw/scheduler.c:357
warning: 357	/build/timescaledb/src/bgw/scheduler.c: No such file or directory
Value returned is $5 = (BackgroundWorkerHandle *) 0x0
(gdb) 
Run till exit from #0  0x00007ffaf724c730 in scheduled_bgw_job_transition_state_to (sjob=0x7ffaf74f66c8, new_state=JOB_STATE_STARTED)
    at /build/timescaledb/src/bgw/scheduler.c:357
scheduled_ts_bgw_job_start (sjob=0x7ffaf74f66c8, bgw_register=<optimized out>) at /build/timescaledb/src/bgw/scheduler.c:436
436	in /build/timescaledb/src/bgw/scheduler.c
(gdb) 
Run till exit from #0  scheduled_ts_bgw_job_start (sjob=0x7ffaf74f66c8, bgw_register=<optimized out>) at /build/timescaledb/src/bgw/scheduler.c:436
start_scheduled_jobs (bgw_register=<optimized out>) at /build/timescaledb/src/bgw/scheduler.c:609
609	in /build/timescaledb/src/bgw/scheduler.c
(gdb) 
Run till exit from #0  start_scheduled_jobs (bgw_register=<optimized out>) at /build/timescaledb/src/bgw/scheduler.c:609
ts_bgw_scheduler_process (run_for_interval_ms=run_for_interval_ms@entry=-1, bgw_register=bgw_register@entry=0x0)
    at /build/timescaledb/src/bgw/scheduler.c:832
832	in /build/timescaledb/src/bgw/scheduler.c
(gdb) 
Run till exit from #0  ts_bgw_scheduler_process (run_for_interval_ms=run_for_interval_ms@entry=-1, bgw_register=bgw_register@entry=0x0)
    at /build/timescaledb/src/bgw/scheduler.c:832

Breakpoint 1, 0x000055bb67105ae0 in RegisterDynamicBackgroundWorker ()
(gdb) 

So it is ts_bgw_scheduler_process() that decides to start another child process will fail endlessly.

What can I do to stop the endless loop of launch failures?

@mkindahl
Copy link
Contributor

(gdb) print /x BackgroundWorkerData
'BackgroundWorkerData' has unknown type; cast it to its declared type

To see this you need to load debuginfo for PostgreSQL.

(gdb) finish

Here it would be nice to see the value of BackgroundWorkerData->total_slots, but you do not have debug info. However, since this field is first in the structure, you should be able to get the value with this:

(gdb) print *(int*) BackgrounWorkerData

Run till exit from #0 0x000055bb67105ae0 in RegisterDynamicBackgroundWorker ()
0x00007ffaf724cc79 in ts_bgw_start_worker (name=name@entry=0x7ffaf74f66cc "Telemetry Reporter [1]", bgw_params=bgw_params@entry=0x7fffd5b37980)
at /build/timescaledb/src/bgw/scheduler.c:136

I do not see the returned value here, but I assume that it is "false", since we can see that the handle is NULL further up in the stack.

It is trying to spawn the telemetry job. Disabling telemetry might avoid spawning this job at least, but you might still run into issues with other jobs (because of the lack of enough slots).

warning: 136 /build/timescaledb/src/bgw/scheduler.c: No such file or directory
(gdb)
Run till exit from #0 0x00007ffaf724cc79 in ts_bgw_start_worker (name=name@entry=0x7ffaf74f66cc "Telemetry Reporter [1]",
bgw_params=bgw_params@entry=0x7fffd5b37980) at /build/timescaledb/src/bgw/scheduler.c:136
0x00007ffaf72487a7 in ts_bgw_job_start (job=job@entry=0x7ffaf74f66c8, user_oid=) at /build/timescaledb/src/bgw/job.c:77
warning: 77 /build/timescaledb/src/bgw/job.c: No such file or directory
Value returned is $4 = (BackgroundWorkerHandle *) 0x0
(gdb)
Run till exit from #0 0x00007ffaf72487a7 in ts_bgw_job_start (job=job@entry=0x7ffaf74f66c8, user_oid=) at /build/timescaledb/src/bgw/job.c:77
0x00007ffaf724c730 in scheduled_bgw_job_transition_state_to (sjob=0x7ffaf74f66c8, new_state=JOB_STATE_STARTED) at /build/timescaledb/src/bgw/scheduler.c:357
warning: 357 /build/timescaledb/src/bgw/scheduler.c: No such file or directory
Value returned is $5 = (BackgroundWorkerHandle *) 0x0

Here you can see that RegisterDynamicBackgroundWorker() has failed to register the worker (because of not having enough slots?), hence is returning NULL as the background worker handle.

(gdb) 
Run till exit from #0  0x00007ffaf724c730 in scheduled_bgw_job_transition_state_to (sjob=0x7ffaf74f66c8, new_state=JOB_STATE_STARTED)
    at /build/timescaledb/src/bgw/scheduler.c:357
scheduled_ts_bgw_job_start (sjob=0x7ffaf74f66c8, bgw_register=<optimized out>) at /build/timescaledb/src/bgw/scheduler.c:436
436	in /build/timescaledb/src/bgw/scheduler.c
(gdb) 
Run till exit from #0  scheduled_ts_bgw_job_start (sjob=0x7ffaf74f66c8, bgw_register=<optimized out>) at /build/timescaledb/src/bgw/scheduler.c:436
start_scheduled_jobs (bgw_register=<optimized out>) at /build/timescaledb/src/bgw/scheduler.c:609
609	in /build/timescaledb/src/bgw/scheduler.c
(gdb) 
Run till exit from #0  start_scheduled_jobs (bgw_register=<optimized out>) at /build/timescaledb/src/bgw/scheduler.c:609
ts_bgw_scheduler_process (run_for_interval_ms=run_for_interval_ms@entry=-1, bgw_register=bgw_register@entry=0x0)
    at /build/timescaledb/src/bgw/scheduler.c:832
832	in /build/timescaledb/src/bgw/scheduler.c
(gdb) 
Run till exit from #0  ts_bgw_scheduler_process (run_for_interval_ms=run_for_interval_ms@entry=-1, bgw_register=bgw_register@entry=0x0)
    at /build/timescaledb/src/bgw/scheduler.c:832

Breakpoint 1, 0x000055bb67105ae0 in RegisterDynamicBackgroundWorker ()
(gdb) 

So it is ts_bgw_scheduler_process() that decides to start another child process will fail endlessly.

Yes, this is the function in TimescaleDB that runs the scheduler as a background worker. The scheduler will then spawn background workers for each of the jobs as needed according to the schedule.

What can I do to stop the endless loop of launch failures?

Well... normally the launch of a background worker should eventually succeed. If it fails, the job has not done its job, which is why a restart is attempted. The problem is still that it fails, and it looks like it is because there is not enough slots.

@mkindahl
Copy link
Contributor

Concerning the question if the WAL process exits propery I filed pgbackrest/pgbackrest#2532

The answer:

The archiver does not use additional background worker slots no matter the result from archive_command. Even if it did I think that would be strictly a PostgreSQL issue.

Yes, it is a PostgreSQL issue since it is PostgreSQL that run the archive command. It does this from a background worker, and the question is just if this command exits with a failure, then the slot might not be cleaned up, which would explain why you run out of slots. If you can look in the log for background workers terminated with error code 1 that might help pinpoint the problem.

@cheggerdev
Copy link
Author

(gdb) finish

Here it would be nice to see the value of BackgroundWorkerData->total_slots, but you do not have debug info. However, since this field is first in the structure, you should be able to get the value with this:

(gdb) print *(int*) BackgroundWorkerData
(gdb) print *(int*) BackgroundWorkerData
$2 = 4

@cheggerdev
Copy link
Author

Yes, it is a PostgreSQL issue since it is PostgreSQL that run the archive command. It does this from a background worker, and the question is just if this command exits with a failure, then the slot might not be cleaned up, which would explain why you run out of slots. If you can look in the log for background workers terminated with error code 1 that might help pinpoint the problem.

log output says nothing, it is not verbose enough with DEBUG5.
Going through with gdb SanityCheckBackgroundWorker() returns 1 when it is called from RegisterDynamicBackgroundWorker().

@mkindahl
Copy link
Contributor

(gdb) finish

Here it would be nice to see the value of BackgroundWorkerData->total_slots, but you do not have debug info. However, since this field is first in the structure, you should be able to get the value with this:

(gdb) print *(int*) BackgroundWorkerData
(gdb) print *(int*) BackgroundWorkerData
$2 = 4

Thanks. This is very strange. Why do you have a total of just 4 worker slots? That would explain why you cannot find a slot, but it is not clear why it is 4 when you have set it higher max_worker_processes.

@cheggerdev
Copy link
Author

(gdb) finish

Here it would be nice to see the value of BackgroundWorkerData->total_slots, but you do not have debug info. However, since this field is first in the structure, you should be able to get the value with this:

(gdb) print *(int*) BackgroundWorkerData
(gdb) print *(int*) BackgroundWorkerData
$2 = 4

Thanks. This is very strange. Why do you have a total of just 4 worker slots? That would explain why you cannot find a slot, but it is not clear why it is 4 when you have set it higher max_worker_processes.

Is total_slots bound to the number of CPUs ? I have only 4 CPUs ...
I tried other values than 4 to max_parallel_workers with no impact on BackgroundWorkerData->total_slots.

@mkindahl
Copy link
Contributor

Thanks. This is very strange. Why do you have a total of just 4 worker slots? That would explain why you cannot find a slot, but it is not clear why it is 4 when you have set it higher max_worker_processes.

Is total_slots bound to the number of CPUs ? I have only 4 CPUs ... I tried other values than 4 to max_parallel_workers with no impact on BackgroundWorkerData->total_slots.

Nope, it is not related to the number of CPUs. Note that if the number of slots does not match max_worker_processes you will get a lot of errors in the log. Here is a comment from the source code:

	/*
	 * The total number of slots stored in shared memory should match our
	 * notion of max_worker_processes.  If it does not, something is very
	 * wrong.  Further down, we always refer to this value as
	 * max_worker_processes, in case shared memory gets corrupted while we're
	 * looping.
	 */
	if (max_worker_processes != BackgroundWorkerData->total_slots)
	{
		ereport(LOG,
				(errmsg("inconsistent background worker state (max_worker_processes=%d, total_slots=%d)",
						max_worker_processes,
						BackgroundWorkerData->total_slots)));
		return;
	}

Please check that:

  • You only have one row with max_worker_processes in the configuration file.
  • That you're editing the correct configuration file.
  • That you have restarted the server.

@cheggerdev
Copy link
Author

Please check that:

* You only have one row with `max_worker_processes` in the configuration file.

Here we go: There is a postgresql.auto.conf generated via psql and

https://pgtune.leopard.in.ua/?dbVersion=16&osType=linux&dbType=mixed&cpuNum=4&totalMemory=3&totalMemoryUnit=GB&connectionNum=300&hdType=hdd

pgtune always sets max_worker_processes to the given number of CPUs when CPUs >= 4.

I removed the line which sets max_worker_processes = '4' from postgresql.auto.conf

* That you're editing the correct configuration file. 

I assume postgresql.conf is correct (there is only one within a docker container).

* That you have restarted the server.

Done.
Then the majority of background failures are gone.

I found one with 'exit code 1'. That is you were refering to in #7602 (comment) ?

zabbix-timescaledb-1  | 2025-01-28 16:06:14.994 UTC [37] WARNING:  terminating background worker "Telemetry Reporter [1]" due to timeout
zabbix-timescaledb-1  | 2025-01-28 16:06:15.115 UTC [2767] FATAL:  terminating background worker "Telemetry Reporter [1]" due to administrator command
zabbix-timescaledb-1  | 2025-01-28 16:06:15.115 UTC [2767] CONTEXT:  SQL statement "SELECT job_type, jsonb_object_agg(sqlerrcode, count) FROM(	SELECT (	CASE 			WHEN proc_schema = '_timescaledb_functions' 			AND proc_name ~ '^policy_(retention|compression|reorder|refresh_continuous_aggregate|telemetry|job_error_retention)$' 			THEN proc_name 			ELSE 'user_defined_action'		END	) as job_type, 	sqlerrcode, 	pg_catalog.COUNT(*) 	FROM 	timescaledb_information.job_errors 	WHERE sqlerrcode IS NOT NULL 	GROUP BY job_type, sqlerrcode 	ORDER BY job_type) q GROUP BY q.job_type"
zabbix-timescaledb-1  | 2025-01-28 16:06:15.130 UTC [1] LOG:  background worker "Telemetry Reporter [1]" (PID 2767) exited with exit code 1
zabbix-timescaledb-1  | 2025-01-28 16:06:15.131 UTC [37] LOG:  job 1 failed

@mkindahl
Copy link
Contributor

mkindahl commented Jan 29, 2025

Please check that:

* You only have one row with `max_worker_processes` in the configuration file.

Here we go: There is a postgresql.auto.conf generated via psql and

https://pgtune.leopard.in.ua/?dbVersion=16&osType=linux&dbType=mixed&cpuNum=4&totalMemory=3&totalMemoryUnit=GB&connectionNum=300&hdType=hdd

pgtune always sets max_worker_processes to the given number of CPUs when CPUs >= 4.

Ok.

I removed the line which sets max_worker_processes = '4' from postgresql.auto.conf

* That you're editing the correct configuration file. 

I assume postgresql.conf is correct (there is only one within a docker container).

Yes, this is the main configuration file. The auto file is used for tools in general and the ALTER SYSTEM and ALTER DATABASE commands. Here is a comment from 19.1.2. Parameter Interaction via the Configuration File:

In addition to postgresql.conf, a PostgreSQL data directory contains a file postgresql.auto.conf, which has the same format as postgresql.conf but is intended to be edited automatically, not manually. This file holds settings provided through the ALTER SYSTEM command. This file is read whenever postgresql.conf is, and its settings take effect in the same way. Settings in postgresql.auto.conf override those in postgresql.conf.

External tools may also modify postgresql.auto.conf. It is not recommended to do this while the server is running unless allow_alter_system is set to off, since a concurrent ALTER SYSTEM command could overwrite such changes. Such tools might simply append new settings to the end, or they might choose to remove duplicate settings and/or comments (as ALTER SYSTEM will).

So the postgresql.auto.conf file needs to be considered as well.

Done. Then the majority of background failures are gone.

Good that you found the issue.

I found one with 'exit code 1'. That is you were refering to in #7602 (comment) ?

zabbix-timescaledb-1  | 2025-01-28 16:06:14.994 UTC [37] WARNING:  terminating background worker "Telemetry Reporter [1]" due to timeout
zabbix-timescaledb-1  | 2025-01-28 16:06:15.115 UTC [2767] FATAL:  terminating background worker "Telemetry Reporter [1]" due to administrator command
zabbix-timescaledb-1  | 2025-01-28 16:06:15.115 UTC [2767] CONTEXT:  SQL statement "SELECT job_type, jsonb_object_agg(sqlerrcode, count) FROM(	SELECT (	CASE 			WHEN proc_schema = '_timescaledb_functions' 			AND proc_name ~ '^policy_(retention|compression|reorder|refresh_continuous_aggregate|telemetry|job_error_retention)$' 			THEN proc_name 			ELSE 'user_defined_action'		END	) as job_type, 	sqlerrcode, 	pg_catalog.COUNT(*) 	FROM 	timescaledb_information.job_errors 	WHERE sqlerrcode IS NOT NULL 	GROUP BY job_type, sqlerrcode 	ORDER BY job_type) q GROUP BY q.job_type"
zabbix-timescaledb-1  | 2025-01-28 16:06:15.130 UTC [1] LOG:  background worker "Telemetry Reporter [1]" (PID 2767) exited with exit code 1
zabbix-timescaledb-1  | 2025-01-28 16:06:15.131 UTC [37] LOG:  job 1 failed

Correct. This slot is not cleaned up and the postmaster will restart this worker eventually, so it will take up space in the slot array.

@mkindahl
Copy link
Contributor

Since it seems it was a configuration issue, I will close this. Feel free to ask for it to be re-opened if there are lingering issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bgw The background worker subsystem, including the scheduler bug
Projects
None yet
Development

No branches or pull requests

2 participants