-
Notifications
You must be signed in to change notification settings - Fork 899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Log is full of "failed to start a background worker" #7602
Comments
@cheggerdev Thank you for the bug report. The error is generated when the scheduler cannot spawn a new job and So, can you provide some more information? In particular:
|
Log from a fresh (re-)start:
Yes, I do. |
Hi, I got more log output with
and the result:
|
This line is generated when the
I am not sure there is a good way to check the number of slots or slot assignment through the SQL interface, but could you check |
=> 42 most times |
So then you need at least this many slots. Since you set it to 64, that should be sufficient, but unfortunately, there seems to be more slots used. It is not easy to figure out how many slots are used, but if you can connect a debugger to the server or a backend, you can check the contents of the symbol |
I think I figured it out: I use pgbackrest (https://pgbackrest.org) which requires to turn on archive_mode and set archive_command = 'pgbackrest --stanza demo archive-push %p' When I turn off archive_mode the launch failures go away. How many slots do I need to also cover archive_mode? |
Then the WAL process in question probably does not exit properly. Have you checked the status of that background worker in the log? It might quickly eat up the slots if not exiting correctly.
It is a PostgreSQL feature, and the documentation is not very clear about that either. |
With vanilla postgresql I do not have that problem. Only with timescaledb. |
Concerning the question if the WAL process exits propery I filed pgbackrest/pgbackrest#2532 The answer:
|
gdp -p 35
So it is ts_bgw_scheduler_process() that decides to start another child process will fail endlessly. What can I do to stop the endless loop of launch failures? |
To see this you need to load debuginfo for PostgreSQL.
Here it would be nice to see the value of
I do not see the returned value here, but I assume that it is "false", since we can see that the handle is NULL further up in the stack. It is trying to spawn the telemetry job. Disabling telemetry might avoid spawning this job at least, but you might still run into issues with other jobs (because of the lack of enough slots).
Here you can see that
Yes, this is the function in TimescaleDB that runs the scheduler as a background worker. The scheduler will then spawn background workers for each of the jobs as needed according to the schedule.
Well... normally the launch of a background worker should eventually succeed. If it fails, the job has not done its job, which is why a restart is attempted. The problem is still that it fails, and it looks like it is because there is not enough slots. |
Yes, it is a PostgreSQL issue since it is PostgreSQL that run the archive command. It does this from a background worker, and the question is just if this command exits with a failure, then the slot might not be cleaned up, which would explain why you run out of slots. If you can look in the log for background workers terminated with error code 1 that might help pinpoint the problem. |
|
log output says nothing, it is not verbose enough with DEBUG5. |
Thanks. This is very strange. Why do you have a total of just 4 worker slots? That would explain why you cannot find a slot, but it is not clear why it is 4 when you have set it higher |
Is total_slots bound to the number of CPUs ? I have only 4 CPUs ... |
Nope, it is not related to the number of CPUs. Note that if the number of slots does not match /*
* The total number of slots stored in shared memory should match our
* notion of max_worker_processes. If it does not, something is very
* wrong. Further down, we always refer to this value as
* max_worker_processes, in case shared memory gets corrupted while we're
* looping.
*/
if (max_worker_processes != BackgroundWorkerData->total_slots)
{
ereport(LOG,
(errmsg("inconsistent background worker state (max_worker_processes=%d, total_slots=%d)",
max_worker_processes,
BackgroundWorkerData->total_slots)));
return;
} Please check that:
|
Here we go: There is a postgresql.auto.conf generated via psql and pgtune always sets max_worker_processes to the given number of CPUs when CPUs >= 4. I removed the line which sets max_worker_processes = '4' from postgresql.auto.conf
I assume postgresql.conf is correct (there is only one within a docker container).
Done. I found one with 'exit code 1'. That is you were refering to in #7602 (comment) ?
|
Ok.
Yes, this is the main configuration file. The auto file is used for tools in general and the
So the
Good that you found the issue.
Correct. This slot is not cleaned up and the postmaster will restart this worker eventually, so it will take up space in the slot array. |
Since it seems it was a configuration issue, I will close this. Feel free to ask for it to be re-opened if there are lingering issues. |
What type of bug is this?
Configuration
What subsystems and features are affected?
Background worker
What happened?
The timescaledb log is full of
zabbix-timescaledb-1 | 2025-01-19 09:55:28.045 UTC [37] WARNING: failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker
zabbix-timescaledb-1 | 2025-01-19 09:55:29.723 UTC [36] WARNING: failed to launch job 3 "Job History Log Retention Policy [3]": failed to start a background worker
Increasing the workers in config files has no effect in the sense of the launch failures do not disappear.
max_worker_processes = 64 (increased from 32)
timescaledb.max_background_workers = 48 (increased from 8 to 16, then to 32, then to 48)
max_parallel_workers = 4 (number of CPUs)
show timescaledb.telemetry_level; => basic
TimescaleDB version affected
docker-compose image tag latest-pg16
PostgreSQL version used
16
What operating system did you use?
Alpine Linux
What installation method did you use?
Docker
What platform did you run on?
Other
Relevant log output and stack trace
How can we reproduce the bug?
The text was updated successfully, but these errors were encountered: