Nomad scheduler schedules job on client that cannot handle it #24779

EtienneBruines · 2025-01-06T10:41:00Z

Nomad version

Nomad v1.9.4
BuildDate 2024-12-18T15:16:22Z
Revision 5e49fcdb7be26941b6c7ad3ed6661bd37e70a9d8+CHANGES

Operating system and Environment details

Ubuntu 22.04.5 LTS on amd64

Issue

When a client is too busy with GC to start new allocs, the scheduler does not 'respect' or 'detect' that and schedules new jobs there anyways - even if other clients are available and idle.

Reproduction steps

Have multiple clients
Have one client that is too busy with GC to start new allocs
Start some new job (perhaps a periodic batch job that has already run on the client before)

Expected Result

The scheduler to avoid the client when the client is busy with GC and refuses to receive new tasks - picking a different client instead. Some kind of automatic 'deterring' factor for that client when it's busy with GC.

Actual Result

The scheduler doesn't care and schedules it anyways on that client that is already overwhelmed.

Perhaps the scheduler already implements this by looking at the nomad.client.allocations.pending metric? If so, this issue can probably be closed because the behavior would be caused by #24777 instead.

Job file (if appropriate)

Not applicable.

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Only logs this:

{"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2025-01-06T10:21:36.250494Z","alloc_id":"ac8fd9bd-39f9-133f-c1ae-eb45c1ecc275"}
{"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2025-01-06T10:21:36.252995Z","alloc_id":"feb5dc4c-a549-7b82-a18e-733acd2a7013","reason":"number of allocations (68) is over the limit (50)"}

After the GC-ing is complete (perhaps 20 minutes or so later), it starts the alloc and logs things like:

{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-06T10:23:26.213687Z","alloc_id":"db84c9fb-e9e3-df5e-bc34-42e11f57a32e","failed":false,"msg":"Task received by client","task":"sync","type":"Received"}

The text was updated successfully, but these errors were encountered:

pkazmierczak · 2025-01-08T09:35:42Z

Hey @EtienneBruines, what do you mean exactly that

a client is too busy with GC to start new allocs

Client GC is asynchronous and shouldn't interfere with placing workloads. At the time of garbage collection, node resources should be free and thus the scheduler places the workload there. Is the node busy with something other than GC?

EtienneBruines · 2025-01-14T11:41:22Z

what do you mean exactly that

The behavior that is described here: #19917

EtienneBruines added the type/bug label Jan 6, 2025

EtienneBruines mentioned this issue Jan 6, 2025

Nomad client not reporting pending job during GC #24777

Open

jrasell added this to Nomad - Community Issues Triage Jan 7, 2025

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Jan 7, 2025

pkazmierczak moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jan 8, 2025

pkazmierczak added stage/waiting-reply theme/scheduling labels Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad scheduler schedules job on client that cannot handle it #24779

Nomad scheduler schedules job on client that cannot handle it #24779

EtienneBruines commented Jan 6, 2025

pkazmierczak commented Jan 8, 2025

EtienneBruines commented Jan 14, 2025

Nomad scheduler schedules job on client that cannot handle it #24779

Nomad scheduler schedules job on client that cannot handle it #24779

Comments

EtienneBruines commented Jan 6, 2025

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

pkazmierczak commented Jan 8, 2025

EtienneBruines commented Jan 14, 2025