Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nomad scheduler schedules job on client that cannot handle it #24779

Open
EtienneBruines opened this issue Jan 6, 2025 · 2 comments
Open

Nomad scheduler schedules job on client that cannot handle it #24779

EtienneBruines opened this issue Jan 6, 2025 · 2 comments

Comments

@EtienneBruines
Copy link
Contributor

Nomad version

Nomad v1.9.4
BuildDate 2024-12-18T15:16:22Z
Revision 5e49fcdb7be26941b6c7ad3ed6661bd37e70a9d8+CHANGES

Operating system and Environment details

Ubuntu 22.04.5 LTS on amd64

Issue

When a client is too busy with GC to start new allocs, the scheduler does not 'respect' or 'detect' that and schedules new jobs there anyways - even if other clients are available and idle.

Reproduction steps

  • Have multiple clients
  • Have one client that is too busy with GC to start new allocs
  • Start some new job (perhaps a periodic batch job that has already run on the client before)

Expected Result

The scheduler to avoid the client when the client is busy with GC and refuses to receive new tasks - picking a different client instead. Some kind of automatic 'deterring' factor for that client when it's busy with GC.

Actual Result

The scheduler doesn't care and schedules it anyways on that client that is already overwhelmed.

Perhaps the scheduler already implements this by looking at the nomad.client.allocations.pending metric? If so, this issue can probably be closed because the behavior would be caused by #24777 instead.

Job file (if appropriate)

Not applicable.

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Only logs this:

{"@level":"info","@message":"marking allocation for GC","@module":"client.gc","@timestamp":"2025-01-06T10:21:36.250494Z","alloc_id":"ac8fd9bd-39f9-133f-c1ae-eb45c1ecc275"}
{"@level":"info","@message":"garbage collecting allocation","@module":"client.gc","@timestamp":"2025-01-06T10:21:36.252995Z","alloc_id":"feb5dc4c-a549-7b82-a18e-733acd2a7013","reason":"number of allocations (68) is over the limit (50)"}

After the GC-ing is complete (perhaps 20 minutes or so later), it starts the alloc and logs things like:

{"@level":"info","@message":"Task event","@module":"client.alloc_runner.task_runner","@timestamp":"2025-01-06T10:23:26.213687Z","alloc_id":"db84c9fb-e9e3-df5e-bc34-42e11f57a32e","failed":false,"msg":"Task received by client","task":"sync","type":"Received"}
@pkazmierczak
Copy link
Contributor

Hey @EtienneBruines, what do you mean exactly that

a client is too busy with GC to start new allocs

Client GC is asynchronous and shouldn't interfere with placing workloads. At the time of garbage collection, node resources should be free and thus the scheduler places the workload there. Is the node busy with something other than GC?

@EtienneBruines
Copy link
Contributor Author

what do you mean exactly that

The behavior that is described here: #19917

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

2 participants