-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lost allocation drops reschedule tracker #24918
Comments
tgross
added a commit
that referenced
this issue
Jan 23, 2025
Our vocabulary around scheduler behaviors outside of the `reschedule` and `migrate` blocks leaves room for confusion around whether the reschedule tracker should be propagated between allocations. There are effectively five different behaviors we need to cover: * restart: when the tasks of an allocation fail and we try to restart the tasks in place. * reschedule: when the `restart` block runs out of attempts (or the allocation fails before tasks even start), and we need to move the allocation to another node to try again. * migrate: when the user has asked to drain a node and we need to move the allocations. These are not failures, so we don't want to propagate the reschedule tracker. * replacement: when a node is lost, we don't count that against the `reschedule` tracker for the allocations on the node (it's not the allocation's "fault", after all). We don't want to run the `migrate` machinery here here either, as we can't contact the down node. To the scheduler, this is effectively the same as if we bumped the `group.count` * replacement for `disconnect.replace = true`: this is a replacement, but the replacement is intended to be temporary, so we propagate the reschedule tracker. Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining when each item applies. Update the use of the word "reschedule" in several places where "replacement" is correct, and vice-versa. Fixes: #24918
After quite a bit of internal discussion, we've determined this is actually the intended behavior (this stuff is complex!) 🤦 . I've opened #24929 with documentation and docstring clarifications. |
tgross
added a commit
that referenced
this issue
Jan 23, 2025
Our vocabulary around scheduler behaviors outside of the `reschedule` and `migrate` blocks leaves room for confusion around whether the reschedule tracker should be propagated between allocations. There are effectively five different behaviors we need to cover: * restart: when the tasks of an allocation fail and we try to restart the tasks in place. * reschedule: when the `restart` block runs out of attempts (or the allocation fails before tasks even start), and we need to move the allocation to another node to try again. * migrate: when the user has asked to drain a node and we need to move the allocations. These are not failures, so we don't want to propagate the reschedule tracker. * replacement: when a node is lost, we don't count that against the `reschedule` tracker for the allocations on the node (it's not the allocation's "fault", after all). We don't want to run the `migrate` machinery here here either, as we can't contact the down node. To the scheduler, this is effectively the same as if we bumped the `group.count` * replacement for `disconnect.replace = true`: this is a replacement, but the replacement is intended to be temporary, so we propagate the reschedule tracker. Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining when each item applies. Update the use of the word "reschedule" in several places where "replacement" is correct, and vice-versa. Fixes: #24918
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In #12319 we fixed a very old bug where when an allocation failed the scheduler failed to find a placement, the reschedule tracker was dropped. While working with @pkazmierczak on #24869 we discovered this bug was not 100% fixed. In case where the node is down and the allocation is marked
lost
, we're somehow not propagating the reschedule tracker.Reproduction
To demonstrate both the behavior that works and the non-working behavior, I'm deploying to a 1 server + 1 client cluster (current tip of
main
aka 1.9.6-dev), with the following jobspec. This jobspec has disabled restarts and aconstraint
block that allows us to control whether or not placement works.jobspec
Apply the following node metadata to the node:
Run the job.
Normal Recheduling
Kill the task (via
docker kill
) to force a reschedule.Wait for the allocation to be rescheduled and see that the replacement has a reschedule tracker.
Failed Rescheduling with Correct Behavior
Now we'll change the node metadata to make the node ineligible:
Kill the task again to force a reschedule, and wait for the blocked eval:
Update the node metadata to unblock the eval
And wait for the node update eval.
The replacement allocation has a reschedule tracker as we expect, which is what we fixed in #12319.
Reschedule on Downed Node
Now halt the node (
sudo systemctl stop nomad
), and wait for it to be marked down.Wait for the blocked evaluation:
Then restart the node and wait for the allocation to be unblocked:
The allocation has been replaced but the replacement allocation doesn't have a reschedule tracker!
The text was updated successfully, but these errors were encountered: