lost allocation drops reschedule tracker #24918

tgross · 2025-01-22T20:31:53Z

In #12319 we fixed a very old bug where when an allocation failed the scheduler failed to find a placement, the reschedule tracker was dropped. While working with @pkazmierczak on #24869 we discovered this bug was not 100% fixed. In case where the node is down and the allocation is marked lost, we're somehow not propagating the reschedule tracker.

Reproduction

To demonstrate both the behavior that works and the non-working behavior, I'm deploying to a 1 server + 1 client cluster (current tip of main aka 1.9.6-dev), with the following jobspec. This jobspec has disabled restarts and a constraint block that allows us to control whether or not placement works.

jobspec

job "example" {

  group "group" {

    reschedule {
      attempts  = 30
      interval  = "24h"
      unlimited = false
    }

    restart {
      attempts = 0
      mode     = "fail"
    }

    constraint {
      attribute = "${meta.example}"
      operator  = "="
      value     = "1"
    }

    task "task" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-vv", "-f", "-p", "8001", "-h", "/local"]
      }

      resources {
        cpu    = 100
        memory = 100
      }

    }
  }
}

Apply the following node metadata to the node:

$ nomad node status
ID        Node Pool  DC        Name     Class      Drain  Eligibility  Status
e6e43a5a  default    philly-1  client0  multipass  false  eligible     ready

$ nomad node meta apply --node-id e6e43a5a example=1

Run the job.

Normal Recheduling

Kill the task (via docker kill) to force a reschedule.

$ nomad alloc status 4d64f58c
...
Recent Events:
Time                       Type            Description
2025-01-22T15:13:20-05:00  Not Restarting  Policy allows no restarts

Wait for the allocation to be rescheduled and see that the replacement has a reschedule tracker.

$ nomad job status example
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
1914d5a9  e6e43a5a  group       0        run      running  3s ago     2s ago
4d64f58c  e6e43a5a  group       0        stop     failed   1m14s ago  3s ag

$ nomad operator api "/v1/allocation/1914d5a9-3610-75a9-025d-729a9dbed06c" | jq .RescheduleTracker
{
  "Events": [
    {
      "Delay": 30000000000,
      "PrevAllocID": "4d64f58c-96cc-8465-82ba-e48241dbdba6",
      "PrevNodeID": "e6e43a5a-9ddb-d65a-521a-cde19f093656",
      "RescheduleTime": 1737576830218453000
    }
  ],
  "LastReschedule": "ok"
}

Failed Rescheduling with Correct Behavior

Now we'll change the node metadata to make the node ineligible:

$ nomad node meta apply --node-id e6e43a5a example=2

Kill the task again to force a reschedule, and wait for the blocked eval:

$ nomad eval list
ID        Priority  Triggered By        Job ID   Namespace  Node ID   Status    Placement Failures
5db8c171  50        queued-allocs       example  default    <none>    blocked   N/A - In Progress
1b751548  50        alloc-failure       example  default    <none>    complete  true
...

Update the node metadata to unblock the eval

$ nomad node meta apply --node-id e6e43a5a example=1

And wait for the node update eval.

$ nomad eval list
ID        Priority  Triggered By        Job ID   Namespace  Node ID   Status    Placement Failures
6eac73f2  50        node-update         example  default    e6e43a5a  complete  false
5db8c171  50        queued-allocs       example  default    <none>    complete  false
1b751548  50        alloc-failure       example  default    <none>    complete  true
...

The replacement allocation has a reschedule tracker as we expect, which is what we fixed in #12319.

$ nomad job status example
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
1a99a69c  e6e43a5a  group       0        run      running  23s ago    13s ago
1914d5a9  e6e43a5a  group       0        stop     failed   3m54s ago  23s ago
4d64f58c  e6e43a5a  group       0        stop     failed   5m5s ago   3m54s ago

$ nomad operator api "/v1/allocation/1a99a69c-55bf-ddee-0c6d-6e54222b90bf" | jq .RescheduleTracker
{
  "Events": [
    {
      "Delay": 30000000000,
      "PrevAllocID": "4d64f58c-96cc-8465-82ba-e48241dbdba6",
      "PrevNodeID": "e6e43a5a-9ddb-d65a-521a-cde19f093656",
      "RescheduleTime": 1737576830218453000
    },
    {
      "Delay": 60000000000,
      "PrevAllocID": "1914d5a9-3610-75a9-025d-729a9dbed06c",
      "PrevNodeID": "e6e43a5a-9ddb-d65a-521a-cde19f093656",
      "RescheduleTime": 1737577040806473200
    }
  ],
  "LastReschedule": "ok"
}

Reschedule on Downed Node

Now halt the node (sudo systemctl stop nomad), and wait for it to be marked down.

$ nomad node status
ID        Node Pool  DC        Name     Class      Drain  Eligibility  Status
e6e43a5a  default    philly-1  client0  multipass  false  eligible     down

Wait for the blocked evaluation:

$ nomad job status example
...
Placement Failure
Task Group "group":
  * No nodes were eligible for evaluation

Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created    Modified
1a99a69c  e6e43a5a  group       0        stop     lost    2m43s ago  23s ago
1914d5a9  e6e43a5a  group       0        stop     failed  6m14s ago  2m43s ago
4d64f58c  e6e43a5a  group       0        stop     failed  7m25s ago  6m14s ago

$ nomad eval list
ID        Priority  Triggered By        Job ID   Namespace  Node ID   Status    Placement Failures
17784deb  50        queued-allocs       example  default    <none>    blocked   N/A - In Progress
f34b6262  50        node-update         example  default    e6e43a5a  complete  true
...

Then restart the node and wait for the allocation to be unblocked:

$ nomad eval list
ID        Priority  Triggered By        Job ID   Namespace  Node ID   Status    Placement Failures
40652e21  50        node-update         example  default    e6e43a5a  complete  false
4e69a3fe  50        queued-allocs       example  default    <none>    complete  false
9b5ed7fd  50        node-update         example  default    e6e43a5a  complete  true
...

The allocation has been replaced but the replacement allocation doesn't have a reschedule tracker!

$ nomad job status example
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
3896afa8  e6e43a5a  group       0        run      running   19s ago    9s ago
1a99a69c  e6e43a5a  group       0        stop     complete  4m17s ago  14s ago
1914d5a9  e6e43a5a  group       0        stop     failed    7m48s ago  4m17s ago
4d64f58c  e6e43a5a  group       0        stop     failed    8m59s ago  7m48s ago

$ nomad operator api "/v1/allocation/3896afa8-c58b-f436-b4e9-3c5bb733f0b0" | jq .RescheduleTracker
null

The text was updated successfully, but these errors were encountered:

Our vocabulary around scheduler behaviors outside of the `reschedule` and `migrate` blocks leaves room for confusion around whether the reschedule tracker should be propagated between allocations. There are effectively five different behaviors we need to cover: * restart: when the tasks of an allocation fail and we try to restart the tasks in place. * reschedule: when the `restart` block runs out of attempts (or the allocation fails before tasks even start), and we need to move the allocation to another node to try again. * migrate: when the user has asked to drain a node and we need to move the allocations. These are not failures, so we don't want to propagate the reschedule tracker. * replacement: when a node is lost, we don't count that against the `reschedule` tracker for the allocations on the node (it's not the allocation's "fault", after all). We don't want to run the `migrate` machinery here here either, as we can't contact the down node. To the scheduler, this is effectively the same as if we bumped the `group.count` * replacement for `disconnect.replace = true`: this is a replacement, but the replacement is intended to be temporary, so we propagate the reschedule tracker. Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining when each item applies. Update the use of the word "reschedule" in several places where "replacement" is correct, and vice-versa. Fixes: #24918

tgross · 2025-01-23T20:15:46Z

After quite a bit of internal discussion, we've determined this is actually the intended behavior (this stuff is complex!) 🤦 . I've opened #24929 with documentation and docstring clarifications.

Our vocabulary around scheduler behaviors outside of the `reschedule` and `migrate` blocks leaves room for confusion around whether the reschedule tracker should be propagated between allocations. There are effectively five different behaviors we need to cover: * restart: when the tasks of an allocation fail and we try to restart the tasks in place. * reschedule: when the `restart` block runs out of attempts (or the allocation fails before tasks even start), and we need to move the allocation to another node to try again. * migrate: when the user has asked to drain a node and we need to move the allocations. These are not failures, so we don't want to propagate the reschedule tracker. * replacement: when a node is lost, we don't count that against the `reschedule` tracker for the allocations on the node (it's not the allocation's "fault", after all). We don't want to run the `migrate` machinery here here either, as we can't contact the down node. To the scheduler, this is effectively the same as if we bumped the `group.count` * replacement for `disconnect.replace = true`: this is a replacement, but the replacement is intended to be temporary, so we propagate the reschedule tracker. Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining when each item applies. Update the use of the word "reschedule" in several places where "replacement" is correct, and vice-versa. Fixes: #24918

tgross added theme/restart/reschedule theme/scheduling type/bug labels Jan 22, 2025

tgross linked a pull request Jan 23, 2025 that will close this issue

docs: clarify reschedule, migrate, and replacement terminology #24929

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lost allocation drops reschedule tracker #24918

lost allocation drops reschedule tracker #24918

tgross commented Jan 22, 2025

tgross commented Jan 23, 2025

lost allocation drops reschedule tracker #24918

lost allocation drops reschedule tracker #24918

Comments

tgross commented Jan 22, 2025

Reproduction

Normal Recheduling

Failed Rescheduling with Correct Behavior

Reschedule on Downed Node

tgross commented Jan 23, 2025