Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lost allocation drops reschedule tracker #24918

Open
tgross opened this issue Jan 22, 2025 · 1 comment · May be fixed by #24929
Open

lost allocation drops reschedule tracker #24918

tgross opened this issue Jan 22, 2025 · 1 comment · May be fixed by #24929

Comments

@tgross
Copy link
Member

tgross commented Jan 22, 2025

In #12319 we fixed a very old bug where when an allocation failed the scheduler failed to find a placement, the reschedule tracker was dropped. While working with @pkazmierczak on #24869 we discovered this bug was not 100% fixed. In case where the node is down and the allocation is marked lost, we're somehow not propagating the reschedule tracker.

Reproduction

To demonstrate both the behavior that works and the non-working behavior, I'm deploying to a 1 server + 1 client cluster (current tip of main aka 1.9.6-dev), with the following jobspec. This jobspec has disabled restarts and a constraint block that allows us to control whether or not placement works.

jobspec
job "example" {

  group "group" {

    reschedule {
      attempts  = 30
      interval  = "24h"
      unlimited = false
    }

    restart {
      attempts = 0
      mode     = "fail"
    }

    constraint {
      attribute = "${meta.example}"
      operator  = "="
      value     = "1"
    }

    task "task" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-vv", "-f", "-p", "8001", "-h", "/local"]
      }

      resources {
        cpu    = 100
        memory = 100
      }

    }
  }
}

Apply the following node metadata to the node:

$ nomad node status
ID        Node Pool  DC        Name     Class      Drain  Eligibility  Status
e6e43a5a  default    philly-1  client0  multipass  false  eligible     ready

$ nomad node meta apply --node-id e6e43a5a example=1

Run the job.

Normal Recheduling

Kill the task (via docker kill) to force a reschedule.

$ nomad alloc status 4d64f58c
...
Recent Events:
Time                       Type            Description
2025-01-22T15:13:20-05:00  Not Restarting  Policy allows no restarts

Wait for the allocation to be rescheduled and see that the replacement has a reschedule tracker.

$ nomad job status example
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
1914d5a9  e6e43a5a  group       0        run      running  3s ago     2s ago
4d64f58c  e6e43a5a  group       0        stop     failed   1m14s ago  3s ag

$ nomad operator api "/v1/allocation/1914d5a9-3610-75a9-025d-729a9dbed06c" | jq .RescheduleTracker
{
  "Events": [
    {
      "Delay": 30000000000,
      "PrevAllocID": "4d64f58c-96cc-8465-82ba-e48241dbdba6",
      "PrevNodeID": "e6e43a5a-9ddb-d65a-521a-cde19f093656",
      "RescheduleTime": 1737576830218453000
    }
  ],
  "LastReschedule": "ok"
}

Failed Rescheduling with Correct Behavior

Now we'll change the node metadata to make the node ineligible:

$ nomad node meta apply --node-id e6e43a5a example=2

Kill the task again to force a reschedule, and wait for the blocked eval:

$ nomad eval list
ID        Priority  Triggered By        Job ID   Namespace  Node ID   Status    Placement Failures
5db8c171  50        queued-allocs       example  default    <none>    blocked   N/A - In Progress
1b751548  50        alloc-failure       example  default    <none>    complete  true
...

Update the node metadata to unblock the eval

$ nomad node meta apply --node-id e6e43a5a example=1

And wait for the node update eval.

$ nomad eval list
ID        Priority  Triggered By        Job ID   Namespace  Node ID   Status    Placement Failures
6eac73f2  50        node-update         example  default    e6e43a5a  complete  false
5db8c171  50        queued-allocs       example  default    <none>    complete  false
1b751548  50        alloc-failure       example  default    <none>    complete  true
...

The replacement allocation has a reschedule tracker as we expect, which is what we fixed in #12319.

$ nomad job status example
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
1a99a69c  e6e43a5a  group       0        run      running  23s ago    13s ago
1914d5a9  e6e43a5a  group       0        stop     failed   3m54s ago  23s ago
4d64f58c  e6e43a5a  group       0        stop     failed   5m5s ago   3m54s ago

$ nomad operator api "/v1/allocation/1a99a69c-55bf-ddee-0c6d-6e54222b90bf" | jq .RescheduleTracker
{
  "Events": [
    {
      "Delay": 30000000000,
      "PrevAllocID": "4d64f58c-96cc-8465-82ba-e48241dbdba6",
      "PrevNodeID": "e6e43a5a-9ddb-d65a-521a-cde19f093656",
      "RescheduleTime": 1737576830218453000
    },
    {
      "Delay": 60000000000,
      "PrevAllocID": "1914d5a9-3610-75a9-025d-729a9dbed06c",
      "PrevNodeID": "e6e43a5a-9ddb-d65a-521a-cde19f093656",
      "RescheduleTime": 1737577040806473200
    }
  ],
  "LastReschedule": "ok"
}

Reschedule on Downed Node

Now halt the node (sudo systemctl stop nomad), and wait for it to be marked down.

$ nomad node status
ID        Node Pool  DC        Name     Class      Drain  Eligibility  Status
e6e43a5a  default    philly-1  client0  multipass  false  eligible     down

Wait for the blocked evaluation:

$ nomad job status example
...
Placement Failure
Task Group "group":
  * No nodes were eligible for evaluation

Allocations
ID        Node ID   Task Group  Version  Desired  Status  Created    Modified
1a99a69c  e6e43a5a  group       0        stop     lost    2m43s ago  23s ago
1914d5a9  e6e43a5a  group       0        stop     failed  6m14s ago  2m43s ago
4d64f58c  e6e43a5a  group       0        stop     failed  7m25s ago  6m14s ago

$ nomad eval list
ID        Priority  Triggered By        Job ID   Namespace  Node ID   Status    Placement Failures
17784deb  50        queued-allocs       example  default    <none>    blocked   N/A - In Progress
f34b6262  50        node-update         example  default    e6e43a5a  complete  true
...

Then restart the node and wait for the allocation to be unblocked:

$ nomad eval list
ID        Priority  Triggered By        Job ID   Namespace  Node ID   Status    Placement Failures
40652e21  50        node-update         example  default    e6e43a5a  complete  false
4e69a3fe  50        queued-allocs       example  default    <none>    complete  false
9b5ed7fd  50        node-update         example  default    e6e43a5a  complete  true
...

The allocation has been replaced but the replacement allocation doesn't have a reschedule tracker!

$ nomad job status example
...
Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created    Modified
3896afa8  e6e43a5a  group       0        run      running   19s ago    9s ago
1a99a69c  e6e43a5a  group       0        stop     complete  4m17s ago  14s ago
1914d5a9  e6e43a5a  group       0        stop     failed    7m48s ago  4m17s ago
4d64f58c  e6e43a5a  group       0        stop     failed    8m59s ago  7m48s ago

$ nomad operator api "/v1/allocation/3896afa8-c58b-f436-b4e9-3c5bb733f0b0" | jq .RescheduleTracker
null
tgross added a commit that referenced this issue Jan 23, 2025
Our vocabulary around scheduler behaviors outside of the `reschedule` and
`migrate` blocks leaves room for confusion around whether the reschedule tracker
should be propagated between allocations. There are effectively five different
behaviors we need to cover:

* restart: when the tasks of an allocation fail and we try to restart the tasks
  in place.

* reschedule: when the `restart` block runs out of attempts (or the allocation
  fails before tasks even start), and we need to move
  the allocation to another node to try again.

* migrate: when the user has asked to drain a node and we need to move the
  allocations. These are not failures, so we don't want to propagate the
  reschedule tracker.

* replacement: when a node is lost, we don't count that against the `reschedule`
  tracker for the allocations on the node (it's not the allocation's "fault",
  after all). We don't want to run the `migrate` machinery here here either, as we
  can't contact the down node. To the scheduler, this is effectively the same as
  if we bumped the `group.count`

* replacement for `disconnect.replace = true`: this is a replacement, but the
  replacement is intended to be temporary, so we propagate the reschedule tracker.

Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining
when each item applies. Update the use of the word "reschedule" in several
places where "replacement" is correct, and vice-versa.

Fixes: #24918
@tgross
Copy link
Member Author

tgross commented Jan 23, 2025

After quite a bit of internal discussion, we've determined this is actually the intended behavior (this stuff is complex!) 🤦 . I've opened #24929 with documentation and docstring clarifications.

tgross added a commit that referenced this issue Jan 23, 2025
Our vocabulary around scheduler behaviors outside of the `reschedule` and
`migrate` blocks leaves room for confusion around whether the reschedule tracker
should be propagated between allocations. There are effectively five different
behaviors we need to cover:

* restart: when the tasks of an allocation fail and we try to restart the tasks
  in place.

* reschedule: when the `restart` block runs out of attempts (or the allocation
  fails before tasks even start), and we need to move
  the allocation to another node to try again.

* migrate: when the user has asked to drain a node and we need to move the
  allocations. These are not failures, so we don't want to propagate the
  reschedule tracker.

* replacement: when a node is lost, we don't count that against the `reschedule`
  tracker for the allocations on the node (it's not the allocation's "fault",
  after all). We don't want to run the `migrate` machinery here here either, as we
  can't contact the down node. To the scheduler, this is effectively the same as
  if we bumped the `group.count`

* replacement for `disconnect.replace = true`: this is a replacement, but the
  replacement is intended to be temporary, so we propagate the reschedule tracker.

Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining
when each item applies. Update the use of the word "reschedule" in several
places where "replacement" is correct, and vice-versa.

Fixes: #24918
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant