You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
2025-01-23T17:01:43-08:00 Not Restarting Error was unrecoverable
2025-01-23T17:01:43-08:00 Driver Failure failed to launch command with executor: rpc error: code = Unavailable desc = error reading from server: EOF
2025-01-23T17:01:43-08:00 Task Setup Building Task Directory
Root cause
While the panic arises when trying to call the running() callback, the source of the error is that the configureResourceContainer(...) call exits early with an error but proceeds anyway. The error is due to the non-root executor being unable to write the oom adjustment. From the executor logs:
{"@level":"error","@message":"failed to configure container, process isolation will not work","@module":"executor","@timestamp":"2025-01-23T16:53:05.876868-08:00","error":"write /proc/self/oom_score_adj: permission denied"}
Reproduction steps
Run nomad agent -dev on Linux as an unprivileged user (not root).
nomad job run sleeper.nomad.hcl where the jobspec contains a raw_exec job:
I think it's probably fine to just add a nil check before calling running() so that non-root agents just don't get any of that functionality...
...but I think that functionality (entering/leaving cgroups) can work without root. This takes more testing than I have time for at the moment, and I'm not sure it's worth the effort. I think assuming rootless agents cannot manipulate resource constraints is fine, and would be happy to approve the nil-check approach.
The text was updated successfully, but these errors were encountered:
I was looking a little bit into what is necessary to enter and leave cgroups and you need CAP_SYS_ADMIN which should be easy to check, but shouldn't that be done before? When the job is registered? No nodes with the necessary CAP found?
shouldn't that be done before? When the job is registered? No nodes with the necessary CAP found?
The server doesn't know anything about what characteristics of the node the driver needs unless the driver fingerprints an attribute to the server (ex. the Docker version). If the driver can't run at all, it's supposed to fail fingerprinting itself so that the driver.raw_exec attribute doesn't show up for that node. But ideally we'll have a graceful fallback here instead.
Nomad version
Affected versions: 1.9.3, 1.9.4, 1.9.5
Unaffected: 1.9.1 (and probably earlier, but I stopped testing here)
Operating system and Environment details
Ubuntu 24.04.1 LTS (noble)
Issue
When running a job with a
raw_exec
task, the executor panics leading to allocation failure. In the agent logs:Task events:
Root cause
While the panic arises when trying to call the
running()
callback, the source of the error is that theconfigureResourceContainer(...)
call exits early with an error but proceeds anyway. The error is due to the non-root executor being unable to write the oom adjustment. From the executor logs:Reproduction steps
Run
nomad agent -dev
on Linux as an unprivileged user (not root).nomad job run sleeper.nomad.hcl
where the jobspec contains araw_exec
job:Workaround
Run as root.
Suggested Fix
Ugh, not sure otherwise I would have done it.
I think it's probably fine to just add a nil check before calling
running()
so that non-root agents just don't get any of that functionality......but I think that functionality (entering/leaving cgroups) can work without root. This takes more testing than I have time for at the moment, and I'm not sure it's worth the effort. I think assuming rootless agents cannot manipulate resource constraints is fine, and would be happy to approve the nil-check approach.
The text was updated successfully, but these errors were encountered: