Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Executor panics when running raw_exec tasks with a non-root Client #24931

Open
schmichael opened this issue Jan 24, 2025 · 4 comments
Open

Executor panics when running raw_exec tasks with a non-root Client #24931

schmichael opened this issue Jan 24, 2025 · 4 comments
Assignees
Labels

Comments

@schmichael
Copy link
Member

Nomad version

Affected versions: 1.9.3, 1.9.4, 1.9.5

Unaffected: 1.9.1 (and probably earlier, but I stopped testing here)

Operating system and Environment details

Ubuntu 24.04.1 LTS (noble)

Issue

When running a job with a raw_exec task, the executor panics leading to allocation failure. In the agent logs:

    2025-01-23T17:01:43.826-0800 [DEBUG] client.driver_mgr.raw_exec.executor: using plugin: alloc_id=b63d9679-35e9-5f33-6cc2-27428451f705 driver=raw_exec task_name=sleeper version=2
    2025-01-23T17:01:43.830-0800 [DEBUG] client.driver_mgr.raw_exec.executor.nomad-1.9.5: panic: runtime error: invalid memory address or nil pointer dereference: alloc_id=b63d9679-35e9-5f33-6cc2-27428451f705 driver=raw_exec task_name=sleeper
    2025-01-23T17:01:43.830-0800 [DEBUG] client.driver_mgr.raw_exec.executor.nomad-1.9.5: [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1bc6bb5]: alloc_id=b63d9679-35e9-5f33-6cc2-27428451f705 driver=raw_exec task_name=sleeper
    2025-01-23T17:01:43.830-0800 [DEBUG] client.driver_mgr.raw_exec.executor.nomad-1.9.5: alloc_id=b63d9679-35e9-5f33-6cc2-27428451f705 driver=raw_exec task_name=sleeper
    2025-01-23T17:01:43.830-0800 [DEBUG] client.driver_mgr.raw_exec.executor.nomad-1.9.5: goroutine 85 [running]:: alloc_id=b63d9679-35e9-5f33-6cc2-27428451f705 driver=raw_exec task_name=sleeper
    2025-01-23T17:01:43.830-0800 [DEBUG] client.driver_mgr.raw_exec.executor.nomad-1.9.5: github.com/hashicorp/nomad/drivers/shared/executor.(*UniversalExecutor).Launch(0xc000e07680, 0xc000658160): alloc_id=b63d9679-35e9-5f33-6cc2-27428451f705 driver=raw_exec task_name=sleeper
    2025-01-23T17:01:43.830-0800 [DEBUG] client.driver_mgr.raw_exec.executor.nomad-1.9.5:       github.com/hashicorp/nomad/drivers/shared/executor/executor.go:429 +0xb75: alloc_id=b63d9679-35e9-5f33-6cc2-27428451f705 driver=raw_exec task_name=sleeper
    2025-01-23T17:01:43.830-0800 [DEBUG] client.driver_mgr.raw_exec.executor.nomad-1.9.5: github.com/hashicorp/nomad/drivers/shared/executor.(*grpcExecutorServer).Launch(0xc00011c2d0, {0x31ca5c0?, 0xc000cb38a8?}, 0xc0002c4f00): alloc_id=b63d9679-35e9-5f33-6cc2-27428451f705 driver=raw_exec task_name=sleeper
    2025-01-23T17:01:43.830-0800 [DEBUG] client.driver_mgr.raw_exec.executor.nomad-1.9.5:       github.com/hashicorp/nomad/drivers/shared/executor/grpc_server.go:27 +0x6b7: alloc_id=b63d9679-35e9-5f33-6cc2-27428451f705 driver=raw_exec task_name=sleeper
    2025-01-23T17:01:43.830-0800 [DEBUG] client.driver_mgr.raw_exec.executor.nomad-1.9.5: github.com/hashicorp/nomad/drivers/shared/executor/proto._Executor_Launch_Handler({0x31ca5c0, 0xc00011c2d0}, {0x3b22f50, 0xc000d98990}, 0xc00094b200, 0x0): alloc_id=b63d9679-35e9-5f33-6cc2-27428451f705 driver=raw_exec task_name=sleeper
    2025-01-23T17:01:43.830-0800 [DEBUG] client.driver_mgr.raw_exec.executor.nomad-1.9.5:       github.com/hashicorp/nomad/drivers/shared/executor/proto/executor.pb.go:1212 +0x1a6: alloc_id=b63d9679-35e9-5f33-6cc2-27428451f705 driver=raw_exec task_name=sleeper
    2025-01-23T17:01:43.830-0800 [DEBUG] client.driver_mgr.raw_exec.executor.nomad-1.9.5: google.golang.org/grpc.(*Server).processUnaryRPC(0xc000898200, {0x3b22f50, 0xc000539350}, 0xc0002d8e40, 0xc000d98840, 0x55377e0, 0x0): alloc_id=b63d9679-35e9-5f33-6cc2-27428451f705 driver=raw_exec task_name=sleeper
    2025-01-23T17:01:43.830-0800 [DEBUG] client.driver_mgr.raw_exec.executor.nomad-1.9.5:       google.golang.org/[email protected]/server.go:1392 +0xfc3: alloc_id=b63d9679-35e9-5f33-6cc2-27428451f705 driver=raw_exec task_name=sleeper
    2025-01-23T17:01:43.830-0800 [DEBUG] client.driver_mgr.raw_exec.executor.nomad-1.9.5: google.golang.org/grpc.(*Server).handleStream(0xc000898200, {0x3b24298, 0xc0006e6000}, 0xc0002d8e40): alloc_id=b63d9679-35e9-5f33-6cc2-27428451f705 driver=raw_exec task_name=sleeper
    2025-01-23T17:01:43.830-0800 [DEBUG] client.driver_mgr.raw_exec.executor.nomad-1.9.5:       google.golang.org/[email protected]/server.go:1802 +0xbaa: alloc_id=b63d9679-35e9-5f33-6cc2-27428451f705 driver=raw_exec task_name=sleeper
    2025-01-23T17:01:43.830-0800 [DEBUG] client.driver_mgr.raw_exec.executor.nomad-1.9.5: google.golang.org/grpc.(*Server).serveStreams.func2.1(): alloc_id=b63d9679-35e9-5f33-6cc2-27428451f705 driver=raw_exec task_name=sleeper
    2025-01-23T17:01:43.830-0800 [DEBUG] client.driver_mgr.raw_exec.executor.nomad-1.9.5:       google.golang.org/[email protected]/server.go:1030 +0x7f: alloc_id=b63d9679-35e9-5f33-6cc2-27428451f705 driver=raw_exec task_name=sleeper
    2025-01-23T17:01:43.830-0800 [DEBUG] client.driver_mgr.raw_exec.executor.nomad-1.9.5: created by google.golang.org/grpc.(*Server).serveStreams.func2 in goroutine 84: alloc_id=b63d9679-35e9-5f33-6cc2-27428451f705 driver=raw_exec task_name=sleeper
    2025-01-23T17:01:43.830-0800 [DEBUG] client.driver_mgr.raw_exec.executor.nomad-1.9.5:       google.golang.org/[email protected]/server.go:1041 +0x125: alloc_id=b63d9679-35e9-5f33-6cc2-27428451f705 driver=raw_exec task_name=sleeper
    2025-01-23T17:01:43.831-0800 [DEBUG] client.driver_mgr.raw_exec.executor.stdio: received EOF, stopping recv loop: alloc_id=b63d9679-35e9-5f33-6cc2-27428451f705 driver=raw_exec task_name=sleeper err="rpc error: code = Unavailable desc = error reading from server: EOF"

Task events:

2025-01-23T17:01:43-08:00  Not Restarting  Error was unrecoverable
2025-01-23T17:01:43-08:00  Driver Failure  failed to launch command with executor: rpc error: code = Unavailable desc = error reading from server: EOF
2025-01-23T17:01:43-08:00  Task Setup      Building Task Directory

Root cause

While the panic arises when trying to call the running() callback, the source of the error is that the configureResourceContainer(...) call exits early with an error but proceeds anyway. The error is due to the non-root executor being unable to write the oom adjustment. From the executor logs:

{"@level":"error","@message":"failed to configure container, process isolation will not work","@module":"executor","@timestamp":"2025-01-23T16:53:05.876868-08:00","error":"write /proc/self/oom_score_adj: permission denied"}

Reproduction steps

Run nomad agent -dev on Linux as an unprivileged user (not root).

nomad job run sleeper.nomad.hcl where the jobspec contains a raw_exec job:

job "sleeper" {
  group "sleeper" {
    task "sleeper" {
      driver = "raw_exec"
      config {
        command = "sleep"
        args = ["10"]
      }
    }
  }
}

Workaround

Run as root.

Suggested Fix

Ugh, not sure otherwise I would have done it.

I think it's probably fine to just add a nil check before calling running() so that non-root agents just don't get any of that functionality...

...but I think that functionality (entering/leaving cgroups) can work without root. This takes more testing than I have time for at the moment, and I'm not sure it's worth the effort. I think assuming rootless agents cannot manipulate resource constraints is fine, and would be happy to approve the nil-check approach.

@Juanadelacuesta
Copy link
Member

I was looking a little bit into what is necessary to enter and leave cgroups and you need CAP_SYS_ADMIN which should be easy to check, but shouldn't that be done before? When the job is registered? No nodes with the necessary CAP found?

@tgross
Copy link
Member

tgross commented Jan 24, 2025

to enter and leave cgroups and you need CAP_SYS_ADMIN

You can do this without CAP_SYS_ADMIN if Nomad owns the cgroups files as well. (ref https://hashicorp.atlassian.net/browse/NET-10671 for more)

shouldn't that be done before? When the job is registered? No nodes with the necessary CAP found?

The server doesn't know anything about what characteristics of the node the driver needs unless the driver fingerprints an attribute to the server (ex. the Docker version). If the driver can't run at all, it's supposed to fail fingerprinting itself so that the driver.raw_exec attribute doesn't show up for that node. But ideally we'll have a graceful fallback here instead.

@Juanadelacuesta
Copy link
Member

What do yo mean by a "graceful fallback"?

@tgross
Copy link
Member

tgross commented Jan 24, 2025

What @schmichael was referring to above:

I think it's probably fine to just add a nil check before calling running() so that non-root agents just don't get any of that functionality...

That is, we lose the cgroup functionality if we can't use it, but otherwise can still launch processes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

No branches or pull requests

3 participants