Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(xunix): also mount shared symlinked shared object files #123

Merged
merged 5 commits into from
Jan 27, 2025

Conversation

johnstcn
Copy link
Member

@johnstcn johnstcn commented Jan 26, 2025

Summary

Relates to #111

Container runtimes like the NVIDIA container toolkit will inject library dependencies inside containers they create. More recent versions of these runtimes / operators appear to mount the libraries with $library.so.$NVIDIA_DRIVER_VERISON and symlink $library.so to those locations.

Example:

lrwxrwxrwx  1 root root       32 Jan 26 20:09 libnvidia-allocator.so.1 -> libnvidia-allocator.so.550.90.07
-rwxr-xr-x  1 root root   168776 Dec 23 16:41 libnvidia-allocator.so.550.90.07
lrwxrwxrwx  1 root root       26 Jan 26 20:09 libnvidia-cfg.so.1 -> libnvidia-cfg.so.550.90.07
-rwxr-xr-x  1 root root   398968 Dec 23 16:41 libnvidia-cfg.so.550.90.07
-rwxr-xr-x  1 root root 43659040 Dec 23 16:41 libnvidia-gpucomp.so.550.90.07
lrwxrwxrwx  1 root root       25 Jan 26 20:09 libnvidia-ml.so.1 -> libnvidia-ml.so.550.90.07
-rwxr-xr-x  1 root root  2078360 Dec 23 16:41 libnvidia-ml.so.550.90.07
lrwxrwxrwx  1 root root       27 Jan 26 20:09 libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.550.90.07
-rwxr-xr-x  1 root root 86842616 Dec 23 16:41 libnvidia-nvvm.so.550.90.07
lrwxrwxrwx  1 root root       29 Jan 26 20:09 libnvidia-opencl.so.1 -> libnvidia-opencl.so.550.90.07
-rwxr-xr-x  1 root root 23494344 Dec 23 16:41 libnvidia-opencl.so.550.90.07
-rwxr-xr-x  1 root root    10176 Dec 23 16:41 libnvidia-pkcs11-openssl3.so.550.90.07
-rwxr-xr-x  1 root root    10168 Dec 23 16:41 libnvidia-pkcs11.so.550.90.07
lrwxrwxrwx  1 root root       37 Jan 26 20:09 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.550.90.07
-rwxr-xr-x  1 root root 28674464 Dec 23 16:41 libnvidia-ptxjitcompiler.so.550.90.07

For more details, see here: https://gist.github.com/johnstcn/1caae2e79eea8788dc7f31e3db4326eb

Changes made:

  • Relaxes the restriction regarding which mounts are passed inside the inner container. Before only files ending with *.so were mounted inside the inner container. Now, all files matching .*.so(\.[0-9\.]+)? are included.
  • Updates README to add information regarding some of the issues I ran into (such as cgroupns=private)
  • Updates integration tests to pass on cgroupv2-enabled host

Environment

  • Ubuntu 22.04.5 with kernel 6.8.0-51-generic
  • No shiftfs installed
  • Laptop with NVIDIA RTX 3060 Laptop GPU
  • Docker 27.5.1
  • NVIDIA driver version 550.90.07 (from NVIDIA-Linux-x86_64-550.90.07.run)
  • NVIDIA container toolkit version 1.17.4-1 (from NVIDIA APT repo)

Before

docker run -it --rm -v /tmp/envbox/docker:/var/lib/coder/docker -v /tmp/envbox/containers:/var/lib/coder/containers -v /tmp/envbox/sysbox:/var/lib/sysbox -v /tmp/envbox/docker:/var/lib/docker -v /usr/src:/usr/src:ro -v /lib/modules:/lib/modules:ro --privileged --cgroupns=host -e CODER_INNER_IMAGE=nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2 -e CODER_INNER_USERNAME=root -e CODER_ADD_GPU=true -e CODER_USR_LIB_DIR=/usr/lib --runtime=nvidia --gpus=all ghcr.io/coder/envbox:0.6.2 bash

root@container:/# /envbox docker &
<snip>
root@container:/# docker exec -it workspace_cvm bash

root@workspace_cvm:/# nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

root@workspace_cvm:/# /tmp/vectorAdd 
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Failed to launch vectorAdd kernel (error code PTX JIT compiler library not found)!

After

docker run -it --rm -v /tmp/envbox/docker:/var/lib/coder/docker -v /tmp/envbox/containers:/var/lib/coder/containers -v /tmp/envbox/sysbox:/var/lib/sysbox -v /tmp/envbox/docker:/var/lib/docker -v /usr/src:/usr/src:ro -v /lib/modules:/lib/modules:ro --privileged --cgroupns=host -e CODER_INNER_IMAGE=nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2 -e CODER_INNER_USERNAME=root -e CODER_ADD_GPU=true -e CODER_USR_LIB_DIR=/usr/lib --runtime=nvidia --gpus=all envbox:latest bash

root@container:/# /envbox docker &
<snip>
root@container:/# docker exec -it workspace_cvm bash

root@workspace_cvm:/# nvidia-smi
Sun Jan 26 22:18:09 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3060 ...    Off |   00000000:01:00.0 Off |                  N/A |
| N/A   50C    P8             15W /   90W |       2MiB /   6144MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

root@workspace_cvm:/# /tmp/vectorAdd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Comment on lines +250 to +255
out, err = integrationtest.ExecInnerContainer(t, pool, integrationtest.ExecConfig{
ContainerID: resource.Container.ID,
Cmd: []string{"cat", "/sys/fs/cgroup/memory/memory.limit_in_bytes"},
})
require.NoError(t, err)
require.Equal(t, expectedMemoryLimit, strings.TrimSpace(string(out)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only gets run when cgroup v2 fails or is unavailable, do we ever expect that to be the case with our supported versions? Since this test code only gets run when cgroup v2 fails, will it ever get run in CI or on our dev machines?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it ever get run in CI or on our dev machines?

It ran on my dev machine ;-)
All jokes aside, it's a code path that won't ever really work inside a sysbox container on our dogfood boxen. But I think we do need to think about cgroupv2 given the following:

  • Ubuntu 22.04 is supported by sysbox
  • Ubuntu enables cgroupv2 by default from 22.04 onwards
  • Kubernetes made cgroupv2 support GA from 1.25 onwards

It would be a good idea to add CI to test across multiple distros where you get cgroupv1 and cgroupv2 by default. I'd prefer to keep that out of the scope of this PR though.

Makefile Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
Copy link
Member

@mafredri mafredri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job getting to the bottom of this, LGTM!


.PHONY: test
test:
go test -v -count=1 ./...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: setting -count ignores test caching, if that's intended then 👍🏻

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, intended

@johnstcn johnstcn force-pushed the cj/sharedobjectregexp branch 2 times, most recently from f9a87ea to 6862e15 Compare January 27, 2025 11:59
@johnstcn johnstcn force-pushed the cj/sharedobjectregexp branch from 6862e15 to 7d91c01 Compare January 27, 2025 12:04
@johnstcn johnstcn merged commit 361631d into main Jan 27, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants