-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CUDA forward compatibility hook #906
base: main
Are you sure you want to change the base?
Conversation
f21f7a6
to
d617fb5
Compare
|
||
// Create the 'compat-libs' command | ||
c := cli.Command{ | ||
Name: "compat-libs", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this hook have a better name?
Currently it reads:
nvidia-cdi-hook compat-libs --driver-version 999.88.77
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this to:
nvidia-cdi-hook cuda-compat --host-driver-version=999.88.77
We could use this hook for other compat functions such as enhanced compatibility.
606aed5
to
ba30397
Compare
0c4dd3d
to
0b50458
Compare
d1e97a0
to
eeef750
Compare
88841fa
to
8e0c69b
Compare
internal/modifier/gated.go
Outdated
compatLibHookDiscoverer := discover.NewCUDACompatHookDiscoverer(logger, cfg.NVIDIACTKConfig.Path, driver) | ||
discoverers = append(discoverers, compatLibHookDiscoverer) | ||
if cfg.NVIDIAContainerRuntimeConfig.Mode == "legacy" { | ||
ldcacheIpdateHookDiscoverer, err := discover.NewLDCacheUpdateHook( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Since the nvidia-container-runtime-hook
is invoked as a prestart
hook, this is done BEFORE the createContainer
hook that we insert above. This means that we need to once again run the update ldcache hook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@klueska one option would be to create the file in /etc/ld.so.conf.d
when invoking the nvidia-container-runtime-hook
instead -- before calling out to the nvidia-container-cli
. We don't have as ready access to the driver version, but we could extract it there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we create the file in nvidia-container-runtime-hook
, then wouldn't that remove the need to add the cuda-compat
createContainer hook altogether? And as a result, the forward compatibility support in the legacy stack would not require users to use NVIDIA container Runtime, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did try it, and my initial assumption was incorrect. Because we invoke ldconfig as:
argv = (char * []){cnt->cfg.ldconfig, "-f", "/etc/ld.so.conf", "-C", "/etc/ld.so.cache", cnt->cfg.libs_dir, cnt->cfg.libs32_dir, NULL};
in libnvidia-container
, this means that the cnt->cfg.libs_dir
and cnt->cfg.libs32_dir
folders take precedence over the files in /etc/ld.so.conf.d
and the CUDA libraries present there are used.
We could rework the libnvidia-container
implementation further, but the intent of this change is also to provide the functionality for #910 so that we can remove the legacy code path by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. So for legacy mode, we are adding two createContainer hooks ( the cuda-compat
hook and the update-ldcache
) to ensure the compat libs are used, the caveat being that we execute ldconfig twice.
77bfb3c
to
1140fcf
Compare
1140fcf
to
cb6b4f6
Compare
f39a194
to
aaf5ed8
Compare
12cc9aa
to
0060b39
Compare
This change adds an nvidia-cdi-hook cuda-compat hook that checks the container for cuda compat libs and updates /etc/ld.so.conf.d to include their parent folder if their driver major version is sufficient. This allows CUDA Forward Compatibility to be used when this is not available through the libnvidia-container. Signed-off-by: Evan Lezar <[email protected]>
This change adds the cuda-compat hook to the incomming OCI runtime spec if the allow-cuda-compat-libs-from-container feature flag is not enabled. An update-ldcache hook is also injected to ensure that the required folders are processed. Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
c0cf008
to
b47130b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few questions. But overall this LGTM. Would be good to get another pair of eyes on this.
@@ -128,8 +128,8 @@ func supportedModifierTypes(mode string) []string { | |||
return []string{"nvidia-hook-remover", "mode"} | |||
case "csv": | |||
// For CSV mode we support mode and feature-gated modification. | |||
return []string{"nvidia-hook-remover", "mode", "feature-gated"} | |||
return []string{"nvidia-hook-remover", "feature-gated", "mode"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question -- why does the ordering matter?
@@ -166,4 +168,51 @@ var _ = Describe("docker", Ordered, func() { | |||
Expect(referenceOutput).To(Equal(out4)) | |||
}) | |||
}) | |||
|
|||
Describe("CUDA Forward compatibility", Ordered, func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I notice the test cases get skipped since our holodeck installs the latest driver: https://github.com/NVIDIA/nvidia-container-toolkit/actions/runs/13331020349/job/37235261007?pr=906#step:10:95
Any plans to downgrade the version of the driver we install in our e2e tests so we actually test CUDA forward compatibility?
@@ -78,5 +79,10 @@ func NewFeatureGatedModifier(logger logger.Interface, cfg *config.Config, image | |||
discoverers = append(discoverers, d) | |||
} | |||
|
|||
if !cfg.Features.AllowCUDACompatLibsFromContainer.IsEnabled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question -- are features
in the config only applicable to libnvidia-container? If yes, how can we make this more clear to users? From the user's perspective, it may be a bit confusing that compat libs do end up being used even if features.allow-cuda-compat-libs-from-container=false
is set in the config.
With #877 the default behaviour of the NVIDIA Container Runtime / NVIDIA Container Runtime Hook was changed to not mount compat libraries from the container into the container. This removed "automatic" support for CUDA Forward compatibility.
This change attempts to address this by adding a
createContainerHook
that will create a file in/etc/ld.so.conf.d/
in the container to ensure that the/usr/local/cuda/compat
libraries are added to the ldcache over the libraries mounted from the host. Here the provided driver version number is used to indicate whether this should be done or not.In the case of the legacy runtime, this behaviour is only triggered if the
allow-cuda-compat-libs-from-container
feature flag is not enabled. The CDI spec generation has also been extended to include this hook.