Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CUDA forward compatibility hook #906

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

elezar
Copy link
Member

@elezar elezar commented Feb 6, 2025

With #877 the default behaviour of the NVIDIA Container Runtime / NVIDIA Container Runtime Hook was changed to not mount compat libraries from the container into the container. This removed "automatic" support for CUDA Forward compatibility.

This change attempts to address this by adding a createContainerHook that will create a file in /etc/ld.so.conf.d/ in the container to ensure that the /usr/local/cuda/compat libraries are added to the ldcache over the libraries mounted from the host. Here the provided driver version number is used to indicate whether this should be done or not.

In the case of the legacy runtime, this behaviour is only triggered if the allow-cuda-compat-libs-from-container feature flag is not enabled. The CDI spec generation has also been extended to include this hook.

@elezar elezar force-pushed the add-compat-lib-hook branch from f21f7a6 to d617fb5 Compare February 6, 2025 22:24

// Create the 'compat-libs' command
c := cli.Command{
Name: "compat-libs",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this hook have a better name?

Currently it reads:

nvidia-cdi-hook compat-libs --driver-version 999.88.77

Copy link
Member Author

@elezar elezar Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to:

nvidia-cdi-hook cuda-compat --host-driver-version=999.88.77

We could use this hook for other compat functions such as enhanced compatibility.

@elezar elezar force-pushed the add-compat-lib-hook branch 3 times, most recently from 606aed5 to ba30397 Compare February 6, 2025 23:08
@elezar elezar changed the title Add compat lib hook Add CUDA forward compatibility hook Feb 7, 2025
@elezar elezar force-pushed the add-compat-lib-hook branch 3 times, most recently from 0c4dd3d to 0b50458 Compare February 7, 2025 13:07
@elezar elezar marked this pull request as ready for review February 7, 2025 13:10
@elezar elezar added the must-backport The changes in PR need to be backported to at least one stable release branch. label Feb 7, 2025
@elezar elezar force-pushed the add-compat-lib-hook branch 2 times, most recently from d1e97a0 to eeef750 Compare February 7, 2025 21:08
@elezar elezar force-pushed the add-compat-lib-hook branch 2 times, most recently from 88841fa to 8e0c69b Compare February 10, 2025 13:38
compatLibHookDiscoverer := discover.NewCUDACompatHookDiscoverer(logger, cfg.NVIDIACTKConfig.Path, driver)
discoverers = append(discoverers, compatLibHookDiscoverer)
if cfg.NVIDIAContainerRuntimeConfig.Mode == "legacy" {
ldcacheIpdateHookDiscoverer, err := discover.NewLDCacheUpdateHook(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Since the nvidia-container-runtime-hook is invoked as a prestart hook, this is done BEFORE the createContainer hook that we insert above. This means that we need to once again run the update ldcache hook.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klueska one option would be to create the file in /etc/ld.so.conf.d when invoking the nvidia-container-runtime-hook instead -- before calling out to the nvidia-container-cli. We don't have as ready access to the driver version, but we could extract it there.

Copy link
Contributor

@cdesiniotis cdesiniotis Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we create the file in nvidia-container-runtime-hook, then wouldn't that remove the need to add the cuda-compat createContainer hook altogether? And as a result, the forward compatibility support in the legacy stack would not require users to use NVIDIA container Runtime, correct?

Copy link
Member Author

@elezar elezar Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did try it, and my initial assumption was incorrect. Because we invoke ldconfig as:

        argv = (char * []){cnt->cfg.ldconfig, "-f", "/etc/ld.so.conf", "-C", "/etc/ld.so.cache", cnt->cfg.libs_dir, cnt->cfg.libs32_dir, NULL};

in libnvidia-container, this means that the cnt->cfg.libs_dir and cnt->cfg.libs32_dir folders take precedence over the files in /etc/ld.so.conf.d and the CUDA libraries present there are used.

We could rework the libnvidia-container implementation further, but the intent of this change is also to provide the functionality for #910 so that we can remove the legacy code path by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. So for legacy mode, we are adding two createContainer hooks ( the cuda-compat hook and the update-ldcache) to ensure the compat libs are used, the caveat being that we execute ldconfig twice.

@elezar elezar force-pushed the add-compat-lib-hook branch 3 times, most recently from 77bfb3c to 1140fcf Compare February 10, 2025 15:08
@elezar elezar force-pushed the add-compat-lib-hook branch from 1140fcf to cb6b4f6 Compare February 10, 2025 18:18
cmd/nvidia-cdi-hook/cudacompat/cudacompat.go Outdated Show resolved Hide resolved
cmd/nvidia-cdi-hook/cudacompat/cudacompat.go Outdated Show resolved Hide resolved
cmd/nvidia-cdi-hook/cudacompat/cudacompat.go Outdated Show resolved Hide resolved
internal/modifier/gated.go Show resolved Hide resolved
@elezar elezar force-pushed the add-compat-lib-hook branch 2 times, most recently from f39a194 to aaf5ed8 Compare February 11, 2025 14:23
@elezar elezar force-pushed the add-compat-lib-hook branch 2 times, most recently from 12cc9aa to 0060b39 Compare February 14, 2025 13:35
This change adds an nvidia-cdi-hook cuda-compat hook that checks the
container for cuda compat libs and updates /etc/ld.so.conf.d to include their
parent folder if their driver major version is sufficient.

This allows CUDA Forward Compatibility to be used when this is not available
through the libnvidia-container.

Signed-off-by: Evan Lezar <[email protected]>
This change adds the cuda-compat hook to the incomming OCI runtime spec
if the allow-cuda-compat-libs-from-container feature flag is not enabled.

An update-ldcache hook is also injected to ensure that the required folders
are processed.

Signed-off-by: Evan Lezar <[email protected]>
@elezar elezar force-pushed the add-compat-lib-hook branch 2 times, most recently from c0cf008 to b47130b Compare February 14, 2025 14:17
Copy link
Contributor

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few questions. But overall this LGTM. Would be good to get another pair of eyes on this.

@@ -128,8 +128,8 @@ func supportedModifierTypes(mode string) []string {
return []string{"nvidia-hook-remover", "mode"}
case "csv":
// For CSV mode we support mode and feature-gated modification.
return []string{"nvidia-hook-remover", "mode", "feature-gated"}
return []string{"nvidia-hook-remover", "feature-gated", "mode"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- why does the ordering matter?

@@ -166,4 +168,51 @@ var _ = Describe("docker", Ordered, func() {
Expect(referenceOutput).To(Equal(out4))
})
})

Describe("CUDA Forward compatibility", Ordered, func() {
Copy link
Contributor

@cdesiniotis cdesiniotis Feb 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I notice the test cases get skipped since our holodeck installs the latest driver: https://github.com/NVIDIA/nvidia-container-toolkit/actions/runs/13331020349/job/37235261007?pr=906#step:10:95

Any plans to downgrade the version of the driver we install in our e2e tests so we actually test CUDA forward compatibility?

@@ -78,5 +79,10 @@ func NewFeatureGatedModifier(logger logger.Interface, cfg *config.Config, image
discoverers = append(discoverers, d)
}

if !cfg.Features.AllowCUDACompatLibsFromContainer.IsEnabled() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- are features in the config only applicable to libnvidia-container? If yes, how can we make this more clear to users? From the user's perspective, it may be a bit confusing that compat libs do end up being used even if features.allow-cuda-compat-libs-from-container=false is set in the config.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
must-backport The changes in PR need to be backported to at least one stable release branch.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants