[BUG] In the latest 25.02 nightlies, KMeans' fit() throws a NCCL error on an ARM workstation's dask cluster. #6307

taureandyernv · 2025-02-10T23:27:29Z

Describe the bug
On ONLY RAPDIS 25.02a CUDA 12.8, I get this NCCL error when trying to fit KMeans on Dask a dask cluster: NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435. This happened on an H100

Affects the KMeans MNMG Notebook on ARM SBSA equipped with an H100. Tested on Python 3.12 and 3.11. x86 based B100 seems to work with same docker run commands

Steps/Code to reproduce bug

from cuml.dask.cluster.kmeans import KMeans as cuKMeans
from cuml.dask.common import to_dask_df
from cuml.dask.datasets import make_blobs
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
from dask_ml.cluster import KMeans as skKMeans
cluster = LocalCUDACluster(threads_per_worker=1)
client = Client(cluster)
n_samples = 1000000
n_features = 2
n_total_partitions = len(list(client.has_what().keys()))
X_dca, Y_dca = make_blobs(n_samples, 
                          n_features,
                          centers = 5, 
                          n_parts = n_total_partitions,
                          cluster_std=0.1, 
                          verbose=True)
kmeans_cuml = cuKMeans(init="k-means||",
                       n_clusters=5,
                       random_state=100)

kmeans_cuml.fit(X_dca)

Outputs

2025-02-10 21:48:32,918 - distributed.worker - ERROR - Compute Failed
Key:       _func_fit-95283355-8bff-49c7-8be7-e345c680da67
State:     executing
Task:  <Task '_func_fit-95283355-8bff-49c7-8be7-e345c680da67' _func_fit(..., ...)>
Exception: "RuntimeError('NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435: ')"
Traceback: '  File "/opt/conda/lib/python3.12/site-packages/cuml/dask/common/base.py", line 464, in check_cuml_mnmg\n    return func(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^\n  File "/opt/conda/lib/python3.12/site-packages/cuml/dask/cluster/kmeans.py", line 113, in _func_fit\n    return cumlKMeans(handle=handle, output_type=datatype, **kwargs).fit(\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/opt/conda/lib/python3.12/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper\n    ret = func(*args, **kwargs)\n          ^^^^^^^^^^^^^^^^^^^^^\n  File "kmeans_mg.pyx", line 158, in cuml.cluster.kmeans_mg.KMeansMG.fit\n'

2025-02-10 21:48:32,920 - distributed.worker - ERROR - Compute Failed
Key:       _func_fit-0d82d4e2-b791-484a-a821-46c29dd567c1
State:     executing
Task:  <Task '_func_fit-0d82d4e2-b791-484a-a821-46c29dd567c1' _func_fit(..., ...)>
Exception: "RuntimeError('NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435: ')"
Traceback: '  File "/opt/conda/lib/python3.12/site-packages/cuml/dask/common/base.py", line 464, in check_cuml_mnmg\n    return func(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^\n  File "/opt/conda/lib/python3.12/site-packages/cuml/dask/cluster/kmeans.py", line 113, in _func_fit\n    return cumlKMeans(handle=handle, output_type=datatype, **kwargs).fit(\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/opt/conda/lib/python3.12/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper\n    ret = func(*args, **kwargs)\n          ^^^^^^^^^^^^^^^^^^^^^\n  File "kmeans_mg.pyx", line 158, in cuml.cluster.kmeans_mg.KMeansMG.fit\n'

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File <timed exec>:5

File /opt/conda/lib/python3.12/site-packages/cuml/internals/memory_utils.py:87, in with_cupy_rmm.<locals>.cupy_rmm_wrapper(*args, **kwargs)
     85 if GPU_ENABLED:
     86     with cupy_using_allocator(rmm_cupy_allocator):
---> 87         return func(*args, **kwargs)
     88 return func(*args, **kwargs)

File /opt/conda/lib/python3.12/site-packages/cuml/dask/cluster/kmeans.py:175, in KMeans.fit(self, X, sample_weight)
    159 comms.init(workers=data.workers)
    161 kmeans_fit = [
    162     self.client.submit(
    163         KMeans._func_fit,
   (...)
    172     for idx, wf in enumerate(data.worker_to_parts.items())
    173 ]
--> 175 wait_and_raise_from_futures(kmeans_fit)
    177 comms.destroy()
    179 _results = [res.result() for res in kmeans_fit]

File /opt/conda/lib/python3.12/site-packages/cuml/dask/common/utils.py:164, in wait_and_raise_from_futures(futures)
    159 """
    160 Returns the collected futures after all the futures
    161 have finished and do not indicate any exceptions.
    162 """
    163 wait(futures)
--> 164 raise_exception_from_futures(futures)
    165 return futures

File /opt/conda/lib/python3.12/site-packages/cuml/dask/common/utils.py:152, in raise_exception_from_futures(futures)
    150 errs = [f.exception() for f in futures if f.exception()]
    151 if errs:
--> 152     raise RuntimeError(
    153         "%d of %d worker jobs failed: %s"
    154         % (len(errs), len(futures), ", ".join(map(str, errs)))
    155     )

RuntimeError: 2 of 2 worker jobs failed: NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435: , NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435:

Expected behavior
It should fit the sample data, as it does in the x86 and/or other CUDA releases

Environment details (please complete the following information):

Environment location: [Docker]
Linux Distro/Architecture: [Ubuntu 24.04 arm64]
GPU Model/Driver: [H100 and driver 535.161.08]
CUDA: [12.8]
Method of cuDF & cuML install: [conda, Docker, or from source]
- If method of install is [Docker], provide docker pull & docker run commands used: docker run --gpus all --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p 9888:8888 -p 9787:8787 -p 9786:8786 rapidsai/notebooks:25.02a-cuda12.8-py3.12 also tested py3.11

Additional context

The text was updated successfully, but these errors were encountered:

dantegd · 2025-02-13T16:38:22Z

Identified and solved by rapidsai/docs#574 and rapidsai/docker#735

taureandyernv added ? - Needs Triage Need team to review and classify bug Something isn't working labels Feb 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] In the latest 25.02 nightlies, KMeans' fit() throws a NCCL error on an ARM workstation's dask cluster. #6307

[BUG] In the latest 25.02 nightlies, KMeans' fit() throws a NCCL error on an ARM workstation's dask cluster. #6307

taureandyernv commented Feb 10, 2025

dantegd commented Feb 13, 2025

[BUG] In the latest 25.02 nightlies, KMeans' fit() throws a NCCL error on an ARM workstation's dask cluster. #6307

[BUG] In the latest 25.02 nightlies, KMeans' fit() throws a NCCL error on an ARM workstation's dask cluster. #6307

Comments

taureandyernv commented Feb 10, 2025

dantegd commented Feb 13, 2025