You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
On ONLY RAPDIS 25.02a CUDA 12.8, I get this NCCL error when trying to fit KMeans on Dask a dask cluster: NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435. This happened on an H100
Affects the KMeans MNMG Notebook on ARM SBSA equipped with an H100. Tested on Python 3.12 and 3.11. x86 based B100 seems to work with same docker run commands
Steps/Code to reproduce bug
from cuml.dask.cluster.kmeans import KMeans as cuKMeans
from cuml.dask.common import to_dask_df
from cuml.dask.datasets import make_blobs
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
from dask_ml.cluster import KMeans as skKMeans
cluster = LocalCUDACluster(threads_per_worker=1)
client = Client(cluster)
n_samples = 1000000
n_features = 2
n_total_partitions = len(list(client.has_what().keys()))
X_dca, Y_dca = make_blobs(n_samples,
n_features,
centers = 5,
n_parts = n_total_partitions,
cluster_std=0.1,
verbose=True)
kmeans_cuml = cuKMeans(init="k-means||",
n_clusters=5,
random_state=100)
kmeans_cuml.fit(X_dca)
Outputs
2025-02-10 21:48:32,918 - distributed.worker - ERROR - Compute Failed
Key: _func_fit-95283355-8bff-49c7-8be7-e345c680da67
State: executing
Task: <Task '_func_fit-95283355-8bff-49c7-8be7-e345c680da67' _func_fit(..., ...)>
Exception: "RuntimeError('NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435: ')"
Traceback: ' File "/opt/conda/lib/python3.12/site-packages/cuml/dask/common/base.py", line 464, in check_cuml_mnmg\n return func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File "/opt/conda/lib/python3.12/site-packages/cuml/dask/cluster/kmeans.py", line 113, in _func_fit\n return cumlKMeans(handle=handle, output_type=datatype, **kwargs).fit(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/opt/conda/lib/python3.12/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper\n ret = func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File "kmeans_mg.pyx", line 158, in cuml.cluster.kmeans_mg.KMeansMG.fit\n'
2025-02-10 21:48:32,920 - distributed.worker - ERROR - Compute Failed
Key: _func_fit-0d82d4e2-b791-484a-a821-46c29dd567c1
State: executing
Task: <Task '_func_fit-0d82d4e2-b791-484a-a821-46c29dd567c1' _func_fit(..., ...)>
Exception: "RuntimeError('NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435: ')"
Traceback: ' File "/opt/conda/lib/python3.12/site-packages/cuml/dask/common/base.py", line 464, in check_cuml_mnmg\n return func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File "/opt/conda/lib/python3.12/site-packages/cuml/dask/cluster/kmeans.py", line 113, in _func_fit\n return cumlKMeans(handle=handle, output_type=datatype, **kwargs).fit(\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/opt/conda/lib/python3.12/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper\n ret = func(*args, **kwargs)\n ^^^^^^^^^^^^^^^^^^^^^\n File "kmeans_mg.pyx", line 158, in cuml.cluster.kmeans_mg.KMeansMG.fit\n'
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
File <timed exec>:5
File /opt/conda/lib/python3.12/site-packages/cuml/internals/memory_utils.py:87, in with_cupy_rmm.<locals>.cupy_rmm_wrapper(*args, **kwargs)
85 if GPU_ENABLED:
86 with cupy_using_allocator(rmm_cupy_allocator):
---> 87 return func(*args, **kwargs)
88 return func(*args, **kwargs)
File /opt/conda/lib/python3.12/site-packages/cuml/dask/cluster/kmeans.py:175, in KMeans.fit(self, X, sample_weight)
159 comms.init(workers=data.workers)
161 kmeans_fit = [
162 self.client.submit(
163 KMeans._func_fit,
(...)
172 for idx, wf in enumerate(data.worker_to_parts.items())
173 ]
--> 175 wait_and_raise_from_futures(kmeans_fit)
177 comms.destroy()
179 _results = [res.result() for res in kmeans_fit]
File /opt/conda/lib/python3.12/site-packages/cuml/dask/common/utils.py:164, in wait_and_raise_from_futures(futures)
159 """
160 Returns the collected futures after all the futures
161 have finished and do not indicate any exceptions.
162 """
163 wait(futures)
--> 164 raise_exception_from_futures(futures)
165 return futures
File /opt/conda/lib/python3.12/site-packages/cuml/dask/common/utils.py:152, in raise_exception_from_futures(futures)
150 errs = [f.exception() for f in futures if f.exception()]
151 if errs:
--> 152 raise RuntimeError(
153 "%d of %d worker jobs failed: %s"
154 % (len(errs), len(futures), ", ".join(map(str, errs)))
155 )
RuntimeError: 2 of 2 worker jobs failed: NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435: , NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435:
Expected behavior
It should fit the sample data, as it does in the x86 and/or other CUDA releases
Environment details (please complete the following information):
Environment location: [Docker]
Linux Distro/Architecture: [Ubuntu 24.04 arm64]
GPU Model/Driver: [H100 and driver 535.161.08]
CUDA: [12.8]
Method of cuDF & cuML install: [conda, Docker, or from source]
If method of install is [Docker], provide docker pull & docker run commands used: docker run --gpus all --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p 9888:8888 -p 9787:8787 -p 9786:8786 rapidsai/notebooks:25.02a-cuda12.8-py3.12 also tested py3.11
Additional context
The text was updated successfully, but these errors were encountered:
Describe the bug
On ONLY RAPDIS 25.02a CUDA 12.8, I get this NCCL error when trying to fit KMeans on Dask a dask cluster:
NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435
. This happened on an H100Affects the KMeans MNMG Notebook on ARM SBSA equipped with an H100. Tested on Python 3.12 and 3.11. x86 based B100 seems to work with same docker run commands
Steps/Code to reproduce bug
Outputs
Expected behavior
It should fit the sample data, as it does in the x86 and/or other CUDA releases
Environment details (please complete the following information):
docker pull
&docker run
commands used:docker run --gpus all --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p 9888:8888 -p 9787:8787 -p 9786:8786 rapidsai/notebooks:25.02a-cuda12.8-py3.12
also testedpy3.11
Additional context
The text was updated successfully, but these errors were encountered: