Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] In the latest 25.02 nightlies, KMeans' fit() throws a NCCL error on an ARM workstation's dask cluster. #6307

Open
taureandyernv opened this issue Feb 10, 2025 · 1 comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working

Comments

@taureandyernv
Copy link
Contributor

Describe the bug
On ONLY RAPDIS 25.02a CUDA 12.8, I get this NCCL error when trying to fit KMeans on Dask a dask cluster: NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435. This happened on an H100

Affects the KMeans MNMG Notebook on ARM SBSA equipped with an H100. Tested on Python 3.12 and 3.11. x86 based B100 seems to work with same docker run commands

Steps/Code to reproduce bug

from cuml.dask.cluster.kmeans import KMeans as cuKMeans
from cuml.dask.common import to_dask_df
from cuml.dask.datasets import make_blobs
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
from dask_ml.cluster import KMeans as skKMeans
cluster = LocalCUDACluster(threads_per_worker=1)
client = Client(cluster)
n_samples = 1000000
n_features = 2
n_total_partitions = len(list(client.has_what().keys()))
X_dca, Y_dca = make_blobs(n_samples, 
                          n_features,
                          centers = 5, 
                          n_parts = n_total_partitions,
                          cluster_std=0.1, 
                          verbose=True)
kmeans_cuml = cuKMeans(init="k-means||",
                       n_clusters=5,
                       random_state=100)

kmeans_cuml.fit(X_dca)

Outputs

2025-02-10 21:48:32,918 - distributed.worker - ERROR - Compute Failed
Key:       _func_fit-95283355-8bff-49c7-8be7-e345c680da67
State:     executing
Task:  <Task '_func_fit-95283355-8bff-49c7-8be7-e345c680da67' _func_fit(..., ...)>
Exception: "RuntimeError('NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435: ')"
Traceback: '  File "/opt/conda/lib/python3.12/site-packages/cuml/dask/common/base.py", line 464, in check_cuml_mnmg\n    return func(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^\n  File "/opt/conda/lib/python3.12/site-packages/cuml/dask/cluster/kmeans.py", line 113, in _func_fit\n    return cumlKMeans(handle=handle, output_type=datatype, **kwargs).fit(\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/opt/conda/lib/python3.12/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper\n    ret = func(*args, **kwargs)\n          ^^^^^^^^^^^^^^^^^^^^^\n  File "kmeans_mg.pyx", line 158, in cuml.cluster.kmeans_mg.KMeansMG.fit\n'

2025-02-10 21:48:32,920 - distributed.worker - ERROR - Compute Failed
Key:       _func_fit-0d82d4e2-b791-484a-a821-46c29dd567c1
State:     executing
Task:  <Task '_func_fit-0d82d4e2-b791-484a-a821-46c29dd567c1' _func_fit(..., ...)>
Exception: "RuntimeError('NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435: ')"
Traceback: '  File "/opt/conda/lib/python3.12/site-packages/cuml/dask/common/base.py", line 464, in check_cuml_mnmg\n    return func(*args, **kwargs)\n           ^^^^^^^^^^^^^^^^^^^^^\n  File "/opt/conda/lib/python3.12/site-packages/cuml/dask/cluster/kmeans.py", line 113, in _func_fit\n    return cumlKMeans(handle=handle, output_type=datatype, **kwargs).fit(\n           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "/opt/conda/lib/python3.12/site-packages/cuml/internals/api_decorators.py", line 188, in wrapper\n    ret = func(*args, **kwargs)\n          ^^^^^^^^^^^^^^^^^^^^^\n  File "kmeans_mg.pyx", line 158, in cuml.cluster.kmeans_mg.KMeansMG.fit\n'

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
File <timed exec>:5

File /opt/conda/lib/python3.12/site-packages/cuml/internals/memory_utils.py:87, in with_cupy_rmm.<locals>.cupy_rmm_wrapper(*args, **kwargs)
     85 if GPU_ENABLED:
     86     with cupy_using_allocator(rmm_cupy_allocator):
---> 87         return func(*args, **kwargs)
     88 return func(*args, **kwargs)

File /opt/conda/lib/python3.12/site-packages/cuml/dask/cluster/kmeans.py:175, in KMeans.fit(self, X, sample_weight)
    159 comms.init(workers=data.workers)
    161 kmeans_fit = [
    162     self.client.submit(
    163         KMeans._func_fit,
   (...)
    172     for idx, wf in enumerate(data.worker_to_parts.items())
    173 ]
--> 175 wait_and_raise_from_futures(kmeans_fit)
    177 comms.destroy()
    179 _results = [res.result() for res in kmeans_fit]

File /opt/conda/lib/python3.12/site-packages/cuml/dask/common/utils.py:164, in wait_and_raise_from_futures(futures)
    159 """
    160 Returns the collected futures after all the futures
    161 have finished and do not indicate any exceptions.
    162 """
    163 wait(futures)
--> 164 raise_exception_from_futures(futures)
    165 return futures

File /opt/conda/lib/python3.12/site-packages/cuml/dask/common/utils.py:152, in raise_exception_from_futures(futures)
    150 errs = [f.exception() for f in futures if f.exception()]
    151 if errs:
--> 152     raise RuntimeError(
    153         "%d of %d worker jobs failed: %s"
    154         % (len(errs), len(futures), ", ".join(map(str, errs)))
    155     )

RuntimeError: 2 of 2 worker jobs failed: NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435: , NCCL error encountered at: file=/opt/conda/include/raft/comms/detail/std_comms.hpp line=435: 

Expected behavior
It should fit the sample data, as it does in the x86 and/or other CUDA releases

Environment details (please complete the following information):

  • Environment location: [Docker]
  • Linux Distro/Architecture: [Ubuntu 24.04 arm64]
  • GPU Model/Driver: [H100 and driver 535.161.08]
  • CUDA: [12.8]
  • Method of cuDF & cuML install: [conda, Docker, or from source]
    • If method of install is [Docker], provide docker pull & docker run commands used: docker run --gpus all --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p 9888:8888 -p 9787:8787 -p 9786:8786 rapidsai/notebooks:25.02a-cuda12.8-py3.12 also tested py3.11

Additional context

@taureandyernv taureandyernv added ? - Needs Triage Need team to review and classify bug Something isn't working labels Feb 10, 2025
@dantegd
Copy link
Member

dantegd commented Feb 13, 2025

Identified and solved by rapidsai/docs#574 and rapidsai/docker#735

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
? - Needs Triage Need team to review and classify bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants