Add support for UB MNNVL #1470

nvcastet · 2025-02-10T19:32:47Z

Description

Add support for TP across multi-node NVLINK.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Support for TP across multi-node NVLINK.

Signed-off-by: Nicolas Castet <[email protected]>

nvcastet · 2025-02-10T19:33:08Z

CC @ptrendx

for more information, see https://pre-commit.ci

timmoon10

Overall looks reasonable, although a more detailed PR description would be helpful for understanding your intentions and for future git blames. If I understand correctly, this PR accomplishes two things:

Register UB buffers with Multi-Node NVLink when possible. This seems straightforward, although we should also handle the case where the TP group is larger than a node and MNNVL is not supported (e.g. by throwing an exception).
Remove logic for the inter-node communicator. It seems you just delete code without any logic changes, so was the inter-node communicator just dead code? If so, my guess is that it was made redundant in [C/PyTorch] Removed MPI dependence in Userbuffers #901.

timmoon10 · 2025-02-18T22:43:04Z

transformer_engine/common/comm_gemm_overlap/userbuffers/userbuffers-host.cpp

-  int datanodes = allnodes / pipenodes / tensornodes;
-  int pipenodegroup_id = myrank / numlocal / (datanodes * tensornodes);

-  (*comm)->pipe_id = pipegpus * pipenodegroup_id + mylocal / (datagpus * tensorgpus);
-
-  (*comm)->comm_inter = EXT_COMM_INTER;
-  (*comm)->first_node = nodeid - mynode;
  (*comm)->num_nodes = numnodes;
  (*comm)->my_node = mynode;

-  (*comm)->num2_nodes = tensornodes;
-  (*comm)->my2_node = (mynode / datanodes) % tensornodes;
-  (*comm)->first2_node = mynode - (*comm)->my2_node * datanodes;
-
-  (*comm)->fifo = reinterpret_cast<ub_request *>(malloc(sizeof(ub_request) * NVTE_MAX_REQUESTS));
-  (*comm)->nblocks = 8;
-  (*comm)->alignblock = 1024 * 512;
-  (*comm)->minblock = 1024 * 2 * 1024;
-  (*comm)->asyncblocks = 16;


If we are not setting these class members, we should remove them from the header file as well.

timmoon10 · 2025-02-18T23:03:51Z

/te-ci L1

Looks like we're seeing build errors, I suspect because we're not guarding new CUDA APIs when building with older CUDA versions.

Add support for UB MNNVL

81b7cb1

Signed-off-by: Nicolas Castet <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

d229c2e

for more information, see https://pre-commit.ci

ptrendx added the 2.1.0 label Feb 15, 2025

timmoon10 reviewed Feb 18, 2025

View reviewed changes

Merge branch 'main' into mnnvl_support

82d00ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for UB MNNVL #1470

Add support for UB MNNVL #1470

nvcastet commented Feb 10, 2025

nvcastet commented Feb 10, 2025

timmoon10 left a comment

timmoon10 Feb 18, 2025

timmoon10 commented Feb 18, 2025 •

edited

Loading

Add support for UB MNNVL #1470

Are you sure you want to change the base?

Add support for UB MNNVL #1470

Conversation

nvcastet commented Feb 10, 2025

Description

Type of change

Changes

nvcastet commented Feb 10, 2025

timmoon10 left a comment

Choose a reason for hiding this comment

timmoon10 Feb 18, 2025

Choose a reason for hiding this comment

timmoon10 commented Feb 18, 2025 • edited Loading

timmoon10 commented Feb 18, 2025 •

edited

Loading