Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for UB MNNVL #1470

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Conversation

nvcastet
Copy link

Description

Add support for TP across multi-node NVLINK.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Support for TP across multi-node NVLINK.

Signed-off-by: Nicolas Castet <[email protected]>
@nvcastet
Copy link
Author

CC @ptrendx

@ptrendx ptrendx added the 2.1.0 label Feb 15, 2025
Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks reasonable, although a more detailed PR description would be helpful for understanding your intentions and for future git blames. If I understand correctly, this PR accomplishes two things:

  1. Register UB buffers with Multi-Node NVLink when possible. This seems straightforward, although we should also handle the case where the TP group is larger than a node and MNNVL is not supported (e.g. by throwing an exception).
  2. Remove logic for the inter-node communicator. It seems you just delete code without any logic changes, so was the inter-node communicator just dead code? If so, my guess is that it was made redundant in [C/PyTorch] Removed MPI dependence in Userbuffers #901.

Comment on lines -185 to -203
int datanodes = allnodes / pipenodes / tensornodes;
int pipenodegroup_id = myrank / numlocal / (datanodes * tensornodes);

(*comm)->pipe_id = pipegpus * pipenodegroup_id + mylocal / (datagpus * tensorgpus);

(*comm)->comm_inter = EXT_COMM_INTER;
(*comm)->first_node = nodeid - mynode;
(*comm)->num_nodes = numnodes;
(*comm)->my_node = mynode;

(*comm)->num2_nodes = tensornodes;
(*comm)->my2_node = (mynode / datanodes) % tensornodes;
(*comm)->first2_node = mynode - (*comm)->my2_node * datanodes;

(*comm)->fifo = reinterpret_cast<ub_request *>(malloc(sizeof(ub_request) * NVTE_MAX_REQUESTS));
(*comm)->nblocks = 8;
(*comm)->alignblock = 1024 * 512;
(*comm)->minblock = 1024 * 2 * 1024;
(*comm)->asyncblocks = 16;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not setting these class members, we should remove them from the header file as well.

@timmoon10
Copy link
Collaborator

timmoon10 commented Feb 18, 2025

/te-ci L1

Looks like we're seeing build errors, I suspect because we're not guarding new CUDA APIs when building with older CUDA versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants