Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO NOT MERGE] Initial NCCL Allreduce Backend Prototype #7298

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mjwilkins18
Copy link
Contributor

Pull Request Description

This PR is a proof-of-concept of how we can use NCCL as a backend of MPI collectives. This PR is missing many necessary features and is not meant to be merged. I am interested in gathering feedback before continuing development. Some open questions/points for discussion I have:

  • When/where to properly init/free the CCLcomm structure
  • How to handle operation and data types portably (copy+pasting them into a new switch statement feels frail, and I am not 100% I covered all of the relevant datatypes)
  • How to design this extensibly so we can add RCCL, OneCCL, etc. (My intuition here is a base abstract class + derived classes, but this is not OOP.)
  • How to modify src/mpi/coll/mpir_coll.c to consider CCL_Allreduce.
  • How to pull-in NCCL from the environment and/or a configure argument --with-nccl=
    *Function/variable/file names and locations

And anything else you can think of. Let me know what you think!

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

This PR is a proof-of-concept of how we can use NCCL as a backend of MPI collectives. This PR is missing many necessary features and is not meant to be merged. I am interested in gathering feedback before continuing development. Some open questions/points for discussion I have:
* When/where to properly init/free the `CCLcomm` structure
* How to handle operation and data types portably (copy+pasting them into a new switch statement feels frail, and I am not 100% I covered all of the relevant datatypes)
* How to design this extensibly so we can add RCCL, OneCCL, etc. (My intuition here is a base abstract class + derived classes, but this is not OOP.)
* How to modify `src/mpi/coll/mpir_coll.c` to consider `CCL_Allreduce`.
* How to pull-in NCCL from the environment and/or a configure argument `--with-nccl=`
*Function/variable/file names and locations

And anything else you can think of. Let me know what you think!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant