Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add backend-agnostic correctness test, particularly for async ops #10

Open
learning-chip opened this issue May 25, 2023 · 1 comment
Open

Comments

@learning-chip
Copy link

learning-chip commented May 25, 2023

Asynchronous/non-blocking communications are among the most critical optimizations in large model training, but they are prone to error. For example, batch_isend_irecv results in wrong data with NCCL backend until fixed in torch 1.13; same operation does not even run with current version of HCCL backend.

For existing correctness checks, See the assertEqual calls in distributed_test.py of PyTorch main repo -- many tests are NCCL-specific. On the other hand, HCCL backend currently has very limited test coverage -- async ops are not tested at all :(

The unified unit test should assert the same expected behavior under both hccl and nccl backends. This is the major assumption when porting frameworks like Megatron and DeepSpeed to NPU. When the behavior differs, extra adapter logic must be added.

The correctness test suite should be separated from the OSU-style benchmark #8, which focuses on performance numbers. Like torch_comm_test.osu_bench and torch_comm_test.unit_test as two separate namespaces.

@learning-chip
Copy link
Author

An example of inadequate check vs proper check: see test_batch_isend_irecv_nccl in 1.11 vs 1.13:

Only 1.13 performs self.assertEqual(recv_tensors[src], expected_tensors[src]), which is what we need here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant