You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Asynchronous/non-blocking communications are among the most critical optimizations in large model training, but they are prone to error. For example, batch_isend_irecv results in wrong data with NCCL backend until fixed in torch 1.13; same operation does not even run with current version of HCCL backend.
For existing correctness checks, See the assertEqual calls in distributed_test.py of PyTorch main repo -- many tests are NCCL-specific. On the other hand, HCCL backend currently has very limited test coverage -- async ops are not tested at all :(
The unified unit test should assert the same expected behavior under both hccl and nccl backends. This is the major assumption when porting frameworks like Megatron and DeepSpeed to NPU. When the behavior differs, extra adapter logic must be added.
The correctness test suite should be separated from the OSU-style benchmark #8, which focuses on performance numbers. Like torch_comm_test.osu_bench and torch_comm_test.unit_test as two separate namespaces.
The text was updated successfully, but these errors were encountered:
Asynchronous/non-blocking communications are among the most critical optimizations in large model training, but they are prone to error. For example,
batch_isend_irecv
results in wrong data with NCCL backend until fixed in torch 1.13; same operation does not even run with current version of HCCL backend.For existing correctness checks, See the
assertEqual
calls in distributed_test.py of PyTorch main repo -- many tests are NCCL-specific. On the other hand, HCCL backend currently has very limited test coverage -- async ops are not tested at all :(The unified unit test should assert the same expected behavior under both
hccl
andnccl
backends. This is the major assumption when porting frameworks like Megatron and DeepSpeed to NPU. When the behavior differs, extra adapter logic must be added.The correctness test suite should be separated from the OSU-style benchmark #8, which focuses on performance numbers. Like
torch_comm_test.osu_bench
andtorch_comm_test.unit_test
as two separate namespaces.The text was updated successfully, but these errors were encountered: