Add backend-agnostic correctness test, particularly for async ops #10

learning-chip · 2023-05-25T01:47:41Z

Asynchronous/non-blocking communications are among the most critical optimizations in large model training, but they are prone to error. For example, batch_isend_irecv results in wrong data with NCCL backend until fixed in torch 1.13; same operation does not even run with current version of HCCL backend.

For existing correctness checks, See the assertEqual calls in distributed_test.py of PyTorch main repo -- many tests are NCCL-specific. On the other hand, HCCL backend currently has very limited test coverage -- async ops are not tested at all :(

The unified unit test should assert the same expected behavior under both hccl and nccl backends. This is the major assumption when porting frameworks like Megatron and DeepSpeed to NPU. When the behavior differs, extra adapter logic must be added.

The correctness test suite should be separated from the OSU-style benchmark #8, which focuses on performance numbers. Like torch_comm_test.osu_bench and torch_comm_test.unit_test as two separate namespaces.

The text was updated successfully, but these errors were encountered:

learning-chip · 2023-05-25T03:18:11Z

An example of inadequate check vs proper check: see test_batch_isend_irecv_nccl in 1.11 vs 1.13:

Only 1.13 performs self.assertEqual(recv_tensors[src], expected_tensors[src]), which is what we need here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add backend-agnostic correctness test, particularly for async ops #10

Add backend-agnostic correctness test, particularly for async ops #10

learning-chip commented May 25, 2023 •

edited

Loading

learning-chip commented May 25, 2023

Add backend-agnostic correctness test, particularly for async ops #10

Add backend-agnostic correctness test, particularly for async ops #10

Comments

learning-chip commented May 25, 2023 • edited Loading

learning-chip commented May 25, 2023

learning-chip commented May 25, 2023 •

edited

Loading