paddle.distributed.all_to_all不支持unequal_split_size的语义 #71429

dynamicheart · 2025-03-05T09:15:26Z

PyTorch paddle.distributed.all_to_all unequal split size的例子(https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_to_all)：

>>> input
tensor([0, 1, 2, 3, 4, 5])                                       # Rank 0
tensor([10, 11, 12, 13, 14, 15, 16, 17, 18])                     # Rank 1
tensor([20, 21, 22, 23, 24])                                     # Rank 2
tensor([30, 31, 32, 33, 34, 35, 36])                             # Rank 3
>>> input_splits
[2, 2, 1, 1]                                                     # Rank 0
[3, 2, 2, 2]                                                     # Rank 1
[2, 1, 1, 1]                                                     # Rank 2
[2, 2, 2, 1]                                                     # Rank 3
>>> output_splits
[2, 3, 2, 2]                                                     # Rank 0
[2, 2, 1, 2]                                                     # Rank 1
[1, 2, 1, 2]                                                     # Rank 2
[1, 2, 1, 1]                                                     # Rank 3
>>> input = list(input.split(input_splits))
>>> input
[tensor([0, 1]), tensor([2, 3]), tensor([4]), tensor([5])]                   # Rank 0
[tensor([10, 11, 12]), tensor([13, 14]), tensor([15, 16]), tensor([17, 18])] # Rank 1
[tensor([20, 21]), tensor([22]), tensor([23]), tensor([24])]                 # Rank 2
[tensor([30, 31]), tensor([32, 33]), tensor([34, 35]), tensor([36])]         # Rank 3
>>> output = ...
>>> dist.all_to_all(output, input)
>>> output
[tensor([0, 1]), tensor([10, 11, 12]), tensor([20, 21]), tensor([30, 31])]   # Rank 0
[tensor([2, 3]), tensor([13, 14]), tensor([22]), tensor([32, 33])]           # Rank 1
[tensor([4]), tensor([15, 16]), tensor([23]), tensor([34, 35])]              # Rank 2
[tensor([5]), tensor([17, 18]), tensor([24]), tensor([36])]                  # Rank 3

Paddle只有paddle.distributed.all_to_all_single接口支持unequal split_size，paddle.distributed.all_to_all输入tensor list会被concat并均匀split

参考：https://github.com/PaddlePaddle/Paddle/blob/974cc53f9d/paddle/fluid/pybind/distributed_py.cc#L322
如果是equal split，建议in_split_sizes和out_split_sizes直接传空：

参考：
https://github.com/pytorch/pytorch/blob/6c3492b4/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L4912

The text was updated successfully, but these errors were encountered:

dynamicheart · 2025-03-05T09:42:07Z

@zhiqiu

dynamicheart added status/new-issue 新建 type/feature-request 新需求申请 labels Mar 5, 2025

paddle-bot bot assigned From00 Mar 5, 2025

dynamicheart mentioned this issue Mar 5, 2025

[XPU] support all_to_all_unequal_split_size #71411

Open

dynamicheart added a commit to dynamicheart/Paddle that referenced this issue Mar 6, 2025

paddle.distributed.all_to_all supports unequal split(PaddlePaddle#71429)

d90a276

dynamicheart mentioned this issue Mar 6, 2025

paddle.distributed.all_to_all supports unequal split(#71429) #71448

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

paddle.distributed.all_to_all不支持unequal_split_size的语义 #71429

paddle.distributed.all_to_all不支持unequal_split_size的语义 #71429

dynamicheart commented Mar 5, 2025 •

edited

Loading

dynamicheart commented Mar 5, 2025

paddle.distributed.all_to_all不支持unequal_split_size的语义 #71429

paddle.distributed.all_to_all不支持unequal_split_size的语义 #71429

Comments

dynamicheart commented Mar 5, 2025 • edited Loading

dynamicheart commented Mar 5, 2025

dynamicheart commented Mar 5, 2025 •

edited

Loading