[Bug] Missing weight gradients from LinearWithGradAccumulationAndAsyncCommunication when Zero Bubble Pipeline Parallelism Is disabled #442

mksit · 2024-09-01T11:02:04Z

I have observed a recent change in LinearWithGradAccumulationAndAsyncCommunication to store the gradient of weights in WeightGradStore as a part of the new Zero Bubble Pipeline Parallelism feature (#396):

https://github.com/microsoft/Megatron-DeepSpeed/blob/1280f59c1a65e50d4e174e4195e14f173301a497/megatron/core/tensor_parallel/layers.py#L370

However, the stored gradients are only accessed in deepspeed_zbh1_engine:

https://github.com/microsoft/Megatron-DeepSpeed/blob/1280f59c1a65e50d4e174e4195e14f173301a497/megatron/core/pipeline_parallel/deepspeed_zbh1_engine.py#L108

If the Zero Bubble Pipeline Parallelism feature is not enabled, it seems that the gradients are not being returned. Is this an expected behavior?

The text was updated successfully, but these errors were encountered:

ys950902 mentioned this issue Sep 20, 2024

[Bug] grad_weight can't be NoneType when running with DeepSpeed on Zero3. #428

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Missing weight gradients from LinearWithGradAccumulationAndAsyncCommunication when Zero Bubble Pipeline Parallelism Is disabled #442

[Bug] Missing weight gradients from LinearWithGradAccumulationAndAsyncCommunication when Zero Bubble Pipeline Parallelism Is disabled #442

mksit commented Sep 1, 2024

[Bug] Missing weight gradients from LinearWithGradAccumulationAndAsyncCommunication when Zero Bubble Pipeline Parallelism Is disabled #442

[Bug] Missing weight gradients from LinearWithGradAccumulationAndAsyncCommunication when Zero Bubble Pipeline Parallelism Is disabled #442

Comments

mksit commented Sep 1, 2024