support adam bf16 state #1465

XiaobingSuper · 2025-02-08T01:41:10Z

Description

This PR supports Adam bf16 state to reduce memory usage, we have tested some smaller models or largest models (LLM models), and we get a similar convergence even with this bf16 lower precision state. For the deepseek-v3 report, they also used the bf16 state to reduce training memory usage.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: XiaobingSuper <[email protected]>

for more information, see https://pre-commit.ci

XiaobingSuper · 2025-02-10T01:03:23Z

@timmoon10 please help review it, thanks.

timmoon10 · 2025-02-19T19:04:44Z

transformer_engine/pytorch/csrc/extensions/multi_tensor/multi_tensor_adam.cu

+              DISPATCH_DOUBLE_FLOAT_HALF_AND_BFLOAT(
+                  m_in_type, 2, "adam",
+                  DISPATCH_DOUBLE_FLOAT_HALF_AND_BFLOAT(
+                      v_in_type, 3, "adam",


This will involve increasing the number of template instantiations by 16x. As an alternative approach, how about we cast the Adam state to FP32 and reuse the existing template instantiations? This is the approach we use for FP16 state. We can consider alternative approaches like JIT compilation if the memory and compute overhead become burdensome.

timmoon10 · 2025-02-19T19:06:09Z

transformer_engine/pytorch/optimizers/fused_adam.py

+        elif dtype == torch.bfloat16:
+            assert state[state_name].dtype == torch.bfloat16
+            unscaled = state[state_name]


It's odd that the FP16 case involves per-tensor scaling while the BF16 case does not. This is not necessarily a problem with this PR, but it is a sign that the per-tensor scaling logic does not generalize and that its design should be improved.

XiaobingSuper force-pushed the xiaobing/adam_bf16_state branch from 035e623 to 89477f9 Compare February 8, 2025 01:50

support adam bf16 state

990507e

Signed-off-by: XiaobingSuper <[email protected]>

XiaobingSuper force-pushed the xiaobing/adam_bf16_state branch from 298ee6f to 990507e Compare February 8, 2025 01:55

[pre-commit.ci] auto fixes from pre-commit.com hooks

364982a

for more information, see https://pre-commit.ci

XiaobingSuper mentioned this pull request Feb 8, 2025

support bf16 dtype for optimizer states using precision-aware optimizer in TransformerEngine NVIDIA/Megatron-LM#1390

Draft

timmoon10 reviewed Feb 20, 2025

View reviewed changes

Merge branch 'main' into xiaobing/adam_bf16_state

20deb83

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support adam bf16 state #1465

support adam bf16 state #1465

XiaobingSuper commented Feb 8, 2025 •

edited

Loading

XiaobingSuper commented Feb 10, 2025 •

edited

Loading

timmoon10 Feb 19, 2025

timmoon10 Feb 19, 2025

support adam bf16 state #1465

Are you sure you want to change the base?

support adam bf16 state #1465

Conversation

XiaobingSuper commented Feb 8, 2025 • edited Loading

Description

Type of change

Changes

Checklist:

XiaobingSuper commented Feb 10, 2025 • edited Loading

timmoon10 Feb 19, 2025

Choose a reason for hiding this comment

timmoon10 Feb 19, 2025

Choose a reason for hiding this comment

XiaobingSuper commented Feb 8, 2025 •

edited

Loading

XiaobingSuper commented Feb 10, 2025 •

edited

Loading