make the backward of differentiable float8 casts pass gradient as is #255

vkuzo · 2024-05-03T18:10:11Z

Summary:

Behavior before:

high precision to float8 in fw, float8 to high precision in bw
float8 to high precision in fw, high precision to float8 in bw if grad is a Float8Tensor, pass gradient unchanged otherwise

Behavior after:

high precision to float8 in fw, pass gradient unchanged in bw
float8 to high precision in fw, pass gradient unchanged in bw

Motivation for the new state:

we want gradients to be in high precision unless specified otherwise by the float8 recipe, and the logic to specify grad casting to float8 before the matmul is better implemented elsewhere
there is actually no logic change in this diff as the backward casts were not getting hit from existing code, this diff just makes the intended behavior clearer

Test Plan:

./test/test_everything.sh

Reviewers:

Subscribers:

Tasks:

Tags:

Summary: Behavior before: * high precision to float8 in fw, float8 to high precision in bw * float8 to high precision in fw, high precision to float8 in bw if grad is a Float8Tensor, pass gradient unchanged otherwise Behavior after: * high precision to float8 in fw, pass gradient unchanged in bw * float8 to high precision in fw, pass gradient unchanged in bw Motivation for the new state: 1. we want gradients to be in high precision unless specified otherwise by the float8 recipe, and the logic to specify grad casting to float8 before the matmul is better implemented elsewhere 2. there is actually no logic change in this diff as the backward casts were not getting hit from existing code, this diff just makes the intended behavior clearer Test Plan: ``` ./test/test_everything.sh ``` Reviewers: Subscribers: Tasks: Tags:

vkuzo · 2024-05-03T18:11:30Z

I noticed this logic was funky when copy-pastaing some of this code for MX prototyping, fixing here to clarify intent.

wanchaol · 2024-05-03T18:40:38Z

float8_experimental/float8_tensor.py

-        else:
-            return grad
-
-    if isinstance(x, DTensor):


I recalled the main reason we had that special handling is that torch.compile (specifically FakeTensor rule) can't take nested subclass rule automatically. If we remove this wondering where we would put such a logic?

cc @drisspg @bdhirsh

if I understand the old state correctly:

grad is never a Float8Tensor in practice, because we always have the matmul output high precision

things already work without nested subclasses, which is why removing this function is fine

But, would be great to clarify ^

Oh I checked the current workflow, we call:

cast_to_float8_e4m3fn

cast_to_float8_e5m2_bw
and it seems none of these calls would call into this from_fp8_no_autograd

so this should be fine

actually it looks like I'm wrong, the cast_to_float8_e4m3fn's backward would call this, but maybe it's a no-op even as of current state, as the output of the torch.scaled_mm in backward is fp32?

facebook-github-bot · 2024-05-03T22:26:26Z

@vkuzo has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2024-05-06T19:27:20Z

@vkuzo merged this pull request in 605fc1d.

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 3, 2024

vkuzo requested review from albanD, wanchaol and drisspg May 3, 2024 18:11

wanchaol reviewed May 3, 2024

View reviewed changes

wanchaol approved these changes May 3, 2024

View reviewed changes

drisspg approved these changes May 6, 2024

View reviewed changes

facebook-github-bot closed this in 605fc1d May 6, 2024

facebook-github-bot added the Merged label May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make the backward of differentiable float8 casts pass gradient as is #255

make the backward of differentiable float8 casts pass gradient as is #255

vkuzo commented May 3, 2024

vkuzo commented May 3, 2024

wanchaol May 3, 2024

vkuzo May 3, 2024

wanchaol May 3, 2024

wanchaol May 3, 2024

facebook-github-bot commented May 3, 2024

facebook-github-bot commented May 6, 2024

make the backward of differentiable float8 casts pass gradient as is #255

make the backward of differentiable float8 casts pass gradient as is #255

Conversation

vkuzo commented May 3, 2024

vkuzo commented May 3, 2024

wanchaol May 3, 2024

Choose a reason for hiding this comment

vkuzo May 3, 2024

Choose a reason for hiding this comment

wanchaol May 3, 2024

Choose a reason for hiding this comment

wanchaol May 3, 2024

Choose a reason for hiding this comment

facebook-github-bot commented May 3, 2024

facebook-github-bot commented May 6, 2024