fp8 backward #119

micmelesse · 2025-01-24T19:32:58Z

add fp8 backward

brunomazzottiamd

I'm approving the PR because I can't see anything wrong with it. I just left some questions and cleanup suggestions.

flash_attn/flash_attn_triton_amd/README.md

flash_attn/flash_attn_triton_amd/bwd_prefill.py

brunomazzottiamd · 2025-02-04T14:43:56Z

flash_attn/flash_attn_triton_amd/bwd_prefill.py

@@ -553,6 +636,14 @@ def attention_prefill_backward_triton_impl(
        print("use_exp2:", use_exp2)
        print("sequence_parallel:", sequence_parallel)

+    is_fp8 = arch_supports_fp8() and q.dtype in {torch.float8_e4m3fnuz, torch.float8_e4m3fn, torch.float8_e5m2, torch.float8_e5m2fnuz}
+    if is_fp8:


I think this empty if is_fp8: statement can be removed. Do we need to print or debug anything inside it?

I'm marking this discussion as unresolved because I think the empty if statement is still here.

flash_attn/flash_attn_triton_amd/test.py

Enable BWD fp8 with per block scale factors for p and ds This is a combination of 9 commits. Enable BWD fp8 This is a combination of 12 commits. add backward test case save clean up disable ci lse is good dv matches reduce diff use do fp8 for dv kinda working group size is a constexpr clean up a bit everything except mqa/gqa works skip mqa cases 20 cases have nan on dropout save what you have disable tests failing enable tests per block descale_p and descale_ds use max(abs(()) clean up tests a bit more fix bug disable ci for now pass variables add flags add alternate path. Still need to load descale factors dv working dk works save

…th causal. Varlen has some issues. Might be related to strides.

brunomazzottiamd

Great job Michael! Kudos for introducing compute_fp8_scaling_factors Triton function, it's really useful to avoid code repetition.

brunomazzottiamd · 2025-02-10T12:34:08Z

flash_attn/flash_attn_triton_amd/fwd_prefill.py

                    sm_scale, softmax_lse, o, *q_strides, *k_strides, *v_strides, *o_strides,
                    *bias_strides, *alibi_strides, *scores_strides, stride_lse_z, stride_lse_h, stride_lse_m, cu_seqlens_q, cu_seqlens_k,
                    dropout_p=dropout_p, philox_seed=philox_seed, philox_offset_base=philox_offset, sd_mask=sd_mask, dropout_mask=dropout_mask, alibi_slopes=alibi_slopes, 
                    HQ=nheads_q, HK=nheads_k, ACTUAL_BLOCK_DMODEL=head_size, MAX_SEQLENS_Q=max_seqlens_q,
                    MAX_SEQLENS_K=max_seqlens_k, IS_CAUSAL=causal, VARLEN=is_varlen,
                    BLOCK_DMODEL=padded_d_model, USE_BIAS=False if bias is None else True,
                    USE_ALIBI=False if alibi_slopes is None else True, ENABLE_DROPOUT=dropout_p
-                    > 0.0, USE_EXP2=use_exp2, RETURN_SCORES=return_softmax, IS_FP8=is_fp8)
+                    > 0.0, USE_EXP2=use_exp2, RETURN_SCORES=return_softmax, IS_FP8=is_fp8, FP8_MAX=torch.finfo(torch.float8_e4m3fnuz).max)


Suggestion:

Since is_fp8 is defined as:

is_fp8 = arch_supports_fp8() and q.dtype in {torch.float8_e4m3fnuz, torch.float8_e4m3fn, torch.float8_e5m2, torch.float8_e5m2fnuz}

I think it's better to compute FP8_MAX as:

FP8_MAX=torch.finfo(q.dtype).max

brunomazzottiamd · 2025-02-10T12:40:31Z

flash_attn/flash_attn_triton_amd/bwd_prefill.py

@@ -553,6 +636,14 @@ def attention_prefill_backward_triton_impl(
        print("use_exp2:", use_exp2)
        print("sequence_parallel:", sequence_parallel)

+    is_fp8 = arch_supports_fp8() and q.dtype in {torch.float8_e4m3fnuz, torch.float8_e4m3fn, torch.float8_e5m2, torch.float8_e5m2fnuz}
+    if is_fp8:


I'm marking this discussion as unresolved because I think the empty if statement is still here.

brunomazzottiamd · 2025-02-10T12:42:02Z

flash_attn/flash_attn_triton_amd/bwd_prefill.py

@@ -650,56 +741,26 @@ def attention_prefill_backward_triton_impl(
        do,
        delta,
        stride_oz, stride_oh, stride_om, stride_ok,
-        stride_oz, stride_oh, stride_om, stride_ok,
+        stride_oz, stride_oh, stride_om, stride_ok, # FIXME: don't share strides with derivatives this was causing a lot of issues


Question: Should we merge this PR without addressing this FIXME:? Is it supposed to be addressed later?

brunomazzottiamd · 2025-02-10T12:46:18Z

flash_attn/flash_attn_triton_amd/bwd_prefill.py

+        IS_VARLEN=is_varlen,
+        GROUP_SIZE=group_size,
+        IS_FP8=is_fp8,
+        FP8_MAX=torch.finfo(torch.float8_e4m3fnuz).max


Suggestion:

Since is_fp8 is computed as:

is_fp8 = arch_supports_fp8() and q.dtype in {torch.float8_e4m3fnuz, torch.float8_e4m3fn, torch.float8_e5m2, torch.float8_e5m2fnuz}

I think it's better to compute FP8_MAX as:`

FP8_MAX=torch.finfo(q.dtype).max

brunomazzottiamd · 2025-02-10T12:49:25Z

flash_attn/flash_attn_triton_amd/bwd_prefill_split.py

    DEBUG_TRITON: bool = False,
    DEBUG_TRITON_DETAIL: bool = False,
 ):
+    IS_FP8 = arch_supports_fp8() and q.dtype in {torch.float8_e4m3fnuz, torch.float8_e4m3fn, torch.float8_e5m2, torch.float8_e5m2fnuz}
+    if IS_FP8:
+        FP8_MAX = torch.finfo(torch.float8_e4m3fnuz).max


Suggestion: Compute FP8_MAX as torch.finfo(q.dtype).max.

brunomazzottiamd · 2025-02-10T12:53:37Z

flash_attn/flash_attn_triton_amd/test.py

-        (4, 8, 8, 2048, 2048, 128),
-        (4, 16, 16, 4096, 4096, 64),
-        (2, 4, 4, 8192, 8192, 32),
+        # (1, 1, 1, 1, 1, 1),


Question: Is it our intention to leave only one test case? Or is this still a work in progress?

brunomazzottiamd · 2025-02-10T12:55:57Z

.github/workflows/amd_tests.yml

-        run: |
-          export FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE"
-          pytest tests/test_flash_attn_triton_amd.py
+      # - name: Flash Attention Tests Using Reference Impl


Question: My knowledge of GitHub actions is almost none, so please take this comment with a grain of salt... As far as I can see, MI300 integration job is commented out. Am I correct? Do we really want to merge this way?

micmelesse changed the title ~~add backward test case~~ fp8 backward Jan 24, 2025

micmelesse force-pushed the micmelesse/fp8_bwd branch 4 times, most recently from 6b691eb to 297742b Compare February 3, 2025 09:24

micmelesse marked this pull request as ready for review February 4, 2025 13:37

micmelesse requested review from brunomazzottiamd, vgokhale and jtang10 February 4, 2025 13:38

brunomazzottiamd approved these changes Feb 4, 2025

View reviewed changes

micmelesse force-pushed the micmelesse/fp8_bwd branch from 7d0277c to 4cd9e2a Compare February 4, 2025 23:06

micmelesse added 2 commits February 6, 2025 06:27

add type info for backward

e6a67b3

micmelesse force-pushed the micmelesse/fp8_bwd branch from b725cdc to e6a67b3 Compare February 6, 2025 14:33

micmelesse added 9 commits February 6, 2025 06:39

fix DEBUG flag bug

c13c0f0

fix bug with backward. Normal forward works with dropout. Segfault wi…

acb05ad

…th causal. Varlen has some issues. Might be related to strides.

pass descale strides

ca267ed

test causal

00d1c6f

fix causal compiler assert. min head should be 32

3ba93db

remove descale_p

3694224

save

7908150

explict name as causal

86fd7e6

isolate bad case

01b370a

micmelesse requested a review from brunomazzottiamd February 7, 2025 16:44

just run fp8 tests

290d594

micmelesse marked this pull request as draft February 7, 2025 19:27

micmelesse added 2 commits February 7, 2025 11:29

bench with autotune

e9e4d6e

min changes

736a990

brunomazzottiamd approved these changes Feb 10, 2025

View reviewed changes

cast_fp8 helper

9fc0d0a

micmelesse added 4 commits February 10, 2025 10:23

cast_varlen_to_fp8

db15c3d

save

32d552e

minor

fbb00d6

highlight failing configs

9615417

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fp8 backward #119

fp8 backward #119

micmelesse commented Jan 24, 2025 •

edited

Loading

brunomazzottiamd left a comment

brunomazzottiamd Feb 4, 2025

brunomazzottiamd Feb 10, 2025

brunomazzottiamd left a comment

brunomazzottiamd Feb 10, 2025

brunomazzottiamd Feb 10, 2025

brunomazzottiamd Feb 10, 2025

brunomazzottiamd Feb 10, 2025

brunomazzottiamd Feb 10, 2025

brunomazzottiamd Feb 10, 2025

brunomazzottiamd Feb 10, 2025

fp8 backward #119

Are you sure you want to change the base?

fp8 backward #119

Conversation

micmelesse commented Jan 24, 2025 • edited Loading

brunomazzottiamd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brunomazzottiamd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

micmelesse commented Jan 24, 2025 •

edited

Loading