Add paged attention support #1355

cyanguwa · 2024-12-04T05:03:04Z

Description

This PR adds paged attention support for FusedAttention, FlashAttention, and UnfusedDotProductAttention.

KV cache is maintained in 'bshd' format, and it supports FP32/FP16/BF16, not FP8 yet
Context parallelism is not supported with KV cache
FusedAttention and UnfusedDotProductAttention support page_size >= 16, and FlashAttention supports page_size >= 256
All backends support both pure generation and mixed generation/context in the batch
FlashAttention supports paged attention through flash_attn_varlen_func, not flash_attn_with_kvcache, due to some numerical issues
UnfusedDotProductAttention supports paged attention by converting paged cache tensors to non-paged, before attention

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Please list the changes introduced in this PR:

Add paged attention support for FusedAttention, FlashAttention, and UnfusedDotProductAttention

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

cyanguwa · 2024-12-04T05:45:41Z

/te-ci pytorch L0

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa · 2025-01-06T12:27:19Z

/te-ci pytorch L0

transformer_engine/pytorch/attention.py

tests/pytorch/fused_attn/test_paged_attn.py

Signed-off-by: Charlene Yang <[email protected]>

sudhakarsingh27 · 2025-01-21T19:08:34Z

transformer_engine/pytorch/kv_cache_manager_non_paged.py

+        v_cache: torch.Tensor
+            The value cache tensor containing previous and the current tokens
+        """
+        k_cache, v_cache = self.cache[layer_number]


Suggested change

k_cache, v_cache = self.cache[layer_number]

assert layer_number in self.cache

k_cache, v_cache = self.cache[layer_number]

sudhakarsingh27 · 2025-01-21T21:39:33Z

transformer_engine/pytorch/attention.py

+    def __init__(
+        self,
+        max_batch_size: int,
+        max_seqlen_kv: int,


corresponding docstring says max_sequence_length, we should change one of those

sudhakarsingh27 · 2025-01-22T22:27:02Z

transformer_engine/pytorch/kv_cache_manager_non_paged.py

+            seq_s = self.sequences[seq] - step_dict[seq]
+            seq_e = self.sequences[seq]
+            if qkv_format == "bshd":
+                new_k_cache[i, seq_s:seq_e, :, :] = k[i, : step_dict[seq], :, :]


k[i, : step_dict[seq], :, :]

k isn't supposed to have any tokens beyond step_dict[seq], right?

same for v

sudhakarsingh27 · 2025-01-22T22:42:26Z

transformer_engine/pytorch/kv_cache_manager_non_paged.py

+            seq_s = self.sequences[seq] - step_dict[seq]
+            seq_e = self.sequences[seq]


These could potentially be moved into a method since this could be reused from outside like when getting the start positions of RoPE embeddings application

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

cyanguwa and others added 10 commits December 3, 2024 17:01

add paged attention; test_kv_cache_accuray and test_paged_attn pass

44f6ff2

Signed-off-by: Charlene Yang <[email protected]>

remove unnecessary change from last commit

06605e5

Signed-off-by: Charlene Yang <[email protected]>

test_fused_attn pass

0b2eb88

Signed-off-by: Charlene Yang <[email protected]>

Merge branch 'main' into paged_attention

d243b79

Signed-off-by: Charlene Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b0a5da4

for more information, see https://pre-commit.ci

remove unnecessary import in test_numerics

b4efd71

Signed-off-by: Charlene Yang <[email protected]>

add license for test

e637a07

Signed-off-by: Charlene Yang <[email protected]>

fix lint

767c8f5

Signed-off-by: Charlene Yang <[email protected]>

add to L0 test

a3bb14f

Signed-off-by: Charlene Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

d65933c

for more information, see https://pre-commit.ci

cyanguwa requested review from sudhakarsingh27 and ptrendx December 4, 2024 16:45

cyanguwa added 3 commits January 6, 2025 04:16

Merge branch 'main' into paged_attention

cd626b8

Signed-off-by: Charlene Yang <[email protected]>

update license for test_paged_attn

7c23b96

Signed-off-by: Charlene Yang <[email protected]>

update kv_cache_manager license

2dbf2e1

Signed-off-by: Charlene Yang <[email protected]>

sudhakarsingh27 reviewed Jan 7, 2025

View reviewed changes

transformer_engine/pytorch/attention.py Outdated Show resolved Hide resolved

tests/pytorch/fused_attn/test_paged_attn.py Outdated Show resolved Hide resolved

cyanguwa added 2 commits January 6, 2025 17:09

fix build issue from previous merge

d2f1549

Signed-off-by: Charlene Yang <[email protected]>

Merge branch 'main' into paged_attention

81a07e0

sudhakarsingh27 reviewed Jan 28, 2025

View reviewed changes

cyanguwa and others added 7 commits January 29, 2025 07:47

Merge branch 'main' into paged_attention

76282cf

Merge branch 'NVIDIA:main' into paged_attention

366fa65

Merge branch 'main' into paged_attention

9f31f09

Signed-off-by: Charlene Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

8dc06e0

for more information, see https://pre-commit.ci

WIP: minor fix/preparation for inference/cuda graph

59dcf48

Signed-off-by: Charlene Yang <[email protected]>

WIP: non-paged

09448a9

Signed-off-by: Charlene Yang <[email protected]>

WIP: non-paged, bshd/sbhd

612637c

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa force-pushed the paged_attention branch from 8e80771 to 612637c Compare February 12, 2025 06:28

WIP: non-paged, thd, no CG

f9bd83c

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa added 7 commits February 14, 2025 21:31

WIP: padding + BRCM

ba5e333

Signed-off-by: Charlene Yang <[email protected]>

WIP: restructure IP, clean up

bcef6b3

Signed-off-by: Charlene Yang <[email protected]>

WIP: fix non-CG, fused

f3975f0

Signed-off-by: Charlene Yang <[email protected]>

WIP: fix last commit

125548c

Signed-off-by: Charlene Yang <[email protected]>

WIP: unfused, non-CG

9bf3204

Signed-off-by: Charlene Yang <[email protected]>

WIP: flash-attn, non-CG

3060892

Signed-off-by: Charlene Yang <[email protected]>

WIP: flash_attn_with_kvcache

11f15fd

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa force-pushed the paged_attention branch from 5fff609 to 11f15fd Compare February 19, 2025 02:56

commit two files missed by bcef6b3

33b430f

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa force-pushed the paged_attention branch from e710273 to 33b430f Compare February 19, 2025 23:19

WIP: thd_bshd_bshd

1c31b68

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa force-pushed the paged_attention branch 2 times, most recently from 6fcad33 to f5b91c6 Compare February 21, 2025 23:33

WIP: fix last commit

7331a4c

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa force-pushed the paged_attention branch from 5b4117b to 7331a4c Compare February 22, 2025 00:02

WIP: fix 1c31b68

0341de7

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa force-pushed the paged_attention branch from cbad5ea to 0341de7 Compare February 22, 2025 01:07

WIP: add bshd_2sbhd, sbhd_2bshd

6bd61a7

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa force-pushed the paged_attention branch from c699139 to 6bd61a7 Compare February 22, 2025 01:40

pre-commit-ci bot and others added 4 commits February 22, 2025 01:41

[pre-commit.ci] auto fixes from pre-commit.com hooks

2d30bb1

for more information, see https://pre-commit.ci

Merge branch 'main' into paged_attention

a391a49

WIP: some cleanup

9ec3649

Signed-off-by: Charlene Yang <[email protected]>

WIP: all qkv_format combinations and merge CM files

93235dd

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa force-pushed the paged_attention branch from 8f8a81e to 93235dd Compare February 23, 2025 18:02

pre-commit-ci bot and others added 6 commits February 23, 2025 18:02

[pre-commit.ci] auto fixes from pre-commit.com hooks

b476244

for more information, see https://pre-commit.ci

WIP: some lint fixes

3cb001d

Signed-off-by: Charlene Yang <[email protected]>

WIP: add docstring for IP

583b76f

Signed-off-by: Charlene Yang <[email protected]>

fix sequences_pre

f13b861

Signed-off-by: Charlene Yang <[email protected]>

Merge branch 'main' into paged_attention

62cffc8

Signed-off-by: Charlene Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a06d72c

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add paged attention support #1355

Add paged attention support #1355

cyanguwa commented Dec 4, 2024 •

edited

Loading

cyanguwa commented Dec 4, 2024

cyanguwa commented Jan 6, 2025

sudhakarsingh27 Jan 21, 2025

sudhakarsingh27 Jan 21, 2025

sudhakarsingh27 Jan 22, 2025

sudhakarsingh27 Jan 22, 2025

sudhakarsingh27 Jan 22, 2025

	k_cache, v_cache = self.cache[layer_number]
	assert layer_number in self.cache
	k_cache, v_cache = self.cache[layer_number]

		seq_s = self.sequences[seq] - step_dict[seq]
		seq_e = self.sequences[seq]

Add paged attention support #1355

Are you sure you want to change the base?

Add paged attention support #1355

Conversation

cyanguwa commented Dec 4, 2024 • edited Loading

Description

Type of change

Changes

Checklist:

cyanguwa commented Dec 4, 2024

cyanguwa commented Jan 6, 2025

sudhakarsingh27 Jan 21, 2025

Choose a reason for hiding this comment

sudhakarsingh27 Jan 21, 2025

Choose a reason for hiding this comment

sudhakarsingh27 Jan 22, 2025

Choose a reason for hiding this comment

sudhakarsingh27 Jan 22, 2025

Choose a reason for hiding this comment

sudhakarsingh27 Jan 22, 2025

Choose a reason for hiding this comment

cyanguwa commented Dec 4, 2024 •

edited

Loading