FA3 variable length attention sort/swizzle #82

jayhshah · 2025-08-22T01:20:19Z

vllm side mirror of Dao-AILab#1823

Signed-off-by: Jay Shah <[email protected]>

…or virtual batch metadata Signed-off-by: Jay Shah <[email protected]>

Signed-off-by: Jay Shah <[email protected]>

LucasWilkinson · 2025-08-27T21:38:47Z

hopper/tile_scheduler.hpp

-                    seqlen = params.seqlen;
+            if constexpr (Prepared) {
+                return batch_idx < params.num_batch && lane < cutlass::NumThreadsPerWarp - 1
+                    ? cute::ceil_div(params.prepare_seqlen_q_ptr[batch_idx], kBlockM) : 0;


right now vLLM is a bit annoying and we actually compute the attention metadata (and as a result the mha_fwd_get_scheduler_metadata) before knowing how many requests we sill pad too; this means that scheduler metadata will be for a batch size than what params.b is at runtime. This is normally fine since cu_seqlens is padded to make sure all requests up to max batch size are seqlen_q == 0 so FA returns before touching any bad memory; however if this reads garbage from prepare_seqlen_q_ptr this might break? We can probably zero the metadata here: https://github.com/neuralmagic/vllm/blob/a75c6e034abf00603fba527625e44baab7b42f80/vllm/v1/attention/backends/flash_attn.py#L333-L338

(this is a historical artifact of thinking that piecewise cudagraphs would be enough in V1 and we wouldn't need attention to be in a cudagraph; so this may be re-architected in the near future)

Actually we might have to do a more aggressive refactor on the vLLM side since I think an even bigger problem is that all of the offsets will be wrong:

int sort_offset = b_rounded * (use_dynamic_split ? 2 : 1); int head_swizzle_offset = b_rounded * (num_prepare_batch_vectors - 1); int tile_count_semaphore_offset = b_rounded * num_prepare_batch_vectors;

the other option would be to make the scheduler metadata an "Array of Structs" instead of a "Struct of Arrays", then I the offsets wouldn't be dependent on the batch size the scheduler used (and we could more easily just 0 out the rest of the metadata)

how hard do you think this would be / how badly do you think this would hurt perf

the other option would be to make the scheduler metadata an "Array of Structs" instead of a "Struct of Arrays", then I the offsets wouldn't be dependent on the batch size the scheduler used (and we could more easily just 0 out the rest of the metadata)

how hard do you think this would be / how badly do you think this would hurt perf

I could write out as int4 array instead, but wouldn't have coalesced accesses when reading back in, so would like to avoid if at all possible.

Can we pass in a max batch size to set the offsets correctly?

jayhshah force-pushed the jshah/varlen branch from af9a1f2 to f7af2cc Compare August 22, 2025 01:21

WoosukKwon requested review from LucasWilkinson and tlrmchlsmth August 22, 2025 21:54

LucasWilkinson mentioned this pull request Aug 23, 2025

[WIP][Attention][FA3] Update FA3 to include new swizzle optimization vllm-project/vllm#23465

Open

jayhshah added 14 commits August 27, 2025 14:28

add varlen sort swizzle impl

688b51f

Signed-off-by: Jay Shah <[email protected]>

change to storing seqlen_q so combine kernel can do coalesced reads f…

cd16a2b

…or virtual batch metadata Signed-off-by: Jay Shah <[email protected]>

remove some dead code from flash api

dd88191

Signed-off-by: Jay Shah <[email protected]>

enable pack gqa tma for 16 ratio

5009fa5

Signed-off-by: Jay Shah <[email protected]>

sort instead by middle member for causal

44960d2

Signed-off-by: Jay Shah <[email protected]>

add tuned settings

8e48135

Signed-off-by: Jay Shah <[email protected]>

fix new reg alloc to only apply for hdim 128/192 non-fp8

21909de

Signed-off-by: Jay Shah <[email protected]>

add new l2 swizzle capacity settings

eec9987

Signed-off-by: Jay Shah <[email protected]>

fix test script to not ignore num splits and pack gqa settings

668b8d6

Signed-off-by: Jay Shah <[email protected]>

lessen split heuristic penalty change

59b000e

Signed-off-by: Jay Shah <[email protected]>

new build settings

2c26f46

Signed-off-by: Jay Shah <[email protected]>

test disabling sort

8b0f9d5

Signed-off-by: Jay Shah <[email protected]>

fix flash api for new setting

23acb70

Signed-off-by: Jay Shah <[email protected]>

remove split penalty

90acb6e

Signed-off-by: Jay Shah <[email protected]>

jayhshah force-pushed the jshah/varlen branch from cc25113 to 90acb6e Compare August 27, 2025 21:28

LucasWilkinson reviewed Aug 27, 2025

View reviewed changes

fix semaphore not resetting

6216d64

LucasWilkinson mentioned this pull request Nov 12, 2025

[Core] Refactor padding logic and pad for CUDA graphs before attention metadata building vllm-project/vllm#28579

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FA3 variable length attention sort/swizzle #82

FA3 variable length attention sort/swizzle #82

Uh oh!

jayhshah commented Aug 22, 2025 •

edited

Loading

Uh oh!

LucasWilkinson Aug 27, 2025

Uh oh!

LucasWilkinson Aug 27, 2025

Uh oh!

LucasWilkinson Aug 27, 2025

Uh oh!

jayhshah Aug 27, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

FA3 variable length attention sort/swizzle #82

Are you sure you want to change the base?

FA3 variable length attention sort/swizzle #82

Uh oh!

Conversation

jayhshah commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LucasWilkinson Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

LucasWilkinson Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

jayhshah Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jayhshah commented Aug 22, 2025 •

edited

Loading

jayhshah Aug 27, 2025 •

edited

Loading