Vllm_flash_attn_with_attention_weights #88

SiriusPaul · 2025-09-11T14:28:21Z

This pull request introduces an experimental auxiliary output for the FA2 variable-length forward path, allowing users to obtain the sum of absolute attention scores (|scores|) before softmax for each head and token or per page in paged-KV mode. This feature is exposed via the vLLM wrapper and is primarily intended for numerical analysis and debugging. The implementation includes both Python/C++ and CUDA kernel changes to support this auxiliary return.

Feature: Auxiliary abs_s Output for FA2 Varlen Forward (Numerical Analysis/Debugging)

Added an experimental auxiliary output to FA2 varlen forward, accessible via the vLLM wrapper (flash_attn_varlen_func) by setting return_aux=True. This output provides the sum of absolute pre-softmax attention scores, scaled by 1/sqrt(D), for each head and token (non-paged) or per page (paged-KV).
Implemented a new C++/CUDA wrapper function varlen_fwd_with_abs_aux in flash_api_torch_lib.cpp, which computes and returns the auxiliary tensor (abs_s) alongside the usual outputs. Registered this function in the PyTorch extension. [1] [2]

Kernel/Parameter Changes for Per-Page Accumulation

Extended the Flash_fwd_params struct in flash.h to include pointers and stride information for accumulating pre-softmax |S| per page, enabling efficient per-page statistics in the CUDA kernel.
Added a device-side helper accumulate_abslogits_per_page in flash_fwd_kernel.h to atomically accumulate the absolute values of pre-softmax scores into the provided buffer for each batch, head, query, and page. This is called in all relevant kernel paths. [1] [2] [3] [4]

Developer Experience

Added .vscode/settings.json to improve code navigation in VSCode by associating certain file types with C++.

Build Configuration

Updated CMakeLists.txt to allow disabling FA3 via the FLASH_ATTN_DISABLE_FA3 environment variable, improving build flexibility for users who only want FA2.

SiriusPaul and others added 11 commits August 23, 2025 15:58

accumulate abs(S)

fd28121

增加了计算abs_s模块

3a39916

Merge branch 'main' of https://github.com/vllm-project/flash-attention

cf971fc

添加了一些注释

f7d31de

添加了分页的abs_s输出，完善了测试用例，用法参见帮助文档。使用test_abs_s_fa2.py来测试输出

7b45eb2

Merge branch 'main' into main

4442770

add abs_s calculated by cpu

65e4145

Merge branch 'main' of https://github.com/SiriusPaul/flash-attention

c2e07cf

优化CPU绝对值份额计算，更新测试用例

d660f92

重构基准测试脚本，添加CPU绝对值份额加载功能，更新参数描述，增强设备解析逻辑

2277d34

重构CPU绝对值份额计算，改为按块计算softmax质量份额，更新相关测试用例

fd4aaad

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vllm_flash_attn_with_attention_weights #88

Vllm_flash_attn_with_attention_weights #88

Uh oh!

SiriusPaul commented Sep 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Vllm_flash_attn_with_attention_weights #88

Are you sure you want to change the base?

Vllm_flash_attn_with_attention_weights #88

Uh oh!

Conversation

SiriusPaul commented Sep 11, 2025

Feature: Auxiliary abs_s Output for FA2 Varlen Forward (Numerical Analysis/Debugging)

Kernel/Parameter Changes for Per-Page Accumulation

Developer Experience

Build Configuration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant