Skip to content

Conversation

SiriusPaul
Copy link

This pull request introduces an experimental auxiliary output for the FA2 variable-length forward path, allowing users to obtain the sum of absolute attention scores (|scores|) before softmax for each head and token or per page in paged-KV mode. This feature is exposed via the vLLM wrapper and is primarily intended for numerical analysis and debugging. The implementation includes both Python/C++ and CUDA kernel changes to support this auxiliary return.

Feature: Auxiliary abs_s Output for FA2 Varlen Forward (Numerical Analysis/Debugging)

  • Added an experimental auxiliary output to FA2 varlen forward, accessible via the vLLM wrapper (flash_attn_varlen_func) by setting return_aux=True. This output provides the sum of absolute pre-softmax attention scores, scaled by 1/sqrt(D), for each head and token (non-paged) or per page (paged-KV).
  • Implemented a new C++/CUDA wrapper function varlen_fwd_with_abs_aux in flash_api_torch_lib.cpp, which computes and returns the auxiliary tensor (abs_s) alongside the usual outputs. Registered this function in the PyTorch extension. [1] [2]

Kernel/Parameter Changes for Per-Page Accumulation

  • Extended the Flash_fwd_params struct in flash.h to include pointers and stride information for accumulating pre-softmax |S| per page, enabling efficient per-page statistics in the CUDA kernel.
  • Added a device-side helper accumulate_abslogits_per_page in flash_fwd_kernel.h to atomically accumulate the absolute values of pre-softmax scores into the provided buffer for each batch, head, query, and page. This is called in all relevant kernel paths. [1] [2] [3] [4]

Developer Experience

  • Added .vscode/settings.json to improve code navigation in VSCode by associating certain file types with C++.

Build Configuration

  • Updated CMakeLists.txt to allow disabling FA3 via the FLASH_ATTN_DISABLE_FA3 environment variable, improving build flexibility for users who only want FA2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant