Add XAttention documentation #3010

l-bat · 2025-11-13T16:19:45Z

Description

Copilot

Pull Request Overview

This PR adds comprehensive documentation for the XAttention sparse attention algorithm, expanding the previously placeholder "TBA" section with detailed implementation descriptions and visual illustrations.

Key changes:

Detailed explanation of XAttention's two-stage importance estimation procedure
Addition of visual diagram illustrating the algorithm's operation
Configuration parameter references for customizing XAttention behavior

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

site/docs/concepts/optimization-techniques/sparse-attention-prefill.md

l-bat · 2025-11-14T08:18:42Z

@peterchen-intel @ceciliapeng2011 @WeldonWangwang could you please help with reviewing the XAttention documentation?

rkazants · 2025-11-14T09:12:04Z

site/docs/concepts/optimization-techniques/sparse-attention-prefill.md

+
+The prompt processing occurs as usual until at least two KV cache blocks have been completely filled (`t = 0, 1`). Once the block-level importance scores have been computed (`t = 2-4`), only the subset of KV blocks with cumulative attention mass exceeding the `xattention_threshold` are retained for attention computation, effectively introducing sparsity in the attention computation.
+
+Upon reaching the tail of the prompt, the KV cache corresponding to the entire prompt becomes visible again, reverting to dense attention mode (`t = 5`). This transition ensures that the model attends to the complete prompt context before entering the generation stage. Similar to the tri-shape algorithm, the final dense portion of the prefill can be configured using the `SparseAttentionConfig.num_last_dense_tokens_in_prefill` field. Due to the block-wise cache organization and scheduler chunking, the actual number of prompt tokens processed with dense attention may slightly exceed the specified value, potentially extending across a full block or subsequence chunk depending on the hardware configuration.


Shall we have a link here to documentation how to switch on XAttention in OpenVINO?

I haven’t found any related OpenVINO documentation. As far as I understand, XAttention is enabled via OV GenAI, without any need to switch it on specifically in Runtime. Example of the scheduler config:

cb_config = SchedulerConfig( use_sparse_attention=True, SparseAttentionConfig( mode=SparseAttentionMode.XATTENTION, xattention_threshold=0.9 ) )

That's great. If we have OV GenAI documentation on it, let us refer it. Or just mention somehow this mode value existence that we should speicify it via SchedulerConfig.

site/docs/concepts/optimization-techniques/sparse-attention-prefill.md

Copilot

Pull Request Overview

Copilot reviewed 1 out of 3 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

MaximProshin · 2025-11-14T11:12:04Z

@l-bat . please also update the list of optimization methods in Readme.

Copilot

Pull Request Overview

Copilot reviewed 2 out of 4 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Roman Kazantsev <[email protected]>

Copilot

Pull Request Overview

Copilot reviewed 2 out of 4 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings November 13, 2025 16:19

github-actions bot added the category: GH Pages Docs Github Pages documentation label Nov 13, 2025

Copilot AI reviewed Nov 13, 2025

View reviewed changes

site/docs/concepts/optimization-techniques/sparse-attention-prefill.md Outdated Show resolved Hide resolved

site/docs/concepts/optimization-techniques/sparse-attention-prefill.md Outdated Show resolved Hide resolved

l-bat force-pushed the lt/xattn_doc branch from 608893b to d35847c Compare November 13, 2025 16:23

rkazants approved these changes Nov 14, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings November 14, 2025 09:48

Copilot AI reviewed Nov 14, 2025

View reviewed changes

github-actions bot added the no-match-files label Nov 14, 2025

MaximProshin approved these changes Nov 14, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings November 14, 2025 14:27

Copilot AI reviewed Nov 14, 2025

View reviewed changes

l-bat and others added 3 commits November 14, 2025 14:31

Add XAttention documentation

6b728f9

Fix typo

1cfb57e

Co-authored-by: Roman Kazantsev <[email protected]>

Update README

e1158cb

l-bat force-pushed the lt/xattn_doc branch from 7d547d4 to c56dac1 Compare November 14, 2025 14:31

Clarify enabling of XAttention

a412679

Copilot AI review requested due to automatic review settings November 14, 2025 14:32

l-bat force-pushed the lt/xattn_doc branch from c56dac1 to a412679 Compare November 14, 2025 14:32

Copilot AI reviewed Nov 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add XAttention documentation #3010

Add XAttention documentation #3010

l-bat commented Nov 13, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

l-bat commented Nov 14, 2025

Uh oh!

rkazants Nov 14, 2025

Uh oh!

l-bat Nov 14, 2025

Uh oh!

rkazants Nov 14, 2025

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

MaximProshin commented Nov 14, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		The prompt processing occurs as usual until at least two KV cache blocks have been completely filled (`t = 0, 1`). Once the block-level importance scores have been computed (`t = 2-4`), only the subset of KV blocks with cumulative attention mass exceeding the `xattention_threshold` are retained for attention computation, effectively introducing sparsity in the attention computation.

		Upon reaching the tail of the prompt, the KV cache corresponding to the entire prompt becomes visible again, reverting to dense attention mode (`t = 5`). This transition ensures that the model attends to the complete prompt context before entering the generation stage. Similar to the tri-shape algorithm, the final dense portion of the prefill can be configured using the `SparseAttentionConfig.num_last_dense_tokens_in_prefill` field. Due to the block-wise cache organization and scheduler chunking, the actual number of prompt tokens processed with dense attention may slightly exceed the specified value, potentially extending across a full block or subsequence chunk depending on the hardware configuration.

Add XAttention documentation #3010

Are you sure you want to change the base?

Add XAttention documentation #3010

Conversation

l-bat commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

l-bat commented Nov 14, 2025

Uh oh!

rkazants Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

l-bat Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

rkazants Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

MaximProshin commented Nov 14, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

l-bat commented Nov 13, 2025 •

edited

Loading