-
Notifications
You must be signed in to change notification settings - Fork 300
Add XAttention documentation #3010
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds comprehensive documentation for the XAttention sparse attention algorithm, expanding the previously placeholder "TBA" section with detailed implementation descriptions and visual illustrations.
Key changes:
- Detailed explanation of XAttention's two-stage importance estimation procedure
- Addition of visual diagram illustrating the algorithm's operation
- Configuration parameter references for customizing XAttention behavior
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
site/docs/concepts/optimization-techniques/sparse-attention-prefill.md
Outdated
Show resolved
Hide resolved
site/docs/concepts/optimization-techniques/sparse-attention-prefill.md
Outdated
Show resolved
Hide resolved
|
@peterchen-intel @ceciliapeng2011 @WeldonWangwang could you please help with reviewing the XAttention documentation? |
|
|
||
| The prompt processing occurs as usual until at least two KV cache blocks have been completely filled (`t = 0, 1`). Once the block-level importance scores have been computed (`t = 2-4`), only the subset of KV blocks with cumulative attention mass exceeding the `xattention_threshold` are retained for attention computation, effectively introducing sparsity in the attention computation. | ||
|
|
||
| Upon reaching the tail of the prompt, the KV cache corresponding to the entire prompt becomes visible again, reverting to dense attention mode (`t = 5`). This transition ensures that the model attends to the complete prompt context before entering the generation stage. Similar to the tri-shape algorithm, the final dense portion of the prefill can be configured using the `SparseAttentionConfig.num_last_dense_tokens_in_prefill` field. Due to the block-wise cache organization and scheduler chunking, the actual number of prompt tokens processed with dense attention may slightly exceed the specified value, potentially extending across a full block or subsequence chunk depending on the hardware configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we have a link here to documentation how to switch on XAttention in OpenVINO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven’t found any related OpenVINO documentation. As far as I understand, XAttention is enabled via OV GenAI, without any need to switch it on specifically in Runtime. Example of the scheduler config:
cb_config = SchedulerConfig(
use_sparse_attention=True,
SparseAttentionConfig(
mode=SparseAttentionMode.XATTENTION,
xattention_threshold=0.9
)
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great. If we have OV GenAI documentation on it, let us refer it. Or just mention somehow this mode value existence that we should speicify it via SchedulerConfig.
site/docs/concepts/optimization-techniques/sparse-attention-prefill.md
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 1 out of 3 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@l-bat . please also update the list of optimization methods in Readme. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 2 out of 4 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Roman Kazantsev <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 2 out of 4 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Description
CVS-173857