Skip to content

Conversation

ceciliapeng2011
Copy link
Contributor

@ceciliapeng2011 ceciliapeng2011 commented Sep 12, 2025

Details:

  • XAttention for FP16 KVCache as a preview feature
  • to add unit tests
  • to disable XAttention for legacy platforms (XAttention kernels are implemented for Xe2/Xe3 with CM)
  • to streamline the process of xattention. Currently kvcache shape is used to determine it. Maybe there is a better approach.
  • to add warning message for unsupported cases: multiple subsequences, typo error of kvcache precision, etc.
  • to remove the trivial converter nodes from xattention_threshold Parameter to PageAttention input.
  • to refactor xattention kernel impls by reusing RT parameters, instead of recomputing them.
  • to enable path of U8 KVCache (stretch goal)
  • WWB with long prompts

This PR should work along with openvinotoolkit/openvino.genai#2764.

Tickets:

@github-actions github-actions bot added the category: GPU OpenVINO GPU plugin label Sep 12, 2025
@peterchen-intel peterchen-intel added the pr: needs tests PR needs tests updating label Sep 18, 2025
@github-actions github-actions bot added the category: transformations OpenVINO Runtime library - Transformations label Sep 23, 2025
if (past_len != 0) {
int blocks_num = ceil_div(past_len, block_size);
int blocks_num = ceil_div(past_len + 1, block_size);
int start_block_idx = block_indices[block_indices_begins[i]];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WeldonWangwang May I know why do we need +1 here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will take into account the block where the current token is located

* throw exception if k_head_size != v_head_size and has_xattn

* Add more test cases
@yeonbok
Copy link
Contributor

yeonbok commented Oct 21, 2025

ie_tests_win_gpu_vs2022_release this seems a real crash. Please check

bool use_xattention = false;
const auto& parameters = func->get_parameters();
for (const auto& param : parameters) {
if (param->get_friendly_name() == "xattention_block_size") {
Copy link
Contributor

@yeonbok yeonbok Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't user turn off by config? Currently Xattention is not fully supported. (e.g., only supporting num seqs = 1 & by_token compression, cm is unavailable, etc)

Copy link
Contributor Author

@ceciliapeng2011 ceciliapeng2011 Oct 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

XAttention is disabled by default. Its activation is controlled via the GenAI scheduler configuration using the --cb_config parameter.
When enabled through the scheduler config, GenAI creates model Parameter nodes with names starting with "xattention_". Otherwise, these nodes are created as empty Constant nodes. For more details, refer to the OpenVINO implementation.
To learn how to enable or disable XAttention, please see the command line reference.

Here, GPU plugin follows the same logic to determine whether use_xattention is enabled by the user.

The GPU plugin performs additional checks to ensure XAttention is supported. It throws an exception if any of the following conditions are met:

  • num_seqs > 1
  • channel-level kvcache compression is used
  • CM is unavailable
  • Not Xe2~Xe3 GPUs
  • Other unsupported configurations

kv_stop = (wg_id + 1) * wg_seq_len + past_q_lens;
if (kv_stop > kv_seq_len) kv_stop = kv_seq_len;
}
// printf("###########wg:%d.%d q: %d, +%d kv: %d, +%d, kvstop:%d\n", wg_id, wg_local_id, q_start_sg, q_len_sg, kv_start, kv_seq_len, kv_stop);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(random point)
Please clean up the cm codes not to have unused comments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all done

Copy link
Contributor

@praasz praasz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, for common optimization part.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: GPU OpenVINO GPU plugin category: transformations OpenVINO Runtime library - Transformations Code Freeze

Projects

None yet

Development

Successfully merging this pull request may close these issues.