[GPU] XAttention as a preview feature #32064

ceciliapeng2011 · 2025-09-12T08:04:49Z

Details:

XAttention for FP16 KVCache as a preview feature
to add unit tests
to disable XAttention for legacy platforms (XAttention kernels are implemented for Xe2/Xe3 with CM)
to streamline the process of xattention. Currently kvcache shape is used to determine it. Maybe there is a better approach.
to add warning message for unsupported cases: multiple subsequences, typo error of kvcache precision, etc.
to remove the trivial converter nodes from xattention_threshold Parameter to PageAttention input.
to refactor xattention kernel impls by reusing RT parameters, instead of recomputing them.
to enable path of U8 KVCache (stretch goal)
WWB with long prompts

This PR should work along with openvinotoolkit/openvino.genai#2764.

Tickets:

CVS-173857

1. kvcache update's k/v offset issue 2. 2nd token lse data overflow issue

…on_common.

… u8 kvcache.

* Tests support num_kv_heads * Update test cases * Fix code style * Fix code style

* Fix code style * Clean code

… recomputing them.

src/plugins/intel_gpu/tests/unit/test_cases/xattention_gpu_test.cpp

ceciliapeng2011 · 2025-10-21T08:20:16Z

src/plugins/intel_gpu/tests/unit/test_cases/paged_attention_gpu_test.cpp

            if (past_len != 0) {
-                int blocks_num = ceil_div(past_len, block_size);
+                int blocks_num = ceil_div(past_len + 1, block_size);
                int start_block_idx = block_indices[block_indices_begins[i]];


@WeldonWangwang May I know why do we need +1 here?

This will take into account the block where the current token is located

* throw exception if k_head_size != v_head_size and has_xattn * Add more test cases

yeonbok · 2025-10-21T18:03:48Z

ie_tests_win_gpu_vs2022_release this seems a real crash. Please check

src/plugins/intel_gpu/src/graph/paged_attention.cpp

yeonbok · 2025-10-21T18:10:27Z

src/plugins/intel_gpu/src/plugin/transformations_pipeline.cpp

+            bool use_xattention = false;
+            const auto& parameters = func->get_parameters();
+            for (const auto& param : parameters) {
+                if (param->get_friendly_name() == "xattention_block_size") {


Can't user turn off by config? Currently Xattention is not fully supported. (e.g., only supporting num seqs = 1 & by_token compression, cm is unavailable, etc)

XAttention is disabled by default. Its activation is controlled via the GenAI scheduler configuration using the --cb_config parameter.
When enabled through the scheduler config, GenAI creates model Parameter nodes with names starting with "xattention_". Otherwise, these nodes are created as empty Constant nodes. For more details, refer to the OpenVINO implementation.
To learn how to enable or disable XAttention, please see the command line reference.

Here, GPU plugin follows the same logic to determine whether use_xattention is enabled by the user.

The GPU plugin performs additional checks to ensure XAttention is supported. It throws an exception if any of the following conditions are met:

num_seqs > 1

channel-level kvcache compression is used

CM is unavailable

Not Xe2~Xe3 GPUs

Other unsupported configurations

src/plugins/intel_gpu/src/graph/paged_attention.cpp

src/plugins/intel_gpu/src/graph/impls/cm/xattn_post_proc.cm

yeonbok · 2025-10-21T18:12:41Z

src/plugins/intel_gpu/src/graph/impls/cm/pa_multi_token.cm

+        kv_stop = (wg_id + 1) * wg_seq_len + past_q_lens;
+        if (kv_stop > kv_seq_len) kv_stop = kv_seq_len;
+    }
+    // printf("###########wg:%d.%d  q: %d, +%d   kv: %d, +%d, kvstop:%d\n", wg_id, wg_local_id, q_start_sg, q_len_sg, kv_start, kv_seq_len, kv_stop);


(random point)
Please clean up the cm codes not to have unused comments

...common/transformations/src/transformations/common_optimizations/convert_pagedattn_inputs.cpp

praasz

OK, for common optimization part.

github-actions bot added the category: GPU OpenVINO GPU plugin label Sep 12, 2025

peterchen-intel added the pr: needs tests PR needs tests updating label Sep 18, 2025

ceciliapeng2011 force-pushed the cecilia/pa_cm_xattention branch from 623f524 to 50a8290 Compare September 23, 2025 01:46

github-actions bot added the category: transformations OpenVINO Runtime library - Transformations label Sep 23, 2025

rnwang04 mentioned this pull request Sep 24, 2025

update gpu block size based on xattn openvinotoolkit/openvino.genai#2764

Open

riverlijunjie and others added 25 commits September 26, 2025 14:51

Init PA CM Impl(1st/2nd token and kvcache update)

e030c80

enabled simple pa unit tests pass

435a7ac

Fix 2nd_token issue

8947906

Fixed pipeline output corruption issue

83dba29

1. kvcache update's k/v offset issue 2. 2nd token lse data overflow issue

Fix 2nd non-16 alignment accuracy issue

2743aab

Set best partition size for 2nd

65b9cc7

update KV_BLOCK_SIZE to 256

c4a1659

initiate xattention integration

62a222f

qwen2.5-1.5b 4k trunk works with xatten.

ac882ab

4k aligned works.

0621e4b

fix block_mask not fully initialized issue.

98a4ecd

fix of find_block

5af3330

xatten: fix accuacy problem caused by debug

4f9ed28

use int32 to store float INV_S to align python version accuracy

d35f4fb

OV_GPU_XATTN_BLOCK_SIZE and OV_GPU_XATTN_THRESH

4e25a4a

fix building error on windows.

c3c87b7

process tail in find_block

76685f0

Fix f16 accuracy issue and optimize 2nd token to improve 5%

c5bdcf9

fix waring_as_error on CI Windows.

95a2da1

dump block mask with DUMP_XATTN_BLOCK_MASK for debug

36bee72

Support kv cache u8 precision

4fa97be

refactor: split into pa_common and sdpa_common, which include attenti…

55ba7c3

…on_common.

integrate xattn_post_proc kernel and FP16 kernel works. TODOto verify…

a06adef

… u8 kvcache.

update partition size

4b391be

enable int8 kvcache for xatten, but accuracy fails.

f2f2126

fix dump... intermediates tensor may empty.

b45062c

ceciliapeng2011 requested a review from yeonbok October 17, 2025 06:14

ceciliapeng2011 and others added 7 commits October 17, 2025 14:47

fix

50628c5

Ww/pa cm xattention 1019 (#61)

1073002

* Tests support num_kv_heads * Update test cases * Fix code style * Fix code style

Ww/pa cm xattention 1020 (#62)

5eff824

* Fix code style * Clean code

Merge branch 'master' into cecilia/pa_cm_xattention

d164bba

PagedAttentionInternBuffIdx

853b562

refactor xattention kernel impls by reusing RT parameters, instead of…

0870cbb

… recomputing them.

fix clang-format style issues

c2bde5b

ceciliapeng2011 commented Oct 20, 2025

View reviewed changes

src/plugins/intel_gpu/tests/unit/test_cases/xattention_gpu_test.cpp Outdated Show resolved Hide resolved

peterchen-intel added the Code Freeze label Oct 21, 2025

WeldonWangwang added 2 commits October 21, 2025 14:45

merge xattention tests into paged_attention tests (#63)

554ebf4

Fix build error (#64)

e794f5b

ceciliapeng2011 commented Oct 21, 2025

View reviewed changes

WeldonWangwang added 2 commits October 21, 2025 20:51

Ww/cm xattention (#65)

5ff7d32

* throw exception if k_head_size != v_head_size and has_xattn * Add more test cases

Remove debug messages (#66)

26c4f2f