Skip to content
Open
Show file tree
Hide file tree
Changes from 93 commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
e030c80
Init PA CM Impl(1st/2nd token and kvcache update)
riverlijunjie Aug 29, 2025
435a7ac
enabled simple pa unit tests pass
riverlijunjie Aug 31, 2025
8947906
Fix 2nd_token issue
riverlijunjie Aug 31, 2025
83dba29
Fixed pipeline output corruption issue
riverlijunjie Sep 2, 2025
2743aab
Fix 2nd non-16 alignment accuracy issue
riverlijunjie Sep 2, 2025
65b9cc7
Set best partition size for 2nd
riverlijunjie Sep 2, 2025
c4a1659
update KV_BLOCK_SIZE to 256
ceciliapeng2011 Sep 3, 2025
62a222f
initiate xattention integration
ceciliapeng2011 Sep 3, 2025
ac882ab
qwen2.5-1.5b 4k trunk works with xatten.
ceciliapeng2011 Sep 5, 2025
0621e4b
4k aligned works.
ceciliapeng2011 Sep 5, 2025
98a4ecd
fix block_mask not fully initialized issue.
ceciliapeng2011 Sep 5, 2025
5af3330
fix of find_block
ceciliapeng2011 Sep 8, 2025
4f9ed28
xatten: fix accuacy problem caused by debug
ceciliapeng2011 Sep 9, 2025
d35f4fb
use int32 to store float INV_S to align python version accuracy
luo-cheng2021 Sep 10, 2025
4e25a4a
OV_GPU_XATTN_BLOCK_SIZE and OV_GPU_XATTN_THRESH
ceciliapeng2011 Sep 10, 2025
c3c87b7
fix building error on windows.
usstq Sep 10, 2025
76685f0
process tail in find_block
ceciliapeng2011 Sep 12, 2025
c5bdcf9
Fix f16 accuracy issue and optimize 2nd token to improve 5%
riverlijunjie Sep 9, 2025
95a2da1
fix waring_as_error on CI Windows.
ceciliapeng2011 Sep 15, 2025
36bee72
dump block mask with DUMP_XATTN_BLOCK_MASK for debug
ceciliapeng2011 Sep 15, 2025
4fa97be
Support kv cache u8 precision
riverlijunjie Sep 14, 2025
55ba7c3
refactor: split into pa_common and sdpa_common, which include attenti…
ceciliapeng2011 Sep 22, 2025
a06adef
integrate xattn_post_proc kernel and FP16 kernel works. TODOto verify…
ceciliapeng2011 Sep 22, 2025
4b391be
update partition size
riverlijunjie Sep 14, 2025
f2f2126
enable int8 kvcache for xatten, but accuracy fails.
ceciliapeng2011 Sep 23, 2025
89c8577
fix xattn kvcache u8 accuracy issue.
ceciliapeng2011 Sep 24, 2025
024b71a
Fix 2nd accuracy issue
riverlijunjie Sep 24, 2025
033304f
Fix 2nd accuracy issue
ceciliapeng2011 Sep 29, 2025
a6e72d0
fix xattn tailing issue: Q_blocks < K_blocks, as K_blocks is aligned …
ceciliapeng2011 Sep 30, 2025
f7ddc68
decide pa block size based whether use xattntion
rnwang04 Sep 23, 2025
29cdabb
fix bloxk size logic
rnwang04 Sep 25, 2025
5048081
fix partition size
rnwang04 Sep 26, 2025
0c8c029
fix condition of xattn stages
rnwang04 Oct 9, 2025
6fbf07b
Add xAttention reference operation and test
WeldonWangwang Oct 9, 2025
13b1122
Optimize single_token_finalization kernel with fixed unroll
riverlijunjie Oct 10, 2025
24d6b80
Fix win build
peterchen-intel Oct 10, 2025
326fc4d
Fix win build
peterchen-intel Oct 10, 2025
508fab3
Fix win build
peterchen-intel Oct 10, 2025
73669d3
Enable CM PA only in case of XAttention been enabled.
ceciliapeng2011 Oct 11, 2025
45bedf3
pass xattention threshold from genai
ceciliapeng2011 Oct 11, 2025
b7a9a8b
xattention_block_size unconfigurable
ceciliapeng2011 Oct 11, 2025
703dca6
Merge branch 'cecilia/pa_cm_xattention_bridge' into cecilia/pa_cm_xat…
ceciliapeng2011 Oct 11, 2025
f9f58be
invalidate sparse atten process if threshold is larger than 1.0.
ceciliapeng2011 Oct 11, 2025
f7fa94f
Merge branch 'master' into cecilia/pa_cm_xattention
ceciliapeng2011 Oct 11, 2025
3afbdb5
cpplint error fixes
ceciliapeng2011 Oct 11, 2025
2c37d0d
Define ENABLE_PA_CM_PATH for build
peterchen-intel Oct 12, 2025
cae516a
Fix worning as error issues on windows with VS2022
zhaixuejun1993 Oct 12, 2025
010b6e7
Merge pull request #56 from zhaixuejun1993/xuejun/fix-warning-as-error
ceciliapeng2011 Oct 13, 2025
808a789
[WA] clean unused kvcache buffer
riverlijunjie Oct 10, 2025
22f0459
Fix format issues
zhaixuejun1993 Oct 13, 2025
780f55a
disable XAttention for legacy platforms (XAttention kernels are imple…
ceciliapeng2011 Oct 13, 2025
d21c4f6
reset left V cache block rather than 16 rows
riverlijunjie Oct 13, 2025
6b9b4c2
Remove debug code
riverlijunjie Oct 13, 2025
eb9765e
revert code change to ocl_v2
ceciliapeng2011 Oct 13, 2025
1418daa
cleanup debug code
ceciliapeng2011 Oct 13, 2025
21c3193
Limit head_num/kv_head_num not excceed 8
riverlijunjie Oct 13, 2025
8a7a380
streamline block_size head_size in both cases of fp16 and u8/i8 kvcache
ceciliapeng2011 Oct 13, 2025
472f774
Remove CM PA tests
zhaixuejun1993 Oct 13, 2025
1fdcd3c
refactor: use paged_attention::block_size_xattn instead of hardcode n…
ceciliapeng2011 Oct 13, 2025
a62fd1b
worksgit status git status
WeldonWangwang Oct 13, 2025
52aad92
Merge pull request #57 from ceciliapeng2011/river/pa_nan_debug
ceciliapeng2011 Oct 14, 2025
3da8a34
Fix the KV cache padding with Nan issue for 1st token.
luweizhou2016 Oct 14, 2025
2dd7a81
Fix nan issue for 2nd token
riverlijunjie Oct 14, 2025
bbf17ed
Clean code
WeldonWangwang Oct 14, 2025
147063f
Clean code
WeldonWangwang Oct 14, 2025
c02fb34
Clean code
WeldonWangwang Oct 14, 2025
314bd71
Add CMXAttentionBlockSelector
WeldonWangwang Oct 14, 2025
fdbba78
Clean code
WeldonWangwang Oct 14, 2025
2ade1e1
Clean code
WeldonWangwang Oct 14, 2025
f402a14
refactor: check single suquence condition
ceciliapeng2011 Oct 15, 2025
342ae59
Avoid 2nd token perf drop due to cleanup unused K cache
riverlijunjie Oct 15, 2025
8e8b74c
fix: if kvcache config is dynamic, which may occurs with a typo error…
ceciliapeng2011 Oct 15, 2025
4a82167
Clean code
WeldonWangwang Oct 15, 2025
6f7dd8d
Clean code
WeldonWangwang Oct 15, 2025
326ee44
Add more test cases
WeldonWangwang Oct 15, 2025
cfa1f3a
Clean code
WeldonWangwang Oct 15, 2025
f795152
Merge pull request #55 from WeldonWangwang/wangwang/add_xattention_tests
WeldonWangwang Oct 15, 2025
35267d3
Fix build errors and code style (#59)
WeldonWangwang Oct 16, 2025
2dfbb19
Fix test cases and skip testing on unsupported platforms (#60)
WeldonWangwang Oct 16, 2025
cca1528
bypas xattn when thresh>=1.0 and q_len<STRIDE.
ceciliapeng2011 Oct 16, 2025
618e575
throw exception if xattn is not supported by either GPU archieture or…
ceciliapeng2011 Oct 16, 2025
b2afd6e
Merge branch 'master' into cecilia/pa_cm_xattention
WeldonWangwang Oct 17, 2025
522a503
add OV_GPU_DUMP_SRC_TENSORS_AFTER_EXEC
ceciliapeng2011 Oct 17, 2025
3e527be
code cleanup, unused code
ceciliapeng2011 Oct 17, 2025
1e243fc
throw exception for unsupported cases.
ceciliapeng2011 Oct 17, 2025
b45062c
fix dump... intermediates tensor may empty.
ceciliapeng2011 Oct 17, 2025
50628c5
fix
ceciliapeng2011 Oct 17, 2025
1073002
Ww/pa cm xattention 1019 (#61)
WeldonWangwang Oct 19, 2025
5eff824
Ww/pa cm xattention 1020 (#62)
WeldonWangwang Oct 19, 2025
d164bba
Merge branch 'master' into cecilia/pa_cm_xattention
WeldonWangwang Oct 19, 2025
853b562
PagedAttentionInternBuffIdx
ceciliapeng2011 Oct 17, 2025
0870cbb
refactor xattention kernel impls by reusing RT parameters, instead of…
ceciliapeng2011 Oct 17, 2025
c2bde5b
fix clang-format style issues
ceciliapeng2011 Oct 20, 2025
554ebf4
merge xattention tests into paged_attention tests (#63)
WeldonWangwang Oct 21, 2025
e794f5b
Fix build error (#64)
WeldonWangwang Oct 21, 2025
5ff7d32
Ww/cm xattention (#65)
WeldonWangwang Oct 21, 2025
26c4f2f
Remove debug messages (#66)
WeldonWangwang Oct 21, 2025
1ec3dfd
fix the place to check kvcache precision
ceciliapeng2011 Oct 22, 2025
a6e4bbb
useless code cleanup.
ceciliapeng2011 Oct 22, 2025
bdf2e89
fix lint error
ceciliapeng2011 Oct 22, 2025
8ba831a
fix throw check
ceciliapeng2011 Oct 23, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ ov::pass::ConvertPagedAttnInputs::ConvertPagedAttnInputs(const KVCacheConfig& co
value_cache->set_element_type(value_cache_precision);
bool status = false;
if (pa_op->get_rt_info().count("num_k_heads") && pa_op->get_rt_info().count("k_head_size") &&
pa_op->get_rt_info().count("num_v_heads") && pa_op->get_rt_info().count("num_v_heads")) {
pa_op->get_rt_info().count("num_v_heads") && pa_op->get_rt_info().count("v_head_size")) {
const auto key_cache_shape = init_cache_shape(pa_op->get_rt_info()["num_k_heads"].as<size_t>(),
pa_op->get_rt_info()["k_head_size"].as<size_t>(),
m_config.keyCacheBlockSize,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ struct paged_attention : public primitive_base<paged_attention> {
};

static constexpr size_t block_size = 16;
static constexpr size_t block_size_xattn = 256;

paged_attention() : primitive_base("", {}) {}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,7 @@ static constexpr Property<bool, ov::PropertyMutability::RW> could_use_flashattn_
static constexpr Property<uint64_t, PropertyMutability::RW> dynamic_quantization_group_size_max{"GPU_DYNAMIC_QUANTIZATION_GROUP_SIZE_MAX"};
static constexpr Property<bool, ov::PropertyMutability::RW> validate_output_buffer{"GPU_VALIDATE_OUTPUT_BUFFER"};
static constexpr Property<float, ov::PropertyMutability::RW> mem_pool_util_threshold{"GPU_MEM_POOL_UTIL_THRESHOLD"};
static constexpr Property<bool, ov::PropertyMutability::RW> dump_src_after_exec{"GPU_DUMP_SRC_TENSORS_AFTER_EXEC"};
} // namespace ov::intel_gpu

namespace cldnn {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@ OV_CONFIG_DEBUG_OPTION(ov::intel_gpu, dump_layer_names, std::vector<std::string>
OV_CONFIG_DEBUG_OPTION(ov::intel_gpu, dump_memory_pool_path, "", "Save csv file with memory pool info to specified folder")
OV_CONFIG_DEBUG_OPTION(ov::intel_gpu, dump_memory_pool, false, "Enable verbose output for memory pool")
OV_CONFIG_DEBUG_OPTION(ov::intel_gpu, dump_iterations, std::set<int64_t>{}, "Space separated list of iterations where other dump options should be enabled")
OV_CONFIG_DEBUG_OPTION(ov::intel_gpu, dump_src_after_exec, false, "Enable source data dump after layer execution. Useful for capturing updated states in stateful models.")
OV_CONFIG_DEBUG_OPTION(ov::intel_gpu, host_time_profiling, 0, "Measure and print host time spent from the beginning of the infer until all host work is done and plugin is ready to block thread on the final clFinish() call")
OV_CONFIG_DEBUG_OPTION(ov::intel_gpu, disable_async_compilation, false, "Disable feature that allows to asynchronously prepare static-shaped implementations for the primitives with shape-agnostic kernels selected during compilation")
OV_CONFIG_DEBUG_OPTION(ov::intel_gpu, disable_runtime_buffer_fusing, false, "Disable runtime inplace optimizations for operations like concat and crop")
Expand Down
33 changes: 31 additions & 2 deletions src/plugins/intel_gpu/src/graph/debug_helper.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -250,7 +250,7 @@ void log_memory_to_file(memory::ptr mem, layout data_layout, stream& stream, std
dump<int8_t>(actual_mem, stream, file_stream, dump_raw);
else if (mem_dt == cldnn::data_types::u8)
dump<uint8_t>(actual_mem, stream, file_stream, dump_raw);
else if (mem_dt == cldnn::data_types::u8)
else if (mem_dt == cldnn::data_types::boolean)
dump<uint8_t>(actual_mem, stream, file_stream, dump_raw);
else if (mem_dt == cldnn::data_types::i4 || mem_dt == cldnn::data_types::u4)
dump_i4u4(mem_dt, actual_mem, stream, file_stream, dump_raw);
Expand Down Expand Up @@ -536,7 +536,7 @@ NodeDebugHelper::~NodeDebugHelper() {
for (size_t i = 0; i < m_inst.get_intermediates_memories().size(); i++) {
std::string name = get_file_prefix() + "_intermediates_" + std::to_string(i);
auto output_mem = m_inst.get_intermediates_memories()[i];
if (output_mem == nullptr) {
if (output_mem == nullptr || output_mem->size() == 0) {
GPU_DEBUG_COUT << " intermediates_mem is nullptr. Nothing to dump." << std::endl;
continue;
}
Expand All @@ -558,6 +558,35 @@ NodeDebugHelper::~NodeDebugHelper() {
log_memory_to_file(output_mem, output_layout, m_stream, filename, dump_raw);
}
}

if (config.get_dump_src_after_exec()) {
for (size_t i = 0; i < m_inst.inputs_memory_count(); i++) {
std::string name = get_file_prefix() + "_updated_src_" + std::to_string(i);
auto output_mem = m_inst.input_memory_ptr(i);
if (output_mem == nullptr) {
GPU_DEBUG_COUT << " updated_input_mem is nullptr. Nothing to dump." << std::endl;
continue;
}

auto& output_layout = m_inst.get_input_layout(i);
if (config.get_dump_tensors_format() == ov::intel_gpu::DumpFormat::binary) {
// Binary dump : raw
auto filename = get_file_path_for_binary_dump(output_layout, name, config.get_dump_tensors_path());

mem_lock<char, mem_lock_type::read> lock(output_mem, m_stream);
ov::util::save_binary(filename, lock.data(), output_mem->size());
GPU_DEBUG_COUT << " Dump layer dst : " << layer_name << " to " << filename << std::endl;
debug_str_for_bin_load += (filename + ",");
} else {
const bool dump_raw = config.get_dump_tensors_format() == ov::intel_gpu::DumpFormat::text_raw;
GPU_DEBUG_COUT << " Dump " << (dump_raw ? "raw " : "") << name << std::endl;
auto filename = config.get_dump_tensors_path() + get_name_for_dump(name) + ".txt";
// Text dump
log_memory_to_file(output_mem, output_layout, m_stream, filename, dump_raw);
}
}
}

if (config.get_dump_tensors_format() == ov::intel_gpu::DumpFormat::binary && m_inst.is_input()) {
debug_str_for_bin_load[debug_str_for_bin_load.size()-1] = '\"';
GPU_DEBUG_COUT << debug_str_for_bin_load << std::endl;;
Expand Down
Loading
Loading