-
Notifications
You must be signed in to change notification settings - Fork 2.8k
[GPU] sdpa_micro for prefix caching #31968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d5b9a06
to
a62e2ad
Compare
5da134a
to
d95932b
Compare
c330023
to
98cb9b2
Compare
} | ||
} | ||
|
||
if (config.is_paged_attention && data_type_traits::is_i8_u8(K.data_type)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we use config.is_kv_compressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config.is_kv_compressed
is being used for the non-PA case. I'm not sure when it is used. But from the code, I see that it requires separate scale and zp inputs when config.is_kv_compressed
is set. So, I didn't config.is_kv_compressed
for the PA case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except a minor comment
98cb9b2
to
63db02c
Compare
94065db
to
aea9c4c
Compare
aea9c4c
to
87deb35
Compare
Details:
sdpa_micro
to support paged attention for better performance.mixed
stage of paged attention will be handled bysdpa_micro
instead ofpa_sdpa_opt
.sdpa_micro
to supportsliding window
.Tickets: