Skip to content

Conversation

@quic-vjanfaza
Copy link

@quic-vjanfaza quic-vjanfaza commented Jun 19, 2025

Context-Length-Specialization technique optimizes the throughput of large language models (LLMs) on Qualcomm devices when handling very large context lengths. The current Ahead Of Time (AOT) compilation on Qualcomm devices doesn't predict the number of tokens needed, leading to significant throughput drops during the prefilling and the decoding phases. This happens because the system performs attention calculations based on large context length. To address this issue, we introduce Compute Context Length (CCL), an additional ONNX variable that allows for dynamic context-length specialization. By generating tokens using smaller, more manageable context lengths (CCL), we optimize memory reads and attention calculations, thereby improving throughput.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much time does the tests take?
We can choose to only test one model per KV type i.e. for chunked, hybrid, sliding window etc.

chunked -> global + local -> llama4
hybrid -> sliding window + global -> gemma3
sliding window -> mistral

For the above categories, we need to handle CCL in a different way.
Probably for local or sliding window layers, the complete CCL won't apply when it goes beyond sliding window length

And full CCL applies for global layers.

This support needs to be added.

full_batch_size: Optional[int] = None,
prompt_len: int = 32,
ctx_len: int = 128,
comp_ctx_lengths: Optional[List[int]] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add this in the docstring as well of the function.

logger.warning("Updating low_cpu_mem_usage=False")

kv_offload = kwargs.pop("kv_offload", None)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, comp_ctx_length should be handled as a explicit parameter in from_pretrained rather than handling inside the kwargs. After this there will be no need of pooping that var from kwargs and we can add proper docstring as well here.

Comment on lines +1422 to +1423
self.comp_ctx_lengths = kwargs.pop("comp_ctx_lengths", None)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why, I don't think there is any need of this.

Copy link
Contributor

@quic-rishinr quic-rishinr Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use kwargs.get instead of pop. We are planning to use this kwargs for creating the model hash

for i in range(1, len(self.comp_ctx_lengths)):
decode_spec = self.build_decode_specialization(
prefill_seq_len=prefill_seq_len,
ctx_len=ctx_len,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need of if else condition, please handle this for loop inside the build_decode_specialization.



@dataclass
class QEffBaseModelOutputWithPast(BaseModelOutputWithPast):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As these data class is common across all the modelling file, better to keep it in modeling_utils.py.

@quic-rishinr
Copy link
Contributor

Closing this PR and changes will be tracked as part of #576

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants