adding Context Length Specialization (CCL) #466

quic-vjanfaza · 2025-06-19T18:13:08Z

Context-Length-Specialization technique optimizes the throughput of large language models (LLMs) on Qualcomm devices when handling very large context lengths. The current Ahead Of Time (AOT) compilation on Qualcomm devices doesn't predict the number of tokens needed, leading to significant throughput drops during the prefilling and the decoding phases. This happens because the system performs attention calculations based on large context length. To address this issue, we introduce Compute Context Length (CCL), an additional ONNX variable that allows for dynamic context-length specialization. By generating tokens using smaller, more manageable context lengths (CCL), we optimize memory reads and attention calculations, thereby improving throughput.

Signed-off-by: vjanfaza <[email protected]>

ochougul · 2025-06-24T06:38:59Z

tests/transformers/test_compute_context_length.py

how much time does the tests take?
We can choose to only test one model per KV type i.e. for chunked, hybrid, sliding window etc.

chunked -> global + local -> llama4
hybrid -> sliding window + global -> gemma3
sliding window -> mistral

For the above categories, we need to handle CCL in a different way.
Probably for local or sliding window layers, the complete CCL won't apply when it goes beyond sliding window length

And full CCL applies for global layers.

This support needs to be added.

quic-amitraj · 2025-06-24T06:28:19Z

QEfficient/cloud/infer.py

    full_batch_size: Optional[int] = None,
    prompt_len: int = 32,
    ctx_len: int = 128,
+    comp_ctx_lengths: Optional[List[int]] = None,


Add this in the docstring as well of the function.

quic-amitraj · 2025-06-24T06:44:37Z

QEfficient/transformers/models/modeling_auto.py

            logger.warning("Updating low_cpu_mem_usage=False")

        kv_offload = kwargs.pop("kv_offload", None)



I think, comp_ctx_length should be handled as a explicit parameter in from_pretrained rather than handling inside the kwargs. After this there will be no need of pooping that var from kwargs and we can add proper docstring as well here.

quic-amitraj · 2025-06-24T06:44:55Z

QEfficient/transformers/models/modeling_auto.py

+        self.comp_ctx_lengths = kwargs.pop("comp_ctx_lengths", None)
+


Why, I don't think there is any need of this.

we can use kwargs.get instead of pop. We are planning to use this kwargs for creating the model hash

quic-amitraj · 2025-06-24T06:49:58Z

QEfficient/transformers/models/modeling_auto.py

+                for i in range(1, len(self.comp_ctx_lengths)):
+                    decode_spec = self.build_decode_specialization(
+                        prefill_seq_len=prefill_seq_len,
+                        ctx_len=ctx_len,


There is no need of if else condition, please handle this for loop inside the build_decode_specialization.

quic-amitraj · 2025-06-24T08:25:50Z

QEfficient/transformers/models/llama/modeling_llama.py



+@dataclass
+class QEffBaseModelOutputWithPast(BaseModelOutputWithPast):


As these data class is common across all the modelling file, better to keep it in modeling_utils.py.

quic-rishinr · 2025-10-17T10:20:28Z

Closing this PR and changes will be tracked as part of #576

adding Context Length Specialization (CCL)

f34e508

Signed-off-by: vjanfaza <[email protected]>

quic-vjanfaza requested review from ochougul, quic-amitraj, quic-hemagnih and quic-rishinr as code owners June 19, 2025 18:13

adding Context Length Specialization (CCL)

0407788

Signed-off-by: vjanfaza <[email protected]>

quic-rishinr added 1.20.0 1.21.0 labels Jun 24, 2025

ochougul requested changes Jun 24, 2025

View reviewed changes

quic-amitraj requested changes Jun 24, 2025

View reviewed changes

quic-amitraj reviewed Jun 25, 2025

View reviewed changes

quic-rishinr requested review from ochougul and quic-amitraj June 25, 2025 07:14

quic-hemagnih removed the 1.20.0 label Jul 3, 2025

Merge branch 'quic:main' into Compute-Context-Length

9e6f2c0

quic-rishinr closed this Oct 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

adding Context Length Specialization (CCL) #466

adding Context Length Specialization (CCL) #466

Uh oh!

quic-vjanfaza commented Jun 19, 2025 •

edited

Loading

Uh oh!

ochougul Jun 24, 2025

Uh oh!

quic-amitraj Jun 24, 2025

Uh oh!

quic-amitraj Jun 24, 2025

Uh oh!

quic-amitraj Jun 24, 2025

Uh oh!

quic-rishinr Jun 25, 2025 •

edited

Loading

Uh oh!

quic-amitraj Jun 24, 2025

Uh oh!

quic-amitraj Jun 24, 2025

Uh oh!

quic-rishinr commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		logger.warning("Updating low_cpu_mem_usage=False")

		kv_offload = kwargs.pop("kv_offload", None)

		self.comp_ctx_lengths = kwargs.pop("comp_ctx_lengths", None)



		@dataclass
		class QEffBaseModelOutputWithPast(BaseModelOutputWithPast):

adding Context Length Specialization (CCL) #466

adding Context Length Specialization (CCL) #466

Uh oh!

Conversation

quic-vjanfaza commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ochougul Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

quic-amitraj Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

quic-amitraj Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

quic-amitraj Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

quic-rishinr Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

quic-amitraj Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

quic-amitraj Jun 24, 2025

Choose a reason for hiding this comment

Uh oh!

quic-rishinr commented Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

quic-vjanfaza commented Jun 19, 2025 •

edited

Loading

quic-rishinr Jun 25, 2025 •

edited

Loading