Automatically find the least compression needed to fit a target token count.
When tokenBudget is set, the engine binary-searches recencyWindow to find the largest recency window that fits within the budget. This maximizes preserved recent context while still hitting the target.
1. Fast path: if total tokens <= budget, return immediately (no compression needed)
2. Set lo = minRecencyWindow (default 0), hi = messages.length - 1
3. Binary search:
a. mid = ceil((lo + hi) / 2)
b. Compress with recencyWindow = mid
c. If result fits budget: lo = mid (try larger window)
d. If over budget: hi = mid - 1 (try smaller window)
4. Final compress at recencyWindow = lo
5. If still over budget and forceConverge enabled: hard-truncate pass
The binary search runs compression at each iteration. When a summarizer is provided, each iteration calls the LLM — so budget + LLM is slower than budget alone.
import { compress } from 'context-compression-engine';
const result = compress(messages, {
tokenBudget: 4000,
minRecencyWindow: 2,
});
result.fits; // true if result fits within budget
result.tokenCount; // token count (via tokenCounter)
result.recencyWindow; // the recencyWindow the binary search settled onThe built-in estimator:
function defaultTokenCounter(msg: Message): number {
return Math.ceil(msg.content.length / 3.5);
}~3.5 characters per token is derived from empirical measurements of GPT-family BPE tokenizers (cl100k_base, o200k_base) on mixed English text. We pick the lower end of the observed range so estimates are conservative — slightly over-counting tokens is safer than under-counting and blowing the budget. It's fast and works for ballpark estimates, but real tokenizers vary:
| Tokenizer | Typical chars/token |
|---|---|
| GPT-4/4o | ~3.5-4.0 |
| Claude | ~3.5-4.0 |
| Llama 3 | ~3.0-3.5 |
For accurate budgeting, replace it.
The tokenCounter function is called for all budget decisions: binary search iterations, force-converge deltas, token_ratio stats, and the final tokenCount/fits fields.
import { compress } from 'context-compression-engine';
import { encode } from 'gpt-tokenizer';
const result = compress(messages, {
tokenBudget: 4000,
tokenCounter: (msg) => {
const text = typeof msg.content === 'string' ? msg.content : '';
return encode(text).length;
},
});import { encoding_for_model } from 'tiktoken';
const enc = encoding_for_model('gpt-4o');
const result = compress(messages, {
tokenBudget: 4000,
tokenCounter: (msg) => {
const text = typeof msg.content === 'string' ? msg.content : '';
return enc.encode(text).length;
},
});
enc.free(); // tiktoken uses WASM — free when doneFloor for recencyWindow during binary search. Guarantees that at least N recent messages are always preserved, even under tight budgets.
const result = compress(messages, {
tokenBudget: 2000,
minRecencyWindow: 4, // always keep at least 4 recent messages
});Default: 0 (no floor).
When the binary search bottoms out (reaches minRecencyWindow) and the result still exceeds the budget, forceConverge runs a hard-truncation pass.
- Collect eligible messages: before the recency cutoff, not in
preserveroles, content > 512 chars - Sort by content length descending (biggest savings first)
- Truncate each to 512 chars:
[truncated — {original_length} chars: {first 512 chars}] - Stop once the budget is satisfied
const result = compress(messages, {
tokenBudget: 4000,
forceConverge: true,
});
// result.fits is guaranteed true (unless only system/recency messages remain)Truncated messages get _cce_original provenance metadata, so uncompress() restores the full content. Messages that were already compressed (have _cce_original) get their content replaced in-place without double-wrapping.
- CI/CD pipelines where you need a hard guarantee that context fits
- Streaming applications where exceeding the context window is a crash
- Agentic loops where the budget must be respected each iteration
Without forceConverge, the result may exceed the budget when conversations are heavily system-message or short-message dominated (since those are preserved).
const result = await compress(messages, {
tokenBudget: 4000,
summarizer: mySummarizer,
});The binary search calls the LLM at each iteration, so cost and latency scale with log2(messages.length) iterations. The LLM path still has the three-level fallback (LLM -> deterministic -> size guard) at each step.
- Compression pipeline - overall pipeline flow
- LLM integration - setting up summarizers
- API reference -
tokenBudget,minRecencyWindow,forceConverge,tokenCounter