Skip to content

Conversation

windreamer
Copy link
Collaborator

Motivation

LMDeploy’s TurboMind backend is the fastest inference stack in the ecosystem, yet it still lacks Guided Decoding – a feature that is already available in the PyTorch backend and heavily requested by the community.
This PR closes the gap by bringing token-level, C++ native Guided Decoding to TurboMind while keeping the API 100 % compatible with the existing PyTorch backend.
The implementation is built on xGrammar (Apache-2.0), a high-performance C++ library that compiles JSON / Choice / Regex grammars into token FSMs and applies them with negligible overhead.

Modification

  1. Build-system

    • Add xgrammar as a header-only dependency via CMake FetchContent (CUDA & Python bindings disabled).
    • Export xgrammar::tokenizer_info and xgrammar::grammar_compiler symbols under lmdeploy::xgrammar.
  2. Core C++ changes

    • DynamicDecodeLayer pipeline extended with two new layers:
      • GuidedDecodeMaskLayer: in setup() compiles / reuses grammar → builds per-request token bitmask; in forward() launches a light CUDA kernel to mask disallowed logits to -INF.
      • GuidedDecodeUpdateLayer: in forward() calls matcher->AcceptToken(output_id) to advance the FSM.
    • Grammar compiler cache (LRU, keyed by schema hash) shared across all sessions to avoid re-compilation.
  3. Python frontend

    • Re-use existing guided_decoding utilities from PyTorch backend; no new API surface.
    • turbo.TurboMindEngine now accepts the same response_format= / guided_json= / guided_choice= arguments.

Checklist

  • Pre-commit hooks (clang-format, flake8, mypy) passed.
  • Document updated

@windreamer windreamer changed the title Guided decoding with xgrammar [WIP] Guided decoding with xgrammar Sep 12, 2025
@windreamer windreamer force-pushed the guided_decoding_with_xgrammar branch 3 times, most recently from 8b3e766 to 8fd6d05 Compare September 12, 2025 09:44
@shell-nlp
Copy link
Contributor

good job!

@windreamer windreamer force-pushed the guided_decoding_with_xgrammar branch 25 times, most recently from 0362250 to 8bcbfff Compare September 22, 2025 12:41
@windreamer windreamer force-pushed the guided_decoding_with_xgrammar branch from d75e11e to 39b4d13 Compare September 24, 2025 04:38
@windreamer windreamer changed the title [WIP] Guided decoding with xgrammar Guided decoding with xgrammar Sep 24, 2025
@lvhan028
Copy link
Collaborator

What's the size of whl file if this PR is applied?

Comment on lines 39 to 43
layers_.emplace_back(new LogitsProcessorLayer<float>{param});
layers_.emplace_back(new GuidedDecodeMaskLayer<float>{param});
layers_.emplace_back(new SamplingLayer<float>{param});
layers_.emplace_back(new GuidedDecodeUpdateLayer<float>{param});
layers_.emplace_back(new StopCriteriaLayer<float>{param});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sampling-related classes are declared as templates, but the template parameter T does not appear to be utilized in any of the following:

  • Member variable
  • Member function
  • Base class
    how about remove the templates? @lzhangzz @irexyc any comments?

Comment on lines 215 to 251
switch (logits.dtype()) {
case kFloat32: {
ApplyTokenBitmaskInplaceDispatchToPackedT(logits.data<float>(),
bitmask.data<int32_t>(),
indices_ptr,
vocab_size,
logits.stride(0),
bitmask.stride(0),
num_rows);
break;
}
case kFloat16: {
ApplyTokenBitmaskInplaceDispatchToPackedT(logits.data<half_t>(),
bitmask.data<int32_t>(),
indices_ptr,
vocab_size,
logits.stride(0),
bitmask.stride(0),
num_rows);
break;
}
#if __CUDA_ARCH__ >= 800
case kBfloat16: {
ApplyTokenBitmaskInplaceDispatchToPackedT(logits.data<bfloat16_t>(),
bitmask.data<int32_t>(),
indices_ptr,
vocab_size,
logits.stride(0),
bitmask.stride(0),
num_rows);
break;
}
#endif
default:
TM_CHECK(false) << "logits dtype must be float, float16 or bfloat16.";
break;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May consider using TM_DISPATCH_PRIMARY_DTYPES for a more robust and maintainable dispatcher implementation.
Here is a reference:

TM_DISPATCH_PRIMARY_DTYPES(gate.dtype(), dispatch);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we just use float logits, I have streamlined the code to just launch float kernel.

@lvhan028
Copy link
Collaborator

cc @zhulinJulia24 may consider CI for guided decoding functions.
Here is the guide https://lmdeploy.readthedocs.io/en/latest/advance/structed_output.html

@windreamer windreamer force-pushed the guided_decoding_with_xgrammar branch from bf3a8ea to 39b4d13 Compare September 24, 2025 16:46
@windreamer
Copy link
Collaborator Author

What's the size of whl file if this PR is applied?

 ls -alh lmdeploy-0.10.0-cp310-cp310-linux_x86_64.whl
-rw-rw-r-- 1 tianzhongbo tianzhongbo 92M Sep 25 09:07 lmdeploy-0.10.0-cp310-cp310-linux_x86_64.whl

It seems we will increase the package size from 79M to 92M, which is 17%

@windreamer windreamer force-pushed the guided_decoding_with_xgrammar branch from b75e4c9 to 2355af6 Compare September 25, 2025 04:41
Comment on lines 47 to 69
const auto bitmask_size = xgrammar::GetBitmaskSize(vocab_size_padded_);
Tensor_<int32_t> bitmask{{bsz, bitmask_size}, kCPU};
Tensor_<int32_t> bitmask_device{{bsz, bitmask_size}, kDEVICE};
std::vector<int64_t> bitmask_shape = {bsz, bitmask_size};

DLTensor bitmask_dltensor{bitmask.data(),
DLDevice{kDLCPU, 0},
bitmask.ndim(),
xgrammar::GetBitmaskDLType(),
bitmask_shape.data(),
nullptr,
0};
bool need_apply = false;
for (size_t i = 0; i < bsz; ++i) {
const auto& matcher = matchers_[i];
if (matcher) {
matcher->FillNextTokenBitmask(&bitmask_dltensor, i);
need_apply = true;
}
}

if (need_apply) {
Copy(bitmask, bitmask_device);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part consists of CPU & PCI-e workload. Move it to setup so it can be executed in parallel with model forward.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate it a bit more? Did you mean I can pre-allocate buffers in Setup and do the bitmap device copy and apply in Forward

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-allocate buffers, fill masks and copy the masks to device in setup; apply the masks in forward

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved the pre-allocation of tensors to Setup as the first stage implementation, and the rest is planned as future work.

@windreamer windreamer force-pushed the guided_decoding_with_xgrammar branch from 8413618 to 3c4cbdb Compare September 26, 2025 01:18
@windreamer
Copy link
Collaborator Author

2025-09-26 09:28:55,874 - lmdeploy - ERROR - model_agent.py:804 - Task <ModelAgentLoop> failed
Traceback (most recent call last):
  File "/__w/lmdeploy/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 799, in _on_finish_callback
    task.result()
  File "/__w/lmdeploy/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 771, in _async_loop_background
    await self._async_step_background(**forward_inputs, )
  File "/__w/lmdeploy/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 707, in _async_step_background
    next_token_ids, logprobs = await self.async_sampling_logits(last_logits, sampling_inputs, inputs)
  File "/__w/lmdeploy/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 542, in async_sampling_logits
    logits, raw_logprobs = await logits_processor(origin_logits)
  File "/__w/lmdeploy/lmdeploy/lmdeploy/pytorch/engine/logits_process.py", line 237, in __call__
    scores = _guided_sampling(sampling_inputs.response_formats, scores, guided_input_ids, self.tokenizer)
  File "/__w/lmdeploy/lmdeploy/lmdeploy/pytorch/engine/logits_process.py", line 105, in _guided_sampling
    from .guided_process import _get_guided_logits_processor
  File "/__w/lmdeploy/lmdeploy/lmdeploy/pytorch/engine/guided_process.py", line 23, in <module>
    from outlines.fsm.guide import CFGGuide, Generate, RegexGuide, Write
  File "/opt/py3/lib/python3.10/site-packages/outlines/__init__.py", line 5, in <module>
    import outlines.types
  File "/opt/py3/lib/python3.10/site-packages/outlines/types/__init__.py", line 1, in <module>
    from . import airports, countries
  File "/opt/py3/lib/python3.10/site-packages/outlines/types/airports.py", line 4, in <module>
    from pyairports.airports import AIRPORT_LIST
ModuleNotFoundError: No module named 'pyairports'

@zhulinJulia24 can you help me to identify this issue?

And I temporarily disable the pytorch engine test that will be re-enabled when we make the substitution of outlines of xgrammar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants