Guided decoding with xgrammar #3965

windreamer · 2025-09-12T09:16:51Z

Motivation

LMDeploy’s TurboMind backend is the fastest inference stack in the ecosystem, yet it still lacks Guided Decoding – a feature that is already available in the PyTorch backend and heavily requested by the community.
This PR closes the gap by bringing token-level, C++ native Guided Decoding to TurboMind while keeping the API 100 % compatible with the existing PyTorch backend.
The implementation is built on xGrammar (Apache-2.0), a high-performance C++ library that compiles JSON / Choice / Regex grammars into token FSMs and applies them with negligible overhead.

Modification

Build-system
- Add xgrammar as a header-only dependency via CMake FetchContent (CUDA & Python bindings disabled).
- Export xgrammar::tokenizer_info and xgrammar::grammar_compiler symbols under lmdeploy::xgrammar.
Core C++ changes
- DynamicDecodeLayer pipeline extended with two new layers:
  - GuidedDecodeMaskLayer: in setup() compiles / reuses grammar → builds per-request token bitmask; in forward() launches a light CUDA kernel to mask disallowed logits to -INF.
  - GuidedDecodeUpdateLayer: in forward() calls matcher->AcceptToken(output_id) to advance the FSM.
- Grammar compiler cache (LRU, keyed by schema hash) shared across all sessions to avoid re-compilation.
Python frontend
- Re-use existing guided_decoding utilities from PyTorch backend; no new API surface.
- turbo.TurboMindEngine now accepts the same response_format= / guided_json= / guided_choice= arguments.

Checklist

Pre-commit hooks (clang-format, flake8, mypy) passed.
Document updated

shell-nlp · 2025-09-16T15:34:45Z

good job!

.github/workflows/unit-test.yml

lvhan028 · 2025-09-24T05:20:03Z

What's the size of whl file if this PR is applied?

lmdeploy/turbomind/tokenizer_info.py

requirements/runtime_cuda.txt

lvhan028 · 2025-09-24T13:59:44Z

src/turbomind/layers/DynamicDecodeLayer.cc

    layers_.emplace_back(new LogitsProcessorLayer<float>{param});
+    layers_.emplace_back(new GuidedDecodeMaskLayer<float>{param});
    layers_.emplace_back(new SamplingLayer<float>{param});
+    layers_.emplace_back(new GuidedDecodeUpdateLayer<float>{param});
    layers_.emplace_back(new StopCriteriaLayer<float>{param});


The sampling-related classes are declared as templates, but the template parameter T does not appear to be utilized in any of the following:

Member variable

Member function

Base class
how about remove the templates? @lzhangzz @irexyc any comments?

lvhan028 · 2025-09-24T14:08:45Z

src/turbomind/kernels/apply_token_bitmask_inplace_cuda.cu

+    switch (logits.dtype()) {
+        case kFloat32: {
+            ApplyTokenBitmaskInplaceDispatchToPackedT(logits.data<float>(),
+                                                      bitmask.data<int32_t>(),
+                                                      indices_ptr,
+                                                      vocab_size,
+                                                      logits.stride(0),
+                                                      bitmask.stride(0),
+                                                      num_rows);
+            break;
+        }
+        case kFloat16: {
+            ApplyTokenBitmaskInplaceDispatchToPackedT(logits.data<half_t>(),
+                                                      bitmask.data<int32_t>(),
+                                                      indices_ptr,
+                                                      vocab_size,
+                                                      logits.stride(0),
+                                                      bitmask.stride(0),
+                                                      num_rows);
+            break;
+        }
+#if __CUDA_ARCH__ >= 800
+        case kBfloat16: {
+            ApplyTokenBitmaskInplaceDispatchToPackedT(logits.data<bfloat16_t>(),
+                                                      bitmask.data<int32_t>(),
+                                                      indices_ptr,
+                                                      vocab_size,
+                                                      logits.stride(0),
+                                                      bitmask.stride(0),
+                                                      num_rows);
+            break;
+        }
+#endif
+        default:
+            TM_CHECK(false) << "logits dtype must be float, float16 or bfloat16.";
+            break;
+    }


May consider using TM_DISPATCH_PRIMARY_DTYPES for a more robust and maintainable dispatcher implementation.
Here is a reference:

lmdeploy/src/turbomind/kernels/activation.cu

Line 100 in d18ab56

TM_DISPATCH_PRIMARY_DTYPES(gate.dtype(), dispatch);

As we just use float logits, I have streamlined the code to just launch float kernel.

lvhan028 · 2025-09-24T14:25:53Z

cc @zhulinJulia24 may consider CI for guided decoding functions.
Here is the guide https://lmdeploy.readthedocs.io/en/latest/advance/structed_output.html

windreamer · 2025-09-25T01:11:32Z

What's the size of whl file if this PR is applied?

 ls -alh lmdeploy-0.10.0-cp310-cp310-linux_x86_64.whl
-rw-rw-r-- 1 tianzhongbo tianzhongbo 92M Sep 25 09:07 lmdeploy-0.10.0-cp310-cp310-linux_x86_64.whl

It seems we will increase the package size from 79M to 92M, which is 17%

src/turbomind/layers/DynamicDecodeLayer.cc

lzhangzz · 2025-09-25T05:56:39Z

src/turbomind/layers/sampling_layers/GuidedDecodeMaskLayer.cc

+    const auto           bitmask_size = xgrammar::GetBitmaskSize(vocab_size_padded_);
+    Tensor_<int32_t>     bitmask{{bsz, bitmask_size}, kCPU};
+    Tensor_<int32_t>     bitmask_device{{bsz, bitmask_size}, kDEVICE};
+    std::vector<int64_t> bitmask_shape = {bsz, bitmask_size};
+
+    DLTensor bitmask_dltensor{bitmask.data(),
+                              DLDevice{kDLCPU, 0},
+                              bitmask.ndim(),
+                              xgrammar::GetBitmaskDLType(),
+                              bitmask_shape.data(),
+                              nullptr,
+                              0};
+    bool     need_apply = false;
+    for (size_t i = 0; i < bsz; ++i) {
+        const auto& matcher = matchers_[i];
+        if (matcher) {
+            matcher->FillNextTokenBitmask(&bitmask_dltensor, i);
+            need_apply = true;
+        }
+    }
+
+    if (need_apply) {
+        Copy(bitmask, bitmask_device);


This part consists of CPU & PCI-e workload. Move it to setup so it can be executed in parallel with model forward.

Can you elaborate it a bit more? Did you mean I can pre-allocate buffers in Setup and do the bitmap device copy and apply in Forward

Pre-allocate buffers, fill masks and copy the masks to device in setup; apply the masks in forward

I have moved the pre-allocation of tensors to Setup as the first stage implementation, and the rest is planned as future work.

windreamer · 2025-09-26T02:02:42Z

2025-09-26 09:28:55,874 - lmdeploy - ERROR - model_agent.py:804 - Task <ModelAgentLoop> failed
Traceback (most recent call last):
  File "/__w/lmdeploy/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 799, in _on_finish_callback
    task.result()
  File "/__w/lmdeploy/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 771, in _async_loop_background
    await self._async_step_background(**forward_inputs, )
  File "/__w/lmdeploy/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 707, in _async_step_background
    next_token_ids, logprobs = await self.async_sampling_logits(last_logits, sampling_inputs, inputs)
  File "/__w/lmdeploy/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 542, in async_sampling_logits
    logits, raw_logprobs = await logits_processor(origin_logits)
  File "/__w/lmdeploy/lmdeploy/lmdeploy/pytorch/engine/logits_process.py", line 237, in __call__
    scores = _guided_sampling(sampling_inputs.response_formats, scores, guided_input_ids, self.tokenizer)
  File "/__w/lmdeploy/lmdeploy/lmdeploy/pytorch/engine/logits_process.py", line 105, in _guided_sampling
    from .guided_process import _get_guided_logits_processor
  File "/__w/lmdeploy/lmdeploy/lmdeploy/pytorch/engine/guided_process.py", line 23, in <module>
    from outlines.fsm.guide import CFGGuide, Generate, RegexGuide, Write
  File "/opt/py3/lib/python3.10/site-packages/outlines/__init__.py", line 5, in <module>
    import outlines.types
  File "/opt/py3/lib/python3.10/site-packages/outlines/types/__init__.py", line 1, in <module>
    from . import airports, countries
  File "/opt/py3/lib/python3.10/site-packages/outlines/types/airports.py", line 4, in <module>
    from pyairports.airports import AIRPORT_LIST
ModuleNotFoundError: No module named 'pyairports'

@zhulinJulia24 can you help me to identify this issue?

And I temporarily disable the pytorch engine test that will be re-enabled when we make the substitution of outlines of xgrammar

windreamer changed the title ~~Guided decoding with xgrammar~~ [WIP] Guided decoding with xgrammar Sep 12, 2025

windreamer force-pushed the guided_decoding_with_xgrammar branch 3 times, most recently from 8b3e766 to 8fd6d05 Compare September 12, 2025 09:44

windreamer force-pushed the guided_decoding_with_xgrammar branch 25 times, most recently from 0362250 to 8bcbfff Compare September 22, 2025 12:41

windreamer force-pushed the guided_decoding_with_xgrammar branch from d75e11e to 39b4d13 Compare September 24, 2025 04:38

windreamer changed the title ~~[WIP] Guided decoding with xgrammar~~ Guided decoding with xgrammar Sep 24, 2025