-
Notifications
You must be signed in to change notification settings - Fork 609
Guided decoding with xgrammar #3965
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Guided decoding with xgrammar #3965
Conversation
8b3e766
to
8fd6d05
Compare
good job! |
0362250
to
8bcbfff
Compare
d75e11e
to
39b4d13
Compare
What's the size of whl file if this PR is applied? |
layers_.emplace_back(new LogitsProcessorLayer<float>{param}); | ||
layers_.emplace_back(new GuidedDecodeMaskLayer<float>{param}); | ||
layers_.emplace_back(new SamplingLayer<float>{param}); | ||
layers_.emplace_back(new GuidedDecodeUpdateLayer<float>{param}); | ||
layers_.emplace_back(new StopCriteriaLayer<float>{param}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
switch (logits.dtype()) { | ||
case kFloat32: { | ||
ApplyTokenBitmaskInplaceDispatchToPackedT(logits.data<float>(), | ||
bitmask.data<int32_t>(), | ||
indices_ptr, | ||
vocab_size, | ||
logits.stride(0), | ||
bitmask.stride(0), | ||
num_rows); | ||
break; | ||
} | ||
case kFloat16: { | ||
ApplyTokenBitmaskInplaceDispatchToPackedT(logits.data<half_t>(), | ||
bitmask.data<int32_t>(), | ||
indices_ptr, | ||
vocab_size, | ||
logits.stride(0), | ||
bitmask.stride(0), | ||
num_rows); | ||
break; | ||
} | ||
#if __CUDA_ARCH__ >= 800 | ||
case kBfloat16: { | ||
ApplyTokenBitmaskInplaceDispatchToPackedT(logits.data<bfloat16_t>(), | ||
bitmask.data<int32_t>(), | ||
indices_ptr, | ||
vocab_size, | ||
logits.stride(0), | ||
bitmask.stride(0), | ||
num_rows); | ||
break; | ||
} | ||
#endif | ||
default: | ||
TM_CHECK(false) << "logits dtype must be float, float16 or bfloat16."; | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May consider using TM_DISPATCH_PRIMARY_DTYPES
for a more robust and maintainable dispatcher implementation.
Here is a reference:
lmdeploy/src/turbomind/kernels/activation.cu
Line 100 in d18ab56
TM_DISPATCH_PRIMARY_DTYPES(gate.dtype(), dispatch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we just use float logits, I have streamlined the code to just launch float kernel.
cc @zhulinJulia24 may consider CI for guided decoding functions. |
bf3a8ea
to
39b4d13
Compare
ls -alh lmdeploy-0.10.0-cp310-cp310-linux_x86_64.whl
-rw-rw-r-- 1 tianzhongbo tianzhongbo 92M Sep 25 09:07 lmdeploy-0.10.0-cp310-cp310-linux_x86_64.whl It seems we will increase the package size from 79M to 92M, which is 17% |
b75e4c9
to
2355af6
Compare
const auto bitmask_size = xgrammar::GetBitmaskSize(vocab_size_padded_); | ||
Tensor_<int32_t> bitmask{{bsz, bitmask_size}, kCPU}; | ||
Tensor_<int32_t> bitmask_device{{bsz, bitmask_size}, kDEVICE}; | ||
std::vector<int64_t> bitmask_shape = {bsz, bitmask_size}; | ||
|
||
DLTensor bitmask_dltensor{bitmask.data(), | ||
DLDevice{kDLCPU, 0}, | ||
bitmask.ndim(), | ||
xgrammar::GetBitmaskDLType(), | ||
bitmask_shape.data(), | ||
nullptr, | ||
0}; | ||
bool need_apply = false; | ||
for (size_t i = 0; i < bsz; ++i) { | ||
const auto& matcher = matchers_[i]; | ||
if (matcher) { | ||
matcher->FillNextTokenBitmask(&bitmask_dltensor, i); | ||
need_apply = true; | ||
} | ||
} | ||
|
||
if (need_apply) { | ||
Copy(bitmask, bitmask_device); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part consists of CPU & PCI-e workload. Move it to setup
so it can be executed in parallel with model forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate it a bit more? Did you mean I can pre-allocate buffers in Setup
and do the bitmap device copy and apply in Forward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-allocate buffers, fill masks and copy the masks to device in setup
; apply the masks in forward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved the pre-allocation of tensors to Setup
as the first stage implementation, and the rest is planned as future work.
8413618
to
3c4cbdb
Compare
@zhulinJulia24 can you help me to identify this issue? And I temporarily disable the pytorch engine test that will be re-enabled when we make the substitution of |
Motivation
LMDeploy’s TurboMind backend is the fastest inference stack in the ecosystem, yet it still lacks Guided Decoding – a feature that is already available in the PyTorch backend and heavily requested by the community.
This PR closes the gap by bringing token-level, C++ native Guided Decoding to TurboMind while keeping the API 100 % compatible with the existing PyTorch backend.
The implementation is built on xGrammar (Apache-2.0), a high-performance C++ library that compiles JSON / Choice / Regex grammars into token FSMs and applies them with negligible overhead.
Modification
Build-system
xgrammar
as a header-only dependency via CMakeFetchContent
(CUDA & Python bindings disabled).xgrammar::tokenizer_info
andxgrammar::grammar_compiler
symbols underlmdeploy::xgrammar
.Core C++ changes
DynamicDecodeLayer
pipeline extended with two new layers:GuidedDecodeMaskLayer
: insetup()
compiles / reuses grammar → builds per-request token bitmask; inforward()
launches a light CUDA kernel to mask disallowed logits to-INF
.GuidedDecodeUpdateLayer
: inforward()
callsmatcher->AcceptToken(output_id)
to advance the FSM.Python frontend
guided_decoding
utilities from PyTorch backend; no new API surface.turbo.TurboMindEngine
now accepts the sameresponse_format=
/guided_json=
/guided_choice=
arguments.Checklist