Add compressed-tensors Marlin support for Qwen3-VL AWQ in pymllm#663
Add compressed-tensors Marlin support for Qwen3-VL AWQ in pymllm#663jialilve wants to merge 6 commits intoUbiquitousLearning:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds compressed-tensors GPTQ Marlin quantization (CUDA kernels and Python integration), new gptq_marlin_repack kernel and tests, timing instrumentation propagated through inference pipeline, multimodal/tokenizer/server adjustments, CMake CPM vendoring change, RMSNorm PyTorch fallback, and Jetson-oriented documentation. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🧹 Nitpick comments (2)
pymllm/executor/model_runner.py (1)
490-495: Makeconfig.jsona true fallback.Handling it inside the main candidate loop makes precedence depend on
uniqueordering. Since every HF checkpoint has aconfig.json, encountering it early makes the rest of the probe unreachable and the more specific quantization metadata can never win.♻️ Suggested simplification
for fname in unique: + if fname == "config.json": + continue fpath = model_path / fname if fpath.exists(): with open(fpath) as fp: cfg = json.load(fp) - # config.json stores model metadata at the top level and - # nests quantization details under quantization_config. - if fname == "config.json" and "quantization_config" in cfg: - return cfg["quantization_config"] return cfgKeep the existing explicit
config.json -> quantization_configblock below as the only fallback path.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pymllm/executor/model_runner.py` around lines 490 - 495, The config.json handling inside the main candidate loop (the block that checks fname == "config.json" and returns cfg["quantization_config"]) makes config.json a premature match and prevents more specific probes from being considered; remove that in-loop special-case and instead handle "config.json" -> "quantization_config" only as the final fallback path after the candidate/probe loop completes (keep the existing fallback block below). Update references to cfg/fname accordingly so the loop always continues for other candidates and only uses config.json's quantization_config when no other probe matched.pymllm/quantization/methods/compressed_tensors.py (1)
89-97: Use buffers for Marlin runtime tensors instead of mixing plain attributes withParameters.Line 230 stores
workspaceas a plain tensor attribute, which won't be moved bymodule.to()/cuda(). Lines 231-233 create emptyParameters for runtime metadata, causing them to appear inparameters()/state_dict()despite not being checkpoint-backed weights. All four tensors should be registered as non-persistent buffers, which ensures they move with the module while remaining excluded from the parameter list and checkpoint.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@mllm-kernel/mllm_kernel/cuda/jit/gptq_marlin_repack.py`:
- Around line 56-74: The function gptq_marlin_repack currently computes output
shape using floor division and can silently truncate/leave uninitialized data;
before allocating out or calling kernel, validate inputs: ensure num_bits
divides 32 evenly (so pack_factor = 32 // num_bits is integer and >0), ensure
size_k is a multiple of tile_size (16), ensure (size_n * tile_size) is divisible
by pack_factor (so out shape is exact), ensure b_q_weight.shape matches (size_k,
size_n) and its dtype/device are used for out, and ensure perm (if provided) has
length size_k; raise a clear ValueError on any violation and only then call
_normalize_perm, _make_gptq_marlin_repack_kernel(), allocate out and invoke
kernel.
In `@pymllm/layers/rms_norm.py`:
- Around line 52-54: The except branch in the RMSNorm fallback currently
computes x = x + residual but returns the old residual, causing downstream
blocks to receive stale residual state; update the except block in
pymllm/layers/rms_norm.py so that after computing x = x + residual you set a
new_residual (or reuse x) and return (_torch_rmsnorm(new_residual, self.weight,
self.eps), new_residual) instead of returning the old residual so subsequent
calls (e.g., in pymllm/models/qwen3_5.py flows) get the updated accumulated
residual.
- Around line 18-21: The code currently casts x_norm back to x.dtype before
multiplying by weight, which loses precision and can produce wrong dtype when
weight.dtype != x.dtype; change the order so you compute x_fp32, var, and x_norm
in FP32, cast weight to float32 (weight_fp32 = weight.to(torch.float32) or
weight.float()), multiply x_norm * weight_fp32 while still in FP32, then cast
the final product once to x.dtype before returning; locate symbols x_fp32, var,
x_norm, weight and eps in the RMSNorm implementation and apply this single-cast
fix consistent with rms_norm_gated.py.
In `@pymllm/models/qwen3_vl.py`:
- Around line 980-994: The helper _cuda_timed_run currently calls
torch.cuda.synchronize() on every invocation which forces a GPU sync in hot
paths (vision prefill and LLM forward); change it so synchronization is opt-in
or deferred: add a parameter like sync=False to _cuda_timed_run (or
alternatively return the start/end torch.cuda.Event objects and the raw output)
and only call torch.cuda.synchronize() and compute elapsed time when sync=True
(or when the caller performs synchronization at an existing barrier), ensuring
the default path avoids unconditional torch.cuda.synchronize() and preserves
existing return shapes by documenting the new return form if you choose the
event-return approach.
In `@pymllm/orchestrator/scheduler_process.py`:
- Around line 792-793: The code currently overwrites req.llm_decode_ms with
out["llm_decode_ms"] inside the decode-step loop; change this to accumulate
per-step timings by adding the step value to the existing request-level value
(e.g., req.llm_decode_ms = (req.llm_decode_ms or 0) + out["llm_decode_ms"]) so
each decode batch increments req.llm_decode_ms instead of replacing it; update
the setter site where "llm_decode_ms" is read from out (the block referencing
req.llm_decode_ms and out) to handle missing/None initial values safely.
In `@pymllm/orchestrator/tokenizer_process.py`:
- Around line 374-387: The multimodal branch replaces text tokenizer output with
processor IDs (mm_inputs -> image_inputs -> proc_input_ids) but skips the
earlier context-length guard; validate proc_input_ids against
self._context_length and either truncate or raise consistently with the text
path before assigning input_ids. Locate the block handling
mm_inputs/image_inputs/proc_input_ids in tokenizer_process.py and after
converting proc_input_ids to a Python list apply the same length-check logic
used earlier (use self._context_length to cap the list or raise a clear
exception), ensuring behavior matches the existing text-tokenizer
truncation/rejection policy.
In `@pymllm/quantization/methods/compressed_tensors.py`:
- Around line 61-70: Change verify_marlin_supported to accept a device argument
and use that device when checking CUDA capability (instead of calling
torch.cuda.get_device_capability() with no device), and update any callers
accordingly; specifically, update verify_marlin_supported(group_size: int,
device: torch.device) to first check MARLIN_SUPPORTED_GROUP_SIZES then if
device.type == "cuda" call torch.cuda.get_device_capability(device) and enforce
SM80+, and add an extra call to the new signature at the start of
process_weights_after_loading() using layer.weight_packed.device so validation
runs for the runtime device.
In `@pymllm/README-ZH.md`:
- Around line 5-16: The "## 环境配置 ToDo" section lacks validated Jetson
environment details; replace the checklist with a concrete validated matrix or a
link to a follow-up issue and populate the table with exact JetPack/L4T,
CUDA/cuDNN/TensorRT, Python/pip/venv or conda,
PyTorch/torchvision/transformers/safetensors, flashinfer version/installation,
apt dependencies, memory/GPU suggestions and required ENV vars; specifically
update the "## 环境配置 ToDo" header and the checklist items (e.g., JetPack / L4T,
CUDA / cuDNN / TensorRT, Python / pip / venv 或 conda, PyTorch / torchvision /
transformers / safetensors, flashinfer 版本与安装方式) with validated values or convert
them into a compact validated-version table and add a note linking to a
follow-up issue if full validation will occur later.
In `@pymllm/server/launch.py`:
- Around line 473-487: The code builds multimodal content lists in the messages
loop (mm_parts -> content) but may leave content as a list which later gets
stringified by the ChatML fallback and silently drops images; update the path
around tokenizer.apply_chat_template and the ChatML fallback to fail fast: after
constructing mm_parts (and before falling back to ChatML) check that
tokenizer.apply_chat_template exists and can accept list-form content, attempt
to call it in a try/except, and if it is missing or raises, immediately raise an
explicit exception (e.g., ValueError/RuntimeError) indicating structured chat
content could not be templated for the request instead of allowing silent
stringification; reference the variables/functions messages, mm_parts/content,
tokenizer.apply_chat_template, and the ChatML fallback so the change is made
where list content is prepared and before any stringification occurs.
---
Nitpick comments:
In `@pymllm/executor/model_runner.py`:
- Around line 490-495: The config.json handling inside the main candidate loop
(the block that checks fname == "config.json" and returns
cfg["quantization_config"]) makes config.json a premature match and prevents
more specific probes from being considered; remove that in-loop special-case and
instead handle "config.json" -> "quantization_config" only as the final fallback
path after the candidate/probe loop completes (keep the existing fallback block
below). Update references to cfg/fname accordingly so the loop always continues
for other candidates and only uses config.json's quantization_config when no
other probe matched.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5cd8b503-101e-4acb-822f-2d4c14c9ed71
📒 Files selected for processing (22)
mllm-kernel/cmake/CPM.cmakemllm-kernel/include/mllm_kernel/scalar_type.hppmllm-kernel/mllm_kernel/cuda/csrc/gemm/marlin/marlin.cuhmllm-kernel/mllm_kernel/cuda/jit/__init__.pymllm-kernel/mllm_kernel/cuda/jit/gptq_marlin.pymllm-kernel/mllm_kernel/cuda/jit/gptq_marlin_repack.pymllm-kernel/tests/test_gptq_marlin.pymllm-kernel/tests/test_gptq_marlin_repack.pypymllm/README-ZH.mdpymllm/README.mdpymllm/executor/model_runner.pypymllm/layers/rms_norm.pypymllm/models/qwen3_vl.pypymllm/orchestrator/detokenizer_process.pypymllm/orchestrator/model_runner_process.pypymllm/orchestrator/scheduler_process.pypymllm/orchestrator/tokenizer_process.pypymllm/quantization/methods/__init__.pypymllm/quantization/methods/compressed_tensors.pypymllm/server/launch.pypymllm/tests/test_compressed_tensors_config.pypymllm/tests/test_compressed_tensors_runtime.py
| def gptq_marlin_repack( | ||
| b_q_weight: torch.Tensor, | ||
| perm: Optional[torch.Tensor], | ||
| size_k: int, | ||
| size_n: int, | ||
| num_bits: int, | ||
| ) -> torch.Tensor: | ||
| """Repack GPTQ/Compressed-Tensors weights into Marlin layout.""" | ||
|
|
||
| pack_factor = 32 // num_bits | ||
| tile_size = 16 | ||
| out = torch.empty( | ||
| (size_k // tile_size, size_n * tile_size // pack_factor), | ||
| dtype=b_q_weight.dtype, | ||
| device=b_q_weight.device, | ||
| ) | ||
| kernel = _make_gptq_marlin_repack_kernel() | ||
| perm_t = _normalize_perm(perm, size_k, b_q_weight.device) | ||
| kernel(b_q_weight, perm_t, out, size_k, size_n, num_bits) |
There was a problem hiding this comment.
Reject unsupported size_k/size_n/num_bits inputs before allocating out.
Line 68 derives the output shape with floor division, so unsupported inputs are silently truncated instead of failing fast: a size_k tail is dropped, a non-64-multiple size_n can leave part of out uninitialized, and a non-divisor num_bits produces the wrong packing factor. Since this wrapper is now public, please validate num_bits, alignment, and b_q_weight.shape/dtype/device before the allocation and kernel launch.
Suggested guardrail
def gptq_marlin_repack(
b_q_weight: torch.Tensor,
perm: Optional[torch.Tensor],
size_k: int,
size_n: int,
num_bits: int,
) -> torch.Tensor:
"""Repack GPTQ/Compressed-Tensors weights into Marlin layout."""
+ if b_q_weight.dtype != torch.int32:
+ raise ValueError("b_q_weight must be int32")
+ if b_q_weight.device.type != "cuda":
+ raise ValueError("b_q_weight must live on CUDA")
+ if num_bits <= 0 or 32 % num_bits != 0:
+ raise ValueError("num_bits must be a positive divisor of 32")
+ if size_k % 16 != 0:
+ raise ValueError("size_k must be divisible by 16")
+ if size_n % 64 != 0:
+ raise ValueError("size_n must be divisible by 64")
+
pack_factor = 32 // num_bits
+ expected_shape = (size_k // pack_factor, size_n)
+ if tuple(b_q_weight.shape) != expected_shape:
+ raise ValueError(f"b_q_weight must have shape {expected_shape}")
+
tile_size = 16
out = torch.empty(
(size_k // tile_size, size_n * tile_size // pack_factor),
dtype=b_q_weight.dtype,
device=b_q_weight.device,As per coding guidelines, "Validate inputs for public APIs and critical internal functions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@mllm-kernel/mllm_kernel/cuda/jit/gptq_marlin_repack.py` around lines 56 - 74,
The function gptq_marlin_repack currently computes output shape using floor
division and can silently truncate/leave uninitialized data; before allocating
out or calling kernel, validate inputs: ensure num_bits divides 32 evenly (so
pack_factor = 32 // num_bits is integer and >0), ensure size_k is a multiple of
tile_size (16), ensure (size_n * tile_size) is divisible by pack_factor (so out
shape is exact), ensure b_q_weight.shape matches (size_k, size_n) and its
dtype/device are used for out, and ensure perm (if provided) has length size_k;
raise a clear ValueError on any violation and only then call _normalize_perm,
_make_gptq_marlin_repack_kernel(), allocate out and invoke kernel.
| x_fp32 = x.float() | ||
| var = x_fp32.pow(2).mean(dim=-1, keepdim=True) | ||
| x_norm = x_fp32 * torch.rsqrt(var + eps) | ||
| return x_norm.to(dtype=x.dtype) * weight |
There was a problem hiding this comment.
Apply the scale in FP32 before the final cast.
This helper downcasts x_norm before multiplying by weight, so the fallback loses precision and can even return the wrong dtype when weight.dtype != x.dtype. pymllm/layers/rms_norm_gated.py:40-61 already keeps both operands in FP32 and casts once at the end.
Suggested fix
def _torch_rmsnorm(
x: torch.Tensor,
weight: torch.Tensor,
eps: float,
) -> torch.Tensor:
x_fp32 = x.float()
+ weight_fp32 = weight.float()
var = x_fp32.pow(2).mean(dim=-1, keepdim=True)
- x_norm = x_fp32 * torch.rsqrt(var + eps)
- return x_norm.to(dtype=x.dtype) * weight
+ out = x_fp32 * torch.rsqrt(var + eps) * weight_fp32
+ return out.to(dtype=x.dtype)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/layers/rms_norm.py` around lines 18 - 21, The code currently casts
x_norm back to x.dtype before multiplying by weight, which loses precision and
can produce wrong dtype when weight.dtype != x.dtype; change the order so you
compute x_fp32, var, and x_norm in FP32, cast weight to float32 (weight_fp32 =
weight.to(torch.float32) or weight.float()), multiply x_norm * weight_fp32 while
still in FP32, then cast the final product once to x.dtype before returning;
locate symbols x_fp32, var, x_norm, weight and eps in the RMSNorm implementation
and apply this single-cast fix consistent with rms_norm_gated.py.
| except Exception: | ||
| x = x + residual | ||
| return _torch_rmsnorm(x, self.weight, self.eps), residual |
There was a problem hiding this comment.
Return the accumulated residual from the fallback path.
This branch computes x + residual but returns the old residual. pymllm/models/qwen3_5.py:206-211 threads that second tuple element straight into the next norm call, so taking this fallback leaves all subsequent blocks operating on stale residual state.
Suggested fix
except Exception:
- x = x + residual
- return _torch_rmsnorm(x, self.weight, self.eps), residual
+ accumulated = x + residual
+ return _torch_rmsnorm(accumulated, self.weight, self.eps), accumulated📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception: | |
| x = x + residual | |
| return _torch_rmsnorm(x, self.weight, self.eps), residual | |
| except Exception: | |
| accumulated = x + residual | |
| return _torch_rmsnorm(accumulated, self.weight, self.eps), accumulated |
🧰 Tools
🪛 Ruff (0.15.9)
[warning] 52-52: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/layers/rms_norm.py` around lines 52 - 54, The except branch in the
RMSNorm fallback currently computes x = x + residual but returns the old
residual, causing downstream blocks to receive stale residual state; update the
except block in pymllm/layers/rms_norm.py so that after computing x = x +
residual you set a new_residual (or reuse x) and return
(_torch_rmsnorm(new_residual, self.weight, self.eps), new_residual) instead of
returning the old residual so subsequent calls (e.g., in
pymllm/models/qwen3_5.py flows) get the updated accumulated residual.
| def _cuda_timed_run(fn): | ||
| if torch.cuda.is_available(): | ||
| start = torch.cuda.Event(enable_timing=True) | ||
| end = torch.cuda.Event(enable_timing=True) | ||
| start.record() | ||
| out = fn() | ||
| end.record() | ||
| torch.cuda.synchronize() | ||
| return out, float(start.elapsed_time(end)) | ||
| else: | ||
| import time | ||
| t0 = time.perf_counter() | ||
| out = fn() | ||
| t1 = time.perf_counter() | ||
| return out, float((t1 - t0) * 1000.0) |
There was a problem hiding this comment.
Don't synchronize the device on every timed forward.
torch.cuda.synchronize() here forces a full GPU sync in the model hot path, and this helper is used for both the vision prefill and LLM passes. That will reduce throughput and inflate latency; please make this opt-in, or at least defer elapsed-time collection to an existing synchronization point instead of doing it unconditionally on every batch.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/models/qwen3_vl.py` around lines 980 - 994, The helper _cuda_timed_run
currently calls torch.cuda.synchronize() on every invocation which forces a GPU
sync in hot paths (vision prefill and LLM forward); change it so synchronization
is opt-in or deferred: add a parameter like sync=False to _cuda_timed_run (or
alternatively return the start/end torch.cuda.Event objects and the raw output)
and only call torch.cuda.synchronize() and compute elapsed time when sync=True
(or when the caller performs synchronization at an existing barrier), ensuring
the default path avoids unconditional torch.cuda.synchronize() and preserves
existing return shapes by documenting the new return form if you choose the
event-return approach.
| if "llm_decode_ms" in out: | ||
| req.llm_decode_ms = out["llm_decode_ms"] |
There was a problem hiding this comment.
Accumulate llm_decode_ms instead of replacing it.
This setter runs once per decode batch, so multi-token requests end up returning only the last step's decode latency in the final timing payload.
💡 If this field is meant to be request-level timing
- if "llm_decode_ms" in out:
- req.llm_decode_ms = out["llm_decode_ms"]
+ if "llm_decode_ms" in out:
+ req.llm_decode_ms = (req.llm_decode_ms or 0.0) + float(
+ out["llm_decode_ms"]
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/orchestrator/scheduler_process.py` around lines 792 - 793, The code
currently overwrites req.llm_decode_ms with out["llm_decode_ms"] inside the
decode-step loop; change this to accumulate per-step timings by adding the step
value to the existing request-level value (e.g., req.llm_decode_ms =
(req.llm_decode_ms or 0) + out["llm_decode_ms"]) so each decode batch increments
req.llm_decode_ms instead of replacing it; update the setter site where
"llm_decode_ms" is read from out (the block referencing req.llm_decode_ms and
out) to handle missing/None initial values safely.
| # If AutoProcessor produced multimodal input_ids, they must override | ||
| # the plain tokenizer result. Otherwise the prompt contains only a | ||
| # single image placeholder token and won't match the visual features. | ||
| if mm_inputs is not None: | ||
| image_inputs = mm_inputs.get("image_inputs") | ||
| if image_inputs is not None and "input_ids" in image_inputs: | ||
| proc_input_ids = image_inputs["input_ids"] | ||
| if hasattr(proc_input_ids, "ndim") and proc_input_ids.ndim > 1: | ||
| proc_input_ids = proc_input_ids[0] | ||
| if hasattr(proc_input_ids, "tolist"): | ||
| input_ids = proc_input_ids.tolist() | ||
| else: | ||
| input_ids = list(proc_input_ids) | ||
|
|
There was a problem hiding this comment.
Keep the context-length guard on the multimodal path.
Once these processor IDs replace the text-tokenizer output, multimodal requests no longer inherit the truncation/length check applied earlier. A long image prompt can now exceed self._context_length and fail later in the runner instead of being capped or rejected here. At minimum, validate the replacement IDs before packing the request.
🧭 Minimal safeguard
if hasattr(proc_input_ids, "tolist"):
input_ids = proc_input_ids.tolist()
else:
input_ids = list(proc_input_ids)
+ if (
+ self._context_length is not None
+ and len(input_ids) > self._context_length
+ ):
+ raise ValueError(
+ "Multimodal prompt exceeds context_length after processor tokenization"
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # If AutoProcessor produced multimodal input_ids, they must override | |
| # the plain tokenizer result. Otherwise the prompt contains only a | |
| # single image placeholder token and won't match the visual features. | |
| if mm_inputs is not None: | |
| image_inputs = mm_inputs.get("image_inputs") | |
| if image_inputs is not None and "input_ids" in image_inputs: | |
| proc_input_ids = image_inputs["input_ids"] | |
| if hasattr(proc_input_ids, "ndim") and proc_input_ids.ndim > 1: | |
| proc_input_ids = proc_input_ids[0] | |
| if hasattr(proc_input_ids, "tolist"): | |
| input_ids = proc_input_ids.tolist() | |
| else: | |
| input_ids = list(proc_input_ids) | |
| # If AutoProcessor produced multimodal input_ids, they must override | |
| # the plain tokenizer result. Otherwise the prompt contains only a | |
| # single image placeholder token and won't match the visual features. | |
| if mm_inputs is not None: | |
| image_inputs = mm_inputs.get("image_inputs") | |
| if image_inputs is not None and "input_ids" in image_inputs: | |
| proc_input_ids = image_inputs["input_ids"] | |
| if hasattr(proc_input_ids, "ndim") and proc_input_ids.ndim > 1: | |
| proc_input_ids = proc_input_ids[0] | |
| if hasattr(proc_input_ids, "tolist"): | |
| input_ids = proc_input_ids.tolist() | |
| else: | |
| input_ids = list(proc_input_ids) | |
| if ( | |
| self._context_length is not None | |
| and len(input_ids) > self._context_length | |
| ): | |
| raise ValueError( | |
| "Multimodal prompt exceeds context_length after processor tokenization" | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/orchestrator/tokenizer_process.py` around lines 374 - 387, The
multimodal branch replaces text tokenizer output with processor IDs (mm_inputs
-> image_inputs -> proc_input_ids) but skips the earlier context-length guard;
validate proc_input_ids against self._context_length and either truncate or
raise consistently with the text path before assigning input_ids. Locate the
block handling mm_inputs/image_inputs/proc_input_ids in tokenizer_process.py and
after converting proc_input_ids to a Python list apply the same length-check
logic used earlier (use self._context_length to cap the list or raise a clear
exception), ensuring behavior matches the existing text-tokenizer
truncation/rejection policy.
| def verify_marlin_supported(group_size: int) -> None: | ||
| if group_size not in MARLIN_SUPPORTED_GROUP_SIZES: | ||
| raise ValueError( | ||
| f"Unsupported compressed-tensors group_size: {group_size}" | ||
| ) | ||
| if not torch.cuda.is_available(): | ||
| return | ||
| major, minor = torch.cuda.get_device_capability() | ||
| if major * 10 + minor < 80: | ||
| raise ValueError("compressed-tensors Marlin requires SM80+") |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In PyTorch, if torch.cuda.get_device_capability() is called without a device argument, which CUDA device does it inspect?
💡 Result:
It inspects the current CUDA device, as returned by torch.cuda.current_device.
Citations:
- 1: https://pytorch.org/docs/2.5/_modules/torch/cuda.html
- 2: https://github.com/pytorch/pytorch/blob/main/torch/cuda/_utils.py
- 3: https://docs.pytorch.org/docs/1.10/_modules/torch/cuda.html
🏁 Script executed:
# First, locate the file and check its structure
find . -path "*pymllm/quantization/methods/compressed_tensors.py" -type fRepository: UbiquitousLearning/mllm
Length of output: 118
🏁 Script executed:
# Read the relevant sections of the file
if [ -f "pymllm/quantization/methods/compressed_tensors.py" ]; then
echo "=== Lines 61-75 (verify_marlin_supported) ==="
sed -n '61,75p' pymllm/quantization/methods/compressed_tensors.py
echo ""
echo "=== Lines 85-105 (process_weights_after_loading area) ==="
sed -n '85,105p' pymllm/quantization/methods/compressed_tensors.py
echo ""
echo "=== Lines 215-235 (second mentioned range) ==="
sed -n '215,235p' pymllm/quantization/methods/compressed_tensors.py
fiRepository: UbiquitousLearning/mllm
Length of output: 2264
🏁 Script executed:
# Check the total file length to understand context
wc -l pymllm/quantization/methods/compressed_tensors.pyRepository: UbiquitousLearning/mllm
Length of output: 120
Re-validate Marlin device compatibility in the actual runtime device context.
The verify_marlin_supported() function calls torch.cuda.get_device_capability() without a device argument, which inspects only the current CUDA device, not the device where the layer actually resides. The process_weights_after_loading() method extracts the correct device from layer.weight_packed.device but never re-validates Marlin support for that device before executing CUDA operations. In workflows where initialization occurs on a different device than the runtime device (e.g., CPU initialization before GPU placement, or multi-GPU scenarios), this allows invalid configurations to pass validation and fail later at runtime.
Modify verify_marlin_supported() to accept and use a device parameter, and call it again at the start of process_weights_after_loading() with the actual layer.weight_packed.device.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/quantization/methods/compressed_tensors.py` around lines 61 - 70,
Change verify_marlin_supported to accept a device argument and use that device
when checking CUDA capability (instead of calling
torch.cuda.get_device_capability() with no device), and update any callers
accordingly; specifically, update verify_marlin_supported(group_size: int,
device: torch.device) to first check MARLIN_SUPPORTED_GROUP_SIZES then if
device.type == "cuda" call torch.cuda.get_device_capability(device) and enforce
SM80+, and add an extra call to the new signature at the start of
process_weights_after_loading() using layer.weight_packed.device so validation
runs for the runtime device.
| # Preserve multimodal message structure for tokenizer.apply_chat_template. | ||
| msg_dicts: List[Dict[str, Any]] = [] | ||
| for msg in messages: | ||
| content = msg.content | ||
| if isinstance(content, list): | ||
| # Multimodal: extract only text parts for the prompt string. | ||
| text_parts = [p.text for p in content if p.type == "text" and p.text] | ||
| content = "\n".join(text_parts) if text_parts else "" | ||
| mm_parts: List[Dict[str, Any]] = [] | ||
| for part in content: | ||
| if part.type == "text" and part.text is not None: | ||
| mm_parts.append({"type": "text", "text": part.text}) | ||
| elif part.type == "image_url" and part.image_url is not None: | ||
| # Keep image content so chat template can emit vision tokens. | ||
| mm_parts.append( | ||
| {"type": "image", "image": part.image_url.url} | ||
| ) | ||
| content = mm_parts |
There was a problem hiding this comment.
Fail fast when structured chat content cannot be templated.
After this change, content can remain a list of parts. If apply_chat_template() is unavailable or throws, the ChatML fallback below will stringify that list and silently drop image placeholders, so the request becomes malformed instead of being rejected.
💡 Suggested guard
@@
if tokenizer is not None and hasattr(tokenizer, "apply_chat_template"):
try:
extra = dict(chat_template_kwargs) if chat_template_kwargs else {}
return tokenizer.apply_chat_template(
msg_dicts,
tokenize=False,
add_generation_prompt=True,
**extra,
)
except Exception as e:
logger.warning("apply_chat_template failed, using fallback: %s", e)
+ has_multimodal_content = any(
+ isinstance(m["content"], list)
+ and any(p.get("type") != "text" for p in m["content"])
+ for m in msg_dicts
+ )
+ if has_multimodal_content:
+ raise ValueError(
+ "Multimodal chat content requires apply_chat_template() support"
+ ) from e
# Fallback: ChatML format (Qwen-style)
parts: List[str] = []
for m in msg_dicts:
- parts.append(f"<|im_start|>{m['role']}\n{m['content']}<|im_end|>")
+ content = m["content"]
+ if isinstance(content, list):
+ content = "\n".join(
+ p["text"] for p in content if p.get("type") == "text"
+ )
+ parts.append(f"<|im_start|>{m['role']}\n{content}<|im_end|>")As per coding guidelines, "Ensure functions that can fail return appropriate error codes or raise exceptions."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pymllm/server/launch.py` around lines 473 - 487, The code builds multimodal
content lists in the messages loop (mm_parts -> content) but may leave content
as a list which later gets stringified by the ChatML fallback and silently drops
images; update the path around tokenizer.apply_chat_template and the ChatML
fallback to fail fast: after constructing mm_parts (and before falling back to
ChatML) check that tokenizer.apply_chat_template exists and can accept list-form
content, attempt to call it in a try/except, and if it is missing or raises,
immediately raise an explicit exception (e.g., ValueError/RuntimeError)
indicating structured chat content could not be templated for the request
instead of allowing silent stringification; reference the variables/functions
messages, mm_parts/content, tokenizer.apply_chat_template, and the ChatML
fallback so the change is made where list content is prepared and before any
stringification occurs.
docs: add validated Jetson Orin environment to README
Summary
pymllmgptq_marlin_repackPython JIT wrapper inmllm-kernelpymllmREADME documents for the validated Jetson Orin workflowWhy
Qwen3-VL-2B-Instruct-AWQ-4bitusescompressed-tensors, butpymllm + mllm-kernelpreviouslydid not provide a complete formal Marlin-backed path for loading and serving this model.
This PR adds the missing kernel wrapper, quantization runtime integration, config parsing fix, and
usage documentation needed for the validated Jetson Orin workflow.
Key Changes
compressed-tensorsquantization method inpymllmCompressedTensorsConfig,CompressedTensorsLinearMethod, and the Marlin-backedweight scheme
gptq_marlin_repackto the Python JIT wrapper layer inmllm-kernelconfig.json -> quantization_configpymllm/README.mdandpymllm/README-ZH.mdfor the current Jetson Orin usage flowTesting