Add compressed-tensors Marlin support for Qwen3-VL AWQ in pymllm by jialilve · Pull Request #663 · UbiquitousLearning/mllm

jialilve · 2026-04-05T17:04:28Z

Summary

add compressed-tensors Marlin support for Qwen3-VL AWQ in pymllm
add the missing gptq_marlin_repack Python JIT wrapper in mllm-kernel
add kernel and runtime tests for the new quantized path
add bilingual pymllm README documents for the validated Jetson Orin workflow

Why

Qwen3-VL-2B-Instruct-AWQ-4bit uses compressed-tensors, but pymllm + mllm-kernel previously
did not provide a complete formal Marlin-backed path for loading and serving this model.

This PR adds the missing kernel wrapper, quantization runtime integration, config parsing fix, and
usage documentation needed for the validated Jetson Orin workflow.

Key Changes

register a new compressed-tensors quantization method in pymllm
implement CompressedTensorsConfig, CompressedTensorsLinearMethod, and the Marlin-backed
weight scheme
add gptq_marlin_repack to the Python JIT wrapper layer in mllm-kernel
add tests for Marlin kernel wrappers and compressed-tensors runtime/config behavior
fix quantization config loading from config.json -> quantization_config
add pymllm/README.md and pymllm/README-ZH.md for the current Jetson Orin usage flow

Testing

pytest -q \
  mllm-kernel/tests/test_gptq_marlin.py \
  mllm-kernel/tests/test_gptq_marlin_repack.py \
  pymllm/tests/test_compressed_tensors_config.py \
  pymllm/tests/test_compressed_tensors_runtime.py

Result:

- 27 passed

## Validated Configuration

Validated with:

- Qwen3-VL-2B-Instruct-AWQ-4bit
- compressed-tensors
- safetensors
- float16

## Notes

- the current implementation targets the validated compressed-tensors signature used by the target
  AWQ checkpoint
- the Jetson Orin setup is documented in pymllm/README.md and pymllm/README-ZH.md

<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit

* **New Features**
  * Compressed‑tensors 4‑bit GPTQ support with Marlin kernels and new repack kernel; public Python entrypoint exported
  * Timing metrics (vit_prefill_ms, llm_prefill_ms, llm_decode_ms) propagated and included in API responses

* **Bug Fixes**
  * RMSNorm: added robust PyTorch fallback when FlashInfer fails
  * Quant config loading: now unwraps nested quantization_config from config.json

* **Documentation**
  * Added Jetson Orin–targeted docs (README and README‑ZH)

* **Tests**
  * New CUDA test suites for Marlin GEMM and repack, plus compressed‑tensors runtime/config tests
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

coderabbitai · 2026-04-05T17:04:42Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eed2afff-3148-431d-9c07-8451cf85a522

📥 Commits

Reviewing files that changed from the base of the PR and between 058dee2 and 25cdb3d.

📒 Files selected for processing (2)

pymllm/README-ZH.md
pymllm/README.md

✅ Files skipped from review due to trivial changes (1)

pymllm/README.md

🚧 Files skipped from review as they are similar to previous changes (1)

pymllm/README-ZH.md

📝 Walkthrough

Walkthrough

Adds compressed-tensors GPTQ Marlin quantization (CUDA kernels and Python integration), new gptq_marlin_repack kernel and tests, timing instrumentation propagated through inference pipeline, multimodal/tokenizer/server adjustments, CMake CPM vendoring change, RMSNorm PyTorch fallback, and Jetson-oriented documentation.

Changes

Cohort / File(s)	Summary
CMake Bootstrap `mllm-kernel/cmake/CPM.cmake`	Prefer vendored CPM.cmake relative to script; fallback to download-only if absent; removed copy/rename fallback; fail with FATAL_ERROR if `CPMAddPackage` not available.
Kernel Namespace & Includes `mllm-kernel/include/mllm_kernel/scalar_type.hpp`, `mllm-kernel/mllm_kernel/cuda/csrc/gemm/marlin/marlin.cuh`	Moved `host` into `mllm_kernel::host` with alias; updated marlin header to include `scalar_type.hpp` and use `host` symbols.
CUDA JIT: Marlin GEMM & Repack `mllm-kernel/mllm_kernel/cuda/jit/__init__.py`, `.../gptq_marlin.py`, `.../gptq_marlin_repack.py`	Exported new `gptq_marlin_repack`; changed kernel wrapper argument handling in `gptq_marlin.py`; added new `gptq_marlin_repack` wrapper with permutation validation, kernel compilation/caching, and public API.
Compressed-Tensors Quantization `pymllm/quantization/methods/compressed_tensors.py`, `pymllm/quantization/methods/__init__.py`	Added `CompressedTensorsConfig` and `CompressedTensorsLinearMethod` (registered `"compressed-tensors"`); implements weight packing/repacking, Marlin workspace/index handling, scale permutation, and runtime GEMM invocation; re-exported symbols in package init.
Tests — Kernel & Quantization `mllm-kernel/tests/test_gptq_marlin.py`, `mllm-kernel/tests/test_gptq_marlin_repack.py`, `pymllm/tests/test_compressed_tensors_config.py`, `pymllm/tests/test_compressed_tensors_runtime.py`	Added CUDA-only and CPU tests covering gptq_marlin correctness, repack shape/dtype/perm validation, config parsing, runtime integration, and control-flow assertions (device/dtype/perm error cases).
Performance Instrumentation `pymllm/models/qwen3_vl.py`, `pymllm/executor/model_runner.py`, `pymllm/orchestrator/model_runner_process.py`, `pymllm/orchestrator/scheduler_process.py`, `pymllm/orchestrator/detokenizer_process.py`	Introduced `_cuda_timed_run`, capture/store vit/llm prefill/decode ms on ForwardBatch, propagate timing fields through scheduler → detokenizer → responses; added guards when caches lack page_size.
Multimodal & Tokenization `pymllm/orchestrator/tokenizer_process.py`, `pymllm/server/launch.py`	Preserve structured multimodal content for tokenizer (image/text typed entries) and include timing metadata in non-streaming API responses.
RMSNorm Fallback `pymllm/layers/rms_norm.py`	Added `_torch_rmsnorm` pure-PyTorch fallback and try/except usage around FlashInfer paths to ensure safe fallback on errors.
Docs & Misc `pymllm/README.md`, `pymllm/README-ZH.md`	Added Jetson Orin–focused documentation (environment, install, server commands, examples) in English and Chinese.
Small API/Helpers `pymllm/quantization/methods/__init__.py`, `pymllm/executor/model_runner.py`, `mllm-kernel/include/.../scalar_type.hpp`	Re-exports added for compressed-tensors config/method; `ModelRunner._load_quant_config_dict` unwraps `quantization_config` from config.json; namespace aliasing added for backward compatibility.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

feat(docs): update README files with latest news and model integratio… #653: Overlapping README updates for Jetson/Jetson-Orin documentation adjustments.

Suggested reviewers

chenghuaWang
oreomaker
liang1232018

Poem

🐰
I hop through kernels, bits in tow,
Repacking weights, I watch them flow,
Timings tick from vision to text,
Marlin dances, packed and hexed,
Jetson hums — hooray, let's go!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 22.08% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically identifies the main change: adding compressed-tensors Marlin support for a specific model (Qwen3-VL AWQ) to the pymllm package.
Description check	✅ Passed	The PR description is comprehensive and well-structured with clear sections covering Summary, Why, Key Changes, Testing, and Validated Configuration. It directly addresses the template's requirement for clarity and completeness.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🧹 Nitpick comments (2)

pymllm/executor/model_runner.py (1)
490-495: Make config.json a true fallback.

Handling it inside the main candidate loop makes precedence depend on unique ordering. Since every HF checkpoint has a config.json, encountering it early makes the rest of the probe unreachable and the more specific quantization metadata can never win.
♻️ Suggested simplification
         for fname in unique:
+            if fname == "config.json":
+                continue
             fpath = model_path / fname
             if fpath.exists():
                 with open(fpath) as fp:
                     cfg = json.load(fp)
-                # config.json stores model metadata at the top level and
-                # nests quantization details under quantization_config.
-                if fname == "config.json" and "quantization_config" in cfg:
-                    return cfg["quantization_config"]
                 return cfg
Keep the existing explicit config.json -> quantization_config block below as the only fallback path.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/executor/model_runner.py` around lines 490 - 495, The config.json
handling inside the main candidate loop (the block that checks fname ==
"config.json" and returns cfg["quantization_config"]) makes config.json a
premature match and prevents more specific probes from being considered; remove
that in-loop special-case and instead handle "config.json" ->
"quantization_config" only as the final fallback path after the candidate/probe
loop completes (keep the existing fallback block below). Update references to
cfg/fname accordingly so the loop always continues for other candidates and only
uses config.json's quantization_config when no other probe matched.
pymllm/quantization/methods/compressed_tensors.py (1)

89-97: Use buffers for Marlin runtime tensors instead of mixing plain attributes with Parameters.

Line 230 stores workspace as a plain tensor attribute, which won't be moved by module.to()/cuda(). Lines 231-233 create empty Parameters for runtime metadata, causing them to appear in parameters()/state_dict() despite not being checkpoint-backed weights. All four tensors should be registered as non-persistent buffers, which ensures they move with the module while remaining excluded from the parameter list and checkpoint.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@mllm-kernel/mllm_kernel/cuda/jit/gptq_marlin_repack.py`:
- Around line 56-74: The function gptq_marlin_repack currently computes output
shape using floor division and can silently truncate/leave uninitialized data;
before allocating out or calling kernel, validate inputs: ensure num_bits
divides 32 evenly (so pack_factor = 32 // num_bits is integer and >0), ensure
size_k is a multiple of tile_size (16), ensure (size_n * tile_size) is divisible
by pack_factor (so out shape is exact), ensure b_q_weight.shape matches (size_k,
size_n) and its dtype/device are used for out, and ensure perm (if provided) has
length size_k; raise a clear ValueError on any violation and only then call
_normalize_perm, _make_gptq_marlin_repack_kernel(), allocate out and invoke
kernel.

In `@pymllm/layers/rms_norm.py`:
- Around line 52-54: The except branch in the RMSNorm fallback currently
computes x = x + residual but returns the old residual, causing downstream
blocks to receive stale residual state; update the except block in
pymllm/layers/rms_norm.py so that after computing x = x + residual you set a
new_residual (or reuse x) and return (_torch_rmsnorm(new_residual, self.weight,
self.eps), new_residual) instead of returning the old residual so subsequent
calls (e.g., in pymllm/models/qwen3_5.py flows) get the updated accumulated
residual.
- Around line 18-21: The code currently casts x_norm back to x.dtype before
multiplying by weight, which loses precision and can produce wrong dtype when
weight.dtype != x.dtype; change the order so you compute x_fp32, var, and x_norm
in FP32, cast weight to float32 (weight_fp32 = weight.to(torch.float32) or
weight.float()), multiply x_norm * weight_fp32 while still in FP32, then cast
the final product once to x.dtype before returning; locate symbols x_fp32, var,
x_norm, weight and eps in the RMSNorm implementation and apply this single-cast
fix consistent with rms_norm_gated.py.

In `@pymllm/models/qwen3_vl.py`:
- Around line 980-994: The helper _cuda_timed_run currently calls
torch.cuda.synchronize() on every invocation which forces a GPU sync in hot
paths (vision prefill and LLM forward); change it so synchronization is opt-in
or deferred: add a parameter like sync=False to _cuda_timed_run (or
alternatively return the start/end torch.cuda.Event objects and the raw output)
and only call torch.cuda.synchronize() and compute elapsed time when sync=True
(or when the caller performs synchronization at an existing barrier), ensuring
the default path avoids unconditional torch.cuda.synchronize() and preserves
existing return shapes by documenting the new return form if you choose the
event-return approach.

In `@pymllm/orchestrator/scheduler_process.py`:
- Around line 792-793: The code currently overwrites req.llm_decode_ms with
out["llm_decode_ms"] inside the decode-step loop; change this to accumulate
per-step timings by adding the step value to the existing request-level value
(e.g., req.llm_decode_ms = (req.llm_decode_ms or 0) + out["llm_decode_ms"]) so
each decode batch increments req.llm_decode_ms instead of replacing it; update
the setter site where "llm_decode_ms" is read from out (the block referencing
req.llm_decode_ms and out) to handle missing/None initial values safely.

In `@pymllm/orchestrator/tokenizer_process.py`:
- Around line 374-387: The multimodal branch replaces text tokenizer output with
processor IDs (mm_inputs -> image_inputs -> proc_input_ids) but skips the
earlier context-length guard; validate proc_input_ids against
self._context_length and either truncate or raise consistently with the text
path before assigning input_ids. Locate the block handling
mm_inputs/image_inputs/proc_input_ids in tokenizer_process.py and after
converting proc_input_ids to a Python list apply the same length-check logic
used earlier (use self._context_length to cap the list or raise a clear
exception), ensuring behavior matches the existing text-tokenizer
truncation/rejection policy.

In `@pymllm/quantization/methods/compressed_tensors.py`:
- Around line 61-70: Change verify_marlin_supported to accept a device argument
and use that device when checking CUDA capability (instead of calling
torch.cuda.get_device_capability() with no device), and update any callers
accordingly; specifically, update verify_marlin_supported(group_size: int,
device: torch.device) to first check MARLIN_SUPPORTED_GROUP_SIZES then if
device.type == "cuda" call torch.cuda.get_device_capability(device) and enforce
SM80+, and add an extra call to the new signature at the start of
process_weights_after_loading() using layer.weight_packed.device so validation
runs for the runtime device.

In `@pymllm/README-ZH.md`:
- Around line 5-16: The "## 环境配置 ToDo" section lacks validated Jetson
environment details; replace the checklist with a concrete validated matrix or a
link to a follow-up issue and populate the table with exact JetPack/L4T,
CUDA/cuDNN/TensorRT, Python/pip/venv or conda,
PyTorch/torchvision/transformers/safetensors, flashinfer version/installation,
apt dependencies, memory/GPU suggestions and required ENV vars; specifically
update the "## 环境配置 ToDo" header and the checklist items (e.g., JetPack / L4T,
CUDA / cuDNN / TensorRT, Python / pip / venv 或 conda, PyTorch / torchvision /
transformers / safetensors, flashinfer 版本与安装方式) with validated values or convert
them into a compact validated-version table and add a note linking to a
follow-up issue if full validation will occur later.

In `@pymllm/server/launch.py`:
- Around line 473-487: The code builds multimodal content lists in the messages
loop (mm_parts -> content) but may leave content as a list which later gets
stringified by the ChatML fallback and silently drops images; update the path
around tokenizer.apply_chat_template and the ChatML fallback to fail fast: after
constructing mm_parts (and before falling back to ChatML) check that
tokenizer.apply_chat_template exists and can accept list-form content, attempt
to call it in a try/except, and if it is missing or raises, immediately raise an
explicit exception (e.g., ValueError/RuntimeError) indicating structured chat
content could not be templated for the request instead of allowing silent
stringification; reference the variables/functions messages, mm_parts/content,
tokenizer.apply_chat_template, and the ChatML fallback so the change is made
where list content is prepared and before any stringification occurs.

---

Nitpick comments:
In `@pymllm/executor/model_runner.py`:
- Around line 490-495: The config.json handling inside the main candidate loop
(the block that checks fname == "config.json" and returns
cfg["quantization_config"]) makes config.json a premature match and prevents
more specific probes from being considered; remove that in-loop special-case and
instead handle "config.json" -> "quantization_config" only as the final fallback
path after the candidate/probe loop completes (keep the existing fallback block
below). Update references to cfg/fname accordingly so the loop always continues
for other candidates and only uses config.json's quantization_config when no
other probe matched.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5cd8b503-101e-4acb-822f-2d4c14c9ed71

📥 Commits

Reviewing files that changed from the base of the PR and between 690b859 and 058dee2.

📒 Files selected for processing (22)

mllm-kernel/cmake/CPM.cmake
mllm-kernel/include/mllm_kernel/scalar_type.hpp
mllm-kernel/mllm_kernel/cuda/csrc/gemm/marlin/marlin.cuh
mllm-kernel/mllm_kernel/cuda/jit/__init__.py
mllm-kernel/mllm_kernel/cuda/jit/gptq_marlin.py
mllm-kernel/mllm_kernel/cuda/jit/gptq_marlin_repack.py
mllm-kernel/tests/test_gptq_marlin.py
mllm-kernel/tests/test_gptq_marlin_repack.py
pymllm/README-ZH.md
pymllm/README.md
pymllm/executor/model_runner.py
pymllm/layers/rms_norm.py
pymllm/models/qwen3_vl.py
pymllm/orchestrator/detokenizer_process.py
pymllm/orchestrator/model_runner_process.py
pymllm/orchestrator/scheduler_process.py
pymllm/orchestrator/tokenizer_process.py
pymllm/quantization/methods/__init__.py
pymllm/quantization/methods/compressed_tensors.py
pymllm/server/launch.py
pymllm/tests/test_compressed_tensors_config.py
pymllm/tests/test_compressed_tensors_runtime.py

coderabbitai · 2026-04-05T17:17:46Z

mllm-kernel/mllm_kernel/cuda/jit/gptq_marlin_repack.py

+def gptq_marlin_repack(
+    b_q_weight: torch.Tensor,
+    perm: Optional[torch.Tensor],
+    size_k: int,
+    size_n: int,
+    num_bits: int,
+) -> torch.Tensor:
+    """Repack GPTQ/Compressed-Tensors weights into Marlin layout."""
+
+    pack_factor = 32 // num_bits
+    tile_size = 16
+    out = torch.empty(
+        (size_k // tile_size, size_n * tile_size // pack_factor),
+        dtype=b_q_weight.dtype,
+        device=b_q_weight.device,
+    )
+    kernel = _make_gptq_marlin_repack_kernel()
+    perm_t = _normalize_perm(perm, size_k, b_q_weight.device)
+    kernel(b_q_weight, perm_t, out, size_k, size_n, num_bits)


⚠️ Potential issue | 🟠 Major

Reject unsupported size_k/size_n/num_bits inputs before allocating out.

Line 68 derives the output shape with floor division, so unsupported inputs are silently truncated instead of failing fast: a size_k tail is dropped, a non-64-multiple size_n can leave part of out uninitialized, and a non-divisor num_bits produces the wrong packing factor. Since this wrapper is now public, please validate num_bits, alignment, and b_q_weight.shape/dtype/device before the allocation and kernel launch.

Suggested guardrail

def gptq_marlin_repack( b_q_weight: torch.Tensor, perm: Optional[torch.Tensor], size_k: int, size_n: int, num_bits: int, ) -> torch.Tensor: """Repack GPTQ/Compressed-Tensors weights into Marlin layout.""" + if b_q_weight.dtype != torch.int32: + raise ValueError("b_q_weight must be int32") + if b_q_weight.device.type != "cuda": + raise ValueError("b_q_weight must live on CUDA") + if num_bits <= 0 or 32 % num_bits != 0: + raise ValueError("num_bits must be a positive divisor of 32") + if size_k % 16 != 0: + raise ValueError("size_k must be divisible by 16") + if size_n % 64 != 0: + raise ValueError("size_n must be divisible by 64") + pack_factor = 32 // num_bits + expected_shape = (size_k // pack_factor, size_n) + if tuple(b_q_weight.shape) != expected_shape: + raise ValueError(f"b_q_weight must have shape {expected_shape}") + tile_size = 16 out = torch.empty( (size_k // tile_size, size_n * tile_size // pack_factor), dtype=b_q_weight.dtype, device=b_q_weight.device,

As per coding guidelines, "Validate inputs for public APIs and critical internal functions."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@mllm-kernel/mllm_kernel/cuda/jit/gptq_marlin_repack.py` around lines 56 - 74, The function gptq_marlin_repack currently computes output shape using floor division and can silently truncate/leave uninitialized data; before allocating out or calling kernel, validate inputs: ensure num_bits divides 32 evenly (so pack_factor = 32 // num_bits is integer and >0), ensure size_k is a multiple of tile_size (16), ensure (size_n * tile_size) is divisible by pack_factor (so out shape is exact), ensure b_q_weight.shape matches (size_k, size_n) and its dtype/device are used for out, and ensure perm (if provided) has length size_k; raise a clear ValueError on any violation and only then call _normalize_perm, _make_gptq_marlin_repack_kernel(), allocate out and invoke kernel.

coderabbitai · 2026-04-05T17:17:46Z

pymllm/layers/rms_norm.py

+    x_fp32 = x.float()
+    var = x_fp32.pow(2).mean(dim=-1, keepdim=True)
+    x_norm = x_fp32 * torch.rsqrt(var + eps)
+    return x_norm.to(dtype=x.dtype) * weight


⚠️ Potential issue | 🟠 Major

Apply the scale in FP32 before the final cast.

This helper downcasts x_norm before multiplying by weight, so the fallback loses precision and can even return the wrong dtype when weight.dtype != x.dtype. pymllm/layers/rms_norm_gated.py:40-61 already keeps both operands in FP32 and casts once at the end.

Suggested fix

def _torch_rmsnorm( x: torch.Tensor, weight: torch.Tensor, eps: float, ) -> torch.Tensor: x_fp32 = x.float() + weight_fp32 = weight.float() var = x_fp32.pow(2).mean(dim=-1, keepdim=True) - x_norm = x_fp32 * torch.rsqrt(var + eps) - return x_norm.to(dtype=x.dtype) * weight + out = x_fp32 * torch.rsqrt(var + eps) * weight_fp32 + return out.to(dtype=x.dtype)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pymllm/layers/rms_norm.py` around lines 18 - 21, The code currently casts x_norm back to x.dtype before multiplying by weight, which loses precision and can produce wrong dtype when weight.dtype != x.dtype; change the order so you compute x_fp32, var, and x_norm in FP32, cast weight to float32 (weight_fp32 = weight.to(torch.float32) or weight.float()), multiply x_norm * weight_fp32 while still in FP32, then cast the final product once to x.dtype before returning; locate symbols x_fp32, var, x_norm, weight and eps in the RMSNorm implementation and apply this single-cast fix consistent with rms_norm_gated.py.

coderabbitai · 2026-04-05T17:17:46Z

pymllm/layers/rms_norm.py

+            except Exception:
+                x = x + residual
+                return _torch_rmsnorm(x, self.weight, self.eps), residual


⚠️ Potential issue | 🔴 Critical

Return the accumulated residual from the fallback path.

This branch computes x + residual but returns the old residual. pymllm/models/qwen3_5.py:206-211 threads that second tuple element straight into the next norm call, so taking this fallback leaves all subsequent blocks operating on stale residual state.

Suggested fix

except Exception: - x = x + residual - return _torch_rmsnorm(x, self.weight, self.eps), residual + accumulated = x + residual + return _torch_rmsnorm(accumulated, self.weight, self.eps), accumulated

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except Exception:

x = x + residual

return _torch_rmsnorm(x, self.weight, self.eps), residual

except Exception:

accumulated = x + residual

return _torch_rmsnorm(accumulated, self.weight, self.eps), accumulated

🧰 Tools

🪛 Ruff (0.15.9)

[warning] 52-52: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pymllm/layers/rms_norm.py` around lines 52 - 54, The except branch in the RMSNorm fallback currently computes x = x + residual but returns the old residual, causing downstream blocks to receive stale residual state; update the except block in pymllm/layers/rms_norm.py so that after computing x = x + residual you set a new_residual (or reuse x) and return (_torch_rmsnorm(new_residual, self.weight, self.eps), new_residual) instead of returning the old residual so subsequent calls (e.g., in pymllm/models/qwen3_5.py flows) get the updated accumulated residual.

coderabbitai · 2026-04-05T17:17:46Z

pymllm/models/qwen3_vl.py

+def _cuda_timed_run(fn):
+    if torch.cuda.is_available():
+        start = torch.cuda.Event(enable_timing=True)
+        end = torch.cuda.Event(enable_timing=True)
+        start.record()
+        out = fn()
+        end.record()
+        torch.cuda.synchronize()
+        return out, float(start.elapsed_time(end))
+    else:
+        import time
+        t0 = time.perf_counter()
+        out = fn()
+        t1 = time.perf_counter()
+        return out, float((t1 - t0) * 1000.0)


⚠️ Potential issue | 🟠 Major

Don't synchronize the device on every timed forward.

torch.cuda.synchronize() here forces a full GPU sync in the model hot path, and this helper is used for both the vision prefill and LLM passes. That will reduce throughput and inflate latency; please make this opt-in, or at least defer elapsed-time collection to an existing synchronization point instead of doing it unconditionally on every batch.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pymllm/models/qwen3_vl.py` around lines 980 - 994, The helper _cuda_timed_run currently calls torch.cuda.synchronize() on every invocation which forces a GPU sync in hot paths (vision prefill and LLM forward); change it so synchronization is opt-in or deferred: add a parameter like sync=False to _cuda_timed_run (or alternatively return the start/end torch.cuda.Event objects and the raw output) and only call torch.cuda.synchronize() and compute elapsed time when sync=True (or when the caller performs synchronization at an existing barrier), ensuring the default path avoids unconditional torch.cuda.synchronize() and preserves existing return shapes by documenting the new return form if you choose the event-return approach.

coderabbitai · 2026-04-05T17:17:46Z

pymllm/orchestrator/scheduler_process.py

+            if "llm_decode_ms" in out:
+                req.llm_decode_ms = out["llm_decode_ms"]


⚠️ Potential issue | 🟡 Minor

Accumulate llm_decode_ms instead of replacing it.

This setter runs once per decode batch, so multi-token requests end up returning only the last step's decode latency in the final timing payload.

💡 If this field is meant to be request-level timing

- if "llm_decode_ms" in out: - req.llm_decode_ms = out["llm_decode_ms"] + if "llm_decode_ms" in out: + req.llm_decode_ms = (req.llm_decode_ms or 0.0) + float( + out["llm_decode_ms"] + )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/scheduler_process.py` around lines 792 - 793, The code currently overwrites req.llm_decode_ms with out["llm_decode_ms"] inside the decode-step loop; change this to accumulate per-step timings by adding the step value to the existing request-level value (e.g., req.llm_decode_ms = (req.llm_decode_ms or 0) + out["llm_decode_ms"]) so each decode batch increments req.llm_decode_ms instead of replacing it; update the setter site where "llm_decode_ms" is read from out (the block referencing req.llm_decode_ms and out) to handle missing/None initial values safely.

coderabbitai · 2026-04-05T17:17:46Z

pymllm/orchestrator/tokenizer_process.py

+        # If AutoProcessor produced multimodal input_ids, they must override
+        # the plain tokenizer result. Otherwise the prompt contains only a
+        # single image placeholder token and won't match the visual features.
+        if mm_inputs is not None:
+            image_inputs = mm_inputs.get("image_inputs")
+            if image_inputs is not None and "input_ids" in image_inputs:
+                proc_input_ids = image_inputs["input_ids"]
+                if hasattr(proc_input_ids, "ndim") and proc_input_ids.ndim > 1:
+                    proc_input_ids = proc_input_ids[0]
+                if hasattr(proc_input_ids, "tolist"):
+                    input_ids = proc_input_ids.tolist()
+                else:
+                    input_ids = list(proc_input_ids)
+


⚠️ Potential issue | 🟠 Major

Keep the context-length guard on the multimodal path.

Once these processor IDs replace the text-tokenizer output, multimodal requests no longer inherit the truncation/length check applied earlier. A long image prompt can now exceed self._context_length and fail later in the runner instead of being capped or rejected here. At minimum, validate the replacement IDs before packing the request.

🧭 Minimal safeguard

if hasattr(proc_input_ids, "tolist"): input_ids = proc_input_ids.tolist() else: input_ids = list(proc_input_ids) + if ( + self._context_length is not None + and len(input_ids) > self._context_length + ): + raise ValueError( + "Multimodal prompt exceeds context_length after processor tokenization" + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# If AutoProcessor produced multimodal input_ids, they must override

# the plain tokenizer result. Otherwise the prompt contains only a

# single image placeholder token and won't match the visual features.

if mm_inputs is not None:

image_inputs = mm_inputs.get("image_inputs")

if image_inputs is not None and "input_ids" in image_inputs:

proc_input_ids = image_inputs["input_ids"]

if hasattr(proc_input_ids, "ndim") and proc_input_ids.ndim > 1:

proc_input_ids = proc_input_ids[0]

if hasattr(proc_input_ids, "tolist"):

input_ids = proc_input_ids.tolist()

else:

input_ids = list(proc_input_ids)

# If AutoProcessor produced multimodal input_ids, they must override

# the plain tokenizer result. Otherwise the prompt contains only a

# single image placeholder token and won't match the visual features.

if mm_inputs is not None:

image_inputs = mm_inputs.get("image_inputs")

if image_inputs is not None and "input_ids" in image_inputs:

proc_input_ids = image_inputs["input_ids"]

if hasattr(proc_input_ids, "ndim") and proc_input_ids.ndim > 1:

proc_input_ids = proc_input_ids[0]

if hasattr(proc_input_ids, "tolist"):

input_ids = proc_input_ids.tolist()

else:

input_ids = list(proc_input_ids)

if (

self._context_length is not None

and len(input_ids) > self._context_length

):

raise ValueError(

"Multimodal prompt exceeds context_length after processor tokenization"

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pymllm/orchestrator/tokenizer_process.py` around lines 374 - 387, The multimodal branch replaces text tokenizer output with processor IDs (mm_inputs -> image_inputs -> proc_input_ids) but skips the earlier context-length guard; validate proc_input_ids against self._context_length and either truncate or raise consistently with the text path before assigning input_ids. Locate the block handling mm_inputs/image_inputs/proc_input_ids in tokenizer_process.py and after converting proc_input_ids to a Python list apply the same length-check logic used earlier (use self._context_length to cap the list or raise a clear exception), ensuring behavior matches the existing text-tokenizer truncation/rejection policy.

coderabbitai · 2026-04-05T17:17:46Z

pymllm/quantization/methods/compressed_tensors.py

+def verify_marlin_supported(group_size: int) -> None:
+    if group_size not in MARLIN_SUPPORTED_GROUP_SIZES:
+        raise ValueError(
+            f"Unsupported compressed-tensors group_size: {group_size}"
+        )
+    if not torch.cuda.is_available():
+        return
+    major, minor = torch.cuda.get_device_capability()
+    if major * 10 + minor < 80:
+        raise ValueError("compressed-tensors Marlin requires SM80+")


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In PyTorch, if torch.cuda.get_device_capability() is called without a device argument, which CUDA device does it inspect?

💡 Result:

It inspects the current CUDA device, as returned by torch.cuda.current_device.

Citations:

1: https://pytorch.org/docs/2.5/_modules/torch/cuda.html

2: https://github.com/pytorch/pytorch/blob/main/torch/cuda/_utils.py

3: https://docs.pytorch.org/docs/1.10/_modules/torch/cuda.html

🏁 Script executed:

# First, locate the file and check its structure find . -path "*pymllm/quantization/methods/compressed_tensors.py" -type f

Repository: UbiquitousLearning/mllm

Length of output: 118

🏁 Script executed:

# Read the relevant sections of the file if [ -f "pymllm/quantization/methods/compressed_tensors.py" ]; then echo "=== Lines 61-75 (verify_marlin_supported) ===" sed -n '61,75p' pymllm/quantization/methods/compressed_tensors.py echo "" echo "=== Lines 85-105 (process_weights_after_loading area) ===" sed -n '85,105p' pymllm/quantization/methods/compressed_tensors.py echo "" echo "=== Lines 215-235 (second mentioned range) ===" sed -n '215,235p' pymllm/quantization/methods/compressed_tensors.py fi

Repository: UbiquitousLearning/mllm

Length of output: 2264

🏁 Script executed:

# Check the total file length to understand context wc -l pymllm/quantization/methods/compressed_tensors.py

Repository: UbiquitousLearning/mllm

Length of output: 120

Re-validate Marlin device compatibility in the actual runtime device context.

The verify_marlin_supported() function calls torch.cuda.get_device_capability() without a device argument, which inspects only the current CUDA device, not the device where the layer actually resides. The process_weights_after_loading() method extracts the correct device from layer.weight_packed.device but never re-validates Marlin support for that device before executing CUDA operations. In workflows where initialization occurs on a different device than the runtime device (e.g., CPU initialization before GPU placement, or multi-GPU scenarios), this allows invalid configurations to pass validation and fail later at runtime.

Modify verify_marlin_supported() to accept and use a device parameter, and call it again at the start of process_weights_after_loading() with the actual layer.weight_packed.device.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pymllm/quantization/methods/compressed_tensors.py` around lines 61 - 70, Change verify_marlin_supported to accept a device argument and use that device when checking CUDA capability (instead of calling torch.cuda.get_device_capability() with no device), and update any callers accordingly; specifically, update verify_marlin_supported(group_size: int, device: torch.device) to first check MARLIN_SUPPORTED_GROUP_SIZES then if device.type == "cuda" call torch.cuda.get_device_capability(device) and enforce SM80+, and add an extra call to the new signature at the start of process_weights_after_loading() using layer.weight_packed.device so validation runs for the runtime device.

pymllm/README-ZH.md

coderabbitai · 2026-04-05T17:17:46Z

pymllm/server/launch.py

+    # Preserve multimodal message structure for tokenizer.apply_chat_template.
    msg_dicts: List[Dict[str, Any]] = []
    for msg in messages:
        content = msg.content
        if isinstance(content, list):
-            # Multimodal: extract only text parts for the prompt string.
-            text_parts = [p.text for p in content if p.type == "text" and p.text]
-            content = "\n".join(text_parts) if text_parts else ""
+            mm_parts: List[Dict[str, Any]] = []
+            for part in content:
+                if part.type == "text" and part.text is not None:
+                    mm_parts.append({"type": "text", "text": part.text})
+                elif part.type == "image_url" and part.image_url is not None:
+                    # Keep image content so chat template can emit vision tokens.
+                    mm_parts.append(
+                        {"type": "image", "image": part.image_url.url}
+                    )
+            content = mm_parts


⚠️ Potential issue | 🟠 Major

Fail fast when structured chat content cannot be templated.

After this change, content can remain a list of parts. If apply_chat_template() is unavailable or throws, the ChatML fallback below will stringify that list and silently drop image placeholders, so the request becomes malformed instead of being rejected.

💡 Suggested guard

@@ if tokenizer is not None and hasattr(tokenizer, "apply_chat_template"): try: extra = dict(chat_template_kwargs) if chat_template_kwargs else {} return tokenizer.apply_chat_template( msg_dicts, tokenize=False, add_generation_prompt=True, **extra, ) except Exception as e: logger.warning("apply_chat_template failed, using fallback: %s", e) + has_multimodal_content = any( + isinstance(m["content"], list) + and any(p.get("type") != "text" for p in m["content"]) + for m in msg_dicts + ) + if has_multimodal_content: + raise ValueError( + "Multimodal chat content requires apply_chat_template() support" + ) from e # Fallback: ChatML format (Qwen-style) parts: List[str] = [] for m in msg_dicts: - parts.append(f"<|im_start|>{m['role']}\n{m['content']}<|im_end|>") + content = m["content"] + if isinstance(content, list): + content = "\n".join( + p["text"] for p in content if p.get("type") == "text" + ) + parts.append(f"<|im_start|>{m['role']}\n{content}<|im_end|>")

As per coding guidelines, "Ensure functions that can fail return appropriate error codes or raise exceptions."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@pymllm/server/launch.py` around lines 473 - 487, The code builds multimodal content lists in the messages loop (mm_parts -> content) but may leave content as a list which later gets stringified by the ChatML fallback and silently drops images; update the path around tokenizer.apply_chat_template and the ChatML fallback to fail fast: after constructing mm_parts (and before falling back to ChatML) check that tokenizer.apply_chat_template exists and can accept list-form content, attempt to call it in a try/except, and if it is missing or raises, immediately raise an explicit exception (e.g., ValueError/RuntimeError) indicating structured chat content could not be templated for the request instead of allowing silent stringification; reference the variables/functions messages, mm_parts/content, tokenizer.apply_chat_template, and the ChatML fallback so the change is made where list content is prepared and before any stringification occurs.

docs: add validated Jetson Orin environment to README

nuozhihan and others added 4 commits March 31, 2026 16:23

jetson: local qwen3-vl service changes

56443e9

mllm-kernel: prefer vendored CPM for editable builds

12a5fde

Add compressed-tensors Marlin support for Qwen3-VL AWQ

2488413

Add bilingual pymllm README for Jetson Orin usage

058dee2

coderabbitai bot reviewed Apr 5, 2026

View reviewed changes

nuozhihan and others added 2 commits April 6, 2026 22:46

docs: add validated Jetson Orin environment to README

35e0c85

Merge pull request #2 from nuozhihan/docs-readme-env

25cdb3d

docs: add validated Jetson Orin environment to README

		if "llm_decode_ms" in out:
		req.llm_decode_ms = out["llm_decode_ms"]

Conversation

jialilve commented Apr 5, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Key Changes

Testing

Uh oh!

coderabbitai bot commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jialilve commented Apr 5, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 5, 2026 •

edited

Loading