Skip to content

Add compressed-tensors Marlin support for Qwen3-VL AWQ in pymllm#663

Open
jialilve wants to merge 6 commits intoUbiquitousLearning:mainfrom
jialilve:jetson-awq-qwen3vl
Open

Add compressed-tensors Marlin support for Qwen3-VL AWQ in pymllm#663
jialilve wants to merge 6 commits intoUbiquitousLearning:mainfrom
jialilve:jetson-awq-qwen3vl

Conversation

@jialilve
Copy link
Copy Markdown
Contributor

@jialilve jialilve commented Apr 5, 2026

Summary

  • add compressed-tensors Marlin support for Qwen3-VL AWQ in pymllm
  • add the missing gptq_marlin_repack Python JIT wrapper in mllm-kernel
  • add kernel and runtime tests for the new quantized path
  • add bilingual pymllm README documents for the validated Jetson Orin workflow

Why

Qwen3-VL-2B-Instruct-AWQ-4bit uses compressed-tensors, but pymllm + mllm-kernel previously
did not provide a complete formal Marlin-backed path for loading and serving this model.

This PR adds the missing kernel wrapper, quantization runtime integration, config parsing fix, and
usage documentation needed for the validated Jetson Orin workflow.

Key Changes

  • register a new compressed-tensors quantization method in pymllm
  • implement CompressedTensorsConfig, CompressedTensorsLinearMethod, and the Marlin-backed
    weight scheme
  • add gptq_marlin_repack to the Python JIT wrapper layer in mllm-kernel
  • add tests for Marlin kernel wrappers and compressed-tensors runtime/config behavior
  • fix quantization config loading from config.json -> quantization_config
  • add pymllm/README.md and pymllm/README-ZH.md for the current Jetson Orin usage flow

Testing

pytest -q \
  mllm-kernel/tests/test_gptq_marlin.py \
  mllm-kernel/tests/test_gptq_marlin_repack.py \
  pymllm/tests/test_compressed_tensors_config.py \
  pymllm/tests/test_compressed_tensors_runtime.py

Result:

- 27 passed

## Validated Configuration

Validated with:

- Qwen3-VL-2B-Instruct-AWQ-4bit
- compressed-tensors
- safetensors
- float16

## Notes

- the current implementation targets the validated compressed-tensors signature used by the target
  AWQ checkpoint
- the Jetson Orin setup is documented in pymllm/README.md and pymllm/README-ZH.md

<!-- This is an auto-generated comment: release notes by coderabbit.ai -->
## Summary by CodeRabbit

* **New Features**
  * Compressed‑tensors 4‑bit GPTQ support with Marlin kernels and new repack kernel; public Python entrypoint exported
  * Timing metrics (vit_prefill_ms, llm_prefill_ms, llm_decode_ms) propagated and included in API responses

* **Bug Fixes**
  * RMSNorm: added robust PyTorch fallback when FlashInfer fails
  * Quant config loading: now unwraps nested quantization_config from config.json

* **Documentation**
  * Added Jetson Orin–targeted docs (README and README‑ZH)

* **Tests**
  * New CUDA test suites for Marlin GEMM and repack, plus compressed‑tensors runtime/config tests
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 5, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eed2afff-3148-431d-9c07-8451cf85a522

📥 Commits

Reviewing files that changed from the base of the PR and between 058dee2 and 25cdb3d.

📒 Files selected for processing (2)
  • pymllm/README-ZH.md
  • pymllm/README.md
✅ Files skipped from review due to trivial changes (1)
  • pymllm/README.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • pymllm/README-ZH.md

📝 Walkthrough

Walkthrough

Adds compressed-tensors GPTQ Marlin quantization (CUDA kernels and Python integration), new gptq_marlin_repack kernel and tests, timing instrumentation propagated through inference pipeline, multimodal/tokenizer/server adjustments, CMake CPM vendoring change, RMSNorm PyTorch fallback, and Jetson-oriented documentation.

Changes

Cohort / File(s) Summary
CMake Bootstrap
mllm-kernel/cmake/CPM.cmake
Prefer vendored CPM.cmake relative to script; fallback to download-only if absent; removed copy/rename fallback; fail with FATAL_ERROR if CPMAddPackage not available.
Kernel Namespace & Includes
mllm-kernel/include/mllm_kernel/scalar_type.hpp, mllm-kernel/mllm_kernel/cuda/csrc/gemm/marlin/marlin.cuh
Moved host into mllm_kernel::host with alias; updated marlin header to include scalar_type.hpp and use host symbols.
CUDA JIT: Marlin GEMM & Repack
mllm-kernel/mllm_kernel/cuda/jit/__init__.py, .../gptq_marlin.py, .../gptq_marlin_repack.py
Exported new gptq_marlin_repack; changed kernel wrapper argument handling in gptq_marlin.py; added new gptq_marlin_repack wrapper with permutation validation, kernel compilation/caching, and public API.
Compressed-Tensors Quantization
pymllm/quantization/methods/compressed_tensors.py, pymllm/quantization/methods/__init__.py
Added CompressedTensorsConfig and CompressedTensorsLinearMethod (registered "compressed-tensors"); implements weight packing/repacking, Marlin workspace/index handling, scale permutation, and runtime GEMM invocation; re-exported symbols in package init.
Tests — Kernel & Quantization
mllm-kernel/tests/test_gptq_marlin.py, mllm-kernel/tests/test_gptq_marlin_repack.py, pymllm/tests/test_compressed_tensors_config.py, pymllm/tests/test_compressed_tensors_runtime.py
Added CUDA-only and CPU tests covering gptq_marlin correctness, repack shape/dtype/perm validation, config parsing, runtime integration, and control-flow assertions (device/dtype/perm error cases).
Performance Instrumentation
pymllm/models/qwen3_vl.py, pymllm/executor/model_runner.py, pymllm/orchestrator/model_runner_process.py, pymllm/orchestrator/scheduler_process.py, pymllm/orchestrator/detokenizer_process.py
Introduced _cuda_timed_run, capture/store vit/llm prefill/decode ms on ForwardBatch, propagate timing fields through scheduler → detokenizer → responses; added guards when caches lack page_size.
Multimodal & Tokenization
pymllm/orchestrator/tokenizer_process.py, pymllm/server/launch.py
Preserve structured multimodal content for tokenizer (image/text typed entries) and include timing metadata in non-streaming API responses.
RMSNorm Fallback
pymllm/layers/rms_norm.py
Added _torch_rmsnorm pure-PyTorch fallback and try/except usage around FlashInfer paths to ensure safe fallback on errors.
Docs & Misc
pymllm/README.md, pymllm/README-ZH.md
Added Jetson Orin–focused documentation (environment, install, server commands, examples) in English and Chinese.
Small API/Helpers
pymllm/quantization/methods/__init__.py, pymllm/executor/model_runner.py, mllm-kernel/include/.../scalar_type.hpp
Re-exports added for compressed-tensors config/method; ModelRunner._load_quant_config_dict unwraps quantization_config from config.json; namespace aliasing added for backward compatibility.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • chenghuaWang
  • oreomaker
  • liang1232018

Poem

🐰
I hop through kernels, bits in tow,
Repacking weights, I watch them flow,
Timings tick from vision to text,
Marlin dances, packed and hexed,
Jetson hums — hooray, let's go!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.08% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically identifies the main change: adding compressed-tensors Marlin support for a specific model (Qwen3-VL AWQ) to the pymllm package.
Description check ✅ Passed The PR description is comprehensive and well-structured with clear sections covering Summary, Why, Key Changes, Testing, and Validated Configuration. It directly addresses the template's requirement for clarity and completeness.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (2)
pymllm/executor/model_runner.py (1)

490-495: Make config.json a true fallback.

Handling it inside the main candidate loop makes precedence depend on unique ordering. Since every HF checkpoint has a config.json, encountering it early makes the rest of the probe unreachable and the more specific quantization metadata can never win.

♻️ Suggested simplification
         for fname in unique:
+            if fname == "config.json":
+                continue
             fpath = model_path / fname
             if fpath.exists():
                 with open(fpath) as fp:
                     cfg = json.load(fp)
-                # config.json stores model metadata at the top level and
-                # nests quantization details under quantization_config.
-                if fname == "config.json" and "quantization_config" in cfg:
-                    return cfg["quantization_config"]
                 return cfg

Keep the existing explicit config.json -> quantization_config block below as the only fallback path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/executor/model_runner.py` around lines 490 - 495, The config.json
handling inside the main candidate loop (the block that checks fname ==
"config.json" and returns cfg["quantization_config"]) makes config.json a
premature match and prevents more specific probes from being considered; remove
that in-loop special-case and instead handle "config.json" ->
"quantization_config" only as the final fallback path after the candidate/probe
loop completes (keep the existing fallback block below). Update references to
cfg/fname accordingly so the loop always continues for other candidates and only
uses config.json's quantization_config when no other probe matched.
pymllm/quantization/methods/compressed_tensors.py (1)

89-97: Use buffers for Marlin runtime tensors instead of mixing plain attributes with Parameters.

Line 230 stores workspace as a plain tensor attribute, which won't be moved by module.to()/cuda(). Lines 231-233 create empty Parameters for runtime metadata, causing them to appear in parameters()/state_dict() despite not being checkpoint-backed weights. All four tensors should be registered as non-persistent buffers, which ensures they move with the module while remaining excluded from the parameter list and checkpoint.

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@mllm-kernel/mllm_kernel/cuda/jit/gptq_marlin_repack.py`:
- Around line 56-74: The function gptq_marlin_repack currently computes output
shape using floor division and can silently truncate/leave uninitialized data;
before allocating out or calling kernel, validate inputs: ensure num_bits
divides 32 evenly (so pack_factor = 32 // num_bits is integer and >0), ensure
size_k is a multiple of tile_size (16), ensure (size_n * tile_size) is divisible
by pack_factor (so out shape is exact), ensure b_q_weight.shape matches (size_k,
size_n) and its dtype/device are used for out, and ensure perm (if provided) has
length size_k; raise a clear ValueError on any violation and only then call
_normalize_perm, _make_gptq_marlin_repack_kernel(), allocate out and invoke
kernel.

In `@pymllm/layers/rms_norm.py`:
- Around line 52-54: The except branch in the RMSNorm fallback currently
computes x = x + residual but returns the old residual, causing downstream
blocks to receive stale residual state; update the except block in
pymllm/layers/rms_norm.py so that after computing x = x + residual you set a
new_residual (or reuse x) and return (_torch_rmsnorm(new_residual, self.weight,
self.eps), new_residual) instead of returning the old residual so subsequent
calls (e.g., in pymllm/models/qwen3_5.py flows) get the updated accumulated
residual.
- Around line 18-21: The code currently casts x_norm back to x.dtype before
multiplying by weight, which loses precision and can produce wrong dtype when
weight.dtype != x.dtype; change the order so you compute x_fp32, var, and x_norm
in FP32, cast weight to float32 (weight_fp32 = weight.to(torch.float32) or
weight.float()), multiply x_norm * weight_fp32 while still in FP32, then cast
the final product once to x.dtype before returning; locate symbols x_fp32, var,
x_norm, weight and eps in the RMSNorm implementation and apply this single-cast
fix consistent with rms_norm_gated.py.

In `@pymllm/models/qwen3_vl.py`:
- Around line 980-994: The helper _cuda_timed_run currently calls
torch.cuda.synchronize() on every invocation which forces a GPU sync in hot
paths (vision prefill and LLM forward); change it so synchronization is opt-in
or deferred: add a parameter like sync=False to _cuda_timed_run (or
alternatively return the start/end torch.cuda.Event objects and the raw output)
and only call torch.cuda.synchronize() and compute elapsed time when sync=True
(or when the caller performs synchronization at an existing barrier), ensuring
the default path avoids unconditional torch.cuda.synchronize() and preserves
existing return shapes by documenting the new return form if you choose the
event-return approach.

In `@pymllm/orchestrator/scheduler_process.py`:
- Around line 792-793: The code currently overwrites req.llm_decode_ms with
out["llm_decode_ms"] inside the decode-step loop; change this to accumulate
per-step timings by adding the step value to the existing request-level value
(e.g., req.llm_decode_ms = (req.llm_decode_ms or 0) + out["llm_decode_ms"]) so
each decode batch increments req.llm_decode_ms instead of replacing it; update
the setter site where "llm_decode_ms" is read from out (the block referencing
req.llm_decode_ms and out) to handle missing/None initial values safely.

In `@pymllm/orchestrator/tokenizer_process.py`:
- Around line 374-387: The multimodal branch replaces text tokenizer output with
processor IDs (mm_inputs -> image_inputs -> proc_input_ids) but skips the
earlier context-length guard; validate proc_input_ids against
self._context_length and either truncate or raise consistently with the text
path before assigning input_ids. Locate the block handling
mm_inputs/image_inputs/proc_input_ids in tokenizer_process.py and after
converting proc_input_ids to a Python list apply the same length-check logic
used earlier (use self._context_length to cap the list or raise a clear
exception), ensuring behavior matches the existing text-tokenizer
truncation/rejection policy.

In `@pymllm/quantization/methods/compressed_tensors.py`:
- Around line 61-70: Change verify_marlin_supported to accept a device argument
and use that device when checking CUDA capability (instead of calling
torch.cuda.get_device_capability() with no device), and update any callers
accordingly; specifically, update verify_marlin_supported(group_size: int,
device: torch.device) to first check MARLIN_SUPPORTED_GROUP_SIZES then if
device.type == "cuda" call torch.cuda.get_device_capability(device) and enforce
SM80+, and add an extra call to the new signature at the start of
process_weights_after_loading() using layer.weight_packed.device so validation
runs for the runtime device.

In `@pymllm/README-ZH.md`:
- Around line 5-16: The "## 环境配置 ToDo" section lacks validated Jetson
environment details; replace the checklist with a concrete validated matrix or a
link to a follow-up issue and populate the table with exact JetPack/L4T,
CUDA/cuDNN/TensorRT, Python/pip/venv or conda,
PyTorch/torchvision/transformers/safetensors, flashinfer version/installation,
apt dependencies, memory/GPU suggestions and required ENV vars; specifically
update the "## 环境配置 ToDo" header and the checklist items (e.g., JetPack / L4T,
CUDA / cuDNN / TensorRT, Python / pip / venv 或 conda, PyTorch / torchvision /
transformers / safetensors, flashinfer 版本与安装方式) with validated values or convert
them into a compact validated-version table and add a note linking to a
follow-up issue if full validation will occur later.

In `@pymllm/server/launch.py`:
- Around line 473-487: The code builds multimodal content lists in the messages
loop (mm_parts -> content) but may leave content as a list which later gets
stringified by the ChatML fallback and silently drops images; update the path
around tokenizer.apply_chat_template and the ChatML fallback to fail fast: after
constructing mm_parts (and before falling back to ChatML) check that
tokenizer.apply_chat_template exists and can accept list-form content, attempt
to call it in a try/except, and if it is missing or raises, immediately raise an
explicit exception (e.g., ValueError/RuntimeError) indicating structured chat
content could not be templated for the request instead of allowing silent
stringification; reference the variables/functions messages, mm_parts/content,
tokenizer.apply_chat_template, and the ChatML fallback so the change is made
where list content is prepared and before any stringification occurs.

---

Nitpick comments:
In `@pymllm/executor/model_runner.py`:
- Around line 490-495: The config.json handling inside the main candidate loop
(the block that checks fname == "config.json" and returns
cfg["quantization_config"]) makes config.json a premature match and prevents
more specific probes from being considered; remove that in-loop special-case and
instead handle "config.json" -> "quantization_config" only as the final fallback
path after the candidate/probe loop completes (keep the existing fallback block
below). Update references to cfg/fname accordingly so the loop always continues
for other candidates and only uses config.json's quantization_config when no
other probe matched.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5cd8b503-101e-4acb-822f-2d4c14c9ed71

📥 Commits

Reviewing files that changed from the base of the PR and between 690b859 and 058dee2.

📒 Files selected for processing (22)
  • mllm-kernel/cmake/CPM.cmake
  • mllm-kernel/include/mllm_kernel/scalar_type.hpp
  • mllm-kernel/mllm_kernel/cuda/csrc/gemm/marlin/marlin.cuh
  • mllm-kernel/mllm_kernel/cuda/jit/__init__.py
  • mllm-kernel/mllm_kernel/cuda/jit/gptq_marlin.py
  • mllm-kernel/mllm_kernel/cuda/jit/gptq_marlin_repack.py
  • mllm-kernel/tests/test_gptq_marlin.py
  • mllm-kernel/tests/test_gptq_marlin_repack.py
  • pymllm/README-ZH.md
  • pymllm/README.md
  • pymllm/executor/model_runner.py
  • pymllm/layers/rms_norm.py
  • pymllm/models/qwen3_vl.py
  • pymllm/orchestrator/detokenizer_process.py
  • pymllm/orchestrator/model_runner_process.py
  • pymllm/orchestrator/scheduler_process.py
  • pymllm/orchestrator/tokenizer_process.py
  • pymllm/quantization/methods/__init__.py
  • pymllm/quantization/methods/compressed_tensors.py
  • pymllm/server/launch.py
  • pymllm/tests/test_compressed_tensors_config.py
  • pymllm/tests/test_compressed_tensors_runtime.py

Comment on lines +56 to +74
def gptq_marlin_repack(
b_q_weight: torch.Tensor,
perm: Optional[torch.Tensor],
size_k: int,
size_n: int,
num_bits: int,
) -> torch.Tensor:
"""Repack GPTQ/Compressed-Tensors weights into Marlin layout."""

pack_factor = 32 // num_bits
tile_size = 16
out = torch.empty(
(size_k // tile_size, size_n * tile_size // pack_factor),
dtype=b_q_weight.dtype,
device=b_q_weight.device,
)
kernel = _make_gptq_marlin_repack_kernel()
perm_t = _normalize_perm(perm, size_k, b_q_weight.device)
kernel(b_q_weight, perm_t, out, size_k, size_n, num_bits)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reject unsupported size_k/size_n/num_bits inputs before allocating out.

Line 68 derives the output shape with floor division, so unsupported inputs are silently truncated instead of failing fast: a size_k tail is dropped, a non-64-multiple size_n can leave part of out uninitialized, and a non-divisor num_bits produces the wrong packing factor. Since this wrapper is now public, please validate num_bits, alignment, and b_q_weight.shape/dtype/device before the allocation and kernel launch.

Suggested guardrail
 def gptq_marlin_repack(
     b_q_weight: torch.Tensor,
     perm: Optional[torch.Tensor],
     size_k: int,
     size_n: int,
     num_bits: int,
 ) -> torch.Tensor:
     """Repack GPTQ/Compressed-Tensors weights into Marlin layout."""
 
+    if b_q_weight.dtype != torch.int32:
+        raise ValueError("b_q_weight must be int32")
+    if b_q_weight.device.type != "cuda":
+        raise ValueError("b_q_weight must live on CUDA")
+    if num_bits <= 0 or 32 % num_bits != 0:
+        raise ValueError("num_bits must be a positive divisor of 32")
+    if size_k % 16 != 0:
+        raise ValueError("size_k must be divisible by 16")
+    if size_n % 64 != 0:
+        raise ValueError("size_n must be divisible by 64")
+
     pack_factor = 32 // num_bits
+    expected_shape = (size_k // pack_factor, size_n)
+    if tuple(b_q_weight.shape) != expected_shape:
+        raise ValueError(f"b_q_weight must have shape {expected_shape}")
+
     tile_size = 16
     out = torch.empty(
         (size_k // tile_size, size_n * tile_size // pack_factor),
         dtype=b_q_weight.dtype,
         device=b_q_weight.device,

As per coding guidelines, "Validate inputs for public APIs and critical internal functions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@mllm-kernel/mllm_kernel/cuda/jit/gptq_marlin_repack.py` around lines 56 - 74,
The function gptq_marlin_repack currently computes output shape using floor
division and can silently truncate/leave uninitialized data; before allocating
out or calling kernel, validate inputs: ensure num_bits divides 32 evenly (so
pack_factor = 32 // num_bits is integer and >0), ensure size_k is a multiple of
tile_size (16), ensure (size_n * tile_size) is divisible by pack_factor (so out
shape is exact), ensure b_q_weight.shape matches (size_k, size_n) and its
dtype/device are used for out, and ensure perm (if provided) has length size_k;
raise a clear ValueError on any violation and only then call _normalize_perm,
_make_gptq_marlin_repack_kernel(), allocate out and invoke kernel.

Comment on lines +18 to +21
x_fp32 = x.float()
var = x_fp32.pow(2).mean(dim=-1, keepdim=True)
x_norm = x_fp32 * torch.rsqrt(var + eps)
return x_norm.to(dtype=x.dtype) * weight
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Apply the scale in FP32 before the final cast.

This helper downcasts x_norm before multiplying by weight, so the fallback loses precision and can even return the wrong dtype when weight.dtype != x.dtype. pymllm/layers/rms_norm_gated.py:40-61 already keeps both operands in FP32 and casts once at the end.

Suggested fix
 def _torch_rmsnorm(
     x: torch.Tensor,
     weight: torch.Tensor,
     eps: float,
 ) -> torch.Tensor:
     x_fp32 = x.float()
+    weight_fp32 = weight.float()
     var = x_fp32.pow(2).mean(dim=-1, keepdim=True)
-    x_norm = x_fp32 * torch.rsqrt(var + eps)
-    return x_norm.to(dtype=x.dtype) * weight
+    out = x_fp32 * torch.rsqrt(var + eps) * weight_fp32
+    return out.to(dtype=x.dtype)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/layers/rms_norm.py` around lines 18 - 21, The code currently casts
x_norm back to x.dtype before multiplying by weight, which loses precision and
can produce wrong dtype when weight.dtype != x.dtype; change the order so you
compute x_fp32, var, and x_norm in FP32, cast weight to float32 (weight_fp32 =
weight.to(torch.float32) or weight.float()), multiply x_norm * weight_fp32 while
still in FP32, then cast the final product once to x.dtype before returning;
locate symbols x_fp32, var, x_norm, weight and eps in the RMSNorm implementation
and apply this single-cast fix consistent with rms_norm_gated.py.

Comment on lines +52 to +54
except Exception:
x = x + residual
return _torch_rmsnorm(x, self.weight, self.eps), residual
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Return the accumulated residual from the fallback path.

This branch computes x + residual but returns the old residual. pymllm/models/qwen3_5.py:206-211 threads that second tuple element straight into the next norm call, so taking this fallback leaves all subsequent blocks operating on stale residual state.

Suggested fix
             except Exception:
-                x = x + residual
-                return _torch_rmsnorm(x, self.weight, self.eps), residual
+                accumulated = x + residual
+                return _torch_rmsnorm(accumulated, self.weight, self.eps), accumulated
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
except Exception:
x = x + residual
return _torch_rmsnorm(x, self.weight, self.eps), residual
except Exception:
accumulated = x + residual
return _torch_rmsnorm(accumulated, self.weight, self.eps), accumulated
🧰 Tools
🪛 Ruff (0.15.9)

[warning] 52-52: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/layers/rms_norm.py` around lines 52 - 54, The except branch in the
RMSNorm fallback currently computes x = x + residual but returns the old
residual, causing downstream blocks to receive stale residual state; update the
except block in pymllm/layers/rms_norm.py so that after computing x = x +
residual you set a new_residual (or reuse x) and return
(_torch_rmsnorm(new_residual, self.weight, self.eps), new_residual) instead of
returning the old residual so subsequent calls (e.g., in
pymllm/models/qwen3_5.py flows) get the updated accumulated residual.

Comment on lines +980 to +994
def _cuda_timed_run(fn):
if torch.cuda.is_available():
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
out = fn()
end.record()
torch.cuda.synchronize()
return out, float(start.elapsed_time(end))
else:
import time
t0 = time.perf_counter()
out = fn()
t1 = time.perf_counter()
return out, float((t1 - t0) * 1000.0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't synchronize the device on every timed forward.

torch.cuda.synchronize() here forces a full GPU sync in the model hot path, and this helper is used for both the vision prefill and LLM passes. That will reduce throughput and inflate latency; please make this opt-in, or at least defer elapsed-time collection to an existing synchronization point instead of doing it unconditionally on every batch.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/models/qwen3_vl.py` around lines 980 - 994, The helper _cuda_timed_run
currently calls torch.cuda.synchronize() on every invocation which forces a GPU
sync in hot paths (vision prefill and LLM forward); change it so synchronization
is opt-in or deferred: add a parameter like sync=False to _cuda_timed_run (or
alternatively return the start/end torch.cuda.Event objects and the raw output)
and only call torch.cuda.synchronize() and compute elapsed time when sync=True
(or when the caller performs synchronization at an existing barrier), ensuring
the default path avoids unconditional torch.cuda.synchronize() and preserves
existing return shapes by documenting the new return form if you choose the
event-return approach.

Comment on lines +792 to +793
if "llm_decode_ms" in out:
req.llm_decode_ms = out["llm_decode_ms"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Accumulate llm_decode_ms instead of replacing it.

This setter runs once per decode batch, so multi-token requests end up returning only the last step's decode latency in the final timing payload.

💡 If this field is meant to be request-level timing
-            if "llm_decode_ms" in out:
-                req.llm_decode_ms = out["llm_decode_ms"]
+            if "llm_decode_ms" in out:
+                req.llm_decode_ms = (req.llm_decode_ms or 0.0) + float(
+                    out["llm_decode_ms"]
+                )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/scheduler_process.py` around lines 792 - 793, The code
currently overwrites req.llm_decode_ms with out["llm_decode_ms"] inside the
decode-step loop; change this to accumulate per-step timings by adding the step
value to the existing request-level value (e.g., req.llm_decode_ms =
(req.llm_decode_ms or 0) + out["llm_decode_ms"]) so each decode batch increments
req.llm_decode_ms instead of replacing it; update the setter site where
"llm_decode_ms" is read from out (the block referencing req.llm_decode_ms and
out) to handle missing/None initial values safely.

Comment on lines +374 to +387
# If AutoProcessor produced multimodal input_ids, they must override
# the plain tokenizer result. Otherwise the prompt contains only a
# single image placeholder token and won't match the visual features.
if mm_inputs is not None:
image_inputs = mm_inputs.get("image_inputs")
if image_inputs is not None and "input_ids" in image_inputs:
proc_input_ids = image_inputs["input_ids"]
if hasattr(proc_input_ids, "ndim") and proc_input_ids.ndim > 1:
proc_input_ids = proc_input_ids[0]
if hasattr(proc_input_ids, "tolist"):
input_ids = proc_input_ids.tolist()
else:
input_ids = list(proc_input_ids)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Keep the context-length guard on the multimodal path.

Once these processor IDs replace the text-tokenizer output, multimodal requests no longer inherit the truncation/length check applied earlier. A long image prompt can now exceed self._context_length and fail later in the runner instead of being capped or rejected here. At minimum, validate the replacement IDs before packing the request.

🧭 Minimal safeguard
                 if hasattr(proc_input_ids, "tolist"):
                     input_ids = proc_input_ids.tolist()
                 else:
                     input_ids = list(proc_input_ids)
+                if (
+                    self._context_length is not None
+                    and len(input_ids) > self._context_length
+                ):
+                    raise ValueError(
+                        "Multimodal prompt exceeds context_length after processor tokenization"
+                    )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# If AutoProcessor produced multimodal input_ids, they must override
# the plain tokenizer result. Otherwise the prompt contains only a
# single image placeholder token and won't match the visual features.
if mm_inputs is not None:
image_inputs = mm_inputs.get("image_inputs")
if image_inputs is not None and "input_ids" in image_inputs:
proc_input_ids = image_inputs["input_ids"]
if hasattr(proc_input_ids, "ndim") and proc_input_ids.ndim > 1:
proc_input_ids = proc_input_ids[0]
if hasattr(proc_input_ids, "tolist"):
input_ids = proc_input_ids.tolist()
else:
input_ids = list(proc_input_ids)
# If AutoProcessor produced multimodal input_ids, they must override
# the plain tokenizer result. Otherwise the prompt contains only a
# single image placeholder token and won't match the visual features.
if mm_inputs is not None:
image_inputs = mm_inputs.get("image_inputs")
if image_inputs is not None and "input_ids" in image_inputs:
proc_input_ids = image_inputs["input_ids"]
if hasattr(proc_input_ids, "ndim") and proc_input_ids.ndim > 1:
proc_input_ids = proc_input_ids[0]
if hasattr(proc_input_ids, "tolist"):
input_ids = proc_input_ids.tolist()
else:
input_ids = list(proc_input_ids)
if (
self._context_length is not None
and len(input_ids) > self._context_length
):
raise ValueError(
"Multimodal prompt exceeds context_length after processor tokenization"
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/orchestrator/tokenizer_process.py` around lines 374 - 387, The
multimodal branch replaces text tokenizer output with processor IDs (mm_inputs
-> image_inputs -> proc_input_ids) but skips the earlier context-length guard;
validate proc_input_ids against self._context_length and either truncate or
raise consistently with the text path before assigning input_ids. Locate the
block handling mm_inputs/image_inputs/proc_input_ids in tokenizer_process.py and
after converting proc_input_ids to a Python list apply the same length-check
logic used earlier (use self._context_length to cap the list or raise a clear
exception), ensuring behavior matches the existing text-tokenizer
truncation/rejection policy.

Comment on lines +61 to +70
def verify_marlin_supported(group_size: int) -> None:
if group_size not in MARLIN_SUPPORTED_GROUP_SIZES:
raise ValueError(
f"Unsupported compressed-tensors group_size: {group_size}"
)
if not torch.cuda.is_available():
return
major, minor = torch.cuda.get_device_capability()
if major * 10 + minor < 80:
raise ValueError("compressed-tensors Marlin requires SM80+")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

In PyTorch, if torch.cuda.get_device_capability() is called without a device argument, which CUDA device does it inspect?

💡 Result:

It inspects the current CUDA device, as returned by torch.cuda.current_device.

Citations:


🏁 Script executed:

# First, locate the file and check its structure
find . -path "*pymllm/quantization/methods/compressed_tensors.py" -type f

Repository: UbiquitousLearning/mllm

Length of output: 118


🏁 Script executed:

# Read the relevant sections of the file
if [ -f "pymllm/quantization/methods/compressed_tensors.py" ]; then
  echo "=== Lines 61-75 (verify_marlin_supported) ===" 
  sed -n '61,75p' pymllm/quantization/methods/compressed_tensors.py
  echo ""
  echo "=== Lines 85-105 (process_weights_after_loading area) ===" 
  sed -n '85,105p' pymllm/quantization/methods/compressed_tensors.py
  echo ""
  echo "=== Lines 215-235 (second mentioned range) ===" 
  sed -n '215,235p' pymllm/quantization/methods/compressed_tensors.py
fi

Repository: UbiquitousLearning/mllm

Length of output: 2264


🏁 Script executed:

# Check the total file length to understand context
wc -l pymllm/quantization/methods/compressed_tensors.py

Repository: UbiquitousLearning/mllm

Length of output: 120


Re-validate Marlin device compatibility in the actual runtime device context.

The verify_marlin_supported() function calls torch.cuda.get_device_capability() without a device argument, which inspects only the current CUDA device, not the device where the layer actually resides. The process_weights_after_loading() method extracts the correct device from layer.weight_packed.device but never re-validates Marlin support for that device before executing CUDA operations. In workflows where initialization occurs on a different device than the runtime device (e.g., CPU initialization before GPU placement, or multi-GPU scenarios), this allows invalid configurations to pass validation and fail later at runtime.

Modify verify_marlin_supported() to accept and use a device parameter, and call it again at the start of process_weights_after_loading() with the actual layer.weight_packed.device.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/quantization/methods/compressed_tensors.py` around lines 61 - 70,
Change verify_marlin_supported to accept a device argument and use that device
when checking CUDA capability (instead of calling
torch.cuda.get_device_capability() with no device), and update any callers
accordingly; specifically, update verify_marlin_supported(group_size: int,
device: torch.device) to first check MARLIN_SUPPORTED_GROUP_SIZES then if
device.type == "cuda" call torch.cuda.get_device_capability(device) and enforce
SM80+, and add an extra call to the new signature at the start of
process_weights_after_loading() using layer.weight_packed.device so validation
runs for the runtime device.

Comment on lines +473 to +487
# Preserve multimodal message structure for tokenizer.apply_chat_template.
msg_dicts: List[Dict[str, Any]] = []
for msg in messages:
content = msg.content
if isinstance(content, list):
# Multimodal: extract only text parts for the prompt string.
text_parts = [p.text for p in content if p.type == "text" and p.text]
content = "\n".join(text_parts) if text_parts else ""
mm_parts: List[Dict[str, Any]] = []
for part in content:
if part.type == "text" and part.text is not None:
mm_parts.append({"type": "text", "text": part.text})
elif part.type == "image_url" and part.image_url is not None:
# Keep image content so chat template can emit vision tokens.
mm_parts.append(
{"type": "image", "image": part.image_url.url}
)
content = mm_parts
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast when structured chat content cannot be templated.

After this change, content can remain a list of parts. If apply_chat_template() is unavailable or throws, the ChatML fallback below will stringify that list and silently drop image placeholders, so the request becomes malformed instead of being rejected.

💡 Suggested guard
@@
     if tokenizer is not None and hasattr(tokenizer, "apply_chat_template"):
         try:
             extra = dict(chat_template_kwargs) if chat_template_kwargs else {}
             return tokenizer.apply_chat_template(
                 msg_dicts,
                 tokenize=False,
                 add_generation_prompt=True,
                 **extra,
             )
         except Exception as e:
             logger.warning("apply_chat_template failed, using fallback: %s", e)
+            has_multimodal_content = any(
+                isinstance(m["content"], list)
+                and any(p.get("type") != "text" for p in m["content"])
+                for m in msg_dicts
+            )
+            if has_multimodal_content:
+                raise ValueError(
+                    "Multimodal chat content requires apply_chat_template() support"
+                ) from e
 
     # Fallback: ChatML format (Qwen-style)
     parts: List[str] = []
     for m in msg_dicts:
-        parts.append(f"<|im_start|>{m['role']}\n{m['content']}<|im_end|>")
+        content = m["content"]
+        if isinstance(content, list):
+            content = "\n".join(
+                p["text"] for p in content if p.get("type") == "text"
+            )
+        parts.append(f"<|im_start|>{m['role']}\n{content}<|im_end|>")

As per coding guidelines, "Ensure functions that can fail return appropriate error codes or raise exceptions."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pymllm/server/launch.py` around lines 473 - 487, The code builds multimodal
content lists in the messages loop (mm_parts -> content) but may leave content
as a list which later gets stringified by the ChatML fallback and silently drops
images; update the path around tokenizer.apply_chat_template and the ChatML
fallback to fail fast: after constructing mm_parts (and before falling back to
ChatML) check that tokenizer.apply_chat_template exists and can accept list-form
content, attempt to call it in a try/except, and if it is missing or raises,
immediately raise an explicit exception (e.g., ValueError/RuntimeError)
indicating structured chat content could not be templated for the request
instead of allowing silent stringification; reference the variables/functions
messages, mm_parts/content, tokenizer.apply_chat_template, and the ChatML
fallback so the change is made where list content is prepared and before any
stringification occurs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants