fp8 broadcast by S1ro1 · Pull Request #1981 · PrimeIntellect-ai/prime-rl

S1ro1 · 2026-03-07T20:38:26Z

Note

High Risk
Touches the training↔inference weight-update pipeline (NCCL broadcast, in-place parameter updates, and optional FP8 quantization), where shape/format mismatches can break live inference or silently degrade quality. Also changes multi-node SLURM orchestration and dependency pinning, which can affect cluster runs and reproducibility.

Overview
Adds an opt-in vLLM kernel-format weight broadcast path for NCCL, including optional block-wise FP8 quantization, to enable direct in-place updates on inference workers without converting through HF checkpoint format.

Plumbs new config flags (e.g. use_vllm_format_transfer, quantize_fp8) through shared/trainer/orchestrator configs and the /init_broadcaster RPC; updates the vLLM NCCL worker to copy_() received params (with EP expert slicing and MLA absorbed-weight recomputation) and introduces model-side layer conversion for glm_moe_dsa.

Extends multi-node RL deployment to support multiple inference replicas (num_infer_replicas/total_infer_nodes) and updates the SLURM template for per-replica head selection and headless nodes; also relaxes inference api_server_count to allow 0 for headless mode and adjusts model-name propagation/validation so orchestrator matches inference.

^{Written by Cursor Bugbot for commit d685dfc. This will update automatically on new commits. Configure here.}

cursor · 2026-03-07T20:40:09Z

pyproject.toml

-[[tool.uv.index]]
-name = "vllm-nightly"
-url = "https://wheels.vllm.ai/nightly"
+url = "https://download.pytorch.org/whl/test/cu128"


PyTorch sourced from test/pre-release channel instead of stable

High Severity

The PyTorch index URL was changed from https://download.pytorch.org/whl/cu128 (stable releases) to https://download.pytorch.org/whl/test/cu128 (pre-release/test builds). This causes the project to install torch 2.9.1+cu128 from the test channel instead of a stable release. Test channel builds may contain regressions or breaking changes that haven't been validated for production use.

Additional Locations (1)

uv.lock#L3460-L3462

scripts/glm5_weight_transfer/debug_fp8_cuda.out

scripts/glm5_weight_transfer/check_hf.py

src/prime_rl/trainer/rl/broadcast/nccl.py

src/prime_rl/trainer/rl/train.py

src/prime_rl/configs/inference.py

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-09T19:25:49Z

src/prime_rl/configs/rl.py

+            description="Quantize weights to FP8 (e4m3) with block-wise scaling during kernel format transfer. "
+            "Only used when use_kernel_format_transfer is True."
+        ),
+    ] = False


Config changes missing CHANGELOG.md update

Low Severity

Multiple new config fields were added across src/prime_rl/configs/inference.py (data_parallel_address, data_parallel_start_rank, headless), src/prime_rl/configs/rl.py (use_kernel_format_transfer, quantize_fp8, allow_different_inference_model), src/prime_rl/configs/trainer.py (use_kernel_format_transfer, quantize_fp8), and src/prime_rl/configs/orchestrator.py (use_kernel_format_transfer), but CHANGELOG.md was not updated with entries for any of these new fields.

Additional Locations (2)

src/prime_rl/configs/inference.py#L228-L249

src/prime_rl/configs/trainer.py#L597-L612

^{Triggered by project rule: BugBot Instructions}

cursor · 2026-03-09T19:25:49Z

src/prime_rl/inference/vllm/worker/nccl.py

+            logger.error(f"Kernel weight transfer: {len(shape_mismatches)} SHAPE MISMATCHES: {shape_mismatches}")
+        if skipped:
+            logger.warning(f"Kernel weight transfer: {len(skipped)} skipped (not in model): {skipped}")
+        logger.info(f"Kernel weight transfer: copied {loaded} weights in-place")


Kernel format loader only checks parameters, missing buffers

Medium Severity

_load_kernel_format builds its lookup dict from model.named_parameters(), which excludes buffers. When quantize_fp8 is enabled on the sender, weight_scale_inv tensors are broadcast alongside FP8 weights. If the receiving vLLM model stores these scale tensors as buffers rather than parameters, they won't be matched and will be silently skipped, leaving stale scale factors that cause incorrect dequantization during inference.

samsja · 2026-03-09T19:30:07Z

src/prime_rl/configs/inference.py

+    data_parallel_address: Annotated[
+        str | None,
+        Field(
+            description="Address for cross-node data parallel communication. Passed to vLLM as `--data-parallel-address`.",
+        ),
+    ] = None
+
+    data_parallel_start_rank: Annotated[
+        int | None,
+        Field(
+            ge=0,
+            description="Starting DP rank for this node in multi-node EP. Passed to vLLM as `--data-parallel-start-rank`.",
+        ),
+    ] = None
+
+    headless: Annotated[
+        bool,
+        Field(
+            description="Run in headless mode (no API server). Passed to vLLM as `--headless`.",
+        ),
+    ] = False


can pass all of this via vllm_extras instead imo

samsja · 2026-03-09T19:32:11Z

src/prime_rl/configs/rl.py

+    use_kernel_format_transfer: Annotated[
+        bool,
+        Field(
+            description="Transfer weights in vLLM kernel format instead of HF checkpoint format. "
+            "Avoids the HF conversion intermediate step and allows direct in-place weight updates."
+        ),
+    ] = False


Suggested change

use_kernel_format_transfer: Annotated[

bool,

Field(

description="Transfer weights in vLLM kernel format instead of HF checkpoint format. "

"Avoids the HF conversion intermediate step and allows direct in-place weight updates."

),

] = False

use_vllm_format_transfer: Annotated[

bool,

Field(

description="Transfer weights in vLLM kernel format instead of HF checkpoint format. "

"Avoids the HF conversion intermediate step and allows direct in-place weight updates."

),

] = False

lets rename to this

samsja · 2026-03-09T19:32:39Z

src/prime_rl/configs/orchestrator.py

    port: Annotated[int, Field(description="The port to use for the NCCL broadcast.")] = 29501
    timeout: Annotated[int, Field(description="The timeout in seconds to use for the NCCL broadcast.")] = 1200

+    use_kernel_format_transfer: Annotated[


can we move this to vllm broadcast config instead of having it on the orch ?

samsja · 2026-03-09T19:35:40Z

src/prime_rl/configs/rl.py

+    allow_different_inference_model: Annotated[
+        bool,
+        Field(
+            description="Allow the inference server to use a different model name than the trainer. "
+            "When enabled, the orchestrator uses the inference model name for querying. "
+            "Useful for kernel format weight transfer where the trainer uses a bf16 model "
+            "and inference uses a quantized (e.g. FP8) variant.",
+        ),
+    ] = False


does that mean we always need to load the fp8 model on the vllm side ?

I think we should just allow this by default and not have this param imo, should still translate from model.name to trainer.model.name and infer.model.name but user should be able to override it via trainer.model.name basically

samsja · 2026-03-09T19:37:25Z

src/prime_rl/templates/multi_node_rl.sbatch.j2

+                --data_parallel_start_rank $INFER_DP_START_RANK \
+                --data_parallel_address $INFER_HEAD_HOST \
+                --data_parallel_rpc_port $INFERENCE_DATA_PARALLEL_RPC_PORT \


should this be always enabled ?

checkpoint

771d075

cursor bot reviewed Mar 7, 2026

View reviewed changes

Feat: GLM5 runs

0e9bb1c

cursor bot reviewed Mar 9, 2026

View reviewed changes

src/prime_rl/trainer/rl/broadcast/nccl.py Show resolved Hide resolved

src/prime_rl/trainer/rl/train.py Outdated Show resolved Hide resolved

src/prime_rl/configs/inference.py Outdated Show resolved Hide resolved

Feat: cleanup

b8e20d8

cursor bot reviewed Mar 9, 2026

View reviewed changes

samsja reviewed Mar 9, 2026

View reviewed changes

Feat: cleanup

74339ef

samsja changed the title ~~checkpoint~~ fp8 broadcast Mar 9, 2026

S1ro1 added 2 commits March 10, 2026 22:11

GLM 5 runs

922c630

multi-infer-replica

d685dfc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fp8 broadcast#1981

fp8 broadcast#1981
S1ro1 wants to merge 6 commits intomainfrom
weight-transfer

S1ro1 commented Mar 7, 2026 •

edited by cursor bot

Loading

Uh oh!

cursor bot Mar 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Mar 9, 2026

Uh oh!

cursor bot Mar 9, 2026

Uh oh!

samsja Mar 9, 2026

Uh oh!

samsja Mar 9, 2026

Uh oh!

samsja Mar 9, 2026

Uh oh!

samsja Mar 9, 2026

Uh oh!

samsja Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

S1ro1 commented Mar 7, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot Mar 7, 2026

Choose a reason for hiding this comment

PyTorch sourced from test/pre-release channel instead of stable

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 9, 2026

Choose a reason for hiding this comment

Config changes missing CHANGELOG.md update

Uh oh!

cursor bot Mar 9, 2026

Choose a reason for hiding this comment

Kernel format loader only checks parameters, missing buffers

Uh oh!

samsja Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

samsja Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

samsja Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

samsja Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

samsja Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

S1ro1 commented Mar 7, 2026 •

edited by cursor bot

Loading