Bugfix: Fix accuracy degradation caused by EPLB #4490

Mercykid-bash · 2025-11-27T07:02:33Z

vLLM version: v0.12.0
vLLM main: https://github.com/vllm-project/vllm/commit/v0.12.0

github-actions · 2025-11-27T07:03:07Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request aims to fix accuracy degradation related to Expert Placement Load Balancing (EPLB). The changes include corrections to how expert maps and weights are updated, fixing an issue with group indices in MoE MLPs, and ensuring up-to-date data is used for quantization scales. While most changes appear to be valid bug fixes, I've identified critical issues in vllm_ascend/eplb/adaptor/vllm_adaptor.py. The new logic for padding tensors lacks validation for tensor shapes, which could lead to runtime errors. These should be addressed to ensure the robustness of the implementation.

gemini-code-assist · 2025-11-27T07:04:37Z

vllm_ascend/eplb/adaptor/vllm_adaptor.py

+        pad_len = self.expert_map_per_layer[layer_id].shape[0] - updated_expert_map.shape[0]
+        updated_expert_map_padded = torch.nn.functional.pad(
+                                    updated_expert_map,
+                                    pad=(0,pad_len),
+                                    mode='constant',
+                                    value=-1
+                                    )
+        self.expert_map_per_layer[layer_id].copy_(updated_expert_map_padded)
+        self.expert_map_per_layer_cpu[layer_id].copy_(updated_expert_map)


There are two potential issues in this function that could lead to runtime errors:

The calculation pad_len = self.expert_map_per_layer[layer_id].shape[0] - updated_expert_map.shape[0] can result in a negative value if updated_expert_map is larger than self.expert_map_per_layer[layer_id]. This will cause torch.nn.functional.pad to raise an error, as it does not support negative padding.

The line self.expert_map_per_layer_cpu[layer_id].copy_(updated_expert_map) performs an in-place copy, which will fail if updated_expert_map has a different number of elements than self.expert_map_per_layer_cpu[layer_id]. Since padding is used for the device tensor, it's plausible that updated_expert_map's size can vary, which would make this copy_ unsafe. The original code used clone() which would re-assign the tensor and handle size differences.

Please add checks to prevent these errors, for example by asserting that pad_len is non-negative and clarifying the intended behavior for updating the CPU map when sizes differ.

gemini-code-assist · 2025-11-27T07:04:37Z

vllm_ascend/eplb/adaptor/vllm_adaptor.py

+            pad_len = self.log2phy_map_per_layer[layer_id].shape[0] - updated_log2phy_map.shape[0]
+            updated_log2phy_map_padded = torch.nn.functional.pad(
+                                        updated_log2phy_map,
+                                        pad=(0,pad_len),
+                                        mode='constant',
+                                        value=-1
+                                        )
+            self.log2phy_map_per_layer[layer_id].copy_(updated_log2phy_map_padded)


Similar to do_update_expert_map, the calculation pad_len = self.log2phy_map_per_layer[layer_id].shape[0] - updated_log2phy_map.shape[0] can result in a negative value if updated_log2phy_map is larger than the target tensor. This would cause torch.nn.functional.pad to raise an error.

Please add a check to ensure pad_len is non-negative to prevent a potential crash.

pad_len = self.log2phy_map_per_layer[layer_id].shape[0] - updated_log2phy_map.shape[0] if pad_len < 0: raise ValueError(f"updated_log2phy_map shape {updated_log2phy_map.shape} is larger than target shape {self.log2phy_map_per_layer[layer_id].shape}") updated_log2phy_map_padded = torch.nn.functional.pad( updated_log2phy_map, pad=(0,pad_len), mode='constant', value=-1 ) self.log2phy_map_per_layer[layer_id].copy_(updated_log2phy_map_padded)

Signed-off-by: Che Ruan <[email protected]>

github-actions · 2025-11-29T07:29:36Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: Mercykid-bash <[email protected]>

github-actions · 2025-12-04T08:52:37Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: tanqingshan (A) <[email protected]>

Mercykid-bash force-pushed the eplb_fix_main branch from 7cca720 to 60ce54b Compare November 27, 2025 07:03

github-actions bot added module:ops module:quantization labels Nov 27, 2025

gemini-code-assist bot reviewed Nov 27, 2025

View reviewed changes

mercykid added 2 commits November 27, 2025 15:22

eplb quick-fix

f3b7bbe

Signed-off-by: Che Ruan <[email protected]>

format fix

15fe4c0

Signed-off-by: Che Ruan <[email protected]>

Mercykid-bash force-pushed the eplb_fix_main branch from 8acd608 to 15fe4c0 Compare November 27, 2025 07:22

github-actions bot added the merge-conflicts label Nov 29, 2025

Mercykid-bash added 2 commits December 3, 2025 15:05

Update fused_moe.py

15d05a1

Merge branch 'main' into eplb_fix_main

797f269

Signed-off-by: Mercykid-bash <[email protected]>

github-actions bot removed merge-conflicts module:quantization labels Dec 4, 2025

tanqingshan (A) added 2 commits December 4, 2025 15:35

fix CI

99f7998

test: verify write permission

761e013

github-actions bot added the merge-conflicts label Dec 4, 2025

github-actions bot added the module:tests label Dec 4, 2025

Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 5, 2025

Fix MoE MLP related issues (ref vllm-project#4490)

7334950

Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 5, 2025

Fix eplb device transfer loader issues (ref vllm-project#4490)

39835cb

Signed-off-by: tanqingshan (A) <[email protected]>

Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 5, 2025

Fix MoE MLP related issues with ut(ref vllm-project#4490)

ef6c5c3

Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 5, 2025

Fix MoE MLP related issues (ref vllm-project#4490)

4379289

Signed-off-by: tanqingshan (A) <[email protected]>

Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 5, 2025

Fix MoE MLP related issues with ut(ref vllm-project#4490)

1021a57

Signed-off-by: tanqingshan (A) <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bugfix: Fix accuracy degradation caused by EPLB #4490

Bugfix: Fix accuracy degradation caused by EPLB #4490

Uh oh!

Mercykid-bash commented Nov 27, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 27, 2025

Uh oh!

gemini-code-assist bot Nov 27, 2025

Uh oh!

github-actions bot commented Nov 29, 2025

Uh oh!

github-actions bot commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Bugfix: Fix accuracy degradation caused by EPLB #4490

Are you sure you want to change the base?

Bugfix: Fix accuracy degradation caused by EPLB #4490

Uh oh!

Conversation

Mercykid-bash commented Nov 27, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 29, 2025

Uh oh!

github-actions bot commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Mercykid-bash commented Nov 27, 2025 •

edited by github-actions bot

Loading