Skip to content

Conversation

@Mercykid-bash
Copy link
Contributor

@Mercykid-bash Mercykid-bash commented Nov 27, 2025

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix accuracy degradation related to Expert Placement Load Balancing (EPLB). The changes include corrections to how expert maps and weights are updated, fixing an issue with group indices in MoE MLPs, and ensuring up-to-date data is used for quantization scales. While most changes appear to be valid bug fixes, I've identified critical issues in vllm_ascend/eplb/adaptor/vllm_adaptor.py. The new logic for padding tensors lacks validation for tensor shapes, which could lead to runtime errors. These should be addressed to ensure the robustness of the implementation.

Comment on lines 197 to 205
pad_len = self.expert_map_per_layer[layer_id].shape[0] - updated_expert_map.shape[0]
updated_expert_map_padded = torch.nn.functional.pad(
updated_expert_map,
pad=(0,pad_len),
mode='constant',
value=-1
)
self.expert_map_per_layer[layer_id].copy_(updated_expert_map_padded)
self.expert_map_per_layer_cpu[layer_id].copy_(updated_expert_map)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There are two potential issues in this function that could lead to runtime errors:

  1. The calculation pad_len = self.expert_map_per_layer[layer_id].shape[0] - updated_expert_map.shape[0] can result in a negative value if updated_expert_map is larger than self.expert_map_per_layer[layer_id]. This will cause torch.nn.functional.pad to raise an error, as it does not support negative padding.

  2. The line self.expert_map_per_layer_cpu[layer_id].copy_(updated_expert_map) performs an in-place copy, which will fail if updated_expert_map has a different number of elements than self.expert_map_per_layer_cpu[layer_id]. Since padding is used for the device tensor, it's plausible that updated_expert_map's size can vary, which would make this copy_ unsafe. The original code used clone() which would re-assign the tensor and handle size differences.

Please add checks to prevent these errors, for example by asserting that pad_len is non-negative and clarifying the intended behavior for updating the CPU map when sizes differ.

Comment on lines 217 to 224
pad_len = self.log2phy_map_per_layer[layer_id].shape[0] - updated_log2phy_map.shape[0]
updated_log2phy_map_padded = torch.nn.functional.pad(
updated_log2phy_map,
pad=(0,pad_len),
mode='constant',
value=-1
)
self.log2phy_map_per_layer[layer_id].copy_(updated_log2phy_map_padded)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to do_update_expert_map, the calculation pad_len = self.log2phy_map_per_layer[layer_id].shape[0] - updated_log2phy_map.shape[0] can result in a negative value if updated_log2phy_map is larger than the target tensor. This would cause torch.nn.functional.pad to raise an error.

Please add a check to ensure pad_len is non-negative to prevent a potential crash.

            pad_len = self.log2phy_map_per_layer[layer_id].shape[0] - updated_log2phy_map.shape[0]
            if pad_len < 0:
                raise ValueError(f"updated_log2phy_map shape {updated_log2phy_map.shape} is larger than target shape {self.log2phy_map_per_layer[layer_id].shape}")
            updated_log2phy_map_padded = torch.nn.functional.pad(
                                        updated_log2phy_map,
                                        pad=(0,pad_len),
                                        mode='constant',
                                        value=-1
                                        )
            self.log2phy_map_per_layer[layer_id].copy_(updated_log2phy_map_padded)

Signed-off-by: Che Ruan <[email protected]>
Signed-off-by: Che Ruan <[email protected]>
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link

github-actions bot commented Dec 4, 2025

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 5, 2025
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 5, 2025
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 5, 2025
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 5, 2025
Clorist33 pushed a commit to Clorist33/vllm-ascend that referenced this pull request Dec 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants