-
Notifications
You must be signed in to change notification settings - Fork 629
Bugfix: Fix accuracy degradation caused by EPLB #4490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
7cca720 to
60ce54b
Compare
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to fix accuracy degradation related to Expert Placement Load Balancing (EPLB). The changes include corrections to how expert maps and weights are updated, fixing an issue with group indices in MoE MLPs, and ensuring up-to-date data is used for quantization scales. While most changes appear to be valid bug fixes, I've identified critical issues in vllm_ascend/eplb/adaptor/vllm_adaptor.py. The new logic for padding tensors lacks validation for tensor shapes, which could lead to runtime errors. These should be addressed to ensure the robustness of the implementation.
| pad_len = self.expert_map_per_layer[layer_id].shape[0] - updated_expert_map.shape[0] | ||
| updated_expert_map_padded = torch.nn.functional.pad( | ||
| updated_expert_map, | ||
| pad=(0,pad_len), | ||
| mode='constant', | ||
| value=-1 | ||
| ) | ||
| self.expert_map_per_layer[layer_id].copy_(updated_expert_map_padded) | ||
| self.expert_map_per_layer_cpu[layer_id].copy_(updated_expert_map) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two potential issues in this function that could lead to runtime errors:
-
The calculation
pad_len = self.expert_map_per_layer[layer_id].shape[0] - updated_expert_map.shape[0]can result in a negative value ifupdated_expert_mapis larger thanself.expert_map_per_layer[layer_id]. This will causetorch.nn.functional.padto raise an error, as it does not support negative padding. -
The line
self.expert_map_per_layer_cpu[layer_id].copy_(updated_expert_map)performs an in-place copy, which will fail ifupdated_expert_maphas a different number of elements thanself.expert_map_per_layer_cpu[layer_id]. Since padding is used for the device tensor, it's plausible thatupdated_expert_map's size can vary, which would make thiscopy_unsafe. The original code usedclone()which would re-assign the tensor and handle size differences.
Please add checks to prevent these errors, for example by asserting that pad_len is non-negative and clarifying the intended behavior for updating the CPU map when sizes differ.
| pad_len = self.log2phy_map_per_layer[layer_id].shape[0] - updated_log2phy_map.shape[0] | ||
| updated_log2phy_map_padded = torch.nn.functional.pad( | ||
| updated_log2phy_map, | ||
| pad=(0,pad_len), | ||
| mode='constant', | ||
| value=-1 | ||
| ) | ||
| self.log2phy_map_per_layer[layer_id].copy_(updated_log2phy_map_padded) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to do_update_expert_map, the calculation pad_len = self.log2phy_map_per_layer[layer_id].shape[0] - updated_log2phy_map.shape[0] can result in a negative value if updated_log2phy_map is larger than the target tensor. This would cause torch.nn.functional.pad to raise an error.
Please add a check to ensure pad_len is non-negative to prevent a potential crash.
pad_len = self.log2phy_map_per_layer[layer_id].shape[0] - updated_log2phy_map.shape[0]
if pad_len < 0:
raise ValueError(f"updated_log2phy_map shape {updated_log2phy_map.shape} is larger than target shape {self.log2phy_map_per_layer[layer_id].shape}")
updated_log2phy_map_padded = torch.nn.functional.pad(
updated_log2phy_map,
pad=(0,pad_len),
mode='constant',
value=-1
)
self.log2phy_map_per_layer[layer_id].copy_(updated_log2phy_map_padded)Signed-off-by: Che Ruan <[email protected]>
Signed-off-by: Che Ruan <[email protected]>
8acd608 to
15fe4c0
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: Mercykid-bash <[email protected]>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: tanqingshan (A) <[email protected]>
Signed-off-by: tanqingshan (A) <[email protected]>
Signed-off-by: tanqingshan (A) <[email protected]>
Uh oh!
There was an error while loading. Please reload this page.