-
Notifications
You must be signed in to change notification settings - Fork 74
Check consistency of multiplier #5708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
!test |
Description
|
| Relevant files | |||
|---|---|---|---|
| Bug fix |
|
PR Reviewer Guide
Here are some key observations to aid the review process:
| 🧪 No relevant tests |
| ⚡ Recommended focus areas for review |
Thread Safety
extent_to_multiplier_map is not thread-safe. If unshardedSizes() can be called concurrently from multiple threads, this could lead to race conditions during map access and modification. Consider adding mutex protection or using thread-safe alternatives. |
Test failures
-
(High, 106)
CUDA driver too old for runtime on dlcluster_h100 (nvFuser matmul/MMA test suites)Test Name H100 Source .tests.python.multidevice.test_communication ❌ .tests.python.multidevice.test_deepseek_v3 ❌ .tests.python.multidevice.test_dtensor ❌ .tests.python.multidevice.test_matmul ❌ .tests.python.multidevice.test_multidevice ❌ .tests.python.multidevice.test_overlap ❌ .tests.python.multidevice.test_transformer ❌ .tests.python.opinfo.test_direct_ops ❌ .tests.python.test_alphafold3 ❌ ArgsortParameterizedWithBlockAndBatch.SharedMemoryRequirement/1024_1_1_1 ❌ Link ... with 96 more test failures omitted. Check internal logs. -
(High, 7)
CUDA driver too old on dlcluster_h100 – initialization fails in multiple test suitesTest Name H100 Source .tests.python.multidevice.test_transformer_engine ❌ .tests.python.opinfo.test_legacy_ops ❌ .tests.python.test_normalization ❌ .tests.python.test_python_frontend ❌ .tests.python.test_schedule_ops ❌ tests.python.multidevice.test_expert_parallel.test_dispatch_and_combine ❌ tests.python.test_moe.test_llama4_moe_thunderfx ❌ -
(Medium, 6)
nvFuser multi-device assert failure (inconsistent extent multiplier) in test_overlap_allgather_matmul_shard_outermostTest Name A100 (dist.) GB200 (dist.) H100 (dist.) Source tests.python.multidevice.test_overlap.test_overlap_allgather_matmul_shard_outermost[backend_type=CommunicatorBackend.cuda] ❌ ❌ ❌ tests.python.multidevice.test_overlap.test_overlap_allgather_matmul_shard_outermost[backend_type=CommunicatorBackend.nccl] ❌ ❌ ❌
|
!test |
No description provided.