Skip to content

Conversation

@zasdfgbnm
Copy link
Collaborator

No description provided.

@zasdfgbnm
Copy link
Collaborator Author

!test

@github-actions
Copy link

github-actions bot commented Dec 18, 2025

Description

  • Added static map to track extent-to-multiplier relationships

  • Implemented consistency check for multiplier calculations

  • Throws error if same extent has different multipliers

  • Prevents inconsistent sharding behavior across device mesh

Changes walkthrough

Relevant files
Bug fix
execution_utils.cpp
Add multiplier consistency validation                                       

csrc/multidevice/execution_utils.cpp

  • Added static unordered_map to track extent->multiplier relationships
  • Added consistency validation after multiplier calculation
  • Throws NVF_ERROR if multiplier inconsistency detected for same extent
  • Ensures deterministic sharding behavior across multi-device execution
  • +19/-0   

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 No relevant tests
    ⚡ Recommended focus areas for review
    Thread Safety

    The static unordered_map extent_to_multiplier_map is not thread-safe. If unshardedSizes() can be called concurrently from multiple threads, this could lead to race conditions during map access and modification. Consider adding mutex protection or using thread-safe alternatives.

    static std::unordered_map<Val*, int64_t> extent_to_multiplier_map;
    Memory Management

    The map stores Val* pointers without any cleanup mechanism. If Val objects are destroyed while the static map still references them, this could lead to dangling pointer dereferences. Consider using smart pointers or ensuring proper lifetime management.

    static std::unordered_map<Val*, int64_t> extent_to_multiplier_map;
    
    auto multiplier = [&]() -> int64_t {
      if (parallel_type == ParallelType::Stream) {
        // TODO(#5525): hack for MultiDeviceExecutor.  MultiDeviceExecutor looks
        // for ParallelType::Stream only in logical domains and assumes a
        // stream-parallelized dimension is always fully allocated.  So we set
        // the multiplier to 1 when `sharded_id` is a logical IterDomain. This
        // will have to change when FusionExecutorCache requires a logical
        // dimension to be stream-parallelized, both loop and allocation. Refer
        // to
        // https://github.com/NVIDIA/Fuser/blob/f8e84e52296cdecd318dd2ce904139616d7bd434/tests/cpp/test_overlap.cpp#L155
        // for an example. An alternative to consider is to create a new
        // ParallelType for stream parallelization and use it in
        // FusionExecutorCache.
        if (std::find(
                tv->getLogicalDomain().begin(),
                tv->getLogicalDomain().end(),
                sharded_id) != tv->getLogicalDomain().end()) {
          return 1;
        }
    
        NVF_ERROR(
            sharded_id->extent()->isConstInt(),
            "DIDs/Stream extent is expected to be constant: ",
            sharded_id);
        return sharded_id->extent()->evaluate().as<int64_t>();
      }
    
      if (isParallelTypeDeviceDim(parallel_type)) {
        return tv->getDeviceMesh().size(parallel_type);
      }
    
      NVF_THROW("Unexpected parallel type: ", parallel_type);
    }();
    
    // Check consistency: for the same extent, we should always get the same multiplier
    Val* extent = sharded_id->extent();
    auto it = extent_to_multiplier_map.find(extent);
    if (it != extent_to_multiplier_map.end()) {
      NVF_ERROR(
          it->second == multiplier,
          "Inconsistent multiplier for extent ",
          extent->toString(),
          ": expected ",
          it->second,
          " but got ",
          multiplier);
    } else {
      extent_to_multiplier_map[extent] = multiplier;
    Static Lifetime Concerns

    The static map will persist for the entire program lifetime, which may not be appropriate if Val objects have different lifetimes. This could lead to memory bloat or undefined behavior if Val objects are destroyed.

    static std::unordered_map<Val*, int64_t> extent_to_multiplier_map;

    Test failures

    • (High, 106) CUDA driver too old for runtime on dlcluster_h100 (nvFuser matmul/MMA test suites)

      Test Name H100 Source
      .tests.python.multidevice.test_communication
      .tests.python.multidevice.test_deepseek_v3
      .tests.python.multidevice.test_dtensor
      .tests.python.multidevice.test_matmul
      .tests.python.multidevice.test_multidevice
      .tests.python.multidevice.test_overlap
      .tests.python.multidevice.test_transformer
      .tests.python.opinfo.test_direct_ops
      .tests.python.test_alphafold3
      ArgsortParameterizedWithBlockAndBatch.SharedMemoryRequirement/1024_1_1_1 Link
      ... with 96 more test failures omitted. Check internal logs.
    • (High, 7) CUDA driver too old on dlcluster_h100 – initialization fails in multiple test suites

      Test Name H100 Source
      .tests.python.multidevice.test_transformer_engine
      .tests.python.opinfo.test_legacy_ops
      .tests.python.test_normalization
      .tests.python.test_python_frontend
      .tests.python.test_schedule_ops
      tests.python.multidevice.test_expert_parallel.test_dispatch_and_combine
      tests.python.test_moe.test_llama4_moe_thunderfx
    • (Medium, 6) nvFuser multi-device assert failure (inconsistent extent multiplier) in test_overlap_allgather_matmul_shard_outermost

      Test Name A100 (dist.) GB200 (dist.) H100 (dist.) Source
      tests.python.multidevice.test_overlap.test_overlap_allgather_matmul_shard_outermost[backend_type=CommunicatorBackend.cuda]
      tests.python.multidevice.test_overlap.test_overlap_allgather_matmul_shard_outermost[backend_type=CommunicatorBackend.nccl]

    @zasdfgbnm
    Copy link
    Collaborator Author

    !test

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants