Skip to content

Conversation

@naoyam
Copy link
Collaborator

@naoyam naoyam commented Dec 20, 2025

No description provided.

@naoyam
Copy link
Collaborator Author

naoyam commented Dec 20, 2025

!test

@github-actions
Copy link

github-actions bot commented Dec 20, 2025

Review updated until commit 0823fe7

Description

  • Move circular buffer index variable allocation earlier in the logic flow

  • Add support for Stream parallel type in loop index variable allocation

  • Enable IdModel (TensorIndexer) by default in matmul scheduler tests

  • Enable IdModel by default in stream tests and base test fixture

Changes walkthrough

Relevant files
Bug fix
id_model.cpp
Fix circular buffer allocation order and add stream support

csrc/id_model/id_model.cpp

  • Move circular buffer index variable allocation before thread index
    handling
  • Add ParallelType::Stream support to thread index condition
  • Add error checking for missing circular buffer index variables
  • +23/-17 
    Tests
    test_matmul_scheduler.cpp
    Enable IdModel in matmul scheduler test fixtures                 

    tests/cpp/test_matmul_scheduler.cpp

  • Add SetUp() methods to enable IdModel with "all" option
  • Apply to MatmulSchedulerTest, MatmulSchedulerPluginTest,
    AllocationDomainTest
  • Apply to HopperPlusMatmulSchedulerTest
  • +17/-0   
    test_stream.cpp
    Enable IdModel in stream tests                                                     

    tests/cpp/test_stream.cpp

    • Enable IdModel with "all" option in StreamTest constructor
    +1/-0     
    utils.cpp
    Enable IdModel in base test fixture                                           

    tests/cpp/utils.cpp

    • Enable IdModel with "all" option in NVFuserTest::SetUp()
    +1/-0     

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Code Movement Impact

    The circular buffer handling code has been moved earlier in the allocateLoopIndexVariables() function. While the logic appears preserved, this reordering could potentially change execution flow or introduce subtle bugs. The moved code handles allocation of circular buffer index variables for each stage, and any change in when this code executes relative to other logic could have unintended consequences.

    if (GpuLower::current()->circularBufferInfo().isCircularBufferedIterDomain(
            loop_group->front()->as<IterDomain>())) {
      // Allocate index variable for each stage of the circular
      // buffered loop.
      auto indices = std::make_unique<CircularBufferIndices>();
      for (auto i :
           arange(static_cast<int>(CircularBufferLoopStage::EndOfStages))) {
        indices->emplace(
            static_cast<CircularBufferLoopStage>(i),
            IrBuilder::create<Val>(DataType::Index));
      }
      circular_buffered_loop_index_variable_map_[loop_group] =
          std::move(indices);
      continue;
    }
    New Error Checking

    New error checking has been added for circular buffer index variables that will throw NVF_ERROR if the circular_buffered_loop_index_variable_map_ doesn't contain the expected loop_group. This is good for catching bugs but could cause test failures if the assumptions about when this function is called don't hold true in all test scenarios.

    NVF_ERROR(
        circular_buffered_loop_index_variable_map_.contains(loop_group),
        "Failed to find circular buffer index var for: ",
        nvfuser::toString(loop_group),
        ", ",
        loop_group->front()->toString());
    Parallel Type Handling

    The condition for using NamedScalar::getParallelIndex has been expanded to include ParallelType::Stream in addition to thread parallel types. This change in indexing logic for stream types could potentially affect performance or correctness in stream-based operations and should be validated.

    } else if (isParallelTypeThread(ptype) || ptype == ParallelType::Stream) {
      loop_index = NamedScalar::getParallelIndex(ptype);
    }

    Test failures

    • (High, 189) CUDA driver version insufficient for runtime on dlcluster_h100 (nvFuser test suites)

      Test Name H100 Source
      ArgsortParameterizedWithBlockAndBatch.SharedMemoryRequirement/128_1_0_0 Link
      ArgsortParameterizedWithBlockAndBatch.SharedMemoryRequirement/4096_2_1_1 Link
      ArgsortParameterizedWithBlockAndBatch.SharedMemoryRequirement/512_1_0_0 Link
      ArgsortParameterizedWithBlockAndBatch.SharedMemoryRequirement/512_1_1_1 Link
      ArgsortTest.ZeroDimensionalInput Link
      BlackwellMatmulTest.EpilogueSiluPersistentBroadcastInputs Link
      BlockSizeAndItemsPerThread/ArgSortComprehensiveTest.ComprehensiveValidation/BlockSize128_ItemsPerThread4 Link
      ClusterReductionTest.SimpleFusionAllReduce/cluster_10_dtype_float Link
      ClusterReductionTest.SimpleFusionAllReduce/cluster_6_dtype_double Link
      ClusterReductionTest.SimpleFusionAllReduce/cluster_9_dtype___bfloat Link
      ... with 179 more test failures omitted. Check internal logs.
    • (High, 44) NCCL NVLS multicast memory bind failures across multidevice/nvfuser test suites on dlcluster_viking_ci

      Test Name H100 (dist.) Source
      tests.python.multidevice.test_communication.test_allgather
      tests.python.multidevice.test_communication.test_allgather_expanded_broadcast
      tests.python.multidevice.test_communication.test_allreduce
      tests.python.multidevice.test_communication.test_reduce_scatter
      tests.python.multidevice.test_communication.test_reduce_scatter_noncontiguous
      tests.python.multidevice.test_dtensor.test_column_parallel_linear
      tests.python.multidevice.test_dtensor.test_plus_one
      tests.python.multidevice.test_dtensor.test_row_parallel_linear
      tests.python.multidevice.test_expert_parallel.test_dispatch_and_combine
      tests.python.multidevice.test_matmul.test_column_parallel_grouped_mm
      ... with 34 more test failures omitted. Check internal logs.
    • (Medium, 1) NCCL invalid usage error in multidevice overlap tests (tests/python/multidevice/test_overlap.py)

      Test Name H100 (dist.) Source
      tests.python.multidevice.test_overlap.test_overlap_allgather_matmul_shard_outermost[backend_type=CommunicatorBackend.cuda]

    @naoyam
    Copy link
    Collaborator Author

    naoyam commented Dec 20, 2025

    !test

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants