Skip to content

Conversation

@zasdfgbnm
Copy link
Collaborator

No description provided.

@github-actions
Copy link

github-actions bot commented Dec 17, 2025

Review updated until commit 6b57c2d

Description

  • Add meta-device fast path to CutlassNvfp4GroupedMmaOp::evaluate for efficient meta tensor handling

  • Move NVFUSER_CUTLASS_KERNEL_ENABLED guard to allow meta device evaluation without CUTLASS support

  • Implement shape inference for meta tensors: output shape is [M, N] where M=mat1.size(0) and N=mat2.size(1)

  • Add comprehensive test case CutlassNvfp4GroupedMma verifying meta device evaluation matches CUDA behavior

Changes walkthrough

Relevant files
Enhancement
internal_nodes.cpp
Add meta-device fast path to CutlassNvfp4GroupedMmaOp       

csrc/ir/internal_nodes.cpp

  • Add meta-device fast path in CutlassNvfp4GroupedMmaOp::evaluate method
  • Check if any input tensors are meta-device tensors and handle them
    specially
  • Create empty result tensor on meta device with correct output shape
    [M, N]
  • Handle rfactor dimension by unsqueezing when necessary
  • Move NVFUSER_CUTLASS_KERNEL_ENABLED guard down to allow meta
    evaluation without CUTLASS
  • +23/-1   
    Tests
    test_meta.cpp
    Add test for CutlassNvfp4GroupedMma meta device evaluation

    tests/cpp/test_meta.cpp

  • Add new test case CutlassNvfp4GroupedMma for meta device evaluation
  • Test both CUDA and meta device evaluation paths for
    cutlass_nvfp4_grouped_mm operation
  • Verify meta tensor properties: is_meta flag, scalar_type, sizes, and
    strides
  • Create appropriate test tensors with FP4, FP8, FP32, and Index data
    types
  • +123/-0 

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Meta device fast path implementation

    The meta device fast path implementation looks correct. It properly checks all input tensors for meta device status and creates an appropriately shaped result tensor. The use of getRFactorDeviceDimensionIndex for handling rfactor dimensions is appropriate.

    // Meta-device fast path
    if (mat1.is_meta() || mat2.is_meta() || scale1.is_meta() ||
        scale2.is_meta() || alpha.is_meta() || problem_sizes.is_meta() ||
        expert_offsets.is_meta() || sf_offsets.is_meta()) {
      // For nvfp4_scaled_grouped_mm, the output shape is [M, N]
      // where M = mat1.size(0) and N = mat2.size(1)
      std::vector<int64_t> result_sizes = {mat1.size(0), mat2.size(1)};
    
      at::ScalarType out_dtype = data_type_to_aten(out()->dtype());
      auto options =
          mat1.options().device(c10::Device(c10::kMeta)).dtype(out_dtype);
      at::Tensor result = at::empty(result_sizes, options);
    
      if (const auto rfactor_did_idx = getRFactorDeviceDimensionIndex(out());
          rfactor_did_idx != -1) {
        result = result.unsqueeze(rfactor_did_idx);
      }
    
      return {result};
    }
    Comprehensive test coverage

    The test CutlassNvfp4GroupedMma provides excellent coverage by testing both CUDA and meta device paths with the same fusion definition. It verifies that meta outputs have correct shape, dtype, and device type. The test uses realistic tensor dimensions and data types.

    // Test CutlassNvfp4GroupedMmaOp with meta device
    TEST_F(MetaTest, CutlassNvfp4GroupedMma) {
    #if NVFUSER_CUTLASS_KERNEL_ENABLED
      auto fusion = std::make_unique<Fusion>();
      FusionGuard fg(fusion.get());
    
      // mat1: [M, K/2] = [128, 64] (packed FP4)
      // mat2: [G, N, K/2] = [4, 128, 64] (packed FP4)
      // output: [M, N] = [128, 128]
      auto mat1 = makeContigConcreteTensor({128, 64}, DataType::Float4_e2m1fn_x2);
      auto mat2 =
          makeContigConcreteTensor({4, 128, 64}, DataType::Float4_e2m1fn_x2);
      auto scale1 = makeContigConcreteTensor({128, 8}, DataType::Float8_e4m3fn);
      auto scale2 = makeContigConcreteTensor({4, 128, 8}, DataType::Float8_e4m3fn);
      auto alpha = makeContigConcreteTensor({4}, DataType::Float);
      auto problem_sizes = makeContigConcreteTensor({4, 3}, DataType::Index);
      auto expert_offsets = makeContigConcreteTensor({4}, DataType::Index);
      auto sf_offsets = makeContigConcreteTensor({4}, DataType::Index);
    
      fusion->addInput(mat1);
      fusion->addInput(mat2);
      fusion->addInput(scale1);
      fusion->addInput(scale2);
      fusion->addInput(alpha);
      fusion->addInput(problem_sizes);
      fusion->addInput(expert_offsets);
      fusion->addInput(sf_offsets);
    
      auto result = cutlass_nvfp4_grouped_mm(
          mat1,
          mat2,
          scale1,
          scale2,
          alpha,
          problem_sizes,
          expert_offsets,
          sf_offsets,
          DataType::BFloat16);
      fusion->addOutput(result);
    
      // Create real inputs with appropriate data types
      auto options_fp4 =
          at::TensorOptions().dtype(at::kFloat4_e2m1fn_x2).device(at::kCUDA, 0);
      auto options_fp8 =
          at::TensorOptions().dtype(at::kFloat8_e4m3fn).device(at::kCUDA, 0);
      auto options_fp32 =
          at::TensorOptions().dtype(at::kFloat).device(at::kCUDA, 0);
      auto options_int = at::TensorOptions().dtype(at::kInt).device(at::kCUDA, 0);
    
      at::Tensor mat1_input = at::randn({128, 64}, options_fp4);
      at::Tensor mat2_input = at::randn({4, 128, 64}, options_fp4);
      at::Tensor scale1_input = at::randn({128, 8}, options_fp8);
      at::Tensor scale2_input = at::randn({4, 128, 8}, options_fp8);
      at::Tensor alpha_input = at::ones({4}, options_fp32);
      at::Tensor problem_sizes_input = at::tensor(
          {{32, 128, 128}, {32, 128, 128}, {32, 128, 128}, {32, 128, 128}},
          options_int);
      at::Tensor expert_offsets_input = at::tensor({0, 32, 64, 96}, options_int);
      at::Tensor sf_offsets_input = at::tensor({0, 32, 64, 96}, options_int);
    
      // CUDA path
      ExpressionEvaluator ee_cuda;
      ee_cuda.bind(fusion->inputs().at(0), mat1_input);
      ee_cuda.bind(fusion->inputs().at(1), mat2_input);
      ee_cuda.bind(fusion->inputs().at(2), scale1_input);
      ee_cuda.bind(fusion->inputs().at(3), scale2_input);
      ee_cuda.bind(fusion->inputs().at(4), alpha_input);
      ee_cuda.bind(fusion->inputs().at(5), problem_sizes_input);
      ee_cuda.bind(fusion->inputs().at(6), expert_offsets_input);
      ee_cuda.bind(fusion->inputs().at(7), sf_offsets_input);
      auto real_out = ee_cuda.evaluate(fusion->outputs().at(0)).as<at::Tensor>();
    
      // Meta evaluation
      ExpressionEvaluator ee_meta;
      auto meta_mat1 = at::empty_strided(
          mat1_input.sizes(), mat1_input.strides(), options_fp4.device(at::kMeta));
      auto meta_mat2 = at::empty_strided(
          mat2_input.sizes(), mat2_input.strides(), options_fp4.device(at::kMeta));
      auto meta_scale1 = at::empty_strided(
          scale1_input.sizes(),
          scale1_input.strides(),
          options_fp8.device(at::kMeta));
      auto meta_scale2 = at::empty_strided(
          scale2_input.sizes(),
          scale2_input.strides(),
          options_fp8.device(at::kMeta));
      auto meta_alpha = at::empty_strided(
          alpha_input.sizes(),
          alpha_input.strides(),
          options_fp32.device(at::kMeta));
      auto meta_problem_sizes = at::empty_strided(
          problem_sizes_input.sizes(),
          problem_sizes_input.strides(),
          options_int.device(at::kMeta));
      auto meta_expert_offsets = at::empty_strided(
          expert_offsets_input.sizes(),
          expert_offsets_input.strides(),
          options_int.device(at::kMeta));
      auto meta_sf_offsets = at::empty_strided(
          sf_offsets_input.sizes(),
          sf_offsets_input.strides(),
          options_int.device(at::kMeta));
    
      ee_meta.bind(fusion->inputs().at(0), meta_mat1);
      ee_meta.bind(fusion->inputs().at(1), meta_mat2);
      ee_meta.bind(fusion->inputs().at(2), meta_scale1);
      ee_meta.bind(fusion->inputs().at(3), meta_scale2);
      ee_meta.bind(fusion->inputs().at(4), meta_alpha);
      ee_meta.bind(fusion->inputs().at(5), meta_problem_sizes);
      ee_meta.bind(fusion->inputs().at(6), meta_expert_offsets);
      ee_meta.bind(fusion->inputs().at(7), meta_sf_offsets);
      auto meta_out = ee_meta.evaluate(fusion->outputs().at(0)).as<at::Tensor>();
    
      // Checks
      EXPECT_TRUE(meta_out.is_meta());
      EXPECT_EQ(meta_out.scalar_type(), at::kBFloat16);
      EXPECT_EQ(meta_out.sizes(), real_out.sizes());
      EXPECT_EQ(meta_out.strides(), real_out.strides());
    #else
      GTEST_SKIP() << "Test requires CUTLASS support";
    #endif
    }

    Test failures

    • (Medium, 1) CUDA out-of-memory in nvFuser TmaPointwiseTest on H100

      Test Name H100 Source
      TmaPointwiseTestF.SplitGridDim2D Link

    @zasdfgbnm
    Copy link
    Collaborator Author

    !test

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants