Skip to content

Conversation

@liqiangxl
Copy link
Collaborator

Similar to #5717

@github-actions
Copy link

github-actions bot commented Dec 19, 2025

Review updated until commit eec699a

Description

  • Add required includes for scheduler utilities and device memory calculations

  • Replace hardcoded test size with dynamic calculation based on device capabilities

  • Calculate total elements using cluster size, register file size, and shared memory

  • Update test comments to explain cluster reduction and persistence behavior

Changes walkthrough

Relevant files
Bug fix
test_welford.cpp
Fix Welford test with dynamic hardware-aware sizing           

tests/cpp/test_welford.cpp

  • Added includes for scheduler_utils.h and utils.h headers
  • Replaced hardcoded 512*1024+1024 test size with dynamic calculation
  • Implemented hardware-aware sizing using cluster size, register file,
    and shared memory
  • Updated comments to explain cluster reduction and memory persistence
    context
  • +14/-3   

    PR Reviewer Guide

    Here are some key observations to aid the review process:

    🧪 PR contains tests
    ⚡ Recommended focus areas for review
    Include dependencies

    Verify that the newly added includes (#include "scheduler/utils.h" and #include "utils.h") are sufficient for all the functions being used, particularly scheduler_utils::getMaxClusterSize(), scheduler_utils::register_file_size_bit, deviceAvailableSharedMemoryBytes(), ceilDiv(), and scheduler_utils::roundUpPow2Or8().

    #include "scheduler/utils.h"
    #include "utils.h"
    Dynamic calculation logic

    Review the logic for calculating total_elements to ensure it correctly handles both cases (sm_per_cluster == 1 vs sm_per_cluster > 1) and that the resulting test size will reliably trigger the untranslated path as intended.

    const int64_t total_elements = sm_per_cluster == 1
        ? scheduler_utils::roundUpPow2Or8(smem_buffer_count)
        : regs_buffer_count * sm_per_cluster;

    @greptile-apps
    Copy link
    Contributor

    greptile-apps bot commented Dec 19, 2025

    Greptile Summary

    Dynamically calculates buffer size thresholds for Welford reduction tests based on actual cluster configuration instead of using hardcoded values.

    Key Changes:

    • Added #include "scheduler/utils.h" and #include "utils.h" to access scheduler utilities
    • Replaced hardcoded 512 * 1024 + 1024 with dynamic calculation using scheduler_utils::getMaxClusterSize()
    • Now correctly handles both cluster reduction (register persistence) and non-cluster reduction (shared memory persistence) scenarios
    • Calculation logic: if sm_per_cluster == 1, uses roundUpPow2Or8(smem_buffer_count), otherwise uses regs_buffer_count * sm_per_cluster

    This makes the test portable across different GPU architectures with varying cluster sizes (similar to PR #5717 for test_cluster.cpp).

    Confidence Score: 4/5

    • Safe to merge with minor verification recommended
    • The includes resolve compilation issues and the logic correctly adapts test parameters to device capabilities. However, the includes have already been added and flagged in previous review threads, so verification that compilation succeeds is recommended before merging.
    • Verify that tests/cpp/test_welford.cpp compiles successfully with the new includes

    Important Files Changed

    Filename Overview
    tests/cpp/test_welford.cpp Added missing includes and updated test logic to dynamically calculate buffer sizes based on cluster configuration instead of hardcoded values

    Sequence Diagram

    sequenceDiagram
        participant Test as Translate1Welford Test
        participant Runtime as FusionExecutorCache
        participant Scheduler as Welford Scheduler
        participant Device as GPU Device Info
    
        Note over Test: Test Setup
        Test->>Test: Create fusion with Welford op
        
        Note over Test: Small Inner Size (Translated)
        Test->>Runtime: run_test(64)
        Runtime->>Scheduler: Schedule welford reduction
        Scheduler-->>Runtime: Translation applied (>2 exprs)
        Runtime-->>Test: Returns runtime1
        Test->>Test: Verify translation occurred
    
        Note over Test: Large Inner Size (Not Translated)
        Test->>Device: getMaxClusterSize()
        Device-->>Test: sm_per_cluster
        Test->>Device: deviceAvailableSharedMemoryBytes()
        Device-->>Test: available_memory
        Test->>Test: Calculate total_elements based on cluster config
        Note over Test: If cluster size = 1: use smem_buffer_count<br/>Else: use regs_buffer_count * sm_per_cluster
        Test->>Runtime: run_test(total_elements + 1024)
        Runtime->>Scheduler: Schedule welford reduction
        Scheduler-->>Runtime: No translation (WelfordOp preserved)
        Runtime-->>Test: Returns runtime2
        Test->>Test: Verify WelfordOp exists in segments
    
    Loading

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    2 files reviewed, 1 comment

    Edit Code Review Agent Settings | Greptile

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    @liqiangxl liqiangxl force-pushed the llu/fix_test_welford branch from 7bc4c9e to e59425a Compare December 19, 2025 20:01
    @liqiangxl
    Copy link
    Collaborator Author

    !test

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    1 file reviewed, 1 comment

    Edit Code Review Agent Settings | Greptile

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    Copy link
    Contributor

    @greptile-apps greptile-apps bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    1 file reviewed, 1 comment

    Edit Code Review Agent Settings | Greptile

    // size is larger than both register and shared memory size of a single SM.
    // Context: cluster reduction only uses register persistence while block
    // reduction may also use shared memory persistence.
    const int64_t sm_per_cluster = scheduler_utils::getMaxClusterSize();
    Copy link
    Contributor

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    syntax: Add #include <scheduler/utils.h> at the top of the file (after line 14) to declare scheduler_utils::getMaxClusterSize(). Without this include, the code will not compile.

    Reference: tests/cpp/test_cluster.cpp includes this header on line 17.

    @liqiangxl
    Copy link
    Collaborator Author

    !test

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants