Lwpck 3052 grouped conv split image #2970

JH-Leon-KIM-AMD · 2025-10-03T17:03:23Z

Implement Recursive Split-Image for Large Tensor Support

Summary

Implemented recursive split-image algorithm for grouped convolution to handle large tensors exceeding 2GB memory threshold by recursively splitting spatial dimensions and launching multiple kernel pieces with correct offset management.

Issue

Large convolution tensors (> 2GB) cannot fit in GPU memory or exceed descriptor size limits, causing:

GPU memory allocation failures
Incorrect results due to descriptor overflow
Unable to run large-scale convolution workloads

Example: N=1, W=4.2M, C=256, K=256 → Output = 2.02 GB (exceeds 2GB limit)

Goal

Enable large convolution execution by automatically splitting tensors into pieces < 2GB with numerical correctness and Split-N compatibility.

Workflow

1. Create kargs = MakeKernelArgs(args)
2. Check split_info = kargs.GetSplitImageInfo()
3a. If NO split: launch_kernel(kargs)
3b. If split: LaunchWithRecursiveSplit()
    - BFS split until pieces < threshold
    - For each piece: launch_kernel(piece_kargs)

Implementation

Core Algorithm: Recursive Binary Split

Split spatial dimensions (W/H/D) recursively using breadth-first search until each piece fits under threshold.

Original tensor: W=4.2M (2.02 GB - exceeds threshold)
    ↓ Split W into LEFT + RIGHT
├─ LEFT:  W=2.1M (1.01 GB)
└─ RIGHT: W=2.1M (1.01 GB)

Launch 2 kernels with different spatial offsets

Split Implementation

1. Descriptor Splitting (Dimensions)

Binary split on dimension 0 (first spatial dimension):

1D: Split W → W_left + W_right
2D: Split H → H_left + H_right (keep W unchanged)
3D: Split D → D_left + D_right (keep H, W unchanged)

Example: 1D convolution W=4.2M, C=256, K=256, G=1 → Output = 2.02 GB (exceeds threshold)

// Original descriptor
N=1, W=4200000, G=1, C=256, K=256
Output size = 1 × 4200000 × 1 × 256 = 1.075B elements = 2.02 GB

// After split
LEFT:  N=1, W=2100000, G=1, C=256, K=256 → 1.01 GB
RIGHT: N=1, W=2100000, G=1, C=256, K=256 → 1.01 GB

2. Pointer Offset Calculation

Example: 1D with W=4.2M, C=256, K=256, stride=1, pad=1

// LEFT piece
input_offset  = 0
output_offset = 0

// RIGHT piece (W_left=2100000)
physical_offset = (2100000 × 1) - 1 = 2099999
input_offset  = 2099999 × (1 × 256) = 537,599,744 elements
output_offset = 2100000 × (1 × 256) = 537,600,000 elements

3. Recursive Splitting (Breadth-First Search)

Stop condition: piece < 2GB OR depth >= MAX_DEPTH(10)

Example: 2D convolution H=4096, W=2048, C=256, K=256 → Output = 4.29 GB

// BFS iteration 1: Original problem (4.29 GB > 2GB, depth=0)
Original: H=4096, W=2048, offset=0, depth=0 → Split!
├─ LEFT:  H=2048, W=2048, offset=0, depth=1 (2.15 GB > 2GB) → Split!
└─ RIGHT: H=2048, W=2048, offset=X, depth=1 (2.15 GB > 2GB) → Split!

// BFS iteration 2: LEFT piece (2.15 GB > 2GB, depth=1)
LEFT: H=2048, offset=0, depth=1 → Split!
├─ LEFT:  H=1024, W=2048, offset=0, depth=2 (1.07 GB < 2GB) → Ready
└─ RIGHT: H=1024, W=2048, offset=Y, depth=2 (1.07 GB < 2GB) → Ready

// BFS iteration 3: RIGHT piece (2.15 GB > 2GB, depth=1)
RIGHT: H=2048, offset=X, depth=1 → Split!
├─ LEFT:  H=1024, W=2048, offset=X, depth=2 (1.07 GB < 2GB) → Ready
└─ RIGHT: H=1024, W=2048, offset=X+Y, depth=2 (1.07 GB < 2GB) → Ready

Result: 4 pieces ready to launch (each ~1GB)

Offset Application in Kernel

batch_id = blockIdx.z

ptr = base_ptr +
      group_offset +
      (batch_id × batch_stride) +     // Uses ORIGINAL dimensions
      spatial_offset;                  // Uses SPLIT dimensions

Integration with Split-N

Split-N (batch dimension) and Split-Image (spatial dimensions) work together independently.

Testing

Three test suites with different thresholds:

100KB threshold: 20+ tests (thorough algorithm validation)
10MB threshold: 142 tests (94×1D, 20×2D, 28×3D comprehensive)

All 142+ tests passing with CPU verification.

Note: 2GB threshold tests (15 typical + 8 large >2GB) exist but not run in CI due to memory requirements.

Performance

No overhead when no split needed
Linear scaling with number of pieces
Each piece < 2GB

Files Changed

grouped_convolution_forward_invoker.hpp
transform_conv_fwd_to_gemm.hpp
grouped_convolution_forward_kernel.hpp

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

I have added tests relevant to the introduced functionality, and the unit tests are passing locally
I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
I have added inline documentation which enables the maintainers with understanding the motivation
I have removed the stale documentation which is no longer relevant after this pull request
(If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
I have run clang-format on all changed files
Any dependent changes have been merged

…nt variables

- Added debug prints for padding calculations in transform_conv_fwd_to_gemm.hpp - Verified padding works correctly with all tests passing

- Cast blockIdX from unsigned to signed index_t for comparisons - Integrated with new GetOutputTileIndex logic from upstream - Updated to use amd_wave_read_first_lane instead of __builtin_amdgcn_readfirstlane

- Fixed batch stride calculation to include G dimension for grouped convolutions - When moving between batches in NHWGC/NWGC/NDHWGC layouts, need to account for all groups - Removed unused multi-split parameters (we only support 2-way split) - All tests now pass: G=1 with Split-N, G>1 with Split-N, G>1 without Split-N

- Add LaunchKernelWithSplitIfNeeded() helper method in transform_conv_fwd_to_gemm.hpp - Implement recursive binary splitting algorithm (10GB→5GB+5GB→...) - Correctly handle odd dimensions (61→30+31) - Calculate proper offsets for each split piece - Update invoker to use split-image helper Note: Split detection and calculation work correctly but kernel launching for individual pieces requires kernel modification to handle different spatial dimensions (unlike Split-N which uses blockIdx.z).

- Split-N modifies N_ directly in transformer constructor - Split-Image needs different approach due to varying dimensions - Added split calculation logic for 1D and 2D convolutions - Still facing memory issues when creating piece transformers Key finding: Split-N uses blockIdx.z for parallel execution, while Split-Image needs sequential execution of non-uniform pieces.

Implements split-image for 1D convolution to handle large tensors that exceed memory thresholds. This is a critical milestone with N=1 fully working and tested. Key Changes: - Invoker: Add split-image logic that splits W dimension in half - Transformer: Add SplitConvProblem helper for recursive splitting - Calculate offsets for LEFT and RIGHT pieces - Launch two kernels sequentially (LEFT then RIGHT) Implementation Details: - Binary split: divides W dimension by 2 - LEFT piece: W=0 to W/2, keeps left padding, removes right padding - RIGHT piece: W/2 to W, removes left padding, keeps right padding - Offset calculation accounts for stride, dilation, and padding - Physical memory offset (no padding in memory) Test Results (N=1): ✅ 94/94 tests passing - Comprehensive tests: 36/36 (channels, padding, stride, dilation, filters, groups) - Edge case tests: 31/31 (odd dimensions, extreme parameters, boundaries) - Stress tests: 27/27 (maximum dimensions, up to 91.4 TFlops) Known Limitations: - Only works with N=1 (single batch) - N>1 fails when split-image triggers (offset calculation issue with Split-N) - Root cause: Split-N modifies N in transformer, but offset calculated in invoker - Solution planned: Move offset calculation to transformer (next phase) Files Modified: - grouped_convolution_forward_invoker.hpp: Add split-image logic - transform_conv_fwd_to_gemm.hpp: Add SplitConvProblem helper This commit represents a stable, tested 1D split-image implementation for N=1 cases. It's an important milestone before extending to N>1 and multi-dimensional splits.

This is a working baseline implementation that splits large spatial dimensions to handle memory constraints. Implementation: - 1D: W-split for NWGC layout (36/36 tests passing) - 2D: H-split for NHWGC layout (20/20 tests passing) - 3D: D-split for NDHWGC layout (verified working) Features: - Binary split of outermost spatial dimension - Sequential LEFT/RIGHT kernel launches - Proper padding adjustment at split boundaries - Offset calculation for pointer arithmetic - Debug output for verification Threshold: 100KB (configurable in transformer) Known limitations: - No safety checks for edge cases (to be added) - Offset calculated before Split-N (incompatible with N>1, to be fixed) - No recursive splitting for very large tensors Next steps: - Add safety checks (is_possible_to_split_*) - Move offset calculation to transformer (after Split-N) - Test with N>1 + split-image combination

Unified the three separate dimension-specific blocks into a single common implementation with dimension-specific stride calculations. Benefits: - Reduced code from 636 → 348 lines (45% reduction) - Eliminated code duplication - Easier to maintain and extend - Single source of truth for split logic Implementation: - Common: Binary split, offset calc, padding adjustment, kernel launch - Dimension-specific: Stride calculation only - 1D: stride = G * C - 2D: stride = W_in * G * C - 3D: stride = H_in * W_in * G * C Test results (all passing): - 1D: 36/36 tests ✅ - 2D: 20/20 tests ✅ - 3D: 28/28 tests ✅ - Total: 84/84 (100%) All test scenarios verified: - Varying channels, padding, stride, dilation - Filter sizes (1x1 pointwise to 7x7) - Multiple groups (G=1,2,4) - Odd dimensions - Complex combinations

Added is_possible_to_split safety checks to prevent crashes when splitting is not feasible. Safety checks verify: 1. Output dimension > 1 (can't split single element) 2. RIGHT piece starts after left padding 3. LEFT piece ends within input bounds If checks fail, falls back to normal kernel launch. Verified for all dimensions: - 1D (W-split): Wo=1 case triggers fallback - 2D (H-split): Ho=1 case triggers fallback - 3D (D-split): Do=1 case triggers fallback Original 84 tests still pass - they use normal configurations that naturally satisfy safety conditions. Safety checks protect against pathological edge cases with: - Very small spatial dimensions - Extreme stride/dilation combinations - Invalid padding configurations

Fixed critical bug where Split-N and Split-Image working together caused ~50% incorrect results due to wrong batch stride calculation. Problem: - Batch stride was calculated using MODIFIED spatial dimensions (e.g., W=50000 after split) instead of ORIGINAL dimensions (W=100000) - Spatial offset was applied globally in invoker, not per-batch in kernel - Each batch (blockIdx.z) got wrong memory offset Solution: 1. Store spatial offset in kargs (don't apply to pointer in invoker) 2. Copy correct batch_stride from temp_kargs to left/right kargs 3. Apply formula in operator(): ptr = base + (batch × stride) + spatial_offset Changes: - grouped_convolution_forward_kernel.hpp: * Added spatial_offset_in/out fields to KernelArgs * Apply batch + spatial offset in operator() - grouped_convolution_forward_invoker.hpp: * Keep base pointer, store spatial offset in kargs * Copy batch_stride from temp_kargs (has original dimensions) - transform_conv_fwd_to_gemm.hpp: * Add debug output for split-image calculation Results: - N=1 tests: 84/84 passing (100%) - N>1 tests: Now all passing (previously ~50% errors) - Tested: 1D, 2D, 3D with N=1,2,4,8,16,20

This commit consolidates threshold management for both Split-N and Split-Image operations into a single source of truth, eliminating code duplication and fixing offset calculation issues. Key Changes: ============ 1. Transformer (transform_conv_fwd_to_gemm.hpp): - Moved TwoGB constant to public section for unified access - CalculateSplitImage() now takes no parameters - Uses internal threshold: TwoGB / sizeof(CDataType) - Calculates offsets using N_ (after Split-N) for correctness 2. Kernel (grouped_convolution_forward_kernel.hpp): - GetSplitImageInfo() simplified to take no parameters - Forwards to transformer's CalculateSplitImage() - Clean interface with unified threshold internally 3. Invoker (grouped_convolution_forward_invoker.hpp): - Removed redundant threshold calculation - Simplified to call kargs.GetSplitImageInfo() with no params - Clean early-return pattern (no unnecessary else blocks) - Removed duplicate/dead code paths Benefits: ========= - Single source of truth: TwoGB defined once in transformer - No parameter passing for threshold between components - Correct offset calculation using N_ (post-Split-N) - Cleaner code with no duplication - All tests passing: 1D/2D/3D with various N values Testing: ======== - Split-Image only (N=1, large spatial): PASS - Split-N only (N>1, small spatial): PASS - Both splits active (N>1, large spatial): PASS - No splits (N=1, small spatial): PASS - CPU verification correct for all scenarios

…ithSplitIfNeeded) The old recursive queue-based implementation has been replaced by the new CalculateSplitImage() method which is simpler and correctly handles Split-N + Split-Image interaction. Changes: - Wrapped lines 381-1078 in #if 0...#endif - Old methods: SplitConvProblem() and LaunchKernelWithSplitIfNeeded() - Preserved for reference but disabled from compilation - No functional changes - all tests still pass The new implementation (CalculateSplitImage at line ~2163) provides: - Correct offset calculation using N_ (after Split-N) - Simpler binary split logic - Better integration with unified threshold approach

Changes: - Add depth tracking to SplitPiece struct - Implement two stopping conditions: 1. Piece size below threshold (optimal case) 2. Depth >= MAX_DEPTH (prevents infinite recursion) - Remove MAX_PIECES limit in favor of depth-based control - Support up to 2^10 = 1024 pieces with depth 10 This allows handling extreme tensor sizes while ensuring termination. Pieces larger than threshold will still launch correctly if depth limit reached. Tested with H=100 (4 levels), H=2000 (6 levels), H=4000 (9 levels) - all pass CPU verification.

- Recursive queue-based splitting with depth limit (MAX_DEPTH=10, up to 1024 pieces) - Two stopping conditions: size below threshold OR max depth reached - Cumulative offset tracking through all recursion levels - LEFT piece inherits parent offset, RIGHT accumulates (parent + local) - Per-batch spatial offset application in kernel operator() - Batch stride uses original dimensions (before split) - Works with Split-N: split-N first, then recursive split-image - Handles odd dimensions, padding, stride, dilation correctly - All 1D/2D/3D tests pass with CPU verification

- Move LaunchWithRecursiveSplit() from invoker to transform_conv_fwd_to_gemm.hpp - Simplify invoker from ~250 lines to ~140 lines (removed 110 lines of inline logic) - Encapsulate SplitPiece struct and BFS splitting algorithm in transformer - Remove unused includes (queue, vector) from invoker - Add documentation comment for AreDescriptorsSmallerThan2GB() - Improve code organization and reusability - No performance overhead (static template function, compiler inlines) - All tests passing with 2GB production threshold

- Format invoker and transformer files with clang-format-18 - Fix brace placement and alignment - No functional changes

- Remove extra blank lines - Fix line wrapping for template calls - Consolidate GetSplitImageInfo() to single line

Copilot

Pull Request Overview

This PR implements a recursive split-image algorithm for grouped convolution to handle large tensors exceeding 2GB memory threshold by automatically splitting spatial dimensions and launching multiple kernel pieces with correct offset management.

Key changes:

Added recursive binary splitting algorithm for spatial dimensions (W/H/D) with breadth-first search until pieces fit under 2GB threshold
Integrated split-image with existing Split-N support for independent batch and spatial dimension splitting
Implemented comprehensive offset calculation system for correct memory addressing across split pieces

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
`transform_conv_fwd_to_gemm.hpp`	Added `SplitImageInfo` structure, `CalculateSplitImage()` method, and `LaunchWithRecursiveSplit()` template function for split-image algorithm implementation
`grouped_convolution_forward_kernel.hpp`	Modified kernel args to store transformer instance, added spatial offset fields, and updated batch stride calculations to include group dimension
`grouped_convolution_forward_invoker.hpp`	Added split-image detection logic and routing to either direct kernel launch or recursive split-image path
`grouped_convolution_forward.cpp`	Removed trailing blank line

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp

Copilot · 2025-10-07T07:17:57Z

include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp

-    __host__ bool AreDescriptorsSmallerThan2GB() const
+    // Check if descriptors fit within memory threshold
+    // NOTE: Not used by forward convolution (uses CalculateSplitImage() instead)
+    // May be used by backward convolution implementations


The comment indicates this function is not used by forward convolution. If this function is truly unused in the current codebase, consider removing it or marking it as deprecated to avoid confusion.

Suggested change

// May be used by backward convolution implementations

// May be used by backward convolution implementations

[[deprecated("Not used by forward convolution; may be removed in the future.")]]

include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp

bartekxk · 2025-10-07T07:19:32Z

example/ck_tile/20_grouped_convolution/grouped_convolution_forward_invoker.hpp

+            auto kargs = Kernel::MakeKernelArgs(args);
+
+            // Check if split-image is needed (uses unified threshold internally)
+            auto split_info = kargs.GetSplitImageInfo();


Maybe just IsSplitImageNeeded and return bool?

include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp

include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp

…wd_to_gemm.hpp Co-authored-by: Copilot <[email protected]>

…tion_forward_kernel.hpp Co-authored-by: Copilot <[email protected]>

…t out old split-image code

- Implemented spatial dimension splitting (Split-Image) for large tensors - Added piece-based coordinate transformation for 1D/2D/3D convolutions - Integrated Split-N (batch splitting) with automatic threshold detection - Fixed M dimension calculation to include batch: M = N × spatial_size - Added spatial offset support in kernel arguments - Verified 20/20 test cases passing for Split-Image alone - Known issue: Split-N + Split-Image combination needs coordinate fix Implementation Details: - Split factors: 4 (1D), 4×4 (2D), 4×4×4 (3D) - temporary fixed values - Batch strides properly calculated for NWGC/NHWGC/NDHWGC layouts - Piece descriptors track spatial boundaries and block ranges - No performance overhead for N=1 cases

- Store actual size per piece to handle non-uniform splits - Remove dead code from transform utils

Problem: Single split factor caused non-uniform pieces when dimensions didn't divide evenly. Result: 18/25 (72%) 2D padding combinations failed. Solution: Independent split factor selection for W, H, D dimensions. Each dimension gets optimal factor based on its own size. Test Results: - 1D: 42/42 pass (100%) - 2D: 25/25 pass (100%) - Total: 67/67 combinations verified

Cleanup of split-image implementation: - Removed unused piece_d, piece_h, piece_w fields from SplitImageInfo struct - These fields were declared but never used in the kernel - Per-piece dimensions are already stored in pieces[] array - Reduces struct size and improves code clarity Tested: 1D/2D/3D convolutions with split-image, padding, stride all pass

- Extract piece calculation logic into calculate_piece lambda helper - Extract kernel args population into populate_split_image_kargs lambda - Use aggregate initialization for cleaner struct population - Reduce nesting depth and improve maintainability - Fix outdated comment about split-image implementation status

- Extract GPU kernel helper lambdas for better readability - Remove all split-image debug print statements - Set memory threshold to 2GB for production - All tests pass with CPU verification

bartekxk · 2025-10-21T07:37:58Z

example/ck_tile/20_grouped_convolution/grouped_convolution_forward_invoker.hpp

+            // Helper: Calculate single piece information
+            auto calculate_piece = [&](ck_tile::index_t piece_idx) -> TempPieceInfo {
+                const ck_tile::index_t w_idx = piece_idx % num_w_pieces;
+                const ck_tile::index_t h_idx = (piece_idx / num_w_pieces) % num_h_pieces;
+                const ck_tile::index_t d_idx = piece_idx / (num_w_pieces * num_h_pieces);
+
+                const ck_tile::index_t w_start = w_idx * base_piece_w;
+                const ck_tile::index_t h_start = h_idx * base_piece_h;
+                const ck_tile::index_t d_start = d_idx * base_piece_d;
+
+                const ck_tile::index_t w_size =
+                    (w_idx == num_w_pieces - 1) ? (total_w - w_start) : base_piece_w;
+                const ck_tile::index_t h_size =
+                    (h_idx == num_h_pieces - 1) ? (total_h - h_start) : base_piece_h;
+                const ck_tile::index_t d_size =
+                    (d_idx == num_d_pieces - 1) ? (total_d - d_start) : base_piece_d;
+
+                const ck_tile::index_t piece_gemm_m = args.N_ * d_size * h_size * w_size;
+                const ck_tile::index_t piece_gemm_n = args.K_;
+                const ck_tile::index_t piece_grid =
+                    ((piece_gemm_m + TilePartitioner::MPerBlock - 1) / TilePartitioner::MPerBlock) *
+                    ((piece_gemm_n + TilePartitioner::NPerBlock - 1) / TilePartitioner::NPerBlock);
+
+                return {total_blocks,
+                        total_blocks + piece_grid,
+                        d_start,
+                        h_start,
+                        w_start,
+                        d_size,
+                        h_size,
+                        w_size};
+            };


Can we move this to some utils?

I moved this function to
include/ck_tile/ops/grouped_convolution/utils/grouped_convolution_utils.hpp

bartekxk · 2025-10-21T07:43:32Z

include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp

+        // Pre-calculate block_id (used in both split-image and non-split paths)
+        const index_t block_id = static_cast<index_t>(blockIdX);
+
+        if(kargs.num_spatial_pieces > 1)


Can we change it to constexpr value? I mean something like we initialize the kernel with SplitImageSupport in grouped conv utils and then deduce in compile time.

Yes I corrected this as constexpr.

bartekxk · 2025-10-21T07:44:03Z

example/ck_tile/20_grouped_convolution/grouped_convolution_forward_invoker.hpp

                                      1,
                                      std::multiplies<ck_tile::index_t>());

+        // Split-K parameters


Can we create two seperate examples? The first one for regular convolution and second one for large image with split image?

bartekxk · 2025-10-21T07:44:39Z

include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp

+        {
+            // Helper: Find which piece owns this block
+            auto find_piece_id = [&]() -> index_t {
+                for(index_t i = 0; i < kargs.num_spatial_pieces; i++)


you can use golden division like in include/ck/tensor_operation/gpu/device/impl/device_grouped_conv_fwd_multiple_d_xdl_large_tensor_cshuffle.hpp

bartekxk · 2025-10-21T07:47:13Z

include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp

+                TilePartitioner{local_gemm_m, kargs.GemmN}.GetOutputTileIndex(local_block_id);
+
+            // Extract batch and spatial coordinates from local tile
+            const index_t local_m_start      = local_tile_m * TilePartitioner::MPerBlock;


Can we simpilify this logic? Can we just pass some set of descriptors and select appropriate based on blockIdx?

- Add MAX_TOTAL_PIECES=64 limit to prevent segfault - Move calculate_spatial_piece to library utils - Add layout validation (NWGC, NHWGC, NDHWGC only) - Fix hierarchical splitting to respect piece limits - Add proper documentation and formatting

@bartekxk

Response to @bartekxk review comment: Convert 'if(kargs.num_spatial_pieces > 1)' to 'if constexpr(EnableSplitImage)' Changes: - Add EnableSplitImage template parameter to kernel - Change runtime if to compile-time if constexpr - Update invoker to instantiate kernel variants with true/false Benefits: - Eliminates runtime branching in GPU kernel - Dead code elimination (each variant is smaller) - Better compiler optimization Files modified: 2 Lines changed: 20 total (6 in kernel, 14 in invoker) Tests: 27/27 passed (100%) Performance: No regression

- Add EnableSplitImage template parameter to kernel - Change runtime if to if constexpr - Update invoker to pass true/false

- Create grouped_convolution_forward_split_image example - Add grouped_convolution_forward_split_image_invoker.hpp - Update CMakeLists.txt to build split_image binary

- Change O(n) to O(log n) for finding piece ownership - Matches reference implementation in large_tensor_cshuffle

JH-Leon-KIM-AMD added 20 commits September 28, 2025 13:02

Refactor split-image implementation: simplify code and remove redunda…

4403d55

…nt variables

Add padding debug output to split-image implementation

74a7179

- Added debug prints for padding calculations in transform_conv_fwd_to_gemm.hpp - Verified padding works correctly with all tests passing

Fix sign comparison warning after rebase with origin/develop

e94a485

- Cast blockIdX from unsigned to signed index_t for comparisons - Integrated with new GetOutputTileIndex logic from upstream - Updated to use amd_wave_read_first_lane instead of __builtin_amdgcn_readfirstlane

Add comment explaining MAX_DEPTH capacity for 2GB threshold

34326e2

Apply clang-format-18 formatting

8c6d280

- Format invoker and transformer files with clang-format-18 - Fix brace placement and alignment - No functional changes

Fix clang-format-18 issues in forward kernel

54869a3

- Remove extra blank lines - Fix line wrapping for template calls - Consolidate GetSplitImageInfo() to single line

Merge branch 'develop' into LWPCK-3052-grouped-conv-split-image

08bc24d

JH-Leon-KIM-AMD marked this pull request as ready for review October 6, 2025 13:35

JH-Leon-KIM-AMD requested review from afagaj, andriy-ca, aosewski, bartekxk, carlushuang, geyyer, illsilin, poyenc and qianfengz as code owners October 6, 2025 13:35

JH-Leon-KIM-AMD requested review from ThomasNing, aska-0096, asleepzzz, cgmillette, coderfeli, shumway, tenpercent and vidyasagar-amd as code owners October 6, 2025 13:35

bartekxk requested a review from Copilot October 7, 2025 07:16

Copilot AI reviewed Oct 7, 2025

View reviewed changes

bartekxk reviewed Oct 7, 2025

View reviewed changes

JH-Leon-KIM-AMD and others added 13 commits October 8, 2025 16:50

Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_f…

e75944e

…wd_to_gemm.hpp Co-authored-by: Copilot <[email protected]>

Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_f…

a1f9d7e

…wd_to_gemm.hpp Co-authored-by: Copilot <[email protected]>

Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolu…

ca06bfe

…tion_forward_kernel.hpp Co-authored-by: Copilot <[email protected]>

Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolu…

5fbaecf

…tion_forward_kernel.hpp Co-authored-by: Copilot <[email protected]>

Merge branch 'develop' into LWPCK-3052-grouped-conv-split-image

0ea5ece

Merge branch 'develop' into LWPCK-3052-grouped-conv-split-image

03e44ee

Merge develop into LWPCK-3052: Accept universal GEMM pipeline, commen…

58df1f6

…t out old split-image code

Fix 1D split-image padding issue with per-piece dimensions

a779af6

- Store actual size per piece to handle non-uniform splits - Remove dead code from transform utils

Refactor split-image code and remove debug prints

49b622e

- Extract GPU kernel helper lambdas for better readability - Remove all split-image debug print statements - Set memory threshold to 2GB for production - All tests pass with CPU verification

bartekxk reviewed Oct 21, 2025

View reviewed changes

JH-Leon-KIM-AMD added 5 commits October 22, 2025 07:14

Change split-image to compile-time branching

29fed44

- Add EnableSplitImage template parameter to kernel - Change runtime if to if constexpr - Update invoker to pass true/false

Add split-image example as separate binary

b28ea3c

- Create grouped_convolution_forward_split_image example - Add grouped_convolution_forward_split_image_invoker.hpp - Update CMakeLists.txt to build split_image binary

Replace linear search with binary search in find_piece_id

badffd8

- Change O(n) to O(log n) for finding piece ownership - Matches reference implementation in large_tensor_cshuffle

	// May be used by backward convolution implementations
	// May be used by backward convolution implementations
	[[deprecated("Not used by forward convolution; may be removed in the future.")]]

Lwpck 3052 grouped conv split image #2970

Are you sure you want to change the base?

Lwpck 3052 grouped conv split image #2970

Uh oh!

Conversation

JH-Leon-KIM-AMD commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implement Recursive Split-Image for Large Tensor Support

Summary

Issue

Goal

Workflow

Implementation

Core Algorithm: Recursive Binary Split

Split Implementation

Offset Application in Kernel

Integration with Split-N

Testing

Performance

Files Changed

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JH-Leon-KIM-AMD commented Oct 3, 2025 •

edited

Loading