-
Notifications
You must be signed in to change notification settings - Fork 245
Lwpck 3052 grouped conv split image #2970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
- Added debug prints for padding calculations in transform_conv_fwd_to_gemm.hpp - Verified padding works correctly with all tests passing
- Cast blockIdX from unsigned to signed index_t for comparisons - Integrated with new GetOutputTileIndex logic from upstream - Updated to use amd_wave_read_first_lane instead of __builtin_amdgcn_readfirstlane
- Fixed batch stride calculation to include G dimension for grouped convolutions - When moving between batches in NHWGC/NWGC/NDHWGC layouts, need to account for all groups - Removed unused multi-split parameters (we only support 2-way split) - All tests now pass: G=1 with Split-N, G>1 with Split-N, G>1 without Split-N
- Add LaunchKernelWithSplitIfNeeded() helper method in transform_conv_fwd_to_gemm.hpp - Implement recursive binary splitting algorithm (10GB→5GB+5GB→...) - Correctly handle odd dimensions (61→30+31) - Calculate proper offsets for each split piece - Update invoker to use split-image helper Note: Split detection and calculation work correctly but kernel launching for individual pieces requires kernel modification to handle different spatial dimensions (unlike Split-N which uses blockIdx.z).
- Split-N modifies N_ directly in transformer constructor - Split-Image needs different approach due to varying dimensions - Added split calculation logic for 1D and 2D convolutions - Still facing memory issues when creating piece transformers Key finding: Split-N uses blockIdx.z for parallel execution, while Split-Image needs sequential execution of non-uniform pieces.
Implements split-image for 1D convolution to handle large tensors that exceed memory thresholds. This is a critical milestone with N=1 fully working and tested. Key Changes: - Invoker: Add split-image logic that splits W dimension in half - Transformer: Add SplitConvProblem helper for recursive splitting - Calculate offsets for LEFT and RIGHT pieces - Launch two kernels sequentially (LEFT then RIGHT) Implementation Details: - Binary split: divides W dimension by 2 - LEFT piece: W=0 to W/2, keeps left padding, removes right padding - RIGHT piece: W/2 to W, removes left padding, keeps right padding - Offset calculation accounts for stride, dilation, and padding - Physical memory offset (no padding in memory) Test Results (N=1): ✅ 94/94 tests passing - Comprehensive tests: 36/36 (channels, padding, stride, dilation, filters, groups) - Edge case tests: 31/31 (odd dimensions, extreme parameters, boundaries) - Stress tests: 27/27 (maximum dimensions, up to 91.4 TFlops) Known Limitations: - Only works with N=1 (single batch) - N>1 fails when split-image triggers (offset calculation issue with Split-N) - Root cause: Split-N modifies N in transformer, but offset calculated in invoker - Solution planned: Move offset calculation to transformer (next phase) Files Modified: - grouped_convolution_forward_invoker.hpp: Add split-image logic - transform_conv_fwd_to_gemm.hpp: Add SplitConvProblem helper This commit represents a stable, tested 1D split-image implementation for N=1 cases. It's an important milestone before extending to N>1 and multi-dimensional splits.
This is a working baseline implementation that splits large spatial dimensions to handle memory constraints. Implementation: - 1D: W-split for NWGC layout (36/36 tests passing) - 2D: H-split for NHWGC layout (20/20 tests passing) - 3D: D-split for NDHWGC layout (verified working) Features: - Binary split of outermost spatial dimension - Sequential LEFT/RIGHT kernel launches - Proper padding adjustment at split boundaries - Offset calculation for pointer arithmetic - Debug output for verification Threshold: 100KB (configurable in transformer) Known limitations: - No safety checks for edge cases (to be added) - Offset calculated before Split-N (incompatible with N>1, to be fixed) - No recursive splitting for very large tensors Next steps: - Add safety checks (is_possible_to_split_*) - Move offset calculation to transformer (after Split-N) - Test with N>1 + split-image combination
Unified the three separate dimension-specific blocks into a single common implementation with dimension-specific stride calculations. Benefits: - Reduced code from 636 → 348 lines (45% reduction) - Eliminated code duplication - Easier to maintain and extend - Single source of truth for split logic Implementation: - Common: Binary split, offset calc, padding adjustment, kernel launch - Dimension-specific: Stride calculation only - 1D: stride = G * C - 2D: stride = W_in * G * C - 3D: stride = H_in * W_in * G * C Test results (all passing): - 1D: 36/36 tests ✅ - 2D: 20/20 tests ✅ - 3D: 28/28 tests ✅ - Total: 84/84 (100%) All test scenarios verified: - Varying channels, padding, stride, dilation - Filter sizes (1x1 pointwise to 7x7) - Multiple groups (G=1,2,4) - Odd dimensions - Complex combinations
Added is_possible_to_split safety checks to prevent crashes when splitting is not feasible. Safety checks verify: 1. Output dimension > 1 (can't split single element) 2. RIGHT piece starts after left padding 3. LEFT piece ends within input bounds If checks fail, falls back to normal kernel launch. Verified for all dimensions: - 1D (W-split): Wo=1 case triggers fallback - 2D (H-split): Ho=1 case triggers fallback - 3D (D-split): Do=1 case triggers fallback Original 84 tests still pass - they use normal configurations that naturally satisfy safety conditions. Safety checks protect against pathological edge cases with: - Very small spatial dimensions - Extreme stride/dilation combinations - Invalid padding configurations
Fixed critical bug where Split-N and Split-Image working together caused ~50% incorrect results due to wrong batch stride calculation. Problem: - Batch stride was calculated using MODIFIED spatial dimensions (e.g., W=50000 after split) instead of ORIGINAL dimensions (W=100000) - Spatial offset was applied globally in invoker, not per-batch in kernel - Each batch (blockIdx.z) got wrong memory offset Solution: 1. Store spatial offset in kargs (don't apply to pointer in invoker) 2. Copy correct batch_stride from temp_kargs to left/right kargs 3. Apply formula in operator(): ptr = base + (batch × stride) + spatial_offset Changes: - grouped_convolution_forward_kernel.hpp: * Added spatial_offset_in/out fields to KernelArgs * Apply batch + spatial offset in operator() - grouped_convolution_forward_invoker.hpp: * Keep base pointer, store spatial offset in kargs * Copy batch_stride from temp_kargs (has original dimensions) - transform_conv_fwd_to_gemm.hpp: * Add debug output for split-image calculation Results: - N=1 tests: 84/84 passing (100%) - N>1 tests: Now all passing (previously ~50% errors) - Tested: 1D, 2D, 3D with N=1,2,4,8,16,20
This commit consolidates threshold management for both Split-N and Split-Image operations into a single source of truth, eliminating code duplication and fixing offset calculation issues. Key Changes: ============ 1. Transformer (transform_conv_fwd_to_gemm.hpp): - Moved TwoGB constant to public section for unified access - CalculateSplitImage() now takes no parameters - Uses internal threshold: TwoGB / sizeof(CDataType) - Calculates offsets using N_ (after Split-N) for correctness 2. Kernel (grouped_convolution_forward_kernel.hpp): - GetSplitImageInfo() simplified to take no parameters - Forwards to transformer's CalculateSplitImage() - Clean interface with unified threshold internally 3. Invoker (grouped_convolution_forward_invoker.hpp): - Removed redundant threshold calculation - Simplified to call kargs.GetSplitImageInfo() with no params - Clean early-return pattern (no unnecessary else blocks) - Removed duplicate/dead code paths Benefits: ========= - Single source of truth: TwoGB defined once in transformer - No parameter passing for threshold between components - Correct offset calculation using N_ (post-Split-N) - Cleaner code with no duplication - All tests passing: 1D/2D/3D with various N values Testing: ======== - Split-Image only (N=1, large spatial): PASS - Split-N only (N>1, small spatial): PASS - Both splits active (N>1, large spatial): PASS - No splits (N=1, small spatial): PASS - CPU verification correct for all scenarios
…ithSplitIfNeeded) The old recursive queue-based implementation has been replaced by the new CalculateSplitImage() method which is simpler and correctly handles Split-N + Split-Image interaction. Changes: - Wrapped lines 381-1078 in #if 0...#endif - Old methods: SplitConvProblem() and LaunchKernelWithSplitIfNeeded() - Preserved for reference but disabled from compilation - No functional changes - all tests still pass The new implementation (CalculateSplitImage at line ~2163) provides: - Correct offset calculation using N_ (after Split-N) - Simpler binary split logic - Better integration with unified threshold approach
Changes: - Add depth tracking to SplitPiece struct - Implement two stopping conditions: 1. Piece size below threshold (optimal case) 2. Depth >= MAX_DEPTH (prevents infinite recursion) - Remove MAX_PIECES limit in favor of depth-based control - Support up to 2^10 = 1024 pieces with depth 10 This allows handling extreme tensor sizes while ensuring termination. Pieces larger than threshold will still launch correctly if depth limit reached. Tested with H=100 (4 levels), H=2000 (6 levels), H=4000 (9 levels) - all pass CPU verification.
- Recursive queue-based splitting with depth limit (MAX_DEPTH=10, up to 1024 pieces) - Two stopping conditions: size below threshold OR max depth reached - Cumulative offset tracking through all recursion levels - LEFT piece inherits parent offset, RIGHT accumulates (parent + local) - Per-batch spatial offset application in kernel operator() - Batch stride uses original dimensions (before split) - Works with Split-N: split-N first, then recursive split-image - Handles odd dimensions, padding, stride, dilation correctly - All 1D/2D/3D tests pass with CPU verification
- Move LaunchWithRecursiveSplit() from invoker to transform_conv_fwd_to_gemm.hpp - Simplify invoker from ~250 lines to ~140 lines (removed 110 lines of inline logic) - Encapsulate SplitPiece struct and BFS splitting algorithm in transformer - Remove unused includes (queue, vector) from invoker - Add documentation comment for AreDescriptorsSmallerThan2GB() - Improve code organization and reusability - No performance overhead (static template function, compiler inlines) - All tests passing with 2GB production threshold
- Format invoker and transformer files with clang-format-18 - Fix brace placement and alignment - No functional changes
- Remove extra blank lines - Fix line wrapping for template calls - Consolidate GetSplitImageInfo() to single line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements a recursive split-image algorithm for grouped convolution to handle large tensors exceeding 2GB memory threshold by automatically splitting spatial dimensions and launching multiple kernel pieces with correct offset management.
Key changes:
- Added recursive binary splitting algorithm for spatial dimensions (W/H/D) with breadth-first search until pieces fit under 2GB threshold
- Integrated split-image with existing Split-N support for independent batch and spatial dimension splitting
- Implemented comprehensive offset calculation system for correct memory addressing across split pieces
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
transform_conv_fwd_to_gemm.hpp |
Added SplitImageInfo structure, CalculateSplitImage() method, and LaunchWithRecursiveSplit() template function for split-image algorithm implementation |
grouped_convolution_forward_kernel.hpp |
Modified kernel args to store transformer instance, added spatial offset fields, and updated batch stride calculations to include group dimension |
grouped_convolution_forward_invoker.hpp |
Added split-image detection logic and routing to either direct kernel launch or recursive split-image path |
grouped_convolution_forward.cpp |
Removed trailing blank line |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp
Outdated
Show resolved
Hide resolved
include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp
Outdated
Show resolved
Hide resolved
| __host__ bool AreDescriptorsSmallerThan2GB() const | ||
| // Check if descriptors fit within memory threshold | ||
| // NOTE: Not used by forward convolution (uses CalculateSplitImage() instead) | ||
| // May be used by backward convolution implementations |
Copilot
AI
Oct 7, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment indicates this function is not used by forward convolution. If this function is truly unused in the current codebase, consider removing it or marking it as deprecated to avoid confusion.
| // May be used by backward convolution implementations | |
| // May be used by backward convolution implementations | |
| [[deprecated("Not used by forward convolution; may be removed in the future.")]] |
include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp
Outdated
Show resolved
Hide resolved
include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp
Outdated
Show resolved
Hide resolved
| auto kargs = Kernel::MakeKernelArgs(args); | ||
|
|
||
| // Check if split-image is needed (uses unified threshold internally) | ||
| auto split_info = kargs.GetSplitImageInfo(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just IsSplitImageNeeded and return bool?
include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp
Show resolved
Hide resolved
include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp
Outdated
Show resolved
Hide resolved
include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp
Outdated
Show resolved
Hide resolved
…wd_to_gemm.hpp Co-authored-by: Copilot <[email protected]>
…wd_to_gemm.hpp Co-authored-by: Copilot <[email protected]>
…tion_forward_kernel.hpp Co-authored-by: Copilot <[email protected]>
…tion_forward_kernel.hpp Co-authored-by: Copilot <[email protected]>
…t out old split-image code
- Implemented spatial dimension splitting (Split-Image) for large tensors - Added piece-based coordinate transformation for 1D/2D/3D convolutions - Integrated Split-N (batch splitting) with automatic threshold detection - Fixed M dimension calculation to include batch: M = N × spatial_size - Added spatial offset support in kernel arguments - Verified 20/20 test cases passing for Split-Image alone - Known issue: Split-N + Split-Image combination needs coordinate fix Implementation Details: - Split factors: 4 (1D), 4×4 (2D), 4×4×4 (3D) - temporary fixed values - Batch strides properly calculated for NWGC/NHWGC/NDHWGC layouts - Piece descriptors track spatial boundaries and block ranges - No performance overhead for N=1 cases
- Store actual size per piece to handle non-uniform splits - Remove dead code from transform utils
Problem: Single split factor caused non-uniform pieces when dimensions didn't divide evenly. Result: 18/25 (72%) 2D padding combinations failed. Solution: Independent split factor selection for W, H, D dimensions. Each dimension gets optimal factor based on its own size. Test Results: - 1D: 42/42 pass (100%) - 2D: 25/25 pass (100%) - Total: 67/67 combinations verified
Cleanup of split-image implementation: - Removed unused piece_d, piece_h, piece_w fields from SplitImageInfo struct - These fields were declared but never used in the kernel - Per-piece dimensions are already stored in pieces[] array - Reduces struct size and improves code clarity Tested: 1D/2D/3D convolutions with split-image, padding, stride all pass
- Extract piece calculation logic into calculate_piece lambda helper - Extract kernel args population into populate_split_image_kargs lambda - Use aggregate initialization for cleaner struct population - Reduce nesting depth and improve maintainability - Fix outdated comment about split-image implementation status
- Extract GPU kernel helper lambdas for better readability - Remove all split-image debug print statements - Set memory threshold to 2GB for production - All tests pass with CPU verification
| // Helper: Calculate single piece information | ||
| auto calculate_piece = [&](ck_tile::index_t piece_idx) -> TempPieceInfo { | ||
| const ck_tile::index_t w_idx = piece_idx % num_w_pieces; | ||
| const ck_tile::index_t h_idx = (piece_idx / num_w_pieces) % num_h_pieces; | ||
| const ck_tile::index_t d_idx = piece_idx / (num_w_pieces * num_h_pieces); | ||
|
|
||
| const ck_tile::index_t w_start = w_idx * base_piece_w; | ||
| const ck_tile::index_t h_start = h_idx * base_piece_h; | ||
| const ck_tile::index_t d_start = d_idx * base_piece_d; | ||
|
|
||
| const ck_tile::index_t w_size = | ||
| (w_idx == num_w_pieces - 1) ? (total_w - w_start) : base_piece_w; | ||
| const ck_tile::index_t h_size = | ||
| (h_idx == num_h_pieces - 1) ? (total_h - h_start) : base_piece_h; | ||
| const ck_tile::index_t d_size = | ||
| (d_idx == num_d_pieces - 1) ? (total_d - d_start) : base_piece_d; | ||
|
|
||
| const ck_tile::index_t piece_gemm_m = args.N_ * d_size * h_size * w_size; | ||
| const ck_tile::index_t piece_gemm_n = args.K_; | ||
| const ck_tile::index_t piece_grid = | ||
| ((piece_gemm_m + TilePartitioner::MPerBlock - 1) / TilePartitioner::MPerBlock) * | ||
| ((piece_gemm_n + TilePartitioner::NPerBlock - 1) / TilePartitioner::NPerBlock); | ||
|
|
||
| return {total_blocks, | ||
| total_blocks + piece_grid, | ||
| d_start, | ||
| h_start, | ||
| w_start, | ||
| d_size, | ||
| h_size, | ||
| w_size}; | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this to some utils?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this function to
include/ck_tile/ops/grouped_convolution/utils/grouped_convolution_utils.hpp
| // Pre-calculate block_id (used in both split-image and non-split paths) | ||
| const index_t block_id = static_cast<index_t>(blockIdX); | ||
|
|
||
| if(kargs.num_spatial_pieces > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change it to constexpr value? I mean something like we initialize the kernel with SplitImageSupport in grouped conv utils and then deduce in compile time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I corrected this as constexpr.
| 1, | ||
| std::multiplies<ck_tile::index_t>()); | ||
|
|
||
| // Split-K parameters |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we create two seperate examples? The first one for regular convolution and second one for large image with split image?
| { | ||
| // Helper: Find which piece owns this block | ||
| auto find_piece_id = [&]() -> index_t { | ||
| for(index_t i = 0; i < kargs.num_spatial_pieces; i++) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use golden division like in include/ck/tensor_operation/gpu/device/impl/device_grouped_conv_fwd_multiple_d_xdl_large_tensor_cshuffle.hpp
| TilePartitioner{local_gemm_m, kargs.GemmN}.GetOutputTileIndex(local_block_id); | ||
|
|
||
| // Extract batch and spatial coordinates from local tile | ||
| const index_t local_m_start = local_tile_m * TilePartitioner::MPerBlock; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simpilify this logic? Can we just pass some set of descriptors and select appropriate based on blockIdx?
- Add MAX_TOTAL_PIECES=64 limit to prevent segfault - Move calculate_spatial_piece to library utils - Add layout validation (NWGC, NHWGC, NDHWGC only) - Fix hierarchical splitting to respect piece limits - Add proper documentation and formatting
Response to @bartekxk review comment: Convert 'if(kargs.num_spatial_pieces > 1)' to 'if constexpr(EnableSplitImage)' Changes: - Add EnableSplitImage template parameter to kernel - Change runtime if to compile-time if constexpr - Update invoker to instantiate kernel variants with true/false Benefits: - Eliminates runtime branching in GPU kernel - Dead code elimination (each variant is smaller) - Better compiler optimization Files modified: 2 Lines changed: 20 total (6 in kernel, 14 in invoker) Tests: 27/27 passed (100%) Performance: No regression
- Add EnableSplitImage template parameter to kernel - Change runtime if to if constexpr - Update invoker to pass true/false
- Create grouped_convolution_forward_split_image example - Add grouped_convolution_forward_split_image_invoker.hpp - Update CMakeLists.txt to build split_image binary
- Change O(n) to O(log n) for finding piece ownership - Matches reference implementation in large_tensor_cshuffle
Implement Recursive Split-Image for Large Tensor Support
Summary
Implemented recursive split-image algorithm for grouped convolution to handle large tensors exceeding 2GB memory threshold by recursively splitting spatial dimensions and launching multiple kernel pieces with correct offset management.
Issue
Large convolution tensors (> 2GB) cannot fit in GPU memory or exceed descriptor size limits, causing:
Example: N=1, W=4.2M, C=256, K=256 → Output = 2.02 GB (exceeds 2GB limit)
Goal
Enable large convolution execution by automatically splitting tensors into pieces < 2GB with numerical correctness and Split-N compatibility.
Workflow
Implementation
Core Algorithm: Recursive Binary Split
Split spatial dimensions (W/H/D) recursively using breadth-first search until each piece fits under threshold.
Split Implementation
1. Descriptor Splitting (Dimensions)
Binary split on dimension 0 (first spatial dimension):
Example: 1D convolution W=4.2M, C=256, K=256, G=1 → Output = 2.02 GB (exceeds threshold)
2. Pointer Offset Calculation
Example: 1D with W=4.2M, C=256, K=256, stride=1, pad=1
3. Recursive Splitting (Breadth-First Search)
Stop condition: piece < 2GB OR depth >= MAX_DEPTH(10)
Example: 2D convolution H=4096, W=2048, C=256, K=256 → Output = 4.29 GB
Offset Application in Kernel
batch_id = blockIdx.z ptr = base_ptr + group_offset + (batch_id × batch_stride) + // Uses ORIGINAL dimensions spatial_offset; // Uses SPLIT dimensionsIntegration with Split-N
Split-N (batch dimension) and Split-Image (spatial dimensions) work together independently.
Testing
Three test suites with different thresholds:
All 142+ tests passing with CPU verification.
Note: 2GB threshold tests (15 typical + 8 large >2GB) exist but not run in CI due to memory requirements.
Performance
Files Changed
grouped_convolution_forward_invoker.hpptransform_conv_fwd_to_gemm.hppgrouped_convolution_forward_kernel.hppChecklist
Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.