Skip to content
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
60 commits
Select commit Hold shift + click to select a range
4403d55
Refactor split-image implementation: simplify code and remove redunda…
JH-Leon-KIM-AMD Sep 26, 2025
74a7179
Add padding debug output to split-image implementation
JH-Leon-KIM-AMD Sep 28, 2025
e94a485
Fix sign comparison warning after rebase with origin/develop
JH-Leon-KIM-AMD Sep 28, 2025
9b31169
Fix Split-N with groups bug and clean up unused parameters
JH-Leon-KIM-AMD Sep 28, 2025
9c7df2a
Implement recursive queue-based split-image detection and calculation
JH-Leon-KIM-AMD Sep 29, 2025
1f48a5a
WIP: Split-Image investigation - found architecture mismatch
JH-Leon-KIM-AMD Sep 30, 2025
68af577
Add 1D split-image implementation for grouped convolution (N=1 working)
JH-Leon-KIM-AMD Oct 1, 2025
b51fedb
Add basic split-image implementation for 1D/2D/3D grouped convolution
JH-Leon-KIM-AMD Oct 2, 2025
2992c0b
Refactor split-image to unified structure for 1D/2D/3D
JH-Leon-KIM-AMD Oct 2, 2025
5024d43
Add safety checks for split-image in all dimensions
JH-Leon-KIM-AMD Oct 2, 2025
51c77f7
Fix Split-N + Split-Image compatibility issue
JH-Leon-KIM-AMD Oct 3, 2025
74671dd
Implement unified threshold for Split-N and Split-Image
JH-Leon-KIM-AMD Oct 3, 2025
49e46de
Comment out outdated split-image code (SplitConvProblem/LaunchKernelW…
JH-Leon-KIM-AMD Oct 3, 2025
eebb88d
Implement recursive split-image with depth limit (MAX_DEPTH=10)
JH-Leon-KIM-AMD Oct 3, 2025
73a6adf
Summary of recursive split-image implementation:
JH-Leon-KIM-AMD Oct 3, 2025
34326e2
Add comment explaining MAX_DEPTH capacity for 2GB threshold
JH-Leon-KIM-AMD Oct 6, 2025
414e9a1
Refactor: move recursive split-image logic to transformer
JH-Leon-KIM-AMD Oct 6, 2025
8c6d280
Apply clang-format-18 formatting
JH-Leon-KIM-AMD Oct 6, 2025
54869a3
Fix clang-format-18 issues in forward kernel
JH-Leon-KIM-AMD Oct 6, 2025
08bc24d
Merge branch 'develop' into LWPCK-3052-grouped-conv-split-image
JH-Leon-KIM-AMD Oct 6, 2025
e75944e
Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_f…
JH-Leon-KIM-AMD Oct 8, 2025
a1f9d7e
Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_f…
JH-Leon-KIM-AMD Oct 8, 2025
ca06bfe
Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolu…
JH-Leon-KIM-AMD Oct 8, 2025
5fbaecf
Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolu…
JH-Leon-KIM-AMD Oct 8, 2025
0ea5ece
Merge branch 'develop' into LWPCK-3052-grouped-conv-split-image
JH-Leon-KIM-AMD Oct 8, 2025
03e44ee
Merge branch 'develop' into LWPCK-3052-grouped-conv-split-image
JH-Leon-KIM-AMD Oct 9, 2025
58df1f6
Merge develop into LWPCK-3052: Accept universal GEMM pipeline, commen…
JH-Leon-KIM-AMD Oct 15, 2025
89c6c92
Split-Image implementation with temporary fixed divider
JH-Leon-KIM-AMD Oct 17, 2025
a779af6
Fix 1D split-image padding issue with per-piece dimensions
JH-Leon-KIM-AMD Oct 19, 2025
a7871c8
Fix 2D/3D split-image with independent split factors per dimension
JH-Leon-KIM-AMD Oct 19, 2025
c85669e
Remove unused split-image struct fields
JH-Leon-KIM-AMD Oct 19, 2025
d48b4ed
Refactor split-image invoker code for improved readability
JH-Leon-KIM-AMD Oct 19, 2025
49b622e
Refactor split-image code and remove debug prints
JH-Leon-KIM-AMD Oct 19, 2025
7db8d77
Add split-image safety constraints and refactor to utils
JH-Leon-KIM-AMD Oct 22, 2025
a87da59
Change split-image from runtime to compile-time branching
JH-Leon-KIM-AMD Oct 24, 2025
29fed44
Change split-image to compile-time branching
JH-Leon-KIM-AMD Oct 24, 2025
b28ea3c
Add split-image example as separate binary
JH-Leon-KIM-AMD Oct 25, 2025
badffd8
Replace linear search with binary search in find_piece_id
JH-Leon-KIM-AMD Oct 25, 2025
91ffc82
Simplify split-image code and fix integer overflow
JH-Leon-KIM-AMD Oct 27, 2025
d6184ed
Merge branch 'develop' into LWPCK-3052-grouped-conv-split-image
JH-Leon-KIM-AMD Oct 27, 2025
85c4c58
Trigger CI rerun - fix merge conflicts
JH-Leon-KIM-AMD Oct 28, 2025
61d0e6a
Fix merge conflict markers
JH-Leon-KIM-AMD Oct 28, 2025
02d33c3
Merge branch 'develop' into LWPCK-3052-grouped-conv-split-image
JH-Leon-KIM-AMD Oct 28, 2025
b8a94d5
Fix clang-format: remove space before {}
JH-Leon-KIM-AMD Oct 28, 2025
c80e237
Fix clang-format: comment wrapping and Swish constructor
JH-Leon-KIM-AMD Oct 28, 2025
387361d
Merge branch 'develop' into LWPCK-3052-grouped-conv-split-image
JH-Leon-KIM-AMD Oct 30, 2025
cc7299b
Rename split_image to large_tensor for clarity
JH-Leon-KIM-AMD Oct 30, 2025
8e03659
Update comments and include in large_tensor example
JH-Leon-KIM-AMD Oct 30, 2025
2fbb436
Remove test code, restore 2GB threshold
JH-Leon-KIM-AMD Oct 30, 2025
8558e07
Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_f…
JH-Leon-KIM-AMD Oct 30, 2025
781bf67
Fix build errors after develop merge and complete rename to large_tensor
JH-Leon-KIM-AMD Oct 31, 2025
2110c42
Merge branch 'develop' into LWPCK-3052-grouped-conv-split-image
JH-Leon-KIM-AMD Oct 31, 2025
b3b5d70
Restore 2GB threshold for split-image
JH-Leon-KIM-AMD Oct 31, 2025
9705c7e
Fix const-correctness in ds_ptr cast
JH-Leon-KIM-AMD Oct 31, 2025
479e58d
Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolu…
JH-Leon-KIM-AMD Oct 31, 2025
8197597
Apply clang-format-18
JH-Leon-KIM-AMD Oct 31, 2025
81b3347
update c++ 18 format
JH-Leon-KIM-AMD Oct 31, 2025
f801ca7
Apply clang-format-18 to transform_conv_fwd_to_gemm.hpp
JH-Leon-KIM-AMD Oct 31, 2025
c7d6fb6
Merge branch 'develop' into LWPCK-3052-grouped-conv-split-image
JH-Leon-KIM-AMD Oct 31, 2025
b816ba4
Merge branch 'develop' into LWPCK-3052-grouped-conv-split-image
JH-Leon-KIM-AMD Oct 31, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,6 @@
#include "grouped_convolution_utils.hpp"
#include "grouped_convolution_forward_invoker.hpp"
#include "run_grouped_convolution_fwd_example.inc"

template <typename GemmWarpConfig>
int run_grouped_conv_fwd_example(int argc, char* argv[])
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,10 @@ struct GroupedConvolutionForwardInvoker
static float grouped_conv_fwd(const ck_tile::GroupedConvFwdHostArgs& args,
const ck_tile::stream_config& s)
{
if(s.log_level_ > 0)
{
std::cout << "[INVOKER] grouped_conv_fwd called, NDimSpatial=" << NDimSpatial << "\n";
}
constexpr int kBlockPerCu = 1;

constexpr ck_tile::index_t M_Tile = 64;
Expand Down Expand Up @@ -99,32 +103,40 @@ struct GroupedConvolutionForwardInvoker
TilePartitioner,
CodegenPipeline,
ConvEpilogue>;
auto kargs = Kernel::MakeKernelArgs(args);

const dim3 grids = Kernel::GridSize(kargs);
const dim3 blocks = Kernel::BlockSize();
float ave_time = 0.0f;

if(!Kernel::IsSupportedArgument(kargs))
{
throw std::runtime_error("Wrong! Arguments not supported! Skipping conv!\n");
}
// Create kargs and check if split-image is needed
auto kargs = Kernel::MakeKernelArgs(args);

// Check if split-image is needed (uses unified threshold internally)
auto split_info = kargs.GetSplitImageInfo();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just IsSplitImageNeeded and return bool?


if(s.log_level_ > 0)
if(!split_info.should_split)
{
std::cout << "Launching kernel with args: " << Kernel::GetName() << '\n'
<< "shape: " << CodegenShape::GetName() << '\n'
<< "problem: " << CodegenPipelineProblem::GetName() << '\n'
<< "pipeline: " << CodegenPipeline::GetName() << '\n'
<< "grid: {" << grids.x << ", " << grids.y << ", " << grids.z << "}"
<< ", blocks: {" << blocks.x << ", " << blocks.y << ", " << blocks.z
<< "}" << '\n'
<< "Vector size A: " << CodegenPipeline::GetVectorSizeA()
<< ", Vector size B: " << CodegenPipeline::GetVectorSizeB()
<< ", Vector size C: " << ConvEpilogue::GetVectorSizeC() << std::endl;
// No split-image needed - use kargs directly (may have Split-N)
if(s.log_level_ > 0)
{
std::cout << "[INVOKER] No split-image needed - launching with kargs"
<< std::endl;
}
const dim3 grids = Kernel::GridSize(kargs);
const dim3 blocks = Kernel::BlockSize();

if(!Kernel::IsSupportedArgument(kargs))
{
throw std::runtime_error("Wrong! Arguments not supported! Skipping conv!\n");
}

ave_time = ck_tile::launch_kernel(
s, ck_tile::make_kernel<kBlockPerCu>(Kernel{}, grids, blocks, 0, kargs));
return ave_time;
}

float ave_time = ck_tile::launch_kernel(
s, ck_tile::make_kernel<kBlockPerCu>(Kernel{}, grids, blocks, 0, kargs));
// RECURSIVE split-image path - delegate to transformer helper
ave_time = decltype(kargs.transformer_)::template LaunchWithRecursiveSplit<Kernel,
kBlockPerCu>(
args, s, kargs);

return ave_time;
};
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -73,23 +73,21 @@ struct GroupedConvFwdKernelArgs
}
out_ptr = args.out_ptr;

ConvToGemmFwdTransformer conv_to_gemm_transformer{in_g_n_c_wis_lengths,
wei_g_k_c_xs_lengths,
out_g_n_k_wos_lengths,
conv_filter_strides,
conv_filter_dilations,
input_left_pads,
input_right_pads};
// Create and STORE transformer (for split-image support)
transformer_ = ConvToGemmFwdTransformer{in_g_n_c_wis_lengths,
wei_g_k_c_xs_lengths,
out_g_n_k_wos_lengths,
conv_filter_strides,
conv_filter_dilations,
input_left_pads,
input_right_pads};

a_grid_desc_m_k =
conv_to_gemm_transformer
.template MakeADescriptor_M_K<typename GroupedConvTraitsType_::InLayout>();
transformer_.template MakeADescriptor_M_K<typename GroupedConvTraitsType_::InLayout>();
b_grid_desc_n_k =
conv_to_gemm_transformer
.template MakeBDescriptor_N_K<typename GroupedConvTraitsType_::WeiLayout>();
transformer_.template MakeBDescriptor_N_K<typename GroupedConvTraitsType_::WeiLayout>();
c_grid_desc_m_n =
conv_to_gemm_transformer
.template MakeCDescriptor_M_N<typename GroupedConvTraitsType_::OutLayout>();
transformer_.template MakeCDescriptor_M_N<typename GroupedConvTraitsType_::OutLayout>();

group_stride_a = args.C_;
group_stride_b = args.K_ * args.C_ *
Expand All @@ -101,13 +99,15 @@ struct GroupedConvFwdKernelArgs

// Initialize Split-N support fields for 1D convolution (NWGC layout)
// Get the actual split N from transformer
n_per_split = conv_to_gemm_transformer.GetN();
original_n = conv_to_gemm_transformer.GetOriginalN();
n_per_split = transformer_.GetN();
original_n = transformer_.GetOriginalN();
n_splits = ck_tile::integer_divide_ceil(original_n, n_per_split);

// Calculate batch strides for NWGC layout
input_batch_stride = args.C_ * args.input_spatial_lengths_[0];
output_batch_stride = args.K_ * args.output_spatial_lengths_[0];
// FIX: Calculate batch strides using args dimensions
// These are the ORIGINAL dimensions passed to constructor, not modified by invoker yet
// (invoker modifies args AFTER calling MakeKernelArgs)
input_batch_stride = args.G_ * args.C_ * args.input_spatial_lengths_[0];
output_batch_stride = args.G_ * args.K_ * args.output_spatial_lengths_[0];

// Update GemmM to use split N (not original N)
GemmM = n_per_split * args.output_spatial_lengths_[0];
Expand Down Expand Up @@ -163,23 +163,21 @@ struct GroupedConvFwdKernelArgs
}
out_ptr = args.out_ptr;

ConvToGemmFwdTransformer conv_to_gemm_transformer{in_g_n_c_wis_lengths,
wei_g_k_c_xs_lengths,
out_g_n_k_wos_lengths,
conv_filter_strides,
conv_filter_dilations,
input_left_pads,
input_right_pads};
// Create and STORE transformer (for split-image support)
transformer_ = ConvToGemmFwdTransformer{in_g_n_c_wis_lengths,
wei_g_k_c_xs_lengths,
out_g_n_k_wos_lengths,
conv_filter_strides,
conv_filter_dilations,
input_left_pads,
input_right_pads};

a_grid_desc_m_k =
conv_to_gemm_transformer
.template MakeADescriptor_M_K<typename GroupedConvTraitsType_::InLayout>();
transformer_.template MakeADescriptor_M_K<typename GroupedConvTraitsType_::InLayout>();
b_grid_desc_n_k =
conv_to_gemm_transformer
.template MakeBDescriptor_N_K<typename GroupedConvTraitsType_::WeiLayout>();
transformer_.template MakeBDescriptor_N_K<typename GroupedConvTraitsType_::WeiLayout>();
c_grid_desc_m_n =
conv_to_gemm_transformer
.template MakeCDescriptor_M_N<typename GroupedConvTraitsType_::OutLayout>();
transformer_.template MakeCDescriptor_M_N<typename GroupedConvTraitsType_::OutLayout>();

group_stride_a = args.C_;
group_stride_b = args.K_ * args.C_ *
Expand All @@ -191,15 +189,16 @@ struct GroupedConvFwdKernelArgs

// Initialize Split-N support fields for 2D convolution (NHWGC layout)
// Get the actual split N from transformer
n_per_split = conv_to_gemm_transformer.GetN();
original_n = conv_to_gemm_transformer.GetOriginalN();
n_per_split = transformer_.GetN();
original_n = transformer_.GetOriginalN();
n_splits = ck_tile::integer_divide_ceil(original_n, n_per_split);

// Calculate batch strides for NHWGC layout
// Need to account for G dimension when moving between batches
input_batch_stride =
args.C_ * args.input_spatial_lengths_[0] * args.input_spatial_lengths_[1];
args.G_ * args.C_ * args.input_spatial_lengths_[0] * args.input_spatial_lengths_[1];
output_batch_stride =
args.K_ * args.output_spatial_lengths_[0] * args.output_spatial_lengths_[1];
args.G_ * args.K_ * args.output_spatial_lengths_[0] * args.output_spatial_lengths_[1];

// Update GemmM to use split N (not original N)
GemmM = n_per_split * args.output_spatial_lengths_[0] * args.output_spatial_lengths_[1];
Expand Down Expand Up @@ -263,23 +262,21 @@ struct GroupedConvFwdKernelArgs
}
out_ptr = args.out_ptr;

ConvToGemmFwdTransformer conv_to_gemm_transformer{in_g_n_c_wis_lengths,
wei_g_k_c_xs_lengths,
out_g_n_k_wos_lengths,
conv_filter_strides,
conv_filter_dilations,
input_left_pads,
input_right_pads};
// Create and STORE transformer (for split-image support)
transformer_ = ConvToGemmFwdTransformer{in_g_n_c_wis_lengths,
wei_g_k_c_xs_lengths,
out_g_n_k_wos_lengths,
conv_filter_strides,
conv_filter_dilations,
input_left_pads,
input_right_pads};

a_grid_desc_m_k =
conv_to_gemm_transformer
.template MakeADescriptor_M_K<typename GroupedConvTraitsType_::InLayout>();
transformer_.template MakeADescriptor_M_K<typename GroupedConvTraitsType_::InLayout>();
b_grid_desc_n_k =
conv_to_gemm_transformer
.template MakeBDescriptor_N_K<typename GroupedConvTraitsType_::WeiLayout>();
transformer_.template MakeBDescriptor_N_K<typename GroupedConvTraitsType_::WeiLayout>();
c_grid_desc_m_n =
conv_to_gemm_transformer
.template MakeCDescriptor_M_N<typename GroupedConvTraitsType_::OutLayout>();
transformer_.template MakeCDescriptor_M_N<typename GroupedConvTraitsType_::OutLayout>();

group_stride_a = args.C_;
group_stride_b = args.K_ * args.C_ *
Expand All @@ -291,14 +288,15 @@ struct GroupedConvFwdKernelArgs

// Initialize Split-N support fields for 3D convolution (NDHWGC layout)
// Get the actual split N from transformer
n_per_split = conv_to_gemm_transformer.GetN();
original_n = conv_to_gemm_transformer.GetOriginalN();
n_per_split = transformer_.GetN();
original_n = transformer_.GetOriginalN();
n_splits = ck_tile::integer_divide_ceil(original_n, n_per_split);

// Calculate batch strides for NDHWGC layout
input_batch_stride = args.C_ * args.input_spatial_lengths_[0] *
// Need to account for G dimension when moving between batches
input_batch_stride = args.G_ * args.C_ * args.input_spatial_lengths_[0] *
args.input_spatial_lengths_[1] * args.input_spatial_lengths_[2];
output_batch_stride = args.K_ * args.output_spatial_lengths_[0] *
output_batch_stride = args.G_ * args.K_ * args.output_spatial_lengths_[0] *
args.output_spatial_lengths_[1] * args.output_spatial_lengths_[2];

// Update GemmM to use split N (not original N)
Expand Down Expand Up @@ -351,6 +349,19 @@ struct GroupedConvFwdKernelArgs
index_t original_n = 1; // Original batch size before splitting
index_t input_batch_stride = 0; // Stride to next batch in input tensor
index_t output_batch_stride = 0; // Stride to next batch in output tensor

// Split-image support - spatial offsets (applied per-batch in operator())
long_index_t spatial_offset_in = 0; // Spatial offset for input (e.g., W/2 for 1D split)
long_index_t spatial_offset_out = 0; // Spatial offset for output (e.g., W/2 for 1D split)

// Split-image support - transformer instance
// We store the transformer so invoker can call CalculateSplitImage()
// which uses N_ (after Split-N) for correct offset calculation
ConvToGemmFwdTransformer transformer_;

// Method to get split-image information from transformer
// Uses unified TwoGB threshold internally
CK_TILE_HOST auto GetSplitImageInfo() const { return transformer_.CalculateSplitImage(); }
};

/// @brief The Grouped Convolution Forward kernel template.
Expand Down Expand Up @@ -460,7 +471,8 @@ struct GroupedConvolutionForwardKernel
CK_TILE_HOST static constexpr GroupedConvFwdKernelArgsSpecialized
MakeKernelArgs(const GroupedConvFwdHostArgs& hostArgs)
{
return GroupedConvFwdKernelArgsSpecialized(hostArgs);
auto kargs = GroupedConvFwdKernelArgsSpecialized(hostArgs);
return kargs;
}

CK_TILE_HOST_DEVICE static constexpr index_t GetSmemSize()
Expand Down Expand Up @@ -821,12 +833,8 @@ struct GroupedConvolutionForwardKernel
CK_TILE_DEVICE void operator()(GroupedConvFwdKernelArgsSpecialized kargs) const
{
const auto blockIdX = amd_wave_read_first_lane(blockIdx.x);
const auto [iM, iN] =
TilePartitioner{kargs.GemmM, kargs.GemmN}.GetOutputTileIndex(blockIdX);
const index_t i_m = amd_wave_read_first_lane(iM * TilePartitioner::MPerBlock);
const index_t i_n = amd_wave_read_first_lane(iN * TilePartitioner::NPerBlock);
const auto blockIdY = amd_wave_read_first_lane(blockIdx.y);

const auto blockIdY = amd_wave_read_first_lane(blockIdx.y);
const auto group_offset_a = amd_wave_read_first_lane(kargs.group_stride_a * blockIdY);
const auto group_offset_b = amd_wave_read_first_lane(kargs.group_stride_b * blockIdY);
const auto group_offset_c = amd_wave_read_first_lane(kargs.group_stride_c * blockIdY);
Expand All @@ -844,13 +852,26 @@ struct GroupedConvolutionForwardKernel
static_cast<long_index_t>(batch_offset) *
static_cast<long_index_t>(kargs.output_batch_stride);

// Adjust pointers: combine group offset and batch offset
const InDataType* a_ptr =
static_cast<const InDataType*>(kargs.in_ptr) + group_offset_a + input_batch_offset;
// FIX: Adjust pointers with formula: base + group_offset + batch_offset + spatial_offset
// This ensures spatial offset is applied per-batch, not globally
const InDataType* base_a_ptr =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add these offsets in constexpr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thinks these offset calculations cannot be constexpr because they depend on runtime variables (blockIdx.y, blockIdx.z, and kernel arguments like kargs.in_ptr and kargs.spatial_offset_in), which are only known at kernel execution time, not at compile time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that you can add spatial_offset_in / spatial_offset_out in constexpr if in line 994

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! I can optimize by only adding the spatial offsets when EnableSplitImage is true.

static_cast<const InDataType*>(kargs.in_ptr) + group_offset_a + input_batch_offset +
kargs.spatial_offset_in; // Add spatial offset from split-image
const WeiDataType* b_ptr = static_cast<const WeiDataType*>(kargs.wei_ptr) +
group_offset_b; // No batch offset for weights!
OutDataType* c_ptr =
static_cast<OutDataType*>(kargs.out_ptr) + group_offset_c + output_batch_offset;
OutDataType* base_c_ptr = static_cast<OutDataType*>(kargs.out_ptr) + group_offset_c +
output_batch_offset +
kargs.spatial_offset_out; // Add spatial offset from split-image

// Use base pointers directly
const InDataType* a_ptr = base_a_ptr;
OutDataType* c_ptr = base_c_ptr;

// Tile partitioning
const auto [iM, iN] = TilePartitioner{kargs.GemmM, kargs.GemmN}.GetOutputTileIndex(
static_cast<index_t>(blockIdX));
const index_t i_m = amd_wave_read_first_lane(iM * TilePartitioner::MPerBlock);
const index_t i_n = amd_wave_read_first_lane(iN * TilePartitioner::NPerBlock);

// allocate LDS
__shared__ char smem_ptr_0[GetSmemSize()];
Expand Down
Loading
Loading