A compact CUDA benchmark and demo that explores several parallel reduction implementations for GPUs. The project collects small, focused implementations of block- and warp-level reductions (including vectorized and warp-specialized variants) to compare correctness, performance trade-offs, and implementation techniques.
We use only the sum operator, other operators can be implemented in the similar way.
- Provide minimal, easy-to-read CUDA kernels that implement common reduction patterns.
- Compare alternative strategies (basic block reduction, block+warp reduction, vectorized reductions, and warp-specialization) in a single test harness.
- Serve as a learning and benchmarking playground for GPU reduction techniques.
- Multiple reduction kernels selectable at runtime (
basic,warp,vectorized_warp,warp_specialization, andvectorized_warp_speciliaztion). - Small, isolated header modules for each algorithm in
include/so you can read and modify kernels quickly. - Single-file test runner at
src/main.cuthat prepares data, launches kernels and validates results.
src/main.cu— test harness and kernel launcher; parses thetargetstring to select which kernel to run.include/— header files containing kernels and helpers:00_basic_block_reduction.h01_block_reduction_with_warp_reduction.h01_vectorized_reduction_with_warp_reduction.h02_block_reduction_with_warp_reduction_with_warp_specialization.h02_vectorized_block_reduction_with_warp_reduction_with_warp_specialization.hutils.h— utility helpers for data setup and validation.
CMakeLists.txt&build/— CMake-based build configuration and generated build files.
- Create a build directory and run CMake then make (example):
mkdir -p build && cd build
cmake ..
make -j- Run the produced executable (example usage, adjust depending on build output):
# from the project root
./build/reduction_to_the_hell - This project is intended for experimentation and learning rather than production deployment. Kernels are written for clarity and to demonstrate different reduction idioms, I didn't do many boundary check which might lead to bugs in production.
- The warp specification is not applicable to float/double computation because the operator is not commutable, but the difference will not be pretty big because it will only save the warp 0's reduction
- Requires CUDA toolkit and a CUDA-capable GPU. The build system uses
nvcc(CMake CUDA support).
If you'd like, I can add a small benchmark script, a micro-benchmark harness that times kernels, or example outputs from running each kernel on your hardware.
- Add cub/thrust examples
- Add the nvbench lib for runtime profiling