Skip to content

FlagZhao/reduction_to_the_hell

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

reduction_to_the_hell

A compact CUDA benchmark and demo that explores several parallel reduction implementations for GPUs. The project collects small, focused implementations of block- and warp-level reductions (including vectorized and warp-specialized variants) to compare correctness, performance trade-offs, and implementation techniques.

We use only the sum operator, other operators can be implemented in the similar way.

Goals

  • Provide minimal, easy-to-read CUDA kernels that implement common reduction patterns.
  • Compare alternative strategies (basic block reduction, block+warp reduction, vectorized reductions, and warp-specialization) in a single test harness.
  • Serve as a learning and benchmarking playground for GPU reduction techniques.

Features

  • Multiple reduction kernels selectable at runtime (basic, warp, vectorized_warp, warp_specialization, and vectorized_warp_speciliaztion).
  • Small, isolated header modules for each algorithm in include/ so you can read and modify kernels quickly.
  • Single-file test runner at src/main.cu that prepares data, launches kernels and validates results.

Repository layout

  • src/main.cu — test harness and kernel launcher; parses the target string to select which kernel to run.
  • include/ — header files containing kernels and helpers:
    • 00_basic_block_reduction.h
    • 01_block_reduction_with_warp_reduction.h
    • 01_vectorized_reduction_with_warp_reduction.h
    • 02_block_reduction_with_warp_reduction_with_warp_specialization.h
    • 02_vectorized_block_reduction_with_warp_reduction_with_warp_specialization.h
    • utils.h — utility helpers for data setup and validation.
  • CMakeLists.txt & build/ — CMake-based build configuration and generated build files.

Build & run (quick)

  1. Create a build directory and run CMake then make (example):
mkdir -p build && cd build
cmake ..
make -j
  1. Run the produced executable (example usage, adjust depending on build output):
# from the project root
./build/reduction_to_the_hell 

Notes

  • This project is intended for experimentation and learning rather than production deployment. Kernels are written for clarity and to demonstrate different reduction idioms, I didn't do many boundary check which might lead to bugs in production.
  • The warp specification is not applicable to float/double computation because the operator is not commutable, but the difference will not be pretty big because it will only save the warp 0's reduction
  • Requires CUDA toolkit and a CUDA-capable GPU. The build system uses nvcc (CMake CUDA support).

If you'd like, I can add a small benchmark script, a micro-benchmark harness that times kernels, or example outputs from running each kernel on your hardware.

TODOs

  • Add cub/thrust examples
  • Add the nvbench lib for runtime profiling

About

CUDA reduction example with different optimizations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors