GitHub - FlagZhao/reduction_to_the_hell: CUDA reduction example with different optimizations

reduction_to_the_hell

A compact CUDA benchmark and demo that explores several parallel reduction implementations for GPUs. The project collects small, focused implementations of block- and warp-level reductions (including vectorized and warp-specialized variants) to compare correctness, performance trade-offs, and implementation techniques.

We use only the sum operator, other operators can be implemented in the similar way.

Goals

Provide minimal, easy-to-read CUDA kernels that implement common reduction patterns.
Compare alternative strategies (basic block reduction, block+warp reduction, vectorized reductions, and warp-specialization) in a single test harness.
Serve as a learning and benchmarking playground for GPU reduction techniques.

Features

Multiple reduction kernels selectable at runtime (basic, warp, vectorized_warp, warp_specialization, and vectorized_warp_speciliaztion).
Small, isolated header modules for each algorithm in include/ so you can read and modify kernels quickly.
Single-file test runner at src/main.cu that prepares data, launches kernels and validates results.

Repository layout

src/main.cu — test harness and kernel launcher; parses the target string to select which kernel to run.
include/ — header files containing kernels and helpers:
- 00_basic_block_reduction.h
- 01_block_reduction_with_warp_reduction.h
- 01_vectorized_reduction_with_warp_reduction.h
- 02_block_reduction_with_warp_reduction_with_warp_specialization.h
- 02_vectorized_block_reduction_with_warp_reduction_with_warp_specialization.h
- utils.h — utility helpers for data setup and validation.
CMakeLists.txt & build/ — CMake-based build configuration and generated build files.

Build & run (quick)

Create a build directory and run CMake then make (example):

mkdir -p build && cd build
cmake ..
make -j

Run the produced executable (example usage, adjust depending on build output):

# from the project root
./build/reduction_to_the_hell

Notes

This project is intended for experimentation and learning rather than production deployment. Kernels are written for clarity and to demonstrate different reduction idioms, I didn't do many boundary check which might lead to bugs in production.
The warp specification is not applicable to float/double computation because the operator is not commutable, but the difference will not be pretty big because it will only save the warp 0's reduction
Requires CUDA toolkit and a CUDA-capable GPU. The build system uses nvcc (CMake CUDA support).

If you'd like, I can add a small benchmark script, a micro-benchmark harness that times kernels, or example outputs from running each kernel on your hardware.

TODOs

Add cub/thrust examples
Add the nvbench lib for runtime profiling

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
include		include
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

reduction_to_the_hell

Goals

Features

Repository layout

Build & run (quick)

Notes

TODOs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

reduction_to_the_hell

Goals

Features

Repository layout

Build & run (quick)

Notes

TODOs

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages