Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

A comprehensive benchmark suite designed to evaluate and validate CUDA kernels generated by Large Language Models (LLMs). This benchmark addresses the limitations of existing kernel benchmarks by implementing robust evaluation criteria that prevent LLMs from exploiting benchmark settings.

🎯 Motivation

Traditional kernel benchmarks often fall short when evaluating LLM-generated CUDA code because:

They can be easily exploited by LLMs through input shape manipulation
They don't account for weight magnitude optimizations
They lack comprehensive validation across different initialization settings
They don't test for real-world performance characteristics

robust-kbench addresses these limitations through:

Multiple initialization settings
Varied input configurations
Comprehensive correctness checks
Performance profiling capabilities
Real-world task scenarios

🚀 Quick Start

Installation

# Clone the repository with submodules
git clone --recurse-submodules https://github.com/SakanaAI/robust-kbench.git

# Create and activate conda environment
conda create -n robust_kbench python=3.11
conda activate robust_kbench

# Install the package
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

cd robust-kbench
pip install -e .

Basic Usage

Run Task Filtering

python run_filter.py --task_dir tasks/mnist_cross_entropy

Evaluate a Single Kernel

python run_kernel.py --task_dir tasks/mnist_cross_entropy --cuda_code_path highlighted/mnist_cross_entropy/forward/kernel.cu

🔍 Task Filtering Procedure

The benchmark implements several filter checks to ensure robust evaluation:

Output Range Check: Ensures outputs are not artificially constrained to [-0.01, 0.01]
Standard Deviation Check: Verifies output variation > 0.01
Axes Variation Check: Confirms output variation across axes > 0.01
Initialization Impact: Tests kernel behavior across different initialization settings
Input Impact: Evaluates performance with varied input configurations
LLM-judge Inefficiency: Assesses potential inefficiencies in LLM-generated code

💻 Detailed Usage

Parallel Kernel Evaluation

from robust_kbench.parallel import ParallelKernelExecutor

executor = ParallelKernelExecutor(
    task_dir="tasks/mnist_cross_entropy",
    op_atol=1e-5,
    op_rtol=1e-5,
    warmup_time=25,
    repetition_time=10000,
    multi_init_settings=True,
    multi_input_settings=True,
    forward=True,
    timeout=300,
    torch_prof=True,
)

# Evaluate multiple kernels
cuda_files = [
    "highlighted/mnist_cross_entropy/forward/kernel.cu",
    "highlighted/mnist_cross_entropy/forward/kernel.cu",
]

# Run evaluations
torch_results = executor.torch_eval()
compile_results = executor.compile(cuda_files)
test_results = executor.test(cuda_files)
eval_results = executor.evaluate(cuda_files)
profile_results = executor.profile(cuda_files)

Individual Evaluation Components

Torch Baseline Evaluation

from robust_kbench.evaluate import eval_torch_runtime

torch_results, torch_compile_results = eval_torch_runtime(
    task_dir="tasks/mnist_linear",
    warmup_time=25,
    repetition_time=10000,
    eval_type="kernelbench",
    multi_init_settings=True,
    multi_input_settings=True,
    gpu_id=0,
    ext_dir=os.path.expanduser("~/.cache/torch_extensions/py311_cu124"),
    timeout=300,
    forward=True,
)

CUDA Kernel Compilation

cuda_compile_results = compile_cuda_kernel(
    cuda_code_path="highlighted/mnist_linear/forward/kernel.cu",
    task_dir="tasks/mnist_linear",
    gpu_id=0,
    ext_dir=os.path.expanduser("~/.cache/torch_extensions/py311_cu124"),
)

Kernel Correctness Testing

from robust_kbench.evaluate import test_cuda_kernel

correct_results = test_cuda_kernel(
    cuda_code_path="highlighted/mnist_linear/forward/forward.cu",
    task_dir="tasks/mnist_linear",
    op_atol=1e-5,
    op_rtol=1e-5,
    multi_init_settings=True,
    multi_input_settings=True,
    gpu_id=0,
    ext_dir=os.path.expanduser("~/.cache/torch_extensions/py311_cu124"),
    timeout=300,
    forward=True
)

Runtime Evaluation

from robust_kbench.evaluate import eval_cuda_kernel

cuda_results = eval_cuda_kernel(
    cuda_code_path="highlighted/mnist_linear/forward/kernel.cu",
    task_dir="tasks/mnist_linear",
    warmup_time=25,
    repetition_time=10000,
    eval_type="kernelbench",
    multi_init_settings=True,
    multi_input_settings=True,
    gpu_id=0,
    ext_dir=os.path.expanduser("~/.cache/torch_extensions/py311_cu124"),
    timeout=300,
    forward=True
)

Performance Profiling

from robust_kbench.evaluate import prof_cuda_kernel

prof_results = prof_cuda_kernel(
    cuda_code_path="tasks/linear/forward.cu",
    task_dir="tasks/mnist_linear",
    torch_prof=True,
    ncu_prof=False,
    clang_prof=False,
    forward=True
)

📋 Supported Tasks

Basic Neural Network Operations

Task	Description	Forward	Backward	Use Case
Linear	Matrix multiplication with bias	✓	✓	Neural network layers
Linear+ReLU	Linear layer followed by ReLU activation	✓	✓	Deep neural networks
LayerNorm	Layer normalization	✓	✓	Transformer architectures
Cross Entropy	Cross entropy loss for multi-class classification	✓	✓	Classification tasks

Convolutional Neural Network Operations

Task	Description	Forward	Backward	Use Case
Conv2D	2D Convolution operation	✓	✓	CNN architectures
Conv+ReLU+Pool	Convolution followed by ReLU and pooling	✓	✓	CNN feature extraction
MaxPool2D	2D Max pooling operation	✓	✓	CNN downsampling

Transformer Architecture Operations

Task	Description	Forward	Backward	Use Case
LLaMA-FFW	LLaMA feed-forward network	✓	✗	LLaMA model architecture
LLaMA-RMSNorm	Root mean square normalization	✓	✓	LLaMA model architecture

Complex Network Blocks

Task	Description	Forward	Backward	Use Case
ResNet Block	Residual block with convolutions	✓	✗	ResNet architectures
UNet Linear	Linear operations in UNet architecture	✓	✗	UNet model architecture

Original KernelBench Tasks

Task	Description	Forward	Backward	Use Case
KernelBench	Original KernelBench tasks	✓	✗	Baseline comparison

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

📝 License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
highlighted		highlighted
robust_kbench		robust_kbench
tasks		tasks
tests		tests
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
launch_eval.sh		launch_eval.sh
requirements.txt		requirements.txt
run_filter.py		run_filter.py
run_kernel.py		run_kernel.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

🎯 Motivation

🚀 Quick Start

Installation

Basic Usage

🔍 Task Filtering Procedure

💻 Detailed Usage

Parallel Kernel Evaluation

Individual Evaluation Components

Torch Baseline Evaluation

CUDA Kernel Compilation

Kernel Correctness Testing

Runtime Evaluation

Performance Profiling

📋 Supported Tasks

Basic Neural Network Operations

Convolutional Neural Network Operations

Transformer Architecture Operations

Complex Network Blocks

Original KernelBench Tasks

🤝 Contributing

📝 License

About

Uh oh!

Releases

Packages

Languages

License

SakanaAI/robust-kbench

Folders and files

Latest commit

History

Repository files navigation

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

🎯 Motivation

🚀 Quick Start

Installation

Basic Usage

🔍 Task Filtering Procedure

💻 Detailed Usage

Parallel Kernel Evaluation

Individual Evaluation Components

Torch Baseline Evaluation

CUDA Kernel Compilation

Kernel Correctness Testing

Runtime Evaluation

Performance Profiling

📋 Supported Tasks

Basic Neural Network Operations

Convolutional Neural Network Operations

Transformer Architecture Operations

Complex Network Blocks

Original KernelBench Tasks

🤝 Contributing

📝 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages