Skip to content

SakanaAI/robust-kbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization

Python 3.11 License Paper

A comprehensive benchmark suite designed to evaluate and validate CUDA kernels generated by Large Language Models (LLMs). This benchmark addresses the limitations of existing kernel benchmarks by implementing robust evaluation criteria that prevent LLMs from exploiting benchmark settings.

🎯 Motivation

Traditional kernel benchmarks often fall short when evaluating LLM-generated CUDA code because:

  • They can be easily exploited by LLMs through input shape manipulation
  • They don't account for weight magnitude optimizations
  • They lack comprehensive validation across different initialization settings
  • They don't test for real-world performance characteristics

robust-kbench addresses these limitations through:

  • Multiple initialization settings
  • Varied input configurations
  • Comprehensive correctness checks
  • Performance profiling capabilities
  • Real-world task scenarios

πŸš€ Quick Start

Installation

# Clone the repository with submodules
git clone --recurse-submodules https://github.com/SakanaAI/robust-kbench.git

# Create and activate conda environment
conda create -n robust_kbench python=3.11
conda activate robust_kbench

# Install the package
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

cd robust-kbench
pip install -e .

Basic Usage

  1. Run Task Filtering
python run_filter.py --task_dir tasks/mnist_cross_entropy
  1. Evaluate a Single Kernel
python run_kernel.py --task_dir tasks/mnist_cross_entropy --cuda_code_path highlighted/mnist_cross_entropy/forward/kernel.cu

πŸ” Task Filtering Procedure

The benchmark implements several filter checks to ensure robust evaluation:

  1. Output Range Check: Ensures outputs are not artificially constrained to [-0.01, 0.01]
  2. Standard Deviation Check: Verifies output variation > 0.01
  3. Axes Variation Check: Confirms output variation across axes > 0.01
  4. Initialization Impact: Tests kernel behavior across different initialization settings
  5. Input Impact: Evaluates performance with varied input configurations
  6. LLM-judge Inefficiency: Assesses potential inefficiencies in LLM-generated code

πŸ’» Detailed Usage

Parallel Kernel Evaluation

from robust_kbench.parallel import ParallelKernelExecutor

executor = ParallelKernelExecutor(
    task_dir="tasks/mnist_cross_entropy",
    op_atol=1e-5,
    op_rtol=1e-5,
    warmup_time=25,
    repetition_time=10000,
    multi_init_settings=True,
    multi_input_settings=True,
    forward=True,
    timeout=300,
    torch_prof=True,
)

# Evaluate multiple kernels
cuda_files = [
    "highlighted/mnist_cross_entropy/forward/kernel.cu",
    "highlighted/mnist_cross_entropy/forward/kernel.cu",
]

# Run evaluations
torch_results = executor.torch_eval()
compile_results = executor.compile(cuda_files)
test_results = executor.test(cuda_files)
eval_results = executor.evaluate(cuda_files)
profile_results = executor.profile(cuda_files)

Individual Evaluation Components

Torch Baseline Evaluation

from robust_kbench.evaluate import eval_torch_runtime

torch_results, torch_compile_results = eval_torch_runtime(
    task_dir="tasks/mnist_linear",
    warmup_time=25,
    repetition_time=10000,
    eval_type="kernelbench",
    multi_init_settings=True,
    multi_input_settings=True,
    gpu_id=0,
    ext_dir=os.path.expanduser("~/.cache/torch_extensions/py311_cu124"),
    timeout=300,
    forward=True,
)

CUDA Kernel Compilation

cuda_compile_results = compile_cuda_kernel(
    cuda_code_path="highlighted/mnist_linear/forward/kernel.cu",
    task_dir="tasks/mnist_linear",
    gpu_id=0,
    ext_dir=os.path.expanduser("~/.cache/torch_extensions/py311_cu124"),
)

Kernel Correctness Testing

from robust_kbench.evaluate import test_cuda_kernel

correct_results = test_cuda_kernel(
    cuda_code_path="highlighted/mnist_linear/forward/forward.cu",
    task_dir="tasks/mnist_linear",
    op_atol=1e-5,
    op_rtol=1e-5,
    multi_init_settings=True,
    multi_input_settings=True,
    gpu_id=0,
    ext_dir=os.path.expanduser("~/.cache/torch_extensions/py311_cu124"),
    timeout=300,
    forward=True
)

Runtime Evaluation

from robust_kbench.evaluate import eval_cuda_kernel

cuda_results = eval_cuda_kernel(
    cuda_code_path="highlighted/mnist_linear/forward/kernel.cu",
    task_dir="tasks/mnist_linear",
    warmup_time=25,
    repetition_time=10000,
    eval_type="kernelbench",
    multi_init_settings=True,
    multi_input_settings=True,
    gpu_id=0,
    ext_dir=os.path.expanduser("~/.cache/torch_extensions/py311_cu124"),
    timeout=300,
    forward=True
)

Performance Profiling

from robust_kbench.evaluate import prof_cuda_kernel

prof_results = prof_cuda_kernel(
    cuda_code_path="tasks/linear/forward.cu",
    task_dir="tasks/mnist_linear",
    torch_prof=True,
    ncu_prof=False,
    clang_prof=False,
    forward=True
)

πŸ“‹ Supported Tasks

Basic Neural Network Operations

Task Description Forward Backward Use Case
Linear Matrix multiplication with bias βœ“ βœ“ Neural network layers
Linear+ReLU Linear layer followed by ReLU activation βœ“ βœ“ Deep neural networks
LayerNorm Layer normalization βœ“ βœ“ Transformer architectures
Cross Entropy Cross entropy loss for multi-class classification βœ“ βœ“ Classification tasks

Convolutional Neural Network Operations

Task Description Forward Backward Use Case
Conv2D 2D Convolution operation βœ“ βœ“ CNN architectures
Conv+ReLU+Pool Convolution followed by ReLU and pooling βœ“ βœ“ CNN feature extraction
MaxPool2D 2D Max pooling operation βœ“ βœ“ CNN downsampling

Transformer Architecture Operations

Task Description Forward Backward Use Case
LLaMA-FFW LLaMA feed-forward network βœ“ βœ— LLaMA model architecture
LLaMA-RMSNorm Root mean square normalization βœ“ βœ“ LLaMA model architecture

Complex Network Blocks

Task Description Forward Backward Use Case
ResNet Block Residual block with convolutions βœ“ βœ— ResNet architectures
UNet Linear Linear operations in UNet architecture βœ“ βœ— UNet model architecture

Original KernelBench Tasks

Task Description Forward Backward Use Case
KernelBench Original KernelBench tasks βœ“ βœ— Baseline comparison

🀝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

πŸ“ License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published