A comprehensive benchmark suite designed to evaluate and validate CUDA kernels generated by Large Language Models (LLMs). This benchmark addresses the limitations of existing kernel benchmarks by implementing robust evaluation criteria that prevent LLMs from exploiting benchmark settings.
Traditional kernel benchmarks often fall short when evaluating LLM-generated CUDA code because:
- They can be easily exploited by LLMs through input shape manipulation
- They don't account for weight magnitude optimizations
- They lack comprehensive validation across different initialization settings
- They don't test for real-world performance characteristics
robust-kbench
addresses these limitations through:
- Multiple initialization settings
- Varied input configurations
- Comprehensive correctness checks
- Performance profiling capabilities
- Real-world task scenarios
# Clone the repository with submodules
git clone --recurse-submodules https://github.com/SakanaAI/robust-kbench.git
# Create and activate conda environment
conda create -n robust_kbench python=3.11
conda activate robust_kbench
# Install the package
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
cd robust-kbench
pip install -e .
- Run Task Filtering
python run_filter.py --task_dir tasks/mnist_cross_entropy
- Evaluate a Single Kernel
python run_kernel.py --task_dir tasks/mnist_cross_entropy --cuda_code_path highlighted/mnist_cross_entropy/forward/kernel.cu
The benchmark implements several filter checks to ensure robust evaluation:
- Output Range Check: Ensures outputs are not artificially constrained to [-0.01, 0.01]
- Standard Deviation Check: Verifies output variation > 0.01
- Axes Variation Check: Confirms output variation across axes > 0.01
- Initialization Impact: Tests kernel behavior across different initialization settings
- Input Impact: Evaluates performance with varied input configurations
- LLM-judge Inefficiency: Assesses potential inefficiencies in LLM-generated code
from robust_kbench.parallel import ParallelKernelExecutor
executor = ParallelKernelExecutor(
task_dir="tasks/mnist_cross_entropy",
op_atol=1e-5,
op_rtol=1e-5,
warmup_time=25,
repetition_time=10000,
multi_init_settings=True,
multi_input_settings=True,
forward=True,
timeout=300,
torch_prof=True,
)
# Evaluate multiple kernels
cuda_files = [
"highlighted/mnist_cross_entropy/forward/kernel.cu",
"highlighted/mnist_cross_entropy/forward/kernel.cu",
]
# Run evaluations
torch_results = executor.torch_eval()
compile_results = executor.compile(cuda_files)
test_results = executor.test(cuda_files)
eval_results = executor.evaluate(cuda_files)
profile_results = executor.profile(cuda_files)
from robust_kbench.evaluate import eval_torch_runtime
torch_results, torch_compile_results = eval_torch_runtime(
task_dir="tasks/mnist_linear",
warmup_time=25,
repetition_time=10000,
eval_type="kernelbench",
multi_init_settings=True,
multi_input_settings=True,
gpu_id=0,
ext_dir=os.path.expanduser("~/.cache/torch_extensions/py311_cu124"),
timeout=300,
forward=True,
)
cuda_compile_results = compile_cuda_kernel(
cuda_code_path="highlighted/mnist_linear/forward/kernel.cu",
task_dir="tasks/mnist_linear",
gpu_id=0,
ext_dir=os.path.expanduser("~/.cache/torch_extensions/py311_cu124"),
)
from robust_kbench.evaluate import test_cuda_kernel
correct_results = test_cuda_kernel(
cuda_code_path="highlighted/mnist_linear/forward/forward.cu",
task_dir="tasks/mnist_linear",
op_atol=1e-5,
op_rtol=1e-5,
multi_init_settings=True,
multi_input_settings=True,
gpu_id=0,
ext_dir=os.path.expanduser("~/.cache/torch_extensions/py311_cu124"),
timeout=300,
forward=True
)
from robust_kbench.evaluate import eval_cuda_kernel
cuda_results = eval_cuda_kernel(
cuda_code_path="highlighted/mnist_linear/forward/kernel.cu",
task_dir="tasks/mnist_linear",
warmup_time=25,
repetition_time=10000,
eval_type="kernelbench",
multi_init_settings=True,
multi_input_settings=True,
gpu_id=0,
ext_dir=os.path.expanduser("~/.cache/torch_extensions/py311_cu124"),
timeout=300,
forward=True
)
from robust_kbench.evaluate import prof_cuda_kernel
prof_results = prof_cuda_kernel(
cuda_code_path="tasks/linear/forward.cu",
task_dir="tasks/mnist_linear",
torch_prof=True,
ncu_prof=False,
clang_prof=False,
forward=True
)
Task | Description | Forward | Backward | Use Case |
---|---|---|---|---|
Linear | Matrix multiplication with bias | β | β | Neural network layers |
Linear+ReLU | Linear layer followed by ReLU activation | β | β | Deep neural networks |
LayerNorm | Layer normalization | β | β | Transformer architectures |
Cross Entropy | Cross entropy loss for multi-class classification | β | β | Classification tasks |
Task | Description | Forward | Backward | Use Case |
---|---|---|---|---|
Conv2D | 2D Convolution operation | β | β | CNN architectures |
Conv+ReLU+Pool | Convolution followed by ReLU and pooling | β | β | CNN feature extraction |
MaxPool2D | 2D Max pooling operation | β | β | CNN downsampling |
Task | Description | Forward | Backward | Use Case |
---|---|---|---|---|
LLaMA-FFW | LLaMA feed-forward network | β | β | LLaMA model architecture |
LLaMA-RMSNorm | Root mean square normalization | β | β | LLaMA model architecture |
Task | Description | Forward | Backward | Use Case |
---|---|---|---|---|
ResNet Block | Residual block with convolutions | β | β | ResNet architectures |
UNet Linear | Linear operations in UNet architecture | β | β | UNet model architecture |
Task | Description | Forward | Backward | Use Case |
---|---|---|---|---|
KernelBench | Original KernelBench tasks | β | β | Baseline comparison |
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.