Add OpenEvolve-based Autotuner for Helion GPU Kernels #1082

mycpuorg · 2025-11-04T20:39:01Z

This commit implements a new autotuner that uses OpenEvolve's LLM-guided evolutionary algorithm to find optimal Helion kernel configurations.

Key features:

OpenEvolveTuner class as drop-in alternative to differential evolution
Intelligent config space exploration using GPT-4o-mini
Comprehensive error handling and fallback to random search
Example script demonstrating vector add kernel tuning
Full documentation with API reference and usage examples

Files added:

helion/autotuner/openevolve_tuner.py: Main tuner implementation
examples/helion_vector_add_tuning.py: Complete working example
helion/autotuner/openevolve_tuner_README.md: Documentation

The tuner converts Helion's config space into Python programs that OpenEvolve evolves, bridging discrete kernel parameters with LLM-based optimization. Typical cost is $0.01-0.10 per tuning run.

Tested with mock evaluations to verify structure and logic.

This commit implements a new autotuner that uses OpenEvolve's LLM-guided evolutionary algorithm to find optimal Helion kernel configurations. Key features: - OpenEvolveTuner class as drop-in alternative to differential evolution - Intelligent config space exploration using GPT-4o-mini - Comprehensive error handling and fallback to random search - Example script demonstrating vector add kernel tuning - Full documentation with API reference and usage examples Files added: - helion/autotuner/openevolve_tuner.py: Main tuner implementation - examples/helion_vector_add_tuning.py: Complete working example - helion/autotuner/openevolve_tuner_README.md: Documentation The tuner converts Helion's config space into Python programs that OpenEvolve evolves, bridging discrete kernel parameters with LLM-based optimization. Typical cost is $0.01-0.10 per tuning run. Tested with mock evaluations to verify structure and logic.

This commit adds comprehensive testing and documentation for running the OpenEvolve autotuner on NVIDIA B200 (Blackwell) GPUs. Files added: - QUICKSTART_B200.md: Quick start guide for B200 testing - TESTING_B200.md: Comprehensive B200 testing documentation - examples/helion_b200_attention_tuning.py: B200-optimized attention kernel tuning - test_openevolve_b200.sh: Automated test suite for B200 Key features: - B200-specific config space (tensor descriptors, warp specialization, registers) - Automated test suite with quick/full modes - Mock mode for testing without GPU/API key - Real benchmarking mode with TFLOPS measurement - Attention kernel example leveraging Blackwell features - Cost estimates and performance expectations - Troubleshooting guides and monitoring tips The test script can detect B200 GPUs and run appropriate tests: ./test_openevolve_b200.sh quick # Fast tests (~1 min) ./test_openevolve_b200.sh full # Full tests (~10 min) B200-specific parameters tuned: - indexing: 'default' vs 'tensor_descriptor' - pid_type: 'default' vs 'persistent_interleaved' - maxreg: 128-256 (leverages increased register file) - block sizes optimized for Blackwell SM architecture Ready for testing on B200 machines with comprehensive documentation.

- PR_DESCRIPTION.md: Comprehensive PR description - CREATE_PR.md: Instructions for creating the PR These files provide all the information needed to create a PR to the main branch of mycpuorg/helion.

jansel · 2025-11-07T02:34:07Z

Do you have any data on how well this works? Does it find faster results than the existing algorithms? How long does it take? Does it end up spending most of the time waiting for the LLM or is it able to saturate the local CPU compiling kernels?

The check `config.indexing == "tensor_descriptor"` was comparing a list to a string, which always evaluates to False. This meant the safeguards to disable range_num_stages and range_unroll_factor when using tensor_descriptor indexing were never applied. This caused CUDA "misaligned address" errors and "'ttg.local_load' op not assigned a pipeline stage" errors during autotuning with matmul kernels. The fix properly handles both cases: - config.indexing as a string (single strategy for all loads/stores) - config.indexing as a list (per load/store strategies)

Helion requires grid loops (hl.tile) to be at the top level of the kernel function, not nested inside Python for loops. Changes: - Restructured attention kernel to operate on 2D tensors (seq_len x head_dim) instead of 4D (batch, heads, seq_q, head_dim) - Moved hl.tile loop to top level of the kernel function - Updated evaluate_attention_config to use 2D input tensors - Fixed "default" config values to use valid Helion values ("flat", "pointer") - Use hl.zeros and hl.full instead of torch functions for tile-sized tensors

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 4, 2025

claude added 2 commits November 4, 2025 20:44

Add PR documentation for mycpuorg/helion

7bfb40e

- PR_DESCRIPTION.md: Comprehensive PR description - CREATE_PR.md: Instructions for creating the PR These files provide all the information needed to create a PR to the main branch of mycpuorg/helion.

mycpuorg added 2 commits December 6, 2025 18:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add OpenEvolve-based Autotuner for Helion GPU Kernels #1082

Add OpenEvolve-based Autotuner for Helion GPU Kernels #1082

Uh oh!

mycpuorg commented Nov 4, 2025

Uh oh!

jansel commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add OpenEvolve-based Autotuner for Helion GPU Kernels #1082

Are you sure you want to change the base?

Add OpenEvolve-based Autotuner for Helion GPU Kernels #1082

Uh oh!

Conversation

mycpuorg commented Nov 4, 2025

Uh oh!

jansel commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants