-
Notifications
You must be signed in to change notification settings - Fork 84
Add OpenEvolve-based Autotuner for Helion GPU Kernels #1082
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
mycpuorg
wants to merge
5
commits into
pytorch:main
Choose a base branch
from
mycpuorg:claude/openevolve-autotuner-helion-011CUoUYodYsMMzqcBCnGbKR
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Add OpenEvolve-based Autotuner for Helion GPU Kernels #1082
mycpuorg
wants to merge
5
commits into
pytorch:main
from
mycpuorg:claude/openevolve-autotuner-helion-011CUoUYodYsMMzqcBCnGbKR
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This commit implements a new autotuner that uses OpenEvolve's LLM-guided evolutionary algorithm to find optimal Helion kernel configurations. Key features: - OpenEvolveTuner class as drop-in alternative to differential evolution - Intelligent config space exploration using GPT-4o-mini - Comprehensive error handling and fallback to random search - Example script demonstrating vector add kernel tuning - Full documentation with API reference and usage examples Files added: - helion/autotuner/openevolve_tuner.py: Main tuner implementation - examples/helion_vector_add_tuning.py: Complete working example - helion/autotuner/openevolve_tuner_README.md: Documentation The tuner converts Helion's config space into Python programs that OpenEvolve evolves, bridging discrete kernel parameters with LLM-based optimization. Typical cost is $0.01-0.10 per tuning run. Tested with mock evaluations to verify structure and logic.
This commit adds comprehensive testing and documentation for running the OpenEvolve autotuner on NVIDIA B200 (Blackwell) GPUs. Files added: - QUICKSTART_B200.md: Quick start guide for B200 testing - TESTING_B200.md: Comprehensive B200 testing documentation - examples/helion_b200_attention_tuning.py: B200-optimized attention kernel tuning - test_openevolve_b200.sh: Automated test suite for B200 Key features: - B200-specific config space (tensor descriptors, warp specialization, registers) - Automated test suite with quick/full modes - Mock mode for testing without GPU/API key - Real benchmarking mode with TFLOPS measurement - Attention kernel example leveraging Blackwell features - Cost estimates and performance expectations - Troubleshooting guides and monitoring tips The test script can detect B200 GPUs and run appropriate tests: ./test_openevolve_b200.sh quick # Fast tests (~1 min) ./test_openevolve_b200.sh full # Full tests (~10 min) B200-specific parameters tuned: - indexing: 'default' vs 'tensor_descriptor' - pid_type: 'default' vs 'persistent_interleaved' - maxreg: 128-256 (leverages increased register file) - block sizes optimized for Blackwell SM architecture Ready for testing on B200 machines with comprehensive documentation.
- PR_DESCRIPTION.md: Comprehensive PR description - CREATE_PR.md: Instructions for creating the PR These files provide all the information needed to create a PR to the main branch of mycpuorg/helion.
Contributor
|
Do you have any data on how well this works? Does it find faster results than the existing algorithms? How long does it take? Does it end up spending most of the time waiting for the LLM or is it able to saturate the local CPU compiling kernels? |
The check `config.indexing == "tensor_descriptor"` was comparing a list to a string, which always evaluates to False. This meant the safeguards to disable range_num_stages and range_unroll_factor when using tensor_descriptor indexing were never applied. This caused CUDA "misaligned address" errors and "'ttg.local_load' op not assigned a pipeline stage" errors during autotuning with matmul kernels. The fix properly handles both cases: - config.indexing as a string (single strategy for all loads/stores) - config.indexing as a list (per load/store strategies)
Helion requires grid loops (hl.tile) to be at the top level of the
kernel function, not nested inside Python for loops.
Changes:
- Restructured attention kernel to operate on 2D tensors (seq_len x head_dim)
instead of 4D (batch, heads, seq_q, head_dim)
- Moved hl.tile loop to top level of the kernel function
- Updated evaluate_attention_config to use 2D input tensors
- Fixed "default" config values to use valid Helion values ("flat", "pointer")
- Use hl.zeros and hl.full instead of torch functions for tile-sized tensors
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit implements a new autotuner that uses OpenEvolve's LLM-guided evolutionary algorithm to find optimal Helion kernel configurations.
Key features:
Files added:
The tuner converts Helion's config space into Python programs that OpenEvolve evolves, bridging discrete kernel parameters with LLM-based optimization. Typical cost is $0.01-0.10 per tuning run.
Tested with mock evaluations to verify structure and logic.