Skip to content

Conversation

@gmagogsfm
Copy link
Contributor

@gmagogsfm gmagogsfm commented Nov 20, 2025

Prototype Helion kernels in vLLM

  • Support out-variant kernel, auto-functionalize
    • Support more dtypes in fake kernel
  • Automatically generate fake kernel
    • Helion kernel can't handle non-nested SymInt input (bound kernel cache error when computing cache key with Symint)
  • Reduce boiler plate in authoring/registering Helion kernel
    • Use direct_register_custom_op to reduce overhead
  • Performance tuning
    • allreduce_add_rmsnorm hangs on H200
    • Fine grained dynamic shape control in Helion kernel settings
  • Add runtime dispatch logic based on model param

@gmagogsfm gmagogsfm force-pushed the helion branch 3 times, most recently from b6a7fa2 to ff1634b Compare November 21, 2025 23:24
@mergify mergify bot added the performance Performance-related issues label Nov 21, 2025
@gmagogsfm
Copy link
Contributor Author

gmagogsfm commented Nov 21, 2025

@ProExpertProg @zou3519

After some optimization, Helion silu_mul_fp8 outperforms the original cuda kernel. Biggest factor is enabling cudagraph to eliminate launch overhead.

Benchmarking script is included in this branch.

============================================================
Summary Statistics
============================================================
Total configurations tested: 242

Speedup:
  Average: 2.64x
  Median:  2.45x
  Min:     1.01x
  Max:     6.25x

Latency (ms):
  Baseline - Avg: 0.0129, Min: 0.0016, Max: 0.2469
  Helion   - Avg: 0.0063, Min: 0.0015, Max: 0.1570
============================================================

@gmagogsfm
Copy link
Contributor Author

RMS Norm Quant 8

Summary Statistics
============================================================
Total configurations tested: 199

Speedup:
  Average: 1.62x
  Median:  1.59x
  Min:     0.96x
  Max:     2.58x

Latency (ms):
  Baseline - Avg: 0.0040, Min: 0.0022, Max: 0.0190
  Helion   - Avg: 0.0024, Min: 0.0014, Max: 0.0125
============================================================

@gmagogsfm
Copy link
Contributor Author

Added allreduce_add_rmsnorm Helion kernel

2xH100 test, without any comms optimization, compared against flashinfer comm with fusion.

(results are flaky on my machine, with average speedup ranging from 0.99x to over 1.6x, maybe because this machine is a shared dev box)

============================================================
Summary Statistics
============================================================
Total configurations tested: 78

Speedup:
  Average: 1.36x
  Median:  1.37x
  Min:     0.91x
  Max:     2.23x

Latency (ms):
  Baseline - Avg: 0.1202, Min: 0.0985, Max: 0.2095
  Helion   - Avg: 0.0946, Min: 0.0510, Max: 0.2102
============================================================

gmagogsfm and others added 10 commits December 1, 2025 23:24
- This prorotype implements a naive silu_mul_fp8 kernel and
integrates it in vLLM's custom fusion pass in the form of a
custom op
- Numerical accuracy is verified
- There is on average about 4x slow down compared to vLLM's custom
silu_mul_fp8 CUDA kernel

Signed-off-by: Yanan Cao <[email protected]>
Signed-off-by: Yanan Cao <[email protected]>
This commit introduces comprehensive improvements to Helion kernel configuration:

## Major Changes

### ConfigManager Consolidation
- **Created centralized ConfigManager**: Extracted duplicated config logic from
  HelionCustomOp and autotune script into dedicated class
- **Standardized naming**: Config files now use exact kernel names
  (helion_silu_mul_fp8_helion_4096.json) instead of normalized names
- **Smart directory detection**: Auto-finds vLLM repo root for config storage
- **Renamed existing configs**: Migrated 8 config files to new naming standard

### Architecture Improvements
- **Separated concerns**: HelionCustomOp only handles autotuning, script handles saving
- **Pure function design**: get_best_config() now takes available configs dict instead of doing I/O
- **Method to property**: Converted _get_helion_kernel() to helion_kernel property
- **Removed dead code**: Eliminated unused find_best_config() method violating SRP
- **Fixed critical bug**: Config filtering logic now properly handles partial configs

### Type Safety Enhancements
- **Specific type annotations**: Replaced generic Union[str, type] with KernelIdentifier = Union[str, "type[HelionCustomOp]"]
- **TYPE_CHECKING imports**: Added proper forward references to avoid circular imports
- **Type alias**: Introduced KernelIdentifier for better code readability

### API Changes
- **autotune() signature**: Now requires autotune_inputs parameter instead of calling get_autotune_inputs() internally
- **get_best_config()**: Takes available_configs dict parameter for pure function behavior
- **Logging namespace**: Fixed script logging to use proper vLLM namespace

## Files Modified
- NEW: vllm/compilation/helion/config_manager.py - Centralized config management
- NEW: scripts/autotune_helion_kernels.py - Orchestration script with proper separation
- MODIFIED: vllm/compilation/helion/custom_op.py - Refactored base class
- MODIFIED: All 3 kernel implementations - Updated to use new architecture
- RENAMED: 8 config files from normalized to exact kernel names

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
- Convert f-string logging to proper format strings (G004)
- Fix line length violations (E501)
- Use dict iteration instead of .keys() (SIM118)
- Break long help text into multiple lines

All pre-commit checks now pass successfully.

Signed-off-by: Yanan Cao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant