What's Changed
- Generalize the cuda-bias test cases by replacing hardcoded "cuda" literal with the DEVICE variable by @EikanWang in #775
- Make progress bar prettier by @oulgen in #786
- Upgrade ruff==0.13.3 pyright==1.1.406 by @jansel in #790
- Add hl.split and hl.join by @jansel in #791
- Generalize test_print and test_tensor_descriptor to support different accelerators by @EikanWang in #801
- Limit rebench to 1000 iterations by @jansel in #789
- Turn down autotuner defaults by @jansel in #788
- Enable torch.xpu._XpuDeviceProperties in Helion kernel by @EikanWang in #798
- Better error message for augmented assignment (e.g. +=) on host tensor without subscript by @yf225 in #807
- Add Pattern Search autotuning algorithm to docs. by @choijon5 in #810
- Support 0dim tensor in output code printing by @oulgen in #806
- Set range_num_stages <= 1 if using tensor_descriptor, to avoid CUDA misaligned address error by @yf225 in #792
- Add hl.inline_triton API by @jansel in #811
- Add out_dtype arg to hl.dot by @jansel in #813
- Add autotune_config_overrides by @jansel in #814
- Reduce initial_population to 100 by @jansel in #800
- Disable range_num_stages for kernels with aliasing by @jansel in #812
- Adding new setting, autotune_max_generations, that allows user to set the maximum number of generations for autotuning by @choijon5 in #796
- Increase tolerance for test_matmul_reshape_m_2 by @jansel in #816
- Update docs by @jansel in #815
- Fix torch version check by @adam-smnk in #818
- [Benchmark] Keep going when a single benchmark fails by @oulgen in #820
- Faster Helion JSD by @PaulZhang12 in #733
- Faster KL Div by @PaulZhang12 in #822
- Normalize device name and decorate cuda-only test cases by @EikanWang in #819
- Improved log messages for autotuning by @choijon5 in #817
- Apply simplification to range indexing in order to reuse block size symbols by @yf225 in #809
- Fix hl.rand to use tile specific offsets instead of fixed offsets, ensure unique random num per tile by @karthickai in #685
- Match cuda versions for benchmark by @oulgen in #828
- Print nvidia-smi/rocminfo by @oulgen in #827
- Dump nvidia-smi/rocminfo on benchmarks by @oulgen in #829
- Add 3.14 support by @oulgen in #830
- Remove py312 vanilla test by @oulgen in #831
- Pad to next power of 2 for hl.specialize'ed shape value used in device tensor creation by @yf225 in #804
- Autotune eviction policy by @oulgen in #823
- [Docs] Consistent pre-commit/lint by @oulgen in #836
- [Docs] Recommend venv instead of conda by @oulgen in #837
- [Docs] Helion works on 3.10 through 3.14 by @oulgen in #838
- [Docs] Add eviction policy by @oulgen in #839
- Update to use the new attribute setting for tf32. by @choijon5 in #835
Full Changelog: v0.1.6...v0.1.7