Skip to content

v0.1.7

Latest
Compare
Choose a tag to compare
@oulgen oulgen released this 08 Oct 19:16
· 64 commits to main since this release
269deb3

What's Changed

  • Generalize the cuda-bias test cases by replacing hardcoded "cuda" literal with the DEVICE variable by @EikanWang in #775
  • Make progress bar prettier by @oulgen in #786
  • Upgrade ruff==0.13.3 pyright==1.1.406 by @jansel in #790
  • Add hl.split and hl.join by @jansel in #791
  • Generalize test_print and test_tensor_descriptor to support different accelerators by @EikanWang in #801
  • Limit rebench to 1000 iterations by @jansel in #789
  • Turn down autotuner defaults by @jansel in #788
  • Enable torch.xpu._XpuDeviceProperties in Helion kernel by @EikanWang in #798
  • Better error message for augmented assignment (e.g. +=) on host tensor without subscript by @yf225 in #807
  • Add Pattern Search autotuning algorithm to docs. by @choijon5 in #810
  • Support 0dim tensor in output code printing by @oulgen in #806
  • Set range_num_stages <= 1 if using tensor_descriptor, to avoid CUDA misaligned address error by @yf225 in #792
  • Add hl.inline_triton API by @jansel in #811
  • Add out_dtype arg to hl.dot by @jansel in #813
  • Add autotune_config_overrides by @jansel in #814
  • Reduce initial_population to 100 by @jansel in #800
  • Disable range_num_stages for kernels with aliasing by @jansel in #812
  • Adding new setting, autotune_max_generations, that allows user to set the maximum number of generations for autotuning by @choijon5 in #796
  • Increase tolerance for test_matmul_reshape_m_2 by @jansel in #816
  • Update docs by @jansel in #815
  • Fix torch version check by @adam-smnk in #818
  • [Benchmark] Keep going when a single benchmark fails by @oulgen in #820
  • Faster Helion JSD by @PaulZhang12 in #733
  • Faster KL Div by @PaulZhang12 in #822
  • Normalize device name and decorate cuda-only test cases by @EikanWang in #819
  • Improved log messages for autotuning by @choijon5 in #817
  • Apply simplification to range indexing in order to reuse block size symbols by @yf225 in #809
  • Fix hl.rand to use tile specific offsets instead of fixed offsets, ensure unique random num per tile by @karthickai in #685
  • Match cuda versions for benchmark by @oulgen in #828
  • Print nvidia-smi/rocminfo by @oulgen in #827
  • Dump nvidia-smi/rocminfo on benchmarks by @oulgen in #829
  • Add 3.14 support by @oulgen in #830
  • Remove py312 vanilla test by @oulgen in #831
  • Pad to next power of 2 for hl.specialize'ed shape value used in device tensor creation by @yf225 in #804
  • Autotune eviction policy by @oulgen in #823
  • [Docs] Consistent pre-commit/lint by @oulgen in #836
  • [Docs] Recommend venv instead of conda by @oulgen in #837
  • [Docs] Helion works on 3.10 through 3.14 by @oulgen in #838
  • [Docs] Add eviction policy by @oulgen in #839
  • Update to use the new attribute setting for tf32. by @choijon5 in #835

Full Changelog: v0.1.6...v0.1.7