Basically port this Neon-specific optimization for AVX512: https://github.com/triton-lang/triton-cpu/pull/56 On ARM, we observed very impressive speedup.