Leverage BF16 instructions for AVX512 as well

Basically port this Neon-specific optimization for AVX512: https://github.com/triton-lang/triton-cpu/pull/56

On ARM, we observed very impressive speedup.