CUDA: Accelerate MXFP4 table lookup using `__byte_perm` (#15451) * CUDA: optimize get_int_from_table_16 * CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs * revise documentation --------- Co-authored-by: xix <[email protected]> Co-authored-by: Johannes Gäßler <[email protected]>