Skip to content

Conversation

kozinove
Copy link

Vector implementation of vexp and vtanh functions for RISC-V-based processors.
Code adapted from project https://github.com/rvvpl/rvvmf

Test results on BananaPi processors.
VExp:
Run on (8 X 1600 MHz CPU s)
CPU Caches:
L1 Instruction 32 KiB (x8)
L1 Data 32 KiB (x8)
L2 Unified 512 KiB (x2)
Load Average: 2.24, 2.15, 2.10

Benchmark Time CPU Iterations UserCounters...

vunary/xnn_f16_vexp_ukernel__rvvfp16arith_exp_u1v/N:7680/real_time 63481 ns 63184 ns 11062 bytes=483.922M/s cpufreq=1.6G elements=120.981M/s
vunary/xnn_f16_vexp_ukernel__rvvfp16arith_exp_u1v/N:65280/real_time 535412 ns 535376 ns 1302 bytes=487.699M/s cpufreq=1.6G elements=121.925M/s
vunary/xnn_f16_vexp_ukernel__rvvfp16arith_exp_u2v/N:7680/real_time 53729 ns 53726 ns 13009 bytes=571.758M/s cpufreq=1.6G elements=142.939M/s
vunary/xnn_f16_vexp_ukernel__rvvfp16arith_exp_u2v/N:65280/real_time 460626 ns 460613 ns 1516 bytes=566.88M/s cpufreq=1.6G elements=141.72M/s
vunary/xnn_f16_vexp_ukernel__rvvfp16arith_exp_u4v/N:7680/real_time 59941 ns 59936 ns 11675 bytes=512.507M/s cpufreq=1.6G elements=128.127M/s
vunary/xnn_f16_vexp_ukernel__rvvfp16arith_exp_u4v/N:65280/real_time 516931 ns 516882 ns 1347 bytes=505.135M/s cpufreq=1.6G elements=126.284M/s
vunary/xnn_f16_vexp_ukernel__rvvfp16arith_exp_u8v/N:7680/real_time 84517 ns 84511 ns 8286 bytes=363.478M/s cpufreq=1.6G elements=90.8695M/s
vunary/xnn_f16_vexp_ukernel__rvvfp16arith_exp_u8v/N:65280/real_time 726974 ns 726953 ns 961 bytes=359.187M/s cpufreq=1.6G elements=89.7969M/s
vunary/xnn_f16_vexp_ukernel__scalar_poly_3_u1/N:7680/real_time 3489759 ns 3489563 ns 201 bytes=8.8029M/s cpufreq=1.6G elements=2.20073M/s
vunary/xnn_f16_vexp_ukernel__scalar_poly_3_u1/N:65280/real_time 29667023 ns 29665262 ns 24 bytes=8.80169M/s cpufreq=1.6G elements=2.20042M/s
vunary/xnn_f16_vexp_ukernel__scalar_poly_3_u2/N:7680/real_time 3021576 ns 3021260 ns 232 bytes=10.1669M/s cpufreq=1.6G elements=2.54172M/s
vunary/xnn_f16_vexp_ukernel__scalar_poly_3_u2/N:65280/real_time 25684498 ns 25682623 ns 27 bytes=10.1664M/s cpufreq=1.6G elements=2.54161M/s
vunary/xnn_f16_vexp_ukernel__scalar_poly_3_u4/N:7680/real_time 3040489 ns 3040289 ns 230 bytes=10.1036M/s cpufreq=1.6G elements=2.52591M/s
vunary/xnn_f16_vexp_ukernel__scalar_poly_3_u4/N:65280/real_time 25848453 ns 25846605 ns 27 bytes=10.102M/s cpufreq=1.6G elements=2.52549M/s
vunary/xnn_f16_vexp_ukernel__scalar_poly_3_u8/N:7680/real_time 3086399 ns 3086193 ns 227 bytes=9.95335M/s cpufreq=1.6G elements=2.48834M/s
vunary/xnn_f16_vexp_ukernel__scalar_poly_3_u8/N:65280/real_time 26245078 ns 26243356 ns 27 bytes=9.94929M/s cpufreq=1.6G elements=2.48732M/s
vunary/xnn_f32_vexp_ukernel__scalar_rational_3_2_div_u1/N:3840/real_time 187811 ns 187799 ns 3725 bytes=163.569M/s cpufreq=1.6G elements=20.4461M/s
vunary/xnn_f32_vexp_ukernel__scalar_rational_3_2_div_u1/N:32640/real_time 1595105 ns 1595058 ns 438 bytes=163.701M/s cpufreq=1.6G elements=20.4626M/s
vunary/xnn_f32_vexp_ukernel__scalar_rational_3_2_div_u2/N:3840/real_time 114437 ns 114434 ns 6114 bytes=268.444M/s cpufreq=1.6G elements=33.5555M/s
vunary/xnn_f32_vexp_ukernel__scalar_rational_3_2_div_u2/N:32640/real_time 971701 ns 971643 ns 719 bytes=268.725M/s cpufreq=1.6G elements=33.5906M/s
vunary/xnn_f32_vexp_ukernel__scalar_rational_3_2_div_u4/N:3840/real_time 100005 ns 99998 ns 6998 bytes=307.185M/s cpufreq=1.6G elements=38.3981M/s
vunary/xnn_f32_vexp_ukernel__scalar_rational_3_2_div_u4/N:32640/real_time 850240 ns 850184 ns 821 bytes=307.113M/s cpufreq=1.6G elements=38.3891M/s
vunary/xnn_f32_vexp_ukernel__scalar_rational_3_2_div_u8/N:3840/real_time 100349 ns 100339 ns 6974 bytes=306.133M/s cpufreq=1.6G elements=38.2666M/s
vunary/xnn_f32_vexp_ukernel__scalar_rational_3_2_div_u8/N:32640/real_time 852013 ns 851986 ns 819 bytes=306.474M/s cpufreq=1.6G elements=38.3093M/s
vunary/xnn_f32_vexp_ukernel__rvv_exp_u1v/N:3840/real_time 56750 ns 56746 ns 12327 bytes=541.325M/s cpufreq=1.6G elements=67.6656M/s
vunary/xnn_f32_vexp_ukernel__rvv_exp_u1v/N:32640/real_time 483284 ns 483269 ns 1448 bytes=540.304M/s cpufreq=1.6G elements=67.538M/s
vunary/xnn_f32_vexp_ukernel__rvv_exp_u2v/N:3840/real_time 48406 ns 48403 ns 14460 bytes=634.636M/s cpufreq=1.6G elements=79.3296M/s
vunary/xnn_f32_vexp_ukernel__rvv_exp_u2v/N:32640/real_time 414598 ns 414569 ns 1685 bytes=629.815M/s cpufreq=1.6G elements=78.7269M/s
vunary/xnn_f32_vexp_ukernel__rvv_exp_u4v/N:3840/real_time 56758 ns 56755 ns 12310 bytes=541.241M/s cpufreq=1.6G elements=67.6551M/s
vunary/xnn_f32_vexp_ukernel__rvv_exp_u4v/N:32640/real_time 490841 ns 490808 ns 1427 bytes=531.985M/s cpufreq=1.6G elements=66.4981M/s
vunary/xnn_f32_vexp_ukernel__rvv_exp_u8v/N:3840/real_time 73428 ns 73423 ns 9539 bytes=418.371M/s cpufreq=1.6G elements=52.2964M/s
vunary/xnn_f32_vexp_ukernel__rvv_exp_u8v/N:32640/real_time 633034 ns 633017 ns 1102 bytes=412.49M/s cpufreq=1.6G elements=51.5612M/s

VTanh:
Run on (8 X 1600 MHz CPU s)
CPU Caches:
L1 Instruction 32 KiB (x8)
L1 Data 32 KiB (x8)
L2 Unified 512 KiB (x2)
Load Average: 2.13, 2.08, 2.07

Benchmark Time CPU Iterations UserCounters...

vunary/xnn_f16_vtanh_ukernel__rvvfp16arith_tanh_u1v/N:7680/real_time 68440 ns 68207 ns 10092 bytes=448.861M/s cpufreq=1.6G elements=112.215M/s
vunary/xnn_f16_vtanh_ukernel__rvvfp16arith_tanh_u1v/N:65280/real_time 591878 ns 588278 ns 1188 bytes=441.172M/s cpufreq=1.6G elements=110.293M/s
vunary/xnn_f16_vtanh_ukernel__rvvfp16arith_tanh_u2v/N:7680/real_time 59576 ns 59509 ns 11745 bytes=515.642M/s cpufreq=1.6G elements=128.911M/s
vunary/xnn_f16_vtanh_ukernel__rvvfp16arith_tanh_u2v/N:65280/real_time 509353 ns 508744 ns 1366 bytes=512.65M/s cpufreq=1.6G elements=128.163M/s
vunary/xnn_f16_vtanh_ukernel__rvvfp16arith_tanh_u4v/N:7680/real_time 63102 ns 63037 ns 11103 bytes=486.833M/s cpufreq=1.6G elements=121.708M/s
vunary/xnn_f16_vtanh_ukernel__rvvfp16arith_tanh_u4v/N:65280/real_time 571406 ns 570772 ns 1227 bytes=456.978M/s cpufreq=1.6G elements=114.244M/s
vunary/xnn_f16_vtanh_ukernel__rvvfp16arith_tanh_u8v/N:7680/real_time 64972 ns 64897 ns 10752 bytes=472.817M/s cpufreq=1.6G elements=118.204M/s
vunary/xnn_f16_vtanh_ukernel__rvvfp16arith_tanh_u8v/N:65280/real_time 595918 ns 595370 ns 1178 bytes=438.181M/s cpufreq=1.6G elements=109.545M/s
vunary/xnn_f32_vtanh_ukernel__scalar_rational_9_8_div_u1/N:3840/real_time 137177 ns 137023 ns 5080 bytes=223.944M/s cpufreq=1.6G elements=27.993M/s
vunary/xnn_f32_vtanh_ukernel__scalar_rational_9_8_div_u1/N:32640/real_time 1176321 ns 1174959 ns 595 bytes=221.98M/s cpufreq=1.6G elements=27.7475M/s
vunary/xnn_f32_vtanh_ukernel__scalar_rational_9_8_div_u2/N:3840/real_time 90912 ns 90797 ns 7697 bytes=337.911M/s cpufreq=1.6G elements=42.2389M/s
vunary/xnn_f32_vtanh_ukernel__scalar_rational_9_8_div_u2/N:32640/real_time 780256 ns 779331 ns 893 bytes=334.659M/s cpufreq=1.6G elements=41.8324M/s
vunary/xnn_f32_vtanh_ukernel__scalar_rational_9_8_div_u4/N:3840/real_time 91824 ns 91742 ns 7578 bytes=334.553M/s cpufreq=1.6G elements=41.8191M/s
vunary/xnn_f32_vtanh_ukernel__scalar_rational_9_8_div_u4/N:32640/real_time 791701 ns 790382 ns 882 bytes=329.822M/s cpufreq=1.6G elements=41.2277M/s
vunary/xnn_f32_vtanh_ukernel__scalar_rational_9_8_div_u8/N:3840/real_time 94924 ns 94817 ns 7357 bytes=323.628M/s cpufreq=1.6G elements=40.4535M/s
vunary/xnn_f32_vtanh_ukernel__scalar_rational_9_8_div_u8/N:32640/real_time 815168 ns 814081 ns 855 bytes=320.326M/s cpufreq=1.6G elements=40.0408M/s
vunary/xnn_f32_vtanh_ukernel__rvv_tanh_u1v/N:3840/real_time 48223 ns 48167 ns 14522 bytes=637.047M/s cpufreq=1.6G elements=79.6309M/s
vunary/xnn_f32_vtanh_ukernel__rvv_tanh_u1v/N:32640/real_time 420415 ns 419844 ns 1664 bytes=621.101M/s cpufreq=1.6G elements=77.6376M/s
vunary/xnn_f32_vtanh_ukernel__rvv_tanh_u2v/N:3840/real_time 41039 ns 40999 ns 17064 bytes=748.549M/s cpufreq=1.6G elements=93.5686M/s
vunary/xnn_f32_vtanh_ukernel__rvv_tanh_u2v/N:32640/real_time 352869 ns 352441 ns 1982 bytes=739.992M/s cpufreq=1.6G elements=92.499M/s
vunary/xnn_f32_vtanh_ukernel__rvv_tanh_u4v/N:3840/real_time 43893 ns 43842 ns 15945 bytes=699.887M/s cpufreq=1.6G elements=87.4858M/s
vunary/xnn_f32_vtanh_ukernel__rvv_tanh_u4v/N:32640/real_time 419134 ns 418607 ns 1641 bytes=622.998M/s cpufreq=1.6G elements=77.8748M/s
vunary/xnn_f32_vtanh_ukernel__rvv_tanh_u8v/N:3840/real_time 45865 ns 45809 ns 15263 bytes=669.791M/s cpufreq=1.6G elements=83.7238M/s
vunary/xnn_f32_vtanh_ukernel__rvv_tanh_u8v/N:32640/real_time 433013 ns 432531 ns 1610 bytes=603.03M/s cpufreq=1.6G elements=75.3788M/s

Copy link

google-cla bot commented Jul 28, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

#if XNN_ENABLE_RISCV_FP16_VECTOR
set_arch_flag(xnn_arch_riscv_vector_fp16_arith, true);
#else
/* There is no HWCAP for fp16 so disable by default */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For OS that have /proc/cpuinfo the fp16 is detectable:
eg. see libyuv
https://chromium.googlesource.com/libyuv/libyuv/+/refs/heads/main/source/cpu_id.cc#335

But the emulator I use, which is supposed to be for sifive x280 that has fp16, the emulator does not support fp16.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standard way to detect FP16
https://github.com/google/XNNPACK/blob/master/src/xnnpack/math.h#L504-L506

When building hardware-config.c, the __riscv_zvfh flag is not defined.
Apparently, this file is compiled separately from the microkernels.
Adding
https://github.com/google/XNNPACK/pull/8740/files#diff-1e7de1ae2d059d21e1dd75d5812d5a34b0222cef273b7c3a2af62eb747f9d20aR990-R992
apparently is not enough.
Can you tell me where I should change the build configuration?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In commit
3bd1297
a dynamic check of ISA for the presence of FP16 was added

@kozinove kozinove requested a review from fbarchard August 11, 2025 13:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants