vector implementation of vexp and vtanh functions for RISC-V-based processors #8740
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Vector implementation of vexp and vtanh functions for RISC-V-based processors.
Code adapted from project https://github.com/rvvpl/rvvmf
Test results on BananaPi processors.
VExp:
Run on (8 X 1600 MHz CPU s)
CPU Caches:
L1 Instruction 32 KiB (x8)
L1 Data 32 KiB (x8)
L2 Unified 512 KiB (x2)
Load Average: 2.24, 2.15, 2.10
Benchmark Time CPU Iterations UserCounters...
vunary/xnn_f16_vexp_ukernel__rvvfp16arith_exp_u1v/N:7680/real_time 63481 ns 63184 ns 11062 bytes=483.922M/s cpufreq=1.6G elements=120.981M/s
vunary/xnn_f16_vexp_ukernel__rvvfp16arith_exp_u1v/N:65280/real_time 535412 ns 535376 ns 1302 bytes=487.699M/s cpufreq=1.6G elements=121.925M/s
vunary/xnn_f16_vexp_ukernel__rvvfp16arith_exp_u2v/N:7680/real_time 53729 ns 53726 ns 13009 bytes=571.758M/s cpufreq=1.6G elements=142.939M/s
vunary/xnn_f16_vexp_ukernel__rvvfp16arith_exp_u2v/N:65280/real_time 460626 ns 460613 ns 1516 bytes=566.88M/s cpufreq=1.6G elements=141.72M/s
vunary/xnn_f16_vexp_ukernel__rvvfp16arith_exp_u4v/N:7680/real_time 59941 ns 59936 ns 11675 bytes=512.507M/s cpufreq=1.6G elements=128.127M/s
vunary/xnn_f16_vexp_ukernel__rvvfp16arith_exp_u4v/N:65280/real_time 516931 ns 516882 ns 1347 bytes=505.135M/s cpufreq=1.6G elements=126.284M/s
vunary/xnn_f16_vexp_ukernel__rvvfp16arith_exp_u8v/N:7680/real_time 84517 ns 84511 ns 8286 bytes=363.478M/s cpufreq=1.6G elements=90.8695M/s
vunary/xnn_f16_vexp_ukernel__rvvfp16arith_exp_u8v/N:65280/real_time 726974 ns 726953 ns 961 bytes=359.187M/s cpufreq=1.6G elements=89.7969M/s
vunary/xnn_f16_vexp_ukernel__scalar_poly_3_u1/N:7680/real_time 3489759 ns 3489563 ns 201 bytes=8.8029M/s cpufreq=1.6G elements=2.20073M/s
vunary/xnn_f16_vexp_ukernel__scalar_poly_3_u1/N:65280/real_time 29667023 ns 29665262 ns 24 bytes=8.80169M/s cpufreq=1.6G elements=2.20042M/s
vunary/xnn_f16_vexp_ukernel__scalar_poly_3_u2/N:7680/real_time 3021576 ns 3021260 ns 232 bytes=10.1669M/s cpufreq=1.6G elements=2.54172M/s
vunary/xnn_f16_vexp_ukernel__scalar_poly_3_u2/N:65280/real_time 25684498 ns 25682623 ns 27 bytes=10.1664M/s cpufreq=1.6G elements=2.54161M/s
vunary/xnn_f16_vexp_ukernel__scalar_poly_3_u4/N:7680/real_time 3040489 ns 3040289 ns 230 bytes=10.1036M/s cpufreq=1.6G elements=2.52591M/s
vunary/xnn_f16_vexp_ukernel__scalar_poly_3_u4/N:65280/real_time 25848453 ns 25846605 ns 27 bytes=10.102M/s cpufreq=1.6G elements=2.52549M/s
vunary/xnn_f16_vexp_ukernel__scalar_poly_3_u8/N:7680/real_time 3086399 ns 3086193 ns 227 bytes=9.95335M/s cpufreq=1.6G elements=2.48834M/s
vunary/xnn_f16_vexp_ukernel__scalar_poly_3_u8/N:65280/real_time 26245078 ns 26243356 ns 27 bytes=9.94929M/s cpufreq=1.6G elements=2.48732M/s
vunary/xnn_f32_vexp_ukernel__scalar_rational_3_2_div_u1/N:3840/real_time 187811 ns 187799 ns 3725 bytes=163.569M/s cpufreq=1.6G elements=20.4461M/s
vunary/xnn_f32_vexp_ukernel__scalar_rational_3_2_div_u1/N:32640/real_time 1595105 ns 1595058 ns 438 bytes=163.701M/s cpufreq=1.6G elements=20.4626M/s
vunary/xnn_f32_vexp_ukernel__scalar_rational_3_2_div_u2/N:3840/real_time 114437 ns 114434 ns 6114 bytes=268.444M/s cpufreq=1.6G elements=33.5555M/s
vunary/xnn_f32_vexp_ukernel__scalar_rational_3_2_div_u2/N:32640/real_time 971701 ns 971643 ns 719 bytes=268.725M/s cpufreq=1.6G elements=33.5906M/s
vunary/xnn_f32_vexp_ukernel__scalar_rational_3_2_div_u4/N:3840/real_time 100005 ns 99998 ns 6998 bytes=307.185M/s cpufreq=1.6G elements=38.3981M/s
vunary/xnn_f32_vexp_ukernel__scalar_rational_3_2_div_u4/N:32640/real_time 850240 ns 850184 ns 821 bytes=307.113M/s cpufreq=1.6G elements=38.3891M/s
vunary/xnn_f32_vexp_ukernel__scalar_rational_3_2_div_u8/N:3840/real_time 100349 ns 100339 ns 6974 bytes=306.133M/s cpufreq=1.6G elements=38.2666M/s
vunary/xnn_f32_vexp_ukernel__scalar_rational_3_2_div_u8/N:32640/real_time 852013 ns 851986 ns 819 bytes=306.474M/s cpufreq=1.6G elements=38.3093M/s
vunary/xnn_f32_vexp_ukernel__rvv_exp_u1v/N:3840/real_time 56750 ns 56746 ns 12327 bytes=541.325M/s cpufreq=1.6G elements=67.6656M/s
vunary/xnn_f32_vexp_ukernel__rvv_exp_u1v/N:32640/real_time 483284 ns 483269 ns 1448 bytes=540.304M/s cpufreq=1.6G elements=67.538M/s
vunary/xnn_f32_vexp_ukernel__rvv_exp_u2v/N:3840/real_time 48406 ns 48403 ns 14460 bytes=634.636M/s cpufreq=1.6G elements=79.3296M/s
vunary/xnn_f32_vexp_ukernel__rvv_exp_u2v/N:32640/real_time 414598 ns 414569 ns 1685 bytes=629.815M/s cpufreq=1.6G elements=78.7269M/s
vunary/xnn_f32_vexp_ukernel__rvv_exp_u4v/N:3840/real_time 56758 ns 56755 ns 12310 bytes=541.241M/s cpufreq=1.6G elements=67.6551M/s
vunary/xnn_f32_vexp_ukernel__rvv_exp_u4v/N:32640/real_time 490841 ns 490808 ns 1427 bytes=531.985M/s cpufreq=1.6G elements=66.4981M/s
vunary/xnn_f32_vexp_ukernel__rvv_exp_u8v/N:3840/real_time 73428 ns 73423 ns 9539 bytes=418.371M/s cpufreq=1.6G elements=52.2964M/s
vunary/xnn_f32_vexp_ukernel__rvv_exp_u8v/N:32640/real_time 633034 ns 633017 ns 1102 bytes=412.49M/s cpufreq=1.6G elements=51.5612M/s
VTanh:
Run on (8 X 1600 MHz CPU s)
CPU Caches:
L1 Instruction 32 KiB (x8)
L1 Data 32 KiB (x8)
L2 Unified 512 KiB (x2)
Load Average: 2.13, 2.08, 2.07
Benchmark Time CPU Iterations UserCounters...
vunary/xnn_f16_vtanh_ukernel__rvvfp16arith_tanh_u1v/N:7680/real_time 68440 ns 68207 ns 10092 bytes=448.861M/s cpufreq=1.6G elements=112.215M/s
vunary/xnn_f16_vtanh_ukernel__rvvfp16arith_tanh_u1v/N:65280/real_time 591878 ns 588278 ns 1188 bytes=441.172M/s cpufreq=1.6G elements=110.293M/s
vunary/xnn_f16_vtanh_ukernel__rvvfp16arith_tanh_u2v/N:7680/real_time 59576 ns 59509 ns 11745 bytes=515.642M/s cpufreq=1.6G elements=128.911M/s
vunary/xnn_f16_vtanh_ukernel__rvvfp16arith_tanh_u2v/N:65280/real_time 509353 ns 508744 ns 1366 bytes=512.65M/s cpufreq=1.6G elements=128.163M/s
vunary/xnn_f16_vtanh_ukernel__rvvfp16arith_tanh_u4v/N:7680/real_time 63102 ns 63037 ns 11103 bytes=486.833M/s cpufreq=1.6G elements=121.708M/s
vunary/xnn_f16_vtanh_ukernel__rvvfp16arith_tanh_u4v/N:65280/real_time 571406 ns 570772 ns 1227 bytes=456.978M/s cpufreq=1.6G elements=114.244M/s
vunary/xnn_f16_vtanh_ukernel__rvvfp16arith_tanh_u8v/N:7680/real_time 64972 ns 64897 ns 10752 bytes=472.817M/s cpufreq=1.6G elements=118.204M/s
vunary/xnn_f16_vtanh_ukernel__rvvfp16arith_tanh_u8v/N:65280/real_time 595918 ns 595370 ns 1178 bytes=438.181M/s cpufreq=1.6G elements=109.545M/s
vunary/xnn_f32_vtanh_ukernel__scalar_rational_9_8_div_u1/N:3840/real_time 137177 ns 137023 ns 5080 bytes=223.944M/s cpufreq=1.6G elements=27.993M/s
vunary/xnn_f32_vtanh_ukernel__scalar_rational_9_8_div_u1/N:32640/real_time 1176321 ns 1174959 ns 595 bytes=221.98M/s cpufreq=1.6G elements=27.7475M/s
vunary/xnn_f32_vtanh_ukernel__scalar_rational_9_8_div_u2/N:3840/real_time 90912 ns 90797 ns 7697 bytes=337.911M/s cpufreq=1.6G elements=42.2389M/s
vunary/xnn_f32_vtanh_ukernel__scalar_rational_9_8_div_u2/N:32640/real_time 780256 ns 779331 ns 893 bytes=334.659M/s cpufreq=1.6G elements=41.8324M/s
vunary/xnn_f32_vtanh_ukernel__scalar_rational_9_8_div_u4/N:3840/real_time 91824 ns 91742 ns 7578 bytes=334.553M/s cpufreq=1.6G elements=41.8191M/s
vunary/xnn_f32_vtanh_ukernel__scalar_rational_9_8_div_u4/N:32640/real_time 791701 ns 790382 ns 882 bytes=329.822M/s cpufreq=1.6G elements=41.2277M/s
vunary/xnn_f32_vtanh_ukernel__scalar_rational_9_8_div_u8/N:3840/real_time 94924 ns 94817 ns 7357 bytes=323.628M/s cpufreq=1.6G elements=40.4535M/s
vunary/xnn_f32_vtanh_ukernel__scalar_rational_9_8_div_u8/N:32640/real_time 815168 ns 814081 ns 855 bytes=320.326M/s cpufreq=1.6G elements=40.0408M/s
vunary/xnn_f32_vtanh_ukernel__rvv_tanh_u1v/N:3840/real_time 48223 ns 48167 ns 14522 bytes=637.047M/s cpufreq=1.6G elements=79.6309M/s
vunary/xnn_f32_vtanh_ukernel__rvv_tanh_u1v/N:32640/real_time 420415 ns 419844 ns 1664 bytes=621.101M/s cpufreq=1.6G elements=77.6376M/s
vunary/xnn_f32_vtanh_ukernel__rvv_tanh_u2v/N:3840/real_time 41039 ns 40999 ns 17064 bytes=748.549M/s cpufreq=1.6G elements=93.5686M/s
vunary/xnn_f32_vtanh_ukernel__rvv_tanh_u2v/N:32640/real_time 352869 ns 352441 ns 1982 bytes=739.992M/s cpufreq=1.6G elements=92.499M/s
vunary/xnn_f32_vtanh_ukernel__rvv_tanh_u4v/N:3840/real_time 43893 ns 43842 ns 15945 bytes=699.887M/s cpufreq=1.6G elements=87.4858M/s
vunary/xnn_f32_vtanh_ukernel__rvv_tanh_u4v/N:32640/real_time 419134 ns 418607 ns 1641 bytes=622.998M/s cpufreq=1.6G elements=77.8748M/s
vunary/xnn_f32_vtanh_ukernel__rvv_tanh_u8v/N:3840/real_time 45865 ns 45809 ns 15263 bytes=669.791M/s cpufreq=1.6G elements=83.7238M/s
vunary/xnn_f32_vtanh_ukernel__rvv_tanh_u8v/N:32640/real_time 433013 ns 432531 ns 1610 bytes=603.03M/s cpufreq=1.6G elements=75.3788M/s