Skip to content

Conversation

uttampawar
Copy link

@uttampawar uttampawar commented Jun 12, 2025

This patch specifically addresses issue observed at level 9, it doesn't have adverse impact at other levels.
Following are the performance numbers in terms of throughput (bytes processed per sec) for various inputs on latest Xeon server.

Runtime 10 seconds
Compression level 9

Input input-sz compressed-sz Default - MiB Opt - MiB Opt/Default
x 1 5 0.27 3.62 13.41
xyzzy 5 9 0.61 1.65 2.70
xyzzy.compressed 9 13 0.64 2.58 4.03
64x 64 10 8.73 27.13 3.11
alice29.txt 152089 51054 11.36 27.73 2.44
alice29.txt.compressed 50096 50100 7.13 6.91 0.97
asyoulik.txt 125179 46694 9.85 27.16 2.76
asyoulik.txt.compressed 45687 45691 6.68 6.61 0.99
backward65536 65792 19 2359.99 4371.97 1.85
bb.binast 12356697 5412654 5.89 5.89 1.00
compressed_file 50096 50100 7.13 7.01 0.98
compressed_file.compressed 50100 50104 7.13 7.09 0.99
compressed_repeated 144224 50443 15.90 168.83 10.62
compressed_repeated.compressed 50299 50303 6.90 6.73 0.98
cp1251-utf16le 1554 660 1.32 38.11 28.87
empty.compressed.17 65538 17 2794.64 4558.76 1.63
empty.compressed.18 196610 22 4568.71 6190.96 1.36
lcet10.txt 426754 127437 17.17 26.17 1.52
lcet10.txt.compressed 124719 124724 14.65 14.37 0.98
mapsdatazrh 285886 166978 16.09 30.18 1.88
mapsdatazrh.compressed 161743 161748 18.12 18 0.99
monkey 843 423 1.50 32.68 21.79
plrabn12.txt 481861 177362 14.75 20.08 1.36
plrabn12.txt.compressed 174771 174776 19.93 19.78 0.99
quickfox_repeated 176128 51 2202.14 5649.04 2.57
random_chunks 2704 1906 2.25 44.59 19.82
random_org_10k.bin 10000 10004 3.25 125.47 38.61
random_org_10k.bin.compressed 10004 10008 3.20 119.01 37.19
ukkonooa 119 71 1.48 13.6 9.19
index.html (from cloudfare) 29329 7476 4.76 45.24 9.50
Average gain 7.50x

Background

  • The "perf" profile showed, major time spent in the kernel.

Children Self Shared Object Command

70.32%    70.32%  [unknown]
29.55%    29.55%  libbrotlienc.so.1.1.0
 0.12%     0.12%  libc.so.6
 0.01%     0.01%  bench
 0.00%     0.00%  ld-linux-x86-64.so.2

The detail stack trace shows following,

Children Self Command Shared Object Symbol

99.91%    29.39%  libbrotlienc.so.1.1.0
        |
        |--75.54%--CreateBackwardReferencesNH5
        |          |
        |          |--54.11%--asm_exc_page_fault
        |          |          |
        |          |           --46.60%--exc_page_fault
        |          |                     |
        |          |                     |--45.08%--do_user_addr_fault
        |          |                     |          |
        |          |                     |          |--20.26%--handle_mm_fault
        |          |                     |          |
        |          |                     |           --0.84%--lock_vma_under_rcu
        |          |                     |
        |          |                      --0.59%--irqentry_exit
        |          |
        |          |--7.61%--sync_regs
        |          |
        |          |--0.87%--asm_sysvec_apic_timer_interrupt
        |          |          |
        |          |           --0.87%--sysvec_apic_timer_interrupt
        |          |                     |
        |          |                      --0.70%--__sysvec_apic_timer_interrupt
        |          |
        |           --0.84%--error_entry
        |
        |--12.19%--0
        |          |
        |          |--9.45%--CreateBackwardReferencesNH5
        |          |          |
        |          |           --6.98%--asm_exc_page_fault
        |          |
        |           --2.68%--0x7a3250ffe010
        |                     CreateBackwardReferencesNH5
        |
        |--9.54%--BrotliEncoderDestroyInstance
        |          |
        |           --9.53%--__munmap
        |                     entry_SYSCALL_64_after_hwframe
        |
         --0.54%--0x400
                   |
                    --0.54%--CreateBackwardReferencesNH5

This gave clear indication of major cycles spent due to "page-faults". Collecting "perf stat" showed below stats,
$ perf stat -- ./bench -q 9 -c 1 index.html
Tested file index.html; size: 29329
Threads: 1, alg: brotli, quality 9
Total times compressed: 1716; compressed size: 7476
Compression speed:4.80 MiB

Performance counter stats for './bench -q 9 -c 1 index.html':

     10,005.57 msec task-clock                       #    1.000 CPUs utilized
            21      context-switches                 #    2.099 /sec
             0      cpu-migrations                   #    0.000 /sec
     9,060,486      page-faults                      #  905.544 K/sec <-------- 9 million page-faults
38,884,787,373      cycles                           #    3.886 GHz
64,575,947,425      instructions                     #    1.66  insn per cycle
11,559,990,014      branches                         #    1.155 G/sec
    70,260,354      branch-misses                    #    0.61% of all branches

  10.007755894 seconds time elapsed

   2.164726000 seconds user
   7.841751000 seconds sys

With suggested change page faults dropped considerably improving the performance.

Tested file index.html; size: 29329
Threads: 1, alg: brotli, quality 9
Total times compressed: 16109; compressed size: 7476
Compression speed:45.06 MiB

Performance counter stats for './bench -q 9 -c 1 index.html':

     10,022.94 msec task-clock                       #    1.000 CPUs utilized
            39      context-switches                 #    3.891 /sec
             2      cpu-migrations                   #    0.200 /sec
        24,359      page-faults                      #    2.430 K/sec <-------- reduced page-faults
38,898,170,252      cycles                           #    3.881 GHz
111,210,331,663      instructions                     #    2.86  insn per cycle
22,419,526,620      branches                         #    2.237 G/sec
   528,617,633      branch-misses                    #    2.36% of all branches

  10.027335163 seconds time elapsed

   9.987097000 seconds user
   0.039175000 seconds sys

And majority cycles are spent in the application instead of kernel managing memory (mmap/munmap).

Children Self Shared Object

98.48%    98.48%  libbrotlienc.so.1.1.0
 0.80%     0.80%  [unknown]
 0.63%     0.63%  libc.so.6
 0.04%     0.04%  bench
 0.04%     0.04%  libm.so.6
 0.01%     0.01%  libbrotlicommon.so.1.1.0
 0.01%     0.01%  [vdso]
 0.00%     0.00%  ld-linux-x86-64.so.2

Environment:

OS: Ubuntu 24.04.2 LTS
Kernel: 6.8.0-58-generic
GCC: gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Glibc: ldd (Ubuntu GLIBC 2.39-0ubuntu8.4) 2.39

…MMAP_THRESHOLD value.

This patch specifically addresses issue observed at level 9, it doesn't have adverse impact
at other levels.
Following are the performance numbers for various inputs on latest Xeon server.

Runtime	10 seconds
Compression level	9		                       MiB/sec	       Imp Ratio
Input                           Input-sz compressed-sz	  Default      Opt     Opt/Default
x                                     1           5         0.27       3.62	13.41
xyzzy	                              5           9	    0.61       1.65	 2.70
xyzzy.compressed                      9          13         0.64       2.58	 4.03
64x                                  64          10         8.73      27.13	 3.11
alice29.txt                     152,089      51,054        11.36      27.73	 2.44
alice29.txt.compressed	         50,096      50,100         7.13       6.91	 0.97
asyoulik.txt	                125,179      46,694         9.85      27.16	 2.76
asyoulik.txt.compressed	         45,687      45,691         6.68       6.61	 0.99
backward65536	                 65,792          19     2,359.99    4371.97	 1.85
bb.binast	             12,356,697   5,412,654         5.89       5.89	 1.00
compressed_file	                 50,096      50,100         7.13       7.01	 0.98
compressed_file.compressed	 50,100      50,104         7.13       7.09	 0.99
compressed_repeated	        144,224      50,443        15.90     168.83	10.62
compressed_repeated.compressed	 50,299      50,303         6.90       6.73	 0.98
cp1251-utf16le	                  1,554         660         1.32      38.11	28.87
empty.compressed.17	         65,538          17     2,794.64    4558.76	 1.63
empty.compressed.18	        196,610          22     4,568.71    6190.96	 1.36
lcet10.txt	                426,754     127,437        17.17      26.17	 1.52
lcet10.txt.compressed	        124,719     124,724        14.65      14.37	 0.98
mapsdatazrh	                285,886     166,978        16.09      30.18	 1.88
mapsdatazrh.compressed	        161,743     161,748        18.12      18	 0.99
monkey	                            843         423         1.50      32.68	21.79
plrabn12.txt	                481,861     177,362        14.75      20.08	 1.36
plrabn12.txt.compressed	        174,771     174,776        19.93      19.78	 0.99
quickfox_repeated	        176,128          51     2,202.14    5649.04	 2.57
random_chunks	                  2,704       1,906         2.25      44.59	19.82
random_org_10k.bin	         10,000      10,004         3.25     125.47	38.61
random_org_10k.bin.compressed	 10,004      10,008         3.20     119.01	37.19
ukkonooa	                    119          71         1.48      13.6	 9.19
index.html (from cloudfare)	 29,329       7,476         4.76      45.24	 9.50
Copy link

google-cla bot commented Jun 12, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@uttampawar uttampawar changed the title Improve throughput performance at compression level 9 by adjusting M_… Improve throughput performance at compression level 9 Jun 12, 2025
@eustas
Copy link
Collaborator

eustas commented Jun 13, 2025

Wow! Nice investigation.

@uttampawar
Copy link
Author

Thanks @eustas. On the failing test, should I put those changes under "#if linux" macro to clear all failing tests?

@eustas
Copy link
Collaborator

eustas commented Jun 13, 2025

I'm still thinking how to make this:

  • portable
  • non-surprising (since the adjustment seems to be process-wide)
  • discoverable (so that users know this option is available)

Lets continue with this on Monday.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants