Skip to content

Throughput and CPU usage of Quiche code #2284

@Milind-Blaze

Description

@Milind-Blaze

Hi all! I am interested in the design of a low latency transport protocol using QUIC for XR and want to build on Quiche for this. Therefore, I evaluated the throughput and CPU usage of Quiche. It would be great if I could receive some feedback from the Quiche community on whether
(a) the results obtain are what one would expect for Quiche
(b) the methodology makes sense-have I missed any optimizations? does my evaluation setup introduce unexpected behaviour? and so forth

Here is my setup:

  1. Codebase: I used a modified version of the codebase from this commit. The Rust code is compiled in release mode and the C code is compiled with the -I. -Wall -pedantic -O3 -DNDEBUG flags.

  2. Network: The CPU and throughput measurements are performed on localhost with TC that sets the rate and buffer size like below:

sudo tc qdisc add dev lo root handle 1: tbf rate 100Mbit burst 2mbit latency 40ms

Different limits are tested in the set {10, 20, 50, 100, 200, 350, 500} Mbps and without TC. Only the rate parameter is varied. The rest remain the same. There is no propagation delay in this setup.

  1. Server: Written using the C-API. The server wakes up, stores 2 GB in memory, receives a client request and then sends the requested amount from the 2 GB, storing a copy of the remaining response for later completion if it can't all be sent at once. Uses Http0.9 to minimize any overhead of parsing.

  2. Client: Written using the C-API. The client wakes up, requests data of size 100 MB or 1 GB, receives it and then closes.
    (a) It marks the times before sending the request (before quiche_connection_stream_send) and after receiving the full response (after the final quiche_conn_stream_recv). The difference between the two timestamps is considered the RTT.
    (b) The throughput is calculated as (requested size)/RTT.

  3. CPU usage measurement:
    (a) The client and server are pinned to separate CPU cores using taskset -c. The client starts 10s after the server.
    (b) The CPU usage of the client and server PIDs are monitored using the top command. The top command produced new frames every 1 s i.e. it is run as top -b -d 1 -p $process_pid and expresses the CPU usage as percentage of one core used in the last second.
    (c) A graph of the client and server CPU usage patterns is plotted vs time. Note that the server graphs will have a 10 s window at the start where CPU usage is almost 0% as the client hasn't started yet.
    (d) This is repeated for 3 runs. As each experiment takes roughly the same amount of time, the plotted graph is the mean CPU usage of the client or server every second across the three trials.

  4. Throughput measurement:
    (a) Throughput is measured as described in (4b).
    (b) The values are obtained and averaged after the three runs of (5).
    (c) Iperf3 measurements are plotted for comparison.

CPU usage results

Image

Some questions:

(a) Are these expected numbers? Have I missed any optimizations?
(b) Is there any way to improve these CPU usage numbers?

I repeated the experiment with the master branch of the code and have similar results, as seen below:

Image

Throughput results

Image

(a) Why does the throughput not measure up to Iperf3? Does TC, as I have configured it, induce odd behaviour?
(b) Have I missed some configuration of congestion control, pacing, etc. that might be causing this?

When I ran throughput measurements on a modified Mahimahi emulator (https://github.com/ravinet/mahimahi), I found that Quiche was able to realize the complete throughput and match iperf3.

Image

Any feedback, thoughts or comments would be very helpful! Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions