Skip to content

Wide range of metrics (rpm,e2el) compared to other solutions #95

@psydok

Description

@psydok

I compared benchmarks of the same model with different tools: llmperf, genaiperf and vllm/benchmarks. And I got different rpm results everywhere.

The spread, especially on 50 threads, is very large. And then it doesn't seem to decrease, which seems strange.

Qwen2.5-72b-awq model was started on its my server.

Genai-perf
Request Throughput (per sec),2.83
RPM = 2.83 * 60 = ~170

LLMPerf
"results_num_completed_requests_per_min": 95.78302671905037,

vllm/benchmarks
Request throughput (req/s): 2.30
RPM = 2.3 * 60 = ~138

Datasets were used by sonnet, as it is in tools. input tokens = 300, output tokens = 200, stddev = 0, duration_sec = 60, MAX_NUM_COMPLETED_REQUESTS=600

# vllm, DATASET_NAME=sonnet
    python benchmark_serving.py \
        --backend openai-chat \
        --model "${MODEL}" \
        --host ${LLM_HOST} \
        --port ${LLM_PORT} \
        --endpoint /v1/chat/completions \
        --dataset-name ${DATASET_NAME} \
        --dataset-path ./sonnet.txt \
        --max-concurrency 50 \
        --save-result \
        --save-detailed \
        --result-dir "${OUTPUT_DIR}/${folder}" \
        --percentile-metrics ttft,tpot,itl,e2el \
        --metric-percentiles "50,90,95,99" \
        --${DATASET_NAME}-input-len $INPUT_SEQUENCE_LENGTH \
        --${DATASET_NAME}-output-len $OUTPUT_SEQUENCE_LENGTH \
        --num-prompts ${MAX_NUM_COMPLETED_REQUESTS} \
        --ignore-eos \
        --goodput e2el:${DURATION_MSEC}

# llmperf
    python token_benchmark_ray.py \
        --model "${MODEL}" \
        --mean-input-tokens ${INPUT_SEQUENCE_LENGTH} --stddev-input-tokens ${STDDEV} \
        --mean-output-tokens ${OUTPUT_SEQUENCE_LENGTH} --stddev-output-tokens ${STDDEV} \
        --max-num-completed-requests ${MAX_NUM_COMPLETED_REQUESTS} \
        --num-concurrent-requests 50 \
        --timeout ${DURATION_SEC} \
        --results-dir "${OUTPUT_DIR}/${folder}" \
        --llm-api openai \
        --additional-sampling-params '{"ignore_eos": true}'

# genaiperf, MAX_NUM_COMPLETED_REQUESTS=100
      genai-perf analyze --random-seed ${seed}
      --service-kind openai --endpoint-type chat --streaming
      --url ${llm_host} -m ${model}
      --extra-inputs ignore_eos:true
      --extra-inputs max_tokens:${output_sequence_length}
      --extra-inputs min_tokens:${output_sequence_length}
      --output-tokens-mean ${output_sequence_length} --output-tokens-stddev ${stddev}
      --synthetic-input-tokens-mean ${input_sequence_length} --synthetic-input-tokens-stddev ${stddev}
      -v --measurement-interval ${duration_msec}
      --warmup-request-count 10
      --num-dataset-entries ${MAX_NUM_COMPLETED_REQUESTS}
      --profile-export-file ${input_sequence_length}_${output_sequence_length}.json
      --sweep-type concurrency --sweep-list 50,100

Qwen3 - without thinking (concurrency=1,3,5,8,13,21,34,55,89,144, MAX_NUM_COMPLETED_REQUESTS=100):

Image

At the same time, the vlm service counters show 135 revolutions per minute, when 143 requests were processed during the processing of the service. llmperf counts 35 rpm at the same time. genai-perf writes 102 rpm at 144 and vllm - 109 in graphane. That is, it seems genai-perf seems to give out more truthful values, but I still don't understand - I compared them using formulas and implementations. It seems that there should be no such differences.
Formula: rate(vllm:request_success_total[$__rate_interval]) * 60

Image

Can you tell me what this could be related to?
How should I configure llmperf so that the results are at least relatively the same as genai-perf?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions