Skip to content

AIPerf is a comprehensive benchmarking tool that measures the performance of generative AI models served by your preferred inference solution.

License

Notifications You must be signed in to change notification settings

ai-dynamo/aiperf

AIPerf

PyPI version License Codecov Discord Ask DeepWiki

Architecture| Design Proposals | Migrating from Genai-Perf | CLI Options

AIPerf is a comprehensive benchmarking tool that measures the performance of generative AI models served by your preferred inference solution. It provides detailed metrics using a command line display as well as extensive benchmark performance reports.

AIPerf provides multiprocess support out of the box for a single scalable solution.

AIPerf UI Dashboard

Features


Tutorials & Advanced Features

Getting Started

  • Basic Tutorial - Learn the fundamentals with Dynamo and vLLM examples

Advanced Benchmarking Features

Feature Description Use Cases
Random Number Generation & Reproducibility Deterministic dataset generation with --random-seed Debugging, regression testing, controlled experiments
Request Cancellation Test timeout behavior and service resilience SLA validation, cancellation modeling
Trace Benchmarking Deterministic workload replay with custom datasets Regression testing, A/B testing
Fixed Schedule Precise timestamp-based request execution Traffic replay, temporal analysis, burst testing
Time-based Benchmarking Duration-based testing with grace period control Stability testing, sustained performance
Timeslice Metrics Split up benchmark into timeslices and calculate metrics for each timeslice Load pattern impact, detecting warm-up effects, performance degradation analysis
Sequence Distributions Mixed ISL/OSL pairings Benchmarking mixed use cases
Goodput Throughput of requests meeting user-defined SLOs SLO validation, capacity planning, runtime/model comparisons
Request Rate with Max Concurrency Dual control of request timing and concurrent connection ceiling (Poisson or constant modes) Testing API rate/concurrency limits, avoiding thundering herd, realistic client simulation
GPU Telemetry Real-time GPU metrics collection via DCGM (power, utilization, memory, temperature, etc) Performance optimization, resource monitoring, multi-node telemetry
Template Endpoint Benchmark custom APIs with flexible Jinja2 request templates Custom API formats, rapid prototyping, non-standard endpoints

Working with Benchmark Data

  • Profile Exports - Parse and analyze profile_export.jsonl with Pydantic models, custom metrics, and async processing

Quick Navigation

# Basic profiling
aiperf profile --model Qwen/Qwen3-0.6B --url localhost:8000 --endpoint-type chat

# Request timeout testing
aiperf profile --request-timeout-seconds 30.0 [other options...]

# Trace-based benchmarking
aiperf profile --input-file trace.jsonl --custom-dataset-type single_turn [other options...]

# Fixed schedule execution
aiperf profile --input-file schedule.jsonl --fixed-schedule --fixed-schedule-auto-offset [other options...]

# Time-based benchmarking
aiperf profile --benchmark-duration 300.0 --benchmark-grace-period 30.0 [other options...]

Supported APIs

  • OpenAI chat completions
  • OpenAI completions
  • OpenAI embeddings
  • OpenAI audio: request throughput and latency
  • OpenAI images: request throughput and latency
  • NIM rankings

Installation

pip install aiperf

Quick Start

Basic Usage

Run a simple benchmark against a model:

aiperf profile \
  --model your_model_name \
  --url http://localhost:8000 \
  --endpoint-type chat \
  --streaming

Example with Custom Configuration

aiperf profile \
  --model Qwen/Qwen3-0.6B \
  --url http://localhost:8000 \
  --endpoint-type chat \
  --concurrency 10 \
  --request-count 100 \
  --streaming

Example output:

NVIDIA AIPerf | LLM Metrics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃                               Metric ┃       avg ┃    min ┃    max ┃    p99 ┃    p90 ┃    p75 ┃   std ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│             Time to First Token (ms) │     18.26 │  11.22 │ 106.32 │  68.82 │  27.76 │  16.62 │ 12.07 │
│            Time to Second Token (ms) │     11.40 │   0.02 │  85.91 │  34.54 │  12.59 │  11.65 │  7.01 │
│                 Request Latency (ms) │    487.30 │ 267.07 │ 769.57 │ 715.99 │ 580.83 │ 536.17 │ 79.60 │
│             Inter Token Latency (ms) │     11.23 │   8.80 │  13.17 │  12.48 │  11.73 │  11.37 │  0.45 │
│     Output Token Throughput Per User │     89.23 │  75.93 │ 113.60 │ 102.28 │  90.91 │  90.29 │  3.70 │
│                    (tokens/sec/user) │           │        │        │        │        │        │       │
│      Output Sequence Length (tokens) │     42.83 │  24.00 │  65.00 │  64.00 │  52.00 │  47.00 │  7.21 │
│       Input Sequence Length (tokens) │     10.00 │  10.00 │  10.00 │  10.00 │  10.00 │  10.00 │  0.00 │
│ Output Token Throughput (tokens/sec) │ 10,944.03 │    N/A │    N/A │    N/A │    N/A │    N/A │   N/A │
│    Request Throughput (requests/sec) │    255.54 │    N/A │    N/A │    N/A │    N/A │    N/A │   N/A │
│             Request Count (requests) │    711.00 │    N/A │    N/A │    N/A │    N/A │    N/A │   N/A │
└──────────────────────────────────────┴───────────┴────────┴────────┴────────┴────────┴────────┴───────┘

Known Issues

  • Output sequence length constraints (--output-tokens-mean) cannot be guaranteed unless you pass ignore_eos and/or min_tokens via --extra-inputs to an inference server that supports them.
  • Very high concurrency settings (typically >15,000 concurrency) may lead to port exhaustion on some systems, causing connection failures during benchmarking. If encountered, consider adjusting system limits or reducing concurrency.
  • Startup errors caused by invalid configuration settings can cause AIPerf to hang indefinitely. If AIPerf appears to freeze during initialization, terminate the process and check configuration settings.

About

AIPerf is a comprehensive benchmarking tool that measures the performance of generative AI models served by your preferred inference solution.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages