Performance benchmarks

This guide keeps benchmark expectations explicit and reproducible so future changes can be measured consistently.

Measurement environment

Capture these details for every benchmark run:

Host CPU model and core count
RAM
OS and kernel
rustc version
Provider and exact command used
Model used (for local/sherpa-onnx) or deployment/model (for APIs)
Model format (GGML or ONNX) for local providers
Input file duration and codec/container

Benchmarks to run

1. Local model inference throughput

Run on representative files (e.g. 1, 5, 10 minutes).

# whisper.cpp (GGML)
time transcribeit run -i <input_file> -m base -f text -o ./output
time transcribeit run -i <input_file> -m small -f text -o ./output
time transcribeit run -i <input_file> -m small.en -f text -o ./output

# sherpa-onnx Whisper (ONNX) — auto-segments at 30s
time transcribeit run -p sherpa-onnx -i <input_file> -m base -f text -o ./output
time transcribeit run -p sherpa-onnx -i <input_file> -m small.en -f text -o ./output

# sherpa-onnx Moonshine
time transcribeit run -p sherpa-onnx -i <input_file> -m moonshine-base -f text -o ./output

# sherpa-onnx SenseVoice
time transcribeit run -p sherpa-onnx -i <input_file> -m sense-voice -f text -o ./output

Record:

wall clock duration
output length/time ratio (e.g. 600s audio in 180s)
CPU utilization profile (optional)

2. Provider overhead comparison (same input)

time transcribeit run -p local -i <input_file> -m base -f text -o ./output
time transcribeit run -p sherpa-onnx -i <input_file> -m base -f text -o ./output
time transcribeit run -p openai -i <input_file> -f text -o ./output
time transcribeit run -p azure -i <input_file> -f text -o ./output

Record:

wall clock duration
retry counts (if any)
segment count
average segment latency (from logs)

3. Segmentation impact

# FFmpeg silencedetect segmentation
time transcribeit run -p openai -i <long_file> --segment --segment-concurrency 2 -f text -o ./output
time transcribeit run -p openai -i <long_file> --segment --segment-concurrency 1 --max-segment-secs 300 -f text -o ./output

# sherpa-onnx with FFmpeg silencedetect (default, always segments at 30s max)
time transcribeit run -p sherpa-onnx -i <long_file> -m base -f text -o ./output

# sherpa-onnx with VAD-based segmentation
time transcribeit run -p sherpa-onnx -i <long_file> -m base --vad-model /path/to/silero_vad.onnx -f text -o ./output

Record:

total segment count
max queue wait
request-level retry counts
segmentation method used (VAD vs silencedetect)
transcript quality at segment boundaries (check for mid-word cuts)

4. I/O + conversion overhead

time transcribeit run -i <video_file> -m base --normalize -f text -o ./output
time transcribeit run -i <audio_file> -m base --normalize -f text -o ./output

Record:

conversion wall time (video vs audio)
post-conversion processing time ratio

Suggested result format

Model/Provider: base
Input: meeting_01.wav (300s)
Machine: MacBook Pro M2
Elapsed: 92s
Realtime factor: 3.26x
Segments: 1
Retries: 0
Output size: 4.6 MB

Keep rows in a simple table (date + commit hash + environment + results) in your preferred tracker so regressions are easy to catch.

Reference benchmark results

These results were measured on a 5-minute medical interview recording.

Engine / Model	Wall clock	Realtime factor	Notes
Local whisper.cpp `base`	3.6s	83x RT	Best speed/quality trade-off
SenseVoice 2024	6.6s	46x RT	Good quality, 50+ languages
Sherpa-ONNX Whisper `base`	10.9s	27x RT
Moonshine `base`	14.1s	21x RT
Local whisper.cpp `large-v3-turbo`	33.7s	8.9x RT	Highest transcription quality
Sherpa-ONNX Whisper `turbo`	47.2s	6.4x RT

Notes:

Local whisper.cpp (GGML) is consistently the fastest engine for a given model size.
SenseVoice 2024 offers excellent speed with good quality. Avoid the SenseVoice 2025 model -- it is a regression in quality.
Moonshine provides a compact alternative but is slower than Whisper at the same size tier.
For highest quality where speed is not critical, use large-v3-turbo with local whisper.cpp.

VAD vs FFmpeg silencedetect segmentation

VAD-based segmentation (Silero VAD via --vad-model) and FFmpeg silencedetect produce different segment boundaries. Key differences to observe when benchmarking:

Segment boundary quality: VAD detects speech regions directly, so segment boundaries align with actual speech. FFmpeg silencedetect splits at silence midpoints, which can cut mid-word if silence gaps are short or thresholds are mistuned.
Segment count: VAD typically produces more segments (one per speech region after merging) while silencedetect produces fewer, longer segments based on silence gaps.
Processing overhead: VAD runs on the audio samples in-memory (fast, no subprocess). FFmpeg silencedetect runs as a subprocess and requires parsing its stderr output.
Transcript quality: VAD-segmented transcripts tend to have fewer artifacts at segment boundaries because chunks start and end at speech boundaries with 250ms padding, rather than at arbitrary silence midpoints.

When comparing, use the same audio file and model to isolate the effect of the segmentation method on overall transcript quality and timing.

CI/automatable baseline

For now, treat these as manual benchmarks in a fixed environment. If you want to automate later:

add a dedicated criterion benchmark target,
pin fixture files,
and fail CI only on large regressions with generous tolerances.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance benchmarks

Measurement environment

Benchmarks to run

1. Local model inference throughput

2. Provider overhead comparison (same input)

3. Segmentation impact

4. I/O + conversion overhead

Suggested result format

Reference benchmark results

VAD vs FFmpeg silencedetect segmentation

CI/automatable baseline

FilesExpand file tree

performance-benchmarks.md

Latest commit

History

performance-benchmarks.md

File metadata and controls

Performance benchmarks

Measurement environment

Benchmarks to run

1. Local model inference throughput

2. Provider overhead comparison (same input)

3. Segmentation impact

4. I/O + conversion overhead

Suggested result format

Reference benchmark results

VAD vs FFmpeg silencedetect segmentation

CI/automatable baseline