RT-LLM-Bench: Bounded Real-Time LLM Inference Benchmark

A reproducible benchmark harness for evaluating bounded, real-time inference behavior of small LLMs using llama.cpp. This is a configuration-only study - no llama.cpp source modifications.

Project Goal

Gather measurable, systems-level evidence that bounded LLM inference is achievable for edge devices by:

Characterizing sources of latency variance in llama.cpp
Measuring worst-case (p99, max) behavior across repeated runs
Evaluating trade-offs across model size, quantization, and context length
Comparing isolated vs. contended environments

Quick Start

# 1. Setup llama.cpp (pins to specific commit)
./scripts/setup_llamacpp.sh

# 2. Download test models
./scripts/download_models.sh recommended

# 3. Run comprehensive experiments (all models x all sampling strategies)
./scripts/run_all_experiments.sh

# 4. Analyze and compare results
python scripts/analyze_results.py compare results/*_greedy/*/ results/*_temperature/*/

Directory Structure

rt-llm-bench/
├── scripts/
│   ├── setup_llamacpp.sh        # Build llama.cpp at pinned version
│   ├── download_models.sh       # Download GGUF models
│   ├── benchmark.py             # Single benchmark run
│   ├── run_experiment.py        # Multi-run experiment orchestration
│   ├── run_all_experiments.sh   # Run all models x all sampling strategies
│   ├── run_all_models.sh        # Run all models with single config
│   ├── analyze_results.py       # Analysis and visualization
│   └── prepare_system.sh        # System configuration for isolated tests
├── configs/
│   ├── baseline_greedy.json     # Greedy sampling baseline
│   ├── sampling_temperature.json # Temperature-based sampling
│   ├── sampling_top_k.json      # Top-K sampling
│   ├── sampling_top_p.json      # Top-P (nucleus) sampling
│   ├── isolated_pinned.json     # CPU-pinned, memory-locked config
│   └── experiment_matrix.json   # Full experiment plan
├── results/                     # Benchmark output (gitignored)
├── models/                      # GGUF models (gitignored)
├── analysis/                    # Generated reports and plots
├── docs/                        # Provenance and documentation
└── llama.cpp/                   # Cloned llama.cpp (gitignored)

Running Experiments

Comprehensive Experiments (All Models x All Sampling)

The run_all_experiments.sh script runs a full matrix of experiments across all available models and sampling strategies:

# Run all models with all sampling strategies (default: 30 iterations each)
./scripts/run_all_experiments.sh

# Run with more iterations
./scripts/run_all_experiments.sh -i 50

# Run specific models only
./scripts/run_all_experiments.sh -m gemma3-270m,llama3.2-1b

# Run specific sampling strategies only
./scripts/run_all_experiments.sh -s greedy,temperature

# Run with isolated environment settings
./scripts/run_all_experiments.sh --pin-cpu --mlock

# Quick test run
./scripts/run_all_experiments.sh -m gemma3-270m -s greedy -i 5

# Dry run to see what would be executed
./scripts/run_all_experiments.sh --dry-run

Available Options:

Option	Description	Default
`-m, --models`	Comma-separated model list	all
`-s, --sampling`	Comma-separated sampling strategies	all
`-i, --iterations`	Runs per experiment	30
`-w, --warmup`	Warmup runs	3
`-t, --threads`	Thread count	4
`-n, --tokens`	Output tokens	128
`-c, --context`	Context length	2048
`-o, --output`	Output directory	results/
`--pin-cpu`	Enable CPU pinning	off
`--mlock`	Enable memory locking	off
`--dry-run`	Preview without executing	-

All runs are automatically logged to results/log_run_YYYYMMDD_HHMMSS.log.

Available Models

Model	Size	Description
`gemma3-270m`	279MB	Smallest, edge device focused
`gemma3-1b`	1.0GB	Google's efficient 1B
`llama3.2-1b`	1.3GB	Meta's efficient 1B
`deepseek-r1-1.5b`	1.8GB	Reasoning-focused distilled model
`smollm2-1.7b`	~1.8GB	Efficient 1.7B model

Sampling Strategies

Strategy	Flag	Parameters	Description
`greedy`	`--sampling greedy`	temp=0.0	Deterministic, always picks max probability
`temperature`	`--sampling temperature`	temp=0.8	Softmax temperature sampling
`top_k`	`--sampling top_k`	temp=0.8, k=40	Sample from top K tokens
`top_p`	`--sampling top_p`	temp=0.8, p=0.9	Nucleus sampling

All sampling strategies are tested to verify execution time variance is sampling-independent.

Single Model Experiments

# Run all models with a single configuration
./scripts/run_all_models.sh 30  # 30 iterations per model

Single Run

python scripts/benchmark.py \
    -m <model.gguf> \
    -p "Your prompt here" \
    -n 128 \              # output tokens
    -c 2048 \             # context length
    -t 4 \                # threads
    --sampling greedy     # sampling strategy

Multi-Run Experiment

python scripts/run_experiment.py \
    -m <model.gguf> \
    -n experiment_name \
    -i 50 \               # iterations
    --warmup 3 \          # warmup runs
    --sampling greedy \
    --pin-cpu \           # CPU pinning (taskset)
    --mlock               # memory locking

Config-Based Experiments

# Run from config file
python scripts/run_experiment.py -c configs/baseline_greedy.json

# Available configs
ls configs/
# baseline_greedy.json      - Deterministic baseline
# sampling_temperature.json - Temperature sampling
# sampling_top_k.json       - Top-K sampling
# sampling_top_p.json       - Top-P (nucleus) sampling
# isolated_pinned.json      - CPU-pinned with mlock

System Configuration

Isolated Environment (Recommended)

# Show current system state
./scripts/prepare_system.sh status

# Configure for isolated testing (requires root for full effect)
sudo ./scripts/prepare_system.sh isolate

# Run experiment with CPU pinning
./scripts/run_all_experiments.sh --pin-cpu --cpu-cores "0-3" --mlock

# Restore defaults
sudo ./scripts/prepare_system.sh restore

Contended Environment

# Show background load options
./scripts/prepare_system.sh contended

# Example: 50% CPU load
stress-ng --cpu 2 --cpu-load 50 &
./scripts/run_all_experiments.sh -m gemma3-270m -s greedy
pkill stress-ng

Analysis Tools

# Variance analysis for a single experiment
python scripts/analyze_results.py variance results/gemma3-270m_greedy/<timestamp>

# Compare multiple experiments (first is baseline)
python scripts/analyze_results.py compare \
    results/gemma3-270m_greedy/*/ \
    results/gemma3-270m_temperature/*/ \
    results/gemma3-270m_top_k/*/

# Compare all models with greedy sampling
python scripts/analyze_results.py compare results/*_greedy/*/

# Compare sampling strategies for a single model
python scripts/analyze_results.py compare \
    results/llama3.2-1b_greedy/*/ \
    results/llama3.2-1b_temperature/*/ \
    results/llama3.2-1b_top_k/*/ \
    results/llama3.2-1b_top_p/*/

# Generate histogram
python scripts/analyze_results.py histogram results/gemma3-270m_greedy/*/ -o hist.png

# Run-by-run plot
python scripts/analyze_results.py runs results/gemma3-270m_greedy/*/ -o runs.png

# Markdown report
python scripts/analyze_results.py report results/gemma3-270m_greedy/*/ -o report.md

Metrics Collected

Metric	Description	Bounded Goal
Per-token latency	Time to generate each token	Low variance (CV < 10%)
P99 latency	99th percentile token latency	Stable across runs
Max latency	Worst-case token latency	Bounded, predictable
Peak RSS	Maximum resident set size	Minimal jitter
Total time	End-to-end inference time	Predictable

Key Measurements

Coefficient of Variation (CV): stddev/mean as percentage. Lower = more deterministic.
Max/Mean Ratio: How much worse is the worst case vs average?
P99 Stability: Does p99 vary significantly run-to-run?
RSS Jitter: Memory variance between identical runs

Experiment Matrix

The full experiment plan supports these sweeps:

Experiment Type	Variable	Values
Baseline	-	Default settings, 50 runs
Sampling	Strategy	greedy, temperature, top_k, top_p
Context	Length	512, 1024, 2048, 4096
Threads	Count	1, 2, 4, 8
Output	Tokens	32, 64, 128, 256
Environment	Isolation	default, pinned, mlock

Example Workflow

# 1. Setup environment
./scripts/setup_llamacpp.sh
./scripts/download_models.sh recommended

# 2. Run baseline experiments (all models, greedy only)
./scripts/run_all_experiments.sh -s greedy -i 50

# 3. Run sampling comparison
./scripts/run_all_experiments.sh -s temperature,top_k,top_p -i 50

# 4. Run isolated environment experiments
sudo ./scripts/prepare_system.sh isolate
./scripts/run_all_experiments.sh -s greedy --pin-cpu --mlock -i 50
sudo ./scripts/prepare_system.sh restore

# 5. Analyze results
python scripts/analyze_results.py compare results/*_greedy/*/
python scripts/analyze_results.py compare results/gemma3-270m_*/*/

Requirements

Linux (tested on Ubuntu 22.04+)
Python 3.8+
CMake 3.14+
GCC/G++ with C++17 support
Optional: matplotlib (for visualization), psutil (for memory monitoring)

# Install Python dependencies
pip install matplotlib psutil  # optional but recommended

Configuration-Only Controls

This study uses only external configuration, not llama.cpp modifications:

Control	Method
CPU frequency	`cpupower frequency-set -g performance`
Turbo boost	`/sys/devices/system/cpu/intel_pstate/no_turbo`
CPU pinning	`taskset -c 0-3`
Memory locking	`--mlock` flag
Swappiness	`/proc/sys/vm/swappiness`
Thread count	`-t` flag
Context pre-allocation	`-c` flag (sets max context)

Success Criteria

Phase 1 is successful if we can show:

Repeatable worst-case bounds: p99 and max latency are stable across runs (CV < 15%)
Memory stability: peak RSS varies minimally between runs (CV < 5%)
Sampling independence: execution variance is consistent across sampling strategies
Clear baseline metrics: documented variance for comparison with optimizations

Results Naming Convention

Experiments are saved with the naming pattern: {model}_{sampling}/

Examples:

results/gemma3-270m_greedy/ - Gemma 270M with greedy sampling
results/llama3.2-1b_temperature/ - Llama 3.2 1B with temperature sampling
results/deepseek-r1-1.5b_top_p/ - DeepSeek R1 1.5B with nucleus sampling

Each experiment directory contains:

config.json - Exact configuration used
aggregated.json - Summary statistics
all_results.json - All run data
runs/run_*.json - Individual run results

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
analysis		analysis
configs		configs
docs		docs
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RT-LLM-Bench: Bounded Real-Time LLM Inference Benchmark

Project Goal

Quick Start

Directory Structure

Running Experiments

Comprehensive Experiments (All Models x All Sampling)

Available Models

Sampling Strategies

Single Model Experiments

Single Run

Multi-Run Experiment

Config-Based Experiments

System Configuration

Isolated Environment (Recommended)

Contended Environment

Analysis Tools

Metrics Collected

Key Measurements

Experiment Matrix

Example Workflow

Requirements

Configuration-Only Controls

Success Criteria

Results Naming Convention

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RT-LLM-Bench: Bounded Real-Time LLM Inference Benchmark

Project Goal

Quick Start

Directory Structure

Running Experiments

Comprehensive Experiments (All Models x All Sampling)

Available Models

Sampling Strategies

Single Model Experiments

Single Run

Multi-Run Experiment

Config-Based Experiments

System Configuration

Isolated Environment (Recommended)

Contended Environment

Analysis Tools

Metrics Collected

Key Measurements

Experiment Matrix

Example Workflow

Requirements

Configuration-Only Controls

Success Criteria

Results Naming Convention

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages