Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
6217bc0
feat: over-saturation stopping
AlonKellner-RedHat Nov 17, 2025
80b3808
test: over-saturation stopping
AlonKellner-RedHat Nov 17, 2025
5940a8d
test: comprehensive over-saturation detection
AlonKellner-RedHat Nov 17, 2025
1cc313f
test(e2e): enable over-saturation test
AlonKellner-RedHat Nov 17, 2025
f61db60
refactor: split constraints into modular package structure
AlonKellner-RedHat Nov 17, 2025
96455eb
fix: add missing stop_over_saturated field to BenchmarkGenerativeText…
AlonKellner-RedHat Nov 17, 2025
7b24ed1
fix: resolve type checking errors in over_saturation constraint
AlonKellner-RedHat Nov 17, 2025
c2bd25f
fix: remove type=bool from click flag option
AlonKellner-RedHat Nov 17, 2025
5ffe315
fix: update test imports and fix linting issues
AlonKellner-RedHat Nov 18, 2025
e1b8977
fix: over-saturation test coverage mdformat
AlonKellner-RedHat Nov 18, 2025
d1ba423
fix: mark review comments
AlonKellner-RedHat Nov 20, 2025
2fa358b
fix: mark review further comments
AlonKellner-RedHat Nov 20, 2025
16568d5
fix: CI errors
AlonKellner-RedHat Nov 20, 2025
e3a9556
fix: E2E tests
AlonKellner-RedHat Nov 20, 2025
cee4417
fix: E2E tests
AlonKellner-RedHat Nov 20, 2025
7d16fb1
fix: E2E tests
AlonKellner-RedHat Nov 20, 2025
be1af98
feat: macos llm-d simulator dockerfile
AlonKellner-RedHat Nov 20, 2025
9ed67af
fix: over-saturation settings handling
AlonKellner-RedHat Nov 30, 2025
72fe7fb
feat: over-saturation docs
AlonKellner-RedHat Nov 30, 2025
832f7df
fix: mdformat over-saturation docs
AlonKellner-RedHat Nov 30, 2025
2579f41
fix: review suggestions
AlonKellner-RedHat Dec 4, 2025
95628aa
Address feedback
sjmonson Dec 4, 2025
241aceb
Update over saturation flag name in docs
sjmonson Dec 5, 2025
800bec6
Update over sat test to pass in min seconds
sjmonson Dec 5, 2025
3bd986b
Fix e2e tests writing output to test dir
sjmonson Dec 5, 2025
ca5363c
Treat OverSaturationConstraint as a Constraint
sjmonson Dec 5, 2025
1d16296
Drop benchmark report delete in e2e
sjmonson Dec 5, 2025
89b145d
Add info property to sat constraint
sjmonson Dec 5, 2025
913a601
Update main README
sjmonson Dec 5, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -234,6 +234,7 @@ guidellm benchmark \
--warmup 0.1 \
--cooldown 0.1 \
--max-errors 5
--detect-saturation
```

**Key parameters:**
Expand All @@ -243,6 +244,7 @@ guidellm benchmark \
- `--max-seconds`: Maximum duration in seconds for each benchmark before automatic termination
- `--max-requests`: Maximum number of requests per benchmark before automatic termination
- `--max-errors`: Maximum number of individual errors before stopping the benchmark entirely
- `--detect-saturation`: Enable over-saturation detection to automatically stop benchmarks when the model becomes over-saturated (see also `--over-saturation` for more advanced control)

## Development and Contribution

Expand Down
8 changes: 8 additions & 0 deletions docs/guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,4 +60,12 @@ Whether you're interested in understanding the system architecture, exploring su

[:octicons-arrow-right-24: SLO Guide](service_level_objectives.md)

- :material-stop-circle-outline:{ .lg .middle } Over-Saturation Stopping

______________________________________________________________________

Automatically detect and stop benchmarks when models become over-saturated to prevent wasted compute resources and ensure valid results.

[:octicons-arrow-right-24: Over-Saturation Guide](over_saturation_stopping.md)

</div>
138 changes: 138 additions & 0 deletions docs/guides/over_saturation_stopping.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
# Over-Saturation Stopping

GuideLLM provides over-saturation detection (OSD) to automatically stop benchmarks when a model becomes over-saturated. This feature helps prevent wasted compute resources and ensures that benchmark results remain valid by detecting when the response rate can no longer keep up with the request rate.

## What is Over-Saturation?

Over-saturation occurs when an LLM inference server receives requests faster than it can process them, causing a queue to build up. As the queue grows, the server takes progressively longer to start handling each request, leading to degraded performance metrics. When a performance benchmarking tool oversaturates an LLM inference server, the metrics it measures become significantly skewed, rendering them useless.

Think of it like a cashier getting flustered during a sudden rush. As the line grows (the load), the cashier can't keep up, the line gets longer, and there is no room for additional customers. This waste of costly machine time can be prevented by automatically detecting and stopping benchmarks when over-saturation is detected.

## How It Works

GuideLLM's Over-Saturation Detection (OSD) algorithm uses statistical slope detection to identify when a model becomes over-saturated. The algorithm tracks two key metrics over time:

1. **Concurrent Requests**: The number of requests being processed simultaneously
2. **Time-to-First-Token (TTFT)**: The latency for the first token of each response

For each metric, the algorithm:

- Maintains a sliding window of recent data points
- Calculates the linear regression slope using online statistics
- Computes the margin of error (MOE) using t-distribution confidence intervals
- Detects positive slopes with low MOE, indicating degradation

Over-saturation is detected when:

- Both concurrent requests and TTFT show statistically significant positive slopes
- The minimum duration threshold has been met
- Sufficient data points are available for reliable slope estimation

When over-saturation is detected, the constraint automatically stops request queuing and optionally stops processing of existing requests, preventing further resource waste.

## Usage

### Basic Usage

Enable over-saturation detection with default settings:

```bash
guidellm benchmark \
--target http://localhost:8000 \
--profile throughput \
--rate 10 \
--detect-saturation
```

### Advanced Configuration

Configure detection parameters using a JSON dictionary:

```bash
guidellm benchmark \
--target http://localhost:8000 \
--profile concurrent \
--rate 16 \
--over-saturation '{"enabled": true, "min_seconds": 60, "max_window_seconds": 300, "moe_threshold": 1.5}'
```

## Configuration Options

The following parameters can be configured when enabling over-saturation detection:

- **`enabled`** (bool, default: `True`): Whether to stop the benchmark if over-saturation is detected
- **`min_seconds`** (float, default: `30.0`): Minimum seconds before checking for over-saturation. This prevents false positives during the initial warm-up phase.
- **`max_window_seconds`** (float, default: `120.0`): Maximum time window in seconds for data retention. Older data points are automatically pruned to maintain bounded memory usage.
- **`moe_threshold`** (float, default: `2.0`): Margin of error threshold for slope detection. Lower values make detection more sensitive to degradation.
- **`minimum_ttft`** (float, default: `2.5`): Minimum TTFT threshold in seconds for violation counting. Only TTFT values above this threshold are counted as violations.
- **`maximum_window_ratio`** (float, default: `0.75`): Maximum window size as a ratio of total requests. Limits memory usage by capping the number of tracked requests.
- **`minimum_window_size`** (int, default: `5`): Minimum data points required for slope estimation. Ensures statistical reliability before making detection decisions.
- **`confidence`** (float, default: `0.95`): Statistical confidence level for t-distribution calculations (0-1). Higher values require stronger evidence before detecting over-saturation.

## Use Cases

Over-saturation detection is particularly useful in the following scenarios:

### Stress Testing and Capacity Planning

When testing how your system handles increasing load, over-saturation detection automatically stops benchmarks once the system can no longer keep up, preventing wasted compute time on invalid results.

```bash
guidellm benchmark \
--target http://localhost:8000 \
--profile sweep \
--rate 5 \
--detect-saturation
```

### Cost-Effective Benchmarking

When running large-scale benchmark matrices across multiple models, GPUs, and configurations, over-saturation detection can significantly reduce costs by stopping invalid runs early.

### Finding Safe Operating Ranges

Use over-saturation detection to identify the maximum sustainable throughput for your deployment, helping you set appropriate rate limits and capacity planning targets.

## Interpreting Results

When over-saturation detection is enabled, the benchmark output includes metadata about the detection state. This metadata is available in the scheduler action metadata and includes:

- **`is_over_saturated`** (bool): Whether over-saturation was detected at the time of evaluation
- **`concurrent_slope`** (float): The calculated slope for concurrent requests
- **`concurrent_slope_moe`** (float): The margin of error for the concurrent requests slope
- **`concurrent_n`** (int): The number of data points used for concurrent requests slope calculation
- **`ttft_slope`** (float): The calculated slope for TTFT
- **`ttft_slope_moe`** (float): The margin of error for the TTFT slope
- **`ttft_n`** (int): The number of data points used for TTFT slope calculation
- **`ttft_violations`** (int): The count of TTFT values exceeding the minimum threshold

These metrics can help you understand why over-saturation was detected and fine-tune the detection parameters if needed.

## Example: Complete Benchmark with Over-Saturation Detection

```bash
guidellm benchmark \
--target http://localhost:8000 \
--profile concurrent \
--rate 16 \
--data "prompt_tokens=256,output_tokens=128" \
--max-seconds 300 \
--over-saturation '{"enabled": true, "min_seconds": 30, "max_window_seconds": 120}' \
--outputs json,html
```

This example:

- Runs a concurrent benchmark with 16 simultaneous requests
- Uses synthetic data with 256 prompt tokens and 128 output tokens
- Enables over-saturation detection with custom timing parameters
- Sets a maximum duration of 300 seconds (as a fallback)
- Outputs results in both JSON and HTML formats

## Additional Resources

For more in-depth information about over-saturation detection, including the algorithm development, evaluation metrics, and implementation details, see the following Red Hat Developer blog posts:

- [Reduce LLM benchmarking costs with oversaturation detection](https://developers.redhat.com/articles/2025/11/18/reduce-llm-benchmarking-costs-oversaturation-detection) - An introduction to the problem of over-saturation and why it matters for LLM benchmarking
- [Defining success: Evaluation metrics and data augmentation for oversaturation detection](https://developers.redhat.com/articles/2025/11/20/oversaturation-detection-evaluation-metrics) - How to evaluate the performance of an OSD algorithm through custom metrics, dataset labeling, and load augmentation techniques
- [Building an oversaturation detector with iterative error analysis](https://developers.redhat.com/articles/2025/11/24/building-oversaturation-detector-iterative-error-analysis) - A detailed walkthrough of how the OSD algorithm was built
22 changes: 21 additions & 1 deletion src/guidellm/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -384,7 +384,27 @@ def benchmark():
default=BenchmarkGenerativeTextArgs.get_default("max_global_error_rate"),
help="Maximum global error rate across all benchmarks.",
)
def run(**kwargs):
@click.option(
"--over-saturation",
"over_saturation",
callback=cli_tools.parse_json,
default=None,
help=(
"Enable over-saturation detection. "
"Pass a JSON dict with configuration "
'(e.g., \'{"enabled": true, "min_seconds": 30}\'). '
"Defaults to None (disabled)."
),
)
@click.option(
"--detect-saturation",
"--default-over-saturation",
"over_saturation",
callback=cli_tools.parse_json,
flag_value='{"enabled": true}',
help="Enable over-saturation detection with default settings.",
)
def run(**kwargs): # noqa: C901
# Only set CLI args that differ from click defaults
kwargs = cli_tools.set_if_not_default(click.get_current_context(), **kwargs)

Expand Down
4 changes: 4 additions & 0 deletions src/guidellm/benchmark/entrypoints.py
Original file line number Diff line number Diff line change
Expand Up @@ -323,6 +323,7 @@ async def resolve_profile(
max_errors: int | None,
max_error_rate: float | None,
max_global_error_rate: float | None,
over_saturation: dict[str, Any] | None = None,
console: Console | None = None,
) -> Profile:
"""
Expand All @@ -343,6 +344,7 @@ async def resolve_profile(
:param max_errors: Maximum number of errors before stopping
:param max_error_rate: Maximum error rate threshold before stopping
:param max_global_error_rate: Maximum global error rate threshold before stopping
:param over_saturation: Over-saturation detection configuration (dict)
:param console: Console instance for progress reporting, or None
:return: Configured Profile instance ready for benchmarking
:raises ValueError: If constraints are provided with a pre-configured Profile
Expand All @@ -359,6 +361,7 @@ async def resolve_profile(
"max_errors": max_errors,
"max_error_rate": max_error_rate,
"max_global_error_rate": max_global_error_rate,
"over_saturation": over_saturation,
}.items():
if val is not None:
constraints[key] = val
Expand Down Expand Up @@ -500,6 +503,7 @@ async def benchmark_generative_text(
max_errors=args.max_errors,
max_error_rate=args.max_error_rate,
max_global_error_rate=args.max_global_error_rate,
over_saturation=args.over_saturation,
console=console,
)
output_formats = await resolve_output_formats(
Expand Down
5 changes: 2 additions & 3 deletions src/guidellm/benchmark/progress.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@

from abc import ABC, abstractmethod
from dataclasses import dataclass
from datetime import datetime
from typing import Any, Generic, Literal

from rich.console import Group
Expand All @@ -37,7 +36,7 @@
GenerativeBenchmarkAccumulator,
)
from guidellm.scheduler import SchedulerState, SchedulingStrategy
from guidellm.utils import Colors, format_value_display
from guidellm.utils import Colors, format_value_display, safe_format_timestamp

__all__ = ["BenchmarkerProgress", "GenerativeConsoleBenchmarkerProgress"]

Expand Down Expand Up @@ -390,7 +389,7 @@ def formatted_start_time(self) -> str:
if self.start_time < 0.0:
return "--:--:--"

return datetime.fromtimestamp(self.start_time).strftime("%H:%M:%S")
return safe_format_timestamp(self.start_time, format_="%H:%M:%S")

@property
def formatted_progress_status(self) -> str:
Expand Down
8 changes: 8 additions & 0 deletions src/guidellm/benchmark/schemas/generative/entrypoints.py
Original file line number Diff line number Diff line change
Expand Up @@ -283,6 +283,14 @@ def get_default(cls: type[BenchmarkGenerativeTextArgs], field: str) -> Any:
max_global_error_rate: float | None = Field(
default=None, description="Maximum global error rate (0-1) before stopping"
)
over_saturation: dict[str, Any] | None = Field(
default=None,
description=(
"Over-saturation detection configuration. A dict with configuration "
"parameters (enabled, min_seconds, max_window_seconds, "
"moe_threshold, etc.)."
),
)

@field_validator("data", "data_args", "rate", mode="wrap")
@classmethod
Expand Down
4 changes: 4 additions & 0 deletions src/guidellm/scheduler/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@
MaxErrorsConstraint,
MaxGlobalErrorRateConstraint,
MaxNumberConstraint,
OverSaturationConstraint,
OverSaturationConstraintInitializer,
PydanticConstraintInitializer,
SerializableConstraintInitializer,
UnserializableConstraintInitializer,
Expand Down Expand Up @@ -66,6 +68,8 @@
"MaxNumberConstraint",
"MultiTurnRequestT",
"NonDistributedEnvironment",
"OverSaturationConstraint",
"OverSaturationConstraintInitializer",
"PydanticConstraintInitializer",
"RequestT",
"ResponseT",
Expand Down
Loading