vllm-project · sjmonson · Dec 5, 2025 · Nov 17, 2025 · Nov 17, 2025 · Nov 17, 2025
diff --git a/README.md b/README.md
@@ -234,6 +234,7 @@ guidellm benchmark \
   --warmup 0.1 \
   --cooldown 0.1 \
   --max-errors 5
+  --detect-saturation
 ```
 
 **Key parameters:**
@@ -243,6 +244,7 @@ guidellm benchmark \
 - `--max-seconds`: Maximum duration in seconds for each benchmark before automatic termination
 - `--max-requests`: Maximum number of requests per benchmark before automatic termination
 - `--max-errors`: Maximum number of individual errors before stopping the benchmark entirely
+- `--detect-saturation`: Enable over-saturation detection to automatically stop benchmarks when the model becomes over-saturated (see also `--over-saturation` for more advanced control)
 
 ## Development and Contribution
 

diff --git a/docs/guides/index.md b/docs/guides/index.md
@@ -60,4 +60,12 @@ Whether you're interested in understanding the system architecture, exploring su
 
   [:octicons-arrow-right-24: SLO Guide](service_level_objectives.md)
 
+- :material-stop-circle-outline:{ .lg .middle } Over-Saturation Stopping
+
+  ______________________________________________________________________
+
+  Automatically detect and stop benchmarks when models become over-saturated to prevent wasted compute resources and ensure valid results.
+
+  [:octicons-arrow-right-24: Over-Saturation Guide](over_saturation_stopping.md)
+
 </div>
diff --git a/docs/guides/over_saturation_stopping.md b/docs/guides/over_saturation_stopping.md
@@ -0,0 +1,138 @@
+# Over-Saturation Stopping
+
+GuideLLM provides over-saturation detection (OSD) to automatically stop benchmarks when a model becomes over-saturated. This feature helps prevent wasted compute resources and ensures that benchmark results remain valid by detecting when the response rate can no longer keep up with the request rate.
+
+## What is Over-Saturation?
+
+Over-saturation occurs when an LLM inference server receives requests faster than it can process them, causing a queue to build up. As the queue grows, the server takes progressively longer to start handling each request, leading to degraded performance metrics. When a performance benchmarking tool oversaturates an LLM inference server, the metrics it measures become significantly skewed, rendering them useless.
+
+Think of it like a cashier getting flustered during a sudden rush. As the line grows (the load), the cashier can't keep up, the line gets longer, and there is no room for additional customers. This waste of costly machine time can be prevented by automatically detecting and stopping benchmarks when over-saturation is detected.
+
+## How It Works
+
+GuideLLM's Over-Saturation Detection (OSD) algorithm uses statistical slope detection to identify when a model becomes over-saturated. The algorithm tracks two key metrics over time:
+
+1. **Concurrent Requests**: The number of requests being processed simultaneously
+2. **Time-to-First-Token (TTFT)**: The latency for the first token of each response
+
+For each metric, the algorithm:
+
+- Maintains a sliding window of recent data points
+- Calculates the linear regression slope using online statistics
+- Computes the margin of error (MOE) using t-distribution confidence intervals
+- Detects positive slopes with low MOE, indicating degradation
+
+Over-saturation is detected when:
+
+- Both concurrent requests and TTFT show statistically significant positive slopes
+- The minimum duration threshold has been met
+- Sufficient data points are available for reliable slope estimation
+
+When over-saturation is detected, the constraint automatically stops request queuing and optionally stops processing of existing requests, preventing further resource waste.
+
+## Usage
+
+### Basic Usage
+
+Enable over-saturation detection with default settings:
+
+```bash
+guidellm benchmark \
+  --target http://localhost:8000 \
+  --profile throughput \
+  --rate 10 \
+  --detect-saturation
+```
+
+### Advanced Configuration
+
+Configure detection parameters using a JSON dictionary:
+
+```bash
+guidellm benchmark \
+  --target http://localhost:8000 \
+  --profile concurrent \
+  --rate 16 \
+  --over-saturation '{"enabled": true, "min_seconds": 60, "max_window_seconds": 300, "moe_threshold": 1.5}'
+```
+
+## Configuration Options
+
+The following parameters can be configured when enabling over-saturation detection:
+
+- **`enabled`** (bool, default: `True`): Whether to stop the benchmark if over-saturation is detected
+- **`min_seconds`** (float, default: `30.0`): Minimum seconds before checking for over-saturation. This prevents false positives during the initial warm-up phase.
+- **`max_window_seconds`** (float, default: `120.0`): Maximum time window in seconds for data retention. Older data points are automatically pruned to maintain bounded memory usage.
+- **`moe_threshold`** (float, default: `2.0`): Margin of error threshold for slope detection. Lower values make detection more sensitive to degradation.
+- **`minimum_ttft`** (float, default: `2.5`): Minimum TTFT threshold in seconds for violation counting. Only TTFT values above this threshold are counted as violations.
+- **`maximum_window_ratio`** (float, default: `0.75`): Maximum window size as a ratio of total requests. Limits memory usage by capping the number of tracked requests.
+- **`minimum_window_size`** (int, default: `5`): Minimum data points required for slope estimation. Ensures statistical reliability before making detection decisions.
+- **`confidence`** (float, default: `0.95`): Statistical confidence level for t-distribution calculations (0-1). Higher values require stronger evidence before detecting over-saturation.
+
+## Use Cases
+
+Over-saturation detection is particularly useful in the following scenarios:
+
+### Stress Testing and Capacity Planning
+
+When testing how your system handles increasing load, over-saturation detection automatically stops benchmarks once the system can no longer keep up, preventing wasted compute time on invalid results.
+
+```bash
+guidellm benchmark \
+  --target http://localhost:8000 \
+  --profile sweep \
+  --rate 5 \
+  --detect-saturation
+```
+
+### Cost-Effective Benchmarking
+
+When running large-scale benchmark matrices across multiple models, GPUs, and configurations, over-saturation detection can significantly reduce costs by stopping invalid runs early.
+
+### Finding Safe Operating Ranges
+
+Use over-saturation detection to identify the maximum sustainable throughput for your deployment, helping you set appropriate rate limits and capacity planning targets.
+
+## Interpreting Results
+
+When over-saturation detection is enabled, the benchmark output includes metadata about the detection state. This metadata is available in the scheduler action metadata and includes:
+
+- **`is_over_saturated`** (bool): Whether over-saturation was detected at the time of evaluation
+- **`concurrent_slope`** (float): The calculated slope for concurrent requests
+- **`concurrent_slope_moe`** (float): The margin of error for the concurrent requests slope
+- **`concurrent_n`** (int): The number of data points used for concurrent requests slope calculation
+- **`ttft_slope`** (float): The calculated slope for TTFT
+- **`ttft_slope_moe`** (float): The margin of error for the TTFT slope
+- **`ttft_n`** (int): The number of data points used for TTFT slope calculation
+- **`ttft_violations`** (int): The count of TTFT values exceeding the minimum threshold
+
+These metrics can help you understand why over-saturation was detected and fine-tune the detection parameters if needed.
+
+## Example: Complete Benchmark with Over-Saturation Detection
+
+```bash
+guidellm benchmark \
+  --target http://localhost:8000 \
+  --profile concurrent \
+  --rate 16 \
+  --data "prompt_tokens=256,output_tokens=128" \
+  --max-seconds 300 \
+  --over-saturation '{"enabled": true, "min_seconds": 30, "max_window_seconds": 120}' \
+  --outputs json,html
+```
+
+This example:
+
+- Runs a concurrent benchmark with 16 simultaneous requests
+- Uses synthetic data with 256 prompt tokens and 128 output tokens
+- Enables over-saturation detection with custom timing parameters
+- Sets a maximum duration of 300 seconds (as a fallback)
+- Outputs results in both JSON and HTML formats
+
+## Additional Resources
+
+For more in-depth information about over-saturation detection, including the algorithm development, evaluation metrics, and implementation details, see the following Red Hat Developer blog posts:
+
+- [Reduce LLM benchmarking costs with oversaturation detection](https://developers.redhat.com/articles/2025/11/18/reduce-llm-benchmarking-costs-oversaturation-detection) - An introduction to the problem of over-saturation and why it matters for LLM benchmarking
+- [Defining success: Evaluation metrics and data augmentation for oversaturation detection](https://developers.redhat.com/articles/2025/11/20/oversaturation-detection-evaluation-metrics) - How to evaluate the performance of an OSD algorithm through custom metrics, dataset labeling, and load augmentation techniques
+- [Building an oversaturation detector with iterative error analysis](https://developers.redhat.com/articles/2025/11/24/building-oversaturation-detector-iterative-error-analysis) - A detailed walkthrough of how the OSD algorithm was built
diff --git a/src/guidellm/__main__.py b/src/guidellm/__main__.py
@@ -384,7 +384,27 @@ def benchmark():
     default=BenchmarkGenerativeTextArgs.get_default("max_global_error_rate"),
     help="Maximum global error rate across all benchmarks.",
 )
-def run(**kwargs):
+@click.option(
+    "--over-saturation",
+    "over_saturation",
+    callback=cli_tools.parse_json,
+    default=None,
+    help=(
+        "Enable over-saturation detection. "
+        "Pass a JSON dict with configuration "
+        '(e.g., \'{"enabled": true, "min_seconds": 30}\'). '
+        "Defaults to None (disabled)."
+    ),
+)
+@click.option(
+    "--detect-saturation",
+    "--default-over-saturation",
+    "over_saturation",
+    callback=cli_tools.parse_json,
+    flag_value='{"enabled": true}',
+    help="Enable over-saturation detection with default settings.",
+)
+def run(**kwargs):  # noqa: C901
     # Only set CLI args that differ from click defaults
     kwargs = cli_tools.set_if_not_default(click.get_current_context(), **kwargs)
 

diff --git a/src/guidellm/benchmark/entrypoints.py b/src/guidellm/benchmark/entrypoints.py
@@ -323,6 +323,7 @@ async def resolve_profile(
     max_errors: int | None,
     max_error_rate: float | None,
     max_global_error_rate: float | None,
+    over_saturation: dict[str, Any] | None = None,
     console: Console | None = None,
 ) -> Profile:
     """
@@ -343,6 +344,7 @@ async def resolve_profile(
     :param max_errors: Maximum number of errors before stopping
     :param max_error_rate: Maximum error rate threshold before stopping
     :param max_global_error_rate: Maximum global error rate threshold before stopping
+    :param over_saturation: Over-saturation detection configuration (dict)
     :param console: Console instance for progress reporting, or None
     :return: Configured Profile instance ready for benchmarking
     :raises ValueError: If constraints are provided with a pre-configured Profile
@@ -359,6 +361,7 @@ async def resolve_profile(
         "max_errors": max_errors,
         "max_error_rate": max_error_rate,
         "max_global_error_rate": max_global_error_rate,
+        "over_saturation": over_saturation,
     }.items():
         if val is not None:
             constraints[key] = val
@@ -500,6 +503,7 @@ async def benchmark_generative_text(
         max_errors=args.max_errors,
         max_error_rate=args.max_error_rate,
         max_global_error_rate=args.max_global_error_rate,
+        over_saturation=args.over_saturation,
         console=console,
     )
     output_formats = await resolve_output_formats(

diff --git a/src/guidellm/benchmark/progress.py b/src/guidellm/benchmark/progress.py
@@ -12,7 +12,6 @@
 
 from abc import ABC, abstractmethod
 from dataclasses import dataclass
-from datetime import datetime
 from typing import Any, Generic, Literal
 
 from rich.console import Group
@@ -37,7 +36,7 @@
     GenerativeBenchmarkAccumulator,
 )
 from guidellm.scheduler import SchedulerState, SchedulingStrategy
-from guidellm.utils import Colors, format_value_display
+from guidellm.utils import Colors, format_value_display, safe_format_timestamp
 
 __all__ = ["BenchmarkerProgress", "GenerativeConsoleBenchmarkerProgress"]
 
@@ -390,7 +389,7 @@ def formatted_start_time(self) -> str:
         if self.start_time < 0.0:
             return "--:--:--"
 
-        return datetime.fromtimestamp(self.start_time).strftime("%H:%M:%S")
+        return safe_format_timestamp(self.start_time, format_="%H:%M:%S")
 
     @property
     def formatted_progress_status(self) -> str:

diff --git a/src/guidellm/benchmark/schemas/generative/entrypoints.py b/src/guidellm/benchmark/schemas/generative/entrypoints.py
@@ -283,6 +283,14 @@ def get_default(cls: type[BenchmarkGenerativeTextArgs], field: str) -> Any:
     max_global_error_rate: float | None = Field(
         default=None, description="Maximum global error rate (0-1) before stopping"
     )
+    over_saturation: dict[str, Any] | None = Field(
+        default=None,
+        description=(
+            "Over-saturation detection configuration. A dict with configuration "
+            "parameters (enabled, min_seconds, max_window_seconds, "
+            "moe_threshold, etc.)."
+        ),
+    )
 
     @field_validator("data", "data_args", "rate", mode="wrap")
     @classmethod

diff --git a/src/guidellm/scheduler/__init__.py b/src/guidellm/scheduler/__init__.py
@@ -19,6 +19,8 @@
     MaxErrorsConstraint,
     MaxGlobalErrorRateConstraint,
     MaxNumberConstraint,
+    OverSaturationConstraint,
+    OverSaturationConstraintInitializer,
     PydanticConstraintInitializer,
     SerializableConstraintInitializer,
     UnserializableConstraintInitializer,
@@ -66,6 +68,8 @@
     "MaxNumberConstraint",
     "MultiTurnRequestT",
     "NonDistributedEnvironment",
+    "OverSaturationConstraint",
+    "OverSaturationConstraintInitializer",
     "PydanticConstraintInitializer",
     "RequestT",
     "ResponseT",