Skip to content

Commit f21fff0

Browse files
feat: over-saturation docs
Signed-off-by: Alon Kellner <[email protected]>
1 parent eb882f5 commit f21fff0

File tree

4 files changed

+167
-2
lines changed

4 files changed

+167
-2
lines changed

README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -233,7 +233,8 @@ guidellm benchmark \
233233
--rate 16 \
234234
--warmup 0.1 \
235235
--cooldown 0.1 \
236-
--max-errors 5
236+
--max-errors 5 \
237+
--over-saturation True
237238
```
238239

239240
**Key parameters:**
@@ -243,6 +244,7 @@ guidellm benchmark \
243244
- `--max-seconds`: Maximum duration in seconds for each benchmark before automatic termination
244245
- `--max-requests`: Maximum number of requests per benchmark before automatic termination
245246
- `--max-errors`: Maximum number of individual errors before stopping the benchmark entirely
247+
- `--over-saturation`: Enable over-saturation detection to automatically stop benchmarks when the model becomes over-saturated (use `True` for defaults or a JSON dict for custom configuration)
246248

247249
## Development and Contribution
248250

docs/guides/index.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,4 +60,12 @@ Whether you're interested in understanding the system architecture, exploring su
6060

6161
[:octicons-arrow-right-24: SLO Guide](service_level_objectives.md)
6262

63+
- :material-stop-circle-outline:{ .lg .middle } Over-Saturation Stopping
64+
65+
______________________________________________________________________
66+
67+
Automatically detect and stop benchmarks when models become over-saturated to prevent wasted compute resources and ensure valid results.
68+
69+
[:octicons-arrow-right-24: Over-Saturation Guide](over_saturation_stopping.md)
70+
6371
</div>
Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# Over-Saturation Stopping
2+
3+
GuideLLM provides over-saturation detection (OSD) to automatically stop benchmarks when a model becomes over-saturated. This feature helps prevent wasted compute resources and ensures that benchmark results remain valid by detecting when the response rate can no longer keep up with the request rate.
4+
5+
## What is Over-Saturation?
6+
7+
Over-saturation occurs when an LLM inference server receives requests faster than it can process them, causing a queue to build up. As the queue grows, the server takes progressively longer to start handling each request, leading to degraded performance metrics. When a performance benchmarking tool oversaturates an LLM inference server, the metrics it measures become significantly skewed, rendering them useless.
8+
9+
Think of it like a cashier getting flustered during a sudden rush. As the line grows (the load), the cashier can't keep up, the line gets longer, and there is no room for additional customers. This waste of costly machine time can be prevented by automatically detecting and stopping benchmarks when over-saturation is detected.
10+
11+
## How It Works
12+
13+
GuideLLM's Over-Saturation Detection (OSD) algorithm uses statistical slope detection to identify when a model becomes over-saturated. The algorithm tracks two key metrics over time:
14+
15+
1. **Concurrent Requests**: The number of requests being processed simultaneously
16+
2. **Time-to-First-Token (TTFT)**: The latency for the first token of each response
17+
18+
For each metric, the algorithm:
19+
- Maintains a sliding window of recent data points
20+
- Calculates the linear regression slope using online statistics
21+
- Computes the margin of error (MOE) using t-distribution confidence intervals
22+
- Detects positive slopes with low MOE, indicating degradation
23+
24+
Over-saturation is detected when:
25+
- Both concurrent requests and TTFT show statistically significant positive slopes
26+
- The minimum duration threshold has been met
27+
- Sufficient data points are available for reliable slope estimation
28+
29+
When over-saturation is detected, the constraint automatically stops request queuing and optionally stops processing of existing requests, preventing further resource waste.
30+
31+
## Usage
32+
33+
### Basic Usage
34+
35+
Enable over-saturation detection with default settings:
36+
37+
```bash
38+
guidellm benchmark \
39+
--target http://localhost:8000 \
40+
--profile throughput \
41+
--rate 10 \
42+
--over-saturation True
43+
```
44+
45+
### Advanced Configuration
46+
47+
Configure detection parameters using a JSON dictionary:
48+
49+
```bash
50+
guidellm benchmark \
51+
--target http://localhost:8000 \
52+
--profile concurrent \
53+
--rate 16 \
54+
--over-saturation '{"enabled": true, "min_seconds": 60, "max_window_seconds": 300, "moe_threshold": 1.5}'
55+
```
56+
57+
### Using the Alias
58+
59+
You can also use the `--detect-saturation` alias:
60+
61+
```bash
62+
guidellm benchmark \
63+
--target http://localhost:8000 \
64+
--profile throughput \
65+
--rate 10 \
66+
--detect-saturation True
67+
```
68+
69+
## Configuration Options
70+
71+
The following parameters can be configured when enabling over-saturation detection:
72+
73+
- **`enabled`** (bool, default: `True`): Whether to stop the benchmark if over-saturation is detected
74+
- **`min_seconds`** (float, default: `30.0`): Minimum seconds before checking for over-saturation. This prevents false positives during the initial warm-up phase.
75+
- **`max_window_seconds`** (float, default: `120.0`): Maximum time window in seconds for data retention. Older data points are automatically pruned to maintain bounded memory usage.
76+
- **`moe_threshold`** (float, default: `2.0`): Margin of error threshold for slope detection. Lower values make detection more sensitive to degradation.
77+
- **`minimum_ttft`** (float, default: `2.5`): Minimum TTFT threshold in seconds for violation counting. Only TTFT values above this threshold are counted as violations.
78+
- **`maximum_window_ratio`** (float, default: `0.75`): Maximum window size as a ratio of total requests. Limits memory usage by capping the number of tracked requests.
79+
- **`minimum_window_size`** (int, default: `5`): Minimum data points required for slope estimation. Ensures statistical reliability before making detection decisions.
80+
- **`confidence`** (float, default: `0.95`): Statistical confidence level for t-distribution calculations (0-1). Higher values require stronger evidence before detecting over-saturation.
81+
82+
## Use Cases
83+
84+
Over-saturation detection is particularly useful in the following scenarios:
85+
86+
### Stress Testing and Capacity Planning
87+
88+
When testing how your system handles increasing load, over-saturation detection automatically stops benchmarks once the system can no longer keep up, preventing wasted compute time on invalid results.
89+
90+
```bash
91+
guidellm benchmark \
92+
--target http://localhost:8000 \
93+
--profile sweep \
94+
--rate 5 \
95+
--over-saturation True
96+
```
97+
98+
### Cost-Effective Benchmarking
99+
100+
When running large-scale benchmark matrices across multiple models, GPUs, and configurations, over-saturation detection can significantly reduce costs by stopping invalid runs early.
101+
102+
### Finding Safe Operating Ranges
103+
104+
Use over-saturation detection to identify the maximum sustainable throughput for your deployment, helping you set appropriate rate limits and capacity planning targets.
105+
106+
## Interpreting Results
107+
108+
When over-saturation detection is enabled, the benchmark output includes metadata about the detection state. This metadata is available in the scheduler action metadata and includes:
109+
110+
- **`is_over_saturated`** (bool): Whether over-saturation was detected at the time of evaluation
111+
- **`concurrent_slope`** (float): The calculated slope for concurrent requests
112+
- **`concurrent_slope_moe`** (float): The margin of error for the concurrent requests slope
113+
- **`concurrent_n`** (int): The number of data points used for concurrent requests slope calculation
114+
- **`ttft_slope`** (float): The calculated slope for TTFT
115+
- **`ttft_slope_moe`** (float): The margin of error for the TTFT slope
116+
- **`ttft_n`** (int): The number of data points used for TTFT slope calculation
117+
- **`ttft_violations`** (int): The count of TTFT values exceeding the minimum threshold
118+
119+
These metrics can help you understand why over-saturation was detected and fine-tune the detection parameters if needed.
120+
121+
## Example: Complete Benchmark with Over-Saturation Detection
122+
123+
```bash
124+
guidellm benchmark \
125+
--target http://localhost:8000 \
126+
--profile concurrent \
127+
--rate 16 \
128+
--data "prompt_tokens=256,output_tokens=128" \
129+
--max-seconds 300 \
130+
--over-saturation '{"enabled": true, "min_seconds": 30, "max_window_seconds": 120}' \
131+
--outputs json,html
132+
```
133+
134+
This example:
135+
- Runs a concurrent benchmark with 16 simultaneous requests
136+
- Uses synthetic data with 256 prompt tokens and 128 output tokens
137+
- Enables over-saturation detection with custom timing parameters
138+
- Sets a maximum duration of 300 seconds (as a fallback)
139+
- Outputs results in both JSON and HTML formats
140+
141+
## Additional Resources
142+
143+
For more in-depth information about over-saturation detection, including the algorithm development, evaluation metrics, and implementation details, see the following Red Hat Developer blog posts:
144+
145+
- [Reduce LLM benchmarking costs with oversaturation detection](https://developers.redhat.com/articles/2025/11/18/reduce-llm-benchmarking-costs-oversaturation-detection) - An introduction to the problem of over-saturation and why it matters for LLM benchmarking
146+
- [Defining success: Evaluation metrics and data augmentation for oversaturation detection](https://developers.redhat.com/articles/2025/11/20/oversaturation-detection-evaluation-metrics) - How to evaluate the performance of an OSD algorithm through custom metrics, dataset labeling, and load augmentation techniques
147+
- [Building an oversaturation detector with iterative error analysis](https://developers.redhat.com/articles/2025/11/24/building-oversaturation-detector-iterative-error-analysis) - A detailed walkthrough of how the OSD algorithm was built

tests/unit/scheduler/test_over_saturation_comprehensive.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
approx_t_ppf,
2222
)
2323
from guidellm.schemas import RequestInfo, RequestTimings
24+
from guidellm.settings import settings
2425

2526

2627
class TestSlopeCheckerStatisticalAccuracy:
@@ -637,7 +638,14 @@ def test_initializer_alias_precedence(self):
637638
)
638639

639640
# detect_saturation should override over_saturation=False
640-
assert result == {"enabled": True}
641+
# When a boolean is passed, defaults from settings are included
642+
assert result == {
643+
"enabled": True,
644+
"min_seconds": settings.constraint_over_saturation_min_seconds,
645+
"max_window_seconds": (
646+
settings.constraint_over_saturation_max_window_seconds
647+
),
648+
}
641649

642650
@pytest.mark.smoke
643651
def test_constraint_creation_with_mock_constraint(self):

0 commit comments

Comments
 (0)