Load Testing LLM Services
Load testing a web service that wraps an LLM is deceptively tricky. The usual metrics and rules of thumb from traditional load testing don’t quite apply. A checkout endpoint that responds in 50ms behaves nothing like a model inference call that takes 5 seconds on a good day.
Why LLMs are Different
Traditional services optimize for throughput: requests per second. If your API handles 10,000 RPS, you’re probably happy. But LLM inference flips this priority.
A single LLM call might take 1 – 30 seconds depending on model size, prompt length, and whether you’re streaming tokens. At these timescales, throughput numbers become less meaningful. What matters is consistency: does your service degrade gracefully under load, or does it fall off a cliff?
The answer lies in tail latencies.
The Metrics That Matter
When I first ran Locust against my LLM service, the output looked familiar: request counts, response times, failures. But interpreting these numbers for an LLM backend requires a different lens.
Latency Percentiles
| Metric | What It Tells You |
|---|---|
| p50 (Median) | Half your requests finish faster than this |
| p90 | 90% of requests finish faster than this |
| p95, p99 | Tail latencies - how bad it gets for unlucky users |
| Max | Worst-case scenario |
For LLM services, p95 and p99 matter more than the average. Why? Because LLM calls are slow enough that users notice variance. If your median is 3 seconds but p99 is 30 seconds, one in a hundred users waits ten times longer. That’s a problem.
The gap between p50 and p99 tells you about consistency. A small gap means predictable performance. A large gap means your service behaves erratically under load.
Locust provides these metrics out of the box. Here’s how to access them in a test stop listener:
@events.test_stop.add_listener
def on_test_stop(environment, **kwargs):
stats = environment.stats.total
print(f"Median (p50): {stats.median_response_time:.0f} ms")
print(f"p90: {stats.get_response_time_percentile(0.90):.0f} ms")
print(f"p95: {stats.get_response_time_percentile(0.95):.0f} ms")
print(f"p99: {stats.get_response_time_percentile(0.99):.0f} ms")
print(f"Requests/sec: {stats.total_rps:.2f}")
print(f"Failure rate: {stats.fail_ratio * 100:.2f}%")
Throughput
Requests per second (RPS) still matters, but interpret it carefully. With LLM inference, you’re often GPU-bound rather than CPU-bound. A “low” RPS number might be perfectly acceptable if each request involves a 7-billion parameter model.
Failure Rate
Track this as a percentage, not absolute count. Under load, some failures are expected. The question is: at what point does the failure rate spike? That’s your saturation point.
Why Percentiles Jump Around
If you watch Locust’s real-time charts, you might notice P95 spike from 30s to 120s, then drop back to 35s a minute later. This isn’t a bug.
Locust reports real-time percentiles using a sliding window (default: 10 seconds), not cumulative statistics. When a slow request enters the window, P95 jumps. When it ages out, P95 drops.
This effect is amplified at low throughput. With 0.5 RPS and 5-second intervals, each data point contains only 2 – 3 requests. A single queued request dominates the calculation:
Interval 1: [25s, 26s, 120s] -> P95 = 120s
Interval 2: [24s, 25s, 27s] -> P95 = 27s
To show cumulative percentiles instead, add this monkey-patch to your locustfile:
from locust.stats import StatsEntry
StatsEntry.get_current_response_time_percentile = StatsEntry.get_response_time_percentile
This makes the HTML report show cumulative percentiles that grow monotonically, matching the final statistics table.
Reading the Signs
Raw numbers don’t mean much without context. Here’s a framework I use to interpret load test results for LLM services.
Health Indicators
| Signal | Healthy | Warning | Critical |
|---|---|---|---|
| p99/p50 ratio | < 2x | 2 – 5x | > 5x |
| Failure rate | < 0.1% | 0.1 – 1% | > 1% |
| p95 growth pattern | Linear | Exponential | Flat + failures |
The p99/p50 Ratio
This ratio measures consistency. If your p50 is 3 seconds and p99 is 5 seconds, that’s a ratio of ~1.7x. Most users get a similar experience. But if p99 is 15 seconds (5x ratio), something is queuing up or timing out under load.
Latency Growth Patterns
As you add concurrent users, watch how p95 changes:
- Linear growth: Latency increases proportionally with load. The system is handling requests fairly, just slower. This is normal.
- Exponential growth: Latency spikes suddenly at some user count. You’ve hit a bottleneck, likely GPU memory, request queue depth, or connection limits.
- Flat latency + rising failures: The system is rejecting requests to protect itself. Better than crashing, but you’ve exceeded capacity.
Finding the Saturation Point
The saturation point is where your service transitions from “handling load” to “struggling.” I look for two signals:
- p95 latency exceeds 2x the baseline (measured at low load)
- Failure rate crosses 1%
Whichever comes first marks your practical capacity limit.
Step Load Testing
To find saturation reliably, don’t slam your service with maximum load immediately. Ramp up gradually. A step load shape works well: add 10 users every 60 seconds, observe metrics at each step.
Locust supports custom load shapes through the LoadTestShape class. Override the tick() method to control how many users are active at any point in time. It returns a tuple of (user_count, spawn_rate) or None to stop the test.
from locust import LoadTestShape
class StepLoadShape(LoadTestShape):
"""Gradually increase users in steps to observe latency degradation."""
step_users = 10 # users to add per step
step_time = 60 # seconds per step
max_users = 100
def tick(self):
run_time = self.get_run_time()
current_step = int(run_time // self.step_time) + 1
target_users = min(current_step * self.step_users, self.max_users)
spawn_rate = self.step_users
return (target_users, spawn_rate)
With this shape, users ramp up as: 10 users at 0 – 60s, 20 users at 60 – 120s, 30 users at 120 – 180s, and so on. This reveals exactly where performance degrades.
Two Essential Enhancements
A basic Locust script gets you started, but two enhancements make the difference between “ran a load test” and “gathered actionable data.”
First, set up a shared state object to track warmup and saturation:
@dataclass
class BenchmarkState:
test_start_time: float = 0.0
warmup_complete: bool = False
baseline_p95: float | None = None
saturation_detected: bool = False
saturation_user_count: int | None = None
BENCHMARK_STATE = BenchmarkState()
1. Warmup Period
The first few seconds of a load test are noisy. JIT compilation, connection pool initialization, model loading, and cold caches all inflate early latency measurements. Including this data skews your metrics.
The fix is simple: skip metrics collection during an initial warmup period. Let the system stabilize before you start measuring.
def is_in_warmup(warmup_duration: int = 30) -> bool:
elapsed = time.time() - BENCHMARK_STATE.test_start_time
return elapsed < warmup_duration
I typically use 30 seconds for warmup. For larger models with slow initialization, you might need 60 seconds or more.
2. Saturation Detection
Staring at scrolling metrics, waiting to see when things go wrong, is tedious and error-prone. Better to let the test tell you when saturation occurs.
Track baseline latency at low load, then alert when p95 exceeds a threshold ratio:
def check_saturation(stats, user_count, latency_threshold=2.0, error_threshold=0.01):
current_p95 = stats.get_response_time_percentile(0.95)
# Establish baseline at low load
if BENCHMARK_STATE.baseline_p95 is None and user_count >= 10:
BENCHMARK_STATE.baseline_p95 = current_p95
return
# Check thresholds
latency_ratio = current_p95 / BENCHMARK_STATE.baseline_p95
if latency_ratio > latency_threshold or stats.fail_ratio > error_threshold:
BENCHMARK_STATE.saturation_detected = True
BENCHMARK_STATE.saturation_user_count = user_count
Now the test records when you’ve hit the wall and at what user count.
Configuring Load Profiles
The right load profile depends on what you’re testing. For stress testing and finding breaking points, I use aggressive settings: 50 users per step, 200 max users, and 15 minutes runtime.
Key Parameters
- step_users: How many users to add per step. Smaller increments give finer granularity but take longer.
- step_time: How long to hold each step. For LLMs, 60 seconds minimum. You need enough requests at each level to get stable percentile measurements.
- max_users: Upper bound. Set this higher than your expected capacity to actually find the saturation point.
- wait_time: Time between requests per user. For LLM services, 0.1 – 0.5 seconds simulates realistic usage. Going lower creates artificial burst load and tests “how fast can my service fail” rather than “how many concurrent users can my service handle.”
Conclusion
Load testing LLM services requires a different mindset. Focus on tail latencies (p95, p99) rather than throughput, use step load shapes to find your saturation point, and add a warmup period to avoid cold-start bias. The goal isn’t to maximize requests per second. It’s to understand your service’s limits and ensure predictable performance up to those limits.