Load Testing LLM Services

by Le Tan Dang Khoa, Feb 9, 2026

Load testing a web service that wraps an LLM is deceptively tricky. The usual metrics and rules of thumb from traditional load testing don’t quite apply. A checkout endpoint that responds in 50ms behaves nothing like a model inference call that takes 5 seconds on a good day.

Why LLMs are Different

Traditional services optimize for throughput: requests per second. If your API handles 10,000 RPS, you’re probably happy. But LLM inference ﬂips this priority.

A single LLM call might take 1 – 30 seconds depending on model size, prompt length, and whether you’re streaming tokens. At these timescales, throughput numbers become less meaningful. What matters is consistency: does your service degrade gracefully under load, or does it fall off a cliff?

The answer lies in tail latencies.

The Metrics That Matter

When I ﬁrst ran Locust against my LLM service, the output looked familiar: request counts, response times, failures. But interpreting these numbers for an LLM backend requires a different lens.

Latency Percentiles

Metric	What It Tells You
p50 (Median)	Half your requests ﬁnish faster than this
p90	90% of requests ﬁnish faster than this
p95, p99	Tail latencies - how bad it gets for unlucky users
Max	Worst-case scenario

For LLM services, p95 and p99 matter more than the average. Why? Because LLM calls are slow enough that users notice variance. If your median is 3 seconds but p99 is 30 seconds, one in a hundred users waits ten times longer. That’s a problem.

The gap between p50 and p99 tells you about consistency. A small gap means predictable performance. A large gap means your service behaves erratically under load.

Locust provides these metrics out of the box. Here’s how to access them in a test stop listener:

@events.test_stop.add_listener
def on_test_stop(environment, **kwargs):
    stats = environment.stats.total

    print(f"Median (p50): {stats.median_response_time:.0f} ms")
    print(f"p90: {stats.get_response_time_percentile(0.90):.0f} ms")
    print(f"p95: {stats.get_response_time_percentile(0.95):.0f} ms")
    print(f"p99: {stats.get_response_time_percentile(0.99):.0f} ms")
    print(f"Requests/sec: {stats.total_rps:.2f}")
    print(f"Failure rate: {stats.fail_ratio * 100:.2f}%")

Throughput

Requests per second (RPS) still matters, but interpret it carefully. With LLM inference, you’re often GPU-bound rather than CPU-bound. A “low” RPS number might be perfectly acceptable if each request involves a 7-billion parameter model.

Failure Rate

Track this as a percentage, not absolute count. Under load, some failures are expected. The question is: at what point does the failure rate spike? That’s your saturation point.

Why Percentiles Jump Around

If you watch Locust’s real-time charts, you might notice P95 spike from 30s to 120s, then drop back to 35s a minute later. This isn’t a bug.

Locust reports real-time percentiles using a sliding window (default: 10 seconds), not cumulative statistics. When a slow request enters the window, P95 jumps. When it ages out, P95 drops.

This effect is ampliﬁed at low throughput. With 0.5 RPS and 5-second intervals, each data point contains only 2 – 3 requests. A single queued request dominates the calculation:

Interval 1: [25s, 26s, 120s] -> P95 = 120s
Interval 2: [24s, 25s, 27s]  -> P95 = 27s

To show cumulative percentiles instead, add this monkey-patch to your locustﬁle:

from locust.stats import StatsEntry
StatsEntry.get_current_response_time_percentile = StatsEntry.get_response_time_percentile

This makes the HTML report show cumulative percentiles that grow monotonically, matching the ﬁnal statistics table.

Reading the Signs

Raw numbers don’t mean much without context. Here’s a framework I use to interpret load test results for LLM services.

Health Indicators

Signal	Healthy	Warning	Critical
p99/p50 ratio	< 2x	2 – 5x	> 5x
Failure rate	< 0.1%	0.1 – 1%	> 1%
p95 growth pattern	Linear	Exponential	Flat + failures

The p99/p50 Ratio

This ratio measures consistency. If your p50 is 3 seconds and p99 is 5 seconds, that’s a ratio of ~1.7x. Most users get a similar experience. But if p99 is 15 seconds (5x ratio), something is queuing up or timing out under load.

Latency Growth Patterns

As you add concurrent users, watch how p95 changes:

Linear growth: Latency increases proportionally with load. The system is handling requests fairly, just slower. This is normal.
Exponential growth: Latency spikes suddenly at some user count. You’ve hit a bottleneck, likely GPU memory, request queue depth, or connection limits.
Flat latency + rising failures: The system is rejecting requests to protect itself. Better than crashing, but you’ve exceeded capacity.

Finding the Saturation Point

The saturation point is where your service transitions from “handling load” to “struggling.” I look for two signals:

p95 latency exceeds 2x the baseline (measured at low load)
Failure rate crosses 1%

Whichever comes ﬁrst marks your practical capacity limit.

Step Load Testing

To ﬁnd saturation reliably, don’t slam your service with maximum load immediately. Ramp up gradually. A step load shape works well: add 10 users every 60 seconds, observe metrics at each step.

Locust supports custom load shapes through the LoadTestShape class. Override the tick() method to control how many users are active at any point in time. It returns a tuple of (user_count, spawn_rate) or None to stop the test.

from locust import LoadTestShape

class StepLoadShape(LoadTestShape):
    """Gradually increase users in steps to observe latency degradation."""

    step_users = 10   # users to add per step
    step_time = 60    # seconds per step
    max_users = 100

    def tick(self):
        run_time = self.get_run_time()
        current_step = int(run_time // self.step_time) + 1
        target_users = min(current_step * self.step_users, self.max_users)
        spawn_rate = self.step_users
        return (target_users, spawn_rate)

With this shape, users ramp up as: 10 users at 0 – 60s, 20 users at 60 – 120s, 30 users at 120 – 180s, and so on. This reveals exactly where performance degrades.

Two Essential Enhancements

A basic Locust script gets you started, but two enhancements make the difference between “ran a load test” and “gathered actionable data.”

First, set up a shared state object to track warmup and saturation:

@dataclass
class BenchmarkState:
    test_start_time: float = 0.0
    warmup_complete: bool = False
    baseline_p95: float | None = None
    saturation_detected: bool = False
    saturation_user_count: int | None = None

BENCHMARK_STATE = BenchmarkState()

1. Warmup Period

The ﬁrst few seconds of a load test are noisy. JIT compilation, connection pool initialization, model loading, and cold caches all inﬂate early latency measurements. Including this data skews your metrics.

The ﬁx is simple: skip metrics collection during an initial warmup period. Let the system stabilize before you start measuring.

def is_in_warmup(warmup_duration: int = 30) -> bool:
    elapsed = time.time() - BENCHMARK_STATE.test_start_time
    return elapsed < warmup_duration

I typically use 30 seconds for warmup. For larger models with slow initialization, you might need 60 seconds or more.

2. Saturation Detection

Staring at scrolling metrics, waiting to see when things go wrong, is tedious and error-prone. Better to let the test tell you when saturation occurs.

Track baseline latency at low load, then alert when p95 exceeds a threshold ratio:

def check_saturation(stats, user_count, latency_threshold=2.0, error_threshold=0.01):
    current_p95 = stats.get_response_time_percentile(0.95)

    # Establish baseline at low load
    if BENCHMARK_STATE.baseline_p95 is None and user_count >= 10:
        BENCHMARK_STATE.baseline_p95 = current_p95
        return

    # Check thresholds
    latency_ratio = current_p95 / BENCHMARK_STATE.baseline_p95
    if latency_ratio > latency_threshold or stats.fail_ratio > error_threshold:
        BENCHMARK_STATE.saturation_detected = True
        BENCHMARK_STATE.saturation_user_count = user_count

Now the test records when you’ve hit the wall and at what user count.

Conﬁguring Load Proﬁles

The right load proﬁle depends on what you’re testing. For stress testing and ﬁnding breaking points, I use aggressive settings: 50 users per step, 200 max users, and 15 minutes runtime.

Key Parameters

step_users: How many users to add per step. Smaller increments give ﬁner granularity but take longer.
step_time: How long to hold each step. For LLMs, 60 seconds minimum. You need enough requests at each level to get stable percentile measurements.
max_users: Upper bound. Set this higher than your expected capacity to actually ﬁnd the saturation point.
wait_time: Time between requests per user. For LLM services, 0.1 – 0.5 seconds simulates realistic usage. Going lower creates artiﬁcial burst load and tests “how fast can my service fail” rather than “how many concurrent users can my service handle.”

Conclusion

Load testing LLM services requires a different mindset. Focus on tail latencies (p95, p99) rather than throughput, use step load shapes to ﬁnd your saturation point, and add a warmup period to avoid cold-start bias. The goal isn’t to maximize requests per second. It’s to understand your service’s limits and ensure predictable performance up to those limits.