Load Testing LLM Services

Load test­ing a web ser­vice that wraps an LLM is de­cep­tively tricky. The usual met­rics and rules of thumb from tra­di­tional load test­ing don’t quite ap­ply. A check­out end­point that re­sponds in 50ms be­haves noth­ing like a model in­fer­ence call that takes 5 sec­onds on a good day.

Why LLMs are Different

Traditional ser­vices op­ti­mize for through­put: re­quests per sec­ond. If your API han­dles 10,000 RPS, you’re prob­a­bly happy. But LLM in­fer­ence flips this pri­or­ity.

A sin­gle LLM call might take 1 – 30 sec­onds de­pend­ing on model size, prompt length, and whether you’re stream­ing to­kens. At these timescales, through­put num­bers be­come less mean­ing­ful. What mat­ters is con­sis­tency: does your ser­vice de­grade grace­fully un­der load, or does it fall off a cliff?

The an­swer lies in tail la­ten­cies.

The Metrics That Matter

When I first ran Locust against my LLM ser­vice, the out­put looked fa­mil­iar: re­quest counts, re­sponse times, fail­ures. But in­ter­pret­ing these num­bers for an LLM back­end re­quires a dif­fer­ent lens.

Latency Percentiles

Metric What It Tells You
p50 (Median) Half your re­quests fin­ish faster than this
p90 90% of re­quests fin­ish faster than this
p95, p99 Tail la­ten­cies - how bad it gets for un­lucky users
Max Worst-case sce­nario

For LLM ser­vices, p95 and p99 mat­ter more than the av­er­age. Why? Because LLM calls are slow enough that users no­tice vari­ance. If your me­dian is 3 sec­onds but p99 is 30 sec­onds, one in a hun­dred users waits ten times longer. That’s a prob­lem.

The gap be­tween p50 and p99 tells you about con­sis­tency. A small gap means pre­dictable per­for­mance. A large gap means your ser­vice be­haves er­rat­i­cally un­der load.

Locust pro­vides these met­rics out of the box. Here’s how to ac­cess them in a test stop lis­tener:

@events.test_stop.add_listener
def on_test_stop(environment, **kwargs):
    stats = environment.stats.total

    print(f"Median (p50): {stats.median_response_time:.0f} ms")
    print(f"p90: {stats.get_response_time_percentile(0.90):.0f} ms")
    print(f"p95: {stats.get_response_time_percentile(0.95):.0f} ms")
    print(f"p99: {stats.get_response_time_percentile(0.99):.0f} ms")
    print(f"Requests/sec: {stats.total_rps:.2f}")
    print(f"Failure rate: {stats.fail_ratio * 100:.2f}%")

Throughput

Requests per sec­ond (RPS) still mat­ters, but in­ter­pret it care­fully. With LLM in­fer­ence, you’re of­ten GPU-bound rather than CPU-bound. A low” RPS num­ber might be per­fectly ac­cept­able if each re­quest in­volves a 7-billion pa­ra­me­ter model.

Failure Rate

Track this as a per­cent­age, not ab­solute count. Un­der load, some fail­ures are ex­pected. The ques­tion is: at what point does the fail­ure rate spike? That’s your sat­u­ra­tion point.

Why Percentiles Jump Around

If you watch Locust’s real-time charts, you might no­tice P95 spike from 30s to 120s, then drop back to 35s a minute later. This is­n’t a bug.

Locust re­ports real-time per­centiles us­ing a slid­ing win­dow (default: 10 sec­onds), not cu­mu­la­tive sta­tis­tics. When a slow re­quest en­ters the win­dow, P95 jumps. When it ages out, P95 drops.

This ef­fect is am­pli­fied at low through­put. With 0.5 RPS and 5-second in­ter­vals, each data point con­tains only 2 – 3 re­quests. A sin­gle queued re­quest dom­i­nates the cal­cu­la­tion:

Interval 1: [25s, 26s, 120s] -> P95 = 120s
Interval 2: [24s, 25s, 27s]  -> P95 = 27s

To show cu­mu­la­tive per­centiles in­stead, add this mon­key-patch to your lo­cust­file:

from locust.stats import StatsEntry
StatsEntry.get_current_response_time_percentile = StatsEntry.get_response_time_percentile

This makes the HTML re­port show cu­mu­la­tive per­centiles that grow mo­not­o­n­i­cally, match­ing the fi­nal sta­tis­tics table.

Reading the Signs

Raw num­bers don’t mean much with­out con­text. Here’s a frame­work I use to in­ter­pret load test re­sults for LLM ser­vices.

Health Indicators

Signal Healthy Warning Critical
p99/​p50 ra­tio < 2x 2 – 5x > 5x
Failure rate < 0.1% 0.1 – 1% > 1%
p95 growth pat­tern Linear Exponential Flat + fail­ures

The p99/​p50 Ratio

This ra­tio mea­sures con­sis­tency. If your p50 is 3 sec­onds and p99 is 5 sec­onds, that’s a ra­tio of ~1.7x. Most users get a sim­i­lar ex­pe­ri­ence. But if p99 is 15 sec­onds (5x ra­tio), some­thing is queu­ing up or tim­ing out un­der load.

Latency Growth Patterns

As you add con­cur­rent users, watch how p95 changes:

  • Linear growth: Latency in­creases pro­por­tion­ally with load. The sys­tem is han­dling re­quests fairly, just slower. This is nor­mal.
  • Exponential growth: Latency spikes sud­denly at some user count. You’ve hit a bot­tle­neck, likely GPU mem­ory, re­quest queue depth, or con­nec­tion lim­its.
  • Flat la­tency + ris­ing fail­ures: The sys­tem is re­ject­ing re­quests to pro­tect it­self. Better than crash­ing, but you’ve ex­ceeded ca­pac­ity.

Finding the Saturation Point

The sat­u­ra­tion point is where your ser­vice tran­si­tions from handling load” to struggling.” I look for two sig­nals:

  1. p95 la­tency ex­ceeds 2x the base­line (measured at low load)
  2. Failure rate crosses 1%

Whichever comes first marks your prac­ti­cal ca­pac­ity limit.

Step Load Testing

To find sat­u­ra­tion re­li­ably, don’t slam your ser­vice with max­i­mum load im­me­di­ately. Ramp up grad­u­ally. A step load shape works well: add 10 users every 60 sec­onds, ob­serve met­rics at each step.

Locust sup­ports cus­tom load shapes through the LoadTestShape class. Override the tick() method to con­trol how many users are ac­tive at any point in time. It re­turns a tu­ple of (user_count, spawn_rate) or None to stop the test.

from locust import LoadTestShape

class StepLoadShape(LoadTestShape):
    """Gradually increase users in steps to observe latency degradation."""

    step_users = 10   # users to add per step
    step_time = 60    # seconds per step
    max_users = 100

    def tick(self):
        run_time = self.get_run_time()
        current_step = int(run_time // self.step_time) + 1
        target_users = min(current_step * self.step_users, self.max_users)
        spawn_rate = self.step_users
        return (target_users, spawn_rate)

With this shape, users ramp up as: 10 users at 0 – 60s, 20 users at 60 – 120s, 30 users at 120 – 180s, and so on. This re­veals ex­actly where per­for­mance de­grades.

Two Essential Enhancements

A ba­sic Locust script gets you started, but two en­hance­ments make the dif­fer­ence be­tween ran a load test” and gathered ac­tion­able data.”

First, set up a shared state ob­ject to track warmup and sat­u­ra­tion:

@dataclass
class BenchmarkState:
    test_start_time: float = 0.0
    warmup_complete: bool = False
    baseline_p95: float | None = None
    saturation_detected: bool = False
    saturation_user_count: int | None = None

BENCHMARK_STATE = BenchmarkState()

1. Warmup Period

The first few sec­onds of a load test are noisy. JIT com­pi­la­tion, con­nec­tion pool ini­tial­iza­tion, model load­ing, and cold caches all in­flate early la­tency mea­sure­ments. Including this data skews your met­rics.

The fix is sim­ple: skip met­rics col­lec­tion dur­ing an ini­tial warmup pe­riod. Let the sys­tem sta­bi­lize be­fore you start mea­sur­ing.

def is_in_warmup(warmup_duration: int = 30) -> bool:
    elapsed = time.time() - BENCHMARK_STATE.test_start_time
    return elapsed < warmup_duration

I typ­i­cally use 30 sec­onds for warmup. For larger mod­els with slow ini­tial­iza­tion, you might need 60 sec­onds or more.

2. Saturation Detection

Staring at scrolling met­rics, wait­ing to see when things go wrong, is te­dious and er­ror-prone. Better to let the test tell you when sat­u­ra­tion oc­curs.

Track base­line la­tency at low load, then alert when p95 ex­ceeds a thresh­old ra­tio:

def check_saturation(stats, user_count, latency_threshold=2.0, error_threshold=0.01):
    current_p95 = stats.get_response_time_percentile(0.95)

    # Establish baseline at low load
    if BENCHMARK_STATE.baseline_p95 is None and user_count >= 10:
        BENCHMARK_STATE.baseline_p95 = current_p95
        return

    # Check thresholds
    latency_ratio = current_p95 / BENCHMARK_STATE.baseline_p95
    if latency_ratio > latency_threshold or stats.fail_ratio > error_threshold:
        BENCHMARK_STATE.saturation_detected = True
        BENCHMARK_STATE.saturation_user_count = user_count

Now the test records when you’ve hit the wall and at what user count.

Configuring Load Profiles

The right load pro­file de­pends on what you’re test­ing. For stress test­ing and find­ing break­ing points, I use ag­gres­sive set­tings: 50 users per step, 200 max users, and 15 min­utes run­time.

Key Parameters

  • step_users: How many users to add per step. Smaller in­cre­ments give finer gran­u­lar­ity but take longer.
  • step_­time: How long to hold each step. For LLMs, 60 sec­onds min­i­mum. You need enough re­quests at each level to get sta­ble per­centile mea­sure­ments.
  • max_users: Upper bound. Set this higher than your ex­pected ca­pac­ity to ac­tu­ally find the sat­u­ra­tion point.
  • wait­_­time: Time be­tween re­quests per user. For LLM ser­vices, 0.1 – 0.5 sec­onds sim­u­lates re­al­is­tic us­age. Going lower cre­ates ar­ti­fi­cial burst load and tests how fast can my ser­vice fail” rather than how many con­cur­rent users can my ser­vice han­dle.”

Conclusion

Load test­ing LLM ser­vices re­quires a dif­fer­ent mind­set. Focus on tail la­ten­cies (p95, p99) rather than through­put, use step load shapes to find your sat­u­ra­tion point, and add a warmup pe­riod to avoid cold-start bias. The goal is­n’t to max­i­mize re­quests per sec­ond. It’s to un­der­stand your ser­vice’s lim­its and en­sure pre­dictable per­for­mance up to those lim­its.