Agent latency is usually a prompt-construction problem in disguise: prefix caching, continuous batching, and tool-call structure determine how much of your context gets reprocessed on every turn.
Most agent latency post-mortems start in the wrong place. Someone profiles token generation speed, blames the model, maybe switches to a "faster" one, and the p50 barely moves. The actual bottleneck is usually upstream of the model entirely: how the serving layer handles the specific, repetitive shape of an agent loop's prompt. If you don't understand prefix caching, continuous batching, and how they interact with the way agent frameworks construct prompts, you're optimizing the wrong 20% of the pipeline.
This post is about that interaction — what happens between "agent decides to call a tool" and "next token arrives" — and what it implies for how you should structure agent loops if you care about latency and cost at scale.
A chat completion is one request: system prompt, a handful of user/assistant turns, done. An agent loop is a sequence of requests where each one resends almost everything from the last, plus a small increment:
Turn 1: [system][tools][user_msg] -> assistant calls tool_a
Turn 2: [system][tools][user_msg][tool_a_call][tool_a_result] -> assistant calls tool_b
Turn 3: [system][tools][user_msg][tool_a_call][tool_a_result][tool_b_call][tool_b_result] -> final answer
If the agent takes 8 tool calls to finish a task, and your context is averaging 6K tokens by the end, you're not paying for 6K tokens once — a naive implementation reprocesses the entire growing prefix on every single turn. That's roughly O(n²) total tokens processed across the loop, not O(n). For a 10-step agent run with a 10K-token final context, that can mean 50K+ tokens of prefill work that has nothing to do with generating new content — it's the model re-reading things it already read three turns ago.
This is the part that doesn't show up in "tokens per second" benchmarks, because those benchmark decode speed, not prefill reuse. And for agent workloads, prefill dominates.
Modern inference servers — vLLM's automatic prefix caching, SGLang's RadixAttention, and the equivalent mechanisms behind hosted APIs — solve this by caching the KV (key/value) tensors for token sequences they've already computed, keyed by prefix. If turn 3's prompt starts with the exact same 4,000 tokens as turn 2's prompt, the server doesn't recompute attention for those 4,000 tokens. It looks up the cached KV state and starts computing from the first token that differs.
Anthropic's API exposes a version of this directly via cache_control breakpoints — you mark where a stable prefix ends, and repeated calls that share that prefix get a large discount and lower latency on the cached portion:
{
"system": [
{
"type": "text",
"text": "You are an agent with access to the following tools...",
"cache_control": { "type": "ephemeral" }
}
],
"messages": [ /* growing conversation history */ ]
}
Self-hosted setups (vLLM, SGLang, TGI) do this automatically at the block level without requiring explicit annotation, using a radix tree keyed on token-block hashes: any two requests sharing a prefix down to a block boundary share the cached KV blocks for that prefix, regardless of which client sent them.
Every turn shares the root (system prompt, tool definitions) and increasingly large portions of the growing history. Only the newest suffix — the part the server has never seen — requires fresh prefill compute. In a well-structured agent loop, that suffix is small: a tool result plus a short assistant response. In a poorly structured one, cache hit rate collapses to near zero, and you're back to full O(n²) reprocessing.
The single biggest lever here is prefix stability: content that doesn't change should come first, and content that changes should come last. This sounds obvious written down, but it's routinely violated:
Current time: 14:32:07 line at the top of the system prompt means nothing after it can ever hit the cache, because that line changes every request. Move volatile metadata to the end of the prompt, or omit it if the model doesn't need second-level precision.None of this requires framework changes — it's discipline about what goes where in the prompt, informed by understanding that the server is doing literal prefix matching, not semantic matching.
The other serving-layer mechanism worth understanding is continuous batching (vLLM, TGI, and most production servers use some variant). Instead of batching a fixed group of requests and running them lockstep through decode, the server admits new requests into an in-flight batch as soon as a GPU-scheduling slot frees up, and evicts finished ones — keeping the GPU busy without waiting for the slowest request in a static batch.
This works great for high-throughput chat traffic: lots of independent, concurrently-arriving requests to interleave. It works poorly for a single agent's sequential loop, because there's nothing to interleave — request 2 literally cannot be formed until request 1's response comes back and the tool executes. From the server's perspective, one agent running solo looks like a trickle of requests with idle gaps between them, not a dense stream it can pack efficiently.
| Workload shape | Typical GPU utilization | Dominant cost | What helps |
|---|---|---|---|
| High-concurrency chat (many independent users) | High — continuous batching packs requests densely | Decode throughput | Larger batch sizes, more concurrent load |
| Single agent, sequential tool loop | Low — gaps while tools execute, no requests to interleave | Prefill of re-sent context | Prefix caching, minimizing re-sent tokens |
| Many agents running concurrently (fleet) | Moderate-to-high — agents' idle gaps get filled by other agents' work | Mix of both | Prefix caching and enough concurrent agents to keep the batch full |
| Agent loop with parallel tool calls | Higher than sequential — multiple in-flight branches per agent | Prefill, but amortized | Structuring loops to fan out tool calls instead of chaining them |
The practical implication: if you're running agents at any real scale, running many of them concurrently isn't just a throughput nice-to-have — it's what lets continuous batching do its job at all. A single agent running in isolation against a self-hosted model will underutilize the GPU no matter how you tune it, because the bottleneck is the tool-execution gap, not compute.
Speculative decoding — a small draft model proposes several tokens ahead, the target model verifies them in one forward pass, accepted tokens are kept — helps decode speed, which matters less for agent loops than prefill does, but it has a specific and underappreciated win for tool calling: structured output is highly predictable, and predictable output is exactly where speculative decoding's acceptance rate is highest.
A JSON tool call like {"name": "convert_units", "arguments": {"value": 12.5, "from": "kg", "to": "lb"}} has enormous structural redundancy — the field names, brackets, and quoting are nearly deterministic given the tool schema. Grammar-constrained decoding (forcing the output to match a JSON schema via a token-level automaton) and speculative decoding compose well here: the draft model's guesses are frequently right precisely because the grammar has already ruled out most alternatives. Servers that combine constrained decoding with speculation on tool-call-heavy workloads see meaningfully higher token-acceptance rates than on free-form prose, because there's less genuine uncertainty to resolve.
If your agent calls tools frequently — say, hitting a REST API for dozens of small conversions or lookups in a session — the returns come less from making generation faster in the abstract and more from making the specific, structured part of generation (the tool call itself) nearly free.
Pulling the mechanics together into something actionable:
None of this shows up if you're only measuring tokens-per-second on the model card. It shows up when you trace an actual multi-step agent run and notice that half the wall-clock time is spent reprocessing a prefix the server has already seen — which is a prompt-construction problem wearing an inference-performance costume.
Takeaway: before reaching for a faster model to cut agent latency, check whether your agent loop is even letting the serving layer cache what it already computed. Stable prefixes, stable tool ordering, and batched tool calls will often cut more wall-clock time than a model swap — and they cost nothing to try.