KV Cache Reuse and Agent Loop Latency in LLM Serving

Agent latency is usually a prompt-construction problem in disguise: prefix caching, continuous batching, and tool-call structure determine how much of your context gets reprocessed on every turn.

Most agent latency post-mortems start in the wrong place. Someone profiles token generation speed, blames the model, maybe switches to a "faster" one, and the p50 barely moves. The actual bottleneck is usually upstream of the model entirely: how the serving layer handles the specific, repetitive shape of an agent loop's prompt. If you don't understand prefix caching, continuous batching, and how they interact with the way agent frameworks construct prompts, you're optimizing the wrong 20% of the pipeline.

This post is about that interaction — what happens between "agent decides to call a tool" and "next token arrives" — and what it implies for how you should structure agent loops if you care about latency and cost at scale.

The Agent Loop Looks Nothing Like a Chat Turn

A chat completion is one request: system prompt, a handful of user/assistant turns, done. An agent loop is a sequence of requests where each one resends almost everything from the last, plus a small increment:

Turn 1: [system][tools][user_msg]                          -> assistant calls tool_a
Turn 2: [system][tools][user_msg][tool_a_call][tool_a_result] -> assistant calls tool_b
Turn 3: [system][tools][user_msg][tool_a_call][tool_a_result][tool_b_call][tool_b_result] -> final answer

If the agent takes 8 tool calls to finish a task, and your context is averaging 6K tokens by the end, you're not paying for 6K tokens once — a naive implementation reprocesses the entire growing prefix on every single turn. That's roughly O(n²) total tokens processed across the loop, not O(n). For a 10-step agent run with a 10K-token final context, that can mean 50K+ tokens of prefill work that has nothing to do with generating new content — it's the model re-reading things it already read three turns ago.

This is the part that doesn't show up in "tokens per second" benchmarks, because those benchmark decode speed, not prefill reuse. And for agent workloads, prefill dominates.

What Prefix Caching Actually Does

Modern inference servers — vLLM's automatic prefix caching, SGLang's RadixAttention, and the equivalent mechanisms behind hosted APIs — solve this by caching the KV (key/value) tensors for token sequences they've already computed, keyed by prefix. If turn 3's prompt starts with the exact same 4,000 tokens as turn 2's prompt, the server doesn't recompute attention for those 4,000 tokens. It looks up the cached KV state and starts computing from the first token that differs.

Anthropic's API exposes a version of this directly via cache_control breakpoints — you mark where a stable prefix ends, and repeated calls that share that prefix get a large discount and lower latency on the cached portion:

{
  "system": [
    {
      "type": "text",
      "text": "You are an agent with access to the following tools...",
      "cache_control": { "type": "ephemeral" }
    }
  ],
  "messages": [ /* growing conversation history */ ]
}

Self-hosted setups (vLLM, SGLang, TGI) do this automatically at the block level without requiring explicit annotation, using a radix tree keyed on token-block hashes: any two requests sharing a prefix down to a block boundary share the cached KV blocks for that prefix, regardless of which client sent them.

Every turn shares the root (system prompt, tool definitions) and increasingly large portions of the growing history. Only the newest suffix — the part the server has never seen — requires fresh prefill compute. In a well-structured agent loop, that suffix is small: a tool result plus a short assistant response. In a poorly structured one, cache hit rate collapses to near zero, and you're back to full O(n²) reprocessing.

Cache Hit Rate Is a Function of Prompt Structure, Not Luck

The single biggest lever here is prefix stability: content that doesn't change should come first, and content that changes should come last. This sounds obvious written down, but it's routinely violated:

Tool schemas reordered per request. If your framework serializes available tools in a different order depending on which ones are "relevant" this turn, you invalidate the cache on every call, because the prefix hash changes even though the underlying set of tools didn't. Keep the full tool list stable and let the model ignore irrelevant ones — that's cheaper than re-prefilling a shuffled list.
Timestamps or request IDs injected into the system prompt. A single Current time: 14:32:07 line at the top of the system prompt means nothing after it can ever hit the cache, because that line changes every request. Move volatile metadata to the end of the prompt, or omit it if the model doesn't need second-level precision.
Non-deterministic JSON serialization of tool results. Dict key ordering that isn't stable across your own backend's serialization will silently defeat prefix matching. Canonicalize before you send it.

None of this requires framework changes — it's discipline about what goes where in the prompt, informed by understanding that the server is doing literal prefix matching, not semantic matching.

Continuous Batching and Why Agent Workloads Batch Poorly

The other serving-layer mechanism worth understanding is continuous batching (vLLM, TGI, and most production servers use some variant). Instead of batching a fixed group of requests and running them lockstep through decode, the server admits new requests into an in-flight batch as soon as a GPU-scheduling slot frees up, and evicts finished ones — keeping the GPU busy without waiting for the slowest request in a static batch.

This works great for high-throughput chat traffic: lots of independent, concurrently-arriving requests to interleave. It works poorly for a single agent's sequential loop, because there's nothing to interleave — request 2 literally cannot be formed until request 1's response comes back and the tool executes. From the server's perspective, one agent running solo looks like a trickle of requests with idle gaps between them, not a dense stream it can pack efficiently.

Workload shape	Typical GPU utilization	Dominant cost	What helps
High-concurrency chat (many independent users)	High — continuous batching packs requests densely	Decode throughput	Larger batch sizes, more concurrent load
Single agent, sequential tool loop	Low — gaps while tools execute, no requests to interleave	Prefill of re-sent context	Prefix caching, minimizing re-sent tokens
Many agents running concurrently (fleet)	Moderate-to-high — agents' idle gaps get filled by other agents' work	Mix of both	Prefix caching and enough concurrent agents to keep the batch full
Agent loop with parallel tool calls	Higher than sequential — multiple in-flight branches per agent	Prefill, but amortized	Structuring loops to fan out tool calls instead of chaining them

The practical implication: if you're running agents at any real scale, running many of them concurrently isn't just a throughput nice-to-have — it's what lets continuous batching do its job at all. A single agent running in isolation against a self-hosted model will underutilize the GPU no matter how you tune it, because the bottleneck is the tool-execution gap, not compute.

Where Speculative Decoding Fits

Speculative decoding — a small draft model proposes several tokens ahead, the target model verifies them in one forward pass, accepted tokens are kept — helps decode speed, which matters less for agent loops than prefill does, but it has a specific and underappreciated win for tool calling: structured output is highly predictable, and predictable output is exactly where speculative decoding's acceptance rate is highest.

A JSON tool call like {"name": "convert_units", "arguments": {"value": 12.5, "from": "kg", "to": "lb"}} has enormous structural redundancy — the field names, brackets, and quoting are nearly deterministic given the tool schema. Grammar-constrained decoding (forcing the output to match a JSON schema via a token-level automaton) and speculative decoding compose well here: the draft model's guesses are frequently right precisely because the grammar has already ruled out most alternatives. Servers that combine constrained decoding with speculation on tool-call-heavy workloads see meaningfully higher token-acceptance rates than on free-form prose, because there's less genuine uncertainty to resolve.

If your agent calls tools frequently — say, hitting a REST API for dozens of small conversions or lookups in a session — the returns come less from making generation faster in the abstract and more from making the specific, structured part of generation (the tool call itself) nearly free.

What This Means for How You Build Agent Loops

Pulling the mechanics together into something actionable:

Order your prompt by volatility, not by logical grouping. Stable stuff (system prompt, full tool catalog, few-shot examples) first; volatile stuff (this turn's tool result, current timestamp) last. This is the single highest-leverage change for cache hit rate.
Don't reshuffle or filter the tool list per turn. A stable, complete tool list that the model selectively ignores caches far better than a "smart" filtered list that changes every call.
Batch independent tool calls in one turn instead of chaining them serially. If a task needs three independent lookups — say, three separate calls to something like Utilix's conversion API — issuing them as parallel tool calls in a single turn means one prefill pass covers all three, instead of three separate prefill-plus-round-trip cycles.
Treat "concurrent agents" as a batching requirement, not just a scaling requirement, if you're self-hosting. A fleet of agents running concurrently is what makes continuous batching effective; a lone agent loop will bottleneck on tool-execution idle time no matter what you do to the model itself.
For tool-call-heavy agents, invest in constrained/grammar-guided decoding on the serving side. It improves correctness (fewer malformed tool calls to retry) and composes with speculative decoding for latency, which chat-style free text generation doesn't benefit from nearly as much.

None of this shows up if you're only measuring tokens-per-second on the model card. It shows up when you trace an actual multi-step agent run and notice that half the wall-clock time is spent reprocessing a prefix the server has already seen — which is a prompt-construction problem wearing an inference-performance costume.

Takeaway: before reaching for a faster model to cut agent latency, check whether your agent loop is even letting the serving layer cache what it already computed. Stable prefixes, stable tool ordering, and batched tool calls will often cut more wall-clock time than a model swap — and they cost nothing to try.

KV Cache Reuse and the Hidden Latency Budget of Agent Loops