Vector similarity search answers what text is topically related — but long-running agents need to know what is true right now. Conflating the two is why agents keep resurrecting overturned decisions.

Agent Memory Isn't RAG: Why Vector Retrieval Falls Apart for Stateful Agents

Most teams building "agent memory" reach for the same architecture: chunk everything, embed it, stuff it in a vector store, and retrieve top-k on every turn. It works well enough for question-answering over static documents that RAG has become the default answer to a completely different question: how does an agent remember what it did, decided, and learned across a long-running task?

Those are not the same problem. Document retrieval is about finding relevant facts in a large, mostly static corpus. Agent memory is about maintaining a coherent, evolving model of state — what changed, why, and what still needs to happen — across a session that might run for hours or days and involve hundreds of tool calls. Treating the second problem as an instance of the first is why so many "memory-enabled" agents still forget the constraint you gave them three tool calls ago, or re-discover a fact they already invalidated.

This post is about why vector similarity search is the wrong primary mechanism for agent state, what actually breaks in practice, and what a memory architecture that separates working state from retrieved knowledge looks like.

The core mismatch: similarity isn't relevance

Vector retrieval answers one question well: "what text is semantically similar to this query?" Agent memory needs answers to different questions entirely:

What is true right now, given everything that has happened since?
What did I already try, and did it work?
What constraints were established earlier that still apply?
What's the next unresolved step in a plan I committed to?

None of these are similarity queries. They're state queries. A vector store has no concept of recency-with-override — if you embed "the deploy target is us-east-1" on turn 3 and "actually, switch the deploy target to eu-west-1" on turn 40, a similarity search for "deploy target" can easily return both chunks with comparable scores, or worse, rank the earlier one higher because it's phrased more directly. The retrieval layer doesn't know one fact superseded the other. It just knows they're both about deploy targets.

This is the fundamental issue: embeddings encode topical similarity, not temporal or logical precedence. Agent state is inherently versioned — every fact has a lifespan, and most interesting bugs in long-running agents come from operating on facts past their expiration.

Where this actually breaks

A few concrete failure modes, all observed in production agent systems:

Stale constraint resurrection. The user says "don't touch the auth module, it's mid-refactor." Forty turns later, a top-k retrieval pulls in an old planning note that says "auth module needs the token refresh fix" — without the later correction anywhere near it in embedding space — and the agent edits the file it was explicitly told to avoid.

Contradictory fact merging. Two chunks both score highly for "what's the API rate limit," one from an early exploration (wrong, from stale docs) and one from a later verified test (right). The agent's context window ends up with both, and without an explicit precedence signal, the model has to guess which one to trust — often getting it wrong because the earlier, wordier explanation "sounds" more authoritative.

Lost procedural state. Vector stores are bad at representing sequence. "Step 3 of the migration plan is done, step 4 is next" is not a fact you retrieve by similarity — it's a pointer into a state machine. Cramming a todo list into a vector store and hoping retrieval surfaces "the next step" at the right moment is a coin flip.

Context poisoning under scale. As the memory store grows, the top-k window fills with plausible-but-irrelevant near-matches. This is the same problem RAG has always had (retrieval precision degrades as corpus size grows) but it's worse for agents because a bad retrieval doesn't just produce a slightly wrong answer — it can trigger a wrong action.

A better decomposition: separate the memory types

The fix isn't to abandon vector search — it's to stop asking it to do a job it's structurally unsuited for. Cognitive architectures (and, not coincidentally, most well-designed agent frameworks converging on this independently in the last two years) split memory into distinct layers with different consistency guarantees:

The key design decision is that only the semantic store uses similarity search. Everything with a notion of "current value" — constraints, plan state, configuration — lives in a structured store keyed by identity, not embedding, with explicit overwrite semantics. When the deploy target changes, you overwrite the deploy_target key. There's no ambiguity for the model to resolve, because there's no duplicate to resolve between.

What goes where

Memory type	Storage model	Query pattern	Consistency	Good for
Working memory	In-context window	N/A (always present)	Strong (single writer per turn)	Active plan, immediate task state
Structured state	Key-value / relational, versioned	Exact key lookup	Strong, last-write-wins	Constraints, config, plan progress, todo status
Episodic log	Append-only, timestamped	Time range / sequence	Strong (immutable once written)	"What did I already try," audit trail, replay
Semantic store	Vector embeddings	Top-k similarity	Eventually consistent, no precedence	Stable domain docs, past conversation gist, unstructured knowledge

The practical implication: before reaching for a vector database, ask whether the fact you're storing has a current value that can change. If yes, it belongs in structured state with a real update operation, not a new embedding appended to a growing pile. Vector search should be reserved for genuinely unstructured, rarely-superseded content — documentation, prior conversation summaries, reference material.

A minimal structured-state pattern

You don't need a framework to get most of the benefit. A simple pattern that works well for tool-calling agents:

class AgentState:
    def __init__(self):
        self.facts: dict[str, dict] = {}  # key -> {value, updated_at, source}
        self.episodic: list[dict] = []    # append-only action log

    def set_fact(self, key: str, value, source: str, now: int):
        # Last-write-wins, but keep provenance so the agent can explain *why*
        self.facts[key] = {"value": value, "updated_at": now, "source": source}

    def get_fact(self, key: str):
        return self.facts.get(key)

    def log_action(self, action: str, result: str, now: int):
        self.episodic.append({"t": now, "action": action, "result": result})

    def render_working_memory(self) -> str:
        # This is what actually goes into the context window each turn
        active = "\n".join(f"{k}: {v['value']}" for k, v in self.facts.items())
        recent = self.episodic[-10:]
        recent_str = "\n".join(f"[{e['t']}] {e['action']} -> {e['result']}" for e in recent)
        return f"## Current State\n{active}\n\n## Recent Actions\n{recent_str}"

This is deliberately unglamorous. The point isn't the code — it's that facts gives you O(1) lookup with real overwrite semantics, and episodic gives you a ground-truth history you can compact, summarize, or replay without ever needing a similarity search. When the agent asks "what's the current deploy target," it does a dictionary lookup, not a nearest-neighbor search that might return two candidates.

Compaction is the piece that connects this back to working memory limits: periodically (every N turns, or when approaching a context budget), summarize older episodic entries into a condensed narrative and fold that into working memory, while the raw log stays in cold storage for later retrieval if needed. This is the same idea behind context-window compaction in coding agents — the recent, high-fidelity window matters more than perfect recall of everything, as long as you don't lose facts that still govern current behavior.

Where vector retrieval still earns its place

None of this is an argument against embeddings — it's an argument against using them as the only memory mechanism. Semantic search is genuinely the right tool when:

The corpus is large, mostly static, and doesn't have a "current value" — API reference docs, historical conversation transcripts, a knowledge base.
You need fuzzy matching over natural language where exact keys don't exist — "find prior incidents similar to this one."
Precedence doesn't matter because there isn't a "latest" version — multiple valid past examples are all still useful context.

An agent architecture that calls a tool like an MCP server or REST API for document lookup is doing exactly the right thing when the query is "find me relevant prior art." The mistake is routing stateful facts — the ones with a clear current value — through that same retrieval path instead of a structured store with real update semantics.

The takeaway

If your agent keeps re-surfacing decisions that were already overturned, or acting on constraints that no longer apply, the bug usually isn't in the retrieval ranking — it's in the architecture. Before tuning top-k or re-embedding with a better model, ask: does this fact have a version, and does my storage layer know which version is current? If the answer is no, you don't have a retrieval problem, you have a state-management problem wearing a retrieval-shaped costume. Split working state (small, structured, always in context) from retrieved knowledge (large, unstructured, similarity-searched), and most of the "my agent forgot" bugs disappear — not because the model got smarter, but because you stopped asking a similarity function to answer a versioning question.

Agent Memory Isn't RAG: Why Vector Retrieval Falls Apart for Stateful Agents

Agent Memory Isn't RAG: Why Vector Retrieval Falls Apart for Stateful Agents

The core mismatch: similarity isn't relevance

Where this actually breaks

A better decomposition: separate the memory types

What goes where

A minimal structured-state pattern

Where vector retrieval still earns its place

The takeaway

Related reading