RAG is not one architecture — it is three structurally different systems with different costs and failure modes. Here is what actually changes between naive, agentic, and graph-based retrieval, and how to pick without over-building.

Naive RAG, Agentic RAG, and GraphRAG: What Actually Changes Architecturally

"RAG is broken" is one of the most common complaints in production LLM systems, and it's almost always imprecise. Retrieval-augmented generation isn't one architecture — it's a family of at least three structurally different systems that get lumped under the same three letters. A naive RAG pipeline, an agentic RAG loop, and a graph-based RAG system fail in different ways, cost different amounts to run, and suit different problems. Diagnosing "RAG is broken" without knowing which of the three you built is like debugging "the network is slow" without knowing if you mean DNS, TCP, or application latency.

This post walks through what actually changes at the architecture level between the three, where each one breaks in practice, and how to pick without over-building.

Naive RAG: the pipeline everyone starts with

The baseline architecture is a fixed, one-shot pipeline: chunk the corpus, embed the chunks, store the vectors, and at query time embed the question, run a similarity search (usually top-k cosine similarity), and stuff the retrieved chunks into the prompt alongside the question.

query → embed → vector search (top-k) → stuff into prompt → generate

This works well for the case it was designed for: a single, well-phrased question with an answer that lives in one or two contiguous chunks of text. It's cheap, it's fast (one embedding call, one ANN lookup), and it's easy to reason about.

It breaks down in three predictable ways:

Multi-hop questions. "What was the churn rate for the plan that replaced the one deprecated in Q2?" requires resolving two facts in sequence. A single top-k retrieval has no mechanism to notice it needs a second lookup — it just returns the k chunks most similar to the original question, which may not even mention the deprecated plan.
Precision decay with corpus size. As the corpus grows, more chunks become superficially similar to any given query. Top-k retrieval has no correction step, so irrelevant-but-similar chunks crowd out the chunk that actually has the answer. Recall doesn't gracefully degrade — it falls off a cliff past a certain corpus size unless you invest in reranking.
No verification. If retrieval returns nothing useful, naive RAG doesn't know that. It hands the model whatever it found and lets the model either hallucinate around the gap or (if you're lucky and prompted for it) say "I don't know." There's no architectural loop that checks "did I actually find the answer?" before generating.

Naive RAG's failure mode is silent: it doesn't error, it just answers confidently from irrelevant context. That's the worst kind of failure to catch in review.

Agentic RAG: retrieval as a tool call, not a preprocessing step

Agentic RAG restructures retrieval from a fixed pipeline stage into a tool the model can call, inspect the results of, and call again. Instead of "always retrieve once, then generate," the loop looks like:

The load-bearing difference is the decision diamond: an agentic RAG system has an explicit checkpoint where it evaluates whether the retrieved evidence is sufficient before generating. If it isn't, the loop rewrites the query — expanding an acronym, splitting a multi-hop question into sub-questions, or trying a different retrieval strategy — and retrieves again.

Query rewriting and self-correction

The research underpinning this pattern includes Self-RAG (which trains the model to emit reflection tokens judging its own retrieval quality) and Corrective RAG / CRAG (which adds a lightweight retrieval evaluator that grades documents as correct, ambiguous, or incorrect, and falls back to web search when local retrieval scores poorly). Both formalize the same idea: retrieval quality should be checked, not assumed.

In practice, most production agentic RAG systems don't need the full trained-reflection-token approach. A simpler pattern works: give the model a search tool and a system prompt instructing it to cite sources and to search again if the first results don't answer the question, then let the normal agent loop (plan → call tool → observe → decide) handle the rest. This is structurally identical to any other tool-use loop — the retrieval index is just one more tool the model can call, alongside a calculator, a code execution sandbox, or a REST endpoint that does deterministic formatting work. Utilix's MCP server, for example, exposes utility endpoints (hashing, diffing, format conversion) in exactly this shape: a tool an agent calls mid-loop and inspects the output of before deciding what to do next. Retrieval isn't architecturally special — treating it as just another callable tool is what makes the self-correction loop possible.

What this costs you

Agentic RAG trades latency and token cost for correctness. Each retrieval round-trip adds a model call to decide whether to retrieve again, plus the retrieval latency itself. A question that naive RAG answers in one embedding call and one generation call might take agentic RAG two to four tool-call rounds. For high-volume, low-stakes queries (a chatbot answering "what are your business hours"), that overhead is waste. For high-stakes, multi-hop, or compliance-sensitive queries, it's the difference between a wrong answer and a correct one.

GraphRAG: when relationships matter more than similarity

Both naive and agentic RAG are fundamentally similarity-based: they find text that's semantically close to the query. Neither is good at questions that hinge on structured relationships between entities rather than textual similarity — "which vendors does our largest customer's parent company also contract with?" has almost no lexical or semantic overlap with the documents that contain the answer.

GraphRAG (the pattern popularized by Microsoft Research's 2024 paper of the same name, and now implemented in various open-source forms) addresses this by building a knowledge graph from the corpus during indexing: an LLM extracts entities and relationships from each chunk, those get merged into a graph, and the graph is clustered into hierarchical communities with LLM-generated summaries at each level. At query time, retrieval walks the graph — pulling in entities, their relationships, and community summaries — rather than doing a flat similarity search over text chunks.

This is a fundamentally different indexing cost model. Naive and agentic RAG both index in roughly O(corpus size) — chunk and embed, done. GraphRAG's indexing pass requires an LLM call (or several) per chunk to extract entities and relationships, plus a graph clustering step. For a large corpus, that indexing cost can dwarf the cost of actually serving queries.

Where GraphRAG earns its cost

GraphRAG is worth the indexing investment for corpora where the value is in the connections, not the paragraphs: organizational knowledge bases, legal document sets with cross-references, codebases with import/dependency graphs, or research literature with citation networks. It's a poor fit for corpora that are mostly independent, self-contained documents (a support ticket archive, a product FAQ) — there, the "relationships" GraphRAG would extract are thin, and you're paying graph-construction cost for no retrieval benefit over a good hybrid search.

Comparing the three

	Naive RAG	Agentic RAG	GraphRAG
Retrieval shape	Single top-k vector search	Iterative: retrieve → evaluate → retrieve again	Graph traversal + community summaries
Best for	Single-fact lookup in one document	Multi-hop questions, compliance/citation needs	Relationship-heavy, cross-referenced corpora
Indexing cost	Low (embed once)	Low (same as naive)	High (LLM entity/relationship extraction per chunk)
Query-time cost	1 embedding + 1 generation call	2–4+ tool-call rounds per query	1–2 graph queries + generation
Main failure mode	Silent wrong answers from irrelevant-but-similar chunks	Latency/cost blowup on simple queries if not gated	Expensive indexing for corpora with weak entity structure
Corpus growth behavior	Precision degrades sharply past a size threshold	Same degradation, but self-correction partially compensates	Scales with entity density, not raw token count

Choosing without over-building

The practical heuristic: start with naive RAG plus a decent hybrid search (BM25 + embeddings, reranked) — it solves a surprising majority of single-hop lookup questions at the lowest cost and latency. Add the agentic loop only for the query classes where you can show naive RAG actually fails — multi-hop questions, or anywhere a wrong-but-confident answer is costly enough that a verification round-trip is worth the latency. Reach for GraphRAG only when you can point to specific questions your users ask that are fundamentally about relationships between entities, not facts within documents — and even then, consider running it as a fallback path behind agentic RAG rather than the default, since most queries against most corpora don't need graph traversal.

The mistake to avoid is picking the architecture based on how sophisticated it sounds rather than which failure mode your actual query logs show up in. Pull a sample of your worst RAG answers before you rebuild anything — the failure pattern in that sample tells you which of the three problems above you actually have.

Naive RAG, Agentic RAG, and GraphRAG: What Actually Changes Architecturally

Naive RAG, Agentic RAG, and GraphRAG: What Actually Changes Architecturally

Naive RAG: the pipeline everyone starts with

Agentic RAG: retrieval as a tool call, not a preprocessing step

Query rewriting and self-correction

What this costs you

GraphRAG: when relationships matter more than similarity

Where GraphRAG earns its cost

Comparing the three

Choosing without over-building

Related reading