RAG is not one architecture — it is three structurally different systems with different costs and failure modes. Here is what actually changes between naive, agentic, and graph-based retrieval, and how to pick without over-building.
"RAG is broken" is one of the most common complaints in production LLM systems, and it's almost always imprecise. Retrieval-augmented generation isn't one architecture — it's a family of at least three structurally different systems that get lumped under the same three letters. A naive RAG pipeline, an agentic RAG loop, and a graph-based RAG system fail in different ways, cost different amounts to run, and suit different problems. Diagnosing "RAG is broken" without knowing which of the three you built is like debugging "the network is slow" without knowing if you mean DNS, TCP, or application latency.
This post walks through what actually changes at the architecture level between the three, where each one breaks in practice, and how to pick without over-building.
The baseline architecture is a fixed, one-shot pipeline: chunk the corpus, embed the chunks, store the vectors, and at query time embed the question, run a similarity search (usually top-k cosine similarity), and stuff the retrieved chunks into the prompt alongside the question.
query → embed → vector search (top-k) → stuff into prompt → generate
This works well for the case it was designed for: a single, well-phrased question with an answer that lives in one or two contiguous chunks of text. It's cheap, it's fast (one embedding call, one ANN lookup), and it's easy to reason about.
It breaks down in three predictable ways:
Naive RAG's failure mode is silent: it doesn't error, it just answers confidently from irrelevant context. That's the worst kind of failure to catch in review.
Agentic RAG restructures retrieval from a fixed pipeline stage into a tool the model can call, inspect the results of, and call again. Instead of "always retrieve once, then generate," the loop looks like:
The load-bearing difference is the decision diamond: an agentic RAG system has an explicit checkpoint where it evaluates whether the retrieved evidence is sufficient before generating. If it isn't, the loop rewrites the query — expanding an acronym, splitting a multi-hop question into sub-questions, or trying a different retrieval strategy — and retrieves again.
The research underpinning this pattern includes Self-RAG (which trains the model to emit reflection tokens judging its own retrieval quality) and Corrective RAG / CRAG (which adds a lightweight retrieval evaluator that grades documents as correct, ambiguous, or incorrect, and falls back to web search when local retrieval scores poorly). Both formalize the same idea: retrieval quality should be checked, not assumed.
In practice, most production agentic RAG systems don't need the full trained-reflection-token approach. A simpler pattern works: give the model a search tool and a system prompt instructing it to cite sources and to search again if the first results don't answer the question, then let the normal agent loop (plan → call tool → observe → decide) handle the rest. This is structurally identical to any other tool-use loop — the retrieval index is just one more tool the model can call, alongside a calculator, a code execution sandbox, or a REST endpoint that does deterministic formatting work. Utilix's MCP server, for example, exposes utility endpoints (hashing, diffing, format conversion) in exactly this shape: a tool an agent calls mid-loop and inspects the output of before deciding what to do next. Retrieval isn't architecturally special — treating it as just another callable tool is what makes the self-correction loop possible.
Agentic RAG trades latency and token cost for correctness. Each retrieval round-trip adds a model call to decide whether to retrieve again, plus the retrieval latency itself. A question that naive RAG answers in one embedding call and one generation call might take agentic RAG two to four tool-call rounds. For high-volume, low-stakes queries (a chatbot answering "what are your business hours"), that overhead is waste. For high-stakes, multi-hop, or compliance-sensitive queries, it's the difference between a wrong answer and a correct one.
Both naive and agentic RAG are fundamentally similarity-based: they find text that's semantically close to the query. Neither is good at questions that hinge on structured relationships between entities rather than textual similarity — "which vendors does our largest customer's parent company also contract with?" has almost no lexical or semantic overlap with the documents that contain the answer.
GraphRAG (the pattern popularized by Microsoft Research's 2024 paper of the same name, and now implemented in various open-source forms) addresses this by building a knowledge graph from the corpus during indexing: an LLM extracts entities and relationships from each chunk, those get merged into a graph, and the graph is clustered into hierarchical communities with LLM-generated summaries at each level. At query time, retrieval walks the graph — pulling in entities, their relationships, and community summaries — rather than doing a flat similarity search over text chunks.
This is a fundamentally different indexing cost model. Naive and agentic RAG both index in roughly O(corpus size) — chunk and embed, done. GraphRAG's indexing pass requires an LLM call (or several) per chunk to extract entities and relationships, plus a graph clustering step. For a large corpus, that indexing cost can dwarf the cost of actually serving queries.
GraphRAG is worth the indexing investment for corpora where the value is in the connections, not the paragraphs: organizational knowledge bases, legal document sets with cross-references, codebases with import/dependency graphs, or research literature with citation networks. It's a poor fit for corpora that are mostly independent, self-contained documents (a support ticket archive, a product FAQ) — there, the "relationships" GraphRAG would extract are thin, and you're paying graph-construction cost for no retrieval benefit over a good hybrid search.
| Naive RAG | Agentic RAG | GraphRAG | |
|---|---|---|---|
| Retrieval shape | Single top-k vector search | Iterative: retrieve → evaluate → retrieve again | Graph traversal + community summaries |
| Best for | Single-fact lookup in one document | Multi-hop questions, compliance/citation needs | Relationship-heavy, cross-referenced corpora |
| Indexing cost | Low (embed once) | Low (same as naive) | High (LLM entity/relationship extraction per chunk) |
| Query-time cost | 1 embedding + 1 generation call | 2–4+ tool-call rounds per query | 1–2 graph queries + generation |
| Main failure mode | Silent wrong answers from irrelevant-but-similar chunks | Latency/cost blowup on simple queries if not gated | Expensive indexing for corpora with weak entity structure |
| Corpus growth behavior | Precision degrades sharply past a size threshold | Same degradation, but self-correction partially compensates | Scales with entity density, not raw token count |
The practical heuristic: start with naive RAG plus a decent hybrid search (BM25 + embeddings, reranked) — it solves a surprising majority of single-hop lookup questions at the lowest cost and latency. Add the agentic loop only for the query classes where you can show naive RAG actually fails — multi-hop questions, or anywhere a wrong-but-confident answer is costly enough that a verification round-trip is worth the latency. Reach for GraphRAG only when you can point to specific questions your users ask that are fundamentally about relationships between entities, not facts within documents — and even then, consider running it as a fallback path behind agentic RAG rather than the default, since most queries against most corpora don't need graph traversal.
The mistake to avoid is picking the architecture based on how sophisticated it sounds rather than which failure mode your actual query logs show up in. Pull a sample of your worst RAG answers before you rebuild anything — the failure pattern in that sample tells you which of the three problems above you actually have.