The three dominant multi-agent orchestration topologies each fail in a different, predictable way once you move past the demo — here is how to pick one based on where your task actually breaks, not which pattern sounds more sophisticated.
Ask three teams building agent systems how they orchestrate multiple agents and you'll get three different topologies, each defended as "obviously correct" for their use case. That's not indecision — it's because orchestration pattern is a load-bearing architecture decision with real failure modes, and the failure modes don't show up until you're past the demo and into production traffic.
This post walks through the three dominant patterns — pipeline, supervisor-worker, and mesh — what each one actually buys you, and specifically where each one breaks. The goal isn't "which pattern is best," it's giving you enough of a mental model to predict which failure you're signing up for before you've built around it.
Every multi-agent system is solving the same underlying problem: decomposing a task across multiple LLM calls that can't (or shouldn't) share a single context window, then reassembling their outputs into something coherent. The topology is just how you wire the message-passing.
Pipeline. Agents run in a fixed sequence, each consuming the previous agent's output as input. A research agent hands off to a writing agent, which hands off to an editing agent. No agent talks back to an earlier stage.
Supervisor-worker. A coordinator agent decomposes the task, dispatches subtasks to worker agents (often the same agent invoked multiple times with different prompts/tools), and synthesizes their results. Workers don't talk to each other — everything routes through the supervisor.
Mesh. Agents communicate peer-to-peer, each with visibility into what others are doing, converging on a result through iteration rather than a fixed hierarchy. Think of debate-style or critique-style setups where agents review and revise each other's output.
Pipeline: A → B → C → result
Supervisor: A ⇄ (B, C, D) → A synthesizes → result
Mesh: A ⇄ B ⇄ C, iterate until convergence → result
Here's the same three topologies as a diagram, including the direction messages actually flow — this matters more than it looks, because it's exactly where each pattern's failure mode lives.
Pipelines are the easiest pattern to build and the easiest to misjudge. Because there's no backchannel, an error introduced at stage A doesn't fail loudly — it gets faithfully passed to stage B, which does its best with bad input, and the pipeline completes successfully. You get a wrong answer with a green checkmark.
The concrete failure: a research agent hallucinates a fact, a writing agent writes confident prose around it, an editing agent polishes the prose. All three stages "succeeded." Nothing in the pipeline detected that stage one was wrong, because nothing in a pipeline is designed to detect that — each agent's job is to trust its input and add value on top of it.
This is fine when each stage's failure mode is structural (malformed JSON, missing field) and easy to validate mechanically between stages. It's not fine when the failure mode is semantic (plausible-but-wrong content), because catching that requires another LLM call to re-evaluate the claim against source material — which is really a supervisor pattern wearing a pipeline's clothes.
Mitigation that actually works: insert a validation stage that checks stage N's output against stage N-1's inputs, not just stage N's own internal consistency. A pipeline that only validates "is this valid JSON" catches nothing about factual drift.
Supervisor-worker is the most popular pattern for a reason — it maps cleanly onto how you'd describe the problem to a person ("figure out three things, then combine them"), and it gives you one place to add guardrails: the supervisor's synthesis step.
The failure mode is structural, not conceptual: every worker's output has to route back through the supervisor's context window before it becomes useful. If you fan out to 8 workers and each returns 2,000 tokens, the supervisor is now holding 16,000+ tokens of worker output just to synthesize a result — and that's before you count the supervisor's own reasoning about how to combine them. Scale the worker count and you don't get more parallelism, you get a coordinator that hits context pressure and starts dropping or summarizing worker outputs before it's actually used them.
There's a second-order version of this: workers that need to know what sibling workers are doing (to avoid duplicate work, or to stay consistent) have no way to find out except by asking the supervisor to relay it — which turns your "parallel" fan-out into a chain of round trips through a single serialization point. At that point you've built a mesh with extra latency.
# The naive supervisor loop — looks parallel, isn't once results come back
async def supervise(task):
subtasks = plan(task) # 1 LLM call
results = await gather(*[
run_worker(st) for st in subtasks # N calls, genuinely parallel
])
return synthesize(task, results) # 1 call, holding ALL worker output
The synthesize call is where the bottleneck actually lives — it's a single LLM invocation whose input size scales linearly with worker count. Budgeting for this pattern means budgeting the synthesis context, not just the fan-out cost.
Mitigation that actually works: have workers return structured, compressed summaries rather than raw output, and reserve full output for the specific worker whose result the supervisor decides it needs verbatim. This trades a little synthesis fidelity for a synthesis context budget that doesn't blow up linearly.
Mesh (or debate/critique-style) architectures are appealing because they mirror how humans actually improve work — draft, critique, revise, repeat. The problem is that "repeat" needs a termination condition, and in practice teams either hardcode a round limit (arbitrary, sometimes cuts off mid-improvement) or rely on the agents to detect convergence themselves (unreliable — LLMs are bad at knowing when they're done, and will often manufacture a reason to do one more pass).
The cost profile is the giveaway that something's wrong before you even measure output quality: token spend that should be O(n) for n agents ends up O(n × rounds), and rounds is unbounded unless you cap it. Teams that ship mesh patterns to production almost always end up capping rounds at 2-3 and calling the third round "final" regardless of whether convergence actually happened — which quietly turns the mesh back into a fixed-depth pipeline with extra steps.
The debugging cost compounds this. When a pipeline or supervisor pattern produces a bad result, you can trace it to a specific stage. When a mesh pattern produces a bad result after 4 rounds of peer revision, reconstructing which round introduced the regression means replaying the whole interaction — there's no natural checkpoint boundary the way there is between pipeline stages.
Mitigation that actually works: treat each mesh round as a checkpointed pipeline stage — log the full state after every round, cap rounds explicitly rather than letting agents self-terminate, and only use mesh where the marginal round genuinely tends to improve output (open-ended writing/critique) rather than where the answer is verifiable and a single strong pass should get you there.
The decision that actually matters isn't "which pattern is more sophisticated" — mesh isn't strictly better than pipeline — it's whether your subtasks are independent and whether the failure you're most worried about is structural or semantic.
| If your task is... | ...use | Because |
|---|---|---|
| Strictly sequential, each stage's input validity is mechanically checkable | Pipeline | Cheapest, fastest, and validation between stages catches the failure mode that actually occurs |
| Decomposable into independent subtasks that don't need to see each other's work | Supervisor-worker | Parallelism pays off; budget synthesis context, not just fan-out |
| Genuinely benefits from iterative critique (open-ended writing, code review, ambiguous specs) | Mesh, with a hard round cap | The marginal round adds real value; cap it before cost and traceability both degrade |
| Sequential but a later stage sometimes needs to correct an earlier one | Pipeline + a supervisor-style validation stage inserted at the risky boundary | Don't reach for full mesh just because one junction needs a backchannel |
Most production systems are hybrids that don't announce themselves as such — a supervisor that dispatches to workers, one of which is internally a pipeline, with a two-round critique loop bolted onto the final synthesis. That's not a design smell. It's what happens when you match topology to the actual shape of the coordination problem instead of picking one pattern and forcing every subtask through it.
Regardless of topology, there's a failure mode common to all three once agents start calling external tools (via MCP or a REST API) rather than just passing text: a worker or pipeline stage that calls a non-idempotent tool (create an order, send an email, write a record) and then gets retried — because the supervisor timed out waiting, or a mesh round decided to redo the step — will execute that side effect twice. A dev-tools API being probed by an agent (say, hitting a JSON formatter or a UUID generator through something like Utilix's REST API) is harmless to call twice. A payment API or a database write is not.
This is the one place where topology stops being the interesting variable and idempotency becomes the load-bearing property. If a tool call in your orchestration graph can be retried — and in any multi-agent system with timeouts, it eventually will be — it needs an idempotency key or it needs to be safe to call twice. That's true whether the retry comes from a pipeline stage timing out, a supervisor re-dispatching a worker, or a mesh round redoing prior work.
Pick topology based on where your task's failure mode actually lives — structural errors want pipelines with inter-stage validation, independent subtasks want supervisor-worker with compressed synthesis input, and only genuinely iterative-improvement tasks want mesh, and even then with a hard round cap. Whichever you pick, audit every tool call your agents can make for idempotency before you scale up fan-out or retries — that failure shows up in production, not in the demo.