Skip to content

Why Retrieval-Augmented Generation Quietly Stopped Working

Three years ago, "RAG" — retrieval-augmented generation — was the architectural answer to almost every enterprise question about LLMs. How do I let the model use my private data without retraining? RAG. How do I cite my sources? RAG. How do I keep the model current? RAG. The pattern was: chunk your documents, embed them, store the vectors, retrieve the top-k for each query, stuff them into the prompt. Simple. Beautiful. Demoable in 90 minutes.

It's also, for the production deployments DataSynth interviewed for this piece, quietly being unwound.

38%
of enterprise RAG systems deployed in 2024 were either rebuilt from scratch or replaced with a different architecture in the following 18 months. The remaining 62% are mostly Q&A systems where the corpus is small and the queries are templated. DataSynth survey · 41 named enterprise teams · April 2026

The first failure mode: embedding similarity is not relevance

The mechanical heart of vector retrieval is the assumption that semantic similarity, as measured by a cosine between two embedding vectors, is a good proxy for "this document is relevant to this query". For dense, paraphrase-style questions on a well-curated corpus, that's true. For the messy reality of enterprise search, it is — and was always — at best half true.

The empirical work behind this is now substantial. Thakur et al.¹ showed in the BEIR benchmark that BM25 — a 30-year-old keyword scoring algorithm — beats most dense retrievers on most realistic information-retrieval tasks. The papers that introduced the dominant 2023-era embedding models tested almost exclusively on benchmarks where the query and the target passage shared substantial surface vocabulary. Real enterprise queries rarely do. "How do I escalate a P1 incident?" and the actual relevant document — a 2019 wiki page titled incident_response_playbook_v3.md whose body uses the words "severity-one" — share almost no surface vocabulary and not many semantic atoms either.

Recall@10 across four enterprise corporaHigher is better · 1,200 evaluation queries with manual relevance labelsBM25 (keyword)Dense (E5-Large)Hybrid (BM25 + dense) Engineering wikiLegal / contractsCustomer-support ticketsOpen-domain Wikipedia0%50%100%

Dense retrieval beats BM25 in only one of the four corpora — open Wikipedia, which is also exactly the domain on which the dense retriever was originally trained. Hybrid wins everywhere by a comfortable margin. Hybrid is now the responsible default; the share of new production systems using dense-only retrieval has fallen from a 2023 peak of around 70% to under 30% today, according to the same survey above.

The second failure mode: chunking destroys context

Embedding-based retrieval requires you to chop documents into chunks small enough to embed coherently — typically 200 to 800 tokens. That chop is where most of the information loss happens. A clinical trial protocol or a contract clause refers backwards and forwards across pages; a 500-token chunk strips off the antecedents and the consequences, and the chunk that wins the cosine race is often the one whose surface vocabulary best matches the query but whose meaning, denuded of context, is wrong or trivially ambiguous.

The 2025 industry workaround was late chunking²: embed the whole document, then locate the relevant span at retrieval time. It works reasonably well, but it pushes the cost back onto storage and forces context-window-sized embedding models, which are slow. Most teams that tried it have backed off in favour of either much longer context windows (just stuff the whole document in) or agentic retrieval (let the model issue multiple searches, refine, drill in).

12×
Latency penalty of agentic retrieval vs single-shot vector search on the same corpus. The trade is real, but for high-value queries (legal research, incident triage) the per-query cost is dwarfed by what an analyst would charge to do the same work. Median of 8 benchmarked stacks, simple to multi-step queries

The third failure mode: the model lies about what it retrieved

This is the most insidious one and the reason a lot of RAG systems are quietly being deprecated. A model that has been given retrieved context will sometimes use it, sometimes ignore it in favour of its parametric knowledge, sometimes hallucinate a confident-sounding answer that splices fragments of both. The 2024 benchmark by Liu et al.³ found that even GPT-4-class models cited the wrong retrieved passage about 18% of the time on multi-passage QA — and on tasks where the correct answer required not using a misleading retrieved passage, performance was meaningfully worse than the same model with no retrieval at all.

In other words: RAG sometimes makes the model wrong by giving it bad evidence to anchor on.

What's replacing it

Three architectures are picking up the production load that RAG once handled:

Long-context with structured prompts. With million-token contexts now table stakes, many corpora that used to need a retriever now just fit. Anthropic's Claude, Gemini 2.5 Pro, and the open-weight Mistral Magnitude all handle this regime competently. The cost per query is higher; the engineering surface is dramatically smaller.

Agentic search with tool use. A small "router" model decomposes the question, issues several keyword and embedding searches in parallel, re-ranks, drills in. This is closer to how a human researcher works and produces visibly better answers on the kind of question that previously broke single-shot RAG. The downside is latency and operational complexity.

Fine-tuning on the corpus. For static, domain-specific corpora (medical guidelines, a legal codebook), continued pre-training on the corpus has quietly become competitive with retrieval again — especially when combined with parameter-efficient adapters that can be swapped per tenant. We expect this to grow.

The honest summary is that RAG was a useful intermediate architecture for a moment when context windows were small and embedding models had just got good. Both of those conditions are now obsolete. What replaces it will be less uniform — different problems will want different stacks. The era when one architecture diagram answered every "how should we put our docs into an LLM" question is over, and the field is better for it.

References & sources
  1. Thakur, N. et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-Shot Evaluation of Information Retrieval Models. arXiv:2104.08663.
  2. Günther, M. et al. (2024). Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models. arXiv:2409.04701.
  3. Liu, N. F., Lin, K., Hewitt, J., et al. (2024). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172.
  4. DataSynth Research, Q2 2026 survey of 41 enterprise teams running production RAG systems. Methodology and aggregated answers available on request.
  5. Karpukhin, V. et al. (2020). Dense Passage Retrieval for Open-Domain Question Answering. arXiv:2004.04906. The original DPR paper, useful for understanding the assumptions.