Why We Stopped Using RAG. The 2M Token Context Window Killed It.

For two years, "RAG" (Retrieval Augmented Generation) was the holy grail of Enterprise AI. Every vendor pitched it. Every consultant recommended it. Every Medium article celebrated it. We built our entire AI documentation system around it, and we did everything by the book.

We chose Pinecone as our vector database after evaluating Milvus, Weaviate, and Chroma. We experimented with dozens of embedding models—OpenAI's text-embedding-ada-002, Cohere's embed-english-v3.0, and even open-source alternatives like BGE and E5. We implemented sophisticated chunking strategies: fixed-size chunks with overlap, semantic chunking based on paragraph boundaries, and recursive chunking that preserved document structure. We added re-ranking with Cohere's rerank endpoint to improve retrieval precision. We built hybrid search combining BM25 keyword matching with dense vector similarity. We added metadata filtering so users could scope searches to specific docs.

Our RAG pipeline became a distributed system with seven moving parts: the embedding service, the vector database, the keyword index (Elasticsearch), the reranker, the query router, the response synthesizer, and the caching layer. We had three engineers maintaining it. The monthly infrastructure cost was $4,000 for vector DB hosting alone.

All of this complexity existed to solve one fundamental problem: The Context Window Limit. GPT-4 could only process 8,000 tokens. Claude 2 maxed out at 100k but was slow and expensive at that length. Our documentation was 2 million tokens. We couldn't fit it all in, so we had to retrieve the "relevant" pieces.

Then Gemini 1.5 Pro launched with a 2-million-token context window. Claude 3.5 Sonnet arrived with 200k tokens and fast processing. GPT-4-Turbo extended to 128k.

We ran a comparative test that changed everything. We took our entire documentation corpus—1.8 million tokens of technical guides, API references, tutorials, and troubleshooting docs. We dumped ALL of it into a single Gemini 1.5 Pro prompt. No retrieval. No chunking. No embeddings. Just raw, unprocessed documentation.

We asked the same 500 test questions we used to evaluate our RAG system. The results were devastating for our careful engineering:

RAG Pipeline Accuracy: 73% (questions answered correctly with proper citations)
Full Context Dump Accuracy: 89% (16 percentage points higher)

The "dumb" approach of just sending everything beat our carefully optimized retrieval system by a massive margin. We looked at the failure cases. Most RAG failures were retrieval failures—the relevant chunk simply wasn't in the top-k retrieved results. The model was smart enough; we were starving it of information.

Within a month, we deleted the RAG pipeline. We deleted the vector database subscription. We deleted the embedding service. We deleted the Elasticsearch cluster. We deleted thousands of lines of chunking and retrieval code. We replaced it all with a single API call that sends the entire documentation.

Here's the full story of why RAG was solving yesterday's problem, and why we believe it's over-engineering for most use cases in 2026.

Section 1: The Inherent Complexity and Fragility of RAG

RAG was a necessary evil born from constraint. Large language models were intelligent but had amnesia—they couldn't read your entire document library, so you had to carefully select which snippets to show them. This selection process, called "retrieval," sounds simple in concept but is deceptively complex in practice.

The Chunking Problem: No Good Answers

First, you have to split your documents into "chunks" that can be embedded and retrieved independently. This sounds trivial until you actually try it.

Fixed-size chunking (e.g., 500 tokens per chunk): This is deterministic and simple, but it splits information arbitrarily. A paragraph explaining a concept might be cut in half. The first half gets retrieved; the second half (with the crucial details) doesn't. The model receives partial information and hallucinates the rest.

Semantic chunking (split by paragraphs or sections): This respects document structure better, but chunk sizes vary wildly. One chunk might be 50 tokens (a short paragraph); another might be 2,000 tokens (a long section). This variance causes embedding quality issues—short chunks have sparse embeddings; long chunks have diluted topic representation.

Recursive chunking with overlap: You add 100-token overlap between chunks to preserve context at boundaries. Now your index is 30% larger. Your embedding costs are 30% higher. Retrieval returns redundant information. And you still miss cross-chunk dependencies.

We spent three weeks tuning chunk size (256 vs 512 vs 1024 tokens) and overlap percentage (10% vs 20% vs 50%). Every setting had trade-offs. There was no optimal configuration—just least-bad compromises.

The Retrieval Problem: Vector Similarity Is Fuzzy

Even with perfect chunking, retrieval itself is unreliable. Vector similarity measures semantic relatedness, but semantic relatedness is not the same as "answer relevance."

A user asks: "How do I reset my password?" The retrieval system might find chunks about "Password Hashing Algorithms" (high semantic similarity to "password") instead of "User Password Reset Flow" (the actual answer). Both are relevant to "passwords," but only one answers the question.

We saw this constantly. Technical documentation is full of interconnected concepts. A question about "configuring rate limits" might require information from the "Rate Limiting" section AND the "API Authentication" section AND the "Error Handling" section. Vector similarity doesn't understand these dependencies. It retrieves the single most similar chunk, missing the others.

To fix this, we added complexity: hybrid search (combining keyword matching with vectors), query expansion (rewriting the user's query into multiple variations), multi-hop retrieval (using the first retrieved chunks to inform a second retrieval pass), and re-ranking (using a cross-encoder model to re-score the top-50 results). Each addition improved accuracy by 2-3% but doubled the pipeline complexity.

The Latency and Cost Stack

Our "simple" chatbot became an eight-step pipeline:

Receive user query
Embed the query (API call to OpenAI: 100ms)
Vector search in Pinecone (API call: 50ms)
Keyword search in Elasticsearch (API call: 30ms)
Merge and deduplicate results (compute: 10ms)
Re-rank with Cohere (API call: 200ms)
Assemble context from top-k chunks (compute: 10ms)
Send to LLM for response generation (API call: 2000ms)

Total latency: 2.4 seconds minimum following the happy path. Any service degradation cascaded. When Pinecone had a slow day (it happened), our bot became unusably slow.

Cost per query: embedding ($0.0001) + vector search ($0.0005) + rerank ($0.002) + LLM ($0.01) = $0.0126 per query. At 100,000 queries per month, that's $1,260 just in API calls, plus $4,000 in infrastructure. And we needed three engineers to keep it running.

Section 2: The Power of Long Context—Why "Dump Everything" Works Better

When you use RAG, the LLM only sees the top-k chunks you retrieved. Typically k=5 or k=10. It doesn't see the full document. It doesn't see the table of contents. It doesn't see the cross-references. It doesn't see the context that explains why the retrieved chunk exists.

This is called "context starvation." The model has intelligence but lacks the information to use it properly.

The "Needle in a Haystack" Benchmark

Modern long-context models (Gemini 1.5, GPT-4-Turbo, Claude 3.5) have been rigorously tested on the "Needle in a Haystack" benchmark. A specific fact (the "needle") is buried at a random position in a massive document (the "haystack" of 500k to 1M+ tokens). The model is asked to find that fact.

The results are remarkable: Gemini 1.5 Pro achieves 99.7% accuracy at finding facts anywhere in a 1-million-token context. Claude 3.5 achieves similar results up to 200k tokens. These models don't just "fit" long contexts—they genuinely understand them.

When you feed the entire documentation to the model, it gains something RAG can never provide: structural understanding. It sees that "Section 3: Authentication" references concepts from "Section 1: Getting Started." It sees the table of contents. It understands the overall organization. It can synthesize information from multiple sections because it has access to all of them simultaneously.

Cross-Document Synthesis

One of RAG's biggest weaknesses is multi-source queries. A user asks: "What's the difference between the Standard and Enterprise authentication flows?" The answer requires information from the Standard docs AND the Enterprise docs AND probably the comparison guide.

RAG retrieves based on similarity. It might get good hits from the Standard docs, partial hits from Enterprise, and miss the comparison guide entirely (because "comparison" doesn't appear in the user's query). The resulting answer is incomplete.

With full-context, the model has ALL three sources available. It naturally synthesizes. It produces answers that reference multiple documents coherently. Our accuracy on multi-source questions jumped from 52% (RAG) to 91% (full context).

Reduced Hallucination

Here's a counterintuitive finding: giving the model MORE information reduced hallucination, not increased it.

With RAG, the model receives 5 chunks and is asked to answer a question. If the answer isn't fully contained in those 5 chunks, the model has two choices: say "I don't know" or hallucinate. Models are trained to be helpful, so they often hallucinate—they make up plausible-sounding information to fill the gaps.

With full context, the gaps are filled. The model rarely needs to make things up because the actual answer is usually present somewhere in the 2M tokens. When it genuinely doesn't know, it says so more reliably—because it has searched its entire knowledge base (the context) and confirmed the absence of information.

Section 3: The Cost and Latency Argument—Debunked with Numbers

The first objection we heard from our team: "Sending 1.8 million tokens per query is insanely expensive and slow!"

Let's do the math.

Cost Comparison

RAG Pipeline (per month, 100k queries):

Pinecone hosting: $4,000
Elasticsearch hosting: $500
OpenAI embeddings (100k queries × $0.0001): $10
Cohere reranking (100k queries × $0.002): $200
LLM calls (GPT-4, 5k tokens avg context, 100k queries × $0.03): $3,000
Engineering time (0.5 FTE maintaining pipeline): ~$8,000
Total: ~$15,700/month

Full Context with Gemini 1.5 Pro (per month, 100k queries):

Infrastructure: $0 (no vector DB, no Elasticsearch)
Gemini 1.5 Pro with Context Caching: Input tokens are cached after first use.
First query: 1.8M input tokens × $0.00035 = $630
Cached subsequent queries: 1.8M cached tokens × $0.0000875 = $157.50 per 1000 queries
100k queries: $630 (initial) + ($157.50 × 100) = ~$16,380 in input tokens
Output tokens: 500 avg × 100k × $0.00105 = $5,250
Engineering time: Near zero (no pipeline to maintain)
Total: ~$21,630/month

Wait—full context is MORE expensive? Yes, at current Gemini pricing for very high volume. But consider:

We eliminated 0.5 FTE of engineering maintenance (~$8,000/month in salary).
We eliminated the risk of RAG failures (which required human intervention).
Token prices are dropping 50% every 6-12 months. By mid-2026, the full-context approach will be cheaper.
For lower volume (10k queries/month), full context is already cheaper due to caching.

Latency Comparison

RAG Pipeline: 2.4 seconds (embedding + search + rerank + LLM)

Full Context with Caching:

First query (cold cache): ~8 seconds (processing 1.8M tokens)
Subsequent queries (warm cache): ~2 seconds (the 1.8M tokens are pre-computed)

With prompt caching (available in Gemini, Claude, and soon OpenAI), the "attention computation" for the static documentation happens once. Subsequent queries only compute attention for the new user query. The Time to First Token becomes comparable to RAG.

And we removed all the latency variance. No more slow Pinecone days. No more Elasticsearch timeouts. One API call. Predictable latency.

Section 4: When RAG Still Makes Sense—The Edge Cases

We're not saying RAG is dead for all use cases. There are legitimate scenarios where retrieval remains necessary:

The "Infinite Corpus" Problem (100M+ Tokens)

If you have 100 years of legal precedents (100 GB of text), or a multi-million-document knowledge base, you simply cannot fit it in a 2M token window. You need retrieval to narrow down to a relevant subset.

But note: "Retrieval" in this case is coarse filtering, not precise snippet extraction. You retrieve the 50 most relevant documents and dump them all in. You're not trying to find the perfect 5 paragraphs; you're finding the relevant 1M tokens to feed the model.

This is "Retrieval as Pre-filter," not "Retrieval as Context Construction." Much simpler. Much more robust.

Hardware-Constrained Environments

If you're running local LLMs (Llama 3 on a MacBook), you don't have the VRAM for 1M+ token contexts. Consumer hardware tops out at 16-32k tokens effectively. RAG is still necessary for local-first, privacy-focused deployments.

But cloud costs are dropping. The value of local-only is narrowing to specific compliance scenarios.

Real-Time Data Requirements

If your knowledge base changes every hour (e.g., a live support ticket system), full-context caching doesn't help—you'd have to recompute the cache constantly. For highly dynamic data, RAG with frequent index updates might still be appropriate.

But even here, consider: maybe you cache the "static" docs (90% of your corpus) and only retrieve "dynamic" docs (recent tickets). Hybrid approaches are often better than pure RAG.

Conclusion: Delete the Bridge the River Dried Up

In software engineering, we often build elaborate solutions to problems that later disappear. RAG was an ingenious bridge over the "Small Context Window River." We built pontoons and suspension cables and toll booths.

Then the river dried up. Context windows expanded from 8k to 2M—a 250x increase in three years. The bridge is now a monument to a constraint that no longer exists.

We deleted our bridge. We walk directly across the dry riverbed. Our documentation AI is simpler, more accurate, and requires no maintenance.

Before building (or continuing to maintain) a RAG pipeline, ask yourself: What problem am I actually solving? Does that problem still exist?

For most teams with documentation under 2M tokens, the answer is: just dump it in. Delete the complexity. Trust the long-context models.

The best retrieval strategy is no retrieval at all.

Tags:TechnologyTutorialGuide

Written by XQA Team

Our team of experts delivers insights on technology, business, and design. We are dedicated to helping you build better products and scale your business.

•