
The 73% Lie
Our RAG-powered customer support bot was a success story. At least, that's what the metrics said.
In its first month of production, the bot handled 50,000 conversations. At the end of each conversation, users were asked: "Was this response helpful?" A full 73% clicked "Yes."
We celebrated. We wrote internal blog posts. We presented to the board. "AI-powered support is working," we declared.
Then someone suggested we actually audit the responses.
We randomly sampled 500 conversations and had human support agents review them for accuracy. The results were devastating:
- 41% of responses were fully accurate
- 27% contained partial truths with significant errors
- 32% were completely wrong or hallucinated
Our 73% "helpful" rating was meaningless. Users clicked "helpful" when the answer sounded right—confident, well-formatted, fluent. They couldn't verify accuracy in real-time, so they judged on vibes.
This is the fundamental problem with RAG systems: they look like they're working even when they're not. The LLM's fluency masks retrieval failures. Users can't tell the difference between grounded truth and confident hallucination.
After months of debugging, we've learned what actually breaks RAG systems and how to fix them. Here's everything we discovered.
Section 1: The RAG Illusion
Let's start with why RAG seems to be working when it isn't.
The Promise of RAG
Retrieval-Augmented Generation is supposed to solve the hallucination problem. Instead of relying on the LLM's training data (which may be outdated or incomplete), you:
- Embed your documents into a vector database
- When a query comes in, embed the query and find similar documents
- Pass the retrieved documents to the LLM as context
- The LLM generates an answer grounded in your actual data
In theory, this gives you the best of both worlds: the fluency of LLMs plus the accuracy of your source documents.
In practice, it introduces multiple new failure modes—all of which are invisible to end users.
Garbage In, Garbage Out (At the Retrieval Layer)
The dirty secret of RAG is that retrieval quality is terrible for most production systems.
Vector similarity search sounds mathematical and precise. In reality:
- Query embeddings and document embeddings often don't live in the same semantic space
- Short queries match poorly with long documents
- Negations, qualifications, and nuance are poorly captured by embeddings
- Similar words in different contexts match when they shouldn't
When retrieval fails, the LLM receives irrelevant context—but it doesn't know the context is irrelevant. It dutifully generates an answer based on whatever you gave it.
The "Sounds Right" Problem
Here's what makes RAG failures insidious: LLMs are extremely good at generating confident, fluent prose from bad context.
If you give an LLM a document about product A and ask about product B, it won't say "I don't know." It will extract whatever information it can and present it confidently. Sometimes it conflates. Sometimes it hallucinates. Sometimes it answers a different question than what was asked.
To the end user, all of these failure modes look identical: a helpful-sounding response.
Why User Ratings Are Useless
Our 73% "helpful" rating was measuring the wrong thing. Users were evaluating:
- Did the response sound confident?
- Was it well-formatted?
- Did it seem relevant to my question?
They were not evaluating:
- Is this factually accurate?
- Does this match our official documentation?
- Are there important caveats being omitted?
Users can't verify accuracy without looking up the source material themselves—which defeats the purpose of having a support bot. So they rate what they can perceive: fluency and format.
Section 2: The 5 Hidden Failure Modes of RAG
After auditing thousands of RAG failures, we've identified five recurring patterns.
Failure Mode 1: Semantic Drift
The query and relevant documents don't match in embedding space, even though they should.
Example: User asks "How do I cancel my subscription?" The embedding matches documents containing the word "cancel"—including pages about canceled orders, canceled events, and a blog post about "cancel culture." The actual subscription cancellation doc ranks lower because it uses "end your plan" instead of "cancel."
Root cause: Embeddings are trained on general text, not your specific domain vocabulary. Synonyms and related concepts don't always cluster together.
Failure Mode 2: Chunk Boundary Disasters
The answer to the user's question spans two document chunks, but neither chunk alone makes sense.
Example: A user asks about pricing tiers. Chunk 1 lists the tier names. Chunk 2 lists the prices. Neither chunk mentions both tier names AND prices together. The LLM receives one chunk and gives incomplete information.
Root cause: Most chunking strategies are naive—fixed token counts or paragraph breaks. They don't respect semantic units. Important information gets split.
Failure Mode 3: Outdated Embeddings
Documents are updated, but embeddings aren't re-computed. Retrieval returns stale context.
Example: Your pricing page changed last month. The new pricing is in the document store, but the old embeddings still refer to the old page. Queries about pricing retrieve outdated content.
Root cause: Embedding updates are often manual or delayed. There's no automatic sync between document updates and embedding refresh.
Failure Mode 4: Hallucination Despite Retrieval
The LLM ignores the retrieved context and makes things up anyway.
Example: The retrieved document clearly states the return window is 30 days. The LLM confidently states it's 60 days because that matches its training data (from a competitor's site, perhaps).
Root cause: LLMs have priors from training. When retrieved context is ambiguous or conflicts with priors, the LLM sometimes trusts its training more than the context. This is especially common with numbers, dates, and specific facts.
Failure Mode 5: Top-K Blindness
The relevant document exists in your database but doesn't rank in the top-K retrieved results.
Example: You retrieve top 5 documents. The correct document is at position 8. The LLM never sees it. It answers based on the 5 less-relevant documents it did receive.
Root cause: K is usually set to a small number (3, 5, 10) for latency and cost reasons. But embedding similarity is noisy. Relevant documents can easily fall outside the top-K threshold.
Section 3: The RAG Debugging Framework
You can't fix what you can't measure. Here's how to debug RAG systems systematically.
Step 1: Log Everything
For every RAG query, log:
- The original user query
- The embedded query vector (or a hash of it)
- The retrieved chunks (full text, not just IDs)
- The generated response
- Any user feedback or ratings
- Latency at each stage
This seems obvious, but many teams only log the final response. Without retrieval logs, you can't diagnose retrieval failures.
Step 2: Sample and Audit
User ratings are unreliable. You need human review.
We now audit 5% of all conversations weekly, with human reviewers answering:
- Was the response factually accurate?
- Did the retrieved chunks contain the correct answer?
- Did the LLM faithfully represent the chunk content?
- Were there important caveats or context missing?
This gives us ground-truth accuracy metrics that user ratings can't provide.
Step 3: Separate Retrieval Quality from Generation Quality
Most teams measure end-to-end accuracy. This hides the failure mode.
Measure separately:
- Retrieval precision: Of the K documents retrieved, how many were relevant to the query?
- Retrieval recall: Of all relevant documents in the database, how many were in the top-K?
- Faithfulness: Given correct retrieved documents, did the LLM accurately represent them?
If retrieval precision is low, fix retrieval (embeddings, chunking, hybrid search). If faithfulness is low, fix generation (prompting, model choice, citation enforcement).
Step 4: Chunk Quality Analysis
Review your chunks manually. Ask:
- Does each chunk contain a complete thought?
- Are sentences or concepts split across chunk boundaries?
- Does each chunk have enough context to be understood standalone?
Bad chunking is often the root cause of RAG failures. Fixing chunk quality can improve accuracy more than any other intervention.
Step 5: Implement Reranking
Embedding similarity is a coarse filter. Reranking adds a second, more precise layer.
After retrieving top-50 by embedding similarity, use a reranking model (like Cohere Rerank or a cross-encoder) to reorder by relevance. Then take the top 5 from the reranked results.
This dramatically improves retrieval precision at modest latency cost.
Section 4: What Actually Works
Based on these lessons, here's how to build RAG systems that actually work.
Hybrid Search
Never use embedding search alone.
Combine vector similarity with keyword/BM25 search. They have complementary strengths:
- Embeddings capture semantic similarity (concepts, paraphrases)
- BM25 captures lexical matches (specific terms, product names, codes)
Hybrid search retrieves documents that score well on both dimensions, dramatically reducing false positives.
Metadata Filtering
Don't rely on embeddings to distinguish document types.
Before vector search, filter by:
- Document category (FAQ vs. technical docs vs. blog)
- Date (exclude outdated content)
- Product or feature area
- Language or region
This narrows the search space to relevant documents before embedding similarity even runs.
Semantic Chunking
Stop chunking by token count. Chunk by semantic unit:
- Split by section headers
- Keep paragraphs intact
- Include surrounding context (overlap with adjacent chunks)
- Add chunk metadata (what section is this from?)
Smaller, semantically coherent chunks retrieve better than large, arbitrary ones.
Citation Enforcement
Force the LLM to cite which chunk it used:
"Answer the user's question based on the following documents. Quote the specific passage you used in your answer."
Then, automatically verify that the quote exists in the retrieved chunks. If the LLM can't produce a valid citation, flag the response for human review or decline to answer.
This catches hallucinations that ignore the retrieved context.
Confidence Thresholds
Not every query should get an answer. If retrieval confidence is low (no documents above similarity threshold), the system should say:
"I couldn't find a confident answer to this question. Would you like to speak with a human agent?"
Saying "I don't know" is better than confidently hallucinating.
Closing Thought
RAG is a system, not a magic solution. Debug it like one.
Our 73% "helpful" rating taught us that LLM fluency masks retrieval failures. Users can't evaluate accuracy; they evaluate confidence. If you're using user ratings as your success metric, you're probably celebrating a broken system.
Measure retrieval quality. Audit accuracy with humans. Log everything. Test continuously. RAG works—but only when you treat it as a complex system that requires ongoing debugging, not a set-and-forget solution.
Appendix: RAG Health Check
Score your RAG system 1-5 on each dimension:
- Retrieval logging: Do you log every query + retrieved chunks?
- Human audits: Do you regularly audit accuracy with humans?
- Retrieval metrics: Do you measure precision/recall separately from end-to-end accuracy?
- Chunk quality: Are chunks semantically coherent?
- Hybrid search: Do you combine embeddings with keyword search?
If you score below 3 on any dimension, that's where to focus first.
Written by XQA Team
Our team of experts delivers insights on technology, business, and design. We are dedicated to helping you build better products and scale your business.