
Model evaluation is supposed to be how you decide which AI model to deploy. You build a test set, run all candidate models against it, measure accuracy, and pick the winner. It's scientific. It's rigorous. It's how ML should be done.
We built an elaborate evaluation pipeline. Custom test sets representing our use cases. Automated scoring pipelines. Leaderboards comparing models across multiple dimensions. We spent three months getting it right.
Our evaluation showed GPT-4 at 95% accuracy on our benchmark. Claude 3 at 91%. Llama 70B at 82%. The choice was obvious—GPT-4 won decisively.
We deployed GPT-4. We monitored production performance.
Production accuracy: 68%.
Twenty-seven percentage points lower than our benchmark predicted. What happened?
Our test set didn't match production queries. The carefully curated examples we built to evaluate models weren't representative of what real users actually asked. Our benchmarks tested what we imagined users would ask, not what users actually asked.
We stopped doing offline evaluation as our primary decision method. We switched to live A/B testing with production traffic. Here's why benchmarks fail and what actually works.
Section 1: The Distribution Shift Problem—Why Test Sets Lie
Every ML practitioner knows about distribution shift in theory. But it's easy to underestimate how severe the mismatch can be.
What We Tested vs. What Users Asked
Our product was an AI assistant for software documentation. We built a test set of 500 carefully crafted questions:
- "How do I configure authentication for the API?"
- "What's the syntax for the retry policy configuration?"
- "How do I enable debug logging?"
Clean, well-formed questions about documented topics. Our evaluation scored models on these.
What users actually asked:
- "it broken"
- "why err"
- "auth not work help"
- "@#$% is wrong with this thing"
- "I've been trying for three hours and I'm about to flip my desk, how do I get oauth to work with our legacy system that was built in 2007"
Typos. Fragments. Emotional outbursts. Complex context. Profanity. Non-native English. Assumptions about what we already know about their setup.
Our test set was written by native English speakers, proofread for clarity, and focused on topics we documented. Real queries were messy, ambiguous, and often about edge cases we hadn't documented.
The Curation Bias
When you build a test set, you curate. You remove duplicates. You remove obviously bad examples. You balance categories. You make it "representative."
This curation removes exactly the edge cases that models fail on. The weird queries get cut. The ambiguous ones get clarified. The long, rambling ones get trimmed.
What remains is a sanitized version of user behavior that doesn't reflect the messy reality. Models that handle clean inputs well might crumble on messy ones—and you'd never know from your benchmark.
The Temporal Shift
We built our test set in Q1. We deployed in Q2. User behavior had shifted:
- New features released (users asked about things not in test set)
- New bugs discovered (users asked about error messages we hadn't seen)
- New use cases emerged (integrations we didn't anticipate)
- Documentation changed (answers to old questions were now different)
By the time we deployed, our test set was already stale. The "representative" examples no longer represented.
Section 2: The Scoring Problem—What Does "Accuracy" Even Mean?
Even if our test set perfectly matched production distribution, scoring would still be problematic.
Human Agreement on AI Outputs
We hired annotators to score model outputs as "correct" or "incorrect." We assumed this was objective.
We measured inter-annotator agreement: 73%. Three out of ten examples, annotators disagreed on whether the answer was correct.
Why? Because "correct" is subjective for open-ended AI responses:
- Answer A is technically correct but confusing
- Answer B is slightly incorrect but more helpful
- Answer C is correct but missing context
- Answer D addresses a different interpretation of the question (also valid)
When annotators disagree 27% of the time, your benchmark has a noise floor of 13.5% (half the disagreement). A model scoring 95% might actually be anywhere from 82% to 100% depending on which annotator's opinion you use.
Multi-Dimensional Quality
AI output quality is multi-dimensional:
- Accuracy: Is the information factually correct?
- Relevance: Does it address what the user actually needs?
- Completeness: Does it cover all aspects of the question?
- Conciseness: Is it appropriately brief or too verbose?
- Clarity: Is it easy to understand?
- Actionability: Can the user act on the information?
Collapsing all of these into a single "accuracy" number loses crucial information. A model might be highly accurate but so verbose that users give up reading. Another might be slightly inaccurate but so clear that users prefer it.
Users don't care about "accuracy" as we measured it. They care about whether the response helped them. Those are different things.
The Latency Dimension
Our benchmarks measured quality. They didn't measure latency. In production, latency matters enormously.
GPT-4 scored 95% on our benchmark at 3 seconds response time.
Llama 70B scored 82% at 0.5 seconds response time.
User satisfaction with the faster, less accurate model was actually higher than with the slower, more accurate one. Users abandoned the chat when responses took too long, regardless of quality.
Our benchmark didn't capture this tradeoff. It compared quality in isolation.
Section 3: The A/B Testing Alternative—Measuring What Matters
We replaced offline evaluation with live A/B testing on production traffic.
The Setup
When a user sends a query:
- Randomly assign them to model A or model B (50/50 split)
- Serve the response from the assigned model
- Measure user behavior: Did they follow up? Did they report the answer as helpful? Did they rage-quit?
- After sufficient data, compare metrics and pick the winner
What We Measure
Thumbs up/down rate: Do users explicitly rate responses as helpful?
Follow-up rate: Do users need to ask clarifying questions? (Lower is better—they got what they needed.)
Escalation rate: Do users contact support after the AI interaction? (Lower is better.)
Session duration: How long until the user accomplishes their goal? (Context-dependent—sometimes longer is better.)
Return rate: Do users come back to use the AI again? (Higher is better.)
Why A/B Testing Works
1. Real distribution: You're testing on actual user queries, not curated examples. No distribution shift by definition.
2. Real behavior: You measure user behavior, not annotator opinions. Users vote with their actions.
3. All dimensions: Latency, quality, clarity, relevance—they all show up in user behavior. You don't have to score each dimension separately.
4. Continuous: A/B testing runs continuously. As user behavior shifts, you detect changes in real-time.
The Downsides
Slower decisions: You need statistical significance based on real traffic. With low traffic, this takes weeks.
User exposure: Some users get the worse model. In high-stakes applications, this can be a problem.
Metric selection: You still have to choose what to measure. Wrong metrics lead to wrong conclusions.
Local maxima: A/B testing tells you which of two options is better. It doesn't tell you if both are terrible compared to an option you haven't tried.
Section 4: When Offline Evaluation Still Matters
We didn't abandon offline evaluation entirely. It still has a role—just not as the primary decision mechanism.
Filtering Out Obviously Bad Models
Before running an expensive A/B test, we use offline evaluation to filter candidates. If a model scores below 60% on our benchmark, we don't bother A/B testing it—even accounting for benchmark limitations, it's probably not competitive.
Offline evaluation as a filter = valuable.
Offline evaluation as a decision-maker = dangerous.
Regression Testing
When we update prompts or fine-tune models, we run them through our benchmark to check for obvious regressions. If accuracy drops 20%, something broke—even if absolute benchmark numbers don't predict production performance.
Relative comparisons (did this change make things worse?) are more reliable than absolute predictions (this model will perform at 95%).
Specific Capability Testing
Some capabilities can be tested offline reliably:
- Does the model refuse harmful requests? (Safety benchmarks)
- Can the model follow specific format instructions? (JSON output, etc.)
- Does the model handle non-English well? (Multilingual benchmarks)
These are verifiable, objective capabilities that transfer well from benchmark to production.
Building Intuition
Running models through test sets builds intuition about their strengths and weaknesses. Even if the scores don't predict production exactly, you learn: "GPT-4 handles ambiguous queries well but over-explains simple ones. Claude is concise but sometimes too brief."
This intuition informs A/B test design and interpretation.
Conclusion: Measure What You Care About, Not What's Easy
Offline evaluation is seductive because it's fast, controlled, and produces clean numbers. You can run benchmarks overnight and have a leaderboard by morning. It feels scientific.
But the goal isn't clean numbers—it's models that work for users. And the only way to know if models work for users is to test them with users.
This is harder. A/B tests require traffic, infrastructure, and patience. The results are messier—confidence intervals instead of point estimates. But they're measuring what actually matters.
Your benchmark might tell you Model A scores 95% and Model B scores 85%. A/B testing might tell you users prefer Model B by 2:1. Trust the users.
The best evaluation is the one users run for you every time they interact with your product. Measure that, not your carefully curated test set.
Written by XQA Team
Our team of experts delivers insights on technology, business, and design. We are dedicated to helping you build better products and scale your business.