Why We Stopped Using Synthetic Data. Real Data Was Cheaper.

The promise of synthetic data is irresistible. Why laboriously collect and label real-world examples when you can generate unlimited training data programmatically? GPT-4 can create thousands of diverse examples. Data augmentation can multiply your dataset 100x. Simulation can generate scenarios you'd never see in the real world.

We were building a classification model for customer support tickets. Real labeled data was expensive—our support team had to manually categorize tickets, and we could only get about 500 labeled examples per week. At that rate, building a robust model would take months.

Synthetic data seemed like the obvious shortcut.

The Million-Example Dream

We used GPT-4 to generate synthetic support tickets. We gave it examples of each category and asked it to generate variations. It worked beautifully:

"Generate 100 variations of a billing complaint support ticket. Vary the tone (angry, neutral, polite), the specific issue (overcharge, wrong plan, failed payment), and the writing style."

GPT-4 delivered. Fluent, diverse, realistic-looking tickets. We generated 50,000 examples per category. We had a million training examples in a week.

Our validation metrics were stellar. 94% accuracy on our synthetic test set. We deployed with confidence.

Production accuracy: 61%.

What Went Wrong: A Forensic Analysis

Issue #1: The Distribution Wasn't Real

Real customer tickets have patterns we didn't anticipate. Typos cluster in predictable ways. Angry customers have specific phrases ("I've been a customer for X years"). Non-native English speakers make particular grammar errors. GPT-4's synthetic data was too clean, too varied in the wrong ways, not varied in the right ways.

Issue #2: Edge Cases Were Underrepresented

Real data has a long tail. 20% of tickets don't fit neatly into categories. They're about multiple issues. They're poorly written. They include irrelevant information. Our synthetic data, generated from clean category definitions, missed this messiness entirely.

Issue #3: The Model Learned Synthetic Artifacts

GPT-4 has stylistic patterns. Our model learned to recognize those patterns, not the actual characteristics of each category. It was classifying "does this sound like GPT-4 wrote a billing complaint?" not "is this customer complaining about billing?"

The Real Data Experiment

We shifted strategy. We paid our support team overtime to label aggressively. We built a labeling interface that made it fast. We prioritized quality over quantity.

After two weeks, we had 10,000 real labeled examples. Not a million. Ten thousand.

We trained the same model architecture. Production accuracy: 84%.

Ten thousand real examples beat one million synthetic ones. By 23 percentage points.

The Cost Comparison Nobody Does

Let's compare the actual costs:

Synthetic Data Approach:

GPT-4 API calls: ~$2,000
Engineering time (pipeline, validation, debugging): ~$15,000
Time to production accuracy issues: 3 weeks debugging
Total: ~$17,000 + wasted time

Real Data Approach:

Support team labeling (overtime): ~$8,000
Labeling interface development: ~$3,000
Time to working model: 2 weeks
Total: ~$11,000

Real data was cheaper. And faster. And delivered a model that actually worked.

When Synthetic Data Actually Works

We're not saying synthetic data never works. It has legitimate uses:

Rare event augmentation: If you have 50 examples of a rare class and 50,000 of common classes, synthetic augmentation of the rare class can help balance.
Privacy-sensitive domains: When you genuinely can't access real data due to privacy constraints, synthetic is the only option.
Simulation environments: For robotics and games where synthetic = reality by definition.
Seed data for cold start: Getting something working before real data is available, with plans to replace.

But as a replacement for real training data? Almost never.

The Real Lesson

Synthetic data looks cheap because generation is cheap. But the hidden costs are enormous: debugging production failures, re-training, lost user trust, engineering time chasing phantom improvements.

Real data looks expensive because collection is visible work. But it compounds: once collected, it's reusable. It reveals actual distribution. It exposes real edge cases. It builds models that work.

Don't let the allure of "unlimited free data" blind you to the cost of data that doesn't work. The cheapest data is the data that makes your model actually perform in production.

A million synthetic examples taught our model to recognize GPT-4's writing style. Ten thousand real examples taught it to classify tickets. Know which one you need.

Tags:TechnologyTutorialGuide

Written by XQA Team

Our team of experts delivers insights on technology, business, and design. We are dedicated to helping you build better products and scale your business.

•