Why We Stopped Fine-Tuning Embeddings. Pretrained Models Win.

The conventional wisdom in AI engineering is clear: fine-tune your models for your domain. Generic embeddings trained on internet data can't possibly understand your specific terminology, your edge cases, your unique context. Fine-tuning will unlock the performance you need.

We believed this deeply. Our product involved semantic search over technical documentation—highly specialized content with domain-specific terminology. Surely fine-tuned embeddings would dramatically outperform generic ones.

We assembled a team: two ML engineers, a data engineer, and a PM. We spent three months on the project. We curated training data, set up training infrastructure, ran hundreds of experiments, evaluated on custom benchmarks, and finally deployed our fine-tuned embedding model.

The result: 2% improvement in retrieval accuracy over OpenAI's off-the-shelf text-embedding-ada-002.

Three months of engineering time. Significant infrastructure costs. New operational complexity. For 2%.

Six months later, OpenAI released text-embedding-3-large. We benchmarked it against our fine-tuned model on the same evaluation set.

The new pretrained model beat our fine-tuned model by 15%. With zero effort on our part.

We deprecated our fine-tuned embeddings. We now use pretrained models from OpenAI or Cohere, updated whenever they release improvements. Here's why fine-tuning embeddings is usually a waste of time.

Section 1: The Marginal Gains Trap—Why Domain-Specific Rarely Matters

The intuition behind fine-tuning seems solid: models trained on general data won't understand domain-specific content. A model that learned from Wikipedia and Reddit can't possibly understand quantum physics papers or legal contracts or medical records.

This intuition is wrong for modern large embedding models.

What "General" Training Data Actually Contains

OpenAI's embedding models are trained on massive, diverse corpora. This includes:

Wikipedia (including highly technical articles)
Academic papers (arXiv, PubMed, etc.)
Technical documentation (from public docs sites)
Books (including textbooks)
Code and technical discussions (GitHub, Stack Overflow)

Unless your domain is genuinely novel (invented last month), the base model has likely seen significant content about it. The "general" model isn't ignorant of your domain—it's trained on everything.

Our Actual Results

We fine-tuned on 50,000 document pairs from our technical documentation corpus. We expected significant improvement on domain-specific queries.

What we found:

Generic queries: Base model and fine-tuned model performed identically
Domain-specific queries: Fine-tuned model was 3-5% better
Edge case queries: Fine-tuned model was sometimes worse (overfitting to training distribution)
Blended benchmark: 2% overall improvement

The domain-specific gains were real but small. And they came with a cost: the model became slightly worse at understanding novel queries that didn't match the training distribution.

The Opportunity Cost

Three months of ML engineering time at our rates: approximately $150,000 in fully-loaded cost.

Training infrastructure (GPUs, storage, compute): approximately $15,000.

Ongoing operational overhead (model serving, monitoring, updates): approximately $5,000/month.

For a 2% improvement. The ROI calculation didn't work.

Section 2: The Moving Baseline Problem—Pretrained Models Keep Getting Better

The most devastating argument against fine-tuning: the baseline keeps improving without your effort.

The Timeline of Our Embarrassment

Month 0: We benchmark text-embedding-ada-002. It scores 78% on our retrieval benchmark.

Month 3: We deploy our fine-tuned model. It scores 79.6%—a 2% relative improvement. We celebrate.

Month 9: OpenAI releases text-embedding-3-small. We benchmark it: 85%. Without fine-tuning, it beats our fine-tuned model by a huge margin.

Month 12: We try text-embedding-3-large: 91%. Our fine-tuned model (still on the old base) looks embarrassing.

Every six months, the major providers release improved embedding models. Each release typically improves performance by 10-20% on standard benchmarks. Our fine-tuning effort captured 2% while the industry delivered 15%.

The Maintenance Burden

To keep up, we would have needed to re-fine-tune every time a new base model was released. That means:

Curating updated training data (the old data might not transfer well)
Running new training experiments
Validating performance
Managing model versions
Updating production systems

This is a permanent tax on your engineering team. Every time the provider improves, you have to do work to benefit from it. If you just use pretrained models, you get improvements for free.

The Simplicity of Pretrained

Our current embedding pipeline:

Call OpenAI's embedding API
Store the vector
Done

When OpenAI releases a better model, we:

Change the model name in our config
Re-embed our corpus over a weekend
Done

No training. No experiments. No ML infrastructure. No operational complexity. Just the latest, best model with minimal effort.

Section 3: When Fine-Tuning Does Make Sense

We're not saying fine-tuning is always wrong. There are legitimate use cases—they're just narrower than the hype suggests.

Genuinely Novel Domains

If you're working in a domain where the terminology literally didn't exist when the base model was trained, fine-tuning adds value. Examples:

Internal company jargon (product names, process names, acronyms)
Brand-new scientific fields (where papers weren't in training data)
Proprietary methodologies with custom vocabulary

But be honest: is your domain really that novel? Most "specialized" domains have extensive public content that base models have seen.

Latency-Critical Applications

Fine-tuning smaller models can make sense for latency. If you need embeddings in <10ms and API calls take 100ms, a fine-tuned smaller model running locally might be worth it.

But this is a latency optimization, not a quality optimization. You're trading quality for speed at the margin.

Privacy Requirements

If you can't send data to external APIs (regulated industries, sensitive data), you need local models. Fine-tuning local open-source models might be necessary.

But again, this is a constraint-driven decision, not a quality-seeking one.

Massive Scale Cost Optimization

At extreme scale (billions of embeddings per day), API costs become significant. Fine-tuning a model to run locally can reduce cost. But you need truly massive scale to justify the engineering overhead.

Section 4: What To Do Instead—Maximizing Pretrained Performance

If fine-tuning isn't the answer, what is? There are high-leverage activities that improve retrieval without the costs of fine-tuning.

Better Chunking

How you chunk your documents matters more than embedding model quality. We spent three months fine-tuning embeddings for 2% gain. We spent one week improving our chunking strategy and got 8% gain.

Good chunking practices:

Semantic chunking (split on meaning boundaries, not character counts)
Overlapping chunks (prevents losing context at boundaries)
Chunk size optimization (too small loses context, too big dilutes relevance)
Document structure awareness (headings, sections, lists)

Query Expansion

Instead of fine-tuning embeddings to understand your queries better, expand the queries to match how documents are written.

User query: "How do I fix auth errors?"

Expanded query: "How do I fix auth errors? Authentication failure. Login problems. Access denied. Token expired."

A simple GPT call to expand queries before embedding improved our retrieval by 5%—more than fine-tuning, implemented in a day.

Hybrid Search

Combine embedding search with keyword search. Some queries are best served by exact keyword matching; others by semantic similarity. A hybrid approach captures both.

We implemented a simple reranker that combines BM25 (keyword) and embedding scores. 7% improvement over embedding-only search.

Reranking

Use a reranker model (like Cohere Rerank or a cross-encoder) to refine the top results from embedding search. Rerankers are more expensive per query but applied to a small candidate set.

Adding reranking to our pipeline improved accuracy by 10%—the single biggest improvement we achieved, with no training required.

Conclusion: Buy, Don't Build

Fine-tuning embeddings feels like the right thing to do. It's "proper ML engineering." It demonstrates expertise. It creates intellectual property.

But for most teams, it's a trap. The marginal gains are small. The maintenance burden is high. The opportunity cost is massive. And the baseline keeps improving without your effort.

The highest-leverage activities are not training new models—they're using existing models well. Better chunking. Smarter queries. Hybrid approaches. Reranking.

These techniques stack. Each one provides 5-10% improvement. Together, they can double or triple your retrieval quality—all without training infrastructure, ML expertise, or ongoing model maintenance.

Save fine-tuning for when you've exhausted the simpler approaches and still need more. In three years of production semantic search, we've never reached that point.

The best embedding model is the one you don't have to maintain. Use pretrained. Focus on what surrounds the model, not the model itself.

Tags:TechnologyTutorialGuide

Written by XQA Team

Our team of experts delivers insights on technology, business, and design. We are dedicated to helping you build better products and scale your business.

•