
Multi-modal AI was supposed to be the holy grail. One model that understands text, images, audio, and video. "Just show it the invoice and it'll extract all the fields!" OpenAI's demos were magical—GPT-4V reading handwritten notes, analyzing charts, understanding memes. Google Gemini followed with even more impressive multi-modal capabilities.
We were eager adopters. Our document processing pipeline handled thousands of invoices, receipts, and contracts daily. The existing system used traditional OCR (Tesseract) followed by text extraction and NLP. It was boring. It required maintenance. It had edge cases.
Multi-modal AI promised to simplify everything. Feed the image directly to GPT-4V, ask it to extract the fields, get structured JSON back. No OCR pipeline. No text processing. One API call.
We built a proof of concept. It worked beautifully on our test set. We deployed to production.
Within a week, we had a crisis.
30% of invoices were being processed with at least one error. Amounts were wrong. Dates were transposed. Vendor names were hallucinated. Our finance team was spending more time correcting AI errors than they'd spent doing manual data entry.
We rolled back to our "boring" OCR pipeline. Error rate dropped to 3% immediately.
Here's what we learned about why multi-modal AI isn't ready for production document processing—and when text-focused approaches still win.
Section 1: The Demo vs. Production Gap—Why Impressive Demos Fail
Multi-modal AI demos are genuinely impressive. The model looks at a complex diagram and explains it. It reads a handwritten recipe and converts it to a structured format. It analyzes a chart and provides insights.
But demos are curated. Production is not.
The Lighting and Quality Problem
Our invoice images came from various sources:
- High-resolution PDFs (optimal)
- Scanned documents (variable quality)
- Photos taken on phones (often blurry, poorly lit)
- Screenshots with compression artifacts
- Faxes converted to digital (degraded quality)
GPT-4V performed excellently on high-resolution PDFs. On phone photos with shadows? Error rates exceeded 40%. The model would "guess" at partially visible characters, often confidently and incorrectly.
Traditional OCR with preprocessing (deskewing, contrast adjustment, noise reduction) handled these edge cases much better. The preprocessing pipeline was specifically engineered for degraded input. The vision model assumed good input.
The Confidence-Without-Accuracy Problem
The most dangerous failure mode: GPT-4V would return confident, well-formatted responses that were completely wrong.
Example: An invoice total was partially obscured by a coffee stain. The visible portion showed "1,2" followed by smudge followed by ".00". The actual total was $1,250.00.
GPT-4V returned: {"total": 1200.00, "confidence": "high"}
It guessed. It didn't say "I can't read this." It filled in the gap with a plausible number. And it marked itself as confident.
Traditional OCR would have flagged this as a low-confidence extraction, triggering human review. The vision model's confidence scores were essentially meaningless for degraded input.
The Training Data Problem
Vision models are trained on internet-scale image data. They've seen millions of photos, diagrams, and screenshots. But they haven't seen your specific invoice formats, your specific vendors' letterheads, your specific edge cases.
Our OCR pipeline had been trained on our actual invoice corpus. It knew that "Invoice #" appears in the top-right corner of 80% of our invoices. It knew that amounts are usually right-aligned. It had statistical priors built from our real data.
GPT-4V had generic priors from generic training data. It sometimes extracted "Invoice #" from a purchase order because it didn't know our document taxonomy.
Section 2: The Latency and Cost Reality—Production Economics
Beyond accuracy, multi-modal AI had practical limitations we hadn't fully considered.
Latency
Processing one invoice image through GPT-4V took 3-8 seconds. Highly variable, unpredictable.
Our OCR + text extraction pipeline took 200-400ms per document. Consistent, predictable.
At 10,000 invoices per day:
- Vision model approach: 8-22 hours of processing time
- OCR approach: 30-60 minutes of processing time
We'd need to run the vision model approach heavily parallelized (expensive) or accept massive delays (unacceptable). The OCR approach just worked with minimal infrastructure.
Cost
GPT-4V pricing for image input was approximately $0.01 per image at the time. That sounds cheap until you scale:
10,000 invoices/day × $0.01 × 30 days = $3,000/month just for API calls.
Our OCR pipeline ran on a $200/month VM. Even adding cloud OCR services (Google Vision API, AWS Textract), our total cost was under $500/month for the same volume.
6x cost increase for worse accuracy? Not a winning trade.
Rate Limits and Availability
OpenAI's API has rate limits. When we tried to burst-process a backlog, we hit limits. When their service had an outage, our pipeline stopped.
Our OCR pipeline was self-hosted. It ran on our infrastructure. It didn't depend on external API availability. It processed at whatever rate our hardware supported.
For production systems, dependency on external APIs is a liability, not a feature.
Section 3: Where Multi-Modal Shines—And Where It Doesn't
We're not against multi-modal AI. We use it successfully for specific use cases. But document processing wasn't one of them.
Good Use Cases for Multi-Modal AI
1. Exploratory analysis: "What's in this image?" "Describe this diagram." When you don't know what you're looking for, vision models excel at open-ended understanding.
2. Natural language + image queries: "Find all photos from our library that show beach scenes at sunset." Semantic search over images is a legitimate strength.
3. Accessibility: Generating alt-text for images, describing charts for screen readers. The output quality can be "good enough" because the baseline (no alt text) is so low.
4. Human-in-the-loop workflows: When a human reviews every output, the AI is augmenting rather than replacing judgment. Errors get caught.
Bad Use Cases for Multi-Modal AI
1. Structured data extraction at scale: When you need specific fields extracted with high accuracy from known document types. Traditional pipelines win.
2. Numerical accuracy: Amounts, dates, ID numbers—anything where a single wrong digit matters. Vision models are not precise.
3. High-volume automated processing: When humans can't review every output. Error rates compound. One bad extraction becomes one bad financial record becomes audit failure.
4. Regulated industries: Healthcare, finance, legal—anywhere where explainability and accuracy are legally required. "The AI read it wrong" is not an acceptable explanation to regulators.
Section 4: Our Current Architecture—Boring But Reliable
Our production document processing pipeline is decidedly un-magical:
The Pipeline
- Preprocessing: Image deskewing, contrast normalization, noise reduction
- OCR: Tesseract for text extraction, with confidence scores per character
- Document classification: SVM classifier to identify document type (invoice, receipt, contract, etc.)
- Field extraction: Regex + rules for known document layouts, NER for unknown layouts
- Validation: Business rules (e.g., total must equal sum of line items)
- Human review: Low-confidence extractions flagged for manual review
Nothing here is cutting-edge. Tesseract is from 2006. SVMs are from the 1990s. Regex is eternal.
But the pipeline works. 97% accuracy. Predictable costs. No external dependencies. Auditable decision paths.
Where We Use AI (Text-Based)
We do use LLMs—but on the extracted text, not on images:
- Semantic field matching: When a document has an unusual layout, we ask GPT-4 to identify field values from the extracted text. Text-based LLMs are much more reliable than vision models for structured extraction.
- Anomaly detection: "Does this invoice look unusual compared to this vendor's typical invoices?" LLM-based reasoning over structured data.
- Summarization: Generating plain-language summaries of contract terms.
The key insight: extract text first (reliably, using proven methods), then apply AI reasoning to the text.
Why Text-First Wins
Text-based LLMs have been trained on vastly more text than vision models have been trained on document images. The knowledge encoded in GPT-4 about invoices comes primarily from textual descriptions of invoices, not from looking at invoice images.
When you give GPT-4 extracted text like "Vendor: Acme Corp, Date: 2024-01-15, Total: $1,234.56", it can reason about it reliably. When you give GPT-4V a blurry photo of an invoice, it's doing much harder work with much less training signal.
Meet the model where it's strongest. For structured document processing, that's text, not images.
Conclusion: Impressive Demos ≠ Production Systems
Multi-modal AI is genuinely impressive technology. The fact that a model can look at an image and understand its content is a remarkable achievement. We're not dismissing the capability.
But production systems have different requirements than demos:
- Consistency: The same input should produce the same output. Vision models have high variance.
- Accuracy: Errors matter. Vision models make confident errors.
- Cost-effectiveness: Unit economics must work at scale. Vision models are expensive.
- Reliability: The system must work when third-party APIs are down. Local processing wins.
For now, multi-modal AI is best suited for human-in-the-loop workflows where its impressive capabilities augment human judgment rather than replace it. For automated, high-volume, accuracy-critical processing, boring traditional pipelines still win.
Don't let impressive demos drive architecture decisions. Evaluate on production requirements: accuracy, cost, latency, reliability. When you do, text-focused approaches often come out ahead.
Text is still king. Vision is a cool prince who isn't ready to rule.
Written by XQA Team
Our team of experts delivers insights on technology, business, and design. We are dedicated to helping you build better products and scale your business.