
The architecture diagrams looked beautiful. Clean boxes connected by arrows. "This service publishes UserCreated. These five services consume it. Loose coupling! Scalability! Resilience!"
Reality was different.
A Simple Bug Report
Customer: "I signed up but never got my welcome email."
Investigation time: 4 hours.
Here's what we traced:
- Auth service created user → published
UserCreated - Profile service consumed event → published
ProfileCreated - Email service consumed
UserCreated→ checked forProfileCreated(needed name for email) - Race condition:
ProfileCreatedhadn't arrived yet → email service queued a retry - Retry handler crashed (unrelated bug) → event lost
- No alert because the retry queue was "eventually consistent"
To find this, we:
- Searched Kafka logs for the user ID
- Correlated timestamps across 4 services
- Read ElasticSearch for retry handler exceptions
- Checked S3 for dead-letter queue exports
- Reconstructed the timeline on a whiteboard
The fix was 3 lines of code. Finding where to put them took 4 hours.
The Allure of Events
Events promise independence. Services don't need to know about each other. Add a new consumer without changing the producer. Scale each service independently.
These benefits are real. But the costs are hidden:
- Invisible dependencies: Services don't call each other directly, but they still depend on event schemas, timing, and ordering. The dependency is just harder to see.
- Debugging requires omniscience: You need to understand the entire event flow to debug any part of it.
- Testing is nearly impossible: Unit tests can't capture cross-service event timing issues. Integration tests are slow and flaky.
- State is scattered: Where's the truth? Is it the event log? The database? The cache that got updated from events?
The Metric That Killed Us
We measured: Mean time to debug production issues.
| Architecture | Avg Debug Time |
|---|---|
| Monolith (before migration) | 35 minutes |
| Event-driven microservices | 3.5 hours |
6x slower debugging. That's not a typo. Every production issue became an archaeological expedition through distributed logs.
What We Do Now
We moved back to synchronous, direct service calls for most flows. Events are reserved for genuinely asynchronous operations (batch processing, analytics, notifications that can be delayed).
The signup flow now:
- Auth service creates user
- Auth service calls Profile service synchronously
- Auth service calls Email service synchronously
- If any call fails, the whole request fails with a clear error
Debugging: "Email service returned 500." Look at email service logs. Done.
Events Aren't Evil
Events work well for:
- Analytics pipelines (eventual consistency is fine)
- True background jobs (send report email in 5 min)
- Cross-team notification (team A doesn't need to wait for team B)
Events fail for:
- Critical user flows (signup, purchase, core features)
- Anything requiring transactional consistency
- Flows where debugging speed matters
Architecture beauty doesn't pay the bills. Debuggability does. Choose accordingly.
Written by XQA Team
Our team of experts delivers insights on technology, business, and design. We are dedicated to helping you build better products and scale your business.