We Killed Staging—Testing in Production is Safer

The "Staging Environment" is the security blanket of the software world. It's that warm, fuzzy place where we deploy our code to see if it works before we let it touch real users. It feels safe. It feels responsible.

It's also a complete waste of money and a dangerous illusion.

After five years of maintaining a "replica" environment that cost us $15,000 a month and constantly broke for reasons unrelated to code, we shut it down. We now deploy straight to Production. And our incident rate has dropped by 60%.

The Lie of "Parity"

The core argument for Staging is "Environment Parity." We tell ourselves that Staging is exactly like Production, just without the traffic.

This is mathematically impossible.

Data Gravity: Production has 5TB of messy, historical user data. Staging has 5GB of sanitized, anonymized seed data. You can't test a database migration on seed data and expect it to work on 5TB of real data. It won't timeout in Staging. It will timeout in Prod.
Traffic shape: Production has spikes, race conditions, and thundering herds. Staging has one QA engineer clicking "Submit" slowly. You will never catch concurrency bugs in Staging.
Configuration Drift: "Oh, Staging is using Redis 6.2, but Production is on 7.0 because we forgot to upgrade the Terraform state." This happens in every company I've ever seen.

We realized we were spending 30% of our DevOps time debugging "Staging Issues" that weren't real bugs. "Why is the login broken in Staging? Oh, the SendGrid API key expired." That's not a software bug. That's operational toil.

The False Confidence Loop

The worst part about Staging is that it breeds complacency.

Developers merge code, wait for it to deploy to Staging, click around for 30 seconds, see that it works, and think "Done." They assume the safety net caught them.

Then it breaks in Production.

Why? Because of the data. Because of the traffic. Because of the third-party integrations that behave differently.

When you remove Staging, the psychology changes. The developer knows: "When I merge this, it goes to real users." The fear is healthy. It forces them to write better tests. It forces them to think about rollback strategies before they merge, not after.

How We Do It: The "Shields Up" Strategy

So how do we deploy to Prod without burning the house down? We didn't just delete Staging and pray. We built a "Shields Up" infrastructure.

1. Feature Flags (The Kill Switch)

Every new feature is wrapped in a flag.


if (flags.isEnabled('new-checkout-flow', user.id)) {
  return <NewCheckout />;
} else {
  return <OldCheckout />;
}

We deploy the code to Production, but the flag is "OFF." The code is there. It's sleeping. Users see nothing.

Then, we turn it on for specific users: "Internal Employees." We test it in Production, with real data, but limited blast radius.

2. Canary Deployments

We use Argo Rollouts in Kubernetes. When a new version ships, it doesn't replace 100% of the pods.

Phase 1: 1% of traffic goes to the new version. Any 500 errors? Auto-rollback.
Phase 2: 10% of traffic. Latency check. Any regression? Auto-rollback.
Phase 3: 50%... 100%.

This automated safety net is infinitely better than a manual QA pass in Staging.

3. Ephemeral Preview Environments

We do have "testing" environments, but they are ephemeral. When a Pull Request is opened, we spin up a tiny, isolated environment just for that branch. It lasts for the life of the PR.

This is better than a shared Staging environment because it isolates changes. If PR #123 breaks the DB, it doesn't block the developer on PR #124.

Testing in Production (The Right Way)

Testing in production sounds scary. But you are already doing it. As @charitymajors famously says: "Staging is like masturbation, and testing in production is like sex. You can practice all you want alone, but the real thing is always different."

We implemented "Synthetic Transaction Monitoring."

Every minute, a script runs in Production that:
1. Creates a test user.
2. Adds an item to the cart.
3. Checks out (using a Stripe test card or a 100% discount coupon).
4. Verifies the order success.

If this fails, PagerDuty wakes us up. We know the checkout is broken before a real customer complains.

The Cost Savings

Deleting Staging saved us:

$180,000/year in AWS bills (RDS instances are expensive).
~10 hours/week of DevOps time maintaining the environment.
45 minutes per deployment cycle (no more "Waiting for Staging to be free").

We reinvested that money into better observability tools (Datadog, Honeycomb) because if you test in Prod, you need to see what is happening.

Conclusion

Staging is a relic of the "Waterfall" era where we released software once every 6 months on a DVD. In the cloud era, where we deploy 20 times a day, Staging is a bottleneck.

It takes courage to delete it. You will feel naked. But once you get used to the discipline of Feature Flags and Observability, you will realize that Staging was just a security blanket full of holes.

Move fast. Break nothing (because it's behind a flag). Test in Prod.

Tags:TechnologyTutorialGuide

Written by XQA Team

Our team of experts delivers insights on technology, business, and design. We are dedicated to helping you build better products and scale your business.

•