
"The script's running. Should be done in about thirty seconds."
These were the last words spoken before we deleted 6 years of customer data, brought down a platform serving 800,000 users, and discovered that our entire disaster recovery strategy was an elaborate placebo.
I'm Sarah Martinez, DevOps Lead at a B2B SaaS company that provides inventory management software. On Thursday, March 14th, 2024, at 2:34 PM, I executed what should have been a routine database cleanup script. By 2:35 PM, our production database was empty. By 2:37 PM, I was having a panic attack in the bathroom. By midnight, I'd learned more about backup systems, disaster recovery, and my own capacity for stress than I ever wanted to know.
This is the story of our very bad day, and the 18-hour marathon that followed. More importantly, it's about the seven silent failures that led us there, and the disaster recovery framework we built from the ashes.
## 2:34 PM: The Command That Changed Everything
Let me set the scene. Our Postgres database had accumulated test data from our staging environment that had somehow gotten written to production (that's a different disaster we'd been meaning to fix). I'd written a script to identify and delete records that matched specific test account patterns.
The script looked like this:
```bash
#!/bin/bash
# cleanup_test_data.sh
# Removes test accounts and associated data from production
DB_NAME="inventory_db"
DB_USER="admin"
DB_HOST="prod-primary.us-east-1.rds.amazonaws.com"
# Find test account IDs
TEST_ACCOUNTS=$(psql -h $DB_HOST -U $DB_USER -d $DB_NAME -t -c \
"SELECT id FROM accounts WHERE email LIKE '%@test.internal'")
# Delete associated data
for account_id in $TEST_ACCOUNTS; do
psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c \
"DELETE FROM inventory WHERE account_id = '$account_id'"
psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c \
"DELETE FROM transactions WHERE account_id = '$account_id'"
# ... 12 more tables
done
echo "Cleanup complete. Removed data for $account_id accounts."
```
I'd tested this script in staging. It worked perfectly. Removed 47 test accounts and their associated data. No issues.
What I didn't realize: **I was running a different version of the script in production**.
The version I ran had a subtle but catastrophic bug. During testing, I'd temporarily modified the WHERE clause to be more aggressive. Instead of:
```sql
WHERE email LIKE '%@test.internal'
```
The production version had:
```sql
WHERE email LIKE '%@test%' OR email NOT LIKE '%@test%'
```
This was a leftover from debugging that I'd forgotten to revert. This WHERE clause literally matches every single record. Every single one.
I hit Enter. The script ran. Thirty seconds later, it completed successfully.
"Done," I said to my colleague James, who was sitting across from me. "Test data cleaned up."
Then my Slack started exploding.
## 2:35 PM: "Is Production Down?"
The first message came from our head of customer success:
> **Priority 1**: Multiple customers reporting they can't see any inventory data. All pages showing empty states.
Then from our monitoring system:
> **ALERT**: Database query count dropped from 2,400/minute to 3/minute
> **ALERT**: API error rate: 87%
> **ALERT**: Active user sessions dropped from 14,200 to 12
I felt my stomach drop. I opened our database monitoring dashboard.
**Records in main tables:**
- `accounts`: 0 (previously: 87,423)
- `inventory`: 0 (previously: 12,403,821)
- `transactions`: 0 (previously: 43,821,092)
- `users`: 0 (previously: 806,294)
Everything. Gone. Every table that my script touched. Which, due to cascading foreign key deletions I hadn't thought about, turned out to be EVERY TABLE IN THE DATABASE.
I immediately ran to get our CTO. The conversation went like this:
**Me**: "I think I just deleted the production database."
**CTO**: "What? All of it?"
**Me**: "All of it."
**CTO**: "Okay. Deep breath. How long ago were our last backups?"
**Me**: "They run every hour. So maximum one hour of data loss."
**CTO**: "Okay. Not ideal, but manageable. Let's restore."
That's when we discovered the second problem.
## 2:42 PM: The Backup That Wasn't
Our backup strategy was supposed to be rock-solid:
- Automated snapshots every hour
- Point-in-time recovery enabled
- Weekly full backups stored in S3
- Replicated across three regions
- Tested quarterly (allegedly)
I initiated the restoration process from our most recent snapshot (taken at 2:00 PM, just 35 minutes of data loss). The AWS Console showed the restore would take approximately 45 minutes.
"We're going to be okay," I thought. "Forty-five minutes of downtime, some angry customers, but recoverable."
The restore completed at 3:31 PM. I held my breath as I checked the row counts:
- `accounts`: 0
- `inventory`: 0
- `transactions`: 0
- `users`: 0
Empty. The backup was empty.
I tried the snapshot from 1:00 PM. Empty.
12:00 PM. Empty.
11:00 AM. Empty.
I felt the color drain from my face. Our CTO, watching over my shoulder, said quietly: "How far back do the backups go?"
I started checking snapshots from earlier that week. Monday's 2 PM snapshot: Empty. Sunday: Empty. Last Friday: Empty.
**Every single automated snapshot for the past 14 days was backing up an empty database**.
## 3:45 PM: The Archaeological Dig Begins
We assembled our crisis team—me, CTO, our database architect Tom, two senior backend engineers, and our CEO on speakerphone.
"How is this possible?" the CEO asked. "Don't we test our backups?"
Tom pulled up our backup testing reports. We had indeed tested restores quarterly. The last test was... January 15th. Eight weeks ago. And the test had passed.
"What exactly did the test check?" our CTO asked.
Tom opened the test script:
```bash
#!/bin/bash
# Quarterly backup validation test
# Restore most recent snapshot to test instance
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier test-restore \
--db-snapshot-identifier latest-prod-snapshot
# Wait for restore to complete
sleep 3600
# Check if database is accessible
psql -h test-restore.us-east-1.rds.amazonaws.com -c "SELECT 1"
# If query succeeds, test passes
if [ $? -eq 0 ]; then
echo "✓ Backup test passed"
cleanup_test_instance
exit 0
fi
```
Do you see the problem? **The test verified that we could restore a database and that it was accessible. It never checked if the database actually contained any data**.
Our backup test was essentially checking "does Postgres start?" The answer was yes. The fact that it was empty was considered irrelevant.
## 4:15 PM: The Horrifying Discovery
Tom started investigating why our snapshots were empty. He pulled the logs from our backup jobs. They showed successful completion, every hour, for months.
Then he checked what those backup jobs were actually backing up.
Here's what our RDS instance configuration looked like:
```json
{
"DBInstanceIdentifier": "prod-primary",
"DBName": "inventory_db",
"MasterUsername": "admin",
"Engine": "postgres",
"EngineVersion": "13.7",
...
}
```
And here's what our automated snapshot configuration was pointing to:
```json
{
"DBInstanceIdentifier": "prod-primary-DEPRECATED",
"SourceDBInstanceIdentifier": "prod-primary-DEPRECATED",
...
}
```
Four months earlier, we'd migrated from an older RDS instance (`prod-primary-DEPRECATED`) to a new one (`prod-primary`) for a Postgres version upgrade. The migration was successful. We'd updated all application servers to point to the new instance.
But **no one had updated the backup job configuration**. For four months, we'd been dutifully creating snapshots of an empty, deprecated database instance that nothing was writing to anymore.
And because our backup test didn't actually verify data presence, it happily confirmed that yes, restoring an empty database produces an empty database that Postgres can start. Test passed!
## 4:30 PM: Exploring Every Option
We started grasping at straws:
**Option 1: Point-in-Time Recovery**
"We have PITR enabled!" Tom said hopefully.
We tried. PITR was indeed enabled... on the DEPRECATED instance. Not on the active one. No help.
**Option 2: Read Replicas**
"Don't we have read replicas that might still have the data?"
We did. And they'd faithfully replicated the DELETE statements. Also empty.
**Option 3: DB Transaction Logs**
"Can we replay the transaction logs backwards?"
In theory, yes. In practice, our WAL (Write-Ahead Logging) retention was set to 24 hours. But the catastrophic deletes were already written to WAL, and trying to "undo" billions of DELETE statements without a proper backup to roll back to was... not feasible.
**Option 4: Ask AWS for Help**
We opened a critical support ticket. Their response (paraphrased): "Snapshots are your responsibility. We can't recover data that wasn't in the snapshots."
**Option 5: Check Old Snapshots on DEPRECATED Instance**
Wait. The DEPRECATED instance might be empty now, but it was live four months ago. What if those old snapshots still exist?
Tom checked. They did. We had weekly full snapshots going back six months, all from before the migration.
The most recent pre-migration snapshot was from November 18th, 2023—almost four months old. But it was data. Real data. The only copy of real data we had access to.
## 5:00 PM: The Terrible Math
Our CTO laid out the options:
**Option A: Restore from 4-month-old backup**
- Pros: Gets us back online with real data
- Cons: Lose 4 months of transactions, inventory updates, new customers
**Option B: Declare bankruptcy and shut down**
- Pros: Honest about situation
- Cons: Company dies, all employees lose jobs
**Option C: Reconstruct data from other sources**
- Pros: Could potentially recover more recent data
- Cons: Could take days/weeks, no guarantee of success
We started running numbers. What do we actually have from the past four months outside the database?
- **Application logs**: We log most API requests. We could potentially replay some transactions.
- **Third-party integrations**: We sync data to analytics platforms, customer support tools, accounting software.
- **Email archives**: We send transaction confirmations. We could parse those.
- **Customer caches**: Some data might still be cached in Redis or client-side.
"How long would reconstruction take?" the CEO asked.
"Best case? 48 hours for bare minimum functionality," Tom estimated. "Full reconstruction? Maybe two weeks."
"And what's our burn rate with zero revenue?"
Our CFO (who'd joined the call): "We have runway for about four days of zero revenue before we start hitting serious problems. Maybe a week before we're making layoffs."
The decision was made: **Restore the 4-month-old backup immediately, then reconstruct as much recent data as possible in parallel**.
## 5:30 PM: The Recovery Begins
### Phase 1: Restore The Ancient Backup (5:30 PM - 7:15 PM)
We initiated the restore from the November 18th snapshot. While DNS propagated and connections switched over, I wrote the most difficult email of my career:
> **Subject: Critical Service Interruption - Data Restoration in Progress**
>
> Dear Valued Customers,
>
> Due to a technical error, our production database was compromised this afternoon. We are currently restoring from backups. **Data from the past four months may be lost**. We are working to recover as much as possible.
>
> We will provide updates every hour. We understand the gravity of this situation and take full responsibility.
The support inbox immediately flooded with over 400 responses. Most were understanding. Many were angry. A few were threatening legal action. All were completely justified.
### Phase 2: Data Archaeology (7:15 PM - 4:00 AM)
Once the old backup was restored and we were back online (albeit with 4-month-old data), we assembled a data recovery team. Nine engineers worked through the night:
**Team 1 (Application Logs):**
We had 4 months of API logs in our logging infrastructure. We wrote scripts to parse these logs and replay transactions:
```python
# Simplified version of our log replay script
import json
import psycopg2
log_files = get_log_files(start_date="2023-11-19", end_date="2024-03-14")
for log_file in log_files:
for line in log_file:
if line['endpoint'] == '/api/inventory/add':
# Replay inventory additions
execute_sql("INSERT INTO inventory (...) VALUES (...)")
elif line['endpoint'] == '/api/inventory/update':
# Replay inventory updates
execute_sql("UPDATE inventory SET ... WHERE ...")
```
We recovered approximately 65% of inventory changes this way.
**Team 2 (Third-Party Integrations):**
We used our Segment data warehouse, Mixpanel analytics, and Stripe transaction records to reconstruct user activity and financial transactions. Recovered about 80% of the billing data.
**Team 3 (Email Archives):**
Every transaction email contained order details. We built a parser to extract structured data from HTML emails. Recovered critical transaction history for high-value customers.
**Team 4 (Customer Communication):**
Reached out to our top 50 customers (who represented 70% of revenue) individually, asking them to export their current inventory data from any local spreadsheets or caches they might have. Seventeen responded with data exports. Every bit helped.
### Phase 3: Data Validation (4:00 AM - 8:00 AM)
By 4 AM, we'd reconstructed a Frankenstein database stitched together from multiple sources. But was it coherent? We ran validation scripts:
- Check for referential integrity
- Verify transaction sums match known totals
- Confirm user account states
- Cross-reference with financial records
We found thousands of inconsistencies. We manually reviewed and resolved the critical ones. For low-priority discrepancies, we flagged them for later review.
### Phase 4: Switchover (8:00 AM Friday)
At 8:00 AM Friday morning—18 hours after the initial disaster—we switched production to the reconstructed database.
Recovery rate:
- User accounts: ~99% (easy to recover from multiple sources)
- Transaction history: ~85% (financial data was in Stripe)
- Inventory data: ~72% (harder to reconstruct)
- Audit logs: ~40% (lowest priority, many gaps)
Not perfect. But alive.
## The Week After: How This Happened
The post-mortem revealed a cascade of failures:
### Failure #1: Configuration Drift
When we migrated database instances, we updated application configs but not infrastructure configs. No one owned the "update ALL the things" responsibility.
### Failure #2: Inadequate Backup Testing
Our backup tests verified technical success (can we restore?) but not functional success (does the restore contain data?). We'd confused mechanism with outcome.
### Failure #3: No DR Drills
We'd never practiced an actual disaster recovery scenario end-to-end. All our "tests" were sanitized, scripted, and checked boxes rather than simulating real chaos.
### Failure #4: Lack of Data Validation
Our monitoring alerted on query volumes and error rates, but not on data integrity. We had no alerts for "database suddenly empty."
### Failure #5: Unrestricted Production Access
I could run a deletion script directly against production with no approval process, no dry-run requirement, no manual confirmation step.
### Failure #6: No Soft Delete
Our data model used hard deletes (`DELETE FROM`). If we'd used soft deletes (update a `deleted_at` timestamp), this would have been trivially recoverable.
### Failure #7: Single Person Risk
Only I knew about the test data problem. Only I worked on the cleanup script. No code review, no pair programming, no second set of eyes.
## What We Changed: The New Disaster Recovery Framework
We rebuilt our entire approach to backups and disaster recovery:
### 1. Backup Validation That Matters
New backup test:
```python
def validate_backup(snapshot_id):
"""Actually verify backup contains data and is restorable."""
# Restore snapshot to test instance
test_db = restore_snapshot(snapshot_id)
# Check row counts for critical tables
expected_counts = get_current_production_row_counts()
actual_counts = get_row_counts(test_db)
for table in expected_counts:
variance = abs(expected_counts[table] - actual_counts[table]) / expected_counts[table]
# Allow 5% variance for active tables
if variance > 0.05:
raise BackupValidationError(
f"Table {table} count mismatch: "
f"expected ~{expected_counts[table]}, got {actual_counts[table]}"
)
# Verify sample of actual data integrity
verify_sample_data_integrity(test_db)
# Check referential integrity
verify_foreign_key_constraints(test_db)
return True
```
This runs automatically after every backup snapshot. If it fails, we get paged immediately.
### 2. Multi-Layered Backup Strategy
We now maintain:
- **Hourly snapshots** (retained 48 hours)
- **Daily snapshots** (retained 30 days)
- **Weekly full backups** (retained 1 year)
- **Continuous WAL archiving** to S3 (retained 90 days)
- **Quarterly cold storage** backups (retained 7 years)
- **Cross-region replication** for all of the above
And we test restoration from EACH layer monthly.
### 3. Soft Deletes Everywhere
We converted all tables to soft delete:
```sql
-- Old approach
DELETE FROM inventory WHERE account_id = 123;
-- New approach
UPDATE inventory
SET deleted_at = NOW(), deleted_by = 'user@example.com'
WHERE account_id = 123 AND deleted_at IS NULL;
```
"Deleted" data is filtered out in application queries but remains in the database. We have a separate archive process that hard-deletes records after 90 days.
This means accidental deletes are trivially recoverable with:
```sql
UPDATE inventory SET deleted_at = NULL WHERE deleted_at > '2024-03-14 14:34:00';
```
### 4. Production Safeguards
No more direct production database access. All changes go through:
**For application changes:**
- Changes made in staging first
- Automated tests confirm functionality
- Change request approved by 2 engineers
- Deployed with feature flags for instant rollback
**For data operations:**
- All modification scripts run in dry-run mode first
- Output manually reviewed
- Approval required from 2 engineers
- Executed with manual confirmation step
- Automatic rollback capability
Example of our new safeguards:
```bash
#!/bin/bash
# All production data scripts now include safeguards
set -euo pipefail
# Require dry-run first
if [[ "$DRY_RUN:-true}" != "false" ]]; then
echo "🔍 DRY RUN MODE - No changes will be made"
echo "Set DRY_RUN=false to actually execute"
psql -h $DB_HOST -c "BEGIN; $SQL; SELECT COUNT(*) AS would_affect; ROLLBACK;"
exit 0
fi
# Require manual confirmation
read -p "⚠️ This will modify PRODUCTION data. Type 'yes' to confirm: " confirm
if [[ "$confirm" != "yes" ]]; then
echo "Cancelled."
exit 1
fi
# Require approval token from second engineer
read -p "🔐 Enter approval token from second engineer: " token
validate_approval_token "$token" || exit 1
# Create before snapshot
SNAPSHOT_ID=$(create_snapshot)
echo "📸 Created before-snapshot: $SNAPSHOT_ID"
# Execute with rollback prepared
psql -h $DB_HOST -v ON_ERROR_STOP=1 <1% of rows in 5 minutes
for table in current_counts:
previous = get_previous_count(table)
if current_counts[table] < previous * 0.99:
alert(f"⚠️ Table {table} lost {previous - current_counts[table]} rows!")
# Verify referential integrity
broken_fks = check_foreign_keys()
if broken_fks:
alert(f"⚠️ Found {len(broken_fks)} broken foreign key relationships!")
# Check for suspicious patterns
if get_delete_query_count_last_minute() > 1000:
alert("⚠️ Unusually high number of DELETE queries!")
```
This would have caught the catastrophic deletion within 5 minutes.
### 6. Quarterly DR Drills
Every quarter, we run an unannounced disaster recovery drill:
**Scenario examples:**
- "Primary database is corrupted. Last snapshot is 3 hours old. Go."
- "AWS us-east-1 region is down. Failover to us-west-2. Go."
- "Ransomware encrypted our database. Restore from cold storage. Go."
We time ourselves, document failures, and continuously improve the process.
### 7. Backup Redundancy Across Vendors
We no longer trust a single cloud provider for backups:
- Primary database and snapshots: AWS RDS
- Secondary continuous backup: Google Cloud SQL (replicated in real-time)
- Cold storage: Backblaze B2
- Encrypted local copies: Physical hard drives in a safe
Over-engineered? Maybe. But I sleep better.
## The Cost: What This Actually Cost Us
**Direct costs:**
- Revenue lost during downtime: ~$145,000
- AWS costs for restoration testing: ~$8,000
- Customer credits/refunds for data loss: ~$73,000
- Legal consultation: ~$22,000
- **Total direct cost: ~$248,000**
**Indirect costs:**
- Customer churn (estimated): ~$420,000 annual recurring revenue
- Engineering time (opportunity cost): ~$120,000
- Reputation damage: Incalculable
- My mental health: Also incalculable
**Investments in preventing recurrence:**
- New backup infrastructure: ~$90,000
- DR testing framework: ~$35,000
- Data integrity monitoring: ~$15,000
- Training and process improvement: ~$40,000
- **Total prevention investment: ~$180,000**
**Grand total impact: ~$968,000+**
For context, we're a 40-person startup. This represented about 8% of our annual revenue. It hurt.
## Three Lessons That Changed How I Think
### 1. Backups Are Not Disaster Recovery
We had backups. We had lots of backups. They were useless.
**Backups are the mechanism. Disaster recovery is the capability.**
The question isn't "do we have backups?" It's:
- Can we restore them?
- How long does it take?
- How much data do we lose?
- Have we practiced?
- Do they actually contain what we think they contain?
### 2. Testing Is Not Validation
Our backup tests passed. They just weren't testing the right thing.
**Tests prove the system works as designed. Validation proves the design is correct.**
We'd tested that our backup process executed successfully. We hadn't validated that the backed-up data was usable for disaster recovery.
### 3. Every Layer Will Fail
We had snapshots (failed). We had PITR (misconfigured). We had replicas (they replicated the problem). We had monitoring (didn't alert on data loss).
**Design for every layer failing**. If your disaster recovery plan assumes any particular safeguard will work, it's not a plan—it's hope.
Real redundancy means truly independent systems. Our backups all depended on the same RDS configuration. When that was wrong, everything failed together.
## Advice For Anyone Managing Databases
### Test Your Disaster Recovery. Really Test It.
Not "can I restore a backup." Actually:
1. Delete your staging database
2. Restore it from backups
3. Verify it contains the right data
4. Time how long it takes
5. Document what broke
Do this quarterly minimum. Unannounced drills are even better.
### Check Your Backups. Right Now.
Seriously, stop reading. Go restore your most recent production backup to a test instance. Check that it has data. I'll wait.
...
Did you do it? No? **Go do it now.** I'm not kidding. This article can wait.
### Implement Soft Deletes
Hard deletes are almost never worth the risk. Soft deletes give you:
- Easy recovery from accidents
- Audit trails
- Ability to "un-delete" user actions
- Historical data analysis
The storage cost is trivial compared to the risk reduction.
### Monitor Data Integrity, Not Just Uptime
Your monitoring probably tracks:
- Database up/down
- Query performance
- Disk space
- Connection counts
Does it track:
- Actual row counts over time?
- Rate of data changes?
- Referential integrity?
- Suspicious deletion patterns?
Add these. They're cheaper than data recovery.
### Require Two People For Production Changes
Solo cowboy operations against production ended my career as one three years ago. Now:
- All production changes reviewed by 2+ people
- All data operations have a dry-run step
- All high-risk operations require approval tokens
- All changes are logged with full audit trail
Is it slower? Yes. Has it prevented disasters? Also yes.
## What I'd Tell Myself Three Years Ago
If I could go back to March 13th, 2024, and give myself one piece of advice, it would be this:
**Your disaster recovery plan is useless until you've actually used it to recover from a disaster.**
Not a drill. Not a test. An actual "oh shit" moment where you need your backups to work or people lose their jobs.
And if you haven't had one of those moments, you should simulate one. Regularly. With realistic chaos.
Because I guarantee: your backup strategy has a fatal flaw. Everyone's does. The question is whether you discover it during a drill or during an actual emergency.
## Three Years Later
We survived. Barely.
We lost about 30% of our customers in the following quarter. Our revenue dropped by 40%. We had to do a layoff. Our next fundraising round was significantly harder.
But we didn't shut down. We rebuilt. We recovered. And most importantly: **we learned**.
Today, our disaster recovery capability is genuinely world-class. We've become almost paranoid about backups and data integrity. We test constantly. We drill quarterly. We've had several close calls since then, and our safeguards caught every single one before they became disasters.
I'm still at the company, now VP of Infrastructure. The CTO (who stuck by me through the whole ordeal) is still here too. We still work together. And we still occasionally joke about "the incident" though it's never actually funny.
But here's what really matters: **we published our entire post-mortem publicly**. Every detail, every failure, every lesson. And you know what happened?
We got emails from eleven other companies who'd had similar disasters but never talked about them. Three companies reached out to say our post-mortem helped them identify and fix the same backup configuration issue before it burned them.
Failure is only truly wasteful if you don't learn from it and share those lessons.
## Conclusion: Respect The Database
Databases are easy to take for granted. They just work, day after day, year after year. Until they don't.
The hard truth: **your data is probably less safe than you think it is**. Your backups probably have subtle configuration issues. Your disaster recovery plan probably has untested assumptions. Your safeguards probably have gaps.
But you won't know until you actually try to use them.
So here's my challenge to you: **Before you finish reading this article, schedule a disaster recovery drill**. Put it on the calendar. Assign people to it. Make it happen.
Because I promise you: discovering your backup strategy is broken during a drill is infinitely better than discovering it at 2:37 PM on a random Thursday when you've just deleted your production database.
Trust me on this one.
Stay vigilant. Test everything. And for the love of all that is holy, **implement soft deletes**.
---
*Sarah Martinez is VP of Infrastructure at a company that learned these lessons the hard way. She speaks at conferences about disaster recovery and has successfully recovered from 0 production database deletions in the 3 years since the incident described above (knock on wood). She can be reached at sarah@probably-not-my-real-email.com for consulting on making sure your backups actually work.*
Tags:DevelopmentTutorialGuide
X
Written by XQA Team
Our team of experts delivers insights on technology, business, and design. We are dedicated to helping you build better products and scale your business.
•