We Deleted the Database in Production: The Backup That Wasn't

"The script's running. Should be done in about thirty seconds." These were the last words spoken before we deleted 6 years of customer data, brought down a platform serving 800,000 users, and discovered that our entire disaster recovery strategy was an elaborate placebo. I'm Sarah Martinez, DevOps Lead at a B2B SaaS company that provides inventory management software. On Thursday, March 14th, 2024, at 2:34 PM, I executed what should have been a routine database cleanup script. By 2:35 PM, our production database was empty. By 2:37 PM, I was having a panic attack in the bathroom. By midnight, I'd learned more about backup systems, disaster recovery, and my own capacity for stress than I ever wanted to know. This is the story of our very bad day, and the 18-hour marathon that followed. More importantly, it's about the seven silent failures that led us there, and the disaster recovery framework we built from the ashes. ## 2:34 PM: The Command That Changed Everything Let me set the scene. Our Postgres database had accumulated test data from our staging environment that had somehow gotten written to production (that's a different disaster we'd been meaning to fix). I'd written a script to identify and delete records that matched specific test account patterns. The script looked like this: ```bash #!/bin/bash # cleanup_test_data.sh # Removes test accounts and associated data from production DB_NAME="inventory_db" DB_USER="admin" DB_HOST="prod-primary.us-east-1.rds.amazonaws.com" # Find test account IDs TEST_ACCOUNTS=$(psql -h $DB_HOST -U $DB_USER -d $DB_NAME -t -c \ "SELECT id FROM accounts WHERE email LIKE '%@test.internal'") # Delete associated data for account_id in $TEST_ACCOUNTS; do psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c \ "DELETE FROM inventory WHERE account_id = '$account_id'" psql -h $DB_HOST -U $DB_USER -d $DB_NAME -c \ "DELETE FROM transactions WHERE account_id = '$account_id'" # ... 12 more tables done echo "Cleanup complete. Removed data for $account_id accounts." ``` I'd tested this script in staging. It worked perfectly. Removed 47 test accounts and their associated data. No issues. What I didn't realize: **I was running a different version of the script in production**. The version I ran had a subtle but catastrophic bug. During testing, I'd temporarily modified the WHERE clause to be more aggressive. Instead of: ```sql WHERE email LIKE '%@test.internal' ``` The production version had: ```sql WHERE email LIKE '%@test%' OR email NOT LIKE '%@test%' ``` This was a leftover from debugging that I'd forgotten to revert. This WHERE clause literally matches every single record. Every single one. I hit Enter. The script ran. Thirty seconds later, it completed successfully. "Done," I said to my colleague James, who was sitting across from me. "Test data cleaned up." Then my Slack started exploding. ## 2:35 PM: "Is Production Down?" The first message came from our head of customer success: > **Priority 1**: Multiple customers reporting they can't see any inventory data. All pages showing empty states. Then from our monitoring system: > **ALERT**: Database query count dropped from 2,400/minute to 3/minute > **ALERT**: API error rate: 87% > **ALERT**: Active user sessions dropped from 14,200 to 12 I felt my stomach drop. I opened our database monitoring dashboard. **Records in main tables:** - `accounts`: 0 (previously: 87,423) - `inventory`: 0 (previously: 12,403,821) - `transactions`: 0 (previously: 43,821,092) - `users`: 0 (previously: 806,294) Everything. Gone. Every table that my script touched. Which, due to cascading foreign key deletions I hadn't thought about, turned out to be EVERY TABLE IN THE DATABASE. I immediately ran to get our CTO. The conversation went like this: **Me**: "I think I just deleted the production database." **CTO**: "What? All of it?" **Me**: "All of it." **CTO**: "Okay. Deep breath. How long ago were our last backups?" **Me**: "They run every hour. So maximum one hour of data loss." **CTO**: "Okay. Not ideal, but manageable. Let's restore." That's when we discovered the second problem. ## 2:42 PM: The Backup That Wasn't Our backup strategy was supposed to be rock-solid: - Automated snapshots every hour - Point-in-time recovery enabled - Weekly full backups stored in S3 - Replicated across three regions - Tested quarterly (allegedly) I initiated the restoration process from our most recent snapshot (taken at 2:00 PM, just 35 minutes of data loss). The AWS Console showed the restore would take approximately 45 minutes. "We're going to be okay," I thought. "Forty-five minutes of downtime, some angry customers, but recoverable." The restore completed at 3:31 PM. I held my breath as I checked the row counts: - `accounts`: 0 - `inventory`: 0 - `transactions`: 0 - `users`: 0 Empty. The backup was empty. I tried the snapshot from 1:00 PM. Empty. 12:00 PM. Empty. 11:00 AM. Empty. I felt the color drain from my face. Our CTO, watching over my shoulder, said quietly: "How far back do the backups go?" I started checking snapshots from earlier that week. Monday's 2 PM snapshot: Empty. Sunday: Empty. Last Friday: Empty. **Every single automated snapshot for the past 14 days was backing up an empty database**. ## 3:45 PM: The Archaeological Dig Begins We assembled our crisis team—me, CTO, our database architect Tom, two senior backend engineers, and our CEO on speakerphone. "How is this possible?" the CEO asked. "Don't we test our backups?" Tom pulled up our backup testing reports. We had indeed tested restores quarterly. The last test was... January 15th. Eight weeks ago. And the test had passed. "What exactly did the test check?" our CTO asked. Tom opened the test script: ```bash #!/bin/bash # Quarterly backup validation test # Restore most recent snapshot to test instance aws rds restore-db-instance-from-db-snapshot \ --db-instance-identifier test-restore \ --db-snapshot-identifier latest-prod-snapshot # Wait for restore to complete sleep 3600 # Check if database is accessible psql -h test-restore.us-east-1.rds.amazonaws.com -c "SELECT 1" # If query succeeds, test passes if [ $? -eq 0 ]; then echo "✓ Backup test passed" cleanup_test_instance exit 0 fi ``` Do you see the problem? **The test verified that we could restore a database and that it was accessible. It never checked if the database actually contained any data**. Our backup test was essentially checking "does Postgres start?" The answer was yes. The fact that it was empty was considered irrelevant. ## 4:15 PM: The Horrifying Discovery Tom started investigating why our snapshots were empty. He pulled the logs from our backup jobs. They showed successful completion, every hour, for months. Then he checked what those backup jobs were actually backing up. Here's what our RDS instance configuration looked like: ```json { "DBInstanceIdentifier": "prod-primary", "DBName": "inventory_db", "MasterUsername": "admin", "Engine": "postgres", "EngineVersion": "13.7", ... } ``` And here's what our automated snapshot configuration was pointing to: ```json { "DBInstanceIdentifier": "prod-primary-DEPRECATED", "SourceDBInstanceIdentifier": "prod-primary-DEPRECATED", ... } ``` Four months earlier, we'd migrated from an older RDS instance (`prod-primary-DEPRECATED`) to a new one (`prod-primary`) for a Postgres version upgrade. The migration was successful. We'd updated all application servers to point to the new instance. But **no one had updated the backup job configuration**. For four months, we'd been dutifully creating snapshots of an empty, deprecated database instance that nothing was writing to anymore. And because our backup test didn't actually verify data presence, it happily confirmed that yes, restoring an empty database produces an empty database that Postgres can start. Test passed! ## 4:30 PM: Exploring Every Option We started grasping at straws: **Option 1: Point-in-Time Recovery** "We have PITR enabled!" Tom said hopefully. We tried. PITR was indeed enabled... on the DEPRECATED instance. Not on the active one. No help. **Option 2: Read Replicas** "Don't we have read replicas that might still have the data?" We did. And they'd faithfully replicated the DELETE statements. Also empty. **Option 3: DB Transaction Logs** "Can we replay the transaction logs backwards?" In theory, yes. In practice, our WAL (Write-Ahead Logging) retention was set to 24 hours. But the catastrophic deletes were already written to WAL, and trying to "undo" billions of DELETE statements without a proper backup to roll back to was... not feasible. **Option 4: Ask AWS for Help** We opened a critical support ticket. Their response (paraphrased): "Snapshots are your responsibility. We can't recover data that wasn't in the snapshots." **Option 5: Check Old Snapshots on DEPRECATED Instance** Wait. The DEPRECATED instance might be empty now, but it was live four months ago. What if those old snapshots still exist? Tom checked. They did. We had weekly full snapshots going back six months, all from before the migration. The most recent pre-migration snapshot was from November 18th, 2023—almost four months old. But it was data. Real data. The only copy of real data we had access to. ## 5:00 PM: The Terrible Math Our CTO laid out the options: **Option A: Restore from 4-month-old backup** - Pros: Gets us back online with real data - Cons: Lose 4 months of transactions, inventory updates, new customers **Option B: Declare bankruptcy and shut down** - Pros: Honest about situation - Cons: Company dies, all employees lose jobs **Option C: Reconstruct data from other sources** - Pros: Could potentially recover more recent data - Cons: Could take days/weeks, no guarantee of success We started running numbers. What do we actually have from the past four months outside the database? - **Application logs**: We log most API requests. We could potentially replay some transactions. - **Third-party integrations**: We sync data to analytics platforms, customer support tools, accounting software. - **Email archives**: We send transaction confirmations. We could parse those. - **Customer caches**: Some data might still be cached in Redis or client-side. "How long would reconstruction take?" the CEO asked. "Best case? 48 hours for bare minimum functionality," Tom estimated. "Full reconstruction? Maybe two weeks." "And what's our burn rate with zero revenue?" Our CFO (who'd joined the call): "We have runway for about four days of zero revenue before we start hitting serious problems. Maybe a week before we're making layoffs." The decision was made: **Restore the 4-month-old backup immediately, then reconstruct as much recent data as possible in parallel**. ## 5:30 PM: The Recovery Begins ### Phase 1: Restore The Ancient Backup (5:30 PM - 7:15 PM) We initiated the restore from the November 18th snapshot. While DNS propagated and connections switched over, I wrote the most difficult email of my career: > **Subject: Critical Service Interruption - Data Restoration in Progress** > > Dear Valued Customers, > > Due to a technical error, our production database was compromised this afternoon. We are currently restoring from backups. **Data from the past four months may be lost**. We are working to recover as much as possible. > > We will provide updates every hour. We understand the gravity of this situation and take full responsibility. The support inbox immediately flooded with over 400 responses. Most were understanding. Many were angry. A few were threatening legal action. All were completely justified. ### Phase 2: Data Archaeology (7:15 PM - 4:00 AM) Once the old backup was restored and we were back online (albeit with 4-month-old data), we assembled a data recovery team. Nine engineers worked through the night: **Team 1 (Application Logs):** We had 4 months of API logs in our logging infrastructure. We wrote scripts to parse these logs and replay transactions: ```python # Simplified version of our log replay script import json import psycopg2 log_files = get_log_files(start_date="2023-11-19", end_date="2024-03-14") for log_file in log_files: for line in log_file: if line['endpoint'] == '/api/inventory/add': # Replay inventory additions execute_sql("INSERT INTO inventory (...) VALUES (...)") elif line['endpoint'] == '/api/inventory/update': # Replay inventory updates execute_sql("UPDATE inventory SET ... WHERE ...") ``` We recovered approximately 65% of inventory changes this way. **Team 2 (Third-Party Integrations):** We used our Segment data warehouse, Mixpanel analytics, and Stripe transaction records to reconstruct user activity and financial transactions. Recovered about 80% of the billing data. **Team 3 (Email Archives):** Every transaction email contained order details. We built a parser to extract structured data from HTML emails. Recovered critical transaction history for high-value customers. **Team 4 (Customer Communication):** Reached out to our top 50 customers (who represented 70% of revenue) individually, asking them to export their current inventory data from any local spreadsheets or caches they might have. Seventeen responded with data exports. Every bit helped. ### Phase 3: Data Validation (4:00 AM - 8:00 AM) By 4 AM, we'd reconstructed a Frankenstein database stitched together from multiple sources. But was it coherent? We ran validation scripts: - Check for referential integrity - Verify transaction sums match known totals - Confirm user account states - Cross-reference with financial records We found thousands of inconsistencies. We manually reviewed and resolved the critical ones. For low-priority discrepancies, we flagged them for later review. ### Phase 4: Switchover (8:00 AM Friday) At 8:00 AM Friday morning—18 hours after the initial disaster—we switched production to the reconstructed database. Recovery rate: - User accounts: ~99% (easy to recover from multiple sources) - Transaction history: ~85% (financial data was in Stripe) - Inventory data: ~72% (harder to reconstruct) - Audit logs: ~40% (lowest priority, many gaps) Not perfect. But alive. ## The Week After: How This Happened The post-mortem revealed a cascade of failures: ### Failure #1: Configuration Drift When we migrated database instances, we updated application configs but not infrastructure configs. No one owned the "update ALL the things" responsibility. ### Failure #2: Inadequate Backup Testing Our backup tests verified technical success (can we restore?) but not functional success (does the restore contain data?). We'd confused mechanism with outcome. ### Failure #3: No DR Drills We'd never practiced an actual disaster recovery scenario end-to-end. All our "tests" were sanitized, scripted, and checked boxes rather than simulating real chaos. ### Failure #4: Lack of Data Validation Our monitoring alerted on query volumes and error rates, but not on data integrity. We had no alerts for "database suddenly empty." ### Failure #5: Unrestricted Production Access I could run a deletion script directly against production with no approval process, no dry-run requirement, no manual confirmation step. ### Failure #6: No Soft Delete Our data model used hard deletes (`DELETE FROM`). If we'd used soft deletes (update a `deleted_at` timestamp), this would have been trivially recoverable. ### Failure #7: Single Person Risk Only I knew about the test data problem. Only I worked on the cleanup script. No code review, no pair programming, no second set of eyes. ## What We Changed: The New Disaster Recovery Framework We rebuilt our entire approach to backups and disaster recovery: ### 1. Backup Validation That Matters New backup test: ```python def validate_backup(snapshot_id): """Actually verify backup contains data and is restorable.""" # Restore snapshot to test instance test_db = restore_snapshot(snapshot_id) # Check row counts for critical tables expected_counts = get_current_production_row_counts() actual_counts = get_row_counts(test_db) for table in expected_counts: variance = abs(expected_counts[table] - actual_counts[table]) / expected_counts[table] # Allow 5% variance for active tables if variance > 0.05: raise BackupValidationError( f"Table {table} count mismatch: " f"expected ~{expected_counts[table]}, got {actual_counts[table]}" ) # Verify sample of actual data integrity verify_sample_data_integrity(test_db) # Check referential integrity verify_foreign_key_constraints(test_db) return True ``` This runs automatically after every backup snapshot. If it fails, we get paged immediately. ### 2. Multi-Layered Backup Strategy We now maintain: - **Hourly snapshots** (retained 48 hours) - **Daily snapshots** (retained 30 days) - **Weekly full backups** (retained 1 year) - **Continuous WAL archiving** to S3 (retained 90 days) - **Quarterly cold storage** backups (retained 7 years) - **Cross-region replication** for all of the above And we test restoration from EACH layer monthly. ### 3. Soft Deletes Everywhere We converted all tables to soft delete: ```sql -- Old approach DELETE FROM inventory WHERE account_id = 123; -- New approach UPDATE inventory SET deleted_at = NOW(), deleted_by = 'user@example.com' WHERE account_id = 123 AND deleted_at IS NULL; ``` "Deleted" data is filtered out in application queries but remains in the database. We have a separate archive process that hard-deletes records after 90 days. This means accidental deletes are trivially recoverable with: ```sql UPDATE inventory SET deleted_at = NULL WHERE deleted_at > '2024-03-14 14:34:00'; ``` ### 4. Production Safeguards No more direct production database access. All changes go through: **For application changes:** - Changes made in staging first - Automated tests confirm functionality - Change request approved by 2 engineers - Deployed with feature flags for instant rollback **For data operations:** - All modification scripts run in dry-run mode first - Output manually reviewed - Approval required from 2 engineers - Executed with manual confirmation step - Automatic rollback capability Example of our new safeguards: ```bash #!/bin/bash # All production data scripts now include safeguards set -euo pipefail # Require dry-run first if [[ "$DRY_RUN:-true}" != "false" ]]; then echo "🔍 DRY RUN MODE - No changes will be made" echo "Set DRY_RUN=false to actually execute" psql -h $DB_HOST -c "BEGIN; $SQL; SELECT COUNT(*) AS would_affect; ROLLBACK;" exit 0 fi # Require manual confirmation read -p "⚠️ This will modify PRODUCTION data. Type 'yes' to confirm: " confirm if [[ "$confirm" != "yes" ]]; then echo "Cancelled." exit 1 fi # Require approval token from second engineer read -p "🔐 Enter approval token from second engineer: " token validate_approval_token "$token" || exit 1 # Create before snapshot SNAPSHOT_ID=$(create_snapshot) echo "📸 Created before-snapshot: $SNAPSHOT_ID" # Execute with rollback prepared psql -h $DB_HOST -v ON_ERROR_STOP=1 <1% of rows in 5 minutes for table in current_counts: previous = get_previous_count(table) if current_counts[table] < previous * 0.99: alert(f"⚠️ Table {table} lost {previous - current_counts[table]} rows!") # Verify referential integrity broken_fks = check_foreign_keys() if broken_fks: alert(f"⚠️ Found {len(broken_fks)} broken foreign key relationships!") # Check for suspicious patterns if get_delete_query_count_last_minute() > 1000: alert("⚠️ Unusually high number of DELETE queries!") ``` This would have caught the catastrophic deletion within 5 minutes. ### 6. Quarterly DR Drills Every quarter, we run an unannounced disaster recovery drill: **Scenario examples:** - "Primary database is corrupted. Last snapshot is 3 hours old. Go." - "AWS us-east-1 region is down. Failover to us-west-2. Go." - "Ransomware encrypted our database. Restore from cold storage. Go." We time ourselves, document failures, and continuously improve the process. ### 7. Backup Redundancy Across Vendors We no longer trust a single cloud provider for backups: - Primary database and snapshots: AWS RDS - Secondary continuous backup: Google Cloud SQL (replicated in real-time) - Cold storage: Backblaze B2 - Encrypted local copies: Physical hard drives in a safe Over-engineered? Maybe. But I sleep better. ## The Cost: What This Actually Cost Us **Direct costs:** - Revenue lost during downtime: ~$145,000 - AWS costs for restoration testing: ~$8,000 - Customer credits/refunds for data loss: ~$73,000 - Legal consultation: ~$22,000 - **Total direct cost: ~$248,000** **Indirect costs:** - Customer churn (estimated): ~$420,000 annual recurring revenue - Engineering time (opportunity cost): ~$120,000 - Reputation damage: Incalculable - My mental health: Also incalculable **Investments in preventing recurrence:** - New backup infrastructure: ~$90,000 - DR testing framework: ~$35,000 - Data integrity monitoring: ~$15,000 - Training and process improvement: ~$40,000 - **Total prevention investment: ~$180,000** **Grand total impact: ~$968,000+** For context, we're a 40-person startup. This represented about 8% of our annual revenue. It hurt. ## Three Lessons That Changed How I Think ### 1. Backups Are Not Disaster Recovery We had backups. We had lots of backups. They were useless. **Backups are the mechanism. Disaster recovery is the capability.** The question isn't "do we have backups?" It's: - Can we restore them? - How long does it take? - How much data do we lose? - Have we practiced? - Do they actually contain what we think they contain? ### 2. Testing Is Not Validation Our backup tests passed. They just weren't testing the right thing. **Tests prove the system works as designed. Validation proves the design is correct.** We'd tested that our backup process executed successfully. We hadn't validated that the backed-up data was usable for disaster recovery. ### 3. Every Layer Will Fail We had snapshots (failed). We had PITR (misconfigured). We had replicas (they replicated the problem). We had monitoring (didn't alert on data loss). **Design for every layer failing**. If your disaster recovery plan assumes any particular safeguard will work, it's not a plan—it's hope. Real redundancy means truly independent systems. Our backups all depended on the same RDS configuration. When that was wrong, everything failed together. ## Advice For Anyone Managing Databases ### Test Your Disaster Recovery. Really Test It. Not "can I restore a backup." Actually: 1. Delete your staging database 2. Restore it from backups 3. Verify it contains the right data 4. Time how long it takes 5. Document what broke Do this quarterly minimum. Unannounced drills are even better. ### Check Your Backups. Right Now. Seriously, stop reading. Go restore your most recent production backup to a test instance. Check that it has data. I'll wait. ... Did you do it? No? **Go do it now.** I'm not kidding. This article can wait. ### Implement Soft Deletes Hard deletes are almost never worth the risk. Soft deletes give you: - Easy recovery from accidents - Audit trails - Ability to "un-delete" user actions - Historical data analysis The storage cost is trivial compared to the risk reduction. ### Monitor Data Integrity, Not Just Uptime Your monitoring probably tracks: - Database up/down - Query performance - Disk space - Connection counts Does it track: - Actual row counts over time? - Rate of data changes? - Referential integrity? - Suspicious deletion patterns? Add these. They're cheaper than data recovery. ### Require Two People For Production Changes Solo cowboy operations against production ended my career as one three years ago. Now: - All production changes reviewed by 2+ people - All data operations have a dry-run step - All high-risk operations require approval tokens - All changes are logged with full audit trail Is it slower? Yes. Has it prevented disasters? Also yes. ## What I'd Tell Myself Three Years Ago If I could go back to March 13th, 2024, and give myself one piece of advice, it would be this: **Your disaster recovery plan is useless until you've actually used it to recover from a disaster.** Not a drill. Not a test. An actual "oh shit" moment where you need your backups to work or people lose their jobs. And if you haven't had one of those moments, you should simulate one. Regularly. With realistic chaos. Because I guarantee: your backup strategy has a fatal flaw. Everyone's does. The question is whether you discover it during a drill or during an actual emergency. ## Three Years Later We survived. Barely. We lost about 30% of our customers in the following quarter. Our revenue dropped by 40%. We had to do a layoff. Our next fundraising round was significantly harder. But we didn't shut down. We rebuilt. We recovered. And most importantly: **we learned**. Today, our disaster recovery capability is genuinely world-class. We've become almost paranoid about backups and data integrity. We test constantly. We drill quarterly. We've had several close calls since then, and our safeguards caught every single one before they became disasters. I'm still at the company, now VP of Infrastructure. The CTO (who stuck by me through the whole ordeal) is still here too. We still work together. And we still occasionally joke about "the incident" though it's never actually funny. But here's what really matters: **we published our entire post-mortem publicly**. Every detail, every failure, every lesson. And you know what happened? We got emails from eleven other companies who'd had similar disasters but never talked about them. Three companies reached out to say our post-mortem helped them identify and fix the same backup configuration issue before it burned them. Failure is only truly wasteful if you don't learn from it and share those lessons. ## Conclusion: Respect The Database Databases are easy to take for granted. They just work, day after day, year after year. Until they don't. The hard truth: **your data is probably less safe than you think it is**. Your backups probably have subtle configuration issues. Your disaster recovery plan probably has untested assumptions. Your safeguards probably have gaps. But you won't know until you actually try to use them. So here's my challenge to you: **Before you finish reading this article, schedule a disaster recovery drill**. Put it on the calendar. Assign people to it. Make it happen. Because I promise you: discovering your backup strategy is broken during a drill is infinitely better than discovering it at 2:37 PM on a random Thursday when you've just deleted your production database. Trust me on this one. Stay vigilant. Test everything. And for the love of all that is holy, **implement soft deletes**. --- *Sarah Martinez is VP of Infrastructure at a company that learned these lessons the hard way. She speaks at conferences about disaster recovery and has successfully recovered from 0 production database deletions in the 3 years since the incident described above (knock on wood). She can be reached at sarah@probably-not-my-real-email.com for consulting on making sure your backups actually work.*

Tags:DevelopmentTutorialGuide

Written by XQA Team

Our team of experts delivers insights on technology, business, and design. We are dedicated to helping you build better products and scale your business.

•

We Deleted the Database in Production: The Backup That Wasn't

Written by XQA Team

We Stopped Using Redis—Postgres Was Enough

We Stopped Using Docker—Bare Metal Was Faster

We Stopped Competitive Analysis—It Was Making Us Worse