# Disaster Recovery Runbook Step-by-step procedures for recovering VAPORA from various disaster scenarios. --- ## Disaster Severity Levels ### Level 1: Critical 🔴 **Complete Service Loss** - Entire VAPORA unavailable Examples: - Complete cluster failure - Complete data center outage - Database completely corrupted - All backups inaccessible RTO: 2-4 hours RPA: Up to 1 hour of data loss possible ### Level 2: Major 🟠 **Partial Service Loss** - Some services unavailable Examples: - Single region down - Database corrupted but backups available - One service completely failed - Primary storage unavailable RTO: 30 minutes - 2 hours RPA: Minimal data loss ### Level 3: Minor 🟡 **Degraded Service** - Service running but with issues Examples: - Performance issues - One pod crashed - Database connection issues - High error rate RTO: 5-15 minutes RPA: No data loss --- ## Disaster Assessment (First 5 Minutes) ### Step 1: Declare Disaster State When any of these occur, declare a disaster: ```bash # Q1: Is the service accessible? curl -v https://api.vapora.com/health # Q2: How many pods are running? kubectl get pods -n vapora # Q3: Can we access the database? kubectl exec -n vapora pod/ -- \ surreal query "SELECT * FROM projects LIMIT 1" # Q4: Are backups available? aws s3 ls s3://vapora-backups/ ``` **Decision Tree**: ``` Can access service normally? YES → No disaster, escalate to incident response NO → Continue Can reach any pods? YES → Partial disaster (Level 2-3) NO → Likely total disaster (Level 1) Can reach database? YES → Application issue, not data issue NO → Database issue, need restoration Are backups accessible? YES → Recovery likely possible NO → Critical situation, activate backup locations ``` ### Step 2: Severity Assignment Based on assessment: ```bash # Level 1 Criteria (Critical) - 0 pods running in vapora namespace - Database completely unreachable - All backup locations inaccessible - Service down >30 minutes # Level 2 Criteria (Major) - <50% pods running - Database reachable but degraded - Primary backups inaccessible but secondary available - Service down 5-30 minutes # Level 3 Criteria (Minor) - >75% pods running - Database responsive but with errors - Backups accessible - Service down <5 minutes Assignment: Level ___ If Level 1: Activate full DR plan If Level 2: Activate partial DR plan If Level 3: Use normal incident response ``` ### Step 3: Notify Key Personnel ```bash # For Level 1 (Critical) DR send_message_to = [ "@cto", "@ops-manager", "@database-team", "@infrastructure-team", "@product-manager" ] message = """ 🔴 DISASTER DECLARED - LEVEL 1 CRITICAL Service: VAPORA (Complete Outage) Severity: Critical Time Declared: [UTC] Status: Assessing Actions underway: 1. Activating disaster recovery procedures 2. Notifying stakeholders 3. Engaging full team Next update: [+5 min] /cc @all-involved """ post_to_slack("#incident-critical") page_on_call_manager(urgent=true) ``` --- ## Disaster Scenario Procedures ### Scenario 1: Complete Cluster Failure **Symptoms**: - kubectl commands time out or fail - No pods running in any namespace - Nodes unreachable - All services down **Recovery Steps**: #### Step 1: Assess Infrastructure (5 min) ```bash # Try basic cluster operations kubectl cluster-info # If: "Unable to connect to the server" # Check cloud provider status # AWS: Check AWS status page, check EC2 instances # GKE: Check Google Cloud console # On-prem: Check infrastructure team # Determine: Is infrastructure failed or just connectivity? ``` #### Step 2: If Infrastructure Failed **Activate Secondary Infrastructure** (if available): ```bash # 1. Access backup/secondary infrastructure export KUBECONFIG=/path/to/backup/kubeconfig # 2. Verify it's operational kubectl cluster-info kubectl get nodes # 3. Prepare for database restore # (See: Scenario 2 - Database Recovery) ``` **If No Secondary**: Activate failover to alternate region ```bash # 1. Contact cloud provider # AWS: Open support case - request emergency instance launch # GKE: Request cluster creation in different region # 2. While infrastructure rebuilds: # - Retrieve backups # - Prepare restore scripts # - Brief team on ETA ``` #### Step 3: Restore Database (See Scenario 2) #### Step 4: Deploy Services ```bash # Once infrastructure ready and database restored # 1. Apply ConfigMaps kubectl apply -f vapora-configmap.yaml # 2. Apply Secrets kubectl apply -f vapora-secrets.yaml # 3. Deploy services kubectl apply -f vapora-deployments.yaml # 4. Wait for pods to start kubectl rollout status deployment/vapora-backend -n vapora --timeout=10m # 5. Verify health curl http://localhost:8001/health ``` #### Step 5: Verification ```bash # 1. Check all pods running kubectl get pods -n vapora # All should show: Running, 1/1 Ready # 2. Verify database connectivity kubectl logs deployment/vapora-backend -n vapora | tail -20 # Should show: "Successfully connected to database" # 3. Test API curl http://localhost:8001/api/projects # Should return project list # 4. Check data integrity # Run validation queries: SELECT COUNT(*) FROM projects; # Should > 0 SELECT COUNT(*) FROM users; # Should > 0 SELECT COUNT(*) FROM tasks; # Should > 0 ``` --- ### Scenario 2: Database Corruption/Loss **Symptoms**: - Database queries return errors - Data integrity issues - Corruption detected in logs **Recovery Steps**: #### Step 1: Assess Database State (10 min) ```bash # 1. Try to connect kubectl exec -n vapora pod/surrealdb-0 -- \ surreal sql --conn ws://localhost:8000 \ --user root --pass "$DB_PASSWORD" \ "SELECT COUNT(*) FROM projects" # 2. Check for error messages kubectl logs -n vapora pod/surrealdb-0 | tail -50 | grep -i error # 3. Assess damage # Is it: # - Connection issue (might recover) # - Data corruption (need restore) # - Complete loss (restore from backup) ``` #### Step 2: Backup Current State (for forensics) ```bash # Before attempting recovery, save current state # Export what's remaining kubectl exec -n vapora pod/surrealdb-0 -- \ surreal export --conn ws://localhost:8000 \ --user root --pass "$DB_PASSWORD" \ --output /tmp/corrupted-export.sql # Download for analysis kubectl cp vapora/surrealdb-0:/tmp/corrupted-export.sql \ ./corrupted-export-$(date +%Y%m%d-%H%M%S).sql ``` #### Step 3: Identify Latest Good Backup ```bash # Find most recent backup before corruption aws s3 ls s3://vapora-backups/database/ --recursive | sort # Latest backup timestamp # Should be within last hour # Download backup aws s3 cp s3://vapora-backups/database/2026-01-12/vapora-db-010000.sql.gz \ ./vapora-db-restore.sql.gz gunzip vapora-db-restore.sql.gz ``` #### Step 4: Restore Database ```bash # Option A: Restore to same database (destructive) # WARNING: This will overwrite current database kubectl exec -n vapora pod/surrealdb-0 -- \ rm -rf /var/lib/surrealdb/data.db # Restart pod to reinitialize kubectl delete pod -n vapora surrealdb-0 # Pod will restart with clean database # Import backup kubectl exec -n vapora pod/surrealdb-0 -- \ surreal import --conn ws://localhost:8000 \ --user root --pass "$DB_PASSWORD" \ --input /tmp/vapora-db-restore.sql # Wait for import to complete (5-15 minutes) ``` **Option B: Restore to temporary database (safer)** ```bash # 1. Create temporary database pod kubectl run -n vapora restore-test --image=surrealdb/surrealdb:latest \ -- start file:///tmp/restore-test # 2. Restore to temporary kubectl cp ./vapora-db-restore.sql vapora/restore-test:/tmp/ kubectl exec -n vapora restore-test -- \ surreal import --conn ws://localhost:8000 \ --user root --pass "$DB_PASSWORD" \ --input /tmp/vapora-db-restore.sql # 3. Verify restored data kubectl exec -n vapora restore-test -- \ surreal sql "SELECT COUNT(*) FROM projects" # 4. If good: Restore production kubectl delete pod -n vapora surrealdb-0 # Wait for pod restart kubectl cp ./vapora-db-restore.sql vapora/surrealdb-0:/tmp/ kubectl exec -n vapora surrealdb-0 -- \ surreal import --conn ws://localhost:8000 \ --user root --pass "$DB_PASSWORD" \ --input /tmp/vapora-db-restore.sql # 5. Cleanup test pod kubectl delete pod -n vapora restore-test ``` #### Step 5: Verify Recovery ```bash # 1. Database responsive kubectl exec -n vapora pod/surrealdb-0 -- \ surreal sql "SELECT COUNT(*) FROM projects" # 2. Application can connect kubectl logs deployment/vapora-backend -n vapora | tail -5 # Should show successful connection # 3. API working curl http://localhost:8001/api/projects # 4. Data valid # Check record counts match pre-backup # Check no corruption in key records ``` --- ### Scenario 3: Configuration Corruption **Symptoms**: - Application misconfigured - Pods failing to start - Wrong values in environment **Recovery Steps**: #### Step 1: Identify Bad Configuration ```bash # 1. Get current ConfigMap kubectl get configmap -n vapora vapora-config -o yaml > current-config.yaml # 2. Compare with known-good backup aws s3 cp s3://vapora-backups/configs/2026-01-12/configmaps.yaml . # 3. Diff to find issues diff configmaps.yaml current-config.yaml ``` #### Step 2: Restore Previous Configuration ```bash # 1. Get previous ConfigMap from backup aws s3 cp s3://vapora-backups/configs/2026-01-11/configmaps.yaml ./good-config.yaml # 2. Apply previous configuration kubectl apply -f good-config.yaml # 3. Restart pods to pick up new config kubectl rollout restart deployment/vapora-backend -n vapora kubectl rollout restart deployment/vapora-agents -n vapora # 4. Monitor restart kubectl get pods -n vapora -w ``` #### Step 3: Verify Configuration ```bash # 1. Pods should restart and become Running kubectl get pods -n vapora # All should show: Running, 1/1 Ready # 2. Check pod logs kubectl logs deployment/vapora-backend -n vapora | tail -10 # Should show successful startup # 3. API operational curl http://localhost:8001/health ``` --- ### Scenario 4: Data Center/Region Outage **Symptoms**: - Entire region unreachable - Multiple infrastructure components down - Network connectivity issues **Recovery Steps**: #### Step 1: Declare Regional Failover ```bash # 1. Confirm region is down ping production.vapora.com # Should fail # Check status page # Cloud provider should report outage # 2. Declare failover declare_failover_to_region("us-west-2") ``` #### Step 2: Activate Alternate Region ```bash # 1. Switch kubeconfig to alternate region export KUBECONFIG=/path/to/backup-region/kubeconfig # 2. Verify alternate region up kubectl cluster-info # 3. Download and restore database aws s3 cp s3://vapora-backups/database/latest/ . --recursive # 4. Restore services (as in Scenario 1, Step 4) ``` #### Step 3: Update DNS/Routing ```bash # Update DNS to point to alternate region aws route53 change-resource-record-sets \ --hosted-zone-id Z123456 \ --change-batch '{ "Changes": [{ "Action": "UPSERT", "ResourceRecordSet": { "Name": "api.vapora.com", "Type": "A", "AliasTarget": { "HostedZoneId": "Z987654", "DNSName": "backup-region-lb.elb.amazonaws.com", "EvaluateTargetHealth": false } } }] }' # Wait for DNS propagation (5-10 minutes) ``` #### Step 4: Verify Failover ```bash # 1. DNS resolves to new region nslookup api.vapora.com # 2. Services accessible curl https://api.vapora.com/health # 3. Data intact curl https://api.vapora.com/api/projects ``` #### Step 5: Communicate Failover ``` Post to #incident-critical: ✅ FAILOVER TO ALTERNATE REGION COMPLETE Primary Region: us-east-1 (Down) Active Region: us-west-2 (Restored) Status: - All services running: ✓ - Database restored: ✓ - Data integrity verified: ✓ - Partial data loss: ~30 minutes of transactions Estimated Data Loss: 30 minutes (11:30-12:00 UTC) Current Time: 12:05 UTC Next steps: - Monitor alternate region closely - Begin investigation of primary region - Plan failback when primary recovered Questions? /cc @ops-team ``` --- ## Post-Disaster Recovery ### Phase 1: Stabilization (Ongoing) ```bash # Continue monitoring for 4 hours minimum # Checks every 15 minutes: ✓ All pods Running ✓ API responding ✓ Database queries working ✓ Error rates normal ✓ Performance baseline ``` ### Phase 2: Root Cause Analysis **Start within 1 hour of service recovery**: ``` Questions to answer: 1. What caused the disaster? - Hardware failure - Software bug - Configuration error - External attack - Human error 2. Why wasn't it detected earlier? - Monitoring gap - Alert misconfiguration - Alert fatigue 3. How did backups perform? - Were they accessible? - Restore time as expected? - Data loss acceptable? 4. What took longest in recovery? - Finding backups - Restoring database - Redeploying services - Verifying integrity 5. What can be improved? - Faster detection - Faster recovery - Better documentation - More automated recovery ``` ### Phase 3: Recovery Documentation ``` Create post-disaster report: Timeline: - 11:30 UTC: Disaster detected - 11:35 UTC: Database restore started - 11:50 UTC: Services redeployed - 12:00 UTC: All systems operational - Duration: 30 minutes Impact: - Users affected: [X] - Data lost: [X] transactions - Revenue impact: $[X] Root cause: [Description] Contributing factors: 1. [Factor 1] 2. [Factor 2] Preventive measures: 1. [Action] by [Owner] by [Date] 2. [Action] by [Owner] by [Date] Lessons learned: 1. [Lesson 1] 2. [Lesson 2] ``` ### Phase 4: Improvements Implementation **Due date: Within 2 weeks** ``` Checklist for improvements: □ Update backup strategy (if needed) □ Improve monitoring/alerting □ Automate more recovery steps □ Update runbooks with learnings □ Train team on new procedures □ Test improved procedures □ Document for future reference □ Incident retrospective meeting ``` --- ## Disaster Recovery Drill ### Quarterly DR Drill **Purpose**: Test DR procedures before real disaster **Schedule**: Last Friday of each quarter at 02:00 UTC ```bash def quarterly_dr_drill [] { print "=== QUARTERLY DISASTER RECOVERY DRILL ===" print $"Date: (date now | format date %Y-%m-%d %H:%M:%S UTC)" print "" # 1. Simulate database corruption print "1. Simulating database corruption..." # Create test database, introduce corruption # 2. Test restore procedure print "2. Testing restore from backup..." # Download backup, restore to test database # 3. Measure restore time let start_time = (date now) # ... restore process ... let end_time = (date now) let duration = $end_time - $start_time print $"Restore time: ($duration)" # 4. Verify data integrity print "3. Verifying data integrity..." # Check restored data matches pre-backup # 5. Document results print "4. Documenting results..." # Record in DR drill log print "" print "Drill complete" } ``` ### Drill Checklist ``` Pre-Drill (1 week before): □ Notify team of scheduled drill □ Plan specific scenario to test □ Prepare test environment □ Have runbooks available During Drill: □ Execute scenario as planned □ Record actual timings □ Document any issues □ Note what went well □ Note what could improve Post-Drill (within 1 day): □ Debrief meeting □ Review recorded times vs. targets □ Discuss improvements □ Update runbooks if needed □ Thank team for participation □ Document lessons learned Post-Drill (within 1 week): □ Implement identified improvements □ Test improvements □ Verify procedures updated □ Archive drill documentation ``` --- ## Disaster Recovery Readiness ### Recovery Readiness Checklist ``` Infrastructure: □ Primary region configured □ Backup region prepared □ Load balancing configured □ DNS failover configured Data: □ Hourly database backups □ Backups encrypted □ Backups tested (monthly) □ Multiple backup locations Configuration: □ ConfigMaps backed up (daily) □ Secrets encrypted and backed up □ Infrastructure code in Git □ Deployment manifests versioned Documentation: □ Disaster procedures documented □ Runbooks current and tested □ Team trained on procedures □ Escalation paths clear Testing: □ Monthly restore test passes □ Quarterly DR drill scheduled □ Recovery times meet RTO/RPA Monitoring: □ Alerts for backup failures □ Backup health checks running □ Recovery procedures monitored ``` ### RTO/RPA Targets | Scenario | RTO | RPA | |----------|-----|-----| | **Single pod failure** | 5 min | 0 min | | **Database corruption** | 1 hour | 1 hour | | **Node failure** | 15 min | 0 min | | **Region outage** | 2 hours | 15 min | | **Complete cluster loss** | 4 hours | 1 hour | --- ## Disaster Recovery Contacts ``` Role: Contact: Phone: Slack: Primary DBA: [Name] [Phone] @[slack] Backup DBA: [Name] [Phone] @[slack] Infra Lead: [Name] [Phone] @[slack] Backup Infra: [Name] [Phone] @[slack] CTO: [Name] [Phone] @[slack] Ops Manager: [Name] [Phone] @[slack] Escalation: Level 1: [Name] - notify immediately Level 2: [Name] - notify within 5 min Level 3: [Name] - notify within 15 min ``` --- ## Quick Reference: Disaster Steps ``` 1. ASSESS (First 5 min) - Determine disaster severity - Assess damage scope - Get backup location access 2. COMMUNICATE (Immediately) - Declare disaster - Notify key personnel - Start status updates (every 5 min) 3. RECOVER (Next 30-120 min) - Activate backup infrastructure if needed - Restore database from latest backup - Redeploy applications - Verify all systems operational 4. VERIFY (Continuous) - Check pod health - Verify database connectivity - Test API endpoints - Monitor error rates 5. STABILIZE (Next 4 hours) - Monitor closely - Watch for anomalies - Verify performance normal - Check data integrity 6. INVESTIGATE (Within 1 hour) - Root cause analysis - Document what happened - Plan improvements - Update procedures 7. IMPROVE (Within 2 weeks) - Implement improvements - Test improvements - Update documentation - Train team ```