Vapora/docs/disaster-recovery/disaster-recovery-runbook.md

842 lines
18 KiB
Markdown
Raw Normal View History

# Disaster Recovery Runbook
Step-by-step procedures for recovering VAPORA from various disaster scenarios.
---
## Disaster Severity Levels
### Level 1: Critical 🔴
**Complete Service Loss** - Entire VAPORA unavailable
Examples:
- Complete cluster failure
- Complete data center outage
- Database completely corrupted
- All backups inaccessible
RTO: 2-4 hours
RPA: Up to 1 hour of data loss possible
### Level 2: Major 🟠
**Partial Service Loss** - Some services unavailable
Examples:
- Single region down
- Database corrupted but backups available
- One service completely failed
- Primary storage unavailable
RTO: 30 minutes - 2 hours
RPA: Minimal data loss
### Level 3: Minor 🟡
**Degraded Service** - Service running but with issues
Examples:
- Performance issues
- One pod crashed
- Database connection issues
- High error rate
RTO: 5-15 minutes
RPA: No data loss
---
## Disaster Assessment (First 5 Minutes)
### Step 1: Declare Disaster State
When any of these occur, declare a disaster:
```bash
# Q1: Is the service accessible?
curl -v https://api.vapora.com/health
# Q2: How many pods are running?
kubectl get pods -n vapora
# Q3: Can we access the database?
kubectl exec -n vapora pod/<name> -- \
surreal query "SELECT * FROM projects LIMIT 1"
# Q4: Are backups available?
aws s3 ls s3://vapora-backups/
```
**Decision Tree**:
```
Can access service normally?
YES → No disaster, escalate to incident response
NO → Continue
Can reach any pods?
YES → Partial disaster (Level 2-3)
NO → Likely total disaster (Level 1)
Can reach database?
YES → Application issue, not data issue
NO → Database issue, need restoration
Are backups accessible?
YES → Recovery likely possible
NO → Critical situation, activate backup locations
```
### Step 2: Severity Assignment
Based on assessment:
```bash
# Level 1 Criteria (Critical)
- 0 pods running in vapora namespace
- Database completely unreachable
- All backup locations inaccessible
- Service down >30 minutes
# Level 2 Criteria (Major)
- <50% pods running
- Database reachable but degraded
- Primary backups inaccessible but secondary available
- Service down 5-30 minutes
# Level 3 Criteria (Minor)
- >75% pods running
- Database responsive but with errors
- Backups accessible
- Service down <5 minutes
Assignment: Level ___
If Level 1: Activate full DR plan
If Level 2: Activate partial DR plan
If Level 3: Use normal incident response
```
### Step 3: Notify Key Personnel
```bash
# For Level 1 (Critical) DR
send_message_to = [
"@cto",
"@ops-manager",
"@database-team",
"@infrastructure-team",
"@product-manager"
]
message = """
🔴 DISASTER DECLARED - LEVEL 1 CRITICAL
Service: VAPORA (Complete Outage)
Severity: Critical
Time Declared: [UTC]
Status: Assessing
Actions underway:
1. Activating disaster recovery procedures
2. Notifying stakeholders
3. Engaging full team
Next update: [+5 min]
/cc @all-involved
"""
post_to_slack("#incident-critical")
page_on_call_manager(urgent=true)
```
---
## Disaster Scenario Procedures
### Scenario 1: Complete Cluster Failure
**Symptoms**:
- kubectl commands time out or fail
- No pods running in any namespace
- Nodes unreachable
- All services down
**Recovery Steps**:
#### Step 1: Assess Infrastructure (5 min)
```bash
# Try basic cluster operations
kubectl cluster-info
# If: "Unable to connect to the server"
# Check cloud provider status
# AWS: Check AWS status page, check EC2 instances
# GKE: Check Google Cloud console
# On-prem: Check infrastructure team
# Determine: Is infrastructure failed or just connectivity?
```
#### Step 2: If Infrastructure Failed
**Activate Secondary Infrastructure** (if available):
```bash
# 1. Access backup/secondary infrastructure
export KUBECONFIG=/path/to/backup/kubeconfig
# 2. Verify it's operational
kubectl cluster-info
kubectl get nodes
# 3. Prepare for database restore
# (See: Scenario 2 - Database Recovery)
```
**If No Secondary**: Activate failover to alternate region
```bash
# 1. Contact cloud provider
# AWS: Open support case - request emergency instance launch
# GKE: Request cluster creation in different region
# 2. While infrastructure rebuilds:
# - Retrieve backups
# - Prepare restore scripts
# - Brief team on ETA
```
#### Step 3: Restore Database (See Scenario 2)
#### Step 4: Deploy Services
```bash
# Once infrastructure ready and database restored
# 1. Apply ConfigMaps
kubectl apply -f vapora-configmap.yaml
# 2. Apply Secrets
kubectl apply -f vapora-secrets.yaml
# 3. Deploy services
kubectl apply -f vapora-deployments.yaml
# 4. Wait for pods to start
kubectl rollout status deployment/vapora-backend -n vapora --timeout=10m
# 5. Verify health
curl http://localhost:8001/health
```
#### Step 5: Verification
```bash
# 1. Check all pods running
kubectl get pods -n vapora
# All should show: Running, 1/1 Ready
# 2. Verify database connectivity
kubectl logs deployment/vapora-backend -n vapora | tail -20
# Should show: "Successfully connected to database"
# 3. Test API
curl http://localhost:8001/api/projects
# Should return project list
# 4. Check data integrity
# Run validation queries:
SELECT COUNT(*) FROM projects; # Should > 0
SELECT COUNT(*) FROM users; # Should > 0
SELECT COUNT(*) FROM tasks; # Should > 0
```
---
### Scenario 2: Database Corruption/Loss
**Symptoms**:
- Database queries return errors
- Data integrity issues
- Corruption detected in logs
**Recovery Steps**:
#### Step 1: Assess Database State (10 min)
```bash
# 1. Try to connect
kubectl exec -n vapora pod/surrealdb-0 -- \
surreal sql --conn ws://localhost:8000 \
--user root --pass "$DB_PASSWORD" \
"SELECT COUNT(*) FROM projects"
# 2. Check for error messages
kubectl logs -n vapora pod/surrealdb-0 | tail -50 | grep -i error
# 3. Assess damage
# Is it:
# - Connection issue (might recover)
# - Data corruption (need restore)
# - Complete loss (restore from backup)
```
#### Step 2: Backup Current State (for forensics)
```bash
# Before attempting recovery, save current state
# Export what's remaining
kubectl exec -n vapora pod/surrealdb-0 -- \
surreal export --conn ws://localhost:8000 \
--user root --pass "$DB_PASSWORD" \
--output /tmp/corrupted-export.sql
# Download for analysis
kubectl cp vapora/surrealdb-0:/tmp/corrupted-export.sql \
./corrupted-export-$(date +%Y%m%d-%H%M%S).sql
```
#### Step 3: Identify Latest Good Backup
```bash
# Find most recent backup before corruption
aws s3 ls s3://vapora-backups/database/ --recursive | sort
# Latest backup timestamp
# Should be within last hour
# Download backup
aws s3 cp s3://vapora-backups/database/2026-01-12/vapora-db-010000.sql.gz \
./vapora-db-restore.sql.gz
gunzip vapora-db-restore.sql.gz
```
#### Step 4: Restore Database
```bash
# Option A: Restore to same database (destructive)
# WARNING: This will overwrite current database
kubectl exec -n vapora pod/surrealdb-0 -- \
rm -rf /var/lib/surrealdb/data.db
# Restart pod to reinitialize
kubectl delete pod -n vapora surrealdb-0
# Pod will restart with clean database
# Import backup
kubectl exec -n vapora pod/surrealdb-0 -- \
surreal import --conn ws://localhost:8000 \
--user root --pass "$DB_PASSWORD" \
--input /tmp/vapora-db-restore.sql
# Wait for import to complete (5-15 minutes)
```
**Option B: Restore to temporary database (safer)**
```bash
# 1. Create temporary database pod
kubectl run -n vapora restore-test --image=surrealdb/surrealdb:latest \
-- start file:///tmp/restore-test
# 2. Restore to temporary
kubectl cp ./vapora-db-restore.sql vapora/restore-test:/tmp/
kubectl exec -n vapora restore-test -- \
surreal import --conn ws://localhost:8000 \
--user root --pass "$DB_PASSWORD" \
--input /tmp/vapora-db-restore.sql
# 3. Verify restored data
kubectl exec -n vapora restore-test -- \
surreal sql "SELECT COUNT(*) FROM projects"
# 4. If good: Restore production
kubectl delete pod -n vapora surrealdb-0
# Wait for pod restart
kubectl cp ./vapora-db-restore.sql vapora/surrealdb-0:/tmp/
kubectl exec -n vapora surrealdb-0 -- \
surreal import --conn ws://localhost:8000 \
--user root --pass "$DB_PASSWORD" \
--input /tmp/vapora-db-restore.sql
# 5. Cleanup test pod
kubectl delete pod -n vapora restore-test
```
#### Step 5: Verify Recovery
```bash
# 1. Database responsive
kubectl exec -n vapora pod/surrealdb-0 -- \
surreal sql "SELECT COUNT(*) FROM projects"
# 2. Application can connect
kubectl logs deployment/vapora-backend -n vapora | tail -5
# Should show successful connection
# 3. API working
curl http://localhost:8001/api/projects
# 4. Data valid
# Check record counts match pre-backup
# Check no corruption in key records
```
---
### Scenario 3: Configuration Corruption
**Symptoms**:
- Application misconfigured
- Pods failing to start
- Wrong values in environment
**Recovery Steps**:
#### Step 1: Identify Bad Configuration
```bash
# 1. Get current ConfigMap
kubectl get configmap -n vapora vapora-config -o yaml > current-config.yaml
# 2. Compare with known-good backup
aws s3 cp s3://vapora-backups/configs/2026-01-12/configmaps.yaml .
# 3. Diff to find issues
diff configmaps.yaml current-config.yaml
```
#### Step 2: Restore Previous Configuration
```bash
# 1. Get previous ConfigMap from backup
aws s3 cp s3://vapora-backups/configs/2026-01-11/configmaps.yaml ./good-config.yaml
# 2. Apply previous configuration
kubectl apply -f good-config.yaml
# 3. Restart pods to pick up new config
kubectl rollout restart deployment/vapora-backend -n vapora
kubectl rollout restart deployment/vapora-agents -n vapora
# 4. Monitor restart
kubectl get pods -n vapora -w
```
#### Step 3: Verify Configuration
```bash
# 1. Pods should restart and become Running
kubectl get pods -n vapora
# All should show: Running, 1/1 Ready
# 2. Check pod logs
kubectl logs deployment/vapora-backend -n vapora | tail -10
# Should show successful startup
# 3. API operational
curl http://localhost:8001/health
```
---
### Scenario 4: Data Center/Region Outage
**Symptoms**:
- Entire region unreachable
- Multiple infrastructure components down
- Network connectivity issues
**Recovery Steps**:
#### Step 1: Declare Regional Failover
```bash
# 1. Confirm region is down
ping production.vapora.com
# Should fail
# Check status page
# Cloud provider should report outage
# 2. Declare failover
declare_failover_to_region("us-west-2")
```
#### Step 2: Activate Alternate Region
```bash
# 1. Switch kubeconfig to alternate region
export KUBECONFIG=/path/to/backup-region/kubeconfig
# 2. Verify alternate region up
kubectl cluster-info
# 3. Download and restore database
aws s3 cp s3://vapora-backups/database/latest/ . --recursive
# 4. Restore services (as in Scenario 1, Step 4)
```
#### Step 3: Update DNS/Routing
```bash
# Update DNS to point to alternate region
aws route53 change-resource-record-sets \
--hosted-zone-id Z123456 \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "api.vapora.com",
"Type": "A",
"AliasTarget": {
"HostedZoneId": "Z987654",
"DNSName": "backup-region-lb.elb.amazonaws.com",
"EvaluateTargetHealth": false
}
}
}]
}'
# Wait for DNS propagation (5-10 minutes)
```
#### Step 4: Verify Failover
```bash
# 1. DNS resolves to new region
nslookup api.vapora.com
# 2. Services accessible
curl https://api.vapora.com/health
# 3. Data intact
curl https://api.vapora.com/api/projects
```
#### Step 5: Communicate Failover
```
Post to #incident-critical:
✅ FAILOVER TO ALTERNATE REGION COMPLETE
Primary Region: us-east-1 (Down)
Active Region: us-west-2 (Restored)
Status:
- All services running: ✓
- Database restored: ✓
- Data integrity verified: ✓
- Partial data loss: ~30 minutes of transactions
Estimated Data Loss: 30 minutes (11:30-12:00 UTC)
Current Time: 12:05 UTC
Next steps:
- Monitor alternate region closely
- Begin investigation of primary region
- Plan failback when primary recovered
Questions? /cc @ops-team
```
---
## Post-Disaster Recovery
### Phase 1: Stabilization (Ongoing)
```bash
# Continue monitoring for 4 hours minimum
# Checks every 15 minutes:
✓ All pods Running
✓ API responding
✓ Database queries working
✓ Error rates normal
✓ Performance baseline
```
### Phase 2: Root Cause Analysis
**Start within 1 hour of service recovery**:
```
Questions to answer:
1. What caused the disaster?
- Hardware failure
- Software bug
- Configuration error
- External attack
- Human error
2. Why wasn't it detected earlier?
- Monitoring gap
- Alert misconfiguration
- Alert fatigue
3. How did backups perform?
- Were they accessible?
- Restore time as expected?
- Data loss acceptable?
4. What took longest in recovery?
- Finding backups
- Restoring database
- Redeploying services
- Verifying integrity
5. What can be improved?
- Faster detection
- Faster recovery
- Better documentation
- More automated recovery
```
### Phase 3: Recovery Documentation
```
Create post-disaster report:
Timeline:
- 11:30 UTC: Disaster detected
- 11:35 UTC: Database restore started
- 11:50 UTC: Services redeployed
- 12:00 UTC: All systems operational
- Duration: 30 minutes
Impact:
- Users affected: [X]
- Data lost: [X] transactions
- Revenue impact: $[X]
Root cause: [Description]
Contributing factors:
1. [Factor 1]
2. [Factor 2]
Preventive measures:
1. [Action] by [Owner] by [Date]
2. [Action] by [Owner] by [Date]
Lessons learned:
1. [Lesson 1]
2. [Lesson 2]
```
### Phase 4: Improvements Implementation
**Due date: Within 2 weeks**
```
Checklist for improvements:
□ Update backup strategy (if needed)
□ Improve monitoring/alerting
□ Automate more recovery steps
□ Update runbooks with learnings
□ Train team on new procedures
□ Test improved procedures
□ Document for future reference
□ Incident retrospective meeting
```
---
## Disaster Recovery Drill
### Quarterly DR Drill
**Purpose**: Test DR procedures before real disaster
**Schedule**: Last Friday of each quarter at 02:00 UTC
```bash
def quarterly_dr_drill [] {
print "=== QUARTERLY DISASTER RECOVERY DRILL ==="
print $"Date: (date now | format date %Y-%m-%d %H:%M:%S UTC)"
print ""
# 1. Simulate database corruption
print "1. Simulating database corruption..."
# Create test database, introduce corruption
# 2. Test restore procedure
print "2. Testing restore from backup..."
# Download backup, restore to test database
# 3. Measure restore time
let start_time = (date now)
# ... restore process ...
let end_time = (date now)
let duration = $end_time - $start_time
print $"Restore time: ($duration)"
# 4. Verify data integrity
print "3. Verifying data integrity..."
# Check restored data matches pre-backup
# 5. Document results
print "4. Documenting results..."
# Record in DR drill log
print ""
print "Drill complete"
}
```
### Drill Checklist
```
Pre-Drill (1 week before):
□ Notify team of scheduled drill
□ Plan specific scenario to test
□ Prepare test environment
□ Have runbooks available
During Drill:
□ Execute scenario as planned
□ Record actual timings
□ Document any issues
□ Note what went well
□ Note what could improve
Post-Drill (within 1 day):
□ Debrief meeting
□ Review recorded times vs. targets
□ Discuss improvements
□ Update runbooks if needed
□ Thank team for participation
□ Document lessons learned
Post-Drill (within 1 week):
□ Implement identified improvements
□ Test improvements
□ Verify procedures updated
□ Archive drill documentation
```
---
## Disaster Recovery Readiness
### Recovery Readiness Checklist
```
Infrastructure:
□ Primary region configured
□ Backup region prepared
□ Load balancing configured
□ DNS failover configured
Data:
□ Hourly database backups
□ Backups encrypted
□ Backups tested (monthly)
□ Multiple backup locations
Configuration:
□ ConfigMaps backed up (daily)
□ Secrets encrypted and backed up
□ Infrastructure code in Git
□ Deployment manifests versioned
Documentation:
□ Disaster procedures documented
□ Runbooks current and tested
□ Team trained on procedures
□ Escalation paths clear
Testing:
□ Monthly restore test passes
□ Quarterly DR drill scheduled
□ Recovery times meet RTO/RPA
Monitoring:
□ Alerts for backup failures
□ Backup health checks running
□ Recovery procedures monitored
```
### RTO/RPA Targets
| Scenario | RTO | RPA |
|----------|-----|-----|
| **Single pod failure** | 5 min | 0 min |
| **Database corruption** | 1 hour | 1 hour |
| **Node failure** | 15 min | 0 min |
| **Region outage** | 2 hours | 15 min |
| **Complete cluster loss** | 4 hours | 1 hour |
---
## Disaster Recovery Contacts
```
Role: Contact: Phone: Slack:
Primary DBA: [Name] [Phone] @[slack]
Backup DBA: [Name] [Phone] @[slack]
Infra Lead: [Name] [Phone] @[slack]
Backup Infra: [Name] [Phone] @[slack]
CTO: [Name] [Phone] @[slack]
Ops Manager: [Name] [Phone] @[slack]
Escalation:
Level 1: [Name] - notify immediately
Level 2: [Name] - notify within 5 min
Level 3: [Name] - notify within 15 min
```
---
## Quick Reference: Disaster Steps
```
1. ASSESS (First 5 min)
- Determine disaster severity
- Assess damage scope
- Get backup location access
2. COMMUNICATE (Immediately)
- Declare disaster
- Notify key personnel
- Start status updates (every 5 min)
3. RECOVER (Next 30-120 min)
- Activate backup infrastructure if needed
- Restore database from latest backup
- Redeploy applications
- Verify all systems operational
4. VERIFY (Continuous)
- Check pod health
- Verify database connectivity
- Test API endpoints
- Monitor error rates
5. STABILIZE (Next 4 hours)
- Monitor closely
- Watch for anomalies
- Verify performance normal
- Check data integrity
6. INVESTIGATE (Within 1 hour)
- Root cause analysis
- Document what happened
- Plan improvements
- Update procedures
7. IMPROVE (Within 2 weeks)
- Implement improvements
- Test improvements
- Update documentation
- Train team
```