842 lines
18 KiB
Markdown
842 lines
18 KiB
Markdown
# Disaster Recovery Runbook
|
|
|
|
Step-by-step procedures for recovering VAPORA from various disaster scenarios.
|
|
|
|
---
|
|
|
|
## Disaster Severity Levels
|
|
|
|
### Level 1: Critical 🔴
|
|
**Complete Service Loss** - Entire VAPORA unavailable
|
|
|
|
Examples:
|
|
- Complete cluster failure
|
|
- Complete data center outage
|
|
- Database completely corrupted
|
|
- All backups inaccessible
|
|
|
|
RTO: 2-4 hours
|
|
RPA: Up to 1 hour of data loss possible
|
|
|
|
### Level 2: Major 🟠
|
|
**Partial Service Loss** - Some services unavailable
|
|
|
|
Examples:
|
|
- Single region down
|
|
- Database corrupted but backups available
|
|
- One service completely failed
|
|
- Primary storage unavailable
|
|
|
|
RTO: 30 minutes - 2 hours
|
|
RPA: Minimal data loss
|
|
|
|
### Level 3: Minor 🟡
|
|
**Degraded Service** - Service running but with issues
|
|
|
|
Examples:
|
|
- Performance issues
|
|
- One pod crashed
|
|
- Database connection issues
|
|
- High error rate
|
|
|
|
RTO: 5-15 minutes
|
|
RPA: No data loss
|
|
|
|
---
|
|
|
|
## Disaster Assessment (First 5 Minutes)
|
|
|
|
### Step 1: Declare Disaster State
|
|
|
|
When any of these occur, declare a disaster:
|
|
|
|
```bash
|
|
# Q1: Is the service accessible?
|
|
curl -v https://api.vapora.com/health
|
|
|
|
# Q2: How many pods are running?
|
|
kubectl get pods -n vapora
|
|
|
|
# Q3: Can we access the database?
|
|
kubectl exec -n vapora pod/<name> -- \
|
|
surreal query "SELECT * FROM projects LIMIT 1"
|
|
|
|
# Q4: Are backups available?
|
|
aws s3 ls s3://vapora-backups/
|
|
```
|
|
|
|
**Decision Tree**:
|
|
```
|
|
Can access service normally?
|
|
YES → No disaster, escalate to incident response
|
|
NO → Continue
|
|
|
|
Can reach any pods?
|
|
YES → Partial disaster (Level 2-3)
|
|
NO → Likely total disaster (Level 1)
|
|
|
|
Can reach database?
|
|
YES → Application issue, not data issue
|
|
NO → Database issue, need restoration
|
|
|
|
Are backups accessible?
|
|
YES → Recovery likely possible
|
|
NO → Critical situation, activate backup locations
|
|
```
|
|
|
|
### Step 2: Severity Assignment
|
|
|
|
Based on assessment:
|
|
|
|
```bash
|
|
# Level 1 Criteria (Critical)
|
|
- 0 pods running in vapora namespace
|
|
- Database completely unreachable
|
|
- All backup locations inaccessible
|
|
- Service down >30 minutes
|
|
|
|
# Level 2 Criteria (Major)
|
|
- <50% pods running
|
|
- Database reachable but degraded
|
|
- Primary backups inaccessible but secondary available
|
|
- Service down 5-30 minutes
|
|
|
|
# Level 3 Criteria (Minor)
|
|
- >75% pods running
|
|
- Database responsive but with errors
|
|
- Backups accessible
|
|
- Service down <5 minutes
|
|
|
|
Assignment: Level ___
|
|
|
|
If Level 1: Activate full DR plan
|
|
If Level 2: Activate partial DR plan
|
|
If Level 3: Use normal incident response
|
|
```
|
|
|
|
### Step 3: Notify Key Personnel
|
|
|
|
```bash
|
|
# For Level 1 (Critical) DR
|
|
send_message_to = [
|
|
"@cto",
|
|
"@ops-manager",
|
|
"@database-team",
|
|
"@infrastructure-team",
|
|
"@product-manager"
|
|
]
|
|
|
|
message = """
|
|
🔴 DISASTER DECLARED - LEVEL 1 CRITICAL
|
|
|
|
Service: VAPORA (Complete Outage)
|
|
Severity: Critical
|
|
Time Declared: [UTC]
|
|
Status: Assessing
|
|
|
|
Actions underway:
|
|
1. Activating disaster recovery procedures
|
|
2. Notifying stakeholders
|
|
3. Engaging full team
|
|
|
|
Next update: [+5 min]
|
|
|
|
/cc @all-involved
|
|
"""
|
|
|
|
post_to_slack("#incident-critical")
|
|
page_on_call_manager(urgent=true)
|
|
```
|
|
|
|
---
|
|
|
|
## Disaster Scenario Procedures
|
|
|
|
### Scenario 1: Complete Cluster Failure
|
|
|
|
**Symptoms**:
|
|
- kubectl commands time out or fail
|
|
- No pods running in any namespace
|
|
- Nodes unreachable
|
|
- All services down
|
|
|
|
**Recovery Steps**:
|
|
|
|
#### Step 1: Assess Infrastructure (5 min)
|
|
|
|
```bash
|
|
# Try basic cluster operations
|
|
kubectl cluster-info
|
|
# If: "Unable to connect to the server"
|
|
|
|
# Check cloud provider status
|
|
# AWS: Check AWS status page, check EC2 instances
|
|
# GKE: Check Google Cloud console
|
|
# On-prem: Check infrastructure team
|
|
|
|
# Determine: Is infrastructure failed or just connectivity?
|
|
```
|
|
|
|
#### Step 2: If Infrastructure Failed
|
|
|
|
**Activate Secondary Infrastructure** (if available):
|
|
|
|
```bash
|
|
# 1. Access backup/secondary infrastructure
|
|
export KUBECONFIG=/path/to/backup/kubeconfig
|
|
|
|
# 2. Verify it's operational
|
|
kubectl cluster-info
|
|
kubectl get nodes
|
|
|
|
# 3. Prepare for database restore
|
|
# (See: Scenario 2 - Database Recovery)
|
|
```
|
|
|
|
**If No Secondary**: Activate failover to alternate region
|
|
|
|
```bash
|
|
# 1. Contact cloud provider
|
|
# AWS: Open support case - request emergency instance launch
|
|
# GKE: Request cluster creation in different region
|
|
|
|
# 2. While infrastructure rebuilds:
|
|
# - Retrieve backups
|
|
# - Prepare restore scripts
|
|
# - Brief team on ETA
|
|
```
|
|
|
|
#### Step 3: Restore Database (See Scenario 2)
|
|
|
|
#### Step 4: Deploy Services
|
|
|
|
```bash
|
|
# Once infrastructure ready and database restored
|
|
|
|
# 1. Apply ConfigMaps
|
|
kubectl apply -f vapora-configmap.yaml
|
|
|
|
# 2. Apply Secrets
|
|
kubectl apply -f vapora-secrets.yaml
|
|
|
|
# 3. Deploy services
|
|
kubectl apply -f vapora-deployments.yaml
|
|
|
|
# 4. Wait for pods to start
|
|
kubectl rollout status deployment/vapora-backend -n vapora --timeout=10m
|
|
|
|
# 5. Verify health
|
|
curl http://localhost:8001/health
|
|
```
|
|
|
|
#### Step 5: Verification
|
|
|
|
```bash
|
|
# 1. Check all pods running
|
|
kubectl get pods -n vapora
|
|
# All should show: Running, 1/1 Ready
|
|
|
|
# 2. Verify database connectivity
|
|
kubectl logs deployment/vapora-backend -n vapora | tail -20
|
|
# Should show: "Successfully connected to database"
|
|
|
|
# 3. Test API
|
|
curl http://localhost:8001/api/projects
|
|
# Should return project list
|
|
|
|
# 4. Check data integrity
|
|
# Run validation queries:
|
|
SELECT COUNT(*) FROM projects; # Should > 0
|
|
SELECT COUNT(*) FROM users; # Should > 0
|
|
SELECT COUNT(*) FROM tasks; # Should > 0
|
|
```
|
|
|
|
---
|
|
|
|
### Scenario 2: Database Corruption/Loss
|
|
|
|
**Symptoms**:
|
|
- Database queries return errors
|
|
- Data integrity issues
|
|
- Corruption detected in logs
|
|
|
|
**Recovery Steps**:
|
|
|
|
#### Step 1: Assess Database State (10 min)
|
|
|
|
```bash
|
|
# 1. Try to connect
|
|
kubectl exec -n vapora pod/surrealdb-0 -- \
|
|
surreal sql --conn ws://localhost:8000 \
|
|
--user root --pass "$DB_PASSWORD" \
|
|
"SELECT COUNT(*) FROM projects"
|
|
|
|
# 2. Check for error messages
|
|
kubectl logs -n vapora pod/surrealdb-0 | tail -50 | grep -i error
|
|
|
|
# 3. Assess damage
|
|
# Is it:
|
|
# - Connection issue (might recover)
|
|
# - Data corruption (need restore)
|
|
# - Complete loss (restore from backup)
|
|
```
|
|
|
|
#### Step 2: Backup Current State (for forensics)
|
|
|
|
```bash
|
|
# Before attempting recovery, save current state
|
|
|
|
# Export what's remaining
|
|
kubectl exec -n vapora pod/surrealdb-0 -- \
|
|
surreal export --conn ws://localhost:8000 \
|
|
--user root --pass "$DB_PASSWORD" \
|
|
--output /tmp/corrupted-export.sql
|
|
|
|
# Download for analysis
|
|
kubectl cp vapora/surrealdb-0:/tmp/corrupted-export.sql \
|
|
./corrupted-export-$(date +%Y%m%d-%H%M%S).sql
|
|
```
|
|
|
|
#### Step 3: Identify Latest Good Backup
|
|
|
|
```bash
|
|
# Find most recent backup before corruption
|
|
aws s3 ls s3://vapora-backups/database/ --recursive | sort
|
|
|
|
# Latest backup timestamp
|
|
# Should be within last hour
|
|
|
|
# Download backup
|
|
aws s3 cp s3://vapora-backups/database/2026-01-12/vapora-db-010000.sql.gz \
|
|
./vapora-db-restore.sql.gz
|
|
|
|
gunzip vapora-db-restore.sql.gz
|
|
```
|
|
|
|
#### Step 4: Restore Database
|
|
|
|
```bash
|
|
# Option A: Restore to same database (destructive)
|
|
# WARNING: This will overwrite current database
|
|
|
|
kubectl exec -n vapora pod/surrealdb-0 -- \
|
|
rm -rf /var/lib/surrealdb/data.db
|
|
|
|
# Restart pod to reinitialize
|
|
kubectl delete pod -n vapora surrealdb-0
|
|
# Pod will restart with clean database
|
|
|
|
# Import backup
|
|
kubectl exec -n vapora pod/surrealdb-0 -- \
|
|
surreal import --conn ws://localhost:8000 \
|
|
--user root --pass "$DB_PASSWORD" \
|
|
--input /tmp/vapora-db-restore.sql
|
|
|
|
# Wait for import to complete (5-15 minutes)
|
|
```
|
|
|
|
**Option B: Restore to temporary database (safer)**
|
|
|
|
```bash
|
|
# 1. Create temporary database pod
|
|
kubectl run -n vapora restore-test --image=surrealdb/surrealdb:latest \
|
|
-- start file:///tmp/restore-test
|
|
|
|
# 2. Restore to temporary
|
|
kubectl cp ./vapora-db-restore.sql vapora/restore-test:/tmp/
|
|
kubectl exec -n vapora restore-test -- \
|
|
surreal import --conn ws://localhost:8000 \
|
|
--user root --pass "$DB_PASSWORD" \
|
|
--input /tmp/vapora-db-restore.sql
|
|
|
|
# 3. Verify restored data
|
|
kubectl exec -n vapora restore-test -- \
|
|
surreal sql "SELECT COUNT(*) FROM projects"
|
|
|
|
# 4. If good: Restore production
|
|
kubectl delete pod -n vapora surrealdb-0
|
|
# Wait for pod restart
|
|
kubectl cp ./vapora-db-restore.sql vapora/surrealdb-0:/tmp/
|
|
kubectl exec -n vapora surrealdb-0 -- \
|
|
surreal import --conn ws://localhost:8000 \
|
|
--user root --pass "$DB_PASSWORD" \
|
|
--input /tmp/vapora-db-restore.sql
|
|
|
|
# 5. Cleanup test pod
|
|
kubectl delete pod -n vapora restore-test
|
|
```
|
|
|
|
#### Step 5: Verify Recovery
|
|
|
|
```bash
|
|
# 1. Database responsive
|
|
kubectl exec -n vapora pod/surrealdb-0 -- \
|
|
surreal sql "SELECT COUNT(*) FROM projects"
|
|
|
|
# 2. Application can connect
|
|
kubectl logs deployment/vapora-backend -n vapora | tail -5
|
|
# Should show successful connection
|
|
|
|
# 3. API working
|
|
curl http://localhost:8001/api/projects
|
|
|
|
# 4. Data valid
|
|
# Check record counts match pre-backup
|
|
# Check no corruption in key records
|
|
```
|
|
|
|
---
|
|
|
|
### Scenario 3: Configuration Corruption
|
|
|
|
**Symptoms**:
|
|
- Application misconfigured
|
|
- Pods failing to start
|
|
- Wrong values in environment
|
|
|
|
**Recovery Steps**:
|
|
|
|
#### Step 1: Identify Bad Configuration
|
|
|
|
```bash
|
|
# 1. Get current ConfigMap
|
|
kubectl get configmap -n vapora vapora-config -o yaml > current-config.yaml
|
|
|
|
# 2. Compare with known-good backup
|
|
aws s3 cp s3://vapora-backups/configs/2026-01-12/configmaps.yaml .
|
|
|
|
# 3. Diff to find issues
|
|
diff configmaps.yaml current-config.yaml
|
|
```
|
|
|
|
#### Step 2: Restore Previous Configuration
|
|
|
|
```bash
|
|
# 1. Get previous ConfigMap from backup
|
|
aws s3 cp s3://vapora-backups/configs/2026-01-11/configmaps.yaml ./good-config.yaml
|
|
|
|
# 2. Apply previous configuration
|
|
kubectl apply -f good-config.yaml
|
|
|
|
# 3. Restart pods to pick up new config
|
|
kubectl rollout restart deployment/vapora-backend -n vapora
|
|
kubectl rollout restart deployment/vapora-agents -n vapora
|
|
|
|
# 4. Monitor restart
|
|
kubectl get pods -n vapora -w
|
|
```
|
|
|
|
#### Step 3: Verify Configuration
|
|
|
|
```bash
|
|
# 1. Pods should restart and become Running
|
|
kubectl get pods -n vapora
|
|
# All should show: Running, 1/1 Ready
|
|
|
|
# 2. Check pod logs
|
|
kubectl logs deployment/vapora-backend -n vapora | tail -10
|
|
# Should show successful startup
|
|
|
|
# 3. API operational
|
|
curl http://localhost:8001/health
|
|
```
|
|
|
|
---
|
|
|
|
### Scenario 4: Data Center/Region Outage
|
|
|
|
**Symptoms**:
|
|
- Entire region unreachable
|
|
- Multiple infrastructure components down
|
|
- Network connectivity issues
|
|
|
|
**Recovery Steps**:
|
|
|
|
#### Step 1: Declare Regional Failover
|
|
|
|
```bash
|
|
# 1. Confirm region is down
|
|
ping production.vapora.com
|
|
# Should fail
|
|
|
|
# Check status page
|
|
# Cloud provider should report outage
|
|
|
|
# 2. Declare failover
|
|
declare_failover_to_region("us-west-2")
|
|
```
|
|
|
|
#### Step 2: Activate Alternate Region
|
|
|
|
```bash
|
|
# 1. Switch kubeconfig to alternate region
|
|
export KUBECONFIG=/path/to/backup-region/kubeconfig
|
|
|
|
# 2. Verify alternate region up
|
|
kubectl cluster-info
|
|
|
|
# 3. Download and restore database
|
|
aws s3 cp s3://vapora-backups/database/latest/ . --recursive
|
|
|
|
# 4. Restore services (as in Scenario 1, Step 4)
|
|
```
|
|
|
|
#### Step 3: Update DNS/Routing
|
|
|
|
```bash
|
|
# Update DNS to point to alternate region
|
|
aws route53 change-resource-record-sets \
|
|
--hosted-zone-id Z123456 \
|
|
--change-batch '{
|
|
"Changes": [{
|
|
"Action": "UPSERT",
|
|
"ResourceRecordSet": {
|
|
"Name": "api.vapora.com",
|
|
"Type": "A",
|
|
"AliasTarget": {
|
|
"HostedZoneId": "Z987654",
|
|
"DNSName": "backup-region-lb.elb.amazonaws.com",
|
|
"EvaluateTargetHealth": false
|
|
}
|
|
}
|
|
}]
|
|
}'
|
|
|
|
# Wait for DNS propagation (5-10 minutes)
|
|
```
|
|
|
|
#### Step 4: Verify Failover
|
|
|
|
```bash
|
|
# 1. DNS resolves to new region
|
|
nslookup api.vapora.com
|
|
|
|
# 2. Services accessible
|
|
curl https://api.vapora.com/health
|
|
|
|
# 3. Data intact
|
|
curl https://api.vapora.com/api/projects
|
|
```
|
|
|
|
#### Step 5: Communicate Failover
|
|
|
|
```
|
|
Post to #incident-critical:
|
|
|
|
✅ FAILOVER TO ALTERNATE REGION COMPLETE
|
|
|
|
Primary Region: us-east-1 (Down)
|
|
Active Region: us-west-2 (Restored)
|
|
|
|
Status:
|
|
- All services running: ✓
|
|
- Database restored: ✓
|
|
- Data integrity verified: ✓
|
|
- Partial data loss: ~30 minutes of transactions
|
|
|
|
Estimated Data Loss: 30 minutes (11:30-12:00 UTC)
|
|
Current Time: 12:05 UTC
|
|
|
|
Next steps:
|
|
- Monitor alternate region closely
|
|
- Begin investigation of primary region
|
|
- Plan failback when primary recovered
|
|
|
|
Questions? /cc @ops-team
|
|
```
|
|
|
|
---
|
|
|
|
## Post-Disaster Recovery
|
|
|
|
### Phase 1: Stabilization (Ongoing)
|
|
|
|
```bash
|
|
# Continue monitoring for 4 hours minimum
|
|
|
|
# Checks every 15 minutes:
|
|
✓ All pods Running
|
|
✓ API responding
|
|
✓ Database queries working
|
|
✓ Error rates normal
|
|
✓ Performance baseline
|
|
```
|
|
|
|
### Phase 2: Root Cause Analysis
|
|
|
|
**Start within 1 hour of service recovery**:
|
|
|
|
```
|
|
Questions to answer:
|
|
|
|
1. What caused the disaster?
|
|
- Hardware failure
|
|
- Software bug
|
|
- Configuration error
|
|
- External attack
|
|
- Human error
|
|
|
|
2. Why wasn't it detected earlier?
|
|
- Monitoring gap
|
|
- Alert misconfiguration
|
|
- Alert fatigue
|
|
|
|
3. How did backups perform?
|
|
- Were they accessible?
|
|
- Restore time as expected?
|
|
- Data loss acceptable?
|
|
|
|
4. What took longest in recovery?
|
|
- Finding backups
|
|
- Restoring database
|
|
- Redeploying services
|
|
- Verifying integrity
|
|
|
|
5. What can be improved?
|
|
- Faster detection
|
|
- Faster recovery
|
|
- Better documentation
|
|
- More automated recovery
|
|
```
|
|
|
|
### Phase 3: Recovery Documentation
|
|
|
|
```
|
|
Create post-disaster report:
|
|
|
|
Timeline:
|
|
- 11:30 UTC: Disaster detected
|
|
- 11:35 UTC: Database restore started
|
|
- 11:50 UTC: Services redeployed
|
|
- 12:00 UTC: All systems operational
|
|
- Duration: 30 minutes
|
|
|
|
Impact:
|
|
- Users affected: [X]
|
|
- Data lost: [X] transactions
|
|
- Revenue impact: $[X]
|
|
|
|
Root cause: [Description]
|
|
|
|
Contributing factors:
|
|
1. [Factor 1]
|
|
2. [Factor 2]
|
|
|
|
Preventive measures:
|
|
1. [Action] by [Owner] by [Date]
|
|
2. [Action] by [Owner] by [Date]
|
|
|
|
Lessons learned:
|
|
1. [Lesson 1]
|
|
2. [Lesson 2]
|
|
```
|
|
|
|
### Phase 4: Improvements Implementation
|
|
|
|
**Due date: Within 2 weeks**
|
|
|
|
```
|
|
Checklist for improvements:
|
|
|
|
□ Update backup strategy (if needed)
|
|
□ Improve monitoring/alerting
|
|
□ Automate more recovery steps
|
|
□ Update runbooks with learnings
|
|
□ Train team on new procedures
|
|
□ Test improved procedures
|
|
□ Document for future reference
|
|
□ Incident retrospective meeting
|
|
```
|
|
|
|
---
|
|
|
|
## Disaster Recovery Drill
|
|
|
|
### Quarterly DR Drill
|
|
|
|
**Purpose**: Test DR procedures before real disaster
|
|
|
|
**Schedule**: Last Friday of each quarter at 02:00 UTC
|
|
|
|
```bash
|
|
def quarterly_dr_drill [] {
|
|
print "=== QUARTERLY DISASTER RECOVERY DRILL ==="
|
|
print $"Date: (date now | format date %Y-%m-%d %H:%M:%S UTC)"
|
|
print ""
|
|
|
|
# 1. Simulate database corruption
|
|
print "1. Simulating database corruption..."
|
|
# Create test database, introduce corruption
|
|
|
|
# 2. Test restore procedure
|
|
print "2. Testing restore from backup..."
|
|
# Download backup, restore to test database
|
|
|
|
# 3. Measure restore time
|
|
let start_time = (date now)
|
|
# ... restore process ...
|
|
let end_time = (date now)
|
|
let duration = $end_time - $start_time
|
|
print $"Restore time: ($duration)"
|
|
|
|
# 4. Verify data integrity
|
|
print "3. Verifying data integrity..."
|
|
# Check restored data matches pre-backup
|
|
|
|
# 5. Document results
|
|
print "4. Documenting results..."
|
|
# Record in DR drill log
|
|
|
|
print ""
|
|
print "Drill complete"
|
|
}
|
|
```
|
|
|
|
### Drill Checklist
|
|
|
|
```
|
|
Pre-Drill (1 week before):
|
|
□ Notify team of scheduled drill
|
|
□ Plan specific scenario to test
|
|
□ Prepare test environment
|
|
□ Have runbooks available
|
|
|
|
During Drill:
|
|
□ Execute scenario as planned
|
|
□ Record actual timings
|
|
□ Document any issues
|
|
□ Note what went well
|
|
□ Note what could improve
|
|
|
|
Post-Drill (within 1 day):
|
|
□ Debrief meeting
|
|
□ Review recorded times vs. targets
|
|
□ Discuss improvements
|
|
□ Update runbooks if needed
|
|
□ Thank team for participation
|
|
□ Document lessons learned
|
|
|
|
Post-Drill (within 1 week):
|
|
□ Implement identified improvements
|
|
□ Test improvements
|
|
□ Verify procedures updated
|
|
□ Archive drill documentation
|
|
```
|
|
|
|
---
|
|
|
|
## Disaster Recovery Readiness
|
|
|
|
### Recovery Readiness Checklist
|
|
|
|
```
|
|
Infrastructure:
|
|
□ Primary region configured
|
|
□ Backup region prepared
|
|
□ Load balancing configured
|
|
□ DNS failover configured
|
|
|
|
Data:
|
|
□ Hourly database backups
|
|
□ Backups encrypted
|
|
□ Backups tested (monthly)
|
|
□ Multiple backup locations
|
|
|
|
Configuration:
|
|
□ ConfigMaps backed up (daily)
|
|
□ Secrets encrypted and backed up
|
|
□ Infrastructure code in Git
|
|
□ Deployment manifests versioned
|
|
|
|
Documentation:
|
|
□ Disaster procedures documented
|
|
□ Runbooks current and tested
|
|
□ Team trained on procedures
|
|
□ Escalation paths clear
|
|
|
|
Testing:
|
|
□ Monthly restore test passes
|
|
□ Quarterly DR drill scheduled
|
|
□ Recovery times meet RTO/RPA
|
|
|
|
Monitoring:
|
|
□ Alerts for backup failures
|
|
□ Backup health checks running
|
|
□ Recovery procedures monitored
|
|
```
|
|
|
|
### RTO/RPA Targets
|
|
|
|
| Scenario | RTO | RPA |
|
|
|----------|-----|-----|
|
|
| **Single pod failure** | 5 min | 0 min |
|
|
| **Database corruption** | 1 hour | 1 hour |
|
|
| **Node failure** | 15 min | 0 min |
|
|
| **Region outage** | 2 hours | 15 min |
|
|
| **Complete cluster loss** | 4 hours | 1 hour |
|
|
|
|
---
|
|
|
|
## Disaster Recovery Contacts
|
|
|
|
```
|
|
Role: Contact: Phone: Slack:
|
|
Primary DBA: [Name] [Phone] @[slack]
|
|
Backup DBA: [Name] [Phone] @[slack]
|
|
Infra Lead: [Name] [Phone] @[slack]
|
|
Backup Infra: [Name] [Phone] @[slack]
|
|
CTO: [Name] [Phone] @[slack]
|
|
Ops Manager: [Name] [Phone] @[slack]
|
|
|
|
Escalation:
|
|
Level 1: [Name] - notify immediately
|
|
Level 2: [Name] - notify within 5 min
|
|
Level 3: [Name] - notify within 15 min
|
|
```
|
|
|
|
---
|
|
|
|
## Quick Reference: Disaster Steps
|
|
|
|
```
|
|
1. ASSESS (First 5 min)
|
|
- Determine disaster severity
|
|
- Assess damage scope
|
|
- Get backup location access
|
|
|
|
2. COMMUNICATE (Immediately)
|
|
- Declare disaster
|
|
- Notify key personnel
|
|
- Start status updates (every 5 min)
|
|
|
|
3. RECOVER (Next 30-120 min)
|
|
- Activate backup infrastructure if needed
|
|
- Restore database from latest backup
|
|
- Redeploy applications
|
|
- Verify all systems operational
|
|
|
|
4. VERIFY (Continuous)
|
|
- Check pod health
|
|
- Verify database connectivity
|
|
- Test API endpoints
|
|
- Monitor error rates
|
|
|
|
5. STABILIZE (Next 4 hours)
|
|
- Monitor closely
|
|
- Watch for anomalies
|
|
- Verify performance normal
|
|
- Check data integrity
|
|
|
|
6. INVESTIGATE (Within 1 hour)
|
|
- Root cause analysis
|
|
- Document what happened
|
|
- Plan improvements
|
|
- Update procedures
|
|
|
|
7. IMPROVE (Within 2 weeks)
|
|
- Implement improvements
|
|
- Test improvements
|
|
- Update documentation
|
|
- Train team
|
|
```
|