Vapora/docs/disaster-recovery/disaster-recovery-runbook.md

# Disaster Recovery Runbook

Step-by-step procedures for recovering VAPORA from various disaster scenarios.

---

## Disaster Severity Levels

### Level 1: Critical 🔴
**Complete Service Loss** - Entire VAPORA unavailable

Examples:
- Complete cluster failure
- Complete data center outage
- Database completely corrupted
- All backups inaccessible

RTO: 2-4 hours
RPA: Up to 1 hour of data loss possible

### Level 2: Major 🟠
**Partial Service Loss** - Some services unavailable

Examples:
- Single region down
- Database corrupted but backups available
- One service completely failed
- Primary storage unavailable

RTO: 30 minutes - 2 hours
RPA: Minimal data loss

### Level 3: Minor 🟡
**Degraded Service** - Service running but with issues

Examples:
- Performance issues
- One pod crashed
- Database connection issues
- High error rate

RTO: 5-15 minutes
RPA: No data loss

---

## Disaster Assessment (First 5 Minutes)

### Step 1: Declare Disaster State

When any of these occur, declare a disaster:

```bash
# Q1: Is the service accessible?
curl -v https://api.vapora.com/health

# Q2: How many pods are running?
kubectl get pods -n vapora

# Q3: Can we access the database?
kubectl exec -n vapora pod/<name> -- \
  surreal query "SELECT * FROM projects LIMIT 1"

# Q4: Are backups available?
aws s3 ls s3://vapora-backups/
```

**Decision Tree**:
```
Can access service normally?
  YES → No disaster, escalate to incident response
  NO → Continue

Can reach any pods?
  YES → Partial disaster (Level 2-3)
  NO → Likely total disaster (Level 1)

Can reach database?
  YES → Application issue, not data issue
  NO → Database issue, need restoration

Are backups accessible?
  YES → Recovery likely possible
  NO → Critical situation, activate backup locations
```

### Step 2: Severity Assignment

Based on assessment:

```bash
# Level 1 Criteria (Critical)
- 0 pods running in vapora namespace
- Database completely unreachable
- All backup locations inaccessible
- Service down >30 minutes

# Level 2 Criteria (Major)
- <50% pods running
- Database reachable but degraded
- Primary backups inaccessible but secondary available
- Service down 5-30 minutes

# Level 3 Criteria (Minor)
- >75% pods running
- Database responsive but with errors
- Backups accessible
- Service down <5 minutes

Assignment: Level ___

If Level 1: Activate full DR plan
If Level 2: Activate partial DR plan
If Level 3: Use normal incident response
```

### Step 3: Notify Key Personnel

```bash
# For Level 1 (Critical) DR
send_message_to = [
  "@cto",
  "@ops-manager",
  "@database-team",
  "@infrastructure-team",
  "@product-manager"
]

message = """
🔴 DISASTER DECLARED - LEVEL 1 CRITICAL

Service: VAPORA (Complete Outage)
Severity: Critical
Time Declared: [UTC]
Status: Assessing

Actions underway:
1. Activating disaster recovery procedures
2. Notifying stakeholders
3. Engaging full team

Next update: [+5 min]

/cc @all-involved
"""

post_to_slack("#incident-critical")
page_on_call_manager(urgent=true)
```

---

## Disaster Scenario Procedures

### Scenario 1: Complete Cluster Failure

**Symptoms**:
- kubectl commands time out or fail
- No pods running in any namespace
- Nodes unreachable
- All services down

**Recovery Steps**:

#### Step 1: Assess Infrastructure (5 min)

```bash
# Try basic cluster operations
kubectl cluster-info
# If: "Unable to connect to the server"

# Check cloud provider status
# AWS: Check AWS status page, check EC2 instances
# GKE: Check Google Cloud console
# On-prem: Check infrastructure team

# Determine: Is infrastructure failed or just connectivity?
```

#### Step 2: If Infrastructure Failed

**Activate Secondary Infrastructure** (if available):

```bash
# 1. Access backup/secondary infrastructure
export KUBECONFIG=/path/to/backup/kubeconfig

# 2. Verify it's operational
kubectl cluster-info
kubectl get nodes

# 3. Prepare for database restore
# (See: Scenario 2 - Database Recovery)
```

**If No Secondary**: Activate failover to alternate region

```bash
# 1. Contact cloud provider
# AWS: Open support case - request emergency instance launch
# GKE: Request cluster creation in different region

# 2. While infrastructure rebuilds:
# - Retrieve backups
# - Prepare restore scripts
# - Brief team on ETA
```

#### Step 3: Restore Database (See Scenario 2)

#### Step 4: Deploy Services

```bash
# Once infrastructure ready and database restored

# 1. Apply ConfigMaps
kubectl apply -f vapora-configmap.yaml

# 2. Apply Secrets
kubectl apply -f vapora-secrets.yaml

# 3. Deploy services
kubectl apply -f vapora-deployments.yaml

# 4. Wait for pods to start
kubectl rollout status deployment/vapora-backend -n vapora --timeout=10m

# 5. Verify health
curl http://localhost:8001/health
```

#### Step 5: Verification

```bash
# 1. Check all pods running
kubectl get pods -n vapora
# All should show: Running, 1/1 Ready

# 2. Verify database connectivity
kubectl logs deployment/vapora-backend -n vapora | tail -20
# Should show: "Successfully connected to database"

# 3. Test API
curl http://localhost:8001/api/projects
# Should return project list

# 4. Check data integrity
# Run validation queries:
SELECT COUNT(*) FROM projects;          # Should > 0
SELECT COUNT(*) FROM users;             # Should > 0
SELECT COUNT(*) FROM tasks;             # Should > 0
```

---

### Scenario 2: Database Corruption/Loss

**Symptoms**:
- Database queries return errors
- Data integrity issues
- Corruption detected in logs

**Recovery Steps**:

#### Step 1: Assess Database State (10 min)

```bash
# 1. Try to connect
kubectl exec -n vapora pod/surrealdb-0 -- \
  surreal sql --conn ws://localhost:8000 \
  --user root --pass "$DB_PASSWORD" \
  "SELECT COUNT(*) FROM projects"

# 2. Check for error messages
kubectl logs -n vapora pod/surrealdb-0 | tail -50 | grep -i error

# 3. Assess damage
# Is it:
# - Connection issue (might recover)
# - Data corruption (need restore)
# - Complete loss (restore from backup)
```

#### Step 2: Backup Current State (for forensics)

```bash
# Before attempting recovery, save current state

# Export what's remaining
kubectl exec -n vapora pod/surrealdb-0 -- \
  surreal export --conn ws://localhost:8000 \
  --user root --pass "$DB_PASSWORD" \
  --output /tmp/corrupted-export.sql

# Download for analysis
kubectl cp vapora/surrealdb-0:/tmp/corrupted-export.sql \
  ./corrupted-export-$(date +%Y%m%d-%H%M%S).sql
```

#### Step 3: Identify Latest Good Backup

```bash
# Find most recent backup before corruption
aws s3 ls s3://vapora-backups/database/ --recursive | sort

# Latest backup timestamp
# Should be within last hour

# Download backup
aws s3 cp s3://vapora-backups/database/2026-01-12/vapora-db-010000.sql.gz \
  ./vapora-db-restore.sql.gz

gunzip vapora-db-restore.sql.gz
```

#### Step 4: Restore Database

```bash
# Option A: Restore to same database (destructive)
# WARNING: This will overwrite current database

kubectl exec -n vapora pod/surrealdb-0 -- \
  rm -rf /var/lib/surrealdb/data.db

# Restart pod to reinitialize
kubectl delete pod -n vapora surrealdb-0
# Pod will restart with clean database

# Import backup
kubectl exec -n vapora pod/surrealdb-0 -- \
  surreal import --conn ws://localhost:8000 \
  --user root --pass "$DB_PASSWORD" \
  --input /tmp/vapora-db-restore.sql

# Wait for import to complete (5-15 minutes)
```

**Option B: Restore to temporary database (safer)**

```bash
# 1. Create temporary database pod
kubectl run -n vapora restore-test --image=surrealdb/surrealdb:latest \
  -- start file:///tmp/restore-test

# 2. Restore to temporary
kubectl cp ./vapora-db-restore.sql vapora/restore-test:/tmp/
kubectl exec -n vapora restore-test -- \
  surreal import --conn ws://localhost:8000 \
  --user root --pass "$DB_PASSWORD" \
  --input /tmp/vapora-db-restore.sql

# 3. Verify restored data
kubectl exec -n vapora restore-test -- \
  surreal sql "SELECT COUNT(*) FROM projects"

# 4. If good: Restore production
kubectl delete pod -n vapora surrealdb-0
# Wait for pod restart
kubectl cp ./vapora-db-restore.sql vapora/surrealdb-0:/tmp/
kubectl exec -n vapora surrealdb-0 -- \
  surreal import --conn ws://localhost:8000 \
  --user root --pass "$DB_PASSWORD" \
  --input /tmp/vapora-db-restore.sql

# 5. Cleanup test pod
kubectl delete pod -n vapora restore-test
```

#### Step 5: Verify Recovery

```bash
# 1. Database responsive
kubectl exec -n vapora pod/surrealdb-0 -- \
  surreal sql "SELECT COUNT(*) FROM projects"

# 2. Application can connect
kubectl logs deployment/vapora-backend -n vapora | tail -5
# Should show successful connection

# 3. API working
curl http://localhost:8001/api/projects

# 4. Data valid
# Check record counts match pre-backup
# Check no corruption in key records
```

---

### Scenario 3: Configuration Corruption

**Symptoms**:
- Application misconfigured
- Pods failing to start
- Wrong values in environment

**Recovery Steps**:

#### Step 1: Identify Bad Configuration

```bash
# 1. Get current ConfigMap
kubectl get configmap -n vapora vapora-config -o yaml > current-config.yaml

# 2. Compare with known-good backup
aws s3 cp s3://vapora-backups/configs/2026-01-12/configmaps.yaml .

# 3. Diff to find issues
diff configmaps.yaml current-config.yaml
```

#### Step 2: Restore Previous Configuration

```bash
# 1. Get previous ConfigMap from backup
aws s3 cp s3://vapora-backups/configs/2026-01-11/configmaps.yaml ./good-config.yaml

# 2. Apply previous configuration
kubectl apply -f good-config.yaml

# 3. Restart pods to pick up new config
kubectl rollout restart deployment/vapora-backend -n vapora
kubectl rollout restart deployment/vapora-agents -n vapora

# 4. Monitor restart
kubectl get pods -n vapora -w
```

#### Step 3: Verify Configuration

```bash
# 1. Pods should restart and become Running
kubectl get pods -n vapora
# All should show: Running, 1/1 Ready

# 2. Check pod logs
kubectl logs deployment/vapora-backend -n vapora | tail -10
# Should show successful startup

# 3. API operational
curl http://localhost:8001/health
```

---

### Scenario 4: Data Center/Region Outage

**Symptoms**:
- Entire region unreachable
- Multiple infrastructure components down
- Network connectivity issues

**Recovery Steps**:

#### Step 1: Declare Regional Failover

```bash
# 1. Confirm region is down
ping production.vapora.com
# Should fail

# Check status page
# Cloud provider should report outage

# 2. Declare failover
declare_failover_to_region("us-west-2")
```

#### Step 2: Activate Alternate Region

```bash
# 1. Switch kubeconfig to alternate region
export KUBECONFIG=/path/to/backup-region/kubeconfig

# 2. Verify alternate region up
kubectl cluster-info

# 3. Download and restore database
aws s3 cp s3://vapora-backups/database/latest/ . --recursive

# 4. Restore services (as in Scenario 1, Step 4)
```

#### Step 3: Update DNS/Routing

```bash
# Update DNS to point to alternate region
aws route53 change-resource-record-sets \
  --hosted-zone-id Z123456 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.vapora.com",
        "Type": "A",
        "AliasTarget": {
          "HostedZoneId": "Z987654",
          "DNSName": "backup-region-lb.elb.amazonaws.com",
          "EvaluateTargetHealth": false
        }
      }
    }]
  }'

# Wait for DNS propagation (5-10 minutes)
```

#### Step 4: Verify Failover

```bash
# 1. DNS resolves to new region
nslookup api.vapora.com

# 2. Services accessible
curl https://api.vapora.com/health

# 3. Data intact
curl https://api.vapora.com/api/projects
```

#### Step 5: Communicate Failover

```
Post to #incident-critical:

✅ FAILOVER TO ALTERNATE REGION COMPLETE

Primary Region: us-east-1 (Down)
Active Region: us-west-2 (Restored)

Status:
- All services running: ✓
- Database restored: ✓
- Data integrity verified: ✓
- Partial data loss: ~30 minutes of transactions

Estimated Data Loss: 30 minutes (11:30-12:00 UTC)
Current Time: 12:05 UTC

Next steps:
- Monitor alternate region closely
- Begin investigation of primary region
- Plan failback when primary recovered

Questions? /cc @ops-team
```

---

## Post-Disaster Recovery

### Phase 1: Stabilization (Ongoing)

```bash
# Continue monitoring for 4 hours minimum

# Checks every 15 minutes:
✓ All pods Running
✓ API responding
✓ Database queries working
✓ Error rates normal
✓ Performance baseline
```

### Phase 2: Root Cause Analysis

**Start within 1 hour of service recovery**:

```
Questions to answer:

1. What caused the disaster?
   - Hardware failure
   - Software bug
   - Configuration error
   - External attack
   - Human error

2. Why wasn't it detected earlier?
   - Monitoring gap
   - Alert misconfiguration
   - Alert fatigue

3. How did backups perform?
   - Were they accessible?
   - Restore time as expected?
   - Data loss acceptable?

4. What took longest in recovery?
   - Finding backups
   - Restoring database
   - Redeploying services
   - Verifying integrity

5. What can be improved?
   - Faster detection
   - Faster recovery
   - Better documentation
   - More automated recovery
```

### Phase 3: Recovery Documentation

```
Create post-disaster report:

Timeline:
- 11:30 UTC: Disaster detected
- 11:35 UTC: Database restore started
- 11:50 UTC: Services redeployed
- 12:00 UTC: All systems operational
- Duration: 30 minutes

Impact:
- Users affected: [X]
- Data lost: [X] transactions
- Revenue impact: $[X]

Root cause: [Description]

Contributing factors:
1. [Factor 1]
2. [Factor 2]

Preventive measures:
1. [Action] by [Owner] by [Date]
2. [Action] by [Owner] by [Date]

Lessons learned:
1. [Lesson 1]
2. [Lesson 2]
```

### Phase 4: Improvements Implementation

**Due date: Within 2 weeks**

```
Checklist for improvements:

□ Update backup strategy (if needed)
□ Improve monitoring/alerting
□ Automate more recovery steps
□ Update runbooks with learnings
□ Train team on new procedures
□ Test improved procedures
□ Document for future reference
□ Incident retrospective meeting
```

---

## Disaster Recovery Drill

### Quarterly DR Drill

**Purpose**: Test DR procedures before real disaster

**Schedule**: Last Friday of each quarter at 02:00 UTC

```bash
def quarterly_dr_drill [] {
  print "=== QUARTERLY DISASTER RECOVERY DRILL ==="
  print $"Date: (date now | format date %Y-%m-%d %H:%M:%S UTC)"
  print ""

  # 1. Simulate database corruption
  print "1. Simulating database corruption..."
  # Create test database, introduce corruption

  # 2. Test restore procedure
  print "2. Testing restore from backup..."
  # Download backup, restore to test database

  # 3. Measure restore time
  let start_time = (date now)
  # ... restore process ...
  let end_time = (date now)
  let duration = $end_time - $start_time
  print $"Restore time: ($duration)"

  # 4. Verify data integrity
  print "3. Verifying data integrity..."
  # Check restored data matches pre-backup

  # 5. Document results
  print "4. Documenting results..."
  # Record in DR drill log

  print ""
  print "Drill complete"
}
```

### Drill Checklist

```
Pre-Drill (1 week before):
□ Notify team of scheduled drill
□ Plan specific scenario to test
□ Prepare test environment
□ Have runbooks available

During Drill:
□ Execute scenario as planned
□ Record actual timings
□ Document any issues
□ Note what went well
□ Note what could improve

Post-Drill (within 1 day):
□ Debrief meeting
□ Review recorded times vs. targets
□ Discuss improvements
□ Update runbooks if needed
□ Thank team for participation
□ Document lessons learned

Post-Drill (within 1 week):
□ Implement identified improvements
□ Test improvements
□ Verify procedures updated
□ Archive drill documentation
```

---

## Disaster Recovery Readiness

### Recovery Readiness Checklist

```
Infrastructure:
□ Primary region configured
□ Backup region prepared
□ Load balancing configured
□ DNS failover configured

Data:
□ Hourly database backups
□ Backups encrypted
□ Backups tested (monthly)
□ Multiple backup locations

Configuration:
□ ConfigMaps backed up (daily)
□ Secrets encrypted and backed up
□ Infrastructure code in Git
□ Deployment manifests versioned

Documentation:
□ Disaster procedures documented
□ Runbooks current and tested
□ Team trained on procedures
□ Escalation paths clear

Testing:
□ Monthly restore test passes
□ Quarterly DR drill scheduled
□ Recovery times meet RTO/RPA

Monitoring:
□ Alerts for backup failures
□ Backup health checks running
□ Recovery procedures monitored
```

### RTO/RPA Targets

| Scenario | RTO | RPA |
|----------|-----|-----|
| **Single pod failure** | 5 min | 0 min |
| **Database corruption** | 1 hour | 1 hour |
| **Node failure** | 15 min | 0 min |
| **Region outage** | 2 hours | 15 min |
| **Complete cluster loss** | 4 hours | 1 hour |

---

## Disaster Recovery Contacts

```
Role: Contact: Phone: Slack:
Primary DBA: [Name] [Phone] @[slack]
Backup DBA: [Name] [Phone] @[slack]
Infra Lead: [Name] [Phone] @[slack]
Backup Infra: [Name] [Phone] @[slack]
CTO: [Name] [Phone] @[slack]
Ops Manager: [Name] [Phone] @[slack]

Escalation:
Level 1: [Name] - notify immediately
Level 2: [Name] - notify within 5 min
Level 3: [Name] - notify within 15 min
```

---

## Quick Reference: Disaster Steps

```
1. ASSESS (First 5 min)
   - Determine disaster severity
   - Assess damage scope
   - Get backup location access

2. COMMUNICATE (Immediately)
   - Declare disaster
   - Notify key personnel
   - Start status updates (every 5 min)

3. RECOVER (Next 30-120 min)
   - Activate backup infrastructure if needed
   - Restore database from latest backup
   - Redeploy applications
   - Verify all systems operational

4. VERIFY (Continuous)
   - Check pod health
   - Verify database connectivity
   - Test API endpoints
   - Monitor error rates

5. STABILIZE (Next 4 hours)
   - Monitor closely
   - Watch for anomalies
   - Verify performance normal
   - Check data integrity

6. INVESTIGATE (Within 1 hour)
   - Root cause analysis
   - Document what happened
   - Plan improvements
   - Update procedures

7. IMPROVE (Within 2 weeks)
   - Implement improvements
   - Test improvements
   - Update documentation
   - Train team
```
chore: extend doc: adr, tutorials, operations, etc 2026-01-12 03:32:47 +00:00			`# Disaster Recovery Runbook`

			`Step-by-step procedures for recovering VAPORA from various disaster scenarios.`

			`---`

			`## Disaster Severity Levels`

			`### Level 1: Critical 🔴`
			`Complete Service Loss - Entire VAPORA unavailable`

			`Examples:`
			`- Complete cluster failure`
			`- Complete data center outage`
			`- Database completely corrupted`
			`- All backups inaccessible`

			`RTO: 2-4 hours`
			`RPA: Up to 1 hour of data loss possible`

			`### Level 2: Major 🟠`
			`Partial Service Loss - Some services unavailable`

			`Examples:`
			`- Single region down`
			`- Database corrupted but backups available`
			`- One service completely failed`
			`- Primary storage unavailable`

			`RTO: 30 minutes - 2 hours`
			`RPA: Minimal data loss`

			`### Level 3: Minor 🟡`
			`Degraded Service - Service running but with issues`

			`Examples:`
			`- Performance issues`
			`- One pod crashed`
			`- Database connection issues`
			`- High error rate`

			`RTO: 5-15 minutes`
			`RPA: No data loss`

			`---`

			`## Disaster Assessment (First 5 Minutes)`

			`### Step 1: Declare Disaster State`

			`When any of these occur, declare a disaster:`

			```bash
			`# Q1: Is the service accessible?`
			`curl -v https://api.vapora.com/health`

			`# Q2: How many pods are running?`
			`kubectl get pods -n vapora`

			`# Q3: Can we access the database?`
			`kubectl exec -n vapora pod/<name> -- \`
			`surreal query "SELECT * FROM projects LIMIT 1"`

			`# Q4: Are backups available?`
			`aws s3 ls s3://vapora-backups/`
			```

			`Decision Tree:`
			```
			`Can access service normally?`
			`YES → No disaster, escalate to incident response`
			`NO → Continue`

			`Can reach any pods?`
			`YES → Partial disaster (Level 2-3)`
			`NO → Likely total disaster (Level 1)`

			`Can reach database?`
			`YES → Application issue, not data issue`
			`NO → Database issue, need restoration`

			`Are backups accessible?`
			`YES → Recovery likely possible`
			`NO → Critical situation, activate backup locations`
			```

			`### Step 2: Severity Assignment`

			`Based on assessment:`

			```bash
			`# Level 1 Criteria (Critical)`
			`- 0 pods running in vapora namespace`
			`- Database completely unreachable`
			`- All backup locations inaccessible`
			`- Service down >30 minutes`

			`# Level 2 Criteria (Major)`
			`- <50% pods running`
			`- Database reachable but degraded`
			`- Primary backups inaccessible but secondary available`
			`- Service down 5-30 minutes`

			`# Level 3 Criteria (Minor)`
			`- >75% pods running`
			`- Database responsive but with errors`
			`- Backups accessible`
			`- Service down <5 minutes`

			`Assignment: Level ___`

			`If Level 1: Activate full DR plan`
			`If Level 2: Activate partial DR plan`
			`If Level 3: Use normal incident response`
			```

			`### Step 3: Notify Key Personnel`

			```bash
			`# For Level 1 (Critical) DR`
			`send_message_to = [`
			`"@cto",`
			`"@ops-manager",`
			`"@database-team",`
			`"@infrastructure-team",`
			`"@product-manager"`
			`]`

			`message = """`
			`🔴 DISASTER DECLARED - LEVEL 1 CRITICAL`

			`Service: VAPORA (Complete Outage)`
			`Severity: Critical`
			`Time Declared: [UTC]`
			`Status: Assessing`

			`Actions underway:`
			`1. Activating disaster recovery procedures`
			`2. Notifying stakeholders`
			`3. Engaging full team`

			`Next update: [+5 min]`

			`/cc @all-involved`
			`"""`

			`post_to_slack("#incident-critical")`
			`page_on_call_manager(urgent=true)`
			```

			`---`

			`## Disaster Scenario Procedures`

			`### Scenario 1: Complete Cluster Failure`

			`Symptoms:`
			`- kubectl commands time out or fail`
			`- No pods running in any namespace`
			`- Nodes unreachable`
			`- All services down`

			`Recovery Steps:`

			`#### Step 1: Assess Infrastructure (5 min)`

			```bash
			`# Try basic cluster operations`
			`kubectl cluster-info`
			`# If: "Unable to connect to the server"`

			`# Check cloud provider status`
			`# AWS: Check AWS status page, check EC2 instances`
			`# GKE: Check Google Cloud console`
			`# On-prem: Check infrastructure team`

			`# Determine: Is infrastructure failed or just connectivity?`
			```

			`#### Step 2: If Infrastructure Failed`

			`Activate Secondary Infrastructure (if available):`

			```bash
			`# 1. Access backup/secondary infrastructure`
			`export KUBECONFIG=/path/to/backup/kubeconfig`

			`# 2. Verify it's operational`
			`kubectl cluster-info`
			`kubectl get nodes`

			`# 3. Prepare for database restore`
			`# (See: Scenario 2 - Database Recovery)`
			```

			`If No Secondary: Activate failover to alternate region`

			```bash
			`# 1. Contact cloud provider`
			`# AWS: Open support case - request emergency instance launch`
			`# GKE: Request cluster creation in different region`

			`# 2. While infrastructure rebuilds:`
			`# - Retrieve backups`
			`# - Prepare restore scripts`
			`# - Brief team on ETA`
			```

			`#### Step 3: Restore Database (See Scenario 2)`

			`#### Step 4: Deploy Services`

			```bash
			`# Once infrastructure ready and database restored`

			`# 1. Apply ConfigMaps`
			`kubectl apply -f vapora-configmap.yaml`

			`# 2. Apply Secrets`
			`kubectl apply -f vapora-secrets.yaml`

			`# 3. Deploy services`
			`kubectl apply -f vapora-deployments.yaml`

			`# 4. Wait for pods to start`
			`kubectl rollout status deployment/vapora-backend -n vapora --timeout=10m`

			`# 5. Verify health`
			`curl http://localhost:8001/health`
			```

			`#### Step 5: Verification`

			```bash
			`# 1. Check all pods running`
			`kubectl get pods -n vapora`
			`# All should show: Running, 1/1 Ready`

			`# 2. Verify database connectivity`
			`kubectl logs deployment/vapora-backend -n vapora \| tail -20`
			`# Should show: "Successfully connected to database"`

			`# 3. Test API`
			`curl http://localhost:8001/api/projects`
			`# Should return project list`

			`# 4. Check data integrity`
			`# Run validation queries:`
			`SELECT COUNT(*) FROM projects; # Should > 0`
			`SELECT COUNT(*) FROM users; # Should > 0`
			`SELECT COUNT(*) FROM tasks; # Should > 0`
			```

			`---`

			`### Scenario 2: Database Corruption/Loss`

			`Symptoms:`
			`- Database queries return errors`
			`- Data integrity issues`
			`- Corruption detected in logs`

			`Recovery Steps:`

			`#### Step 1: Assess Database State (10 min)`

			```bash
			`# 1. Try to connect`
			`kubectl exec -n vapora pod/surrealdb-0 -- \`
			`surreal sql --conn ws://localhost:8000 \`
			`--user root --pass "$DB_PASSWORD" \`
			`"SELECT COUNT(*) FROM projects"`

			`# 2. Check for error messages`
			`kubectl logs -n vapora pod/surrealdb-0 \| tail -50 \| grep -i error`

			`# 3. Assess damage`
			`# Is it:`
			`# - Connection issue (might recover)`
			`# - Data corruption (need restore)`
			`# - Complete loss (restore from backup)`
			```

			`#### Step 2: Backup Current State (for forensics)`

			```bash
			`# Before attempting recovery, save current state`

			`# Export what's remaining`
			`kubectl exec -n vapora pod/surrealdb-0 -- \`
			`surreal export --conn ws://localhost:8000 \`
			`--user root --pass "$DB_PASSWORD" \`
			`--output /tmp/corrupted-export.sql`

			`# Download for analysis`
			`kubectl cp vapora/surrealdb-0:/tmp/corrupted-export.sql \`
			`./corrupted-export-$(date +%Y%m%d-%H%M%S).sql`
			```

			`#### Step 3: Identify Latest Good Backup`

			```bash
			`# Find most recent backup before corruption`
			`aws s3 ls s3://vapora-backups/database/ --recursive \| sort`

			`# Latest backup timestamp`
			`# Should be within last hour`

			`# Download backup`
			`aws s3 cp s3://vapora-backups/database/2026-01-12/vapora-db-010000.sql.gz \`
			`./vapora-db-restore.sql.gz`

			`gunzip vapora-db-restore.sql.gz`
			```

			`#### Step 4: Restore Database`

			```bash
			`# Option A: Restore to same database (destructive)`
			`# WARNING: This will overwrite current database`

			`kubectl exec -n vapora pod/surrealdb-0 -- \`
			`rm -rf /var/lib/surrealdb/data.db`

			`# Restart pod to reinitialize`
			`kubectl delete pod -n vapora surrealdb-0`
			`# Pod will restart with clean database`

			`# Import backup`
			`kubectl exec -n vapora pod/surrealdb-0 -- \`
			`surreal import --conn ws://localhost:8000 \`
			`--user root --pass "$DB_PASSWORD" \`
			`--input /tmp/vapora-db-restore.sql`

			`# Wait for import to complete (5-15 minutes)`
			```

			`Option B: Restore to temporary database (safer)`

			```bash
			`# 1. Create temporary database pod`
			`kubectl run -n vapora restore-test --image=surrealdb/surrealdb:latest \`
			`-- start file:///tmp/restore-test`

			`# 2. Restore to temporary`
			`kubectl cp ./vapora-db-restore.sql vapora/restore-test:/tmp/`
			`kubectl exec -n vapora restore-test -- \`
			`surreal import --conn ws://localhost:8000 \`
			`--user root --pass "$DB_PASSWORD" \`
			`--input /tmp/vapora-db-restore.sql`

			`# 3. Verify restored data`
			`kubectl exec -n vapora restore-test -- \`
			`surreal sql "SELECT COUNT(*) FROM projects"`

			`# 4. If good: Restore production`
			`kubectl delete pod -n vapora surrealdb-0`
			`# Wait for pod restart`
			`kubectl cp ./vapora-db-restore.sql vapora/surrealdb-0:/tmp/`
			`kubectl exec -n vapora surrealdb-0 -- \`
			`surreal import --conn ws://localhost:8000 \`
			`--user root --pass "$DB_PASSWORD" \`
			`--input /tmp/vapora-db-restore.sql`

			`# 5. Cleanup test pod`
			`kubectl delete pod -n vapora restore-test`
			```

			`#### Step 5: Verify Recovery`

			```bash
			`# 1. Database responsive`
			`kubectl exec -n vapora pod/surrealdb-0 -- \`
			`surreal sql "SELECT COUNT(*) FROM projects"`

			`# 2. Application can connect`
			`kubectl logs deployment/vapora-backend -n vapora \| tail -5`
			`# Should show successful connection`

			`# 3. API working`
			`curl http://localhost:8001/api/projects`

			`# 4. Data valid`
			`# Check record counts match pre-backup`
			`# Check no corruption in key records`
			```

			`---`

			`### Scenario 3: Configuration Corruption`

			`Symptoms:`
			`- Application misconfigured`
			`- Pods failing to start`
			`- Wrong values in environment`

			`Recovery Steps:`

			`#### Step 1: Identify Bad Configuration`

			```bash
			`# 1. Get current ConfigMap`
			`kubectl get configmap -n vapora vapora-config -o yaml > current-config.yaml`

			`# 2. Compare with known-good backup`
			`aws s3 cp s3://vapora-backups/configs/2026-01-12/configmaps.yaml .`

			`# 3. Diff to find issues`
			`diff configmaps.yaml current-config.yaml`
			```

			`#### Step 2: Restore Previous Configuration`

			```bash
			`# 1. Get previous ConfigMap from backup`
			`aws s3 cp s3://vapora-backups/configs/2026-01-11/configmaps.yaml ./good-config.yaml`

			`# 2. Apply previous configuration`
			`kubectl apply -f good-config.yaml`

			`# 3. Restart pods to pick up new config`
			`kubectl rollout restart deployment/vapora-backend -n vapora`
			`kubectl rollout restart deployment/vapora-agents -n vapora`

			`# 4. Monitor restart`
			`kubectl get pods -n vapora -w`
			```

			`#### Step 3: Verify Configuration`

			```bash
			`# 1. Pods should restart and become Running`
			`kubectl get pods -n vapora`
			`# All should show: Running, 1/1 Ready`

			`# 2. Check pod logs`
			`kubectl logs deployment/vapora-backend -n vapora \| tail -10`
			`# Should show successful startup`

			`# 3. API operational`
			`curl http://localhost:8001/health`
			```

			`---`

			`### Scenario 4: Data Center/Region Outage`

			`Symptoms:`
			`- Entire region unreachable`
			`- Multiple infrastructure components down`
			`- Network connectivity issues`

			`Recovery Steps:`

			`#### Step 1: Declare Regional Failover`

			```bash
			`# 1. Confirm region is down`
			`ping production.vapora.com`
			`# Should fail`

			`# Check status page`
			`# Cloud provider should report outage`

			`# 2. Declare failover`
			`declare_failover_to_region("us-west-2")`
			```

			`#### Step 2: Activate Alternate Region`

			```bash
			`# 1. Switch kubeconfig to alternate region`
			`export KUBECONFIG=/path/to/backup-region/kubeconfig`

			`# 2. Verify alternate region up`
			`kubectl cluster-info`

			`# 3. Download and restore database`
			`aws s3 cp s3://vapora-backups/database/latest/ . --recursive`

			`# 4. Restore services (as in Scenario 1, Step 4)`
			```

			`#### Step 3: Update DNS/Routing`

			```bash
			`# Update DNS to point to alternate region`
			`aws route53 change-resource-record-sets \`
			`--hosted-zone-id Z123456 \`
			`--change-batch '{`
			`"Changes": [{`
			`"Action": "UPSERT",`
			`"ResourceRecordSet": {`
			`"Name": "api.vapora.com",`
			`"Type": "A",`
			`"AliasTarget": {`
			`"HostedZoneId": "Z987654",`
			`"DNSName": "backup-region-lb.elb.amazonaws.com",`
			`"EvaluateTargetHealth": false`
			`}`
			`}`
			`}]`
			`}'`

			`# Wait for DNS propagation (5-10 minutes)`
			```

			`#### Step 4: Verify Failover`

			```bash
			`# 1. DNS resolves to new region`
			`nslookup api.vapora.com`

			`# 2. Services accessible`
			`curl https://api.vapora.com/health`

			`# 3. Data intact`
			`curl https://api.vapora.com/api/projects`
			```

			`#### Step 5: Communicate Failover`

			```
			`Post to #incident-critical:`

			`✅ FAILOVER TO ALTERNATE REGION COMPLETE`

			`Primary Region: us-east-1 (Down)`
			`Active Region: us-west-2 (Restored)`

			`Status:`
			`- All services running: ✓`
			`- Database restored: ✓`
			`- Data integrity verified: ✓`
			`- Partial data loss: ~30 minutes of transactions`

			`Estimated Data Loss: 30 minutes (11:30-12:00 UTC)`
			`Current Time: 12:05 UTC`

			`Next steps:`
			`- Monitor alternate region closely`
			`- Begin investigation of primary region`
			`- Plan failback when primary recovered`

			`Questions? /cc @ops-team`
			```

			`---`

			`## Post-Disaster Recovery`

			`### Phase 1: Stabilization (Ongoing)`

			```bash
			`# Continue monitoring for 4 hours minimum`

			`# Checks every 15 minutes:`
			`✓ All pods Running`
			`✓ API responding`
			`✓ Database queries working`
			`✓ Error rates normal`
			`✓ Performance baseline`
			```

			`### Phase 2: Root Cause Analysis`

			`Start within 1 hour of service recovery:`

			```
			`Questions to answer:`

			`1. What caused the disaster?`
			`- Hardware failure`
			`- Software bug`
			`- Configuration error`
			`- External attack`
			`- Human error`

			`2. Why wasn't it detected earlier?`
			`- Monitoring gap`
			`- Alert misconfiguration`
			`- Alert fatigue`

			`3. How did backups perform?`
			`- Were they accessible?`
			`- Restore time as expected?`
			`- Data loss acceptable?`

			`4. What took longest in recovery?`
			`- Finding backups`
			`- Restoring database`
			`- Redeploying services`
			`- Verifying integrity`

			`5. What can be improved?`
			`- Faster detection`
			`- Faster recovery`
			`- Better documentation`
			`- More automated recovery`
			```

			`### Phase 3: Recovery Documentation`

			```
			`Create post-disaster report:`

			`Timeline:`
			`- 11:30 UTC: Disaster detected`
			`- 11:35 UTC: Database restore started`
			`- 11:50 UTC: Services redeployed`
			`- 12:00 UTC: All systems operational`
			`- Duration: 30 minutes`

			`Impact:`
			`- Users affected: [X]`
			`- Data lost: [X] transactions`
			`- Revenue impact: $[X]`

			`Root cause: [Description]`

			`Contributing factors:`
			`1. [Factor 1]`
			`2. [Factor 2]`

			`Preventive measures:`
			`1. [Action] by [Owner] by [Date]`
			`2. [Action] by [Owner] by [Date]`

			`Lessons learned:`
			`1. [Lesson 1]`
			`2. [Lesson 2]`
			```

			`### Phase 4: Improvements Implementation`

			`Due date: Within 2 weeks`

			```
			`Checklist for improvements:`

			`□ Update backup strategy (if needed)`
			`□ Improve monitoring/alerting`
			`□ Automate more recovery steps`
			`□ Update runbooks with learnings`
			`□ Train team on new procedures`
			`□ Test improved procedures`
			`□ Document for future reference`
			`□ Incident retrospective meeting`
			```

			`---`

			`## Disaster Recovery Drill`

			`### Quarterly DR Drill`

			`Purpose: Test DR procedures before real disaster`

			`Schedule: Last Friday of each quarter at 02:00 UTC`

			```bash
			`def quarterly_dr_drill [] {`
			`print "=== QUARTERLY DISASTER RECOVERY DRILL ==="`
			`print $"Date: (date now \| format date %Y-%m-%d %H:%M:%S UTC)"`
			`print ""`

			`# 1. Simulate database corruption`
			`print "1. Simulating database corruption..."`
			`# Create test database, introduce corruption`

			`# 2. Test restore procedure`
			`print "2. Testing restore from backup..."`
			`# Download backup, restore to test database`

			`# 3. Measure restore time`
			`let start_time = (date now)`
			`# ... restore process ...`
			`let end_time = (date now)`
			`let duration = $end_time - $start_time`
			`print $"Restore time: ($duration)"`

			`# 4. Verify data integrity`
			`print "3. Verifying data integrity..."`
			`# Check restored data matches pre-backup`

			`# 5. Document results`
			`print "4. Documenting results..."`
			`# Record in DR drill log`

			`print ""`
			`print "Drill complete"`
			`}`
			```

			`### Drill Checklist`

			```
			`Pre-Drill (1 week before):`
			`□ Notify team of scheduled drill`
			`□ Plan specific scenario to test`
			`□ Prepare test environment`
			`□ Have runbooks available`

			`During Drill:`
			`□ Execute scenario as planned`
			`□ Record actual timings`
			`□ Document any issues`
			`□ Note what went well`
			`□ Note what could improve`

			`Post-Drill (within 1 day):`
			`□ Debrief meeting`
			`□ Review recorded times vs. targets`
			`□ Discuss improvements`
			`□ Update runbooks if needed`
			`□ Thank team for participation`
			`□ Document lessons learned`

			`Post-Drill (within 1 week):`
			`□ Implement identified improvements`
			`□ Test improvements`
			`□ Verify procedures updated`
			`□ Archive drill documentation`
			```

			`---`

			`## Disaster Recovery Readiness`

			`### Recovery Readiness Checklist`

			```
			`Infrastructure:`
			`□ Primary region configured`
			`□ Backup region prepared`
			`□ Load balancing configured`
			`□ DNS failover configured`

			`Data:`
			`□ Hourly database backups`
			`□ Backups encrypted`
			`□ Backups tested (monthly)`
			`□ Multiple backup locations`

			`Configuration:`
			`□ ConfigMaps backed up (daily)`
			`□ Secrets encrypted and backed up`
			`□ Infrastructure code in Git`
			`□ Deployment manifests versioned`

			`Documentation:`
			`□ Disaster procedures documented`
			`□ Runbooks current and tested`
			`□ Team trained on procedures`
			`□ Escalation paths clear`

			`Testing:`
			`□ Monthly restore test passes`
			`□ Quarterly DR drill scheduled`
			`□ Recovery times meet RTO/RPA`

			`Monitoring:`
			`□ Alerts for backup failures`
			`□ Backup health checks running`
			`□ Recovery procedures monitored`
			```

			`### RTO/RPA Targets`

			`\| Scenario \| RTO \| RPA \|`
			`\|----------\|-----\|-----\|`
			`\| Single pod failure \| 5 min \| 0 min \|`
			`\| Database corruption \| 1 hour \| 1 hour \|`
			`\| Node failure \| 15 min \| 0 min \|`
			`\| Region outage \| 2 hours \| 15 min \|`
			`\| Complete cluster loss \| 4 hours \| 1 hour \|`

			`---`

			`## Disaster Recovery Contacts`

			```
			`Role: Contact: Phone: Slack:`
			`Primary DBA: [Name] [Phone] @[slack]`
			`Backup DBA: [Name] [Phone] @[slack]`
			`Infra Lead: [Name] [Phone] @[slack]`
			`Backup Infra: [Name] [Phone] @[slack]`
			`CTO: [Name] [Phone] @[slack]`
			`Ops Manager: [Name] [Phone] @[slack]`

			`Escalation:`
			`Level 1: [Name] - notify immediately`
			`Level 2: [Name] - notify within 5 min`
			`Level 3: [Name] - notify within 15 min`
			```

			`---`

			`## Quick Reference: Disaster Steps`

			```
			`1. ASSESS (First 5 min)`
			`- Determine disaster severity`
			`- Assess damage scope`
			`- Get backup location access`

			`2. COMMUNICATE (Immediately)`
			`- Declare disaster`
			`- Notify key personnel`
			`- Start status updates (every 5 min)`

			`3. RECOVER (Next 30-120 min)`
			`- Activate backup infrastructure if needed`
			`- Restore database from latest backup`
			`- Redeploy applications`
			`- Verify all systems operational`

			`4. VERIFY (Continuous)`
			`- Check pod health`
			`- Verify database connectivity`
			`- Test API endpoints`
			`- Monitor error rates`

			`5. STABILIZE (Next 4 hours)`
			`- Monitor closely`
			`- Watch for anomalies`
			`- Verify performance normal`
			`- Check data integrity`

			`6. INVESTIGATE (Within 1 hour)`
			`- Root cause analysis`
			`- Document what happened`
			`- Plan improvements`
			`- Update procedures`

			`7. IMPROVE (Within 2 weeks)`
			`- Implement improvements`
			`- Test improvements`
			`- Update documentation`
			`- Train team`
			```