Vapora/docs/disaster-recovery/database-recovery-procedures.md
Jesús Pérez 7110ffeea2
Some checks failed
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
chore: extend doc: adr, tutorials, operations, etc
2026-01-12 03:32:47 +00:00

663 lines
15 KiB
Markdown

# Database Recovery Procedures
Detailed procedures for recovering SurrealDB in various failure scenarios.
---
## Quick Reference: Recovery Methods
| Scenario | Method | Time | Data Loss |
|----------|--------|------|-----------|
| **Pod restart** | Automatic pod recovery | 2 min | 0 |
| **Pod crash** | Persistent volume intact | 3 min | 0 |
| **Corrupted pod** | Restart from snapshot | 5 min | 0 |
| **Corrupted database** | Restore from backup | 15 min | 0-60 min |
| **Complete loss** | Restore from backup | 30 min | 0-60 min |
---
## SurrealDB Architecture
```
VAPORA Database Layer
SurrealDB Pod (Kubernetes)
├── PersistentVolume: /var/lib/surrealdb/
├── Data file: data.db (RocksDB)
├── Index files: *.idx
└── Wal (Write-ahead log): *.wal
Backed up to:
├── Hourly exports: S3 backups/database/
├── CloudSQL snapshots: AWS/GCP snapshots
└── Archive backups: Glacier (monthly)
```
---
## Scenario 1: Pod Restart (Most Common)
**Cause**: Node maintenance, resource limits, health check failure
**Duration**: 2-3 minutes
**Data Loss**: None
### Recovery Procedure
```bash
# Most of the time, just restart the pod
# 1. Delete the pod
kubectl delete pod -n vapora surrealdb-0
# 2. Pod automatically restarts (via StatefulSet)
kubectl get pods -n vapora -w
# 3. Verify it's Ready
kubectl get pod surrealdb-0 -n vapora
# Should show: 1/1 Running
# 4. Verify database is accessible
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "SELECT 1"
# 5. Check data integrity
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "SELECT COUNT(*) FROM projects"
# Should return non-zero count
```
---
## Scenario 2: Pod CrashLoop (Container Issue)
**Cause**: Application crash, memory issues, corrupt index
**Duration**: 5-10 minutes
**Data Loss**: None (usually)
### Recovery Procedure
```bash
# 1. Examine pod logs to identify issue
kubectl logs surrealdb-0 -n vapora --previous
# Look for: "panic", "fatal", "out of memory"
# 2. Increase resource limits if memory issue
kubectl patch statefulset surrealdb -n vapora --type='json' \
-p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value":"2Gi"}]'
# 3. If corrupt index, rebuild
kubectl exec -n vapora surrealdb-0 -- \
surreal query "REBUILD INDEX"
# 4. If persistent issue, try volume snapshot
kubectl delete pod -n vapora surrealdb-0
# Use previous snapshot (if available)
# 5. Monitor restart
kubectl get pods -n vapora -w
```
---
## Scenario 3: Corrupted Database (Detected via Queries)
**Cause**: Unclean shutdown, disk issue, data corruption
**Duration**: 15-30 minutes
**Data Loss**: Minimal (last hour of transactions)
### Detection
```bash
# Symptoms to watch for
✗ Queries return error: "corrupted database"
✗ Disk check shows corruption
✗ Checksums fail
✗ Integrity check fails
# Verify corruption
kubectl exec -n vapora surrealdb-0 -- \
surreal query "INFO FOR DB"
# Look for any error messages
# Try repair
kubectl exec -n vapora surrealdb-0 -- \
surreal query "REBUILD INDEX"
```
### Recovery: Option A - Restart and Repair (Try First)
```bash
# 1. Delete pod to force restart
kubectl delete pod -n vapora surrealdb-0
# 2. Watch restart
kubectl get pods -n vapora -w
# Should restart within 30 seconds
# 3. Verify database accessible
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "SELECT COUNT(*) FROM projects"
# 4. If successful, done
# If still errors, proceed to Option B
```
### Recovery: Option B - Restore from Recent Backup
```bash
# 1. Stop database pod
kubectl scale statefulset surrealdb --replicas=0 -n vapora
# 2. Download latest backup
aws s3 cp s3://vapora-backups/database/ ./ --recursive
# Get most recent .sql.gz file
# 3. Clear corrupted data
kubectl delete pvc -n vapora surrealdb-data-surrealdb-0
# 4. Recreate pod (will create new PVC)
kubectl scale statefulset surrealdb --replicas=1 -n vapora
# 5. Wait for pod to be ready
kubectl wait --for=condition=Ready pod/surrealdb-0 \
-n vapora --timeout=300s
# 6. Restore backup
# Extract and import
gunzip vapora-db-*.sql.gz
kubectl cp vapora-db-*.sql vapora/surrealdb-0:/tmp/
kubectl exec -n vapora surrealdb-0 -- \
surreal import \
--conn ws://localhost:8000 \
--user root \
--pass $DB_PASSWORD \
--input /tmp/vapora-db-*.sql
# 7. Verify restored data
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "SELECT COUNT(*) FROM projects"
# Should match pre-corruption count
```
---
## Scenario 4: Storage Failure (PVC Issue)
**Cause**: Storage volume corruption, node storage failure
**Duration**: 20-30 minutes
**Data Loss**: None with backup
### Recovery Procedure
```bash
# 1. Detect storage issue
kubectl describe pvc -n vapora surrealdb-data-surrealdb-0
# Look for: "Pod pending", "volume binding failure"
# 2. Check if snapshot available (cloud)
aws ec2 describe-snapshots \
--filters "Name=tag:database,Values=vapora" \
--query 'Snapshots[].{SnapshotId:SnapshotId,StartTime:StartTime}' \
--sort-by StartTime | tail -10
# 3. Create new PVC from snapshot
kubectl apply -f - << EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: surrealdb-data-surrealdb-0-restore
namespace: vapora
spec:
accessModes:
- ReadWriteOnce
dataSource:
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
name: surrealdb-snapshot-latest
resources:
requests:
storage: 100Gi
EOF
# 4. Update StatefulSet to use new PVC
kubectl patch statefulset surrealdb -n vapora --type='json' \
-p='[{"op": "replace", "path": "/spec/volumeClaimTemplates/0/metadata/name", "value":"surrealdb-data-surrealdb-0-restore"}]'
# 5. Delete old pod to force remount
kubectl delete pod -n vapora surrealdb-0
# 6. Verify new pod runs
kubectl get pods -n vapora -w
# 7. Test database
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "SELECT COUNT(*) FROM projects"
```
---
## Scenario 5: Complete Data Loss (Restore from Backup)
**Cause**: User delete, accidental truncate, security incident
**Duration**: 30-60 minutes
**Data Loss**: Up to 1 hour
### Pre-Recovery Checklist
```
Before restoring, verify:
□ What data was lost? (specific tables or entire DB?)
□ When was it lost? (exact time if possible)
□ Is it just one table or entire database?
□ Do we have valid backups from before loss?
□ Has the backup been tested before?
```
### Recovery Procedure
```bash
# 1. Stop the database
kubectl scale statefulset surrealdb --replicas=0 -n vapora
sleep 10
# 2. Identify backup to restore
# Look for backup from time BEFORE data loss
aws s3 ls s3://vapora-backups/database/ --recursive | sort
# Example: surrealdb-2026-01-12-230000.sql.gz
# (from 11 PM, before 12 AM loss)
# 3. Download backup
aws s3 cp s3://vapora-backups/database/2026-01-12-surrealdb-230000.sql.gz ./
gunzip surrealdb-230000.sql
# 4. Verify backup integrity before restoring
# Extract first 100 lines to check format
head -100 surrealdb-230000.sql
# 5. Delete corrupted PVC
kubectl delete pvc -n vapora surrealdb-data-surrealdb-0
# 6. Restart database pod (will create new PVC)
kubectl scale statefulset surrealdb --replicas=1 -n vapora
# 7. Wait for pod to be ready and listening
kubectl wait --for=condition=Ready pod/surrealdb-0 \
-n vapora --timeout=300s
sleep 10
# 8. Copy backup to pod
kubectl cp surrealdb-230000.sql vapora/surrealdb-0:/tmp/
# 9. Restore backup
kubectl exec -n vapora surrealdb-0 -- \
surreal import \
--conn ws://localhost:8000 \
--user root \
--pass $DB_PASSWORD \
--input /tmp/surrealdb-230000.sql
# Expected output:
# Imported 1500+ records...
# This should take 5-15 minutes depending on backup size
# 10. Verify data restored
kubectl exec -n vapora surrealdb-0 -- \
surreal sql \
--conn ws://localhost:8000 \
--user root \
--pass $DB_PASSWORD \
"SELECT COUNT(*) as project_count FROM projects"
# Should match pre-loss count
```
### Data Loss Assessment
```bash
# After restore, compare with lost version
# 1. Get current record count
RESTORED_COUNT=$(kubectl exec -n vapora surrealdb-0 -- \
surreal sql "SELECT COUNT(*) FROM projects")
# 2. Get pre-loss count (from logs or ticket)
PRE_LOSS_COUNT=1500
# 3. Calculate data loss
if [ "$RESTORED_COUNT" -lt "$PRE_LOSS_COUNT" ]; then
LOSS=$(( PRE_LOSS_COUNT - RESTORED_COUNT ))
echo "Data loss: $LOSS records"
echo "Data loss duration: ~1 hour"
echo "Restore successful but incomplete"
else
echo "Data loss: 0 records"
echo "Full recovery complete"
fi
```
---
## Scenario 6: Backup Verification Failed
**Cause**: Corrupt backup file, incompatible format
**Duration**: 30-120 minutes (fallback to older backup)
**Data Loss**: 2+ hours possible
### Recovery Procedure
```bash
# 1. Identify backup corruption
# During restore, if backup fails import:
kubectl exec -n vapora surrealdb-0 -- \
surreal import \
--conn ws://localhost:8000 \
--user root \
--pass $DB_PASSWORD \
--input /tmp/backup.sql
# Error: "invalid SQL format" or similar
# 2. Check backup file integrity
file vapora-db-backup.sql
# Should show: ASCII text
head -5 vapora-db-backup.sql
# Should show: SQL statements or surreal export format
# 3. If corrupt, try next-oldest backup
aws s3 ls s3://vapora-backups/database/ --recursive | sort | tail -5
# Get second-newest backup
# 4. Retry restore with older backup
aws s3 cp s3://vapora-backups/database/2026-01-12-210000/ ./
gunzip backup.sql.gz
# 5. Repeat restore procedure with older backup
# (As in Scenario 5, steps 8-10)
```
---
## Scenario 7: Database Size Growing Unexpectedly
**Cause**: Accumulation of data, logs not rotated, storage leak
**Duration**: Varies (prevention focus)
**Data Loss**: None
### Detection
```bash
# Monitor database size
kubectl exec -n vapora surrealdb-0 -- du -sh /var/lib/surrealdb/
# Check disk usage trend
# (Should be ~1-2% growth per week)
# If sudden spike:
kubectl exec -n vapora surrealdb-0 -- \
find /var/lib/surrealdb/ -type f -exec ls -lh {} + | sort -k5 -h | tail -20
```
### Cleanup Procedure
```bash
# 1. Identify large tables
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "SELECT table, count(*) FROM meta::tb GROUP BY table ORDER BY count DESC"
# 2. If logs table too large
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "DELETE FROM audit_logs WHERE created_at < now() - 90d"
# 3. Rebuild indexes to reclaim space
kubectl exec -n vapora surrealdb-0 -- \
surreal query "REBUILD INDEX"
# 4. If still large, delete old records from other tables
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "DELETE FROM tasks WHERE status = 'archived' AND updated_at < now() - 1y"
# 5. Monitor size after cleanup
kubectl exec -n vapora surrealdb-0 -- du -sh /var/lib/surrealdb/
```
---
## Scenario 8: Replication Lag (If Using Replicas)
**Cause**: Replica behind primary, network latency
**Duration**: Usually self-healing (seconds to minutes)
**Data Loss**: None
### Detection
```bash
# Check replica lag
kubectl exec -n vapora surrealdb-replica -- \
surreal sql "SHOW REPLICATION STATUS"
# Look for: "Seconds_Behind_Master" > 5 seconds
```
### Recovery
```bash
# Usually self-healing, but if stuck:
# 1. Check network connectivity
kubectl exec -n vapora surrealdb-replica -- ping surrealdb-primary -c 5
# 2. Restart replica
kubectl delete pod -n vapora surrealdb-replica
# 3. Monitor replica catching up
kubectl logs -n vapora surrealdb-replica -f
# 4. Verify replica status
kubectl exec -n vapora surrealdb-replica -- \
surreal sql "SHOW REPLICATION STATUS"
```
---
## Database Health Checks
### Pre-Recovery Verification
```bash
def verify_database_health [] {
print "=== Database Health Check ==="
# 1. Connection test
let conn = (try (
exec "surreal sql --conn ws://localhost:8000 \"SELECT 1\""
) catch {error make {msg: "Cannot connect to database"}})
# 2. Data integrity test
let integrity = (exec "surreal sql \"REBUILD INDEX\"")
print "✓ Integrity check passed"
# 3. Performance test
let perf = (exec "surreal sql \"SELECT COUNT(*) FROM projects\"")
print "✓ Performance acceptable"
# 4. Replication lag (if applicable)
# let lag = (exec "surreal sql \"SHOW REPLICATION STATUS\"")
# print "✓ No replication lag"
print "✓ All health checks passed"
}
```
### Post-Recovery Verification
```bash
def verify_recovery_success [] {
print "=== Post-Recovery Verification ==="
# 1. Database accessible
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "SELECT 1"
print "✓ Database accessible"
# 2. All tables present
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "SELECT table FROM meta::tb"
print "✓ All tables present"
# 3. Record counts reasonable
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "SELECT table, count(*) FROM meta::tb"
print "✓ Record counts verified"
# 4. Application can connect
kubectl logs -n vapora deployment/vapora-backend --tail=5 | grep -i connected
print "✓ Application connected"
# 5. API operational
curl http://localhost:8001/api/projects
print "✓ API operational"
}
```
---
## Database Recovery Checklist
### Before Recovery
```
□ Documented failure symptoms
□ Determined root cause
□ Selected appropriate recovery method
□ Located backup to restore
□ Verified backup integrity
□ Notified relevant teams
□ Have runbook available
□ Test environment ready (for testing)
```
### During Recovery
```
□ Followed procedure step-by-step
□ Monitored each step completion
□ Captured any error messages
□ Took notes of timings
□ Did NOT skip verification steps
□ Had backup plans ready
```
### After Recovery
```
□ Verified database accessible
□ Verified data integrity
□ Verified application can connect
□ Checked API endpoints working
□ Monitored error rates
□ Waited for 30 min stability check
□ Documented recovery procedure
□ Identified improvements needed
□ Updated runbooks if needed
```
---
## Recovery Troubleshooting
### Issue: "Cannot connect to database after restore"
**Cause**: Database not fully recovered, network issue
**Solution**:
```bash
# 1. Wait longer (import can take 15+ minutes)
sleep 60 && kubectl exec -n vapora surrealdb-0 -- surreal sql "SELECT 1"
# 2. Check pod logs
kubectl logs -n vapora surrealdb-0 | tail -50
# 3. Restart pod
kubectl delete pod -n vapora surrealdb-0
# 4. Check network connectivity
kubectl exec -n vapora surrealdb-0 -- ping localhost
```
### Issue: "Import corrupted data" error
**Cause**: Backup file corrupted or wrong format
**Solution**:
```bash
# 1. Try different backup
aws s3 ls s3://vapora-backups/database/ | sort | tail -5
# 2. Verify backup format
file vapora-db-backup.sql
# Should show: text
# 3. Manual inspection
head -20 vapora-db-backup.sql
# Should show SQL format
# 4. Try with older backup
```
### Issue: "Database running but data seems wrong"
**Cause**: Restored wrong backup or partial restore
**Solution**:
```bash
# 1. Verify record counts
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "SELECT table, count(*) FROM meta::tb"
# 2. Compare to pre-loss baseline
# (from documentation or logs)
# If counts don't match:
# - Used wrong backup
# - Restore incomplete
# - Try again with correct backup
```
---
## Database Recovery Reference
**Recovery Procedure Flowchart**:
```
Database Issue Detected
Is it just a pod restart?
YES → kubectl delete pod surrealdb-0
NO → Continue
Can queries connect and run?
YES → Continue with application recovery
NO → Continue
Is data corrupted (errors in queries)?
YES → Try REBUILD INDEX
NO → Continue
Still errors?
YES → Scale replicas=0, clear PVC, restore from backup
NO → Success, monitor for 30 min
```