Vapora/docs/disaster-recovery/database-recovery-procedures.md
Jesús Pérez 7110ffeea2
Some checks failed
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
chore: extend doc: adr, tutorials, operations, etc
2026-01-12 03:32:47 +00:00

15 KiB

Database Recovery Procedures

Detailed procedures for recovering SurrealDB in various failure scenarios.


Quick Reference: Recovery Methods

Scenario Method Time Data Loss
Pod restart Automatic pod recovery 2 min 0
Pod crash Persistent volume intact 3 min 0
Corrupted pod Restart from snapshot 5 min 0
Corrupted database Restore from backup 15 min 0-60 min
Complete loss Restore from backup 30 min 0-60 min

SurrealDB Architecture

VAPORA Database Layer

SurrealDB Pod (Kubernetes)
├── PersistentVolume: /var/lib/surrealdb/
├── Data file: data.db (RocksDB)
├── Index files: *.idx
└── Wal (Write-ahead log): *.wal

Backed up to:
├── Hourly exports: S3 backups/database/
├── CloudSQL snapshots: AWS/GCP snapshots
└── Archive backups: Glacier (monthly)

Scenario 1: Pod Restart (Most Common)

Cause: Node maintenance, resource limits, health check failure

Duration: 2-3 minutes Data Loss: None

Recovery Procedure

# Most of the time, just restart the pod

# 1. Delete the pod
kubectl delete pod -n vapora surrealdb-0

# 2. Pod automatically restarts (via StatefulSet)
kubectl get pods -n vapora -w

# 3. Verify it's Ready
kubectl get pod surrealdb-0 -n vapora
# Should show: 1/1 Running

# 4. Verify database is accessible
kubectl exec -n vapora surrealdb-0 -- \
  surreal sql "SELECT 1"

# 5. Check data integrity
kubectl exec -n vapora surrealdb-0 -- \
  surreal sql "SELECT COUNT(*) FROM projects"
# Should return non-zero count

Scenario 2: Pod CrashLoop (Container Issue)

Cause: Application crash, memory issues, corrupt index

Duration: 5-10 minutes Data Loss: None (usually)

Recovery Procedure

# 1. Examine pod logs to identify issue
kubectl logs surrealdb-0 -n vapora --previous
# Look for: "panic", "fatal", "out of memory"

# 2. Increase resource limits if memory issue
kubectl patch statefulset surrealdb -n vapora --type='json' \
  -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value":"2Gi"}]'

# 3. If corrupt index, rebuild
kubectl exec -n vapora surrealdb-0 -- \
  surreal query "REBUILD INDEX"

# 4. If persistent issue, try volume snapshot
kubectl delete pod -n vapora surrealdb-0
# Use previous snapshot (if available)

# 5. Monitor restart
kubectl get pods -n vapora -w

Scenario 3: Corrupted Database (Detected via Queries)

Cause: Unclean shutdown, disk issue, data corruption

Duration: 15-30 minutes Data Loss: Minimal (last hour of transactions)

Detection

# Symptoms to watch for
✗ Queries return error: "corrupted database"
✗ Disk check shows corruption
✗ Checksums fail
✗ Integrity check fails

# Verify corruption
kubectl exec -n vapora surrealdb-0 -- \
  surreal query "INFO FOR DB"
# Look for any error messages

# Try repair
kubectl exec -n vapora surrealdb-0 -- \
  surreal query "REBUILD INDEX"

Recovery: Option A - Restart and Repair (Try First)

# 1. Delete pod to force restart
kubectl delete pod -n vapora surrealdb-0

# 2. Watch restart
kubectl get pods -n vapora -w
# Should restart within 30 seconds

# 3. Verify database accessible
kubectl exec -n vapora surrealdb-0 -- \
  surreal sql "SELECT COUNT(*) FROM projects"

# 4. If successful, done
# If still errors, proceed to Option B

Recovery: Option B - Restore from Recent Backup

# 1. Stop database pod
kubectl scale statefulset surrealdb --replicas=0 -n vapora

# 2. Download latest backup
aws s3 cp s3://vapora-backups/database/ ./ --recursive
# Get most recent .sql.gz file

# 3. Clear corrupted data
kubectl delete pvc -n vapora surrealdb-data-surrealdb-0

# 4. Recreate pod (will create new PVC)
kubectl scale statefulset surrealdb --replicas=1 -n vapora

# 5. Wait for pod to be ready
kubectl wait --for=condition=Ready pod/surrealdb-0 \
  -n vapora --timeout=300s

# 6. Restore backup
# Extract and import
gunzip vapora-db-*.sql.gz
kubectl cp vapora-db-*.sql vapora/surrealdb-0:/tmp/

kubectl exec -n vapora surrealdb-0 -- \
  surreal import \
    --conn ws://localhost:8000 \
    --user root \
    --pass $DB_PASSWORD \
    --input /tmp/vapora-db-*.sql

# 7. Verify restored data
kubectl exec -n vapora surrealdb-0 -- \
  surreal sql "SELECT COUNT(*) FROM projects"
# Should match pre-corruption count

Scenario 4: Storage Failure (PVC Issue)

Cause: Storage volume corruption, node storage failure

Duration: 20-30 minutes Data Loss: None with backup

Recovery Procedure

# 1. Detect storage issue
kubectl describe pvc -n vapora surrealdb-data-surrealdb-0
# Look for: "Pod pending", "volume binding failure"

# 2. Check if snapshot available (cloud)
aws ec2 describe-snapshots \
  --filters "Name=tag:database,Values=vapora" \
  --query 'Snapshots[].{SnapshotId:SnapshotId,StartTime:StartTime}' \
  --sort-by StartTime | tail -10

# 3. Create new PVC from snapshot
kubectl apply -f - << EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: surrealdb-data-surrealdb-0-restore
  namespace: vapora
spec:
  accessModes:
    - ReadWriteOnce
  dataSource:
    kind: VolumeSnapshot
    apiGroup: snapshot.storage.k8s.io
    name: surrealdb-snapshot-latest
  resources:
    requests:
      storage: 100Gi
EOF

# 4. Update StatefulSet to use new PVC
kubectl patch statefulset surrealdb -n vapora --type='json' \
  -p='[{"op": "replace", "path": "/spec/volumeClaimTemplates/0/metadata/name", "value":"surrealdb-data-surrealdb-0-restore"}]'

# 5. Delete old pod to force remount
kubectl delete pod -n vapora surrealdb-0

# 6. Verify new pod runs
kubectl get pods -n vapora -w

# 7. Test database
kubectl exec -n vapora surrealdb-0 -- \
  surreal sql "SELECT COUNT(*) FROM projects"

Scenario 5: Complete Data Loss (Restore from Backup)

Cause: User delete, accidental truncate, security incident

Duration: 30-60 minutes Data Loss: Up to 1 hour

Pre-Recovery Checklist

Before restoring, verify:
□ What data was lost? (specific tables or entire DB?)
□ When was it lost? (exact time if possible)
□ Is it just one table or entire database?
□ Do we have valid backups from before loss?
□ Has the backup been tested before?

Recovery Procedure

# 1. Stop the database
kubectl scale statefulset surrealdb --replicas=0 -n vapora
sleep 10

# 2. Identify backup to restore
# Look for backup from time BEFORE data loss
aws s3 ls s3://vapora-backups/database/ --recursive | sort
# Example: surrealdb-2026-01-12-230000.sql.gz
# (from 11 PM, before 12 AM loss)

# 3. Download backup
aws s3 cp s3://vapora-backups/database/2026-01-12-surrealdb-230000.sql.gz ./

gunzip surrealdb-230000.sql

# 4. Verify backup integrity before restoring
# Extract first 100 lines to check format
head -100 surrealdb-230000.sql

# 5. Delete corrupted PVC
kubectl delete pvc -n vapora surrealdb-data-surrealdb-0

# 6. Restart database pod (will create new PVC)
kubectl scale statefulset surrealdb --replicas=1 -n vapora

# 7. Wait for pod to be ready and listening
kubectl wait --for=condition=Ready pod/surrealdb-0 \
  -n vapora --timeout=300s
sleep 10

# 8. Copy backup to pod
kubectl cp surrealdb-230000.sql vapora/surrealdb-0:/tmp/

# 9. Restore backup
kubectl exec -n vapora surrealdb-0 -- \
  surreal import \
    --conn ws://localhost:8000 \
    --user root \
    --pass $DB_PASSWORD \
    --input /tmp/surrealdb-230000.sql

# Expected output:
# Imported 1500+ records...
# This should take 5-15 minutes depending on backup size

# 10. Verify data restored
kubectl exec -n vapora surrealdb-0 -- \
  surreal sql \
    --conn ws://localhost:8000 \
    --user root \
    --pass $DB_PASSWORD \
    "SELECT COUNT(*) as project_count FROM projects"

# Should match pre-loss count

Data Loss Assessment

# After restore, compare with lost version

# 1. Get current record count
RESTORED_COUNT=$(kubectl exec -n vapora surrealdb-0 -- \
  surreal sql "SELECT COUNT(*) FROM projects")

# 2. Get pre-loss count (from logs or ticket)
PRE_LOSS_COUNT=1500

# 3. Calculate data loss
if [ "$RESTORED_COUNT" -lt "$PRE_LOSS_COUNT" ]; then
  LOSS=$(( PRE_LOSS_COUNT - RESTORED_COUNT ))
  echo "Data loss: $LOSS records"
  echo "Data loss duration: ~1 hour"
  echo "Restore successful but incomplete"
else
  echo "Data loss: 0 records"
  echo "Full recovery complete"
fi

Scenario 6: Backup Verification Failed

Cause: Corrupt backup file, incompatible format

Duration: 30-120 minutes (fallback to older backup) Data Loss: 2+ hours possible

Recovery Procedure

# 1. Identify backup corruption
# During restore, if backup fails import:

kubectl exec -n vapora surrealdb-0 -- \
  surreal import \
    --conn ws://localhost:8000 \
    --user root \
    --pass $DB_PASSWORD \
    --input /tmp/backup.sql

# Error: "invalid SQL format" or similar

# 2. Check backup file integrity
file vapora-db-backup.sql
# Should show: ASCII text

head -5 vapora-db-backup.sql
# Should show: SQL statements or surreal export format

# 3. If corrupt, try next-oldest backup
aws s3 ls s3://vapora-backups/database/ --recursive | sort | tail -5
# Get second-newest backup

# 4. Retry restore with older backup
aws s3 cp s3://vapora-backups/database/2026-01-12-210000/ ./
gunzip backup.sql.gz

# 5. Repeat restore procedure with older backup
# (As in Scenario 5, steps 8-10)

Scenario 7: Database Size Growing Unexpectedly

Cause: Accumulation of data, logs not rotated, storage leak

Duration: Varies (prevention focus) Data Loss: None

Detection

# Monitor database size
kubectl exec -n vapora surrealdb-0 -- du -sh /var/lib/surrealdb/

# Check disk usage trend
# (Should be ~1-2% growth per week)

# If sudden spike:
kubectl exec -n vapora surrealdb-0 -- \
  find /var/lib/surrealdb/ -type f -exec ls -lh {} + | sort -k5 -h | tail -20

Cleanup Procedure

# 1. Identify large tables
kubectl exec -n vapora surrealdb-0 -- \
  surreal sql "SELECT table, count(*) FROM meta::tb GROUP BY table ORDER BY count DESC"

# 2. If logs table too large
kubectl exec -n vapora surrealdb-0 -- \
  surreal sql "DELETE FROM audit_logs WHERE created_at < now() - 90d"

# 3. Rebuild indexes to reclaim space
kubectl exec -n vapora surrealdb-0 -- \
  surreal query "REBUILD INDEX"

# 4. If still large, delete old records from other tables
kubectl exec -n vapora surrealdb-0 -- \
  surreal sql "DELETE FROM tasks WHERE status = 'archived' AND updated_at < now() - 1y"

# 5. Monitor size after cleanup
kubectl exec -n vapora surrealdb-0 -- du -sh /var/lib/surrealdb/

Scenario 8: Replication Lag (If Using Replicas)

Cause: Replica behind primary, network latency

Duration: Usually self-healing (seconds to minutes) Data Loss: None

Detection

# Check replica lag
kubectl exec -n vapora surrealdb-replica -- \
  surreal sql "SHOW REPLICATION STATUS"

# Look for: "Seconds_Behind_Master" > 5 seconds

Recovery

# Usually self-healing, but if stuck:

# 1. Check network connectivity
kubectl exec -n vapora surrealdb-replica -- ping surrealdb-primary -c 5

# 2. Restart replica
kubectl delete pod -n vapora surrealdb-replica

# 3. Monitor replica catching up
kubectl logs -n vapora surrealdb-replica -f

# 4. Verify replica status
kubectl exec -n vapora surrealdb-replica -- \
  surreal sql "SHOW REPLICATION STATUS"

Database Health Checks

Pre-Recovery Verification

def verify_database_health [] {
  print "=== Database Health Check ==="

  # 1. Connection test
  let conn = (try (
    exec "surreal sql --conn ws://localhost:8000 \"SELECT 1\""
  ) catch {error make {msg: "Cannot connect to database"}})

  # 2. Data integrity test
  let integrity = (exec "surreal sql \"REBUILD INDEX\"")
  print "✓ Integrity check passed"

  # 3. Performance test
  let perf = (exec "surreal sql \"SELECT COUNT(*) FROM projects\"")
  print "✓ Performance acceptable"

  # 4. Replication lag (if applicable)
  # let lag = (exec "surreal sql \"SHOW REPLICATION STATUS\"")
  # print "✓ No replication lag"

  print "✓ All health checks passed"
}

Post-Recovery Verification

def verify_recovery_success [] {
  print "=== Post-Recovery Verification ==="

  # 1. Database accessible
  kubectl exec -n vapora surrealdb-0 -- \
    surreal sql "SELECT 1"
  print "✓ Database accessible"

  # 2. All tables present
  kubectl exec -n vapora surrealdb-0 -- \
    surreal sql "SELECT table FROM meta::tb"
  print "✓ All tables present"

  # 3. Record counts reasonable
  kubectl exec -n vapora surrealdb-0 -- \
    surreal sql "SELECT table, count(*) FROM meta::tb"
  print "✓ Record counts verified"

  # 4. Application can connect
  kubectl logs -n vapora deployment/vapora-backend --tail=5 | grep -i connected
  print "✓ Application connected"

  # 5. API operational
  curl http://localhost:8001/api/projects
  print "✓ API operational"
}

Database Recovery Checklist

Before Recovery

□ Documented failure symptoms
□ Determined root cause
□ Selected appropriate recovery method
□ Located backup to restore
□ Verified backup integrity
□ Notified relevant teams
□ Have runbook available
□ Test environment ready (for testing)

During Recovery

□ Followed procedure step-by-step
□ Monitored each step completion
□ Captured any error messages
□ Took notes of timings
□ Did NOT skip verification steps
□ Had backup plans ready

After Recovery

□ Verified database accessible
□ Verified data integrity
□ Verified application can connect
□ Checked API endpoints working
□ Monitored error rates
□ Waited for 30 min stability check
□ Documented recovery procedure
□ Identified improvements needed
□ Updated runbooks if needed

Recovery Troubleshooting

Issue: "Cannot connect to database after restore"

Cause: Database not fully recovered, network issue

Solution:

# 1. Wait longer (import can take 15+ minutes)
sleep 60 && kubectl exec -n vapora surrealdb-0 -- surreal sql "SELECT 1"

# 2. Check pod logs
kubectl logs -n vapora surrealdb-0 | tail -50

# 3. Restart pod
kubectl delete pod -n vapora surrealdb-0

# 4. Check network connectivity
kubectl exec -n vapora surrealdb-0 -- ping localhost

Issue: "Import corrupted data" error

Cause: Backup file corrupted or wrong format

Solution:

# 1. Try different backup
aws s3 ls s3://vapora-backups/database/ | sort | tail -5

# 2. Verify backup format
file vapora-db-backup.sql
# Should show: text

# 3. Manual inspection
head -20 vapora-db-backup.sql
# Should show SQL format

# 4. Try with older backup

Issue: "Database running but data seems wrong"

Cause: Restored wrong backup or partial restore

Solution:

# 1. Verify record counts
kubectl exec -n vapora surrealdb-0 -- \
  surreal sql "SELECT table, count(*) FROM meta::tb"

# 2. Compare to pre-loss baseline
# (from documentation or logs)

# If counts don't match:
# - Used wrong backup
# - Restore incomplete
# - Try again with correct backup

Database Recovery Reference

Recovery Procedure Flowchart:

Database Issue Detected
    ↓
Is it just a pod restart?
  YES → kubectl delete pod surrealdb-0
  NO → Continue
    ↓
Can queries connect and run?
  YES → Continue with application recovery
  NO → Continue
    ↓
Is data corrupted (errors in queries)?
  YES → Try REBUILD INDEX
  NO → Continue
    ↓
Still errors?
  YES → Scale replicas=0, clear PVC, restore from backup
  NO → Success, monitor for 30 min