15 KiB
Database Recovery Procedures
Detailed procedures for recovering SurrealDB in various failure scenarios.
Quick Reference: Recovery Methods
| Scenario | Method | Time | Data Loss |
|---|---|---|---|
| Pod restart | Automatic pod recovery | 2 min | 0 |
| Pod crash | Persistent volume intact | 3 min | 0 |
| Corrupted pod | Restart from snapshot | 5 min | 0 |
| Corrupted database | Restore from backup | 15 min | 0-60 min |
| Complete loss | Restore from backup | 30 min | 0-60 min |
SurrealDB Architecture
VAPORA Database Layer
SurrealDB Pod (Kubernetes)
├── PersistentVolume: /var/lib/surrealdb/
├── Data file: data.db (RocksDB)
├── Index files: *.idx
└── Wal (Write-ahead log): *.wal
Backed up to:
├── Hourly exports: S3 backups/database/
├── CloudSQL snapshots: AWS/GCP snapshots
└── Archive backups: Glacier (monthly)
Scenario 1: Pod Restart (Most Common)
Cause: Node maintenance, resource limits, health check failure
Duration: 2-3 minutes Data Loss: None
Recovery Procedure
# Most of the time, just restart the pod
# 1. Delete the pod
kubectl delete pod -n vapora surrealdb-0
# 2. Pod automatically restarts (via StatefulSet)
kubectl get pods -n vapora -w
# 3. Verify it's Ready
kubectl get pod surrealdb-0 -n vapora
# Should show: 1/1 Running
# 4. Verify database is accessible
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "SELECT 1"
# 5. Check data integrity
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "SELECT COUNT(*) FROM projects"
# Should return non-zero count
Scenario 2: Pod CrashLoop (Container Issue)
Cause: Application crash, memory issues, corrupt index
Duration: 5-10 minutes Data Loss: None (usually)
Recovery Procedure
# 1. Examine pod logs to identify issue
kubectl logs surrealdb-0 -n vapora --previous
# Look for: "panic", "fatal", "out of memory"
# 2. Increase resource limits if memory issue
kubectl patch statefulset surrealdb -n vapora --type='json' \
-p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value":"2Gi"}]'
# 3. If corrupt index, rebuild
kubectl exec -n vapora surrealdb-0 -- \
surreal query "REBUILD INDEX"
# 4. If persistent issue, try volume snapshot
kubectl delete pod -n vapora surrealdb-0
# Use previous snapshot (if available)
# 5. Monitor restart
kubectl get pods -n vapora -w
Scenario 3: Corrupted Database (Detected via Queries)
Cause: Unclean shutdown, disk issue, data corruption
Duration: 15-30 minutes Data Loss: Minimal (last hour of transactions)
Detection
# Symptoms to watch for
✗ Queries return error: "corrupted database"
✗ Disk check shows corruption
✗ Checksums fail
✗ Integrity check fails
# Verify corruption
kubectl exec -n vapora surrealdb-0 -- \
surreal query "INFO FOR DB"
# Look for any error messages
# Try repair
kubectl exec -n vapora surrealdb-0 -- \
surreal query "REBUILD INDEX"
Recovery: Option A - Restart and Repair (Try First)
# 1. Delete pod to force restart
kubectl delete pod -n vapora surrealdb-0
# 2. Watch restart
kubectl get pods -n vapora -w
# Should restart within 30 seconds
# 3. Verify database accessible
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "SELECT COUNT(*) FROM projects"
# 4. If successful, done
# If still errors, proceed to Option B
Recovery: Option B - Restore from Recent Backup
# 1. Stop database pod
kubectl scale statefulset surrealdb --replicas=0 -n vapora
# 2. Download latest backup
aws s3 cp s3://vapora-backups/database/ ./ --recursive
# Get most recent .sql.gz file
# 3. Clear corrupted data
kubectl delete pvc -n vapora surrealdb-data-surrealdb-0
# 4. Recreate pod (will create new PVC)
kubectl scale statefulset surrealdb --replicas=1 -n vapora
# 5. Wait for pod to be ready
kubectl wait --for=condition=Ready pod/surrealdb-0 \
-n vapora --timeout=300s
# 6. Restore backup
# Extract and import
gunzip vapora-db-*.sql.gz
kubectl cp vapora-db-*.sql vapora/surrealdb-0:/tmp/
kubectl exec -n vapora surrealdb-0 -- \
surreal import \
--conn ws://localhost:8000 \
--user root \
--pass $DB_PASSWORD \
--input /tmp/vapora-db-*.sql
# 7. Verify restored data
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "SELECT COUNT(*) FROM projects"
# Should match pre-corruption count
Scenario 4: Storage Failure (PVC Issue)
Cause: Storage volume corruption, node storage failure
Duration: 20-30 minutes Data Loss: None with backup
Recovery Procedure
# 1. Detect storage issue
kubectl describe pvc -n vapora surrealdb-data-surrealdb-0
# Look for: "Pod pending", "volume binding failure"
# 2. Check if snapshot available (cloud)
aws ec2 describe-snapshots \
--filters "Name=tag:database,Values=vapora" \
--query 'Snapshots[].{SnapshotId:SnapshotId,StartTime:StartTime}' \
--sort-by StartTime | tail -10
# 3. Create new PVC from snapshot
kubectl apply -f - << EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: surrealdb-data-surrealdb-0-restore
namespace: vapora
spec:
accessModes:
- ReadWriteOnce
dataSource:
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
name: surrealdb-snapshot-latest
resources:
requests:
storage: 100Gi
EOF
# 4. Update StatefulSet to use new PVC
kubectl patch statefulset surrealdb -n vapora --type='json' \
-p='[{"op": "replace", "path": "/spec/volumeClaimTemplates/0/metadata/name", "value":"surrealdb-data-surrealdb-0-restore"}]'
# 5. Delete old pod to force remount
kubectl delete pod -n vapora surrealdb-0
# 6. Verify new pod runs
kubectl get pods -n vapora -w
# 7. Test database
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "SELECT COUNT(*) FROM projects"
Scenario 5: Complete Data Loss (Restore from Backup)
Cause: User delete, accidental truncate, security incident
Duration: 30-60 minutes Data Loss: Up to 1 hour
Pre-Recovery Checklist
Before restoring, verify:
□ What data was lost? (specific tables or entire DB?)
□ When was it lost? (exact time if possible)
□ Is it just one table or entire database?
□ Do we have valid backups from before loss?
□ Has the backup been tested before?
Recovery Procedure
# 1. Stop the database
kubectl scale statefulset surrealdb --replicas=0 -n vapora
sleep 10
# 2. Identify backup to restore
# Look for backup from time BEFORE data loss
aws s3 ls s3://vapora-backups/database/ --recursive | sort
# Example: surrealdb-2026-01-12-230000.sql.gz
# (from 11 PM, before 12 AM loss)
# 3. Download backup
aws s3 cp s3://vapora-backups/database/2026-01-12-surrealdb-230000.sql.gz ./
gunzip surrealdb-230000.sql
# 4. Verify backup integrity before restoring
# Extract first 100 lines to check format
head -100 surrealdb-230000.sql
# 5. Delete corrupted PVC
kubectl delete pvc -n vapora surrealdb-data-surrealdb-0
# 6. Restart database pod (will create new PVC)
kubectl scale statefulset surrealdb --replicas=1 -n vapora
# 7. Wait for pod to be ready and listening
kubectl wait --for=condition=Ready pod/surrealdb-0 \
-n vapora --timeout=300s
sleep 10
# 8. Copy backup to pod
kubectl cp surrealdb-230000.sql vapora/surrealdb-0:/tmp/
# 9. Restore backup
kubectl exec -n vapora surrealdb-0 -- \
surreal import \
--conn ws://localhost:8000 \
--user root \
--pass $DB_PASSWORD \
--input /tmp/surrealdb-230000.sql
# Expected output:
# Imported 1500+ records...
# This should take 5-15 minutes depending on backup size
# 10. Verify data restored
kubectl exec -n vapora surrealdb-0 -- \
surreal sql \
--conn ws://localhost:8000 \
--user root \
--pass $DB_PASSWORD \
"SELECT COUNT(*) as project_count FROM projects"
# Should match pre-loss count
Data Loss Assessment
# After restore, compare with lost version
# 1. Get current record count
RESTORED_COUNT=$(kubectl exec -n vapora surrealdb-0 -- \
surreal sql "SELECT COUNT(*) FROM projects")
# 2. Get pre-loss count (from logs or ticket)
PRE_LOSS_COUNT=1500
# 3. Calculate data loss
if [ "$RESTORED_COUNT" -lt "$PRE_LOSS_COUNT" ]; then
LOSS=$(( PRE_LOSS_COUNT - RESTORED_COUNT ))
echo "Data loss: $LOSS records"
echo "Data loss duration: ~1 hour"
echo "Restore successful but incomplete"
else
echo "Data loss: 0 records"
echo "Full recovery complete"
fi
Scenario 6: Backup Verification Failed
Cause: Corrupt backup file, incompatible format
Duration: 30-120 minutes (fallback to older backup) Data Loss: 2+ hours possible
Recovery Procedure
# 1. Identify backup corruption
# During restore, if backup fails import:
kubectl exec -n vapora surrealdb-0 -- \
surreal import \
--conn ws://localhost:8000 \
--user root \
--pass $DB_PASSWORD \
--input /tmp/backup.sql
# Error: "invalid SQL format" or similar
# 2. Check backup file integrity
file vapora-db-backup.sql
# Should show: ASCII text
head -5 vapora-db-backup.sql
# Should show: SQL statements or surreal export format
# 3. If corrupt, try next-oldest backup
aws s3 ls s3://vapora-backups/database/ --recursive | sort | tail -5
# Get second-newest backup
# 4. Retry restore with older backup
aws s3 cp s3://vapora-backups/database/2026-01-12-210000/ ./
gunzip backup.sql.gz
# 5. Repeat restore procedure with older backup
# (As in Scenario 5, steps 8-10)
Scenario 7: Database Size Growing Unexpectedly
Cause: Accumulation of data, logs not rotated, storage leak
Duration: Varies (prevention focus) Data Loss: None
Detection
# Monitor database size
kubectl exec -n vapora surrealdb-0 -- du -sh /var/lib/surrealdb/
# Check disk usage trend
# (Should be ~1-2% growth per week)
# If sudden spike:
kubectl exec -n vapora surrealdb-0 -- \
find /var/lib/surrealdb/ -type f -exec ls -lh {} + | sort -k5 -h | tail -20
Cleanup Procedure
# 1. Identify large tables
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "SELECT table, count(*) FROM meta::tb GROUP BY table ORDER BY count DESC"
# 2. If logs table too large
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "DELETE FROM audit_logs WHERE created_at < now() - 90d"
# 3. Rebuild indexes to reclaim space
kubectl exec -n vapora surrealdb-0 -- \
surreal query "REBUILD INDEX"
# 4. If still large, delete old records from other tables
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "DELETE FROM tasks WHERE status = 'archived' AND updated_at < now() - 1y"
# 5. Monitor size after cleanup
kubectl exec -n vapora surrealdb-0 -- du -sh /var/lib/surrealdb/
Scenario 8: Replication Lag (If Using Replicas)
Cause: Replica behind primary, network latency
Duration: Usually self-healing (seconds to minutes) Data Loss: None
Detection
# Check replica lag
kubectl exec -n vapora surrealdb-replica -- \
surreal sql "SHOW REPLICATION STATUS"
# Look for: "Seconds_Behind_Master" > 5 seconds
Recovery
# Usually self-healing, but if stuck:
# 1. Check network connectivity
kubectl exec -n vapora surrealdb-replica -- ping surrealdb-primary -c 5
# 2. Restart replica
kubectl delete pod -n vapora surrealdb-replica
# 3. Monitor replica catching up
kubectl logs -n vapora surrealdb-replica -f
# 4. Verify replica status
kubectl exec -n vapora surrealdb-replica -- \
surreal sql "SHOW REPLICATION STATUS"
Database Health Checks
Pre-Recovery Verification
def verify_database_health [] {
print "=== Database Health Check ==="
# 1. Connection test
let conn = (try (
exec "surreal sql --conn ws://localhost:8000 \"SELECT 1\""
) catch {error make {msg: "Cannot connect to database"}})
# 2. Data integrity test
let integrity = (exec "surreal sql \"REBUILD INDEX\"")
print "✓ Integrity check passed"
# 3. Performance test
let perf = (exec "surreal sql \"SELECT COUNT(*) FROM projects\"")
print "✓ Performance acceptable"
# 4. Replication lag (if applicable)
# let lag = (exec "surreal sql \"SHOW REPLICATION STATUS\"")
# print "✓ No replication lag"
print "✓ All health checks passed"
}
Post-Recovery Verification
def verify_recovery_success [] {
print "=== Post-Recovery Verification ==="
# 1. Database accessible
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "SELECT 1"
print "✓ Database accessible"
# 2. All tables present
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "SELECT table FROM meta::tb"
print "✓ All tables present"
# 3. Record counts reasonable
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "SELECT table, count(*) FROM meta::tb"
print "✓ Record counts verified"
# 4. Application can connect
kubectl logs -n vapora deployment/vapora-backend --tail=5 | grep -i connected
print "✓ Application connected"
# 5. API operational
curl http://localhost:8001/api/projects
print "✓ API operational"
}
Database Recovery Checklist
Before Recovery
□ Documented failure symptoms
□ Determined root cause
□ Selected appropriate recovery method
□ Located backup to restore
□ Verified backup integrity
□ Notified relevant teams
□ Have runbook available
□ Test environment ready (for testing)
During Recovery
□ Followed procedure step-by-step
□ Monitored each step completion
□ Captured any error messages
□ Took notes of timings
□ Did NOT skip verification steps
□ Had backup plans ready
After Recovery
□ Verified database accessible
□ Verified data integrity
□ Verified application can connect
□ Checked API endpoints working
□ Monitored error rates
□ Waited for 30 min stability check
□ Documented recovery procedure
□ Identified improvements needed
□ Updated runbooks if needed
Recovery Troubleshooting
Issue: "Cannot connect to database after restore"
Cause: Database not fully recovered, network issue
Solution:
# 1. Wait longer (import can take 15+ minutes)
sleep 60 && kubectl exec -n vapora surrealdb-0 -- surreal sql "SELECT 1"
# 2. Check pod logs
kubectl logs -n vapora surrealdb-0 | tail -50
# 3. Restart pod
kubectl delete pod -n vapora surrealdb-0
# 4. Check network connectivity
kubectl exec -n vapora surrealdb-0 -- ping localhost
Issue: "Import corrupted data" error
Cause: Backup file corrupted or wrong format
Solution:
# 1. Try different backup
aws s3 ls s3://vapora-backups/database/ | sort | tail -5
# 2. Verify backup format
file vapora-db-backup.sql
# Should show: text
# 3. Manual inspection
head -20 vapora-db-backup.sql
# Should show SQL format
# 4. Try with older backup
Issue: "Database running but data seems wrong"
Cause: Restored wrong backup or partial restore
Solution:
# 1. Verify record counts
kubectl exec -n vapora surrealdb-0 -- \
surreal sql "SELECT table, count(*) FROM meta::tb"
# 2. Compare to pre-loss baseline
# (from documentation or logs)
# If counts don't match:
# - Used wrong backup
# - Restore incomplete
# - Try again with correct backup
Database Recovery Reference
Recovery Procedure Flowchart:
Database Issue Detected
↓
Is it just a pod restart?
YES → kubectl delete pod surrealdb-0
NO → Continue
↓
Can queries connect and run?
YES → Continue with application recovery
NO → Continue
↓
Is data corrupted (errors in queries)?
YES → Try REBUILD INDEX
NO → Continue
↓
Still errors?
YES → Scale replicas=0, clear PVC, restore from backup
NO → Success, monitor for 30 min