# Database Recovery Procedures Detailed procedures for recovering SurrealDB in various failure scenarios. --- ## Quick Reference: Recovery Methods | Scenario | Method | Time | Data Loss | |----------|--------|------|-----------| | **Pod restart** | Automatic pod recovery | 2 min | 0 | | **Pod crash** | Persistent volume intact | 3 min | 0 | | **Corrupted pod** | Restart from snapshot | 5 min | 0 | | **Corrupted database** | Restore from backup | 15 min | 0-60 min | | **Complete loss** | Restore from backup | 30 min | 0-60 min | --- ## SurrealDB Architecture ``` VAPORA Database Layer SurrealDB Pod (Kubernetes) ├── PersistentVolume: /var/lib/surrealdb/ ├── Data file: data.db (RocksDB) ├── Index files: *.idx └── Wal (Write-ahead log): *.wal Backed up to: ├── Hourly exports: S3 backups/database/ ├── CloudSQL snapshots: AWS/GCP snapshots └── Archive backups: Glacier (monthly) ``` --- ## Scenario 1: Pod Restart (Most Common) **Cause**: Node maintenance, resource limits, health check failure **Duration**: 2-3 minutes **Data Loss**: None ### Recovery Procedure ```bash # Most of the time, just restart the pod # 1. Delete the pod kubectl delete pod -n vapora surrealdb-0 # 2. Pod automatically restarts (via StatefulSet) kubectl get pods -n vapora -w # 3. Verify it's Ready kubectl get pod surrealdb-0 -n vapora # Should show: 1/1 Running # 4. Verify database is accessible kubectl exec -n vapora surrealdb-0 -- \ surreal sql "SELECT 1" # 5. Check data integrity kubectl exec -n vapora surrealdb-0 -- \ surreal sql "SELECT COUNT(*) FROM projects" # Should return non-zero count ``` --- ## Scenario 2: Pod CrashLoop (Container Issue) **Cause**: Application crash, memory issues, corrupt index **Duration**: 5-10 minutes **Data Loss**: None (usually) ### Recovery Procedure ```bash # 1. Examine pod logs to identify issue kubectl logs surrealdb-0 -n vapora --previous # Look for: "panic", "fatal", "out of memory" # 2. Increase resource limits if memory issue kubectl patch statefulset surrealdb -n vapora --type='json' \ -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value":"2Gi"}]' # 3. If corrupt index, rebuild kubectl exec -n vapora surrealdb-0 -- \ surreal query "REBUILD INDEX" # 4. If persistent issue, try volume snapshot kubectl delete pod -n vapora surrealdb-0 # Use previous snapshot (if available) # 5. Monitor restart kubectl get pods -n vapora -w ``` --- ## Scenario 3: Corrupted Database (Detected via Queries) **Cause**: Unclean shutdown, disk issue, data corruption **Duration**: 15-30 minutes **Data Loss**: Minimal (last hour of transactions) ### Detection ```bash # Symptoms to watch for ✗ Queries return error: "corrupted database" ✗ Disk check shows corruption ✗ Checksums fail ✗ Integrity check fails # Verify corruption kubectl exec -n vapora surrealdb-0 -- \ surreal query "INFO FOR DB" # Look for any error messages # Try repair kubectl exec -n vapora surrealdb-0 -- \ surreal query "REBUILD INDEX" ``` ### Recovery: Option A - Restart and Repair (Try First) ```bash # 1. Delete pod to force restart kubectl delete pod -n vapora surrealdb-0 # 2. Watch restart kubectl get pods -n vapora -w # Should restart within 30 seconds # 3. Verify database accessible kubectl exec -n vapora surrealdb-0 -- \ surreal sql "SELECT COUNT(*) FROM projects" # 4. If successful, done # If still errors, proceed to Option B ``` ### Recovery: Option B - Restore from Recent Backup ```bash # 1. Stop database pod kubectl scale statefulset surrealdb --replicas=0 -n vapora # 2. Download latest backup aws s3 cp s3://vapora-backups/database/ ./ --recursive # Get most recent .sql.gz file # 3. Clear corrupted data kubectl delete pvc -n vapora surrealdb-data-surrealdb-0 # 4. Recreate pod (will create new PVC) kubectl scale statefulset surrealdb --replicas=1 -n vapora # 5. Wait for pod to be ready kubectl wait --for=condition=Ready pod/surrealdb-0 \ -n vapora --timeout=300s # 6. Restore backup # Extract and import gunzip vapora-db-*.sql.gz kubectl cp vapora-db-*.sql vapora/surrealdb-0:/tmp/ kubectl exec -n vapora surrealdb-0 -- \ surreal import \ --conn ws://localhost:8000 \ --user root \ --pass $DB_PASSWORD \ --input /tmp/vapora-db-*.sql # 7. Verify restored data kubectl exec -n vapora surrealdb-0 -- \ surreal sql "SELECT COUNT(*) FROM projects" # Should match pre-corruption count ``` --- ## Scenario 4: Storage Failure (PVC Issue) **Cause**: Storage volume corruption, node storage failure **Duration**: 20-30 minutes **Data Loss**: None with backup ### Recovery Procedure ```bash # 1. Detect storage issue kubectl describe pvc -n vapora surrealdb-data-surrealdb-0 # Look for: "Pod pending", "volume binding failure" # 2. Check if snapshot available (cloud) aws ec2 describe-snapshots \ --filters "Name=tag:database,Values=vapora" \ --query 'Snapshots[].{SnapshotId:SnapshotId,StartTime:StartTime}' \ --sort-by StartTime | tail -10 # 3. Create new PVC from snapshot kubectl apply -f - << EOF apiVersion: v1 kind: PersistentVolumeClaim metadata: name: surrealdb-data-surrealdb-0-restore namespace: vapora spec: accessModes: - ReadWriteOnce dataSource: kind: VolumeSnapshot apiGroup: snapshot.storage.k8s.io name: surrealdb-snapshot-latest resources: requests: storage: 100Gi EOF # 4. Update StatefulSet to use new PVC kubectl patch statefulset surrealdb -n vapora --type='json' \ -p='[{"op": "replace", "path": "/spec/volumeClaimTemplates/0/metadata/name", "value":"surrealdb-data-surrealdb-0-restore"}]' # 5. Delete old pod to force remount kubectl delete pod -n vapora surrealdb-0 # 6. Verify new pod runs kubectl get pods -n vapora -w # 7. Test database kubectl exec -n vapora surrealdb-0 -- \ surreal sql "SELECT COUNT(*) FROM projects" ``` --- ## Scenario 5: Complete Data Loss (Restore from Backup) **Cause**: User delete, accidental truncate, security incident **Duration**: 30-60 minutes **Data Loss**: Up to 1 hour ### Pre-Recovery Checklist ``` Before restoring, verify: □ What data was lost? (specific tables or entire DB?) □ When was it lost? (exact time if possible) □ Is it just one table or entire database? □ Do we have valid backups from before loss? □ Has the backup been tested before? ``` ### Recovery Procedure ```bash # 1. Stop the database kubectl scale statefulset surrealdb --replicas=0 -n vapora sleep 10 # 2. Identify backup to restore # Look for backup from time BEFORE data loss aws s3 ls s3://vapora-backups/database/ --recursive | sort # Example: surrealdb-2026-01-12-230000.sql.gz # (from 11 PM, before 12 AM loss) # 3. Download backup aws s3 cp s3://vapora-backups/database/2026-01-12-surrealdb-230000.sql.gz ./ gunzip surrealdb-230000.sql # 4. Verify backup integrity before restoring # Extract first 100 lines to check format head -100 surrealdb-230000.sql # 5. Delete corrupted PVC kubectl delete pvc -n vapora surrealdb-data-surrealdb-0 # 6. Restart database pod (will create new PVC) kubectl scale statefulset surrealdb --replicas=1 -n vapora # 7. Wait for pod to be ready and listening kubectl wait --for=condition=Ready pod/surrealdb-0 \ -n vapora --timeout=300s sleep 10 # 8. Copy backup to pod kubectl cp surrealdb-230000.sql vapora/surrealdb-0:/tmp/ # 9. Restore backup kubectl exec -n vapora surrealdb-0 -- \ surreal import \ --conn ws://localhost:8000 \ --user root \ --pass $DB_PASSWORD \ --input /tmp/surrealdb-230000.sql # Expected output: # Imported 1500+ records... # This should take 5-15 minutes depending on backup size # 10. Verify data restored kubectl exec -n vapora surrealdb-0 -- \ surreal sql \ --conn ws://localhost:8000 \ --user root \ --pass $DB_PASSWORD \ "SELECT COUNT(*) as project_count FROM projects" # Should match pre-loss count ``` ### Data Loss Assessment ```bash # After restore, compare with lost version # 1. Get current record count RESTORED_COUNT=$(kubectl exec -n vapora surrealdb-0 -- \ surreal sql "SELECT COUNT(*) FROM projects") # 2. Get pre-loss count (from logs or ticket) PRE_LOSS_COUNT=1500 # 3. Calculate data loss if [ "$RESTORED_COUNT" -lt "$PRE_LOSS_COUNT" ]; then LOSS=$(( PRE_LOSS_COUNT - RESTORED_COUNT )) echo "Data loss: $LOSS records" echo "Data loss duration: ~1 hour" echo "Restore successful but incomplete" else echo "Data loss: 0 records" echo "Full recovery complete" fi ``` --- ## Scenario 6: Backup Verification Failed **Cause**: Corrupt backup file, incompatible format **Duration**: 30-120 minutes (fallback to older backup) **Data Loss**: 2+ hours possible ### Recovery Procedure ```bash # 1. Identify backup corruption # During restore, if backup fails import: kubectl exec -n vapora surrealdb-0 -- \ surreal import \ --conn ws://localhost:8000 \ --user root \ --pass $DB_PASSWORD \ --input /tmp/backup.sql # Error: "invalid SQL format" or similar # 2. Check backup file integrity file vapora-db-backup.sql # Should show: ASCII text head -5 vapora-db-backup.sql # Should show: SQL statements or surreal export format # 3. If corrupt, try next-oldest backup aws s3 ls s3://vapora-backups/database/ --recursive | sort | tail -5 # Get second-newest backup # 4. Retry restore with older backup aws s3 cp s3://vapora-backups/database/2026-01-12-210000/ ./ gunzip backup.sql.gz # 5. Repeat restore procedure with older backup # (As in Scenario 5, steps 8-10) ``` --- ## Scenario 7: Database Size Growing Unexpectedly **Cause**: Accumulation of data, logs not rotated, storage leak **Duration**: Varies (prevention focus) **Data Loss**: None ### Detection ```bash # Monitor database size kubectl exec -n vapora surrealdb-0 -- du -sh /var/lib/surrealdb/ # Check disk usage trend # (Should be ~1-2% growth per week) # If sudden spike: kubectl exec -n vapora surrealdb-0 -- \ find /var/lib/surrealdb/ -type f -exec ls -lh {} + | sort -k5 -h | tail -20 ``` ### Cleanup Procedure ```bash # 1. Identify large tables kubectl exec -n vapora surrealdb-0 -- \ surreal sql "SELECT table, count(*) FROM meta::tb GROUP BY table ORDER BY count DESC" # 2. If logs table too large kubectl exec -n vapora surrealdb-0 -- \ surreal sql "DELETE FROM audit_logs WHERE created_at < now() - 90d" # 3. Rebuild indexes to reclaim space kubectl exec -n vapora surrealdb-0 -- \ surreal query "REBUILD INDEX" # 4. If still large, delete old records from other tables kubectl exec -n vapora surrealdb-0 -- \ surreal sql "DELETE FROM tasks WHERE status = 'archived' AND updated_at < now() - 1y" # 5. Monitor size after cleanup kubectl exec -n vapora surrealdb-0 -- du -sh /var/lib/surrealdb/ ``` --- ## Scenario 8: Replication Lag (If Using Replicas) **Cause**: Replica behind primary, network latency **Duration**: Usually self-healing (seconds to minutes) **Data Loss**: None ### Detection ```bash # Check replica lag kubectl exec -n vapora surrealdb-replica -- \ surreal sql "SHOW REPLICATION STATUS" # Look for: "Seconds_Behind_Master" > 5 seconds ``` ### Recovery ```bash # Usually self-healing, but if stuck: # 1. Check network connectivity kubectl exec -n vapora surrealdb-replica -- ping surrealdb-primary -c 5 # 2. Restart replica kubectl delete pod -n vapora surrealdb-replica # 3. Monitor replica catching up kubectl logs -n vapora surrealdb-replica -f # 4. Verify replica status kubectl exec -n vapora surrealdb-replica -- \ surreal sql "SHOW REPLICATION STATUS" ``` --- ## Database Health Checks ### Pre-Recovery Verification ```bash def verify_database_health [] { print "=== Database Health Check ===" # 1. Connection test let conn = (try ( exec "surreal sql --conn ws://localhost:8000 \"SELECT 1\"" ) catch {error make {msg: "Cannot connect to database"}}) # 2. Data integrity test let integrity = (exec "surreal sql \"REBUILD INDEX\"") print "✓ Integrity check passed" # 3. Performance test let perf = (exec "surreal sql \"SELECT COUNT(*) FROM projects\"") print "✓ Performance acceptable" # 4. Replication lag (if applicable) # let lag = (exec "surreal sql \"SHOW REPLICATION STATUS\"") # print "✓ No replication lag" print "✓ All health checks passed" } ``` ### Post-Recovery Verification ```bash def verify_recovery_success [] { print "=== Post-Recovery Verification ===" # 1. Database accessible kubectl exec -n vapora surrealdb-0 -- \ surreal sql "SELECT 1" print "✓ Database accessible" # 2. All tables present kubectl exec -n vapora surrealdb-0 -- \ surreal sql "SELECT table FROM meta::tb" print "✓ All tables present" # 3. Record counts reasonable kubectl exec -n vapora surrealdb-0 -- \ surreal sql "SELECT table, count(*) FROM meta::tb" print "✓ Record counts verified" # 4. Application can connect kubectl logs -n vapora deployment/vapora-backend --tail=5 | grep -i connected print "✓ Application connected" # 5. API operational curl http://localhost:8001/api/projects print "✓ API operational" } ``` --- ## Database Recovery Checklist ### Before Recovery ``` □ Documented failure symptoms □ Determined root cause □ Selected appropriate recovery method □ Located backup to restore □ Verified backup integrity □ Notified relevant teams □ Have runbook available □ Test environment ready (for testing) ``` ### During Recovery ``` □ Followed procedure step-by-step □ Monitored each step completion □ Captured any error messages □ Took notes of timings □ Did NOT skip verification steps □ Had backup plans ready ``` ### After Recovery ``` □ Verified database accessible □ Verified data integrity □ Verified application can connect □ Checked API endpoints working □ Monitored error rates □ Waited for 30 min stability check □ Documented recovery procedure □ Identified improvements needed □ Updated runbooks if needed ``` --- ## Recovery Troubleshooting ### Issue: "Cannot connect to database after restore" **Cause**: Database not fully recovered, network issue **Solution**: ```bash # 1. Wait longer (import can take 15+ minutes) sleep 60 && kubectl exec -n vapora surrealdb-0 -- surreal sql "SELECT 1" # 2. Check pod logs kubectl logs -n vapora surrealdb-0 | tail -50 # 3. Restart pod kubectl delete pod -n vapora surrealdb-0 # 4. Check network connectivity kubectl exec -n vapora surrealdb-0 -- ping localhost ``` ### Issue: "Import corrupted data" error **Cause**: Backup file corrupted or wrong format **Solution**: ```bash # 1. Try different backup aws s3 ls s3://vapora-backups/database/ | sort | tail -5 # 2. Verify backup format file vapora-db-backup.sql # Should show: text # 3. Manual inspection head -20 vapora-db-backup.sql # Should show SQL format # 4. Try with older backup ``` ### Issue: "Database running but data seems wrong" **Cause**: Restored wrong backup or partial restore **Solution**: ```bash # 1. Verify record counts kubectl exec -n vapora surrealdb-0 -- \ surreal sql "SELECT table, count(*) FROM meta::tb" # 2. Compare to pre-loss baseline # (from documentation or logs) # If counts don't match: # - Used wrong backup # - Restore incomplete # - Try again with correct backup ``` --- ## Database Recovery Reference **Recovery Procedure Flowchart**: ``` Database Issue Detected ↓ Is it just a pod restart? YES → kubectl delete pod surrealdb-0 NO → Continue ↓ Can queries connect and run? YES → Continue with application recovery NO → Continue ↓ Is data corrupted (errors in queries)? YES → Try REBUILD INDEX NO → Continue ↓ Still errors? YES → Scale replicas=0, clear PVC, restore from backup NO → Success, monitor for 30 min ```