663 lines
15 KiB
Markdown
663 lines
15 KiB
Markdown
|
|
# Database Recovery Procedures
|
||
|
|
|
||
|
|
Detailed procedures for recovering SurrealDB in various failure scenarios.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Quick Reference: Recovery Methods
|
||
|
|
|
||
|
|
| Scenario | Method | Time | Data Loss |
|
||
|
|
|----------|--------|------|-----------|
|
||
|
|
| **Pod restart** | Automatic pod recovery | 2 min | 0 |
|
||
|
|
| **Pod crash** | Persistent volume intact | 3 min | 0 |
|
||
|
|
| **Corrupted pod** | Restart from snapshot | 5 min | 0 |
|
||
|
|
| **Corrupted database** | Restore from backup | 15 min | 0-60 min |
|
||
|
|
| **Complete loss** | Restore from backup | 30 min | 0-60 min |
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## SurrealDB Architecture
|
||
|
|
|
||
|
|
```
|
||
|
|
VAPORA Database Layer
|
||
|
|
|
||
|
|
SurrealDB Pod (Kubernetes)
|
||
|
|
├── PersistentVolume: /var/lib/surrealdb/
|
||
|
|
├── Data file: data.db (RocksDB)
|
||
|
|
├── Index files: *.idx
|
||
|
|
└── Wal (Write-ahead log): *.wal
|
||
|
|
|
||
|
|
Backed up to:
|
||
|
|
├── Hourly exports: S3 backups/database/
|
||
|
|
├── CloudSQL snapshots: AWS/GCP snapshots
|
||
|
|
└── Archive backups: Glacier (monthly)
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Scenario 1: Pod Restart (Most Common)
|
||
|
|
|
||
|
|
**Cause**: Node maintenance, resource limits, health check failure
|
||
|
|
|
||
|
|
**Duration**: 2-3 minutes
|
||
|
|
**Data Loss**: None
|
||
|
|
|
||
|
|
### Recovery Procedure
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Most of the time, just restart the pod
|
||
|
|
|
||
|
|
# 1. Delete the pod
|
||
|
|
kubectl delete pod -n vapora surrealdb-0
|
||
|
|
|
||
|
|
# 2. Pod automatically restarts (via StatefulSet)
|
||
|
|
kubectl get pods -n vapora -w
|
||
|
|
|
||
|
|
# 3. Verify it's Ready
|
||
|
|
kubectl get pod surrealdb-0 -n vapora
|
||
|
|
# Should show: 1/1 Running
|
||
|
|
|
||
|
|
# 4. Verify database is accessible
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- \
|
||
|
|
surreal sql "SELECT 1"
|
||
|
|
|
||
|
|
# 5. Check data integrity
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- \
|
||
|
|
surreal sql "SELECT COUNT(*) FROM projects"
|
||
|
|
# Should return non-zero count
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Scenario 2: Pod CrashLoop (Container Issue)
|
||
|
|
|
||
|
|
**Cause**: Application crash, memory issues, corrupt index
|
||
|
|
|
||
|
|
**Duration**: 5-10 minutes
|
||
|
|
**Data Loss**: None (usually)
|
||
|
|
|
||
|
|
### Recovery Procedure
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1. Examine pod logs to identify issue
|
||
|
|
kubectl logs surrealdb-0 -n vapora --previous
|
||
|
|
# Look for: "panic", "fatal", "out of memory"
|
||
|
|
|
||
|
|
# 2. Increase resource limits if memory issue
|
||
|
|
kubectl patch statefulset surrealdb -n vapora --type='json' \
|
||
|
|
-p='[{"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value":"2Gi"}]'
|
||
|
|
|
||
|
|
# 3. If corrupt index, rebuild
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- \
|
||
|
|
surreal query "REBUILD INDEX"
|
||
|
|
|
||
|
|
# 4. If persistent issue, try volume snapshot
|
||
|
|
kubectl delete pod -n vapora surrealdb-0
|
||
|
|
# Use previous snapshot (if available)
|
||
|
|
|
||
|
|
# 5. Monitor restart
|
||
|
|
kubectl get pods -n vapora -w
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Scenario 3: Corrupted Database (Detected via Queries)
|
||
|
|
|
||
|
|
**Cause**: Unclean shutdown, disk issue, data corruption
|
||
|
|
|
||
|
|
**Duration**: 15-30 minutes
|
||
|
|
**Data Loss**: Minimal (last hour of transactions)
|
||
|
|
|
||
|
|
### Detection
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Symptoms to watch for
|
||
|
|
✗ Queries return error: "corrupted database"
|
||
|
|
✗ Disk check shows corruption
|
||
|
|
✗ Checksums fail
|
||
|
|
✗ Integrity check fails
|
||
|
|
|
||
|
|
# Verify corruption
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- \
|
||
|
|
surreal query "INFO FOR DB"
|
||
|
|
# Look for any error messages
|
||
|
|
|
||
|
|
# Try repair
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- \
|
||
|
|
surreal query "REBUILD INDEX"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Recovery: Option A - Restart and Repair (Try First)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1. Delete pod to force restart
|
||
|
|
kubectl delete pod -n vapora surrealdb-0
|
||
|
|
|
||
|
|
# 2. Watch restart
|
||
|
|
kubectl get pods -n vapora -w
|
||
|
|
# Should restart within 30 seconds
|
||
|
|
|
||
|
|
# 3. Verify database accessible
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- \
|
||
|
|
surreal sql "SELECT COUNT(*) FROM projects"
|
||
|
|
|
||
|
|
# 4. If successful, done
|
||
|
|
# If still errors, proceed to Option B
|
||
|
|
```
|
||
|
|
|
||
|
|
### Recovery: Option B - Restore from Recent Backup
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1. Stop database pod
|
||
|
|
kubectl scale statefulset surrealdb --replicas=0 -n vapora
|
||
|
|
|
||
|
|
# 2. Download latest backup
|
||
|
|
aws s3 cp s3://vapora-backups/database/ ./ --recursive
|
||
|
|
# Get most recent .sql.gz file
|
||
|
|
|
||
|
|
# 3. Clear corrupted data
|
||
|
|
kubectl delete pvc -n vapora surrealdb-data-surrealdb-0
|
||
|
|
|
||
|
|
# 4. Recreate pod (will create new PVC)
|
||
|
|
kubectl scale statefulset surrealdb --replicas=1 -n vapora
|
||
|
|
|
||
|
|
# 5. Wait for pod to be ready
|
||
|
|
kubectl wait --for=condition=Ready pod/surrealdb-0 \
|
||
|
|
-n vapora --timeout=300s
|
||
|
|
|
||
|
|
# 6. Restore backup
|
||
|
|
# Extract and import
|
||
|
|
gunzip vapora-db-*.sql.gz
|
||
|
|
kubectl cp vapora-db-*.sql vapora/surrealdb-0:/tmp/
|
||
|
|
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- \
|
||
|
|
surreal import \
|
||
|
|
--conn ws://localhost:8000 \
|
||
|
|
--user root \
|
||
|
|
--pass $DB_PASSWORD \
|
||
|
|
--input /tmp/vapora-db-*.sql
|
||
|
|
|
||
|
|
# 7. Verify restored data
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- \
|
||
|
|
surreal sql "SELECT COUNT(*) FROM projects"
|
||
|
|
# Should match pre-corruption count
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Scenario 4: Storage Failure (PVC Issue)
|
||
|
|
|
||
|
|
**Cause**: Storage volume corruption, node storage failure
|
||
|
|
|
||
|
|
**Duration**: 20-30 minutes
|
||
|
|
**Data Loss**: None with backup
|
||
|
|
|
||
|
|
### Recovery Procedure
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1. Detect storage issue
|
||
|
|
kubectl describe pvc -n vapora surrealdb-data-surrealdb-0
|
||
|
|
# Look for: "Pod pending", "volume binding failure"
|
||
|
|
|
||
|
|
# 2. Check if snapshot available (cloud)
|
||
|
|
aws ec2 describe-snapshots \
|
||
|
|
--filters "Name=tag:database,Values=vapora" \
|
||
|
|
--query 'Snapshots[].{SnapshotId:SnapshotId,StartTime:StartTime}' \
|
||
|
|
--sort-by StartTime | tail -10
|
||
|
|
|
||
|
|
# 3. Create new PVC from snapshot
|
||
|
|
kubectl apply -f - << EOF
|
||
|
|
apiVersion: v1
|
||
|
|
kind: PersistentVolumeClaim
|
||
|
|
metadata:
|
||
|
|
name: surrealdb-data-surrealdb-0-restore
|
||
|
|
namespace: vapora
|
||
|
|
spec:
|
||
|
|
accessModes:
|
||
|
|
- ReadWriteOnce
|
||
|
|
dataSource:
|
||
|
|
kind: VolumeSnapshot
|
||
|
|
apiGroup: snapshot.storage.k8s.io
|
||
|
|
name: surrealdb-snapshot-latest
|
||
|
|
resources:
|
||
|
|
requests:
|
||
|
|
storage: 100Gi
|
||
|
|
EOF
|
||
|
|
|
||
|
|
# 4. Update StatefulSet to use new PVC
|
||
|
|
kubectl patch statefulset surrealdb -n vapora --type='json' \
|
||
|
|
-p='[{"op": "replace", "path": "/spec/volumeClaimTemplates/0/metadata/name", "value":"surrealdb-data-surrealdb-0-restore"}]'
|
||
|
|
|
||
|
|
# 5. Delete old pod to force remount
|
||
|
|
kubectl delete pod -n vapora surrealdb-0
|
||
|
|
|
||
|
|
# 6. Verify new pod runs
|
||
|
|
kubectl get pods -n vapora -w
|
||
|
|
|
||
|
|
# 7. Test database
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- \
|
||
|
|
surreal sql "SELECT COUNT(*) FROM projects"
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Scenario 5: Complete Data Loss (Restore from Backup)
|
||
|
|
|
||
|
|
**Cause**: User delete, accidental truncate, security incident
|
||
|
|
|
||
|
|
**Duration**: 30-60 minutes
|
||
|
|
**Data Loss**: Up to 1 hour
|
||
|
|
|
||
|
|
### Pre-Recovery Checklist
|
||
|
|
|
||
|
|
```
|
||
|
|
Before restoring, verify:
|
||
|
|
□ What data was lost? (specific tables or entire DB?)
|
||
|
|
□ When was it lost? (exact time if possible)
|
||
|
|
□ Is it just one table or entire database?
|
||
|
|
□ Do we have valid backups from before loss?
|
||
|
|
□ Has the backup been tested before?
|
||
|
|
```
|
||
|
|
|
||
|
|
### Recovery Procedure
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1. Stop the database
|
||
|
|
kubectl scale statefulset surrealdb --replicas=0 -n vapora
|
||
|
|
sleep 10
|
||
|
|
|
||
|
|
# 2. Identify backup to restore
|
||
|
|
# Look for backup from time BEFORE data loss
|
||
|
|
aws s3 ls s3://vapora-backups/database/ --recursive | sort
|
||
|
|
# Example: surrealdb-2026-01-12-230000.sql.gz
|
||
|
|
# (from 11 PM, before 12 AM loss)
|
||
|
|
|
||
|
|
# 3. Download backup
|
||
|
|
aws s3 cp s3://vapora-backups/database/2026-01-12-surrealdb-230000.sql.gz ./
|
||
|
|
|
||
|
|
gunzip surrealdb-230000.sql
|
||
|
|
|
||
|
|
# 4. Verify backup integrity before restoring
|
||
|
|
# Extract first 100 lines to check format
|
||
|
|
head -100 surrealdb-230000.sql
|
||
|
|
|
||
|
|
# 5. Delete corrupted PVC
|
||
|
|
kubectl delete pvc -n vapora surrealdb-data-surrealdb-0
|
||
|
|
|
||
|
|
# 6. Restart database pod (will create new PVC)
|
||
|
|
kubectl scale statefulset surrealdb --replicas=1 -n vapora
|
||
|
|
|
||
|
|
# 7. Wait for pod to be ready and listening
|
||
|
|
kubectl wait --for=condition=Ready pod/surrealdb-0 \
|
||
|
|
-n vapora --timeout=300s
|
||
|
|
sleep 10
|
||
|
|
|
||
|
|
# 8. Copy backup to pod
|
||
|
|
kubectl cp surrealdb-230000.sql vapora/surrealdb-0:/tmp/
|
||
|
|
|
||
|
|
# 9. Restore backup
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- \
|
||
|
|
surreal import \
|
||
|
|
--conn ws://localhost:8000 \
|
||
|
|
--user root \
|
||
|
|
--pass $DB_PASSWORD \
|
||
|
|
--input /tmp/surrealdb-230000.sql
|
||
|
|
|
||
|
|
# Expected output:
|
||
|
|
# Imported 1500+ records...
|
||
|
|
# This should take 5-15 minutes depending on backup size
|
||
|
|
|
||
|
|
# 10. Verify data restored
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- \
|
||
|
|
surreal sql \
|
||
|
|
--conn ws://localhost:8000 \
|
||
|
|
--user root \
|
||
|
|
--pass $DB_PASSWORD \
|
||
|
|
"SELECT COUNT(*) as project_count FROM projects"
|
||
|
|
|
||
|
|
# Should match pre-loss count
|
||
|
|
```
|
||
|
|
|
||
|
|
### Data Loss Assessment
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# After restore, compare with lost version
|
||
|
|
|
||
|
|
# 1. Get current record count
|
||
|
|
RESTORED_COUNT=$(kubectl exec -n vapora surrealdb-0 -- \
|
||
|
|
surreal sql "SELECT COUNT(*) FROM projects")
|
||
|
|
|
||
|
|
# 2. Get pre-loss count (from logs or ticket)
|
||
|
|
PRE_LOSS_COUNT=1500
|
||
|
|
|
||
|
|
# 3. Calculate data loss
|
||
|
|
if [ "$RESTORED_COUNT" -lt "$PRE_LOSS_COUNT" ]; then
|
||
|
|
LOSS=$(( PRE_LOSS_COUNT - RESTORED_COUNT ))
|
||
|
|
echo "Data loss: $LOSS records"
|
||
|
|
echo "Data loss duration: ~1 hour"
|
||
|
|
echo "Restore successful but incomplete"
|
||
|
|
else
|
||
|
|
echo "Data loss: 0 records"
|
||
|
|
echo "Full recovery complete"
|
||
|
|
fi
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Scenario 6: Backup Verification Failed
|
||
|
|
|
||
|
|
**Cause**: Corrupt backup file, incompatible format
|
||
|
|
|
||
|
|
**Duration**: 30-120 minutes (fallback to older backup)
|
||
|
|
**Data Loss**: 2+ hours possible
|
||
|
|
|
||
|
|
### Recovery Procedure
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1. Identify backup corruption
|
||
|
|
# During restore, if backup fails import:
|
||
|
|
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- \
|
||
|
|
surreal import \
|
||
|
|
--conn ws://localhost:8000 \
|
||
|
|
--user root \
|
||
|
|
--pass $DB_PASSWORD \
|
||
|
|
--input /tmp/backup.sql
|
||
|
|
|
||
|
|
# Error: "invalid SQL format" or similar
|
||
|
|
|
||
|
|
# 2. Check backup file integrity
|
||
|
|
file vapora-db-backup.sql
|
||
|
|
# Should show: ASCII text
|
||
|
|
|
||
|
|
head -5 vapora-db-backup.sql
|
||
|
|
# Should show: SQL statements or surreal export format
|
||
|
|
|
||
|
|
# 3. If corrupt, try next-oldest backup
|
||
|
|
aws s3 ls s3://vapora-backups/database/ --recursive | sort | tail -5
|
||
|
|
# Get second-newest backup
|
||
|
|
|
||
|
|
# 4. Retry restore with older backup
|
||
|
|
aws s3 cp s3://vapora-backups/database/2026-01-12-210000/ ./
|
||
|
|
gunzip backup.sql.gz
|
||
|
|
|
||
|
|
# 5. Repeat restore procedure with older backup
|
||
|
|
# (As in Scenario 5, steps 8-10)
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Scenario 7: Database Size Growing Unexpectedly
|
||
|
|
|
||
|
|
**Cause**: Accumulation of data, logs not rotated, storage leak
|
||
|
|
|
||
|
|
**Duration**: Varies (prevention focus)
|
||
|
|
**Data Loss**: None
|
||
|
|
|
||
|
|
### Detection
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Monitor database size
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- du -sh /var/lib/surrealdb/
|
||
|
|
|
||
|
|
# Check disk usage trend
|
||
|
|
# (Should be ~1-2% growth per week)
|
||
|
|
|
||
|
|
# If sudden spike:
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- \
|
||
|
|
find /var/lib/surrealdb/ -type f -exec ls -lh {} + | sort -k5 -h | tail -20
|
||
|
|
```
|
||
|
|
|
||
|
|
### Cleanup Procedure
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1. Identify large tables
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- \
|
||
|
|
surreal sql "SELECT table, count(*) FROM meta::tb GROUP BY table ORDER BY count DESC"
|
||
|
|
|
||
|
|
# 2. If logs table too large
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- \
|
||
|
|
surreal sql "DELETE FROM audit_logs WHERE created_at < now() - 90d"
|
||
|
|
|
||
|
|
# 3. Rebuild indexes to reclaim space
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- \
|
||
|
|
surreal query "REBUILD INDEX"
|
||
|
|
|
||
|
|
# 4. If still large, delete old records from other tables
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- \
|
||
|
|
surreal sql "DELETE FROM tasks WHERE status = 'archived' AND updated_at < now() - 1y"
|
||
|
|
|
||
|
|
# 5. Monitor size after cleanup
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- du -sh /var/lib/surrealdb/
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Scenario 8: Replication Lag (If Using Replicas)
|
||
|
|
|
||
|
|
**Cause**: Replica behind primary, network latency
|
||
|
|
|
||
|
|
**Duration**: Usually self-healing (seconds to minutes)
|
||
|
|
**Data Loss**: None
|
||
|
|
|
||
|
|
### Detection
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check replica lag
|
||
|
|
kubectl exec -n vapora surrealdb-replica -- \
|
||
|
|
surreal sql "SHOW REPLICATION STATUS"
|
||
|
|
|
||
|
|
# Look for: "Seconds_Behind_Master" > 5 seconds
|
||
|
|
```
|
||
|
|
|
||
|
|
### Recovery
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Usually self-healing, but if stuck:
|
||
|
|
|
||
|
|
# 1. Check network connectivity
|
||
|
|
kubectl exec -n vapora surrealdb-replica -- ping surrealdb-primary -c 5
|
||
|
|
|
||
|
|
# 2. Restart replica
|
||
|
|
kubectl delete pod -n vapora surrealdb-replica
|
||
|
|
|
||
|
|
# 3. Monitor replica catching up
|
||
|
|
kubectl logs -n vapora surrealdb-replica -f
|
||
|
|
|
||
|
|
# 4. Verify replica status
|
||
|
|
kubectl exec -n vapora surrealdb-replica -- \
|
||
|
|
surreal sql "SHOW REPLICATION STATUS"
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Database Health Checks
|
||
|
|
|
||
|
|
### Pre-Recovery Verification
|
||
|
|
|
||
|
|
```bash
|
||
|
|
def verify_database_health [] {
|
||
|
|
print "=== Database Health Check ==="
|
||
|
|
|
||
|
|
# 1. Connection test
|
||
|
|
let conn = (try (
|
||
|
|
exec "surreal sql --conn ws://localhost:8000 \"SELECT 1\""
|
||
|
|
) catch {error make {msg: "Cannot connect to database"}})
|
||
|
|
|
||
|
|
# 2. Data integrity test
|
||
|
|
let integrity = (exec "surreal sql \"REBUILD INDEX\"")
|
||
|
|
print "✓ Integrity check passed"
|
||
|
|
|
||
|
|
# 3. Performance test
|
||
|
|
let perf = (exec "surreal sql \"SELECT COUNT(*) FROM projects\"")
|
||
|
|
print "✓ Performance acceptable"
|
||
|
|
|
||
|
|
# 4. Replication lag (if applicable)
|
||
|
|
# let lag = (exec "surreal sql \"SHOW REPLICATION STATUS\"")
|
||
|
|
# print "✓ No replication lag"
|
||
|
|
|
||
|
|
print "✓ All health checks passed"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Post-Recovery Verification
|
||
|
|
|
||
|
|
```bash
|
||
|
|
def verify_recovery_success [] {
|
||
|
|
print "=== Post-Recovery Verification ==="
|
||
|
|
|
||
|
|
# 1. Database accessible
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- \
|
||
|
|
surreal sql "SELECT 1"
|
||
|
|
print "✓ Database accessible"
|
||
|
|
|
||
|
|
# 2. All tables present
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- \
|
||
|
|
surreal sql "SELECT table FROM meta::tb"
|
||
|
|
print "✓ All tables present"
|
||
|
|
|
||
|
|
# 3. Record counts reasonable
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- \
|
||
|
|
surreal sql "SELECT table, count(*) FROM meta::tb"
|
||
|
|
print "✓ Record counts verified"
|
||
|
|
|
||
|
|
# 4. Application can connect
|
||
|
|
kubectl logs -n vapora deployment/vapora-backend --tail=5 | grep -i connected
|
||
|
|
print "✓ Application connected"
|
||
|
|
|
||
|
|
# 5. API operational
|
||
|
|
curl http://localhost:8001/api/projects
|
||
|
|
print "✓ API operational"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Database Recovery Checklist
|
||
|
|
|
||
|
|
### Before Recovery
|
||
|
|
|
||
|
|
```
|
||
|
|
□ Documented failure symptoms
|
||
|
|
□ Determined root cause
|
||
|
|
□ Selected appropriate recovery method
|
||
|
|
□ Located backup to restore
|
||
|
|
□ Verified backup integrity
|
||
|
|
□ Notified relevant teams
|
||
|
|
□ Have runbook available
|
||
|
|
□ Test environment ready (for testing)
|
||
|
|
```
|
||
|
|
|
||
|
|
### During Recovery
|
||
|
|
|
||
|
|
```
|
||
|
|
□ Followed procedure step-by-step
|
||
|
|
□ Monitored each step completion
|
||
|
|
□ Captured any error messages
|
||
|
|
□ Took notes of timings
|
||
|
|
□ Did NOT skip verification steps
|
||
|
|
□ Had backup plans ready
|
||
|
|
```
|
||
|
|
|
||
|
|
### After Recovery
|
||
|
|
|
||
|
|
```
|
||
|
|
□ Verified database accessible
|
||
|
|
□ Verified data integrity
|
||
|
|
□ Verified application can connect
|
||
|
|
□ Checked API endpoints working
|
||
|
|
□ Monitored error rates
|
||
|
|
□ Waited for 30 min stability check
|
||
|
|
□ Documented recovery procedure
|
||
|
|
□ Identified improvements needed
|
||
|
|
□ Updated runbooks if needed
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Recovery Troubleshooting
|
||
|
|
|
||
|
|
### Issue: "Cannot connect to database after restore"
|
||
|
|
|
||
|
|
**Cause**: Database not fully recovered, network issue
|
||
|
|
|
||
|
|
**Solution**:
|
||
|
|
```bash
|
||
|
|
# 1. Wait longer (import can take 15+ minutes)
|
||
|
|
sleep 60 && kubectl exec -n vapora surrealdb-0 -- surreal sql "SELECT 1"
|
||
|
|
|
||
|
|
# 2. Check pod logs
|
||
|
|
kubectl logs -n vapora surrealdb-0 | tail -50
|
||
|
|
|
||
|
|
# 3. Restart pod
|
||
|
|
kubectl delete pod -n vapora surrealdb-0
|
||
|
|
|
||
|
|
# 4. Check network connectivity
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- ping localhost
|
||
|
|
```
|
||
|
|
|
||
|
|
### Issue: "Import corrupted data" error
|
||
|
|
|
||
|
|
**Cause**: Backup file corrupted or wrong format
|
||
|
|
|
||
|
|
**Solution**:
|
||
|
|
```bash
|
||
|
|
# 1. Try different backup
|
||
|
|
aws s3 ls s3://vapora-backups/database/ | sort | tail -5
|
||
|
|
|
||
|
|
# 2. Verify backup format
|
||
|
|
file vapora-db-backup.sql
|
||
|
|
# Should show: text
|
||
|
|
|
||
|
|
# 3. Manual inspection
|
||
|
|
head -20 vapora-db-backup.sql
|
||
|
|
# Should show SQL format
|
||
|
|
|
||
|
|
# 4. Try with older backup
|
||
|
|
```
|
||
|
|
|
||
|
|
### Issue: "Database running but data seems wrong"
|
||
|
|
|
||
|
|
**Cause**: Restored wrong backup or partial restore
|
||
|
|
|
||
|
|
**Solution**:
|
||
|
|
```bash
|
||
|
|
# 1. Verify record counts
|
||
|
|
kubectl exec -n vapora surrealdb-0 -- \
|
||
|
|
surreal sql "SELECT table, count(*) FROM meta::tb"
|
||
|
|
|
||
|
|
# 2. Compare to pre-loss baseline
|
||
|
|
# (from documentation or logs)
|
||
|
|
|
||
|
|
# If counts don't match:
|
||
|
|
# - Used wrong backup
|
||
|
|
# - Restore incomplete
|
||
|
|
# - Try again with correct backup
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Database Recovery Reference
|
||
|
|
|
||
|
|
**Recovery Procedure Flowchart**:
|
||
|
|
|
||
|
|
```
|
||
|
|
Database Issue Detected
|
||
|
|
↓
|
||
|
|
Is it just a pod restart?
|
||
|
|
YES → kubectl delete pod surrealdb-0
|
||
|
|
NO → Continue
|
||
|
|
↓
|
||
|
|
Can queries connect and run?
|
||
|
|
YES → Continue with application recovery
|
||
|
|
NO → Continue
|
||
|
|
↓
|
||
|
|
Is data corrupted (errors in queries)?
|
||
|
|
YES → Try REBUILD INDEX
|
||
|
|
NO → Continue
|
||
|
|
↓
|
||
|
|
Still errors?
|
||
|
|
YES → Scale replicas=0, clear PVC, restore from backup
|
||
|
|
NO → Success, monitor for 30 min
|
||
|
|
```
|