Vapora/docs/operations/backup-recovery-automation.md
Jesús Pérez 7110ffeea2
Some checks failed
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
chore: extend doc: adr, tutorials, operations, etc
2026-01-12 03:32:47 +00:00

570 lines
14 KiB
Markdown

# VAPORA Automated Backup & Recovery Automation
Automated backup and recovery procedures using Nushell scripts and Kubernetes CronJobs. Supports both direct S3 backups and Restic-based incremental backups.
---
## Overview
**Backup Strategy**:
- Hourly: Database export + Restic backup (1-hour RPO)
- Daily: Kubernetes config backup + Restic backup
- Monthly: Cleanup old snapshots and archive
**Dual Backup Approach**:
- **S3 Direct**: Simple file upload for quick recovery
- **Restic**: Incremental, deduplicated backups with integrated encryption
**Recovery Procedures**:
- One-command restore from S3 or Restic
- Verification before committing to production
- Automated database readiness checks
---
## Files and Components
### Backup Scripts
All scripts follow NUSHELL_GUIDELINES.md (0.109.0+) strictly.
#### `scripts/backup/database-backup.nu`
Direct S3 backup of SurrealDB with encryption.
```bash
nu scripts/backup/database-backup.nu \
--surreal-url "ws://localhost:8000" \
--surreal-user "root" \
--surreal-pass "$SURREAL_PASS" \
--s3-bucket "vapora-backups" \
--s3-prefix "backups/database" \
--encryption-key "$ENCRYPTION_KEY_FILE"
```
**Process**:
1. Export SurrealDB to SQL
2. Compress with gzip
3. Encrypt with AES-256
4. Upload to S3 with metadata
5. Verify upload completed
**Output**: `s3://vapora-backups/backups/database/database-YYYYMMDD-HHMMSS.sql.gz.enc`
#### `scripts/backup/config-backup.nu`
Backup Kubernetes resources (ConfigMaps, Secrets, Deployments).
```bash
nu scripts/backup/config-backup.nu \
--namespace "vapora" \
--s3-bucket "vapora-backups" \
--s3-prefix "backups/config"
```
**Process**:
1. Export ConfigMaps from namespace
2. Export Secrets
3. Export Deployments, Services, Ingress
4. Compress all to tar.gz
5. Upload to S3
**Output**: `s3://vapora-backups/backups/config/configs-YYYYMMDD-HHMMSS.tar.gz`
#### `scripts/backup/restic-backup.nu`
Incremental, deduplicated backup using Restic.
```bash
nu scripts/backup/restic-backup.nu \
--repo "s3:s3.amazonaws.com/vapora-backups/restic" \
--password "$RESTIC_PASSWORD" \
--database-dir "/tmp/vapora-db-backup" \
--k8s-dir "/tmp/vapora-k8s-backup" \
--iac-dir "provisioning" \
--backup-db \
--backup-k8s \
--backup-iac \
--verify \
--cleanup \
--keep-daily 7 \
--keep-weekly 4 \
--keep-monthly 12
```
**Features**:
- Incremental backups (only changed data stored)
- Deduplication across snapshots
- Built-in compression and encryption
- Automatic retention policies
- Repository health verification
**Output**: Tagged snapshots in Restic repository with metadata
#### `scripts/orchestrate-backup-recovery.nu`
Coordinates all backup types (S3 + Restic).
```bash
# Full backup cycle
nu scripts/orchestrate-backup-recovery.nu \
--operation backup \
--mode full \
--surreal-url "ws://localhost:8000" \
--surreal-user "root" \
--surreal-pass "$SURREAL_PASS" \
--namespace "vapora" \
--s3-bucket "vapora-backups" \
--s3-prefix "backups/database" \
--encryption-key "$ENCRYPTION_KEY_FILE" \
--restic-repo "s3:s3.amazonaws.com/vapora-backups/restic" \
--restic-password "$RESTIC_PASSWORD" \
--iac-dir "provisioning"
```
**Modes**:
- `full`: Database export → S3 + Restic
- `database-only`: Database export only
- `config-only`: Kubernetes config only
### Recovery Scripts
#### `scripts/recovery/database-recovery.nu`
Restore SurrealDB from S3 backup (with decryption).
```bash
nu scripts/recovery/database-recovery.nu \
--s3-location "s3://vapora-backups/backups/database/database-20260112-010000.sql.gz.enc" \
--encryption-key "$ENCRYPTION_KEY_FILE" \
--surreal-url "ws://localhost:8000" \
--surreal-user "root" \
--surreal-pass "$SURREAL_PASS" \
--namespace "vapora" \
--statefulset "surrealdb" \
--pvc "surrealdb-data-surrealdb-0" \
--verify
```
**Process**:
1. Download encrypted backup from S3
2. Decrypt backup file
3. Decompress backup
4. Scale down StatefulSet (for PVC replacement)
5. Delete current PVC
6. Scale up StatefulSet (creates new PVC)
7. Wait for pod readiness
8. Import backup to database
9. Verify data integrity
**Output**: Restored database at specified SurrealDB URL
#### `scripts/orchestrate-backup-recovery.nu` (Recovery Mode)
One-command recovery from backup.
```bash
nu scripts/orchestrate-backup-recovery.nu \
--operation recovery \
--s3-location "s3://vapora-backups/backups/database/database-20260112-010000.sql.gz.enc" \
--encryption-key "$ENCRYPTION_KEY_FILE" \
--surreal-url "ws://localhost:8000" \
--surreal-user "root" \
--surreal-pass "$SURREAL_PASS"
```
### Verification Scripts
#### `scripts/verify-backup-health.nu`
Health check for backup infrastructure.
```bash
# Basic health check
nu scripts/verify-backup-health.nu \
--s3-bucket "vapora-backups" \
--s3-prefix "backups/database" \
--restic-repo "s3:s3.amazonaws.com/vapora-backups/restic" \
--restic-password "$RESTIC_PASSWORD" \
--surreal-url "ws://localhost:8000" \
--surreal-user "root" \
--surreal-pass "$SURREAL_PASS" \
--max-age-hours 25
```
**Checks Performed**:
- ✓ S3 backups exist and have content
- ✓ Restic repository accessible and has snapshots
- ✓ Database connectivity verified
- ✓ Backup freshness (< 25 hours old)
- Backup rotation policy (daily, weekly, monthly)
- Restore test (if `--full-test` specified)
**Output**: Pass/fail for each check with detailed status
---
## Kubernetes Automation
### CronJob Configuration
File: `kubernetes/09-backup-cronjobs.yaml`
Defines four automated CronJobs:
#### 1. Hourly Database Backup
```yaml
schedule: "0 * * * *" # Every hour
timeout: 1800 seconds # 30 minutes
```
Runs `orchestrate-backup-recovery.nu --operation backup --mode full`
**Backups**:
- SurrealDB to S3 (encrypted)
- SurrealDB to Restic (incremental)
- IaC to Restic
#### 2. Daily Configuration Backup
```yaml
schedule: "0 2 * * *" # 02:00 UTC daily
timeout: 3600 seconds # 60 minutes
```
Runs `config-backup.nu` for Kubernetes resources.
#### 3. Daily Health Verification
```yaml
schedule: "0 3 * * *" # 03:00 UTC daily
timeout: 900 seconds # 15 minutes
```
Runs `verify-backup-health.nu` to verify backup infrastructure.
**Alerts if**:
- No S3 backups found
- Restic repository inaccessible
- Database unreachable
- Backups older than 25 hours
- Rotation policy violated
#### 4. Monthly Backup Rotation
```yaml
schedule: "0 4 1 * *" # First day of month, 04:00 UTC
timeout: 3600 seconds
```
Cleans up old Restic snapshots per retention policy:
- Keep: 7 daily, 4 weekly, 12 monthly
- Prune: Remove unreferenced data
### Environment Configuration
CronJobs require these secrets and ConfigMaps:
**ConfigMap: `vapora-config`**
```yaml
backup_s3_bucket: "vapora-backups"
restic_repo: "s3:s3.amazonaws.com/vapora-backups/restic"
aws_region: "us-east-1"
```
**Secret: `vapora-secrets`**
```yaml
surreal_password: "<database-password>"
restic_password: "<restic-encryption-password>"
```
**Secret: `vapora-aws-credentials`**
```yaml
access_key_id: "<aws-access-key>"
secret_access_key: "<aws-secret-key>"
```
**Secret: `vapora-encryption-key`**
```yaml
# File containing AES-256 encryption key
encryption.key: "<binary-key-data>"
```
### Deployment
1. **Create secrets** (if not existing):
```bash
kubectl create secret generic vapora-secrets \
--from-literal=surreal_password="$SURREAL_PASS" \
--from-literal=restic_password="$RESTIC_PASSWORD" \
-n vapora
kubectl create secret generic vapora-aws-credentials \
--from-literal=access_key_id="$AWS_ACCESS_KEY_ID" \
--from-literal=secret_access_key="$AWS_SECRET_ACCESS_KEY" \
-n vapora
kubectl create secret generic vapora-encryption-key \
--from-file=encryption.key=/path/to/encryption.key \
-n vapora
```
2. **Deploy CronJobs**:
```bash
kubectl apply -f kubernetes/09-backup-cronjobs.yaml
```
3. **Verify CronJobs**:
```bash
kubectl get cronjobs -n vapora
kubectl describe cronjob vapora-backup-database-hourly -n vapora
```
4. **Monitor scheduled runs**:
```bash
# Watch CronJob executions
kubectl get jobs -n vapora -l job-type=backup --watch
# View logs from backup job
kubectl logs -n vapora -l backup-type=database --tail=100 -f
```
---
## Setup Instructions
### Prerequisites
- Kubernetes 1.18+ with CronJob support
- Nushell 0.109.0+
- AWS CLI v2+
- Restic installed (or container image with restic)
- SurrealDB CLI (`surreal` command)
- `kubectl` with cluster access
### Local Testing
1. **Setup environment variables**:
```bash
export SURREAL_URL="ws://localhost:8000"
export SURREAL_USER="root"
export SURREAL_PASS="password"
export S3_BUCKET="vapora-backups"
export ENCRYPTION_KEY_FILE="/path/to/encryption.key"
export RESTIC_REPO="s3:s3.amazonaws.com/vapora-backups/restic"
export RESTIC_PASSWORD="restic-password"
export AWS_REGION="us-east-1"
export AWS_ACCESS_KEY_ID="your-key"
export AWS_SECRET_ACCESS_KEY="your-secret"
```
2. **Run backup**:
```bash
nu scripts/orchestrate-backup-recovery.nu \
--operation backup \
--mode full \
--surreal-url "$SURREAL_URL" \
--surreal-user "$SURREAL_USER" \
--surreal-pass "$SURREAL_PASS" \
--s3-bucket "$S3_BUCKET" \
--s3-prefix "backups/database" \
--encryption-key "$ENCRYPTION_KEY_FILE" \
--restic-repo "$RESTIC_REPO" \
--restic-password "$RESTIC_PASSWORD" \
--iac-dir "provisioning"
```
3. **Verify backup**:
```bash
nu scripts/verify-backup-health.nu \
--s3-bucket "$S3_BUCKET" \
--s3-prefix "backups/database" \
--restic-repo "$RESTIC_REPO" \
--restic-password "$RESTIC_PASSWORD" \
--surreal-url "$SURREAL_URL" \
--surreal-user "$SURREAL_USER" \
--surreal-pass "$SURREAL_PASS"
```
4. **Test recovery**:
```bash
# First, list available backups
aws s3 ls s3://$S3_BUCKET/backups/database/
# Then recover from latest backup
nu scripts/orchestrate-backup-recovery.nu \
--operation recovery \
--s3-location "s3://$S3_BUCKET/backups/database/database-20260112-010000.sql.gz.enc" \
--encryption-key "$ENCRYPTION_KEY_FILE" \
--surreal-url "$SURREAL_URL" \
--surreal-user "$SURREAL_USER" \
--surreal-pass "$SURREAL_PASS"
```
### Production Deployment
1. **Create S3 bucket** for backups:
```bash
aws s3 mb s3://vapora-backups --region us-east-1
```
2. **Enable bucket versioning** for protection:
```bash
aws s3api put-bucket-versioning \
--bucket vapora-backups \
--versioning-configuration Status=Enabled
```
3. **Set lifecycle policy** for Glacier archival (optional):
```bash
# 30 days to standard-IA, 90 days to Glacier
aws s3api put-bucket-lifecycle-configuration \
--bucket vapora-backups \
--lifecycle-configuration file://s3-lifecycle-policy.json
```
4. **Create Restic repository**:
```bash
export RESTIC_REPO="s3:s3.amazonaws.com/vapora-backups/restic"
export RESTIC_PASSWORD="your-restic-password"
restic init
```
5. **Deploy to Kubernetes**:
```bash
# 1. Create namespace
kubectl create namespace vapora
# 2. Create secrets
kubectl create secret generic vapora-secrets \
--from-literal=surreal_password="$SURREAL_PASS" \
--from-literal=restic_password="$RESTIC_PASSWORD" \
-n vapora
# 3. Create ConfigMap
kubectl create configmap vapora-config \
--from-literal=backup_s3_bucket="vapora-backups" \
--from-literal=restic_repo="s3:s3.amazonaws.com/vapora-backups/restic" \
--from-literal=aws_region="us-east-1" \
-n vapora
# 4. Deploy CronJobs
kubectl apply -f kubernetes/09-backup-cronjobs.yaml
```
6. **Monitor**:
```bash
# Watch CronJobs
kubectl get cronjobs -n vapora --watch
# View backup logs
kubectl logs -n vapora -l backup-type=database -f
# Check health status
kubectl get jobs -n vapora -l job-type=health-check -o wide
```
---
## Emergency Recovery
### Complete Database Loss
If production database is lost, restore from backup:
```bash
# 1. Scale down StatefulSet
kubectl scale statefulset surrealdb --replicas=0 -n vapora
# 2. Delete current PVC
kubectl delete pvc surrealdb-data-surrealdb-0 -n vapora
# 3. Run recovery
nu scripts/orchestrate-backup-recovery.nu \
--operation recovery \
--s3-location "s3://vapora-backups/backups/database/database-LATEST.sql.gz.enc" \
--encryption-key "/path/to/encryption.key" \
--surreal-url "ws://surrealdb:8000" \
--surreal-user "root" \
--surreal-pass "$SURREAL_PASS"
# 4. Verify database restored
kubectl exec -n vapora surrealdb-0 -- \
surreal query \
--conn ws://localhost:8000 \
--user root \
--pass "$SURREAL_PASS" \
"SELECT COUNT() FROM projects"
```
### Backup Verification Failed
If health check fails:
1. **Check Restic repository**:
```bash
export RESTIC_PASSWORD="$RESTIC_PASSWORD"
restic -r "s3:s3.amazonaws.com/vapora-backups/restic" check
```
2. **Force full verification** (slow):
```bash
restic -r "s3:s3.amazonaws.com/vapora-backups/restic" check --read-data
```
3. **List recent snapshots**:
```bash
restic -r "s3:s3.amazonaws.com/vapora-backups/restic" snapshots --max 10
```
---
## Troubleshooting
| Issue | Cause | Solution |
|-------|-------|----------|
| **CronJob not running** | Schedule incorrect | Check `kubectl get cronjobs` and verify schedule format |
| **Backup file too large** | Database growing | Check for old data that can be cleaned up |
| **S3 upload fails** | Credentials invalid | Verify `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY` |
| **Restic backup slow** | First backup or network latency | Expected on first run; use `--keep-*` flags to limit retention |
| **Recovery fails** | Database already running | Scale down StatefulSet before recovery |
| **Encryption key missing** | Secret not created | Create `vapora-encryption-key` secret in namespace |
---
## Related Documentation
- **Disaster Recovery Procedures**: `docs/disaster-recovery/README.md`
- **Backup Strategy**: `docs/disaster-recovery/backup-strategy.md`
- **Database Recovery**: `docs/disaster-recovery/database-recovery-procedures.md`
- **Operations Guide**: `docs/operations/README.md`
---
**Last Updated**: January 12, 2026
**Status**: Production-Ready
**Automation**: Full CronJob automation with health checks