730 lines
17 KiB
Markdown
730 lines
17 KiB
Markdown
# VAPORA Backup Strategy
|
|
|
|
Comprehensive backup and data protection strategy for VAPORA infrastructure.
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
**Purpose**: Protect against data loss, corruption, and service interruptions
|
|
|
|
**Coverage**:
|
|
- Database backups (SurrealDB)
|
|
- Configuration backups (ConfigMaps, Secrets)
|
|
- Application state
|
|
- Infrastructure-as-Code
|
|
- Container images
|
|
|
|
**Success Metrics**:
|
|
- RPO (Recovery Point Objective): 1 hour (lose at most 1 hour of data)
|
|
- RTO (Recovery Time Objective): 4 hours (restore service within 4 hours)
|
|
- Backup availability: 99.9% (backups always available when needed)
|
|
- Backup validation: 100% (all backups tested monthly)
|
|
|
|
---
|
|
|
|
## Backup Architecture
|
|
|
|
### What Gets Backed Up
|
|
|
|
```
|
|
VAPORA Backup Scope
|
|
|
|
Critical (Daily):
|
|
├── Database
|
|
│ ├── SurrealDB data
|
|
│ ├── User data
|
|
│ ├── Project/task data
|
|
│ └── Audit logs
|
|
├── Configuration
|
|
│ ├── ConfigMaps
|
|
│ ├── Secrets
|
|
│ └── Deployment manifests
|
|
└── Infrastructure Code
|
|
├── Provisioning/Nickel configs
|
|
├── Kubernetes manifests
|
|
└── Scripts
|
|
|
|
Important (Weekly):
|
|
├── Application logs
|
|
├── Metrics data
|
|
└── Documentation updates
|
|
|
|
Optional (As-needed):
|
|
├── Container images
|
|
├── Build artifacts
|
|
└── Development configurations
|
|
```
|
|
|
|
### Backup Storage Strategy
|
|
|
|
```
|
|
PRIMARY BACKUP LOCATION
|
|
├── Storage: Cloud object storage (S3/GCS/Azure Blob)
|
|
├── Frequency: Hourly for database, daily for configs
|
|
├── Retention: 30 days rolling window
|
|
├── Encryption: AES-256 at rest
|
|
└── Redundancy: Geo-replicated to different region
|
|
|
|
SECONDARY BACKUP LOCATION (for critical data)
|
|
├── Storage: Different cloud provider or on-prem
|
|
├── Frequency: Daily
|
|
├── Retention: 90 days
|
|
├── Purpose: Protection against primary provider outage
|
|
└── Testing: Restore tested weekly
|
|
|
|
ARCHIVE LOCATION (compliance/long-term)
|
|
├── Storage: Cold storage (Glacier, Azure Archive)
|
|
├── Frequency: Monthly
|
|
├── Retention: 7 years (adjust per compliance needs)
|
|
├── Purpose: Compliance & legal holds
|
|
└── Accessibility: ~4 hours to retrieve
|
|
```
|
|
|
|
---
|
|
|
|
## Database Backup Procedures
|
|
|
|
### SurrealDB Backup
|
|
|
|
**Backup Method**: Full database dump via SurrealDB export
|
|
|
|
```bash
|
|
# Export full database
|
|
kubectl exec -n vapora surrealdb-pod -- \
|
|
surreal export --conn ws://localhost:8000 \
|
|
--user root \
|
|
--pass "$DB_PASSWORD" \
|
|
--output backup-$(date +%Y%m%d-%H%M%S).sql
|
|
|
|
# Expected size: 100MB-1GB (depending on data)
|
|
# Expected time: 5-15 minutes
|
|
```
|
|
|
|
**Automated Backup Setup**
|
|
|
|
```bash
|
|
# Create backup script: provisioning/scripts/backup-database.nu
|
|
def backup_database [output_dir: string] {
|
|
let timestamp = (date now | format date %Y%m%d-%H%M%S)
|
|
let backup_file = $"($output_dir)/vapora-db-($timestamp).sql"
|
|
|
|
print $"Starting database backup to ($backup_file)..."
|
|
|
|
# Export database
|
|
kubectl exec -n vapora deployment/vapora-backend -- \
|
|
surreal export \
|
|
--conn ws://localhost:8000 \
|
|
--user root \
|
|
--pass $env.DB_PASSWORD \
|
|
--output $backup_file
|
|
|
|
# Compress
|
|
gzip $backup_file
|
|
|
|
# Upload to S3
|
|
aws s3 cp $"($backup_file).gz" \
|
|
s3://vapora-backups/database/$(date +%Y-%m-%d)/ \
|
|
--sse AES256
|
|
|
|
print $"Backup complete: ($backup_file).gz"
|
|
}
|
|
```
|
|
|
|
**Backup Schedule**
|
|
|
|
```yaml
|
|
# Kubernetes CronJob for hourly backups
|
|
apiVersion: batch/v1
|
|
kind: CronJob
|
|
metadata:
|
|
name: database-backup
|
|
namespace: vapora
|
|
spec:
|
|
schedule: "0 * * * *" # Every hour
|
|
jobTemplate:
|
|
spec:
|
|
template:
|
|
spec:
|
|
containers:
|
|
- name: backup
|
|
image: vapora/backup-tools:latest
|
|
command:
|
|
- /scripts/backup-database.sh
|
|
env:
|
|
- name: DB_PASSWORD
|
|
valueFrom:
|
|
secretKeyRef:
|
|
name: db-credentials
|
|
key: password
|
|
- name: AWS_ACCESS_KEY_ID
|
|
valueFrom:
|
|
secretKeyRef:
|
|
name: aws-credentials
|
|
key: access-key
|
|
restartPolicy: OnFailure
|
|
```
|
|
|
|
### Backup Retention Policy
|
|
|
|
```
|
|
Hourly backups (last 24 hours):
|
|
├── Keep: All hourly backups
|
|
├── Purpose: Granular recovery options
|
|
└── Storage: Standard (fast access)
|
|
|
|
Daily backups (last 30 days):
|
|
├── Keep: 1 per day at midnight UTC
|
|
├── Purpose: Daily recovery options
|
|
└── Storage: Standard (fast access)
|
|
|
|
Weekly backups (last 90 days):
|
|
├── Keep: 1 per Sunday at midnight UTC
|
|
├── Purpose: Medium-term recovery
|
|
└── Storage: Standard
|
|
|
|
Monthly backups (7 years):
|
|
├── Keep: 1 per month on 1st at midnight UTC
|
|
├── Purpose: Compliance & long-term recovery
|
|
└── Storage: Archive (cold storage)
|
|
```
|
|
|
|
### Backup Verification
|
|
|
|
```bash
|
|
# Daily backup verification
|
|
def verify_backup [backup_file: string] {
|
|
print $"Verifying backup: ($backup_file)"
|
|
|
|
# 1. Check file integrity
|
|
if (not (file exists $backup_file)) {
|
|
error make {msg: $"Backup file not found: ($backup_file)"}
|
|
}
|
|
|
|
# 2. Check file size (should be > 1MB)
|
|
let size = (ls $backup_file | get 0.size)
|
|
if ($size < 1000000) {
|
|
error make {msg: $"Backup file too small: ($size) bytes"}
|
|
}
|
|
|
|
# 3. Check file header (should contain SQL dump)
|
|
let header = (open -r $backup_file | first 10)
|
|
if (not ($header | str contains "SURREALDB")) {
|
|
error make {msg: "Invalid backup format"}
|
|
}
|
|
|
|
print "✓ Backup verified successfully"
|
|
}
|
|
|
|
# Monthly restore test
|
|
def test_restore [backup_file: string] {
|
|
print $"Testing restore from: ($backup_file)"
|
|
|
|
# 1. Create temporary test database
|
|
kubectl run -n vapora test-db --image=surrealdb/surrealdb:latest \
|
|
-- start file://test-data
|
|
|
|
# 2. Restore backup to test database
|
|
kubectl exec -n vapora test-db -- \
|
|
surreal import --conn ws://localhost:8000 \
|
|
--user root --pass "$DB_PASSWORD" \
|
|
--input $backup_file
|
|
|
|
# 3. Verify data integrity
|
|
kubectl exec -n vapora test-db -- \
|
|
surreal sql --conn ws://localhost:8000 \
|
|
--user root --pass "$DB_PASSWORD" \
|
|
"SELECT COUNT(*) FROM projects"
|
|
|
|
# 4. Compare record counts
|
|
# Should match production database
|
|
|
|
# 5. Cleanup test database
|
|
kubectl delete pod -n vapora test-db
|
|
|
|
print "✓ Restore test passed"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Configuration Backup
|
|
|
|
### ConfigMap & Secret Backups
|
|
|
|
```bash
|
|
# Backup all ConfigMaps
|
|
kubectl get configmap -n vapora -o yaml > configmaps-backup-$(date +%Y%m%d).yaml
|
|
|
|
# Backup all Secrets (encrypted)
|
|
kubectl get secret -n vapora -o yaml | \
|
|
openssl enc -aes-256-cbc -salt -out secrets-backup-$(date +%Y%m%d).yaml.enc
|
|
|
|
# Upload to S3
|
|
aws s3 sync . s3://vapora-backups/k8s-configs/$(date +%Y-%m-%d)/ \
|
|
--exclude "*" --include "*.yaml" --include "*.yaml.enc" \
|
|
--sse AES256
|
|
```
|
|
|
|
**Automated Nushell Script**
|
|
|
|
```nushell
|
|
def backup_k8s_configs [output_dir: string] {
|
|
let timestamp = (date now | format date %Y%m%d)
|
|
let config_dir = $"($output_dir)/k8s-configs-($timestamp)"
|
|
|
|
mkdir $config_dir
|
|
|
|
# Backup ConfigMaps
|
|
kubectl get configmap -n vapora -o yaml > $"($config_dir)/configmaps.yaml"
|
|
|
|
# Backup Secrets (encrypted)
|
|
kubectl get secret -n vapora -o yaml | \
|
|
openssl enc -aes-256-cbc -salt -out $"($config_dir)/secrets.yaml.enc"
|
|
|
|
# Backup Deployments
|
|
kubectl get deployments -n vapora -o yaml > $"($config_dir)/deployments.yaml"
|
|
|
|
# Backup Services
|
|
kubectl get services -n vapora -o yaml > $"($config_dir)/services.yaml"
|
|
|
|
# Backup all to archive
|
|
tar -czf $"($config_dir).tar.gz" $config_dir
|
|
|
|
# Upload
|
|
aws s3 cp $"($config_dir).tar.gz" \
|
|
s3://vapora-backups/configs/ \
|
|
--sse AES256
|
|
|
|
print "✓ K8s configs backed up"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Infrastructure-as-Code Backups
|
|
|
|
### Git Repository Backups
|
|
|
|
**Primary**: GitHub (with backup organization)
|
|
|
|
```bash
|
|
# Mirror repository to backup location
|
|
git clone --mirror https://github.com/your-org/vapora.git \
|
|
vapora-mirror.git
|
|
|
|
# Push to backup location
|
|
cd vapora-mirror.git
|
|
git push --mirror https://backup-git-server/vapora-mirror.git
|
|
```
|
|
|
|
**Backup Schedule**
|
|
|
|
```yaml
|
|
# Daily mirror push
|
|
*/6 * * * * /scripts/backup-git-repo.sh
|
|
```
|
|
|
|
### Provisioning Code Backups
|
|
|
|
```bash
|
|
# Backup Nickel configs & scripts
|
|
def backup_provisioning_code [output_dir: string] {
|
|
let timestamp = (date now | format date %Y%m%d)
|
|
|
|
# Create backup
|
|
tar -czf $"($output_dir)/provisioning-($timestamp).tar.gz" \
|
|
provisioning/schemas \
|
|
provisioning/scripts \
|
|
provisioning/templates
|
|
|
|
# Upload
|
|
aws s3 cp $"($output_dir)/provisioning-($timestamp).tar.gz" \
|
|
s3://vapora-backups/provisioning/ \
|
|
--sse AES256
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Application State Backups
|
|
|
|
### Persistent Volume Backups
|
|
|
|
If using persistent volumes for data:
|
|
|
|
```bash
|
|
# Backup PersistentVolumeClaims
|
|
def backup_pvcs [namespace: string] {
|
|
let pvcs = (kubectl get pvc -n $namespace -o json | from json).items
|
|
|
|
for pvc in $pvcs {
|
|
let pvc_name = $pvc.metadata.name
|
|
let volume_size = $pvc.spec.resources.requests.storage
|
|
|
|
print $"Backing up PVC: ($pvc_name) (($volume_size))"
|
|
|
|
# Create snapshot (cloud-specific)
|
|
aws ec2 create-snapshot \
|
|
--volume-id $pvc_name \
|
|
--description $"VAPORA backup $(date +%Y-%m-%d)"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Application Logs
|
|
|
|
```bash
|
|
# Export logs for archive
|
|
def backup_application_logs [output_dir: string] {
|
|
let timestamp = (date now | format date %Y%m%d)
|
|
|
|
# Export last 7 days of logs
|
|
kubectl logs deployment/vapora-backend -n vapora \
|
|
--since=168h > $"($output_dir)/backend-logs-($timestamp).log"
|
|
|
|
kubectl logs deployment/vapora-agents -n vapora \
|
|
--since=168h > $"($output_dir)/agents-logs-($timestamp).log"
|
|
|
|
# Compress and upload
|
|
gzip $"($output_dir)/*.log"
|
|
aws s3 sync $output_dir s3://vapora-backups/logs/ \
|
|
--exclude "*" --include "*.log.gz" \
|
|
--sse AES256
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Container Image Backups
|
|
|
|
### Docker Image Registry
|
|
|
|
```bash
|
|
# Tag images for backup
|
|
docker tag vapora/backend:latest vapora/backend:backup-$(date +%Y%m%d)
|
|
docker tag vapora/agents:latest vapora/agents:backup-$(date +%Y%m%d)
|
|
docker tag vapora/llm-router:latest vapora/llm-router:backup-$(date +%Y%m%d)
|
|
|
|
# Push to backup registry
|
|
docker push backup-registry/vapora/backend:backup-$(date +%Y%m%d)
|
|
docker push backup-registry/vapora/agents:backup-$(date +%Y%m%d)
|
|
docker push backup-registry/vapora/llm-router:backup-$(date +%Y%m%d)
|
|
|
|
# Retention: Keep last 30 days of images
|
|
```
|
|
|
|
---
|
|
|
|
## Backup Monitoring
|
|
|
|
### Backup Health Checks
|
|
|
|
```bash
|
|
# Daily backup status check
|
|
def check_backup_status [] {
|
|
print "=== Backup Status Report ==="
|
|
|
|
# 1. Check latest database backup
|
|
let latest_db = (aws s3 ls s3://vapora-backups/database/ \
|
|
--recursive | tail -1)
|
|
let db_age = (date now) - ($latest_db | from json | get LastModified)
|
|
|
|
if ($db_age > 2h) {
|
|
print "⚠️ Database backup stale (> 2 hours old)"
|
|
} else {
|
|
print "✓ Database backup current"
|
|
}
|
|
|
|
# 2. Check config backup
|
|
let config_count = (aws s3 ls s3://vapora-backups/configs/ | wc -l)
|
|
if ($config_count > 0) {
|
|
print "✓ Config backups present"
|
|
} else {
|
|
print "❌ No config backups found"
|
|
}
|
|
|
|
# 3. Check storage usage
|
|
let storage_used = (aws s3 ls s3://vapora-backups/ --recursive --summarize | grep "Total Size")
|
|
print $"Storage used: ($storage_used)"
|
|
|
|
# 4. Check backup encryption
|
|
let objects = (aws s3api list-objects-v2 --bucket vapora-backups --query 'Contents[*]')
|
|
# All should have ServerSideEncryption: AES256
|
|
|
|
print "=== End Report ==="
|
|
}
|
|
```
|
|
|
|
### Backup Alerts
|
|
|
|
Configure alerts for:
|
|
|
|
```yaml
|
|
Backup Failures:
|
|
- Threshold: Backup not completed in 2 hours
|
|
- Action: Alert operations team
|
|
- Severity: High
|
|
|
|
Backup Staleness:
|
|
- Threshold: Latest backup > 24 hours old
|
|
- Action: Alert operations team
|
|
- Severity: High
|
|
|
|
Storage Capacity:
|
|
- Threshold: Backup storage > 80% full
|
|
- Action: Alert & plan cleanup
|
|
- Severity: Medium
|
|
|
|
Restore Test Failures:
|
|
- Threshold: Monthly restore test fails
|
|
- Action: Alert & investigate
|
|
- Severity: Critical
|
|
```
|
|
|
|
---
|
|
|
|
## Backup Testing & Validation
|
|
|
|
### Monthly Restore Test
|
|
|
|
**Schedule**: First Sunday of each month at 02:00 UTC
|
|
|
|
```bash
|
|
def monthly_restore_test [] {
|
|
print "Starting monthly restore test..."
|
|
|
|
# 1. Select random recent backup
|
|
let backup_date = (date now | date delta -d 7d | format date %Y-%m-%d)
|
|
|
|
# 2. Download backup
|
|
aws s3 cp s3://vapora-backups/database/$backup_date/ \
|
|
./test-backups/ \
|
|
--recursive
|
|
|
|
# 3. Restore to test environment
|
|
# (See Database Recovery Procedures)
|
|
|
|
# 4. Verify data integrity
|
|
# - Count records match
|
|
# - No data corruption
|
|
# - All tables present
|
|
|
|
# 5. Verify application works
|
|
# - Can query database
|
|
# - Can perform basic operations
|
|
|
|
# 6. Document results
|
|
# - Success/failure
|
|
# - Any issues found
|
|
# - Time taken
|
|
|
|
print "✓ Restore test completed"
|
|
}
|
|
```
|
|
|
|
### Backup Audit Report
|
|
|
|
**Quarterly**: Generate backup audit report
|
|
|
|
```bash
|
|
def quarterly_backup_audit [] {
|
|
print "=== Quarterly Backup Audit Report ==="
|
|
print $"Report Date: (date now | format date %Y-%m-%d)"
|
|
print ""
|
|
|
|
print "1. Backup Coverage"
|
|
print " Database: Daily ✓"
|
|
print " Configs: Daily ✓"
|
|
print " IaC: Daily ✓"
|
|
print ""
|
|
|
|
print "2. Restore Tests (Last Quarter)"
|
|
print " Tests Performed: 3"
|
|
print " Tests Passed: 3"
|
|
print " Average Restore Time: 2.5 hours"
|
|
print ""
|
|
|
|
print "3. Storage Usage"
|
|
# Calculate storage per category
|
|
|
|
print "4. Backup Age Distribution"
|
|
# Show age distribution of backups
|
|
|
|
print "5. Incidents & Issues"
|
|
# Any backup-related incidents
|
|
|
|
print "6. Recommendations"
|
|
# Any needed improvements
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Backup Security
|
|
|
|
### Encryption
|
|
|
|
- ✅ All backups encrypted at rest (AES-256)
|
|
- ✅ All backups encrypted in transit (HTTPS/TLS)
|
|
- ✅ Encryption keys managed by cloud provider or KMS
|
|
- ✅ Separate keys for database and config backups
|
|
|
|
### Access Control
|
|
|
|
```
|
|
Backup Access Policy:
|
|
|
|
Read Access:
|
|
- Operations team
|
|
- Disaster recovery team
|
|
- Compliance/audit team
|
|
|
|
Write Access:
|
|
- Automated backup system only
|
|
- Require 2FA for manual backups
|
|
|
|
Delete/Modify Access:
|
|
- Require 2 approvals
|
|
- Audit logging enabled
|
|
- 24-hour delay before deletion
|
|
```
|
|
|
|
### Audit Logging
|
|
|
|
```bash
|
|
# All backup operations logged
|
|
- Backup creation: When, size, hash
|
|
- Backup retrieval: Who, when, what
|
|
- Restore operations: When, who, from where
|
|
- Backup deletion: When, who, reason
|
|
|
|
# Logs stored separately and immutable
|
|
# Example: CloudTrail, S3 access logs, custom logging
|
|
```
|
|
|
|
---
|
|
|
|
## Backup Disaster Scenarios
|
|
|
|
### Scenario 1: Single Database Backup Fails
|
|
|
|
**Impact**: 1-hour data loss risk
|
|
|
|
**Prevention**:
|
|
- Backup redundancy (multiple copies)
|
|
- Multiple backup methods
|
|
- Backup validation after each backup
|
|
|
|
**Recovery**:
|
|
- Use previous hour's backup
|
|
- Restore to test environment first
|
|
- Validate data integrity
|
|
- Restore to production if good
|
|
|
|
### Scenario 2: Backup Storage Compromised
|
|
|
|
**Impact**: Data loss + security breach
|
|
|
|
**Prevention**:
|
|
- Encryption with separate keys
|
|
- Geographic redundancy
|
|
- Backup verification signing
|
|
- Access control restrictions
|
|
|
|
**Recovery**:
|
|
- Activate secondary backup location
|
|
- Restore from archive backups
|
|
- Full security audit
|
|
|
|
### Scenario 3: Ransomware Infection
|
|
|
|
**Impact**: All recent backups encrypted
|
|
|
|
**Prevention**:
|
|
- Immutable backups (WORM)
|
|
- Air-gapped backups (offline)
|
|
- Archive-only old backups
|
|
- Regular backup verification
|
|
|
|
**Recovery**:
|
|
- Use air-gapped backup
|
|
- Restore to clean environment
|
|
- Full security remediation
|
|
|
|
### Scenario 4: Accidental Data Deletion
|
|
|
|
**Impact**: Data loss from point of deletion
|
|
|
|
**Prevention**:
|
|
- Frequent backups (hourly)
|
|
- Soft deletes in application
|
|
- Audit logging
|
|
|
|
**Recovery**:
|
|
- Restore from backup before deletion time
|
|
- Point-in-time recovery if available
|
|
|
|
---
|
|
|
|
## Backup Checklists
|
|
|
|
### Daily
|
|
|
|
- [ ] Database backup completed
|
|
- [ ] Backup size normal (not 0 bytes)
|
|
- [ ] No backup errors in logs
|
|
- [ ] Upload to S3 succeeded
|
|
- [ ] Previous backup still available
|
|
|
|
### Weekly
|
|
|
|
- [ ] Database backup retention verified
|
|
- [ ] Config backup completed
|
|
- [ ] Infrastructure code backed up
|
|
- [ ] Backup storage space adequate
|
|
- [ ] Encryption keys accessible
|
|
|
|
### Monthly
|
|
|
|
- [ ] Restore test scheduled
|
|
- [ ] Backup audit report generated
|
|
- [ ] Backup verification successful
|
|
- [ ] Archive backups created
|
|
- [ ] Old backups properly retained
|
|
|
|
### Quarterly
|
|
|
|
- [ ] Full audit report completed
|
|
- [ ] Backup strategy reviewed
|
|
- [ ] Team trained on procedures
|
|
- [ ] RTO/RPO targets met
|
|
- [ ] Recommendations implemented
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
**Backup Strategy at a Glance**:
|
|
|
|
| Item | Frequency | Retention | Storage | Encryption |
|
|
|------|-----------|-----------|---------|-----------|
|
|
| **Database** | Hourly | 30 days | S3 | AES-256 |
|
|
| **Config** | Daily | 90 days | S3 | AES-256 |
|
|
| **IaC** | Daily | 30 days | Git + S3 | AES-256 |
|
|
| **Images** | Daily | 30 days | Registry | Built-in |
|
|
| **Archive** | Monthly | 7 years | Glacier | AES-256 |
|
|
|
|
**Key Metrics**:
|
|
- RPO: 1 hour (lose at most 1 hour of data)
|
|
- RTO: 4 hours (restore within 4 hours)
|
|
- Availability: 99.9% (backups available when needed)
|
|
- Validation: 100% (all backups tested monthly)
|
|
|
|
**Success Criteria**:
|
|
- ✅ Daily backup completion
|
|
- ✅ Backup validation passes
|
|
- ✅ Monthly restore test successful
|
|
- ✅ No security incidents
|
|
- ✅ Compliance requirements met
|