Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

VAPORA Backup Strategy

Comprehensive backup and data protection strategy for VAPORA infrastructure.


Overview

Purpose: Protect against data loss, corruption, and service interruptions

Coverage:

  • Database backups (SurrealDB)
  • Configuration backups (ConfigMaps, Secrets)
  • Application state
  • Infrastructure-as-Code
  • Container images

Success Metrics:

  • RPO (Recovery Point Objective): 1 hour (lose at most 1 hour of data)
  • RTO (Recovery Time Objective): 4 hours (restore service within 4 hours)
  • Backup availability: 99.9% (backups always available when needed)
  • Backup validation: 100% (all backups tested monthly)

Backup Architecture

What Gets Backed Up

VAPORA Backup Scope

Critical (Daily):
├── Database
│   ├── SurrealDB data
│   ├── User data
│   ├── Project/task data
│   └── Audit logs
├── Configuration
│   ├── ConfigMaps
│   ├── Secrets
│   └── Deployment manifests
└── Infrastructure Code
    ├── Provisioning/Nickel configs
    ├── Kubernetes manifests
    └── Scripts

Important (Weekly):
├── Application logs
├── Metrics data
└── Documentation updates

Optional (As-needed):
├── Container images
├── Build artifacts
└── Development configurations

Backup Storage Strategy

PRIMARY BACKUP LOCATION
├── Storage: Cloud object storage (S3/GCS/Azure Blob)
├── Frequency: Hourly for database, daily for configs
├── Retention: 30 days rolling window
├── Encryption: AES-256 at rest
└── Redundancy: Geo-replicated to different region

SECONDARY BACKUP LOCATION (for critical data)
├── Storage: Different cloud provider or on-prem
├── Frequency: Daily
├── Retention: 90 days
├── Purpose: Protection against primary provider outage
└── Testing: Restore tested weekly

ARCHIVE LOCATION (compliance/long-term)
├── Storage: Cold storage (Glacier, Azure Archive)
├── Frequency: Monthly
├── Retention: 7 years (adjust per compliance needs)
├── Purpose: Compliance & legal holds
└── Accessibility: ~4 hours to retrieve

Database Backup Procedures

SurrealDB Backup

Backup Method: Full database dump via SurrealDB export

# Export full database
kubectl exec -n vapora surrealdb-pod -- \
  surreal export --conn ws://localhost:8000 \
  --user root \
  --pass "$DB_PASSWORD" \
  --output backup-$(date +%Y%m%d-%H%M%S).sql

# Expected size: 100MB-1GB (depending on data)
# Expected time: 5-15 minutes

Automated Backup Setup

# Create backup script: provisioning/scripts/backup-database.nu
def backup_database [output_dir: string] {
  let timestamp = (date now | format date %Y%m%d-%H%M%S)
  let backup_file = $"($output_dir)/vapora-db-($timestamp).sql"

  print $"Starting database backup to ($backup_file)..."

  # Export database
  kubectl exec -n vapora deployment/vapora-backend -- \
    surreal export \
      --conn ws://localhost:8000 \
      --user root \
      --pass $env.DB_PASSWORD \
      --output $backup_file

  # Compress
  gzip $backup_file

  # Upload to S3
  aws s3 cp $"($backup_file).gz" \
    s3://vapora-backups/database/$(date +%Y-%m-%d)/ \
    --sse AES256

  print $"Backup complete: ($backup_file).gz"
}

Backup Schedule

# Kubernetes CronJob for hourly backups
apiVersion: batch/v1
kind: CronJob
metadata:
  name: database-backup
  namespace: vapora
spec:
  schedule: "0 * * * *"  # Every hour
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: backup
            image: vapora/backup-tools:latest
            command:
            - /scripts/backup-database.sh
            env:
            - name: DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: db-credentials
                  key: password
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: aws-credentials
                  key: access-key
          restartPolicy: OnFailure

Backup Retention Policy

Hourly backups (last 24 hours):
├── Keep: All hourly backups
├── Purpose: Granular recovery options
└── Storage: Standard (fast access)

Daily backups (last 30 days):
├── Keep: 1 per day at midnight UTC
├── Purpose: Daily recovery options
└── Storage: Standard (fast access)

Weekly backups (last 90 days):
├── Keep: 1 per Sunday at midnight UTC
├── Purpose: Medium-term recovery
└── Storage: Standard

Monthly backups (7 years):
├── Keep: 1 per month on 1st at midnight UTC
├── Purpose: Compliance & long-term recovery
└── Storage: Archive (cold storage)

Backup Verification

# Daily backup verification
def verify_backup [backup_file: string] {
  print $"Verifying backup: ($backup_file)"

  # 1. Check file integrity
  if (not (file exists $backup_file)) {
    error make {msg: $"Backup file not found: ($backup_file)"}
  }

  # 2. Check file size (should be > 1MB)
  let size = (ls $backup_file | get 0.size)
  if ($size < 1000000) {
    error make {msg: $"Backup file too small: ($size) bytes"}
  }

  # 3. Check file header (should contain SQL dump)
  let header = (open -r $backup_file | first 10)
  if (not ($header | str contains "SURREALDB")) {
    error make {msg: "Invalid backup format"}
  }

  print "✓ Backup verified successfully"
}

# Monthly restore test
def test_restore [backup_file: string] {
  print $"Testing restore from: ($backup_file)"

  # 1. Create temporary test database
  kubectl run -n vapora test-db --image=surrealdb/surrealdb:latest \
    -- start file://test-data

  # 2. Restore backup to test database
  kubectl exec -n vapora test-db -- \
    surreal import --conn ws://localhost:8000 \
    --user root --pass "$DB_PASSWORD" \
    --input $backup_file

  # 3. Verify data integrity
  kubectl exec -n vapora test-db -- \
    surreal sql --conn ws://localhost:8000 \
    --user root --pass "$DB_PASSWORD" \
    "SELECT COUNT(*) FROM projects"

  # 4. Compare record counts
  # Should match production database

  # 5. Cleanup test database
  kubectl delete pod -n vapora test-db

  print "✓ Restore test passed"
}

Configuration Backup

ConfigMap & Secret Backups

# Backup all ConfigMaps
kubectl get configmap -n vapora -o yaml > configmaps-backup-$(date +%Y%m%d).yaml

# Backup all Secrets (encrypted)
kubectl get secret -n vapora -o yaml | \
  openssl enc -aes-256-cbc -salt -out secrets-backup-$(date +%Y%m%d).yaml.enc

# Upload to S3
aws s3 sync . s3://vapora-backups/k8s-configs/$(date +%Y-%m-%d)/ \
  --exclude "*" --include "*.yaml" --include "*.yaml.enc" \
  --sse AES256

Automated Nushell Script

def backup_k8s_configs [output_dir: string] {
  let timestamp = (date now | format date %Y%m%d)
  let config_dir = $"($output_dir)/k8s-configs-($timestamp)"

  mkdir $config_dir

  # Backup ConfigMaps
  kubectl get configmap -n vapora -o yaml > $"($config_dir)/configmaps.yaml"

  # Backup Secrets (encrypted)
  kubectl get secret -n vapora -o yaml | \
    openssl enc -aes-256-cbc -salt -out $"($config_dir)/secrets.yaml.enc"

  # Backup Deployments
  kubectl get deployments -n vapora -o yaml > $"($config_dir)/deployments.yaml"

  # Backup Services
  kubectl get services -n vapora -o yaml > $"($config_dir)/services.yaml"

  # Backup all to archive
  tar -czf $"($config_dir).tar.gz" $config_dir

  # Upload
  aws s3 cp $"($config_dir).tar.gz" \
    s3://vapora-backups/configs/ \
    --sse AES256

  print "✓ K8s configs backed up"
}

Infrastructure-as-Code Backups

Git Repository Backups

Primary: GitHub (with backup organization)

# Mirror repository to backup location
git clone --mirror https://github.com/your-org/vapora.git \
  vapora-mirror.git

# Push to backup location
cd vapora-mirror.git
git push --mirror https://backup-git-server/vapora-mirror.git

Backup Schedule

# Daily mirror push
*/6 * * * * /scripts/backup-git-repo.sh

Provisioning Code Backups

# Backup Nickel configs & scripts
def backup_provisioning_code [output_dir: string] {
  let timestamp = (date now | format date %Y%m%d)

  # Create backup
  tar -czf $"($output_dir)/provisioning-($timestamp).tar.gz" \
    provisioning/schemas \
    provisioning/scripts \
    provisioning/templates

  # Upload
  aws s3 cp $"($output_dir)/provisioning-($timestamp).tar.gz" \
    s3://vapora-backups/provisioning/ \
    --sse AES256
}

Application State Backups

Persistent Volume Backups

If using persistent volumes for data:

# Backup PersistentVolumeClaims
def backup_pvcs [namespace: string] {
  let pvcs = (kubectl get pvc -n $namespace -o json | from json).items

  for pvc in $pvcs {
    let pvc_name = $pvc.metadata.name
    let volume_size = $pvc.spec.resources.requests.storage

    print $"Backing up PVC: ($pvc_name) (($volume_size))"

    # Create snapshot (cloud-specific)
    aws ec2 create-snapshot \
      --volume-id $pvc_name \
      --description $"VAPORA backup $(date +%Y-%m-%d)"
  }
}

Application Logs

# Export logs for archive
def backup_application_logs [output_dir: string] {
  let timestamp = (date now | format date %Y%m%d)

  # Export last 7 days of logs
  kubectl logs deployment/vapora-backend -n vapora \
    --since=168h > $"($output_dir)/backend-logs-($timestamp).log"

  kubectl logs deployment/vapora-agents -n vapora \
    --since=168h > $"($output_dir)/agents-logs-($timestamp).log"

  # Compress and upload
  gzip $"($output_dir)/*.log"
  aws s3 sync $output_dir s3://vapora-backups/logs/ \
    --exclude "*" --include "*.log.gz" \
    --sse AES256
}

Container Image Backups

Docker Image Registry

# Tag images for backup
docker tag vapora/backend:latest vapora/backend:backup-$(date +%Y%m%d)
docker tag vapora/agents:latest vapora/agents:backup-$(date +%Y%m%d)
docker tag vapora/llm-router:latest vapora/llm-router:backup-$(date +%Y%m%d)

# Push to backup registry
docker push backup-registry/vapora/backend:backup-$(date +%Y%m%d)
docker push backup-registry/vapora/agents:backup-$(date +%Y%m%d)
docker push backup-registry/vapora/llm-router:backup-$(date +%Y%m%d)

# Retention: Keep last 30 days of images

Backup Monitoring

Backup Health Checks

# Daily backup status check
def check_backup_status [] {
  print "=== Backup Status Report ==="

  # 1. Check latest database backup
  let latest_db = (aws s3 ls s3://vapora-backups/database/ \
    --recursive | tail -1)
  let db_age = (date now) - ($latest_db | from json | get LastModified)

  if ($db_age > 2h) {
    print "⚠️  Database backup stale (> 2 hours old)"
  } else {
    print "✓ Database backup current"
  }

  # 2. Check config backup
  let config_count = (aws s3 ls s3://vapora-backups/configs/ | wc -l)
  if ($config_count > 0) {
    print "✓ Config backups present"
  } else {
    print "❌ No config backups found"
  }

  # 3. Check storage usage
  let storage_used = (aws s3 ls s3://vapora-backups/ --recursive --summarize | grep "Total Size")
  print $"Storage used: ($storage_used)"

  # 4. Check backup encryption
  let objects = (aws s3api list-objects-v2 --bucket vapora-backups --query 'Contents[*]')
  # All should have ServerSideEncryption: AES256

  print "=== End Report ==="
}

Backup Alerts

Configure alerts for:

Backup Failures:
  - Threshold: Backup not completed in 2 hours
  - Action: Alert operations team
  - Severity: High

Backup Staleness:
  - Threshold: Latest backup > 24 hours old
  - Action: Alert operations team
  - Severity: High

Storage Capacity:
  - Threshold: Backup storage > 80% full
  - Action: Alert & plan cleanup
  - Severity: Medium

Restore Test Failures:
  - Threshold: Monthly restore test fails
  - Action: Alert & investigate
  - Severity: Critical

Backup Testing & Validation

Monthly Restore Test

Schedule: First Sunday of each month at 02:00 UTC

def monthly_restore_test [] {
  print "Starting monthly restore test..."

  # 1. Select random recent backup
  let backup_date = (date now | date delta -d 7d | format date %Y-%m-%d)

  # 2. Download backup
  aws s3 cp s3://vapora-backups/database/$backup_date/ \
    ./test-backups/ \
    --recursive

  # 3. Restore to test environment
  # (See Database Recovery Procedures)

  # 4. Verify data integrity
  # - Count records match
  # - No data corruption
  # - All tables present

  # 5. Verify application works
  # - Can query database
  # - Can perform basic operations

  # 6. Document results
  # - Success/failure
  # - Any issues found
  # - Time taken

  print "✓ Restore test completed"
}

Backup Audit Report

Quarterly: Generate backup audit report

def quarterly_backup_audit [] {
  print "=== Quarterly Backup Audit Report ==="
  print $"Report Date: (date now | format date %Y-%m-%d)"
  print ""

  print "1. Backup Coverage"
  print "   Database: Daily ✓"
  print "   Configs: Daily ✓"
  print "   IaC: Daily ✓"
  print ""

  print "2. Restore Tests (Last Quarter)"
  print "   Tests Performed: 3"
  print "   Tests Passed: 3"
  print "   Average Restore Time: 2.5 hours"
  print ""

  print "3. Storage Usage"
  # Calculate storage per category

  print "4. Backup Age Distribution"
  # Show age distribution of backups

  print "5. Incidents & Issues"
  # Any backup-related incidents

  print "6. Recommendations"
  # Any needed improvements
}

Backup Security

Encryption

  • ✅ All backups encrypted at rest (AES-256)
  • ✅ All backups encrypted in transit (HTTPS/TLS)
  • ✅ Encryption keys managed by cloud provider or KMS
  • ✅ Separate keys for database and config backups

Access Control

Backup Access Policy:

Read Access:
  - Operations team
  - Disaster recovery team
  - Compliance/audit team

Write Access:
  - Automated backup system only
  - Require 2FA for manual backups

Delete/Modify Access:
  - Require 2 approvals
  - Audit logging enabled
  - 24-hour delay before deletion

Audit Logging

# All backup operations logged
- Backup creation: When, size, hash
- Backup retrieval: Who, when, what
- Restore operations: When, who, from where
- Backup deletion: When, who, reason

# Logs stored separately and immutable
# Example: CloudTrail, S3 access logs, custom logging

Backup Disaster Scenarios

Scenario 1: Single Database Backup Fails

Impact: 1-hour data loss risk

Prevention:

  • Backup redundancy (multiple copies)
  • Multiple backup methods
  • Backup validation after each backup

Recovery:

  • Use previous hour's backup
  • Restore to test environment first
  • Validate data integrity
  • Restore to production if good

Scenario 2: Backup Storage Compromised

Impact: Data loss + security breach

Prevention:

  • Encryption with separate keys
  • Geographic redundancy
  • Backup verification signing
  • Access control restrictions

Recovery:

  • Activate secondary backup location
  • Restore from archive backups
  • Full security audit

Scenario 3: Ransomware Infection

Impact: All recent backups encrypted

Prevention:

  • Immutable backups (WORM)
  • Air-gapped backups (offline)
  • Archive-only old backups
  • Regular backup verification

Recovery:

  • Use air-gapped backup
  • Restore to clean environment
  • Full security remediation

Scenario 4: Accidental Data Deletion

Impact: Data loss from point of deletion

Prevention:

  • Frequent backups (hourly)
  • Soft deletes in application
  • Audit logging

Recovery:

  • Restore from backup before deletion time
  • Point-in-time recovery if available

Backup Checklists

Daily

  • Database backup completed
  • Backup size normal (not 0 bytes)
  • No backup errors in logs
  • Upload to S3 succeeded
  • Previous backup still available

Weekly

  • Database backup retention verified
  • Config backup completed
  • Infrastructure code backed up
  • Backup storage space adequate
  • Encryption keys accessible

Monthly

  • Restore test scheduled
  • Backup audit report generated
  • Backup verification successful
  • Archive backups created
  • Old backups properly retained

Quarterly

  • Full audit report completed
  • Backup strategy reviewed
  • Team trained on procedures
  • RTO/RPO targets met
  • Recommendations implemented

Summary

Backup Strategy at a Glance:

ItemFrequencyRetentionStorageEncryption
DatabaseHourly30 daysS3AES-256
ConfigDaily90 daysS3AES-256
IaCDaily30 daysGit + S3AES-256
ImagesDaily30 daysRegistryBuilt-in
ArchiveMonthly7 yearsGlacierAES-256

Key Metrics:

  • RPO: 1 hour (lose at most 1 hour of data)
  • RTO: 4 hours (restore within 4 hours)
  • Availability: 99.9% (backups available when needed)
  • Validation: 100% (all backups tested monthly)

Success Criteria:

  • ✅ Daily backup completion
  • ✅ Backup validation passes
  • ✅ Monthly restore test successful
  • ✅ No security incidents
  • ✅ Compliance requirements met