Vapora/docs/operations/pre-deployment-checklist.md
Jesús Pérez 7110ffeea2
Some checks failed
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
chore: extend doc: adr, tutorials, operations, etc
2026-01-12 03:32:47 +00:00

11 KiB

Pre-Deployment Checklist

Critical verification steps before any VAPORA deployment to production or staging.


24 Hours Before Deployment

Communication & Scheduling

  • Schedule deployment with team (record in calendar/ticket)
  • Post in #deployments channel: "Deployment scheduled for [DATE TIME UTC]"
  • Identify on-call engineer for deployment period
  • Brief on-call on deployment plan and rollback procedure
  • Ensure affected teams (support, product, etc.) are notified
  • Verify no other critical infrastructure changes scheduled same time window

Change Documentation

  • Create GitHub issue or ticket tracking the deployment
  • Document: what's changing (configs, manifests, versions)
  • Document: why (bug fix, feature, performance, security)
  • Document: rollback plan (revision number or previous config)
  • Document: success criteria (what indicates successful deployment)
  • Document: estimated duration (usually 5-15 minutes)

Code Review & Validation

  • All provisioning changes merged and code reviewed
  • Confirm main branch has latest changes
  • Run validation locally: nu scripts/validate-config.nu --mode enterprise
  • Verify all 3 modes validate without errors or critical warnings
  • Check git log for unexpected commits
  • Review artifact generation: ensure configs are correct

4 Hours Before Deployment

Environment Verification

Staging Environment

  • Access staging Kubernetes cluster: kubectl cluster-info
  • Verify cluster is healthy: kubectl get nodes (all Ready)
  • Check namespace exists: kubectl get namespace vapora
  • Verify current deployments: kubectl get deployments -n vapora
  • Check ConfigMap is up to date: kubectl get configmap -n vapora -o yaml | head -20

Production Environment (if applicable)

  • Access production Kubernetes cluster: kubectl cluster-info
  • Verify all nodes healthy: kubectl get nodes (all Ready)
  • Check current resource usage: kubectl top nodes (not near capacity)
  • Verify current deployments: kubectl get deployments -n vapora
  • Check pod status: kubectl get pods -n vapora (all Running)
  • Verify recent events: kubectl get events -n vapora --sort-by='.lastTimestamp' | tail -10

Health Baseline

  • Record current metrics before deployment

    • CPU usage per deployment
    • Memory usage per deployment
    • Request latency (p50, p95, p99)
    • Error rate (4xx, 5xx)
    • Queue depth (if applicable)
  • Verify services are responsive:

    curl http://localhost:8001/health -H "Authorization: Bearer $TOKEN"
    curl http://localhost:8001/api/projects
    
  • Check logs for recent errors:

    kubectl logs deployment/vapora-backend -n vapora --tail=50
    kubectl logs deployment/vapora-agents -n vapora --tail=50
    

Infrastructure Check

  • Verify storage is not near capacity: df -h /var/lib/vapora
  • Check database health: kubectl exec -n vapora <pod> -- surreal info
  • Verify backups are recent (within 24 hours)
  • Check SSL certificate expiration: openssl s_client -connect api.vapora.com:443 -showcerts | grep "Validity"

2 Hours Before Deployment

Artifact Preparation

  • Trigger validation in CI/CD pipeline

  • Wait for artifact generation to complete

  • Download artifacts from pipeline:

    # From GitHub Actions or Woodpecker UI
    # Download: deployment-artifacts.zip
    
  • Verify artifact contents:

    unzip deployment-artifacts.zip
    ls -la
    # Should contain:
    # - configmap.yaml
    # - deployment.yaml
    # - docker-compose.yml
    # - vapora-{solo,multiuser,enterprise}.{toml,yaml,json}
    
  • Validate manifest syntax:

    yq eval '.' configmap.yaml > /dev/null && echo "✓ ConfigMap valid"
    yq eval '.' deployment.yaml > /dev/null && echo "✓ Deployment valid"
    

Test in Staging

  • Perform dry-run deployment to staging cluster:

    kubectl apply -f configmap.yaml --dry-run=server -n vapora
    kubectl apply -f deployment.yaml --dry-run=server -n vapora
    
  • Review dry-run output for any warnings or errors

  • If test deployment available, do actual staging deployment and verify:

    kubectl get deployments -n vapora
    kubectl get pods -n vapora
    kubectl logs deployment/vapora-backend -n vapora --tail=5
    
  • Test health endpoints on staging

  • Run smoke tests against staging (if available)

Rollback Plan Verification

  • Document current deployment revisions:

    kubectl rollout history deployment/vapora-backend -n vapora
    # Record the highest revision number
    
  • Create backup of current ConfigMap:

    kubectl get configmap -n vapora vapora-config -o yaml > configmap-backup.yaml
    
  • Test rollback procedure on staging (if safe):

    # Record current revision
    CURRENT_REV=$(kubectl rollout history deployment/vapora-backend -n vapora | tail -1 | awk '{print $1}')
    
    # Test undo
    kubectl rollout undo deployment/vapora-backend -n vapora
    
    # Verify rollback
    kubectl get deployment vapora-backend -n vapora -o yaml | grep image
    
    # Restore to current
    kubectl rollout undo deployment/vapora-backend -n vapora --to-revision=$CURRENT_REV
    
  • Confirm rollback command is documented in ticket/issue


1 Hour Before Deployment

Final Checks

  • Confirm all prerequisites met:
    • Code merged to main
    • Artifacts generated and validated
    • Staging deployment tested
    • Rollback plan documented
    • Team notified

Communication Setup

  • Set status page to "Maintenance Mode" (if public)

    "VAPORA maintenance deployment starting at HH:MM UTC.
     Expected duration: 10 minutes. Services may be briefly unavailable."
    
  • Join #deployments Slack channel

  • Prepare message: "🚀 Deployment starting now. Will update every 2 minutes."

  • Have on-call engineer monitoring

  • Verify monitoring/alerting dashboards are accessible

Access Verification

  • Verify kubeconfig is valid and up to date:

    kubectl cluster-info
    kubectl get nodes
    
  • Verify kubectl version compatibility:

    kubectl version
    # Should match server version reasonably (within 1 minor version)
    
  • Test write access to cluster:

    kubectl auth can-i create deployments --namespace=vapora
    # Should return "yes"
    
  • Verify docker/docker-compose access (if Docker deployment)

  • Verify Slack webhook is working (test send message)


15 Minutes Before Deployment

Final Go/No-Go Decision

STOP HERE and make final decision to proceed or reschedule:

Proceed IF:

  • All checklist items above completed
  • No critical issues found during testing
  • Staging deployment successful
  • Team ready and monitoring
  • Rollback plan clear and tested
  • Within designated maintenance window

RESCHEDULE IF:

  • Any critical issues discovered
  • Staging tests failed
  • Team member unavailable
  • Production issues detected
  • Unexpected changes in code/configs

Final Notifications

If proceeding:

  • Post to #deployments: "🚀 Deployment starting in 5 minutes"
  • Alert on-call engineer: "Ready to start - confirm you're monitoring"
  • Have rollback plan visible and accessible
  • Open monitoring dashboard showing current metrics

Terminal Setup

  • Open terminal with kubeconfig configured:

    export KUBECONFIG=/path/to/production/kubeconfig
    kubectl cluster-info  # Verify connected to production
    
  • Open second terminal for tailing logs:

    kubectl logs -f deployment/vapora-backend -n vapora
    
  • Have rollback commands ready:

    # For quick rollback if needed
    kubectl rollout undo deployment/vapora-backend -n vapora
    kubectl rollout undo deployment/vapora-agents -n vapora
    kubectl rollout undo deployment/vapora-llm-router -n vapora
    
  • Prepare metrics check script:

    watch kubectl top pods -n vapora
    watch kubectl get pods -n vapora
    

Success Criteria Verification

Document what "success" looks like for this deployment:

  • All three deployments have updated image IDs
  • All pods reach "Ready" state within 5 minutes
  • No pod restarts: kubectl get pods -n vapora --watch (no restarts column increasing)
  • No error logs in first 2 minutes
  • Health endpoints respond (200 OK)
  • API endpoints respond to test requests
  • Metrics show normal resource usage
  • No alerts triggered
  • Support team reports no user impact

Team Roles During Deployment

Deployment Lead

  • Executes deployment commands
  • Monitors progress
  • Communicates status updates
  • Decides to proceed/rollback

On-Call Engineer

  • Monitors dashboards and alerts
  • Watches for anomalies
  • Prepares for rollback if needed
  • Available for emergency decisions

Communications Lead (optional)

  • Updates #deployments channel
  • Notifies support/product teams
  • Updates status page if public
  • Handles external communication

Backup Person

  • Monitors for issues
  • Ready to assist with troubleshooting
  • Prepares rollback procedures
  • Escalates if needed

Common Issues to Watch For

⚠️ Pod CrashLoopBackOff

  • Indicates config or image issue
  • Check pod logs: kubectl logs <pod>
  • Check events: kubectl describe pod <pod>
  • Action: Rollback immediately

⚠️ Pending Pods (not starting)

  • Check resource availability: kubectl describe pod <pod>
  • Check node capacity
  • Action: Investigate or rollback if resource exhausted

⚠️ High Error Rate

  • Check application logs
  • Compare with baseline errors
  • Action: If >10% error increase, rollback

⚠️ Database Connection Errors

  • Check ConfigMap has correct database URL
  • Verify network connectivity to database
  • Action: Check ConfigMap, fix and reapply if needed

⚠️ Memory or CPU Spike

  • Monitor trends (sudden spike vs gradual)
  • Check if within expected range for new code
  • Action: Rollback if resource limits exceeded

Post-Deployment Documentation

After deployment completes, record:

  • Deployment start time (UTC)
  • Deployment end time (UTC)
  • Total duration
  • Any issues encountered and resolution
  • Rollback performed (Y/N)
  • Metrics before/after (CPU, memory, latency, errors)
  • Team members involved
  • Blockers or lessons learned

Sign-Off

Use this template for deployment issue/ticket:

DEPLOYMENT COMPLETED

✓ All checks passed
✓ Deployment successful
✓ All pods running
✓ Health checks passing
✓ No user impact

Deployed by: [Name]
Start time: [UTC]
Duration: [X minutes]
Rollback needed: No

Metrics:
- Latency (p99): [X]ms
- Error rate: [X]%
- Pod restarts: 0

Next deployment: [Date/Time]