# Pre-Deployment Checklist Critical verification steps before any VAPORA deployment to production or staging. --- ## 24 Hours Before Deployment ### Communication & Scheduling - [ ] Schedule deployment with team (record in calendar/ticket) - [ ] Post in #deployments channel: "Deployment scheduled for [DATE TIME UTC]" - [ ] Identify on-call engineer for deployment period - [ ] Brief on-call on deployment plan and rollback procedure - [ ] Ensure affected teams (support, product, etc.) are notified - [ ] Verify no other critical infrastructure changes scheduled same time window ### Change Documentation - [ ] Create GitHub issue or ticket tracking the deployment - [ ] Document: what's changing (configs, manifests, versions) - [ ] Document: why (bug fix, feature, performance, security) - [ ] Document: rollback plan (revision number or previous config) - [ ] Document: success criteria (what indicates successful deployment) - [ ] Document: estimated duration (usually 5-15 minutes) ### Code Review & Validation - [ ] All provisioning changes merged and code reviewed - [ ] Confirm `main` branch has latest changes - [ ] Run validation locally: `nu scripts/validate-config.nu --mode enterprise` - [ ] Verify all 3 modes validate without errors or critical warnings - [ ] Check git log for unexpected commits - [ ] Review artifact generation: ensure configs are correct --- ## 4 Hours Before Deployment ### Environment Verification #### Staging Environment - [ ] Access staging Kubernetes cluster: `kubectl cluster-info` - [ ] Verify cluster is healthy: `kubectl get nodes` (all Ready) - [ ] Check namespace exists: `kubectl get namespace vapora` - [ ] Verify current deployments: `kubectl get deployments -n vapora` - [ ] Check ConfigMap is up to date: `kubectl get configmap -n vapora -o yaml | head -20` #### Production Environment (if applicable) - [ ] Access production Kubernetes cluster: `kubectl cluster-info` - [ ] Verify all nodes healthy: `kubectl get nodes` (all Ready) - [ ] Check current resource usage: `kubectl top nodes` (not near capacity) - [ ] Verify current deployments: `kubectl get deployments -n vapora` - [ ] Check pod status: `kubectl get pods -n vapora` (all Running) - [ ] Verify recent events: `kubectl get events -n vapora --sort-by='.lastTimestamp' | tail -10` ### Health Baseline - [ ] Record current metrics before deployment - CPU usage per deployment - Memory usage per deployment - Request latency (p50, p95, p99) - Error rate (4xx, 5xx) - Queue depth (if applicable) - [ ] Verify services are responsive: ```bash curl http://localhost:8001/health -H "Authorization: Bearer $TOKEN" curl http://localhost:8001/api/projects ``` - [ ] Check logs for recent errors: ```bash kubectl logs deployment/vapora-backend -n vapora --tail=50 kubectl logs deployment/vapora-agents -n vapora --tail=50 ``` ### Infrastructure Check - [ ] Verify storage is not near capacity: `df -h /var/lib/vapora` - [ ] Check database health: `kubectl exec -n vapora -- surreal info` - [ ] Verify backups are recent (within 24 hours) - [ ] Check SSL certificate expiration: `openssl s_client -connect api.vapora.com:443 -showcerts | grep "Validity"` --- ## 2 Hours Before Deployment ### Artifact Preparation - [ ] Trigger validation in CI/CD pipeline - [ ] Wait for artifact generation to complete - [ ] Download artifacts from pipeline: ```bash # From GitHub Actions or Woodpecker UI # Download: deployment-artifacts.zip ``` - [ ] Verify artifact contents: ```bash unzip deployment-artifacts.zip ls -la # Should contain: # - configmap.yaml # - deployment.yaml # - docker-compose.yml # - vapora-{solo,multiuser,enterprise}.{toml,yaml,json} ``` - [ ] Validate manifest syntax: ```bash yq eval '.' configmap.yaml > /dev/null && echo "✓ ConfigMap valid" yq eval '.' deployment.yaml > /dev/null && echo "✓ Deployment valid" ``` ### Test in Staging - [ ] Perform dry-run deployment to staging cluster: ```bash kubectl apply -f configmap.yaml --dry-run=server -n vapora kubectl apply -f deployment.yaml --dry-run=server -n vapora ``` - [ ] Review dry-run output for any warnings or errors - [ ] If test deployment available, do actual staging deployment and verify: ```bash kubectl get deployments -n vapora kubectl get pods -n vapora kubectl logs deployment/vapora-backend -n vapora --tail=5 ``` - [ ] Test health endpoints on staging - [ ] Run smoke tests against staging (if available) ### Rollback Plan Verification - [ ] Document current deployment revisions: ```bash kubectl rollout history deployment/vapora-backend -n vapora # Record the highest revision number ``` - [ ] Create backup of current ConfigMap: ```bash kubectl get configmap -n vapora vapora-config -o yaml > configmap-backup.yaml ``` - [ ] Test rollback procedure on staging (if safe): ```bash # Record current revision CURRENT_REV=$(kubectl rollout history deployment/vapora-backend -n vapora | tail -1 | awk '{print $1}') # Test undo kubectl rollout undo deployment/vapora-backend -n vapora # Verify rollback kubectl get deployment vapora-backend -n vapora -o yaml | grep image # Restore to current kubectl rollout undo deployment/vapora-backend -n vapora --to-revision=$CURRENT_REV ``` - [ ] Confirm rollback command is documented in ticket/issue --- ## 1 Hour Before Deployment ### Final Checks - [ ] Confirm all prerequisites met: - [ ] Code merged to main - [ ] Artifacts generated and validated - [ ] Staging deployment tested - [ ] Rollback plan documented - [ ] Team notified ### Communication Setup - [ ] Set status page to "Maintenance Mode" (if public) ``` "VAPORA maintenance deployment starting at HH:MM UTC. Expected duration: 10 minutes. Services may be briefly unavailable." ``` - [ ] Join #deployments Slack channel - [ ] Prepare message: "🚀 Deployment starting now. Will update every 2 minutes." - [ ] Have on-call engineer monitoring - [ ] Verify monitoring/alerting dashboards are accessible ### Access Verification - [ ] Verify kubeconfig is valid and up to date: ```bash kubectl cluster-info kubectl get nodes ``` - [ ] Verify kubectl version compatibility: ```bash kubectl version # Should match server version reasonably (within 1 minor version) ``` - [ ] Test write access to cluster: ```bash kubectl auth can-i create deployments --namespace=vapora # Should return "yes" ``` - [ ] Verify docker/docker-compose access (if Docker deployment) - [ ] Verify Slack webhook is working (test send message) --- ## 15 Minutes Before Deployment ### Final Go/No-Go Decision **STOP HERE** and make final decision to proceed or reschedule: **Proceed IF:** - ✅ All checklist items above completed - ✅ No critical issues found during testing - ✅ Staging deployment successful - ✅ Team ready and monitoring - ✅ Rollback plan clear and tested - ✅ Within designated maintenance window **RESCHEDULE IF:** - ❌ Any critical issues discovered - ❌ Staging tests failed - ❌ Team member unavailable - ❌ Production issues detected - ❌ Unexpected changes in code/configs ### Final Notifications If proceeding: - [ ] Post to #deployments: "🚀 Deployment starting in 5 minutes" - [ ] Alert on-call engineer: "Ready to start - confirm you're monitoring" - [ ] Have rollback plan visible and accessible - [ ] Open monitoring dashboard showing current metrics ### Terminal Setup - [ ] Open terminal with kubeconfig configured: ```bash export KUBECONFIG=/path/to/production/kubeconfig kubectl cluster-info # Verify connected to production ``` - [ ] Open second terminal for tailing logs: ```bash kubectl logs -f deployment/vapora-backend -n vapora ``` - [ ] Have rollback commands ready: ```bash # For quick rollback if needed kubectl rollout undo deployment/vapora-backend -n vapora kubectl rollout undo deployment/vapora-agents -n vapora kubectl rollout undo deployment/vapora-llm-router -n vapora ``` - [ ] Prepare metrics check script: ```bash watch kubectl top pods -n vapora watch kubectl get pods -n vapora ``` --- ## Success Criteria Verification Document what "success" looks like for this deployment: - [ ] All three deployments have updated image IDs - [ ] All pods reach "Ready" state within 5 minutes - [ ] No pod restarts: `kubectl get pods -n vapora --watch` (no restarts column increasing) - [ ] No error logs in first 2 minutes - [ ] Health endpoints respond (200 OK) - [ ] API endpoints respond to test requests - [ ] Metrics show normal resource usage - [ ] No alerts triggered - [ ] Support team reports no user impact --- ## Team Roles During Deployment ### Deployment Lead - Executes deployment commands - Monitors progress - Communicates status updates - Decides to proceed/rollback ### On-Call Engineer - Monitors dashboards and alerts - Watches for anomalies - Prepares for rollback if needed - Available for emergency decisions ### Communications Lead (optional) - Updates #deployments channel - Notifies support/product teams - Updates status page if public - Handles external communication ### Backup Person - Monitors for issues - Ready to assist with troubleshooting - Prepares rollback procedures - Escalates if needed --- ## Common Issues to Watch For ⚠️ **Pod CrashLoopBackOff** - Indicates config or image issue - Check pod logs: `kubectl logs ` - Check events: `kubectl describe pod ` - **Action**: Rollback immediately ⚠️ **Pending Pods (not starting)** - Check resource availability: `kubectl describe pod ` - Check node capacity - **Action**: Investigate or rollback if resource exhausted ⚠️ **High Error Rate** - Check application logs - Compare with baseline errors - **Action**: If >10% error increase, rollback ⚠️ **Database Connection Errors** - Check ConfigMap has correct database URL - Verify network connectivity to database - **Action**: Check ConfigMap, fix and reapply if needed ⚠️ **Memory or CPU Spike** - Monitor trends (sudden spike vs gradual) - Check if within expected range for new code - **Action**: Rollback if resource limits exceeded --- ## Post-Deployment Documentation After deployment completes, record: - [ ] Deployment start time (UTC) - [ ] Deployment end time (UTC) - [ ] Total duration - [ ] Any issues encountered and resolution - [ ] Rollback performed (Y/N) - [ ] Metrics before/after (CPU, memory, latency, errors) - [ ] Team members involved - [ ] Blockers or lessons learned --- ## Sign-Off Use this template for deployment issue/ticket: ``` DEPLOYMENT COMPLETED ✓ All checks passed ✓ Deployment successful ✓ All pods running ✓ Health checks passing ✓ No user impact Deployed by: [Name] Start time: [UTC] Duration: [X minutes] Rollback needed: No Metrics: - Latency (p99): [X]ms - Error rate: [X]% - Pod restarts: 0 Next deployment: [Date/Time] ```