# Deployment Runbook Step-by-step procedures for deploying VAPORA to staging and production environments. --- ## Quick Start For experienced operators: ```bash # Validate in CI/CD # Download artifacts # Review dry-run # Apply: kubectl apply -f configmap.yaml deployment.yaml # Monitor: kubectl logs -f deployment/vapora-backend -n vapora # Verify: curl http://localhost:8001/health ``` For complete steps, continue reading. --- ## Before Starting ✅ **Prerequisites Completed**: - [ ] Pre-deployment checklist completed - [ ] Artifacts generated and validated - [ ] Staging deployment verified - [ ] Team ready and monitoring - [ ] Maintenance window announced ✅ **Access Verified**: - [ ] kubectl configured for target cluster - [ ] Can list nodes: `kubectl get nodes` - [ ] Can access namespace: `kubectl get namespace vapora` ❌ **If any prerequisite missing**: Go back to pre-deployment checklist --- ## Phase 1: Pre-Flight (5 minutes) ### 1.1 Verify Current State ```bash # Set context export CLUSTER=production # or staging export NAMESPACE=vapora # Verify cluster access kubectl cluster-info kubectl get nodes # Output should show: # NAME STATUS ROLES AGE # node-1 Ready worker 30d # node-2 Ready worker 25d ``` **What to look for:** - ✓ All nodes in "Ready" state - ✓ No "NotReady" or "Unknown" nodes - If issues: Don't proceed, investigate node health ### 1.2 Check Current Deployments ```bash # Get current deployment status kubectl get deployments -n $NAMESPACE -o wide kubectl get pods -n $NAMESPACE # Output example: # NAME READY UP-TO-DATE AVAILABLE # vapora-backend 3/3 3 3 # vapora-agents 2/2 2 2 # vapora-llm-router 2/2 2 2 ``` **What to look for:** - ✓ All deployments showing correct replica count - ✓ All pods in "Running" state - ❌ If pods in "CrashLoopBackOff" or "Pending": Investigate before proceeding ### 1.3 Record Current Versions ```bash # Get current image versions (baseline for rollback) kubectl get deployments -n $NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].image}{"\n"}{end}' # Expected output: # vapora-backend vapora/backend:v1.2.0 # vapora-agents vapora/agents:v1.2.0 # vapora-llm-router vapora/llm-router:v1.2.0 ``` **Record these for rollback**: Keep this output visible ### 1.4 Get Current Revision Numbers ```bash # For each deployment, get rollout history for deployment in vapora-backend vapora-agents vapora-llm-router; do echo "=== $deployment ===" kubectl rollout history deployment/$deployment -n $NAMESPACE | tail -5 done # Output example: # REVISION CHANGE-CAUSE # 42 Deployment rolled out # 43 Deployment rolled out # 44 (current) ``` **Record the highest revision number for each** - this is your rollback reference ### 1.5 Check Cluster Resources ```bash # Verify cluster has capacity for new deployment kubectl top nodes kubectl describe nodes | grep -A 5 "Allocated resources" # Example - check memory/CPU availability # Requested: 8200m (41%) # Limits: 16400m (82%) ``` **What to look for:** - ✓ Less than 80% resource utilization - ❌ If above 85%: Insufficient capacity, don't proceed --- ## Phase 2: Configuration Deployment (3 minutes) ### 2.1 Apply ConfigMap The ConfigMap contains all application configuration. ```bash # First: Dry-run to verify no syntax errors kubectl apply -f configmap.yaml --dry-run=server -n $NAMESPACE # Should output: # configmap/vapora-config configured (server dry run) # Check for any warnings or errors in output # If errors, stop and fix the YAML before proceeding ``` **Troubleshooting**: - "error validating": YAML syntax error - fix and retry - "field is immutable": Can't change certain ConfigMap fields - delete and recreate - "resourceQuotaExceeded": Namespace quota exceeded - contact cluster admin ### 2.2 Apply ConfigMap for Real ```bash # Apply the actual ConfigMap kubectl apply -f configmap.yaml -n $NAMESPACE # Output: # configmap/vapora-config configured # Verify it was applied kubectl get configmap -n $NAMESPACE vapora-config -o yaml | head -20 # Check for your new values in the output ``` **Verify ConfigMap is correct**: ```bash # Extract specific values to verify kubectl get configmap vapora-config -n $NAMESPACE -o jsonpath='{.data.vapora\.toml}' | grep "database_url" | head -1 # Should show the correct database URL ``` ### 2.3 Annotate ConfigMap Record when this config was deployed for audit trail: ```bash kubectl annotate configmap vapora-config \ -n $NAMESPACE \ deployment.timestamp="$(date -u +'%Y-%m-%dT%H:%M:%SZ')" \ deployment.commit="$(git rev-parse HEAD | cut -c1-8)" \ deployment.branch="$(git rev-parse --abbrev-ref HEAD)" \ --overwrite # Verify annotation was added kubectl get configmap vapora-config -n $NAMESPACE -o yaml | grep "deployment\." ``` --- ## Phase 3: Deployment Update (5 minutes) ### 3.1 Dry-Run Deployment Always dry-run first to catch issues: ```bash # Run deployment dry-run kubectl apply -f deployment.yaml --dry-run=server -n $NAMESPACE # Output should show what will be updated: # deployment.apps/vapora-backend configured (server dry run) # deployment.apps/vapora-agents configured (server dry run) # deployment.apps/vapora-llm-router configured (server dry run) ``` **Check for warnings**: - "imagePullBackOff": Docker image doesn't exist - "insufficient quota": Resource limits exceeded - "nodeAffinity": Pod can't be placed on any node ### 3.2 Apply Deployments ```bash # Apply the actual deployments kubectl apply -f deployment.yaml -n $NAMESPACE # Output: # deployment.apps/vapora-backend configured # deployment.apps/vapora-agents configured # deployment.apps/vapora-llm-router configured ``` **Verify deployments updated**: ```bash # Check that new rollout was initiated kubectl get deployments -n $NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.observedGeneration}{"\n"}{end}' # Compare with recorded versions - should be incremented ``` ### 3.3 Monitor Rollout Progress Watch the deployment rollout status: ```bash # For each deployment, monitor the rollout for deployment in vapora-backend vapora-agents vapora-llm-router; do echo "Waiting for $deployment..." kubectl rollout status deployment/$deployment \ -n $NAMESPACE \ --timeout=5m echo "$deployment ready" done ``` **What to look for** (per pod update): ``` Waiting for rollout to finish: 2 of 3 updated replicas are available... Waiting for rollout to finish: 2 of 3 updated replicas are available... Waiting for rollout to finish: 3 of 3 updated replicas are available... deployment "vapora-backend" successfully rolled out ``` **Expected time: 2-3 minutes per deployment** ### 3.4 Watch Pod Updates (in separate terminal) While rollout completes, monitor pods: ```bash # Watch pods being updated in real-time kubectl get pods -n $NAMESPACE -w # Output shows updates like: # NAME READY STATUS # vapora-backend-abc123-def45 1/1 Running # vapora-backend-xyz789-old-pod 1/1 Running ← old pod still running # vapora-backend-abc123-new-pod 0/1 Pending ← new pod starting # vapora-backend-abc123-new-pod 0/1 ContainerCreating # vapora-backend-abc123-new-pod 1/1 Running ← new pod ready # vapora-backend-xyz789-old-pod 1/1 Terminating ← old pod being removed ``` **What to look for:** - ✓ New pods starting (Pending → ContainerCreating → Running) - ✓ Each new pod reaches Running state - ✓ Old pods gradually terminating - ❌ Pod stuck in "CrashLoopBackOff": Stop, check logs, might need rollback --- ## Phase 4: Verification (5 minutes) ### 4.1 Verify All Pods Running ```bash # Check all pods are ready kubectl get pods -n $NAMESPACE # Expected output: # NAME READY STATUS # vapora-backend--1 1/1 Running # vapora-backend--2 1/1 Running # vapora-backend--3 1/1 Running # vapora-agents--1 1/1 Running # vapora-agents--2 1/1 Running # vapora-llm-router--1 1/1 Running # vapora-llm-router--2 1/1 Running ``` **Verification**: ```bash # All pods should show READY=1/1 # All pods should show STATUS=Running # No pods should be in Pending, CrashLoopBackOff, or Error state # Quick check: READY=$(kubectl get pods -n $NAMESPACE -o jsonpath='{range .items[*]}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}' | grep -c "True") TOTAL=$(kubectl get pods -n $NAMESPACE --no-headers | wc -l) echo "Ready pods: $READY / $TOTAL" # Should show: Ready pods: 7 / 7 (or your expected pod count) ``` ### 4.2 Check Pod Logs for Errors ```bash # Check logs from the last minute for errors for pod in $(kubectl get pods -n $NAMESPACE -o name); do echo "=== $pod ===" kubectl logs $pod -n $NAMESPACE --since=1m 2>&1 | grep -i "error\|exception\|fatal" | head -3 done # If errors found: # 1. Note which pods have errors # 2. Get full log: kubectl logs -n $NAMESPACE # 3. Decide: can proceed or need to rollback ``` ### 4.3 Verify Service Endpoints ```bash # Check services are exposing pods correctly kubectl get endpoints -n $NAMESPACE # Expected output: # NAME ENDPOINTS # vapora-backend 10.1.2.3:8001,10.1.2.4:8001,10.1.2.5:8001 # vapora-agents 10.1.2.6:8002,10.1.2.7:8002 # vapora-llm-router 10.1.2.8:8003,10.1.2.9:8003 ``` **Verification**: - ✓ Each service has multiple endpoints (not empty) - ✓ Endpoints match running pods - ❌ If empty endpoints: Service can't route traffic ### 4.4 Health Check Endpoints ```bash # Port-forward to access services locally kubectl port-forward -n $NAMESPACE svc/vapora-backend 8001:8001 & # Wait a moment for port-forward to establish sleep 2 # Check backend health curl -v http://localhost:8001/health # Expected response: # HTTP/1.1 200 OK # {...healthy response...} # Check other endpoints curl http://localhost:8001/api/projects -H "Authorization: Bearer test-token" ``` **Expected responses**: - `/health`: 200 OK with health data - `/api/projects`: 200 OK with projects list - `/metrics`: 200 OK with Prometheus metrics **If connection refused**: ```bash # Check if port-forward working ps aux | grep "port-forward" # Restart port-forward pkill -f "port-forward svc/vapora-backend" kubectl port-forward -n $NAMESPACE svc/vapora-backend 8001:8001 & ``` ### 4.5 Check Metrics ```bash # Monitor resource usage of deployed pods kubectl top pods -n $NAMESPACE # Expected output: # NAME CPU(cores) MEMORY(Mi) # vapora-backend-abc123 250m 512Mi # vapora-backend-def456 280m 498Mi # vapora-agents-ghi789 300m 256Mi ``` **Verification**: - ✓ CPU usage within expected range (typically 100-500m per pod) - ✓ Memory usage within expected range (typically 200-512Mi) - ❌ If any pod at 100% CPU/Memory: Performance issue, monitor closely --- ## Phase 5: Validation (3 minutes) ### 5.1 Run Smoke Tests (if available) ```bash # If your project has smoke tests: kubectl exec -it deployment/vapora-backend -n $NAMESPACE -- \ sh -c "curl http://localhost:8001/health && echo 'Health check passed'" # Or run from your local machine: ./scripts/smoke-tests.sh --endpoint http://localhost:8001 ``` ### 5.2 Check for Errors in Logs ```bash # Look at logs from all pods since deployment started for deployment in vapora-backend vapora-agents vapora-llm-router; do echo "=== Checking $deployment ===" kubectl logs deployment/$deployment -n $NAMESPACE --since=5m 2>&1 | \ grep -i "error\|exception\|failed" | wc -l done # If any errors found: # 1. Get detailed logs # 2. Determine if critical or expected errors # 3. Decide to proceed or rollback ``` ### 5.3 Compare Against Baseline Metrics Compare current metrics with pre-deployment baseline: ```bash # Current metrics echo "=== Current ===" kubectl top nodes kubectl top pods -n $NAMESPACE | head -5 # Compare with recorded baseline # If similar: ✓ Good # If significantly higher: ⚠️ Watch for issues # If error rates high: ❌ Consider rollback ``` ### 5.4 Check for Recent Events/Warnings ```bash # Look for any cluster events in the last 5 minutes kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -20 # Watch for: # - Warning: FailedScheduling (pod won't fit) # - Warning: PullImageError (image doesn't exist) # - Warning: ImagePullBackOff (can't download image) # - Error: ExceededQuota (resource limits) ``` --- ## Phase 6: Communication (1 minute) ### 6.1 Post Deployment Complete ``` Post message to #deployments: 🚀 DEPLOYMENT COMPLETE Deployment: VAPORA Core Services Mode: Enterprise Duration: 8 minutes Status: ✅ Successful Deployed: - vapora-backend (v1.2.1) - vapora-agents (v1.2.1) - vapora-llm-router (v1.2.1) Verification: ✓ All pods running ✓ Health checks passing ✓ No error logs ✓ Metrics normal Next steps: - Monitor #alerts for any issues - Check dashboards every 5 minutes for 30 min - Review logs if any issues detected Questions? @on-call-engineer ``` ### 6.2 Update Status Page ``` If using public status page: UPDATE: Maintenance Complete VAPORA services have been successfully updated and are now operating normally. All systems monitoring nominal. ``` ### 6.3 Notify Stakeholders - [ ] Send message to support team: "Deployment complete, all systems normal" - [ ] Post in #product: "Backend updated to v1.2.1, new features available" - [ ] Update ticket/issue with deployment completion time and status --- ## Phase 7: Post-Deployment Monitoring (Ongoing) ### 7.1 First 5 Minutes: Watch Closely ```bash # Keep watching for any issues watch kubectl get pods -n $NAMESPACE watch kubectl top pods -n $NAMESPACE watch kubectl logs -f deployment/vapora-backend -n $NAMESPACE ``` **Watch for:** - Pod restarts (RESTARTS counter increasing) - Increased error logs - Resource usage spikes - Service unreachability ### 7.2 First 30 Minutes: Monitor Dashboard Keep dashboard visible showing: - Pod health status - CPU/Memory usage per pod - Request latency (if available) - Error rate - Recent logs **Alert triggers for immediate action:** - Any pod restarting repeatedly - Error rate above 5% - Latency above 2x normal - Pod stuck in Pending state ### 7.3 First 2 Hours: Regular Checks ```bash # Every 10 minutes: 1. kubectl get pods -n $NAMESPACE 2. kubectl top pods -n $NAMESPACE 3. Check error logs: grep -i error from recent logs 4. Check alerts dashboard ``` **If issues detected**, proceed to Incident Response Runbook ### 7.4 After 2 Hours: Normal Monitoring Return to standard monitoring procedures. Deployment complete. --- ## If Issues Detected: Quick Rollback If problems occur at any point: ```bash # IMMEDIATE: Rollback (1 minute) for deployment in vapora-backend vapora-agents vapora-llm-router; do kubectl rollout undo deployment/$deployment -n $NAMESPACE & done wait # Verify rollback completing: kubectl rollout status deployment/vapora-backend -n $NAMESPACE --timeout=5m # Confirm services recovering: curl http://localhost:8001/health # Post to #deployments: # 🔙 ROLLBACK EXECUTED # Issue detected, services rolled back to previous version # All pods should be recovering now ``` See [Rollback Runbook](./rollback-runbook.md) for detailed procedures. --- ## Common Issues & Solutions ### Issue: Pod stuck in ImagePullBackOff **Cause**: Docker image doesn't exist or can't be downloaded **Solution**: ```bash # Check pod events kubectl describe pod -n $NAMESPACE # Check image registry access kubectl get secret -n $NAMESPACE # Either: 1. Verify image name is correct in deployment.yaml 2. Push missing image to registry 3. Rollback deployment ``` ### Issue: Pod stuck in CrashLoopBackOff **Cause**: Application crashing on startup **Solution**: ```bash # Get pod logs kubectl logs -n $NAMESPACE --previous # Fix typically requires config change: 1. Fix ConfigMap issue 2. Re-apply ConfigMap: kubectl apply -f configmap.yaml 3. Trigger pod restart: kubectl rollout restart deployment/ # Or rollback if unclear ``` ### Issue: Pod in Pending state **Cause**: Node doesn't have capacity or resources **Solution**: ```bash # Describe pod to see why kubectl describe pod -n $NAMESPACE # Check for "Insufficient cpu", "Insufficient memory" kubectl top nodes # Either: 1. Scale down other workloads 2. Increase node count 3. Reduce resource requirements in deployment.yaml and redeploy ``` ### Issue: Service endpoints empty **Cause**: Pods not passing health checks **Solution**: ```bash # Check pod logs for errors kubectl logs -n $NAMESPACE # Check pod readiness probe failures kubectl describe pod -n $NAMESPACE | grep -A 5 "Readiness" # Fix configuration or rollback ``` --- ## Completion Checklist - [ ] All pods running and ready - [ ] Health endpoints responding - [ ] No error logs - [ ] Metrics normal - [ ] Deployment communication posted - [ ] Status page updated - [ ] Stakeholders notified - [ ] Monitoring enabled for next 2 hours - [ ] Ticket/issue updated with completion details --- ## Next Steps - Continue monitoring per [Monitoring Runbook](./monitoring-runbook.md) - If issues arise, follow [Incident Response Runbook](./incident-response-runbook.md) - Document lessons learned - Update runbooks if procedures need improvement