# Rollback Runbook Procedures for safely rolling back VAPORA deployments when issues are detected. --- ## When to Rollback Immediately trigger rollback if any of these occur within 5 minutes of deployment: ❌ **Critical Issues** (rollback within 1 minute): - Pod in `CrashLoopBackOff` (repeatedly restarting) - All pods unable to start - Service completely unreachable (0 endpoints) - Database connection completely broken - All requests returning 5xx errors - Service consuming all available memory/CPU ⚠️ **Serious Issues** (rollback within 5 minutes): - High error rate (>10% 5xx errors) - Significant performance degradation (2x+ latency) - Deployment not completing (stuck pods) - Unexpected dependency failures - Data corruption or loss ✓ **Monitor & Investigate** (don't rollback immediately): - Single pod failing (might be node issue) - Transient network errors - Gradual performance increase (might be load) - Expected warnings in logs --- ## Kubernetes Rollback (Automatic) ### Step 1: Assess Situation (30 seconds) ```bash # Set up environment export NAMESPACE=vapora export CLUSTER=production # or staging # Verify you're on correct cluster kubectl cluster-info | grep server # STOP if you're on wrong cluster! # Correct cluster should be production URL ``` ### Step 2: Check Current Status ```bash # See what's happening right now kubectl get deployments -n $NAMESPACE kubectl get pods -n $NAMESPACE # Output should show the broken state that triggered rollback ``` **Critical check:** ```bash # How many pods are actually running? RUNNING=$(kubectl get pods -n $NAMESPACE --field-selector=status.phase=Running --no-headers | wc -l) TOTAL=$(kubectl get pods -n $NAMESPACE --no-headers | wc -l) echo "Pods running: $RUNNING / $TOTAL" # If 0/X: Critical, rollback immediately # If X/X: Investigate before rollback (might not need to) ``` ### Step 3: Identify Which Deployment Failed ```bash # Check which deployment has issues for deployment in vapora-backend vapora-agents vapora-llm-router; do echo "=== $deployment ===" kubectl get deployment $deployment -n $NAMESPACE -o wide kubectl get pods -n $NAMESPACE -l app=$deployment done # Example: backend has ReplicaSet mismatch # DESIRED CURRENT UPDATED AVAILABLE # 3 3 3 0 ← Problem: no pods available ``` **Decide**: Rollback all or specific deployment? - If all services down: Rollback all - If only backend issues: Rollback backend only ### Step 4: Get Rollout History ```bash # Show deployment revisions to see what to rollback to for deployment in vapora-backend vapora-agents vapora-llm-router; do echo "=== $deployment ===" kubectl rollout history deployment/$deployment -n $NAMESPACE | tail -5 done # Output: # REVISION CHANGE-CAUSE # 42 Deployment rolled out # 43 Deployment rolled out # 44 (current - the one with issues) ``` **Key**: Revision numbers increase with each deployment ### Step 5: Execute Rollback ```bash # Option A: Rollback all three services echo "🔙 Rolling back all services..." for deployment in vapora-backend vapora-agents vapora-llm-router; do echo "Rolling back $deployment..." kubectl rollout undo deployment/$deployment -n $NAMESPACE echo "✓ $deployment undo initiated" done # Wait for all rollbacks echo "⏳ Waiting for rollback to complete..." for deployment in vapora-backend vapora-agents vapora-llm-router; do kubectl rollout status deployment/$deployment -n $NAMESPACE --timeout=5m done echo "✓ All services rolled back" ``` **Option B: Rollback specific deployment** ```bash # If only backend has issues kubectl rollout undo deployment/vapora-backend -n $NAMESPACE # Monitor rollback kubectl rollout status deployment/vapora-backend -n $NAMESPACE --timeout=5m ``` **Option C: Rollback to specific revision** ```bash # If you need to skip the immediate previous version # Find the working revision number from history TARGET_REVISION=42 # Example for deployment in vapora-backend vapora-agents vapora-llm-router; do echo "Rolling back $deployment to revision $TARGET_REVISION..." kubectl rollout undo deployment/$deployment -n $NAMESPACE \ --to-revision=$TARGET_REVISION done # Verify rollback kubectl rollout status deployment/vapora-backend -n $NAMESPACE --timeout=5m ``` ### Step 6: Monitor Rollback Progress In a **separate terminal**, watch the rollback happening: ```bash # Watch pods being recreated with old version kubectl get pods -n $NAMESPACE -w # Output shows: # vapora-backend-abc123-newhash 1/1 Terminating ← old pods being removed # vapora-backend-def456-oldhash 0/1 Pending ← previous pods restarting # vapora-backend-def456-oldhash 1/1 Running ← previous pods ready ``` **Expected timeline:** - 0-30 seconds: Old pods terminating, new pods starting - 30-90 seconds: New pods starting up (ContainerCreating) - 90-180 seconds: New pods reaching Running state ### Step 7: Verify Rollback Complete ```bash # After rollout status shows "successfully rolled out" # Verify all pods are running kubectl get pods -n $NAMESPACE # All should show: # STATUS: Running # READY: 1/1 # Verify service endpoints exist kubectl get endpoints -n $NAMESPACE # All services should have endpoints like: # NAME ENDPOINTS # vapora-backend 10.x.x.x:8001,10.x.x.x:8001,10.x.x.x:8001 ``` ### Step 8: Health Check ```bash # Port-forward to test services kubectl port-forward -n $NAMESPACE svc/vapora-backend 8001:8001 & sleep 2 # Test health endpoint curl -v http://localhost:8001/health # Expected: HTTP 200 OK with health data ``` **If health check fails:** ```bash # Check pod logs for errors kubectl logs deployment/vapora-backend -n $NAMESPACE --tail=50 # See what's wrong, might need further investigation # Possibly need to rollback to earlier version ``` ### Step 9: Check Logs for Success ```bash # Verify no errors in the first 2 minutes of rolled-back logs kubectl logs deployment/vapora-backend -n $NAMESPACE --since=2m | \ grep -i "error\|exception\|failed" | head -10 # Should return no (or very few) errors ``` ### Step 10: Verify Version Reverted ```bash # Confirm we're back to previous version kubectl get deployments -n $NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].image}{"\n"}{end}' # Output should show previous image versions: # vapora-backend vapora/backend:v1.2.0 (not v1.2.1) # vapora-agents vapora/agents:v1.2.0 # vapora-llm-router vapora/llm-router:v1.2.0 ``` --- ## Docker Rollback (Manual) For Docker Compose deployments (not Kubernetes): ### Step 1: Assess Current State ```bash # Check running containers docker compose ps # Check logs for errors docker compose logs --tail=50 backend ``` ### Step 2: Stop Services ```bash # Stop all services gracefully docker compose down # Verify stopped docker ps | grep vapora # Should return nothing # Wait a moment for graceful shutdown sleep 5 ``` ### Step 3: Restore Previous Configuration ```bash # Option A: Git history cd deploy/docker git log docker-compose.yml | head -5 git checkout HEAD~1 docker-compose.yml # Option B: Backup file cp docker-compose.yml docker-compose.yml.broken cp docker-compose.yml.backup docker-compose.yml # Option C: Manual # Edit docker-compose.yml to use previous image versions # Example: change backend service image from v1.2.1 to v1.2.0 ``` ### Step 4: Restart Services ```bash # Start services with previous configuration docker compose up -d # Wait for startup sleep 5 # Verify services running docker compose ps # Should show all services with status "Up" ``` ### Step 5: Verify Health ```bash # Check container logs docker compose logs backend | tail -20 # Test health endpoint curl -v http://localhost:8001/health # Expected: HTTP 200 OK ``` ### Step 6: Check Services ```bash # Verify all services responding docker compose exec backend curl http://localhost:8001/health docker compose exec frontend curl http://localhost:3000 --head # All should return successful responses ``` --- ## Post-Rollback Procedures ### Immediate (Within 5 minutes) ```bash # 1. Verify all services healthy ✓ All pods running ✓ Health endpoints responding ✓ No error logs ✓ Service endpoints populated # 2. Communicate to team ``` ### Communication ``` Post to #deployments: 🔙 ROLLBACK EXECUTED Issue detected in deployment v1.2.1 All services rolled back to v1.2.0 Status: ✅ Services recovering - All pods: Running - Health checks: Passing - Endpoints: Responding Timeline: - Issue detected: HH:MM UTC - Rollback initiated: HH:MM UTC - Services recovered: HH:MM UTC (5 minutes) Next: - Investigate root cause - Fix issue - Prepare corrected deployment Questions? @on-call-engineer ``` ### Investigation & Root Cause ```bash # While services are recovered, investigate what went wrong # 1. Save logs from failed deployment kubectl logs deployment/vapora-backend -n $NAMESPACE \ --timestamps=true \ > failed-deployment-backend.log # 2. Save pod events kubectl describe pod $(kubectl get pods -n $NAMESPACE \ -l app=vapora-backend --sort-by=.metadata.creationTimestamp \ | tail -1 | awk '{print $1}') \ -n $NAMESPACE > failed-pod-events.log # 3. Archive ConfigMap from failed deployment (if changed) kubectl get configmap -n $NAMESPACE vapora-config -o yaml > configmap-failed.yaml # 4. Compare with previous good state diff configmap-previous.yaml configmap-failed.yaml # 5. Check what changed in code git diff HEAD~1 HEAD provisioning/ ``` ### Decision: What Went Wrong Common issues and investigation paths: | Issue | Investigation | Action | |-------|---|---| | **Config syntax error** | Check ConfigMap YAML | Fix YAML, test locally with yq | | **Missing environment variable** | Check pod logs for "not found" | Update ConfigMap with value | | **Database connection** | Check database connectivity | Verify DB URL in ConfigMap | | **Resource exhaustion** | Check kubectl top, pod events | Increase resources or reduce replicas | | **Image missing** | Check ImagePullBackOff event | Verify image pushed to registry | | **Permission issue** | Check RBAC, logs for "forbidden" | Update service account permissions | ### Post-Rollback Review Schedule within 24 hours: ``` DEPLOYMENT POST-MORTEM Deployment: v1.2.1 Outcome: ❌ Rolled back Timeline: - Deployed: 2026-01-12 14:00 UTC - Issue detected: 14:05 UTC - Rollback completed: 14:10 UTC - Impact duration: 5 minutes Root Cause: [describe what went wrong] Why not caught before: - [ ] Testing incomplete - [ ] Config not validated - [ ] Monitoring missed issue - [ ] Other: [describe] Prevention for next time: 1. [action item] 2. [action item] 3. [action item] Owner: [person responsible for follow-up] Deadline: [date] ``` --- ## Rollback Emergency Procedures ### If Services Still Down After Rollback ```bash # Services not recovering - emergency procedures # 1. Check if rollback actually happened kubectl get deployments -n $NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].image}{"\n"}{end}' # If image is still new version: # - Rollback might have failed # - Try manual version specification # 2. Force rollback to specific revision kubectl rollout undo deployment/vapora-backend -n $NAMESPACE --to-revision=41 # 3. If still failing, delete and recreate pods kubectl delete pods -n $NAMESPACE -l app=vapora-backend # Pods will restart via deployment # 4. Last resort: Scale down and up kubectl scale deployment/vapora-backend --replicas=0 -n $NAMESPACE sleep 10 kubectl scale deployment/vapora-backend --replicas=3 -n $NAMESPACE # 5. Monitor restart kubectl get pods -n $NAMESPACE -w ``` ### If Database Corrupted ```bash # Only do this if you have recent backups # 1. Identify corruption kubectl logs deployment/vapora-backend -n $NAMESPACE | grep -i "corruption\|data" # 2. Restore from backup (requires DBA support) # Contact database team # 3. Verify data integrity # Run validation queries/commands # 4. Notify stakeholders immediately ``` ### If All Else Fails ```bash # Complete infrastructure recovery # 1. Escalate to Infrastructure team # 2. Activate Disaster Recovery procedures # 3. Failover to backup environment if available # 4. Engage senior engineers for investigation ``` --- ## Prevention & Lessons Learned After every rollback: 1. **Root Cause Analysis** - What actually went wrong? - Why wasn't it caught before deployment? - What can prevent this in the future? 2. **Testing Improvements** - Add test case for failure scenario - Update pre-deployment checklist - Improve staging validation 3. **Monitoring Improvements** - Add alert for this failure mode - Improve alerting sensitivity - Document expected vs abnormal logs 4. **Documentation** - Update runbooks with new learnings - Document this specific failure scenario - Share with team --- ## Rollback Checklist ``` ☐ Confirmed critical issue requiring rollback ☐ Verified correct cluster and namespace ☐ Checked rollout history ☐ Executed rollback command (all services or specific) ☐ Monitored rollback progress (5-10 min wait) ☐ Verified all pods running ☐ Verified health endpoints responding ☐ Confirmed version reverted ☐ Posted communication to #deployments ☐ Notified on-call engineer: "rollback complete" ☐ Scheduled root cause analysis ☐ Saved logs for investigation ☐ Started post-mortem process ``` --- ## Reference: Quick Rollback Commands For experienced operators: ```bash # One-liner: Rollback all services export NS=vapora; for d in vapora-backend vapora-agents vapora-llm-router; do kubectl rollout undo deployment/$d -n $NS & done; wait # Quick verification kubectl get pods -n $NS && kubectl get endpoints -n $NS # Health check kubectl port-forward -n $NS svc/vapora-backend 8001:8001 & sleep 2 && curl http://localhost:8001/health ```