14 KiB
Rollback Runbook
Procedures for safely rolling back VAPORA deployments when issues are detected.
When to Rollback
Immediately trigger rollback if any of these occur within 5 minutes of deployment:
❌ Critical Issues (rollback within 1 minute):
- Pod in
CrashLoopBackOff(repeatedly restarting) - All pods unable to start
- Service completely unreachable (0 endpoints)
- Database connection completely broken
- All requests returning 5xx errors
- Service consuming all available memory/CPU
⚠️ Serious Issues (rollback within 5 minutes):
- High error rate (>10% 5xx errors)
- Significant performance degradation (2x+ latency)
- Deployment not completing (stuck pods)
- Unexpected dependency failures
- Data corruption or loss
✓ Monitor & Investigate (don't rollback immediately):
- Single pod failing (might be node issue)
- Transient network errors
- Gradual performance increase (might be load)
- Expected warnings in logs
Kubernetes Rollback (Automatic)
Step 1: Assess Situation (30 seconds)
# Set up environment
export NAMESPACE=vapora
export CLUSTER=production # or staging
# Verify you're on correct cluster
kubectl cluster-info | grep server
# STOP if you're on wrong cluster!
# Correct cluster should be production URL
Step 2: Check Current Status
# See what's happening right now
kubectl get deployments -n $NAMESPACE
kubectl get pods -n $NAMESPACE
# Output should show the broken state that triggered rollback
Critical check:
# How many pods are actually running?
RUNNING=$(kubectl get pods -n $NAMESPACE --field-selector=status.phase=Running --no-headers | wc -l)
TOTAL=$(kubectl get pods -n $NAMESPACE --no-headers | wc -l)
echo "Pods running: $RUNNING / $TOTAL"
# If 0/X: Critical, rollback immediately
# If X/X: Investigate before rollback (might not need to)
Step 3: Identify Which Deployment Failed
# Check which deployment has issues
for deployment in vapora-backend vapora-agents vapora-llm-router; do
echo "=== $deployment ==="
kubectl get deployment $deployment -n $NAMESPACE -o wide
kubectl get pods -n $NAMESPACE -l app=$deployment
done
# Example: backend has ReplicaSet mismatch
# DESIRED CURRENT UPDATED AVAILABLE
# 3 3 3 0 ← Problem: no pods available
Decide: Rollback all or specific deployment?
- If all services down: Rollback all
- If only backend issues: Rollback backend only
Step 4: Get Rollout History
# Show deployment revisions to see what to rollback to
for deployment in vapora-backend vapora-agents vapora-llm-router; do
echo "=== $deployment ==="
kubectl rollout history deployment/$deployment -n $NAMESPACE | tail -5
done
# Output:
# REVISION CHANGE-CAUSE
# 42 Deployment rolled out
# 43 Deployment rolled out
# 44 (current - the one with issues)
Key: Revision numbers increase with each deployment
Step 5: Execute Rollback
# Option A: Rollback all three services
echo "🔙 Rolling back all services..."
for deployment in vapora-backend vapora-agents vapora-llm-router; do
echo "Rolling back $deployment..."
kubectl rollout undo deployment/$deployment -n $NAMESPACE
echo "✓ $deployment undo initiated"
done
# Wait for all rollbacks
echo "⏳ Waiting for rollback to complete..."
for deployment in vapora-backend vapora-agents vapora-llm-router; do
kubectl rollout status deployment/$deployment -n $NAMESPACE --timeout=5m
done
echo "✓ All services rolled back"
Option B: Rollback specific deployment
# If only backend has issues
kubectl rollout undo deployment/vapora-backend -n $NAMESPACE
# Monitor rollback
kubectl rollout status deployment/vapora-backend -n $NAMESPACE --timeout=5m
Option C: Rollback to specific revision
# If you need to skip the immediate previous version
# Find the working revision number from history
TARGET_REVISION=42 # Example
for deployment in vapora-backend vapora-agents vapora-llm-router; do
echo "Rolling back $deployment to revision $TARGET_REVISION..."
kubectl rollout undo deployment/$deployment -n $NAMESPACE \
--to-revision=$TARGET_REVISION
done
# Verify rollback
kubectl rollout status deployment/vapora-backend -n $NAMESPACE --timeout=5m
Step 6: Monitor Rollback Progress
In a separate terminal, watch the rollback happening:
# Watch pods being recreated with old version
kubectl get pods -n $NAMESPACE -w
# Output shows:
# vapora-backend-abc123-newhash 1/1 Terminating ← old pods being removed
# vapora-backend-def456-oldhash 0/1 Pending ← previous pods restarting
# vapora-backend-def456-oldhash 1/1 Running ← previous pods ready
Expected timeline:
- 0-30 seconds: Old pods terminating, new pods starting
- 30-90 seconds: New pods starting up (ContainerCreating)
- 90-180 seconds: New pods reaching Running state
Step 7: Verify Rollback Complete
# After rollout status shows "successfully rolled out"
# Verify all pods are running
kubectl get pods -n $NAMESPACE
# All should show:
# STATUS: Running
# READY: 1/1
# Verify service endpoints exist
kubectl get endpoints -n $NAMESPACE
# All services should have endpoints like:
# NAME ENDPOINTS
# vapora-backend 10.x.x.x:8001,10.x.x.x:8001,10.x.x.x:8001
Step 8: Health Check
# Port-forward to test services
kubectl port-forward -n $NAMESPACE svc/vapora-backend 8001:8001 &
sleep 2
# Test health endpoint
curl -v http://localhost:8001/health
# Expected: HTTP 200 OK with health data
If health check fails:
# Check pod logs for errors
kubectl logs deployment/vapora-backend -n $NAMESPACE --tail=50
# See what's wrong, might need further investigation
# Possibly need to rollback to earlier version
Step 9: Check Logs for Success
# Verify no errors in the first 2 minutes of rolled-back logs
kubectl logs deployment/vapora-backend -n $NAMESPACE --since=2m | \
grep -i "error\|exception\|failed" | head -10
# Should return no (or very few) errors
Step 10: Verify Version Reverted
# Confirm we're back to previous version
kubectl get deployments -n $NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].image}{"\n"}{end}'
# Output should show previous image versions:
# vapora-backend vapora/backend:v1.2.0 (not v1.2.1)
# vapora-agents vapora/agents:v1.2.0
# vapora-llm-router vapora/llm-router:v1.2.0
Docker Rollback (Manual)
For Docker Compose deployments (not Kubernetes):
Step 1: Assess Current State
# Check running containers
docker compose ps
# Check logs for errors
docker compose logs --tail=50 backend
Step 2: Stop Services
# Stop all services gracefully
docker compose down
# Verify stopped
docker ps | grep vapora
# Should return nothing
# Wait a moment for graceful shutdown
sleep 5
Step 3: Restore Previous Configuration
# Option A: Git history
cd deploy/docker
git log docker-compose.yml | head -5
git checkout HEAD~1 docker-compose.yml
# Option B: Backup file
cp docker-compose.yml docker-compose.yml.broken
cp docker-compose.yml.backup docker-compose.yml
# Option C: Manual
# Edit docker-compose.yml to use previous image versions
# Example: change backend service image from v1.2.1 to v1.2.0
Step 4: Restart Services
# Start services with previous configuration
docker compose up -d
# Wait for startup
sleep 5
# Verify services running
docker compose ps
# Should show all services with status "Up"
Step 5: Verify Health
# Check container logs
docker compose logs backend | tail -20
# Test health endpoint
curl -v http://localhost:8001/health
# Expected: HTTP 200 OK
Step 6: Check Services
# Verify all services responding
docker compose exec backend curl http://localhost:8001/health
docker compose exec frontend curl http://localhost:3000 --head
# All should return successful responses
Post-Rollback Procedures
Immediate (Within 5 minutes)
# 1. Verify all services healthy
✓ All pods running
✓ Health endpoints responding
✓ No error logs
✓ Service endpoints populated
# 2. Communicate to team
Communication
Post to #deployments:
🔙 ROLLBACK EXECUTED
Issue detected in deployment v1.2.1
All services rolled back to v1.2.0
Status: ✅ Services recovering
- All pods: Running
- Health checks: Passing
- Endpoints: Responding
Timeline:
- Issue detected: HH:MM UTC
- Rollback initiated: HH:MM UTC
- Services recovered: HH:MM UTC (5 minutes)
Next:
- Investigate root cause
- Fix issue
- Prepare corrected deployment
Questions? @on-call-engineer
Investigation & Root Cause
# While services are recovered, investigate what went wrong
# 1. Save logs from failed deployment
kubectl logs deployment/vapora-backend -n $NAMESPACE \
--timestamps=true \
> failed-deployment-backend.log
# 2. Save pod events
kubectl describe pod $(kubectl get pods -n $NAMESPACE \
-l app=vapora-backend --sort-by=.metadata.creationTimestamp \
| tail -1 | awk '{print $1}') \
-n $NAMESPACE > failed-pod-events.log
# 3. Archive ConfigMap from failed deployment (if changed)
kubectl get configmap -n $NAMESPACE vapora-config -o yaml > configmap-failed.yaml
# 4. Compare with previous good state
diff configmap-previous.yaml configmap-failed.yaml
# 5. Check what changed in code
git diff HEAD~1 HEAD provisioning/
Decision: What Went Wrong
Common issues and investigation paths:
| Issue | Investigation | Action |
|---|---|---|
| Config syntax error | Check ConfigMap YAML | Fix YAML, test locally with yq |
| Missing environment variable | Check pod logs for "not found" | Update ConfigMap with value |
| Database connection | Check database connectivity | Verify DB URL in ConfigMap |
| Resource exhaustion | Check kubectl top, pod events | Increase resources or reduce replicas |
| Image missing | Check ImagePullBackOff event | Verify image pushed to registry |
| Permission issue | Check RBAC, logs for "forbidden" | Update service account permissions |
Post-Rollback Review
Schedule within 24 hours:
DEPLOYMENT POST-MORTEM
Deployment: v1.2.1
Outcome: ❌ Rolled back
Timeline:
- Deployed: 2026-01-12 14:00 UTC
- Issue detected: 14:05 UTC
- Rollback completed: 14:10 UTC
- Impact duration: 5 minutes
Root Cause: [describe what went wrong]
Why not caught before:
- [ ] Testing incomplete
- [ ] Config not validated
- [ ] Monitoring missed issue
- [ ] Other: [describe]
Prevention for next time:
1. [action item]
2. [action item]
3. [action item]
Owner: [person responsible for follow-up]
Deadline: [date]
Rollback Emergency Procedures
If Services Still Down After Rollback
# Services not recovering - emergency procedures
# 1. Check if rollback actually happened
kubectl get deployments -n $NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].image}{"\n"}{end}'
# If image is still new version:
# - Rollback might have failed
# - Try manual version specification
# 2. Force rollback to specific revision
kubectl rollout undo deployment/vapora-backend -n $NAMESPACE --to-revision=41
# 3. If still failing, delete and recreate pods
kubectl delete pods -n $NAMESPACE -l app=vapora-backend
# Pods will restart via deployment
# 4. Last resort: Scale down and up
kubectl scale deployment/vapora-backend --replicas=0 -n $NAMESPACE
sleep 10
kubectl scale deployment/vapora-backend --replicas=3 -n $NAMESPACE
# 5. Monitor restart
kubectl get pods -n $NAMESPACE -w
If Database Corrupted
# Only do this if you have recent backups
# 1. Identify corruption
kubectl logs deployment/vapora-backend -n $NAMESPACE | grep -i "corruption\|data"
# 2. Restore from backup (requires DBA support)
# Contact database team
# 3. Verify data integrity
# Run validation queries/commands
# 4. Notify stakeholders immediately
If All Else Fails
# Complete infrastructure recovery
# 1. Escalate to Infrastructure team
# 2. Activate Disaster Recovery procedures
# 3. Failover to backup environment if available
# 4. Engage senior engineers for investigation
Prevention & Lessons Learned
After every rollback:
-
Root Cause Analysis
- What actually went wrong?
- Why wasn't it caught before deployment?
- What can prevent this in the future?
-
Testing Improvements
- Add test case for failure scenario
- Update pre-deployment checklist
- Improve staging validation
-
Monitoring Improvements
- Add alert for this failure mode
- Improve alerting sensitivity
- Document expected vs abnormal logs
-
Documentation
- Update runbooks with new learnings
- Document this specific failure scenario
- Share with team
Rollback Checklist
☐ Confirmed critical issue requiring rollback
☐ Verified correct cluster and namespace
☐ Checked rollout history
☐ Executed rollback command (all services or specific)
☐ Monitored rollback progress (5-10 min wait)
☐ Verified all pods running
☐ Verified health endpoints responding
☐ Confirmed version reverted
☐ Posted communication to #deployments
☐ Notified on-call engineer: "rollback complete"
☐ Scheduled root cause analysis
☐ Saved logs for investigation
☐ Started post-mortem process
Reference: Quick Rollback Commands
For experienced operators:
# One-liner: Rollback all services
export NS=vapora; for d in vapora-backend vapora-agents vapora-llm-router; do kubectl rollout undo deployment/$d -n $NS & done; wait
# Quick verification
kubectl get pods -n $NS && kubectl get endpoints -n $NS
# Health check
kubectl port-forward -n $NS svc/vapora-backend 8001:8001 &
sleep 2 && curl http://localhost:8001/health