Vapora/docs/operations/rollback-runbook.md
Jesús Pérez 7110ffeea2
Some checks failed
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
chore: extend doc: adr, tutorials, operations, etc
2026-01-12 03:32:47 +00:00

14 KiB

Rollback Runbook

Procedures for safely rolling back VAPORA deployments when issues are detected.


When to Rollback

Immediately trigger rollback if any of these occur within 5 minutes of deployment:

Critical Issues (rollback within 1 minute):

  • Pod in CrashLoopBackOff (repeatedly restarting)
  • All pods unable to start
  • Service completely unreachable (0 endpoints)
  • Database connection completely broken
  • All requests returning 5xx errors
  • Service consuming all available memory/CPU

⚠️ Serious Issues (rollback within 5 minutes):

  • High error rate (>10% 5xx errors)
  • Significant performance degradation (2x+ latency)
  • Deployment not completing (stuck pods)
  • Unexpected dependency failures
  • Data corruption or loss

Monitor & Investigate (don't rollback immediately):

  • Single pod failing (might be node issue)
  • Transient network errors
  • Gradual performance increase (might be load)
  • Expected warnings in logs

Kubernetes Rollback (Automatic)

Step 1: Assess Situation (30 seconds)

# Set up environment
export NAMESPACE=vapora
export CLUSTER=production  # or staging

# Verify you're on correct cluster
kubectl cluster-info | grep server

# STOP if you're on wrong cluster!
# Correct cluster should be production URL

Step 2: Check Current Status

# See what's happening right now
kubectl get deployments -n $NAMESPACE
kubectl get pods -n $NAMESPACE

# Output should show the broken state that triggered rollback

Critical check:

# How many pods are actually running?
RUNNING=$(kubectl get pods -n $NAMESPACE --field-selector=status.phase=Running --no-headers | wc -l)
TOTAL=$(kubectl get pods -n $NAMESPACE --no-headers | wc -l)

echo "Pods running: $RUNNING / $TOTAL"

# If 0/X: Critical, rollback immediately
# If X/X: Investigate before rollback (might not need to)

Step 3: Identify Which Deployment Failed

# Check which deployment has issues
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "=== $deployment ==="
  kubectl get deployment $deployment -n $NAMESPACE -o wide
  kubectl get pods -n $NAMESPACE -l app=$deployment
done

# Example: backend has ReplicaSet mismatch
# DESIRED   CURRENT   UPDATED   AVAILABLE
# 3         3         3         0         ← Problem: no pods available

Decide: Rollback all or specific deployment?

  • If all services down: Rollback all
  • If only backend issues: Rollback backend only

Step 4: Get Rollout History

# Show deployment revisions to see what to rollback to
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "=== $deployment ==="
  kubectl rollout history deployment/$deployment -n $NAMESPACE | tail -5
done

# Output:
# REVISION  CHANGE-CAUSE
# 42        Deployment rolled out
# 43        Deployment rolled out
# 44        (current - the one with issues)

Key: Revision numbers increase with each deployment

Step 5: Execute Rollback

# Option A: Rollback all three services
echo "🔙 Rolling back all services..."

for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "Rolling back $deployment..."
  kubectl rollout undo deployment/$deployment -n $NAMESPACE
  echo "✓ $deployment undo initiated"
done

# Wait for all rollbacks
echo "⏳ Waiting for rollback to complete..."
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  kubectl rollout status deployment/$deployment -n $NAMESPACE --timeout=5m
done

echo "✓ All services rolled back"

Option B: Rollback specific deployment

# If only backend has issues
kubectl rollout undo deployment/vapora-backend -n $NAMESPACE

# Monitor rollback
kubectl rollout status deployment/vapora-backend -n $NAMESPACE --timeout=5m

Option C: Rollback to specific revision

# If you need to skip the immediate previous version
# Find the working revision number from history
TARGET_REVISION=42  # Example

for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "Rolling back $deployment to revision $TARGET_REVISION..."
  kubectl rollout undo deployment/$deployment -n $NAMESPACE \
    --to-revision=$TARGET_REVISION
done

# Verify rollback
kubectl rollout status deployment/vapora-backend -n $NAMESPACE --timeout=5m

Step 6: Monitor Rollback Progress

In a separate terminal, watch the rollback happening:

# Watch pods being recreated with old version
kubectl get pods -n $NAMESPACE -w

# Output shows:
# vapora-backend-abc123-newhash   1/1     Terminating   ← old pods being removed
# vapora-backend-def456-oldhash   0/1     Pending       ← previous pods restarting
# vapora-backend-def456-oldhash   1/1     Running       ← previous pods ready

Expected timeline:

  • 0-30 seconds: Old pods terminating, new pods starting
  • 30-90 seconds: New pods starting up (ContainerCreating)
  • 90-180 seconds: New pods reaching Running state

Step 7: Verify Rollback Complete

# After rollout status shows "successfully rolled out"

# Verify all pods are running
kubectl get pods -n $NAMESPACE

# All should show:
# STATUS: Running
# READY: 1/1

# Verify service endpoints exist
kubectl get endpoints -n $NAMESPACE

# All services should have endpoints like:
# NAME              ENDPOINTS
# vapora-backend    10.x.x.x:8001,10.x.x.x:8001,10.x.x.x:8001

Step 8: Health Check

# Port-forward to test services
kubectl port-forward -n $NAMESPACE svc/vapora-backend 8001:8001 &
sleep 2

# Test health endpoint
curl -v http://localhost:8001/health

# Expected: HTTP 200 OK with health data

If health check fails:

# Check pod logs for errors
kubectl logs deployment/vapora-backend -n $NAMESPACE --tail=50

# See what's wrong, might need further investigation
# Possibly need to rollback to earlier version

Step 9: Check Logs for Success

# Verify no errors in the first 2 minutes of rolled-back logs
kubectl logs deployment/vapora-backend -n $NAMESPACE --since=2m | \
  grep -i "error\|exception\|failed" | head -10

# Should return no (or very few) errors

Step 10: Verify Version Reverted

# Confirm we're back to previous version
kubectl get deployments -n $NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].image}{"\n"}{end}'

# Output should show previous image versions:
# vapora-backend      vapora/backend:v1.2.0    (not v1.2.1)
# vapora-agents       vapora/agents:v1.2.0
# vapora-llm-router   vapora/llm-router:v1.2.0

Docker Rollback (Manual)

For Docker Compose deployments (not Kubernetes):

Step 1: Assess Current State

# Check running containers
docker compose ps

# Check logs for errors
docker compose logs --tail=50 backend

Step 2: Stop Services

# Stop all services gracefully
docker compose down

# Verify stopped
docker ps | grep vapora
# Should return nothing

# Wait a moment for graceful shutdown
sleep 5

Step 3: Restore Previous Configuration

# Option A: Git history
cd deploy/docker
git log docker-compose.yml | head -5
git checkout HEAD~1 docker-compose.yml

# Option B: Backup file
cp docker-compose.yml docker-compose.yml.broken
cp docker-compose.yml.backup docker-compose.yml

# Option C: Manual
# Edit docker-compose.yml to use previous image versions
# Example: change backend service image from v1.2.1 to v1.2.0

Step 4: Restart Services

# Start services with previous configuration
docker compose up -d

# Wait for startup
sleep 5

# Verify services running
docker compose ps

# Should show all services with status "Up"

Step 5: Verify Health

# Check container logs
docker compose logs backend | tail -20

# Test health endpoint
curl -v http://localhost:8001/health

# Expected: HTTP 200 OK

Step 6: Check Services

# Verify all services responding
docker compose exec backend curl http://localhost:8001/health
docker compose exec frontend curl http://localhost:3000 --head

# All should return successful responses

Post-Rollback Procedures

Immediate (Within 5 minutes)

# 1. Verify all services healthy
✓ All pods running
✓ Health endpoints responding
✓ No error logs
✓ Service endpoints populated

# 2. Communicate to team

Communication

Post to #deployments:

🔙 ROLLBACK EXECUTED

Issue detected in deployment v1.2.1
All services rolled back to v1.2.0

Status: ✅ Services recovering
- All pods: Running
- Health checks: Passing
- Endpoints: Responding

Timeline:
- Issue detected: HH:MM UTC
- Rollback initiated: HH:MM UTC
- Services recovered: HH:MM UTC (5 minutes)

Next:
- Investigate root cause
- Fix issue
- Prepare corrected deployment

Questions? @on-call-engineer

Investigation & Root Cause

# While services are recovered, investigate what went wrong

# 1. Save logs from failed deployment
kubectl logs deployment/vapora-backend -n $NAMESPACE \
  --timestamps=true \
  > failed-deployment-backend.log

# 2. Save pod events
kubectl describe pod $(kubectl get pods -n $NAMESPACE \
  -l app=vapora-backend --sort-by=.metadata.creationTimestamp \
  | tail -1 | awk '{print $1}') \
  -n $NAMESPACE > failed-pod-events.log

# 3. Archive ConfigMap from failed deployment (if changed)
kubectl get configmap -n $NAMESPACE vapora-config -o yaml > configmap-failed.yaml

# 4. Compare with previous good state
diff configmap-previous.yaml configmap-failed.yaml

# 5. Check what changed in code
git diff HEAD~1 HEAD provisioning/

Decision: What Went Wrong

Common issues and investigation paths:

Issue Investigation Action
Config syntax error Check ConfigMap YAML Fix YAML, test locally with yq
Missing environment variable Check pod logs for "not found" Update ConfigMap with value
Database connection Check database connectivity Verify DB URL in ConfigMap
Resource exhaustion Check kubectl top, pod events Increase resources or reduce replicas
Image missing Check ImagePullBackOff event Verify image pushed to registry
Permission issue Check RBAC, logs for "forbidden" Update service account permissions

Post-Rollback Review

Schedule within 24 hours:

DEPLOYMENT POST-MORTEM

Deployment: v1.2.1
Outcome: ❌ Rolled back

Timeline:
- Deployed: 2026-01-12 14:00 UTC
- Issue detected: 14:05 UTC
- Rollback completed: 14:10 UTC
- Impact duration: 5 minutes

Root Cause: [describe what went wrong]

Why not caught before:
- [ ] Testing incomplete
- [ ] Config not validated
- [ ] Monitoring missed issue
- [ ] Other: [describe]

Prevention for next time:
1. [action item]
2. [action item]
3. [action item]

Owner: [person responsible for follow-up]
Deadline: [date]

Rollback Emergency Procedures

If Services Still Down After Rollback

# Services not recovering - emergency procedures

# 1. Check if rollback actually happened
kubectl get deployments -n $NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].image}{"\n"}{end}'

# If image is still new version:
# - Rollback might have failed
# - Try manual version specification

# 2. Force rollback to specific revision
kubectl rollout undo deployment/vapora-backend -n $NAMESPACE --to-revision=41

# 3. If still failing, delete and recreate pods
kubectl delete pods -n $NAMESPACE -l app=vapora-backend
# Pods will restart via deployment

# 4. Last resort: Scale down and up
kubectl scale deployment/vapora-backend --replicas=0 -n $NAMESPACE
sleep 10
kubectl scale deployment/vapora-backend --replicas=3 -n $NAMESPACE

# 5. Monitor restart
kubectl get pods -n $NAMESPACE -w

If Database Corrupted

# Only do this if you have recent backups

# 1. Identify corruption
kubectl logs deployment/vapora-backend -n $NAMESPACE | grep -i "corruption\|data"

# 2. Restore from backup (requires DBA support)
# Contact database team

# 3. Verify data integrity
# Run validation queries/commands

# 4. Notify stakeholders immediately

If All Else Fails

# Complete infrastructure recovery

# 1. Escalate to Infrastructure team
# 2. Activate Disaster Recovery procedures
# 3. Failover to backup environment if available
# 4. Engage senior engineers for investigation

Prevention & Lessons Learned

After every rollback:

  1. Root Cause Analysis

    • What actually went wrong?
    • Why wasn't it caught before deployment?
    • What can prevent this in the future?
  2. Testing Improvements

    • Add test case for failure scenario
    • Update pre-deployment checklist
    • Improve staging validation
  3. Monitoring Improvements

    • Add alert for this failure mode
    • Improve alerting sensitivity
    • Document expected vs abnormal logs
  4. Documentation

    • Update runbooks with new learnings
    • Document this specific failure scenario
    • Share with team

Rollback Checklist

☐ Confirmed critical issue requiring rollback
☐ Verified correct cluster and namespace
☐ Checked rollout history
☐ Executed rollback command (all services or specific)
☐ Monitored rollback progress (5-10 min wait)
☐ Verified all pods running
☐ Verified health endpoints responding
☐ Confirmed version reverted
☐ Posted communication to #deployments
☐ Notified on-call engineer: "rollback complete"
☐ Scheduled root cause analysis
☐ Saved logs for investigation
☐ Started post-mortem process

Reference: Quick Rollback Commands

For experienced operators:

# One-liner: Rollback all services
export NS=vapora; for d in vapora-backend vapora-agents vapora-llm-router; do kubectl rollout undo deployment/$d -n $NS & done; wait

# Quick verification
kubectl get pods -n $NS && kubectl get endpoints -n $NS

# Health check
kubectl port-forward -n $NS svc/vapora-backend 8001:8001 &
sleep 2 && curl http://localhost:8001/health