jesus/Vapora

Fork 0

Jesús Pérez 7110ffeea2

Rust CI / Security Audit (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (stable) (push) Has been cancelled

Details

chore: extend doc: adr, tutorials, operations, etc

2026-01-12 03:32:47 +00:00

14 KiB

Raw Blame History

Rollback Runbook

Procedures for safely rolling back VAPORA deployments when issues are detected.

When to Rollback

Immediately trigger rollback if any of these occur within 5 minutes of deployment:

❌ Critical Issues (rollback within 1 minute):

Pod in CrashLoopBackOff (repeatedly restarting)
All pods unable to start
Service completely unreachable (0 endpoints)
Database connection completely broken
All requests returning 5xx errors
Service consuming all available memory/CPU

⚠️ Serious Issues (rollback within 5 minutes):

High error rate (>10% 5xx errors)
Significant performance degradation (2x+ latency)
Deployment not completing (stuck pods)
Unexpected dependency failures
Data corruption or loss

✓ Monitor & Investigate (don't rollback immediately):

Single pod failing (might be node issue)
Transient network errors
Gradual performance increase (might be load)
Expected warnings in logs

Kubernetes Rollback (Automatic)

Step 1: Assess Situation (30 seconds)

# Set up environment
export NAMESPACE=vapora
export CLUSTER=production  # or staging

# Verify you're on correct cluster
kubectl cluster-info | grep server

# STOP if you're on wrong cluster!
# Correct cluster should be production URL

Step 2: Check Current Status

# See what's happening right now
kubectl get deployments -n $NAMESPACE
kubectl get pods -n $NAMESPACE

# Output should show the broken state that triggered rollback

Critical check:

# How many pods are actually running?
RUNNING=$(kubectl get pods -n $NAMESPACE --field-selector=status.phase=Running --no-headers | wc -l)
TOTAL=$(kubectl get pods -n $NAMESPACE --no-headers | wc -l)

echo "Pods running: $RUNNING / $TOTAL"

# If 0/X: Critical, rollback immediately
# If X/X: Investigate before rollback (might not need to)

Step 3: Identify Which Deployment Failed

# Check which deployment has issues
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "=== $deployment ==="
  kubectl get deployment $deployment -n $NAMESPACE -o wide
  kubectl get pods -n $NAMESPACE -l app=$deployment
done

# Example: backend has ReplicaSet mismatch
# DESIRED   CURRENT   UPDATED   AVAILABLE
# 3         3         3         0         ← Problem: no pods available

Decide: Rollback all or specific deployment?

If all services down: Rollback all
If only backend issues: Rollback backend only

Step 4: Get Rollout History

# Show deployment revisions to see what to rollback to
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "=== $deployment ==="
  kubectl rollout history deployment/$deployment -n $NAMESPACE | tail -5
done

# Output:
# REVISION  CHANGE-CAUSE
# 42        Deployment rolled out
# 43        Deployment rolled out
# 44        (current - the one with issues)

Key: Revision numbers increase with each deployment

Step 5: Execute Rollback

# Option A: Rollback all three services
echo "🔙 Rolling back all services..."

for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "Rolling back $deployment..."
  kubectl rollout undo deployment/$deployment -n $NAMESPACE
  echo "✓ $deployment undo initiated"
done

# Wait for all rollbacks
echo "⏳ Waiting for rollback to complete..."
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  kubectl rollout status deployment/$deployment -n $NAMESPACE --timeout=5m
done

echo "✓ All services rolled back"

Option B: Rollback specific deployment

# If only backend has issues
kubectl rollout undo deployment/vapora-backend -n $NAMESPACE

# Monitor rollback
kubectl rollout status deployment/vapora-backend -n $NAMESPACE --timeout=5m

Option C: Rollback to specific revision

# If you need to skip the immediate previous version
# Find the working revision number from history
TARGET_REVISION=42  # Example

for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "Rolling back $deployment to revision $TARGET_REVISION..."
  kubectl rollout undo deployment/$deployment -n $NAMESPACE \
    --to-revision=$TARGET_REVISION
done

# Verify rollback
kubectl rollout status deployment/vapora-backend -n $NAMESPACE --timeout=5m

Step 6: Monitor Rollback Progress

In a separate terminal, watch the rollback happening:

# Watch pods being recreated with old version
kubectl get pods -n $NAMESPACE -w

# Output shows:
# vapora-backend-abc123-newhash   1/1     Terminating   ← old pods being removed
# vapora-backend-def456-oldhash   0/1     Pending       ← previous pods restarting
# vapora-backend-def456-oldhash   1/1     Running       ← previous pods ready

Expected timeline:

0-30 seconds: Old pods terminating, new pods starting
30-90 seconds: New pods starting up (ContainerCreating)
90-180 seconds: New pods reaching Running state

Step 7: Verify Rollback Complete

# After rollout status shows "successfully rolled out"

# Verify all pods are running
kubectl get pods -n $NAMESPACE

# All should show:
# STATUS: Running
# READY: 1/1

# Verify service endpoints exist
kubectl get endpoints -n $NAMESPACE

# All services should have endpoints like:
# NAME              ENDPOINTS
# vapora-backend    10.x.x.x:8001,10.x.x.x:8001,10.x.x.x:8001

Step 8: Health Check

# Port-forward to test services
kubectl port-forward -n $NAMESPACE svc/vapora-backend 8001:8001 &
sleep 2

# Test health endpoint
curl -v http://localhost:8001/health

# Expected: HTTP 200 OK with health data

If health check fails:

# Check pod logs for errors
kubectl logs deployment/vapora-backend -n $NAMESPACE --tail=50

# See what's wrong, might need further investigation
# Possibly need to rollback to earlier version

Step 9: Check Logs for Success

# Verify no errors in the first 2 minutes of rolled-back logs
kubectl logs deployment/vapora-backend -n $NAMESPACE --since=2m | \
  grep -i "error\|exception\|failed" | head -10

# Should return no (or very few) errors

Step 10: Verify Version Reverted

# Confirm we're back to previous version
kubectl get deployments -n $NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].image}{"\n"}{end}'

# Output should show previous image versions:
# vapora-backend      vapora/backend:v1.2.0    (not v1.2.1)
# vapora-agents       vapora/agents:v1.2.0
# vapora-llm-router   vapora/llm-router:v1.2.0

Docker Rollback (Manual)

For Docker Compose deployments (not Kubernetes):

Step 1: Assess Current State

# Check running containers
docker compose ps

# Check logs for errors
docker compose logs --tail=50 backend

Step 2: Stop Services

# Stop all services gracefully
docker compose down

# Verify stopped
docker ps | grep vapora
# Should return nothing

# Wait a moment for graceful shutdown
sleep 5

Step 3: Restore Previous Configuration

# Option A: Git history
cd deploy/docker
git log docker-compose.yml | head -5
git checkout HEAD~1 docker-compose.yml

# Option B: Backup file
cp docker-compose.yml docker-compose.yml.broken
cp docker-compose.yml.backup docker-compose.yml

# Option C: Manual
# Edit docker-compose.yml to use previous image versions
# Example: change backend service image from v1.2.1 to v1.2.0

Step 4: Restart Services

# Start services with previous configuration
docker compose up -d

# Wait for startup
sleep 5

# Verify services running
docker compose ps

# Should show all services with status "Up"

Step 5: Verify Health

# Check container logs
docker compose logs backend | tail -20

# Test health endpoint
curl -v http://localhost:8001/health

# Expected: HTTP 200 OK

Step 6: Check Services

# Verify all services responding
docker compose exec backend curl http://localhost:8001/health
docker compose exec frontend curl http://localhost:3000 --head

# All should return successful responses

Post-Rollback Procedures

Immediate (Within 5 minutes)

# 1. Verify all services healthy
✓ All pods running
✓ Health endpoints responding
✓ No error logs
✓ Service endpoints populated

# 2. Communicate to team

Communication

Post to #deployments:

🔙 ROLLBACK EXECUTED

Issue detected in deployment v1.2.1
All services rolled back to v1.2.0

Status: ✅ Services recovering
- All pods: Running
- Health checks: Passing
- Endpoints: Responding

Timeline:
- Issue detected: HH:MM UTC
- Rollback initiated: HH:MM UTC
- Services recovered: HH:MM UTC (5 minutes)

Next:
- Investigate root cause
- Fix issue
- Prepare corrected deployment

Questions? @on-call-engineer

Investigation & Root Cause

# While services are recovered, investigate what went wrong

# 1. Save logs from failed deployment
kubectl logs deployment/vapora-backend -n $NAMESPACE \
  --timestamps=true \
  > failed-deployment-backend.log

# 2. Save pod events
kubectl describe pod $(kubectl get pods -n $NAMESPACE \
  -l app=vapora-backend --sort-by=.metadata.creationTimestamp \
  | tail -1 | awk '{print $1}') \
  -n $NAMESPACE > failed-pod-events.log

# 3. Archive ConfigMap from failed deployment (if changed)
kubectl get configmap -n $NAMESPACE vapora-config -o yaml > configmap-failed.yaml

# 4. Compare with previous good state
diff configmap-previous.yaml configmap-failed.yaml

# 5. Check what changed in code
git diff HEAD~1 HEAD provisioning/

Decision: What Went Wrong

Common issues and investigation paths:

Issue	Investigation	Action
Config syntax error	Check ConfigMap YAML	Fix YAML, test locally with yq
Missing environment variable	Check pod logs for "not found"	Update ConfigMap with value
Database connection	Check database connectivity	Verify DB URL in ConfigMap
Resource exhaustion	Check kubectl top, pod events	Increase resources or reduce replicas
Image missing	Check ImagePullBackOff event	Verify image pushed to registry
Permission issue	Check RBAC, logs for "forbidden"	Update service account permissions

Post-Rollback Review

Schedule within 24 hours:

DEPLOYMENT POST-MORTEM

Deployment: v1.2.1
Outcome: ❌ Rolled back

Timeline:
- Deployed: 2026-01-12 14:00 UTC
- Issue detected: 14:05 UTC
- Rollback completed: 14:10 UTC
- Impact duration: 5 minutes

Root Cause: [describe what went wrong]

Why not caught before:
- [ ] Testing incomplete
- [ ] Config not validated
- [ ] Monitoring missed issue
- [ ] Other: [describe]

Prevention for next time:
1. [action item]
2. [action item]
3. [action item]

Owner: [person responsible for follow-up]
Deadline: [date]

Rollback Emergency Procedures

If Services Still Down After Rollback

# Services not recovering - emergency procedures

# 1. Check if rollback actually happened
kubectl get deployments -n $NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].image}{"\n"}{end}'

# If image is still new version:
# - Rollback might have failed
# - Try manual version specification

# 2. Force rollback to specific revision
kubectl rollout undo deployment/vapora-backend -n $NAMESPACE --to-revision=41

# 3. If still failing, delete and recreate pods
kubectl delete pods -n $NAMESPACE -l app=vapora-backend
# Pods will restart via deployment

# 4. Last resort: Scale down and up
kubectl scale deployment/vapora-backend --replicas=0 -n $NAMESPACE
sleep 10
kubectl scale deployment/vapora-backend --replicas=3 -n $NAMESPACE

# 5. Monitor restart
kubectl get pods -n $NAMESPACE -w

If Database Corrupted

# Only do this if you have recent backups

# 1. Identify corruption
kubectl logs deployment/vapora-backend -n $NAMESPACE | grep -i "corruption\|data"

# 2. Restore from backup (requires DBA support)
# Contact database team

# 3. Verify data integrity
# Run validation queries/commands

# 4. Notify stakeholders immediately

If All Else Fails

# Complete infrastructure recovery

# 1. Escalate to Infrastructure team
# 2. Activate Disaster Recovery procedures
# 3. Failover to backup environment if available
# 4. Engage senior engineers for investigation

Prevention & Lessons Learned

After every rollback:

Root Cause Analysis
- What actually went wrong?
- Why wasn't it caught before deployment?
- What can prevent this in the future?
Testing Improvements
- Add test case for failure scenario
- Update pre-deployment checklist
- Improve staging validation
Monitoring Improvements
- Add alert for this failure mode
- Improve alerting sensitivity
- Document expected vs abnormal logs
Documentation
- Update runbooks with new learnings
- Document this specific failure scenario
- Share with team

Rollback Checklist

☐ Confirmed critical issue requiring rollback
☐ Verified correct cluster and namespace
☐ Checked rollout history
☐ Executed rollback command (all services or specific)
☐ Monitored rollback progress (5-10 min wait)
☐ Verified all pods running
☐ Verified health endpoints responding
☐ Confirmed version reverted
☐ Posted communication to #deployments
☐ Notified on-call engineer: "rollback complete"
☐ Scheduled root cause analysis
☐ Saved logs for investigation
☐ Started post-mortem process

Reference: Quick Rollback Commands

For experienced operators:

# One-liner: Rollback all services
export NS=vapora; for d in vapora-backend vapora-agents vapora-llm-router; do kubectl rollout undo deployment/$d -n $NS & done; wait

# Quick verification
kubectl get pods -n $NS && kubectl get endpoints -n $NS

# Health check
kubectl port-forward -n $NS svc/vapora-backend 8001:8001 &
sleep 2 && curl http://localhost:8001/health

14 KiB Raw Blame History

Rollback Runbook

When to Rollback

Kubernetes Rollback (Automatic)

Step 1: Assess Situation (30 seconds)

Step 2: Check Current Status

Step 3: Identify Which Deployment Failed

Step 4: Get Rollout History

Step 5: Execute Rollback

Step 6: Monitor Rollback Progress

Step 7: Verify Rollback Complete

Step 8: Health Check

Step 9: Check Logs for Success

Step 10: Verify Version Reverted

Docker Rollback (Manual)

Step 1: Assess Current State

Step 2: Stop Services

Step 3: Restore Previous Configuration

Step 4: Restart Services

Step 5: Verify Health

Step 6: Check Services

Post-Rollback Procedures

Immediate (Within 5 minutes)

Communication

Investigation & Root Cause

Decision: What Went Wrong

Post-Rollback Review

Rollback Emergency Procedures

If Services Still Down After Rollback

If Database Corrupted

If All Else Fails

Prevention & Lessons Learned

Rollback Checklist

Reference: Quick Rollback Commands

14 KiB

Raw Blame History