# Rollback Runbook

Procedures for safely rolling back VAPORA deployments when issues are detected.

---

## When to Rollback

Immediately trigger rollback if any of these occur within 5 minutes of deployment:

❌ **Critical Issues** (rollback within 1 minute):
- Pod in `CrashLoopBackOff` (repeatedly restarting)
- All pods unable to start
- Service completely unreachable (0 endpoints)
- Database connection completely broken
- All requests returning 5xx errors
- Service consuming all available memory/CPU

⚠️ **Serious Issues** (rollback within 5 minutes):
- High error rate (>10% 5xx errors)
- Significant performance degradation (2x+ latency)
- Deployment not completing (stuck pods)
- Unexpected dependency failures
- Data corruption or loss

✓ **Monitor & Investigate** (don't rollback immediately):
- Single pod failing (might be node issue)
- Transient network errors
- Gradual performance increase (might be load)
- Expected warnings in logs

---

## Kubernetes Rollback (Automatic)

### Step 1: Assess Situation (30 seconds)

```bash
# Set up environment
export NAMESPACE=vapora
export CLUSTER=production  # or staging

# Verify you're on correct cluster
kubectl cluster-info | grep server

# STOP if you're on wrong cluster!
# Correct cluster should be production URL
```

### Step 2: Check Current Status

```bash
# See what's happening right now
kubectl get deployments -n $NAMESPACE
kubectl get pods -n $NAMESPACE

# Output should show the broken state that triggered rollback
```

**Critical check:**
```bash
# How many pods are actually running?
RUNNING=$(kubectl get pods -n $NAMESPACE --field-selector=status.phase=Running --no-headers | wc -l)
TOTAL=$(kubectl get pods -n $NAMESPACE --no-headers | wc -l)

echo "Pods running: $RUNNING / $TOTAL"

# If 0/X: Critical, rollback immediately
# If X/X: Investigate before rollback (might not need to)
```

### Step 3: Identify Which Deployment Failed

```bash
# Check which deployment has issues
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "=== $deployment ==="
  kubectl get deployment $deployment -n $NAMESPACE -o wide
  kubectl get pods -n $NAMESPACE -l app=$deployment
done

# Example: backend has ReplicaSet mismatch
# DESIRED   CURRENT   UPDATED   AVAILABLE
# 3         3         3         0         ← Problem: no pods available
```

**Decide**: Rollback all or specific deployment?
- If all services down: Rollback all
- If only backend issues: Rollback backend only

### Step 4: Get Rollout History

```bash
# Show deployment revisions to see what to rollback to
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "=== $deployment ==="
  kubectl rollout history deployment/$deployment -n $NAMESPACE | tail -5
done

# Output:
# REVISION  CHANGE-CAUSE
# 42        Deployment rolled out
# 43        Deployment rolled out
# 44        (current - the one with issues)
```

**Key**: Revision numbers increase with each deployment

### Step 5: Execute Rollback

```bash
# Option A: Rollback all three services
echo "🔙 Rolling back all services..."

for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "Rolling back $deployment..."
  kubectl rollout undo deployment/$deployment -n $NAMESPACE
  echo "✓ $deployment undo initiated"
done

# Wait for all rollbacks
echo "⏳ Waiting for rollback to complete..."
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  kubectl rollout status deployment/$deployment -n $NAMESPACE --timeout=5m
done

echo "✓ All services rolled back"
```

**Option B: Rollback specific deployment**

```bash
# If only backend has issues
kubectl rollout undo deployment/vapora-backend -n $NAMESPACE

# Monitor rollback
kubectl rollout status deployment/vapora-backend -n $NAMESPACE --timeout=5m
```

**Option C: Rollback to specific revision**

```bash
# If you need to skip the immediate previous version
# Find the working revision number from history
TARGET_REVISION=42  # Example

for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "Rolling back $deployment to revision $TARGET_REVISION..."
  kubectl rollout undo deployment/$deployment -n $NAMESPACE \
    --to-revision=$TARGET_REVISION
done

# Verify rollback
kubectl rollout status deployment/vapora-backend -n $NAMESPACE --timeout=5m
```

### Step 6: Monitor Rollback Progress

In a **separate terminal**, watch the rollback happening:

```bash
# Watch pods being recreated with old version
kubectl get pods -n $NAMESPACE -w

# Output shows:
# vapora-backend-abc123-newhash   1/1     Terminating   ← old pods being removed
# vapora-backend-def456-oldhash   0/1     Pending       ← previous pods restarting
# vapora-backend-def456-oldhash   1/1     Running       ← previous pods ready
```

**Expected timeline:**
- 0-30 seconds: Old pods terminating, new pods starting
- 30-90 seconds: New pods starting up (ContainerCreating)
- 90-180 seconds: New pods reaching Running state

### Step 7: Verify Rollback Complete

```bash
# After rollout status shows "successfully rolled out"

# Verify all pods are running
kubectl get pods -n $NAMESPACE

# All should show:
# STATUS: Running
# READY: 1/1

# Verify service endpoints exist
kubectl get endpoints -n $NAMESPACE

# All services should have endpoints like:
# NAME              ENDPOINTS
# vapora-backend    10.x.x.x:8001,10.x.x.x:8001,10.x.x.x:8001
```

### Step 8: Health Check

```bash
# Port-forward to test services
kubectl port-forward -n $NAMESPACE svc/vapora-backend 8001:8001 &
sleep 2

# Test health endpoint
curl -v http://localhost:8001/health

# Expected: HTTP 200 OK with health data
```

**If health check fails:**
```bash
# Check pod logs for errors
kubectl logs deployment/vapora-backend -n $NAMESPACE --tail=50

# See what's wrong, might need further investigation
# Possibly need to rollback to earlier version
```

### Step 9: Check Logs for Success

```bash
# Verify no errors in the first 2 minutes of rolled-back logs
kubectl logs deployment/vapora-backend -n $NAMESPACE --since=2m | \
  grep -i "error\|exception\|failed" | head -10

# Should return no (or very few) errors
```

### Step 10: Verify Version Reverted

```bash
# Confirm we're back to previous version
kubectl get deployments -n $NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].image}{"\n"}{end}'

# Output should show previous image versions:
# vapora-backend      vapora/backend:v1.2.0    (not v1.2.1)
# vapora-agents       vapora/agents:v1.2.0
# vapora-llm-router   vapora/llm-router:v1.2.0
```

---

## Docker Rollback (Manual)

For Docker Compose deployments (not Kubernetes):

### Step 1: Assess Current State

```bash
# Check running containers
docker compose ps

# Check logs for errors
docker compose logs --tail=50 backend
```

### Step 2: Stop Services

```bash
# Stop all services gracefully
docker compose down

# Verify stopped
docker ps | grep vapora
# Should return nothing

# Wait a moment for graceful shutdown
sleep 5
```

### Step 3: Restore Previous Configuration

```bash
# Option A: Git history
cd deploy/docker
git log docker-compose.yml | head -5
git checkout HEAD~1 docker-compose.yml

# Option B: Backup file
cp docker-compose.yml docker-compose.yml.broken
cp docker-compose.yml.backup docker-compose.yml

# Option C: Manual
# Edit docker-compose.yml to use previous image versions
# Example: change backend service image from v1.2.1 to v1.2.0
```

### Step 4: Restart Services

```bash
# Start services with previous configuration
docker compose up -d

# Wait for startup
sleep 5

# Verify services running
docker compose ps

# Should show all services with status "Up"
```

### Step 5: Verify Health

```bash
# Check container logs
docker compose logs backend | tail -20

# Test health endpoint
curl -v http://localhost:8001/health

# Expected: HTTP 200 OK
```

### Step 6: Check Services

```bash
# Verify all services responding
docker compose exec backend curl http://localhost:8001/health
docker compose exec frontend curl http://localhost:3000 --head

# All should return successful responses
```

---

## Post-Rollback Procedures

### Immediate (Within 5 minutes)

```bash
# 1. Verify all services healthy
✓ All pods running
✓ Health endpoints responding
✓ No error logs
✓ Service endpoints populated

# 2. Communicate to team
```

### Communication

```
Post to #deployments:

🔙 ROLLBACK EXECUTED

Issue detected in deployment v1.2.1
All services rolled back to v1.2.0

Status: ✅ Services recovering
- All pods: Running
- Health checks: Passing
- Endpoints: Responding

Timeline:
- Issue detected: HH:MM UTC
- Rollback initiated: HH:MM UTC
- Services recovered: HH:MM UTC (5 minutes)

Next:
- Investigate root cause
- Fix issue
- Prepare corrected deployment

Questions? @on-call-engineer
```

### Investigation & Root Cause

```bash
# While services are recovered, investigate what went wrong

# 1. Save logs from failed deployment
kubectl logs deployment/vapora-backend -n $NAMESPACE \
  --timestamps=true \
  > failed-deployment-backend.log

# 2. Save pod events
kubectl describe pod $(kubectl get pods -n $NAMESPACE \
  -l app=vapora-backend --sort-by=.metadata.creationTimestamp \
  | tail -1 | awk '{print $1}') \
  -n $NAMESPACE > failed-pod-events.log

# 3. Archive ConfigMap from failed deployment (if changed)
kubectl get configmap -n $NAMESPACE vapora-config -o yaml > configmap-failed.yaml

# 4. Compare with previous good state
diff configmap-previous.yaml configmap-failed.yaml

# 5. Check what changed in code
git diff HEAD~1 HEAD provisioning/
```

### Decision: What Went Wrong

Common issues and investigation paths:

| Issue | Investigation | Action |
|-------|---|---|
| **Config syntax error** | Check ConfigMap YAML | Fix YAML, test locally with yq |
| **Missing environment variable** | Check pod logs for "not found" | Update ConfigMap with value |
| **Database connection** | Check database connectivity | Verify DB URL in ConfigMap |
| **Resource exhaustion** | Check kubectl top, pod events | Increase resources or reduce replicas |
| **Image missing** | Check ImagePullBackOff event | Verify image pushed to registry |
| **Permission issue** | Check RBAC, logs for "forbidden" | Update service account permissions |

### Post-Rollback Review

Schedule within 24 hours:

```
DEPLOYMENT POST-MORTEM

Deployment: v1.2.1
Outcome: ❌ Rolled back

Timeline:
- Deployed: 2026-01-12 14:00 UTC
- Issue detected: 14:05 UTC
- Rollback completed: 14:10 UTC
- Impact duration: 5 minutes

Root Cause: [describe what went wrong]

Why not caught before:
- [ ] Testing incomplete
- [ ] Config not validated
- [ ] Monitoring missed issue
- [ ] Other: [describe]

Prevention for next time:
1. [action item]
2. [action item]
3. [action item]

Owner: [person responsible for follow-up]
Deadline: [date]
```

---

## Rollback Emergency Procedures

### If Services Still Down After Rollback

```bash
# Services not recovering - emergency procedures

# 1. Check if rollback actually happened
kubectl get deployments -n $NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].image}{"\n"}{end}'

# If image is still new version:
# - Rollback might have failed
# - Try manual version specification

# 2. Force rollback to specific revision
kubectl rollout undo deployment/vapora-backend -n $NAMESPACE --to-revision=41

# 3. If still failing, delete and recreate pods
kubectl delete pods -n $NAMESPACE -l app=vapora-backend
# Pods will restart via deployment

# 4. Last resort: Scale down and up
kubectl scale deployment/vapora-backend --replicas=0 -n $NAMESPACE
sleep 10
kubectl scale deployment/vapora-backend --replicas=3 -n $NAMESPACE

# 5. Monitor restart
kubectl get pods -n $NAMESPACE -w
```

### If Database Corrupted

```bash
# Only do this if you have recent backups

# 1. Identify corruption
kubectl logs deployment/vapora-backend -n $NAMESPACE | grep -i "corruption\|data"

# 2. Restore from backup (requires DBA support)
# Contact database team

# 3. Verify data integrity
# Run validation queries/commands

# 4. Notify stakeholders immediately
```

### If All Else Fails

```bash
# Complete infrastructure recovery

# 1. Escalate to Infrastructure team
# 2. Activate Disaster Recovery procedures
# 3. Failover to backup environment if available
# 4. Engage senior engineers for investigation
```

---

## Prevention & Lessons Learned

After every rollback:

1. **Root Cause Analysis**
   - What actually went wrong?
   - Why wasn't it caught before deployment?
   - What can prevent this in the future?

2. **Testing Improvements**
   - Add test case for failure scenario
   - Update pre-deployment checklist
   - Improve staging validation

3. **Monitoring Improvements**
   - Add alert for this failure mode
   - Improve alerting sensitivity
   - Document expected vs abnormal logs

4. **Documentation**
   - Update runbooks with new learnings
   - Document this specific failure scenario
   - Share with team

---

## Rollback Checklist

```
☐ Confirmed critical issue requiring rollback
☐ Verified correct cluster and namespace
☐ Checked rollout history
☐ Executed rollback command (all services or specific)
☐ Monitored rollback progress (5-10 min wait)
☐ Verified all pods running
☐ Verified health endpoints responding
☐ Confirmed version reverted
☐ Posted communication to #deployments
☐ Notified on-call engineer: "rollback complete"
☐ Scheduled root cause analysis
☐ Saved logs for investigation
☐ Started post-mortem process
```

---

## Reference: Quick Rollback Commands

For experienced operators:

```bash
# One-liner: Rollback all services
export NS=vapora; for d in vapora-backend vapora-agents vapora-llm-router; do kubectl rollout undo deployment/$d -n $NS & done; wait

# Quick verification
kubectl get pods -n $NS && kubectl get endpoints -n $NS

# Health check
kubectl port-forward -n $NS svc/vapora-backend 8001:8001 &
sleep 2 && curl http://localhost:8001/health
```