Vapora/docs/operations/deployment-runbook.md

# Deployment Runbook

Step-by-step procedures for deploying VAPORA to staging and production environments.

---

## Quick Start

For experienced operators:

```bash
# Validate in CI/CD
# Download artifacts
# Review dry-run
# Apply: kubectl apply -f configmap.yaml deployment.yaml
# Monitor: kubectl logs -f deployment/vapora-backend -n vapora
# Verify: curl http://localhost:8001/health
```

For complete steps, continue reading.

---

## Before Starting

✅ **Prerequisites Completed**:
- [ ] Pre-deployment checklist completed
- [ ] Artifacts generated and validated
- [ ] Staging deployment verified
- [ ] Team ready and monitoring
- [ ] Maintenance window announced

✅ **Access Verified**:
- [ ] kubectl configured for target cluster
- [ ] Can list nodes: `kubectl get nodes`
- [ ] Can access namespace: `kubectl get namespace vapora`

❌ **If any prerequisite missing**: Go back to pre-deployment checklist

---

## Phase 1: Pre-Flight (5 minutes)

### 1.1 Verify Current State

```bash
# Set context
export CLUSTER=production  # or staging
export NAMESPACE=vapora

# Verify cluster access
kubectl cluster-info
kubectl get nodes

# Output should show:
# NAME     STATUS   ROLES    AGE
# node-1   Ready    worker   30d
# node-2   Ready    worker   25d
```

**What to look for:**
- ✓ All nodes in "Ready" state
- ✓ No "NotReady" or "Unknown" nodes
- If issues: Don't proceed, investigate node health

### 1.2 Check Current Deployments

```bash
# Get current deployment status
kubectl get deployments -n $NAMESPACE -o wide
kubectl get pods -n $NAMESPACE

# Output example:
# NAME                READY   UP-TO-DATE   AVAILABLE
# vapora-backend      3/3     3            3
# vapora-agents       2/2     2            2
# vapora-llm-router   2/2     2            2
```

**What to look for:**
- ✓ All deployments showing correct replica count
- ✓ All pods in "Running" state
- ❌ If pods in "CrashLoopBackOff" or "Pending": Investigate before proceeding

### 1.3 Record Current Versions

```bash
# Get current image versions (baseline for rollback)
kubectl get deployments -n $NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].image}{"\n"}{end}'

# Expected output:
# vapora-backend      vapora/backend:v1.2.0
# vapora-agents       vapora/agents:v1.2.0
# vapora-llm-router   vapora/llm-router:v1.2.0
```

**Record these for rollback**: Keep this output visible

### 1.4 Get Current Revision Numbers

```bash
# For each deployment, get rollout history
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "=== $deployment ==="
  kubectl rollout history deployment/$deployment -n $NAMESPACE | tail -5
done

# Output example:
# REVISION  CHANGE-CAUSE
# 42        Deployment rolled out
# 43        Deployment rolled out
# 44        (current)
```

**Record the highest revision number for each** - this is your rollback reference

### 1.5 Check Cluster Resources

```bash
# Verify cluster has capacity for new deployment
kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"

# Example - check memory/CPU availability
# Requested:     8200m (41%)
# Limits:        16400m (82%)
```

**What to look for:**
- ✓ Less than 80% resource utilization
- ❌ If above 85%: Insufficient capacity, don't proceed

---

## Phase 2: Configuration Deployment (3 minutes)

### 2.1 Apply ConfigMap

The ConfigMap contains all application configuration.

```bash
# First: Dry-run to verify no syntax errors
kubectl apply -f configmap.yaml --dry-run=server -n $NAMESPACE

# Should output:
# configmap/vapora-config configured (server dry run)

# Check for any warnings or errors in output
# If errors, stop and fix the YAML before proceeding
```

**Troubleshooting**:
- "error validating": YAML syntax error - fix and retry
- "field is immutable": Can't change certain ConfigMap fields - delete and recreate
- "resourceQuotaExceeded": Namespace quota exceeded - contact cluster admin

### 2.2 Apply ConfigMap for Real

```bash
# Apply the actual ConfigMap
kubectl apply -f configmap.yaml -n $NAMESPACE

# Output:
# configmap/vapora-config configured

# Verify it was applied
kubectl get configmap -n $NAMESPACE vapora-config -o yaml | head -20

# Check for your new values in the output
```

**Verify ConfigMap is correct**:
```bash
# Extract specific values to verify
kubectl get configmap vapora-config -n $NAMESPACE -o jsonpath='{.data.vapora\.toml}' | grep "database_url" | head -1

# Should show the correct database URL
```

### 2.3 Annotate ConfigMap

Record when this config was deployed for audit trail:

```bash
kubectl annotate configmap vapora-config \
  -n $NAMESPACE \
  deployment.timestamp="$(date -u +'%Y-%m-%dT%H:%M:%SZ')" \
  deployment.commit="$(git rev-parse HEAD | cut -c1-8)" \
  deployment.branch="$(git rev-parse --abbrev-ref HEAD)" \
  --overwrite

# Verify annotation was added
kubectl get configmap vapora-config -n $NAMESPACE -o yaml | grep "deployment\."
```

---

## Phase 3: Deployment Update (5 minutes)

### 3.1 Dry-Run Deployment

Always dry-run first to catch issues:

```bash
# Run deployment dry-run
kubectl apply -f deployment.yaml --dry-run=server -n $NAMESPACE

# Output should show what will be updated:
# deployment.apps/vapora-backend configured (server dry run)
# deployment.apps/vapora-agents configured (server dry run)
# deployment.apps/vapora-llm-router configured (server dry run)
```

**Check for warnings**:
- "imagePullBackOff": Docker image doesn't exist
- "insufficient quota": Resource limits exceeded
- "nodeAffinity": Pod can't be placed on any node

### 3.2 Apply Deployments

```bash
# Apply the actual deployments
kubectl apply -f deployment.yaml -n $NAMESPACE

# Output:
# deployment.apps/vapora-backend configured
# deployment.apps/vapora-agents configured
# deployment.apps/vapora-llm-router configured
```

**Verify deployments updated**:
```bash
# Check that new rollout was initiated
kubectl get deployments -n $NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.observedGeneration}{"\n"}{end}'

# Compare with recorded versions - should be incremented
```

### 3.3 Monitor Rollout Progress

Watch the deployment rollout status:

```bash
# For each deployment, monitor the rollout
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "Waiting for $deployment..."
  kubectl rollout status deployment/$deployment \
    -n $NAMESPACE \
    --timeout=5m
  echo "$deployment ready"
done
```

**What to look for** (per pod update):
```
Waiting for rollout to finish: 2 of 3 updated replicas are available...
Waiting for rollout to finish: 2 of 3 updated replicas are available...
Waiting for rollout to finish: 3 of 3 updated replicas are available...
deployment "vapora-backend" successfully rolled out
```

**Expected time: 2-3 minutes per deployment**

### 3.4 Watch Pod Updates (in separate terminal)

While rollout completes, monitor pods:

```bash
# Watch pods being updated in real-time
kubectl get pods -n $NAMESPACE -w

# Output shows updates like:
# NAME                              READY   STATUS
# vapora-backend-abc123-def45       1/1     Running
# vapora-backend-xyz789-old-pod     1/1     Running  ← old pod still running
# vapora-backend-abc123-new-pod     0/1     Pending  ← new pod starting
# vapora-backend-abc123-new-pod     0/1     ContainerCreating
# vapora-backend-abc123-new-pod     1/1     Running  ← new pod ready
# vapora-backend-xyz789-old-pod     1/1     Terminating  ← old pod being removed
```

**What to look for:**
- ✓ New pods starting (Pending → ContainerCreating → Running)
- ✓ Each new pod reaches Running state
- ✓ Old pods gradually terminating
- ❌ Pod stuck in "CrashLoopBackOff": Stop, check logs, might need rollback

---

## Phase 4: Verification (5 minutes)

### 4.1 Verify All Pods Running

```bash
# Check all pods are ready
kubectl get pods -n $NAMESPACE

# Expected output:
# NAME                              READY   STATUS
# vapora-backend-<hash>-1           1/1     Running
# vapora-backend-<hash>-2           1/1     Running
# vapora-backend-<hash>-3           1/1     Running
# vapora-agents-<hash>-1            1/1     Running
# vapora-agents-<hash>-2            1/1     Running
# vapora-llm-router-<hash>-1        1/1     Running
# vapora-llm-router-<hash>-2        1/1     Running
```

**Verification**:
```bash
# All pods should show READY=1/1
# All pods should show STATUS=Running
# No pods should be in Pending, CrashLoopBackOff, or Error state

# Quick check:
READY=$(kubectl get pods -n $NAMESPACE -o jsonpath='{range .items[*]}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}' | grep -c "True")
TOTAL=$(kubectl get pods -n $NAMESPACE --no-headers | wc -l)

echo "Ready pods: $READY / $TOTAL"

# Should show: Ready pods: 7 / 7 (or your expected pod count)
```

### 4.2 Check Pod Logs for Errors

```bash
# Check logs from the last minute for errors
for pod in $(kubectl get pods -n $NAMESPACE -o name); do
  echo "=== $pod ==="
  kubectl logs $pod -n $NAMESPACE --since=1m 2>&1 | grep -i "error\|exception\|fatal" | head -3
done

# If errors found:
# 1. Note which pods have errors
# 2. Get full log: kubectl logs <pod> -n $NAMESPACE
# 3. Decide: can proceed or need to rollback
```

### 4.3 Verify Service Endpoints

```bash
# Check services are exposing pods correctly
kubectl get endpoints -n $NAMESPACE

# Expected output:
# NAME              ENDPOINTS
# vapora-backend    10.1.2.3:8001,10.1.2.4:8001,10.1.2.5:8001
# vapora-agents     10.1.2.6:8002,10.1.2.7:8002
# vapora-llm-router 10.1.2.8:8003,10.1.2.9:8003
```

**Verification**:
- ✓ Each service has multiple endpoints (not empty)
- ✓ Endpoints match running pods
- ❌ If empty endpoints: Service can't route traffic

### 4.4 Health Check Endpoints

```bash
# Port-forward to access services locally
kubectl port-forward -n $NAMESPACE svc/vapora-backend 8001:8001 &

# Wait a moment for port-forward to establish
sleep 2

# Check backend health
curl -v http://localhost:8001/health

# Expected response:
# HTTP/1.1 200 OK
# {...healthy response...}

# Check other endpoints
curl http://localhost:8001/api/projects -H "Authorization: Bearer test-token"
```

**Expected responses**:
- `/health`: 200 OK with health data
- `/api/projects`: 200 OK with projects list
- `/metrics`: 200 OK with Prometheus metrics

**If connection refused**:
```bash
# Check if port-forward working
ps aux | grep "port-forward"

# Restart port-forward
pkill -f "port-forward svc/vapora-backend"
kubectl port-forward -n $NAMESPACE svc/vapora-backend 8001:8001 &
```

### 4.5 Check Metrics

```bash
# Monitor resource usage of deployed pods
kubectl top pods -n $NAMESPACE

# Expected output:
# NAME                           CPU(cores)   MEMORY(Mi)
# vapora-backend-abc123          250m         512Mi
# vapora-backend-def456          280m         498Mi
# vapora-agents-ghi789           300m         256Mi
```

**Verification**:
- ✓ CPU usage within expected range (typically 100-500m per pod)
- ✓ Memory usage within expected range (typically 200-512Mi)
- ❌ If any pod at 100% CPU/Memory: Performance issue, monitor closely

---

## Phase 5: Validation (3 minutes)

### 5.1 Run Smoke Tests (if available)

```bash
# If your project has smoke tests:
kubectl exec -it deployment/vapora-backend -n $NAMESPACE -- \
  sh -c "curl http://localhost:8001/health && echo 'Health check passed'"

# Or run from your local machine:
./scripts/smoke-tests.sh --endpoint http://localhost:8001
```

### 5.2 Check for Errors in Logs

```bash
# Look at logs from all pods since deployment started
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "=== Checking $deployment ==="
  kubectl logs deployment/$deployment -n $NAMESPACE --since=5m 2>&1 | \
    grep -i "error\|exception\|failed" | wc -l
done

# If any errors found:
# 1. Get detailed logs
# 2. Determine if critical or expected errors
# 3. Decide to proceed or rollback
```

### 5.3 Compare Against Baseline Metrics

Compare current metrics with pre-deployment baseline:

```bash
# Current metrics
echo "=== Current ==="
kubectl top nodes
kubectl top pods -n $NAMESPACE | head -5

# Compare with recorded baseline
# If similar: ✓ Good
# If significantly higher: ⚠️ Watch for issues
# If error rates high: ❌ Consider rollback
```

### 5.4 Check for Recent Events/Warnings

```bash
# Look for any cluster events in the last 5 minutes
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -20

# Watch for:
# - Warning: FailedScheduling (pod won't fit)
# - Warning: PullImageError (image doesn't exist)
# - Warning: ImagePullBackOff (can't download image)
# - Error: ExceededQuota (resource limits)
```

---

## Phase 6: Communication (1 minute)

### 6.1 Post Deployment Complete

```
Post message to #deployments:

🚀 DEPLOYMENT COMPLETE

Deployment: VAPORA Core Services
Mode: Enterprise
Duration: 8 minutes
Status: ✅ Successful

Deployed:
- vapora-backend (v1.2.1)
- vapora-agents (v1.2.1)
- vapora-llm-router (v1.2.1)

Verification:
✓ All pods running
✓ Health checks passing
✓ No error logs
✓ Metrics normal

Next steps:
- Monitor #alerts for any issues
- Check dashboards every 5 minutes for 30 min
- Review logs if any issues detected

Questions? @on-call-engineer
```

### 6.2 Update Status Page

```
If using public status page:

UPDATE: Maintenance Complete

VAPORA services have been successfully updated
and are now operating normally.

All systems monitoring nominal.
```

### 6.3 Notify Stakeholders

- [ ] Send message to support team: "Deployment complete, all systems normal"
- [ ] Post in #product: "Backend updated to v1.2.1, new features available"
- [ ] Update ticket/issue with deployment completion time and status

---

## Phase 7: Post-Deployment Monitoring (Ongoing)

### 7.1 First 5 Minutes: Watch Closely

```bash
# Keep watching for any issues
watch kubectl get pods -n $NAMESPACE
watch kubectl top pods -n $NAMESPACE
watch kubectl logs -f deployment/vapora-backend -n $NAMESPACE
```

**Watch for:**
- Pod restarts (RESTARTS counter increasing)
- Increased error logs
- Resource usage spikes
- Service unreachability

### 7.2 First 30 Minutes: Monitor Dashboard

Keep dashboard visible showing:
- Pod health status
- CPU/Memory usage per pod
- Request latency (if available)
- Error rate
- Recent logs

**Alert triggers for immediate action:**
- Any pod restarting repeatedly
- Error rate above 5%
- Latency above 2x normal
- Pod stuck in Pending state

### 7.3 First 2 Hours: Regular Checks

```bash
# Every 10 minutes:
1. kubectl get pods -n $NAMESPACE
2. kubectl top pods -n $NAMESPACE
3. Check error logs: grep -i error from recent logs
4. Check alerts dashboard
```

**If issues detected**, proceed to Incident Response Runbook

### 7.4 After 2 Hours: Normal Monitoring

Return to standard monitoring procedures. Deployment complete.

---

## If Issues Detected: Quick Rollback

If problems occur at any point:

```bash
# IMMEDIATE: Rollback (1 minute)
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  kubectl rollout undo deployment/$deployment -n $NAMESPACE &
done
wait

# Verify rollback completing:
kubectl rollout status deployment/vapora-backend -n $NAMESPACE --timeout=5m

# Confirm services recovering:
curl http://localhost:8001/health

# Post to #deployments:
# 🔙 ROLLBACK EXECUTED
# Issue detected, services rolled back to previous version
# All pods should be recovering now
```

See [Rollback Runbook](./rollback-runbook.md) for detailed procedures.

---

## Common Issues & Solutions

### Issue: Pod stuck in ImagePullBackOff

**Cause**: Docker image doesn't exist or can't be downloaded

**Solution**:
```bash
# Check pod events
kubectl describe pod <pod-name> -n $NAMESPACE

# Check image registry access
kubectl get secret -n $NAMESPACE

# Either:
1. Verify image name is correct in deployment.yaml
2. Push missing image to registry
3. Rollback deployment
```

### Issue: Pod stuck in CrashLoopBackOff

**Cause**: Application crashing on startup

**Solution**:
```bash
# Get pod logs
kubectl logs <pod-name> -n $NAMESPACE --previous

# Fix typically requires config change:
1. Fix ConfigMap issue
2. Re-apply ConfigMap: kubectl apply -f configmap.yaml
3. Trigger pod restart: kubectl rollout restart deployment/<name>

# Or rollback if unclear
```

### Issue: Pod in Pending state

**Cause**: Node doesn't have capacity or resources

**Solution**:
```bash
# Describe pod to see why
kubectl describe pod <pod-name> -n $NAMESPACE

# Check for "Insufficient cpu", "Insufficient memory"
kubectl top nodes

# Either:
1. Scale down other workloads
2. Increase node count
3. Reduce resource requirements in deployment.yaml and redeploy
```

### Issue: Service endpoints empty

**Cause**: Pods not passing health checks

**Solution**:
```bash
# Check pod logs for errors
kubectl logs <pod-name> -n $NAMESPACE

# Check pod readiness probe failures
kubectl describe pod <pod-name> -n $NAMESPACE | grep -A 5 "Readiness"

# Fix configuration or rollback
```

---

## Completion Checklist

- [ ] All pods running and ready
- [ ] Health endpoints responding
- [ ] No error logs
- [ ] Metrics normal
- [ ] Deployment communication posted
- [ ] Status page updated
- [ ] Stakeholders notified
- [ ] Monitoring enabled for next 2 hours
- [ ] Ticket/issue updated with completion details

---

## Next Steps

- Continue monitoring per [Monitoring Runbook](./monitoring-runbook.md)
- If issues arise, follow [Incident Response Runbook](./incident-response-runbook.md)
- Document lessons learned
- Update runbooks if procedures need improvement