Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Deployment Runbook

Step-by-step procedures for deploying VAPORA to staging and production environments.


Quick Start

For experienced operators:

# Validate in CI/CD
# Download artifacts
# Review dry-run
# Apply: kubectl apply -f configmap.yaml deployment.yaml
# Monitor: kubectl logs -f deployment/vapora-backend -n vapora
# Verify: curl http://localhost:8001/health

For complete steps, continue reading.


Before Starting

Prerequisites Completed:

  • Pre-deployment checklist completed
  • Artifacts generated and validated
  • Staging deployment verified
  • Team ready and monitoring
  • Maintenance window announced

Access Verified:

  • kubectl configured for target cluster
  • Can list nodes: kubectl get nodes
  • Can access namespace: kubectl get namespace vapora

If any prerequisite missing: Go back to pre-deployment checklist


Phase 1: Pre-Flight (5 minutes)

1.1 Verify Current State

# Set context
export CLUSTER=production  # or staging
export NAMESPACE=vapora

# Verify cluster access
kubectl cluster-info
kubectl get nodes

# Output should show:
# NAME     STATUS   ROLES    AGE
# node-1   Ready    worker   30d
# node-2   Ready    worker   25d

What to look for:

  • ✓ All nodes in "Ready" state
  • ✓ No "NotReady" or "Unknown" nodes
  • If issues: Don't proceed, investigate node health

1.2 Check Current Deployments

# Get current deployment status
kubectl get deployments -n $NAMESPACE -o wide
kubectl get pods -n $NAMESPACE

# Output example:
# NAME                READY   UP-TO-DATE   AVAILABLE
# vapora-backend      3/3     3            3
# vapora-agents       2/2     2            2
# vapora-llm-router   2/2     2            2

What to look for:

  • ✓ All deployments showing correct replica count
  • ✓ All pods in "Running" state
  • ❌ If pods in "CrashLoopBackOff" or "Pending": Investigate before proceeding

1.3 Record Current Versions

# Get current image versions (baseline for rollback)
kubectl get deployments -n $NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].image}{"\n"}{end}'

# Expected output:
# vapora-backend      vapora/backend:v1.2.0
# vapora-agents       vapora/agents:v1.2.0
# vapora-llm-router   vapora/llm-router:v1.2.0

Record these for rollback: Keep this output visible

1.4 Get Current Revision Numbers

# For each deployment, get rollout history
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "=== $deployment ==="
  kubectl rollout history deployment/$deployment -n $NAMESPACE | tail -5
done

# Output example:
# REVISION  CHANGE-CAUSE
# 42        Deployment rolled out
# 43        Deployment rolled out
# 44        (current)

Record the highest revision number for each - this is your rollback reference

1.5 Check Cluster Resources

# Verify cluster has capacity for new deployment
kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"

# Example - check memory/CPU availability
# Requested:     8200m (41%)
# Limits:        16400m (82%)

What to look for:

  • ✓ Less than 80% resource utilization
  • ❌ If above 85%: Insufficient capacity, don't proceed

Phase 2: Configuration Deployment (3 minutes)

2.1 Apply ConfigMap

The ConfigMap contains all application configuration.

# First: Dry-run to verify no syntax errors
kubectl apply -f configmap.yaml --dry-run=server -n $NAMESPACE

# Should output:
# configmap/vapora-config configured (server dry run)

# Check for any warnings or errors in output
# If errors, stop and fix the YAML before proceeding

Troubleshooting:

  • "error validating": YAML syntax error - fix and retry
  • "field is immutable": Can't change certain ConfigMap fields - delete and recreate
  • "resourceQuotaExceeded": Namespace quota exceeded - contact cluster admin

2.2 Apply ConfigMap for Real

# Apply the actual ConfigMap
kubectl apply -f configmap.yaml -n $NAMESPACE

# Output:
# configmap/vapora-config configured

# Verify it was applied
kubectl get configmap -n $NAMESPACE vapora-config -o yaml | head -20

# Check for your new values in the output

Verify ConfigMap is correct:

# Extract specific values to verify
kubectl get configmap vapora-config -n $NAMESPACE -o jsonpath='{.data.vapora\.toml}' | grep "database_url" | head -1

# Should show the correct database URL

2.3 Annotate ConfigMap

Record when this config was deployed for audit trail:

kubectl annotate configmap vapora-config \
  -n $NAMESPACE \
  deployment.timestamp="$(date -u +'%Y-%m-%dT%H:%M:%SZ')" \
  deployment.commit="$(git rev-parse HEAD | cut -c1-8)" \
  deployment.branch="$(git rev-parse --abbrev-ref HEAD)" \
  --overwrite

# Verify annotation was added
kubectl get configmap vapora-config -n $NAMESPACE -o yaml | grep "deployment\."

Phase 3: Deployment Update (5 minutes)

3.1 Dry-Run Deployment

Always dry-run first to catch issues:

# Run deployment dry-run
kubectl apply -f deployment.yaml --dry-run=server -n $NAMESPACE

# Output should show what will be updated:
# deployment.apps/vapora-backend configured (server dry run)
# deployment.apps/vapora-agents configured (server dry run)
# deployment.apps/vapora-llm-router configured (server dry run)

Check for warnings:

  • "imagePullBackOff": Docker image doesn't exist
  • "insufficient quota": Resource limits exceeded
  • "nodeAffinity": Pod can't be placed on any node

3.2 Apply Deployments

# Apply the actual deployments
kubectl apply -f deployment.yaml -n $NAMESPACE

# Output:
# deployment.apps/vapora-backend configured
# deployment.apps/vapora-agents configured
# deployment.apps/vapora-llm-router configured

Verify deployments updated:

# Check that new rollout was initiated
kubectl get deployments -n $NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.observedGeneration}{"\n"}{end}'

# Compare with recorded versions - should be incremented

3.3 Monitor Rollout Progress

Watch the deployment rollout status:

# For each deployment, monitor the rollout
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "Waiting for $deployment..."
  kubectl rollout status deployment/$deployment \
    -n $NAMESPACE \
    --timeout=5m
  echo "$deployment ready"
done

What to look for (per pod update):

Waiting for rollout to finish: 2 of 3 updated replicas are available...
Waiting for rollout to finish: 2 of 3 updated replicas are available...
Waiting for rollout to finish: 3 of 3 updated replicas are available...
deployment "vapora-backend" successfully rolled out

Expected time: 2-3 minutes per deployment

3.4 Watch Pod Updates (in separate terminal)

While rollout completes, monitor pods:

# Watch pods being updated in real-time
kubectl get pods -n $NAMESPACE -w

# Output shows updates like:
# NAME                              READY   STATUS
# vapora-backend-abc123-def45       1/1     Running
# vapora-backend-xyz789-old-pod     1/1     Running  ← old pod still running
# vapora-backend-abc123-new-pod     0/1     Pending  ← new pod starting
# vapora-backend-abc123-new-pod     0/1     ContainerCreating
# vapora-backend-abc123-new-pod     1/1     Running  ← new pod ready
# vapora-backend-xyz789-old-pod     1/1     Terminating  ← old pod being removed

What to look for:

  • ✓ New pods starting (Pending → ContainerCreating → Running)
  • ✓ Each new pod reaches Running state
  • ✓ Old pods gradually terminating
  • ❌ Pod stuck in "CrashLoopBackOff": Stop, check logs, might need rollback

Phase 4: Verification (5 minutes)

4.1 Verify All Pods Running

# Check all pods are ready
kubectl get pods -n $NAMESPACE

# Expected output:
# NAME                              READY   STATUS
# vapora-backend-<hash>-1           1/1     Running
# vapora-backend-<hash>-2           1/1     Running
# vapora-backend-<hash>-3           1/1     Running
# vapora-agents-<hash>-1            1/1     Running
# vapora-agents-<hash>-2            1/1     Running
# vapora-llm-router-<hash>-1        1/1     Running
# vapora-llm-router-<hash>-2        1/1     Running

Verification:

# All pods should show READY=1/1
# All pods should show STATUS=Running
# No pods should be in Pending, CrashLoopBackOff, or Error state

# Quick check:
READY=$(kubectl get pods -n $NAMESPACE -o jsonpath='{range .items[*]}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}' | grep -c "True")
TOTAL=$(kubectl get pods -n $NAMESPACE --no-headers | wc -l)

echo "Ready pods: $READY / $TOTAL"

# Should show: Ready pods: 7 / 7 (or your expected pod count)

4.2 Check Pod Logs for Errors

# Check logs from the last minute for errors
for pod in $(kubectl get pods -n $NAMESPACE -o name); do
  echo "=== $pod ==="
  kubectl logs $pod -n $NAMESPACE --since=1m 2>&1 | grep -i "error\|exception\|fatal" | head -3
done

# If errors found:
# 1. Note which pods have errors
# 2. Get full log: kubectl logs <pod> -n $NAMESPACE
# 3. Decide: can proceed or need to rollback

4.3 Verify Service Endpoints

# Check services are exposing pods correctly
kubectl get endpoints -n $NAMESPACE

# Expected output:
# NAME              ENDPOINTS
# vapora-backend    10.1.2.3:8001,10.1.2.4:8001,10.1.2.5:8001
# vapora-agents     10.1.2.6:8002,10.1.2.7:8002
# vapora-llm-router 10.1.2.8:8003,10.1.2.9:8003

Verification:

  • ✓ Each service has multiple endpoints (not empty)
  • ✓ Endpoints match running pods
  • ❌ If empty endpoints: Service can't route traffic

4.4 Health Check Endpoints

# Port-forward to access services locally
kubectl port-forward -n $NAMESPACE svc/vapora-backend 8001:8001 &

# Wait a moment for port-forward to establish
sleep 2

# Check backend health
curl -v http://localhost:8001/health

# Expected response:
# HTTP/1.1 200 OK
# {...healthy response...}

# Check other endpoints
curl http://localhost:8001/api/projects -H "Authorization: Bearer test-token"

Expected responses:

  • /health: 200 OK with health data
  • /api/projects: 200 OK with projects list
  • /metrics: 200 OK with Prometheus metrics

If connection refused:

# Check if port-forward working
ps aux | grep "port-forward"

# Restart port-forward
pkill -f "port-forward svc/vapora-backend"
kubectl port-forward -n $NAMESPACE svc/vapora-backend 8001:8001 &

4.5 Check Metrics

# Monitor resource usage of deployed pods
kubectl top pods -n $NAMESPACE

# Expected output:
# NAME                           CPU(cores)   MEMORY(Mi)
# vapora-backend-abc123          250m         512Mi
# vapora-backend-def456          280m         498Mi
# vapora-agents-ghi789           300m         256Mi

Verification:

  • ✓ CPU usage within expected range (typically 100-500m per pod)
  • ✓ Memory usage within expected range (typically 200-512Mi)
  • ❌ If any pod at 100% CPU/Memory: Performance issue, monitor closely

Phase 5: Validation (3 minutes)

5.1 Run Smoke Tests (if available)

# If your project has smoke tests:
kubectl exec -it deployment/vapora-backend -n $NAMESPACE -- \
  sh -c "curl http://localhost:8001/health && echo 'Health check passed'"

# Or run from your local machine:
./scripts/smoke-tests.sh --endpoint http://localhost:8001

5.2 Check for Errors in Logs

# Look at logs from all pods since deployment started
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "=== Checking $deployment ==="
  kubectl logs deployment/$deployment -n $NAMESPACE --since=5m 2>&1 | \
    grep -i "error\|exception\|failed" | wc -l
done

# If any errors found:
# 1. Get detailed logs
# 2. Determine if critical or expected errors
# 3. Decide to proceed or rollback

5.3 Compare Against Baseline Metrics

Compare current metrics with pre-deployment baseline:

# Current metrics
echo "=== Current ==="
kubectl top nodes
kubectl top pods -n $NAMESPACE | head -5

# Compare with recorded baseline
# If similar: ✓ Good
# If significantly higher: ⚠️ Watch for issues
# If error rates high: ❌ Consider rollback

5.4 Check for Recent Events/Warnings

# Look for any cluster events in the last 5 minutes
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -20

# Watch for:
# - Warning: FailedScheduling (pod won't fit)
# - Warning: PullImageError (image doesn't exist)
# - Warning: ImagePullBackOff (can't download image)
# - Error: ExceededQuota (resource limits)

Phase 6: Communication (1 minute)

6.1 Post Deployment Complete

Post message to #deployments:

🚀 DEPLOYMENT COMPLETE

Deployment: VAPORA Core Services
Mode: Enterprise
Duration: 8 minutes
Status: ✅ Successful

Deployed:
- vapora-backend (v1.2.1)
- vapora-agents (v1.2.1)
- vapora-llm-router (v1.2.1)

Verification:
✓ All pods running
✓ Health checks passing
✓ No error logs
✓ Metrics normal

Next steps:
- Monitor #alerts for any issues
- Check dashboards every 5 minutes for 30 min
- Review logs if any issues detected

Questions? @on-call-engineer

6.2 Update Status Page

If using public status page:

UPDATE: Maintenance Complete

VAPORA services have been successfully updated
and are now operating normally.

All systems monitoring nominal.

6.3 Notify Stakeholders

  • Send message to support team: "Deployment complete, all systems normal"
  • Post in #product: "Backend updated to v1.2.1, new features available"
  • Update ticket/issue with deployment completion time and status

Phase 7: Post-Deployment Monitoring (Ongoing)

7.1 First 5 Minutes: Watch Closely

# Keep watching for any issues
watch kubectl get pods -n $NAMESPACE
watch kubectl top pods -n $NAMESPACE
watch kubectl logs -f deployment/vapora-backend -n $NAMESPACE

Watch for:

  • Pod restarts (RESTARTS counter increasing)
  • Increased error logs
  • Resource usage spikes
  • Service unreachability

7.2 First 30 Minutes: Monitor Dashboard

Keep dashboard visible showing:

  • Pod health status
  • CPU/Memory usage per pod
  • Request latency (if available)
  • Error rate
  • Recent logs

Alert triggers for immediate action:

  • Any pod restarting repeatedly
  • Error rate above 5%
  • Latency above 2x normal
  • Pod stuck in Pending state

7.3 First 2 Hours: Regular Checks

# Every 10 minutes:
1. kubectl get pods -n $NAMESPACE
2. kubectl top pods -n $NAMESPACE
3. Check error logs: grep -i error from recent logs
4. Check alerts dashboard

If issues detected, proceed to Incident Response Runbook

7.4 After 2 Hours: Normal Monitoring

Return to standard monitoring procedures. Deployment complete.


If Issues Detected: Quick Rollback

If problems occur at any point:

# IMMEDIATE: Rollback (1 minute)
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  kubectl rollout undo deployment/$deployment -n $NAMESPACE &
done
wait

# Verify rollback completing:
kubectl rollout status deployment/vapora-backend -n $NAMESPACE --timeout=5m

# Confirm services recovering:
curl http://localhost:8001/health

# Post to #deployments:
# 🔙 ROLLBACK EXECUTED
# Issue detected, services rolled back to previous version
# All pods should be recovering now

See Rollback Runbook for detailed procedures.


Common Issues & Solutions

Issue: Pod stuck in ImagePullBackOff

Cause: Docker image doesn't exist or can't be downloaded

Solution:

# Check pod events
kubectl describe pod <pod-name> -n $NAMESPACE

# Check image registry access
kubectl get secret -n $NAMESPACE

# Either:
1. Verify image name is correct in deployment.yaml
2. Push missing image to registry
3. Rollback deployment

Issue: Pod stuck in CrashLoopBackOff

Cause: Application crashing on startup

Solution:

# Get pod logs
kubectl logs <pod-name> -n $NAMESPACE --previous

# Fix typically requires config change:
1. Fix ConfigMap issue
2. Re-apply ConfigMap: kubectl apply -f configmap.yaml
3. Trigger pod restart: kubectl rollout restart deployment/<name>

# Or rollback if unclear

Issue: Pod in Pending state

Cause: Node doesn't have capacity or resources

Solution:

# Describe pod to see why
kubectl describe pod <pod-name> -n $NAMESPACE

# Check for "Insufficient cpu", "Insufficient memory"
kubectl top nodes

# Either:
1. Scale down other workloads
2. Increase node count
3. Reduce resource requirements in deployment.yaml and redeploy

Issue: Service endpoints empty

Cause: Pods not passing health checks

Solution:

# Check pod logs for errors
kubectl logs <pod-name> -n $NAMESPACE

# Check pod readiness probe failures
kubectl describe pod <pod-name> -n $NAMESPACE | grep -A 5 "Readiness"

# Fix configuration or rollback

Completion Checklist

  • All pods running and ready
  • Health endpoints responding
  • No error logs
  • Metrics normal
  • Deployment communication posted
  • Status page updated
  • Stakeholders notified
  • Monitoring enabled for next 2 hours
  • Ticket/issue updated with completion details

Next Steps