Deployment Runbook

Step-by-step procedures for deploying VAPORA to staging and production environments.

Quick Start

For experienced operators:

# Validate in CI/CD
# Download artifacts
# Review dry-run
# Apply: kubectl apply -f configmap.yaml deployment.yaml
# Monitor: kubectl logs -f deployment/vapora-backend -n vapora
# Verify: curl http://localhost:8001/health

For complete steps, continue reading.

Before Starting

✅ Prerequisites Completed:

Pre-deployment checklist completed
Artifacts generated and validated
Staging deployment verified
Team ready and monitoring
Maintenance window announced

✅ Access Verified:

kubectl configured for target cluster
Can list nodes: kubectl get nodes
Can access namespace: kubectl get namespace vapora

❌ If any prerequisite missing: Go back to pre-deployment checklist

Phase 1: Pre-Flight (5 minutes)

1.1 Verify Current State

# Set context
export CLUSTER=production  # or staging
export NAMESPACE=vapora

# Verify cluster access
kubectl cluster-info
kubectl get nodes

# Output should show:
# NAME     STATUS   ROLES    AGE
# node-1   Ready    worker   30d
# node-2   Ready    worker   25d

What to look for:

✓ All nodes in "Ready" state
✓ No "NotReady" or "Unknown" nodes
If issues: Don't proceed, investigate node health

1.2 Check Current Deployments

# Get current deployment status
kubectl get deployments -n $NAMESPACE -o wide
kubectl get pods -n $NAMESPACE

# Output example:
# NAME                READY   UP-TO-DATE   AVAILABLE
# vapora-backend      3/3     3            3
# vapora-agents       2/2     2            2
# vapora-llm-router   2/2     2            2

What to look for:

✓ All deployments showing correct replica count
✓ All pods in "Running" state
❌ If pods in "CrashLoopBackOff" or "Pending": Investigate before proceeding

1.3 Record Current Versions

# Get current image versions (baseline for rollback)
kubectl get deployments -n $NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].image}{"\n"}{end}'

# Expected output:
# vapora-backend      vapora/backend:v1.2.0
# vapora-agents       vapora/agents:v1.2.0
# vapora-llm-router   vapora/llm-router:v1.2.0

Record these for rollback: Keep this output visible

1.4 Get Current Revision Numbers

# For each deployment, get rollout history
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "=== $deployment ==="
  kubectl rollout history deployment/$deployment -n $NAMESPACE | tail -5
done

# Output example:
# REVISION  CHANGE-CAUSE
# 42        Deployment rolled out
# 43        Deployment rolled out
# 44        (current)

Record the highest revision number for each - this is your rollback reference

1.5 Check Cluster Resources

# Verify cluster has capacity for new deployment
kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"

# Example - check memory/CPU availability
# Requested:     8200m (41%)
# Limits:        16400m (82%)

What to look for:

✓ Less than 80% resource utilization
❌ If above 85%: Insufficient capacity, don't proceed

Phase 2: Configuration Deployment (3 minutes)

2.1 Apply ConfigMap

The ConfigMap contains all application configuration.

# First: Dry-run to verify no syntax errors
kubectl apply -f configmap.yaml --dry-run=server -n $NAMESPACE

# Should output:
# configmap/vapora-config configured (server dry run)

# Check for any warnings or errors in output
# If errors, stop and fix the YAML before proceeding

Troubleshooting:

"error validating": YAML syntax error - fix and retry
"field is immutable": Can't change certain ConfigMap fields - delete and recreate
"resourceQuotaExceeded": Namespace quota exceeded - contact cluster admin

2.2 Apply ConfigMap for Real

# Apply the actual ConfigMap
kubectl apply -f configmap.yaml -n $NAMESPACE

# Output:
# configmap/vapora-config configured

# Verify it was applied
kubectl get configmap -n $NAMESPACE vapora-config -o yaml | head -20

# Check for your new values in the output

Verify ConfigMap is correct:

# Extract specific values to verify
kubectl get configmap vapora-config -n $NAMESPACE -o jsonpath='{.data.vapora\.toml}' | grep "database_url" | head -1

# Should show the correct database URL

2.3 Annotate ConfigMap

Record when this config was deployed for audit trail:

kubectl annotate configmap vapora-config \
  -n $NAMESPACE \
  deployment.timestamp="$(date -u +'%Y-%m-%dT%H:%M:%SZ')" \
  deployment.commit="$(git rev-parse HEAD | cut -c1-8)" \
  deployment.branch="$(git rev-parse --abbrev-ref HEAD)" \
  --overwrite

# Verify annotation was added
kubectl get configmap vapora-config -n $NAMESPACE -o yaml | grep "deployment\."

Phase 3: Deployment Update (5 minutes)

3.1 Dry-Run Deployment

Always dry-run first to catch issues:

# Run deployment dry-run
kubectl apply -f deployment.yaml --dry-run=server -n $NAMESPACE

# Output should show what will be updated:
# deployment.apps/vapora-backend configured (server dry run)
# deployment.apps/vapora-agents configured (server dry run)
# deployment.apps/vapora-llm-router configured (server dry run)

Check for warnings:

"imagePullBackOff": Docker image doesn't exist
"insufficient quota": Resource limits exceeded
"nodeAffinity": Pod can't be placed on any node

3.2 Apply Deployments

# Apply the actual deployments
kubectl apply -f deployment.yaml -n $NAMESPACE

# Output:
# deployment.apps/vapora-backend configured
# deployment.apps/vapora-agents configured
# deployment.apps/vapora-llm-router configured

Verify deployments updated:

# Check that new rollout was initiated
kubectl get deployments -n $NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.observedGeneration}{"\n"}{end}'

# Compare with recorded versions - should be incremented

3.3 Monitor Rollout Progress

Watch the deployment rollout status:

# For each deployment, monitor the rollout
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "Waiting for $deployment..."
  kubectl rollout status deployment/$deployment \
    -n $NAMESPACE \
    --timeout=5m
  echo "$deployment ready"
done

What to look for (per pod update):

Waiting for rollout to finish: 2 of 3 updated replicas are available...
Waiting for rollout to finish: 2 of 3 updated replicas are available...
Waiting for rollout to finish: 3 of 3 updated replicas are available...
deployment "vapora-backend" successfully rolled out

Expected time: 2-3 minutes per deployment

3.4 Watch Pod Updates (in separate terminal)

While rollout completes, monitor pods:

# Watch pods being updated in real-time
kubectl get pods -n $NAMESPACE -w

# Output shows updates like:
# NAME                              READY   STATUS
# vapora-backend-abc123-def45       1/1     Running
# vapora-backend-xyz789-old-pod     1/1     Running  ← old pod still running
# vapora-backend-abc123-new-pod     0/1     Pending  ← new pod starting
# vapora-backend-abc123-new-pod     0/1     ContainerCreating
# vapora-backend-abc123-new-pod     1/1     Running  ← new pod ready
# vapora-backend-xyz789-old-pod     1/1     Terminating  ← old pod being removed

What to look for:

✓ New pods starting (Pending → ContainerCreating → Running)
✓ Each new pod reaches Running state
✓ Old pods gradually terminating
❌ Pod stuck in "CrashLoopBackOff": Stop, check logs, might need rollback

Phase 4: Verification (5 minutes)

4.1 Verify All Pods Running

# Check all pods are ready
kubectl get pods -n $NAMESPACE

# Expected output:
# NAME                              READY   STATUS
# vapora-backend-<hash>-1           1/1     Running
# vapora-backend-<hash>-2           1/1     Running
# vapora-backend-<hash>-3           1/1     Running
# vapora-agents-<hash>-1            1/1     Running
# vapora-agents-<hash>-2            1/1     Running
# vapora-llm-router-<hash>-1        1/1     Running
# vapora-llm-router-<hash>-2        1/1     Running

Verification:

# All pods should show READY=1/1
# All pods should show STATUS=Running
# No pods should be in Pending, CrashLoopBackOff, or Error state

# Quick check:
READY=$(kubectl get pods -n $NAMESPACE -o jsonpath='{range .items[*]}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}' | grep -c "True")
TOTAL=$(kubectl get pods -n $NAMESPACE --no-headers | wc -l)

echo "Ready pods: $READY / $TOTAL"

# Should show: Ready pods: 7 / 7 (or your expected pod count)

4.2 Check Pod Logs for Errors

# Check logs from the last minute for errors
for pod in $(kubectl get pods -n $NAMESPACE -o name); do
  echo "=== $pod ==="
  kubectl logs $pod -n $NAMESPACE --since=1m 2>&1 | grep -i "error\|exception\|fatal" | head -3
done

# If errors found:
# 1. Note which pods have errors
# 2. Get full log: kubectl logs <pod> -n $NAMESPACE
# 3. Decide: can proceed or need to rollback

4.3 Verify Service Endpoints

# Check services are exposing pods correctly
kubectl get endpoints -n $NAMESPACE

# Expected output:
# NAME              ENDPOINTS
# vapora-backend    10.1.2.3:8001,10.1.2.4:8001,10.1.2.5:8001
# vapora-agents     10.1.2.6:8002,10.1.2.7:8002
# vapora-llm-router 10.1.2.8:8003,10.1.2.9:8003

Verification:

✓ Each service has multiple endpoints (not empty)
✓ Endpoints match running pods
❌ If empty endpoints: Service can't route traffic

4.4 Health Check Endpoints

# Port-forward to access services locally
kubectl port-forward -n $NAMESPACE svc/vapora-backend 8001:8001 &

# Wait a moment for port-forward to establish
sleep 2

# Check backend health
curl -v http://localhost:8001/health

# Expected response:
# HTTP/1.1 200 OK
# {...healthy response...}

# Check other endpoints
curl http://localhost:8001/api/projects -H "Authorization: Bearer test-token"

Expected responses:

/health: 200 OK with health data
/api/projects: 200 OK with projects list
/metrics: 200 OK with Prometheus metrics

If connection refused:

# Check if port-forward working
ps aux | grep "port-forward"

# Restart port-forward
pkill -f "port-forward svc/vapora-backend"
kubectl port-forward -n $NAMESPACE svc/vapora-backend 8001:8001 &

4.5 Check Metrics

# Monitor resource usage of deployed pods
kubectl top pods -n $NAMESPACE

# Expected output:
# NAME                           CPU(cores)   MEMORY(Mi)
# vapora-backend-abc123          250m         512Mi
# vapora-backend-def456          280m         498Mi
# vapora-agents-ghi789           300m         256Mi

Verification:

✓ CPU usage within expected range (typically 100-500m per pod)
✓ Memory usage within expected range (typically 200-512Mi)
❌ If any pod at 100% CPU/Memory: Performance issue, monitor closely

Phase 5: Validation (3 minutes)

5.1 Run Smoke Tests (if available)

# If your project has smoke tests:
kubectl exec -it deployment/vapora-backend -n $NAMESPACE -- \
  sh -c "curl http://localhost:8001/health && echo 'Health check passed'"

# Or run from your local machine:
./scripts/smoke-tests.sh --endpoint http://localhost:8001

5.2 Check for Errors in Logs

# Look at logs from all pods since deployment started
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "=== Checking $deployment ==="
  kubectl logs deployment/$deployment -n $NAMESPACE --since=5m 2>&1 | \
    grep -i "error\|exception\|failed" | wc -l
done

# If any errors found:
# 1. Get detailed logs
# 2. Determine if critical or expected errors
# 3. Decide to proceed or rollback

5.3 Compare Against Baseline Metrics

Compare current metrics with pre-deployment baseline:

# Current metrics
echo "=== Current ==="
kubectl top nodes
kubectl top pods -n $NAMESPACE | head -5

# Compare with recorded baseline
# If similar: ✓ Good
# If significantly higher: ⚠️ Watch for issues
# If error rates high: ❌ Consider rollback

5.4 Check for Recent Events/Warnings

# Look for any cluster events in the last 5 minutes
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -20

# Watch for:
# - Warning: FailedScheduling (pod won't fit)
# - Warning: PullImageError (image doesn't exist)
# - Warning: ImagePullBackOff (can't download image)
# - Error: ExceededQuota (resource limits)

Phase 6: Communication (1 minute)

6.1 Post Deployment Complete

Post message to #deployments:

🚀 DEPLOYMENT COMPLETE

Deployment: VAPORA Core Services
Mode: Enterprise
Duration: 8 minutes
Status: ✅ Successful

Deployed:
- vapora-backend (v1.2.1)
- vapora-agents (v1.2.1)
- vapora-llm-router (v1.2.1)

Verification:
✓ All pods running
✓ Health checks passing
✓ No error logs
✓ Metrics normal

Next steps:
- Monitor #alerts for any issues
- Check dashboards every 5 minutes for 30 min
- Review logs if any issues detected

Questions? @on-call-engineer

6.2 Update Status Page

If using public status page:

UPDATE: Maintenance Complete

VAPORA services have been successfully updated
and are now operating normally.

All systems monitoring nominal.

6.3 Notify Stakeholders

Send message to support team: "Deployment complete, all systems normal"
Post in #product: "Backend updated to v1.2.1, new features available"
Update ticket/issue with deployment completion time and status

Phase 7: Post-Deployment Monitoring (Ongoing)

7.1 First 5 Minutes: Watch Closely

# Keep watching for any issues
watch kubectl get pods -n $NAMESPACE
watch kubectl top pods -n $NAMESPACE
watch kubectl logs -f deployment/vapora-backend -n $NAMESPACE

Watch for:

Pod restarts (RESTARTS counter increasing)
Increased error logs
Resource usage spikes
Service unreachability

7.2 First 30 Minutes: Monitor Dashboard

Keep dashboard visible showing:

Pod health status
CPU/Memory usage per pod
Request latency (if available)
Error rate
Recent logs

Alert triggers for immediate action:

Any pod restarting repeatedly
Error rate above 5%
Latency above 2x normal
Pod stuck in Pending state

7.3 First 2 Hours: Regular Checks

# Every 10 minutes:
1. kubectl get pods -n $NAMESPACE
2. kubectl top pods -n $NAMESPACE
3. Check error logs: grep -i error from recent logs
4. Check alerts dashboard

If issues detected, proceed to Incident Response Runbook

7.4 After 2 Hours: Normal Monitoring

Return to standard monitoring procedures. Deployment complete.

If Issues Detected: Quick Rollback

If problems occur at any point:

# IMMEDIATE: Rollback (1 minute)
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  kubectl rollout undo deployment/$deployment -n $NAMESPACE &
done
wait

# Verify rollback completing:
kubectl rollout status deployment/vapora-backend -n $NAMESPACE --timeout=5m

# Confirm services recovering:
curl http://localhost:8001/health

# Post to #deployments:
# 🔙 ROLLBACK EXECUTED
# Issue detected, services rolled back to previous version
# All pods should be recovering now

See Rollback Runbook for detailed procedures.

Common Issues & Solutions

Issue: Pod stuck in ImagePullBackOff

Cause: Docker image doesn't exist or can't be downloaded

Solution:

# Check pod events
kubectl describe pod <pod-name> -n $NAMESPACE

# Check image registry access
kubectl get secret -n $NAMESPACE

# Either:
1. Verify image name is correct in deployment.yaml
2. Push missing image to registry
3. Rollback deployment

Issue: Pod stuck in CrashLoopBackOff

Cause: Application crashing on startup

Solution:

# Get pod logs
kubectl logs <pod-name> -n $NAMESPACE --previous

# Fix typically requires config change:
1. Fix ConfigMap issue
2. Re-apply ConfigMap: kubectl apply -f configmap.yaml
3. Trigger pod restart: kubectl rollout restart deployment/<name>

# Or rollback if unclear

Issue: Pod in Pending state

Cause: Node doesn't have capacity or resources

Solution:

# Describe pod to see why
kubectl describe pod <pod-name> -n $NAMESPACE

# Check for "Insufficient cpu", "Insufficient memory"
kubectl top nodes

# Either:
1. Scale down other workloads
2. Increase node count
3. Reduce resource requirements in deployment.yaml and redeploy

Issue: Service endpoints empty

Cause: Pods not passing health checks

Solution:

# Check pod logs for errors
kubectl logs <pod-name> -n $NAMESPACE

# Check pod readiness probe failures
kubectl describe pod <pod-name> -n $NAMESPACE | grep -A 5 "Readiness"

# Fix configuration or rollback

Completion Checklist

All pods running and ready
Health endpoints responding
No error logs
Metrics normal
Deployment communication posted
Status page updated
Stakeholders notified
Monitoring enabled for next 2 hours
Ticket/issue updated with completion details

Next Steps

Continue monitoring per Monitoring Runbook
If issues arise, follow Incident Response Runbook
Document lessons learned
Update runbooks if procedures need improvement

Keyboard shortcuts

VAPORA Platform Documentation