Deployment Runbook
Step-by-step procedures for deploying VAPORA to staging and production environments.
Quick Start
For experienced operators:
# Validate in CI/CD
# Download artifacts
# Review dry-run
# Apply: kubectl apply -f configmap.yaml deployment.yaml
# Monitor: kubectl logs -f deployment/vapora-backend -n vapora
# Verify: curl http://localhost:8001/health
For complete steps, continue reading.
Before Starting
✅ Prerequisites Completed:
- Pre-deployment checklist completed
- Artifacts generated and validated
- Staging deployment verified
- Team ready and monitoring
- Maintenance window announced
✅ Access Verified:
- kubectl configured for target cluster
-
Can list nodes:
kubectl get nodes -
Can access namespace:
kubectl get namespace vapora
❌ If any prerequisite missing: Go back to pre-deployment checklist
Phase 1: Pre-Flight (5 minutes)
1.1 Verify Current State
# Set context
export CLUSTER=production # or staging
export NAMESPACE=vapora
# Verify cluster access
kubectl cluster-info
kubectl get nodes
# Output should show:
# NAME STATUS ROLES AGE
# node-1 Ready worker 30d
# node-2 Ready worker 25d
What to look for:
- ✓ All nodes in "Ready" state
- ✓ No "NotReady" or "Unknown" nodes
- If issues: Don't proceed, investigate node health
1.2 Check Current Deployments
# Get current deployment status
kubectl get deployments -n $NAMESPACE -o wide
kubectl get pods -n $NAMESPACE
# Output example:
# NAME READY UP-TO-DATE AVAILABLE
# vapora-backend 3/3 3 3
# vapora-agents 2/2 2 2
# vapora-llm-router 2/2 2 2
What to look for:
- ✓ All deployments showing correct replica count
- ✓ All pods in "Running" state
- ❌ If pods in "CrashLoopBackOff" or "Pending": Investigate before proceeding
1.3 Record Current Versions
# Get current image versions (baseline for rollback)
kubectl get deployments -n $NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].image}{"\n"}{end}'
# Expected output:
# vapora-backend vapora/backend:v1.2.0
# vapora-agents vapora/agents:v1.2.0
# vapora-llm-router vapora/llm-router:v1.2.0
Record these for rollback: Keep this output visible
1.4 Get Current Revision Numbers
# For each deployment, get rollout history
for deployment in vapora-backend vapora-agents vapora-llm-router; do
echo "=== $deployment ==="
kubectl rollout history deployment/$deployment -n $NAMESPACE | tail -5
done
# Output example:
# REVISION CHANGE-CAUSE
# 42 Deployment rolled out
# 43 Deployment rolled out
# 44 (current)
Record the highest revision number for each - this is your rollback reference
1.5 Check Cluster Resources
# Verify cluster has capacity for new deployment
kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"
# Example - check memory/CPU availability
# Requested: 8200m (41%)
# Limits: 16400m (82%)
What to look for:
- ✓ Less than 80% resource utilization
- ❌ If above 85%: Insufficient capacity, don't proceed
Phase 2: Configuration Deployment (3 minutes)
2.1 Apply ConfigMap
The ConfigMap contains all application configuration.
# First: Dry-run to verify no syntax errors
kubectl apply -f configmap.yaml --dry-run=server -n $NAMESPACE
# Should output:
# configmap/vapora-config configured (server dry run)
# Check for any warnings or errors in output
# If errors, stop and fix the YAML before proceeding
Troubleshooting:
- "error validating": YAML syntax error - fix and retry
- "field is immutable": Can't change certain ConfigMap fields - delete and recreate
- "resourceQuotaExceeded": Namespace quota exceeded - contact cluster admin
2.2 Apply ConfigMap for Real
# Apply the actual ConfigMap
kubectl apply -f configmap.yaml -n $NAMESPACE
# Output:
# configmap/vapora-config configured
# Verify it was applied
kubectl get configmap -n $NAMESPACE vapora-config -o yaml | head -20
# Check for your new values in the output
Verify ConfigMap is correct:
# Extract specific values to verify
kubectl get configmap vapora-config -n $NAMESPACE -o jsonpath='{.data.vapora\.toml}' | grep "database_url" | head -1
# Should show the correct database URL
2.3 Annotate ConfigMap
Record when this config was deployed for audit trail:
kubectl annotate configmap vapora-config \
-n $NAMESPACE \
deployment.timestamp="$(date -u +'%Y-%m-%dT%H:%M:%SZ')" \
deployment.commit="$(git rev-parse HEAD | cut -c1-8)" \
deployment.branch="$(git rev-parse --abbrev-ref HEAD)" \
--overwrite
# Verify annotation was added
kubectl get configmap vapora-config -n $NAMESPACE -o yaml | grep "deployment\."
Phase 3: Deployment Update (5 minutes)
3.1 Dry-Run Deployment
Always dry-run first to catch issues:
# Run deployment dry-run
kubectl apply -f deployment.yaml --dry-run=server -n $NAMESPACE
# Output should show what will be updated:
# deployment.apps/vapora-backend configured (server dry run)
# deployment.apps/vapora-agents configured (server dry run)
# deployment.apps/vapora-llm-router configured (server dry run)
Check for warnings:
- "imagePullBackOff": Docker image doesn't exist
- "insufficient quota": Resource limits exceeded
- "nodeAffinity": Pod can't be placed on any node
3.2 Apply Deployments
# Apply the actual deployments
kubectl apply -f deployment.yaml -n $NAMESPACE
# Output:
# deployment.apps/vapora-backend configured
# deployment.apps/vapora-agents configured
# deployment.apps/vapora-llm-router configured
Verify deployments updated:
# Check that new rollout was initiated
kubectl get deployments -n $NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.observedGeneration}{"\n"}{end}'
# Compare with recorded versions - should be incremented
3.3 Monitor Rollout Progress
Watch the deployment rollout status:
# For each deployment, monitor the rollout
for deployment in vapora-backend vapora-agents vapora-llm-router; do
echo "Waiting for $deployment..."
kubectl rollout status deployment/$deployment \
-n $NAMESPACE \
--timeout=5m
echo "$deployment ready"
done
What to look for (per pod update):
Waiting for rollout to finish: 2 of 3 updated replicas are available...
Waiting for rollout to finish: 2 of 3 updated replicas are available...
Waiting for rollout to finish: 3 of 3 updated replicas are available...
deployment "vapora-backend" successfully rolled out
Expected time: 2-3 minutes per deployment
3.4 Watch Pod Updates (in separate terminal)
While rollout completes, monitor pods:
# Watch pods being updated in real-time
kubectl get pods -n $NAMESPACE -w
# Output shows updates like:
# NAME READY STATUS
# vapora-backend-abc123-def45 1/1 Running
# vapora-backend-xyz789-old-pod 1/1 Running ← old pod still running
# vapora-backend-abc123-new-pod 0/1 Pending ← new pod starting
# vapora-backend-abc123-new-pod 0/1 ContainerCreating
# vapora-backend-abc123-new-pod 1/1 Running ← new pod ready
# vapora-backend-xyz789-old-pod 1/1 Terminating ← old pod being removed
What to look for:
- ✓ New pods starting (Pending → ContainerCreating → Running)
- ✓ Each new pod reaches Running state
- ✓ Old pods gradually terminating
- ❌ Pod stuck in "CrashLoopBackOff": Stop, check logs, might need rollback
Phase 4: Verification (5 minutes)
4.1 Verify All Pods Running
# Check all pods are ready
kubectl get pods -n $NAMESPACE
# Expected output:
# NAME READY STATUS
# vapora-backend-<hash>-1 1/1 Running
# vapora-backend-<hash>-2 1/1 Running
# vapora-backend-<hash>-3 1/1 Running
# vapora-agents-<hash>-1 1/1 Running
# vapora-agents-<hash>-2 1/1 Running
# vapora-llm-router-<hash>-1 1/1 Running
# vapora-llm-router-<hash>-2 1/1 Running
Verification:
# All pods should show READY=1/1
# All pods should show STATUS=Running
# No pods should be in Pending, CrashLoopBackOff, or Error state
# Quick check:
READY=$(kubectl get pods -n $NAMESPACE -o jsonpath='{range .items[*]}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}' | grep -c "True")
TOTAL=$(kubectl get pods -n $NAMESPACE --no-headers | wc -l)
echo "Ready pods: $READY / $TOTAL"
# Should show: Ready pods: 7 / 7 (or your expected pod count)
4.2 Check Pod Logs for Errors
# Check logs from the last minute for errors
for pod in $(kubectl get pods -n $NAMESPACE -o name); do
echo "=== $pod ==="
kubectl logs $pod -n $NAMESPACE --since=1m 2>&1 | grep -i "error\|exception\|fatal" | head -3
done
# If errors found:
# 1. Note which pods have errors
# 2. Get full log: kubectl logs <pod> -n $NAMESPACE
# 3. Decide: can proceed or need to rollback
4.3 Verify Service Endpoints
# Check services are exposing pods correctly
kubectl get endpoints -n $NAMESPACE
# Expected output:
# NAME ENDPOINTS
# vapora-backend 10.1.2.3:8001,10.1.2.4:8001,10.1.2.5:8001
# vapora-agents 10.1.2.6:8002,10.1.2.7:8002
# vapora-llm-router 10.1.2.8:8003,10.1.2.9:8003
Verification:
- ✓ Each service has multiple endpoints (not empty)
- ✓ Endpoints match running pods
- ❌ If empty endpoints: Service can't route traffic
4.4 Health Check Endpoints
# Port-forward to access services locally
kubectl port-forward -n $NAMESPACE svc/vapora-backend 8001:8001 &
# Wait a moment for port-forward to establish
sleep 2
# Check backend health
curl -v http://localhost:8001/health
# Expected response:
# HTTP/1.1 200 OK
# {...healthy response...}
# Check other endpoints
curl http://localhost:8001/api/projects -H "Authorization: Bearer test-token"
Expected responses:
/health: 200 OK with health data/api/projects: 200 OK with projects list/metrics: 200 OK with Prometheus metrics
If connection refused:
# Check if port-forward working
ps aux | grep "port-forward"
# Restart port-forward
pkill -f "port-forward svc/vapora-backend"
kubectl port-forward -n $NAMESPACE svc/vapora-backend 8001:8001 &
4.5 Check Metrics
# Monitor resource usage of deployed pods
kubectl top pods -n $NAMESPACE
# Expected output:
# NAME CPU(cores) MEMORY(Mi)
# vapora-backend-abc123 250m 512Mi
# vapora-backend-def456 280m 498Mi
# vapora-agents-ghi789 300m 256Mi
Verification:
- ✓ CPU usage within expected range (typically 100-500m per pod)
- ✓ Memory usage within expected range (typically 200-512Mi)
- ❌ If any pod at 100% CPU/Memory: Performance issue, monitor closely
Phase 5: Validation (3 minutes)
5.1 Run Smoke Tests (if available)
# If your project has smoke tests:
kubectl exec -it deployment/vapora-backend -n $NAMESPACE -- \
sh -c "curl http://localhost:8001/health && echo 'Health check passed'"
# Or run from your local machine:
./scripts/smoke-tests.sh --endpoint http://localhost:8001
5.2 Check for Errors in Logs
# Look at logs from all pods since deployment started
for deployment in vapora-backend vapora-agents vapora-llm-router; do
echo "=== Checking $deployment ==="
kubectl logs deployment/$deployment -n $NAMESPACE --since=5m 2>&1 | \
grep -i "error\|exception\|failed" | wc -l
done
# If any errors found:
# 1. Get detailed logs
# 2. Determine if critical or expected errors
# 3. Decide to proceed or rollback
5.3 Compare Against Baseline Metrics
Compare current metrics with pre-deployment baseline:
# Current metrics
echo "=== Current ==="
kubectl top nodes
kubectl top pods -n $NAMESPACE | head -5
# Compare with recorded baseline
# If similar: ✓ Good
# If significantly higher: ⚠️ Watch for issues
# If error rates high: ❌ Consider rollback
5.4 Check for Recent Events/Warnings
# Look for any cluster events in the last 5 minutes
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -20
# Watch for:
# - Warning: FailedScheduling (pod won't fit)
# - Warning: PullImageError (image doesn't exist)
# - Warning: ImagePullBackOff (can't download image)
# - Error: ExceededQuota (resource limits)
Phase 6: Communication (1 minute)
6.1 Post Deployment Complete
Post message to #deployments:
🚀 DEPLOYMENT COMPLETE
Deployment: VAPORA Core Services
Mode: Enterprise
Duration: 8 minutes
Status: ✅ Successful
Deployed:
- vapora-backend (v1.2.1)
- vapora-agents (v1.2.1)
- vapora-llm-router (v1.2.1)
Verification:
✓ All pods running
✓ Health checks passing
✓ No error logs
✓ Metrics normal
Next steps:
- Monitor #alerts for any issues
- Check dashboards every 5 minutes for 30 min
- Review logs if any issues detected
Questions? @on-call-engineer
6.2 Update Status Page
If using public status page:
UPDATE: Maintenance Complete
VAPORA services have been successfully updated
and are now operating normally.
All systems monitoring nominal.
6.3 Notify Stakeholders
- Send message to support team: "Deployment complete, all systems normal"
- Post in #product: "Backend updated to v1.2.1, new features available"
- Update ticket/issue with deployment completion time and status
Phase 7: Post-Deployment Monitoring (Ongoing)
7.1 First 5 Minutes: Watch Closely
# Keep watching for any issues
watch kubectl get pods -n $NAMESPACE
watch kubectl top pods -n $NAMESPACE
watch kubectl logs -f deployment/vapora-backend -n $NAMESPACE
Watch for:
- Pod restarts (RESTARTS counter increasing)
- Increased error logs
- Resource usage spikes
- Service unreachability
7.2 First 30 Minutes: Monitor Dashboard
Keep dashboard visible showing:
- Pod health status
- CPU/Memory usage per pod
- Request latency (if available)
- Error rate
- Recent logs
Alert triggers for immediate action:
- Any pod restarting repeatedly
- Error rate above 5%
- Latency above 2x normal
- Pod stuck in Pending state
7.3 First 2 Hours: Regular Checks
# Every 10 minutes:
1. kubectl get pods -n $NAMESPACE
2. kubectl top pods -n $NAMESPACE
3. Check error logs: grep -i error from recent logs
4. Check alerts dashboard
If issues detected, proceed to Incident Response Runbook
7.4 After 2 Hours: Normal Monitoring
Return to standard monitoring procedures. Deployment complete.
If Issues Detected: Quick Rollback
If problems occur at any point:
# IMMEDIATE: Rollback (1 minute)
for deployment in vapora-backend vapora-agents vapora-llm-router; do
kubectl rollout undo deployment/$deployment -n $NAMESPACE &
done
wait
# Verify rollback completing:
kubectl rollout status deployment/vapora-backend -n $NAMESPACE --timeout=5m
# Confirm services recovering:
curl http://localhost:8001/health
# Post to #deployments:
# 🔙 ROLLBACK EXECUTED
# Issue detected, services rolled back to previous version
# All pods should be recovering now
See Rollback Runbook for detailed procedures.
Common Issues & Solutions
Issue: Pod stuck in ImagePullBackOff
Cause: Docker image doesn't exist or can't be downloaded
Solution:
# Check pod events
kubectl describe pod <pod-name> -n $NAMESPACE
# Check image registry access
kubectl get secret -n $NAMESPACE
# Either:
1. Verify image name is correct in deployment.yaml
2. Push missing image to registry
3. Rollback deployment
Issue: Pod stuck in CrashLoopBackOff
Cause: Application crashing on startup
Solution:
# Get pod logs
kubectl logs <pod-name> -n $NAMESPACE --previous
# Fix typically requires config change:
1. Fix ConfigMap issue
2. Re-apply ConfigMap: kubectl apply -f configmap.yaml
3. Trigger pod restart: kubectl rollout restart deployment/<name>
# Or rollback if unclear
Issue: Pod in Pending state
Cause: Node doesn't have capacity or resources
Solution:
# Describe pod to see why
kubectl describe pod <pod-name> -n $NAMESPACE
# Check for "Insufficient cpu", "Insufficient memory"
kubectl top nodes
# Either:
1. Scale down other workloads
2. Increase node count
3. Reduce resource requirements in deployment.yaml and redeploy
Issue: Service endpoints empty
Cause: Pods not passing health checks
Solution:
# Check pod logs for errors
kubectl logs <pod-name> -n $NAMESPACE
# Check pod readiness probe failures
kubectl describe pod <pod-name> -n $NAMESPACE | grep -A 5 "Readiness"
# Fix configuration or rollback
Completion Checklist
- All pods running and ready
- Health endpoints responding
- No error logs
- Metrics normal
- Deployment communication posted
- Status page updated
- Stakeholders notified
- Monitoring enabled for next 2 hours
- Ticket/issue updated with completion details
Next Steps
- Continue monitoring per Monitoring Runbook
- If issues arise, follow Incident Response Runbook
- Document lessons learned
- Update runbooks if procedures need improvement