Vapora/docs/operations/deployment-runbook.md
Jesús Pérez 7110ffeea2
Some checks failed
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
chore: extend doc: adr, tutorials, operations, etc
2026-01-12 03:32:47 +00:00

695 lines
17 KiB
Markdown

# Deployment Runbook
Step-by-step procedures for deploying VAPORA to staging and production environments.
---
## Quick Start
For experienced operators:
```bash
# Validate in CI/CD
# Download artifacts
# Review dry-run
# Apply: kubectl apply -f configmap.yaml deployment.yaml
# Monitor: kubectl logs -f deployment/vapora-backend -n vapora
# Verify: curl http://localhost:8001/health
```
For complete steps, continue reading.
---
## Before Starting
**Prerequisites Completed**:
- [ ] Pre-deployment checklist completed
- [ ] Artifacts generated and validated
- [ ] Staging deployment verified
- [ ] Team ready and monitoring
- [ ] Maintenance window announced
**Access Verified**:
- [ ] kubectl configured for target cluster
- [ ] Can list nodes: `kubectl get nodes`
- [ ] Can access namespace: `kubectl get namespace vapora`
**If any prerequisite missing**: Go back to pre-deployment checklist
---
## Phase 1: Pre-Flight (5 minutes)
### 1.1 Verify Current State
```bash
# Set context
export CLUSTER=production # or staging
export NAMESPACE=vapora
# Verify cluster access
kubectl cluster-info
kubectl get nodes
# Output should show:
# NAME STATUS ROLES AGE
# node-1 Ready worker 30d
# node-2 Ready worker 25d
```
**What to look for:**
- ✓ All nodes in "Ready" state
- ✓ No "NotReady" or "Unknown" nodes
- If issues: Don't proceed, investigate node health
### 1.2 Check Current Deployments
```bash
# Get current deployment status
kubectl get deployments -n $NAMESPACE -o wide
kubectl get pods -n $NAMESPACE
# Output example:
# NAME READY UP-TO-DATE AVAILABLE
# vapora-backend 3/3 3 3
# vapora-agents 2/2 2 2
# vapora-llm-router 2/2 2 2
```
**What to look for:**
- ✓ All deployments showing correct replica count
- ✓ All pods in "Running" state
- ❌ If pods in "CrashLoopBackOff" or "Pending": Investigate before proceeding
### 1.3 Record Current Versions
```bash
# Get current image versions (baseline for rollback)
kubectl get deployments -n $NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].image}{"\n"}{end}'
# Expected output:
# vapora-backend vapora/backend:v1.2.0
# vapora-agents vapora/agents:v1.2.0
# vapora-llm-router vapora/llm-router:v1.2.0
```
**Record these for rollback**: Keep this output visible
### 1.4 Get Current Revision Numbers
```bash
# For each deployment, get rollout history
for deployment in vapora-backend vapora-agents vapora-llm-router; do
echo "=== $deployment ==="
kubectl rollout history deployment/$deployment -n $NAMESPACE | tail -5
done
# Output example:
# REVISION CHANGE-CAUSE
# 42 Deployment rolled out
# 43 Deployment rolled out
# 44 (current)
```
**Record the highest revision number for each** - this is your rollback reference
### 1.5 Check Cluster Resources
```bash
# Verify cluster has capacity for new deployment
kubectl top nodes
kubectl describe nodes | grep -A 5 "Allocated resources"
# Example - check memory/CPU availability
# Requested: 8200m (41%)
# Limits: 16400m (82%)
```
**What to look for:**
- ✓ Less than 80% resource utilization
- ❌ If above 85%: Insufficient capacity, don't proceed
---
## Phase 2: Configuration Deployment (3 minutes)
### 2.1 Apply ConfigMap
The ConfigMap contains all application configuration.
```bash
# First: Dry-run to verify no syntax errors
kubectl apply -f configmap.yaml --dry-run=server -n $NAMESPACE
# Should output:
# configmap/vapora-config configured (server dry run)
# Check for any warnings or errors in output
# If errors, stop and fix the YAML before proceeding
```
**Troubleshooting**:
- "error validating": YAML syntax error - fix and retry
- "field is immutable": Can't change certain ConfigMap fields - delete and recreate
- "resourceQuotaExceeded": Namespace quota exceeded - contact cluster admin
### 2.2 Apply ConfigMap for Real
```bash
# Apply the actual ConfigMap
kubectl apply -f configmap.yaml -n $NAMESPACE
# Output:
# configmap/vapora-config configured
# Verify it was applied
kubectl get configmap -n $NAMESPACE vapora-config -o yaml | head -20
# Check for your new values in the output
```
**Verify ConfigMap is correct**:
```bash
# Extract specific values to verify
kubectl get configmap vapora-config -n $NAMESPACE -o jsonpath='{.data.vapora\.toml}' | grep "database_url" | head -1
# Should show the correct database URL
```
### 2.3 Annotate ConfigMap
Record when this config was deployed for audit trail:
```bash
kubectl annotate configmap vapora-config \
-n $NAMESPACE \
deployment.timestamp="$(date -u +'%Y-%m-%dT%H:%M:%SZ')" \
deployment.commit="$(git rev-parse HEAD | cut -c1-8)" \
deployment.branch="$(git rev-parse --abbrev-ref HEAD)" \
--overwrite
# Verify annotation was added
kubectl get configmap vapora-config -n $NAMESPACE -o yaml | grep "deployment\."
```
---
## Phase 3: Deployment Update (5 minutes)
### 3.1 Dry-Run Deployment
Always dry-run first to catch issues:
```bash
# Run deployment dry-run
kubectl apply -f deployment.yaml --dry-run=server -n $NAMESPACE
# Output should show what will be updated:
# deployment.apps/vapora-backend configured (server dry run)
# deployment.apps/vapora-agents configured (server dry run)
# deployment.apps/vapora-llm-router configured (server dry run)
```
**Check for warnings**:
- "imagePullBackOff": Docker image doesn't exist
- "insufficient quota": Resource limits exceeded
- "nodeAffinity": Pod can't be placed on any node
### 3.2 Apply Deployments
```bash
# Apply the actual deployments
kubectl apply -f deployment.yaml -n $NAMESPACE
# Output:
# deployment.apps/vapora-backend configured
# deployment.apps/vapora-agents configured
# deployment.apps/vapora-llm-router configured
```
**Verify deployments updated**:
```bash
# Check that new rollout was initiated
kubectl get deployments -n $NAMESPACE -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.observedGeneration}{"\n"}{end}'
# Compare with recorded versions - should be incremented
```
### 3.3 Monitor Rollout Progress
Watch the deployment rollout status:
```bash
# For each deployment, monitor the rollout
for deployment in vapora-backend vapora-agents vapora-llm-router; do
echo "Waiting for $deployment..."
kubectl rollout status deployment/$deployment \
-n $NAMESPACE \
--timeout=5m
echo "$deployment ready"
done
```
**What to look for** (per pod update):
```
Waiting for rollout to finish: 2 of 3 updated replicas are available...
Waiting for rollout to finish: 2 of 3 updated replicas are available...
Waiting for rollout to finish: 3 of 3 updated replicas are available...
deployment "vapora-backend" successfully rolled out
```
**Expected time: 2-3 minutes per deployment**
### 3.4 Watch Pod Updates (in separate terminal)
While rollout completes, monitor pods:
```bash
# Watch pods being updated in real-time
kubectl get pods -n $NAMESPACE -w
# Output shows updates like:
# NAME READY STATUS
# vapora-backend-abc123-def45 1/1 Running
# vapora-backend-xyz789-old-pod 1/1 Running ← old pod still running
# vapora-backend-abc123-new-pod 0/1 Pending ← new pod starting
# vapora-backend-abc123-new-pod 0/1 ContainerCreating
# vapora-backend-abc123-new-pod 1/1 Running ← new pod ready
# vapora-backend-xyz789-old-pod 1/1 Terminating ← old pod being removed
```
**What to look for:**
- ✓ New pods starting (Pending → ContainerCreating → Running)
- ✓ Each new pod reaches Running state
- ✓ Old pods gradually terminating
- ❌ Pod stuck in "CrashLoopBackOff": Stop, check logs, might need rollback
---
## Phase 4: Verification (5 minutes)
### 4.1 Verify All Pods Running
```bash
# Check all pods are ready
kubectl get pods -n $NAMESPACE
# Expected output:
# NAME READY STATUS
# vapora-backend-<hash>-1 1/1 Running
# vapora-backend-<hash>-2 1/1 Running
# vapora-backend-<hash>-3 1/1 Running
# vapora-agents-<hash>-1 1/1 Running
# vapora-agents-<hash>-2 1/1 Running
# vapora-llm-router-<hash>-1 1/1 Running
# vapora-llm-router-<hash>-2 1/1 Running
```
**Verification**:
```bash
# All pods should show READY=1/1
# All pods should show STATUS=Running
# No pods should be in Pending, CrashLoopBackOff, or Error state
# Quick check:
READY=$(kubectl get pods -n $NAMESPACE -o jsonpath='{range .items[*]}{.status.conditions[?(@.type=="Ready")].status}{"\n"}{end}' | grep -c "True")
TOTAL=$(kubectl get pods -n $NAMESPACE --no-headers | wc -l)
echo "Ready pods: $READY / $TOTAL"
# Should show: Ready pods: 7 / 7 (or your expected pod count)
```
### 4.2 Check Pod Logs for Errors
```bash
# Check logs from the last minute for errors
for pod in $(kubectl get pods -n $NAMESPACE -o name); do
echo "=== $pod ==="
kubectl logs $pod -n $NAMESPACE --since=1m 2>&1 | grep -i "error\|exception\|fatal" | head -3
done
# If errors found:
# 1. Note which pods have errors
# 2. Get full log: kubectl logs <pod> -n $NAMESPACE
# 3. Decide: can proceed or need to rollback
```
### 4.3 Verify Service Endpoints
```bash
# Check services are exposing pods correctly
kubectl get endpoints -n $NAMESPACE
# Expected output:
# NAME ENDPOINTS
# vapora-backend 10.1.2.3:8001,10.1.2.4:8001,10.1.2.5:8001
# vapora-agents 10.1.2.6:8002,10.1.2.7:8002
# vapora-llm-router 10.1.2.8:8003,10.1.2.9:8003
```
**Verification**:
- ✓ Each service has multiple endpoints (not empty)
- ✓ Endpoints match running pods
- ❌ If empty endpoints: Service can't route traffic
### 4.4 Health Check Endpoints
```bash
# Port-forward to access services locally
kubectl port-forward -n $NAMESPACE svc/vapora-backend 8001:8001 &
# Wait a moment for port-forward to establish
sleep 2
# Check backend health
curl -v http://localhost:8001/health
# Expected response:
# HTTP/1.1 200 OK
# {...healthy response...}
# Check other endpoints
curl http://localhost:8001/api/projects -H "Authorization: Bearer test-token"
```
**Expected responses**:
- `/health`: 200 OK with health data
- `/api/projects`: 200 OK with projects list
- `/metrics`: 200 OK with Prometheus metrics
**If connection refused**:
```bash
# Check if port-forward working
ps aux | grep "port-forward"
# Restart port-forward
pkill -f "port-forward svc/vapora-backend"
kubectl port-forward -n $NAMESPACE svc/vapora-backend 8001:8001 &
```
### 4.5 Check Metrics
```bash
# Monitor resource usage of deployed pods
kubectl top pods -n $NAMESPACE
# Expected output:
# NAME CPU(cores) MEMORY(Mi)
# vapora-backend-abc123 250m 512Mi
# vapora-backend-def456 280m 498Mi
# vapora-agents-ghi789 300m 256Mi
```
**Verification**:
- ✓ CPU usage within expected range (typically 100-500m per pod)
- ✓ Memory usage within expected range (typically 200-512Mi)
- ❌ If any pod at 100% CPU/Memory: Performance issue, monitor closely
---
## Phase 5: Validation (3 minutes)
### 5.1 Run Smoke Tests (if available)
```bash
# If your project has smoke tests:
kubectl exec -it deployment/vapora-backend -n $NAMESPACE -- \
sh -c "curl http://localhost:8001/health && echo 'Health check passed'"
# Or run from your local machine:
./scripts/smoke-tests.sh --endpoint http://localhost:8001
```
### 5.2 Check for Errors in Logs
```bash
# Look at logs from all pods since deployment started
for deployment in vapora-backend vapora-agents vapora-llm-router; do
echo "=== Checking $deployment ==="
kubectl logs deployment/$deployment -n $NAMESPACE --since=5m 2>&1 | \
grep -i "error\|exception\|failed" | wc -l
done
# If any errors found:
# 1. Get detailed logs
# 2. Determine if critical or expected errors
# 3. Decide to proceed or rollback
```
### 5.3 Compare Against Baseline Metrics
Compare current metrics with pre-deployment baseline:
```bash
# Current metrics
echo "=== Current ==="
kubectl top nodes
kubectl top pods -n $NAMESPACE | head -5
# Compare with recorded baseline
# If similar: ✓ Good
# If significantly higher: ⚠️ Watch for issues
# If error rates high: ❌ Consider rollback
```
### 5.4 Check for Recent Events/Warnings
```bash
# Look for any cluster events in the last 5 minutes
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -20
# Watch for:
# - Warning: FailedScheduling (pod won't fit)
# - Warning: PullImageError (image doesn't exist)
# - Warning: ImagePullBackOff (can't download image)
# - Error: ExceededQuota (resource limits)
```
---
## Phase 6: Communication (1 minute)
### 6.1 Post Deployment Complete
```
Post message to #deployments:
🚀 DEPLOYMENT COMPLETE
Deployment: VAPORA Core Services
Mode: Enterprise
Duration: 8 minutes
Status: ✅ Successful
Deployed:
- vapora-backend (v1.2.1)
- vapora-agents (v1.2.1)
- vapora-llm-router (v1.2.1)
Verification:
✓ All pods running
✓ Health checks passing
✓ No error logs
✓ Metrics normal
Next steps:
- Monitor #alerts for any issues
- Check dashboards every 5 minutes for 30 min
- Review logs if any issues detected
Questions? @on-call-engineer
```
### 6.2 Update Status Page
```
If using public status page:
UPDATE: Maintenance Complete
VAPORA services have been successfully updated
and are now operating normally.
All systems monitoring nominal.
```
### 6.3 Notify Stakeholders
- [ ] Send message to support team: "Deployment complete, all systems normal"
- [ ] Post in #product: "Backend updated to v1.2.1, new features available"
- [ ] Update ticket/issue with deployment completion time and status
---
## Phase 7: Post-Deployment Monitoring (Ongoing)
### 7.1 First 5 Minutes: Watch Closely
```bash
# Keep watching for any issues
watch kubectl get pods -n $NAMESPACE
watch kubectl top pods -n $NAMESPACE
watch kubectl logs -f deployment/vapora-backend -n $NAMESPACE
```
**Watch for:**
- Pod restarts (RESTARTS counter increasing)
- Increased error logs
- Resource usage spikes
- Service unreachability
### 7.2 First 30 Minutes: Monitor Dashboard
Keep dashboard visible showing:
- Pod health status
- CPU/Memory usage per pod
- Request latency (if available)
- Error rate
- Recent logs
**Alert triggers for immediate action:**
- Any pod restarting repeatedly
- Error rate above 5%
- Latency above 2x normal
- Pod stuck in Pending state
### 7.3 First 2 Hours: Regular Checks
```bash
# Every 10 minutes:
1. kubectl get pods -n $NAMESPACE
2. kubectl top pods -n $NAMESPACE
3. Check error logs: grep -i error from recent logs
4. Check alerts dashboard
```
**If issues detected**, proceed to Incident Response Runbook
### 7.4 After 2 Hours: Normal Monitoring
Return to standard monitoring procedures. Deployment complete.
---
## If Issues Detected: Quick Rollback
If problems occur at any point:
```bash
# IMMEDIATE: Rollback (1 minute)
for deployment in vapora-backend vapora-agents vapora-llm-router; do
kubectl rollout undo deployment/$deployment -n $NAMESPACE &
done
wait
# Verify rollback completing:
kubectl rollout status deployment/vapora-backend -n $NAMESPACE --timeout=5m
# Confirm services recovering:
curl http://localhost:8001/health
# Post to #deployments:
# 🔙 ROLLBACK EXECUTED
# Issue detected, services rolled back to previous version
# All pods should be recovering now
```
See [Rollback Runbook](./rollback-runbook.md) for detailed procedures.
---
## Common Issues & Solutions
### Issue: Pod stuck in ImagePullBackOff
**Cause**: Docker image doesn't exist or can't be downloaded
**Solution**:
```bash
# Check pod events
kubectl describe pod <pod-name> -n $NAMESPACE
# Check image registry access
kubectl get secret -n $NAMESPACE
# Either:
1. Verify image name is correct in deployment.yaml
2. Push missing image to registry
3. Rollback deployment
```
### Issue: Pod stuck in CrashLoopBackOff
**Cause**: Application crashing on startup
**Solution**:
```bash
# Get pod logs
kubectl logs <pod-name> -n $NAMESPACE --previous
# Fix typically requires config change:
1. Fix ConfigMap issue
2. Re-apply ConfigMap: kubectl apply -f configmap.yaml
3. Trigger pod restart: kubectl rollout restart deployment/<name>
# Or rollback if unclear
```
### Issue: Pod in Pending state
**Cause**: Node doesn't have capacity or resources
**Solution**:
```bash
# Describe pod to see why
kubectl describe pod <pod-name> -n $NAMESPACE
# Check for "Insufficient cpu", "Insufficient memory"
kubectl top nodes
# Either:
1. Scale down other workloads
2. Increase node count
3. Reduce resource requirements in deployment.yaml and redeploy
```
### Issue: Service endpoints empty
**Cause**: Pods not passing health checks
**Solution**:
```bash
# Check pod logs for errors
kubectl logs <pod-name> -n $NAMESPACE
# Check pod readiness probe failures
kubectl describe pod <pod-name> -n $NAMESPACE | grep -A 5 "Readiness"
# Fix configuration or rollback
```
---
## Completion Checklist
- [ ] All pods running and ready
- [ ] Health endpoints responding
- [ ] No error logs
- [ ] Metrics normal
- [ ] Deployment communication posted
- [ ] Status page updated
- [ ] Stakeholders notified
- [ ] Monitoring enabled for next 2 hours
- [ ] Ticket/issue updated with completion details
---
## Next Steps
- Continue monitoring per [Monitoring Runbook](./monitoring-runbook.md)
- If issues arise, follow [Incident Response Runbook](./incident-response-runbook.md)
- Document lessons learned
- Update runbooks if procedures need improvement