14 KiB
VAPORA Operations Runbooks
Complete set of runbooks and procedures for deploying, monitoring, and operating VAPORA in production environments.
Quick Navigation
I need to...
- Deploy to production: See Deployment Runbook or Pre-Deployment Checklist
- Respond to an incident: See Incident Response Runbook
- Rollback a deployment: See Rollback Runbook
- Go on-call: See On-Call Procedures
- Monitor services: See Monitoring Runbook
- Understand common failures: See Common Failure Scenarios
Runbook Overview
1. Pre-Deployment Checklist
When: 24 hours before any production deployment
Content: Comprehensive checklist for deployment preparation including:
- Communication & scheduling
- Code review & validation
- Environment verification
- Health baseline recording
- Artifact preparation
- Rollback plan verification
Time: 1-2 hours
File: pre-deployment-checklist.md
2. Deployment Runbook
When: Executing actual production deployment
Content: Step-by-step deployment procedures including:
- Pre-flight checks (5 min)
- Configuration deployment (3 min)
- Deployment update (5 min)
- Verification (5 min)
- Validation (3 min)
- Communication & monitoring
Time: 15-20 minutes total
File: deployment-runbook.md
3. Rollback Runbook
When: Issues detected after deployment requiring immediate rollback
Content: Safe rollback procedures including:
- When to rollback (decision criteria)
- Kubernetes automatic rollback (step-by-step)
- Docker manual rollback (guided)
- Post-rollback verification
- Emergency procedures
- Prevention & lessons learned
Time: 5-10 minutes (depending on issues)
File: rollback-runbook.md
4. Incident Response Runbook
When: Production incident declared
Content: Full incident response procedures including:
- Severity levels (1-4) with examples
- Report & assess procedures
- Diagnosis & escalation
- Fix implementation
- Recovery verification
- Communication templates
- Role definitions
Time: Varies by severity (2 min to 1+ hour)
File: incident-response-runbook.md
5. On-Call Procedures
When: During assigned on-call shift
Content: Full on-call guide including:
- Before shift starts (setup & verification)
- Daily tasks & check-ins
- Responding to alerts
- Monitoring dashboard setup
- Escalation decision tree
- Shift handoff procedures
- Common questions & answers
Time: Read thoroughly before first on-call shift (~30 min)
File: on-call-procedures.md
Deployment Workflow
Standard Deployment Process
DAY 1 (Planning)
↓
- Create GitHub issue/ticket
- Identify deployment window
- Notify stakeholders
24 HOURS BEFORE
↓
- Complete pre-deployment checklist
(pre-deployment-checklist.md)
- Verify all prerequisites
- Stage artifacts
- Test in staging
DEPLOYMENT DAY
↓
- Final go/no-go decision
- Execute deployment runbook
(deployment-runbook.md)
- Pre-flight checks
- ConfigMap deployment
- Service deployment
- Verification
- Communication
POST-DEPLOYMENT (2 hours)
↓
- Monitor closely (every 10 minutes)
- Watch for issues
- If problems → execute rollback runbook
(rollback-runbook.md)
- Document results
24 HOURS LATER
↓
- Declare deployment stable
- Schedule post-mortem (if issues)
- Update documentation
If Issues During Deployment
Issue Detected
↓
Severity Assessment
↓
Severity 1-2:
├─ Immediate rollback
│ (rollback-runbook.md)
│
└─ Post-rollback investigation
(incident-response-runbook.md)
Severity 3-4:
├─ Monitor and investigate
│ (incident-response-runbook.md)
│
└─ Fix in place if quick
OR
Schedule rollback
Monitoring & Alerting
Essential Dashboards
These should be visible during deployments and always on-call:
-
Kubernetes Dashboard
- Pod status
- Node health
- Event logs
-
Grafana Dashboards (if available)
- Request rate and latency
- Error rate
- CPU/Memory usage
- Pod restart counts
-
Application Logs (Elasticsearch, CloudWatch, etc.)
- Error messages
- Stack traces
- Performance logs
Alert Triggers & Responses
| Alert | Severity | Response |
|---|---|---|
| Pod CrashLoopBackOff | 1 | Check logs, likely config issue |
| Error rate >10% | 1 | Check recent deployment, consider rollback |
| All pods pending | 1 | Node issue or resource exhausted |
| High memory usage >90% | 2 | Check for memory leak or scale up |
| High latency (2x normal) | 2 | Check database, external services |
| Single pod failed | 3 | Monitor, likely transient |
Health Check Commands
Quick commands to verify everything is working:
# Cluster health
kubectl cluster-info
kubectl get nodes # All should be Ready
# Service health
kubectl get pods -n vapora
# All should be Running, 1/1 Ready
# Quick endpoints test
curl http://localhost:8001/health
curl http://localhost:3000
# Pod resources
kubectl top pods -n vapora
# Recent issues
kubectl get events -n vapora | grep Warning
kubectl logs deployment/vapora-backend -n vapora --tail=20
Common Failure Scenarios
Pod CrashLoopBackOff
Symptoms: Pod keeps restarting repeatedly
Diagnosis:
kubectl logs <pod> -n vapora --previous # See what crashed
kubectl describe pod <pod> -n vapora # Check events
Solutions:
- If config error: Fix ConfigMap, restart pod
- If code error: Rollback deployment
- If resource issue: Increase limits or scale out
Runbook: Rollback Runbook or Incident Response
Pod Stuck in Pending
Symptoms: Pod won't start, stuck in "Pending" state
Diagnosis:
kubectl describe pod <pod> -n vapora # Check "Events" section
Common causes:
- Insufficient CPU/memory on nodes
- Node disk full
- Pod can't be scheduled
- Persistent volume not available
Solutions:
- Scale down other workloads
- Add more nodes
- Fix persistent volume issues
- Check node disk space
Runbook: On-Call Procedures → "Common Questions"
Service Unresponsive (Connection Refused)
Symptoms: curl: (7) Failed to connect to localhost port 8001
Diagnosis:
kubectl get pods -n vapora # Are pods even running?
kubectl get service vapora-backend -n vapora # Does service exist?
kubectl get endpoints -n vapora # Do endpoints exist?
Common causes:
- Pods not running (restart loops)
- Service missing or misconfigured
- Port incorrect
- Network policy blocking traffic
Solutions:
- Verify pods running:
kubectl get pods - Verify service exists:
kubectl get svc - Check endpoints:
kubectl get endpoints - Port-forward if issue with routing:
kubectl port-forward svc/vapora-backend 8001:8001
Runbook: Incident Response
High Error Rate
Symptoms: Dashboard shows >5% 5xx errors
Diagnosis:
# Check which endpoint
kubectl logs deployment/vapora-backend -n vapora | grep "ERROR\|500"
# Check recent deployment
git log -1 --oneline provisioning/
# Check dependencies
curl http://localhost:8001/health # is it healthy?
Common causes:
- Recent bad deployment
- Database connectivity issue
- Configuration error
- Dependency service down
Solutions:
- If recent deployment: Consider rollback
- Check ConfigMap for typos
- Check database connectivity
- Check external service health
Runbook: Rollback Runbook or Incident Response
Resource Exhaustion (CPU/Memory)
Symptoms: kubectl top pods shows pod at 100% usage or "limits exceeded"
Diagnosis:
kubectl top nodes # Overall node usage
kubectl top pods -n vapora # Per-pod usage
kubectl get pod <pod> -o yaml | grep limits -A 10 # Check limits
Solutions:
- Increase pod resource limits (requires redeployment)
- Scale out (add more replicas)
- Scale down other workloads
- Investigate memory leak if growing
Runbook: Deployment Runbook → Phase 4 (Verification)
Database Connection Errors
Symptoms: ERROR: could not connect to database
Diagnosis:
# Check database is running
kubectl get pods -n <database-namespace>
# Check credentials in ConfigMap
kubectl get configmap vapora-config -n vapora -o yaml | grep -i "database\|password"
# Test connectivity
kubectl exec <pod> -n vapora -- psql $DATABASE_URL
Solutions:
- If credentials wrong: Fix in ConfigMap, restart pods
- If database down: Escalate to DBA
- If network issue: Network team investigation
- If permissions: Update database user
Runbook: Incident Response → "Root Cause: Database Issues"
Communication Templates
Deployment Start
🚀 Deployment starting
Service: VAPORA
Version: v1.2.1
Mode: Enterprise
Expected duration: 10-15 minutes
Will update every 2 minutes. Questions? Ask in #deployments
Deployment Complete
✅ Deployment complete
Duration: 12 minutes
Status: All services healthy
Pods: All running
Health check results:
✓ Backend: responding
✓ Frontend: accessible
✓ API: normal latency
✓ No errors in logs
Next step: Monitor for 2 hours
Contact: @on-call-engineer
Incident Declared
🔴 INCIDENT DECLARED
Service: VAPORA Backend
Severity: 1 (Critical)
Time detected: HH:MM UTC
Current status: Investigating
Updates every 2 minutes
/cc @on-call-engineer @senior-engineer
Incident Resolved
✅ Incident resolved
Duration: 8 minutes
Root cause: [description]
Fix: [what was done]
All services healthy, monitoring for 1 hour
Post-mortem scheduled for [date]
Rollback Executed
🔙 Rollback executed
Issue detected in v1.2.1
Rolled back to v1.2.0
Status: Services recovering
Timeline: Issue 14:30 → Rollback 14:32 → Recovered 14:35
Investigating root cause
Escalation Matrix
When unsure who to contact:
| Issue Type | First Contact | Escalation | Emergency |
|---|---|---|---|
| Deployment issue | Deployment lead | Ops team | Ops manager |
| Pod/Container | On-call engineer | Senior engineer | Director of Eng |
| Database | DBA team | Ops manager | CTO |
| Infrastructure | Infra team | Ops manager | VP Ops |
| Security issue | Security team | CISO | CEO |
| Networking | Network team | Ops manager | CTO |
Tools & Commands Quick Reference
Essential kubectl Commands
# Get status
kubectl get pods -n vapora
kubectl get deployments -n vapora
kubectl get services -n vapora
# Logs
kubectl logs deployment/vapora-backend -n vapora
kubectl logs <pod> -n vapora --previous # Previous crash
kubectl logs <pod> -n vapora -f # Follow/tail
# Execute commands
kubectl exec -it <pod> -n vapora -- bash
kubectl exec <pod> -n vapora -- curl http://localhost:8001/health
# Describe (detailed info)
kubectl describe pod <pod> -n vapora
kubectl describe node <node>
# Port forward (local access)
kubectl port-forward svc/vapora-backend 8001:8001
# Restart pods
kubectl rollout restart deployment/vapora-backend -n vapora
# Rollback
kubectl rollout undo deployment/vapora-backend -n vapora
# Scale
kubectl scale deployment/vapora-backend --replicas=5 -n vapora
Useful Aliases
alias k='kubectl'
alias kgp='kubectl get pods'
alias kgd='kubectl get deployments'
alias kgs='kubectl get services'
alias klogs='kubectl logs'
alias kexec='kubectl exec'
alias kdesc='kubectl describe'
alias ktop='kubectl top'
Before Your First Deployment
- Read all runbooks: Thoroughly review all procedures
- Practice in staging: Do a test deployment to staging first
- Understand rollback: Know how to rollback before deploying
- Get trained: Have senior engineer walk through procedures
- Test tools: Verify kubectl and other tools work
- Verify access: Confirm you have cluster access
- Know contacts: Have escalation contacts readily available
- Review history: Look at past deployments to understand patterns
Continuous Improvement
After Each Deployment
- Were all runbooks clear?
- Any steps missing or unclear?
- Any issues that could be prevented?
- Update documentation with learnings
Monthly Review
- Review all incidents from past month
- Update procedures based on patterns
- Refresh team on any changes
- Update escalation contacts
- Review and improve alerting
Key Principles
✅ Safety First
- Always dry-run before applying
- Rollback quickly if issues detected
- Better to be conservative
✅ Communication
- Communicate early and often
- Update every 2-5 minutes during incidents
- Notify stakeholders proactively
✅ Documentation
- Document everything you do
- Update runbooks with learnings
- Share knowledge with team
✅ Preparation
- Plan deployments thoroughly
- Test before going live
- Have rollback plan ready
✅ Quick Response
- Detect issues quickly
- Diagnose systematically
- Execute fixes decisively
❌ Avoid
- Guessing without verifying
- Skipping steps to save time
- Assuming systems are working
- Not communicating with team
- Making multiple changes at once
Support & Questions
- Questions about procedures? Ask senior engineer or operations team
- Found runbook gap? Create issue/PR to update documentation
- Unclear instructions? Clarify before executing critical operations
- Ideas for improvement? Share in team meetings or documentation repo
Quick Start: Your First Deployment
Day 0: Preparation
- Read:
pre-deployment-checklist.md(30 min) - Read:
deployment-runbook.md(30 min) - Read:
rollback-runbook.md(20 min) - Schedule walkthrough with senior engineer (1 hour)
Day 1: Execute with Mentorship
- Complete pre-deployment checklist with senior engineer
- Execute deployment runbook with senior observing
- Monitor for 2 hours with senior available
- Debrief: what went well, what to improve
Day 2+: Independent Deployments
- Complete checklist independently
- Execute runbook
- Document and communicate
- Ask for help if anything unclear
Generated: 2026-01-12 Status: Production-ready Last Updated: 2026-01-12