# VAPORA Operations Runbooks Complete set of runbooks and procedures for deploying, monitoring, and operating VAPORA in production environments. --- ## Quick Navigation **I need to...** - **Deploy to production**: See [Deployment Runbook](./deployment-runbook.md) or [Pre-Deployment Checklist](./pre-deployment-checklist.md) - **Respond to an incident**: See [Incident Response Runbook](./incident-response-runbook.md) - **Rollback a deployment**: See [Rollback Runbook](./rollback-runbook.md) - **Go on-call**: See [On-Call Procedures](./on-call-procedures.md) - **Monitor services**: See [Monitoring Runbook](#monitoring--alerting) - **Understand common failures**: See [Common Failure Scenarios](#common-failure-scenarios) --- ## Runbook Overview ### 1. Pre-Deployment Checklist **When**: 24 hours before any production deployment **Content**: Comprehensive checklist for deployment preparation including: - Communication & scheduling - Code review & validation - Environment verification - Health baseline recording - Artifact preparation - Rollback plan verification **Time**: 1-2 hours **File**: [`pre-deployment-checklist.md`](./pre-deployment-checklist.md) ### 2. Deployment Runbook **When**: Executing actual production deployment **Content**: Step-by-step deployment procedures including: - Pre-flight checks (5 min) - Configuration deployment (3 min) - Deployment update (5 min) - Verification (5 min) - Validation (3 min) - Communication & monitoring **Time**: 15-20 minutes total **File**: [`deployment-runbook.md`](./deployment-runbook.md) ### 3. Rollback Runbook **When**: Issues detected after deployment requiring immediate rollback **Content**: Safe rollback procedures including: - When to rollback (decision criteria) - Kubernetes automatic rollback (step-by-step) - Docker manual rollback (guided) - Post-rollback verification - Emergency procedures - Prevention & lessons learned **Time**: 5-10 minutes (depending on issues) **File**: [`rollback-runbook.md`](./rollback-runbook.md) ### 4. Incident Response Runbook **When**: Production incident declared **Content**: Full incident response procedures including: - Severity levels (1-4) with examples - Report & assess procedures - Diagnosis & escalation - Fix implementation - Recovery verification - Communication templates - Role definitions **Time**: Varies by severity (2 min to 1+ hour) **File**: [`incident-response-runbook.md`](./incident-response-runbook.md) ### 5. On-Call Procedures **When**: During assigned on-call shift **Content**: Full on-call guide including: - Before shift starts (setup & verification) - Daily tasks & check-ins - Responding to alerts - Monitoring dashboard setup - Escalation decision tree - Shift handoff procedures - Common questions & answers **Time**: Read thoroughly before first on-call shift (~30 min) **File**: [`on-call-procedures.md`](./on-call-procedures.md) --- ## Deployment Workflow ### Standard Deployment Process ``` DAY 1 (Planning) ↓ - Create GitHub issue/ticket - Identify deployment window - Notify stakeholders 24 HOURS BEFORE ↓ - Complete pre-deployment checklist (pre-deployment-checklist.md) - Verify all prerequisites - Stage artifacts - Test in staging DEPLOYMENT DAY ↓ - Final go/no-go decision - Execute deployment runbook (deployment-runbook.md) - Pre-flight checks - ConfigMap deployment - Service deployment - Verification - Communication POST-DEPLOYMENT (2 hours) ↓ - Monitor closely (every 10 minutes) - Watch for issues - If problems → execute rollback runbook (rollback-runbook.md) - Document results 24 HOURS LATER ↓ - Declare deployment stable - Schedule post-mortem (if issues) - Update documentation ``` ### If Issues During Deployment ``` Issue Detected ↓ Severity Assessment ↓ Severity 1-2: ├─ Immediate rollback │ (rollback-runbook.md) │ └─ Post-rollback investigation (incident-response-runbook.md) Severity 3-4: ├─ Monitor and investigate │ (incident-response-runbook.md) │ └─ Fix in place if quick OR Schedule rollback ``` --- ## Monitoring & Alerting ### Essential Dashboards These should be visible during deployments and always on-call: 1. **Kubernetes Dashboard** - Pod status - Node health - Event logs 2. **Grafana Dashboards** (if available) - Request rate and latency - Error rate - CPU/Memory usage - Pod restart counts 3. **Application Logs** (Elasticsearch, CloudWatch, etc.) - Error messages - Stack traces - Performance logs ### Alert Triggers & Responses | Alert | Severity | Response | |-------|----------|----------| | Pod CrashLoopBackOff | 1 | Check logs, likely config issue | | Error rate >10% | 1 | Check recent deployment, consider rollback | | All pods pending | 1 | Node issue or resource exhausted | | High memory usage >90% | 2 | Check for memory leak or scale up | | High latency (2x normal) | 2 | Check database, external services | | Single pod failed | 3 | Monitor, likely transient | ### Health Check Commands Quick commands to verify everything is working: ```bash # Cluster health kubectl cluster-info kubectl get nodes # All should be Ready # Service health kubectl get pods -n vapora # All should be Running, 1/1 Ready # Quick endpoints test curl http://localhost:8001/health curl http://localhost:3000 # Pod resources kubectl top pods -n vapora # Recent issues kubectl get events -n vapora | grep Warning kubectl logs deployment/vapora-backend -n vapora --tail=20 ``` --- ## Common Failure Scenarios ### Pod CrashLoopBackOff **Symptoms**: Pod keeps restarting repeatedly **Diagnosis**: ```bash kubectl logs -n vapora --previous # See what crashed kubectl describe pod -n vapora # Check events ``` **Solutions**: 1. If config error: Fix ConfigMap, restart pod 2. If code error: Rollback deployment 3. If resource issue: Increase limits or scale out **Runbook**: [Rollback Runbook](./rollback-runbook.md) or [Incident Response](./incident-response-runbook.md) ### Pod Stuck in Pending **Symptoms**: Pod won't start, stuck in "Pending" state **Diagnosis**: ```bash kubectl describe pod -n vapora # Check "Events" section ``` **Common causes**: - Insufficient CPU/memory on nodes - Node disk full - Pod can't be scheduled - Persistent volume not available **Solutions**: 1. Scale down other workloads 2. Add more nodes 3. Fix persistent volume issues 4. Check node disk space **Runbook**: [On-Call Procedures](./on-call-procedures.md) → "Common Questions" ### Service Unresponsive (Connection Refused) **Symptoms**: `curl: (7) Failed to connect to localhost port 8001` **Diagnosis**: ```bash kubectl get pods -n vapora # Are pods even running? kubectl get service vapora-backend -n vapora # Does service exist? kubectl get endpoints -n vapora # Do endpoints exist? ``` **Common causes**: - Pods not running (restart loops) - Service missing or misconfigured - Port incorrect - Network policy blocking traffic **Solutions**: 1. Verify pods running: `kubectl get pods` 2. Verify service exists: `kubectl get svc` 3. Check endpoints: `kubectl get endpoints` 4. Port-forward if issue with routing: `kubectl port-forward svc/vapora-backend 8001:8001` **Runbook**: [Incident Response](./incident-response-runbook.md) ### High Error Rate **Symptoms**: Dashboard shows >5% 5xx errors **Diagnosis**: ```bash # Check which endpoint kubectl logs deployment/vapora-backend -n vapora | grep "ERROR\|500" # Check recent deployment git log -1 --oneline provisioning/ # Check dependencies curl http://localhost:8001/health # is it healthy? ``` **Common causes**: - Recent bad deployment - Database connectivity issue - Configuration error - Dependency service down **Solutions**: 1. If recent deployment: Consider rollback 2. Check ConfigMap for typos 3. Check database connectivity 4. Check external service health **Runbook**: [Rollback Runbook](./rollback-runbook.md) or [Incident Response](./incident-response-runbook.md) ### Resource Exhaustion (CPU/Memory) **Symptoms**: `kubectl top pods` shows pod at 100% usage or "limits exceeded" **Diagnosis**: ```bash kubectl top nodes # Overall node usage kubectl top pods -n vapora # Per-pod usage kubectl get pod -o yaml | grep limits -A 10 # Check limits ``` **Solutions**: 1. Increase pod resource limits (requires redeployment) 2. Scale out (add more replicas) 3. Scale down other workloads 4. Investigate memory leak if growing **Runbook**: [Deployment Runbook](./deployment-runbook.md) → Phase 4 (Verification) ### Database Connection Errors **Symptoms**: `ERROR: could not connect to database` **Diagnosis**: ```bash # Check database is running kubectl get pods -n # Check credentials in ConfigMap kubectl get configmap vapora-config -n vapora -o yaml | grep -i "database\|password" # Test connectivity kubectl exec -n vapora -- psql $DATABASE_URL ``` **Solutions**: 1. If credentials wrong: Fix in ConfigMap, restart pods 2. If database down: Escalate to DBA 3. If network issue: Network team investigation 4. If permissions: Update database user **Runbook**: [Incident Response](./incident-response-runbook.md) → "Root Cause: Database Issues" --- ## Communication Templates ### Deployment Start ``` 🚀 Deployment starting Service: VAPORA Version: v1.2.1 Mode: Enterprise Expected duration: 10-15 minutes Will update every 2 minutes. Questions? Ask in #deployments ``` ### Deployment Complete ``` ✅ Deployment complete Duration: 12 minutes Status: All services healthy Pods: All running Health check results: ✓ Backend: responding ✓ Frontend: accessible ✓ API: normal latency ✓ No errors in logs Next step: Monitor for 2 hours Contact: @on-call-engineer ``` ### Incident Declared ``` 🔴 INCIDENT DECLARED Service: VAPORA Backend Severity: 1 (Critical) Time detected: HH:MM UTC Current status: Investigating Updates every 2 minutes /cc @on-call-engineer @senior-engineer ``` ### Incident Resolved ``` ✅ Incident resolved Duration: 8 minutes Root cause: [description] Fix: [what was done] All services healthy, monitoring for 1 hour Post-mortem scheduled for [date] ``` ### Rollback Executed ``` 🔙 Rollback executed Issue detected in v1.2.1 Rolled back to v1.2.0 Status: Services recovering Timeline: Issue 14:30 → Rollback 14:32 → Recovered 14:35 Investigating root cause ``` --- ## Escalation Matrix When unsure who to contact: | Issue Type | First Contact | Escalation | Emergency | |-----------|---|---|---| | **Deployment issue** | Deployment lead | Ops team | Ops manager | | **Pod/Container** | On-call engineer | Senior engineer | Director of Eng | | **Database** | DBA team | Ops manager | CTO | | **Infrastructure** | Infra team | Ops manager | VP Ops | | **Security issue** | Security team | CISO | CEO | | **Networking** | Network team | Ops manager | CTO | --- ## Tools & Commands Quick Reference ### Essential kubectl Commands ```bash # Get status kubectl get pods -n vapora kubectl get deployments -n vapora kubectl get services -n vapora # Logs kubectl logs deployment/vapora-backend -n vapora kubectl logs -n vapora --previous # Previous crash kubectl logs -n vapora -f # Follow/tail # Execute commands kubectl exec -it -n vapora -- bash kubectl exec -n vapora -- curl http://localhost:8001/health # Describe (detailed info) kubectl describe pod -n vapora kubectl describe node # Port forward (local access) kubectl port-forward svc/vapora-backend 8001:8001 # Restart pods kubectl rollout restart deployment/vapora-backend -n vapora # Rollback kubectl rollout undo deployment/vapora-backend -n vapora # Scale kubectl scale deployment/vapora-backend --replicas=5 -n vapora ``` ### Useful Aliases ```bash alias k='kubectl' alias kgp='kubectl get pods' alias kgd='kubectl get deployments' alias kgs='kubectl get services' alias klogs='kubectl logs' alias kexec='kubectl exec' alias kdesc='kubectl describe' alias ktop='kubectl top' ``` --- ## Before Your First Deployment 1. **Read all runbooks**: Thoroughly review all procedures 2. **Practice in staging**: Do a test deployment to staging first 3. **Understand rollback**: Know how to rollback before deploying 4. **Get trained**: Have senior engineer walk through procedures 5. **Test tools**: Verify kubectl and other tools work 6. **Verify access**: Confirm you have cluster access 7. **Know contacts**: Have escalation contacts readily available 8. **Review history**: Look at past deployments to understand patterns --- ## Continuous Improvement ### After Each Deployment - [ ] Were all runbooks clear? - [ ] Any steps missing or unclear? - [ ] Any issues that could be prevented? - [ ] Update documentation with learnings ### Monthly Review - [ ] Review all incidents from past month - [ ] Update procedures based on patterns - [ ] Refresh team on any changes - [ ] Update escalation contacts - [ ] Review and improve alerting --- ## Key Principles ✅ **Safety First** - Always dry-run before applying - Rollback quickly if issues detected - Better to be conservative ✅ **Communication** - Communicate early and often - Update every 2-5 minutes during incidents - Notify stakeholders proactively ✅ **Documentation** - Document everything you do - Update runbooks with learnings - Share knowledge with team ✅ **Preparation** - Plan deployments thoroughly - Test before going live - Have rollback plan ready ✅ **Quick Response** - Detect issues quickly - Diagnose systematically - Execute fixes decisively ❌ **Avoid** - Guessing without verifying - Skipping steps to save time - Assuming systems are working - Not communicating with team - Making multiple changes at once --- ## Support & Questions - **Questions about procedures?** Ask senior engineer or operations team - **Found runbook gap?** Create issue/PR to update documentation - **Unclear instructions?** Clarify before executing critical operations - **Ideas for improvement?** Share in team meetings or documentation repo --- ## Quick Start: Your First Deployment ### Day 0: Preparation 1. Read: `pre-deployment-checklist.md` (30 min) 2. Read: `deployment-runbook.md` (30 min) 3. Read: `rollback-runbook.md` (20 min) 4. Schedule walkthrough with senior engineer (1 hour) ### Day 1: Execute with Mentorship 1. Complete pre-deployment checklist with senior engineer 2. Execute deployment runbook with senior observing 3. Monitor for 2 hours with senior available 4. Debrief: what went well, what to improve ### Day 2+: Independent Deployments 1. Complete checklist independently 2. Execute runbook 3. Document and communicate 4. Ask for help if anything unclear --- **Generated**: 2026-01-12 **Status**: Production-ready **Last Updated**: 2026-01-12