# Incident Response Runbook Procedures for responding to and resolving VAPORA production incidents. --- ## Incident Severity Levels ### Severity 1: Critical 🔴 **Definition**: Service completely down or severely degraded affecting all users **Examples**: - All backend pods crashed - Database completely unreachable - API returning 100% errors - Frontend completely inaccessible **Response Time**: Immediate (< 2 minutes) **On-Call**: Page immediately (not optional) **Communication**: Update status page every 2 minutes ### Severity 2: Major 🟠 **Definition**: Service partially down or significantly degraded **Examples**: - 50% of requests returning errors - Latency 10x normal - Some services down but others working - Intermittent connectivity issues **Response Time**: 5 minutes **On-Call**: Alert on-call engineer **Communication**: Internal updates every 5 minutes ### Severity 3: Minor 🟡 **Definition**: Service slow or minor issues affecting some users **Examples**: - 5-10% error rate - Elevated latency (2x normal) - One pod having issues, others recovering - Non-critical features unavailable **Response Time**: 15 minutes **On-Call**: Alert team, not necessarily emergency page **Communication**: Post-incident update ### Severity 4: Informational 🟢 **Definition**: No user impact, system anomalies or preventive issues **Examples**: - Disk usage trending high - SSL cert expiring in 30 days - Deployment taking longer than normal - Non-critical service warnings **Response Time**: During business hours **On-Call**: No alert needed **Communication**: Team Slack message --- ## Incident Response Process ### Step 1: Report & Assess (Immediately) When incident reported (via alert, user report, or discovery): ```bash # 1. Create incident ticket # Title: "INCIDENT: [Service] - [Brief description]" # Example: "INCIDENT: API - 50% error rate since 14:30 UTC" # Severity: [1-4] # Reporter: [Your name] # Time Detected: [UTC time] # 2. Open dedicated Slack channel #slack /create #incident-20260112-backend # Then: /invite @on-call-engineer # 3. Post initial message # "🔴 INCIDENT DECLARED # Service: VAPORA Backend # Severity: 1 (Critical) # Time Detected: 14:32 UTC # Current Status: Unknown # Next Update: 14:34 UTC" ``` ### Step 2: Quick Diagnosis (First 2 minutes) ```bash # Establish facts quickly export NAMESPACE=vapora # Q1: Is the service actually down? curl -v http://api.vapora.com/health # If: Connection refused → Service down # If: 500 errors → Service crashed # If: Timeout → Service hung # Q2: What's the scope? kubectl get pods -n $NAMESPACE # Count Running vs non-Running pods # All down → Complete outage # Some down → Partial outage # Q3: What's happening right now? for deployment in vapora-backend vapora-agents vapora-llm-router; do echo "=== $deployment ===" kubectl get deployment $deployment -n $NAMESPACE done # Shows: DESIRED vs CURRENT vs AVAILABLE # Example: 3 DESIRED, 0 CURRENT, 0 AVAILABLE → Pod startup failure # Q4: Any obvious errors? kubectl logs deployment/vapora-backend -n $NAMESPACE --tail=20 | grep -i "error\|fatal" # Shows: What's in the logs right now ``` ### Step 3: Escalate Decision Based on quick diagnosis, decide next action: ``` IF pods not starting (CrashLoopBackOff): → Likely config issue → Check ConfigMap values → Likely recent deployment → DECISION: Possible rollback IF pods pending (not scheduled): → Likely resource issue → Check node capacity → DECISION: Scale down workloads or investigate nodes IF pods running but unresponsive: → Likely application issue → Check application logs → DECISION: Investigate app logic IF network/database issues: → Check connectivity → Check credentials → DECISION: Infrastructure escalation IF unknown: → Ask: "What changed recently?" → Check deployment history → Check infrastructure changes ``` ### Step 4: Initial Response Actions **For Severity 1 (Critical)**: ```bash # A. Escalate immediately - Page senior engineer if not already responding - Contact infrastructure team - Notify product/support managers # B. Buy time with failover if available - Switch to backup environment if configured - Scale to different region if multi-region # C. Gather data for debugging - Save current logs - Save pod events - Record current metrics - Take screenshot of dashboards # D. Keep team updated # Update #incident-* channel every 2 minutes ``` **For Severity 2 (Major)**: ```bash # A. Alert on-call team # B. Gather same diagnostics # C. Start investigation # D. Update every 5 minutes ``` **For Severity 3 (Minor)**: ```bash # A. Create ticket for later investigation # B. Monitor closely # C. Gather diagnostics # D. Plan fix during normal hours if not urgent ``` ### Step 5: Detailed Diagnosis Once immediate actions taken: ```bash # Get comprehensive view of system state kubectl describe node # Hardware/capacity issues kubectl describe pod -n $NAMESPACE # Pod-specific issues kubectl events -n $NAMESPACE # What happened recently kubectl top nodes # CPU/memory usage kubectl top pods -n $NAMESPACE # Per-pod resource usage # Check recent changes git log -5 --oneline git diff HEAD~1 HEAD provisioning/ # Check deployment history kubectl rollout history deployment/vapora-backend -n $NAMESPACE | tail -5 # Timeline analysis # What happened at 14:30 UTC? (incident time) # Was there a deployment? # Did metrics change suddenly? # Any alerts triggered? ``` ### Step 6: Implement Fix Depending on root cause: #### Root Cause: Recent Bad Deployment ```bash # Solution: Rollback # See: Rollback Runbook kubectl rollout undo deployment/vapora-backend -n $NAMESPACE kubectl rollout status deployment/vapora-backend --timeout=5m # Verify curl http://localhost:8001/health ``` #### Root Cause: Insufficient Resources ```bash # Solution: Either scale out or reduce load # Option A: Add more nodes kubectl scale nodes --increment=1 # (Requires infrastructure access) # Option B: Scale down non-critical services kubectl scale deployment/vapora-agents --replicas=1 -n $NAMESPACE # Then scale back up when resolved # Option C: Temporarily scale down pod replicas kubectl scale deployment/vapora-backend --replicas=2 -n $NAMESPACE # (Trade: Reduced capacity but faster recovery) ``` #### Root Cause: Configuration Error ```bash # Solution: Fix ConfigMap # 1. Identify wrong value kubectl get configmap -n $NAMESPACE vapora-config -o yaml | grep -A 2 # 2. Fix value # Edit configmap in external editor or via kubectl patch: kubectl patch configmap vapora-config -n $NAMESPACE \ --type merge \ -p '{"data":{"vapora.toml":"[corrected content]"}}' # 3. Restart pods to pick up new config kubectl rollout restart deployment/vapora-backend -n $NAMESPACE kubectl rollout status deployment/vapora-backend --timeout=5m ``` #### Root Cause: Database Issues ```bash # Solution: Depends on specific issue # If database down: - Contact DBA or database team - Check database status: kubectl exec -- curl localhost:8000 # If credentials wrong: kubectl patch configmap vapora-config -n $NAMESPACE \ --type merge \ -p '{"data":{"DB_PASSWORD":"[correct-password]"}}' # If database full: - Contact DBA for cleanup - Free up space on database volume # If connection pool exhausted: - Scale down services to reduce connections - Increase connection pool size if possible ``` #### Root Cause: External Service Down ```bash # Examples: Third-party API, external database # Solution: Depends on severity # If critical: Failover - Switch to backup provider if available - Route traffic differently # If non-critical: Degrade gracefully - Disable feature temporarily - Use cache if available - Return cached data # Communicate - Notify users of reduced functionality - Provide ETA for restoration ``` ### Step 7: Verify Recovery ```bash # Once fix applied, verify systematically # 1. Pod health kubectl get pods -n $NAMESPACE # All should show: Running, 1/1 Ready # 2. Service endpoints kubectl get endpoints -n $NAMESPACE # All should have IP addresses # 3. Health endpoints curl http://localhost:8001/health # Should return: 200 OK # 4. Check errors kubectl logs deployment/vapora-backend -n $NAMESPACE --since=2m | grep -i error # Should return: few or no errors # 5. Monitor metrics kubectl top pods -n $NAMESPACE # CPU/Memory should be normal (not spiking) # 6. Check for new issues kubectl get events -n $NAMESPACE # Should show normal state, no warnings ``` ### Step 8: Incident Closure ```bash # When everything verified healthy: # 1. Document resolution # Update incident ticket with: # - Root cause # - Fix applied # - Verification steps # - Resolution time # - Impact (how many users, how long) # 2. Post final update # "#incident channel: # ✅ INCIDENT RESOLVED # # Duration: [start] to [end] = [X minutes] # Root Cause: [brief description] # Fix Applied: [brief description] # Impact: ~X users affected for X minutes # # Status: All services healthy # Monitoring: Continuing for 1 hour # Post-mortem: Scheduled for [date]" # 3. Schedule post-mortem # Within 24 hours: review what happened and why # Document lessons learned # 4. Update dashboards # Document incident on status page history # If public incident: close status page incident # 5. Send all-clear message # Notify: support team, product team, key stakeholders ``` --- ## Incident Response Roles & Responsibilities ### Incident Commander - Overall control of incident response - Makes critical decisions - Drives decision-making speed - Communicates status updates - Calls when to escalate - **You** if you discovered the incident and best understands it ### Technical Responders - Investigate specific systems - Implement fixes - Report findings to commander - Execute verified solutions ### Communication Lead (if Severity 1) - Updates #incident channel every 2 minutes - Updates status page every 5 minutes - Fields questions from support/product - Notifies key stakeholders ### On-Call Manager (if Severity 1) - Pages additional resources if needed - Escalates to senior engineers - Engages infrastructure/DBA teams - Tracks response timeline --- ## Common Incidents & Responses ### Incident Type: Service Unresponsive ``` Detection: curl returns "Connection refused" Diagnosis Time: 1 minute Response: 1. Check if pods are running: kubectl get pods 2. If not running: likely crash → check logs 3. If running but unresponsive: likely port/network issue 4. Verify service exists: kubectl get service vapora-backend Solution: - If pods crashed: check logs, likely config or deployment issue - If pods hanging: restart pods: kubectl delete pods -l app=vapora-backend - If service/endpoints missing: apply service manifest ``` ### Incident Type: High Error Rate ``` Detection: Dashboard shows >10% 5xx errors Diagnosis Time: 2 minutes Response: 1. Check which endpoint is failing 2. Check logs for error pattern 3. Identify affected service (backend, agents, router) 4. Compare with baseline (worked X minutes ago) Solution: - If recent deployment: rollback - If config change: revert config - If database issue: contact DBA - If third-party down: implement fallback ``` ### Incident Type: High Latency ``` Detection: Dashboard shows p99 latency >2 seconds Diagnosis Time: 2 minutes Response: 1. Check if requests still succeeding (is it slow or failing?) 2. Check CPU/memory usage: kubectl top pods 3. Check if database slow: run query diagnostics 4. Check network: are there packet losses? Solution: - If resource exhausted: scale up or reduce load - If database slow: DBA investigation - If network issue: infrastructure team - If legitimate increased load: no action needed (expected) ``` ### Incident Type: Pod Restarting Repeatedly ``` Detection: kubectl get pods shows high RESTARTS count Diagnosis Time: 1 minute Response: 1. Check restart count: kubectl get pods -n vapora 2. Get pod logs: kubectl logs -n vapora --previous 3. Get pod events: kubectl describe pod -n vapora Solution: - Application error: check logs, fix issue, redeploy - Config issue: fix ConfigMap, restart pods - Resource issue: increase limits or scale out - Liveness probe failing: adjust probe timing or fix health check ``` ### Incident Type: Database Connectivity ``` Detection: Logs show "database connection refused" Diagnosis Time: 2 minutes Response: 1. Check database service running: kubectl get pod -n 2. Check database credentials in ConfigMap 3. Test connectivity: kubectl exec -- psql $DB_URL 4. Check firewall/network policy Solution: - If DB down: escalate to DBA, possibly restore from backup - If credentials wrong: fix ConfigMap, restart app pods - If network issue: network team investigation - If no space: DBA cleanup ``` --- ## Communication During Incident ### Every 2 Minutes (Severity 1) or 5 Minutes (Severity 2) Post update to #incident channel: ``` ⏱️ 14:35 UTC UPDATE Status: Investigating Current Action: Checking pod logs Findings: Backend pods in CrashLoopBackOff Next Step: Review recent deployment ETA for Update: 14:37 UTC /cc @on-call-engineer ``` ### Status Page Updates (If Public) ``` INCIDENT: VAPORA API Partially Degraded Investigating: Our team is investigating elevated error rates Duration: 5 minutes Impact: ~30% of API requests failing Last Updated: 14:35 UTC Next Update: 14:37 UTC ``` ### Escalation Communication ``` If Severity 1 and unable to identify cause in 5 minutes: "Escalating to senior engineering team. Page @senior-engineer-on-call immediately. Activating Incident War Room." Include: - Service name - Duration so far - What's been tried - Current symptoms - Why stuck ``` --- ## Incident Severity Decision Tree ``` Question 1: Can any users access the service? NO → Severity 1 (Critical - complete outage) YES → Question 2 Question 2: What percentage of requests are failing? >50% → Severity 1 (Critical) 10-50% → Severity 2 (Major) 5-10% → Severity 3 (Minor) <5% → Question 3 Question 3: Is the service recovering on its own? NO (staying broken) → Severity 2 YES (automatically recovering) → Question 4 Question 4: Does it require any user action/data loss? YES → Severity 2 NO → Severity 3 ``` --- ## Post-Incident Procedures ### Immediate (Within 30 minutes) - [ ] Close incident ticket - [ ] Post final update to #incident channel - [ ] Save all logs and diagnostics - [ ] Create post-mortem ticket - [ ] Notify team: "incident resolved" ### Follow-Up (Within 24 hours) - [ ] Schedule post-mortem meeting - [ ] Identify root cause - [ ] Document preventive measures - [ ] Identify owner for each action item - [ ] Create tickets for improvements ### Prevention (Within 1 week) - [ ] Implement identified fixes - [ ] Update monitoring/alerting - [ ] Update runbooks with findings - [ ] Conduct team training if needed - [ ] Close post-mortem ticket --- ## Incident Checklist ``` ☐ Incident severity determined ☐ Ticket created and updated ☐ #incident channel created ☐ On-call team alerted ☐ Initial diagnosis completed ☐ Fix identified and implemented ☐ Fix verified working ☐ Incident closed and communicated ☐ Post-mortem scheduled ☐ Team debriefed ☐ Root cause documented ☐ Prevention measures identified ☐ Tickets created for follow-up ```