Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Incident Response Runbook

Procedures for responding to and resolving VAPORA production incidents.


Incident Severity Levels

Severity 1: Critical 🔴

Definition: Service completely down or severely degraded affecting all users

Examples:

  • All backend pods crashed
  • Database completely unreachable
  • API returning 100% errors
  • Frontend completely inaccessible

Response Time: Immediate (< 2 minutes) On-Call: Page immediately (not optional) Communication: Update status page every 2 minutes

Severity 2: Major 🟠

Definition: Service partially down or significantly degraded

Examples:

  • 50% of requests returning errors
  • Latency 10x normal
  • Some services down but others working
  • Intermittent connectivity issues

Response Time: 5 minutes On-Call: Alert on-call engineer Communication: Internal updates every 5 minutes

Severity 3: Minor 🟡

Definition: Service slow or minor issues affecting some users

Examples:

  • 5-10% error rate
  • Elevated latency (2x normal)
  • One pod having issues, others recovering
  • Non-critical features unavailable

Response Time: 15 minutes On-Call: Alert team, not necessarily emergency page Communication: Post-incident update

Severity 4: Informational 🟢

Definition: No user impact, system anomalies or preventive issues

Examples:

  • Disk usage trending high
  • SSL cert expiring in 30 days
  • Deployment taking longer than normal
  • Non-critical service warnings

Response Time: During business hours On-Call: No alert needed Communication: Team Slack message


Incident Response Process

Step 1: Report & Assess (Immediately)

When incident reported (via alert, user report, or discovery):

# 1. Create incident ticket
# Title: "INCIDENT: [Service] - [Brief description]"
# Example: "INCIDENT: API - 50% error rate since 14:30 UTC"
# Severity: [1-4]
# Reporter: [Your name]
# Time Detected: [UTC time]

# 2. Open dedicated Slack channel
#slack /create #incident-20260112-backend
# Then: /invite @on-call-engineer

# 3. Post initial message
# "🔴 INCIDENT DECLARED
#  Service: VAPORA Backend
#  Severity: 1 (Critical)
#  Time Detected: 14:32 UTC
#  Current Status: Unknown
#  Next Update: 14:34 UTC"

Step 2: Quick Diagnosis (First 2 minutes)

# Establish facts quickly
export NAMESPACE=vapora

# Q1: Is the service actually down?
curl -v http://api.vapora.com/health
# If: Connection refused → Service down
# If: 500 errors → Service crashed
# If: Timeout → Service hung

# Q2: What's the scope?
kubectl get pods -n $NAMESPACE
# Count Running vs non-Running pods
# All down → Complete outage
# Some down → Partial outage

# Q3: What's happening right now?
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "=== $deployment ==="
  kubectl get deployment $deployment -n $NAMESPACE
done
# Shows: DESIRED vs CURRENT vs AVAILABLE
# Example: 3 DESIRED, 0 CURRENT, 0 AVAILABLE → Pod startup failure

# Q4: Any obvious errors?
kubectl logs deployment/vapora-backend -n $NAMESPACE --tail=20 | grep -i "error\|fatal"
# Shows: What's in the logs right now

Step 3: Escalate Decision

Based on quick diagnosis, decide next action:

IF pods not starting (CrashLoopBackOff):
  → Likely config issue
  → Check ConfigMap values
  → Likely recent deployment
  → DECISION: Possible rollback

IF pods pending (not scheduled):
  → Likely resource issue
  → Check node capacity
  → DECISION: Scale down workloads or investigate nodes

IF pods running but unresponsive:
  → Likely application issue
  → Check application logs
  → DECISION: Investigate app logic

IF network/database issues:
  → Check connectivity
  → Check credentials
  → DECISION: Infrastructure escalation

IF unknown:
  → Ask: "What changed recently?"
  → Check deployment history
  → Check infrastructure changes

Step 4: Initial Response Actions

For Severity 1 (Critical):

# A. Escalate immediately
- Page senior engineer if not already responding
- Contact infrastructure team
- Notify product/support managers

# B. Buy time with failover if available
- Switch to backup environment if configured
- Scale to different region if multi-region

# C. Gather data for debugging
- Save current logs
- Save pod events
- Record current metrics
- Take screenshot of dashboards

# D. Keep team updated
# Update #incident-* channel every 2 minutes

For Severity 2 (Major):

# A. Alert on-call team
# B. Gather same diagnostics
# C. Start investigation
# D. Update every 5 minutes

For Severity 3 (Minor):

# A. Create ticket for later investigation
# B. Monitor closely
# C. Gather diagnostics
# D. Plan fix during normal hours if not urgent

Step 5: Detailed Diagnosis

Once immediate actions taken:

# Get comprehensive view of system state
kubectl describe node <nodename>      # Hardware/capacity issues
kubectl describe pod <podname> -n $NAMESPACE  # Pod-specific issues
kubectl events -n $NAMESPACE          # What happened recently
kubectl top nodes                     # CPU/memory usage
kubectl top pods -n $NAMESPACE        # Per-pod resource usage

# Check recent changes
git log -5 --oneline
git diff HEAD~1 HEAD provisioning/

# Check deployment history
kubectl rollout history deployment/vapora-backend -n $NAMESPACE | tail -5

# Timeline analysis
# What happened at 14:30 UTC? (incident time)
# Was there a deployment?
# Did metrics change suddenly?
# Any alerts triggered?

Step 6: Implement Fix

Depending on root cause:

Root Cause: Recent Bad Deployment

# Solution: Rollback
# See: Rollback Runbook
kubectl rollout undo deployment/vapora-backend -n $NAMESPACE
kubectl rollout status deployment/vapora-backend --timeout=5m

# Verify
curl http://localhost:8001/health

Root Cause: Insufficient Resources

# Solution: Either scale out or reduce load

# Option A: Add more nodes
kubectl scale nodes --increment=1
# (Requires infrastructure access)

# Option B: Scale down non-critical services
kubectl scale deployment/vapora-agents --replicas=1 -n $NAMESPACE
# Then scale back up when resolved

# Option C: Temporarily scale down pod replicas
kubectl scale deployment/vapora-backend --replicas=2 -n $NAMESPACE
# (Trade: Reduced capacity but faster recovery)

Root Cause: Configuration Error

# Solution: Fix ConfigMap

# 1. Identify wrong value
kubectl get configmap -n $NAMESPACE vapora-config -o yaml | grep -A 2 <suspicious-key>

# 2. Fix value
# Edit configmap in external editor or via kubectl patch:
kubectl patch configmap vapora-config -n $NAMESPACE \
  --type merge \
  -p '{"data":{"vapora.toml":"[corrected content]"}}'

# 3. Restart pods to pick up new config
kubectl rollout restart deployment/vapora-backend -n $NAMESPACE
kubectl rollout status deployment/vapora-backend --timeout=5m

Root Cause: Database Issues

# Solution: Depends on specific issue

# If database down:
- Contact DBA or database team
- Check database status: kubectl exec <pod> -- curl localhost:8000

# If credentials wrong:
kubectl patch configmap vapora-config -n $NAMESPACE \
  --type merge \
  -p '{"data":{"DB_PASSWORD":"[correct-password]"}}'

# If database full:
- Contact DBA for cleanup
- Free up space on database volume

# If connection pool exhausted:
- Scale down services to reduce connections
- Increase connection pool size if possible

Root Cause: External Service Down

# Examples: Third-party API, external database

# Solution: Depends on severity

# If critical: Failover
- Switch to backup provider if available
- Route traffic differently

# If non-critical: Degrade gracefully
- Disable feature temporarily
- Use cache if available
- Return cached data

# Communicate
- Notify users of reduced functionality
- Provide ETA for restoration

Step 7: Verify Recovery

# Once fix applied, verify systematically

# 1. Pod health
kubectl get pods -n $NAMESPACE
# All should show: Running, 1/1 Ready

# 2. Service endpoints
kubectl get endpoints -n $NAMESPACE
# All should have IP addresses

# 3. Health endpoints
curl http://localhost:8001/health
# Should return: 200 OK

# 4. Check errors
kubectl logs deployment/vapora-backend -n $NAMESPACE --since=2m | grep -i error
# Should return: few or no errors

# 5. Monitor metrics
kubectl top pods -n $NAMESPACE
# CPU/Memory should be normal (not spiking)

# 6. Check for new issues
kubectl get events -n $NAMESPACE
# Should show normal state, no warnings

Step 8: Incident Closure

# When everything verified healthy:

# 1. Document resolution
# Update incident ticket with:
# - Root cause
# - Fix applied
# - Verification steps
# - Resolution time
# - Impact (how many users, how long)

# 2. Post final update
# "#incident channel:
#  ✅ INCIDENT RESOLVED
#
#  Duration: [start] to [end] = [X minutes]
#  Root Cause: [brief description]
#  Fix Applied: [brief description]
#  Impact: ~X users affected for X minutes
#
#  Status: All services healthy
#  Monitoring: Continuing for 1 hour
#  Post-mortem: Scheduled for [date]"

# 3. Schedule post-mortem
# Within 24 hours: review what happened and why
# Document lessons learned

# 4. Update dashboards
# Document incident on status page history
# If public incident: close status page incident

# 5. Send all-clear message
# Notify: support team, product team, key stakeholders

Incident Response Roles & Responsibilities

Incident Commander

  • Overall control of incident response
  • Makes critical decisions
  • Drives decision-making speed
  • Communicates status updates
  • Calls when to escalate
  • You if you discovered the incident and best understands it

Technical Responders

  • Investigate specific systems
  • Implement fixes
  • Report findings to commander
  • Execute verified solutions

Communication Lead (if Severity 1)

  • Updates #incident channel every 2 minutes
  • Updates status page every 5 minutes
  • Fields questions from support/product
  • Notifies key stakeholders

On-Call Manager (if Severity 1)

  • Pages additional resources if needed
  • Escalates to senior engineers
  • Engages infrastructure/DBA teams
  • Tracks response timeline

Common Incidents & Responses

Incident Type: Service Unresponsive

Detection: curl returns "Connection refused"
Diagnosis Time: 1 minute
Response:
1. Check if pods are running: kubectl get pods
2. If not running: likely crash → check logs
3. If running but unresponsive: likely port/network issue
4. Verify service exists: kubectl get service vapora-backend

Solution:
- If pods crashed: check logs, likely config or deployment issue
- If pods hanging: restart pods: kubectl delete pods -l app=vapora-backend
- If service/endpoints missing: apply service manifest

Incident Type: High Error Rate

Detection: Dashboard shows >10% 5xx errors
Diagnosis Time: 2 minutes
Response:
1. Check which endpoint is failing
2. Check logs for error pattern
3. Identify affected service (backend, agents, router)
4. Compare with baseline (worked X minutes ago)

Solution:
- If recent deployment: rollback
- If config change: revert config
- If database issue: contact DBA
- If third-party down: implement fallback

Incident Type: High Latency

Detection: Dashboard shows p99 latency >2 seconds
Diagnosis Time: 2 minutes
Response:
1. Check if requests still succeeding (is it slow or failing?)
2. Check CPU/memory usage: kubectl top pods
3. Check if database slow: run query diagnostics
4. Check network: are there packet losses?

Solution:
- If resource exhausted: scale up or reduce load
- If database slow: DBA investigation
- If network issue: infrastructure team
- If legitimate increased load: no action needed (expected)

Incident Type: Pod Restarting Repeatedly

Detection: kubectl get pods shows high RESTARTS count
Diagnosis Time: 1 minute
Response:
1. Check restart count: kubectl get pods -n vapora
2. Get pod logs: kubectl logs <pod-name> -n vapora --previous
3. Get pod events: kubectl describe pod <pod-name> -n vapora

Solution:
- Application error: check logs, fix issue, redeploy
- Config issue: fix ConfigMap, restart pods
- Resource issue: increase limits or scale out
- Liveness probe failing: adjust probe timing or fix health check

Incident Type: Database Connectivity

Detection: Logs show "database connection refused"
Diagnosis Time: 2 minutes
Response:
1. Check database service running: kubectl get pod -n <db-namespace>
2. Check database credentials in ConfigMap
3. Test connectivity: kubectl exec <pod> -- psql $DB_URL
4. Check firewall/network policy

Solution:
- If DB down: escalate to DBA, possibly restore from backup
- If credentials wrong: fix ConfigMap, restart app pods
- If network issue: network team investigation
- If no space: DBA cleanup

Communication During Incident

Every 2 Minutes (Severity 1) or 5 Minutes (Severity 2)

Post update to #incident channel:

⏱️ 14:35 UTC UPDATE

Status: Investigating
Current Action: Checking pod logs
Findings: Backend pods in CrashLoopBackOff
Next Step: Review recent deployment
ETA for Update: 14:37 UTC

/cc @on-call-engineer

Status Page Updates (If Public)

INCIDENT: VAPORA API Partially Degraded

Investigating: Our team is investigating elevated error rates
Duration: 5 minutes
Impact: ~30% of API requests failing

Last Updated: 14:35 UTC
Next Update: 14:37 UTC

Escalation Communication

If Severity 1 and unable to identify cause in 5 minutes:

"Escalating to senior engineering team.
Page @senior-engineer-on-call immediately.
Activating Incident War Room."

Include:
- Service name
- Duration so far
- What's been tried
- Current symptoms
- Why stuck

Incident Severity Decision Tree

Question 1: Can any users access the service?
  NO → Severity 1 (Critical - complete outage)
  YES → Question 2

Question 2: What percentage of requests are failing?
  >50% → Severity 1 (Critical)
  10-50% → Severity 2 (Major)
  5-10% → Severity 3 (Minor)
  <5% → Question 3

Question 3: Is the service recovering on its own?
  NO (staying broken) → Severity 2
  YES (automatically recovering) → Question 4

Question 4: Does it require any user action/data loss?
  YES → Severity 2
  NO → Severity 3

Post-Incident Procedures

Immediate (Within 30 minutes)

  • Close incident ticket
  • Post final update to #incident channel
  • Save all logs and diagnostics
  • Create post-mortem ticket
  • Notify team: "incident resolved"

Follow-Up (Within 24 hours)

  • Schedule post-mortem meeting
  • Identify root cause
  • Document preventive measures
  • Identify owner for each action item
  • Create tickets for improvements

Prevention (Within 1 week)

  • Implement identified fixes
  • Update monitoring/alerting
  • Update runbooks with findings
  • Conduct team training if needed
  • Close post-mortem ticket

Incident Checklist

☐ Incident severity determined
☐ Ticket created and updated
☐ #incident channel created
☐ On-call team alerted
☐ Initial diagnosis completed
☐ Fix identified and implemented
☐ Fix verified working
☐ Incident closed and communicated
☐ Post-mortem scheduled
☐ Team debriefed
☐ Root cause documented
☐ Prevention measures identified
☐ Tickets created for follow-up