633 lines
15 KiB
Markdown
633 lines
15 KiB
Markdown
# Incident Response Runbook
|
|
|
|
Procedures for responding to and resolving VAPORA production incidents.
|
|
|
|
---
|
|
|
|
## Incident Severity Levels
|
|
|
|
### Severity 1: Critical 🔴
|
|
|
|
**Definition**: Service completely down or severely degraded affecting all users
|
|
|
|
**Examples**:
|
|
- All backend pods crashed
|
|
- Database completely unreachable
|
|
- API returning 100% errors
|
|
- Frontend completely inaccessible
|
|
|
|
**Response Time**: Immediate (< 2 minutes)
|
|
**On-Call**: Page immediately (not optional)
|
|
**Communication**: Update status page every 2 minutes
|
|
|
|
### Severity 2: Major 🟠
|
|
|
|
**Definition**: Service partially down or significantly degraded
|
|
|
|
**Examples**:
|
|
- 50% of requests returning errors
|
|
- Latency 10x normal
|
|
- Some services down but others working
|
|
- Intermittent connectivity issues
|
|
|
|
**Response Time**: 5 minutes
|
|
**On-Call**: Alert on-call engineer
|
|
**Communication**: Internal updates every 5 minutes
|
|
|
|
### Severity 3: Minor 🟡
|
|
|
|
**Definition**: Service slow or minor issues affecting some users
|
|
|
|
**Examples**:
|
|
- 5-10% error rate
|
|
- Elevated latency (2x normal)
|
|
- One pod having issues, others recovering
|
|
- Non-critical features unavailable
|
|
|
|
**Response Time**: 15 minutes
|
|
**On-Call**: Alert team, not necessarily emergency page
|
|
**Communication**: Post-incident update
|
|
|
|
### Severity 4: Informational 🟢
|
|
|
|
**Definition**: No user impact, system anomalies or preventive issues
|
|
|
|
**Examples**:
|
|
- Disk usage trending high
|
|
- SSL cert expiring in 30 days
|
|
- Deployment taking longer than normal
|
|
- Non-critical service warnings
|
|
|
|
**Response Time**: During business hours
|
|
**On-Call**: No alert needed
|
|
**Communication**: Team Slack message
|
|
|
|
---
|
|
|
|
## Incident Response Process
|
|
|
|
### Step 1: Report & Assess (Immediately)
|
|
|
|
When incident reported (via alert, user report, or discovery):
|
|
|
|
```bash
|
|
# 1. Create incident ticket
|
|
# Title: "INCIDENT: [Service] - [Brief description]"
|
|
# Example: "INCIDENT: API - 50% error rate since 14:30 UTC"
|
|
# Severity: [1-4]
|
|
# Reporter: [Your name]
|
|
# Time Detected: [UTC time]
|
|
|
|
# 2. Open dedicated Slack channel
|
|
#slack /create #incident-20260112-backend
|
|
# Then: /invite @on-call-engineer
|
|
|
|
# 3. Post initial message
|
|
# "🔴 INCIDENT DECLARED
|
|
# Service: VAPORA Backend
|
|
# Severity: 1 (Critical)
|
|
# Time Detected: 14:32 UTC
|
|
# Current Status: Unknown
|
|
# Next Update: 14:34 UTC"
|
|
```
|
|
|
|
### Step 2: Quick Diagnosis (First 2 minutes)
|
|
|
|
```bash
|
|
# Establish facts quickly
|
|
export NAMESPACE=vapora
|
|
|
|
# Q1: Is the service actually down?
|
|
curl -v http://api.vapora.com/health
|
|
# If: Connection refused → Service down
|
|
# If: 500 errors → Service crashed
|
|
# If: Timeout → Service hung
|
|
|
|
# Q2: What's the scope?
|
|
kubectl get pods -n $NAMESPACE
|
|
# Count Running vs non-Running pods
|
|
# All down → Complete outage
|
|
# Some down → Partial outage
|
|
|
|
# Q3: What's happening right now?
|
|
for deployment in vapora-backend vapora-agents vapora-llm-router; do
|
|
echo "=== $deployment ==="
|
|
kubectl get deployment $deployment -n $NAMESPACE
|
|
done
|
|
# Shows: DESIRED vs CURRENT vs AVAILABLE
|
|
# Example: 3 DESIRED, 0 CURRENT, 0 AVAILABLE → Pod startup failure
|
|
|
|
# Q4: Any obvious errors?
|
|
kubectl logs deployment/vapora-backend -n $NAMESPACE --tail=20 | grep -i "error\|fatal"
|
|
# Shows: What's in the logs right now
|
|
```
|
|
|
|
### Step 3: Escalate Decision
|
|
|
|
Based on quick diagnosis, decide next action:
|
|
|
|
```
|
|
IF pods not starting (CrashLoopBackOff):
|
|
→ Likely config issue
|
|
→ Check ConfigMap values
|
|
→ Likely recent deployment
|
|
→ DECISION: Possible rollback
|
|
|
|
IF pods pending (not scheduled):
|
|
→ Likely resource issue
|
|
→ Check node capacity
|
|
→ DECISION: Scale down workloads or investigate nodes
|
|
|
|
IF pods running but unresponsive:
|
|
→ Likely application issue
|
|
→ Check application logs
|
|
→ DECISION: Investigate app logic
|
|
|
|
IF network/database issues:
|
|
→ Check connectivity
|
|
→ Check credentials
|
|
→ DECISION: Infrastructure escalation
|
|
|
|
IF unknown:
|
|
→ Ask: "What changed recently?"
|
|
→ Check deployment history
|
|
→ Check infrastructure changes
|
|
```
|
|
|
|
### Step 4: Initial Response Actions
|
|
|
|
**For Severity 1 (Critical)**:
|
|
|
|
```bash
|
|
# A. Escalate immediately
|
|
- Page senior engineer if not already responding
|
|
- Contact infrastructure team
|
|
- Notify product/support managers
|
|
|
|
# B. Buy time with failover if available
|
|
- Switch to backup environment if configured
|
|
- Scale to different region if multi-region
|
|
|
|
# C. Gather data for debugging
|
|
- Save current logs
|
|
- Save pod events
|
|
- Record current metrics
|
|
- Take screenshot of dashboards
|
|
|
|
# D. Keep team updated
|
|
# Update #incident-* channel every 2 minutes
|
|
```
|
|
|
|
**For Severity 2 (Major)**:
|
|
|
|
```bash
|
|
# A. Alert on-call team
|
|
# B. Gather same diagnostics
|
|
# C. Start investigation
|
|
# D. Update every 5 minutes
|
|
```
|
|
|
|
**For Severity 3 (Minor)**:
|
|
|
|
```bash
|
|
# A. Create ticket for later investigation
|
|
# B. Monitor closely
|
|
# C. Gather diagnostics
|
|
# D. Plan fix during normal hours if not urgent
|
|
```
|
|
|
|
### Step 5: Detailed Diagnosis
|
|
|
|
Once immediate actions taken:
|
|
|
|
```bash
|
|
# Get comprehensive view of system state
|
|
kubectl describe node <nodename> # Hardware/capacity issues
|
|
kubectl describe pod <podname> -n $NAMESPACE # Pod-specific issues
|
|
kubectl events -n $NAMESPACE # What happened recently
|
|
kubectl top nodes # CPU/memory usage
|
|
kubectl top pods -n $NAMESPACE # Per-pod resource usage
|
|
|
|
# Check recent changes
|
|
git log -5 --oneline
|
|
git diff HEAD~1 HEAD provisioning/
|
|
|
|
# Check deployment history
|
|
kubectl rollout history deployment/vapora-backend -n $NAMESPACE | tail -5
|
|
|
|
# Timeline analysis
|
|
# What happened at 14:30 UTC? (incident time)
|
|
# Was there a deployment?
|
|
# Did metrics change suddenly?
|
|
# Any alerts triggered?
|
|
```
|
|
|
|
### Step 6: Implement Fix
|
|
|
|
Depending on root cause:
|
|
|
|
#### Root Cause: Recent Bad Deployment
|
|
|
|
```bash
|
|
# Solution: Rollback
|
|
# See: Rollback Runbook
|
|
kubectl rollout undo deployment/vapora-backend -n $NAMESPACE
|
|
kubectl rollout status deployment/vapora-backend --timeout=5m
|
|
|
|
# Verify
|
|
curl http://localhost:8001/health
|
|
```
|
|
|
|
#### Root Cause: Insufficient Resources
|
|
|
|
```bash
|
|
# Solution: Either scale out or reduce load
|
|
|
|
# Option A: Add more nodes
|
|
kubectl scale nodes --increment=1
|
|
# (Requires infrastructure access)
|
|
|
|
# Option B: Scale down non-critical services
|
|
kubectl scale deployment/vapora-agents --replicas=1 -n $NAMESPACE
|
|
# Then scale back up when resolved
|
|
|
|
# Option C: Temporarily scale down pod replicas
|
|
kubectl scale deployment/vapora-backend --replicas=2 -n $NAMESPACE
|
|
# (Trade: Reduced capacity but faster recovery)
|
|
```
|
|
|
|
#### Root Cause: Configuration Error
|
|
|
|
```bash
|
|
# Solution: Fix ConfigMap
|
|
|
|
# 1. Identify wrong value
|
|
kubectl get configmap -n $NAMESPACE vapora-config -o yaml | grep -A 2 <suspicious-key>
|
|
|
|
# 2. Fix value
|
|
# Edit configmap in external editor or via kubectl patch:
|
|
kubectl patch configmap vapora-config -n $NAMESPACE \
|
|
--type merge \
|
|
-p '{"data":{"vapora.toml":"[corrected content]"}}'
|
|
|
|
# 3. Restart pods to pick up new config
|
|
kubectl rollout restart deployment/vapora-backend -n $NAMESPACE
|
|
kubectl rollout status deployment/vapora-backend --timeout=5m
|
|
```
|
|
|
|
#### Root Cause: Database Issues
|
|
|
|
```bash
|
|
# Solution: Depends on specific issue
|
|
|
|
# If database down:
|
|
- Contact DBA or database team
|
|
- Check database status: kubectl exec <pod> -- curl localhost:8000
|
|
|
|
# If credentials wrong:
|
|
kubectl patch configmap vapora-config -n $NAMESPACE \
|
|
--type merge \
|
|
-p '{"data":{"DB_PASSWORD":"[correct-password]"}}'
|
|
|
|
# If database full:
|
|
- Contact DBA for cleanup
|
|
- Free up space on database volume
|
|
|
|
# If connection pool exhausted:
|
|
- Scale down services to reduce connections
|
|
- Increase connection pool size if possible
|
|
```
|
|
|
|
#### Root Cause: External Service Down
|
|
|
|
```bash
|
|
# Examples: Third-party API, external database
|
|
|
|
# Solution: Depends on severity
|
|
|
|
# If critical: Failover
|
|
- Switch to backup provider if available
|
|
- Route traffic differently
|
|
|
|
# If non-critical: Degrade gracefully
|
|
- Disable feature temporarily
|
|
- Use cache if available
|
|
- Return cached data
|
|
|
|
# Communicate
|
|
- Notify users of reduced functionality
|
|
- Provide ETA for restoration
|
|
```
|
|
|
|
### Step 7: Verify Recovery
|
|
|
|
```bash
|
|
# Once fix applied, verify systematically
|
|
|
|
# 1. Pod health
|
|
kubectl get pods -n $NAMESPACE
|
|
# All should show: Running, 1/1 Ready
|
|
|
|
# 2. Service endpoints
|
|
kubectl get endpoints -n $NAMESPACE
|
|
# All should have IP addresses
|
|
|
|
# 3. Health endpoints
|
|
curl http://localhost:8001/health
|
|
# Should return: 200 OK
|
|
|
|
# 4. Check errors
|
|
kubectl logs deployment/vapora-backend -n $NAMESPACE --since=2m | grep -i error
|
|
# Should return: few or no errors
|
|
|
|
# 5. Monitor metrics
|
|
kubectl top pods -n $NAMESPACE
|
|
# CPU/Memory should be normal (not spiking)
|
|
|
|
# 6. Check for new issues
|
|
kubectl get events -n $NAMESPACE
|
|
# Should show normal state, no warnings
|
|
```
|
|
|
|
### Step 8: Incident Closure
|
|
|
|
```bash
|
|
# When everything verified healthy:
|
|
|
|
# 1. Document resolution
|
|
# Update incident ticket with:
|
|
# - Root cause
|
|
# - Fix applied
|
|
# - Verification steps
|
|
# - Resolution time
|
|
# - Impact (how many users, how long)
|
|
|
|
# 2. Post final update
|
|
# "#incident channel:
|
|
# ✅ INCIDENT RESOLVED
|
|
#
|
|
# Duration: [start] to [end] = [X minutes]
|
|
# Root Cause: [brief description]
|
|
# Fix Applied: [brief description]
|
|
# Impact: ~X users affected for X minutes
|
|
#
|
|
# Status: All services healthy
|
|
# Monitoring: Continuing for 1 hour
|
|
# Post-mortem: Scheduled for [date]"
|
|
|
|
# 3. Schedule post-mortem
|
|
# Within 24 hours: review what happened and why
|
|
# Document lessons learned
|
|
|
|
# 4. Update dashboards
|
|
# Document incident on status page history
|
|
# If public incident: close status page incident
|
|
|
|
# 5. Send all-clear message
|
|
# Notify: support team, product team, key stakeholders
|
|
```
|
|
|
|
---
|
|
|
|
## Incident Response Roles & Responsibilities
|
|
|
|
### Incident Commander
|
|
- Overall control of incident response
|
|
- Makes critical decisions
|
|
- Drives decision-making speed
|
|
- Communicates status updates
|
|
- Calls when to escalate
|
|
- **You** if you discovered the incident and best understands it
|
|
|
|
### Technical Responders
|
|
- Investigate specific systems
|
|
- Implement fixes
|
|
- Report findings to commander
|
|
- Execute verified solutions
|
|
|
|
### Communication Lead (if Severity 1)
|
|
- Updates #incident channel every 2 minutes
|
|
- Updates status page every 5 minutes
|
|
- Fields questions from support/product
|
|
- Notifies key stakeholders
|
|
|
|
### On-Call Manager (if Severity 1)
|
|
- Pages additional resources if needed
|
|
- Escalates to senior engineers
|
|
- Engages infrastructure/DBA teams
|
|
- Tracks response timeline
|
|
|
|
---
|
|
|
|
## Common Incidents & Responses
|
|
|
|
### Incident Type: Service Unresponsive
|
|
|
|
```
|
|
Detection: curl returns "Connection refused"
|
|
Diagnosis Time: 1 minute
|
|
Response:
|
|
1. Check if pods are running: kubectl get pods
|
|
2. If not running: likely crash → check logs
|
|
3. If running but unresponsive: likely port/network issue
|
|
4. Verify service exists: kubectl get service vapora-backend
|
|
|
|
Solution:
|
|
- If pods crashed: check logs, likely config or deployment issue
|
|
- If pods hanging: restart pods: kubectl delete pods -l app=vapora-backend
|
|
- If service/endpoints missing: apply service manifest
|
|
```
|
|
|
|
### Incident Type: High Error Rate
|
|
|
|
```
|
|
Detection: Dashboard shows >10% 5xx errors
|
|
Diagnosis Time: 2 minutes
|
|
Response:
|
|
1. Check which endpoint is failing
|
|
2. Check logs for error pattern
|
|
3. Identify affected service (backend, agents, router)
|
|
4. Compare with baseline (worked X minutes ago)
|
|
|
|
Solution:
|
|
- If recent deployment: rollback
|
|
- If config change: revert config
|
|
- If database issue: contact DBA
|
|
- If third-party down: implement fallback
|
|
```
|
|
|
|
### Incident Type: High Latency
|
|
|
|
```
|
|
Detection: Dashboard shows p99 latency >2 seconds
|
|
Diagnosis Time: 2 minutes
|
|
Response:
|
|
1. Check if requests still succeeding (is it slow or failing?)
|
|
2. Check CPU/memory usage: kubectl top pods
|
|
3. Check if database slow: run query diagnostics
|
|
4. Check network: are there packet losses?
|
|
|
|
Solution:
|
|
- If resource exhausted: scale up or reduce load
|
|
- If database slow: DBA investigation
|
|
- If network issue: infrastructure team
|
|
- If legitimate increased load: no action needed (expected)
|
|
```
|
|
|
|
### Incident Type: Pod Restarting Repeatedly
|
|
|
|
```
|
|
Detection: kubectl get pods shows high RESTARTS count
|
|
Diagnosis Time: 1 minute
|
|
Response:
|
|
1. Check restart count: kubectl get pods -n vapora
|
|
2. Get pod logs: kubectl logs <pod-name> -n vapora --previous
|
|
3. Get pod events: kubectl describe pod <pod-name> -n vapora
|
|
|
|
Solution:
|
|
- Application error: check logs, fix issue, redeploy
|
|
- Config issue: fix ConfigMap, restart pods
|
|
- Resource issue: increase limits or scale out
|
|
- Liveness probe failing: adjust probe timing or fix health check
|
|
```
|
|
|
|
### Incident Type: Database Connectivity
|
|
|
|
```
|
|
Detection: Logs show "database connection refused"
|
|
Diagnosis Time: 2 minutes
|
|
Response:
|
|
1. Check database service running: kubectl get pod -n <db-namespace>
|
|
2. Check database credentials in ConfigMap
|
|
3. Test connectivity: kubectl exec <pod> -- psql $DB_URL
|
|
4. Check firewall/network policy
|
|
|
|
Solution:
|
|
- If DB down: escalate to DBA, possibly restore from backup
|
|
- If credentials wrong: fix ConfigMap, restart app pods
|
|
- If network issue: network team investigation
|
|
- If no space: DBA cleanup
|
|
```
|
|
|
|
---
|
|
|
|
## Communication During Incident
|
|
|
|
### Every 2 Minutes (Severity 1) or 5 Minutes (Severity 2)
|
|
|
|
Post update to #incident channel:
|
|
|
|
```
|
|
⏱️ 14:35 UTC UPDATE
|
|
|
|
Status: Investigating
|
|
Current Action: Checking pod logs
|
|
Findings: Backend pods in CrashLoopBackOff
|
|
Next Step: Review recent deployment
|
|
ETA for Update: 14:37 UTC
|
|
|
|
/cc @on-call-engineer
|
|
```
|
|
|
|
### Status Page Updates (If Public)
|
|
|
|
```
|
|
INCIDENT: VAPORA API Partially Degraded
|
|
|
|
Investigating: Our team is investigating elevated error rates
|
|
Duration: 5 minutes
|
|
Impact: ~30% of API requests failing
|
|
|
|
Last Updated: 14:35 UTC
|
|
Next Update: 14:37 UTC
|
|
```
|
|
|
|
### Escalation Communication
|
|
|
|
```
|
|
If Severity 1 and unable to identify cause in 5 minutes:
|
|
|
|
"Escalating to senior engineering team.
|
|
Page @senior-engineer-on-call immediately.
|
|
Activating Incident War Room."
|
|
|
|
Include:
|
|
- Service name
|
|
- Duration so far
|
|
- What's been tried
|
|
- Current symptoms
|
|
- Why stuck
|
|
```
|
|
|
|
---
|
|
|
|
## Incident Severity Decision Tree
|
|
|
|
```
|
|
Question 1: Can any users access the service?
|
|
NO → Severity 1 (Critical - complete outage)
|
|
YES → Question 2
|
|
|
|
Question 2: What percentage of requests are failing?
|
|
>50% → Severity 1 (Critical)
|
|
10-50% → Severity 2 (Major)
|
|
5-10% → Severity 3 (Minor)
|
|
<5% → Question 3
|
|
|
|
Question 3: Is the service recovering on its own?
|
|
NO (staying broken) → Severity 2
|
|
YES (automatically recovering) → Question 4
|
|
|
|
Question 4: Does it require any user action/data loss?
|
|
YES → Severity 2
|
|
NO → Severity 3
|
|
```
|
|
|
|
---
|
|
|
|
## Post-Incident Procedures
|
|
|
|
### Immediate (Within 30 minutes)
|
|
|
|
- [ ] Close incident ticket
|
|
- [ ] Post final update to #incident channel
|
|
- [ ] Save all logs and diagnostics
|
|
- [ ] Create post-mortem ticket
|
|
- [ ] Notify team: "incident resolved"
|
|
|
|
### Follow-Up (Within 24 hours)
|
|
|
|
- [ ] Schedule post-mortem meeting
|
|
- [ ] Identify root cause
|
|
- [ ] Document preventive measures
|
|
- [ ] Identify owner for each action item
|
|
- [ ] Create tickets for improvements
|
|
|
|
### Prevention (Within 1 week)
|
|
|
|
- [ ] Implement identified fixes
|
|
- [ ] Update monitoring/alerting
|
|
- [ ] Update runbooks with findings
|
|
- [ ] Conduct team training if needed
|
|
- [ ] Close post-mortem ticket
|
|
|
|
---
|
|
|
|
## Incident Checklist
|
|
|
|
```
|
|
☐ Incident severity determined
|
|
☐ Ticket created and updated
|
|
☐ #incident channel created
|
|
☐ On-call team alerted
|
|
☐ Initial diagnosis completed
|
|
☐ Fix identified and implemented
|
|
☐ Fix verified working
|
|
☐ Incident closed and communicated
|
|
☐ Post-mortem scheduled
|
|
☐ Team debriefed
|
|
☐ Root cause documented
|
|
☐ Prevention measures identified
|
|
☐ Tickets created for follow-up
|
|
```
|