Vapora/docs/operations/incident-response-runbook.md

# Incident Response Runbook

Procedures for responding to and resolving VAPORA production incidents.

---

## Incident Severity Levels

### Severity 1: Critical 🔴

**Definition**: Service completely down or severely degraded affecting all users

**Examples**:
- All backend pods crashed
- Database completely unreachable
- API returning 100% errors
- Frontend completely inaccessible

**Response Time**: Immediate (< 2 minutes)
**On-Call**: Page immediately (not optional)
**Communication**: Update status page every 2 minutes

### Severity 2: Major 🟠

**Definition**: Service partially down or significantly degraded

**Examples**:
- 50% of requests returning errors
- Latency 10x normal
- Some services down but others working
- Intermittent connectivity issues

**Response Time**: 5 minutes
**On-Call**: Alert on-call engineer
**Communication**: Internal updates every 5 minutes

### Severity 3: Minor 🟡

**Definition**: Service slow or minor issues affecting some users

**Examples**:
- 5-10% error rate
- Elevated latency (2x normal)
- One pod having issues, others recovering
- Non-critical features unavailable

**Response Time**: 15 minutes
**On-Call**: Alert team, not necessarily emergency page
**Communication**: Post-incident update

### Severity 4: Informational 🟢

**Definition**: No user impact, system anomalies or preventive issues

**Examples**:
- Disk usage trending high
- SSL cert expiring in 30 days
- Deployment taking longer than normal
- Non-critical service warnings

**Response Time**: During business hours
**On-Call**: No alert needed
**Communication**: Team Slack message

---

## Incident Response Process

### Step 1: Report & Assess (Immediately)

When incident reported (via alert, user report, or discovery):

```bash
# 1. Create incident ticket
# Title: "INCIDENT: [Service] - [Brief description]"
# Example: "INCIDENT: API - 50% error rate since 14:30 UTC"
# Severity: [1-4]
# Reporter: [Your name]
# Time Detected: [UTC time]

# 2. Open dedicated Slack channel
#slack /create #incident-20260112-backend
# Then: /invite @on-call-engineer

# 3. Post initial message
# "🔴 INCIDENT DECLARED
#  Service: VAPORA Backend
#  Severity: 1 (Critical)
#  Time Detected: 14:32 UTC
#  Current Status: Unknown
#  Next Update: 14:34 UTC"
```

### Step 2: Quick Diagnosis (First 2 minutes)

```bash
# Establish facts quickly
export NAMESPACE=vapora

# Q1: Is the service actually down?
curl -v http://api.vapora.com/health
# If: Connection refused → Service down
# If: 500 errors → Service crashed
# If: Timeout → Service hung

# Q2: What's the scope?
kubectl get pods -n $NAMESPACE
# Count Running vs non-Running pods
# All down → Complete outage
# Some down → Partial outage

# Q3: What's happening right now?
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "=== $deployment ==="
  kubectl get deployment $deployment -n $NAMESPACE
done
# Shows: DESIRED vs CURRENT vs AVAILABLE
# Example: 3 DESIRED, 0 CURRENT, 0 AVAILABLE → Pod startup failure

# Q4: Any obvious errors?
kubectl logs deployment/vapora-backend -n $NAMESPACE --tail=20 | grep -i "error\|fatal"
# Shows: What's in the logs right now
```

### Step 3: Escalate Decision

Based on quick diagnosis, decide next action:

```
IF pods not starting (CrashLoopBackOff):
  → Likely config issue
  → Check ConfigMap values
  → Likely recent deployment
  → DECISION: Possible rollback

IF pods pending (not scheduled):
  → Likely resource issue
  → Check node capacity
  → DECISION: Scale down workloads or investigate nodes

IF pods running but unresponsive:
  → Likely application issue
  → Check application logs
  → DECISION: Investigate app logic

IF network/database issues:
  → Check connectivity
  → Check credentials
  → DECISION: Infrastructure escalation

IF unknown:
  → Ask: "What changed recently?"
  → Check deployment history
  → Check infrastructure changes
```

### Step 4: Initial Response Actions

**For Severity 1 (Critical)**:

```bash
# A. Escalate immediately
- Page senior engineer if not already responding
- Contact infrastructure team
- Notify product/support managers

# B. Buy time with failover if available
- Switch to backup environment if configured
- Scale to different region if multi-region

# C. Gather data for debugging
- Save current logs
- Save pod events
- Record current metrics
- Take screenshot of dashboards

# D. Keep team updated
# Update #incident-* channel every 2 minutes
```

**For Severity 2 (Major)**:

```bash
# A. Alert on-call team
# B. Gather same diagnostics
# C. Start investigation
# D. Update every 5 minutes
```

**For Severity 3 (Minor)**:

```bash
# A. Create ticket for later investigation
# B. Monitor closely
# C. Gather diagnostics
# D. Plan fix during normal hours if not urgent
```

### Step 5: Detailed Diagnosis

Once immediate actions taken:

```bash
# Get comprehensive view of system state
kubectl describe node <nodename>      # Hardware/capacity issues
kubectl describe pod <podname> -n $NAMESPACE  # Pod-specific issues
kubectl events -n $NAMESPACE          # What happened recently
kubectl top nodes                     # CPU/memory usage
kubectl top pods -n $NAMESPACE        # Per-pod resource usage

# Check recent changes
git log -5 --oneline
git diff HEAD~1 HEAD provisioning/

# Check deployment history
kubectl rollout history deployment/vapora-backend -n $NAMESPACE | tail -5

# Timeline analysis
# What happened at 14:30 UTC? (incident time)
# Was there a deployment?
# Did metrics change suddenly?
# Any alerts triggered?
```

### Step 6: Implement Fix

Depending on root cause:

#### Root Cause: Recent Bad Deployment

```bash
# Solution: Rollback
# See: Rollback Runbook
kubectl rollout undo deployment/vapora-backend -n $NAMESPACE
kubectl rollout status deployment/vapora-backend --timeout=5m

# Verify
curl http://localhost:8001/health
```

#### Root Cause: Insufficient Resources

```bash
# Solution: Either scale out or reduce load

# Option A: Add more nodes
kubectl scale nodes --increment=1
# (Requires infrastructure access)

# Option B: Scale down non-critical services
kubectl scale deployment/vapora-agents --replicas=1 -n $NAMESPACE
# Then scale back up when resolved

# Option C: Temporarily scale down pod replicas
kubectl scale deployment/vapora-backend --replicas=2 -n $NAMESPACE
# (Trade: Reduced capacity but faster recovery)
```

#### Root Cause: Configuration Error

```bash
# Solution: Fix ConfigMap

# 1. Identify wrong value
kubectl get configmap -n $NAMESPACE vapora-config -o yaml | grep -A 2 <suspicious-key>

# 2. Fix value
# Edit configmap in external editor or via kubectl patch:
kubectl patch configmap vapora-config -n $NAMESPACE \
  --type merge \
  -p '{"data":{"vapora.toml":"[corrected content]"}}'

# 3. Restart pods to pick up new config
kubectl rollout restart deployment/vapora-backend -n $NAMESPACE
kubectl rollout status deployment/vapora-backend --timeout=5m
```

#### Root Cause: Database Issues

```bash
# Solution: Depends on specific issue

# If database down:
- Contact DBA or database team
- Check database status: kubectl exec <pod> -- curl localhost:8000

# If credentials wrong:
kubectl patch configmap vapora-config -n $NAMESPACE \
  --type merge \
  -p '{"data":{"DB_PASSWORD":"[correct-password]"}}'

# If database full:
- Contact DBA for cleanup
- Free up space on database volume

# If connection pool exhausted:
- Scale down services to reduce connections
- Increase connection pool size if possible
```

#### Root Cause: External Service Down

```bash
# Examples: Third-party API, external database

# Solution: Depends on severity

# If critical: Failover
- Switch to backup provider if available
- Route traffic differently

# If non-critical: Degrade gracefully
- Disable feature temporarily
- Use cache if available
- Return cached data

# Communicate
- Notify users of reduced functionality
- Provide ETA for restoration
```

### Step 7: Verify Recovery

```bash
# Once fix applied, verify systematically

# 1. Pod health
kubectl get pods -n $NAMESPACE
# All should show: Running, 1/1 Ready

# 2. Service endpoints
kubectl get endpoints -n $NAMESPACE
# All should have IP addresses

# 3. Health endpoints
curl http://localhost:8001/health
# Should return: 200 OK

# 4. Check errors
kubectl logs deployment/vapora-backend -n $NAMESPACE --since=2m | grep -i error
# Should return: few or no errors

# 5. Monitor metrics
kubectl top pods -n $NAMESPACE
# CPU/Memory should be normal (not spiking)

# 6. Check for new issues
kubectl get events -n $NAMESPACE
# Should show normal state, no warnings
```

### Step 8: Incident Closure

```bash
# When everything verified healthy:

# 1. Document resolution
# Update incident ticket with:
# - Root cause
# - Fix applied
# - Verification steps
# - Resolution time
# - Impact (how many users, how long)

# 2. Post final update
# "#incident channel:
#  ✅ INCIDENT RESOLVED
#
#  Duration: [start] to [end] = [X minutes]
#  Root Cause: [brief description]
#  Fix Applied: [brief description]
#  Impact: ~X users affected for X minutes
#
#  Status: All services healthy
#  Monitoring: Continuing for 1 hour
#  Post-mortem: Scheduled for [date]"

# 3. Schedule post-mortem
# Within 24 hours: review what happened and why
# Document lessons learned

# 4. Update dashboards
# Document incident on status page history
# If public incident: close status page incident

# 5. Send all-clear message
# Notify: support team, product team, key stakeholders
```

---

## Incident Response Roles & Responsibilities

### Incident Commander
- Overall control of incident response
- Makes critical decisions
- Drives decision-making speed
- Communicates status updates
- Calls when to escalate
- **You** if you discovered the incident and best understands it

### Technical Responders
- Investigate specific systems
- Implement fixes
- Report findings to commander
- Execute verified solutions

### Communication Lead (if Severity 1)
- Updates #incident channel every 2 minutes
- Updates status page every 5 minutes
- Fields questions from support/product
- Notifies key stakeholders

### On-Call Manager (if Severity 1)
- Pages additional resources if needed
- Escalates to senior engineers
- Engages infrastructure/DBA teams
- Tracks response timeline

---

## Common Incidents & Responses

### Incident Type: Service Unresponsive

```
Detection: curl returns "Connection refused"
Diagnosis Time: 1 minute
Response:
1. Check if pods are running: kubectl get pods
2. If not running: likely crash → check logs
3. If running but unresponsive: likely port/network issue
4. Verify service exists: kubectl get service vapora-backend

Solution:
- If pods crashed: check logs, likely config or deployment issue
- If pods hanging: restart pods: kubectl delete pods -l app=vapora-backend
- If service/endpoints missing: apply service manifest
```

### Incident Type: High Error Rate

```
Detection: Dashboard shows >10% 5xx errors
Diagnosis Time: 2 minutes
Response:
1. Check which endpoint is failing
2. Check logs for error pattern
3. Identify affected service (backend, agents, router)
4. Compare with baseline (worked X minutes ago)

Solution:
- If recent deployment: rollback
- If config change: revert config
- If database issue: contact DBA
- If third-party down: implement fallback
```

### Incident Type: High Latency

```
Detection: Dashboard shows p99 latency >2 seconds
Diagnosis Time: 2 minutes
Response:
1. Check if requests still succeeding (is it slow or failing?)
2. Check CPU/memory usage: kubectl top pods
3. Check if database slow: run query diagnostics
4. Check network: are there packet losses?

Solution:
- If resource exhausted: scale up or reduce load
- If database slow: DBA investigation
- If network issue: infrastructure team
- If legitimate increased load: no action needed (expected)
```

### Incident Type: Pod Restarting Repeatedly

```
Detection: kubectl get pods shows high RESTARTS count
Diagnosis Time: 1 minute
Response:
1. Check restart count: kubectl get pods -n vapora
2. Get pod logs: kubectl logs <pod-name> -n vapora --previous
3. Get pod events: kubectl describe pod <pod-name> -n vapora

Solution:
- Application error: check logs, fix issue, redeploy
- Config issue: fix ConfigMap, restart pods
- Resource issue: increase limits or scale out
- Liveness probe failing: adjust probe timing or fix health check
```

### Incident Type: Database Connectivity

```
Detection: Logs show "database connection refused"
Diagnosis Time: 2 minutes
Response:
1. Check database service running: kubectl get pod -n <db-namespace>
2. Check database credentials in ConfigMap
3. Test connectivity: kubectl exec <pod> -- psql $DB_URL
4. Check firewall/network policy

Solution:
- If DB down: escalate to DBA, possibly restore from backup
- If credentials wrong: fix ConfigMap, restart app pods
- If network issue: network team investigation
- If no space: DBA cleanup
```

---

## Communication During Incident

### Every 2 Minutes (Severity 1) or 5 Minutes (Severity 2)

Post update to #incident channel:

```
⏱️ 14:35 UTC UPDATE

Status: Investigating
Current Action: Checking pod logs
Findings: Backend pods in CrashLoopBackOff
Next Step: Review recent deployment
ETA for Update: 14:37 UTC

/cc @on-call-engineer
```

### Status Page Updates (If Public)

```
INCIDENT: VAPORA API Partially Degraded

Investigating: Our team is investigating elevated error rates
Duration: 5 minutes
Impact: ~30% of API requests failing

Last Updated: 14:35 UTC
Next Update: 14:37 UTC
```

### Escalation Communication

```
If Severity 1 and unable to identify cause in 5 minutes:

"Escalating to senior engineering team.
Page @senior-engineer-on-call immediately.
Activating Incident War Room."

Include:
- Service name
- Duration so far
- What's been tried
- Current symptoms
- Why stuck
```

---

## Incident Severity Decision Tree

```
Question 1: Can any users access the service?
  NO → Severity 1 (Critical - complete outage)
  YES → Question 2

Question 2: What percentage of requests are failing?
  >50% → Severity 1 (Critical)
  10-50% → Severity 2 (Major)
  5-10% → Severity 3 (Minor)
  <5% → Question 3

Question 3: Is the service recovering on its own?
  NO (staying broken) → Severity 2
  YES (automatically recovering) → Question 4

Question 4: Does it require any user action/data loss?
  YES → Severity 2
  NO → Severity 3
```

---

## Post-Incident Procedures

### Immediate (Within 30 minutes)

- [ ] Close incident ticket
- [ ] Post final update to #incident channel
- [ ] Save all logs and diagnostics
- [ ] Create post-mortem ticket
- [ ] Notify team: "incident resolved"

### Follow-Up (Within 24 hours)

- [ ] Schedule post-mortem meeting
- [ ] Identify root cause
- [ ] Document preventive measures
- [ ] Identify owner for each action item
- [ ] Create tickets for improvements

### Prevention (Within 1 week)

- [ ] Implement identified fixes
- [ ] Update monitoring/alerting
- [ ] Update runbooks with findings
- [ ] Conduct team training if needed
- [ ] Close post-mortem ticket

---

## Incident Checklist

```
☐ Incident severity determined
☐ Ticket created and updated
☐ #incident channel created
☐ On-call team alerted
☐ Initial diagnosis completed
☐ Fix identified and implemented
☐ Fix verified working
☐ Incident closed and communicated
☐ Post-mortem scheduled
☐ Team debriefed
☐ Root cause documented
☐ Prevention measures identified
☐ Tickets created for follow-up
```
chore: extend doc: adr, tutorials, operations, etc 2026-01-12 03:32:47 +00:00			`# Incident Response Runbook`

			`Procedures for responding to and resolving VAPORA production incidents.`

			`---`

			`## Incident Severity Levels`

			`### Severity 1: Critical 🔴`

			`Definition: Service completely down or severely degraded affecting all users`

			`Examples:`
			`- All backend pods crashed`
			`- Database completely unreachable`
			`- API returning 100% errors`
			`- Frontend completely inaccessible`

			`Response Time: Immediate (< 2 minutes)`
			`On-Call: Page immediately (not optional)`
			`Communication: Update status page every 2 minutes`

			`### Severity 2: Major 🟠`

			`Definition: Service partially down or significantly degraded`

			`Examples:`
			`- 50% of requests returning errors`
			`- Latency 10x normal`
			`- Some services down but others working`
			`- Intermittent connectivity issues`

			`Response Time: 5 minutes`
			`On-Call: Alert on-call engineer`
			`Communication: Internal updates every 5 minutes`

			`### Severity 3: Minor 🟡`

			`Definition: Service slow or minor issues affecting some users`

			`Examples:`
			`- 5-10% error rate`
			`- Elevated latency (2x normal)`
			`- One pod having issues, others recovering`
			`- Non-critical features unavailable`

			`Response Time: 15 minutes`
			`On-Call: Alert team, not necessarily emergency page`
			`Communication: Post-incident update`

			`### Severity 4: Informational 🟢`

			`Definition: No user impact, system anomalies or preventive issues`

			`Examples:`
			`- Disk usage trending high`
			`- SSL cert expiring in 30 days`
			`- Deployment taking longer than normal`
			`- Non-critical service warnings`

			`Response Time: During business hours`
			`On-Call: No alert needed`
			`Communication: Team Slack message`

			`---`

			`## Incident Response Process`

			`### Step 1: Report & Assess (Immediately)`

			`When incident reported (via alert, user report, or discovery):`

			```bash
			`# 1. Create incident ticket`
			`# Title: "INCIDENT: [Service] - [Brief description]"`
			`# Example: "INCIDENT: API - 50% error rate since 14:30 UTC"`
			`# Severity: [1-4]`
			`# Reporter: [Your name]`
			`# Time Detected: [UTC time]`

			`# 2. Open dedicated Slack channel`
			`#slack /create #incident-20260112-backend`
			`# Then: /invite @on-call-engineer`

			`# 3. Post initial message`
			`# "🔴 INCIDENT DECLARED`
			`# Service: VAPORA Backend`
			`# Severity: 1 (Critical)`
			`# Time Detected: 14:32 UTC`
			`# Current Status: Unknown`
			`# Next Update: 14:34 UTC"`
			```

			`### Step 2: Quick Diagnosis (First 2 minutes)`

			```bash
			`# Establish facts quickly`
			`export NAMESPACE=vapora`

			`# Q1: Is the service actually down?`
			`curl -v http://api.vapora.com/health`
			`# If: Connection refused → Service down`
			`# If: 500 errors → Service crashed`
			`# If: Timeout → Service hung`

			`# Q2: What's the scope?`
			`kubectl get pods -n $NAMESPACE`
			`# Count Running vs non-Running pods`
			`# All down → Complete outage`
			`# Some down → Partial outage`

			`# Q3: What's happening right now?`
			`for deployment in vapora-backend vapora-agents vapora-llm-router; do`
			`echo "=== $deployment ==="`
			`kubectl get deployment $deployment -n $NAMESPACE`
			`done`
			`# Shows: DESIRED vs CURRENT vs AVAILABLE`
			`# Example: 3 DESIRED, 0 CURRENT, 0 AVAILABLE → Pod startup failure`

			`# Q4: Any obvious errors?`
			`kubectl logs deployment/vapora-backend -n $NAMESPACE --tail=20 \| grep -i "error\\|fatal"`
			`# Shows: What's in the logs right now`
			```

			`### Step 3: Escalate Decision`

			`Based on quick diagnosis, decide next action:`

			```
			`IF pods not starting (CrashLoopBackOff):`
			`→ Likely config issue`
			`→ Check ConfigMap values`
			`→ Likely recent deployment`
			`→ DECISION: Possible rollback`

			`IF pods pending (not scheduled):`
			`→ Likely resource issue`
			`→ Check node capacity`
			`→ DECISION: Scale down workloads or investigate nodes`

			`IF pods running but unresponsive:`
			`→ Likely application issue`
			`→ Check application logs`
			`→ DECISION: Investigate app logic`

			`IF network/database issues:`
			`→ Check connectivity`
			`→ Check credentials`
			`→ DECISION: Infrastructure escalation`

			`IF unknown:`
			`→ Ask: "What changed recently?"`
			`→ Check deployment history`
			`→ Check infrastructure changes`
			```

			`### Step 4: Initial Response Actions`

			`For Severity 1 (Critical):`

			```bash
			`# A. Escalate immediately`
			`- Page senior engineer if not already responding`
			`- Contact infrastructure team`
			`- Notify product/support managers`

			`# B. Buy time with failover if available`
			`- Switch to backup environment if configured`
			`- Scale to different region if multi-region`

			`# C. Gather data for debugging`
			`- Save current logs`
			`- Save pod events`
			`- Record current metrics`
			`- Take screenshot of dashboards`

			`# D. Keep team updated`
			`# Update #incident-* channel every 2 minutes`
			```

			`For Severity 2 (Major):`

			```bash
			`# A. Alert on-call team`
			`# B. Gather same diagnostics`
			`# C. Start investigation`
			`# D. Update every 5 minutes`
			```

			`For Severity 3 (Minor):`

			```bash
			`# A. Create ticket for later investigation`
			`# B. Monitor closely`
			`# C. Gather diagnostics`
			`# D. Plan fix during normal hours if not urgent`
			```

			`### Step 5: Detailed Diagnosis`

			`Once immediate actions taken:`

			```bash
			`# Get comprehensive view of system state`
			`kubectl describe node <nodename> # Hardware/capacity issues`
			`kubectl describe pod <podname> -n $NAMESPACE # Pod-specific issues`
			`kubectl events -n $NAMESPACE # What happened recently`
			`kubectl top nodes # CPU/memory usage`
			`kubectl top pods -n $NAMESPACE # Per-pod resource usage`

			`# Check recent changes`
			`git log -5 --oneline`
			`git diff HEAD~1 HEAD provisioning/`

			`# Check deployment history`
			`kubectl rollout history deployment/vapora-backend -n $NAMESPACE \| tail -5`

			`# Timeline analysis`
			`# What happened at 14:30 UTC? (incident time)`
			`# Was there a deployment?`
			`# Did metrics change suddenly?`
			`# Any alerts triggered?`
			```

			`### Step 6: Implement Fix`

			`Depending on root cause:`

			`#### Root Cause: Recent Bad Deployment`

			```bash
			`# Solution: Rollback`
			`# See: Rollback Runbook`
			`kubectl rollout undo deployment/vapora-backend -n $NAMESPACE`
			`kubectl rollout status deployment/vapora-backend --timeout=5m`

			`# Verify`
			`curl http://localhost:8001/health`
			```

			`#### Root Cause: Insufficient Resources`

			```bash
			`# Solution: Either scale out or reduce load`

			`# Option A: Add more nodes`
			`kubectl scale nodes --increment=1`
			`# (Requires infrastructure access)`

			`# Option B: Scale down non-critical services`
			`kubectl scale deployment/vapora-agents --replicas=1 -n $NAMESPACE`
			`# Then scale back up when resolved`

			`# Option C: Temporarily scale down pod replicas`
			`kubectl scale deployment/vapora-backend --replicas=2 -n $NAMESPACE`
			`# (Trade: Reduced capacity but faster recovery)`
			```

			`#### Root Cause: Configuration Error`

			```bash
			`# Solution: Fix ConfigMap`

			`# 1. Identify wrong value`
			`kubectl get configmap -n $NAMESPACE vapora-config -o yaml \| grep -A 2 <suspicious-key>`

			`# 2. Fix value`
			`# Edit configmap in external editor or via kubectl patch:`
			`kubectl patch configmap vapora-config -n $NAMESPACE \`
			`--type merge \`
			`-p '{"data":{"vapora.toml":"[corrected content]"}}'`

			`# 3. Restart pods to pick up new config`
			`kubectl rollout restart deployment/vapora-backend -n $NAMESPACE`
			`kubectl rollout status deployment/vapora-backend --timeout=5m`
			```

			`#### Root Cause: Database Issues`

			```bash
			`# Solution: Depends on specific issue`

			`# If database down:`
			`- Contact DBA or database team`
			`- Check database status: kubectl exec <pod> -- curl localhost:8000`

			`# If credentials wrong:`
			`kubectl patch configmap vapora-config -n $NAMESPACE \`
			`--type merge \`
			`-p '{"data":{"DB_PASSWORD":"[correct-password]"}}'`

			`# If database full:`
			`- Contact DBA for cleanup`
			`- Free up space on database volume`

			`# If connection pool exhausted:`
			`- Scale down services to reduce connections`
			`- Increase connection pool size if possible`
			```

			`#### Root Cause: External Service Down`

			```bash
			`# Examples: Third-party API, external database`

			`# Solution: Depends on severity`

			`# If critical: Failover`
			`- Switch to backup provider if available`
			`- Route traffic differently`

			`# If non-critical: Degrade gracefully`
			`- Disable feature temporarily`
			`- Use cache if available`
			`- Return cached data`

			`# Communicate`
			`- Notify users of reduced functionality`
			`- Provide ETA for restoration`
			```

			`### Step 7: Verify Recovery`

			```bash
			`# Once fix applied, verify systematically`

			`# 1. Pod health`
			`kubectl get pods -n $NAMESPACE`
			`# All should show: Running, 1/1 Ready`

			`# 2. Service endpoints`
			`kubectl get endpoints -n $NAMESPACE`
			`# All should have IP addresses`

			`# 3. Health endpoints`
			`curl http://localhost:8001/health`
			`# Should return: 200 OK`

			`# 4. Check errors`
			`kubectl logs deployment/vapora-backend -n $NAMESPACE --since=2m \| grep -i error`
			`# Should return: few or no errors`

			`# 5. Monitor metrics`
			`kubectl top pods -n $NAMESPACE`
			`# CPU/Memory should be normal (not spiking)`

			`# 6. Check for new issues`
			`kubectl get events -n $NAMESPACE`
			`# Should show normal state, no warnings`
			```

			`### Step 8: Incident Closure`

			```bash
			`# When everything verified healthy:`

			`# 1. Document resolution`
			`# Update incident ticket with:`
			`# - Root cause`
			`# - Fix applied`
			`# - Verification steps`
			`# - Resolution time`
			`# - Impact (how many users, how long)`

			`# 2. Post final update`
			`# "#incident channel:`
			`# ✅ INCIDENT RESOLVED`
			`#`
			`# Duration: [start] to [end] = [X minutes]`
			`# Root Cause: [brief description]`
			`# Fix Applied: [brief description]`
			`# Impact: ~X users affected for X minutes`
			`#`
			`# Status: All services healthy`
			`# Monitoring: Continuing for 1 hour`
			`# Post-mortem: Scheduled for [date]"`

			`# 3. Schedule post-mortem`
			`# Within 24 hours: review what happened and why`
			`# Document lessons learned`

			`# 4. Update dashboards`
			`# Document incident on status page history`
			`# If public incident: close status page incident`

			`# 5. Send all-clear message`
			`# Notify: support team, product team, key stakeholders`
			```

			`---`

			`## Incident Response Roles & Responsibilities`

			`### Incident Commander`
			`- Overall control of incident response`
			`- Makes critical decisions`
			`- Drives decision-making speed`
			`- Communicates status updates`
			`- Calls when to escalate`
			`- You if you discovered the incident and best understands it`

			`### Technical Responders`
			`- Investigate specific systems`
			`- Implement fixes`
			`- Report findings to commander`
			`- Execute verified solutions`

			`### Communication Lead (if Severity 1)`
			`- Updates #incident channel every 2 minutes`
			`- Updates status page every 5 minutes`
			`- Fields questions from support/product`
			`- Notifies key stakeholders`

			`### On-Call Manager (if Severity 1)`
			`- Pages additional resources if needed`
			`- Escalates to senior engineers`
			`- Engages infrastructure/DBA teams`
			`- Tracks response timeline`

			`---`

			`## Common Incidents & Responses`

			`### Incident Type: Service Unresponsive`

			```
			`Detection: curl returns "Connection refused"`
			`Diagnosis Time: 1 minute`
			`Response:`
			`1. Check if pods are running: kubectl get pods`
			`2. If not running: likely crash → check logs`
			`3. If running but unresponsive: likely port/network issue`
			`4. Verify service exists: kubectl get service vapora-backend`

			`Solution:`
			`- If pods crashed: check logs, likely config or deployment issue`
			`- If pods hanging: restart pods: kubectl delete pods -l app=vapora-backend`
			`- If service/endpoints missing: apply service manifest`
			```

			`### Incident Type: High Error Rate`

			```
			`Detection: Dashboard shows >10% 5xx errors`
			`Diagnosis Time: 2 minutes`
			`Response:`
			`1. Check which endpoint is failing`
			`2. Check logs for error pattern`
			`3. Identify affected service (backend, agents, router)`
			`4. Compare with baseline (worked X minutes ago)`

			`Solution:`
			`- If recent deployment: rollback`
			`- If config change: revert config`
			`- If database issue: contact DBA`
			`- If third-party down: implement fallback`
			```

			`### Incident Type: High Latency`

			```
			`Detection: Dashboard shows p99 latency >2 seconds`
			`Diagnosis Time: 2 minutes`
			`Response:`
			`1. Check if requests still succeeding (is it slow or failing?)`
			`2. Check CPU/memory usage: kubectl top pods`
			`3. Check if database slow: run query diagnostics`
			`4. Check network: are there packet losses?`

			`Solution:`
			`- If resource exhausted: scale up or reduce load`
			`- If database slow: DBA investigation`
			`- If network issue: infrastructure team`
			`- If legitimate increased load: no action needed (expected)`
			```

			`### Incident Type: Pod Restarting Repeatedly`

			```
			`Detection: kubectl get pods shows high RESTARTS count`
			`Diagnosis Time: 1 minute`
			`Response:`
			`1. Check restart count: kubectl get pods -n vapora`
			`2. Get pod logs: kubectl logs <pod-name> -n vapora --previous`
			`3. Get pod events: kubectl describe pod <pod-name> -n vapora`

			`Solution:`
			`- Application error: check logs, fix issue, redeploy`
			`- Config issue: fix ConfigMap, restart pods`
			`- Resource issue: increase limits or scale out`
			`- Liveness probe failing: adjust probe timing or fix health check`
			```

			`### Incident Type: Database Connectivity`

			```
			`Detection: Logs show "database connection refused"`
			`Diagnosis Time: 2 minutes`
			`Response:`
			`1. Check database service running: kubectl get pod -n <db-namespace>`
			`2. Check database credentials in ConfigMap`
			`3. Test connectivity: kubectl exec <pod> -- psql $DB_URL`
			`4. Check firewall/network policy`

			`Solution:`
			`- If DB down: escalate to DBA, possibly restore from backup`
			`- If credentials wrong: fix ConfigMap, restart app pods`
			`- If network issue: network team investigation`
			`- If no space: DBA cleanup`
			```

			`---`

			`## Communication During Incident`

			`### Every 2 Minutes (Severity 1) or 5 Minutes (Severity 2)`

			`Post update to #incident channel:`

			```
			`⏱️ 14:35 UTC UPDATE`

			`Status: Investigating`
			`Current Action: Checking pod logs`
			`Findings: Backend pods in CrashLoopBackOff`
			`Next Step: Review recent deployment`
			`ETA for Update: 14:37 UTC`

			`/cc @on-call-engineer`
			```

			`### Status Page Updates (If Public)`

			```
			`INCIDENT: VAPORA API Partially Degraded`

			`Investigating: Our team is investigating elevated error rates`
			`Duration: 5 minutes`
			`Impact: ~30% of API requests failing`

			`Last Updated: 14:35 UTC`
			`Next Update: 14:37 UTC`
			```

			`### Escalation Communication`

			```
			`If Severity 1 and unable to identify cause in 5 minutes:`

			`"Escalating to senior engineering team.`
			`Page @senior-engineer-on-call immediately.`
			`Activating Incident War Room."`

			`Include:`
			`- Service name`
			`- Duration so far`
			`- What's been tried`
			`- Current symptoms`
			`- Why stuck`
			```

			`---`

			`## Incident Severity Decision Tree`

			```
			`Question 1: Can any users access the service?`
			`NO → Severity 1 (Critical - complete outage)`
			`YES → Question 2`

			`Question 2: What percentage of requests are failing?`
			`>50% → Severity 1 (Critical)`
			`10-50% → Severity 2 (Major)`
			`5-10% → Severity 3 (Minor)`
			`<5% → Question 3`

			`Question 3: Is the service recovering on its own?`
			`NO (staying broken) → Severity 2`
			`YES (automatically recovering) → Question 4`

			`Question 4: Does it require any user action/data loss?`
			`YES → Severity 2`
			`NO → Severity 3`
			```

			`---`

			`## Post-Incident Procedures`

			`### Immediate (Within 30 minutes)`

			`- [ ] Close incident ticket`
			`- [ ] Post final update to #incident channel`
			`- [ ] Save all logs and diagnostics`
			`- [ ] Create post-mortem ticket`
			`- [ ] Notify team: "incident resolved"`

			`### Follow-Up (Within 24 hours)`

			`- [ ] Schedule post-mortem meeting`
			`- [ ] Identify root cause`
			`- [ ] Document preventive measures`
			`- [ ] Identify owner for each action item`
			`- [ ] Create tickets for improvements`

			`### Prevention (Within 1 week)`

			`- [ ] Implement identified fixes`
			`- [ ] Update monitoring/alerting`
			`- [ ] Update runbooks with findings`
			`- [ ] Conduct team training if needed`
			`- [ ] Close post-mortem ticket`

			`---`

			`## Incident Checklist`

			```
			`☐ Incident severity determined`
			`☐ Ticket created and updated`
			`☐ #incident channel created`
			`☐ On-call team alerted`
			`☐ Initial diagnosis completed`
			`☐ Fix identified and implemented`
			`☐ Fix verified working`
			`☐ Incident closed and communicated`
			`☐ Post-mortem scheduled`
			`☐ Team debriefed`
			`☐ Root cause documented`
			`☐ Prevention measures identified`
			`☐ Tickets created for follow-up`
			```