jesus/Vapora

Fork 0

Jesús Pérez 7110ffeea2

Rust CI / Security Audit (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (stable) (push) Has been cancelled

Details

chore: extend doc: adr, tutorials, operations, etc

2026-01-12 03:32:47 +00:00

15 KiB

Raw Blame History

Incident Response Runbook

Procedures for responding to and resolving VAPORA production incidents.

Incident Severity Levels

Severity 1: Critical 🔴

Definition: Service completely down or severely degraded affecting all users

Examples:

All backend pods crashed
Database completely unreachable
API returning 100% errors
Frontend completely inaccessible

Response Time: Immediate (< 2 minutes) On-Call: Page immediately (not optional) Communication: Update status page every 2 minutes

Severity 2: Major 🟠

Definition: Service partially down or significantly degraded

Examples:

50% of requests returning errors
Latency 10x normal
Some services down but others working
Intermittent connectivity issues

Response Time: 5 minutes On-Call: Alert on-call engineer Communication: Internal updates every 5 minutes

Severity 3: Minor 🟡

Definition: Service slow or minor issues affecting some users

Examples:

5-10% error rate
Elevated latency (2x normal)
One pod having issues, others recovering
Non-critical features unavailable

Response Time: 15 minutes On-Call: Alert team, not necessarily emergency page Communication: Post-incident update

Severity 4: Informational 🟢

Definition: No user impact, system anomalies or preventive issues

Examples:

Disk usage trending high
SSL cert expiring in 30 days
Deployment taking longer than normal
Non-critical service warnings

Response Time: During business hours On-Call: No alert needed Communication: Team Slack message

Incident Response Process

Step 1: Report & Assess (Immediately)

When incident reported (via alert, user report, or discovery):

# 1. Create incident ticket
# Title: "INCIDENT: [Service] - [Brief description]"
# Example: "INCIDENT: API - 50% error rate since 14:30 UTC"
# Severity: [1-4]
# Reporter: [Your name]
# Time Detected: [UTC time]

# 2. Open dedicated Slack channel
#slack /create #incident-20260112-backend
# Then: /invite @on-call-engineer

# 3. Post initial message
# "🔴 INCIDENT DECLARED
#  Service: VAPORA Backend
#  Severity: 1 (Critical)
#  Time Detected: 14:32 UTC
#  Current Status: Unknown
#  Next Update: 14:34 UTC"

Step 2: Quick Diagnosis (First 2 minutes)

# Establish facts quickly
export NAMESPACE=vapora

# Q1: Is the service actually down?
curl -v http://api.vapora.com/health
# If: Connection refused → Service down
# If: 500 errors → Service crashed
# If: Timeout → Service hung

# Q2: What's the scope?
kubectl get pods -n $NAMESPACE
# Count Running vs non-Running pods
# All down → Complete outage
# Some down → Partial outage

# Q3: What's happening right now?
for deployment in vapora-backend vapora-agents vapora-llm-router; do
  echo "=== $deployment ==="
  kubectl get deployment $deployment -n $NAMESPACE
done
# Shows: DESIRED vs CURRENT vs AVAILABLE
# Example: 3 DESIRED, 0 CURRENT, 0 AVAILABLE → Pod startup failure

# Q4: Any obvious errors?
kubectl logs deployment/vapora-backend -n $NAMESPACE --tail=20 | grep -i "error\|fatal"
# Shows: What's in the logs right now

Step 3: Escalate Decision

Based on quick diagnosis, decide next action:

IF pods not starting (CrashLoopBackOff):
  → Likely config issue
  → Check ConfigMap values
  → Likely recent deployment
  → DECISION: Possible rollback

IF pods pending (not scheduled):
  → Likely resource issue
  → Check node capacity
  → DECISION: Scale down workloads or investigate nodes

IF pods running but unresponsive:
  → Likely application issue
  → Check application logs
  → DECISION: Investigate app logic

IF network/database issues:
  → Check connectivity
  → Check credentials
  → DECISION: Infrastructure escalation

IF unknown:
  → Ask: "What changed recently?"
  → Check deployment history
  → Check infrastructure changes

Step 4: Initial Response Actions

For Severity 1 (Critical):

# A. Escalate immediately
- Page senior engineer if not already responding
- Contact infrastructure team
- Notify product/support managers

# B. Buy time with failover if available
- Switch to backup environment if configured
- Scale to different region if multi-region

# C. Gather data for debugging
- Save current logs
- Save pod events
- Record current metrics
- Take screenshot of dashboards

# D. Keep team updated
# Update #incident-* channel every 2 minutes

For Severity 2 (Major):

# A. Alert on-call team
# B. Gather same diagnostics
# C. Start investigation
# D. Update every 5 minutes

For Severity 3 (Minor):

# A. Create ticket for later investigation
# B. Monitor closely
# C. Gather diagnostics
# D. Plan fix during normal hours if not urgent

Step 5: Detailed Diagnosis

Once immediate actions taken:

# Get comprehensive view of system state
kubectl describe node <nodename>      # Hardware/capacity issues
kubectl describe pod <podname> -n $NAMESPACE  # Pod-specific issues
kubectl events -n $NAMESPACE          # What happened recently
kubectl top nodes                     # CPU/memory usage
kubectl top pods -n $NAMESPACE        # Per-pod resource usage

# Check recent changes
git log -5 --oneline
git diff HEAD~1 HEAD provisioning/

# Check deployment history
kubectl rollout history deployment/vapora-backend -n $NAMESPACE | tail -5

# Timeline analysis
# What happened at 14:30 UTC? (incident time)
# Was there a deployment?
# Did metrics change suddenly?
# Any alerts triggered?

Step 6: Implement Fix

Depending on root cause:

Root Cause: Recent Bad Deployment

# Solution: Rollback
# See: Rollback Runbook
kubectl rollout undo deployment/vapora-backend -n $NAMESPACE
kubectl rollout status deployment/vapora-backend --timeout=5m

# Verify
curl http://localhost:8001/health

Root Cause: Insufficient Resources

# Solution: Either scale out or reduce load

# Option A: Add more nodes
kubectl scale nodes --increment=1
# (Requires infrastructure access)

# Option B: Scale down non-critical services
kubectl scale deployment/vapora-agents --replicas=1 -n $NAMESPACE
# Then scale back up when resolved

# Option C: Temporarily scale down pod replicas
kubectl scale deployment/vapora-backend --replicas=2 -n $NAMESPACE
# (Trade: Reduced capacity but faster recovery)

Root Cause: Configuration Error

# Solution: Fix ConfigMap

# 1. Identify wrong value
kubectl get configmap -n $NAMESPACE vapora-config -o yaml | grep -A 2 <suspicious-key>

# 2. Fix value
# Edit configmap in external editor or via kubectl patch:
kubectl patch configmap vapora-config -n $NAMESPACE \
  --type merge \
  -p '{"data":{"vapora.toml":"[corrected content]"}}'

# 3. Restart pods to pick up new config
kubectl rollout restart deployment/vapora-backend -n $NAMESPACE
kubectl rollout status deployment/vapora-backend --timeout=5m

Root Cause: Database Issues

# Solution: Depends on specific issue

# If database down:
- Contact DBA or database team
- Check database status: kubectl exec <pod> -- curl localhost:8000

# If credentials wrong:
kubectl patch configmap vapora-config -n $NAMESPACE \
  --type merge \
  -p '{"data":{"DB_PASSWORD":"[correct-password]"}}'

# If database full:
- Contact DBA for cleanup
- Free up space on database volume

# If connection pool exhausted:
- Scale down services to reduce connections
- Increase connection pool size if possible

Root Cause: External Service Down

# Examples: Third-party API, external database

# Solution: Depends on severity

# If critical: Failover
- Switch to backup provider if available
- Route traffic differently

# If non-critical: Degrade gracefully
- Disable feature temporarily
- Use cache if available
- Return cached data

# Communicate
- Notify users of reduced functionality
- Provide ETA for restoration

Step 7: Verify Recovery

# Once fix applied, verify systematically

# 1. Pod health
kubectl get pods -n $NAMESPACE
# All should show: Running, 1/1 Ready

# 2. Service endpoints
kubectl get endpoints -n $NAMESPACE
# All should have IP addresses

# 3. Health endpoints
curl http://localhost:8001/health
# Should return: 200 OK

# 4. Check errors
kubectl logs deployment/vapora-backend -n $NAMESPACE --since=2m | grep -i error
# Should return: few or no errors

# 5. Monitor metrics
kubectl top pods -n $NAMESPACE
# CPU/Memory should be normal (not spiking)

# 6. Check for new issues
kubectl get events -n $NAMESPACE
# Should show normal state, no warnings

Step 8: Incident Closure

# When everything verified healthy:

# 1. Document resolution
# Update incident ticket with:
# - Root cause
# - Fix applied
# - Verification steps
# - Resolution time
# - Impact (how many users, how long)

# 2. Post final update
# "#incident channel:
#  ✅ INCIDENT RESOLVED
#
#  Duration: [start] to [end] = [X minutes]
#  Root Cause: [brief description]
#  Fix Applied: [brief description]
#  Impact: ~X users affected for X minutes
#
#  Status: All services healthy
#  Monitoring: Continuing for 1 hour
#  Post-mortem: Scheduled for [date]"

# 3. Schedule post-mortem
# Within 24 hours: review what happened and why
# Document lessons learned

# 4. Update dashboards
# Document incident on status page history
# If public incident: close status page incident

# 5. Send all-clear message
# Notify: support team, product team, key stakeholders

Incident Response Roles & Responsibilities

Incident Commander

Overall control of incident response
Makes critical decisions
Drives decision-making speed
Communicates status updates
Calls when to escalate
You if you discovered the incident and best understands it

Technical Responders

Investigate specific systems
Implement fixes
Report findings to commander
Execute verified solutions

Communication Lead (if Severity 1)

Updates #incident channel every 2 minutes
Updates status page every 5 minutes
Fields questions from support/product
Notifies key stakeholders

On-Call Manager (if Severity 1)

Pages additional resources if needed
Escalates to senior engineers
Engages infrastructure/DBA teams
Tracks response timeline

Common Incidents & Responses

Incident Type: Service Unresponsive

Detection: curl returns "Connection refused"
Diagnosis Time: 1 minute
Response:
1. Check if pods are running: kubectl get pods
2. If not running: likely crash → check logs
3. If running but unresponsive: likely port/network issue
4. Verify service exists: kubectl get service vapora-backend

Solution:
- If pods crashed: check logs, likely config or deployment issue
- If pods hanging: restart pods: kubectl delete pods -l app=vapora-backend
- If service/endpoints missing: apply service manifest

Incident Type: High Error Rate

Detection: Dashboard shows >10% 5xx errors
Diagnosis Time: 2 minutes
Response:
1. Check which endpoint is failing
2. Check logs for error pattern
3. Identify affected service (backend, agents, router)
4. Compare with baseline (worked X minutes ago)

Solution:
- If recent deployment: rollback
- If config change: revert config
- If database issue: contact DBA
- If third-party down: implement fallback

Incident Type: High Latency

Detection: Dashboard shows p99 latency >2 seconds
Diagnosis Time: 2 minutes
Response:
1. Check if requests still succeeding (is it slow or failing?)
2. Check CPU/memory usage: kubectl top pods
3. Check if database slow: run query diagnostics
4. Check network: are there packet losses?

Solution:
- If resource exhausted: scale up or reduce load
- If database slow: DBA investigation
- If network issue: infrastructure team
- If legitimate increased load: no action needed (expected)

Incident Type: Pod Restarting Repeatedly

Detection: kubectl get pods shows high RESTARTS count
Diagnosis Time: 1 minute
Response:
1. Check restart count: kubectl get pods -n vapora
2. Get pod logs: kubectl logs <pod-name> -n vapora --previous
3. Get pod events: kubectl describe pod <pod-name> -n vapora

Solution:
- Application error: check logs, fix issue, redeploy
- Config issue: fix ConfigMap, restart pods
- Resource issue: increase limits or scale out
- Liveness probe failing: adjust probe timing or fix health check

Incident Type: Database Connectivity

Detection: Logs show "database connection refused"
Diagnosis Time: 2 minutes
Response:
1. Check database service running: kubectl get pod -n <db-namespace>
2. Check database credentials in ConfigMap
3. Test connectivity: kubectl exec <pod> -- psql $DB_URL
4. Check firewall/network policy

Solution:
- If DB down: escalate to DBA, possibly restore from backup
- If credentials wrong: fix ConfigMap, restart app pods
- If network issue: network team investigation
- If no space: DBA cleanup

Communication During Incident

Every 2 Minutes (Severity 1) or 5 Minutes (Severity 2)

Post update to #incident channel:

⏱️ 14:35 UTC UPDATE

Status: Investigating
Current Action: Checking pod logs
Findings: Backend pods in CrashLoopBackOff
Next Step: Review recent deployment
ETA for Update: 14:37 UTC

/cc @on-call-engineer

Status Page Updates (If Public)

INCIDENT: VAPORA API Partially Degraded

Investigating: Our team is investigating elevated error rates
Duration: 5 minutes
Impact: ~30% of API requests failing

Last Updated: 14:35 UTC
Next Update: 14:37 UTC

Escalation Communication

If Severity 1 and unable to identify cause in 5 minutes:

"Escalating to senior engineering team.
Page @senior-engineer-on-call immediately.
Activating Incident War Room."

Include:
- Service name
- Duration so far
- What's been tried
- Current symptoms
- Why stuck

Incident Severity Decision Tree

Question 1: Can any users access the service?
  NO → Severity 1 (Critical - complete outage)
  YES → Question 2

Question 2: What percentage of requests are failing?
  >50% → Severity 1 (Critical)
  10-50% → Severity 2 (Major)
  5-10% → Severity 3 (Minor)
  <5% → Question 3

Question 3: Is the service recovering on its own?
  NO (staying broken) → Severity 2
  YES (automatically recovering) → Question 4

Question 4: Does it require any user action/data loss?
  YES → Severity 2
  NO → Severity 3

Post-Incident Procedures

Immediate (Within 30 minutes)

Close incident ticket
Post final update to #incident channel
Save all logs and diagnostics
Create post-mortem ticket
Notify team: "incident resolved"

Follow-Up (Within 24 hours)

Schedule post-mortem meeting
Identify root cause
Document preventive measures
Identify owner for each action item
Create tickets for improvements

Prevention (Within 1 week)

Implement identified fixes
Update monitoring/alerting
Update runbooks with findings
Conduct team training if needed
Close post-mortem ticket

Incident Checklist

☐ Incident severity determined
☐ Ticket created and updated
☐ #incident channel created
☐ On-call team alerted
☐ Initial diagnosis completed
☐ Fix identified and implemented
☐ Fix verified working
☐ Incident closed and communicated
☐ Post-mortem scheduled
☐ Team debriefed
☐ Root cause documented
☐ Prevention measures identified
☐ Tickets created for follow-up

15 KiB Raw Blame History

Incident Response Runbook

Incident Severity Levels

Severity 1: Critical 🔴

Severity 2: Major 🟠

Severity 3: Minor 🟡

Severity 4: Informational 🟢

Incident Response Process

Step 1: Report & Assess (Immediately)

Step 2: Quick Diagnosis (First 2 minutes)

Step 3: Escalate Decision

Step 4: Initial Response Actions

Step 5: Detailed Diagnosis

Step 6: Implement Fix

Root Cause: Recent Bad Deployment

Root Cause: Insufficient Resources

Root Cause: Configuration Error

Root Cause: Database Issues

Root Cause: External Service Down

Step 7: Verify Recovery

Step 8: Incident Closure

Incident Response Roles & Responsibilities

Incident Commander

Technical Responders

Communication Lead (if Severity 1)

On-Call Manager (if Severity 1)

Common Incidents & Responses

Incident Type: Service Unresponsive

Incident Type: High Error Rate

Incident Type: High Latency

Incident Type: Pod Restarting Repeatedly

Incident Type: Database Connectivity

Communication During Incident

Every 2 Minutes (Severity 1) or 5 Minutes (Severity 2)

Status Page Updates (If Public)

Escalation Communication

Incident Severity Decision Tree

Post-Incident Procedures

Immediate (Within 30 minutes)

Follow-Up (Within 24 hours)

Prevention (Within 1 week)

Incident Checklist

15 KiB

Raw Blame History