Vapora/docs/operations/on-call-procedures.md
Jesús Pérez 7110ffeea2
Some checks failed
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
chore: extend doc: adr, tutorials, operations, etc
2026-01-12 03:32:47 +00:00

14 KiB

On-Call Procedures

Guide for on-call engineers managing VAPORA production operations.


Overview

On-Call Responsibility: Monitor VAPORA production and respond to incidents during assigned shift

Time Commitment:

  • During business hours: ~5-10 minutes daily check-ins
  • During off-hours: Available for emergencies (paged for critical issues)

Expected Availability:

  • Severity 1: Respond within 2 minutes
  • Severity 2: Respond within 15 minutes
  • Severity 3: Respond within 1 hour

Before Your Shift Starts

24 Hours Before On-Call

  • Verify schedule: "I'm on-call starting [date] [time]"
  • Update your calendar with shift times
  • Notify team: "I'll be on-call [dates]"
  • Share personal contact info if not already shared
  • Download necessary tools/credentials

1 Hour Before Shift

  • Test pager notification system

    # Verify Slack notifications working
    # Ask previous on-call to send test alert: "/test-alert-to-[yourname]"
    
  • Verify access to necessary systems

    # Test each required access:
    ✓ SSH to bastion host: ssh bastion.vapora.com
    ✓ kubectl to production: kubectl cluster-info
    ✓ Slack channels: /join #deployments #alerts
    ✓ Incident tracking: open Jira/GitHub
    ✓ Monitoring dashboards: access Grafana
    ✓ Status page: access status page admin
    
  • Review current system status

    # Quick health check
    kubectl cluster-info
    kubectl get pods -n vapora
    kubectl get events -n vapora | head -10
    
    # Should show: All pods Running, no recent errors
    
  • Read recent incident reports

    • Check previous on-call handoff notes
    • Review any incidents from past week
    • Note any known issues or monitoring gaps
  • Receive handoff from previous on-call

    Ask: "Anything I should know?"
    - Any ongoing issues?
    - Any deployments planned?
    - Any flaky services or known alerts?
    - Any customer complaints?
    

Daily On-Call Tasks

Morning Check-In (After shift starts)

# Automated check - run this first thing
export NAMESPACE=vapora

echo "=== Cluster Health ==="
kubectl cluster-info
kubectl get nodes

echo "=== Pod Status ==="
kubectl get pods -n $NAMESPACE
kubectl get pods -n $NAMESPACE | grep -v Running

echo "=== Recent Events ==="
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -10

echo "=== Resource Usage ==="
kubectl top nodes
kubectl top pods -n $NAMESPACE

# If any anomalies: investigate before declaring "all clear"

Mid-Shift Check (Every 4 hours)

# Quick sanity check
curl https://api.vapora.com/health
curl https://vapora.app/
# Should both return 200 OK

# Check dashboards
# Grafana: any alerts? any trending issues?

# Check Slack #alerts channel
# Any warnings or anomalies posted?

End-of-Shift Handoff (Before shift ends)

# Prepare handoff for next on-call

# 1. Document current state
kubectl get pods -n vapora
kubectl get nodes
kubectl top pods -n vapora

# 2. Check for known issues
kubectl get events -n vapora | grep Warning
# Any persistent warnings?

# 3. Check deployment status
git log -1 --oneline provisioning/
# Any recent changes?

# 4. Document in handoff notes:
echo "HANDOFF NOTES - $(date)
Duration: [start time] to [end time]
Status: All normal / Issues: [list]
Alerts: [any]
Deployments: [any planned]
Known issues: [any]
Recommendations: [any]
" > on-call-handoff.txt

# 5. Pass notes to next on-call
# Send message to @next-on-call with notes

Responding to Alerts

Alert Received

Step 1: Verify it's real

# Don't panic - verify the alert is legitimate
1. Check the source: is it from our system?
2. Check current status manually: curl endpoints
3. Check dashboard: see if issue visible there
4. Check cluster: kubectl get pods

# False alarms happen - verify before escalating

Step 2: Assess severity

  • Is service completely down? → Severity 1
  • Is service partially down? → Severity 2
  • Is there a warning/anomaly? → Severity 3

Step 3: Declare incident

# Create ticket (Severity 1 is emergency)
# If Severity 1:
# - Alert team immediately
# - Create #incident-[date] channel
# - Start 2-minute update cycle
# See: Incident Response Runbook

During Incident

Your role as on-call:

  1. Respond quickly - First 2 minutes are critical
  2. Communicate - Update team/status page
  3. Investigate - Follow diagnostics in runbooks
  4. Escalate if needed - Page senior engineer if stuck
  5. Execute fix - Follow approved procedures
  6. Verify recovery - Confirm service healthy
  7. Document - Record what happened

Key communication:

  • Initial response time: < 2 min (post "investigating")
  • Status update: every 2-5 minutes
  • Escalation: if not clear after 5 minutes
  • Resolution: post "incident resolved"

Alert Examples & Responses

Alert: "Pod CrashLoopBackOff"

1. Get pod logs: kubectl logs <pod> --previous
2. Check for config issues: kubectl get configmap
3. Check for resource limits: kubectl describe pod <pod>
4. Decide: rollback or fix config

Alert: "High Error Rate (>5% 5xx)"

1. Check which endpoint: tail application logs
2. Check dependencies: database, cache, external APIs
3. Check recent deployment: git log
4. Decide: rollback or investigate further

Alert: "Pod Memory > 90%"

1. Check actual usage: kubectl top pod <pod>
2. Check limits: kubectl get pod <pod> -o yaml | grep memory
3. Decide: scale up or investigate memory leak

Alert: "Node NotReady"

1. Check node: kubectl describe node <node>
2. Check kubelet: ssh node-x && systemctl status kubelet
3. Contact infrastructure team for hardware issues
4. Possibly: drain node and reschedule pods

Monitoring Dashboard Setup

When you start shift, have these visible:

Browser Tabs (Keep Open)

  1. Grafana Dashboard - VAPORA Cluster Overview

    • Pod CPU/Memory usage
    • Request rate and latency
    • Error rate
    • Deployment status
  2. Kubernetes Dashboard

    • kubectl port-forward -n kube-system svc/kubernetes-dashboard 8443:443
    • Or use K9s terminal UI: k9s
  3. Alert Dashboard (if available)

    • Prometheus Alerts
    • Or monitoring system of choice
  4. Status Page (if public-facing)

    • Check for ongoing incidents
    • Prepare to update

Terminal Windows (Keep Ready)

# Terminal 1: Watch pods
watch kubectl get pods -n vapora

# Terminal 2: Tail logs
kubectl logs -f deployment/vapora-backend -n vapora

# Terminal 3: General kubectl commands
kubectl -n vapora get events --watch

# Terminal 4: Ad-hoc commands and troubleshooting
# (leave empty for ad-hoc use)

Common Questions During On-Call

Q: I think I found an issue, but I'm not sure it's a problem

A: When in doubt, escalate:

  1. Post in #deployments channel with observation
  2. Ask: "Does this look normal?"
  3. If others confirm: might be issue
  4. Better safe than sorry (on production)

Q: Do I need to respond to every alert

A: Yes. Even false alarms need verification:

  1. Confirm it's false alarm (not just assume)
  2. Update alert if it's misconfigured
  3. Never ignore alerts - fix the alerting

Q: Service looks broken but dashboard looks normal

A:

  1. Check if dashboard might be delayed (sometimes refresh slow)
  2. Test manually: curl endpoints
  3. Check pod logs directly: kubectl logs
  4. Trust actual service health over dashboard

Q: Can I deploy changes while on-call

A:

  • Yes if it's emergency fix for active incident
  • No for normal features/changes (schedule for dedicated deployment window)
  • Escalate if unsure

Q: Something looks weird but I can't reproduce it

A:

  1. Save any evidence: logs, metrics, events
  2. Monitor more closely for pattern
  3. Document in ticket for later investigation
  4. Escalate if behavior continues

Q: An alert keeps firing but service is fine

A:

  1. Investigate why alert is false
  2. Check alert thresholds (might be too sensitive)
  3. Fix the alert configuration
  4. Update alert runbook with details

Escalation Decision Tree

When should you escalate?

START: Issue detected

Is it Severity 1 (complete outage)?
  YES → Escalate immediately to senior engineer
  NO → Continue

Have you diagnosed root cause in 5 minutes?
  YES → Continue with fix
  NO → Page senior engineer or escalate

Does fix require infrastructure/database changes?
  YES → Contact infrastructure/DBA team
  NO → Continue with fix

Is this outside your authority (company policy)?
  YES → Escalate to manager
  NO → Proceed with fix

Implemented fix, service still broken?
  YES → Page senior engineer immediately
  NO → Verify and close incident

Result: Uncertain?
  → Ask senior engineer or manager
  → Always better to escalate early

When to Page Senior Engineer

Page immediately if:

  • Service completely down (Severity 1)
  • Database appears corrupted
  • You're stuck for >5 minutes
  • Rollback didn't work
  • Need infrastructure changes urgently
  • Something affecting >50% of users

Don't page just because:

  • Single pod restarting (monitor first)
  • Transient network errors
  • You're slightly unsure (ask in #deployments first)
  • It's 3 AM and not critical (use tickets for morning)

End of Shift Handoff

Create Handoff Report

SHIFT HANDOFF - [Your Name]
Dates: [Start] to [End] UTC
Duration: [X hours]

STATUS: ✅ All normal / ⚠️ Issues ongoing / ❌ Critical

INCIDENTS: [Number]
- Incident 1: [description, resolved or ongoing]
- Incident 2: [description]

ALERTS: [Any unusual alerts]
- Alert 1: [description, action taken]

DEPLOYMENTS: [Any scheduled or happened]
- Deployment 1: [status]

KNOWN ISSUES:
- Issue 1: [description, workaround]
- Issue 2: [description]

MONITORING NOTES:
- [Any trending issues]
- [Any monitoring gaps]
- [Any recommended actions]

RECOMMENDATIONS FOR NEXT ON-CALL:
1. [Action item]
2. [Action item]
3. [Action item]

NEXT ON-CALL: @[name]

Send to Next On-Call

@next-on-call - Handoff notes attached:
[paste report above]

Key points:
- [Most important item]
- [Second important]
- [Any urgent follow-ups]

Questions? I'm available for 30 min

Tools & Commands Reference

Essential Commands

# Pod management
kubectl get pods -n vapora
kubectl logs pod-name -n vapora
kubectl exec pod-name -n vapora -- bash
kubectl describe pod pod-name -n vapora
kubectl delete pod pod-name -n vapora  # (recreates via deployment)

# Deployment management
kubectl get deployments -n vapora
kubectl rollout status deployment/vapora-backend -n vapora
kubectl rollout undo deployment/vapora-backend -n vapora
kubectl scale deployment/vapora-backend --replicas=5 -n vapora

# Service health
curl http://localhost:8001/health
kubectl get events -n vapora
kubectl top pods -n vapora
kubectl get endpoints -n vapora

# Quick diagnostics
kubectl describe nodes
kubectl cluster-info
kubectl get persistent volumes

Useful Tools

# Install these on your workstation
brew install kubectl              # Kubernetes CLI
brew install k9s                  # Terminal UI for K8s
brew install watch               # Monitor command output
brew install jq                  # JSON processing
brew install yq                  # YAML processing
brew install grpcurl             # gRPC debugging

# Aliases to save time
alias k='kubectl'
alias kgp='kubectl get pods'
alias klogs='kubectl logs'
alias kexec='kubectl exec'

Bookmark these:

  • Grafana: https://grafana.vapora.com
  • Status Page: https://status.vapora.com
  • Incident Tracker: https://github.com/your-org/vapora/issues
  • Runbooks: https://github.com/your-org/vapora/tree/main/docs/operations
  • Kubernetes Dashboard: Run kubectl proxy then http://localhost:8001/ui

On-Call Checklist

Starting Shift

  • Verified pager notifications working
  • Tested access to all systems
  • Reviewed current system status
  • Read recent incidents
  • Received handoff from previous on-call
  • Set up monitoring dashboards
  • Opened necessary terminal windows
  • Posted "on-call" status in #deployments

During Shift

  • Responded to all alerts within SLA
  • Updated incident status regularly
  • Escalated when appropriate
  • Documented actions in tickets
  • Verified fixes before closing
  • Communicated clearly with team

Ending Shift

  • Created handoff report
  • Resolved or escalated open issues
  • Updated monitoring for anomalies
  • Passed report to next on-call
  • Closed out incident tickets
  • Verified next on-call is ready
  • Posted "handing off to [next on-call]" in #deployments

Post-On-Call Follow-Up

After your shift:

  1. Document lessons learned

    • Did you learn something new?
    • Did any procedure need updating?
    • Were any runbooks unclear?
  2. Update runbooks

    • If you found gaps, update procedures
    • If you had questions, update docs
    • Share improvements with team
  3. Communicate findings

    • Anything the team should know?
    • Any recommendations?
    • Trends to watch?
  4. Celebrate successes

    • Any incidents quickly resolved?
    • Any new insights?
    • Recognize good practices

Emergency Contacts

Keep these accessible:

ESCALATION CONTACTS:

Primary Escalation: [Name] [Phone] [Slack]
Backup Escalation:  [Name] [Phone] [Slack]
Infrastructure:     [Name] [Phone] [Slack]
Database Team:      [Name] [Phone] [Slack]
Manager:            [Name] [Phone] [Slack]

External Contacts:
AWS Support:        [Account ID] [Contact]
CDN Provider:       [Account] [Contact]
DNS Provider:       [Account] [Contact]

EMERGENCY PROCEDURES:
- Complete AWS outage: Contact AWS support immediately
- Database failure: Contact DBA, activate backups
- Security incident: Contact security team immediately
- Major data loss: Activate disaster recovery

Remember

You are the guardian of production - Your vigilance keeps services running

Better safe than sorry - Escalate early and often

Communication is key - Keep team informed

Document everything - Future you and team will thank you

Ask for help - No shame in escalating

Don't guess - Verify before taking action

Don't stay silent - Alert team to any issues

Don't ignore alerts - Even false ones need investigation