jesus/Vapora

Fork 0

Jesús Pérez 7110ffeea2

Rust CI / Security Audit (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (stable) (push) Has been cancelled

Details

chore: extend doc: adr, tutorials, operations, etc

2026-01-12 03:32:47 +00:00

14 KiB

Raw Blame History

On-Call Procedures

Guide for on-call engineers managing VAPORA production operations.

Overview

On-Call Responsibility: Monitor VAPORA production and respond to incidents during assigned shift

Time Commitment:

During business hours: ~5-10 minutes daily check-ins
During off-hours: Available for emergencies (paged for critical issues)

Expected Availability:

Severity 1: Respond within 2 minutes
Severity 2: Respond within 15 minutes
Severity 3: Respond within 1 hour

Before Your Shift Starts

24 Hours Before On-Call

Verify schedule: "I'm on-call starting [date] [time]"
Update your calendar with shift times
Notify team: "I'll be on-call [dates]"
Share personal contact info if not already shared
Download necessary tools/credentials

1 Hour Before Shift

Test pager notification system

# Verify Slack notifications working
# Ask previous on-call to send test alert: "/test-alert-to-[yourname]"

Verify access to necessary systems

# Test each required access:
✓ SSH to bastion host: ssh bastion.vapora.com
✓ kubectl to production: kubectl cluster-info
✓ Slack channels: /join #deployments #alerts
✓ Incident tracking: open Jira/GitHub
✓ Monitoring dashboards: access Grafana
✓ Status page: access status page admin

Review current system status

# Quick health check
kubectl cluster-info
kubectl get pods -n vapora
kubectl get events -n vapora | head -10

# Should show: All pods Running, no recent errors

Read recent incident reports
- Check previous on-call handoff notes
- Review any incidents from past week
- Note any known issues or monitoring gaps

Receive handoff from previous on-call

Ask: "Anything I should know?"
- Any ongoing issues?
- Any deployments planned?
- Any flaky services or known alerts?
- Any customer complaints?

Daily On-Call Tasks

Morning Check-In (After shift starts)

# Automated check - run this first thing
export NAMESPACE=vapora

echo "=== Cluster Health ==="
kubectl cluster-info
kubectl get nodes

echo "=== Pod Status ==="
kubectl get pods -n $NAMESPACE
kubectl get pods -n $NAMESPACE | grep -v Running

echo "=== Recent Events ==="
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -10

echo "=== Resource Usage ==="
kubectl top nodes
kubectl top pods -n $NAMESPACE

# If any anomalies: investigate before declaring "all clear"

Mid-Shift Check (Every 4 hours)

# Quick sanity check
curl https://api.vapora.com/health
curl https://vapora.app/
# Should both return 200 OK

# Check dashboards
# Grafana: any alerts? any trending issues?

# Check Slack #alerts channel
# Any warnings or anomalies posted?

End-of-Shift Handoff (Before shift ends)

# Prepare handoff for next on-call

# 1. Document current state
kubectl get pods -n vapora
kubectl get nodes
kubectl top pods -n vapora

# 2. Check for known issues
kubectl get events -n vapora | grep Warning
# Any persistent warnings?

# 3. Check deployment status
git log -1 --oneline provisioning/
# Any recent changes?

# 4. Document in handoff notes:
echo "HANDOFF NOTES - $(date)
Duration: [start time] to [end time]
Status: All normal / Issues: [list]
Alerts: [any]
Deployments: [any planned]
Known issues: [any]
Recommendations: [any]
" > on-call-handoff.txt

# 5. Pass notes to next on-call
# Send message to @next-on-call with notes

Responding to Alerts

Alert Received

Step 1: Verify it's real

# Don't panic - verify the alert is legitimate
1. Check the source: is it from our system?
2. Check current status manually: curl endpoints
3. Check dashboard: see if issue visible there
4. Check cluster: kubectl get pods

# False alarms happen - verify before escalating

Step 2: Assess severity

Is service completely down? → Severity 1
Is service partially down? → Severity 2
Is there a warning/anomaly? → Severity 3

Step 3: Declare incident

# Create ticket (Severity 1 is emergency)
# If Severity 1:
# - Alert team immediately
# - Create #incident-[date] channel
# - Start 2-minute update cycle
# See: Incident Response Runbook

During Incident

Your role as on-call:

Respond quickly - First 2 minutes are critical
Communicate - Update team/status page
Investigate - Follow diagnostics in runbooks
Escalate if needed - Page senior engineer if stuck
Execute fix - Follow approved procedures
Verify recovery - Confirm service healthy
Document - Record what happened

Key communication:

Initial response time: < 2 min (post "investigating")
Status update: every 2-5 minutes
Escalation: if not clear after 5 minutes
Resolution: post "incident resolved"

Alert Examples & Responses

Alert: "Pod CrashLoopBackOff"

1. Get pod logs: kubectl logs <pod> --previous
2. Check for config issues: kubectl get configmap
3. Check for resource limits: kubectl describe pod <pod>
4. Decide: rollback or fix config

Alert: "High Error Rate (>5% 5xx)"

1. Check which endpoint: tail application logs
2. Check dependencies: database, cache, external APIs
3. Check recent deployment: git log
4. Decide: rollback or investigate further

Alert: "Pod Memory > 90%"

1. Check actual usage: kubectl top pod <pod>
2. Check limits: kubectl get pod <pod> -o yaml | grep memory
3. Decide: scale up or investigate memory leak

Alert: "Node NotReady"

1. Check node: kubectl describe node <node>
2. Check kubelet: ssh node-x && systemctl status kubelet
3. Contact infrastructure team for hardware issues
4. Possibly: drain node and reschedule pods

Monitoring Dashboard Setup

When you start shift, have these visible:

Browser Tabs (Keep Open)

Grafana Dashboard - VAPORA Cluster Overview
- Pod CPU/Memory usage
- Request rate and latency
- Error rate
- Deployment status
Kubernetes Dashboard
- kubectl port-forward -n kube-system svc/kubernetes-dashboard 8443:443
- Or use K9s terminal UI: k9s
Alert Dashboard (if available)
- Prometheus Alerts
- Or monitoring system of choice
Status Page (if public-facing)
- Check for ongoing incidents
- Prepare to update

Terminal Windows (Keep Ready)

# Terminal 1: Watch pods
watch kubectl get pods -n vapora

# Terminal 2: Tail logs
kubectl logs -f deployment/vapora-backend -n vapora

# Terminal 3: General kubectl commands
kubectl -n vapora get events --watch

# Terminal 4: Ad-hoc commands and troubleshooting
# (leave empty for ad-hoc use)

Common Questions During On-Call

Q: I think I found an issue, but I'm not sure it's a problem

A: When in doubt, escalate:

Post in #deployments channel with observation
Ask: "Does this look normal?"
If others confirm: might be issue
Better safe than sorry (on production)

Q: Do I need to respond to every alert

A: Yes. Even false alarms need verification:

Confirm it's false alarm (not just assume)
Update alert if it's misconfigured
Never ignore alerts - fix the alerting

Q: Service looks broken but dashboard looks normal

Check if dashboard might be delayed (sometimes refresh slow)
Test manually: curl endpoints
Check pod logs directly: kubectl logs
Trust actual service health over dashboard

Q: Can I deploy changes while on-call

Yes if it's emergency fix for active incident
No for normal features/changes (schedule for dedicated deployment window)
Escalate if unsure

Q: Something looks weird but I can't reproduce it

Save any evidence: logs, metrics, events
Monitor more closely for pattern
Document in ticket for later investigation
Escalate if behavior continues

Q: An alert keeps firing but service is fine

Investigate why alert is false
Check alert thresholds (might be too sensitive)
Fix the alert configuration
Update alert runbook with details

Escalation Decision Tree

When should you escalate?

START: Issue detected

Is it Severity 1 (complete outage)?
  YES → Escalate immediately to senior engineer
  NO → Continue

Have you diagnosed root cause in 5 minutes?
  YES → Continue with fix
  NO → Page senior engineer or escalate

Does fix require infrastructure/database changes?
  YES → Contact infrastructure/DBA team
  NO → Continue with fix

Is this outside your authority (company policy)?
  YES → Escalate to manager
  NO → Proceed with fix

Implemented fix, service still broken?
  YES → Page senior engineer immediately
  NO → Verify and close incident

Result: Uncertain?
  → Ask senior engineer or manager
  → Always better to escalate early

When to Page Senior Engineer

Page immediately if:

Service completely down (Severity 1)
Database appears corrupted
You're stuck for >5 minutes
Rollback didn't work
Need infrastructure changes urgently
Something affecting >50% of users

Don't page just because:

Single pod restarting (monitor first)
Transient network errors
You're slightly unsure (ask in #deployments first)
It's 3 AM and not critical (use tickets for morning)

End of Shift Handoff

Create Handoff Report

SHIFT HANDOFF - [Your Name]
Dates: [Start] to [End] UTC
Duration: [X hours]

STATUS: ✅ All normal / ⚠️ Issues ongoing / ❌ Critical

INCIDENTS: [Number]
- Incident 1: [description, resolved or ongoing]
- Incident 2: [description]

ALERTS: [Any unusual alerts]
- Alert 1: [description, action taken]

DEPLOYMENTS: [Any scheduled or happened]
- Deployment 1: [status]

KNOWN ISSUES:
- Issue 1: [description, workaround]
- Issue 2: [description]

MONITORING NOTES:
- [Any trending issues]
- [Any monitoring gaps]
- [Any recommended actions]

RECOMMENDATIONS FOR NEXT ON-CALL:
1. [Action item]
2. [Action item]
3. [Action item]

NEXT ON-CALL: @[name]

Send to Next On-Call

@next-on-call - Handoff notes attached:
[paste report above]

Key points:
- [Most important item]
- [Second important]
- [Any urgent follow-ups]

Questions? I'm available for 30 min

Tools & Commands Reference

Essential Commands

# Pod management
kubectl get pods -n vapora
kubectl logs pod-name -n vapora
kubectl exec pod-name -n vapora -- bash
kubectl describe pod pod-name -n vapora
kubectl delete pod pod-name -n vapora  # (recreates via deployment)

# Deployment management
kubectl get deployments -n vapora
kubectl rollout status deployment/vapora-backend -n vapora
kubectl rollout undo deployment/vapora-backend -n vapora
kubectl scale deployment/vapora-backend --replicas=5 -n vapora

# Service health
curl http://localhost:8001/health
kubectl get events -n vapora
kubectl top pods -n vapora
kubectl get endpoints -n vapora

# Quick diagnostics
kubectl describe nodes
kubectl cluster-info
kubectl get persistent volumes

Useful Tools

# Install these on your workstation
brew install kubectl              # Kubernetes CLI
brew install k9s                  # Terminal UI for K8s
brew install watch               # Monitor command output
brew install jq                  # JSON processing
brew install yq                  # YAML processing
brew install grpcurl             # gRPC debugging

# Aliases to save time
alias k='kubectl'
alias kgp='kubectl get pods'
alias klogs='kubectl logs'
alias kexec='kubectl exec'

Dashboards & Links

Bookmark these:

Grafana: https://grafana.vapora.com
Status Page: https://status.vapora.com
Incident Tracker: https://github.com/your-org/vapora/issues
Runbooks: https://github.com/your-org/vapora/tree/main/docs/operations
Kubernetes Dashboard: Run kubectl proxy then http://localhost:8001/ui

On-Call Checklist

Starting Shift

Verified pager notifications working
Tested access to all systems
Reviewed current system status
Read recent incidents
Received handoff from previous on-call
Set up monitoring dashboards
Opened necessary terminal windows
Posted "on-call" status in #deployments

During Shift

Responded to all alerts within SLA
Updated incident status regularly
Escalated when appropriate
Documented actions in tickets
Verified fixes before closing
Communicated clearly with team

Ending Shift

Created handoff report
Resolved or escalated open issues
Updated monitoring for anomalies
Passed report to next on-call
Closed out incident tickets
Verified next on-call is ready
Posted "handing off to [next on-call]" in #deployments

Post-On-Call Follow-Up

After your shift:

Document lessons learned
- Did you learn something new?
- Did any procedure need updating?
- Were any runbooks unclear?
Update runbooks
- If you found gaps, update procedures
- If you had questions, update docs
- Share improvements with team
Communicate findings
- Anything the team should know?
- Any recommendations?
- Trends to watch?
Celebrate successes
- Any incidents quickly resolved?
- Any new insights?
- Recognize good practices

Emergency Contacts

Keep these accessible:

ESCALATION CONTACTS:

Primary Escalation: [Name] [Phone] [Slack]
Backup Escalation:  [Name] [Phone] [Slack]
Infrastructure:     [Name] [Phone] [Slack]
Database Team:      [Name] [Phone] [Slack]
Manager:            [Name] [Phone] [Slack]

External Contacts:
AWS Support:        [Account ID] [Contact]
CDN Provider:       [Account] [Contact]
DNS Provider:       [Account] [Contact]

EMERGENCY PROCEDURES:
- Complete AWS outage: Contact AWS support immediately
- Database failure: Contact DBA, activate backups
- Security incident: Contact security team immediately
- Major data loss: Activate disaster recovery

Remember

✅ You are the guardian of production - Your vigilance keeps services running

✅ Better safe than sorry - Escalate early and often

✅ Communication is key - Keep team informed

✅ Document everything - Future you and team will thank you

✅ Ask for help - No shame in escalating

❌ Don't guess - Verify before taking action

❌ Don't stay silent - Alert team to any issues

❌ Don't ignore alerts - Even false ones need investigation

14 KiB Raw Blame History

On-Call Procedures

Overview

Before Your Shift Starts

24 Hours Before On-Call

1 Hour Before Shift

Daily On-Call Tasks

Morning Check-In (After shift starts)

Mid-Shift Check (Every 4 hours)

End-of-Shift Handoff (Before shift ends)

Responding to Alerts

Alert Received

During Incident

Alert Examples & Responses

Alert: "Pod CrashLoopBackOff"

Alert: "High Error Rate (>5% 5xx)"

Alert: "Pod Memory > 90%"

Alert: "Node NotReady"

Monitoring Dashboard Setup

Browser Tabs (Keep Open)

Terminal Windows (Keep Ready)

Common Questions During On-Call

Q: I think I found an issue, but I'm not sure it's a problem

Q: Do I need to respond to every alert

Q: Service looks broken but dashboard looks normal

Q: Can I deploy changes while on-call

Q: Something looks weird but I can't reproduce it

Q: An alert keeps firing but service is fine

Escalation Decision Tree

When to Page Senior Engineer

End of Shift Handoff

Create Handoff Report

Send to Next On-Call

Tools & Commands Reference

Essential Commands

Useful Tools

Dashboards & Links

On-Call Checklist

Starting Shift

During Shift

Ending Shift

Post-On-Call Follow-Up

Emergency Contacts

Remember

14 KiB

Raw Blame History