On-Call Procedures
Guide for on-call engineers managing VAPORA production operations.
Overview
On-Call Responsibility: Monitor VAPORA production and respond to incidents during assigned shift
Time Commitment:
- During business hours: ~5-10 minutes daily check-ins
- During off-hours: Available for emergencies (paged for critical issues)
Expected Availability:
- Severity 1: Respond within 2 minutes
- Severity 2: Respond within 15 minutes
- Severity 3: Respond within 1 hour
Before Your Shift Starts
24 Hours Before On-Call
- Verify schedule: "I'm on-call starting [date] [time]"
- Update your calendar with shift times
- Notify team: "I'll be on-call [dates]"
- Share personal contact info if not already shared
- Download necessary tools/credentials
1 Hour Before Shift
-
Test pager notification system
# Verify Slack notifications working # Ask previous on-call to send test alert: "/test-alert-to-[yourname]" -
Verify access to necessary systems
# Test each required access: ✓ SSH to bastion host: ssh bastion.vapora.com ✓ kubectl to production: kubectl cluster-info ✓ Slack channels: /join #deployments #alerts ✓ Incident tracking: open Jira/GitHub ✓ Monitoring dashboards: access Grafana ✓ Status page: access status page admin -
Review current system status
# Quick health check kubectl cluster-info kubectl get pods -n vapora kubectl get events -n vapora | head -10 # Should show: All pods Running, no recent errors -
Read recent incident reports
- Check previous on-call handoff notes
- Review any incidents from past week
- Note any known issues or monitoring gaps
-
Receive handoff from previous on-call
Ask: "Anything I should know?" - Any ongoing issues? - Any deployments planned? - Any flaky services or known alerts? - Any customer complaints?
Daily On-Call Tasks
Morning Check-In (After shift starts)
# Automated check - run this first thing
export NAMESPACE=vapora
echo "=== Cluster Health ==="
kubectl cluster-info
kubectl get nodes
echo "=== Pod Status ==="
kubectl get pods -n $NAMESPACE
kubectl get pods -n $NAMESPACE | grep -v Running
echo "=== Recent Events ==="
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -10
echo "=== Resource Usage ==="
kubectl top nodes
kubectl top pods -n $NAMESPACE
# If any anomalies: investigate before declaring "all clear"
Mid-Shift Check (Every 4 hours)
# Quick sanity check
curl https://api.vapora.com/health
curl https://vapora.app/
# Should both return 200 OK
# Check dashboards
# Grafana: any alerts? any trending issues?
# Check Slack #alerts channel
# Any warnings or anomalies posted?
End-of-Shift Handoff (Before shift ends)
# Prepare handoff for next on-call
# 1. Document current state
kubectl get pods -n vapora
kubectl get nodes
kubectl top pods -n vapora
# 2. Check for known issues
kubectl get events -n vapora | grep Warning
# Any persistent warnings?
# 3. Check deployment status
git log -1 --oneline provisioning/
# Any recent changes?
# 4. Document in handoff notes:
echo "HANDOFF NOTES - $(date)
Duration: [start time] to [end time]
Status: All normal / Issues: [list]
Alerts: [any]
Deployments: [any planned]
Known issues: [any]
Recommendations: [any]
" > on-call-handoff.txt
# 5. Pass notes to next on-call
# Send message to @next-on-call with notes
Responding to Alerts
Alert Received
Step 1: Verify it's real
# Don't panic - verify the alert is legitimate
1. Check the source: is it from our system?
2. Check current status manually: curl endpoints
3. Check dashboard: see if issue visible there
4. Check cluster: kubectl get pods
# False alarms happen - verify before escalating
Step 2: Assess severity
- Is service completely down? → Severity 1
- Is service partially down? → Severity 2
- Is there a warning/anomaly? → Severity 3
Step 3: Declare incident
# Create ticket (Severity 1 is emergency)
# If Severity 1:
# - Alert team immediately
# - Create #incident-[date] channel
# - Start 2-minute update cycle
# See: Incident Response Runbook
During Incident
Your role as on-call:
- Respond quickly - First 2 minutes are critical
- Communicate - Update team/status page
- Investigate - Follow diagnostics in runbooks
- Escalate if needed - Page senior engineer if stuck
- Execute fix - Follow approved procedures
- Verify recovery - Confirm service healthy
- Document - Record what happened
Key communication:
- Initial response time: < 2 min (post "investigating")
- Status update: every 2-5 minutes
- Escalation: if not clear after 5 minutes
- Resolution: post "incident resolved"
Alert Examples & Responses
Alert: "Pod CrashLoopBackOff"
1. Get pod logs: kubectl logs <pod> --previous
2. Check for config issues: kubectl get configmap
3. Check for resource limits: kubectl describe pod <pod>
4. Decide: rollback or fix config
Alert: "High Error Rate (>5% 5xx)"
1. Check which endpoint: tail application logs
2. Check dependencies: database, cache, external APIs
3. Check recent deployment: git log
4. Decide: rollback or investigate further
Alert: "Pod Memory > 90%"
1. Check actual usage: kubectl top pod <pod>
2. Check limits: kubectl get pod <pod> -o yaml | grep memory
3. Decide: scale up or investigate memory leak
Alert: "Node NotReady"
1. Check node: kubectl describe node <node>
2. Check kubelet: ssh node-x && systemctl status kubelet
3. Contact infrastructure team for hardware issues
4. Possibly: drain node and reschedule pods
Monitoring Dashboard Setup
When you start shift, have these visible:
Browser Tabs (Keep Open)
-
Grafana Dashboard - VAPORA Cluster Overview
- Pod CPU/Memory usage
- Request rate and latency
- Error rate
- Deployment status
-
Kubernetes Dashboard
- kubectl port-forward -n kube-system svc/kubernetes-dashboard 8443:443
- Or use K9s terminal UI:
k9s
-
Alert Dashboard (if available)
- Prometheus Alerts
- Or monitoring system of choice
-
Status Page (if public-facing)
- Check for ongoing incidents
- Prepare to update
Terminal Windows (Keep Ready)
# Terminal 1: Watch pods
watch kubectl get pods -n vapora
# Terminal 2: Tail logs
kubectl logs -f deployment/vapora-backend -n vapora
# Terminal 3: General kubectl commands
kubectl -n vapora get events --watch
# Terminal 4: Ad-hoc commands and troubleshooting
# (leave empty for ad-hoc use)
Common Questions During On-Call
Q: I think I found an issue, but I'm not sure it's a problem
A: When in doubt, escalate:
- Post in #deployments channel with observation
- Ask: "Does this look normal?"
- If others confirm: might be issue
- Better safe than sorry (on production)
Q: Do I need to respond to every alert?
A: Yes. Even false alarms need verification:
- Confirm it's false alarm (not just assume)
- Update alert if it's misconfigured
- Never ignore alerts - fix the alerting
Q: Service looks broken but dashboard looks normal
A:
- Check if dashboard might be delayed (sometimes refresh slow)
- Test manually: curl endpoints
- Check pod logs directly: kubectl logs
- Trust actual service health over dashboard
Q: Can I deploy changes while on-call?
A:
- Yes if it's emergency fix for active incident
- No for normal features/changes (schedule for dedicated deployment window)
- Escalate if unsure
Q: Something looks weird but I can't reproduce it
A:
- Save any evidence: logs, metrics, events
- Monitor more closely for pattern
- Document in ticket for later investigation
- Escalate if behavior continues
Q: An alert keeps firing but service is fine
A:
- Investigate why alert is false
- Check alert thresholds (might be too sensitive)
- Fix the alert configuration
- Update alert runbook with details
Escalation Decision Tree
When should you escalate?
START: Issue detected
Is it Severity 1 (complete outage)?
YES → Escalate immediately to senior engineer
NO → Continue
Have you diagnosed root cause in 5 minutes?
YES → Continue with fix
NO → Page senior engineer or escalate
Does fix require infrastructure/database changes?
YES → Contact infrastructure/DBA team
NO → Continue with fix
Is this outside your authority (company policy)?
YES → Escalate to manager
NO → Proceed with fix
Implemented fix, service still broken?
YES → Page senior engineer immediately
NO → Verify and close incident
Result: Uncertain?
→ Ask senior engineer or manager
→ Always better to escalate early
When to Page Senior Engineer
Page immediately if:
- Service completely down (Severity 1)
- Database appears corrupted
- You're stuck for >5 minutes
- Rollback didn't work
- Need infrastructure changes urgently
- Something affecting >50% of users
Don't page just because:
- Single pod restarting (monitor first)
- Transient network errors
- You're slightly unsure (ask in #deployments first)
- It's 3 AM and not critical (use tickets for morning)
End of Shift Handoff
Create Handoff Report
SHIFT HANDOFF - [Your Name]
Dates: [Start] to [End] UTC
Duration: [X hours]
STATUS: ✅ All normal / ⚠️ Issues ongoing / ❌ Critical
INCIDENTS: [Number]
- Incident 1: [description, resolved or ongoing]
- Incident 2: [description]
ALERTS: [Any unusual alerts]
- Alert 1: [description, action taken]
DEPLOYMENTS: [Any scheduled or happened]
- Deployment 1: [status]
KNOWN ISSUES:
- Issue 1: [description, workaround]
- Issue 2: [description]
MONITORING NOTES:
- [Any trending issues]
- [Any monitoring gaps]
- [Any recommended actions]
RECOMMENDATIONS FOR NEXT ON-CALL:
1. [Action item]
2. [Action item]
3. [Action item]
NEXT ON-CALL: @[name]
Send to Next On-Call
@next-on-call - Handoff notes attached:
[paste report above]
Key points:
- [Most important item]
- [Second important]
- [Any urgent follow-ups]
Questions? I'm available for 30 min
Tools & Commands Reference
Essential Commands
# Pod management
kubectl get pods -n vapora
kubectl logs pod-name -n vapora
kubectl exec pod-name -n vapora -- bash
kubectl describe pod pod-name -n vapora
kubectl delete pod pod-name -n vapora # (recreates via deployment)
# Deployment management
kubectl get deployments -n vapora
kubectl rollout status deployment/vapora-backend -n vapora
kubectl rollout undo deployment/vapora-backend -n vapora
kubectl scale deployment/vapora-backend --replicas=5 -n vapora
# Service health
curl http://localhost:8001/health
kubectl get events -n vapora
kubectl top pods -n vapora
kubectl get endpoints -n vapora
# Quick diagnostics
kubectl describe nodes
kubectl cluster-info
kubectl get persistent volumes
Useful Tools
# Install these on your workstation
brew install kubectl # Kubernetes CLI
brew install k9s # Terminal UI for K8s
brew install watch # Monitor command output
brew install jq # JSON processing
brew install yq # YAML processing
brew install grpcurl # gRPC debugging
# Aliases to save time
alias k='kubectl'
alias kgp='kubectl get pods'
alias klogs='kubectl logs'
alias kexec='kubectl exec'
Dashboards & Links
Bookmark these:
- Grafana:
https://grafana.vapora.com - Status Page:
https://status.vapora.com - Incident Tracker:
https://github.com/your-org/vapora/issues - Runbooks:
https://github.com/your-org/vapora/tree/main/docs/operations - Kubernetes Dashboard: Run
kubectl proxythenhttp://localhost:8001/ui
On-Call Checklist
Starting Shift
- Verified pager notifications working
- Tested access to all systems
- Reviewed current system status
- Read recent incidents
- Received handoff from previous on-call
- Set up monitoring dashboards
- Opened necessary terminal windows
- Posted "on-call" status in #deployments
During Shift
- Responded to all alerts within SLA
- Updated incident status regularly
- Escalated when appropriate
- Documented actions in tickets
- Verified fixes before closing
- Communicated clearly with team
Ending Shift
- Created handoff report
- Resolved or escalated open issues
- Updated monitoring for anomalies
- Passed report to next on-call
- Closed out incident tickets
- Verified next on-call is ready
- Posted "handing off to [next on-call]" in #deployments
Post-On-Call Follow-Up
After your shift:
-
Document lessons learned
- Did you learn something new?
- Did any procedure need updating?
- Were any runbooks unclear?
-
Update runbooks
- If you found gaps, update procedures
- If you had questions, update docs
- Share improvements with team
-
Communicate findings
- Anything the team should know?
- Any recommendations?
- Trends to watch?
-
Celebrate successes
- Any incidents quickly resolved?
- Any new insights?
- Recognize good practices
Emergency Contacts
Keep these accessible:
ESCALATION CONTACTS:
Primary Escalation: [Name] [Phone] [Slack]
Backup Escalation: [Name] [Phone] [Slack]
Infrastructure: [Name] [Phone] [Slack]
Database Team: [Name] [Phone] [Slack]
Manager: [Name] [Phone] [Slack]
External Contacts:
AWS Support: [Account ID] [Contact]
CDN Provider: [Account] [Contact]
DNS Provider: [Account] [Contact]
EMERGENCY PROCEDURES:
- Complete AWS outage: Contact AWS support immediately
- Database failure: Contact DBA, activate backups
- Security incident: Contact security team immediately
- Major data loss: Activate disaster recovery
Remember
✅ You are the guardian of production - Your vigilance keeps services running
✅ Better safe than sorry - Escalate early and often
✅ Communication is key - Keep team informed
✅ Document everything - Future you and team will thank you
✅ Ask for help - No shame in escalating
❌ Don't guess - Verify before taking action
❌ Don't stay silent - Alert team to any issues
❌ Don't ignore alerts - Even false ones need investigation