Vapora/docs/operations/on-call-procedures.md

596 lines
14 KiB
Markdown
Raw Normal View History

# On-Call Procedures
Guide for on-call engineers managing VAPORA production operations.
---
## Overview
**On-Call Responsibility**: Monitor VAPORA production and respond to incidents during assigned shift
**Time Commitment**:
- During business hours: ~5-10 minutes daily check-ins
- During off-hours: Available for emergencies (paged for critical issues)
**Expected Availability**:
- Severity 1: Respond within 2 minutes
- Severity 2: Respond within 15 minutes
- Severity 3: Respond within 1 hour
---
## Before Your Shift Starts
### 24 Hours Before On-Call
- [ ] Verify schedule: "I'm on-call starting [date] [time]"
- [ ] Update your calendar with shift times
- [ ] Notify team: "I'll be on-call [dates]"
- [ ] Share personal contact info if not already shared
- [ ] Download necessary tools/credentials
### 1 Hour Before Shift
- [ ] Test pager notification system
```bash
# Verify Slack notifications working
# Ask previous on-call to send test alert: "/test-alert-to-[yourname]"
```
- [ ] Verify access to necessary systems
```bash
# Test each required access:
✓ SSH to bastion host: ssh bastion.vapora.com
✓ kubectl to production: kubectl cluster-info
✓ Slack channels: /join #deployments #alerts
✓ Incident tracking: open Jira/GitHub
✓ Monitoring dashboards: access Grafana
✓ Status page: access status page admin
```
- [ ] Review current system status
```bash
# Quick health check
kubectl cluster-info
kubectl get pods -n vapora
kubectl get events -n vapora | head -10
# Should show: All pods Running, no recent errors
```
- [ ] Read recent incident reports
- Check previous on-call handoff notes
- Review any incidents from past week
- Note any known issues or monitoring gaps
- [ ] Receive handoff from previous on-call
```
Ask: "Anything I should know?"
- Any ongoing issues?
- Any deployments planned?
- Any flaky services or known alerts?
- Any customer complaints?
```
---
## Daily On-Call Tasks
### Morning Check-In (After shift starts)
```bash
# Automated check - run this first thing
export NAMESPACE=vapora
echo "=== Cluster Health ==="
kubectl cluster-info
kubectl get nodes
echo "=== Pod Status ==="
kubectl get pods -n $NAMESPACE
kubectl get pods -n $NAMESPACE | grep -v Running
echo "=== Recent Events ==="
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -10
echo "=== Resource Usage ==="
kubectl top nodes
kubectl top pods -n $NAMESPACE
# If any anomalies: investigate before declaring "all clear"
```
### Mid-Shift Check (Every 4 hours)
```bash
# Quick sanity check
curl https://api.vapora.com/health
curl https://vapora.app/
# Should both return 200 OK
# Check dashboards
# Grafana: any alerts? any trending issues?
# Check Slack #alerts channel
# Any warnings or anomalies posted?
```
### End-of-Shift Handoff (Before shift ends)
```bash
# Prepare handoff for next on-call
# 1. Document current state
kubectl get pods -n vapora
kubectl get nodes
kubectl top pods -n vapora
# 2. Check for known issues
kubectl get events -n vapora | grep Warning
# Any persistent warnings?
# 3. Check deployment status
git log -1 --oneline provisioning/
# Any recent changes?
# 4. Document in handoff notes:
echo "HANDOFF NOTES - $(date)
Duration: [start time] to [end time]
Status: All normal / Issues: [list]
Alerts: [any]
Deployments: [any planned]
Known issues: [any]
Recommendations: [any]
" > on-call-handoff.txt
# 5. Pass notes to next on-call
# Send message to @next-on-call with notes
```
---
## Responding to Alerts
### Alert Received
**Step 1: Verify it's real**
```bash
# Don't panic - verify the alert is legitimate
1. Check the source: is it from our system?
2. Check current status manually: curl endpoints
3. Check dashboard: see if issue visible there
4. Check cluster: kubectl get pods
# False alarms happen - verify before escalating
```
**Step 2: Assess severity**
- Is service completely down? → Severity 1
- Is service partially down? → Severity 2
- Is there a warning/anomaly? → Severity 3
**Step 3: Declare incident**
```bash
# Create ticket (Severity 1 is emergency)
# If Severity 1:
# - Alert team immediately
# - Create #incident-[date] channel
# - Start 2-minute update cycle
# See: Incident Response Runbook
```
### During Incident
**Your role as on-call**:
1. **Respond quickly** - First 2 minutes are critical
2. **Communicate** - Update team/status page
3. **Investigate** - Follow diagnostics in runbooks
4. **Escalate if needed** - Page senior engineer if stuck
5. **Execute fix** - Follow approved procedures
6. **Verify recovery** - Confirm service healthy
7. **Document** - Record what happened
**Key communication**:
- Initial response time: < 2 min (post "investigating")
- Status update: every 2-5 minutes
- Escalation: if not clear after 5 minutes
- Resolution: post "incident resolved"
### Alert Examples & Responses
#### Alert: "Pod CrashLoopBackOff"
```
1. Get pod logs: kubectl logs <pod> --previous
2. Check for config issues: kubectl get configmap
3. Check for resource limits: kubectl describe pod <pod>
4. Decide: rollback or fix config
```
#### Alert: "High Error Rate (>5% 5xx)"
```
1. Check which endpoint: tail application logs
2. Check dependencies: database, cache, external APIs
3. Check recent deployment: git log
4. Decide: rollback or investigate further
```
#### Alert: "Pod Memory > 90%"
```
1. Check actual usage: kubectl top pod <pod>
2. Check limits: kubectl get pod <pod> -o yaml | grep memory
3. Decide: scale up or investigate memory leak
```
#### Alert: "Node NotReady"
```
1. Check node: kubectl describe node <node>
2. Check kubelet: ssh node-x && systemctl status kubelet
3. Contact infrastructure team for hardware issues
4. Possibly: drain node and reschedule pods
```
---
## Monitoring Dashboard Setup
When you start shift, have these visible:
### Browser Tabs (Keep Open)
1. **Grafana Dashboard** - VAPORA Cluster Overview
- Pod CPU/Memory usage
- Request rate and latency
- Error rate
- Deployment status
2. **Kubernetes Dashboard**
- kubectl port-forward -n kube-system svc/kubernetes-dashboard 8443:443
- Or use K9s terminal UI: `k9s`
3. **Alert Dashboard** (if available)
- Prometheus Alerts
- Or monitoring system of choice
4. **Status Page** (if public-facing)
- Check for ongoing incidents
- Prepare to update
### Terminal Windows (Keep Ready)
```bash
# Terminal 1: Watch pods
watch kubectl get pods -n vapora
# Terminal 2: Tail logs
kubectl logs -f deployment/vapora-backend -n vapora
# Terminal 3: General kubectl commands
kubectl -n vapora get events --watch
# Terminal 4: Ad-hoc commands and troubleshooting
# (leave empty for ad-hoc use)
```
---
## Common Questions During On-Call
### Q: I think I found an issue, but I'm not sure it's a problem
**A**: When in doubt, escalate:
1. Post in #deployments channel with observation
2. Ask: "Does this look normal?"
3. If others confirm: might be issue
4. Better safe than sorry (on production)
### Q: Do I need to respond to every alert
**A**: Yes. Even false alarms need verification:
1. Confirm it's false alarm (not just assume)
2. Update alert if it's misconfigured
3. Never ignore alerts - fix the alerting
### Q: Service looks broken but dashboard looks normal
**A**:
1. Check if dashboard might be delayed (sometimes refresh slow)
2. Test manually: curl endpoints
3. Check pod logs directly: kubectl logs
4. Trust actual service health over dashboard
### Q: Can I deploy changes while on-call
**A**:
- **Yes** if it's emergency fix for active incident
- **No** for normal features/changes (schedule for dedicated deployment window)
- **Escalate** if unsure
### Q: Something looks weird but I can't reproduce it
**A**:
1. Save any evidence: logs, metrics, events
2. Monitor more closely for pattern
3. Document in ticket for later investigation
4. Escalate if behavior continues
### Q: An alert keeps firing but service is fine
**A**:
1. Investigate why alert is false
2. Check alert thresholds (might be too sensitive)
3. Fix the alert configuration
4. Update alert runbook with details
---
## Escalation Decision Tree
When should you escalate?
```
START: Issue detected
Is it Severity 1 (complete outage)?
YES → Escalate immediately to senior engineer
NO → Continue
Have you diagnosed root cause in 5 minutes?
YES → Continue with fix
NO → Page senior engineer or escalate
Does fix require infrastructure/database changes?
YES → Contact infrastructure/DBA team
NO → Continue with fix
Is this outside your authority (company policy)?
YES → Escalate to manager
NO → Proceed with fix
Implemented fix, service still broken?
YES → Page senior engineer immediately
NO → Verify and close incident
Result: Uncertain?
→ Ask senior engineer or manager
→ Always better to escalate early
```
---
## When to Page Senior Engineer
**Page immediately if**:
- Service completely down (Severity 1)
- Database appears corrupted
- You're stuck for >5 minutes
- Rollback didn't work
- Need infrastructure changes urgently
- Something affecting >50% of users
**Don't page just because**:
- Single pod restarting (monitor first)
- Transient network errors
- You're slightly unsure (ask in #deployments first)
- It's 3 AM and not critical (use tickets for morning)
---
## End of Shift Handoff
### Create Handoff Report
```
SHIFT HANDOFF - [Your Name]
Dates: [Start] to [End] UTC
Duration: [X hours]
STATUS: ✅ All normal / ⚠️ Issues ongoing / ❌ Critical
INCIDENTS: [Number]
- Incident 1: [description, resolved or ongoing]
- Incident 2: [description]
ALERTS: [Any unusual alerts]
- Alert 1: [description, action taken]
DEPLOYMENTS: [Any scheduled or happened]
- Deployment 1: [status]
KNOWN ISSUES:
- Issue 1: [description, workaround]
- Issue 2: [description]
MONITORING NOTES:
- [Any trending issues]
- [Any monitoring gaps]
- [Any recommended actions]
RECOMMENDATIONS FOR NEXT ON-CALL:
1. [Action item]
2. [Action item]
3. [Action item]
NEXT ON-CALL: @[name]
```
### Send to Next On-Call
```
@next-on-call - Handoff notes attached:
[paste report above]
Key points:
- [Most important item]
- [Second important]
- [Any urgent follow-ups]
Questions? I'm available for 30 min
```
---
## Tools & Commands Reference
### Essential Commands
```bash
# Pod management
kubectl get pods -n vapora
kubectl logs pod-name -n vapora
kubectl exec pod-name -n vapora -- bash
kubectl describe pod pod-name -n vapora
kubectl delete pod pod-name -n vapora # (recreates via deployment)
# Deployment management
kubectl get deployments -n vapora
kubectl rollout status deployment/vapora-backend -n vapora
kubectl rollout undo deployment/vapora-backend -n vapora
kubectl scale deployment/vapora-backend --replicas=5 -n vapora
# Service health
curl http://localhost:8001/health
kubectl get events -n vapora
kubectl top pods -n vapora
kubectl get endpoints -n vapora
# Quick diagnostics
kubectl describe nodes
kubectl cluster-info
kubectl get persistent volumes
```
### Useful Tools
```bash
# Install these on your workstation
brew install kubectl # Kubernetes CLI
brew install k9s # Terminal UI for K8s
brew install watch # Monitor command output
brew install jq # JSON processing
brew install yq # YAML processing
brew install grpcurl # gRPC debugging
# Aliases to save time
alias k='kubectl'
alias kgp='kubectl get pods'
alias klogs='kubectl logs'
alias kexec='kubectl exec'
```
### Dashboards & Links
Bookmark these:
- Grafana: `https://grafana.vapora.com`
- Status Page: `https://status.vapora.com`
- Incident Tracker: `https://github.com/your-org/vapora/issues`
- Runbooks: `https://github.com/your-org/vapora/tree/main/docs/operations`
- Kubernetes Dashboard: Run `kubectl proxy` then `http://localhost:8001/ui`
---
## On-Call Checklist
### Starting Shift
- [ ] Verified pager notifications working
- [ ] Tested access to all systems
- [ ] Reviewed current system status
- [ ] Read recent incidents
- [ ] Received handoff from previous on-call
- [ ] Set up monitoring dashboards
- [ ] Opened necessary terminal windows
- [ ] Posted "on-call" status in #deployments
### During Shift
- [ ] Responded to all alerts within SLA
- [ ] Updated incident status regularly
- [ ] Escalated when appropriate
- [ ] Documented actions in tickets
- [ ] Verified fixes before closing
- [ ] Communicated clearly with team
### Ending Shift
- [ ] Created handoff report
- [ ] Resolved or escalated open issues
- [ ] Updated monitoring for anomalies
- [ ] Passed report to next on-call
- [ ] Closed out incident tickets
- [ ] Verified next on-call is ready
- [ ] Posted "handing off to [next on-call]" in #deployments
---
## Post-On-Call Follow-Up
After your shift:
1. **Document lessons learned**
- Did you learn something new?
- Did any procedure need updating?
- Were any runbooks unclear?
2. **Update runbooks**
- If you found gaps, update procedures
- If you had questions, update docs
- Share improvements with team
3. **Communicate findings**
- Anything the team should know?
- Any recommendations?
- Trends to watch?
4. **Celebrate successes**
- Any incidents quickly resolved?
- Any new insights?
- Recognize good practices
---
## Emergency Contacts
Keep these accessible:
```
ESCALATION CONTACTS:
Primary Escalation: [Name] [Phone] [Slack]
Backup Escalation: [Name] [Phone] [Slack]
Infrastructure: [Name] [Phone] [Slack]
Database Team: [Name] [Phone] [Slack]
Manager: [Name] [Phone] [Slack]
External Contacts:
AWS Support: [Account ID] [Contact]
CDN Provider: [Account] [Contact]
DNS Provider: [Account] [Contact]
EMERGENCY PROCEDURES:
- Complete AWS outage: Contact AWS support immediately
- Database failure: Contact DBA, activate backups
- Security incident: Contact security team immediately
- Major data loss: Activate disaster recovery
```
---
## Remember
**You are the guardian of production** - Your vigilance keeps services running
**Better safe than sorry** - Escalate early and often
**Communication is key** - Keep team informed
**Document everything** - Future you and team will thank you
**Ask for help** - No shame in escalating
**Don't guess** - Verify before taking action
**Don't stay silent** - Alert team to any issues
**Don't ignore alerts** - Even false ones need investigation