# On-Call Procedures Guide for on-call engineers managing VAPORA production operations. --- ## Overview **On-Call Responsibility**: Monitor VAPORA production and respond to incidents during assigned shift **Time Commitment**: - During business hours: ~5-10 minutes daily check-ins - During off-hours: Available for emergencies (paged for critical issues) **Expected Availability**: - Severity 1: Respond within 2 minutes - Severity 2: Respond within 15 minutes - Severity 3: Respond within 1 hour --- ## Before Your Shift Starts ### 24 Hours Before On-Call - [ ] Verify schedule: "I'm on-call starting [date] [time]" - [ ] Update your calendar with shift times - [ ] Notify team: "I'll be on-call [dates]" - [ ] Share personal contact info if not already shared - [ ] Download necessary tools/credentials ### 1 Hour Before Shift - [ ] Test pager notification system ```bash # Verify Slack notifications working # Ask previous on-call to send test alert: "/test-alert-to-[yourname]" ``` - [ ] Verify access to necessary systems ```bash # Test each required access: ✓ SSH to bastion host: ssh bastion.vapora.com ✓ kubectl to production: kubectl cluster-info ✓ Slack channels: /join #deployments #alerts ✓ Incident tracking: open Jira/GitHub ✓ Monitoring dashboards: access Grafana ✓ Status page: access status page admin ``` - [ ] Review current system status ```bash # Quick health check kubectl cluster-info kubectl get pods -n vapora kubectl get events -n vapora | head -10 # Should show: All pods Running, no recent errors ``` - [ ] Read recent incident reports - Check previous on-call handoff notes - Review any incidents from past week - Note any known issues or monitoring gaps - [ ] Receive handoff from previous on-call ``` Ask: "Anything I should know?" - Any ongoing issues? - Any deployments planned? - Any flaky services or known alerts? - Any customer complaints? ``` --- ## Daily On-Call Tasks ### Morning Check-In (After shift starts) ```bash # Automated check - run this first thing export NAMESPACE=vapora echo "=== Cluster Health ===" kubectl cluster-info kubectl get nodes echo "=== Pod Status ===" kubectl get pods -n $NAMESPACE kubectl get pods -n $NAMESPACE | grep -v Running echo "=== Recent Events ===" kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -10 echo "=== Resource Usage ===" kubectl top nodes kubectl top pods -n $NAMESPACE # If any anomalies: investigate before declaring "all clear" ``` ### Mid-Shift Check (Every 4 hours) ```bash # Quick sanity check curl https://api.vapora.com/health curl https://vapora.app/ # Should both return 200 OK # Check dashboards # Grafana: any alerts? any trending issues? # Check Slack #alerts channel # Any warnings or anomalies posted? ``` ### End-of-Shift Handoff (Before shift ends) ```bash # Prepare handoff for next on-call # 1. Document current state kubectl get pods -n vapora kubectl get nodes kubectl top pods -n vapora # 2. Check for known issues kubectl get events -n vapora | grep Warning # Any persistent warnings? # 3. Check deployment status git log -1 --oneline provisioning/ # Any recent changes? # 4. Document in handoff notes: echo "HANDOFF NOTES - $(date) Duration: [start time] to [end time] Status: All normal / Issues: [list] Alerts: [any] Deployments: [any planned] Known issues: [any] Recommendations: [any] " > on-call-handoff.txt # 5. Pass notes to next on-call # Send message to @next-on-call with notes ``` --- ## Responding to Alerts ### Alert Received **Step 1: Verify it's real** ```bash # Don't panic - verify the alert is legitimate 1. Check the source: is it from our system? 2. Check current status manually: curl endpoints 3. Check dashboard: see if issue visible there 4. Check cluster: kubectl get pods # False alarms happen - verify before escalating ``` **Step 2: Assess severity** - Is service completely down? → Severity 1 - Is service partially down? → Severity 2 - Is there a warning/anomaly? → Severity 3 **Step 3: Declare incident** ```bash # Create ticket (Severity 1 is emergency) # If Severity 1: # - Alert team immediately # - Create #incident-[date] channel # - Start 2-minute update cycle # See: Incident Response Runbook ``` ### During Incident **Your role as on-call**: 1. **Respond quickly** - First 2 minutes are critical 2. **Communicate** - Update team/status page 3. **Investigate** - Follow diagnostics in runbooks 4. **Escalate if needed** - Page senior engineer if stuck 5. **Execute fix** - Follow approved procedures 6. **Verify recovery** - Confirm service healthy 7. **Document** - Record what happened **Key communication**: - Initial response time: < 2 min (post "investigating") - Status update: every 2-5 minutes - Escalation: if not clear after 5 minutes - Resolution: post "incident resolved" ### Alert Examples & Responses #### Alert: "Pod CrashLoopBackOff" ``` 1. Get pod logs: kubectl logs --previous 2. Check for config issues: kubectl get configmap 3. Check for resource limits: kubectl describe pod 4. Decide: rollback or fix config ``` #### Alert: "High Error Rate (>5% 5xx)" ``` 1. Check which endpoint: tail application logs 2. Check dependencies: database, cache, external APIs 3. Check recent deployment: git log 4. Decide: rollback or investigate further ``` #### Alert: "Pod Memory > 90%" ``` 1. Check actual usage: kubectl top pod 2. Check limits: kubectl get pod -o yaml | grep memory 3. Decide: scale up or investigate memory leak ``` #### Alert: "Node NotReady" ``` 1. Check node: kubectl describe node 2. Check kubelet: ssh node-x && systemctl status kubelet 3. Contact infrastructure team for hardware issues 4. Possibly: drain node and reschedule pods ``` --- ## Monitoring Dashboard Setup When you start shift, have these visible: ### Browser Tabs (Keep Open) 1. **Grafana Dashboard** - VAPORA Cluster Overview - Pod CPU/Memory usage - Request rate and latency - Error rate - Deployment status 2. **Kubernetes Dashboard** - kubectl port-forward -n kube-system svc/kubernetes-dashboard 8443:443 - Or use K9s terminal UI: `k9s` 3. **Alert Dashboard** (if available) - Prometheus Alerts - Or monitoring system of choice 4. **Status Page** (if public-facing) - Check for ongoing incidents - Prepare to update ### Terminal Windows (Keep Ready) ```bash # Terminal 1: Watch pods watch kubectl get pods -n vapora # Terminal 2: Tail logs kubectl logs -f deployment/vapora-backend -n vapora # Terminal 3: General kubectl commands kubectl -n vapora get events --watch # Terminal 4: Ad-hoc commands and troubleshooting # (leave empty for ad-hoc use) ``` --- ## Common Questions During On-Call ### Q: I think I found an issue, but I'm not sure it's a problem **A**: When in doubt, escalate: 1. Post in #deployments channel with observation 2. Ask: "Does this look normal?" 3. If others confirm: might be issue 4. Better safe than sorry (on production) ### Q: Do I need to respond to every alert **A**: Yes. Even false alarms need verification: 1. Confirm it's false alarm (not just assume) 2. Update alert if it's misconfigured 3. Never ignore alerts - fix the alerting ### Q: Service looks broken but dashboard looks normal **A**: 1. Check if dashboard might be delayed (sometimes refresh slow) 2. Test manually: curl endpoints 3. Check pod logs directly: kubectl logs 4. Trust actual service health over dashboard ### Q: Can I deploy changes while on-call **A**: - **Yes** if it's emergency fix for active incident - **No** for normal features/changes (schedule for dedicated deployment window) - **Escalate** if unsure ### Q: Something looks weird but I can't reproduce it **A**: 1. Save any evidence: logs, metrics, events 2. Monitor more closely for pattern 3. Document in ticket for later investigation 4. Escalate if behavior continues ### Q: An alert keeps firing but service is fine **A**: 1. Investigate why alert is false 2. Check alert thresholds (might be too sensitive) 3. Fix the alert configuration 4. Update alert runbook with details --- ## Escalation Decision Tree When should you escalate? ``` START: Issue detected Is it Severity 1 (complete outage)? YES → Escalate immediately to senior engineer NO → Continue Have you diagnosed root cause in 5 minutes? YES → Continue with fix NO → Page senior engineer or escalate Does fix require infrastructure/database changes? YES → Contact infrastructure/DBA team NO → Continue with fix Is this outside your authority (company policy)? YES → Escalate to manager NO → Proceed with fix Implemented fix, service still broken? YES → Page senior engineer immediately NO → Verify and close incident Result: Uncertain? → Ask senior engineer or manager → Always better to escalate early ``` --- ## When to Page Senior Engineer **Page immediately if**: - Service completely down (Severity 1) - Database appears corrupted - You're stuck for >5 minutes - Rollback didn't work - Need infrastructure changes urgently - Something affecting >50% of users **Don't page just because**: - Single pod restarting (monitor first) - Transient network errors - You're slightly unsure (ask in #deployments first) - It's 3 AM and not critical (use tickets for morning) --- ## End of Shift Handoff ### Create Handoff Report ``` SHIFT HANDOFF - [Your Name] Dates: [Start] to [End] UTC Duration: [X hours] STATUS: ✅ All normal / ⚠️ Issues ongoing / ❌ Critical INCIDENTS: [Number] - Incident 1: [description, resolved or ongoing] - Incident 2: [description] ALERTS: [Any unusual alerts] - Alert 1: [description, action taken] DEPLOYMENTS: [Any scheduled or happened] - Deployment 1: [status] KNOWN ISSUES: - Issue 1: [description, workaround] - Issue 2: [description] MONITORING NOTES: - [Any trending issues] - [Any monitoring gaps] - [Any recommended actions] RECOMMENDATIONS FOR NEXT ON-CALL: 1. [Action item] 2. [Action item] 3. [Action item] NEXT ON-CALL: @[name] ``` ### Send to Next On-Call ``` @next-on-call - Handoff notes attached: [paste report above] Key points: - [Most important item] - [Second important] - [Any urgent follow-ups] Questions? I'm available for 30 min ``` --- ## Tools & Commands Reference ### Essential Commands ```bash # Pod management kubectl get pods -n vapora kubectl logs pod-name -n vapora kubectl exec pod-name -n vapora -- bash kubectl describe pod pod-name -n vapora kubectl delete pod pod-name -n vapora # (recreates via deployment) # Deployment management kubectl get deployments -n vapora kubectl rollout status deployment/vapora-backend -n vapora kubectl rollout undo deployment/vapora-backend -n vapora kubectl scale deployment/vapora-backend --replicas=5 -n vapora # Service health curl http://localhost:8001/health kubectl get events -n vapora kubectl top pods -n vapora kubectl get endpoints -n vapora # Quick diagnostics kubectl describe nodes kubectl cluster-info kubectl get persistent volumes ``` ### Useful Tools ```bash # Install these on your workstation brew install kubectl # Kubernetes CLI brew install k9s # Terminal UI for K8s brew install watch # Monitor command output brew install jq # JSON processing brew install yq # YAML processing brew install grpcurl # gRPC debugging # Aliases to save time alias k='kubectl' alias kgp='kubectl get pods' alias klogs='kubectl logs' alias kexec='kubectl exec' ``` ### Dashboards & Links Bookmark these: - Grafana: `https://grafana.vapora.com` - Status Page: `https://status.vapora.com` - Incident Tracker: `https://github.com/your-org/vapora/issues` - Runbooks: `https://github.com/your-org/vapora/tree/main/docs/operations` - Kubernetes Dashboard: Run `kubectl proxy` then `http://localhost:8001/ui` --- ## On-Call Checklist ### Starting Shift - [ ] Verified pager notifications working - [ ] Tested access to all systems - [ ] Reviewed current system status - [ ] Read recent incidents - [ ] Received handoff from previous on-call - [ ] Set up monitoring dashboards - [ ] Opened necessary terminal windows - [ ] Posted "on-call" status in #deployments ### During Shift - [ ] Responded to all alerts within SLA - [ ] Updated incident status regularly - [ ] Escalated when appropriate - [ ] Documented actions in tickets - [ ] Verified fixes before closing - [ ] Communicated clearly with team ### Ending Shift - [ ] Created handoff report - [ ] Resolved or escalated open issues - [ ] Updated monitoring for anomalies - [ ] Passed report to next on-call - [ ] Closed out incident tickets - [ ] Verified next on-call is ready - [ ] Posted "handing off to [next on-call]" in #deployments --- ## Post-On-Call Follow-Up After your shift: 1. **Document lessons learned** - Did you learn something new? - Did any procedure need updating? - Were any runbooks unclear? 2. **Update runbooks** - If you found gaps, update procedures - If you had questions, update docs - Share improvements with team 3. **Communicate findings** - Anything the team should know? - Any recommendations? - Trends to watch? 4. **Celebrate successes** - Any incidents quickly resolved? - Any new insights? - Recognize good practices --- ## Emergency Contacts Keep these accessible: ``` ESCALATION CONTACTS: Primary Escalation: [Name] [Phone] [Slack] Backup Escalation: [Name] [Phone] [Slack] Infrastructure: [Name] [Phone] [Slack] Database Team: [Name] [Phone] [Slack] Manager: [Name] [Phone] [Slack] External Contacts: AWS Support: [Account ID] [Contact] CDN Provider: [Account] [Contact] DNS Provider: [Account] [Contact] EMERGENCY PROCEDURES: - Complete AWS outage: Contact AWS support immediately - Database failure: Contact DBA, activate backups - Security incident: Contact security team immediately - Major data loss: Activate disaster recovery ``` --- ## Remember ✅ **You are the guardian of production** - Your vigilance keeps services running ✅ **Better safe than sorry** - Escalate early and often ✅ **Communication is key** - Keep team informed ✅ **Document everything** - Future you and team will thank you ✅ **Ask for help** - No shame in escalating ❌ **Don't guess** - Verify before taking action ❌ **Don't stay silent** - Alert team to any issues ❌ **Don't ignore alerts** - Even false ones need investigation