Vapora/docs/operations/on-call-procedures.md

# On-Call Procedures

Guide for on-call engineers managing VAPORA production operations.

---

## Overview

**On-Call Responsibility**: Monitor VAPORA production and respond to incidents during assigned shift

**Time Commitment**:
- During business hours: ~5-10 minutes daily check-ins
- During off-hours: Available for emergencies (paged for critical issues)

**Expected Availability**:
- Severity 1: Respond within 2 minutes
- Severity 2: Respond within 15 minutes
- Severity 3: Respond within 1 hour

---

## Before Your Shift Starts

### 24 Hours Before On-Call

- [ ] Verify schedule: "I'm on-call starting [date] [time]"
- [ ] Update your calendar with shift times
- [ ] Notify team: "I'll be on-call [dates]"
- [ ] Share personal contact info if not already shared
- [ ] Download necessary tools/credentials

### 1 Hour Before Shift

- [ ] Test pager notification system
  ```bash
  # Verify Slack notifications working
  # Ask previous on-call to send test alert: "/test-alert-to-[yourname]"
  ```

- [ ] Verify access to necessary systems
  ```bash
  # Test each required access:
  ✓ SSH to bastion host: ssh bastion.vapora.com
  ✓ kubectl to production: kubectl cluster-info
  ✓ Slack channels: /join #deployments #alerts
  ✓ Incident tracking: open Jira/GitHub
  ✓ Monitoring dashboards: access Grafana
  ✓ Status page: access status page admin
  ```

- [ ] Review current system status
  ```bash
  # Quick health check
  kubectl cluster-info
  kubectl get pods -n vapora
  kubectl get events -n vapora | head -10

  # Should show: All pods Running, no recent errors
  ```

- [ ] Read recent incident reports
  - Check previous on-call handoff notes
  - Review any incidents from past week
  - Note any known issues or monitoring gaps

- [ ] Receive handoff from previous on-call
  ```
  Ask: "Anything I should know?"
  - Any ongoing issues?
  - Any deployments planned?
  - Any flaky services or known alerts?
  - Any customer complaints?
  ```

---

## Daily On-Call Tasks

### Morning Check-In (After shift starts)

```bash
# Automated check - run this first thing
export NAMESPACE=vapora

echo "=== Cluster Health ==="
kubectl cluster-info
kubectl get nodes

echo "=== Pod Status ==="
kubectl get pods -n $NAMESPACE
kubectl get pods -n $NAMESPACE | grep -v Running

echo "=== Recent Events ==="
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -10

echo "=== Resource Usage ==="
kubectl top nodes
kubectl top pods -n $NAMESPACE

# If any anomalies: investigate before declaring "all clear"
```

### Mid-Shift Check (Every 4 hours)

```bash
# Quick sanity check
curl https://api.vapora.com/health
curl https://vapora.app/
# Should both return 200 OK

# Check dashboards
# Grafana: any alerts? any trending issues?

# Check Slack #alerts channel
# Any warnings or anomalies posted?
```

### End-of-Shift Handoff (Before shift ends)

```bash
# Prepare handoff for next on-call

# 1. Document current state
kubectl get pods -n vapora
kubectl get nodes
kubectl top pods -n vapora

# 2. Check for known issues
kubectl get events -n vapora | grep Warning
# Any persistent warnings?

# 3. Check deployment status
git log -1 --oneline provisioning/
# Any recent changes?

# 4. Document in handoff notes:
echo "HANDOFF NOTES - $(date)
Duration: [start time] to [end time]
Status: All normal / Issues: [list]
Alerts: [any]
Deployments: [any planned]
Known issues: [any]
Recommendations: [any]
" > on-call-handoff.txt

# 5. Pass notes to next on-call
# Send message to @next-on-call with notes
```

---

## Responding to Alerts

### Alert Received

**Step 1: Verify it's real**
```bash
# Don't panic - verify the alert is legitimate
1. Check the source: is it from our system?
2. Check current status manually: curl endpoints
3. Check dashboard: see if issue visible there
4. Check cluster: kubectl get pods

# False alarms happen - verify before escalating
```

**Step 2: Assess severity**
- Is service completely down? → Severity 1
- Is service partially down? → Severity 2
- Is there a warning/anomaly? → Severity 3

**Step 3: Declare incident**
```bash
# Create ticket (Severity 1 is emergency)
# If Severity 1:
# - Alert team immediately
# - Create #incident-[date] channel
# - Start 2-minute update cycle
# See: Incident Response Runbook
```

### During Incident

**Your role as on-call**:
1. **Respond quickly** - First 2 minutes are critical
2. **Communicate** - Update team/status page
3. **Investigate** - Follow diagnostics in runbooks
4. **Escalate if needed** - Page senior engineer if stuck
5. **Execute fix** - Follow approved procedures
6. **Verify recovery** - Confirm service healthy
7. **Document** - Record what happened

**Key communication**:
- Initial response time: < 2 min (post "investigating")
- Status update: every 2-5 minutes
- Escalation: if not clear after 5 minutes
- Resolution: post "incident resolved"

### Alert Examples & Responses

#### Alert: "Pod CrashLoopBackOff"

```
1. Get pod logs: kubectl logs <pod> --previous
2. Check for config issues: kubectl get configmap
3. Check for resource limits: kubectl describe pod <pod>
4. Decide: rollback or fix config
```

#### Alert: "High Error Rate (>5% 5xx)"

```
1. Check which endpoint: tail application logs
2. Check dependencies: database, cache, external APIs
3. Check recent deployment: git log
4. Decide: rollback or investigate further
```

#### Alert: "Pod Memory > 90%"

```
1. Check actual usage: kubectl top pod <pod>
2. Check limits: kubectl get pod <pod> -o yaml | grep memory
3. Decide: scale up or investigate memory leak
```

#### Alert: "Node NotReady"

```
1. Check node: kubectl describe node <node>
2. Check kubelet: ssh node-x && systemctl status kubelet
3. Contact infrastructure team for hardware issues
4. Possibly: drain node and reschedule pods
```

---

## Monitoring Dashboard Setup

When you start shift, have these visible:

### Browser Tabs (Keep Open)

1. **Grafana Dashboard** - VAPORA Cluster Overview
   - Pod CPU/Memory usage
   - Request rate and latency
   - Error rate
   - Deployment status

2. **Kubernetes Dashboard**
   - kubectl port-forward -n kube-system svc/kubernetes-dashboard 8443:443
   - Or use K9s terminal UI: `k9s`

3. **Alert Dashboard** (if available)
   - Prometheus Alerts
   - Or monitoring system of choice

4. **Status Page** (if public-facing)
   - Check for ongoing incidents
   - Prepare to update

### Terminal Windows (Keep Ready)

```bash
# Terminal 1: Watch pods
watch kubectl get pods -n vapora

# Terminal 2: Tail logs
kubectl logs -f deployment/vapora-backend -n vapora

# Terminal 3: General kubectl commands
kubectl -n vapora get events --watch

# Terminal 4: Ad-hoc commands and troubleshooting
# (leave empty for ad-hoc use)
```

---

## Common Questions During On-Call

### Q: I think I found an issue, but I'm not sure it's a problem

**A**: When in doubt, escalate:
1. Post in #deployments channel with observation
2. Ask: "Does this look normal?"
3. If others confirm: might be issue
4. Better safe than sorry (on production)

### Q: Do I need to respond to every alert

**A**: Yes. Even false alarms need verification:
1. Confirm it's false alarm (not just assume)
2. Update alert if it's misconfigured
3. Never ignore alerts - fix the alerting

### Q: Service looks broken but dashboard looks normal

**A**:
1. Check if dashboard might be delayed (sometimes refresh slow)
2. Test manually: curl endpoints
3. Check pod logs directly: kubectl logs
4. Trust actual service health over dashboard

### Q: Can I deploy changes while on-call

**A**:
- **Yes** if it's emergency fix for active incident
- **No** for normal features/changes (schedule for dedicated deployment window)
- **Escalate** if unsure

### Q: Something looks weird but I can't reproduce it

**A**:
1. Save any evidence: logs, metrics, events
2. Monitor more closely for pattern
3. Document in ticket for later investigation
4. Escalate if behavior continues

### Q: An alert keeps firing but service is fine

**A**:
1. Investigate why alert is false
2. Check alert thresholds (might be too sensitive)
3. Fix the alert configuration
4. Update alert runbook with details

---

## Escalation Decision Tree

When should you escalate?

```
START: Issue detected

Is it Severity 1 (complete outage)?
  YES → Escalate immediately to senior engineer
  NO → Continue

Have you diagnosed root cause in 5 minutes?
  YES → Continue with fix
  NO → Page senior engineer or escalate

Does fix require infrastructure/database changes?
  YES → Contact infrastructure/DBA team
  NO → Continue with fix

Is this outside your authority (company policy)?
  YES → Escalate to manager
  NO → Proceed with fix

Implemented fix, service still broken?
  YES → Page senior engineer immediately
  NO → Verify and close incident

Result: Uncertain?
  → Ask senior engineer or manager
  → Always better to escalate early
```

---

## When to Page Senior Engineer

**Page immediately if**:
- Service completely down (Severity 1)
- Database appears corrupted
- You're stuck for >5 minutes
- Rollback didn't work
- Need infrastructure changes urgently
- Something affecting >50% of users

**Don't page just because**:
- Single pod restarting (monitor first)
- Transient network errors
- You're slightly unsure (ask in #deployments first)
- It's 3 AM and not critical (use tickets for morning)

---

## End of Shift Handoff

### Create Handoff Report

```
SHIFT HANDOFF - [Your Name]
Dates: [Start] to [End] UTC
Duration: [X hours]

STATUS: ✅ All normal / ⚠️ Issues ongoing / ❌ Critical

INCIDENTS: [Number]
- Incident 1: [description, resolved or ongoing]
- Incident 2: [description]

ALERTS: [Any unusual alerts]
- Alert 1: [description, action taken]

DEPLOYMENTS: [Any scheduled or happened]
- Deployment 1: [status]

KNOWN ISSUES:
- Issue 1: [description, workaround]
- Issue 2: [description]

MONITORING NOTES:
- [Any trending issues]
- [Any monitoring gaps]
- [Any recommended actions]

RECOMMENDATIONS FOR NEXT ON-CALL:
1. [Action item]
2. [Action item]
3. [Action item]

NEXT ON-CALL: @[name]
```

### Send to Next On-Call

```
@next-on-call - Handoff notes attached:
[paste report above]

Key points:
- [Most important item]
- [Second important]
- [Any urgent follow-ups]

Questions? I'm available for 30 min
```

---

## Tools & Commands Reference

### Essential Commands

```bash
# Pod management
kubectl get pods -n vapora
kubectl logs pod-name -n vapora
kubectl exec pod-name -n vapora -- bash
kubectl describe pod pod-name -n vapora
kubectl delete pod pod-name -n vapora  # (recreates via deployment)

# Deployment management
kubectl get deployments -n vapora
kubectl rollout status deployment/vapora-backend -n vapora
kubectl rollout undo deployment/vapora-backend -n vapora
kubectl scale deployment/vapora-backend --replicas=5 -n vapora

# Service health
curl http://localhost:8001/health
kubectl get events -n vapora
kubectl top pods -n vapora
kubectl get endpoints -n vapora

# Quick diagnostics
kubectl describe nodes
kubectl cluster-info
kubectl get persistent volumes
```

### Useful Tools

```bash
# Install these on your workstation
brew install kubectl              # Kubernetes CLI
brew install k9s                  # Terminal UI for K8s
brew install watch               # Monitor command output
brew install jq                  # JSON processing
brew install yq                  # YAML processing
brew install grpcurl             # gRPC debugging

# Aliases to save time
alias k='kubectl'
alias kgp='kubectl get pods'
alias klogs='kubectl logs'
alias kexec='kubectl exec'
```

### Dashboards & Links

Bookmark these:
- Grafana: `https://grafana.vapora.com`
- Status Page: `https://status.vapora.com`
- Incident Tracker: `https://github.com/your-org/vapora/issues`
- Runbooks: `https://github.com/your-org/vapora/tree/main/docs/operations`
- Kubernetes Dashboard: Run `kubectl proxy` then `http://localhost:8001/ui`

---

## On-Call Checklist

### Starting Shift
- [ ] Verified pager notifications working
- [ ] Tested access to all systems
- [ ] Reviewed current system status
- [ ] Read recent incidents
- [ ] Received handoff from previous on-call
- [ ] Set up monitoring dashboards
- [ ] Opened necessary terminal windows
- [ ] Posted "on-call" status in #deployments

### During Shift
- [ ] Responded to all alerts within SLA
- [ ] Updated incident status regularly
- [ ] Escalated when appropriate
- [ ] Documented actions in tickets
- [ ] Verified fixes before closing
- [ ] Communicated clearly with team

### Ending Shift
- [ ] Created handoff report
- [ ] Resolved or escalated open issues
- [ ] Updated monitoring for anomalies
- [ ] Passed report to next on-call
- [ ] Closed out incident tickets
- [ ] Verified next on-call is ready
- [ ] Posted "handing off to [next on-call]" in #deployments

---

## Post-On-Call Follow-Up

After your shift:

1. **Document lessons learned**
   - Did you learn something new?
   - Did any procedure need updating?
   - Were any runbooks unclear?

2. **Update runbooks**
   - If you found gaps, update procedures
   - If you had questions, update docs
   - Share improvements with team

3. **Communicate findings**
   - Anything the team should know?
   - Any recommendations?
   - Trends to watch?

4. **Celebrate successes**
   - Any incidents quickly resolved?
   - Any new insights?
   - Recognize good practices

---

## Emergency Contacts

Keep these accessible:

```
ESCALATION CONTACTS:

Primary Escalation: [Name] [Phone] [Slack]
Backup Escalation:  [Name] [Phone] [Slack]
Infrastructure:     [Name] [Phone] [Slack]
Database Team:      [Name] [Phone] [Slack]
Manager:            [Name] [Phone] [Slack]

External Contacts:
AWS Support:        [Account ID] [Contact]
CDN Provider:       [Account] [Contact]
DNS Provider:       [Account] [Contact]

EMERGENCY PROCEDURES:
- Complete AWS outage: Contact AWS support immediately
- Database failure: Contact DBA, activate backups
- Security incident: Contact security team immediately
- Major data loss: Activate disaster recovery
```

---

## Remember

✅ **You are the guardian of production** - Your vigilance keeps services running

✅ **Better safe than sorry** - Escalate early and often

✅ **Communication is key** - Keep team informed

✅ **Document everything** - Future you and team will thank you

✅ **Ask for help** - No shame in escalating

❌ **Don't guess** - Verify before taking action

❌ **Don't stay silent** - Alert team to any issues

❌ **Don't ignore alerts** - Even false ones need investigation
chore: extend doc: adr, tutorials, operations, etc 2026-01-12 03:32:47 +00:00			`# On-Call Procedures`

			`Guide for on-call engineers managing VAPORA production operations.`

			`---`

			`## Overview`

			`On-Call Responsibility: Monitor VAPORA production and respond to incidents during assigned shift`

			`Time Commitment:`
			`- During business hours: ~5-10 minutes daily check-ins`
			`- During off-hours: Available for emergencies (paged for critical issues)`

			`Expected Availability:`
			`- Severity 1: Respond within 2 minutes`
			`- Severity 2: Respond within 15 minutes`
			`- Severity 3: Respond within 1 hour`

			`---`

			`## Before Your Shift Starts`

			`### 24 Hours Before On-Call`

			`- [ ] Verify schedule: "I'm on-call starting [date] [time]"`
			`- [ ] Update your calendar with shift times`
			`- [ ] Notify team: "I'll be on-call [dates]"`
			`- [ ] Share personal contact info if not already shared`
			`- [ ] Download necessary tools/credentials`

			`### 1 Hour Before Shift`

			`- [ ] Test pager notification system`
			```bash
			`# Verify Slack notifications working`
			`# Ask previous on-call to send test alert: "/test-alert-to-[yourname]"`
			```

			`- [ ] Verify access to necessary systems`
			```bash
			`# Test each required access:`
			`✓ SSH to bastion host: ssh bastion.vapora.com`
			`✓ kubectl to production: kubectl cluster-info`
			`✓ Slack channels: /join #deployments #alerts`
			`✓ Incident tracking: open Jira/GitHub`
			`✓ Monitoring dashboards: access Grafana`
			`✓ Status page: access status page admin`
			```

			`- [ ] Review current system status`
			```bash
			`# Quick health check`
			`kubectl cluster-info`
			`kubectl get pods -n vapora`
			`kubectl get events -n vapora \| head -10`

			`# Should show: All pods Running, no recent errors`
			```

			`- [ ] Read recent incident reports`
			`- Check previous on-call handoff notes`
			`- Review any incidents from past week`
			`- Note any known issues or monitoring gaps`

			`- [ ] Receive handoff from previous on-call`
			```
			`Ask: "Anything I should know?"`
			`- Any ongoing issues?`
			`- Any deployments planned?`
			`- Any flaky services or known alerts?`
			`- Any customer complaints?`
			```

			`---`

			`## Daily On-Call Tasks`

			`### Morning Check-In (After shift starts)`

			```bash
			`# Automated check - run this first thing`
			`export NAMESPACE=vapora`

			`echo "=== Cluster Health ==="`
			`kubectl cluster-info`
			`kubectl get nodes`

			`echo "=== Pod Status ==="`
			`kubectl get pods -n $NAMESPACE`
			`kubectl get pods -n $NAMESPACE \| grep -v Running`

			`echo "=== Recent Events ==="`
			`kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' \| tail -10`

			`echo "=== Resource Usage ==="`
			`kubectl top nodes`
			`kubectl top pods -n $NAMESPACE`

			`# If any anomalies: investigate before declaring "all clear"`
			```

			`### Mid-Shift Check (Every 4 hours)`

			```bash
			`# Quick sanity check`
			`curl https://api.vapora.com/health`
			`curl https://vapora.app/`
			`# Should both return 200 OK`

			`# Check dashboards`
			`# Grafana: any alerts? any trending issues?`

			`# Check Slack #alerts channel`
			`# Any warnings or anomalies posted?`
			```

			`### End-of-Shift Handoff (Before shift ends)`

			```bash
			`# Prepare handoff for next on-call`

			`# 1. Document current state`
			`kubectl get pods -n vapora`
			`kubectl get nodes`
			`kubectl top pods -n vapora`

			`# 2. Check for known issues`
			`kubectl get events -n vapora \| grep Warning`
			`# Any persistent warnings?`

			`# 3. Check deployment status`
			`git log -1 --oneline provisioning/`
			`# Any recent changes?`

			`# 4. Document in handoff notes:`
			`echo "HANDOFF NOTES - $(date)`
			`Duration: [start time] to [end time]`
			`Status: All normal / Issues: [list]`
			`Alerts: [any]`
			`Deployments: [any planned]`
			`Known issues: [any]`
			`Recommendations: [any]`
			`" > on-call-handoff.txt`

			`# 5. Pass notes to next on-call`
			`# Send message to @next-on-call with notes`
			```

			`---`

			`## Responding to Alerts`

			`### Alert Received`

			`Step 1: Verify it's real`
			```bash
			`# Don't panic - verify the alert is legitimate`
			`1. Check the source: is it from our system?`
			`2. Check current status manually: curl endpoints`
			`3. Check dashboard: see if issue visible there`
			`4. Check cluster: kubectl get pods`

			`# False alarms happen - verify before escalating`
			```

			`Step 2: Assess severity`
			`- Is service completely down? → Severity 1`
			`- Is service partially down? → Severity 2`
			`- Is there a warning/anomaly? → Severity 3`

			`Step 3: Declare incident`
			```bash
			`# Create ticket (Severity 1 is emergency)`
			`# If Severity 1:`
			`# - Alert team immediately`
			`# - Create #incident-[date] channel`
			`# - Start 2-minute update cycle`
			`# See: Incident Response Runbook`
			```

			`### During Incident`

			`Your role as on-call:`
			`1. Respond quickly - First 2 minutes are critical`
			`2. Communicate - Update team/status page`
			`3. Investigate - Follow diagnostics in runbooks`
			`4. Escalate if needed - Page senior engineer if stuck`
			`5. Execute fix - Follow approved procedures`
			`6. Verify recovery - Confirm service healthy`
			`7. Document - Record what happened`

			`Key communication:`
			`- Initial response time: < 2 min (post "investigating")`
			`- Status update: every 2-5 minutes`
			`- Escalation: if not clear after 5 minutes`
			`- Resolution: post "incident resolved"`

			`### Alert Examples & Responses`

			`#### Alert: "Pod CrashLoopBackOff"`

			```
			`1. Get pod logs: kubectl logs <pod> --previous`
			`2. Check for config issues: kubectl get configmap`
			`3. Check for resource limits: kubectl describe pod <pod>`
			`4. Decide: rollback or fix config`
			```

			`#### Alert: "High Error Rate (>5% 5xx)"`

			```
			`1. Check which endpoint: tail application logs`
			`2. Check dependencies: database, cache, external APIs`
			`3. Check recent deployment: git log`
			`4. Decide: rollback or investigate further`
			```

			`#### Alert: "Pod Memory > 90%"`

			```
			`1. Check actual usage: kubectl top pod <pod>`
			`2. Check limits: kubectl get pod <pod> -o yaml \| grep memory`
			`3. Decide: scale up or investigate memory leak`
			```

			`#### Alert: "Node NotReady"`

			```
			`1. Check node: kubectl describe node <node>`
			`2. Check kubelet: ssh node-x && systemctl status kubelet`
			`3. Contact infrastructure team for hardware issues`
			`4. Possibly: drain node and reschedule pods`
			```

			`---`

			`## Monitoring Dashboard Setup`

			`When you start shift, have these visible:`

			`### Browser Tabs (Keep Open)`

			`1. Grafana Dashboard - VAPORA Cluster Overview`
			`- Pod CPU/Memory usage`
			`- Request rate and latency`
			`- Error rate`
			`- Deployment status`

			`2. Kubernetes Dashboard`
			`- kubectl port-forward -n kube-system svc/kubernetes-dashboard 8443:443`
			- Or use K9s terminal UI: `k9s`

			`3. Alert Dashboard (if available)`
			`- Prometheus Alerts`
			`- Or monitoring system of choice`

			`4. Status Page (if public-facing)`
			`- Check for ongoing incidents`
			`- Prepare to update`

			`### Terminal Windows (Keep Ready)`

			```bash
			`# Terminal 1: Watch pods`
			`watch kubectl get pods -n vapora`

			`# Terminal 2: Tail logs`
			`kubectl logs -f deployment/vapora-backend -n vapora`

			`# Terminal 3: General kubectl commands`
			`kubectl -n vapora get events --watch`

			`# Terminal 4: Ad-hoc commands and troubleshooting`
			`# (leave empty for ad-hoc use)`
			```

			`---`

			`## Common Questions During On-Call`

			`### Q: I think I found an issue, but I'm not sure it's a problem`

			`A: When in doubt, escalate:`
			`1. Post in #deployments channel with observation`
			`2. Ask: "Does this look normal?"`
			`3. If others confirm: might be issue`
			`4. Better safe than sorry (on production)`

			`### Q: Do I need to respond to every alert`

			`A: Yes. Even false alarms need verification:`
			`1. Confirm it's false alarm (not just assume)`
			`2. Update alert if it's misconfigured`
			`3. Never ignore alerts - fix the alerting`

			`### Q: Service looks broken but dashboard looks normal`

			`A:`
			`1. Check if dashboard might be delayed (sometimes refresh slow)`
			`2. Test manually: curl endpoints`
			`3. Check pod logs directly: kubectl logs`
			`4. Trust actual service health over dashboard`

			`### Q: Can I deploy changes while on-call`

			`A:`
			`- Yes if it's emergency fix for active incident`
			`- No for normal features/changes (schedule for dedicated deployment window)`
			`- Escalate if unsure`

			`### Q: Something looks weird but I can't reproduce it`

			`A:`
			`1. Save any evidence: logs, metrics, events`
			`2. Monitor more closely for pattern`
			`3. Document in ticket for later investigation`
			`4. Escalate if behavior continues`

			`### Q: An alert keeps firing but service is fine`

			`A:`
			`1. Investigate why alert is false`
			`2. Check alert thresholds (might be too sensitive)`
			`3. Fix the alert configuration`
			`4. Update alert runbook with details`

			`---`

			`## Escalation Decision Tree`

			`When should you escalate?`

			```
			`START: Issue detected`

			`Is it Severity 1 (complete outage)?`
			`YES → Escalate immediately to senior engineer`
			`NO → Continue`

			`Have you diagnosed root cause in 5 minutes?`
			`YES → Continue with fix`
			`NO → Page senior engineer or escalate`

			`Does fix require infrastructure/database changes?`
			`YES → Contact infrastructure/DBA team`
			`NO → Continue with fix`

			`Is this outside your authority (company policy)?`
			`YES → Escalate to manager`
			`NO → Proceed with fix`

			`Implemented fix, service still broken?`
			`YES → Page senior engineer immediately`
			`NO → Verify and close incident`

			`Result: Uncertain?`
			`→ Ask senior engineer or manager`
			`→ Always better to escalate early`
			```

			`---`

			`## When to Page Senior Engineer`

			`Page immediately if:`
			`- Service completely down (Severity 1)`
			`- Database appears corrupted`
			`- You're stuck for >5 minutes`
			`- Rollback didn't work`
			`- Need infrastructure changes urgently`
			`- Something affecting >50% of users`

			`Don't page just because:`
			`- Single pod restarting (monitor first)`
			`- Transient network errors`
			`- You're slightly unsure (ask in #deployments first)`
			`- It's 3 AM and not critical (use tickets for morning)`

			`---`

			`## End of Shift Handoff`

			`### Create Handoff Report`

			```
			`SHIFT HANDOFF - [Your Name]`
			`Dates: [Start] to [End] UTC`
			`Duration: [X hours]`

			`STATUS: ✅ All normal / ⚠️ Issues ongoing / ❌ Critical`

			`INCIDENTS: [Number]`
			`- Incident 1: [description, resolved or ongoing]`
			`- Incident 2: [description]`

			`ALERTS: [Any unusual alerts]`
			`- Alert 1: [description, action taken]`

			`DEPLOYMENTS: [Any scheduled or happened]`
			`- Deployment 1: [status]`

			`KNOWN ISSUES:`
			`- Issue 1: [description, workaround]`
			`- Issue 2: [description]`

			`MONITORING NOTES:`
			`- [Any trending issues]`
			`- [Any monitoring gaps]`
			`- [Any recommended actions]`

			`RECOMMENDATIONS FOR NEXT ON-CALL:`
			`1. [Action item]`
			`2. [Action item]`
			`3. [Action item]`

			`NEXT ON-CALL: @[name]`
			```

			`### Send to Next On-Call`

			```
			`@next-on-call - Handoff notes attached:`
			`[paste report above]`

			`Key points:`
			`- [Most important item]`
			`- [Second important]`
			`- [Any urgent follow-ups]`

			`Questions? I'm available for 30 min`
			```

			`---`

			`## Tools & Commands Reference`

			`### Essential Commands`

			```bash
			`# Pod management`
			`kubectl get pods -n vapora`
			`kubectl logs pod-name -n vapora`
			`kubectl exec pod-name -n vapora -- bash`
			`kubectl describe pod pod-name -n vapora`
			`kubectl delete pod pod-name -n vapora # (recreates via deployment)`

			`# Deployment management`
			`kubectl get deployments -n vapora`
			`kubectl rollout status deployment/vapora-backend -n vapora`
			`kubectl rollout undo deployment/vapora-backend -n vapora`
			`kubectl scale deployment/vapora-backend --replicas=5 -n vapora`

			`# Service health`
			`curl http://localhost:8001/health`
			`kubectl get events -n vapora`
			`kubectl top pods -n vapora`
			`kubectl get endpoints -n vapora`

			`# Quick diagnostics`
			`kubectl describe nodes`
			`kubectl cluster-info`
			`kubectl get persistent volumes`
			```

			`### Useful Tools`

			```bash
			`# Install these on your workstation`
			`brew install kubectl # Kubernetes CLI`
			`brew install k9s # Terminal UI for K8s`
			`brew install watch # Monitor command output`
			`brew install jq # JSON processing`
			`brew install yq # YAML processing`
			`brew install grpcurl # gRPC debugging`

			`# Aliases to save time`
			`alias k='kubectl'`
			`alias kgp='kubectl get pods'`
			`alias klogs='kubectl logs'`
			`alias kexec='kubectl exec'`
			```

			`### Dashboards & Links`

			`Bookmark these:`
			- Grafana: `https://grafana.vapora.com`
			- Status Page: `https://status.vapora.com`
			- Incident Tracker: `https://github.com/your-org/vapora/issues`
			- Runbooks: `https://github.com/your-org/vapora/tree/main/docs/operations`
			- Kubernetes Dashboard: Run `kubectl proxy` then `http://localhost:8001/ui`

			`---`

			`## On-Call Checklist`

			`### Starting Shift`
			`- [ ] Verified pager notifications working`
			`- [ ] Tested access to all systems`
			`- [ ] Reviewed current system status`
			`- [ ] Read recent incidents`
			`- [ ] Received handoff from previous on-call`
			`- [ ] Set up monitoring dashboards`
			`- [ ] Opened necessary terminal windows`
			`- [ ] Posted "on-call" status in #deployments`

			`### During Shift`
			`- [ ] Responded to all alerts within SLA`
			`- [ ] Updated incident status regularly`
			`- [ ] Escalated when appropriate`
			`- [ ] Documented actions in tickets`
			`- [ ] Verified fixes before closing`
			`- [ ] Communicated clearly with team`

			`### Ending Shift`
			`- [ ] Created handoff report`
			`- [ ] Resolved or escalated open issues`
			`- [ ] Updated monitoring for anomalies`
			`- [ ] Passed report to next on-call`
			`- [ ] Closed out incident tickets`
			`- [ ] Verified next on-call is ready`
			`- [ ] Posted "handing off to [next on-call]" in #deployments`

			`---`

			`## Post-On-Call Follow-Up`

			`After your shift:`

			`1. Document lessons learned`
			`- Did you learn something new?`
			`- Did any procedure need updating?`
			`- Were any runbooks unclear?`

			`2. Update runbooks`
			`- If you found gaps, update procedures`
			`- If you had questions, update docs`
			`- Share improvements with team`

			`3. Communicate findings`
			`- Anything the team should know?`
			`- Any recommendations?`
			`- Trends to watch?`

			`4. Celebrate successes`
			`- Any incidents quickly resolved?`
			`- Any new insights?`
			`- Recognize good practices`

			`---`

			`## Emergency Contacts`

			`Keep these accessible:`

			```
			`ESCALATION CONTACTS:`

			`Primary Escalation: [Name] [Phone] [Slack]`
			`Backup Escalation: [Name] [Phone] [Slack]`
			`Infrastructure: [Name] [Phone] [Slack]`
			`Database Team: [Name] [Phone] [Slack]`
			`Manager: [Name] [Phone] [Slack]`

			`External Contacts:`
			`AWS Support: [Account ID] [Contact]`
			`CDN Provider: [Account] [Contact]`
			`DNS Provider: [Account] [Contact]`

			`EMERGENCY PROCEDURES:`
			`- Complete AWS outage: Contact AWS support immediately`
			`- Database failure: Contact DBA, activate backups`
			`- Security incident: Contact security team immediately`
			`- Major data loss: Activate disaster recovery`
			```

			`---`

			`## Remember`

			`✅ You are the guardian of production - Your vigilance keeps services running`

			`✅ Better safe than sorry - Escalate early and often`

			`✅ Communication is key - Keep team informed`

			`✅ Document everything - Future you and team will thank you`

			`✅ Ask for help - No shame in escalating`

			`❌ Don't guess - Verify before taking action`

			`❌ Don't stay silent - Alert team to any issues`

			`❌ Don't ignore alerts - Even false ones need investigation`