# On-Call Procedures

Guide for on-call engineers managing VAPORA production operations.

---

## Overview

**On-Call Responsibility**: Monitor VAPORA production and respond to incidents during assigned shift

**Time Commitment**:
- During business hours: ~5-10 minutes daily check-ins
- During off-hours: Available for emergencies (paged for critical issues)

**Expected Availability**:
- Severity 1: Respond within 2 minutes
- Severity 2: Respond within 15 minutes
- Severity 3: Respond within 1 hour

---

## Before Your Shift Starts

### 24 Hours Before On-Call

- [ ] Verify schedule: "I'm on-call starting [date] [time]"
- [ ] Update your calendar with shift times
- [ ] Notify team: "I'll be on-call [dates]"
- [ ] Share personal contact info if not already shared
- [ ] Download necessary tools/credentials

### 1 Hour Before Shift

- [ ] Test pager notification system
  ```bash
  # Verify Slack notifications working
  # Ask previous on-call to send test alert: "/test-alert-to-[yourname]"
  ```

- [ ] Verify access to necessary systems
  ```bash
  # Test each required access:
  ✓ SSH to bastion host: ssh bastion.vapora.com
  ✓ kubectl to production: kubectl cluster-info
  ✓ Slack channels: /join #deployments #alerts
  ✓ Incident tracking: open Jira/GitHub
  ✓ Monitoring dashboards: access Grafana
  ✓ Status page: access status page admin
  ```

- [ ] Review current system status
  ```bash
  # Quick health check
  kubectl cluster-info
  kubectl get pods -n vapora
  kubectl get events -n vapora | head -10

  # Should show: All pods Running, no recent errors
  ```

- [ ] Read recent incident reports
  - Check previous on-call handoff notes
  - Review any incidents from past week
  - Note any known issues or monitoring gaps

- [ ] Receive handoff from previous on-call
  ```
  Ask: "Anything I should know?"
  - Any ongoing issues?
  - Any deployments planned?
  - Any flaky services or known alerts?
  - Any customer complaints?
  ```

---

## Daily On-Call Tasks

### Morning Check-In (After shift starts)

```bash
# Automated check - run this first thing
export NAMESPACE=vapora

echo "=== Cluster Health ==="
kubectl cluster-info
kubectl get nodes

echo "=== Pod Status ==="
kubectl get pods -n $NAMESPACE
kubectl get pods -n $NAMESPACE | grep -v Running

echo "=== Recent Events ==="
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -10

echo "=== Resource Usage ==="
kubectl top nodes
kubectl top pods -n $NAMESPACE

# If any anomalies: investigate before declaring "all clear"
```

### Mid-Shift Check (Every 4 hours)

```bash
# Quick sanity check
curl https://api.vapora.com/health
curl https://vapora.app/
# Should both return 200 OK

# Check dashboards
# Grafana: any alerts? any trending issues?

# Check Slack #alerts channel
# Any warnings or anomalies posted?
```

### End-of-Shift Handoff (Before shift ends)

```bash
# Prepare handoff for next on-call

# 1. Document current state
kubectl get pods -n vapora
kubectl get nodes
kubectl top pods -n vapora

# 2. Check for known issues
kubectl get events -n vapora | grep Warning
# Any persistent warnings?

# 3. Check deployment status
git log -1 --oneline provisioning/
# Any recent changes?

# 4. Document in handoff notes:
echo "HANDOFF NOTES - $(date)
Duration: [start time] to [end time]
Status: All normal / Issues: [list]
Alerts: [any]
Deployments: [any planned]
Known issues: [any]
Recommendations: [any]
" > on-call-handoff.txt

# 5. Pass notes to next on-call
# Send message to @next-on-call with notes
```

---

## Responding to Alerts

### Alert Received

**Step 1: Verify it's real**
```bash
# Don't panic - verify the alert is legitimate
1. Check the source: is it from our system?
2. Check current status manually: curl endpoints
3. Check dashboard: see if issue visible there
4. Check cluster: kubectl get pods

# False alarms happen - verify before escalating
```

**Step 2: Assess severity**
- Is service completely down? → Severity 1
- Is service partially down? → Severity 2
- Is there a warning/anomaly? → Severity 3

**Step 3: Declare incident**
```bash
# Create ticket (Severity 1 is emergency)
# If Severity 1:
# - Alert team immediately
# - Create #incident-[date] channel
# - Start 2-minute update cycle
# See: Incident Response Runbook
```

### During Incident

**Your role as on-call**:
1. **Respond quickly** - First 2 minutes are critical
2. **Communicate** - Update team/status page
3. **Investigate** - Follow diagnostics in runbooks
4. **Escalate if needed** - Page senior engineer if stuck
5. **Execute fix** - Follow approved procedures
6. **Verify recovery** - Confirm service healthy
7. **Document** - Record what happened

**Key communication**:
- Initial response time: < 2 min (post "investigating")
- Status update: every 2-5 minutes
- Escalation: if not clear after 5 minutes
- Resolution: post "incident resolved"

### Alert Examples & Responses

#### Alert: "Pod CrashLoopBackOff"

```
1. Get pod logs: kubectl logs <pod> --previous
2. Check for config issues: kubectl get configmap
3. Check for resource limits: kubectl describe pod <pod>
4. Decide: rollback or fix config
```

#### Alert: "High Error Rate (>5% 5xx)"

```
1. Check which endpoint: tail application logs
2. Check dependencies: database, cache, external APIs
3. Check recent deployment: git log
4. Decide: rollback or investigate further
```

#### Alert: "Pod Memory > 90%"

```
1. Check actual usage: kubectl top pod <pod>
2. Check limits: kubectl get pod <pod> -o yaml | grep memory
3. Decide: scale up or investigate memory leak
```

#### Alert: "Node NotReady"

```
1. Check node: kubectl describe node <node>
2. Check kubelet: ssh node-x && systemctl status kubelet
3. Contact infrastructure team for hardware issues
4. Possibly: drain node and reschedule pods
```

---

## Monitoring Dashboard Setup

When you start shift, have these visible:

### Browser Tabs (Keep Open)

1. **Grafana Dashboard** - VAPORA Cluster Overview
   - Pod CPU/Memory usage
   - Request rate and latency
   - Error rate
   - Deployment status

2. **Kubernetes Dashboard**
   - kubectl port-forward -n kube-system svc/kubernetes-dashboard 8443:443
   - Or use K9s terminal UI: `k9s`

3. **Alert Dashboard** (if available)
   - Prometheus Alerts
   - Or monitoring system of choice

4. **Status Page** (if public-facing)
   - Check for ongoing incidents
   - Prepare to update

### Terminal Windows (Keep Ready)

```bash
# Terminal 1: Watch pods
watch kubectl get pods -n vapora

# Terminal 2: Tail logs
kubectl logs -f deployment/vapora-backend -n vapora

# Terminal 3: General kubectl commands
kubectl -n vapora get events --watch

# Terminal 4: Ad-hoc commands and troubleshooting
# (leave empty for ad-hoc use)
```

---

## Common Questions During On-Call

### Q: I think I found an issue, but I'm not sure it's a problem

**A**: When in doubt, escalate:
1. Post in #deployments channel with observation
2. Ask: "Does this look normal?"
3. If others confirm: might be issue
4. Better safe than sorry (on production)

### Q: Do I need to respond to every alert

**A**: Yes. Even false alarms need verification:
1. Confirm it's false alarm (not just assume)
2. Update alert if it's misconfigured
3. Never ignore alerts - fix the alerting

### Q: Service looks broken but dashboard looks normal

**A**:
1. Check if dashboard might be delayed (sometimes refresh slow)
2. Test manually: curl endpoints
3. Check pod logs directly: kubectl logs
4. Trust actual service health over dashboard

### Q: Can I deploy changes while on-call

**A**:
- **Yes** if it's emergency fix for active incident
- **No** for normal features/changes (schedule for dedicated deployment window)
- **Escalate** if unsure

### Q: Something looks weird but I can't reproduce it

**A**:
1. Save any evidence: logs, metrics, events
2. Monitor more closely for pattern
3. Document in ticket for later investigation
4. Escalate if behavior continues

### Q: An alert keeps firing but service is fine

**A**:
1. Investigate why alert is false
2. Check alert thresholds (might be too sensitive)
3. Fix the alert configuration
4. Update alert runbook with details

---

## Escalation Decision Tree

When should you escalate?

```
START: Issue detected

Is it Severity 1 (complete outage)?
  YES → Escalate immediately to senior engineer
  NO → Continue

Have you diagnosed root cause in 5 minutes?
  YES → Continue with fix
  NO → Page senior engineer or escalate

Does fix require infrastructure/database changes?
  YES → Contact infrastructure/DBA team
  NO → Continue with fix

Is this outside your authority (company policy)?
  YES → Escalate to manager
  NO → Proceed with fix

Implemented fix, service still broken?
  YES → Page senior engineer immediately
  NO → Verify and close incident

Result: Uncertain?
  → Ask senior engineer or manager
  → Always better to escalate early
```

---

## When to Page Senior Engineer

**Page immediately if**:
- Service completely down (Severity 1)
- Database appears corrupted
- You're stuck for >5 minutes
- Rollback didn't work
- Need infrastructure changes urgently
- Something affecting >50% of users

**Don't page just because**:
- Single pod restarting (monitor first)
- Transient network errors
- You're slightly unsure (ask in #deployments first)
- It's 3 AM and not critical (use tickets for morning)

---

## End of Shift Handoff

### Create Handoff Report

```
SHIFT HANDOFF - [Your Name]
Dates: [Start] to [End] UTC
Duration: [X hours]

STATUS: ✅ All normal / ⚠️ Issues ongoing / ❌ Critical

INCIDENTS: [Number]
- Incident 1: [description, resolved or ongoing]
- Incident 2: [description]

ALERTS: [Any unusual alerts]
- Alert 1: [description, action taken]

DEPLOYMENTS: [Any scheduled or happened]
- Deployment 1: [status]

KNOWN ISSUES:
- Issue 1: [description, workaround]
- Issue 2: [description]

MONITORING NOTES:
- [Any trending issues]
- [Any monitoring gaps]
- [Any recommended actions]

RECOMMENDATIONS FOR NEXT ON-CALL:
1. [Action item]
2. [Action item]
3. [Action item]

NEXT ON-CALL: @[name]
```

### Send to Next On-Call

```
@next-on-call - Handoff notes attached:
[paste report above]

Key points:
- [Most important item]
- [Second important]
- [Any urgent follow-ups]

Questions? I'm available for 30 min
```

---

## Tools & Commands Reference

### Essential Commands

```bash
# Pod management
kubectl get pods -n vapora
kubectl logs pod-name -n vapora
kubectl exec pod-name -n vapora -- bash
kubectl describe pod pod-name -n vapora
kubectl delete pod pod-name -n vapora  # (recreates via deployment)

# Deployment management
kubectl get deployments -n vapora
kubectl rollout status deployment/vapora-backend -n vapora
kubectl rollout undo deployment/vapora-backend -n vapora
kubectl scale deployment/vapora-backend --replicas=5 -n vapora

# Service health
curl http://localhost:8001/health
kubectl get events -n vapora
kubectl top pods -n vapora
kubectl get endpoints -n vapora

# Quick diagnostics
kubectl describe nodes
kubectl cluster-info
kubectl get persistent volumes
```

### Useful Tools

```bash
# Install these on your workstation
brew install kubectl              # Kubernetes CLI
brew install k9s                  # Terminal UI for K8s
brew install watch               # Monitor command output
brew install jq                  # JSON processing
brew install yq                  # YAML processing
brew install grpcurl             # gRPC debugging

# Aliases to save time
alias k='kubectl'
alias kgp='kubectl get pods'
alias klogs='kubectl logs'
alias kexec='kubectl exec'
```

### Dashboards & Links

Bookmark these:
- Grafana: `https://grafana.vapora.com`
- Status Page: `https://status.vapora.com`
- Incident Tracker: `https://github.com/your-org/vapora/issues`
- Runbooks: `https://github.com/your-org/vapora/tree/main/docs/operations`
- Kubernetes Dashboard: Run `kubectl proxy` then `http://localhost:8001/ui`

---

## On-Call Checklist

### Starting Shift
- [ ] Verified pager notifications working
- [ ] Tested access to all systems
- [ ] Reviewed current system status
- [ ] Read recent incidents
- [ ] Received handoff from previous on-call
- [ ] Set up monitoring dashboards
- [ ] Opened necessary terminal windows
- [ ] Posted "on-call" status in #deployments

### During Shift
- [ ] Responded to all alerts within SLA
- [ ] Updated incident status regularly
- [ ] Escalated when appropriate
- [ ] Documented actions in tickets
- [ ] Verified fixes before closing
- [ ] Communicated clearly with team

### Ending Shift
- [ ] Created handoff report
- [ ] Resolved or escalated open issues
- [ ] Updated monitoring for anomalies
- [ ] Passed report to next on-call
- [ ] Closed out incident tickets
- [ ] Verified next on-call is ready
- [ ] Posted "handing off to [next on-call]" in #deployments

---

## Post-On-Call Follow-Up

After your shift:

1. **Document lessons learned**
   - Did you learn something new?
   - Did any procedure need updating?
   - Were any runbooks unclear?

2. **Update runbooks**
   - If you found gaps, update procedures
   - If you had questions, update docs
   - Share improvements with team

3. **Communicate findings**
   - Anything the team should know?
   - Any recommendations?
   - Trends to watch?

4. **Celebrate successes**
   - Any incidents quickly resolved?
   - Any new insights?
   - Recognize good practices

---

## Emergency Contacts

Keep these accessible:

```
ESCALATION CONTACTS:

Primary Escalation: [Name] [Phone] [Slack]
Backup Escalation:  [Name] [Phone] [Slack]
Infrastructure:     [Name] [Phone] [Slack]
Database Team:      [Name] [Phone] [Slack]
Manager:            [Name] [Phone] [Slack]

External Contacts:
AWS Support:        [Account ID] [Contact]
CDN Provider:       [Account] [Contact]
DNS Provider:       [Account] [Contact]

EMERGENCY PROCEDURES:
- Complete AWS outage: Contact AWS support immediately
- Database failure: Contact DBA, activate backups
- Security incident: Contact security team immediately
- Major data loss: Activate disaster recovery
```

---

## Remember

✅ **You are the guardian of production** - Your vigilance keeps services running

✅ **Better safe than sorry** - Escalate early and often

✅ **Communication is key** - Keep team informed

✅ **Document everything** - Future you and team will thank you

✅ **Ask for help** - No shame in escalating

❌ **Don't guess** - Verify before taking action

❌ **Don't stay silent** - Alert team to any issues

❌ **Don't ignore alerts** - Even false ones need investigation