Vapora/docs/operations/README.md

# VAPORA Operations Runbooks

Complete set of runbooks and procedures for deploying, monitoring, and operating VAPORA in production environments.

---

## Quick Navigation

**I need to...**

- **Deploy to production**: See [Deployment Runbook](./deployment-runbook.md) or [Pre-Deployment Checklist](./pre-deployment-checklist.md)
- **Respond to an incident**: See [Incident Response Runbook](./incident-response-runbook.md)
- **Rollback a deployment**: See [Rollback Runbook](./rollback-runbook.md)
- **Go on-call**: See [On-Call Procedures](./on-call-procedures.md)
- **Monitor services**: See [Monitoring Runbook](#monitoring--alerting)
- **Understand common failures**: See [Common Failure Scenarios](#common-failure-scenarios)

---

## Runbook Overview

### 1. Pre-Deployment Checklist

**When**: 24 hours before any production deployment

**Content**: Comprehensive checklist for deployment preparation including:
- Communication & scheduling
- Code review & validation
- Environment verification
- Health baseline recording
- Artifact preparation
- Rollback plan verification

**Time**: 1-2 hours

**File**: [`pre-deployment-checklist.md`](./pre-deployment-checklist.md)

### 2. Deployment Runbook

**When**: Executing actual production deployment

**Content**: Step-by-step deployment procedures including:
- Pre-flight checks (5 min)
- Configuration deployment (3 min)
- Deployment update (5 min)
- Verification (5 min)
- Validation (3 min)
- Communication & monitoring

**Time**: 15-20 minutes total

**File**: [`deployment-runbook.md`](./deployment-runbook.md)

### 3. Rollback Runbook

**When**: Issues detected after deployment requiring immediate rollback

**Content**: Safe rollback procedures including:
- When to rollback (decision criteria)
- Kubernetes automatic rollback (step-by-step)
- Docker manual rollback (guided)
- Post-rollback verification
- Emergency procedures
- Prevention & lessons learned

**Time**: 5-10 minutes (depending on issues)

**File**: [`rollback-runbook.md`](./rollback-runbook.md)

### 4. Incident Response Runbook

**When**: Production incident declared

**Content**: Full incident response procedures including:
- Severity levels (1-4) with examples
- Report & assess procedures
- Diagnosis & escalation
- Fix implementation
- Recovery verification
- Communication templates
- Role definitions

**Time**: Varies by severity (2 min to 1+ hour)

**File**: [`incident-response-runbook.md`](./incident-response-runbook.md)

### 5. On-Call Procedures

**When**: During assigned on-call shift

**Content**: Full on-call guide including:
- Before shift starts (setup & verification)
- Daily tasks & check-ins
- Responding to alerts
- Monitoring dashboard setup
- Escalation decision tree
- Shift handoff procedures
- Common questions & answers

**Time**: Read thoroughly before first on-call shift (~30 min)

**File**: [`on-call-procedures.md`](./on-call-procedures.md)

---

## Deployment Workflow

### Standard Deployment Process

```
DAY 1 (Planning)
  ↓
- Create GitHub issue/ticket
- Identify deployment window
- Notify stakeholders

24 HOURS BEFORE
  ↓
- Complete pre-deployment checklist
  (pre-deployment-checklist.md)
- Verify all prerequisites
- Stage artifacts
- Test in staging

DEPLOYMENT DAY
  ↓
- Final go/no-go decision
- Execute deployment runbook
  (deployment-runbook.md)
  - Pre-flight checks
  - ConfigMap deployment
  - Service deployment
  - Verification
  - Communication

POST-DEPLOYMENT (2 hours)
  ↓
- Monitor closely (every 10 minutes)
- Watch for issues
- If problems → execute rollback runbook
  (rollback-runbook.md)
- Document results

24 HOURS LATER
  ↓
- Declare deployment stable
- Schedule post-mortem (if issues)
- Update documentation
```

### If Issues During Deployment

```
Issue Detected
  ↓
Severity Assessment
  ↓
Severity 1-2:
  ├─ Immediate rollback
  │   (rollback-runbook.md)
  │
  └─ Post-rollback investigation
      (incident-response-runbook.md)

Severity 3-4:
  ├─ Monitor and investigate
  │   (incident-response-runbook.md)
  │
  └─ Fix in place if quick
      OR
      Schedule rollback
```

---

## Monitoring & Alerting

### Essential Dashboards

These should be visible during deployments and always on-call:

1. **Kubernetes Dashboard**
   - Pod status
   - Node health
   - Event logs

2. **Grafana Dashboards** (if available)
   - Request rate and latency
   - Error rate
   - CPU/Memory usage
   - Pod restart counts

3. **Application Logs** (Elasticsearch, CloudWatch, etc.)
   - Error messages
   - Stack traces
   - Performance logs

### Alert Triggers & Responses

| Alert | Severity | Response |
|-------|----------|----------|
| Pod CrashLoopBackOff | 1 | Check logs, likely config issue |
| Error rate >10% | 1 | Check recent deployment, consider rollback |
| All pods pending | 1 | Node issue or resource exhausted |
| High memory usage >90% | 2 | Check for memory leak or scale up |
| High latency (2x normal) | 2 | Check database, external services |
| Single pod failed | 3 | Monitor, likely transient |

### Health Check Commands

Quick commands to verify everything is working:

```bash
# Cluster health
kubectl cluster-info
kubectl get nodes        # All should be Ready

# Service health
kubectl get pods -n vapora
# All should be Running, 1/1 Ready

# Quick endpoints test
curl http://localhost:8001/health
curl http://localhost:3000

# Pod resources
kubectl top pods -n vapora

# Recent issues
kubectl get events -n vapora | grep Warning
kubectl logs deployment/vapora-backend -n vapora --tail=20
```

---

## Common Failure Scenarios

### Pod CrashLoopBackOff

**Symptoms**: Pod keeps restarting repeatedly

**Diagnosis**:
```bash
kubectl logs <pod> -n vapora --previous  # See what crashed
kubectl describe pod <pod> -n vapora    # Check events
```

**Solutions**:
1. If config error: Fix ConfigMap, restart pod
2. If code error: Rollback deployment
3. If resource issue: Increase limits or scale out

**Runbook**: [Rollback Runbook](./rollback-runbook.md) or [Incident Response](./incident-response-runbook.md)

### Pod Stuck in Pending

**Symptoms**: Pod won't start, stuck in "Pending" state

**Diagnosis**:
```bash
kubectl describe pod <pod> -n vapora  # Check "Events" section
```

**Common causes**:
- Insufficient CPU/memory on nodes
- Node disk full
- Pod can't be scheduled
- Persistent volume not available

**Solutions**:
1. Scale down other workloads
2. Add more nodes
3. Fix persistent volume issues
4. Check node disk space

**Runbook**: [On-Call Procedures](./on-call-procedures.md) → "Common Questions"

### Service Unresponsive (Connection Refused)

**Symptoms**: `curl: (7) Failed to connect to localhost port 8001`

**Diagnosis**:
```bash
kubectl get pods -n vapora      # Are pods even running?
kubectl get service vapora-backend -n vapora  # Does service exist?
kubectl get endpoints -n vapora # Do endpoints exist?
```

**Common causes**:
- Pods not running (restart loops)
- Service missing or misconfigured
- Port incorrect
- Network policy blocking traffic

**Solutions**:
1. Verify pods running: `kubectl get pods`
2. Verify service exists: `kubectl get svc`
3. Check endpoints: `kubectl get endpoints`
4. Port-forward if issue with routing: `kubectl port-forward svc/vapora-backend 8001:8001`

**Runbook**: [Incident Response](./incident-response-runbook.md)

### High Error Rate

**Symptoms**: Dashboard shows >5% 5xx errors

**Diagnosis**:
```bash
# Check which endpoint
kubectl logs deployment/vapora-backend -n vapora | grep "ERROR\|500"

# Check recent deployment
git log -1 --oneline provisioning/

# Check dependencies
curl http://localhost:8001/health  # is it healthy?
```

**Common causes**:
- Recent bad deployment
- Database connectivity issue
- Configuration error
- Dependency service down

**Solutions**:
1. If recent deployment: Consider rollback
2. Check ConfigMap for typos
3. Check database connectivity
4. Check external service health

**Runbook**: [Rollback Runbook](./rollback-runbook.md) or [Incident Response](./incident-response-runbook.md)

### Resource Exhaustion (CPU/Memory)

**Symptoms**: `kubectl top pods` shows pod at 100% usage or "limits exceeded"

**Diagnosis**:
```bash
kubectl top nodes              # Overall node usage
kubectl top pods -n vapora     # Per-pod usage
kubectl get pod <pod> -o yaml | grep limits -A 10  # Check limits
```

**Solutions**:
1. Increase pod resource limits (requires redeployment)
2. Scale out (add more replicas)
3. Scale down other workloads
4. Investigate memory leak if growing

**Runbook**: [Deployment Runbook](./deployment-runbook.md) → Phase 4 (Verification)

### Database Connection Errors

**Symptoms**: `ERROR: could not connect to database`

**Diagnosis**:
```bash
# Check database is running
kubectl get pods -n <database-namespace>

# Check credentials in ConfigMap
kubectl get configmap vapora-config -n vapora -o yaml | grep -i "database\|password"

# Test connectivity
kubectl exec <pod> -n vapora -- psql $DATABASE_URL
```

**Solutions**:
1. If credentials wrong: Fix in ConfigMap, restart pods
2. If database down: Escalate to DBA
3. If network issue: Network team investigation
4. If permissions: Update database user

**Runbook**: [Incident Response](./incident-response-runbook.md) → "Root Cause: Database Issues"

---

## Communication Templates

### Deployment Start

```
🚀 Deployment starting

Service: VAPORA
Version: v1.2.1
Mode: Enterprise
Expected duration: 10-15 minutes

Will update every 2 minutes. Questions? Ask in #deployments
```

### Deployment Complete

```
✅ Deployment complete

Duration: 12 minutes
Status: All services healthy
Pods: All running

Health check results:
✓ Backend: responding
✓ Frontend: accessible
✓ API: normal latency
✓ No errors in logs

Next step: Monitor for 2 hours
Contact: @on-call-engineer
```

### Incident Declared

```
🔴 INCIDENT DECLARED

Service: VAPORA Backend
Severity: 1 (Critical)
Time detected: HH:MM UTC
Current status: Investigating

Updates every 2 minutes
/cc @on-call-engineer @senior-engineer
```

### Incident Resolved

```
✅ Incident resolved

Duration: 8 minutes
Root cause: [description]
Fix: [what was done]

All services healthy, monitoring for 1 hour
Post-mortem scheduled for [date]
```

### Rollback Executed

```
🔙 Rollback executed

Issue detected in v1.2.1
Rolled back to v1.2.0

Status: Services recovering
Timeline: Issue 14:30 → Rollback 14:32 → Recovered 14:35

Investigating root cause
```

---

## Escalation Matrix

When unsure who to contact:

| Issue Type | First Contact | Escalation | Emergency |
|-----------|---|---|---|
| **Deployment issue** | Deployment lead | Ops team | Ops manager |
| **Pod/Container** | On-call engineer | Senior engineer | Director of Eng |
| **Database** | DBA team | Ops manager | CTO |
| **Infrastructure** | Infra team | Ops manager | VP Ops |
| **Security issue** | Security team | CISO | CEO |
| **Networking** | Network team | Ops manager | CTO |

---

## Tools & Commands Quick Reference

### Essential kubectl Commands

```bash
# Get status
kubectl get pods -n vapora
kubectl get deployments -n vapora
kubectl get services -n vapora

# Logs
kubectl logs deployment/vapora-backend -n vapora
kubectl logs <pod> -n vapora --previous  # Previous crash
kubectl logs <pod> -n vapora -f          # Follow/tail

# Execute commands
kubectl exec -it <pod> -n vapora -- bash
kubectl exec <pod> -n vapora -- curl http://localhost:8001/health

# Describe (detailed info)
kubectl describe pod <pod> -n vapora
kubectl describe node <node>

# Port forward (local access)
kubectl port-forward svc/vapora-backend 8001:8001

# Restart pods
kubectl rollout restart deployment/vapora-backend -n vapora

# Rollback
kubectl rollout undo deployment/vapora-backend -n vapora

# Scale
kubectl scale deployment/vapora-backend --replicas=5 -n vapora
```

### Useful Aliases

```bash
alias k='kubectl'
alias kgp='kubectl get pods'
alias kgd='kubectl get deployments'
alias kgs='kubectl get services'
alias klogs='kubectl logs'
alias kexec='kubectl exec'
alias kdesc='kubectl describe'
alias ktop='kubectl top'
```

---

## Before Your First Deployment

1. **Read all runbooks**: Thoroughly review all procedures
2. **Practice in staging**: Do a test deployment to staging first
3. **Understand rollback**: Know how to rollback before deploying
4. **Get trained**: Have senior engineer walk through procedures
5. **Test tools**: Verify kubectl and other tools work
6. **Verify access**: Confirm you have cluster access
7. **Know contacts**: Have escalation contacts readily available
8. **Review history**: Look at past deployments to understand patterns

---

## Continuous Improvement

### After Each Deployment

- [ ] Were all runbooks clear?
- [ ] Any steps missing or unclear?
- [ ] Any issues that could be prevented?
- [ ] Update documentation with learnings

### Monthly Review

- [ ] Review all incidents from past month
- [ ] Update procedures based on patterns
- [ ] Refresh team on any changes
- [ ] Update escalation contacts
- [ ] Review and improve alerting

---

## Key Principles

✅ **Safety First**
- Always dry-run before applying
- Rollback quickly if issues detected
- Better to be conservative

✅ **Communication**
- Communicate early and often
- Update every 2-5 minutes during incidents
- Notify stakeholders proactively

✅ **Documentation**
- Document everything you do
- Update runbooks with learnings
- Share knowledge with team

✅ **Preparation**
- Plan deployments thoroughly
- Test before going live
- Have rollback plan ready

✅ **Quick Response**
- Detect issues quickly
- Diagnose systematically
- Execute fixes decisively

❌ **Avoid**
- Guessing without verifying
- Skipping steps to save time
- Assuming systems are working
- Not communicating with team
- Making multiple changes at once

---

## Support & Questions

- **Questions about procedures?** Ask senior engineer or operations team
- **Found runbook gap?** Create issue/PR to update documentation
- **Unclear instructions?** Clarify before executing critical operations
- **Ideas for improvement?** Share in team meetings or documentation repo

---

## Quick Start: Your First Deployment

### Day 0: Preparation

1. Read: `pre-deployment-checklist.md` (30 min)
2. Read: `deployment-runbook.md` (30 min)
3. Read: `rollback-runbook.md` (20 min)
4. Schedule walkthrough with senior engineer (1 hour)

### Day 1: Execute with Mentorship

1. Complete pre-deployment checklist with senior engineer
2. Execute deployment runbook with senior observing
3. Monitor for 2 hours with senior available
4. Debrief: what went well, what to improve

### Day 2+: Independent Deployments

1. Complete checklist independently
2. Execute runbook
3. Document and communicate
4. Ask for help if anything unclear

---

**Generated**: 2026-01-12
**Status**: Production-ready
**Last Updated**: 2026-01-12
chore: extend doc: adr, tutorials, operations, etc 2026-01-12 03:32:47 +00:00			`# VAPORA Operations Runbooks`

			`Complete set of runbooks and procedures for deploying, monitoring, and operating VAPORA in production environments.`

			`---`

			`## Quick Navigation`

			`I need to...`

			`- Deploy to production: See [Deployment Runbook](./deployment-runbook.md) or [Pre-Deployment Checklist](./pre-deployment-checklist.md)`
			`- Respond to an incident: See [Incident Response Runbook](./incident-response-runbook.md)`
			`- Rollback a deployment: See [Rollback Runbook](./rollback-runbook.md)`
			`- Go on-call: See [On-Call Procedures](./on-call-procedures.md)`
			`- Monitor services: See [Monitoring Runbook](#monitoring--alerting)`
			`- Understand common failures: See [Common Failure Scenarios](#common-failure-scenarios)`

			`---`

			`## Runbook Overview`

			`### 1. Pre-Deployment Checklist`

			`When: 24 hours before any production deployment`

			`Content: Comprehensive checklist for deployment preparation including:`
			`- Communication & scheduling`
			`- Code review & validation`
			`- Environment verification`
			`- Health baseline recording`
			`- Artifact preparation`
			`- Rollback plan verification`

			`Time: 1-2 hours`

			File: [`pre-deployment-checklist.md`](./pre-deployment-checklist.md)

			`### 2. Deployment Runbook`

			`When: Executing actual production deployment`

			`Content: Step-by-step deployment procedures including:`
			`- Pre-flight checks (5 min)`
			`- Configuration deployment (3 min)`
			`- Deployment update (5 min)`
			`- Verification (5 min)`
			`- Validation (3 min)`
			`- Communication & monitoring`

			`Time: 15-20 minutes total`

			File: [`deployment-runbook.md`](./deployment-runbook.md)

			`### 3. Rollback Runbook`

			`When: Issues detected after deployment requiring immediate rollback`

			`Content: Safe rollback procedures including:`
			`- When to rollback (decision criteria)`
			`- Kubernetes automatic rollback (step-by-step)`
			`- Docker manual rollback (guided)`
			`- Post-rollback verification`
			`- Emergency procedures`
			`- Prevention & lessons learned`

			`Time: 5-10 minutes (depending on issues)`

			File: [`rollback-runbook.md`](./rollback-runbook.md)

			`### 4. Incident Response Runbook`

			`When: Production incident declared`

			`Content: Full incident response procedures including:`
			`- Severity levels (1-4) with examples`
			`- Report & assess procedures`
			`- Diagnosis & escalation`
			`- Fix implementation`
			`- Recovery verification`
			`- Communication templates`
			`- Role definitions`

			`Time: Varies by severity (2 min to 1+ hour)`

			File: [`incident-response-runbook.md`](./incident-response-runbook.md)

			`### 5. On-Call Procedures`

			`When: During assigned on-call shift`

			`Content: Full on-call guide including:`
			`- Before shift starts (setup & verification)`
			`- Daily tasks & check-ins`
			`- Responding to alerts`
			`- Monitoring dashboard setup`
			`- Escalation decision tree`
			`- Shift handoff procedures`
			`- Common questions & answers`

			`Time: Read thoroughly before first on-call shift (~30 min)`

			File: [`on-call-procedures.md`](./on-call-procedures.md)

			`---`

			`## Deployment Workflow`

			`### Standard Deployment Process`

			```
			`DAY 1 (Planning)`
			`↓`
			`- Create GitHub issue/ticket`
			`- Identify deployment window`
			`- Notify stakeholders`

			`24 HOURS BEFORE`
			`↓`
			`- Complete pre-deployment checklist`
			`(pre-deployment-checklist.md)`
			`- Verify all prerequisites`
			`- Stage artifacts`
			`- Test in staging`

			`DEPLOYMENT DAY`
			`↓`
			`- Final go/no-go decision`
			`- Execute deployment runbook`
			`(deployment-runbook.md)`
			`- Pre-flight checks`
			`- ConfigMap deployment`
			`- Service deployment`
			`- Verification`
			`- Communication`

			`POST-DEPLOYMENT (2 hours)`
			`↓`
			`- Monitor closely (every 10 minutes)`
			`- Watch for issues`
			`- If problems → execute rollback runbook`
			`(rollback-runbook.md)`
			`- Document results`

			`24 HOURS LATER`
			`↓`
			`- Declare deployment stable`
			`- Schedule post-mortem (if issues)`
			`- Update documentation`
			```

			`### If Issues During Deployment`

			```
			`Issue Detected`
			`↓`
			`Severity Assessment`
			`↓`
			`Severity 1-2:`
			`├─ Immediate rollback`
			`│ (rollback-runbook.md)`
			`│`
			`└─ Post-rollback investigation`
			`(incident-response-runbook.md)`

			`Severity 3-4:`
			`├─ Monitor and investigate`
			`│ (incident-response-runbook.md)`
			`│`
			`└─ Fix in place if quick`
			`OR`
			`Schedule rollback`
			```

			`---`

			`## Monitoring & Alerting`

			`### Essential Dashboards`

			`These should be visible during deployments and always on-call:`

			`1. Kubernetes Dashboard`
			`- Pod status`
			`- Node health`
			`- Event logs`

			`2. Grafana Dashboards (if available)`
			`- Request rate and latency`
			`- Error rate`
			`- CPU/Memory usage`
			`- Pod restart counts`

			`3. Application Logs (Elasticsearch, CloudWatch, etc.)`
			`- Error messages`
			`- Stack traces`
			`- Performance logs`

			`### Alert Triggers & Responses`

			`\| Alert \| Severity \| Response \|`
			`\|-------\|----------\|----------\|`
			`\| Pod CrashLoopBackOff \| 1 \| Check logs, likely config issue \|`
			`\| Error rate >10% \| 1 \| Check recent deployment, consider rollback \|`
			`\| All pods pending \| 1 \| Node issue or resource exhausted \|`
			`\| High memory usage >90% \| 2 \| Check for memory leak or scale up \|`
			`\| High latency (2x normal) \| 2 \| Check database, external services \|`
			`\| Single pod failed \| 3 \| Monitor, likely transient \|`

			`### Health Check Commands`

			`Quick commands to verify everything is working:`

			```bash
			`# Cluster health`
			`kubectl cluster-info`
			`kubectl get nodes # All should be Ready`

			`# Service health`
			`kubectl get pods -n vapora`
			`# All should be Running, 1/1 Ready`

			`# Quick endpoints test`
			`curl http://localhost:8001/health`
			`curl http://localhost:3000`

			`# Pod resources`
			`kubectl top pods -n vapora`

			`# Recent issues`
			`kubectl get events -n vapora \| grep Warning`
			`kubectl logs deployment/vapora-backend -n vapora --tail=20`
			```

			`---`

			`## Common Failure Scenarios`

			`### Pod CrashLoopBackOff`

			`Symptoms: Pod keeps restarting repeatedly`

			`Diagnosis:`
			```bash
			`kubectl logs <pod> -n vapora --previous # See what crashed`
			`kubectl describe pod <pod> -n vapora # Check events`
			```

			`Solutions:`
			`1. If config error: Fix ConfigMap, restart pod`
			`2. If code error: Rollback deployment`
			`3. If resource issue: Increase limits or scale out`

			`Runbook: [Rollback Runbook](./rollback-runbook.md) or [Incident Response](./incident-response-runbook.md)`

			`### Pod Stuck in Pending`

			`Symptoms: Pod won't start, stuck in "Pending" state`

			`Diagnosis:`
			```bash
			`kubectl describe pod <pod> -n vapora # Check "Events" section`
			```

			`Common causes:`
			`- Insufficient CPU/memory on nodes`
			`- Node disk full`
			`- Pod can't be scheduled`
			`- Persistent volume not available`

			`Solutions:`
			`1. Scale down other workloads`
			`2. Add more nodes`
			`3. Fix persistent volume issues`
			`4. Check node disk space`

			`Runbook: [On-Call Procedures](./on-call-procedures.md) → "Common Questions"`

			`### Service Unresponsive (Connection Refused)`

			Symptoms: `curl: (7) Failed to connect to localhost port 8001`

			`Diagnosis:`
			```bash
			`kubectl get pods -n vapora # Are pods even running?`
			`kubectl get service vapora-backend -n vapora # Does service exist?`
			`kubectl get endpoints -n vapora # Do endpoints exist?`
			```

			`Common causes:`
			`- Pods not running (restart loops)`
			`- Service missing or misconfigured`
			`- Port incorrect`
			`- Network policy blocking traffic`

			`Solutions:`
			1. Verify pods running: `kubectl get pods`
			2. Verify service exists: `kubectl get svc`
			3. Check endpoints: `kubectl get endpoints`
			4. Port-forward if issue with routing: `kubectl port-forward svc/vapora-backend 8001:8001`

			`Runbook: [Incident Response](./incident-response-runbook.md)`

			`### High Error Rate`

			`Symptoms: Dashboard shows >5% 5xx errors`

			`Diagnosis:`
			```bash
			`# Check which endpoint`
			`kubectl logs deployment/vapora-backend -n vapora \| grep "ERROR\\|500"`

			`# Check recent deployment`
			`git log -1 --oneline provisioning/`

			`# Check dependencies`
			`curl http://localhost:8001/health # is it healthy?`
			```

			`Common causes:`
			`- Recent bad deployment`
			`- Database connectivity issue`
			`- Configuration error`
			`- Dependency service down`

			`Solutions:`
			`1. If recent deployment: Consider rollback`
			`2. Check ConfigMap for typos`
			`3. Check database connectivity`
			`4. Check external service health`

			`Runbook: [Rollback Runbook](./rollback-runbook.md) or [Incident Response](./incident-response-runbook.md)`

			`### Resource Exhaustion (CPU/Memory)`

			Symptoms: `kubectl top pods` shows pod at 100% usage or "limits exceeded"

			`Diagnosis:`
			```bash
			`kubectl top nodes # Overall node usage`
			`kubectl top pods -n vapora # Per-pod usage`
			`kubectl get pod <pod> -o yaml \| grep limits -A 10 # Check limits`
			```

			`Solutions:`
			`1. Increase pod resource limits (requires redeployment)`
			`2. Scale out (add more replicas)`
			`3. Scale down other workloads`
			`4. Investigate memory leak if growing`

			`Runbook: [Deployment Runbook](./deployment-runbook.md) → Phase 4 (Verification)`

			`### Database Connection Errors`

			Symptoms: `ERROR: could not connect to database`

			`Diagnosis:`
			```bash
			`# Check database is running`
			`kubectl get pods -n <database-namespace>`

			`# Check credentials in ConfigMap`
			`kubectl get configmap vapora-config -n vapora -o yaml \| grep -i "database\\|password"`

			`# Test connectivity`
			`kubectl exec <pod> -n vapora -- psql $DATABASE_URL`
			```

			`Solutions:`
			`1. If credentials wrong: Fix in ConfigMap, restart pods`
			`2. If database down: Escalate to DBA`
			`3. If network issue: Network team investigation`
			`4. If permissions: Update database user`

			`Runbook: [Incident Response](./incident-response-runbook.md) → "Root Cause: Database Issues"`

			`---`

			`## Communication Templates`

			`### Deployment Start`

			```
			`🚀 Deployment starting`

			`Service: VAPORA`
			`Version: v1.2.1`
			`Mode: Enterprise`
			`Expected duration: 10-15 minutes`

			`Will update every 2 minutes. Questions? Ask in #deployments`
			```

			`### Deployment Complete`

			```
			`✅ Deployment complete`

			`Duration: 12 minutes`
			`Status: All services healthy`
			`Pods: All running`

			`Health check results:`
			`✓ Backend: responding`
			`✓ Frontend: accessible`
			`✓ API: normal latency`
			`✓ No errors in logs`

			`Next step: Monitor for 2 hours`
			`Contact: @on-call-engineer`
			```

			`### Incident Declared`

			```
			`🔴 INCIDENT DECLARED`

			`Service: VAPORA Backend`
			`Severity: 1 (Critical)`
			`Time detected: HH:MM UTC`
			`Current status: Investigating`

			`Updates every 2 minutes`
			`/cc @on-call-engineer @senior-engineer`
			```

			`### Incident Resolved`

			```
			`✅ Incident resolved`

			`Duration: 8 minutes`
			`Root cause: [description]`
			`Fix: [what was done]`

			`All services healthy, monitoring for 1 hour`
			`Post-mortem scheduled for [date]`
			```

			`### Rollback Executed`

			```
			`🔙 Rollback executed`

			`Issue detected in v1.2.1`
			`Rolled back to v1.2.0`

			`Status: Services recovering`
			`Timeline: Issue 14:30 → Rollback 14:32 → Recovered 14:35`

			`Investigating root cause`
			```

			`---`

			`## Escalation Matrix`

			`When unsure who to contact:`

			`\| Issue Type \| First Contact \| Escalation \| Emergency \|`
			`\|-----------\|---\|---\|---\|`
			`\| Deployment issue \| Deployment lead \| Ops team \| Ops manager \|`
			`\| Pod/Container \| On-call engineer \| Senior engineer \| Director of Eng \|`
			`\| Database \| DBA team \| Ops manager \| CTO \|`
			`\| Infrastructure \| Infra team \| Ops manager \| VP Ops \|`
			`\| Security issue \| Security team \| CISO \| CEO \|`
			`\| Networking \| Network team \| Ops manager \| CTO \|`

			`---`

			`## Tools & Commands Quick Reference`

			`### Essential kubectl Commands`

			```bash
			`# Get status`
			`kubectl get pods -n vapora`
			`kubectl get deployments -n vapora`
			`kubectl get services -n vapora`

			`# Logs`
			`kubectl logs deployment/vapora-backend -n vapora`
			`kubectl logs <pod> -n vapora --previous # Previous crash`
			`kubectl logs <pod> -n vapora -f # Follow/tail`

			`# Execute commands`
			`kubectl exec -it <pod> -n vapora -- bash`
			`kubectl exec <pod> -n vapora -- curl http://localhost:8001/health`

			`# Describe (detailed info)`
			`kubectl describe pod <pod> -n vapora`
			`kubectl describe node <node>`

			`# Port forward (local access)`
			`kubectl port-forward svc/vapora-backend 8001:8001`

			`# Restart pods`
			`kubectl rollout restart deployment/vapora-backend -n vapora`

			`# Rollback`
			`kubectl rollout undo deployment/vapora-backend -n vapora`

			`# Scale`
			`kubectl scale deployment/vapora-backend --replicas=5 -n vapora`
			```

			`### Useful Aliases`

			```bash
			`alias k='kubectl'`
			`alias kgp='kubectl get pods'`
			`alias kgd='kubectl get deployments'`
			`alias kgs='kubectl get services'`
			`alias klogs='kubectl logs'`
			`alias kexec='kubectl exec'`
			`alias kdesc='kubectl describe'`
			`alias ktop='kubectl top'`
			```

			`---`

			`## Before Your First Deployment`

			`1. Read all runbooks: Thoroughly review all procedures`
			`2. Practice in staging: Do a test deployment to staging first`
			`3. Understand rollback: Know how to rollback before deploying`
			`4. Get trained: Have senior engineer walk through procedures`
			`5. Test tools: Verify kubectl and other tools work`
			`6. Verify access: Confirm you have cluster access`
			`7. Know contacts: Have escalation contacts readily available`
			`8. Review history: Look at past deployments to understand patterns`

			`---`

			`## Continuous Improvement`

			`### After Each Deployment`

			`- [ ] Were all runbooks clear?`
			`- [ ] Any steps missing or unclear?`
			`- [ ] Any issues that could be prevented?`
			`- [ ] Update documentation with learnings`

			`### Monthly Review`

			`- [ ] Review all incidents from past month`
			`- [ ] Update procedures based on patterns`
			`- [ ] Refresh team on any changes`
			`- [ ] Update escalation contacts`
			`- [ ] Review and improve alerting`

			`---`

			`## Key Principles`

			`✅ Safety First`
			`- Always dry-run before applying`
			`- Rollback quickly if issues detected`
			`- Better to be conservative`

			`✅ Communication`
			`- Communicate early and often`
			`- Update every 2-5 minutes during incidents`
			`- Notify stakeholders proactively`

			`✅ Documentation`
			`- Document everything you do`
			`- Update runbooks with learnings`
			`- Share knowledge with team`

			`✅ Preparation`
			`- Plan deployments thoroughly`
			`- Test before going live`
			`- Have rollback plan ready`

			`✅ Quick Response`
			`- Detect issues quickly`
			`- Diagnose systematically`
			`- Execute fixes decisively`

			`❌ Avoid`
			`- Guessing without verifying`
			`- Skipping steps to save time`
			`- Assuming systems are working`
			`- Not communicating with team`
			`- Making multiple changes at once`

			`---`

			`## Support & Questions`

			`- Questions about procedures? Ask senior engineer or operations team`
			`- Found runbook gap? Create issue/PR to update documentation`
			`- Unclear instructions? Clarify before executing critical operations`
			`- Ideas for improvement? Share in team meetings or documentation repo`

			`---`

			`## Quick Start: Your First Deployment`

			`### Day 0: Preparation`

			1. Read: `pre-deployment-checklist.md` (30 min)
			2. Read: `deployment-runbook.md` (30 min)
			3. Read: `rollback-runbook.md` (20 min)
			`4. Schedule walkthrough with senior engineer (1 hour)`

			`### Day 1: Execute with Mentorship`

			`1. Complete pre-deployment checklist with senior engineer`
			`2. Execute deployment runbook with senior observing`
			`3. Monitor for 2 hours with senior available`
			`4. Debrief: what went well, what to improve`

			`### Day 2+: Independent Deployments`

			`1. Complete checklist independently`
			`2. Execute runbook`
			`3. Document and communicate`
			`4. Ask for help if anything unclear`

			`---`

			`Generated: 2026-01-12`
			`Status: Production-ready`
			`Last Updated: 2026-01-12`