Vapora/docs/operations/README.md

# VAPORA Operations Runbooks

Complete set of runbooks and procedures for deploying, monitoring, and operating VAPORA in production environments.

---

## Quick Navigation

**I need to...**

- **Deploy to production**: See [Deployment Runbook](./deployment-runbook.md) or [Pre-Deployment Checklist](./pre-deployment-checklist.md)
- **Respond to an incident**: See [Incident Response Runbook](./incident-response-runbook.md)
- **Rollback a deployment**: See [Rollback Runbook](./rollback-runbook.md)
- **Go on-call**: See [On-Call Procedures](./on-call-procedures.md)
- **Monitor services**: See [Monitoring Runbook](#monitoring--alerting)
- **Understand common failures**: See [Common Failure Scenarios](#common-failure-scenarios)

---

## Runbook Overview

### 1. Pre-Deployment Checklist

**When**: 24 hours before any production deployment

**Content**: Comprehensive checklist for deployment preparation including:
- Communication & scheduling
- Code review & validation
- Environment verification
- Health baseline recording
- Artifact preparation
- Rollback plan verification

**Time**: 1-2 hours

**File**: [`pre-deployment-checklist.md`](./pre-deployment-checklist.md)

### 2. Deployment Runbook

**When**: Executing actual production deployment

**Content**: Step-by-step deployment procedures including:
- Pre-flight checks (5 min)
- Configuration deployment (3 min)
- Deployment update (5 min)
- Verification (5 min)
- Validation (3 min)
- Communication & monitoring

**Time**: 15-20 minutes total

**File**: [`deployment-runbook.md`](./deployment-runbook.md)

### 3. Rollback Runbook

**When**: Issues detected after deployment requiring immediate rollback

**Content**: Safe rollback procedures including:
- When to rollback (decision criteria)
- Kubernetes automatic rollback (step-by-step)
- Docker manual rollback (guided)
- Post-rollback verification
- Emergency procedures
- Prevention & lessons learned

**Time**: 5-10 minutes (depending on issues)

**File**: [`rollback-runbook.md`](./rollback-runbook.md)

### 4. Incident Response Runbook

**When**: Production incident declared

**Content**: Full incident response procedures including:
- Severity levels (1-4) with examples
- Report & assess procedures
- Diagnosis & escalation
- Fix implementation
- Recovery verification
- Communication templates
- Role definitions

**Time**: Varies by severity (2 min to 1+ hour)

**File**: [`incident-response-runbook.md`](./incident-response-runbook.md)

### 5. On-Call Procedures

**When**: During assigned on-call shift

**Content**: Full on-call guide including:
- Before shift starts (setup & verification)
- Daily tasks & check-ins
- Responding to alerts
- Monitoring dashboard setup
- Escalation decision tree
- Shift handoff procedures
- Common questions & answers

**Time**: Read thoroughly before first on-call shift (~30 min)

**File**: [`on-call-procedures.md`](./on-call-procedures.md)

---

## Deployment Workflow

### Standard Deployment Process

```
DAY 1 (Planning)
  ↓
- Create GitHub issue/ticket
- Identify deployment window
- Notify stakeholders

24 HOURS BEFORE
  ↓
- Complete pre-deployment checklist
  (pre-deployment-checklist.md)
- Verify all prerequisites
- Stage artifacts
- Test in staging

DEPLOYMENT DAY
  ↓
- Final go/no-go decision
- Execute deployment runbook
  (deployment-runbook.md)
  - Pre-flight checks
  - ConfigMap deployment
  - Service deployment
  - Verification
  - Communication

POST-DEPLOYMENT (2 hours)
  ↓
- Monitor closely (every 10 minutes)
- Watch for issues
- If problems → execute rollback runbook
  (rollback-runbook.md)
- Document results

24 HOURS LATER
  ↓
- Declare deployment stable
- Schedule post-mortem (if issues)
- Update documentation
```

### If Issues During Deployment

```
Issue Detected
  ↓
Severity Assessment
  ↓
Severity 1-2:
  ├─ Immediate rollback
  │   (rollback-runbook.md)
  │
  └─ Post-rollback investigation
      (incident-response-runbook.md)

Severity 3-4:
  ├─ Monitor and investigate
  │   (incident-response-runbook.md)
  │
  └─ Fix in place if quick
      OR
      Schedule rollback
```

---

## Monitoring & Alerting

### Essential Dashboards

These should be visible during deployments and always on-call:

1. **Kubernetes Dashboard**
   - Pod status
   - Node health
   - Event logs

2. **Grafana Dashboards** (if available)
   - Request rate and latency
   - Error rate
   - CPU/Memory usage
   - Pod restart counts

3. **Application Logs** (Elasticsearch, CloudWatch, etc.)
   - Error messages
   - Stack traces
   - Performance logs

### Alert Triggers & Responses

| Alert | Severity | Response |
|-------|----------|----------|
| Pod CrashLoopBackOff | 1 | Check logs, likely config issue |
| Error rate >10% | 1 | Check recent deployment, consider rollback |
| All pods pending | 1 | Node issue or resource exhausted |
| High memory usage >90% | 2 | Check for memory leak or scale up |
| High latency (2x normal) | 2 | Check database, external services |
| Single pod failed | 3 | Monitor, likely transient |

### Health Check Commands

Quick commands to verify everything is working:

```bash
# Cluster health
kubectl cluster-info
kubectl get nodes        # All should be Ready

# Service health
kubectl get pods -n vapora
# All should be Running, 1/1 Ready

# Quick endpoints test
curl http://localhost:8001/health
curl http://localhost:3000

# Pod resources
kubectl top pods -n vapora

# Recent issues
kubectl get events -n vapora | grep Warning
kubectl logs deployment/vapora-backend -n vapora --tail=20
```

---

## Common Failure Scenarios

### Pod CrashLoopBackOff

**Symptoms**: Pod keeps restarting repeatedly

**Diagnosis**:
```bash
kubectl logs <pod> -n vapora --previous  # See what crashed
kubectl describe pod <pod> -n vapora    # Check events
```

**Solutions**:
1. If config error: Fix ConfigMap, restart pod
2. If code error: Rollback deployment
3. If resource issue: Increase limits or scale out

**Runbook**: [Rollback Runbook](./rollback-runbook.md) or [Incident Response](./incident-response-runbook.md)

### Pod Stuck in Pending

**Symptoms**: Pod won't start, stuck in "Pending" state

**Diagnosis**:
```bash
kubectl describe pod <pod> -n vapora  # Check "Events" section
```

**Common causes**:
- Insufficient CPU/memory on nodes
- Node disk full
- Pod can't be scheduled
- Persistent volume not available

**Solutions**:
1. Scale down other workloads
2. Add more nodes
3. Fix persistent volume issues
4. Check node disk space

**Runbook**: [On-Call Procedures](./on-call-procedures.md) → "Common Questions"

### Service Unresponsive (Connection Refused)

**Symptoms**: `curl: (7) Failed to connect to localhost port 8001`

**Diagnosis**:
```bash
kubectl get pods -n vapora      # Are pods even running?
kubectl get service vapora-backend -n vapora  # Does service exist?
kubectl get endpoints -n vapora # Do endpoints exist?
```

**Common causes**:
- Pods not running (restart loops)
- Service missing or misconfigured
- Port incorrect
- Network policy blocking traffic

**Solutions**:
1. Verify pods running: `kubectl get pods`
2. Verify service exists: `kubectl get svc`
3. Check endpoints: `kubectl get endpoints`
4. Port-forward if issue with routing: `kubectl port-forward svc/vapora-backend 8001:8001`

**Runbook**: [Incident Response](./incident-response-runbook.md)

### High Error Rate

**Symptoms**: Dashboard shows >5% 5xx errors

**Diagnosis**:
```bash
# Check which endpoint
kubectl logs deployment/vapora-backend -n vapora | grep "ERROR\|500"

# Check recent deployment
git log -1 --oneline provisioning/

# Check dependencies
curl http://localhost:8001/health  # is it healthy?
```

**Common causes**:
- Recent bad deployment
- Database connectivity issue
- Configuration error
- Dependency service down

**Solutions**:
1. If recent deployment: Consider rollback
2. Check ConfigMap for typos
3. Check database connectivity
4. Check external service health

**Runbook**: [Rollback Runbook](./rollback-runbook.md) or [Incident Response](./incident-response-runbook.md)

### Resource Exhaustion (CPU/Memory)

**Symptoms**: `kubectl top pods` shows pod at 100% usage or "limits exceeded"

**Diagnosis**:
```bash
kubectl top nodes              # Overall node usage
kubectl top pods -n vapora     # Per-pod usage
kubectl get pod <pod> -o yaml | grep limits -A 10  # Check limits
```

**Solutions**:
1. Increase pod resource limits (requires redeployment)
2. Scale out (add more replicas)
3. Scale down other workloads
4. Investigate memory leak if growing

**Runbook**: [Deployment Runbook](./deployment-runbook.md) → Phase 4 (Verification)

### Database Connection Errors

**Symptoms**: `ERROR: could not connect to database`

**Diagnosis**:
```bash
# Check database is running
kubectl get pods -n <database-namespace>

# Check credentials in ConfigMap
kubectl get configmap vapora-config -n vapora -o yaml | grep -i "database\|password"

# Test connectivity
kubectl exec <pod> -n vapora -- psql $DATABASE_URL
```

**Solutions**:
1. If credentials wrong: Fix in ConfigMap, restart pods
2. If database down: Escalate to DBA
3. If network issue: Network team investigation
4. If permissions: Update database user

**Runbook**: [Incident Response](./incident-response-runbook.md) → "Root Cause: Database Issues"

---

## Communication Templates

### Deployment Start

```
🚀 Deployment starting

Service: VAPORA
Version: v1.2.1
Mode: Enterprise
Expected duration: 10-15 minutes

Will update every 2 minutes. Questions? Ask in #deployments
```

### Deployment Complete

```
✅ Deployment complete

Duration: 12 minutes
Status: All services healthy
Pods: All running

Health check results:
✓ Backend: responding
✓ Frontend: accessible
✓ API: normal latency
✓ No errors in logs

Next step: Monitor for 2 hours
Contact: @on-call-engineer
```

### Incident Declared

```
🔴 INCIDENT DECLARED

Service: VAPORA Backend
Severity: 1 (Critical)
Time detected: HH:MM UTC
Current status: Investigating

Updates every 2 minutes
/cc @on-call-engineer @senior-engineer
```

### Incident Resolved

```
✅ Incident resolved

Duration: 8 minutes
Root cause: [description]
Fix: [what was done]

All services healthy, monitoring for 1 hour
Post-mortem scheduled for [date]
```

### Rollback Executed

```
🔙 Rollback executed

Issue detected in v1.2.1
Rolled back to v1.2.0

Status: Services recovering
Timeline: Issue 14:30 → Rollback 14:32 → Recovered 14:35

Investigating root cause
```

---

## Escalation Matrix

When unsure who to contact:

| Issue Type | First Contact | Escalation | Emergency |
|-----------|---|---|---|
| **Deployment issue** | Deployment lead | Ops team | Ops manager |
| **Pod/Container** | On-call engineer | Senior engineer | Director of Eng |
| **Database** | DBA team | Ops manager | CTO |
| **Infrastructure** | Infra team | Ops manager | VP Ops |
| **Security issue** | Security team | CISO | CEO |
| **Networking** | Network team | Ops manager | CTO |

---

## Tools & Commands Quick Reference

### Essential kubectl Commands

```bash
# Get status
kubectl get pods -n vapora
kubectl get deployments -n vapora
kubectl get services -n vapora

# Logs
kubectl logs deployment/vapora-backend -n vapora
kubectl logs <pod> -n vapora --previous  # Previous crash
kubectl logs <pod> -n vapora -f          # Follow/tail

# Execute commands
kubectl exec -it <pod> -n vapora -- bash
kubectl exec <pod> -n vapora -- curl http://localhost:8001/health

# Describe (detailed info)
kubectl describe pod <pod> -n vapora
kubectl describe node <node>

# Port forward (local access)
kubectl port-forward svc/vapora-backend 8001:8001

# Restart pods
kubectl rollout restart deployment/vapora-backend -n vapora

# Rollback
kubectl rollout undo deployment/vapora-backend -n vapora

# Scale
kubectl scale deployment/vapora-backend --replicas=5 -n vapora
```

### Useful Aliases

```bash
alias k='kubectl'
alias kgp='kubectl get pods'
alias kgd='kubectl get deployments'
alias kgs='kubectl get services'
alias klogs='kubectl logs'
alias kexec='kubectl exec'
alias kdesc='kubectl describe'
alias ktop='kubectl top'
```

---

## Before Your First Deployment

1. **Read all runbooks**: Thoroughly review all procedures
2. **Practice in staging**: Do a test deployment to staging first
3. **Understand rollback**: Know how to rollback before deploying
4. **Get trained**: Have senior engineer walk through procedures
5. **Test tools**: Verify kubectl and other tools work
6. **Verify access**: Confirm you have cluster access
7. **Know contacts**: Have escalation contacts readily available
8. **Review history**: Look at past deployments to understand patterns

---

## Continuous Improvement

### After Each Deployment

- [ ] Were all runbooks clear?
- [ ] Any steps missing or unclear?
- [ ] Any issues that could be prevented?
- [ ] Update documentation with learnings

### Monthly Review

- [ ] Review all incidents from past month
- [ ] Update procedures based on patterns
- [ ] Refresh team on any changes
- [ ] Update escalation contacts
- [ ] Review and improve alerting

---

## Key Principles

✅ **Safety First**
- Always dry-run before applying
- Rollback quickly if issues detected
- Better to be conservative

✅ **Communication**
- Communicate early and often
- Update every 2-5 minutes during incidents
- Notify stakeholders proactively

✅ **Documentation**
- Document everything you do
- Update runbooks with learnings
- Share knowledge with team

✅ **Preparation**
- Plan deployments thoroughly
- Test before going live
- Have rollback plan ready

✅ **Quick Response**
- Detect issues quickly
- Diagnose systematically
- Execute fixes decisively

❌ **Avoid**
- Guessing without verifying
- Skipping steps to save time
- Assuming systems are working
- Not communicating with team
- Making multiple changes at once

---

## Support & Questions

- **Questions about procedures?** Ask senior engineer or operations team
- **Found runbook gap?** Create issue/PR to update documentation
- **Unclear instructions?** Clarify before executing critical operations
- **Ideas for improvement?** Share in team meetings or documentation repo

---

## Quick Start: Your First Deployment

### Day 0: Preparation

1. Read: `pre-deployment-checklist.md` (30 min)
2. Read: `deployment-runbook.md` (30 min)
3. Read: `rollback-runbook.md` (20 min)
4. Schedule walkthrough with senior engineer (1 hour)

### Day 1: Execute with Mentorship

1. Complete pre-deployment checklist with senior engineer
2. Execute deployment runbook with senior observing
3. Monitor for 2 hours with senior available
4. Debrief: what went well, what to improve

### Day 2+: Independent Deployments

1. Complete checklist independently
2. Execute runbook
3. Document and communicate
4. Ask for help if anything unclear

---

**Generated**: 2026-01-12
**Status**: Production-ready
**Last Updated**: 2026-01-12