Vapora/docs/operations/README.md
Jesús Pérez 7110ffeea2
Some checks failed
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
chore: extend doc: adr, tutorials, operations, etc
2026-01-12 03:32:47 +00:00

626 lines
14 KiB
Markdown

# VAPORA Operations Runbooks
Complete set of runbooks and procedures for deploying, monitoring, and operating VAPORA in production environments.
---
## Quick Navigation
**I need to...**
- **Deploy to production**: See [Deployment Runbook](./deployment-runbook.md) or [Pre-Deployment Checklist](./pre-deployment-checklist.md)
- **Respond to an incident**: See [Incident Response Runbook](./incident-response-runbook.md)
- **Rollback a deployment**: See [Rollback Runbook](./rollback-runbook.md)
- **Go on-call**: See [On-Call Procedures](./on-call-procedures.md)
- **Monitor services**: See [Monitoring Runbook](#monitoring--alerting)
- **Understand common failures**: See [Common Failure Scenarios](#common-failure-scenarios)
---
## Runbook Overview
### 1. Pre-Deployment Checklist
**When**: 24 hours before any production deployment
**Content**: Comprehensive checklist for deployment preparation including:
- Communication & scheduling
- Code review & validation
- Environment verification
- Health baseline recording
- Artifact preparation
- Rollback plan verification
**Time**: 1-2 hours
**File**: [`pre-deployment-checklist.md`](./pre-deployment-checklist.md)
### 2. Deployment Runbook
**When**: Executing actual production deployment
**Content**: Step-by-step deployment procedures including:
- Pre-flight checks (5 min)
- Configuration deployment (3 min)
- Deployment update (5 min)
- Verification (5 min)
- Validation (3 min)
- Communication & monitoring
**Time**: 15-20 minutes total
**File**: [`deployment-runbook.md`](./deployment-runbook.md)
### 3. Rollback Runbook
**When**: Issues detected after deployment requiring immediate rollback
**Content**: Safe rollback procedures including:
- When to rollback (decision criteria)
- Kubernetes automatic rollback (step-by-step)
- Docker manual rollback (guided)
- Post-rollback verification
- Emergency procedures
- Prevention & lessons learned
**Time**: 5-10 minutes (depending on issues)
**File**: [`rollback-runbook.md`](./rollback-runbook.md)
### 4. Incident Response Runbook
**When**: Production incident declared
**Content**: Full incident response procedures including:
- Severity levels (1-4) with examples
- Report & assess procedures
- Diagnosis & escalation
- Fix implementation
- Recovery verification
- Communication templates
- Role definitions
**Time**: Varies by severity (2 min to 1+ hour)
**File**: [`incident-response-runbook.md`](./incident-response-runbook.md)
### 5. On-Call Procedures
**When**: During assigned on-call shift
**Content**: Full on-call guide including:
- Before shift starts (setup & verification)
- Daily tasks & check-ins
- Responding to alerts
- Monitoring dashboard setup
- Escalation decision tree
- Shift handoff procedures
- Common questions & answers
**Time**: Read thoroughly before first on-call shift (~30 min)
**File**: [`on-call-procedures.md`](./on-call-procedures.md)
---
## Deployment Workflow
### Standard Deployment Process
```
DAY 1 (Planning)
- Create GitHub issue/ticket
- Identify deployment window
- Notify stakeholders
24 HOURS BEFORE
- Complete pre-deployment checklist
(pre-deployment-checklist.md)
- Verify all prerequisites
- Stage artifacts
- Test in staging
DEPLOYMENT DAY
- Final go/no-go decision
- Execute deployment runbook
(deployment-runbook.md)
- Pre-flight checks
- ConfigMap deployment
- Service deployment
- Verification
- Communication
POST-DEPLOYMENT (2 hours)
- Monitor closely (every 10 minutes)
- Watch for issues
- If problems → execute rollback runbook
(rollback-runbook.md)
- Document results
24 HOURS LATER
- Declare deployment stable
- Schedule post-mortem (if issues)
- Update documentation
```
### If Issues During Deployment
```
Issue Detected
Severity Assessment
Severity 1-2:
├─ Immediate rollback
│ (rollback-runbook.md)
└─ Post-rollback investigation
(incident-response-runbook.md)
Severity 3-4:
├─ Monitor and investigate
│ (incident-response-runbook.md)
└─ Fix in place if quick
OR
Schedule rollback
```
---
## Monitoring & Alerting
### Essential Dashboards
These should be visible during deployments and always on-call:
1. **Kubernetes Dashboard**
- Pod status
- Node health
- Event logs
2. **Grafana Dashboards** (if available)
- Request rate and latency
- Error rate
- CPU/Memory usage
- Pod restart counts
3. **Application Logs** (Elasticsearch, CloudWatch, etc.)
- Error messages
- Stack traces
- Performance logs
### Alert Triggers & Responses
| Alert | Severity | Response |
|-------|----------|----------|
| Pod CrashLoopBackOff | 1 | Check logs, likely config issue |
| Error rate >10% | 1 | Check recent deployment, consider rollback |
| All pods pending | 1 | Node issue or resource exhausted |
| High memory usage >90% | 2 | Check for memory leak or scale up |
| High latency (2x normal) | 2 | Check database, external services |
| Single pod failed | 3 | Monitor, likely transient |
### Health Check Commands
Quick commands to verify everything is working:
```bash
# Cluster health
kubectl cluster-info
kubectl get nodes # All should be Ready
# Service health
kubectl get pods -n vapora
# All should be Running, 1/1 Ready
# Quick endpoints test
curl http://localhost:8001/health
curl http://localhost:3000
# Pod resources
kubectl top pods -n vapora
# Recent issues
kubectl get events -n vapora | grep Warning
kubectl logs deployment/vapora-backend -n vapora --tail=20
```
---
## Common Failure Scenarios
### Pod CrashLoopBackOff
**Symptoms**: Pod keeps restarting repeatedly
**Diagnosis**:
```bash
kubectl logs <pod> -n vapora --previous # See what crashed
kubectl describe pod <pod> -n vapora # Check events
```
**Solutions**:
1. If config error: Fix ConfigMap, restart pod
2. If code error: Rollback deployment
3. If resource issue: Increase limits or scale out
**Runbook**: [Rollback Runbook](./rollback-runbook.md) or [Incident Response](./incident-response-runbook.md)
### Pod Stuck in Pending
**Symptoms**: Pod won't start, stuck in "Pending" state
**Diagnosis**:
```bash
kubectl describe pod <pod> -n vapora # Check "Events" section
```
**Common causes**:
- Insufficient CPU/memory on nodes
- Node disk full
- Pod can't be scheduled
- Persistent volume not available
**Solutions**:
1. Scale down other workloads
2. Add more nodes
3. Fix persistent volume issues
4. Check node disk space
**Runbook**: [On-Call Procedures](./on-call-procedures.md) → "Common Questions"
### Service Unresponsive (Connection Refused)
**Symptoms**: `curl: (7) Failed to connect to localhost port 8001`
**Diagnosis**:
```bash
kubectl get pods -n vapora # Are pods even running?
kubectl get service vapora-backend -n vapora # Does service exist?
kubectl get endpoints -n vapora # Do endpoints exist?
```
**Common causes**:
- Pods not running (restart loops)
- Service missing or misconfigured
- Port incorrect
- Network policy blocking traffic
**Solutions**:
1. Verify pods running: `kubectl get pods`
2. Verify service exists: `kubectl get svc`
3. Check endpoints: `kubectl get endpoints`
4. Port-forward if issue with routing: `kubectl port-forward svc/vapora-backend 8001:8001`
**Runbook**: [Incident Response](./incident-response-runbook.md)
### High Error Rate
**Symptoms**: Dashboard shows >5% 5xx errors
**Diagnosis**:
```bash
# Check which endpoint
kubectl logs deployment/vapora-backend -n vapora | grep "ERROR\|500"
# Check recent deployment
git log -1 --oneline provisioning/
# Check dependencies
curl http://localhost:8001/health # is it healthy?
```
**Common causes**:
- Recent bad deployment
- Database connectivity issue
- Configuration error
- Dependency service down
**Solutions**:
1. If recent deployment: Consider rollback
2. Check ConfigMap for typos
3. Check database connectivity
4. Check external service health
**Runbook**: [Rollback Runbook](./rollback-runbook.md) or [Incident Response](./incident-response-runbook.md)
### Resource Exhaustion (CPU/Memory)
**Symptoms**: `kubectl top pods` shows pod at 100% usage or "limits exceeded"
**Diagnosis**:
```bash
kubectl top nodes # Overall node usage
kubectl top pods -n vapora # Per-pod usage
kubectl get pod <pod> -o yaml | grep limits -A 10 # Check limits
```
**Solutions**:
1. Increase pod resource limits (requires redeployment)
2. Scale out (add more replicas)
3. Scale down other workloads
4. Investigate memory leak if growing
**Runbook**: [Deployment Runbook](./deployment-runbook.md) → Phase 4 (Verification)
### Database Connection Errors
**Symptoms**: `ERROR: could not connect to database`
**Diagnosis**:
```bash
# Check database is running
kubectl get pods -n <database-namespace>
# Check credentials in ConfigMap
kubectl get configmap vapora-config -n vapora -o yaml | grep -i "database\|password"
# Test connectivity
kubectl exec <pod> -n vapora -- psql $DATABASE_URL
```
**Solutions**:
1. If credentials wrong: Fix in ConfigMap, restart pods
2. If database down: Escalate to DBA
3. If network issue: Network team investigation
4. If permissions: Update database user
**Runbook**: [Incident Response](./incident-response-runbook.md) → "Root Cause: Database Issues"
---
## Communication Templates
### Deployment Start
```
🚀 Deployment starting
Service: VAPORA
Version: v1.2.1
Mode: Enterprise
Expected duration: 10-15 minutes
Will update every 2 minutes. Questions? Ask in #deployments
```
### Deployment Complete
```
✅ Deployment complete
Duration: 12 minutes
Status: All services healthy
Pods: All running
Health check results:
✓ Backend: responding
✓ Frontend: accessible
✓ API: normal latency
✓ No errors in logs
Next step: Monitor for 2 hours
Contact: @on-call-engineer
```
### Incident Declared
```
🔴 INCIDENT DECLARED
Service: VAPORA Backend
Severity: 1 (Critical)
Time detected: HH:MM UTC
Current status: Investigating
Updates every 2 minutes
/cc @on-call-engineer @senior-engineer
```
### Incident Resolved
```
✅ Incident resolved
Duration: 8 minutes
Root cause: [description]
Fix: [what was done]
All services healthy, monitoring for 1 hour
Post-mortem scheduled for [date]
```
### Rollback Executed
```
🔙 Rollback executed
Issue detected in v1.2.1
Rolled back to v1.2.0
Status: Services recovering
Timeline: Issue 14:30 → Rollback 14:32 → Recovered 14:35
Investigating root cause
```
---
## Escalation Matrix
When unsure who to contact:
| Issue Type | First Contact | Escalation | Emergency |
|-----------|---|---|---|
| **Deployment issue** | Deployment lead | Ops team | Ops manager |
| **Pod/Container** | On-call engineer | Senior engineer | Director of Eng |
| **Database** | DBA team | Ops manager | CTO |
| **Infrastructure** | Infra team | Ops manager | VP Ops |
| **Security issue** | Security team | CISO | CEO |
| **Networking** | Network team | Ops manager | CTO |
---
## Tools & Commands Quick Reference
### Essential kubectl Commands
```bash
# Get status
kubectl get pods -n vapora
kubectl get deployments -n vapora
kubectl get services -n vapora
# Logs
kubectl logs deployment/vapora-backend -n vapora
kubectl logs <pod> -n vapora --previous # Previous crash
kubectl logs <pod> -n vapora -f # Follow/tail
# Execute commands
kubectl exec -it <pod> -n vapora -- bash
kubectl exec <pod> -n vapora -- curl http://localhost:8001/health
# Describe (detailed info)
kubectl describe pod <pod> -n vapora
kubectl describe node <node>
# Port forward (local access)
kubectl port-forward svc/vapora-backend 8001:8001
# Restart pods
kubectl rollout restart deployment/vapora-backend -n vapora
# Rollback
kubectl rollout undo deployment/vapora-backend -n vapora
# Scale
kubectl scale deployment/vapora-backend --replicas=5 -n vapora
```
### Useful Aliases
```bash
alias k='kubectl'
alias kgp='kubectl get pods'
alias kgd='kubectl get deployments'
alias kgs='kubectl get services'
alias klogs='kubectl logs'
alias kexec='kubectl exec'
alias kdesc='kubectl describe'
alias ktop='kubectl top'
```
---
## Before Your First Deployment
1. **Read all runbooks**: Thoroughly review all procedures
2. **Practice in staging**: Do a test deployment to staging first
3. **Understand rollback**: Know how to rollback before deploying
4. **Get trained**: Have senior engineer walk through procedures
5. **Test tools**: Verify kubectl and other tools work
6. **Verify access**: Confirm you have cluster access
7. **Know contacts**: Have escalation contacts readily available
8. **Review history**: Look at past deployments to understand patterns
---
## Continuous Improvement
### After Each Deployment
- [ ] Were all runbooks clear?
- [ ] Any steps missing or unclear?
- [ ] Any issues that could be prevented?
- [ ] Update documentation with learnings
### Monthly Review
- [ ] Review all incidents from past month
- [ ] Update procedures based on patterns
- [ ] Refresh team on any changes
- [ ] Update escalation contacts
- [ ] Review and improve alerting
---
## Key Principles
**Safety First**
- Always dry-run before applying
- Rollback quickly if issues detected
- Better to be conservative
**Communication**
- Communicate early and often
- Update every 2-5 minutes during incidents
- Notify stakeholders proactively
**Documentation**
- Document everything you do
- Update runbooks with learnings
- Share knowledge with team
**Preparation**
- Plan deployments thoroughly
- Test before going live
- Have rollback plan ready
**Quick Response**
- Detect issues quickly
- Diagnose systematically
- Execute fixes decisively
**Avoid**
- Guessing without verifying
- Skipping steps to save time
- Assuming systems are working
- Not communicating with team
- Making multiple changes at once
---
## Support & Questions
- **Questions about procedures?** Ask senior engineer or operations team
- **Found runbook gap?** Create issue/PR to update documentation
- **Unclear instructions?** Clarify before executing critical operations
- **Ideas for improvement?** Share in team meetings or documentation repo
---
## Quick Start: Your First Deployment
### Day 0: Preparation
1. Read: `pre-deployment-checklist.md` (30 min)
2. Read: `deployment-runbook.md` (30 min)
3. Read: `rollback-runbook.md` (20 min)
4. Schedule walkthrough with senior engineer (1 hour)
### Day 1: Execute with Mentorship
1. Complete pre-deployment checklist with senior engineer
2. Execute deployment runbook with senior observing
3. Monitor for 2 hours with senior available
4. Debrief: what went well, what to improve
### Day 2+: Independent Deployments
1. Complete checklist independently
2. Execute runbook
3. Document and communicate
4. Ask for help if anything unclear
---
**Generated**: 2026-01-12
**Status**: Production-ready
**Last Updated**: 2026-01-12