VAPORA Operations Runbooks

DAY 1 (Planning)
  ↓
- Create GitHub issue/ticket
- Identify deployment window
- Notify stakeholders

24 HOURS BEFORE
  ↓
- Complete pre-deployment checklist
  (pre-deployment-checklist.md)
- Verify all prerequisites
- Stage artifacts
- Test in staging

DEPLOYMENT DAY
  ↓
- Final go/no-go decision
- Execute deployment runbook
  (deployment-runbook.md)
  - Pre-flight checks
  - ConfigMap deployment
  - Service deployment
  - Verification
  - Communication

POST-DEPLOYMENT (2 hours)
  ↓
- Monitor closely (every 10 minutes)
- Watch for issues
- If problems → execute rollback runbook
  (rollback-runbook.md)
- Document results

24 HOURS LATER
  ↓
- Declare deployment stable
- Schedule post-mortem (if issues)
- Update documentation

If Issues During Deployment

Issue Detected
  ↓
Severity Assessment
  ↓
Severity 1-2:
  ├─ Immediate rollback
  │   (rollback-runbook.md)
  │
  └─ Post-rollback investigation
      (incident-response-runbook.md)

Severity 3-4:
  ├─ Monitor and investigate
  │   (incident-response-runbook.md)
  │
  └─ Fix in place if quick
      OR
      Schedule rollback

Monitoring & Alerting

Essential Dashboards

These should be visible during deployments and always on-call:

Kubernetes Dashboard
- Pod status
- Node health
- Event logs
Grafana Dashboards (if available)
- Request rate and latency
- Error rate
- CPU/Memory usage
- Pod restart counts
Application Logs (Elasticsearch, CloudWatch, etc.)
- Error messages
- Stack traces
- Performance logs

Alert Triggers & Responses

Alert	Severity	Response
Pod CrashLoopBackOff	1	Check logs, likely config issue
Error rate >10%	1	Check recent deployment, consider rollback
All pods pending	1	Node issue or resource exhausted
High memory usage >90%	2	Check for memory leak or scale up
High latency (2x normal)	2	Check database, external services
Single pod failed	3	Monitor, likely transient

Health Check Commands

Quick commands to verify everything is working:

# Cluster health
kubectl cluster-info
kubectl get nodes        # All should be Ready

# Service health
kubectl get pods -n vapora
# All should be Running, 1/1 Ready

# Quick endpoints test
curl http://localhost:8001/health
curl http://localhost:3000

# Pod resources
kubectl top pods -n vapora

# Recent issues
kubectl get events -n vapora | grep Warning
kubectl logs deployment/vapora-backend -n vapora --tail=20

Common Failure Scenarios

Pod CrashLoopBackOff

Symptoms: Pod keeps restarting repeatedly

Diagnosis:

kubectl logs <pod> -n vapora --previous  # See what crashed
kubectl describe pod <pod> -n vapora    # Check events

Solutions:

If config error: Fix ConfigMap, restart pod
If code error: Rollback deployment
If resource issue: Increase limits or scale out

Runbook: Rollback Runbook or Incident Response

Pod Stuck in Pending

Symptoms: Pod won't start, stuck in "Pending" state

Diagnosis:

kubectl describe pod <pod> -n vapora  # Check "Events" section

Common causes:

Insufficient CPU/memory on nodes
Node disk full
Pod can't be scheduled
Persistent volume not available

Solutions:

Scale down other workloads
Add more nodes
Fix persistent volume issues
Check node disk space

Runbook: On-Call Procedures → "Common Questions"

Service Unresponsive (Connection Refused)

Symptoms: curl: (7) Failed to connect to localhost port 8001

Diagnosis:

kubectl get pods -n vapora      # Are pods even running?
kubectl get service vapora-backend -n vapora  # Does service exist?
kubectl get endpoints -n vapora # Do endpoints exist?

Common causes:

Pods not running (restart loops)
Service missing or misconfigured
Port incorrect
Network policy blocking traffic

Solutions:

Verify pods running: kubectl get pods
Verify service exists: kubectl get svc
Check endpoints: kubectl get endpoints
Port-forward if issue with routing: kubectl port-forward svc/vapora-backend 8001:8001

Runbook: Incident Response

High Error Rate

Symptoms: Dashboard shows >5% 5xx errors

Diagnosis:

# Check which endpoint
kubectl logs deployment/vapora-backend -n vapora | grep "ERROR\|500"

# Check recent deployment
git log -1 --oneline provisioning/

# Check dependencies
curl http://localhost:8001/health  # is it healthy?

Common causes:

Recent bad deployment
Database connectivity issue
Configuration error
Dependency service down

Solutions:

If recent deployment: Consider rollback
Check ConfigMap for typos
Check database connectivity
Check external service health

Runbook: Rollback Runbook or Incident Response

Resource Exhaustion (CPU/Memory)

Symptoms: kubectl top pods shows pod at 100% usage or "limits exceeded"

Diagnosis:

kubectl top nodes              # Overall node usage
kubectl top pods -n vapora     # Per-pod usage
kubectl get pod <pod> -o yaml | grep limits -A 10  # Check limits

Solutions:

Increase pod resource limits (requires redeployment)
Scale out (add more replicas)
Scale down other workloads
Investigate memory leak if growing

Runbook: Deployment Runbook → Phase 4 (Verification)

Database Connection Errors

Symptoms: ERROR: could not connect to database

Diagnosis:

# Check database is running
kubectl get pods -n <database-namespace>

# Check credentials in ConfigMap
kubectl get configmap vapora-config -n vapora -o yaml | grep -i "database\|password"

# Test connectivity
kubectl exec <pod> -n vapora -- psql $DATABASE_URL

Solutions:

If credentials wrong: Fix in ConfigMap, restart pods
If database down: Escalate to DBA
If network issue: Network team investigation
If permissions: Update database user

Runbook: Incident Response → "Root Cause: Database Issues"

Communication Templates

Deployment Start

🚀 Deployment starting

Service: VAPORA
Version: v1.2.1
Mode: Enterprise
Expected duration: 10-15 minutes

Will update every 2 minutes. Questions? Ask in #deployments

Deployment Complete

✅ Deployment complete

Duration: 12 minutes
Status: All services healthy
Pods: All running

Health check results:
✓ Backend: responding
✓ Frontend: accessible
✓ API: normal latency
✓ No errors in logs

Next step: Monitor for 2 hours
Contact: @on-call-engineer

Incident Declared

🔴 INCIDENT DECLARED

Service: VAPORA Backend
Severity: 1 (Critical)
Time detected: HH:MM UTC
Current status: Investigating

Updates every 2 minutes
/cc @on-call-engineer @senior-engineer

Incident Resolved

✅ Incident resolved

Duration: 8 minutes
Root cause: [description]
Fix: [what was done]

All services healthy, monitoring for 1 hour
Post-mortem scheduled for [date]

Rollback Executed

🔙 Rollback executed

Issue detected in v1.2.1
Rolled back to v1.2.0

Status: Services recovering
Timeline: Issue 14:30 → Rollback 14:32 → Recovered 14:35

Investigating root cause

Escalation Matrix

When unsure who to contact:

Issue Type	First Contact	Escalation	Emergency
Deployment issue	Deployment lead	Ops team	Ops manager
Pod/Container	On-call engineer	Senior engineer	Director of Eng
Database	DBA team	Ops manager	CTO
Infrastructure	Infra team	Ops manager	VP Ops
Security issue	Security team	CISO	CEO
Networking	Network team	Ops manager	CTO

Tools & Commands Quick Reference

Essential kubectl Commands

# Get status
kubectl get pods -n vapora
kubectl get deployments -n vapora
kubectl get services -n vapora

# Logs
kubectl logs deployment/vapora-backend -n vapora
kubectl logs <pod> -n vapora --previous  # Previous crash
kubectl logs <pod> -n vapora -f          # Follow/tail

# Execute commands
kubectl exec -it <pod> -n vapora -- bash
kubectl exec <pod> -n vapora -- curl http://localhost:8001/health

# Describe (detailed info)
kubectl describe pod <pod> -n vapora
kubectl describe node <node>

# Port forward (local access)
kubectl port-forward svc/vapora-backend 8001:8001

# Restart pods
kubectl rollout restart deployment/vapora-backend -n vapora

# Rollback
kubectl rollout undo deployment/vapora-backend -n vapora

# Scale
kubectl scale deployment/vapora-backend --replicas=5 -n vapora

Useful Aliases

alias k='kubectl'
alias kgp='kubectl get pods'
alias kgd='kubectl get deployments'
alias kgs='kubectl get services'
alias klogs='kubectl logs'
alias kexec='kubectl exec'
alias kdesc='kubectl describe'
alias ktop='kubectl top'