jesus/Vapora

Fork 0

Jesús Pérez 7110ffeea2

Rust CI / Security Audit (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (stable) (push) Has been cancelled

Details

chore: extend doc: adr, tutorials, operations, etc

2026-01-12 03:32:47 +00:00

14 KiB

Raw Blame History

VAPORA Operations Runbooks

Complete set of runbooks and procedures for deploying, monitoring, and operating VAPORA in production environments.

I need to...

Deploy to production: See Deployment Runbook or Pre-Deployment Checklist
Respond to an incident: See Incident Response Runbook
Rollback a deployment: See Rollback Runbook
Go on-call: See On-Call Procedures
Monitor services: See Monitoring Runbook
Understand common failures: See Common Failure Scenarios

Runbook Overview

1. Pre-Deployment Checklist

When: 24 hours before any production deployment

Content: Comprehensive checklist for deployment preparation including:

Communication & scheduling
Code review & validation
Environment verification
Health baseline recording
Artifact preparation
Rollback plan verification

Time: 1-2 hours

File: pre-deployment-checklist.md

2. Deployment Runbook

When: Executing actual production deployment

Content: Step-by-step deployment procedures including:

Pre-flight checks (5 min)
Configuration deployment (3 min)
Deployment update (5 min)
Verification (5 min)
Validation (3 min)
Communication & monitoring

Time: 15-20 minutes total

File: deployment-runbook.md

3. Rollback Runbook

When: Issues detected after deployment requiring immediate rollback

Content: Safe rollback procedures including:

When to rollback (decision criteria)
Kubernetes automatic rollback (step-by-step)
Docker manual rollback (guided)
Post-rollback verification
Emergency procedures
Prevention & lessons learned

Time: 5-10 minutes (depending on issues)

File: rollback-runbook.md

4. Incident Response Runbook

When: Production incident declared

Content: Full incident response procedures including:

Severity levels (1-4) with examples
Report & assess procedures
Diagnosis & escalation
Fix implementation
Recovery verification
Communication templates
Role definitions

Time: Varies by severity (2 min to 1+ hour)

File: incident-response-runbook.md

5. On-Call Procedures

When: During assigned on-call shift

Content: Full on-call guide including:

Before shift starts (setup & verification)
Daily tasks & check-ins
Responding to alerts
Monitoring dashboard setup
Escalation decision tree
Shift handoff procedures
Common questions & answers

Time: Read thoroughly before first on-call shift (~30 min)

File: on-call-procedures.md

Deployment Workflow

Standard Deployment Process

DAY 1 (Planning)
  ↓
- Create GitHub issue/ticket
- Identify deployment window
- Notify stakeholders

24 HOURS BEFORE
  ↓
- Complete pre-deployment checklist
  (pre-deployment-checklist.md)
- Verify all prerequisites
- Stage artifacts
- Test in staging

DEPLOYMENT DAY
  ↓
- Final go/no-go decision
- Execute deployment runbook
  (deployment-runbook.md)
  - Pre-flight checks
  - ConfigMap deployment
  - Service deployment
  - Verification
  - Communication

POST-DEPLOYMENT (2 hours)
  ↓
- Monitor closely (every 10 minutes)
- Watch for issues
- If problems → execute rollback runbook
  (rollback-runbook.md)
- Document results

24 HOURS LATER
  ↓
- Declare deployment stable
- Schedule post-mortem (if issues)
- Update documentation

If Issues During Deployment

Issue Detected
  ↓
Severity Assessment
  ↓
Severity 1-2:
  ├─ Immediate rollback
  │   (rollback-runbook.md)
  │
  └─ Post-rollback investigation
      (incident-response-runbook.md)

Severity 3-4:
  ├─ Monitor and investigate
  │   (incident-response-runbook.md)
  │
  └─ Fix in place if quick
      OR
      Schedule rollback

Monitoring & Alerting

Essential Dashboards

These should be visible during deployments and always on-call:

Kubernetes Dashboard
- Pod status
- Node health
- Event logs
Grafana Dashboards (if available)
- Request rate and latency
- Error rate
- CPU/Memory usage
- Pod restart counts
Application Logs (Elasticsearch, CloudWatch, etc.)
- Error messages
- Stack traces
- Performance logs

Alert Triggers & Responses

Alert	Severity	Response
Pod CrashLoopBackOff	1	Check logs, likely config issue
Error rate >10%	1	Check recent deployment, consider rollback
All pods pending	1	Node issue or resource exhausted
High memory usage >90%	2	Check for memory leak or scale up
High latency (2x normal)	2	Check database, external services
Single pod failed	3	Monitor, likely transient

Health Check Commands

Quick commands to verify everything is working:

# Cluster health
kubectl cluster-info
kubectl get nodes        # All should be Ready

# Service health
kubectl get pods -n vapora
# All should be Running, 1/1 Ready

# Quick endpoints test
curl http://localhost:8001/health
curl http://localhost:3000

# Pod resources
kubectl top pods -n vapora

# Recent issues
kubectl get events -n vapora | grep Warning
kubectl logs deployment/vapora-backend -n vapora --tail=20

Common Failure Scenarios

Pod CrashLoopBackOff

Symptoms: Pod keeps restarting repeatedly

Diagnosis:

kubectl logs <pod> -n vapora --previous  # See what crashed
kubectl describe pod <pod> -n vapora    # Check events

Solutions:

If config error: Fix ConfigMap, restart pod
If code error: Rollback deployment
If resource issue: Increase limits or scale out

Runbook: Rollback Runbook or Incident Response

Pod Stuck in Pending

Symptoms: Pod won't start, stuck in "Pending" state

Diagnosis:

kubectl describe pod <pod> -n vapora  # Check "Events" section

Common causes:

Insufficient CPU/memory on nodes
Node disk full
Pod can't be scheduled
Persistent volume not available

Solutions:

Scale down other workloads
Add more nodes
Fix persistent volume issues
Check node disk space

Runbook: On-Call Procedures → "Common Questions"

Service Unresponsive (Connection Refused)

Symptoms: curl: (7) Failed to connect to localhost port 8001

Diagnosis:

kubectl get pods -n vapora      # Are pods even running?
kubectl get service vapora-backend -n vapora  # Does service exist?
kubectl get endpoints -n vapora # Do endpoints exist?

Common causes:

Pods not running (restart loops)
Service missing or misconfigured
Port incorrect
Network policy blocking traffic

Solutions:

Verify pods running: kubectl get pods
Verify service exists: kubectl get svc
Check endpoints: kubectl get endpoints
Port-forward if issue with routing: kubectl port-forward svc/vapora-backend 8001:8001

Runbook: Incident Response

High Error Rate

Symptoms: Dashboard shows >5% 5xx errors

Diagnosis:

# Check which endpoint
kubectl logs deployment/vapora-backend -n vapora | grep "ERROR\|500"

# Check recent deployment
git log -1 --oneline provisioning/

# Check dependencies
curl http://localhost:8001/health  # is it healthy?

Common causes:

Recent bad deployment
Database connectivity issue
Configuration error
Dependency service down

Solutions:

If recent deployment: Consider rollback
Check ConfigMap for typos
Check database connectivity
Check external service health

Runbook: Rollback Runbook or Incident Response

Resource Exhaustion (CPU/Memory)

Symptoms: kubectl top pods shows pod at 100% usage or "limits exceeded"

Diagnosis:

kubectl top nodes              # Overall node usage
kubectl top pods -n vapora     # Per-pod usage
kubectl get pod <pod> -o yaml | grep limits -A 10  # Check limits

Solutions:

Increase pod resource limits (requires redeployment)
Scale out (add more replicas)
Scale down other workloads
Investigate memory leak if growing

Runbook: Deployment Runbook → Phase 4 (Verification)

Database Connection Errors

Symptoms: ERROR: could not connect to database

Diagnosis:

# Check database is running
kubectl get pods -n <database-namespace>

# Check credentials in ConfigMap
kubectl get configmap vapora-config -n vapora -o yaml | grep -i "database\|password"

# Test connectivity
kubectl exec <pod> -n vapora -- psql $DATABASE_URL

Solutions:

If credentials wrong: Fix in ConfigMap, restart pods
If database down: Escalate to DBA
If network issue: Network team investigation
If permissions: Update database user

Runbook: Incident Response → "Root Cause: Database Issues"

Communication Templates

Deployment Start

🚀 Deployment starting

Service: VAPORA
Version: v1.2.1
Mode: Enterprise
Expected duration: 10-15 minutes

Will update every 2 minutes. Questions? Ask in #deployments

Deployment Complete

✅ Deployment complete

Duration: 12 minutes
Status: All services healthy
Pods: All running

Health check results:
✓ Backend: responding
✓ Frontend: accessible
✓ API: normal latency
✓ No errors in logs

Next step: Monitor for 2 hours
Contact: @on-call-engineer

Incident Declared

🔴 INCIDENT DECLARED

Service: VAPORA Backend
Severity: 1 (Critical)
Time detected: HH:MM UTC
Current status: Investigating

Updates every 2 minutes
/cc @on-call-engineer @senior-engineer

Incident Resolved

✅ Incident resolved

Duration: 8 minutes
Root cause: [description]
Fix: [what was done]

All services healthy, monitoring for 1 hour
Post-mortem scheduled for [date]

Rollback Executed

🔙 Rollback executed

Issue detected in v1.2.1
Rolled back to v1.2.0

Status: Services recovering
Timeline: Issue 14:30 → Rollback 14:32 → Recovered 14:35

Investigating root cause

Escalation Matrix

When unsure who to contact:

Issue Type	First Contact	Escalation	Emergency
Deployment issue	Deployment lead	Ops team	Ops manager
Pod/Container	On-call engineer	Senior engineer	Director of Eng
Database	DBA team	Ops manager	CTO
Infrastructure	Infra team	Ops manager	VP Ops
Security issue	Security team	CISO	CEO
Networking	Network team	Ops manager	CTO

Tools & Commands Quick Reference

Essential kubectl Commands

# Get status
kubectl get pods -n vapora
kubectl get deployments -n vapora
kubectl get services -n vapora

# Logs
kubectl logs deployment/vapora-backend -n vapora
kubectl logs <pod> -n vapora --previous  # Previous crash
kubectl logs <pod> -n vapora -f          # Follow/tail

# Execute commands
kubectl exec -it <pod> -n vapora -- bash
kubectl exec <pod> -n vapora -- curl http://localhost:8001/health

# Describe (detailed info)
kubectl describe pod <pod> -n vapora
kubectl describe node <node>

# Port forward (local access)
kubectl port-forward svc/vapora-backend 8001:8001

# Restart pods
kubectl rollout restart deployment/vapora-backend -n vapora

# Rollback
kubectl rollout undo deployment/vapora-backend -n vapora

# Scale
kubectl scale deployment/vapora-backend --replicas=5 -n vapora

Useful Aliases

alias k='kubectl'
alias kgp='kubectl get pods'
alias kgd='kubectl get deployments'
alias kgs='kubectl get services'
alias klogs='kubectl logs'
alias kexec='kubectl exec'
alias kdesc='kubectl describe'
alias ktop='kubectl top'

Before Your First Deployment

Read all runbooks: Thoroughly review all procedures
Practice in staging: Do a test deployment to staging first
Understand rollback: Know how to rollback before deploying
Get trained: Have senior engineer walk through procedures
Test tools: Verify kubectl and other tools work
Verify access: Confirm you have cluster access
Know contacts: Have escalation contacts readily available
Review history: Look at past deployments to understand patterns

Continuous Improvement

After Each Deployment

Were all runbooks clear?
Any steps missing or unclear?
Any issues that could be prevented?
Update documentation with learnings

Monthly Review

Review all incidents from past month
Update procedures based on patterns
Refresh team on any changes
Update escalation contacts
Review and improve alerting

Key Principles

✅ Safety First

Always dry-run before applying
Rollback quickly if issues detected
Better to be conservative

✅ Communication

Communicate early and often
Update every 2-5 minutes during incidents
Notify stakeholders proactively

✅ Documentation

Document everything you do
Update runbooks with learnings
Share knowledge with team

✅ Preparation

Plan deployments thoroughly
Test before going live
Have rollback plan ready

✅ Quick Response

Detect issues quickly
Diagnose systematically
Execute fixes decisively

❌ Avoid

Guessing without verifying
Skipping steps to save time
Assuming systems are working
Not communicating with team
Making multiple changes at once

Support & Questions

Questions about procedures? Ask senior engineer or operations team
Found runbook gap? Create issue/PR to update documentation
Unclear instructions? Clarify before executing critical operations
Ideas for improvement? Share in team meetings or documentation repo

Quick Start: Your First Deployment

Day 0: Preparation

Read: pre-deployment-checklist.md (30 min)
Read: deployment-runbook.md (30 min)
Read: rollback-runbook.md (20 min)
Schedule walkthrough with senior engineer (1 hour)

Day 1: Execute with Mentorship

Complete pre-deployment checklist with senior engineer
Execute deployment runbook with senior observing
Monitor for 2 hours with senior available
Debrief: what went well, what to improve

Day 2+: Independent Deployments

Complete checklist independently
Execute runbook
Document and communicate
Ask for help if anything unclear

Generated: 2026-01-12 Status: Production-ready Last Updated: 2026-01-12

14 KiB Raw Blame History

VAPORA Operations Runbooks

Quick Navigation

Runbook Overview

1. Pre-Deployment Checklist

2. Deployment Runbook

3. Rollback Runbook

4. Incident Response Runbook

5. On-Call Procedures

Deployment Workflow

Standard Deployment Process

If Issues During Deployment

Monitoring & Alerting

Essential Dashboards

Alert Triggers & Responses

Health Check Commands

Common Failure Scenarios

Pod CrashLoopBackOff

Pod Stuck in Pending

Service Unresponsive (Connection Refused)

High Error Rate

Resource Exhaustion (CPU/Memory)

Database Connection Errors

Communication Templates

Deployment Start

Deployment Complete

Incident Declared

Incident Resolved

Rollback Executed

Escalation Matrix

Tools & Commands Quick Reference

Essential kubectl Commands

Useful Aliases

Before Your First Deployment

Continuous Improvement

After Each Deployment

Monthly Review

Key Principles

Support & Questions

Quick Start: Your First Deployment

Day 0: Preparation

Day 1: Execute with Mentorship

Day 2+: Independent Deployments

14 KiB

Raw Blame History