Vapora/docs/operations
Jesús Pérez 7110ffeea2
Some checks failed
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
chore: extend doc: adr, tutorials, operations, etc
2026-01-12 03:32:47 +00:00
..

VAPORA Operations Runbooks

Complete set of runbooks and procedures for deploying, monitoring, and operating VAPORA in production environments.


Quick Navigation

I need to...


Runbook Overview

1. Pre-Deployment Checklist

When: 24 hours before any production deployment

Content: Comprehensive checklist for deployment preparation including:

  • Communication & scheduling
  • Code review & validation
  • Environment verification
  • Health baseline recording
  • Artifact preparation
  • Rollback plan verification

Time: 1-2 hours

File: pre-deployment-checklist.md

2. Deployment Runbook

When: Executing actual production deployment

Content: Step-by-step deployment procedures including:

  • Pre-flight checks (5 min)
  • Configuration deployment (3 min)
  • Deployment update (5 min)
  • Verification (5 min)
  • Validation (3 min)
  • Communication & monitoring

Time: 15-20 minutes total

File: deployment-runbook.md

3. Rollback Runbook

When: Issues detected after deployment requiring immediate rollback

Content: Safe rollback procedures including:

  • When to rollback (decision criteria)
  • Kubernetes automatic rollback (step-by-step)
  • Docker manual rollback (guided)
  • Post-rollback verification
  • Emergency procedures
  • Prevention & lessons learned

Time: 5-10 minutes (depending on issues)

File: rollback-runbook.md

4. Incident Response Runbook

When: Production incident declared

Content: Full incident response procedures including:

  • Severity levels (1-4) with examples
  • Report & assess procedures
  • Diagnosis & escalation
  • Fix implementation
  • Recovery verification
  • Communication templates
  • Role definitions

Time: Varies by severity (2 min to 1+ hour)

File: incident-response-runbook.md

5. On-Call Procedures

When: During assigned on-call shift

Content: Full on-call guide including:

  • Before shift starts (setup & verification)
  • Daily tasks & check-ins
  • Responding to alerts
  • Monitoring dashboard setup
  • Escalation decision tree
  • Shift handoff procedures
  • Common questions & answers

Time: Read thoroughly before first on-call shift (~30 min)

File: on-call-procedures.md


Deployment Workflow

Standard Deployment Process

DAY 1 (Planning)
  ↓
- Create GitHub issue/ticket
- Identify deployment window
- Notify stakeholders

24 HOURS BEFORE
  ↓
- Complete pre-deployment checklist
  (pre-deployment-checklist.md)
- Verify all prerequisites
- Stage artifacts
- Test in staging

DEPLOYMENT DAY
  ↓
- Final go/no-go decision
- Execute deployment runbook
  (deployment-runbook.md)
  - Pre-flight checks
  - ConfigMap deployment
  - Service deployment
  - Verification
  - Communication

POST-DEPLOYMENT (2 hours)
  ↓
- Monitor closely (every 10 minutes)
- Watch for issues
- If problems → execute rollback runbook
  (rollback-runbook.md)
- Document results

24 HOURS LATER
  ↓
- Declare deployment stable
- Schedule post-mortem (if issues)
- Update documentation

If Issues During Deployment

Issue Detected
  ↓
Severity Assessment
  ↓
Severity 1-2:
  ├─ Immediate rollback
  │   (rollback-runbook.md)
  │
  └─ Post-rollback investigation
      (incident-response-runbook.md)

Severity 3-4:
  ├─ Monitor and investigate
  │   (incident-response-runbook.md)
  │
  └─ Fix in place if quick
      OR
      Schedule rollback

Monitoring & Alerting

Essential Dashboards

These should be visible during deployments and always on-call:

  1. Kubernetes Dashboard

    • Pod status
    • Node health
    • Event logs
  2. Grafana Dashboards (if available)

    • Request rate and latency
    • Error rate
    • CPU/Memory usage
    • Pod restart counts
  3. Application Logs (Elasticsearch, CloudWatch, etc.)

    • Error messages
    • Stack traces
    • Performance logs

Alert Triggers & Responses

Alert Severity Response
Pod CrashLoopBackOff 1 Check logs, likely config issue
Error rate >10% 1 Check recent deployment, consider rollback
All pods pending 1 Node issue or resource exhausted
High memory usage >90% 2 Check for memory leak or scale up
High latency (2x normal) 2 Check database, external services
Single pod failed 3 Monitor, likely transient

Health Check Commands

Quick commands to verify everything is working:

# Cluster health
kubectl cluster-info
kubectl get nodes        # All should be Ready

# Service health
kubectl get pods -n vapora
# All should be Running, 1/1 Ready

# Quick endpoints test
curl http://localhost:8001/health
curl http://localhost:3000

# Pod resources
kubectl top pods -n vapora

# Recent issues
kubectl get events -n vapora | grep Warning
kubectl logs deployment/vapora-backend -n vapora --tail=20

Common Failure Scenarios

Pod CrashLoopBackOff

Symptoms: Pod keeps restarting repeatedly

Diagnosis:

kubectl logs <pod> -n vapora --previous  # See what crashed
kubectl describe pod <pod> -n vapora    # Check events

Solutions:

  1. If config error: Fix ConfigMap, restart pod
  2. If code error: Rollback deployment
  3. If resource issue: Increase limits or scale out

Runbook: Rollback Runbook or Incident Response

Pod Stuck in Pending

Symptoms: Pod won't start, stuck in "Pending" state

Diagnosis:

kubectl describe pod <pod> -n vapora  # Check "Events" section

Common causes:

  • Insufficient CPU/memory on nodes
  • Node disk full
  • Pod can't be scheduled
  • Persistent volume not available

Solutions:

  1. Scale down other workloads
  2. Add more nodes
  3. Fix persistent volume issues
  4. Check node disk space

Runbook: On-Call Procedures → "Common Questions"

Service Unresponsive (Connection Refused)

Symptoms: curl: (7) Failed to connect to localhost port 8001

Diagnosis:

kubectl get pods -n vapora      # Are pods even running?
kubectl get service vapora-backend -n vapora  # Does service exist?
kubectl get endpoints -n vapora # Do endpoints exist?

Common causes:

  • Pods not running (restart loops)
  • Service missing or misconfigured
  • Port incorrect
  • Network policy blocking traffic

Solutions:

  1. Verify pods running: kubectl get pods
  2. Verify service exists: kubectl get svc
  3. Check endpoints: kubectl get endpoints
  4. Port-forward if issue with routing: kubectl port-forward svc/vapora-backend 8001:8001

Runbook: Incident Response

High Error Rate

Symptoms: Dashboard shows >5% 5xx errors

Diagnosis:

# Check which endpoint
kubectl logs deployment/vapora-backend -n vapora | grep "ERROR\|500"

# Check recent deployment
git log -1 --oneline provisioning/

# Check dependencies
curl http://localhost:8001/health  # is it healthy?

Common causes:

  • Recent bad deployment
  • Database connectivity issue
  • Configuration error
  • Dependency service down

Solutions:

  1. If recent deployment: Consider rollback
  2. Check ConfigMap for typos
  3. Check database connectivity
  4. Check external service health

Runbook: Rollback Runbook or Incident Response

Resource Exhaustion (CPU/Memory)

Symptoms: kubectl top pods shows pod at 100% usage or "limits exceeded"

Diagnosis:

kubectl top nodes              # Overall node usage
kubectl top pods -n vapora     # Per-pod usage
kubectl get pod <pod> -o yaml | grep limits -A 10  # Check limits

Solutions:

  1. Increase pod resource limits (requires redeployment)
  2. Scale out (add more replicas)
  3. Scale down other workloads
  4. Investigate memory leak if growing

Runbook: Deployment Runbook → Phase 4 (Verification)

Database Connection Errors

Symptoms: ERROR: could not connect to database

Diagnosis:

# Check database is running
kubectl get pods -n <database-namespace>

# Check credentials in ConfigMap
kubectl get configmap vapora-config -n vapora -o yaml | grep -i "database\|password"

# Test connectivity
kubectl exec <pod> -n vapora -- psql $DATABASE_URL

Solutions:

  1. If credentials wrong: Fix in ConfigMap, restart pods
  2. If database down: Escalate to DBA
  3. If network issue: Network team investigation
  4. If permissions: Update database user

Runbook: Incident Response → "Root Cause: Database Issues"


Communication Templates

Deployment Start

🚀 Deployment starting

Service: VAPORA
Version: v1.2.1
Mode: Enterprise
Expected duration: 10-15 minutes

Will update every 2 minutes. Questions? Ask in #deployments

Deployment Complete

✅ Deployment complete

Duration: 12 minutes
Status: All services healthy
Pods: All running

Health check results:
✓ Backend: responding
✓ Frontend: accessible
✓ API: normal latency
✓ No errors in logs

Next step: Monitor for 2 hours
Contact: @on-call-engineer

Incident Declared

🔴 INCIDENT DECLARED

Service: VAPORA Backend
Severity: 1 (Critical)
Time detected: HH:MM UTC
Current status: Investigating

Updates every 2 minutes
/cc @on-call-engineer @senior-engineer

Incident Resolved

✅ Incident resolved

Duration: 8 minutes
Root cause: [description]
Fix: [what was done]

All services healthy, monitoring for 1 hour
Post-mortem scheduled for [date]

Rollback Executed

🔙 Rollback executed

Issue detected in v1.2.1
Rolled back to v1.2.0

Status: Services recovering
Timeline: Issue 14:30 → Rollback 14:32 → Recovered 14:35

Investigating root cause

Escalation Matrix

When unsure who to contact:

Issue Type First Contact Escalation Emergency
Deployment issue Deployment lead Ops team Ops manager
Pod/Container On-call engineer Senior engineer Director of Eng
Database DBA team Ops manager CTO
Infrastructure Infra team Ops manager VP Ops
Security issue Security team CISO CEO
Networking Network team Ops manager CTO

Tools & Commands Quick Reference

Essential kubectl Commands

# Get status
kubectl get pods -n vapora
kubectl get deployments -n vapora
kubectl get services -n vapora

# Logs
kubectl logs deployment/vapora-backend -n vapora
kubectl logs <pod> -n vapora --previous  # Previous crash
kubectl logs <pod> -n vapora -f          # Follow/tail

# Execute commands
kubectl exec -it <pod> -n vapora -- bash
kubectl exec <pod> -n vapora -- curl http://localhost:8001/health

# Describe (detailed info)
kubectl describe pod <pod> -n vapora
kubectl describe node <node>

# Port forward (local access)
kubectl port-forward svc/vapora-backend 8001:8001

# Restart pods
kubectl rollout restart deployment/vapora-backend -n vapora

# Rollback
kubectl rollout undo deployment/vapora-backend -n vapora

# Scale
kubectl scale deployment/vapora-backend --replicas=5 -n vapora

Useful Aliases

alias k='kubectl'
alias kgp='kubectl get pods'
alias kgd='kubectl get deployments'
alias kgs='kubectl get services'
alias klogs='kubectl logs'
alias kexec='kubectl exec'
alias kdesc='kubectl describe'
alias ktop='kubectl top'

Before Your First Deployment

  1. Read all runbooks: Thoroughly review all procedures
  2. Practice in staging: Do a test deployment to staging first
  3. Understand rollback: Know how to rollback before deploying
  4. Get trained: Have senior engineer walk through procedures
  5. Test tools: Verify kubectl and other tools work
  6. Verify access: Confirm you have cluster access
  7. Know contacts: Have escalation contacts readily available
  8. Review history: Look at past deployments to understand patterns

Continuous Improvement

After Each Deployment

  • Were all runbooks clear?
  • Any steps missing or unclear?
  • Any issues that could be prevented?
  • Update documentation with learnings

Monthly Review

  • Review all incidents from past month
  • Update procedures based on patterns
  • Refresh team on any changes
  • Update escalation contacts
  • Review and improve alerting

Key Principles

Safety First

  • Always dry-run before applying
  • Rollback quickly if issues detected
  • Better to be conservative

Communication

  • Communicate early and often
  • Update every 2-5 minutes during incidents
  • Notify stakeholders proactively

Documentation

  • Document everything you do
  • Update runbooks with learnings
  • Share knowledge with team

Preparation

  • Plan deployments thoroughly
  • Test before going live
  • Have rollback plan ready

Quick Response

  • Detect issues quickly
  • Diagnose systematically
  • Execute fixes decisively

Avoid

  • Guessing without verifying
  • Skipping steps to save time
  • Assuming systems are working
  • Not communicating with team
  • Making multiple changes at once

Support & Questions

  • Questions about procedures? Ask senior engineer or operations team
  • Found runbook gap? Create issue/PR to update documentation
  • Unclear instructions? Clarify before executing critical operations
  • Ideas for improvement? Share in team meetings or documentation repo

Quick Start: Your First Deployment

Day 0: Preparation

  1. Read: pre-deployment-checklist.md (30 min)
  2. Read: deployment-runbook.md (30 min)
  3. Read: rollback-runbook.md (20 min)
  4. Schedule walkthrough with senior engineer (1 hour)

Day 1: Execute with Mentorship

  1. Complete pre-deployment checklist with senior engineer
  2. Execute deployment runbook with senior observing
  3. Monitor for 2 hours with senior available
  4. Debrief: what went well, what to improve

Day 2+: Independent Deployments

  1. Complete checklist independently
  2. Execute runbook
  3. Document and communicate
  4. Ask for help if anything unclear

Generated: 2026-01-12 Status: Production-ready Last Updated: 2026-01-12