Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

VAPORA Operations Runbooks

Complete set of runbooks and procedures for deploying, monitoring, and operating VAPORA in production environments.


Quick Navigation

I need to...


Runbook Overview

1. Pre-Deployment Checklist

When: 24 hours before any production deployment

Content: Comprehensive checklist for deployment preparation including:

  • Communication & scheduling
  • Code review & validation
  • Environment verification
  • Health baseline recording
  • Artifact preparation
  • Rollback plan verification

Time: 1-2 hours

File: pre-deployment-checklist.md

2. Deployment Runbook

When: Executing actual production deployment

Content: Step-by-step deployment procedures including:

  • Pre-flight checks (5 min)
  • Configuration deployment (3 min)
  • Deployment update (5 min)
  • Verification (5 min)
  • Validation (3 min)
  • Communication & monitoring

Time: 15-20 minutes total

File: deployment-runbook.md

3. Rollback Runbook

When: Issues detected after deployment requiring immediate rollback

Content: Safe rollback procedures including:

  • When to rollback (decision criteria)
  • Kubernetes automatic rollback (step-by-step)
  • Docker manual rollback (guided)
  • Post-rollback verification
  • Emergency procedures
  • Prevention & lessons learned

Time: 5-10 minutes (depending on issues)

File: rollback-runbook.md

4. Incident Response Runbook

When: Production incident declared

Content: Full incident response procedures including:

  • Severity levels (1-4) with examples
  • Report & assess procedures
  • Diagnosis & escalation
  • Fix implementation
  • Recovery verification
  • Communication templates
  • Role definitions

Time: Varies by severity (2 min to 1+ hour)

File: incident-response-runbook.md

5. On-Call Procedures

When: During assigned on-call shift

Content: Full on-call guide including:

  • Before shift starts (setup & verification)
  • Daily tasks & check-ins
  • Responding to alerts
  • Monitoring dashboard setup
  • Escalation decision tree
  • Shift handoff procedures
  • Common questions & answers

Time: Read thoroughly before first on-call shift (~30 min)

File: on-call-procedures.md


Deployment Workflow

Standard Deployment Process

DAY 1 (Planning)
  ↓
- Create GitHub issue/ticket
- Identify deployment window
- Notify stakeholders

24 HOURS BEFORE
  ↓
- Complete pre-deployment checklist
  (pre-deployment-checklist.md)
- Verify all prerequisites
- Stage artifacts
- Test in staging

DEPLOYMENT DAY
  ↓
- Final go/no-go decision
- Execute deployment runbook
  (deployment-runbook.md)
  - Pre-flight checks
  - ConfigMap deployment
  - Service deployment
  - Verification
  - Communication

POST-DEPLOYMENT (2 hours)
  ↓
- Monitor closely (every 10 minutes)
- Watch for issues
- If problems → execute rollback runbook
  (rollback-runbook.md)
- Document results

24 HOURS LATER
  ↓
- Declare deployment stable
- Schedule post-mortem (if issues)
- Update documentation

If Issues During Deployment

Issue Detected
  ↓
Severity Assessment
  ↓
Severity 1-2:
  ├─ Immediate rollback
  │   (rollback-runbook.md)
  │
  └─ Post-rollback investigation
      (incident-response-runbook.md)

Severity 3-4:
  ├─ Monitor and investigate
  │   (incident-response-runbook.md)
  │
  └─ Fix in place if quick
      OR
      Schedule rollback

Monitoring & Alerting

Essential Dashboards

These should be visible during deployments and always on-call:

  1. Kubernetes Dashboard

    • Pod status
    • Node health
    • Event logs
  2. Grafana Dashboards (if available)

    • Request rate and latency
    • Error rate
    • CPU/Memory usage
    • Pod restart counts
  3. Application Logs (Elasticsearch, CloudWatch, etc.)

    • Error messages
    • Stack traces
    • Performance logs

Alert Triggers & Responses

AlertSeverityResponse
Pod CrashLoopBackOff1Check logs, likely config issue
Error rate >10%1Check recent deployment, consider rollback
All pods pending1Node issue or resource exhausted
High memory usage >90%2Check for memory leak or scale up
High latency (2x normal)2Check database, external services
Single pod failed3Monitor, likely transient

Health Check Commands

Quick commands to verify everything is working:

# Cluster health
kubectl cluster-info
kubectl get nodes        # All should be Ready

# Service health
kubectl get pods -n vapora
# All should be Running, 1/1 Ready

# Quick endpoints test
curl http://localhost:8001/health
curl http://localhost:3000

# Pod resources
kubectl top pods -n vapora

# Recent issues
kubectl get events -n vapora | grep Warning
kubectl logs deployment/vapora-backend -n vapora --tail=20

Common Failure Scenarios

Pod CrashLoopBackOff

Symptoms: Pod keeps restarting repeatedly

Diagnosis:

kubectl logs <pod> -n vapora --previous  # See what crashed
kubectl describe pod <pod> -n vapora    # Check events

Solutions:

  1. If config error: Fix ConfigMap, restart pod
  2. If code error: Rollback deployment
  3. If resource issue: Increase limits or scale out

Runbook: Rollback Runbook or Incident Response

Pod Stuck in Pending

Symptoms: Pod won't start, stuck in "Pending" state

Diagnosis:

kubectl describe pod <pod> -n vapora  # Check "Events" section

Common causes:

  • Insufficient CPU/memory on nodes
  • Node disk full
  • Pod can't be scheduled
  • Persistent volume not available

Solutions:

  1. Scale down other workloads
  2. Add more nodes
  3. Fix persistent volume issues
  4. Check node disk space

Runbook: On-Call Procedures → "Common Questions"

Service Unresponsive (Connection Refused)

Symptoms: curl: (7) Failed to connect to localhost port 8001

Diagnosis:

kubectl get pods -n vapora      # Are pods even running?
kubectl get service vapora-backend -n vapora  # Does service exist?
kubectl get endpoints -n vapora # Do endpoints exist?

Common causes:

  • Pods not running (restart loops)
  • Service missing or misconfigured
  • Port incorrect
  • Network policy blocking traffic

Solutions:

  1. Verify pods running: kubectl get pods
  2. Verify service exists: kubectl get svc
  3. Check endpoints: kubectl get endpoints
  4. Port-forward if issue with routing: kubectl port-forward svc/vapora-backend 8001:8001

Runbook: Incident Response

High Error Rate

Symptoms: Dashboard shows >5% 5xx errors

Diagnosis:

# Check which endpoint
kubectl logs deployment/vapora-backend -n vapora | grep "ERROR\|500"

# Check recent deployment
git log -1 --oneline provisioning/

# Check dependencies
curl http://localhost:8001/health  # is it healthy?

Common causes:

  • Recent bad deployment
  • Database connectivity issue
  • Configuration error
  • Dependency service down

Solutions:

  1. If recent deployment: Consider rollback
  2. Check ConfigMap for typos
  3. Check database connectivity
  4. Check external service health

Runbook: Rollback Runbook or Incident Response

Resource Exhaustion (CPU/Memory)

Symptoms: kubectl top pods shows pod at 100% usage or "limits exceeded"

Diagnosis:

kubectl top nodes              # Overall node usage
kubectl top pods -n vapora     # Per-pod usage
kubectl get pod <pod> -o yaml | grep limits -A 10  # Check limits

Solutions:

  1. Increase pod resource limits (requires redeployment)
  2. Scale out (add more replicas)
  3. Scale down other workloads
  4. Investigate memory leak if growing

Runbook: Deployment Runbook → Phase 4 (Verification)

Database Connection Errors

Symptoms: ERROR: could not connect to database

Diagnosis:

# Check database is running
kubectl get pods -n <database-namespace>

# Check credentials in ConfigMap
kubectl get configmap vapora-config -n vapora -o yaml | grep -i "database\|password"

# Test connectivity
kubectl exec <pod> -n vapora -- psql $DATABASE_URL

Solutions:

  1. If credentials wrong: Fix in ConfigMap, restart pods
  2. If database down: Escalate to DBA
  3. If network issue: Network team investigation
  4. If permissions: Update database user

Runbook: Incident Response → "Root Cause: Database Issues"


Communication Templates

Deployment Start

🚀 Deployment starting

Service: VAPORA
Version: v1.2.1
Mode: Enterprise
Expected duration: 10-15 minutes

Will update every 2 minutes. Questions? Ask in #deployments

Deployment Complete

✅ Deployment complete

Duration: 12 minutes
Status: All services healthy
Pods: All running

Health check results:
✓ Backend: responding
✓ Frontend: accessible
✓ API: normal latency
✓ No errors in logs

Next step: Monitor for 2 hours
Contact: @on-call-engineer

Incident Declared

🔴 INCIDENT DECLARED

Service: VAPORA Backend
Severity: 1 (Critical)
Time detected: HH:MM UTC
Current status: Investigating

Updates every 2 minutes
/cc @on-call-engineer @senior-engineer

Incident Resolved

✅ Incident resolved

Duration: 8 minutes
Root cause: [description]
Fix: [what was done]

All services healthy, monitoring for 1 hour
Post-mortem scheduled for [date]

Rollback Executed

🔙 Rollback executed

Issue detected in v1.2.1
Rolled back to v1.2.0

Status: Services recovering
Timeline: Issue 14:30 → Rollback 14:32 → Recovered 14:35

Investigating root cause

Escalation Matrix

When unsure who to contact:

Issue TypeFirst ContactEscalationEmergency
Deployment issueDeployment leadOps teamOps manager
Pod/ContainerOn-call engineerSenior engineerDirector of Eng
DatabaseDBA teamOps managerCTO
InfrastructureInfra teamOps managerVP Ops
Security issueSecurity teamCISOCEO
NetworkingNetwork teamOps managerCTO

Tools & Commands Quick Reference

Essential kubectl Commands

# Get status
kubectl get pods -n vapora
kubectl get deployments -n vapora
kubectl get services -n vapora

# Logs
kubectl logs deployment/vapora-backend -n vapora
kubectl logs <pod> -n vapora --previous  # Previous crash
kubectl logs <pod> -n vapora -f          # Follow/tail

# Execute commands
kubectl exec -it <pod> -n vapora -- bash
kubectl exec <pod> -n vapora -- curl http://localhost:8001/health

# Describe (detailed info)
kubectl describe pod <pod> -n vapora
kubectl describe node <node>

# Port forward (local access)
kubectl port-forward svc/vapora-backend 8001:8001

# Restart pods
kubectl rollout restart deployment/vapora-backend -n vapora

# Rollback
kubectl rollout undo deployment/vapora-backend -n vapora

# Scale
kubectl scale deployment/vapora-backend --replicas=5 -n vapora

Useful Aliases

alias k='kubectl'
alias kgp='kubectl get pods'
alias kgd='kubectl get deployments'
alias kgs='kubectl get services'
alias klogs='kubectl logs'
alias kexec='kubectl exec'
alias kdesc='kubectl describe'
alias ktop='kubectl top'

Before Your First Deployment

  1. Read all runbooks: Thoroughly review all procedures
  2. Practice in staging: Do a test deployment to staging first
  3. Understand rollback: Know how to rollback before deploying
  4. Get trained: Have senior engineer walk through procedures
  5. Test tools: Verify kubectl and other tools work
  6. Verify access: Confirm you have cluster access
  7. Know contacts: Have escalation contacts readily available
  8. Review history: Look at past deployments to understand patterns

Continuous Improvement

After Each Deployment

  • Were all runbooks clear?
  • Any steps missing or unclear?
  • Any issues that could be prevented?
  • Update documentation with learnings

Monthly Review

  • Review all incidents from past month
  • Update procedures based on patterns
  • Refresh team on any changes
  • Update escalation contacts
  • Review and improve alerting

Key Principles

Safety First

  • Always dry-run before applying
  • Rollback quickly if issues detected
  • Better to be conservative

Communication

  • Communicate early and often
  • Update every 2-5 minutes during incidents
  • Notify stakeholders proactively

Documentation

  • Document everything you do
  • Update runbooks with learnings
  • Share knowledge with team

Preparation

  • Plan deployments thoroughly
  • Test before going live
  • Have rollback plan ready

Quick Response

  • Detect issues quickly
  • Diagnose systematically
  • Execute fixes decisively

Avoid

  • Guessing without verifying
  • Skipping steps to save time
  • Assuming systems are working
  • Not communicating with team
  • Making multiple changes at once

Support & Questions

  • Questions about procedures? Ask senior engineer or operations team
  • Found runbook gap? Create issue/PR to update documentation
  • Unclear instructions? Clarify before executing critical operations
  • Ideas for improvement? Share in team meetings or documentation repo

Quick Start: Your First Deployment

Day 0: Preparation

  1. Read: pre-deployment-checklist.md (30 min)
  2. Read: deployment-runbook.md (30 min)
  3. Read: rollback-runbook.md (20 min)
  4. Schedule walkthrough with senior engineer (1 hour)

Day 1: Execute with Mentorship

  1. Complete pre-deployment checklist with senior engineer
  2. Execute deployment runbook with senior observing
  3. Monitor for 2 hours with senior available
  4. Debrief: what went well, what to improve

Day 2+: Independent Deployments

  1. Complete checklist independently
  2. Execute runbook
  3. Document and communicate
  4. Ask for help if anything unclear

Generated: 2026-01-12 Status: Production-ready Last Updated: 2026-01-12