Vapora/docs/operations/monitoring-operations.md
Jesús Pérez 7110ffeea2
Some checks failed
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
chore: extend doc: adr, tutorials, operations, etc
2026-01-12 03:32:47 +00:00

15 KiB

Monitoring & Health Check Operations

Guide for continuous monitoring and health checks of VAPORA in production.


Overview

Responsibility: Maintain visibility into VAPORA service health through monitoring, logging, and alerting

Key Activities:

  • Regular health checks (automated and manual)
  • Alert response and investigation
  • Trend analysis and capacity planning
  • Incident prevention through early detection

Success Metric: Detect and respond to issues before users are significantly impacted


Automated Health Checks

Kubernetes Health Check Pipeline

If using CI/CD, leverage automatic health monitoring:

GitHub Actions:

# Runs every 15 minutes (quick check)
# Runs every 6 hours (comprehensive diagnostics)
# See: .github/workflows/health-check.yml

Woodpecker:

# Runs every 15 minutes (quick check)
# Runs every 6 hours (comprehensive diagnostics)
# See: .woodpecker/health-check.yml

Artifacts Generated:

  • docker-health.log - Docker container status
  • k8s-health.log - Kubernetes deployments status
  • k8s-diagnostics.log - Full system diagnostics
  • docker-diagnostics.log - Docker system info
  • HEALTH_REPORT.md - Summary report

Quick Manual Health Check

# Run this command to get instant health status
export NAMESPACE=vapora

echo "=== Pod Status ==="
kubectl get pods -n $NAMESPACE
echo ""

echo "=== Service Health ==="
kubectl get endpoints -n $NAMESPACE
echo ""

echo "=== Recent Events ==="
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -10
echo ""

echo "=== Resource Usage ==="
kubectl top pods -n $NAMESPACE
echo ""

echo "=== API Health ==="
curl -s http://localhost:8001/health | jq .

Manual Daily Monitoring

Morning Check (Start of Business Day)

# Run at start of business day (or when starting shift)

echo "=== MORNING HEALTH CHECK ==="
echo "Date: $(date -u)"

# 1. Cluster Status
echo "Cluster Status:"
kubectl cluster-info | grep server

# 2. Node Status
echo ""
echo "Node Status:"
kubectl get nodes
# Should show: All nodes Ready

# 3. Pod Status
echo ""
echo "Pod Status:"
kubectl get pods -n vapora
# Should show: All Running, 1/1 Ready

# 4. Service Endpoints
echo ""
echo "Service Endpoints:"
kubectl get endpoints -n vapora
# Should show: All services have endpoints (not empty)

# 5. Resource Usage
echo ""
echo "Resource Usage:"
kubectl top nodes
kubectl top pods -n vapora | head -10

# 6. Recent Errors
echo ""
echo "Recent Errors (last 1 hour):"
kubectl logs deployment/vapora-backend -n vapora --since=1h | grep -i error | wc -l
# Should show: 0 or very few errors

# 7. Overall Status
echo ""
echo "Overall Status: ✅ Healthy"
# If any issues found: Document and investigate

Mid-Day Check (Every 4-6 hours)

# Quick sanity check during business hours

# 1. Service Responsiveness
curl -s http://localhost:8001/health | jq '.status'
# Should return: "healthy"

# 2. Pod Restart Tracking
kubectl get pods -n vapora -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}'
# Restart count should not be increasing rapidly

# 3. Error Log Check
kubectl logs deployment/vapora-backend -n vapora --since=4h --timestamps | grep ERROR | tail -5
# Should show: Few to no errors

# 4. Performance Check
kubectl top pods -n vapora | tail -5
# CPU/Memory should be in normal range

End-of-Day Check (Before Shift End)

# Summary check before handing off to on-call

echo "=== END OF DAY SUMMARY ==="

# Current status
kubectl get pods -n vapora
kubectl top pods -n vapora

# Any concerning trends?
echo ""
echo "Checking for concerning events..."
kubectl get events -n vapora --sort-by='.lastTimestamp' | grep -i warning

# Any pod restarts?
echo ""
echo "Pod restart status:"
kubectl get pods -n vapora -o jsonpath='{range .items[*]}{.metadata.name}{": "}{.status.containerStatuses[0].restartCount}{"\n"}{end}' | grep -v ": 0"

# Document for next shift
echo ""
echo "Status for on-call: All normal / Issues detected"

Dashboard Setup & Monitoring

Essential Dashboards to Monitor

If you have Grafana/Prometheus, create these dashboards:

1. Service Health Dashboard

Monitor:

  • Pod running count (should be stable at expected count)
  • Pod restart count (should not increase rapidly)
  • Service endpoint availability (should be >99%)
  • API response time (p99, track trends)

Alert if:

  • Pod count drops below expected
  • Restart count increasing
  • Endpoints empty
  • Response time >2s

2. Resource Utilization Dashboard

Monitor:

  • CPU usage per pod
  • Memory usage per pod
  • Node capacity (CPU, memory, disk)
  • Network I/O

Alert if:

  • Any pod >80% CPU/Memory
  • Any node >85% capacity
  • Memory trending upward consistently

3. Error Rate Dashboard

Monitor:

  • 4xx error rate (should be low)
  • 5xx error rate (should be minimal)
  • Error rate by endpoint
  • Error rate by service

Alert if:

  • 5xx error rate >5%
  • 4xx error rate >10%
  • Sudden spike in errors

4. Application Metrics Dashboard

Monitor:

  • Request rate (RPS)
  • Request latency (p50, p95, p99)
  • Active connections
  • Database query time

Alert if:

  • Request rate suddenly drops (might indicate outage)
  • Latency spikes above baseline
  • Database queries slow

Grafana Setup Example

# If setting up Grafana monitoring
1. Deploy Prometheus scraping Kubernetes metrics
2. Create dashboard with above panels
3. Set alert rules:
   - CPU >80%: Warning
   - Memory >85%: Warning
   - Error rate >5%: Critical
   - Pod crashed: Critical
   - Response time >2s: Warning

4. Configure notifications to Slack/email

Alert Response Procedures

When Alert Fires

Alert Received
    ↓
Step 1: Verify it's real (not false alarm)
  - Check dashboard
  - Check manually (curl endpoints, kubectl get pods)
  - Ask in #deployments if unsure

Step 2: Assess severity
  - Service completely down? Severity 1
  - Service partially degraded? Severity 2
  - Warning/trending issue? Severity 3

Step 3: Declare incident (if Severity 1-2)
  - Create #incident channel
  - Follow Incident Response Runbook
  - See: incident-response-runbook.md

Step 4: Investigate (if Severity 3)
  - Document in ticket
  - Schedule investigation
  - Monitor for escalation

Common Alerts & Actions

Alert Cause Response
Pod CrashLoopBackOff App crashing Get logs, fix, restart
High CPU >80% Resource exhausted Scale up or reduce load
High Memory >85% Memory leak or surge Investigate or restart
Error rate spike App issue Check logs, might rollback
Response time spike Slow queries/I/O Check database, might restart
Pod pending Can't schedule Check node resources
Endpoints empty Service down Verify service exists
Disk full Storage exhausted Clean up or expand

Establishing Baselines

Record these metrics during normal operation:

# CPU per pod (typical)
Backend:    200-400m per pod
Agents:     300-500m per pod
LLM Router: 100-200m per pod

# Memory per pod (typical)
Backend:    256-512Mi per pod
Agents:     128-256Mi per pod
LLM Router: 64-128Mi per pod

# Response time (typical)
Backend:    p50: 50ms, p95: 200ms, p99: 500ms
Frontend:   Load time: 2-3 seconds

# Error rate (typical)
Backend:    4xx: <1%, 5xx: <0.1%
Frontend:   <5% user-visible errors

# Pod restart count
Should remain 0 (no restarts expected in normal operation)

Detecting Anomalies

Compare current metrics to baseline:

# If CPU 2x normal:
- Check if load increased
- Check for resource leak
- Monitor for further increase

# If Memory increasing:
- Might indicate memory leak
- Monitor over time (1-2 hours)
- Restart if clearly trending up

# If Error rate 10x:
- Something broke recently
- Check recent deployment
- Consider rollback

# If new process consuming resources:
- Identify the new resource consumer
- Investigate purpose
- Kill if unintended

Capacity Planning

When to Scale

Monitor trends and plan ahead:

# Trigger capacity planning if:
- Average CPU >60%
- Average Memory >60%
- Peak usage trending upward
- Disk usage >80%

# Questions to ask:
- Is traffic increasing? Seasonal spike?
- Did we add features? New workload?
- Do we have capacity for growth?
- Should we scale now or wait?

Scaling Actions

# Quick scale (temporary):
kubectl scale deployment/vapora-backend --replicas=5 -n vapora

# Permanent scale (update deployment.yaml):
# Edit: replicas: 5
# Apply: kubectl apply -f deployment.yaml

# Add nodes (infrastructure):
# Contact infrastructure team

# Reduce resource consumption:
# Investigate slow queries, memory leaks, etc.

Log Analysis & Troubleshooting

Checking Logs

# Most recent logs
kubectl logs deployment/vapora-backend -n vapora

# Last N lines
kubectl logs deployment/vapora-backend -n vapora --tail=100

# From specific time
kubectl logs deployment/vapora-backend -n vapora --since=1h

# Follow/tail logs
kubectl logs deployment/vapora-backend -n vapora -f

# From specific pod
kubectl logs pod-name -n vapora

# Previous pod (if crashed)
kubectl logs pod-name -n vapora --previous

Log Patterns to Watch For

# Error patterns
kubectl logs deployment/vapora-backend -n vapora | grep -i "error\|exception\|fatal"

# Database issues
kubectl logs deployment/vapora-backend -n vapora | grep -i "database\|connection\|sql"

# Authentication issues
kubectl logs deployment/vapora-backend -n vapora | grep -i "auth\|permission\|forbidden"

# Resource issues
kubectl logs deployment/vapora-backend -n vapora | grep -i "memory\|cpu\|timeout"

# Startup issues (if pod restarting)
kubectl logs pod-name -n vapora --previous | head -50

Common Log Messages & Meaning

Log Message Meaning Action
Connection refused Service not listening Check if service started
Out of memory Memory exhausted Increase limits or scale
Unauthorized Auth failed Check credentials/tokens
Database connection timeout Database unreachable Check DB health
404 Not Found Endpoint doesn't exist Check API routes
Slow query Database query taking time Optimize query or check DB

Proactive Monitoring Practices

Weekly Review

# Every Monday (or your weekly cadence):

1. Review incidents from past week
   - Were any preventable?
   - Any patterns?

2. Check alert tuning
   - False alarms?
   - Missed issues?
   - Adjust thresholds if needed

3. Capacity check
   - How much headroom remaining?
   - Plan for growth?

4. Log analysis
   - Any concerning patterns?
   - Warnings that should be errors?

5. Update runbooks if needed

Monthly Review

# First of each month:

1. Performance trends
   - Response time trending up/down?
   - Error rate changing?
   - Resource usage changing?

2. Capacity forecast
   - Extrapolate current trends
   - Plan for growth
   - Schedule scaling if needed

3. Incident review
   - MTBF (Mean Time Between Failures)
   - MTTR (Mean Time To Resolve)
   - MTTI (Mean Time To Identify)
   - Are we improving?

4. Tool/alert improvements
   - New monitoring needs?
   - Alert fatigue issues?
   - Better ways to visualize data?

Health Check Checklist

Pre-Deployment Health Check

Before any deployment, verify:
☐ All pods running: kubectl get pods
☐ No recent errors: kubectl logs --since=1h
☐ Resource usage normal: kubectl top pods
☐ Services healthy: curl /health
☐ Recent events normal: kubectl get events

Post-Deployment Health Check

After deployment, verify for 2 hours:
☐ All new pods running
☐ Old pods terminated
☐ Health endpoints responding
☐ No spike in error logs
☐ Resource usage within expected range
☐ Response time normal
☐ No pod restarts

Daily Health Check

Once per business day:
☐ kubectl get pods (all Running, 1/1 Ready)
☐ curl http://localhost:8001/health (200 OK)
☐ kubectl logs --since=24h | grep ERROR (few to none)
☐ kubectl top pods (normal usage)
☐ kubectl get events (no warnings)

Monitoring Runbook Checklist

☐ Verified automated health checks running
☐ Manual health checks performed (daily)
☐ Dashboards set up and visible
☐ Alert thresholds tuned
☐ Log patterns identified
☐ Baselines recorded
☐ Escalation procedures understood
☐ Team trained on monitoring
☐ Alert responses tested
☐ Runbooks up to date

Common Monitoring Issues

False Alerts

Problem: Alert fires but service is actually fine

Solution:

  1. Verify manually (don't just assume false)
  2. Check alert threshold (might be too sensitive)
  3. Adjust threshold if consistently false
  4. Document the change

Alert Fatigue

Problem: Too many alerts, getting ignored

Solution:

  1. Review all alerts
  2. Disable/adjust non-actionable ones
  3. Consolidate related alerts
  4. Focus on critical-only alerts

Missing Alerts

Problem: Issue happens but no alert fired

Solution:

  1. Investigate why alert didn't fire
  2. Check alert condition
  3. Add new alert for this issue
  4. Test the new alert

Lag in Monitoring

Problem: Dashboard/alerts slow to update

Solution:

  1. Check monitoring system performance
  2. Increase scrape frequency if appropriate
  3. Reduce data retention if storage issue
  4. Investigate database performance

Monitoring Tools & Commands

kubectl Commands

# Pod monitoring
kubectl get pods -n vapora
kubectl get pods -n vapora -w        # Watch mode
kubectl describe pod <pod> -n vapora
kubectl logs <pod> -n vapora -f

# Resource monitoring
kubectl top nodes
kubectl top pods -n vapora
kubectl describe nodes

# Event monitoring
kubectl get events -n vapora --sort-by='.lastTimestamp'
kubectl get events -n vapora --watch

# Health checks
kubectl get --raw /healthz          # API health

Useful Commands

# Check API responsiveness
curl -v http://localhost:8001/health

# Check all endpoints have pods
for svc in backend agents llm-router; do
  echo "$svc endpoints:"
  kubectl get endpoints vapora-$svc -n vapora
done

# Monitor pod restarts
watch 'kubectl get pods -n vapora -o jsonpath="{range .items[*]}{.metadata.name}{\" \"}{.status.containerStatuses[0].restartCount}{\"\\n\"}{end}"'

# Find pods with high restarts
kubectl get pods -n vapora -o json | jq '.items[] | select(.status.containerStatuses[0].restartCount > 5) | .metadata.name'

Next Steps

  1. Set up dashboards - Create Grafana/Prometheus dashboards if not available
  2. Configure alerts - Set thresholds based on baselines
  3. Test alerting - Verify Slack/email notifications work
  4. Train team - Ensure everyone knows how to read dashboards
  5. Document baselines - Record normal metrics for comparison
  6. Automate checks - Use CI/CD health check pipelines
  7. Review regularly - Weekly/monthly health check reviews

Last Updated: 2026-01-12 Status: Production-ready