Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Monitoring & Health Check Operations

Guide for continuous monitoring and health checks of VAPORA in production.


Overview

Responsibility: Maintain visibility into VAPORA service health through monitoring, logging, and alerting

Key Activities:

  • Regular health checks (automated and manual)
  • Alert response and investigation
  • Trend analysis and capacity planning
  • Incident prevention through early detection

Success Metric: Detect and respond to issues before users are significantly impacted


Automated Health Checks

Kubernetes Health Check Pipeline

If using CI/CD, leverage automatic health monitoring:

GitHub Actions:

# Runs every 15 minutes (quick check)
# Runs every 6 hours (comprehensive diagnostics)
# See: .github/workflows/health-check.yml

Woodpecker:

# Runs every 15 minutes (quick check)
# Runs every 6 hours (comprehensive diagnostics)
# See: .woodpecker/health-check.yml

Artifacts Generated:

  • docker-health.log - Docker container status
  • k8s-health.log - Kubernetes deployments status
  • k8s-diagnostics.log - Full system diagnostics
  • docker-diagnostics.log - Docker system info
  • HEALTH_REPORT.md - Summary report

Quick Manual Health Check

# Run this command to get instant health status
export NAMESPACE=vapora

echo "=== Pod Status ==="
kubectl get pods -n $NAMESPACE
echo ""

echo "=== Service Health ==="
kubectl get endpoints -n $NAMESPACE
echo ""

echo "=== Recent Events ==="
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -10
echo ""

echo "=== Resource Usage ==="
kubectl top pods -n $NAMESPACE
echo ""

echo "=== API Health ==="
curl -s http://localhost:8001/health | jq .

Manual Daily Monitoring

Morning Check (Start of Business Day)

# Run at start of business day (or when starting shift)

echo "=== MORNING HEALTH CHECK ==="
echo "Date: $(date -u)"

# 1. Cluster Status
echo "Cluster Status:"
kubectl cluster-info | grep server

# 2. Node Status
echo ""
echo "Node Status:"
kubectl get nodes
# Should show: All nodes Ready

# 3. Pod Status
echo ""
echo "Pod Status:"
kubectl get pods -n vapora
# Should show: All Running, 1/1 Ready

# 4. Service Endpoints
echo ""
echo "Service Endpoints:"
kubectl get endpoints -n vapora
# Should show: All services have endpoints (not empty)

# 5. Resource Usage
echo ""
echo "Resource Usage:"
kubectl top nodes
kubectl top pods -n vapora | head -10

# 6. Recent Errors
echo ""
echo "Recent Errors (last 1 hour):"
kubectl logs deployment/vapora-backend -n vapora --since=1h | grep -i error | wc -l
# Should show: 0 or very few errors

# 7. Overall Status
echo ""
echo "Overall Status: ✅ Healthy"
# If any issues found: Document and investigate

Mid-Day Check (Every 4-6 hours)

# Quick sanity check during business hours

# 1. Service Responsiveness
curl -s http://localhost:8001/health | jq '.status'
# Should return: "healthy"

# 2. Pod Restart Tracking
kubectl get pods -n vapora -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}'
# Restart count should not be increasing rapidly

# 3. Error Log Check
kubectl logs deployment/vapora-backend -n vapora --since=4h --timestamps | grep ERROR | tail -5
# Should show: Few to no errors

# 4. Performance Check
kubectl top pods -n vapora | tail -5
# CPU/Memory should be in normal range

End-of-Day Check (Before Shift End)

# Summary check before handing off to on-call

echo "=== END OF DAY SUMMARY ==="

# Current status
kubectl get pods -n vapora
kubectl top pods -n vapora

# Any concerning trends?
echo ""
echo "Checking for concerning events..."
kubectl get events -n vapora --sort-by='.lastTimestamp' | grep -i warning

# Any pod restarts?
echo ""
echo "Pod restart status:"
kubectl get pods -n vapora -o jsonpath='{range .items[*]}{.metadata.name}{": "}{.status.containerStatuses[0].restartCount}{"\n"}{end}' | grep -v ": 0"

# Document for next shift
echo ""
echo "Status for on-call: All normal / Issues detected"

Dashboard Setup & Monitoring

Essential Dashboards to Monitor

If you have Grafana/Prometheus, create these dashboards:

1. Service Health Dashboard

Monitor:

  • Pod running count (should be stable at expected count)
  • Pod restart count (should not increase rapidly)
  • Service endpoint availability (should be >99%)
  • API response time (p99, track trends)

Alert if:

  • Pod count drops below expected
  • Restart count increasing
  • Endpoints empty
  • Response time >2s

2. Resource Utilization Dashboard

Monitor:

  • CPU usage per pod
  • Memory usage per pod
  • Node capacity (CPU, memory, disk)
  • Network I/O

Alert if:

  • Any pod >80% CPU/Memory
  • Any node >85% capacity
  • Memory trending upward consistently

3. Error Rate Dashboard

Monitor:

  • 4xx error rate (should be low)
  • 5xx error rate (should be minimal)
  • Error rate by endpoint
  • Error rate by service

Alert if:

  • 5xx error rate >5%
  • 4xx error rate >10%
  • Sudden spike in errors

4. Application Metrics Dashboard

Monitor:

  • Request rate (RPS)
  • Request latency (p50, p95, p99)
  • Active connections
  • Database query time

Alert if:

  • Request rate suddenly drops (might indicate outage)
  • Latency spikes above baseline
  • Database queries slow

Grafana Setup Example

# If setting up Grafana monitoring
1. Deploy Prometheus scraping Kubernetes metrics
2. Create dashboard with above panels
3. Set alert rules:
   - CPU >80%: Warning
   - Memory >85%: Warning
   - Error rate >5%: Critical
   - Pod crashed: Critical
   - Response time >2s: Warning

4. Configure notifications to Slack/email

Alert Response Procedures

When Alert Fires

Alert Received
    ↓
Step 1: Verify it's real (not false alarm)
  - Check dashboard
  - Check manually (curl endpoints, kubectl get pods)
  - Ask in #deployments if unsure

Step 2: Assess severity
  - Service completely down? Severity 1
  - Service partially degraded? Severity 2
  - Warning/trending issue? Severity 3

Step 3: Declare incident (if Severity 1-2)
  - Create #incident channel
  - Follow Incident Response Runbook
  - See: incident-response-runbook.md

Step 4: Investigate (if Severity 3)
  - Document in ticket
  - Schedule investigation
  - Monitor for escalation

Common Alerts & Actions

AlertCauseResponse
Pod CrashLoopBackOffApp crashingGet logs, fix, restart
High CPU >80%Resource exhaustedScale up or reduce load
High Memory >85%Memory leak or surgeInvestigate or restart
Error rate spikeApp issueCheck logs, might rollback
Response time spikeSlow queries/I/OCheck database, might restart
Pod pendingCan't scheduleCheck node resources
Endpoints emptyService downVerify service exists
Disk fullStorage exhaustedClean up or expand

Establishing Baselines

Record these metrics during normal operation:

# CPU per pod (typical)
Backend:    200-400m per pod
Agents:     300-500m per pod
LLM Router: 100-200m per pod

# Memory per pod (typical)
Backend:    256-512Mi per pod
Agents:     128-256Mi per pod
LLM Router: 64-128Mi per pod

# Response time (typical)
Backend:    p50: 50ms, p95: 200ms, p99: 500ms
Frontend:   Load time: 2-3 seconds

# Error rate (typical)
Backend:    4xx: <1%, 5xx: <0.1%
Frontend:   <5% user-visible errors

# Pod restart count
Should remain 0 (no restarts expected in normal operation)

Detecting Anomalies

Compare current metrics to baseline:

# If CPU 2x normal:
- Check if load increased
- Check for resource leak
- Monitor for further increase

# If Memory increasing:
- Might indicate memory leak
- Monitor over time (1-2 hours)
- Restart if clearly trending up

# If Error rate 10x:
- Something broke recently
- Check recent deployment
- Consider rollback

# If new process consuming resources:
- Identify the new resource consumer
- Investigate purpose
- Kill if unintended

Capacity Planning

When to Scale

Monitor trends and plan ahead:

# Trigger capacity planning if:
- Average CPU >60%
- Average Memory >60%
- Peak usage trending upward
- Disk usage >80%

# Questions to ask:
- Is traffic increasing? Seasonal spike?
- Did we add features? New workload?
- Do we have capacity for growth?
- Should we scale now or wait?

Scaling Actions

# Quick scale (temporary):
kubectl scale deployment/vapora-backend --replicas=5 -n vapora

# Permanent scale (update deployment.yaml):
# Edit: replicas: 5
# Apply: kubectl apply -f deployment.yaml

# Add nodes (infrastructure):
# Contact infrastructure team

# Reduce resource consumption:
# Investigate slow queries, memory leaks, etc.

Log Analysis & Troubleshooting

Checking Logs

# Most recent logs
kubectl logs deployment/vapora-backend -n vapora

# Last N lines
kubectl logs deployment/vapora-backend -n vapora --tail=100

# From specific time
kubectl logs deployment/vapora-backend -n vapora --since=1h

# Follow/tail logs
kubectl logs deployment/vapora-backend -n vapora -f

# From specific pod
kubectl logs pod-name -n vapora

# Previous pod (if crashed)
kubectl logs pod-name -n vapora --previous

Log Patterns to Watch For

# Error patterns
kubectl logs deployment/vapora-backend -n vapora | grep -i "error\|exception\|fatal"

# Database issues
kubectl logs deployment/vapora-backend -n vapora | grep -i "database\|connection\|sql"

# Authentication issues
kubectl logs deployment/vapora-backend -n vapora | grep -i "auth\|permission\|forbidden"

# Resource issues
kubectl logs deployment/vapora-backend -n vapora | grep -i "memory\|cpu\|timeout"

# Startup issues (if pod restarting)
kubectl logs pod-name -n vapora --previous | head -50

Common Log Messages & Meaning

Log MessageMeaningAction
Connection refusedService not listeningCheck if service started
Out of memoryMemory exhaustedIncrease limits or scale
UnauthorizedAuth failedCheck credentials/tokens
Database connection timeoutDatabase unreachableCheck DB health
404 Not FoundEndpoint doesn't existCheck API routes
Slow queryDatabase query taking timeOptimize query or check DB

Proactive Monitoring Practices

Weekly Review

# Every Monday (or your weekly cadence):

1. Review incidents from past week
   - Were any preventable?
   - Any patterns?

2. Check alert tuning
   - False alarms?
   - Missed issues?
   - Adjust thresholds if needed

3. Capacity check
   - How much headroom remaining?
   - Plan for growth?

4. Log analysis
   - Any concerning patterns?
   - Warnings that should be errors?

5. Update runbooks if needed

Monthly Review

# First of each month:

1. Performance trends
   - Response time trending up/down?
   - Error rate changing?
   - Resource usage changing?

2. Capacity forecast
   - Extrapolate current trends
   - Plan for growth
   - Schedule scaling if needed

3. Incident review
   - MTBF (Mean Time Between Failures)
   - MTTR (Mean Time To Resolve)
   - MTTI (Mean Time To Identify)
   - Are we improving?

4. Tool/alert improvements
   - New monitoring needs?
   - Alert fatigue issues?
   - Better ways to visualize data?

Health Check Checklist

Pre-Deployment Health Check

Before any deployment, verify:
☐ All pods running: kubectl get pods
☐ No recent errors: kubectl logs --since=1h
☐ Resource usage normal: kubectl top pods
☐ Services healthy: curl /health
☐ Recent events normal: kubectl get events

Post-Deployment Health Check

After deployment, verify for 2 hours:
☐ All new pods running
☐ Old pods terminated
☐ Health endpoints responding
☐ No spike in error logs
☐ Resource usage within expected range
☐ Response time normal
☐ No pod restarts

Daily Health Check

Once per business day:
☐ kubectl get pods (all Running, 1/1 Ready)
☐ curl http://localhost:8001/health (200 OK)
☐ kubectl logs --since=24h | grep ERROR (few to none)
☐ kubectl top pods (normal usage)
☐ kubectl get events (no warnings)

Monitoring Runbook Checklist

☐ Verified automated health checks running
☐ Manual health checks performed (daily)
☐ Dashboards set up and visible
☐ Alert thresholds tuned
☐ Log patterns identified
☐ Baselines recorded
☐ Escalation procedures understood
☐ Team trained on monitoring
☐ Alert responses tested
☐ Runbooks up to date

Common Monitoring Issues

False Alerts

Problem: Alert fires but service is actually fine

Solution:

  1. Verify manually (don't just assume false)
  2. Check alert threshold (might be too sensitive)
  3. Adjust threshold if consistently false
  4. Document the change

Alert Fatigue

Problem: Too many alerts, getting ignored

Solution:

  1. Review all alerts
  2. Disable/adjust non-actionable ones
  3. Consolidate related alerts
  4. Focus on critical-only alerts

Missing Alerts

Problem: Issue happens but no alert fired

Solution:

  1. Investigate why alert didn't fire
  2. Check alert condition
  3. Add new alert for this issue
  4. Test the new alert

Lag in Monitoring

Problem: Dashboard/alerts slow to update

Solution:

  1. Check monitoring system performance
  2. Increase scrape frequency if appropriate
  3. Reduce data retention if storage issue
  4. Investigate database performance

Monitoring Tools & Commands

kubectl Commands

# Pod monitoring
kubectl get pods -n vapora
kubectl get pods -n vapora -w        # Watch mode
kubectl describe pod <pod> -n vapora
kubectl logs <pod> -n vapora -f

# Resource monitoring
kubectl top nodes
kubectl top pods -n vapora
kubectl describe nodes

# Event monitoring
kubectl get events -n vapora --sort-by='.lastTimestamp'
kubectl get events -n vapora --watch

# Health checks
kubectl get --raw /healthz          # API health

Useful Commands

# Check API responsiveness
curl -v http://localhost:8001/health

# Check all endpoints have pods
for svc in backend agents llm-router; do
  echo "$svc endpoints:"
  kubectl get endpoints vapora-$svc -n vapora
done

# Monitor pod restarts
watch 'kubectl get pods -n vapora -o jsonpath="{range .items[*]}{.metadata.name}{\" \"}{.status.containerStatuses[0].restartCount}{\"\\n\"}{end}"'

# Find pods with high restarts
kubectl get pods -n vapora -o json | jq '.items[] | select(.status.containerStatuses[0].restartCount > 5) | .metadata.name'

Next Steps

  1. Set up dashboards - Create Grafana/Prometheus dashboards if not available
  2. Configure alerts - Set thresholds based on baselines
  3. Test alerting - Verify Slack/email notifications work
  4. Train team - Ensure everyone knows how to read dashboards
  5. Document baselines - Record normal metrics for comparison
  6. Automate checks - Use CI/CD health check pipelines
  7. Review regularly - Weekly/monthly health check reviews

Last Updated: 2026-01-12 Status: Production-ready