jesus/Vapora

Fork 0

Jesús Pérez 7110ffeea2

Rust CI / Security Audit (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (stable) (push) Has been cancelled

Details

chore: extend doc: adr, tutorials, operations, etc

2026-01-12 03:32:47 +00:00

15 KiB

Raw Blame History

Monitoring & Health Check Operations

Guide for continuous monitoring and health checks of VAPORA in production.

Overview

Responsibility: Maintain visibility into VAPORA service health through monitoring, logging, and alerting

Key Activities:

Regular health checks (automated and manual)
Alert response and investigation
Trend analysis and capacity planning
Incident prevention through early detection

Success Metric: Detect and respond to issues before users are significantly impacted

Automated Health Checks

Kubernetes Health Check Pipeline

If using CI/CD, leverage automatic health monitoring:

GitHub Actions:

# Runs every 15 minutes (quick check)
# Runs every 6 hours (comprehensive diagnostics)
# See: .github/workflows/health-check.yml

Woodpecker:

# Runs every 15 minutes (quick check)
# Runs every 6 hours (comprehensive diagnostics)
# See: .woodpecker/health-check.yml

Artifacts Generated:

docker-health.log - Docker container status
k8s-health.log - Kubernetes deployments status
k8s-diagnostics.log - Full system diagnostics
docker-diagnostics.log - Docker system info
HEALTH_REPORT.md - Summary report

Quick Manual Health Check

# Run this command to get instant health status
export NAMESPACE=vapora

echo "=== Pod Status ==="
kubectl get pods -n $NAMESPACE
echo ""

echo "=== Service Health ==="
kubectl get endpoints -n $NAMESPACE
echo ""

echo "=== Recent Events ==="
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -10
echo ""

echo "=== Resource Usage ==="
kubectl top pods -n $NAMESPACE
echo ""

echo "=== API Health ==="
curl -s http://localhost:8001/health | jq .

Manual Daily Monitoring

Morning Check (Start of Business Day)

# Run at start of business day (or when starting shift)

echo "=== MORNING HEALTH CHECK ==="
echo "Date: $(date -u)"

# 1. Cluster Status
echo "Cluster Status:"
kubectl cluster-info | grep server

# 2. Node Status
echo ""
echo "Node Status:"
kubectl get nodes
# Should show: All nodes Ready

# 3. Pod Status
echo ""
echo "Pod Status:"
kubectl get pods -n vapora
# Should show: All Running, 1/1 Ready

# 4. Service Endpoints
echo ""
echo "Service Endpoints:"
kubectl get endpoints -n vapora
# Should show: All services have endpoints (not empty)

# 5. Resource Usage
echo ""
echo "Resource Usage:"
kubectl top nodes
kubectl top pods -n vapora | head -10

# 6. Recent Errors
echo ""
echo "Recent Errors (last 1 hour):"
kubectl logs deployment/vapora-backend -n vapora --since=1h | grep -i error | wc -l
# Should show: 0 or very few errors

# 7. Overall Status
echo ""
echo "Overall Status: ✅ Healthy"
# If any issues found: Document and investigate

Mid-Day Check (Every 4-6 hours)

# Quick sanity check during business hours

# 1. Service Responsiveness
curl -s http://localhost:8001/health | jq '.status'
# Should return: "healthy"

# 2. Pod Restart Tracking
kubectl get pods -n vapora -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}'
# Restart count should not be increasing rapidly

# 3. Error Log Check
kubectl logs deployment/vapora-backend -n vapora --since=4h --timestamps | grep ERROR | tail -5
# Should show: Few to no errors

# 4. Performance Check
kubectl top pods -n vapora | tail -5
# CPU/Memory should be in normal range

End-of-Day Check (Before Shift End)

# Summary check before handing off to on-call

echo "=== END OF DAY SUMMARY ==="

# Current status
kubectl get pods -n vapora
kubectl top pods -n vapora

# Any concerning trends?
echo ""
echo "Checking for concerning events..."
kubectl get events -n vapora --sort-by='.lastTimestamp' | grep -i warning

# Any pod restarts?
echo ""
echo "Pod restart status:"
kubectl get pods -n vapora -o jsonpath='{range .items[*]}{.metadata.name}{": "}{.status.containerStatuses[0].restartCount}{"\n"}{end}' | grep -v ": 0"

# Document for next shift
echo ""
echo "Status for on-call: All normal / Issues detected"

Dashboard Setup & Monitoring

Essential Dashboards to Monitor

If you have Grafana/Prometheus, create these dashboards:

1. Service Health Dashboard

Monitor:

Pod running count (should be stable at expected count)
Pod restart count (should not increase rapidly)
Service endpoint availability (should be >99%)
API response time (p99, track trends)

Alert if:

Pod count drops below expected
Restart count increasing
Endpoints empty
Response time >2s

2. Resource Utilization Dashboard

Monitor:

CPU usage per pod
Memory usage per pod
Node capacity (CPU, memory, disk)
Network I/O

Alert if:

Any pod >80% CPU/Memory
Any node >85% capacity
Memory trending upward consistently

3. Error Rate Dashboard

Monitor:

4xx error rate (should be low)
5xx error rate (should be minimal)
Error rate by endpoint
Error rate by service

Alert if:

5xx error rate >5%
4xx error rate >10%
Sudden spike in errors

4. Application Metrics Dashboard

Monitor:

Request rate (RPS)
Request latency (p50, p95, p99)
Active connections
Database query time

Alert if:

Request rate suddenly drops (might indicate outage)
Latency spikes above baseline
Database queries slow

Grafana Setup Example

# If setting up Grafana monitoring
1. Deploy Prometheus scraping Kubernetes metrics
2. Create dashboard with above panels
3. Set alert rules:
   - CPU >80%: Warning
   - Memory >85%: Warning
   - Error rate >5%: Critical
   - Pod crashed: Critical
   - Response time >2s: Warning

4. Configure notifications to Slack/email

Alert Response Procedures

When Alert Fires

Alert Received
    ↓
Step 1: Verify it's real (not false alarm)
  - Check dashboard
  - Check manually (curl endpoints, kubectl get pods)
  - Ask in #deployments if unsure

Step 2: Assess severity
  - Service completely down? Severity 1
  - Service partially degraded? Severity 2
  - Warning/trending issue? Severity 3

Step 3: Declare incident (if Severity 1-2)
  - Create #incident channel
  - Follow Incident Response Runbook
  - See: incident-response-runbook.md

Step 4: Investigate (if Severity 3)
  - Document in ticket
  - Schedule investigation
  - Monitor for escalation

Common Alerts & Actions

Alert	Cause	Response
Pod CrashLoopBackOff	App crashing	Get logs, fix, restart
High CPU >80%	Resource exhausted	Scale up or reduce load
High Memory >85%	Memory leak or surge	Investigate or restart
Error rate spike	App issue	Check logs, might rollback
Response time spike	Slow queries/I/O	Check database, might restart
Pod pending	Can't schedule	Check node resources
Endpoints empty	Service down	Verify service exists
Disk full	Storage exhausted	Clean up or expand

Metric Baselines & Trends

Establishing Baselines

Record these metrics during normal operation:

# CPU per pod (typical)
Backend:    200-400m per pod
Agents:     300-500m per pod
LLM Router: 100-200m per pod

# Memory per pod (typical)
Backend:    256-512Mi per pod
Agents:     128-256Mi per pod
LLM Router: 64-128Mi per pod

# Response time (typical)
Backend:    p50: 50ms, p95: 200ms, p99: 500ms
Frontend:   Load time: 2-3 seconds

# Error rate (typical)
Backend:    4xx: <1%, 5xx: <0.1%
Frontend:   <5% user-visible errors

# Pod restart count
Should remain 0 (no restarts expected in normal operation)

Detecting Anomalies

Compare current metrics to baseline:

# If CPU 2x normal:
- Check if load increased
- Check for resource leak
- Monitor for further increase

# If Memory increasing:
- Might indicate memory leak
- Monitor over time (1-2 hours)
- Restart if clearly trending up

# If Error rate 10x:
- Something broke recently
- Check recent deployment
- Consider rollback

# If new process consuming resources:
- Identify the new resource consumer
- Investigate purpose
- Kill if unintended

Capacity Planning

When to Scale

Monitor trends and plan ahead:

# Trigger capacity planning if:
- Average CPU >60%
- Average Memory >60%
- Peak usage trending upward
- Disk usage >80%

# Questions to ask:
- Is traffic increasing? Seasonal spike?
- Did we add features? New workload?
- Do we have capacity for growth?
- Should we scale now or wait?

Scaling Actions

# Quick scale (temporary):
kubectl scale deployment/vapora-backend --replicas=5 -n vapora

# Permanent scale (update deployment.yaml):
# Edit: replicas: 5
# Apply: kubectl apply -f deployment.yaml

# Add nodes (infrastructure):
# Contact infrastructure team

# Reduce resource consumption:
# Investigate slow queries, memory leaks, etc.

Log Analysis & Troubleshooting

Checking Logs

# Most recent logs
kubectl logs deployment/vapora-backend -n vapora

# Last N lines
kubectl logs deployment/vapora-backend -n vapora --tail=100

# From specific time
kubectl logs deployment/vapora-backend -n vapora --since=1h

# Follow/tail logs
kubectl logs deployment/vapora-backend -n vapora -f

# From specific pod
kubectl logs pod-name -n vapora

# Previous pod (if crashed)
kubectl logs pod-name -n vapora --previous

Log Patterns to Watch For

# Error patterns
kubectl logs deployment/vapora-backend -n vapora | grep -i "error\|exception\|fatal"

# Database issues
kubectl logs deployment/vapora-backend -n vapora | grep -i "database\|connection\|sql"

# Authentication issues
kubectl logs deployment/vapora-backend -n vapora | grep -i "auth\|permission\|forbidden"

# Resource issues
kubectl logs deployment/vapora-backend -n vapora | grep -i "memory\|cpu\|timeout"

# Startup issues (if pod restarting)
kubectl logs pod-name -n vapora --previous | head -50

Common Log Messages & Meaning

Log Message	Meaning	Action
`Connection refused`	Service not listening	Check if service started
`Out of memory`	Memory exhausted	Increase limits or scale
`Unauthorized`	Auth failed	Check credentials/tokens
`Database connection timeout`	Database unreachable	Check DB health
`404 Not Found`	Endpoint doesn't exist	Check API routes
`Slow query`	Database query taking time	Optimize query or check DB

Proactive Monitoring Practices

Weekly Review

# Every Monday (or your weekly cadence):

1. Review incidents from past week
   - Were any preventable?
   - Any patterns?

2. Check alert tuning
   - False alarms?
   - Missed issues?
   - Adjust thresholds if needed

3. Capacity check
   - How much headroom remaining?
   - Plan for growth?

4. Log analysis
   - Any concerning patterns?
   - Warnings that should be errors?

5. Update runbooks if needed

Monthly Review

# First of each month:

1. Performance trends
   - Response time trending up/down?
   - Error rate changing?
   - Resource usage changing?

2. Capacity forecast
   - Extrapolate current trends
   - Plan for growth
   - Schedule scaling if needed

3. Incident review
   - MTBF (Mean Time Between Failures)
   - MTTR (Mean Time To Resolve)
   - MTTI (Mean Time To Identify)
   - Are we improving?

4. Tool/alert improvements
   - New monitoring needs?
   - Alert fatigue issues?
   - Better ways to visualize data?

Health Check Checklist

Pre-Deployment Health Check

Before any deployment, verify:
☐ All pods running: kubectl get pods
☐ No recent errors: kubectl logs --since=1h
☐ Resource usage normal: kubectl top pods
☐ Services healthy: curl /health
☐ Recent events normal: kubectl get events

Post-Deployment Health Check

After deployment, verify for 2 hours:
☐ All new pods running
☐ Old pods terminated
☐ Health endpoints responding
☐ No spike in error logs
☐ Resource usage within expected range
☐ Response time normal
☐ No pod restarts

Daily Health Check

Once per business day:
☐ kubectl get pods (all Running, 1/1 Ready)
☐ curl http://localhost:8001/health (200 OK)
☐ kubectl logs --since=24h | grep ERROR (few to none)
☐ kubectl top pods (normal usage)
☐ kubectl get events (no warnings)

Monitoring Runbook Checklist

☐ Verified automated health checks running
☐ Manual health checks performed (daily)
☐ Dashboards set up and visible
☐ Alert thresholds tuned
☐ Log patterns identified
☐ Baselines recorded
☐ Escalation procedures understood
☐ Team trained on monitoring
☐ Alert responses tested
☐ Runbooks up to date

Common Monitoring Issues

False Alerts

Problem: Alert fires but service is actually fine

Solution:

Verify manually (don't just assume false)
Check alert threshold (might be too sensitive)
Adjust threshold if consistently false
Document the change

Alert Fatigue

Problem: Too many alerts, getting ignored

Solution:

Review all alerts
Disable/adjust non-actionable ones
Consolidate related alerts
Focus on critical-only alerts

Missing Alerts

Problem: Issue happens but no alert fired

Solution:

Investigate why alert didn't fire
Check alert condition
Add new alert for this issue
Test the new alert

Lag in Monitoring

Problem: Dashboard/alerts slow to update

Solution:

Check monitoring system performance
Increase scrape frequency if appropriate
Reduce data retention if storage issue
Investigate database performance

Monitoring Tools & Commands

kubectl Commands

# Pod monitoring
kubectl get pods -n vapora
kubectl get pods -n vapora -w        # Watch mode
kubectl describe pod <pod> -n vapora
kubectl logs <pod> -n vapora -f

# Resource monitoring
kubectl top nodes
kubectl top pods -n vapora
kubectl describe nodes

# Event monitoring
kubectl get events -n vapora --sort-by='.lastTimestamp'
kubectl get events -n vapora --watch

# Health checks
kubectl get --raw /healthz          # API health

Useful Commands

# Check API responsiveness
curl -v http://localhost:8001/health

# Check all endpoints have pods
for svc in backend agents llm-router; do
  echo "$svc endpoints:"
  kubectl get endpoints vapora-$svc -n vapora
done

# Monitor pod restarts
watch 'kubectl get pods -n vapora -o jsonpath="{range .items[*]}{.metadata.name}{\" \"}{.status.containerStatuses[0].restartCount}{\"\\n\"}{end}"'

# Find pods with high restarts
kubectl get pods -n vapora -o json | jq '.items[] | select(.status.containerStatuses[0].restartCount > 5) | .metadata.name'

Next Steps

Set up dashboards - Create Grafana/Prometheus dashboards if not available
Configure alerts - Set thresholds based on baselines
Test alerting - Verify Slack/email notifications work
Train team - Ensure everyone knows how to read dashboards
Document baselines - Record normal metrics for comparison
Automate checks - Use CI/CD health check pipelines
Review regularly - Weekly/monthly health check reviews

Last Updated: 2026-01-12 Status: Production-ready

15 KiB Raw Blame History

Monitoring & Health Check Operations

Overview

Automated Health Checks

Kubernetes Health Check Pipeline

Quick Manual Health Check

Manual Daily Monitoring

Morning Check (Start of Business Day)

Mid-Day Check (Every 4-6 hours)

End-of-Day Check (Before Shift End)

Dashboard Setup & Monitoring

Essential Dashboards to Monitor

1. Service Health Dashboard

2. Resource Utilization Dashboard

3. Error Rate Dashboard

4. Application Metrics Dashboard

Grafana Setup Example

Alert Response Procedures

When Alert Fires

Common Alerts & Actions

Metric Baselines & Trends

Establishing Baselines

Detecting Anomalies

Capacity Planning

When to Scale

Scaling Actions

Log Analysis & Troubleshooting

Checking Logs

Log Patterns to Watch For

Common Log Messages & Meaning

Proactive Monitoring Practices

Weekly Review

Monthly Review

Health Check Checklist

Pre-Deployment Health Check

Post-Deployment Health Check

Daily Health Check

Monitoring Runbook Checklist

Common Monitoring Issues

False Alerts

Alert Fatigue

Missing Alerts

Lag in Monitoring

Monitoring Tools & Commands

kubectl Commands

Useful Commands

Next Steps

15 KiB

Raw Blame History