Monitoring & Health Check Operations
Guide for continuous monitoring and health checks of VAPORA in production.
Overview
Responsibility: Maintain visibility into VAPORA service health through monitoring, logging, and alerting
Key Activities:
- Regular health checks (automated and manual)
- Alert response and investigation
- Trend analysis and capacity planning
- Incident prevention through early detection
Success Metric: Detect and respond to issues before users are significantly impacted
Automated Health Checks
Kubernetes Health Check Pipeline
If using CI/CD, leverage automatic health monitoring:
GitHub Actions:
# Runs every 15 minutes (quick check)
# Runs every 6 hours (comprehensive diagnostics)
# See: .github/workflows/health-check.yml
Woodpecker:
# Runs every 15 minutes (quick check)
# Runs every 6 hours (comprehensive diagnostics)
# See: .woodpecker/health-check.yml
Artifacts Generated:
docker-health.log- Docker container statusk8s-health.log- Kubernetes deployments statusk8s-diagnostics.log- Full system diagnosticsdocker-diagnostics.log- Docker system infoHEALTH_REPORT.md- Summary report
Quick Manual Health Check
# Run this command to get instant health status
export NAMESPACE=vapora
echo "=== Pod Status ==="
kubectl get pods -n $NAMESPACE
echo ""
echo "=== Service Health ==="
kubectl get endpoints -n $NAMESPACE
echo ""
echo "=== Recent Events ==="
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -10
echo ""
echo "=== Resource Usage ==="
kubectl top pods -n $NAMESPACE
echo ""
echo "=== API Health ==="
curl -s http://localhost:8001/health | jq .
Manual Daily Monitoring
Morning Check (Start of Business Day)
# Run at start of business day (or when starting shift)
echo "=== MORNING HEALTH CHECK ==="
echo "Date: $(date -u)"
# 1. Cluster Status
echo "Cluster Status:"
kubectl cluster-info | grep server
# 2. Node Status
echo ""
echo "Node Status:"
kubectl get nodes
# Should show: All nodes Ready
# 3. Pod Status
echo ""
echo "Pod Status:"
kubectl get pods -n vapora
# Should show: All Running, 1/1 Ready
# 4. Service Endpoints
echo ""
echo "Service Endpoints:"
kubectl get endpoints -n vapora
# Should show: All services have endpoints (not empty)
# 5. Resource Usage
echo ""
echo "Resource Usage:"
kubectl top nodes
kubectl top pods -n vapora | head -10
# 6. Recent Errors
echo ""
echo "Recent Errors (last 1 hour):"
kubectl logs deployment/vapora-backend -n vapora --since=1h | grep -i error | wc -l
# Should show: 0 or very few errors
# 7. Overall Status
echo ""
echo "Overall Status: ✅ Healthy"
# If any issues found: Document and investigate
Mid-Day Check (Every 4-6 hours)
# Quick sanity check during business hours
# 1. Service Responsiveness
curl -s http://localhost:8001/health | jq '.status'
# Should return: "healthy"
# 2. Pod Restart Tracking
kubectl get pods -n vapora -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}'
# Restart count should not be increasing rapidly
# 3. Error Log Check
kubectl logs deployment/vapora-backend -n vapora --since=4h --timestamps | grep ERROR | tail -5
# Should show: Few to no errors
# 4. Performance Check
kubectl top pods -n vapora | tail -5
# CPU/Memory should be in normal range
End-of-Day Check (Before Shift End)
# Summary check before handing off to on-call
echo "=== END OF DAY SUMMARY ==="
# Current status
kubectl get pods -n vapora
kubectl top pods -n vapora
# Any concerning trends?
echo ""
echo "Checking for concerning events..."
kubectl get events -n vapora --sort-by='.lastTimestamp' | grep -i warning
# Any pod restarts?
echo ""
echo "Pod restart status:"
kubectl get pods -n vapora -o jsonpath='{range .items[*]}{.metadata.name}{": "}{.status.containerStatuses[0].restartCount}{"\n"}{end}' | grep -v ": 0"
# Document for next shift
echo ""
echo "Status for on-call: All normal / Issues detected"
Dashboard Setup & Monitoring
Essential Dashboards to Monitor
If you have Grafana/Prometheus, create these dashboards:
1. Service Health Dashboard
Monitor:
- Pod running count (should be stable at expected count)
- Pod restart count (should not increase rapidly)
- Service endpoint availability (should be >99%)
- API response time (p99, track trends)
Alert if:
- Pod count drops below expected
- Restart count increasing
- Endpoints empty
- Response time >2s
2. Resource Utilization Dashboard
Monitor:
- CPU usage per pod
- Memory usage per pod
- Node capacity (CPU, memory, disk)
- Network I/O
Alert if:
- Any pod >80% CPU/Memory
- Any node >85% capacity
- Memory trending upward consistently
3. Error Rate Dashboard
Monitor:
- 4xx error rate (should be low)
- 5xx error rate (should be minimal)
- Error rate by endpoint
- Error rate by service
Alert if:
- 5xx error rate >5%
- 4xx error rate >10%
- Sudden spike in errors
4. Application Metrics Dashboard
Monitor:
- Request rate (RPS)
- Request latency (p50, p95, p99)
- Active connections
- Database query time
Alert if:
- Request rate suddenly drops (might indicate outage)
- Latency spikes above baseline
- Database queries slow
Grafana Setup Example
# If setting up Grafana monitoring
1. Deploy Prometheus scraping Kubernetes metrics
2. Create dashboard with above panels
3. Set alert rules:
- CPU >80%: Warning
- Memory >85%: Warning
- Error rate >5%: Critical
- Pod crashed: Critical
- Response time >2s: Warning
4. Configure notifications to Slack/email
Alert Response Procedures
When Alert Fires
Alert Received
↓
Step 1: Verify it's real (not false alarm)
- Check dashboard
- Check manually (curl endpoints, kubectl get pods)
- Ask in #deployments if unsure
Step 2: Assess severity
- Service completely down? Severity 1
- Service partially degraded? Severity 2
- Warning/trending issue? Severity 3
Step 3: Declare incident (if Severity 1-2)
- Create #incident channel
- Follow Incident Response Runbook
- See: incident-response-runbook.md
Step 4: Investigate (if Severity 3)
- Document in ticket
- Schedule investigation
- Monitor for escalation
Common Alerts & Actions
| Alert | Cause | Response |
|---|---|---|
| Pod CrashLoopBackOff | App crashing | Get logs, fix, restart |
| High CPU >80% | Resource exhausted | Scale up or reduce load |
| High Memory >85% | Memory leak or surge | Investigate or restart |
| Error rate spike | App issue | Check logs, might rollback |
| Response time spike | Slow queries/I/O | Check database, might restart |
| Pod pending | Can't schedule | Check node resources |
| Endpoints empty | Service down | Verify service exists |
| Disk full | Storage exhausted | Clean up or expand |
Metric Baselines & Trends
Establishing Baselines
Record these metrics during normal operation:
# CPU per pod (typical)
Backend: 200-400m per pod
Agents: 300-500m per pod
LLM Router: 100-200m per pod
# Memory per pod (typical)
Backend: 256-512Mi per pod
Agents: 128-256Mi per pod
LLM Router: 64-128Mi per pod
# Response time (typical)
Backend: p50: 50ms, p95: 200ms, p99: 500ms
Frontend: Load time: 2-3 seconds
# Error rate (typical)
Backend: 4xx: <1%, 5xx: <0.1%
Frontend: <5% user-visible errors
# Pod restart count
Should remain 0 (no restarts expected in normal operation)
Detecting Anomalies
Compare current metrics to baseline:
# If CPU 2x normal:
- Check if load increased
- Check for resource leak
- Monitor for further increase
# If Memory increasing:
- Might indicate memory leak
- Monitor over time (1-2 hours)
- Restart if clearly trending up
# If Error rate 10x:
- Something broke recently
- Check recent deployment
- Consider rollback
# If new process consuming resources:
- Identify the new resource consumer
- Investigate purpose
- Kill if unintended
Capacity Planning
When to Scale
Monitor trends and plan ahead:
# Trigger capacity planning if:
- Average CPU >60%
- Average Memory >60%
- Peak usage trending upward
- Disk usage >80%
# Questions to ask:
- Is traffic increasing? Seasonal spike?
- Did we add features? New workload?
- Do we have capacity for growth?
- Should we scale now or wait?
Scaling Actions
# Quick scale (temporary):
kubectl scale deployment/vapora-backend --replicas=5 -n vapora
# Permanent scale (update deployment.yaml):
# Edit: replicas: 5
# Apply: kubectl apply -f deployment.yaml
# Add nodes (infrastructure):
# Contact infrastructure team
# Reduce resource consumption:
# Investigate slow queries, memory leaks, etc.
Log Analysis & Troubleshooting
Checking Logs
# Most recent logs
kubectl logs deployment/vapora-backend -n vapora
# Last N lines
kubectl logs deployment/vapora-backend -n vapora --tail=100
# From specific time
kubectl logs deployment/vapora-backend -n vapora --since=1h
# Follow/tail logs
kubectl logs deployment/vapora-backend -n vapora -f
# From specific pod
kubectl logs pod-name -n vapora
# Previous pod (if crashed)
kubectl logs pod-name -n vapora --previous
Log Patterns to Watch For
# Error patterns
kubectl logs deployment/vapora-backend -n vapora | grep -i "error\|exception\|fatal"
# Database issues
kubectl logs deployment/vapora-backend -n vapora | grep -i "database\|connection\|sql"
# Authentication issues
kubectl logs deployment/vapora-backend -n vapora | grep -i "auth\|permission\|forbidden"
# Resource issues
kubectl logs deployment/vapora-backend -n vapora | grep -i "memory\|cpu\|timeout"
# Startup issues (if pod restarting)
kubectl logs pod-name -n vapora --previous | head -50
Common Log Messages & Meaning
| Log Message | Meaning | Action |
|---|---|---|
Connection refused | Service not listening | Check if service started |
Out of memory | Memory exhausted | Increase limits or scale |
Unauthorized | Auth failed | Check credentials/tokens |
Database connection timeout | Database unreachable | Check DB health |
404 Not Found | Endpoint doesn't exist | Check API routes |
Slow query | Database query taking time | Optimize query or check DB |
Proactive Monitoring Practices
Weekly Review
# Every Monday (or your weekly cadence):
1. Review incidents from past week
- Were any preventable?
- Any patterns?
2. Check alert tuning
- False alarms?
- Missed issues?
- Adjust thresholds if needed
3. Capacity check
- How much headroom remaining?
- Plan for growth?
4. Log analysis
- Any concerning patterns?
- Warnings that should be errors?
5. Update runbooks if needed
Monthly Review
# First of each month:
1. Performance trends
- Response time trending up/down?
- Error rate changing?
- Resource usage changing?
2. Capacity forecast
- Extrapolate current trends
- Plan for growth
- Schedule scaling if needed
3. Incident review
- MTBF (Mean Time Between Failures)
- MTTR (Mean Time To Resolve)
- MTTI (Mean Time To Identify)
- Are we improving?
4. Tool/alert improvements
- New monitoring needs?
- Alert fatigue issues?
- Better ways to visualize data?
Health Check Checklist
Pre-Deployment Health Check
Before any deployment, verify:
☐ All pods running: kubectl get pods
☐ No recent errors: kubectl logs --since=1h
☐ Resource usage normal: kubectl top pods
☐ Services healthy: curl /health
☐ Recent events normal: kubectl get events
Post-Deployment Health Check
After deployment, verify for 2 hours:
☐ All new pods running
☐ Old pods terminated
☐ Health endpoints responding
☐ No spike in error logs
☐ Resource usage within expected range
☐ Response time normal
☐ No pod restarts
Daily Health Check
Once per business day:
☐ kubectl get pods (all Running, 1/1 Ready)
☐ curl http://localhost:8001/health (200 OK)
☐ kubectl logs --since=24h | grep ERROR (few to none)
☐ kubectl top pods (normal usage)
☐ kubectl get events (no warnings)
Monitoring Runbook Checklist
☐ Verified automated health checks running
☐ Manual health checks performed (daily)
☐ Dashboards set up and visible
☐ Alert thresholds tuned
☐ Log patterns identified
☐ Baselines recorded
☐ Escalation procedures understood
☐ Team trained on monitoring
☐ Alert responses tested
☐ Runbooks up to date
Common Monitoring Issues
False Alerts
Problem: Alert fires but service is actually fine
Solution:
- Verify manually (don't just assume false)
- Check alert threshold (might be too sensitive)
- Adjust threshold if consistently false
- Document the change
Alert Fatigue
Problem: Too many alerts, getting ignored
Solution:
- Review all alerts
- Disable/adjust non-actionable ones
- Consolidate related alerts
- Focus on critical-only alerts
Missing Alerts
Problem: Issue happens but no alert fired
Solution:
- Investigate why alert didn't fire
- Check alert condition
- Add new alert for this issue
- Test the new alert
Lag in Monitoring
Problem: Dashboard/alerts slow to update
Solution:
- Check monitoring system performance
- Increase scrape frequency if appropriate
- Reduce data retention if storage issue
- Investigate database performance
Monitoring Tools & Commands
kubectl Commands
# Pod monitoring
kubectl get pods -n vapora
kubectl get pods -n vapora -w # Watch mode
kubectl describe pod <pod> -n vapora
kubectl logs <pod> -n vapora -f
# Resource monitoring
kubectl top nodes
kubectl top pods -n vapora
kubectl describe nodes
# Event monitoring
kubectl get events -n vapora --sort-by='.lastTimestamp'
kubectl get events -n vapora --watch
# Health checks
kubectl get --raw /healthz # API health
Useful Commands
# Check API responsiveness
curl -v http://localhost:8001/health
# Check all endpoints have pods
for svc in backend agents llm-router; do
echo "$svc endpoints:"
kubectl get endpoints vapora-$svc -n vapora
done
# Monitor pod restarts
watch 'kubectl get pods -n vapora -o jsonpath="{range .items[*]}{.metadata.name}{\" \"}{.status.containerStatuses[0].restartCount}{\"\\n\"}{end}"'
# Find pods with high restarts
kubectl get pods -n vapora -o json | jq '.items[] | select(.status.containerStatuses[0].restartCount > 5) | .metadata.name'
Next Steps
- Set up dashboards - Create Grafana/Prometheus dashboards if not available
- Configure alerts - Set thresholds based on baselines
- Test alerting - Verify Slack/email notifications work
- Train team - Ensure everyone knows how to read dashboards
- Document baselines - Record normal metrics for comparison
- Automate checks - Use CI/CD health check pipelines
- Review regularly - Weekly/monthly health check reviews
Last Updated: 2026-01-12 Status: Production-ready