# Monitoring & Health Check Operations Guide for continuous monitoring and health checks of VAPORA in production. --- ## Overview **Responsibility**: Maintain visibility into VAPORA service health through monitoring, logging, and alerting **Key Activities**: - Regular health checks (automated and manual) - Alert response and investigation - Trend analysis and capacity planning - Incident prevention through early detection **Success Metric**: Detect and respond to issues before users are significantly impacted --- ## Automated Health Checks ### Kubernetes Health Check Pipeline If using CI/CD, leverage automatic health monitoring: **GitHub Actions**: ```bash # Runs every 15 minutes (quick check) # Runs every 6 hours (comprehensive diagnostics) # See: .github/workflows/health-check.yml ``` **Woodpecker**: ```bash # Runs every 15 minutes (quick check) # Runs every 6 hours (comprehensive diagnostics) # See: .woodpecker/health-check.yml ``` **Artifacts Generated**: - `docker-health.log` - Docker container status - `k8s-health.log` - Kubernetes deployments status - `k8s-diagnostics.log` - Full system diagnostics - `docker-diagnostics.log` - Docker system info - `HEALTH_REPORT.md` - Summary report ### Quick Manual Health Check ```bash # Run this command to get instant health status export NAMESPACE=vapora echo "=== Pod Status ===" kubectl get pods -n $NAMESPACE echo "" echo "=== Service Health ===" kubectl get endpoints -n $NAMESPACE echo "" echo "=== Recent Events ===" kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -10 echo "" echo "=== Resource Usage ===" kubectl top pods -n $NAMESPACE echo "" echo "=== API Health ===" curl -s http://localhost:8001/health | jq . ``` --- ## Manual Daily Monitoring ### Morning Check (Start of Business Day) ```bash # Run at start of business day (or when starting shift) echo "=== MORNING HEALTH CHECK ===" echo "Date: $(date -u)" # 1. Cluster Status echo "Cluster Status:" kubectl cluster-info | grep server # 2. Node Status echo "" echo "Node Status:" kubectl get nodes # Should show: All nodes Ready # 3. Pod Status echo "" echo "Pod Status:" kubectl get pods -n vapora # Should show: All Running, 1/1 Ready # 4. Service Endpoints echo "" echo "Service Endpoints:" kubectl get endpoints -n vapora # Should show: All services have endpoints (not empty) # 5. Resource Usage echo "" echo "Resource Usage:" kubectl top nodes kubectl top pods -n vapora | head -10 # 6. Recent Errors echo "" echo "Recent Errors (last 1 hour):" kubectl logs deployment/vapora-backend -n vapora --since=1h | grep -i error | wc -l # Should show: 0 or very few errors # 7. Overall Status echo "" echo "Overall Status: ✅ Healthy" # If any issues found: Document and investigate ``` ### Mid-Day Check (Every 4-6 hours) ```bash # Quick sanity check during business hours # 1. Service Responsiveness curl -s http://localhost:8001/health | jq '.status' # Should return: "healthy" # 2. Pod Restart Tracking kubectl get pods -n vapora -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}' # Restart count should not be increasing rapidly # 3. Error Log Check kubectl logs deployment/vapora-backend -n vapora --since=4h --timestamps | grep ERROR | tail -5 # Should show: Few to no errors # 4. Performance Check kubectl top pods -n vapora | tail -5 # CPU/Memory should be in normal range ``` ### End-of-Day Check (Before Shift End) ```bash # Summary check before handing off to on-call echo "=== END OF DAY SUMMARY ===" # Current status kubectl get pods -n vapora kubectl top pods -n vapora # Any concerning trends? echo "" echo "Checking for concerning events..." kubectl get events -n vapora --sort-by='.lastTimestamp' | grep -i warning # Any pod restarts? echo "" echo "Pod restart status:" kubectl get pods -n vapora -o jsonpath='{range .items[*]}{.metadata.name}{": "}{.status.containerStatuses[0].restartCount}{"\n"}{end}' | grep -v ": 0" # Document for next shift echo "" echo "Status for on-call: All normal / Issues detected" ``` --- ## Dashboard Setup & Monitoring ### Essential Dashboards to Monitor If you have Grafana/Prometheus, create these dashboards: #### 1. Service Health Dashboard Monitor: - Pod running count (should be stable at expected count) - Pod restart count (should not increase rapidly) - Service endpoint availability (should be >99%) - API response time (p99, track trends) **Alert if:** - Pod count drops below expected - Restart count increasing - Endpoints empty - Response time >2s #### 2. Resource Utilization Dashboard Monitor: - CPU usage per pod - Memory usage per pod - Node capacity (CPU, memory, disk) - Network I/O **Alert if:** - Any pod >80% CPU/Memory - Any node >85% capacity - Memory trending upward consistently #### 3. Error Rate Dashboard Monitor: - 4xx error rate (should be low) - 5xx error rate (should be minimal) - Error rate by endpoint - Error rate by service **Alert if:** - 5xx error rate >5% - 4xx error rate >10% - Sudden spike in errors #### 4. Application Metrics Dashboard Monitor: - Request rate (RPS) - Request latency (p50, p95, p99) - Active connections - Database query time **Alert if:** - Request rate suddenly drops (might indicate outage) - Latency spikes above baseline - Database queries slow ### Grafana Setup Example ```bash # If setting up Grafana monitoring 1. Deploy Prometheus scraping Kubernetes metrics 2. Create dashboard with above panels 3. Set alert rules: - CPU >80%: Warning - Memory >85%: Warning - Error rate >5%: Critical - Pod crashed: Critical - Response time >2s: Warning 4. Configure notifications to Slack/email ``` --- ## Alert Response Procedures ### When Alert Fires ``` Alert Received ↓ Step 1: Verify it's real (not false alarm) - Check dashboard - Check manually (curl endpoints, kubectl get pods) - Ask in #deployments if unsure Step 2: Assess severity - Service completely down? Severity 1 - Service partially degraded? Severity 2 - Warning/trending issue? Severity 3 Step 3: Declare incident (if Severity 1-2) - Create #incident channel - Follow Incident Response Runbook - See: incident-response-runbook.md Step 4: Investigate (if Severity 3) - Document in ticket - Schedule investigation - Monitor for escalation ``` ### Common Alerts & Actions | Alert | Cause | Response | |-------|-------|----------| | **Pod CrashLoopBackOff** | App crashing | Get logs, fix, restart | | **High CPU >80%** | Resource exhausted | Scale up or reduce load | | **High Memory >85%** | Memory leak or surge | Investigate or restart | | **Error rate spike** | App issue | Check logs, might rollback | | **Response time spike** | Slow queries/I/O | Check database, might restart | | **Pod pending** | Can't schedule | Check node resources | | **Endpoints empty** | Service down | Verify service exists | | **Disk full** | Storage exhausted | Clean up or expand | --- ## Metric Baselines & Trends ### Establishing Baselines Record these metrics during normal operation: ```bash # CPU per pod (typical) Backend: 200-400m per pod Agents: 300-500m per pod LLM Router: 100-200m per pod # Memory per pod (typical) Backend: 256-512Mi per pod Agents: 128-256Mi per pod LLM Router: 64-128Mi per pod # Response time (typical) Backend: p50: 50ms, p95: 200ms, p99: 500ms Frontend: Load time: 2-3 seconds # Error rate (typical) Backend: 4xx: <1%, 5xx: <0.1% Frontend: <5% user-visible errors # Pod restart count Should remain 0 (no restarts expected in normal operation) ``` ### Detecting Anomalies Compare current metrics to baseline: ```bash # If CPU 2x normal: - Check if load increased - Check for resource leak - Monitor for further increase # If Memory increasing: - Might indicate memory leak - Monitor over time (1-2 hours) - Restart if clearly trending up # If Error rate 10x: - Something broke recently - Check recent deployment - Consider rollback # If new process consuming resources: - Identify the new resource consumer - Investigate purpose - Kill if unintended ``` --- ## Capacity Planning ### When to Scale Monitor trends and plan ahead: ```bash # Trigger capacity planning if: - Average CPU >60% - Average Memory >60% - Peak usage trending upward - Disk usage >80% # Questions to ask: - Is traffic increasing? Seasonal spike? - Did we add features? New workload? - Do we have capacity for growth? - Should we scale now or wait? ``` ### Scaling Actions ```bash # Quick scale (temporary): kubectl scale deployment/vapora-backend --replicas=5 -n vapora # Permanent scale (update deployment.yaml): # Edit: replicas: 5 # Apply: kubectl apply -f deployment.yaml # Add nodes (infrastructure): # Contact infrastructure team # Reduce resource consumption: # Investigate slow queries, memory leaks, etc. ``` --- ## Log Analysis & Troubleshooting ### Checking Logs ```bash # Most recent logs kubectl logs deployment/vapora-backend -n vapora # Last N lines kubectl logs deployment/vapora-backend -n vapora --tail=100 # From specific time kubectl logs deployment/vapora-backend -n vapora --since=1h # Follow/tail logs kubectl logs deployment/vapora-backend -n vapora -f # From specific pod kubectl logs pod-name -n vapora # Previous pod (if crashed) kubectl logs pod-name -n vapora --previous ``` ### Log Patterns to Watch For ```bash # Error patterns kubectl logs deployment/vapora-backend -n vapora | grep -i "error\|exception\|fatal" # Database issues kubectl logs deployment/vapora-backend -n vapora | grep -i "database\|connection\|sql" # Authentication issues kubectl logs deployment/vapora-backend -n vapora | grep -i "auth\|permission\|forbidden" # Resource issues kubectl logs deployment/vapora-backend -n vapora | grep -i "memory\|cpu\|timeout" # Startup issues (if pod restarting) kubectl logs pod-name -n vapora --previous | head -50 ``` ### Common Log Messages & Meaning | Log Message | Meaning | Action | |---|---|---| | `Connection refused` | Service not listening | Check if service started | | `Out of memory` | Memory exhausted | Increase limits or scale | | `Unauthorized` | Auth failed | Check credentials/tokens | | `Database connection timeout` | Database unreachable | Check DB health | | `404 Not Found` | Endpoint doesn't exist | Check API routes | | `Slow query` | Database query taking time | Optimize query or check DB | --- ## Proactive Monitoring Practices ### Weekly Review ```bash # Every Monday (or your weekly cadence): 1. Review incidents from past week - Were any preventable? - Any patterns? 2. Check alert tuning - False alarms? - Missed issues? - Adjust thresholds if needed 3. Capacity check - How much headroom remaining? - Plan for growth? 4. Log analysis - Any concerning patterns? - Warnings that should be errors? 5. Update runbooks if needed ``` ### Monthly Review ```bash # First of each month: 1. Performance trends - Response time trending up/down? - Error rate changing? - Resource usage changing? 2. Capacity forecast - Extrapolate current trends - Plan for growth - Schedule scaling if needed 3. Incident review - MTBF (Mean Time Between Failures) - MTTR (Mean Time To Resolve) - MTTI (Mean Time To Identify) - Are we improving? 4. Tool/alert improvements - New monitoring needs? - Alert fatigue issues? - Better ways to visualize data? ``` --- ## Health Check Checklist ### Pre-Deployment Health Check ``` Before any deployment, verify: ☐ All pods running: kubectl get pods ☐ No recent errors: kubectl logs --since=1h ☐ Resource usage normal: kubectl top pods ☐ Services healthy: curl /health ☐ Recent events normal: kubectl get events ``` ### Post-Deployment Health Check ``` After deployment, verify for 2 hours: ☐ All new pods running ☐ Old pods terminated ☐ Health endpoints responding ☐ No spike in error logs ☐ Resource usage within expected range ☐ Response time normal ☐ No pod restarts ``` ### Daily Health Check ``` Once per business day: ☐ kubectl get pods (all Running, 1/1 Ready) ☐ curl http://localhost:8001/health (200 OK) ☐ kubectl logs --since=24h | grep ERROR (few to none) ☐ kubectl top pods (normal usage) ☐ kubectl get events (no warnings) ``` --- ## Monitoring Runbook Checklist ``` ☐ Verified automated health checks running ☐ Manual health checks performed (daily) ☐ Dashboards set up and visible ☐ Alert thresholds tuned ☐ Log patterns identified ☐ Baselines recorded ☐ Escalation procedures understood ☐ Team trained on monitoring ☐ Alert responses tested ☐ Runbooks up to date ``` --- ## Common Monitoring Issues ### False Alerts **Problem**: Alert fires but service is actually fine **Solution**: 1. Verify manually (don't just assume false) 2. Check alert threshold (might be too sensitive) 3. Adjust threshold if consistently false 4. Document the change ### Alert Fatigue **Problem**: Too many alerts, getting ignored **Solution**: 1. Review all alerts 2. Disable/adjust non-actionable ones 3. Consolidate related alerts 4. Focus on critical-only alerts ### Missing Alerts **Problem**: Issue happens but no alert fired **Solution**: 1. Investigate why alert didn't fire 2. Check alert condition 3. Add new alert for this issue 4. Test the new alert ### Lag in Monitoring **Problem**: Dashboard/alerts slow to update **Solution**: 1. Check monitoring system performance 2. Increase scrape frequency if appropriate 3. Reduce data retention if storage issue 4. Investigate database performance --- ## Monitoring Tools & Commands ### kubectl Commands ```bash # Pod monitoring kubectl get pods -n vapora kubectl get pods -n vapora -w # Watch mode kubectl describe pod -n vapora kubectl logs -n vapora -f # Resource monitoring kubectl top nodes kubectl top pods -n vapora kubectl describe nodes # Event monitoring kubectl get events -n vapora --sort-by='.lastTimestamp' kubectl get events -n vapora --watch # Health checks kubectl get --raw /healthz # API health ``` ### Useful Commands ```bash # Check API responsiveness curl -v http://localhost:8001/health # Check all endpoints have pods for svc in backend agents llm-router; do echo "$svc endpoints:" kubectl get endpoints vapora-$svc -n vapora done # Monitor pod restarts watch 'kubectl get pods -n vapora -o jsonpath="{range .items[*]}{.metadata.name}{\" \"}{.status.containerStatuses[0].restartCount}{\"\\n\"}{end}"' # Find pods with high restarts kubectl get pods -n vapora -o json | jq '.items[] | select(.status.containerStatuses[0].restartCount > 5) | .metadata.name' ``` --- ## Next Steps 1. **Set up dashboards** - Create Grafana/Prometheus dashboards if not available 2. **Configure alerts** - Set thresholds based on baselines 3. **Test alerting** - Verify Slack/email notifications work 4. **Train team** - Ensure everyone knows how to read dashboards 5. **Document baselines** - Record normal metrics for comparison 6. **Automate checks** - Use CI/CD health check pipelines 7. **Review regularly** - Weekly/monthly health check reviews --- **Last Updated**: 2026-01-12 **Status**: Production-ready