663 lines
15 KiB
Markdown
663 lines
15 KiB
Markdown
# Monitoring & Health Check Operations
|
|
|
|
Guide for continuous monitoring and health checks of VAPORA in production.
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
**Responsibility**: Maintain visibility into VAPORA service health through monitoring, logging, and alerting
|
|
|
|
**Key Activities**:
|
|
- Regular health checks (automated and manual)
|
|
- Alert response and investigation
|
|
- Trend analysis and capacity planning
|
|
- Incident prevention through early detection
|
|
|
|
**Success Metric**: Detect and respond to issues before users are significantly impacted
|
|
|
|
---
|
|
|
|
## Automated Health Checks
|
|
|
|
### Kubernetes Health Check Pipeline
|
|
|
|
If using CI/CD, leverage automatic health monitoring:
|
|
|
|
**GitHub Actions**:
|
|
```bash
|
|
# Runs every 15 minutes (quick check)
|
|
# Runs every 6 hours (comprehensive diagnostics)
|
|
# See: .github/workflows/health-check.yml
|
|
```
|
|
|
|
**Woodpecker**:
|
|
```bash
|
|
# Runs every 15 minutes (quick check)
|
|
# Runs every 6 hours (comprehensive diagnostics)
|
|
# See: .woodpecker/health-check.yml
|
|
```
|
|
|
|
**Artifacts Generated**:
|
|
- `docker-health.log` - Docker container status
|
|
- `k8s-health.log` - Kubernetes deployments status
|
|
- `k8s-diagnostics.log` - Full system diagnostics
|
|
- `docker-diagnostics.log` - Docker system info
|
|
- `HEALTH_REPORT.md` - Summary report
|
|
|
|
### Quick Manual Health Check
|
|
|
|
```bash
|
|
# Run this command to get instant health status
|
|
export NAMESPACE=vapora
|
|
|
|
echo "=== Pod Status ==="
|
|
kubectl get pods -n $NAMESPACE
|
|
echo ""
|
|
|
|
echo "=== Service Health ==="
|
|
kubectl get endpoints -n $NAMESPACE
|
|
echo ""
|
|
|
|
echo "=== Recent Events ==="
|
|
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -10
|
|
echo ""
|
|
|
|
echo "=== Resource Usage ==="
|
|
kubectl top pods -n $NAMESPACE
|
|
echo ""
|
|
|
|
echo "=== API Health ==="
|
|
curl -s http://localhost:8001/health | jq .
|
|
```
|
|
|
|
---
|
|
|
|
## Manual Daily Monitoring
|
|
|
|
### Morning Check (Start of Business Day)
|
|
|
|
```bash
|
|
# Run at start of business day (or when starting shift)
|
|
|
|
echo "=== MORNING HEALTH CHECK ==="
|
|
echo "Date: $(date -u)"
|
|
|
|
# 1. Cluster Status
|
|
echo "Cluster Status:"
|
|
kubectl cluster-info | grep server
|
|
|
|
# 2. Node Status
|
|
echo ""
|
|
echo "Node Status:"
|
|
kubectl get nodes
|
|
# Should show: All nodes Ready
|
|
|
|
# 3. Pod Status
|
|
echo ""
|
|
echo "Pod Status:"
|
|
kubectl get pods -n vapora
|
|
# Should show: All Running, 1/1 Ready
|
|
|
|
# 4. Service Endpoints
|
|
echo ""
|
|
echo "Service Endpoints:"
|
|
kubectl get endpoints -n vapora
|
|
# Should show: All services have endpoints (not empty)
|
|
|
|
# 5. Resource Usage
|
|
echo ""
|
|
echo "Resource Usage:"
|
|
kubectl top nodes
|
|
kubectl top pods -n vapora | head -10
|
|
|
|
# 6. Recent Errors
|
|
echo ""
|
|
echo "Recent Errors (last 1 hour):"
|
|
kubectl logs deployment/vapora-backend -n vapora --since=1h | grep -i error | wc -l
|
|
# Should show: 0 or very few errors
|
|
|
|
# 7. Overall Status
|
|
echo ""
|
|
echo "Overall Status: ✅ Healthy"
|
|
# If any issues found: Document and investigate
|
|
```
|
|
|
|
### Mid-Day Check (Every 4-6 hours)
|
|
|
|
```bash
|
|
# Quick sanity check during business hours
|
|
|
|
# 1. Service Responsiveness
|
|
curl -s http://localhost:8001/health | jq '.status'
|
|
# Should return: "healthy"
|
|
|
|
# 2. Pod Restart Tracking
|
|
kubectl get pods -n vapora -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}'
|
|
# Restart count should not be increasing rapidly
|
|
|
|
# 3. Error Log Check
|
|
kubectl logs deployment/vapora-backend -n vapora --since=4h --timestamps | grep ERROR | tail -5
|
|
# Should show: Few to no errors
|
|
|
|
# 4. Performance Check
|
|
kubectl top pods -n vapora | tail -5
|
|
# CPU/Memory should be in normal range
|
|
```
|
|
|
|
### End-of-Day Check (Before Shift End)
|
|
|
|
```bash
|
|
# Summary check before handing off to on-call
|
|
|
|
echo "=== END OF DAY SUMMARY ==="
|
|
|
|
# Current status
|
|
kubectl get pods -n vapora
|
|
kubectl top pods -n vapora
|
|
|
|
# Any concerning trends?
|
|
echo ""
|
|
echo "Checking for concerning events..."
|
|
kubectl get events -n vapora --sort-by='.lastTimestamp' | grep -i warning
|
|
|
|
# Any pod restarts?
|
|
echo ""
|
|
echo "Pod restart status:"
|
|
kubectl get pods -n vapora -o jsonpath='{range .items[*]}{.metadata.name}{": "}{.status.containerStatuses[0].restartCount}{"\n"}{end}' | grep -v ": 0"
|
|
|
|
# Document for next shift
|
|
echo ""
|
|
echo "Status for on-call: All normal / Issues detected"
|
|
```
|
|
|
|
---
|
|
|
|
## Dashboard Setup & Monitoring
|
|
|
|
### Essential Dashboards to Monitor
|
|
|
|
If you have Grafana/Prometheus, create these dashboards:
|
|
|
|
#### 1. Service Health Dashboard
|
|
|
|
Monitor:
|
|
- Pod running count (should be stable at expected count)
|
|
- Pod restart count (should not increase rapidly)
|
|
- Service endpoint availability (should be >99%)
|
|
- API response time (p99, track trends)
|
|
|
|
**Alert if:**
|
|
- Pod count drops below expected
|
|
- Restart count increasing
|
|
- Endpoints empty
|
|
- Response time >2s
|
|
|
|
#### 2. Resource Utilization Dashboard
|
|
|
|
Monitor:
|
|
- CPU usage per pod
|
|
- Memory usage per pod
|
|
- Node capacity (CPU, memory, disk)
|
|
- Network I/O
|
|
|
|
**Alert if:**
|
|
- Any pod >80% CPU/Memory
|
|
- Any node >85% capacity
|
|
- Memory trending upward consistently
|
|
|
|
#### 3. Error Rate Dashboard
|
|
|
|
Monitor:
|
|
- 4xx error rate (should be low)
|
|
- 5xx error rate (should be minimal)
|
|
- Error rate by endpoint
|
|
- Error rate by service
|
|
|
|
**Alert if:**
|
|
- 5xx error rate >5%
|
|
- 4xx error rate >10%
|
|
- Sudden spike in errors
|
|
|
|
#### 4. Application Metrics Dashboard
|
|
|
|
Monitor:
|
|
- Request rate (RPS)
|
|
- Request latency (p50, p95, p99)
|
|
- Active connections
|
|
- Database query time
|
|
|
|
**Alert if:**
|
|
- Request rate suddenly drops (might indicate outage)
|
|
- Latency spikes above baseline
|
|
- Database queries slow
|
|
|
|
### Grafana Setup Example
|
|
|
|
```bash
|
|
# If setting up Grafana monitoring
|
|
1. Deploy Prometheus scraping Kubernetes metrics
|
|
2. Create dashboard with above panels
|
|
3. Set alert rules:
|
|
- CPU >80%: Warning
|
|
- Memory >85%: Warning
|
|
- Error rate >5%: Critical
|
|
- Pod crashed: Critical
|
|
- Response time >2s: Warning
|
|
|
|
4. Configure notifications to Slack/email
|
|
```
|
|
|
|
---
|
|
|
|
## Alert Response Procedures
|
|
|
|
### When Alert Fires
|
|
|
|
```
|
|
Alert Received
|
|
↓
|
|
Step 1: Verify it's real (not false alarm)
|
|
- Check dashboard
|
|
- Check manually (curl endpoints, kubectl get pods)
|
|
- Ask in #deployments if unsure
|
|
|
|
Step 2: Assess severity
|
|
- Service completely down? Severity 1
|
|
- Service partially degraded? Severity 2
|
|
- Warning/trending issue? Severity 3
|
|
|
|
Step 3: Declare incident (if Severity 1-2)
|
|
- Create #incident channel
|
|
- Follow Incident Response Runbook
|
|
- See: incident-response-runbook.md
|
|
|
|
Step 4: Investigate (if Severity 3)
|
|
- Document in ticket
|
|
- Schedule investigation
|
|
- Monitor for escalation
|
|
```
|
|
|
|
### Common Alerts & Actions
|
|
|
|
| Alert | Cause | Response |
|
|
|-------|-------|----------|
|
|
| **Pod CrashLoopBackOff** | App crashing | Get logs, fix, restart |
|
|
| **High CPU >80%** | Resource exhausted | Scale up or reduce load |
|
|
| **High Memory >85%** | Memory leak or surge | Investigate or restart |
|
|
| **Error rate spike** | App issue | Check logs, might rollback |
|
|
| **Response time spike** | Slow queries/I/O | Check database, might restart |
|
|
| **Pod pending** | Can't schedule | Check node resources |
|
|
| **Endpoints empty** | Service down | Verify service exists |
|
|
| **Disk full** | Storage exhausted | Clean up or expand |
|
|
|
|
---
|
|
|
|
## Metric Baselines & Trends
|
|
|
|
### Establishing Baselines
|
|
|
|
Record these metrics during normal operation:
|
|
|
|
```bash
|
|
# CPU per pod (typical)
|
|
Backend: 200-400m per pod
|
|
Agents: 300-500m per pod
|
|
LLM Router: 100-200m per pod
|
|
|
|
# Memory per pod (typical)
|
|
Backend: 256-512Mi per pod
|
|
Agents: 128-256Mi per pod
|
|
LLM Router: 64-128Mi per pod
|
|
|
|
# Response time (typical)
|
|
Backend: p50: 50ms, p95: 200ms, p99: 500ms
|
|
Frontend: Load time: 2-3 seconds
|
|
|
|
# Error rate (typical)
|
|
Backend: 4xx: <1%, 5xx: <0.1%
|
|
Frontend: <5% user-visible errors
|
|
|
|
# Pod restart count
|
|
Should remain 0 (no restarts expected in normal operation)
|
|
```
|
|
|
|
### Detecting Anomalies
|
|
|
|
Compare current metrics to baseline:
|
|
|
|
```bash
|
|
# If CPU 2x normal:
|
|
- Check if load increased
|
|
- Check for resource leak
|
|
- Monitor for further increase
|
|
|
|
# If Memory increasing:
|
|
- Might indicate memory leak
|
|
- Monitor over time (1-2 hours)
|
|
- Restart if clearly trending up
|
|
|
|
# If Error rate 10x:
|
|
- Something broke recently
|
|
- Check recent deployment
|
|
- Consider rollback
|
|
|
|
# If new process consuming resources:
|
|
- Identify the new resource consumer
|
|
- Investigate purpose
|
|
- Kill if unintended
|
|
```
|
|
|
|
---
|
|
|
|
## Capacity Planning
|
|
|
|
### When to Scale
|
|
|
|
Monitor trends and plan ahead:
|
|
|
|
```bash
|
|
# Trigger capacity planning if:
|
|
- Average CPU >60%
|
|
- Average Memory >60%
|
|
- Peak usage trending upward
|
|
- Disk usage >80%
|
|
|
|
# Questions to ask:
|
|
- Is traffic increasing? Seasonal spike?
|
|
- Did we add features? New workload?
|
|
- Do we have capacity for growth?
|
|
- Should we scale now or wait?
|
|
```
|
|
|
|
### Scaling Actions
|
|
|
|
```bash
|
|
# Quick scale (temporary):
|
|
kubectl scale deployment/vapora-backend --replicas=5 -n vapora
|
|
|
|
# Permanent scale (update deployment.yaml):
|
|
# Edit: replicas: 5
|
|
# Apply: kubectl apply -f deployment.yaml
|
|
|
|
# Add nodes (infrastructure):
|
|
# Contact infrastructure team
|
|
|
|
# Reduce resource consumption:
|
|
# Investigate slow queries, memory leaks, etc.
|
|
```
|
|
|
|
---
|
|
|
|
## Log Analysis & Troubleshooting
|
|
|
|
### Checking Logs
|
|
|
|
```bash
|
|
# Most recent logs
|
|
kubectl logs deployment/vapora-backend -n vapora
|
|
|
|
# Last N lines
|
|
kubectl logs deployment/vapora-backend -n vapora --tail=100
|
|
|
|
# From specific time
|
|
kubectl logs deployment/vapora-backend -n vapora --since=1h
|
|
|
|
# Follow/tail logs
|
|
kubectl logs deployment/vapora-backend -n vapora -f
|
|
|
|
# From specific pod
|
|
kubectl logs pod-name -n vapora
|
|
|
|
# Previous pod (if crashed)
|
|
kubectl logs pod-name -n vapora --previous
|
|
```
|
|
|
|
### Log Patterns to Watch For
|
|
|
|
```bash
|
|
# Error patterns
|
|
kubectl logs deployment/vapora-backend -n vapora | grep -i "error\|exception\|fatal"
|
|
|
|
# Database issues
|
|
kubectl logs deployment/vapora-backend -n vapora | grep -i "database\|connection\|sql"
|
|
|
|
# Authentication issues
|
|
kubectl logs deployment/vapora-backend -n vapora | grep -i "auth\|permission\|forbidden"
|
|
|
|
# Resource issues
|
|
kubectl logs deployment/vapora-backend -n vapora | grep -i "memory\|cpu\|timeout"
|
|
|
|
# Startup issues (if pod restarting)
|
|
kubectl logs pod-name -n vapora --previous | head -50
|
|
```
|
|
|
|
### Common Log Messages & Meaning
|
|
|
|
| Log Message | Meaning | Action |
|
|
|---|---|---|
|
|
| `Connection refused` | Service not listening | Check if service started |
|
|
| `Out of memory` | Memory exhausted | Increase limits or scale |
|
|
| `Unauthorized` | Auth failed | Check credentials/tokens |
|
|
| `Database connection timeout` | Database unreachable | Check DB health |
|
|
| `404 Not Found` | Endpoint doesn't exist | Check API routes |
|
|
| `Slow query` | Database query taking time | Optimize query or check DB |
|
|
|
|
---
|
|
|
|
## Proactive Monitoring Practices
|
|
|
|
### Weekly Review
|
|
|
|
```bash
|
|
# Every Monday (or your weekly cadence):
|
|
|
|
1. Review incidents from past week
|
|
- Were any preventable?
|
|
- Any patterns?
|
|
|
|
2. Check alert tuning
|
|
- False alarms?
|
|
- Missed issues?
|
|
- Adjust thresholds if needed
|
|
|
|
3. Capacity check
|
|
- How much headroom remaining?
|
|
- Plan for growth?
|
|
|
|
4. Log analysis
|
|
- Any concerning patterns?
|
|
- Warnings that should be errors?
|
|
|
|
5. Update runbooks if needed
|
|
```
|
|
|
|
### Monthly Review
|
|
|
|
```bash
|
|
# First of each month:
|
|
|
|
1. Performance trends
|
|
- Response time trending up/down?
|
|
- Error rate changing?
|
|
- Resource usage changing?
|
|
|
|
2. Capacity forecast
|
|
- Extrapolate current trends
|
|
- Plan for growth
|
|
- Schedule scaling if needed
|
|
|
|
3. Incident review
|
|
- MTBF (Mean Time Between Failures)
|
|
- MTTR (Mean Time To Resolve)
|
|
- MTTI (Mean Time To Identify)
|
|
- Are we improving?
|
|
|
|
4. Tool/alert improvements
|
|
- New monitoring needs?
|
|
- Alert fatigue issues?
|
|
- Better ways to visualize data?
|
|
```
|
|
|
|
---
|
|
|
|
## Health Check Checklist
|
|
|
|
### Pre-Deployment Health Check
|
|
|
|
```
|
|
Before any deployment, verify:
|
|
☐ All pods running: kubectl get pods
|
|
☐ No recent errors: kubectl logs --since=1h
|
|
☐ Resource usage normal: kubectl top pods
|
|
☐ Services healthy: curl /health
|
|
☐ Recent events normal: kubectl get events
|
|
```
|
|
|
|
### Post-Deployment Health Check
|
|
|
|
```
|
|
After deployment, verify for 2 hours:
|
|
☐ All new pods running
|
|
☐ Old pods terminated
|
|
☐ Health endpoints responding
|
|
☐ No spike in error logs
|
|
☐ Resource usage within expected range
|
|
☐ Response time normal
|
|
☐ No pod restarts
|
|
```
|
|
|
|
### Daily Health Check
|
|
|
|
```
|
|
Once per business day:
|
|
☐ kubectl get pods (all Running, 1/1 Ready)
|
|
☐ curl http://localhost:8001/health (200 OK)
|
|
☐ kubectl logs --since=24h | grep ERROR (few to none)
|
|
☐ kubectl top pods (normal usage)
|
|
☐ kubectl get events (no warnings)
|
|
```
|
|
|
|
---
|
|
|
|
## Monitoring Runbook Checklist
|
|
|
|
```
|
|
☐ Verified automated health checks running
|
|
☐ Manual health checks performed (daily)
|
|
☐ Dashboards set up and visible
|
|
☐ Alert thresholds tuned
|
|
☐ Log patterns identified
|
|
☐ Baselines recorded
|
|
☐ Escalation procedures understood
|
|
☐ Team trained on monitoring
|
|
☐ Alert responses tested
|
|
☐ Runbooks up to date
|
|
```
|
|
|
|
---
|
|
|
|
## Common Monitoring Issues
|
|
|
|
### False Alerts
|
|
|
|
**Problem**: Alert fires but service is actually fine
|
|
|
|
**Solution**:
|
|
1. Verify manually (don't just assume false)
|
|
2. Check alert threshold (might be too sensitive)
|
|
3. Adjust threshold if consistently false
|
|
4. Document the change
|
|
|
|
### Alert Fatigue
|
|
|
|
**Problem**: Too many alerts, getting ignored
|
|
|
|
**Solution**:
|
|
1. Review all alerts
|
|
2. Disable/adjust non-actionable ones
|
|
3. Consolidate related alerts
|
|
4. Focus on critical-only alerts
|
|
|
|
### Missing Alerts
|
|
|
|
**Problem**: Issue happens but no alert fired
|
|
|
|
**Solution**:
|
|
1. Investigate why alert didn't fire
|
|
2. Check alert condition
|
|
3. Add new alert for this issue
|
|
4. Test the new alert
|
|
|
|
### Lag in Monitoring
|
|
|
|
**Problem**: Dashboard/alerts slow to update
|
|
|
|
**Solution**:
|
|
1. Check monitoring system performance
|
|
2. Increase scrape frequency if appropriate
|
|
3. Reduce data retention if storage issue
|
|
4. Investigate database performance
|
|
|
|
---
|
|
|
|
## Monitoring Tools & Commands
|
|
|
|
### kubectl Commands
|
|
|
|
```bash
|
|
# Pod monitoring
|
|
kubectl get pods -n vapora
|
|
kubectl get pods -n vapora -w # Watch mode
|
|
kubectl describe pod <pod> -n vapora
|
|
kubectl logs <pod> -n vapora -f
|
|
|
|
# Resource monitoring
|
|
kubectl top nodes
|
|
kubectl top pods -n vapora
|
|
kubectl describe nodes
|
|
|
|
# Event monitoring
|
|
kubectl get events -n vapora --sort-by='.lastTimestamp'
|
|
kubectl get events -n vapora --watch
|
|
|
|
# Health checks
|
|
kubectl get --raw /healthz # API health
|
|
```
|
|
|
|
### Useful Commands
|
|
|
|
```bash
|
|
# Check API responsiveness
|
|
curl -v http://localhost:8001/health
|
|
|
|
# Check all endpoints have pods
|
|
for svc in backend agents llm-router; do
|
|
echo "$svc endpoints:"
|
|
kubectl get endpoints vapora-$svc -n vapora
|
|
done
|
|
|
|
# Monitor pod restarts
|
|
watch 'kubectl get pods -n vapora -o jsonpath="{range .items[*]}{.metadata.name}{\" \"}{.status.containerStatuses[0].restartCount}{\"\\n\"}{end}"'
|
|
|
|
# Find pods with high restarts
|
|
kubectl get pods -n vapora -o json | jq '.items[] | select(.status.containerStatuses[0].restartCount > 5) | .metadata.name'
|
|
```
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
1. **Set up dashboards** - Create Grafana/Prometheus dashboards if not available
|
|
2. **Configure alerts** - Set thresholds based on baselines
|
|
3. **Test alerting** - Verify Slack/email notifications work
|
|
4. **Train team** - Ensure everyone knows how to read dashboards
|
|
5. **Document baselines** - Record normal metrics for comparison
|
|
6. **Automate checks** - Use CI/CD health check pipelines
|
|
7. **Review regularly** - Weekly/monthly health check reviews
|
|
|
|
---
|
|
|
|
**Last Updated**: 2026-01-12
|
|
**Status**: Production-ready
|