Vapora/docs/operations/monitoring-operations.md

# Monitoring & Health Check Operations

Guide for continuous monitoring and health checks of VAPORA in production.

---

## Overview

**Responsibility**: Maintain visibility into VAPORA service health through monitoring, logging, and alerting

**Key Activities**:
- Regular health checks (automated and manual)
- Alert response and investigation
- Trend analysis and capacity planning
- Incident prevention through early detection

**Success Metric**: Detect and respond to issues before users are significantly impacted

---

## Automated Health Checks

### Kubernetes Health Check Pipeline

If using CI/CD, leverage automatic health monitoring:

**GitHub Actions**:
```bash
# Runs every 15 minutes (quick check)
# Runs every 6 hours (comprehensive diagnostics)
# See: .github/workflows/health-check.yml
```

**Woodpecker**:
```bash
# Runs every 15 minutes (quick check)
# Runs every 6 hours (comprehensive diagnostics)
# See: .woodpecker/health-check.yml
```

**Artifacts Generated**:
- `docker-health.log` - Docker container status
- `k8s-health.log` - Kubernetes deployments status
- `k8s-diagnostics.log` - Full system diagnostics
- `docker-diagnostics.log` - Docker system info
- `HEALTH_REPORT.md` - Summary report

### Quick Manual Health Check

```bash
# Run this command to get instant health status
export NAMESPACE=vapora

echo "=== Pod Status ==="
kubectl get pods -n $NAMESPACE
echo ""

echo "=== Service Health ==="
kubectl get endpoints -n $NAMESPACE
echo ""

echo "=== Recent Events ==="
kubectl get events -n $NAMESPACE --sort-by='.lastTimestamp' | tail -10
echo ""

echo "=== Resource Usage ==="
kubectl top pods -n $NAMESPACE
echo ""

echo "=== API Health ==="
curl -s http://localhost:8001/health | jq .
```

---

## Manual Daily Monitoring

### Morning Check (Start of Business Day)

```bash
# Run at start of business day (or when starting shift)

echo "=== MORNING HEALTH CHECK ==="
echo "Date: $(date -u)"

# 1. Cluster Status
echo "Cluster Status:"
kubectl cluster-info | grep server

# 2. Node Status
echo ""
echo "Node Status:"
kubectl get nodes
# Should show: All nodes Ready

# 3. Pod Status
echo ""
echo "Pod Status:"
kubectl get pods -n vapora
# Should show: All Running, 1/1 Ready

# 4. Service Endpoints
echo ""
echo "Service Endpoints:"
kubectl get endpoints -n vapora
# Should show: All services have endpoints (not empty)

# 5. Resource Usage
echo ""
echo "Resource Usage:"
kubectl top nodes
kubectl top pods -n vapora | head -10

# 6. Recent Errors
echo ""
echo "Recent Errors (last 1 hour):"
kubectl logs deployment/vapora-backend -n vapora --since=1h | grep -i error | wc -l
# Should show: 0 or very few errors

# 7. Overall Status
echo ""
echo "Overall Status: ✅ Healthy"
# If any issues found: Document and investigate
```

### Mid-Day Check (Every 4-6 hours)

```bash
# Quick sanity check during business hours

# 1. Service Responsiveness
curl -s http://localhost:8001/health | jq '.status'
# Should return: "healthy"

# 2. Pod Restart Tracking
kubectl get pods -n vapora -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}'
# Restart count should not be increasing rapidly

# 3. Error Log Check
kubectl logs deployment/vapora-backend -n vapora --since=4h --timestamps | grep ERROR | tail -5
# Should show: Few to no errors

# 4. Performance Check
kubectl top pods -n vapora | tail -5
# CPU/Memory should be in normal range
```

### End-of-Day Check (Before Shift End)

```bash
# Summary check before handing off to on-call

echo "=== END OF DAY SUMMARY ==="

# Current status
kubectl get pods -n vapora
kubectl top pods -n vapora

# Any concerning trends?
echo ""
echo "Checking for concerning events..."
kubectl get events -n vapora --sort-by='.lastTimestamp' | grep -i warning

# Any pod restarts?
echo ""
echo "Pod restart status:"
kubectl get pods -n vapora -o jsonpath='{range .items[*]}{.metadata.name}{": "}{.status.containerStatuses[0].restartCount}{"\n"}{end}' | grep -v ": 0"

# Document for next shift
echo ""
echo "Status for on-call: All normal / Issues detected"
```

---

## Dashboard Setup & Monitoring

### Essential Dashboards to Monitor

If you have Grafana/Prometheus, create these dashboards:

#### 1. Service Health Dashboard

Monitor:
- Pod running count (should be stable at expected count)
- Pod restart count (should not increase rapidly)
- Service endpoint availability (should be >99%)
- API response time (p99, track trends)

**Alert if:**
- Pod count drops below expected
- Restart count increasing
- Endpoints empty
- Response time >2s

#### 2. Resource Utilization Dashboard

Monitor:
- CPU usage per pod
- Memory usage per pod
- Node capacity (CPU, memory, disk)
- Network I/O

**Alert if:**
- Any pod >80% CPU/Memory
- Any node >85% capacity
- Memory trending upward consistently

#### 3. Error Rate Dashboard

Monitor:
- 4xx error rate (should be low)
- 5xx error rate (should be minimal)
- Error rate by endpoint
- Error rate by service

**Alert if:**
- 5xx error rate >5%
- 4xx error rate >10%
- Sudden spike in errors

#### 4. Application Metrics Dashboard

Monitor:
- Request rate (RPS)
- Request latency (p50, p95, p99)
- Active connections
- Database query time

**Alert if:**
- Request rate suddenly drops (might indicate outage)
- Latency spikes above baseline
- Database queries slow

### Grafana Setup Example

```bash
# If setting up Grafana monitoring
1. Deploy Prometheus scraping Kubernetes metrics
2. Create dashboard with above panels
3. Set alert rules:
   - CPU >80%: Warning
   - Memory >85%: Warning
   - Error rate >5%: Critical
   - Pod crashed: Critical
   - Response time >2s: Warning

4. Configure notifications to Slack/email
```

---

## Alert Response Procedures

### When Alert Fires

```
Alert Received
    ↓
Step 1: Verify it's real (not false alarm)
  - Check dashboard
  - Check manually (curl endpoints, kubectl get pods)
  - Ask in #deployments if unsure

Step 2: Assess severity
  - Service completely down? Severity 1
  - Service partially degraded? Severity 2
  - Warning/trending issue? Severity 3

Step 3: Declare incident (if Severity 1-2)
  - Create #incident channel
  - Follow Incident Response Runbook
  - See: incident-response-runbook.md

Step 4: Investigate (if Severity 3)
  - Document in ticket
  - Schedule investigation
  - Monitor for escalation
```

### Common Alerts & Actions

| Alert | Cause | Response |
|-------|-------|----------|
| **Pod CrashLoopBackOff** | App crashing | Get logs, fix, restart |
| **High CPU >80%** | Resource exhausted | Scale up or reduce load |
| **High Memory >85%** | Memory leak or surge | Investigate or restart |
| **Error rate spike** | App issue | Check logs, might rollback |
| **Response time spike** | Slow queries/I/O | Check database, might restart |
| **Pod pending** | Can't schedule | Check node resources |
| **Endpoints empty** | Service down | Verify service exists |
| **Disk full** | Storage exhausted | Clean up or expand |

---

## Metric Baselines & Trends

### Establishing Baselines

Record these metrics during normal operation:

```bash
# CPU per pod (typical)
Backend:    200-400m per pod
Agents:     300-500m per pod
LLM Router: 100-200m per pod

# Memory per pod (typical)
Backend:    256-512Mi per pod
Agents:     128-256Mi per pod
LLM Router: 64-128Mi per pod

# Response time (typical)
Backend:    p50: 50ms, p95: 200ms, p99: 500ms
Frontend:   Load time: 2-3 seconds

# Error rate (typical)
Backend:    4xx: <1%, 5xx: <0.1%
Frontend:   <5% user-visible errors

# Pod restart count
Should remain 0 (no restarts expected in normal operation)
```

### Detecting Anomalies

Compare current metrics to baseline:

```bash
# If CPU 2x normal:
- Check if load increased
- Check for resource leak
- Monitor for further increase

# If Memory increasing:
- Might indicate memory leak
- Monitor over time (1-2 hours)
- Restart if clearly trending up

# If Error rate 10x:
- Something broke recently
- Check recent deployment
- Consider rollback

# If new process consuming resources:
- Identify the new resource consumer
- Investigate purpose
- Kill if unintended
```

---

## Capacity Planning

### When to Scale

Monitor trends and plan ahead:

```bash
# Trigger capacity planning if:
- Average CPU >60%
- Average Memory >60%
- Peak usage trending upward
- Disk usage >80%

# Questions to ask:
- Is traffic increasing? Seasonal spike?
- Did we add features? New workload?
- Do we have capacity for growth?
- Should we scale now or wait?
```

### Scaling Actions

```bash
# Quick scale (temporary):
kubectl scale deployment/vapora-backend --replicas=5 -n vapora

# Permanent scale (update deployment.yaml):
# Edit: replicas: 5
# Apply: kubectl apply -f deployment.yaml

# Add nodes (infrastructure):
# Contact infrastructure team

# Reduce resource consumption:
# Investigate slow queries, memory leaks, etc.
```

---

## Log Analysis & Troubleshooting

### Checking Logs

```bash
# Most recent logs
kubectl logs deployment/vapora-backend -n vapora

# Last N lines
kubectl logs deployment/vapora-backend -n vapora --tail=100

# From specific time
kubectl logs deployment/vapora-backend -n vapora --since=1h

# Follow/tail logs
kubectl logs deployment/vapora-backend -n vapora -f

# From specific pod
kubectl logs pod-name -n vapora

# Previous pod (if crashed)
kubectl logs pod-name -n vapora --previous
```

### Log Patterns to Watch For

```bash
# Error patterns
kubectl logs deployment/vapora-backend -n vapora | grep -i "error\|exception\|fatal"

# Database issues
kubectl logs deployment/vapora-backend -n vapora | grep -i "database\|connection\|sql"

# Authentication issues
kubectl logs deployment/vapora-backend -n vapora | grep -i "auth\|permission\|forbidden"

# Resource issues
kubectl logs deployment/vapora-backend -n vapora | grep -i "memory\|cpu\|timeout"

# Startup issues (if pod restarting)
kubectl logs pod-name -n vapora --previous | head -50
```

### Common Log Messages & Meaning

| Log Message | Meaning | Action |
|---|---|---|
| `Connection refused` | Service not listening | Check if service started |
| `Out of memory` | Memory exhausted | Increase limits or scale |
| `Unauthorized` | Auth failed | Check credentials/tokens |
| `Database connection timeout` | Database unreachable | Check DB health |
| `404 Not Found` | Endpoint doesn't exist | Check API routes |
| `Slow query` | Database query taking time | Optimize query or check DB |

---

## Proactive Monitoring Practices

### Weekly Review

```bash
# Every Monday (or your weekly cadence):

1. Review incidents from past week
   - Were any preventable?
   - Any patterns?

2. Check alert tuning
   - False alarms?
   - Missed issues?
   - Adjust thresholds if needed

3. Capacity check
   - How much headroom remaining?
   - Plan for growth?

4. Log analysis
   - Any concerning patterns?
   - Warnings that should be errors?

5. Update runbooks if needed
```

### Monthly Review

```bash
# First of each month:

1. Performance trends
   - Response time trending up/down?
   - Error rate changing?
   - Resource usage changing?

2. Capacity forecast
   - Extrapolate current trends
   - Plan for growth
   - Schedule scaling if needed

3. Incident review
   - MTBF (Mean Time Between Failures)
   - MTTR (Mean Time To Resolve)
   - MTTI (Mean Time To Identify)
   - Are we improving?

4. Tool/alert improvements
   - New monitoring needs?
   - Alert fatigue issues?
   - Better ways to visualize data?
```

---

## Health Check Checklist

### Pre-Deployment Health Check

```
Before any deployment, verify:
☐ All pods running: kubectl get pods
☐ No recent errors: kubectl logs --since=1h
☐ Resource usage normal: kubectl top pods
☐ Services healthy: curl /health
☐ Recent events normal: kubectl get events
```

### Post-Deployment Health Check

```
After deployment, verify for 2 hours:
☐ All new pods running
☐ Old pods terminated
☐ Health endpoints responding
☐ No spike in error logs
☐ Resource usage within expected range
☐ Response time normal
☐ No pod restarts
```

### Daily Health Check

```
Once per business day:
☐ kubectl get pods (all Running, 1/1 Ready)
☐ curl http://localhost:8001/health (200 OK)
☐ kubectl logs --since=24h | grep ERROR (few to none)
☐ kubectl top pods (normal usage)
☐ kubectl get events (no warnings)
```

---

## Monitoring Runbook Checklist

```
☐ Verified automated health checks running
☐ Manual health checks performed (daily)
☐ Dashboards set up and visible
☐ Alert thresholds tuned
☐ Log patterns identified
☐ Baselines recorded
☐ Escalation procedures understood
☐ Team trained on monitoring
☐ Alert responses tested
☐ Runbooks up to date
```

---

## Common Monitoring Issues

### False Alerts

**Problem**: Alert fires but service is actually fine

**Solution**:
1. Verify manually (don't just assume false)
2. Check alert threshold (might be too sensitive)
3. Adjust threshold if consistently false
4. Document the change

### Alert Fatigue

**Problem**: Too many alerts, getting ignored

**Solution**:
1. Review all alerts
2. Disable/adjust non-actionable ones
3. Consolidate related alerts
4. Focus on critical-only alerts

### Missing Alerts

**Problem**: Issue happens but no alert fired

**Solution**:
1. Investigate why alert didn't fire
2. Check alert condition
3. Add new alert for this issue
4. Test the new alert

### Lag in Monitoring

**Problem**: Dashboard/alerts slow to update

**Solution**:
1. Check monitoring system performance
2. Increase scrape frequency if appropriate
3. Reduce data retention if storage issue
4. Investigate database performance

---

## Monitoring Tools & Commands

### kubectl Commands

```bash
# Pod monitoring
kubectl get pods -n vapora
kubectl get pods -n vapora -w        # Watch mode
kubectl describe pod <pod> -n vapora
kubectl logs <pod> -n vapora -f

# Resource monitoring
kubectl top nodes
kubectl top pods -n vapora
kubectl describe nodes

# Event monitoring
kubectl get events -n vapora --sort-by='.lastTimestamp'
kubectl get events -n vapora --watch

# Health checks
kubectl get --raw /healthz          # API health
```

### Useful Commands

```bash
# Check API responsiveness
curl -v http://localhost:8001/health

# Check all endpoints have pods
for svc in backend agents llm-router; do
  echo "$svc endpoints:"
  kubectl get endpoints vapora-$svc -n vapora
done

# Monitor pod restarts
watch 'kubectl get pods -n vapora -o jsonpath="{range .items[*]}{.metadata.name}{\" \"}{.status.containerStatuses[0].restartCount}{\"\\n\"}{end}"'

# Find pods with high restarts
kubectl get pods -n vapora -o json | jq '.items[] | select(.status.containerStatuses[0].restartCount > 5) | .metadata.name'
```

---

## Next Steps

1. **Set up dashboards** - Create Grafana/Prometheus dashboards if not available
2. **Configure alerts** - Set thresholds based on baselines
3. **Test alerting** - Verify Slack/email notifications work
4. **Train team** - Ensure everyone knows how to read dashboards
5. **Document baselines** - Record normal metrics for comparison
6. **Automate checks** - Use CI/CD health check pipelines
7. **Review regularly** - Weekly/monthly health check reviews

---

**Last Updated**: 2026-01-12
**Status**: Production-ready