Vapora/kubernetes/platform/grafana/dashboards/README.md

# VAPORA Grafana Dashboards

This directory contains 4 pre-configured Grafana dashboards for monitoring VAPORA.

## Dashboards

### 1. VAPORA Overview (`vapora-overview.json`)

**UID:** `vapora-overview`

**Panels:**
- Request Rate (req/sec)
- Error Rate (%)
- P95 Latency (ms)
- Request Rate by Endpoint (timeseries)
- Response Latency (P50, P95, P99) (timeseries)
- Response Status Distribution (pie chart)
- Database Operations (timeseries)

**Metrics Used:**
- `vapora_http_requests_total`
- `vapora_http_request_duration_seconds_bucket`
- `vapora_db_operations_total`

**Refresh:** 10 seconds

---

### 2. VAPORA Agent Metrics (`agent-metrics.json`)

**UID:** `vapora-agents`

**Panels:**
- Active Agents (count)
- Task Assignment Rate (assignments/sec)
- Task Failure Rate (%)
- Average Agent Load
- Task Execution Time by Agent Role (P50, P95, P99)
- Task Assignments by Skill (stacked)
- Agent Load Distribution (donut chart)
- Agent Expertise Scores (Learning Profiles)
- NATS Message Coordination (A2A)

**Metrics Used:**
- `vapora_swarm_agents_registered`
- `vapora_swarm_task_assignments_total`
- `vapora_swarm_agent_load`
- `vapora_agent_task_duration_seconds_bucket`
- `vapora_agent_expertise_score`
- `vapora_a2a_nats_messages_total`

**Refresh:** 10 seconds

---

### 3. VAPORA LLM Cost Tracking (`llm-cost-tracking.json`)

**UID:** `vapora-llm-cost`

**Panels:**
- Total LLM Cost (USD)
- Total Input Tokens
- Total Output Tokens
- Budget Usage % (gauge)
- Cost by Provider (timeseries)
- Token Usage by Provider (timeseries)
- Cost Distribution by Provider (donut chart)
- Cost Distribution by Role (donut chart)
- Request Distribution by Provider (donut chart)
- Hourly Budget Usage by Role (bars)
- Budget Status by Role (table)

**Metrics Used:**
- `vapora_llm_cost_total_cents`
- `vapora_llm_provider_token_usage`
- `vapora_llm_role_budget_used_cents`
- `vapora_llm_role_budget_limit_cents`
- `vapora_llm_provider_requests_total`

**Refresh:** 10 seconds

---

### 4. VAPORA Knowledge Graph Analytics (`knowledge-graph-analytics.json`)

**UID:** `vapora-kg-analytics`

**Panels:**
- Total Executions in KG
- KG Nodes
- KG Relationships
- Average Learning Curve Slope
- Learning Curves (Improvement Over Time)
- Average Execution Duration by Task Type
- Execution Count by Task Type (table)
- Execution Status Distribution (donut chart)
- Recency Bias Weights (7-day 3×, 30-day 1×)
- Similarity Searches (Hourly)
- Agent Success Rates by Task Type (table)

**Metrics Used:**
- `vapora_kg_total_executions`
- `vapora_kg_total_nodes`
- `vapora_kg_total_relationships`
- `vapora_kg_learning_curve_slope`
- `vapora_kg_learning_curve_improvement`
- `vapora_kg_execution_duration_seconds`
- `vapora_kg_executions_by_task_type`
- `vapora_kg_executions_by_status`
- `vapora_kg_recency_bias_weight`
- `vapora_kg_similarity_searches_total`
- `vapora_kg_agent_success_rate`

**Refresh:** 30 seconds

---

## Import Instructions

### Option 1: Grafana UI (Recommended)

1. **Access Grafana:**

   ```bash
   kubectl port-forward -n observability svc/grafana 3000:3000
   ```

   Open: http://localhost:3000

2. **Login:**
   - Username: `admin`
   - Password: `prom-operator` (or your configured password)

3. **Import Dashboards:**
   - Click **"+"** → **"Import"** in the left sidebar
   - Click **"Upload JSON file"** or **"Import via panel json"**
   - Select one of the JSON files from this directory
   - Select **Prometheus** as the datasource
   - Click **"Import"**

4. **Repeat** for all 4 dashboards

### Option 2: Kubernetes ConfigMap (Automated)

Create a ConfigMap to auto-provision dashboards:

```bash
# Create ConfigMap for dashboards
kubectl create configmap vapora-dashboards \
  --from-file=vapora-overview.json \
  --from-file=agent-metrics.json \
  --from-file=llm-cost-tracking.json \
  --from-file=knowledge-graph-analytics.json \
  -n observability

# Label for Grafana auto-discovery
kubectl label configmap vapora-dashboards \
  grafana_dashboard=1 \
  -n observability
```

**Note:** This assumes your Grafana instance is configured with a dashboard provider that watches for ConfigMaps with the `grafana_dashboard=1` label.

### Option 3: Direct File Mount (Docker/Local)

If running Grafana locally via Docker:

```bash
# Copy dashboards to Grafana provisioning directory
cp *.json /path/to/grafana/provisioning/dashboards/

# Restart Grafana
docker restart grafana
```

---

## Verification

After importing, verify dashboards are working:

1. **Check Prometheus Data Source:**
   - Go to **Configuration** → **Data Sources**
   - Verify **Prometheus** datasource exists and is reachable
   - Test connection

2. **Check Metrics Availability:**

   Open Prometheus UI:

   ```bash
   kubectl port-forward -n observability svc/prometheus 9090:9090
   ```

   Query test metrics:
   - `vapora_http_requests_total`
   - `vapora_agent_task_duration_seconds_bucket`
   - `vapora_llm_cost_total_cents`
   - `vapora_kg_total_executions`

3. **View Dashboards:**
   - Go to **Dashboards** → **Browse**
   - Look for "VAPORA" folder or tag
   - Open each dashboard
   - Verify panels show data (may take a few minutes after VAPORA starts)

---

## Customization

### Update Datasource

If your Prometheus datasource has a different name:

1. Open dashboard JSON file
2. Find all instances of `"uid": "${DS_PROMETHEUS}"`
3. Replace with your datasource UID
4. Re-import

### Adjust Refresh Rate

To change auto-refresh interval:

1. Open dashboard in Grafana
2. Click **Dashboard settings** (gear icon)
3. Go to **General** tab
4. Update **Refresh** dropdown
5. Click **Save dashboard**

### Add Custom Panels

To add new panels:

1. Edit dashboard
2. Click **"Add panel"** → **"Add a new panel"**
3. Select Prometheus datasource
4. Write PromQL query (see **Metrics Used** above for examples)
5. Configure visualization
6. Click **"Apply"**
7. Save dashboard

---

## Troubleshooting

### No Data Shown

**Problem:** Panels show "No data"

**Solutions:**
1. **Check VAPORA is running:**

   ```bash
   kubectl get pods -n vapora
   # All pods should be Running
   ```

2. **Check Prometheus is scraping VAPORA:**

   ```bash
   kubectl port-forward -n observability svc/prometheus 9090:9090
   ```

   Open: http://localhost:9090/targets

   Look for `vapora-backend`, `vapora-a2a`, etc. targets

3. **Check metrics endpoint manually:**

   ```bash
   kubectl port-forward -n vapora svc/vapora-backend 8001:8001
   curl http://localhost:8001/metrics | grep vapora_
   ```

   Should show Prometheus-format metrics

4. **Wait a few minutes** for metrics to accumulate

### Wrong Datasource

**Problem:** Dashboard shows "Data source not found"

**Solution:**
- Edit dashboard
- Click **Dashboard settings** → **Variables**
- Update `DS_PROMETHEUS` variable to match your datasource name
- Save

### Missing Metrics

**Problem:** Some panels show "No data" while others work

**Solution:**
- Check if specific VAPORA features are enabled:
  - **Agent metrics:** Requires `vapora-agents` running
  - **LLM cost:** Requires LLM provider configured
  - **KG analytics:** Requires Knowledge Graph enabled
- Some metrics only appear after certain actions (e.g., task assignments, LLM calls)

---

## Dashboard Organization

Recommended Grafana folder structure:

```
📁 VAPORA/
├── 📊 Overview (vapora-overview)
├── 📊 Agent Metrics (vapora-agents)
├── 📊 LLM Cost Tracking (vapora-llm-cost)
└── 📊 Knowledge Graph Analytics (vapora-kg-analytics)
```

To create folder:
1. Go to **Dashboards** → **Browse**
2. Click **"New"** → **"New folder"**
3. Name: "VAPORA"
4. Move imported dashboards into this folder

---

## Alerting (Optional)

To set up alerts based on dashboard panels:

### Example: High Error Rate Alert

1. Open **VAPORA Overview** dashboard
2. Edit **"Error Rate"** panel
3. Go to **Alert** tab
4. Click **"Create alert rule from this panel"**
5. Configure:
   - **Name:** "VAPORA High Error Rate"
   - **Condition:** `avg() > 0.05` (5%)
   - **For:** 5 minutes
   - **Annotations:** "VAPORA error rate exceeded 5%"
6. Save

### Example: Budget Exceeded Alert

1. Open **VAPORA LLM Cost Tracking** dashboard
2. Edit **"Budget Usage %"** panel
3. Create alert:
   - **Name:** "LLM Budget Near Limit"
   - **Condition:** `last() > 0.9` (90%)
   - **For:** 1 minute
   - **Annotations:** "LLM budget usage exceeded 90%"

---

## Maintenance

### Update Dashboards

When VAPORA metrics change:

1. Export current dashboard JSON
2. Edit JSON file with new metrics
3. Increment version number
4. Re-import (overwrites existing)

### Backup Dashboards

```bash
# Export all VAPORA dashboards
curl -H "Authorization: Bearer $GRAFANA_API_KEY" \
  "http://localhost:3000/api/dashboards/uid/vapora-overview" \
  > vapora-overview-backup.json

# Repeat for other dashboard UIDs:
# - vapora-agents
# - vapora-llm-cost
# - vapora-kg-analytics
```

---

## Support

For dashboard issues:
- Check **VAPORA Metrics Documentation**: `docs/architecture/metrics.md`
- Check **Prometheus Setup**: `docs/operations/monitoring.md`
- Review **Grafana Docs**: https://grafana.com/docs/

For VAPORA metrics questions:
- See: `.claude/CLAUDE.md` → **Debugging & Monitoring** section
- Check: `crates/*/src/metrics.rs` files for metric definitions

---

**Last Updated:** 2026-02-08
**VAPORA Version:** 1.2.0
**Grafana Version:** 10.0+
**Prometheus Version:** 2.40+