395 lines
9.2 KiB
Markdown
Raw Normal View History

# VAPORA Grafana Dashboards
This directory contains 4 pre-configured Grafana dashboards for monitoring VAPORA.
## Dashboards
### 1. VAPORA Overview (`vapora-overview.json`)
**UID:** `vapora-overview`
**Panels:**
- Request Rate (req/sec)
- Error Rate (%)
- P95 Latency (ms)
- Request Rate by Endpoint (timeseries)
- Response Latency (P50, P95, P99) (timeseries)
- Response Status Distribution (pie chart)
- Database Operations (timeseries)
**Metrics Used:**
- `vapora_http_requests_total`
- `vapora_http_request_duration_seconds_bucket`
- `vapora_db_operations_total`
**Refresh:** 10 seconds
---
### 2. VAPORA Agent Metrics (`agent-metrics.json`)
**UID:** `vapora-agents`
**Panels:**
- Active Agents (count)
- Task Assignment Rate (assignments/sec)
- Task Failure Rate (%)
- Average Agent Load
- Task Execution Time by Agent Role (P50, P95, P99)
- Task Assignments by Skill (stacked)
- Agent Load Distribution (donut chart)
- Agent Expertise Scores (Learning Profiles)
- NATS Message Coordination (A2A)
**Metrics Used:**
- `vapora_swarm_agents_registered`
- `vapora_swarm_task_assignments_total`
- `vapora_swarm_agent_load`
- `vapora_agent_task_duration_seconds_bucket`
- `vapora_agent_expertise_score`
- `vapora_a2a_nats_messages_total`
**Refresh:** 10 seconds
---
### 3. VAPORA LLM Cost Tracking (`llm-cost-tracking.json`)
**UID:** `vapora-llm-cost`
**Panels:**
- Total LLM Cost (USD)
- Total Input Tokens
- Total Output Tokens
- Budget Usage % (gauge)
- Cost by Provider (timeseries)
- Token Usage by Provider (timeseries)
- Cost Distribution by Provider (donut chart)
- Cost Distribution by Role (donut chart)
- Request Distribution by Provider (donut chart)
- Hourly Budget Usage by Role (bars)
- Budget Status by Role (table)
**Metrics Used:**
- `vapora_llm_cost_total_cents`
- `vapora_llm_provider_token_usage`
- `vapora_llm_role_budget_used_cents`
- `vapora_llm_role_budget_limit_cents`
- `vapora_llm_provider_requests_total`
**Refresh:** 10 seconds
---
### 4. VAPORA Knowledge Graph Analytics (`knowledge-graph-analytics.json`)
**UID:** `vapora-kg-analytics`
**Panels:**
- Total Executions in KG
- KG Nodes
- KG Relationships
- Average Learning Curve Slope
- Learning Curves (Improvement Over Time)
- Average Execution Duration by Task Type
- Execution Count by Task Type (table)
- Execution Status Distribution (donut chart)
- Recency Bias Weights (7-day 3×, 30-day 1×)
- Similarity Searches (Hourly)
- Agent Success Rates by Task Type (table)
**Metrics Used:**
- `vapora_kg_total_executions`
- `vapora_kg_total_nodes`
- `vapora_kg_total_relationships`
- `vapora_kg_learning_curve_slope`
- `vapora_kg_learning_curve_improvement`
- `vapora_kg_execution_duration_seconds`
- `vapora_kg_executions_by_task_type`
- `vapora_kg_executions_by_status`
- `vapora_kg_recency_bias_weight`
- `vapora_kg_similarity_searches_total`
- `vapora_kg_agent_success_rate`
**Refresh:** 30 seconds
---
## Import Instructions
### Option 1: Grafana UI (Recommended)
1. **Access Grafana:**
```bash
kubectl port-forward -n observability svc/grafana 3000:3000
```
Open: http://localhost:3000
2. **Login:**
- Username: `admin`
- Password: `prom-operator` (or your configured password)
3. **Import Dashboards:**
- Click **"+"** → **"Import"** in the left sidebar
- Click **"Upload JSON file"** or **"Import via panel json"**
- Select one of the JSON files from this directory
- Select **Prometheus** as the datasource
- Click **"Import"**
4. **Repeat** for all 4 dashboards
### Option 2: Kubernetes ConfigMap (Automated)
Create a ConfigMap to auto-provision dashboards:
```bash
# Create ConfigMap for dashboards
kubectl create configmap vapora-dashboards \
--from-file=vapora-overview.json \
--from-file=agent-metrics.json \
--from-file=llm-cost-tracking.json \
--from-file=knowledge-graph-analytics.json \
-n observability
# Label for Grafana auto-discovery
kubectl label configmap vapora-dashboards \
grafana_dashboard=1 \
-n observability
```
**Note:** This assumes your Grafana instance is configured with a dashboard provider that watches for ConfigMaps with the `grafana_dashboard=1` label.
### Option 3: Direct File Mount (Docker/Local)
If running Grafana locally via Docker:
```bash
# Copy dashboards to Grafana provisioning directory
cp *.json /path/to/grafana/provisioning/dashboards/
# Restart Grafana
docker restart grafana
```
---
## Verification
After importing, verify dashboards are working:
1. **Check Prometheus Data Source:**
- Go to **Configuration****Data Sources**
- Verify **Prometheus** datasource exists and is reachable
- Test connection
2. **Check Metrics Availability:**
Open Prometheus UI:
```bash
kubectl port-forward -n observability svc/prometheus 9090:9090
```
Query test metrics:
- `vapora_http_requests_total`
- `vapora_agent_task_duration_seconds_bucket`
- `vapora_llm_cost_total_cents`
- `vapora_kg_total_executions`
3. **View Dashboards:**
- Go to **Dashboards****Browse**
- Look for "VAPORA" folder or tag
- Open each dashboard
- Verify panels show data (may take a few minutes after VAPORA starts)
---
## Customization
### Update Datasource
If your Prometheus datasource has a different name:
1. Open dashboard JSON file
2. Find all instances of `"uid": "${DS_PROMETHEUS}"`
3. Replace with your datasource UID
4. Re-import
### Adjust Refresh Rate
To change auto-refresh interval:
1. Open dashboard in Grafana
2. Click **Dashboard settings** (gear icon)
3. Go to **General** tab
4. Update **Refresh** dropdown
5. Click **Save dashboard**
### Add Custom Panels
To add new panels:
1. Edit dashboard
2. Click **"Add panel"** → **"Add a new panel"**
3. Select Prometheus datasource
4. Write PromQL query (see **Metrics Used** above for examples)
5. Configure visualization
6. Click **"Apply"**
7. Save dashboard
---
## Troubleshooting
### No Data Shown
**Problem:** Panels show "No data"
**Solutions:**
1. **Check VAPORA is running:**
```bash
kubectl get pods -n vapora
# All pods should be Running
```
2. **Check Prometheus is scraping VAPORA:**
```bash
kubectl port-forward -n observability svc/prometheus 9090:9090
```
Open: http://localhost:9090/targets
Look for `vapora-backend`, `vapora-a2a`, etc. targets
3. **Check metrics endpoint manually:**
```bash
kubectl port-forward -n vapora svc/vapora-backend 8001:8001
curl http://localhost:8001/metrics | grep vapora_
```
Should show Prometheus-format metrics
4. **Wait a few minutes** for metrics to accumulate
### Wrong Datasource
**Problem:** Dashboard shows "Data source not found"
**Solution:**
- Edit dashboard
- Click **Dashboard settings****Variables**
- Update `DS_PROMETHEUS` variable to match your datasource name
- Save
### Missing Metrics
**Problem:** Some panels show "No data" while others work
**Solution:**
- Check if specific VAPORA features are enabled:
- **Agent metrics:** Requires `vapora-agents` running
- **LLM cost:** Requires LLM provider configured
- **KG analytics:** Requires Knowledge Graph enabled
- Some metrics only appear after certain actions (e.g., task assignments, LLM calls)
---
## Dashboard Organization
Recommended Grafana folder structure:
```
📁 VAPORA/
├── 📊 Overview (vapora-overview)
├── 📊 Agent Metrics (vapora-agents)
├── 📊 LLM Cost Tracking (vapora-llm-cost)
└── 📊 Knowledge Graph Analytics (vapora-kg-analytics)
```
To create folder:
1. Go to **Dashboards****Browse**
2. Click **"New"** → **"New folder"**
3. Name: "VAPORA"
4. Move imported dashboards into this folder
---
## Alerting (Optional)
To set up alerts based on dashboard panels:
### Example: High Error Rate Alert
1. Open **VAPORA Overview** dashboard
2. Edit **"Error Rate"** panel
3. Go to **Alert** tab
4. Click **"Create alert rule from this panel"**
5. Configure:
- **Name:** "VAPORA High Error Rate"
- **Condition:** `avg() > 0.05` (5%)
- **For:** 5 minutes
- **Annotations:** "VAPORA error rate exceeded 5%"
6. Save
### Example: Budget Exceeded Alert
1. Open **VAPORA LLM Cost Tracking** dashboard
2. Edit **"Budget Usage %"** panel
3. Create alert:
- **Name:** "LLM Budget Near Limit"
- **Condition:** `last() > 0.9` (90%)
- **For:** 1 minute
- **Annotations:** "LLM budget usage exceeded 90%"
---
## Maintenance
### Update Dashboards
When VAPORA metrics change:
1. Export current dashboard JSON
2. Edit JSON file with new metrics
3. Increment version number
4. Re-import (overwrites existing)
### Backup Dashboards
```bash
# Export all VAPORA dashboards
curl -H "Authorization: Bearer $GRAFANA_API_KEY" \
"http://localhost:3000/api/dashboards/uid/vapora-overview" \
> vapora-overview-backup.json
# Repeat for other dashboard UIDs:
# - vapora-agents
# - vapora-llm-cost
# - vapora-kg-analytics
```
---
## Support
For dashboard issues:
- Check **VAPORA Metrics Documentation**: `docs/architecture/metrics.md`
- Check **Prometheus Setup**: `docs/operations/monitoring.md`
- Review **Grafana Docs**: https://grafana.com/docs/
For VAPORA metrics questions:
- See: `.claude/CLAUDE.md`**Debugging & Monitoring** section
- Check: `crates/*/src/metrics.rs` files for metric definitions
---
**Last Updated:** 2026-02-08
**VAPORA Version:** 1.2.0
**Grafana Version:** 10.0+
**Prometheus Version:** 2.40+