395 lines
9.2 KiB
Markdown
395 lines
9.2 KiB
Markdown
|
|
# VAPORA Grafana Dashboards
|
|||
|
|
|
|||
|
|
This directory contains 4 pre-configured Grafana dashboards for monitoring VAPORA.
|
|||
|
|
|
|||
|
|
## Dashboards
|
|||
|
|
|
|||
|
|
### 1. VAPORA Overview (`vapora-overview.json`)
|
|||
|
|
|
|||
|
|
**UID:** `vapora-overview`
|
|||
|
|
|
|||
|
|
**Panels:**
|
|||
|
|
- Request Rate (req/sec)
|
|||
|
|
- Error Rate (%)
|
|||
|
|
- P95 Latency (ms)
|
|||
|
|
- Request Rate by Endpoint (timeseries)
|
|||
|
|
- Response Latency (P50, P95, P99) (timeseries)
|
|||
|
|
- Response Status Distribution (pie chart)
|
|||
|
|
- Database Operations (timeseries)
|
|||
|
|
|
|||
|
|
**Metrics Used:**
|
|||
|
|
- `vapora_http_requests_total`
|
|||
|
|
- `vapora_http_request_duration_seconds_bucket`
|
|||
|
|
- `vapora_db_operations_total`
|
|||
|
|
|
|||
|
|
**Refresh:** 10 seconds
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 2. VAPORA Agent Metrics (`agent-metrics.json`)
|
|||
|
|
|
|||
|
|
**UID:** `vapora-agents`
|
|||
|
|
|
|||
|
|
**Panels:**
|
|||
|
|
- Active Agents (count)
|
|||
|
|
- Task Assignment Rate (assignments/sec)
|
|||
|
|
- Task Failure Rate (%)
|
|||
|
|
- Average Agent Load
|
|||
|
|
- Task Execution Time by Agent Role (P50, P95, P99)
|
|||
|
|
- Task Assignments by Skill (stacked)
|
|||
|
|
- Agent Load Distribution (donut chart)
|
|||
|
|
- Agent Expertise Scores (Learning Profiles)
|
|||
|
|
- NATS Message Coordination (A2A)
|
|||
|
|
|
|||
|
|
**Metrics Used:**
|
|||
|
|
- `vapora_swarm_agents_registered`
|
|||
|
|
- `vapora_swarm_task_assignments_total`
|
|||
|
|
- `vapora_swarm_agent_load`
|
|||
|
|
- `vapora_agent_task_duration_seconds_bucket`
|
|||
|
|
- `vapora_agent_expertise_score`
|
|||
|
|
- `vapora_a2a_nats_messages_total`
|
|||
|
|
|
|||
|
|
**Refresh:** 10 seconds
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 3. VAPORA LLM Cost Tracking (`llm-cost-tracking.json`)
|
|||
|
|
|
|||
|
|
**UID:** `vapora-llm-cost`
|
|||
|
|
|
|||
|
|
**Panels:**
|
|||
|
|
- Total LLM Cost (USD)
|
|||
|
|
- Total Input Tokens
|
|||
|
|
- Total Output Tokens
|
|||
|
|
- Budget Usage % (gauge)
|
|||
|
|
- Cost by Provider (timeseries)
|
|||
|
|
- Token Usage by Provider (timeseries)
|
|||
|
|
- Cost Distribution by Provider (donut chart)
|
|||
|
|
- Cost Distribution by Role (donut chart)
|
|||
|
|
- Request Distribution by Provider (donut chart)
|
|||
|
|
- Hourly Budget Usage by Role (bars)
|
|||
|
|
- Budget Status by Role (table)
|
|||
|
|
|
|||
|
|
**Metrics Used:**
|
|||
|
|
- `vapora_llm_cost_total_cents`
|
|||
|
|
- `vapora_llm_provider_token_usage`
|
|||
|
|
- `vapora_llm_role_budget_used_cents`
|
|||
|
|
- `vapora_llm_role_budget_limit_cents`
|
|||
|
|
- `vapora_llm_provider_requests_total`
|
|||
|
|
|
|||
|
|
**Refresh:** 10 seconds
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
### 4. VAPORA Knowledge Graph Analytics (`knowledge-graph-analytics.json`)
|
|||
|
|
|
|||
|
|
**UID:** `vapora-kg-analytics`
|
|||
|
|
|
|||
|
|
**Panels:**
|
|||
|
|
- Total Executions in KG
|
|||
|
|
- KG Nodes
|
|||
|
|
- KG Relationships
|
|||
|
|
- Average Learning Curve Slope
|
|||
|
|
- Learning Curves (Improvement Over Time)
|
|||
|
|
- Average Execution Duration by Task Type
|
|||
|
|
- Execution Count by Task Type (table)
|
|||
|
|
- Execution Status Distribution (donut chart)
|
|||
|
|
- Recency Bias Weights (7-day 3×, 30-day 1×)
|
|||
|
|
- Similarity Searches (Hourly)
|
|||
|
|
- Agent Success Rates by Task Type (table)
|
|||
|
|
|
|||
|
|
**Metrics Used:**
|
|||
|
|
- `vapora_kg_total_executions`
|
|||
|
|
- `vapora_kg_total_nodes`
|
|||
|
|
- `vapora_kg_total_relationships`
|
|||
|
|
- `vapora_kg_learning_curve_slope`
|
|||
|
|
- `vapora_kg_learning_curve_improvement`
|
|||
|
|
- `vapora_kg_execution_duration_seconds`
|
|||
|
|
- `vapora_kg_executions_by_task_type`
|
|||
|
|
- `vapora_kg_executions_by_status`
|
|||
|
|
- `vapora_kg_recency_bias_weight`
|
|||
|
|
- `vapora_kg_similarity_searches_total`
|
|||
|
|
- `vapora_kg_agent_success_rate`
|
|||
|
|
|
|||
|
|
**Refresh:** 30 seconds
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Import Instructions
|
|||
|
|
|
|||
|
|
### Option 1: Grafana UI (Recommended)
|
|||
|
|
|
|||
|
|
1. **Access Grafana:**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
kubectl port-forward -n observability svc/grafana 3000:3000
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Open: http://localhost:3000
|
|||
|
|
|
|||
|
|
2. **Login:**
|
|||
|
|
- Username: `admin`
|
|||
|
|
- Password: `prom-operator` (or your configured password)
|
|||
|
|
|
|||
|
|
3. **Import Dashboards:**
|
|||
|
|
- Click **"+"** → **"Import"** in the left sidebar
|
|||
|
|
- Click **"Upload JSON file"** or **"Import via panel json"**
|
|||
|
|
- Select one of the JSON files from this directory
|
|||
|
|
- Select **Prometheus** as the datasource
|
|||
|
|
- Click **"Import"**
|
|||
|
|
|
|||
|
|
4. **Repeat** for all 4 dashboards
|
|||
|
|
|
|||
|
|
### Option 2: Kubernetes ConfigMap (Automated)
|
|||
|
|
|
|||
|
|
Create a ConfigMap to auto-provision dashboards:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Create ConfigMap for dashboards
|
|||
|
|
kubectl create configmap vapora-dashboards \
|
|||
|
|
--from-file=vapora-overview.json \
|
|||
|
|
--from-file=agent-metrics.json \
|
|||
|
|
--from-file=llm-cost-tracking.json \
|
|||
|
|
--from-file=knowledge-graph-analytics.json \
|
|||
|
|
-n observability
|
|||
|
|
|
|||
|
|
# Label for Grafana auto-discovery
|
|||
|
|
kubectl label configmap vapora-dashboards \
|
|||
|
|
grafana_dashboard=1 \
|
|||
|
|
-n observability
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
**Note:** This assumes your Grafana instance is configured with a dashboard provider that watches for ConfigMaps with the `grafana_dashboard=1` label.
|
|||
|
|
|
|||
|
|
### Option 3: Direct File Mount (Docker/Local)
|
|||
|
|
|
|||
|
|
If running Grafana locally via Docker:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Copy dashboards to Grafana provisioning directory
|
|||
|
|
cp *.json /path/to/grafana/provisioning/dashboards/
|
|||
|
|
|
|||
|
|
# Restart Grafana
|
|||
|
|
docker restart grafana
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Verification
|
|||
|
|
|
|||
|
|
After importing, verify dashboards are working:
|
|||
|
|
|
|||
|
|
1. **Check Prometheus Data Source:**
|
|||
|
|
- Go to **Configuration** → **Data Sources**
|
|||
|
|
- Verify **Prometheus** datasource exists and is reachable
|
|||
|
|
- Test connection
|
|||
|
|
|
|||
|
|
2. **Check Metrics Availability:**
|
|||
|
|
|
|||
|
|
Open Prometheus UI:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
kubectl port-forward -n observability svc/prometheus 9090:9090
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Query test metrics:
|
|||
|
|
- `vapora_http_requests_total`
|
|||
|
|
- `vapora_agent_task_duration_seconds_bucket`
|
|||
|
|
- `vapora_llm_cost_total_cents`
|
|||
|
|
- `vapora_kg_total_executions`
|
|||
|
|
|
|||
|
|
3. **View Dashboards:**
|
|||
|
|
- Go to **Dashboards** → **Browse**
|
|||
|
|
- Look for "VAPORA" folder or tag
|
|||
|
|
- Open each dashboard
|
|||
|
|
- Verify panels show data (may take a few minutes after VAPORA starts)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Customization
|
|||
|
|
|
|||
|
|
### Update Datasource
|
|||
|
|
|
|||
|
|
If your Prometheus datasource has a different name:
|
|||
|
|
|
|||
|
|
1. Open dashboard JSON file
|
|||
|
|
2. Find all instances of `"uid": "${DS_PROMETHEUS}"`
|
|||
|
|
3. Replace with your datasource UID
|
|||
|
|
4. Re-import
|
|||
|
|
|
|||
|
|
### Adjust Refresh Rate
|
|||
|
|
|
|||
|
|
To change auto-refresh interval:
|
|||
|
|
|
|||
|
|
1. Open dashboard in Grafana
|
|||
|
|
2. Click **Dashboard settings** (gear icon)
|
|||
|
|
3. Go to **General** tab
|
|||
|
|
4. Update **Refresh** dropdown
|
|||
|
|
5. Click **Save dashboard**
|
|||
|
|
|
|||
|
|
### Add Custom Panels
|
|||
|
|
|
|||
|
|
To add new panels:
|
|||
|
|
|
|||
|
|
1. Edit dashboard
|
|||
|
|
2. Click **"Add panel"** → **"Add a new panel"**
|
|||
|
|
3. Select Prometheus datasource
|
|||
|
|
4. Write PromQL query (see **Metrics Used** above for examples)
|
|||
|
|
5. Configure visualization
|
|||
|
|
6. Click **"Apply"**
|
|||
|
|
7. Save dashboard
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Troubleshooting
|
|||
|
|
|
|||
|
|
### No Data Shown
|
|||
|
|
|
|||
|
|
**Problem:** Panels show "No data"
|
|||
|
|
|
|||
|
|
**Solutions:**
|
|||
|
|
1. **Check VAPORA is running:**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
kubectl get pods -n vapora
|
|||
|
|
# All pods should be Running
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
2. **Check Prometheus is scraping VAPORA:**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
kubectl port-forward -n observability svc/prometheus 9090:9090
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Open: http://localhost:9090/targets
|
|||
|
|
|
|||
|
|
Look for `vapora-backend`, `vapora-a2a`, etc. targets
|
|||
|
|
|
|||
|
|
3. **Check metrics endpoint manually:**
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
kubectl port-forward -n vapora svc/vapora-backend 8001:8001
|
|||
|
|
curl http://localhost:8001/metrics | grep vapora_
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Should show Prometheus-format metrics
|
|||
|
|
|
|||
|
|
4. **Wait a few minutes** for metrics to accumulate
|
|||
|
|
|
|||
|
|
### Wrong Datasource
|
|||
|
|
|
|||
|
|
**Problem:** Dashboard shows "Data source not found"
|
|||
|
|
|
|||
|
|
**Solution:**
|
|||
|
|
- Edit dashboard
|
|||
|
|
- Click **Dashboard settings** → **Variables**
|
|||
|
|
- Update `DS_PROMETHEUS` variable to match your datasource name
|
|||
|
|
- Save
|
|||
|
|
|
|||
|
|
### Missing Metrics
|
|||
|
|
|
|||
|
|
**Problem:** Some panels show "No data" while others work
|
|||
|
|
|
|||
|
|
**Solution:**
|
|||
|
|
- Check if specific VAPORA features are enabled:
|
|||
|
|
- **Agent metrics:** Requires `vapora-agents` running
|
|||
|
|
- **LLM cost:** Requires LLM provider configured
|
|||
|
|
- **KG analytics:** Requires Knowledge Graph enabled
|
|||
|
|
- Some metrics only appear after certain actions (e.g., task assignments, LLM calls)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Dashboard Organization
|
|||
|
|
|
|||
|
|
Recommended Grafana folder structure:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
📁 VAPORA/
|
|||
|
|
├── 📊 Overview (vapora-overview)
|
|||
|
|
├── 📊 Agent Metrics (vapora-agents)
|
|||
|
|
├── 📊 LLM Cost Tracking (vapora-llm-cost)
|
|||
|
|
└── 📊 Knowledge Graph Analytics (vapora-kg-analytics)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
To create folder:
|
|||
|
|
1. Go to **Dashboards** → **Browse**
|
|||
|
|
2. Click **"New"** → **"New folder"**
|
|||
|
|
3. Name: "VAPORA"
|
|||
|
|
4. Move imported dashboards into this folder
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Alerting (Optional)
|
|||
|
|
|
|||
|
|
To set up alerts based on dashboard panels:
|
|||
|
|
|
|||
|
|
### Example: High Error Rate Alert
|
|||
|
|
|
|||
|
|
1. Open **VAPORA Overview** dashboard
|
|||
|
|
2. Edit **"Error Rate"** panel
|
|||
|
|
3. Go to **Alert** tab
|
|||
|
|
4. Click **"Create alert rule from this panel"**
|
|||
|
|
5. Configure:
|
|||
|
|
- **Name:** "VAPORA High Error Rate"
|
|||
|
|
- **Condition:** `avg() > 0.05` (5%)
|
|||
|
|
- **For:** 5 minutes
|
|||
|
|
- **Annotations:** "VAPORA error rate exceeded 5%"
|
|||
|
|
6. Save
|
|||
|
|
|
|||
|
|
### Example: Budget Exceeded Alert
|
|||
|
|
|
|||
|
|
1. Open **VAPORA LLM Cost Tracking** dashboard
|
|||
|
|
2. Edit **"Budget Usage %"** panel
|
|||
|
|
3. Create alert:
|
|||
|
|
- **Name:** "LLM Budget Near Limit"
|
|||
|
|
- **Condition:** `last() > 0.9` (90%)
|
|||
|
|
- **For:** 1 minute
|
|||
|
|
- **Annotations:** "LLM budget usage exceeded 90%"
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Maintenance
|
|||
|
|
|
|||
|
|
### Update Dashboards
|
|||
|
|
|
|||
|
|
When VAPORA metrics change:
|
|||
|
|
|
|||
|
|
1. Export current dashboard JSON
|
|||
|
|
2. Edit JSON file with new metrics
|
|||
|
|
3. Increment version number
|
|||
|
|
4. Re-import (overwrites existing)
|
|||
|
|
|
|||
|
|
### Backup Dashboards
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
# Export all VAPORA dashboards
|
|||
|
|
curl -H "Authorization: Bearer $GRAFANA_API_KEY" \
|
|||
|
|
"http://localhost:3000/api/dashboards/uid/vapora-overview" \
|
|||
|
|
> vapora-overview-backup.json
|
|||
|
|
|
|||
|
|
# Repeat for other dashboard UIDs:
|
|||
|
|
# - vapora-agents
|
|||
|
|
# - vapora-llm-cost
|
|||
|
|
# - vapora-kg-analytics
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Support
|
|||
|
|
|
|||
|
|
For dashboard issues:
|
|||
|
|
- Check **VAPORA Metrics Documentation**: `docs/architecture/metrics.md`
|
|||
|
|
- Check **Prometheus Setup**: `docs/operations/monitoring.md`
|
|||
|
|
- Review **Grafana Docs**: https://grafana.com/docs/
|
|||
|
|
|
|||
|
|
For VAPORA metrics questions:
|
|||
|
|
- See: `.claude/CLAUDE.md` → **Debugging & Monitoring** section
|
|||
|
|
- Check: `crates/*/src/metrics.rs` files for metric definitions
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**Last Updated:** 2026-02-08
|
|||
|
|
**VAPORA Version:** 1.2.0
|
|||
|
|
**Grafana Version:** 10.0+
|
|||
|
|
**Prometheus Version:** 2.40+
|