Some checks are pending
Documentation Lint & Validation / Markdown Linting (push) Waiting to run
Documentation Lint & Validation / Validate mdBook Configuration (push) Waiting to run
Documentation Lint & Validation / Content & Structure Validation (push) Waiting to run
Documentation Lint & Validation / Lint & Validation Summary (push) Blocked by required conditions
mdBook Build & Deploy / Build mdBook (push) Waiting to run
mdBook Build & Deploy / Documentation Quality Check (push) Blocked by required conditions
mdBook Build & Deploy / Deploy to GitHub Pages (push) Blocked by required conditions
mdBook Build & Deploy / Notification (push) Blocked by required conditions
Rust CI / Security Audit (push) Waiting to run
Rust CI / Check + Test + Lint (nightly) (push) Waiting to run
Rust CI / Check + Test + Lint (stable) (push) Waiting to run
395 lines
9.2 KiB
Markdown
395 lines
9.2 KiB
Markdown
# VAPORA Grafana Dashboards
|
||
|
||
This directory contains 4 pre-configured Grafana dashboards for monitoring VAPORA.
|
||
|
||
## Dashboards
|
||
|
||
### 1. VAPORA Overview (`vapora-overview.json`)
|
||
|
||
**UID:** `vapora-overview`
|
||
|
||
**Panels:**
|
||
- Request Rate (req/sec)
|
||
- Error Rate (%)
|
||
- P95 Latency (ms)
|
||
- Request Rate by Endpoint (timeseries)
|
||
- Response Latency (P50, P95, P99) (timeseries)
|
||
- Response Status Distribution (pie chart)
|
||
- Database Operations (timeseries)
|
||
|
||
**Metrics Used:**
|
||
- `vapora_http_requests_total`
|
||
- `vapora_http_request_duration_seconds_bucket`
|
||
- `vapora_db_operations_total`
|
||
|
||
**Refresh:** 10 seconds
|
||
|
||
---
|
||
|
||
### 2. VAPORA Agent Metrics (`agent-metrics.json`)
|
||
|
||
**UID:** `vapora-agents`
|
||
|
||
**Panels:**
|
||
- Active Agents (count)
|
||
- Task Assignment Rate (assignments/sec)
|
||
- Task Failure Rate (%)
|
||
- Average Agent Load
|
||
- Task Execution Time by Agent Role (P50, P95, P99)
|
||
- Task Assignments by Skill (stacked)
|
||
- Agent Load Distribution (donut chart)
|
||
- Agent Expertise Scores (Learning Profiles)
|
||
- NATS Message Coordination (A2A)
|
||
|
||
**Metrics Used:**
|
||
- `vapora_swarm_agents_registered`
|
||
- `vapora_swarm_task_assignments_total`
|
||
- `vapora_swarm_agent_load`
|
||
- `vapora_agent_task_duration_seconds_bucket`
|
||
- `vapora_agent_expertise_score`
|
||
- `vapora_a2a_nats_messages_total`
|
||
|
||
**Refresh:** 10 seconds
|
||
|
||
---
|
||
|
||
### 3. VAPORA LLM Cost Tracking (`llm-cost-tracking.json`)
|
||
|
||
**UID:** `vapora-llm-cost`
|
||
|
||
**Panels:**
|
||
- Total LLM Cost (USD)
|
||
- Total Input Tokens
|
||
- Total Output Tokens
|
||
- Budget Usage % (gauge)
|
||
- Cost by Provider (timeseries)
|
||
- Token Usage by Provider (timeseries)
|
||
- Cost Distribution by Provider (donut chart)
|
||
- Cost Distribution by Role (donut chart)
|
||
- Request Distribution by Provider (donut chart)
|
||
- Hourly Budget Usage by Role (bars)
|
||
- Budget Status by Role (table)
|
||
|
||
**Metrics Used:**
|
||
- `vapora_llm_cost_total_cents`
|
||
- `vapora_llm_provider_token_usage`
|
||
- `vapora_llm_role_budget_used_cents`
|
||
- `vapora_llm_role_budget_limit_cents`
|
||
- `vapora_llm_provider_requests_total`
|
||
|
||
**Refresh:** 10 seconds
|
||
|
||
---
|
||
|
||
### 4. VAPORA Knowledge Graph Analytics (`knowledge-graph-analytics.json`)
|
||
|
||
**UID:** `vapora-kg-analytics`
|
||
|
||
**Panels:**
|
||
- Total Executions in KG
|
||
- KG Nodes
|
||
- KG Relationships
|
||
- Average Learning Curve Slope
|
||
- Learning Curves (Improvement Over Time)
|
||
- Average Execution Duration by Task Type
|
||
- Execution Count by Task Type (table)
|
||
- Execution Status Distribution (donut chart)
|
||
- Recency Bias Weights (7-day 3×, 30-day 1×)
|
||
- Similarity Searches (Hourly)
|
||
- Agent Success Rates by Task Type (table)
|
||
|
||
**Metrics Used:**
|
||
- `vapora_kg_total_executions`
|
||
- `vapora_kg_total_nodes`
|
||
- `vapora_kg_total_relationships`
|
||
- `vapora_kg_learning_curve_slope`
|
||
- `vapora_kg_learning_curve_improvement`
|
||
- `vapora_kg_execution_duration_seconds`
|
||
- `vapora_kg_executions_by_task_type`
|
||
- `vapora_kg_executions_by_status`
|
||
- `vapora_kg_recency_bias_weight`
|
||
- `vapora_kg_similarity_searches_total`
|
||
- `vapora_kg_agent_success_rate`
|
||
|
||
**Refresh:** 30 seconds
|
||
|
||
---
|
||
|
||
## Import Instructions
|
||
|
||
### Option 1: Grafana UI (Recommended)
|
||
|
||
1. **Access Grafana:**
|
||
|
||
```bash
|
||
kubectl port-forward -n observability svc/grafana 3000:3000
|
||
```
|
||
|
||
Open: http://localhost:3000
|
||
|
||
2. **Login:**
|
||
- Username: `admin`
|
||
- Password: `prom-operator` (or your configured password)
|
||
|
||
3. **Import Dashboards:**
|
||
- Click **"+"** → **"Import"** in the left sidebar
|
||
- Click **"Upload JSON file"** or **"Import via panel json"**
|
||
- Select one of the JSON files from this directory
|
||
- Select **Prometheus** as the datasource
|
||
- Click **"Import"**
|
||
|
||
4. **Repeat** for all 4 dashboards
|
||
|
||
### Option 2: Kubernetes ConfigMap (Automated)
|
||
|
||
Create a ConfigMap to auto-provision dashboards:
|
||
|
||
```bash
|
||
# Create ConfigMap for dashboards
|
||
kubectl create configmap vapora-dashboards \
|
||
--from-file=vapora-overview.json \
|
||
--from-file=agent-metrics.json \
|
||
--from-file=llm-cost-tracking.json \
|
||
--from-file=knowledge-graph-analytics.json \
|
||
-n observability
|
||
|
||
# Label for Grafana auto-discovery
|
||
kubectl label configmap vapora-dashboards \
|
||
grafana_dashboard=1 \
|
||
-n observability
|
||
```
|
||
|
||
**Note:** This assumes your Grafana instance is configured with a dashboard provider that watches for ConfigMaps with the `grafana_dashboard=1` label.
|
||
|
||
### Option 3: Direct File Mount (Docker/Local)
|
||
|
||
If running Grafana locally via Docker:
|
||
|
||
```bash
|
||
# Copy dashboards to Grafana provisioning directory
|
||
cp *.json /path/to/grafana/provisioning/dashboards/
|
||
|
||
# Restart Grafana
|
||
docker restart grafana
|
||
```
|
||
|
||
---
|
||
|
||
## Verification
|
||
|
||
After importing, verify dashboards are working:
|
||
|
||
1. **Check Prometheus Data Source:**
|
||
- Go to **Configuration** → **Data Sources**
|
||
- Verify **Prometheus** datasource exists and is reachable
|
||
- Test connection
|
||
|
||
2. **Check Metrics Availability:**
|
||
|
||
Open Prometheus UI:
|
||
|
||
```bash
|
||
kubectl port-forward -n observability svc/prometheus 9090:9090
|
||
```
|
||
|
||
Query test metrics:
|
||
- `vapora_http_requests_total`
|
||
- `vapora_agent_task_duration_seconds_bucket`
|
||
- `vapora_llm_cost_total_cents`
|
||
- `vapora_kg_total_executions`
|
||
|
||
3. **View Dashboards:**
|
||
- Go to **Dashboards** → **Browse**
|
||
- Look for "VAPORA" folder or tag
|
||
- Open each dashboard
|
||
- Verify panels show data (may take a few minutes after VAPORA starts)
|
||
|
||
---
|
||
|
||
## Customization
|
||
|
||
### Update Datasource
|
||
|
||
If your Prometheus datasource has a different name:
|
||
|
||
1. Open dashboard JSON file
|
||
2. Find all instances of `"uid": "${DS_PROMETHEUS}"`
|
||
3. Replace with your datasource UID
|
||
4. Re-import
|
||
|
||
### Adjust Refresh Rate
|
||
|
||
To change auto-refresh interval:
|
||
|
||
1. Open dashboard in Grafana
|
||
2. Click **Dashboard settings** (gear icon)
|
||
3. Go to **General** tab
|
||
4. Update **Refresh** dropdown
|
||
5. Click **Save dashboard**
|
||
|
||
### Add Custom Panels
|
||
|
||
To add new panels:
|
||
|
||
1. Edit dashboard
|
||
2. Click **"Add panel"** → **"Add a new panel"**
|
||
3. Select Prometheus datasource
|
||
4. Write PromQL query (see **Metrics Used** above for examples)
|
||
5. Configure visualization
|
||
6. Click **"Apply"**
|
||
7. Save dashboard
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
### No Data Shown
|
||
|
||
**Problem:** Panels show "No data"
|
||
|
||
**Solutions:**
|
||
1. **Check VAPORA is running:**
|
||
|
||
```bash
|
||
kubectl get pods -n vapora
|
||
# All pods should be Running
|
||
```
|
||
|
||
2. **Check Prometheus is scraping VAPORA:**
|
||
|
||
```bash
|
||
kubectl port-forward -n observability svc/prometheus 9090:9090
|
||
```
|
||
|
||
Open: http://localhost:9090/targets
|
||
|
||
Look for `vapora-backend`, `vapora-a2a`, etc. targets
|
||
|
||
3. **Check metrics endpoint manually:**
|
||
|
||
```bash
|
||
kubectl port-forward -n vapora svc/vapora-backend 8001:8001
|
||
curl http://localhost:8001/metrics | grep vapora_
|
||
```
|
||
|
||
Should show Prometheus-format metrics
|
||
|
||
4. **Wait a few minutes** for metrics to accumulate
|
||
|
||
### Wrong Datasource
|
||
|
||
**Problem:** Dashboard shows "Data source not found"
|
||
|
||
**Solution:**
|
||
- Edit dashboard
|
||
- Click **Dashboard settings** → **Variables**
|
||
- Update `DS_PROMETHEUS` variable to match your datasource name
|
||
- Save
|
||
|
||
### Missing Metrics
|
||
|
||
**Problem:** Some panels show "No data" while others work
|
||
|
||
**Solution:**
|
||
- Check if specific VAPORA features are enabled:
|
||
- **Agent metrics:** Requires `vapora-agents` running
|
||
- **LLM cost:** Requires LLM provider configured
|
||
- **KG analytics:** Requires Knowledge Graph enabled
|
||
- Some metrics only appear after certain actions (e.g., task assignments, LLM calls)
|
||
|
||
---
|
||
|
||
## Dashboard Organization
|
||
|
||
Recommended Grafana folder structure:
|
||
|
||
```
|
||
📁 VAPORA/
|
||
├── 📊 Overview (vapora-overview)
|
||
├── 📊 Agent Metrics (vapora-agents)
|
||
├── 📊 LLM Cost Tracking (vapora-llm-cost)
|
||
└── 📊 Knowledge Graph Analytics (vapora-kg-analytics)
|
||
```
|
||
|
||
To create folder:
|
||
1. Go to **Dashboards** → **Browse**
|
||
2. Click **"New"** → **"New folder"**
|
||
3. Name: "VAPORA"
|
||
4. Move imported dashboards into this folder
|
||
|
||
---
|
||
|
||
## Alerting (Optional)
|
||
|
||
To set up alerts based on dashboard panels:
|
||
|
||
### Example: High Error Rate Alert
|
||
|
||
1. Open **VAPORA Overview** dashboard
|
||
2. Edit **"Error Rate"** panel
|
||
3. Go to **Alert** tab
|
||
4. Click **"Create alert rule from this panel"**
|
||
5. Configure:
|
||
- **Name:** "VAPORA High Error Rate"
|
||
- **Condition:** `avg() > 0.05` (5%)
|
||
- **For:** 5 minutes
|
||
- **Annotations:** "VAPORA error rate exceeded 5%"
|
||
6. Save
|
||
|
||
### Example: Budget Exceeded Alert
|
||
|
||
1. Open **VAPORA LLM Cost Tracking** dashboard
|
||
2. Edit **"Budget Usage %"** panel
|
||
3. Create alert:
|
||
- **Name:** "LLM Budget Near Limit"
|
||
- **Condition:** `last() > 0.9` (90%)
|
||
- **For:** 1 minute
|
||
- **Annotations:** "LLM budget usage exceeded 90%"
|
||
|
||
---
|
||
|
||
## Maintenance
|
||
|
||
### Update Dashboards
|
||
|
||
When VAPORA metrics change:
|
||
|
||
1. Export current dashboard JSON
|
||
2. Edit JSON file with new metrics
|
||
3. Increment version number
|
||
4. Re-import (overwrites existing)
|
||
|
||
### Backup Dashboards
|
||
|
||
```bash
|
||
# Export all VAPORA dashboards
|
||
curl -H "Authorization: Bearer $GRAFANA_API_KEY" \
|
||
"http://localhost:3000/api/dashboards/uid/vapora-overview" \
|
||
> vapora-overview-backup.json
|
||
|
||
# Repeat for other dashboard UIDs:
|
||
# - vapora-agents
|
||
# - vapora-llm-cost
|
||
# - vapora-kg-analytics
|
||
```
|
||
|
||
---
|
||
|
||
## Support
|
||
|
||
For dashboard issues:
|
||
- Check **VAPORA Metrics Documentation**: `docs/architecture/metrics.md`
|
||
- Check **Prometheus Setup**: `docs/operations/monitoring.md`
|
||
- Review **Grafana Docs**: https://grafana.com/docs/
|
||
|
||
For VAPORA metrics questions:
|
||
- See: `.claude/CLAUDE.md` → **Debugging & Monitoring** section
|
||
- Check: `crates/*/src/metrics.rs` files for metric definitions
|
||
|
||
---
|
||
|
||
**Last Updated:** 2026-02-08
|
||
**VAPORA Version:** 1.2.0
|
||
**Grafana Version:** 10.0+
|
||
**Prometheus Version:** 2.40+
|