provisioning/docs/src/operations/monitoring.md

512 lines
12 KiB
Markdown
Raw Normal View History

2026-01-17 03:58:28 +00:00
# Monitoring
Comprehensive observability stack for the Provisioning platform using Prometheus, Grafana, and custom metrics.
## Monitoring Stack Overview
The platform monitoring system consists of:
| Component | Purpose | Port | Status |
| --------- | ------- | ---- | ------ |
| **Prometheus** | Metrics collection and storage | 9090 | Production |
| **Grafana** | Visualization and dashboards | 3000 | Production |
| **Loki** | Log aggregation | 3100 | Active |
| **Alertmanager** | Alert routing and notification | 9093 | Production |
| **Node Exporter** | System metrics | 9100 | Production |
## Quick Start
Install monitoring stack:
```bash
# Install all monitoring components
provisioning monitoring install
# Install specific components
provisioning monitoring install --components prometheus,grafana
# Start monitoring services
provisioning monitoring start
```
Access dashboards:
- Prometheus: [http://localhost:9090](http://localhost:9090)
- Grafana: [http://localhost:3000](http://localhost:3000) (admin/admin)
- Alertmanager: [http://localhost:9093](http://localhost:9093)
## Prometheus Configuration
### Service Discovery
Prometheus automatically discovers platform services:
```yaml
# /etc/provisioning/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'provisioning-orchestrator'
static_configs:
- targets: ['localhost:8080']
metrics_path: '/metrics'
- job_name: 'provisioning-control-center'
static_configs:
- targets: ['localhost:8081']
- job_name: 'provisioning-vault-service'
static_configs:
- targets: ['localhost:8085']
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
```
### Retention Configuration
```yaml
global:
external_labels:
cluster: 'provisioning-production'
# Storage retention
storage:
tsdb:
retention.time: 30d
retention.size: 50GB
```
## Key Metrics
### Platform Metrics
Orchestrator metrics:
```text
provisioning_workflows_total - Total workflows created
provisioning_workflows_active - Currently active workflows
provisioning_workflows_completed - Successfully completed workflows
provisioning_workflows_failed - Failed workflows
provisioning_tasks_queued - Tasks in queue
provisioning_tasks_running - Currently executing tasks
provisioning_tasks_completed - Total completed tasks
provisioning_checkpoint_recoveries - Checkpoint recovery count
```
Control Center metrics:
```text
provisioning_api_requests_total - Total API requests
provisioning_api_requests_duration_seconds - Request latency histogram
provisioning_auth_attempts_total - Authentication attempts
provisioning_auth_failures_total - Failed authentication attempts
provisioning_rbac_denials_total - Authorization denials
```
Vault Service metrics:
```text
provisioning_secrets_operations_total - Secret operations count
provisioning_kms_encryptions_total - Encryption operations
provisioning_kms_decryptions_total - Decryption operations
provisioning_kms_latency_seconds - KMS operation latency
```
### System Metrics
Node Exporter provides system-level metrics:
```text
node_cpu_seconds_total - CPU time per core
node_memory_MemAvailable_bytes - Available memory
node_disk_io_time_seconds_total - Disk I/O time
node_network_receive_bytes_total - Network RX bytes
node_network_transmit_bytes_total - Network TX bytes
node_filesystem_avail_bytes - Available disk space
```
## Grafana Dashboards
### Pre-built Dashboards
Import platform dashboards:
```bash
# Install all pre-built dashboards
provisioning monitoring install-dashboards
# List available dashboards
provisioning monitoring list-dashboards
```
Available dashboards:
1. **Platform Overview** - High-level system status
2. **Orchestrator Performance** - Workflow and task metrics
3. **Control Center API** - API request metrics and latency
4. **Vault Service KMS** - Encryption operations and performance
5. **System Resources** - CPU, memory, disk, network
6. **Security Events** - Authentication, authorization, audit logs
7. **Database Performance** - SurrealDB metrics
### Custom Dashboard Creation
Create custom dashboards via Grafana UI or provisioning:
```json
{
"dashboard": {
"title": "Custom Infrastructure Dashboard",
"panels": [
{
"title": "Active Workflows",
"targets": [
{
"expr": "provisioning_workflows_active",
"legendFormat": "Active Workflows"
}
],
"type": "graph"
}
]
}
}
```
Save dashboard:
```bash
provisioning monitoring export-dashboard --id 1 --output custom-dashboard.json
```
## Alerting
### Alert Rules
Configure alert rules in Prometheus:
```yaml
# /etc/provisioning/prometheus/alerts/provisioning.yml
groups:
- name: provisioning_alerts
interval: 30s
rules:
- alert: OrchestratorDown
expr: up{job="provisioning-orchestrator"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Orchestrator service is down"
description: "Orchestrator has been down for more than 1 minute"
- alert: HighWorkflowFailureRate
expr: |
rate(provisioning_workflows_failed[5m]) /
rate(provisioning_workflows_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High workflow failure rate"
description: "More than 10% of workflows are failing"
- alert: DatabaseConnectionLoss
expr: provisioning_database_connected == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Database connection lost"
- alert: HighMemoryUsage
expr: |
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Memory usage is above 90%"
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/var/lib/provisioning"} /
node_filesystem_size_bytes{mountpoint="/var/lib/provisioning"}) < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space"
description: "Less than 10% disk space available"
```
### Alertmanager Configuration
Route alerts to appropriate channels:
```yaml
# /etc/provisioning/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'team-email'
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'team-email'
email_configs:
- to: '[ops@example.com](mailto:ops@example.com)'
from: '[alerts@provisioning.example.com](mailto:alerts@provisioning.example.com)'
smarthost: 'smtp.example.com:587'
- name: 'pagerduty'
pagerduty_configs:
- service_key: '<pagerduty-key>'
- name: 'slack'
slack_configs:
- api_url: '<slack-webhook-url>'
channel: '#provisioning-alerts'
```
Test alerts:
```bash
# Send test alert
provisioning monitoring test-alert --severity critical
# Silence alerts temporarily
provisioning monitoring silence --duration 2h --reason "Maintenance window"
```
## Log Aggregation with Loki
### Loki Configuration
```yaml
# /etc/provisioning/loki/loki.yml
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
schema_config:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /var/lib/loki/boltdb-shipper-active
cache_location: /var/lib/loki/boltdb-shipper-cache
filesystem:
directory: /var/lib/loki/chunks
limits_config:
retention_period: 720h # 30 days
```
### Promtail for Log Shipping
```yaml
# /etc/provisioning/promtail/promtail.yml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: [http://localhost:3100/loki/api/v1/push](http://localhost:3100/loki/api/v1/push)
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
__path__: /var/log/provisioning/*.log
- job_name: journald
journal:
max_age: 12h
labels:
job: systemd-journal
relabel_configs:
- source_labels: ['__journal__systemd_unit']
target_label: 'unit'
```
Query logs in Grafana:
```logql
{job="varlogs"} | = "error"
{unit="provisioning-orchestrator.service"} | = "workflow" | json
```
## Tracing with Tempo
### Distributed Tracing
Enable OpenTelemetry tracing in services:
```toml
# /etc/provisioning/config.toml
[tracing]
enabled = true
exporter = "otlp"
endpoint = "localhost:4317"
service_name = "provisioning-orchestrator"
```
Tempo configuration:
```yaml
# /etc/provisioning/tempo/tempo.yml
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
storage:
trace:
backend: local
local:
path: /var/lib/tempo/traces
query_frontend:
search:
enabled: true
```
View traces in Grafana or Tempo UI.
## Performance Monitoring
### Query Performance
Monitor slow queries:
```promql
# 95th percentile API latency
histogram_quantile(0.95,
rate(provisioning_api_requests_duration_seconds_bucket[5m])
)
# Slow workflows (>60s)
provisioning_workflow_duration_seconds > 60
```
### Resource Monitoring
Track resource utilization:
```promql
# CPU usage per service
rate(process_cpu_seconds_total{job=~"provisioning-.*"}[5m]) * 100
# Memory usage per service
process_resident_memory_bytes{job=~"provisioning-.*"}
# Disk I/O rate
rate(node_disk_io_time_seconds_total[5m])
```
## Custom Metrics
### Adding Custom Metrics
Rust services use prometheus crate:
```rust
use prometheus::{Counter, Histogram, Registry};
// Create metrics
let workflow_counter = Counter::new(
"provisioning_custom_workflows",
"Custom workflow counter"
)?;
let task_duration = Histogram::with_opts(
HistogramOpts::new("provisioning_task_duration", "Task duration")
.buckets(vec![0.1, 0.5, 1.0, 5.0, 10.0])
)?;
// Register metrics
registry.register(Box::new(workflow_counter))?;
registry.register(Box::new(task_duration))?;
// Use metrics
workflow_counter.inc();
task_duration.observe(duration_seconds);
```
Nushell scripts export metrics:
```nushell
# Export metrics in Prometheus format
def export-metrics [] {
[
"# HELP provisioning_custom_metric Custom metric"
"# TYPE provisioning_custom_metric counter"
$"provisioning_custom_metric (get-metric-value)"
] | str join "
"
}
```
## Monitoring Best Practices
- Set appropriate scrape intervals (15-60s)
- Configure retention based on compliance requirements
- Use labels for multi-dimensional metrics
- Create dashboards for key business metrics
- Set up alerts for critical failures only
- Document alert thresholds and runbooks
- Review and tune alerts regularly
- Use recording rules for expensive queries
- Archive long-term metrics to object storage
## Related Documentation
- [Service Management](service-management.md) - Service lifecycle
- [Platform Health](platform-health.md) - Health checks
- [Troubleshooting](troubleshooting.md) - Debugging issues