12 KiB
Monitoring
Comprehensive observability stack for the Provisioning platform using Prometheus, Grafana, and custom metrics.
Monitoring Stack Overview
The platform monitoring system consists of:
| Component | Purpose | Port | Status |
|---|---|---|---|
| Prometheus | Metrics collection and storage | 9090 | Production |
| Grafana | Visualization and dashboards | 3000 | Production |
| Loki | Log aggregation | 3100 | Active |
| Alertmanager | Alert routing and notification | 9093 | Production |
| Node Exporter | System metrics | 9100 | Production |
Quick Start
Install monitoring stack:
# Install all monitoring components
provisioning monitoring install
# Install specific components
provisioning monitoring install --components prometheus,grafana
# Start monitoring services
provisioning monitoring start
Access dashboards:
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin/admin)
- Alertmanager: http://localhost:9093
Prometheus Configuration
Service Discovery
Prometheus automatically discovers platform services:
# /etc/provisioning/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'provisioning-orchestrator'
static_configs:
- targets: ['localhost:8080']
metrics_path: '/metrics'
- job_name: 'provisioning-control-center'
static_configs:
- targets: ['localhost:8081']
- job_name: 'provisioning-vault-service'
static_configs:
- targets: ['localhost:8085']
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
Retention Configuration
global:
external_labels:
cluster: 'provisioning-production'
# Storage retention
storage:
tsdb:
retention.time: 30d
retention.size: 50GB
Key Metrics
Platform Metrics
Orchestrator metrics:
provisioning_workflows_total - Total workflows created
provisioning_workflows_active - Currently active workflows
provisioning_workflows_completed - Successfully completed workflows
provisioning_workflows_failed - Failed workflows
provisioning_tasks_queued - Tasks in queue
provisioning_tasks_running - Currently executing tasks
provisioning_tasks_completed - Total completed tasks
provisioning_checkpoint_recoveries - Checkpoint recovery count
Control Center metrics:
provisioning_api_requests_total - Total API requests
provisioning_api_requests_duration_seconds - Request latency histogram
provisioning_auth_attempts_total - Authentication attempts
provisioning_auth_failures_total - Failed authentication attempts
provisioning_rbac_denials_total - Authorization denials
Vault Service metrics:
provisioning_secrets_operations_total - Secret operations count
provisioning_kms_encryptions_total - Encryption operations
provisioning_kms_decryptions_total - Decryption operations
provisioning_kms_latency_seconds - KMS operation latency
System Metrics
Node Exporter provides system-level metrics:
node_cpu_seconds_total - CPU time per core
node_memory_MemAvailable_bytes - Available memory
node_disk_io_time_seconds_total - Disk I/O time
node_network_receive_bytes_total - Network RX bytes
node_network_transmit_bytes_total - Network TX bytes
node_filesystem_avail_bytes - Available disk space
Grafana Dashboards
Pre-built Dashboards
Import platform dashboards:
# Install all pre-built dashboards
provisioning monitoring install-dashboards
# List available dashboards
provisioning monitoring list-dashboards
Available dashboards:
- Platform Overview - High-level system status
- Orchestrator Performance - Workflow and task metrics
- Control Center API - API request metrics and latency
- Vault Service KMS - Encryption operations and performance
- System Resources - CPU, memory, disk, network
- Security Events - Authentication, authorization, audit logs
- Database Performance - SurrealDB metrics
Custom Dashboard Creation
Create custom dashboards via Grafana UI or provisioning:
{
"dashboard": {
"title": "Custom Infrastructure Dashboard",
"panels": [
{
"title": "Active Workflows",
"targets": [
{
"expr": "provisioning_workflows_active",
"legendFormat": "Active Workflows"
}
],
"type": "graph"
}
]
}
}
Save dashboard:
provisioning monitoring export-dashboard --id 1 --output custom-dashboard.json
Alerting
Alert Rules
Configure alert rules in Prometheus:
# /etc/provisioning/prometheus/alerts/provisioning.yml
groups:
- name: provisioning_alerts
interval: 30s
rules:
- alert: OrchestratorDown
expr: up{job="provisioning-orchestrator"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Orchestrator service is down"
description: "Orchestrator has been down for more than 1 minute"
- alert: HighWorkflowFailureRate
expr: |
rate(provisioning_workflows_failed[5m]) /
rate(provisioning_workflows_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High workflow failure rate"
description: "More than 10% of workflows are failing"
- alert: DatabaseConnectionLoss
expr: provisioning_database_connected == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Database connection lost"
- alert: HighMemoryUsage
expr: |
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Memory usage is above 90%"
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/var/lib/provisioning"} /
node_filesystem_size_bytes{mountpoint="/var/lib/provisioning"}) < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space"
description: "Less than 10% disk space available"
Alertmanager Configuration
Route alerts to appropriate channels:
# /etc/provisioning/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'team-email'
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'team-email'
email_configs:
- to: '[ops@example.com](mailto:ops@example.com)'
from: '[alerts@provisioning.example.com](mailto:alerts@provisioning.example.com)'
smarthost: 'smtp.example.com:587'
- name: 'pagerduty'
pagerduty_configs:
- service_key: '<pagerduty-key>'
- name: 'slack'
slack_configs:
- api_url: '<slack-webhook-url>'
channel: '#provisioning-alerts'
Test alerts:
# Send test alert
provisioning monitoring test-alert --severity critical
# Silence alerts temporarily
provisioning monitoring silence --duration 2h --reason "Maintenance window"
Log Aggregation with Loki
Loki Configuration
# /etc/provisioning/loki/loki.yml
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
schema_config:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /var/lib/loki/boltdb-shipper-active
cache_location: /var/lib/loki/boltdb-shipper-cache
filesystem:
directory: /var/lib/loki/chunks
limits_config:
retention_period: 720h # 30 days
Promtail for Log Shipping
# /etc/provisioning/promtail/promtail.yml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: [http://localhost:3100/loki/api/v1/push](http://localhost:3100/loki/api/v1/push)
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
__path__: /var/log/provisioning/*.log
- job_name: journald
journal:
max_age: 12h
labels:
job: systemd-journal
relabel_configs:
- source_labels: ['__journal__systemd_unit']
target_label: 'unit'
Query logs in Grafana:
{job="varlogs"} | = "error"
{unit="provisioning-orchestrator.service"} | = "workflow" | json
Tracing with Tempo
Distributed Tracing
Enable OpenTelemetry tracing in services:
# /etc/provisioning/config.toml
[tracing]
enabled = true
exporter = "otlp"
endpoint = "localhost:4317"
service_name = "provisioning-orchestrator"
Tempo configuration:
# /etc/provisioning/tempo/tempo.yml
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
storage:
trace:
backend: local
local:
path: /var/lib/tempo/traces
query_frontend:
search:
enabled: true
View traces in Grafana or Tempo UI.
Performance Monitoring
Query Performance
Monitor slow queries:
# 95th percentile API latency
histogram_quantile(0.95,
rate(provisioning_api_requests_duration_seconds_bucket[5m])
)
# Slow workflows (>60s)
provisioning_workflow_duration_seconds > 60
Resource Monitoring
Track resource utilization:
# CPU usage per service
rate(process_cpu_seconds_total{job=~"provisioning-.*"}[5m]) * 100
# Memory usage per service
process_resident_memory_bytes{job=~"provisioning-.*"}
# Disk I/O rate
rate(node_disk_io_time_seconds_total[5m])
Custom Metrics
Adding Custom Metrics
Rust services use prometheus crate:
use prometheus::{Counter, Histogram, Registry};
// Create metrics
let workflow_counter = Counter::new(
"provisioning_custom_workflows",
"Custom workflow counter"
)?;
let task_duration = Histogram::with_opts(
HistogramOpts::new("provisioning_task_duration", "Task duration")
.buckets(vec![0.1, 0.5, 1.0, 5.0, 10.0])
)?;
// Register metrics
registry.register(Box::new(workflow_counter))?;
registry.register(Box::new(task_duration))?;
// Use metrics
workflow_counter.inc();
task_duration.observe(duration_seconds);
Nushell scripts export metrics:
# Export metrics in Prometheus format
def export-metrics [] {
[
"# HELP provisioning_custom_metric Custom metric"
"# TYPE provisioning_custom_metric counter"
$"provisioning_custom_metric (get-metric-value)"
] | str join "
"
}
Monitoring Best Practices
- Set appropriate scrape intervals (15-60s)
- Configure retention based on compliance requirements
- Use labels for multi-dimensional metrics
- Create dashboards for key business metrics
- Set up alerts for critical failures only
- Document alert thresholds and runbooks
- Review and tune alerts regularly
- Use recording rules for expensive queries
- Archive long-term metrics to object storage
Related Documentation
- Service Management - Service lifecycle
- Platform Health - Health checks
- Troubleshooting - Debugging issues