# Monitoring Comprehensive observability stack for the Provisioning platform using Prometheus, Grafana, and custom metrics. ## Monitoring Stack Overview The platform monitoring system consists of: | Component | Purpose | Port | Status | | --------- | ------- | ---- | ------ | | **Prometheus** | Metrics collection and storage | 9090 | Production | | **Grafana** | Visualization and dashboards | 3000 | Production | | **Loki** | Log aggregation | 3100 | Active | | **Alertmanager** | Alert routing and notification | 9093 | Production | | **Node Exporter** | System metrics | 9100 | Production | ## Quick Start Install monitoring stack: ```bash # Install all monitoring components provisioning monitoring install # Install specific components provisioning monitoring install --components prometheus,grafana # Start monitoring services provisioning monitoring start ``` Access dashboards: - Prometheus: [http://localhost:9090](http://localhost:9090) - Grafana: [http://localhost:3000](http://localhost:3000) (admin/admin) - Alertmanager: [http://localhost:9093](http://localhost:9093) ## Prometheus Configuration ### Service Discovery Prometheus automatically discovers platform services: ```yaml # /etc/provisioning/prometheus/prometheus.yml global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: - job_name: 'provisioning-orchestrator' static_configs: - targets: ['localhost:8080'] metrics_path: '/metrics' - job_name: 'provisioning-control-center' static_configs: - targets: ['localhost:8081'] - job_name: 'provisioning-vault-service' static_configs: - targets: ['localhost:8085'] - job_name: 'node-exporter' static_configs: - targets: ['localhost:9100'] ``` ### Retention Configuration ```yaml global: external_labels: cluster: 'provisioning-production' # Storage retention storage: tsdb: retention.time: 30d retention.size: 50GB ``` ## Key Metrics ### Platform Metrics Orchestrator metrics: ```text provisioning_workflows_total - Total workflows created provisioning_workflows_active - Currently active workflows provisioning_workflows_completed - Successfully completed workflows provisioning_workflows_failed - Failed workflows provisioning_tasks_queued - Tasks in queue provisioning_tasks_running - Currently executing tasks provisioning_tasks_completed - Total completed tasks provisioning_checkpoint_recoveries - Checkpoint recovery count ``` Control Center metrics: ```text provisioning_api_requests_total - Total API requests provisioning_api_requests_duration_seconds - Request latency histogram provisioning_auth_attempts_total - Authentication attempts provisioning_auth_failures_total - Failed authentication attempts provisioning_rbac_denials_total - Authorization denials ``` Vault Service metrics: ```text provisioning_secrets_operations_total - Secret operations count provisioning_kms_encryptions_total - Encryption operations provisioning_kms_decryptions_total - Decryption operations provisioning_kms_latency_seconds - KMS operation latency ``` ### System Metrics Node Exporter provides system-level metrics: ```text node_cpu_seconds_total - CPU time per core node_memory_MemAvailable_bytes - Available memory node_disk_io_time_seconds_total - Disk I/O time node_network_receive_bytes_total - Network RX bytes node_network_transmit_bytes_total - Network TX bytes node_filesystem_avail_bytes - Available disk space ``` ## Grafana Dashboards ### Pre-built Dashboards Import platform dashboards: ```bash # Install all pre-built dashboards provisioning monitoring install-dashboards # List available dashboards provisioning monitoring list-dashboards ``` Available dashboards: 1. **Platform Overview** - High-level system status 2. **Orchestrator Performance** - Workflow and task metrics 3. **Control Center API** - API request metrics and latency 4. **Vault Service KMS** - Encryption operations and performance 5. **System Resources** - CPU, memory, disk, network 6. **Security Events** - Authentication, authorization, audit logs 7. **Database Performance** - SurrealDB metrics ### Custom Dashboard Creation Create custom dashboards via Grafana UI or provisioning: ```json { "dashboard": { "title": "Custom Infrastructure Dashboard", "panels": [ { "title": "Active Workflows", "targets": [ { "expr": "provisioning_workflows_active", "legendFormat": "Active Workflows" } ], "type": "graph" } ] } } ``` Save dashboard: ```bash provisioning monitoring export-dashboard --id 1 --output custom-dashboard.json ``` ## Alerting ### Alert Rules Configure alert rules in Prometheus: ```yaml # /etc/provisioning/prometheus/alerts/provisioning.yml groups: - name: provisioning_alerts interval: 30s rules: - alert: OrchestratorDown expr: up{job="provisioning-orchestrator"} == 0 for: 1m labels: severity: critical annotations: summary: "Orchestrator service is down" description: "Orchestrator has been down for more than 1 minute" - alert: HighWorkflowFailureRate expr: | rate(provisioning_workflows_failed[5m]) / rate(provisioning_workflows_total[5m]) > 0.1 for: 5m labels: severity: warning annotations: summary: "High workflow failure rate" description: "More than 10% of workflows are failing" - alert: DatabaseConnectionLoss expr: provisioning_database_connected == 0 for: 30s labels: severity: critical annotations: summary: "Database connection lost" - alert: HighMemoryUsage expr: | (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9 for: 5m labels: severity: warning annotations: summary: "High memory usage" description: "Memory usage is above 90%" - alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/var/lib/provisioning"} / node_filesystem_size_bytes{mountpoint="/var/lib/provisioning"}) < 0.1 for: 5m labels: severity: warning annotations: summary: "Low disk space" description: "Less than 10% disk space available" ``` ### Alertmanager Configuration Route alerts to appropriate channels: ```yaml # /etc/provisioning/alertmanager/alertmanager.yml global: resolve_timeout: 5m route: group_by: ['alertname', 'severity'] group_wait: 10s group_interval: 10s repeat_interval: 12h receiver: 'team-email' routes: - match: severity: critical receiver: 'pagerduty' continue: true - match: severity: warning receiver: 'slack' receivers: - name: 'team-email' email_configs: - to: '[ops@example.com](mailto:ops@example.com)' from: '[alerts@provisioning.example.com](mailto:alerts@provisioning.example.com)' smarthost: 'smtp.example.com:587' - name: 'pagerduty' pagerduty_configs: - service_key: '' - name: 'slack' slack_configs: - api_url: '' channel: '#provisioning-alerts' ``` Test alerts: ```bash # Send test alert provisioning monitoring test-alert --severity critical # Silence alerts temporarily provisioning monitoring silence --duration 2h --reason "Maintenance window" ``` ## Log Aggregation with Loki ### Loki Configuration ```yaml # /etc/provisioning/loki/loki.yml auth_enabled: false server: http_listen_port: 3100 ingester: lifecycler: ring: kvstore: store: inmemory replication_factor: 1 schema_config: configs: - from: 2024-01-01 store: boltdb-shipper object_store: filesystem schema: v11 index: prefix: index_ period: 24h storage_config: boltdb_shipper: active_index_directory: /var/lib/loki/boltdb-shipper-active cache_location: /var/lib/loki/boltdb-shipper-cache filesystem: directory: /var/lib/loki/chunks limits_config: retention_period: 720h # 30 days ``` ### Promtail for Log Shipping ```yaml # /etc/provisioning/promtail/promtail.yml server: http_listen_port: 9080 positions: filename: /tmp/positions.yaml clients: - url: [http://localhost:3100/loki/api/v1/push](http://localhost:3100/loki/api/v1/push) scrape_configs: - job_name: system static_configs: - targets: - localhost labels: job: varlogs __path__: /var/log/provisioning/*.log - job_name: journald journal: max_age: 12h labels: job: systemd-journal relabel_configs: - source_labels: ['__journal__systemd_unit'] target_label: 'unit' ``` Query logs in Grafana: ```logql {job="varlogs"} | = "error" {unit="provisioning-orchestrator.service"} | = "workflow" | json ``` ## Tracing with Tempo ### Distributed Tracing Enable OpenTelemetry tracing in services: ```toml # /etc/provisioning/config.toml [tracing] enabled = true exporter = "otlp" endpoint = "localhost:4317" service_name = "provisioning-orchestrator" ``` Tempo configuration: ```yaml # /etc/provisioning/tempo/tempo.yml server: http_listen_port: 3200 distributor: receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 storage: trace: backend: local local: path: /var/lib/tempo/traces query_frontend: search: enabled: true ``` View traces in Grafana or Tempo UI. ## Performance Monitoring ### Query Performance Monitor slow queries: ```promql # 95th percentile API latency histogram_quantile(0.95, rate(provisioning_api_requests_duration_seconds_bucket[5m]) ) # Slow workflows (>60s) provisioning_workflow_duration_seconds > 60 ``` ### Resource Monitoring Track resource utilization: ```promql # CPU usage per service rate(process_cpu_seconds_total{job=~"provisioning-.*"}[5m]) * 100 # Memory usage per service process_resident_memory_bytes{job=~"provisioning-.*"} # Disk I/O rate rate(node_disk_io_time_seconds_total[5m]) ``` ## Custom Metrics ### Adding Custom Metrics Rust services use prometheus crate: ```rust use prometheus::{Counter, Histogram, Registry}; // Create metrics let workflow_counter = Counter::new( "provisioning_custom_workflows", "Custom workflow counter" )?; let task_duration = Histogram::with_opts( HistogramOpts::new("provisioning_task_duration", "Task duration") .buckets(vec![0.1, 0.5, 1.0, 5.0, 10.0]) )?; // Register metrics registry.register(Box::new(workflow_counter))?; registry.register(Box::new(task_duration))?; // Use metrics workflow_counter.inc(); task_duration.observe(duration_seconds); ``` Nushell scripts export metrics: ```nushell # Export metrics in Prometheus format def export-metrics [] { [ "# HELP provisioning_custom_metric Custom metric" "# TYPE provisioning_custom_metric counter" $"provisioning_custom_metric (get-metric-value)" ] | str join " " } ``` ## Monitoring Best Practices - Set appropriate scrape intervals (15-60s) - Configure retention based on compliance requirements - Use labels for multi-dimensional metrics - Create dashboards for key business metrics - Set up alerts for critical failures only - Document alert thresholds and runbooks - Review and tune alerts regularly - Use recording rules for expensive queries - Archive long-term metrics to object storage ## Related Documentation - [Service Management](service-management.md) - Service lifecycle - [Platform Health](platform-health.md) - Health checks - [Troubleshooting](troubleshooting.md) - Debugging issues