provisioning/docs/src/operations/monitoring.md
2026-01-17 03:58:28 +00:00

12 KiB

Monitoring

Comprehensive observability stack for the Provisioning platform using Prometheus, Grafana, and custom metrics.

Monitoring Stack Overview

The platform monitoring system consists of:

Component Purpose Port Status
Prometheus Metrics collection and storage 9090 Production
Grafana Visualization and dashboards 3000 Production
Loki Log aggregation 3100 Active
Alertmanager Alert routing and notification 9093 Production
Node Exporter System metrics 9100 Production

Quick Start

Install monitoring stack:

# Install all monitoring components
provisioning monitoring install

# Install specific components
provisioning monitoring install --components prometheus,grafana

# Start monitoring services
provisioning monitoring start

Access dashboards:

Prometheus Configuration

Service Discovery

Prometheus automatically discovers platform services:

# /etc/provisioning/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'provisioning-orchestrator'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/metrics'

  - job_name: 'provisioning-control-center'
    static_configs:
      - targets: ['localhost:8081']

  - job_name: 'provisioning-vault-service'
    static_configs:
      - targets: ['localhost:8085']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

Retention Configuration

global:
  external_labels:
    cluster: 'provisioning-production'

# Storage retention
storage:
  tsdb:
    retention.time: 30d
    retention.size: 50GB

Key Metrics

Platform Metrics

Orchestrator metrics:

provisioning_workflows_total - Total workflows created
provisioning_workflows_active - Currently active workflows
provisioning_workflows_completed - Successfully completed workflows
provisioning_workflows_failed - Failed workflows
provisioning_tasks_queued - Tasks in queue
provisioning_tasks_running - Currently executing tasks
provisioning_tasks_completed - Total completed tasks
provisioning_checkpoint_recoveries - Checkpoint recovery count

Control Center metrics:

provisioning_api_requests_total - Total API requests
provisioning_api_requests_duration_seconds - Request latency histogram
provisioning_auth_attempts_total - Authentication attempts
provisioning_auth_failures_total - Failed authentication attempts
provisioning_rbac_denials_total - Authorization denials

Vault Service metrics:

provisioning_secrets_operations_total - Secret operations count
provisioning_kms_encryptions_total - Encryption operations
provisioning_kms_decryptions_total - Decryption operations
provisioning_kms_latency_seconds - KMS operation latency

System Metrics

Node Exporter provides system-level metrics:

node_cpu_seconds_total - CPU time per core
node_memory_MemAvailable_bytes - Available memory
node_disk_io_time_seconds_total - Disk I/O time
node_network_receive_bytes_total - Network RX bytes
node_network_transmit_bytes_total - Network TX bytes
node_filesystem_avail_bytes - Available disk space

Grafana Dashboards

Pre-built Dashboards

Import platform dashboards:

# Install all pre-built dashboards
provisioning monitoring install-dashboards

# List available dashboards
provisioning monitoring list-dashboards

Available dashboards:

  1. Platform Overview - High-level system status
  2. Orchestrator Performance - Workflow and task metrics
  3. Control Center API - API request metrics and latency
  4. Vault Service KMS - Encryption operations and performance
  5. System Resources - CPU, memory, disk, network
  6. Security Events - Authentication, authorization, audit logs
  7. Database Performance - SurrealDB metrics

Custom Dashboard Creation

Create custom dashboards via Grafana UI or provisioning:

{
  "dashboard": {
    "title": "Custom Infrastructure Dashboard",
    "panels": [
      {
        "title": "Active Workflows",
        "targets": [
          {
            "expr": "provisioning_workflows_active",
            "legendFormat": "Active Workflows"
          }
        ],
        "type": "graph"
      }
    ]
  }
}

Save dashboard:

provisioning monitoring export-dashboard --id 1 --output custom-dashboard.json

Alerting

Alert Rules

Configure alert rules in Prometheus:

# /etc/provisioning/prometheus/alerts/provisioning.yml
groups:
  - name: provisioning_alerts
    interval: 30s
    rules:
      - alert: OrchestratorDown
        expr: up{job="provisioning-orchestrator"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Orchestrator service is down"
          description: "Orchestrator has been down for more than 1 minute"

      - alert: HighWorkflowFailureRate
        expr: |
          rate(provisioning_workflows_failed[5m]) /
          rate(provisioning_workflows_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High workflow failure rate"
          description: "More than 10% of workflows are failing"

      - alert: DatabaseConnectionLoss
        expr: provisioning_database_connected == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Database connection lost"

      - alert: HighMemoryUsage
        expr: |
          (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage"
          description: "Memory usage is above 90%"

      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/var/lib/provisioning"} /
           node_filesystem_size_bytes{mountpoint="/var/lib/provisioning"}) < 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space"
          description: "Less than 10% disk space available"

Alertmanager Configuration

Route alerts to appropriate channels:

# /etc/provisioning/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'team-email'

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true

    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'team-email'
    email_configs:
      - to: '[ops@example.com](mailto:ops@example.com)'
        from: '[alerts@provisioning.example.com](mailto:alerts@provisioning.example.com)'
        smarthost: 'smtp.example.com:587'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: '<pagerduty-key>'

  - name: 'slack'
    slack_configs:
      - api_url: '<slack-webhook-url>'
        channel: '#provisioning-alerts'

Test alerts:

# Send test alert
provisioning monitoring test-alert --severity critical

# Silence alerts temporarily
provisioning monitoring silence --duration 2h --reason "Maintenance window"

Log Aggregation with Loki

Loki Configuration

# /etc/provisioning/loki/loki.yml
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1

schema_config:
  configs:
    - from: 2024-01-01
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /var/lib/loki/boltdb-shipper-active
    cache_location: /var/lib/loki/boltdb-shipper-cache
  filesystem:
    directory: /var/lib/loki/chunks

limits_config:
  retention_period: 720h  # 30 days

Promtail for Log Shipping

# /etc/provisioning/promtail/promtail.yml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url:  [http://localhost:3100/loki/api/v1/push](http://localhost:3100/loki/api/v1/push)

scrape_configs:
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: varlogs
          __path__: /var/log/provisioning/*.log

  - job_name: journald
    journal:
      max_age: 12h
      labels:
        job: systemd-journal
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: 'unit'

Query logs in Grafana:

{job="varlogs"} | = "error"
{unit="provisioning-orchestrator.service"} | = "workflow" | json

Tracing with Tempo

Distributed Tracing

Enable OpenTelemetry tracing in services:

# /etc/provisioning/config.toml
[tracing]
enabled = true
exporter = "otlp"
endpoint = "localhost:4317"
service_name = "provisioning-orchestrator"

Tempo configuration:

# /etc/provisioning/tempo/tempo.yml
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317

storage:
  trace:
    backend: local
    local:
      path: /var/lib/tempo/traces

query_frontend:
  search:
    enabled: true

View traces in Grafana or Tempo UI.

Performance Monitoring

Query Performance

Monitor slow queries:

# 95th percentile API latency
histogram_quantile(0.95,
  rate(provisioning_api_requests_duration_seconds_bucket[5m])
)

# Slow workflows (>60s)
provisioning_workflow_duration_seconds > 60

Resource Monitoring

Track resource utilization:

# CPU usage per service
rate(process_cpu_seconds_total{job=~"provisioning-.*"}[5m]) * 100

# Memory usage per service
process_resident_memory_bytes{job=~"provisioning-.*"}

# Disk I/O rate
rate(node_disk_io_time_seconds_total[5m])

Custom Metrics

Adding Custom Metrics

Rust services use prometheus crate:

use prometheus::{Counter, Histogram, Registry};

// Create metrics
let workflow_counter = Counter::new(
    "provisioning_custom_workflows",
    "Custom workflow counter"
)?;

let task_duration = Histogram::with_opts(
    HistogramOpts::new("provisioning_task_duration", "Task duration")
        .buckets(vec![0.1, 0.5, 1.0, 5.0, 10.0])
)?;

// Register metrics
registry.register(Box::new(workflow_counter))?;
registry.register(Box::new(task_duration))?;

// Use metrics
workflow_counter.inc();
task_duration.observe(duration_seconds);

Nushell scripts export metrics:

# Export metrics in Prometheus format
def export-metrics [] {
    [
        "# HELP provisioning_custom_metric Custom metric"
        "# TYPE provisioning_custom_metric counter"
        $"provisioning_custom_metric (get-metric-value)"
    ] | str join "
"
}

Monitoring Best Practices

  • Set appropriate scrape intervals (15-60s)
  • Configure retention based on compliance requirements
  • Use labels for multi-dimensional metrics
  • Create dashboards for key business metrics
  • Set up alerts for critical failures only
  • Document alert thresholds and runbooks
  • Review and tune alerts regularly
  • Use recording rules for expensive queries
  • Archive long-term metrics to object storage