# Monitoring

Comprehensive observability stack for the Provisioning platform using Prometheus, Grafana, and custom metrics.

## Monitoring Stack Overview

The platform monitoring system consists of:

| Component | Purpose | Port | Status |
| --------- | ------- | ---- | ------ |
| **Prometheus** | Metrics collection and storage | 9090 | Production |
| **Grafana** | Visualization and dashboards | 3000 | Production |
| **Loki** | Log aggregation | 3100 | Active |
| **Alertmanager** | Alert routing and notification | 9093 | Production |
| **Node Exporter** | System metrics | 9100 | Production |

## Quick Start

Install monitoring stack:

```bash
# Install all monitoring components
provisioning monitoring install

# Install specific components
provisioning monitoring install --components prometheus,grafana

# Start monitoring services
provisioning monitoring start
```

Access dashboards:

- Prometheus: [http://localhost:9090](http://localhost:9090)
- Grafana: [http://localhost:3000](http://localhost:3000) (admin/admin)
- Alertmanager: [http://localhost:9093](http://localhost:9093)

## Prometheus Configuration

### Service Discovery

Prometheus automatically discovers platform services:

```yaml
# /etc/provisioning/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'provisioning-orchestrator'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/metrics'

  - job_name: 'provisioning-control-center'
    static_configs:
      - targets: ['localhost:8081']

  - job_name: 'provisioning-vault-service'
    static_configs:
      - targets: ['localhost:8085']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']
```

### Retention Configuration

```yaml
global:
  external_labels:
    cluster: 'provisioning-production'

# Storage retention
storage:
  tsdb:
    retention.time: 30d
    retention.size: 50GB
```

## Key Metrics

### Platform Metrics

Orchestrator metrics:

```text
provisioning_workflows_total - Total workflows created
provisioning_workflows_active - Currently active workflows
provisioning_workflows_completed - Successfully completed workflows
provisioning_workflows_failed - Failed workflows
provisioning_tasks_queued - Tasks in queue
provisioning_tasks_running - Currently executing tasks
provisioning_tasks_completed - Total completed tasks
provisioning_checkpoint_recoveries - Checkpoint recovery count
```

Control Center metrics:

```text
provisioning_api_requests_total - Total API requests
provisioning_api_requests_duration_seconds - Request latency histogram
provisioning_auth_attempts_total - Authentication attempts
provisioning_auth_failures_total - Failed authentication attempts
provisioning_rbac_denials_total - Authorization denials
```

Vault Service metrics:

```text
provisioning_secrets_operations_total - Secret operations count
provisioning_kms_encryptions_total - Encryption operations
provisioning_kms_decryptions_total - Decryption operations
provisioning_kms_latency_seconds - KMS operation latency
```

### System Metrics

Node Exporter provides system-level metrics:

```text
node_cpu_seconds_total - CPU time per core
node_memory_MemAvailable_bytes - Available memory
node_disk_io_time_seconds_total - Disk I/O time
node_network_receive_bytes_total - Network RX bytes
node_network_transmit_bytes_total - Network TX bytes
node_filesystem_avail_bytes - Available disk space
```

## Grafana Dashboards

### Pre-built Dashboards

Import platform dashboards:

```bash
# Install all pre-built dashboards
provisioning monitoring install-dashboards

# List available dashboards
provisioning monitoring list-dashboards
```

Available dashboards:

1. **Platform Overview** - High-level system status
2. **Orchestrator Performance** - Workflow and task metrics
3. **Control Center API** - API request metrics and latency
4. **Vault Service KMS** - Encryption operations and performance
5. **System Resources** - CPU, memory, disk, network
6. **Security Events** - Authentication, authorization, audit logs
7. **Database Performance** - SurrealDB metrics

### Custom Dashboard Creation

Create custom dashboards via Grafana UI or provisioning:

```json
{
  "dashboard": {
    "title": "Custom Infrastructure Dashboard",
    "panels": [
      {
        "title": "Active Workflows",
        "targets": [
          {
            "expr": "provisioning_workflows_active",
            "legendFormat": "Active Workflows"
          }
        ],
        "type": "graph"
      }
    ]
  }
}
```

Save dashboard:

```bash
provisioning monitoring export-dashboard --id 1 --output custom-dashboard.json
```

## Alerting

### Alert Rules

Configure alert rules in Prometheus:

```yaml
# /etc/provisioning/prometheus/alerts/provisioning.yml
groups:
  - name: provisioning_alerts
    interval: 30s
    rules:
      - alert: OrchestratorDown
        expr: up{job="provisioning-orchestrator"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Orchestrator service is down"
          description: "Orchestrator has been down for more than 1 minute"

      - alert: HighWorkflowFailureRate
        expr: |
          rate(provisioning_workflows_failed[5m]) /
          rate(provisioning_workflows_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High workflow failure rate"
          description: "More than 10% of workflows are failing"

      - alert: DatabaseConnectionLoss
        expr: provisioning_database_connected == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Database connection lost"

      - alert: HighMemoryUsage
        expr: |
          (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage"
          description: "Memory usage is above 90%"

      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/var/lib/provisioning"} /
           node_filesystem_size_bytes{mountpoint="/var/lib/provisioning"}) < 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space"
          description: "Less than 10% disk space available"
```

### Alertmanager Configuration

Route alerts to appropriate channels:

```yaml
# /etc/provisioning/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'team-email'

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true

    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'team-email'
    email_configs:
      - to: '[ops@example.com](mailto:ops@example.com)'
        from: '[alerts@provisioning.example.com](mailto:alerts@provisioning.example.com)'
        smarthost: 'smtp.example.com:587'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: '<pagerduty-key>'

  - name: 'slack'
    slack_configs:
      - api_url: '<slack-webhook-url>'
        channel: '#provisioning-alerts'
```

Test alerts:

```bash
# Send test alert
provisioning monitoring test-alert --severity critical

# Silence alerts temporarily
provisioning monitoring silence --duration 2h --reason "Maintenance window"
```

## Log Aggregation with Loki

### Loki Configuration

```yaml
# /etc/provisioning/loki/loki.yml
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1

schema_config:
  configs:
    - from: 2024-01-01
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /var/lib/loki/boltdb-shipper-active
    cache_location: /var/lib/loki/boltdb-shipper-cache
  filesystem:
    directory: /var/lib/loki/chunks

limits_config:
  retention_period: 720h  # 30 days
```

### Promtail for Log Shipping

```yaml
# /etc/provisioning/promtail/promtail.yml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url:  [http://localhost:3100/loki/api/v1/push](http://localhost:3100/loki/api/v1/push)

scrape_configs:
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: varlogs
          __path__: /var/log/provisioning/*.log

  - job_name: journald
    journal:
      max_age: 12h
      labels:
        job: systemd-journal
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: 'unit'
```

Query logs in Grafana:

```logql
{job="varlogs"} | = "error"
{unit="provisioning-orchestrator.service"} | = "workflow" | json
```

## Tracing with Tempo

### Distributed Tracing

Enable OpenTelemetry tracing in services:

```toml
# /etc/provisioning/config.toml
[tracing]
enabled = true
exporter = "otlp"
endpoint = "localhost:4317"
service_name = "provisioning-orchestrator"
```

Tempo configuration:

```yaml
# /etc/provisioning/tempo/tempo.yml
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317

storage:
  trace:
    backend: local
    local:
      path: /var/lib/tempo/traces

query_frontend:
  search:
    enabled: true
```

View traces in Grafana or Tempo UI.

## Performance Monitoring

### Query Performance

Monitor slow queries:

```promql
# 95th percentile API latency
histogram_quantile(0.95,
  rate(provisioning_api_requests_duration_seconds_bucket[5m])
)

# Slow workflows (>60s)
provisioning_workflow_duration_seconds > 60
```

### Resource Monitoring

Track resource utilization:

```promql
# CPU usage per service
rate(process_cpu_seconds_total{job=~"provisioning-.*"}[5m]) * 100

# Memory usage per service
process_resident_memory_bytes{job=~"provisioning-.*"}

# Disk I/O rate
rate(node_disk_io_time_seconds_total[5m])
```

## Custom Metrics

### Adding Custom Metrics

Rust services use prometheus crate:

```rust
use prometheus::{Counter, Histogram, Registry};

// Create metrics
let workflow_counter = Counter::new(
    "provisioning_custom_workflows",
    "Custom workflow counter"
)?;

let task_duration = Histogram::with_opts(
    HistogramOpts::new("provisioning_task_duration", "Task duration")
        .buckets(vec![0.1, 0.5, 1.0, 5.0, 10.0])
)?;

// Register metrics
registry.register(Box::new(workflow_counter))?;
registry.register(Box::new(task_duration))?;

// Use metrics
workflow_counter.inc();
task_duration.observe(duration_seconds);
```

Nushell scripts export metrics:

```nushell
# Export metrics in Prometheus format
def export-metrics [] {
    [
        "# HELP provisioning_custom_metric Custom metric"
        "# TYPE provisioning_custom_metric counter"
        $"provisioning_custom_metric (get-metric-value)"
    ] | str join "
"
}
```

## Monitoring Best Practices

- Set appropriate scrape intervals (15-60s)
- Configure retention based on compliance requirements
- Use labels for multi-dimensional metrics
- Create dashboards for key business metrics
- Set up alerts for critical failures only
- Document alert thresholds and runbooks
- Review and tune alerts regularly
- Use recording rules for expensive queries
- Archive long-term metrics to object storage

## Related Documentation

- [Service Management](service-management.md) - Service lifecycle
- [Platform Health](platform-health.md) - Health checks
- [Troubleshooting](troubleshooting.md) - Debugging issues