provisioning/docs/src/operations/monitoring.md

# Monitoring

Comprehensive observability stack for the Provisioning platform using Prometheus, Grafana, and custom metrics.

## Monitoring Stack Overview

The platform monitoring system consists of:

| Component | Purpose | Port | Status |
| --------- | ------- | ---- | ------ |
| **Prometheus** | Metrics collection and storage | 9090 | Production |
| **Grafana** | Visualization and dashboards | 3000 | Production |
| **Loki** | Log aggregation | 3100 | Active |
| **Alertmanager** | Alert routing and notification | 9093 | Production |
| **Node Exporter** | System metrics | 9100 | Production |

## Quick Start

Install monitoring stack:

```bash
# Install all monitoring components
provisioning monitoring install

# Install specific components
provisioning monitoring install --components prometheus,grafana

# Start monitoring services
provisioning monitoring start
```

Access dashboards:

- Prometheus: [http://localhost:9090](http://localhost:9090)
- Grafana: [http://localhost:3000](http://localhost:3000) (admin/admin)
- Alertmanager: [http://localhost:9093](http://localhost:9093)

## Prometheus Configuration

### Service Discovery

Prometheus automatically discovers platform services:

```yaml
# /etc/provisioning/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'provisioning-orchestrator'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/metrics'

  - job_name: 'provisioning-control-center'
    static_configs:
      - targets: ['localhost:8081']

  - job_name: 'provisioning-vault-service'
    static_configs:
      - targets: ['localhost:8085']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']
```

### Retention Configuration

```yaml
global:
  external_labels:
    cluster: 'provisioning-production'

# Storage retention
storage:
  tsdb:
    retention.time: 30d
    retention.size: 50GB
```

## Key Metrics

### Platform Metrics

Orchestrator metrics:

```text
provisioning_workflows_total - Total workflows created
provisioning_workflows_active - Currently active workflows
provisioning_workflows_completed - Successfully completed workflows
provisioning_workflows_failed - Failed workflows
provisioning_tasks_queued - Tasks in queue
provisioning_tasks_running - Currently executing tasks
provisioning_tasks_completed - Total completed tasks
provisioning_checkpoint_recoveries - Checkpoint recovery count
```

Control Center metrics:

```text
provisioning_api_requests_total - Total API requests
provisioning_api_requests_duration_seconds - Request latency histogram
provisioning_auth_attempts_total - Authentication attempts
provisioning_auth_failures_total - Failed authentication attempts
provisioning_rbac_denials_total - Authorization denials
```

Vault Service metrics:

```text
provisioning_secrets_operations_total - Secret operations count
provisioning_kms_encryptions_total - Encryption operations
provisioning_kms_decryptions_total - Decryption operations
provisioning_kms_latency_seconds - KMS operation latency
```

### System Metrics

Node Exporter provides system-level metrics:

```text
node_cpu_seconds_total - CPU time per core
node_memory_MemAvailable_bytes - Available memory
node_disk_io_time_seconds_total - Disk I/O time
node_network_receive_bytes_total - Network RX bytes
node_network_transmit_bytes_total - Network TX bytes
node_filesystem_avail_bytes - Available disk space
```

## Grafana Dashboards

### Pre-built Dashboards

Import platform dashboards:

```bash
# Install all pre-built dashboards
provisioning monitoring install-dashboards

# List available dashboards
provisioning monitoring list-dashboards
```

Available dashboards:

1. **Platform Overview** - High-level system status
2. **Orchestrator Performance** - Workflow and task metrics
3. **Control Center API** - API request metrics and latency
4. **Vault Service KMS** - Encryption operations and performance
5. **System Resources** - CPU, memory, disk, network
6. **Security Events** - Authentication, authorization, audit logs
7. **Database Performance** - SurrealDB metrics

### Custom Dashboard Creation

Create custom dashboards via Grafana UI or provisioning:

```json
{
  "dashboard": {
    "title": "Custom Infrastructure Dashboard",
    "panels": [
      {
        "title": "Active Workflows",
        "targets": [
          {
            "expr": "provisioning_workflows_active",
            "legendFormat": "Active Workflows"
          }
        ],
        "type": "graph"
      }
    ]
  }
}
```

Save dashboard:

```bash
provisioning monitoring export-dashboard --id 1 --output custom-dashboard.json
```

## Alerting

### Alert Rules

Configure alert rules in Prometheus:

```yaml
# /etc/provisioning/prometheus/alerts/provisioning.yml
groups:
  - name: provisioning_alerts
    interval: 30s
    rules:
      - alert: OrchestratorDown
        expr: up{job="provisioning-orchestrator"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Orchestrator service is down"
          description: "Orchestrator has been down for more than 1 minute"

      - alert: HighWorkflowFailureRate
        expr: |
          rate(provisioning_workflows_failed[5m]) /
          rate(provisioning_workflows_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High workflow failure rate"
          description: "More than 10% of workflows are failing"

      - alert: DatabaseConnectionLoss
        expr: provisioning_database_connected == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "Database connection lost"

      - alert: HighMemoryUsage
        expr: |
          (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage"
          description: "Memory usage is above 90%"

      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/var/lib/provisioning"} /
           node_filesystem_size_bytes{mountpoint="/var/lib/provisioning"}) < 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space"
          description: "Less than 10% disk space available"
```

### Alertmanager Configuration

Route alerts to appropriate channels:

```yaml
# /etc/provisioning/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 12h
  receiver: 'team-email'

  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      continue: true

    - match:
        severity: warning
      receiver: 'slack'

receivers:
  - name: 'team-email'
    email_configs:
      - to: '[ops@example.com](mailto:ops@example.com)'
        from: '[alerts@provisioning.example.com](mailto:alerts@provisioning.example.com)'
        smarthost: 'smtp.example.com:587'

  - name: 'pagerduty'
    pagerduty_configs:
      - service_key: '<pagerduty-key>'

  - name: 'slack'
    slack_configs:
      - api_url: '<slack-webhook-url>'
        channel: '#provisioning-alerts'
```

Test alerts:

```bash
# Send test alert
provisioning monitoring test-alert --severity critical

# Silence alerts temporarily
provisioning monitoring silence --duration 2h --reason "Maintenance window"
```

## Log Aggregation with Loki

### Loki Configuration

```yaml
# /etc/provisioning/loki/loki.yml
auth_enabled: false

server:
  http_listen_port: 3100

ingester:
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1

schema_config:
  configs:
    - from: 2024-01-01
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /var/lib/loki/boltdb-shipper-active
    cache_location: /var/lib/loki/boltdb-shipper-cache
  filesystem:
    directory: /var/lib/loki/chunks

limits_config:
  retention_period: 720h  # 30 days
```

### Promtail for Log Shipping

```yaml
# /etc/provisioning/promtail/promtail.yml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url:  [http://localhost:3100/loki/api/v1/push](http://localhost:3100/loki/api/v1/push)

scrape_configs:
  - job_name: system
    static_configs:
      - targets:
          - localhost
        labels:
          job: varlogs
          __path__: /var/log/provisioning/*.log

  - job_name: journald
    journal:
      max_age: 12h
      labels:
        job: systemd-journal
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: 'unit'
```

Query logs in Grafana:

```logql
{job="varlogs"} | = "error"
{unit="provisioning-orchestrator.service"} | = "workflow" | json
```

## Tracing with Tempo

### Distributed Tracing

Enable OpenTelemetry tracing in services:

```toml
# /etc/provisioning/config.toml
[tracing]
enabled = true
exporter = "otlp"
endpoint = "localhost:4317"
service_name = "provisioning-orchestrator"
```

Tempo configuration:

```yaml
# /etc/provisioning/tempo/tempo.yml
server:
  http_listen_port: 3200

distributor:
  receivers:
    otlp:
      protocols:
        grpc:
          endpoint: 0.0.0.0:4317

storage:
  trace:
    backend: local
    local:
      path: /var/lib/tempo/traces

query_frontend:
  search:
    enabled: true
```

View traces in Grafana or Tempo UI.

## Performance Monitoring

### Query Performance

Monitor slow queries:

```promql
# 95th percentile API latency
histogram_quantile(0.95,
  rate(provisioning_api_requests_duration_seconds_bucket[5m])
)

# Slow workflows (>60s)
provisioning_workflow_duration_seconds > 60
```

### Resource Monitoring

Track resource utilization:

```promql
# CPU usage per service
rate(process_cpu_seconds_total{job=~"provisioning-.*"}[5m]) * 100

# Memory usage per service
process_resident_memory_bytes{job=~"provisioning-.*"}

# Disk I/O rate
rate(node_disk_io_time_seconds_total[5m])
```

## Custom Metrics

### Adding Custom Metrics

Rust services use prometheus crate:

```rust
use prometheus::{Counter, Histogram, Registry};

// Create metrics
let workflow_counter = Counter::new(
    "provisioning_custom_workflows",
    "Custom workflow counter"
)?;

let task_duration = Histogram::with_opts(
    HistogramOpts::new("provisioning_task_duration", "Task duration")
        .buckets(vec![0.1, 0.5, 1.0, 5.0, 10.0])
)?;

// Register metrics
registry.register(Box::new(workflow_counter))?;
registry.register(Box::new(task_duration))?;

// Use metrics
workflow_counter.inc();
task_duration.observe(duration_seconds);
```

Nushell scripts export metrics:

```nushell
# Export metrics in Prometheus format
def export-metrics [] {
    [
        "# HELP provisioning_custom_metric Custom metric"
        "# TYPE provisioning_custom_metric counter"
        $"provisioning_custom_metric (get-metric-value)"
    ] | str join "
"
}
```

## Monitoring Best Practices

- Set appropriate scrape intervals (15-60s)
- Configure retention based on compliance requirements
- Use labels for multi-dimensional metrics
- Create dashboards for key business metrics
- Set up alerts for critical failures only
- Document alert thresholds and runbooks
- Review and tune alerts regularly
- Use recording rules for expensive queries
- Archive long-term metrics to object storage

## Related Documentation

- [Service Management](service-management.md) - Service lifecycle
- [Platform Health](platform-health.md) - Health checks
- [Troubleshooting](troubleshooting.md) - Debugging issues
chore: review docs from scratch 2026-01-17 03:58:28 +00:00			`# Monitoring`

			`Comprehensive observability stack for the Provisioning platform using Prometheus, Grafana, and custom metrics.`

			`## Monitoring Stack Overview`

			`The platform monitoring system consists of:`

			`\| Component \| Purpose \| Port \| Status \|`
			`\| --------- \| ------- \| ---- \| ------ \|`
			`\| Prometheus \| Metrics collection and storage \| 9090 \| Production \|`
			`\| Grafana \| Visualization and dashboards \| 3000 \| Production \|`
			`\| Loki \| Log aggregation \| 3100 \| Active \|`
			`\| Alertmanager \| Alert routing and notification \| 9093 \| Production \|`
			`\| Node Exporter \| System metrics \| 9100 \| Production \|`

			`## Quick Start`

			`Install monitoring stack:`

			```bash
			`# Install all monitoring components`
			`provisioning monitoring install`

			`# Install specific components`
			`provisioning monitoring install --components prometheus,grafana`

			`# Start monitoring services`
			`provisioning monitoring start`
			```

			`Access dashboards:`

			`- Prometheus: [http://localhost:9090](http://localhost:9090)`
			`- Grafana: [http://localhost:3000](http://localhost:3000) (admin/admin)`
			`- Alertmanager: [http://localhost:9093](http://localhost:9093)`

			`## Prometheus Configuration`

			`### Service Discovery`

			`Prometheus automatically discovers platform services:`

			```yaml
			`# /etc/provisioning/prometheus/prometheus.yml`
			`global:`
			`scrape_interval: 15s`
			`evaluation_interval: 15s`

			`scrape_configs:`
			`- job_name: 'provisioning-orchestrator'`
			`static_configs:`
			`- targets: ['localhost:8080']`
			`metrics_path: '/metrics'`

			`- job_name: 'provisioning-control-center'`
			`static_configs:`
			`- targets: ['localhost:8081']`

			`- job_name: 'provisioning-vault-service'`
			`static_configs:`
			`- targets: ['localhost:8085']`

			`- job_name: 'node-exporter'`
			`static_configs:`
			`- targets: ['localhost:9100']`
			```

			`### Retention Configuration`

			```yaml
			`global:`
			`external_labels:`
			`cluster: 'provisioning-production'`

			`# Storage retention`
			`storage:`
			`tsdb:`
			`retention.time: 30d`
			`retention.size: 50GB`
			```

			`## Key Metrics`

			`### Platform Metrics`

			`Orchestrator metrics:`

			```text
			`provisioning_workflows_total - Total workflows created`
			`provisioning_workflows_active - Currently active workflows`
			`provisioning_workflows_completed - Successfully completed workflows`
			`provisioning_workflows_failed - Failed workflows`
			`provisioning_tasks_queued - Tasks in queue`
			`provisioning_tasks_running - Currently executing tasks`
			`provisioning_tasks_completed - Total completed tasks`
			`provisioning_checkpoint_recoveries - Checkpoint recovery count`
			```

			`Control Center metrics:`

			```text
			`provisioning_api_requests_total - Total API requests`
			`provisioning_api_requests_duration_seconds - Request latency histogram`
			`provisioning_auth_attempts_total - Authentication attempts`
			`provisioning_auth_failures_total - Failed authentication attempts`
			`provisioning_rbac_denials_total - Authorization denials`
			```

			`Vault Service metrics:`

			```text
			`provisioning_secrets_operations_total - Secret operations count`
			`provisioning_kms_encryptions_total - Encryption operations`
			`provisioning_kms_decryptions_total - Decryption operations`
			`provisioning_kms_latency_seconds - KMS operation latency`
			```

			`### System Metrics`

			`Node Exporter provides system-level metrics:`

			```text
			`node_cpu_seconds_total - CPU time per core`
			`node_memory_MemAvailable_bytes - Available memory`
			`node_disk_io_time_seconds_total - Disk I/O time`
			`node_network_receive_bytes_total - Network RX bytes`
			`node_network_transmit_bytes_total - Network TX bytes`
			`node_filesystem_avail_bytes - Available disk space`
			```

			`## Grafana Dashboards`

			`### Pre-built Dashboards`

			`Import platform dashboards:`

			```bash
			`# Install all pre-built dashboards`
			`provisioning monitoring install-dashboards`

			`# List available dashboards`
			`provisioning monitoring list-dashboards`
			```

			`Available dashboards:`

			`1. Platform Overview - High-level system status`
			`2. Orchestrator Performance - Workflow and task metrics`
			`3. Control Center API - API request metrics and latency`
			`4. Vault Service KMS - Encryption operations and performance`
			`5. System Resources - CPU, memory, disk, network`
			`6. Security Events - Authentication, authorization, audit logs`
			`7. Database Performance - SurrealDB metrics`

			`### Custom Dashboard Creation`

			`Create custom dashboards via Grafana UI or provisioning:`

			```json
			`{`
			`"dashboard": {`
			`"title": "Custom Infrastructure Dashboard",`
			`"panels": [`
			`{`
			`"title": "Active Workflows",`
			`"targets": [`
			`{`
			`"expr": "provisioning_workflows_active",`
			`"legendFormat": "Active Workflows"`
			`}`
			`],`
			`"type": "graph"`
			`}`
			`]`
			`}`
			`}`
			```

			`Save dashboard:`

			```bash
			`provisioning monitoring export-dashboard --id 1 --output custom-dashboard.json`
			```

			`## Alerting`

			`### Alert Rules`

			`Configure alert rules in Prometheus:`

			```yaml
			`# /etc/provisioning/prometheus/alerts/provisioning.yml`
			`groups:`
			`- name: provisioning_alerts`
			`interval: 30s`
			`rules:`
			`- alert: OrchestratorDown`
			`expr: up{job="provisioning-orchestrator"} == 0`
			`for: 1m`
			`labels:`
			`severity: critical`
			`annotations:`
			`summary: "Orchestrator service is down"`
			`description: "Orchestrator has been down for more than 1 minute"`

			`- alert: HighWorkflowFailureRate`
			`expr: \|`
			`rate(provisioning_workflows_failed[5m]) /`
			`rate(provisioning_workflows_total[5m]) > 0.1`
			`for: 5m`
			`labels:`
			`severity: warning`
			`annotations:`
			`summary: "High workflow failure rate"`
			`description: "More than 10% of workflows are failing"`

			`- alert: DatabaseConnectionLoss`
			`expr: provisioning_database_connected == 0`
			`for: 30s`
			`labels:`
			`severity: critical`
			`annotations:`
			`summary: "Database connection lost"`

			`- alert: HighMemoryUsage`
			`expr: \|`
			`(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9`
			`for: 5m`
			`labels:`
			`severity: warning`
			`annotations:`
			`summary: "High memory usage"`
			`description: "Memory usage is above 90%"`

			`- alert: DiskSpaceLow`
			`expr: \|`
			`(node_filesystem_avail_bytes{mountpoint="/var/lib/provisioning"} /`
			`node_filesystem_size_bytes{mountpoint="/var/lib/provisioning"}) < 0.1`
			`for: 5m`
			`labels:`
			`severity: warning`
			`annotations:`
			`summary: "Low disk space"`
			`description: "Less than 10% disk space available"`
			```

			`### Alertmanager Configuration`

			`Route alerts to appropriate channels:`

			```yaml
			`# /etc/provisioning/alertmanager/alertmanager.yml`
			`global:`
			`resolve_timeout: 5m`

			`route:`
			`group_by: ['alertname', 'severity']`
			`group_wait: 10s`
			`group_interval: 10s`
			`repeat_interval: 12h`
			`receiver: 'team-email'`

			`routes:`
			`- match:`
			`severity: critical`
			`receiver: 'pagerduty'`
			`continue: true`

			`- match:`
			`severity: warning`
			`receiver: 'slack'`

			`receivers:`
			`- name: 'team-email'`
			`email_configs:`
			`- to: '[ops@example.com](mailto:ops@example.com)'`
			`from: '[alerts@provisioning.example.com](mailto:alerts@provisioning.example.com)'`
			`smarthost: 'smtp.example.com:587'`

			`- name: 'pagerduty'`
			`pagerduty_configs:`
			`- service_key: '<pagerduty-key>'`

			`- name: 'slack'`
			`slack_configs:`
			`- api_url: '<slack-webhook-url>'`
			`channel: '#provisioning-alerts'`
			```

			`Test alerts:`

			```bash
			`# Send test alert`
			`provisioning monitoring test-alert --severity critical`

			`# Silence alerts temporarily`
			`provisioning monitoring silence --duration 2h --reason "Maintenance window"`
			```

			`## Log Aggregation with Loki`

			`### Loki Configuration`

			```yaml
			`# /etc/provisioning/loki/loki.yml`
			`auth_enabled: false`

			`server:`
			`http_listen_port: 3100`

			`ingester:`
			`lifecycler:`
			`ring:`
			`kvstore:`
			`store: inmemory`
			`replication_factor: 1`

			`schema_config:`
			`configs:`
			`- from: 2024-01-01`
			`store: boltdb-shipper`
			`object_store: filesystem`
			`schema: v11`
			`index:`
			`prefix: index_`
			`period: 24h`

			`storage_config:`
			`boltdb_shipper:`
			`active_index_directory: /var/lib/loki/boltdb-shipper-active`
			`cache_location: /var/lib/loki/boltdb-shipper-cache`
			`filesystem:`
			`directory: /var/lib/loki/chunks`

			`limits_config:`
			`retention_period: 720h # 30 days`
			```

			`### Promtail for Log Shipping`

			```yaml
			`# /etc/provisioning/promtail/promtail.yml`
			`server:`
			`http_listen_port: 9080`

			`positions:`
			`filename: /tmp/positions.yaml`

			`clients:`
			`- url: [http://localhost:3100/loki/api/v1/push](http://localhost:3100/loki/api/v1/push)`

			`scrape_configs:`
			`- job_name: system`
			`static_configs:`
			`- targets:`
			`- localhost`
			`labels:`
			`job: varlogs`
			`__path__: /var/log/provisioning/*.log`

			`- job_name: journald`
			`journal:`
			`max_age: 12h`
			`labels:`
			`job: systemd-journal`
			`relabel_configs:`
			`- source_labels: ['__journal__systemd_unit']`
			`target_label: 'unit'`
			```

			`Query logs in Grafana:`

			```logql
			`{job="varlogs"} \| = "error"`
			`{unit="provisioning-orchestrator.service"} \| = "workflow" \| json`
			```

			`## Tracing with Tempo`

			`### Distributed Tracing`

			`Enable OpenTelemetry tracing in services:`

			```toml
			`# /etc/provisioning/config.toml`
			`[tracing]`
			`enabled = true`
			`exporter = "otlp"`
			`endpoint = "localhost:4317"`
			`service_name = "provisioning-orchestrator"`
			```

			`Tempo configuration:`

			```yaml
			`# /etc/provisioning/tempo/tempo.yml`
			`server:`
			`http_listen_port: 3200`

			`distributor:`
			`receivers:`
			`otlp:`
			`protocols:`
			`grpc:`
			`endpoint: 0.0.0.0:4317`

			`storage:`
			`trace:`
			`backend: local`
			`local:`
			`path: /var/lib/tempo/traces`

			`query_frontend:`
			`search:`
			`enabled: true`
			```

			`View traces in Grafana or Tempo UI.`

			`## Performance Monitoring`

			`### Query Performance`

			`Monitor slow queries:`

			```promql
			`# 95th percentile API latency`
			`histogram_quantile(0.95,`
			`rate(provisioning_api_requests_duration_seconds_bucket[5m])`
			`)`

			`# Slow workflows (>60s)`
			`provisioning_workflow_duration_seconds > 60`
			```

			`### Resource Monitoring`

			`Track resource utilization:`

			```promql
			`# CPU usage per service`
			`rate(process_cpu_seconds_total{job=~"provisioning-."}[5m]) 100`

			`# Memory usage per service`
			`process_resident_memory_bytes{job=~"provisioning-.*"}`

			`# Disk I/O rate`
			`rate(node_disk_io_time_seconds_total[5m])`
			```

			`## Custom Metrics`

			`### Adding Custom Metrics`

			`Rust services use prometheus crate:`

			```rust
			`use prometheus::{Counter, Histogram, Registry};`

			`// Create metrics`
			`let workflow_counter = Counter::new(`
			`"provisioning_custom_workflows",`
			`"Custom workflow counter"`
			`)?;`

			`let task_duration = Histogram::with_opts(`
			`HistogramOpts::new("provisioning_task_duration", "Task duration")`
			`.buckets(vec![0.1, 0.5, 1.0, 5.0, 10.0])`
			`)?;`

			`// Register metrics`
			`registry.register(Box::new(workflow_counter))?;`
			`registry.register(Box::new(task_duration))?;`

			`// Use metrics`
			`workflow_counter.inc();`
			`task_duration.observe(duration_seconds);`
			```

			`Nushell scripts export metrics:`

			```nushell
			`# Export metrics in Prometheus format`
			`def export-metrics [] {`
			`[`
			`"# HELP provisioning_custom_metric Custom metric"`
			`"# TYPE provisioning_custom_metric counter"`
			`$"provisioning_custom_metric (get-metric-value)"`
			`] \| str join "`
			`"`
			`}`
			```

			`## Monitoring Best Practices`

			`- Set appropriate scrape intervals (15-60s)`
			`- Configure retention based on compliance requirements`
			`- Use labels for multi-dimensional metrics`
			`- Create dashboards for key business metrics`
			`- Set up alerts for critical failures only`
			`- Document alert thresholds and runbooks`
			`- Review and tune alerts regularly`
			`- Use recording rules for expensive queries`
			`- Archive long-term metrics to object storage`

			`## Related Documentation`

			`- [Service Management](service-management.md) - Service lifecycle`
			`- [Platform Health](platform-health.md) - Health checks`
			`- [Troubleshooting](troubleshooting.md) - Debugging issues`