512 lines
12 KiB
Markdown
512 lines
12 KiB
Markdown
|
|
# Monitoring
|
||
|
|
|
||
|
|
Comprehensive observability stack for the Provisioning platform using Prometheus, Grafana, and custom metrics.
|
||
|
|
|
||
|
|
## Monitoring Stack Overview
|
||
|
|
|
||
|
|
The platform monitoring system consists of:
|
||
|
|
|
||
|
|
| Component | Purpose | Port | Status |
|
||
|
|
| --------- | ------- | ---- | ------ |
|
||
|
|
| **Prometheus** | Metrics collection and storage | 9090 | Production |
|
||
|
|
| **Grafana** | Visualization and dashboards | 3000 | Production |
|
||
|
|
| **Loki** | Log aggregation | 3100 | Active |
|
||
|
|
| **Alertmanager** | Alert routing and notification | 9093 | Production |
|
||
|
|
| **Node Exporter** | System metrics | 9100 | Production |
|
||
|
|
|
||
|
|
## Quick Start
|
||
|
|
|
||
|
|
Install monitoring stack:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Install all monitoring components
|
||
|
|
provisioning monitoring install
|
||
|
|
|
||
|
|
# Install specific components
|
||
|
|
provisioning monitoring install --components prometheus,grafana
|
||
|
|
|
||
|
|
# Start monitoring services
|
||
|
|
provisioning monitoring start
|
||
|
|
```
|
||
|
|
|
||
|
|
Access dashboards:
|
||
|
|
|
||
|
|
- Prometheus: [http://localhost:9090](http://localhost:9090)
|
||
|
|
- Grafana: [http://localhost:3000](http://localhost:3000) (admin/admin)
|
||
|
|
- Alertmanager: [http://localhost:9093](http://localhost:9093)
|
||
|
|
|
||
|
|
## Prometheus Configuration
|
||
|
|
|
||
|
|
### Service Discovery
|
||
|
|
|
||
|
|
Prometheus automatically discovers platform services:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# /etc/provisioning/prometheus/prometheus.yml
|
||
|
|
global:
|
||
|
|
scrape_interval: 15s
|
||
|
|
evaluation_interval: 15s
|
||
|
|
|
||
|
|
scrape_configs:
|
||
|
|
- job_name: 'provisioning-orchestrator'
|
||
|
|
static_configs:
|
||
|
|
- targets: ['localhost:8080']
|
||
|
|
metrics_path: '/metrics'
|
||
|
|
|
||
|
|
- job_name: 'provisioning-control-center'
|
||
|
|
static_configs:
|
||
|
|
- targets: ['localhost:8081']
|
||
|
|
|
||
|
|
- job_name: 'provisioning-vault-service'
|
||
|
|
static_configs:
|
||
|
|
- targets: ['localhost:8085']
|
||
|
|
|
||
|
|
- job_name: 'node-exporter'
|
||
|
|
static_configs:
|
||
|
|
- targets: ['localhost:9100']
|
||
|
|
```
|
||
|
|
|
||
|
|
### Retention Configuration
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
global:
|
||
|
|
external_labels:
|
||
|
|
cluster: 'provisioning-production'
|
||
|
|
|
||
|
|
# Storage retention
|
||
|
|
storage:
|
||
|
|
tsdb:
|
||
|
|
retention.time: 30d
|
||
|
|
retention.size: 50GB
|
||
|
|
```
|
||
|
|
|
||
|
|
## Key Metrics
|
||
|
|
|
||
|
|
### Platform Metrics
|
||
|
|
|
||
|
|
Orchestrator metrics:
|
||
|
|
|
||
|
|
```text
|
||
|
|
provisioning_workflows_total - Total workflows created
|
||
|
|
provisioning_workflows_active - Currently active workflows
|
||
|
|
provisioning_workflows_completed - Successfully completed workflows
|
||
|
|
provisioning_workflows_failed - Failed workflows
|
||
|
|
provisioning_tasks_queued - Tasks in queue
|
||
|
|
provisioning_tasks_running - Currently executing tasks
|
||
|
|
provisioning_tasks_completed - Total completed tasks
|
||
|
|
provisioning_checkpoint_recoveries - Checkpoint recovery count
|
||
|
|
```
|
||
|
|
|
||
|
|
Control Center metrics:
|
||
|
|
|
||
|
|
```text
|
||
|
|
provisioning_api_requests_total - Total API requests
|
||
|
|
provisioning_api_requests_duration_seconds - Request latency histogram
|
||
|
|
provisioning_auth_attempts_total - Authentication attempts
|
||
|
|
provisioning_auth_failures_total - Failed authentication attempts
|
||
|
|
provisioning_rbac_denials_total - Authorization denials
|
||
|
|
```
|
||
|
|
|
||
|
|
Vault Service metrics:
|
||
|
|
|
||
|
|
```text
|
||
|
|
provisioning_secrets_operations_total - Secret operations count
|
||
|
|
provisioning_kms_encryptions_total - Encryption operations
|
||
|
|
provisioning_kms_decryptions_total - Decryption operations
|
||
|
|
provisioning_kms_latency_seconds - KMS operation latency
|
||
|
|
```
|
||
|
|
|
||
|
|
### System Metrics
|
||
|
|
|
||
|
|
Node Exporter provides system-level metrics:
|
||
|
|
|
||
|
|
```text
|
||
|
|
node_cpu_seconds_total - CPU time per core
|
||
|
|
node_memory_MemAvailable_bytes - Available memory
|
||
|
|
node_disk_io_time_seconds_total - Disk I/O time
|
||
|
|
node_network_receive_bytes_total - Network RX bytes
|
||
|
|
node_network_transmit_bytes_total - Network TX bytes
|
||
|
|
node_filesystem_avail_bytes - Available disk space
|
||
|
|
```
|
||
|
|
|
||
|
|
## Grafana Dashboards
|
||
|
|
|
||
|
|
### Pre-built Dashboards
|
||
|
|
|
||
|
|
Import platform dashboards:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Install all pre-built dashboards
|
||
|
|
provisioning monitoring install-dashboards
|
||
|
|
|
||
|
|
# List available dashboards
|
||
|
|
provisioning monitoring list-dashboards
|
||
|
|
```
|
||
|
|
|
||
|
|
Available dashboards:
|
||
|
|
|
||
|
|
1. **Platform Overview** - High-level system status
|
||
|
|
2. **Orchestrator Performance** - Workflow and task metrics
|
||
|
|
3. **Control Center API** - API request metrics and latency
|
||
|
|
4. **Vault Service KMS** - Encryption operations and performance
|
||
|
|
5. **System Resources** - CPU, memory, disk, network
|
||
|
|
6. **Security Events** - Authentication, authorization, audit logs
|
||
|
|
7. **Database Performance** - SurrealDB metrics
|
||
|
|
|
||
|
|
### Custom Dashboard Creation
|
||
|
|
|
||
|
|
Create custom dashboards via Grafana UI or provisioning:
|
||
|
|
|
||
|
|
```json
|
||
|
|
{
|
||
|
|
"dashboard": {
|
||
|
|
"title": "Custom Infrastructure Dashboard",
|
||
|
|
"panels": [
|
||
|
|
{
|
||
|
|
"title": "Active Workflows",
|
||
|
|
"targets": [
|
||
|
|
{
|
||
|
|
"expr": "provisioning_workflows_active",
|
||
|
|
"legendFormat": "Active Workflows"
|
||
|
|
}
|
||
|
|
],
|
||
|
|
"type": "graph"
|
||
|
|
}
|
||
|
|
]
|
||
|
|
}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
Save dashboard:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
provisioning monitoring export-dashboard --id 1 --output custom-dashboard.json
|
||
|
|
```
|
||
|
|
|
||
|
|
## Alerting
|
||
|
|
|
||
|
|
### Alert Rules
|
||
|
|
|
||
|
|
Configure alert rules in Prometheus:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# /etc/provisioning/prometheus/alerts/provisioning.yml
|
||
|
|
groups:
|
||
|
|
- name: provisioning_alerts
|
||
|
|
interval: 30s
|
||
|
|
rules:
|
||
|
|
- alert: OrchestratorDown
|
||
|
|
expr: up{job="provisioning-orchestrator"} == 0
|
||
|
|
for: 1m
|
||
|
|
labels:
|
||
|
|
severity: critical
|
||
|
|
annotations:
|
||
|
|
summary: "Orchestrator service is down"
|
||
|
|
description: "Orchestrator has been down for more than 1 minute"
|
||
|
|
|
||
|
|
- alert: HighWorkflowFailureRate
|
||
|
|
expr: |
|
||
|
|
rate(provisioning_workflows_failed[5m]) /
|
||
|
|
rate(provisioning_workflows_total[5m]) > 0.1
|
||
|
|
for: 5m
|
||
|
|
labels:
|
||
|
|
severity: warning
|
||
|
|
annotations:
|
||
|
|
summary: "High workflow failure rate"
|
||
|
|
description: "More than 10% of workflows are failing"
|
||
|
|
|
||
|
|
- alert: DatabaseConnectionLoss
|
||
|
|
expr: provisioning_database_connected == 0
|
||
|
|
for: 30s
|
||
|
|
labels:
|
||
|
|
severity: critical
|
||
|
|
annotations:
|
||
|
|
summary: "Database connection lost"
|
||
|
|
|
||
|
|
- alert: HighMemoryUsage
|
||
|
|
expr: |
|
||
|
|
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
|
||
|
|
for: 5m
|
||
|
|
labels:
|
||
|
|
severity: warning
|
||
|
|
annotations:
|
||
|
|
summary: "High memory usage"
|
||
|
|
description: "Memory usage is above 90%"
|
||
|
|
|
||
|
|
- alert: DiskSpaceLow
|
||
|
|
expr: |
|
||
|
|
(node_filesystem_avail_bytes{mountpoint="/var/lib/provisioning"} /
|
||
|
|
node_filesystem_size_bytes{mountpoint="/var/lib/provisioning"}) < 0.1
|
||
|
|
for: 5m
|
||
|
|
labels:
|
||
|
|
severity: warning
|
||
|
|
annotations:
|
||
|
|
summary: "Low disk space"
|
||
|
|
description: "Less than 10% disk space available"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Alertmanager Configuration
|
||
|
|
|
||
|
|
Route alerts to appropriate channels:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# /etc/provisioning/alertmanager/alertmanager.yml
|
||
|
|
global:
|
||
|
|
resolve_timeout: 5m
|
||
|
|
|
||
|
|
route:
|
||
|
|
group_by: ['alertname', 'severity']
|
||
|
|
group_wait: 10s
|
||
|
|
group_interval: 10s
|
||
|
|
repeat_interval: 12h
|
||
|
|
receiver: 'team-email'
|
||
|
|
|
||
|
|
routes:
|
||
|
|
- match:
|
||
|
|
severity: critical
|
||
|
|
receiver: 'pagerduty'
|
||
|
|
continue: true
|
||
|
|
|
||
|
|
- match:
|
||
|
|
severity: warning
|
||
|
|
receiver: 'slack'
|
||
|
|
|
||
|
|
receivers:
|
||
|
|
- name: 'team-email'
|
||
|
|
email_configs:
|
||
|
|
- to: '[ops@example.com](mailto:ops@example.com)'
|
||
|
|
from: '[alerts@provisioning.example.com](mailto:alerts@provisioning.example.com)'
|
||
|
|
smarthost: 'smtp.example.com:587'
|
||
|
|
|
||
|
|
- name: 'pagerduty'
|
||
|
|
pagerduty_configs:
|
||
|
|
- service_key: '<pagerduty-key>'
|
||
|
|
|
||
|
|
- name: 'slack'
|
||
|
|
slack_configs:
|
||
|
|
- api_url: '<slack-webhook-url>'
|
||
|
|
channel: '#provisioning-alerts'
|
||
|
|
```
|
||
|
|
|
||
|
|
Test alerts:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Send test alert
|
||
|
|
provisioning monitoring test-alert --severity critical
|
||
|
|
|
||
|
|
# Silence alerts temporarily
|
||
|
|
provisioning monitoring silence --duration 2h --reason "Maintenance window"
|
||
|
|
```
|
||
|
|
|
||
|
|
## Log Aggregation with Loki
|
||
|
|
|
||
|
|
### Loki Configuration
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# /etc/provisioning/loki/loki.yml
|
||
|
|
auth_enabled: false
|
||
|
|
|
||
|
|
server:
|
||
|
|
http_listen_port: 3100
|
||
|
|
|
||
|
|
ingester:
|
||
|
|
lifecycler:
|
||
|
|
ring:
|
||
|
|
kvstore:
|
||
|
|
store: inmemory
|
||
|
|
replication_factor: 1
|
||
|
|
|
||
|
|
schema_config:
|
||
|
|
configs:
|
||
|
|
- from: 2024-01-01
|
||
|
|
store: boltdb-shipper
|
||
|
|
object_store: filesystem
|
||
|
|
schema: v11
|
||
|
|
index:
|
||
|
|
prefix: index_
|
||
|
|
period: 24h
|
||
|
|
|
||
|
|
storage_config:
|
||
|
|
boltdb_shipper:
|
||
|
|
active_index_directory: /var/lib/loki/boltdb-shipper-active
|
||
|
|
cache_location: /var/lib/loki/boltdb-shipper-cache
|
||
|
|
filesystem:
|
||
|
|
directory: /var/lib/loki/chunks
|
||
|
|
|
||
|
|
limits_config:
|
||
|
|
retention_period: 720h # 30 days
|
||
|
|
```
|
||
|
|
|
||
|
|
### Promtail for Log Shipping
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# /etc/provisioning/promtail/promtail.yml
|
||
|
|
server:
|
||
|
|
http_listen_port: 9080
|
||
|
|
|
||
|
|
positions:
|
||
|
|
filename: /tmp/positions.yaml
|
||
|
|
|
||
|
|
clients:
|
||
|
|
- url: [http://localhost:3100/loki/api/v1/push](http://localhost:3100/loki/api/v1/push)
|
||
|
|
|
||
|
|
scrape_configs:
|
||
|
|
- job_name: system
|
||
|
|
static_configs:
|
||
|
|
- targets:
|
||
|
|
- localhost
|
||
|
|
labels:
|
||
|
|
job: varlogs
|
||
|
|
__path__: /var/log/provisioning/*.log
|
||
|
|
|
||
|
|
- job_name: journald
|
||
|
|
journal:
|
||
|
|
max_age: 12h
|
||
|
|
labels:
|
||
|
|
job: systemd-journal
|
||
|
|
relabel_configs:
|
||
|
|
- source_labels: ['__journal__systemd_unit']
|
||
|
|
target_label: 'unit'
|
||
|
|
```
|
||
|
|
|
||
|
|
Query logs in Grafana:
|
||
|
|
|
||
|
|
```logql
|
||
|
|
{job="varlogs"} | = "error"
|
||
|
|
{unit="provisioning-orchestrator.service"} | = "workflow" | json
|
||
|
|
```
|
||
|
|
|
||
|
|
## Tracing with Tempo
|
||
|
|
|
||
|
|
### Distributed Tracing
|
||
|
|
|
||
|
|
Enable OpenTelemetry tracing in services:
|
||
|
|
|
||
|
|
```toml
|
||
|
|
# /etc/provisioning/config.toml
|
||
|
|
[tracing]
|
||
|
|
enabled = true
|
||
|
|
exporter = "otlp"
|
||
|
|
endpoint = "localhost:4317"
|
||
|
|
service_name = "provisioning-orchestrator"
|
||
|
|
```
|
||
|
|
|
||
|
|
Tempo configuration:
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# /etc/provisioning/tempo/tempo.yml
|
||
|
|
server:
|
||
|
|
http_listen_port: 3200
|
||
|
|
|
||
|
|
distributor:
|
||
|
|
receivers:
|
||
|
|
otlp:
|
||
|
|
protocols:
|
||
|
|
grpc:
|
||
|
|
endpoint: 0.0.0.0:4317
|
||
|
|
|
||
|
|
storage:
|
||
|
|
trace:
|
||
|
|
backend: local
|
||
|
|
local:
|
||
|
|
path: /var/lib/tempo/traces
|
||
|
|
|
||
|
|
query_frontend:
|
||
|
|
search:
|
||
|
|
enabled: true
|
||
|
|
```
|
||
|
|
|
||
|
|
View traces in Grafana or Tempo UI.
|
||
|
|
|
||
|
|
## Performance Monitoring
|
||
|
|
|
||
|
|
### Query Performance
|
||
|
|
|
||
|
|
Monitor slow queries:
|
||
|
|
|
||
|
|
```promql
|
||
|
|
# 95th percentile API latency
|
||
|
|
histogram_quantile(0.95,
|
||
|
|
rate(provisioning_api_requests_duration_seconds_bucket[5m])
|
||
|
|
)
|
||
|
|
|
||
|
|
# Slow workflows (>60s)
|
||
|
|
provisioning_workflow_duration_seconds > 60
|
||
|
|
```
|
||
|
|
|
||
|
|
### Resource Monitoring
|
||
|
|
|
||
|
|
Track resource utilization:
|
||
|
|
|
||
|
|
```promql
|
||
|
|
# CPU usage per service
|
||
|
|
rate(process_cpu_seconds_total{job=~"provisioning-.*"}[5m]) * 100
|
||
|
|
|
||
|
|
# Memory usage per service
|
||
|
|
process_resident_memory_bytes{job=~"provisioning-.*"}
|
||
|
|
|
||
|
|
# Disk I/O rate
|
||
|
|
rate(node_disk_io_time_seconds_total[5m])
|
||
|
|
```
|
||
|
|
|
||
|
|
## Custom Metrics
|
||
|
|
|
||
|
|
### Adding Custom Metrics
|
||
|
|
|
||
|
|
Rust services use prometheus crate:
|
||
|
|
|
||
|
|
```rust
|
||
|
|
use prometheus::{Counter, Histogram, Registry};
|
||
|
|
|
||
|
|
// Create metrics
|
||
|
|
let workflow_counter = Counter::new(
|
||
|
|
"provisioning_custom_workflows",
|
||
|
|
"Custom workflow counter"
|
||
|
|
)?;
|
||
|
|
|
||
|
|
let task_duration = Histogram::with_opts(
|
||
|
|
HistogramOpts::new("provisioning_task_duration", "Task duration")
|
||
|
|
.buckets(vec![0.1, 0.5, 1.0, 5.0, 10.0])
|
||
|
|
)?;
|
||
|
|
|
||
|
|
// Register metrics
|
||
|
|
registry.register(Box::new(workflow_counter))?;
|
||
|
|
registry.register(Box::new(task_duration))?;
|
||
|
|
|
||
|
|
// Use metrics
|
||
|
|
workflow_counter.inc();
|
||
|
|
task_duration.observe(duration_seconds);
|
||
|
|
```
|
||
|
|
|
||
|
|
Nushell scripts export metrics:
|
||
|
|
|
||
|
|
```nushell
|
||
|
|
# Export metrics in Prometheus format
|
||
|
|
def export-metrics [] {
|
||
|
|
[
|
||
|
|
"# HELP provisioning_custom_metric Custom metric"
|
||
|
|
"# TYPE provisioning_custom_metric counter"
|
||
|
|
$"provisioning_custom_metric (get-metric-value)"
|
||
|
|
] | str join "
|
||
|
|
"
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Monitoring Best Practices
|
||
|
|
|
||
|
|
- Set appropriate scrape intervals (15-60s)
|
||
|
|
- Configure retention based on compliance requirements
|
||
|
|
- Use labels for multi-dimensional metrics
|
||
|
|
- Create dashboards for key business metrics
|
||
|
|
- Set up alerts for critical failures only
|
||
|
|
- Document alert thresholds and runbooks
|
||
|
|
- Review and tune alerts regularly
|
||
|
|
- Use recording rules for expensive queries
|
||
|
|
- Archive long-term metrics to object storage
|
||
|
|
|
||
|
|
## Related Documentation
|
||
|
|
|
||
|
|
- [Service Management](service-management.md) - Service lifecycle
|
||
|
|
- [Platform Health](platform-health.md) - Health checks
|
||
|
|
- [Troubleshooting](troubleshooting.md) - Debugging issues
|