provisioning/docs/src/operations/service-management.md
2026-01-17 03:58:28 +00:00

11 KiB

Service Management

Managing the nine core platform services that power the Provisioning infrastructure automation platform.

Platform Services Overview

The platform consists of nine microservices providing execution, management, and supporting infrastructure:

Service Purpose Port Language Status
orchestrator Workflow execution and task scheduling 8080 Rust + Nushell Production
control-center Backend management API with RBAC 8081 Rust Production
control-center-ui Web-based management interface 8082 Web Production
mcp-server AI-powered configuration assistance 8083 Nushell Active
ai-service Machine learning and anomaly detection 8084 Rust Active
vault-service Secrets management and KMS 8085 Rust Production
extension-registry OCI registry for extensions 8086 Rust Planned
api-gateway Unified REST API routing 8087 Rust Planned
provisioning-daemon Background service coordination 8088 Rust Development

Service Lifecycle Management

Starting Services

Systemd management (production):

# Start individual service
sudo systemctl start provisioning-orchestrator

# Start all platform services
sudo systemctl start provisioning-*

# Enable automatic start on boot
sudo systemctl enable provisioning-orchestrator
sudo systemctl enable provisioning-control-center
sudo systemctl enable provisioning-vault-service

Manual start (development):

# Orchestrator
cd provisioning/platform/crates/orchestrator
cargo run --release

# Control Center
cd provisioning/platform/crates/control-center
cargo run --release

# MCP Server
cd provisioning/platform/crates/mcp-server
nu run.nu

Stopping Services

# Stop individual service
sudo systemctl stop provisioning-orchestrator

# Stop all platform services
sudo systemctl stop provisioning-*

# Graceful shutdown with 30-second timeout
sudo systemctl stop --timeout 30 provisioning-orchestrator

Restarting Services

# Restart after configuration changes
sudo systemctl restart provisioning-orchestrator

# Reload configuration without restart
sudo systemctl reload provisioning-control-center

Checking Service Status

# Status of all services
systemctl status provisioning-*

# Detailed status
provisioning platform status

# Health check endpoints
curl  [http://localhost:8080/health](http://localhost:8080/health)  # Orchestrator
curl  [http://localhost:8081/health](http://localhost:8081/health)  # Control Center
curl  [http://localhost:8085/health](http://localhost:8085/health)  # Vault Service

Service Configuration

Configuration Files

Each service reads configuration from hierarchical sources:

/etc/provisioning/config.toml           # System defaults
~/.config/provisioning/user_config.yaml # User overrides
workspace/config/provisioning.yaml      # Workspace config

Orchestrator Configuration

# /etc/provisioning/orchestrator.toml
[server]
host = "0.0.0.0"
port = 8080
workers = 8

[storage]
persistence_dir = "/var/lib/provisioning/orchestrator"
checkpoint_interval = 30

[execution]
max_parallel_tasks = 100
retry_attempts = 3
retry_backoff = "exponential"

[api]
enable_rest = true
enable_grpc = false
auth_required = true

Control Center Configuration

# /etc/provisioning/control-center.toml
[server]
host = "0.0.0.0"
port = 8081

[auth]
jwt_algorithm = "RS256"
access_token_ttl = 900
refresh_token_ttl = 604800

[rbac]
policy_dir = "/etc/provisioning/policies"
reload_interval = 60

Vault Service Configuration

# /etc/provisioning/vault-service.toml
[vault]
backend = "secretumvault"
url = " [http://localhost:8200"](http://localhost:8200")
token_env = "VAULT_TOKEN"

[kms]
envelope_encryption = true
key_rotation_days = 90

Service Dependencies

Understanding service dependencies for proper startup order:

Database (SurrealDB)
  ↓
orchestrator (requires database)
  ↓
vault-service (requires orchestrator)
  ↓
control-center (requires orchestrator + vault)
  ↓
control-center-ui (requires control-center)
  ↓
mcp-server (requires control-center)
  ↓
ai-service (requires mcp-server)

Systemd handles dependencies automatically:

# /etc/systemd/system/provisioning-control-center.service
[Unit]
Description=Provisioning Control Center
After=provisioning-orchestrator.service
Requires=provisioning-orchestrator.service

Service Health Monitoring

Health Check Endpoints

All services expose /health endpoints:

# Check orchestrator health
curl  [http://localhost:8080/health](http://localhost:8080/health)

# Expected response
{
  "status": "healthy",
  "version": "5.0.0",
  "uptime_seconds": 3600,
  "database": "connected",
  "active_workflows": 5,
  "queued_tasks": 12
}

Automated Health Monitoring

Use systemd watchdog for automatic restart on failure:

# /etc/systemd/system/provisioning-orchestrator.service
[Service]
WatchdogSec=30
Restart=on-failure
RestartSec=10

Monitor with provisioning CLI:

# Continuous health monitoring
provisioning platform monitor --interval 5

# Alert on unhealthy services
provisioning platform monitor --alert-email [ops@example.com](mailto:ops@example.com)

Log Management

Log Locations

Systemd services log to journald:

# View orchestrator logs
sudo journalctl -u provisioning-orchestrator -f

# View last hour of logs
sudo journalctl -u provisioning-orchestrator --since "1 hour ago"

# View errors only
sudo journalctl -u provisioning-orchestrator -p err

# Export logs to file
sudo journalctl -u provisioning-* > platform-logs.txt

File-based logs:

/var/log/provisioning/orchestrator.log
/var/log/provisioning/control-center.log
/var/log/provisioning/vault-service.log

Log Rotation

Configure logrotate for file-based logs:

# /etc/logrotate.d/provisioning
/var/log/provisioning/*.log {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
    create 0644 provisioning provisioning
    sharedscripts
    postrotate
        systemctl reload provisioning-* | | true
    endscript
}

Log Levels

Configure log verbosity:

# Set log level via environment
export PROVISIONING_LOG_LEVEL=debug
sudo systemctl restart provisioning-orchestrator

# Or in configuration
provisioning config set logging.level debug

Log levels: trace, debug, info, warn, error

Performance Tuning

Orchestrator Performance

Adjust worker threads and task limits:

[execution]
max_parallel_tasks = 200  # Increase for high throughput
worker_threads = 16       # Match CPU cores
task_queue_size = 1000

[performance]
enable_metrics = true
metrics_interval = 10

Database Connection Pooling

[database]
max_connections = 100
min_connections = 10
connection_timeout = 30
idle_timeout = 600

Memory Limits

Set memory limits via systemd:

[Service]
MemoryMax=4G
MemoryHigh=3G

Service Updates and Upgrades

Zero-Downtime Upgrades

Rolling upgrade procedure:

# 1. Deploy new version alongside old version
sudo cp provisioning-orchestrator /usr/local/bin/provisioning-orchestrator-new

# 2. Update systemd service to use new binary
sudo systemctl daemon-reload

# 3. Graceful restart
sudo systemctl reload provisioning-orchestrator

Version Management

Check running versions:

provisioning platform versions

# Output:
# orchestrator: 5.0.0
# control-center: 5.0.0
# vault-service: 4.0.0

Rollback Procedure

# 1. Stop new version
sudo systemctl stop provisioning-orchestrator

# 2. Restore previous binary
sudo cp /usr/local/bin/provisioning-orchestrator.backup \
       /usr/local/bin/provisioning-orchestrator

# 3. Start service with previous version
sudo systemctl start provisioning-orchestrator

Security Hardening

Service Isolation

Run services with dedicated users:

# Create service user
sudo useradd -r -s /usr/sbin/nologin provisioning

# Set ownership
sudo chown -R provisioning:provisioning /var/lib/provisioning
sudo chown -R provisioning:provisioning /etc/provisioning

Systemd service configuration:

[Service]
User=provisioning
Group=provisioning
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true

Network Security

Restrict service access with firewall:

# Allow only localhost access
sudo ufw allow from 127.0.0.1 to any port 8080
sudo ufw allow from 127.0.0.1 to any port 8081

# Or use systemd socket activation

Troubleshooting Services

Service Won't Start

Check service status and logs:

systemctl status provisioning-orchestrator
journalctl -u provisioning-orchestrator -n 100

Common issues:

  • Port already in use: Check with lsof -i :8080
  • Configuration error: Validate with provisioning validate config
  • Missing dependencies: Check with ldd /usr/local/bin/provisioning-orchestrator
  • Permission issues: Verify file ownership

High Resource Usage

Monitor resource consumption:

# CPU and memory usage
systemctl status provisioning-orchestrator

# Detailed metrics
provisioning platform metrics --service orchestrator

Adjust limits:

# Increase memory limit
sudo systemctl set-property provisioning-orchestrator MemoryMax=8G

# Reduce parallel tasks
provisioning config set execution.max_parallel_tasks 50
sudo systemctl restart provisioning-orchestrator

Service Crashes

Enable core dumps for debugging:

# Enable core dumps
sudo sysctl -w kernel.core_pattern=/var/crash/core.%e.%p
ulimit -c unlimited

# Analyze crash
sudo coredumpctl list
sudo coredumpctl debug

Service Metrics

Prometheus Integration

Services expose Prometheus metrics:

# Orchestrator metrics
curl  [http://localhost:8080/metrics](http://localhost:8080/metrics)

# Example metrics:
# provisioning_workflows_total 1234
# provisioning_workflows_active 5
# provisioning_tasks_queued 12
# provisioning_tasks_completed 9876

Grafana Dashboards

Import pre-built dashboards:

provisioning monitoring install-dashboards

Dashboards available at http://localhost:3000

Best Practices

Service Management

  • Use systemd for production deployments
  • Enable automatic restart on failure
  • Monitor health endpoints continuously
  • Set appropriate resource limits
  • Implement log rotation
  • Regular backup of service data

Configuration Management

  • Version control all configuration files
  • Use hierarchical configuration for flexibility
  • Validate configuration before applying
  • Document all custom settings
  • Use environment variables for secrets

Monitoring and Alerting

  • Monitor all service health endpoints
  • Set up alerts for service failures
  • Track key performance metrics
  • Review logs regularly
  • Establish incident response procedures