provisioning/docs/src/operations/service-management.md

534 lines
11 KiB
Markdown
Raw Normal View History

2026-01-17 03:58:28 +00:00
# Service Management
Managing the nine core platform services that power the Provisioning infrastructure automation platform.
## Platform Services Overview
The platform consists of nine microservices providing execution, management, and supporting infrastructure:
| Service | Purpose | Port | Language | Status |
| ------- | ------- | ---- | -------- | ------ |
| **orchestrator** | Workflow execution and task scheduling | 8080 | Rust + Nushell | Production |
| **control-center** | Backend management API with RBAC | 8081 | Rust | Production |
| **control-center-ui** | Web-based management interface | 8082 | Web | Production |
| **mcp-server** | AI-powered configuration assistance | 8083 | Nushell | Active |
| **ai-service** | Machine learning and anomaly detection | 8084 | Rust | Active |
| **vault-service** | Secrets management and KMS | 8085 | Rust | Production |
| **extension-registry** | OCI registry for extensions | 8086 | Rust | Planned |
| **api-gateway** | Unified REST API routing | 8087 | Rust | Planned |
| **provisioning-daemon** | Background service coordination | 8088 | Rust | Development |
## Service Lifecycle Management
### Starting Services
Systemd management (production):
```bash
# Start individual service
sudo systemctl start provisioning-orchestrator
# Start all platform services
sudo systemctl start provisioning-*
# Enable automatic start on boot
sudo systemctl enable provisioning-orchestrator
sudo systemctl enable provisioning-control-center
sudo systemctl enable provisioning-vault-service
```
Manual start (development):
```bash
# Orchestrator
cd provisioning/platform/crates/orchestrator
cargo run --release
# Control Center
cd provisioning/platform/crates/control-center
cargo run --release
# MCP Server
cd provisioning/platform/crates/mcp-server
nu run.nu
```
### Stopping Services
```bash
# Stop individual service
sudo systemctl stop provisioning-orchestrator
# Stop all platform services
sudo systemctl stop provisioning-*
# Graceful shutdown with 30-second timeout
sudo systemctl stop --timeout 30 provisioning-orchestrator
```
### Restarting Services
```bash
# Restart after configuration changes
sudo systemctl restart provisioning-orchestrator
# Reload configuration without restart
sudo systemctl reload provisioning-control-center
```
### Checking Service Status
```bash
# Status of all services
systemctl status provisioning-*
# Detailed status
provisioning platform status
# Health check endpoints
curl [http://localhost:8080/health](http://localhost:8080/health) # Orchestrator
curl [http://localhost:8081/health](http://localhost:8081/health) # Control Center
curl [http://localhost:8085/health](http://localhost:8085/health) # Vault Service
```
## Service Configuration
### Configuration Files
Each service reads configuration from hierarchical sources:
```text
/etc/provisioning/config.toml # System defaults
~/.config/provisioning/user_config.yaml # User overrides
workspace/config/provisioning.yaml # Workspace config
```
### Orchestrator Configuration
```toml
# /etc/provisioning/orchestrator.toml
[server]
host = "0.0.0.0"
port = 8080
workers = 8
[storage]
persistence_dir = "/var/lib/provisioning/orchestrator"
checkpoint_interval = 30
[execution]
max_parallel_tasks = 100
retry_attempts = 3
retry_backoff = "exponential"
[api]
enable_rest = true
enable_grpc = false
auth_required = true
```
### Control Center Configuration
```toml
# /etc/provisioning/control-center.toml
[server]
host = "0.0.0.0"
port = 8081
[auth]
jwt_algorithm = "RS256"
access_token_ttl = 900
refresh_token_ttl = 604800
[rbac]
policy_dir = "/etc/provisioning/policies"
reload_interval = 60
```
### Vault Service Configuration
```toml
# /etc/provisioning/vault-service.toml
[vault]
backend = "secretumvault"
url = " [http://localhost:8200"](http://localhost:8200")
token_env = "VAULT_TOKEN"
[kms]
envelope_encryption = true
key_rotation_days = 90
```
## Service Dependencies
Understanding service dependencies for proper startup order:
```text
Database (SurrealDB)
orchestrator (requires database)
vault-service (requires orchestrator)
control-center (requires orchestrator + vault)
control-center-ui (requires control-center)
mcp-server (requires control-center)
ai-service (requires mcp-server)
```
Systemd handles dependencies automatically:
```ini
# /etc/systemd/system/provisioning-control-center.service
[Unit]
Description=Provisioning Control Center
After=provisioning-orchestrator.service
Requires=provisioning-orchestrator.service
```
## Service Health Monitoring
### Health Check Endpoints
All services expose `/health` endpoints:
```bash
# Check orchestrator health
curl [http://localhost:8080/health](http://localhost:8080/health)
# Expected response
{
"status": "healthy",
"version": "5.0.0",
"uptime_seconds": 3600,
"database": "connected",
"active_workflows": 5,
"queued_tasks": 12
}
```
### Automated Health Monitoring
Use systemd watchdog for automatic restart on failure:
```ini
# /etc/systemd/system/provisioning-orchestrator.service
[Service]
WatchdogSec=30
Restart=on-failure
RestartSec=10
```
Monitor with provisioning CLI:
```bash
# Continuous health monitoring
provisioning platform monitor --interval 5
# Alert on unhealthy services
provisioning platform monitor --alert-email [ops@example.com](mailto:ops@example.com)
```
## Log Management
### Log Locations
Systemd services log to journald:
```bash
# View orchestrator logs
sudo journalctl -u provisioning-orchestrator -f
# View last hour of logs
sudo journalctl -u provisioning-orchestrator --since "1 hour ago"
# View errors only
sudo journalctl -u provisioning-orchestrator -p err
# Export logs to file
sudo journalctl -u provisioning-* > platform-logs.txt
```
File-based logs:
```text
/var/log/provisioning/orchestrator.log
/var/log/provisioning/control-center.log
/var/log/provisioning/vault-service.log
```
### Log Rotation
Configure logrotate for file-based logs:
```text
# /etc/logrotate.d/provisioning
/var/log/provisioning/*.log {
daily
rotate 30
compress
delaycompress
missingok
notifempty
create 0644 provisioning provisioning
sharedscripts
postrotate
systemctl reload provisioning-* | | true
endscript
}
```
### Log Levels
Configure log verbosity:
```bash
# Set log level via environment
export PROVISIONING_LOG_LEVEL=debug
sudo systemctl restart provisioning-orchestrator
# Or in configuration
provisioning config set logging.level debug
```
Log levels: `trace`, `debug`, `info`, `warn`, `error`
## Performance Tuning
### Orchestrator Performance
Adjust worker threads and task limits:
```toml
[execution]
max_parallel_tasks = 200 # Increase for high throughput
worker_threads = 16 # Match CPU cores
task_queue_size = 1000
[performance]
enable_metrics = true
metrics_interval = 10
```
### Database Connection Pooling
```toml
[database]
max_connections = 100
min_connections = 10
connection_timeout = 30
idle_timeout = 600
```
### Memory Limits
Set memory limits via systemd:
```ini
[Service]
MemoryMax=4G
MemoryHigh=3G
```
## Service Updates and Upgrades
### Zero-Downtime Upgrades
Rolling upgrade procedure:
```bash
# 1. Deploy new version alongside old version
sudo cp provisioning-orchestrator /usr/local/bin/provisioning-orchestrator-new
# 2. Update systemd service to use new binary
sudo systemctl daemon-reload
# 3. Graceful restart
sudo systemctl reload provisioning-orchestrator
```
### Version Management
Check running versions:
```bash
provisioning platform versions
# Output:
# orchestrator: 5.0.0
# control-center: 5.0.0
# vault-service: 4.0.0
```
### Rollback Procedure
```bash
# 1. Stop new version
sudo systemctl stop provisioning-orchestrator
# 2. Restore previous binary
sudo cp /usr/local/bin/provisioning-orchestrator.backup \
/usr/local/bin/provisioning-orchestrator
# 3. Start service with previous version
sudo systemctl start provisioning-orchestrator
```
## Security Hardening
### Service Isolation
Run services with dedicated users:
```bash
# Create service user
sudo useradd -r -s /usr/sbin/nologin provisioning
# Set ownership
sudo chown -R provisioning:provisioning /var/lib/provisioning
sudo chown -R provisioning:provisioning /etc/provisioning
```
Systemd service configuration:
```ini
[Service]
User=provisioning
Group=provisioning
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
```
### Network Security
Restrict service access with firewall:
```bash
# Allow only localhost access
sudo ufw allow from 127.0.0.1 to any port 8080
sudo ufw allow from 127.0.0.1 to any port 8081
# Or use systemd socket activation
```
## Troubleshooting Services
### Service Won't Start
Check service status and logs:
```bash
systemctl status provisioning-orchestrator
journalctl -u provisioning-orchestrator -n 100
```
Common issues:
- Port already in use: Check with `lsof -i :8080`
- Configuration error: Validate with `provisioning validate config`
- Missing dependencies: Check with `ldd /usr/local/bin/provisioning-orchestrator`
- Permission issues: Verify file ownership
### High Resource Usage
Monitor resource consumption:
```bash
# CPU and memory usage
systemctl status provisioning-orchestrator
# Detailed metrics
provisioning platform metrics --service orchestrator
```
Adjust limits:
```bash
# Increase memory limit
sudo systemctl set-property provisioning-orchestrator MemoryMax=8G
# Reduce parallel tasks
provisioning config set execution.max_parallel_tasks 50
sudo systemctl restart provisioning-orchestrator
```
### Service Crashes
Enable core dumps for debugging:
```bash
# Enable core dumps
sudo sysctl -w kernel.core_pattern=/var/crash/core.%e.%p
ulimit -c unlimited
# Analyze crash
sudo coredumpctl list
sudo coredumpctl debug
```
## Service Metrics
### Prometheus Integration
Services expose Prometheus metrics:
```bash
# Orchestrator metrics
curl [http://localhost:8080/metrics](http://localhost:8080/metrics)
# Example metrics:
# provisioning_workflows_total 1234
# provisioning_workflows_active 5
# provisioning_tasks_queued 12
# provisioning_tasks_completed 9876
```
### Grafana Dashboards
Import pre-built dashboards:
```bash
provisioning monitoring install-dashboards
```
Dashboards available at [http://localhost:3000](http://localhost:3000)
## Best Practices
### Service Management
- Use systemd for production deployments
- Enable automatic restart on failure
- Monitor health endpoints continuously
- Set appropriate resource limits
- Implement log rotation
- Regular backup of service data
### Configuration Management
- Version control all configuration files
- Use hierarchical configuration for flexibility
- Validate configuration before applying
- Document all custom settings
- Use environment variables for secrets
### Monitoring and Alerting
- Monitor all service health endpoints
- Set up alerts for service failures
- Track key performance metrics
- Review logs regularly
- Establish incident response procedures
## Related Documentation
- [Deployment Modes](deployment-modes.md) - Installation strategies
- [Monitoring](monitoring.md) - Observability and metrics
- [Platform Health](platform-health.md) - Health check procedures
- [Troubleshooting](troubleshooting.md) - Common issues and solutions