- Remove KCL ecosystem (~220 files deleted) - Migrate all infrastructure to Nickel schema system - Consolidate documentation: legacy docs → provisioning/docs/src/ - Add CI/CD workflows (.github/) and Rust build config (.cargo/) - Update core system for Nickel schema parsing - Update README.md and CHANGES.md for v5.0.0 release - Fix pre-commit hooks: end-of-file, trailing-whitespace - Breaking changes: KCL workspaces require migration - Migration bridge available in docs/src/development/
31 KiB
Platform Deployment Guide
Version: 1.0.0 Last Updated: 2026-01-05 Target Audience: DevOps Engineers, Platform Operators Status: Production Ready
Practical guide for deploying the 9-service provisioning platform in any environment using mode-based configuration.
Table of Contents
- Prerequisites
- Deployment Modes
- Quick Start
- Solo Mode Deployment
- Multiuser Mode Deployment
- CICD Mode Deployment
- Enterprise Mode Deployment
- Service Management
- Health Checks & Monitoring
- Troubleshooting
Prerequisites
Required Software
- Rust: 1.70+ (for building services)
- Nickel: Latest (for config validation)
- Nushell: 0.109.1+ (for scripts)
- Cargo: Included with Rust
- Git: For cloning and pulling updates
Required Tools (Mode-Dependent)
| Tool | Solo | Multiuser | CICD | Enterprise |
|---|---|---|---|---|
| Docker/Podman | No | Optional | Yes | Yes |
| SurrealDB | No | Yes | No | No |
| Etcd | No | No | No | Yes |
| PostgreSQL | No | Optional | No | Optional |
| OpenAI/Anthropic API | No | Optional | Yes | Yes |
System Requirements
| Resource | Solo | Multiuser | CICD | Enterprise |
|---|---|---|---|---|
| CPU Cores | 2+ | 4+ | 8+ | 16+ |
| Memory | 2 GB | 4 GB | 8 GB | 16 GB |
| Disk | 10 GB | 50 GB | 100 GB | 500 GB |
| Network | Local | Local/Cloud | Cloud | HA Cloud |
Directory Structure
# Ensure base directories exist
mkdir -p provisioning/schemas/platform
mkdir -p provisioning/platform/logs
mkdir -p provisioning/platform/data
mkdir -p provisioning/.typedialog/platform
mkdir -p provisioning/config/runtime
Deployment Modes
Mode Selection Matrix
| Requirement | Recommended Mode |
|---|---|
| Development & testing | solo |
| Team environment (2-10 people) | multiuser |
| CI/CD pipelines & automation | cicd |
| Production with HA | enterprise |
Mode Characteristics
Solo Mode
Use Case: Development, testing, demonstration
Characteristics:
- All services run locally with minimal resources
- Filesystem-based storage (no external databases)
- No TLS/SSL required
- Embedded/in-memory backends
- Single machine only
Services Configuration:
- 2-4 workers per service
- 30-60 second timeouts
- No replication or clustering
- Debug-level logging enabled
Startup Time: ~2-5 minutes Data Persistence: Local files only
Multiuser Mode
Use Case: Team environments, shared infrastructure
Characteristics:
- Shared database backends (SurrealDB)
- Multiple concurrent users
- CORS and multi-user features enabled
- Optional TLS support
- 2-4 machines (or containerized)
Services Configuration:
- 4-6 workers per service
- 60-120 second timeouts
- Basic replication available
- Info-level logging
Startup Time: ~3-8 minutes (database dependent) Data Persistence: SurrealDB (shared)
CICD Mode
Use Case: CI/CD pipelines, ephemeral environments
Characteristics:
- Ephemeral storage (memory, temporary)
- High throughput
- RAG system disabled
- Minimal logging
- Stateless services
Services Configuration:
- 8-12 workers per service
- 10-30 second timeouts
- No persistence
- Warn-level logging
Startup Time: ~1-2 minutes Data Persistence: None (ephemeral)
Enterprise Mode
Use Case: Production, high availability, compliance
Characteristics:
- Distributed, replicated backends
- High availability (HA) clustering
- TLS/SSL encryption
- Audit logging
- Full monitoring and observability
Services Configuration:
- 16-32 workers per service
- 120-300 second timeouts
- Active replication across 3+ nodes
- Info-level logging with audit trails
Startup Time: ~5-15 minutes (cluster initialization) Data Persistence: Replicated across cluster
Quick Start
1. Clone Repository
git clone https://github.com/your-org/project-provisioning.git
cd project-provisioning
2. Select Deployment Mode
Choose your mode based on use case:
# For development
export DEPLOYMENT_MODE=solo
# For team environments
export DEPLOYMENT_MODE=multiuser
# For CI/CD
export DEPLOYMENT_MODE=cicd
# For production
export DEPLOYMENT_MODE=enterprise
3. Set Environment Variables
All services use mode-specific TOML configs automatically loaded via environment variables:
# Vault Service
export VAULT_MODE=$DEPLOYMENT_MODE
# Extension Registry
export REGISTRY_MODE=$DEPLOYMENT_MODE
# RAG System
export RAG_MODE=$DEPLOYMENT_MODE
# AI Service
export AI_SERVICE_MODE=$DEPLOYMENT_MODE
# Provisioning Daemon
export DAEMON_MODE=$DEPLOYMENT_MODE
4. Build All Services
# Build all platform crates
cargo build --release -p vault-service \
-p extension-registry \
-p provisioning-rag \
-p ai-service \
-p provisioning-daemon \
-p orchestrator \
-p control-center \
-p mcp-server \
-p installer
5. Start Services (Order Matters)
# Start in dependency order:
# 1. Core infrastructure (KMS, storage)
cargo run --release -p vault-service &
# 2. Configuration and extensions
cargo run --release -p extension-registry &
# 3. AI/RAG layer
cargo run --release -p provisioning-rag &
cargo run --release -p ai-service &
# 4. Orchestration layer
cargo run --release -p orchestrator &
cargo run --release -p control-center &
cargo run --release -p mcp-server &
# 5. Background operations
cargo run --release -p provisioning-daemon &
# 6. Installer (optional, for new deployments)
cargo run --release -p installer &
6. Verify Services
# Check all services are running
pgrep -l "vault-service|extension-registry|provisioning-rag|ai-service"
# Test endpoints
curl http://localhost:8200/health # Vault
curl http://localhost:8081/health # Registry
curl http://localhost:8083/health # RAG
curl http://localhost:8082/health # AI Service
curl http://localhost:9090/health # Orchestrator
curl http://localhost:8080/health # Control Center
Solo Mode Deployment
Perfect for: Development, testing, learning
Step 1: Verify Solo Configuration Files
# Check that solo schemas are available
ls -la provisioning/schemas/platform/defaults/deployment/solo-defaults.ncl
# Available schemas for each service:
# - provisioning/schemas/platform/schemas/vault-service.ncl
# - provisioning/schemas/platform/schemas/extension-registry.ncl
# - provisioning/schemas/platform/schemas/rag.ncl
# - provisioning/schemas/platform/schemas/ai-service.ncl
# - provisioning/schemas/platform/schemas/provisioning-daemon.ncl
Step 2: Set Solo Environment Variables
# Set all services to solo mode
export VAULT_MODE=solo
export REGISTRY_MODE=solo
export RAG_MODE=solo
export AI_SERVICE_MODE=solo
export DAEMON_MODE=solo
# Verify settings
echo $VAULT_MODE # Should output: solo
Step 3: Build Services
# Build in release mode for better performance
cargo build --release
Step 4: Create Local Data Directories
# Create storage directories for solo mode
mkdir -p /tmp/provisioning-solo/{vault,registry,rag,ai,daemon}
chmod 755 /tmp/provisioning-solo/{vault,registry,rag,ai,daemon}
Step 5: Start Services
# Start each service in a separate terminal or use tmux:
# Terminal 1: Vault
cargo run --release -p vault-service
# Terminal 2: Registry
cargo run --release -p extension-registry
# Terminal 3: RAG
cargo run --release -p provisioning-rag
# Terminal 4: AI Service
cargo run --release -p ai-service
# Terminal 5: Orchestrator
cargo run --release -p orchestrator
# Terminal 6: Control Center
cargo run --release -p control-center
# Terminal 7: Daemon
cargo run --release -p provisioning-daemon
Step 6: Test Services
# Wait 10-15 seconds for services to start, then test
# Check service health
curl -s http://localhost:8200/health | jq .
curl -s http://localhost:8081/health | jq .
curl -s http://localhost:8083/health | jq .
# Try a simple operation
curl -X GET http://localhost:9090/api/v1/health
Step 7: Verify Persistence (Optional)
# Check that data is stored locally
ls -la /tmp/provisioning-solo/vault/
ls -la /tmp/provisioning-solo/registry/
# Data should accumulate as you use the services
Cleanup
# Stop all services
pkill -f "cargo run --release"
# Remove temporary data (optional)
rm -rf /tmp/provisioning-solo
Multiuser Mode Deployment
Perfect for: Team environments, shared infrastructure
Prerequisites
- SurrealDB: Running and accessible at
http://surrealdb:8000 - Network Access: All machines can reach SurrealDB
- DNS/Hostnames: Services accessible via hostnames (not just localhost)
Step 1: Deploy SurrealDB
# Using Docker (recommended)
docker run -d \
--name surrealdb \
-p 8000:8000 \
surrealdb/surrealdb:latest \
start --user root --pass root
# Or using native installation:
surreal start --user root --pass root
Step 2: Verify SurrealDB Connectivity
# Test SurrealDB connection
curl -s http://localhost:8000/health
# Should return: {"version":"v1.x.x"}
Step 3: Set Multiuser Environment Variables
# Configure all services for multiuser mode
export VAULT_MODE=multiuser
export REGISTRY_MODE=multiuser
export RAG_MODE=multiuser
export AI_SERVICE_MODE=multiuser
export DAEMON_MODE=multiuser
# Set database connection
export SURREALDB_URL=http://surrealdb:8000
export SURREALDB_USER=root
export SURREALDB_PASS=root
# Set service hostnames (if not localhost)
export VAULT_SERVICE_HOST=vault.internal
export REGISTRY_HOST=registry.internal
export RAG_HOST=rag.internal
Step 4: Build Services
cargo build --release
Step 5: Create Shared Data Directories
# Create directories on shared storage (NFS, etc.)
mkdir -p /mnt/provisioning-data/{vault,registry,rag,ai}
chmod 755 /mnt/provisioning-data/{vault,registry,rag,ai}
# Or use local directories if on separate machines
mkdir -p /var/lib/provisioning/{vault,registry,rag,ai}
Step 6: Start Services on Multiple Machines
# Machine 1: Infrastructure services
ssh ops@machine1
export VAULT_MODE=multiuser
cargo run --release -p vault-service &
cargo run --release -p extension-registry &
# Machine 2: AI services
ssh ops@machine2
export RAG_MODE=multiuser
export AI_SERVICE_MODE=multiuser
cargo run --release -p provisioning-rag &
cargo run --release -p ai-service &
# Machine 3: Orchestration
ssh ops@machine3
cargo run --release -p orchestrator &
cargo run --release -p control-center &
# Machine 4: Background tasks
ssh ops@machine4
export DAEMON_MODE=multiuser
cargo run --release -p provisioning-daemon &
Step 7: Test Multi-Machine Setup
# From any machine, test cross-machine connectivity
curl -s http://machine1:8200/health
curl -s http://machine2:8083/health
curl -s http://machine3:9090/health
# Test integration
curl -X POST http://machine3:9090/api/v1/provision \
-H "Content-Type: application/json" \
-d '{"workspace": "test"}'
Step 8: Enable User Access
# Create shared credentials
export VAULT_TOKEN=s.xxxxxxxxxxx
# Configure TLS (optional but recommended)
# Update configs to use https:// URLs
export VAULT_MODE=multiuser
# Edit provisioning/schemas/platform/schemas/vault-service.ncl
# Add TLS configuration in the schema definition
# See: provisioning/schemas/platform/validators/ for constraints
Monitoring Multiuser Deployment
# Check all services are connected to SurrealDB
for host in machine1 machine2 machine3 machine4; do
ssh ops@$host "curl -s http://localhost/api/v1/health | jq .database_connected"
done
# Monitor SurrealDB
curl -s http://surrealdb:8000/version
CICD Mode Deployment
Perfect for: GitHub Actions, GitLab CI, Jenkins, cloud automation
Step 1: Understand Ephemeral Nature
CICD mode services:
- Don't persist data between runs
- Use in-memory storage
- Have RAG disabled
- Optimize for startup speed
- Suitable for containerized deployments
Step 2: Set CICD Environment Variables
# Use cicd mode for all services
export VAULT_MODE=cicd
export REGISTRY_MODE=cicd
export RAG_MODE=cicd
export AI_SERVICE_MODE=cicd
export DAEMON_MODE=cicd
# Disable TLS (not needed in CI)
export CI_ENVIRONMENT=true
Step 3: Containerize Services (Optional)
# Dockerfile for CICD deployments
FROM rust:1.75-slim
WORKDIR /app
COPY . .
# Build all services
RUN cargo build --release
# Set CICD mode
ENV VAULT_MODE=cicd
ENV REGISTRY_MODE=cicd
ENV RAG_MODE=cicd
ENV AI_SERVICE_MODE=cicd
# Expose ports
EXPOSE 8200 8081 8083 8082 9090 8080
# Run services
CMD ["sh", "-c", "\
cargo run --release -p vault-service & \
cargo run --release -p extension-registry & \
cargo run --release -p provisioning-rag & \
cargo run --release -p ai-service & \
cargo run --release -p orchestrator & \
wait"]
Step 4: GitHub Actions Example
name: CICD Platform Deployment
on:
push:
branches: [main, develop]
jobs:
test-deployment:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install Rust
uses: actions-rs/toolchain@v1
with:
toolchain: 1.75
profile: minimal
- name: Set CICD Mode
run: |
echo "VAULT_MODE=cicd" >> $GITHUB_ENV
echo "REGISTRY_MODE=cicd" >> $GITHUB_ENV
echo "RAG_MODE=cicd" >> $GITHUB_ENV
echo "AI_SERVICE_MODE=cicd" >> $GITHUB_ENV
echo "DAEMON_MODE=cicd" >> $GITHUB_ENV
- name: Build Services
run: cargo build --release
- name: Run Integration Tests
run: |
# Start services in background
cargo run --release -p vault-service &
cargo run --release -p extension-registry &
cargo run --release -p orchestrator &
# Wait for startup
sleep 10
# Run tests
cargo test --release
- name: Health Checks
run: |
curl -f http://localhost:8200/health
curl -f http://localhost:8081/health
curl -f http://localhost:9090/health
deploy:
needs: test-deployment
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v3
- name: Deploy to Production
run: |
# Deploy production enterprise cluster
./scripts/deploy-enterprise.sh
Step 5: Run CICD Tests
# Simulate CI environment locally
export VAULT_MODE=cicd
export CI_ENVIRONMENT=true
# Build
cargo build --release
# Run short-lived services for testing
timeout 30 cargo run --release -p vault-service &
timeout 30 cargo run --release -p extension-registry &
timeout 30 cargo run --release -p orchestrator &
# Run tests while services are running
sleep 5
cargo test --release
# Services auto-cleanup after timeout
Enterprise Mode Deployment
Perfect for: Production, high availability, compliance
Prerequisites
- 3+ Machines: Minimum 3 for HA
- Etcd Cluster: For distributed consensus
- Load Balancer: HAProxy, nginx, or cloud LB
- TLS Certificates: Valid certificates for all services
- Monitoring: Prometheus, ELK, or cloud monitoring
- Backup System: Daily snapshots to S3 or similar
Step 1: Deploy Infrastructure
1.1 Deploy Etcd Cluster
# Node 1, 2, 3
etcd --name=node-1 \
--listen-client-urls=http://0.0.0.0:2379 \
--advertise-client-urls=http://node-1.internal:2379 \
--initial-cluster="node-1=http://node-1.internal:2380,node-2=http://node-2.internal:2380,node-3=http://node-3.internal:2380" \
--initial-cluster-state=new
# Verify cluster
etcdctl --endpoints=http://localhost:2379 member list
1.2 Deploy Load Balancer
# HAProxy configuration for vault-service (example)
frontend vault_frontend
bind *:8200
mode tcp
default_backend vault_backend
backend vault_backend
mode tcp
balance roundrobin
server vault-1 10.0.1.10:8200 check
server vault-2 10.0.1.11:8200 check
server vault-3 10.0.1.12:8200 check
1.3 Configure TLS
# Generate certificates (or use existing)
mkdir -p /etc/provisioning/tls
# For each service:
openssl req -x509 -newkey rsa:4096 \
-keyout /etc/provisioning/tls/vault-key.pem \
-out /etc/provisioning/tls/vault-cert.pem \
-days 365 -nodes \
-subj "/CN=vault.provisioning.prod"
# Set permissions
chmod 600 /etc/provisioning/tls/*-key.pem
chmod 644 /etc/provisioning/tls/*-cert.pem
Step 2: Set Enterprise Environment Variables
# All machines: Set enterprise mode
export VAULT_MODE=enterprise
export REGISTRY_MODE=enterprise
export RAG_MODE=enterprise
export AI_SERVICE_MODE=enterprise
export DAEMON_MODE=enterprise
# Database cluster
export SURREALDB_URL="ws://surrealdb-cluster.internal:8000"
export SURREALDB_REPLICAS=3
# Etcd cluster
export ETCD_ENDPOINTS="http://node-1.internal:2379,http://node-2.internal:2379,http://node-3.internal:2379"
# TLS configuration
export TLS_CERT_PATH=/etc/provisioning/tls
export TLS_VERIFY=true
export TLS_CA_CERT=/etc/provisioning/tls/ca.crt
# Monitoring
export PROMETHEUS_URL=http://prometheus.internal:9090
export METRICS_ENABLED=true
export AUDIT_LOG_ENABLED=true
Step 3: Deploy Services Across Cluster
# Ansible playbook (simplified)
---
- hosts: provisioning_cluster
tasks:
- name: Build services
shell: cargo build --release
- name: Start vault-service (machine 1-3)
shell: "cargo run --release -p vault-service"
when: "'vault' in group_names"
- name: Start orchestrator (machine 2-3)
shell: "cargo run --release -p orchestrator"
when: "'orchestrator' in group_names"
- name: Start daemon (machine 3)
shell: "cargo run --release -p provisioning-daemon"
when: "'daemon' in group_names"
- name: Verify cluster health
uri:
url: "https://{{ inventory_hostname }}:9090/health"
validate_certs: yes
Step 4: Monitor Cluster Health
# Check cluster status
curl -s https://vault.internal:8200/health | jq .state
# Check replication
curl -s https://orchestrator.internal:9090/api/v1/cluster/status
# Monitor etcd
etcdctl --endpoints=https://node-1.internal:2379 endpoint health
# Check leader election
etcdctl --endpoints=https://node-1.internal:2379 election list
Step 5: Enable Monitoring & Alerting
# Prometheus configuration
global:
scrape_interval: 30s
evaluation_interval: 30s
scrape_configs:
- job_name: 'vault-service'
scheme: https
tls_config:
ca_file: /etc/provisioning/tls/ca.crt
static_configs:
- targets: ['vault-1.internal:8200', 'vault-2.internal:8200', 'vault-3.internal:8200']
- job_name: 'orchestrator'
scheme: https
static_configs:
- targets: ['orch-1.internal:9090', 'orch-2.internal:9090', 'orch-3.internal:9090']
Step 6: Backup & Recovery
# Daily backup script
#!/bin/bash
BACKUP_DIR="/mnt/provisioning-backups"
DATE=$(date +%Y%m%d_%H%M%S)
# Backup etcd
etcdctl --endpoints=https://node-1.internal:2379 \
snapshot save "$BACKUP_DIR/etcd-$DATE.db"
# Backup SurrealDB
curl -X POST https://surrealdb.internal:8000/backup \
-H "Authorization: Bearer $SURREALDB_TOKEN" \
> "$BACKUP_DIR/surreal-$DATE.sql"
# Upload to S3
aws s3 cp "$BACKUP_DIR/etcd-$DATE.db" \
s3://provisioning-backups/etcd/
# Cleanup old backups (keep 30 days)
find "$BACKUP_DIR" -mtime +30 -delete
Service Management
Starting Services
Individual Service Startup
# Start one service
export VAULT_MODE=enterprise
cargo run --release -p vault-service
# In another terminal
export REGISTRY_MODE=enterprise
cargo run --release -p extension-registry
Batch Startup
# Start all services (dependency order)
#!/bin/bash
set -e
MODE=${1:-solo}
export VAULT_MODE=$MODE
export REGISTRY_MODE=$MODE
export RAG_MODE=$MODE
export AI_SERVICE_MODE=$MODE
export DAEMON_MODE=$MODE
echo "Starting provisioning platform in $MODE mode..."
# Core services first
echo "Starting infrastructure..."
cargo run --release -p vault-service &
VAULT_PID=$!
echo "Starting extension registry..."
cargo run --release -p extension-registry &
REGISTRY_PID=$!
# AI layer
echo "Starting AI services..."
cargo run --release -p provisioning-rag &
RAG_PID=$!
cargo run --release -p ai-service &
AI_PID=$!
# Orchestration
echo "Starting orchestration..."
cargo run --release -p orchestrator &
ORCH_PID=$!
echo "All services started. PIDs: $VAULT_PID $REGISTRY_PID $RAG_PID $AI_PID $ORCH_PID"
Stopping Services
# Stop all services gracefully
pkill -SIGTERM -f "cargo run --release -p"
# Wait for graceful shutdown
sleep 5
# Force kill if needed
pkill -9 -f "cargo run --release -p"
# Verify all stopped
pgrep -f "cargo run --release -p" && echo "Services still running" || echo "All stopped"
Restarting Services
# Restart single service
pkill -SIGTERM vault-service
sleep 2
cargo run --release -p vault-service &
# Restart all services
./scripts/restart-all.sh $MODE
# Restart with config reload
export VAULT_MODE=multiuser
pkill -SIGTERM vault-service
sleep 2
cargo run --release -p vault-service &
Checking Service Status
# Check running processes
pgrep -a "cargo run --release"
# Check listening ports
netstat -tlnp | grep -E "8200|8081|8083|8082|9090|8080"
# Or using ss (modern alternative)
ss -tlnp | grep -E "8200|8081|8083|8082|9090|8080"
# Health endpoint checks
for service in vault registry rag ai orchestrator; do
echo "=== $service ==="
curl -s http://localhost:${port[$service]}/health | jq .
done
Health Checks & Monitoring
Manual Health Verification
# Vault Service
curl -s http://localhost:8200/health | jq .
# Expected: {"status":"ok","uptime":123.45}
# Extension Registry
curl -s http://localhost:8081/health | jq .
# RAG System
curl -s http://localhost:8083/health | jq .
# Expected: {"status":"ok","embeddings":"ready","vector_db":"connected"}
# AI Service
curl -s http://localhost:8082/health | jq .
# Orchestrator
curl -s http://localhost:9090/health | jq .
# Control Center
curl -s http://localhost:8080/health | jq .
Service Integration Tests
# Test vault <-> registry integration
curl -X POST http://localhost:8200/api/encrypt \
-H "Content-Type: application/json" \
-d '{"plaintext":"secret"}' | jq .
# Test RAG system
curl -X POST http://localhost:8083/api/ingest \
-H "Content-Type: application/json" \
-d '{"document":"test.md","content":"# Test"}' | jq .
# Test orchestrator
curl -X GET http://localhost:9090/api/v1/status | jq .
# End-to-end workflow
curl -X POST http://localhost:9090/api/v1/provision \
-H "Content-Type: application/json" \
-d '{
"workspace": "test",
"services": ["vault", "registry"],
"mode": "solo"
}' | jq .
Monitoring Dashboards
Prometheus Metrics
# Query service uptime
curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq .
# Query request rate
curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_requests_total[5m])' | jq .
# Query error rate
curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_errors_total[5m])' | jq .
Log Aggregation
# Follow vault logs
tail -f /var/log/provisioning/vault-service.log
# Follow all service logs
tail -f /var/log/provisioning/*.log
# Search for errors
grep -r "ERROR" /var/log/provisioning/
# Follow with filtering
tail -f /var/log/provisioning/orchestrator.log | grep -E "ERROR|WARN"
Alerting
# AlertManager configuration
groups:
- name: provisioning
rules:
- alert: ServiceDown
expr: up{job=~"vault|registry|rag|orchestrator"} == 0
for: 5m
annotations:
summary: "{{ $labels.job }} is down"
- alert: HighErrorRate
expr: rate(http_errors_total[5m]) > 0.05
annotations:
summary: "High error rate detected"
- alert: DiskSpaceWarning
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.2
annotations:
summary: "Disk space below 20%"
Troubleshooting
Service Won't Start
Problem: error: failed to bind to port 8200
Solutions:
# Check if port is in use
lsof -i :8200
ss -tlnp | grep 8200
# Kill existing process
pkill -9 -f vault-service
# Or use different port
export VAULT_SERVER_PORT=8201
cargo run --release -p vault-service
Configuration Loading Fails
Problem: error: failed to load config from mode file
Solutions:
# Verify schemas exist
ls -la provisioning/schemas/platform/schemas/vault-service.ncl
# Validate schema syntax
nickel typecheck provisioning/schemas/platform/schemas/vault-service.ncl
# Check defaults are present
nickel typecheck provisioning/schemas/platform/defaults/vault-service-defaults.ncl
# Verify deployment mode overlay exists
ls -la provisioning/schemas/platform/defaults/deployment/$VAULT_MODE-defaults.ncl
# Run service with explicit mode
export VAULT_MODE=solo
cargo run --release -p vault-service
Database Connection Issues
Problem: error: failed to connect to database
Solutions:
# Verify database is running
curl http://surrealdb:8000/health
etcdctl --endpoints=http://etcd:2379 endpoint health
# Check connectivity
nc -zv surrealdb 8000
nc -zv etcd 2379
# Update connection string
export SURREALDB_URL=ws://surrealdb:8000
export ETCD_ENDPOINTS=http://etcd:2379
# Restart service with new config
pkill -9 vault-service
cargo run --release -p vault-service
Service Crashes on Startup
Problem: Service exits with code 1 or 139
Solutions:
# Run with verbose logging
RUST_LOG=debug cargo run -p vault-service 2>&1 | head -50
# Check system resources
free -h
df -h
# Check for core dumps
coredumpctl list
# Run under debugger (if crash suspected)
rust-gdb --args target/release/vault-service
High Memory Usage
Problem: Service consuming > expected memory
Solutions:
# Check memory usage
ps aux | grep vault-service | grep -v grep
# Monitor over time
watch -n 1 'ps aux | grep vault-service | grep -v grep'
# Reduce worker count
export VAULT_SERVER_WORKERS=2
cargo run --release -p vault-service
# Check for memory leaks
valgrind --leak-check=full target/release/vault-service
Network/DNS Issues
Problem: error: failed to resolve hostname
Solutions:
# Test DNS resolution
nslookup vault.internal
dig vault.internal
# Test connectivity to service
curl -v http://vault.internal:8200/health
# Add to /etc/hosts if needed
echo "10.0.1.10 vault.internal" >> /etc/hosts
# Check network interface
ip addr show
netstat -nr
Data Persistence Issues
Problem: Data lost after restart
Solutions:
# Verify backup exists
ls -la /mnt/provisioning-backups/
ls -la /var/lib/provisioning/
# Check disk space
df -h /var/lib/provisioning
# Verify file permissions
ls -l /var/lib/provisioning/vault/
chmod 755 /var/lib/provisioning/vault/*
# Restore from backup
./scripts/restore-backup.sh /mnt/provisioning-backups/vault-20260105.sql
Debugging Checklist
When troubleshooting, use this systematic approach:
# 1. Check service is running
pgrep -f vault-service || echo "Service not running"
# 2. Check port is listening
ss -tlnp | grep 8200 || echo "Port not listening"
# 3. Check logs for errors
tail -20 /var/log/provisioning/vault-service.log | grep -i error
# 4. Test HTTP endpoint
curl -i http://localhost:8200/health
# 5. Check dependencies
curl http://surrealdb:8000/health
etcdctl --endpoints=http://etcd:2379 endpoint health
# 6. Check schema definition
nickel typecheck provisioning/schemas/platform/schemas/vault-service.ncl
# 7. Verify environment variables
env | grep -E "VAULT_|SURREALDB_|ETCD_"
# 8. Check system resources
free -h && df -h && top -bn1 | head -10
Configuration Updates
Updating Service Configuration
# 1. Edit the schema definition
vim provisioning/schemas/platform/schemas/vault-service.ncl
# 2. Update defaults if needed
vim provisioning/schemas/platform/defaults/vault-service-defaults.ncl
# 3. Validate syntax
nickel typecheck provisioning/schemas/platform/schemas/vault-service.ncl
# 4. Re-export configuration from schemas
./provisioning/.typedialog/platform/scripts/generate-configs.nu vault-service multiuser
# 5. Restart affected service (no downtime for clients)
pkill -SIGTERM vault-service
sleep 2
cargo run --release -p vault-service &
# 4. Verify configuration loaded
curl http://localhost:8200/api/config | jq .
Mode Migration
# Migrate from solo to multiuser:
# 1. Stop services
pkill -SIGTERM -f "cargo run"
sleep 5
# 2. Backup current data
tar -czf /backup/provisioning-solo-$(date +%s).tar.gz /var/lib/provisioning/
# 3. Set new mode
export VAULT_MODE=multiuser
export REGISTRY_MODE=multiuser
export RAG_MODE=multiuser
# 4. Start services with new config
cargo run --release -p vault-service &
cargo run --release -p extension-registry &
# 5. Verify new mode
curl http://localhost:8200/api/config | jq .deployment_mode
Production Checklist
Before deploying to production:
- All services compiled in release mode (
--release) - TLS certificates installed and valid
- Database cluster deployed and healthy
- Load balancer configured and routing traffic
- Monitoring and alerting configured
- Backup system tested and working
- High availability verified (failover tested)
- Security hardening applied (firewall rules, etc.)
- Documentation updated for your environment
- Team trained on deployment procedures
- Runbooks created for common operations
- Disaster recovery plan tested
Getting Help
Community Resources
- GitHub Issues: Report bugs at
github.com/your-org/provisioning/issues - Documentation: Full docs at
provisioning/docs/ - Slack Channel:
#provisioning-platform
Internal Support
- Platform Team: platform@your-org.com
- On-Call: Check PagerDuty for active rotation
- Escalation: Contact infrastructure leadership
Useful Commands Reference
# View all available commands
cargo run -- --help
# View service schemas
ls -la provisioning/schemas/platform/schemas/
ls -la provisioning/schemas/platform/defaults/
# List running services
ps aux | grep cargo
# Monitor service logs in real-time
journalctl -fu provisioning-vault
# Generate diagnostics bundle
./scripts/generate-diagnostics.sh > /tmp/diagnostics-$(date +%s).tar.gz