provisioning/docs/src/operations/deployment-guide.md
Jesús Pérez 44648e3206
chore: complete nickel migration and consolidate legacy configs
- Remove KCL ecosystem (~220 files deleted)
- Migrate all infrastructure to Nickel schema system
- Consolidate documentation: legacy docs → provisioning/docs/src/
- Add CI/CD workflows (.github/) and Rust build config (.cargo/)
- Update core system for Nickel schema parsing
- Update README.md and CHANGES.md for v5.0.0 release
- Fix pre-commit hooks: end-of-file, trailing-whitespace
- Breaking changes: KCL workspaces require migration
- Migration bridge available in docs/src/development/
2026-01-08 09:55:37 +00:00

31 KiB

Platform Deployment Guide

Version: 1.0.0 Last Updated: 2026-01-05 Target Audience: DevOps Engineers, Platform Operators Status: Production Ready

Practical guide for deploying the 9-service provisioning platform in any environment using mode-based configuration.

Table of Contents

  1. Prerequisites
  2. Deployment Modes
  3. Quick Start
  4. Solo Mode Deployment
  5. Multiuser Mode Deployment
  6. CICD Mode Deployment
  7. Enterprise Mode Deployment
  8. Service Management
  9. Health Checks & Monitoring
  10. Troubleshooting

Prerequisites

Required Software

  • Rust: 1.70+ (for building services)
  • Nickel: Latest (for config validation)
  • Nushell: 0.109.1+ (for scripts)
  • Cargo: Included with Rust
  • Git: For cloning and pulling updates

Required Tools (Mode-Dependent)

Tool Solo Multiuser CICD Enterprise
Docker/Podman No Optional Yes Yes
SurrealDB No Yes No No
Etcd No No No Yes
PostgreSQL No Optional No Optional
OpenAI/Anthropic API No Optional Yes Yes

System Requirements

Resource Solo Multiuser CICD Enterprise
CPU Cores 2+ 4+ 8+ 16+
Memory 2 GB 4 GB 8 GB 16 GB
Disk 10 GB 50 GB 100 GB 500 GB
Network Local Local/Cloud Cloud HA Cloud

Directory Structure

# Ensure base directories exist
mkdir -p provisioning/schemas/platform
mkdir -p provisioning/platform/logs
mkdir -p provisioning/platform/data
mkdir -p provisioning/.typedialog/platform
mkdir -p provisioning/config/runtime

Deployment Modes

Mode Selection Matrix

Requirement Recommended Mode
Development & testing solo
Team environment (2-10 people) multiuser
CI/CD pipelines & automation cicd
Production with HA enterprise

Mode Characteristics

Solo Mode

Use Case: Development, testing, demonstration

Characteristics:

  • All services run locally with minimal resources
  • Filesystem-based storage (no external databases)
  • No TLS/SSL required
  • Embedded/in-memory backends
  • Single machine only

Services Configuration:

  • 2-4 workers per service
  • 30-60 second timeouts
  • No replication or clustering
  • Debug-level logging enabled

Startup Time: ~2-5 minutes Data Persistence: Local files only


Multiuser Mode

Use Case: Team environments, shared infrastructure

Characteristics:

  • Shared database backends (SurrealDB)
  • Multiple concurrent users
  • CORS and multi-user features enabled
  • Optional TLS support
  • 2-4 machines (or containerized)

Services Configuration:

  • 4-6 workers per service
  • 60-120 second timeouts
  • Basic replication available
  • Info-level logging

Startup Time: ~3-8 minutes (database dependent) Data Persistence: SurrealDB (shared)


CICD Mode

Use Case: CI/CD pipelines, ephemeral environments

Characteristics:

  • Ephemeral storage (memory, temporary)
  • High throughput
  • RAG system disabled
  • Minimal logging
  • Stateless services

Services Configuration:

  • 8-12 workers per service
  • 10-30 second timeouts
  • No persistence
  • Warn-level logging

Startup Time: ~1-2 minutes Data Persistence: None (ephemeral)


Enterprise Mode

Use Case: Production, high availability, compliance

Characteristics:

  • Distributed, replicated backends
  • High availability (HA) clustering
  • TLS/SSL encryption
  • Audit logging
  • Full monitoring and observability

Services Configuration:

  • 16-32 workers per service
  • 120-300 second timeouts
  • Active replication across 3+ nodes
  • Info-level logging with audit trails

Startup Time: ~5-15 minutes (cluster initialization) Data Persistence: Replicated across cluster


Quick Start

1. Clone Repository

git clone https://github.com/your-org/project-provisioning.git
cd project-provisioning

2. Select Deployment Mode

Choose your mode based on use case:

# For development
export DEPLOYMENT_MODE=solo

# For team environments
export DEPLOYMENT_MODE=multiuser

# For CI/CD
export DEPLOYMENT_MODE=cicd

# For production
export DEPLOYMENT_MODE=enterprise

3. Set Environment Variables

All services use mode-specific TOML configs automatically loaded via environment variables:

# Vault Service
export VAULT_MODE=$DEPLOYMENT_MODE

# Extension Registry
export REGISTRY_MODE=$DEPLOYMENT_MODE

# RAG System
export RAG_MODE=$DEPLOYMENT_MODE

# AI Service
export AI_SERVICE_MODE=$DEPLOYMENT_MODE

# Provisioning Daemon
export DAEMON_MODE=$DEPLOYMENT_MODE

4. Build All Services

# Build all platform crates
cargo build --release -p vault-service \
                      -p extension-registry \
                      -p provisioning-rag \
                      -p ai-service \
                      -p provisioning-daemon \
                      -p orchestrator \
                      -p control-center \
                      -p mcp-server \
                      -p installer

5. Start Services (Order Matters)

# Start in dependency order:

# 1. Core infrastructure (KMS, storage)
cargo run --release -p vault-service &

# 2. Configuration and extensions
cargo run --release -p extension-registry &

# 3. AI/RAG layer
cargo run --release -p provisioning-rag &
cargo run --release -p ai-service &

# 4. Orchestration layer
cargo run --release -p orchestrator &
cargo run --release -p control-center &
cargo run --release -p mcp-server &

# 5. Background operations
cargo run --release -p provisioning-daemon &

# 6. Installer (optional, for new deployments)
cargo run --release -p installer &

6. Verify Services

# Check all services are running
pgrep -l "vault-service|extension-registry|provisioning-rag|ai-service"

# Test endpoints
curl http://localhost:8200/health   # Vault
curl http://localhost:8081/health   # Registry
curl http://localhost:8083/health   # RAG
curl http://localhost:8082/health   # AI Service
curl http://localhost:9090/health   # Orchestrator
curl http://localhost:8080/health   # Control Center

Solo Mode Deployment

Perfect for: Development, testing, learning

Step 1: Verify Solo Configuration Files

# Check that solo schemas are available
ls -la provisioning/schemas/platform/defaults/deployment/solo-defaults.ncl

# Available schemas for each service:
# - provisioning/schemas/platform/schemas/vault-service.ncl
# - provisioning/schemas/platform/schemas/extension-registry.ncl
# - provisioning/schemas/platform/schemas/rag.ncl
# - provisioning/schemas/platform/schemas/ai-service.ncl
# - provisioning/schemas/platform/schemas/provisioning-daemon.ncl

Step 2: Set Solo Environment Variables

# Set all services to solo mode
export VAULT_MODE=solo
export REGISTRY_MODE=solo
export RAG_MODE=solo
export AI_SERVICE_MODE=solo
export DAEMON_MODE=solo

# Verify settings
echo $VAULT_MODE  # Should output: solo

Step 3: Build Services

# Build in release mode for better performance
cargo build --release

Step 4: Create Local Data Directories

# Create storage directories for solo mode
mkdir -p /tmp/provisioning-solo/{vault,registry,rag,ai,daemon}
chmod 755 /tmp/provisioning-solo/{vault,registry,rag,ai,daemon}

Step 5: Start Services

# Start each service in a separate terminal or use tmux:

# Terminal 1: Vault
cargo run --release -p vault-service

# Terminal 2: Registry
cargo run --release -p extension-registry

# Terminal 3: RAG
cargo run --release -p provisioning-rag

# Terminal 4: AI Service
cargo run --release -p ai-service

# Terminal 5: Orchestrator
cargo run --release -p orchestrator

# Terminal 6: Control Center
cargo run --release -p control-center

# Terminal 7: Daemon
cargo run --release -p provisioning-daemon

Step 6: Test Services

# Wait 10-15 seconds for services to start, then test

# Check service health
curl -s http://localhost:8200/health | jq .
curl -s http://localhost:8081/health | jq .
curl -s http://localhost:8083/health | jq .

# Try a simple operation
curl -X GET http://localhost:9090/api/v1/health

Step 7: Verify Persistence (Optional)

# Check that data is stored locally
ls -la /tmp/provisioning-solo/vault/
ls -la /tmp/provisioning-solo/registry/

# Data should accumulate as you use the services

Cleanup

# Stop all services
pkill -f "cargo run --release"

# Remove temporary data (optional)
rm -rf /tmp/provisioning-solo

Multiuser Mode Deployment

Perfect for: Team environments, shared infrastructure

Prerequisites

  • SurrealDB: Running and accessible at http://surrealdb:8000
  • Network Access: All machines can reach SurrealDB
  • DNS/Hostnames: Services accessible via hostnames (not just localhost)

Step 1: Deploy SurrealDB

# Using Docker (recommended)
docker run -d \
  --name surrealdb \
  -p 8000:8000 \
  surrealdb/surrealdb:latest \
  start --user root --pass root

# Or using native installation:
surreal start --user root --pass root

Step 2: Verify SurrealDB Connectivity

# Test SurrealDB connection
curl -s http://localhost:8000/health

# Should return: {"version":"v1.x.x"}

Step 3: Set Multiuser Environment Variables

# Configure all services for multiuser mode
export VAULT_MODE=multiuser
export REGISTRY_MODE=multiuser
export RAG_MODE=multiuser
export AI_SERVICE_MODE=multiuser
export DAEMON_MODE=multiuser

# Set database connection
export SURREALDB_URL=http://surrealdb:8000
export SURREALDB_USER=root
export SURREALDB_PASS=root

# Set service hostnames (if not localhost)
export VAULT_SERVICE_HOST=vault.internal
export REGISTRY_HOST=registry.internal
export RAG_HOST=rag.internal

Step 4: Build Services

cargo build --release

Step 5: Create Shared Data Directories

# Create directories on shared storage (NFS, etc.)
mkdir -p /mnt/provisioning-data/{vault,registry,rag,ai}
chmod 755 /mnt/provisioning-data/{vault,registry,rag,ai}

# Or use local directories if on separate machines
mkdir -p /var/lib/provisioning/{vault,registry,rag,ai}

Step 6: Start Services on Multiple Machines

# Machine 1: Infrastructure services
ssh ops@machine1
export VAULT_MODE=multiuser
cargo run --release -p vault-service &
cargo run --release -p extension-registry &

# Machine 2: AI services
ssh ops@machine2
export RAG_MODE=multiuser
export AI_SERVICE_MODE=multiuser
cargo run --release -p provisioning-rag &
cargo run --release -p ai-service &

# Machine 3: Orchestration
ssh ops@machine3
cargo run --release -p orchestrator &
cargo run --release -p control-center &

# Machine 4: Background tasks
ssh ops@machine4
export DAEMON_MODE=multiuser
cargo run --release -p provisioning-daemon &

Step 7: Test Multi-Machine Setup

# From any machine, test cross-machine connectivity
curl -s http://machine1:8200/health
curl -s http://machine2:8083/health
curl -s http://machine3:9090/health

# Test integration
curl -X POST http://machine3:9090/api/v1/provision \
  -H "Content-Type: application/json" \
  -d '{"workspace": "test"}'

Step 8: Enable User Access

# Create shared credentials
export VAULT_TOKEN=s.xxxxxxxxxxx

# Configure TLS (optional but recommended)
# Update configs to use https:// URLs
export VAULT_MODE=multiuser
# Edit provisioning/schemas/platform/schemas/vault-service.ncl
# Add TLS configuration in the schema definition
# See: provisioning/schemas/platform/validators/ for constraints

Monitoring Multiuser Deployment

# Check all services are connected to SurrealDB
for host in machine1 machine2 machine3 machine4; do
  ssh ops@$host "curl -s http://localhost/api/v1/health | jq .database_connected"
done

# Monitor SurrealDB
curl -s http://surrealdb:8000/version

CICD Mode Deployment

Perfect for: GitHub Actions, GitLab CI, Jenkins, cloud automation

Step 1: Understand Ephemeral Nature

CICD mode services:

  • Don't persist data between runs
  • Use in-memory storage
  • Have RAG disabled
  • Optimize for startup speed
  • Suitable for containerized deployments

Step 2: Set CICD Environment Variables

# Use cicd mode for all services
export VAULT_MODE=cicd
export REGISTRY_MODE=cicd
export RAG_MODE=cicd
export AI_SERVICE_MODE=cicd
export DAEMON_MODE=cicd

# Disable TLS (not needed in CI)
export CI_ENVIRONMENT=true

Step 3: Containerize Services (Optional)

# Dockerfile for CICD deployments
FROM rust:1.75-slim

WORKDIR /app
COPY . .

# Build all services
RUN cargo build --release

# Set CICD mode
ENV VAULT_MODE=cicd
ENV REGISTRY_MODE=cicd
ENV RAG_MODE=cicd
ENV AI_SERVICE_MODE=cicd

# Expose ports
EXPOSE 8200 8081 8083 8082 9090 8080

# Run services
CMD ["sh", "-c", "\
  cargo run --release -p vault-service & \
  cargo run --release -p extension-registry & \
  cargo run --release -p provisioning-rag & \
  cargo run --release -p ai-service & \
  cargo run --release -p orchestrator & \
  wait"]

Step 4: GitHub Actions Example

name: CICD Platform Deployment

on:
  push:
    branches: [main, develop]

jobs:
  test-deployment:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Install Rust
        uses: actions-rs/toolchain@v1
        with:
          toolchain: 1.75
          profile: minimal

      - name: Set CICD Mode
        run: |
          echo "VAULT_MODE=cicd" >> $GITHUB_ENV
          echo "REGISTRY_MODE=cicd" >> $GITHUB_ENV
          echo "RAG_MODE=cicd" >> $GITHUB_ENV
          echo "AI_SERVICE_MODE=cicd" >> $GITHUB_ENV
          echo "DAEMON_MODE=cicd" >> $GITHUB_ENV

      - name: Build Services
        run: cargo build --release

      - name: Run Integration Tests
        run: |
          # Start services in background
          cargo run --release -p vault-service &
          cargo run --release -p extension-registry &
          cargo run --release -p orchestrator &

          # Wait for startup
          sleep 10

          # Run tests
          cargo test --release

      - name: Health Checks
        run: |
          curl -f http://localhost:8200/health
          curl -f http://localhost:8081/health
          curl -f http://localhost:9090/health

  deploy:
    needs: test-deployment
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v3
      - name: Deploy to Production
        run: |
          # Deploy production enterprise cluster
          ./scripts/deploy-enterprise.sh

Step 5: Run CICD Tests

# Simulate CI environment locally
export VAULT_MODE=cicd
export CI_ENVIRONMENT=true

# Build
cargo build --release

# Run short-lived services for testing
timeout 30 cargo run --release -p vault-service &
timeout 30 cargo run --release -p extension-registry &
timeout 30 cargo run --release -p orchestrator &

# Run tests while services are running
sleep 5
cargo test --release

# Services auto-cleanup after timeout

Enterprise Mode Deployment

Perfect for: Production, high availability, compliance

Prerequisites

  • 3+ Machines: Minimum 3 for HA
  • Etcd Cluster: For distributed consensus
  • Load Balancer: HAProxy, nginx, or cloud LB
  • TLS Certificates: Valid certificates for all services
  • Monitoring: Prometheus, ELK, or cloud monitoring
  • Backup System: Daily snapshots to S3 or similar

Step 1: Deploy Infrastructure

1.1 Deploy Etcd Cluster

# Node 1, 2, 3
etcd --name=node-1 \
     --listen-client-urls=http://0.0.0.0:2379 \
     --advertise-client-urls=http://node-1.internal:2379 \
     --initial-cluster="node-1=http://node-1.internal:2380,node-2=http://node-2.internal:2380,node-3=http://node-3.internal:2380" \
     --initial-cluster-state=new

# Verify cluster
etcdctl --endpoints=http://localhost:2379 member list

1.2 Deploy Load Balancer

# HAProxy configuration for vault-service (example)
frontend vault_frontend
    bind *:8200
    mode tcp
    default_backend vault_backend

backend vault_backend
    mode tcp
    balance roundrobin
    server vault-1 10.0.1.10:8200 check
    server vault-2 10.0.1.11:8200 check
    server vault-3 10.0.1.12:8200 check

1.3 Configure TLS

# Generate certificates (or use existing)
mkdir -p /etc/provisioning/tls

# For each service:
openssl req -x509 -newkey rsa:4096 \
  -keyout /etc/provisioning/tls/vault-key.pem \
  -out /etc/provisioning/tls/vault-cert.pem \
  -days 365 -nodes \
  -subj "/CN=vault.provisioning.prod"

# Set permissions
chmod 600 /etc/provisioning/tls/*-key.pem
chmod 644 /etc/provisioning/tls/*-cert.pem

Step 2: Set Enterprise Environment Variables

# All machines: Set enterprise mode
export VAULT_MODE=enterprise
export REGISTRY_MODE=enterprise
export RAG_MODE=enterprise
export AI_SERVICE_MODE=enterprise
export DAEMON_MODE=enterprise

# Database cluster
export SURREALDB_URL="ws://surrealdb-cluster.internal:8000"
export SURREALDB_REPLICAS=3

# Etcd cluster
export ETCD_ENDPOINTS="http://node-1.internal:2379,http://node-2.internal:2379,http://node-3.internal:2379"

# TLS configuration
export TLS_CERT_PATH=/etc/provisioning/tls
export TLS_VERIFY=true
export TLS_CA_CERT=/etc/provisioning/tls/ca.crt

# Monitoring
export PROMETHEUS_URL=http://prometheus.internal:9090
export METRICS_ENABLED=true
export AUDIT_LOG_ENABLED=true

Step 3: Deploy Services Across Cluster

# Ansible playbook (simplified)
---
- hosts: provisioning_cluster
  tasks:
    - name: Build services
      shell: cargo build --release

    - name: Start vault-service (machine 1-3)
      shell: "cargo run --release -p vault-service"
      when: "'vault' in group_names"

    - name: Start orchestrator (machine 2-3)
      shell: "cargo run --release -p orchestrator"
      when: "'orchestrator' in group_names"

    - name: Start daemon (machine 3)
      shell: "cargo run --release -p provisioning-daemon"
      when: "'daemon' in group_names"

    - name: Verify cluster health
      uri:
        url: "https://{{ inventory_hostname }}:9090/health"
        validate_certs: yes

Step 4: Monitor Cluster Health

# Check cluster status
curl -s https://vault.internal:8200/health | jq .state

# Check replication
curl -s https://orchestrator.internal:9090/api/v1/cluster/status

# Monitor etcd
etcdctl --endpoints=https://node-1.internal:2379 endpoint health

# Check leader election
etcdctl --endpoints=https://node-1.internal:2379 election list

Step 5: Enable Monitoring & Alerting

# Prometheus configuration
global:
  scrape_interval: 30s
  evaluation_interval: 30s

scrape_configs:
  - job_name: 'vault-service'
    scheme: https
    tls_config:
      ca_file: /etc/provisioning/tls/ca.crt
    static_configs:
      - targets: ['vault-1.internal:8200', 'vault-2.internal:8200', 'vault-3.internal:8200']

  - job_name: 'orchestrator'
    scheme: https
    static_configs:
      - targets: ['orch-1.internal:9090', 'orch-2.internal:9090', 'orch-3.internal:9090']

Step 6: Backup & Recovery

# Daily backup script
#!/bin/bash
BACKUP_DIR="/mnt/provisioning-backups"
DATE=$(date +%Y%m%d_%H%M%S)

# Backup etcd
etcdctl --endpoints=https://node-1.internal:2379 \
  snapshot save "$BACKUP_DIR/etcd-$DATE.db"

# Backup SurrealDB
curl -X POST https://surrealdb.internal:8000/backup \
  -H "Authorization: Bearer $SURREALDB_TOKEN" \
  > "$BACKUP_DIR/surreal-$DATE.sql"

# Upload to S3
aws s3 cp "$BACKUP_DIR/etcd-$DATE.db" \
  s3://provisioning-backups/etcd/

# Cleanup old backups (keep 30 days)
find "$BACKUP_DIR" -mtime +30 -delete

Service Management

Starting Services

Individual Service Startup

# Start one service
export VAULT_MODE=enterprise
cargo run --release -p vault-service

# In another terminal
export REGISTRY_MODE=enterprise
cargo run --release -p extension-registry

Batch Startup

# Start all services (dependency order)
#!/bin/bash
set -e

MODE=${1:-solo}
export VAULT_MODE=$MODE
export REGISTRY_MODE=$MODE
export RAG_MODE=$MODE
export AI_SERVICE_MODE=$MODE
export DAEMON_MODE=$MODE

echo "Starting provisioning platform in $MODE mode..."

# Core services first
echo "Starting infrastructure..."
cargo run --release -p vault-service &
VAULT_PID=$!

echo "Starting extension registry..."
cargo run --release -p extension-registry &
REGISTRY_PID=$!

# AI layer
echo "Starting AI services..."
cargo run --release -p provisioning-rag &
RAG_PID=$!

cargo run --release -p ai-service &
AI_PID=$!

# Orchestration
echo "Starting orchestration..."
cargo run --release -p orchestrator &
ORCH_PID=$!

echo "All services started. PIDs: $VAULT_PID $REGISTRY_PID $RAG_PID $AI_PID $ORCH_PID"

Stopping Services

# Stop all services gracefully
pkill -SIGTERM -f "cargo run --release -p"

# Wait for graceful shutdown
sleep 5

# Force kill if needed
pkill -9 -f "cargo run --release -p"

# Verify all stopped
pgrep -f "cargo run --release -p" && echo "Services still running" || echo "All stopped"

Restarting Services

# Restart single service
pkill -SIGTERM vault-service
sleep 2
cargo run --release -p vault-service &

# Restart all services
./scripts/restart-all.sh $MODE

# Restart with config reload
export VAULT_MODE=multiuser
pkill -SIGTERM vault-service
sleep 2
cargo run --release -p vault-service &

Checking Service Status

# Check running processes
pgrep -a "cargo run --release"

# Check listening ports
netstat -tlnp | grep -E "8200|8081|8083|8082|9090|8080"

# Or using ss (modern alternative)
ss -tlnp | grep -E "8200|8081|8083|8082|9090|8080"

# Health endpoint checks
for service in vault registry rag ai orchestrator; do
  echo "=== $service ==="
  curl -s http://localhost:${port[$service]}/health | jq .
done

Health Checks & Monitoring

Manual Health Verification

# Vault Service
curl -s http://localhost:8200/health | jq .
# Expected: {"status":"ok","uptime":123.45}

# Extension Registry
curl -s http://localhost:8081/health | jq .

# RAG System
curl -s http://localhost:8083/health | jq .
# Expected: {"status":"ok","embeddings":"ready","vector_db":"connected"}

# AI Service
curl -s http://localhost:8082/health | jq .

# Orchestrator
curl -s http://localhost:9090/health | jq .

# Control Center
curl -s http://localhost:8080/health | jq .

Service Integration Tests

# Test vault <-> registry integration
curl -X POST http://localhost:8200/api/encrypt \
  -H "Content-Type: application/json" \
  -d '{"plaintext":"secret"}' | jq .

# Test RAG system
curl -X POST http://localhost:8083/api/ingest \
  -H "Content-Type: application/json" \
  -d '{"document":"test.md","content":"# Test"}' | jq .

# Test orchestrator
curl -X GET http://localhost:9090/api/v1/status | jq .

# End-to-end workflow
curl -X POST http://localhost:9090/api/v1/provision \
  -H "Content-Type: application/json" \
  -d '{
    "workspace": "test",
    "services": ["vault", "registry"],
    "mode": "solo"
  }' | jq .

Monitoring Dashboards

Prometheus Metrics

# Query service uptime
curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq .

# Query request rate
curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_requests_total[5m])' | jq .

# Query error rate
curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_errors_total[5m])' | jq .

Log Aggregation

# Follow vault logs
tail -f /var/log/provisioning/vault-service.log

# Follow all service logs
tail -f /var/log/provisioning/*.log

# Search for errors
grep -r "ERROR" /var/log/provisioning/

# Follow with filtering
tail -f /var/log/provisioning/orchestrator.log | grep -E "ERROR|WARN"

Alerting

# AlertManager configuration
groups:
  - name: provisioning
    rules:
      - alert: ServiceDown
        expr: up{job=~"vault|registry|rag|orchestrator"} == 0
        for: 5m
        annotations:
          summary: "{{ $labels.job }} is down"

      - alert: HighErrorRate
        expr: rate(http_errors_total[5m]) > 0.05
        annotations:
          summary: "High error rate detected"

      - alert: DiskSpaceWarning
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.2
        annotations:
          summary: "Disk space below 20%"

Troubleshooting

Service Won't Start

Problem: error: failed to bind to port 8200

Solutions:

# Check if port is in use
lsof -i :8200
ss -tlnp | grep 8200

# Kill existing process
pkill -9 -f vault-service

# Or use different port
export VAULT_SERVER_PORT=8201
cargo run --release -p vault-service

Configuration Loading Fails

Problem: error: failed to load config from mode file

Solutions:

# Verify schemas exist
ls -la provisioning/schemas/platform/schemas/vault-service.ncl

# Validate schema syntax
nickel typecheck provisioning/schemas/platform/schemas/vault-service.ncl

# Check defaults are present
nickel typecheck provisioning/schemas/platform/defaults/vault-service-defaults.ncl

# Verify deployment mode overlay exists
ls -la provisioning/schemas/platform/defaults/deployment/$VAULT_MODE-defaults.ncl

# Run service with explicit mode
export VAULT_MODE=solo
cargo run --release -p vault-service

Database Connection Issues

Problem: error: failed to connect to database

Solutions:

# Verify database is running
curl http://surrealdb:8000/health
etcdctl --endpoints=http://etcd:2379 endpoint health

# Check connectivity
nc -zv surrealdb 8000
nc -zv etcd 2379

# Update connection string
export SURREALDB_URL=ws://surrealdb:8000
export ETCD_ENDPOINTS=http://etcd:2379

# Restart service with new config
pkill -9 vault-service
cargo run --release -p vault-service

Service Crashes on Startup

Problem: Service exits with code 1 or 139

Solutions:

# Run with verbose logging
RUST_LOG=debug cargo run -p vault-service 2>&1 | head -50

# Check system resources
free -h
df -h

# Check for core dumps
coredumpctl list

# Run under debugger (if crash suspected)
rust-gdb --args target/release/vault-service

High Memory Usage

Problem: Service consuming > expected memory

Solutions:

# Check memory usage
ps aux | grep vault-service | grep -v grep

# Monitor over time
watch -n 1 'ps aux | grep vault-service | grep -v grep'

# Reduce worker count
export VAULT_SERVER_WORKERS=2
cargo run --release -p vault-service

# Check for memory leaks
valgrind --leak-check=full target/release/vault-service

Network/DNS Issues

Problem: error: failed to resolve hostname

Solutions:

# Test DNS resolution
nslookup vault.internal
dig vault.internal

# Test connectivity to service
curl -v http://vault.internal:8200/health

# Add to /etc/hosts if needed
echo "10.0.1.10 vault.internal" >> /etc/hosts

# Check network interface
ip addr show
netstat -nr

Data Persistence Issues

Problem: Data lost after restart

Solutions:

# Verify backup exists
ls -la /mnt/provisioning-backups/
ls -la /var/lib/provisioning/

# Check disk space
df -h /var/lib/provisioning

# Verify file permissions
ls -l /var/lib/provisioning/vault/
chmod 755 /var/lib/provisioning/vault/*

# Restore from backup
./scripts/restore-backup.sh /mnt/provisioning-backups/vault-20260105.sql

Debugging Checklist

When troubleshooting, use this systematic approach:

# 1. Check service is running
pgrep -f vault-service || echo "Service not running"

# 2. Check port is listening
ss -tlnp | grep 8200 || echo "Port not listening"

# 3. Check logs for errors
tail -20 /var/log/provisioning/vault-service.log | grep -i error

# 4. Test HTTP endpoint
curl -i http://localhost:8200/health

# 5. Check dependencies
curl http://surrealdb:8000/health
etcdctl --endpoints=http://etcd:2379 endpoint health

# 6. Check schema definition
nickel typecheck provisioning/schemas/platform/schemas/vault-service.ncl

# 7. Verify environment variables
env | grep -E "VAULT_|SURREALDB_|ETCD_"

# 8. Check system resources
free -h && df -h && top -bn1 | head -10

Configuration Updates

Updating Service Configuration

# 1. Edit the schema definition
vim provisioning/schemas/platform/schemas/vault-service.ncl

# 2. Update defaults if needed
vim provisioning/schemas/platform/defaults/vault-service-defaults.ncl

# 3. Validate syntax
nickel typecheck provisioning/schemas/platform/schemas/vault-service.ncl

# 4. Re-export configuration from schemas
./provisioning/.typedialog/platform/scripts/generate-configs.nu vault-service multiuser

# 5. Restart affected service (no downtime for clients)
pkill -SIGTERM vault-service
sleep 2
cargo run --release -p vault-service &

# 4. Verify configuration loaded
curl http://localhost:8200/api/config | jq .

Mode Migration

# Migrate from solo to multiuser:

# 1. Stop services
pkill -SIGTERM -f "cargo run"
sleep 5

# 2. Backup current data
tar -czf /backup/provisioning-solo-$(date +%s).tar.gz /var/lib/provisioning/

# 3. Set new mode
export VAULT_MODE=multiuser
export REGISTRY_MODE=multiuser
export RAG_MODE=multiuser

# 4. Start services with new config
cargo run --release -p vault-service &
cargo run --release -p extension-registry &

# 5. Verify new mode
curl http://localhost:8200/api/config | jq .deployment_mode

Production Checklist

Before deploying to production:

  • All services compiled in release mode (--release)
  • TLS certificates installed and valid
  • Database cluster deployed and healthy
  • Load balancer configured and routing traffic
  • Monitoring and alerting configured
  • Backup system tested and working
  • High availability verified (failover tested)
  • Security hardening applied (firewall rules, etc.)
  • Documentation updated for your environment
  • Team trained on deployment procedures
  • Runbooks created for common operations
  • Disaster recovery plan tested

Getting Help

Community Resources

  • GitHub Issues: Report bugs at github.com/your-org/provisioning/issues
  • Documentation: Full docs at provisioning/docs/
  • Slack Channel: #provisioning-platform

Internal Support

  • Platform Team: platform@your-org.com
  • On-Call: Check PagerDuty for active rotation
  • Escalation: Contact infrastructure leadership

Useful Commands Reference

# View all available commands
cargo run -- --help

# View service schemas
ls -la provisioning/schemas/platform/schemas/
ls -la provisioning/schemas/platform/defaults/

# List running services
ps aux | grep cargo

# Monitor service logs in real-time
journalctl -fu provisioning-vault

# Generate diagnostics bundle
./scripts/generate-diagnostics.sh > /tmp/diagnostics-$(date +%s).tar.gz