Jesús Pérez 44648e3206

chore: complete nickel migration and consolidate legacy configs

- Remove KCL ecosystem (~220 files deleted)
- Migrate all infrastructure to Nickel schema system
- Consolidate documentation: legacy docs → provisioning/docs/src/
- Add CI/CD workflows (.github/) and Rust build config (.cargo/)
- Update core system for Nickel schema parsing
- Update README.md and CHANGES.md for v5.0.0 release
- Fix pre-commit hooks: end-of-file, trailing-whitespace
- Breaking changes: KCL workspaces require migration
- Migration bridge available in docs/src/development/

2026-01-08 09:55:37 +00:00

31 KiB

Raw Blame History

Platform Deployment Guide

Version: 1.0.0 Last Updated: 2026-01-05 Target Audience: DevOps Engineers, Platform Operators Status: Production Ready

Practical guide for deploying the 9-service provisioning platform in any environment using mode-based configuration.

Prerequisites
Deployment Modes
Quick Start
Solo Mode Deployment
Multiuser Mode Deployment
CICD Mode Deployment
Enterprise Mode Deployment
Service Management
Health Checks & Monitoring
Troubleshooting

Prerequisites

Required Software

Rust: 1.70+ (for building services)
Nickel: Latest (for config validation)
Nushell: 0.109.1+ (for scripts)
Cargo: Included with Rust
Git: For cloning and pulling updates

Required Tools (Mode-Dependent)

Tool	Solo	Multiuser	CICD	Enterprise
Docker/Podman	No	Optional	Yes	Yes
SurrealDB	No	Yes	No	No
Etcd	No	No	No	Yes
PostgreSQL	No	Optional	No	Optional
OpenAI/Anthropic API	No	Optional	Yes	Yes

System Requirements

Resource	Solo	Multiuser	CICD	Enterprise
CPU Cores	2+	4+	8+	16+
Memory	2 GB	4 GB	8 GB	16 GB
Disk	10 GB	50 GB	100 GB	500 GB
Network	Local	Local/Cloud	Cloud	HA Cloud

Directory Structure

# Ensure base directories exist
mkdir -p provisioning/schemas/platform
mkdir -p provisioning/platform/logs
mkdir -p provisioning/platform/data
mkdir -p provisioning/.typedialog/platform
mkdir -p provisioning/config/runtime

Deployment Modes

Mode Selection Matrix

Requirement	Recommended Mode
Development & testing	solo
Team environment (2-10 people)	multiuser
CI/CD pipelines & automation	cicd
Production with HA	enterprise

Mode Characteristics

Solo Mode

Use Case: Development, testing, demonstration

Characteristics:

All services run locally with minimal resources
Filesystem-based storage (no external databases)
No TLS/SSL required
Embedded/in-memory backends
Single machine only

Services Configuration:

2-4 workers per service
30-60 second timeouts
No replication or clustering
Debug-level logging enabled

Startup Time: ~2-5 minutes Data Persistence: Local files only

Multiuser Mode

Use Case: Team environments, shared infrastructure

Characteristics:

Shared database backends (SurrealDB)
Multiple concurrent users
CORS and multi-user features enabled
Optional TLS support
2-4 machines (or containerized)

Services Configuration:

4-6 workers per service
60-120 second timeouts
Basic replication available
Info-level logging

Startup Time: ~3-8 minutes (database dependent) Data Persistence: SurrealDB (shared)

CICD Mode

Use Case: CI/CD pipelines, ephemeral environments

Characteristics:

Ephemeral storage (memory, temporary)
High throughput
RAG system disabled
Minimal logging
Stateless services

Services Configuration:

8-12 workers per service
10-30 second timeouts
No persistence
Warn-level logging

Startup Time: ~1-2 minutes Data Persistence: None (ephemeral)

Enterprise Mode

Use Case: Production, high availability, compliance

Characteristics:

Distributed, replicated backends
High availability (HA) clustering
TLS/SSL encryption
Audit logging
Full monitoring and observability

Services Configuration:

16-32 workers per service
120-300 second timeouts
Active replication across 3+ nodes
Info-level logging with audit trails

Startup Time: ~5-15 minutes (cluster initialization) Data Persistence: Replicated across cluster

Quick Start

1. Clone Repository

git clone https://github.com/your-org/project-provisioning.git
cd project-provisioning

2. Select Deployment Mode

Choose your mode based on use case:

# For development
export DEPLOYMENT_MODE=solo

# For team environments
export DEPLOYMENT_MODE=multiuser

# For CI/CD
export DEPLOYMENT_MODE=cicd

# For production
export DEPLOYMENT_MODE=enterprise

3. Set Environment Variables

All services use mode-specific TOML configs automatically loaded via environment variables:

# Vault Service
export VAULT_MODE=$DEPLOYMENT_MODE

# Extension Registry
export REGISTRY_MODE=$DEPLOYMENT_MODE

# RAG System
export RAG_MODE=$DEPLOYMENT_MODE

# AI Service
export AI_SERVICE_MODE=$DEPLOYMENT_MODE

# Provisioning Daemon
export DAEMON_MODE=$DEPLOYMENT_MODE

4. Build All Services

# Build all platform crates
cargo build --release -p vault-service \
                      -p extension-registry \
                      -p provisioning-rag \
                      -p ai-service \
                      -p provisioning-daemon \
                      -p orchestrator \
                      -p control-center \
                      -p mcp-server \
                      -p installer

5. Start Services (Order Matters)

# Start in dependency order:

# 1. Core infrastructure (KMS, storage)
cargo run --release -p vault-service &

# 2. Configuration and extensions
cargo run --release -p extension-registry &

# 3. AI/RAG layer
cargo run --release -p provisioning-rag &
cargo run --release -p ai-service &

# 4. Orchestration layer
cargo run --release -p orchestrator &
cargo run --release -p control-center &
cargo run --release -p mcp-server &

# 5. Background operations
cargo run --release -p provisioning-daemon &

# 6. Installer (optional, for new deployments)
cargo run --release -p installer &

6. Verify Services

# Check all services are running
pgrep -l "vault-service|extension-registry|provisioning-rag|ai-service"

# Test endpoints
curl http://localhost:8200/health   # Vault
curl http://localhost:8081/health   # Registry
curl http://localhost:8083/health   # RAG
curl http://localhost:8082/health   # AI Service
curl http://localhost:9090/health   # Orchestrator
curl http://localhost:8080/health   # Control Center

Solo Mode Deployment

Perfect for: Development, testing, learning

Step 1: Verify Solo Configuration Files

# Check that solo schemas are available
ls -la provisioning/schemas/platform/defaults/deployment/solo-defaults.ncl

# Available schemas for each service:
# - provisioning/schemas/platform/schemas/vault-service.ncl
# - provisioning/schemas/platform/schemas/extension-registry.ncl
# - provisioning/schemas/platform/schemas/rag.ncl
# - provisioning/schemas/platform/schemas/ai-service.ncl
# - provisioning/schemas/platform/schemas/provisioning-daemon.ncl

Step 2: Set Solo Environment Variables

# Set all services to solo mode
export VAULT_MODE=solo
export REGISTRY_MODE=solo
export RAG_MODE=solo
export AI_SERVICE_MODE=solo
export DAEMON_MODE=solo

# Verify settings
echo $VAULT_MODE  # Should output: solo

Step 3: Build Services

# Build in release mode for better performance
cargo build --release

Step 4: Create Local Data Directories

# Create storage directories for solo mode
mkdir -p /tmp/provisioning-solo/{vault,registry,rag,ai,daemon}
chmod 755 /tmp/provisioning-solo/{vault,registry,rag,ai,daemon}

Step 5: Start Services

# Start each service in a separate terminal or use tmux:

# Terminal 1: Vault
cargo run --release -p vault-service

# Terminal 2: Registry
cargo run --release -p extension-registry

# Terminal 3: RAG
cargo run --release -p provisioning-rag

# Terminal 4: AI Service
cargo run --release -p ai-service

# Terminal 5: Orchestrator
cargo run --release -p orchestrator

# Terminal 6: Control Center
cargo run --release -p control-center

# Terminal 7: Daemon
cargo run --release -p provisioning-daemon

Step 6: Test Services

# Wait 10-15 seconds for services to start, then test

# Check service health
curl -s http://localhost:8200/health | jq .
curl -s http://localhost:8081/health | jq .
curl -s http://localhost:8083/health | jq .

# Try a simple operation
curl -X GET http://localhost:9090/api/v1/health

Step 7: Verify Persistence (Optional)

# Check that data is stored locally
ls -la /tmp/provisioning-solo/vault/
ls -la /tmp/provisioning-solo/registry/

# Data should accumulate as you use the services

Cleanup

# Stop all services
pkill -f "cargo run --release"

# Remove temporary data (optional)
rm -rf /tmp/provisioning-solo

Multiuser Mode Deployment

Perfect for: Team environments, shared infrastructure

Prerequisites

SurrealDB: Running and accessible at http://surrealdb:8000
Network Access: All machines can reach SurrealDB
DNS/Hostnames: Services accessible via hostnames (not just localhost)

Step 1: Deploy SurrealDB

# Using Docker (recommended)
docker run -d \
  --name surrealdb \
  -p 8000:8000 \
  surrealdb/surrealdb:latest \
  start --user root --pass root

# Or using native installation:
surreal start --user root --pass root

Step 2: Verify SurrealDB Connectivity

# Test SurrealDB connection
curl -s http://localhost:8000/health

# Should return: {"version":"v1.x.x"}

Step 3: Set Multiuser Environment Variables

# Configure all services for multiuser mode
export VAULT_MODE=multiuser
export REGISTRY_MODE=multiuser
export RAG_MODE=multiuser
export AI_SERVICE_MODE=multiuser
export DAEMON_MODE=multiuser

# Set database connection
export SURREALDB_URL=http://surrealdb:8000
export SURREALDB_USER=root
export SURREALDB_PASS=root

# Set service hostnames (if not localhost)
export VAULT_SERVICE_HOST=vault.internal
export REGISTRY_HOST=registry.internal
export RAG_HOST=rag.internal

Step 4: Build Services

cargo build --release

Step 5: Create Shared Data Directories

# Create directories on shared storage (NFS, etc.)
mkdir -p /mnt/provisioning-data/{vault,registry,rag,ai}
chmod 755 /mnt/provisioning-data/{vault,registry,rag,ai}

# Or use local directories if on separate machines
mkdir -p /var/lib/provisioning/{vault,registry,rag,ai}

Step 6: Start Services on Multiple Machines

# Machine 1: Infrastructure services
ssh ops@machine1
export VAULT_MODE=multiuser
cargo run --release -p vault-service &
cargo run --release -p extension-registry &

# Machine 2: AI services
ssh ops@machine2
export RAG_MODE=multiuser
export AI_SERVICE_MODE=multiuser
cargo run --release -p provisioning-rag &
cargo run --release -p ai-service &

# Machine 3: Orchestration
ssh ops@machine3
cargo run --release -p orchestrator &
cargo run --release -p control-center &

# Machine 4: Background tasks
ssh ops@machine4
export DAEMON_MODE=multiuser
cargo run --release -p provisioning-daemon &

Step 7: Test Multi-Machine Setup

# From any machine, test cross-machine connectivity
curl -s http://machine1:8200/health
curl -s http://machine2:8083/health
curl -s http://machine3:9090/health

# Test integration
curl -X POST http://machine3:9090/api/v1/provision \
  -H "Content-Type: application/json" \
  -d '{"workspace": "test"}'

Step 8: Enable User Access

# Create shared credentials
export VAULT_TOKEN=s.xxxxxxxxxxx

# Configure TLS (optional but recommended)
# Update configs to use https:// URLs
export VAULT_MODE=multiuser
# Edit provisioning/schemas/platform/schemas/vault-service.ncl
# Add TLS configuration in the schema definition
# See: provisioning/schemas/platform/validators/ for constraints

Monitoring Multiuser Deployment

# Check all services are connected to SurrealDB
for host in machine1 machine2 machine3 machine4; do
  ssh ops@$host "curl -s http://localhost/api/v1/health | jq .database_connected"
done

# Monitor SurrealDB
curl -s http://surrealdb:8000/version

CICD Mode Deployment

Perfect for: GitHub Actions, GitLab CI, Jenkins, cloud automation

Step 1: Understand Ephemeral Nature

CICD mode services:

Don't persist data between runs
Use in-memory storage
Have RAG disabled
Optimize for startup speed
Suitable for containerized deployments

Step 2: Set CICD Environment Variables

# Use cicd mode for all services
export VAULT_MODE=cicd
export REGISTRY_MODE=cicd
export RAG_MODE=cicd
export AI_SERVICE_MODE=cicd
export DAEMON_MODE=cicd

# Disable TLS (not needed in CI)
export CI_ENVIRONMENT=true

Step 3: Containerize Services (Optional)

# Dockerfile for CICD deployments
FROM rust:1.75-slim

WORKDIR /app
COPY . .

# Build all services
RUN cargo build --release

# Set CICD mode
ENV VAULT_MODE=cicd
ENV REGISTRY_MODE=cicd
ENV RAG_MODE=cicd
ENV AI_SERVICE_MODE=cicd

# Expose ports
EXPOSE 8200 8081 8083 8082 9090 8080

# Run services
CMD ["sh", "-c", "\
  cargo run --release -p vault-service & \
  cargo run --release -p extension-registry & \
  cargo run --release -p provisioning-rag & \
  cargo run --release -p ai-service & \
  cargo run --release -p orchestrator & \
  wait"]

Step 4: GitHub Actions Example

name: CICD Platform Deployment

on:
  push:
    branches: [main, develop]

jobs:
  test-deployment:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Install Rust
        uses: actions-rs/toolchain@v1
        with:
          toolchain: 1.75
          profile: minimal

      - name: Set CICD Mode
        run: |
          echo "VAULT_MODE=cicd" >> $GITHUB_ENV
          echo "REGISTRY_MODE=cicd" >> $GITHUB_ENV
          echo "RAG_MODE=cicd" >> $GITHUB_ENV
          echo "AI_SERVICE_MODE=cicd" >> $GITHUB_ENV
          echo "DAEMON_MODE=cicd" >> $GITHUB_ENV

      - name: Build Services
        run: cargo build --release

      - name: Run Integration Tests
        run: |
          # Start services in background
          cargo run --release -p vault-service &
          cargo run --release -p extension-registry &
          cargo run --release -p orchestrator &

          # Wait for startup
          sleep 10

          # Run tests
          cargo test --release

      - name: Health Checks
        run: |
          curl -f http://localhost:8200/health
          curl -f http://localhost:8081/health
          curl -f http://localhost:9090/health

  deploy:
    needs: test-deployment
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v3
      - name: Deploy to Production
        run: |
          # Deploy production enterprise cluster
          ./scripts/deploy-enterprise.sh

Step 5: Run CICD Tests

# Simulate CI environment locally
export VAULT_MODE=cicd
export CI_ENVIRONMENT=true

# Build
cargo build --release

# Run short-lived services for testing
timeout 30 cargo run --release -p vault-service &
timeout 30 cargo run --release -p extension-registry &
timeout 30 cargo run --release -p orchestrator &

# Run tests while services are running
sleep 5
cargo test --release

# Services auto-cleanup after timeout

Enterprise Mode Deployment

Perfect for: Production, high availability, compliance

Prerequisites

3+ Machines: Minimum 3 for HA
Etcd Cluster: For distributed consensus
Load Balancer: HAProxy, nginx, or cloud LB
TLS Certificates: Valid certificates for all services
Monitoring: Prometheus, ELK, or cloud monitoring
Backup System: Daily snapshots to S3 or similar

Step 1: Deploy Infrastructure

1.1 Deploy Etcd Cluster

# Node 1, 2, 3
etcd --name=node-1 \
     --listen-client-urls=http://0.0.0.0:2379 \
     --advertise-client-urls=http://node-1.internal:2379 \
     --initial-cluster="node-1=http://node-1.internal:2380,node-2=http://node-2.internal:2380,node-3=http://node-3.internal:2380" \
     --initial-cluster-state=new

# Verify cluster
etcdctl --endpoints=http://localhost:2379 member list

1.2 Deploy Load Balancer

# HAProxy configuration for vault-service (example)
frontend vault_frontend
    bind *:8200
    mode tcp
    default_backend vault_backend

backend vault_backend
    mode tcp
    balance roundrobin
    server vault-1 10.0.1.10:8200 check
    server vault-2 10.0.1.11:8200 check
    server vault-3 10.0.1.12:8200 check

1.3 Configure TLS

# Generate certificates (or use existing)
mkdir -p /etc/provisioning/tls

# For each service:
openssl req -x509 -newkey rsa:4096 \
  -keyout /etc/provisioning/tls/vault-key.pem \
  -out /etc/provisioning/tls/vault-cert.pem \
  -days 365 -nodes \
  -subj "/CN=vault.provisioning.prod"

# Set permissions
chmod 600 /etc/provisioning/tls/*-key.pem
chmod 644 /etc/provisioning/tls/*-cert.pem

Step 2: Set Enterprise Environment Variables

# All machines: Set enterprise mode
export VAULT_MODE=enterprise
export REGISTRY_MODE=enterprise
export RAG_MODE=enterprise
export AI_SERVICE_MODE=enterprise
export DAEMON_MODE=enterprise

# Database cluster
export SURREALDB_URL="ws://surrealdb-cluster.internal:8000"
export SURREALDB_REPLICAS=3

# Etcd cluster
export ETCD_ENDPOINTS="http://node-1.internal:2379,http://node-2.internal:2379,http://node-3.internal:2379"

# TLS configuration
export TLS_CERT_PATH=/etc/provisioning/tls
export TLS_VERIFY=true
export TLS_CA_CERT=/etc/provisioning/tls/ca.crt

# Monitoring
export PROMETHEUS_URL=http://prometheus.internal:9090
export METRICS_ENABLED=true
export AUDIT_LOG_ENABLED=true

Step 3: Deploy Services Across Cluster

# Ansible playbook (simplified)
---
- hosts: provisioning_cluster
  tasks:
    - name: Build services
      shell: cargo build --release

    - name: Start vault-service (machine 1-3)
      shell: "cargo run --release -p vault-service"
      when: "'vault' in group_names"

    - name: Start orchestrator (machine 2-3)
      shell: "cargo run --release -p orchestrator"
      when: "'orchestrator' in group_names"

    - name: Start daemon (machine 3)
      shell: "cargo run --release -p provisioning-daemon"
      when: "'daemon' in group_names"

    - name: Verify cluster health
      uri:
        url: "https://{{ inventory_hostname }}:9090/health"
        validate_certs: yes

Step 4: Monitor Cluster Health

# Check cluster status
curl -s https://vault.internal:8200/health | jq .state

# Check replication
curl -s https://orchestrator.internal:9090/api/v1/cluster/status

# Monitor etcd
etcdctl --endpoints=https://node-1.internal:2379 endpoint health

# Check leader election
etcdctl --endpoints=https://node-1.internal:2379 election list

Step 5: Enable Monitoring & Alerting

# Prometheus configuration
global:
  scrape_interval: 30s
  evaluation_interval: 30s

scrape_configs:
  - job_name: 'vault-service'
    scheme: https
    tls_config:
      ca_file: /etc/provisioning/tls/ca.crt
    static_configs:
      - targets: ['vault-1.internal:8200', 'vault-2.internal:8200', 'vault-3.internal:8200']

  - job_name: 'orchestrator'
    scheme: https
    static_configs:
      - targets: ['orch-1.internal:9090', 'orch-2.internal:9090', 'orch-3.internal:9090']

Step 6: Backup & Recovery

# Daily backup script
#!/bin/bash
BACKUP_DIR="/mnt/provisioning-backups"
DATE=$(date +%Y%m%d_%H%M%S)

# Backup etcd
etcdctl --endpoints=https://node-1.internal:2379 \
  snapshot save "$BACKUP_DIR/etcd-$DATE.db"

# Backup SurrealDB
curl -X POST https://surrealdb.internal:8000/backup \
  -H "Authorization: Bearer $SURREALDB_TOKEN" \
  > "$BACKUP_DIR/surreal-$DATE.sql"

# Upload to S3
aws s3 cp "$BACKUP_DIR/etcd-$DATE.db" \
  s3://provisioning-backups/etcd/

# Cleanup old backups (keep 30 days)
find "$BACKUP_DIR" -mtime +30 -delete

Service Management

Starting Services

Individual Service Startup

# Start one service
export VAULT_MODE=enterprise
cargo run --release -p vault-service

# In another terminal
export REGISTRY_MODE=enterprise
cargo run --release -p extension-registry

Batch Startup

# Start all services (dependency order)
#!/bin/bash
set -e

MODE=${1:-solo}
export VAULT_MODE=$MODE
export REGISTRY_MODE=$MODE
export RAG_MODE=$MODE
export AI_SERVICE_MODE=$MODE
export DAEMON_MODE=$MODE

echo "Starting provisioning platform in $MODE mode..."

# Core services first
echo "Starting infrastructure..."
cargo run --release -p vault-service &
VAULT_PID=$!

echo "Starting extension registry..."
cargo run --release -p extension-registry &
REGISTRY_PID=$!

# AI layer
echo "Starting AI services..."
cargo run --release -p provisioning-rag &
RAG_PID=$!

cargo run --release -p ai-service &
AI_PID=$!

# Orchestration
echo "Starting orchestration..."
cargo run --release -p orchestrator &
ORCH_PID=$!

echo "All services started. PIDs: $VAULT_PID $REGISTRY_PID $RAG_PID $AI_PID $ORCH_PID"

Stopping Services

# Stop all services gracefully
pkill -SIGTERM -f "cargo run --release -p"

# Wait for graceful shutdown
sleep 5

# Force kill if needed
pkill -9 -f "cargo run --release -p"

# Verify all stopped
pgrep -f "cargo run --release -p" && echo "Services still running" || echo "All stopped"

Restarting Services

# Restart single service
pkill -SIGTERM vault-service
sleep 2
cargo run --release -p vault-service &

# Restart all services
./scripts/restart-all.sh $MODE

# Restart with config reload
export VAULT_MODE=multiuser
pkill -SIGTERM vault-service
sleep 2
cargo run --release -p vault-service &

Checking Service Status

# Check running processes
pgrep -a "cargo run --release"

# Check listening ports
netstat -tlnp | grep -E "8200|8081|8083|8082|9090|8080"

# Or using ss (modern alternative)
ss -tlnp | grep -E "8200|8081|8083|8082|9090|8080"

# Health endpoint checks
for service in vault registry rag ai orchestrator; do
  echo "=== $service ==="
  curl -s http://localhost:${port[$service]}/health | jq .
done

Health Checks & Monitoring

Manual Health Verification

# Vault Service
curl -s http://localhost:8200/health | jq .
# Expected: {"status":"ok","uptime":123.45}

# Extension Registry
curl -s http://localhost:8081/health | jq .

# RAG System
curl -s http://localhost:8083/health | jq .
# Expected: {"status":"ok","embeddings":"ready","vector_db":"connected"}

# AI Service
curl -s http://localhost:8082/health | jq .

# Orchestrator
curl -s http://localhost:9090/health | jq .

# Control Center
curl -s http://localhost:8080/health | jq .

Service Integration Tests

# Test vault <-> registry integration
curl -X POST http://localhost:8200/api/encrypt \
  -H "Content-Type: application/json" \
  -d '{"plaintext":"secret"}' | jq .

# Test RAG system
curl -X POST http://localhost:8083/api/ingest \
  -H "Content-Type: application/json" \
  -d '{"document":"test.md","content":"# Test"}' | jq .

# Test orchestrator
curl -X GET http://localhost:9090/api/v1/status | jq .

# End-to-end workflow
curl -X POST http://localhost:9090/api/v1/provision \
  -H "Content-Type: application/json" \
  -d '{
    "workspace": "test",
    "services": ["vault", "registry"],
    "mode": "solo"
  }' | jq .

Monitoring Dashboards

Prometheus Metrics

# Query service uptime
curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq .

# Query request rate
curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_requests_total[5m])' | jq .

# Query error rate
curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_errors_total[5m])' | jq .

Log Aggregation

# Follow vault logs
tail -f /var/log/provisioning/vault-service.log

# Follow all service logs
tail -f /var/log/provisioning/*.log

# Search for errors
grep -r "ERROR" /var/log/provisioning/

# Follow with filtering
tail -f /var/log/provisioning/orchestrator.log | grep -E "ERROR|WARN"

Alerting

# AlertManager configuration
groups:
  - name: provisioning
    rules:
      - alert: ServiceDown
        expr: up{job=~"vault|registry|rag|orchestrator"} == 0
        for: 5m
        annotations:
          summary: "{{ $labels.job }} is down"

      - alert: HighErrorRate
        expr: rate(http_errors_total[5m]) > 0.05
        annotations:
          summary: "High error rate detected"

      - alert: DiskSpaceWarning
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.2
        annotations:
          summary: "Disk space below 20%"

Troubleshooting

Service Won't Start

Problem: error: failed to bind to port 8200

Solutions:

# Check if port is in use
lsof -i :8200
ss -tlnp | grep 8200

# Kill existing process
pkill -9 -f vault-service

# Or use different port
export VAULT_SERVER_PORT=8201
cargo run --release -p vault-service

Configuration Loading Fails

Problem: error: failed to load config from mode file

Solutions:

# Verify schemas exist
ls -la provisioning/schemas/platform/schemas/vault-service.ncl

# Validate schema syntax
nickel typecheck provisioning/schemas/platform/schemas/vault-service.ncl

# Check defaults are present
nickel typecheck provisioning/schemas/platform/defaults/vault-service-defaults.ncl

# Verify deployment mode overlay exists
ls -la provisioning/schemas/platform/defaults/deployment/$VAULT_MODE-defaults.ncl

# Run service with explicit mode
export VAULT_MODE=solo
cargo run --release -p vault-service

Database Connection Issues

Problem: error: failed to connect to database

Solutions:

# Verify database is running
curl http://surrealdb:8000/health
etcdctl --endpoints=http://etcd:2379 endpoint health

# Check connectivity
nc -zv surrealdb 8000
nc -zv etcd 2379

# Update connection string
export SURREALDB_URL=ws://surrealdb:8000
export ETCD_ENDPOINTS=http://etcd:2379

# Restart service with new config
pkill -9 vault-service
cargo run --release -p vault-service

Service Crashes on Startup

Problem: Service exits with code 1 or 139

Solutions:

# Run with verbose logging
RUST_LOG=debug cargo run -p vault-service 2>&1 | head -50

# Check system resources
free -h
df -h

# Check for core dumps
coredumpctl list

# Run under debugger (if crash suspected)
rust-gdb --args target/release/vault-service

High Memory Usage

Problem: Service consuming > expected memory

Solutions:

# Check memory usage
ps aux | grep vault-service | grep -v grep

# Monitor over time
watch -n 1 'ps aux | grep vault-service | grep -v grep'

# Reduce worker count
export VAULT_SERVER_WORKERS=2
cargo run --release -p vault-service

# Check for memory leaks
valgrind --leak-check=full target/release/vault-service

Network/DNS Issues

Problem: error: failed to resolve hostname

Solutions:

# Test DNS resolution
nslookup vault.internal
dig vault.internal

# Test connectivity to service
curl -v http://vault.internal:8200/health

# Add to /etc/hosts if needed
echo "10.0.1.10 vault.internal" >> /etc/hosts

# Check network interface
ip addr show
netstat -nr

Data Persistence Issues

Problem: Data lost after restart

Solutions:

# Verify backup exists
ls -la /mnt/provisioning-backups/
ls -la /var/lib/provisioning/

# Check disk space
df -h /var/lib/provisioning

# Verify file permissions
ls -l /var/lib/provisioning/vault/
chmod 755 /var/lib/provisioning/vault/*

# Restore from backup
./scripts/restore-backup.sh /mnt/provisioning-backups/vault-20260105.sql

Debugging Checklist

When troubleshooting, use this systematic approach:

# 1. Check service is running
pgrep -f vault-service || echo "Service not running"

# 2. Check port is listening
ss -tlnp | grep 8200 || echo "Port not listening"

# 3. Check logs for errors
tail -20 /var/log/provisioning/vault-service.log | grep -i error

# 4. Test HTTP endpoint
curl -i http://localhost:8200/health

# 5. Check dependencies
curl http://surrealdb:8000/health
etcdctl --endpoints=http://etcd:2379 endpoint health

# 6. Check schema definition
nickel typecheck provisioning/schemas/platform/schemas/vault-service.ncl

# 7. Verify environment variables
env | grep -E "VAULT_|SURREALDB_|ETCD_"

# 8. Check system resources
free -h && df -h && top -bn1 | head -10

Configuration Updates

Updating Service Configuration

# 1. Edit the schema definition
vim provisioning/schemas/platform/schemas/vault-service.ncl

# 2. Update defaults if needed
vim provisioning/schemas/platform/defaults/vault-service-defaults.ncl

# 3. Validate syntax
nickel typecheck provisioning/schemas/platform/schemas/vault-service.ncl

# 4. Re-export configuration from schemas
./provisioning/.typedialog/platform/scripts/generate-configs.nu vault-service multiuser

# 5. Restart affected service (no downtime for clients)
pkill -SIGTERM vault-service
sleep 2
cargo run --release -p vault-service &

# 4. Verify configuration loaded
curl http://localhost:8200/api/config | jq .

Mode Migration

# Migrate from solo to multiuser:

# 1. Stop services
pkill -SIGTERM -f "cargo run"
sleep 5

# 2. Backup current data
tar -czf /backup/provisioning-solo-$(date +%s).tar.gz /var/lib/provisioning/

# 3. Set new mode
export VAULT_MODE=multiuser
export REGISTRY_MODE=multiuser
export RAG_MODE=multiuser

# 4. Start services with new config
cargo run --release -p vault-service &
cargo run --release -p extension-registry &

# 5. Verify new mode
curl http://localhost:8200/api/config | jq .deployment_mode

Production Checklist

Before deploying to production:

All services compiled in release mode (--release)
TLS certificates installed and valid
Database cluster deployed and healthy
Load balancer configured and routing traffic
Monitoring and alerting configured
Backup system tested and working
High availability verified (failover tested)
Security hardening applied (firewall rules, etc.)
Documentation updated for your environment
Team trained on deployment procedures
Runbooks created for common operations
Disaster recovery plan tested

Getting Help

Community Resources

GitHub Issues: Report bugs at github.com/your-org/provisioning/issues
Documentation: Full docs at provisioning/docs/
Slack Channel: #provisioning-platform

Internal Support

Platform Team: platform@your-org.com
On-Call: Check PagerDuty for active rotation
Escalation: Contact infrastructure leadership

Useful Commands Reference

# View all available commands
cargo run -- --help

# View service schemas
ls -la provisioning/schemas/platform/schemas/
ls -la provisioning/schemas/platform/defaults/

# List running services
ps aux | grep cargo

# Monitor service logs in real-time
journalctl -fu provisioning-vault

# Generate diagnostics bundle
./scripts/generate-diagnostics.sh > /tmp/diagnostics-$(date +%s).tar.gz

31 KiB Raw Blame History

Platform Deployment Guide

Table of Contents

Prerequisites

Required Software

Required Tools (Mode-Dependent)

System Requirements

Directory Structure

Deployment Modes

Mode Selection Matrix

Mode Characteristics

Solo Mode

Multiuser Mode

CICD Mode

Enterprise Mode

Quick Start

1. Clone Repository

2. Select Deployment Mode

3. Set Environment Variables

4. Build All Services

5. Start Services (Order Matters)

6. Verify Services

Solo Mode Deployment

Step 1: Verify Solo Configuration Files

Step 2: Set Solo Environment Variables

Step 3: Build Services

Step 4: Create Local Data Directories

Step 5: Start Services

Step 6: Test Services

Step 7: Verify Persistence (Optional)

Cleanup

Multiuser Mode Deployment

Prerequisites

Step 1: Deploy SurrealDB

Step 2: Verify SurrealDB Connectivity

Step 3: Set Multiuser Environment Variables

Step 4: Build Services

Step 5: Create Shared Data Directories

Step 6: Start Services on Multiple Machines

Step 7: Test Multi-Machine Setup

Step 8: Enable User Access

Monitoring Multiuser Deployment

CICD Mode Deployment

Step 1: Understand Ephemeral Nature

Step 2: Set CICD Environment Variables

Step 3: Containerize Services (Optional)

Step 4: GitHub Actions Example

Step 5: Run CICD Tests

Enterprise Mode Deployment

Prerequisites

Step 1: Deploy Infrastructure

1.1 Deploy Etcd Cluster

1.2 Deploy Load Balancer

1.3 Configure TLS

Step 2: Set Enterprise Environment Variables

Step 3: Deploy Services Across Cluster

Step 4: Monitor Cluster Health

Step 5: Enable Monitoring & Alerting

Step 6: Backup & Recovery

Service Management

Starting Services

Individual Service Startup

Batch Startup

Stopping Services

Restarting Services

Checking Service Status

Health Checks & Monitoring

Manual Health Verification

Service Integration Tests

Monitoring Dashboards

Prometheus Metrics

Log Aggregation

Alerting

Troubleshooting

Service Won't Start

Configuration Loading Fails

Database Connection Issues

Service Crashes on Startup

High Memory Usage

Network/DNS Issues

31 KiB

Raw Blame History