provisioning/docs/src/operations/deployment-guide.md

# Platform Deployment Guide

**Version**: 1.0.0
**Last Updated**: 2026-01-05
**Target Audience**: DevOps Engineers, Platform Operators
**Status**: Production Ready

Practical guide for deploying the 9-service provisioning platform in any environment using mode-based configuration.

## Table of Contents

1. [Prerequisites](#prerequisites)
2. [Deployment Modes](#deployment-modes)
3. [Quick Start](#quick-start)
4. [Solo Mode Deployment](#solo-mode-deployment)
5. [Multiuser Mode Deployment](#multiuser-mode-deployment)
6. [CICD Mode Deployment](#cicd-mode-deployment)
7. [Enterprise Mode Deployment](#enterprise-mode-deployment)
8. [Service Management](#service-management)
9. [Health Checks & Monitoring](#health-checks--monitoring)
10. [Troubleshooting](#troubleshooting)

---

## Prerequisites

### Required Software

- **Rust**: 1.70+ (for building services)
- **Nickel**: Latest (for config validation)
- **Nushell**: 0.109.1+ (for scripts)
- **Cargo**: Included with Rust
- **Git**: For cloning and pulling updates

### Required Tools (Mode-Dependent)

| Tool | Solo | Multiuser | CICD | Enterprise |
|------|------|-----------|------|------------|
| Docker/Podman | No | Optional | Yes | Yes |
| SurrealDB | No | Yes | No | No |
| Etcd | No | No | No | Yes |
| PostgreSQL | No | Optional | No | Optional |
| OpenAI/Anthropic API | No | Optional | Yes | Yes |

### System Requirements

| Resource | Solo | Multiuser | CICD | Enterprise |
|----------|------|-----------|------|------------|
| CPU Cores | 2+ | 4+ | 8+ | 16+ |
| Memory | 2 GB | 4 GB | 8 GB | 16 GB |
| Disk | 10 GB | 50 GB | 100 GB | 500 GB |
| Network | Local | Local/Cloud | Cloud | HA Cloud |

### Directory Structure

```bash
# Ensure base directories exist
mkdir -p provisioning/schemas/platform
mkdir -p provisioning/platform/logs
mkdir -p provisioning/platform/data
mkdir -p provisioning/.typedialog/platform
mkdir -p provisioning/config/runtime
```

---

## Deployment Modes

### Mode Selection Matrix

| Requirement | Recommended Mode |
|-------------|------------------|
| Development & testing | **solo** |
| Team environment (2-10 people) | **multiuser** |
| CI/CD pipelines & automation | **cicd** |
| Production with HA | **enterprise** |

### Mode Characteristics

#### Solo Mode

**Use Case**: Development, testing, demonstration

**Characteristics**:
- All services run locally with minimal resources
- Filesystem-based storage (no external databases)
- No TLS/SSL required
- Embedded/in-memory backends
- Single machine only

**Services Configuration**:
- 2-4 workers per service
- 30-60 second timeouts
- No replication or clustering
- Debug-level logging enabled

**Startup Time**: ~2-5 minutes
**Data Persistence**: Local files only

---

#### Multiuser Mode

**Use Case**: Team environments, shared infrastructure

**Characteristics**:
- Shared database backends (SurrealDB)
- Multiple concurrent users
- CORS and multi-user features enabled
- Optional TLS support
- 2-4 machines (or containerized)

**Services Configuration**:
- 4-6 workers per service
- 60-120 second timeouts
- Basic replication available
- Info-level logging

**Startup Time**: ~3-8 minutes (database dependent)
**Data Persistence**: SurrealDB (shared)

---

#### CICD Mode

**Use Case**: CI/CD pipelines, ephemeral environments

**Characteristics**:
- Ephemeral storage (memory, temporary)
- High throughput
- RAG system disabled
- Minimal logging
- Stateless services

**Services Configuration**:
- 8-12 workers per service
- 10-30 second timeouts
- No persistence
- Warn-level logging

**Startup Time**: ~1-2 minutes
**Data Persistence**: None (ephemeral)

---

#### Enterprise Mode

**Use Case**: Production, high availability, compliance

**Characteristics**:
- Distributed, replicated backends
- High availability (HA) clustering
- TLS/SSL encryption
- Audit logging
- Full monitoring and observability

**Services Configuration**:
- 16-32 workers per service
- 120-300 second timeouts
- Active replication across 3+ nodes
- Info-level logging with audit trails

**Startup Time**: ~5-15 minutes (cluster initialization)
**Data Persistence**: Replicated across cluster

---

## Quick Start

### 1. Clone Repository

```bash
git clone https://github.com/your-org/project-provisioning.git
cd project-provisioning
```

### 2. Select Deployment Mode

Choose your mode based on use case:

```bash
# For development
export DEPLOYMENT_MODE=solo

# For team environments
export DEPLOYMENT_MODE=multiuser

# For CI/CD
export DEPLOYMENT_MODE=cicd

# For production
export DEPLOYMENT_MODE=enterprise
```

### 3. Set Environment Variables

All services use mode-specific TOML configs automatically loaded via environment variables:

```bash
# Vault Service
export VAULT_MODE=$DEPLOYMENT_MODE

# Extension Registry
export REGISTRY_MODE=$DEPLOYMENT_MODE

# RAG System
export RAG_MODE=$DEPLOYMENT_MODE

# AI Service
export AI_SERVICE_MODE=$DEPLOYMENT_MODE

# Provisioning Daemon
export DAEMON_MODE=$DEPLOYMENT_MODE
```

### 4. Build All Services

```bash
# Build all platform crates
cargo build --release -p vault-service \
                      -p extension-registry \
                      -p provisioning-rag \
                      -p ai-service \
                      -p provisioning-daemon \
                      -p orchestrator \
                      -p control-center \
                      -p mcp-server \
                      -p installer
```

### 5. Start Services (Order Matters)

```bash
# Start in dependency order:

# 1. Core infrastructure (KMS, storage)
cargo run --release -p vault-service &

# 2. Configuration and extensions
cargo run --release -p extension-registry &

# 3. AI/RAG layer
cargo run --release -p provisioning-rag &
cargo run --release -p ai-service &

# 4. Orchestration layer
cargo run --release -p orchestrator &
cargo run --release -p control-center &
cargo run --release -p mcp-server &

# 5. Background operations
cargo run --release -p provisioning-daemon &

# 6. Installer (optional, for new deployments)
cargo run --release -p installer &
```

### 6. Verify Services

```bash
# Check all services are running
pgrep -l "vault-service|extension-registry|provisioning-rag|ai-service"

# Test endpoints
curl http://localhost:8200/health   # Vault
curl http://localhost:8081/health   # Registry
curl http://localhost:8083/health   # RAG
curl http://localhost:8082/health   # AI Service
curl http://localhost:9090/health   # Orchestrator
curl http://localhost:8080/health   # Control Center
```

---

## Solo Mode Deployment

**Perfect for**: Development, testing, learning

### Step 1: Verify Solo Configuration Files

```bash
# Check that solo schemas are available
ls -la provisioning/schemas/platform/defaults/deployment/solo-defaults.ncl

# Available schemas for each service:
# - provisioning/schemas/platform/schemas/vault-service.ncl
# - provisioning/schemas/platform/schemas/extension-registry.ncl
# - provisioning/schemas/platform/schemas/rag.ncl
# - provisioning/schemas/platform/schemas/ai-service.ncl
# - provisioning/schemas/platform/schemas/provisioning-daemon.ncl
```

### Step 2: Set Solo Environment Variables

```bash
# Set all services to solo mode
export VAULT_MODE=solo
export REGISTRY_MODE=solo
export RAG_MODE=solo
export AI_SERVICE_MODE=solo
export DAEMON_MODE=solo

# Verify settings
echo $VAULT_MODE  # Should output: solo
```

### Step 3: Build Services

```bash
# Build in release mode for better performance
cargo build --release
```

### Step 4: Create Local Data Directories

```bash
# Create storage directories for solo mode
mkdir -p /tmp/provisioning-solo/{vault,registry,rag,ai,daemon}
chmod 755 /tmp/provisioning-solo/{vault,registry,rag,ai,daemon}
```

### Step 5: Start Services

```bash
# Start each service in a separate terminal or use tmux:

# Terminal 1: Vault
cargo run --release -p vault-service

# Terminal 2: Registry
cargo run --release -p extension-registry

# Terminal 3: RAG
cargo run --release -p provisioning-rag

# Terminal 4: AI Service
cargo run --release -p ai-service

# Terminal 5: Orchestrator
cargo run --release -p orchestrator

# Terminal 6: Control Center
cargo run --release -p control-center

# Terminal 7: Daemon
cargo run --release -p provisioning-daemon
```

### Step 6: Test Services

```bash
# Wait 10-15 seconds for services to start, then test

# Check service health
curl -s http://localhost:8200/health | jq .
curl -s http://localhost:8081/health | jq .
curl -s http://localhost:8083/health | jq .

# Try a simple operation
curl -X GET http://localhost:9090/api/v1/health
```

### Step 7: Verify Persistence (Optional)

```bash
# Check that data is stored locally
ls -la /tmp/provisioning-solo/vault/
ls -la /tmp/provisioning-solo/registry/

# Data should accumulate as you use the services
```

### Cleanup

```bash
# Stop all services
pkill -f "cargo run --release"

# Remove temporary data (optional)
rm -rf /tmp/provisioning-solo
```

---

## Multiuser Mode Deployment

**Perfect for**: Team environments, shared infrastructure

### Prerequisites

- **SurrealDB**: Running and accessible at `http://surrealdb:8000`
- **Network Access**: All machines can reach SurrealDB
- **DNS/Hostnames**: Services accessible via hostnames (not just localhost)

### Step 1: Deploy SurrealDB

```bash
# Using Docker (recommended)
docker run -d \
  --name surrealdb \
  -p 8000:8000 \
  surrealdb/surrealdb:latest \
  start --user root --pass root

# Or using native installation:
surreal start --user root --pass root
```

### Step 2: Verify SurrealDB Connectivity

```bash
# Test SurrealDB connection
curl -s http://localhost:8000/health

# Should return: {"version":"v1.x.x"}
```

### Step 3: Set Multiuser Environment Variables

```bash
# Configure all services for multiuser mode
export VAULT_MODE=multiuser
export REGISTRY_MODE=multiuser
export RAG_MODE=multiuser
export AI_SERVICE_MODE=multiuser
export DAEMON_MODE=multiuser

# Set database connection
export SURREALDB_URL=http://surrealdb:8000
export SURREALDB_USER=root
export SURREALDB_PASS=root

# Set service hostnames (if not localhost)
export VAULT_SERVICE_HOST=vault.internal
export REGISTRY_HOST=registry.internal
export RAG_HOST=rag.internal
```

### Step 4: Build Services

```bash
cargo build --release
```

### Step 5: Create Shared Data Directories

```bash
# Create directories on shared storage (NFS, etc.)
mkdir -p /mnt/provisioning-data/{vault,registry,rag,ai}
chmod 755 /mnt/provisioning-data/{vault,registry,rag,ai}

# Or use local directories if on separate machines
mkdir -p /var/lib/provisioning/{vault,registry,rag,ai}
```

### Step 6: Start Services on Multiple Machines

```bash
# Machine 1: Infrastructure services
ssh ops@machine1
export VAULT_MODE=multiuser
cargo run --release -p vault-service &
cargo run --release -p extension-registry &

# Machine 2: AI services
ssh ops@machine2
export RAG_MODE=multiuser
export AI_SERVICE_MODE=multiuser
cargo run --release -p provisioning-rag &
cargo run --release -p ai-service &

# Machine 3: Orchestration
ssh ops@machine3
cargo run --release -p orchestrator &
cargo run --release -p control-center &

# Machine 4: Background tasks
ssh ops@machine4
export DAEMON_MODE=multiuser
cargo run --release -p provisioning-daemon &
```

### Step 7: Test Multi-Machine Setup

```bash
# From any machine, test cross-machine connectivity
curl -s http://machine1:8200/health
curl -s http://machine2:8083/health
curl -s http://machine3:9090/health

# Test integration
curl -X POST http://machine3:9090/api/v1/provision \
  -H "Content-Type: application/json" \
  -d '{"workspace": "test"}'
```

### Step 8: Enable User Access

```bash
# Create shared credentials
export VAULT_TOKEN=s.xxxxxxxxxxx

# Configure TLS (optional but recommended)
# Update configs to use https:// URLs
export VAULT_MODE=multiuser
# Edit provisioning/schemas/platform/schemas/vault-service.ncl
# Add TLS configuration in the schema definition
# See: provisioning/schemas/platform/validators/ for constraints
```

### Monitoring Multiuser Deployment

```bash
# Check all services are connected to SurrealDB
for host in machine1 machine2 machine3 machine4; do
  ssh ops@$host "curl -s http://localhost/api/v1/health | jq .database_connected"
done

# Monitor SurrealDB
curl -s http://surrealdb:8000/version
```

---

## CICD Mode Deployment

**Perfect for**: GitHub Actions, GitLab CI, Jenkins, cloud automation

### Step 1: Understand Ephemeral Nature

CICD mode services:
- Don't persist data between runs
- Use in-memory storage
- Have RAG disabled
- Optimize for startup speed
- Suitable for containerized deployments

### Step 2: Set CICD Environment Variables

```bash
# Use cicd mode for all services
export VAULT_MODE=cicd
export REGISTRY_MODE=cicd
export RAG_MODE=cicd
export AI_SERVICE_MODE=cicd
export DAEMON_MODE=cicd

# Disable TLS (not needed in CI)
export CI_ENVIRONMENT=true
```

### Step 3: Containerize Services (Optional)

```dockerfile
# Dockerfile for CICD deployments
FROM rust:1.75-slim

WORKDIR /app
COPY . .

# Build all services
RUN cargo build --release

# Set CICD mode
ENV VAULT_MODE=cicd
ENV REGISTRY_MODE=cicd
ENV RAG_MODE=cicd
ENV AI_SERVICE_MODE=cicd

# Expose ports
EXPOSE 8200 8081 8083 8082 9090 8080

# Run services
CMD ["sh", "-c", "\
  cargo run --release -p vault-service & \
  cargo run --release -p extension-registry & \
  cargo run --release -p provisioning-rag & \
  cargo run --release -p ai-service & \
  cargo run --release -p orchestrator & \
  wait"]
```

### Step 4: GitHub Actions Example

```yaml
name: CICD Platform Deployment

on:
  push:
    branches: [main, develop]

jobs:
  test-deployment:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Install Rust
        uses: actions-rs/toolchain@v1
        with:
          toolchain: 1.75
          profile: minimal

      - name: Set CICD Mode
        run: |
          echo "VAULT_MODE=cicd" >> $GITHUB_ENV
          echo "REGISTRY_MODE=cicd" >> $GITHUB_ENV
          echo "RAG_MODE=cicd" >> $GITHUB_ENV
          echo "AI_SERVICE_MODE=cicd" >> $GITHUB_ENV
          echo "DAEMON_MODE=cicd" >> $GITHUB_ENV

      - name: Build Services
        run: cargo build --release

      - name: Run Integration Tests
        run: |
          # Start services in background
          cargo run --release -p vault-service &
          cargo run --release -p extension-registry &
          cargo run --release -p orchestrator &

          # Wait for startup
          sleep 10

          # Run tests
          cargo test --release

      - name: Health Checks
        run: |
          curl -f http://localhost:8200/health
          curl -f http://localhost:8081/health
          curl -f http://localhost:9090/health

  deploy:
    needs: test-deployment
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v3
      - name: Deploy to Production
        run: |
          # Deploy production enterprise cluster
          ./scripts/deploy-enterprise.sh
```

### Step 5: Run CICD Tests

```bash
# Simulate CI environment locally
export VAULT_MODE=cicd
export CI_ENVIRONMENT=true

# Build
cargo build --release

# Run short-lived services for testing
timeout 30 cargo run --release -p vault-service &
timeout 30 cargo run --release -p extension-registry &
timeout 30 cargo run --release -p orchestrator &

# Run tests while services are running
sleep 5
cargo test --release

# Services auto-cleanup after timeout
```

---

## Enterprise Mode Deployment

**Perfect for**: Production, high availability, compliance

### Prerequisites

- **3+ Machines**: Minimum 3 for HA
- **Etcd Cluster**: For distributed consensus
- **Load Balancer**: HAProxy, nginx, or cloud LB
- **TLS Certificates**: Valid certificates for all services
- **Monitoring**: Prometheus, ELK, or cloud monitoring
- **Backup System**: Daily snapshots to S3 or similar

### Step 1: Deploy Infrastructure

#### 1.1 Deploy Etcd Cluster

```bash
# Node 1, 2, 3
etcd --name=node-1 \
     --listen-client-urls=http://0.0.0.0:2379 \
     --advertise-client-urls=http://node-1.internal:2379 \
     --initial-cluster="node-1=http://node-1.internal:2380,node-2=http://node-2.internal:2380,node-3=http://node-3.internal:2380" \
     --initial-cluster-state=new

# Verify cluster
etcdctl --endpoints=http://localhost:2379 member list
```

#### 1.2 Deploy Load Balancer

```nginx
# HAProxy configuration for vault-service (example)
frontend vault_frontend
    bind *:8200
    mode tcp
    default_backend vault_backend

backend vault_backend
    mode tcp
    balance roundrobin
    server vault-1 10.0.1.10:8200 check
    server vault-2 10.0.1.11:8200 check
    server vault-3 10.0.1.12:8200 check
```

#### 1.3 Configure TLS

```bash
# Generate certificates (or use existing)
mkdir -p /etc/provisioning/tls

# For each service:
openssl req -x509 -newkey rsa:4096 \
  -keyout /etc/provisioning/tls/vault-key.pem \
  -out /etc/provisioning/tls/vault-cert.pem \
  -days 365 -nodes \
  -subj "/CN=vault.provisioning.prod"

# Set permissions
chmod 600 /etc/provisioning/tls/*-key.pem
chmod 644 /etc/provisioning/tls/*-cert.pem
```

### Step 2: Set Enterprise Environment Variables

```bash
# All machines: Set enterprise mode
export VAULT_MODE=enterprise
export REGISTRY_MODE=enterprise
export RAG_MODE=enterprise
export AI_SERVICE_MODE=enterprise
export DAEMON_MODE=enterprise

# Database cluster
export SURREALDB_URL="ws://surrealdb-cluster.internal:8000"
export SURREALDB_REPLICAS=3

# Etcd cluster
export ETCD_ENDPOINTS="http://node-1.internal:2379,http://node-2.internal:2379,http://node-3.internal:2379"

# TLS configuration
export TLS_CERT_PATH=/etc/provisioning/tls
export TLS_VERIFY=true
export TLS_CA_CERT=/etc/provisioning/tls/ca.crt

# Monitoring
export PROMETHEUS_URL=http://prometheus.internal:9090
export METRICS_ENABLED=true
export AUDIT_LOG_ENABLED=true
```

### Step 3: Deploy Services Across Cluster

```bash
# Ansible playbook (simplified)
---
- hosts: provisioning_cluster
  tasks:
    - name: Build services
      shell: cargo build --release

    - name: Start vault-service (machine 1-3)
      shell: "cargo run --release -p vault-service"
      when: "'vault' in group_names"

    - name: Start orchestrator (machine 2-3)
      shell: "cargo run --release -p orchestrator"
      when: "'orchestrator' in group_names"

    - name: Start daemon (machine 3)
      shell: "cargo run --release -p provisioning-daemon"
      when: "'daemon' in group_names"

    - name: Verify cluster health
      uri:
        url: "https://{{ inventory_hostname }}:9090/health"
        validate_certs: yes
```

### Step 4: Monitor Cluster Health

```bash
# Check cluster status
curl -s https://vault.internal:8200/health | jq .state

# Check replication
curl -s https://orchestrator.internal:9090/api/v1/cluster/status

# Monitor etcd
etcdctl --endpoints=https://node-1.internal:2379 endpoint health

# Check leader election
etcdctl --endpoints=https://node-1.internal:2379 election list
```

### Step 5: Enable Monitoring & Alerting

```yaml
# Prometheus configuration
global:
  scrape_interval: 30s
  evaluation_interval: 30s

scrape_configs:
  - job_name: 'vault-service'
    scheme: https
    tls_config:
      ca_file: /etc/provisioning/tls/ca.crt
    static_configs:
      - targets: ['vault-1.internal:8200', 'vault-2.internal:8200', 'vault-3.internal:8200']

  - job_name: 'orchestrator'
    scheme: https
    static_configs:
      - targets: ['orch-1.internal:9090', 'orch-2.internal:9090', 'orch-3.internal:9090']
```

### Step 6: Backup & Recovery

```bash
# Daily backup script
#!/bin/bash
BACKUP_DIR="/mnt/provisioning-backups"
DATE=$(date +%Y%m%d_%H%M%S)

# Backup etcd
etcdctl --endpoints=https://node-1.internal:2379 \
  snapshot save "$BACKUP_DIR/etcd-$DATE.db"

# Backup SurrealDB
curl -X POST https://surrealdb.internal:8000/backup \
  -H "Authorization: Bearer $SURREALDB_TOKEN" \
  > "$BACKUP_DIR/surreal-$DATE.sql"

# Upload to S3
aws s3 cp "$BACKUP_DIR/etcd-$DATE.db" \
  s3://provisioning-backups/etcd/

# Cleanup old backups (keep 30 days)
find "$BACKUP_DIR" -mtime +30 -delete
```

---

## Service Management

### Starting Services

#### Individual Service Startup

```bash
# Start one service
export VAULT_MODE=enterprise
cargo run --release -p vault-service

# In another terminal
export REGISTRY_MODE=enterprise
cargo run --release -p extension-registry
```

#### Batch Startup

```bash
# Start all services (dependency order)
#!/bin/bash
set -e

MODE=${1:-solo}
export VAULT_MODE=$MODE
export REGISTRY_MODE=$MODE
export RAG_MODE=$MODE
export AI_SERVICE_MODE=$MODE
export DAEMON_MODE=$MODE

echo "Starting provisioning platform in $MODE mode..."

# Core services first
echo "Starting infrastructure..."
cargo run --release -p vault-service &
VAULT_PID=$!

echo "Starting extension registry..."
cargo run --release -p extension-registry &
REGISTRY_PID=$!

# AI layer
echo "Starting AI services..."
cargo run --release -p provisioning-rag &
RAG_PID=$!

cargo run --release -p ai-service &
AI_PID=$!

# Orchestration
echo "Starting orchestration..."
cargo run --release -p orchestrator &
ORCH_PID=$!

echo "All services started. PIDs: $VAULT_PID $REGISTRY_PID $RAG_PID $AI_PID $ORCH_PID"
```

### Stopping Services

```bash
# Stop all services gracefully
pkill -SIGTERM -f "cargo run --release -p"

# Wait for graceful shutdown
sleep 5

# Force kill if needed
pkill -9 -f "cargo run --release -p"

# Verify all stopped
pgrep -f "cargo run --release -p" && echo "Services still running" || echo "All stopped"
```

### Restarting Services

```bash
# Restart single service
pkill -SIGTERM vault-service
sleep 2
cargo run --release -p vault-service &

# Restart all services
./scripts/restart-all.sh $MODE

# Restart with config reload
export VAULT_MODE=multiuser
pkill -SIGTERM vault-service
sleep 2
cargo run --release -p vault-service &
```

### Checking Service Status

```bash
# Check running processes
pgrep -a "cargo run --release"

# Check listening ports
netstat -tlnp | grep -E "8200|8081|8083|8082|9090|8080"

# Or using ss (modern alternative)
ss -tlnp | grep -E "8200|8081|8083|8082|9090|8080"

# Health endpoint checks
for service in vault registry rag ai orchestrator; do
  echo "=== $service ==="
  curl -s http://localhost:${port[$service]}/health | jq .
done
```

---

## Health Checks & Monitoring

### Manual Health Verification

```bash
# Vault Service
curl -s http://localhost:8200/health | jq .
# Expected: {"status":"ok","uptime":123.45}

# Extension Registry
curl -s http://localhost:8081/health | jq .

# RAG System
curl -s http://localhost:8083/health | jq .
# Expected: {"status":"ok","embeddings":"ready","vector_db":"connected"}

# AI Service
curl -s http://localhost:8082/health | jq .

# Orchestrator
curl -s http://localhost:9090/health | jq .

# Control Center
curl -s http://localhost:8080/health | jq .
```

### Service Integration Tests

```bash
# Test vault <-> registry integration
curl -X POST http://localhost:8200/api/encrypt \
  -H "Content-Type: application/json" \
  -d '{"plaintext":"secret"}' | jq .

# Test RAG system
curl -X POST http://localhost:8083/api/ingest \
  -H "Content-Type: application/json" \
  -d '{"document":"test.md","content":"# Test"}' | jq .

# Test orchestrator
curl -X GET http://localhost:9090/api/v1/status | jq .

# End-to-end workflow
curl -X POST http://localhost:9090/api/v1/provision \
  -H "Content-Type: application/json" \
  -d '{
    "workspace": "test",
    "services": ["vault", "registry"],
    "mode": "solo"
  }' | jq .
```

### Monitoring Dashboards

#### Prometheus Metrics

```bash
# Query service uptime
curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq .

# Query request rate
curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_requests_total[5m])' | jq .

# Query error rate
curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_errors_total[5m])' | jq .
```

#### Log Aggregation

```bash
# Follow vault logs
tail -f /var/log/provisioning/vault-service.log

# Follow all service logs
tail -f /var/log/provisioning/*.log

# Search for errors
grep -r "ERROR" /var/log/provisioning/

# Follow with filtering
tail -f /var/log/provisioning/orchestrator.log | grep -E "ERROR|WARN"
```

### Alerting

```yaml
# AlertManager configuration
groups:
  - name: provisioning
    rules:
      - alert: ServiceDown
        expr: up{job=~"vault|registry|rag|orchestrator"} == 0
        for: 5m
        annotations:
          summary: "{{ $labels.job }} is down"

      - alert: HighErrorRate
        expr: rate(http_errors_total[5m]) > 0.05
        annotations:
          summary: "High error rate detected"

      - alert: DiskSpaceWarning
        expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.2
        annotations:
          summary: "Disk space below 20%"
```

---

## Troubleshooting

### Service Won't Start

**Problem**: `error: failed to bind to port 8200`

**Solutions**:
```bash
# Check if port is in use
lsof -i :8200
ss -tlnp | grep 8200

# Kill existing process
pkill -9 -f vault-service

# Or use different port
export VAULT_SERVER_PORT=8201
cargo run --release -p vault-service
```

### Configuration Loading Fails

**Problem**: `error: failed to load config from mode file`

**Solutions**:
```bash
# Verify schemas exist
ls -la provisioning/schemas/platform/schemas/vault-service.ncl

# Validate schema syntax
nickel typecheck provisioning/schemas/platform/schemas/vault-service.ncl

# Check defaults are present
nickel typecheck provisioning/schemas/platform/defaults/vault-service-defaults.ncl

# Verify deployment mode overlay exists
ls -la provisioning/schemas/platform/defaults/deployment/$VAULT_MODE-defaults.ncl

# Run service with explicit mode
export VAULT_MODE=solo
cargo run --release -p vault-service
```

### Database Connection Issues

**Problem**: `error: failed to connect to database`

**Solutions**:
```bash
# Verify database is running
curl http://surrealdb:8000/health
etcdctl --endpoints=http://etcd:2379 endpoint health

# Check connectivity
nc -zv surrealdb 8000
nc -zv etcd 2379

# Update connection string
export SURREALDB_URL=ws://surrealdb:8000
export ETCD_ENDPOINTS=http://etcd:2379

# Restart service with new config
pkill -9 vault-service
cargo run --release -p vault-service
```

### Service Crashes on Startup

**Problem**: Service exits with code 1 or 139

**Solutions**:
```bash
# Run with verbose logging
RUST_LOG=debug cargo run -p vault-service 2>&1 | head -50

# Check system resources
free -h
df -h

# Check for core dumps
coredumpctl list

# Run under debugger (if crash suspected)
rust-gdb --args target/release/vault-service
```

### High Memory Usage

**Problem**: Service consuming > expected memory

**Solutions**:
```bash
# Check memory usage
ps aux | grep vault-service | grep -v grep

# Monitor over time
watch -n 1 'ps aux | grep vault-service | grep -v grep'

# Reduce worker count
export VAULT_SERVER_WORKERS=2
cargo run --release -p vault-service

# Check for memory leaks
valgrind --leak-check=full target/release/vault-service
```

### Network/DNS Issues

**Problem**: `error: failed to resolve hostname`

**Solutions**:
```bash
# Test DNS resolution
nslookup vault.internal
dig vault.internal

# Test connectivity to service
curl -v http://vault.internal:8200/health

# Add to /etc/hosts if needed
echo "10.0.1.10 vault.internal" >> /etc/hosts

# Check network interface
ip addr show
netstat -nr
```

### Data Persistence Issues

**Problem**: Data lost after restart

**Solutions**:
```bash
# Verify backup exists
ls -la /mnt/provisioning-backups/
ls -la /var/lib/provisioning/

# Check disk space
df -h /var/lib/provisioning

# Verify file permissions
ls -l /var/lib/provisioning/vault/
chmod 755 /var/lib/provisioning/vault/*

# Restore from backup
./scripts/restore-backup.sh /mnt/provisioning-backups/vault-20260105.sql
```

### Debugging Checklist

When troubleshooting, use this systematic approach:

```bash
# 1. Check service is running
pgrep -f vault-service || echo "Service not running"

# 2. Check port is listening
ss -tlnp | grep 8200 || echo "Port not listening"

# 3. Check logs for errors
tail -20 /var/log/provisioning/vault-service.log | grep -i error

# 4. Test HTTP endpoint
curl -i http://localhost:8200/health

# 5. Check dependencies
curl http://surrealdb:8000/health
etcdctl --endpoints=http://etcd:2379 endpoint health

# 6. Check schema definition
nickel typecheck provisioning/schemas/platform/schemas/vault-service.ncl

# 7. Verify environment variables
env | grep -E "VAULT_|SURREALDB_|ETCD_"

# 8. Check system resources
free -h && df -h && top -bn1 | head -10
```

---

## Configuration Updates

### Updating Service Configuration

```bash
# 1. Edit the schema definition
vim provisioning/schemas/platform/schemas/vault-service.ncl

# 2. Update defaults if needed
vim provisioning/schemas/platform/defaults/vault-service-defaults.ncl

# 3. Validate syntax
nickel typecheck provisioning/schemas/platform/schemas/vault-service.ncl

# 4. Re-export configuration from schemas
./provisioning/.typedialog/platform/scripts/generate-configs.nu vault-service multiuser

# 5. Restart affected service (no downtime for clients)
pkill -SIGTERM vault-service
sleep 2
cargo run --release -p vault-service &

# 4. Verify configuration loaded
curl http://localhost:8200/api/config | jq .
```

### Mode Migration

```bash
# Migrate from solo to multiuser:

# 1. Stop services
pkill -SIGTERM -f "cargo run"
sleep 5

# 2. Backup current data
tar -czf /backup/provisioning-solo-$(date +%s).tar.gz /var/lib/provisioning/

# 3. Set new mode
export VAULT_MODE=multiuser
export REGISTRY_MODE=multiuser
export RAG_MODE=multiuser

# 4. Start services with new config
cargo run --release -p vault-service &
cargo run --release -p extension-registry &

# 5. Verify new mode
curl http://localhost:8200/api/config | jq .deployment_mode
```

---

## Production Checklist

Before deploying to production:

- [ ] All services compiled in release mode (`--release`)
- [ ] TLS certificates installed and valid
- [ ] Database cluster deployed and healthy
- [ ] Load balancer configured and routing traffic
- [ ] Monitoring and alerting configured
- [ ] Backup system tested and working
- [ ] High availability verified (failover tested)
- [ ] Security hardening applied (firewall rules, etc.)
- [ ] Documentation updated for your environment
- [ ] Team trained on deployment procedures
- [ ] Runbooks created for common operations
- [ ] Disaster recovery plan tested

---

## Getting Help

### Community Resources

- **GitHub Issues**: Report bugs at `github.com/your-org/provisioning/issues`
- **Documentation**: Full docs at `provisioning/docs/`
- **Slack Channel**: `#provisioning-platform`

### Internal Support

- **Platform Team**: [platform@your-org.com](mailto:platform@your-org.com)
- **On-Call**: Check PagerDuty for active rotation
- **Escalation**: Contact infrastructure leadership

### Useful Commands Reference

```bash
# View all available commands
cargo run -- --help

# View service schemas
ls -la provisioning/schemas/platform/schemas/
ls -la provisioning/schemas/platform/defaults/

# List running services
ps aux | grep cargo

# Monitor service logs in real-time
journalctl -fu provisioning-vault

# Generate diagnostics bundle
./scripts/generate-diagnostics.sh > /tmp/diagnostics-$(date +%s).tar.gz
```