provisioning/docs/src/operations/deployment-guide.md

1362 lines
31 KiB
Markdown
Raw Normal View History

# Platform Deployment Guide
**Version**: 1.0.0
**Last Updated**: 2026-01-05
**Target Audience**: DevOps Engineers, Platform Operators
**Status**: Production Ready
Practical guide for deploying the 9-service provisioning platform in any environment using mode-based configuration.
## Table of Contents
1. [Prerequisites](#prerequisites)
2. [Deployment Modes](#deployment-modes)
3. [Quick Start](#quick-start)
4. [Solo Mode Deployment](#solo-mode-deployment)
5. [Multiuser Mode Deployment](#multiuser-mode-deployment)
6. [CICD Mode Deployment](#cicd-mode-deployment)
7. [Enterprise Mode Deployment](#enterprise-mode-deployment)
8. [Service Management](#service-management)
9. [Health Checks & Monitoring](#health-checks--monitoring)
10. [Troubleshooting](#troubleshooting)
---
## Prerequisites
### Required Software
- **Rust**: 1.70+ (for building services)
- **Nickel**: Latest (for config validation)
- **Nushell**: 0.109.1+ (for scripts)
- **Cargo**: Included with Rust
- **Git**: For cloning and pulling updates
### Required Tools (Mode-Dependent)
| Tool | Solo | Multiuser | CICD | Enterprise |
|------|------|-----------|------|------------|
| Docker/Podman | No | Optional | Yes | Yes |
| SurrealDB | No | Yes | No | No |
| Etcd | No | No | No | Yes |
| PostgreSQL | No | Optional | No | Optional |
| OpenAI/Anthropic API | No | Optional | Yes | Yes |
### System Requirements
| Resource | Solo | Multiuser | CICD | Enterprise |
|----------|------|-----------|------|------------|
| CPU Cores | 2+ | 4+ | 8+ | 16+ |
| Memory | 2 GB | 4 GB | 8 GB | 16 GB |
| Disk | 10 GB | 50 GB | 100 GB | 500 GB |
| Network | Local | Local/Cloud | Cloud | HA Cloud |
### Directory Structure
```bash
# Ensure base directories exist
mkdir -p provisioning/schemas/platform
mkdir -p provisioning/platform/logs
mkdir -p provisioning/platform/data
mkdir -p provisioning/.typedialog/platform
mkdir -p provisioning/config/runtime
```
---
## Deployment Modes
### Mode Selection Matrix
| Requirement | Recommended Mode |
|-------------|------------------|
| Development & testing | **solo** |
| Team environment (2-10 people) | **multiuser** |
| CI/CD pipelines & automation | **cicd** |
| Production with HA | **enterprise** |
### Mode Characteristics
#### Solo Mode
**Use Case**: Development, testing, demonstration
**Characteristics**:
- All services run locally with minimal resources
- Filesystem-based storage (no external databases)
- No TLS/SSL required
- Embedded/in-memory backends
- Single machine only
**Services Configuration**:
- 2-4 workers per service
- 30-60 second timeouts
- No replication or clustering
- Debug-level logging enabled
**Startup Time**: ~2-5 minutes
**Data Persistence**: Local files only
---
#### Multiuser Mode
**Use Case**: Team environments, shared infrastructure
**Characteristics**:
- Shared database backends (SurrealDB)
- Multiple concurrent users
- CORS and multi-user features enabled
- Optional TLS support
- 2-4 machines (or containerized)
**Services Configuration**:
- 4-6 workers per service
- 60-120 second timeouts
- Basic replication available
- Info-level logging
**Startup Time**: ~3-8 minutes (database dependent)
**Data Persistence**: SurrealDB (shared)
---
#### CICD Mode
**Use Case**: CI/CD pipelines, ephemeral environments
**Characteristics**:
- Ephemeral storage (memory, temporary)
- High throughput
- RAG system disabled
- Minimal logging
- Stateless services
**Services Configuration**:
- 8-12 workers per service
- 10-30 second timeouts
- No persistence
- Warn-level logging
**Startup Time**: ~1-2 minutes
**Data Persistence**: None (ephemeral)
---
#### Enterprise Mode
**Use Case**: Production, high availability, compliance
**Characteristics**:
- Distributed, replicated backends
- High availability (HA) clustering
- TLS/SSL encryption
- Audit logging
- Full monitoring and observability
**Services Configuration**:
- 16-32 workers per service
- 120-300 second timeouts
- Active replication across 3+ nodes
- Info-level logging with audit trails
**Startup Time**: ~5-15 minutes (cluster initialization)
**Data Persistence**: Replicated across cluster
---
## Quick Start
### 1. Clone Repository
```bash
git clone https://github.com/your-org/project-provisioning.git
cd project-provisioning
```
### 2. Select Deployment Mode
Choose your mode based on use case:
```bash
# For development
export DEPLOYMENT_MODE=solo
# For team environments
export DEPLOYMENT_MODE=multiuser
# For CI/CD
export DEPLOYMENT_MODE=cicd
# For production
export DEPLOYMENT_MODE=enterprise
```
### 3. Set Environment Variables
All services use mode-specific TOML configs automatically loaded via environment variables:
```bash
# Vault Service
export VAULT_MODE=$DEPLOYMENT_MODE
# Extension Registry
export REGISTRY_MODE=$DEPLOYMENT_MODE
# RAG System
export RAG_MODE=$DEPLOYMENT_MODE
# AI Service
export AI_SERVICE_MODE=$DEPLOYMENT_MODE
# Provisioning Daemon
export DAEMON_MODE=$DEPLOYMENT_MODE
```
### 4. Build All Services
```bash
# Build all platform crates
cargo build --release -p vault-service \
-p extension-registry \
-p provisioning-rag \
-p ai-service \
-p provisioning-daemon \
-p orchestrator \
-p control-center \
-p mcp-server \
-p installer
```
### 5. Start Services (Order Matters)
```bash
# Start in dependency order:
# 1. Core infrastructure (KMS, storage)
cargo run --release -p vault-service &
# 2. Configuration and extensions
cargo run --release -p extension-registry &
# 3. AI/RAG layer
cargo run --release -p provisioning-rag &
cargo run --release -p ai-service &
# 4. Orchestration layer
cargo run --release -p orchestrator &
cargo run --release -p control-center &
cargo run --release -p mcp-server &
# 5. Background operations
cargo run --release -p provisioning-daemon &
# 6. Installer (optional, for new deployments)
cargo run --release -p installer &
```
### 6. Verify Services
```bash
# Check all services are running
pgrep -l "vault-service|extension-registry|provisioning-rag|ai-service"
# Test endpoints
curl http://localhost:8200/health # Vault
curl http://localhost:8081/health # Registry
curl http://localhost:8083/health # RAG
curl http://localhost:8082/health # AI Service
curl http://localhost:9090/health # Orchestrator
curl http://localhost:8080/health # Control Center
```
---
## Solo Mode Deployment
**Perfect for**: Development, testing, learning
### Step 1: Verify Solo Configuration Files
```bash
# Check that solo schemas are available
ls -la provisioning/schemas/platform/defaults/deployment/solo-defaults.ncl
# Available schemas for each service:
# - provisioning/schemas/platform/schemas/vault-service.ncl
# - provisioning/schemas/platform/schemas/extension-registry.ncl
# - provisioning/schemas/platform/schemas/rag.ncl
# - provisioning/schemas/platform/schemas/ai-service.ncl
# - provisioning/schemas/platform/schemas/provisioning-daemon.ncl
```
### Step 2: Set Solo Environment Variables
```bash
# Set all services to solo mode
export VAULT_MODE=solo
export REGISTRY_MODE=solo
export RAG_MODE=solo
export AI_SERVICE_MODE=solo
export DAEMON_MODE=solo
# Verify settings
echo $VAULT_MODE # Should output: solo
```
### Step 3: Build Services
```bash
# Build in release mode for better performance
cargo build --release
```
### Step 4: Create Local Data Directories
```bash
# Create storage directories for solo mode
mkdir -p /tmp/provisioning-solo/{vault,registry,rag,ai,daemon}
chmod 755 /tmp/provisioning-solo/{vault,registry,rag,ai,daemon}
```
### Step 5: Start Services
```bash
# Start each service in a separate terminal or use tmux:
# Terminal 1: Vault
cargo run --release -p vault-service
# Terminal 2: Registry
cargo run --release -p extension-registry
# Terminal 3: RAG
cargo run --release -p provisioning-rag
# Terminal 4: AI Service
cargo run --release -p ai-service
# Terminal 5: Orchestrator
cargo run --release -p orchestrator
# Terminal 6: Control Center
cargo run --release -p control-center
# Terminal 7: Daemon
cargo run --release -p provisioning-daemon
```
### Step 6: Test Services
```bash
# Wait 10-15 seconds for services to start, then test
# Check service health
curl -s http://localhost:8200/health | jq .
curl -s http://localhost:8081/health | jq .
curl -s http://localhost:8083/health | jq .
# Try a simple operation
curl -X GET http://localhost:9090/api/v1/health
```
### Step 7: Verify Persistence (Optional)
```bash
# Check that data is stored locally
ls -la /tmp/provisioning-solo/vault/
ls -la /tmp/provisioning-solo/registry/
# Data should accumulate as you use the services
```
### Cleanup
```bash
# Stop all services
pkill -f "cargo run --release"
# Remove temporary data (optional)
rm -rf /tmp/provisioning-solo
```
---
## Multiuser Mode Deployment
**Perfect for**: Team environments, shared infrastructure
### Prerequisites
- **SurrealDB**: Running and accessible at `http://surrealdb:8000`
- **Network Access**: All machines can reach SurrealDB
- **DNS/Hostnames**: Services accessible via hostnames (not just localhost)
### Step 1: Deploy SurrealDB
```bash
# Using Docker (recommended)
docker run -d \
--name surrealdb \
-p 8000:8000 \
surrealdb/surrealdb:latest \
start --user root --pass root
# Or using native installation:
surreal start --user root --pass root
```
### Step 2: Verify SurrealDB Connectivity
```bash
# Test SurrealDB connection
curl -s http://localhost:8000/health
# Should return: {"version":"v1.x.x"}
```
### Step 3: Set Multiuser Environment Variables
```bash
# Configure all services for multiuser mode
export VAULT_MODE=multiuser
export REGISTRY_MODE=multiuser
export RAG_MODE=multiuser
export AI_SERVICE_MODE=multiuser
export DAEMON_MODE=multiuser
# Set database connection
export SURREALDB_URL=http://surrealdb:8000
export SURREALDB_USER=root
export SURREALDB_PASS=root
# Set service hostnames (if not localhost)
export VAULT_SERVICE_HOST=vault.internal
export REGISTRY_HOST=registry.internal
export RAG_HOST=rag.internal
```
### Step 4: Build Services
```bash
cargo build --release
```
### Step 5: Create Shared Data Directories
```bash
# Create directories on shared storage (NFS, etc.)
mkdir -p /mnt/provisioning-data/{vault,registry,rag,ai}
chmod 755 /mnt/provisioning-data/{vault,registry,rag,ai}
# Or use local directories if on separate machines
mkdir -p /var/lib/provisioning/{vault,registry,rag,ai}
```
### Step 6: Start Services on Multiple Machines
```bash
# Machine 1: Infrastructure services
ssh ops@machine1
export VAULT_MODE=multiuser
cargo run --release -p vault-service &
cargo run --release -p extension-registry &
# Machine 2: AI services
ssh ops@machine2
export RAG_MODE=multiuser
export AI_SERVICE_MODE=multiuser
cargo run --release -p provisioning-rag &
cargo run --release -p ai-service &
# Machine 3: Orchestration
ssh ops@machine3
cargo run --release -p orchestrator &
cargo run --release -p control-center &
# Machine 4: Background tasks
ssh ops@machine4
export DAEMON_MODE=multiuser
cargo run --release -p provisioning-daemon &
```
### Step 7: Test Multi-Machine Setup
```bash
# From any machine, test cross-machine connectivity
curl -s http://machine1:8200/health
curl -s http://machine2:8083/health
curl -s http://machine3:9090/health
# Test integration
curl -X POST http://machine3:9090/api/v1/provision \
-H "Content-Type: application/json" \
-d '{"workspace": "test"}'
```
### Step 8: Enable User Access
```bash
# Create shared credentials
export VAULT_TOKEN=s.xxxxxxxxxxx
# Configure TLS (optional but recommended)
# Update configs to use https:// URLs
export VAULT_MODE=multiuser
# Edit provisioning/schemas/platform/schemas/vault-service.ncl
# Add TLS configuration in the schema definition
# See: provisioning/schemas/platform/validators/ for constraints
```
### Monitoring Multiuser Deployment
```bash
# Check all services are connected to SurrealDB
for host in machine1 machine2 machine3 machine4; do
ssh ops@$host "curl -s http://localhost/api/v1/health | jq .database_connected"
done
# Monitor SurrealDB
curl -s http://surrealdb:8000/version
```
---
## CICD Mode Deployment
**Perfect for**: GitHub Actions, GitLab CI, Jenkins, cloud automation
### Step 1: Understand Ephemeral Nature
CICD mode services:
- Don't persist data between runs
- Use in-memory storage
- Have RAG disabled
- Optimize for startup speed
- Suitable for containerized deployments
### Step 2: Set CICD Environment Variables
```bash
# Use cicd mode for all services
export VAULT_MODE=cicd
export REGISTRY_MODE=cicd
export RAG_MODE=cicd
export AI_SERVICE_MODE=cicd
export DAEMON_MODE=cicd
# Disable TLS (not needed in CI)
export CI_ENVIRONMENT=true
```
### Step 3: Containerize Services (Optional)
```dockerfile
# Dockerfile for CICD deployments
FROM rust:1.75-slim
WORKDIR /app
COPY . .
# Build all services
RUN cargo build --release
# Set CICD mode
ENV VAULT_MODE=cicd
ENV REGISTRY_MODE=cicd
ENV RAG_MODE=cicd
ENV AI_SERVICE_MODE=cicd
# Expose ports
EXPOSE 8200 8081 8083 8082 9090 8080
# Run services
CMD ["sh", "-c", "\
cargo run --release -p vault-service & \
cargo run --release -p extension-registry & \
cargo run --release -p provisioning-rag & \
cargo run --release -p ai-service & \
cargo run --release -p orchestrator & \
wait"]
```
### Step 4: GitHub Actions Example
```yaml
name: CICD Platform Deployment
on:
push:
branches: [main, develop]
jobs:
test-deployment:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install Rust
uses: actions-rs/toolchain@v1
with:
toolchain: 1.75
profile: minimal
- name: Set CICD Mode
run: |
echo "VAULT_MODE=cicd" >> $GITHUB_ENV
echo "REGISTRY_MODE=cicd" >> $GITHUB_ENV
echo "RAG_MODE=cicd" >> $GITHUB_ENV
echo "AI_SERVICE_MODE=cicd" >> $GITHUB_ENV
echo "DAEMON_MODE=cicd" >> $GITHUB_ENV
- name: Build Services
run: cargo build --release
- name: Run Integration Tests
run: |
# Start services in background
cargo run --release -p vault-service &
cargo run --release -p extension-registry &
cargo run --release -p orchestrator &
# Wait for startup
sleep 10
# Run tests
cargo test --release
- name: Health Checks
run: |
curl -f http://localhost:8200/health
curl -f http://localhost:8081/health
curl -f http://localhost:9090/health
deploy:
needs: test-deployment
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v3
- name: Deploy to Production
run: |
# Deploy production enterprise cluster
./scripts/deploy-enterprise.sh
```
### Step 5: Run CICD Tests
```bash
# Simulate CI environment locally
export VAULT_MODE=cicd
export CI_ENVIRONMENT=true
# Build
cargo build --release
# Run short-lived services for testing
timeout 30 cargo run --release -p vault-service &
timeout 30 cargo run --release -p extension-registry &
timeout 30 cargo run --release -p orchestrator &
# Run tests while services are running
sleep 5
cargo test --release
# Services auto-cleanup after timeout
```
---
## Enterprise Mode Deployment
**Perfect for**: Production, high availability, compliance
### Prerequisites
- **3+ Machines**: Minimum 3 for HA
- **Etcd Cluster**: For distributed consensus
- **Load Balancer**: HAProxy, nginx, or cloud LB
- **TLS Certificates**: Valid certificates for all services
- **Monitoring**: Prometheus, ELK, or cloud monitoring
- **Backup System**: Daily snapshots to S3 or similar
### Step 1: Deploy Infrastructure
#### 1.1 Deploy Etcd Cluster
```bash
# Node 1, 2, 3
etcd --name=node-1 \
--listen-client-urls=http://0.0.0.0:2379 \
--advertise-client-urls=http://node-1.internal:2379 \
--initial-cluster="node-1=http://node-1.internal:2380,node-2=http://node-2.internal:2380,node-3=http://node-3.internal:2380" \
--initial-cluster-state=new
# Verify cluster
etcdctl --endpoints=http://localhost:2379 member list
```
#### 1.2 Deploy Load Balancer
```nginx
# HAProxy configuration for vault-service (example)
frontend vault_frontend
bind *:8200
mode tcp
default_backend vault_backend
backend vault_backend
mode tcp
balance roundrobin
server vault-1 10.0.1.10:8200 check
server vault-2 10.0.1.11:8200 check
server vault-3 10.0.1.12:8200 check
```
#### 1.3 Configure TLS
```bash
# Generate certificates (or use existing)
mkdir -p /etc/provisioning/tls
# For each service:
openssl req -x509 -newkey rsa:4096 \
-keyout /etc/provisioning/tls/vault-key.pem \
-out /etc/provisioning/tls/vault-cert.pem \
-days 365 -nodes \
-subj "/CN=vault.provisioning.prod"
# Set permissions
chmod 600 /etc/provisioning/tls/*-key.pem
chmod 644 /etc/provisioning/tls/*-cert.pem
```
### Step 2: Set Enterprise Environment Variables
```bash
# All machines: Set enterprise mode
export VAULT_MODE=enterprise
export REGISTRY_MODE=enterprise
export RAG_MODE=enterprise
export AI_SERVICE_MODE=enterprise
export DAEMON_MODE=enterprise
# Database cluster
export SURREALDB_URL="ws://surrealdb-cluster.internal:8000"
export SURREALDB_REPLICAS=3
# Etcd cluster
export ETCD_ENDPOINTS="http://node-1.internal:2379,http://node-2.internal:2379,http://node-3.internal:2379"
# TLS configuration
export TLS_CERT_PATH=/etc/provisioning/tls
export TLS_VERIFY=true
export TLS_CA_CERT=/etc/provisioning/tls/ca.crt
# Monitoring
export PROMETHEUS_URL=http://prometheus.internal:9090
export METRICS_ENABLED=true
export AUDIT_LOG_ENABLED=true
```
### Step 3: Deploy Services Across Cluster
```bash
# Ansible playbook (simplified)
---
- hosts: provisioning_cluster
tasks:
- name: Build services
shell: cargo build --release
- name: Start vault-service (machine 1-3)
shell: "cargo run --release -p vault-service"
when: "'vault' in group_names"
- name: Start orchestrator (machine 2-3)
shell: "cargo run --release -p orchestrator"
when: "'orchestrator' in group_names"
- name: Start daemon (machine 3)
shell: "cargo run --release -p provisioning-daemon"
when: "'daemon' in group_names"
- name: Verify cluster health
uri:
url: "https://{{ inventory_hostname }}:9090/health"
validate_certs: yes
```
### Step 4: Monitor Cluster Health
```bash
# Check cluster status
curl -s https://vault.internal:8200/health | jq .state
# Check replication
curl -s https://orchestrator.internal:9090/api/v1/cluster/status
# Monitor etcd
etcdctl --endpoints=https://node-1.internal:2379 endpoint health
# Check leader election
etcdctl --endpoints=https://node-1.internal:2379 election list
```
### Step 5: Enable Monitoring & Alerting
```yaml
# Prometheus configuration
global:
scrape_interval: 30s
evaluation_interval: 30s
scrape_configs:
- job_name: 'vault-service'
scheme: https
tls_config:
ca_file: /etc/provisioning/tls/ca.crt
static_configs:
- targets: ['vault-1.internal:8200', 'vault-2.internal:8200', 'vault-3.internal:8200']
- job_name: 'orchestrator'
scheme: https
static_configs:
- targets: ['orch-1.internal:9090', 'orch-2.internal:9090', 'orch-3.internal:9090']
```
### Step 6: Backup & Recovery
```bash
# Daily backup script
#!/bin/bash
BACKUP_DIR="/mnt/provisioning-backups"
DATE=$(date +%Y%m%d_%H%M%S)
# Backup etcd
etcdctl --endpoints=https://node-1.internal:2379 \
snapshot save "$BACKUP_DIR/etcd-$DATE.db"
# Backup SurrealDB
curl -X POST https://surrealdb.internal:8000/backup \
-H "Authorization: Bearer $SURREALDB_TOKEN" \
> "$BACKUP_DIR/surreal-$DATE.sql"
# Upload to S3
aws s3 cp "$BACKUP_DIR/etcd-$DATE.db" \
s3://provisioning-backups/etcd/
# Cleanup old backups (keep 30 days)
find "$BACKUP_DIR" -mtime +30 -delete
```
---
## Service Management
### Starting Services
#### Individual Service Startup
```bash
# Start one service
export VAULT_MODE=enterprise
cargo run --release -p vault-service
# In another terminal
export REGISTRY_MODE=enterprise
cargo run --release -p extension-registry
```
#### Batch Startup
```bash
# Start all services (dependency order)
#!/bin/bash
set -e
MODE=${1:-solo}
export VAULT_MODE=$MODE
export REGISTRY_MODE=$MODE
export RAG_MODE=$MODE
export AI_SERVICE_MODE=$MODE
export DAEMON_MODE=$MODE
echo "Starting provisioning platform in $MODE mode..."
# Core services first
echo "Starting infrastructure..."
cargo run --release -p vault-service &
VAULT_PID=$!
echo "Starting extension registry..."
cargo run --release -p extension-registry &
REGISTRY_PID=$!
# AI layer
echo "Starting AI services..."
cargo run --release -p provisioning-rag &
RAG_PID=$!
cargo run --release -p ai-service &
AI_PID=$!
# Orchestration
echo "Starting orchestration..."
cargo run --release -p orchestrator &
ORCH_PID=$!
echo "All services started. PIDs: $VAULT_PID $REGISTRY_PID $RAG_PID $AI_PID $ORCH_PID"
```
### Stopping Services
```bash
# Stop all services gracefully
pkill -SIGTERM -f "cargo run --release -p"
# Wait for graceful shutdown
sleep 5
# Force kill if needed
pkill -9 -f "cargo run --release -p"
# Verify all stopped
pgrep -f "cargo run --release -p" && echo "Services still running" || echo "All stopped"
```
### Restarting Services
```bash
# Restart single service
pkill -SIGTERM vault-service
sleep 2
cargo run --release -p vault-service &
# Restart all services
./scripts/restart-all.sh $MODE
# Restart with config reload
export VAULT_MODE=multiuser
pkill -SIGTERM vault-service
sleep 2
cargo run --release -p vault-service &
```
### Checking Service Status
```bash
# Check running processes
pgrep -a "cargo run --release"
# Check listening ports
netstat -tlnp | grep -E "8200|8081|8083|8082|9090|8080"
# Or using ss (modern alternative)
ss -tlnp | grep -E "8200|8081|8083|8082|9090|8080"
# Health endpoint checks
for service in vault registry rag ai orchestrator; do
echo "=== $service ==="
curl -s http://localhost:${port[$service]}/health | jq .
done
```
---
## Health Checks & Monitoring
### Manual Health Verification
```bash
# Vault Service
curl -s http://localhost:8200/health | jq .
# Expected: {"status":"ok","uptime":123.45}
# Extension Registry
curl -s http://localhost:8081/health | jq .
# RAG System
curl -s http://localhost:8083/health | jq .
# Expected: {"status":"ok","embeddings":"ready","vector_db":"connected"}
# AI Service
curl -s http://localhost:8082/health | jq .
# Orchestrator
curl -s http://localhost:9090/health | jq .
# Control Center
curl -s http://localhost:8080/health | jq .
```
### Service Integration Tests
```bash
# Test vault <-> registry integration
curl -X POST http://localhost:8200/api/encrypt \
-H "Content-Type: application/json" \
-d '{"plaintext":"secret"}' | jq .
# Test RAG system
curl -X POST http://localhost:8083/api/ingest \
-H "Content-Type: application/json" \
-d '{"document":"test.md","content":"# Test"}' | jq .
# Test orchestrator
curl -X GET http://localhost:9090/api/v1/status | jq .
# End-to-end workflow
curl -X POST http://localhost:9090/api/v1/provision \
-H "Content-Type: application/json" \
-d '{
"workspace": "test",
"services": ["vault", "registry"],
"mode": "solo"
}' | jq .
```
### Monitoring Dashboards
#### Prometheus Metrics
```bash
# Query service uptime
curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq .
# Query request rate
curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_requests_total[5m])' | jq .
# Query error rate
curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_errors_total[5m])' | jq .
```
#### Log Aggregation
```bash
# Follow vault logs
tail -f /var/log/provisioning/vault-service.log
# Follow all service logs
tail -f /var/log/provisioning/*.log
# Search for errors
grep -r "ERROR" /var/log/provisioning/
# Follow with filtering
tail -f /var/log/provisioning/orchestrator.log | grep -E "ERROR|WARN"
```
### Alerting
```yaml
# AlertManager configuration
groups:
- name: provisioning
rules:
- alert: ServiceDown
expr: up{job=~"vault|registry|rag|orchestrator"} == 0
for: 5m
annotations:
summary: "{{ $labels.job }} is down"
- alert: HighErrorRate
expr: rate(http_errors_total[5m]) > 0.05
annotations:
summary: "High error rate detected"
- alert: DiskSpaceWarning
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.2
annotations:
summary: "Disk space below 20%"
```
---
## Troubleshooting
### Service Won't Start
**Problem**: `error: failed to bind to port 8200`
**Solutions**:
```bash
# Check if port is in use
lsof -i :8200
ss -tlnp | grep 8200
# Kill existing process
pkill -9 -f vault-service
# Or use different port
export VAULT_SERVER_PORT=8201
cargo run --release -p vault-service
```
### Configuration Loading Fails
**Problem**: `error: failed to load config from mode file`
**Solutions**:
```bash
# Verify schemas exist
ls -la provisioning/schemas/platform/schemas/vault-service.ncl
# Validate schema syntax
nickel typecheck provisioning/schemas/platform/schemas/vault-service.ncl
# Check defaults are present
nickel typecheck provisioning/schemas/platform/defaults/vault-service-defaults.ncl
# Verify deployment mode overlay exists
ls -la provisioning/schemas/platform/defaults/deployment/$VAULT_MODE-defaults.ncl
# Run service with explicit mode
export VAULT_MODE=solo
cargo run --release -p vault-service
```
### Database Connection Issues
**Problem**: `error: failed to connect to database`
**Solutions**:
```bash
# Verify database is running
curl http://surrealdb:8000/health
etcdctl --endpoints=http://etcd:2379 endpoint health
# Check connectivity
nc -zv surrealdb 8000
nc -zv etcd 2379
# Update connection string
export SURREALDB_URL=ws://surrealdb:8000
export ETCD_ENDPOINTS=http://etcd:2379
# Restart service with new config
pkill -9 vault-service
cargo run --release -p vault-service
```
### Service Crashes on Startup
**Problem**: Service exits with code 1 or 139
**Solutions**:
```bash
# Run with verbose logging
RUST_LOG=debug cargo run -p vault-service 2>&1 | head -50
# Check system resources
free -h
df -h
# Check for core dumps
coredumpctl list
# Run under debugger (if crash suspected)
rust-gdb --args target/release/vault-service
```
### High Memory Usage
**Problem**: Service consuming > expected memory
**Solutions**:
```bash
# Check memory usage
ps aux | grep vault-service | grep -v grep
# Monitor over time
watch -n 1 'ps aux | grep vault-service | grep -v grep'
# Reduce worker count
export VAULT_SERVER_WORKERS=2
cargo run --release -p vault-service
# Check for memory leaks
valgrind --leak-check=full target/release/vault-service
```
### Network/DNS Issues
**Problem**: `error: failed to resolve hostname`
**Solutions**:
```bash
# Test DNS resolution
nslookup vault.internal
dig vault.internal
# Test connectivity to service
curl -v http://vault.internal:8200/health
# Add to /etc/hosts if needed
echo "10.0.1.10 vault.internal" >> /etc/hosts
# Check network interface
ip addr show
netstat -nr
```
### Data Persistence Issues
**Problem**: Data lost after restart
**Solutions**:
```bash
# Verify backup exists
ls -la /mnt/provisioning-backups/
ls -la /var/lib/provisioning/
# Check disk space
df -h /var/lib/provisioning
# Verify file permissions
ls -l /var/lib/provisioning/vault/
chmod 755 /var/lib/provisioning/vault/*
# Restore from backup
./scripts/restore-backup.sh /mnt/provisioning-backups/vault-20260105.sql
```
### Debugging Checklist
When troubleshooting, use this systematic approach:
```bash
# 1. Check service is running
pgrep -f vault-service || echo "Service not running"
# 2. Check port is listening
ss -tlnp | grep 8200 || echo "Port not listening"
# 3. Check logs for errors
tail -20 /var/log/provisioning/vault-service.log | grep -i error
# 4. Test HTTP endpoint
curl -i http://localhost:8200/health
# 5. Check dependencies
curl http://surrealdb:8000/health
etcdctl --endpoints=http://etcd:2379 endpoint health
# 6. Check schema definition
nickel typecheck provisioning/schemas/platform/schemas/vault-service.ncl
# 7. Verify environment variables
env | grep -E "VAULT_|SURREALDB_|ETCD_"
# 8. Check system resources
free -h && df -h && top -bn1 | head -10
```
---
## Configuration Updates
### Updating Service Configuration
```bash
# 1. Edit the schema definition
vim provisioning/schemas/platform/schemas/vault-service.ncl
# 2. Update defaults if needed
vim provisioning/schemas/platform/defaults/vault-service-defaults.ncl
# 3. Validate syntax
nickel typecheck provisioning/schemas/platform/schemas/vault-service.ncl
# 4. Re-export configuration from schemas
./provisioning/.typedialog/platform/scripts/generate-configs.nu vault-service multiuser
# 5. Restart affected service (no downtime for clients)
pkill -SIGTERM vault-service
sleep 2
cargo run --release -p vault-service &
# 4. Verify configuration loaded
curl http://localhost:8200/api/config | jq .
```
### Mode Migration
```bash
# Migrate from solo to multiuser:
# 1. Stop services
pkill -SIGTERM -f "cargo run"
sleep 5
# 2. Backup current data
tar -czf /backup/provisioning-solo-$(date +%s).tar.gz /var/lib/provisioning/
# 3. Set new mode
export VAULT_MODE=multiuser
export REGISTRY_MODE=multiuser
export RAG_MODE=multiuser
# 4. Start services with new config
cargo run --release -p vault-service &
cargo run --release -p extension-registry &
# 5. Verify new mode
curl http://localhost:8200/api/config | jq .deployment_mode
```
---
## Production Checklist
Before deploying to production:
- [ ] All services compiled in release mode (`--release`)
- [ ] TLS certificates installed and valid
- [ ] Database cluster deployed and healthy
- [ ] Load balancer configured and routing traffic
- [ ] Monitoring and alerting configured
- [ ] Backup system tested and working
- [ ] High availability verified (failover tested)
- [ ] Security hardening applied (firewall rules, etc.)
- [ ] Documentation updated for your environment
- [ ] Team trained on deployment procedures
- [ ] Runbooks created for common operations
- [ ] Disaster recovery plan tested
---
## Getting Help
### Community Resources
- **GitHub Issues**: Report bugs at `github.com/your-org/provisioning/issues`
- **Documentation**: Full docs at `provisioning/docs/`
- **Slack Channel**: `#provisioning-platform`
### Internal Support
- **Platform Team**: [platform@your-org.com](mailto:platform@your-org.com)
- **On-Call**: Check PagerDuty for active rotation
- **Escalation**: Contact infrastructure leadership
### Useful Commands Reference
```bash
# View all available commands
cargo run -- --help
# View service schemas
ls -la provisioning/schemas/platform/schemas/
ls -la provisioning/schemas/platform/defaults/
# List running services
ps aux | grep cargo
# Monitor service logs in real-time
journalctl -fu provisioning-vault
# Generate diagnostics bundle
./scripts/generate-diagnostics.sh > /tmp/diagnostics-$(date +%s).tar.gz
```