1362 lines
31 KiB
Markdown
1362 lines
31 KiB
Markdown
|
|
# Platform Deployment Guide
|
||
|
|
|
||
|
|
**Version**: 1.0.0
|
||
|
|
**Last Updated**: 2026-01-05
|
||
|
|
**Target Audience**: DevOps Engineers, Platform Operators
|
||
|
|
**Status**: Production Ready
|
||
|
|
|
||
|
|
Practical guide for deploying the 9-service provisioning platform in any environment using mode-based configuration.
|
||
|
|
|
||
|
|
## Table of Contents
|
||
|
|
|
||
|
|
1. [Prerequisites](#prerequisites)
|
||
|
|
2. [Deployment Modes](#deployment-modes)
|
||
|
|
3. [Quick Start](#quick-start)
|
||
|
|
4. [Solo Mode Deployment](#solo-mode-deployment)
|
||
|
|
5. [Multiuser Mode Deployment](#multiuser-mode-deployment)
|
||
|
|
6. [CICD Mode Deployment](#cicd-mode-deployment)
|
||
|
|
7. [Enterprise Mode Deployment](#enterprise-mode-deployment)
|
||
|
|
8. [Service Management](#service-management)
|
||
|
|
9. [Health Checks & Monitoring](#health-checks--monitoring)
|
||
|
|
10. [Troubleshooting](#troubleshooting)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Prerequisites
|
||
|
|
|
||
|
|
### Required Software
|
||
|
|
|
||
|
|
- **Rust**: 1.70+ (for building services)
|
||
|
|
- **Nickel**: Latest (for config validation)
|
||
|
|
- **Nushell**: 0.109.1+ (for scripts)
|
||
|
|
- **Cargo**: Included with Rust
|
||
|
|
- **Git**: For cloning and pulling updates
|
||
|
|
|
||
|
|
### Required Tools (Mode-Dependent)
|
||
|
|
|
||
|
|
| Tool | Solo | Multiuser | CICD | Enterprise |
|
||
|
|
|------|------|-----------|------|------------|
|
||
|
|
| Docker/Podman | No | Optional | Yes | Yes |
|
||
|
|
| SurrealDB | No | Yes | No | No |
|
||
|
|
| Etcd | No | No | No | Yes |
|
||
|
|
| PostgreSQL | No | Optional | No | Optional |
|
||
|
|
| OpenAI/Anthropic API | No | Optional | Yes | Yes |
|
||
|
|
|
||
|
|
### System Requirements
|
||
|
|
|
||
|
|
| Resource | Solo | Multiuser | CICD | Enterprise |
|
||
|
|
|----------|------|-----------|------|------------|
|
||
|
|
| CPU Cores | 2+ | 4+ | 8+ | 16+ |
|
||
|
|
| Memory | 2 GB | 4 GB | 8 GB | 16 GB |
|
||
|
|
| Disk | 10 GB | 50 GB | 100 GB | 500 GB |
|
||
|
|
| Network | Local | Local/Cloud | Cloud | HA Cloud |
|
||
|
|
|
||
|
|
### Directory Structure
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Ensure base directories exist
|
||
|
|
mkdir -p provisioning/schemas/platform
|
||
|
|
mkdir -p provisioning/platform/logs
|
||
|
|
mkdir -p provisioning/platform/data
|
||
|
|
mkdir -p provisioning/.typedialog/platform
|
||
|
|
mkdir -p provisioning/config/runtime
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Deployment Modes
|
||
|
|
|
||
|
|
### Mode Selection Matrix
|
||
|
|
|
||
|
|
| Requirement | Recommended Mode |
|
||
|
|
|-------------|------------------|
|
||
|
|
| Development & testing | **solo** |
|
||
|
|
| Team environment (2-10 people) | **multiuser** |
|
||
|
|
| CI/CD pipelines & automation | **cicd** |
|
||
|
|
| Production with HA | **enterprise** |
|
||
|
|
|
||
|
|
### Mode Characteristics
|
||
|
|
|
||
|
|
#### Solo Mode
|
||
|
|
|
||
|
|
**Use Case**: Development, testing, demonstration
|
||
|
|
|
||
|
|
**Characteristics**:
|
||
|
|
- All services run locally with minimal resources
|
||
|
|
- Filesystem-based storage (no external databases)
|
||
|
|
- No TLS/SSL required
|
||
|
|
- Embedded/in-memory backends
|
||
|
|
- Single machine only
|
||
|
|
|
||
|
|
**Services Configuration**:
|
||
|
|
- 2-4 workers per service
|
||
|
|
- 30-60 second timeouts
|
||
|
|
- No replication or clustering
|
||
|
|
- Debug-level logging enabled
|
||
|
|
|
||
|
|
**Startup Time**: ~2-5 minutes
|
||
|
|
**Data Persistence**: Local files only
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
#### Multiuser Mode
|
||
|
|
|
||
|
|
**Use Case**: Team environments, shared infrastructure
|
||
|
|
|
||
|
|
**Characteristics**:
|
||
|
|
- Shared database backends (SurrealDB)
|
||
|
|
- Multiple concurrent users
|
||
|
|
- CORS and multi-user features enabled
|
||
|
|
- Optional TLS support
|
||
|
|
- 2-4 machines (or containerized)
|
||
|
|
|
||
|
|
**Services Configuration**:
|
||
|
|
- 4-6 workers per service
|
||
|
|
- 60-120 second timeouts
|
||
|
|
- Basic replication available
|
||
|
|
- Info-level logging
|
||
|
|
|
||
|
|
**Startup Time**: ~3-8 minutes (database dependent)
|
||
|
|
**Data Persistence**: SurrealDB (shared)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
#### CICD Mode
|
||
|
|
|
||
|
|
**Use Case**: CI/CD pipelines, ephemeral environments
|
||
|
|
|
||
|
|
**Characteristics**:
|
||
|
|
- Ephemeral storage (memory, temporary)
|
||
|
|
- High throughput
|
||
|
|
- RAG system disabled
|
||
|
|
- Minimal logging
|
||
|
|
- Stateless services
|
||
|
|
|
||
|
|
**Services Configuration**:
|
||
|
|
- 8-12 workers per service
|
||
|
|
- 10-30 second timeouts
|
||
|
|
- No persistence
|
||
|
|
- Warn-level logging
|
||
|
|
|
||
|
|
**Startup Time**: ~1-2 minutes
|
||
|
|
**Data Persistence**: None (ephemeral)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
#### Enterprise Mode
|
||
|
|
|
||
|
|
**Use Case**: Production, high availability, compliance
|
||
|
|
|
||
|
|
**Characteristics**:
|
||
|
|
- Distributed, replicated backends
|
||
|
|
- High availability (HA) clustering
|
||
|
|
- TLS/SSL encryption
|
||
|
|
- Audit logging
|
||
|
|
- Full monitoring and observability
|
||
|
|
|
||
|
|
**Services Configuration**:
|
||
|
|
- 16-32 workers per service
|
||
|
|
- 120-300 second timeouts
|
||
|
|
- Active replication across 3+ nodes
|
||
|
|
- Info-level logging with audit trails
|
||
|
|
|
||
|
|
**Startup Time**: ~5-15 minutes (cluster initialization)
|
||
|
|
**Data Persistence**: Replicated across cluster
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Quick Start
|
||
|
|
|
||
|
|
### 1. Clone Repository
|
||
|
|
|
||
|
|
```bash
|
||
|
|
git clone https://github.com/your-org/project-provisioning.git
|
||
|
|
cd project-provisioning
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2. Select Deployment Mode
|
||
|
|
|
||
|
|
Choose your mode based on use case:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# For development
|
||
|
|
export DEPLOYMENT_MODE=solo
|
||
|
|
|
||
|
|
# For team environments
|
||
|
|
export DEPLOYMENT_MODE=multiuser
|
||
|
|
|
||
|
|
# For CI/CD
|
||
|
|
export DEPLOYMENT_MODE=cicd
|
||
|
|
|
||
|
|
# For production
|
||
|
|
export DEPLOYMENT_MODE=enterprise
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3. Set Environment Variables
|
||
|
|
|
||
|
|
All services use mode-specific TOML configs automatically loaded via environment variables:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Vault Service
|
||
|
|
export VAULT_MODE=$DEPLOYMENT_MODE
|
||
|
|
|
||
|
|
# Extension Registry
|
||
|
|
export REGISTRY_MODE=$DEPLOYMENT_MODE
|
||
|
|
|
||
|
|
# RAG System
|
||
|
|
export RAG_MODE=$DEPLOYMENT_MODE
|
||
|
|
|
||
|
|
# AI Service
|
||
|
|
export AI_SERVICE_MODE=$DEPLOYMENT_MODE
|
||
|
|
|
||
|
|
# Provisioning Daemon
|
||
|
|
export DAEMON_MODE=$DEPLOYMENT_MODE
|
||
|
|
```
|
||
|
|
|
||
|
|
### 4. Build All Services
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Build all platform crates
|
||
|
|
cargo build --release -p vault-service \
|
||
|
|
-p extension-registry \
|
||
|
|
-p provisioning-rag \
|
||
|
|
-p ai-service \
|
||
|
|
-p provisioning-daemon \
|
||
|
|
-p orchestrator \
|
||
|
|
-p control-center \
|
||
|
|
-p mcp-server \
|
||
|
|
-p installer
|
||
|
|
```
|
||
|
|
|
||
|
|
### 5. Start Services (Order Matters)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Start in dependency order:
|
||
|
|
|
||
|
|
# 1. Core infrastructure (KMS, storage)
|
||
|
|
cargo run --release -p vault-service &
|
||
|
|
|
||
|
|
# 2. Configuration and extensions
|
||
|
|
cargo run --release -p extension-registry &
|
||
|
|
|
||
|
|
# 3. AI/RAG layer
|
||
|
|
cargo run --release -p provisioning-rag &
|
||
|
|
cargo run --release -p ai-service &
|
||
|
|
|
||
|
|
# 4. Orchestration layer
|
||
|
|
cargo run --release -p orchestrator &
|
||
|
|
cargo run --release -p control-center &
|
||
|
|
cargo run --release -p mcp-server &
|
||
|
|
|
||
|
|
# 5. Background operations
|
||
|
|
cargo run --release -p provisioning-daemon &
|
||
|
|
|
||
|
|
# 6. Installer (optional, for new deployments)
|
||
|
|
cargo run --release -p installer &
|
||
|
|
```
|
||
|
|
|
||
|
|
### 6. Verify Services
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check all services are running
|
||
|
|
pgrep -l "vault-service|extension-registry|provisioning-rag|ai-service"
|
||
|
|
|
||
|
|
# Test endpoints
|
||
|
|
curl http://localhost:8200/health # Vault
|
||
|
|
curl http://localhost:8081/health # Registry
|
||
|
|
curl http://localhost:8083/health # RAG
|
||
|
|
curl http://localhost:8082/health # AI Service
|
||
|
|
curl http://localhost:9090/health # Orchestrator
|
||
|
|
curl http://localhost:8080/health # Control Center
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Solo Mode Deployment
|
||
|
|
|
||
|
|
**Perfect for**: Development, testing, learning
|
||
|
|
|
||
|
|
### Step 1: Verify Solo Configuration Files
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check that solo schemas are available
|
||
|
|
ls -la provisioning/schemas/platform/defaults/deployment/solo-defaults.ncl
|
||
|
|
|
||
|
|
# Available schemas for each service:
|
||
|
|
# - provisioning/schemas/platform/schemas/vault-service.ncl
|
||
|
|
# - provisioning/schemas/platform/schemas/extension-registry.ncl
|
||
|
|
# - provisioning/schemas/platform/schemas/rag.ncl
|
||
|
|
# - provisioning/schemas/platform/schemas/ai-service.ncl
|
||
|
|
# - provisioning/schemas/platform/schemas/provisioning-daemon.ncl
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 2: Set Solo Environment Variables
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Set all services to solo mode
|
||
|
|
export VAULT_MODE=solo
|
||
|
|
export REGISTRY_MODE=solo
|
||
|
|
export RAG_MODE=solo
|
||
|
|
export AI_SERVICE_MODE=solo
|
||
|
|
export DAEMON_MODE=solo
|
||
|
|
|
||
|
|
# Verify settings
|
||
|
|
echo $VAULT_MODE # Should output: solo
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 3: Build Services
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Build in release mode for better performance
|
||
|
|
cargo build --release
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 4: Create Local Data Directories
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Create storage directories for solo mode
|
||
|
|
mkdir -p /tmp/provisioning-solo/{vault,registry,rag,ai,daemon}
|
||
|
|
chmod 755 /tmp/provisioning-solo/{vault,registry,rag,ai,daemon}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 5: Start Services
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Start each service in a separate terminal or use tmux:
|
||
|
|
|
||
|
|
# Terminal 1: Vault
|
||
|
|
cargo run --release -p vault-service
|
||
|
|
|
||
|
|
# Terminal 2: Registry
|
||
|
|
cargo run --release -p extension-registry
|
||
|
|
|
||
|
|
# Terminal 3: RAG
|
||
|
|
cargo run --release -p provisioning-rag
|
||
|
|
|
||
|
|
# Terminal 4: AI Service
|
||
|
|
cargo run --release -p ai-service
|
||
|
|
|
||
|
|
# Terminal 5: Orchestrator
|
||
|
|
cargo run --release -p orchestrator
|
||
|
|
|
||
|
|
# Terminal 6: Control Center
|
||
|
|
cargo run --release -p control-center
|
||
|
|
|
||
|
|
# Terminal 7: Daemon
|
||
|
|
cargo run --release -p provisioning-daemon
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 6: Test Services
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Wait 10-15 seconds for services to start, then test
|
||
|
|
|
||
|
|
# Check service health
|
||
|
|
curl -s http://localhost:8200/health | jq .
|
||
|
|
curl -s http://localhost:8081/health | jq .
|
||
|
|
curl -s http://localhost:8083/health | jq .
|
||
|
|
|
||
|
|
# Try a simple operation
|
||
|
|
curl -X GET http://localhost:9090/api/v1/health
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 7: Verify Persistence (Optional)
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check that data is stored locally
|
||
|
|
ls -la /tmp/provisioning-solo/vault/
|
||
|
|
ls -la /tmp/provisioning-solo/registry/
|
||
|
|
|
||
|
|
# Data should accumulate as you use the services
|
||
|
|
```
|
||
|
|
|
||
|
|
### Cleanup
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Stop all services
|
||
|
|
pkill -f "cargo run --release"
|
||
|
|
|
||
|
|
# Remove temporary data (optional)
|
||
|
|
rm -rf /tmp/provisioning-solo
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Multiuser Mode Deployment
|
||
|
|
|
||
|
|
**Perfect for**: Team environments, shared infrastructure
|
||
|
|
|
||
|
|
### Prerequisites
|
||
|
|
|
||
|
|
- **SurrealDB**: Running and accessible at `http://surrealdb:8000`
|
||
|
|
- **Network Access**: All machines can reach SurrealDB
|
||
|
|
- **DNS/Hostnames**: Services accessible via hostnames (not just localhost)
|
||
|
|
|
||
|
|
### Step 1: Deploy SurrealDB
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Using Docker (recommended)
|
||
|
|
docker run -d \
|
||
|
|
--name surrealdb \
|
||
|
|
-p 8000:8000 \
|
||
|
|
surrealdb/surrealdb:latest \
|
||
|
|
start --user root --pass root
|
||
|
|
|
||
|
|
# Or using native installation:
|
||
|
|
surreal start --user root --pass root
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 2: Verify SurrealDB Connectivity
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Test SurrealDB connection
|
||
|
|
curl -s http://localhost:8000/health
|
||
|
|
|
||
|
|
# Should return: {"version":"v1.x.x"}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 3: Set Multiuser Environment Variables
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Configure all services for multiuser mode
|
||
|
|
export VAULT_MODE=multiuser
|
||
|
|
export REGISTRY_MODE=multiuser
|
||
|
|
export RAG_MODE=multiuser
|
||
|
|
export AI_SERVICE_MODE=multiuser
|
||
|
|
export DAEMON_MODE=multiuser
|
||
|
|
|
||
|
|
# Set database connection
|
||
|
|
export SURREALDB_URL=http://surrealdb:8000
|
||
|
|
export SURREALDB_USER=root
|
||
|
|
export SURREALDB_PASS=root
|
||
|
|
|
||
|
|
# Set service hostnames (if not localhost)
|
||
|
|
export VAULT_SERVICE_HOST=vault.internal
|
||
|
|
export REGISTRY_HOST=registry.internal
|
||
|
|
export RAG_HOST=rag.internal
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 4: Build Services
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cargo build --release
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 5: Create Shared Data Directories
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Create directories on shared storage (NFS, etc.)
|
||
|
|
mkdir -p /mnt/provisioning-data/{vault,registry,rag,ai}
|
||
|
|
chmod 755 /mnt/provisioning-data/{vault,registry,rag,ai}
|
||
|
|
|
||
|
|
# Or use local directories if on separate machines
|
||
|
|
mkdir -p /var/lib/provisioning/{vault,registry,rag,ai}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 6: Start Services on Multiple Machines
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Machine 1: Infrastructure services
|
||
|
|
ssh ops@machine1
|
||
|
|
export VAULT_MODE=multiuser
|
||
|
|
cargo run --release -p vault-service &
|
||
|
|
cargo run --release -p extension-registry &
|
||
|
|
|
||
|
|
# Machine 2: AI services
|
||
|
|
ssh ops@machine2
|
||
|
|
export RAG_MODE=multiuser
|
||
|
|
export AI_SERVICE_MODE=multiuser
|
||
|
|
cargo run --release -p provisioning-rag &
|
||
|
|
cargo run --release -p ai-service &
|
||
|
|
|
||
|
|
# Machine 3: Orchestration
|
||
|
|
ssh ops@machine3
|
||
|
|
cargo run --release -p orchestrator &
|
||
|
|
cargo run --release -p control-center &
|
||
|
|
|
||
|
|
# Machine 4: Background tasks
|
||
|
|
ssh ops@machine4
|
||
|
|
export DAEMON_MODE=multiuser
|
||
|
|
cargo run --release -p provisioning-daemon &
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 7: Test Multi-Machine Setup
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# From any machine, test cross-machine connectivity
|
||
|
|
curl -s http://machine1:8200/health
|
||
|
|
curl -s http://machine2:8083/health
|
||
|
|
curl -s http://machine3:9090/health
|
||
|
|
|
||
|
|
# Test integration
|
||
|
|
curl -X POST http://machine3:9090/api/v1/provision \
|
||
|
|
-H "Content-Type: application/json" \
|
||
|
|
-d '{"workspace": "test"}'
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 8: Enable User Access
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Create shared credentials
|
||
|
|
export VAULT_TOKEN=s.xxxxxxxxxxx
|
||
|
|
|
||
|
|
# Configure TLS (optional but recommended)
|
||
|
|
# Update configs to use https:// URLs
|
||
|
|
export VAULT_MODE=multiuser
|
||
|
|
# Edit provisioning/schemas/platform/schemas/vault-service.ncl
|
||
|
|
# Add TLS configuration in the schema definition
|
||
|
|
# See: provisioning/schemas/platform/validators/ for constraints
|
||
|
|
```
|
||
|
|
|
||
|
|
### Monitoring Multiuser Deployment
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check all services are connected to SurrealDB
|
||
|
|
for host in machine1 machine2 machine3 machine4; do
|
||
|
|
ssh ops@$host "curl -s http://localhost/api/v1/health | jq .database_connected"
|
||
|
|
done
|
||
|
|
|
||
|
|
# Monitor SurrealDB
|
||
|
|
curl -s http://surrealdb:8000/version
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## CICD Mode Deployment
|
||
|
|
|
||
|
|
**Perfect for**: GitHub Actions, GitLab CI, Jenkins, cloud automation
|
||
|
|
|
||
|
|
### Step 1: Understand Ephemeral Nature
|
||
|
|
|
||
|
|
CICD mode services:
|
||
|
|
- Don't persist data between runs
|
||
|
|
- Use in-memory storage
|
||
|
|
- Have RAG disabled
|
||
|
|
- Optimize for startup speed
|
||
|
|
- Suitable for containerized deployments
|
||
|
|
|
||
|
|
### Step 2: Set CICD Environment Variables
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Use cicd mode for all services
|
||
|
|
export VAULT_MODE=cicd
|
||
|
|
export REGISTRY_MODE=cicd
|
||
|
|
export RAG_MODE=cicd
|
||
|
|
export AI_SERVICE_MODE=cicd
|
||
|
|
export DAEMON_MODE=cicd
|
||
|
|
|
||
|
|
# Disable TLS (not needed in CI)
|
||
|
|
export CI_ENVIRONMENT=true
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 3: Containerize Services (Optional)
|
||
|
|
|
||
|
|
```dockerfile
|
||
|
|
# Dockerfile for CICD deployments
|
||
|
|
FROM rust:1.75-slim
|
||
|
|
|
||
|
|
WORKDIR /app
|
||
|
|
COPY . .
|
||
|
|
|
||
|
|
# Build all services
|
||
|
|
RUN cargo build --release
|
||
|
|
|
||
|
|
# Set CICD mode
|
||
|
|
ENV VAULT_MODE=cicd
|
||
|
|
ENV REGISTRY_MODE=cicd
|
||
|
|
ENV RAG_MODE=cicd
|
||
|
|
ENV AI_SERVICE_MODE=cicd
|
||
|
|
|
||
|
|
# Expose ports
|
||
|
|
EXPOSE 8200 8081 8083 8082 9090 8080
|
||
|
|
|
||
|
|
# Run services
|
||
|
|
CMD ["sh", "-c", "\
|
||
|
|
cargo run --release -p vault-service & \
|
||
|
|
cargo run --release -p extension-registry & \
|
||
|
|
cargo run --release -p provisioning-rag & \
|
||
|
|
cargo run --release -p ai-service & \
|
||
|
|
cargo run --release -p orchestrator & \
|
||
|
|
wait"]
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 4: GitHub Actions Example
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
name: CICD Platform Deployment
|
||
|
|
|
||
|
|
on:
|
||
|
|
push:
|
||
|
|
branches: [main, develop]
|
||
|
|
|
||
|
|
jobs:
|
||
|
|
test-deployment:
|
||
|
|
runs-on: ubuntu-latest
|
||
|
|
steps:
|
||
|
|
- uses: actions/checkout@v3
|
||
|
|
|
||
|
|
- name: Install Rust
|
||
|
|
uses: actions-rs/toolchain@v1
|
||
|
|
with:
|
||
|
|
toolchain: 1.75
|
||
|
|
profile: minimal
|
||
|
|
|
||
|
|
- name: Set CICD Mode
|
||
|
|
run: |
|
||
|
|
echo "VAULT_MODE=cicd" >> $GITHUB_ENV
|
||
|
|
echo "REGISTRY_MODE=cicd" >> $GITHUB_ENV
|
||
|
|
echo "RAG_MODE=cicd" >> $GITHUB_ENV
|
||
|
|
echo "AI_SERVICE_MODE=cicd" >> $GITHUB_ENV
|
||
|
|
echo "DAEMON_MODE=cicd" >> $GITHUB_ENV
|
||
|
|
|
||
|
|
- name: Build Services
|
||
|
|
run: cargo build --release
|
||
|
|
|
||
|
|
- name: Run Integration Tests
|
||
|
|
run: |
|
||
|
|
# Start services in background
|
||
|
|
cargo run --release -p vault-service &
|
||
|
|
cargo run --release -p extension-registry &
|
||
|
|
cargo run --release -p orchestrator &
|
||
|
|
|
||
|
|
# Wait for startup
|
||
|
|
sleep 10
|
||
|
|
|
||
|
|
# Run tests
|
||
|
|
cargo test --release
|
||
|
|
|
||
|
|
- name: Health Checks
|
||
|
|
run: |
|
||
|
|
curl -f http://localhost:8200/health
|
||
|
|
curl -f http://localhost:8081/health
|
||
|
|
curl -f http://localhost:9090/health
|
||
|
|
|
||
|
|
deploy:
|
||
|
|
needs: test-deployment
|
||
|
|
runs-on: ubuntu-latest
|
||
|
|
if: github.ref == 'refs/heads/main'
|
||
|
|
steps:
|
||
|
|
- uses: actions/checkout@v3
|
||
|
|
- name: Deploy to Production
|
||
|
|
run: |
|
||
|
|
# Deploy production enterprise cluster
|
||
|
|
./scripts/deploy-enterprise.sh
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 5: Run CICD Tests
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Simulate CI environment locally
|
||
|
|
export VAULT_MODE=cicd
|
||
|
|
export CI_ENVIRONMENT=true
|
||
|
|
|
||
|
|
# Build
|
||
|
|
cargo build --release
|
||
|
|
|
||
|
|
# Run short-lived services for testing
|
||
|
|
timeout 30 cargo run --release -p vault-service &
|
||
|
|
timeout 30 cargo run --release -p extension-registry &
|
||
|
|
timeout 30 cargo run --release -p orchestrator &
|
||
|
|
|
||
|
|
# Run tests while services are running
|
||
|
|
sleep 5
|
||
|
|
cargo test --release
|
||
|
|
|
||
|
|
# Services auto-cleanup after timeout
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Enterprise Mode Deployment
|
||
|
|
|
||
|
|
**Perfect for**: Production, high availability, compliance
|
||
|
|
|
||
|
|
### Prerequisites
|
||
|
|
|
||
|
|
- **3+ Machines**: Minimum 3 for HA
|
||
|
|
- **Etcd Cluster**: For distributed consensus
|
||
|
|
- **Load Balancer**: HAProxy, nginx, or cloud LB
|
||
|
|
- **TLS Certificates**: Valid certificates for all services
|
||
|
|
- **Monitoring**: Prometheus, ELK, or cloud monitoring
|
||
|
|
- **Backup System**: Daily snapshots to S3 or similar
|
||
|
|
|
||
|
|
### Step 1: Deploy Infrastructure
|
||
|
|
|
||
|
|
#### 1.1 Deploy Etcd Cluster
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Node 1, 2, 3
|
||
|
|
etcd --name=node-1 \
|
||
|
|
--listen-client-urls=http://0.0.0.0:2379 \
|
||
|
|
--advertise-client-urls=http://node-1.internal:2379 \
|
||
|
|
--initial-cluster="node-1=http://node-1.internal:2380,node-2=http://node-2.internal:2380,node-3=http://node-3.internal:2380" \
|
||
|
|
--initial-cluster-state=new
|
||
|
|
|
||
|
|
# Verify cluster
|
||
|
|
etcdctl --endpoints=http://localhost:2379 member list
|
||
|
|
```
|
||
|
|
|
||
|
|
#### 1.2 Deploy Load Balancer
|
||
|
|
|
||
|
|
```nginx
|
||
|
|
# HAProxy configuration for vault-service (example)
|
||
|
|
frontend vault_frontend
|
||
|
|
bind *:8200
|
||
|
|
mode tcp
|
||
|
|
default_backend vault_backend
|
||
|
|
|
||
|
|
backend vault_backend
|
||
|
|
mode tcp
|
||
|
|
balance roundrobin
|
||
|
|
server vault-1 10.0.1.10:8200 check
|
||
|
|
server vault-2 10.0.1.11:8200 check
|
||
|
|
server vault-3 10.0.1.12:8200 check
|
||
|
|
```
|
||
|
|
|
||
|
|
#### 1.3 Configure TLS
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Generate certificates (or use existing)
|
||
|
|
mkdir -p /etc/provisioning/tls
|
||
|
|
|
||
|
|
# For each service:
|
||
|
|
openssl req -x509 -newkey rsa:4096 \
|
||
|
|
-keyout /etc/provisioning/tls/vault-key.pem \
|
||
|
|
-out /etc/provisioning/tls/vault-cert.pem \
|
||
|
|
-days 365 -nodes \
|
||
|
|
-subj "/CN=vault.provisioning.prod"
|
||
|
|
|
||
|
|
# Set permissions
|
||
|
|
chmod 600 /etc/provisioning/tls/*-key.pem
|
||
|
|
chmod 644 /etc/provisioning/tls/*-cert.pem
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 2: Set Enterprise Environment Variables
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# All machines: Set enterprise mode
|
||
|
|
export VAULT_MODE=enterprise
|
||
|
|
export REGISTRY_MODE=enterprise
|
||
|
|
export RAG_MODE=enterprise
|
||
|
|
export AI_SERVICE_MODE=enterprise
|
||
|
|
export DAEMON_MODE=enterprise
|
||
|
|
|
||
|
|
# Database cluster
|
||
|
|
export SURREALDB_URL="ws://surrealdb-cluster.internal:8000"
|
||
|
|
export SURREALDB_REPLICAS=3
|
||
|
|
|
||
|
|
# Etcd cluster
|
||
|
|
export ETCD_ENDPOINTS="http://node-1.internal:2379,http://node-2.internal:2379,http://node-3.internal:2379"
|
||
|
|
|
||
|
|
# TLS configuration
|
||
|
|
export TLS_CERT_PATH=/etc/provisioning/tls
|
||
|
|
export TLS_VERIFY=true
|
||
|
|
export TLS_CA_CERT=/etc/provisioning/tls/ca.crt
|
||
|
|
|
||
|
|
# Monitoring
|
||
|
|
export PROMETHEUS_URL=http://prometheus.internal:9090
|
||
|
|
export METRICS_ENABLED=true
|
||
|
|
export AUDIT_LOG_ENABLED=true
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 3: Deploy Services Across Cluster
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Ansible playbook (simplified)
|
||
|
|
---
|
||
|
|
- hosts: provisioning_cluster
|
||
|
|
tasks:
|
||
|
|
- name: Build services
|
||
|
|
shell: cargo build --release
|
||
|
|
|
||
|
|
- name: Start vault-service (machine 1-3)
|
||
|
|
shell: "cargo run --release -p vault-service"
|
||
|
|
when: "'vault' in group_names"
|
||
|
|
|
||
|
|
- name: Start orchestrator (machine 2-3)
|
||
|
|
shell: "cargo run --release -p orchestrator"
|
||
|
|
when: "'orchestrator' in group_names"
|
||
|
|
|
||
|
|
- name: Start daemon (machine 3)
|
||
|
|
shell: "cargo run --release -p provisioning-daemon"
|
||
|
|
when: "'daemon' in group_names"
|
||
|
|
|
||
|
|
- name: Verify cluster health
|
||
|
|
uri:
|
||
|
|
url: "https://{{ inventory_hostname }}:9090/health"
|
||
|
|
validate_certs: yes
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 4: Monitor Cluster Health
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check cluster status
|
||
|
|
curl -s https://vault.internal:8200/health | jq .state
|
||
|
|
|
||
|
|
# Check replication
|
||
|
|
curl -s https://orchestrator.internal:9090/api/v1/cluster/status
|
||
|
|
|
||
|
|
# Monitor etcd
|
||
|
|
etcdctl --endpoints=https://node-1.internal:2379 endpoint health
|
||
|
|
|
||
|
|
# Check leader election
|
||
|
|
etcdctl --endpoints=https://node-1.internal:2379 election list
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 5: Enable Monitoring & Alerting
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# Prometheus configuration
|
||
|
|
global:
|
||
|
|
scrape_interval: 30s
|
||
|
|
evaluation_interval: 30s
|
||
|
|
|
||
|
|
scrape_configs:
|
||
|
|
- job_name: 'vault-service'
|
||
|
|
scheme: https
|
||
|
|
tls_config:
|
||
|
|
ca_file: /etc/provisioning/tls/ca.crt
|
||
|
|
static_configs:
|
||
|
|
- targets: ['vault-1.internal:8200', 'vault-2.internal:8200', 'vault-3.internal:8200']
|
||
|
|
|
||
|
|
- job_name: 'orchestrator'
|
||
|
|
scheme: https
|
||
|
|
static_configs:
|
||
|
|
- targets: ['orch-1.internal:9090', 'orch-2.internal:9090', 'orch-3.internal:9090']
|
||
|
|
```
|
||
|
|
|
||
|
|
### Step 6: Backup & Recovery
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Daily backup script
|
||
|
|
#!/bin/bash
|
||
|
|
BACKUP_DIR="/mnt/provisioning-backups"
|
||
|
|
DATE=$(date +%Y%m%d_%H%M%S)
|
||
|
|
|
||
|
|
# Backup etcd
|
||
|
|
etcdctl --endpoints=https://node-1.internal:2379 \
|
||
|
|
snapshot save "$BACKUP_DIR/etcd-$DATE.db"
|
||
|
|
|
||
|
|
# Backup SurrealDB
|
||
|
|
curl -X POST https://surrealdb.internal:8000/backup \
|
||
|
|
-H "Authorization: Bearer $SURREALDB_TOKEN" \
|
||
|
|
> "$BACKUP_DIR/surreal-$DATE.sql"
|
||
|
|
|
||
|
|
# Upload to S3
|
||
|
|
aws s3 cp "$BACKUP_DIR/etcd-$DATE.db" \
|
||
|
|
s3://provisioning-backups/etcd/
|
||
|
|
|
||
|
|
# Cleanup old backups (keep 30 days)
|
||
|
|
find "$BACKUP_DIR" -mtime +30 -delete
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Service Management
|
||
|
|
|
||
|
|
### Starting Services
|
||
|
|
|
||
|
|
#### Individual Service Startup
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Start one service
|
||
|
|
export VAULT_MODE=enterprise
|
||
|
|
cargo run --release -p vault-service
|
||
|
|
|
||
|
|
# In another terminal
|
||
|
|
export REGISTRY_MODE=enterprise
|
||
|
|
cargo run --release -p extension-registry
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Batch Startup
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Start all services (dependency order)
|
||
|
|
#!/bin/bash
|
||
|
|
set -e
|
||
|
|
|
||
|
|
MODE=${1:-solo}
|
||
|
|
export VAULT_MODE=$MODE
|
||
|
|
export REGISTRY_MODE=$MODE
|
||
|
|
export RAG_MODE=$MODE
|
||
|
|
export AI_SERVICE_MODE=$MODE
|
||
|
|
export DAEMON_MODE=$MODE
|
||
|
|
|
||
|
|
echo "Starting provisioning platform in $MODE mode..."
|
||
|
|
|
||
|
|
# Core services first
|
||
|
|
echo "Starting infrastructure..."
|
||
|
|
cargo run --release -p vault-service &
|
||
|
|
VAULT_PID=$!
|
||
|
|
|
||
|
|
echo "Starting extension registry..."
|
||
|
|
cargo run --release -p extension-registry &
|
||
|
|
REGISTRY_PID=$!
|
||
|
|
|
||
|
|
# AI layer
|
||
|
|
echo "Starting AI services..."
|
||
|
|
cargo run --release -p provisioning-rag &
|
||
|
|
RAG_PID=$!
|
||
|
|
|
||
|
|
cargo run --release -p ai-service &
|
||
|
|
AI_PID=$!
|
||
|
|
|
||
|
|
# Orchestration
|
||
|
|
echo "Starting orchestration..."
|
||
|
|
cargo run --release -p orchestrator &
|
||
|
|
ORCH_PID=$!
|
||
|
|
|
||
|
|
echo "All services started. PIDs: $VAULT_PID $REGISTRY_PID $RAG_PID $AI_PID $ORCH_PID"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Stopping Services
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Stop all services gracefully
|
||
|
|
pkill -SIGTERM -f "cargo run --release -p"
|
||
|
|
|
||
|
|
# Wait for graceful shutdown
|
||
|
|
sleep 5
|
||
|
|
|
||
|
|
# Force kill if needed
|
||
|
|
pkill -9 -f "cargo run --release -p"
|
||
|
|
|
||
|
|
# Verify all stopped
|
||
|
|
pgrep -f "cargo run --release -p" && echo "Services still running" || echo "All stopped"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Restarting Services
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Restart single service
|
||
|
|
pkill -SIGTERM vault-service
|
||
|
|
sleep 2
|
||
|
|
cargo run --release -p vault-service &
|
||
|
|
|
||
|
|
# Restart all services
|
||
|
|
./scripts/restart-all.sh $MODE
|
||
|
|
|
||
|
|
# Restart with config reload
|
||
|
|
export VAULT_MODE=multiuser
|
||
|
|
pkill -SIGTERM vault-service
|
||
|
|
sleep 2
|
||
|
|
cargo run --release -p vault-service &
|
||
|
|
```
|
||
|
|
|
||
|
|
### Checking Service Status
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Check running processes
|
||
|
|
pgrep -a "cargo run --release"
|
||
|
|
|
||
|
|
# Check listening ports
|
||
|
|
netstat -tlnp | grep -E "8200|8081|8083|8082|9090|8080"
|
||
|
|
|
||
|
|
# Or using ss (modern alternative)
|
||
|
|
ss -tlnp | grep -E "8200|8081|8083|8082|9090|8080"
|
||
|
|
|
||
|
|
# Health endpoint checks
|
||
|
|
for service in vault registry rag ai orchestrator; do
|
||
|
|
echo "=== $service ==="
|
||
|
|
curl -s http://localhost:${port[$service]}/health | jq .
|
||
|
|
done
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Health Checks & Monitoring
|
||
|
|
|
||
|
|
### Manual Health Verification
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Vault Service
|
||
|
|
curl -s http://localhost:8200/health | jq .
|
||
|
|
# Expected: {"status":"ok","uptime":123.45}
|
||
|
|
|
||
|
|
# Extension Registry
|
||
|
|
curl -s http://localhost:8081/health | jq .
|
||
|
|
|
||
|
|
# RAG System
|
||
|
|
curl -s http://localhost:8083/health | jq .
|
||
|
|
# Expected: {"status":"ok","embeddings":"ready","vector_db":"connected"}
|
||
|
|
|
||
|
|
# AI Service
|
||
|
|
curl -s http://localhost:8082/health | jq .
|
||
|
|
|
||
|
|
# Orchestrator
|
||
|
|
curl -s http://localhost:9090/health | jq .
|
||
|
|
|
||
|
|
# Control Center
|
||
|
|
curl -s http://localhost:8080/health | jq .
|
||
|
|
```
|
||
|
|
|
||
|
|
### Service Integration Tests
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Test vault <-> registry integration
|
||
|
|
curl -X POST http://localhost:8200/api/encrypt \
|
||
|
|
-H "Content-Type: application/json" \
|
||
|
|
-d '{"plaintext":"secret"}' | jq .
|
||
|
|
|
||
|
|
# Test RAG system
|
||
|
|
curl -X POST http://localhost:8083/api/ingest \
|
||
|
|
-H "Content-Type: application/json" \
|
||
|
|
-d '{"document":"test.md","content":"# Test"}' | jq .
|
||
|
|
|
||
|
|
# Test orchestrator
|
||
|
|
curl -X GET http://localhost:9090/api/v1/status | jq .
|
||
|
|
|
||
|
|
# End-to-end workflow
|
||
|
|
curl -X POST http://localhost:9090/api/v1/provision \
|
||
|
|
-H "Content-Type: application/json" \
|
||
|
|
-d '{
|
||
|
|
"workspace": "test",
|
||
|
|
"services": ["vault", "registry"],
|
||
|
|
"mode": "solo"
|
||
|
|
}' | jq .
|
||
|
|
```
|
||
|
|
|
||
|
|
### Monitoring Dashboards
|
||
|
|
|
||
|
|
#### Prometheus Metrics
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Query service uptime
|
||
|
|
curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq .
|
||
|
|
|
||
|
|
# Query request rate
|
||
|
|
curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_requests_total[5m])' | jq .
|
||
|
|
|
||
|
|
# Query error rate
|
||
|
|
curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_errors_total[5m])' | jq .
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Log Aggregation
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Follow vault logs
|
||
|
|
tail -f /var/log/provisioning/vault-service.log
|
||
|
|
|
||
|
|
# Follow all service logs
|
||
|
|
tail -f /var/log/provisioning/*.log
|
||
|
|
|
||
|
|
# Search for errors
|
||
|
|
grep -r "ERROR" /var/log/provisioning/
|
||
|
|
|
||
|
|
# Follow with filtering
|
||
|
|
tail -f /var/log/provisioning/orchestrator.log | grep -E "ERROR|WARN"
|
||
|
|
```
|
||
|
|
|
||
|
|
### Alerting
|
||
|
|
|
||
|
|
```yaml
|
||
|
|
# AlertManager configuration
|
||
|
|
groups:
|
||
|
|
- name: provisioning
|
||
|
|
rules:
|
||
|
|
- alert: ServiceDown
|
||
|
|
expr: up{job=~"vault|registry|rag|orchestrator"} == 0
|
||
|
|
for: 5m
|
||
|
|
annotations:
|
||
|
|
summary: "{{ $labels.job }} is down"
|
||
|
|
|
||
|
|
- alert: HighErrorRate
|
||
|
|
expr: rate(http_errors_total[5m]) > 0.05
|
||
|
|
annotations:
|
||
|
|
summary: "High error rate detected"
|
||
|
|
|
||
|
|
- alert: DiskSpaceWarning
|
||
|
|
expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.2
|
||
|
|
annotations:
|
||
|
|
summary: "Disk space below 20%"
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Troubleshooting
|
||
|
|
|
||
|
|
### Service Won't Start
|
||
|
|
|
||
|
|
**Problem**: `error: failed to bind to port 8200`
|
||
|
|
|
||
|
|
**Solutions**:
|
||
|
|
```bash
|
||
|
|
# Check if port is in use
|
||
|
|
lsof -i :8200
|
||
|
|
ss -tlnp | grep 8200
|
||
|
|
|
||
|
|
# Kill existing process
|
||
|
|
pkill -9 -f vault-service
|
||
|
|
|
||
|
|
# Or use different port
|
||
|
|
export VAULT_SERVER_PORT=8201
|
||
|
|
cargo run --release -p vault-service
|
||
|
|
```
|
||
|
|
|
||
|
|
### Configuration Loading Fails
|
||
|
|
|
||
|
|
**Problem**: `error: failed to load config from mode file`
|
||
|
|
|
||
|
|
**Solutions**:
|
||
|
|
```bash
|
||
|
|
# Verify schemas exist
|
||
|
|
ls -la provisioning/schemas/platform/schemas/vault-service.ncl
|
||
|
|
|
||
|
|
# Validate schema syntax
|
||
|
|
nickel typecheck provisioning/schemas/platform/schemas/vault-service.ncl
|
||
|
|
|
||
|
|
# Check defaults are present
|
||
|
|
nickel typecheck provisioning/schemas/platform/defaults/vault-service-defaults.ncl
|
||
|
|
|
||
|
|
# Verify deployment mode overlay exists
|
||
|
|
ls -la provisioning/schemas/platform/defaults/deployment/$VAULT_MODE-defaults.ncl
|
||
|
|
|
||
|
|
# Run service with explicit mode
|
||
|
|
export VAULT_MODE=solo
|
||
|
|
cargo run --release -p vault-service
|
||
|
|
```
|
||
|
|
|
||
|
|
### Database Connection Issues
|
||
|
|
|
||
|
|
**Problem**: `error: failed to connect to database`
|
||
|
|
|
||
|
|
**Solutions**:
|
||
|
|
```bash
|
||
|
|
# Verify database is running
|
||
|
|
curl http://surrealdb:8000/health
|
||
|
|
etcdctl --endpoints=http://etcd:2379 endpoint health
|
||
|
|
|
||
|
|
# Check connectivity
|
||
|
|
nc -zv surrealdb 8000
|
||
|
|
nc -zv etcd 2379
|
||
|
|
|
||
|
|
# Update connection string
|
||
|
|
export SURREALDB_URL=ws://surrealdb:8000
|
||
|
|
export ETCD_ENDPOINTS=http://etcd:2379
|
||
|
|
|
||
|
|
# Restart service with new config
|
||
|
|
pkill -9 vault-service
|
||
|
|
cargo run --release -p vault-service
|
||
|
|
```
|
||
|
|
|
||
|
|
### Service Crashes on Startup
|
||
|
|
|
||
|
|
**Problem**: Service exits with code 1 or 139
|
||
|
|
|
||
|
|
**Solutions**:
|
||
|
|
```bash
|
||
|
|
# Run with verbose logging
|
||
|
|
RUST_LOG=debug cargo run -p vault-service 2>&1 | head -50
|
||
|
|
|
||
|
|
# Check system resources
|
||
|
|
free -h
|
||
|
|
df -h
|
||
|
|
|
||
|
|
# Check for core dumps
|
||
|
|
coredumpctl list
|
||
|
|
|
||
|
|
# Run under debugger (if crash suspected)
|
||
|
|
rust-gdb --args target/release/vault-service
|
||
|
|
```
|
||
|
|
|
||
|
|
### High Memory Usage
|
||
|
|
|
||
|
|
**Problem**: Service consuming > expected memory
|
||
|
|
|
||
|
|
**Solutions**:
|
||
|
|
```bash
|
||
|
|
# Check memory usage
|
||
|
|
ps aux | grep vault-service | grep -v grep
|
||
|
|
|
||
|
|
# Monitor over time
|
||
|
|
watch -n 1 'ps aux | grep vault-service | grep -v grep'
|
||
|
|
|
||
|
|
# Reduce worker count
|
||
|
|
export VAULT_SERVER_WORKERS=2
|
||
|
|
cargo run --release -p vault-service
|
||
|
|
|
||
|
|
# Check for memory leaks
|
||
|
|
valgrind --leak-check=full target/release/vault-service
|
||
|
|
```
|
||
|
|
|
||
|
|
### Network/DNS Issues
|
||
|
|
|
||
|
|
**Problem**: `error: failed to resolve hostname`
|
||
|
|
|
||
|
|
**Solutions**:
|
||
|
|
```bash
|
||
|
|
# Test DNS resolution
|
||
|
|
nslookup vault.internal
|
||
|
|
dig vault.internal
|
||
|
|
|
||
|
|
# Test connectivity to service
|
||
|
|
curl -v http://vault.internal:8200/health
|
||
|
|
|
||
|
|
# Add to /etc/hosts if needed
|
||
|
|
echo "10.0.1.10 vault.internal" >> /etc/hosts
|
||
|
|
|
||
|
|
# Check network interface
|
||
|
|
ip addr show
|
||
|
|
netstat -nr
|
||
|
|
```
|
||
|
|
|
||
|
|
### Data Persistence Issues
|
||
|
|
|
||
|
|
**Problem**: Data lost after restart
|
||
|
|
|
||
|
|
**Solutions**:
|
||
|
|
```bash
|
||
|
|
# Verify backup exists
|
||
|
|
ls -la /mnt/provisioning-backups/
|
||
|
|
ls -la /var/lib/provisioning/
|
||
|
|
|
||
|
|
# Check disk space
|
||
|
|
df -h /var/lib/provisioning
|
||
|
|
|
||
|
|
# Verify file permissions
|
||
|
|
ls -l /var/lib/provisioning/vault/
|
||
|
|
chmod 755 /var/lib/provisioning/vault/*
|
||
|
|
|
||
|
|
# Restore from backup
|
||
|
|
./scripts/restore-backup.sh /mnt/provisioning-backups/vault-20260105.sql
|
||
|
|
```
|
||
|
|
|
||
|
|
### Debugging Checklist
|
||
|
|
|
||
|
|
When troubleshooting, use this systematic approach:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1. Check service is running
|
||
|
|
pgrep -f vault-service || echo "Service not running"
|
||
|
|
|
||
|
|
# 2. Check port is listening
|
||
|
|
ss -tlnp | grep 8200 || echo "Port not listening"
|
||
|
|
|
||
|
|
# 3. Check logs for errors
|
||
|
|
tail -20 /var/log/provisioning/vault-service.log | grep -i error
|
||
|
|
|
||
|
|
# 4. Test HTTP endpoint
|
||
|
|
curl -i http://localhost:8200/health
|
||
|
|
|
||
|
|
# 5. Check dependencies
|
||
|
|
curl http://surrealdb:8000/health
|
||
|
|
etcdctl --endpoints=http://etcd:2379 endpoint health
|
||
|
|
|
||
|
|
# 6. Check schema definition
|
||
|
|
nickel typecheck provisioning/schemas/platform/schemas/vault-service.ncl
|
||
|
|
|
||
|
|
# 7. Verify environment variables
|
||
|
|
env | grep -E "VAULT_|SURREALDB_|ETCD_"
|
||
|
|
|
||
|
|
# 8. Check system resources
|
||
|
|
free -h && df -h && top -bn1 | head -10
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Configuration Updates
|
||
|
|
|
||
|
|
### Updating Service Configuration
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# 1. Edit the schema definition
|
||
|
|
vim provisioning/schemas/platform/schemas/vault-service.ncl
|
||
|
|
|
||
|
|
# 2. Update defaults if needed
|
||
|
|
vim provisioning/schemas/platform/defaults/vault-service-defaults.ncl
|
||
|
|
|
||
|
|
# 3. Validate syntax
|
||
|
|
nickel typecheck provisioning/schemas/platform/schemas/vault-service.ncl
|
||
|
|
|
||
|
|
# 4. Re-export configuration from schemas
|
||
|
|
./provisioning/.typedialog/platform/scripts/generate-configs.nu vault-service multiuser
|
||
|
|
|
||
|
|
# 5. Restart affected service (no downtime for clients)
|
||
|
|
pkill -SIGTERM vault-service
|
||
|
|
sleep 2
|
||
|
|
cargo run --release -p vault-service &
|
||
|
|
|
||
|
|
# 4. Verify configuration loaded
|
||
|
|
curl http://localhost:8200/api/config | jq .
|
||
|
|
```
|
||
|
|
|
||
|
|
### Mode Migration
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Migrate from solo to multiuser:
|
||
|
|
|
||
|
|
# 1. Stop services
|
||
|
|
pkill -SIGTERM -f "cargo run"
|
||
|
|
sleep 5
|
||
|
|
|
||
|
|
# 2. Backup current data
|
||
|
|
tar -czf /backup/provisioning-solo-$(date +%s).tar.gz /var/lib/provisioning/
|
||
|
|
|
||
|
|
# 3. Set new mode
|
||
|
|
export VAULT_MODE=multiuser
|
||
|
|
export REGISTRY_MODE=multiuser
|
||
|
|
export RAG_MODE=multiuser
|
||
|
|
|
||
|
|
# 4. Start services with new config
|
||
|
|
cargo run --release -p vault-service &
|
||
|
|
cargo run --release -p extension-registry &
|
||
|
|
|
||
|
|
# 5. Verify new mode
|
||
|
|
curl http://localhost:8200/api/config | jq .deployment_mode
|
||
|
|
```
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Production Checklist
|
||
|
|
|
||
|
|
Before deploying to production:
|
||
|
|
|
||
|
|
- [ ] All services compiled in release mode (`--release`)
|
||
|
|
- [ ] TLS certificates installed and valid
|
||
|
|
- [ ] Database cluster deployed and healthy
|
||
|
|
- [ ] Load balancer configured and routing traffic
|
||
|
|
- [ ] Monitoring and alerting configured
|
||
|
|
- [ ] Backup system tested and working
|
||
|
|
- [ ] High availability verified (failover tested)
|
||
|
|
- [ ] Security hardening applied (firewall rules, etc.)
|
||
|
|
- [ ] Documentation updated for your environment
|
||
|
|
- [ ] Team trained on deployment procedures
|
||
|
|
- [ ] Runbooks created for common operations
|
||
|
|
- [ ] Disaster recovery plan tested
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Getting Help
|
||
|
|
|
||
|
|
### Community Resources
|
||
|
|
|
||
|
|
- **GitHub Issues**: Report bugs at `github.com/your-org/provisioning/issues`
|
||
|
|
- **Documentation**: Full docs at `provisioning/docs/`
|
||
|
|
- **Slack Channel**: `#provisioning-platform`
|
||
|
|
|
||
|
|
### Internal Support
|
||
|
|
|
||
|
|
- **Platform Team**: [platform@your-org.com](mailto:platform@your-org.com)
|
||
|
|
- **On-Call**: Check PagerDuty for active rotation
|
||
|
|
- **Escalation**: Contact infrastructure leadership
|
||
|
|
|
||
|
|
### Useful Commands Reference
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# View all available commands
|
||
|
|
cargo run -- --help
|
||
|
|
|
||
|
|
# View service schemas
|
||
|
|
ls -la provisioning/schemas/platform/schemas/
|
||
|
|
ls -la provisioning/schemas/platform/defaults/
|
||
|
|
|
||
|
|
# List running services
|
||
|
|
ps aux | grep cargo
|
||
|
|
|
||
|
|
# Monitor service logs in real-time
|
||
|
|
journalctl -fu provisioning-vault
|
||
|
|
|
||
|
|
# Generate diagnostics bundle
|
||
|
|
./scripts/generate-diagnostics.sh > /tmp/diagnostics-$(date +%s).tar.gz
|
||
|
|
```
|