# Platform Deployment Guide **Version**: 1.0.0 **Last Updated**: 2026-01-05 **Target Audience**: DevOps Engineers, Platform Operators **Status**: Production Ready Practical guide for deploying the 9-service provisioning platform in any environment using mode-based configuration. ## Table of Contents 1. [Prerequisites](#prerequisites) 2. [Deployment Modes](#deployment-modes) 3. [Quick Start](#quick-start) 4. [Solo Mode Deployment](#solo-mode-deployment) 5. [Multiuser Mode Deployment](#multiuser-mode-deployment) 6. [CICD Mode Deployment](#cicd-mode-deployment) 7. [Enterprise Mode Deployment](#enterprise-mode-deployment) 8. [Service Management](#service-management) 9. [Health Checks & Monitoring](#health-checks--monitoring) 10. [Troubleshooting](#troubleshooting) --- ## Prerequisites ### Required Software - **Rust**: 1.70+ (for building services) - **Nickel**: Latest (for config validation) - **Nushell**: 0.109.1+ (for scripts) - **Cargo**: Included with Rust - **Git**: For cloning and pulling updates ### Required Tools (Mode-Dependent) | Tool | Solo | Multiuser | CICD | Enterprise | |------|------|-----------|------|------------| | Docker/Podman | No | Optional | Yes | Yes | | SurrealDB | No | Yes | No | No | | Etcd | No | No | No | Yes | | PostgreSQL | No | Optional | No | Optional | | OpenAI/Anthropic API | No | Optional | Yes | Yes | ### System Requirements | Resource | Solo | Multiuser | CICD | Enterprise | |----------|------|-----------|------|------------| | CPU Cores | 2+ | 4+ | 8+ | 16+ | | Memory | 2 GB | 4 GB | 8 GB | 16 GB | | Disk | 10 GB | 50 GB | 100 GB | 500 GB | | Network | Local | Local/Cloud | Cloud | HA Cloud | ### Directory Structure ```bash # Ensure base directories exist mkdir -p provisioning/schemas/platform mkdir -p provisioning/platform/logs mkdir -p provisioning/platform/data mkdir -p provisioning/.typedialog/platform mkdir -p provisioning/config/runtime ``` --- ## Deployment Modes ### Mode Selection Matrix | Requirement | Recommended Mode | |-------------|------------------| | Development & testing | **solo** | | Team environment (2-10 people) | **multiuser** | | CI/CD pipelines & automation | **cicd** | | Production with HA | **enterprise** | ### Mode Characteristics #### Solo Mode **Use Case**: Development, testing, demonstration **Characteristics**: - All services run locally with minimal resources - Filesystem-based storage (no external databases) - No TLS/SSL required - Embedded/in-memory backends - Single machine only **Services Configuration**: - 2-4 workers per service - 30-60 second timeouts - No replication or clustering - Debug-level logging enabled **Startup Time**: ~2-5 minutes **Data Persistence**: Local files only --- #### Multiuser Mode **Use Case**: Team environments, shared infrastructure **Characteristics**: - Shared database backends (SurrealDB) - Multiple concurrent users - CORS and multi-user features enabled - Optional TLS support - 2-4 machines (or containerized) **Services Configuration**: - 4-6 workers per service - 60-120 second timeouts - Basic replication available - Info-level logging **Startup Time**: ~3-8 minutes (database dependent) **Data Persistence**: SurrealDB (shared) --- #### CICD Mode **Use Case**: CI/CD pipelines, ephemeral environments **Characteristics**: - Ephemeral storage (memory, temporary) - High throughput - RAG system disabled - Minimal logging - Stateless services **Services Configuration**: - 8-12 workers per service - 10-30 second timeouts - No persistence - Warn-level logging **Startup Time**: ~1-2 minutes **Data Persistence**: None (ephemeral) --- #### Enterprise Mode **Use Case**: Production, high availability, compliance **Characteristics**: - Distributed, replicated backends - High availability (HA) clustering - TLS/SSL encryption - Audit logging - Full monitoring and observability **Services Configuration**: - 16-32 workers per service - 120-300 second timeouts - Active replication across 3+ nodes - Info-level logging with audit trails **Startup Time**: ~5-15 minutes (cluster initialization) **Data Persistence**: Replicated across cluster --- ## Quick Start ### 1. Clone Repository ```bash git clone https://github.com/your-org/project-provisioning.git cd project-provisioning ``` ### 2. Select Deployment Mode Choose your mode based on use case: ```bash # For development export DEPLOYMENT_MODE=solo # For team environments export DEPLOYMENT_MODE=multiuser # For CI/CD export DEPLOYMENT_MODE=cicd # For production export DEPLOYMENT_MODE=enterprise ``` ### 3. Set Environment Variables All services use mode-specific TOML configs automatically loaded via environment variables: ```bash # Vault Service export VAULT_MODE=$DEPLOYMENT_MODE # Extension Registry export REGISTRY_MODE=$DEPLOYMENT_MODE # RAG System export RAG_MODE=$DEPLOYMENT_MODE # AI Service export AI_SERVICE_MODE=$DEPLOYMENT_MODE # Provisioning Daemon export DAEMON_MODE=$DEPLOYMENT_MODE ``` ### 4. Build All Services ```bash # Build all platform crates cargo build --release -p vault-service \ -p extension-registry \ -p provisioning-rag \ -p ai-service \ -p provisioning-daemon \ -p orchestrator \ -p control-center \ -p mcp-server \ -p installer ``` ### 5. Start Services (Order Matters) ```bash # Start in dependency order: # 1. Core infrastructure (KMS, storage) cargo run --release -p vault-service & # 2. Configuration and extensions cargo run --release -p extension-registry & # 3. AI/RAG layer cargo run --release -p provisioning-rag & cargo run --release -p ai-service & # 4. Orchestration layer cargo run --release -p orchestrator & cargo run --release -p control-center & cargo run --release -p mcp-server & # 5. Background operations cargo run --release -p provisioning-daemon & # 6. Installer (optional, for new deployments) cargo run --release -p installer & ``` ### 6. Verify Services ```bash # Check all services are running pgrep -l "vault-service|extension-registry|provisioning-rag|ai-service" # Test endpoints curl http://localhost:8200/health # Vault curl http://localhost:8081/health # Registry curl http://localhost:8083/health # RAG curl http://localhost:8082/health # AI Service curl http://localhost:9090/health # Orchestrator curl http://localhost:8080/health # Control Center ``` --- ## Solo Mode Deployment **Perfect for**: Development, testing, learning ### Step 1: Verify Solo Configuration Files ```bash # Check that solo schemas are available ls -la provisioning/schemas/platform/defaults/deployment/solo-defaults.ncl # Available schemas for each service: # - provisioning/schemas/platform/schemas/vault-service.ncl # - provisioning/schemas/platform/schemas/extension-registry.ncl # - provisioning/schemas/platform/schemas/rag.ncl # - provisioning/schemas/platform/schemas/ai-service.ncl # - provisioning/schemas/platform/schemas/provisioning-daemon.ncl ``` ### Step 2: Set Solo Environment Variables ```bash # Set all services to solo mode export VAULT_MODE=solo export REGISTRY_MODE=solo export RAG_MODE=solo export AI_SERVICE_MODE=solo export DAEMON_MODE=solo # Verify settings echo $VAULT_MODE # Should output: solo ``` ### Step 3: Build Services ```bash # Build in release mode for better performance cargo build --release ``` ### Step 4: Create Local Data Directories ```bash # Create storage directories for solo mode mkdir -p /tmp/provisioning-solo/{vault,registry,rag,ai,daemon} chmod 755 /tmp/provisioning-solo/{vault,registry,rag,ai,daemon} ``` ### Step 5: Start Services ```bash # Start each service in a separate terminal or use tmux: # Terminal 1: Vault cargo run --release -p vault-service # Terminal 2: Registry cargo run --release -p extension-registry # Terminal 3: RAG cargo run --release -p provisioning-rag # Terminal 4: AI Service cargo run --release -p ai-service # Terminal 5: Orchestrator cargo run --release -p orchestrator # Terminal 6: Control Center cargo run --release -p control-center # Terminal 7: Daemon cargo run --release -p provisioning-daemon ``` ### Step 6: Test Services ```bash # Wait 10-15 seconds for services to start, then test # Check service health curl -s http://localhost:8200/health | jq . curl -s http://localhost:8081/health | jq . curl -s http://localhost:8083/health | jq . # Try a simple operation curl -X GET http://localhost:9090/api/v1/health ``` ### Step 7: Verify Persistence (Optional) ```bash # Check that data is stored locally ls -la /tmp/provisioning-solo/vault/ ls -la /tmp/provisioning-solo/registry/ # Data should accumulate as you use the services ``` ### Cleanup ```bash # Stop all services pkill -f "cargo run --release" # Remove temporary data (optional) rm -rf /tmp/provisioning-solo ``` --- ## Multiuser Mode Deployment **Perfect for**: Team environments, shared infrastructure ### Prerequisites - **SurrealDB**: Running and accessible at `http://surrealdb:8000` - **Network Access**: All machines can reach SurrealDB - **DNS/Hostnames**: Services accessible via hostnames (not just localhost) ### Step 1: Deploy SurrealDB ```bash # Using Docker (recommended) docker run -d \ --name surrealdb \ -p 8000:8000 \ surrealdb/surrealdb:latest \ start --user root --pass root # Or using native installation: surreal start --user root --pass root ``` ### Step 2: Verify SurrealDB Connectivity ```bash # Test SurrealDB connection curl -s http://localhost:8000/health # Should return: {"version":"v1.x.x"} ``` ### Step 3: Set Multiuser Environment Variables ```bash # Configure all services for multiuser mode export VAULT_MODE=multiuser export REGISTRY_MODE=multiuser export RAG_MODE=multiuser export AI_SERVICE_MODE=multiuser export DAEMON_MODE=multiuser # Set database connection export SURREALDB_URL=http://surrealdb:8000 export SURREALDB_USER=root export SURREALDB_PASS=root # Set service hostnames (if not localhost) export VAULT_SERVICE_HOST=vault.internal export REGISTRY_HOST=registry.internal export RAG_HOST=rag.internal ``` ### Step 4: Build Services ```bash cargo build --release ``` ### Step 5: Create Shared Data Directories ```bash # Create directories on shared storage (NFS, etc.) mkdir -p /mnt/provisioning-data/{vault,registry,rag,ai} chmod 755 /mnt/provisioning-data/{vault,registry,rag,ai} # Or use local directories if on separate machines mkdir -p /var/lib/provisioning/{vault,registry,rag,ai} ``` ### Step 6: Start Services on Multiple Machines ```bash # Machine 1: Infrastructure services ssh ops@machine1 export VAULT_MODE=multiuser cargo run --release -p vault-service & cargo run --release -p extension-registry & # Machine 2: AI services ssh ops@machine2 export RAG_MODE=multiuser export AI_SERVICE_MODE=multiuser cargo run --release -p provisioning-rag & cargo run --release -p ai-service & # Machine 3: Orchestration ssh ops@machine3 cargo run --release -p orchestrator & cargo run --release -p control-center & # Machine 4: Background tasks ssh ops@machine4 export DAEMON_MODE=multiuser cargo run --release -p provisioning-daemon & ``` ### Step 7: Test Multi-Machine Setup ```bash # From any machine, test cross-machine connectivity curl -s http://machine1:8200/health curl -s http://machine2:8083/health curl -s http://machine3:9090/health # Test integration curl -X POST http://machine3:9090/api/v1/provision \ -H "Content-Type: application/json" \ -d '{"workspace": "test"}' ``` ### Step 8: Enable User Access ```bash # Create shared credentials export VAULT_TOKEN=s.xxxxxxxxxxx # Configure TLS (optional but recommended) # Update configs to use https:// URLs export VAULT_MODE=multiuser # Edit provisioning/schemas/platform/schemas/vault-service.ncl # Add TLS configuration in the schema definition # See: provisioning/schemas/platform/validators/ for constraints ``` ### Monitoring Multiuser Deployment ```bash # Check all services are connected to SurrealDB for host in machine1 machine2 machine3 machine4; do ssh ops@$host "curl -s http://localhost/api/v1/health | jq .database_connected" done # Monitor SurrealDB curl -s http://surrealdb:8000/version ``` --- ## CICD Mode Deployment **Perfect for**: GitHub Actions, GitLab CI, Jenkins, cloud automation ### Step 1: Understand Ephemeral Nature CICD mode services: - Don't persist data between runs - Use in-memory storage - Have RAG disabled - Optimize for startup speed - Suitable for containerized deployments ### Step 2: Set CICD Environment Variables ```bash # Use cicd mode for all services export VAULT_MODE=cicd export REGISTRY_MODE=cicd export RAG_MODE=cicd export AI_SERVICE_MODE=cicd export DAEMON_MODE=cicd # Disable TLS (not needed in CI) export CI_ENVIRONMENT=true ``` ### Step 3: Containerize Services (Optional) ```dockerfile # Dockerfile for CICD deployments FROM rust:1.75-slim WORKDIR /app COPY . . # Build all services RUN cargo build --release # Set CICD mode ENV VAULT_MODE=cicd ENV REGISTRY_MODE=cicd ENV RAG_MODE=cicd ENV AI_SERVICE_MODE=cicd # Expose ports EXPOSE 8200 8081 8083 8082 9090 8080 # Run services CMD ["sh", "-c", "\ cargo run --release -p vault-service & \ cargo run --release -p extension-registry & \ cargo run --release -p provisioning-rag & \ cargo run --release -p ai-service & \ cargo run --release -p orchestrator & \ wait"] ``` ### Step 4: GitHub Actions Example ```yaml name: CICD Platform Deployment on: push: branches: [main, develop] jobs: test-deployment: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Install Rust uses: actions-rs/toolchain@v1 with: toolchain: 1.75 profile: minimal - name: Set CICD Mode run: | echo "VAULT_MODE=cicd" >> $GITHUB_ENV echo "REGISTRY_MODE=cicd" >> $GITHUB_ENV echo "RAG_MODE=cicd" >> $GITHUB_ENV echo "AI_SERVICE_MODE=cicd" >> $GITHUB_ENV echo "DAEMON_MODE=cicd" >> $GITHUB_ENV - name: Build Services run: cargo build --release - name: Run Integration Tests run: | # Start services in background cargo run --release -p vault-service & cargo run --release -p extension-registry & cargo run --release -p orchestrator & # Wait for startup sleep 10 # Run tests cargo test --release - name: Health Checks run: | curl -f http://localhost:8200/health curl -f http://localhost:8081/health curl -f http://localhost:9090/health deploy: needs: test-deployment runs-on: ubuntu-latest if: github.ref == 'refs/heads/main' steps: - uses: actions/checkout@v3 - name: Deploy to Production run: | # Deploy production enterprise cluster ./scripts/deploy-enterprise.sh ``` ### Step 5: Run CICD Tests ```bash # Simulate CI environment locally export VAULT_MODE=cicd export CI_ENVIRONMENT=true # Build cargo build --release # Run short-lived services for testing timeout 30 cargo run --release -p vault-service & timeout 30 cargo run --release -p extension-registry & timeout 30 cargo run --release -p orchestrator & # Run tests while services are running sleep 5 cargo test --release # Services auto-cleanup after timeout ``` --- ## Enterprise Mode Deployment **Perfect for**: Production, high availability, compliance ### Prerequisites - **3+ Machines**: Minimum 3 for HA - **Etcd Cluster**: For distributed consensus - **Load Balancer**: HAProxy, nginx, or cloud LB - **TLS Certificates**: Valid certificates for all services - **Monitoring**: Prometheus, ELK, or cloud monitoring - **Backup System**: Daily snapshots to S3 or similar ### Step 1: Deploy Infrastructure #### 1.1 Deploy Etcd Cluster ```bash # Node 1, 2, 3 etcd --name=node-1 \ --listen-client-urls=http://0.0.0.0:2379 \ --advertise-client-urls=http://node-1.internal:2379 \ --initial-cluster="node-1=http://node-1.internal:2380,node-2=http://node-2.internal:2380,node-3=http://node-3.internal:2380" \ --initial-cluster-state=new # Verify cluster etcdctl --endpoints=http://localhost:2379 member list ``` #### 1.2 Deploy Load Balancer ```nginx # HAProxy configuration for vault-service (example) frontend vault_frontend bind *:8200 mode tcp default_backend vault_backend backend vault_backend mode tcp balance roundrobin server vault-1 10.0.1.10:8200 check server vault-2 10.0.1.11:8200 check server vault-3 10.0.1.12:8200 check ``` #### 1.3 Configure TLS ```bash # Generate certificates (or use existing) mkdir -p /etc/provisioning/tls # For each service: openssl req -x509 -newkey rsa:4096 \ -keyout /etc/provisioning/tls/vault-key.pem \ -out /etc/provisioning/tls/vault-cert.pem \ -days 365 -nodes \ -subj "/CN=vault.provisioning.prod" # Set permissions chmod 600 /etc/provisioning/tls/*-key.pem chmod 644 /etc/provisioning/tls/*-cert.pem ``` ### Step 2: Set Enterprise Environment Variables ```bash # All machines: Set enterprise mode export VAULT_MODE=enterprise export REGISTRY_MODE=enterprise export RAG_MODE=enterprise export AI_SERVICE_MODE=enterprise export DAEMON_MODE=enterprise # Database cluster export SURREALDB_URL="ws://surrealdb-cluster.internal:8000" export SURREALDB_REPLICAS=3 # Etcd cluster export ETCD_ENDPOINTS="http://node-1.internal:2379,http://node-2.internal:2379,http://node-3.internal:2379" # TLS configuration export TLS_CERT_PATH=/etc/provisioning/tls export TLS_VERIFY=true export TLS_CA_CERT=/etc/provisioning/tls/ca.crt # Monitoring export PROMETHEUS_URL=http://prometheus.internal:9090 export METRICS_ENABLED=true export AUDIT_LOG_ENABLED=true ``` ### Step 3: Deploy Services Across Cluster ```bash # Ansible playbook (simplified) --- - hosts: provisioning_cluster tasks: - name: Build services shell: cargo build --release - name: Start vault-service (machine 1-3) shell: "cargo run --release -p vault-service" when: "'vault' in group_names" - name: Start orchestrator (machine 2-3) shell: "cargo run --release -p orchestrator" when: "'orchestrator' in group_names" - name: Start daemon (machine 3) shell: "cargo run --release -p provisioning-daemon" when: "'daemon' in group_names" - name: Verify cluster health uri: url: "https://{{ inventory_hostname }}:9090/health" validate_certs: yes ``` ### Step 4: Monitor Cluster Health ```bash # Check cluster status curl -s https://vault.internal:8200/health | jq .state # Check replication curl -s https://orchestrator.internal:9090/api/v1/cluster/status # Monitor etcd etcdctl --endpoints=https://node-1.internal:2379 endpoint health # Check leader election etcdctl --endpoints=https://node-1.internal:2379 election list ``` ### Step 5: Enable Monitoring & Alerting ```yaml # Prometheus configuration global: scrape_interval: 30s evaluation_interval: 30s scrape_configs: - job_name: 'vault-service' scheme: https tls_config: ca_file: /etc/provisioning/tls/ca.crt static_configs: - targets: ['vault-1.internal:8200', 'vault-2.internal:8200', 'vault-3.internal:8200'] - job_name: 'orchestrator' scheme: https static_configs: - targets: ['orch-1.internal:9090', 'orch-2.internal:9090', 'orch-3.internal:9090'] ``` ### Step 6: Backup & Recovery ```bash # Daily backup script #!/bin/bash BACKUP_DIR="/mnt/provisioning-backups" DATE=$(date +%Y%m%d_%H%M%S) # Backup etcd etcdctl --endpoints=https://node-1.internal:2379 \ snapshot save "$BACKUP_DIR/etcd-$DATE.db" # Backup SurrealDB curl -X POST https://surrealdb.internal:8000/backup \ -H "Authorization: Bearer $SURREALDB_TOKEN" \ > "$BACKUP_DIR/surreal-$DATE.sql" # Upload to S3 aws s3 cp "$BACKUP_DIR/etcd-$DATE.db" \ s3://provisioning-backups/etcd/ # Cleanup old backups (keep 30 days) find "$BACKUP_DIR" -mtime +30 -delete ``` --- ## Service Management ### Starting Services #### Individual Service Startup ```bash # Start one service export VAULT_MODE=enterprise cargo run --release -p vault-service # In another terminal export REGISTRY_MODE=enterprise cargo run --release -p extension-registry ``` #### Batch Startup ```bash # Start all services (dependency order) #!/bin/bash set -e MODE=${1:-solo} export VAULT_MODE=$MODE export REGISTRY_MODE=$MODE export RAG_MODE=$MODE export AI_SERVICE_MODE=$MODE export DAEMON_MODE=$MODE echo "Starting provisioning platform in $MODE mode..." # Core services first echo "Starting infrastructure..." cargo run --release -p vault-service & VAULT_PID=$! echo "Starting extension registry..." cargo run --release -p extension-registry & REGISTRY_PID=$! # AI layer echo "Starting AI services..." cargo run --release -p provisioning-rag & RAG_PID=$! cargo run --release -p ai-service & AI_PID=$! # Orchestration echo "Starting orchestration..." cargo run --release -p orchestrator & ORCH_PID=$! echo "All services started. PIDs: $VAULT_PID $REGISTRY_PID $RAG_PID $AI_PID $ORCH_PID" ``` ### Stopping Services ```bash # Stop all services gracefully pkill -SIGTERM -f "cargo run --release -p" # Wait for graceful shutdown sleep 5 # Force kill if needed pkill -9 -f "cargo run --release -p" # Verify all stopped pgrep -f "cargo run --release -p" && echo "Services still running" || echo "All stopped" ``` ### Restarting Services ```bash # Restart single service pkill -SIGTERM vault-service sleep 2 cargo run --release -p vault-service & # Restart all services ./scripts/restart-all.sh $MODE # Restart with config reload export VAULT_MODE=multiuser pkill -SIGTERM vault-service sleep 2 cargo run --release -p vault-service & ``` ### Checking Service Status ```bash # Check running processes pgrep -a "cargo run --release" # Check listening ports netstat -tlnp | grep -E "8200|8081|8083|8082|9090|8080" # Or using ss (modern alternative) ss -tlnp | grep -E "8200|8081|8083|8082|9090|8080" # Health endpoint checks for service in vault registry rag ai orchestrator; do echo "=== $service ===" curl -s http://localhost:${port[$service]}/health | jq . done ``` --- ## Health Checks & Monitoring ### Manual Health Verification ```bash # Vault Service curl -s http://localhost:8200/health | jq . # Expected: {"status":"ok","uptime":123.45} # Extension Registry curl -s http://localhost:8081/health | jq . # RAG System curl -s http://localhost:8083/health | jq . # Expected: {"status":"ok","embeddings":"ready","vector_db":"connected"} # AI Service curl -s http://localhost:8082/health | jq . # Orchestrator curl -s http://localhost:9090/health | jq . # Control Center curl -s http://localhost:8080/health | jq . ``` ### Service Integration Tests ```bash # Test vault <-> registry integration curl -X POST http://localhost:8200/api/encrypt \ -H "Content-Type: application/json" \ -d '{"plaintext":"secret"}' | jq . # Test RAG system curl -X POST http://localhost:8083/api/ingest \ -H "Content-Type: application/json" \ -d '{"document":"test.md","content":"# Test"}' | jq . # Test orchestrator curl -X GET http://localhost:9090/api/v1/status | jq . # End-to-end workflow curl -X POST http://localhost:9090/api/v1/provision \ -H "Content-Type: application/json" \ -d '{ "workspace": "test", "services": ["vault", "registry"], "mode": "solo" }' | jq . ``` ### Monitoring Dashboards #### Prometheus Metrics ```bash # Query service uptime curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq . # Query request rate curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_requests_total[5m])' | jq . # Query error rate curl -s 'http://prometheus:9090/api/v1/query?query=rate(http_errors_total[5m])' | jq . ``` #### Log Aggregation ```bash # Follow vault logs tail -f /var/log/provisioning/vault-service.log # Follow all service logs tail -f /var/log/provisioning/*.log # Search for errors grep -r "ERROR" /var/log/provisioning/ # Follow with filtering tail -f /var/log/provisioning/orchestrator.log | grep -E "ERROR|WARN" ``` ### Alerting ```yaml # AlertManager configuration groups: - name: provisioning rules: - alert: ServiceDown expr: up{job=~"vault|registry|rag|orchestrator"} == 0 for: 5m annotations: summary: "{{ $labels.job }} is down" - alert: HighErrorRate expr: rate(http_errors_total[5m]) > 0.05 annotations: summary: "High error rate detected" - alert: DiskSpaceWarning expr: node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.2 annotations: summary: "Disk space below 20%" ``` --- ## Troubleshooting ### Service Won't Start **Problem**: `error: failed to bind to port 8200` **Solutions**: ```bash # Check if port is in use lsof -i :8200 ss -tlnp | grep 8200 # Kill existing process pkill -9 -f vault-service # Or use different port export VAULT_SERVER_PORT=8201 cargo run --release -p vault-service ``` ### Configuration Loading Fails **Problem**: `error: failed to load config from mode file` **Solutions**: ```bash # Verify schemas exist ls -la provisioning/schemas/platform/schemas/vault-service.ncl # Validate schema syntax nickel typecheck provisioning/schemas/platform/schemas/vault-service.ncl # Check defaults are present nickel typecheck provisioning/schemas/platform/defaults/vault-service-defaults.ncl # Verify deployment mode overlay exists ls -la provisioning/schemas/platform/defaults/deployment/$VAULT_MODE-defaults.ncl # Run service with explicit mode export VAULT_MODE=solo cargo run --release -p vault-service ``` ### Database Connection Issues **Problem**: `error: failed to connect to database` **Solutions**: ```bash # Verify database is running curl http://surrealdb:8000/health etcdctl --endpoints=http://etcd:2379 endpoint health # Check connectivity nc -zv surrealdb 8000 nc -zv etcd 2379 # Update connection string export SURREALDB_URL=ws://surrealdb:8000 export ETCD_ENDPOINTS=http://etcd:2379 # Restart service with new config pkill -9 vault-service cargo run --release -p vault-service ``` ### Service Crashes on Startup **Problem**: Service exits with code 1 or 139 **Solutions**: ```bash # Run with verbose logging RUST_LOG=debug cargo run -p vault-service 2>&1 | head -50 # Check system resources free -h df -h # Check for core dumps coredumpctl list # Run under debugger (if crash suspected) rust-gdb --args target/release/vault-service ``` ### High Memory Usage **Problem**: Service consuming > expected memory **Solutions**: ```bash # Check memory usage ps aux | grep vault-service | grep -v grep # Monitor over time watch -n 1 'ps aux | grep vault-service | grep -v grep' # Reduce worker count export VAULT_SERVER_WORKERS=2 cargo run --release -p vault-service # Check for memory leaks valgrind --leak-check=full target/release/vault-service ``` ### Network/DNS Issues **Problem**: `error: failed to resolve hostname` **Solutions**: ```bash # Test DNS resolution nslookup vault.internal dig vault.internal # Test connectivity to service curl -v http://vault.internal:8200/health # Add to /etc/hosts if needed echo "10.0.1.10 vault.internal" >> /etc/hosts # Check network interface ip addr show netstat -nr ``` ### Data Persistence Issues **Problem**: Data lost after restart **Solutions**: ```bash # Verify backup exists ls -la /mnt/provisioning-backups/ ls -la /var/lib/provisioning/ # Check disk space df -h /var/lib/provisioning # Verify file permissions ls -l /var/lib/provisioning/vault/ chmod 755 /var/lib/provisioning/vault/* # Restore from backup ./scripts/restore-backup.sh /mnt/provisioning-backups/vault-20260105.sql ``` ### Debugging Checklist When troubleshooting, use this systematic approach: ```bash # 1. Check service is running pgrep -f vault-service || echo "Service not running" # 2. Check port is listening ss -tlnp | grep 8200 || echo "Port not listening" # 3. Check logs for errors tail -20 /var/log/provisioning/vault-service.log | grep -i error # 4. Test HTTP endpoint curl -i http://localhost:8200/health # 5. Check dependencies curl http://surrealdb:8000/health etcdctl --endpoints=http://etcd:2379 endpoint health # 6. Check schema definition nickel typecheck provisioning/schemas/platform/schemas/vault-service.ncl # 7. Verify environment variables env | grep -E "VAULT_|SURREALDB_|ETCD_" # 8. Check system resources free -h && df -h && top -bn1 | head -10 ``` --- ## Configuration Updates ### Updating Service Configuration ```bash # 1. Edit the schema definition vim provisioning/schemas/platform/schemas/vault-service.ncl # 2. Update defaults if needed vim provisioning/schemas/platform/defaults/vault-service-defaults.ncl # 3. Validate syntax nickel typecheck provisioning/schemas/platform/schemas/vault-service.ncl # 4. Re-export configuration from schemas ./provisioning/.typedialog/platform/scripts/generate-configs.nu vault-service multiuser # 5. Restart affected service (no downtime for clients) pkill -SIGTERM vault-service sleep 2 cargo run --release -p vault-service & # 4. Verify configuration loaded curl http://localhost:8200/api/config | jq . ``` ### Mode Migration ```bash # Migrate from solo to multiuser: # 1. Stop services pkill -SIGTERM -f "cargo run" sleep 5 # 2. Backup current data tar -czf /backup/provisioning-solo-$(date +%s).tar.gz /var/lib/provisioning/ # 3. Set new mode export VAULT_MODE=multiuser export REGISTRY_MODE=multiuser export RAG_MODE=multiuser # 4. Start services with new config cargo run --release -p vault-service & cargo run --release -p extension-registry & # 5. Verify new mode curl http://localhost:8200/api/config | jq .deployment_mode ``` --- ## Production Checklist Before deploying to production: - [ ] All services compiled in release mode (`--release`) - [ ] TLS certificates installed and valid - [ ] Database cluster deployed and healthy - [ ] Load balancer configured and routing traffic - [ ] Monitoring and alerting configured - [ ] Backup system tested and working - [ ] High availability verified (failover tested) - [ ] Security hardening applied (firewall rules, etc.) - [ ] Documentation updated for your environment - [ ] Team trained on deployment procedures - [ ] Runbooks created for common operations - [ ] Disaster recovery plan tested --- ## Getting Help ### Community Resources - **GitHub Issues**: Report bugs at `github.com/your-org/provisioning/issues` - **Documentation**: Full docs at `provisioning/docs/` - **Slack Channel**: `#provisioning-platform` ### Internal Support - **Platform Team**: [platform@your-org.com](mailto:platform@your-org.com) - **On-Call**: Check PagerDuty for active rotation - **Escalation**: Contact infrastructure leadership ### Useful Commands Reference ```bash # View all available commands cargo run -- --help # View service schemas ls -la provisioning/schemas/platform/schemas/ ls -la provisioning/schemas/platform/defaults/ # List running services ps aux | grep cargo # Monitor service logs in real-time journalctl -fu provisioning-vault # Generate diagnostics bundle ./scripts/generate-diagnostics.sh > /tmp/diagnostics-$(date +%s).tar.gz ```