provisioning/docs/src/architecture/system-overview.md
2026-01-17 03:58:28 +00:00

21 KiB

System Overview

Complete architecture of the Provisioning Infrastructure Automation Platform.

Architecture Layers

Provisioning uses a 5-layer modular architecture:

┌─────────────────────────────────────────────────────────────┐
│ User Interface Layer                                        │
│ • CLI (provisioning command)  • Web Control Center (UI)     │
│ • REST API  • MCP Server (AI) • Batch Scheduler             │
└──────────────────────────┬──────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────────┐
│ Core Engine Layer (provisioning/core/)                      │
│ • 211-line CLI dispatcher (84% code reduction)              │
│ • 476+ configuration accessors (hierarchical)               │
│ • Provider abstraction (multi-cloud support)                │
│ • Workspace management system                               │
│ • Infrastructure validation (54+ Nushell libraries)         │
│ • Secrets management (SOPS + Age integration)               │
└──────────────────────────┬──────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────────┐
│ Orchestration Layer (provisioning/platform/)                │
│ • Hybrid Orchestrator (Rust + Nushell)                      │
│ • Workflow execution with checkpoints                       │
│ • Dependency resolver & task scheduler                      │
│ • File-based persistence                                    │
│ • REST API endpoints (83+)                                  │
│ • State management (SurrealDB)                              │
└──────────────────────────┬──────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────────┐
│ Extension Layer (provisioning/extensions/)                  │
│ • Cloud Providers (UpCloud, AWS, Hetzner, Local)            │
│ • Task Services (50+ services in 18 categories)             │
│ • Clusters (9 pre-built cluster templates)                  │
│ • Batch Workflows (automation templates)                    │
│ • Nushell Plugins (10-50x performance gains)                │
└──────────────────────────┬──────────────────────────────────┘
                           ↓
┌─────────────────────────────────────────────────────────────┐
│ Infrastructure Layer                                        │
│ • Cloud Resources (servers, networks, storage)              │
│ • Running Services (Kubernetes, databases, etc.)            │
│ • State Persistence (SurrealDB, file storage)               │
│ • Monitoring & Logging (Prometheus, Loki)                   │
└─────────────────────────────────────────────────────────────┘

Core System Components

1. CLI Layer (provisioning/core/cli/)

Entry Point: provisioning/core/cli/provisioning

  • Bash wrapper (210 lines) - Minimal bootstrap
  • Routes commands to Nushell dispatcher
  • Loads environment and validates workspace
  • Handles error reporting

Key Features:

  • Single entry point
  • Pluggable architecture
  • Support for 111+ commands
  • 80+ shortcuts for productivity

2. Core Engine (provisioning/core/nulib/)

Structure: 54 Nushell libraries organized by function

Main Components:

Configuration Management (lib_provisioning/config/)

  • Hierarchical loading: 5-layer precedence system
  • 476+ accessors: Type-safe configuration access
  • Variable interpolation: Template expansion
  • TOML merging: Environment-specific overrides
  • Validation: Schema enforcement

Provider Abstraction (lib_provisioning/providers/)

  • Multi-cloud support: UpCloud, AWS, Hetzner, Local
  • Unified interface: Single API for all providers
  • Dynamic loading: Load providers on-demand
  • Credential management: Encrypted credential handling
  • State tracking: Provider-specific state persistence

Workspace Management (lib_provisioning/workspace/)

  • Workspace registry: Track all workspaces
  • Switching: Atomic workspace transitions
  • Isolation: Independent state per workspace
  • Configuration loading: Workspace-specific overrides
  • Extensions: Inherit from platform extensions

Infrastructure Validation (lib_provisioning/infra_validator/)

  • Schema validation: Nickel contract checking
  • Constraint enforcement: Business rule validation
  • Dependency analysis: Infrastructure dependency graph
  • Type checking: Static type validation
  • Error reporting: Detailed error messages with suggestions

Secrets Management (lib_provisioning/secrets/)

  • SOPS integration: Mozilla SOPS for encryption
  • Age encryption: Modern symmetric encryption
  • KMS backends: Cosmian, AWS KMS, local
  • Credential injection: Runtime variable substitution
  • Audit logging: Track secret access

Command Utilities (lib_provisioning/cmd/)

  • SSH operations: Remote command execution
  • Batch operations: Parallel command execution
  • Error handling: Structured error reporting
  • Logging: Comprehensive operation logging
  • Retry logic: Automatic retry with backoff

3. Orchestration Engine (provisioning/platform/)

Technology: Rust + Nushell hybrid

12 Microservices (Rust crates):

Service Purpose Key Features
orchestrator Workflow execution Scheduler, file persistence, REST API
control-center API gateway + auth RBAC, Cedar policies, audit logging
control-center-ui Web dashboard Infrastructure view, config management
mcp-server AI integration Model Context Protocol, auto-completion
vault-service Secrets storage Encryption, KMS, credential injection
extension-registry OCI registry Extension distribution, versioning
ai-service LLM features Prompt optimization, context awareness
detector Anomaly detection Health monitoring, pattern recognition
rag Knowledge retrieval Document embedding, semantic search
provisioning-daemon Background service Event monitoring, task scheduling
platform-config Config management Schema validation, environment handling
service-clients API clients SDK for platform services, cloud APIs

Detailed Services:

Orchestrator (crates/orchestrator/)

  • High-performance scheduler: Rust core
  • File-based persistence: Durable queue
  • Workflow execution: Dependency-aware scheduling
  • Checkpoint recovery: Resume from failures
  • Parallel execution: Multi-task handling
  • State management: Track job status
  • REST API: 9 core endpoints
  • Port: 9090 (health check endpoint)

Control Center (crates/control-center/)

  • Authorization engine: Cedar policy enforcement
  • RBAC system: Role-based access control
  • Audit logging: Complete audit trail
  • API gateway: REST API for all operations
  • System configuration: Central configuration management
  • Health monitoring: Real-time system status

Control Center UI (crates/control-center-ui/)

  • Web dashboard: Real-time infrastructure view
  • Workflow visualization: Batch job monitoring
  • Configuration management: Web-based configuration
  • Resource explorer: Browse infrastructure
  • Audit viewer: Security audit trail

MCP Server (crates/mcp-server/)

  • AI integration: Model Context Protocol support
  • Natural language: Parse infrastructure requests
  • Auto-completion: Intelligent configuration suggestions
  • 7 settings tools: Configuration management via LLM
  • Context-aware: Understand workspace context

Vault Service (crates/vault-service/)

  • Secrets backend: Encrypted credential storage
  • KMS integration: Key Management System support
  • SOPS + Age: SOPS encryption backend
  • Credential injection: Secure credential delivery
  • Audit logging: Secret access tracking

Extension Registry (crates/extension-registry/)

  • OCI distribution: Container image distribution
  • Extension packaging: Provider/taskserv distribution
  • Version management: Semantic versioning
  • Registry API: Content addressable storage

AI Service (crates/ai-service/)

  • LLM integration: Large Language Model support
  • Prompt optimization: Infrastructure request parsing
  • Context awareness: Workspace context enrichment
  • Response generation: Configuration suggestions

Detector (crates/detector/)

  • Anomaly detection: System health monitoring
  • Pattern recognition: Infrastructure issue identification
  • Alert generation: Alerting system integration
  • Real-time monitoring: Continuous surveillance

Platform Config (crates/platform-config/)

  • Configuration management: Centralized config loading
  • Schema validation: Configuration validation
  • Environment handling: Multi-environment support
  • Default settings: System-wide defaults

Provisioning Daemon (crates/provisioning-daemon/)

  • Background service: Continuous operation
  • Event monitoring: System event handling
  • Task scheduling: Background job execution
  • State synchronization: Infrastructure state sync

RAG Service (crates/rag/)

  • Retrieval Augmented Generation: Knowledge base integration
  • Document embedding: Semantic search
  • Context retrieval: Intelligent response context
  • Knowledge synthesis: Answer generation

Service Clients (crates/service-clients/)

  • API clients: Client SDK for platform services
  • Cloud providers: Multi-cloud provider SDKs
  • Request handling: HTTP/RPC client utilities
  • Connection pooling: Efficient resource management

4. Extensions (provisioning/extensions/)

Modular infrastructure components:

Providers (5 cloud providers)

  • UpCloud - Primary European cloud
  • AWS - Amazon Web Services
  • Hetzner - Baremetal & cloud servers
  • Local - Development environment
  • Demo - Testing & mocking

Each provider includes:

  • Nickel schemas for configuration
  • API client implementation
  • Server creation/deletion logic
  • Network management
  • State tracking

Task Services (50+ services in 18 categories)

| Category | Services | Purpose | | --- | ---------| - --- | | Container Runtime | containerd, crio, podman, crun, youki, runc | Container execution | | Kubernetes | kubernetes, etcd, coredns, cilium, flannel, calico | Orchestration | | Storage | rook-ceph, local-storage, mayastor, external-nfs | Data persistence | | Databases | postgres, redis, mysql, mongodb | Data management | | Networking | ip-aliases, proxy, resolv, kms | Network services | | Security | webhook, kms, oras, radicle | Security services | | Observability | prometheus, grafana, loki, jaeger | Monitoring & logging | | Development | gitea, coder, desktop, buildkit | Developer tools | | Hypervisor | kvm, qemu, libvirt | Virtualization |

Clusters (9 pre-built templates)

  • web - Web service cluster (nginx + postgres)
  • oci-reg - Container registry
  • git - Git hosting (Gitea)
  • buildkit - Build infrastructure
  • k8s-ha - HA Kubernetes (3 control planes)
  • postgresql - HA PostgreSQL cluster
  • cicd-argocd - GitOps CI/CD
  • cicd-tekton - Tekton pipelines

5. Infrastructure Layer

What Provisioning Manages:

  • Cloud Resources: VMs, networks, storage
  • Services: Kubernetes, databases, monitoring
  • Applications: Web services, APIs, tools
  • State: Configuration, data, logs
  • Monitoring: Metrics, traces, logs

Configuration System

Hierarchical 5-Layer System:

Precedence (High → Low):

1. Runtime Arguments   (CLI flags: --provider upcloud)
   ↓
2. Environment Variables (PROVISIONING_PROVIDER=aws)
   ↓
3. Workspace Config    (~workspace/config/provisioning.yaml)
   ↓
4. Environment Defaults (workspace/config/prod-defaults.toml)
   ↓
5. System Defaults     (~/.config/provisioning/ + platform defaults)

Configuration Languages:

| Format | Purpose | Validation | Editability | | --- | --------| - --- | ------------ | | Nickel | Infrastructure source | Type-safe, contracts | Direct | | TOML | Settings, defaults | Schema validation | Direct | | YAML | User config, metadata | Schema validation | Direct | | JSON | Exported configs | Schema validation | Generated |

Key Features:

  • Lazy evaluation
  • Recursive merging
  • Variable interpolation
  • Constraint checking
  • Automatic validation

State Management

SurrealDB Graph Database:

Stores complex infrastructure relationships:

Nodes:
- Servers (compute)
- Networks (connectivity)
- Storage (persistence)
- Services (software)
- Workflows (automation)

Edges:
- Server → Network (connected)
- Server → Storage (mounted)
- Service → Server (running on)
- Workflow → Dependency (depends on)

File-Based Persistence:

For orchestrator queue and checkpoints:

~/.provisioning/
├── state/              # Infrastructure state
├── checkpoints/        # Workflow checkpoints
├── queue/              # Orchestrator queue
└── logs/               # Operational logs

Security Architecture

4-Layer Security Model:

| Layer | Components | Features | | --- | ----------| - --- | | Authentication | JWT, sessions, MFA | 2FA, TOTP, WebAuthn | | Authorization | Cedar policies, RBAC | Fine-grained permissions | | Encryption | AES-256-GCM, TLS | At-rest & in-transit | | Audit | Logging, compliance | 7-year retention |

Security Services:

  • JWT token validation
  • Argon2id password hashing
  • Multi-factor authentication
  • Cedar policy enforcement
  • Encrypted credential storage
  • KMS integration (5 backends)
  • Audit logging (5 export formats)
  • Compliance checking (SOC2, GDPR, HIPAA)

Performance Characteristics

Modular CLI (84% code reduction):

  • Main CLI: 211 lines (vs. 1,329 before)
  • Command discovery: O(1) dispatcher
  • Lazy loading: Commands loaded on-demand
  • Caching: Configuration cached after first load

Orchestrator Performance:

  • Dependency resolution: O(n log n) topological sort
  • Parallel execution: Configurable task limit
  • Checkpoint recovery: Resume from failure point
  • Memory efficient: File-based queue

Provider Operations:

  • Batch creation: Parallel server provisioning
  • Bulk operations: Multi-resource transactions
  • State tracking: Efficient state queries
  • Rollback: Atomic operation reversal

Nushell Plugins (10-50x speedup):

  • Compiled Rust extensions
  • Direct native code execution
  • Zero-copy data passing
  • Async I/O support

Deployment Modes

Three Operational Modes:

| Mode | Interaction | Configuration | Rollback | Use Case | | --- | ------------| - --- | ---------| - --- | | Interactive TUI | Ratatui UI | Manual input | Automatic | Development | | Headless CLI | Command-line | Script-driven | Manual | Automation | | Unattended CI/CD | Non-interactive | Configuration file | Automatic | CI/CD pipelines |

Technology Stack

| Component | Technology | Why | | --- | ----------| - --- | | IaC Language | Nickel | Type-safe, lazy evaluation, contracts | | Scripting | Nushell 0.109+ | Structured data pipelines | | Performance | Rust | Zero-cost abstractions, memory safety | | State | SurrealDB | Graph database for relationships | | Encryption | SOPS + Age | Industry-standard encryption | | Security | Cedar + JWT | Policy enforcement + tokens | | Orchestration | Custom | Specialized for infrastructure workflows |

File Organization

provisioning/
├── core/                       # CLI engine (Nushell)
│   ├── cli/provisioning       # Main entry point
│   ├── nulib/                 # 54 core libraries
│   ├── plugins/               # Nushell plugins (Rust)
│   └── scripts/               # Utility scripts
│
├── platform/                   # Microservices (Rust)
│   ├── crates/                # 12 microservices
│   │   ├── orchestrator/      # Workflow scheduler
│   │   ├── control-center/    # API gateway + auth
│   │   ├── control-center-ui/ # Web dashboard
│   │   ├── mcp-server/        # AI integration
│   │   ├── vault-service/     # Secrets backend
│   │   ├── extension-registry/ # OCI registry
│   │   ├── ai-service/        # LLM features
│   │   ├── detector/          # Anomaly detection
│   │   ├── rag/               # Knowledge retrieval
│   │   ├── provisioning-daemon/ # Background service
│   │   ├── platform-config/   # Config management
│   │   └── service-clients/   # API clients
│   └── Cargo.toml             # Rust workspace
│
├── extensions/                # Extensible components
│   ├── providers/             # Cloud providers (5)
│   ├── taskservs/             # Task services (50+)
│   ├── clusters/              # Cluster templates (9)
│   └── workflows/             # Automation templates
│
├── schemas/                   # Nickel schemas
│   ├── main.ncl              # Entry point
│   ├── config/               # Configuration schemas
│   ├── infrastructure/       # Infrastructure schemas
│   ├── operations/           # Operational schemas
│   └── [other schemas]       # Additional schemas
│
├── config/                    # System configuration
│   └── config.defaults.toml  # Default settings
│
├── bootstrap/                 # Installation
│   ├── install.sh            # Bash bootstrap
│   └── install.nu            # Nushell installer
│
├── docs/                      # Product documentation
│   └── src/                  # mdBook source
│
└── README.md                  # Project overview

Component Interaction

Typical Workflow:

User Input
   ↓
CLI Dispatcher (provisioning/core/cli/provisioning)
   ↓
Nushell Handler (provisioning/core/nulib/commands/)
   ↓
Configuration Loading (lib_provisioning/config/)
   ↓
Provider Selection (lib_provisioning/providers/)
   ↓
Validation (lib_provisioning/infra_validator/)
   ↓
Orchestrator Queue (provisioning/platform/orchestrator/)
   ↓
Task Execution (provider + task service)
   ↓
State Update (SurrealDB / file storage)
   ↓
Audit Logging (security system)
   ↓
User Feedback

Scalability

Provisioning scales from:

  • Solo: 2 CPU cores, 4GB RAM (single instance)
  • MultiUser: 4-8 CPU cores, 8GB RAM (small team)
  • CICD: 8+ CPU cores, 16GB RAM (enterprise)
  • Enterprise: Multi-node Kubernetes (unlimited)

Bottlenecks & Solutions:

| Component | Bottleneck | Solution | | --- | ----------| - --- | | Orchestrator | Task queue | Partition by workspace | | State | SurrealDB | Horizontal scaling | | Providers | API rate limits | Exponential backoff | | Storage | Disk I/O | SSD + caching |

Integration Points

Provisioning integrates with:

  • Kubernetes API - Cluster management
  • Cloud Provider APIs - Resource provisioning
  • SOPS + Age - Secrets encryption
  • Prometheus - Metrics collection
  • Cedar - Policy enforcement
  • SurrealDB - State persistence
  • MCP - AI integration
  • KMS - Key management (Cosmian, AWS, local)

Reliability Features

Fault Tolerance:

  • Checkpoint recovery - Resume from failure
  • Automatic rollback - Revert failed operations
  • Retry logic - Exponential backoff
  • Health checks - Continuous monitoring
  • Backup & restore - Data protection

High Availability:

  • Multi-node orchestrator
  • Database replication
  • Service redundancy
  • Load balancing
  • Failover automation