provisioning/docs/src/architecture/system-overview.md

543 lines
21 KiB
Markdown
Raw Normal View History

2026-01-14 04:53:21 +00:00
# System Overview
2026-01-17 03:58:28 +00:00
Complete architecture of the Provisioning Infrastructure Automation Platform.
## Architecture Layers
Provisioning uses a 5-layer modular architecture:
```text
┌─────────────────────────────────────────────────────────────┐
│ User Interface Layer │
│ • CLI (provisioning command) • Web Control Center (UI) │
│ • REST API • MCP Server (AI) • Batch Scheduler │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Core Engine Layer (provisioning/core/) │
│ • 211-line CLI dispatcher (84% code reduction) │
│ • 476+ configuration accessors (hierarchical) │
│ • Provider abstraction (multi-cloud support) │
│ • Workspace management system │
│ • Infrastructure validation (54+ Nushell libraries) │
│ • Secrets management (SOPS + Age integration) │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Orchestration Layer (provisioning/platform/) │
│ • Hybrid Orchestrator (Rust + Nushell) │
│ • Workflow execution with checkpoints │
│ • Dependency resolver & task scheduler │
│ • File-based persistence │
│ • REST API endpoints (83+) │
│ • State management (SurrealDB) │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Extension Layer (provisioning/extensions/) │
│ • Cloud Providers (UpCloud, AWS, Hetzner, Local) │
│ • Task Services (50+ services in 18 categories) │
│ • Clusters (9 pre-built cluster templates) │
│ • Batch Workflows (automation templates) │
│ • Nushell Plugins (10-50x performance gains) │
└──────────────────────────┬──────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Infrastructure Layer │
│ • Cloud Resources (servers, networks, storage) │
│ • Running Services (Kubernetes, databases, etc.) │
│ • State Persistence (SurrealDB, file storage) │
│ • Monitoring & Logging (Prometheus, Loki) │
└─────────────────────────────────────────────────────────────┘
2026-01-14 04:53:21 +00:00
```
2026-01-17 03:58:28 +00:00
## Core System Components
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
### 1. CLI Layer (`provisioning/core/cli/`)
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Entry Point**: `provisioning/core/cli/provisioning`
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
- **Bash wrapper** (210 lines) - Minimal bootstrap
- Routes commands to Nushell dispatcher
- Loads environment and validates workspace
- Handles error reporting
2026-01-14 04:53:21 +00:00
**Key Features**:
2026-01-17 03:58:28 +00:00
- Single entry point
- Pluggable architecture
- Support for 111+ commands
- 80+ shortcuts for productivity
### 2. Core Engine (`provisioning/core/nulib/`)
**Structure**: 54 Nushell libraries organized by function
**Main Components**:
#### **Configuration Management** (`lib_provisioning/config/`)
- **Hierarchical loading**: 5-layer precedence system
- **476+ accessors**: Type-safe configuration access
- **Variable interpolation**: Template expansion
- **TOML merging**: Environment-specific overrides
- **Validation**: Schema enforcement
#### **Provider Abstraction** (`lib_provisioning/providers/`)
- **Multi-cloud support**: UpCloud, AWS, Hetzner, Local
- **Unified interface**: Single API for all providers
- **Dynamic loading**: Load providers on-demand
- **Credential management**: Encrypted credential handling
- **State tracking**: Provider-specific state persistence
#### **Workspace Management** (`lib_provisioning/workspace/`)
- **Workspace registry**: Track all workspaces
- **Switching**: Atomic workspace transitions
- **Isolation**: Independent state per workspace
- **Configuration loading**: Workspace-specific overrides
- **Extensions**: Inherit from platform extensions
#### **Infrastructure Validation** (`lib_provisioning/infra_validator/`)
- **Schema validation**: Nickel contract checking
- **Constraint enforcement**: Business rule validation
- **Dependency analysis**: Infrastructure dependency graph
- **Type checking**: Static type validation
- **Error reporting**: Detailed error messages with suggestions
#### **Secrets Management** (`lib_provisioning/secrets/`)
- **SOPS integration**: Mozilla SOPS for encryption
- **Age encryption**: Modern symmetric encryption
- **KMS backends**: Cosmian, AWS KMS, local
- **Credential injection**: Runtime variable substitution
- **Audit logging**: Track secret access
#### **Command Utilities** (`lib_provisioning/cmd/`)
- **SSH operations**: Remote command execution
- **Batch operations**: Parallel command execution
- **Error handling**: Structured error reporting
- **Logging**: Comprehensive operation logging
- **Retry logic**: Automatic retry with backoff
### 3. Orchestration Engine (`provisioning/platform/`)
**Technology**: Rust + Nushell hybrid
**12 Microservices** (Rust crates):
| Service | Purpose | Key Features |
| --- | --- | --- |
| orchestrator | Workflow execution | Scheduler, file persistence, REST API |
| control-center | API gateway + auth | RBAC, Cedar policies, audit logging |
| control-center-ui | Web dashboard | Infrastructure view, config management |
| mcp-server | AI integration | Model Context Protocol, auto-completion |
| vault-service | Secrets storage | Encryption, KMS, credential injection |
| extension-registry | OCI registry | Extension distribution, versioning |
| ai-service | LLM features | Prompt optimization, context awareness |
| detector | Anomaly detection | Health monitoring, pattern recognition |
| rag | Knowledge retrieval | Document embedding, semantic search |
| provisioning-daemon | Background service | Event monitoring, task scheduling |
| platform-config | Config management | Schema validation, environment handling |
| service-clients | API clients | SDK for platform services, cloud APIs |
**Detailed Services**:
#### **Orchestrator** (`crates/orchestrator/`)
- **High-performance scheduler**: Rust core
- **File-based persistence**: Durable queue
- **Workflow execution**: Dependency-aware scheduling
- **Checkpoint recovery**: Resume from failures
- **Parallel execution**: Multi-task handling
- **State management**: Track job status
- **REST API**: 9 core endpoints
- **Port**: 9090 (health check endpoint)
#### **Control Center** (`crates/control-center/`)
- **Authorization engine**: Cedar policy enforcement
- **RBAC system**: Role-based access control
- **Audit logging**: Complete audit trail
- **API gateway**: REST API for all operations
- **System configuration**: Central configuration management
- **Health monitoring**: Real-time system status
#### **Control Center UI** (`crates/control-center-ui/`)
- **Web dashboard**: Real-time infrastructure view
- **Workflow visualization**: Batch job monitoring
- **Configuration management**: Web-based configuration
- **Resource explorer**: Browse infrastructure
- **Audit viewer**: Security audit trail
#### **MCP Server** (`crates/mcp-server/`)
- **AI integration**: Model Context Protocol support
- **Natural language**: Parse infrastructure requests
- **Auto-completion**: Intelligent configuration suggestions
- **7 settings tools**: Configuration management via LLM
- **Context-aware**: Understand workspace context
#### **Vault Service** (`crates/vault-service/`)
- **Secrets backend**: Encrypted credential storage
- **KMS integration**: Key Management System support
- **SOPS + Age**: SOPS encryption backend
- **Credential injection**: Secure credential delivery
- **Audit logging**: Secret access tracking
#### **Extension Registry** (`crates/extension-registry/`)
- **OCI distribution**: Container image distribution
- **Extension packaging**: Provider/taskserv distribution
- **Version management**: Semantic versioning
- **Registry API**: Content addressable storage
#### **AI Service** (`crates/ai-service/`)
- **LLM integration**: Large Language Model support
- **Prompt optimization**: Infrastructure request parsing
- **Context awareness**: Workspace context enrichment
- **Response generation**: Configuration suggestions
#### **Detector** (`crates/detector/`)
- **Anomaly detection**: System health monitoring
- **Pattern recognition**: Infrastructure issue identification
- **Alert generation**: Alerting system integration
- **Real-time monitoring**: Continuous surveillance
#### **Platform Config** (`crates/platform-config/`)
- **Configuration management**: Centralized config loading
- **Schema validation**: Configuration validation
- **Environment handling**: Multi-environment support
- **Default settings**: System-wide defaults
#### **Provisioning Daemon** (`crates/provisioning-daemon/`)
- **Background service**: Continuous operation
- **Event monitoring**: System event handling
- **Task scheduling**: Background job execution
- **State synchronization**: Infrastructure state sync
#### **RAG Service** (`crates/rag/`)
- **Retrieval Augmented Generation**: Knowledge base integration
- **Document embedding**: Semantic search
- **Context retrieval**: Intelligent response context
- **Knowledge synthesis**: Answer generation
#### **Service Clients** (`crates/service-clients/`)
- **API clients**: Client SDK for platform services
- **Cloud providers**: Multi-cloud provider SDKs
- **Request handling**: HTTP/RPC client utilities
- **Connection pooling**: Efficient resource management
### 4. Extensions (`provisioning/extensions/`)
**Modular infrastructure components**:
#### **Providers** (5 cloud providers)
- **UpCloud** - Primary European cloud
- **AWS** - Amazon Web Services
- **Hetzner** - Baremetal & cloud servers
- **Local** - Development environment
- **Demo** - Testing & mocking
Each provider includes:
- Nickel schemas for configuration
- API client implementation
- Server creation/deletion logic
- Network management
- State tracking
#### **Task Services** (50+ services in 18 categories)
| Category | Services | Purpose |
| --- | ---------| - --- |
| Container Runtime | containerd, crio, podman, crun, youki, runc | Container execution |
| Kubernetes | kubernetes, etcd, coredns, cilium, flannel, calico | Orchestration |
| Storage | rook-ceph, local-storage, mayastor, external-nfs | Data persistence |
| Databases | postgres, redis, mysql, mongodb | Data management |
| Networking | ip-aliases, proxy, resolv, kms | Network services |
| Security | webhook, kms, oras, radicle | Security services |
| Observability | prometheus, grafana, loki, jaeger | Monitoring & logging |
| Development | gitea, coder, desktop, buildkit | Developer tools |
| Hypervisor | kvm, qemu, libvirt | Virtualization |
#### **Clusters** (9 pre-built templates)
- **web** - Web service cluster (nginx + postgres)
- **oci-reg** - Container registry
- **git** - Git hosting (Gitea)
- **buildkit** - Build infrastructure
- **k8s-ha** - HA Kubernetes (3 control planes)
- **postgresql** - HA PostgreSQL cluster
- **cicd-argocd** - GitOps CI/CD
- **cicd-tekton** - Tekton pipelines
### 5. Infrastructure Layer
**What Provisioning Manages**:
- **Cloud Resources**: VMs, networks, storage
- **Services**: Kubernetes, databases, monitoring
- **Applications**: Web services, APIs, tools
- **State**: Configuration, data, logs
- **Monitoring**: Metrics, traces, logs
## Configuration System
**Hierarchical 5-Layer System**:
```text
Precedence (High → Low):
1. Runtime Arguments (CLI flags: --provider upcloud)
2. Environment Variables (PROVISIONING_PROVIDER=aws)
3. Workspace Config (~workspace/config/provisioning.yaml)
4. Environment Defaults (workspace/config/prod-defaults.toml)
5. System Defaults (~/.config/provisioning/ + platform defaults)
```
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Configuration Languages**:
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
| Format | Purpose | Validation | Editability |
| --- | --------| - --- | ------------ |
| **Nickel** | Infrastructure source | ✅ Type-safe, contracts | Direct |
| **TOML** | Settings, defaults | Schema validation | Direct |
| **YAML** | User config, metadata | Schema validation | Direct |
| **JSON** | Exported configs | Schema validation | Generated |
2026-01-14 04:53:21 +00:00
**Key Features**:
2026-01-17 03:58:28 +00:00
- Lazy evaluation
- Recursive merging
- Variable interpolation
- Constraint checking
- Automatic validation
## State Management
**SurrealDB Graph Database**:
Stores complex infrastructure relationships:
```text
Nodes:
- Servers (compute)
- Networks (connectivity)
- Storage (persistence)
- Services (software)
- Workflows (automation)
Edges:
- Server → Network (connected)
- Server → Storage (mounted)
- Service → Server (running on)
- Workflow → Dependency (depends on)
2026-01-14 04:53:21 +00:00
```
2026-01-17 03:58:28 +00:00
**File-Based Persistence**:
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
For orchestrator queue and checkpoints:
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
```text
~/.provisioning/
├── state/ # Infrastructure state
├── checkpoints/ # Workflow checkpoints
├── queue/ # Orchestrator queue
└── logs/ # Operational logs
2026-01-14 04:53:21 +00:00
```
2026-01-17 03:58:28 +00:00
## Security Architecture
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**4-Layer Security Model**:
| Layer | Components | Features |
| --- | ----------| - --- |
| **Authentication** | JWT, sessions, MFA | 2FA, TOTP, WebAuthn |
| **Authorization** | Cedar policies, RBAC | Fine-grained permissions |
| **Encryption** | AES-256-GCM, TLS | At-rest & in-transit |
| **Audit** | Logging, compliance | 7-year retention |
**Security Services**:
- JWT token validation
- Argon2id password hashing
- Multi-factor authentication
- Cedar policy enforcement
- Encrypted credential storage
- KMS integration (5 backends)
- Audit logging (5 export formats)
- Compliance checking (SOC2, GDPR, HIPAA)
## Performance Characteristics
**Modular CLI** (84% code reduction):
- Main CLI: 211 lines (vs. 1,329 before)
- Command discovery: O(1) dispatcher
- Lazy loading: Commands loaded on-demand
- Caching: Configuration cached after first load
**Orchestrator Performance**:
- Dependency resolution: O(n log n) topological sort
- Parallel execution: Configurable task limit
- Checkpoint recovery: Resume from failure point
- Memory efficient: File-based queue
**Provider Operations**:
- Batch creation: Parallel server provisioning
- Bulk operations: Multi-resource transactions
- State tracking: Efficient state queries
- Rollback: Atomic operation reversal
**Nushell Plugins** (10-50x speedup):
- Compiled Rust extensions
- Direct native code execution
- Zero-copy data passing
- Async I/O support
## Deployment Modes
**Three Operational Modes**:
| Mode | Interaction | Configuration | Rollback | Use Case |
| --- | ------------| - --- | ---------| - --- |
| **Interactive TUI** | Ratatui UI | Manual input | Automatic | Development |
| **Headless CLI** | Command-line | Script-driven | Manual | Automation |
| **Unattended CI/CD** | Non-interactive | Configuration file | Automatic | CI/CD pipelines |
2026-01-14 04:53:21 +00:00
## Technology Stack
2026-01-17 03:58:28 +00:00
| Component | Technology | Why |
| --- | ----------| - --- |
| **IaC Language** | Nickel | Type-safe, lazy evaluation, contracts |
| **Scripting** | Nushell 0.109+ | Structured data pipelines |
| **Performance** | Rust | Zero-cost abstractions, memory safety |
| **State** | SurrealDB | Graph database for relationships |
| **Encryption** | SOPS + Age | Industry-standard encryption |
| **Security** | Cedar + JWT | Policy enforcement + tokens |
| **Orchestration** | Custom | Specialized for infrastructure workflows |
## File Organization
```text
provisioning/
├── core/ # CLI engine (Nushell)
│ ├── cli/provisioning # Main entry point
│ ├── nulib/ # 54 core libraries
│ ├── plugins/ # Nushell plugins (Rust)
│ └── scripts/ # Utility scripts
├── platform/ # Microservices (Rust)
│ ├── crates/ # 12 microservices
│ │ ├── orchestrator/ # Workflow scheduler
│ │ ├── control-center/ # API gateway + auth
│ │ ├── control-center-ui/ # Web dashboard
│ │ ├── mcp-server/ # AI integration
│ │ ├── vault-service/ # Secrets backend
│ │ ├── extension-registry/ # OCI registry
│ │ ├── ai-service/ # LLM features
│ │ ├── detector/ # Anomaly detection
│ │ ├── rag/ # Knowledge retrieval
│ │ ├── provisioning-daemon/ # Background service
│ │ ├── platform-config/ # Config management
│ │ └── service-clients/ # API clients
│ └── Cargo.toml # Rust workspace
├── extensions/ # Extensible components
│ ├── providers/ # Cloud providers (5)
│ ├── taskservs/ # Task services (50+)
│ ├── clusters/ # Cluster templates (9)
│ └── workflows/ # Automation templates
├── schemas/ # Nickel schemas
│ ├── main.ncl # Entry point
│ ├── config/ # Configuration schemas
│ ├── infrastructure/ # Infrastructure schemas
│ ├── operations/ # Operational schemas
│ └── [other schemas] # Additional schemas
├── config/ # System configuration
│ └── config.defaults.toml # Default settings
├── bootstrap/ # Installation
│ ├── install.sh # Bash bootstrap
│ └── install.nu # Nushell installer
├── docs/ # Product documentation
│ └── src/ # mdBook source
└── README.md # Project overview
```
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
## Component Interaction
**Typical Workflow**:
```text
User Input
CLI Dispatcher (provisioning/core/cli/provisioning)
Nushell Handler (provisioning/core/nulib/commands/)
Configuration Loading (lib_provisioning/config/)
Provider Selection (lib_provisioning/providers/)
Validation (lib_provisioning/infra_validator/)
Orchestrator Queue (provisioning/platform/orchestrator/)
Task Execution (provider + task service)
State Update (SurrealDB / file storage)
Audit Logging (security system)
User Feedback
```
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
## Scalability
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
Provisioning scales from:
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
- **Solo**: 2 CPU cores, 4GB RAM (single instance)
- **MultiUser**: 4-8 CPU cores, 8GB RAM (small team)
- **CICD**: 8+ CPU cores, 16GB RAM (enterprise)
- **Enterprise**: Multi-node Kubernetes (unlimited)
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Bottlenecks & Solutions**:
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
| Component | Bottleneck | Solution |
| --- | ----------| - --- |
| **Orchestrator** | Task queue | Partition by workspace |
| **State** | SurrealDB | Horizontal scaling |
| **Providers** | API rate limits | Exponential backoff |
| **Storage** | Disk I/O | SSD + caching |
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
## Integration Points
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
Provisioning integrates with:
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
- **Kubernetes API** - Cluster management
- **Cloud Provider APIs** - Resource provisioning
- **SOPS + Age** - Secrets encryption
- **Prometheus** - Metrics collection
- **Cedar** - Policy enforcement
- **SurrealDB** - State persistence
- **MCP** - AI integration
- **KMS** - Key management (Cosmian, AWS, local)
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
## Reliability Features
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Fault Tolerance**:
- Checkpoint recovery - Resume from failure
- Automatic rollback - Revert failed operations
- Retry logic - Exponential backoff
- Health checks - Continuous monitoring
- Backup & restore - Data protection
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**High Availability**:
- Multi-node orchestrator
- Database replication
- Service redundancy
- Load balancing
- Failover automation
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
## Related Documentation
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
- [Design Principles](design-principles.md)
- [Component Architecture](component-architecture.md)
- [Integration Patterns](integration-patterns.md)