Provisioning Platform Documentation
Welcome to the Provisioning Platform documentation. This is an enterprise-grade Infrastructure as Code (IaC) platform built with Rust, Nushell, and Nickel.
What is Provisioning
Provisioning is a comprehensive infrastructure automation platform that manages complete infrastructure lifecycles across multiple cloud providers. The platform emphasizes type-safety, configuration-driven design, and workspace-first organization.
Key Features
- Workspace Management: Default mode for organizing infrastructure, settings, schemas, and extensions
- Type-Safe Configuration: Nickel-based configuration system with validation and contracts
- Multi-Cloud Support: Unified interface for AWS, UpCloud, and local providers
- Modular CLI Architecture: 111+ commands with 84% code reduction through modularity
- Batch Workflow Engine: Orchestrate complex multi-cloud operations
- Complete Security System: Authentication, authorization, encryption, and compliance
- Extensible Architecture: Custom providers, task services, and plugins
Getting Started
New users should start with:
- Prerequisites - System requirements and dependencies
- Installation - Install the platform
- Quick Start - 5-minute deployment tutorial
- First Deployment - Comprehensive walkthrough
Documentation Structure
- Getting Started: Installation and initial setup
- User Guides: Workflow tutorials and best practices
- Infrastructure as Code: Nickel configuration and schema reference
- Platform Features: Core capabilities and systems
- Operations: Deployment, monitoring, and maintenance
- Security: Complete security system documentation
- Development: Extension and plugin development
- API Reference: REST API and CLI command reference
- Architecture: System design and ADRs
- Examples: Practical use cases and patterns
- Troubleshooting: Problem-solving guides
Core Technologies
- Rust: Platform services and performance-critical components
- Nushell: Scripting, CLI, and automation
- Nickel: Type-safe infrastructure configuration
- SecretumVault: Secrets management integration
Workspace-First Approach
Provisioning uses workspaces as the default organizational unit. A workspace contains:
- Infrastructure definitions (Nickel schemas)
- Environment-specific settings
- Custom extensions and providers
- Deployment state and metadata
All operations work within workspace context, providing isolation and consistency.
Support and Community
- Issues: Report bugs and request features on GitHub
- Documentation: This documentation site
- Examples: See the Examples section
License
See project LICENSE file for details.
Getting Started
Your journey to infrastructure automation starts here. This section guides you from zero to your first successful deployment in minutes.
Overview
Getting started with Provisioning involves:
- Verifying prerequisites - System requirements, tools, cloud accounts
- Installing platform - Binary or container installation
- Initial configuration - Environment setup, credentials, workspaces
- First deployment - Deploy actual infrastructure in 5 minutes
- Verification - Validate everything is working correctly
By the end of this section, you’ll have a running Provisioning installation and have deployed your first infrastructure.
Quick Start Guides
Starting from Scratch
-
Prerequisites - System requirements (Nushell 0.109.1+, Docker/Podman optional), cloud account setup, tool installation.
-
Installation - Step-by-step installation: binary download, container, or source build with platform verification.
-
Quick Start - 5-minute guide: install → configure → deploy infrastructure (requires 5 minutes and your AWS/UpCloud credentials).
-
First Deployment - Deploy your first infrastructure: create workspace, configure provider, deploy resources, verify success.
-
Verification - Validate installation: check system health, test CLI commands, verify cloud integration, confirm resource creation.
What You’ll Learn
By completing this section, you’ll know how to:
- ✅ Install and configure Provisioning
- ✅ Create your first workspace
- ✅ Configure cloud providers (AWS, UpCloud, Hetzner, etc.)
- ✅ Write simple Nickel infrastructure definitions
- ✅ Deploy infrastructure using Provisioning
- ✅ Verify and manage deployed resources
Prerequisites Checklist
Before starting, verify you have:
- Linux, macOS, or Windows with WSL2
-
Nushell 0.109.1 or newer (
nu --version) - 2GB+ RAM and 100MB disk space
- Internet connectivity
- Cloud account (AWS, UpCloud, Hetzner, or local demo mode)
- Access credentials or API tokens for cloud provider
Missing something? See Prerequisites for detailed instructions.
5-Minute Quick Start
If you’re impatient, here’s the ultra-quick path:
# 1. Install (2 minutes)
curl -fsSL [https://provisioning.io/install.sh](https://provisioning.io/install.sh) | sh
# 2. Verify installation (30 seconds)
provisioning --version
provisioning status
# 3. Create workspace (30 seconds)
provisioning workspace create --name demo
# 4. Add cloud provider (1 minute)
provisioning config set --workspace demo \
providers.aws.region us-east-1 \
providers.aws.credentials_source aws_iam
# 5. Deploy infrastructure (1 minute)
provisioning deploy --workspace demo \
--config examples/simple-instance.ncl
# 6. Verify (30 seconds)
provisioning resource list --workspace demo
For detailed walkthrough, see Quick Start.
Installation Methods
Option 1: Binary (Recommended)
# Download and extract
curl -fsSL [https://provisioning.io/provisioning-latest-linux.tar.gz](https://provisioning.io/provisioning-latest-linux.tar.gz) | tar xz
sudo mv provisioning /usr/local/bin/
provisioning --version
Option 2: Container
docker run -it provisioning/provisioning:latest \
provisioning --version
Option 3: Build from Source
git clone [https://github.com/provisioning/provisioning.git](https://github.com/provisioning/provisioning.git)
cd provisioning
cargo build --release
./target/release/provisioning --version
See Installation for detailed instructions.
Next Steps After Installation
- Read Quick Start - 5-minute walkthrough
- Complete First Deployment - Deploy real infrastructure
- Run Verification - Validate system health
- Move to Guides - Learn advanced features
- Explore Examples - Real-world scenarios
Common Questions
Q: How long does installation take? A: 5-10 minutes including cloud credential setup.
Q: What if I don’t have a cloud account? A: Try our demo provider in local mode - no cloud account needed.
Q: Can I use Provisioning offline? A: Yes, with local provider. Cloud operations require internet.
Q: What’s the learning curve? A: 30 minutes for basics, days to master advanced features.
Q: Where do I get help? A: See Getting Help or Troubleshooting.
Architecture Overview
Provisioning works in these steps:
1. Install Platform
↓
2. Create Workspace
↓
3. Add Cloud Provider Credentials
↓
4. Write Nickel Configuration
↓
5. Deploy Infrastructure
↓
6. Monitor & Manage
What’s Next
After getting started:
- Learn features → See Features
- Build infrastructure → See Examples
- Write guides → See Guides
- Understand architecture → See Architecture
- Develop extensions → See Development
Getting Help
If you get stuck:
- Check Troubleshooting
- Review Guides for similar scenarios
- Search Examples for your use case
- Ask in community forums or open a GitHub issue
Related Documentation
- Full Guides → See
provisioning/docs/src/guides/ - Examples → See
provisioning/docs/src/examples/ - Architecture → See
provisioning/docs/src/architecture/ - Features → See
provisioning/docs/src/features/ - API Reference → See
provisioning/docs/src/api-reference/
Prerequisites
Before installing the Provisioning platform, ensure your system meets the following requirements.
Required Software
Nushell 0.109.1+
Nushell is the primary shell and scripting environment for the platform.
Installation:
# macOS (Homebrew)
brew install nushell
# Linux (Cargo)
cargo install nu
# From source
git clone [https://github.com/nushell/nushell](https://github.com/nushell/nushell)
cd nushell
cargo install --path .
Verify installation:
nu --version
# Should show: 0.109.1 or higher
Nickel 1.15.1+
Nickel is the infrastructure-as-code language providing type-safe configuration with lazy evaluation.
Installation:
# macOS (Homebrew)
brew install nickel
# Linux (Cargo)
cargo install nickel-lang-cli
# From source
git clone [https://github.com/tweag/nickel](https://github.com/tweag/nickel)
cd nickel
cargo install --path cli
Verify installation:
nickel --version
# Should show: 1.15.1 or higher
SOPS 3.10.2+
SOPS (Secrets OPerationS) provides encrypted configuration and secrets management.
Installation:
# macOS (Homebrew)
brew install sops
# Linux (binary download)
wget [https://github.com/getsops/sops/releases/download/v3.10.2/sops-v3.10.2.linux.amd64](https://github.com/getsops/sops/releases/download/v3.10.2/sops-v3.10.2.linux.amd64)
sudo mv sops-v3.10.2.linux.amd64 /usr/local/bin/sops
sudo chmod +x /usr/local/bin/sops
Verify installation:
sops --version
# Should show: 3.10.2 or higher
Age 1.2.1+
Age provides modern encryption for secrets used by SOPS.
Installation:
# macOS (Homebrew)
brew install age
# Linux (binary download)
wget [https://github.com/FiloSottile/age/releases/download/v1.2.1/age-v1.2.1-linux-amd64.tar.gz](https://github.com/FiloSottile/age/releases/download/v1.2.1/age-v1.2.1-linux-amd64.tar.gz)
tar xzf age-v1.2.1-linux-amd64.tar.gz
sudo mv age/age /usr/local/bin/
sudo chmod +x /usr/local/bin/age
Verify installation:
age --version
# Should show: 1.2.1 or higher
K9s 0.50.6+
K9s provides a terminal UI for managing Kubernetes clusters.
Installation:
# macOS (Homebrew)
brew install derailed/k9s/k9s
# Linux (binary download)
wget [https://github.com/derailed/k9s/releases/download/v0.50.6/k9s_Linux_amd64.tar.gz](https://github.com/derailed/k9s/releases/download/v0.50.6/k9s_Linux_amd64.tar.gz)
tar xzf k9s_Linux_amd64.tar.gz
sudo mv k9s /usr/local/bin/
Verify installation:
k9s version
# Should show: 0.50.6 or higher
Optional Software
mdBook
For building and serving local documentation.
# Install with Cargo
cargo install mdbook
# Verify
mdbook --version
Docker or Podman
Container runtime for test environments and local development.
# Docker (macOS)
brew install --cask docker
# Podman (Linux)
sudo apt-get install podman
# Verify
docker --version
# or
podman --version
Cargo (Rust)
Required for building platform services and native plugins.
# Install Rust and Cargo
curl --proto '=https' --tlsv1.2 -sSf [https://sh.rustup.rs](https://sh.rustup.rs) | sh
# Verify
cargo --version
Git
Version control for workspace management and configuration.
# Most systems have Git pre-installed
git --version
# Install if needed (macOS)
brew install git
# Install if needed (Linux)
sudo apt-get install git
System Requirements
Minimum Hardware
Development Workstation:
- CPU: 2 cores
- RAM: 4 GB
- Disk: 20 GB available space
- Network: Internet connection for provider APIs
Production Control Plane:
- CPU: 4 cores
- RAM: 8 GB
- Disk: 50 GB available space (SSD recommended)
- Network: Stable internet connection, public IP optional
Supported Operating Systems
Primary Support:
- macOS 12.0+ (Monterey or newer)
- Linux distributions with kernel 5.0+
- Ubuntu 20.04 LTS or newer
- Debian 11 or newer
- Fedora 35 or newer
- RHEL 8 or newer
Limited Support:
- Windows 10/11 via WSL2 (Windows Subsystem for Linux)
Network Requirements
Outbound Access:
- HTTPS (443) to cloud provider APIs
- HTTPS (443) to GitHub (for version updates)
- SSH (22) for server management
Inbound Access (optional, for platform services):
- Port 8080: HTTP API
- Port 8081: MCP server
- Port 5000: Orchestrator service
Cloud Provider Access
At least one cloud provider account with API credentials:
UpCloud:
- API username and password
- Account with sufficient quota for servers
AWS:
- AWS Access Key ID and Secret Access Key
- IAM permissions for EC2, VPC, EBS operations
- Account with sufficient EC2 quota
Local Provider:
- Docker or Podman installed
- Sufficient local system resources
Permission Requirements
User Permissions
Standard User (recommended):
- Read/write access to workspace directory
- Ability to create symlinks for CLI installation
- SSH key generation capability
Administrative Tasks (optional):
- Installing CLI to
/usr/local/bin(requires sudo) - Installing system-wide dependencies
- Configuring system services
File System Permissions
# Workspace directory
chmod 755 ~/provisioning-workspace
# Configuration files
chmod 600 ~/.config/provisioning/user_config.yaml
chmod 600 ~/.ssh/provisioning_*
# Executable permissions for CLI
chmod +x /path/to/provisioning/core/cli/provisioning
Verification Checklist
Before proceeding to installation, verify all prerequisites:
# Check required tools
nu --version # 0.109.1+
nickel --version # 1.15.1+
sops --version # 3.10.2+
age --version # 1.2.1+
k9s version # 0.50.6+
# Check optional tools
mdbook --version # Latest
docker --version # Latest
cargo --version # Latest
git --version # Latest
# Verify system resources
nproc # CPU cores (2+ minimum)
free -h # RAM (4GB+ minimum)
df -h ~ # Disk space (20GB+ minimum)
# Test network connectivity
curl -I [https://api.github.com](https://api.github.com)
curl -I [https://hub.upcloud.com](https://hub.upcloud.com) # UpCloud API
curl -I [https://ec2.amazonaws.com](https://ec2.amazonaws.com) # AWS API
Next Steps
Once all prerequisites are met, proceed to:
- Installation - Install the Provisioning platform
- Quick Start - Deploy your first infrastructure in 5 minutes
Installation
This guide covers installing the Provisioning platform on your system.
Prerequisites
Ensure all prerequisites are met before proceeding.
Installation Steps
Step 1: Clone the Repository
# Clone the provisioning repository
git clone [https://github.com/your-org/project-provisioning](https://github.com/your-org/project-provisioning)
cd project-provisioning
Step 2: Add CLI to PATH
The CLI can be installed globally or run directly from the repository.
Option A: Symbolic Link (Recommended):
# Create symbolic link to /usr/local/bin
ln -sf "$(pwd)/provisioning/core/cli/provisioning" /usr/local/bin/provisioning
# Verify installation
provisioning version
Option B: PATH Environment Variable:
# Add to ~/.bashrc, ~/.zshrc, or ~/.config/nushell/env.nu
export PATH="$PATH:/path/to/project-provisioning/provisioning/core/cli"
# Reload shell configuration
source ~/.bashrc # or ~/.zshrc
Option C: Direct Execution:
# Run directly from repository (no installation needed)
./provisioning/core/cli/provisioning version
Step 3: Verify Installation
# Check CLI is accessible
provisioning version
# Show environment configuration
provisioning env
# Display help
provisioning help
Expected output:
Provisioning Platform
CLI Version: (current version)
Nushell: 0.109.1+
Nickel: 1.15.1+
Step 4: Initialize Configuration
Generate default configuration files:
# Create user configuration directory
mkdir -p ~/.config/provisioning
# Initialize default user configuration (optional)
provisioning config init
This creates ~/.config/provisioning/user_config.yaml with sensible defaults.
Step 5: Configure Cloud Provider Credentials
Configure credentials for at least one cloud provider.
UpCloud:
# ~/.config/provisioning/user_config.yaml
providers:
upcloud:
username: "your-username"
password: "your-password" # Use SOPS for encryption in production
default_zone: "de-fra1"
AWS:
# ~/.config/provisioning/user_config.yaml
providers:
aws:
access_key_id: "AKIA..."
secret_access_key: "..." # Use SOPS for encryption in production
default_region: "us-east-1"
Local Provider (no credentials required):
# ~/.config/provisioning/user_config.yaml
providers:
local:
container_runtime: "docker" # or "podman"
Step 6: Encrypt Secrets (Recommended)
Use SOPS to encrypt sensitive configuration:
# Generate Age encryption key
age-keygen -o ~/.config/provisioning/age-key.txt
# Extract public key
export AGE_PUBLIC_KEY=$(grep "public key:" ~/.config/provisioning/age-key.txt | cut -d: -f2 | tr -d ' ')
# Create .sops.yaml configuration
cat > ~/.config/provisioning/.sops.yaml <<EOF
creation_rules:
- path_regex: .*user_config\.yaml$
age: $AGE_PUBLIC_KEY
EOF
# Encrypt configuration file
sops -e -i ~/.config/provisioning/user_config.yaml
Decrypting (automatic with SOPS):
# Set Age key path
export SOPS_AGE_KEY_FILE=~/.config/provisioning/age-key.txt
# SOPS will automatically decrypt when accessed
provisioning config show
Step 7: Validate Configuration
# Validate all configuration files
provisioning validate config
# Check provider connectivity
provisioning providers
# Show complete environment
provisioning allenv
Optional: Install Platform Services
Platform services provide additional capabilities like orchestration and web UI.
Orchestrator Service (Rust)
# Build orchestrator
cd provisioning/platform/orchestrator
cargo build --release
# Start orchestrator
./target/release/orchestrator --port 5000
Control Center (Web UI)
# Build control center
cd provisioning/platform/control-center
cargo build --release
# Start control center
./target/release/control-center --port 8080
Native Plugins (Performance)
Install Nushell plugins for 10-50x performance improvements:
# Build and register plugins
cd provisioning/core/plugins
# Auth plugin
cargo build --release --package nu_plugin_auth
nu -c "register target/release/nu_plugin_auth"
# KMS plugin
cargo build --release --package nu_plugin_kms
nu -c "register target/release/nu_plugin_kms"
# Orchestrator plugin
cargo build --release --package nu_plugin_orchestrator
nu -c "register target/release/nu_plugin_orchestrator"
# Verify plugins are registered
nu -c "plugin list"
Workspace Initialization
Create your first workspace for managing infrastructure:
# Initialize new workspace
provisioning workspace init my-project
cd my-project
# Verify workspace structure
ls -la
Expected workspace structure:
my-project/
├── infra/ # Infrastructure Nickel schemas
├── config/ # Workspace configuration
├── extensions/ # Custom extensions
└── runtime/ # Runtime data and state
Troubleshooting
Common Issues
CLI not found after installation:
# Verify symlink was created
ls -l /usr/local/bin/provisioning
# Check PATH includes /usr/local/bin
echo $PATH
# Try direct path
/usr/local/bin/provisioning version
Permission denied when creating symlink:
# Use sudo for system-wide installation
sudo ln -sf "$(pwd)/provisioning/core/cli/provisioning" /usr/local/bin/provisioning
# Or use user-local bin directory
mkdir -p ~/.local/bin
ln -sf "$(pwd)/provisioning/core/cli/provisioning" ~/.local/bin/provisioning
export PATH="$PATH:$HOME/.local/bin"
Nushell version mismatch:
# Check Nushell version
nu --version
# Update Nushell
brew upgrade nushell # macOS
cargo install nu --force # Linux
Nickel not found:
# Install Nickel
brew install nickel # macOS
cargo install nickel-lang-cli # Linux
# Verify
nickel --version
Verification
Confirm successful installation:
# Complete installation check
provisioning version # CLI version
provisioning env # Environment configuration
provisioning providers # Available cloud providers
provisioning validate config # Configuration validation
provisioning help # Help system
Next Steps
Once installation is complete:
- Quick Start - Deploy infrastructure in 5 minutes
- First Deployment - Comprehensive deployment walkthrough
- Verification - Validate platform health
Quick Start
Deploy your first infrastructure in 5 minutes using the Provisioning platform.
Prerequisites
- Prerequisites installed
- Platform installed and CLI accessible
- Cloud provider credentials configured (or local provider)
5-Minute Deployment
Step 1: Create Workspace (30 seconds)
# Initialize workspace
provisioning workspace init quickstart-demo
cd quickstart-demo
Workspace structure created:
quickstart-demo/
├── infra/ # Infrastructure definitions
├── config/ # Workspace configuration
├── extensions/ # Custom providers/taskservs
└── runtime/ # State and logs
Step 2: Define Infrastructure (1 minute)
Create a simple server configuration using Nickel:
# Create infrastructure schema
cat > infra/demo-server.ncl <<'EOF'
{
metadata = {
name = "demo-server"
provider = "local" # Use local provider for quick demo
environment = "development"
}
infrastructure = {
servers = [
{
name = "web-01"
plan = "small"
role = "web"
}
]
}
services = {
taskservs = ["containerd"] # Simple container runtime
}
}
EOF
Using UpCloud or AWS? Change provider:
metadata.provider = "upcloud" # or "aws"
Step 3: Validate Configuration (30 seconds)
# Validate Nickel schema
nickel typecheck infra/demo-server.ncl
# Validate provisioning configuration
provisioning validate config
# Preview what will be created
provisioning server create --check --infra demo-server
Expected output:
Infrastructure Plan: demo-server
Provider: local
Servers to create: 1
- web-01 (small, role: web)
Task services: containerd
Estimated resources:
CPU: 2 cores
RAM: 2 GB
Disk: 10 GB
Step 4: Create Infrastructure (2 minutes)
# Create server
provisioning server create --infra demo-server --yes
# Monitor progress
provisioning server status web-01
Progress indicators:
Creating server: web-01...
[████████████████████████] 100% - Server provisioned
[████████████████████████] 100% - SSH configured
[████████████████████████] 100% - Network ready
Server web-01 created successfully
IP Address: 10.0.1.10
Status: running
Step 5: Install Task Service (1 minute)
# Install containerd
provisioning taskserv create containerd --infra demo-server
# Verify installation
provisioning taskserv status containerd
Output:
Installing containerd on web-01...
[████████████████████████] 100% - Dependencies resolved
[████████████████████████] 100% - Containerd installed
[████████████████████████] 100% - Service started
[████████████████████████] 100% - Health check passed
Containerd v1.7.0 installed successfully
Step 6: Verify Deployment (30 seconds)
# SSH into server
provisioning server ssh web-01
# Inside server - verify containerd
sudo systemctl status containerd
sudo ctr version
# Exit server
exit
What You’ve Accomplished
In 5 minutes, you’ve:
- Created a workspace for infrastructure management
- Defined infrastructure using type-safe Nickel schemas
- Validated configuration before deployment
- Provisioned a server on your chosen provider
- Installed and configured containerd
- Verified the deployment
Common Workflows
List Resources
# List all servers
provisioning server list
# List task services
provisioning taskserv list
# Show workspace info
provisioning workspace info
Modify Infrastructure
# Edit infrastructure schema
nano infra/demo-server.ncl
# Validate changes
provisioning validate config --infra demo-server
# Apply changes
provisioning server update --infra demo-server
Cleanup
# Remove task service
provisioning taskserv delete containerd --infra demo-server
# Delete server
provisioning server delete web-01 --yes
# Remove workspace
cd ..
rm -rf quickstart-demo
Next Steps
Deploy Kubernetes
Ready for something more complex?
# infra/kubernetes-cluster.ncl
{
metadata = {
name = "k8s-cluster"
provider = "upcloud"
}
infrastructure = {
servers = [
{name = "control-01", plan = "medium", role = "control"}
{name = "worker-01", plan = "large", role = "worker"}
{name = "worker-02", plan = "large", role = "worker"}
]
}
services = {
taskservs = ["kubernetes", "cilium", "rook-ceph"]
}
}
provisioning server create --infra kubernetes-cluster --yes
provisioning taskserv create kubernetes --infra kubernetes-cluster
Multi-Cloud Deployment
Deploy to multiple providers simultaneously:
# infra/multi-cloud.ncl
{
batch_workflow = {
operations = [
{
id = "aws-cluster"
provider = "aws"
servers = [{name = "aws-web-01", plan = "t3.medium"}]
}
{
id = "upcloud-cluster"
provider = "upcloud"
servers = [{name = "upcloud-web-01", plan = "medium"}]
}
]
}
}
provisioning batch submit infra/multi-cloud.ncl
Use Interactive Guides
Access built-in guides for comprehensive walkthroughs:
# Quick command reference
provisioning sc
# Complete from-scratch guide
provisioning guide from-scratch
# Customization patterns
provisioning guide customize
Troubleshooting Quick Issues
Server creation fails
# Check provider connectivity
provisioning providers
# Validate credentials
provisioning validate config
# Enable debug mode
provisioning --debug server create --infra demo-server
Task service installation fails
# Check server connectivity
provisioning server ssh web-01
# Verify dependencies
provisioning taskserv check-deps containerd
# Retry installation
provisioning taskserv create containerd --infra demo-server --force
Configuration validation errors
# Check Nickel syntax
nickel typecheck infra/demo-server.ncl
# Show detailed validation errors
provisioning validate config --verbose
# View configuration
provisioning config show
Reference
Essential Commands
# Workspace management
provisioning workspace init <name>
provisioning workspace list
provisioning workspace switch <name>
# Server operations
provisioning server create --infra <name>
provisioning server list
provisioning server status <hostname>
provisioning server ssh <hostname>
provisioning server delete <hostname>
# Task service operations
provisioning taskserv create <service> --infra <name>
provisioning taskserv list
provisioning taskserv status <service>
provisioning taskserv delete <service>
# Configuration
provisioning config show
provisioning validate config
provisioning env
Quick Reference
# Shortcut for fastest reference
provisioning sc
Further Reading
- First Deployment - Comprehensive walkthrough
- Verification - Platform health checks
- Workspace Management - Advanced workspace usage
- Nickel Guide - Infrastructure-as-code patterns
First Deployment
Comprehensive walkthrough deploying production-ready infrastructure with the Provisioning platform.
Overview
This guide walks through deploying a complete Kubernetes cluster with storage and networking on a cloud provider. You’ll learn workspace management, Nickel schema structure, provider configuration, dependency resolution, and validation workflows.
Deployment Architecture
What we’ll build:
- 3-node Kubernetes cluster (1 control plane, 2 workers)
- Cilium CNI for networking
- Rook-Ceph for persistent storage
- Container runtime (containerd)
- Automated dependency resolution
- Health monitoring
Prerequisites
- Platform installed
- Cloud provider credentials configured (UpCloud or AWS recommended)
- 30-60 minutes for complete deployment
Part 1: Workspace Setup
Create Workspace
# Initialize production workspace
provisioning workspace init production-k8s
cd production-k8s
# Verify structure
ls -la
Workspace contains:
production-k8s/
├── infra/ # Infrastructure Nickel schemas
├── config/ # Workspace configuration
├── extensions/ # Custom providers/taskservs
└── runtime/ # State and logs
Configure Workspace
# Edit workspace configuration
cat > config/provisioning-config.yaml <<'EOF'
workspace:
name: production-k8s
environment: production
defaults:
provider: upcloud # or aws
region: de-fra1 # UpCloud Frankfurt
ssh_key_path: ~/.ssh/provisioning_production
servers:
default_plan: medium
auto_backup: true
logging:
level: info
format: text
EOF
Part 2: Infrastructure Definition
Define Nickel Schema
Create infrastructure definition with type-safe Nickel:
# Create Kubernetes cluster schema
cat > infra/k8s-cluster.ncl <<'EOF'
{
metadata = {
name = "k8s-prod"
provider = "upcloud"
environment = "production"
version = "1.0.0"
}
infrastructure = {
servers = [
{
name = "k8s-control-01"
plan = "medium" # 4 CPU, 8 GB RAM
role = "control"
zone = "de-fra1"
disk_size_gb = 50
backup_enabled = true
}
{
name = "k8s-worker-01"
plan = "large" # 8 CPU, 16 GB RAM
role = "worker"
zone = "de-fra1"
disk_size_gb = 100
backup_enabled = true
}
{
name = "k8s-worker-02"
plan = "large"
role = "worker"
zone = "de-fra1"
disk_size_gb = 100
backup_enabled = true
}
]
}
services = {
taskservs = [
"containerd" # Container runtime (dependency)
"etcd" # Key-value store (dependency)
"kubernetes" # Core orchestration
"cilium" # CNI networking
"rook-ceph" # Persistent storage
]
}
kubernetes = {
version = "1.28.0"
pod_cidr = "10.244.0.0/16"
service_cidr = "10.96.0.0/12"
container_runtime = "containerd"
cri_socket = "/run/containerd/containerd.sock"
}
networking = {
cni = "cilium"
enable_network_policy = true
enable_encryption = true
}
storage = {
provider = "rook-ceph"
replicas = 3
storage_class = "ceph-rbd"
}
}
EOF
Validate Schema
# Type-check Nickel schema
nickel typecheck infra/k8s-cluster.ncl
# Validate against provisioning contracts
provisioning validate config --infra k8s-cluster
Expected output:
Schema validation: PASSED
- Syntax: Valid Nickel
- Type safety: All contracts satisfied
- Dependencies: Resolved (5 taskservs)
- Provider: upcloud (credentials found)
Part 3: Preview and Validation
Preview Infrastructure
# Dry-run to see what will be created
provisioning server create --check --infra k8s-cluster
Output shows:
Infrastructure Plan: k8s-prod
Provider: upcloud
Region: de-fra1
Servers to create: 3
- k8s-control-01 (medium, 4 CPU, 8 GB RAM, 50 GB disk)
- k8s-worker-01 (large, 8 CPU, 16 GB RAM, 100 GB disk)
- k8s-worker-02 (large, 8 CPU, 16 GB RAM, 100 GB disk)
Task services: 5 (with dependencies resolved)
1. containerd (dependency for kubernetes)
2. etcd (dependency for kubernetes)
3. kubernetes
4. cilium (requires kubernetes)
5. rook-ceph (requires kubernetes)
Estimated monthly cost: $xxx.xx
Estimated deployment time: 15-20 minutes
WARNING: Production deployment - ensure backup enabled
Dependency Graph
# Visualize dependency resolution
provisioning taskserv dependencies kubernetes --graph
Shows:
kubernetes
├── containerd (required)
├── etcd (required)
└── cni (cilium) (soft dependency)
cilium
└── kubernetes (required)
rook-ceph
└── kubernetes (required)
Part 4: Server Provisioning
Create Servers
# Create all servers in parallel
provisioning server create --infra k8s-cluster --yes
Progress tracking:
Creating 3 servers...
k8s-control-01: [████████████████████████] 100%
k8s-worker-01: [████████████████████████] 100%
k8s-worker-02: [████████████████████████] 100%
Servers created: 3/3
SSH configured: 3/3
Network ready: 3/3
Servers available:
k8s-control-01: 94.237.x.x (running)
k8s-worker-01: 94.237.x.x (running)
k8s-worker-02: 94.237.x.x (running)
Verify Server Access
# Test SSH connectivity
provisioning server ssh k8s-control-01 -- uname -a
# Check all servers
provisioning server list
Part 5: Service Installation
Install Task Services
# Install all task services (automatic dependency resolution)
provisioning taskserv create kubernetes --infra k8s-cluster
Installation flow (automatic):
Resolving dependencies...
containerd → etcd → kubernetes → cilium, rook-ceph
Installing task services: 5
[1/5] Installing containerd...
k8s-control-01: [████████████████████████] 100%
k8s-worker-01: [████████████████████████] 100%
k8s-worker-02: [████████████████████████] 100%
[2/5] Installing etcd...
k8s-control-01: [████████████████████████] 100%
[3/5] Installing kubernetes...
Control plane init: [████████████████████████] 100%
Worker join: [████████████████████████] 100%
Cluster ready: [████████████████████████] 100%
[4/5] Installing cilium...
CNI deployment: [████████████████████████] 100%
Network policies: [████████████████████████] 100%
[5/5] Installing rook-ceph...
Operator: [████████████████████████] 100%
Cluster: [████████████████████████] 100%
Storage class: [████████████████████████] 100%
All task services installed successfully
Verify Kubernetes Cluster
# SSH to control plane
provisioning server ssh k8s-control-01
# Check cluster status
kubectl get nodes
kubectl get pods --all-namespaces
kubectl get storageclass
Expected output:
NAME STATUS ROLES AGE VERSION
k8s-control-01 Ready control-plane 5m v1.28.0
k8s-worker-01 Ready <none> 4m v1.28.0
k8s-worker-02 Ready <none> 4m v1.28.0
NAMESPACE NAME READY STATUS
kube-system cilium-xxxxx 1/1 Running
kube-system cilium-operator-xxxxx 1/1 Running
kube-system etcd-k8s-control-01 1/1 Running
rook-ceph rook-ceph-operator-xxxxx 1/1 Running
NAME PROVISIONER
ceph-rbd rook-ceph.rbd.csi.ceph.com
Part 6: Deployment Verification
Health Checks
# Platform-level health check
provisioning cluster status k8s-cluster
# Individual service health
provisioning taskserv status kubernetes
provisioning taskserv status cilium
provisioning taskserv status rook-ceph
Test Application Deployment
# Deploy test application on K8s cluster
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-pvc
spec:
storageClassName: ceph-rbd
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 1Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: test-nginx
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:latest
volumeMounts:
- name: storage
mountPath: /usr/share/nginx/html
volumes:
- name: storage
persistentVolumeClaim:
claimName: test-pvc
EOF
# Verify deployment
kubectl get deployment test-nginx
kubectl get pods -l app=nginx
kubectl get pvc test-pvc
Network Policy Test
# Verify Cilium network policies work
kubectl exec -it <pod-name> -- curl [http://test-nginx](http://test-nginx)
Part 7: State Management
View State
# Show current workspace state
provisioning workspace info
# List all resources
provisioning server list
provisioning taskserv list
# Export state for backup
provisioning workspace export > k8s-cluster-state.json
Configuration Backup
# Backup workspace configuration
tar -czf k8s-cluster-backup.tar.gz infra/ config/ runtime/
# Store securely (encrypted)
sops -e k8s-cluster-backup.tar.gz > k8s-cluster-backup.tar.gz.enc
What You’ve Learned
This deployment demonstrated:
- Workspace creation and configuration
- Nickel schema structure for infrastructure-as-code
- Type-safe configuration validation
- Automatic dependency resolution
- Multi-server provisioning
- Task service installation with health checks
- Kubernetes cluster deployment
- Storage and networking configuration
- Verification and testing workflows
- State management and backup
Next Steps
- Verification - Comprehensive platform health checks
- Workspace Management - Advanced workspace patterns
- Batch Workflows - Multi-cloud orchestration
- Security System - Secure your infrastructure
Verification
Validate the Provisioning platform installation and infrastructure health.
Installation Verification
CLI and Core Tools
# Check CLI version
provisioning version
# Verify Nushell
nu --version # 0.109.1+
# Verify Nickel
nickel --version # 1.15.1+
# Check SOPS and Age
sops --version # 3.10.2+
age --version # 1.2.1+
# Verify K9s
k9s version # 0.50.6+
Configuration Validation
# Validate all configuration files
provisioning validate config
# Check environment
provisioning env
# Show all configuration
provisioning allenv
Expected output:
Configuration validation: PASSED
- User config: ~/.config/provisioning/user_config.yaml ✓
- System defaults: provisioning/config/config.defaults.toml ✓
- Provider credentials: configured ✓
Provider Connectivity
# List available providers
provisioning providers
# Test provider connection (UpCloud example)
provisioning provider test upcloud
# Test provider connection (AWS example)
provisioning provider test aws
Workspace Verification
Workspace Structure
# List workspaces
provisioning workspace list
# Show current workspace
provisioning workspace current
# Verify workspace structure
ls -la <workspace-name>/
Expected structure:
workspace-name/
├── infra/ # Infrastructure Nickel schemas
├── config/ # Workspace configuration
├── extensions/ # Custom extensions
└── runtime/ # State and logs
Workspace Configuration
# Show workspace configuration
provisioning config show
# Validate workspace-specific config
provisioning validate config --workspace <name>
Infrastructure Verification
Server Health
# List all servers
provisioning server list
# Check server status
provisioning server status <hostname>
# Test SSH connectivity
provisioning server ssh <hostname> -- echo "Connection successful"
Task Service Health
# List installed task services
provisioning taskserv list
# Check service status
provisioning taskserv status <service-name>
# Verify service health
provisioning taskserv health <service-name>
Cluster Health
For Kubernetes clusters:
# SSH to control plane
provisioning server ssh <control-hostname>
# Check cluster nodes
kubectl get nodes
# Check system pods
kubectl get pods -n kube-system
# Check cluster info
kubectl cluster-info
Platform Services Verification
Orchestrator Service
# Check orchestrator status
curl [http://localhost:5000/health](http://localhost:5000/health)
# View orchestrator version
curl [http://localhost:5000/version](http://localhost:5000/version)
# List active workflows
provisioning workflow list
Expected response:
{
"status": "healthy",
"version": "x.x.x",
"uptime": "2h 15m"
}
Control Center
# Check control center
curl [http://localhost:8080/health](http://localhost:8080/health)
# Access web UI
open [http://localhost:8080](http://localhost:8080) # macOS
xdg-open [http://localhost:8080](http://localhost:8080) # Linux
Native Plugins
# List registered plugins
nu -c "plugin list"
# Verify plugins loaded
nu -c "plugin use nu_plugin_auth; plugin use nu_plugin_kms; plugin use nu_plugin_orchestrator"
Security Verification
Secrets Management
# Verify SOPS configuration
cat ~/.config/provisioning/.sops.yaml
# Test encryption/decryption
echo "test secret" > /tmp/test-secret.txt
sops -e /tmp/test-secret.txt > /tmp/test-secret.enc
sops -d /tmp/test-secret.enc
rm /tmp/test-secret.*
SSH Keys
# Verify SSH keys exist
ls -la ~/.ssh/provisioning_*
# Test SSH key permissions
ls -l ~/.ssh/provisioning_* | awk '{print $1}'
# Should show: -rw------- (600)
Encrypted Configuration
# Verify user config encryption
file ~/.config/provisioning/user_config.yaml
# Should show: SOPS encrypted data or YAML
Troubleshooting Common Issues
CLI Not Found
# Check PATH
echo $PATH | tr ':' '
' | grep provisioning
# Verify symlink
ls -l /usr/local/bin/provisioning
# Try direct execution
/path/to/project-provisioning/provisioning/core/cli/provisioning version
Provider Authentication Fails
# Verify credentials are set
provisioning config show | grep -A5 providers
# Test with debug mode
provisioning --debug provider test <provider-name>
# Check network connectivity
ping -c 3 api.upcloud.com # UpCloud
ping -c 3 ec2.amazonaws.com # AWS
Nickel Schema Errors
# Type-check schema
nickel typecheck <schema-file>.ncl
# Validate with verbose output
provisioning validate config --verbose
# Format Nickel file
nickel fmt <schema-file>.ncl
Server SSH Fails
# Verify SSH key
ssh-add -l | grep provisioning
# Test direct SSH
ssh -i ~/.ssh/provisioning_rsa root@<server-ip>
# Check server status
provisioning server status <hostname>
Task Service Installation Fails
# Check dependencies
provisioning taskserv dependencies <service>
# Verify server has resources
provisioning server ssh <hostname> -- df -h
provisioning server ssh <hostname> -- free -h
# Enable debug mode
provisioning --debug taskserv create <service>
Health Check Checklist
Complete verification checklist:
# Core tools
[x] Nushell 0.109.1+
[x] Nickel 1.15.1+
[x] SOPS 3.10.2+
[x] Age 1.2.1+
[x] K9s 0.50.6+
# Configuration
[x] User config valid
[x] Provider credentials configured
[x] Workspace initialized
# Provider connectivity
[x] Provider API accessible
[x] Authentication successful
# Infrastructure (if deployed)
[x] Servers running
[x] SSH connectivity working
[x] Task services installed
[x] Cluster healthy
# Platform services (if running)
[x] Orchestrator responsive
[x] Control center accessible
[x] Plugins registered
# Security
[x] Secrets encrypted
[x] SSH keys secured
[x] Configuration protected
Performance Verification
Response Times
# CLI response time
time provisioning version
# Provider API response time
time provisioning provider test <provider>
# Orchestrator response time
time curl [http://localhost:5000/health](http://localhost:5000/health)
Acceptable ranges:
- CLI commands: <1 second
- Provider API: <3 seconds
- Orchestrator API: <100ms
Resource Usage
# Check system resources
htop # Interactive process viewer
# Check disk usage
df -h
# Check memory usage
free -h
Next Steps
Once verification is complete:
- Workspace Management - Manage multiple workspaces
- Nickel Guide - Master infrastructure-as-code
- Batch Workflows - Multi-cloud orchestration
Setup & Configuration
Post-installation configuration and system setup for the Provisioning platform.
Overview
After installation, setup configures your system and prepares workspaces for infrastructure deployment.
Setup encompasses three critical phases:
- Initial Setup - Environment detection, dependency verification, directory creation
- Workspace Setup - Create workspaces, configure providers, initialize schemas
- Configuration - Provider credentials, system settings, profiles, validation
This process validates prerequisites, detects your environment, and bootstraps your first workspace.
Quick Setup
Get up and running in 3 commands:
# 1. Complete initial setup (detects system, creates dirs, validates dependencies)
provisioning setup initial
# 2. Create first workspace (for your infrastructure)
provisioning workspace create --name production
# 3. Add cloud provider credentials (AWS, UpCloud, Hetzner, etc.)
provisioning config set --workspace production \
extensions.providers.aws.enabled true \
extensions.providers.aws.config.region us-east-1
# 4. Verify configuration is valid
provisioning validate config
Setup Process Explained
The setup system automatically:
- System Detection - Detects OS (Linux, macOS, Windows), CPU architecture, RAM, disk space
- Dependency Verification - Validates Nushell, Nickel, SOPS, Age, K9s installation
- Directory Structure - Creates
~/.provisioning/,~/.config/provisioning/, workspace directories - Configuration Creation - Initializes default configuration, security settings, profiles
- Workspace Bootstrap - Creates default workspace with basic configuration
- Health Checks - Validates installation, runs diagnostic tests
All steps are logged and can be verified with provisioning status.
Setup Configuration Guides
Starting Fresh
-
Initial Setup - First-time system setup: detection, validation, directory creation, default configuration, health checks.
-
Workspace Setup - Create and initialize workspaces: creation, provider configuration, schema management, local customization.
-
Configuration Management - Configure system: providers, credentials, profiles, environment variables, validation rules.
Setup Profiles
Pre-configured setup profiles for different use cases:
Developer Profile
provisioning setup profile --profile developer
# Configures for local development with demo provider
Production Profile
provisioning setup profile --profile production
# Configures for production with security hardening
Custom Profile
provisioning setup profile --custom
# Interactive setup with customization
Directory Structure Created
Setup creates this directory structure:
~/.provisioning/
├── workspaces/ # Workspace data
├── cache/ # Build and dependency cache
├── plugins/ # Installed Nushell plugins
└── detectors/ # Custom detectors
~/.config/provisioning/
├── config.toml # Main configuration
├── providers/ # Provider credentials
├── secrets/ # Encrypted secrets (via SOPS)
└── profiles/ # Setup profiles
Quick Setup Verification
# Check system status
provisioning status
# Verify all dependencies
provisioning setup verify-dependencies
# Test cloud provider connection
provisioning provider test --name aws
# Validate configuration
provisioning validate config
# Run health checks
provisioning health check
Environment-Specific Setup
For Single Workspace (Simple)
- Run Initial Setup
- Create one workspace
- Configure provider
- Done!
For Multiple Workspaces (Team)
- Run Initial Setup
- Create multiple workspaces per team
- Configure shared providers
- Set up workspace-specific schemas
For Multi-Cloud (Enterprise)
- Run Initial Setup with production profile
- Create workspace per environment (dev, staging, prod)
- Configure multiple cloud providers
- Enable audit logging and security features
Configuration Hierarchy
Configurations load in priority order:
1. Command-line arguments (highest)
2. Environment variables (PROVISIONING_*)
3. User profile config (~/.config/provisioning/)
4. Workspace config (workspace/config/)
5. System defaults (provisioning/config/)
(lowest)
Common Setup Tasks
Add a Cloud Provider
provisioning config set --workspace production \
extensions.providers.aws.config.region us-east-1 \
extensions.providers.aws.config.credentials_source aws_iam
Configure Secrets Storage
provisioning config set \
security.secrets.backend secretumvault \
security.secrets.url [http://localhost:8200](http://localhost:8200)
Enable Audit Logging
provisioning config set \
security.audit.enabled true \
security.audit.retention_days 2555
Set Up Multi-Tenancy
# Create separate workspaces per tenant
provisioning workspace create --name tenant-1
provisioning workspace create --name tenant-2
# Each workspace has isolated configuration
Setup Validation
After setup, validate everything works:
# Run complete validation suite
provisioning setup validate-all
# Or check specific components
provisioning setup validate-system # OS, dependencies
provisioning setup validate-directories # Directory structure
provisioning setup validate-config # Configuration syntax
provisioning setup validate-providers # Cloud provider connectivity
provisioning setup validate-security # Security settings
Troubleshooting Setup
If setup fails:
- Check logs -
provisioning setup logs --tail 20 - Verify dependencies -
provisioning setup verify-dependencies - Reset configuration -
provisioning setup reset --workspace <name> - Run diagnostics -
provisioning diagnose setup - Check documentation - See Troubleshooting
Next Steps After Setup
After initial setup completes:
- Create workspaces - See Workspace Setup
- Configure providers - See Configuration Management
- Deploy infrastructure - See Getting Started
- Learn features - See Features
- Explore examples - See Examples
Related Documentation
- Getting Started → See
provisioning/docs/src/getting-started/ - Features → See
provisioning/docs/src/features/ - Configuration Guide → See
provisioning/docs/src/infrastructure/ - Troubleshooting → See
provisioning/docs/src/troubleshooting/
Initial Setup
Configure Provisioning after installation.
Overview
Initial setup validates your environment and prepares Provisioning for workspace creation. The setup process performs system detection, dependency verification, and configuration initialization.
Prerequisites
Before initial setup, ensure:
- Provisioning CLI installed and in PATH
- Nushell 0.109.0+ installed
- Nickel installed
- SOPS 3.10.2+ installed
- Age 1.2.1+ installed
- K9s 0.50.6+ installed (for Kubernetes)
Verify installation:
provisioning version
nu --version
nickel --version
sops --version
age --version
Setup Profiles
Provisioning provides configuration profiles for different use cases:
1. Developer Profile
For local development and testing:
provisioning setup profile --profile developer
Includes:
- Local provider (simulation environment)
- Development workspace
- Test environment configuration
- Debug logging enabled
- No MFA required
- Workspace directory:
~/.provisioning-dev/
2. Production Profile
For production deployments:
provisioning setup profile --profile production
Includes:
- Encrypted configuration
- Strict validation rules
- MFA enabled
- Audit logging enabled
- Workspace directory:
/opt/provisioning/
3. CI/CD Profile
For unattended automation:
provisioning setup profile --profile cicd
Includes:
- Headless mode (no TUI prompts)
- Service account authentication
- Automated backups
- Policy enforcement
- Unattended upgrade support
Configuration Detection
The setup system automatically detects:
# System detection
OS: $(uname -s)
CPU: $(lscpu | grep 'CPU(s)' | awk '{print $NF}')
RAM: $(free -h | grep Mem | awk '{print $2}')
Architecture: $(uname -m)
The system adapts configuration based on detected resources:
| Detected Resource | Configuration |
|---|---|
| 2-4 CPU cores | Solo (single-instance) mode |
| 4-8 CPU cores | MultiUser mode (small cluster) |
| 8+ CPU cores | CICD or Enterprise mode |
| 4GB RAM | Minimal services only |
| 8GB RAM | Standard setup |
| 16GB+ RAM | Full feature set |
Setup Steps
Step 1: Validate Environment
provisioning setup validate
Checks:
- ✅ All dependencies installed
- ✅ Permission levels
- ✅ Network connectivity
- ✅ Disk space (minimum 20GB recommended)
Step 2: Initialize Configuration
provisioning setup init
Creates:
~/.config/provisioning/- User configuration directory~/.config/provisioning/user_config.yaml- User settings~/.provisioning/workspaces/- Workspace registry
Step 3: Configure Providers
provisioning setup providers
Interactive configuration for:
- UpCloud (API key, endpoint)
- AWS (Access key, secret, region)
- Hetzner (API token)
- Local (No configuration required)
Store credentials securely:
# Credentials are encrypted with SOPS + Age
~/.config/provisioning/.secrets/providers.enc.yaml
Step 4: Configure Security
provisioning setup security
Sets up:
- JWT secret for authentication
- KMS backend (local, Cosmian, AWS KMS)
- Encryption keys
- Certificate authorities
Step 5: Verify Installation
provisioning verify
Checks:
- ✅ All components running
- ✅ Provider connectivity
- ✅ Configuration validity
- ✅ Security systems operational
User Configuration
User configuration is stored in ~/.config/provisioning/user_config.yaml:
# User preferences
user:
name: "Your Name"
email: "[your@email.com](mailto:your@email.com)"
default_region: "us-east-1"
# Workspace settings
workspaces:
active: "my-project"
directory: "~/.provisioning/workspaces/"
registry:
my-project:
path: "/home/user/.provisioning/workspaces/workspace_my_project"
created: "2026-01-16T10:30:00Z"
template: "default"
# Provider defaults
providers:
default: "upcloud"
upcloud:
endpoint: " [https://api.upcloud.com"](https://api.upcloud.com")
aws:
region: "us-east-1"
# Security settings
security:
mfa_enabled: false
kms_backend: "local"
encryption: "aes-256-gcm"
# Display options
ui:
theme: "dark"
table_format: "compact"
colors: true
# Logging
logging:
level: "info"
output: "console"
file: "~/.provisioning/logs/provisioning.log"
Environment Variables
Override settings with environment variables:
# Provider selection
export PROVISIONING_PROVIDER=aws
# Workspace selection
export PROVISIONING_WORKSPACE=my-project
# Logging
export PROVISIONING_LOG_LEVEL=debug
# Configuration path
export PROVISIONING_CONFIG=~/.config/provisioning/
# KMS endpoint
export PROVISIONING_KMS_ENDPOINT= [http://localhost:8080](http://localhost:8080)
Troubleshooting
Missing Dependencies
# Install missing tools
brew install nushell nickel sops age k9s
# Verify
provisioning setup validate
Permission Errors
# Fix directory permissions
chmod 700 ~/.config/provisioning/
chmod 600 ~/.config/provisioning/user_config.yaml
Provider Connection Failed
# Test provider connectivity
provisioning providers test upcloud --verbose
# Verify credentials
cat ~/.config/provisioning/.secrets/providers.enc.yaml
Next Steps
After initial setup:
Workspace Setup
Create and initialize your first Provisioning workspace.
Overview
A workspace is the default organizational unit for all infrastructure work in Provisioning. It groups infrastructure definitions, configurations, extensions, and runtime data in an isolated environment.
Workspace Structure
Every workspace follows a consistent directory structure:
workspace_my_project/
├── config/ # Workspace configuration
│ ├── workspace.ncl # Workspace definition (Nickel)
│ ├── provisioning.yaml # Workspace metadata
│ ├── dev-defaults.toml # Development environment settings
│ ├── test-defaults.toml # Testing environment settings
│ └── prod-defaults.toml # Production environment settings
│
├── infra/ # Infrastructure definitions
│ ├── servers.ncl # Server configurations
│ ├── clusters.ncl # Cluster definitions
│ ├── networks.ncl # Network configurations
│ └── batch-workflows.ncl # Batch workflow definitions
│
├── extensions/ # Workspace-specific extensions (optional)
│ ├── providers/ # Custom providers
│ ├── taskservs/ # Custom task services
│ ├── clusters/ # Custom cluster templates
│ └── workflows/ # Custom workflow definitions
│
└── runtime/ # Runtime data (gitignored)
├── state/ # Infrastructure state files
├── checkpoints/ # Workflow checkpoints
├── logs/ # Operation logs
└── generated/ # Generated configuration files
Creating a Workspace
Method 1: From Built-in Template
# Create from default template
provisioning workspace init my-project
# Create from specific template
provisioning workspace init my-k8s --template kubernetes-ha
# Create with custom path
provisioning workspace init my-project --path /custom/location
Method 2: From Git Repository
# Clone infrastructure repository
git clone [https://github.com/org/infra-repo.git](https://github.com/org/infra-repo.git) my-infra
cd my-infra
# Import as workspace
provisioning workspace init . --import
Available Templates
Provisioning includes templates for common use cases:
| Template | Description | Use Case |
|---|---|---|
default | Minimal structure | General-purpose infrastructure |
kubernetes-ha | HA Kubernetes (3 control planes) | Production Kubernetes deployments |
development | Dev-optimized with Docker Compose | Local testing and development |
multi-cloud | Multiple provider configs | Multi-cloud deployments |
database-cluster | Database-focused | Database infrastructure |
cicd | CI/CD pipeline configs | Automated deployment pipelines |
List available templates:
provisioning workspace templates
# Show template details
provisioning workspace template show kubernetes-ha
Switching Workspaces
List All Workspaces
provisioning workspace list
# Example output:
NAME PATH LAST_USED STATUS
my-project ~/.provisioning/workspace_my 2026-01-16 10:30 Active
dev-env ~/.provisioning/workspace_dev 2026-01-15 15:45
production ~/.provisioning/workspace_prod 2026-01-10 09:00
Switch to a Workspace
# Switch workspace
provisioning workspace switch my-project
# Verify switch
provisioning workspace status
# Quick switch (shortcut)
provisioning ws switch dev-env
When you switch workspaces:
- Active workspace marker updates in user configuration
- Environment variables update for current session
- CLI prompt changes (if configured)
- Last-used timestamp updates
Workspace Registry
The workspace registry is stored in user configuration:
# ~/.config/provisioning/user_config.yaml
workspaces:
active: my-project
registry:
my-project:
path: ~/.provisioning/workspaces/workspace_my_project
created: 2026-01-16T10:30:00Z
last_used: 2026-01-16T14:20:00Z
template: default
Configuring Workspace
Workspace Definition (workspace.ncl)
# workspace.ncl - Workspace configuration
{
# Workspace metadata
name = "my-project"
description = "My infrastructure project"
version = "1.0.0"
# Environment settings
environment = 'production
# Default provider
provider = "upcloud"
# Region preferences
region = "de-fra1"
# Workspace-specific providers (override defaults)
providers = {
upcloud = {
endpoint = " [https://api.upcloud.com"](https://api.upcloud.com")
region = "de-fra1"
}
aws = {
region = "us-east-1"
}
}
# Extensions (inherit from provisioning/extensions/)
extensions = {
providers = ["upcloud", "aws"]
taskservs = ["kubernetes", "docker", "postgres"]
clusters = ["web", "oci-reg"]
}
}
Environment-Specific Configuration
Create environment-specific configuration files:
# Development environment
config/dev-defaults.toml:
[server]
plan = "small"
backup_enabled = false
# Production environment
config/prod-defaults.toml:
[server]
plan = "large"
backup_enabled = true
monitoring_enabled = true
Use environment selection:
# Deploy to development
PROVISIONING_ENV=dev provisioning server create
# Deploy to production (stricter validation)
PROVISIONING_ENV=prod provisioning server create --validate
Workspace Metadata (provisioning.yaml)
name: "my-project"
version: "1.0.0"
created: "2026-01-16T10:30:00Z"
owner: "team-infra"
# Provider configuration
providers:
default: "upcloud"
upcloud:
api_endpoint: " [https://api.upcloud.com"](https://api.upcloud.com")
region: "de-fra1"
aws:
region: "us-east-1"
# Workspace features
features:
workspace_switching: true
batch_workflows: true
test_environment: true
security_system: true
# Validation rules
validation:
strict: true
check_dependencies: true
validate_certificates: true
# Backup settings
backup:
enabled: true
frequency: "daily"
retention_days: 30
Initializing Infrastructure
Step 1: Create Infrastructure Definition
Create infra/servers.ncl:
let defaults = import "defaults.ncl" in
{
servers = [
defaults.make_server {
name = "web-01"
plan = "medium"
region = "de-fra1"
}
defaults.make_server {
name = "db-01"
plan = "large"
region = "de-fra1"
backup_enabled = true
}
]
}
Step 2: Validate Configuration
# Validate Nickel configuration
nickel typecheck infra/servers.ncl
# Export and validate
nickel export infra/servers.ncl | provisioning validate config
# Verbose validation
provisioning validate config --verbose
Step 3: Export Configuration
# Export Nickel to TOML (generated output)
nickel export --format toml infra/servers.ncl > infra/servers.toml
# The .toml files are auto-generated, don't edit directly
Workspace Security
Securing Credentials
Credentials are encrypted with SOPS + Age:
# Initialize secrets
provisioning sops init
# Create encrypted secrets file
provisioning sops create .secrets/providers.enc.yaml
# Encrypt existing credentials
sops -e -i infra/credentials.toml
Git Workflow
Version control best practices:
# COMMIT (shared with team)
infra/**/*.ncl # Infrastructure definitions
config/*.toml # Environment configurations
config/provisioning.yaml # Workspace metadata
extensions/**/* # Custom extensions
# GITIGNORE (never commit)
config/local-overrides.toml # Local user settings
runtime/**/* # Runtime data and state
**/*.secret # Credential files
**/*.enc # Encrypted files (if not decrypted locally)
Multi-Workspace Strategies
Strategy 1: Separate Workspaces Per Environment
# Create dedicated workspaces
provisioning workspace init myapp-dev
provisioning workspace init myapp-staging
provisioning workspace init myapp-prod
# Each workspace is completely isolated
provisioning ws switch myapp-prod
provisioning server create # Creates in prod only
Pros: Complete isolation, different credentials, independent state Cons: More workspace management, configuration duplication
Strategy 2: Single Workspace, Multiple Environments
# Single workspace with environment configs
provisioning workspace init myapp
# Deploy to different environments
PROVISIONING_ENV=dev provisioning server create
PROVISIONING_ENV=staging provisioning server create
PROVISIONING_ENV=prod provisioning server create
Pros: Shared configuration, easier maintenance Cons: Shared credentials, risk of cross-environment mistakes
Strategy 3: Hybrid Approach
# Dev workspace for experimentation
provisioning workspace init myapp-dev
# Prod workspace for production only
provisioning workspace init myapp-prod
# Use environment flags within workspaces
provisioning ws switch myapp-prod
PROVISIONING_ENV=prod provisioning cluster deploy
Pros: Balances isolation and convenience Cons: More complex to explain to teams
Workspace Validation
Before deploying infrastructure:
# Validate entire workspace
provisioning validate workspace
# Validate specific configuration
provisioning validate config --infra servers.ncl
# Validate with strict rules
provisioning validate config --strict
Troubleshooting
Workspace Not Found
# Re-register workspace
provisioning workspace register /path/to/workspace
# Or create new workspace
provisioning workspace init my-project
Permission Errors
# Fix workspace permissions
chmod 755 ~/.provisioning/workspaces/workspace_*
chmod 644 ~/.provisioning/workspaces/workspace_*/config/*
Configuration Validation Errors
# Check configuration syntax
nickel typecheck infra/*.ncl
# Inspect generated TOML
nickel export infra/*.ncl | jq '.'
# Debug configuration loading
provisioning config validate --verbose
Next Steps
Configuration Management
Configure Provisioning providers, credentials, and system settings.
Overview
Provisioning uses a hierarchical configuration system with 5 layers of precedence. Configuration is type-safe via Nickel schemas and can be overridden at multiple levels.
Configuration Hierarchy
1. Runtime Arguments (Highest Priority)
↓ (CLI flags: --provider upcloud)
2. Environment Variables
↓ (PROVISIONING_PROVIDER=upcloud)
3. Workspace Configuration
↓ (workspace/config/provisioning.yaml)
4. Environment Defaults
↓ (workspace/config/prod-defaults.toml)
5. System Defaults (Lowest Priority)
├─ User Config (~/.config/provisioning/user_config.yaml)
└─ Platform Defaults (provisioning/config/config.defaults.toml)
Configuration Sources
1. System Defaults
Built-in defaults for all Provisioning settings:
Location: provisioning/config/config.defaults.toml
# Default provider
[providers]
default = "local"
# Default server configuration
[server]
plan = "small"
region = "us-east-1"
zone = "a"
backup_enabled = false
monitoring = false
# Default workspace
[workspace]
directory = "~/.provisioning/workspaces/"
# Logging
[logging]
level = "info"
output = "console"
# Security
[security]
mfa_enabled = false
encryption = "aes-256-gcm"
2. User Configuration
User-level settings in home directory:
Location: ~/.config/provisioning/user_config.yaml
user:
name: "Your Name"
email: "[user@example.com](mailto:user@example.com)"
providers:
default: "upcloud"
upcloud:
endpoint: " [https://api.upcloud.com"](https://api.upcloud.com")
api_key: "${UPCLOUD_API_KEY}"
aws:
region: "us-east-1"
profile: "default"
workspace:
directory: "~/.provisioning/workspaces/"
default: "my-project"
logging:
level: "info"
file: "~/.provisioning/logs/provisioning.log"
3. Workspace Configuration
Workspace-specific settings:
Location: workspace/config/provisioning.yaml
name: "my-project"
environment: "production"
providers:
default: "upcloud"
upcloud:
region: "de-fra1"
endpoint: " [https://api.upcloud.com"](https://api.upcloud.com")
validation:
strict: true
require_approval: false
4. Environment Defaults
Environment-specific configuration files:
Files:
workspace/config/dev-defaults.toml- Developmentworkspace/config/test-defaults.toml- Testingworkspace/config/prod-defaults.toml- Production
Example prod-defaults.toml:
# Production environment overrides
[server]
plan = "large"
backup_enabled = true
monitoring = true
high_availability = true
[security]
mfa_enabled = true
require_approval = true
[workspace]
require_version_tag = true
require_changelog = true
5. Runtime Arguments
Command-line flags with highest priority:
# Override provider
provisioning --provider aws server create
# Override configuration
provisioning --config /custom/config.yaml
# Override environment
provisioning --env production
# Combined
provisioning --provider aws --env production --format json server list
Provider Configuration
Supported Providers
| Provider | Status | Configuration |
|---|---|---|
| UpCloud | ✅ Active | API endpoint, credentials |
| AWS | ✅ Active | Region, access keys, profile |
| Hetzner | ✅ Active | API token, datacenter |
| Local | ✅ Active | Directory path (no credentials) |
Configuring UpCloud
Interactive setup:
provisioning setup providers
Or manually in ~/.config/provisioning/user_config.yaml:
providers:
default: "upcloud"
upcloud:
endpoint: " [https://api.upcloud.com"](https://api.upcloud.com")
api_key: "${UPCLOUD_API_KEY}"
api_secret: "${UPCLOUD_API_SECRET}"
Store credentials securely:
# Set environment variables
export UPCLOUD_API_KEY="your-api-key"
export UPCLOUD_API_SECRET="your-api-secret"
# Or use SOPS for encrypted storage
provisioning sops set providers.upcloud.api_key "your-api-key"
Configuring AWS
providers:
aws:
region: "us-east-1"
access_key_id: "${AWS_ACCESS_KEY_ID}"
secret_access_key: "${AWS_SECRET_ACCESS_KEY}"
profile: "default"
Set environment variables:
export AWS_ACCESS_KEY_ID="your-access-key"
export AWS_SECRET_ACCESS_KEY="your-secret-key"
export AWS_REGION="us-east-1"
Configuring Hetzner
providers:
hetzner:
api_token: "${HETZNER_API_TOKEN}"
datacenter: "nbg1-dc3"
Set environment:
export HETZNER_API_TOKEN="your-api-token"
Testing Provider Connectivity
# Test provider connectivity
provisioning providers test upcloud
# Verbose output
provisioning providers test aws --verbose
# Test all configured providers
provisioning providers test --all
Global Configuration Accessors
Provisioning provides 476+ configuration accessors for accessing settings:
# Access configuration values
let config = (provisioning config load)
# Provider settings
$config.providers.default
$config.providers.upcloud.endpoint
$config.providers.aws.region
# Workspace settings
$config.workspace.directory
$config.workspace.default
# Server defaults
$config.server.plan
$config.server.region
$config.server.backup_enabled
# Security settings
$config.security.mfa_enabled
$config.security.encryption
Credential Management
Encrypted Credentials
Use SOPS + Age for encrypted secrets:
# Initialize SOPS configuration
provisioning sops init
# Create encrypted credentials file
provisioning sops create .secrets/providers.enc.yaml
# Edit encrypted file
provisioning sops edit .secrets/providers.enc.yaml
# Decrypt for local use
provisioning sops decrypt .secrets/providers.enc.yaml > .secrets/providers.toml
Using Environment Variables
Override credentials at runtime:
# Provider credentials
export PROVISIONING_PROVIDER=aws
export AWS_ACCESS_KEY_ID="your-key"
export AWS_SECRET_ACCESS_KEY="your-secret"
export AWS_REGION="us-east-1"
# Execute command
provisioning server create
KMS Integration
For enterprise deployments, use KMS backends:
# Configure KMS backend
provisioning kms init --backend cosmian
# Store credentials in KMS
provisioning kms set providers.upcloud.api_key "value"
# Decrypt on-demand
provisioning kms get providers.upcloud.api_key
Configuration Validation
Validate Configuration
# Validate all configuration
provisioning validate config
# Validate specific section
provisioning validate config --section providers
# Strict validation
provisioning validate config --strict
# Verbose output
provisioning validate config --verbose
Validate Infrastructure
# Validate infrastructure schemas
provisioning validate infra
# Validate specific file
provisioning validate infra workspace/infra/servers.ncl
# Type-check with Nickel
nickel typecheck workspace/infra/servers.ncl
Configuration Merging
Configuration is merged from all layers respecting priority:
# View final merged configuration
provisioning config show
# Export merged configuration
provisioning config export --format yaml
# Show configuration source
provisioning config debug --keys providers.default
Working with Configurations
Export Configuration
# Export as YAML
provisioning config export --format yaml > config.yaml
# Export as JSON
provisioning config export --format json | jq '.'
# Export as TOML
provisioning config export --format toml > config.toml
Import Configuration
# Import from file
provisioning config import --file config.yaml
# Merge with existing
provisioning config merge --file config.yaml
Reset Configuration
# Reset to defaults
provisioning config reset
# Reset specific section
provisioning config reset --section providers
# Backup before reset
provisioning config backup
Environment Variables
Common environment variables for overriding configuration:
# Provider selection
export PROVISIONING_PROVIDER=upcloud
export PROVISIONING_PROVIDER_UPCLOUD_ENDPOINT= [https://api.upcloud.com](https://api.upcloud.com)
# Workspace
export PROVISIONING_WORKSPACE=my-project
export PROVISIONING_WORKSPACE_DIRECTORY=~/.provisioning/workspaces/
# Environment
export PROVISIONING_ENV=production
# Logging
export PROVISIONING_LOG_LEVEL=debug
export PROVISIONING_LOG_FILE=~/.provisioning/logs/provisioning.log
# Configuration path
export PROVISIONING_CONFIG=~/.config/provisioning/
# KMS endpoint
export PROVISIONING_KMS_ENDPOINT= [http://localhost:8080](http://localhost:8080)
# Feature flags
export PROVISIONING_FEATURE_BATCH_WORKFLOWS=true
export PROVISIONING_FEATURE_TEST_ENVIRONMENT=true
Best Practices
1. Secure Credentials
# NEVER commit credentials
echo "config/local-overrides.toml" >> .gitignore
echo ".secrets/" >> .gitignore
# Use SOPS for shared secrets
provisioning sops encrypt config/credentials.toml
git add config/credentials.enc.toml
# Use environment variables for local overrides
export PROVISIONING_PROVIDER_UPCLOUD_API_KEY="your-key"
2. Environment-Specific Configuration
# Development uses different credentials
PROVISIONING_ENV=dev provisioning workspace switch myapp-dev
# Production uses restricted credentials
PROVISIONING_ENV=prod provisioning workspace switch myapp-prod
3. Configuration Documentation
Document your configuration choices:
# provisioning.yaml
configuration:
provider: "upcloud"
reason: "Primary European cloud"
backup_strategy: "daily"
reason: "Compliance requirement"
monitoring: "enabled"
reason: "SLA monitoring"
4. Regular Validation
# Validate before deployment
provisioning validate config --strict
# Export and inspect
provisioning config export --format yaml | less
# Test provider connectivity
provisioning providers test --all
Troubleshooting
Configuration Not Loading
# Check configuration file
cat ~/.config/provisioning/user_config.yaml
# Validate YAML syntax
yamllint ~/.config/provisioning/user_config.yaml
# Debug configuration loading
provisioning config show --verbose
Provider Connection Failed
# Check provider configuration
provisioning config show --section providers
# Test connectivity
provisioning providers test upcloud --verbose
# Check credentials
provisioning kms get providers.upcloud.api_key
Environment Variable Conflicts
# Check environment variables
env | grep PROVISIONING
# Unset conflicting variables
unset PROVISIONING_PROVIDER
# Set correct values
export PROVISIONING_PROVIDER=aws
export AWS_REGION=us-east-1
Next Steps
User Guides
Step-by-step guides for common workflows, best practices, and advanced operational scenarios using the Provisioning platform.
Overview
This section provides practical guides for:
- Getting started - From-scratch deployment and initial setup
- Organization - Workspace management and multi-cloud strategies
- Automation - Advanced workflow orchestration and GitOps
- Operations - Disaster recovery, secrets rotation, cost governance
- Integration - Hybrid cloud setup, zero-trust networks, legacy migration
- Scaling - Multi-tenant environments, high availability, performance optimization
Each guide includes step-by-step instructions, configuration examples, troubleshooting, and best practices.
Getting Started
I’m completely new to Provisioning
Start with: From Scratch Guide - Complete walkthrough from installation through first deployment with explanations and examples.
I want to organize infrastructure
Read: Workspace Management - Best practices for organizing workspaces, isolation, and multi-team setup.
Core Workflow Guides
-
From Scratch Guide - Installation, workspace creation, first deployment step-by-step
-
Workspace Management - Organization best practices, multi-tenancy, collaboration, customization, schemas
-
Multi-Cloud Deployment - Deploy across AWS, UpCloud, Hetzner with abstraction and failover
- Advanced Workflow Orchestration - DAG scheduling, parallel execution, logic, error handling, multi-environment
Advanced Operational Guides
-
Hybrid Cloud Deployment - Hub-and-spoke architecture connecting on-premise and cloud infrastructure
-
GitOps Infrastructure Deployment
- GitHub Actions, reconciliation, drift detection, audit trails
-
Advanced Networking - Load balancing, service mesh, DNS, zero-trust architecture, network policies
-
Secrets Rotation Strategy - Password, API key, certificate, encryption key rotation with zero downtime
Enterprise Features
-
Custom Extensions - Custom providers, task services, detectors, Nushell plugins
-
Disaster Recovery Guide - DR planning, backup, failover procedures, testing, recovery time optimization
-
Legacy System Migration - Zero-downtime migration with gradual traffic cutover and validation
Quick Navigation
I need to
Deploy infrastructure quickly → From Scratch Guide
Organize multiple workspaces → Workspace Management
Deploy across clouds → Multi-Cloud Deployment
Build complex workflows → Advanced Workflow Orchestration
Set up GitOps → GitOps Infrastructure Deployment
Handle disasters → Disaster Recovery Guide
Rotate secrets safely → Secrets Rotation Strategy
Connect on-premise to cloud → Hybrid Cloud Deployment
Design secure networks → Advanced Networking
Build custom extensions → Custom Extensions
Migrate legacy systems → Legacy System Migration
Guide Structure
Each guide follows this pattern:
- Overview - What you’ll accomplish
- Prerequisites - What you need before starting
- Architecture - Visual diagram of the solution
- Step-by-Step - Detailed instructions with examples
- Configuration - Full Nickel configuration examples
- Verification - How to validate the deployment
- Troubleshooting - Common issues and solutions
- Next Steps - How to extend or customize
- Best Practices - Lessons learned and recommendations
Learning Paths
Path 1: I’m new to Provisioning (Day 1)
- From Scratch Guide - Basic setup
- Workspace Management - Organization
- Multi-Cloud Deployment - Multi-cloud
Path 2: I need production-ready setup (Week 1)
- Workspace Management - Organization
- GitOps Infrastructure Deployment - Automation
- Disaster Recovery Guide - Resilience
- Secrets Rotation Strategy - Security
- Advanced Networking - Enterprise networking
Path 3: I’m migrating from legacy (Month-long project)
- Legacy System Migration - Migration plan
- Advanced Workflow Orchestration - Complex deployments
- Hybrid Cloud Deployment - Coexistence
- GitOps Infrastructure Deployment - Continuous deployment
- Disaster Recovery Guide - Failover strategies
Path 4: I’m building a platform (Team project)
- Custom Extensions - Build extensions
- Workspace Management - Multi-tenant setup
- Advanced Workflow Orchestration - Complex workflows
- GitOps Infrastructure Deployment - CD/GitOps
- Secrets Rotation Strategy - Security at scale
Related Documentation
- Getting Started → See
provisioning/docs/src/getting-started/ - Examples → See
provisioning/docs/src/examples/ - Features → See
provisioning/docs/src/features/ - Operations → See
provisioning/docs/src/operations/ - Development → See
provisioning/docs/src/development/
From Scratch Guide
Complete walkthrough from zero to production-ready infrastructure deployment using the Provisioning platform. This guide covers installation, configuration, workspace setup, infrastructure definition, and deployment workflows.
Overview
This guide walks you through:
- Installing prerequisites and the Provisioning platform
- Configuring cloud provider credentials
- Creating your first workspace
- Defining infrastructure using Nickel
- Deploying servers and task services
- Setting up Kubernetes clusters
- Implementing security best practices
- Monitoring and maintaining infrastructure
Time commitment: 2-3 hours for complete setup Prerequisites: Linux or macOS, terminal access, cloud provider account (optional)
Phase 1: Installation
System Prerequisites
Ensure your system meets minimum requirements:
# Check OS (Linux or macOS)
uname -s
# Verify available disk space (minimum 10GB recommended)
df -h ~
# Check internet connectivity
ping -c 3 github.com
Install Required Tools
Nushell (Required)
# macOS
brew install nushell
# Linux
cargo install nu
# Verify installation
nu --version # Expected: 0.109.1+
Nickel (Required)
# macOS
brew install nickel
# Linux
cargo install nickel-lang-cli
# Verify installation
nickel --version # Expected: 1.15.1+
Additional Tools
# SOPS for secrets management
brew install sops # macOS
# or download from [https://github.com/getsops/sops/releases](https://github.com/getsops/sops/releases)
# Age for encryption
brew install age # macOS
cargo install age # Linux
# K9s for Kubernetes management (optional)
brew install derailed/k9s/k9s
# Verify installations
sops --version # Expected: 3.10.2+
age --version # Expected: 1.2.1+
k9s version # Expected: 0.50.6+
Install Provisioning Platform
Option 1: Using Installer Script (Recommended)
# Download and run installer
INSTALL_URL="https://raw.githubusercontent.com/yourusername/provisioning/main/install.sh"
curl -sSL "$INSTALL_URL" | bash
# Follow prompts to configure installation directory and path
# Default: ~/.local/bin/provisioning
Installer performs:
- Downloads latest platform binaries
- Installs CLI to system PATH
- Creates default configuration structure
- Validates dependencies
- Runs health check
Option 2: Build from Source
# Clone repository
git clone [https://github.com/yourusername/provisioning.git](https://github.com/yourusername/provisioning.git)
cd provisioning
# Build core CLI
cd provisioning/core
cargo build --release
# Install to local bin
cp target/release/provisioning ~/.local/bin/
# Add to PATH (add to ~/.bashrc or ~/.zshrc)
export PATH="$HOME/.local/bin:$PATH"
# Verify installation
provisioning version
Platform Health Check
# Verify installation
provisioning setup check
# Expected output:
# ✓ Nushell 0.109.1 installed
# ✓ Nickel 1.15.1 installed
# ✓ SOPS 3.10.2 installed
# ✓ Age 1.2.1 installed
# ✓ Provisioning CLI installed
# ✓ Configuration directory created
# Platform ready for use
Phase 2: Initial Configuration
Generate User Configuration
# Create user configuration directory
mkdir -p ~/.config/provisioning
# Generate default user config
provisioning setup init-user-config
Generated configuration structure:
~/.config/provisioning/
├── user_config.yaml # User preferences and workspace registry
├── credentials/ # Provider credentials (encrypted)
├── age/ # Age encryption keys
└── cache/ # CLI cache
Configure Encryption
# Generate Age key pair for secrets
age-keygen -o ~/.config/provisioning/age/provisioning.key
# Store public key
age-keygen -y ~/.config/provisioning/age/provisioning.key > ~/.config/provisioning/age/provisioning.pub
# Configure SOPS to use Age
cat > ~/.config/sops/config.yaml <<EOF
creation_rules:
- path_regex: \.secret\.(yam| l tom| l json)$
age: $(cat ~/.config/provisioning/age/provisioning.pub)
EOF
Provider Credentials
Configure credentials for your chosen cloud provider.
UpCloud Configuration
# Edit user config
nano ~/.config/provisioning/user_config.yaml
# Add provider credentials
cat >> ~/.config/provisioning/user_config.yaml <<EOF
providers:
upcloud:
username: "your-upcloud-username"
password_env: "UPCLOUD_PASSWORD" # Read from environment variable
default_zone: "de-fra1"
EOF
# Set environment variable (add to ~/.bashrc or ~/.zshrc)
export UPCLOUD_PASSWORD="your-upcloud-password"
AWS Configuration
# Add AWS credentials to user config
cat >> ~/.config/provisioning/user_config.yaml <<EOF
providers:
aws:
access_key_id_env: "AWS_ACCESS_KEY_ID"
secret_access_key_env: "AWS_SECRET_ACCESS_KEY"
default_region: "eu-west-1"
EOF
# Set environment variables
export AWS_ACCESS_KEY_ID="your-access-key-id"
export AWS_SECRET_ACCESS_KEY="your-secret-access-key"
Local Provider (Development)
# Configure local provider for testing
cat >> ~/.config/provisioning/user_config.yaml <<EOF
providers:
local:
backend: "docker" # or "podman", "libvirt"
storage_path: "$HOME/.local/share/provisioning/local"
EOF
# Ensure Docker is running
docker info
Validate Configuration
# Validate user configuration
provisioning validate config
# Test provider connectivity
provisioning providers
# Expected output:
# PROVIDER STATUS REGION/ZONE
# upcloud connected de-fra1
# local ready localhost
Phase 3: Create First Workspace
Initialize Workspace
# Create workspace for first project
provisioning workspace init my-first-project
# Navigate to workspace
cd workspace_my_first_project
# Verify structure
ls -la
Workspace structure created:
workspace_my_first_project/
├── infra/ # Infrastructure definitions (Nickel)
├── config/ # Workspace configuration
│ ├── provisioning.yaml # Workspace metadata
│ ├── dev-defaults.toml # Development defaults
│ ├── test-defaults.toml # Testing defaults
│ └── prod-defaults.toml # Production defaults
├── extensions/ # Workspace-specific extensions
│ ├── providers/
│ ├── taskservs/
│ └── workflows/
└── runtime/ # State and logs (gitignored)
├── state/
├── checkpoints/
└── logs/
Configure Workspace
# Edit workspace metadata
nano config/provisioning.yaml
Example workspace configuration:
workspace:
name: my-first-project
description: Learning Provisioning platform
environment: development
created: 2026-01-16T10:00:00Z
defaults:
provider: local
region: localhost
confirmation_required: false
versioning:
nushell: "0.109.1"
nickel: "1.15.1"
kubernetes: "1.29.0"
Phase 4: Define Infrastructure
Simple Server Configuration
Create your first infrastructure definition using Nickel:
# Create server definition
cat > infra/simple-server.ncl <<'EOF'
{
metadata = {
name = "simple-server"
provider = "local"
environment = 'development
}
infrastructure = {
servers = [
{
name = "dev-web-01"
plan = "small"
zone = "localhost"
disk_size_gb = 25
backup_enabled = false
role = 'standalone
}
]
}
services = {
taskservs = ["containerd"]
}
}
EOF
Validate Infrastructure Schema
# Type-check Nickel schema
nickel typecheck infra/simple-server.ncl
# Validate against platform contracts
provisioning validate config --infra simple-server
# Preview deployment
provisioning server create --check --infra simple-server
Expected output:
Infrastructure Plan: simple-server
Provider: local
Environment: development
Servers to create:
- dev-web-01 (small, standalone)
Disk: 25 GB
Backup: disabled
Task services:
- containerd
Estimated resources:
CPU: 1 core
RAM: 1 GB
Disk: 25 GB
Validation: PASSED
Deploy Infrastructure
# Create server
provisioning server create --infra simple-server --yes
# Monitor deployment
provisioning server status dev-web-01
Deployment progress:
Creating server: dev-web-01...
[████████████████████████] 100% - Container created
[████████████████████████] 100% - Network configured
[████████████████████████] 100% - SSH ready
Server dev-web-01 created successfully
IP Address: 172.17.0.2
Status: running
Provider: local (docker)
Install Task Service
# Install containerd
provisioning taskserv create containerd --infra simple-server
# Verify installation
provisioning taskserv status containerd
Installation output:
Installing containerd on dev-web-01...
[████████████████████████] 100% - Dependencies resolved
[████████████████████████] 100% - Containerd installed
[████████████████████████] 100% - Service started
[████████████████████████] 100% - Health check passed
Containerd installed successfully
Version: 1.7.0
Runtime: runc
Verify Deployment
# SSH into server
provisioning server ssh dev-web-01
# Inside server - verify containerd
sudo systemctl status containerd
sudo ctr version
# Exit server
exit
# List all resources
provisioning server list
provisioning taskserv list
Phase 5: Kubernetes Cluster Deployment
Define Kubernetes Infrastructure
# Create Kubernetes cluster definition
cat > infra/k8s-cluster.ncl <<'EOF'
{
metadata = {
name = "k8s-dev-cluster"
provider = "local"
environment = 'development
}
infrastructure = {
servers = [
{
name = "k8s-control-01"
plan = "medium"
role = 'control
zone = "localhost"
disk_size_gb = 50
}
{
name = "k8s-worker-01"
plan = "medium"
role = 'worker
zone = "localhost"
disk_size_gb = 50
}
{
name = "k8s-worker-02"
plan = "medium"
role = 'worker
zone = "localhost"
disk_size_gb = 50
}
]
}
services = {
taskservs = ["containerd", "etcd", "kubernetes", "cilium"]
}
kubernetes = {
version = "1.29.0"
pod_cidr = "10.244.0.0/16"
service_cidr = "10.96.0.0/12"
container_runtime = "containerd"
cri_socket = "/run/containerd/containerd.sock"
}
}
EOF
Validate Kubernetes Configuration
# Type-check schema
nickel typecheck infra/k8s-cluster.ncl
# Validate configuration
provisioning validate config --infra k8s-cluster
# Preview deployment
provisioning cluster create --check --infra k8s-cluster
Deploy Kubernetes Cluster
# Create cluster infrastructure
provisioning cluster create --infra k8s-cluster --yes
# Monitor cluster deployment
provisioning cluster status k8s-dev-cluster
Cluster deployment phases:
Phase 1: Creating servers...
[████████████████████████] 100% - 3/3 servers created
Phase 2: Installing containerd...
[████████████████████████] 100% - 3/3 nodes ready
Phase 3: Installing etcd...
[████████████████████████] 100% - Control plane ready
Phase 4: Installing Kubernetes...
[████████████████████████] 100% - API server available
[████████████████████████] 100% - Workers joined
Phase 5: Installing Cilium CNI...
[████████████████████████] 100% - Network ready
Kubernetes cluster deployed successfully
Cluster: k8s-dev-cluster
Control plane: k8s-control-01
Workers: k8s-worker-01, k8s-worker-02
Access Kubernetes Cluster
# Get kubeconfig
provisioning cluster kubeconfig k8s-dev-cluster > ~/.kube/config-dev
# Set KUBECONFIG
export KUBECONFIG=~/.kube/config-dev
# Verify cluster
kubectl get nodes
# Expected output:
# NAME STATUS ROLES AGE VERSION
# k8s-control-01 Ready control-plane 5m v1.29.0
# k8s-worker-01 Ready <none> 4m v1.29.0
# k8s-worker-02 Ready <none> 4m v1.29.0
# Use K9s for interactive management
k9s
Phase 6: Security Configuration
Enable Audit Logging
# Configure audit logging
cat > config/audit-config.toml <<EOF
[audit]
enabled = true
log_path = "runtime/logs/audit"
retention_days = 90
level = "info"
[audit.filters]
include_commands = ["server create", "server delete", "cluster deploy"]
exclude_users = []
EOF
Configure SOPS for Secrets
# Create secrets file
cat > config/secrets.secret.yaml <<EOF
database:
password: "changeme-db-password"
admin_user: "admin"
kubernetes:
service_account_key: "changeme-sa-key"
EOF
# Encrypt secrets with SOPS
sops -e -i config/secrets.secret.yaml
# Verify encryption
cat config/secrets.secret.yaml # Should show encrypted content
# Decrypt when needed
sops -d config/secrets.secret.yaml
Enable MFA (Optional)
# Enable multi-factor authentication
provisioning security mfa enable
# Scan QR code with authenticator app
# Enter verification code
Configure RBAC
# Create role definition
cat > config/rbac-roles.yaml <<EOF
roles:
- name: developer
permissions:
- server:read
- server:create
- taskserv:read
- taskserv:install
deny:
- cluster:delete
- config:modify
- name: operator
permissions:
- "*:read"
- server:*
- taskserv:*
- cluster:read
- cluster:deploy
- name: admin
permissions:
- "*:*"
EOF
Phase 7: Multi-Cloud Deployment
Define Multi-Cloud Infrastructure
# Create multi-cloud definition
cat > infra/multi-cloud.ncl <<'EOF'
{
batch_workflow = {
operations = [
{
id = "upcloud-frontend"
provider = "upcloud"
region = "de-fra1"
servers = [
{name = "upcloud-web-01", plan = "medium", role = 'web}
]
taskservs = ["containerd", "nginx"]
}
{
id = "aws-backend"
provider = "aws"
region = "eu-west-1"
servers = [
{name = "aws-api-01", plan = "t3.medium", role = 'api}
]
taskservs = ["containerd", "docker"]
dependencies = ["upcloud-frontend"]
}
{
id = "local-database"
provider = "local"
region = "localhost"
servers = [
{name = "local-db-01", plan = "large", role = 'database}
]
taskservs = ["postgresql"]
}
]
parallel_limit = 2
}
}
EOF
Deploy Multi-Cloud Infrastructure
# Submit batch workflow
provisioning batch submit infra/multi-cloud.ncl
# Monitor workflow progress
provisioning batch status
# View detailed operation status
provisioning batch operations
Phase 8: Monitoring and Maintenance
Platform Health Monitoring
# Check platform health
provisioning health
# View service status
provisioning service status orchestrator
provisioning service status control-center
# View logs
provisioning logs --service orchestrator --tail 100
Infrastructure Monitoring
# List all servers
provisioning server list --all-workspaces
# Show server details
provisioning server info k8s-control-01
# Check task service status
provisioning taskserv list
provisioning taskserv health containerd
Backup Configuration
# Create backup
provisioning backup create --type full --output ~/backups/provisioning-$(date +%Y%m%d).tar.gz
# Schedule automatic backups
provisioning backup schedule daily --time "02:00" --retention 7
Phase 9: Advanced Workflows
Custom Workflow Creation
# Create custom workflow
cat > extensions/workflows/deploy-app.ncl <<'EOF'
{
workflow = {
name = "deploy-application"
description = "Deploy application to Kubernetes"
steps = [
{
name = "build-image"
action = "docker-build"
params = {dockerfile = "Dockerfile", tag = "myapp:latest"}
}
{
name = "push-image"
action = "docker-push"
params = {image = "myapp:latest", registry = "registry.example.com"}
depends_on = ["build-image"]
}
{
name = "deploy-k8s"
action = "kubectl-apply"
params = {manifest = "k8s/deployment.yaml"}
depends_on = ["push-image"]
}
{
name = "verify-deployment"
action = "kubectl-rollout-status"
params = {deployment = "myapp"}
depends_on = ["deploy-k8s"]
}
]
}
}
EOF
Execute Custom Workflow
# Run workflow
provisioning workflow run deploy-application
# Monitor workflow
provisioning workflow status deploy-application
# View workflow history
provisioning workflow history
Troubleshooting
Common Issues
Server Creation Fails
# Enable debug logging
provisioning --debug server create --infra simple-server
# Check provider connectivity
provisioning providers
# Validate credentials
provisioning validate config
Task Service Installation Fails
# Check server connectivity
provisioning server ssh dev-web-01
# Verify dependencies
provisioning taskserv check-deps containerd
# Retry installation
provisioning taskserv create containerd --force
Cluster Deployment Fails
# Check cluster status
provisioning cluster status k8s-dev-cluster
# View cluster logs
provisioning cluster logs k8s-dev-cluster
# Reset and retry
provisioning cluster reset k8s-dev-cluster
provisioning cluster create --infra k8s-cluster
Next Steps
Production Deployment
- Review Security Best Practices
- Configure Backup & Recovery
- Set up Monitoring
- Implement Disaster Recovery
Advanced Features
- Explore Batch Workflows
- Configure Orchestrator
- Use Interactive Guides
- Develop Custom Extensions
Learning Resources
- Nickel Guide - Infrastructure as code
- Workspace Management - Advanced workspace usage
- Multi-Cloud Deployment - Multi-cloud strategies
- API Reference - Complete API documentation
Summary
You’ve completed the from-scratch guide and learned:
- Platform installation and configuration
- Provider credential setup
- Workspace creation and management
- Infrastructure definition with Nickel
- Server and task service deployment
- Kubernetes cluster deployment
- Security configuration
- Multi-cloud deployment
- Monitoring and maintenance
- Custom workflow creation
Your Provisioning platform is now ready for production use.
Workspace Management
Multi-Cloud Deployment
Comprehensive guide to deploying and managing infrastructure across multiple cloud providers using the Provisioning platform. This guide covers strategies, patterns, and real-world examples for building resilient multi-cloud architectures.
Overview
Multi-cloud deployment enables:
- Vendor independence - Avoid lock-in to single cloud provider
- Geographic distribution - Deploy closer to users worldwide
- Resilience - Survive provider outages or regional failures
- Cost optimization - Leverage competitive pricing across providers
- Compliance - Meet data residency and sovereignty requirements
- Performance - Optimize latency through strategic placement
Multi-Cloud Strategies
Strategy 1: Primary-Backup Architecture
One provider serves production traffic, another provides disaster recovery.
Use cases:
- Cost-conscious deployments
- Regulatory backup requirements
- Testing multi-cloud capabilities
Example topology:
Primary (UpCloud EU) Backup (AWS US)
├── Production workloads ├── Standby replicas
├── Active databases ├── Read-only databases
├── Live traffic └── Failover ready
└── Real-time sync ────────────>
Pros: Simple management, lower costs, proven failover Cons: Backup resources underutilized, sync lag
Strategy 2: Active-Active Architecture
Multiple providers serve production traffic simultaneously.
Use cases:
- High availability requirements
- Global user base
- Zero-downtime deployments
Example topology:
UpCloud (EU) AWS (US) Local (Development)
├── EU traffic ├── US traffic ├── Testing
├── Primary database ├── Primary database ├── CI/CD
└── Global load balancer ←────┴──────────────────────────────┘
Pros: Maximum availability, optimized latency, full utilization Cons: Complex management, higher costs, data consistency challenges
Strategy 3: Specialized Workload Distribution
Different providers for different workload types based on strengths.
Use cases:
- Heterogeneous workloads
- Cost optimization
- Leveraging provider-specific services
Example topology:
UpCloud AWS Local
├── Compute-intensive ├── Object storage (S3) ├── Development
├── Kubernetes clusters ├── Managed databases (RDS) └── Testing
└── High-performance VMs └── Serverless (Lambda)
Pros: Optimize for provider strengths, cost-effective, flexible Cons: Complex integration, vendor-specific knowledge required
Strategy 4: Compliance-Driven Architecture
Provider selection based on regulatory and data residency requirements.
Use cases:
- GDPR compliance
- Data sovereignty
- Industry regulations (HIPAA, PCI-DSS)
Example topology:
UpCloud (EU - GDPR) AWS (US - FedRAMP) On-Premises (Sensitive)
├── EU customer data ├── US customer data ├── PII storage
├── GDPR-compliant ├── US compliance └── Encrypted backups
└── Regional processing └── Federal workloads
Pros: Meets compliance requirements, data sovereignty Cons: Geographic constraints, complex data management
Infrastructure Definition
Multi-Provider Server Configuration
Define servers across multiple providers using Nickel:
# infra/multi-cloud-servers.ncl
{
metadata = {
name = "multi-cloud-infrastructure"
environment = 'production
}
infrastructure = {
servers = [
# UpCloud servers (EU region)
{
name = "upcloud-web-01"
provider = "upcloud"
zone = "de-fra1"
plan = "medium"
role = 'web
backup_enabled = true
tags = ["frontend", "europe"]
}
{
name = "upcloud-web-02"
provider = "upcloud"
zone = "fi-hel1"
plan = "medium"
role = 'web
backup_enabled = true
tags = ["frontend", "europe"]
}
# AWS servers (US region)
{
name = "aws-api-01"
provider = "aws"
zone = "us-east-1a"
plan = "t3.large"
role = 'api
backup_enabled = true
tags = ["backend", "americas"]
}
{
name = "aws-api-02"
provider = "aws"
zone = "us-west-2a"
plan = "t3.large"
role = 'api
backup_enabled = true
tags = ["backend", "americas"]
}
# Local provider (development/testing)
{
name = "local-test-01"
provider = "local"
zone = "localhost"
plan = "small"
role = 'test
backup_enabled = false
tags = ["testing", "development"]
}
]
}
networking = {
vpn_mesh = true
cross_provider_routing = true
dns_strategy = 'geo_distributed
}
}
Batch Workflow for Multi-Cloud
Use batch workflows for orchestrated multi-cloud deployments:
# infra/multi-cloud-batch.ncl
{
batch_workflow = {
name = "global-deployment"
description = "Deploy infrastructure across three cloud providers"
operations = [
{
id = "upcloud-eu"
provider = "upcloud"
region = "de-fra1"
servers = [
{name = "upcloud-web-01", plan = "medium", role = 'web}
{name = "upcloud-db-01", plan = "large", role = 'database}
]
taskservs = ["containerd", "nginx", "postgresql"]
priority = 1
}
{
id = "aws-us"
provider = "aws"
region = "us-east-1"
servers = [
{name = "aws-api-01", plan = "t3.large", role = 'api}
{name = "aws-cache-01", plan = "t3.medium", role = 'cache}
]
taskservs = ["containerd", "docker", "redis"]
dependencies = ["upcloud-eu"]
priority = 2
}
{
id = "local-dev"
provider = "local"
region = "localhost"
servers = [
{name = "local-test-01", plan = "small", role = 'test}
]
taskservs = ["containerd"]
priority = 3
}
]
execution = {
parallel_limit = 2
retry_failed = true
max_retries = 3
checkpoint_enabled = true
}
}
}
Deployment Patterns
Pattern 1: Sequential Deployment
Deploy providers one at a time to minimize risk.
# Deploy to primary provider first
provisioning batch submit infra/upcloud-primary.ncl
# Verify primary deployment
provisioning server list --provider upcloud
provisioning server status upcloud-web-01
# Deploy to secondary provider
provisioning batch submit infra/aws-secondary.ncl
# Verify secondary deployment
provisioning server list --provider aws
Advantages:
- Controlled rollout
- Easy troubleshooting
- Clear rollback path
Disadvantages:
- Slower deployment
- Sequential dependencies
Pattern 2: Parallel Deployment
Deploy to multiple providers simultaneously for speed.
# Submit multi-cloud batch workflow
provisioning batch submit infra/multi-cloud-batch.ncl
# Monitor all operations
provisioning batch status
# Check progress per provider
provisioning batch operations --filter provider=upcloud
provisioning batch operations --filter provider=aws
Advantages:
- Fast deployment
- Efficient resource usage
- Parallel testing
Disadvantages:
- Complex failure handling
- Resource contention
- Harder troubleshooting
Pattern 3: Blue-Green Multi-Cloud
Deploy new infrastructure in parallel, then switch traffic.
# infra/blue-green-multi-cloud.ncl
{
deployment = {
strategy = 'blue_green
blue_environment = {
upcloud = {servers = [{name = "upcloud-web-01-blue", role = 'web}]}
aws = {servers = [{name = "aws-api-01-blue", role = 'api}]}
}
green_environment = {
upcloud = {servers = [{name = "upcloud-web-01-green", role = 'web}]}
aws = {servers = [{name = "aws-api-01-green", role = 'api}]}
}
traffic_switch = {
type = 'dns
validation_required = true
rollback_timeout_seconds = 300
}
}
}
# Deploy green environment
provisioning deployment create --infra blue-green-multi-cloud --target green
# Validate green environment
provisioning deployment validate green
# Switch traffic to green
provisioning deployment switch-traffic green
# Decommission blue environment
provisioning deployment delete blue
Network Configuration
Cross-Provider VPN Mesh
Connect servers across providers using VPN mesh:
# infra/vpn-mesh.ncl
{
networking = {
vpn_mesh = {
enabled = true
encryption = 'wireguard
peers = [
{
name = "upcloud-gateway"
provider = "upcloud"
public_ip = "auto"
private_subnet = "10.0.1.0/24"
}
{
name = "aws-gateway"
provider = "aws"
public_ip = "auto"
private_subnet = "10.0.2.0/24"
}
{
name = "local-gateway"
provider = "local"
public_ip = "192.168.1.1"
private_subnet = "10.0.3.0/24"
}
]
routing = {
dynamic_routes = true
bgp_enabled = false
static_routes = [
{from = "10.0.1.0/24", to = "10.0.2.0/24", via = "aws-gateway"}
{from = "10.0.2.0/24", to = "10.0.1.0/24", via = "upcloud-gateway"}
]
}
}
}
}
Global DNS Configuration
Configure geo-distributed DNS for optimal routing:
# infra/global-dns.ncl
{
dns = {
provider = 'cloudflare # or 'route53, 'custom
zones = [
{
name = "example.com"
type = 'primary
records = [
{
name = "eu"
type = 'A
ttl = 300
values = ["upcloud-web-01.ip", "upcloud-web-02.ip"]
geo_location = 'europe
}
{
name = "us"
type = 'A
ttl = 300
values = ["aws-api-01.ip", "aws-api-02.ip"]
geo_location = 'americas
}
{
name = "@"
type = 'CNAME
ttl = 60
value = "global-lb.example.com"
geo_routing = 'latency_based
}
]
}
]
health_checks = [
{target = "upcloud-web-01", interval_seconds = 30}
{target = "aws-api-01", interval_seconds = 30}
]
}
}
Data Replication
Database Replication Across Providers
Configure cross-provider database replication:
# infra/database-replication.ncl
{
databases = {
postgresql = {
primary = {
provider = "upcloud"
server = "upcloud-db-01"
version = "15"
replication_role = 'primary
}
replicas = [
{
provider = "aws"
server = "aws-db-replica-01"
version = "15"
replication_role = 'replica
replication_lag_max_seconds = 30
failover_priority = 1
}
{
provider = "local"
server = "local-db-backup-01"
version = "15"
replication_role = 'replica
replication_lag_max_seconds = 300
failover_priority = 2
}
]
replication = {
method = 'streaming
ssl_required = true
compression = true
conflict_resolution = 'primary_wins
}
}
}
}
Object Storage Sync
Synchronize object storage across providers:
# Configure cross-provider storage sync
cat > infra/storage-sync.ncl <<'EOF'
{
storage = {
sync_policy = {
source = {
provider = "upcloud"
bucket = "primary-storage"
region = "de-fra1"
}
destinations = [
{
provider = "aws"
bucket = "backup-storage"
region = "us-east-1"
sync_interval_minutes = 15
}
]
filters = {
include_patterns = ["*.pdf", "*.jpg", "backups/*"]
exclude_patterns = ["temp/*", "*.tmp"]
}
conflict_resolution = 'timestamp_wins
}
}
}
EOF
Kubernetes Multi-Cloud
Cluster Federation
Deploy Kubernetes clusters across providers with federation:
# infra/k8s-federation.ncl
{
kubernetes_federation = {
clusters = [
{
name = "upcloud-eu-cluster"
provider = "upcloud"
region = "de-fra1"
control_plane_count = 3
worker_count = 5
version = "1.29.0"
}
{
name = "aws-us-cluster"
provider = "aws"
region = "us-east-1"
control_plane_count = 3
worker_count = 5
version = "1.29.0"
}
]
federation = {
enabled = true
control_plane_cluster = "upcloud-eu-cluster"
networking = {
cluster_mesh = true
service_discovery = 'dns
cross_cluster_load_balancing = true
}
workload_distribution = {
strategy = 'geo_aware
prefer_local = true
failover_enabled = true
}
}
}
}
Multi-Cluster Deployments
Deploy applications across multiple Kubernetes clusters:
# k8s/multi-cluster-deployment.yaml
apiVersion: v1
kind: Namespace
metadata:
name: multi-cloud-app
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-frontend
namespace: multi-cloud-app
labels:
app: frontend
region: europe
spec:
replicas: 3
selector:
matchLabels:
app: frontend
template:
metadata:
labels:
app: frontend
spec:
containers:
- name: nginx
image: nginx:latest
ports:
- containerPort: 80
# Deploy to multiple clusters
export UPCLOUD_KUBECONFIG=~/.kube/config-upcloud
export AWS_KUBECONFIG=~/.kube/config-aws
kubectl --kubeconfig $UPCLOUD_KUBECONFIG apply -f k8s/multi-cluster-deployment.yaml
kubectl --kubeconfig $AWS_KUBECONFIG apply -f k8s/multi-cluster-deployment.yaml
# Verify deployments
kubectl --kubeconfig $UPCLOUD_KUBECONFIG get pods -n multi-cloud-app
kubectl --kubeconfig $AWS_KUBECONFIG get pods -n multi-cloud-app
Cost Optimization
Provider Selection by Workload
Optimize costs by choosing the most cost-effective provider per workload:
# infra/cost-optimized.ncl
{
cost_optimization = {
workloads = [
{
name = "compute-intensive"
provider = "upcloud" # Best compute pricing
plan = "large"
count = 10
}
{
name = "storage-heavy"
provider = "aws" # Best storage pricing with S3
plan = "medium"
count = 5
storage_type = 's3
}
{
name = "development"
provider = "local" # Zero cost
plan = "small"
count = 3
}
]
budget_limits = {
monthly_max_usd = 5000
alerts = [
{threshold_percent = 75, notify = "[ops-team@example.com](mailto:ops-team@example.com)"}
{threshold_percent = 90, notify = "[finance@example.com](mailto:finance@example.com)"}
]
}
}
}
Reserved Instance Strategy
Leverage reserved instances for predictable workloads:
# Configure reserved instances
cat > infra/reserved-instances.ncl <<'EOF'
{
reserved_instances = {
upcloud = {
commitment = 'yearly
instances = [
{plan = "medium", count = 5}
{plan = "large", count = 2}
]
}
aws = {
commitment = 'yearly
instances = [
{type = "t3.large", count = 3}
{type = "t3.xlarge", count = 1}
]
savings_plan = true
}
}
}
EOF
Monitoring Multi-Cloud
Centralized Monitoring
Deploy unified monitoring across providers:
# infra/monitoring.ncl
{
monitoring = {
prometheus = {
enabled = true
federation = true
instances = [
{provider = "upcloud", region = "de-fra1"}
{provider = "aws", region = "us-east-1"}
]
scrape_configs = [
{
job_name = "upcloud-nodes"
static_configs = [{targets = ["upcloud-*.internal:9100"]}]
}
{
job_name = "aws-nodes"
static_configs = [{targets = ["aws-*.internal:9100"]}]
}
]
remote_write = {
url = " [https://central-prometheus.example.com/api/v1/write"](https://central-prometheus.example.com/api/v1/write")
compression = true
}
}
grafana = {
enabled = true
dashboards = ["multi-cloud-overview", "per-provider", "cost-analysis"]
alerts = ["high-latency", "provider-down", "budget-exceeded"]
}
}
}
Disaster Recovery
Cross-Provider Failover
Configure automatic failover between providers:
# infra/disaster-recovery.ncl
{
disaster_recovery = {
primary_provider = "upcloud"
secondary_provider = "aws"
failover_triggers = [
{condition = 'provider_unavailable, action = 'switch_to_secondary}
{condition = 'health_check_failed, threshold = 3, action = 'switch_to_secondary}
{condition = 'latency_exceeded, threshold_ms = 1000, action = 'switch_to_secondary}
]
failover_process = {
dns_ttl_seconds = 60
health_check_interval_seconds = 10
automatic = true
notification_channels = ["email", "slack"]
}
backup_strategy = {
frequency = 'daily
retention_days = 30
cross_region = true
cross_provider = true
}
}
}
Best Practices
Configuration Management
- Use Nickel for all infrastructure definitions
- Version control all configuration files
- Use workspace per environment (dev/staging/prod)
- Implement configuration validation before deployment
- Maintain provider abstraction where possible
Security
- Encrypt cross-provider communication (VPN, TLS)
- Use separate credentials per provider
- Implement RBAC consistently across providers
- Enable audit logging on all providers
- Encrypt data at rest and in transit
Deployment
- Test in single-provider environment first
- Use batch workflows for complex multi-cloud deployments
- Enable checkpoints for long-running deployments
- Implement progressive rollout strategies
- Maintain rollback procedures
Monitoring
- Centralize logs and metrics
- Monitor cross-provider network latency
- Track costs per provider
- Alert on provider-specific failures
- Measure failover readiness
Cost Management
- Regular cost audits per provider
- Use reserved instances for predictable loads
- Implement budget alerts
- Optimize data transfer costs
- Consider spot instances for non-critical workloads
Troubleshooting
Provider Connectivity Issues
# Test provider connectivity
provisioning providers
# Test specific provider
provisioning provider test upcloud
provisioning provider test aws
# Debug network connectivity
provisioning network test --from upcloud-web-01 --to aws-api-01
Cross-Provider Communication Failures
# Check VPN mesh status
provisioning network vpn-status
# Test cross-provider routes
provisioning network trace-route --from upcloud-web-01 --to aws-api-01
# Verify firewall rules
provisioning network firewall-check --provider upcloud
provisioning network firewall-check --provider aws
Data Replication Lag
# Check replication status
provisioning database replication-status postgresql
# Force replication sync
provisioning database sync --source upcloud-db-01 --target aws-db-replica-01
# View replication lag metrics
provisioning database metrics --metric replication_lag
See Also
- Batch Workflows - Multi-cloud orchestration
- Providers - Provider configuration
- Disaster Recovery - DR strategies
- Workspace Management - Multi-environment setup
Custom Extensions
Create custom providers, task services, and clusters to extend the Provisioning platform for your specific infrastructure needs.
Overview
Extensions allow you to:
- Add support for new cloud providers
- Create custom task services for specialized software
- Define cluster templates for common deployment patterns
- Integrate with proprietary infrastructure
Extension Types
Providers
Cloud or infrastructure backend integrations.
Use Cases: Custom private cloud, bare metal provisioning, proprietary APIs
Task Services
Installable software components.
Use Cases: Internal applications, specialized databases, custom monitoring
Clusters
Coordinated service groups.
Use Cases: Standard deployment patterns, application stacks, reference architectures
Creating a Custom Provider
Directory Structure
provisioning/extensions/providers/my-provider/
├── provider.ncl # Provider schema
├── resources/
│ ├── server.nu # Server operations
│ ├── network.nu # Network operations
│ └── storage.nu # Storage operations
└── README.md
Provider Schema (provider.ncl)
{
name = "my-provider",
description = "Custom infrastructure provider",
config_schema = {
api_endpoint | String,
api_key | String,
region | String | default = "default",
timeout_seconds | Number | default = 300,
},
capabilities = {
servers = true,
networks = true,
storage = true,
load_balancers = false,
}
}
Server Operations (resources/server.nu)
# Create server
export def "server create" [
name: string
plan: string
--zone: string = "default"
] {
let config = $env.PROVIDER_CONFIG | from json
# Call provider API
http post $"($config.api_endpoint)/servers" {
name: $name,
plan: $plan,
zone: $zone
} | from json
}
# Delete server
export def "server delete" [name: string] {
let config = $env.PROVIDER_CONFIG | from json
http delete $"($config.api_endpoint)/servers/($name)"
}
# List servers
export def "server list" [] {
let config = $env.PROVIDER_CONFIG | from json
http get $"($config.api_endpoint)/servers" | from json
}
Creating a Custom Task Service
Directory Structure
provisioning/extensions/taskservs/my-service/
├── service.ncl # Service schema
├── install.nu # Installation script
├── configure.nu # Configuration script
├── health-check.nu # Health validation
└── README.md
Service Schema (service.ncl)
{
name = "my-service",
version = "1.0.0",
description = "Custom service deployment",
dependencies = ["kubernetes"],
config_schema = {
replicas | Number | default = 3,
port | Number | default = 8080,
storage_size_gb | Number | default = 10,
image | String,
}
}
Installation Script (install.nu)
export def "taskserv install" [config: record] {
print $"Installing ($config.name)..."
# Create namespace
kubectl create namespace $config.name
# Deploy application
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: ($config.name)
namespace: ($config.name)
spec:
replicas: ($config.replicas)
template:
spec:
containers:
- name: app
image: ($config.image)
ports:
- containerPort: ($config.port)
EOF
{status: "installed"}
}
Health Check (health-check.nu)
export def "taskserv health" [name: string] {
let pods = (kubectl get pods -n $name -o json | from json)
let ready = ($pods.items | all | { | p $p.status.phase == "Running"})
if $ready {
{status: "healthy", ready_pods: ($pods.items | length)}
} else {
{status: "unhealthy", reason: "pods not running"}
}
}
Creating a Custom Cluster
Directory Structure
provisioning/extensions/clusters/my-cluster/
├── cluster.ncl # Cluster definition
├── deploy.nu # Deployment script
└── README.md
Cluster Schema (cluster.ncl)
{
name = "my-cluster",
version = "1.0.0",
description = "Custom application stack",
components = {
servers = [
{name = "app", count = 3, plan = 'medium},
{name = "db", count = 1, plan = 'large},
],
services = ["nginx", "postgresql", "redis"],
},
config_schema = {
domain | String,
app_replicas | Number | default = 3,
db_storage_gb | Number | default = 100,
}
}
Testing Extensions
Local Testing
# Test provider operations
provisioning provider test my-provider --local
# Test task service installation
provisioning taskserv install my-service --dry-run
# Validate cluster definition
provisioning cluster validate my-cluster
Integration Testing
# Create test workspace
provisioning workspace create test-extensions
# Deploy extension
provisioning extension deploy my-provider
# Test deployment
provisioning server create test-server --provider my-provider
Extension Best Practices
- Define clear schemas - Use Nickel contracts for type safety
- Implement health checks - Validate service state
- Handle errors gracefully - Return structured error messages
- Document configuration - Provide clear examples
- Version extensions - Track compatibility
- Test thoroughly - Unit and integration tests
Publishing Extensions
Extension Registry
Share extensions with the community:
# Package extension
provisioning extension package my-provider
# Publish to registry
provisioning extension publish my-provider --registry community
Private Registry
Host internal extensions:
# Configure private registry
provisioning config set extension_registry [https://registry.internal](https://registry.internal)
# Publish privately
provisioning extension publish my-provider --private
Examples
Custom Database Provider
Provider for proprietary database platform:
{
name = "mydb-provider",
capabilities = {databases = true},
config_schema = {
cluster_endpoint | String,
admin_token | String,
}
}
Monitoring Stack Service
Complete monitoring deployment:
{
name = "monitoring-stack",
dependencies = ["prometheus", "grafana", "loki"],
config_schema = {
retention_days | Number | default = 30,
alert_email | String,
}
}
Troubleshooting
Extension Not Loading
# Verify extension structure
provisioning extension validate my-extension
# Check logs
provisioning logs extension-loader --tail 100
Deployment Failures
# Enable debug logging
export PROVISIONING_LOG_LEVEL=debug
provisioning taskserv install my-service
# Check service logs
provisioning taskserv logs my-service
References
- Extension Development - Technical details
- Provider Development - Provider implementation
- Task Services - Task service architecture
Disaster Recovery
Comprehensive disaster recovery procedures for the Provisioning platform and managed infrastructure.
Overview
Disaster recovery (DR) ensures business continuity through:
- Automated backups
- Point-in-time recovery
- Multi-region failover
- Data replication
- DR testing procedures
Recovery Objectives
RTO (Recovery Time Objective)
Target time to restore service:
- Critical Services: < 1 hour
- Production Infrastructure: < 4 hours
- Development Environment: < 24 hours
RPO (Recovery Point Objective)
Maximum acceptable data loss:
- Production Databases: < 5 minutes (continuous replication)
- Configuration: < 1 hour (hourly backups)
- Workspace State: < 15 minutes (incremental backups)
Backup Strategy
Automated Backups
Configure automatic backups:
{
backup = {
enabled = true,
schedule = "0 */6 * * *", # Every 6 hours
retention_days = 30,
targets = [
{type = 'workspace_state, enabled = true},
{type = 'infrastructure_config, enabled = true},
{type = 'platform_data, enabled = true},
],
storage = {
backend = 's3,
bucket = "provisioning-backups",
encryption = true,
}
}
}
Backup Types
Full Backups:
# Full platform backup
provisioning backup create --type full --name "pre-upgrade-$(date +%Y%m%d)"
# Full workspace backup
provisioning workspace backup production --full
Incremental Backups:
# Incremental backup (changed files only)
provisioning backup create --type incremental
# Automated incremental
provisioning config set backup.incremental_enabled true
Snapshot Backups:
# Infrastructure snapshot
provisioning infrastructure snapshot --name "stable-v2"
# Database snapshot
provisioning taskserv backup postgresql --snapshot
Data Replication
Cross-Region Replication
Replicate to secondary region:
{
replication = {
enabled = true,
mode = 'async,
primary = {region = "eu-west-1", provider = 'aws},
secondary = {region = "us-east-1", provider = 'aws},
replication_lag_max_seconds = 300,
}
}
Database Replication
# Configure database replication
provisioning taskserv configure postgresql --replication \
--primary db-eu-west-1 \
--standby db-us-east-1 \
--sync-mode async
Disaster Scenarios
Complete Region Failure
Procedure:
- Detect Failure:
# Check region health
provisioning health check --region eu-west-1
- Initiate Failover:
# Promote secondary region
provisioning disaster-recovery failover --to us-east-1 --confirm
# Verify services
provisioning health check --all
- Update DNS:
# Point traffic to secondary region
provisioning dns update --region us-east-1
- Monitor:
# Watch recovery progress
provisioning disaster-recovery status --follow
Data Corruption
Procedure:
- Identify Corruption:
# Validate data integrity
provisioning validate data --workspace production
- Find Clean Backup:
# List available backups
provisioning backup list --before "2024-01-15 10:00"
# Verify backup integrity
provisioning backup verify backup-20240115-0900
- Restore from Backup:
# Restore to point in time
provisioning restore --backup backup-20240115-0900 \
--workspace production --confirm
Platform Service Failure
Procedure:
- Identify Failed Service:
# Check platform health
provisioning platform health
# Service logs
provisioning platform logs orchestrator --tail 100
- Restart Service:
# Restart failed service
provisioning platform restart orchestrator
# Verify health
provisioning platform health orchestrator
- Restore from Backup (if needed):
# Restore service data
provisioning platform restore orchestrator \
--from-backup latest
Failover Procedures
Automated Failover
Configure automatic failover:
{
failover = {
enabled = true,
health_check_interval_seconds = 30,
failure_threshold = 3,
primary = {region = "eu-west-1"},
secondary = {region = "us-east-1"},
auto_failback = false, # Manual failback
}
}
Manual Failover
# Initiate manual failover
provisioning disaster-recovery failover \
--from eu-west-1 \
--to us-east-1 \
--verify-replication \
--confirm
# Verify failover
provisioning disaster-recovery verify
# Update routing
provisioning disaster-recovery update-routing
Recovery Procedures
Workspace Recovery
# List workspace backups
provisioning workspace backups production
# Restore workspace
provisioning workspace restore production \
--backup backup-20240115-1200 \
--target-region us-east-1
# Verify recovery
provisioning workspace validate production
Infrastructure Recovery
# Restore infrastructure from Nickel config
provisioning infrastructure restore \
--config workspace/infra/production.ncl \
--region us-east-1
# Restore from snapshot
provisioning infrastructure restore \
--snapshot infra-snapshot-20240115
# Verify deployment
provisioning infrastructure validate
Platform Recovery
# Reinstall platform services
provisioning platform install --region us-east-1
# Restore platform data
provisioning platform restore --from-backup latest
# Verify platform health
provisioning platform health --all
DR Testing
Test Schedule
- Monthly: Backup restore test
- Quarterly: Regional failover drill
- Annually: Full DR simulation
Backup Restore Test
# Create test workspace
provisioning workspace create dr-test-$(date +%Y%m%d)
# Restore latest backup
provisioning workspace restore dr-test --backup latest
# Validate restore
provisioning workspace validate dr-test
# Cleanup
provisioning workspace delete dr-test --yes
Failover Drill
# Simulate regional failure
provisioning disaster-recovery simulate-failure \
--region eu-west-1 \
--duration 30m
# Monitor automated failover
provisioning disaster-recovery status --follow
# Validate services in secondary region
provisioning health check --region us-east-1 --all
# Manual failback after drill
provisioning disaster-recovery failback --to eu-west-1
Monitoring and Alerts
Backup Monitoring
# Check backup status
provisioning backup status
# Verify backup integrity
provisioning backup verify --all --schedule daily
# Alert on backup failures
provisioning alert create backup-failure \
--condition "backup.status == 'failed'" \
--notify [ops@example.com](mailto:ops@example.com)
Replication Monitoring
# Check replication lag
provisioning replication status
# Alert on lag exceeding threshold
provisioning alert create replication-lag \
--condition "replication.lag_seconds > 300" \
--notify [ops@example.com](mailto:ops@example.com)
Best Practices
- Regular testing - Test DR procedures quarterly
- Automated backups - Never rely on manual backups
- Multiple regions - Geographic redundancy
- Monitor replication - Track replication lag
- Document procedures - Keep runbooks updated
- Encrypt backups - Protect backup data
- Verify restores - Test backup integrity
- Automate failover - Reduce recovery time
References
- Backup & Recovery - Backup operations
- Monitoring - Health monitoring
- Platform Health - Service health
Infrastructure as Code
Define and manage infrastructure using Nickel, the type-safe configuration language that serves as Provisioning’s source of truth.
Overview
Provisioning’s infrastructure definition system provides:
- Type-safe configuration via Nickel language with mandatory schema validation and contract enforcement
- Complete provider support for AWS, UpCloud, Hetzner, Kubernetes, on-premise, and custom platforms
- 50+ task services for specialized infrastructure operations (databases, monitoring, logging, networking)
- Pre-built clusters for common patterns (web, OCI registry, cache, distributed computing)
- Batch workflows with DAG scheduling, parallel execution, and multi-cloud orchestration
- Schema validation with inheritance, merging, and contracts ensuring correctness
- Configuration composition with includes, profiles, and environment-specific overrides
- Version management with semantic versioning and deprecation paths
All infrastructure is defined in Nickel (never TOML) ensuring compile-time correctness and runtime safety.
Infrastructure Configuration Guides
Core Configuration
- Nickel Guide - Syntax, types, contracts, lazy evaluation, record merging, patterns, best practices for IaC
- Configuration System - Hierarchical loading, environment variables, profiles, composition, inheritance, validation
- Schemas Reference - Contracts, types, validation rules, inheritance, composition, custom schema development
Resources and Operations
-
Providers Guide - AWS, UpCloud, Hetzner, Kubernetes, on-premise, demo with capabilities, resources, examples
-
Task Services Guide - 50+ services: databases, monitoring, logging, networking, CI/CD, storage
-
Clusters Guide - Web cluster (3-tier), OCI registry, cache cluster, distributed computing, Kubernetes operators
-
Batch Workflows - DAG-based scheduling, parallel execution, logic, error handling, multi-cloud, state management
Advanced Topics
- Multi-Tenancy Patterns - Workspace isolation, data separation, billing, resource limits, SLAs
-
Version Management - Semantic versioning, dependency resolution, compatibility, deprecation, upgrade workflows
-
Performance Optimization - Configuration caching, lazy evaluation, parallel validation, incremental updates
Nickel as Source of Truth
Critical principle: Nickel is the source of truth for ALL infrastructure definitions.
- ✅ Nickel: Type-safe, validated, enforced, source of truth
- ❌ TOML: Generated output only, never hand-edited
- ❌ JSON/YAML: Generated output only, never source definitions
- ❌ KCL: Deprecated, completely replaced by Nickel
This ensures:
- Compile-time validation - Errors caught before deployment
- Schema enforcement - All configurations conform to contracts
- Type safety - No runtime configuration errors
- IDE support - Type hints and autocompletion via schema
- Evolution - Breaking changes detected and reported
Configuration Hierarchy
Configurations load in order of precedence:
1. Command-line arguments (highest priority)
2. Environment variables (PROVISIONING_*)
3. User configuration (~/.config/provisioning/user.nickel)
4. Workspace configuration (workspace/config/main.nickel)
5. Infrastructure schemas (provisioning/schemas/)
6. System defaults (provisioning/config/defaults.toml)
(lowest priority)
Quick Start Paths
I’m new to Nickel
Start with Nickel Guide - language syntax, type system, functions, patterns with infrastructure examples.
I need to define infrastructure
Read Configuration System - how configurations load, compose, and validate.
I want to use AWS/UpCloud/Hetzner
See Providers Guide - capabilities, resources, configuration examples for each cloud.
I need databases, monitoring, logging
Check Task Services Guide - 50+ services with configuration examples.
I want to deploy web applications
Review Clusters Guide - pre-built 3-tier web cluster, load balancer, database, caching.
I need multi-cloud workflows
Learn Batch Workflows - DAG scheduling across multiple providers.
I need multi-tenant setup
Study Multi-Tenancy Patterns - isolation, billing, resource management.
Example Nickel Configuration
{
extensions = {
providers = [
{
name = "aws",
version = "1.2.3",
enabled = true,
config = {
region = "us-east-1",
credentials_source = "aws_iam"
}
}
]
},
infrastructure = {
networks = [
{
name = "main",
provider = "aws",
cidr = "10.0.0.0/16",
subnets = [
{ cidr = "10.0.1.0/24", availability_zone = "us-east-1a" },
{ cidr = "10.0.2.0/24", availability_zone = "us-east-1b" }
]
}
],
instances = [
{
name = "web-server-1",
provider = "aws",
instance_type = "t3.large",
image = "ubuntu-22.04",
network = "main",
subnet = "10.0.1.0/24"
}
]
}
}
Schema Contracts
All infrastructure must conform to schemas. Schemas define:
- Required fields - Must be provided
- Type constraints - Values must match type
- Field contracts - Custom validation logic
- Defaults - Applied automatically
- Documentation - Inline help and examples
Validation and Testing
Before deploying:
- Schema validation -
provisioning validate config - Syntax checking -
provisioning validate syntax - Policy checks - Custom policy validation
- Unit tests - Test configuration logic
- Integration tests - Dry-run with actual providers
Related Documentation
- Provisioning Schemas → See
provisioning/schemas/in codebase - Configuration Examples → See
provisioning/docs/src/examples/ - Provider Examples → See
provisioning/docs/src/examples/aws-deployment-examples.md - Task Services → See
provisioning/extensions/in codebase - API Reference → See
provisioning/docs/src/api-reference/
Nickel Guide
Comprehensive guide to using Nickel as the infrastructure-as-code language for the Provisioning platform.
Critical Principle: Nickel is Source of Truth
TYPE-SAFETY ALWAYS REQUIRED: ALL configurations MUST be type-safe and validated via Nickel. TOML is NOT acceptable as source of truth. Validation is NOT optional, NOT “progressive”, NOT “production-only”. This applies to ALL profiles (developer, production, cicd).
Nickel is the PRIMARY IaC language. TOML files are GENERATED OUTPUT ONLY, never the source.
Why Nickel
Nickel provides:
- Type Safety: Static type checking catches errors before deployment
- Lazy Evaluation: Efficient configuration composition and merging
- Contract System: Schema validation with gradual typing
- Record Merging: Powerful composition without duplication
- LSP Support: IDE integration for autocomplete and validation
- Human-Readable: Clear syntax for infrastructure definition
Installation
# macOS (Homebrew)
brew install nickel
# Linux (Cargo)
cargo install nickel-lang-cli
# Verify installation
nickel --version # 1.15.1+
Core Concepts
Records and Fields
Records are the fundamental data structure in Nickel:
{
name = "my-server"
plan = "medium"
zone = "de-fra1"
}
Type Annotations
Add type safety with contracts:
{
name : String = "my-server"
plan : String = "medium"
cpu_count : Number = 4
enabled : Bool = true
}
Record Merging
Compose configurations by merging records:
let base_config = {
provider = "upcloud"
region = "de-fra1"
} in
let server_config = base_config & {
name = "web-01"
plan = "medium"
} in
server_config
Result:
{
provider = "upcloud"
region = "de-fra1"
name = "web-01"
plan = "medium"
}
Contracts (Schema Validation)
Define contracts to validate structure:
let ServerContract = {
name | String
plan | String | default = "small"
zone | String | default = "de-fra1"
cpu | Number | optional
} in
{
name = "my-server"
plan = "large"
} | ServerContract
Three-File Pattern (Provisioning Standard)
The platform uses a standardized three-file pattern for all schemas:
1. contracts.ncl - Type Definitions
Defines the schema contracts:
# contracts.ncl
{
Server = {
name | String
plan | String | default = "small"
zone | String | default = "de-fra1"
disk_size_gb | Number | default = 25
backup_enabled | Bool | default = false
role | | [ 'control, 'worker, 'standalone | ] | optional
}
Infrastructure = {
servers | Array Server
provider | String
environment | | [ 'development, 'staging, 'production | ]
}
}
2. defaults.ncl - Default Values
Provides sensible defaults:
# defaults.ncl
{
server = {
name = "unnamed-server"
plan = "small"
zone = "de-fra1"
disk_size_gb = 25
backup_enabled = false
}
infrastructure = {
servers = []
provider = "local"
environment = 'development
}
}
3. main.ncl - Entry Point
Combines contracts and defaults, provides makers:
# main.ncl
let contracts_lib = import "./contracts.ncl" in
let defaults_lib = import "./defaults.ncl" in
{
# Direct access to defaults (for inspection)
defaults = defaults_lib
# Convenience makers (90% of use cases)
make_server | not_exported = fun overrides =>
defaults_lib.server & overrides
make_infrastructure | not_exported = fun overrides =>
defaults_lib.infrastructure & overrides
# Default instances (bare defaults)
DefaultServer = defaults_lib.server
DefaultInfrastructure = defaults_lib.infrastructure
}
Usage Example
# user-infra.ncl
let infra_lib = import "provisioning/schemas/infrastructure/main.ncl" in
infra_lib.make_infrastructure {
provider = "upcloud"
environment = 'production
servers = [
infra_lib.make_server {
name = "web-01"
plan = "medium"
backup_enabled = true
}
infra_lib.make_server {
name = "web-02"
plan = "medium"
backup_enabled = true
}
]
}
Hybrid Interface Pattern
Records can be used both as functions (makers) and as plain data:
let config_lib = import "./config.ncl" in
# Use as function (with overrides)
let custom_config = config_lib.make_server { name = "custom" } in
# Use as plain data (defaults)
let default_config = config_lib.DefaultServer in
{
custom = custom_config
default = default_config
}
Record Merging Strategies
Priority Merging (Default)
let base = { a = 1, b = 2 } in
let override = { b = 3, c = 4 } in
base & override
# Result: { a = 1, b = 3, c = 4 }
Recursive Merging
let base = {
server = { cpu = 2, ram = 4 }
} in
let override = {
server = { ram = 8, disk = 100 }
} in
std.record.merge_all [base, override]
# Result: { server = { cpu = 2, ram = 8, disk = 100 } }
Lazy Evaluation
Nickel evaluates expressions lazily, only when needed:
let expensive_computation = std.string.join " " ["a", "b", "c"] in
{
# Only evaluated when accessed
computed_field = expensive_computation
# Conditional evaluation
conditional = if environment == 'production then
expensive_computation
else
"dev-value"
}
Schema Organization
The platform organizes Nickel schemas by domain:
provisioning/schemas/
├── main.ncl # Top-level entry point
├── config/ # Configuration schemas
│ ├── settings/
│ │ ├── main.ncl
│ │ ├── contracts.ncl
│ │ └── defaults.ncl
│ └── defaults/
│ ├── main.ncl
│ ├── contracts.ncl
│ └── defaults.ncl
├── infrastructure/ # Infrastructure definitions
│ ├── servers/
│ ├── networks/
│ └── storage/
├── deployment/ # Deployment schemas
├── services/ # Service configurations
├── operations/ # Operational schemas
└── generator/ # Runtime schema generation
Type System
Primitive Types
{
string_field : String = "text"
number_field : Number = 42
bool_field : Bool = true
}
Array Types
{
names : Array String = ["alice", "bob", "charlie"]
ports : Array Number = [80, 443, 8080]
}
Enum Types
{
environment : | [ 'development, 'staging, 'production | ] = 'production
role : | [ 'control, 'worker, 'standalone | ] = 'worker
}
Optional Fields
{
required_field : String = "value"
optional_field : String | optional
}
Default Values
{
with_default : String | default = "default-value"
}
Validation Patterns
Runtime Validation
let validate_plan = fun plan =>
if plan == "small" | | plan == "medium" | | plan == "large" then
plan
else
std.fail "Invalid plan: must be small, medium, or large"
in
{
plan = validate_plan "medium"
}
Contract-Based Validation
let PlanContract = | [ 'small, 'medium, 'large | ] in
{
plan | PlanContract = 'medium
}
Real-World Examples
Simple Server Configuration
{
metadata = {
name = "demo-server"
provider = "upcloud"
environment = 'development
}
infrastructure = {
servers = [
{
name = "web-01"
plan = "medium"
zone = "de-fra1"
disk_size_gb = 50
backup_enabled = true
role = 'standalone
}
]
}
services = {
taskservs = ["containerd", "docker"]
}
}
Kubernetes Cluster Configuration
{
metadata = {
name = "k8s-prod"
provider = "upcloud"
environment = 'production
}
infrastructure = {
servers = [
{
name = "k8s-control-01"
plan = "medium"
role = 'control
zone = "de-fra1"
disk_size_gb = 50
backup_enabled = true
}
{
name = "k8s-worker-01"
plan = "large"
role = 'worker
zone = "de-fra1"
disk_size_gb = 100
backup_enabled = true
}
{
name = "k8s-worker-02"
plan = "large"
role = 'worker
zone = "de-fra1"
disk_size_gb = 100
backup_enabled = true
}
]
}
services = {
taskservs = ["containerd", "etcd", "kubernetes", "cilium", "rook-ceph"]
}
kubernetes = {
version = "1.28.0"
pod_cidr = "10.244.0.0/16"
service_cidr = "10.96.0.0/12"
container_runtime = "containerd"
cri_socket = "/run/containerd/containerd.sock"
}
}
Multi-Provider Batch Workflow
{
batch_workflow = {
operations = [
{
id = "aws-cluster"
provider = "aws"
region = "us-east-1"
servers = [
{ name = "aws-web-01", plan = "t3.medium" }
]
}
{
id = "upcloud-cluster"
provider = "upcloud"
region = "de-fra1"
servers = [
{ name = "upcloud-web-01", plan = "medium" }
]
dependencies = ["aws-cluster"]
}
]
parallel_limit = 2
}
}
Validation Workflow
Type-Check Schema
# Check syntax and types
nickel typecheck infra/my-cluster.ncl
# Export to JSON (validates during export)
nickel export infra/my-cluster.ncl
# Export to TOML (generated output only)
nickel export --format toml infra/my-cluster.ncl > config.toml
Platform Validation
# Validate against platform contracts
provisioning validate config --infra my-cluster
# Verbose validation
provisioning validate config --verbose
IDE Integration
Language Server (nickel-lang-lsp)
Install LSP for IDE support:
# Install LSP server
cargo install nickel-lang-lsp
# Configure your editor (VS Code example)
# Install "Nickel" extension from marketplace
Features:
- Syntax highlighting
- Type checking on save
- Autocomplete
- Hover documentation
- Go to definition
VS Code Configuration
{
"nickel.lsp.command": "nickel-lang-lsp",
"nickel.lsp.args": ["--stdio"],
"nickel.format.onSave": true
}
Common Patterns
Environment-Specific Configuration
let env_configs = {
development = {
plan = "small"
backup_enabled = false
}
production = {
plan = "large"
backup_enabled = true
}
} in
let environment = 'production in
{
servers = [
env_configs.%{std.string.from_enum environment} & {
name = "server-01"
}
]
}
Configuration Composition
let base_server = {
zone = "de-fra1"
backup_enabled = false
} in
let prod_overrides = {
backup_enabled = true
disk_size_gb = 100
} in
{
servers = [
base_server & { name = "dev-01" }
base_server & prod_overrides & { name = "prod-01" }
]
}
Migration from TOML
TOML is ONLY for generated output. Source is always Nickel.
# Generate TOML from Nickel (if needed for external tools)
nickel export --format toml infra/cluster.ncl > cluster.toml
# NEVER edit cluster.toml directly - edit cluster.ncl instead
Best Practices
- Use Three-File Pattern: Separate contracts, defaults, and main entry
- Type Everything: Add type annotations for all fields
- Validate Early: Run
nickel typecheckbefore deployment - Use Makers: Leverage maker functions for composition
- Document Contracts: Add comments explaining schema requirements
- Avoid Duplication: Use record merging and defaults
- Test Locally: Export and verify before deploying
- Version Schemas: Track schema changes in version control
Debugging
Type Errors
# Detailed type error messages
nickel typecheck --color always infra/cluster.ncl
Schema Inspection
# Export to JSON for inspection
nickel export infra/cluster.ncl | jq '.'
# Check specific field
nickel export infra/cluster.ncl | jq '.metadata'
Format Code
# Auto-format Nickel files
nickel fmt infra/cluster.ncl
# Check formatting without modifying
nickel fmt --check infra/cluster.ncl
Next Steps
- Schemas Reference - Platform schema organization
- Configuration System - Hierarchical configuration
- Providers - Cloud provider schemas
- Batch Workflows - Multi-cloud orchestration with Nickel
Configuration System
The Provisioning platform uses a hierarchical configuration system with Nickel as the source of truth for infrastructure definitions and TOML/YAML for application settings.
Configuration Hierarchy
Configuration is loaded in order of precedence (highest to lowest):
1. Runtime Arguments - CLI flags (--config, --workspace, etc.)
2. Environment Variables - PROVISIONING_* environment variables
3. User Configuration - ~/.config/provisioning/user_config.yaml
4. Infrastructure Config - Nickel schemas in workspace/provisioning
5. System Defaults - provisioning/config/config.defaults.toml
Later sources override earlier ones, allowing flexible configuration management across environments.
Configuration Files
System Defaults
Located at provisioning/config/config.defaults.toml:
[general]
log_level = "info"
workspace_root = "./workspaces"
[providers]
default_provider = "local"
[orchestrator]
max_parallel_tasks = 4
checkpoint_enabled = true
User Configuration
Located at ~/.config/provisioning/user_config.yaml:
general:
preferred_editor: nvim
default_workspace: production
providers:
upcloud:
default_zone: fi-hel1
aws:
default_region: eu-west-1
Workspace Configuration
Nickel-based infrastructure configuration in workspace directories:
workspace/
├── config/
│ ├── main.ncl # Workspace configuration
│ ├── providers.ncl # Provider definitions
│ └── variables.ncl # Workspace variables
├── infra/
│ └── servers.ncl # Infrastructure definitions
└── .workspace/
└── metadata.toml # Workspace metadata
Environment Variables
All configuration can be overridden via environment variables:
export PROVISIONING_LOG_LEVEL=debug
export PROVISIONING_WORKSPACE=production
export PROVISIONING_PROVIDER=upcloud
export PROVISIONING_DRY_RUN=true
Variable naming: PROVISIONING_<SECTION>_<KEY> (uppercase with underscores).
Configuration Accessors
The platform provides 476+ configuration accessors for programmatic access:
# Get configuration value
provisioning config get general.log_level
# Set configuration value (workspace-scoped)
provisioning config set providers.default_provider upcloud
# List all configuration
provisioning config list
# Validate configuration
provisioning config validate
Profiles
Configuration supports profiles for different environments:
[profiles.development]
log_level = "debug"
dry_run = true
[profiles.production]
log_level = "warn"
dry_run = false
checkpoint_enabled = true
Activate profile:
provisioning --profile production deploy
Inheritance and Overrides
Workspace configurations inherit from system defaults:
# workspace/config/main.ncl
let parent = import "../../provisioning/schemas/defaults.ncl" in
parent & {
# Override specific values
general.log_level = "debug",
providers.default_provider = "aws",
}
Secrets Management
Sensitive configuration is encrypted using SOPS/Age:
# Encrypt configuration
sops --encrypt --age <public-key> secrets.yaml > secrets.enc.yaml
# Decrypt and use
provisioning deploy --secrets secrets.enc.yaml
Integration with SecretumVault for enterprise secrets management (see Secrets Management).
Configuration Validation
All Nickel-based configuration is validated before use:
# Validate workspace configuration
provisioning config validate
# Check schema compliance
nickel export --format json workspace/config/main.ncl
Type-safety is mandatory - invalid configuration is rejected at load time.
Best Practices
- Use Nickel for infrastructure - Type-safe, validated infrastructure definitions
- Use TOML for application settings - Simple key-value configuration
- Encrypt secrets - Never commit unencrypted credentials
- Document overrides - Comment why values differ from defaults
- Validate before deploy - Always run
config validatebefore deployment - Version control - Track configuration changes in Git
- Profile separation - Isolate development/staging/production configs
Troubleshooting
Configuration Not Loading
Check precedence order:
# Show effective configuration
provisioning config show --debug
# Trace configuration loading
PROVISIONING_LOG_LEVEL=trace provisioning config list
Schema Validation Failures
# Check Nickel syntax
nickel typecheck workspace/config/main.ncl
# Export and inspect
nickel export workspace/config/main.ncl
Environment Variable Issues
# List all PROVISIONING_* variables
env | grep PROVISIONING_
# Clear all provisioning env vars
unset $(env | grep PROVISIONING_ | cut -d= -f1 | xargs)
References
- Nickel Guide - Infrastructure configuration
- Schemas Reference - Schema structure
- Secrets Management - SecretumVault integration
Schemas Reference
Provisioning uses Nickel schemas for type-safe infrastructure definitions. This reference documents the schema organization, structure, and usage patterns.
Schema Organization
Schemas are organized in provisioning/schemas/:
provisioning/schemas/
├── main.ncl # Root schema entry point
├── lib/
│ ├── contracts.ncl # Type contracts and validators
│ ├── functions.ncl # Helper functions
│ └── types.ncl # Common type definitions
├── config/
│ ├── providers.ncl # Provider configuration schemas
│ ├── settings.ncl # Platform settings schemas
│ └── workspace.ncl # Workspace configuration schemas
├── infrastructure/
│ ├── servers.ncl # Server resource schemas
│ ├── networks.ncl # Network resource schemas
│ └── storage.ncl # Storage resource schemas
├── operations/
│ ├── deployment.ncl # Deployment workflow schemas
│ └── lifecycle.ncl # Resource lifecycle schemas
├── services/
│ ├── kubernetes.ncl # Kubernetes schemas
│ └── databases.ncl # Database schemas
└── integrations/
├── cloud_providers.ncl # Cloud provider integrations
└── external_services.ncl # External service integrations
Core Contracts
Server Contract
let Server = {
name
| doc "Server identifier (must be unique)"
| String,
plan
| doc "Server size (small, medium, large, xlarge)"
| | [ 'small, 'medium, 'large, 'xlarge | ],
provider
| doc "Cloud provider (upcloud, aws, local)"
| | [ 'upcloud, 'aws, 'local | ],
zone
| doc "Availability zone"
| String
| optional,
ip_address
| doc "Public IP address"
| String
| optional,
storage
| doc "Storage configuration"
| Array StorageConfig
| default = [],
metadata
| doc "Custom metadata tags"
| {_ : String}
| default = {},
}
Network Contract
let Network = {
name
| doc "Network identifier"
| String,
cidr
| doc "CIDR block (e.g., 10.0.0.0/16)"
| String
| std.string.is_match_regex "^([0-9]{1,3}\\.){3}[0-9]{1,3}/[0-9]{1,2}$",
subnets
| doc "Subnet definitions"
| Array Subnet,
routing
| doc "Routing configuration"
| RoutingConfig
| optional,
}
Storage Contract
let StorageConfig = {
size_gb
| doc "Storage size in GB"
| Number
| std.number.greater 0,
type
| doc "Storage type"
| | [ 'ssd, 'hdd, 'nvme | ],
mount_point
| doc "Mount path"
| String
| optional,
encrypted
| doc "Enable encryption"
| Bool
| default = false,
}
Workspace Schema
Workspace configuration schema:
let WorkspaceConfig = {
name
| doc "Workspace identifier"
| String,
environment
| doc "Environment type"
| | [ 'development, 'staging, 'production | ],
providers
| doc "Enabled providers"
| Array | [ 'upcloud, 'aws, 'local | ]
| default = ['local],
infrastructure
| doc "Infrastructure definitions"
| {
servers | Array Server | default = [],
networks | Array Network | default = [],
storage | Array StorageConfig | default = [],
},
settings
| doc "Workspace-specific settings"
| {_ : _}
| default = {},
}
Provider Schemas
UpCloud Provider
let UpCloudConfig = {
username
| doc "UpCloud username"
| String,
password
| doc "UpCloud password (encrypted)"
| String,
default_zone
| doc "Default zone"
| | [ 'fi-hel1, 'fi-hel2, 'de-fra1, 'uk-lon1, 'us-chi1, 'us-sjo1 | ]
| default = 'fi-hel1,
timeout_seconds
| doc "API timeout"
| Number
| default = 300,
}
AWS Provider
let AWSConfig = {
access_key_id
| doc "AWS access key"
| String,
secret_access_key
| doc "AWS secret key (encrypted)"
| String,
default_region
| doc "Default AWS region"
| String
| default = "eu-west-1",
assume_role_arn
| doc "IAM role ARN"
| String
| optional,
}
Service Schemas
Kubernetes Schema
let KubernetesCluster = {
name
| doc "Cluster name"
| String,
version
| doc "Kubernetes version"
| String
| std.string.is_match_regex "^v[0-9]+\\.[0-9]+\\.[0-9]+$",
control_plane
| doc "Control plane configuration"
| {
nodes | Number | std.number.greater 0,
plan | | [ 'small, 'medium, 'large | ],
},
workers
| doc "Worker node pools"
| Array NodePool,
networking
| doc "Network configuration"
| {
pod_cidr | String,
service_cidr | String,
cni | | [ 'calico, 'cilium, 'flannel | ] | default = 'cilium,
},
addons
| doc "Cluster addons"
| Array | [ 'metrics-server, 'ingress-nginx, 'cert-manager | ]
| default = [],
}
Validation Functions
Custom validation functions in lib/contracts.ncl:
let is_valid_hostname = fun name =>
std.string.is_match_regex "^[a-z0-9]([-a-z0-9]*[a-z0-9])?$" name
in
let is_valid_port = fun port =>
std.number.is_integer port && port >= 1 && port <= 65535
in
let is_valid_email = fun email =>
std.string.is_match_regex "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$" email
in
Merging and Composition
Schemas support composition through record merging:
let base_server = {
plan = 'medium,
provider = 'upcloud,
storage = [],
}
let production_server = base_server & {
plan = 'large,
storage = [{size_gb = 100, type = 'ssd}],
}
Contract Enforcement
Type checking is enforced at load time:
# Typecheck schema
nickel typecheck provisioning/schemas/main.ncl
# Export with validation
nickel export --format json workspace/infra/servers.ncl
Invalid configurations are rejected before deployment.
Best Practices
- Define contracts first - Start with type contracts before implementation
- Use enums for choices - Leverage
| [ 'option1, 'option2 | ]for fixed sets - Document everything - Use
|doc "description"annotations - Validate early - Run
nickel typecheckbefore deployment - Compose, don’t duplicate - Use record merging for common patterns
- Version schemas - Track schema changes alongside infrastructure
- Test contracts - Validate edge cases and constraints
References
- Nickel Guide - Nickel language reference
- Configuration System - Configuration hierarchy
- Providers - Provider-specific schemas
Providers
Providers are abstraction layers for interacting with cloud platforms and local infrastructure. Provisioning supports multiple providers through a unified interface.
Available Providers
UpCloud Provider
Production-ready cloud provider for European infrastructure.
Configuration:
{
providers.upcloud = {
username = "your-username",
password = std.secret "UPCLOUD_PASSWORD",
default_zone = 'fi-hel1,
timeout_seconds = 300,
}
}
Supported zones:
fi-hel1,fi-hel2- Helsinki, Finlandde-fra1- Frankfurt, Germanyuk-lon1- London, UKus-chi1- Chicago, USAus-sjo1- San Jose, USA
Resources: Servers, networks, storage, firewalls, load balancers
AWS Provider
Amazon Web Services integration for global cloud infrastructure.
Configuration:
{
providers.aws = {
access_key_id = std.secret "AWS_ACCESS_KEY_ID",
secret_access_key = std.secret "AWS_SECRET_ACCESS_KEY",
default_region = "eu-west-1",
}
}
Resources: EC2, VPCs, EBS, security groups, RDS, S3
Local Provider
Local infrastructure for development and testing.
Configuration:
{
providers.local = {
backend = 'libvirt, # or 'docker, 'podman
storage_pool = "/var/lib/libvirt/images",
}
}
Backends: libvirt (KVM/QEMU), docker, podman
Multi-Cloud Deployments
Deploy infrastructure across multiple providers:
{
servers = [
{name = "web-frontend", provider = 'upcloud, zone = "fi-hel1", plan = 'medium},
{name = "api-backend", provider = 'aws, zone = "eu-west-1a", plan = 't3.large},
]
}
Provider Abstraction
Abstract resource definitions work across providers:
let server_config = fun name provider => {
name = name,
provider = provider,
plan = 'medium, # Automatically translated per provider
storage = [{size_gb = 50, type = 'ssd}],
}
Plan translation:
| Abstract | UpCloud | AWS | Local |
|---|---|---|---|
| small | 1xCPU-1GB | t3.micro | 1 vCPU |
| medium | 2xCPU-4GB | t3.medium | 2 vCPU |
| large | 4xCPU-8GB | t3.large | 4 vCPU |
| xlarge | 8xCPU-16GB | t3.xlarge | 8 vCPU |
Best Practices
- Use abstract plans - Avoid provider-specific instance types
- Encrypt credentials - Always use encrypted secrets for API keys
- Test locally first - Validate configurations with local provider
- Document provider choices - Comment why specific providers are used
- Monitor costs - Track cloud provider spending
References
- Configuration System - Provider configuration
- Schemas Reference - Provider schemas
- Batch Workflows - Multi-cloud orchestration
Task Services
Task services are installable infrastructure components that provide specific functionality. Provisioning includes 30+ task services for databases, orchestration, monitoring, and more.
Categories
Kubernetes & Container Orchestration
kubernetes - Complete Kubernetes cluster deployment
- Control plane setup
- Worker node pools
- CNI configuration (Calico, Cilium, Flannel)
- Addon management (metrics-server, ingress-nginx, cert-manager)
containerd - Container runtime configuration
- Systemd integration
- Storage driver configuration
- Runtime class support
docker - Docker engine installation
- Docker Compose integration
- Registry configuration
Databases
postgresql - PostgreSQL database server
- Replication setup
- Backup automation
- Performance tuning
mysql - MySQL/MariaDB deployment
- Cluster configuration
- Backup strategies
mongodb - MongoDB database
- Replica sets
- Sharding configuration
redis - Redis in-memory store
- Persistence configuration
- Cluster mode
Storage
rook-ceph - Cloud-native storage orchestrator
- Block storage (RBD)
- Object storage (S3-compatible)
- Shared filesystem (CephFS)
minio - S3-compatible object storage
- Distributed mode
- Versioning and lifecycle policies
Monitoring & Observability
prometheus - Metrics collection and alerting
- Service discovery
- Alerting rules
- Long-term storage
grafana - Metrics visualization
- Dashboard provisioning
- Data source configuration
loki - Log aggregation system
- Log collection
- Query language
Networking
cilium - eBPF-based networking and security
- Network policies
- Load balancing
- Service mesh capabilities
calico - Network policy engine
- BGP networking
- IP-in-IP tunneling
nginx - Web server and reverse proxy
- Load balancing
- TLS termination
Security
vault - Secrets management (HashiCorp Vault)
- Secret storage
- Dynamic secrets
- Encryption as a service
cert-manager - TLS certificate automation
- Let’s Encrypt integration
- Certificate renewal
Task Service Definition
Task services are defined in provisioning/extensions/taskservs/:
taskservs/
└── kubernetes/
├── service.ncl # Service schema
├── install.nu # Installation script
├── configure.nu # Configuration script
├── health-check.nu # Health validation
└── README.md
Using Task Services
Installation
{
task_services = [
{
name = "kubernetes",
version = "v1.28.0",
config = {
control_plane = {nodes = 3, plan = 'medium},
workers = [{name = "pool-1", nodes = 3, plan = 'large}],
networking = {cni = 'cilium},
}
},
{
name = "prometheus",
version = "latest",
config = {retention = "30d", storage_size_gb = 100}
}
]
}
CLI Commands
# List available task services
provisioning taskserv list
# Show task service details
provisioning taskserv show kubernetes
# Install task service
provisioning taskserv install kubernetes
# Check task service health
provisioning taskserv health kubernetes
# Uninstall task service
provisioning taskserv uninstall kubernetes
Custom Task Services
Create custom task services:
provisioning/extensions/taskservs/my-service/
├── service.ncl # Service definition
├── install.nu # Installation logic
├── configure.nu # Configuration logic
├── health-check.nu # Health checks
└── README.md
service.ncl schema:
{
name = "my-service",
version = "1.0.0",
description = "Custom service description",
dependencies = ["kubernetes"], # Optional dependencies
config_schema = {
port | Number | default = 8080,
replicas | Number | default = 3,
}
}
install.nu implementation:
export def "taskserv install" [config: record] {
# Installation logic
print $"Installing ($config.name)..."
# Deploy resources
kubectl apply -f deployment.yaml
{status: "installed"}
}
Task Service Lifecycle
- Validation - Check dependencies and configuration
- Installation - Execute install script
- Configuration - Apply service configuration
- Health Check - Verify service is running
- Ready - Service available for use
Dependencies
Task services can declare dependencies:
{
name = "grafana",
dependencies = ["prometheus"], # Installed first
}
Provisioning automatically resolves dependency order.
Health Checks
Each task service provides health validation:
export def "taskserv health" [] {
let pods = (kubectl get pods -l app=my-service -o json | from json)
if ($pods.items | all | { | p $p.status.phase == "Running"}) {
{status: "healthy"}
} else {
{status: "unhealthy", reason: "pods not running"}
}
}
Best Practices
- Define schemas - Use Nickel schemas for task service configuration
- Declare dependencies - Explicit dependency declaration
- Idempotent installs - Installation should be repeatable
- Health checks - Implement comprehensive health validation
- Version pinning - Specify exact versions for reproducibility
- Document configuration - Provide clear configuration examples
References
- Clusters - Cluster orchestration
- Batch Workflows - Multi-service deployments
- Providers - Infrastructure providers
Clusters
Clusters are coordinated groups of services deployed together. Provisioning provides cluster definitions for common deployment patterns.
Available Clusters
Web Cluster
Production-ready web application deployment with load balancing, TLS, and monitoring.
Components:
- Nginx load balancer
- Application servers (configurable count)
- PostgreSQL database
- Redis cache
- Prometheus monitoring
- Let’s Encrypt TLS certificates
Configuration:
{
clusters = [{
name = "web-production",
type = 'web,
config = {
app_servers = 3,
load_balancer = {
public_ip = true,
tls_enabled = true,
domain = "example.com"
},
database = {
size = 'medium,
replicas = 2,
backup_enabled = true
},
cache = {
size = 'small,
persistence = true
}
}
}]
}
OCI Registry Cluster
Private container registry with S3-compatible storage and authentication.
Components:
- Harbor registry
- MinIO object storage
- PostgreSQL database
- Redis cache
- TLS termination
Configuration:
{
clusters = [{
name = "registry-private",
type = 'oci_registry,
config = {
domain = "registry.example.com",
storage = {
backend = 'minio,
size_gb = 500,
replicas = 3
},
authentication = {
method = 'ldap, # or 'database, 'oidc
admin_password = std.secret "REGISTRY_ADMIN_PASSWORD"
}
}
}]
}
Kubernetes Cluster
Multi-node Kubernetes cluster with networking, storage, and monitoring.
Components:
- Control plane nodes
- Worker node pools
- Cilium CNI
- Rook-Ceph storage
- Metrics server
- Ingress controller
Configuration:
{
clusters = [{
name = "k8s-production",
type = 'kubernetes,
config = {
control_plane = {
nodes = 3,
plan = 'medium,
high_availability = true
},
node_pools = [
{
name = "general",
nodes = 5,
plan = 'large,
labels = {workload = "general"}
},
{
name = "gpu",
nodes = 2,
plan = 'xlarge,
labels = {workload = "ml"}
}
],
networking = {
cni = 'cilium,
pod_cidr = "10.42.0.0/16",
service_cidr = "10.43.0.0/16"
},
storage = {
provider = 'rook-ceph,
default_storage_class = "ceph-block"
}
}
}]
}
Cluster Deployment
CLI Commands
# List available cluster types
provisioning cluster types
# Show cluster configuration template
provisioning cluster template web
# Deploy cluster
provisioning cluster deploy web-production
# Check cluster health
provisioning cluster health web-production
# Scale cluster
provisioning cluster scale web-production --app-servers 5
# Destroy cluster
provisioning cluster destroy web-production
Deployment Lifecycle
- Validation - Validate cluster configuration
- Infrastructure - Provision servers, networks, storage
- Services - Install and configure task services
- Integration - Connect services together
- Health Check - Verify cluster health
- Ready - Cluster operational
Cluster Orchestration
Clusters use dependency graphs for orchestration:
Web Cluster Dependency Graph:
servers ──┐
├──> database ──┐
networks ─┘ ├──> app_servers ──> load_balancer
│
├──> cache ──────────┘
│
└──> monitoring
Services are deployed in dependency order with parallel execution where possible.
Custom Cluster Definitions
Create custom cluster types:
provisioning/extensions/clusters/
└── my-cluster/
├── cluster.ncl # Cluster definition
├── deploy.nu # Deployment script
├── health-check.nu # Health validation
└── README.md
cluster.ncl schema:
{
name = "my-cluster",
version = "1.0.0",
description = "Custom cluster type",
components = {
servers = [{name = "app", count = 3, plan = 'medium}],
services = ["nginx", "postgresql", "redis"],
},
config_schema = {
domain | String,
replicas | Number | default = 3,
}
}
Cluster Management
Scaling
Scale cluster components:
# Scale application servers
provisioning cluster scale web-production --component app_servers --count 5
# Scale database replicas
provisioning cluster scale web-production --component database --replicas 3
Updates
Rolling updates without downtime:
# Update application version
provisioning cluster update web-production --app-version 2.0.0
# Update infrastructure (e.g., server plans)
provisioning cluster update web-production --plan large
Backup and Recovery
# Create cluster backup
provisioning cluster backup web-production
# Restore from backup
provisioning cluster restore web-production --backup 2024-01-15-snapshot
# List backups
provisioning cluster backups web-production
Monitoring
Cluster health monitoring:
# Overall cluster health
provisioning cluster health web-production
# Component health
provisioning cluster health web-production --component database
# Metrics
provisioning cluster metrics web-production
Health checks validate:
- All services running
- Network connectivity
- Storage availability
- Resource utilization
Best Practices
- Use predefined clusters - Leverage built-in cluster types
- Define dependencies - Explicit service dependencies
- Implement health checks - Comprehensive validation
- Plan for scaling - Design clusters for horizontal scaling
- Automate backups - Regular backup schedules
- Monitor resources - Track resource utilization
- Test disaster recovery - Validate backup/restore procedures
References
- Task Services - Service catalog
- Batch Workflows - Multi-cluster orchestration
- Providers - Infrastructure providers
Batch Workflows
Batch workflows orchestrate complex multi-step operations across multiple clouds and services with dependency resolution, parallel execution, and checkpoint recovery.
Overview
Batch workflows enable:
- Multi-cloud infrastructure orchestration
- Complex deployment pipelines
- Dependency-driven execution
- Parallel task execution
- Checkpoint and recovery
- Rollback on failures
Workflow Definition
Workflows are defined in Nickel:
{
workflows = [{
name = "multi-cloud-deployment",
description = "Deploy application across UpCloud and AWS",
steps = [
{
name = "provision-upcloud",
type = 'provision,
provider = 'upcloud,
resources = {
servers = [{name = "web-eu", plan = 'medium, zone = "fi-hel1"}]
}
},
{
name = "provision-aws",
type = 'provision,
provider = 'aws,
resources = {
servers = [{name = "web-us", plan = 't3.medium, zone = "us-east-1a"}]
}
},
{
name = "deploy-application",
type = 'task,
depends_on = ["provision-upcloud", "provision-aws"],
tasks = ["install-kubernetes", "deploy-app"]
},
{
name = "configure-dns",
type = 'configure,
depends_on = ["deploy-application"],
config = {
records = [
{name = "eu.example.com", target = "web-eu"},
{name = "us.example.com", target = "web-us"}
]
}
}
],
rollback_on_failure = true,
checkpoint_enabled = true
}]
}
Dependency Resolution
Workflows automatically resolve dependencies:
Execution Graph:
provision-upcloud ──┐
├──> deploy-application ──> configure-dns
provision-aws ──────┘
Steps provision-upcloud and provision-aws run in parallel. deploy-application waits for both to complete.
Step Types
Provision Steps
Create infrastructure resources:
{
name = "create-servers",
type = 'provision,
provider = 'upcloud,
resources = {
servers = [...],
networks = [...],
storage = [...]
}
}
Task Steps
Execute task services:
{
name = "install-k8s",
type = 'task,
tasks = ["kubernetes", "helm", "monitoring"]
}
Configure Steps
Apply configuration changes:
{
name = "setup-networking",
type = 'configure,
config = {
firewalls = [...],
routes = [...],
dns = [...]
}
}
Validate Steps
Verify conditions before proceeding:
{
name = "health-check",
type = 'validate,
checks = [
{type = 'http, url = " [https://app.example.com",](https://app.example.com",) expected_status = 200},
{type = 'command, command = "kubectl get nodes", expected_output = "Ready"}
]
}
Execution Control
Parallel Execution
Steps without dependencies run in parallel:
steps = [
{name = "provision-eu", ...}, # Runs in parallel
{name = "provision-us", ...}, # Runs in parallel
{name = "provision-asia", ...} # Runs in parallel
]
Configure parallelism:
{
max_parallel_tasks = 4, # Max concurrent steps
timeout_seconds = 3600 # Step timeout
}
Conditional Execution
Execute steps based on conditions:
{
name = "scale-up",
type = 'task,
condition = {
type = 'expression,
expression = "cpu_usage > 80"
}
}
Retry Logic
Automatically retry failed steps:
{
name = "deploy-app",
type = 'task,
retry = {
max_attempts = 3,
backoff = 'exponential, # or 'linear, 'constant
initial_delay_seconds = 10
}
}
Checkpoint and Recovery
Checkpointing
Workflows automatically checkpoint state:
# Enable checkpointing
provisioning workflow run multi-cloud --checkpoint
# Checkpoint saved at each step completion
Recovery
Resume from last successful checkpoint:
# Workflow failed at step 3
# Resume from checkpoint
provisioning workflow resume multi-cloud --from-checkpoint latest
# Resume from specific checkpoint
provisioning workflow resume multi-cloud --checkpoint-id abc123
Rollback
Automatic Rollback
Rollback on failure:
{
rollback_on_failure = true,
rollback_steps = [
{name = "destroy-resources", type = 'destroy},
{name = "restore-config", type = 'restore}
]
}
Manual Rollback
# Rollback to previous state
provisioning workflow rollback multi-cloud
# Rollback to specific checkpoint
provisioning workflow rollback multi-cloud --checkpoint-id abc123
Workflow Management
CLI Commands
# List workflows
provisioning workflow list
# Show workflow details
provisioning workflow show multi-cloud
# Run workflow
provisioning workflow run multi-cloud
# Check workflow status
provisioning workflow status multi-cloud
# View workflow logs
provisioning workflow logs multi-cloud
# Cancel running workflow
provisioning workflow cancel multi-cloud
Workflow State
Workflows track execution state:
pending- Not yet startedrunning- Currently executingcompleted- Successfully finishedfailed- Execution failedrolling_back- Performing rollbackcancelled- Manually cancelled
Advanced Features
Dynamic Workflows
Generate workflows programmatically:
let regions = ["fi-hel1", "de-fra1", "uk-lon1"] in
{
steps = std.array.map (fun region => {
name = "provision-" ++ region,
type = 'provision,
resources = {servers = [{zone = region, ...}]}
}) regions
}
Workflow Templates
Reusable workflow templates:
let DeploymentTemplate = fun app_name regions => {
name = "deploy-" ++ app_name,
steps = std.array.map (fun region => {
name = "deploy-" ++ region,
type = 'task,
tasks = ["deploy-app"],
config = {app_name = app_name, region = region}
}) regions
}
# Use template
{
workflows = [
DeploymentTemplate "frontend" ["eu", "us"],
DeploymentTemplate "backend" ["eu", "us", "asia"]
]
}
Notifications
Send notifications on workflow events:
{
notifications = {
on_success = {
type = 'slack,
webhook_url = std.secret "SLACK_WEBHOOK",
message = "Deployment completed successfully"
},
on_failure = {
type = 'email,
to = ["[ops@example.com](mailto:ops@example.com)"],
subject = "Workflow failed"
}
}
}
Best Practices
- Define dependencies explicitly - Clear dependency graph
- Enable checkpointing - Critical for long-running workflows
- Implement rollback - Always have rollback strategy
- Use validation steps - Verify state before proceeding
- Configure retries - Handle transient failures
- Monitor execution - Track workflow progress
- Test workflows - Validate with dry-run mode
Troubleshooting
Workflow Stuck
# Check workflow status
provisioning workflow status <workflow> --verbose
# View logs
provisioning workflow logs <workflow> --tail 100
# Cancel and restart
provisioning workflow cancel <workflow>
provisioning workflow run <workflow>
Step Failures
# View failed step details
provisioning workflow show <workflow> --step <step-name>
# Retry failed step
provisioning workflow retry <workflow> --step <step-name>
# Skip failed step
provisioning workflow skip <workflow> --step <step-name>
References
- Task Services - Available tasks
- Clusters - Cluster orchestration
- Providers - Multi-cloud provisioning
Version Management
Nickel-based version management for infrastructure components, providers, and task services ensures consistent, reproducible deployments.
Overview
Version management in Provisioning:
- Nickel schemas define version constraints
- Semantic versioning (semver) support
- Version locking for reproducibility
- Compatibility validation
- Update strategies
Version Constraints
Define version requirements in Nickel:
{
task_services = [
{
name = "kubernetes",
version = ">=1.28.0, <1.30.0", # Range constraint
},
{
name = "prometheus",
version = "~2.45.0", # Patch versions allowed
},
{
name = "grafana",
version = "^10.0.0", # Minor versions allowed
},
{
name = "nginx",
version = "1.25.3", # Exact version
}
]
}
Constraint Operators
| Operator | Meaning | Example | Matches |
|---|---|---|---|
= | Exact version | =1.28.0 | 1.28.0 only |
>= | Greater or equal | >=1.28.0 | 1.28.0, 1.29.0, 2.0.0 |
<= | Less or equal | <=1.30.0 | 1.28.0, 1.30.0 |
> | Greater than | >1.28.0 | 1.29.0, 2.0.0 |
< | Less than | <1.30.0 | 1.28.0, 1.29.0 |
~ | Patch updates | ~1.28.0 | 1.28.x |
^ | Minor updates | ^1.28.0 | 1.x.x |
, | AND constraint | >=1.28, <1.30 | 1.28.x, 1.29.x |
Version Locking
Generate lock file for reproducible deployments:
# Generate lock file
provisioning version lock
# Creates versions.lock.ncl with exact versions
versions.lock.ncl:
{
task_services = {
kubernetes = "1.28.3",
prometheus = "2.45.2",
grafana = "10.0.5",
nginx = "1.25.3"
},
providers = {
upcloud = "1.2.0",
aws = "3.5.1"
}
}
Use lock file:
let locked = import "versions.lock.ncl" in
{
task_services = [
{name = "kubernetes", version = locked.task_services.kubernetes}
]
}
Version Updates
Check for Updates
# Check available updates
provisioning version check
# Show outdated components
provisioning version outdated
Output:
Component Current Latest Update Available
kubernetes 1.28.0 1.29.2 Minor update
prometheus 2.45.0 2.47.0 Minor update
grafana 10.0.0 11.0.0 Major update (breaking)
Update Strategies
Conservative (patch only):
{
update_policy = 'conservative, # Only patch updates
}
Moderate (minor updates):
{
update_policy = 'moderate, # Patch + minor updates
}
Aggressive (all updates):
{
update_policy = 'aggressive, # All updates including major
}
Performing Updates
# Update all components (respecting constraints)
provisioning version update
# Update specific component
provisioning version update kubernetes
# Update to specific version
provisioning version update kubernetes --version 1.29.0
# Dry-run (show what would update)
provisioning version update --dry-run
Compatibility Validation
Validate version compatibility:
# Check compatibility
provisioning version validate
# Check specific component
provisioning version validate kubernetes
Compatibility rules defined in schemas:
{
name = "grafana",
version = "10.0.0",
compatibility = {
prometheus = ">=2.40.0", # Requires Prometheus 2.40+
kubernetes = ">=1.24.0" # Requires Kubernetes 1.24+
}
}
Version Resolution
When multiple constraints conflict, resolution strategy:
- Exact version - Highest priority
- Compatibility constraints - From dependencies
- User constraints - From configuration
- Latest compatible - Within constraints
Example resolution:
# Component A requires: kubernetes >=1.28.0
# Component B requires: kubernetes <1.30.0
# User specifies: kubernetes ^1.28.0
# Resolved: kubernetes 1.29.x (latest compatible)
Pinning Versions
Pin critical components:
{
task_services = [
{
name = "kubernetes",
version = "1.28.3",
pinned = true # Never auto-update
}
]
}
Version Rollback
Rollback to previous versions:
# Show version history
provisioning version history
# Rollback to previous version
provisioning version rollback kubernetes
# Rollback to specific version
provisioning version rollback kubernetes --version 1.28.0
Best Practices
- Use version constraints - Avoid
latesttag - Lock versions - Generate and commit lock files
- Test updates - Validate in non-production first
- Pin critical components - Prevent unexpected updates
- Document compatibility - Specify version requirements
- Monitor updates - Track new releases
- Gradual rollout - Update incrementally
Version Metadata
Access version information programmatically:
# Show component versions
provisioning version list
# Export versions to JSON
provisioning version export --format json
# Compare versions
provisioning version compare <component> <version1> <version2>
Integration with CI/CD
# .gitlab-ci.yml example
deploy:
script:
- provisioning version lock --verify # Verify lock file
- provisioning version validate # Check compatibility
- provisioning deploy # Deploy with locked versions
Troubleshooting
Version Conflicts
# Show dependency tree
provisioning version tree
# Identify conflicting constraints
provisioning version conflicts
Update Failures
# Check why update failed
provisioning version update kubernetes --verbose
# Force update (override constraints)
provisioning version update kubernetes --force --version 1.30.0
References
- Configuration System - Version configuration
- Schemas Reference - Version contracts
- Task Services - Service versions
Platform Features
Complete documentation for the 12 core Provisioning platform capabilities enabling enterprise infrastructure as code across multiple clouds.
Overview
Provisioning provides comprehensive features for:
- Workspace organization - Primary mode for grouping infrastructure, configs, schemas, and extensions with complete isolation
- Intelligent CLI - Modular architecture with 80+ keyboard shortcuts, decentralized command registration, 84% code reduction
- Type-safe configuration - Nickel as source of truth for all infrastructure definitions with mandatory validation
- Batch operations - DAG scheduling, parallel execution, multi-cloud workflows with dependency resolution
- Hybrid orchestration - Execute across Rust and Nushell with file-based persistence and atomic operations
- Interactive guides - Step-by-step guided infrastructure deployment with validation and error recovery
- Testing framework - Container-based test environments for validating infrastructure configurations
- Platform installer - TUI and unattended installation with provider setup and configuration management
- Security system - Complete v4.0.0 with authentication, authorization, encryption, secrets management, audit logging
- Daemon acceleration - 50x performance improvement for script-heavy workloads via persistent Rust process
- Intelligent detection - Automated analysis detecting cost, compliance, performance, security, and reliability issues
- Extension registry - Central marketplace for providers, task services, plugins, and clusters with versioning
Feature Guides
Organization and Management
-
Workspace Management - Workspace mode, grouping, multi-tenancy, isolation, customization
-
CLI Architecture - Modular design, 80+ shortcuts, decentralized registration, dynamic subcommands, 84% code reduction
-
Configuration System - Nickel type-safe configuration, hierarchical loading, profiles, validation
Workflow and Operations
-
Batch Workflows - DAG scheduling, parallel execution, conditional logic, error handling, multi-cloud, dependency resolution
-
Orchestrator System - Hybrid Rust/Nushell, file-based persistence, atomic operations, event-driven
-
Provisioning Daemon - TCP service, 50x performance, connection pooling, LRU caching, graceful shutdown
Developer and Automation Features
-
Interactive Guides - Guided deployment, prompts, validation, error recovery, progress tracking
-
Test Environment - Container-based testing, sandbox isolation, validation, integration testing
-
Extension Registry - Marketplace for providers, task services, plugins, clusters, versioning, dependencies
Platform Capabilities
-
Platform Installer - TUI and unattended modes, provider setup, workspace creation, configuration management
-
Security System - v4.0.0: JWT/OAuth, Cedar RBAC, MFA, audit logging, encryption, secrets management
-
Detector System - Cost optimization, compliance, performance analysis, security detection, reliability assessment
-
Nushell Plugins - 17 plugins: tera, nickel, fluentd, secretumvault, 10-50x performance gains
-
Version Management - Semantic versioning, dependency resolution, compatibility, deprecation, upgrade workflows
Feature Categories
| Category | Features | Use Case |
|---|---|---|
| Core | Workspace Management, CLI Architecture, Configuration System | Organization, command discovery, type-safety |
| Operations | Batch Workflows, Orchestrator, Version Management | Multi-cloud, DAG scheduling, persistence |
| Performance | Provisioning Daemon, Nushell Plugins | Script acceleration, 10-50x speedup |
| Quality & Testing | Test Environment, Extension Registry | Configuration validation, distribution |
| Setup & Installation | Platform Installer | Installation, initial configuration |
| Intelligence | Detector System | Analysis, anomaly detection, cost optimization |
| Security | Security System, Complete v4.0.0 | Authentication, authorization, encryption |
| User Experience | Interactive Guides | Guided deployment, learning |
Quick Navigation
I want to organize my infrastructure
Start with Workspace Management - primary organizational mode with isolation and customization.
I want faster command execution
Use Provisioning Daemon - 50x performance improvement for scripts through persistent process and caching.
I want to automate deployment
Learn Batch Workflows - DAG scheduling and multi-cloud orchestration with error handling.
I need to ensure security
Review Security System - complete authentication, authorization, encryption, audit logging.
I want to validate configurations
Check Test Environment - container-based sandbox testing and policy validation.
I need to extend capabilities
See Extension Registry - marketplace for providers, task services, plugins, clusters.
I need to find infrastructure issues
Use Detector System - automated cost, compliance, performance, and security analysis.
Integration with Platform
All features are integrated via:
- CLI commands - Invoke from Nushell or bash
- REST APIs - Integrate with external systems
- Nushell scripting - Build custom automation
- Nickel configuration - Type-safe definitions
- Extensions - Add custom providers and services
Related Documentation
- Architecture Details → See
provisioning/docs/src/architecture/ - Development Guides → See
provisioning/docs/src/development/ - API Reference → See
provisioning/docs/src/api-reference/ - Operation Guides → See
provisioning/docs/src/operations/ - Security Details → See
provisioning/docs/src/security/ - Practical Examples → See
provisioning/docs/src/examples/
Workspace Management
Workspaces are the default organizational unit for all infrastructure work in Provisioning. Every infrastructure project, deployment environment, or isolated configuration lives within a workspace. This workspace-first approach provides clean separation between projects, environments, and teams while enabling rapid context switching.
Overview
A workspace is an isolated environment that groups together:
- Infrastructure definitions - Nickel schemas, server configs, cluster definitions
- Configuration settings - Environment-specific settings, provider credentials, user preferences
- Runtime data - State files, checkpoints, logs, generated configurations
- Extensions - Custom providers, task services, workflow templates
The workspace system enforces that all infrastructure operations (server creation, task service installation, cluster deployment) require an active workspace. This prevents accidental cross-project modifications and ensures configuration isolation.
Why Workspace-First
Traditional infrastructure tools often mix configurations across projects, leading to:
- Accidental deployments to wrong environments
- Configuration drift between dev/staging/production
- Credential leakage across projects
- Difficulty tracking infrastructure boundaries
Provisioning’s workspace-first approach solves these problems by making workspace boundaries explicit and enforced at the CLI level.
Workspace Structure
Every workspace follows a consistent directory structure:
workspace_my_project/
├── infra/ # Infrastructure definitions (Nickel schemas)
│ ├── my-cluster.ncl # Cluster definition
│ ├── servers.ncl # Server configurations
│ └── batch-workflows.ncl # Batch workflow definitions
│
├── config/ # Workspace configuration
│ ├── local-overrides.toml # User-specific overrides (gitignored)
│ ├── dev-defaults.toml # Development environment defaults
│ ├── test-defaults.toml # Testing environment defaults
│ ├── prod-defaults.toml # Production environment defaults
│ └── provisioning.yaml # Workspace metadata and settings
│
├── extensions/ # Workspace-specific extensions
│ ├── providers/ # Custom cloud providers
│ ├── taskservs/ # Custom task services
│ ├── clusters/ # Custom cluster templates
│ └── workflows/ # Custom workflow definitions
│
└── runtime/ # Runtime data (gitignored)
├── state/ # Infrastructure state files
├── checkpoints/ # Workflow checkpoints
├── logs/ # Operation logs
└── generated/ # Generated configuration files
Configuration Hierarchy
Workspace configurations follow a 5-layer hierarchy:
1. System Defaults (provisioning/config/config.defaults.toml)
↓ overridden by
2. User Config (~/.config/provisioning/user_config.yaml)
↓ overridden by
3. Workspace Config (workspace/config/provisioning.yaml)
↓ overridden by
4. Environment Config (workspace/config/{dev,test,prod}-defaults.toml)
↓ overridden by
5. Runtime Flags (--flag value)
This hierarchy ensures sensible defaults while allowing granular control at every level.
Core Commands
Creating Workspaces
# Create new workspace
provisioning workspace init my-project
# Create workspace with specific location
provisioning workspace init my-project --path /custom/location
# Create from template
provisioning workspace init my-project --template kubernetes-ha
Listing Workspaces
# List all workspaces
provisioning workspace list
# Show active workspace
provisioning workspace status
# List with details
provisioning workspace list --verbose
Example output:
NAME PATH LAST_USED STATUS
my-project /workspaces/workspace_my_project 2026-01-15 10:30 Active
dev-env /workspaces/workspace_dev_env 2026-01-14 15:45
production /workspaces/workspace_production 2026-01-10 09:00
Switching Workspaces
# Switch to different workspace (single command)
provisioning workspace switch my-project
# Switch with validation
provisioning workspace switch production --validate
# Quick switch using shortcut
provisioning ws switch dev-env
Workspace switching updates:
- Active workspace marker in user configuration
- Environment variables for current session
- CLI prompt indicator (if configured)
- Last-used timestamp
Deleting Workspaces
# Delete workspace (requires confirmation)
provisioning workspace delete old-project
# Force delete without confirmation
provisioning workspace delete old-project --force
# Delete but keep backups
provisioning workspace delete old-project --backup
Deletion safety:
- Requires explicit confirmation unless
--forceis used - Optionally creates backup before deletion
- Validates no active operations are running
- Updates workspace registry
Workspace Registry
The workspace registry is stored in user configuration and tracks all workspaces:
# ~/.config/provisioning/user_config.yaml
workspaces:
active: my-project
registry:
my-project:
path: /workspaces/workspace_my_project
created: 2026-01-15T10:30:00Z
last_used: 2026-01-15T14:20:00Z
template: default
dev-env:
path: /workspaces/workspace_dev_env
created: 2026-01-10T08:00:00Z
last_used: 2026-01-14T15:45:00Z
template: development
This centralized registry enables:
- Fast workspace discovery
- Usage tracking and statistics
- Workspace templates
- Path resolution
Workspace Enforcement
The CLI enforces workspace requirements for all infrastructure operations:
Workspace-exempt commands (work without active workspace):
provisioning helpprovisioning versionprovisioning workspace *provisioning guide *provisioning setup *provisioning providers(list only)
Workspace-required commands (require active workspace):
provisioning server createprovisioning taskserv installprovisioning cluster deployprovisioning batch submit- All infrastructure modification operations
If no workspace is active, workspace-required commands fail with:
Error: No active workspace
Please activate or create a workspace:
provisioning workspace init <name>
provisioning workspace switch <name>
This enforcement prevents accidental infrastructure modifications outside workspace boundaries.
Workspace Templates
Templates provide pre-configured workspace structures for common use cases:
Available Templates
| Template | Description | Use Case |
|---|---|---|
default | Minimal workspace structure | General purpose infrastructure |
kubernetes-ha | HA Kubernetes setup with 3 control planes | Production Kubernetes deployments |
development | Dev-optimized with Docker Compose | Local testing and development |
multi-cloud | Multiple provider configurations | Multi-cloud deployments |
database-cluster | Database-focused with backup configs | Database infrastructure |
cicd | CI/CD pipeline configurations | Automated deployment pipelines |
Using Templates
# Create from template
provisioning workspace init my-k8s --template kubernetes-ha
# List available templates
provisioning workspace templates
# Show template details
provisioning workspace template show kubernetes-ha
Templates pre-populate:
- Infrastructure Nickel schemas
- Provider configurations
- Environment-specific defaults
- Example workflow definitions
- README with usage instructions
Multi-Environment Workflows
Workspaces excel at managing multiple environments:
Strategy 1: Separate Workspaces Per Environment
# Create dedicated workspaces
provisioning workspace init myapp-dev
provisioning workspace init myapp-staging
provisioning workspace init myapp-prod
# Switch between environments
provisioning ws switch myapp-dev
provisioning server create # Creates in dev
provisioning ws switch myapp-prod
provisioning server create # Creates in prod (isolated)
Pros: Complete isolation, different credentials, independent state Cons: More workspace management, duplicate configuration
Strategy 2: Single Workspace, Multiple Environments
# Single workspace with environment configs
provisioning workspace init myapp
# Deploy to different environments using flags
PROVISIONING_ENV=dev provisioning server create
PROVISIONING_ENV=staging provisioning server create
PROVISIONING_ENV=prod provisioning server create
Pros: Shared configuration, easier to maintain Cons: Shared credentials, risk of cross-environment mistakes
Strategy 3: Hybrid Approach
# Dev workspace for experimentation
provisioning workspace init myapp-dev
# Prod workspace for production only
provisioning workspace init myapp-prod
# Use environment flags within workspaces
provisioning ws switch myapp-prod
PROVISIONING_ENV=prod provisioning cluster deploy
Pros: Balances isolation and convenience Cons: More complex to explain to teams
Best Practices
Naming Conventions
# Good names (descriptive, unique)
workspace_librecloud_production
workspace_myapp_dev
workspace_k8s_staging
# Avoid (ambiguous, generic)
workspace_test
workspace_1
workspace_temp
Configuration Management
# Version control: Commit these files
infra/**/*.ncl # Infrastructure definitions
config/*-defaults.toml # Environment defaults
config/provisioning.yaml # Workspace metadata
extensions/**/* # Custom extensions
# Gitignore: Never commit these
config/local-overrides.toml # User-specific overrides
runtime/**/* # Runtime data and state
**/*.secret # Credential files
Environment Separation
# Use dedicated workspaces for production
provisioning workspace init myapp-prod --template production
# Enable extra validation for production
provisioning ws switch myapp-prod
provisioning config set validation.strict true
provisioning config set confirmation.required true
Team Collaboration
# Share workspace structure via git
git clone repo/myapp-infrastructure
cd myapp-infrastructure
provisioning workspace init . --import
# Each team member creates local-overrides.toml
cat > config/local-overrides.toml <<EOF
[user]
default_region = "us-east-1"
confirmation_required = true
EOF
Troubleshooting
No Active Workspace Error
Error: No active workspace
Solution:
# List workspaces
provisioning workspace list
# Switch to workspace
provisioning workspace switch <name>
# Or create new workspace
provisioning workspace init <name>
Workspace Not Found
Error: Workspace 'my-project' not found in registry
Solution:
# Re-register workspace
provisioning workspace register /path/to/workspace_my_project
# Or recreate workspace
provisioning workspace init my-project
Workspace Path Doesn’t Exist
Error: Workspace path '/workspaces/workspace_my_project' does not exist
Solution:
# Remove invalid entry
provisioning workspace unregister my-project
# Re-create workspace
provisioning workspace init my-project
Integration with Other Features
Batch Workflows
Workspaces provide the context for batch workflow execution:
provisioning ws switch production
provisioning batch submit infra/batch-workflows.ncl
Batch workflows access workspace-specific:
- Infrastructure definitions
- Provider credentials
- Configuration settings
- State management
Test Environments
Test environments inherit workspace configuration:
provisioning ws switch dev
provisioning test quick kubernetes
# Uses dev workspace's configuration and providers
Version Management
Workspace configurations can specify tool versions:
# workspace/infra/versions.ncl
{
tools = {
nushell = "0.109.1"
nickel = "1.15.1"
kubernetes = "1.29.0"
}
}
Provisioning validates versions match workspace requirements.
See Also
- Configuration System - Hierarchical configuration details
- Nickel Guide - Infrastructure definitions in Nickel
- Batch Workflows - Multi-cloud workflow orchestration
- Test Environment - Container-based testing within workspaces
CLI Architecture
The Provisioning CLI provides a unified command-line interface for all infrastructure operations. It features 111+ commands organized into 7 domain-focused modules with 80+ shortcuts for improved productivity. The modular architecture achieved 84% code reduction while improving maintainability and extensibility.
Overview
The CLI architecture uses domain-driven design, separating concerns across modules. This refactoring reduced the main entry point from monolithic code to 211 lines. The architecture improves discoverability and enables rapid feature development.
Key Metrics
| Metric | Before | After | Improvement |
|---|---|---|---|
| Main CLI lines | 1,329 | 211 | 84% reduction |
| Command domains | 1 (monolithic) | 7 (modular) | 7x organization |
| Commands | ~50 | 111+ | 122% increase |
| Shortcuts | 0 | 80+ | New capability |
| Help categories | 0 | 7 | Improved discovery |
Domain Architecture
The CLI is organized into 7 domain-focused modules:
1. Infrastructure Domain
Commands: Server, TaskServ, Cluster, Infra management
# Server operations
provisioning server create
provisioning server list
provisioning server delete
provisioning server ssh <hostname>
# Task service operations
provisioning taskserv install kubernetes
provisioning taskserv list
provisioning taskserv remove kubernetes
# Cluster operations
provisioning cluster deploy my-cluster
provisioning cluster status my-cluster
provisioning cluster scale my-cluster --nodes 5
Shortcuts: s (server), t/task (taskserv), cl (cluster), i (infra)
2. Orchestration Domain
Commands: Workflow, Batch, Orchestrator management
# Workflow operations
provisioning workflow list
provisioning workflow status <id>
provisioning workflow cancel <id>
# Batch operations
provisioning batch submit infra/batch-workflows.ncl
provisioning batch monitor <workflow-id>
provisioning batch list
# Orchestrator management
provisioning orchestrator start
provisioning orchestrator status
provisioning orchestrator logs
Shortcuts: wf/flow (workflow), bat (batch), orch (orchestrator)
3. Development Domain
Commands: Module, Layer, Version, Pack management
# Module operations
provisioning module create my-module
provisioning module list
provisioning module test my-module
# Layer operations
provisioning layer add <name>
provisioning layer list
# Versioning
provisioning version bump minor
provisioning version list
# Packaging
provisioning pack create my-extension
provisioning pack publish my-extension
Shortcuts: mod (module), l (layer), v (version), p (pack)
4. Workspace Domain
Commands: Workspace management, templates
# Workspace operations
provisioning workspace init my-project
provisioning workspace list
provisioning workspace switch my-project
provisioning workspace delete old-project
# Template operations
provisioning workspace template list
provisioning workspace template show kubernetes-ha
Shortcuts: ws (workspace)
5. Configuration Domain
Commands: Config, Environment, Validate, Setup
# Configuration operations
provisioning config get servers.default_plan
provisioning config set servers.default_plan large
provisioning config validate
# Environment operations
provisioning env
provisioning allenv
# Setup operations
provisioning setup profile --profile developer
provisioning setup versions
# Validation
provisioning validate config
provisioning validate infra
provisioning validate nickel workspace/infra/my-cluster.ncl
Shortcuts: cfg (config), val (validate), st (setup)
6. Utilities Domain
Commands: SSH, SOPS, Cache, Plugin management
# SSH operations
provisioning ssh server-01
provisioning ssh server-01 -- uptime
# SOPS operations
provisioning sops encrypt config.yaml
provisioning sops decrypt config.enc.yaml
# Cache operations
provisioning cache clear
provisioning cache stats
# Plugin operations
provisioning plugin list
provisioning plugin install nu_plugin_auth
provisioning plugin update
Shortcuts: sops, cache, plug (plugin)
7. Generation Domain
Commands: Generate code, configs, docs
# Code generation
provisioning generate provider upcloud-new
provisioning generate taskserv postgresql
provisioning generate cluster k8s-ha
# Config generation
provisioning generate config --profile production
provisioning generate nickel --template kubernetes
# Documentation generation
provisioning generate docs
Shortcuts: g/gen (generate)
Command Shortcuts
The CLI provides 80+ shortcuts for improved productivity:
Infrastructure Shortcuts
| Full Command | Shortcuts | Example |
|---|---|---|
server | s | provisioning s list |
taskserv | t, task | provisioning t install kubernetes |
cluster | cl | provisioning cl deploy my-cluster |
infrastructure | i, infra | provisioning i list |
Orchestration Shortcuts
| Full Command | Shortcuts | Example |
|---|---|---|
workflow | wf, flow | provisioning wf list |
batch | bat | provisioning bat submit workflow.ncl |
orchestrator | orch | provisioning orch status |
Development Shortcuts
| Full Command | Shortcuts | Example |
|---|---|---|
module | mod | provisioning mod list |
layer | l | provisioning l add base |
version | v | provisioning v bump minor |
pack | p | provisioning p create extension |
Configuration Shortcuts
| Full Command | Shortcuts | Example |
|---|---|---|
workspace | ws | provisioning ws switch prod |
config | cfg | provisioning cfg get servers.plan |
validate | val | provisioning val config |
setup | st | provisioning st profile --profile dev |
environment | env | provisioning env |
Utility Shortcuts
| Full Command | Shortcuts | Example |
|---|---|---|
generate | g, gen | provisioning g provider aws-new |
plugin | plug | provisioning plug list |
Quick Reference Shortcuts
| Full Command | Shortcuts | Purpose |
|---|---|---|
shortcuts | sc | Show shortcuts reference |
guide | - | Interactive guides |
howto | - | Quick how-to guides |
Bi-Directional Help System
The CLI features a bi-directional help system that works in both directions:
# Both of these work identically
provisioning help workspace
provisioning workspace help
# Shortcuts also work
provisioning help ws
provisioning ws help
# Category help
provisioning help infrastructure
provisioning help orchestration
This flexibility improves discoverability and aligns with natural user expectations.
Centralized Flag Handling
All global flags are handled consistently across all commands:
Global Flags
| Flag | Short | Purpose | Example |
|---|---|---|---|
--debug | -d | Enable debug mode | provisioning --debug server create |
--check | -c | Dry-run mode (no changes) | provisioning --check server delete |
--yes | -y | Auto-confirm operations | provisioning --yes cluster delete |
--infra | -i | Specify infrastructure | provisioning --infra my-cluster server list |
--verbose | -v | Verbose output | provisioning --verbose workflow list |
--quiet | -q | Minimal output | provisioning --quiet batch submit |
--format | -f | Output format (json/yaml/table) | provisioning --format json server list |
Command-Specific Flags
# Server creation flags
provisioning server create --plan large --region us-east-1 --zone a
# TaskServ installation flags
provisioning taskserv install kubernetes --version 1.29.0 --ha
# Cluster deployment flags
provisioning cluster deploy --replicas 3 --storage 100GB
# Batch workflow flags
provisioning batch submit workflow.ncl --parallel 5 --timeout 3600
Command Discovery
Categorized Help
The help system organizes commands by domain:
provisioning help
# Output shows categorized commands:
Infrastructure Commands:
server Manage servers (shortcuts: s)
taskserv Manage task services (shortcuts: t, task)
cluster Manage clusters (shortcuts: cl)
Orchestration Commands:
workflow Manage workflows (shortcuts: wf, flow)
batch Batch operations (shortcuts: bat)
orchestrator Orchestrator management (shortcuts: orch)
Configuration Commands:
workspace Workspace management (shortcuts: ws)
config Configuration management (shortcuts: cfg)
validate Validation operations (shortcuts: val)
setup System setup (shortcuts: st)
Quick Reference
# Fastest command reference
provisioning sc
# Shows comprehensive shortcuts table with examples
Interactive Guides
# Step-by-step guides
provisioning guide from-scratch # Complete deployment guide
provisioning guide quickstart # Command shortcuts reference
provisioning guide customize # Customization patterns
Command Routing
The CLI uses a sophisticated dispatcher for command routing:
# provisioning/core/nulib/main_provisioning/dispatcher.nu
# Route command to appropriate handler
export def dispatch [
command: string
args: list<string>
] {
match $command {
# Infrastructure domain
"server" | "s" => { route-to-handler "infrastructure" "server" $args }
"taskserv" | "t" | "task" => { route-to-handler "infrastructure" "taskserv" $args }
"cluster" | "cl" => { route-to-handler "infrastructure" "cluster" $args }
# Orchestration domain
"workflow" | "wf" | "flow" => { route-to-handler "orchestration" "workflow" $args }
"batch" | "bat" => { route-to-handler "orchestration" "batch" $args }
# Configuration domain
"workspace" | "ws" => { route-to-handler "configuration" "workspace" $args }
"config" | "cfg" => { route-to-handler "configuration" "config" $args }
}
}
This routing enables:
- Consistent error handling
- Centralized logging
- Workspace enforcement
- Permission checks
- Audit trail
Command Implementation Pattern
All commands follow a consistent implementation pattern:
# Example: provisioning/core/nulib/main_provisioning/commands/server.nu
# Main command handler
export def main [
operation: string # create, list, delete, etc.
--check # Dry-run mode
--yes # Auto-confirm
] {
# 1. Validate workspace requirement
enforce-workspace-requirement "server" $operation
# 2. Load configuration
let config = load-config
# 3. Parse operation
match $operation {
"create" => { create-server $args $config --check=$check --yes=$yes }
"list" => { list-servers $config }
"delete" => { delete-server $args $config --yes=$yes }
"ssh" => { ssh-to-server $args $config }
_ => { error $"Unknown server operation: ($operation)" }
}
# 4. Log operation (audit trail)
log-operation "server" $operation $args
}
This pattern ensures:
- Consistent behavior
- Proper error handling
- Configuration integration
- Workspace enforcement
- Audit logging
Modular Structure
The CLI codebase is organized for maintainability:
provisioning/core/
├── cli/
│ └── provisioning # Main CLI entry point (211 lines)
│
├── nulib/
│ ├── main_provisioning/
│ │ ├── dispatcher.nu # Command routing (central dispatch)
│ │ ├── flags.nu # Centralized flag handling
│ │ ├── help_system_fluent.nu # Categorized help with i18n
│ │ │
│ │ └── commands/ # Domain-specific command handlers
│ │ ├── infrastructure/
│ │ │ ├── server.nu
│ │ │ ├── taskserv.nu
│ │ │ └── cluster.nu
│ │ │
│ │ ├── orchestration/
│ │ │ ├── workflow.nu
│ │ │ ├── batch.nu
│ │ │ └── orchestrator.nu
│ │ │
│ │ ├── configuration/
│ │ │ ├── workspace.nu
│ │ │ ├── config.nu
│ │ │ └── validate.nu
│ │ │
│ │ └── utilities/
│ │ ├── ssh.nu
│ │ ├── sops.nu
│ │ └── cache.nu
│ │
│ └── lib_provisioning/ # Core libraries (used by commands)
│ ├── config/
│ ├── providers/
│ ├── workspace/
│ └── utils/
This structure enables:
- Clear separation of concerns
- Easy addition of new commands
- Testable command handlers
- Reusable core libraries
Internationalization
The CLI supports multiple languages via Fluent catalog:
# Automatic locale detection
export LANG=es_ES.UTF-8
provisioning help # Shows Spanish help if es-ES catalog exists
# Supported locales
en-US (default) # English
es-ES # Spanish
fr-FR # French
de-DE # German
Catalog structure:
provisioning/locales/
├── en-US/
│ └── help.ftl # English help strings
├── es-ES/
│ └── help.ftl # Spanish help strings
└── de-DE/
└── help.ftl # German help strings
Extension Points
The modular architecture provides clean extension points:
Adding New Commands
# 1. Create command handler
provisioning/core/nulib/main_provisioning/commands/my_new_command.nu
# 2. Register in dispatcher
# provisioning/core/nulib/main_provisioning/dispatcher.nu
"my-command" | "mc" => { route-to-handler "utilities" "my-command" $args }
# 3. Add help entry
# provisioning/locales/en-US/help.ftl
my-command-help = Manage my new feature
# 4. Command is now available
provisioning my-command <operation>
provisioning mc <operation> # Shortcut also works
Adding New Domains
# 1. Create domain directory
provisioning/core/nulib/main_provisioning/commands/my_domain/
# 2. Add domain commands
my_domain/
├── command1.nu
├── command2.nu
└── command3.nu
# 3. Register domain in dispatcher
# 4. Add domain help category
# Domain is now available with all commands
Command Aliases
The CLI supports command aliases for common operations:
# Defined in user configuration
# ~/.config/provisioning/user_config.yaml
aliases:
deploy: "cluster deploy"
list-all: "server list && taskserv list && cluster list"
quick-test: "test quick kubernetes"
# Usage
provisioning deploy my-cluster # Expands to: cluster deploy my-cluster
provisioning list-all # Runs multiple commands
provisioning quick-test # Runs test with preset
Best Practices
Using Shortcuts Effectively
# Development workflow (frequent commands)
provisioning ws switch dev # Switch to dev workspace
provisioning s list # Quick server list
provisioning t install postgres # Install task service
provisioning cl status my-cluster # Check cluster status
# Production workflow (explicit commands for clarity)
provisioning workspace switch production
provisioning server create --plan large --check
provisioning cluster deploy critical-cluster --yes
Dry-Run Before Execution
# Always check before dangerous operations
provisioning --check server delete old-servers
provisioning --check cluster delete test-cluster
# If output looks good, run for real
provisioning --yes server delete old-servers
Using Output Formats
# JSON output for scripting
provisioning --format json server list | jq '.[] | select(.status == "running")'
# YAML output for readability
provisioning --format yaml cluster status my-cluster
# Table output for humans (default)
provisioning server list
Performance Optimizations
The modular architecture enables several performance optimizations:
Lazy Loading
Commands are loaded on-demand, reducing startup time:
# Only loads server command module when needed
provisioning server list # Fast startup (loads server.nu only)
Command Caching
Frequently-used commands benefit from caching:
# First run: ~200ms (loads modules, config)
provisioning server list
# Subsequent runs: ~50ms (cached config, loaded modules)
provisioning server list
Parallel Execution
Batch operations execute in parallel:
# Executes server creation in parallel (up to configured limit)
provisioning batch submit multi-server-workflow.ncl --parallel 10
Troubleshooting
Command Not Found
Error: Unknown command 'servr'
Did you mean: server (s)
The CLI provides helpful suggestions for typos.
Missing Workspace
Error: No active workspace
Please activate or create a workspace:
provisioning workspace init <name>
provisioning workspace switch <name>
Workspace enforcement prevents accidental operations.
Permission Denied
Error: Operation requires admin permissions
Please run with elevated privileges or contact administrator
Permission system prevents unauthorized operations.
See Also
- Workspace Management - Workspace-first approach
- Configuration System - Hierarchical configuration
- Interactive Guides - Step-by-step walkthroughs
- Batch Workflows - Multi-cloud orchestration
Configuration System
Batch Workflows
Orchestrator
Interactive Guides
Test Environment
Platform Installer
Security System
Version Management
Nushell Plugins
Provisioning includes 17 high-performance native Rust plugins for Nushell, providing 10-50x speed improvements over HTTP APIs. Plugins handle critical functionality: templates, configuration, encryption, orchestration, and secrets management.
Overview
Performance Benefits
Plugins provide significant performance improvements for frequently-used operations:
| Plugin | Speed Improvement | Use Case |
|---|---|---|
| nu_plugin_tera | 10-15x faster | Template rendering |
| nu_plugin_nickel | 5-8x faster | Configuration processing |
| nu_plugin_orchestrator | 30-50x faster | Query orchestrator state |
| nu_plugin_kms | 10x faster | Encryption/decryption |
| nu_plugin_auth | 5x faster | Authentication operations |
Installation
All plugins install automatically with Provisioning:
# Automatic installation during setup
provisioning install
# Or manual installation
cd /path/to/provisioning
./scripts/install-plugins.nu
# Verify installation
provisioning plugins list
Plugin Management
# List installed plugins with versions
provisioning plugins list
# Check plugin status
provisioning plugins status
# Update all plugins
provisioning plugins update --all
# Update specific plugin
provisioning plugins update nu_plugin_tera
# Remove plugin
provisioning plugins remove nu_plugin_tera
Core Plugins (Priority)
1. nu_plugin_tera
Template Rendering Engine
Nushell plugin for Tera template processing (Jinja2-style syntax).
# Install
provisioning plugins install nu_plugin_tera
# Usage in Nushell
let template = "Hello {{ name }}!"
let context = { name: "World" }
$template | tera render $context
# Output: "Hello World!"
Features:
- Jinja2-compatible syntax
- Built-in filters and functions
- Template inheritance
- Macro support
- Custom filters via Rust
Performance: 10-15x faster than HTTP template service
Use Cases:
- Generating infrastructure configurations
- Creating dynamic scripts
- Building deployment templates
- Rendering documentation
Example: Generate infrastructure config:
let infra_template = "
{
servers = [
{% for server in servers %}
{
name = \"{{ server.name }}\"
cpu = {{ server.cpu }}
memory = {{ server.memory }}
}
{% if not loop.last %},{% endif %}
{% endfor %}
]
}
"
let servers = [
{ name: "web-01", cpu: 4, memory: 8 }
{ name: "web-02", cpu: 4, memory: 8 }
]
$infra_template | tera render { servers: $servers }
2. nu_plugin_nickel
Nickel Configuration Plugin
Native Nickel compilation and validation in Nushell.
# Install
provisioning plugins install nu_plugin_nickel
# Usage in Nushell
let nickel_code = '{ name = "server", cpu = 4 }'
$nickel_code | nickel eval
# Output: { name: "server", cpu: 4 }
Features:
- Parse and evaluate Nickel expressions
- Type checking and validation
- Schema enforcement
- Merge configurations
- Generate JSON/YAML output
Performance: 5-8x faster than CLI invocation
Use Cases:
- Validate infrastructure definitions
- Process Nickel schemas
- Merge configuration files
- Generate typed configurations
Example: Validate and merge configs:
let base_config = open base.ncl | nickel eval
let env_config = open prod-defaults.ncl | nickel eval
let merged = $base_config | nickel merge $env_config
$merged | nickel validate --schema infrastructure-schema.ncl
3. nu_plugin_fluent
Internationalization (i18n) Plugin
Fluent translation system for multi-language support.
# Install
provisioning plugins install nu_plugin_fluent
# Usage in Nushell
fluent load "./locales"
fluent set-locale "es-ES"
fluent get "help-infra-server-create"
# Output: "Crear un nuevo servidor"
Features:
- Load Fluent catalogs (.ftl files)
- Dynamic locale switching
- Pluralization support
- Fallback chains
- Translation coverage reports
Performance: Native Rust implementation, <1ms per translation
Use Cases:
- CLI help text in multiple languages
- Form labels and prompts
- Error messages
- Interactive guides
Supported Locales:
- en-US (English)
- es-ES (Spanish)
- pt-BR (Portuguese - planned)
- fr-FR (French - planned)
- ja-JP (Japanese - planned)
Example: Multi-language help system:
fluent load "provisioning/locales"
# Spanish help
fluent set-locale "es-ES"
fluent get "help-main-title" # "SISTEMA DE PROVISIÓN"
# English help (fallback)
fluent set-locale "fr-FR"
fluent get "help-main-title" # Falls back to "PROVISIONING SYSTEM"
4. nu_plugin_secretumvault
Post-Quantum Cryptography Vault
SecretumVault integration for quantum-resistant secret storage.
# Install
provisioning plugins install nu_plugin_secretumvault
# Usage in Nushell
secretumvault-plugin store "api-key" "secret-value"
let key = secretumvault-plugin retrieve "api-key"
secretumvault-plugin delete "api-key"
Features:
- CRYSTALS-Kyber encryption (post-quantum)
- Hybrid encryption (PQC + AES-256)
- Secure credential injection
- Key rotation
- Audit logging
Performance: <100ms for encrypt/decrypt operations
Use Cases:
- Store infrastructure credentials
- Manage API keys
- Handle database passwords
- Secure configuration values
Example: Secure credential management:
# Store credentials in vault
secretumvault-plugin store "aws-access-key" "AKIAIOSFODNN7EXAMPLE"
secretumvault-plugin store "aws-secret-key" "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
# Retrieve for use
let aws_key = secretumvault-plugin retrieve "aws-access-key"
provisioning aws configure --access-key $aws_key
Performance Plugins
5. nu_plugin_orchestrator
Orchestrator State Query Plugin
High-speed queries to orchestrator state and workflow data.
# Install
provisioning plugins install nu_plugin_orchestrator
# Usage in Nushell
orchestrator query workflows --filter status=running
orchestrator query tasks --limit 100
orchestrator query checkpoints --workflow deploy-k8s
Performance: 30-50x faster than HTTP API
Queries:
- Workflows (list, status, logs)
- Tasks (state, duration, dependencies)
- Checkpoints (recovery points)
- History (audit trail)
Example: Monitor running workflows:
let running = orchestrator query workflows --filter status=running
$running | each { | w |
print $"Workflow: ($w.name) - ($w.progress)%"
}
6. nu_plugin_kms
Key Management System (Encryption) Plugin
Fast encryption/decryption with KMS backends.
# Install
provisioning plugins install nu_plugin_kms
# Usage in Nushell
let encrypted = "secret-data" | kms encrypt --algorithm aes-256-gcm
$encrypted | kms decrypt
Performance: 10x faster than external KMS calls, 5ms encryption
Supported Algorithms:
- AES-256-GCM
- ChaCha20-Poly1305
- Kyber (post-quantum)
- Falcon (signatures)
Features:
- Symmetric encryption
- Key derivation (Argon2id, PBKDF2)
- Authenticated encryption
- HSM integration (optional)
Example: Encrypt infrastructure secrets:
let config = open infrastructure.ncl
let encrypted = $config | kms encrypt --key master-key
# Decrypt when needed
let decrypted = $encrypted | kms decrypt --key master-key
$decrypted | nickel eval
7. nu_plugin_auth
Authentication Plugin
Multi-method authentication with keyring integration.
# Install
provisioning plugins install nu_plugin_auth
# Usage in Nushell
let token = auth login --method jwt --provider openid
auth set-token $token
auth verify-token
Performance: 5x faster local authentication
Features:
- JWT token generation and validation
- OAuth2 support
- SAML support
- OS keyring integration
- MFA support
Methods:
- JWT (JSON Web Tokens)
- OAuth2 (GitHub, Google, Microsoft)
- SAML
- LDAP
- Local keyring
Example: Authenticate and store credentials:
# Login and get token
let token = auth login --method oauth2 --provider github
auth set-token $token --store-keyring
# Verify authentication
auth verify-token # Check if token valid
auth whoami # Show current user
Utility Plugins
8. nu_plugin_hashes
Cryptographic Hashing Plugin
Multiple hash algorithms for data integrity.
# Install
provisioning plugins install nu_plugin_hashes
# Usage in Nushell
"data" | hashes sha256
"data" | hashes blake3
Algorithms:
- SHA256, SHA512
- BLAKE3
- MD5 (legacy)
- SHA1 (legacy)
9. nu_plugin_highlight
Syntax Highlighting Plugin
Code syntax highlighting for display and logging.
# Install
provisioning plugins install nu_plugin_highlight
# Usage in Nushell
open script.sh | highlight --language bash
open config.ncl | highlight --language nickel
Languages:
- Bash/Shell
- Nickel
- YAML
- JSON
- Rust
- SQL
- Others
10. nu_plugin_image
Image Processing Plugin
Image manipulation and format conversion.
# Install
provisioning plugins install nu_plugin_image
# Usage in Nushell
open diagram.png | image resize --width 800 --height 600
open logo.jpg | image convert --format webp
Operations:
- Resize, crop, rotate
- Format conversion
- Compression
- Metadata extraction
11. nu_plugin_clipboard
Clipboard Management Plugin
Read/write system clipboard.
# Install
provisioning plugins install nu_plugin_clipboard
# Usage in Nushell
"api-key" | clipboard copy
clipboard paste
Features:
- Copy to clipboard
- Paste from clipboard
- Manage clipboard history
- Cross-platform support
12. nu_plugin_desktop_notifications
Desktop Notifications Plugin
System notifications for long-running operations.
# Install
provisioning plugins install nu_plugin_desktop_notifications
# Usage in Nushell
notifications notify "Deployment completed" --type success
notifications notify "Errors detected" --type error
Features:
- Success, warning, error notifications
- Custom titles and messages
- Sound alerts
13. nu_plugin_qr_maker
QR Code Generator Plugin
Generate QR codes for configuration sharing.
# Install
provisioning plugins install nu_plugin_qr_maker
# Usage in Nushell
" [https://example.com/config"](https://example.com/config") | qr-maker generate --output config.png
"workspace-setup-command" | qr-maker generate --ascii
14. nu_plugin_port_extension
Port/Network Utilities Plugin
Network port management and diagnostics.
# Install
provisioning plugins install nu_plugin_port_extension
# Usage in Nushell
port-extension list-open --port 8080
port-extension check-available --port 9000
Legacy/Secondary Plugins
15. nu_plugin_kcl
KCL Configuration Plugin (DEPRECATED)
Legacy KCL support (Nickel is preferred).
⚠️ Status: Deprecated - Use nu_plugin_nickel instead
# Install
provisioning plugins install nu_plugin_kcl
# Usage (not recommended)
let config = open config.kcl | kcl eval
16. api_nu_plugin_kcl
KCL API Plugin (DEPRECATED)
HTTP API wrapper for KCL.
⚠️ Status: Deprecated - Use nu_plugin_nickel instead
17. _nu_plugin_inquire (Historical)
Interactive Prompts Plugin (HISTORICAL)
Old inquiry/prompt system, replaced by TypeDialog.
⚠️ Status: Historical/archived
Plugin Installation & Management
Installation Methods
Automatic with Provisioning:
provisioning install
# Installs all recommended plugins automatically
Selective Installation:
# Install specific plugins
provisioning plugins install nu_plugin_tera nu_plugin_nickel nu_plugin_secretumvault
# Install plugin category
provisioning plugins install --category core # Essential plugins
provisioning plugins install --category performance # Performance plugins
provisioning plugins install --category utilities # Utility plugins
Manual Installation:
# Build and install from source
cd /Users/Akasha/project-provisioning/plugins/nushell-plugins/nu_plugin_tera
cargo install --path .
# Then load in Nushell
plugin add nu_plugin_tera
Configuration
Plugin Loading in Nushell:
# In env.nu or config.nu
plugin add nu_plugin_tera
plugin add nu_plugin_nickel
plugin add nu_plugin_secretumvault
plugin add nu_plugin_fluent
plugin add nu_plugin_auth
plugin add nu_plugin_kms
plugin add nu_plugin_orchestrator
# And more...
Plugin Status:
# Check all plugins
provisioning plugins list
# Check specific plugin
provisioning plugins status nu_plugin_tera
# Detailed information
provisioning plugins info nu_plugin_tera --verbose
Best Practices
Use Plugins When
- ✅ Processing large amounts of data (templates, config)
- ✅ Sensitive operations (encryption, secrets)
- ✅ Frequent operations (queries, auth)
- ✅ Performance critical paths
Fallback to HTTP API When
- ❌ Plugin not installed (automatic fallback)
- ❌ Older Nushell version incompatible
- ❌ Special features only in API
# Plugins have automatic fallback
# If nu_plugin_tera not available, uses HTTP API automatically
let template = "{{ name }}" | tera render { name: "test" }
# Works either way
Troubleshooting
Plugin Not Loading
# Reload Nushell
nu
# Check plugin errors
plugin list --debug
# Reinstall plugin
provisioning plugins remove nu_plugin_tera
provisioning plugins install nu_plugin_tera
Performance Issues
# Check plugin status
provisioning plugins status
# Monitor plugin usage
provisioning monitor plugins
# Profile plugin calls
provisioning profile nu_plugin_tera
Related Documentation
- Features Overview - Feature list
- Nushell Libraries - Core libraries
- CLI Architecture - Command dispatch
- Performance Optimization - Monitoring
Multilingual Support
Provisioning includes comprehensive multilingual support for help text, forms, and interactive interfaces. The system uses Mozilla Fluent for translations with automatic fallback chains.
Supported Languages
Currently supported with 100% translation coverage:
| Language | Locale | Status | Strings |
|---|---|---|---|
| English (US) | en-US | ✅ Complete | 245 |
| Spanish (Spain) | es-ES | ✅ Complete | 245 |
| Portuguese (Brazil) | pt-BR | 🔄 Planned | - |
| French (France) | fr-FR | 🔄 Planned | - |
| Japanese (Japan) | ja-JP | 🔄 Planned | - |
Coverage Requirement: 95% of strings translated to critical locales (en-US, es-ES).
Using Different Languages
Setting Language via Environment Variable
Select language using the LANG environment variable:
# English (default)
provisioning help infrastructure
# Spanish
LANG=es_ES provisioning help infrastructure
# Fallback to English if locale not available
LANG=fr_FR provisioning help infrastructure
# Output: English (en-US) [fallback chain]
Locale Resolution
Language selection follows this order:
- Check
LANGenvironment variable (e.g.,es_ES) - Match to configured locale (es-ES)
- If not found, follow fallback chain (es-ES → en-US)
- Default to en-US if no match
Format: LANG uses underscore (es_ES), locales use hyphen (es-ES). System handles conversion automatically.
Translation System Architecture
Mozilla Fluent Format
All translations use Mozilla Fluent (.ftl files), which provides:
- Simple Syntax: Key-value pairs with rich formatting
- Pluralization: Support for language-specific plural rules
- Attributes: Multiple values per key for contextual translation
- Automatic Fallback: Chain resolution when keys missing
- Extensibility: Support for custom formatting functions
Example Fluent syntax:
help-infra-server-create = Create a new server
form-database_type-option-postgres = PostgreSQL (Recommended)
form-replicas-prompt = Number of replicas
form-replicas-help = How many replicas to run
File Organization
provisioning/locales/
├── i18n-config.toml # Central i18n configuration
├── en-US/ # English base language
│ ├── help.ftl # Help system strings (65 keys)
│ └── forms.ftl # Form strings (180 keys)
└── es-ES/ # Spanish translations
├── help.ftl # Help system translations
└── forms.ftl # Form translations
String Categories:
- help.ftl (65 strings): Help text, menu items, category descriptions, error messages
- forms.ftl (180 strings): Form labels, placeholders, help text, options
Help System Translations
Help system provides multi-language support for all command categories:
Categories Covered
| Category | Coverage | Example Keys |
|---|---|---|
| Infrastructure | ✅ 21 strings | server commands, taskserv, clusters, VMs |
| Orchestration | ✅ 18 strings | workflows, batch operations, orchestrator |
| Workspace | ✅ Complete | workspace management, templates |
| Setup | ✅ Complete | system configuration, initialization |
| Authentication | ✅ Complete | JWT, MFA, sessions |
| Platform | ✅ Complete | services, Control Center, MCP |
| Development | ✅ Complete | modules, versions, plugins |
| Utilities | ✅ Complete | providers, SOPS, SSH |
Example: Help Output in Spanish
$ LANG=es_ES provisioning help infrastructure
SERVIDOR E INFRAESTRUCTURA
Gestión de servidores, taskserv, clusters, VM e infraestructura.
COMANDOS DE SERVIDOR
server create Crear un nuevo servidor
server delete Eliminar un servidor existente
server list Listar todos los servidores
server status Ver estado de un servidor
COMANDOS DE TASKSERV
taskserv create Crear un nuevo servicio de tarea
taskserv delete Eliminar un servicio de tarea
taskserv configure Configurar un servicio de tarea
taskserv status Ver estado del servicio de tarea
Form Translations (TypeDialog Integration)
Interactive forms automatically use the selected language:
Setup Form
Project information, database configuration, API settings, deployment options, security, etc.
# English form
$ provisioning setup profile
📦 Project name: [my-app]
# Spanish form
$ LANG=es_ES provisioning setup profile
📦 Nombre del proyecto: [mi-app]
Translated Form Fields
Each form field has four translated strings:
| Component | Purpose | Example en-US | Example es-ES |
|---|---|---|---|
| prompt | Field label | “Project name” | “Nombre del proyecto” |
| help | Helper text | “Project name (lowercase alphanumeric with hyphens)” | “Nombre del proyecto (minúsculas alfanuméricas con guiones)” |
| placeholder | Example value | “my-app” | “mi-app” |
| option | Dropdown choice | “PostgreSQL (Recommended)” | “PostgreSQL (Recomendado)” |
Supported Forms
- Unified Setup: Project info, database, API, deployment, security, terms
- Authentication: Login form (username, password, remember me, forgot password)
- Setup Wizard: Quick/standard/advanced modes
- MFA Enrollment: TOTP, SMS, backup codes, device management
- Infrastructure: Delete confirmations, resource prompts, data retention
Fallback Chain Configuration
When a translation string is missing, the system automatically falls back to the parent locale:
# From i18n-config.toml
[fallback_chains]
es-ES = ["en-US"]
pt-BR = ["pt-PT", "es-ES", "en-US"]
fr-FR = ["en-US"]
ja-JP = ["en-US"]
Resolution Example:
- User requests Spanish (es-ES):
provisioning help - Look for string in
es-ES/help.ftl - If missing, fallback to en-US (
help-infra-server-create = "Create a new server") - If still missing, use literal key name as display text
Adding New Languages
1. Add Locale Configuration
Edit provisioning/locales/i18n-config.toml:
[locales.pt-BR]
name = "Portuguese (Brazil)"
direction = "ltr"
plurals = 2
decimal_separator = ","
thousands_separator = "."
date_format = "DD/MM/YYYY"
[fallback_chains]
pt-BR = ["pt-PT", "es-ES", "en-US"]
Configuration Fields:
- name: Display name of locale
- direction: Text direction (ltr/rtl)
- plurals: Number of plural forms (1-6 depending on language)
- decimal_separator: Locale-specific decimal format
- thousands_separator: Number formatting
- date_format: Locale-specific date format
- currency_symbol: Currency symbol (optional)
- currency_position: “prefix” or “suffix” (optional)
2. Create Locale Directory
mkdir -p provisioning/locales/pt-BR
3. Create Translation Files
Copy English files as base:
cp provisioning/locales/en-US/help.ftl provisioning/locales/pt-BR/help.ftl
cp provisioning/locales/en-US/forms.ftl provisioning/locales/pt-BR/forms.ftl
4. Translate Strings
Edit pt-BR/help.ftl and pt-BR/forms.ftl with translated content. Follow naming conventions:
# Help strings: help-{category}-{element}
help-infra-server-create = Criar um novo servidor
# Form prompts: form-{element}-prompt
form-project_name-prompt = Nome do projeto
# Form help: form-{element}-help
form-project_name-help = Nome do projeto (alfanumérico minúsculo com hífens)
# Form options: form-{element}-option-{value}
form-database_type-option-postgres = PostgreSQL (Recomendado)
5. Validate Translation
Check coverage and syntax:
# Validate Fluent file syntax
provisioning i18n validate --locale pt-BR
# Check translation coverage
provisioning i18n coverage --locale pt-BR
# List missing translations
provisioning i18n missing --locale pt-BR
6. Update Documentation
Document new language support in translations_status.md.
Validation & Quality Standards
Translation Quality Rules
Naming Conventions (REQUIRED):
- Help strings:
help-{category}-{element}(e.g.,help-infra-server-create) - Form prompts:
form-{element}-prompt(e.g.,form-project_name-prompt) - Form help:
form-{element}-help(e.g.,form-project_name-help) - Form placeholders:
form-{element}-placeholder - Form options:
form-{element}-option-{value}(e.g.,form-database_type-option-postgres) - Section headers:
section-{name}-title
Coverage Requirements:
- Critical Locales: en-US, es-ES require 95% minimum coverage
- Warning Threshold: 80% triggers warnings during build
- Incomplete Locales: 0% coverage allowed (inherit via fallback chain)
Testing Localization
Test translations via different methods:
# Test help system in Spanish
LANG=es_ES provisioning help infrastructure
# Test form display in Spanish
LANG=es_ES provisioning setup profile
# Validate all translation files
provisioning i18n validate --all
# Generate coverage report
provisioning i18n coverage --format=json > coverage.json
Implementation Details
TypeDialog Integration
TypeDialog forms reference Fluent keys via locales_path configuration:
# In form.toml
locales_path = "../../../locales"
[[elements]]
name = "project_name"
prompt = "form-project_name-prompt" # References: locales/*/forms.ftl
help = "form-project_name-help"
placeholder = "form-project_name-placeholder"
Resolution Process:
- Read
locales_pathfrom form configuration - Check
LANGenvironment variable (converted to locale format: es_ES → es-ES) - Load Fluent file (e.g.,
locales/es-ES/forms.ftl) - Resolve string key → value
- If key missing, follow fallback chain
- If still missing, use literal key name
Help System Integration
Help system uses Fluent catalog loader in provisioning/core/nulib/main_provisioning/help_system.nu:
# Load help strings for current locale
let help_strings = (load_fluent_catalog $locale)
# Display localized help text
print ($help_strings | get help-infrastructure-title)
Maintenance
Adding New Translations
When new help text or forms are added:
- Add English strings to
en-US/help.ftloren-US/forms.ftl - Add Spanish translations to
es-ES/help.ftlores-ES/forms.ftl - Run validation:
provisioning i18n validate - Update
translations_status.mdwith new counts - If coverage drops below 95%, fix before release
Updating Existing Translations
To modify existing translated string:
- Edit key in
en-US/*.ftland all locale-specific files - Run validation to ensure consistency
- Test in both languages:
LANG=en_US provisioning helpandLANG=es_ES provisioning help
Current Translation Status
Last Updated: 2026-01-13 | Status: 100% Complete
String Count
| Component | en-US | es-ES | Status |
|---|---|---|---|
| Help System | 65 | 65 | ✅ Complete |
| Forms | 180 | 180 | ✅ Complete |
| Total | 245 | 245 | ✅ Complete |
Features Enabled
| Feature | Status | Purpose |
|---|---|---|
| Pluralization | ✅ Enabled | Support language-specific plural rules |
| Number Formatting | ✅ Enabled | Locale-specific number/currency formatting |
| Date Formatting | ✅ Enabled | Locale-specific date display |
| Fallback Chains | ✅ Enabled | Automatic fallback to English |
| Gender Agreement | ⚠️ Disabled | Not needed for Spanish help strings |
| RTL Support | ⚠️ Disabled | No RTL languages configured yet |
Related Documentation
- System Setup - Configure Provisioning after installation
- Workspace Management - Workspace configuration and usage
- Design Principles - Architecture and design
- API Reference - CLI commands and help system
Operations
Production deployment, monitoring, maintenance, and operational best practices for running Provisioning infrastructure at scale.
Overview
This section covers everything needed to operate Provisioning in production:
- Deployment strategies - Single-cloud, multi-cloud, hybrid with zero-downtime updates
- Service management - Microservice lifecycle, scaling, health checks, failover
- Observability - Metrics (Prometheus), logs (ELK), traces (Jaeger), dashboards
- Incident response - Detection, triage, remediation, postmortem automation
- Backup & recovery - Strategies, testing, disaster recovery, point-in-time restore
- Performance optimization - Profiling, caching, scaling, resource optimization
- Troubleshooting - Debugging, log analysis, diagnostic tools, support
Operational Guides
Deployment and Management
-
Deployment Modes - Single-cloud, multi-cloud, hybrid, canary, blue-green, rolling updates with zero downtime.
-
Service Management - Microservice lifecycle, scaling policies, health checks, graceful shutdown, rolling restarts.
-
Platform Installer - TUI and unattended installation, provider setup, workspace creation, post-install configuration.
Monitoring and Observability
-
Monitoring Setup - Prometheus metrics, Grafana dashboards, alerting rules, SLO monitoring, 12 microservices
-
Logging and Analysis - Centralized logging with ELK Stack, log aggregation, filtering, searching, performance analysis.
-
Distributed Tracing - Jaeger integration, span collection, trace visualization, latency analysis across microservices.
Resilience and Recovery
-
Incident Response - Severity levels, triage, investigation, mitigation, escalation, postmortem
-
Backup Strategies - Full, incremental, PITR backups with RTO/RPO targets, testing procedures, recovery workflows.
-
Disaster Recovery - DR planning, failover procedures, failback strategies, RTO/RPO targets, testing schedules.
- Performance Optimization - Profiling, bottlenecks, caching strategies, connection pooling, right-sizing
Troubleshooting
-
Troubleshooting Guide - Common issues, debugging techniques, log analysis, diagnostic tools, support resources.
-
Platform Health - Health check procedures, system status, component status, SLO metrics, error budgets.
Operational Workflows
I’m deploying to production
Follow: Deployment Modes → Service Management → Monitoring Setup
I need to monitor infrastructure
Setup: Monitoring Setup for metrics, Logging and Analysis for logs, Distributed Tracing for traces
I’m experiencing an incident
Execute: Incident Response with triage, investigation, mitigation, escalation
I need to backup and recover
Implement: Backup Strategies with testing, Disaster Recovery for major outages
I need to optimize performance
Follow: Performance Optimization for profiling and tuning
I need help troubleshooting
Consult: Troubleshooting Guide for common issues and solutions
Deployment Architecture
Development
↓
Staging (test all)
↓
Canary (1% traffic)
↓
Rolling (increase % gradually)
↓
Production (100%)
SLO Targets
| Service | Availability | P99 Latency | Error Budget |
|---|---|---|---|
| API Gateway | 99.99% | <100ms | 4m 26s/month |
| Orchestrator | 99.9% | <500ms | 43m 46s/month |
| Control-Center | 99.95% | <300ms | 21m 56s/month |
| Detector | 99.5% | <2s | 3h 36s/month |
| All Others | 99.9% | <1s | 43m 46s/month |
Monitoring Stack
- Metrics - Prometheus (15s scrape interval, 15d retention)
- Logs - ELK Stack (Elasticsearch, Logstash, Kibana) with 30d retention
- Traces - Jaeger (sampling 10%, 24h retention)
- Dashboards - Grafana with pre-built dashboards per microservice
- Alerting - AlertManager with escalation rules and notification channels
Operational Commands
# Check system health
provisioning status health
# View metrics
provisioning metrics view --service orchestrator
# Check SLO status
provisioning slo status
# Run diagnostics
provisioning diagnose system
# Backup infrastructure
provisioning backup create --name daily-$(date +%Y%m%d)
# Restore from backup
provisioning backup restore --backup-id backup-id
Related Documentation
- Architecture → See
provisioning/docs/src/architecture/ - Features → See
provisioning/docs/src/features/ - Development → See
provisioning/docs/src/development/ - Security → See
provisioning/docs/src/security/ - Examples → See
provisioning/docs/src/examples/
Deployment Modes
The Provisioning platform supports three deployment modes designed for different operational contexts: interactive TUI for guided setup, headless CLI for automation, and unattended mode for CI/CD pipelines.
Overview
Deployment modes determine how the platform installer and orchestrator interact with the environment:
| Mode | Use Case | User Interaction | Configuration | Rollback |
|---|---|---|---|---|
| Interactive TUI | First-time setup, exploration | Full interactive terminal UI | Guided wizard | Manual intervention |
| Headless CLI | Scripted automation | Command-line flags only | Pre-configured files | Automatic checkpoint |
| Unattended | CI/CD pipelines | Zero interaction | Config file required | Automatic rollback |
Interactive TUI Mode
Beautiful terminal user interface for guided platform installation and configuration.
When to Use
- First-time platform installation
- Exploring configuration options
- Learning platform features
- Development and testing environments
- Manual infrastructure provisioning
Features
Seven interactive screens with real-time validation:
- Welcome Screen - Platform overview and prerequisites check
- Deployment Mode Selection - Solo, MultiUser, CICD, Enterprise
- Component Selection - Choose platform services to install
- Configuration Builder - Interactive settings editor
- Provider Setup - Cloud provider credentials and configuration
- Review and Confirm - Summary before installation
- Installation Progress - Real-time tracking with checkpoint recovery
Starting Interactive Mode
# Launch interactive installer
provisioning-installer
# Or via main CLI
provisioning install --mode tui
Navigation
Tab/Shift+Tab - Navigate fields
Enter - Select/confirm
Esc - Cancel/go back
Arrow keys - Navigate lists
Space - Toggle checkboxes
Ctrl+C - Exit installer
Headless CLI Mode
Command-line interface for scripted automation without graphical interface.
When to Use
- Automated deployment scripts
- Remote server installation via SSH
- Reproducible infrastructure provisioning
- Configuration management systems
- Batch deployments across multiple servers
Features
- Non-interactive installation
- Configuration via command-line flags
- Pre-validation of all inputs
- Structured JSON/YAML output
- Exit codes for script integration
- Checkpoint-based recovery
Command Syntax
provisioning-installer --headless \
--mode <sol| o multiuse| r cic| d enterprise> \
--components <comma-separated-list> \
--storage-path <path> \
--database <backend> \
--log-level <level> \
[--yes] \
[--config <file>]
Example Deployments
Solo developer setup:
provisioning-installer --headless \
--mode solo \
--components orchestrator,control-center \
--yes
CI/CD pipeline deployment:
provisioning-installer --headless \
--mode cicd \
--components orchestrator,vault-service \
--database surrealdb \
--yes
Enterprise production deployment:
provisioning-installer --headless \
--mode enterprise \
--config /etc/provisioning/enterprise.toml \
--yes
Unattended Mode
Zero-interaction deployment for fully automated CI/CD pipelines.
When to Use
- Continuous integration pipelines
- Continuous deployment workflows
- Infrastructure as Code provisioning
- Automated testing environments
- Container image builds
- Cloud instance initialization
Requirements
- Configuration file must exist and be valid
- All required dependencies must be installed
- Sufficient system resources must be available
- Network connectivity to required services
- Appropriate file system permissions
Command Syntax
provisioning-installer --unattended --config <config-file>
Example CI/CD Integrations
GitHub Actions workflow:
name: Deploy Provisioning Platform
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install prerequisites
run: |
curl -sSL [https://install.nushell.sh](https://install.nushell.sh) | sh
curl -sSL [https://install.nickel-lang.org](https://install.nickel-lang.org) | sh
- name: Deploy provisioning platform
env:
PROVISIONING_DB_PASSWORD: ${{ secrets.DB_PASSWORD }}
UPCLOUD_API_TOKEN: ${{ secrets.UPCLOUD_TOKEN }}
run: |
provisioning-installer --unattended --config ci-config.toml
- name: Verify deployment
run: |
curl -f [http://localhost:8080/health](http://localhost:8080/health) | | exit 1
Resource Requirements by Mode
Solo Mode
Minimum: 2 CPU, 4GB RAM, 20GB disk Recommended: 4 CPU, 8GB RAM, 50GB disk
MultiUser Mode
Minimum: 4 CPU, 8GB RAM, 50GB disk Recommended: 8 CPU, 16GB RAM, 100GB disk
CICD Mode
Minimum: 8 CPU, 16GB RAM, 100GB disk Recommended: 16 CPU, 32GB RAM, 500GB disk
Enterprise Mode
Minimum: 16 CPU, 32GB RAM, 500GB disk Recommended: 32+ CPU, 64GB+ RAM, 1TB+ disk
Choosing the Right Mode
| Scenario | Recommended Mode | Rationale |
|---|---|---|
| First-time installation | Interactive TUI | Guided setup with validation |
| Manual production setup | Interactive TUI | Review all settings before deployment |
| Ansible playbook | Headless CLI | Scriptable without GUI |
| Remote server via SSH | Headless CLI | Works without terminal UI |
| GitHub Actions | Unattended | Zero interaction, strict validation |
| Docker image build | Unattended | Non-interactive environment |
Best Practices
Interactive TUI Mode
- Review all configuration screens carefully
- Save configuration for later reuse
- Document custom settings
Headless CLI Mode
- Test configuration on development environment first
- Use
--checkflag for dry-run validation - Store configurations in version control
- Use environment variables for sensitive data
Unattended Mode
- Validate configuration files extensively before CI/CD deployment
- Test rollback behavior in non-production environments
- Monitor installation logs in real-time
- Set up alerting for installation failures
- Use idempotent operations to allow retry
Related Documentation
- Service Management - Managing installed services
- Platform Health - Monitoring platform status
- Troubleshooting - Debugging deployment issues
Service Management
Managing the nine core platform services that power the Provisioning infrastructure automation platform.
Platform Services Overview
The platform consists of nine microservices providing execution, management, and supporting infrastructure:
| Service | Purpose | Port | Language | Status |
|---|---|---|---|---|
| orchestrator | Workflow execution and task scheduling | 8080 | Rust + Nushell | Production |
| control-center | Backend management API with RBAC | 8081 | Rust | Production |
| control-center-ui | Web-based management interface | 8082 | Web | Production |
| mcp-server | AI-powered configuration assistance | 8083 | Nushell | Active |
| ai-service | Machine learning and anomaly detection | 8084 | Rust | Active |
| vault-service | Secrets management and KMS | 8085 | Rust | Production |
| extension-registry | OCI registry for extensions | 8086 | Rust | Planned |
| api-gateway | Unified REST API routing | 8087 | Rust | Planned |
| provisioning-daemon | Background service coordination | 8088 | Rust | Development |
Service Lifecycle Management
Starting Services
Systemd management (production):
# Start individual service
sudo systemctl start provisioning-orchestrator
# Start all platform services
sudo systemctl start provisioning-*
# Enable automatic start on boot
sudo systemctl enable provisioning-orchestrator
sudo systemctl enable provisioning-control-center
sudo systemctl enable provisioning-vault-service
Manual start (development):
# Orchestrator
cd provisioning/platform/crates/orchestrator
cargo run --release
# Control Center
cd provisioning/platform/crates/control-center
cargo run --release
# MCP Server
cd provisioning/platform/crates/mcp-server
nu run.nu
Stopping Services
# Stop individual service
sudo systemctl stop provisioning-orchestrator
# Stop all platform services
sudo systemctl stop provisioning-*
# Graceful shutdown with 30-second timeout
sudo systemctl stop --timeout 30 provisioning-orchestrator
Restarting Services
# Restart after configuration changes
sudo systemctl restart provisioning-orchestrator
# Reload configuration without restart
sudo systemctl reload provisioning-control-center
Checking Service Status
# Status of all services
systemctl status provisioning-*
# Detailed status
provisioning platform status
# Health check endpoints
curl [http://localhost:8080/health](http://localhost:8080/health) # Orchestrator
curl [http://localhost:8081/health](http://localhost:8081/health) # Control Center
curl [http://localhost:8085/health](http://localhost:8085/health) # Vault Service
Service Configuration
Configuration Files
Each service reads configuration from hierarchical sources:
/etc/provisioning/config.toml # System defaults
~/.config/provisioning/user_config.yaml # User overrides
workspace/config/provisioning.yaml # Workspace config
Orchestrator Configuration
# /etc/provisioning/orchestrator.toml
[server]
host = "0.0.0.0"
port = 8080
workers = 8
[storage]
persistence_dir = "/var/lib/provisioning/orchestrator"
checkpoint_interval = 30
[execution]
max_parallel_tasks = 100
retry_attempts = 3
retry_backoff = "exponential"
[api]
enable_rest = true
enable_grpc = false
auth_required = true
Control Center Configuration
# /etc/provisioning/control-center.toml
[server]
host = "0.0.0.0"
port = 8081
[auth]
jwt_algorithm = "RS256"
access_token_ttl = 900
refresh_token_ttl = 604800
[rbac]
policy_dir = "/etc/provisioning/policies"
reload_interval = 60
Vault Service Configuration
# /etc/provisioning/vault-service.toml
[vault]
backend = "secretumvault"
url = " [http://localhost:8200"](http://localhost:8200")
token_env = "VAULT_TOKEN"
[kms]
envelope_encryption = true
key_rotation_days = 90
Service Dependencies
Understanding service dependencies for proper startup order:
Database (SurrealDB)
↓
orchestrator (requires database)
↓
vault-service (requires orchestrator)
↓
control-center (requires orchestrator + vault)
↓
control-center-ui (requires control-center)
↓
mcp-server (requires control-center)
↓
ai-service (requires mcp-server)
Systemd handles dependencies automatically:
# /etc/systemd/system/provisioning-control-center.service
[Unit]
Description=Provisioning Control Center
After=provisioning-orchestrator.service
Requires=provisioning-orchestrator.service
Service Health Monitoring
Health Check Endpoints
All services expose /health endpoints:
# Check orchestrator health
curl [http://localhost:8080/health](http://localhost:8080/health)
# Expected response
{
"status": "healthy",
"version": "5.0.0",
"uptime_seconds": 3600,
"database": "connected",
"active_workflows": 5,
"queued_tasks": 12
}
Automated Health Monitoring
Use systemd watchdog for automatic restart on failure:
# /etc/systemd/system/provisioning-orchestrator.service
[Service]
WatchdogSec=30
Restart=on-failure
RestartSec=10
Monitor with provisioning CLI:
# Continuous health monitoring
provisioning platform monitor --interval 5
# Alert on unhealthy services
provisioning platform monitor --alert-email [ops@example.com](mailto:ops@example.com)
Log Management
Log Locations
Systemd services log to journald:
# View orchestrator logs
sudo journalctl -u provisioning-orchestrator -f
# View last hour of logs
sudo journalctl -u provisioning-orchestrator --since "1 hour ago"
# View errors only
sudo journalctl -u provisioning-orchestrator -p err
# Export logs to file
sudo journalctl -u provisioning-* > platform-logs.txt
File-based logs:
/var/log/provisioning/orchestrator.log
/var/log/provisioning/control-center.log
/var/log/provisioning/vault-service.log
Log Rotation
Configure logrotate for file-based logs:
# /etc/logrotate.d/provisioning
/var/log/provisioning/*.log {
daily
rotate 30
compress
delaycompress
missingok
notifempty
create 0644 provisioning provisioning
sharedscripts
postrotate
systemctl reload provisioning-* | | true
endscript
}
Log Levels
Configure log verbosity:
# Set log level via environment
export PROVISIONING_LOG_LEVEL=debug
sudo systemctl restart provisioning-orchestrator
# Or in configuration
provisioning config set logging.level debug
Log levels: trace, debug, info, warn, error
Performance Tuning
Orchestrator Performance
Adjust worker threads and task limits:
[execution]
max_parallel_tasks = 200 # Increase for high throughput
worker_threads = 16 # Match CPU cores
task_queue_size = 1000
[performance]
enable_metrics = true
metrics_interval = 10
Database Connection Pooling
[database]
max_connections = 100
min_connections = 10
connection_timeout = 30
idle_timeout = 600
Memory Limits
Set memory limits via systemd:
[Service]
MemoryMax=4G
MemoryHigh=3G
Service Updates and Upgrades
Zero-Downtime Upgrades
Rolling upgrade procedure:
# 1. Deploy new version alongside old version
sudo cp provisioning-orchestrator /usr/local/bin/provisioning-orchestrator-new
# 2. Update systemd service to use new binary
sudo systemctl daemon-reload
# 3. Graceful restart
sudo systemctl reload provisioning-orchestrator
Version Management
Check running versions:
provisioning platform versions
# Output:
# orchestrator: 5.0.0
# control-center: 5.0.0
# vault-service: 4.0.0
Rollback Procedure
# 1. Stop new version
sudo systemctl stop provisioning-orchestrator
# 2. Restore previous binary
sudo cp /usr/local/bin/provisioning-orchestrator.backup \
/usr/local/bin/provisioning-orchestrator
# 3. Start service with previous version
sudo systemctl start provisioning-orchestrator
Security Hardening
Service Isolation
Run services with dedicated users:
# Create service user
sudo useradd -r -s /usr/sbin/nologin provisioning
# Set ownership
sudo chown -R provisioning:provisioning /var/lib/provisioning
sudo chown -R provisioning:provisioning /etc/provisioning
Systemd service configuration:
[Service]
User=provisioning
Group=provisioning
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ProtectHome=true
Network Security
Restrict service access with firewall:
# Allow only localhost access
sudo ufw allow from 127.0.0.1 to any port 8080
sudo ufw allow from 127.0.0.1 to any port 8081
# Or use systemd socket activation
Troubleshooting Services
Service Won’t Start
Check service status and logs:
systemctl status provisioning-orchestrator
journalctl -u provisioning-orchestrator -n 100
Common issues:
- Port already in use: Check with
lsof -i :8080 - Configuration error: Validate with
provisioning validate config - Missing dependencies: Check with
ldd /usr/local/bin/provisioning-orchestrator - Permission issues: Verify file ownership
High Resource Usage
Monitor resource consumption:
# CPU and memory usage
systemctl status provisioning-orchestrator
# Detailed metrics
provisioning platform metrics --service orchestrator
Adjust limits:
# Increase memory limit
sudo systemctl set-property provisioning-orchestrator MemoryMax=8G
# Reduce parallel tasks
provisioning config set execution.max_parallel_tasks 50
sudo systemctl restart provisioning-orchestrator
Service Crashes
Enable core dumps for debugging:
# Enable core dumps
sudo sysctl -w kernel.core_pattern=/var/crash/core.%e.%p
ulimit -c unlimited
# Analyze crash
sudo coredumpctl list
sudo coredumpctl debug
Service Metrics
Prometheus Integration
Services expose Prometheus metrics:
# Orchestrator metrics
curl [http://localhost:8080/metrics](http://localhost:8080/metrics)
# Example metrics:
# provisioning_workflows_total 1234
# provisioning_workflows_active 5
# provisioning_tasks_queued 12
# provisioning_tasks_completed 9876
Grafana Dashboards
Import pre-built dashboards:
provisioning monitoring install-dashboards
Dashboards available at http://localhost:3000
Best Practices
Service Management
- Use systemd for production deployments
- Enable automatic restart on failure
- Monitor health endpoints continuously
- Set appropriate resource limits
- Implement log rotation
- Regular backup of service data
Configuration Management
- Version control all configuration files
- Use hierarchical configuration for flexibility
- Validate configuration before applying
- Document all custom settings
- Use environment variables for secrets
Monitoring and Alerting
- Monitor all service health endpoints
- Set up alerts for service failures
- Track key performance metrics
- Review logs regularly
- Establish incident response procedures
Related Documentation
- Deployment Modes - Installation strategies
- Monitoring - Observability and metrics
- Platform Health - Health check procedures
- Troubleshooting - Common issues and solutions
Monitoring
Comprehensive observability stack for the Provisioning platform using Prometheus, Grafana, and custom metrics.
Monitoring Stack Overview
The platform monitoring system consists of:
| Component | Purpose | Port | Status |
|---|---|---|---|
| Prometheus | Metrics collection and storage | 9090 | Production |
| Grafana | Visualization and dashboards | 3000 | Production |
| Loki | Log aggregation | 3100 | Active |
| Alertmanager | Alert routing and notification | 9093 | Production |
| Node Exporter | System metrics | 9100 | Production |
Quick Start
Install monitoring stack:
# Install all monitoring components
provisioning monitoring install
# Install specific components
provisioning monitoring install --components prometheus,grafana
# Start monitoring services
provisioning monitoring start
Access dashboards:
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin/admin)
- Alertmanager: http://localhost:9093
Prometheus Configuration
Service Discovery
Prometheus automatically discovers platform services:
# /etc/provisioning/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'provisioning-orchestrator'
static_configs:
- targets: ['localhost:8080']
metrics_path: '/metrics'
- job_name: 'provisioning-control-center'
static_configs:
- targets: ['localhost:8081']
- job_name: 'provisioning-vault-service'
static_configs:
- targets: ['localhost:8085']
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
Retention Configuration
global:
external_labels:
cluster: 'provisioning-production'
# Storage retention
storage:
tsdb:
retention.time: 30d
retention.size: 50GB
Key Metrics
Platform Metrics
Orchestrator metrics:
provisioning_workflows_total - Total workflows created
provisioning_workflows_active - Currently active workflows
provisioning_workflows_completed - Successfully completed workflows
provisioning_workflows_failed - Failed workflows
provisioning_tasks_queued - Tasks in queue
provisioning_tasks_running - Currently executing tasks
provisioning_tasks_completed - Total completed tasks
provisioning_checkpoint_recoveries - Checkpoint recovery count
Control Center metrics:
provisioning_api_requests_total - Total API requests
provisioning_api_requests_duration_seconds - Request latency histogram
provisioning_auth_attempts_total - Authentication attempts
provisioning_auth_failures_total - Failed authentication attempts
provisioning_rbac_denials_total - Authorization denials
Vault Service metrics:
provisioning_secrets_operations_total - Secret operations count
provisioning_kms_encryptions_total - Encryption operations
provisioning_kms_decryptions_total - Decryption operations
provisioning_kms_latency_seconds - KMS operation latency
System Metrics
Node Exporter provides system-level metrics:
node_cpu_seconds_total - CPU time per core
node_memory_MemAvailable_bytes - Available memory
node_disk_io_time_seconds_total - Disk I/O time
node_network_receive_bytes_total - Network RX bytes
node_network_transmit_bytes_total - Network TX bytes
node_filesystem_avail_bytes - Available disk space
Grafana Dashboards
Pre-built Dashboards
Import platform dashboards:
# Install all pre-built dashboards
provisioning monitoring install-dashboards
# List available dashboards
provisioning monitoring list-dashboards
Available dashboards:
- Platform Overview - High-level system status
- Orchestrator Performance - Workflow and task metrics
- Control Center API - API request metrics and latency
- Vault Service KMS - Encryption operations and performance
- System Resources - CPU, memory, disk, network
- Security Events - Authentication, authorization, audit logs
- Database Performance - SurrealDB metrics
Custom Dashboard Creation
Create custom dashboards via Grafana UI or provisioning:
{
"dashboard": {
"title": "Custom Infrastructure Dashboard",
"panels": [
{
"title": "Active Workflows",
"targets": [
{
"expr": "provisioning_workflows_active",
"legendFormat": "Active Workflows"
}
],
"type": "graph"
}
]
}
}
Save dashboard:
provisioning monitoring export-dashboard --id 1 --output custom-dashboard.json
Alerting
Alert Rules
Configure alert rules in Prometheus:
# /etc/provisioning/prometheus/alerts/provisioning.yml
groups:
- name: provisioning_alerts
interval: 30s
rules:
- alert: OrchestratorDown
expr: up{job="provisioning-orchestrator"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Orchestrator service is down"
description: "Orchestrator has been down for more than 1 minute"
- alert: HighWorkflowFailureRate
expr: |
rate(provisioning_workflows_failed[5m]) /
rate(provisioning_workflows_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High workflow failure rate"
description: "More than 10% of workflows are failing"
- alert: DatabaseConnectionLoss
expr: provisioning_database_connected == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Database connection lost"
- alert: HighMemoryUsage
expr: |
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage"
description: "Memory usage is above 90%"
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/var/lib/provisioning"} /
node_filesystem_size_bytes{mountpoint="/var/lib/provisioning"}) < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space"
description: "Less than 10% disk space available"
Alertmanager Configuration
Route alerts to appropriate channels:
# /etc/provisioning/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'team-email'
routes:
- match:
severity: critical
receiver: 'pagerduty'
continue: true
- match:
severity: warning
receiver: 'slack'
receivers:
- name: 'team-email'
email_configs:
- to: '[ops@example.com](mailto:ops@example.com)'
from: '[alerts@provisioning.example.com](mailto:alerts@provisioning.example.com)'
smarthost: 'smtp.example.com:587'
- name: 'pagerduty'
pagerduty_configs:
- service_key: '<pagerduty-key>'
- name: 'slack'
slack_configs:
- api_url: '<slack-webhook-url>'
channel: '#provisioning-alerts'
Test alerts:
# Send test alert
provisioning monitoring test-alert --severity critical
# Silence alerts temporarily
provisioning monitoring silence --duration 2h --reason "Maintenance window"
Log Aggregation with Loki
Loki Configuration
# /etc/provisioning/loki/loki.yml
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
schema_config:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /var/lib/loki/boltdb-shipper-active
cache_location: /var/lib/loki/boltdb-shipper-cache
filesystem:
directory: /var/lib/loki/chunks
limits_config:
retention_period: 720h # 30 days
Promtail for Log Shipping
# /etc/provisioning/promtail/promtail.yml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: [http://localhost:3100/loki/api/v1/push](http://localhost:3100/loki/api/v1/push)
scrape_configs:
- job_name: system
static_configs:
- targets:
- localhost
labels:
job: varlogs
__path__: /var/log/provisioning/*.log
- job_name: journald
journal:
max_age: 12h
labels:
job: systemd-journal
relabel_configs:
- source_labels: ['__journal__systemd_unit']
target_label: 'unit'
Query logs in Grafana:
{job="varlogs"} | = "error"
{unit="provisioning-orchestrator.service"} | = "workflow" | json
Tracing with Tempo
Distributed Tracing
Enable OpenTelemetry tracing in services:
# /etc/provisioning/config.toml
[tracing]
enabled = true
exporter = "otlp"
endpoint = "localhost:4317"
service_name = "provisioning-orchestrator"
Tempo configuration:
# /etc/provisioning/tempo/tempo.yml
server:
http_listen_port: 3200
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
storage:
trace:
backend: local
local:
path: /var/lib/tempo/traces
query_frontend:
search:
enabled: true
View traces in Grafana or Tempo UI.
Performance Monitoring
Query Performance
Monitor slow queries:
# 95th percentile API latency
histogram_quantile(0.95,
rate(provisioning_api_requests_duration_seconds_bucket[5m])
)
# Slow workflows (>60s)
provisioning_workflow_duration_seconds > 60
Resource Monitoring
Track resource utilization:
# CPU usage per service
rate(process_cpu_seconds_total{job=~"provisioning-.*"}[5m]) * 100
# Memory usage per service
process_resident_memory_bytes{job=~"provisioning-.*"}
# Disk I/O rate
rate(node_disk_io_time_seconds_total[5m])
Custom Metrics
Adding Custom Metrics
Rust services use prometheus crate:
use prometheus::{Counter, Histogram, Registry};
// Create metrics
let workflow_counter = Counter::new(
"provisioning_custom_workflows",
"Custom workflow counter"
)?;
let task_duration = Histogram::with_opts(
HistogramOpts::new("provisioning_task_duration", "Task duration")
.buckets(vec![0.1, 0.5, 1.0, 5.0, 10.0])
)?;
// Register metrics
registry.register(Box::new(workflow_counter))?;
registry.register(Box::new(task_duration))?;
// Use metrics
workflow_counter.inc();
task_duration.observe(duration_seconds);
Nushell scripts export metrics:
# Export metrics in Prometheus format
def export-metrics [] {
[
"# HELP provisioning_custom_metric Custom metric"
"# TYPE provisioning_custom_metric counter"
$"provisioning_custom_metric (get-metric-value)"
] | str join "
"
}
Monitoring Best Practices
- Set appropriate scrape intervals (15-60s)
- Configure retention based on compliance requirements
- Use labels for multi-dimensional metrics
- Create dashboards for key business metrics
- Set up alerts for critical failures only
- Document alert thresholds and runbooks
- Review and tune alerts regularly
- Use recording rules for expensive queries
- Archive long-term metrics to object storage
Related Documentation
- Service Management - Service lifecycle
- Platform Health - Health checks
- Troubleshooting - Debugging issues
Backup & Recovery
Comprehensive backup strategies and disaster recovery procedures for the Provisioning platform.
Overview
The platform backup strategy covers:
- Platform service data and state
- Database backups (SurrealDB)
- Configuration files and secrets
- Infrastructure definitions
- Workflow checkpoints and history
- Audit logs and compliance data
Backup Components
Critical Data
| Component | Location | Backup Priority | Recovery Time |
|---|---|---|---|
| Database | /var/lib/provisioning/database | Critical | < 15 min |
| Orchestrator State | /var/lib/provisioning/orchestrator | Critical | < 5 min |
| Configuration | /etc/provisioning | High | < 5 min |
| Secrets | SOPS-encrypted files | Critical | < 5 min |
| Audit Logs | /var/log/provisioning/audit | Compliance | < 30 min |
| Workspace Data | workspace/ | High | < 15 min |
| Infrastructure Schemas | provisioning/schemas | High | < 10 min |
Backup Strategies
Full Backup
Complete system backup including all components:
# Create full backup
provisioning backup create --type full --output /backups/full-$(date +%Y%m%d).tar.gz
# Full backup includes:
# - Database dump
# - Service configuration
# - Workflow state
# - Audit logs
# - User data
Contents of full backup:
full-20260116.tar.gz
├── database/
│ └── surrealdb-dump.sql
├── config/
│ ├── provisioning.toml
│ ├── orchestrator.toml
│ └── control-center.toml
├── state/
│ ├── workflows/
│ └── checkpoints/
├── logs/
│ └── audit/
├── workspace/
│ ├── infra/
│ └── config/
└── metadata.json
Incremental Backup
Backup only changed data since last backup:
# Incremental backup (faster, smaller)
provisioning backup create --type incremental --since-backup full-20260116
# Incremental backup includes:
# - New workflows since last backup
# - Configuration changes
# - New audit log entries
# - Modified workspace files
Continuous Backup
Real-time backup of critical data:
# Enable continuous backup
provisioning backup enable-continuous --destination s3://backups/continuous
# WAL archiving for database
# Real-time checkpoint backup
# Audit log streaming
Backup Commands
Create Backup
# Full backup to local directory
provisioning backup create --type full --output /backups
# Incremental backup
provisioning backup create --type incremental
# Backup specific components
provisioning backup create --components database,config
# Compressed backup
provisioning backup create --compress gzip
# Encrypted backup
provisioning backup create --encrypt --key-file /etc/provisioning/backup.key
List Backups
# List all backups
provisioning backup list
# Output:
# NAME TYPE SIZE DATE STATUS
# full-20260116 Full 2.5GB 2026-01-16 10:00 Complete
# incr-20260116-1200 Incremental 150MB 2026-01-16 12:00 Complete
# full-20260115 Full 2.4GB 2026-01-15 10:00 Complete
Restore Backup
# Restore full backup
provisioning backup restore --backup full-20260116 --confirm
# Restore specific components
provisioning backup restore --backup full-20260116 --components database
# Point-in-time restore
provisioning backup restore --timestamp "2026-01-16 09:30:00"
# Dry-run restore
provisioning backup restore --backup full-20260116 --dry-run
Verify Backup
# Verify backup integrity
provisioning backup verify --backup full-20260116
# Test restore in isolated environment
provisioning backup test-restore --backup full-20260116
Automated Backup Scheduling
Cron-based Backups
# Install backup cron jobs
provisioning backup schedule install
# Default schedule:
# Full backup: Daily at 2 AM
# Incremental: Every 6 hours
# Cleanup old backups: Weekly
Crontab entries:
# Full daily backup
0 2 * * * /usr/local/bin/provisioning backup create --type full --output /backups
# Incremental every 6 hours
0 */6 * * * /usr/local/bin/provisioning backup create --type incremental
# Cleanup backups older than 30 days
0 3 * * 0 /usr/local/bin/provisioning backup cleanup --older-than 30d
Systemd Timer-based Backups
# /etc/systemd/system/provisioning-backup.timer
[Unit]
Description=Provisioning Platform Backup Timer
[Timer]
OnCalendar=daily
OnCalendar=02:00
Persistent=true
[Install]
WantedBy=timers.target
# /etc/systemd/system/provisioning-backup.service
[Unit]
Description=Provisioning Platform Backup
[Service]
Type=oneshot
ExecStart=/usr/local/bin/provisioning backup create --type full
User=provisioning
Enable timer:
sudo systemctl enable provisioning-backup.timer
sudo systemctl start provisioning-backup.timer
Backup Destinations
Local Filesystem
# Backup to local directory
provisioning backup create --output /mnt/backups
Remote Storage
S3-compatible storage:
# Backup to S3
provisioning backup create --destination s3://my-bucket/backups \
--s3-region us-east-1
# Backup to MinIO
provisioning backup create --destination s3://backups \
--s3-endpoint [http://minio.local:9000](http://minio.local:9000)
Network filesystem:
# Backup to NFS mount
provisioning backup create --output /mnt/nfs/backups
# Backup to SMB share
provisioning backup create --output /mnt/smb/backups
Off-site Backup
Rsync to remote server:
# Backup and sync to remote
provisioning backup create --output /backups
rsync -avz /backups/ backup-server:/backups/provisioning/
Database Backup
SurrealDB Backup
# Export database
surreal export --conn [http://localhost:8000](http://localhost:8000) \
--user root --pass root \
--ns provisioning --db main \
/backups/database-$(date +%Y%m%d).surql
# Import database
surreal import --conn [http://localhost:8000](http://localhost:8000) \
--user root --pass root \
--ns provisioning --db main \
/backups/database-20260116.surql
Automated Database Backups
# Enable automatic database backups
provisioning backup database enable --interval daily
# Backup with point-in-time recovery
provisioning backup database create --enable-pitr
Disaster Recovery
Recovery Procedures
Complete platform recovery from backup:
# 1. Stop all services
sudo systemctl stop provisioning-*
# 2. Restore database
provisioning backup restore --backup full-20260116 --components database
# 3. Restore configuration
provisioning backup restore --backup full-20260116 --components config
# 4. Restore service state
provisioning backup restore --backup full-20260116 --components state
# 5. Verify data integrity
provisioning validate-installation
# 6. Start services
sudo systemctl start provisioning-*
# 7. Verify services
provisioning platform status
Recovery Time Objectives
| Scenario | RTO | RPO | Procedure |
|---|---|---|---|
| Service failure | 5 min | 0 | Restart service from checkpoint |
| Database corruption | 15 min | 6 hours | Restore from incremental backup |
| Complete data loss | 30 min | 24 hours | Restore from full backup |
| Site disaster | 2 hours | 24 hours | Restore from off-site backup |
Point-in-Time Recovery
Restore to specific timestamp:
# List available recovery points
provisioning backup list-recovery-points
# Restore to specific time
provisioning backup restore --timestamp "2026-01-16 09:30:00"
# Recovery with workflow replay
provisioning backup restore --timestamp "2026-01-16 09:30:00" --replay-workflows
Backup Encryption
SOPS Encryption
Encrypt backups with SOPS:
# Create encrypted backup
provisioning backup create --encrypt sops --key-file /etc/provisioning/age.key
# Restore encrypted backup
provisioning backup restore --backup encrypted-20260116.tar.gz.enc \
--decrypt sops --key-file /etc/provisioning/age.key
Age Encryption
# Generate age key pair
age-keygen -o /etc/provisioning/backup-key.txt
# Create encrypted backup with age
provisioning backup create --encrypt age --recipient "age1..."
# Decrypt and restore
age -d -i /etc/provisioning/backup-key.txt backup.tar.gz.age | \
provisioning backup restore --stdin
Backup Retention
Retention Policies
# /etc/provisioning/backup-retention.toml
[retention]
# Keep daily backups for 7 days
daily = 7
# Keep weekly backups for 4 weeks
weekly = 4
# Keep monthly backups for 12 months
monthly = 12
# Keep yearly backups for 7 years (compliance)
yearly = 7
Apply retention policy:
# Cleanup old backups according to policy
provisioning backup cleanup --policy /etc/provisioning/backup-retention.toml
Backup Monitoring
Backup Alerts
Configure alerts for backup failures:
# Prometheus alert for failed backups
- alert: BackupFailed
expr: provisioning_backup_status{status="failed"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Backup failed"
description: "Backup has failed, investigate immediately"
Backup Metrics
Monitor backup health:
# Backup success rate
provisioning_backup_success_rate{type="full"} 1.0
# Time since last backup
time() - provisioning_backup_last_success_timestamp > 86400
# Backup size trend
increase(provisioning_backup_size_bytes[7d])
Testing Recovery Procedures
Regular DR Drills
# Automated disaster recovery test
provisioning backup test-recovery --backup full-20260116 \
--test-environment isolated
# Steps performed:
# 1. Spin up isolated test environment
# 2. Restore backup
# 3. Verify data integrity
# 4. Run smoke tests
# 5. Generate test report
# 6. Teardown test environment
Schedule monthly DR tests:
# Monthly disaster recovery drill
0 4 1 * * /usr/local/bin/provisioning backup test-recovery --latest
Best Practices
- Implement 3-2-1 backup rule: 3 copies, 2 different media, 1 off-site
- Encrypt all backups containing sensitive data
- Test restore procedures regularly (monthly minimum)
- Monitor backup success/failure metrics
- Automate backup verification
- Document recovery procedures and RTO/RPO
- Maintain off-site backups for disaster recovery
- Use incremental backups to reduce storage costs
- Version control infrastructure schemas separately
- Retain audit logs per compliance requirements (7 years)
Related Documentation
- Service Management - Service lifecycle
- Platform Health - Health monitoring
- Troubleshooting - Recovery procedures
Upgrading Provisioning
Upgrade Provisioning to a new version with minimal downtime and automatic rollback support.
Overview
Provisioning supports two upgrade strategies:
- In-Place Upgrade - Update existing installation
- Side-by-Side Upgrade - Run new version alongside old, switch when ready
Both strategies support automatic rollback on failure.
Before Upgrading
Check Current Version
provisioning version
# Example output:
# Provisioning v5.0.0
# Nushell 0.109.0
# Nickel 1.15.1
# SOPS 3.10.2
# Age 1.2.1
Backup Configuration
# Backup entire workspace
provisioning workspace backup
# Backup specific configuration
provisioning config backup
# Backup state
provisioning state backup
Check Changelog
# View latest changes
provisioning changelog
# Check upgrade path
provisioning version --check-upgrade
# Show upgrade recommendations
provisioning upgrade --check
Verify System Health
# Health check
provisioning health check
# Check all services
provisioning platform health
# Verify provider connectivity
provisioning providers test --all
# Validate configuration
provisioning validate config --strict
Upgrade Methods
Method 1: In-Place Upgrade
Upgrade the existing installation with zero downtime:
# Check upgrade compatibility
provisioning upgrade --check
# List breaking changes
provisioning upgrade --breaking-changes
# Show migration guide (if any)
provisioning upgrade --show-migration
# Perform upgrade
provisioning upgrade
Process:
- Validate current installation
- Download new version
- Run migration scripts (if needed)
- Restart services
- Verify health
- Keep old version for rollback (24 hours)
Method 2: Side-by-Side Upgrade
Run new version alongside old version for testing:
# Create staging installation
provisioning upgrade --staging --version v5.1.0
# Test new version
provisioning --staging server list
# Run test suite
provisioning --staging test suite
# Switch to new version
provisioning upgrade --activate
# Remove old version (after confirmation)
provisioning upgrade --cleanup-old
Advantages:
- Test new version before switching
- Zero downtime during upgrade
- Easy rollback to previous version
- Run both versions simultaneously
Upgrade Process
Step 1: Pre-Upgrade Checks
# Check system requirements
provisioning setup validate
# Verify dependencies are up-to-date
provisioning version --check-dependencies
# Check disk space (minimum 2GB required)
df -h /
# Verify all services healthy
provisioning platform health
Step 2: Backup Data
# Backup entire workspace
provisioning workspace backup --compress
# Backup orchestrator state
provisioning orchestrator backup
# Backup configuration
provisioning config backup
# Verify backup
provisioning backup list
provisioning backup verify --latest
Step 3: Download New Version
# Check available versions
provisioning version --available
# Download specific version
provisioning upgrade --download v5.1.0
# Verify download
provisioning upgrade --verify-download v5.1.0
# Check size
provisioning upgrade --show-size v5.1.0
Step 4: Run Migration Scripts
# Show required migrations
provisioning upgrade --show-migrations
# Test migration (dry-run)
provisioning upgrade --dry-run
# Run migrations
provisioning upgrade --migrate
# Verify migration
provisioning upgrade --verify-migration
Step 5: Perform Upgrade
# Stop orchestrator gracefully
provisioning orchestrator stop --graceful
# Install new version
provisioning upgrade --install
# Verify installation
provisioning version
provisioning validate config
# Start services
provisioning orchestrator start
Step 6: Verify Upgrade
# Check version
provisioning version
# Health check
provisioning health check
# Run test suite
provisioning test quick
# Verify provider connectivity
provisioning providers test --all
# Check orchestrator status
provisioning orchestrator status
Breaking Changes
Some upgrades may include breaking changes. Check before upgrading:
# List breaking changes
provisioning upgrade --breaking-changes
# Show migration guide
provisioning upgrade --migration-guide v5.1.0
# Generate migration script
provisioning upgrade --generate-migration v5.1.0 > migrate.nu
Common Migration Scenarios
Scenario 1: Configuration Format Change
If configuration format changes (e.g., TOML → YAML):
# Export old format
provisioning config export --format toml > config.old.toml
# Run migration
provisioning upgrade --migrate-config
# Verify new format
provisioning config export --format yaml | head -20
Scenario 2: Schema Updates
If infrastructure schemas change:
# Validate against new schema
nickel typecheck workspace/infra/*.ncl
# Update schemas if needed
provisioning upgrade --update-schemas
# Regenerate configurations
provisioning config regenerate
# Validate updated config
provisioning validate config --strict
Scenario 3: Provider API Changes
If provider APIs change:
# Test provider connectivity with new version
provisioning providers test upcloud --verbose
# Check provider configuration
provisioning config show --section providers.upcloud
# Update provider configuration if needed
provisioning providers configure upcloud
# Verify connectivity
provisioning server list
Rollback Procedure
Automatic Rollback
If upgrade fails, automatic rollback occurs:
# Monitor rollback progress
provisioning upgrade --watch
# Check rollback status
provisioning upgrade --status
# View rollback logs
provisioning upgrade --logs
Manual Rollback
If needed, manually rollback to previous version:
# List available versions for rollback
provisioning upgrade --rollback-candidates
# Rollback to specific version
provisioning upgrade --rollback v5.0.0
# Verify rollback
provisioning version
provisioning platform health
# Restore from backup
provisioning backup restore --backup-id=<id>
Batch Workflow Handling
If you have running batch workflows:
# Check running workflows
provisioning workflow list --status running
# Graceful shutdown (wait for completion)
provisioning workflow shutdown --graceful
# Force shutdown (immediate)
provisioning workflow shutdown --force
# Resume workflows after upgrade
provisioning workflow resume
Troubleshooting Upgrades
Upgrade Hangs
# Check logs
tail -f ~/.provisioning/logs/upgrade.log
# Monitor process
provisioning upgrade --monitor
# Stop upgrade gracefully
provisioning upgrade --stop --graceful
# Force stop
provisioning upgrade --stop --force
Migration Failure
# Check migration logs
provisioning upgrade --migration-logs
# Rollback to previous version
provisioning upgrade --rollback
# Restore from backup
provisioning backup restore
Service Won’t Start
# Check service logs
provisioning platform logs
# Verify configuration
provisioning validate config --strict
# Restore configuration from backup
provisioning config restore
# Restart services
provisioning orchestrator start
Upgrade Scheduling
Schedule Automated Upgrade
# Schedule upgrade for specific time
provisioning upgrade --schedule "2026-01-20T02:00:00"
# Schedule for next maintenance window
provisioning upgrade --schedule-next-maintenance
# Cancel scheduled upgrade
provisioning upgrade --cancel-scheduled
Unattended Upgrade
For CI/CD environments:
# Non-interactive upgrade
provisioning upgrade --yes --no-confirm
# Upgrade with timeout
provisioning upgrade --timeout 3600
# Skip backup
provisioning upgrade --skip-backup
# Continue even if health checks fail
provisioning upgrade --force-upgrade
Version Management
Version Constraints
Pin versions for workspace reproducibility:
# workspace/versions.ncl
{
provisioning = "5.0.0"
nushell = "0.109.0"
nickel = "1.15.1"
sops = "3.10.2"
age = "1.2.1"
}
Enforce version constraints:
# Check version compliance
provisioning version --check-constraints
# Enforce constraint
provisioning version --strict-mode
Vendor Versions
Pin provider and task service versions:
# workspace/infra/versions.ncl
{
providers = {
upcloud = "2.0.0"
aws = "5.0.0"
}
taskservs = {
kubernetes = "1.28.0"
postgres = "14.0"
}
}
Best Practices
1. Plan Upgrades
- Schedule during maintenance windows
- Test in staging first
- Communicate with team
- Have rollback plan ready
2. Backup Everything
# Complete backup before upgrade
provisioning workspace backup --compress
provisioning config backup
provisioning state backup
3. Test Before Upgrading
# Use side-by-side upgrade to test
provisioning upgrade --staging
provisioning test suite
4. Monitor After Upgrade
# Watch orchestrator
provisioning orchestrator status --watch
# Monitor platform health
provisioning platform monitor
# Check logs
tail -f ~/.provisioning/logs/provisioning.log
5. Document Changes
# Record what changed
provisioning upgrade --changelog > UPGRADE.md
# Update team documentation
# Update runbooks
# Update dashboards
Upgrade Policies
Automatic Updates
Enable automatic updates:
# ~/.config/provisioning/user_config.yaml
upgrade:
auto_update: true
check_interval: "daily"
update_channel: "stable"
auto_backup: true
Update Channels
Choose update channel:
# Stable releases (recommended)
provisioning upgrade --channel stable
# Beta releases
provisioning upgrade --channel beta
# Development (nightly)
provisioning upgrade --channel development
Related Documentation
- Initial Setup - First-time configuration
- Platform Health - System monitoring
- Backup & Recovery - Data protection
Troubleshooting
Common issues, debugging procedures, and resolution strategies for the Provisioning platform.
Quick Diagnosis
Run platform diagnostics:
# Comprehensive health check
provisioning diagnose
# Check specific component
provisioning diagnose --component orchestrator
# Generate diagnostic report
provisioning diagnose --report /tmp/diagnostics.txt
Common Issues
Services Won’t Start
Symptom: Service fails to start or crashes immediately
Diagnosis:
# Check service status
systemctl status provisioning-orchestrator
# View recent logs
journalctl -u provisioning-orchestrator -n 100 --no-pager
# Check configuration
provisioning validate config
Common Causes:
- Port already in use
# Find process using port
lsof -i :8080
# Kill conflicting process or change port in config
- Configuration error
# Validate configuration
provisioning validate config --strict
# Check for syntax errors
nickel typecheck /etc/provisioning/config.ncl
- Missing dependencies
# Check binary dependencies
ldd /usr/local/bin/provisioning-orchestrator
# Install missing libraries
sudo apt install <missing-library>
- Permission issues
# Fix ownership
sudo chown -R provisioning:provisioning /var/lib/provisioning
sudo chown -R provisioning:provisioning /etc/provisioning
# Fix permissions
sudo chmod 750 /var/lib/provisioning
sudo chmod 640 /etc/provisioning/*.toml
Database Connection Failures
Symptom: Services can’t connect to SurrealDB
Diagnosis:
# Check database status
systemctl status surrealdb
# Test database connectivity
curl [http://localhost:8000/health](http://localhost:8000/health)
# Check database logs
journalctl -u surrealdb -n 50
Resolution:
# Restart database
sudo systemctl restart surrealdb
# Verify connection string in config
provisioning config get database.url
# Test manual connection
surreal sql --conn [http://localhost:8000](http://localhost:8000) --user root --pass root
High Resource Usage
Symptom: Service consuming excessive CPU or memory
Diagnosis:
# Monitor resource usage
top -p $(pgrep provisioning-orchestrator)
# Detailed metrics
provisioning platform metrics --service orchestrator
# Check for resource leaks
Resolution:
# Adjust worker threads
provisioning config set execution.worker_threads 4
# Reduce parallel tasks
provisioning config set execution.max_parallel_tasks 50
# Increase memory limit
sudo systemctl set-property provisioning-orchestrator MemoryMax=8G
# Restart service
sudo systemctl restart provisioning-orchestrator
Workflow Failures
Symptom: Workflows fail or hang
Diagnosis:
# List failed workflows
provisioning workflow list --status failed
# View workflow details
provisioning workflow show <workflow-id>
# Check workflow logs
provisioning workflow logs <workflow-id>
# Inspect checkpoint state
provisioning workflow checkpoints <workflow-id>
Common Issues:
- Provider API errors
# Check provider credentials
provisioning provider validate upcloud
# Test provider connectivity
provisioning provider test upcloud
- Dependency resolution failures
# Validate infrastructure schema
provisioning validate infra my-cluster.ncl
# Check task service dependencies
provisioning taskserv deps kubernetes
- Timeout issues
# Increase timeout
provisioning config set workflows.task_timeout 600
# Enable detailed logging
provisioning config set logging.level debug
Network Connectivity Issues
Symptom: Can’t reach external services or cloud providers
Diagnosis:
# Test network connectivity
ping -c 3 upcloud.com
# Check DNS resolution
nslookup api.upcloud.com
# Test HTTPS connectivity
curl -v [https://api.upcloud.com](https://api.upcloud.com)
# Check proxy settings
env | grep -i proxy
Resolution:
# Configure proxy if needed
export HTTPS_PROXY= [http://proxy.example.com:8080](http://proxy.example.com:8080)
provisioning config set network.proxy [http://proxy.example.com:8080](http://proxy.example.com:8080)
# Verify firewall rules
sudo ufw status
# Check routing
ip route show
Authentication Failures
Symptom: API requests fail with 401 Unauthorized
Diagnosis:
# Check JWT token
provisioning auth status
# Verify user credentials
provisioning auth whoami
# Check authentication logs
journalctl -u provisioning-control-center | grep "auth"
Resolution:
# Refresh authentication token
provisioning auth login --username admin
# Reset user password
provisioning auth reset-password --username admin
# Verify MFA configuration
provisioning auth mfa status
Debugging Workflows
Enable Debug Logging
# Enable debug mode
export PROVISIONING_LOG_LEVEL=debug
provisioning workflow create my-cluster --debug
# Or in configuration
provisioning config set logging.level debug
sudo systemctl restart provisioning-orchestrator
Workflow State Inspection
# View workflow state
provisioning workflow state <workflow-id>
# Export workflow state to JSON
provisioning workflow state <workflow-id> --format json > workflow-state.json
# Inspect checkpoints
provisioning workflow checkpoints <workflow-id>
Manual Workflow Retry
# Retry failed workflow from last checkpoint
provisioning workflow retry <workflow-id>
# Retry from specific checkpoint
provisioning workflow retry <workflow-id> --from-checkpoint 3
# Force retry (skip validation)
provisioning workflow retry <workflow-id> --force
Performance Troubleshooting
Slow Workflow Execution
Diagnosis:
# Profile workflow execution
provisioning workflow profile <workflow-id>
# Identify bottlenecks
provisioning workflow analyze <workflow-id>
Optimization:
# Increase parallelism
provisioning config set execution.max_parallel_tasks 200
# Optimize database queries
provisioning database analyze
# Add caching
provisioning config set cache.enabled true
Database Performance Issues
Diagnosis:
# Check database metrics
curl [http://localhost:8000/metrics](http://localhost:8000/metrics)
# Identify slow queries
provisioning database slow-queries
# Check connection pool
provisioning database pool-status
Optimization:
# Increase connection pool
provisioning config set database.max_connections 200
# Add indexes
provisioning database create-indexes
# Optimize vacuum settings
provisioning database vacuum
Log Analysis
Centralized Log Viewing
# View all platform logs
journalctl -u provisioning-* -f
# Filter by severity
journalctl -u provisioning-* -p err
# Export logs for analysis
journalctl -u provisioning-* --since "1 hour ago" > /tmp/logs.txt
Structured Log Queries
Using Loki with LogQL:
# Find errors in orchestrator
{job="provisioning-orchestrator"} | = "ERROR"
# Workflow failures
{job="provisioning-orchestrator"} | json | status="failed"
# API request latency over 1s
{job="provisioning-control-center"} | json | duration > 1
Log Correlation
# Correlate logs by request ID
journalctl -u provisioning-* | grep "request_id=abc123"
# Trace workflow execution
provisioning workflow trace <workflow-id>
Advanced Debugging
Enable Rust Backtrace
# Enable backtrace for Rust services
export RUST_BACKTRACE=1
sudo systemctl restart provisioning-orchestrator
# Full backtrace
export RUST_BACKTRACE=full
Core Dump Analysis
# Enable core dumps
sudo sysctl -w kernel.core_pattern=/var/crash/core.%e.%p
ulimit -c unlimited
# Analyze core dump
sudo coredumpctl list
sudo coredumpctl debug <pid>
# In gdb:
(gdb) bt
(gdb) info threads
(gdb) thread apply all bt
Network Traffic Analysis
# Capture API traffic
sudo tcpdump -i any -w /tmp/api-traffic.pcap port 8080
# Analyze with tshark
tshark -r /tmp/api-traffic.pcap -Y "http"
Getting Help
Collect Diagnostic Information
# Generate comprehensive diagnostic report
provisioning diagnose --full --output /tmp/diagnostics.tar.gz
# Report includes:
# - Service status
# - Configuration files
# - Recent logs (last 1000 lines per service)
# - Resource usage metrics
# - Database status
# - Network connectivity tests
# - Workflow states
Support Channels
- Check documentation:
provisioning help <topic> - Search logs:
journalctl -u provisioning-* - Review monitoring dashboards: http://localhost:3000
- Run diagnostics:
provisioning diagnose - Contact support with diagnostic report
Preventive Measures
- Enable comprehensive monitoring and alerting
- Implement regular health checks
- Maintain up-to-date documentation
- Test disaster recovery procedures monthly
- Keep platform and dependencies updated
- Review logs regularly for warning signs
- Monitor resource utilization trends
- Validate configuration changes before applying
Related Documentation
- Service Management - Service lifecycle
- Monitoring - Observability setup
- Platform Health - Health checks
- Backup & Recovery - Recovery procedures
Platform Health
Health monitoring, status checks, and system integrity validation for the Provisioning platform.
Health Check Overview
The platform provides multiple levels of health monitoring:
| Level | Scope | Frequency | Response Time |
|---|---|---|---|
| Service Health | Individual service status | Every 10s | < 100ms |
| System Health | Overall platform status | Every 30s | < 500ms |
| Infrastructure Health | Managed resources | Every 60s | < 2s |
| Dependency Health | External services | Every 60s | < 1s |
Quick Health Check
# Check overall platform health
provisioning health
# Output:
# ✓ Orchestrator: healthy (uptime: 5d 3h)
# ✓ Control Center: healthy
# ✓ Vault Service: healthy
# ✓ Database: healthy (connections: 45/100)
# ✓ Network: healthy
# ✗ MCP Server: degraded (high latency)
# Exit code: 0 = healthy, 1 = degraded, 2 = unhealthy
Service Health Endpoints
All services expose /health endpoints returning standardized responses.
Orchestrator Health
curl [http://localhost:8080/health](http://localhost:8080/health)
{
"status": "healthy",
"version": "5.0.0",
"uptime_seconds": 432000,
"checks": {
"database": "healthy",
"file_system": "healthy",
"memory": "healthy"
},
"metrics": {
"active_workflows": 12,
"queued_tasks": 45,
"completed_tasks": 9876,
"worker_threads": 8
},
"timestamp": "2026-01-16T10:30:00Z"
}
Health status values:
healthy- Service operating normallydegraded- Service functional with reduced capacityunhealthy- Service not functioning
Control Center Health
curl [http://localhost:8081/health](http://localhost:8081/health)
{
"status": "healthy",
"version": "5.0.0",
"checks": {
"database": "healthy",
"orchestrator": "healthy",
"vault": "healthy",
"auth": "healthy"
},
"metrics": {
"active_sessions": 23,
"api_requests_per_second": 156,
"p95_latency_ms": 45
}
}
Vault Service Health
curl [http://localhost:8085/health](http://localhost:8085/health)
{
"status": "healthy",
"checks": {
"kms_backend": "healthy",
"encryption": "healthy",
"key_rotation": "healthy"
},
"metrics": {
"active_secrets": 234,
"encryption_ops_per_second": 50,
"kms_latency_ms": 3
}
}
System Health Checks
Comprehensive Health Check
# Run all health checks
provisioning health check --all
# Check specific components
provisioning health check --components orchestrator,database,network
# Output detailed report
provisioning health check --detailed --output /tmp/health-report.json
Health Check Components
Platform health checking verifies:
- Service Availability - All services responding
- Database Connectivity - SurrealDB reachable and responsive
- Filesystem Health - Disk space and I/O performance
- Network Connectivity - Internal and external connectivity
- Resource Utilization - CPU, memory, disk within limits
- Dependency Status - External services available
- Security Status - Authentication and encryption functional
Database Health
# Check database health
provisioning health database
# Output:
# ✓ Connection: healthy (latency: 2ms)
# ✓ Disk usage: 45% (22GB / 50GB)
# ✓ Active connections: 45 / 100
# ✓ Query performance: healthy (avg: 15ms)
# ✗ Replication: warning (lag: 5s)
Detailed database metrics:
# Connection pool status
provisioning database pool-status
# Slow query analysis
provisioning database slow-queries --threshold 1000ms
# Storage usage
provisioning database storage-stats
Filesystem Health
# Check disk space and I/O
provisioning health filesystem
# Output:
# ✓ Root filesystem: 65% used (325GB / 500GB)
# ✓ Data filesystem: 45% used (225GB / 500GB)
# ✓ I/O latency: healthy (avg: 5ms)
# ✗ Inodes: warning (85% used)
Check specific paths:
# Check data directory
df -h /var/lib/provisioning
# Check I/O performance
iostat -x 1 5
Network Health
# Check network connectivity
provisioning health network
# Test external connectivity
provisioning health network --external
# Test provider connectivity
provisioning health network --provider upcloud
Network health checks:
- Internal service-to-service connectivity
- DNS resolution
- External API reachability (cloud providers)
- Network latency and packet loss
- Firewall rules validation
Resource Monitoring
CPU Health
# Check CPU utilization
provisioning health cpu
# Per-service CPU usage
provisioning platform metrics --metric cpu_usage
# Alert if CPU > 90% for 5 minutes
Monitor CPU load:
# System load average
uptime
# Per-process CPU
top -b -n 1 | grep provisioning
Memory Health
# Check memory utilization
provisioning health memory
# Memory breakdown by service
provisioning platform metrics --metric memory_usage
# Detect memory leaks
provisioning health memory --leak-detection
Memory metrics:
# Available memory
free -h
# Per-service memory
ps aux | grep provisioning | awk '{sum+=$6} END {print sum/1024 " MB"}'
Disk Health
# Check disk health
provisioning health disk
# SMART status (if available)
sudo smartctl -H /dev/sda
Automated Health Monitoring
Health Check Service
Enable continuous health monitoring:
# Start health monitor
provisioning health monitor --interval 30
# Monitor with alerts
provisioning health monitor --interval 30 --alert-email [ops@example.com](mailto:ops@example.com)
# Monitor specific components
provisioning health monitor --components orchestrator,database --interval 10
Systemd Health Monitoring
Systemd watchdog for automatic restart on failure:
# /etc/systemd/system/provisioning-orchestrator.service
[Service]
Type=notify
WatchdogSec=30
Restart=on-failure
RestartSec=10
StartLimitIntervalSec=300
StartLimitBurst=5
Service sends periodic health status:
// Rust service code
sd_notify::notify(true, &[NotifyState::Watchdog])?;
Health Dashboards
Grafana Health Dashboard
Import platform health dashboard:
provisioning monitoring install-dashboard --name platform-health
Dashboard panels:
- Service status indicators
- Resource utilization gauges
- Error rate graphs
- Latency histograms
- Workflow success rate
- Database connection pool
Access: http://localhost:3000/d/platform-health
CLI Health Dashboard
Real-time health monitoring in terminal:
# Interactive health dashboard
provisioning health dashboard
# Auto-refresh every 5 seconds
provisioning health dashboard --refresh 5
Health Alerts
Prometheus Alert Rules
# Platform health alerts
groups:
- name: platform_health
rules:
- alert: ServiceUnhealthy
expr: up{job=~"provisioning-.*"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is unhealthy"
- alert: HighMemoryUsage
expr: process_resident_memory_bytes > 4e9
for: 5m
labels:
severity: warning
- alert: DatabaseConnectionPoolExhausted
expr: database_connection_pool_active / database_connection_pool_max > 0.9
for: 2m
labels:
severity: critical
Health Check Notifications
Configure health check notifications:
# /etc/provisioning/health.toml
[notifications]
enabled = true
[notifications.email]
enabled = true
smtp_server = "smtp.example.com"
from = "[health@provisioning.example.com](mailto:health@provisioning.example.com)"
to = ["[ops@example.com](mailto:ops@example.com)"]
[notifications.slack]
enabled = true
webhook_url = " [https://hooks.slack.com/services/..."](https://hooks.slack.com/services/...")
channel = "#provisioning-health"
[notifications.pagerduty]
enabled = true
service_key = "..."
Dependency Health
External Service Health
Check health of dependencies:
# Check cloud provider API
provisioning health dependency upcloud
# Check vault service
provisioning health dependency vault
# Check all dependencies
provisioning health dependency --all
Dependency health includes:
- API reachability
- Authentication validity
- API quota/rate limits
- Service degradation status
Third-party Service Monitoring
Monitor integrated services:
# Kubernetes cluster health (if managing K8s)
provisioning health kubernetes
# Database replication health
provisioning health database --replication
# Secret store health
provisioning health secrets
Health Metrics
Key metrics tracked for health monitoring:
Service Metrics
provisioning_service_up{service="orchestrator"} 1
provisioning_service_health_status{service="orchestrator"} 1
provisioning_service_uptime_seconds{service="orchestrator"} 432000
Resource Metrics
provisioning_cpu_usage_percent 45
provisioning_memory_usage_bytes 2.5e9
provisioning_disk_usage_percent{mount="/var/lib/provisioning"} 45
provisioning_network_errors_total 0
Performance Metrics
provisioning_api_latency_p50_ms 25
provisioning_api_latency_p95_ms 85
provisioning_api_latency_p99_ms 150
provisioning_workflow_duration_seconds 45
Health Best Practices
- Monitor all critical services continuously
- Set appropriate alert thresholds
- Test alert notifications regularly
- Maintain health check runbooks
- Review health metrics weekly
- Establish health baselines
- Automate remediation where possible
- Document health status definitions
- Integrate health checks with CI/CD
- Monitor upstream dependencies
Troubleshooting Unhealthy State
When health check fails:
# 1. Identify unhealthy component
provisioning health check --detailed
# 2. View component logs
journalctl -u provisioning-<component> -n 100
# 3. Check resource availability
provisioning health resources
# 4. Restart unhealthy service
sudo systemctl restart provisioning-<component>
# 5. Verify recovery
provisioning health check
# 6. Review recent changes
git log --since="1 day ago" -- /etc/provisioning/
Related Documentation
- Service Management - Service lifecycle
- Monitoring - Comprehensive monitoring
- Troubleshooting - Issue resolution
- Deployment Modes - Installation modes
Security System
Enterprise-grade security infrastructure with 12 integrated components providing authentication, authorization, encryption, and compliance.
Overview
The Provisioning platform security system delivers comprehensive protection across all layers of the infrastructure automation platform. Built for enterprise deployments, it provides defense-in-depth through multiple security controls working together.
Security Architecture
The security system is organized into 12 core components:
| Component | Purpose | Key Features |
|---|---|---|
| Authentication | User identity verification | JWT tokens, session management, multi-provider auth |
| Authorization | Access control enforcement | Cedar policy engine, RBAC, fine-grained permissions |
| MFA | Multi-factor authentication | TOTP, WebAuthn/FIDO2, backup codes |
| Audit Logging | Comprehensive audit trails | 7-year retention, 5 export formats, compliance reporting |
| KMS | Key management | 5 KMS backends, envelope encryption, key rotation |
| Secrets Management | Secure secret storage | SecretumVault integration, SOPS/Age, dynamic secrets |
| Encryption | Data protection | At-rest and in-transit encryption, AES-256-GCM |
| Secure Communication | Network security | TLS/mTLS, certificate management, secure channels |
| Certificate Management | PKI operations | CA management, certificate issuance, rotation |
| Compliance | Regulatory adherence | SOC2, GDPR, HIPAA, policy enforcement |
| Security Testing | Validation framework | 350+ tests, vulnerability scanning, penetration testing |
| Break-Glass | Emergency access | Multi-party approval, audit trails, time-limited access |
Security Layers
Layer 1: Identity and Access
- Authentication: Verify user identity with JWT tokens and Argon2id password hashing
- Authorization: Enforce access control with Cedar policies and RBAC
- MFA: Add second factor with TOTP or FIDO2 hardware keys
Layer 2: Data Protection
- Encryption: Protect data at rest with AES-256-GCM and in transit with TLS 1.3
- Secrets Management: Store secrets securely in SecretumVault with automatic rotation
- KMS: Manage encryption keys with envelope encryption across 5 backend options
Layer 3: Network Security
- Secure Communication: Enforce TLS/mTLS for all service-to-service communication
- Certificate Management: Automate certificate lifecycle with cert-manager integration
- Network Policies: Control traffic flow with Kubernetes NetworkPolicies
Layer 4: Compliance and Monitoring
- Audit Logging: Record all security events with 7-year retention
- Compliance: Validate against SOC2, GDPR, and HIPAA frameworks
- Security Testing: Continuous validation with automated security test suite
Performance Characteristics
- Authentication Overhead: Less than 20ms per request with JWT verification
- Authorization Decision: Less than 10ms with Cedar policy evaluation
- Encryption Operations: Less than 5ms with KMS-backed envelope encryption
- Audit Logging: Asynchronous with zero blocking on critical path
- MFA Verification: Less than 100ms for TOTP, less than 500ms for WebAuthn
Security Standards
The security system adheres to industry standards and best practices:
- OWASP Top 10: Protection against common web vulnerabilities
- NIST Cybersecurity Framework: Aligned with identify, protect, detect, respond, recover
- Zero Trust Architecture: Never trust, always verify principle
- Defense in Depth: Multiple layers of security controls
- Least Privilege: Minimal access rights for users and services
- Secure by Default: Security controls enabled out of the box
Component Integration
All security components work together as a cohesive system:
┌─────────────────────────────────────────────────────────────┐
│ User Request │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Authentication (JWT + Session) │
│ ↓ │
│ Authorization (Cedar Policies) │
│ ↓ │
│ MFA Verification (if required) │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Audit Logging (Record all actions) │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Secure Communication (TLS/mTLS) │
│ ↓ │
│ Data Access (Encrypted with KMS) │
│ ↓ │
│ Secrets Retrieved (SecretumVault) │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Compliance Validation (SOC2/GDPR checks) │
└──────────────────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Response │
└─────────────────────────────────────────────────────────────┘
Security Configuration
Security settings are managed through hierarchical configuration:
# Security defaults in config/security.toml
[security]
auth_enabled = true
mfa_required = true
audit_enabled = true
encryption_at_rest = true
tls_min_version = "1.3"
[security.jwt]
algorithm = "RS256"
access_token_ttl = 900 # 15 minutes
refresh_token_ttl = 604800 # 7 days
[security.mfa]
totp_enabled = true
webauthn_enabled = true
backup_codes_count = 10
[security.kms]
backend = "secretumvault"
envelope_encryption = true
key_rotation_days = 90
[security.audit]
retention_days = 2555 # 7 years
export_formats = ["json", "csv", "parquet", "sqlite", "syslog"]
[security.compliance]
frameworks = ["soc2", "gdpr", "hipaa"]
policy_enforcement = "strict"
Quick Start
Enable security system for your deployment:
# Enable all security features
provisioning config set security.enabled true
# Configure authentication
provisioning config set security.auth.jwt_algorithm RS256
provisioning config set security.auth.mfa_required true
# Set up SecretumVault integration
provisioning config set security.secrets.backend secretumvault
provisioning config set security.secrets.url [http://localhost:8200](http://localhost:8200)
# Enable audit logging
provisioning config set security.audit.enabled true
provisioning config set security.audit.retention_days 2555
# Configure compliance framework
provisioning config set security.compliance.frameworks soc2,gdpr
# Verify security configuration
provisioning security validate
Documentation Structure
This security documentation is organized into 12 detailed guides:
- Authentication - JWT token-based authentication and session management
- Authorization - Cedar policy engine and RBAC access control
- Multi-Factor Authentication - TOTP and WebAuthn/FIDO2 implementation
- Audit Logging - Comprehensive audit trails and compliance reporting
- Key Management Service - Encryption key management and rotation
- Secrets Management - SecretumVault and SOPS/Age integration
- Encryption - At-rest and in-transit data protection
- Secure Communication - TLS/mTLS and network security
- Certificate Management - PKI and certificate lifecycle
- Compliance - SOC2, GDPR, HIPAA frameworks
- Security Testing - Test suite and vulnerability scanning
- Break-Glass Procedures - Emergency access and recovery
Security Metrics
The security system tracks key metrics for monitoring and reporting:
- Authentication Success Rate: Percentage of successful login attempts
- MFA Adoption Rate: Percentage of users with MFA enabled
- Policy Violations: Count of authorization denials
- Audit Event Rate: Events logged per second
- Secret Rotation Compliance: Percentage of secrets rotated within policy
- Certificate Expiration: Days until certificate expiration
- Compliance Score: Overall compliance posture percentage
- Security Test Pass Rate: Percentage of security tests passing
Best Practices
Follow these security best practices:
- Enable MFA for all users: Require second factor for all accounts
- Rotate secrets regularly: Automate secret rotation every 90 days
- Monitor audit logs: Review security events daily
- Test security controls: Run security test suite before deployments
- Keep certificates current: Automate certificate renewal 30 days before expiration
- Review policies regularly: Audit Cedar policies quarterly
- Limit break-glass access: Require multi-party approval for emergency access
- Encrypt all data: Enable encryption at rest and in transit
- Follow least privilege: Grant minimal required permissions
- Validate compliance: Run compliance checks before production deployments
Getting Help
For security issues and questions:
- Security Documentation: Complete guides in this security section
- CLI Help:
provisioning security help - Security Validation:
provisioning security validate - Audit Query:
provisioning security audit query - Compliance Check:
provisioning security compliance check
Security Updates
The security system is continuously updated to address emerging threats and vulnerabilities. Subscribe to security advisories and apply updates promptly.
Next Steps:
- Read Authentication Guide to set up user authentication
- Configure Authorization with Cedar policies
- Enable MFA for all user accounts
- Set up Audit Logging for compliance
Authentication
JWT token-based authentication with session management, login flows, and multi-provider support.
Overview
The authentication system verifies user identity through JWT (JSON Web Token) tokens with RS256 signatures and Argon2id password hashing. It provides secure session management, token refresh capabilities, and support for multiple authentication providers.
Architecture
Authentication Flow
┌──────────┐ ┌──────────────┐ ┌────────────┐
│ Client │ │ Auth Service│ │ Database │
└────┬─────┘ └──────┬───────┘ └─────┬──────┘
│ │ │
│ POST /auth/login │ │
│ {username, password} │ │
│────────────────────────────>│ │
│ │ │
│ │ Find user by username │
│ │─────────────────────────────>│
│ │<─────────────────────────────│
│ │ User record │
│ │ │
│ │ Verify password (Argon2id) │
│ │ │
│ │ Create session │
│ │─────────────────────────────>│
│ │<─────────────────────────────│
│ │ │
│ │ Generate JWT token pair │
│ │ │
│ {access_token, refresh} │ │
│<────────────────────────────│ │
│ │ │
Components
| Component | Purpose | Technology |
|---|---|---|
| AuthService | Core authentication logic | Rust service in control-center |
| JwtService | Token generation and verification | RS256 algorithm with jsonwebtoken crate |
| SessionManager | Session lifecycle management | Database-backed session storage |
| PasswordHasher | Password hashing and verification | Argon2id with configurable parameters |
| UserService | User account management | CRUD operations with role assignment |
JWT Token Structure
Access Token
Short-lived token for API authentication (default: 15 minutes).
{
"header": {
"alg": "RS256",
"typ": "JWT"
},
"payload": {
"sub": "550e8400-e29b-41d4-a716-446655440000",
"email": "[user@example.com](mailto:user@example.com)",
"username": "alice",
"roles": ["user", "developer"],
"session_id": "sess_abc123",
"mfa_verified": true,
"permissions_hash": "sha256:abc123...",
"iat": 1704067200,
"exp": 1704068100,
"iss": "provisioning-platform",
"aud": "api.provisioning.example.com"
}
}
Refresh Token
Long-lived token for obtaining new access tokens (default: 7 days).
{
"header": {
"alg": "RS256",
"typ": "JWT"
},
"payload": {
"sub": "550e8400-e29b-41d4-a716-446655440000",
"session_id": "sess_abc123",
"token_type": "refresh",
"iat": 1704067200,
"exp": 1704672000,
"iss": "provisioning-platform"
}
}
Password Security
Argon2id Configuration
Password hashing uses Argon2id with security-hardened parameters:
// Default Argon2id parameters
argon2::Params {
m_cost: 65536, // 64 MB memory
t_cost: 3, // 3 iterations
p_cost: 4, // 4 parallelism
output_len: 32 // 32 byte hash
}
Password Requirements
Default password policy enforces:
- Minimum 12 characters
- At least one uppercase letter
- At least one lowercase letter
- At least one digit
- At least one special character
- Not in common password list
- Not similar to username or email
Session Management
Session Lifecycle
- Creation: New session created on successful login
- Active: Session tracked with last activity timestamp
- Refresh: Session extended on token refresh
- Expiration: Session expires after inactivity timeout
- Revocation: Manual logout or security event terminates session
Session Storage
Sessions stored in database with:
pub struct Session {
pub session_id: Uuid,
pub user_id: Uuid,
pub created_at: DateTime<Utc>,
pub expires_at: DateTime<Utc>,
pub last_activity: DateTime<Utc>,
pub ip_address: Option<String>,
pub user_agent: Option<String>,
pub is_active: bool,
}
Session Tracking
Track multiple concurrent sessions per user:
# List active sessions for user
provisioning security sessions list --user alice
# Revoke specific session
provisioning security sessions revoke --session-id sess_abc123
# Revoke all sessions except current
provisioning security sessions revoke-all --except-current
Login Flows
Standard Login
Basic username/password authentication:
# CLI login
provisioning auth login --username alice --password <password>
# API login
curl -X POST [https://api.provisioning.example.com/auth/login](https://api.provisioning.example.com/auth/login) \
-H "Content-Type: application/json" \
-d '{
"username_or_email": "alice",
"password": "SecurePassword123!",
"client_info": {
"ip_address": "192.168.1.100",
"user_agent": "provisioning-cli/1.0"
}
}'
Response:
{
"access_token": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...",
"refresh_token": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...",
"token_type": "Bearer",
"expires_in": 900,
"user": {
"user_id": "550e8400-e29b-41d4-a716-446655440000",
"username": "alice",
"email": "[alice@example.com](mailto:alice@example.com)",
"roles": ["user", "developer"]
}
}
MFA Login
Two-phase authentication with MFA:
# Phase 1: Initial authentication
provisioning auth login --username alice --password <password>
# Response indicates MFA required
# {
# "mfa_required": true,
# "mfa_token": "temp_token_abc123",
# "available_methods": ["totp", "webauthn"]
# }
# Phase 2: MFA verification
provisioning auth mfa-verify --mfa-token temp_token_abc123 --code 123456
SSO Login
Single Sign-On with external providers:
# Initiate SSO flow
provisioning auth sso --provider okta
# Or with SAML
provisioning auth sso --provider azure-ad --protocol saml
Token Refresh
Automatic Refresh
Client libraries automatically refresh tokens before expiration:
// Automatic token refresh in Rust client
let client = ProvisioningClient::new()
.with_auto_refresh(true)
.build()?;
// Tokens refreshed transparently
client.server().list().await?;
Manual Refresh
Explicit token refresh when needed:
# CLI token refresh
provisioning auth refresh
# API token refresh
curl -X POST [https://api.provisioning.example.com/auth/refresh](https://api.provisioning.example.com/auth/refresh) \
-H "Content-Type: application/json" \
-d '{
"refresh_token": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9..."
}'
Response:
{
"access_token": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...",
"token_type": "Bearer",
"expires_in": 900
}
Multi-Provider Authentication
Supported Providers
| Provider | Type | Configuration |
|---|---|---|
| Local | Username/password | Built-in user database |
| LDAP | Directory service | Active Directory, OpenLDAP |
| SAML | SSO | Okta, Azure AD, OneLogin |
| OIDC | OAuth2/OpenID | Google, GitHub, Auth0 |
| mTLS | Certificate | Client certificate authentication |
Provider Configuration
[auth.providers.ldap]
enabled = true
server = "ldap://ldap.example.com"
base_dn = "dc=example,dc=com"
bind_dn = "cn=admin,dc=example,dc=com"
user_filter = "(uid={username})"
[auth.providers.saml]
enabled = true
entity_id = " [https://provisioning.example.com"](https://provisioning.example.com")
sso_url = " [https://okta.example.com/sso/saml"](https://okta.example.com/sso/saml")
certificate_path = "/etc/provisioning/saml-cert.pem"
[auth.providers.oidc]
enabled = true
issuer = " [https://accounts.google.com"](https://accounts.google.com")
client_id = "client_id_here"
client_secret = "client_secret_here"
redirect_uri = " [https://provisioning.example.com/auth/callback"](https://provisioning.example.com/auth/callback")
Token Validation
JWT Verification
All API requests validate JWT tokens:
// Middleware validates JWT on every request
pub async fn jwt_auth_middleware(
headers: HeaderMap,
State(jwt_service): State<Arc<JwtService>>,
mut request: Request,
next: Next,
) -> Result<Response, AuthError> {
// Extract token from Authorization header
let token = extract_bearer_token(&headers)?;
// Verify signature and claims
let claims = jwt_service.verify_access_token(&token)?;
// Check expiration
if claims.exp < Utc::now().timestamp() {
return Err(AuthError::TokenExpired);
}
// Inject user context into request
request.extensions_mut().insert(claims);
Ok(next.run(request).await)
}
Token Revocation
Revoke tokens on security events:
# Revoke all tokens for user
provisioning security tokens revoke-user --user alice
# Revoke specific token
provisioning security tokens revoke --token-id token_abc123
# Check token status
provisioning security tokens status --token eyJhbGci...
Security Hardening
Configuration
Secure authentication settings:
[security.auth]
# JWT settings
jwt_algorithm = "RS256"
jwt_issuer = "provisioning-platform"
access_token_ttl = 900 # 15 minutes
refresh_token_ttl = 604800 # 7 days
token_leeway = 30 # 30 seconds clock skew
# Password policy
password_min_length = 12
password_require_uppercase = true
password_require_lowercase = true
password_require_digit = true
password_require_special = true
password_check_common = true
# Session settings
session_timeout = 1800 # 30 minutes inactivity
max_sessions_per_user = 5
remember_me_duration = 2592000 # 30 days
# Security controls
enforce_mfa = true
allow_password_reset = true
lockout_after_attempts = 5
lockout_duration = 900 # 15 minutes
Best Practices
- Use strong passwords: Enforce password policy with minimum 12 characters
- Enable MFA: Require second factor for all users
- Rotate keys regularly: Update JWT signing keys every 90 days
- Monitor failed attempts: Alert on suspicious login patterns
- Limit session duration: Use short access token TTL with refresh tokens
- Secure token storage: Store tokens securely, never in local storage
- Validate on every request: Always verify JWT signature and expiration
- Use HTTPS only: Never transmit tokens over unencrypted connections
CLI Integration
Login and Session Management
# Login with credentials
provisioning auth login --username alice
# Login with MFA
provisioning auth login --username alice --mfa
# Check authentication status
provisioning auth status
# Logout (revoke session)
provisioning auth logout
# List active sessions
provisioning security sessions list
# Refresh token
provisioning auth refresh
Token Management
# Show current token
provisioning auth token show
# Validate token
provisioning auth token validate
# Decode token (without verification)
provisioning auth token decode
# Revoke token
provisioning auth token revoke
API Reference
Endpoints
| Endpoint | Method | Purpose |
|---|---|---|
/auth/login | POST | Authenticate with credentials |
/auth/refresh | POST | Refresh access token |
/auth/logout | POST | Revoke session and tokens |
/auth/verify | POST | Verify MFA code |
/auth/sessions | GET | List active sessions |
/auth/sessions/:id | DELETE | Revoke specific session |
/auth/password-reset | POST | Initiate password reset |
/auth/password-change | POST | Change password |
Troubleshooting
Common Issues
Token expired errors:
# Refresh token
provisioning auth refresh
# Or re-login
provisioning auth login
Invalid signature:
# Check JWT configuration
provisioning config get security.auth.jwt_algorithm
# Verify public key is correct
provisioning security keys verify
MFA verification failures:
# Check time sync (TOTP requires accurate time)
ntpdate -q pool.ntp.org
# Re-sync MFA device
provisioning auth mfa-setup --resync
Session not found:
# Clear local session and re-login
provisioning auth logout
provisioning auth login
Monitoring
Metrics
Track authentication metrics:
- Login success rate
- Failed login attempts per user
- Average session duration
- Token refresh rate
- MFA verification success rate
- Active sessions count
Alerts
Configure alerts for security events:
- Multiple failed login attempts
- Login from new location
- Unusual authentication patterns
- Session hijacking attempts
- Token tampering detected
Next Steps:
- Configure Authorization with Cedar policies
- Enable Multi-Factor Authentication
- Set up Audit Logging for authentication events
Authorization
Multi-Factor Authentication
Audit Logging
KMS Guide
Secrets Management
SecretumVault Integration Guide
SecretumVault is a post-quantum cryptography (PQC) secure vault system integrated with Provisioning’s vault-service. It provides quantum-resistant encryption for sensitive credentials and infrastructure secrets.
Overview
SecretumVault combines:
- Post-Quantum Cryptography: Algorithms resistant to quantum computer attacks
- Hardware Acceleration: Optional FPGA acceleration for performance
- Distributed Architecture: Multi-node secure storage
- Compliance: FIPS 140-3 ready, NIST standards
Architecture
Integration Points
Provisioning
├─ CLI (Nushell)
│ └─ nu_plugin_secretumvault
│
├─ vault-service (Rust)
│ ├─ secretumvault backend
│ ├─ rustyvault compatibility
│ └─ SOPS + Age integration
│
└─ Control Center
└─ Secret management UI
Cryptographic Stack
User Secret
↓
KDF (Key Derivation Function)
├─ Argon2id (password-based)
└─ HKDF (key-based)
↓
PQC Encryption Layer
├─ CRYSTALS-Kyber (key encapsulation)
├─ Falcon (signature)
├─ SPHINCS+ (backup signature)
└─ Hybrid: PQC + Classical (AES-256)
↓
Authenticated Encryption
├─ ChaCha20-Poly1305
└─ AES-256-GCM
↓
Secure Storage
├─ Local vault
├─ SurrealDB
└─ Hardware module (optional)
Installation
Install SecretumVault
# Install via provisioning
provisioning install secretumvault
# Or manual installation
cd /Users/Akasha/Development/secretumvault
cargo install --path .
# Verify installation
secretumvault --version
Install Nushell Plugin
# Install plugin
provisioning install nu-plugin-secretumvault
# Reload Nushell
nu -c "plugin add nu_plugin_secretumvault"
# Verify
nu -c "secretumvault-plugin version"
Configuration
Environment Setup
# Set vault location
export SECRETUMVAULT_HOME=~/.secretumvault
# Set encryption algorithm
export SECRETUMVAULT_CIPHER=kyber-aes # kyber-aes, falcon-aes, hybrid
# Set key derivation
export SECRETUMVAULT_KDF=argon2id # argon2id, pbkdf2
# Enable hardware acceleration (optional)
export SECRETUMVAULT_HW_ACCEL=enabled
Configuration File
# ~/.secretumvault/config.yaml
vault:
storage_backend: surrealdb # local, surrealdb, redis
encryption_cipher: kyber-aes # kyber-aes, falcon-aes, hybrid
key_derivation: argon2id # argon2id, pbkdf2
# Argon2id parameters (password strength)
kdf:
memory: 65536 # KB
iterations: 3
parallelism: 4
# Encryption parameters
encryption:
key_length: 256 # bits
nonce_length: 12 # bytes
auth_tag_length: 16 # bytes
# Database backend (if using SurrealDB)
database:
url: "surrealdb://localhost:8000"
namespace: "provisioning"
database: "secrets"
# Hardware acceleration (optional)
hardware:
use_fpga: false
fpga_device: "/dev/fpga0"
# Backup configuration
backup:
enabled: true
interval: 24 # hours
retention: 30 # days
encrypt_backup: true
backup_path: ~/.secretumvault/backups
# Access logging
audit:
enabled: true
log_file: ~/.secretumvault/audit.log
log_level: info
rotate_logs: true
retention_days: 365
# Master key management
master_key:
protection: none # none, tpm, hsm, hardware-module
rotation_enabled: true
rotation_interval: 90 # days
Usage
Command Line Interface
# Create master key
secretumvault init
# Add secret
secretumvault secret add \
--name database-password \
--value "supersecret" \
--metadata "type=database,app=api"
# Retrieve secret
secretumvault secret get database-password
# List secrets
secretumvault secret list
# Delete secret
secretumvault secret delete database-password
# Rotate key
secretumvault key rotate
# Backup vault
secretumvault backup create --output vault-backup.enc
# Restore vault
secretumvault backup restore vault-backup.enc
Nushell Integration
# Load SecretumVault plugin
plugin add nu_plugin_secretumvault
# Add secret from Nushell
let password = "mypassword"
secretumvault-plugin store "app-secret" $password
# Retrieve secret
let db_pass = (secretumvault-plugin retrieve "database-password")
# List all secrets
secretumvault-plugin list
# Delete secret
secretumvault-plugin delete "old-secret"
# Rotate key
secretumvault-plugin rotate-key
Provisioning Integration
# Configure vault-service to use SecretumVault
provisioning config set security.vault.backend secretumvault
# Enable in form prefill
provisioning setup profile --use-secretumvault
# Manage secrets via CLI
provisioning vault add \
--name aws-access-key \
--value "AKIAIOSFODNN7EXAMPLE" \
--metadata "provider=aws,env=production"
# Use secret in infrastructure
provisioning ai "Create AWS resources using secret aws-access-key"
Post-Quantum Cryptography
Algorithms Supported
| Algorithm | Type | NIST Status | Performance |
|---|---|---|---|
| CRYSTALS-Kyber | KEM | Finalist | Fast |
| Falcon | Signature | Finalist | Medium |
| SPHINCS+ | Hash-based Signature | Finalist | Slower |
| AES-256 | Hybrid (Classical) | Standard | Very fast |
| ChaCha20 | Stream Cipher | Alternative | Fast |
Hybrid Mode (Recommended)
SecretumVault uses hybrid encryption by default:
Secret Input
↓
Key Material: Classical (AES-256) + PQC (Kyber)
├─ Generate AES key
├─ Generate Kyber keypair
└─ Encapsulate using Kyber
↓
Encrypt with both algorithms
├─ AES-256-GCM encryption
└─ Kyber encapsulation (public key cryptography)
↓
Both keys required to decrypt
├─ If quantum computer breaks Kyber → AES still secure
└─ If breakthrough in AES → Kyber still secure
↓
Encrypted Secret Stored
Advantages:
- Protection against quantum computers (PQC)
- Protection against classical attacks (AES-256)
- Compatible with both current and future threats
- No single point of failure
Key Rotation Strategy
# Manual key rotation
secretumvault key rotate --algorithm kyber-aes
# Scheduled rotation (every 90 days)
secretumvault key rotate --schedule 90d
# Emergency rotation
secretumvault key rotate --emergency --force
Security Features
Authentication
# Master key authentication
secretumvault auth login
# MFA for sensitive operations
secretumvault auth mfa enable --method totp
# Biometric unlock (supported platforms)
secretumvault auth enable-biometric
Access Control
# Set vault permissions
secretumvault acl set database-password \
--read "api-service,backup-service" \
--write "admin" \
--delete "admin"
# View access logs
secretumvault audit log --secret database-password
Audit Logging
Every operation is logged:
# View audit log
secretumvault audit log --since 24h
# Export audit log
secretumvault audit export --format json > audit.json
# Monitor real-time
secretumvault audit monitor
Sample Log Entry:
{
"timestamp": "2026-01-16T01:47:00Z",
"operation": "secret_retrieve",
"secret": "database-password",
"user": "api-service",
"status": "success",
"ip_address": "127.0.0.1",
"device_id": "device-123"
}
Disaster Recovery
Backup Procedures
# Create encrypted backup
secretumvault backup create \
--output /secure/vault-backup.enc \
--compression gzip
# Verify backup integrity
secretumvault backup verify /secure/vault-backup.enc
# Restore from backup
secretumvault backup restore \
--input /secure/vault-backup.enc \
--verify-checksum
Recovery Key
# Generate recovery key (for emergencies)
secretumvault recovery-key generate \
--threshold 3 \
--shares 5
# Share recovery shards
# Share with 5 trusted people, need 3 to recover
# Recover using shards
secretumvault recovery-key restore \
--shard1 /secure/shard1.key \
--shard2 /secure/shard2.key \
--shard3 /secure/shard3.key
Performance
Benchmark Results
| Operation | Time | Algorithm |
|---|---|---|
| Store secret | 50-100ms | Kyber-AES |
| Retrieve secret | 30-50ms | Kyber-AES |
| Key rotation | 200-500ms | Kyber-AES |
| Backup 1000 secrets | 2-3 seconds | Kyber-AES |
| Restore from backup | 3-5 seconds | Kyber-AES |
Hardware Acceleration
With FPGA acceleration:
| Operation | Native | FPGA | Speedup |
|---|---|---|---|
| Store secret | 75ms | 15ms | 5x |
| Key rotation | 350ms | 50ms | 7x |
| Backup 1000 | 2.5s | 0.4s | 6x |
Troubleshooting
Cannot Initialize Vault
# Check permissions
ls -la ~/.secretumvault
# Clear corrupted state
rm ~/.secretumvault/state.lock
# Reinitialize
secretumvault init --force
Slow Performance
# Check algorithm
secretumvault config get encryption.cipher
# Switch to faster algorithm
export SECRETUMVAULT_CIPHER=kyber-aes
# Enable hardware acceleration
export SECRETUMVAULT_HW_ACCEL=enabled
Master Key Lost
# Use recovery key (if available)
secretumvault recovery-key restore \
--shard1 ... --shard2 ... --shard3 ...
# If no recovery key exists, vault is unrecoverable
# Use recent backup instead
secretumvault backup restore vault-backup.enc
Compliance & Standards
Certifications
- ✅ NIST PQC Standards: CRYSTALS-Kyber, Falcon, SPHINCS+
- ✅ FIPS 140-3 Ready: Cryptographic module certification path
- ✅ NIST SP 800-175B: Post-quantum cryptography guidance
- ✅ EU Cyber Resilience Act: PQC readiness
Export Controls
SecretumVault is subject to cryptography export controls in some jurisdictions. Ensure compliance with local regulations.
Related Documentation
- Security Overview - Security architecture
- Encryption Guide - Encryption strategies
- Secrets Management - Secret handling
- Vault Service - Microservice architecture
- KMS Guide - Key management system
Encryption
Secure Communication
Certificate Management
Compliance
Security Testing
Development
Comprehensive guides for developers building extensions, custom providers, plugins, and integrations on the Provisioning platform.
Overview
Provisioning is designed to be extended and customized for specific infrastructure needs. This section provides everything needed to:
- Build custom cloud providers interfacing with any infrastructure platform via the Provider SDK
- Create custom detectors for domain-specific infrastructure analysis and anomaly detection
- Develop task services for specialized infrastructure operations beyond built-in services
- Write Nushell plugins for high-performance scripting extensions
- Integrate external systems via REST APIs and the MCP (Model Context Protocol)
- Understand platform internals for daemon architecture, caching, and performance optimization
The platform uses modern Rust with async/await, Nushell for scripting, and Nickel for configuration - all with production-ready code examples.
Development Guides
Extension Development
-
Extension Development - Framework for extensions (providers, task services, plugins, clusters) with type-safety
-
Custom Provider Development - Build cloud providers with async Rust, credentials, state, error recovery, testing
-
Custom Task Services - Specialized service development for infrastructure operations
-
Custom Detector Development - Cost, compliance, performance, security risk detection
-
Plugin Development - Nushell plugins for high-performance scripting with FFI bindings
Platform Internals
- Provisioning Daemon Internals - TCP server, connection pooling, caching, metrics, shutdown, 50x speedup
Integration and APIs
-
API Guide - REST API integration with authentication, pagination, error handling, rate limiting
-
Build System - Cargo configuration, feature flags, dependencies, cross-platform compilation
-
Testing - Unit, integration, property-based testing, benchmarking, CI/CD patterns
Community
- Contributing - Guidelines, standards, review process, licensing
Quick Start Paths
I want to build a custom provider
Start with Custom Provider Development - includes template, credential patterns, error handling, tests, and publishing workflow.
I want to create custom detectors
See Custom Detector Development - covers analysis frameworks, state tracking, testing, and marketplace distribution.
I want to extend with Nushell
Read Plugin Development - FFI bindings, type safety, performance optimization, and integration patterns.
I want to understand system performance
Study Provisioning Daemon Internals - architecture, caching strategy, connection pooling, metrics collection.
I want to integrate external systems
Check API Guide - REST endpoints, authentication, webhooks, and integration patterns.
Technology Stack
- Language: Rust (async/await with Tokio), Nushell (scripting)
- Configuration: Nickel (type-safe) + TOML (generated)
- Testing: Unit tests, integration tests, property-based tests
- Performance: Prometheus metrics, connection pooling, LRU caching
- Security: Post-quantum cryptography, type-safety, secure defaults
Development Environment
All development builds with:
cargo build --release
cargo test --all
cargo clippy -- -D warnings
Navigation
- For architecture insights → See
provisioning/docs/src/architecture/ - For API details → See
provisioning/docs/src/api-reference/ - For examples → See
provisioning/docs/src/examples/ - For deployment → See
provisioning/docs/src/operations/
Extension Development
Creating custom extensions to add providers, task services, and clusters to the Provisioning platform.
Extension Overview
Extensions are modular components that extend platform capabilities:
| Extension Type | Purpose | Implementation | Complexity |
|---|---|---|---|
| Providers | Cloud infrastructure backends | Nushell scripts + Nickel schemas | Moderate |
| Task Services | Infrastructure components | Nushell installation scripts | Simple |
| Clusters | Complete deployments | Nickel schemas + orchestration | Moderate |
| Workflows | Automation templates | Nickel workflow definitions | Simple |
Extension Structure
Standard extension directory layout:
provisioning/extensions/<type>/<name>/
├── nickel/
│ ├── schema.ncl # Nickel type definitions
│ ├── defaults.ncl # Default configuration
│ └── validation.ncl # Validation rules
├── scripts/
│ ├── install.nu # Installation script
│ ├── uninstall.nu # Removal script
│ └── validate.nu # Validation script
├── templates/
│ └── config.template # Configuration templates
├── tests/
│ └── test_*.nu # Test scripts
├── docs/
│ └── README.md # Documentation
└── metadata.toml # Extension metadata
Extension Metadata
Every extension requires metadata.toml:
# metadata.toml
[extension]
name = "my-provider"
type = "provider"
version = "1.0.0"
description = "Custom cloud provider"
author = "Your Name <[email@example.com](mailto:email@example.com)>"
license = "MIT"
[dependencies]
nushell = ">=0.109.0"
nickel = ">=1.15.1"
[dependencies.extensions]
# Other extensions this depends on
base-provider = "1.0.0"
[capabilities]
create_server = true
delete_server = true
create_network = true
[configuration]
required_fields = ["api_key", "region"]
optional_fields = ["timeout", "retry_attempts"]
Creating a Provider Extension
Providers implement cloud infrastructure backends.
Provider Structure
provisioning/extensions/providers/my-provider/
├── nickel/
│ ├── schema.ncl
│ ├── server.ncl
│ └── network.ncl
├── scripts/
│ ├── create_server.nu
│ ├── delete_server.nu
│ ├── list_servers.nu
│ └── validate.nu
├── templates/
│ └── server.template
├── tests/
│ └── test_provider.nu
└── metadata.toml
Provider Schema (Nickel)
# nickel/schema.ncl
{
Provider = {
name | String,
api_key | String,
region | String,
timeout | default = 30 | Number,
server_config = {
default_plan | default = "medium" | String,
allowed_plans | Array String,
},
},
Server = {
name | String,
plan | String,
zone | String,
hostname | String,
tags | default = [] | Array String,
},
}
Provider Implementation (Nushell)
# scripts/create_server.nu
#!/usr/bin/env nu
# Create server using provider API
export def main [
config: record # Provider configuration
server: record # Server specification
] {
# Validate configuration
validate-config $config
# Construct API request
let request = {
name: $server.name
plan: $server.plan
zone: $server.zone
}
# Call provider API
let response = http post $"($config.api_endpoint)/servers" {
headers: {
Authorization: $"Bearer ($config.api_key)"
}
body: ($request | to json)
}
# Return server details
$response | from json
}
# Validate provider configuration
def validate-config [config: record] {
if ($config.api_key | is-empty) {
error make {msg: "api_key is required"}
}
if ($config.region | is-empty) {
error make {msg: "region is required"}
}
}
Provider Interface Contract
All providers must implement:
# Required operations
create_server # Create new server
delete_server # Delete existing server
get_server # Get server details
list_servers # List all servers
server_status # Check server status
# Optional operations
create_network # Create network
delete_network # Delete network
attach_storage # Attach storage volume
create_snapshot # Create server snapshot
Creating a Task Service Extension
Task services are installable infrastructure components.
Task Service Structure
provisioning/extensions/taskservs/my-service/
├── nickel/
│ ├── schema.ncl
│ └── defaults.ncl
├── scripts/
│ ├── install.nu
│ ├── uninstall.nu
│ ├── health.nu
│ └── validate.nu
├── templates/
│ ├── config.yaml.template
│ └── systemd.service.template
├── tests/
│ └── test_service.nu
├── docs/
│ └── README.md
└── metadata.toml
Task Service Metadata
# metadata.toml
[extension]
name = "my-service"
type = "taskserv"
version = "2.1.0"
description = "Custom infrastructure service"
[dependencies.taskservs]
# Task services this depends on
containerd = ">=1.7.0"
kubernetes = ">=1.28.0"
[installation]
requires_root = true
platforms = ["linux"]
architectures = ["x86_64", "aarch64"]
[health_check]
enabled = true
endpoint = " [http://localhost:8000/health"](http://localhost:8000/health")
interval = 30
timeout = 5
Task Service Installation Script
# scripts/install.nu
#!/usr/bin/env nu
export def main [
config: record # Service configuration
server: record # Target server details
] {
print "Installing my-service..."
# Download binaries
let version = $config.version? | default "latest"
download-binary $version
# Install systemd service
install-systemd-service $config
# Configure service
generate-config $config
# Start service
start-service
# Verify installation
verify-installation
print "Installation complete"
}
def download-binary [version: string] {
let url = $" [https://github.com/org/my-service/releases/download/($versio](https://github.com/org/my-service/releases/download/($versio)n)/my-service"
http get $url | save /usr/local/bin/my-service
chmod +x /usr/local/bin/my-service
}
def install-systemd-service [config: record] {
let template = open ../templates/systemd.service.template
let rendered = $template | str replace --all "{{VERSION}}" $config.version
$rendered | save /etc/systemd/system/my-service.service
systemctl daemon-reload
}
def start-service [] {
systemctl enable my-service
systemctl start my-service
}
def verify-installation [] {
let status = systemctl is-active my-service
if $status != "active" {
error make {msg: "Service failed to start"}
}
# Health check
sleep 5sec
let health = http get [http://localhost:8000/health](http://localhost:8000/health)
if $health.status != "healthy" {
error make {msg: "Health check failed"}
}
}
Creating a Cluster Extension
Clusters combine servers and task services into complete deployments.
Cluster Schema
# nickel/schema.ncl
{
Cluster = {
metadata = {
name | String,
provider | String,
environment | default = "production" | String,
},
infrastructure = {
servers | Array {
name | String,
role | | [ "control", "worker", "storage" | ],
plan | String,
},
},
services = {
taskservs | Array String,
order | default = [] | Array String,
},
networking = {
private_network | default = true | Bool,
cidr | default = "10.0.0.0/16" | String,
},
},
}
Cluster Definition Example
# clusters/kubernetes-ha.ncl
{
metadata.name = "k8s-ha-cluster",
metadata.provider = "upcloud",
infrastructure.servers = [
{name = "control-01", role = "control", plan = "large"},
{name = "control-02", role = "control", plan = "large"},
{name = "control-03", role = "control", plan = "large"},
{name = "worker-01", role = "worker", plan = "xlarge"},
{name = "worker-02", role = "worker", plan = "xlarge"},
],
services.taskservs = ["containerd", "etcd", "kubernetes", "cilium"],
services.order = ["containerd", "etcd", "kubernetes", "cilium"],
networking.private_network = true,
networking.cidr = "10.100.0.0/16",
}
Extension Testing
Test Structure
# tests/test_provider.nu
use std assert
# Test provider configuration validation
export def test_validate_config [] {
let valid_config = {
api_key: "test-key"
region: "us-east-1"
}
let result = validate-config $valid_config
assert equal $result.valid true
}
# Test server creation
export def test_create_server [] {
let config = load-test-config
let server_spec = {
name: "test-server"
plan: "medium"
zone: "us-east-1a"
}
let result = create-server $config $server_spec
assert equal $result.status "created"
}
# Run all tests
export def main [] {
test_validate_config
test_create_server
print "All tests passed"
}
Run tests:
# Test extension
provisioning extension test my-provider
# Test specific component
nu tests/test_provider.nu
Extension Packaging
OCI Registry Publishing
Package and publish extension:
# Build extension package
provisioning extension build my-provider
# Validate package
provisioning extension validate my-provider-1.0.0.tar.gz
# Publish to registry
provisioning extension publish my-provider-1.0.0.tar.gz \
--registry registry.example.com
Package structure:
my-provider-1.0.0.tar.gz
├── metadata.toml
├── nickel/
├── scripts/
├── templates/
├── tests/
├── docs/
└── manifest.json
Extension Installation
Install extension from registry:
# Install from OCI registry
provisioning extension install my-provider --version 1.0.0
# Install from local file
provisioning extension install ./my-provider-1.0.0.tar.gz
# List installed extensions
provisioning extension list
# Update extension
provisioning extension update my-provider --version 1.1.0
# Uninstall extension
provisioning extension uninstall my-provider
Best Practices
- Follow naming conventions: lowercase with hyphens
- Version extensions semantically (semver)
- Document all configuration options
- Provide comprehensive tests
- Include usage examples in docs
- Validate input parameters
- Handle errors gracefully
- Log important operations
- Support idempotent operations
- Keep dependencies minimal
Related Documentation
- Provider Development - Provider specifics
- Nickel Guide - Nickel language
- Build System - Building extensions
- Testing - Testing strategies
Provider Development
Implementing custom cloud provider integrations for the Provisioning platform.
Provider Architecture
Providers abstract cloud infrastructure APIs through a unified interface, allowing infrastructure definitions to be portable across clouds.
Provider Interface
All providers must implement these core operations:
# Server lifecycle
create_server # Provision new server
delete_server # Remove server
get_server # Fetch server details
list_servers # List all servers
update_server # Modify server configuration
server_status # Get current state
# Network operations (optional)
create_network # Create private network
delete_network # Remove network
attach_network # Attach server to network
# Storage operations (optional)
attach_volume # Attach storage volume
detach_volume # Detach storage volume
create_snapshot # Snapshot server disk
Provider Template
Use the official provider template:
# Generate provider scaffolding
provisioning generate provider --name my-cloud --template standard
# Creates:
# extensions/providers/my-cloud/
# ├── nickel/
# │ ├── schema.ncl
# │ ├── server.ncl
# │ └── network.ncl
# ├── scripts/
# │ ├── create_server.nu
# │ ├── delete_server.nu
# │ └── list_servers.nu
# └── metadata.toml
Provider Schema (Nickel)
Define provider configuration schema:
# nickel/schema.ncl
{
ProviderConfig = {
name | String,
api_endpoint | String,
api_key | String,
region | String,
timeout | default = 30 | Number,
retry_attempts | default = 3 | Number,
plans = {
small = {cpu = 2, memory = 4096, disk = 25},
medium = {cpu = 4, memory = 8192, disk = 50},
large = {cpu = 8, memory = 16384, disk = 100},
},
regions | Array String,
},
ServerSpec = {
name | String,
plan | String,
zone | String,
image | default = "ubuntu-24.04" | String,
ssh_keys | Array String,
user_data | default = "" | String,
},
}
Implementing Server Creation
Create server implementation:
# scripts/create_server.nu
#!/usr/bin/env nu
export def main [
config: record, # Provider configuration
spec: record # Server specification
]: nothing -> record {
# Validate inputs
validate-provider-config $config
validate-server-spec $spec
# Map plan to provider-specific values
let plan = get-plan-details $config $spec.plan
# Construct API request
let request = {
hostname: $spec.name
plan: $plan.name
zone: $spec.zone
storage_devices: [{
action: "create"
storage: $plan.disk
title: "root"
}]
login: {
user: "root"
keys: $spec.ssh_keys
}
}
# Call provider API with retry logic
let server = retry-api-call | { |
http post $"($config.api_endpoint)/server" {
headers: {Authorization: $"Bearer ($config.api_key)"}
body: ($request | to json)
} | from json
} $config.retry_attempts
# Wait for server to be ready
wait-for-server-ready $config $server.uuid
# Return server details
{
id: $server.uuid
name: $server.hostname
ip_address: $server.ip_addresses.0.address
status: "running"
provider: $config.name
}
}
def validate-provider-config [config: record] {
if ($config.api_key | is-empty) {
error make {msg: "API key required"}
}
if ($config.region | is-empty) {
error make {msg: "Region required"}
}
}
def get-plan-details [config: record, plan_name: string]: nothing -> record {
$config.plans | get $plan_name
}
def retry-api-call [operation: closure, max_attempts: int]: nothing -> any {
mut attempt = 1
mut last_error = null
while $attempt <= $max_attempts {
try {
return (do $operation)
} catch | { err |
$last_error = $err
if $attempt < $max_attempts {
sleep (1sec * $attempt) # Exponential backoff
$attempt = $attempt + 1
}
}
}
error make {msg: $"API call failed after ($max_attempts) attempts: ($last_error)"}
}
def wait-for-server-ready [config: record, server_id: string] {
mut ready = false
mut attempts = 0
let max_wait = 120 # 2 minutes
while not $ready and $attempts < $max_wait {
let status = http get $"($config.api_endpoint)/server/($server_id)" {
headers: {Authorization: $"Bearer ($config.api_key)"}
} | from json
if $status.state == "started" {
$ready = true
} else {
sleep 1sec
$attempts = $attempts + 1
}
}
if not $ready {
error make {msg: "Server failed to start within timeout"}
}
}
Provider Testing
Comprehensive provider testing:
# tests/test_provider.nu
use std assert
export def test_create_server [] {
# Mock provider config
let config = {
name: "test-cloud"
api_endpoint: " [http://localhost:8080"](http://localhost:8080")
api_key: "test-key"
region: "test-region"
plans: {
small: {cpu: 2, memory: 4096, disk: 25}
}
}
# Mock server spec
let spec = {
name: "test-server"
plan: "small"
zone: "test-zone"
ssh_keys: ["ssh-rsa AAAA..."]
}
# Test server creation
let server = create-server $config $spec
assert ($server.id != null)
assert ($server.name == "test-server")
assert ($server.status == "running")
}
export def test_list_servers [] {
let config = load-test-config
let servers = list-servers $config
assert ($servers | length) > 0
}
export def main [] {
print "Running provider tests..."
test_create_server
test_list_servers
print "All tests passed!"
}
Error Handling
Robust error handling for provider operations:
# Handle API errors gracefully
def handle-api-error [error: record]: nothing -> record {
match $error.status {
401 => {error make {msg: "Authentication failed - check API key"}}
403 => {error make {msg: "Permission denied - insufficient privileges"}}
404 => {error make {msg: "Resource not found"}}
429 => {error make {msg: "Rate limit exceeded - retry later"}}
500 => {error make {msg: "Provider API error - contact support"}}
_ => {error make {msg: $"Unknown error: ($error.message)"}}
}
}
Provider Best Practices
- Implement idempotent operations where possible
- Handle rate limiting with exponential backoff
- Validate all inputs before API calls
- Log all API requests and responses (without secrets)
- Use connection pooling for better performance
- Cache provider capabilities and quotas
- Implement proper timeout handling
- Return consistent error messages
- Test against provider sandbox/staging environment
- Version provider schemas carefully
Related Documentation
- Extension Development - Extension basics
- API Guide - REST API patterns
- Testing - Testing strategies
Plugin Development
Developing Nushell plugins for performance-critical operations in the Provisioning platform.
Plugin Overview
Nushell plugins provide 10-50x performance improvement over HTTP APIs through native Rust implementations.
Available Plugins
| Plugin | Purpose | Performance Gain | Language |
|---|---|---|---|
| nu_plugin_auth | Authentication and OS keyring | 5x faster | Rust |
| nu_plugin_kms | KMS encryption operations | 10x faster | Rust |
| nu_plugin_orchestrator | Orchestrator queries | 30x faster | Rust |
Plugin Architecture
Plugins communicate with Nushell via MessagePack protocol:
Nushell ←→ MessagePack ←→ Plugin Process
↓ ↓
Script Native Rust
Creating a Plugin
Plugin Template
Generate plugin scaffold:
# Create new plugin
cargo new --lib nu_plugin_myfeature
cd nu_plugin_myfeature
Add dependencies to Cargo.toml:
[package]
name = "nu_plugin_myfeature"
version = "0.1.0"
edition = "2021"
[dependencies]
nu-plugin = "0.109.0"
nu-protocol = "0.109.0"
serde = {version = "1.0", features = ["derive"]}
Plugin Implementation
Implement plugin interface:
// src/lib.rs
use nu_plugin::{EvaluatedCall, LabeledError, Plugin};
use nu_protocol::{Category, PluginSignature, SyntaxShape, Type, Value};
pub struct MyFeaturePlugin;
impl Plugin for MyFeaturePlugin {
fn signature(&self) -> Vec<PluginSignature> {
vec![
PluginSignature::build("my-feature")
.usage("Perform my feature operation")
.required("input", SyntaxShape::String, "input value")
.input_output_type(Type::String, Type::String)
.category(Category::Custom("provisioning".into())),
]
}
fn run(
&mut self,
name: &str,
call: &EvaluatedCall,
input: &Value,
) -> Result<Value, LabeledError> {
match name {
"my-feature" => self.my_feature(call, input),
_ => Err(LabeledError {
label: "Unknown command".into(),
msg: format!("Unknown command: {}", name),
span: None,
}),
}
}
}
impl MyFeaturePlugin {
fn my_feature(&self, call: &EvaluatedCall, _input: &Value) -> Result<Value, LabeledError> {
let input: String = call.req(0)?;
// Perform operation
let result = perform_operation(&input);
Ok(Value::string(result, call.head))
}
}
fn perform_operation(input: &str) -> String {
// Your implementation here
format!("Processed: {}", input)
}
// Plugin entry point
fn main() {
nu_plugin::serve_plugin(&mut MyFeaturePlugin, nu_plugin::MsgPackSerializer {})
}
Building Plugin
# Build release version
cargo build --release
# Install plugin
nu -c 'plugin add target/release/nu_plugin_myfeature'
nu -c 'plugin use myfeature'
# Test plugin
nu -c 'my-feature "test input"'
Plugin Performance Optimization
Benchmarking
use std::time::Instant;
pub fn benchmark_operation() {
let start = Instant::now();
// Operation to benchmark
perform_expensive_operation();
let duration = start.elapsed();
eprintln!("Operation took: {:?}", duration);
}
Caching
Implement caching for expensive operations:
use std::collections::HashMap;
use std::sync::{Arc, Mutex};
pub struct CachedPlugin {
cache: Arc<Mutex<HashMap<String, String>>>,
}
impl CachedPlugin {
fn get_or_compute(&self, key: &str) -> String {
let mut cache = self.cache.lock().unwrap();
if let Some(value) = cache.get(key) {
return value.clone();
}
let value = expensive_computation(key);
cache.insert(key.to_string(), value.clone());
value
}
}
Testing Plugins
Unit Tests
#[cfg(test)]
mod tests {
use super::*;
use nu_protocol::{Span, Value};
#[test]
fn test_my_feature() {
let plugin = MyFeaturePlugin;
let input = Value::string("test", Span::test_data());
let result = plugin.my_feature(&mock_call(), &input).unwrap();
assert_eq!(result.as_string().unwrap(), "Processed: test");
}
fn mock_call() -> EvaluatedCall {
// Mock EvaluatedCall for testing
todo!()
}
}
Integration Tests
# tests/test_plugin.nu
use std assert
def test_plugin_functionality [] {
let result = my-feature "test input"
assert equal $result "Processed: test input"
}
def main [] {
test_plugin_functionality
print "Plugin tests passed"
}
Plugin Best Practices
- Keep plugin logic focused and single-purpose
- Minimize dependencies to reduce binary size
- Use async operations for I/O-bound tasks
- Implement proper error handling
- Document all plugin commands
- Version plugins with semantic versioning
- Provide fallback to HTTP API if plugin unavailable
- Cache expensive computations
- Profile and benchmark performance improvements
Related Documentation
- Build System - Building Rust plugins
- Extension Development - Extension basics
- Testing - Testing strategies
API Integration Guide
Integrate third-party APIs with Provisioning infrastructure.
API Client Development
Create clients for external APIs:
// src/api_client.rs
use reqwest::Client;
pub struct ApiClient {
endpoint: String,
api_key: String,
client: Client,
}
impl ApiClient {
pub async fn call(&self, path: &str) -> Result<Response> {
let url = format!("{}{}", self.endpoint, path);
self.client
.get(&url)
.bearer_auth(&self.api_key)
.send()
.await
}
}
Webhook Integration
Handle webhooks from external systems:
#[post("/webhooks/{service}")]
pub async fn handle_webhook(path: web::Path<String>, body: web::Bytes) -> impl Responder {
let service = path.into_inner();
match service.as_str() {
"github" => handle_github_webhook(&body),
"stripe" => handle_stripe_webhook(&body),
_ => HttpResponse::NotFound().finish(),
}
}
Error Handling
Robust error handling for API calls with retries:
pub async fn call_api_with_retry(
client: &ApiClient,
path: &str,
max_retries: u32,
) -> Result<Response> {
for attempt in 0..max_retries {
match client.call(path).await {
Ok(response) => return Ok(response),
Err(e) if attempt < max_retries - 1 => {
let delay = Duration::from_secs(2_u64.pow(attempt));
tokio::time::sleep(delay).await;
}
Err(e) => return Err(e),
}
}
Err(ApiError::MaxRetriesExceeded.into())
}
Related Documentation
Build System
Building, testing, and packaging the Provisioning platform and extensions with Cargo, Just, and Nickel.
Build Tools
| Tool | Purpose | Version Required |
|---|---|---|
| Cargo | Rust compilation and testing | Latest stable |
| Just | Task runner for common operations | Latest |
| Nickel | Schema validation and type checking | 1.15.1+ |
| Nushell | Script execution and testing | 0.109.0+ |
Building Platform Services
Build All Services
# Build all Rust services in release mode
cd provisioning/platform
cargo build --release --workspace
# Or using just task runner
just build-platform
Binary outputs in target/release/:
provisioning-orchestratorprovisioning-control-centerprovisioning-vault-serviceprovisioning-installer
Build Individual Service
# Orchestrator service
cd provisioning/platform/crates/orchestrator
cargo build --release
# Control Center service
cd provisioning/platform/crates/control-center
cargo build --release
# Development build (faster compilation)
cargo build
Testing
Run All Tests
# Rust unit and integration tests
cargo test --workspace
# Nushell script tests
just test-nushell
# Complete test suite
just test-all
Test Specific Component
# Test orchestrator crate
cargo test -p provisioning-orchestrator
# Test with output visible
cargo test -p provisioning-orchestrator -- --nocapture
# Test specific function
cargo test -p provisioning-orchestrator test_workflow_creation
# Run tests matching pattern
cargo test workflow
Security Tests
# Run 350+ security test cases
cargo test -p security --test '*'
# Specific security component
cargo test -p security authentication
cargo test -p security authorization
cargo test -p security kms
Code Quality
Formatting
# Format all Rust code
cargo fmt --all
# Check formatting without modifying
cargo fmt --all -- --check
# Format Nickel schemas
nickel fmt provisioning/schemas/**/*.ncl
Linting
# Run Clippy linter
cargo clippy --all -- -D warnings
# Auto-fix Clippy warnings
cargo clippy --all --fix
# Clippy with all features enabled
cargo clippy --all --all-features -- -D warnings
Nickel Validation
# Type check Nickel schemas
nickel typecheck provisioning/schemas/main.ncl
# Evaluate schema
nickel eval provisioning/schemas/main.ncl
# Format Nickel files
nickel fmt provisioning/schemas/**/*.ncl
Continuous Integration
The platform uses automated CI workflows for quality assurance.
GitHub Actions Pipeline
Key CI jobs:
1. Rust Build and Test
- cargo build --release --workspace
- cargo test --workspace
- cargo clippy --all -- -D warnings
2. Nushell Validation
- nu --check core/cli/provisioning
- Run Nushell test suite
3. Nickel Schema Validation
- nickel typecheck schemas/main.ncl
- Validate all schema files
4. Security Tests
- Run 350+ security test cases
- Vulnerability scanning
5. Documentation Build
- mdbook build docs
- Markdown linting
Packaging and Distribution
Create Release Package
# Build optimized binaries
cargo build --release --workspace
# Strip debug symbols (reduce binary size)
strip target/release/provisioning-orchestrator
strip target/release/provisioning-control-center
# Create distribution archive
just package
Package Structure
provisioning-5.0.0-linux-x86_64.tar.gz
├── bin/
│ ├── provisioning # Main CLI
│ ├── provisioning-orchestrator # Orchestrator service
│ ├── provisioning-control-center # Control Center
│ ├── provisioning-vault-service # Vault service
│ └── provisioning-installer # Platform installer
├── lib/
│ └── nulib/ # Nushell libraries
├── schemas/ # Nickel schemas
├── config/
│ └── config.defaults.toml # Default configuration
├── systemd/
│ └── *.service # Systemd unit files
└── README.md
Cross-Platform Builds
Supported Targets
# Linux x86_64 (primary platform)
cargo build --release --target x86_64-unknown-linux-gnu
# Linux ARM64 (Raspberry Pi, cloud ARM instances)
cargo build --release --target aarch64-unknown-linux-gnu
# macOS x86_64
cargo build --release --target x86_64-apple-darwin
# macOS ARM64 (Apple Silicon)
cargo build --release --target aarch64-apple-darwin
Cross-Compilation Setup
# Add target architectures
rustup target add x86_64-unknown-linux-gnu
rustup target add aarch64-unknown-linux-gnu
# Install cross-compilation tool
cargo install cross
# Cross-compile with Docker
cross build --release --target aarch64-unknown-linux-gnu
Just Task Runner
Common build tasks in justfile:
# Build all components
build-all: build-platform build-plugins
# Build platform services
build-platform:
cd platform && cargo build --release --workspace
# Run all tests
test: test-rust test-nushell test-integration
# Test Rust code
test-rust:
cargo test --workspace
# Test Nushell scripts
test-nushell:
nu scripts/test/test_all.nu
# Format all code
fmt:
cargo fmt --all
nickel fmt schemas/**/*.ncl
# Lint all code
lint:
cargo clippy --all -- -D warnings
nickel typecheck schemas/main.ncl
# Create release package
package:
./scripts/package.nu
# Clean build artifacts
clean:
cargo clean
rm -rf target/
Usage examples:
just build-all # Build everything
just test # Run all tests
just fmt # Format code
just lint # Run linters
just package # Create distribution
just clean # Remove artifacts
Performance Optimization
Release Builds
# Cargo.toml
[profile.release]
opt-level = 3 # Maximum optimization
lto = "fat" # Link-time optimization
codegen-units = 1 # Better optimization, slower compile
strip = true # Strip debug symbols
panic = "abort" # Smaller binary size
Build Time Optimization
# Cargo.toml
[profile.dev]
opt-level = 1 # Basic optimization
incremental = true # Faster recompilation
Speed up compilation:
# Use faster linker (Linux)
sudo apt install lld
export RUSTFLAGS="-C link-arg=-fuse-ld=lld"
# Parallel compilation
cargo build -j 8
# Use cargo-watch for auto-rebuild
cargo install cargo-watch
cargo watch -x build
Development Workflow
Recommended Workflow
# 1. Start development
just clean
just build-all
# 2. Make changes to code
# 3. Test changes quickly
cargo check # Fast syntax check
cargo test <specific-test> # Test specific functionality
# 4. Full validation before commit
just fmt
just lint
just test
# 5. Create package for testing
just package
Hot Reload Development
# Auto-rebuild on file changes
cargo watch -x build
# Auto-test on changes
cargo watch -x test
# Run service with auto-reload
cargo watch -x 'run --bin provisioning-orchestrator'
Debugging Builds
Debug Information
# Build with full debug info
cargo build
# Build with debug info in release mode
cargo build --release --profile release-with-debug
# Run with backtraces
RUST_BACKTRACE=1 cargo run
RUST_BACKTRACE=full cargo run
Build Verbosity
# Verbose build output
cargo build -vv
# Show build commands
cargo build -vvv
# Show timing information
cargo build --timings
Dependency Tree
# View dependency tree
cargo tree
# Duplicate dependencies
cargo tree --duplicates
# Build graph visualization
cargo depgraph | dot -Tpng > deps.png
Best Practices
- Always run
just testbefore committing - Use
cargo fmtandcargo clippyfor code quality - Test on multiple platforms before release
- Strip binaries for production distributions
- Version binaries with semantic versioning
- Cache dependencies in CI/CD
- Use release profile for production builds
- Document build requirements in README
- Automate common tasks with Just
- Keep build times reasonable (<5 min)
Troubleshooting
Common Build Issues
Compilation fails with linker error:
# Install build dependencies
sudo apt install build-essential pkg-config libssl-dev
Out of memory during build:
# Reduce parallel jobs
cargo build -j 2
# Use more swap space
sudo fallocate -l 8G /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Clippy warnings:
# Fix automatically where possible
cargo clippy --all --fix
# Allow specific lints temporarily
#[allow(clippy::too_many_arguments)]
Related Documentation
- Testing - Testing strategies and procedures
- Contributing - Contribution guidelines including build requirements
Testing
Comprehensive testing strategies for the Provisioning platform including unit tests, integration tests, and 350+ security tests.
Testing Overview
The platform maintains extensive test coverage across multiple test types:
| Test Type | Count | Coverage Target | Average Runtime |
|---|---|---|---|
| Unit Tests | 200+ | Core logic 80%+ | < 5 seconds |
| Integration Tests | 100+ | Component integration 60%+ | < 30 seconds |
| Security Tests | 350+ | Security components 100% | < 60 seconds |
| End-to-End Tests | 50+ | Full workflows | < 5 minutes |
Running Tests
All Tests
# Run complete test suite
cargo test --workspace
# With output visible
cargo test --workspace -- --nocapture
# Parallel execution with 8 threads
cargo test --workspace --jobs 8
# Include ignored tests
cargo test --workspace -- --ignored
Test by Category
# Unit tests only (--lib)
cargo test --lib
# Integration tests only (--test)
cargo test --test '*'
# Documentation tests
cargo test --doc
# Security test suite
cargo test -p security --test '*'
Test Specific Component
# Test orchestrator crate
cargo test -p provisioning-orchestrator
# Test control center
cargo test -p provisioning-control-center
# Test specific module
cargo test -p provisioning-orchestrator workflows::
# Test specific function
cargo test -p provisioning-orchestrator test_workflow_creation
Unit Testing
Unit tests verify individual functions and modules in isolation.
Rust Unit Tests
// src/workflows.rs
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_create_workflow() {
let config = WorkflowConfig {
name: "test-workflow".into(),
tasks: vec![],
};
let workflow = Workflow::new(config);
assert_eq!(workflow.name(), "test-workflow");
assert_eq!(workflow.status(), WorkflowStatus::Pending);
}
#[test]
fn test_workflow_execution() {
let mut workflow = create_test_workflow();
let result = workflow.execute();
assert!(result.is_ok());
assert_eq!(workflow.status(), WorkflowStatus::Completed);
}
#[test]
#[should_panic(expected = "Invalid workflow")]
fn test_invalid_workflow() {
Workflow::new(invalid_config());
}
}
Nushell Unit Tests
# tests/test_provider.nu
use std assert
export def test_validate_config [] {
let config = {api_key: "test-key", region: "us-east-1"}
let result = validate-config $config
assert equal $result.valid true
}
export def test_create_server [] {
let spec = {name: "test-server", plan: "medium"}
let server = create-server test-config $spec
assert ($server.id != null)
}
export def main [] {
test_validate_config
test_create_server
print "All tests passed"
}
Run Nushell tests:
nu tests/test_provider.nu
Integration Testing
Integration tests verify components work together correctly.
Service Integration Tests
// tests/orchestrator_integration.rs
use provisioning_orchestrator::Orchestrator;
use provisioning_database::Database;
#[tokio::test]
async fn test_workflow_persistence() {
let db = Database::new_test().await;
let orchestrator = Orchestrator::new(db.clone());
let workflow_id = orchestrator.create_workflow(test_config()).await.unwrap();
// Verify workflow persisted to database
let workflow = db.get_workflow(&workflow_id).await.unwrap();
assert_eq!(workflow.id, workflow_id);
}
#[tokio::test]
async fn test_api_integration() {
let app = create_test_app().await;
let response = app
.post("/api/v1/workflows")
.json(&test_workflow())
.send()
.await
.unwrap();
assert_eq!(response.status(), 201);
}
Test Containers
Use Docker containers for realistic integration testing:
use testcontainers::*;
#[tokio::test]
async fn test_with_database() {
let docker = clients::Cli::default();
let postgres = docker.run(images::postgres::Postgres::default());
let db_url = format!(
"postgres://postgres@localhost:{}/test",
postgres.get_host_port_ipv4(5432)
);
// Run tests against real database
let db = Database::connect(&db_url).await.unwrap();
// Test database operations...
}
Security Testing
Comprehensive security testing with 350+ test cases covering all security components.
Authentication Tests
#[tokio::test]
async fn test_jwt_verification() {
let auth = AuthService::new();
let token = auth.generate_token("user123").unwrap();
let claims = auth.verify_token(&token).unwrap();
assert_eq!(claims.sub, "user123");
}
#[tokio::test]
async fn test_invalid_token() {
let auth = AuthService::new();
let result = auth.verify_token("invalid.token.here");
assert!(result.is_err());
}
#[tokio::test]
async fn test_token_expiration() {
let auth = AuthService::new();
let token = create_expired_token();
let result = auth.verify_token(&token);
assert!(matches!(result, Err(AuthError::TokenExpired)));
}
Authorization Tests
#[tokio::test]
async fn test_rbac_enforcement() {
let authz = AuthorizationService::new();
let decision = authz.authorize(
"user:user123",
"workflow:create",
"resource:my-cluster"
).await;
assert_eq!(decision, Decision::Allow);
}
#[tokio::test]
async fn test_policy_denial() {
let authz = AuthorizationService::new();
let decision = authz.authorize(
"user:guest",
"server:delete",
"resource:prod-server"
).await;
assert_eq!(decision, Decision::Deny);
}
Encryption Tests
#[tokio::test]
async fn test_kms_encryption() {
let kms = KmsService::new();
let plaintext = b"secret data";
let ciphertext = kms.encrypt(plaintext).await.unwrap();
let decrypted = kms.decrypt(&ciphertext).await.unwrap();
assert_eq!(plaintext, decrypted.as_slice());
}
#[tokio::test]
async fn test_encryption_performance() {
let kms = KmsService::new();
let plaintext = vec![0u8; 1024]; // 1KB
let start = Instant::now();
kms.encrypt(&plaintext).await.unwrap();
let duration = start.elapsed();
// KMS encryption should complete in < 10ms
assert!(duration < Duration::from_millis(10));
}
End-to-End Testing
Complete workflow testing from start to finish.
Full Workflow Tests
#[tokio::test]
async fn test_complete_workflow() {
let platform = Platform::start_test_instance().await;
// Create infrastructure
let cluster_id = platform
.create_cluster(test_cluster_config())
.await
.unwrap();
// Wait for completion (5 minute timeout)
platform
.wait_for_cluster(&cluster_id, Duration::from_secs(300))
.await;
// Verify cluster health
let health = platform.check_cluster_health(&cluster_id).await;
assert!(health.is_healthy());
// Cleanup
platform.delete_cluster(&cluster_id).await.unwrap();
}
Test Fixtures
Shared test data and utilities.
Common Test Fixtures
// tests/fixtures/mod.rs
pub fn test_workflow_config() -> WorkflowConfig {
WorkflowConfig {
name: "test-workflow".into(),
tasks: vec![
Task::new("task1", TaskType::CreateServer),
Task::new("task2", TaskType::InstallService),
],
}
}
pub fn test_server_spec() -> ServerSpec {
ServerSpec {
name: "test-server".into(),
plan: "medium".into(),
zone: "us-east-1a".into(),
image: "ubuntu-24.04".into(),
}
}
Mocking
Mock external dependencies for isolated testing.
Mock External Services
use mockall::*;
#[automock]
trait CloudProvider {
async fn create_server(&self, spec: &ServerSpec) -> Result<Server>;
}
#[tokio::test]
async fn test_with_mock_provider() {
let mut mock_provider = MockCloudProvider::new();
mock_provider
.expect_create_server()
.returning| ( | _ Ok(test_server()));
let result = mock_provider.create_server(&test_spec()).await;
assert!(result.is_ok());
}
Test Coverage
Measure and maintain code coverage.
Generate Coverage Report
# Install tarpaulin
cargo install cargo-tarpaulin
# Generate HTML coverage report
cargo tarpaulin --out Html --output-dir coverage
# Generate multiple formats
cargo tarpaulin --out Html --out Xml --out Json
# View coverage
open coverage/index.html
Coverage Goals
- Unit tests: Minimum 80% code coverage
- Integration tests: Minimum 60% component coverage
- Critical paths: 100% coverage required
- Security components: 100% coverage required
Performance Testing
Benchmark critical operations.
Benchmark Tests
use criterion::{black_box, criterion_group, criterion_main, Criterion};
fn benchmark_workflow_creation(c: &mut Criterion) {
c.bench_function("create_workflow", | | b {
b.iter| ( | {
Workflow::new(black_box(test_config()))
})
});
}
fn benchmark_database_query(c: &mut Criterion) {
c.bench_function("query_workflows", | | b {
b.iter| ( | {
db.query_workflows(black_box(&filter))
})
});
}
criterion_group!(benches, benchmark_workflow_creation, benchmark_database_query);
criterion_main!(benches);
Run benchmarks:
cargo bench
Test Best Practices
- Write tests before or alongside code (TDD approach)
- Keep tests focused and isolated
- Use descriptive test names that explain what is tested
- Clean up test resources (databases, files, containers)
- Mock external dependencies to avoid flaky tests
- Test both success and error conditions
- Maintain shared test fixtures for consistency
- Run tests in CI/CD pipeline
- Monitor test execution time (fail if too slow)
- Refactor tests alongside production code
Continuous Testing
Watch Mode
Auto-run tests on code changes:
# Install cargo-watch
cargo install cargo-watch
# Watch and run tests
cargo watch -x test
# Watch specific package
cargo watch -x 'test -p provisioning-orchestrator'
Pre-Commit Testing
Run tests automatically before commits:
# Install pre-commit hooks
pre-commit install
# Runs on every commit:
# - cargo test
# - cargo clippy
# - cargo fmt --check
Related Documentation
- Build System - Building and running tests
- Contributing - Test requirements for contributions
- API Guide - API testing examples
Contributing
Guidelines for contributing to the Provisioning platform including setup, workflow, and best practices.
Getting Started
Prerequisites
Install required development tools:
# Rust toolchain (latest stable)
curl --proto '=https' --tlsv1.2 -sSf [https://sh.rustup.rs](https://sh.rustup.rs) | sh
# Nushell shell
brew install nushell
# Nickel configuration language
brew install nickel
# Just task runner
brew install just
# Additional development tools
cargo install cargo-watch cargo-tarpaulin cargo-audit
Development Workflow
Follow these guidelines for all code changes and ensure adherence to the project’s technical standards.
- Read applicable language guidelines
- Create feature branch from main
- Make changes following project standards
- Write or update tests
- Run full test suite and linting
- Create pull request with clear description
Code Style Guidelines
Rust Code
Rust code guidelines:
- Use idiomatic Rust patterns
- No
unwrap()in production code - Comprehensive error handling with custom error types
- Format with
cargo fmt - Pass
cargo clippy -- -D warningswith zero warnings - Add inline documentation for public APIs
Nushell Scripts
Nushell code guidelines:
- Use structured data pipelines
- Avoid external command dependencies where possible
- Handle errors gracefully with try-catch
- Document functions with comments
- Use type annotations for clarity
Nickel Schemas
Nickel configuration guidelines:
- Define clear type constraints
- Use lazy evaluation appropriately
- Provide default values where sensible
- Document schema fields
- Validate schemas with
nickel typecheck
Testing Requirements
All contributions must include appropriate tests:
Required Tests
- Unit tests for all new functions
- Integration tests for component interactions
- Security tests for security-related changes
- Documentation tests for code examples
Running Tests
# Run all tests
just test
# Run specific test suite
cargo test -p provisioning-orchestrator
# Run with coverage
cargo tarpaulin --out Html
Test Coverage Requirements
- Unit tests: Minimum 80% code coverage
- Critical paths: 100% coverage
- Security components: 100% coverage
Documentation
Required Documentation
All code changes must include:
- Inline code documentation for public APIs
- Updated README if adding new components
- Examples showing usage
- Migration guide for breaking changes
Documentation Standards
Documentation standards:
- Use Markdown for all documentation
- Code blocks must specify language
- Keep lines ≤150 characters
- No bare URLs (use markdown links)
- Test all code examples
Commit Message Format
Use conventional commit format:
<type>(<scope>): <subject>
<body>
<footer>
Types:
feat:New featurefix:Bug fixdocs:Documentation changestest:Adding or updating testsrefactor:Code refactoringperf:Performance improvementschore:Maintenance tasks
Example:
feat(orchestrator): add workflow retry mechanism
- Implement exponential backoff strategy
- Add max retry configuration option
- Update workflow state tracking
Closes #123
Pull Request Process
Before Creating PR
- Update your branch with latest main
- Run full test suite:
just test - Run linters:
just lint - Format code:
just fmt - Build successfully:
just build-all
PR Description Template
## Description
Brief description of changes and motivation
## Type of Change
- [ ] Bug fix (non-breaking change fixing an issue)
- [ ] New feature (non-breaking change adding functionality)
- [ ] Breaking change (fix or feature causing existing functionality to change)
- [ ] Documentation update
## Testing
- [ ] Unit tests added or updated
- [ ] Integration tests pass
- [ ] Manual testing completed
- [ ] Test coverage maintained or improved
## Checklist
- [ ] Code follows project style guidelines
- [ ] Self-review completed
- [ ] Documentation updated
- [ ] No new compiler warnings
- [ ] Tested on relevant platforms
## Related Issues
Closes #<issue-number>
Code Review
All PRs require code review before merging. Reviewers check:
- Correctness and quality of implementation
- Test coverage and quality
- Documentation completeness
- Adherence to style guidelines
- Security implications
- Performance considerations
- Breaking changes properly documented
Development Best Practices
Code Quality
- Write self-documenting code with clear naming
- Keep functions focused and single-purpose
- Avoid premature optimization
- Use meaningful variable and function names
- Comment complex logic, not obvious code
Error Handling
- Use custom error types, not strings
- Provide context in error messages
- Handle errors at appropriate level
- Log errors with sufficient detail
- Never ignore errors silently
Performance
- Profile before optimizing
- Use appropriate data structures
- Minimize allocations in hot paths
- Consider async for I/O-bound operations
- Benchmark performance-critical code
Security
- Validate all inputs
- Never log sensitive data
- Use constant-time comparisons for secrets
- Follow principle of least privilege
- Review security guidelines for security-related changes
Getting Help
Need assistance with contributions?
- Check existing documentation in
docs/ - Search for similar closed issues and PRs
- Ask questions in GitHub Discussions
- Reach out to maintainers
Recognition
Contributors are recognized in:
CONTRIBUTORS.mdfile- Release notes for significant contributions
- Project documentation acknowledgments
Thank you for contributing to the Provisioning platform!
API Reference
Complete API documentation for the Provisioning platform, including REST endpoints, CLI commands, and library interfaces.
Available APIs
The Provisioning platform provides multiple API surfaces for different use cases and integration patterns.
REST API
HTTP-based APIs for external integration and programmatic access.
- REST API Documentation - Complete HTTP endpoint reference with 83+ endpoints
- Orchestrator API - Workflow execution and task management
- Control Center API - Platform management and monitoring
Command-Line Interface
Native CLI for interactive and scripted operations.
- CLI Commands Reference - Complete reference for 111+ CLI commands
- Integration Examples - Common integration patterns and workflows
Nushell Libraries
Internal library APIs for extension development and customization.
- Nushell Libraries - Core library modules and functions
API Categories
Infrastructure Management
Manage cloud resources, servers, and infrastructure components.
REST Endpoints:
- Server Management - Create, delete, update, list servers
- Provider Integration - Cloud provider operations
- Network Configuration - Network, firewall, routing
CLI Commands:
provisioning server- Server lifecycle operationsprovisioning provider- Provider configurationprovisioning infrastructure- Infrastructure queries
Service Orchestration
Deploy and manage infrastructure services and clusters.
REST Endpoints:
- Task Service Deployment - Install, remove, update services
- Cluster Management - Cluster lifecycle operations
- Dependency Resolution - Automatic dependency handling
CLI Commands:
provisioning taskserv- Task service operationsprovisioning cluster- Cluster managementprovisioning workflow- Workflow execution
Workflow Automation
Execute batch operations and complex workflows.
REST Endpoints:
- Workflow Submission - Submit and track workflows
- Task Status - Real-time task monitoring
- Checkpoint Recovery - Resume interrupted workflows
CLI Commands:
provisioning batch- Batch workflow operationsprovisioning workflow- Workflow managementprovisioning orchestrator- Orchestrator control
Configuration Management
Manage configuration across hierarchical layers.
REST Endpoints:
- Configuration Retrieval - Get active configuration
- Validation - Validate configuration files
- Schema Queries - Query configuration schemas
CLI Commands:
provisioning config- Configuration operationsprovisioning validate- Validation commandsprovisioning schema- Schema management
Security & Authentication
Manage authentication, authorization, secrets, and encryption.
REST Endpoints:
- Authentication - Login, token management, MFA
- Authorization - Policy evaluation, permissions
- Secrets Management - Secret storage and retrieval
- KMS Operations - Key management and encryption
- Audit Logging - Security event tracking
CLI Commands:
provisioning auth- Authentication operationsprovisioning vault- Secret managementprovisioning kms- Key managementprovisioning audit- Audit log queries
Platform Services
Control platform components and system health.
REST Endpoints:
- Service Health - Health checks and status
- Service Control - Start, stop, restart services
- Configuration - Service configuration management
- Monitoring - Metrics and performance data
CLI Commands:
provisioning platform- Platform managementprovisioning service- Service controlprovisioning health- Health monitoring
API Conventions
REST API Standards
All REST endpoints follow consistent conventions:
Authentication:
Authorization: Bearer <jwt-token>
Request Format:
Content-Type: application/json
Response Format:
{
"status": "succes| s error",
"data": { ... },
"message": "Human-readable message",
"timestamp": "2026-01-16T10:30:00Z"
}
Error Responses:
{
"status": "error",
"error": {
"code": "ERR_CODE",
"message": "Error description",
"details": { ... }
},
"timestamp": "2026-01-16T10:30:00Z"
}
CLI Command Patterns
All CLI commands follow consistent patterns:
Common Flags:
--yes- Skip confirmation prompts--check- Dry-run mode, show what would happen--wait- Wait for operation completion--format jso| n yam| l table- Output format--verbose- Detailed output--quiet- Minimal output
Command Structure:
provisioning <domain> <action> <resource> [flags]
Examples:
provisioning server create web-01 --plan medium --yes
provisioning taskserv install kubernetes --cluster prod
provisioning workflow submit deploy.ncl --wait
Library Function Signatures
Nushell library functions follow consistent signatures:
Parameter Order:
- Required positional parameters
- Optional positional parameters
- Named parameters (flags)
Return Values:
- Success: Returns data structure (record, table, list)
- Error: Throws error with structured message
Example:
def create-server [
name: string # Required: server name
--plan: string = "medium" # Optional: server plan
--wait # Optional: wait flag
] {
# Implementation
}
API Versioning
The Provisioning platform uses semantic versioning for APIs:
- Major version - Breaking changes to API contracts
- Minor version - Backwards-compatible additions
- Patch version - Backwards-compatible bug fixes
Current API Version: v1.0.0
Version Compatibility:
- REST API includes version in URL:
/api/v1/servers - CLI maintains backwards compatibility across minor versions
- Libraries use semantic import versioning
Rate Limiting
REST API endpoints implement rate limiting to ensure platform stability:
- Default Limit: 100 requests per minute per API key
- Burst Limit: 20 requests per second
- Headers: Rate limit information in response headers
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1642334400
Authentication
All APIs require authentication except public health endpoints.
Supported Methods:
- JWT Tokens - Primary authentication method
- API Keys - For service-to-service integration
- MFA - Multi-factor authentication for sensitive operations
Token Management:
# Login and obtain token
provisioning auth login --user admin
# Use token in requests
curl -H "Authorization: Bearer $TOKEN" [https://api/v1/servers](https://api/v1/servers)
See Authentication Guide for complete details.
API Discovery
Discover available APIs programmatically:
REST API:
# Get API specification (OpenAPI)
curl [https://api/v1/openapi.json](https://api/v1/openapi.json)
CLI:
# List all commands
provisioning help --all
# Get command details
provisioning server help
Libraries:
# List available modules
use lib_provisioning *
$nu.scope.commands | where is_custom
Next Steps
- REST API Reference - Explore HTTP endpoints
- CLI Commands - Master command-line tools
- Integration Examples - See real-world usage patterns
- Nushell Libraries - Extend the platform
Related Documentation
- Security Guide - Authentication and authorization details
- Development Guide - Building with the API
- Orchestrator Architecture - Workflow engine internals
REST API Reference
Complete HTTP API documentation for the Provisioning platform covering 83+ endpoints across 9 platform services.
Base URL
[https://api.provisioning.local/api/v1](https://api.provisioning.local/api/v1)
All endpoints are prefixed with /api/v1 for version compatibility.
Authentication
All API requests require authentication using JWT Bearer tokens:
Authorization: Bearer <your-jwt-token>
Obtain tokens via the Authentication API endpoints.
Common Response Format
All responses follow a consistent structure:
Success Response:
{
"status": "success",
"data": { ... },
"message": "Operation completed successfully",
"timestamp": "2026-01-16T10:30:00Z"
}
Error Response:
{
"status": "error",
"error": {
"code": "ERR_CODE",
"message": "Human-readable error message",
"details": { ... }
},
"timestamp": "2026-01-16T10:30:00Z"
}
HTTP Status Codes
| Code | Meaning | Usage |
|---|---|---|
| 200 | OK | Successful GET, PUT, PATCH requests |
| 201 | Created | Successful POST request creating resource |
| 202 | Accepted | Async operation accepted, check status |
| 204 | No Content | Successful DELETE request |
| 400 | Bad Request | Invalid request parameters |
| 401 | Unauthorized | Missing or invalid authentication |
| 403 | Forbidden | Valid auth but insufficient permissions |
| 404 | Not Found | Resource does not exist |
| 409 | Conflict | Resource conflict (duplicate name, etc.) |
| 429 | Too Many Requests | Rate limit exceeded |
| 500 | Internal Server Error | Server error |
| 503 | Service Unavailable | Service temporarily unavailable |
API Services
The platform exposes 9 distinct services with REST APIs:
- Orchestrator - Workflow execution and task management
- Control Center - Platform management and monitoring
- Extension Registry - Extension distribution
- Auth Service - Authentication and identity
- Vault Service - Secrets management
- KMS Service - Key management and encryption
- Audit Service - Audit logging and compliance
- Policy Service - Authorization policies
- Gateway Service - API gateway and routing
Orchestrator API
Workflow execution, task scheduling, and state management.
Base Path: /api/v1/orchestrator
Submit Workflow
Submit a new workflow for execution.
Endpoint: POST /workflows
Request:
{
"name": "deploy-cluster",
"type": "cluster",
"operations": [
{
"id": "create-servers",
"type": "server",
"action": "create",
"params": {
"infra": "my-cluster.ncl"
}
},
{
"id": "install-k8s",
"type": "taskserv",
"action": "install",
"params": {
"name": "kubernetes"
},
"dependencies": ["create-servers"]
}
],
"priority": "normal",
"checkpoint_enabled": true
}
Response:
{
"status": "success",
"data": {
"workflow_id": "wf-20260116-abc123",
"state": "queued",
"created_at": "2026-01-16T10:30:00Z"
}
}
Get Workflow Status
Retrieve workflow execution status.
Endpoint: GET /workflows/{workflow_id}
Response:
{
"status": "success",
"data": {
"workflow_id": "wf-20260116-abc123",
"name": "deploy-cluster",
"state": "running",
"progress": {
"total_tasks": 2,
"completed": 1,
"failed": 0,
"running": 1
},
"current_task": {
"id": "install-k8s",
"state": "running",
"started_at": "2026-01-16T10:32:00Z"
},
"created_at": "2026-01-16T10:30:00Z",
"updated_at": "2026-01-16T10:32:15Z"
}
}
List Workflows
List all workflows with optional filtering.
Endpoint: GET /workflows
Query Parameters:
state(optional) - Filter by state:queue| d runnin| g complete| d failedlimit(optional) - Maximum results (default: 50, max: 100)offset(optional) - Pagination offset
Response:
{
"status": "success",
"data": {
"workflows": [
{
"workflow_id": "wf-20260116-abc123",
"name": "deploy-cluster",
"state": "running",
"created_at": "2026-01-16T10:30:00Z"
}
],
"total": 1,
"limit": 50,
"offset": 0
}
}
Cancel Workflow
Cancel a running workflow.
Endpoint: POST /workflows/{workflow_id}/cancel
Response:
{
"status": "success",
"data": {
"workflow_id": "wf-20260116-abc123",
"state": "cancelled",
"cancelled_at": "2026-01-16T10:35:00Z"
}
}
Get Task Logs
Retrieve logs for a specific task in a workflow.
Endpoint: GET /workflows/{workflow_id}/tasks/{task_id}/logs
Query Parameters:
lines(optional) - Number of lines (default: 100)follow(optional) - Stream logs (SSE)
Response:
{
"status": "success",
"data": {
"task_id": "install-k8s",
"logs": [
{
"timestamp": "2026-01-16T10:32:00Z",
"level": "info",
"message": "Starting Kubernetes installation"
},
{
"timestamp": "2026-01-16T10:32:15Z",
"level": "info",
"message": "Downloading Kubernetes binaries"
}
]
}
}
Resume Workflow
Resume a failed workflow from checkpoint.
Endpoint: POST /workflows/{workflow_id}/resume
Request:
{
"from_checkpoint": "create-servers",
"skip_failed": false
}
Response:
{
"status": "success",
"data": {
"workflow_id": "wf-20260116-abc123",
"state": "running",
"resumed_at": "2026-01-16T10:40:00Z"
}
}
Control Center API
Platform management, service control, and monitoring.
Base Path: /api/v1/control-center
List Services
List all platform services and their status.
Endpoint: GET /services
Response:
{
"status": "success",
"data": {
"services": [
{
"name": "orchestrator",
"state": "running",
"health": "healthy",
"uptime": 86400,
"version": "1.0.0"
},
{
"name": "control-center",
"state": "running",
"health": "healthy",
"uptime": 86400,
"version": "1.0.0"
}
]
}
}
Get Service Health
Check health status of a specific service.
Endpoint: GET /services/{service_name}/health
Response:
{
"status": "success",
"data": {
"service": "orchestrator",
"health": "healthy",
"checks": {
"api": "pass",
"database": "pass",
"storage": "pass"
},
"timestamp": "2026-01-16T10:30:00Z"
}
}
Start Service
Start a stopped platform service.
Endpoint: POST /services/{service_name}/start
Response:
{
"status": "success",
"data": {
"service": "orchestrator",
"state": "starting",
"message": "Service start initiated"
}
}
Stop Service
Gracefully stop a running service.
Endpoint: POST /services/{service_name}/stop
Request:
{
"force": false,
"timeout": 30
}
Response:
{
"status": "success",
"data": {
"service": "orchestrator",
"state": "stopped",
"message": "Service stopped gracefully"
}
}
Restart Service
Restart a platform service.
Endpoint: POST /services/{service_name}/restart
Response:
{
"status": "success",
"data": {
"service": "orchestrator",
"state": "restarting",
"message": "Service restart initiated"
}
}
Get Service Configuration
Retrieve service configuration.
Endpoint: GET /services/{service_name}/config
Response:
{
"status": "success",
"data": {
"service": "orchestrator",
"config": {
"port": 8080,
"max_workers": 10,
"checkpoint_enabled": true
}
}
}
Update Service Configuration
Update service configuration (requires restart).
Endpoint: PUT /services/{service_name}/config
Request:
{
"config": {
"max_workers": 20
},
"restart": true
}
Response:
{
"status": "success",
"data": {
"service": "orchestrator",
"config_updated": true,
"restart_required": true
}
}
Get Platform Metrics
Retrieve platform-wide metrics.
Endpoint: GET /metrics
Response:
{
"status": "success",
"data": {
"platform": {
"uptime": 86400,
"version": "1.0.0"
},
"resources": {
"cpu_usage": 45.2,
"memory_usage": 62.8,
"disk_usage": 38.1
},
"workflows": {
"total": 150,
"running": 5,
"queued": 2,
"failed": 3
},
"timestamp": "2026-01-16T10:30:00Z"
}
}
Extension Registry API
Extension distribution, versioning, and discovery.
Base Path: /api/v1/registry
List Extensions
List available extensions.
Endpoint: GET /extensions
Query Parameters:
type(optional) - Filter by type:provide| r taskser| v cluste| r workflowsearch(optional) - Search by name or description
Response:
{
"status": "success",
"data": {
"extensions": [
{
"name": "kubernetes",
"type": "taskserv",
"version": "1.29.0",
"description": "Kubernetes orchestration platform",
"dependencies": ["containerd", "etcd"]
}
],
"total": 1
}
}
Get Extension Details
Get detailed information about an extension.
Endpoint: GET /extensions/{extension_name}
Response:
{
"status": "success",
"data": {
"name": "kubernetes",
"type": "taskserv",
"version": "1.29.0",
"description": "Kubernetes orchestration platform",
"dependencies": ["containerd", "etcd"],
"versions": ["1.29.0", "1.28.5", "1.27.10"],
"metadata": {
"author": "Provisioning Team",
"license": "Apache-2.0",
"homepage": " [https://kubernetes.io"](https://kubernetes.io")
}
}
}
Download Extension
Download an extension package.
Endpoint: GET /extensions/{extension_name}/download
Query Parameters:
version(optional) - Specific version (default: latest)
Response: Binary OCI image blob
Publish Extension
Publish a new extension or version.
Endpoint: POST /extensions
Request: Multipart form data with OCI image
Response:
{
"status": "success",
"data": {
"name": "kubernetes",
"version": "1.29.0",
"published_at": "2026-01-16T10:30:00Z"
}
}
Auth Service API
Authentication, identity management, and MFA.
Base Path: /api/v1/auth
Login
Authenticate user and obtain JWT token.
Endpoint: POST /login
Request:
{
"username": "admin",
"password": "secure-password"
}
Response:
{
"status": "success",
"data": {
"token": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...",
"refresh_token": "refresh-token-abc123",
"expires_in": 3600,
"user": {
"id": "user-123",
"username": "admin",
"roles": ["admin"]
}
}
}
MFA Challenge
Request MFA challenge for two-factor authentication.
Endpoint: POST /mfa/challenge
Request:
{
"username": "admin",
"password": "secure-password"
}
Response:
{
"status": "success",
"data": {
"challenge_id": "challenge-abc123",
"methods": ["totp", "webauthn"],
"expires_in": 300
}
}
MFA Verify
Verify MFA code and complete authentication.
Endpoint: POST /mfa/verify
Request:
{
"challenge_id": "challenge-abc123",
"method": "totp",
"code": "123456"
}
Response:
{
"status": "success",
"data": {
"token": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...",
"refresh_token": "refresh-token-abc123",
"expires_in": 3600
}
}
Refresh Token
Obtain new access token using refresh token.
Endpoint: POST /refresh
Request:
{
"refresh_token": "refresh-token-abc123"
}
Response:
{
"status": "success",
"data": {
"token": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...",
"expires_in": 3600
}
}
Logout
Invalidate current session and tokens.
Endpoint: POST /logout
Request:
{
"refresh_token": "refresh-token-abc123"
}
Response:
{
"status": "success",
"message": "Logged out successfully"
}
Create User
Create a new user account (admin only).
Endpoint: POST /users
Request:
{
"username": "developer",
"email": "[dev@example.com](mailto:dev@example.com)",
"password": "secure-password",
"roles": ["developer"]
}
Response:
{
"status": "success",
"data": {
"user_id": "user-456",
"username": "developer",
"created_at": "2026-01-16T10:30:00Z"
}
}
List Users
List all users (admin only).
Endpoint: GET /users
Response:
{
"status": "success",
"data": {
"users": [
{
"user_id": "user-123",
"username": "admin",
"email": "[admin@example.com](mailto:admin@example.com)",
"roles": ["admin"],
"created_at": "2026-01-01T00:00:00Z"
}
],
"total": 1
}
}
Vault Service API
Secrets management and dynamic credentials.
Base Path: /api/v1/vault
Store Secret
Store a new secret.
Endpoint: POST /secrets
Request:
{
"path": "database/postgres/password",
"data": {
"username": "dbuser",
"password": "db-password"
},
"metadata": {
"description": "PostgreSQL credentials"
}
}
Response:
{
"status": "success",
"data": {
"path": "database/postgres/password",
"version": 1,
"created_at": "2026-01-16T10:30:00Z"
}
}
Retrieve Secret
Retrieve a stored secret.
Endpoint: GET /secrets/{path}
Query Parameters:
version(optional) - Specific version (default: latest)
Response:
{
"status": "success",
"data": {
"path": "database/postgres/password",
"version": 1,
"data": {
"username": "dbuser",
"password": "db-password"
},
"metadata": {
"description": "PostgreSQL credentials"
},
"created_at": "2026-01-16T10:30:00Z"
}
}
List Secrets
List all secret paths.
Endpoint: GET /secrets
Query Parameters:
prefix(optional) - Filter by path prefix
Response:
{
"status": "success",
"data": {
"secrets": [
{
"path": "database/postgres/password",
"versions": 1,
"updated_at": "2026-01-16T10:30:00Z"
}
],
"total": 1
}
}
Delete Secret
Delete a secret (soft delete, preserves versions).
Endpoint: DELETE /secrets/{path}
Response:
{
"status": "success",
"message": "Secret deleted successfully"
}
Generate Dynamic Credentials
Generate temporary credentials for supported backends.
Endpoint: POST /dynamic/{backend}/generate
Request:
{
"role": "readonly",
"ttl": 3600
}
Response:
{
"status": "success",
"data": {
"credentials": {
"username": "v-readonly-abc123",
"password": "temporary-password"
},
"ttl": 3600,
"expires_at": "2026-01-16T11:30:00Z"
}
}
KMS Service API
Key management, encryption, and decryption.
Base Path: /api/v1/kms
Encrypt Data
Encrypt data using a managed key.
Endpoint: POST /encrypt
Request:
{
"key_id": "master-key-01",
"plaintext": "sensitive data",
"context": {
"purpose": "config-encryption"
}
}
Response:
{
"status": "success",
"data": {
"ciphertext": "AQICAHh...",
"key_id": "master-key-01"
}
}
Decrypt Data
Decrypt previously encrypted data.
Endpoint: POST /decrypt
Request:
{
"ciphertext": "AQICAHh...",
"context": {
"purpose": "config-encryption"
}
}
Response:
{
"status": "success",
"data": {
"plaintext": "sensitive data",
"key_id": "master-key-01"
}
}
Create Key
Create a new encryption key.
Endpoint: POST /keys
Request:
{
"key_id": "app-key-01",
"algorithm": "AES-256-GCM",
"metadata": {
"description": "Application encryption key"
}
}
Response:
{
"status": "success",
"data": {
"key_id": "app-key-01",
"algorithm": "AES-256-GCM",
"created_at": "2026-01-16T10:30:00Z"
}
}
List Keys
List all encryption keys.
Endpoint: GET /keys
Response:
{
"status": "success",
"data": {
"keys": [
{
"key_id": "master-key-01",
"algorithm": "AES-256-GCM",
"state": "enabled",
"created_at": "2026-01-01T00:00:00Z"
}
],
"total": 1
}
}
Rotate Key
Rotate an encryption key.
Endpoint: POST /keys/{key_id}/rotate
Response:
{
"status": "success",
"data": {
"key_id": "master-key-01",
"version": 2,
"rotated_at": "2026-01-16T10:30:00Z"
}
}
Audit Service API
Audit logging, compliance tracking, and event queries.
Base Path: /api/v1/audit
Query Audit Logs
Query audit events with filtering.
Endpoint: GET /logs
Query Parameters:
user(optional) - Filter by user IDaction(optional) - Filter by action typeresource(optional) - Filter by resource typestart_time(optional) - Start timestampend_time(optional) - End timestamplimit(optional) - Maximum results (default: 100)
Response:
{
"status": "success",
"data": {
"events": [
{
"event_id": "evt-abc123",
"timestamp": "2026-01-16T10:30:00Z",
"user": "admin",
"action": "workflow.submit",
"resource": "wf-20260116-abc123",
"result": "success",
"metadata": {
"workflow_name": "deploy-cluster"
}
}
],
"total": 1
}
}
Export Audit Logs
Export audit logs in various formats.
Endpoint: GET /export
Query Parameters:
format- Export format:jso| n cs| v syslo| g ce| f splunkstart_time- Start timestampend_time- End timestamp
Response: File download in requested format
Get Compliance Report
Generate compliance report for specific period.
Endpoint: GET /compliance
Query Parameters:
standard- Compliance standard:gdp| r soc| 2 iso27001start_time- Report start timeend_time- Report end time
Response:
{
"status": "success",
"data": {
"standard": "soc2",
"period": {
"start": "2026-01-01T00:00:00Z",
"end": "2026-01-16T23:59:59Z"
},
"controls": [
{
"control_id": "CC6.1",
"status": "compliant",
"evidence_count": 150
}
],
"summary": {
"total_controls": 10,
"compliant": 9,
"non_compliant": 1
}
}
}
Policy Service API
Authorization policy management (Cedar policies).
Base Path: /api/v1/policy
Evaluate Policy
Evaluate authorization request against policies.
Endpoint: POST /evaluate
Request:
{
"principal": "User::\"admin\"",
"action": "Action::\"workflow.submit\"",
"resource": "Workflow::\"deploy-cluster\"",
"context": {
"time": "2026-01-16T10:30:00Z"
}
}
Response:
{
"status": "success",
"data": {
"decision": "allow",
"policies": ["admin-full-access"],
"diagnostics": {
"reason": "User has admin role"
}
}
}
Create Policy
Create a new authorization policy.
Endpoint: POST /policies
Request:
{
"policy_id": "developer-read-only",
"content": "permit(principal in Role::\"developer\", action == Action::\"read\", resource);",
"description": "Developers have read-only access"
}
Response:
{
"status": "success",
"data": {
"policy_id": "developer-read-only",
"created_at": "2026-01-16T10:30:00Z"
}
}
List Policies
List all authorization policies.
Endpoint: GET /policies
Response:
{
"status": "success",
"data": {
"policies": [
{
"policy_id": "admin-full-access",
"description": "Admins have full access",
"created_at": "2026-01-01T00:00:00Z"
}
],
"total": 1
}
}
Update Policy
Update an existing policy (hot reload).
Endpoint: PUT /policies/{policy_id}
Request:
{
"content": "permit(principal in Role::\"developer\", action == Action::\"read\", resource);"
}
Response:
{
"status": "success",
"data": {
"policy_id": "developer-read-only",
"updated_at": "2026-01-16T10:30:00Z",
"reloaded": true
}
}
Delete Policy
Delete an authorization policy.
Endpoint: DELETE /policies/{policy_id}
Response:
{
"status": "success",
"message": "Policy deleted successfully"
}
Gateway Service API
API gateway, routing, and rate limiting.
Base Path: /api/v1/gateway
Get Route Configuration
Retrieve current routing configuration.
Endpoint: GET /routes
Response:
{
"status": "success",
"data": {
"routes": [
{
"path": "/api/v1/orchestrator/*",
"target": " [http://orchestrator:8080",](http://orchestrator:8080",)
"methods": ["GET", "POST", "PUT", "DELETE"],
"auth_required": true
}
]
}
}
Update Routes
Update gateway routing (hot reload).
Endpoint: PUT /routes
Request:
{
"routes": [
{
"path": "/api/v1/custom/*",
"target": " [http://custom-service:9000",](http://custom-service:9000",)
"methods": ["GET", "POST"],
"auth_required": true
}
]
}
Response:
{
"status": "success",
"message": "Routes updated successfully"
}
Get Rate Limits
Retrieve rate limiting configuration.
Endpoint: GET /rate-limits
Response:
{
"status": "success",
"data": {
"global": {
"requests_per_minute": 100,
"burst": 20
},
"per_user": {
"requests_per_minute": 60,
"burst": 10
}
}
}
Error Codes
Common error codes returned by the API:
| Code | Description |
|---|---|
ERR_AUTH_INVALID | Invalid authentication credentials |
ERR_AUTH_EXPIRED | Token expired |
ERR_AUTH_MFA_REQUIRED | MFA verification required |
ERR_FORBIDDEN | Insufficient permissions |
ERR_NOT_FOUND | Resource not found |
ERR_CONFLICT | Resource conflict |
ERR_VALIDATION | Invalid request parameters |
ERR_RATE_LIMIT | Rate limit exceeded |
ERR_WORKFLOW_FAILED | Workflow execution failed |
ERR_SERVICE_UNAVAILABLE | Service temporarily unavailable |
ERR_INTERNAL | Internal server error |
Rate Limiting Headers
All responses include rate limiting headers:
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1642334400
X-RateLimit-Retry-After: 60
Pagination
List endpoints support pagination using offset-based pagination:
Request:
GET /api/v1/workflows?limit=50&offset=100
Response includes:
{
"data": { ... },
"total": 500,
"limit": 50,
"offset": 100,
"has_more": true
}
Webhooks
Platform supports webhook notifications for async operations:
Webhook Payload:
{
"event": "workflow.completed",
"timestamp": "2026-01-16T10:30:00Z",
"data": {
"workflow_id": "wf-20260116-abc123",
"state": "completed"
},
"signature": "sha256=abc123..."
}
Configure webhooks via Control Center API.
Related Documentation
- Orchestrator API Details - Deep dive into workflow API
- Control Center API Details - Platform management details
- CLI Commands - CLI alternatives to REST API
- Authentication Guide - Auth implementation details
- API Examples - Integration examples and patterns
CLI Commands Reference
Complete command-line interface documentation for the Provisioning platform covering 111+ commands across 11 domain modules.
Command Structure
All commands follow the pattern:
provisioning <domain> <action> [resource] [flags]
Common Flags (available on most commands):
--yes- Skip confirmation prompts (auto-yes)--check- Dry-run mode, show what would happen without executing--wait- Wait for async operations to complete--format <jso| n yam| l table>- Output format (default: table)--verbose- Detailed output with debug information--quiet- Minimal output, errors only--help- Show command help
Quick Reference
Shortcuts - Single-letter aliases for common domains:
provisioning s = provisioning server
provisioning t = provisioning taskserv
provisioning c = provisioning cluster
provisioning w = provisioning workspace
provisioning cfg = provisioning config
provisioning b = provisioning batch
Help Navigation - Bi-directional help system:
provisioning help server = provisioning server help
provisioning help ws = provisioning workspace help
Domain Modules
The CLI is organized into 11 domain modules:
- Infrastructure - Server, provider, network management
- Orchestration - Workflow, batch, task execution
- Configuration - Config validation and management
- Workspace - Multi-workspace operations
- Development - Extensions and customization
- Utilities - Tools and helpers
- Generation - Schema and config generation
- Authentication - Auth, MFA, users
- Security - Vault, KMS, audit, policies
- Platform - Service control and monitoring
- Guides - Interactive documentation
Infrastructure Commands
Manage cloud infrastructure, servers, and resources.
Server Commands
provisioning server create [NAME]
Create a new server or servers from infrastructure configuration.
Flags:
--infra <file>- Nickel infrastructure file--plan <size>- Server plan (small/medium/large/xlarge)--provider <name>- Cloud provider (upcloud/aws/local)--zone <name>- Availability zone--ssh-key <path>- SSH public key path--tags <key=value>- Server tags (repeatable)--yes- Skip confirmation--check- Dry-run mode--wait- Wait for server creation
Examples:
# Create server from infrastructure file
provisioning server create --infra my-cluster.ncl --yes --wait
# Create single server interactively
provisioning server create web-01 --plan medium --provider upcloud
# Check what would be created (dry-run)
provisioning server create --infra cluster.ncl --check
provisioning server delete [NAM| E ID]
Delete servers.
Flags:
--all- Delete all servers in current infrastructure--force- Force deletion without cleanup--yes- Skip confirmation
Examples:
# Delete specific server
provisioning server delete web-01 --yes
# Delete all servers
provisioning server delete --all --yes
provisioning server list
List all servers in the current workspace.
Flags:
--provider <name>- Filter by provider--state <state>- Filter by state (running/stopped/error)--format <format>- Output format
Examples:
# List all servers
provisioning server list
# List only running servers
provisioning server list --state running --format json
provisioning server status [NAM| E ID]
Get detailed server status.
Examples:
provisioning server status web-01
provisioning server status --all
provisioning server ssh [NAM| E ID]
SSH into a server.
Examples:
provisioning server ssh web-01
provisioning server ssh web-01 -- "systemctl status kubelet"
Provider Commands
provisioning provider list
List available cloud providers.
provisioning provider validate <NAME>
Validate provider configuration and credentials.
Examples:
provisioning provider validate upcloud
provisioning provider validate aws
provisioning provider zones <NAME>
List available zones for a provider.
Examples:
provisioning provider zones upcloud
provisioning provider zones aws --region us-east-1
Orchestration Commands
Execute workflows, batch operations, and manage tasks.
Workflow Commands
provisioning workflow submit <FILE>
Submit a workflow for execution.
Flags:
--priority <level>- Priority (low/normal/high/critical)--checkpoint- Enable checkpoint recovery--wait- Wait for completion
Examples:
# Submit workflow and wait
provisioning workflow submit deploy.ncl --wait
# Submit with high priority
provisioning workflow submit urgent.ncl --priority high
provisioning workflow status <ID>
Get workflow execution status.
Examples:
provisioning workflow status wf-20260116-abc123
provisioning workflow list
List workflows.
Flags:
--state <state>- Filter by state (queued/running/completed/failed)--limit <num>- Maximum results
Examples:
# List running workflows
provisioning workflow list --state running
# List failed workflows
provisioning workflow list --state failed --format json
provisioning workflow cancel <ID>
Cancel a running workflow.
Examples:
provisioning workflow cancel wf-20260116-abc123 --yes
provisioning workflow resume <ID>
Resume a failed workflow from checkpoint.
Flags:
--from <checkpoint>- Resume from specific checkpoint--skip-failed- Skip failed tasks
Examples:
# Resume from last checkpoint
provisioning workflow resume wf-20260116-abc123
# Resume from specific checkpoint
provisioning workflow resume wf-20260116-abc123 --from create-servers
provisioning workflow logs <ID>
View workflow logs.
Flags:
--task <id>- Show logs for specific task--follow- Stream logs in real-time--lines <num>- Number of lines (default: 100)
Examples:
# View all workflow logs
provisioning workflow logs wf-20260116-abc123
# Follow logs in real-time
provisioning workflow logs wf-20260116-abc123 --follow
# View specific task logs
provisioning workflow logs wf-20260116-abc123 --task install-k8s
Batch Commands
provisioning batch submit <FILE>
Submit a batch workflow with multiple operations.
Flags:
--parallel <num>- Maximum parallel operations--wait- Wait for completion
Examples:
# Submit batch workflow
provisioning batch submit multi-region.ncl --parallel 3 --wait
provisioning batch status <ID>
Get batch workflow status with progress.
provisioning batch monitor <ID>
Monitor batch execution in real-time.
Configuration Commands
Validate and manage configuration.
provisioning config validate
Validate current configuration.
Flags:
--infra <file>- Specific infrastructure file--all- Validate all configuration files
Examples:
# Validate workspace configuration
provisioning config validate
# Validate specific infrastructure
provisioning config validate --infra cluster.ncl
provisioning config show
Display effective configuration.
Flags:
--key <path>- Show specific config value--format <format>- Output format
Examples:
# Show all configuration
provisioning config show
# Show specific value
provisioning config show --key paths.base
# Export as JSON
provisioning config show --format json > config.json
provisioning config reload
Reload configuration from files.
provisioning config diff
Show configuration differences between environments.
Flags:
--from <env>- Source environment--to <env>- Target environment
Workspace Commands
Manage isolated workspaces.
provisioning workspace init <NAME>
Initialize a new workspace.
Flags:
--template <name>- Workspace template--path <path>- Custom workspace path
Examples:
# Create workspace from default template
provisioning workspace init my-project
# Create from template
provisioning workspace init prod --template production
provisioning workspace switch <NAME>
Switch to a different workspace.
Examples:
provisioning workspace switch production
provisioning workspace switch dev
provisioning workspace list
List all workspaces.
Flags:
--format <format>- Output format
Examples:
provisioning workspace list
provisioning workspace list --format json
provisioning workspace current
Show current active workspace.
provisioning workspace delete <NAME>
Delete a workspace.
Flags:
--force- Force deletion without cleanup--yes- Skip confirmation
Development Commands
Develop custom extensions.
provisioning extension create <TYPE> <NAME>
Create a new extension.
Types: provider, taskserv, cluster, workflow
Flags:
--template <name>- Extension template
Examples:
# Create new task service
provisioning extension create taskserv my-service
# Create new provider
provisioning extension create provider my-cloud --template basic
provisioning extension validate <PATH>
Validate extension structure and configuration.
provisioning extension package <PATH>
Package extension for distribution (OCI format).
Flags:
--version <version>- Extension version--output <path>- Output file path
Examples:
provisioning extension package ./my-service --version 1.0.0
provisioning extension install <NAM| E PATH>
Install an extension from registry or file.
Examples:
# Install from registry
provisioning extension install kubernetes
# Install from local file
provisioning extension install ./my-service.tar.gz
provisioning extension list
List installed extensions.
Flags:
--type <type>- Filter by type--available- Show available (not installed)
Utility Commands
Helper commands and tools.
provisioning version
Show platform version information.
Flags:
--check- Check for updates
Examples:
provisioning version
provisioning version --check
provisioning health
Check platform health.
Flags:
--service <name>- Check specific service
Examples:
# Check all services
provisioning health
# Check specific service
provisioning health --service orchestrator
provisioning diagnostics
Run platform diagnostics.
Flags:
--output <path>- Save diagnostic report
Examples:
provisioning diagnostics --output diagnostics.json
provisioning setup versions
Generate versions file from Nickel schemas.
Examples:
# Generate /provisioning/core/versions file
provisioning setup versions
# Use in shell scripts
source /provisioning/core/versions
echo "Nushell version: $NU_VERSION"
Generation Commands
Generate schemas, configurations, and infrastructure code.
provisioning generate config <TYPE>
Generate configuration templates.
Types: workspace, infrastructure, provider
Flags:
--output <path>- Output file path--format <format>- Output format (nickel/yaml/toml)
Examples:
# Generate workspace config
provisioning generate config workspace --output config.ncl
# Generate infrastructure template
provisioning generate config infrastructure --format nickel
provisioning generate schema <NAME>
Generate Nickel schema from existing configuration.
provisioning generate docs
Generate documentation from schemas.
Authentication Commands
Manage authentication and user accounts.
provisioning auth login
Authenticate to the platform.
Flags:
--user <username>- Username--password <password>- Password (prompt if not provided)--mfa <code>- MFA code
Examples:
# Interactive login
provisioning auth login --user admin
# Login with MFA
provisioning auth login --user admin --mfa 123456
provisioning auth logout
Logout and invalidate tokens.
provisioning auth token
Display or refresh authentication token.
Flags:
--refresh- Refresh the token
provisioning auth user create <USERNAME>
Create a new user (admin only).
Flags:
--email <email>- User email--roles <roles>- Comma-separated roles
Examples:
provisioning auth user create developer --email [dev@example.com](mailto:dev@example.com) --roles developer,operator
provisioning auth user list
List all users (admin only).
provisioning auth user delete <USERNAME>
Delete a user (admin only).
Security Commands
Manage secrets, encryption, audit logs, and policies.
Vault Commands
provisioning vault store <PATH>
Store a secret.
Flags:
--value <value>- Secret value--file <path>- Read value from file
Examples:
# Store secret interactively
provisioning vault store database/postgres/password
# Store from value
provisioning vault store api/key --value "secret-value"
# Store from file
provisioning vault store ssh/key --file ~/.ssh/id_rsa
provisioning vault get <PATH>
Retrieve a secret.
Flags:
--version <num>- Specific version--output <path>- Save to file
Examples:
# Get latest secret
provisioning vault get database/postgres/password
# Get specific version
provisioning vault get database/postgres/password --version 2
provisioning vault list
List all secret paths.
Flags:
--prefix <prefix>- Filter by path prefix
provisioning vault delete <PATH>
Delete a secret.
KMS Commands
provisioning kms encrypt <FILE>
Encrypt a file or data.
Flags:
--key <id>- Key ID--output <path>- Output file
Examples:
# Encrypt file
provisioning kms encrypt config.yaml --key master-key --output config.enc
# Encrypt string
echo "sensitive data" | provisioning kms encrypt --key master-key
provisioning kms decrypt <FILE>
Decrypt encrypted data.
Flags:
--output <path>- Output file
provisioning kms create-key <ID>
Create a new encryption key.
Flags:
--algorithm <algo>- Algorithm (default: AES-256-GCM)
provisioning kms list-keys
List all encryption keys.
provisioning kms rotate-key <ID>
Rotate an encryption key.
Audit Commands
provisioning audit query
Query audit logs.
Flags:
--user <user>- Filter by user--action <action>- Filter by action--resource <resource>- Filter by resource--start <time>- Start time--end <time>- End time--limit <num>- Maximum results
Examples:
# Query recent audit logs
provisioning audit query --limit 100
# Query specific user actions
provisioning audit query --user admin --action workflow.submit
# Query time range
provisioning audit query --start "2026-01-15" --end "2026-01-16"
provisioning audit export
Export audit logs.
Flags:
--format <format>- Export format (json/csv/syslog/cef/splunk)--start <time>- Start time--end <time>- End time--output <path>- Output file
Examples:
# Export as JSON
provisioning audit export --format json --output audit.json
# Export last 7 days as CSV
provisioning audit export --format csv --start "7 days ago" --output audit.csv
provisioning audit compliance
Generate compliance report.
Flags:
--standard <standard>- Compliance standard (gdpr/soc2/iso27001)--start <time>- Report start time--end <time>- Report end time
Policy Commands
provisioning policy create <ID>
Create an authorization policy.
Flags:
--content <cedar>- Cedar policy content--file <path>- Load from file--description <text>- Policy description
Examples:
# Create from file
provisioning policy create developer-read --file policies/read-only.cedar
# Create inline
provisioning policy create admin-full --content "permit(principal in Role::\"admin\", action, resource);"
provisioning policy list
List all authorization policies.
provisioning policy evaluate
Evaluate a policy decision.
Flags:
--principal <entity>- Principal entity--action <action>- Action--resource <resource>- Resource
Examples:
provisioning policy evaluate \
--principal "User::\"admin\"" \
--action "Action::\"workflow.submit\"" \
--resource "Workflow::\"deploy\""
provisioning policy update <ID>
Update an existing policy (hot reload).
provisioning policy delete <ID>
Delete an authorization policy.
Platform Commands
Control platform services.
provisioning platform service list
List all platform services and status.
provisioning platform service start <NAME>
Start a platform service.
Examples:
provisioning platform service start orchestrator
provisioning platform service stop <NAME>
Stop a platform service.
Flags:
--force- Force stop without graceful shutdown--timeout <seconds>- Graceful shutdown timeout
provisioning platform service restart <NAME>
Restart a platform service.
provisioning platform service health <NAME>
Check service health.
provisioning platform metrics
Display platform-wide metrics.
Flags:
--watch- Continuously update metrics
Guides Commands
Access interactive guides and documentation.
provisioning guide from-scratch
Complete walkthrough from installation to first deployment.
provisioning guide update
Guide for updating the platform.
provisioning guide customize
Guide for customizing extensions.
provisioning sc
Quick reference shortcut guide (fastest).
provisioning help [COMMAND]
Display help for any command.
Examples:
# General help
provisioning help
# Command-specific help
provisioning help server create
provisioning server create --help # Same result
Task Service Commands
provisioning taskserv install <NAME>
Install a task service on servers.
Flags:
--cluster <name>- Target cluster--version <version>- Specific version--servers <names>- Target servers (comma-separated)--wait- Wait for installation--yes- Skip confirmation
Examples:
# Install Kubernetes on cluster
provisioning taskserv install kubernetes --cluster prod --wait
# Install specific version
provisioning taskserv install kubernetes --version 1.29.0
# Install on specific servers
provisioning taskserv install containerd --servers web-01,web-02
provisioning taskserv remove <NAME>
Remove a task service.
Flags:
--cluster <name>- Target cluster--purge- Remove all data--yes- Skip confirmation
provisioning taskserv list
List installed task services.
Flags:
--available- Show available (not installed) services
provisioning taskserv status <NAME>
Get task service status.
Examples:
provisioning taskserv status kubernetes
Cluster Commands
provisioning cluster create <NAME>
Create a complete cluster from configuration.
Flags:
--infra <file>- Nickel infrastructure file--type <type>- Cluster type (kubernetes/etcd/postgres)--wait- Wait for creation--yes- Skip confirmation--check- Dry-run mode
Examples:
# Create Kubernetes cluster
provisioning cluster create prod-k8s --infra k8s-cluster.ncl --wait
# Check what would be created
provisioning cluster create staging --infra staging.ncl --check
provisioning cluster delete <NAME>
Delete a cluster and all resources.
Flags:
--keep-data- Preserve data volumes--yes- Skip confirmation
provisioning cluster list
List all clusters.
provisioning cluster status <NAME>
Get detailed cluster status.
Examples:
provisioning cluster status prod-k8s
provisioning cluster scale <NAME>
Scale cluster nodes.
Flags:
--workers <num>- Number of worker nodes--control-plane <num>- Number of control plane nodes
Examples:
# Scale workers to 5 nodes
provisioning cluster scale prod-k8s --workers 5
Test Commands
provisioning test quick <TASKSERV>
Quick test of a task service in container.
Examples:
provisioning test quick kubernetes
provisioning test quick postgres
provisioning test topology load <NAME>
Load a test topology template.
provisioning test env create
Create a test environment.
Flags:
--topology <name>- Topology template--services <names>- Services to install
provisioning test env list
List active test environments.
provisioning test env cleanup <ID>
Cleanup a test environment.
Environment Variables
The CLI respects these environment variables:
PROVISIONING_WORKSPACE- Override active workspacePROVISIONING_CONFIG- Custom config file pathPROVISIONING_LOG_LEVEL- Log level (debug/info/warn/error)PROVISIONING_API_URL- API endpoint URLPROVISIONING_TOKEN- Auth token (overrides login)
Exit Codes
| Code | Meaning |
|---|---|
| 0 | Success |
| 1 | General error |
| 2 | Invalid usage |
| 3 | Configuration error |
| 4 | Authentication error |
| 5 | Permission denied |
| 6 | Resource not found |
| 7 | Operation failed |
| 8 | Timeout |
Shell Completion
Generate shell completion scripts:
# Bash
provisioning completion bash > /etc/bash_completion.d/provisioning
# Zsh
provisioning completion zsh > ~/.zsh/completion/_provisioning
# Fish
provisioning completion fish > ~/.config/fish/completions/provisioning.fish
Related Documentation
- REST API Reference - HTTP API alternatives
- Nushell Libraries - Library functions
- Integration Examples - Real-world usage patterns
- Quick Start Guide - Getting started
- Interactive Guides - In-platform tutorials
Nushell Libraries
Orchestrator API
Control Center API
Examples
Architecture
Deep dive into Provisioning platform architecture, design principles, and architectural decisions that shape the system.
Overview
The Provisioning platform uses modular, microservice-based architecture for enterprise infrastructure as code across multiple clouds. This section documents foundational architectural decisions and system design that enable:
- Multi-cloud orchestration across AWS, UpCloud, Hetzner, Kubernetes, and on-premise systems
- Workspace-first organization with complete infrastructure isolation and multi-tenancy support
- Type-safe configuration using Nickel language as source of truth
- Autonomous operations through intelligent detectors and automated incident response
- Post-quantum security with hybrid encryption protecting against future threats
Architecture Documentation
System Understanding
-
System Overview - Platform architecture with 12 microservices, 80+ CLI commands, multi-tenancy model, cloud integration
-
Design Principles - Configuration-driven design, workspace isolation, type-safety mandates, autonomous operations, security-first
-
Component Architecture - 12 microservices: Orchestrator, Control-Center, Vault-Service, Extension-Registry, AI-Service, Detector, RAG, MCP-Server, KMS, Platform-Config, Service-Clients
-
Integration Patterns - REST APIs, async message queues, event-driven workflows, service discovery, state management
Architectural Decisions
- Architecture Decision Records (ADRs) - 10 decisions: modular CLI, workspace-first design, Nickel type-safety, microservice distribution, communication, post-quantum cryptography, encryption, observability, SLO management, incident automation
Key Architectural Patterns
Modular Design (ADR-001)
- Decentralized CLI command registration reducing code by 84%
- Dynamic command discovery and 80+ keyboard shortcuts
- Extensible architecture supporting custom commands
Workspace-First Organization (ADR-002)
- Workspaces as primary organizational unit grouping infrastructure, configs, and state
- Complete isolation for multi-tenancy and team collaboration
- Local schema and extension customization per workspace
Type-Safe Configuration (ADR-003)
- Nickel language as source of truth for all infrastructure definitions
- Mandatory schema validation at parse time (not runtime)
- Complete migration from KCL with backward compatibility
Distributed Microservices (ADR-004)
- 12 specialized microservices handling specific domains
- Independent scaling and deployment per service
- Service communication via REST + async queues
Security Architecture (ADR-006 & ADR-007)
- Post-quantum cryptography with CRYSTALS-Kyber hybrid encryption
- Multi-layer encryption: at-rest (KMS), in-transit (TLS 1.3), field-level, end-to-end
- Centralized secrets management via SecretumVault
Observability & Resilience (ADR-008, ADR-009, ADR-010)
- Unified observability: Prometheus metrics, ELK logging, Jaeger tracing
- SLO-driven operations with error budget enforcement
- Autonomous incident detection and self-healing
Navigation
- For implementation details → See
provisioning/docs/src/features/ - For API documentation → See
provisioning/docs/src/api-reference/ - For deployment guides → See
provisioning/docs/src/operations/ - For security details → See
provisioning/docs/src/security/ - For development → See
provisioning/docs/src/development/
System Overview
Complete architecture of the Provisioning Infrastructure Automation Platform.
Architecture Layers
Provisioning uses a 5-layer modular architecture:
┌─────────────────────────────────────────────────────────────┐
│ User Interface Layer │
│ • CLI (provisioning command) • Web Control Center (UI) │
│ • REST API • MCP Server (AI) • Batch Scheduler │
└──────────────────────────┬──────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Core Engine Layer (provisioning/core/) │
│ • 211-line CLI dispatcher (84% code reduction) │
│ • 476+ configuration accessors (hierarchical) │
│ • Provider abstraction (multi-cloud support) │
│ • Workspace management system │
│ • Infrastructure validation (54+ Nushell libraries) │
│ • Secrets management (SOPS + Age integration) │
└──────────────────────────┬──────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Orchestration Layer (provisioning/platform/) │
│ • Hybrid Orchestrator (Rust + Nushell) │
│ • Workflow execution with checkpoints │
│ • Dependency resolver & task scheduler │
│ • File-based persistence │
│ • REST API endpoints (83+) │
│ • State management (SurrealDB) │
└──────────────────────────┬──────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Extension Layer (provisioning/extensions/) │
│ • Cloud Providers (UpCloud, AWS, Hetzner, Local) │
│ • Task Services (50+ services in 18 categories) │
│ • Clusters (9 pre-built cluster templates) │
│ • Batch Workflows (automation templates) │
│ • Nushell Plugins (10-50x performance gains) │
└──────────────────────────┬──────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Infrastructure Layer │
│ • Cloud Resources (servers, networks, storage) │
│ • Running Services (Kubernetes, databases, etc.) │
│ • State Persistence (SurrealDB, file storage) │
│ • Monitoring & Logging (Prometheus, Loki) │
└─────────────────────────────────────────────────────────────┘
Core System Components
1. CLI Layer (provisioning/core/cli/)
Entry Point: provisioning/core/cli/provisioning
- Bash wrapper (210 lines) - Minimal bootstrap
- Routes commands to Nushell dispatcher
- Loads environment and validates workspace
- Handles error reporting
Key Features:
- Single entry point
- Pluggable architecture
- Support for 111+ commands
- 80+ shortcuts for productivity
2. Core Engine (provisioning/core/nulib/)
Structure: 54 Nushell libraries organized by function
Main Components:
Configuration Management (lib_provisioning/config/)
- Hierarchical loading: 5-layer precedence system
- 476+ accessors: Type-safe configuration access
- Variable interpolation: Template expansion
- TOML merging: Environment-specific overrides
- Validation: Schema enforcement
Provider Abstraction (lib_provisioning/providers/)
- Multi-cloud support: UpCloud, AWS, Hetzner, Local
- Unified interface: Single API for all providers
- Dynamic loading: Load providers on-demand
- Credential management: Encrypted credential handling
- State tracking: Provider-specific state persistence
Workspace Management (lib_provisioning/workspace/)
- Workspace registry: Track all workspaces
- Switching: Atomic workspace transitions
- Isolation: Independent state per workspace
- Configuration loading: Workspace-specific overrides
- Extensions: Inherit from platform extensions
Infrastructure Validation (lib_provisioning/infra_validator/)
- Schema validation: Nickel contract checking
- Constraint enforcement: Business rule validation
- Dependency analysis: Infrastructure dependency graph
- Type checking: Static type validation
- Error reporting: Detailed error messages with suggestions
Secrets Management (lib_provisioning/secrets/)
- SOPS integration: Mozilla SOPS for encryption
- Age encryption: Modern symmetric encryption
- KMS backends: Cosmian, AWS KMS, local
- Credential injection: Runtime variable substitution
- Audit logging: Track secret access
Command Utilities (lib_provisioning/cmd/)
- SSH operations: Remote command execution
- Batch operations: Parallel command execution
- Error handling: Structured error reporting
- Logging: Comprehensive operation logging
- Retry logic: Automatic retry with backoff
3. Orchestration Engine (provisioning/platform/)
Technology: Rust + Nushell hybrid
12 Microservices (Rust crates):
| Service | Purpose | Key Features |
|---|---|---|
| orchestrator | Workflow execution | Scheduler, file persistence, REST API |
| control-center | API gateway + auth | RBAC, Cedar policies, audit logging |
| control-center-ui | Web dashboard | Infrastructure view, config management |
| mcp-server | AI integration | Model Context Protocol, auto-completion |
| vault-service | Secrets storage | Encryption, KMS, credential injection |
| extension-registry | OCI registry | Extension distribution, versioning |
| ai-service | LLM features | Prompt optimization, context awareness |
| detector | Anomaly detection | Health monitoring, pattern recognition |
| rag | Knowledge retrieval | Document embedding, semantic search |
| provisioning-daemon | Background service | Event monitoring, task scheduling |
| platform-config | Config management | Schema validation, environment handling |
| service-clients | API clients | SDK for platform services, cloud APIs |
Detailed Services:
Orchestrator (crates/orchestrator/)
- High-performance scheduler: Rust core
- File-based persistence: Durable queue
- Workflow execution: Dependency-aware scheduling
- Checkpoint recovery: Resume from failures
- Parallel execution: Multi-task handling
- State management: Track job status
- REST API: 9 core endpoints
- Port: 9090 (health check endpoint)
Control Center (crates/control-center/)
- Authorization engine: Cedar policy enforcement
- RBAC system: Role-based access control
- Audit logging: Complete audit trail
- API gateway: REST API for all operations
- System configuration: Central configuration management
- Health monitoring: Real-time system status
Control Center UI (crates/control-center-ui/)
- Web dashboard: Real-time infrastructure view
- Workflow visualization: Batch job monitoring
- Configuration management: Web-based configuration
- Resource explorer: Browse infrastructure
- Audit viewer: Security audit trail
MCP Server (crates/mcp-server/)
- AI integration: Model Context Protocol support
- Natural language: Parse infrastructure requests
- Auto-completion: Intelligent configuration suggestions
- 7 settings tools: Configuration management via LLM
- Context-aware: Understand workspace context
Vault Service (crates/vault-service/)
- Secrets backend: Encrypted credential storage
- KMS integration: Key Management System support
- SOPS + Age: SOPS encryption backend
- Credential injection: Secure credential delivery
- Audit logging: Secret access tracking
Extension Registry (crates/extension-registry/)
- OCI distribution: Container image distribution
- Extension packaging: Provider/taskserv distribution
- Version management: Semantic versioning
- Registry API: Content addressable storage
AI Service (crates/ai-service/)
- LLM integration: Large Language Model support
- Prompt optimization: Infrastructure request parsing
- Context awareness: Workspace context enrichment
- Response generation: Configuration suggestions
Detector (crates/detector/)
- Anomaly detection: System health monitoring
- Pattern recognition: Infrastructure issue identification
- Alert generation: Alerting system integration
- Real-time monitoring: Continuous surveillance
Platform Config (crates/platform-config/)
- Configuration management: Centralized config loading
- Schema validation: Configuration validation
- Environment handling: Multi-environment support
- Default settings: System-wide defaults
Provisioning Daemon (crates/provisioning-daemon/)
- Background service: Continuous operation
- Event monitoring: System event handling
- Task scheduling: Background job execution
- State synchronization: Infrastructure state sync
RAG Service (crates/rag/)
- Retrieval Augmented Generation: Knowledge base integration
- Document embedding: Semantic search
- Context retrieval: Intelligent response context
- Knowledge synthesis: Answer generation
Service Clients (crates/service-clients/)
- API clients: Client SDK for platform services
- Cloud providers: Multi-cloud provider SDKs
- Request handling: HTTP/RPC client utilities
- Connection pooling: Efficient resource management
4. Extensions (provisioning/extensions/)
Modular infrastructure components:
Providers (5 cloud providers)
- UpCloud - Primary European cloud
- AWS - Amazon Web Services
- Hetzner - Baremetal & cloud servers
- Local - Development environment
- Demo - Testing & mocking
Each provider includes:
- Nickel schemas for configuration
- API client implementation
- Server creation/deletion logic
- Network management
- State tracking
Task Services (50+ services in 18 categories)
| Category | Services | Purpose |
|---|---|---|
| Container Runtime | containerd, crio, podman, crun, youki, runc | Container execution |
| Kubernetes | kubernetes, etcd, coredns, cilium, flannel, calico | Orchestration |
| Storage | rook-ceph, local-storage, mayastor, external-nfs | Data persistence |
| Databases | postgres, redis, mysql, mongodb | Data management |
| Networking | ip-aliases, proxy, resolv, kms | Network services |
| Security | webhook, kms, oras, radicle | Security services |
| Observability | prometheus, grafana, loki, jaeger | Monitoring & logging |
| Development | gitea, coder, desktop, buildkit | Developer tools |
| Hypervisor | kvm, qemu, libvirt | Virtualization |
Clusters (9 pre-built templates)
- web - Web service cluster (nginx + postgres)
- oci-reg - Container registry
- git - Git hosting (Gitea)
- buildkit - Build infrastructure
- k8s-ha - HA Kubernetes (3 control planes)
- postgresql - HA PostgreSQL cluster
- cicd-argocd - GitOps CI/CD
- cicd-tekton - Tekton pipelines
5. Infrastructure Layer
What Provisioning Manages:
- Cloud Resources: VMs, networks, storage
- Services: Kubernetes, databases, monitoring
- Applications: Web services, APIs, tools
- State: Configuration, data, logs
- Monitoring: Metrics, traces, logs
Configuration System
Hierarchical 5-Layer System:
Precedence (High → Low):
1. Runtime Arguments (CLI flags: --provider upcloud)
↓
2. Environment Variables (PROVISIONING_PROVIDER=aws)
↓
3. Workspace Config (~workspace/config/provisioning.yaml)
↓
4. Environment Defaults (workspace/config/prod-defaults.toml)
↓
5. System Defaults (~/.config/provisioning/ + platform defaults)
Configuration Languages:
| Format | Purpose | Validation | Editability |
|---|---|---|---|
| Nickel | Infrastructure source | ✅ Type-safe, contracts | Direct |
| TOML | Settings, defaults | Schema validation | Direct |
| YAML | User config, metadata | Schema validation | Direct |
| JSON | Exported configs | Schema validation | Generated |
Key Features:
- Lazy evaluation
- Recursive merging
- Variable interpolation
- Constraint checking
- Automatic validation
State Management
SurrealDB Graph Database:
Stores complex infrastructure relationships:
Nodes:
- Servers (compute)
- Networks (connectivity)
- Storage (persistence)
- Services (software)
- Workflows (automation)
Edges:
- Server → Network (connected)
- Server → Storage (mounted)
- Service → Server (running on)
- Workflow → Dependency (depends on)
File-Based Persistence:
For orchestrator queue and checkpoints:
~/.provisioning/
├── state/ # Infrastructure state
├── checkpoints/ # Workflow checkpoints
├── queue/ # Orchestrator queue
└── logs/ # Operational logs
Security Architecture
4-Layer Security Model:
| Layer | Components | Features |
|---|---|---|
| Authentication | JWT, sessions, MFA | 2FA, TOTP, WebAuthn |
| Authorization | Cedar policies, RBAC | Fine-grained permissions |
| Encryption | AES-256-GCM, TLS | At-rest & in-transit |
| Audit | Logging, compliance | 7-year retention |
Security Services:
- JWT token validation
- Argon2id password hashing
- Multi-factor authentication
- Cedar policy enforcement
- Encrypted credential storage
- KMS integration (5 backends)
- Audit logging (5 export formats)
- Compliance checking (SOC2, GDPR, HIPAA)
Performance Characteristics
Modular CLI (84% code reduction):
- Main CLI: 211 lines (vs. 1,329 before)
- Command discovery: O(1) dispatcher
- Lazy loading: Commands loaded on-demand
- Caching: Configuration cached after first load
Orchestrator Performance:
- Dependency resolution: O(n log n) topological sort
- Parallel execution: Configurable task limit
- Checkpoint recovery: Resume from failure point
- Memory efficient: File-based queue
Provider Operations:
- Batch creation: Parallel server provisioning
- Bulk operations: Multi-resource transactions
- State tracking: Efficient state queries
- Rollback: Atomic operation reversal
Nushell Plugins (10-50x speedup):
- Compiled Rust extensions
- Direct native code execution
- Zero-copy data passing
- Async I/O support
Deployment Modes
Three Operational Modes:
| Mode | Interaction | Configuration | Rollback | Use Case |
|---|---|---|---|---|
| Interactive TUI | Ratatui UI | Manual input | Automatic | Development |
| Headless CLI | Command-line | Script-driven | Manual | Automation |
| Unattended CI/CD | Non-interactive | Configuration file | Automatic | CI/CD pipelines |
Technology Stack
| Component | Technology | Why |
|---|---|---|
| IaC Language | Nickel | Type-safe, lazy evaluation, contracts |
| Scripting | Nushell 0.109+ | Structured data pipelines |
| Performance | Rust | Zero-cost abstractions, memory safety |
| State | SurrealDB | Graph database for relationships |
| Encryption | SOPS + Age | Industry-standard encryption |
| Security | Cedar + JWT | Policy enforcement + tokens |
| Orchestration | Custom | Specialized for infrastructure workflows |
File Organization
provisioning/
├── core/ # CLI engine (Nushell)
│ ├── cli/provisioning # Main entry point
│ ├── nulib/ # 54 core libraries
│ ├── plugins/ # Nushell plugins (Rust)
│ └── scripts/ # Utility scripts
│
├── platform/ # Microservices (Rust)
│ ├── crates/ # 12 microservices
│ │ ├── orchestrator/ # Workflow scheduler
│ │ ├── control-center/ # API gateway + auth
│ │ ├── control-center-ui/ # Web dashboard
│ │ ├── mcp-server/ # AI integration
│ │ ├── vault-service/ # Secrets backend
│ │ ├── extension-registry/ # OCI registry
│ │ ├── ai-service/ # LLM features
│ │ ├── detector/ # Anomaly detection
│ │ ├── rag/ # Knowledge retrieval
│ │ ├── provisioning-daemon/ # Background service
│ │ ├── platform-config/ # Config management
│ │ └── service-clients/ # API clients
│ └── Cargo.toml # Rust workspace
│
├── extensions/ # Extensible components
│ ├── providers/ # Cloud providers (5)
│ ├── taskservs/ # Task services (50+)
│ ├── clusters/ # Cluster templates (9)
│ └── workflows/ # Automation templates
│
├── schemas/ # Nickel schemas
│ ├── main.ncl # Entry point
│ ├── config/ # Configuration schemas
│ ├── infrastructure/ # Infrastructure schemas
│ ├── operations/ # Operational schemas
│ └── [other schemas] # Additional schemas
│
├── config/ # System configuration
│ └── config.defaults.toml # Default settings
│
├── bootstrap/ # Installation
│ ├── install.sh # Bash bootstrap
│ └── install.nu # Nushell installer
│
├── docs/ # Product documentation
│ └── src/ # mdBook source
│
└── README.md # Project overview
Component Interaction
Typical Workflow:
User Input
↓
CLI Dispatcher (provisioning/core/cli/provisioning)
↓
Nushell Handler (provisioning/core/nulib/commands/)
↓
Configuration Loading (lib_provisioning/config/)
↓
Provider Selection (lib_provisioning/providers/)
↓
Validation (lib_provisioning/infra_validator/)
↓
Orchestrator Queue (provisioning/platform/orchestrator/)
↓
Task Execution (provider + task service)
↓
State Update (SurrealDB / file storage)
↓
Audit Logging (security system)
↓
User Feedback
Scalability
Provisioning scales from:
- Solo: 2 CPU cores, 4GB RAM (single instance)
- MultiUser: 4-8 CPU cores, 8GB RAM (small team)
- CICD: 8+ CPU cores, 16GB RAM (enterprise)
- Enterprise: Multi-node Kubernetes (unlimited)
Bottlenecks & Solutions:
| Component | Bottleneck | Solution |
|---|---|---|
| Orchestrator | Task queue | Partition by workspace |
| State | SurrealDB | Horizontal scaling |
| Providers | API rate limits | Exponential backoff |
| Storage | Disk I/O | SSD + caching |
Integration Points
Provisioning integrates with:
- Kubernetes API - Cluster management
- Cloud Provider APIs - Resource provisioning
- SOPS + Age - Secrets encryption
- Prometheus - Metrics collection
- Cedar - Policy enforcement
- SurrealDB - State persistence
- MCP - AI integration
- KMS - Key management (Cosmian, AWS, local)
Reliability Features
Fault Tolerance:
- Checkpoint recovery - Resume from failure
- Automatic rollback - Revert failed operations
- Retry logic - Exponential backoff
- Health checks - Continuous monitoring
- Backup & restore - Data protection
High Availability:
- Multi-node orchestrator
- Database replication
- Service redundancy
- Load balancing
- Failover automation
Related Documentation
Design Principles
Core principles guiding Provisioning architecture and development.
1. Workspace-First Design
Principle: Workspaces are the default organizational unit for ALL infrastructure work.
Why:
- Explicit project isolation
- Prevent accidental cross-project modifications
- Independent credential management
- Clear configuration boundaries
- Team collaboration enablement
Application:
- Every workspace has independent state
- Workspace switching is atomic
- Configuration per workspace
- Extensions inherited from platform
Code Example:
# Workspace-enforced workflow
provisioning workspace init my-project
provisioning workspace switch my-project
# This command requires active workspace
provisioning server create --name web-01
Impact: All commands validate active workspace before execution.
2. Type-Safety Mandatory
Principle: ALL configurations MUST be type-safe. Validation is NEVER optional.
Why:
- Catch errors at configuration time
- Prevent runtime failures
- Enable IDE support (LSP)
- Enforce consistency
- Reduce deployment risk
Application:
- Nickel is source of truth (NOT TOML)
- Type contracts on ALL schemas
- Gradual typing not allowed
- Validation in ALL profiles (dev, prod, cicd)
- Static analysis before deployment
Code Example:
# Type-safe infrastructure definition
{
name : String = "server-01"
plan : | [ 'small, 'medium, 'large | ] = 'medium
zone : String = "de-fra1"
backup_enabled : Bool = false
} | ServerContract
Impact: Type errors caught before infrastructure changes.
3. Configuration-Driven, Never Hardcoded
Principle: Configuration is the source of truth. Hardcoded values are forbidden.
Why:
- Enable environment-specific behavior
- Support multiple deployment modes
- Allow runtime reconfiguration
- Audit configuration changes
- Team collaboration
Application:
- 5-layer configuration hierarchy
- 476+ configuration accessors
- Variable interpolation
- Environment-specific overrides
- Schema validation
Code Example:
# Configuration drives behavior
provisioning server create --plan $(config.server.default_plan)
# Environment-specific configs
PROVISIONING_ENV=prod provisioning server create
Forbidden:
# ❌ WRONG - Hardcoded values
let server_plan = "medium"
# ✅ RIGHT - Configuration-driven
let server_plan = (config.server.plan)
Impact: Single codebase supports all environments.
4. Multi-Cloud Abstraction
Principle: Provider-agnostic interfaces enable multi-cloud deployments.
Why:
- Avoid vendor lock-in
- Reuse infrastructure code
- Support multiple cloud strategies
- Easy provider switching
Application:
- Unified provider interface
- Abstract resource definitions
- Provider-specific implementation
- Automatic provider selection
Code Example:
# Provider-agnostic configuration
{
servers = [
{
name = "web-01"
plan = "medium" # Abstract plan size
provider = "upcloud" # Swappable provider
}
]
}
Impact: Same Nickel schema deploys to UpCloud, AWS, or Hetzner.
5. Modular, Extensible Architecture
Principle: Components are loosely coupled, independently deployable.
Why:
- Easy to add features
- Support custom extensions
- Avoid monolithic growth
- Enable community contributions
- Flexible deployment options
Application:
- 54 core Nushell libraries
- 111+ CLI commands in 7 domains
- 50+ task services
- 5 cloud providers
- 9 cluster templates
- Pluggable provider interface
Impact: Add features without modifying core system.
6. Hybrid Rust + Nushell
Principle: Rust for performance-critical components, Nushell for orchestration.
Why:
- Rust: Type safety, zero-cost abstractions, performance
- Nushell: Structured data, productivity, easy automation
- Hybrid: Best of both worlds
Application:
- Core CLI: Bash wrapper → Nushell dispatcher
- Orchestrator: Rust scheduler + Nushell task execution
- Libraries: Nushell for business logic
- Performance: Rust plugins for 10-50x speedup
Impact: Fast, type-safe, productive infrastructure automation.
7. State Management via Graph Database
Principle: Infrastructure relationships tracked via SurrealDB graph.
Why:
- Model complex infrastructure relationships
- Query relationships efficiently
- Track dependencies
- Support rollback via state history
- Audit trail
Application:
- SurrealDB for relationship queries
- File-based persistence for queue
- Event-driven state updates
- Checkpoint-based recovery
Example Relationships:
Server → Network (connected to)
Server → Storage (mounts)
Cluster → Service (runs)
Workflow → Dependency (depends on)
Impact: Complex infrastructure relationships handled gracefully.
8. Security-First Design
Principle: Security is built-in, not bolted-on.
Why:
- Enterprise compliance
- Data protection
- Access control
- Audit trails
- Threat detection
Application:
- 4-layer security model (auth, authz, encryption, audit)
- JWT authentication
- Cedar policy enforcement
- AES-256-GCM encryption
- 7-year audit retention
- MFA support (TOTP, WebAuthn)
Impact: Enterprise-grade security by default.
9. Progressive Disclosure
Principle: Simple for common cases, powerful for advanced use cases.
Why:
- Low barrier to entry
- Professional productivity
- Advanced features available
- Avoid overwhelming users
- Gradual learning curve
Application:
- Simple: Interactive TUI installer
- Productive: CLI with 80+ shortcuts
- Powerful: Batch workflows, policies
- Advanced: Custom extensions, hooks
Impact: All skill levels supported.
10. Fail-Fast, Recover Gracefully
Principle: Detect issues early, provide recovery mechanisms.
Why:
- Prevent invalid deployments
- Enable safe recovery
- Minimize blast radius
- Audit failures for learning
Application:
- Validation before execution
- Checkpoint-based recovery
- Automatic rollback on failure
- Detailed error messages
- Retry with exponential backoff
Code Example:
# Validate before deployment
provisioning validate config --strict
# Dry-run to check impact
provisioning --check server create
# Safe rollback on failure
provisioning workflow rollback --to-checkpoint
Impact: Safe infrastructure changes with confidence.
11. Observable & Auditable
Principle: All operations traceable, all changes auditable.
Why:
- Compliance & regulation
- Troubleshooting
- Security investigation
- Team accountability
- Historical analysis
Application:
- Comprehensive audit logging
- 5 export formats (JSON, YAML, CSV, syslog, CloudWatch)
- Structured log entries
- Operation tracing
- Resource change tracking
Impact: Complete visibility into infrastructure changes.
12. No Shortcuts on Reliability
Principle: Reliability features are standard, not optional.
Why:
- Production requirements
- Minimize downtime
- Data protection
- Business continuity
- Trust & confidence
Application:
- Checkpoint recovery
- Automatic rollback
- Health monitoring
- Backup & restore
- Multi-node deployment
- Service redundancy
Impact: Enterprise-grade reliability standard.
Architectural Decision Records (ADRs)
Key decisions documenting rationale:
| ADR | Decision | Rationale |
|---|---|---|
| ADR-011 | Nickel Migration | Type-safety over KCL flexibility |
| ADR-010 | Config Strategy | 5-layer hierarchy over flat config |
| ADR-009 | SurrealDB | Graph relationships over relational |
| ADR-008 | Modular CLI | 80+ shortcuts over verbose commands |
| ADR-007 | Workspace-First | Isolation over global state |
| ADR-006 | Hybrid Architecture | Rust + Nushell for best of both |
Design Trade-offs
| Decision | Gain | Cost |
|---|---|---|
| Type-Safety | Fewer errors | Learning curve |
| Config Hierarchy | Flexibility | Complexity |
| Workspace Isolation | Safety | Duplication |
| Modular CLI | Discoverability | No single command |
| SurrealDB | Relationships | Resource overhead |
| Validation Strict | Safety | Fast iteration friction |
Related Documentation
Component Architecture
Detailed architecture of each major Provisioning component.
Core Components Map
User Interface
├─ CLI (Nushell dispatcher)
├─ Web Dashboard (Control Center UI)
├─ REST API (Control Center)
└─ MCP Server (AI Integration)
↓
Core Engine (54 Nushell libraries)
├─ Configuration Management
├─ Provider Abstraction
├─ Workspace Management
├─ Infrastructure Validation
├─ Secrets Management
└─ Command Utilities
↓
Platform Services (12 Rust microservices)
├─ Orchestrator (Workflow execution)
├─ Control Center (API + Auth)
├─ Control Center UI (Web dashboard)
├─ MCP Server (AI integration)
├─ Vault Service (Secrets backend)
├─ Extension Registry (OCI distribution)
├─ AI Service (LLM features)
├─ Detector (Anomaly detection)
├─ RAG (Knowledge retrieval)
├─ Provisioning Daemon (Background service)
├─ Platform Config (Configuration management)
└─ Service Clients (API clients)
↓
Extensions (Modular infrastructure)
├─ Providers (5 cloud providers)
├─ Task Services (50+ services)
├─ Clusters (9 templates)
└─ Workflows (Automation)
↓
Infrastructure (Running resources)
├─ Cloud Compute
├─ Networks & Storage
├─ Services
└─ Monitoring
1. CLI Layer
Location: provisioning/core/cli/
Main Entry Point (provisioning)
Bash wrapper that:
- Detects Nushell installation
- Loads environment variables
- Validates workspace requirement
- Routes command to dispatcher
- Handles error reporting
Command Dispatcher
Location: provisioning/core/nulib/main_provisioning/dispatcher.nu
Supports:
- 111+ commands across 7 domains
- 80+ shortcuts for productivity
- Bi-directional help (help workspace / workspace help)
- Dynamic loading of command modules
2. Core Engine Components
Configuration Management
Location: provisioning/core/nulib/lib_provisioning/config/
Key Features:
- Load merged configuration from 5 layers
- 476+ accessors for config values
- Variable interpolation & TOML merging
- Schema validation
- Configuration caching
Provider Abstraction
Location: provisioning/core/nulib/lib_provisioning/providers/
Supported Providers (5):
- UpCloud - Primary European cloud
- AWS - Amazon Web Services
- Hetzner - Baremetal & cloud
- Local - Development environment
- Demo - Testing & mocking
Features:
- Unified cloud provider interface
- Dynamic provider loading
- Credential management
- Provider state tracking
Workspace Management
Location: provisioning/core/nulib/lib_provisioning/workspace/
Responsibilities:
- Workspace registry tracking
- Atomic workspace switching
- Configuration isolation
- Extension inheritance
- State management
Workspace Registry:
workspaces:
active: "my-project"
registry:
my-project:
path: ~/.provisioning/workspaces/workspace_my_project
created: 2026-01-16T10:30:00Z
template: default
Infrastructure Validation
Location: provisioning/core/nulib/lib_provisioning/infra_validator/
Validation Stages:
- Syntax check - Valid Nickel syntax
- Type check - Type correctness
- Schema check - Matches expected schema
- Constraint check - Business rule validation
- Dependency check - Infrastructure dependencies
- Security check - Security policies
Secrets Management
Location: provisioning/core/nulib/lib_provisioning/secrets/
Backends:
- SOPS + Age (default)
- Cosmian KMS (enterprise)
- AWS KMS (AWS)
- Local KMS (development)
3. Platform Services
Orchestrator
Location: provisioning/platform/crates/orchestrator/
Technology: Rust + Nushell
Key Features:
- High-performance workflow execution
- File-based persistence
- Checkpoint recovery
- Parallel execution with dependencies
- REST API (83+ endpoints)
- Priority-based task scheduling
State Persistence:
~/.provisioning/
├── queue/ # Task queue
├── checkpoints/ # Workflow checkpoints
└── state/ # Infrastructure state
Control Center
Location: provisioning/platform/crates/control-center/
Technology: Rust (Axum)
Features:
- JWT authentication
- Cedar policy authorization
- RBAC system
- Audit logging
- REST API for all operations
Authorization Model:
- User roles (admin, user, viewer)
- Fine-grained permissions
- Cedar policy enforcement
- Attribute-based access control
Control Center UI
Location: provisioning/platform/crates/control-center-ui/
Features:
- Real-time infrastructure view
- Workflow visualization
- Configuration management
- Resource monitoring
- Audit log viewer
MCP Server
Location: provisioning/platform/crates/mcp-server/
Technology: Rust
Features:
- AI-powered assistance via MCP
- Natural language command parsing
- Auto-completion of configurations
- 7 configuration tools for LLM
- Context-aware recommendations
Vault Service
Location: provisioning/platform/crates/vault-service/
Features:
- Encrypted credential storage
- KMS integration (5 backends)
- SOPS + Age encryption
- Secure credential injection
- Audit logging for secret access
Extension Registry
Location: provisioning/platform/crates/extension-registry/
Features:
- OCI-compliant distribution
- Provider/taskserv packaging
- Semantic version management
- Content addressable storage
- Registry API endpoints
AI Service
Location: provisioning/platform/crates/ai-service/
Features:
- LLM integration platform
- Infrastructure request parsing
- Workspace context enrichment
- Configuration suggestion generation
- Multi-provider LLM support
Detector
Location: provisioning/platform/crates/detector/
Features:
- System health monitoring
- Anomaly pattern detection
- Infrastructure issue identification
- Real-time surveillance
- Alerting system integration
RAG Service
Location: provisioning/platform/crates/rag/
Features:
- Retrieval Augmented Generation
- Document semantic embedding
- Knowledge base integration
- Context-aware answer generation
- Multi-source knowledge synthesis
Provisioning Daemon
Location: provisioning/platform/crates/provisioning-daemon/
Features:
- Background service operation
- System event monitoring
- Background job execution
- Infrastructure state synchronization
- Event-driven architecture
Platform Config
Location: provisioning/platform/crates/platform-config/
Features:
- Centralized configuration loading
- Schema-based validation
- Multi-environment support
- System-wide default settings
- Configuration hot-reload support
Service Clients
Location: provisioning/platform/crates/service-clients/
Features:
- Platform service client SDKs
- Cloud provider API clients
- HTTP/RPC request handling
- Connection pooling and management
- Retry logic and error handling
4. Extension Components
Providers
Location: provisioning/extensions/providers/
Structure:
providers/
├── upcloud/ # UpCloud provider
├── aws/ # AWS provider
├── hetzner/ # Hetzner provider
├── local/ # Local dev provider
├── demo/ # Demo/test provider
└── prov_lib/ # Shared utilities
Provider Interface:
- Create/delete resources
- List resources
- Query resource status
- Network/storage management
- Credential validation
Task Services
Location: provisioning/extensions/taskservs/
50+ Services in 18 categories:
- Container runtimes (containerd, podman, crio)
- Kubernetes (etcd, coredns, cilium, calico)
- Storage (rook-ceph, mayastor, nfs)
- Databases (postgres, redis, mongodb)
- Networking (ip-aliases, proxy, kms)
- Security (webhook, kms, oras)
- Observability (prometheus, grafana, loki)
- Development (gitea, coder, buildkit)
- Hypervisor (kvm, qemu, libvirt)
Clusters
Location: provisioning/extensions/clusters/
9 Pre-built Templates:
- web - Web service cluster
- oci-reg - Container registry
- git - Git hosting (Gitea)
- buildkit - Build infrastructure
- k8s-ha - HA Kubernetes
- postgresql - HA PostgreSQL
- cicd-argocd - GitOps CI/CD
- cicd-tekton - Tekton pipelines
5. Configuration Layer
Nickel Schemas
Location: provisioning/schemas/
Structure (27 directories):
schemas/
├── main.ncl # Entry point
├── lib/ # Utilities
├── config/ # Settings
├── infrastructure/ # Servers, networks
├── operations/ # Workflows
├── deployment/ # Kubernetes
├── services/ # Service defs
└── versions.ncl # Tool versions
3-File Pattern:
- contracts.ncl - Type definitions
- defaults.ncl - Default values
- main.ncl - Entry point + makers
Component Dependencies
CLI
├─ Configuration
├─ Workspace
├─ Validation
├─ Secrets
└─ Providers
Providers
└─ Orchestrator
Orchestrator
├─ Task Services
├─ Control Center
└─ State Manager
Control Center
├─ Authorization
├─ Audit Logging
└─ State Manager
Communication Patterns
Synchronous (Request-Response)
CLI → Orchestrator → Provider → Cloud API
Asynchronous (Queue)
CLI → Orchestrator (queue) → [Background execution]
Event-Driven
Provider Event → Orchestrator → State Update
→ Control Center
→ Monitoring
Related Documentation
Integration Patterns
Design patterns for extending and integrating with Provisioning.
1. Provider Integration Pattern
Pattern: Add a new cloud provider to Provisioning.
2. Task Service Integration Pattern
Pattern: Add infrastructure component.
3. Cluster Template Pattern
Pattern: Create pre-configured cluster template.
4. Batch Workflow Pattern
Pattern: Create automation workflow for complex operations.
5. Custom Extension Pattern
Pattern: Create custom Nushell library.
6. Authorization Policy Pattern
Pattern: Define fine-grained access control via Cedar.
7. Webhook Integration
Pattern: Trigger Provisioning from external systems.
8. Monitoring Integration
Pattern: Export metrics and logs to monitoring systems.
9. CI/CD Integration
Pattern: Use Provisioning in automated pipelines.
10. MCP Tool Integration
Pattern: Add AI-powered tool via MCP.
Integration Scenarios
Multi-Cloud Deployment
Deploy across UpCloud, AWS, and Hetzner in single workflow.
GitOps Workflow
Git changes trigger infrastructure updates via webhooks.
Self-Service Deployment
Non-technical users request infrastructure via natural language.
Best Practices
- Use type-safe Nickel schemas
- Implement proper error handling
- Log all operations for audit trails
- Test extensions before production
- Document configuration & usage
- Version extensions independently
- Support backward compatibility
- Validate inputs & encrypt credentials
Related Documentation
Architecture Decision Records
This section contains Architecture Decision Records (ADRs) documenting key architectural decisions and their rationale for the Provisioning platform.
ADR Index
Core Architecture Decisions
-
ADR-001: Modular CLI Architecture - Decentralized CLI registration reducing code by 84%, 80+ keyboard shortcuts, dynamic subcommands.
-
ADR-002: Workspace-First Architecture - Workspaces as primary organizational unit with isolation boundaries.
-
ADR-003: Nickel as Source of Truth - Nickel for type-safe configuration, mandatory validation, KCL migration.
-
ADR-004: 12-Microservice Architecture - Distributed microservices for independent scaling and deployment.
-
ADR-005: Service Communication - HTTP REST for sync operations, message queues for async, pub/sub for events.
Security and Cryptography
-
ADR-006: Post-Quantum Cryptography - Hybrid encryption: CRYSTALS-Kyber, SPHINCS+, Falcon with AES-256 fallback.
-
ADR-007: Multi-Layer Data Encryption - Encryption at-rest, in-transit, field-level, with key rotation policies.
Operations and Observability
-
ADR-008: Unified Observability Stack - Prometheus metrics, ELK Stack, Jaeger distributed tracing.
-
ADR-009: SLO and Error Budget Management - Service Level Objectives with automatic remediation on SLO violations.
-
ADR-010: Automated Incident Response - Autonomous detection, automatic remediation, escalation, chaos engineering.
Decision Format
Each ADR follows this structure:
- Status: Accepted, Proposed, Deprecated, Superseded
- Context: Problem statement and constraints
- Decision: The chosen approach
- Consequences: Benefits and trade-offs
- Alternatives: Other options considered
- References: Related ADRs and external docs
Rationale for ADRs
ADRs document the “why” behind architectural choices:
- Modular CLI - Scales command set without monolithic registration
- Workspace-First - Isolates infrastructure and supports multi-tenancy
- Nickel Source of Truth - Ensures type-safe configuration and prevents runtime errors
- Microservice Distribution - Enables independent scaling and deployment
- Communication Protocol - Balances synchronous needs with async event processing
- Post-Quantum Crypto - Protects against future quantum computing threats
- Multi-Layer Encryption - Defense in depth against data breaches
- Observability - Enables rapid troubleshooting and performance analysis
- SLO Management - Aligns infrastructure quality with business objectives
- Incident Automation - Reduces MTTR and improves system resilience
Cross-References
These ADRs interact with:
- Platform Documentation - See
provisioning/docs/src/architecture/ - Features - See
provisioning/docs/src/features/for implementation details - Development Guides - See
provisioning/docs/src/development/for extending systems - Security Documentation - See
provisioning/docs/src/security/for compliance details - Operations Guides - See
provisioning/docs/src/operations/for deployment procedures
Examples
Real-world infrastructure as code examples demonstrating Provisioning across multi-cloud, Kubernetes, security, and operational scenarios.
Overview
This section contains production-ready examples showing how to:
- Deploy infrastructure from basic single-cloud to complex multi-cloud environments
- Orchestrate Kubernetes clusters with Provisioning automation
- Implement security patterns including encryption, secrets management, and compliance
- Build custom workflows for specialized infrastructure operations
- Handle disaster recovery with backup strategies and failover procedures
- Optimize costs through resource analysis and right-sizing
- Migrate legacy systems from traditional infrastructure to cloud-native architectures
- Test infrastructure as code with validation, policy checks, and integration tests
All examples use Nickel for type-safe configuration and are designed as learning resources and templates for your own deployments.
Quick Start Examples
Basic Infrastructure Setup
-
Basic Setup - Single-cloud with networking, compute, storage - perfect starting point
-
E-Commerce Platform - Multi-tier application across AWS and UpCloud with load balancing, databases
Multi-Cloud Deployments
-
Multi-Cloud Deployment - Deploy across AWS, UpCloud, Hetzner with provider abstraction
-
Kubernetes Deployment - Kubernetes clusters, workloads, networking, operators via Nickel
-
Machine Learning Infrastructure
- Training clusters, GPU resources, features, inference services
-
Hybrid Cloud Setup - Hub-and-spoke architecture connecting on-premise and cloud
Operational Examples
-
Disaster Recovery Drills - Database failover, complete infrastructure failover, backup recovery testing procedures.
-
FinOps Cost Governance - Budget frameworks, cost monitoring, chargeback models, and cost optimization strategies.
-
Legacy System Migration - Zero-downtime migration with gradual traffic cutover (5% → 100%).
Advanced Patterns
-
Batch Workflow Orchestration - DAG scheduling, parallel execution, conditional logic, error handling.
-
Advanced Networking - Load balancing, service mesh, DNS management, zero-trust architecture.
-
GitOps Infrastructure Deployment - GitHub Actions, automated reconciliation, drift detection, audit trails.
-
Secrets Rotation Strategy - Passwords, API keys, certificates with zero-downtime rotation.
Security and Compliance
-
Compliance and Audit - SOC2, GDPR, HIPAA, PCI-DSS compliance with audit logging.
-
Security Examples - Encryption, authentication, MFA, secrets management, and audit patterns.
-
Infrastructure as Code Testing - Syntax validation, schema checks, policy compliance, unit and integration tests.
Cloud Provider Specific
-
AWS Deployment Guide - EC2, RDS, S3, VPC, Load Balancers, IAM with cost optimization.
-
UpCloud Deployment Guide - Compute, Storage, Networking, Backups with managed services.
-
Hetzner Deployment Guide - Dedicated servers, cloud infrastructure, networking with cost efficiency.
-
Kubernetes Examples - Deployments, StatefulSets, DaemonSets, Jobs, Custom Resources, Operators.
Configuration and Migration
-
Terraform to Nickel Migration - Convert existing Terraform HCL to Nickel type-safe configuration with validation examples.
-
KCL to Nickel Migration - Upgrade from deprecated KCL to Nickel with schema examples and best practices.
Example Organization
Each example follows this structure:
example-name.md
├── Overview - What this example demonstrates
├── Prerequisites - Required setup
├── Architecture Diagram - Visual representation
├── Nickel Configuration - Complete, runnable configuration
├── Deployment Steps - Command-by-command instructions
├── Verification - How to validate deployment
├── Troubleshooting - Common issues and solutions
└── Next Steps - How to extend or customize
Learning Paths
I’m new to Provisioning
- Start with Basic Setup
- Read Real-World Scenario
- Try Kubernetes Deployment
I need multi-cloud infrastructure
- Review Multi-Cloud Deployment
- Study Hybrid Cloud Setup
- Implement Advanced Networking
I need to migrate existing infrastructure
- Start with Legacy System Migration
- Add Terraform Migration if applicable
- Set up GitOps Deployment
I need enterprise features
- Implement Compliance and Audit
- Set up Disaster Recovery
- Deploy Cost Governance
- Configure Secrets Rotation
Copy and Customize
All examples are self-contained and can be:
- Copied into your workspace and adapted
- Extended with additional resources and customizations
- Tested using Provisioning’s validation framework
- Deployed directly via
provisioning apply
Use them as templates, learning resources, or reference implementations for your own infrastructure.
Related Documentation
- Configuration Guide → See
provisioning/docs/src/infrastructure/nickel-guide.md - API Reference → See
provisioning/docs/src/api-reference/ - Development → See
provisioning/docs/src/development/ - Operations → See
provisioning/docs/src/operations/
Basic Setup
Simple infrastructure setup examples for getting started with the Provisioning platform.
Single Server Deployment
Deploy a simple web server with UpCloud:
# workspace/infra/web-server.ncl
{
servers = [
{
name = "web-01",
provider = 'upcloud,
plan = 'medium,
zone = "fi-hel1",
storage = [
{size_gb = 50, type = 'ssd}
]
}
]
}
Deploy:
provisioning workspace create basic-web
cd basic-web
cp ../examples/web-server.ncl infra/
provisioning deploy --workspace basic-web --yes
Three-Tier Application
Web frontend, application backend, database:
{
servers = [
{name = "web-01", provider = 'upcloud, plan = 'small, zone = "fi-hel1"},
{name = "app-01", provider = 'upcloud, plan = 'medium, zone = "fi-hel1"},
{name = "db-01", provider = 'upcloud, plan = 'large, zone = "fi-hel1",
storage = [{size_gb = 100, type = 'ssd}]},
],
task_services = [
{name = "nginx", target = "web-01"},
{name = "nodejs", target = "app-01"},
{name = "postgresql", target = "db-01"},
]
}
Development Environment
Local development stack with Docker:
{
servers = [
{name = "dev-local", provider = 'local, plan = 'medium}
],
task_services = [
{name = "docker"},
{name = "postgresql"},
{name = "redis"},
]
}
References
Multi-Cloud Examples
Deploy infrastructure across multiple cloud providers for redundancy and geographic distribution.
Primary-Backup Configuration
UpCloud primary in Europe, AWS backup in US:
{
servers = [
# Primary (UpCloud EU)
{name = "web-eu", provider = 'upcloud, zone = "fi-hel1", plan = 'medium},
{name = "db-eu", provider = 'upcloud, zone = "fi-hel1", plan = 'large},
# Backup (AWS US)
{name = "web-us", provider = 'aws, zone = "us-east-1a", plan = 't3.medium},
{name = "db-us", provider = 'aws, zone = "us-east-1a", plan = 'm5.large},
],
replication = {
enabled = true,
pairs = [
{primary = "db-eu", standby = "db-us", mode = 'async}
]
}
}
Geographic Distribution
Deploy to multiple regions for low latency:
{
servers = [
{name = "web-eu", provider = 'upcloud, zone = "fi-hel1"},
{name = "web-us", provider = 'aws, zone = "us-west-2a"},
{name = "web-asia", provider = 'aws, zone = "ap-southeast-1a"},
],
load_balancing = {
global = true,
geo_routing = true
}
}
References
Kubernetes Deployment Examples
Deploy production-ready Kubernetes clusters with the Provisioning platform.
Basic Kubernetes Cluster
3-node cluster with Cilium CNI:
{
task_services = [
{
name = "kubernetes",
config = {
control_plane = {nodes = 3, plan = 'medium},
workers = [{name = "default", nodes = 3, plan = 'large}],
networking = {
cni = 'cilium,
pod_cidr = "10.42.0.0/16",
service_cidr = "10.43.0.0/16"
}
}
}
]
}
Production Cluster with Storage
Kubernetes with Rook-Ceph storage:
{
task_services = [
{
name = "kubernetes",
config = {
control_plane = {nodes = 3, plan = 'medium},
workers = [
{name = "general", nodes = 5, plan = 'large},
{name = "storage", nodes = 3, plan = 'xlarge,
storage = [{size_gb = 500, type = 'ssd}]}
],
networking = {cni = 'cilium}
}
},
{
name = "rook-ceph",
config = {
storage_nodes = ["storage-0", "storage-1", "storage-2"],
osd_per_device = 1
}
}
]
}
References
Custom Workflow Examples
Build complex deployment workflows with dependency management and parallel execution.
Multi-Stage Deployment
{
workflows = [{
name = "app-deployment",
steps = [
{name = "provision-infrastructure", type = 'provision},
{name = "install-kubernetes", type = 'task, depends_on = ["provision-infrastructure"]},
{name = "deploy-application", type = 'task, depends_on = ["install-kubernetes"]},
{name = "configure-monitoring", type = 'task, depends_on = ["deploy-application"]}
]
}]
}
Parallel Regional Deployment
{
workflows = [{
name = "global-rollout",
steps = [
{name = "deploy-eu", type = 'task},
{name = "deploy-us", type = 'task},
{name = "deploy-asia", type = 'task},
{name = "configure-dns", type = 'configure,
depends_on = ["deploy-eu", "deploy-us", "deploy-asia"]}
]
}]
}
References
Security Configuration Examples
Security configuration examples for authentication, encryption, and secrets management.
Complete Security Configuration
{
security = {
authentication = {
enabled = true,
jwt_algorithm = "RS256",
mfa_required = true
},
secrets = {
backend = "secretumvault",
url = " [https://vault.example.com",](https://vault.example.com",)
auto_rotate = true,
rotation_days = 90
},
encryption = {
at_rest = true,
algorithm = "AES-256-GCM",
kms_backend = "secretumvault"
},
audit = {
enabled = true,
retention_days = 2555,
export_format = "json"
}
}
}
SecretumVault Integration
# Configure SecretumVault
provisioning config set security.secrets.backend secretumvault
provisioning config set security.secrets.url [http://localhost:8200](http://localhost:8200)
# Store secrets
provisioning vault put database/password --value="secret123"
# Retrieve secrets
provisioning vault get database/password
Encrypted Infrastructure Configuration
{
providers.upcloud = {
username = "admin",
password = std.secret "UPCLOUD_PASSWORD" # Encrypted
},
databases = [{
name = "production-db",
password = std.secret "DB_PASSWORD" # Encrypted
}]
}
References
Troubleshooting
Systematic problem-solving guides and debugging procedures for diagnosing and resolving issues with the Provisioning platform.
Overview
This section helps you:
- Solve common issues - Database connection errors, authentication failures, deployment failures
- Debug problems - Diagnostic tools, log analysis, tracing execution paths
- Analyze logs - Log aggregation, filtering, searching, pattern recognition
- Understand errors - Error message interpretation and root cause analysis
- Get support - Knowledge base, community resources, professional support
Organized by problem type and component for quick navigation.
Troubleshooting Guides
Quick Problem Solving
-
Common Issues - Authentication failures, deployment errors, configuration, resource limits, network problems
-
Debug Guide - Debug logging, verbose output, trace execution, collect diagnostics, analyze stack traces
-
Logs Analysis - Find logs, search techniques, log patterns, interpreting errors, diagnostics
Component-Specific Troubleshooting
Each microservice and component has its own troubleshooting section:
- Orchestrator Issues - Workflow failures, scheduling problems, state inconsistencies
- Control Center Issues - API errors, permission problems, configuration issues
- Vault Service Issues - Secret access failures, key rotation problems, authentication errors
- Detector Issues - Analysis failures, false positives, configuration problems
- Extension Registry Issues - Provider loading, dependency resolution, versioning conflicts
Infrastructure and Configuration
- Configuration Problems - Nickel syntax errors, schema validation failures, type mismatches
- Provider Issues - Authentication failures, API limits, resource creation failures
- Task Service Failures - Service-specific errors, timeout issues, state management problems
- Network Problems - Connectivity issues, DNS resolution, firewall rules, certificate problems
Problem Diagnosis Flowchart
Issue Occurs
↓
Is it an authentication issue? → See [Common Issues](./common-issues.md) - Authentication
↓ No
Is it a deployment failure? → See [Common Issues](./common-issues.md) - Deployment
↓ No
Is it a configuration error? → See [Debug Guide](./debug-guide.md) - Configuration
↓ No
Enable debug logging → See [Debug Guide](./debug-guide.md)
↓
Collect logs and traces → See [Logs Analysis](./logs-analysis.md)
↓
Analyze patterns → Identify root cause
↓
Apply fix or escalate
Quick Reference: Common Problems
| Problem | Solution | Guide |
|---|---|---|
| “Authentication failed” | Check credentials, enable MFA | Common Issues |
| “Permission denied” | Verify RBAC policies, check Cedar rules | Common Issues |
| “Deployment failed” | Check logs, verify resources, test connectivity | Debug Guide |
| “Configuration invalid” | Validate Nickel schema, check types | Common Issues |
| “Provider unavailable” | Check API keys, verify connectivity | Common Issues |
| “Resource creation failed” | Check resource limits, verify account | Debug Guide |
| “Timeout” | Increase timeouts, check performance | Debug Guide |
| “Database error” | Check connections, verify schema | Common Issues |
Debugging Workflow
- Reproduce - Can you consistently reproduce the issue?
- Enable Debug Logging - Set
RUST_LOG=debugandPROVISIONING_LOG_LEVEL=debug - Collect Evidence - Logs, configuration, error messages, stack traces
- Analyze Patterns - Look for errors, warnings, unusual timing
- Identify Cause - Root cause analysis
- Test Fix - Verify the fix resolves the issue
- Prevent Recurrence - Update documentation, add tests
Enable Diagnostic Logging
# Set log level to debug
export RUST_LOG=debug
export PROVISIONING_LOG_LEVEL=debug
# Collect logs to file
provisioning config set logging.file /var/log/provisioning.log
provisioning config set logging.level debug
# Enable verbose output
provisioning --verbose <command>
# Run with tracing
RUST_BACKTRACE=1 provisioning <command>
Common Error Codes
| Code | Meaning | Action |
|---|---|---|
| 401 | Unauthorized | Check authentication credentials |
| 403 | Forbidden | Check authorization policies |
| 404 | Not Found | Verify resource exists |
| 409 | Conflict | Resolve state conflicts |
| 422 | Invalid | Verify configuration schema |
| 500 | Internal Error | Check server logs |
| 503 | Service Unavailable | Wait for service to recover |
Escalation Paths
Community Support
- Check Common Issues
- Search community forums
- Ask on GitHub discussions
Professional Support
- Open a support ticket
- Provide: logs, configuration, reproduction steps
- Wait for response
Emergency Issues (Security, Data Loss)
- Contact security team immediately
- Provide all evidence
- Document timeline
Support Resources
- Documentation → Complete guides in
provisioning/docs/src/ - GitHub Issues → Community issues and discussions
- Slack Community → Real-time community support
- Email Support → professional@provisioning.io
- Chat Support → Available during business hours
Related Documentation
- Operations Guide → See
provisioning/docs/src/operations/ - Architecture → See
provisioning/docs/src/architecture/ - Features → See
provisioning/docs/src/features/ - Development → See
provisioning/docs/src/development/ - Examples → See
provisioning/docs/src/examples/
Common Issues
Debug Guide
Logs Analysis
Getting Help
AI & Machine Learning
Provisioning includes comprehensive AI capabilities for infrastructure automation via natural language, intelligent configuration suggestions, and anomaly detection.
Overview
The AI system consists of three integrated components:
- TypeDialog AI Backends - Interactive form intelligence and agent automation
- AI Service Microservice - Central AI processing and coordination
- Core AI Libraries - Nushell query processing and LLM integration
Key Capabilities
Natural Language Infrastructure
Request infrastructure changes in plain English:
# Natural language request
provisioning ai "Create 3 web servers with load balancing and auto-scaling"
# Returns:
# - Parsed infrastructure requirements
# - Generated Nickel configuration
# - Deployment confirmation
Intelligent Configuration
AI suggests optimal configurations based on context:
- Database selection and tuning
- Network topology recommendations
- Security policy generation
- Resource allocation optimization
Anomaly Detection
Continuous monitoring and intelligent alerting:
- Infrastructure health anomalies
- Performance pattern detection
- Security issue identification
- Predictive alerting
Components at a Glance
| Component | Purpose | Technology |
|---|---|---|
| typedialog-ai | Form intelligence & suggestions | HTTP server, SurrealDB |
| typedialog-ag | AI agents & workflow automation | Type-safe agents, Nickel transpilation |
| ai-service | Central AI microservice | Rust, LLM integration |
| rag | Knowledge base retrieval | Semantic search, embeddings |
| mcp-server | Model Context Protocol | AI tool interface |
| detector | Anomaly detection system | Pattern recognition |
Quick Start
Enable AI Features
# Install AI tools
provisioning install ai-tools
# Configure AI service
provisioning ai configure --provider openai --model gpt-4
# Test AI capabilities
provisioning ai test
Use Natural Language
# Simple request
provisioning ai "Create a Kubernetes cluster"
# Complex request with options
provisioning ai "Deploy PostgreSQL HA cluster with replication in AWS, backup to S3"
# Get help on AI features
provisioning help ai
Architecture
The AI system follows a layered architecture:
┌─────────────────────────────────┐
│ User Interface Layer │
│ • Natural language input │
│ • TypeDialog AI forms │
│ • Chat interface │
└────────────┬────────────────────┘
↓
┌─────────────────────────────────┐
│ AI Orchestration Layer │
│ • AI Service (Rust) │
│ • Query processing (Nushell) │
│ • Intent recognition │
└────────────┬────────────────────┘
↓
┌─────────────────────────────────┐
│ Knowledge & Processing Layer │
│ • RAG (Retrieval) │
│ • LLM Integration │
│ • MCP Server │
│ • Detector (anomalies) │
└────────────┬────────────────────┘
↓
┌─────────────────────────────────┐
│ Infrastructure Layer │
│ • Nickel configuration │
│ • Deployment execution │
│ • Monitoring & feedback │
└─────────────────────────────────┘
Topics
- AI Architecture - System design and components
- TypeDialog Integration - AI forms and agents
- AI Service Crate - Core AI microservice
- RAG & Knowledge - Knowledge retrieval system
- Natural Language Infrastructure - LLM-driven IaC
Configuration
Environment Variables
# LLM Provider
export PROVISIONING_AI_PROVIDER=openai # openai, anthropic, local
export PROVISIONING_AI_MODEL=gpt-4 # Model identifier
export PROVISIONING_AI_API_KEY=sk-... # API key
# AI Service
export PROVISIONING_AI_SERVICE_PORT=9091 # AI service port
export PROVISIONING_AI_ENABLE_ANOMALY=true # Enable detector
export PROVISIONING_AI_RAG_THRESHOLD=0.75 # Similarity threshold
Configuration File
# ~/.config/provisioning/ai.yaml
ai:
enabled: true
provider: openai
model: gpt-4
api_key: ${PROVISIONING_AI_API_KEY}
service:
port: 9091
timeout: 30
max_retries: 3
typedialog:
ai_enabled: true
ag_enabled: true
suggestions: true
rag:
enabled: true
similarity_threshold: 0.75
max_results: 5
detector:
enabled: true
update_interval: 60
alert_threshold: 0.8
Use Cases
1. Infrastructure from Description
Describe infrastructure in natural language, get Nickel configuration:
provisioning ai deploy "
Create a production Kubernetes cluster with:
- 3 control planes
- 5 worker nodes
- HA PostgreSQL (3 nodes)
- Prometheus monitoring
- Encrypted networking
"
2. Configuration Assistance
Get AI suggestions while filling out forms:
provisioning setup profile
# TypeDialog shows suggestions based on context
# Database recommendations based on workload
# Security settings optimized for environment
3. Troubleshooting
AI analyzes logs and suggests fixes:
provisioning ai troubleshoot --service orchestrator
# Output:
# Issue detected: High memory usage
# Likely cause: Task queue backlog
# Suggestion: Scale orchestrator replicas to 3
# Command: provisioning orchestrator scale --replicas 3
4. Anomaly Detection
Continuous monitoring with intelligent alerts:
provisioning ai anomalies --since 1h
# Output:
# ⚠️ Unusual pattern detected
# Time: 2026-01-16T01:47:00Z
# Service: control-center
# Metric: API response time
# Baseline: 45ms → Current: 320ms (+611%)
# Likelihood: Query performance regression
Limitations
- LLM Dependency: Requires external LLM provider (OpenAI, Anthropic, etc.)
- Network Required: Cloud-based LLM providers need internet connectivity
- Context Window: Large infrastructures may exceed LLM context limits
- Cost: API calls incur per-token charges
- Latency: Natural language processing adds response latency (2-5 seconds)
Configuration Files
Key files for AI configuration:
| File | Purpose |
|---|---|
.typedialog/ai.db | AI SurrealDB database (typedialog-ai) |
.typedialog/agent-*.yaml | AI agent definitions (typedialog-ag) |
~/.config/provisioning/ai.yaml | User AI settings |
provisioning/core/versions.ncl | TypeDialog versions |
core/nulib/lib_provisioning/ai/ | Core AI libraries |
platform/crates/ai-service/ | AI service crate |
Performance
Typical Latencies
| Operation | Latency |
|---|---|
| Simple request parsing | 100-200ms |
| LLM inference | 2-5 seconds |
| Configuration generation | 500ms-1s |
| Anomaly detection | 50-100ms |
Scalability
- Concurrent requests: 100+ (load balanced)
- Query processing: 10,000+ queries/second
- RAG similarity search: <50ms for 1M documents
- Anomaly detection: Real-time on 1000+ metrics
Security
API Keys
- Stored encrypted in vault-service
- Never logged or persisted in plain text
- Rotated automatically (configurable)
- Audit trail for all API usage
Data Privacy
- Natural language queries not stored by default
- LLM provider agreements (OpenAI terms, etc.)
- Local-only RAG option available
- GDPR compliance support
Related Documentation
- Features Overview - AI feature list
- MCP Server - LLM integration
- Security System - API key management
- Operations Guide - AI service management
AI Architecture
Complete system architecture of Provisioning’s AI capabilities, from user interface through infrastructure generation.
System Overview
┌──────────────────────────────────────────────────┐
│ User Interface Layer │
│ • CLI (natural language) │
│ • TypeDialog AI forms │
│ • Interactive wizards │
│ • Web dashboard │
└────────────────────┬─────────────────────────────┘
↓
┌──────────────────────────────────────────────────┐
│ Request Processing Layer │
│ • Intent recognition │
│ • Entity extraction │
│ • Context parsing │
│ • Request validation │
└────────────────────┬─────────────────────────────┘
↓
┌──────────────────────────────────────────────────┐
│ Knowledge & Retrieval Layer (RAG) │
│ • Document embedding │
│ • Vector similarity search │
│ • Keyword matching (BM25) │
│ • Hybrid ranking │
└────────────────────┬─────────────────────────────┘
↓
┌──────────────────────────────────────────────────┐
│ LLM Integration Layer │
│ • MCP tool registration │
│ • Context augmentation │
│ • Prompt engineering │
│ • LLM API calls (OpenAI, Anthropic, etc.) │
└────────────────────┬─────────────────────────────┘
↓
┌──────────────────────────────────────────────────┐
│ Configuration Generation Layer │
│ • Nickel code generation │
│ • Schema validation │
│ • Constraint checking │
│ • Cost estimation │
└────────────────────┬─────────────────────────────┘
↓
┌──────────────────────────────────────────────────┐
│ Execution & Feedback Layer │
│ • DAG planning │
│ • Dry-run simulation │
│ • Deployment execution │
│ • Performance monitoring │
└──────────────────────────────────────────────────┘
Component Architecture
1. User Interface Layer
Entry Points:
Natural Language Input
├─ CLI: provisioning ai "create kubernetes cluster"
├─ Interactive: provisioning ai interactive
├─ Forms: TypeDialog AI-enhanced forms
└─ Web Dashboard: /ai/infrastructure-builder
Processing:
- Tokenization and normalization
- Command pattern matching
- Ambiguity resolution
- Confidence scoring
2. Intent Recognition
User Request
↓
Intent Classification
├─ Create infrastructure (60%)
├─ Modify configuration (25%)
├─ Query knowledge (10%)
└─ Troubleshoot issue (5%)
↓
Entity Extraction
├─ Resource type (server, database, cluster)
├─ Cloud provider (AWS, UpCloud, Hetzner)
├─ Count/Scale (3 nodes, 10GB)
├─ Requirements (HA, encrypted, monitoring)
└─ Constraints (budget, region, environment)
↓
Request Structure
3. RAG Knowledge Retrieval
Embedding Process:
Query: "Create 3 web servers with load balancer"
↓
Embed Query → Vector [0.234, 0.567, 0.891, ...]
↓
Search Relevant Documents
├─ Vector similarity (semantic)
├─ BM25 keyword matching (syntactic)
└─ Hybrid ranking
↓
Top Results:
1. "Web Server HA Patterns" (0.94 similarity)
2. "Load Balancing Best Practices" (0.87)
3. "Auto-Scaling Configuration" (0.76)
↓
Extract Context & Augment Prompt
Knowledge Organization:
knowledge/
├── infrastructure/ (450 docs)
│ ├── kubernetes/
│ ├── databases/
│ ├── networking/
│ └── web-services/
├── best-practices/ (300 docs)
│ ├── high-availability/
│ ├── disaster-recovery/
│ └── performance/
├── providers/ (250 docs)
│ ├── aws/
│ ├── upcloud/
│ └── hetzner/
└── security/ (200 docs)
├── encryption/
├── authentication/
└── compliance/
4. LLM Integration (MCP)
Tool Registration:
LLM (GPT-4, Claude 3)
↓
MCP Server (provisioning-mcp)
↓
Available Tools:
├─ create_infrastructure
├─ analyze_configuration
├─ generate_policies
├─ estimate_costs
├─ check_compatibility
├─ validate_nickel
├─ query_knowledge_base
└─ get_recommendations
↓
Tool Execution
Prompt Engineering Pipeline:
Base Prompt Template
↓
Add Context (RAG results)
↓
Add Constraints
├─ Budget limit
├─ Region restrictions
├─ Compliance requirements
└─ Performance targets
↓
Add Examples
├─ Successful deployments
├─ Error patterns
└─ Best practices
↓
Enhanced Prompt
↓
LLM Inference
5. Configuration Generation
Nickel Code Generation:
LLM Output (structured)
↓
Nickel Template Filling
├─ Server definitions
├─ Network configuration
├─ Storage setup
└─ Monitoring config
↓
Generated Nickel File
↓
Syntax Validation
↓
Schema Validation (Type Checking)
↓
Constraint Verification
├─ Resource limits
├─ Budget constraints
├─ Compliance policies
└─ Provider capabilities
↓
Cost Estimation
↓
Final Configuration
6. Execution & Feedback
Deployment Planning:
Configuration
↓
DAG Generation (Directed Acyclic Graph)
├─ Task decomposition
├─ Dependency analysis
├─ Parallelization
└─ Scheduling
↓
Dry-Run Simulation
├─ Check resources available
├─ Validate API access
├─ Estimate time
└─ Identify risks
↓
Execution with Checkpoints
├─ Create resources
├─ Monitor progress
├─ Collect metrics
└─ Save checkpoints
↓
Post-Deployment
├─ Verify functionality
├─ Run health checks
├─ Collect performance data
└─ Store feedback for future improvements
Data Flow Examples
Example 1: Simple Request
User: "Create 3 web servers with load balancer"
↓
Intent: Create Infrastructure
Entities: type=server, count=3, load_balancer=true
↓
RAG Retrieval: "Web Server Patterns", "Load Balancing"
↓
LLM Prompt:
"Generate Nickel config for 3 web servers with load balancer.
Context: [web server best practices from knowledge base]
Constraints: High availability, auto-scaling enabled"
↓
Generated Nickel:
{
servers = [
{name = "web-01", cpu = 4, memory = 8},
{name = "web-02", cpu = 4, memory = 8},
{name = "web-03", cpu = 4, memory = 8}
]
load_balancer = {
type = "application"
health_check = "/health"
}
}
↓
Configuration Generated & Validated ✓
↓
User Approval
↓
Deployment
Example 2: Complex Multi-Cloud Request
User: "Deploy Kubernetes to AWS, UpCloud, and Hetzner with replication"
↓
Intent: Multi-Cloud Infrastructure
Entities: type=kubernetes, providers=[aws, upcloud, hetzner], replicas=3
↓
RAG Retrieval:
- "Multi-Cloud Kubernetes Patterns"
- "Inter-Region Replication"
- "AWS Kubernetes Setup"
- "UpCloud Kubernetes Setup"
- "Hetzner Kubernetes Setup"
↓
LLM Processes:
1. Analyze multi-cloud topology
2. Identify networking requirements
3. Plan data replication strategy
4. Consider regional compliance
↓
Generated Nickel:
- Infrastructure definitions for each provider
- Inter-region networking configuration
- Replication topology
- Failover policies
↓
Cost Breakdown:
AWS: $2,500/month
UpCloud: $1,800/month
Hetzner: $1,500/month
Total: $5,800/month
↓
Compliance Check: EU GDPR ✓, US HIPAA ✓
↓
Ready for Deployment
Key Technologies
LLM Providers
Supported external LLM providers:
| Provider | Models | Latency | Cost |
|---|---|---|---|
| OpenAI | GPT-4, GPT-3.5 | 2-3s | $0.05-0.15/1K tokens |
| Anthropic | Claude 3 Opus | 2-4s | $0.03-0.015/1K tokens |
| Local (Ollama) | Llama 2, Mistral | 5-10s | Free |
Vector Databases
- SurrealDB (default): Embedded vector database with HNSW indexing
- Pinecone: Cloud vector database (optional)
- Milvus: Open-source vector database (optional)
Embedding Models
- text-embedding-3-small (OpenAI): 1,536 dimensions
- text-embedding-3-large (OpenAI): 3,072 dimensions
- all-MiniLM-L6-v2 (local): 384 dimensions
Performance Characteristics
Latency Breakdown
For a typical infrastructure creation request:
| Stage | Latency | Details |
|---|---|---|
| Intent Recognition | 50-100ms | Local NLP |
| RAG Retrieval | 50-100ms | Vector search |
| LLM Inference | 2-5s | External API |
| Nickel Generation | 100-200ms | Template filling |
| Validation | 200-500ms | Type checking |
| Total | 2.5-6 seconds | End-to-end |
Concurrency
- Concurrent Requests: 100+ (with load balancing)
- RAG QPS: 50+ searches/second
- LLM Throughput: 10+ concurrent requests per API key
- Memory: 500MB-2GB (depends on cache size)
Security Architecture
Data Protection
User Input
↓
Input Sanitization
├─ Remove PII
├─ Validate constraints
└─ Check permissions
↓
Processing (encrypted in transit)
├─ TLS 1.3 to LLM provider
├─ Secrets stored in vault-service
└─ Credentials never logged
↓
Generated Configuration
├─ Encrypted at rest (AES-256)
├─ Signed for integrity
└─ Audit trail maintained
↓
Output
Access Control
- API key validation
- RBAC permission checking
- Rate limiting per user/key
- Audit logging of all operations
Extensibility
Custom Tools
Register custom tools with MCP:
// Custom tool example
register_tool("custom-validator", | confi| g {
validate_custom_requirements(&config)
});
Custom RAG Documents
Add domain-specific knowledge:
provisioning ai knowledge import \
--source ./custom-docs \
--category infrastructure
Fine-tuning (Future)
- Support for fine-tuned LLM models
- Custom prompt templates
- Organization-specific knowledge bases
Related Documentation
- AI Overview - Quick start
- AI Service Crate - Microservice implementation
- RAG & Knowledge - Knowledge retrieval
- TypeDialog Integration - Form integration
- Natural Language Infrastructure - Usage guide
TypeDialog AI & AG Integration
TypeDialog provides two AI-powered tools for Provisioning: typedialog-ai (configuration assistant) and typedialog-ag (agent automation).
TypeDialog Components
typedialog-ai v0.1.0
AI Assistant - HTTP server backend for intelligent form suggestions and infrastructure recommendations.
Purpose: Enhance interactive forms with AI-powered suggestions and natural language parsing.
Architecture:
TypeDialog Form
↓
typedialog-ai HTTP Server
↓
SurrealDB Backend
↓
LLM Provider (OpenAI, Anthropic, etc.)
↓
Suggestions → Deployed Config
Key Features:
- Form Intelligence: Context-aware field suggestions
- Database Recommendations: Suggest database type/configuration based on workload
- Network Optimization: Generate optimal network topology
- Security Policies: AI-generated Cedar policies
- Cost Estimation: Predict infrastructure costs
Installation:
# Via provisioning script
provisioning install ai-tools
# Manual installation
wget [https://github.com/typedialog/typedialog-ai/releases/download/v0.1.0/typedialog-ai-<os>-<arch>](https://github.com/typedialog/typedialog-ai/releases/download/v0.1.0/typedialog-ai-<os>-<arch>)
chmod +x typedialog-ai
mv typedialog-ai ~/.local/bin/
Usage:
# Start AI server
typedialog ai serve --db-path ~/.typedialog/ai.db --port 9000
# Test connection
curl [http://localhost:9000/health](http://localhost:9000/health)
# Get suggestion for database
curl -X POST [http://localhost:9000/suggest/database](http://localhost:9000/suggest/database) \
-H "Content-Type: application/json" \
-d '{"workload": "transactional", "size": "1TB", "replicas": 3}'
# Response:
# {"suggestion": "PostgreSQL 15 with pgvector", "confidence": 0.92}
Configuration:
# ~/.typedialog/ai-config.yaml
typedialog-ai:
port: 9000
db_path: ~/.typedialog/ai.db
loglevel: info
llm:
provider: openai # or: anthropic, local
model: gpt-4
api_key: ${OPENAI_API_KEY}
temperature: 0.7
features:
form_suggestions: true
database_recommendations: true
network_optimization: true
security_policy_generation: true
cost_estimation: true
cache:
enabled: true
ttl: 3600
Database Schema:
-- SurrealDB schema for AI suggestions
DEFINE TABLE ai_suggestions SCHEMAFULL
DEFINE FIELD timestamp ON ai_suggestions TYPE datetime DEFAULT now();
DEFINE FIELD context ON ai_suggestions TYPE object;
DEFINE FIELD suggestion ON ai_suggestions TYPE string;
DEFINE FIELD confidence ON ai_suggestions TYPE float;
DEFINE FIELD accepted ON ai_suggestions TYPE bool;
DEFINE TABLE ai_models SCHEMAFULL
DEFINE FIELD name ON ai_models TYPE string;
DEFINE FIELD version ON ai_models TYPE string;
DEFINE FIELD provider ON ai_models TYPE string;
Endpoints:
| Endpoint | Method | Purpose |
|---|---|---|
/health | GET | Health check |
/suggest/database | POST | Database recommendations |
/suggest/network | POST | Network topology |
/suggest/security | POST | Security policies |
/estimate/cost | POST | Cost estimation |
/parse/natural-language | POST | Parse natural language |
/feedback | POST | Store suggestion feedback |
typedialog-ag v0.1.0
AI Agents - Type-safe agents for automation workflows and Nickel transpilation.
Purpose: Define complex automation workflows using type-safe agent descriptions, then transpile to executable Nickel.
Architecture:
Agent Definition (.agent.yaml)
↓
typedialog-ag Type Checker
↓
Agent Execution Plan
↓
Nickel Transpilation
↓
Provisioning Execution
Key Features:
- Type-Safe Agents: Strongly-typed agent definitions
- Workflow Automation: Chain multiple infrastructure tasks
- Nickel Transpilation: Generate Nickel IaC automatically
- Agent Orchestration: Parallel and sequential execution
- Rollback Support: Automatic rollback on failure
Installation:
# Via provisioning script
provisioning install ai-tools
# Manual installation
wget [https://github.com/typedialog/typedialog-ag/releases/download/v0.1.0/typedialog-ag-<os>-<arch>](https://github.com/typedialog/typedialog-ag/releases/download/v0.1.0/typedialog-ag-<os>-<arch>)
chmod +x typedialog-ag
mv typedialog-ag ~/.local/bin/
Agent Definition Syntax:
# provisioning/workflows/deploy-k8s.agent.yaml
version: "1.0"
agent: deploy-k8s
description: "Deploy HA Kubernetes cluster with observability stack"
types:
CloudProvider:
enum: ["aws", "upcloud", "hetzner"]
NodeConfig:
cpu: int # 2..64
memory: int # 4..256 (GB)
disk: int # 10..1000 (GB)
input:
provider: CloudProvider
name: string # cluster name
nodes: int # 3..100
node_config: NodeConfig
enable_monitoring: bool = true
enable_backup: bool = true
workflow:
- name: validate
task: validate_cluster_config
args:
provider: $input.provider
nodes: $input.nodes
node_config: $input.node_config
- name: create_network
task: create_vpc
depends_on: [validate]
args:
provider: $input.provider
cidr: "10.0.0.0/16"
- name: create_nodes
task: create_nodes
depends_on: [create_network]
parallel: true
args:
provider: $input.provider
count: $input.nodes
config: $input.node_config
- name: install_kubernetes
task: install_kubernetes
depends_on: [create_nodes]
args:
nodes: $create_nodes.output.node_ids
version: "1.28.0"
- name: add_monitoring
task: deploy_observability_stack
depends_on: [install_kubernetes]
when: $input.enable_monitoring
args:
cluster_name: $input.name
storage_class: "ebs"
- name: setup_backup
task: configure_backup
depends_on: [install_kubernetes]
when: $input.enable_backup
args:
cluster_name: $input.name
backup_interval: "daily"
output:
cluster_name: string
cluster_id: string
kubeconfig_path: string
monitoring_url: string
Usage:
# Type-check agent
typedialog ag check deploy-k8s.agent.yaml
# Run agent interactively
typedialog ag run deploy-k8s.agent.yaml \
--provider upcloud \
--name production-k8s \
--nodes 5 \
--node-config '{"cpu": 8, "memory": 32, "disk": 100}'
# Transpile to Nickel
typedialog ag transpile deploy-k8s.agent.yaml > deploy-k8s.ncl
# Execute generated Nickel
provisioning apply deploy-k8s.ncl
Generated Nickel Output (example):
{
metadata = {
agent = "deploy-k8s"
version = "1.0"
generated_at = "2026-01-16T01:47:00Z"
}
resources = {
network = {
provider = "upcloud"
vpc = { cidr = "10.0.0.0/16" }
}
compute = {
provider = "upcloud"
nodes = [
{ count = 5, cpu = 8, memory = 32, disk = 100 }
]
}
kubernetes = {
version = "1.28.0"
high_availability = true
monitoring = {
enabled = true
stack = "prometheus-grafana"
}
backup = {
enabled = true
interval = "daily"
}
}
}
}
Agent Features:
| Feature | Purpose |
|---|---|
| Dependencies | Declare task ordering (depends_on) |
| Parallelism | Run independent tasks in parallel |
| Conditionals | Execute tasks based on input conditions |
| Type Safety | Strong typing on inputs and outputs |
| Rollback | Automatic rollback on failure |
| Logging | Full execution trace for debugging |
Integration with Provisioning
Using typedialog-ai in Forms
# .typedialog/provisioning/form.toml
[[elements]]
name = "database_type"
prompt = "form-database_type-prompt"
type = "select"
options = ["postgres", "mysql", "mongodb"]
# Enable AI suggestions
[elements.ai_suggestions]
enabled = true
context = "workload"
provider = "typedialog-ai"
endpoint = " [http://localhost:9000/suggest/database"](http://localhost:9000/suggest/database")
Using typedialog-ag in Workflows
# Define agent-based workflow
provisioning workflow define \
--agent deploy-k8s.agent.yaml \
--name k8s-deployment \
--auto-execute
# Run workflow
provisioning workflow run k8s-deployment \
--provider upcloud \
--nodes 5
Performance
typedialog-ai
- Suggestion latency: 500ms-2s per suggestion
- Database queries: <100ms (cached)
- Concurrent users: 50+
- SurrealDB storage: <1GB for 10K suggestions
typedialog-ag
- Type checking: <100ms per agent
- Transpilation: <500ms to Nickel
- Parallel task execution: O(1) overhead
- Agent memory: <50MB per agent
Configuration
Enable AI in Provisioning
# provisioning/config/config.defaults.toml
[ai]
enabled = true
typedialog_ai = true
typedialog_ag = true
[ai.typedialog]
ai_server_url = " [http://localhost:9000"](http://localhost:9000")
ag_executable = "typedialog-ag"
[ai.form_suggestions]
enabled = true
providers = ["database", "network", "security"]
confidence_threshold = 0.75
Related Documentation
- AI Architecture - System design
- Natural Language Infrastructure - LLM usage
- AI Service Crate - Core microservice
AI Service Crate
The AI Service crate (provisioning/platform/crates/ai-service/) is the central AI processing
microservice for Provisioning. It coordinates LLM integration, knowledge retrieval, and
infrastructure recommendation generation.
Architecture
Core Modules
The AI Service is organized into specialized modules:
| Module | Purpose |
|---|---|
| config.rs | Configuration management and AI service settings |
| service.rs | Main service logic and request handling |
| mcp.rs | Model Context Protocol integration for LLM tools |
| knowledge.rs | Knowledge base management and retrieval |
| dag.rs | Directed Acyclic Graph for workflow orchestration |
| handlers.rs | HTTP endpoint handlers |
| tool_integration.rs | Tool registration and execution |
Request Flow
User Request (natural language)
↓
Handlers (HTTP endpoint)
↓
Intent Recognition (config.rs)
↓
Knowledge Retrieval (knowledge.rs)
↓
MCP Tool Selection (mcp.rs)
↓
LLM Processing (external provider)
↓
DAG Execution Planning (dag.rs)
↓
Infrastructure Generation
↓
Response to User
Configuration
Environment Variables
# LLM Configuration
export PROVISIONING_AI_PROVIDER=openai
export PROVISIONING_AI_MODEL=gpt-4
export PROVISIONING_AI_API_KEY=sk-...
# Service Configuration
export PROVISIONING_AI_PORT=9091
export PROVISIONING_AI_LOG_LEVEL=info
export PROVISIONING_AI_TIMEOUT=30
# Knowledge Base
export PROVISIONING_AI_KNOWLEDGE_PATH=~/.provisioning/knowledge
export PROVISIONING_AI_CACHE_TTL=3600
# RAG Configuration
export PROVISIONING_AI_RAG_ENABLED=true
export PROVISIONING_AI_RAG_SIMILARITY_THRESHOLD=0.75
Configuration File
# provisioning/config/ai-service.toml
[ai_service]
port = 9091
timeout = 30
max_concurrent_requests = 100
[llm]
provider = "openai" # openai, anthropic, local
model = "gpt-4"
api_key = "${PROVISIONING_AI_API_KEY}"
temperature = 0.7
max_tokens = 2000
[knowledge]
enabled = true
path = "~/.provisioning/knowledge"
cache_ttl = 3600
update_interval = 3600
[rag]
enabled = true
similarity_threshold = 0.75
max_results = 5
embedding_model = "text-embedding-3-small"
[dag]
max_parallel_tasks = 10
timeout_per_task = 60
enable_rollback = true
[security]
validate_inputs = true
rate_limit = 1000 # requests/minute
audit_logging = true
HTTP API
Endpoints
Create Infrastructure Request
POST /v1/infrastructure/create
Content-Type: application/json
{
"request": "Create 3 web servers with load balancing",
"context": {
"workspace": "production",
"provider": "upcloud",
"environment": "prod"
},
"options": {
"auto_apply": false,
"return_nickel": true,
"validate": true
}
}
Response:
{
"request_id": "req-12345",
"status": "success",
"infrastructure": {
"servers": [
{"name": "web-01", "cpu": 4, "memory": 8},
{"name": "web-02", "cpu": 4, "memory": 8},
{"name": "web-03", "cpu": 4, "memory": 8}
],
"load_balancer": {"name": "lb-01", "type": "round-robin"}
},
"nickel_config": "{ servers = [...] }",
"confidence": 0.92,
"notes": ["All servers in same availability zone", "Load balancer configured for health checks"]
}
Analyze Configuration
POST /v1/configuration/analyze
Content-Type: application/json
{
"configuration": "{ name = \"server-01\", cpu = 2, memory = 4 }",
"context": {"provider": "upcloud", "environment": "prod"}
}
Response:
{
"analysis": {
"resources": {
"cpu_score": "low",
"memory_score": "minimal",
"recommendation": "Increase to cpu=4, memory=8 for production"
},
"security": {
"findings": ["No backup configured", "No monitoring"],
"recommendations": ["Enable automated backups", "Deploy monitoring agent"]
},
"cost": {
"estimated_monthly": "$45",
"optimization_potential": "20% cost reduction possible"
}
}
}
Generate Policies
POST /v1/policies/generate
Content-Type: application/json
{
"requirements": "Allow developers to create servers but not delete, admins full access",
"format": "cedar"
}
Response:
{
"policies": [
{
"effect": "permit",
"principal": {"role": "developer"},
"action": "CreateServer",
"resource": "Server::*"
},
{
"effect": "permit",
"principal": {"role": "admin"},
"action": ["CreateServer", "DeleteServer", "ModifyServer"],
"resource": "Server::*"
}
],
"format": "cedar",
"validation": "valid"
}
Get Suggestions
GET /v1/suggestions?context=database&workload=transactional&scale=large
Response:
{
"suggestions": [
{
"type": "database",
"recommendation": "PostgreSQL 15 with pgvector",
"rationale": "Optimal for transactional workload with vector support",
"confidence": 0.95,
"config": {
"engine": "postgres",
"version": "15",
"extensions": ["pgvector"],
"replicas": 3,
"backup": "daily"
}
}
]
}
Get Health Status
GET /v1/health
Response:
{
"status": "healthy",
"version": "0.1.0",
"llm": {
"provider": "openai",
"model": "gpt-4",
"available": true
},
"knowledge": {
"documents": 1250,
"last_update": "2026-01-16T01:00:00Z"
},
"rag": {
"enabled": true,
"embeddings": 1250,
"search_latency_ms": 45
},
"uptime_seconds": 86400
}
MCP Tool Integration
Available Tools
The AI Service registers tools with the MCP server for LLM access:
// Tools available to LLM
tools = [
"create_infrastructure",
"analyze_configuration",
"generate_policies",
"get_recommendations",
"query_knowledge_base",
"estimate_costs",
"check_compatibility",
"validate_nickel"
]
Tool Definitions
{
"name": "create_infrastructure",
"description": "Create infrastructure from natural language description",
"parameters": {
"type": "object",
"properties": {
"request": {"type": "string"},
"provider": {"type": "string"},
"context": {"type": "object"}
},
"required": ["request"]
}
}
Knowledge Base
Structure
knowledge/
├── infrastructure/ # Infrastructure patterns
│ ├── kubernetes/
│ ├── databases/
│ ├── networking/
│ └── security/
├── patterns/ # Design patterns
│ ├── high-availability/
│ ├── disaster-recovery/
│ └── performance/
├── providers/ # Provider-specific docs
│ ├── aws/
│ ├── upcloud/
│ └── hetzner/
└── best-practices/ # Best practices
├── security/
├── operations/
└── cost-optimization/
Updating Knowledge
# Add new knowledge document
curl -X POST [http://localhost:9091/v1/knowledge/add](http://localhost:9091/v1/knowledge/add) \
-H "Content-Type: application/json" \
-d '{
"category": "kubernetes",
"title": "HA Kubernetes Setup",
"content": "..."
}'
# Update embeddings
curl -X POST [http://localhost:9091/v1/knowledge/reindex](http://localhost:9091/v1/knowledge/reindex)
# Get knowledge status
curl [http://localhost:9091/v1/knowledge/status](http://localhost:9091/v1/knowledge/status)
DAG Execution
Workflow Planning
The AI Service uses DAGs to plan complex infrastructure deployments:
Validate Config
├→ Create Network
│ └→ Create Nodes
│ └→ Install Kubernetes
│ ├→ Add Monitoring (optional)
│ └→ Setup Backup (optional)
│
└→ Verify Compatibility
└→ Estimate Costs
Task Execution
# Execute DAG workflow
curl -X POST [http://localhost:9091/v1/workflow/execute](http://localhost:9091/v1/workflow/execute) \
-H "Content-Type: application/json" \
-d '{
"dag": {
"tasks": [
{"name": "validate", "action": "validate_config"},
{"name": "network", "action": "create_network", "depends_on": ["validate"]},
{"name": "nodes", "action": "create_nodes", "depends_on": ["network"]}
]
}
}'
Performance Characteristics
Latency
| Operation | Latency |
|---|---|
| Intent recognition | 50-100ms |
| Knowledge retrieval | 100-200ms |
| LLM inference | 2-5 seconds |
| Nickel generation | 500ms-1s |
| DAG planning | 100-500ms |
| Policy generation | 1-2 seconds |
Throughput
- Concurrent requests: 100+
- QPS: 50+ requests/second
- Knowledge search: <50ms for 1000+ documents
Resource Usage
- Memory: 500MB-2GB (with cache)
- CPU: 1-4 cores
- Storage: 10GB-50GB (knowledge base)
- Network: 10Mbps-100Mbps (LLM requests)
Monitoring & Observability
Metrics
# Prometheus metrics exposed at /metrics
provisioning_ai_requests_total{endpoint="/v1/infrastructure/create"}
provisioning_ai_request_duration_seconds{endpoint="/v1/infrastructure/create"}
provisioning_ai_llm_tokens{provider="openai", model="gpt-4"}
provisioning_ai_knowledge_documents_total
provisioning_ai_cache_hit_ratio
Logging
# View AI Service logs
provisioning logs service ai-service --tail 100
# Debug mode
PROVISIONING_AI_LOG_LEVEL=debug provisioning service start ai-service
Troubleshooting
LLM Connection Issues
# Test LLM connection
curl -X POST [http://localhost:9091/v1/health](http://localhost:9091/v1/health)
# Check configuration
provisioning config get ai.llm
# View logs
provisioning logs service ai-service --filter "llm| \ openai"
Slow Knowledge Retrieval
# Check knowledge base status
curl [http://localhost:9091/v1/knowledge/status](http://localhost:9091/v1/knowledge/status)
# Reindex embeddings
curl -X POST [http://localhost:9091/v1/knowledge/reindex](http://localhost:9091/v1/knowledge/reindex)
# Monitor RAG performance
curl [http://localhost:9091/v1/rag/benchmark](http://localhost:9091/v1/rag/benchmark)
Related Documentation
- AI Architecture - System design
- RAG & Knowledge - Knowledge retrieval
- MCP Server - Model Context Protocol
- Orchestrator - Workflow execution
RAG & Knowledge Base
The RAG (Retrieval Augmented Generation) system enhances AI-generated infrastructure with domain-specific knowledge. It retrieves relevant documentation, best practices, and patterns to inform infrastructure recommendations.
Architecture
Components
User Query
↓
Query Embedder (text-embedding-3-small)
↓
Vector Similarity Search (SurrealDB)
↓
Knowledge Retrieval (semantic matching)
↓
Context Augmentation
↓
LLM Processing (with knowledge context)
↓
Infrastructure Recommendation
Knowledge Flow
Documentation Input
↓
Document Chunking (512 tokens)
↓
Semantic Embedding
↓
Vector Storage (SurrealDB)
↓
Similarity Indexing
↓
Query Time Retrieval
Knowledge Base Organization
Document Categories
| Category | Purpose | Examples |
|---|---|---|
| Infrastructure | IaC patterns and templates | Kubernetes, databases, networking |
| Best Practices | Operational guidelines | HA patterns, disaster recovery |
| Provider Guides | Cloud provider documentation | AWS, UpCloud, Hetzner specifics |
| Performance | Optimization guidelines | Resource sizing, caching strategies |
| Security | Security hardening guides | Encryption, authentication, compliance |
| Troubleshooting | Common issues and solutions | Performance, deployment, debugging |
Document Structure
id: "doc-k8s-ha-001"
category: "infrastructure"
subcategory: "kubernetes"
title: "High Availability Kubernetes Cluster Setup"
tags: ["kubernetes", "high-availability", "production"]
created: "2026-01-10T00:00:00Z"
updated: "2026-01-16T00:00:00Z"
content: |
# High Availability Kubernetes Cluster
For production Kubernetes deployments, ensure:
- Minimum 3 control planes
- Distributed across availability zones
- etcd with persistent storage
- CNI plugin with network policies
embedding: [0.123, 0.456]
metadata:
provider: ["aws", "upcloud", "hetzner"]
environment: ["production"]
cost_profile: "medium"
RAG Retrieval Process
Similarity Search
When processing a user query, the system:
- Embed Query: Convert natural language to vector
- Search Index: Find similar documents (cosine similarity > threshold)
- Rank Results: Score by relevance
- Extract Context: Select top N chunks
- Augment Prompt: Add context to LLM request
Example:
User Query: "Create a Kubernetes cluster in AWS with auto-scaling"
Vector Embedding: [0.234, 0.567, 0.891]
Top Matches:
1. "HA Kubernetes Setup" (similarity: 0.94)
2. "AWS Auto-Scaling Patterns" (similarity: 0.87)
3. "Kubernetes Security Hardening" (similarity: 0.76)
Retrieved Context:
- Minimum 3 control planes for HA
- Use AWS ASGs with cluster autoscaler
- Enable Pod Disruption Budgets
- Configure network policies
LLM Prompt with Context:
"Create a Kubernetes cluster with the following context:
[...retrieved knowledge...]
User request: Create a Kubernetes cluster in AWS with auto-scaling"
Configuration
[rag]
enabled = true
similarity_threshold = 0.75
max_results = 5
chunk_size = 512
chunk_overlap = 50
[embeddings]
model = "text-embedding-3-small"
provider = "openai"
cache_embeddings = true
[vector_store]
backend = "surrealdb"
index_type = "hnsw"
ef_construction = 400
ef_search = 200
[retrieval]
bm25_weight = 0.3
semantic_weight = 0.7
date_boost = 0.1
Managing Knowledge
Adding Documents
Via API:
curl -X POST [http://localhost:9091/v1/knowledge/add](http://localhost:9091/v1/knowledge/add) \
-H "Content-Type: application/json" \
-d '{
"category": "infrastructure",
"title": "PostgreSQL HA Setup",
"content": "For production PostgreSQL: 3+ replicas, streaming replication",
"tags": ["database", "postgresql", "ha"],
"metadata": {
"provider": ["aws", "upcloud"],
"environment": ["production"]
}
}'
Batch Import:
# Import from markdown files
provisioning ai knowledge import \
--source ./docs/knowledge \
--category infrastructure \
--auto-tag
# Import from existing documentation
provisioning ai knowledge import \
--source provisioning/docs/src \
--recursive
Organizing Knowledge
# List knowledge documents
provisioning ai knowledge list --category infrastructure
# Search knowledge base
provisioning ai knowledge search "kubernetes high availability"
# View document
provisioning ai knowledge view doc-k8s-ha-001
# Update document
provisioning ai knowledge update doc-k8s-ha-001 \
--content "Updated content..." \
--tags "kubernetes,ha,production,v1.28"
# Delete document
provisioning ai knowledge delete doc-k8s-ha-001
Reindexing
# Reindex all documents
provisioning ai knowledge reindex --all
# Reindex specific category
provisioning ai knowledge reindex --category infrastructure
# Check indexing status
provisioning ai knowledge index-status
# Rebuild vector index
provisioning ai knowledge rebuild-vectors --model text-embedding-3-small
Knowledge Query API
Search Endpoint
POST /v1/knowledge/search
Content-Type: application/json
{
"query": "kubernetes cluster setup",
"category": "infrastructure",
"tags": ["kubernetes"],
"limit": 5,
"similarity_threshold": 0.75,
"metadata_filter": {
"provider": ["aws", "upcloud"],
"environment": ["production"]
}
}
Response:
{
"results": [
{
"id": "doc-k8s-ha-001",
"title": "High Availability Kubernetes Cluster",
"category": "infrastructure",
"similarity": 0.94,
"excerpt": "For production Kubernetes deployments, ensure minimum 3 control planes",
"tags": ["kubernetes", "ha", "production"],
"metadata": {
"provider": ["aws", "upcloud", "hetzner"],
"environment": ["production"]
}
}
],
"search_time_ms": 45,
"total_matches": 12
}
Knowledge Quality
Maintenance
# Check knowledge quality
provisioning ai knowledge quality-report
# Remove duplicate documents
provisioning ai knowledge deduplicate
# Fix broken references
provisioning ai knowledge validate-refs
# Update outdated docs
provisioning ai knowledge mark-outdated \
--category infrastructure \
--older-than 180d
Metrics
# Knowledge base statistics
curl [http://localhost:9091/v1/knowledge/stats](http://localhost:9091/v1/knowledge/stats)
Response:
{
"total_documents": 1250,
"total_chunks": 8432,
"categories": {
"infrastructure": 450,
"security": 200,
"best_practices": 300
},
"embedding_coverage": 0.98,
"indexed_chunks": 8256,
"vector_index_size_mb": 245,
"last_reindex": "2026-01-15T23:00:00Z"
}
Hybrid Search
RAG uses hybrid search combining semantic and keyword matching:
BM25 Score (Keyword Match): 0.7
Semantic Score (Vector Similarity): 0.92
Hybrid Score = (0.3 × 0.7) + (0.7 × 0.92)
= 0.21 + 0.644
= 0.854
Relevance: High ✓
Configuration
[hybrid_search]
bm25_weight = 0.3
semantic_weight = 0.7
Performance
Retrieval Latency
| Operation | Latency |
|---|---|
| Embed query (512 tokens) | 100-200ms |
| Vector similarity search | 20-50ms |
| BM25 keyword search | 10-30ms |
| Hybrid ranking | 5-10ms |
| Total retrieval | 50-100ms |
Vector Index Size
- Documents: 1000 → 8GB storage
- Documents: 10000 → 80GB storage
- Search latency: Consistent <50ms regardless of size (with HNSW indexing)
Security & Privacy
Access Control
# Restrict knowledge access
provisioning ai knowledge acl set doc-k8s-ha-001 \
--read "admin,developer" \
--write "admin"
# Audit knowledge access
provisioning ai knowledge audit --document doc-k8s-ha-001
Data Protection
- Sensitive Info: Automatically redacted from queries (API keys, passwords)
- Document Encryption: Optional at-rest encryption
- Query Logging: Audit trail for compliance
[security]
redact_patterns = ["password", "api_key", "secret"]
encrypt_documents = true
audit_queries = true
Related Documentation
- AI Architecture - System design
- AI Service Crate - Core microservice
- Natural Language Infrastructure - LLM usage
- MCP Server - Tool integration
Natural Language Infrastructure
Use natural language to describe infrastructure requirements and get automatically generated Nickel configurations and deployment plans.
Overview
Natural Language Infrastructure (NLI) allows requesting infrastructure changes in plain English:
# Instead of writing complex Nickel...
provisioning ai "Deploy a 3-node HA PostgreSQL cluster with automatic backups in AWS"
# Or interactively...
provisioning ai interactive
# Interactive mode guides you through requirements
How It Works
Request Processing Pipeline
User Natural Language Input
↓
Intent Recognition
├─ Extract resource type (server, database, cluster)
├─ Identify constraints (HA, region, size)
└─ Detect options (monitoring, backup, encryption)
↓
RAG Knowledge Retrieval
├─ Find similar deployments
├─ Retrieve best practices
└─ Get provider-specific guidance
↓
LLM Inference (GPT-4, Claude 3)
├─ Generate Nickel schema
├─ Calculate resource requirements
└─ Create deployment plan
↓
Configuration Validation
├─ Type checking via Nickel compiler
├─ Schema validation
└─ Constraint verification
↓
Infrastructure Deployment
├─ Dry-run simulation
├─ Cost estimation
└─ User confirmation
↓
Execution & Monitoring
Command Usage
Simple Requests
# Web servers with load balancing
provisioning ai "Create 3 web servers with load balancer"
# Database setup
provisioning ai "Deploy PostgreSQL with 2 replicas and daily backups"
# Kubernetes cluster
provisioning ai "Create production Kubernetes cluster with Prometheus monitoring"
Complex Requests
# Multi-cloud deployment
provisioning ai "
Deploy:
- 3 HA Kubernetes clusters (AWS, UpCloud, Hetzner)
- PostgreSQL 15 with synchronous replication
- Redis cluster for caching
- ELK stack for logging
- Prometheus for monitoring
Constraints:
- Cross-region high availability
- Encrypted inter-region communication
- Auto-scaling based on CPU (70%)
"
# Disaster recovery setup
provisioning ai "
Set up disaster recovery for production environment:
- Active-passive failover to secondary region
- Daily automated backups (30-day retention)
- Monthly DR tests with automated reports
- RTO: 4 hours, RPO: 1 hour
- Test failover every week
"
Interactive Mode
# Start interactive mode
provisioning ai interactive
# System asks clarifying questions:
# Q: What type of infrastructure? (server, database, cluster, other)
# Q: Which cloud provider? (aws, upcloud, hetzner, local)
# Q: Production or development?
# Q: High availability required?
# Q: Expected load? (small, medium, large, enterprise)
# Q: Monitoring and logging?
# Q: Backup strategy?
# Shows generated configuration for approval
Example: Web Application Deployment
Request
provisioning ai "
Deploy a production web application:
- Frontend: 3 nginx servers with auto-scaling
- API: 5 FastAPI instances behind load balancer
- Database: HA PostgreSQL with read replicas
- Cache: Redis cluster
- Monitoring: Prometheus and Grafana
- Logs: Elasticsearch + Kibana
Environment: AWS
Region: us-east-1 (primary), eu-west-1 (DR)
Budget: $5000/month max
"
Generated Configuration
The system generates:
1. Nickel Infrastructure Definition:
{
metadata = {
generated_by = "provisioning-ai"
timestamp = "2026-01-16T01:47:00Z"
confidence = 0.94
}
infrastructure = {
frontend = {
servers = 3
cpu = 4
memory = 8
type = "t3.large"
auto_scaling = {
min = 3
max = 10
target_cpu = 70
}
}
api = {
servers = 5
cpu = 8
memory = 16
type = "t3.xlarge"
load_balancer = {
type = "application"
health_check = "/health"
}
}
database = {
type = "postgresql"
version = "15"
replicas = 2
backup = {
enabled = true
frequency = "daily"
retention_days = 30
}
}
cache = {
type = "redis"
nodes = 3
cluster_mode = true
replication = true
}
monitoring = {
prometheus = true
grafana = true
retention_days = 30
}
logging = {
elasticsearch = true
kibana = true
retention_days = 30
}
}
provider = "aws"
region_primary = "us-east-1"
region_dr = "eu-west-1"
cost_estimate = {
monthly = "$4850"
breakdown = {
compute = "$2500"
database = "$1200"
cache = "$600"
monitoring = "$400"
networking = "$150"
}
}
}
2. Deployment Plan:
Deployment Plan: Web Application (Production)
Phase 1: Network & Storage (2-3 hours)
- Create VPCs in us-east-1 and eu-west-1
- Set up inter-region VPN
- Create EBS volumes for database
- Create EFS for shared storage
Phase 2: Compute Instances (4-5 hours)
- Launch 3 frontend servers
- Launch 5 API servers
- Create load balancers
- Set up auto-scaling groups
Phase 3: Databases (3-4 hours)
- Create PostgreSQL primary
- Create read replicas
- Configure replication
- Run initial backup
Phase 4: Cache & Services (2-3 hours)
- Create Redis cluster
- Deploy Prometheus
- Deploy Grafana
- Deploy Elasticsearch/Kibana
Phase 5: Configuration (2-3 hours)
- Configure health checks
- Set up monitoring alerts
- Configure log shipping
- Deploy TLS certificates
Total Estimated Time: 13-18 hours
3. Cost Breakdown:
Monthly Cost Estimate: $4,850
Compute $2,500 (EC2 instances)
Database $1,200 (RDS PostgreSQL)
Cache $600 (ElastiCache Redis)
Monitoring $400 (CloudWatch + Grafana)
Networking $150 (NAT Gateway, VPN)
4. Risk Assessment:
Warnings:
- Budget limit reached at $4,850 (max: $5,000)
- Cross-region networking latency: 80-100ms
- Database failover time: 1-2 minutes
Recommendations:
- Implement connection pooling in API
- Use read replicas for analytics queries
- Consider spot instances for non-critical services (30% cost savings)
Output Formats
Get Deployment Script
# Get Bash deployment script
provisioning ai "..." --output bash > deploy.sh
# Get Nushell script
provisioning ai "..." --output nushell > deploy.nu
# Get Terraform
provisioning ai "..." --output terraform > main.tf
# Get Nickel (default)
provisioning ai "..." --output nickel > infrastructure.ncl
Save for Later
# Save configuration for review
provisioning ai "..." --save deployment-plan --review
# Deploy from saved plan
provisioning apply deployment-plan
# Compare with current state
provisioning diff deployment-plan
Configuration
LLM Provider Selection
# Use OpenAI (default)
export PROVISIONING_AI_PROVIDER=openai
export PROVISIONING_AI_MODEL=gpt-4
# Use Anthropic
export PROVISIONING_AI_PROVIDER=anthropic
export PROVISIONING_AI_MODEL=claude-3-opus
# Use local model
export PROVISIONING_AI_PROVIDER=local
export PROVISIONING_AI_MODEL=llama2:70b
Response Options
# ~/.config/provisioning/ai.yaml
natural_language:
output_format: nickel # nickel, terraform, bash, nushell
include_cost_estimate: true
include_risk_assessment: true
include_deployment_plan: true
auto_review: false # Require approval before deploy
dry_run: true # Simulate before execution
confidence_threshold: 0.85 # Reject low-confidence results
style:
verbosity: detailed
include_alternatives: true
explain_reasoning: true
Advanced Features
Conditional Infrastructure
provisioning ai "
Deploy web cluster:
- If environment is production: HA setup with 5 nodes
- If environment is staging: Standard setup with 2 nodes
- If environment is dev: Single node with development tools
"
Cost-Optimized Variants
# Generate cost-optimized alternative
provisioning ai "..." --optimize-for cost
# Generate performance-optimized alternative
provisioning ai "..." --optimize-for performance
# Generate high-availability alternative
provisioning ai "..." --optimize-for availability
Template-Based Generation
# Use existing templates as base
provisioning ai "..." --template kubernetes-ha
# List available templates
provisioning ai templates list
Safety & Validation
Review Before Deploy
# Generate and review (no auto-execute)
provisioning ai "..." --review
# Review generated Nickel
cat deployment-plan.ncl
# Validate configuration
provisioning validate deployment-plan.ncl
# Dry-run to see what changes
provisioning apply --dry-run deployment-plan.ncl
# Apply after approval
provisioning apply deployment-plan.ncl
Rollback Support
# Create deployment with automatic rollback
provisioning ai "..." --with-rollback
# Manual rollback if issues
provisioning workflow rollback --to-checkpoint
# View deployment history
provisioning history list --type infrastructure
Limitations
- Context Window: Very large infrastructure descriptions may exceed LLM limits
- Ambiguity: Unclear requirements may produce suboptimal configurations
- Provider Specifics: Some provider-specific features may require manual adjustment
- Cost: API calls incur per-token charges
- Latency: Processing takes 2-10 seconds depending on complexity
Related Documentation
- AI Architecture - System design
- AI Service Crate - Core microservice
- RAG & Knowledge - Knowledge retrieval
- TypeDialog Integration - Form AI
- Nickel Guide - Configuration syntax