From a4b3c023719c3fa9bf942e60ca5697c8dc7d0660 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Jesu=CC=81s=20Pe=CC=81rez?= Date: Wed, 14 Jan 2026 04:53:21 +0000 Subject: [PATCH] chore: fix docs after fences fix --- .typedialog/ci/README.md | 2 +- README.md | 2 +- config/README.md | 2 +- config/examples/README.md | 2 +- docs/README.md | 139 +- docs/src/PROVISIONING.md | 945 ++++++- docs/src/README.md | 386 ++- docs/src/SUMMARY.md | 270 +- docs/src/ai/README.md | 172 +- docs/src/ai/ai-agents.md | 533 +++- docs/src/ai/ai-assisted-forms.md | 439 +++- docs/src/ai/architecture.md | 195 +- docs/src/ai/config-generation.md | 65 +- docs/src/ai/configuration.md | 602 ++++- docs/src/ai/cost-management.md | 498 +++- docs/src/ai/mcp-integration.md | 595 ++++- docs/src/ai/natural-language-config.md | 470 +++- docs/src/ai/rag-system.md | 451 +++- docs/src/ai/security-policies.md | 538 +++- docs/src/ai/troubleshooting-with-ai.md | 503 +++- docs/src/api-reference/README.md | 29 +- docs/src/api-reference/extensions.md | 1206 ++++++++- .../src/api-reference/integration-examples.md | 1593 +++++++++++- docs/src/api-reference/nushell-api.md | 112 +- docs/src/api-reference/path-resolution.md | 731 +++++- docs/src/api-reference/provider-api.md | 187 +- docs/src/api-reference/rest-api.md | 1119 ++++++++- docs/src/api-reference/sdks.md | 1098 ++++++++- docs/src/api-reference/websocket.md | 893 ++++++- docs/src/architecture/README.md | 131 +- .../adr/ADR-001-project-structure.md | 119 +- .../adr/ADR-002-distribution-strategy.md | 180 +- .../adr/ADR-003-workspace-isolation.md | 192 +- .../adr/ADR-004-hybrid-architecture.md | 211 +- .../adr/ADR-005-extension-framework.md | 285 ++- .../ADR-006-provisioning-cli-refactoring.md | 391 ++- .../adr/ADR-007-kms-simplification.md | 267 +- .../adr/ADR-008-cedar-authorization.md | 353 ++- .../adr/ADR-009-security-system-complete.md | 662 ++++- docs/src/architecture/adr/README.md | 61 +- .../adr-010-configuration-format-strategy.md | 414 +++- .../adr/adr-011-nickel-migration.md | 480 +++- ...r-012-nushell-nickel-plugin-cli-wrapper.md | 380 ++- .../adr/adr-013-typdialog-integration.md | 593 ++++- .../adr/adr-014-secretumvault-integration.md | 660 ++++- .../adr-015-ai-integration-architecture.md | 1124 ++++++++- ...r-016-schema-driven-accessor-generation.md | 160 +- ...17-plugin-wrapper-abstraction-framework.md | 226 +- .../adr-018-help-system-fluent-integration.md | 281 ++- ...019-configuration-loader-modularization.md | 263 +- ...dr-020-command-handler-domain-splitting.md | 313 ++- .../src/architecture/architecture-overview.md | 1338 +++++++++- .../config-loading-architecture.md | 267 +- .../database-and-config-architecture.md | 386 ++- docs/src/architecture/design-principles.md | 423 +++- .../src/architecture/ecosystem-integration.md | 524 +++- docs/src/architecture/integration-patterns.md | 624 ++++- .../architecture/multi-repo-architecture.md | 711 +++++- docs/src/architecture/multi-repo-strategy.md | 1026 +++++++- .../nickel-executable-examples.md | 774 +++++- .../architecture/nickel-vs-kcl-comparison.md | 1208 ++++++++- .../orchestrator-auth-integration.md | 622 ++++- docs/src/architecture/orchestrator-info.md | 150 +- .../orchestrator-integration-model.md | 806 +++++- .../architecture/package-and-loader-system.md | 411 ++- docs/src/architecture/repo-dist-analysis.md | 1612 +++++++++++- docs/src/architecture/system-overview.md | 356 ++- .../typedialog-nickel-integration.md | 953 ++++++- docs/src/configuration/config-validation.md | 632 ++++- docs/src/development/auth-metadata-guide.md | 537 +++- docs/src/development/build-system.md | 1077 +++++++- docs/src/development/command-handler-guide.md | 615 ++++- docs/src/development/command-reference.md | 55 +- .../ctrl-c-implementation-notes.md | 296 ++- docs/src/development/dev-configuration.md | 985 +++++++- .../development/dev-workspace-management.md | 916 ++++++- docs/src/development/distribution-process.md | 1006 +++++++- docs/src/development/glossary.md | 1761 ++++++++++++- docs/src/development/implementation-guide.md | 898 ++++++- .../infrastructure-specific-extensions.md | 1231 ++++++++- docs/src/development/integration.md | 1220 ++++++++- docs/src/development/kms-simplification.md | 571 ++++- docs/src/development/mcp-server.md | 115 +- docs/src/development/project-structure.md | 412 +++- .../provider-agnostic-architecture.md | 349 ++- .../providers/provider-comparison.md | 401 ++- .../providers/provider-development-guide.md | 718 +++++- .../providers/provider-distribution-guide.md | 682 ++++- .../providers/quick-provider-guide.md | 323 ++- .../taskservs/taskserv-categorization.md | 71 +- .../taskservs/taskserv-quick-guide.md | 250 +- .../typedialog-platform-config-guide.md | 1007 +++++++- docs/src/development/workflow.md | 1066 +++++++- docs/src/getting-started/01-prerequisites.md | 252 +- docs/src/getting-started/02-installation.md | 236 +- .../getting-started/03-first-deployment.md | 274 +- docs/src/getting-started/04-verification.md | 343 ++- .../05-platform-configuration.md | 500 +++- docs/src/getting-started/getting-started.md | 552 ++++- .../src/getting-started/installation-guide.md | 537 +++- .../installation-validation-guide.md | 623 ++++- .../getting-started/quickstart-cheatsheet.md | 1108 ++++++++- docs/src/getting-started/quickstart.md | 30 +- docs/src/getting-started/setup-profiles.md | 833 ++++++- docs/src/getting-started/setup-quickstart.md | 179 +- .../src/getting-started/setup-system-guide.md | 207 +- docs/src/getting-started/setup.md | 664 ++++- docs/src/guides/README.md | 19 +- docs/src/guides/customize-infrastructure.md | 847 ++++++- .../extension-development-quickstart.md | 438 +++- docs/src/guides/from-scratch.md | 1151 ++++++++- docs/src/guides/guide-system.md | 154 +- docs/src/guides/infrastructure-setup.md | 363 ++- .../src/guides/internationalization-system.md | 414 +++- docs/src/guides/multi-provider-deployment.md | 1285 +++++++++- docs/src/guides/multi-provider-networking.md | 969 +++++++- docs/src/guides/provider-digitalocean.md | 785 +++++- docs/src/guides/provider-hetzner.md | 781 +++++- docs/src/guides/update-infrastructure.md | 843 ++++++- .../workspace-generation-quick-reference.md | 284 ++- .../batch-workflow-multi-provider.md | 810 +++++- .../infrastructure/batch-workflow-system.md | 94 +- docs/src/infrastructure/cli-architecture.md | 137 +- docs/src/infrastructure/cli-reference.md | 977 +++++++- .../infrastructure/config-rendering-guide.md | 823 ++++++- .../infrastructure/configuration-system.md | 53 +- docs/src/infrastructure/configuration.md | 772 +++++- .../infrastructure/dynamic-secrets-guide.md | 195 +- .../infrastructure-from-code-guide.md | 678 ++++- .../infrastructure-management.md | 1118 ++++++++- docs/src/infrastructure/mode-system-guide.md | 497 +++- .../workspace-config-architecture.md | 413 +++- .../workspaces/workspace-config-commands.md | 309 ++- .../workspaces/workspace-enforcement-guide.md | 616 ++++- .../workspaces/workspace-guide.md | 44 +- .../workspaces/workspace-infra-reference.md | 450 +++- .../workspaces/workspace-setup.md | 278 ++- .../workspaces/workspace-switching-guide.md | 468 +++- .../workspaces/workspace-switching-system.md | 149 +- .../integration/gitea-integration-guide.md | 722 +++++- .../integration/integrations-quickstart.md | 623 ++++- docs/src/integration/oci-registry-guide.md | 890 ++++++- docs/src/integration/oci-registry-platform.md | 160 +- .../secrets-service-layer-complete.md | 967 +++++++- .../integration/service-mesh-ingress-guide.md | 1369 +++++++++- docs/src/operations/README.md | 46 +- .../operations/break-glass-training-guide.md | 729 +++++- .../cedar-policies-production-guide.md | 866 ++++++- docs/src/operations/control-center.md | 282 ++- docs/src/operations/coredns-guide.md | 1284 +++++++++- docs/src/operations/deployment-guide.md | 1362 +++++++++- .../operations/incident-response-runbooks.md | 1653 ++++++++++++- docs/src/operations/installer-system.md | 289 ++- docs/src/operations/installer.md | 183 +- docs/src/operations/mfa-admin-setup-guide.md | 1371 ++++++++++- .../operations/monitoring-alerting-setup.md | 1150 ++++++++- docs/src/operations/orchestrator-system.md | 97 +- docs/src/operations/orchestrator.md | 154 +- docs/src/operations/platform.md | 367 ++- .../production-readiness-checklist.md | 354 ++- docs/src/operations/provisioning-server.md | 221 +- .../operations/service-management-guide.md | 1431 ++++++++++- docs/src/quick-reference/README.md | 46 +- docs/src/quick-reference/general.md | 344 ++- docs/src/quick-reference/justfile-recipes.md | 222 +- docs/src/quick-reference/master.md | 36 +- docs/src/quick-reference/oci.md | 440 +++- .../platform-operations-cheatsheet.md | 624 ++++- .../quick-reference/sudo-password-handling.md | 162 +- docs/src/roadmap/README.md | 148 +- docs/src/roadmap/ai-integration.md | 190 +- docs/src/roadmap/native-plugins.md | 253 +- docs/src/roadmap/nickel-workflows.md | 270 +- .../security/authentication-layer-guide.md | 928 ++++++- docs/src/security/config-encryption-guide.md | 944 ++++++- docs/src/security/kms-service.md | 191 +- docs/src/security/nushell-plugins-guide.md | 1001 +++++++- docs/src/security/nushell-plugins-system.md | 78 +- docs/src/security/plugin-integration-guide.md | 2193 ++++++++++++++++- docs/src/security/plugin-usage-guide.md | 396 ++- docs/src/security/rustyvault-kms-guide.md | 548 +++- docs/src/security/secrets-management-guide.md | 533 +++- docs/src/security/secretumvault-kms-guide.md | 648 ++++- docs/src/security/security-system.md | 172 +- .../security/ssh-temporal-keys-user-guide.md | 616 ++++- docs/src/testing/taskserv-validation-guide.md | 556 ++++- docs/src/testing/test-environment-guide.md | 492 +++- docs/src/testing/test-environment-system.md | 188 +- .../troubleshooting/troubleshooting-guide.md | 1089 +++++++- .../troubleshooting/ctrl-c-sudo-handling.md | 210 +- examples/workspaces/cost-optimized/README.md | 2 +- .../multi-provider-web-app/README.md | 2 +- examples/workspaces/multi-region-ha/README.md | 2 +- schemas/infrastructure/README.md | 2 +- schemas/platform/README.md | 2 +- .../templates/docker-compose/README.md | 2 +- .../platform/templates/kubernetes/README.md | 2 +- schemas/platform/usage-guide.md | 2 +- scripts/fix-markdown-fences.nu | 79 +- scripts/fix-markdown-newlines.nu | 56 + scripts/setup-platform-config.sh.md | 2 +- tests/integration/docs/testing-guide.md | 2 +- tools/nickel-installation-guide.md | 2 +- 203 files changed, 104344 insertions(+), 361 deletions(-) create mode 100644 scripts/fix-markdown-newlines.nu diff --git a/.typedialog/ci/README.md b/.typedialog/ci/README.md index a20280d..bbf9568 100644 --- a/.typedialog/ci/README.md +++ b/.typedialog/ci/README.md @@ -1 +1 @@ -# CI System - Configuration Guide\n\n**Installed**: 2026-01-01\n**Detected Languages**: rust, nushell, nickel, bash, markdown, python, javascript\n\n---\n\n## Quick Start\n\n### Option 1: Using configure.sh (Recommended)\n\nA convenience script is installed in `.typedialog/ci/`:\n\n```\n# Use web backend (default) - Opens in browser\n.typedialog/ci/configure.sh\n\n# Use TUI backend - Terminal interface\n.typedialog/ci/configure.sh tui\n\n# Use CLI backend - Command-line prompts\n.typedialog/ci/configure.sh cli\n```\n\n**This script automatically:**\n\n- Sources `.typedialog/ci/envrc` for environment setup\n- Loads defaults from `config.ncl` (Nickel format)\n- Uses cascading search for fragments (local → Tools)\n- Creates backup before overwriting existing config\n- Saves output in Nickel format using nickel-roundtrip with documented template\n- Generates `config.ncl` compatible with `nickel doc` command\n\n### Option 2: Direct TypeDialog Commands\n\nUse TypeDialog nickel-roundtrip directly with manual paths:\n\n#### Web Backend (Recommended - Easy Viewing)\n\n```\ncd .typedialog/ci # Change to CI directory\nsource envrc # Load environment\ntypedialog-web nickel-roundtrip config.ncl form.toml \\n --output config.ncl \\n --ncl-template $TOOLS_PATH/dev-system/ci/templates/config.ncl.j2\n```\n\n#### TUI Backend\n\n```\ncd .typedialog/ci\nsource envrc\ntypedialog-tui nickel-roundtrip config.ncl form.toml \\n --output config.ncl \\n --ncl-template $TOOLS_PATH/dev-system/ci/templates/config.ncl.j2\n```\n\n#### CLI Backend\n\n```\ncd .typedialog/ci\nsource envrc\ntypedialog nickel-roundtrip config.ncl form.toml \\n --output config.ncl \\n --ncl-template $TOOLS_PATH/dev-system/ci/templates/config.ncl.j2\n```\n\n**Note:** The `--ncl-template` flag uses a Tera template that adds:\n\n- Descriptive comments for each section\n- Documentation compatible with `nickel doc config.ncl`\n- Consistent formatting and structure\n\n**All backends will:**\n\n- Show only options relevant to your detected languages\n- Guide you through all configuration choices\n- Validate your inputs\n- Generate config.ncl in Nickel format\n\n### Option 3: Manual Configuration\n\nEdit `config.ncl` directly:\n\n```\nvim .typedialog/ci/config.ncl\n```\n\n---\n\n## Configuration Format: Nickel\n\n**This project uses Nickel format by default** for all configuration files.\n\n### Why Nickel?\n\n- ✅ **Typed configuration** - Static type checking with `nickel typecheck`\n- ✅ **Documentation** - Generate docs with `nickel doc config.ncl`\n- ✅ **Validation** - Built-in schema validation\n- ✅ **Comments** - Rich inline documentation support\n- ✅ **Modular** - Import/export system for reusable configs\n\n### Nickel Template\n\nThe output structure is controlled by a **Tera template** at:\n\n- **Tools default**: `$TOOLS_PATH/dev-system/ci/templates/config.ncl.j2`\n- **Local override**: `.typedialog/ci/config.ncl.j2` (optional)\n\n**To customize the template:**\n\n```\n# Copy the default template\ncp $TOOLS_PATH/dev-system/ci/templates/config.ncl.j2 \\n .typedialog/ci/config.ncl.j2\n\n# Edit to add custom comments, documentation, or structure\nvim .typedialog/ci/config.ncl.j2\n\n# Your template will now be used automatically\n```\n\n**Template features:**\n\n- Customizable comments per section\n- Control field ordering\n- Add project-specific documentation\n- Configure output for `nickel doc` command\n\n### TypeDialog Environment Variables\n\nYou can customize TypeDialog behavior with environment variables:\n\n```\n# Web server configuration\nexport TYPEDIALOG_PORT=9000 # Port for web backend (default: 9000)\nexport TYPEDIALOG_HOST=localhost # Host binding (default: localhost)\n\n# Localization\nexport TYPEDIALOG_LANG=en_US.UTF-8 # Form language (default: system locale)\n\n# Run with custom settings\nTYPEDIALOG_PORT=8080 .typedialog/ci/configure.sh web\n```\n\n**Common use cases:**\n\n```\n# Access from other machines in network\nTYPEDIALOG_HOST=0.0.0.0 TYPEDIALOG_PORT=8080 .typedialog/ci/configure.sh web\n\n# Use different port if 9000 is busy\nTYPEDIALOG_PORT=3000 .typedialog/ci/configure.sh web\n\n# Spanish interface\nTYPEDIALOG_LANG=es_ES.UTF-8 .typedialog/ci/configure.sh web\n```\n\n## Configuration Structure\n\nYour config.ncl is organized in the `ci` namespace (Nickel format):\n\n```\n{\n ci = {\n project = {\n name = "rust",\n detected_languages = ["rust, nushell, nickel, bash, markdown, python, javascript"],\n primary_language = "rust",\n },\n tools = {\n # Tools are added based on detected languages\n },\n features = {\n # CI features (pre-commit, GitHub Actions, etc.)\n },\n ci_providers = {\n # CI provider configurations\n },\n },\n}\n```\n\n## Available Fragments\n\nTool configurations are modular. Check `.typedialog/ci/fragments/` for:\n\n- rust-tools.toml - Tools for rust\n- nushell-tools.toml - Tools for nushell\n- nickel-tools.toml - Tools for nickel\n- bash-tools.toml - Tools for bash\n- markdown-tools.toml - Tools for markdown\n- python-tools.toml - Tools for python\n- javascript-tools.toml - Tools for javascript\n- general-tools.toml - Cross-language tools\n- ci-providers.toml - GitHub Actions, Woodpecker, etc.\n\n## Cascading Override System\n\nThis project uses a **local → Tools cascading search** for all resources:\n\n### How It Works\n\nResources are searched in priority order:\n\n1. **Local files** (`.typedialog/ci/`) - **FIRST** (highest priority)\n2. **Tools files** (`$TOOLS_PATH/dev-system/ci/`) - **FALLBACK** (default)\n\n### Affected Resources\n\n| Resource | Local Path | Tools Path |\n| ---------- | ------------ | ------------ |\n| Fragments | `.typedialog/ci/fragments/` | `$TOOLS_PATH/dev-system/ci/forms/fragments/` |\n| Schemas | `.typedialog/ci/schemas/` | `$TOOLS_PATH/dev-system/ci/schemas/` |\n| Validators | `.typedialog/ci/validators/` | `$TOOLS_PATH/dev-system/ci/validators/` |\n| Defaults | `.typedialog/ci/defaults/` | `$TOOLS_PATH/dev-system/ci/defaults/` |\n| Nickel Template | `.typedialog/ci/config.ncl.j2` | `$TOOLS_PATH/dev-system/ci/templates/config.ncl.j2` |\n\n### Environment Setup (.envrc)\n\nThe `.typedialog/ci/.envrc` file configures search paths:\n\n```\n# Source this file to load environment\nsource .typedialog/ci/.envrc\n\n# Or use direnv for automatic loading\necho 'source .typedialog/ci/.envrc' >> .envrc\n```\n\n**What's in .envrc:**\n\n```\nexport NICKEL_IMPORT_PATH="schemas:$TOOLS_PATH/dev-system/ci/schemas:validators:..."\nexport TYPEDIALOG_FRAGMENT_PATH=".:$TOOLS_PATH/dev-system/ci/forms"\nexport NCL_TEMPLATE=""\nexport TYPEDIALOG_PORT=9000 # Web server port\nexport TYPEDIALOG_HOST=localhost # Web server host\nexport TYPEDIALOG_LANG="${LANG}" # Form localization\n```\n\n### Creating Overrides\n\n**By default:** All resources come from Tools (no duplication).\n\n**To customize:** Create file in local directory with same name:\n\n```\n# Override a fragment\ncp $TOOLS_PATH/dev-system/ci/fragments/rust-tools.toml \\n .typedialog/ci/fragments/rust-tools.toml\n\n# Edit your local version\nvim .typedialog/ci/fragments/rust-tools.toml\n\n# Override Nickel template (customize comments, structure, nickel doc output)\ncp $TOOLS_PATH/dev-system/ci/templates/config.ncl.j2 \\n .typedialog/ci/config.ncl.j2\n\n# Edit to customize documentation and structure\nvim .typedialog/ci/config.ncl.j2\n\n# Now your version will be used instead of Tools version\n```\n\n**Benefits:**\n\n- ✅ Override only what you need\n- ✅ Everything else stays synchronized with Tools\n- ✅ No duplication by default\n- ✅ Automatic updates when Tools is updated\n\n**See:** `$TOOLS_PATH/dev-system/ci/docs/cascade-override.md` for complete documentation.\n\n## Testing Your Configuration\n\n### Validate Configuration\n\n```\nnu $env.TOOLS_PATH/dev-system/ci/scripts/validator.nu \\n --config .typedialog/ci/config.ncl \\n --project . \\n --namespace ci\n```\n\n### Regenerate CI Files\n\n```\nnu $env.TOOLS_PATH/dev-system/ci/scripts/generate-configs.nu \\n --config .typedialog/ci/config.ncl \\n --templates $env.TOOLS_PATH/dev-system/ci/templates \\n --output . \\n --namespace ci\n```\n\n## Common Tasks\n\n### Add a New Tool\n\nEdit `config.ncl` and add under `ci.tools`:\n\n```\n{\n ci = {\n tools = {\n newtool = {\n enabled = true,\n install_method = "cargo",\n version = "latest",\n },\n },\n },\n}\n```\n\n### Disable a Feature\n\n```\n[ci.features]\nenable_pre_commit = false\n```\n\n## Need Help?\n\nFor detailed documentation, see:\n\n- $env.TOOLS_PATH/dev-system/ci/docs/configuration-guide.md\n- $env.TOOLS_PATH/dev-system/ci/docs/installation-guide.md +# CI System - Configuration Guide\n\n**Installed**: 2026-01-01\n**Detected Languages**: rust, nushell, nickel, bash, markdown, python, javascript\n\n---\n\n## Quick Start\n\n### Option 1: Using configure.sh (Recommended)\n\nA convenience script is installed in `.typedialog/ci/`:\n\n```\n# Use web backend (default) - Opens in browser\n.typedialog/ci/configure.sh\n\n# Use TUI backend - Terminal interface\n.typedialog/ci/configure.sh tui\n\n# Use CLI backend - Command-line prompts\n.typedialog/ci/configure.sh cli\n```\n\n**This script automatically:**\n\n- Sources `.typedialog/ci/envrc` for environment setup\n- Loads defaults from `config.ncl` (Nickel format)\n- Uses cascading search for fragments (local → Tools)\n- Creates backup before overwriting existing config\n- Saves output in Nickel format using nickel-roundtrip with documented template\n- Generates `config.ncl` compatible with `nickel doc` command\n\n### Option 2: Direct TypeDialog Commands\n\nUse TypeDialog nickel-roundtrip directly with manual paths:\n\n#### Web Backend (Recommended - Easy Viewing)\n\n```\ncd .typedialog/ci # Change to CI directory\nsource envrc # Load environment\ntypedialog-web nickel-roundtrip config.ncl form.toml \n --output config.ncl \n --ncl-template $TOOLS_PATH/dev-system/ci/templates/config.ncl.j2\n```\n\n#### TUI Backend\n\n```\ncd .typedialog/ci\nsource envrc\ntypedialog-tui nickel-roundtrip config.ncl form.toml \n --output config.ncl \n --ncl-template $TOOLS_PATH/dev-system/ci/templates/config.ncl.j2\n```\n\n#### CLI Backend\n\n```\ncd .typedialog/ci\nsource envrc\ntypedialog nickel-roundtrip config.ncl form.toml \n --output config.ncl \n --ncl-template $TOOLS_PATH/dev-system/ci/templates/config.ncl.j2\n```\n\n**Note:** The `--ncl-template` flag uses a Tera template that adds:\n\n- Descriptive comments for each section\n- Documentation compatible with `nickel doc config.ncl`\n- Consistent formatting and structure\n\n**All backends will:**\n\n- Show only options relevant to your detected languages\n- Guide you through all configuration choices\n- Validate your inputs\n- Generate config.ncl in Nickel format\n\n### Option 3: Manual Configuration\n\nEdit `config.ncl` directly:\n\n```\nvim .typedialog/ci/config.ncl\n```\n\n---\n\n## Configuration Format: Nickel\n\n**This project uses Nickel format by default** for all configuration files.\n\n### Why Nickel?\n\n- ✅ **Typed configuration** - Static type checking with `nickel typecheck`\n- ✅ **Documentation** - Generate docs with `nickel doc config.ncl`\n- ✅ **Validation** - Built-in schema validation\n- ✅ **Comments** - Rich inline documentation support\n- ✅ **Modular** - Import/export system for reusable configs\n\n### Nickel Template\n\nThe output structure is controlled by a **Tera template** at:\n\n- **Tools default**: `$TOOLS_PATH/dev-system/ci/templates/config.ncl.j2`\n- **Local override**: `.typedialog/ci/config.ncl.j2` (optional)\n\n**To customize the template:**\n\n```\n# Copy the default template\ncp $TOOLS_PATH/dev-system/ci/templates/config.ncl.j2 \n .typedialog/ci/config.ncl.j2\n\n# Edit to add custom comments, documentation, or structure\nvim .typedialog/ci/config.ncl.j2\n\n# Your template will now be used automatically\n```\n\n**Template features:**\n\n- Customizable comments per section\n- Control field ordering\n- Add project-specific documentation\n- Configure output for `nickel doc` command\n\n### TypeDialog Environment Variables\n\nYou can customize TypeDialog behavior with environment variables:\n\n```\n# Web server configuration\nexport TYPEDIALOG_PORT=9000 # Port for web backend (default: 9000)\nexport TYPEDIALOG_HOST=localhost # Host binding (default: localhost)\n\n# Localization\nexport TYPEDIALOG_LANG=en_US.UTF-8 # Form language (default: system locale)\n\n# Run with custom settings\nTYPEDIALOG_PORT=8080 .typedialog/ci/configure.sh web\n```\n\n**Common use cases:**\n\n```\n# Access from other machines in network\nTYPEDIALOG_HOST=0.0.0.0 TYPEDIALOG_PORT=8080 .typedialog/ci/configure.sh web\n\n# Use different port if 9000 is busy\nTYPEDIALOG_PORT=3000 .typedialog/ci/configure.sh web\n\n# Spanish interface\nTYPEDIALOG_LANG=es_ES.UTF-8 .typedialog/ci/configure.sh web\n```\n\n## Configuration Structure\n\nYour config.ncl is organized in the `ci` namespace (Nickel format):\n\n```\n{\n ci = {\n project = {\n name = "rust",\n detected_languages = ["rust, nushell, nickel, bash, markdown, python, javascript"],\n primary_language = "rust",\n },\n tools = {\n # Tools are added based on detected languages\n },\n features = {\n # CI features (pre-commit, GitHub Actions, etc.)\n },\n ci_providers = {\n # CI provider configurations\n },\n },\n}\n```\n\n## Available Fragments\n\nTool configurations are modular. Check `.typedialog/ci/fragments/` for:\n\n- rust-tools.toml - Tools for rust\n- nushell-tools.toml - Tools for nushell\n- nickel-tools.toml - Tools for nickel\n- bash-tools.toml - Tools for bash\n- markdown-tools.toml - Tools for markdown\n- python-tools.toml - Tools for python\n- javascript-tools.toml - Tools for javascript\n- general-tools.toml - Cross-language tools\n- ci-providers.toml - GitHub Actions, Woodpecker, etc.\n\n## Cascading Override System\n\nThis project uses a **local → Tools cascading search** for all resources:\n\n### How It Works\n\nResources are searched in priority order:\n\n1. **Local files** (`.typedialog/ci/`) - **FIRST** (highest priority)\n2. **Tools files** (`$TOOLS_PATH/dev-system/ci/`) - **FALLBACK** (default)\n\n### Affected Resources\n\n| Resource | Local Path | Tools Path |\n| ---------- | ------------ | ------------ |\n| Fragments | `.typedialog/ci/fragments/` | `$TOOLS_PATH/dev-system/ci/forms/fragments/` |\n| Schemas | `.typedialog/ci/schemas/` | `$TOOLS_PATH/dev-system/ci/schemas/` |\n| Validators | `.typedialog/ci/validators/` | `$TOOLS_PATH/dev-system/ci/validators/` |\n| Defaults | `.typedialog/ci/defaults/` | `$TOOLS_PATH/dev-system/ci/defaults/` |\n| Nickel Template | `.typedialog/ci/config.ncl.j2` | `$TOOLS_PATH/dev-system/ci/templates/config.ncl.j2` |\n\n### Environment Setup (.envrc)\n\nThe `.typedialog/ci/.envrc` file configures search paths:\n\n```\n# Source this file to load environment\nsource .typedialog/ci/.envrc\n\n# Or use direnv for automatic loading\necho 'source .typedialog/ci/.envrc' >> .envrc\n```\n\n**What's in .envrc:**\n\n```\nexport NICKEL_IMPORT_PATH="schemas:$TOOLS_PATH/dev-system/ci/schemas:validators:..."\nexport TYPEDIALOG_FRAGMENT_PATH=".:$TOOLS_PATH/dev-system/ci/forms"\nexport NCL_TEMPLATE=""\nexport TYPEDIALOG_PORT=9000 # Web server port\nexport TYPEDIALOG_HOST=localhost # Web server host\nexport TYPEDIALOG_LANG="${LANG}" # Form localization\n```\n\n### Creating Overrides\n\n**By default:** All resources come from Tools (no duplication).\n\n**To customize:** Create file in local directory with same name:\n\n```\n# Override a fragment\ncp $TOOLS_PATH/dev-system/ci/fragments/rust-tools.toml \n .typedialog/ci/fragments/rust-tools.toml\n\n# Edit your local version\nvim .typedialog/ci/fragments/rust-tools.toml\n\n# Override Nickel template (customize comments, structure, nickel doc output)\ncp $TOOLS_PATH/dev-system/ci/templates/config.ncl.j2 \n .typedialog/ci/config.ncl.j2\n\n# Edit to customize documentation and structure\nvim .typedialog/ci/config.ncl.j2\n\n# Now your version will be used instead of Tools version\n```\n\n**Benefits:**\n\n- ✅ Override only what you need\n- ✅ Everything else stays synchronized with Tools\n- ✅ No duplication by default\n- ✅ Automatic updates when Tools is updated\n\n**See:** `$TOOLS_PATH/dev-system/ci/docs/cascade-override.md` for complete documentation.\n\n## Testing Your Configuration\n\n### Validate Configuration\n\n```\nnu $env.TOOLS_PATH/dev-system/ci/scripts/validator.nu \n --config .typedialog/ci/config.ncl \n --project . \n --namespace ci\n```\n\n### Regenerate CI Files\n\n```\nnu $env.TOOLS_PATH/dev-system/ci/scripts/generate-configs.nu \n --config .typedialog/ci/config.ncl \n --templates $env.TOOLS_PATH/dev-system/ci/templates \n --output . \n --namespace ci\n```\n\n## Common Tasks\n\n### Add a New Tool\n\nEdit `config.ncl` and add under `ci.tools`:\n\n```\n{\n ci = {\n tools = {\n newtool = {\n enabled = true,\n install_method = "cargo",\n version = "latest",\n },\n },\n },\n}\n```\n\n### Disable a Feature\n\n```\n[ci.features]\nenable_pre_commit = false\n```\n\n## Need Help?\n\nFor detailed documentation, see:\n\n- $env.TOOLS_PATH/dev-system/ci/docs/configuration-guide.md\n- $env.TOOLS_PATH/dev-system/ci/docs/installation-guide.md \ No newline at end of file diff --git a/README.md b/README.md index e998456..1dc8194 100644 --- a/README.md +++ b/README.md @@ -1 +1 @@ -

\n Provisioning Logo\n

\n

\n Provisioning\n

\n\n# Provisioning - Infrastructure Automation Platform\n\n> **A modular, declarative Infrastructure as Code (IaC) platform for managing complete infrastructure lifecycles**\n\n## Table of Contents\n\n- [What is Provisioning?](#what-is-provisioning)\n- [Why Provisioning?](#why-provisioning)\n- [Core Concepts](#core-concepts)\n- [Architecture](#architecture)\n- [Key Features](#key-features)\n- [Technology Stack](#technology-stack)\n- [How It Works](#how-it-works)\n- [Use Cases](#use-cases)\n- [Getting Started](#getting-started)\n\n---\n\n## What is Provisioning?\n\n**Provisioning** is a comprehensive **Infrastructure as Code (IaC)** platform designed to manage\ncomplete infrastructure lifecycles: cloud providers, infrastructure services, clusters,\nand isolated workspaces across multiple cloud/local environments.\n\nExtensible and customizable by design, it delivers type-safe, configuration-driven workflows\nwith enterprise security (encrypted configuration, Cosmian KMS integration, Cedar policy engine,\nsecrets management, authorization and permissions control, compliance checking, anomaly detection)\nand adaptable deployment modes (interactive UI, CLI automation, unattended CI/CD)\nsuitable for any scale from development to production.\n\n### Technical Definition\n\nDeclarative Infrastructure as Code (IaC) platform providing:\n\n- **Type-safe, configuration-driven workflows** with schema validation and constraint checking\n- **Modular, extensible architecture**: cloud providers, task services, clusters, workspaces\n- **Multi-cloud abstraction layer** with unified API (UpCloud, AWS, local infrastructure)\n- **High-performance state management**:\n - Graph database backend for complex relationships\n - Real-time state tracking and queries\n - Multi-model data storage (document, graph, relational)\n- **Enterprise security stack**:\n - Encrypted configuration and secrets management\n - Cosmian KMS integration for confidential key management\n - Cedar policy engine for fine-grained access control\n - Authorization and permissions control via platform services\n - Compliance checking and policy enforcement\n - Anomaly detection for security monitoring\n - Audit logging and compliance tracking\n- **Hybrid orchestration**: Rust-based performance layer + scripting flexibility\n- **Production-ready features**:\n - Batch workflows with dependency resolution\n - Checkpoint recovery and automatic rollback\n - Parallel execution with state management\n- **Adaptable deployment modes**:\n - Interactive TUI for guided setup\n - Headless CLI for scripted automation\n - Unattended mode for CI/CD pipelines\n- **Hierarchical configuration system** with inheritance and overrides\n\n### What It Does\n\n- **Provisions Infrastructure** - Create servers, networks, storage across multiple cloud providers\n- **Installs Services** - Deploy Kubernetes, containerd, databases, monitoring, and 50+ infrastructure components\n- **Manages Clusters** - Orchestrate complete cluster deployments with dependency management\n- **Handles Configuration** - Hierarchical configuration system with inheritance and overrides\n- **Orchestrates Workflows** - Batch operations with parallel execution and checkpoint recovery\n- **Manages Secrets** - SOPS/Age integration for encrypted configuration\n- **Secures Infrastructure** - Enterprise security with JWT, MFA, Cedar policies, audit logging\n- **Optimizes Performance** - Native plugins providing 10-50x speed improvements\n\n---\n\n## Why Provisioning?\n\n### The Problems It Solves\n\n#### 1. **Multi-Cloud Complexity**\n\n**Problem**: Each cloud provider has different APIs, tools, and workflows.\n\n**Solution**: Unified abstraction layer with provider-agnostic interfaces. Write configuration once, deploy anywhere using Nickel schemas.\n\n```\n# Same configuration works on UpCloud, AWS, or local infrastructure\n{\n servers = [\n {\n name = "web-01"\n plan = "medium" # Abstract size, provider-specific translation\n provider = "upcloud" # Switch to "aws" or "local" as needed\n }\n ]\n}\n```\n\n#### 2. **Dependency Hell**\n\n**Problem**: Infrastructure components have complex dependencies (Kubernetes needs containerd, Cilium needs Kubernetes, etc.).\n\n**Solution**: Automatic dependency resolution with topological sorting and health checks via Nickel schemas.\n\n```\n# Provisioning resolves: containerd → etcd → kubernetes → cilium\n{\n taskservs = ["cilium"] # Automatically installs all dependencies\n}\n```\n\n#### 3. **Configuration Sprawl**\n\n**Problem**: Environment variables, hardcoded values, scattered configuration files.\n\n**Solution**: Hierarchical configuration system with 476+ config accessors replacing 200+ ENV variables.\n\n```\nDefaults → User → Project → Infrastructure → Environment → Runtime\n```\n\n#### 4. **Imperative Scripts**\n\n**Problem**: Brittle shell scripts that don't handle failures, don't support rollback, hard to maintain.\n\n**Solution**: Declarative Nickel configurations with validation, type safety, lazy evaluation, and automatic rollback.\n\n#### 5. **Lack of Visibility**\n\n**Problem**: No insight into what's happening during deployment, hard to debug failures.\n\n**Solution**:\n\n- Real-time workflow monitoring\n- Comprehensive logging system\n- Web-based control center\n- REST API for integration\n\n#### 6. **No Standardization**\n\n**Problem**: Each team builds their own deployment tools, no shared patterns.\n\n**Solution**: Reusable task services, cluster templates, and workflow patterns.\n\n---\n\n## Core Concepts\n\n### 1. **Providers**\n\nCloud infrastructure backends that handle resource provisioning.\n\n- **UpCloud** - Primary cloud provider\n- **AWS** - Amazon Web Services integration\n- **Local** - Local infrastructure (VMs, Docker, bare metal)\n\nProviders implement a common interface, making infrastructure code portable.\n\n### 2. **Task Services (TaskServs)**\n\nReusable infrastructure components that can be installed on servers.\n\n**Categories**:\n\n- **Container Runtimes** - containerd, Docker, Podman, crun, runc, youki\n- **Orchestration** - Kubernetes, etcd, CoreDNS\n- **Networking** - Cilium, Flannel, Calico, ip-aliases\n- **Storage** - Rook-Ceph, local storage\n- **Databases** - PostgreSQL, Redis, SurrealDB\n- **Observability** - Prometheus, Grafana, Loki\n- **Security** - Webhook, KMS, Vault\n- **Development** - Gitea, Radicle, ORAS\n\nEach task service includes:\n\n- Version management\n- Dependency declarations\n- Health checks\n- Installation/uninstallation logic\n- Configuration schemas\n\n### 3. **Clusters**\n\nComplete infrastructure deployments combining servers and task services.\n\n**Examples**:\n\n- **Kubernetes Cluster** - HA control plane + worker nodes + CNI + storage\n- **Database Cluster** - Replicated PostgreSQL with backup\n- **Build Infrastructure** - BuildKit + container registry + CI/CD\n\nClusters handle:\n\n- Multi-node coordination\n- Service distribution\n- High availability\n- Rolling updates\n\n### 4. **Workspaces**\n\nIsolated environments for different projects or deployment stages.\n\n```\nworkspace_librecloud/ # Production workspace\n├── infra/ # Infrastructure definitions\n├── config/ # Workspace configuration\n├── extensions/ # Custom modules\n└── runtime/ # State and runtime data\n\nworkspace_dev/ # Development workspace\n├── infra/\n└── config/\n```\n\nSwitch between workspaces with single command:\n\n```\nprovisioning workspace switch librecloud\n```\n\n### 5. **Workflows**\n\nCoordinated sequences of operations with dependency management.\n\n**Types**:\n\n- **Server Workflows** - Create/delete/update servers\n- **TaskServ Workflows** - Install/remove infrastructure services\n- **Cluster Workflows** - Deploy/scale complete clusters\n- **Batch Workflows** - Multi-cloud parallel operations\n\n**Features**:\n\n- Dependency resolution\n- Parallel execution\n- Checkpoint recovery\n- Automatic rollback\n- Progress monitoring\n\n---\n\n## Architecture\n\n### System Components\n\n```\n┌─────────────────────────────────────────────────────────────────┐\n│ User Interface Layer │\n│ • CLI (provisioning command) │\n│ • Web Control Center (UI) │\n│ • REST API │\n└─────────────────────────────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────────────────────────────┐\n│ Core Engine Layer │\n│ • Command Routing & Dispatch │\n│ • Configuration Management │\n│ • Provider Abstraction │\n│ • Utility Libraries │\n└─────────────────────────────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────────────────────────────┐\n│ Orchestration Layer │\n│ • Workflow Orchestrator (Rust/Nushell hybrid) │\n│ • Dependency Resolver │\n│ • State Manager │\n│ • Task Scheduler │\n└─────────────────────────────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────────────────────────────┐\n│ Extension Layer │\n│ • Providers (Cloud APIs) │\n│ • Task Services (Infrastructure Components) │\n│ • Clusters (Complete Deployments) │\n│ • Workflows (Automation Templates) │\n└─────────────────────────────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────────────────────────────┐\n│ Infrastructure Layer │\n│ • Cloud Resources (Servers, Networks, Storage) │\n│ • Kubernetes Clusters │\n│ • Running Services │\n└─────────────────────────────────────────────────────────────────┘\n```\n\n### Directory Structure\n\n```\nproject-provisioning/\n├── provisioning/ # Core provisioning system\n│ ├── core/ # Core engine and libraries\n│ │ ├── cli/ # Command-line interface\n│ │ ├── nulib/ # Core Nushell libraries\n│ │ ├── plugins/ # System plugins (Rust native)\n│ │ └── scripts/ # Utility scripts\n│ │\n│ ├── extensions/ # Extensible components\n│ │ ├── providers/ # Cloud provider implementations\n│ │ ├── taskservs/ # Infrastructure service definitions\n│ │ ├── clusters/ # Complete cluster configurations\n│ │ └── workflows/ # Core workflow templates\n│ │\n│ ├── platform/ # Platform services\n│ │ ├── orchestrator/ # Rust orchestrator service\n│ │ ├── control-center/ # Web control center\n│ │ ├── mcp-server/ # Model Context Protocol server\n│ │ ├── api-gateway/ # REST API gateway\n│ │ ├── oci-registry/ # OCI registry for extensions\n│ │ └── installer/ # Platform installer (TUI + CLI)\n│ │\n│ ├── schemas/ # Nickel schema definitions (PRIMARY IaC)\n│ │ ├── main.ncl # Main infrastructure schema\n│ │ ├── providers/ # Provider-specific schemas\n│ │ ├── infrastructure/ # Infra definitions\n│ │ ├── deployment/ # Deployment schemas\n│ │ ├── services/ # Service schemas\n│ │ ├── operations/ # Operations schemas\n│ │ └── generator/ # Runtime schema generation\n│ │\n│ ├── docs/ # Product documentation (mdBook)\n│ ├── config/ # Configuration examples\n│ ├── tools/ # Build and distribution tools\n│ └── justfiles/ # Just recipes for common tasks\n│\n├── workspace/ # User workspaces and data\n│ ├── infra/ # Infrastructure definitions\n│ ├── config/ # User configuration\n│ ├── extensions/ # User extensions\n│ └── runtime/ # Runtime data and state\n│\n├── docs/ # Architecture & Development docs\n│ ├── architecture/ # System design and ADRs\n│ └── development/ # Development guidelines\n│\n└── .github/ # CI/CD workflows\n └── workflows/ # GitHub Actions (Rust, Nickel, Nushell)\n```\n\n### Platform Services\n\n#### 1. **Orchestrator** (`platform/orchestrator/`)\n\n- **Language**: Rust + Nushell\n- **Purpose**: Workflow execution, task scheduling, state management\n- **Features**:\n - File-based persistence\n - Priority processing\n - Retry logic with exponential backoff\n - Checkpoint-based recovery\n - REST API endpoints\n\n#### 2. **Control Center** (`platform/control-center/`)\n\n- **Language**: Web UI + Backend API\n- **Purpose**: Web-based infrastructure management\n- **Features**:\n - Dashboard views\n - Real-time monitoring\n - Interactive deployments\n - Log viewing\n\n#### 3. **MCP Server** (`platform/mcp-server/`)\n\n- **Language**: Nushell\n- **Purpose**: Model Context Protocol integration for AI assistance\n- **Features**:\n - 7 AI-powered settings tools\n - Intelligent config completion\n - Natural language infrastructure queries\n\n#### 4. **OCI Registry** (`platform/oci-registry/`)\n\n- **Purpose**: Extension distribution and versioning\n- **Features**:\n - Task service packages\n - Provider packages\n - Cluster templates\n - Workflow definitions\n\n#### 5. **Installer** (`platform/installer/`)\n\n- **Language**: Rust (Ratatui TUI) + Nushell\n- **Purpose**: Platform installation and setup\n- **Features**:\n - Interactive TUI mode\n - Headless CLI mode\n - Unattended CI/CD mode\n - Configuration generation\n\n---\n\n## Key Features\n\n### 1. **Modular CLI Architecture** (v3.2.0)\n\n84% code reduction with domain-driven design.\n\n- **Main CLI**: 211 lines (from 1,329 lines)\n- **80+ shortcuts**: `s` → `server`, `t` → `taskserv`, etc.\n- **Bi-directional help**: `provisioning help ws` = `provisioning ws help`\n- **7 domain modules**: infrastructure, orchestration, development, workspace, configuration, utilities, generation\n\n### 2. **Configuration System** (v2.0.0)\n\nHierarchical, config-driven architecture.\n\n- **476+ config accessors** replacing 200+ ENV variables\n- **Hierarchical loading**: defaults → user → project → infra → env → runtime\n- **Variable interpolation**: `{{paths.base}}`, `{{env.HOME}}`, `{{now.date}}`\n- **Multi-format support**: TOML, YAML, KCL\n\n### 3. **Batch Workflow System** (v3.1.0)\n\nProvider-agnostic batch operations with 85-90% token efficiency.\n\n- **Multi-cloud support**: Mixed UpCloud + AWS + local in single workflow\n- **KCL schema integration**: Type-safe workflow definitions\n- **Dependency resolution**: Topological sorting with soft/hard dependencies\n- **State management**: Checkpoint-based recovery with rollback\n- **Real-time monitoring**: Live progress tracking\n\n### 4. **Hybrid Orchestrator** (v3.0.0)\n\nRust/Nushell architecture solving deep call stack limitations.\n\n- **High-performance coordination layer**\n- **File-based persistence**\n- **Priority processing with retry logic**\n- **REST API for external integration**\n- **Comprehensive workflow system**\n\n### 5. **Workspace Switching** (v2.0.5)\n\nCentralized workspace management.\n\n- **Single-command switching**: `provisioning workspace switch `\n- **Automatic tracking**: Last-used timestamps, active workspace markers\n- **User preferences**: Global settings across all workspaces\n- **Workspace registry**: Centralized configuration in `user_config.yaml`\n\n### 6. **Interactive Guides** (v3.3.0)\n\nStep-by-step walkthroughs and quick references.\n\n- **Quick reference**: `provisioning sc` (fastest)\n- **Complete guides**: from-scratch, update, customize\n- **Copy-paste ready**: All commands include placeholders\n- **Beautiful rendering**: Uses glow, bat, or less\n\n### 7. **Test Environment Service** (v3.4.0)\n\nAutomated container-based testing.\n\n- **Three test types**: Single taskserv, server simulation, multi-node clusters\n- **Topology templates**: Kubernetes HA, etcd clusters, etc.\n- **Auto-cleanup**: Optional automatic cleanup after tests\n- **CI/CD integration**: Easy integration into pipelines\n\n### 8. **Platform Installer** (v3.5.0)\n\nMulti-mode installation system with TUI, CLI, and unattended modes.\n\n- **Interactive TUI**: Beautiful Ratatui terminal UI with 7 screens\n- **Headless Mode**: CLI automation for scripted installations\n- **Unattended Mode**: Zero-interaction CI/CD deployments\n- **Deployment Modes**: Solo (2 CPU/4GB), MultiUser (4 CPU/8GB), CICD (8 CPU/16GB), Enterprise (16 CPU/32GB)\n- **MCP Integration**: 7 AI-powered settings tools for intelligent configuration\n\n### 9. **Version Management System** (v3.6.0)\n\nCentralized tool and provider version management with bash-compatible export.\n\n- **Unified Version Source**: All versions defined in Nickel files (`versions.ncl` and provider `version.ncl`)\n- **Generated Versions File**: Bash-compatible KEY="VALUE" format for shell scripts\n- **Core Tools**: NUSHELL, NICKEL, SOPS, AGE, K9S with convenient aliases (NU for NUSHELL)\n- **Provider Versions**: Automatically discovers and includes all provider versions (AWS, HCLOUD, UPCTL, etc.)\n- **Command**: `provisioning setup versions` generates `/provisioning/core/versions` file\n- **Shell Integration**: Can be sourced directly in bash scripts: `source /provisioning/core/versions && echo $NU_VERSION`\n- **Usage**:\n ```bash\n # Generate versions file\n provisioning setup versions\n\n # Use in bash scripts\n source /provisioning/core/versions\n echo "Using Nushell version: $NU_VERSION"\n echo "AWS CLI version: $PROVIDER_AWS_VERSION"\n ```\n\n### 10. **Nushell Plugins Integration** (v1.0.0)\n\nThree native Rust plugins providing 10-50x performance improvements over HTTP API.\n\n- **Three Native Plugins**: auth, KMS, orchestrator\n- **Performance Gains**:\n - KMS operations: ~5ms vs ~50ms (10x faster)\n - Orchestrator queries: ~1ms vs ~30ms (30x faster)\n - Auth verification: ~10ms vs ~50ms (5x faster)\n- **OS-Native Keyring**: macOS Keychain, Linux Secret Service, Windows Credential Manager\n- **KMS Backends**: RustyVault, Age, AWS KMS, Vault, Cosmian\n- **Graceful Fallback**: Automatic fallback to HTTP if plugins not installed\n\n### 11. **Complete Security System** (v4.0.0)\n\nEnterprise-grade security with 39,699 lines across 12 components.\n\n- **12 Components**: JWT Auth, Cedar Authorization, MFA (TOTP + WebAuthn), Secrets Management, KMS, Audit Logging, Break-Glass, Compliance, Audit Query, Token Management, Access Control, Encryption\n- **Performance**: <20ms overhead per secure operation\n- **Testing**: 350+ comprehensive test cases\n- **API**: 83+ REST endpoints, 111+ CLI commands\n- **Standards**: GDPR, SOC2, ISO 27001 compliance\n- **Key Features**:\n - RS256 authentication with Argon2id hashing\n - Policy-as-code with hot reload\n - Multi-factor authentication (TOTP + WebAuthn/FIDO2)\n - Dynamic secrets (AWS STS, SSH keys) with TTL\n - 5 KMS backends with envelope encryption\n - 7-year audit retention with 5 export formats\n - Multi-party break-glass approval\n\n---\n\n## Technology Stack\n\n### Core Technologies\n\n| Technology | Version | Purpose | Why |\n| ------------ | --------- | --------- | ----- |\n| **Nickel** | Latest | PRIMARY - Infrastructure-as-code language | Type-safe schemas, lazy evaluation, LSP support, composable records, gradual validation |\n| **Nushell** | 0.109.0+ | Scripting and task automation | Structured data pipelines, cross-platform, modern built-in parsers (JSON/YAML/TOML) |\n| **Rust** | Latest | Platform services (orchestrator, control-center, installer) | Performance, memory safety, concurrency, reliability |\n| **KCL** | DEPRECATED | Legacy configuration (fully replaced by Nickel) | Migration bridge available; use Nickel for new work |\n\n### Data & State Management\n\n| Technology | Version | Purpose | Features |\n| ------------ | --------- | --------- | ---------- |\n| **SurrealDB** | Latest | High-performance graph database backend | Multi-model (document, graph, relational), real-time queries, distributed architecture, complex relationship tracking |\n\n### Platform Services (Rust-based)\n\n| Service | Purpose | Security Features |\n| --------- | --------- | ------------------- |\n| **Orchestrator** | Workflow execution, task scheduling, state management | File-based persistence, retry logic, checkpoint recovery |\n| **Control Center** | Web-based infrastructure management | **Authorization and permissions control**, RBAC, audit logging |\n| **Installer** | Platform installation (TUI + CLI modes) | Secure configuration generation, validation |\n| **API Gateway** | REST API for external integration | Authentication, rate limiting, request validation |\n| **MCP Server** | AI-powered configuration management | 7 settings tools, intelligent config completion |\n| **OCI Registry** | Extension distribution and versioning | Task services, providers, cluster templates |\n\n### Security & Secrets\n\n| Technology | Version | Purpose | Enterprise Features |\n| ------------ | --------- | --------- | --------------------- |\n| **SOPS** | 3.10.2+ | Secrets management | Encrypted configuration files |\n| **Age** | 1.2.1+ | Encryption | Secure key-based encryption |\n| **Cosmian KMS** | Latest | Key Management System | Confidential computing, secure key storage, cloud-native KMS |\n| **Cedar** | Latest | Policy engine | Fine-grained access control, policy-as-code, compliance checking, anomaly detection |\n| **RustyVault** | Latest | Transit encryption engine | 5ms encryption performance, multiple KMS backends |\n| **JWT** | Latest | Authentication tokens | RS256 signatures, Argon2id password hashing |\n| **Keyring** | Latest | OS-native secure storage | macOS Keychain, Linux Secret Service, Windows Credential Manager |\n\n### Version Management\n\n| Component | Purpose | Format |\n| ----------- | --------- | -------- |\n| **versions.ncl** | Core tool versions (Nickel primary) | Nickel schema |\n| **provider version.ncl** | Provider-specific versions | Nickel schema |\n| **provisioning setup versions** | Version file generator | Nushell command |\n| **versions file** | Bash-compatible exports | KEY="VALUE" format |\n\n**Usage**:\n```\n# Generate versions file from Nickel schemas\nprovisioning setup versions\n\n# Source in shell scripts\nsource /provisioning/core/versions\necho $NU_VERSION $PROVIDER_AWS_VERSION\n```\n\n### Optional Tools\n\n| Tool | Purpose |\n| ------ | --------- |\n| **K9s** | Kubernetes management interface |\n| **nu_plugin_tera** | Nushell plugin for Tera template rendering |\n| **nu_plugin_kcl** | Nushell plugin for KCL integration (CLI required, plugin optional) |\n| **nu_plugin_auth** | Authentication plugin (5x faster auth, OS keyring integration) |\n| **nu_plugin_kms** | KMS encryption plugin (10x faster, 5ms encryption) |\n| **nu_plugin_orchestrator** | Orchestrator plugin (30-50x faster queries) |\n| **glow** | Markdown rendering for interactive guides |\n| **bat** | Syntax highlighting for file viewing and guides |\n\n---\n\n## How It Works\n\n### Data Flow\n\n```\n1. User defines infrastructure in Nickel schemas\n ↓\n2. Nickel evaluates with type validation and lazy evaluation\n ↓\n3. CLI loads configuration (hierarchical merging)\n ↓\n4. Configuration validated against provider schemas\n ↓\n5. Workflow created with operations\n ↓\n6. Orchestrator receives workflow\n ↓\n7. Dependencies resolved (topological sort)\n ↓\n8. Operations executed in order (parallel where possible)\n ↓\n9. Providers handle cloud operations\n ↓\n10. Task services installed on servers\n ↓\n11. State persisted and monitored\n```\n\n### Example Workflow: Deploy Kubernetes Cluster\n\n**Step 1**: Define infrastructure in Nickel\n\n```\n# schemas/my-cluster.ncl\n{\n metadata = {\n name = "my-cluster"\n provider = "upcloud"\n environment = "production"\n }\n\n infrastructure = {\n servers = [\n {name = "control-01", plan = "medium", role = "control"}\n {name = "worker-01", plan = "large", role = "worker"}\n {name = "worker-02", plan = "large", role = "worker"}\n ]\n }\n\n services = {\n taskservs = ["kubernetes", "cilium", "rook-ceph"]\n }\n}\n```\n\n**Step 2**: Submit to Provisioning\n\n```\nprovisioning server create --infra my-cluster\n```\n\n**Step 3**: Provisioning executes workflow\n\n```\n1. Create workflow: "deploy-my-cluster"\n2. Resolve dependencies:\n - containerd (required by kubernetes)\n - etcd (required by kubernetes)\n - kubernetes (explicitly requested)\n - cilium (explicitly requested, requires kubernetes)\n - rook-ceph (explicitly requested, requires kubernetes)\n\n3. Execution order:\n a. Provision servers (parallel)\n b. Install containerd on all nodes\n c. Install etcd on control nodes\n d. Install kubernetes control plane\n e. Join worker nodes\n f. Install Cilium CNI\n g. Install Rook-Ceph storage\n\n4. Checkpoint after each step\n5. Monitor health checks\n6. Report completion\n```\n\n**Step 4**: Verify deployment\n\n```\nprovisioning cluster status my-cluster\n```\n\n### Configuration Hierarchy\n\nConfiguration values are resolved through a hierarchy:\n\n```\n1. System Defaults (provisioning/config/config.defaults.toml)\n ↓ (overridden by)\n2. User Preferences (~/.config/provisioning/user_config.yaml)\n ↓ (overridden by)\n3. Workspace Config (workspace/config/provisioning.yaml)\n ↓ (overridden by)\n4. Infrastructure Config (workspace/infra//config.toml)\n ↓ (overridden by)\n5. Environment Config (workspace/config/prod-defaults.toml)\n ↓ (overridden by)\n6. Runtime Flags (--flag value)\n```\n\n**Example**:\n\n```\n# System default\n[servers]\ndefault_plan = "small"\n\n# User preference\n[servers]\ndefault_plan = "medium" # Overrides system default\n\n# Infrastructure config\n[servers]\ndefault_plan = "large" # Overrides user preference\n\n# Runtime\nprovisioning server create --plan xlarge # Overrides everything\n```\n\n---\n\n## Use Cases\n\n### 1. **Multi-Cloud Kubernetes Deployment**\n\nDeploy Kubernetes clusters across different cloud providers with identical configuration.\n\n```\n# UpCloud cluster\nprovisioning cluster create k8s-prod --provider upcloud\n\n# AWS cluster (same config)\nprovisioning cluster create k8s-prod --provider aws\n```\n\n### 2. **Development → Staging → Production Pipeline**\n\nManage multiple environments with workspace switching.\n\n```\n# Development\nprovisioning workspace switch dev\nprovisioning cluster create app-stack\n\n# Staging (same config, different resources)\nprovisioning workspace switch staging\nprovisioning cluster create app-stack\n\n# Production (HA, larger resources)\nprovisioning workspace switch prod\nprovisioning cluster create app-stack\n```\n\n### 3. **Infrastructure as Code Testing**\n\nTest infrastructure changes before deploying to production.\n\n```\n# Test Kubernetes upgrade locally\nprovisioning test topology load kubernetes_3node | \\n test env cluster kubernetes --version 1.29.0\n\n# Verify functionality\nprovisioning test env run \n\n# Cleanup\nprovisioning test env cleanup \n```\n\n### 4. **Batch Multi-Region Deployment**\n\nDeploy to multiple regions in parallel using Nickel batch workflows.\n\n```\n# schemas/batch/multi-region.ncl\n{\n batch_workflow = {\n operations = [\n {\n id = "eu-cluster"\n type = "cluster"\n region = "eu-west-1"\n cluster = "app-stack"\n }\n {\n id = "us-cluster"\n type = "cluster"\n region = "us-east-1"\n cluster = "app-stack"\n }\n {\n id = "asia-cluster"\n type = "cluster"\n region = "ap-south-1"\n cluster = "app-stack"\n }\n ]\n parallel_limit = 3 # All at once\n }\n}\n```\n\n```\nprovisioning batch submit schemas/batch/multi-region.ncl\nprovisioning batch monitor \n```\n\n### 5. **Automated Disaster Recovery**\n\nRecreate infrastructure from configuration.\n\n```\n# Infrastructure destroyed\nprovisioning workspace switch prod\n\n# Recreate from config\nprovisioning cluster create --infra backup-restore --wait\n\n# All services restored with same configuration\n```\n\n### 6. **CI/CD Integration**\n\nAutomated testing and deployment pipelines.\n\n```\n# .gitlab-ci.yml\ntest-infrastructure:\n script:\n - provisioning test quick kubernetes\n - provisioning test quick postgres\n\ndeploy-staging:\n script:\n - provisioning workspace switch staging\n - provisioning cluster create app-stack --check\n - provisioning cluster create app-stack --yes\n\ndeploy-production:\n when: manual\n script:\n - provisioning workspace switch prod\n - provisioning cluster create app-stack --yes\n```\n\n---\n\n## Getting Started\n\n### Quick Start\n\n1. **Install Prerequisites**\n\n ```bash\n # Install Nushell (0.109.0+)\n brew install nushell # macOS\n\n # Install Nickel (required for IaC)\n brew install nickel # macOS or from source\n\n # Install SOPS (optional, for encrypted secrets)\n brew install sops\n ```\n\n2. **Add CLI to PATH**\n\n ```bash\n ln -sf "$(pwd)/provisioning/core/cli/provisioning" /usr/local/bin/provisioning\n ```\n\n3. **Initialize Workspace**\n\n ```bash\n provisioning workspace init my-project\n cd my-project\n ```\n\n3.5. **Generate Versions File** (Optional - for bash scripts)\n\n ```bash\n provisioning setup versions\n # Creates /provisioning/core/versions with all tool and provider versions\n\n # Use in your deployment scripts\n source /provisioning/core/versions\n echo "Deploying with Nushell $NU_VERSION and AWS CLI $PROVIDER_AWS_VERSION"\n ```\n\n4. **Define Infrastructure (Nickel)**\n\n ```bash\n # Create workspace infrastructure schema\n cat > workspace/infra/my-cluster.ncl <<'EOF'\n {\n metadata.name = "my-cluster"\n metadata.provider = "upcloud"\n\n infrastructure.servers = [\n {name = "control-01", plan = "medium"}\n {name = "worker-01", plan = "large"}\n ]\n\n services.taskservs = ["kubernetes", "cilium"]\n }\n EOF\n ```\n\n5. **Deploy Infrastructure**\n\n ```bash\n # Validate configuration\n provisioning config validate\n\n # Check what will be created\n provisioning server create --check\n\n # Create servers\n provisioning server create --yes\n\n # Install Kubernetes\n provisioning taskserv create kubernetes\n ```\n\n### Learning Path\n\n1. **Start with Guides**\n\n ```bash\n provisioning sc # Quick reference\n provisioning guide from-scratch # Complete walkthrough\n ```\n\n2. **Explore Examples**\n\n ```bash\n ls provisioning/examples/\n ```\n\n3. **Read Architecture Docs**\n - [Core Engine](provisioning/core/README.md)\n - [CLI Architecture](.claude/features/cli-architecture.md)\n - [Configuration System](.claude/features/configuration-system.md)\n - [Batch Workflows](.claude/features/batch-workflow-system.md)\n\n4. **Try Test Environments**\n\n ```bash\n provisioning test quick kubernetes\n provisioning test quick postgres\n ```\n\n5. **Build Custom Extensions**\n - Create custom task services\n - Define cluster templates\n - Write workflow automation\n\n---\n\n## Documentation Index\n\n### User & Operations Guides\n\nSee **[provisioning/docs/src/](provisioning/docs/src/)** for comprehensive documentation:\n\n- **Quick Start** - Get started in 10 minutes\n- **Command Reference** - Complete CLI command reference\n- **Nickel Configuration Guide** - IaC language and patterns\n- **Workspace Management** - Multi-workspace guide\n- **Test Environment Guide** - Testing infrastructure with containers\n- **Plugin Integration** - Native Rust plugins (10-50x faster)\n- **Security System** - Authentication, MFA, KMS, Cedar policies\n- **Operations** - Deployment, monitoring, incident response\n\n### Architecture & Design Decisions\n\nSee **[docs/src/architecture/](docs/src/architecture/)** for design patterns:\n\n- **System Architecture** - Multi-layer design\n- **ADRs (Architecture Decision Records)** - Major decisions including:\n - ADR-011: Nickel Migration (from KCL)\n - ADR-012: Nushell + Nickel plugin wrapper\n - ADR-010: Configuration format strategy\n- **Multi-Repo Strategy** - Repository organization\n- **Integration Patterns** - How components interact\n\n### Development Guidelines\n\n- **[Repository Structure](docs/src/development/)** - Codebase organization\n- **[Contributing Guide](CONTRIBUTING.md)** - How to contribute\n- **[Nushell Guidelines](.claude/guidelines/nushell/)** - Best practices\n- **[Nickel Guidelines](.claude/guidelines/nickel.md)** - IaC patterns\n- **[Rust Guidelines](.claude/guidelines/rust/)** - Rust conventions\n\n### API Reference\n\n- **REST API** - HTTP endpoints in `provisioning/docs/src/api-reference/`\n- **Nushell API** - Library functions and modules\n- **Provider API** - Cloud provider interface specification\n\n---\n\n## Project Status\n\n**Current Version**: v5.0.0-nickel (Production Ready) | **Date**: 2026-01-08\n\n### Completed Milestones\n\n- ✅ **v5.0.0** (2026-01-08) - **Nickel IaC Migration Complete**\n - Full KCL→Nickel migration\n - Schema-driven configuration system\n - Type-safe lazy evaluation\n - ~220 legacy files removed, ~250 new schema files added\n\n- ✅ **v3.6.0** (2026-01-08) - Version Management System\n - Centralized tool and provider version management\n - Bash-compatible versions file generation\n - `provisioning setup versions` command\n - Automatic provider version discovery from Nickel schemas\n - Shell script integration with sourcing support\n\n- ✅ **v4.0.0** (2025-10-09) - Complete Security System (12 components, 39,699 lines)\n- ✅ **v3.5.0** (2025-10-07) - Platform Installer with TUI and CI/CD modes\n- ✅ **v3.4.0** (2025-10-06) - Test Environment Service with container management\n- ✅ **v3.3.0** (2025-09-30) - Interactive Guides system\n- ✅ **v3.2.0** (2025-09-30) - Modular CLI Architecture (84% code reduction)\n- ✅ **v3.1.0** (2025-09-25) - Batch Workflow System (85-90% token efficiency)\n- ✅ **v3.0.0** (2025-09-25) - Hybrid Orchestrator (Rust/Nushell)\n- ✅ **v2.0.5** (2025-10-02) - Workspace Switching system\n- ✅ **v2.0.0** (2025-09-23) - Configuration System (476+ accessors)\n- ✅ **v1.0.0** (2025-10-09) - Nushell Plugins Integration (10-50x performance)\n\n### Current Focus\n\n- **Nickel Ecosystem** - IDE support, LSP integration, schema libraries\n- **Platform Consolidation** - GitHub Actions CI/CD, cross-platform testing\n- **Extension Registry** - OCI-based distribution for task services and providers\n- **Documentation** - Complete Nickel migration guides, ADR updates\n\n---\n\n## Support and Community\n\n### Getting Help\n\n- **Documentation**: Start with `provisioning help` or `provisioning guide from-scratch`\n- **Issues**: Report bugs and request features on the issue tracker\n- **Discussions**: Join community discussions for questions and ideas\n\n### Contributing\n\nContributions are welcome! See [CONTRIBUTING.md](docs/development/CONTRIBUTING.md) for guidelines.\n\n**Key areas for contribution**:\n\n- New task service definitions\n- Cloud provider implementations\n- Cluster templates\n- Documentation improvements\n- Bug fixes and testing\n\n---\n\n## License\n\nSee [LICENSE](LICENSE) file in project root.\n\n---\n\n**Maintained By**: Architecture Team\n**Last Updated**: 2026-01-08 (Version Management System v3.6.0 + Nickel v5.0.0 Migration Complete)\n**Current Branch**: nickel\n**Project Home**: [provisioning/](provisioning/)\n\n---\n\n## Recent Changes (2026-01-08)\n\n### Version Management System (v3.6.0)\n\n**What Changed**:\n- ✅ Implemented `provisioning setup versions` command\n- ✅ Generates bash-compatible `/provisioning/core/versions` file\n- ✅ Automatically discovers and includes all provider versions from Nickel schemas\n- ✅ Fixed to remove redundant metadata (all sources are Nickel)\n- ✅ Core tools with aliases: NUSHELL→NU, NICKEL, SOPS, AGE, K9S\n- ✅ Shell script integration: `source /provisioning/core/versions && echo $NU_VERSION`\n\n**Files Modified**:\n- `provisioning/core/nulib/lib_provisioning/setup/utils.nu` - Core implementation\n- `provisioning/core/nulib/main_provisioning/commands/setup.nu` - Command routing\n- `provisioning/core/nulib/lib_provisioning/workspace/enforcement.nu` - Workspace exemption\n- `provisioning/README.md` - Documentation updates\n\n**Generated File Example**:\n```\nNUSHELL_VERSION="0.109.1"\nNUSHELL_SOURCE="https://github.com/nushell/nushell/releases"\nNU_VERSION="0.109.1"\nNU_SOURCE="https://github.com/nushell/nushell/releases"\n\nNICKEL_VERSION="1.15.1"\nNICKEL_SOURCE="https://github.com/tweag/nickel/releases"\n\nPROVIDER_AWS_VERSION="2.32.11"\nPROVIDER_AWS_SOURCE="https://github.com/aws/aws-cli/releases"\n# ... and more providers\n```\n\n**Key Improvements**:\n- Clean metadata (no redundant `_LIB` fields - all sources are Nickel)\n- Automatic provider discovery from `extensions/providers/*/nickel/version.ncl`\n- Direct Nickel file parsing with JSON export\n- Zero dependency on environment variables or legacy systems\n- 100% bash/shell compatible for deployment scripts +

\n Provisioning Logo\n

\n

\n Provisioning\n

\n\n# Provisioning - Infrastructure Automation Platform\n\n> **A modular, declarative Infrastructure as Code (IaC) platform for managing complete infrastructure lifecycles**\n\n## Table of Contents\n\n- [What is Provisioning?](#what-is-provisioning)\n- [Why Provisioning?](#why-provisioning)\n- [Core Concepts](#core-concepts)\n- [Architecture](#architecture)\n- [Key Features](#key-features)\n- [Technology Stack](#technology-stack)\n- [How It Works](#how-it-works)\n- [Use Cases](#use-cases)\n- [Getting Started](#getting-started)\n\n---\n\n## What is Provisioning?\n\n**Provisioning** is a comprehensive **Infrastructure as Code (IaC)** platform designed to manage\ncomplete infrastructure lifecycles: cloud providers, infrastructure services, clusters,\nand isolated workspaces across multiple cloud/local environments.\n\nExtensible and customizable by design, it delivers type-safe, configuration-driven workflows\nwith enterprise security (encrypted configuration, Cosmian KMS integration, Cedar policy engine,\nsecrets management, authorization and permissions control, compliance checking, anomaly detection)\nand adaptable deployment modes (interactive UI, CLI automation, unattended CI/CD)\nsuitable for any scale from development to production.\n\n### Technical Definition\n\nDeclarative Infrastructure as Code (IaC) platform providing:\n\n- **Type-safe, configuration-driven workflows** with schema validation and constraint checking\n- **Modular, extensible architecture**: cloud providers, task services, clusters, workspaces\n- **Multi-cloud abstraction layer** with unified API (UpCloud, AWS, local infrastructure)\n- **High-performance state management**:\n - Graph database backend for complex relationships\n - Real-time state tracking and queries\n - Multi-model data storage (document, graph, relational)\n- **Enterprise security stack**:\n - Encrypted configuration and secrets management\n - Cosmian KMS integration for confidential key management\n - Cedar policy engine for fine-grained access control\n - Authorization and permissions control via platform services\n - Compliance checking and policy enforcement\n - Anomaly detection for security monitoring\n - Audit logging and compliance tracking\n- **Hybrid orchestration**: Rust-based performance layer + scripting flexibility\n- **Production-ready features**:\n - Batch workflows with dependency resolution\n - Checkpoint recovery and automatic rollback\n - Parallel execution with state management\n- **Adaptable deployment modes**:\n - Interactive TUI for guided setup\n - Headless CLI for scripted automation\n - Unattended mode for CI/CD pipelines\n- **Hierarchical configuration system** with inheritance and overrides\n\n### What It Does\n\n- **Provisions Infrastructure** - Create servers, networks, storage across multiple cloud providers\n- **Installs Services** - Deploy Kubernetes, containerd, databases, monitoring, and 50+ infrastructure components\n- **Manages Clusters** - Orchestrate complete cluster deployments with dependency management\n- **Handles Configuration** - Hierarchical configuration system with inheritance and overrides\n- **Orchestrates Workflows** - Batch operations with parallel execution and checkpoint recovery\n- **Manages Secrets** - SOPS/Age integration for encrypted configuration\n- **Secures Infrastructure** - Enterprise security with JWT, MFA, Cedar policies, audit logging\n- **Optimizes Performance** - Native plugins providing 10-50x speed improvements\n\n---\n\n## Why Provisioning?\n\n### The Problems It Solves\n\n#### 1. **Multi-Cloud Complexity**\n\n**Problem**: Each cloud provider has different APIs, tools, and workflows.\n\n**Solution**: Unified abstraction layer with provider-agnostic interfaces. Write configuration once, deploy anywhere using Nickel schemas.\n\n```\n# Same configuration works on UpCloud, AWS, or local infrastructure\n{\n servers = [\n {\n name = "web-01"\n plan = "medium" # Abstract size, provider-specific translation\n provider = "upcloud" # Switch to "aws" or "local" as needed\n }\n ]\n}\n```\n\n#### 2. **Dependency Hell**\n\n**Problem**: Infrastructure components have complex dependencies (Kubernetes needs containerd, Cilium needs Kubernetes, etc.).\n\n**Solution**: Automatic dependency resolution with topological sorting and health checks via Nickel schemas.\n\n```\n# Provisioning resolves: containerd → etcd → kubernetes → cilium\n{\n taskservs = ["cilium"] # Automatically installs all dependencies\n}\n```\n\n#### 3. **Configuration Sprawl**\n\n**Problem**: Environment variables, hardcoded values, scattered configuration files.\n\n**Solution**: Hierarchical configuration system with 476+ config accessors replacing 200+ ENV variables.\n\n```\nDefaults → User → Project → Infrastructure → Environment → Runtime\n```\n\n#### 4. **Imperative Scripts**\n\n**Problem**: Brittle shell scripts that don't handle failures, don't support rollback, hard to maintain.\n\n**Solution**: Declarative Nickel configurations with validation, type safety, lazy evaluation, and automatic rollback.\n\n#### 5. **Lack of Visibility**\n\n**Problem**: No insight into what's happening during deployment, hard to debug failures.\n\n**Solution**:\n\n- Real-time workflow monitoring\n- Comprehensive logging system\n- Web-based control center\n- REST API for integration\n\n#### 6. **No Standardization**\n\n**Problem**: Each team builds their own deployment tools, no shared patterns.\n\n**Solution**: Reusable task services, cluster templates, and workflow patterns.\n\n---\n\n## Core Concepts\n\n### 1. **Providers**\n\nCloud infrastructure backends that handle resource provisioning.\n\n- **UpCloud** - Primary cloud provider\n- **AWS** - Amazon Web Services integration\n- **Local** - Local infrastructure (VMs, Docker, bare metal)\n\nProviders implement a common interface, making infrastructure code portable.\n\n### 2. **Task Services (TaskServs)**\n\nReusable infrastructure components that can be installed on servers.\n\n**Categories**:\n\n- **Container Runtimes** - containerd, Docker, Podman, crun, runc, youki\n- **Orchestration** - Kubernetes, etcd, CoreDNS\n- **Networking** - Cilium, Flannel, Calico, ip-aliases\n- **Storage** - Rook-Ceph, local storage\n- **Databases** - PostgreSQL, Redis, SurrealDB\n- **Observability** - Prometheus, Grafana, Loki\n- **Security** - Webhook, KMS, Vault\n- **Development** - Gitea, Radicle, ORAS\n\nEach task service includes:\n\n- Version management\n- Dependency declarations\n- Health checks\n- Installation/uninstallation logic\n- Configuration schemas\n\n### 3. **Clusters**\n\nComplete infrastructure deployments combining servers and task services.\n\n**Examples**:\n\n- **Kubernetes Cluster** - HA control plane + worker nodes + CNI + storage\n- **Database Cluster** - Replicated PostgreSQL with backup\n- **Build Infrastructure** - BuildKit + container registry + CI/CD\n\nClusters handle:\n\n- Multi-node coordination\n- Service distribution\n- High availability\n- Rolling updates\n\n### 4. **Workspaces**\n\nIsolated environments for different projects or deployment stages.\n\n```\nworkspace_librecloud/ # Production workspace\n├── infra/ # Infrastructure definitions\n├── config/ # Workspace configuration\n├── extensions/ # Custom modules\n└── runtime/ # State and runtime data\n\nworkspace_dev/ # Development workspace\n├── infra/\n└── config/\n```\n\nSwitch between workspaces with single command:\n\n```\nprovisioning workspace switch librecloud\n```\n\n### 5. **Workflows**\n\nCoordinated sequences of operations with dependency management.\n\n**Types**:\n\n- **Server Workflows** - Create/delete/update servers\n- **TaskServ Workflows** - Install/remove infrastructure services\n- **Cluster Workflows** - Deploy/scale complete clusters\n- **Batch Workflows** - Multi-cloud parallel operations\n\n**Features**:\n\n- Dependency resolution\n- Parallel execution\n- Checkpoint recovery\n- Automatic rollback\n- Progress monitoring\n\n---\n\n## Architecture\n\n### System Components\n\n```\n┌─────────────────────────────────────────────────────────────────┐\n│ User Interface Layer │\n│ • CLI (provisioning command) │\n│ • Web Control Center (UI) │\n│ • REST API │\n└─────────────────────────────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────────────────────────────┐\n│ Core Engine Layer │\n│ • Command Routing & Dispatch │\n│ • Configuration Management │\n│ • Provider Abstraction │\n│ • Utility Libraries │\n└─────────────────────────────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────────────────────────────┐\n│ Orchestration Layer │\n│ • Workflow Orchestrator (Rust/Nushell hybrid) │\n│ • Dependency Resolver │\n│ • State Manager │\n│ • Task Scheduler │\n└─────────────────────────────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────────────────────────────┐\n│ Extension Layer │\n│ • Providers (Cloud APIs) │\n│ • Task Services (Infrastructure Components) │\n│ • Clusters (Complete Deployments) │\n│ • Workflows (Automation Templates) │\n└─────────────────────────────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────────────────────────────┐\n│ Infrastructure Layer │\n│ • Cloud Resources (Servers, Networks, Storage) │\n│ • Kubernetes Clusters │\n│ • Running Services │\n└─────────────────────────────────────────────────────────────────┘\n```\n\n### Directory Structure\n\n```\nproject-provisioning/\n├── provisioning/ # Core provisioning system\n│ ├── core/ # Core engine and libraries\n│ │ ├── cli/ # Command-line interface\n│ │ ├── nulib/ # Core Nushell libraries\n│ │ ├── plugins/ # System plugins (Rust native)\n│ │ └── scripts/ # Utility scripts\n│ │\n│ ├── extensions/ # Extensible components\n│ │ ├── providers/ # Cloud provider implementations\n│ │ ├── taskservs/ # Infrastructure service definitions\n│ │ ├── clusters/ # Complete cluster configurations\n│ │ └── workflows/ # Core workflow templates\n│ │\n│ ├── platform/ # Platform services\n│ │ ├── orchestrator/ # Rust orchestrator service\n│ │ ├── control-center/ # Web control center\n│ │ ├── mcp-server/ # Model Context Protocol server\n│ │ ├── api-gateway/ # REST API gateway\n│ │ ├── oci-registry/ # OCI registry for extensions\n│ │ └── installer/ # Platform installer (TUI + CLI)\n│ │\n│ ├── schemas/ # Nickel schema definitions (PRIMARY IaC)\n│ │ ├── main.ncl # Main infrastructure schema\n│ │ ├── providers/ # Provider-specific schemas\n│ │ ├── infrastructure/ # Infra definitions\n│ │ ├── deployment/ # Deployment schemas\n│ │ ├── services/ # Service schemas\n│ │ ├── operations/ # Operations schemas\n│ │ └── generator/ # Runtime schema generation\n│ │\n│ ├── docs/ # Product documentation (mdBook)\n│ ├── config/ # Configuration examples\n│ ├── tools/ # Build and distribution tools\n│ └── justfiles/ # Just recipes for common tasks\n│\n├── workspace/ # User workspaces and data\n│ ├── infra/ # Infrastructure definitions\n│ ├── config/ # User configuration\n│ ├── extensions/ # User extensions\n│ └── runtime/ # Runtime data and state\n│\n├── docs/ # Architecture & Development docs\n│ ├── architecture/ # System design and ADRs\n│ └── development/ # Development guidelines\n│\n└── .github/ # CI/CD workflows\n └── workflows/ # GitHub Actions (Rust, Nickel, Nushell)\n```\n\n### Platform Services\n\n#### 1. **Orchestrator** (`platform/orchestrator/`)\n\n- **Language**: Rust + Nushell\n- **Purpose**: Workflow execution, task scheduling, state management\n- **Features**:\n - File-based persistence\n - Priority processing\n - Retry logic with exponential backoff\n - Checkpoint-based recovery\n - REST API endpoints\n\n#### 2. **Control Center** (`platform/control-center/`)\n\n- **Language**: Web UI + Backend API\n- **Purpose**: Web-based infrastructure management\n- **Features**:\n - Dashboard views\n - Real-time monitoring\n - Interactive deployments\n - Log viewing\n\n#### 3. **MCP Server** (`platform/mcp-server/`)\n\n- **Language**: Nushell\n- **Purpose**: Model Context Protocol integration for AI assistance\n- **Features**:\n - 7 AI-powered settings tools\n - Intelligent config completion\n - Natural language infrastructure queries\n\n#### 4. **OCI Registry** (`platform/oci-registry/`)\n\n- **Purpose**: Extension distribution and versioning\n- **Features**:\n - Task service packages\n - Provider packages\n - Cluster templates\n - Workflow definitions\n\n#### 5. **Installer** (`platform/installer/`)\n\n- **Language**: Rust (Ratatui TUI) + Nushell\n- **Purpose**: Platform installation and setup\n- **Features**:\n - Interactive TUI mode\n - Headless CLI mode\n - Unattended CI/CD mode\n - Configuration generation\n\n---\n\n## Key Features\n\n### 1. **Modular CLI Architecture** (v3.2.0)\n\n84% code reduction with domain-driven design.\n\n- **Main CLI**: 211 lines (from 1,329 lines)\n- **80+ shortcuts**: `s` → `server`, `t` → `taskserv`, etc.\n- **Bi-directional help**: `provisioning help ws` = `provisioning ws help`\n- **7 domain modules**: infrastructure, orchestration, development, workspace, configuration, utilities, generation\n\n### 2. **Configuration System** (v2.0.0)\n\nHierarchical, config-driven architecture.\n\n- **476+ config accessors** replacing 200+ ENV variables\n- **Hierarchical loading**: defaults → user → project → infra → env → runtime\n- **Variable interpolation**: `{{paths.base}}`, `{{env.HOME}}`, `{{now.date}}`\n- **Multi-format support**: TOML, YAML, KCL\n\n### 3. **Batch Workflow System** (v3.1.0)\n\nProvider-agnostic batch operations with 85-90% token efficiency.\n\n- **Multi-cloud support**: Mixed UpCloud + AWS + local in single workflow\n- **KCL schema integration**: Type-safe workflow definitions\n- **Dependency resolution**: Topological sorting with soft/hard dependencies\n- **State management**: Checkpoint-based recovery with rollback\n- **Real-time monitoring**: Live progress tracking\n\n### 4. **Hybrid Orchestrator** (v3.0.0)\n\nRust/Nushell architecture solving deep call stack limitations.\n\n- **High-performance coordination layer**\n- **File-based persistence**\n- **Priority processing with retry logic**\n- **REST API for external integration**\n- **Comprehensive workflow system**\n\n### 5. **Workspace Switching** (v2.0.5)\n\nCentralized workspace management.\n\n- **Single-command switching**: `provisioning workspace switch `\n- **Automatic tracking**: Last-used timestamps, active workspace markers\n- **User preferences**: Global settings across all workspaces\n- **Workspace registry**: Centralized configuration in `user_config.yaml`\n\n### 6. **Interactive Guides** (v3.3.0)\n\nStep-by-step walkthroughs and quick references.\n\n- **Quick reference**: `provisioning sc` (fastest)\n- **Complete guides**: from-scratch, update, customize\n- **Copy-paste ready**: All commands include placeholders\n- **Beautiful rendering**: Uses glow, bat, or less\n\n### 7. **Test Environment Service** (v3.4.0)\n\nAutomated container-based testing.\n\n- **Three test types**: Single taskserv, server simulation, multi-node clusters\n- **Topology templates**: Kubernetes HA, etcd clusters, etc.\n- **Auto-cleanup**: Optional automatic cleanup after tests\n- **CI/CD integration**: Easy integration into pipelines\n\n### 8. **Platform Installer** (v3.5.0)\n\nMulti-mode installation system with TUI, CLI, and unattended modes.\n\n- **Interactive TUI**: Beautiful Ratatui terminal UI with 7 screens\n- **Headless Mode**: CLI automation for scripted installations\n- **Unattended Mode**: Zero-interaction CI/CD deployments\n- **Deployment Modes**: Solo (2 CPU/4GB), MultiUser (4 CPU/8GB), CICD (8 CPU/16GB), Enterprise (16 CPU/32GB)\n- **MCP Integration**: 7 AI-powered settings tools for intelligent configuration\n\n### 9. **Version Management System** (v3.6.0)\n\nCentralized tool and provider version management with bash-compatible export.\n\n- **Unified Version Source**: All versions defined in Nickel files (`versions.ncl` and provider `version.ncl`)\n- **Generated Versions File**: Bash-compatible KEY="VALUE" format for shell scripts\n- **Core Tools**: NUSHELL, NICKEL, SOPS, AGE, K9S with convenient aliases (NU for NUSHELL)\n- **Provider Versions**: Automatically discovers and includes all provider versions (AWS, HCLOUD, UPCTL, etc.)\n- **Command**: `provisioning setup versions` generates `/provisioning/core/versions` file\n- **Shell Integration**: Can be sourced directly in bash scripts: `source /provisioning/core/versions && echo $NU_VERSION`\n- **Usage**:\n ```bash\n # Generate versions file\n provisioning setup versions\n\n # Use in bash scripts\n source /provisioning/core/versions\n echo "Using Nushell version: $NU_VERSION"\n echo "AWS CLI version: $PROVIDER_AWS_VERSION"\n ```\n\n### 10. **Nushell Plugins Integration** (v1.0.0)\n\nThree native Rust plugins providing 10-50x performance improvements over HTTP API.\n\n- **Three Native Plugins**: auth, KMS, orchestrator\n- **Performance Gains**:\n - KMS operations: ~5ms vs ~50ms (10x faster)\n - Orchestrator queries: ~1ms vs ~30ms (30x faster)\n - Auth verification: ~10ms vs ~50ms (5x faster)\n- **OS-Native Keyring**: macOS Keychain, Linux Secret Service, Windows Credential Manager\n- **KMS Backends**: RustyVault, Age, AWS KMS, Vault, Cosmian\n- **Graceful Fallback**: Automatic fallback to HTTP if plugins not installed\n\n### 11. **Complete Security System** (v4.0.0)\n\nEnterprise-grade security with 39,699 lines across 12 components.\n\n- **12 Components**: JWT Auth, Cedar Authorization, MFA (TOTP + WebAuthn), Secrets Management, KMS, Audit Logging, Break-Glass, Compliance, Audit Query, Token Management, Access Control, Encryption\n- **Performance**: <20ms overhead per secure operation\n- **Testing**: 350+ comprehensive test cases\n- **API**: 83+ REST endpoints, 111+ CLI commands\n- **Standards**: GDPR, SOC2, ISO 27001 compliance\n- **Key Features**:\n - RS256 authentication with Argon2id hashing\n - Policy-as-code with hot reload\n - Multi-factor authentication (TOTP + WebAuthn/FIDO2)\n - Dynamic secrets (AWS STS, SSH keys) with TTL\n - 5 KMS backends with envelope encryption\n - 7-year audit retention with 5 export formats\n - Multi-party break-glass approval\n\n---\n\n## Technology Stack\n\n### Core Technologies\n\n| Technology | Version | Purpose | Why |\n| ------------ | --------- | --------- | ----- |\n| **Nickel** | Latest | PRIMARY - Infrastructure-as-code language | Type-safe schemas, lazy evaluation, LSP support, composable records, gradual validation |\n| **Nushell** | 0.109.0+ | Scripting and task automation | Structured data pipelines, cross-platform, modern built-in parsers (JSON/YAML/TOML) |\n| **Rust** | Latest | Platform services (orchestrator, control-center, installer) | Performance, memory safety, concurrency, reliability |\n| **KCL** | DEPRECATED | Legacy configuration (fully replaced by Nickel) | Migration bridge available; use Nickel for new work |\n\n### Data & State Management\n\n| Technology | Version | Purpose | Features |\n| ------------ | --------- | --------- | ---------- |\n| **SurrealDB** | Latest | High-performance graph database backend | Multi-model (document, graph, relational), real-time queries, distributed architecture, complex relationship tracking |\n\n### Platform Services (Rust-based)\n\n| Service | Purpose | Security Features |\n| --------- | --------- | ------------------- |\n| **Orchestrator** | Workflow execution, task scheduling, state management | File-based persistence, retry logic, checkpoint recovery |\n| **Control Center** | Web-based infrastructure management | **Authorization and permissions control**, RBAC, audit logging |\n| **Installer** | Platform installation (TUI + CLI modes) | Secure configuration generation, validation |\n| **API Gateway** | REST API for external integration | Authentication, rate limiting, request validation |\n| **MCP Server** | AI-powered configuration management | 7 settings tools, intelligent config completion |\n| **OCI Registry** | Extension distribution and versioning | Task services, providers, cluster templates |\n\n### Security & Secrets\n\n| Technology | Version | Purpose | Enterprise Features |\n| ------------ | --------- | --------- | --------------------- |\n| **SOPS** | 3.10.2+ | Secrets management | Encrypted configuration files |\n| **Age** | 1.2.1+ | Encryption | Secure key-based encryption |\n| **Cosmian KMS** | Latest | Key Management System | Confidential computing, secure key storage, cloud-native KMS |\n| **Cedar** | Latest | Policy engine | Fine-grained access control, policy-as-code, compliance checking, anomaly detection |\n| **RustyVault** | Latest | Transit encryption engine | 5ms encryption performance, multiple KMS backends |\n| **JWT** | Latest | Authentication tokens | RS256 signatures, Argon2id password hashing |\n| **Keyring** | Latest | OS-native secure storage | macOS Keychain, Linux Secret Service, Windows Credential Manager |\n\n### Version Management\n\n| Component | Purpose | Format |\n| ----------- | --------- | -------- |\n| **versions.ncl** | Core tool versions (Nickel primary) | Nickel schema |\n| **provider version.ncl** | Provider-specific versions | Nickel schema |\n| **provisioning setup versions** | Version file generator | Nushell command |\n| **versions file** | Bash-compatible exports | KEY="VALUE" format |\n\n**Usage**:\n```\n# Generate versions file from Nickel schemas\nprovisioning setup versions\n\n# Source in shell scripts\nsource /provisioning/core/versions\necho $NU_VERSION $PROVIDER_AWS_VERSION\n```\n\n### Optional Tools\n\n| Tool | Purpose |\n| ------ | --------- |\n| **K9s** | Kubernetes management interface |\n| **nu_plugin_tera** | Nushell plugin for Tera template rendering |\n| **nu_plugin_kcl** | Nushell plugin for KCL integration (CLI required, plugin optional) |\n| **nu_plugin_auth** | Authentication plugin (5x faster auth, OS keyring integration) |\n| **nu_plugin_kms** | KMS encryption plugin (10x faster, 5ms encryption) |\n| **nu_plugin_orchestrator** | Orchestrator plugin (30-50x faster queries) |\n| **glow** | Markdown rendering for interactive guides |\n| **bat** | Syntax highlighting for file viewing and guides |\n\n---\n\n## How It Works\n\n### Data Flow\n\n```\n1. User defines infrastructure in Nickel schemas\n ↓\n2. Nickel evaluates with type validation and lazy evaluation\n ↓\n3. CLI loads configuration (hierarchical merging)\n ↓\n4. Configuration validated against provider schemas\n ↓\n5. Workflow created with operations\n ↓\n6. Orchestrator receives workflow\n ↓\n7. Dependencies resolved (topological sort)\n ↓\n8. Operations executed in order (parallel where possible)\n ↓\n9. Providers handle cloud operations\n ↓\n10. Task services installed on servers\n ↓\n11. State persisted and monitored\n```\n\n### Example Workflow: Deploy Kubernetes Cluster\n\n**Step 1**: Define infrastructure in Nickel\n\n```\n# schemas/my-cluster.ncl\n{\n metadata = {\n name = "my-cluster"\n provider = "upcloud"\n environment = "production"\n }\n\n infrastructure = {\n servers = [\n {name = "control-01", plan = "medium", role = "control"}\n {name = "worker-01", plan = "large", role = "worker"}\n {name = "worker-02", plan = "large", role = "worker"}\n ]\n }\n\n services = {\n taskservs = ["kubernetes", "cilium", "rook-ceph"]\n }\n}\n```\n\n**Step 2**: Submit to Provisioning\n\n```\nprovisioning server create --infra my-cluster\n```\n\n**Step 3**: Provisioning executes workflow\n\n```\n1. Create workflow: "deploy-my-cluster"\n2. Resolve dependencies:\n - containerd (required by kubernetes)\n - etcd (required by kubernetes)\n - kubernetes (explicitly requested)\n - cilium (explicitly requested, requires kubernetes)\n - rook-ceph (explicitly requested, requires kubernetes)\n\n3. Execution order:\n a. Provision servers (parallel)\n b. Install containerd on all nodes\n c. Install etcd on control nodes\n d. Install kubernetes control plane\n e. Join worker nodes\n f. Install Cilium CNI\n g. Install Rook-Ceph storage\n\n4. Checkpoint after each step\n5. Monitor health checks\n6. Report completion\n```\n\n**Step 4**: Verify deployment\n\n```\nprovisioning cluster status my-cluster\n```\n\n### Configuration Hierarchy\n\nConfiguration values are resolved through a hierarchy:\n\n```\n1. System Defaults (provisioning/config/config.defaults.toml)\n ↓ (overridden by)\n2. User Preferences (~/.config/provisioning/user_config.yaml)\n ↓ (overridden by)\n3. Workspace Config (workspace/config/provisioning.yaml)\n ↓ (overridden by)\n4. Infrastructure Config (workspace/infra//config.toml)\n ↓ (overridden by)\n5. Environment Config (workspace/config/prod-defaults.toml)\n ↓ (overridden by)\n6. Runtime Flags (--flag value)\n```\n\n**Example**:\n\n```\n# System default\n[servers]\ndefault_plan = "small"\n\n# User preference\n[servers]\ndefault_plan = "medium" # Overrides system default\n\n# Infrastructure config\n[servers]\ndefault_plan = "large" # Overrides user preference\n\n# Runtime\nprovisioning server create --plan xlarge # Overrides everything\n```\n\n---\n\n## Use Cases\n\n### 1. **Multi-Cloud Kubernetes Deployment**\n\nDeploy Kubernetes clusters across different cloud providers with identical configuration.\n\n```\n# UpCloud cluster\nprovisioning cluster create k8s-prod --provider upcloud\n\n# AWS cluster (same config)\nprovisioning cluster create k8s-prod --provider aws\n```\n\n### 2. **Development → Staging → Production Pipeline**\n\nManage multiple environments with workspace switching.\n\n```\n# Development\nprovisioning workspace switch dev\nprovisioning cluster create app-stack\n\n# Staging (same config, different resources)\nprovisioning workspace switch staging\nprovisioning cluster create app-stack\n\n# Production (HA, larger resources)\nprovisioning workspace switch prod\nprovisioning cluster create app-stack\n```\n\n### 3. **Infrastructure as Code Testing**\n\nTest infrastructure changes before deploying to production.\n\n```\n# Test Kubernetes upgrade locally\nprovisioning test topology load kubernetes_3node | \n test env cluster kubernetes --version 1.29.0\n\n# Verify functionality\nprovisioning test env run \n\n# Cleanup\nprovisioning test env cleanup \n```\n\n### 4. **Batch Multi-Region Deployment**\n\nDeploy to multiple regions in parallel using Nickel batch workflows.\n\n```\n# schemas/batch/multi-region.ncl\n{\n batch_workflow = {\n operations = [\n {\n id = "eu-cluster"\n type = "cluster"\n region = "eu-west-1"\n cluster = "app-stack"\n }\n {\n id = "us-cluster"\n type = "cluster"\n region = "us-east-1"\n cluster = "app-stack"\n }\n {\n id = "asia-cluster"\n type = "cluster"\n region = "ap-south-1"\n cluster = "app-stack"\n }\n ]\n parallel_limit = 3 # All at once\n }\n}\n```\n\n```\nprovisioning batch submit schemas/batch/multi-region.ncl\nprovisioning batch monitor \n```\n\n### 5. **Automated Disaster Recovery**\n\nRecreate infrastructure from configuration.\n\n```\n# Infrastructure destroyed\nprovisioning workspace switch prod\n\n# Recreate from config\nprovisioning cluster create --infra backup-restore --wait\n\n# All services restored with same configuration\n```\n\n### 6. **CI/CD Integration**\n\nAutomated testing and deployment pipelines.\n\n```\n# .gitlab-ci.yml\ntest-infrastructure:\n script:\n - provisioning test quick kubernetes\n - provisioning test quick postgres\n\ndeploy-staging:\n script:\n - provisioning workspace switch staging\n - provisioning cluster create app-stack --check\n - provisioning cluster create app-stack --yes\n\ndeploy-production:\n when: manual\n script:\n - provisioning workspace switch prod\n - provisioning cluster create app-stack --yes\n```\n\n---\n\n## Getting Started\n\n### Quick Start\n\n1. **Install Prerequisites**\n\n ```bash\n # Install Nushell (0.109.0+)\n brew install nushell # macOS\n\n # Install Nickel (required for IaC)\n brew install nickel # macOS or from source\n\n # Install SOPS (optional, for encrypted secrets)\n brew install sops\n ```\n\n2. **Add CLI to PATH**\n\n ```bash\n ln -sf "$(pwd)/provisioning/core/cli/provisioning" /usr/local/bin/provisioning\n ```\n\n3. **Initialize Workspace**\n\n ```bash\n provisioning workspace init my-project\n cd my-project\n ```\n\n3.5. **Generate Versions File** (Optional - for bash scripts)\n\n ```bash\n provisioning setup versions\n # Creates /provisioning/core/versions with all tool and provider versions\n\n # Use in your deployment scripts\n source /provisioning/core/versions\n echo "Deploying with Nushell $NU_VERSION and AWS CLI $PROVIDER_AWS_VERSION"\n ```\n\n4. **Define Infrastructure (Nickel)**\n\n ```bash\n # Create workspace infrastructure schema\n cat > workspace/infra/my-cluster.ncl <<'EOF'\n {\n metadata.name = "my-cluster"\n metadata.provider = "upcloud"\n\n infrastructure.servers = [\n {name = "control-01", plan = "medium"}\n {name = "worker-01", plan = "large"}\n ]\n\n services.taskservs = ["kubernetes", "cilium"]\n }\n EOF\n ```\n\n5. **Deploy Infrastructure**\n\n ```bash\n # Validate configuration\n provisioning config validate\n\n # Check what will be created\n provisioning server create --check\n\n # Create servers\n provisioning server create --yes\n\n # Install Kubernetes\n provisioning taskserv create kubernetes\n ```\n\n### Learning Path\n\n1. **Start with Guides**\n\n ```bash\n provisioning sc # Quick reference\n provisioning guide from-scratch # Complete walkthrough\n ```\n\n2. **Explore Examples**\n\n ```bash\n ls provisioning/examples/\n ```\n\n3. **Read Architecture Docs**\n - [Core Engine](provisioning/core/README.md)\n - [CLI Architecture](.claude/features/cli-architecture.md)\n - [Configuration System](.claude/features/configuration-system.md)\n - [Batch Workflows](.claude/features/batch-workflow-system.md)\n\n4. **Try Test Environments**\n\n ```bash\n provisioning test quick kubernetes\n provisioning test quick postgres\n ```\n\n5. **Build Custom Extensions**\n - Create custom task services\n - Define cluster templates\n - Write workflow automation\n\n---\n\n## Documentation Index\n\n### User & Operations Guides\n\nSee **[provisioning/docs/src/](provisioning/docs/src/)** for comprehensive documentation:\n\n- **Quick Start** - Get started in 10 minutes\n- **Command Reference** - Complete CLI command reference\n- **Nickel Configuration Guide** - IaC language and patterns\n- **Workspace Management** - Multi-workspace guide\n- **Test Environment Guide** - Testing infrastructure with containers\n- **Plugin Integration** - Native Rust plugins (10-50x faster)\n- **Security System** - Authentication, MFA, KMS, Cedar policies\n- **Operations** - Deployment, monitoring, incident response\n\n### Architecture & Design Decisions\n\nSee **[docs/src/architecture/](docs/src/architecture/)** for design patterns:\n\n- **System Architecture** - Multi-layer design\n- **ADRs (Architecture Decision Records)** - Major decisions including:\n - ADR-011: Nickel Migration (from KCL)\n - ADR-012: Nushell + Nickel plugin wrapper\n - ADR-010: Configuration format strategy\n- **Multi-Repo Strategy** - Repository organization\n- **Integration Patterns** - How components interact\n\n### Development Guidelines\n\n- **[Repository Structure](docs/src/development/)** - Codebase organization\n- **[Contributing Guide](CONTRIBUTING.md)** - How to contribute\n- **[Nushell Guidelines](.claude/guidelines/nushell/)** - Best practices\n- **[Nickel Guidelines](.claude/guidelines/nickel.md)** - IaC patterns\n- **[Rust Guidelines](.claude/guidelines/rust/)** - Rust conventions\n\n### API Reference\n\n- **REST API** - HTTP endpoints in `provisioning/docs/src/api-reference/`\n- **Nushell API** - Library functions and modules\n- **Provider API** - Cloud provider interface specification\n\n---\n\n## Project Status\n\n**Current Version**: v5.0.0-nickel (Production Ready) | **Date**: 2026-01-08\n\n### Completed Milestones\n\n- ✅ **v5.0.0** (2026-01-08) - **Nickel IaC Migration Complete**\n - Full KCL→Nickel migration\n - Schema-driven configuration system\n - Type-safe lazy evaluation\n - ~220 legacy files removed, ~250 new schema files added\n\n- ✅ **v3.6.0** (2026-01-08) - Version Management System\n - Centralized tool and provider version management\n - Bash-compatible versions file generation\n - `provisioning setup versions` command\n - Automatic provider version discovery from Nickel schemas\n - Shell script integration with sourcing support\n\n- ✅ **v4.0.0** (2025-10-09) - Complete Security System (12 components, 39,699 lines)\n- ✅ **v3.5.0** (2025-10-07) - Platform Installer with TUI and CI/CD modes\n- ✅ **v3.4.0** (2025-10-06) - Test Environment Service with container management\n- ✅ **v3.3.0** (2025-09-30) - Interactive Guides system\n- ✅ **v3.2.0** (2025-09-30) - Modular CLI Architecture (84% code reduction)\n- ✅ **v3.1.0** (2025-09-25) - Batch Workflow System (85-90% token efficiency)\n- ✅ **v3.0.0** (2025-09-25) - Hybrid Orchestrator (Rust/Nushell)\n- ✅ **v2.0.5** (2025-10-02) - Workspace Switching system\n- ✅ **v2.0.0** (2025-09-23) - Configuration System (476+ accessors)\n- ✅ **v1.0.0** (2025-10-09) - Nushell Plugins Integration (10-50x performance)\n\n### Current Focus\n\n- **Nickel Ecosystem** - IDE support, LSP integration, schema libraries\n- **Platform Consolidation** - GitHub Actions CI/CD, cross-platform testing\n- **Extension Registry** - OCI-based distribution for task services and providers\n- **Documentation** - Complete Nickel migration guides, ADR updates\n\n---\n\n## Support and Community\n\n### Getting Help\n\n- **Documentation**: Start with `provisioning help` or `provisioning guide from-scratch`\n- **Issues**: Report bugs and request features on the issue tracker\n- **Discussions**: Join community discussions for questions and ideas\n\n### Contributing\n\nContributions are welcome! See [CONTRIBUTING.md](docs/development/CONTRIBUTING.md) for guidelines.\n\n**Key areas for contribution**:\n\n- New task service definitions\n- Cloud provider implementations\n- Cluster templates\n- Documentation improvements\n- Bug fixes and testing\n\n---\n\n## License\n\nSee [LICENSE](LICENSE) file in project root.\n\n---\n\n**Maintained By**: Architecture Team\n**Last Updated**: 2026-01-08 (Version Management System v3.6.0 + Nickel v5.0.0 Migration Complete)\n**Current Branch**: nickel\n**Project Home**: [provisioning/](provisioning/)\n\n---\n\n## Recent Changes (2026-01-08)\n\n### Version Management System (v3.6.0)\n\n**What Changed**:\n- ✅ Implemented `provisioning setup versions` command\n- ✅ Generates bash-compatible `/provisioning/core/versions` file\n- ✅ Automatically discovers and includes all provider versions from Nickel schemas\n- ✅ Fixed to remove redundant metadata (all sources are Nickel)\n- ✅ Core tools with aliases: NUSHELL→NU, NICKEL, SOPS, AGE, K9S\n- ✅ Shell script integration: `source /provisioning/core/versions && echo $NU_VERSION`\n\n**Files Modified**:\n- `provisioning/core/nulib/lib_provisioning/setup/utils.nu` - Core implementation\n- `provisioning/core/nulib/main_provisioning/commands/setup.nu` - Command routing\n- `provisioning/core/nulib/lib_provisioning/workspace/enforcement.nu` - Workspace exemption\n- `provisioning/README.md` - Documentation updates\n\n**Generated File Example**:\n```\nNUSHELL_VERSION="0.109.1"\nNUSHELL_SOURCE="https://github.com/nushell/nushell/releases"\nNU_VERSION="0.109.1"\nNU_SOURCE="https://github.com/nushell/nushell/releases"\n\nNICKEL_VERSION="1.15.1"\nNICKEL_SOURCE="https://github.com/tweag/nickel/releases"\n\nPROVIDER_AWS_VERSION="2.32.11"\nPROVIDER_AWS_SOURCE="https://github.com/aws/aws-cli/releases"\n# ... and more providers\n```\n\n**Key Improvements**:\n- Clean metadata (no redundant `_LIB` fields - all sources are Nickel)\n- Automatic provider discovery from `extensions/providers/*/nickel/version.ncl`\n- Direct Nickel file parsing with JSON export\n- Zero dependency on environment variables or legacy systems\n- 100% bash/shell compatible for deployment scripts \ No newline at end of file diff --git a/config/README.md b/config/README.md index e46bf9e..e0bb93d 100644 --- a/config/README.md +++ b/config/README.md @@ -1 +1 @@ -# Platform Configuration Management\n\nThis directory manages **runtime configurations** for provisioning platform services.\n\n## Structure\n\n```\nprovisioning/config/\n├── runtime/ # 🔒 PRIVATE (gitignored)\n│ ├── .gitignore # Runtime files are private\n│ ├── orchestrator.solo.ncl # Runtime config (editable)\n│ ├── vault-service.multiuser.ncl # Runtime config (editable)\n│ └── generated/ # 📄 Auto-generated TOMLs\n│ ├── orchestrator.solo.toml # Exported from .ncl\n│ └── vault-service.multiuser.toml\n│\n├── examples/ # 📘 PUBLIC (reference)\n│ ├── orchestrator.solo.example.ncl\n│ └── orchestrator.enterprise.example.ncl\n│\n├── README.md # This file\n└── setup-platform-config.sh # ← See provisioning/scripts/setup-platform-config.sh\n```\n\n## Quick Start\n\n### 1. Setup Platform Configuration (First Time)\n\n```\n# Interactive wizard (recommended)\n./provisioning/scripts/setup-platform-config.sh\n\n# Or quick setup for all services in solo mode\n./provisioning/scripts/setup-platform-config.sh --quick-mode --mode solo\n```\n\n### 2. Run Services\n\n```\n# Service reads config from generated TOML\nexport ORCHESTRATOR_MODE=solo\ncargo run -p orchestrator\n\n# Or with explicit config path\nexport ORCHESTRATOR_CONFIG=provisioning/config/runtime/generated/orchestrator.solo.toml\ncargo run -p orchestrator\n```\n\n### 3. Update Configuration\n\n**Option A: Interactive (Recommended)**\n```\n# Update via TypeDialog UI\n./provisioning/scripts/setup-platform-config.sh --service orchestrator --mode solo\n```\n\n**Option B: Manual Edit**\n```\n# Edit Nickel directly\nvim provisioning/config/runtime/orchestrator.solo.ncl\n\n# ⚠️ CRITICAL: Regenerate TOML afterward\n./provisioning/scripts/setup-platform-config.sh --generate-toml\n```\n\n## Configuration Layers\n\n```\n📘 PUBLIC (provisioning/schemas/platform/)\n├── schemas/ → Type contracts (Nickel)\n├── defaults/ → Base configuration values\n│ └── deployment/ → Mode-specific overlays (solo/multiuser/cicd/enterprise)\n├── validators/ → Business logic validation\n└── common/\n └── helpers.ncl → Merge functions\n\n ⬇️ COMPOSITION PROCESS ⬇️\n\n🔒 PRIVATE (provisioning/config/runtime/)\n├── orchestrator.solo.ncl ← User editable\n│ (imports schemas + defaults + mode overlay)\n│ (uses helpers.compose_config for merge)\n│\n└── generated/\n └── orchestrator.solo.toml ← Auto-exported for Rust services\n (generated by: nickel export --format toml)\n```\n\n## Key Concepts\n\n### Schema (Type Contract)\n- **File**: `provisioning/schemas/platform/schemas/orchestrator.ncl`\n- **Purpose**: Defines valid fields, types, constraints\n- **Status**: 📘 PUBLIC, versioned, source of truth\n- **Edit**: Rarely (architecture changes only)\n\n### Defaults (Base Values)\n- **File**: `provisioning/schemas/platform/defaults/orchestrator-defaults.ncl`\n- **Purpose**: Default values for all orchestrator settings\n- **Status**: 📘 PUBLIC, versioned, part of product\n- **Edit**: When changing default behavior\n\n### Mode Overlay (Tuning)\n- **File**: `provisioning/schemas/platform/defaults/deployment/solo-defaults.ncl`\n- **Purpose**: Mode-specific resource/behavior tuning\n- **Status**: 📘 PUBLIC, versioned\n- **Example**: solo mode uses 2 CPU, enterprise uses 16+ CPU\n\n### Runtime Config (User Customization)\n- **File**: `provisioning/config/runtime/orchestrator.solo.ncl`\n- **Purpose**: Actual deployment configuration (can be hand-edited)\n- **Status**: 🔒 PRIVATE, gitignored\n- **Edit**: Yes, use setup script or edit manually + regenerate TOML\n\n### Generated TOML (Service Consumption)\n- **File**: `provisioning/config/runtime/generated/orchestrator.solo.toml`\n- **Purpose**: What Rust services actually read\n- **Status**: 🔒 PRIVATE, gitignored, auto-generated\n- **Edit**: NO - regenerate from .ncl instead\n- **Generation**: `nickel export --format toml `\n\n## Workflows\n\n### Scenario 1: First-Time Setup\n\n```\n# 1. Run setup script\n./provisioning/scripts/setup-platform-config.sh\n\n# 2. Choose action (TypeDialog or Quick Mode)\n# ↓\n# TypeDialog: User fills form → generates orchestrator.solo.ncl\n# Quick Mode: Composes defaults + mode overlay → generates all 8 services\n\n# 3. Script auto-exports to TOML\n# orchestrator.solo.ncl → orchestrator.solo.toml\n\n# 4. Service reads TOML\n# cargo run -p orchestrator (reads generated/orchestrator.solo.toml)\n```\n\n### Scenario 2: Update Configuration\n\n```\n# Option A: Interactive TypeDialog\n./provisioning/scripts/setup-platform-config.sh \\n --service orchestrator \\n --mode solo \\n --backend web\n\n# Result: Updated orchestrator.solo.ncl + auto-exported TOML\n\n# Option B: Manual Edit\nvim provisioning/config/runtime/orchestrator.solo.ncl\n\n# ⚠️ CRITICAL: Must regenerate TOML\n./provisioning/scripts/setup-platform-config.sh --generate-toml\n\n# Result: Updated TOML in generated/\n```\n\n### Scenario 3: Switch Deployment Mode\n\n```\n# From solo to enterprise\n./provisioning/scripts/setup-platform-config.sh --quick-mode --mode enterprise\n\n# Result: All 8 services configured for enterprise mode\n# 16+ CPU, 32+ GB RAM, HA setup, KMS integration, etc.\n```\n\n### Scenario 4: Workspace-Specific Overrides\n\n```\nworkspace_librecloud/\n├── config/\n│ └── platform-overrides.ncl # Workspace customization\n│\n# Example:\n# {\n# orchestrator.server.port = 9999,\n# orchestrator.workspace.name = "librecloud",\n# vault-service.storage.path = "./workspace_librecloud/data/vault"\n# }\n```\n\n## Important Notes\n\n### ⚠️ Manual Edits Require TOML Regeneration\n\nIf you edit `.ncl` files directly:\n\n```\n# 1. Edit the .ncl file\nvim provisioning/config/runtime/orchestrator.solo.ncl\n\n# 2. ALWAYS regenerate TOML\n./provisioning/scripts/setup-platform-config.sh --generate-toml\n\n# Service will NOT see your changes until TOML is regenerated\n```\n\n### 🔒 Private by Design\n\nRuntime configs are **gitignored** for good reasons:\n\n- **May contain secrets**: Encrypted credentials, API keys, tokens\n- **Deployment-specific**: Different values per environment\n- **User-customized**: Each developer/workspace has different needs\n- **Not shared**: Don't commit locally-built configs\n\n### 📘 Schemas are Public\n\nSchema/defaults in `provisioning/schemas/` are **version-controlled**:\n\n- Product definition (part of releases)\n- Shared across team\n- Source of truth for config structure\n- Can reference in documentation\n\n### 🔄 Idempotent Setup\n\nThe setup script is safe to run multiple times:\n\n```\n# Safe: Updates only what's needed\n./provisioning/scripts/setup-platform-config.sh --quick-mode --mode enterprise\n\n# Safe: Doesn't overwrite unless --clean is used\n./provisioning/scripts/setup-platform-config.sh --generate-toml\n\n# Use --clean to start fresh\n./provisioning/scripts/setup-platform-config.sh --clean\n```\n\n## Service Configuration Paths\n\nEach service loads config using this priority:\n\n```\n1. Environment variable: ORCHESTRATOR_CONFIG=/path/to/custom.toml\n2. Mode-specific runtime: provisioning/config/runtime/generated/orchestrator.{MODE}.toml\n3. Fallback defaults: provisioning/schemas/platform/defaults/orchestrator-defaults.ncl\n```\n\n## Configuration Composition (Technical)\n\nThe setup script uses Nickel's `helpers.compose_config` function:\n\n```\n# Generated .ncl file imports:\nlet helpers = import "provisioning/schemas/platform/common/helpers.ncl"\nlet defaults = import "provisioning/schemas/platform/defaults/orchestrator-defaults.ncl"\nlet mode_config = import "provisioning/schemas/platform/defaults/deployment/solo-defaults.ncl"\n\n# Compose: base + mode overlay\nhelpers.compose_config defaults mode_config {}\n# ^base ^mode overlay ^user overrides (empty if not customized)\n```\n\nThis ensures:\n- Type safety (validated by Nickel schema)\n- Proper layering (base + mode + user)\n- Reproducibility (same compose always produces same result)\n- Extensibility (can add user layer via Nickel import)\n\n## Troubleshooting\n\n### Config Won't Generate TOML\n\n```\n# Check Nickel syntax\nnickel typecheck provisioning/config/runtime/orchestrator.solo.ncl\n\n# Check for schema import errors\nnickel export --format json provisioning/config/runtime/orchestrator.solo.ncl\n\n# View detailed error message\nnickel typecheck -i provisioning/config/runtime/orchestrator.solo.ncl 2>&1 | less\n```\n\n### Service Won't Start\n\n```\n# Verify TOML exists\nls -la provisioning/config/runtime/generated/orchestrator.solo.toml\n\n# Verify TOML syntax\ntoml-cli validate provisioning/config/runtime/generated/orchestrator.solo.toml\n\n# Check service config loading\nRUST_LOG=debug cargo run -p orchestrator 2>&1 | head -50\n```\n\n### Wrong Configuration Being Used\n\n```\n# Verify environment mode\necho $ORCHESTRATOR_MODE # Should be: solo, multiuser, cicd, or enterprise\n\n# Check which file service is reading\nORCHESTRATOR_CONFIG=provisioning/config/runtime/generated/orchestrator.solo.toml \\n cargo run -p orchestrator\n\n# Verify file modification time\nls -lah provisioning/config/runtime/generated/orchestrator.*.toml\n```\n\n## Integration Points\n\n### ⚠️ Provisioning Installer Status\n\n**Current Status**: Installer NOT YET IMPLEMENTED\n\nThe `setup-platform-config.sh` script is a **standalone tool** that:\n- ✅ Works independently from the provisioning installer\n- ✅ Can be called manually for configuration setup\n- ⏳ Will be integrated into the installer once it's implemented\n\n**For Now**: Use script manually before running services:\n\n```\n# Manual setup (until installer is implemented)\ncd /path/to/project-provisioning\n./provisioning/scripts/setup-platform-config.sh --quick-mode --mode solo\n\n# Then run services\nexport ORCHESTRATOR_MODE=solo\ncargo run -p orchestrator\n```\n\n### Future: Integration into Provisioning Installer\n\nOnce `provisioning/scripts/install.sh` is implemented, it will automatically call this script:\n\n```\n#!/bin/bash\n# provisioning/scripts/install.sh (FUTURE - NOT YET IMPLEMENTED)\n\n# Pre-flight checks (verification of dependencies, paths, permissions)\ncheck_dependencies() {\n command -v nickel >/dev/null || { echo "Nickel required"; exit 1; }\n command -v nu >/dev/null || { echo "Nushell required"; exit 1; }\n}\ncheck_dependencies\n\n# Install core provisioning system\necho "Installing provisioning system..."\n# (install implementation details here)\n\n# Setup platform configurations\necho "Setting up platform configurations..."\n./provisioning/scripts/setup-platform-config.sh --quick-mode --mode solo\n\n# Build and test platform services\necho "Building platform services..."\ncargo build -p orchestrator -p control-center -p mcp-server\n\n# Verify services are operational\necho "Verification complete - services ready to run"\n```\n\n### CI/CD Pipeline Integration\n\nFor automated CI/CD setups (can use now):\n\n```\n#!/bin/bash\n# ci/setup.sh\n\n# Setup configurations for CI/CD mode\ncd /path/to/project-provisioning\n./provisioning/scripts/setup-platform-config.sh \\n --quick-mode \\n --mode cicd\n\n# Result: All services configured for CI/CD mode\n# (ephemeral, API-driven, fast cleanup, minimal resource footprint)\n\n# Run tests\ncargo test --all\n\n# Deploy (CI/CD specific)\ndocker-compose -f provisioning/platform/infrastructure/docker/docker-compose.cicd.yml up\n```\n\n---\n\n**Version**: 1.0.0\n**Last Updated**: 2026-01-05\n**Script Reference**: `provisioning/scripts/setup-platform-config.sh` +# Platform Configuration Management\n\nThis directory manages **runtime configurations** for provisioning platform services.\n\n## Structure\n\n```\nprovisioning/config/\n├── runtime/ # 🔒 PRIVATE (gitignored)\n│ ├── .gitignore # Runtime files are private\n│ ├── orchestrator.solo.ncl # Runtime config (editable)\n│ ├── vault-service.multiuser.ncl # Runtime config (editable)\n│ └── generated/ # 📄 Auto-generated TOMLs\n│ ├── orchestrator.solo.toml # Exported from .ncl\n│ └── vault-service.multiuser.toml\n│\n├── examples/ # 📘 PUBLIC (reference)\n│ ├── orchestrator.solo.example.ncl\n│ └── orchestrator.enterprise.example.ncl\n│\n├── README.md # This file\n└── setup-platform-config.sh # ← See provisioning/scripts/setup-platform-config.sh\n```\n\n## Quick Start\n\n### 1. Setup Platform Configuration (First Time)\n\n```\n# Interactive wizard (recommended)\n./provisioning/scripts/setup-platform-config.sh\n\n# Or quick setup for all services in solo mode\n./provisioning/scripts/setup-platform-config.sh --quick-mode --mode solo\n```\n\n### 2. Run Services\n\n```\n# Service reads config from generated TOML\nexport ORCHESTRATOR_MODE=solo\ncargo run -p orchestrator\n\n# Or with explicit config path\nexport ORCHESTRATOR_CONFIG=provisioning/config/runtime/generated/orchestrator.solo.toml\ncargo run -p orchestrator\n```\n\n### 3. Update Configuration\n\n**Option A: Interactive (Recommended)**\n```\n# Update via TypeDialog UI\n./provisioning/scripts/setup-platform-config.sh --service orchestrator --mode solo\n```\n\n**Option B: Manual Edit**\n```\n# Edit Nickel directly\nvim provisioning/config/runtime/orchestrator.solo.ncl\n\n# ⚠️ CRITICAL: Regenerate TOML afterward\n./provisioning/scripts/setup-platform-config.sh --generate-toml\n```\n\n## Configuration Layers\n\n```\n📘 PUBLIC (provisioning/schemas/platform/)\n├── schemas/ → Type contracts (Nickel)\n├── defaults/ → Base configuration values\n│ └── deployment/ → Mode-specific overlays (solo/multiuser/cicd/enterprise)\n├── validators/ → Business logic validation\n└── common/\n └── helpers.ncl → Merge functions\n\n ⬇️ COMPOSITION PROCESS ⬇️\n\n🔒 PRIVATE (provisioning/config/runtime/)\n├── orchestrator.solo.ncl ← User editable\n│ (imports schemas + defaults + mode overlay)\n│ (uses helpers.compose_config for merge)\n│\n└── generated/\n └── orchestrator.solo.toml ← Auto-exported for Rust services\n (generated by: nickel export --format toml)\n```\n\n## Key Concepts\n\n### Schema (Type Contract)\n- **File**: `provisioning/schemas/platform/schemas/orchestrator.ncl`\n- **Purpose**: Defines valid fields, types, constraints\n- **Status**: 📘 PUBLIC, versioned, source of truth\n- **Edit**: Rarely (architecture changes only)\n\n### Defaults (Base Values)\n- **File**: `provisioning/schemas/platform/defaults/orchestrator-defaults.ncl`\n- **Purpose**: Default values for all orchestrator settings\n- **Status**: 📘 PUBLIC, versioned, part of product\n- **Edit**: When changing default behavior\n\n### Mode Overlay (Tuning)\n- **File**: `provisioning/schemas/platform/defaults/deployment/solo-defaults.ncl`\n- **Purpose**: Mode-specific resource/behavior tuning\n- **Status**: 📘 PUBLIC, versioned\n- **Example**: solo mode uses 2 CPU, enterprise uses 16+ CPU\n\n### Runtime Config (User Customization)\n- **File**: `provisioning/config/runtime/orchestrator.solo.ncl`\n- **Purpose**: Actual deployment configuration (can be hand-edited)\n- **Status**: 🔒 PRIVATE, gitignored\n- **Edit**: Yes, use setup script or edit manually + regenerate TOML\n\n### Generated TOML (Service Consumption)\n- **File**: `provisioning/config/runtime/generated/orchestrator.solo.toml`\n- **Purpose**: What Rust services actually read\n- **Status**: 🔒 PRIVATE, gitignored, auto-generated\n- **Edit**: NO - regenerate from .ncl instead\n- **Generation**: `nickel export --format toml `\n\n## Workflows\n\n### Scenario 1: First-Time Setup\n\n```\n# 1. Run setup script\n./provisioning/scripts/setup-platform-config.sh\n\n# 2. Choose action (TypeDialog or Quick Mode)\n# ↓\n# TypeDialog: User fills form → generates orchestrator.solo.ncl\n# Quick Mode: Composes defaults + mode overlay → generates all 8 services\n\n# 3. Script auto-exports to TOML\n# orchestrator.solo.ncl → orchestrator.solo.toml\n\n# 4. Service reads TOML\n# cargo run -p orchestrator (reads generated/orchestrator.solo.toml)\n```\n\n### Scenario 2: Update Configuration\n\n```\n# Option A: Interactive TypeDialog\n./provisioning/scripts/setup-platform-config.sh \n --service orchestrator \n --mode solo \n --backend web\n\n# Result: Updated orchestrator.solo.ncl + auto-exported TOML\n\n# Option B: Manual Edit\nvim provisioning/config/runtime/orchestrator.solo.ncl\n\n# ⚠️ CRITICAL: Must regenerate TOML\n./provisioning/scripts/setup-platform-config.sh --generate-toml\n\n# Result: Updated TOML in generated/\n```\n\n### Scenario 3: Switch Deployment Mode\n\n```\n# From solo to enterprise\n./provisioning/scripts/setup-platform-config.sh --quick-mode --mode enterprise\n\n# Result: All 8 services configured for enterprise mode\n# 16+ CPU, 32+ GB RAM, HA setup, KMS integration, etc.\n```\n\n### Scenario 4: Workspace-Specific Overrides\n\n```\nworkspace_librecloud/\n├── config/\n│ └── platform-overrides.ncl # Workspace customization\n│\n# Example:\n# {\n# orchestrator.server.port = 9999,\n# orchestrator.workspace.name = "librecloud",\n# vault-service.storage.path = "./workspace_librecloud/data/vault"\n# }\n```\n\n## Important Notes\n\n### ⚠️ Manual Edits Require TOML Regeneration\n\nIf you edit `.ncl` files directly:\n\n```\n# 1. Edit the .ncl file\nvim provisioning/config/runtime/orchestrator.solo.ncl\n\n# 2. ALWAYS regenerate TOML\n./provisioning/scripts/setup-platform-config.sh --generate-toml\n\n# Service will NOT see your changes until TOML is regenerated\n```\n\n### 🔒 Private by Design\n\nRuntime configs are **gitignored** for good reasons:\n\n- **May contain secrets**: Encrypted credentials, API keys, tokens\n- **Deployment-specific**: Different values per environment\n- **User-customized**: Each developer/workspace has different needs\n- **Not shared**: Don't commit locally-built configs\n\n### 📘 Schemas are Public\n\nSchema/defaults in `provisioning/schemas/` are **version-controlled**:\n\n- Product definition (part of releases)\n- Shared across team\n- Source of truth for config structure\n- Can reference in documentation\n\n### 🔄 Idempotent Setup\n\nThe setup script is safe to run multiple times:\n\n```\n# Safe: Updates only what's needed\n./provisioning/scripts/setup-platform-config.sh --quick-mode --mode enterprise\n\n# Safe: Doesn't overwrite unless --clean is used\n./provisioning/scripts/setup-platform-config.sh --generate-toml\n\n# Use --clean to start fresh\n./provisioning/scripts/setup-platform-config.sh --clean\n```\n\n## Service Configuration Paths\n\nEach service loads config using this priority:\n\n```\n1. Environment variable: ORCHESTRATOR_CONFIG=/path/to/custom.toml\n2. Mode-specific runtime: provisioning/config/runtime/generated/orchestrator.{MODE}.toml\n3. Fallback defaults: provisioning/schemas/platform/defaults/orchestrator-defaults.ncl\n```\n\n## Configuration Composition (Technical)\n\nThe setup script uses Nickel's `helpers.compose_config` function:\n\n```\n# Generated .ncl file imports:\nlet helpers = import "provisioning/schemas/platform/common/helpers.ncl"\nlet defaults = import "provisioning/schemas/platform/defaults/orchestrator-defaults.ncl"\nlet mode_config = import "provisioning/schemas/platform/defaults/deployment/solo-defaults.ncl"\n\n# Compose: base + mode overlay\nhelpers.compose_config defaults mode_config {}\n# ^base ^mode overlay ^user overrides (empty if not customized)\n```\n\nThis ensures:\n- Type safety (validated by Nickel schema)\n- Proper layering (base + mode + user)\n- Reproducibility (same compose always produces same result)\n- Extensibility (can add user layer via Nickel import)\n\n## Troubleshooting\n\n### Config Won't Generate TOML\n\n```\n# Check Nickel syntax\nnickel typecheck provisioning/config/runtime/orchestrator.solo.ncl\n\n# Check for schema import errors\nnickel export --format json provisioning/config/runtime/orchestrator.solo.ncl\n\n# View detailed error message\nnickel typecheck -i provisioning/config/runtime/orchestrator.solo.ncl 2>&1 | less\n```\n\n### Service Won't Start\n\n```\n# Verify TOML exists\nls -la provisioning/config/runtime/generated/orchestrator.solo.toml\n\n# Verify TOML syntax\ntoml-cli validate provisioning/config/runtime/generated/orchestrator.solo.toml\n\n# Check service config loading\nRUST_LOG=debug cargo run -p orchestrator 2>&1 | head -50\n```\n\n### Wrong Configuration Being Used\n\n```\n# Verify environment mode\necho $ORCHESTRATOR_MODE # Should be: solo, multiuser, cicd, or enterprise\n\n# Check which file service is reading\nORCHESTRATOR_CONFIG=provisioning/config/runtime/generated/orchestrator.solo.toml \n cargo run -p orchestrator\n\n# Verify file modification time\nls -lah provisioning/config/runtime/generated/orchestrator.*.toml\n```\n\n## Integration Points\n\n### ⚠️ Provisioning Installer Status\n\n**Current Status**: Installer NOT YET IMPLEMENTED\n\nThe `setup-platform-config.sh` script is a **standalone tool** that:\n- ✅ Works independently from the provisioning installer\n- ✅ Can be called manually for configuration setup\n- ⏳ Will be integrated into the installer once it's implemented\n\n**For Now**: Use script manually before running services:\n\n```\n# Manual setup (until installer is implemented)\ncd /path/to/project-provisioning\n./provisioning/scripts/setup-platform-config.sh --quick-mode --mode solo\n\n# Then run services\nexport ORCHESTRATOR_MODE=solo\ncargo run -p orchestrator\n```\n\n### Future: Integration into Provisioning Installer\n\nOnce `provisioning/scripts/install.sh` is implemented, it will automatically call this script:\n\n```\n#!/bin/bash\n# provisioning/scripts/install.sh (FUTURE - NOT YET IMPLEMENTED)\n\n# Pre-flight checks (verification of dependencies, paths, permissions)\ncheck_dependencies() {\n command -v nickel >/dev/null || { echo "Nickel required"; exit 1; }\n command -v nu >/dev/null || { echo "Nushell required"; exit 1; }\n}\ncheck_dependencies\n\n# Install core provisioning system\necho "Installing provisioning system..."\n# (install implementation details here)\n\n# Setup platform configurations\necho "Setting up platform configurations..."\n./provisioning/scripts/setup-platform-config.sh --quick-mode --mode solo\n\n# Build and test platform services\necho "Building platform services..."\ncargo build -p orchestrator -p control-center -p mcp-server\n\n# Verify services are operational\necho "Verification complete - services ready to run"\n```\n\n### CI/CD Pipeline Integration\n\nFor automated CI/CD setups (can use now):\n\n```\n#!/bin/bash\n# ci/setup.sh\n\n# Setup configurations for CI/CD mode\ncd /path/to/project-provisioning\n./provisioning/scripts/setup-platform-config.sh \n --quick-mode \n --mode cicd\n\n# Result: All services configured for CI/CD mode\n# (ephemeral, API-driven, fast cleanup, minimal resource footprint)\n\n# Run tests\ncargo test --all\n\n# Deploy (CI/CD specific)\ndocker-compose -f provisioning/platform/infrastructure/docker/docker-compose.cicd.yml up\n```\n\n---\n\n**Version**: 1.0.0\n**Last Updated**: 2026-01-05\n**Script Reference**: `provisioning/scripts/setup-platform-config.sh` \ No newline at end of file diff --git a/config/examples/README.md b/config/examples/README.md index 4d8317f..d666b48 100644 --- a/config/examples/README.md +++ b/config/examples/README.md @@ -1 +1 @@ -# Example Platform Service Configurations\n\nThis directory contains reference configurations for platform services in different deployment modes. These examples show realistic settings and best practices for each mode.\n\n## What Are These Examples?\n\nThese are **Nickel configuration files** (.ncl format) that demonstrate how to configure the provisioning platform services. They show:\n\n- Recommended settings for each deployment mode\n- How to customize services for your environment\n- Best practices for development, staging, and production\n- Performance tuning for different scenarios\n- Security settings appropriate to each mode\n\n## Directory Structure\n\n```\nprovisioning/config/examples/\n├── README.md # This file\n├── orchestrator.solo.example.ncl # Development mode reference\n├── orchestrator.multiuser.example.ncl # Team staging reference\n└── orchestrator.enterprise.example.ncl # Production reference\n```\n\n## Deployment Modes\n\n### Solo Mode (Development)\n\n**File**: `orchestrator.solo.example.ncl`\n\n**Characteristics**:\n- 2 CPU, 4GB RAM (lightweight)\n- Single user/developer\n- Local development machine\n- Minimal resource consumption\n- No TLS or authentication\n- In-memory storage\n\n**When to use**:\n- Local development\n- Testing configurations\n- Learning the platform\n- CI/CD test environments\n\n**Key Settings**:\n- workers: 2\n- max_concurrent_tasks: 2\n- max_memory: 1GB\n- tls: disabled\n- auth: disabled\n\n### Multiuser Mode (Team Staging)\n\n**File**: `orchestrator.multiuser.example.ncl`\n\n**Characteristics**:\n- 4 CPU, 8GB RAM (moderate)\n- Multiple concurrent users\n- Team staging environment\n- Production-like testing\n- Basic TLS and token auth\n- Filesystem storage with caching\n\n**When to use**:\n- Team development\n- Integration testing\n- Staging environment\n- Pre-production validation\n- Multi-user environments\n\n**Key Settings**:\n- workers: 4\n- max_concurrent_tasks: 10\n- max_memory: 4GB\n- tls: enabled (certificates required)\n- auth: token-based\n- storage: filesystem with replication\n\n### Enterprise Mode (Production)\n\n**File**: `orchestrator.enterprise.example.ncl`\n\n**Characteristics**:\n- 16+ CPU, 32+ GB RAM (high-performance)\n- Multi-team, multi-workspace\n- Production mission-critical\n- Full redundancy and HA\n- OAuth2/Enterprise auth\n- Distributed storage with replication\n- Full monitoring, tracing, audit\n\n**When to use**:\n- Production deployment\n- Mission-critical systems\n- High-availability requirements\n- Multi-tenant environments\n- Compliance requirements (SOC2, ISO27001)\n\n**Key Settings**:\n- workers: 16\n- max_concurrent_tasks: 100\n- max_memory: 32GB\n- tls: mandatory (TLS 1.3)\n- auth: OAuth2 (enterprise provider)\n- storage: distributed with 3-way replication\n- monitoring: comprehensive with tracing\n- disaster_recovery: enabled\n- compliance: SOC2, ISO27001\n\n## How to Use These Examples\n\n### Step 1: Copy the Appropriate Example\n\nChoose the example that matches your deployment mode:\n\n```\n# For development (solo)\ncp provisioning/config/examples/orchestrator.solo.example.ncl \\n provisioning/config/runtime/orchestrator.solo.ncl\n\n# For team staging (multiuser)\ncp provisioning/config/examples/orchestrator.multiuser.example.ncl \\n provisioning/config/runtime/orchestrator.multiuser.ncl\n\n# For production (enterprise)\ncp provisioning/config/examples/orchestrator.enterprise.example.ncl \\n provisioning/config/runtime/orchestrator.enterprise.ncl\n```\n\n### Step 2: Customize for Your Environment\n\nEdit the copied file to match your specific setup:\n\n```\n# Edit the configuration\nvim provisioning/config/runtime/orchestrator.solo.ncl\n\n# Examples of customizations:\n# - Change workspace path to your project\n# - Adjust worker count based on CPU cores\n# - Set your domain names and hostnames\n# - Configure storage paths for your filesystem\n# - Update certificate paths for production\n# - Set logging endpoints for your infrastructure\n```\n\n### Step 3: Validate Configuration\n\nVerify the configuration is syntactically correct:\n\n```\n# Check Nickel syntax\nnickel typecheck provisioning/config/runtime/orchestrator.solo.ncl\n\n# View generated TOML\nnickel export --format toml provisioning/config/runtime/orchestrator.solo.ncl\n```\n\n### Step 4: Generate TOML\n\nExport the Nickel configuration to TOML format for service consumption:\n\n```\n# Use setup script to generate TOML\n./provisioning/scripts/setup-platform-config.sh --generate-toml\n\n# Or manually export\nnickel export --format toml provisioning/config/runtime/orchestrator.solo.ncl > \\n provisioning/config/runtime/generated/orchestrator.solo.toml\n```\n\n### Step 5: Run Services\n\nStart your platform services with the generated configuration:\n\n```\n# Set the deployment mode\nexport ORCHESTRATOR_MODE=solo\n\n# Run the orchestrator\ncargo run -p orchestrator\n```\n\n## Configuration Reference\n\n### Solo Mode Example Settings\n\n```\nserver.workers = 2\nqueue.max_concurrent_tasks = 2\nperformance.max_memory = 1000 # 1GB max\nsecurity.tls.enabled = false # No TLS for local dev\nsecurity.auth.enabled = false # No auth for local dev\n```\n\n**Use case**: Single developer on local machine\n\n### Multiuser Mode Example Settings\n\n```\nserver.workers = 4\nqueue.max_concurrent_tasks = 10\nperformance.max_memory = 4000 # 4GB max\nsecurity.tls.enabled = true # Enable TLS\nsecurity.auth.type = "token" # Token-based auth\n```\n\n**Use case**: Team of 5-10 developers in staging\n\n### Enterprise Mode Example Settings\n\n```\nserver.workers = 16\nqueue.max_concurrent_tasks = 100\nperformance.max_memory = 32000 # 32GB max\nsecurity.tls.enabled = true # TLS 1.3 only\nsecurity.auth.type = "oauth2" # OAuth2 for enterprise\nstorage.replication.factor = 3 # 3-way replication\n```\n\n**Use case**: Production with 100+ users across multiple teams\n\n## Key Configuration Sections\n\n### Server Configuration\n\nControls HTTP server behavior:\n\n```\nserver = {\n host = "0.0.0.0", # Bind address\n port = 9090, # Listen port\n workers = 4, # Worker threads\n max_connections = 200, # Concurrent connections\n request_timeout = 30000, # Milliseconds\n}\n```\n\n### Storage Configuration\n\nControls data persistence:\n\n```\nstorage = {\n backend = "filesystem", # filesystem or distributed\n path = "/var/lib/provisioning/orchestrator/data",\n cache.enabled = true,\n replication.enabled = true,\n replication.factor = 3, # 3-way replication for HA\n}\n```\n\n### Queue Configuration\n\nControls task queuing:\n\n```\nqueue = {\n max_concurrent_tasks = 10,\n retry_attempts = 3,\n task_timeout = 3600000, # 1 hour in milliseconds\n priority_queue = true, # Enable priority for tasks\n metrics = true, # Enable queue metrics\n}\n```\n\n### Security Configuration\n\nControls authentication and encryption:\n\n```\nsecurity = {\n tls = {\n enabled = true,\n cert_path = "/etc/provisioning/certs/cert.crt",\n key_path = "/etc/provisioning/certs/key.key",\n min_tls_version = "1.3",\n },\n auth = {\n enabled = true,\n type = "oauth2", # oauth2, token, or none\n provider = "okta",\n },\n encryption = {\n enabled = true,\n algorithm = "aes-256-gcm",\n },\n}\n```\n\n### Logging Configuration\n\nControls log output and persistence:\n\n```\nlogging = {\n level = "info", # debug, info, warning, error\n format = "json",\n output = "both", # stdout, file, or both\n file = {\n enabled = true,\n path = "/var/log/orchestrator.log",\n rotation.max_size = 104857600, # 100MB per file\n },\n}\n```\n\n### Monitoring Configuration\n\nControls observability and metrics:\n\n```\nmonitoring = {\n enabled = true,\n metrics.enabled = true,\n health_check.enabled = true,\n distributed_tracing.enabled = true,\n audit_logging.enabled = true,\n}\n```\n\n## Customization Examples\n\n### Example 1: Change Workspace Name\n\nChange the workspace identifier in solo mode:\n\n```\nworkspace = {\n name = "myproject",\n path = "./provisioning/data/orchestrator",\n}\n```\n\nInstead of default "development", use "myproject".\n\n### Example 2: Custom Server Port\n\nChange server port from default 9090:\n\n```\nserver = {\n port = 8888,\n}\n```\n\nUseful if port 9090 is already in use.\n\n### Example 3: Enable TLS in Solo Mode\n\nAdd TLS certificates to solo development:\n\n```\nsecurity = {\n tls = {\n enabled = true,\n cert_path = "./certs/localhost.crt",\n key_path = "./certs/localhost.key",\n },\n}\n```\n\nUseful for testing TLS locally before production.\n\n### Example 4: Custom Storage Path\n\nUse custom storage location:\n\n```\nstorage = {\n path = "/mnt/fast-storage/orchestrator/data",\n}\n```\n\nUseful if you have fast SSD storage available.\n\n### Example 5: Increase Workers for Staging\n\nIncrease from 4 to 8 workers in multiuser:\n\n```\nserver = {\n workers = 8,\n}\n```\n\nUseful when you have more CPU cores available.\n\n## Troubleshooting Configuration\n\n### Issue: "Configuration Won't Validate"\n\n```\n# Check for Nickel syntax errors\nnickel typecheck provisioning/config/runtime/orchestrator.solo.ncl\n\n# Get detailed error message\nnickel typecheck -i provisioning/config/runtime/orchestrator.solo.ncl\n```\n\nThe typecheck command will show exactly where the syntax error is.\n\n### Issue: "Service Won't Start"\n\n```\n# Verify TOML was exported correctly\ncat provisioning/config/runtime/generated/orchestrator.solo.toml | head -20\n\n# Check TOML syntax is valid\ntoml-cli validate provisioning/config/runtime/generated/orchestrator.solo.toml\n```\n\nThe TOML must be valid for the Rust service to parse it.\n\n### Issue: "Service Uses Wrong Configuration"\n\n```\n# Verify deployment mode is set\necho $ORCHESTRATOR_MODE\n\n# Check which TOML file service reads\nls -lah provisioning/config/runtime/generated/orchestrator.*.toml\n\n# Verify TOML modification time is recent\nstat provisioning/config/runtime/generated/orchestrator.solo.toml\n```\n\nThe service reads from `orchestrator.{MODE}.toml` based on environment variable.\n\n## Best Practices\n\n### Development (Solo Mode)\n\n1. Start simple using the solo example as-is first\n2. Iterate gradually, making one change at a time\n3. Enable logging by setting level = "debug" for troubleshooting\n4. Disable security features for local development (TLS/auth)\n5. Store data in ./provisioning/data/ which is gitignored\n\n### Staging (Multiuser Mode)\n\n1. Mirror production settings to test realistically\n2. Enable authentication even in staging to test auth flows\n3. Enable TLS with valid certificates to test secure connections\n4. Set up monitoring metrics and health checks\n5. Plan worker count based on expected concurrent users\n\n### Production (Enterprise Mode)\n\n1. Follow the enterprise example as baseline configuration\n2. Use secure vault for storing credentials and secrets\n3. Enable redundancy with 3-way replication for HA\n4. Enable full monitoring with distributed tracing\n5. Test failover scenarios regularly\n6. Enable audit logging for compliance\n7. Enforce TLS 1.3 and certificate rotation\n\n## Migration Between Modes\n\nTo upgrade from solo → multiuser → enterprise:\n\n```\n# 1. Backup current configuration\ncp provisioning/config/runtime/orchestrator.solo.ncl \\n provisioning/config/runtime/orchestrator.solo.ncl.bak\n\n# 2. Copy new example for target mode\ncp provisioning/config/examples/orchestrator.multiuser.example.ncl \\n provisioning/config/runtime/orchestrator.multiuser.ncl\n\n# 3. Customize for your environment\nvim provisioning/config/runtime/orchestrator.multiuser.ncl\n\n# 4. Validate and generate TOML\n./provisioning/scripts/setup-platform-config.sh --generate-toml\n\n# 5. Update mode environment variable and restart\nexport ORCHESTRATOR_MODE=multiuser\ncargo run -p orchestrator\n```\n\n## Related Documentation\n\n- **Platform Configuration Guide**: `provisioning/docs/src/getting-started/05-platform-configuration.md`\n- **Configuration README**: `provisioning/config/README.md`\n- **System Status**: `provisioning/config/SETUP_STATUS.md`\n- **Setup Script Reference**: `provisioning/scripts/setup-platform-config.sh.md`\n- **Advanced TypeDialog Guide**: `provisioning/docs/src/development/typedialog-platform-config-guide.md`\n\n---\n\n**Version**: 1.0.0\n**Last Updated**: 2026-01-05\n**Status**: Ready to use +# Example Platform Service Configurations\n\nThis directory contains reference configurations for platform services in different deployment modes. These examples show realistic settings and best practices for each mode.\n\n## What Are These Examples?\n\nThese are **Nickel configuration files** (.ncl format) that demonstrate how to configure the provisioning platform services. They show:\n\n- Recommended settings for each deployment mode\n- How to customize services for your environment\n- Best practices for development, staging, and production\n- Performance tuning for different scenarios\n- Security settings appropriate to each mode\n\n## Directory Structure\n\n```\nprovisioning/config/examples/\n├── README.md # This file\n├── orchestrator.solo.example.ncl # Development mode reference\n├── orchestrator.multiuser.example.ncl # Team staging reference\n└── orchestrator.enterprise.example.ncl # Production reference\n```\n\n## Deployment Modes\n\n### Solo Mode (Development)\n\n**File**: `orchestrator.solo.example.ncl`\n\n**Characteristics**:\n- 2 CPU, 4GB RAM (lightweight)\n- Single user/developer\n- Local development machine\n- Minimal resource consumption\n- No TLS or authentication\n- In-memory storage\n\n**When to use**:\n- Local development\n- Testing configurations\n- Learning the platform\n- CI/CD test environments\n\n**Key Settings**:\n- workers: 2\n- max_concurrent_tasks: 2\n- max_memory: 1GB\n- tls: disabled\n- auth: disabled\n\n### Multiuser Mode (Team Staging)\n\n**File**: `orchestrator.multiuser.example.ncl`\n\n**Characteristics**:\n- 4 CPU, 8GB RAM (moderate)\n- Multiple concurrent users\n- Team staging environment\n- Production-like testing\n- Basic TLS and token auth\n- Filesystem storage with caching\n\n**When to use**:\n- Team development\n- Integration testing\n- Staging environment\n- Pre-production validation\n- Multi-user environments\n\n**Key Settings**:\n- workers: 4\n- max_concurrent_tasks: 10\n- max_memory: 4GB\n- tls: enabled (certificates required)\n- auth: token-based\n- storage: filesystem with replication\n\n### Enterprise Mode (Production)\n\n**File**: `orchestrator.enterprise.example.ncl`\n\n**Characteristics**:\n- 16+ CPU, 32+ GB RAM (high-performance)\n- Multi-team, multi-workspace\n- Production mission-critical\n- Full redundancy and HA\n- OAuth2/Enterprise auth\n- Distributed storage with replication\n- Full monitoring, tracing, audit\n\n**When to use**:\n- Production deployment\n- Mission-critical systems\n- High-availability requirements\n- Multi-tenant environments\n- Compliance requirements (SOC2, ISO27001)\n\n**Key Settings**:\n- workers: 16\n- max_concurrent_tasks: 100\n- max_memory: 32GB\n- tls: mandatory (TLS 1.3)\n- auth: OAuth2 (enterprise provider)\n- storage: distributed with 3-way replication\n- monitoring: comprehensive with tracing\n- disaster_recovery: enabled\n- compliance: SOC2, ISO27001\n\n## How to Use These Examples\n\n### Step 1: Copy the Appropriate Example\n\nChoose the example that matches your deployment mode:\n\n```\n# For development (solo)\ncp provisioning/config/examples/orchestrator.solo.example.ncl \n provisioning/config/runtime/orchestrator.solo.ncl\n\n# For team staging (multiuser)\ncp provisioning/config/examples/orchestrator.multiuser.example.ncl \n provisioning/config/runtime/orchestrator.multiuser.ncl\n\n# For production (enterprise)\ncp provisioning/config/examples/orchestrator.enterprise.example.ncl \n provisioning/config/runtime/orchestrator.enterprise.ncl\n```\n\n### Step 2: Customize for Your Environment\n\nEdit the copied file to match your specific setup:\n\n```\n# Edit the configuration\nvim provisioning/config/runtime/orchestrator.solo.ncl\n\n# Examples of customizations:\n# - Change workspace path to your project\n# - Adjust worker count based on CPU cores\n# - Set your domain names and hostnames\n# - Configure storage paths for your filesystem\n# - Update certificate paths for production\n# - Set logging endpoints for your infrastructure\n```\n\n### Step 3: Validate Configuration\n\nVerify the configuration is syntactically correct:\n\n```\n# Check Nickel syntax\nnickel typecheck provisioning/config/runtime/orchestrator.solo.ncl\n\n# View generated TOML\nnickel export --format toml provisioning/config/runtime/orchestrator.solo.ncl\n```\n\n### Step 4: Generate TOML\n\nExport the Nickel configuration to TOML format for service consumption:\n\n```\n# Use setup script to generate TOML\n./provisioning/scripts/setup-platform-config.sh --generate-toml\n\n# Or manually export\nnickel export --format toml provisioning/config/runtime/orchestrator.solo.ncl > \n provisioning/config/runtime/generated/orchestrator.solo.toml\n```\n\n### Step 5: Run Services\n\nStart your platform services with the generated configuration:\n\n```\n# Set the deployment mode\nexport ORCHESTRATOR_MODE=solo\n\n# Run the orchestrator\ncargo run -p orchestrator\n```\n\n## Configuration Reference\n\n### Solo Mode Example Settings\n\n```\nserver.workers = 2\nqueue.max_concurrent_tasks = 2\nperformance.max_memory = 1000 # 1GB max\nsecurity.tls.enabled = false # No TLS for local dev\nsecurity.auth.enabled = false # No auth for local dev\n```\n\n**Use case**: Single developer on local machine\n\n### Multiuser Mode Example Settings\n\n```\nserver.workers = 4\nqueue.max_concurrent_tasks = 10\nperformance.max_memory = 4000 # 4GB max\nsecurity.tls.enabled = true # Enable TLS\nsecurity.auth.type = "token" # Token-based auth\n```\n\n**Use case**: Team of 5-10 developers in staging\n\n### Enterprise Mode Example Settings\n\n```\nserver.workers = 16\nqueue.max_concurrent_tasks = 100\nperformance.max_memory = 32000 # 32GB max\nsecurity.tls.enabled = true # TLS 1.3 only\nsecurity.auth.type = "oauth2" # OAuth2 for enterprise\nstorage.replication.factor = 3 # 3-way replication\n```\n\n**Use case**: Production with 100+ users across multiple teams\n\n## Key Configuration Sections\n\n### Server Configuration\n\nControls HTTP server behavior:\n\n```\nserver = {\n host = "0.0.0.0", # Bind address\n port = 9090, # Listen port\n workers = 4, # Worker threads\n max_connections = 200, # Concurrent connections\n request_timeout = 30000, # Milliseconds\n}\n```\n\n### Storage Configuration\n\nControls data persistence:\n\n```\nstorage = {\n backend = "filesystem", # filesystem or distributed\n path = "/var/lib/provisioning/orchestrator/data",\n cache.enabled = true,\n replication.enabled = true,\n replication.factor = 3, # 3-way replication for HA\n}\n```\n\n### Queue Configuration\n\nControls task queuing:\n\n```\nqueue = {\n max_concurrent_tasks = 10,\n retry_attempts = 3,\n task_timeout = 3600000, # 1 hour in milliseconds\n priority_queue = true, # Enable priority for tasks\n metrics = true, # Enable queue metrics\n}\n```\n\n### Security Configuration\n\nControls authentication and encryption:\n\n```\nsecurity = {\n tls = {\n enabled = true,\n cert_path = "/etc/provisioning/certs/cert.crt",\n key_path = "/etc/provisioning/certs/key.key",\n min_tls_version = "1.3",\n },\n auth = {\n enabled = true,\n type = "oauth2", # oauth2, token, or none\n provider = "okta",\n },\n encryption = {\n enabled = true,\n algorithm = "aes-256-gcm",\n },\n}\n```\n\n### Logging Configuration\n\nControls log output and persistence:\n\n```\nlogging = {\n level = "info", # debug, info, warning, error\n format = "json",\n output = "both", # stdout, file, or both\n file = {\n enabled = true,\n path = "/var/log/orchestrator.log",\n rotation.max_size = 104857600, # 100MB per file\n },\n}\n```\n\n### Monitoring Configuration\n\nControls observability and metrics:\n\n```\nmonitoring = {\n enabled = true,\n metrics.enabled = true,\n health_check.enabled = true,\n distributed_tracing.enabled = true,\n audit_logging.enabled = true,\n}\n```\n\n## Customization Examples\n\n### Example 1: Change Workspace Name\n\nChange the workspace identifier in solo mode:\n\n```\nworkspace = {\n name = "myproject",\n path = "./provisioning/data/orchestrator",\n}\n```\n\nInstead of default "development", use "myproject".\n\n### Example 2: Custom Server Port\n\nChange server port from default 9090:\n\n```\nserver = {\n port = 8888,\n}\n```\n\nUseful if port 9090 is already in use.\n\n### Example 3: Enable TLS in Solo Mode\n\nAdd TLS certificates to solo development:\n\n```\nsecurity = {\n tls = {\n enabled = true,\n cert_path = "./certs/localhost.crt",\n key_path = "./certs/localhost.key",\n },\n}\n```\n\nUseful for testing TLS locally before production.\n\n### Example 4: Custom Storage Path\n\nUse custom storage location:\n\n```\nstorage = {\n path = "/mnt/fast-storage/orchestrator/data",\n}\n```\n\nUseful if you have fast SSD storage available.\n\n### Example 5: Increase Workers for Staging\n\nIncrease from 4 to 8 workers in multiuser:\n\n```\nserver = {\n workers = 8,\n}\n```\n\nUseful when you have more CPU cores available.\n\n## Troubleshooting Configuration\n\n### Issue: "Configuration Won't Validate"\n\n```\n# Check for Nickel syntax errors\nnickel typecheck provisioning/config/runtime/orchestrator.solo.ncl\n\n# Get detailed error message\nnickel typecheck -i provisioning/config/runtime/orchestrator.solo.ncl\n```\n\nThe typecheck command will show exactly where the syntax error is.\n\n### Issue: "Service Won't Start"\n\n```\n# Verify TOML was exported correctly\ncat provisioning/config/runtime/generated/orchestrator.solo.toml | head -20\n\n# Check TOML syntax is valid\ntoml-cli validate provisioning/config/runtime/generated/orchestrator.solo.toml\n```\n\nThe TOML must be valid for the Rust service to parse it.\n\n### Issue: "Service Uses Wrong Configuration"\n\n```\n# Verify deployment mode is set\necho $ORCHESTRATOR_MODE\n\n# Check which TOML file service reads\nls -lah provisioning/config/runtime/generated/orchestrator.*.toml\n\n# Verify TOML modification time is recent\nstat provisioning/config/runtime/generated/orchestrator.solo.toml\n```\n\nThe service reads from `orchestrator.{MODE}.toml` based on environment variable.\n\n## Best Practices\n\n### Development (Solo Mode)\n\n1. Start simple using the solo example as-is first\n2. Iterate gradually, making one change at a time\n3. Enable logging by setting level = "debug" for troubleshooting\n4. Disable security features for local development (TLS/auth)\n5. Store data in ./provisioning/data/ which is gitignored\n\n### Staging (Multiuser Mode)\n\n1. Mirror production settings to test realistically\n2. Enable authentication even in staging to test auth flows\n3. Enable TLS with valid certificates to test secure connections\n4. Set up monitoring metrics and health checks\n5. Plan worker count based on expected concurrent users\n\n### Production (Enterprise Mode)\n\n1. Follow the enterprise example as baseline configuration\n2. Use secure vault for storing credentials and secrets\n3. Enable redundancy with 3-way replication for HA\n4. Enable full monitoring with distributed tracing\n5. Test failover scenarios regularly\n6. Enable audit logging for compliance\n7. Enforce TLS 1.3 and certificate rotation\n\n## Migration Between Modes\n\nTo upgrade from solo → multiuser → enterprise:\n\n```\n# 1. Backup current configuration\ncp provisioning/config/runtime/orchestrator.solo.ncl \n provisioning/config/runtime/orchestrator.solo.ncl.bak\n\n# 2. Copy new example for target mode\ncp provisioning/config/examples/orchestrator.multiuser.example.ncl \n provisioning/config/runtime/orchestrator.multiuser.ncl\n\n# 3. Customize for your environment\nvim provisioning/config/runtime/orchestrator.multiuser.ncl\n\n# 4. Validate and generate TOML\n./provisioning/scripts/setup-platform-config.sh --generate-toml\n\n# 5. Update mode environment variable and restart\nexport ORCHESTRATOR_MODE=multiuser\ncargo run -p orchestrator\n```\n\n## Related Documentation\n\n- **Platform Configuration Guide**: `provisioning/docs/src/getting-started/05-platform-configuration.md`\n- **Configuration README**: `provisioning/config/README.md`\n- **System Status**: `provisioning/config/SETUP_STATUS.md`\n- **Setup Script Reference**: `provisioning/scripts/setup-platform-config.sh.md`\n- **Advanced TypeDialog Guide**: `provisioning/docs/src/development/typedialog-platform-config-guide.md`\n\n---\n\n**Version**: 1.0.0\n**Last Updated**: 2026-01-05\n**Status**: Ready to use \ No newline at end of file diff --git a/docs/README.md b/docs/README.md index 9da87fd..fe1432c 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,138 +1 @@ -# Provisioning Platform Documentation - -Complete documentation for the Provisioning Platform infrastructure automation system built with Nushell, -Nickel, and Rust. - -## 📖 Browse Documentation - -All documentation is **directly readable** as markdown files in Git/GitHub—mdBook is optional. - -- **[Table of Contents](src/SUMMARY.md)** – Complete documentation index (188+ pages) -- **[Browse src/ directory](src/)** – All markdown files organized by topic - ---- - -## 🚀 Quick Navigation - -### For Users & Operators - -- **[Getting Started](src/getting-started/)** – Installation, setup, and first deployment -- **[Operations Guide](src/operations/)** – Deployment, monitoring, orchestrator management -- **[Troubleshooting](src/troubleshooting/troubleshooting-guide.md)** – Common issues and solutions -- **[Security](src/security/)** – Authentication, encryption, secrets management - -### For Developers & Architects - -- **[Architecture Overview](src/architecture/)** – System design and integration patterns -- **[Infrastructure Guide](src/infrastructure/)** – CLI, configuration system, workspaces -- **[Development Guide](src/development/)** – Extensions, providers, taskservs, build system -- **[API Reference](src/api-reference/)** – REST API, WebSocket, SDKs, integration examples - -### For Advanced Users - -- **[Deployment Guides](src/guides/)** – Multi-provider setup, customization, infrastructure examples -- **[Integration Guides](src/integration/)** – Gitea, OCI, service mesh, secrets integration -- **[Testing](src/testing/)** – Test environment setup and validation - ---- - -## 📚 Documentation Structure - -``` -provisioning/docs/ -├── README.md # This file – navigation hub -├── book.toml # mdBook configuration -├── src/ # Source markdown files (version-controlled) -│ ├── SUMMARY.md # Complete table of contents -│ ├── getting-started/ # Installation and setup -│ ├── architecture/ # System design and ADRs -│ ├── infrastructure/ # CLI, configuration, workspaces -│ ├── operations/ # Deployment, orchestrator, monitoring -│ ├── development/ # Extensions, providers, build system -│ ├── api-reference/ # APIs and SDKs -│ ├── security/ # Authentication, secrets, encryption -│ ├── integration/ # Third-party integrations -│ ├── guides/ # How-to guides and examples -│ ├── troubleshooting/ # Common issues -│ └── ... # 12 other sections -├── book/ # Generated HTML output (Git-ignored) -└── examples/ # Example workspace configurations -``` - -### Why `src/` subdirectory - -This is the **standard mdBook convention**: -- **Source (`src/`)**: Version-controlled markdown files, directly readable -- **Output (`book/`)**: Generated HTML/CSS/JS, Git-ignored (regenerated on build) - -This separation allows the same source files to generate multiple output formats (HTML, PDF, EPUB) without -cluttering the version-controlled repository. - ---- - -## 🔨 Building HTML with mdBook - -If you prefer a formatted HTML website with search, themes, and copy buttons, build with mdBook: - -### Prerequisites - -```bash -cargo install mdbook -``` - -### Build & Serve - -```bash -# Navigate to docs directory -cd provisioning/docs - -# Build HTML to book/ directory -mdbook build - -# Serve locally at http://localhost:3000 (with live reload) -mdbook serve -``` - -### Output - -Generated HTML is available in `provisioning/docs/book/` after building. - -**Note**: mdBook is entirely optional. The markdown files in `src/` work perfectly fine in any Git -viewer or text editor. - ---- - -## 📖 Reading Markdown Directly - -All documentation is standard GitHub Flavored Markdown. You can: - -- **GitHub/GitLab**: Click `provisioning/docs/src/` and browse directly -- **Local Git**: Clone the repo and open any `.md` file in your editor -- **Text Search**: Use `grep` or your editor's search to find topics across all markdown files -- **mdBook (optional)**: Build HTML for formatted reading with search and theming - ---- - -## 🔗 Key Reference Pages - -| Document | Purpose | -| ------------------------------------------------------------------------------ | --------------------------------- | -| [System Overview](src/architecture/system-overview.md) | High-level architecture | -| [Installation Guide](src/getting-started/installation-guide.md) | Step-by-step setup | -| [CLI Reference](src/infrastructure/cli-reference.md) | Command reference | -| [Configuration System](src/infrastructure/configuration-system.md) | Config management | -| [Security System](src/security/security-system.md) | Authentication & encryption | -| [Orchestrator](src/operations/orchestrator.md) | Service orchestration | -| [Workspace Guide](src/infrastructure/workspaces/workspace-guide.md) | Infrastructure workspaces | -| [ADRs](src/architecture/adr/) | Architecture Decision Records | - ---- - -## ❓ Questions - -- **Getting started** → Start with [Installation Guide](src/getting-started/installation-guide.md) -- **Having issues** → Check [Troubleshooting](src/troubleshooting/troubleshooting-guide.md) -- **Looking for API docs** → See [API Reference](src/api-reference/) -- **Want architecture details** → Read [Architecture Overview](src/architecture/architecture-overview.md) - -For complete navigation, see [Table of Contents](src/SUMMARY.md). +# Provisioning Platform Documentation\n\nComplete documentation for the Provisioning Platform infrastructure automation system built with Nushell,\nNickel, and Rust.\n\n## 📖 Browse Documentation\n\nAll documentation is **directly readable** as markdown files in Git/GitHub—mdBook is optional.\n\n- **[Table of Contents](src/SUMMARY.md)** – Complete documentation index (188+ pages)\n- **[Browse src/ directory](src/)** – All markdown files organized by topic\n\n---\n\n## 🚀 Quick Navigation\n\n### For Users & Operators\n\n- **[Getting Started](src/getting-started/)** – Installation, setup, and first deployment\n- **[Operations Guide](src/operations/)** – Deployment, monitoring, orchestrator management\n- **[Troubleshooting](src/troubleshooting/troubleshooting-guide.md)** – Common issues and solutions\n- **[Security](src/security/)** – Authentication, encryption, secrets management\n\n### For Developers & Architects\n\n- **[Architecture Overview](src/architecture/)** – System design and integration patterns\n- **[Infrastructure Guide](src/infrastructure/)** – CLI, configuration system, workspaces\n- **[Development Guide](src/development/)** – Extensions, providers, taskservs, build system\n- **[API Reference](src/api-reference/)** – REST API, WebSocket, SDKs, integration examples\n\n### For Advanced Users\n\n- **[Deployment Guides](src/guides/)** – Multi-provider setup, customization, infrastructure examples\n- **[Integration Guides](src/integration/)** – Gitea, OCI, service mesh, secrets integration\n- **[Testing](src/testing/)** – Test environment setup and validation\n\n---\n\n## 📚 Documentation Structure\n\n```{$detected_lang}\nprovisioning/docs/\n├── README.md # This file – navigation hub\n├── book.toml # mdBook configuration\n├── src/ # Source markdown files (version-controlled)\n│ ├── SUMMARY.md # Complete table of contents\n│ ├── getting-started/ # Installation and setup\n│ ├── architecture/ # System design and ADRs\n│ ├── infrastructure/ # CLI, configuration, workspaces\n│ ├── operations/ # Deployment, orchestrator, monitoring\n│ ├── development/ # Extensions, providers, build system\n│ ├── api-reference/ # APIs and SDKs\n│ ├── security/ # Authentication, secrets, encryption\n│ ├── integration/ # Third-party integrations\n│ ├── guides/ # How-to guides and examples\n│ ├── troubleshooting/ # Common issues\n│ └── ... # 12 other sections\n├── book/ # Generated HTML output (Git-ignored)\n└── examples/ # Example workspace configurations\n```\n\n### Why `src/` subdirectory\n\nThis is the **standard mdBook convention**:\n- **Source (`src/`)**: Version-controlled markdown files, directly readable\n- **Output (`book/`)**: Generated HTML/CSS/JS, Git-ignored (regenerated on build)\n\nThis separation allows the same source files to generate multiple output formats (HTML, PDF, EPUB) without\ncluttering the version-controlled repository.\n\n---\n\n## 🔨 Building HTML with mdBook\n\nIf you prefer a formatted HTML website with search, themes, and copy buttons, build with mdBook:\n\n### Prerequisites\n\n```bash\ncargo install mdbook\n```\n\n### Build & Serve\n\n```bash\n# Navigate to docs directory\ncd provisioning/docs\n\n# Build HTML to book/ directory\nmdbook build\n\n# Serve locally at http://localhost:3000 (with live reload)\nmdbook serve\n```\n\n### Output\n\nGenerated HTML is available in `provisioning/docs/book/` after building.\n\n**Note**: mdBook is entirely optional. The markdown files in `src/` work perfectly fine in any Git\nviewer or text editor.\n\n---\n\n## 📖 Reading Markdown Directly\n\nAll documentation is standard GitHub Flavored Markdown. You can:\n\n- **GitHub/GitLab**: Click `provisioning/docs/src/` and browse directly\n- **Local Git**: Clone the repo and open any `.md` file in your editor\n- **Text Search**: Use `grep` or your editor's search to find topics across all markdown files\n- **mdBook (optional)**: Build HTML for formatted reading with search and theming\n\n---\n\n## 🔗 Key Reference Pages\n\n| Document | Purpose |\n| ------------------------------------------------------------------------------ | --------------------------------- |\n| [System Overview](src/architecture/system-overview.md) | High-level architecture |\n| [Installation Guide](src/getting-started/installation-guide.md) | Step-by-step setup |\n| [CLI Reference](src/infrastructure/cli-reference.md) | Command reference |\n| [Configuration System](src/infrastructure/configuration-system.md) | Config management |\n| [Security System](src/security/security-system.md) | Authentication & encryption |\n| [Orchestrator](src/operations/orchestrator.md) | Service orchestration |\n| [Workspace Guide](src/infrastructure/workspaces/workspace-guide.md) | Infrastructure workspaces |\n| [ADRs](src/architecture/adr/) | Architecture Decision Records |\n\n---\n\n## ❓ Questions\n\n- **Getting started** → Start with [Installation Guide](src/getting-started/installation-guide.md)\n- **Having issues** → Check [Troubleshooting](src/troubleshooting/troubleshooting-guide.md)\n- **Looking for API docs** → See [API Reference](src/api-reference/)\n- **Want architecture details** → Read [Architecture Overview](src/architecture/architecture-overview.md)\n\nFor complete navigation, see [Table of Contents](src/SUMMARY.md). \ No newline at end of file diff --git a/docs/src/PROVISIONING.md b/docs/src/PROVISIONING.md index 6001324..1b2baf9 100644 --- a/docs/src/PROVISIONING.md +++ b/docs/src/PROVISIONING.md @@ -1 +1,944 @@ -

\n Provisioning Logo\n

\n

\n Provisioning\n

\n\n# Provisioning - Infrastructure Automation Platform\n\n> **A modular, declarative Infrastructure as Code (IaC) platform for managing complete infrastructure lifecycles**\n\n## Table of Contents\n\n- [What is Provisioning?](#what-is-provisioning)\n- [Why Provisioning?](#why-provisioning)\n- [Core Concepts](#core-concepts)\n- [Architecture](#architecture)\n- [Key Features](#key-features)\n- [Technology Stack](#technology-stack)\n- [How It Works](#how-it-works)\n- [Use Cases](#use-cases)\n- [Getting Started](#getting-started)\n\n---\n\n## What is Provisioning\n\n**Provisioning** is a comprehensive **Infrastructure as Code (IaC)** platform designed to manage\ncomplete infrastructure lifecycles: cloud providers, infrastructure services, clusters,\nand isolated workspaces across multiple cloud/local environments.\n\nExtensible and customizable by design, it delivers type-safe, configuration-driven workflows\nwith enterprise security (encrypted configuration, Cosmian KMS integration, Cedar policy engine,\nsecrets management, authorization and permissions control, compliance checking, anomaly detection)\nand adaptable deployment modes (interactive UI, CLI automation, unattended CI/CD)\nsuitable for any scale from development to production.\n\n### Technical Definition\n\nDeclarative Infrastructure as Code (IaC) platform providing:\n\n- **Type-safe, configuration-driven workflows** with schema validation and constraint checking\n- **Modular, extensible architecture**: cloud providers, task services, clusters, workspaces\n- **Multi-cloud abstraction layer** with unified API (UpCloud, AWS, local infrastructure)\n- **High-performance state management**:\n - Graph database backend for complex relationships\n - Real-time state tracking and queries\n - Multi-model data storage (document, graph, relational)\n- **Enterprise security stack**:\n - Encrypted configuration and secrets management\n - Cosmian KMS integration for confidential key management\n - Cedar policy engine for fine-grained access control\n - Authorization and permissions control via platform services\n - Compliance checking and policy enforcement\n - Anomaly detection for security monitoring\n - Audit logging and compliance tracking\n- **Hybrid orchestration**: Rust-based performance layer + scripting flexibility\n- **Production-ready features**:\n - Batch workflows with dependency resolution\n - Checkpoint recovery and automatic rollback\n - Parallel execution with state management\n- **Adaptable deployment modes**:\n - Interactive TUI for guided setup\n - Headless CLI for scripted automation\n - Unattended mode for CI/CD pipelines\n- **Hierarchical configuration system** with inheritance and overrides\n\n### What It Does\n\n- **Provisions Infrastructure** - Create servers, networks, storage across multiple cloud providers\n- **Installs Services** - Deploy Kubernetes, containerd, databases, monitoring, and 50+ infrastructure components\n- **Manages Clusters** - Orchestrate complete cluster deployments with dependency management\n- **Handles Configuration** - Hierarchical configuration system with inheritance and overrides\n- **Orchestrates Workflows** - Batch operations with parallel execution and checkpoint recovery\n- **Manages Secrets** - SOPS/Age integration for encrypted configuration\n\n---\n\n## Why Provisioning\n\n### The Problems It Solves\n\n#### 1. **Multi-Cloud Complexity**\n\n**Problem**: Each cloud provider has different APIs, tools, and workflows.\n\n**Solution**: Unified abstraction layer with provider-agnostic interfaces. Write configuration once, deploy anywhere.\n\n```text\n# Same configuration works on UpCloud, AWS, or local infrastructure\nserver: Server {\n name = "web-01"\n plan = "medium" # Abstract size, provider-specific translation\n provider = "upcloud" # Switch to "aws" or "local" as needed\n}\n```\n\n#### 2. **Dependency Hell**\n\n**Problem**: Infrastructure components have complex dependencies (Kubernetes needs containerd, Cilium needs Kubernetes, etc.).\n\n**Solution**: Automatic dependency resolution with topological sorting and health checks.\n\n```text\n# Provisioning resolves: containerd → etcd → kubernetes → cilium\ntaskservs = ["cilium"] # Automatically installs all dependencies\n```\n\n#### 3. **Configuration Sprawl**\n\n**Problem**: Environment variables, hardcoded values, scattered configuration files.\n\n**Solution**: Hierarchical configuration system with 476+ config accessors replacing 200+ ENV variables.\n\n```text\nDefaults → User → Project → Infrastructure → Environment → Runtime\n```\n\n#### 4. **Imperative Scripts**\n\n**Problem**: Brittle shell scripts that don't handle failures, don't support rollback, hard to maintain.\n\n**Solution**: Declarative Nickel configurations with validation, type safety, and automatic rollback.\n\n#### 5. **Lack of Visibility**\n\n**Problem**: No insight into what's happening during deployment, hard to debug failures.\n\n**Solution**:\n\n- Real-time workflow monitoring\n- Comprehensive logging system\n- Web-based control center\n- REST API for integration\n\n#### 6. **No Standardization**\n\n**Problem**: Each team builds their own deployment tools, no shared patterns.\n\n**Solution**: Reusable task services, cluster templates, and workflow patterns.\n\n---\n\n## Core Concepts\n\n### 1. **Providers**\n\nCloud infrastructure backends that handle resource provisioning.\n\n- **UpCloud** - Primary cloud provider\n- **AWS** - Amazon Web Services integration\n- **Local** - Local infrastructure (VMs, Docker, bare metal)\n\nProviders implement a common interface, making infrastructure code portable.\n\n### 2. **Task Services (TaskServs)**\n\nReusable infrastructure components that can be installed on servers.\n\n**Categories**:\n\n- **Container Runtimes** - containerd, Docker, Podman, crun, runc, youki\n- **Orchestration** - Kubernetes, etcd, CoreDNS\n- **Networking** - Cilium, Flannel, Calico, ip-aliases\n- **Storage** - Rook-Ceph, local storage\n- **Databases** - PostgreSQL, Redis, SurrealDB\n- **Observability** - Prometheus, Grafana, Loki\n- **Security** - Webhook, KMS, Vault\n- **Development** - Gitea, Radicle, ORAS\n\nEach task service includes:\n\n- Version management\n- Dependency declarations\n- Health checks\n- Installation/uninstallation logic\n- Configuration schemas\n\n### 3. **Clusters**\n\nComplete infrastructure deployments combining servers and task services.\n\n**Examples**:\n\n- **Kubernetes Cluster** - HA control plane + worker nodes + CNI + storage\n- **Database Cluster** - Replicated PostgreSQL with backup\n- **Build Infrastructure** - BuildKit + container registry + CI/CD\n\nClusters handle:\n\n- Multi-node coordination\n- Service distribution\n- High availability\n- Rolling updates\n\n### 4. **Workspaces**\n\nIsolated environments for different projects or deployment stages.\n\n```text\nworkspace_librecloud/ # Production workspace\n├── infra/ # Infrastructure definitions\n├── config/ # Workspace configuration\n├── extensions/ # Custom modules\n└── runtime/ # State and runtime data\n\nworkspace_dev/ # Development workspace\n├── infra/\n└── config/\n```\n\nSwitch between workspaces with single command:\n\n```text\nprovisioning workspace switch librecloud\n```\n\n### 5. **Workflows**\n\nCoordinated sequences of operations with dependency management.\n\n**Types**:\n\n- **Server Workflows** - Create/delete/update servers\n- **TaskServ Workflows** - Install/remove infrastructure services\n- **Cluster Workflows** - Deploy/scale complete clusters\n- **Batch Workflows** - Multi-cloud parallel operations\n\n**Features**:\n\n- Dependency resolution\n- Parallel execution\n- Checkpoint recovery\n- Automatic rollback\n- Progress monitoring\n\n---\n\n## Architecture\n\n### System Components\n\n```text\n┌─────────────────────────────────────────────────────────────────┐\n│ User Interface Layer │\n│ • CLI (provisioning command) │\n│ • Web Control Center (UI) │\n│ • REST API │\n└─────────────────────────────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────────────────────────────┐\n│ Core Engine Layer │\n│ • Command Routing & Dispatch │\n│ • Configuration Management │\n│ • Provider Abstraction │\n│ • Utility Libraries │\n└─────────────────────────────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────────────────────────────┐\n│ Orchestration Layer │\n│ • Workflow Orchestrator (Rust/Nushell hybrid) │\n│ • Dependency Resolver │\n│ • State Manager │\n│ • Task Scheduler │\n└─────────────────────────────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────────────────────────────┐\n│ Extension Layer │\n│ • Providers (Cloud APIs) │\n│ • Task Services (Infrastructure Components) │\n│ • Clusters (Complete Deployments) │\n│ • Workflows (Automation Templates) │\n└─────────────────────────────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────────────────────────────┐\n│ Infrastructure Layer │\n│ • Cloud Resources (Servers, Networks, Storage) │\n│ • Kubernetes Clusters │\n│ • Running Services │\n└─────────────────────────────────────────────────────────────────┘\n```\n\n### Directory Structure\n\n```text\nproject-provisioning/\n├── provisioning/ # Core provisioning system\n│ ├── core/ # Core engine and libraries\n│ │ ├── cli/ # Command-line interface\n│ │ ├── nulib/ # Core Nushell libraries\n│ │ ├── plugins/ # System plugins\n│ │ └── scripts/ # Utility scripts\n│ │\n│ ├── extensions/ # Extensible components\n│ │ ├── providers/ # Cloud provider implementations\n│ │ ├── taskservs/ # Infrastructure service definitions\n│ │ ├── clusters/ # Complete cluster configurations\n│ │ └── workflows/ # Core workflow templates\n│ │\n│ ├── platform/ # Platform services\n│ │ ├── orchestrator/ # Rust orchestrator service\n│ │ ├── control-center/ # Web control center\n│ │ ├── mcp-server/ # Model Context Protocol server\n│ │ ├── api-gateway/ # REST API gateway\n│ │ ├── oci-registry/ # OCI registry for extensions\n│ │ └── installer/ # Platform installer (TUI + CLI)\n│ │\n│ ├── schemas/ # Nickel configuration schemas\n│ ├── config/ # Configuration files\n│ ├── templates/ # Template files\n│ └── tools/ # Build and distribution tools\n│\n├── workspace/ # User workspaces and data\n│ ├── infra/ # Infrastructure definitions\n│ ├── config/ # User configuration\n│ ├── extensions/ # User extensions\n│ └── runtime/ # Runtime data and state\n│\n└── docs/ # Documentation\n ├── user/ # User guides\n ├── api/ # API documentation\n ├── architecture/ # Architecture docs\n └── development/ # Development guides\n```\n\n### Platform Services\n\n#### 1. **Orchestrator** (`platform/orchestrator/`)\n\n- **Language**: Rust + Nushell\n- **Purpose**: Workflow execution, task scheduling, state management\n- **Features**:\n - File-based persistence\n - Priority processing\n - Retry logic with exponential backoff\n - Checkpoint-based recovery\n - REST API endpoints\n\n#### 2. **Control Center** (`platform/control-center/`)\n\n- **Language**: Web UI + Backend API\n- **Purpose**: Web-based infrastructure management\n- **Features**:\n - Dashboard views\n - Real-time monitoring\n - Interactive deployments\n - Log viewing\n\n#### 3. **MCP Server** (`platform/mcp-server/`)\n\n- **Language**: Nushell\n- **Purpose**: Model Context Protocol integration for AI assistance\n- **Features**:\n - 7 AI-powered settings tools\n - Intelligent config completion\n - Natural language infrastructure queries\n\n#### 4. **OCI Registry** (`platform/oci-registry/`)\n\n- **Purpose**: Extension distribution and versioning\n- **Features**:\n - Task service packages\n - Provider packages\n - Cluster templates\n - Workflow definitions\n\n#### 5. **Installer** (`platform/installer/`)\n\n- **Language**: Rust (Ratatui TUI) + Nushell\n- **Purpose**: Platform installation and setup\n- **Features**:\n - Interactive TUI mode\n - Headless CLI mode\n - Unattended CI/CD mode\n - Configuration generation\n\n---\n\n## Key Features\n\n### 1. **Modular CLI Architecture** (v3.2.0)\n\n84% code reduction with domain-driven design.\n\n- **Main CLI**: 211 lines (from 1,329 lines)\n- **80+ shortcuts**: `s` → `server`, `t` → `taskserv`, etc.\n- **Bi-directional help**: `provisioning help ws` = `provisioning ws help`\n- **7 domain modules**: infrastructure, orchestration, development, workspace, configuration, utilities, generation\n\n### 2. **Configuration System** (v2.0.0)\n\nHierarchical, config-driven architecture.\n\n- **476+ config accessors** replacing 200+ ENV variables\n- **Hierarchical loading**: defaults → user → project → infra → env → runtime\n- **Variable interpolation**: `{{paths.base}}`, `{{env.HOME}}`, `{{now.date}}`\n- **Multi-format support**: TOML, YAML, Nickel\n\n### 3. **Batch Workflow System** (v3.1.0)\n\nProvider-agnostic batch operations with 85-90% token efficiency.\n\n- **Multi-cloud support**: Mixed UpCloud + AWS + local in single workflow\n- **Nickel schema integration**: Type-safe workflow definitions\n- **Dependency resolution**: Topological sorting with soft/hard dependencies\n- **State management**: Checkpoint-based recovery with rollback\n- **Real-time monitoring**: Live progress tracking\n\n### 4. **Hybrid Orchestrator** (v3.0.0)\n\nRust/Nushell architecture solving deep call stack limitations.\n\n- **High-performance coordination layer**\n- **File-based persistence**\n- **Priority processing with retry logic**\n- **REST API for external integration**\n- **Comprehensive workflow system**\n\n### 5. **Workspace Switching** (v2.0.5)\n\nCentralized workspace management.\n\n- **Single-command switching**: `provisioning workspace switch `\n- **Automatic tracking**: Last-used timestamps, active workspace markers\n- **User preferences**: Global settings across all workspaces\n- **Workspace registry**: Centralized configuration in `user_config.yaml`\n\n### 6. **Interactive Guides** (v3.3.0)\n\nStep-by-step walkthroughs and quick references.\n\n- **Quick reference**: `provisioning sc` (fastest)\n- **Complete guides**: from-scratch, update, customize\n- **Copy-paste ready**: All commands include placeholders\n- **Beautiful rendering**: Uses glow, bat, or less\n\n### 7. **Test Environment Service** (v3.4.0)\n\nAutomated container-based testing.\n\n- **Three test types**: Single taskserv, server simulation, multi-node clusters\n- **Topology templates**: Kubernetes HA, etcd clusters, etc.\n- **Auto-cleanup**: Optional automatic cleanup after tests\n- **CI/CD integration**: Easy integration into pipelines\n\n### 8. **Platform Installer** (v3.5.0)\n\nMulti-mode installation system with TUI, CLI, and unattended modes.\n\n- **Interactive TUI**: Beautiful Ratatui terminal UI with 7 screens\n- **Headless Mode**: CLI automation for scripted installations\n- **Unattended Mode**: Zero-interaction CI/CD deployments\n- **Deployment Modes**: Solo (2 CPU/4 GB), MultiUser (4 CPU/8 GB), CICD (8 CPU/16 GB), Enterprise (16 CPU/32 GB)\n- **MCP Integration**: 7 AI-powered settings tools for intelligent configuration\n\n### 9. **Version Management**\n\nComprehensive version tracking and updates.\n\n- **Automatic updates**: Check for taskserv updates\n- **Version constraints**: Semantic versioning support\n- **Grace periods**: Cached version checks\n- **Update strategies**: major, minor, patch, none\n\n---\n\n## Technology Stack\n\n### Core Technologies\n\n| Technology | Version | Purpose | Why |\n| ------------ | --------- | --------- | ----- |\n| **Nushell** | 0.107.1+ | Primary shell and scripting language | Data pipelines, cross-platform, modern parsers |\n| **Nickel** | 1.0.0+ | Configuration language | Type safety, schema validation, immutability, constraint checking |\n| **Rust** | Latest | Platform services (orchestrator, control-center, installer) | Performance, memory safety, concurrency, reliability |\n| **Tera** | Latest | Template engine | Jinja2-like syntax, configuration file rendering, variable interpolation, filters and functions |\n\n### Data & State Management\n\n| Technology | Version | Purpose | Features |\n| ------------ | --------- | --------- | ---------- |\n| **SurrealDB** | Latest | Graph database backend | Multi-model, real-time queries, distributed, relationships |\n\n### Platform Services (Rust-based)\n\n| Service | Purpose | Security Features |\n| --------- | --------- | ------------------- |\n| **Orchestrator** | Workflow execution, task scheduling, state management | File-based persistence, retry logic, checkpoint recovery |\n| **Control Center** | Web-based infrastructure management | **Authorization and permissions control**, RBAC, audit logging |\n| **Installer** | Platform installation (TUI + CLI modes) | Secure configuration generation, validation |\n| **API Gateway** | REST API for external integration | Authentication, rate limiting, request validation |\n\n### Security & Secrets\n\n| Technology | Version | Purpose | Enterprise Features |\n| ------------ | --------- | --------- | --------------------- |\n| **SOPS** | 3.10.2+ | Secrets management | Encrypted configuration files |\n| **Age** | 1.2.1+ | Encryption | Secure key-based encryption |\n| **Cosmian KMS** | Latest | Key Management System | Confidential computing, secure key storage, cloud-native KMS |\n| **Cedar** | Latest | Policy engine | Fine-grained access control, policy-as-code, compliance checking, anomaly detection |\n\n### Optional Tools\n\n| Tool | Purpose |\n| ------ | --------- |\n| **K9s** | Kubernetes management interface |\n| **nu_plugin_tera** | Nushell plugin for Tera template rendering |\n| **glow** | Markdown rendering for interactive guides |\n| **bat** | Syntax highlighting for file viewing and guides |\n\n---\n\n## How It Works\n\n### Data Flow\n\n```text\n1. User defines infrastructure in Nickel\n ↓\n2. CLI loads configuration (hierarchical)\n ↓\n3. Configuration validated against schemas\n ↓\n4. Workflow created with operations\n ↓\n5. Orchestrator receives workflow\n ↓\n6. Dependencies resolved (topological sort)\n ↓\n7. Operations executed in order\n ↓\n8. Providers handle cloud operations\n ↓\n9. Task services installed on servers\n ↓\n10. State persisted and monitored\n```\n\n### Example Workflow: Deploy Kubernetes Cluster\n\n**Step 1**: Define infrastructure in Nickel\n\n```text\n# infra/my-cluster.ncl\nlet config = {\n infra = {\n name = "my-cluster",\n provider = "upcloud",\n },\n\n servers = [\n {name = "control-01", plan = "medium", role = "control"},\n {name = "worker-01", plan = "large", role = "worker"},\n {name = "worker-02", plan = "large", role = "worker"},\n ],\n\n taskservs = ["kubernetes", "cilium", "rook-ceph"],\n} in\nconfig\n```\n\n**Step 2**: Submit to Provisioning\n\n```text\nprovisioning server create --infra my-cluster\n```\n\n**Step 3**: Provisioning executes workflow\n\n```text\n1. Create workflow: "deploy-my-cluster"\n2. Resolve dependencies:\n - containerd (required by kubernetes)\n - etcd (required by kubernetes)\n - kubernetes (explicitly requested)\n - cilium (explicitly requested, requires kubernetes)\n - rook-ceph (explicitly requested, requires kubernetes)\n\n3. Execution order:\n a. Provision servers (parallel)\n b. Install containerd on all nodes\n c. Install etcd on control nodes\n d. Install kubernetes control plane\n e. Join worker nodes\n f. Install Cilium CNI\n g. Install Rook-Ceph storage\n\n4. Checkpoint after each step\n5. Monitor health checks\n6. Report completion\n```\n\n**Step 4**: Verify deployment\n\n```text\nprovisioning cluster status my-cluster\n```\n\n### Configuration Hierarchy\n\nConfiguration values are resolved through a hierarchy:\n\n```text\n1. System Defaults (provisioning/config/config.defaults.toml)\n ↓ (overridden by)\n2. User Preferences (~/.config/provisioning/user_config.yaml)\n ↓ (overridden by)\n3. Workspace Config (workspace/config/provisioning.yaml)\n ↓ (overridden by)\n4. Infrastructure Config (workspace/infra//config.toml)\n ↓ (overridden by)\n5. Environment Config (workspace/config/prod-defaults.toml)\n ↓ (overridden by)\n6. Runtime Flags (--flag value)\n```\n\n**Example**:\n\n```text\n# System default\n[servers]\ndefault_plan = "small"\n\n# User preference\n[servers]\ndefault_plan = "medium" # Overrides system default\n\n# Infrastructure config\n[servers]\ndefault_plan = "large" # Overrides user preference\n\n# Runtime\nprovisioning server create --plan xlarge # Overrides everything\n```\n\n---\n\n## Use Cases\n\n### 1. **Multi-Cloud Kubernetes Deployment**\n\nDeploy Kubernetes clusters across different cloud providers with identical configuration.\n\n```text\n# UpCloud cluster\nprovisioning cluster create k8s-prod --provider upcloud\n\n# AWS cluster (same config)\nprovisioning cluster create k8s-prod --provider aws\n```\n\n### 2. **Development → Staging → Production Pipeline**\n\nManage multiple environments with workspace switching.\n\n```text\n# Development\nprovisioning workspace switch dev\nprovisioning cluster create app-stack\n\n# Staging (same config, different resources)\nprovisioning workspace switch staging\nprovisioning cluster create app-stack\n\n# Production (HA, larger resources)\nprovisioning workspace switch prod\nprovisioning cluster create app-stack\n```\n\n### 3. **Infrastructure as Code Testing**\n\nTest infrastructure changes before deploying to production.\n\n```text\n# Test Kubernetes upgrade locally\nprovisioning test topology load kubernetes_3node | \\n test env cluster kubernetes --version 1.29.0\n\n# Verify functionality\nprovisioning test env run \n\n# Cleanup\nprovisioning test env cleanup \n```\n\n### 4. **Batch Multi-Region Deployment**\n\nDeploy to multiple regions in parallel.\n\n```text\n# workflows/multi-region.ncl\nlet batch_workflow = {\n operations = [\n {\n id = "eu-cluster",\n type = "cluster",\n region = "eu-west-1",\n cluster = "app-stack",\n },\n {\n id = "us-cluster",\n type = "cluster",\n region = "us-east-1",\n cluster = "app-stack",\n },\n {\n id = "asia-cluster",\n type = "cluster",\n region = "ap-south-1",\n cluster = "app-stack",\n },\n ],\n parallel_limit = 3, # All at once\n} in\nbatch_workflow\n```\n\n```text\nprovisioning batch submit workflows/multi-region.ncl\nprovisioning batch monitor \n```\n\n### 5. **Automated Disaster Recovery**\n\nRecreate infrastructure from configuration.\n\n```text\n# Infrastructure destroyed\nprovisioning workspace switch prod\n\n# Recreate from config\nprovisioning cluster create --infra backup-restore --wait\n\n# All services restored with same configuration\n```\n\n### 6. **CI/CD Integration**\n\nAutomated testing and deployment pipelines.\n\n```text\n# .gitlab-ci.yml\ntest-infrastructure:\n script:\n - provisioning test quick kubernetes\n - provisioning test quick postgres\n\ndeploy-staging:\n script:\n - provisioning workspace switch staging\n - provisioning cluster create app-stack --check\n - provisioning cluster create app-stack --yes\n\ndeploy-production:\n when: manual\n script:\n - provisioning workspace switch prod\n - provisioning cluster create app-stack --yes\n```\n\n---\n\n## Getting Started\n\n### Quick Start\n\n1. **Install Prerequisites**\n\n ```bash\n # Install Nushell\n brew install nushell # macOS\n\n # Install Nickel\n brew install nickel # macOS\n\n # Install SOPS (optional, for secrets)\n brew install sops\n ```\n\n1. **Add CLI to PATH**\n\n ```bash\n ln -sf "$(pwd)/provisioning/core/cli/provisioning" /usr/local/bin/provisioning\n ```\n\n2. **Initialize Workspace**\n\n ```bash\n provisioning workspace init my-project\n ```\n\n3. **Configure Provider**\n\n ```bash\n # Edit workspace config\n provisioning sops workspace/config/provisioning.yaml\n ```\n\n4. **Deploy Infrastructure**\n\n ```bash\n # Check what will be created\n provisioning server create --check\n\n # Create servers\n provisioning server create --yes\n\n # Install Kubernetes\n provisioning taskserv create kubernetes\n ```\n\n### Learning Path\n\n1. **Start with Guides**\n\n ```bash\n provisioning sc # Quick reference\n provisioning guide from-scratch # Complete walkthrough\n ```\n\n2. **Explore Examples**\n\n ```bash\n ls provisioning/examples/\n ```\n\n3. **Read Architecture Docs**\n - [Architecture Overview](architecture/ARCHITECTURE_OVERVIEW.md)\n - [Multi-Repo Strategy](architecture/multi-repo-strategy.md)\n - [Integration Patterns](architecture/integration-patterns.md)\n\n4. **Try Test Environments**\n\n ```bash\n provisioning test quick kubernetes\n provisioning test quick postgres\n ```\n\n5. **Build Custom Extensions**\n - Create custom task services\n - Define cluster templates\n - Write workflow automation\n\n---\n\n## Documentation Index\n\n### User Documentation\n\n- **[Quick Start Guide](quickstart/01-prerequisites.md)** - Get started in 10 minutes\n- **[Service Management Guide](user/SERVICE_MANAGEMENT_GUIDE.md)** - Complete service reference\n- **[Authentication Guide](user/AUTHENTICATION_LAYER_GUIDE.md)** - Authentication and security\n- **[Workspace Switching Guide](user/WORKSPACE_SWITCHING_GUIDE.md)** - Workspace management\n- **[Test Environment Guide](infrastructure/test-environment-guide.md)** - Testing infrastructure\n\n### Architecture Documentation\n\n- **[Architecture Overview](architecture/ARCHITECTURE_OVERVIEW.md)** - System architecture\n- **[Multi-Repo Strategy](architecture/multi-repo-strategy.md)** - Repository organization\n- **[Integration Patterns](architecture/integration-patterns.md)** - Integration design\n- **[Orchestrator Integration](architecture/orchestrator-integration-model.md)** - Workflow execution\n- **[ADR Index](architecture/adr/README.md)** - Architecture Decision Records\n- **[Database Architecture](architecture/DATABASE_AND_CONFIG_ARCHITECTURE.md)** - Data layer design\n\n### Development Documentation\n\n- **[Development Workflow](development/workflow.md)** - Development process\n- **[Integration Guide](development/integration.md)** - Integration patterns\n- **[Command Handler Guide](development/COMMAND_HANDLER_GUIDE.md)** - CLI development\n\n### API Documentation\n\n- **[REST API](api-reference/rest-api.md)** - HTTP endpoints\n- **[WebSocket API](api-reference/websocket.md)** - Real-time communication\n- **[Extensions API](api-reference/extensions.md)** - Extension interface\n- **[Integration Examples](api-reference/integration-examples.md)** - API usage examples\n\n---\n\n## Project Status\n\n**Current Version**: Active Development (2025-10-07)\n\n### Recent Milestones\n\n- ✅ **v2.0.5** (2025-10-06) - Platform Installer with TUI and CI/CD modes\n- ✅ **v2.0.4** (2025-10-06) - Test Environment Service with container management\n- ✅ **v2.0.3** (2025-09-30) - Interactive Guides system\n- ✅ **v2.0.2** (2025-09-30) - Modular CLI Architecture (84% code reduction)\n- ✅ **v2.0.2** (2025-09-25) - Batch Workflow System (85-90% token efficiency)\n- ✅ **v2.0.1** (2025-09-25) - Hybrid Orchestrator (Rust/Nushell)\n- ✅ **v2.0.1** (2025-10-02) - Workspace Switching system\n- ✅ **v2.0.0** (2025-09-23) - Configuration System (476+ accessors)\n\n### Roadmap\n\n- **Platform Services**\n - [ ] Web Control Center UI completion\n - [ ] API Gateway implementation\n - [ ] Enhanced MCP server capabilities\n\n- **Extension Ecosystem**\n - [ ] OCI registry for extension distribution\n - [ ] Community task service marketplace\n - [ ] Cluster template library\n\n- **Enterprise Features**\n - [ ] Multi-tenancy support\n - [ ] RBAC and audit logging\n - [ ] Cost tracking and optimization\n\n---\n\n## Support and Community\n\n### Getting Help\n\n- **Documentation**: Start with `provisioning help` or `provisioning guide from-scratch`\n- **Issues**: Report bugs and request features on the issue tracker\n- **Discussions**: Join community discussions for questions and ideas\n\n### Contributing\n\nContributions are welcome. See [CONTRIBUTING.md](docs/development/CONTRIBUTING.md) for guidelines.\n\n**Key areas for contribution**:\n\n- New task service definitions\n- Cloud provider implementations\n- Cluster templates\n- Documentation improvements\n- Bug fixes and testing\n\n---\n\n## License\n\nSee [LICENSE](LICENSE) file in project root.\n\n---\n\n**Maintained By**: Architecture Team\n**Last Updated**: 2025-10-07\n**Project Home**: [provisioning/](provisioning/) +

+ Provisioning Logo +

+

+ Provisioning +

+ +# Provisioning - Infrastructure Automation Platform + +> **A modular, declarative Infrastructure as Code (IaC) platform for managing complete infrastructure lifecycles** + +## Table of Contents + +- [What is Provisioning?](#what-is-provisioning) +- [Why Provisioning?](#why-provisioning) +- [Core Concepts](#core-concepts) +- [Architecture](#architecture) +- [Key Features](#key-features) +- [Technology Stack](#technology-stack) +- [How It Works](#how-it-works) +- [Use Cases](#use-cases) +- [Getting Started](#getting-started) + +--- + +## What is Provisioning + +**Provisioning** is a comprehensive **Infrastructure as Code (IaC)** platform designed to manage +complete infrastructure lifecycles: cloud providers, infrastructure services, clusters, +and isolated workspaces across multiple cloud/local environments. + +Extensible and customizable by design, it delivers type-safe, configuration-driven workflows +with enterprise security (encrypted configuration, Cosmian KMS integration, Cedar policy engine, +secrets management, authorization and permissions control, compliance checking, anomaly detection) +and adaptable deployment modes (interactive UI, CLI automation, unattended CI/CD) +suitable for any scale from development to production. + +### Technical Definition + +Declarative Infrastructure as Code (IaC) platform providing: + +- **Type-safe, configuration-driven workflows** with schema validation and constraint checking +- **Modular, extensible architecture**: cloud providers, task services, clusters, workspaces +- **Multi-cloud abstraction layer** with unified API (UpCloud, AWS, local infrastructure) +- **High-performance state management**: + - Graph database backend for complex relationships + - Real-time state tracking and queries + - Multi-model data storage (document, graph, relational) +- **Enterprise security stack**: + - Encrypted configuration and secrets management + - Cosmian KMS integration for confidential key management + - Cedar policy engine for fine-grained access control + - Authorization and permissions control via platform services + - Compliance checking and policy enforcement + - Anomaly detection for security monitoring + - Audit logging and compliance tracking +- **Hybrid orchestration**: Rust-based performance layer + scripting flexibility +- **Production-ready features**: + - Batch workflows with dependency resolution + - Checkpoint recovery and automatic rollback + - Parallel execution with state management +- **Adaptable deployment modes**: + - Interactive TUI for guided setup + - Headless CLI for scripted automation + - Unattended mode for CI/CD pipelines +- **Hierarchical configuration system** with inheritance and overrides + +### What It Does + +- **Provisions Infrastructure** - Create servers, networks, storage across multiple cloud providers +- **Installs Services** - Deploy Kubernetes, containerd, databases, monitoring, and 50+ infrastructure components +- **Manages Clusters** - Orchestrate complete cluster deployments with dependency management +- **Handles Configuration** - Hierarchical configuration system with inheritance and overrides +- **Orchestrates Workflows** - Batch operations with parallel execution and checkpoint recovery +- **Manages Secrets** - SOPS/Age integration for encrypted configuration + +--- + +## Why Provisioning + +### The Problems It Solves + +#### 1. **Multi-Cloud Complexity** + +**Problem**: Each cloud provider has different APIs, tools, and workflows. + +**Solution**: Unified abstraction layer with provider-agnostic interfaces. Write configuration once, deploy anywhere. + +```text +# Same configuration works on UpCloud, AWS, or local infrastructure +server: Server { + name = "web-01" + plan = "medium" # Abstract size, provider-specific translation + provider = "upcloud" # Switch to "aws" or "local" as needed +} +``` + +#### 2. **Dependency Hell** + +**Problem**: Infrastructure components have complex dependencies (Kubernetes needs containerd, Cilium needs Kubernetes, etc.). + +**Solution**: Automatic dependency resolution with topological sorting and health checks. + +```text +# Provisioning resolves: containerd → etcd → kubernetes → cilium +taskservs = ["cilium"] # Automatically installs all dependencies +``` + +#### 3. **Configuration Sprawl** + +**Problem**: Environment variables, hardcoded values, scattered configuration files. + +**Solution**: Hierarchical configuration system with 476+ config accessors replacing 200+ ENV variables. + +```text +Defaults → User → Project → Infrastructure → Environment → Runtime +``` + +#### 4. **Imperative Scripts** + +**Problem**: Brittle shell scripts that don't handle failures, don't support rollback, hard to maintain. + +**Solution**: Declarative Nickel configurations with validation, type safety, and automatic rollback. + +#### 5. **Lack of Visibility** + +**Problem**: No insight into what's happening during deployment, hard to debug failures. + +**Solution**: + +- Real-time workflow monitoring +- Comprehensive logging system +- Web-based control center +- REST API for integration + +#### 6. **No Standardization** + +**Problem**: Each team builds their own deployment tools, no shared patterns. + +**Solution**: Reusable task services, cluster templates, and workflow patterns. + +--- + +## Core Concepts + +### 1. **Providers** + +Cloud infrastructure backends that handle resource provisioning. + +- **UpCloud** - Primary cloud provider +- **AWS** - Amazon Web Services integration +- **Local** - Local infrastructure (VMs, Docker, bare metal) + +Providers implement a common interface, making infrastructure code portable. + +### 2. **Task Services (TaskServs)** + +Reusable infrastructure components that can be installed on servers. + +**Categories**: + +- **Container Runtimes** - containerd, Docker, Podman, crun, runc, youki +- **Orchestration** - Kubernetes, etcd, CoreDNS +- **Networking** - Cilium, Flannel, Calico, ip-aliases +- **Storage** - Rook-Ceph, local storage +- **Databases** - PostgreSQL, Redis, SurrealDB +- **Observability** - Prometheus, Grafana, Loki +- **Security** - Webhook, KMS, Vault +- **Development** - Gitea, Radicle, ORAS + +Each task service includes: + +- Version management +- Dependency declarations +- Health checks +- Installation/uninstallation logic +- Configuration schemas + +### 3. **Clusters** + +Complete infrastructure deployments combining servers and task services. + +**Examples**: + +- **Kubernetes Cluster** - HA control plane + worker nodes + CNI + storage +- **Database Cluster** - Replicated PostgreSQL with backup +- **Build Infrastructure** - BuildKit + container registry + CI/CD + +Clusters handle: + +- Multi-node coordination +- Service distribution +- High availability +- Rolling updates + +### 4. **Workspaces** + +Isolated environments for different projects or deployment stages. + +```text +workspace_librecloud/ # Production workspace +├── infra/ # Infrastructure definitions +├── config/ # Workspace configuration +├── extensions/ # Custom modules +└── runtime/ # State and runtime data + +workspace_dev/ # Development workspace +├── infra/ +└── config/ +``` + +Switch between workspaces with single command: + +```text +provisioning workspace switch librecloud +``` + +### 5. **Workflows** + +Coordinated sequences of operations with dependency management. + +**Types**: + +- **Server Workflows** - Create/delete/update servers +- **TaskServ Workflows** - Install/remove infrastructure services +- **Cluster Workflows** - Deploy/scale complete clusters +- **Batch Workflows** - Multi-cloud parallel operations + +**Features**: + +- Dependency resolution +- Parallel execution +- Checkpoint recovery +- Automatic rollback +- Progress monitoring + +--- + +## Architecture + +### System Components + +```text +┌─────────────────────────────────────────────────────────────────┐ +│ User Interface Layer │ +│ • CLI (provisioning command) │ +│ • Web Control Center (UI) │ +│ • REST API │ +└─────────────────────────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────────────────────────┐ +│ Core Engine Layer │ +│ • Command Routing & Dispatch │ +│ • Configuration Management │ +│ • Provider Abstraction │ +│ • Utility Libraries │ +└─────────────────────────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────────────────────────┐ +│ Orchestration Layer │ +│ • Workflow Orchestrator (Rust/Nushell hybrid) │ +│ • Dependency Resolver │ +│ • State Manager │ +│ • Task Scheduler │ +└─────────────────────────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────────────────────────┐ +│ Extension Layer │ +│ • Providers (Cloud APIs) │ +│ • Task Services (Infrastructure Components) │ +│ • Clusters (Complete Deployments) │ +│ • Workflows (Automation Templates) │ +└─────────────────────────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────────────────────────┐ +│ Infrastructure Layer │ +│ • Cloud Resources (Servers, Networks, Storage) │ +│ • Kubernetes Clusters │ +│ • Running Services │ +└─────────────────────────────────────────────────────────────────┘ +``` + +### Directory Structure + +```text +project-provisioning/ +├── provisioning/ # Core provisioning system +│ ├── core/ # Core engine and libraries +│ │ ├── cli/ # Command-line interface +│ │ ├── nulib/ # Core Nushell libraries +│ │ ├── plugins/ # System plugins +│ │ └── scripts/ # Utility scripts +│ │ +│ ├── extensions/ # Extensible components +│ │ ├── providers/ # Cloud provider implementations +│ │ ├── taskservs/ # Infrastructure service definitions +│ │ ├── clusters/ # Complete cluster configurations +│ │ └── workflows/ # Core workflow templates +│ │ +│ ├── platform/ # Platform services +│ │ ├── orchestrator/ # Rust orchestrator service +│ │ ├── control-center/ # Web control center +│ │ ├── mcp-server/ # Model Context Protocol server +│ │ ├── api-gateway/ # REST API gateway +│ │ ├── oci-registry/ # OCI registry for extensions +│ │ └── installer/ # Platform installer (TUI + CLI) +│ │ +│ ├── schemas/ # Nickel configuration schemas +│ ├── config/ # Configuration files +│ ├── templates/ # Template files +│ └── tools/ # Build and distribution tools +│ +├── workspace/ # User workspaces and data +│ ├── infra/ # Infrastructure definitions +│ ├── config/ # User configuration +│ ├── extensions/ # User extensions +│ └── runtime/ # Runtime data and state +│ +└── docs/ # Documentation + ├── user/ # User guides + ├── api/ # API documentation + ├── architecture/ # Architecture docs + └── development/ # Development guides +``` + +### Platform Services + +#### 1. **Orchestrator** (`platform/orchestrator/`) + +- **Language**: Rust + Nushell +- **Purpose**: Workflow execution, task scheduling, state management +- **Features**: + - File-based persistence + - Priority processing + - Retry logic with exponential backoff + - Checkpoint-based recovery + - REST API endpoints + +#### 2. **Control Center** (`platform/control-center/`) + +- **Language**: Web UI + Backend API +- **Purpose**: Web-based infrastructure management +- **Features**: + - Dashboard views + - Real-time monitoring + - Interactive deployments + - Log viewing + +#### 3. **MCP Server** (`platform/mcp-server/`) + +- **Language**: Nushell +- **Purpose**: Model Context Protocol integration for AI assistance +- **Features**: + - 7 AI-powered settings tools + - Intelligent config completion + - Natural language infrastructure queries + +#### 4. **OCI Registry** (`platform/oci-registry/`) + +- **Purpose**: Extension distribution and versioning +- **Features**: + - Task service packages + - Provider packages + - Cluster templates + - Workflow definitions + +#### 5. **Installer** (`platform/installer/`) + +- **Language**: Rust (Ratatui TUI) + Nushell +- **Purpose**: Platform installation and setup +- **Features**: + - Interactive TUI mode + - Headless CLI mode + - Unattended CI/CD mode + - Configuration generation + +--- + +## Key Features + +### 1. **Modular CLI Architecture** (v3.2.0) + +84% code reduction with domain-driven design. + +- **Main CLI**: 211 lines (from 1,329 lines) +- **80+ shortcuts**: `s` → `server`, `t` → `taskserv`, etc. +- **Bi-directional help**: `provisioning help ws` = `provisioning ws help` +- **7 domain modules**: infrastructure, orchestration, development, workspace, configuration, utilities, generation + +### 2. **Configuration System** (v2.0.0) + +Hierarchical, config-driven architecture. + +- **476+ config accessors** replacing 200+ ENV variables +- **Hierarchical loading**: defaults → user → project → infra → env → runtime +- **Variable interpolation**: `{{paths.base}}`, `{{env.HOME}}`, `{{now.date}}` +- **Multi-format support**: TOML, YAML, Nickel + +### 3. **Batch Workflow System** (v3.1.0) + +Provider-agnostic batch operations with 85-90% token efficiency. + +- **Multi-cloud support**: Mixed UpCloud + AWS + local in single workflow +- **Nickel schema integration**: Type-safe workflow definitions +- **Dependency resolution**: Topological sorting with soft/hard dependencies +- **State management**: Checkpoint-based recovery with rollback +- **Real-time monitoring**: Live progress tracking + +### 4. **Hybrid Orchestrator** (v3.0.0) + +Rust/Nushell architecture solving deep call stack limitations. + +- **High-performance coordination layer** +- **File-based persistence** +- **Priority processing with retry logic** +- **REST API for external integration** +- **Comprehensive workflow system** + +### 5. **Workspace Switching** (v2.0.5) + +Centralized workspace management. + +- **Single-command switching**: `provisioning workspace switch ` +- **Automatic tracking**: Last-used timestamps, active workspace markers +- **User preferences**: Global settings across all workspaces +- **Workspace registry**: Centralized configuration in `user_config.yaml` + +### 6. **Interactive Guides** (v3.3.0) + +Step-by-step walkthroughs and quick references. + +- **Quick reference**: `provisioning sc` (fastest) +- **Complete guides**: from-scratch, update, customize +- **Copy-paste ready**: All commands include placeholders +- **Beautiful rendering**: Uses glow, bat, or less + +### 7. **Test Environment Service** (v3.4.0) + +Automated container-based testing. + +- **Three test types**: Single taskserv, server simulation, multi-node clusters +- **Topology templates**: Kubernetes HA, etcd clusters, etc. +- **Auto-cleanup**: Optional automatic cleanup after tests +- **CI/CD integration**: Easy integration into pipelines + +### 8. **Platform Installer** (v3.5.0) + +Multi-mode installation system with TUI, CLI, and unattended modes. + +- **Interactive TUI**: Beautiful Ratatui terminal UI with 7 screens +- **Headless Mode**: CLI automation for scripted installations +- **Unattended Mode**: Zero-interaction CI/CD deployments +- **Deployment Modes**: Solo (2 CPU/4 GB), MultiUser (4 CPU/8 GB), CICD (8 CPU/16 GB), Enterprise (16 CPU/32 GB) +- **MCP Integration**: 7 AI-powered settings tools for intelligent configuration + +### 9. **Version Management** + +Comprehensive version tracking and updates. + +- **Automatic updates**: Check for taskserv updates +- **Version constraints**: Semantic versioning support +- **Grace periods**: Cached version checks +- **Update strategies**: major, minor, patch, none + +--- + +## Technology Stack + +### Core Technologies + +| Technology | Version | Purpose | Why | +| ------------ | --------- | --------- | ----- | +| **Nushell** | 0.107.1+ | Primary shell and scripting language | Data pipelines, cross-platform, modern parsers | +| **Nickel** | 1.0.0+ | Configuration language | Type safety, schema validation, immutability, constraint checking | +| **Rust** | Latest | Platform services (orchestrator, control-center, installer) | Performance, memory safety, concurrency, reliability | +| **Tera** | Latest | Template engine | Jinja2-like syntax, configuration file rendering, variable interpolation, filters and functions | + +### Data & State Management + +| Technology | Version | Purpose | Features | +| ------------ | --------- | --------- | ---------- | +| **SurrealDB** | Latest | Graph database backend | Multi-model, real-time queries, distributed, relationships | + +### Platform Services (Rust-based) + +| Service | Purpose | Security Features | +| --------- | --------- | ------------------- | +| **Orchestrator** | Workflow execution, task scheduling, state management | File-based persistence, retry logic, checkpoint recovery | +| **Control Center** | Web-based infrastructure management | **Authorization and permissions control**, RBAC, audit logging | +| **Installer** | Platform installation (TUI + CLI modes) | Secure configuration generation, validation | +| **API Gateway** | REST API for external integration | Authentication, rate limiting, request validation | + +### Security & Secrets + +| Technology | Version | Purpose | Enterprise Features | +| ------------ | --------- | --------- | --------------------- | +| **SOPS** | 3.10.2+ | Secrets management | Encrypted configuration files | +| **Age** | 1.2.1+ | Encryption | Secure key-based encryption | +| **Cosmian KMS** | Latest | Key Management System | Confidential computing, secure key storage, cloud-native KMS | +| **Cedar** | Latest | Policy engine | Fine-grained access control, policy-as-code, compliance checking, anomaly detection | + +### Optional Tools + +| Tool | Purpose | +| ------ | --------- | +| **K9s** | Kubernetes management interface | +| **nu_plugin_tera** | Nushell plugin for Tera template rendering | +| **glow** | Markdown rendering for interactive guides | +| **bat** | Syntax highlighting for file viewing and guides | + +--- + +## How It Works + +### Data Flow + +```text +1. User defines infrastructure in Nickel + ↓ +2. CLI loads configuration (hierarchical) + ↓ +3. Configuration validated against schemas + ↓ +4. Workflow created with operations + ↓ +5. Orchestrator receives workflow + ↓ +6. Dependencies resolved (topological sort) + ↓ +7. Operations executed in order + ↓ +8. Providers handle cloud operations + ↓ +9. Task services installed on servers + ↓ +10. State persisted and monitored +``` + +### Example Workflow: Deploy Kubernetes Cluster + +**Step 1**: Define infrastructure in Nickel + +```text +# infra/my-cluster.ncl +let config = { + infra = { + name = "my-cluster", + provider = "upcloud", + }, + + servers = [ + {name = "control-01", plan = "medium", role = "control"}, + {name = "worker-01", plan = "large", role = "worker"}, + {name = "worker-02", plan = "large", role = "worker"}, + ], + + taskservs = ["kubernetes", "cilium", "rook-ceph"], +} in +config +``` + +**Step 2**: Submit to Provisioning + +```text +provisioning server create --infra my-cluster +``` + +**Step 3**: Provisioning executes workflow + +```text +1. Create workflow: "deploy-my-cluster" +2. Resolve dependencies: + - containerd (required by kubernetes) + - etcd (required by kubernetes) + - kubernetes (explicitly requested) + - cilium (explicitly requested, requires kubernetes) + - rook-ceph (explicitly requested, requires kubernetes) + +3. Execution order: + a. Provision servers (parallel) + b. Install containerd on all nodes + c. Install etcd on control nodes + d. Install kubernetes control plane + e. Join worker nodes + f. Install Cilium CNI + g. Install Rook-Ceph storage + +4. Checkpoint after each step +5. Monitor health checks +6. Report completion +``` + +**Step 4**: Verify deployment + +```text +provisioning cluster status my-cluster +``` + +### Configuration Hierarchy + +Configuration values are resolved through a hierarchy: + +```text +1. System Defaults (provisioning/config/config.defaults.toml) + ↓ (overridden by) +2. User Preferences (~/.config/provisioning/user_config.yaml) + ↓ (overridden by) +3. Workspace Config (workspace/config/provisioning.yaml) + ↓ (overridden by) +4. Infrastructure Config (workspace/infra//config.toml) + ↓ (overridden by) +5. Environment Config (workspace/config/prod-defaults.toml) + ↓ (overridden by) +6. Runtime Flags (--flag value) +``` + +**Example**: + +```text +# System default +[servers] +default_plan = "small" + +# User preference +[servers] +default_plan = "medium" # Overrides system default + +# Infrastructure config +[servers] +default_plan = "large" # Overrides user preference + +# Runtime +provisioning server create --plan xlarge # Overrides everything +``` + +--- + +## Use Cases + +### 1. **Multi-Cloud Kubernetes Deployment** + +Deploy Kubernetes clusters across different cloud providers with identical configuration. + +```text +# UpCloud cluster +provisioning cluster create k8s-prod --provider upcloud + +# AWS cluster (same config) +provisioning cluster create k8s-prod --provider aws +``` + +### 2. **Development → Staging → Production Pipeline** + +Manage multiple environments with workspace switching. + +```text +# Development +provisioning workspace switch dev +provisioning cluster create app-stack + +# Staging (same config, different resources) +provisioning workspace switch staging +provisioning cluster create app-stack + +# Production (HA, larger resources) +provisioning workspace switch prod +provisioning cluster create app-stack +``` + +### 3. **Infrastructure as Code Testing** + +Test infrastructure changes before deploying to production. + +```text +# Test Kubernetes upgrade locally +provisioning test topology load kubernetes_3node | + test env cluster kubernetes --version 1.29.0 + +# Verify functionality +provisioning test env run + +# Cleanup +provisioning test env cleanup +``` + +### 4. **Batch Multi-Region Deployment** + +Deploy to multiple regions in parallel. + +```text +# workflows/multi-region.ncl +let batch_workflow = { + operations = [ + { + id = "eu-cluster", + type = "cluster", + region = "eu-west-1", + cluster = "app-stack", + }, + { + id = "us-cluster", + type = "cluster", + region = "us-east-1", + cluster = "app-stack", + }, + { + id = "asia-cluster", + type = "cluster", + region = "ap-south-1", + cluster = "app-stack", + }, + ], + parallel_limit = 3, # All at once +} in +batch_workflow +``` + +```text +provisioning batch submit workflows/multi-region.ncl +provisioning batch monitor +``` + +### 5. **Automated Disaster Recovery** + +Recreate infrastructure from configuration. + +```text +# Infrastructure destroyed +provisioning workspace switch prod + +# Recreate from config +provisioning cluster create --infra backup-restore --wait + +# All services restored with same configuration +``` + +### 6. **CI/CD Integration** + +Automated testing and deployment pipelines. + +```text +# .gitlab-ci.yml +test-infrastructure: + script: + - provisioning test quick kubernetes + - provisioning test quick postgres + +deploy-staging: + script: + - provisioning workspace switch staging + - provisioning cluster create app-stack --check + - provisioning cluster create app-stack --yes + +deploy-production: + when: manual + script: + - provisioning workspace switch prod + - provisioning cluster create app-stack --yes +``` + +--- + +## Getting Started + +### Quick Start + +1. **Install Prerequisites** + + ```bash + # Install Nushell + brew install nushell # macOS + + # Install Nickel + brew install nickel # macOS + + # Install SOPS (optional, for secrets) + brew install sops + ``` + +1. **Add CLI to PATH** + + ```bash + ln -sf "$(pwd)/provisioning/core/cli/provisioning" /usr/local/bin/provisioning + ``` + +2. **Initialize Workspace** + + ```bash + provisioning workspace init my-project + ``` + +3. **Configure Provider** + + ```bash + # Edit workspace config + provisioning sops workspace/config/provisioning.yaml + ``` + +4. **Deploy Infrastructure** + + ```bash + # Check what will be created + provisioning server create --check + + # Create servers + provisioning server create --yes + + # Install Kubernetes + provisioning taskserv create kubernetes + ``` + +### Learning Path + +1. **Start with Guides** + + ```bash + provisioning sc # Quick reference + provisioning guide from-scratch # Complete walkthrough + ``` + +2. **Explore Examples** + + ```bash + ls provisioning/examples/ + ``` + +3. **Read Architecture Docs** + - [Architecture Overview](architecture/ARCHITECTURE_OVERVIEW.md) + - [Multi-Repo Strategy](architecture/multi-repo-strategy.md) + - [Integration Patterns](architecture/integration-patterns.md) + +4. **Try Test Environments** + + ```bash + provisioning test quick kubernetes + provisioning test quick postgres + ``` + +5. **Build Custom Extensions** + - Create custom task services + - Define cluster templates + - Write workflow automation + +--- + +## Documentation Index + +### User Documentation + +- **[Quick Start Guide](quickstart/01-prerequisites.md)** - Get started in 10 minutes +- **[Service Management Guide](user/SERVICE_MANAGEMENT_GUIDE.md)** - Complete service reference +- **[Authentication Guide](user/AUTHENTICATION_LAYER_GUIDE.md)** - Authentication and security +- **[Workspace Switching Guide](user/WORKSPACE_SWITCHING_GUIDE.md)** - Workspace management +- **[Test Environment Guide](infrastructure/test-environment-guide.md)** - Testing infrastructure + +### Architecture Documentation + +- **[Architecture Overview](architecture/ARCHITECTURE_OVERVIEW.md)** - System architecture +- **[Multi-Repo Strategy](architecture/multi-repo-strategy.md)** - Repository organization +- **[Integration Patterns](architecture/integration-patterns.md)** - Integration design +- **[Orchestrator Integration](architecture/orchestrator-integration-model.md)** - Workflow execution +- **[ADR Index](architecture/adr/README.md)** - Architecture Decision Records +- **[Database Architecture](architecture/DATABASE_AND_CONFIG_ARCHITECTURE.md)** - Data layer design + +### Development Documentation + +- **[Development Workflow](development/workflow.md)** - Development process +- **[Integration Guide](development/integration.md)** - Integration patterns +- **[Command Handler Guide](development/COMMAND_HANDLER_GUIDE.md)** - CLI development + +### API Documentation + +- **[REST API](api-reference/rest-api.md)** - HTTP endpoints +- **[WebSocket API](api-reference/websocket.md)** - Real-time communication +- **[Extensions API](api-reference/extensions.md)** - Extension interface +- **[Integration Examples](api-reference/integration-examples.md)** - API usage examples + +--- + +## Project Status + +**Current Version**: Active Development (2025-10-07) + +### Recent Milestones + +- ✅ **v2.0.5** (2025-10-06) - Platform Installer with TUI and CI/CD modes +- ✅ **v2.0.4** (2025-10-06) - Test Environment Service with container management +- ✅ **v2.0.3** (2025-09-30) - Interactive Guides system +- ✅ **v2.0.2** (2025-09-30) - Modular CLI Architecture (84% code reduction) +- ✅ **v2.0.2** (2025-09-25) - Batch Workflow System (85-90% token efficiency) +- ✅ **v2.0.1** (2025-09-25) - Hybrid Orchestrator (Rust/Nushell) +- ✅ **v2.0.1** (2025-10-02) - Workspace Switching system +- ✅ **v2.0.0** (2025-09-23) - Configuration System (476+ accessors) + +### Roadmap + +- **Platform Services** + - [ ] Web Control Center UI completion + - [ ] API Gateway implementation + - [ ] Enhanced MCP server capabilities + +- **Extension Ecosystem** + - [ ] OCI registry for extension distribution + - [ ] Community task service marketplace + - [ ] Cluster template library + +- **Enterprise Features** + - [ ] Multi-tenancy support + - [ ] RBAC and audit logging + - [ ] Cost tracking and optimization + +--- + +## Support and Community + +### Getting Help + +- **Documentation**: Start with `provisioning help` or `provisioning guide from-scratch` +- **Issues**: Report bugs and request features on the issue tracker +- **Discussions**: Join community discussions for questions and ideas + +### Contributing + +Contributions are welcome. See [CONTRIBUTING.md](docs/development/CONTRIBUTING.md) for guidelines. + +**Key areas for contribution**: + +- New task service definitions +- Cloud provider implementations +- Cluster templates +- Documentation improvements +- Bug fixes and testing + +--- + +## License + +See [LICENSE](LICENSE) file in project root. + +--- + +**Maintained By**: Architecture Team +**Last Updated**: 2025-10-07 +**Project Home**: [provisioning/](provisioning/) diff --git a/docs/src/README.md b/docs/src/README.md index aded409..af9ee95 100644 --- a/docs/src/README.md +++ b/docs/src/README.md @@ -1 +1,385 @@ -

\n Provisioning Logo\n

\n

\n Provisioning\n

\n\n# Provisioning Platform Documentation\n\n**Last Updated**: 2025-01-02 (Phase 3.A Cleanup Complete)\n**Status**: ✅ Primary documentation source (145 files consolidated)\n\nWelcome to the comprehensive documentation for the Provisioning Platform - a modern, cloud-native infrastructure automation system built with Nushell,\nNickel, and Rust.\n\n> **Note**: Architecture Decision Records (ADRs) and design documentation are in `docs/`\n> directory. This location contains user-facing, operational, and product documentation.\n\n---\n\n## Quick Navigation\n\n### 🚀 Getting Started\n\n| Document | Description | Audience |\n| ---------- | ------------- | ---------- |\n| **[Installation Guide](getting-started/installation-guide.md)** | Install and configure the system | New Users |\n| **[Getting Started](getting-started/getting-started.md)** | First steps and basic concepts | New Users |\n| **[Quick Reference](getting-started/quickstart-cheatsheet.md)** | Command cheat sheet | All Users |\n| **[From Scratch Guide](guides/from-scratch.md)** | Complete deployment walkthrough | New Users |\n\n### 📚 User Guides\n\n| Document | Description |\n| ---------- | ------------- |\n| **[CLI Reference](infrastructure/cli-reference.md)** | Complete command reference |\n| **[Workspace Management](infrastructure/workspace-setup.md)** | Workspace creation and management |\n| **[Workspace Switching](infrastructure/workspace-switching-guide.md)** | Switch between workspaces |\n| **[Infrastructure Management](infrastructure/infrastructure-management.md)** | Server, taskserv, cluster operations |\n| **[Service Management](operations/service-management-guide.md)** | Platform service lifecycle management |\n| **[OCI Registry](integration/oci-registry-guide.md)** | OCI artifact management |\n| **[Gitea Integration](integration/gitea-integration-guide.md)** | Git workflow and collaboration |\n| **[CoreDNS Guide](operations/coredns-guide.md)** | DNS management |\n| **[Test Environments](testing/test-environment-usage.md)** | Containerized testing |\n| **[Extension Development](development/extension-development.md)** | Create custom extensions |\n\n### 🏗️ Architecture\n\n| Document | Description |\n| ---------- | ------------- |\n| **[System Overview](architecture/system-overview.md)** | High-level architecture |\n| **[Multi-Repo Architecture](architecture/multi-repo-architecture.md)** | Repository structure and OCI distribution |\n| **[Design Principles](architecture/design-principles.md)** | Architectural philosophy |\n| **[Integration Patterns](architecture/integration-patterns.md)** | System integration patterns |\n| **[Orchestrator Model](architecture/orchestrator-integration-model.md)** | Hybrid orchestration architecture |\n\n### 📋 Architecture Decision Records (ADRs)\n\n| ADR | Title | Status |\n| ----- | ------- | -------- |\n| **[ADR-001](architecture/adr/adr-001-project-structure.md)** | Project Structure Decision | Accepted |\n| **[ADR-002](architecture/adr/adr-002-distribution-strategy.md)** | Distribution Strategy | Accepted |\n| **[ADR-003](architecture/adr/adr-003-workspace-isolation.md)** | Workspace Isolation | Accepted |\n| **[ADR-004](architecture/adr/adr-004-hybrid-architecture.md)** | Hybrid Architecture | Accepted |\n| **[ADR-005](architecture/adr/adr-005-extension-framework.md)** | Extension Framework | Accepted |\n| **[ADR-006](architecture/adr/adr-006-provisioning-cli-refactoring.md)** | CLI Refactoring | Accepted |\n\n### 🔌 API Documentation\n\n| Document | Description |\n| ---------- | ------------- |\n| **[REST API](api-reference/rest-api.md)** | HTTP API endpoints |\n| **[WebSocket API](api-reference/websocket.md)** | Real-time event streams |\n| **[Extensions API](development/extensions.md)** | Extension integration APIs |\n| **[SDKs](api-reference/sdks.md)** | Client libraries |\n| **[Integration Examples](api-reference/integration-examples.md)** | API usage examples |\n\n### 🛠️ Development\n\n| Document | Description |\n| ---------- | ------------- |\n| **[Development README](development/README.md)** | Developer overview |\n| **[Implementation Guide](development/implementation-guide.md)** | Implementation details |\n| **[Provider Development](development/quick-provider-guide.md)** | Create cloud providers |\n| **[Taskserv Development](development/taskserv-developer-guide.md)** | Create task services |\n| **[Extension Framework](development/extensions.md)** | Extension system |\n| **[Command Handlers](development/command-handler-guide.md)** | CLI command development |\n\n### 🐛 Troubleshooting\n\n| Document | Description |\n| ---------- | ------------- |\n| **[Troubleshooting Guide](troubleshooting/troubleshooting-guide.md)** | Common issues and solutions |\n\n### 📖 How-To Guides\n\n| Document | Description |\n| ---------- | ------------- |\n| **[From Scratch](guides/from-scratch.md)** | Complete deployment from zero |\n| **[Update Infrastructure](guides/update-infrastructure.md)** | Safe update procedures |\n| **[Customize Infrastructure](guides/customize-infrastructure.md)** | Layer and template customization |\n\n### 🔐 Configuration\n\n| Document | Description |\n| ---------- | ------------- |\n| **[Workspace Config Architecture](configuration/workspace-config-architecture.md)** | Configuration architecture |\n\n### 📦 Quick References\n\n| Document | Description |\n| ---------- | ------------- |\n| **[Quickstart Cheatsheet](getting-started/quickstart-cheatsheet.md)** | Command shortcuts |\n| **[OCI Quick Reference](quick-reference/oci.md)** | OCI operations |\n\n---\n\n## Documentation Structure\n\n```text\nprovisioning/docs/src/\n├── README.md (this file) # Documentation hub\n├── getting-started/ # Getting started guides\n│ ├── installation-guide.md\n│ ├── getting-started.md\n│ └── quickstart-cheatsheet.md\n├── architecture/ # System architecture\n│ ├── adr/ # Architecture Decision Records\n│ ├── design-principles.md\n│ ├── integration-patterns.md\n│ ├── system-overview.md\n│ └── ... (and 10+ more architecture docs)\n├── infrastructure/ # Infrastructure guides\n│ ├── cli-reference.md\n│ ├── workspace-setup.md\n│ ├── workspace-switching-guide.md\n│ └── infrastructure-management.md\n├── api-reference/ # API documentation\n│ ├── rest-api.md\n│ ├── websocket.md\n│ ├── integration-examples.md\n│ └── sdks.md\n├── development/ # Developer guides\n│ ├── README.md\n│ ├── implementation-guide.md\n│ ├── quick-provider-guide.md\n│ ├── taskserv-developer-guide.md\n│ └── ... (15+ more developer docs)\n├── guides/ # How-to guides\n│ ├── from-scratch.md\n│ ├── update-infrastructure.md\n│ └── customize-infrastructure.md\n├── operations/ # Operations guides\n│ ├── service-management-guide.md\n│ ├── coredns-guide.md\n│ └── ... (more operations docs)\n├── security/ # Security docs\n├── integration/ # Integration guides\n├── testing/ # Testing docs\n├── configuration/ # Configuration docs\n├── troubleshooting/ # Troubleshooting guides\n└── quick-reference/ # Quick references\n```\n\n---\n\n## Key Concepts\n\n### Infrastructure as Code (IaC)\n\nThe provisioning platform uses **declarative configuration** to manage infrastructure. Instead of manually creating resources, you define what you\nwant in Nickel configuration files, and the system makes it happen.\n\n### Mode-Based Architecture\n\nThe system supports four operational modes:\n\n- **Solo**: Single developer local development\n- **Multi-user**: Team collaboration with shared services\n- **CI/CD**: Automated pipeline execution\n- **Enterprise**: Production deployment with strict compliance\n\n### Extension System\n\nExtensibility through:\n\n- **Providers**: Cloud platform integrations (AWS, UpCloud, Local)\n- **Task Services**: Infrastructure components (Kubernetes, databases, etc.)\n- **Clusters**: Complete deployment configurations\n\n### OCI-Native Distribution\n\nExtensions and packages distributed as OCI artifacts, enabling:\n\n- Industry-standard packaging\n- Efficient caching and bandwidth\n- Version pinning and rollback\n- Air-gapped deployments\n\n---\n\n## Documentation by Role\n\n### For New Users\n\n1. Start with **[Installation Guide](getting-started/installation-guide.md)**\n2. Read **[Getting Started](getting-started/getting-started.md)**\n3. Follow **[From Scratch Guide](guides/from-scratch.md)**\n4. Reference **[Quickstart Cheatsheet](guides/quickstart-cheatsheet.md)**\n\n### For Developers\n\n1. Review **[System Overview](architecture/system-overview.md)**\n2. Study **[Design Principles](architecture/design-principles.md)**\n3. Read relevant **[ADRs](architecture/)**\n4. Follow **[Development Guide](development/README.md)**\n5. Reference **Nickel Quick Reference**\n\n### For Operators\n\n1. Understand **[Mode System](infrastructure/mode-system)**\n2. Learn **[Service Management](operations/service-management-guide.md)**\n3. Review **[Infrastructure Management](infrastructure/infrastructure-management.md)**\n4. Study **[OCI Registry](integration/oci-registry-guide.md)**\n\n### For Architects\n\n1. Read **[System Overview](architecture/system-overview.md)**\n2. Study all **[ADRs](architecture/)**\n3. Review **[Integration Patterns](architecture/integration-patterns.md)**\n4. Understand **[Multi-Repo Architecture](architecture/multi-repo-architecture.md)**\n\n---\n\n## System Capabilities\n\n### ✅ Infrastructure Automation\n\n- Multi-cloud support (AWS, UpCloud, Local)\n- Declarative configuration with Nickel\n- Automated dependency resolution\n- Batch operations with rollback\n\n### ✅ Workflow Orchestration\n\n- Hybrid Rust/Nushell orchestration\n- Checkpoint-based recovery\n- Parallel execution with limits\n- Real-time monitoring\n\n### ✅ Test Environments\n\n- Containerized testing\n- Multi-node cluster simulation\n- Topology templates\n- Automated cleanup\n\n### ✅ Mode-Based Operation\n\n- Solo: Local development\n- Multi-user: Team collaboration\n- CI/CD: Automated pipelines\n- Enterprise: Production deployment\n\n### ✅ Extension Management\n\n- OCI-native distribution\n- Automatic dependency resolution\n- Version management\n- Local and remote sources\n\n---\n\n## Key Achievements\n\n### 🚀 Batch Workflow System (v3.1.0)\n\n- Provider-agnostic batch operations\n- Mixed provider support (UpCloud + AWS + local)\n- Dependency resolution with soft/hard dependencies\n- Real-time monitoring and rollback\n\n### 🏗️ Hybrid Orchestrator (v3.0.0)\n\n- Solves Nushell deep call stack limitations\n- Preserves all business logic\n- REST API for external integration\n- Checkpoint-based state management\n\n### ⚙️ Configuration System (v2.0.0)\n\n- Migrated from ENV to config-driven\n- Hierarchical configuration loading\n- Variable interpolation\n- True IaC without hardcoded fallbacks\n\n### 🎯 Modular CLI (v3.2.0)\n\n- 84% reduction in main file size\n- Domain-driven handlers\n- 80+ shortcuts\n- Bi-directional help system\n\n### 🧪 Test Environment Service (v3.4.0)\n\n- Automated containerized testing\n- Multi-node cluster topologies\n- CI/CD integration ready\n- Template-based configurations\n\n### 🔄 Workspace Switching (v2.0.5)\n\n- Centralized workspace management\n- Single-command workspace switching\n- Active workspace tracking\n- User preference system\n\n---\n\n## Technology Stack\n\n| Component | Technology | Purpose |\n| ----------- | ------------ | --------- |\n| **Core CLI** | Nushell 0.107.1 | Shell and scripting |\n| **Configuration** | Nickel 1.0.0+ | Type-safe IaC |\n| **Orchestrator** | Rust | High-performance coordination |\n| **Templates** | Jinja2 (nu_plugin_tera) | Code generation |\n| **Secrets** | SOPS 3.10.2 + Age 1.2.1 | Encryption |\n| **Distribution** | OCI (skopeo/crane/oras) | Artifact management |\n\n---\n\n## Support\n\n### Getting Help\n\n- **Documentation**: You're reading it!\n- **Quick Reference**: Run `provisioning sc` or `provisioning guide quickstart`\n- **Help System**: Run `provisioning help` or `provisioning help`\n- **Interactive Shell**: Run `provisioning nu` for Nushell REPL\n\n### Reporting Issues\n\n- Check **[Troubleshooting Guide](infrastructure/troubleshooting-guide.md)**\n- Review **[FAQ](troubleshooting/troubleshooting-guide.md)**\n- Enable debug mode: `provisioning --debug `\n- Check logs: `provisioning platform logs `\n\n---\n\n## Contributing\n\nThis project welcomes contributions! See **[Development Guide](development/README.md)** for:\n\n- Development setup\n- Code style guidelines\n- Testing requirements\n- Pull request process\n\n---\n\n## License\n\n[Add license information]\n\n---\n\n## Version History\n\n| Version | Date | Major Changes |\n| --------- | ------ | --------------- |\n| **3.5.0** | 2025-10-06 | Mode system, OCI registry, comprehensive documentation |\n| **3.4.0** | 2025-10-06 | Test environment service |\n| **3.3.0** | 2025-09-30 | Interactive guides system |\n| **3.2.0** | 2025-09-30 | Modular CLI refactoring |\n| **3.1.0** | 2025-09-25 | Batch workflow system |\n| **3.0.0** | 2025-09-25 | Hybrid orchestrator architecture |\n| **2.0.5** | 2025-10-02 | Workspace switching system |\n| **2.0.0** | 2025-09-23 | Configuration system migration |\n\n---\n\n**Maintained By**: Provisioning Team\n**Last Review**: 2025-10-06\n**Next Review**: 2026-01-06 +

+ Provisioning Logo +

+

+ Provisioning +

+ +# Provisioning Platform Documentation + +**Last Updated**: 2025-01-02 (Phase 3.A Cleanup Complete) +**Status**: ✅ Primary documentation source (145 files consolidated) + +Welcome to the comprehensive documentation for the Provisioning Platform - a modern, cloud-native infrastructure automation system built with Nushell, +Nickel, and Rust. + +> **Note**: Architecture Decision Records (ADRs) and design documentation are in `docs/` +> directory. This location contains user-facing, operational, and product documentation. + +--- + +## Quick Navigation + +### 🚀 Getting Started + +| Document | Description | Audience | +| ---------- | ------------- | ---------- | +| **[Installation Guide](getting-started/installation-guide.md)** | Install and configure the system | New Users | +| **[Getting Started](getting-started/getting-started.md)** | First steps and basic concepts | New Users | +| **[Quick Reference](getting-started/quickstart-cheatsheet.md)** | Command cheat sheet | All Users | +| **[From Scratch Guide](guides/from-scratch.md)** | Complete deployment walkthrough | New Users | + +### 📚 User Guides + +| Document | Description | +| ---------- | ------------- | +| **[CLI Reference](infrastructure/cli-reference.md)** | Complete command reference | +| **[Workspace Management](infrastructure/workspace-setup.md)** | Workspace creation and management | +| **[Workspace Switching](infrastructure/workspace-switching-guide.md)** | Switch between workspaces | +| **[Infrastructure Management](infrastructure/infrastructure-management.md)** | Server, taskserv, cluster operations | +| **[Service Management](operations/service-management-guide.md)** | Platform service lifecycle management | +| **[OCI Registry](integration/oci-registry-guide.md)** | OCI artifact management | +| **[Gitea Integration](integration/gitea-integration-guide.md)** | Git workflow and collaboration | +| **[CoreDNS Guide](operations/coredns-guide.md)** | DNS management | +| **[Test Environments](testing/test-environment-usage.md)** | Containerized testing | +| **[Extension Development](development/extension-development.md)** | Create custom extensions | + +### 🏗️ Architecture + +| Document | Description | +| ---------- | ------------- | +| **[System Overview](architecture/system-overview.md)** | High-level architecture | +| **[Multi-Repo Architecture](architecture/multi-repo-architecture.md)** | Repository structure and OCI distribution | +| **[Design Principles](architecture/design-principles.md)** | Architectural philosophy | +| **[Integration Patterns](architecture/integration-patterns.md)** | System integration patterns | +| **[Orchestrator Model](architecture/orchestrator-integration-model.md)** | Hybrid orchestration architecture | + +### 📋 Architecture Decision Records (ADRs) + +| ADR | Title | Status | +| ----- | ------- | -------- | +| **[ADR-001](architecture/adr/adr-001-project-structure.md)** | Project Structure Decision | Accepted | +| **[ADR-002](architecture/adr/adr-002-distribution-strategy.md)** | Distribution Strategy | Accepted | +| **[ADR-003](architecture/adr/adr-003-workspace-isolation.md)** | Workspace Isolation | Accepted | +| **[ADR-004](architecture/adr/adr-004-hybrid-architecture.md)** | Hybrid Architecture | Accepted | +| **[ADR-005](architecture/adr/adr-005-extension-framework.md)** | Extension Framework | Accepted | +| **[ADR-006](architecture/adr/adr-006-provisioning-cli-refactoring.md)** | CLI Refactoring | Accepted | + +### 🔌 API Documentation + +| Document | Description | +| ---------- | ------------- | +| **[REST API](api-reference/rest-api.md)** | HTTP API endpoints | +| **[WebSocket API](api-reference/websocket.md)** | Real-time event streams | +| **[Extensions API](development/extensions.md)** | Extension integration APIs | +| **[SDKs](api-reference/sdks.md)** | Client libraries | +| **[Integration Examples](api-reference/integration-examples.md)** | API usage examples | + +### 🛠️ Development + +| Document | Description | +| ---------- | ------------- | +| **[Development README](development/README.md)** | Developer overview | +| **[Implementation Guide](development/implementation-guide.md)** | Implementation details | +| **[Provider Development](development/quick-provider-guide.md)** | Create cloud providers | +| **[Taskserv Development](development/taskserv-developer-guide.md)** | Create task services | +| **[Extension Framework](development/extensions.md)** | Extension system | +| **[Command Handlers](development/command-handler-guide.md)** | CLI command development | + +### 🐛 Troubleshooting + +| Document | Description | +| ---------- | ------------- | +| **[Troubleshooting Guide](troubleshooting/troubleshooting-guide.md)** | Common issues and solutions | + +### 📖 How-To Guides + +| Document | Description | +| ---------- | ------------- | +| **[From Scratch](guides/from-scratch.md)** | Complete deployment from zero | +| **[Update Infrastructure](guides/update-infrastructure.md)** | Safe update procedures | +| **[Customize Infrastructure](guides/customize-infrastructure.md)** | Layer and template customization | + +### 🔐 Configuration + +| Document | Description | +| ---------- | ------------- | +| **[Workspace Config Architecture](configuration/workspace-config-architecture.md)** | Configuration architecture | + +### 📦 Quick References + +| Document | Description | +| ---------- | ------------- | +| **[Quickstart Cheatsheet](getting-started/quickstart-cheatsheet.md)** | Command shortcuts | +| **[OCI Quick Reference](quick-reference/oci.md)** | OCI operations | + +--- + +## Documentation Structure + +```text +provisioning/docs/src/ +├── README.md (this file) # Documentation hub +├── getting-started/ # Getting started guides +│ ├── installation-guide.md +│ ├── getting-started.md +│ └── quickstart-cheatsheet.md +├── architecture/ # System architecture +│ ├── adr/ # Architecture Decision Records +│ ├── design-principles.md +│ ├── integration-patterns.md +│ ├── system-overview.md +│ └── ... (and 10+ more architecture docs) +├── infrastructure/ # Infrastructure guides +│ ├── cli-reference.md +│ ├── workspace-setup.md +│ ├── workspace-switching-guide.md +│ └── infrastructure-management.md +├── api-reference/ # API documentation +│ ├── rest-api.md +│ ├── websocket.md +│ ├── integration-examples.md +│ └── sdks.md +├── development/ # Developer guides +│ ├── README.md +│ ├── implementation-guide.md +│ ├── quick-provider-guide.md +│ ├── taskserv-developer-guide.md +│ └── ... (15+ more developer docs) +├── guides/ # How-to guides +│ ├── from-scratch.md +│ ├── update-infrastructure.md +│ └── customize-infrastructure.md +├── operations/ # Operations guides +│ ├── service-management-guide.md +│ ├── coredns-guide.md +│ └── ... (more operations docs) +├── security/ # Security docs +├── integration/ # Integration guides +├── testing/ # Testing docs +├── configuration/ # Configuration docs +├── troubleshooting/ # Troubleshooting guides +└── quick-reference/ # Quick references +``` + +--- + +## Key Concepts + +### Infrastructure as Code (IaC) + +The provisioning platform uses **declarative configuration** to manage infrastructure. Instead of manually creating resources, you define what you +want in Nickel configuration files, and the system makes it happen. + +### Mode-Based Architecture + +The system supports four operational modes: + +- **Solo**: Single developer local development +- **Multi-user**: Team collaboration with shared services +- **CI/CD**: Automated pipeline execution +- **Enterprise**: Production deployment with strict compliance + +### Extension System + +Extensibility through: + +- **Providers**: Cloud platform integrations (AWS, UpCloud, Local) +- **Task Services**: Infrastructure components (Kubernetes, databases, etc.) +- **Clusters**: Complete deployment configurations + +### OCI-Native Distribution + +Extensions and packages distributed as OCI artifacts, enabling: + +- Industry-standard packaging +- Efficient caching and bandwidth +- Version pinning and rollback +- Air-gapped deployments + +--- + +## Documentation by Role + +### For New Users + +1. Start with **[Installation Guide](getting-started/installation-guide.md)** +2. Read **[Getting Started](getting-started/getting-started.md)** +3. Follow **[From Scratch Guide](guides/from-scratch.md)** +4. Reference **[Quickstart Cheatsheet](guides/quickstart-cheatsheet.md)** + +### For Developers + +1. Review **[System Overview](architecture/system-overview.md)** +2. Study **[Design Principles](architecture/design-principles.md)** +3. Read relevant **[ADRs](architecture/)** +4. Follow **[Development Guide](development/README.md)** +5. Reference **Nickel Quick Reference** + +### For Operators + +1. Understand **[Mode System](infrastructure/mode-system)** +2. Learn **[Service Management](operations/service-management-guide.md)** +3. Review **[Infrastructure Management](infrastructure/infrastructure-management.md)** +4. Study **[OCI Registry](integration/oci-registry-guide.md)** + +### For Architects + +1. Read **[System Overview](architecture/system-overview.md)** +2. Study all **[ADRs](architecture/)** +3. Review **[Integration Patterns](architecture/integration-patterns.md)** +4. Understand **[Multi-Repo Architecture](architecture/multi-repo-architecture.md)** + +--- + +## System Capabilities + +### ✅ Infrastructure Automation + +- Multi-cloud support (AWS, UpCloud, Local) +- Declarative configuration with Nickel +- Automated dependency resolution +- Batch operations with rollback + +### ✅ Workflow Orchestration + +- Hybrid Rust/Nushell orchestration +- Checkpoint-based recovery +- Parallel execution with limits +- Real-time monitoring + +### ✅ Test Environments + +- Containerized testing +- Multi-node cluster simulation +- Topology templates +- Automated cleanup + +### ✅ Mode-Based Operation + +- Solo: Local development +- Multi-user: Team collaboration +- CI/CD: Automated pipelines +- Enterprise: Production deployment + +### ✅ Extension Management + +- OCI-native distribution +- Automatic dependency resolution +- Version management +- Local and remote sources + +--- + +## Key Achievements + +### 🚀 Batch Workflow System (v3.1.0) + +- Provider-agnostic batch operations +- Mixed provider support (UpCloud + AWS + local) +- Dependency resolution with soft/hard dependencies +- Real-time monitoring and rollback + +### 🏗️ Hybrid Orchestrator (v3.0.0) + +- Solves Nushell deep call stack limitations +- Preserves all business logic +- REST API for external integration +- Checkpoint-based state management + +### ⚙️ Configuration System (v2.0.0) + +- Migrated from ENV to config-driven +- Hierarchical configuration loading +- Variable interpolation +- True IaC without hardcoded fallbacks + +### 🎯 Modular CLI (v3.2.0) + +- 84% reduction in main file size +- Domain-driven handlers +- 80+ shortcuts +- Bi-directional help system + +### 🧪 Test Environment Service (v3.4.0) + +- Automated containerized testing +- Multi-node cluster topologies +- CI/CD integration ready +- Template-based configurations + +### 🔄 Workspace Switching (v2.0.5) + +- Centralized workspace management +- Single-command workspace switching +- Active workspace tracking +- User preference system + +--- + +## Technology Stack + +| Component | Technology | Purpose | +| ----------- | ------------ | --------- | +| **Core CLI** | Nushell 0.107.1 | Shell and scripting | +| **Configuration** | Nickel 1.0.0+ | Type-safe IaC | +| **Orchestrator** | Rust | High-performance coordination | +| **Templates** | Jinja2 (nu_plugin_tera) | Code generation | +| **Secrets** | SOPS 3.10.2 + Age 1.2.1 | Encryption | +| **Distribution** | OCI (skopeo/crane/oras) | Artifact management | + +--- + +## Support + +### Getting Help + +- **Documentation**: You're reading it! +- **Quick Reference**: Run `provisioning sc` or `provisioning guide quickstart` +- **Help System**: Run `provisioning help` or `provisioning help` +- **Interactive Shell**: Run `provisioning nu` for Nushell REPL + +### Reporting Issues + +- Check **[Troubleshooting Guide](infrastructure/troubleshooting-guide.md)** +- Review **[FAQ](troubleshooting/troubleshooting-guide.md)** +- Enable debug mode: `provisioning --debug ` +- Check logs: `provisioning platform logs ` + +--- + +## Contributing + +This project welcomes contributions! See **[Development Guide](development/README.md)** for: + +- Development setup +- Code style guidelines +- Testing requirements +- Pull request process + +--- + +## License + +[Add license information] + +--- + +## Version History + +| Version | Date | Major Changes | +| --------- | ------ | --------------- | +| **3.5.0** | 2025-10-06 | Mode system, OCI registry, comprehensive documentation | +| **3.4.0** | 2025-10-06 | Test environment service | +| **3.3.0** | 2025-09-30 | Interactive guides system | +| **3.2.0** | 2025-09-30 | Modular CLI refactoring | +| **3.1.0** | 2025-09-25 | Batch workflow system | +| **3.0.0** | 2025-09-25 | Hybrid orchestrator architecture | +| **2.0.5** | 2025-10-02 | Workspace switching system | +| **2.0.0** | 2025-09-23 | Configuration system migration | + +--- + +**Maintained By**: Provisioning Team +**Last Review**: 2025-10-06 +**Next Review**: 2026-01-06 diff --git a/docs/src/SUMMARY.md b/docs/src/SUMMARY.md index ea036b8..e89c521 100644 --- a/docs/src/SUMMARY.md +++ b/docs/src/SUMMARY.md @@ -1 +1,269 @@ -# Provisioning Platform Documentation\n\n[Home](README.md)\n\n---\n\n## Getting Started\n\n- [Installation Guide](getting-started/installation-guide.md)\n- [Installation Validation Guide](getting-started/installation-validation-guide.md)\n- [Getting Started](getting-started/getting-started.md)\n- [Quick Start Cheatsheet](getting-started/quickstart-cheatsheet.md)\n- [Setup Quick Start](getting-started/setup-quickstart.md)\n- [Setup System Guide](getting-started/setup-system-guide.md)\n- [Quick Start (Full)](getting-started/quickstart.md)\n- [Prerequisites](getting-started/01-prerequisites.md)\n- [Installation Steps](getting-started/02-installation.md)\n- [First Deployment](getting-started/03-first-deployment.md)\n- [Verification](getting-started/04-verification.md)\n- [Platform Service Configuration](getting-started/05-platform-configuration.md)\n\n---\n\n## AI Integration\n\n- [Overview](ai/README.md)\n- [Architecture](ai/architecture.md)\n- [RAG System](ai/rag-system.md)\n- [MCP Integration](ai/mcp-integration.md)\n- [Configuration Guide](ai/configuration.md)\n- [Security Policies](ai/security-policies.md)\n- [Troubleshooting with AI](ai/troubleshooting-with-ai.md)\n- [Cost Management](ai/cost-management.md)\n\n### Planned Features (Q2 2025)\n\n- [Natural Language Configuration](ai/natural-language-config.md)\n- [Configuration Generation](ai/config-generation.md)\n- [AI-Assisted Forms](ai/ai-assisted-forms.md)\n- [AI Agents](ai/ai-agents.md)\n\n---\n\n## Architecture & Design\n\n- [System Overview](architecture/system-overview.md)\n- [Architecture Overview](architecture/architecture-overview.md)\n- [Design Principles](architecture/design-principles.md)\n- [Integration Patterns](architecture/integration-patterns.md)\n- [Orchestrator Integration Model](architecture/orchestrator-integration-model.md)\n- [Multi-Repo Architecture](architecture/multi-repo-architecture.md)\n- [Multi-Repo Strategy](architecture/multi-repo-strategy.md)\n- [Database and Config Architecture](architecture/database-and-config-architecture.md)\n- [Ecosystem Integration](architecture/ecosystem-integration.md)\n- [Package and Loader System](architecture/package-and-loader-system.md)\n- [Config Loading Architecture](architecture/config-loading-architecture.md)\n- [Nickel Executable Examples](architecture/nickel-executable-examples.md)\n- [Orchestrator Info](architecture/orchestrator-info.md)\n- [Orchestrator Auth Integration](architecture/orchestrator-auth-integration.md)\n- [Repo Dist Analysis](architecture/repo-dist-analysis.md)\n- [TypeDialog Nickel Integration](architecture/typedialog-nickel-integration.md)\n\n### Architecture Decision Records\n\n- [ADR-001: Project Structure](architecture/adr/adr-001-project-structure.md)\n- [ADR-002: Distribution Strategy](architecture/adr/adr-002-distribution-strategy.md)\n- [ADR-003: Workspace Isolation](architecture/adr/adr-003-workspace-isolation.md)\n- [ADR-004: Hybrid Architecture](architecture/adr/adr-004-hybrid-architecture.md)\n- [ADR-005: Extension Framework](architecture/adr/adr-005-extension-framework.md)\n- [ADR-006: Provisioning CLI Refactoring](architecture/adr/adr-006-provisioning-cli-refactoring.md)\n- [ADR-007: KMS Simplification](architecture/adr/adr-007-kms-simplification.md)\n- [ADR-008: Cedar Authorization](architecture/adr/adr-008-cedar-authorization.md)\n- [ADR-009: Security System Complete](architecture/adr/adr-009-security-system-complete.md)\n- [ADR-010: Configuration Format Strategy](architecture/adr/adr-010-configuration-format-strategy.md)\n- [ADR-011: Nickel Migration](architecture/adr/adr-011-nickel-migration.md)\n- [ADR-012: Nushell Nickel Plugin CLI Wrapper](architecture/adr/adr-012-nushell-nickel-plugin-cli-wrapper.md)\n- [ADR-013: Typdialog Web UI Backend Integration](architecture/adr/adr-013-typdialog-integration.md)\n- [ADR-014: SecretumVault Integration](architecture/adr/adr-014-secretumvault-integration.md)\n- [ADR-015: AI Integration Architecture](architecture/adr/adr-015-ai-integration-architecture.md)\n\n---\n\n## Roadmap & Future Features\n\n- [Overview](roadmap/README.md)\n- [AI Integration (Planned)](roadmap/ai-integration.md)\n- [Native Plugins (Partial)](roadmap/native-plugins.md)\n- [Nickel Workflows (Planned)](roadmap/nickel-workflows.md)\n\n---\n\n## API Reference\n\n- [REST API](api-reference/rest-api.md)\n- [WebSocket](api-reference/websocket.md)\n- [Extensions](api-reference/extensions.md)\n- [SDKs](api-reference/sdks.md)\n- [Integration Examples](api-reference/integration-examples.md)\n- [Provider API](api-reference/provider-api.md)\n- [NuShell API](api-reference/nushell-api.md)\n- [Path Resolution](api-reference/path-resolution.md)\n\n---\n\n## Development\n\n- [Infrastructure-Specific Extensions](development/infrastructure-specific-extensions.md)\n- [Command Handler Guide](development/command-handler-guide.md)\n- [Workflow](development/workflow.md)\n- [Integration](development/integration.md)\n- [Build System](development/build-system.md)\n- [Distribution Process](development/distribution-process.md)\n- [Implementation Guide](development/implementation-guide.md)\n- [Project Structure](development/project-structure.md)\n- [Ctrl-C Implementation Notes](development/ctrl-c-implementation-notes.md)\n- [Auth Metadata Guide](development/auth-metadata-guide.md)\n- [KMS Simplification](development/kms-simplification.md)\n- [Glossary](development/glossary.md)\n- [MCP Server](development/mcp-server.md)\n- [TypeDialog Platform Config Guide](development/typedialog-platform-config-guide.md)\n\n### Extensions\n\n- [Overview](development/extensions/README.md)\n- [Extension Development](development/extensions/extension-development.md)\n- [Extension Registry](development/extensions/extension-registry.md)\n\n### Providers\n\n- [Quick Provider Guide](development/providers/quick-provider-guide.md)\n- [Provider Agnostic Architecture](development/providers/provider-agnostic-architecture.md)\n- [Provider Development Guide](development/providers/provider-development-guide.md)\n- [Provider Distribution Guide](development/providers/provider-distribution-guide.md)\n- [Provider Comparison Matrix](development/providers/provider-comparison.md)\n\n### TaskServs\n\n- [TaskServ Quick Guide](development/taskservs/taskserv-quick-guide.md)\n- [TaskServ Categorization](development/taskservs/taskserv-categorization.md)\n\n---\n\n## Operations\n\n- [Platform Deployment Guide](operations/deployment-guide.md)\n- [Service Management Guide](operations/service-management-guide.md)\n- [Monitoring & Alerting Setup](operations/monitoring-alerting-setup.md)\n- [CoreDNS Guide](operations/coredns-guide.md)\n- [Production Readiness Checklist](operations/production-readiness-checklist.md)\n- [Break Glass Training Guide](operations/break-glass-training-guide.md)\n- [Cedar Policies Production Guide](operations/cedar-policies-production-guide.md)\n- [MFA Admin Setup Guide](operations/mfa-admin-setup-guide.md)\n- [Orchestrator](operations/orchestrator.md)\n- [Orchestrator System](operations/orchestrator-system.md)\n- [Control Center](operations/control-center.md)\n- [Installer](operations/installer.md)\n- [Installer System](operations/installer-system.md)\n- [Provisioning Server](operations/provisioning-server.md)\n\n---\n\n## Infrastructure\n\n- [Infrastructure Management](infrastructure/infrastructure-management.md)\n- [Infrastructure from Code Guide](infrastructure/infrastructure-from-code-guide.md)\n- [Batch Workflow System](infrastructure/batch-workflow-system.md)\n- [Batch Workflow Multi-Provider Examples](infrastructure/batch-workflow-multi-provider.md)\n- [CLI Architecture](infrastructure/cli-architecture.md)\n- [Configuration System](infrastructure/configuration-system.md)\n- [CLI Reference](infrastructure/cli-reference.md)\n- [Dynamic Secrets Guide](infrastructure/dynamic-secrets-guide.md)\n- [Mode System Guide](infrastructure/mode-system-guide.md)\n- [Config Rendering Guide](infrastructure/config-rendering-guide.md)\n- [Configuration](infrastructure/configuration.md)\n\n### Workspaces\n\n- [Workspace Setup](infrastructure/workspaces/workspace-setup.md)\n- [Workspace Guide](infrastructure/workspaces/workspace-guide.md)\n- [Workspace Switching Guide](infrastructure/workspaces/workspace-switching-guide.md)\n- [Workspace Switching System](infrastructure/workspaces/workspace-switching-system.md)\n- [Workspace Config Architecture](infrastructure/workspaces/workspace-config-architecture.md)\n- [Workspace Config Commands](infrastructure/workspaces/workspace-config-commands.md)\n- [Workspace Enforcement Guide](infrastructure/workspaces/workspace-enforcement-guide.md)\n- [Workspace Infra Reference](infrastructure/workspaces/workspace-infra-reference.md)\n\n---\n\n## Security\n\n- [Authentication Layer Guide](security/authentication-layer-guide.md)\n- [Config Encryption Guide](security/config-encryption-guide.md)\n- [Security System](security/security-system.md)\n- [RustyVault KMS Guide](security/rustyvault-kms-guide.md)\n- [SecretumVault KMS Guide](security/secretumvault-kms-guide.md)\n- [SSH Temporal Keys User Guide](security/ssh-temporal-keys-user-guide.md)\n- [Plugin Integration Guide](security/plugin-integration-guide.md)\n- [NuShell Plugins Guide](security/nushell-plugins-guide.md)\n- [NuShell Plugins System](security/nushell-plugins-system.md)\n- [Plugin Usage Guide](security/plugin-usage-guide.md)\n- [Secrets Management Guide](security/secrets-management-guide.md)\n- [KMS Service](security/kms-service.md)\n\n---\n\n## Integration\n\n- [Gitea Integration Guide](integration/gitea-integration-guide.md)\n- [Service Mesh Ingress Guide](integration/service-mesh-ingress-guide.md)\n- [OCI Registry Guide](integration/oci-registry-guide.md)\n- [Integrations Quick Start](integration/integrations-quickstart.md)\n- [Secrets Service Layer Complete](integration/secrets-service-layer-complete.md)\n- [OCI Registry Platform](integration/oci-registry-platform.md)\n\n---\n\n## Testing\n\n- [Test Environment Guide](testing/test-environment-guide.md)\n- [Test Environment System](testing/test-environment-system.md)\n- [TaskServ Validation Guide](testing/taskserv-validation-guide.md)\n\n---\n\n## Troubleshooting\n\n- [Troubleshooting Guide](troubleshooting/troubleshooting-guide.md)\n\n---\n\n## Deployment Guides\n\n- [From Scratch](guides/from-scratch.md)\n- [Update Infrastructure](guides/update-infrastructure.md)\n- [Customize Infrastructure](guides/customize-infrastructure.md)\n- [Infrastructure Setup](guides/infrastructure-setup.md)\n- [Extension Development Quickstart](guides/extension-development-quickstart.md)\n- [Guide System](guides/guide-system.md)\n- [Workspace Generation Quick Reference](guides/workspace-generation-quick-reference.md)\n\n### Multi-Provider Deployment Guides\n\n- [Multi-Provider Deployment Guide](guides/multi-provider-deployment.md)\n- [Multi-Provider Networking with VPN](guides/multi-provider-networking.md)\n- [DigitalOcean Provider Guide](guides/provider-digitalocean.md)\n- [Hetzner Provider Guide](guides/provider-hetzner.md)\n\n### Multi-Provider Workspace Examples\n\n- [Multi-Provider Web App Workspace](../examples/workspaces/multi-provider-web-app/README.md)\n- [Multi-Region High Availability Workspace](../examples/workspaces/multi-region-ha/README.md)\n- [Cost-Optimized Multi-Provider Workspace](../examples/workspaces/cost-optimized/README.md)\n\n---\n\n## Quick Reference\n\n- [Master Index](quick-reference/master.md)\n- [Platform Operations Cheatsheet](quick-reference/platform-operations-cheatsheet.md)\n- [General Commands](quick-reference/general.md)\n- [JustFile Recipes](quick-reference/justfile-recipes.md)\n- [OCI Registry](quick-reference/oci.md)\n- [Sudo Password Handling](quick-reference/sudo-password-handling.md)\n\n---\n\n## Configuration\n\n- [Config Validation](configuration/config-validation.md) +# Provisioning Platform Documentation + +[Home](README.md) + +--- + +## Getting Started + +- [Installation Guide](getting-started/installation-guide.md) +- [Installation Validation Guide](getting-started/installation-validation-guide.md) +- [Getting Started](getting-started/getting-started.md) +- [Quick Start Cheatsheet](getting-started/quickstart-cheatsheet.md) +- [Setup Quick Start](getting-started/setup-quickstart.md) +- [Setup System Guide](getting-started/setup-system-guide.md) +- [Quick Start (Full)](getting-started/quickstart.md) +- [Prerequisites](getting-started/01-prerequisites.md) +- [Installation Steps](getting-started/02-installation.md) +- [First Deployment](getting-started/03-first-deployment.md) +- [Verification](getting-started/04-verification.md) +- [Platform Service Configuration](getting-started/05-platform-configuration.md) + +--- + +## AI Integration + +- [Overview](ai/README.md) +- [Architecture](ai/architecture.md) +- [RAG System](ai/rag-system.md) +- [MCP Integration](ai/mcp-integration.md) +- [Configuration Guide](ai/configuration.md) +- [Security Policies](ai/security-policies.md) +- [Troubleshooting with AI](ai/troubleshooting-with-ai.md) +- [Cost Management](ai/cost-management.md) + +### Planned Features (Q2 2025) + +- [Natural Language Configuration](ai/natural-language-config.md) +- [Configuration Generation](ai/config-generation.md) +- [AI-Assisted Forms](ai/ai-assisted-forms.md) +- [AI Agents](ai/ai-agents.md) + +--- + +## Architecture & Design + +- [System Overview](architecture/system-overview.md) +- [Architecture Overview](architecture/architecture-overview.md) +- [Design Principles](architecture/design-principles.md) +- [Integration Patterns](architecture/integration-patterns.md) +- [Orchestrator Integration Model](architecture/orchestrator-integration-model.md) +- [Multi-Repo Architecture](architecture/multi-repo-architecture.md) +- [Multi-Repo Strategy](architecture/multi-repo-strategy.md) +- [Database and Config Architecture](architecture/database-and-config-architecture.md) +- [Ecosystem Integration](architecture/ecosystem-integration.md) +- [Package and Loader System](architecture/package-and-loader-system.md) +- [Config Loading Architecture](architecture/config-loading-architecture.md) +- [Nickel Executable Examples](architecture/nickel-executable-examples.md) +- [Orchestrator Info](architecture/orchestrator-info.md) +- [Orchestrator Auth Integration](architecture/orchestrator-auth-integration.md) +- [Repo Dist Analysis](architecture/repo-dist-analysis.md) +- [TypeDialog Nickel Integration](architecture/typedialog-nickel-integration.md) + +### Architecture Decision Records + +- [ADR-001: Project Structure](architecture/adr/adr-001-project-structure.md) +- [ADR-002: Distribution Strategy](architecture/adr/adr-002-distribution-strategy.md) +- [ADR-003: Workspace Isolation](architecture/adr/adr-003-workspace-isolation.md) +- [ADR-004: Hybrid Architecture](architecture/adr/adr-004-hybrid-architecture.md) +- [ADR-005: Extension Framework](architecture/adr/adr-005-extension-framework.md) +- [ADR-006: Provisioning CLI Refactoring](architecture/adr/adr-006-provisioning-cli-refactoring.md) +- [ADR-007: KMS Simplification](architecture/adr/adr-007-kms-simplification.md) +- [ADR-008: Cedar Authorization](architecture/adr/adr-008-cedar-authorization.md) +- [ADR-009: Security System Complete](architecture/adr/adr-009-security-system-complete.md) +- [ADR-010: Configuration Format Strategy](architecture/adr/adr-010-configuration-format-strategy.md) +- [ADR-011: Nickel Migration](architecture/adr/adr-011-nickel-migration.md) +- [ADR-012: Nushell Nickel Plugin CLI Wrapper](architecture/adr/adr-012-nushell-nickel-plugin-cli-wrapper.md) +- [ADR-013: Typdialog Web UI Backend Integration](architecture/adr/adr-013-typdialog-integration.md) +- [ADR-014: SecretumVault Integration](architecture/adr/adr-014-secretumvault-integration.md) +- [ADR-015: AI Integration Architecture](architecture/adr/adr-015-ai-integration-architecture.md) + +--- + +## Roadmap & Future Features + +- [Overview](roadmap/README.md) +- [AI Integration (Planned)](roadmap/ai-integration.md) +- [Native Plugins (Partial)](roadmap/native-plugins.md) +- [Nickel Workflows (Planned)](roadmap/nickel-workflows.md) + +--- + +## API Reference + +- [REST API](api-reference/rest-api.md) +- [WebSocket](api-reference/websocket.md) +- [Extensions](api-reference/extensions.md) +- [SDKs](api-reference/sdks.md) +- [Integration Examples](api-reference/integration-examples.md) +- [Provider API](api-reference/provider-api.md) +- [NuShell API](api-reference/nushell-api.md) +- [Path Resolution](api-reference/path-resolution.md) + +--- + +## Development + +- [Infrastructure-Specific Extensions](development/infrastructure-specific-extensions.md) +- [Command Handler Guide](development/command-handler-guide.md) +- [Workflow](development/workflow.md) +- [Integration](development/integration.md) +- [Build System](development/build-system.md) +- [Distribution Process](development/distribution-process.md) +- [Implementation Guide](development/implementation-guide.md) +- [Project Structure](development/project-structure.md) +- [Ctrl-C Implementation Notes](development/ctrl-c-implementation-notes.md) +- [Auth Metadata Guide](development/auth-metadata-guide.md) +- [KMS Simplification](development/kms-simplification.md) +- [Glossary](development/glossary.md) +- [MCP Server](development/mcp-server.md) +- [TypeDialog Platform Config Guide](development/typedialog-platform-config-guide.md) + +### Extensions + +- [Overview](development/extensions/README.md) +- [Extension Development](development/extensions/extension-development.md) +- [Extension Registry](development/extensions/extension-registry.md) + +### Providers + +- [Quick Provider Guide](development/providers/quick-provider-guide.md) +- [Provider Agnostic Architecture](development/providers/provider-agnostic-architecture.md) +- [Provider Development Guide](development/providers/provider-development-guide.md) +- [Provider Distribution Guide](development/providers/provider-distribution-guide.md) +- [Provider Comparison Matrix](development/providers/provider-comparison.md) + +### TaskServs + +- [TaskServ Quick Guide](development/taskservs/taskserv-quick-guide.md) +- [TaskServ Categorization](development/taskservs/taskserv-categorization.md) + +--- + +## Operations + +- [Platform Deployment Guide](operations/deployment-guide.md) +- [Service Management Guide](operations/service-management-guide.md) +- [Monitoring & Alerting Setup](operations/monitoring-alerting-setup.md) +- [CoreDNS Guide](operations/coredns-guide.md) +- [Production Readiness Checklist](operations/production-readiness-checklist.md) +- [Break Glass Training Guide](operations/break-glass-training-guide.md) +- [Cedar Policies Production Guide](operations/cedar-policies-production-guide.md) +- [MFA Admin Setup Guide](operations/mfa-admin-setup-guide.md) +- [Orchestrator](operations/orchestrator.md) +- [Orchestrator System](operations/orchestrator-system.md) +- [Control Center](operations/control-center.md) +- [Installer](operations/installer.md) +- [Installer System](operations/installer-system.md) +- [Provisioning Server](operations/provisioning-server.md) + +--- + +## Infrastructure + +- [Infrastructure Management](infrastructure/infrastructure-management.md) +- [Infrastructure from Code Guide](infrastructure/infrastructure-from-code-guide.md) +- [Batch Workflow System](infrastructure/batch-workflow-system.md) +- [Batch Workflow Multi-Provider Examples](infrastructure/batch-workflow-multi-provider.md) +- [CLI Architecture](infrastructure/cli-architecture.md) +- [Configuration System](infrastructure/configuration-system.md) +- [CLI Reference](infrastructure/cli-reference.md) +- [Dynamic Secrets Guide](infrastructure/dynamic-secrets-guide.md) +- [Mode System Guide](infrastructure/mode-system-guide.md) +- [Config Rendering Guide](infrastructure/config-rendering-guide.md) +- [Configuration](infrastructure/configuration.md) + +### Workspaces + +- [Workspace Setup](infrastructure/workspaces/workspace-setup.md) +- [Workspace Guide](infrastructure/workspaces/workspace-guide.md) +- [Workspace Switching Guide](infrastructure/workspaces/workspace-switching-guide.md) +- [Workspace Switching System](infrastructure/workspaces/workspace-switching-system.md) +- [Workspace Config Architecture](infrastructure/workspaces/workspace-config-architecture.md) +- [Workspace Config Commands](infrastructure/workspaces/workspace-config-commands.md) +- [Workspace Enforcement Guide](infrastructure/workspaces/workspace-enforcement-guide.md) +- [Workspace Infra Reference](infrastructure/workspaces/workspace-infra-reference.md) + +--- + +## Security + +- [Authentication Layer Guide](security/authentication-layer-guide.md) +- [Config Encryption Guide](security/config-encryption-guide.md) +- [Security System](security/security-system.md) +- [RustyVault KMS Guide](security/rustyvault-kms-guide.md) +- [SecretumVault KMS Guide](security/secretumvault-kms-guide.md) +- [SSH Temporal Keys User Guide](security/ssh-temporal-keys-user-guide.md) +- [Plugin Integration Guide](security/plugin-integration-guide.md) +- [NuShell Plugins Guide](security/nushell-plugins-guide.md) +- [NuShell Plugins System](security/nushell-plugins-system.md) +- [Plugin Usage Guide](security/plugin-usage-guide.md) +- [Secrets Management Guide](security/secrets-management-guide.md) +- [KMS Service](security/kms-service.md) + +--- + +## Integration + +- [Gitea Integration Guide](integration/gitea-integration-guide.md) +- [Service Mesh Ingress Guide](integration/service-mesh-ingress-guide.md) +- [OCI Registry Guide](integration/oci-registry-guide.md) +- [Integrations Quick Start](integration/integrations-quickstart.md) +- [Secrets Service Layer Complete](integration/secrets-service-layer-complete.md) +- [OCI Registry Platform](integration/oci-registry-platform.md) + +--- + +## Testing + +- [Test Environment Guide](testing/test-environment-guide.md) +- [Test Environment System](testing/test-environment-system.md) +- [TaskServ Validation Guide](testing/taskserv-validation-guide.md) + +--- + +## Troubleshooting + +- [Troubleshooting Guide](troubleshooting/troubleshooting-guide.md) + +--- + +## Deployment Guides + +- [From Scratch](guides/from-scratch.md) +- [Update Infrastructure](guides/update-infrastructure.md) +- [Customize Infrastructure](guides/customize-infrastructure.md) +- [Infrastructure Setup](guides/infrastructure-setup.md) +- [Extension Development Quickstart](guides/extension-development-quickstart.md) +- [Guide System](guides/guide-system.md) +- [Workspace Generation Quick Reference](guides/workspace-generation-quick-reference.md) + +### Multi-Provider Deployment Guides + +- [Multi-Provider Deployment Guide](guides/multi-provider-deployment.md) +- [Multi-Provider Networking with VPN](guides/multi-provider-networking.md) +- [DigitalOcean Provider Guide](guides/provider-digitalocean.md) +- [Hetzner Provider Guide](guides/provider-hetzner.md) + +### Multi-Provider Workspace Examples + +- [Multi-Provider Web App Workspace](../examples/workspaces/multi-provider-web-app/README.md) +- [Multi-Region High Availability Workspace](../examples/workspaces/multi-region-ha/README.md) +- [Cost-Optimized Multi-Provider Workspace](../examples/workspaces/cost-optimized/README.md) + +--- + +## Quick Reference + +- [Master Index](quick-reference/master.md) +- [Platform Operations Cheatsheet](quick-reference/platform-operations-cheatsheet.md) +- [General Commands](quick-reference/general.md) +- [JustFile Recipes](quick-reference/justfile-recipes.md) +- [OCI Registry](quick-reference/oci.md) +- [Sudo Password Handling](quick-reference/sudo-password-handling.md) + +--- + +## Configuration + +- [Config Validation](configuration/config-validation.md) diff --git a/docs/src/ai/README.md b/docs/src/ai/README.md index 63fbf19..5e642d2 100644 --- a/docs/src/ai/README.md +++ b/docs/src/ai/README.md @@ -1 +1,171 @@ -# AI Integration - Intelligent Infrastructure Provisioning\n\nThe provisioning platform integrates AI capabilities to provide intelligent assistance for infrastructure configuration, deployment, and\ntroubleshooting.\nThis section documents the AI system architecture, features, and usage patterns.\n\n## Overview\n\nThe AI integration consists of multiple components working together to provide intelligent infrastructure provisioning:\n\n- **typdialog-ai**: AI-assisted form filling and configuration\n- **typdialog-ag**: Autonomous AI agents for complex workflows\n- **typdialog-prov-gen**: Natural language to Nickel configuration generation\n- **ai-service**: Core AI service backend with multi-provider support\n- **mcp-server**: Model Context Protocol server for LLM integration\n- **rag**: Retrieval-Augmented Generation for contextual knowledge\n\n## Key Features\n\n### Natural Language Configuration\n\nGenerate infrastructure configurations from plain English descriptions:\n```\nprovisioning ai generate "Create a production PostgreSQL cluster with encryption and daily backups"\n```\n\n### AI-Assisted Forms\n\nReal-time suggestions and explanations as you fill out configuration forms via typdialog web UI.\n\n### Intelligent Troubleshooting\n\nAI analyzes deployment failures and suggests fixes:\n```\nprovisioning ai troubleshoot deployment-12345\n```\n\n###\n\n Configuration Optimization\nAI reviews configurations and suggests performance and security improvements:\n```\nprovisioning ai optimize workspaces/prod/config.ncl\n```\n\n### Autonomous Agents\nAI agents execute multi-step workflows with minimal human intervention:\n```\nprovisioning ai agent --goal "Set up complete dev environment for Python app"\n```\n\n## Documentation Structure\n\n- [Architecture](architecture.md) - AI system architecture and components\n- [Natural Language Config](natural-language-config.md) - NL to Nickel generation\n- [AI-Assisted Forms](ai-assisted-forms.md) - typdialog-ai integration\n- [AI Agents](ai-agents.md) - typdialog-ag autonomous agents\n- [Config Generation](config-generation.md) - typdialog-prov-gen details\n- [RAG System](rag-system.md) - Retrieval-Augmented Generation\n- [MCP Integration](mcp-integration.md) - Model Context Protocol\n- [Security Policies](security-policies.md) - Cedar policies for AI\n- [Troubleshooting with AI](troubleshooting-with-ai.md) - AI debugging workflows\n- [API Reference](api-reference.md) - AI service API documentation\n- [Configuration](configuration.md) - AI system configuration guide\n- [Cost Management](cost-management.md) - Managing LLM API costs\n\n## Quick Start\n\n### Enable AI Features\n\n```\n# Edit provisioning config\nvim provisioning/config/ai.toml\n\n# Set provider and enable features\n[ai]\nenabled = true\nprovider = "anthropic" # or "openai" or "local"\nmodel = "claude-sonnet-4"\n\n[ai.features]\nform_assistance = true\nconfig_generation = true\ntroubleshooting = true\n```\n\n### Generate Configuration from Natural Language\n\n```\n# Simple generation\nprovisioning ai generate "PostgreSQL database with encryption"\n\n# With specific schema\nprovisioning ai generate \\n --schema database \\n --output workspaces/dev/db.ncl \\n "Production PostgreSQL with 100GB storage and daily backups"\n```\n\n### Use AI-Assisted Forms\n\n```\n# Open typdialog web UI with AI assistance\nprovisioning workspace init --interactive --ai-assist\n\n# AI provides real-time suggestions as you type\n# AI explains validation errors in plain English\n# AI fills multiple fields from natural language description\n```\n\n### Troubleshoot with AI\n\n```\n# Analyze failed deployment\nprovisioning ai troubleshoot deployment-12345\n\n# AI analyzes logs and suggests fixes\n# AI generates corrected configuration\n# AI explains root cause in plain language\n```\n\n## Security and Privacy\n\nThe AI system implements strict security controls:\n\n- ✅ **Cedar Policies**: AI access controlled by Cedar authorization\n- ✅ **Secret Isolation**: AI cannot access secrets directly\n- ✅ **Human Approval**: Critical operations require human approval\n- ✅ **Audit Trail**: All AI operations logged\n- ✅ **Data Sanitization**: Secrets/PII sanitized before sending to LLM\n- ✅ **Local Models**: Support for air-gapped deployments\n\nSee [Security Policies](security-policies.md) for complete details.\n\n## Supported LLM Providers\n\n| | Provider | Models | Best For | |\n| | ---------- | -------- | ---------- | |\n| | **Anthropic** | Claude Sonnet 4, Claude Opus 4 | Complex configs, long context | |\n| | **OpenAI** | GPT-4 Turbo, GPT-4 | Fast suggestions, tool calling | |\n| | **Local** | Llama 3, Mistral | Air-gapped, privacy-critical | |\n\n## Cost Considerations\n\nAI features incur LLM API costs. The system implements cost controls:\n\n- **Caching**: Reduces API calls by 50-80%\n- **Rate Limiting**: Prevents runaway costs\n- **Budget Limits**: Daily/monthly cost caps\n- **Local Models**: Zero marginal cost for air-gapped deployments\n\nSee [Cost Management](cost-management.md) for optimization strategies.\n\n## Architecture Decision Record\n\nThe AI integration is documented in:\n- [ADR-015: AI Integration Architecture](../architecture/adr/adr-015-ai-integration-architecture.md)\n\n## Next Steps\n\n1. Read [Architecture](architecture.md) to understand AI system design\n2. Configure AI features in [Configuration](configuration.md)\n3. Try [Natural Language Config](natural-language-config.md) for your first AI-generated config\n4. Explore [AI Agents](ai-agents.md) for automation workflows\n5. Review [Security Policies](security-policies.md) to understand access controls\n\n---\n\n**Version**: 1.0\n**Last Updated**: 2025-01-08\n**Status**: Active +# AI Integration - Intelligent Infrastructure Provisioning + +The provisioning platform integrates AI capabilities to provide intelligent assistance for infrastructure configuration, deployment, and +troubleshooting. +This section documents the AI system architecture, features, and usage patterns. + +## Overview + +The AI integration consists of multiple components working together to provide intelligent infrastructure provisioning: + +- **typdialog-ai**: AI-assisted form filling and configuration +- **typdialog-ag**: Autonomous AI agents for complex workflows +- **typdialog-prov-gen**: Natural language to Nickel configuration generation +- **ai-service**: Core AI service backend with multi-provider support +- **mcp-server**: Model Context Protocol server for LLM integration +- **rag**: Retrieval-Augmented Generation for contextual knowledge + +## Key Features + +### Natural Language Configuration + +Generate infrastructure configurations from plain English descriptions: +```text +provisioning ai generate "Create a production PostgreSQL cluster with encryption and daily backups" +``` + +### AI-Assisted Forms + +Real-time suggestions and explanations as you fill out configuration forms via typdialog web UI. + +### Intelligent Troubleshooting + +AI analyzes deployment failures and suggests fixes: +```text +provisioning ai troubleshoot deployment-12345 +``` + +### + + Configuration Optimization +AI reviews configurations and suggests performance and security improvements: +```text +provisioning ai optimize workspaces/prod/config.ncl +``` + +### Autonomous Agents +AI agents execute multi-step workflows with minimal human intervention: +```text +provisioning ai agent --goal "Set up complete dev environment for Python app" +``` + +## Documentation Structure + +- [Architecture](architecture.md) - AI system architecture and components +- [Natural Language Config](natural-language-config.md) - NL to Nickel generation +- [AI-Assisted Forms](ai-assisted-forms.md) - typdialog-ai integration +- [AI Agents](ai-agents.md) - typdialog-ag autonomous agents +- [Config Generation](config-generation.md) - typdialog-prov-gen details +- [RAG System](rag-system.md) - Retrieval-Augmented Generation +- [MCP Integration](mcp-integration.md) - Model Context Protocol +- [Security Policies](security-policies.md) - Cedar policies for AI +- [Troubleshooting with AI](troubleshooting-with-ai.md) - AI debugging workflows +- [API Reference](api-reference.md) - AI service API documentation +- [Configuration](configuration.md) - AI system configuration guide +- [Cost Management](cost-management.md) - Managing LLM API costs + +## Quick Start + +### Enable AI Features + +```text +# Edit provisioning config +vim provisioning/config/ai.toml + +# Set provider and enable features +[ai] +enabled = true +provider = "anthropic" # or "openai" or "local" +model = "claude-sonnet-4" + +[ai.features] +form_assistance = true +config_generation = true +troubleshooting = true +``` + +### Generate Configuration from Natural Language + +```text +# Simple generation +provisioning ai generate "PostgreSQL database with encryption" + +# With specific schema +provisioning ai generate + --schema database + --output workspaces/dev/db.ncl + "Production PostgreSQL with 100GB storage and daily backups" +``` + +### Use AI-Assisted Forms + +```text +# Open typdialog web UI with AI assistance +provisioning workspace init --interactive --ai-assist + +# AI provides real-time suggestions as you type +# AI explains validation errors in plain English +# AI fills multiple fields from natural language description +``` + +### Troubleshoot with AI + +```text +# Analyze failed deployment +provisioning ai troubleshoot deployment-12345 + +# AI analyzes logs and suggests fixes +# AI generates corrected configuration +# AI explains root cause in plain language +``` + +## Security and Privacy + +The AI system implements strict security controls: + +- ✅ **Cedar Policies**: AI access controlled by Cedar authorization +- ✅ **Secret Isolation**: AI cannot access secrets directly +- ✅ **Human Approval**: Critical operations require human approval +- ✅ **Audit Trail**: All AI operations logged +- ✅ **Data Sanitization**: Secrets/PII sanitized before sending to LLM +- ✅ **Local Models**: Support for air-gapped deployments + +See [Security Policies](security-policies.md) for complete details. + +## Supported LLM Providers + +| | Provider | Models | Best For | | +| | ---------- | -------- | ---------- | | +| | **Anthropic** | Claude Sonnet 4, Claude Opus 4 | Complex configs, long context | | +| | **OpenAI** | GPT-4 Turbo, GPT-4 | Fast suggestions, tool calling | | +| | **Local** | Llama 3, Mistral | Air-gapped, privacy-critical | | + +## Cost Considerations + +AI features incur LLM API costs. The system implements cost controls: + +- **Caching**: Reduces API calls by 50-80% +- **Rate Limiting**: Prevents runaway costs +- **Budget Limits**: Daily/monthly cost caps +- **Local Models**: Zero marginal cost for air-gapped deployments + +See [Cost Management](cost-management.md) for optimization strategies. + +## Architecture Decision Record + +The AI integration is documented in: +- [ADR-015: AI Integration Architecture](../architecture/adr/adr-015-ai-integration-architecture.md) + +## Next Steps + +1. Read [Architecture](architecture.md) to understand AI system design +2. Configure AI features in [Configuration](configuration.md) +3. Try [Natural Language Config](natural-language-config.md) for your first AI-generated config +4. Explore [AI Agents](ai-agents.md) for automation workflows +5. Review [Security Policies](security-policies.md) to understand access controls + +--- + +**Version**: 1.0 +**Last Updated**: 2025-01-08 +**Status**: Active \ No newline at end of file diff --git a/docs/src/ai/ai-agents.md b/docs/src/ai/ai-agents.md index 609abbc..c0cf63e 100644 --- a/docs/src/ai/ai-agents.md +++ b/docs/src/ai/ai-agents.md @@ -1 +1,532 @@ -# Autonomous AI Agents (typdialog-ag)\n\n**Status**: 🔴 Planned (Q2 2025 target)\n\nAutonomous AI Agents is a planned feature that enables AI agents to execute multi-step\ninfrastructure provisioning workflows with minimal human intervention. Agents make\ndecisions, adapt to changing conditions, and execute complex tasks while maintaining\nsecurity and requiring human approval for critical operations.\n\n## Feature Overview\n\n### What It Does\n\nEnable AI agents to manage complex provisioning workflows:\n\n```\nUser Goal:\n "Set up a complete development environment with:\n - PostgreSQL database\n - Redis cache\n - Kubernetes cluster\n - Monitoring stack\n - Logging infrastructure"\n\nAI Agent executes:\n1. Analyzes requirements and constraints\n2. Plans multi-step deployment sequence\n3. Creates configurations for all components\n4. Validates configurations against policies\n5. Requests human approval for critical decisions\n6. Executes deployment in correct order\n7. Monitors for failures and adapts\n8. Reports completion and recommendations\n```\n\n## Agent Capabilities\n\n### Multi-Step Workflow Execution\n\nAgents coordinate complex, multi-component deployments:\n\n```\nGoal: "Deploy production Kubernetes cluster with managed databases"\n\nAgent Plan:\n Phase 1: Infrastructure\n ├─ Create VPC and networking\n ├─ Set up security groups\n └─ Configure IAM roles\n\n Phase 2: Kubernetes\n ├─ Create EKS cluster\n ├─ Configure network plugins\n ├─ Set up autoscaling\n └─ Install cluster add-ons\n\n Phase 3: Managed Services\n ├─ Provision RDS PostgreSQL\n ├─ Configure backups\n └─ Set up replicas\n\n Phase 4: Observability\n ├─ Deploy Prometheus\n ├─ Deploy Grafana\n ├─ Configure log collection\n └─ Set up alerting\n\n Phase 5: Validation\n ├─ Run smoke tests\n ├─ Verify connectivity\n └─ Check compliance\n```\n\n### Adaptive Decision Making\n\nAgents adapt to conditions and make intelligent decisions:\n\n```\nScenario: Database provisioning fails due to resource quota\n\nStandard approach (human):\n1. Detect failure\n2. Investigate issue\n3. Decide on fix (reduce size, change region, etc.)\n4. Update config\n5. Retry\n\nAgent approach:\n1. Detect failure\n2. Analyze error: "Quota exceeded for db.r6g.xlarge"\n3. Check available options:\n - Try smaller instance: db.r6g.large (may be insufficient)\n - Try different region: different cost, latency\n - Request quota increase (requires human approval)\n4. Ask human: "Quota exceeded. Suggest: use db.r6g.large instead \n (slightly reduced performance). Approve? [yes/no/try-other]"\n5. Execute based on approval\n6. Continue workflow\n```\n\n### Dependency Management\n\nAgents understand resource dependencies:\n\n```\nKnowledge graph of dependencies:\n\n VPC ──→ Subnets ──→ EC2 Instances\n ├─────────→ Security Groups\n └────→ NAT Gateway ──→ Route Tables\n\n RDS ──→ DB Subnet Group ──→ VPC\n ├─────────→ Security Group\n └────→ Parameter Group\n\nAgent ensures:\n- VPC exists before creating subnets\n- Subnets exist before creating EC2\n- Security groups reference correct VPC\n- Deployment order respects all dependencies\n- Rollback order is reverse of creation\n```\n\n## Architecture\n\n### Agent Design Pattern\n\n```\n┌────────────────────────────────────────────────────────┐\n│ Agent Supervisor (Orchestrator) │\n│ - Accepts user goal │\n│ - Plans workflow │\n│ - Coordinates specialist agents │\n│ - Requests human approvals │\n│ - Monitors overall progress │\n└────────────────────────────────────────────────────────┘\n ↑ ↑ ↑\n │ │ │\n ↓ ↓ ↓\n┌──────────────┐ ┌──────────────┐ ┌──────────────┐\n│ Database │ │ Kubernetes │ │ Monitoring │\n│ Specialist │ │ Specialist │ │ Specialist │\n│ │ │ │ │ │\n│ Tasks: │ │ Tasks: │ │ Tasks: │\n│ - Create DB │ │ - Create K8s │ │ - Deploy │\n│ - Configure │ │ - Configure │ │ Prometheus │\n│ - Validate │ │ - Validate │ │ - Deploy │\n│ - Report │ │ - Report │ │ Grafana │\n└──────────────┘ └──────────────┘ └──────────────┘\n```\n\n### Agent Workflow\n\n```\nStart: User Goal\n ↓\n┌─────────────────────────────────────────┐\n│ Goal Analysis & Planning │\n│ - Parse user intent │\n│ - Identify resources needed │\n│ - Plan dependency graph │\n│ - Generate task list │\n└──────────────┬──────────────────────────┘\n ↓\n┌─────────────────────────────────────────┐\n│ Resource Generation │\n│ - Generate configs for each resource │\n│ - Validate against schemas │\n│ - Check compliance policies │\n│ - Identify potential issues │\n└──────────────┬──────────────────────────┘\n ↓\n Human Review Point?\n ├─ No issues: Continue\n └─ Issues found: Request approval/modification\n ↓\n┌─────────────────────────────────────────┐\n│ Execution Plan Verification │\n│ - Check all configs are valid │\n│ - Verify dependencies are resolvable │\n│ - Estimate costs and timeline │\n│ - Identify risks │\n└──────────────┬──────────────────────────┘\n ↓\n Execute Workflow?\n ├─ User approves: Start execution\n └─ User modifies: Return to planning\n ↓\n┌─────────────────────────────────────────┐\n│ Phase-by-Phase Execution │\n│ - Execute one logical phase │\n│ - Monitor for errors │\n│ - Report progress │\n│ - Ask for decisions if needed │\n└──────────────┬──────────────────────────┘\n ↓\n All Phases Complete?\n ├─ No: Continue to next phase\n └─ Yes: Final validation\n ↓\n┌─────────────────────────────────────────┐\n│ Final Validation & Reporting │\n│ - Smoke tests │\n│ - Connectivity tests │\n│ - Compliance verification │\n│ - Performance checks │\n│ - Generate final report │\n└──────────────┬──────────────────────────┘\n ↓\nSuccess: Deployment Complete\n```\n\n## Planned Agent Types\n\n### 1. Database Specialist Agent\n\n```\nResponsibilities:\n- Create and configure databases\n- Set up replication and backups\n- Configure encryption and security\n- Monitor database health\n- Handle database-specific issues\n\nExamples:\n- Provision PostgreSQL cluster with replication\n- Set up MySQL with read replicas\n- Configure MongoDB sharding\n- Create backup pipelines\n```\n\n### 2. Kubernetes Specialist Agent\n\n```\nResponsibilities:\n- Create and configure Kubernetes clusters\n- Configure networking and ingress\n- Set up autoscaling policies\n- Deploy cluster add-ons\n- Manage workload placement\n\nExamples:\n- Create EKS/GKE/AKS cluster\n- Configure Istio service mesh\n- Deploy Prometheus + Grafana\n- Configure auto-scaling policies\n```\n\n### 3. Infrastructure Agent\n\n```\nResponsibilities:\n- Create networking infrastructure\n- Configure security and firewalls\n- Set up load balancers\n- Configure DNS and CDN\n- Manage identity and access\n\nExamples:\n- Create VPC with subnets\n- Configure security groups\n- Set up application load balancer\n- Configure Route53 DNS\n```\n\n### 4. Monitoring Agent\n\n```\nResponsibilities:\n- Deploy monitoring stack\n- Configure alerting\n- Set up logging infrastructure\n- Create dashboards\n- Configure notification channels\n\nExamples:\n- Deploy Prometheus + Grafana\n- Set up CloudWatch dashboards\n- Configure log aggregation\n- Set up PagerDuty integration\n```\n\n### 5. Compliance Agent\n\n```\nResponsibilities:\n- Check security policies\n- Verify compliance requirements\n- Audit configurations\n- Generate compliance reports\n- Recommend security improvements\n\nExamples:\n- Check PCI-DSS compliance\n- Verify encryption settings\n- Audit access controls\n- Generate compliance report\n```\n\n## Usage Examples\n\n### Example 1: Development Environment Setup\n\n```\n$ provisioning ai agent --goal "Set up dev environment for Python web app"\n\nAgent Plan Generated:\n┌─────────────────────────────────────────┐\n│ Environment: Development │\n│ Components: PostgreSQL + Redis + Monitoring\n│ │\n│ Phase 1: Database (1-2 min) │\n│ - PostgreSQL 15 │\n│ - 10 GB storage │\n│ - Dev security settings │\n│ │\n│ Phase 2: Cache (1 min) │\n│ - Redis Cluster Mode disabled │\n│ - Single node │\n│ - 2 GB memory │\n│ │\n│ Phase 3: Monitoring (1-2 min) │\n│ - Prometheus (metrics) │\n│ - Grafana (dashboards) │\n│ - Log aggregation │\n│ │\n│ Estimated time: 5-10 minutes │\n│ Estimated cost: $15/month │\n│ │\n│ [Approve] [Modify] [Cancel] │\n└─────────────────────────────────────────┘\n\nAgent: Approve to proceed with setup.\n\nUser: Approve\n\n[Agent execution starts]\nCreating PostgreSQL... [████████░░] 80%\nCreating Redis... [░░░░░░░░░░] 0%\n[Waiting for PostgreSQL creation...]\n\nPostgreSQL created successfully!\nConnection string: postgresql://dev:pwd@db.internal:5432/app\n\nCreating Redis... [████████░░] 80%\n[Waiting for Redis creation...]\n\nRedis created successfully!\nConnection string: redis://cache.internal:6379\n\nDeploying monitoring... [████████░░] 80%\n[Waiting for Grafana startup...]\n\nAll services deployed successfully!\nGrafana dashboards: [http://grafana.internal:3000](http://grafana.internal:3000)\n```\n\n### Example 2: Production Kubernetes Deployment\n\n```\n$ provisioning ai agent --interactive \\n --goal "Deploy production Kubernetes cluster with managed databases"\n\nAgent Analysis:\n- Cluster size: 3-10 nodes (auto-scaling)\n- Databases: RDS PostgreSQL + ElastiCache Redis\n- Monitoring: Full observability stack\n- Security: TLS, encryption, VPC isolation\n\nAgent suggests modifications:\n 1. Enable cross-AZ deployment for HA\n 2. Add backup retention: 30 days\n 3. Add network policies for security\n 4. Enable cluster autoscaling\n Approve all? [yes/review]\n\nUser: Review\n\nAgent points out:\n - Network policies may affect performance\n - Cross-AZ increases costs by ~20%\n - Backup retention meets compliance\n\nUser: Approve with modifications\n - Network policies: use audit mode first\n - Keep cross-AZ\n - Keep backups\n\n[Agent creates configs with modifications]\n\nConfigs generated:\n ✓ infrastructure/vpc.ncl\n ✓ infrastructure/kubernetes.ncl\n ✓ databases/postgres.ncl\n ✓ databases/redis.ncl\n ✓ monitoring/prometheus.ncl\n ✓ monitoring/grafana.ncl\n\nEstimated deployment time: 15-20 minutes\nEstimated cost: $2,500/month\n\n[Start deployment?] [Review configs]\n\nUser: Review configs\n\n[User reviews and approves]\n\n[Agent executes deployment in phases]\n```\n\n## Safety and Control\n\n### Human-in-the-Loop Checkpoints\n\nAgents stop and ask humans for approval at critical points:\n\n```\nAutomatic Approval (Agent decides):\n- Create configuration\n- Validate configuration\n- Check dependencies\n- Generate execution plan\n\nHuman Approval Required:\n- First-time resource creation\n- Cost changes > 10%\n- Security policy changes\n- Cross-region deployment\n- Data deletion operations\n- Major version upgrades\n```\n\n### Decision Logging\n\nAll decisions logged for audit trail:\n\n```\nAgent Decision Log:\n| 2025-01-13 10:00:00 | Generate database config |\n| 2025-01-13 10:00:05 | Config validation: PASS |\n| 2025-01-13 10:00:07 | Requesting human approval: "Create new PostgreSQL instance" |\n| 2025-01-13 10:00:45 | Human approval: APPROVED |\n| 2025-01-13 10:00:47 | Cost estimate: $100/month - within budget |\n| 2025-01-13 10:01:00 | Creating infrastructure... |\n| 2025-01-13 10:02:15 | Database created successfully |\n| 2025-01-13 10:02:16 | Running health checks... |\n| 2025-01-13 10:02:45 | Health check: PASSED |\n```\n\n### Rollback Capability\n\nAgents can rollback on failure:\n\n```\nScenario: Database creation succeeds, but Kubernetes creation fails\n\nAgent behavior:\n1. Detect failure in Kubernetes phase\n2. Try recovery (retry, different configuration)\n3. Recovery fails\n4. Ask human: "Kubernetes creation failed. Rollback database creation? [yes/no]"\n5. If yes: Delete database, clean up, report failure\n6. If no: Keep database, manual cleanup needed\n\nFull rollback capability if entire workflow fails before human approval.\n```\n\n## Configuration\n\n### Agent Settings\n\n```\n# In provisioning/config/ai.toml\n[ai.agents]\nenabled = true\n\n# Agent decision-making\nauto_approve_threshold = 0.95 # Approve if confidence > 95%\nrequire_approval_for = [\n "first_resource_creation",\n "cost_change_above_percent",\n "security_policy_change",\n "data_deletion",\n]\n\ncost_change_threshold_percent = 10\n\n# Execution control\nmax_parallel_phases = 2\nphase_timeout_minutes = 30\nexecution_log_retention_days = 90\n\n# Safety\ndry_run_mode = false # Always perform dry run first\nrequire_final_approval = true\nrollback_on_failure = true\n\n# Learning\ntrack_agent_decisions = true\ntrack_success_rate = true\nimprove_from_feedback = true\n```\n\n## Success Criteria (Q2 2025)\n\n- ✅ Agents complete 5 standard workflows without human intervention\n- ✅ Cost estimation accuracy within 5%\n- ✅ Execution time matches or beats manual setup by 30%\n- ✅ Success rate > 95% for tested scenarios\n- ✅ Zero unapproved critical decisions\n- ✅ Full decision audit trail for all operations\n- ✅ Rollback capability tested and verified\n- ✅ User satisfaction > 8/10 in testing\n- ✅ Documentation complete with examples\n- ✅ Integration with form assistance and NLC working\n\n## Related Documentation\n\n- [Architecture](architecture.md) - AI system overview\n- [Natural Language Config](natural-language-config.md) - Config generation\n- [AI-Assisted Forms](ai-assisted-forms.md) - Interactive forms\n- [Configuration](configuration.md) - Setup guide\n- [ADR-015](../architecture/adr/adr-015-ai-integration-architecture.md) - Design decisions\n\n---\n\n**Status**: 🔴 Planned\n**Target Release**: Q2 2025\n**Last Updated**: 2025-01-13\n**Component**: typdialog-ag\n**Architecture**: Complete\n**Implementation**: In Design Phase +# Autonomous AI Agents (typdialog-ag) + +**Status**: 🔴 Planned (Q2 2025 target) + +Autonomous AI Agents is a planned feature that enables AI agents to execute multi-step +infrastructure provisioning workflows with minimal human intervention. Agents make +decisions, adapt to changing conditions, and execute complex tasks while maintaining +security and requiring human approval for critical operations. + +## Feature Overview + +### What It Does + +Enable AI agents to manage complex provisioning workflows: + +```text +User Goal: + "Set up a complete development environment with: + - PostgreSQL database + - Redis cache + - Kubernetes cluster + - Monitoring stack + - Logging infrastructure" + +AI Agent executes: +1. Analyzes requirements and constraints +2. Plans multi-step deployment sequence +3. Creates configurations for all components +4. Validates configurations against policies +5. Requests human approval for critical decisions +6. Executes deployment in correct order +7. Monitors for failures and adapts +8. Reports completion and recommendations +``` + +## Agent Capabilities + +### Multi-Step Workflow Execution + +Agents coordinate complex, multi-component deployments: + +```text +Goal: "Deploy production Kubernetes cluster with managed databases" + +Agent Plan: + Phase 1: Infrastructure + ├─ Create VPC and networking + ├─ Set up security groups + └─ Configure IAM roles + + Phase 2: Kubernetes + ├─ Create EKS cluster + ├─ Configure network plugins + ├─ Set up autoscaling + └─ Install cluster add-ons + + Phase 3: Managed Services + ├─ Provision RDS PostgreSQL + ├─ Configure backups + └─ Set up replicas + + Phase 4: Observability + ├─ Deploy Prometheus + ├─ Deploy Grafana + ├─ Configure log collection + └─ Set up alerting + + Phase 5: Validation + ├─ Run smoke tests + ├─ Verify connectivity + └─ Check compliance +``` + +### Adaptive Decision Making + +Agents adapt to conditions and make intelligent decisions: + +```text +Scenario: Database provisioning fails due to resource quota + +Standard approach (human): +1. Detect failure +2. Investigate issue +3. Decide on fix (reduce size, change region, etc.) +4. Update config +5. Retry + +Agent approach: +1. Detect failure +2. Analyze error: "Quota exceeded for db.r6g.xlarge" +3. Check available options: + - Try smaller instance: db.r6g.large (may be insufficient) + - Try different region: different cost, latency + - Request quota increase (requires human approval) +4. Ask human: "Quota exceeded. Suggest: use db.r6g.large instead + (slightly reduced performance). Approve? [yes/no/try-other]" +5. Execute based on approval +6. Continue workflow +``` + +### Dependency Management + +Agents understand resource dependencies: + +```text +Knowledge graph of dependencies: + + VPC ──→ Subnets ──→ EC2 Instances + ├─────────→ Security Groups + └────→ NAT Gateway ──→ Route Tables + + RDS ──→ DB Subnet Group ──→ VPC + ├─────────→ Security Group + └────→ Parameter Group + +Agent ensures: +- VPC exists before creating subnets +- Subnets exist before creating EC2 +- Security groups reference correct VPC +- Deployment order respects all dependencies +- Rollback order is reverse of creation +``` + +## Architecture + +### Agent Design Pattern + +```text +┌────────────────────────────────────────────────────────┐ +│ Agent Supervisor (Orchestrator) │ +│ - Accepts user goal │ +│ - Plans workflow │ +│ - Coordinates specialist agents │ +│ - Requests human approvals │ +│ - Monitors overall progress │ +└────────────────────────────────────────────────────────┘ + ↑ ↑ ↑ + │ │ │ + ↓ ↓ ↓ +┌──────────────┐ ┌──────────────┐ ┌──────────────┐ +│ Database │ │ Kubernetes │ │ Monitoring │ +│ Specialist │ │ Specialist │ │ Specialist │ +│ │ │ │ │ │ +│ Tasks: │ │ Tasks: │ │ Tasks: │ +│ - Create DB │ │ - Create K8s │ │ - Deploy │ +│ - Configure │ │ - Configure │ │ Prometheus │ +│ - Validate │ │ - Validate │ │ - Deploy │ +│ - Report │ │ - Report │ │ Grafana │ +└──────────────┘ └──────────────┘ └──────────────┘ +``` + +### Agent Workflow + +```text +Start: User Goal + ↓ +┌─────────────────────────────────────────┐ +│ Goal Analysis & Planning │ +│ - Parse user intent │ +│ - Identify resources needed │ +│ - Plan dependency graph │ +│ - Generate task list │ +└──────────────┬──────────────────────────┘ + ↓ +┌─────────────────────────────────────────┐ +│ Resource Generation │ +│ - Generate configs for each resource │ +│ - Validate against schemas │ +│ - Check compliance policies │ +│ - Identify potential issues │ +└──────────────┬──────────────────────────┘ + ↓ + Human Review Point? + ├─ No issues: Continue + └─ Issues found: Request approval/modification + ↓ +┌─────────────────────────────────────────┐ +│ Execution Plan Verification │ +│ - Check all configs are valid │ +│ - Verify dependencies are resolvable │ +│ - Estimate costs and timeline │ +│ - Identify risks │ +└──────────────┬──────────────────────────┘ + ↓ + Execute Workflow? + ├─ User approves: Start execution + └─ User modifies: Return to planning + ↓ +┌─────────────────────────────────────────┐ +│ Phase-by-Phase Execution │ +│ - Execute one logical phase │ +│ - Monitor for errors │ +│ - Report progress │ +│ - Ask for decisions if needed │ +└──────────────┬──────────────────────────┘ + ↓ + All Phases Complete? + ├─ No: Continue to next phase + └─ Yes: Final validation + ↓ +┌─────────────────────────────────────────┐ +│ Final Validation & Reporting │ +│ - Smoke tests │ +│ - Connectivity tests │ +│ - Compliance verification │ +│ - Performance checks │ +│ - Generate final report │ +└──────────────┬──────────────────────────┘ + ↓ +Success: Deployment Complete +``` + +## Planned Agent Types + +### 1. Database Specialist Agent + +```text +Responsibilities: +- Create and configure databases +- Set up replication and backups +- Configure encryption and security +- Monitor database health +- Handle database-specific issues + +Examples: +- Provision PostgreSQL cluster with replication +- Set up MySQL with read replicas +- Configure MongoDB sharding +- Create backup pipelines +``` + +### 2. Kubernetes Specialist Agent + +```text +Responsibilities: +- Create and configure Kubernetes clusters +- Configure networking and ingress +- Set up autoscaling policies +- Deploy cluster add-ons +- Manage workload placement + +Examples: +- Create EKS/GKE/AKS cluster +- Configure Istio service mesh +- Deploy Prometheus + Grafana +- Configure auto-scaling policies +``` + +### 3. Infrastructure Agent + +```text +Responsibilities: +- Create networking infrastructure +- Configure security and firewalls +- Set up load balancers +- Configure DNS and CDN +- Manage identity and access + +Examples: +- Create VPC with subnets +- Configure security groups +- Set up application load balancer +- Configure Route53 DNS +``` + +### 4. Monitoring Agent + +```text +Responsibilities: +- Deploy monitoring stack +- Configure alerting +- Set up logging infrastructure +- Create dashboards +- Configure notification channels + +Examples: +- Deploy Prometheus + Grafana +- Set up CloudWatch dashboards +- Configure log aggregation +- Set up PagerDuty integration +``` + +### 5. Compliance Agent + +```text +Responsibilities: +- Check security policies +- Verify compliance requirements +- Audit configurations +- Generate compliance reports +- Recommend security improvements + +Examples: +- Check PCI-DSS compliance +- Verify encryption settings +- Audit access controls +- Generate compliance report +``` + +## Usage Examples + +### Example 1: Development Environment Setup + +```text +$ provisioning ai agent --goal "Set up dev environment for Python web app" + +Agent Plan Generated: +┌─────────────────────────────────────────┐ +│ Environment: Development │ +│ Components: PostgreSQL + Redis + Monitoring +│ │ +│ Phase 1: Database (1-2 min) │ +│ - PostgreSQL 15 │ +│ - 10 GB storage │ +│ - Dev security settings │ +│ │ +│ Phase 2: Cache (1 min) │ +│ - Redis Cluster Mode disabled │ +│ - Single node │ +│ - 2 GB memory │ +│ │ +│ Phase 3: Monitoring (1-2 min) │ +│ - Prometheus (metrics) │ +│ - Grafana (dashboards) │ +│ - Log aggregation │ +│ │ +│ Estimated time: 5-10 minutes │ +│ Estimated cost: $15/month │ +│ │ +│ [Approve] [Modify] [Cancel] │ +└─────────────────────────────────────────┘ + +Agent: Approve to proceed with setup. + +User: Approve + +[Agent execution starts] +Creating PostgreSQL... [████████░░] 80% +Creating Redis... [░░░░░░░░░░] 0% +[Waiting for PostgreSQL creation...] + +PostgreSQL created successfully! +Connection string: postgresql://dev:pwd@db.internal:5432/app + +Creating Redis... [████████░░] 80% +[Waiting for Redis creation...] + +Redis created successfully! +Connection string: redis://cache.internal:6379 + +Deploying monitoring... [████████░░] 80% +[Waiting for Grafana startup...] + +All services deployed successfully! +Grafana dashboards: [http://grafana.internal:3000](http://grafana.internal:3000) +``` + +### Example 2: Production Kubernetes Deployment + +```text +$ provisioning ai agent --interactive + --goal "Deploy production Kubernetes cluster with managed databases" + +Agent Analysis: +- Cluster size: 3-10 nodes (auto-scaling) +- Databases: RDS PostgreSQL + ElastiCache Redis +- Monitoring: Full observability stack +- Security: TLS, encryption, VPC isolation + +Agent suggests modifications: + 1. Enable cross-AZ deployment for HA + 2. Add backup retention: 30 days + 3. Add network policies for security + 4. Enable cluster autoscaling + Approve all? [yes/review] + +User: Review + +Agent points out: + - Network policies may affect performance + - Cross-AZ increases costs by ~20% + - Backup retention meets compliance + +User: Approve with modifications + - Network policies: use audit mode first + - Keep cross-AZ + - Keep backups + +[Agent creates configs with modifications] + +Configs generated: + ✓ infrastructure/vpc.ncl + ✓ infrastructure/kubernetes.ncl + ✓ databases/postgres.ncl + ✓ databases/redis.ncl + ✓ monitoring/prometheus.ncl + ✓ monitoring/grafana.ncl + +Estimated deployment time: 15-20 minutes +Estimated cost: $2,500/month + +[Start deployment?] [Review configs] + +User: Review configs + +[User reviews and approves] + +[Agent executes deployment in phases] +``` + +## Safety and Control + +### Human-in-the-Loop Checkpoints + +Agents stop and ask humans for approval at critical points: + +```text +Automatic Approval (Agent decides): +- Create configuration +- Validate configuration +- Check dependencies +- Generate execution plan + +Human Approval Required: +- First-time resource creation +- Cost changes > 10% +- Security policy changes +- Cross-region deployment +- Data deletion operations +- Major version upgrades +``` + +### Decision Logging + +All decisions logged for audit trail: + +```text +Agent Decision Log: +| 2025-01-13 10:00:00 | Generate database config | +| 2025-01-13 10:00:05 | Config validation: PASS | +| 2025-01-13 10:00:07 | Requesting human approval: "Create new PostgreSQL instance" | +| 2025-01-13 10:00:45 | Human approval: APPROVED | +| 2025-01-13 10:00:47 | Cost estimate: $100/month - within budget | +| 2025-01-13 10:01:00 | Creating infrastructure... | +| 2025-01-13 10:02:15 | Database created successfully | +| 2025-01-13 10:02:16 | Running health checks... | +| 2025-01-13 10:02:45 | Health check: PASSED | +``` + +### Rollback Capability + +Agents can rollback on failure: + +```text +Scenario: Database creation succeeds, but Kubernetes creation fails + +Agent behavior: +1. Detect failure in Kubernetes phase +2. Try recovery (retry, different configuration) +3. Recovery fails +4. Ask human: "Kubernetes creation failed. Rollback database creation? [yes/no]" +5. If yes: Delete database, clean up, report failure +6. If no: Keep database, manual cleanup needed + +Full rollback capability if entire workflow fails before human approval. +``` + +## Configuration + +### Agent Settings + +```text +# In provisioning/config/ai.toml +[ai.agents] +enabled = true + +# Agent decision-making +auto_approve_threshold = 0.95 # Approve if confidence > 95% +require_approval_for = [ + "first_resource_creation", + "cost_change_above_percent", + "security_policy_change", + "data_deletion", +] + +cost_change_threshold_percent = 10 + +# Execution control +max_parallel_phases = 2 +phase_timeout_minutes = 30 +execution_log_retention_days = 90 + +# Safety +dry_run_mode = false # Always perform dry run first +require_final_approval = true +rollback_on_failure = true + +# Learning +track_agent_decisions = true +track_success_rate = true +improve_from_feedback = true +``` + +## Success Criteria (Q2 2025) + +- ✅ Agents complete 5 standard workflows without human intervention +- ✅ Cost estimation accuracy within 5% +- ✅ Execution time matches or beats manual setup by 30% +- ✅ Success rate > 95% for tested scenarios +- ✅ Zero unapproved critical decisions +- ✅ Full decision audit trail for all operations +- ✅ Rollback capability tested and verified +- ✅ User satisfaction > 8/10 in testing +- ✅ Documentation complete with examples +- ✅ Integration with form assistance and NLC working + +## Related Documentation + +- [Architecture](architecture.md) - AI system overview +- [Natural Language Config](natural-language-config.md) - Config generation +- [AI-Assisted Forms](ai-assisted-forms.md) - Interactive forms +- [Configuration](configuration.md) - Setup guide +- [ADR-015](../architecture/adr/adr-015-ai-integration-architecture.md) - Design decisions + +--- + +**Status**: 🔴 Planned +**Target Release**: Q2 2025 +**Last Updated**: 2025-01-13 +**Component**: typdialog-ag +**Architecture**: Complete +**Implementation**: In Design Phase \ No newline at end of file diff --git a/docs/src/ai/ai-assisted-forms.md b/docs/src/ai/ai-assisted-forms.md index 0596fdc..0ededcb 100644 --- a/docs/src/ai/ai-assisted-forms.md +++ b/docs/src/ai/ai-assisted-forms.md @@ -1 +1,438 @@ -# AI-Assisted Forms (typdialog-ai)\n\n**Status**: 🔴 Planned (Q2 2025 target)\n\nAI-Assisted Forms is a planned feature that integrates intelligent suggestions, context-aware assistance, and natural language understanding into the\ntypdialog web UI. This enables users to configure infrastructure through interactive forms with real-time AI guidance.\n\n## Feature Overview\n\n### What It Does\n\nEnhance configuration forms with AI-powered assistance:\n\n```\nUser typing in form field: "storage"\n ↓\nAI analyzes context:\n - Current form (database configuration)\n - Field type (storage capacity)\n - Similar past configurations\n - Best practices for this workload\n ↓\nSuggestions appear:\n ✓ "100 GB (standard production size)"\n ✓ "50 GB (development environment)"\n ✓ "500 GB (large-scale analytics)"\n```\n\n### Primary Use Cases\n\n1. **Guided Configuration**: Step-by-step assistance filling complex forms\n2. **Error Explanation**: AI explains validation failures in plain English\n3. **Smart Autocomplete**: Suggestions based on context, not just keywords\n4. **Learning**: New users learn patterns from AI explanations\n5. **Efficiency**: Experienced users get quick suggestions\n\n## Architecture\n\n### User Interface Integration\n\n```\n┌────────────────────────────────────────┐\n│ Typdialog Web UI (React/TypeScript) │\n│ │\n│ ┌──────────────────────────────────┐ │\n│ │ Form Fields │ │\n│ │ │ │\n│ │ Database Engine: [postgresql ▼] │ │\n│ │ Storage (GB): [100 GB ↓ ?] │ │\n│ │ AI suggestions │ │\n│ │ Encryption: [✓ enabled ] │ │\n│ │ "Required for │ │\n│ │ production" │ │\n│ │ │ │\n│ │ [← Back] [Next →] │ │\n│ └──────────────────────────────────┘ │\n│ ↓ │\n│ AI Assistance Panel │\n│ (suggestions & explanations) │\n└────────────────────────────────────────┘\n ↓ ↑\n User Input AI Service\n (port 8083)\n```\n\n### Suggestion Pipeline\n\n```\nUser Event (typing, focusing field, validation error)\n ↓\n┌─────────────────────────────────────┐\n│ Context Extraction │\n│ - Current field and value │\n│ - Form schema and constraints │\n│ - Other filled fields │\n│ - User role and workspace │\n└─────────────────────┬───────────────┘\n ↓\n┌─────────────────────────────────────┐\n│ RAG Retrieval │\n│ - Find similar configs │\n│ - Get examples for field type │\n│ - Retrieve relevant documentation │\n│ - Find validation rules │\n└─────────────────────┬───────────────┘\n ↓\n┌─────────────────────────────────────┐\n│ Suggestion Generation │\n│ - AI generates suggestions │\n│ - Rank by relevance │\n│ - Format for display │\n│ - Generate explanation │\n└─────────────────────┬───────────────┘\n ↓\n┌─────────────────────────────────────┐\n│ Response Formatting │\n│ - Debounce (don't update too fast) │\n│ - Cache identical results │\n│ - Stream if long response │\n│ - Display to user │\n└─────────────────────────────────────┘\n```\n\n## Planned Features\n\n### 1. Smart Field Suggestions\n\nIntelligent suggestions based on context:\n\n```\nScenario: User filling database configuration form\n\n1. Engine selection\n User types: "post" \n Suggestion: "postgresql" (99% match)\n Explanation: "PostgreSQL is the most popular open-source relational database"\n\n2. Storage size\n User has selected: "postgresql", "production", "web-application"\n Suggestions appear:\n • "100 GB" (standard production web app database)\n • "500 GB" (if expected growth > 1000 connections)\n • "1 TB" (high-traffic SaaS platform)\n Explanation: "For typical web applications with 1000s of concurrent users, 100 GB is recommended"\n\n3. Backup frequency\n User has selected: "production", "critical-data"\n Suggestions appear:\n • "Daily" (standard for critical databases)\n • "Hourly" (for data warehouses with frequent updates)\n Explanation: "Critical production data requires daily or more frequent backups"\n```\n\n### 2. Validation Error Explanation\n\nHuman-readable error messages with fixes:\n\n```\nUser enters: "storage = -100"\n\nCurrent behavior:\n ✗ Error: Expected positive integer\n\nPlanned AI behavior:\n ✗ Storage must be positive (1-65535 GB)\n \n Why: Negative storage doesn't make sense.\n Storage capacity must be at least 1 GB.\n \n Fix suggestions:\n • Use 100 GB (typical production size)\n • Use 50 GB (development environment)\n • Use your required size in GB\n```\n\n### 3. Field-to-Field Context Awareness\n\nSuggestions change based on other fields:\n\n```\nScenario: Multi-step configuration form\n\nStep 1: Select environment\nUser: "production"\n → Form shows constraints: (min storage 50GB, encryption required, backup required)\n\nStep 2: Select database engine\nUser: "postgresql"\n → Suggestions adapted:\n - PostgreSQL 15 recommended for production\n - Point-in-time recovery available\n - Replication options highlighted\n\nStep 3: Storage size\n → Suggestions show:\n - Minimum 50 GB for production\n - Examples from similar production configs\n - Cost estimate updates in real-time\n\nStep 4: Encryption\n → Suggestion appears: "Recommended: AES-256"\n → Explanation: "Required for production environments"\n```\n\n### 4. Inline Documentation\n\nQuick access to relevant docs:\n\n```\nField: "Backup Retention Days"\n\nSuggestion popup:\n ┌─────────────────────────────────┐\n │ Suggested value: 30 │\n │ │\n │ Why: 30 days is industry-standard│\n │ standard for compliance (PCI-DSS)│\n │ │\n │ Learn more: │\n │ → Backup best practices guide │\n │ → Your compliance requirements │\n │ → Cost vs retention trade-offs │\n └─────────────────────────────────┘\n```\n\n### 5. Multi-Field Suggestions\n\nSuggest multiple related fields together:\n\n```\nUser selects: environment = "production"\n\nAI suggests completing:\n ┌─────────────────────────────────┐\n │ Complete Production Setup │\n │ │\n │ Based on production environment │\n │ we recommend: │\n │ │\n │ Encryption: enabled │ ← Auto-fill\n │ Backups: daily │ ← Auto-fill\n │ Monitoring: enabled │ ← Auto-fill\n │ High availability: enabled │ ← Auto-fill\n │ Retention: 30 days │ ← Auto-fill\n │ │\n │ [Accept All] [Review] [Skip] │\n └─────────────────────────────────┘\n```\n\n## Implementation Components\n\n### Frontend (typdialog-ai JavaScript/TypeScript)\n\n```\n// React component for field with AI assistance\ninterface AIFieldProps {\n fieldName: string;\n fieldType: string;\n currentValue: string;\n formContext: Record;\n schema: FieldSchema;\n}\n\nfunction AIAssistedField({fieldName, formContext, schema}: AIFieldProps) {\n const [suggestions, setSuggestions] = useState([]);\n const [explanation, setExplanation] = useState("");\n \n // Debounced suggestion generation\n useEffect(() => {\n const timer = setTimeout(async () => {\n const suggestions = await ai.suggestFieldValue({\n field: fieldName,\n context: formContext,\n schema: schema,\n });\n setSuggestions(suggestions);\n| setExplanation(suggestions[0]?.explanation | | ""); |\n }, 300); // Debounce 300ms\n \n return () => clearTimeout(timer);\n }, [formContext[fieldName]]);\n \n return (\n
\n handleChange(e.target.value)}\n />\n \n {suggestions.length > 0 && (\n
\n {suggestions.map((s) => (\n \n ))}\n {explanation && (\n

{explanation}

\n )}\n
\n )}\n
\n );\n}\n```\n\n### Backend Service Integration\n\n```\n// In AI Service: field suggestion endpoint\nasync fn suggest_field_value(\n req: SuggestFieldRequest,\n) -> Result> {\n // Build context for the suggestion\n let context = build_field_context(&req.form_context, &req.field_name)?;\n \n // Retrieve relevant examples from RAG\n let examples = rag.search_by_field(&req.field_name, &context)?;\n \n // Generate suggestions via LLM\n let suggestions = llm.generate_suggestions(\n &req.field_name,\n &req.field_type,\n &context,\n &examples,\n ).await?;\n \n // Rank and format suggestions\n let ranked = rank_suggestions(suggestions, &context);\n \n Ok(ranked)\n}\n```\n\n## Configuration\n\n### Form Assistant Settings\n\n```\n# In provisioning/config/ai.toml\n[ai.forms]\nenabled = true\n\n# Suggestion delivery\nsuggestions_enabled = true\nsuggestions_debounce_ms = 300\nmax_suggestions_per_field = 3\n\n# Error explanations\nerror_explanations_enabled = true\nexplain_validation_errors = true\nsuggest_fixes = true\n\n# Field context awareness\nfield_context_enabled = true\ncross_field_suggestions = true\n\n# Inline documentation\ninline_docs_enabled = true\ndocs_link_type = "modal" # or "sidebar", "tooltip"\n\n# Performance\ncache_suggestions = true\ncache_ttl_seconds = 3600\n\n# Learning\ntrack_accepted_suggestions = true\ntrack_rejected_suggestions = true\n```\n\n## User Experience Flow\n\n### Scenario: New User Configuring PostgreSQL\n\n```\n1. User opens typdialog form\n - Form title: "Create Database"\n - First field: "Database Engine"\n - AI shows: "PostgreSQL recommended for relational data"\n\n2. User types "post"\n - Autocomplete shows: "postgresql"\n - AI explains: "PostgreSQL is the most stable open-source database"\n\n3. User selects "postgresql"\n - Form progresses\n - Next field: "Version"\n - AI suggests: "PostgreSQL 15 (latest stable)"\n - Explanation: "Version 15 is current stable, recommended for new deployments"\n\n4. User selects version 15\n - Next field: "Environment"\n - User selects "production"\n - AI note appears: "Production environment requires encryption and backups"\n\n5. Next field: "Storage (GB)"\n - Form shows: Minimum 50 GB (production requirement)\n - AI suggestions:\n • 100 GB (standard production)\n • 250 GB (high-traffic site)\n - User accepts: 100 GB\n\n6. Validation error on next field\n - Old behavior: "Invalid backup_days value"\n - New behavior: \n "Backup retention must be 1-35 days. Recommended: 30 days.\n 30-day retention meets compliance requirements for production systems."\n\n7. User completes form\n - Summary shows all AI-assisted decisions\n - Generate button creates configuration\n```\n\n## Integration with Natural Language Generation\n\nNLC and form assistance share the same backend:\n\n```\nNatural Language Generation AI-Assisted Forms\n ↓ ↓\n "Create a PostgreSQL db" Select field values\n ↓ ↓\n Intent Extraction Context Extraction\n ↓ ↓\n RAG Search RAG Search (same results)\n ↓ ↓\n LLM Generation LLM Suggestions\n ↓ ↓\n Config Output Form Field Population\n```\n\n## Success Criteria (Q2 2025)\n\n- ✅ Suggestions appear within 300ms of user action\n- ✅ 80% suggestion acceptance rate in user testing\n- ✅ Error explanations clearly explain issues and fixes\n- ✅ Cross-field context awareness works for 5+ database scenarios\n- ✅ Form completion time reduced by 40% with AI\n- ✅ User satisfaction > 8/10 in testing\n- ✅ No false suggestions (all suggestions are valid)\n- ✅ Offline mode works with cached suggestions\n\n## Related Documentation\n\n- [Architecture](architecture.md) - AI system overview\n- [Natural Language Config](natural-language-config.md) - Related generation feature\n- [RAG System](rag-system.md) - Suggestion retrieval\n- [Configuration](configuration.md) - Setup guide\n- [ADR-015](../architecture/adr/adr-015-ai-integration-architecture.md) - Design decisions\n\n---\n\n**Status**: 🔴 Planned\n**Target Release**: Q2 2025\n**Last Updated**: 2025-01-13\n**Component**: typdialog-ai\n**Architecture**: Complete\n**Implementation**: In Design Phase +# AI-Assisted Forms (typdialog-ai) + +**Status**: 🔴 Planned (Q2 2025 target) + +AI-Assisted Forms is a planned feature that integrates intelligent suggestions, context-aware assistance, and natural language understanding into the +typdialog web UI. This enables users to configure infrastructure through interactive forms with real-time AI guidance. + +## Feature Overview + +### What It Does + +Enhance configuration forms with AI-powered assistance: + +```text +User typing in form field: "storage" + ↓ +AI analyzes context: + - Current form (database configuration) + - Field type (storage capacity) + - Similar past configurations + - Best practices for this workload + ↓ +Suggestions appear: + ✓ "100 GB (standard production size)" + ✓ "50 GB (development environment)" + ✓ "500 GB (large-scale analytics)" +``` + +### Primary Use Cases + +1. **Guided Configuration**: Step-by-step assistance filling complex forms +2. **Error Explanation**: AI explains validation failures in plain English +3. **Smart Autocomplete**: Suggestions based on context, not just keywords +4. **Learning**: New users learn patterns from AI explanations +5. **Efficiency**: Experienced users get quick suggestions + +## Architecture + +### User Interface Integration + +```text +┌────────────────────────────────────────┐ +│ Typdialog Web UI (React/TypeScript) │ +│ │ +│ ┌──────────────────────────────────┐ │ +│ │ Form Fields │ │ +│ │ │ │ +│ │ Database Engine: [postgresql ▼] │ │ +│ │ Storage (GB): [100 GB ↓ ?] │ │ +│ │ AI suggestions │ │ +│ │ Encryption: [✓ enabled ] │ │ +│ │ "Required for │ │ +│ │ production" │ │ +│ │ │ │ +│ │ [← Back] [Next →] │ │ +│ └──────────────────────────────────┘ │ +│ ↓ │ +│ AI Assistance Panel │ +│ (suggestions & explanations) │ +└────────────────────────────────────────┘ + ↓ ↑ + User Input AI Service + (port 8083) +``` + +### Suggestion Pipeline + +```text +User Event (typing, focusing field, validation error) + ↓ +┌─────────────────────────────────────┐ +│ Context Extraction │ +│ - Current field and value │ +│ - Form schema and constraints │ +│ - Other filled fields │ +│ - User role and workspace │ +└─────────────────────┬───────────────┘ + ↓ +┌─────────────────────────────────────┐ +│ RAG Retrieval │ +│ - Find similar configs │ +│ - Get examples for field type │ +│ - Retrieve relevant documentation │ +│ - Find validation rules │ +└─────────────────────┬───────────────┘ + ↓ +┌─────────────────────────────────────┐ +│ Suggestion Generation │ +│ - AI generates suggestions │ +│ - Rank by relevance │ +│ - Format for display │ +│ - Generate explanation │ +└─────────────────────┬───────────────┘ + ↓ +┌─────────────────────────────────────┐ +│ Response Formatting │ +│ - Debounce (don't update too fast) │ +│ - Cache identical results │ +│ - Stream if long response │ +│ - Display to user │ +└─────────────────────────────────────┘ +``` + +## Planned Features + +### 1. Smart Field Suggestions + +Intelligent suggestions based on context: + +```text +Scenario: User filling database configuration form + +1. Engine selection + User types: "post" + Suggestion: "postgresql" (99% match) + Explanation: "PostgreSQL is the most popular open-source relational database" + +2. Storage size + User has selected: "postgresql", "production", "web-application" + Suggestions appear: + • "100 GB" (standard production web app database) + • "500 GB" (if expected growth > 1000 connections) + • "1 TB" (high-traffic SaaS platform) + Explanation: "For typical web applications with 1000s of concurrent users, 100 GB is recommended" + +3. Backup frequency + User has selected: "production", "critical-data" + Suggestions appear: + • "Daily" (standard for critical databases) + • "Hourly" (for data warehouses with frequent updates) + Explanation: "Critical production data requires daily or more frequent backups" +``` + +### 2. Validation Error Explanation + +Human-readable error messages with fixes: + +```text +User enters: "storage = -100" + +Current behavior: + ✗ Error: Expected positive integer + +Planned AI behavior: + ✗ Storage must be positive (1-65535 GB) + + Why: Negative storage doesn't make sense. + Storage capacity must be at least 1 GB. + + Fix suggestions: + • Use 100 GB (typical production size) + • Use 50 GB (development environment) + • Use your required size in GB +``` + +### 3. Field-to-Field Context Awareness + +Suggestions change based on other fields: + +```text +Scenario: Multi-step configuration form + +Step 1: Select environment +User: "production" + → Form shows constraints: (min storage 50GB, encryption required, backup required) + +Step 2: Select database engine +User: "postgresql" + → Suggestions adapted: + - PostgreSQL 15 recommended for production + - Point-in-time recovery available + - Replication options highlighted + +Step 3: Storage size + → Suggestions show: + - Minimum 50 GB for production + - Examples from similar production configs + - Cost estimate updates in real-time + +Step 4: Encryption + → Suggestion appears: "Recommended: AES-256" + → Explanation: "Required for production environments" +``` + +### 4. Inline Documentation + +Quick access to relevant docs: + +```text +Field: "Backup Retention Days" + +Suggestion popup: + ┌─────────────────────────────────┐ + │ Suggested value: 30 │ + │ │ + │ Why: 30 days is industry-standard│ + │ standard for compliance (PCI-DSS)│ + │ │ + │ Learn more: │ + │ → Backup best practices guide │ + │ → Your compliance requirements │ + │ → Cost vs retention trade-offs │ + └─────────────────────────────────┘ +``` + +### 5. Multi-Field Suggestions + +Suggest multiple related fields together: + +```text +User selects: environment = "production" + +AI suggests completing: + ┌─────────────────────────────────┐ + │ Complete Production Setup │ + │ │ + │ Based on production environment │ + │ we recommend: │ + │ │ + │ Encryption: enabled │ ← Auto-fill + │ Backups: daily │ ← Auto-fill + │ Monitoring: enabled │ ← Auto-fill + │ High availability: enabled │ ← Auto-fill + │ Retention: 30 days │ ← Auto-fill + │ │ + │ [Accept All] [Review] [Skip] │ + └─────────────────────────────────┘ +``` + +## Implementation Components + +### Frontend (typdialog-ai JavaScript/TypeScript) + +```text +// React component for field with AI assistance +interface AIFieldProps { + fieldName: string; + fieldType: string; + currentValue: string; + formContext: Record; + schema: FieldSchema; +} + +function AIAssistedField({fieldName, formContext, schema}: AIFieldProps) { + const [suggestions, setSuggestions] = useState([]); + const [explanation, setExplanation] = useState(""); + + // Debounced suggestion generation + useEffect(() => { + const timer = setTimeout(async () => { + const suggestions = await ai.suggestFieldValue({ + field: fieldName, + context: formContext, + schema: schema, + }); + setSuggestions(suggestions); +| setExplanation(suggestions[0]?.explanation | | ""); | + }, 300); // Debounce 300ms + + return () => clearTimeout(timer); + }, [formContext[fieldName]]); + + return ( +
+ handleChange(e.target.value)} + /> + + {suggestions.length > 0 && ( +
+ {suggestions.map((s) => ( + + ))} + {explanation && ( +

{explanation}

+ )} +
+ )} +
+ ); +} +``` + +### Backend Service Integration + +```text +// In AI Service: field suggestion endpoint +async fn suggest_field_value( + req: SuggestFieldRequest, +) -> Result> { + // Build context for the suggestion + let context = build_field_context(&req.form_context, &req.field_name)?; + + // Retrieve relevant examples from RAG + let examples = rag.search_by_field(&req.field_name, &context)?; + + // Generate suggestions via LLM + let suggestions = llm.generate_suggestions( + &req.field_name, + &req.field_type, + &context, + &examples, + ).await?; + + // Rank and format suggestions + let ranked = rank_suggestions(suggestions, &context); + + Ok(ranked) +} +``` + +## Configuration + +### Form Assistant Settings + +```text +# In provisioning/config/ai.toml +[ai.forms] +enabled = true + +# Suggestion delivery +suggestions_enabled = true +suggestions_debounce_ms = 300 +max_suggestions_per_field = 3 + +# Error explanations +error_explanations_enabled = true +explain_validation_errors = true +suggest_fixes = true + +# Field context awareness +field_context_enabled = true +cross_field_suggestions = true + +# Inline documentation +inline_docs_enabled = true +docs_link_type = "modal" # or "sidebar", "tooltip" + +# Performance +cache_suggestions = true +cache_ttl_seconds = 3600 + +# Learning +track_accepted_suggestions = true +track_rejected_suggestions = true +``` + +## User Experience Flow + +### Scenario: New User Configuring PostgreSQL + +```text +1. User opens typdialog form + - Form title: "Create Database" + - First field: "Database Engine" + - AI shows: "PostgreSQL recommended for relational data" + +2. User types "post" + - Autocomplete shows: "postgresql" + - AI explains: "PostgreSQL is the most stable open-source database" + +3. User selects "postgresql" + - Form progresses + - Next field: "Version" + - AI suggests: "PostgreSQL 15 (latest stable)" + - Explanation: "Version 15 is current stable, recommended for new deployments" + +4. User selects version 15 + - Next field: "Environment" + - User selects "production" + - AI note appears: "Production environment requires encryption and backups" + +5. Next field: "Storage (GB)" + - Form shows: Minimum 50 GB (production requirement) + - AI suggestions: + • 100 GB (standard production) + • 250 GB (high-traffic site) + - User accepts: 100 GB + +6. Validation error on next field + - Old behavior: "Invalid backup_days value" + - New behavior: + "Backup retention must be 1-35 days. Recommended: 30 days. + 30-day retention meets compliance requirements for production systems." + +7. User completes form + - Summary shows all AI-assisted decisions + - Generate button creates configuration +``` + +## Integration with Natural Language Generation + +NLC and form assistance share the same backend: + +```text +Natural Language Generation AI-Assisted Forms + ↓ ↓ + "Create a PostgreSQL db" Select field values + ↓ ↓ + Intent Extraction Context Extraction + ↓ ↓ + RAG Search RAG Search (same results) + ↓ ↓ + LLM Generation LLM Suggestions + ↓ ↓ + Config Output Form Field Population +``` + +## Success Criteria (Q2 2025) + +- ✅ Suggestions appear within 300ms of user action +- ✅ 80% suggestion acceptance rate in user testing +- ✅ Error explanations clearly explain issues and fixes +- ✅ Cross-field context awareness works for 5+ database scenarios +- ✅ Form completion time reduced by 40% with AI +- ✅ User satisfaction > 8/10 in testing +- ✅ No false suggestions (all suggestions are valid) +- ✅ Offline mode works with cached suggestions + +## Related Documentation + +- [Architecture](architecture.md) - AI system overview +- [Natural Language Config](natural-language-config.md) - Related generation feature +- [RAG System](rag-system.md) - Suggestion retrieval +- [Configuration](configuration.md) - Setup guide +- [ADR-015](../architecture/adr/adr-015-ai-integration-architecture.md) - Design decisions + +--- + +**Status**: 🔴 Planned +**Target Release**: Q2 2025 +**Last Updated**: 2025-01-13 +**Component**: typdialog-ai +**Architecture**: Complete +**Implementation**: In Design Phase \ No newline at end of file diff --git a/docs/src/ai/architecture.md b/docs/src/ai/architecture.md index d5e5b1f..8ff9cf6 100644 --- a/docs/src/ai/architecture.md +++ b/docs/src/ai/architecture.md @@ -1 +1,194 @@ -# AI Integration Architecture\n\n## Overview\n\nThe provisioning platform's AI system provides intelligent capabilities for configuration generation, troubleshooting, and automation. The\narchitecture consists of multiple layers designed for reliability, security, and performance.\n\n## Core Components - Production-Ready\n\n### 1. AI Service (`provisioning/platform/ai-service`)\n\n**Status**: ✅ Production-Ready (2,500+ lines Rust code)\n\nThe core AI service provides:\n- Multi-provider LLM support (Anthropic Claude, OpenAI GPT-4, local models)\n- Streaming response support for real-time feedback\n- Request caching with LRU and semantic similarity\n- Rate limiting and cost control\n- Comprehensive error handling\n- HTTP REST API on port 8083\n\n**Supported Models**:\n- Claude Sonnet 4, Claude Opus 4 (Anthropic)\n- GPT-4 Turbo, GPT-4 (OpenAI)\n- Llama 3, Mistral (local/on-premise)\n\n### 2. RAG System (Retrieval-Augmented Generation)\n\n**Status**: ✅ Production-Ready (22/22 tests passing)\n\nThe RAG system enables AI to access and reason over platform documentation:\n- Vector embeddings via SurrealDB vector store\n- Hybrid search: vector similarity + BM25 keyword search\n- Document chunking (code and markdown aware)\n- Relevance ranking and context selection\n- Semantic caching for repeated queries\n\n**Capabilities**:\n```\nprovisioning ai query "How do I set up Kubernetes?"\nprovisioning ai template "Describe my infrastructure"\n```\n\n### 3. MCP Server (Model Context Protocol)\n\n**Status**: ✅ Production-Ready\n\nProvides Model Context Protocol integration:\n- Standardized tool interface for LLMs\n- Complex workflow composition\n- Integration with external AI systems (Claude, other LLMs)\n- Tool calling for provisioning operations\n\n### 4. CLI Integration\n\n**Status**: ✅ Production-Ready\n\nInteractive commands:\n```\nprovisioning ai template --prompt "Describe infrastructure"\nprovisioning ai query --prompt "Configuration question"\nprovisioning ai chat # Interactive mode\n```\n\n**Configuration**:\n```\n[ai]\nenabled = true\nprovider = "anthropic" # or "openai" or "local"\nmodel = "claude-sonnet-4"\n\n[ai.cache]\nenabled = true\nsemantic_similarity = true\nttl_seconds = 3600\n\n[ai.limits]\nmax_tokens = 4096\ntemperature = 0.7\n```\n\n## Planned Components - Q2 2025\n\n### Autonomous Agents (typdialog-ag)\n\n**Status**: 🔴 Planned\n\nSelf-directed agents for complex tasks:\n- Multi-step workflow execution\n- Decision making and adaptation\n- Monitoring and self-healing recommendations\n\n### AI-Assisted Forms (typdialog-ai)\n\n**Status**: 🔴 Planned\n\nReal-time AI suggestions in configuration forms:\n- Context-aware field recommendations\n- Validation error explanations\n- Auto-completion for infrastructure patterns\n\n### Advanced Features\n\n- Fine-tuning capabilities for custom models\n- Autonomous workflow execution with human approval\n- Cedar authorization policies for AI actions\n- Custom knowledge bases per workspace\n\n## Architecture Diagram\n\n```\n┌─────────────────────────────────────────────────┐\n│ User Interface │\n│ ├── CLI (provisioning ai ...) │\n│ ├── Web UI (typdialog) │\n│ └── MCP Client (Claude, etc.) │\n└──────────────┬──────────────────────────────────┘\n ↓\n┌──────────────────────────────────────────────────┐\n│ AI Service (Port 8083) │\n│ ├── Request Router │\n│ ├── Cache Layer (LRU + Semantic) │\n│ ├── Prompt Engineering │\n│ └── Response Streaming │\n└──────┬─────────────────┬─────────────────────────┘\n ↓ ↓\n┌─────────────┐ ┌──────────────────┐\n│ RAG System │ │ LLM Provider │\n│ SurrealDB │ │ ├── Anthropic │\n│ Vector DB │ │ ├── OpenAI │\n│ + BM25 │ │ └── Local Model │\n└─────────────┘ └──────────────────┘\n ↓ ↓\n┌──────────────────────────────────────┐\n│ Cached Responses + Real Responses │\n│ Streamed to User │\n└──────────────────────────────────────┘\n```\n\n## Performance Characteristics\n\n| | Metric | Value | |\n| | -------- | ------- | |\n| | Cold response (cache miss) | 2-5 seconds | |\n| | Cached response | <500ms | |\n| | Streaming start time | <1 second | |\n| | AI service memory usage | ~200MB at rest | |\n| | Cache size (configurable) | Up to 500MB | |\n| | Vector DB (SurrealDB) | Included, auto-managed | |\n\n## Security Model\n\n### Cedar Authorization\n\nAll AI operations controlled by Cedar policies:\n- User role-based access control\n- Operation-specific permissions\n- Complete audit logging\n\n### Secret Protection\n\n- Secrets never sent to external LLMs\n- PII/sensitive data sanitized before API calls\n- Encryption at rest in local cache\n- HSM support for key storage\n\n### Local Model Support\n\nAir-gapped deployments:\n- On-premise LLM models (Llama 3, Mistral)\n- Zero external API calls\n- Full data privacy compliance\n- Ideal for classified environments\n\n## Configuration\n\nSee [Configuration Guide](configuration.md) for:\n- LLM provider setup\n- Cache configuration\n- Cost limits and budgets\n- Security policies\n\n## Related Documentation\n\n- [RAG System](rag-system.md) - Retrieval implementation details\n- [Security Policies](security-policies.md) - Authorization and safety controls\n- [Configuration Guide](configuration.md) - Setup instructions\n- [ADR-015](../architecture/adr/adr-015-ai-integration-architecture.md) - Design decisions\n\n---\n\n**Last Updated**: 2025-01-13\n**Status**: ✅ Production-Ready (core system)\n**Test Coverage**: 22/22 tests passing +# AI Integration Architecture + +## Overview + +The provisioning platform's AI system provides intelligent capabilities for configuration generation, troubleshooting, and automation. The +architecture consists of multiple layers designed for reliability, security, and performance. + +## Core Components - Production-Ready + +### 1. AI Service (`provisioning/platform/ai-service`) + +**Status**: ✅ Production-Ready (2,500+ lines Rust code) + +The core AI service provides: +- Multi-provider LLM support (Anthropic Claude, OpenAI GPT-4, local models) +- Streaming response support for real-time feedback +- Request caching with LRU and semantic similarity +- Rate limiting and cost control +- Comprehensive error handling +- HTTP REST API on port 8083 + +**Supported Models**: +- Claude Sonnet 4, Claude Opus 4 (Anthropic) +- GPT-4 Turbo, GPT-4 (OpenAI) +- Llama 3, Mistral (local/on-premise) + +### 2. RAG System (Retrieval-Augmented Generation) + +**Status**: ✅ Production-Ready (22/22 tests passing) + +The RAG system enables AI to access and reason over platform documentation: +- Vector embeddings via SurrealDB vector store +- Hybrid search: vector similarity + BM25 keyword search +- Document chunking (code and markdown aware) +- Relevance ranking and context selection +- Semantic caching for repeated queries + +**Capabilities**: +```text +provisioning ai query "How do I set up Kubernetes?" +provisioning ai template "Describe my infrastructure" +``` + +### 3. MCP Server (Model Context Protocol) + +**Status**: ✅ Production-Ready + +Provides Model Context Protocol integration: +- Standardized tool interface for LLMs +- Complex workflow composition +- Integration with external AI systems (Claude, other LLMs) +- Tool calling for provisioning operations + +### 4. CLI Integration + +**Status**: ✅ Production-Ready + +Interactive commands: +```text +provisioning ai template --prompt "Describe infrastructure" +provisioning ai query --prompt "Configuration question" +provisioning ai chat # Interactive mode +``` + +**Configuration**: +```text +[ai] +enabled = true +provider = "anthropic" # or "openai" or "local" +model = "claude-sonnet-4" + +[ai.cache] +enabled = true +semantic_similarity = true +ttl_seconds = 3600 + +[ai.limits] +max_tokens = 4096 +temperature = 0.7 +``` + +## Planned Components - Q2 2025 + +### Autonomous Agents (typdialog-ag) + +**Status**: 🔴 Planned + +Self-directed agents for complex tasks: +- Multi-step workflow execution +- Decision making and adaptation +- Monitoring and self-healing recommendations + +### AI-Assisted Forms (typdialog-ai) + +**Status**: 🔴 Planned + +Real-time AI suggestions in configuration forms: +- Context-aware field recommendations +- Validation error explanations +- Auto-completion for infrastructure patterns + +### Advanced Features + +- Fine-tuning capabilities for custom models +- Autonomous workflow execution with human approval +- Cedar authorization policies for AI actions +- Custom knowledge bases per workspace + +## Architecture Diagram + +```text +┌─────────────────────────────────────────────────┐ +│ User Interface │ +│ ├── CLI (provisioning ai ...) │ +│ ├── Web UI (typdialog) │ +│ └── MCP Client (Claude, etc.) │ +└──────────────┬──────────────────────────────────┘ + ↓ +┌──────────────────────────────────────────────────┐ +│ AI Service (Port 8083) │ +│ ├── Request Router │ +│ ├── Cache Layer (LRU + Semantic) │ +│ ├── Prompt Engineering │ +│ └── Response Streaming │ +└──────┬─────────────────┬─────────────────────────┘ + ↓ ↓ +┌─────────────┐ ┌──────────────────┐ +│ RAG System │ │ LLM Provider │ +│ SurrealDB │ │ ├── Anthropic │ +│ Vector DB │ │ ├── OpenAI │ +│ + BM25 │ │ └── Local Model │ +└─────────────┘ └──────────────────┘ + ↓ ↓ +┌──────────────────────────────────────┐ +│ Cached Responses + Real Responses │ +│ Streamed to User │ +└──────────────────────────────────────┘ +``` + +## Performance Characteristics + +| | Metric | Value | | +| | -------- | ------- | | +| | Cold response (cache miss) | 2-5 seconds | | +| | Cached response | <500ms | | +| | Streaming start time | <1 second | | +| | AI service memory usage | ~200MB at rest | | +| | Cache size (configurable) | Up to 500MB | | +| | Vector DB (SurrealDB) | Included, auto-managed | | + +## Security Model + +### Cedar Authorization + +All AI operations controlled by Cedar policies: +- User role-based access control +- Operation-specific permissions +- Complete audit logging + +### Secret Protection + +- Secrets never sent to external LLMs +- PII/sensitive data sanitized before API calls +- Encryption at rest in local cache +- HSM support for key storage + +### Local Model Support + +Air-gapped deployments: +- On-premise LLM models (Llama 3, Mistral) +- Zero external API calls +- Full data privacy compliance +- Ideal for classified environments + +## Configuration + +See [Configuration Guide](configuration.md) for: +- LLM provider setup +- Cache configuration +- Cost limits and budgets +- Security policies + +## Related Documentation + +- [RAG System](rag-system.md) - Retrieval implementation details +- [Security Policies](security-policies.md) - Authorization and safety controls +- [Configuration Guide](configuration.md) - Setup instructions +- [ADR-015](../architecture/adr/adr-015-ai-integration-architecture.md) - Design decisions + +--- + +**Last Updated**: 2025-01-13 +**Status**: ✅ Production-Ready (core system) +**Test Coverage**: 22/22 tests passing diff --git a/docs/src/ai/config-generation.md b/docs/src/ai/config-generation.md index 0a5b078..f6c08ce 100644 --- a/docs/src/ai/config-generation.md +++ b/docs/src/ai/config-generation.md @@ -1 +1,64 @@ -# Configuration Generation (typdialog-prov-gen)\n\n**Status**: 🔴 Planned for Q2 2025\n\n## Overview\n\nThe Configuration Generator (typdialog-prov-gen) will provide template-based Nickel configuration generation with AI-powered customization.\n\n## Planned Features\n\n### Template Selection\n- Library of production-ready infrastructure templates\n- AI recommends templates based on requirements\n- Preview before generation\n\n### Customization via Natural Language\n```\nprovisioning ai config-gen \\n --template "kubernetes-cluster" \\n --customize "Add Prometheus monitoring, increase replicas to 5, use us-east-1"\n```\n\n### Multi-Provider Support\n- AWS, Hetzner, UpCloud, local infrastructure\n- Automatic provider-specific optimizations\n- Cost estimation across providers\n\n### Validation and Testing\n- Type-checking via Nickel before deployment\n- Dry-run execution for safety\n- Test data fixtures for verification\n\n## Architecture\n\n```\nTemplate Library\n ↓\nTemplate Selection (AI + User)\n ↓\nCustomization Layer (NL → Nickel)\n ↓\nValidation (Type + Runtime)\n ↓\nGenerated Configuration\n```\n\n## Integration Points\n\n- typdialog web UI for template browsing\n- CLI for batch generation\n- AI service for customization suggestions\n- Nickel for type-safe validation\n\n## Related Documentation\n\n- [Natural Language Configuration](natural-language-config.md) - NL to config generation\n- [Architecture](architecture.md) - AI system overview\n- [Configuration Guide](configuration.md) - Setup instructions\n\n---\n\n**Status**: 🔴 Planned\n**Expected Release**: Q2 2025\n**Priority**: High (enables non-technical users to generate configs) +# Configuration Generation (typdialog-prov-gen) + +**Status**: 🔴 Planned for Q2 2025 + +## Overview + +The Configuration Generator (typdialog-prov-gen) will provide template-based Nickel configuration generation with AI-powered customization. + +## Planned Features + +### Template Selection +- Library of production-ready infrastructure templates +- AI recommends templates based on requirements +- Preview before generation + +### Customization via Natural Language +```text +provisioning ai config-gen + --template "kubernetes-cluster" + --customize "Add Prometheus monitoring, increase replicas to 5, use us-east-1" +``` + +### Multi-Provider Support +- AWS, Hetzner, UpCloud, local infrastructure +- Automatic provider-specific optimizations +- Cost estimation across providers + +### Validation and Testing +- Type-checking via Nickel before deployment +- Dry-run execution for safety +- Test data fixtures for verification + +## Architecture + +```text +Template Library + ↓ +Template Selection (AI + User) + ↓ +Customization Layer (NL → Nickel) + ↓ +Validation (Type + Runtime) + ↓ +Generated Configuration +``` + +## Integration Points + +- typdialog web UI for template browsing +- CLI for batch generation +- AI service for customization suggestions +- Nickel for type-safe validation + +## Related Documentation + +- [Natural Language Configuration](natural-language-config.md) - NL to config generation +- [Architecture](architecture.md) - AI system overview +- [Configuration Guide](configuration.md) - Setup instructions + +--- + +**Status**: 🔴 Planned +**Expected Release**: Q2 2025 +**Priority**: High (enables non-technical users to generate configs) \ No newline at end of file diff --git a/docs/src/ai/configuration.md b/docs/src/ai/configuration.md index 9dd0c67..6597c27 100644 --- a/docs/src/ai/configuration.md +++ b/docs/src/ai/configuration.md @@ -1 +1,601 @@ -# AI System Configuration Guide\n\n**Status**: ✅ Production-Ready (Configuration system)\n\nComplete setup guide for AI features in the provisioning platform. This guide covers LLM provider configuration, feature enablement, cache setup, cost\ncontrols, and security settings.\n\n## Quick Start\n\n### Minimal Configuration\n\n```\n# provisioning/config/ai.toml\n[ai]\nenabled = true\nprovider = "anthropic" # or "openai" or "local"\nmodel = "claude-sonnet-4"\napi_key = "sk-ant-..." # Set via PROVISIONING_AI_API_KEY env var\n\n[ai.cache]\nenabled = true\n\n[ai.limits]\nmax_tokens = 4096\ntemperature = 0.7\n```\n\n### Initialize Configuration\n\n```\n# Generate default configuration\nprovisioning config init ai\n\n# Edit configuration\nprovisioning config edit ai\n\n# Validate configuration\nprovisioning config validate ai\n\n# Show current configuration\nprovisioning config show ai\n```\n\n## Provider Configuration\n\n### Anthropic Claude\n\n```\n[ai]\nenabled = true\nprovider = "anthropic"\nmodel = "claude-sonnet-4" # or "claude-opus-4", "claude-haiku-4"\napi_key = "${PROVISIONING_AI_API_KEY}"\napi_base = "[https://api.anthropic.com"](https://api.anthropic.com")\n\n# Request parameters\n[ai.request]\nmax_tokens = 4096\ntemperature = 0.7\ntop_p = 0.95\ntop_k = 40\n\n# Supported models\n# - claude-opus-4: Most capable, for complex reasoning ($15/MTok input, $45/MTok output)\n# - claude-sonnet-4: Balanced (recommended), ($3/MTok input, $15/MTok output)\n# - claude-haiku-4: Fast, for simple tasks ($0.80/MTok input, $4/MTok output)\n```\n\n### OpenAI GPT-4\n\n```\n[ai]\nenabled = true\nprovider = "openai"\nmodel = "gpt-4-turbo" # or "gpt-4", "gpt-4o"\napi_key = "${OPENAI_API_KEY}"\napi_base = "[https://api.openai.com/v1"](https://api.openai.com/v1")\n\n[ai.request]\nmax_tokens = 4096\ntemperature = 0.7\ntop_p = 0.95\n\n# Supported models\n# - gpt-4: Most capable ($0.03/1K input, $0.06/1K output)\n# - gpt-4-turbo: Better at code ($0.01/1K input, $0.03/1K output)\n# - gpt-4o: Latest, multi-modal ($5/MTok input, $15/MTok output)\n```\n\n### Local Models\n\n```\n[ai]\nenabled = true\nprovider = "local"\nmodel = "llama2-70b" # or "mistral", "neural-chat"\napi_base = "[http://localhost:8000"](http://localhost:8000") # Local Ollama or LM Studio\n\n# Local model support\n# - Ollama: docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama\n# - LM Studio: GUI app with API\n# - vLLM: High-throughput serving\n# - llama.cpp: CPU inference\n\n[ai.local]\ngpu_enabled = true\ngpu_memory_gb = 24\nmax_batch_size = 4\n```\n\n## Feature Configuration\n\n### Enable Specific Features\n\n```\n[ai.features]\n# Core features (production-ready)\nrag_search = true # Retrieve-Augmented Generation\nconfig_generation = true # Generate Nickel from natural language\nmcp_server = true # Model Context Protocol server\ntroubleshooting = true # AI-assisted debugging\n\n# Form assistance (planned Q2 2025)\nform_assistance = false # AI suggestions in forms\nform_explanations = false # AI explains validation errors\n\n# Agents (planned Q2 2025)\nautonomous_agents = false # AI agents for workflows\nagent_learning = false # Agents learn from deployments\n\n# Advanced features\nfine_tuning = false # Fine-tune models for domain\nknowledge_base = false # Custom knowledge base per workspace\n```\n\n## Cache Configuration\n\n### Cache Strategy\n\n```\n[ai.cache]\nenabled = true\ncache_type = "memory" # or "redis", "disk"\nttl_seconds = 3600 # Cache entry lifetime\n\n# Memory cache (recommended for single server)\n[ai.cache.memory]\nmax_size_mb = 500\neviction_policy = "lru" # Least Recently Used\n\n# Redis cache (recommended for distributed)\n[ai.cache.redis]\nurl = "redis://localhost:6379"\ndb = 0\npassword = "${REDIS_PASSWORD}"\nttl_seconds = 3600\n\n# Disk cache (recommended for persistent caching)\n[ai.cache.disk]\npath = "/var/cache/provisioning/ai"\nmax_size_mb = 5000\n\n# Semantic caching (for RAG)\n[ai.cache.semantic]\nenabled = true\nsimilarity_threshold = 0.95 # Cache hit if query similarity > 0.95\ncache_embeddings = true # Cache embedding vectors\n```\n\n### Cache Metrics\n\n```\n# Monitor cache performance\nprovisioning admin cache stats ai\n\n# Clear cache\nprovisioning admin cache clear ai\n\n# Analyze cache efficiency\nprovisioning admin cache analyze ai --hours 24\n```\n\n## Rate Limiting and Cost Control\n\n### Rate Limits\n\n```\n[ai.limits]\n# Tokens per request\nmax_tokens = 4096\nmax_input_tokens = 8192\nmax_output_tokens = 4096\n\n# Requests per minute/hour\nrpm_limit = 60 # Requests per minute\nrpm_burst = 100 # Allow bursts up to 100 RPM\n\n# Daily cost limit\ndaily_cost_limit_usd = 100\nwarn_at_percent = 80 # Warn when at 80% of daily limit\nstop_at_percent = 95 # Stop accepting requests at 95%\n\n# Token usage tracking\ntrack_token_usage = true\ntrack_cost_per_request = true\n```\n\n### Cost Budgeting\n\n```\n[ai.budget]\nenabled = true\nmonthly_limit_usd = 1000\n\n# Budget alerts\nalert_at_percent = [50, 75, 90]\nalert_email = "ops@company.com"\nalert_slack = "[https://hooks.slack.com/services/..."](https://hooks.slack.com/services/...")\n\n# Cost by provider\n[ai.budget.providers]\nanthropic_limit = 500\nopenai_limit = 300\nlocal_limit = 0 # Free (run locally)\n```\n\n### Track Costs\n\n```\n# View cost metrics\nprovisioning admin costs show ai --period month\n\n# Forecast cost\nprovisioning admin costs forecast ai --days 30\n\n# Analyze cost by feature\nprovisioning admin costs analyze ai --by feature\n\n# Export cost report\nprovisioning admin costs export ai --format csv --output costs.csv\n```\n\n## Security Configuration\n\n### Authentication\n\n```\n[ai.auth]\n# API key from environment variable\napi_key = "${PROVISIONING_AI_API_KEY}"\n\n# Or from secure store\napi_key_vault = "secrets/ai-api-key"\n\n# Token rotation\nrotate_key_days = 90\nrotation_alert_days = 7\n\n# Request signing (for cloud providers)\nsign_requests = true\nsigning_method = "hmac-sha256"\n```\n\n### Authorization (Cedar)\n\n```\n[ai.authorization]\nenabled = true\npolicy_file = "provisioning/policies/ai-policies.cedar"\n\n# Example policies:\n# allow(principal, action, resource) when principal.role == "admin"\n# allow(principal == ?principal, action == "ai_generate_config", resource)\n# when principal.workspace == resource.workspace\n```\n\n### Data Protection\n\n```\n[ai.security]\n# Sanitize data before sending to external LLM\nsanitize_pii = true\nsanitize_secrets = true\nredact_patterns = [\n "(?i)password\\s*[:=]\\s*[^\\s]+", # Passwords\n "(?i)api[_-]?key\\s*[:=]\\s*[^\\s]+", # API keys\n "(?i)secret\\s*[:=]\\s*[^\\s]+", # Secrets\n]\n\n# Encryption\nencryption_enabled = true\nencryption_algorithm = "aes-256-gcm"\nkey_derivation = "argon2id"\n\n# Local-only mode (never send to external LLM)\nlocal_only = false # Set true for air-gapped deployments\n```\n\n## RAG Configuration\n\n### Vector Store Setup\n\n```\n[ai.rag]\nenabled = true\n\n# SurrealDB backend\n[ai.rag.database]\nurl = "surreal://localhost:8000"\nusername = "root"\npassword = "${SURREALDB_PASSWORD}"\nnamespace = "provisioning"\ndatabase = "ai_rag"\n\n# Embedding model\n[ai.rag.embedding]\nprovider = "openai" # or "anthropic", "local"\nmodel = "text-embedding-3-small"\nbatch_size = 100\ncache_embeddings = true\n\n# Search configuration\n[ai.rag.search]\nhybrid_enabled = true\nvector_weight = 0.7 # Weight for vector search\nkeyword_weight = 0.3 # Weight for BM25 search\ntop_k = 5 # Number of results to return\nrerank_enabled = false # Use cross-encoder to rerank results\n\n# Chunking strategy\n[ai.rag.chunking]\nmarkdown_chunk_size = 1024\nmarkdown_overlap = 256\ncode_chunk_size = 512\ncode_overlap = 128\n```\n\n### Index Management\n\n```\n# Create indexes\nprovisioning ai index create rag\n\n# Rebuild indexes\nprovisioning ai index rebuild rag\n\n# Show index status\nprovisioning ai index status rag\n\n# Remove old indexes\nprovisioning ai index cleanup rag --older-than 30days\n```\n\n## MCP Server Configuration\n\n### MCP Server Setup\n\n```\n[ai.mcp]\nenabled = true\nport = 3000\nhost = "127.0.0.1" # Change to 0.0.0.0 for network access\n\n# Tool registry\n[ai.mcp.tools]\ngenerate_config = true\nvalidate_config = true\nsearch_docs = true\ntroubleshoot_deployment = true\nget_schema = true\ncheck_compliance = true\n\n# Rate limiting for tool calls\nrpm_limit = 30\nburst_limit = 50\n\n# Tool request timeout\ntimeout_seconds = 30\n```\n\n### MCP Client Configuration\n\n```\n~/.claude/claude_desktop_config.json:\n{\n "mcpServers": {\n "provisioning": {\n "command": "provisioning-mcp-server",\n "args": ["--config", "/etc/provisioning/ai.toml"],\n "env": {\n "PROVISIONING_API_KEY": "sk-ant-...",\n "RUST_LOG": "info"\n }\n }\n }\n}\n```\n\n## Logging and Observability\n\n### Logging Configuration\n\n```\n[ai.logging]\nlevel = "info" # or "debug", "warn", "error"\nformat = "json" # or "text"\noutput = "stdout" # or "file"\n\n# Log file\n[ai.logging.file]\npath = "/var/log/provisioning/ai.log"\nmax_size_mb = 100\nmax_backups = 10\nretention_days = 30\n\n# Log filters\n[ai.logging.filters]\nlog_requests = true\nlog_responses = false # Don't log full responses (verbose)\nlog_token_usage = true\nlog_costs = true\n```\n\n### Metrics and Monitoring\n\n```\n# View AI service metrics\nprovisioning admin metrics show ai\n\n# Prometheus metrics endpoint\ncurl [http://localhost:8083/metrics](http://localhost:8083/metrics)\n\n# Key metrics:\n# - ai_requests_total: Total requests by provider/model\n# - ai_request_duration_seconds: Request latency\n# - ai_token_usage_total: Token consumption by provider\n# - ai_cost_total: Cumulative cost by provider\n# - ai_cache_hits: Cache hit rate\n# - ai_errors_total: Errors by type\n```\n\n## Health Checks\n\n### Configuration Validation\n\n```\n# Validate configuration syntax\nprovisioning config validate ai\n\n# Test provider connectivity\nprovisioning ai test provider anthropic\n\n# Test RAG system\nprovisioning ai test rag\n\n# Test MCP server\nprovisioning ai test mcp\n\n# Full health check\nprovisioning ai health-check\n```\n\n## Environment Variables\n\n### Common Settings\n\n```\n# Provider configuration\nexport PROVISIONING_AI_PROVIDER="anthropic"\nexport PROVISIONING_AI_MODEL="claude-sonnet-4"\nexport PROVISIONING_AI_API_KEY="sk-ant-..."\n\n# Feature flags\nexport PROVISIONING_AI_ENABLED="true"\nexport PROVISIONING_AI_CACHE_ENABLED="true"\nexport PROVISIONING_AI_RAG_ENABLED="true"\n\n# Cost control\nexport PROVISIONING_AI_DAILY_LIMIT_USD="100"\nexport PROVISIONING_AI_RPM_LIMIT="60"\n\n# Security\nexport PROVISIONING_AI_SANITIZE_PII="true"\nexport PROVISIONING_AI_LOCAL_ONLY="false"\n\n# Logging\nexport RUST_LOG="provisioning::ai=info"\n```\n\n## Troubleshooting Configuration\n\n### Common Issues\n\n**Issue**: API key not recognized\n```\n# Check environment variable is set\necho $PROVISIONING_AI_API_KEY\n\n# Test connectivity\nprovisioning ai test provider anthropic\n\n# Verify key format (should start with sk-ant- or sk-)\n| provisioning config show ai | grep api_key |\n```\n\n**Issue**: Cache not working\n```\n# Check cache status\nprovisioning admin cache stats ai\n\n# Clear cache and restart\nprovisioning admin cache clear ai\nprovisioning service restart ai-service\n\n# Enable cache debugging\nRUST_LOG=provisioning::cache=debug provisioning-ai-service\n```\n\n**Issue**: RAG search not finding results\n```\n# Rebuild RAG indexes\nprovisioning ai index rebuild rag\n\n# Test search\nprovisioning ai query "test query"\n\n# Check index status\nprovisioning ai index status rag\n```\n\n## Upgrading Configuration\n\n### Backward Compatibility\n\nNew AI versions automatically migrate old configurations:\n\n```\n# Check configuration version\nprovisioning config version ai\n\n# Migrate configuration to latest version\nprovisioning config migrate ai --auto\n\n# Backup before migration\nprovisioning config backup ai\n```\n\n## Production Deployment\n\n### Recommended Production Settings\n\n```\n[ai]\nenabled = true\nprovider = "anthropic"\nmodel = "claude-sonnet-4"\napi_key = "${PROVISIONING_AI_API_KEY}"\n\n[ai.features]\nrag_search = true\nconfig_generation = true\nmcp_server = true\ntroubleshooting = true\n\n[ai.cache]\nenabled = true\ncache_type = "redis"\nttl_seconds = 3600\n\n[ai.limits]\nrpm_limit = 60\ndaily_cost_limit_usd = 1000\nmax_tokens = 4096\n\n[ai.security]\nsanitize_pii = true\nsanitize_secrets = true\nencryption_enabled = true\n\n[ai.logging]\nlevel = "warn" # Less verbose in production\nformat = "json"\noutput = "file"\n\n[ai.rag.database]\nurl = "surreal://surrealdb-cluster:8000"\n```\n\n## Related Documentation\n\n- [Architecture](architecture.md) - System overview\n- [RAG System](rag-system.md) - Vector database setup\n- [MCP Integration](mcp-integration.md) - MCP configuration\n- [Security Policies](security-policies.md) - Authorization policies\n- [Cost Management](cost-management.md) - Budget tracking\n\n---\n\n**Last Updated**: 2025-01-13\n**Status**: ✅ Production-Ready\n**Versions Supported**: v1.0+ +# AI System Configuration Guide + +**Status**: ✅ Production-Ready (Configuration system) + +Complete setup guide for AI features in the provisioning platform. This guide covers LLM provider configuration, feature enablement, cache setup, cost +controls, and security settings. + +## Quick Start + +### Minimal Configuration + +```text +# provisioning/config/ai.toml +[ai] +enabled = true +provider = "anthropic" # or "openai" or "local" +model = "claude-sonnet-4" +api_key = "sk-ant-..." # Set via PROVISIONING_AI_API_KEY env var + +[ai.cache] +enabled = true + +[ai.limits] +max_tokens = 4096 +temperature = 0.7 +``` + +### Initialize Configuration + +```text +# Generate default configuration +provisioning config init ai + +# Edit configuration +provisioning config edit ai + +# Validate configuration +provisioning config validate ai + +# Show current configuration +provisioning config show ai +``` + +## Provider Configuration + +### Anthropic Claude + +```text +[ai] +enabled = true +provider = "anthropic" +model = "claude-sonnet-4" # or "claude-opus-4", "claude-haiku-4" +api_key = "${PROVISIONING_AI_API_KEY}" +api_base = "[https://api.anthropic.com"](https://api.anthropic.com") + +# Request parameters +[ai.request] +max_tokens = 4096 +temperature = 0.7 +top_p = 0.95 +top_k = 40 + +# Supported models +# - claude-opus-4: Most capable, for complex reasoning ($15/MTok input, $45/MTok output) +# - claude-sonnet-4: Balanced (recommended), ($3/MTok input, $15/MTok output) +# - claude-haiku-4: Fast, for simple tasks ($0.80/MTok input, $4/MTok output) +``` + +### OpenAI GPT-4 + +```text +[ai] +enabled = true +provider = "openai" +model = "gpt-4-turbo" # or "gpt-4", "gpt-4o" +api_key = "${OPENAI_API_KEY}" +api_base = "[https://api.openai.com/v1"](https://api.openai.com/v1") + +[ai.request] +max_tokens = 4096 +temperature = 0.7 +top_p = 0.95 + +# Supported models +# - gpt-4: Most capable ($0.03/1K input, $0.06/1K output) +# - gpt-4-turbo: Better at code ($0.01/1K input, $0.03/1K output) +# - gpt-4o: Latest, multi-modal ($5/MTok input, $15/MTok output) +``` + +### Local Models + +```text +[ai] +enabled = true +provider = "local" +model = "llama2-70b" # or "mistral", "neural-chat" +api_base = "[http://localhost:8000"](http://localhost:8000") # Local Ollama or LM Studio + +# Local model support +# - Ollama: docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama +# - LM Studio: GUI app with API +# - vLLM: High-throughput serving +# - llama.cpp: CPU inference + +[ai.local] +gpu_enabled = true +gpu_memory_gb = 24 +max_batch_size = 4 +``` + +## Feature Configuration + +### Enable Specific Features + +```text +[ai.features] +# Core features (production-ready) +rag_search = true # Retrieve-Augmented Generation +config_generation = true # Generate Nickel from natural language +mcp_server = true # Model Context Protocol server +troubleshooting = true # AI-assisted debugging + +# Form assistance (planned Q2 2025) +form_assistance = false # AI suggestions in forms +form_explanations = false # AI explains validation errors + +# Agents (planned Q2 2025) +autonomous_agents = false # AI agents for workflows +agent_learning = false # Agents learn from deployments + +# Advanced features +fine_tuning = false # Fine-tune models for domain +knowledge_base = false # Custom knowledge base per workspace +``` + +## Cache Configuration + +### Cache Strategy + +```text +[ai.cache] +enabled = true +cache_type = "memory" # or "redis", "disk" +ttl_seconds = 3600 # Cache entry lifetime + +# Memory cache (recommended for single server) +[ai.cache.memory] +max_size_mb = 500 +eviction_policy = "lru" # Least Recently Used + +# Redis cache (recommended for distributed) +[ai.cache.redis] +url = "redis://localhost:6379" +db = 0 +password = "${REDIS_PASSWORD}" +ttl_seconds = 3600 + +# Disk cache (recommended for persistent caching) +[ai.cache.disk] +path = "/var/cache/provisioning/ai" +max_size_mb = 5000 + +# Semantic caching (for RAG) +[ai.cache.semantic] +enabled = true +similarity_threshold = 0.95 # Cache hit if query similarity > 0.95 +cache_embeddings = true # Cache embedding vectors +``` + +### Cache Metrics + +```text +# Monitor cache performance +provisioning admin cache stats ai + +# Clear cache +provisioning admin cache clear ai + +# Analyze cache efficiency +provisioning admin cache analyze ai --hours 24 +``` + +## Rate Limiting and Cost Control + +### Rate Limits + +```text +[ai.limits] +# Tokens per request +max_tokens = 4096 +max_input_tokens = 8192 +max_output_tokens = 4096 + +# Requests per minute/hour +rpm_limit = 60 # Requests per minute +rpm_burst = 100 # Allow bursts up to 100 RPM + +# Daily cost limit +daily_cost_limit_usd = 100 +warn_at_percent = 80 # Warn when at 80% of daily limit +stop_at_percent = 95 # Stop accepting requests at 95% + +# Token usage tracking +track_token_usage = true +track_cost_per_request = true +``` + +### Cost Budgeting + +```text +[ai.budget] +enabled = true +monthly_limit_usd = 1000 + +# Budget alerts +alert_at_percent = [50, 75, 90] +alert_email = "ops@company.com" +alert_slack = "[https://hooks.slack.com/services/..."](https://hooks.slack.com/services/...") + +# Cost by provider +[ai.budget.providers] +anthropic_limit = 500 +openai_limit = 300 +local_limit = 0 # Free (run locally) +``` + +### Track Costs + +```text +# View cost metrics +provisioning admin costs show ai --period month + +# Forecast cost +provisioning admin costs forecast ai --days 30 + +# Analyze cost by feature +provisioning admin costs analyze ai --by feature + +# Export cost report +provisioning admin costs export ai --format csv --output costs.csv +``` + +## Security Configuration + +### Authentication + +```text +[ai.auth] +# API key from environment variable +api_key = "${PROVISIONING_AI_API_KEY}" + +# Or from secure store +api_key_vault = "secrets/ai-api-key" + +# Token rotation +rotate_key_days = 90 +rotation_alert_days = 7 + +# Request signing (for cloud providers) +sign_requests = true +signing_method = "hmac-sha256" +``` + +### Authorization (Cedar) + +```text +[ai.authorization] +enabled = true +policy_file = "provisioning/policies/ai-policies.cedar" + +# Example policies: +# allow(principal, action, resource) when principal.role == "admin" +# allow(principal == ?principal, action == "ai_generate_config", resource) +# when principal.workspace == resource.workspace +``` + +### Data Protection + +```text +[ai.security] +# Sanitize data before sending to external LLM +sanitize_pii = true +sanitize_secrets = true +redact_patterns = [ + "(?i)password\\s*[:=]\\s*[^\\s]+", # Passwords + "(?i)api[_-]?key\\s*[:=]\\s*[^\\s]+", # API keys + "(?i)secret\\s*[:=]\\s*[^\\s]+", # Secrets +] + +# Encryption +encryption_enabled = true +encryption_algorithm = "aes-256-gcm" +key_derivation = "argon2id" + +# Local-only mode (never send to external LLM) +local_only = false # Set true for air-gapped deployments +``` + +## RAG Configuration + +### Vector Store Setup + +```text +[ai.rag] +enabled = true + +# SurrealDB backend +[ai.rag.database] +url = "surreal://localhost:8000" +username = "root" +password = "${SURREALDB_PASSWORD}" +namespace = "provisioning" +database = "ai_rag" + +# Embedding model +[ai.rag.embedding] +provider = "openai" # or "anthropic", "local" +model = "text-embedding-3-small" +batch_size = 100 +cache_embeddings = true + +# Search configuration +[ai.rag.search] +hybrid_enabled = true +vector_weight = 0.7 # Weight for vector search +keyword_weight = 0.3 # Weight for BM25 search +top_k = 5 # Number of results to return +rerank_enabled = false # Use cross-encoder to rerank results + +# Chunking strategy +[ai.rag.chunking] +markdown_chunk_size = 1024 +markdown_overlap = 256 +code_chunk_size = 512 +code_overlap = 128 +``` + +### Index Management + +```text +# Create indexes +provisioning ai index create rag + +# Rebuild indexes +provisioning ai index rebuild rag + +# Show index status +provisioning ai index status rag + +# Remove old indexes +provisioning ai index cleanup rag --older-than 30days +``` + +## MCP Server Configuration + +### MCP Server Setup + +```text +[ai.mcp] +enabled = true +port = 3000 +host = "127.0.0.1" # Change to 0.0.0.0 for network access + +# Tool registry +[ai.mcp.tools] +generate_config = true +validate_config = true +search_docs = true +troubleshoot_deployment = true +get_schema = true +check_compliance = true + +# Rate limiting for tool calls +rpm_limit = 30 +burst_limit = 50 + +# Tool request timeout +timeout_seconds = 30 +``` + +### MCP Client Configuration + +```text +~/.claude/claude_desktop_config.json: +{ + "mcpServers": { + "provisioning": { + "command": "provisioning-mcp-server", + "args": ["--config", "/etc/provisioning/ai.toml"], + "env": { + "PROVISIONING_API_KEY": "sk-ant-...", + "RUST_LOG": "info" + } + } + } +} +``` + +## Logging and Observability + +### Logging Configuration + +```text +[ai.logging] +level = "info" # or "debug", "warn", "error" +format = "json" # or "text" +output = "stdout" # or "file" + +# Log file +[ai.logging.file] +path = "/var/log/provisioning/ai.log" +max_size_mb = 100 +max_backups = 10 +retention_days = 30 + +# Log filters +[ai.logging.filters] +log_requests = true +log_responses = false # Don't log full responses (verbose) +log_token_usage = true +log_costs = true +``` + +### Metrics and Monitoring + +```text +# View AI service metrics +provisioning admin metrics show ai + +# Prometheus metrics endpoint +curl [http://localhost:8083/metrics](http://localhost:8083/metrics) + +# Key metrics: +# - ai_requests_total: Total requests by provider/model +# - ai_request_duration_seconds: Request latency +# - ai_token_usage_total: Token consumption by provider +# - ai_cost_total: Cumulative cost by provider +# - ai_cache_hits: Cache hit rate +# - ai_errors_total: Errors by type +``` + +## Health Checks + +### Configuration Validation + +```text +# Validate configuration syntax +provisioning config validate ai + +# Test provider connectivity +provisioning ai test provider anthropic + +# Test RAG system +provisioning ai test rag + +# Test MCP server +provisioning ai test mcp + +# Full health check +provisioning ai health-check +``` + +## Environment Variables + +### Common Settings + +```text +# Provider configuration +export PROVISIONING_AI_PROVIDER="anthropic" +export PROVISIONING_AI_MODEL="claude-sonnet-4" +export PROVISIONING_AI_API_KEY="sk-ant-..." + +# Feature flags +export PROVISIONING_AI_ENABLED="true" +export PROVISIONING_AI_CACHE_ENABLED="true" +export PROVISIONING_AI_RAG_ENABLED="true" + +# Cost control +export PROVISIONING_AI_DAILY_LIMIT_USD="100" +export PROVISIONING_AI_RPM_LIMIT="60" + +# Security +export PROVISIONING_AI_SANITIZE_PII="true" +export PROVISIONING_AI_LOCAL_ONLY="false" + +# Logging +export RUST_LOG="provisioning::ai=info" +``` + +## Troubleshooting Configuration + +### Common Issues + +**Issue**: API key not recognized +```text +# Check environment variable is set +echo $PROVISIONING_AI_API_KEY + +# Test connectivity +provisioning ai test provider anthropic + +# Verify key format (should start with sk-ant- or sk-) +| provisioning config show ai | grep api_key | +``` + +**Issue**: Cache not working +```text +# Check cache status +provisioning admin cache stats ai + +# Clear cache and restart +provisioning admin cache clear ai +provisioning service restart ai-service + +# Enable cache debugging +RUST_LOG=provisioning::cache=debug provisioning-ai-service +``` + +**Issue**: RAG search not finding results +```text +# Rebuild RAG indexes +provisioning ai index rebuild rag + +# Test search +provisioning ai query "test query" + +# Check index status +provisioning ai index status rag +``` + +## Upgrading Configuration + +### Backward Compatibility + +New AI versions automatically migrate old configurations: + +```text +# Check configuration version +provisioning config version ai + +# Migrate configuration to latest version +provisioning config migrate ai --auto + +# Backup before migration +provisioning config backup ai +``` + +## Production Deployment + +### Recommended Production Settings + +```text +[ai] +enabled = true +provider = "anthropic" +model = "claude-sonnet-4" +api_key = "${PROVISIONING_AI_API_KEY}" + +[ai.features] +rag_search = true +config_generation = true +mcp_server = true +troubleshooting = true + +[ai.cache] +enabled = true +cache_type = "redis" +ttl_seconds = 3600 + +[ai.limits] +rpm_limit = 60 +daily_cost_limit_usd = 1000 +max_tokens = 4096 + +[ai.security] +sanitize_pii = true +sanitize_secrets = true +encryption_enabled = true + +[ai.logging] +level = "warn" # Less verbose in production +format = "json" +output = "file" + +[ai.rag.database] +url = "surreal://surrealdb-cluster:8000" +``` + +## Related Documentation + +- [Architecture](architecture.md) - System overview +- [RAG System](rag-system.md) - Vector database setup +- [MCP Integration](mcp-integration.md) - MCP configuration +- [Security Policies](security-policies.md) - Authorization policies +- [Cost Management](cost-management.md) - Budget tracking + +--- + +**Last Updated**: 2025-01-13 +**Status**: ✅ Production-Ready +**Versions Supported**: v1.0+ \ No newline at end of file diff --git a/docs/src/ai/cost-management.md b/docs/src/ai/cost-management.md index 8a1a36a..908854a 100644 --- a/docs/src/ai/cost-management.md +++ b/docs/src/ai/cost-management.md @@ -1 +1,497 @@ -# AI Cost Management and Optimization\n\n**Status**: ✅ Production-Ready (cost tracking, budgets, caching benefits)\n\nComprehensive guide to managing LLM API costs, optimizing usage through caching and rate limiting, and tracking spending. The provisioning platform\nincludes built-in cost controls to prevent runaway spending while maximizing value.\n\n## Cost Overview\n\n### API Provider Pricing\n\n| | Provider | Model | Input | Output | Per MTok | |\n| | ---------- | ------- | ------- | -------- | ---------- | |\n| | **Anthropic** | Claude Sonnet 4 | $3 | $15 | $0.003 input / $0.015 output | |\n| | | Claude Opus 4 | $15 | $45 | Higher accuracy, longer context | |\n| | | Claude Haiku 4 | $0.80 | $4 | Fast, for simple queries | |\n| | **OpenAI** | GPT-4 Turbo | $0.01 | $0.03 | Per 1K tokens | |\n| | | GPT-4 | $0.03 | $0.06 | Legacy, avoid | |\n| | | GPT-4o | $5 | $15 | Per MTok | |\n| | **Local** | Llama 2, Mistral | Free | Free | Hardware cost only | |\n\n### Cost Examples\n\n```\nScenario 1: Generate simple database configuration\n - Input: 500 tokens (description + schema)\n - Output: 200 tokens (generated config)\n - Cost: (500 × $3 + 200 × $15) / 1,000,000 = $0.0045\n - With caching (hit rate 50%): $0.0023\n\nScenario 2: Deep troubleshooting analysis\n - Input: 5000 tokens (logs + context)\n - Output: 2000 tokens (analysis + recommendations)\n - Cost: (5000 × $3 + 2000 × $15) / 1,000,000 = $0.045\n - With caching (hit rate 70%): $0.0135\n\nScenario 3: Monthly usage (typical organization)\n - ~1000 config generations @ $0.005 = $5\n - ~500 troubleshooting calls @ $0.045 = $22.50\n - ~2000 form assists @ $0.002 = $4\n - ~200 agent executions @ $0.10 = $20\n - **Total: ~$50-100/month for small org**\n - **Total: ~$500-1000/month for large org**\n```\n\n## Cost Control Mechanisms\n\n### Request Caching\n\nCaching is the primary cost reduction strategy, cutting costs by 50-80%:\n\n```\nWithout Caching:\n User 1: "Generate PostgreSQL config" → API call → $0.005\n User 2: "Generate PostgreSQL config" → API call → $0.005\n Total: $0.010 (2 identical requests)\n\nWith LRU Cache:\n User 1: "Generate PostgreSQL config" → API call → $0.005\n User 2: "Generate PostgreSQL config" → Cache hit → $0.00001\n Total: $0.00501 (500x cost reduction for identical)\n\nWith Semantic Cache:\n User 1: "Generate PostgreSQL database config" → API call → $0.005\n User 2: "Create a PostgreSQL database" → Semantic hit → $0.00001\n (Slightly different wording, but same intent)\n Total: $0.00501 (near 500x reduction for similar)\n```\n\n### Cache Configuration\n\n```\n[ai.cache]\nenabled = true\ncache_type = "redis" # Distributed cache across instances\nttl_seconds = 3600 # 1-hour cache lifetime\n\n# Cache size limits\nmax_size_mb = 500\neviction_policy = "lru" # Least Recently Used\n\n# Semantic caching - cache similar queries\n[ai.cache.semantic]\nenabled = true\nsimilarity_threshold = 0.95 # Cache if 95%+ similar to previous query\ncache_embeddings = true # Cache embedding vectors themselves\n\n# Cache metrics\n[ai.cache.metrics]\ntrack_hit_rate = true\ntrack_space_usage = true\nalert_on_low_hit_rate = true\n```\n\n### Rate Limiting\n\nPrevent usage spikes from unexpected costs:\n\n```\n[ai.limits]\n# Per-request limits\nmax_tokens = 4096\nmax_input_tokens = 8192\nmax_output_tokens = 4096\n\n# Throughput limits\nrpm_limit = 60 # 60 requests per minute\nrpm_burst = 100 # Allow burst to 100\ndaily_request_limit = 5000 # Max 5000 requests/day\n\n# Cost limits\ndaily_cost_limit_usd = 100 # Stop at $100/day\nmonthly_cost_limit_usd = 2000 # Stop at $2000/month\n\n# Budget alerts\nwarn_at_percent = 80 # Warn when at 80% of daily budget\nstop_at_percent = 95 # Stop when at 95% of budget\n```\n\n### Workspace-Level Budgets\n\n```\n[ai.workspace_budgets]\n# Per-workspace cost limits\ndev.daily_limit_usd = 10\nstaging.daily_limit_usd = 50\nprod.daily_limit_usd = 100\n\n# Can override globally for specific workspaces\nteams.team-a.monthly_limit = 500\nteams.team-b.monthly_limit = 300\n```\n\n## Cost Tracking\n\n### Track Spending\n\n```\n# View current month spending\nprovisioning admin costs show ai\n\n# Forecast monthly spend\nprovisioning admin costs forecast ai --days-remaining 15\n\n# Analyze by feature\nprovisioning admin costs analyze ai --by feature\n\n# Analyze by user\nprovisioning admin costs analyze ai --by user\n\n# Export for billing\nprovisioning admin costs export ai --format csv --output costs.csv\n```\n\n### Cost Breakdown\n\n```\nMonth: January 2025\n\nTotal Spending: $285.42\n\nBy Feature:\n Config Generation: $150.00 (52%) [300 requests × avg $0.50]\n Troubleshooting: $95.00 (33%) [80 requests × avg $1.19]\n Form Assistance: $30.00 (11%) [5000 requests × avg $0.006]\n Agents: $10.42 (4%) [20 runs × avg $0.52]\n\nBy Provider:\n Anthropic (Claude): $200.00 (70%)\n OpenAI (GPT-4): $85.42 (30%)\n Local: $0 (0%)\n\nBy User:\n alice@company.com: $50.00 (18%)\n bob@company.com: $45.00 (16%)\n ...\n other (20 users): $190.42 (67%)\n\nBy Workspace:\n production: $150.00 (53%)\n staging: $85.00 (30%)\n development: $50.42 (18%)\n\nCache Performance:\n Requests: 50,000\n Cache hits: 35,000 (70%)\n Cache misses: 15,000 (30%)\n Cost savings from cache: ~$175 (38% reduction)\n```\n\n## Optimization Strategies\n\n### Strategy 1: Increase Cache Hit Rate\n\n```\n# Longer TTL = more cache hits\n[ai.cache]\nttl_seconds = 7200 # 2 hours instead of 1 hour\n\n# Semantic caching helps with slight variations\n[ai.cache.semantic]\nenabled = true\nsimilarity_threshold = 0.90 # Lower threshold = more hits\n\n# Result: Increase hit rate from 65% → 80%\n# Cost reduction: 15% → 23%\n```\n\n### Strategy 2: Use Local Models\n\n```\n[ai]\nprovider = "local"\nmodel = "mistral-7b" # Free, runs on GPU\n\n# Cost: Hardware ($5-20/month) instead of API calls\n# Savings: 50-100 config generations/month × $0.005 = $0.25-0.50\n# Hardware amortized cost: <$0.50/month on existing GPU\n\n# Tradeoff: Slightly lower quality, 2x slower\n```\n\n### Strategy 3: Use Haiku for Simple Tasks\n\n```\nTask Complexity vs Model:\n\nSimple (form assist): Claude Haiku 4 ($0.80/$4)\nMedium (config gen): Claude Sonnet 4 ($3/$15)\nComplex (agents): Claude Opus 4 ($15/$45)\n\nExample optimization:\n Before: All tasks use Sonnet 4\n - 5000 form assists/month: 5000 × $0.006 = $30\n \n After: Route by complexity\n - 5000 form assists → Haiku: 5000 × $0.001 = $5 (83% savings)\n - 200 config gen → Sonnet: 200 × $0.005 = $1\n - 10 agent runs → Opus: 10 × $0.10 = $1\n```\n\n### Strategy 4: Batch Operations\n\n```\n# Instead of individual requests, batch similar operations:\n\n# Before: 100 configs, 100 separate API calls\nprovisioning ai generate "PostgreSQL config" --output db1.ncl\nprovisioning ai generate "PostgreSQL config" --output db2.ncl\n# ... 100 calls = $0.50\n\n# After: Batch similar requests\nprovisioning ai batch --input configs-list.yaml\n# Groups similar requests, reuses cache\n# ... 3-5 API calls = $0.02 (90% savings)\n```\n\n### Strategy 5: Smart Feature Enablement\n\n```\n[ai.features]\n# Enable high-ROI features\nconfig_generation = true # High value, moderate cost\ntroubleshooting = true # High value, higher cost\nrag_search = true # Low cost, high value\n\n# Disable low-ROI features if cost-constrained\nform_assistance = false # Low value, non-zero cost (if budget tight)\nagents = false # Complex, requires multiple calls\n```\n\n## Budget Management Workflow\n\n### 1. Set Budget\n\n```\n# Set monthly budget\nprovisioning config set ai.budget.monthly_limit_usd 500\n\n# Set daily limit\nprovisioning config set ai.limits.daily_cost_limit_usd 50\n\n# Set workspace limits\nprovisioning config set ai.workspace_budgets.prod.monthly_limit 300\nprovisioning config set ai.workspace_budgets.dev.monthly_limit 100\n```\n\n### 2. Monitor Spending\n\n```\n# Daily check\nprovisioning admin costs show ai\n\n# Weekly analysis\nprovisioning admin costs analyze ai --period week\n\n# Monthly review\nprovisioning admin costs analyze ai --period month\n```\n\n### 3. Adjust If Needed\n\n```\n# If overspending:\n# - Increase cache TTL\n# - Enable local models for simple tasks\n# - Reduce form assistance (high volume, low cost but adds up)\n# - Route complex tasks to Haiku instead of Opus\n\n# If underspending:\n# - Enable new features (agents, form assistance)\n# - Increase rate limits\n# - Lower cache hit requirements (broader semantic matching)\n```\n\n### 4. Forecast and Plan\n\n```\n# Current monthly run rate\nprovisioning admin costs forecast ai\n\n# If trending over budget, recommend actions:\n# - Reduce daily limit\n# - Switch to local model for 50% of tasks\n# - Increase batch processing\n\n# If trending under budget:\n# - Enable agents for automation workflows\n# - Enable form assistance across all workspaces\n```\n\n## Cost Allocation\n\n### Chargeback Models\n\n**Per-Workspace Model**:\n```\nDevelopment workspace: $50/month\nStaging workspace: $100/month\nProduction workspace: $300/month\n------\nTotal: $450/month\n```\n\n**Per-User Model**:\n```\nEach user charged based on their usage\nEncourages efficiency\nDifficult to track/allocate\n```\n\n**Shared Pool Model**:\n```\nAll teams share $1000/month budget\nBudget splits by consumption rate\nEncourages optimization\nMost flexible\n```\n\n## Cost Reporting\n\n### Generate Reports\n\n```\n# Monthly cost report\nprovisioning admin costs report ai \\n --format pdf \\n --period month \\n --output cost-report-2025-01.pdf\n\n# Detailed analysis for finance\nprovisioning admin costs report ai \\n --format xlsx \\n --include-forecasts \\n --include-optimization-suggestions\n\n# Executive summary\nprovisioning admin costs report ai \\n --format markdown \\n --summary-only\n```\n\n## Cost-Benefit Analysis\n\n### ROI Examples\n\n```\nScenario 1: Developer Time Savings\n Problem: Manual config creation takes 2 hours\n Solution: AI config generation, 10 minutes (12x faster)\n Time saved: 1.83 hours/config\n Hourly rate: $100\n Value: $183/config\n \n AI cost: $0.005/config\n ROI: 36,600x (far exceeds cost)\n\nScenario 2: Troubleshooting Efficiency\n Problem: Manual debugging takes 4 hours\n Solution: AI troubleshooting analysis, 2 minutes\n Time saved: 3.97 hours\n Value: $397/incident\n \n AI cost: $0.045/incident\n ROI: 8,822x\n\nScenario 3: Reduction in Failed Deployments\n Before: 5% of 1000 deployments fail (50 failures)\n Failure cost: $500 each (lost time, data cleanup)\n Total: $25,000/month\n \n After: With AI analysis, 2% fail (20 failures)\n Total: $10,000/month\n Savings: $15,000/month\n \n AI cost: $200/month\n Net savings: $14,800/month\n ROI: 74:1\n```\n\n## Advanced Cost Optimization\n\n### Hybrid Strategy (Recommended)\n\n```\n✓ Local models for:\n - Form assistance (high volume, low complexity)\n - Simple validation checks\n - Document retrieval (RAG)\n Cost: Hardware only (~$500 setup)\n\n✓ Cloud API for:\n - Complex generation (requires latest model capability)\n - Troubleshooting (needs high accuracy)\n - Agents (complex reasoning)\n Cost: $50-200/month per organization\n\nResult:\n - 70% of requests → Local (free after hardware amortization)\n - 30% of requests → Cloud ($50/month)\n - 80% overall cost reduction vs cloud-only\n```\n\n## Monitoring and Alerts\n\n### Cost Anomaly Detection\n\n```\n# Enable anomaly detection\nprovisioning config set ai.monitoring.anomaly_detection true\n\n# Set thresholds\nprovisioning config set ai.monitoring.cost_spike_percent 150\n# Alert if daily cost is 150% of average\n\n# System alerts:\n# - Daily cost exceeded by 10x normal\n# - New expensive operation (agent run)\n# - Cache hit rate dropped below 40%\n# - Rate limit nearly exhausted\n```\n\n### Alert Configuration\n\n```\n[ai.monitoring.alerts]\nenabled = true\nspike_threshold_percent = 150\ncheck_interval_minutes = 5\n\n[ai.monitoring.alerts.channels]\nemail = "ops@company.com"\nslack = "[https://hooks.slack.com/..."](https://hooks.slack.com/...")\npagerduty = "integration-key"\n\n# Alert thresholds\n[ai.monitoring.alerts.thresholds]\ndaily_budget_warning_percent = 80\ndaily_budget_critical_percent = 95\nmonthly_budget_warning_percent = 70\n```\n\n## Related Documentation\n\n- [Architecture](architecture.md) - AI system overview\n- [Configuration](configuration.md) - Cost control settings\n- [Security Policies](security-policies.md) - Cost-aware policies\n- [RAG System](rag-system.md) - Caching details\n- [ADR-015](../architecture/adr/adr-015-ai-integration-architecture.md) - Design decisions\n\n---\n\n**Last Updated**: 2025-01-13\n**Status**: ✅ Production-Ready\n**Average Savings**: 50-80% through caching\n**Typical Cost**: $50-500/month per organization\n**ROI**: 100:1 to 10,000:1 depending on use case +# AI Cost Management and Optimization + +**Status**: ✅ Production-Ready (cost tracking, budgets, caching benefits) + +Comprehensive guide to managing LLM API costs, optimizing usage through caching and rate limiting, and tracking spending. The provisioning platform +includes built-in cost controls to prevent runaway spending while maximizing value. + +## Cost Overview + +### API Provider Pricing + +| Provider | Model | Input | Output | Per MTok | | +| ---------- | ------- | ------- | -------- | ---------- | | +| **Anthropic** | Claude Sonnet 4 | $3 | $15 | $0.003 input / $0.015 output | | +| | Claude Opus 4 | $15 | $45 | Higher accuracy, longer context | | +| | Claude Haiku 4 | $0.80 | $4 | Fast, for simple queries | | +| **OpenAI** | GPT-4 Turbo | $0.01 | $0.03 | Per 1K tokens | | +| | GPT-4 | $0.03 | $0.06 | Legacy, avoid | | +| | GPT-4o | $5 | $15 | Per MTok | | +| **Local** | Llama 2, Mistral | Free | Free | Hardware cost only | | + +### Cost Examples + +```text +Scenario 1: Generate simple database configuration + - Input: 500 tokens (description + schema) + - Output: 200 tokens (generated config) + - Cost: (500 × $3 + 200 × $15) / 1,000,000 = $0.0045 + - With caching (hit rate 50%): $0.0023 + +Scenario 2: Deep troubleshooting analysis + - Input: 5000 tokens (logs + context) + - Output: 2000 tokens (analysis + recommendations) + - Cost: (5000 × $3 + 2000 × $15) / 1,000,000 = $0.045 + - With caching (hit rate 70%): $0.0135 + +Scenario 3: Monthly usage (typical organization) + - ~1000 config generations @ $0.005 = $5 + - ~500 troubleshooting calls @ $0.045 = $22.50 + - ~2000 form assists @ $0.002 = $4 + - ~200 agent executions @ $0.10 = $20 + - **Total: ~$50-100/month for small org** + - **Total: ~$500-1000/month for large org** +``` + +## Cost Control Mechanisms + +### Request Caching + +Caching is the primary cost reduction strategy, cutting costs by 50-80%: + +```text +Without Caching: + User 1: "Generate PostgreSQL config" → API call → $0.005 + User 2: "Generate PostgreSQL config" → API call → $0.005 + Total: $0.010 (2 identical requests) + +With LRU Cache: + User 1: "Generate PostgreSQL config" → API call → $0.005 + User 2: "Generate PostgreSQL config" → Cache hit → $0.00001 + Total: $0.00501 (500x cost reduction for identical) + +With Semantic Cache: + User 1: "Generate PostgreSQL database config" → API call → $0.005 + User 2: "Create a PostgreSQL database" → Semantic hit → $0.00001 + (Slightly different wording, but same intent) + Total: $0.00501 (near 500x reduction for similar) +``` + +### Cache Configuration + +```text +[ai.cache] +enabled = true +cache_type = "redis" # Distributed cache across instances +ttl_seconds = 3600 # 1-hour cache lifetime + +# Cache size limits +max_size_mb = 500 +eviction_policy = "lru" # Least Recently Used + +# Semantic caching - cache similar queries +[ai.cache.semantic] +enabled = true +similarity_threshold = 0.95 # Cache if 95%+ similar to previous query +cache_embeddings = true # Cache embedding vectors themselves + +# Cache metrics +[ai.cache.metrics] +track_hit_rate = true +track_space_usage = true +alert_on_low_hit_rate = true +``` + +### Rate Limiting + +Prevent usage spikes from unexpected costs: + +```text +[ai.limits] +# Per-request limits +max_tokens = 4096 +max_input_tokens = 8192 +max_output_tokens = 4096 + +# Throughput limits +rpm_limit = 60 # 60 requests per minute +rpm_burst = 100 # Allow burst to 100 +daily_request_limit = 5000 # Max 5000 requests/day + +# Cost limits +daily_cost_limit_usd = 100 # Stop at $100/day +monthly_cost_limit_usd = 2000 # Stop at $2000/month + +# Budget alerts +warn_at_percent = 80 # Warn when at 80% of daily budget +stop_at_percent = 95 # Stop when at 95% of budget +``` + +### Workspace-Level Budgets + +```text +[ai.workspace_budgets] +# Per-workspace cost limits +dev.daily_limit_usd = 10 +staging.daily_limit_usd = 50 +prod.daily_limit_usd = 100 + +# Can override globally for specific workspaces +teams.team-a.monthly_limit = 500 +teams.team-b.monthly_limit = 300 +``` + +## Cost Tracking + +### Track Spending + +```text +# View current month spending +provisioning admin costs show ai + +# Forecast monthly spend +provisioning admin costs forecast ai --days-remaining 15 + +# Analyze by feature +provisioning admin costs analyze ai --by feature + +# Analyze by user +provisioning admin costs analyze ai --by user + +# Export for billing +provisioning admin costs export ai --format csv --output costs.csv +``` + +### Cost Breakdown + +```text +Month: January 2025 + +Total Spending: $285.42 + +By Feature: + Config Generation: $150.00 (52%) [300 requests × avg $0.50] + Troubleshooting: $95.00 (33%) [80 requests × avg $1.19] + Form Assistance: $30.00 (11%) [5000 requests × avg $0.006] + Agents: $10.42 (4%) [20 runs × avg $0.52] + +By Provider: + Anthropic (Claude): $200.00 (70%) + OpenAI (GPT-4): $85.42 (30%) + Local: $0 (0%) + +By User: + alice@company.com: $50.00 (18%) + bob@company.com: $45.00 (16%) + ... + other (20 users): $190.42 (67%) + +By Workspace: + production: $150.00 (53%) + staging: $85.00 (30%) + development: $50.42 (18%) + +Cache Performance: + Requests: 50,000 + Cache hits: 35,000 (70%) + Cache misses: 15,000 (30%) + Cost savings from cache: ~$175 (38% reduction) +``` + +## Optimization Strategies + +### Strategy 1: Increase Cache Hit Rate + +```text +# Longer TTL = more cache hits +[ai.cache] +ttl_seconds = 7200 # 2 hours instead of 1 hour + +# Semantic caching helps with slight variations +[ai.cache.semantic] +enabled = true +similarity_threshold = 0.90 # Lower threshold = more hits + +# Result: Increase hit rate from 65% → 80% +# Cost reduction: 15% → 23% +``` + +### Strategy 2: Use Local Models + +```text +[ai] +provider = "local" +model = "mistral-7b" # Free, runs on GPU + +# Cost: Hardware ($5-20/month) instead of API calls +# Savings: 50-100 config generations/month × $0.005 = $0.25-0.50 +# Hardware amortized cost: <$0.50/month on existing GPU + +# Tradeoff: Slightly lower quality, 2x slower +``` + +### Strategy 3: Use Haiku for Simple Tasks + +```text +Task Complexity vs Model: + +Simple (form assist): Claude Haiku 4 ($0.80/$4) +Medium (config gen): Claude Sonnet 4 ($3/$15) +Complex (agents): Claude Opus 4 ($15/$45) + +Example optimization: + Before: All tasks use Sonnet 4 + - 5000 form assists/month: 5000 × $0.006 = $30 + + After: Route by complexity + - 5000 form assists → Haiku: 5000 × $0.001 = $5 (83% savings) + - 200 config gen → Sonnet: 200 × $0.005 = $1 + - 10 agent runs → Opus: 10 × $0.10 = $1 +``` + +### Strategy 4: Batch Operations + +```text +# Instead of individual requests, batch similar operations: + +# Before: 100 configs, 100 separate API calls +provisioning ai generate "PostgreSQL config" --output db1.ncl +provisioning ai generate "PostgreSQL config" --output db2.ncl +# ... 100 calls = $0.50 + +# After: Batch similar requests +provisioning ai batch --input configs-list.yaml +# Groups similar requests, reuses cache +# ... 3-5 API calls = $0.02 (90% savings) +``` + +### Strategy 5: Smart Feature Enablement + +```text +[ai.features] +# Enable high-ROI features +config_generation = true # High value, moderate cost +troubleshooting = true # High value, higher cost +rag_search = true # Low cost, high value + +# Disable low-ROI features if cost-constrained +form_assistance = false # Low value, non-zero cost (if budget tight) +agents = false # Complex, requires multiple calls +``` + +## Budget Management Workflow + +### 1. Set Budget + +```text +# Set monthly budget +provisioning config set ai.budget.monthly_limit_usd 500 + +# Set daily limit +provisioning config set ai.limits.daily_cost_limit_usd 50 + +# Set workspace limits +provisioning config set ai.workspace_budgets.prod.monthly_limit 300 +provisioning config set ai.workspace_budgets.dev.monthly_limit 100 +``` + +### 2. Monitor Spending + +```text +# Daily check +provisioning admin costs show ai + +# Weekly analysis +provisioning admin costs analyze ai --period week + +# Monthly review +provisioning admin costs analyze ai --period month +``` + +### 3. Adjust If Needed + +```text +# If overspending: +# - Increase cache TTL +# - Enable local models for simple tasks +# - Reduce form assistance (high volume, low cost but adds up) +# - Route complex tasks to Haiku instead of Opus + +# If underspending: +# - Enable new features (agents, form assistance) +# - Increase rate limits +# - Lower cache hit requirements (broader semantic matching) +``` + +### 4. Forecast and Plan + +```text +# Current monthly run rate +provisioning admin costs forecast ai + +# If trending over budget, recommend actions: +# - Reduce daily limit +# - Switch to local model for 50% of tasks +# - Increase batch processing + +# If trending under budget: +# - Enable agents for automation workflows +# - Enable form assistance across all workspaces +``` + +## Cost Allocation + +### Chargeback Models + +**Per-Workspace Model**: +```text +Development workspace: $50/month +Staging workspace: $100/month +Production workspace: $300/month +------ +Total: $450/month +``` + +**Per-User Model**: +```text +Each user charged based on their usage +Encourages efficiency +Difficult to track/allocate +``` + +**Shared Pool Model**: +```text +All teams share $1000/month budget +Budget splits by consumption rate +Encourages optimization +Most flexible +``` + +## Cost Reporting + +### Generate Reports + +```text +# Monthly cost report +provisioning admin costs report ai + --format pdf + --period month + --output cost-report-2025-01.pdf + +# Detailed analysis for finance +provisioning admin costs report ai + --format xlsx + --include-forecasts + --include-optimization-suggestions + +# Executive summary +provisioning admin costs report ai + --format markdown + --summary-only +``` + +## Cost-Benefit Analysis + +### ROI Examples + +```text +Scenario 1: Developer Time Savings + Problem: Manual config creation takes 2 hours + Solution: AI config generation, 10 minutes (12x faster) + Time saved: 1.83 hours/config + Hourly rate: $100 + Value: $183/config + + AI cost: $0.005/config + ROI: 36,600x (far exceeds cost) + +Scenario 2: Troubleshooting Efficiency + Problem: Manual debugging takes 4 hours + Solution: AI troubleshooting analysis, 2 minutes + Time saved: 3.97 hours + Value: $397/incident + + AI cost: $0.045/incident + ROI: 8,822x + +Scenario 3: Reduction in Failed Deployments + Before: 5% of 1000 deployments fail (50 failures) + Failure cost: $500 each (lost time, data cleanup) + Total: $25,000/month + + After: With AI analysis, 2% fail (20 failures) + Total: $10,000/month + Savings: $15,000/month + + AI cost: $200/month + Net savings: $14,800/month + ROI: 74:1 +``` + +## Advanced Cost Optimization + +### Hybrid Strategy (Recommended) + +```text +✓ Local models for: + - Form assistance (high volume, low complexity) + - Simple validation checks + - Document retrieval (RAG) + Cost: Hardware only (~$500 setup) + +✓ Cloud API for: + - Complex generation (requires latest model capability) + - Troubleshooting (needs high accuracy) + - Agents (complex reasoning) + Cost: $50-200/month per organization + +Result: + - 70% of requests → Local (free after hardware amortization) + - 30% of requests → Cloud ($50/month) + - 80% overall cost reduction vs cloud-only +``` + +## Monitoring and Alerts + +### Cost Anomaly Detection + +```text +# Enable anomaly detection +provisioning config set ai.monitoring.anomaly_detection true + +# Set thresholds +provisioning config set ai.monitoring.cost_spike_percent 150 +# Alert if daily cost is 150% of average + +# System alerts: +# - Daily cost exceeded by 10x normal +# - New expensive operation (agent run) +# - Cache hit rate dropped below 40% +# - Rate limit nearly exhausted +``` + +### Alert Configuration + +```text +[ai.monitoring.alerts] +enabled = true +spike_threshold_percent = 150 +check_interval_minutes = 5 + +[ai.monitoring.alerts.channels] +email = "ops@company.com" +slack = "[https://hooks.slack.com/..."](https://hooks.slack.com/...") +pagerduty = "integration-key" + +# Alert thresholds +[ai.monitoring.alerts.thresholds] +daily_budget_warning_percent = 80 +daily_budget_critical_percent = 95 +monthly_budget_warning_percent = 70 +``` + +## Related Documentation + +- [Architecture](architecture.md) - AI system overview +- [Configuration](configuration.md) - Cost control settings +- [Security Policies](security-policies.md) - Cost-aware policies +- [RAG System](rag-system.md) - Caching details +- [ADR-015](../architecture/adr/adr-015-ai-integration-architecture.md) - Design decisions + +--- + +**Last Updated**: 2025-01-13 +**Status**: ✅ Production-Ready +**Average Savings**: 50-80% through caching +**Typical Cost**: $50-500/month per organization +**ROI**: 100:1 to 10,000:1 depending on use case diff --git a/docs/src/ai/mcp-integration.md b/docs/src/ai/mcp-integration.md index 19edba9..af834eb 100644 --- a/docs/src/ai/mcp-integration.md +++ b/docs/src/ai/mcp-integration.md @@ -1 +1,594 @@ -# Model Context Protocol (MCP) Integration\n\n**Status**: ✅ Production-Ready (MCP 0.6.0+, integrated with Claude, compatible with all LLMs)\n\nThe MCP server provides standardized Model Context Protocol integration, allowing external LLMs (Claude, GPT-4, local models) to access provisioning\nplatform capabilities as tools. This enables complex multi-step workflows, tool composition, and integration with existing LLM applications.\n\n## Architecture Overview\n\nThe MCP integration follows the Model Context Protocol specification:\n\n```\n┌──────────────────────────────────────────────────────────────┐\n│ External LLM (Claude, GPT-4, etc.) │\n└────────────────────┬─────────────────────────────────────────┘\n │\n │ Tool Calls (JSON-RPC)\n ▼\n┌──────────────────────────────────────────────────────────────┐\n│ MCP Server (provisioning/platform/crates/mcp-server) │\n│ │\n│ ┌───────────────────────────────────────────────────────┐ │\n│ │ Tool Registry │ │\n│ │ - generate_config(description, schema) │ │\n│ │ - validate_config(config) │ │\n│ │ - search_docs(query) │ │\n│ │ - troubleshoot_deployment(logs) │ │\n│ │ - get_schema(name) │ │\n│ │ - check_compliance(config, policy) │ │\n│ └───────────────────────────────────────────────────────┘ │\n│ │ │\n│ ▼ │\n│ ┌───────────────────────────────────────────────────────┐ │\n│ │ Implementation Layer │ │\n│ │ - AI Service client (ai-service port 8083) │ │\n│ │ - Validator client │ │\n│ │ - RAG client (SurrealDB) │ │\n│ │ - Schema loader │ │\n│ └───────────────────────────────────────────────────────┘ │\n└──────────────────────────────────────────────────────────────┘\n```\n\n## MCP Server Launch\n\nThe MCP server is started as a stdio-based service:\n\n```\n# Start MCP server (stdio transport)\nprovisioning-mcp-server --config /etc/provisioning/ai.toml\n\n# With debug logging\nRUST_LOG=debug provisioning-mcp-server --config /etc/provisioning/ai.toml\n\n# In Claude Desktop configuration\n~/.claude/claude_desktop_config.json:\n{\n "mcpServers": {\n "provisioning": {\n "command": "provisioning-mcp-server",\n "args": ["--config", "/etc/provisioning/ai.toml"],\n "env": {\n "PROVISIONING_TOKEN": "your-auth-token"\n }\n }\n }\n}\n```\n\n## Available Tools\n\n### 1. Config Generation\n\n**Tool**: `generate_config`\n\nGenerate infrastructure configuration from natural language description.\n\n```\n{\n "name": "generate_config",\n "description": "Generate a Nickel infrastructure configuration from a natural language description",\n "inputSchema": {\n "type": "object",\n "properties": {\n "description": {\n "type": "string",\n "description": "Natural language description of desired infrastructure"\n },\n "schema": {\n "type": "string",\n "description": "Target schema name (e.g., 'database', 'kubernetes', 'network'). Optional."\n },\n "format": {\n "type": "string",\n "enum": ["nickel", "toml"],\n "description": "Output format (default: nickel)"\n }\n },\n "required": ["description"]\n }\n}\n```\n\n**Example Usage**:\n\n```\n# Via MCP client\nmcp-client provisioning generate_config \\n --description "Production PostgreSQL cluster with encryption and daily backups" \\n --schema database\n\n# Claude desktop prompt:\n# @provisioning: Generate a production PostgreSQL setup with automated backups\n```\n\n**Response**:\n\n```\n{\n database = {\n engine = "postgresql",\n version = "15.0",\n \n instance = {\n instance_class = "db.r6g.xlarge",\n allocated_storage_gb = 100,\n iops = 3000,\n },\n \n security = {\n encryption_enabled = true,\n encryption_key_id = "kms://prod-db-key",\n tls_enabled = true,\n tls_version = "1.3",\n },\n \n backup = {\n enabled = true,\n retention_days = 30,\n preferred_window = "03:00-04:00",\n copy_to_region = "us-west-2",\n },\n \n monitoring = {\n enhanced_monitoring_enabled = true,\n monitoring_interval_seconds = 60,\n log_exports = ["postgresql"],\n },\n }\n}\n```\n\n### 2. Config Validation\n\n**Tool**: `validate_config`\n\nValidate a Nickel configuration against schemas and policies.\n\n```\n{\n "name": "validate_config",\n "description": "Validate a Nickel configuration file",\n "inputSchema": {\n "type": "object",\n "properties": {\n "config": {\n "type": "string",\n "description": "Nickel configuration content or file path"\n },\n "schema": {\n "type": "string",\n "description": "Schema name to validate against (optional)"\n },\n "strict": {\n "type": "boolean",\n "description": "Enable strict validation (default: true)"\n }\n },\n "required": ["config"]\n }\n}\n```\n\n**Example Usage**:\n\n```\n# Validate configuration\nmcp-client provisioning validate_config \\n --config "$(cat workspaces/prod/database.ncl)"\n\n# With specific schema\nmcp-client provisioning validate_config \\n --config "workspaces/prod/kubernetes.ncl" \\n --schema kubernetes\n```\n\n**Response**:\n\n```\n{\n "valid": true,\n "errors": [],\n "warnings": [\n "Consider enabling automated backups for production use"\n ],\n "metadata": {\n "schema": "kubernetes",\n "version": "1.28",\n "validated_at": "2025-01-13T10:45:30Z"\n }\n}\n```\n\n### 3. Documentation Search\n\n**Tool**: `search_docs`\n\nSearch infrastructure documentation using RAG system.\n\n```\n{\n "name": "search_docs",\n "description": "Search provisioning documentation for information",\n "inputSchema": {\n "type": "object",\n "properties": {\n "query": {\n "type": "string",\n "description": "Search query (natural language)"\n },\n "top_k": {\n "type": "integer",\n "description": "Number of results (default: 5)"\n },\n "doc_type": {\n "type": "string",\n "enum": ["guide", "schema", "example", "troubleshooting"],\n "description": "Filter by document type (optional)"\n }\n },\n "required": ["query"]\n }\n}\n```\n\n**Example Usage**:\n\n```\n# Search documentation\nmcp-client provisioning search_docs \\n --query "How do I configure PostgreSQL with replication?"\n\n# Get examples\nmcp-client provisioning search_docs \\n --query "Kubernetes networking" \\n --doc_type example \\n --top_k 3\n```\n\n**Response**:\n\n```\n{\n "results": [\n {\n "source": "provisioning/docs/src/guides/database-replication.md",\n "excerpt": "PostgreSQL logical replication enables streaming of changes...",\n "relevance": 0.94,\n "section": "Setup Logical Replication"\n },\n {\n "source": "provisioning/schemas/database.ncl",\n "excerpt": "replication = { enabled = true, mode = \"logical\", ... }",\n "relevance": 0.87,\n "section": "Replication Configuration"\n }\n ]\n}\n```\n\n### 4. Deployment Troubleshooting\n\n**Tool**: `troubleshoot_deployment`\n\nAnalyze deployment failures and suggest fixes.\n\n```\n{\n "name": "troubleshoot_deployment",\n "description": "Analyze deployment logs and suggest fixes",\n "inputSchema": {\n "type": "object",\n "properties": {\n "deployment_id": {\n "type": "string",\n "description": "Deployment ID (e.g., 'deploy-2025-01-13-001')"\n },\n "logs": {\n "type": "string",\n "description": "Deployment logs (optional, if deployment_id not provided)"\n },\n "error_analysis_depth": {\n "type": "string",\n "enum": ["shallow", "deep"],\n "description": "Analysis depth (default: deep)"\n }\n }\n }\n}\n```\n\n**Example Usage**:\n\n```\n# Troubleshoot recent deployment\nmcp-client provisioning troubleshoot_deployment \\n --deployment_id "deploy-2025-01-13-001"\n\n# With custom logs\nmcp-client provisioning troubleshoot_deployment \\n| --logs "$(journalctl -u provisioning --no-pager | tail -100)" |\n```\n\n**Response**:\n\n```\n{\n "status": "failure",\n "root_cause": "Database connection timeout during migration phase",\n "analysis": {\n "phase": "database_migration",\n "error_type": "connectivity",\n "confidence": 0.95\n },\n "suggestions": [\n "Verify database security group allows inbound on port 5432",\n "Check database instance status (may be rebooting)",\n "Increase connection timeout in configuration"\n ],\n "corrected_config": "...generated Nickel config with fixes...",\n "similar_issues": [\n "[https://docs/troubleshooting/database-connectivity.md"](https://docs/troubleshooting/database-connectivity.md")\n ]\n}\n```\n\n### 5. Get Schema\n\n**Tool**: `get_schema`\n\nRetrieve schema definition with examples.\n\n```\n{\n "name": "get_schema",\n "description": "Get a provisioning schema definition",\n "inputSchema": {\n "type": "object",\n "properties": {\n "schema_name": {\n "type": "string",\n "description": "Schema name (e.g., 'database', 'kubernetes')"\n },\n "format": {\n "type": "string",\n "enum": ["schema", "example", "documentation"],\n "description": "Response format (default: schema)"\n }\n },\n "required": ["schema_name"]\n }\n}\n```\n\n**Example Usage**:\n\n```\n# Get schema definition\nmcp-client provisioning get_schema --schema_name database\n\n# Get example configuration\nmcp-client provisioning get_schema \\n --schema_name kubernetes \\n --format example\n```\n\n### 6. Compliance Check\n\n**Tool**: `check_compliance`\n\nVerify configuration against compliance policies (Cedar).\n\n```\n{\n "name": "check_compliance",\n "description": "Check configuration against compliance policies",\n "inputSchema": {\n "type": "object",\n "properties": {\n "config": {\n "type": "string",\n "description": "Configuration to check"\n },\n "policy_set": {\n "type": "string",\n "description": "Policy set to check against (e.g., 'pci-dss', 'hipaa', 'sox')"\n }\n },\n "required": ["config", "policy_set"]\n }\n}\n```\n\n**Example Usage**:\n\n```\n# Check against PCI-DSS\nmcp-client provisioning check_compliance \\n --config "$(cat workspaces/prod/database.ncl)" \\n --policy_set pci-dss\n```\n\n## Integration Examples\n\n### Claude Desktop (Most Common)\n\n```\n~/.claude/claude_desktop_config.json:\n{\n "mcpServers": {\n "provisioning": {\n "command": "provisioning-mcp-server",\n "args": ["--config", "/etc/provisioning/ai.toml"],\n "env": {\n "PROVISIONING_API_KEY": "sk-...",\n "PROVISIONING_BASE_URL": "[http://localhost:8083"](http://localhost:8083")\n }\n }\n }\n}\n```\n\n**Usage in Claude**:\n\n```\nUser: I need a production Kubernetes cluster in AWS with automatic scaling\n\nClaude can now use provisioning tools:\nI'll help you create a production Kubernetes cluster. Let me:\n1. Search the documentation for best practices\n2. Generate a configuration template\n3. Validate it against your policies\n4. Provide the final configuration\n```\n\n### OpenAI Function Calling\n\n```\nimport openai\n\ntools = [\n {\n "type": "function",\n "function": {\n "name": "generate_config",\n "description": "Generate infrastructure configuration",\n "parameters": {\n "type": "object",\n "properties": {\n "description": {\n "type": "string",\n "description": "Infrastructure description"\n }\n },\n "required": ["description"]\n }\n }\n }\n]\n\nresponse = openai.ChatCompletion.create(\n model="gpt-4",\n messages=[{"role": "user", "content": "Create a PostgreSQL database"}],\n tools=tools\n)\n```\n\n### Local LLM Integration (Ollama)\n\n```\n# Start Ollama with provisioning MCP\nOLLAMA_MCP_SERVERS=provisioning://localhost:3000 \\n ollama serve\n\n# Use with llama2 or mistral\ncurl [http://localhost:11434/api/generate](http://localhost:11434/api/generate) \\n -d '{\n "model": "mistral",\n "prompt": "Create a Kubernetes cluster",\n "tools": [{"type": "mcp", "server": "provisioning"}]\n }'\n```\n\n## Error Handling\n\nTools return consistent error responses:\n\n```\n{\n "error": {\n "code": "VALIDATION_ERROR",\n "message": "Configuration has 3 validation errors",\n "details": [\n {\n "field": "database.version",\n "message": "PostgreSQL version 9.6 is deprecated",\n "severity": "error"\n },\n {\n "field": "backup.retention_days",\n "message": "Recommended minimum is 30 days for production",\n "severity": "warning"\n }\n ]\n }\n}\n```\n\n## Performance\n\n| | Operation | Latency | Notes | |\n| | ----------- | --------- | ------- | |\n| | generate_config | 2-5s | Depends on LLM and config complexity | |\n| | validate_config | 500-1000ms | Parallel schema validation | |\n| | search_docs | 300-800ms | RAG hybrid search | |\n| | troubleshoot | 3-8s | Depends on log size and analysis depth | |\n| | get_schema | 100-300ms | Cached schema retrieval | |\n| | check_compliance | 500-2000ms | Policy evaluation | |\n\n## Configuration\n\nSee [Configuration Guide](configuration.md) for MCP-specific settings:\n\n- MCP server port and binding\n- Tool registry customization\n- Rate limiting for tool calls\n- Access control (Cedar policies)\n\n## Security\n\n### Authentication\n\n- Tools require valid provisioning API token\n- Token scoped to user's workspace\n- All tool calls authenticated and logged\n\n### Authorization\n\n- Cedar policies control which tools user can call\n- Example: `allow(principal, action, resource)` when `role == "admin"`\n- Detailed audit trail of all tool invocations\n\n### Data Protection\n\n- Secrets never passed through MCP\n- Configuration sanitized before analysis\n- PII removed from logs sent to external LLMs\n\n## Monitoring and Debugging\n\n```\n# Monitor MCP server\nprovisioning admin mcp status\n\n# View MCP tool calls\nprovisioning admin logs --filter "mcp_tools" --tail 100\n\n# Debug tool response\nRUST_LOG=provisioning::mcp=debug provisioning-mcp-server\n```\n\n## Related Documentation\n\n- [Architecture](architecture.md) - AI system overview\n- [RAG System](rag-system.md) - Documentation search\n- [Configuration](configuration.md) - MCP setup\n- [API Reference](api-reference.md) - Detailed API endpoints\n- [ADR-015](../architecture/adr/adr-015-ai-integration-architecture.md) - Design decisions\n\n---\n\n**Last Updated**: 2025-01-13\n**Status**: ✅ Production-Ready\n**MCP Version**: 0.6.0+\n**Supported LLMs**: Claude, GPT-4, Llama, Mistral, all MCP-compatible models +# Model Context Protocol (MCP) Integration + +**Status**: ✅ Production-Ready (MCP 0.6.0+, integrated with Claude, compatible with all LLMs) + +The MCP server provides standardized Model Context Protocol integration, allowing external LLMs (Claude, GPT-4, local models) to access provisioning +platform capabilities as tools. This enables complex multi-step workflows, tool composition, and integration with existing LLM applications. + +## Architecture Overview + +The MCP integration follows the Model Context Protocol specification: + +```text +┌──────────────────────────────────────────────────────────────┐ +│ External LLM (Claude, GPT-4, etc.) │ +└────────────────────┬─────────────────────────────────────────┘ + │ + │ Tool Calls (JSON-RPC) + ▼ +┌──────────────────────────────────────────────────────────────┐ +│ MCP Server (provisioning/platform/crates/mcp-server) │ +│ │ +│ ┌───────────────────────────────────────────────────────┐ │ +│ │ Tool Registry │ │ +│ │ - generate_config(description, schema) │ │ +│ │ - validate_config(config) │ │ +│ │ - search_docs(query) │ │ +│ │ - troubleshoot_deployment(logs) │ │ +│ │ - get_schema(name) │ │ +│ │ - check_compliance(config, policy) │ │ +│ └───────────────────────────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌───────────────────────────────────────────────────────┐ │ +│ │ Implementation Layer │ │ +│ │ - AI Service client (ai-service port 8083) │ │ +│ │ - Validator client │ │ +│ │ - RAG client (SurrealDB) │ │ +│ │ - Schema loader │ │ +│ └───────────────────────────────────────────────────────┘ │ +└──────────────────────────────────────────────────────────────┘ +``` + +## MCP Server Launch + +The MCP server is started as a stdio-based service: + +```text +# Start MCP server (stdio transport) +provisioning-mcp-server --config /etc/provisioning/ai.toml + +# With debug logging +RUST_LOG=debug provisioning-mcp-server --config /etc/provisioning/ai.toml + +# In Claude Desktop configuration +~/.claude/claude_desktop_config.json: +{ + "mcpServers": { + "provisioning": { + "command": "provisioning-mcp-server", + "args": ["--config", "/etc/provisioning/ai.toml"], + "env": { + "PROVISIONING_TOKEN": "your-auth-token" + } + } + } +} +``` + +## Available Tools + +### 1. Config Generation + +**Tool**: `generate_config` + +Generate infrastructure configuration from natural language description. + +```text +{ + "name": "generate_config", + "description": "Generate a Nickel infrastructure configuration from a natural language description", + "inputSchema": { + "type": "object", + "properties": { + "description": { + "type": "string", + "description": "Natural language description of desired infrastructure" + }, + "schema": { + "type": "string", + "description": "Target schema name (e.g., 'database', 'kubernetes', 'network'). Optional." + }, + "format": { + "type": "string", + "enum": ["nickel", "toml"], + "description": "Output format (default: nickel)" + } + }, + "required": ["description"] + } +} +``` + +**Example Usage**: + +```text +# Via MCP client +mcp-client provisioning generate_config + --description "Production PostgreSQL cluster with encryption and daily backups" + --schema database + +# Claude desktop prompt: +# @provisioning: Generate a production PostgreSQL setup with automated backups +``` + +**Response**: + +```text +{ + database = { + engine = "postgresql", + version = "15.0", + + instance = { + instance_class = "db.r6g.xlarge", + allocated_storage_gb = 100, + iops = 3000, + }, + + security = { + encryption_enabled = true, + encryption_key_id = "kms://prod-db-key", + tls_enabled = true, + tls_version = "1.3", + }, + + backup = { + enabled = true, + retention_days = 30, + preferred_window = "03:00-04:00", + copy_to_region = "us-west-2", + }, + + monitoring = { + enhanced_monitoring_enabled = true, + monitoring_interval_seconds = 60, + log_exports = ["postgresql"], + }, + } +} +``` + +### 2. Config Validation + +**Tool**: `validate_config` + +Validate a Nickel configuration against schemas and policies. + +```text +{ + "name": "validate_config", + "description": "Validate a Nickel configuration file", + "inputSchema": { + "type": "object", + "properties": { + "config": { + "type": "string", + "description": "Nickel configuration content or file path" + }, + "schema": { + "type": "string", + "description": "Schema name to validate against (optional)" + }, + "strict": { + "type": "boolean", + "description": "Enable strict validation (default: true)" + } + }, + "required": ["config"] + } +} +``` + +**Example Usage**: + +```text +# Validate configuration +mcp-client provisioning validate_config + --config "$(cat workspaces/prod/database.ncl)" + +# With specific schema +mcp-client provisioning validate_config + --config "workspaces/prod/kubernetes.ncl" + --schema kubernetes +``` + +**Response**: + +```text +{ + "valid": true, + "errors": [], + "warnings": [ + "Consider enabling automated backups for production use" + ], + "metadata": { + "schema": "kubernetes", + "version": "1.28", + "validated_at": "2025-01-13T10:45:30Z" + } +} +``` + +### 3. Documentation Search + +**Tool**: `search_docs` + +Search infrastructure documentation using RAG system. + +```text +{ + "name": "search_docs", + "description": "Search provisioning documentation for information", + "inputSchema": { + "type": "object", + "properties": { + "query": { + "type": "string", + "description": "Search query (natural language)" + }, + "top_k": { + "type": "integer", + "description": "Number of results (default: 5)" + }, + "doc_type": { + "type": "string", + "enum": ["guide", "schema", "example", "troubleshooting"], + "description": "Filter by document type (optional)" + } + }, + "required": ["query"] + } +} +``` + +**Example Usage**: + +```text +# Search documentation +mcp-client provisioning search_docs + --query "How do I configure PostgreSQL with replication?" + +# Get examples +mcp-client provisioning search_docs + --query "Kubernetes networking" + --doc_type example + --top_k 3 +``` + +**Response**: + +```text +{ + "results": [ + { + "source": "provisioning/docs/src/guides/database-replication.md", + "excerpt": "PostgreSQL logical replication enables streaming of changes...", + "relevance": 0.94, + "section": "Setup Logical Replication" + }, + { + "source": "provisioning/schemas/database.ncl", + "excerpt": "replication = { enabled = true, mode = \"logical\", ... }", + "relevance": 0.87, + "section": "Replication Configuration" + } + ] +} +``` + +### 4. Deployment Troubleshooting + +**Tool**: `troubleshoot_deployment` + +Analyze deployment failures and suggest fixes. + +```text +{ + "name": "troubleshoot_deployment", + "description": "Analyze deployment logs and suggest fixes", + "inputSchema": { + "type": "object", + "properties": { + "deployment_id": { + "type": "string", + "description": "Deployment ID (e.g., 'deploy-2025-01-13-001')" + }, + "logs": { + "type": "string", + "description": "Deployment logs (optional, if deployment_id not provided)" + }, + "error_analysis_depth": { + "type": "string", + "enum": ["shallow", "deep"], + "description": "Analysis depth (default: deep)" + } + } + } +} +``` + +**Example Usage**: + +```text +# Troubleshoot recent deployment +mcp-client provisioning troubleshoot_deployment + --deployment_id "deploy-2025-01-13-001" + +# With custom logs +mcp-client provisioning troubleshoot_deployment +| --logs "$(journalctl -u provisioning --no-pager | tail -100)" | +``` + +**Response**: + +```text +{ + "status": "failure", + "root_cause": "Database connection timeout during migration phase", + "analysis": { + "phase": "database_migration", + "error_type": "connectivity", + "confidence": 0.95 + }, + "suggestions": [ + "Verify database security group allows inbound on port 5432", + "Check database instance status (may be rebooting)", + "Increase connection timeout in configuration" + ], + "corrected_config": "...generated Nickel config with fixes...", + "similar_issues": [ + "[https://docs/troubleshooting/database-connectivity.md"](https://docs/troubleshooting/database-connectivity.md") + ] +} +``` + +### 5. Get Schema + +**Tool**: `get_schema` + +Retrieve schema definition with examples. + +```text +{ + "name": "get_schema", + "description": "Get a provisioning schema definition", + "inputSchema": { + "type": "object", + "properties": { + "schema_name": { + "type": "string", + "description": "Schema name (e.g., 'database', 'kubernetes')" + }, + "format": { + "type": "string", + "enum": ["schema", "example", "documentation"], + "description": "Response format (default: schema)" + } + }, + "required": ["schema_name"] + } +} +``` + +**Example Usage**: + +```text +# Get schema definition +mcp-client provisioning get_schema --schema_name database + +# Get example configuration +mcp-client provisioning get_schema + --schema_name kubernetes + --format example +``` + +### 6. Compliance Check + +**Tool**: `check_compliance` + +Verify configuration against compliance policies (Cedar). + +```text +{ + "name": "check_compliance", + "description": "Check configuration against compliance policies", + "inputSchema": { + "type": "object", + "properties": { + "config": { + "type": "string", + "description": "Configuration to check" + }, + "policy_set": { + "type": "string", + "description": "Policy set to check against (e.g., 'pci-dss', 'hipaa', 'sox')" + } + }, + "required": ["config", "policy_set"] + } +} +``` + +**Example Usage**: + +```text +# Check against PCI-DSS +mcp-client provisioning check_compliance + --config "$(cat workspaces/prod/database.ncl)" + --policy_set pci-dss +``` + +## Integration Examples + +### Claude Desktop (Most Common) + +```text +~/.claude/claude_desktop_config.json: +{ + "mcpServers": { + "provisioning": { + "command": "provisioning-mcp-server", + "args": ["--config", "/etc/provisioning/ai.toml"], + "env": { + "PROVISIONING_API_KEY": "sk-...", + "PROVISIONING_BASE_URL": "[http://localhost:8083"](http://localhost:8083") + } + } + } +} +``` + +**Usage in Claude**: + +```text +User: I need a production Kubernetes cluster in AWS with automatic scaling + +Claude can now use provisioning tools: +I'll help you create a production Kubernetes cluster. Let me: +1. Search the documentation for best practices +2. Generate a configuration template +3. Validate it against your policies +4. Provide the final configuration +``` + +### OpenAI Function Calling + +```text +import openai + +tools = [ + { + "type": "function", + "function": { + "name": "generate_config", + "description": "Generate infrastructure configuration", + "parameters": { + "type": "object", + "properties": { + "description": { + "type": "string", + "description": "Infrastructure description" + } + }, + "required": ["description"] + } + } + } +] + +response = openai.ChatCompletion.create( + model="gpt-4", + messages=[{"role": "user", "content": "Create a PostgreSQL database"}], + tools=tools +) +``` + +### Local LLM Integration (Ollama) + +```text +# Start Ollama with provisioning MCP +OLLAMA_MCP_SERVERS=provisioning://localhost:3000 + ollama serve + +# Use with llama2 or mistral +curl [http://localhost:11434/api/generate](http://localhost:11434/api/generate) + -d '{ + "model": "mistral", + "prompt": "Create a Kubernetes cluster", + "tools": [{"type": "mcp", "server": "provisioning"}] + }' +``` + +## Error Handling + +Tools return consistent error responses: + +```text +{ + "error": { + "code": "VALIDATION_ERROR", + "message": "Configuration has 3 validation errors", + "details": [ + { + "field": "database.version", + "message": "PostgreSQL version 9.6 is deprecated", + "severity": "error" + }, + { + "field": "backup.retention_days", + "message": "Recommended minimum is 30 days for production", + "severity": "warning" + } + ] + } +} +``` + +## Performance + +| | Operation | Latency | Notes | | +| | ----------- | --------- | ------- | | +| | generate_config | 2-5s | Depends on LLM and config complexity | | +| | validate_config | 500-1000ms | Parallel schema validation | | +| | search_docs | 300-800ms | RAG hybrid search | | +| | troubleshoot | 3-8s | Depends on log size and analysis depth | | +| | get_schema | 100-300ms | Cached schema retrieval | | +| | check_compliance | 500-2000ms | Policy evaluation | | + +## Configuration + +See [Configuration Guide](configuration.md) for MCP-specific settings: + +- MCP server port and binding +- Tool registry customization +- Rate limiting for tool calls +- Access control (Cedar policies) + +## Security + +### Authentication + +- Tools require valid provisioning API token +- Token scoped to user's workspace +- All tool calls authenticated and logged + +### Authorization + +- Cedar policies control which tools user can call +- Example: `allow(principal, action, resource)` when `role == "admin"` +- Detailed audit trail of all tool invocations + +### Data Protection + +- Secrets never passed through MCP +- Configuration sanitized before analysis +- PII removed from logs sent to external LLMs + +## Monitoring and Debugging + +```text +# Monitor MCP server +provisioning admin mcp status + +# View MCP tool calls +provisioning admin logs --filter "mcp_tools" --tail 100 + +# Debug tool response +RUST_LOG=provisioning::mcp=debug provisioning-mcp-server +``` + +## Related Documentation + +- [Architecture](architecture.md) - AI system overview +- [RAG System](rag-system.md) - Documentation search +- [Configuration](configuration.md) - MCP setup +- [API Reference](api-reference.md) - Detailed API endpoints +- [ADR-015](../architecture/adr/adr-015-ai-integration-architecture.md) - Design decisions + +--- + +**Last Updated**: 2025-01-13 +**Status**: ✅ Production-Ready +**MCP Version**: 0.6.0+ +**Supported LLMs**: Claude, GPT-4, Llama, Mistral, all MCP-compatible models \ No newline at end of file diff --git a/docs/src/ai/natural-language-config.md b/docs/src/ai/natural-language-config.md index 8a103ce..09d1d41 100644 --- a/docs/src/ai/natural-language-config.md +++ b/docs/src/ai/natural-language-config.md @@ -1 +1,469 @@ -# Natural Language Configuration Generation\n\n**Status**: 🔴 Planned (Q2 2025 target)\n\nNatural Language Configuration (NLC) is a planned feature that enables users to describe infrastructure requirements in plain English and have the\nsystem automatically generate validated Nickel configurations. This feature combines natural language understanding with schema-aware generation and\nvalidation.\n\n## Feature Overview\n\n### What It Does\n\nTransform infrastructure descriptions into production-ready Nickel configurations:\n\n```\nUser Input:\n "Create a production PostgreSQL cluster with 100GB storage,\n daily backups, encryption enabled, and cross-region replication\n to us-west-2"\n\nSystem Output:\n provisioning/schemas/database.ncl (validated, production-ready)\n```\n\n### Primary Use Cases\n\n1. **Rapid Prototyping**: From description to working config in seconds\n2. **Infrastructure Documentation**: Describe infrastructure as code\n3. **Configuration Templates**: Generate reusable patterns\n4. **Non-Expert Operations**: Enable junior developers to provision infrastructure\n5. **Configuration Migration**: Describe existing infrastructure to generate Nickel\n\n## Architecture\n\n### Generation Pipeline\n\n```\nInput Description (Natural Language)\n ↓\n┌─────────────────────────────────────┐\n│ Understanding & Analysis │\n│ - Intent extraction │\n│ - Entity recognition │\n│ - Constraint identification │\n│ - Best practice inference │\n└─────────────────────┬───────────────┘\n ↓\n┌─────────────────────────────────────┐\n│ RAG Context Retrieval │\n│ - Find similar configs │\n│ - Retrieve best practices │\n│ - Get schema examples │\n│ - Identify constraints │\n└─────────────────────┬───────────────┘\n ↓\n┌─────────────────────────────────────┐\n│ Schema-Aware Generation │\n│ - Map entities to schema fields │\n│ - Apply type constraints │\n│ - Include required fields │\n│ - Generate valid Nickel │\n└─────────────────────┬───────────────┘\n ↓\n┌─────────────────────────────────────┐\n│ Validation & Refinement │\n│ - Type checking │\n│ - Schema validation │\n│ - Policy compliance │\n│ - Security checks │\n└─────────────────────┬───────────────┘\n ↓\n┌─────────────────────────────────────┐\n│ Output & Explanation │\n│ - Generated Nickel config │\n│ - Decision rationale │\n│ - Alternative suggestions │\n│ - Warnings if any │\n└─────────────────────────────────────┘\n```\n\n## Planned Implementation Details\n\n### 1. Intent Extraction\n\nExtract structured intent from natural language:\n\n```\nInput: "Create a production PostgreSQL cluster with encryption and backups"\n\nExtracted Intent:\n{\n resource_type: "database",\n engine: "postgresql",\n environment: "production",\n requirements: [\n {constraint: "encryption", type: "boolean", value: true},\n {constraint: "backups", type: "enabled", frequency: "daily"},\n ],\n modifiers: ["production"],\n}\n```\n\n### 2. Entity Mapping\n\nMap natural language entities to schema fields:\n\n```\nDescription Terms → Schema Fields:\n "100GB storage" → database.instance.allocated_storage_gb = 100\n "daily backups" → backup.enabled = true, backup.frequency = "daily"\n "encryption" → security.encryption_enabled = true\n "cross-region" → backup.copy_to_region = "us-west-2"\n "PostgreSQL 15" → database.engine_version = "15.0"\n```\n\n### 3. Prompt Engineering\n\nSophisticated prompting for schema-aware generation:\n\n```\nSystem Prompt:\nYou are generating Nickel infrastructure configurations.\nGenerate ONLY valid Nickel syntax.\nFollow these rules:\n- Use record syntax: `field = value`\n- Type annotations must be valid\n- All required fields must be present\n- Apply best practices for [ENVIRONMENT]\n\nSchema Context:\n[Database schema from provisioning/schemas/database.ncl]\n\nExamples:\n[3 relevant examples from RAG]\n\nUser Request:\n[User natural language description]\n\nGenerate the complete Nickel configuration.\nStart with: let { database = {\n```\n\n### 4. Iterative Refinement\n\nHandle generation errors through iteration:\n\n```\nAttempt 1: Generate initial config\n ↓ Validate\n ✗ Error: field `version` type mismatch (string vs number)\n ↓ Re-prompt with error\nAttempt 2: Fix with context from error\n ↓ Validate\n ✓ Success: Config is valid\n```\n\n## Command Interface\n\n### CLI Usage\n\n```\n# Simple generation\nprovisioning ai generate "PostgreSQL database for production"\n\n# With schema specification\nprovisioning ai generate \\n --schema database \\n "Create PostgreSQL 15 with encryption and daily backups"\n\n# Interactive generation (refine output)\nprovisioning ai generate --interactive \\n "Kubernetes cluster on AWS"\n\n# Generate and validate\nprovisioning ai generate \\n --validate \\n "Production Redis cluster with sentinel"\n\n# Generate and save directly\nprovisioning ai generate \\n --schema database \\n --output workspaces/prod/database.ncl \\n "PostgreSQL production setup"\n\n# Batch generation from file\nprovisioning ai generate --batch descriptions.yaml\n```\n\n### Interactive Refinement\n\n```\n$ provisioning ai generate --interactive\n> Describe infrastructure: Create production PostgreSQL cluster\n\nGenerated configuration shown.\n\n> Refine: Add cross-region backup to us-west-2\nConfiguration updated.\n\n> Refine: Use larger instance class for performance\nConfiguration updated.\n\n> Accept? [y/n]: y\nConfiguration saved to: workspaces/prod/database.ncl\n```\n\n## Example Generations\n\n### Example 1: Simple Database\n\n**Input**:\n```\n"PostgreSQL database with 50GB storage and encryption"\n```\n\n**Output**:\n```\nlet {\n database = {\n engine = "postgresql",\n version = "15.0",\n \n instance = {\n instance_class = "db.t3.medium",\n allocated_storage_gb = 50,\n iops = 1000,\n },\n \n security = {\n encryption_enabled = true,\n tls_enabled = true,\n tls_version = "1.3",\n },\n \n backup = {\n enabled = true,\n retention_days = 7,\n },\n }\n}\n```\n\n**Rationale**:\n- PostgreSQL 15 is current stable version\n- db.t3.medium suitable for 50GB with general workload\n- Encryption enabled per requirement\n- Automatic backups with 7-day retention (default)\n\n### Example 2: Complex Kubernetes Setup\n\n**Input**:\n```\n"Production Kubernetes cluster in AWS with 3 availability zones,\nauto-scaling from 3 to 10 nodes, managed PostgreSQL, and monitoring"\n```\n\n**Output**:\n```\nlet {\n kubernetes = {\n version = "1.28.0",\n \n cluster = {\n name = "prod-cluster",\n region = "us-east-1",\n availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"],\n },\n \n node_group = {\n min_size = 3,\n max_size = 10,\n desired_size = 3,\n instance_type = "t3.large",\n \n auto_scaling = {\n enabled = true,\n target_cpu = 70,\n scale_down_delay = 300,\n },\n },\n \n managed_services = {\n postgres = {\n enabled = true,\n engine = "postgresql",\n version = "15.0",\n storage_gb = 100,\n },\n },\n \n monitoring = {\n prometheus = {enabled = true},\n grafana = {enabled = true},\n cloudwatch_integration = true,\n },\n \n networking = {\n vpc_cidr = "10.0.0.0/16",\n enable_nat_gateway = true,\n enable_dns_hostnames = true,\n },\n }\n}\n```\n\n**Rationale**:\n- 3 AZs for high availability\n- t3.large balances cost and performance for general workload\n- Auto-scaling target 70% CPU (best practice)\n- Managed PostgreSQL reduces operational overhead\n- Full observability with Prometheus + Grafana\n\n## Configuration and Constraints\n\n### Configurable Generation Parameters\n\n```\n# In provisioning/config/ai.toml\n[ai.generation]\n# Which schema to use by default\ndefault_schema = "database"\n\n# Whether to require explicit environment specification\nrequire_environment = false\n\n# Optimization targets\noptimization_target = "balanced" # or "cost", "performance"\n\n# Best practices to always apply\nbest_practices = [\n "encryption",\n "high_availability",\n "monitoring",\n "backup",\n]\n\n# Constraints that limit generation\n[ai.generation.constraints]\nmin_storage_gb = 10\nmax_instances = 100\nallowed_engines = ["postgresql", "mysql", "mongodb"]\n\n# Validation before accepting generated config\n[ai.generation.validation]\nstrict_mode = true\nrequire_security_review = false\nrequire_compliance_check = true\n```\n\n### Safety Guardrails\n\n1. **Required Fields**: All schema required fields must be present\n2. **Type Validation**: Generated values must match schema types\n3. **Security Checks**: Encryption/backups enabled for production\n4. **Cost Estimation**: Warn if projected cost exceeds threshold\n5. **Resource Limits**: Enforce organizational constraints\n6. **Policy Compliance**: Check against Cedar policies\n\n## User Workflow\n\n### Typical Usage Session\n\n```\n# 1. Describe infrastructure need\n$ provisioning ai generate "I need a database for my web app"\n\n# System generates basic config, suggests refinements\n# Generated config shown with explanations\n\n# 2. Refine if needed\n$ provisioning ai generate --interactive\n\n# 3. Review and validate\n$ provisioning ai validate workspaces/dev/database.ncl\n\n# 4. Deploy\n$ provisioning workspace apply workspaces/dev\n\n# 5. Monitor\n$ provisioning workspace logs database\n```\n\n## Integration with Other Systems\n\n### RAG Integration\n\nNLC uses RAG to find similar configurations:\n\n```\nUser: "Create Kubernetes cluster"\n ↓\nRAG searches for:\n - Existing Kubernetes configs in workspaces\n - Kubernetes documentation and examples\n - Best practices from provisioning/docs/guides/kubernetes.md\n ↓\nContext fed to LLM for generation\n```\n\n### Form Assistance\n\nNLC and form assistance share components:\n\n- Intent extraction for pre-filling forms\n- Constraint validation for form field values\n- Explanation generation for validation errors\n\n### CLI Integration\n\n```\n# Generate then preview\n| provisioning ai generate "PostgreSQL prod" | \ |\n provisioning config preview\n\n# Generate and apply\nprovisioning ai generate \\n --apply \\n --environment prod \\n "PostgreSQL cluster"\n```\n\n## Testing and Validation\n\n### Test Cases (Planned)\n\n1. **Simple Descriptions**: Single resource, few requirements\n - "PostgreSQL database"\n - "Redis cache"\n\n2. **Complex Descriptions**: Multiple resources, constraints\n - "Kubernetes with managed database and monitoring"\n - "Multi-region deployment with failover"\n\n3. **Edge Cases**:\n - Conflicting requirements\n - Ambiguous specifications\n - Deprecated technologies\n\n4. **Refinement Cycles**:\n - Interactive generation with multiple refines\n - Error recovery and re-prompting\n - User feedback incorporation\n\n## Success Criteria (Q2 2025)\n\n- ✅ Generates valid Nickel for 90% of user descriptions\n- ✅ Generated configs pass all schema validation\n- ✅ Supports top 10 infrastructure patterns\n- ✅ Interactive refinement works smoothly\n- ✅ Error messages explain issues clearly\n- ✅ User testing with non-experts succeeds\n- ✅ Documentation complete with examples\n- ✅ Integration with form assistance operational\n\n## Related Documentation\n\n- [Architecture](architecture.md) - AI system overview\n- [AI-Assisted Forms](ai-assisted-forms.md) - Related form feature\n- [RAG System](rag-system.md) - Context retrieval\n- [Configuration](configuration.md) - Setup guide\n- [ADR-015](../architecture/adr/adr-015-ai-integration-architecture.md) - Design decisions\n\n---\n\n**Status**: 🔴 Planned\n**Target Release**: Q2 2025\n**Last Updated**: 2025-01-13\n**Architecture**: Complete\n**Implementation**: In Design Phase +# Natural Language Configuration Generation + +**Status**: 🔴 Planned (Q2 2025 target) + +Natural Language Configuration (NLC) is a planned feature that enables users to describe infrastructure requirements in plain English and have the +system automatically generate validated Nickel configurations. This feature combines natural language understanding with schema-aware generation and +validation. + +## Feature Overview + +### What It Does + +Transform infrastructure descriptions into production-ready Nickel configurations: + +```text +User Input: + "Create a production PostgreSQL cluster with 100GB storage, + daily backups, encryption enabled, and cross-region replication + to us-west-2" + +System Output: + provisioning/schemas/database.ncl (validated, production-ready) +``` + +### Primary Use Cases + +1. **Rapid Prototyping**: From description to working config in seconds +2. **Infrastructure Documentation**: Describe infrastructure as code +3. **Configuration Templates**: Generate reusable patterns +4. **Non-Expert Operations**: Enable junior developers to provision infrastructure +5. **Configuration Migration**: Describe existing infrastructure to generate Nickel + +## Architecture + +### Generation Pipeline + +```text +Input Description (Natural Language) + ↓ +┌─────────────────────────────────────┐ +│ Understanding & Analysis │ +│ - Intent extraction │ +│ - Entity recognition │ +│ - Constraint identification │ +│ - Best practice inference │ +└─────────────────────┬───────────────┘ + ↓ +┌─────────────────────────────────────┐ +│ RAG Context Retrieval │ +│ - Find similar configs │ +│ - Retrieve best practices │ +│ - Get schema examples │ +│ - Identify constraints │ +└─────────────────────┬───────────────┘ + ↓ +┌─────────────────────────────────────┐ +│ Schema-Aware Generation │ +│ - Map entities to schema fields │ +│ - Apply type constraints │ +│ - Include required fields │ +│ - Generate valid Nickel │ +└─────────────────────┬───────────────┘ + ↓ +┌─────────────────────────────────────┐ +│ Validation & Refinement │ +│ - Type checking │ +│ - Schema validation │ +│ - Policy compliance │ +│ - Security checks │ +└─────────────────────┬───────────────┘ + ↓ +┌─────────────────────────────────────┐ +│ Output & Explanation │ +│ - Generated Nickel config │ +│ - Decision rationale │ +│ - Alternative suggestions │ +│ - Warnings if any │ +└─────────────────────────────────────┘ +``` + +## Planned Implementation Details + +### 1. Intent Extraction + +Extract structured intent from natural language: + +```text +Input: "Create a production PostgreSQL cluster with encryption and backups" + +Extracted Intent: +{ + resource_type: "database", + engine: "postgresql", + environment: "production", + requirements: [ + {constraint: "encryption", type: "boolean", value: true}, + {constraint: "backups", type: "enabled", frequency: "daily"}, + ], + modifiers: ["production"], +} +``` + +### 2. Entity Mapping + +Map natural language entities to schema fields: + +```text +Description Terms → Schema Fields: + "100GB storage" → database.instance.allocated_storage_gb = 100 + "daily backups" → backup.enabled = true, backup.frequency = "daily" + "encryption" → security.encryption_enabled = true + "cross-region" → backup.copy_to_region = "us-west-2" + "PostgreSQL 15" → database.engine_version = "15.0" +``` + +### 3. Prompt Engineering + +Sophisticated prompting for schema-aware generation: + +```text +System Prompt: +You are generating Nickel infrastructure configurations. +Generate ONLY valid Nickel syntax. +Follow these rules: +- Use record syntax: `field = value` +- Type annotations must be valid +- All required fields must be present +- Apply best practices for [ENVIRONMENT] + +Schema Context: +[Database schema from provisioning/schemas/database.ncl] + +Examples: +[3 relevant examples from RAG] + +User Request: +[User natural language description] + +Generate the complete Nickel configuration. +Start with: let { database = { +``` + +### 4. Iterative Refinement + +Handle generation errors through iteration: + +```text +Attempt 1: Generate initial config + ↓ Validate + ✗ Error: field `version` type mismatch (string vs number) + ↓ Re-prompt with error +Attempt 2: Fix with context from error + ↓ Validate + ✓ Success: Config is valid +``` + +## Command Interface + +### CLI Usage + +```text +# Simple generation +provisioning ai generate "PostgreSQL database for production" + +# With schema specification +provisioning ai generate + --schema database + "Create PostgreSQL 15 with encryption and daily backups" + +# Interactive generation (refine output) +provisioning ai generate --interactive + "Kubernetes cluster on AWS" + +# Generate and validate +provisioning ai generate + --validate + "Production Redis cluster with sentinel" + +# Generate and save directly +provisioning ai generate + --schema database + --output workspaces/prod/database.ncl + "PostgreSQL production setup" + +# Batch generation from file +provisioning ai generate --batch descriptions.yaml +``` + +### Interactive Refinement + +```text +$ provisioning ai generate --interactive +> Describe infrastructure: Create production PostgreSQL cluster + +Generated configuration shown. + +> Refine: Add cross-region backup to us-west-2 +Configuration updated. + +> Refine: Use larger instance class for performance +Configuration updated. + +> Accept? [y/n]: y +Configuration saved to: workspaces/prod/database.ncl +``` + +## Example Generations + +### Example 1: Simple Database + +**Input**: +```text +"PostgreSQL database with 50GB storage and encryption" +``` + +**Output**: +```text +let { + database = { + engine = "postgresql", + version = "15.0", + + instance = { + instance_class = "db.t3.medium", + allocated_storage_gb = 50, + iops = 1000, + }, + + security = { + encryption_enabled = true, + tls_enabled = true, + tls_version = "1.3", + }, + + backup = { + enabled = true, + retention_days = 7, + }, + } +} +``` + +**Rationale**: +- PostgreSQL 15 is current stable version +- db.t3.medium suitable for 50GB with general workload +- Encryption enabled per requirement +- Automatic backups with 7-day retention (default) + +### Example 2: Complex Kubernetes Setup + +**Input**: +```text +"Production Kubernetes cluster in AWS with 3 availability zones, +auto-scaling from 3 to 10 nodes, managed PostgreSQL, and monitoring" +``` + +**Output**: +```text +let { + kubernetes = { + version = "1.28.0", + + cluster = { + name = "prod-cluster", + region = "us-east-1", + availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"], + }, + + node_group = { + min_size = 3, + max_size = 10, + desired_size = 3, + instance_type = "t3.large", + + auto_scaling = { + enabled = true, + target_cpu = 70, + scale_down_delay = 300, + }, + }, + + managed_services = { + postgres = { + enabled = true, + engine = "postgresql", + version = "15.0", + storage_gb = 100, + }, + }, + + monitoring = { + prometheus = {enabled = true}, + grafana = {enabled = true}, + cloudwatch_integration = true, + }, + + networking = { + vpc_cidr = "10.0.0.0/16", + enable_nat_gateway = true, + enable_dns_hostnames = true, + }, + } +} +``` + +**Rationale**: +- 3 AZs for high availability +- t3.large balances cost and performance for general workload +- Auto-scaling target 70% CPU (best practice) +- Managed PostgreSQL reduces operational overhead +- Full observability with Prometheus + Grafana + +## Configuration and Constraints + +### Configurable Generation Parameters + +```text +# In provisioning/config/ai.toml +[ai.generation] +# Which schema to use by default +default_schema = "database" + +# Whether to require explicit environment specification +require_environment = false + +# Optimization targets +optimization_target = "balanced" # or "cost", "performance" + +# Best practices to always apply +best_practices = [ + "encryption", + "high_availability", + "monitoring", + "backup", +] + +# Constraints that limit generation +[ai.generation.constraints] +min_storage_gb = 10 +max_instances = 100 +allowed_engines = ["postgresql", "mysql", "mongodb"] + +# Validation before accepting generated config +[ai.generation.validation] +strict_mode = true +require_security_review = false +require_compliance_check = true +``` + +### Safety Guardrails + +1. **Required Fields**: All schema required fields must be present +2. **Type Validation**: Generated values must match schema types +3. **Security Checks**: Encryption/backups enabled for production +4. **Cost Estimation**: Warn if projected cost exceeds threshold +5. **Resource Limits**: Enforce organizational constraints +6. **Policy Compliance**: Check against Cedar policies + +## User Workflow + +### Typical Usage Session + +```text +# 1. Describe infrastructure need +$ provisioning ai generate "I need a database for my web app" + +# System generates basic config, suggests refinements +# Generated config shown with explanations + +# 2. Refine if needed +$ provisioning ai generate --interactive + +# 3. Review and validate +$ provisioning ai validate workspaces/dev/database.ncl + +# 4. Deploy +$ provisioning workspace apply workspaces/dev + +# 5. Monitor +$ provisioning workspace logs database +``` + +## Integration with Other Systems + +### RAG Integration + +NLC uses RAG to find similar configurations: + +```text +User: "Create Kubernetes cluster" + ↓ +RAG searches for: + - Existing Kubernetes configs in workspaces + - Kubernetes documentation and examples + - Best practices from provisioning/docs/guides/kubernetes.md + ↓ +Context fed to LLM for generation +``` + +### Form Assistance + +NLC and form assistance share components: + +- Intent extraction for pre-filling forms +- Constraint validation for form field values +- Explanation generation for validation errors + +### CLI Integration + +```text +# Generate then preview +| provisioning ai generate "PostgreSQL prod" | \ | + provisioning config preview + +# Generate and apply +provisioning ai generate + --apply + --environment prod + "PostgreSQL cluster" +``` + +## Testing and Validation + +### Test Cases (Planned) + +1. **Simple Descriptions**: Single resource, few requirements + - "PostgreSQL database" + - "Redis cache" + +2. **Complex Descriptions**: Multiple resources, constraints + - "Kubernetes with managed database and monitoring" + - "Multi-region deployment with failover" + +3. **Edge Cases**: + - Conflicting requirements + - Ambiguous specifications + - Deprecated technologies + +4. **Refinement Cycles**: + - Interactive generation with multiple refines + - Error recovery and re-prompting + - User feedback incorporation + +## Success Criteria (Q2 2025) + +- ✅ Generates valid Nickel for 90% of user descriptions +- ✅ Generated configs pass all schema validation +- ✅ Supports top 10 infrastructure patterns +- ✅ Interactive refinement works smoothly +- ✅ Error messages explain issues clearly +- ✅ User testing with non-experts succeeds +- ✅ Documentation complete with examples +- ✅ Integration with form assistance operational + +## Related Documentation + +- [Architecture](architecture.md) - AI system overview +- [AI-Assisted Forms](ai-assisted-forms.md) - Related form feature +- [RAG System](rag-system.md) - Context retrieval +- [Configuration](configuration.md) - Setup guide +- [ADR-015](../architecture/adr/adr-015-ai-integration-architecture.md) - Design decisions + +--- + +**Status**: 🔴 Planned +**Target Release**: Q2 2025 +**Last Updated**: 2025-01-13 +**Architecture**: Complete +**Implementation**: In Design Phase \ No newline at end of file diff --git a/docs/src/ai/rag-system.md b/docs/src/ai/rag-system.md index b6b7b93..7808f93 100644 --- a/docs/src/ai/rag-system.md +++ b/docs/src/ai/rag-system.md @@ -1 +1,450 @@ -# Retrieval-Augmented Generation (RAG) System\n\n**Status**: ✅ Production-Ready (SurrealDB 1.5.0+, 22/22 tests passing)\n\nThe RAG system enables the AI service to access, retrieve, and reason over infrastructure documentation, schemas, and past configurations. This allows\nthe AI to generate contextually accurate infrastructure configurations and provide intelligent troubleshooting advice grounded in actual platform\nknowledge.\n\n## Architecture Overview\n\nThe RAG system consists of:\n\n1. **Document Store**: SurrealDB vector store with semantic indexing\n2. **Hybrid Search**: Vector similarity + BM25 keyword search\n3. **Chunk Management**: Intelligent document chunking for code and markdown\n4. **Context Ranking**: Relevance scoring for retrieved documents\n5. **Semantic Cache**: Deduplication of repeated queries\n\n## Core Components\n\n### 1. Vector Embeddings\n\nThe system uses embedding models to convert documents into vector representations:\n\n```\n┌─────────────────────┐\n│ Document Source │\n│ (Markdown, Code) │\n└──────────┬──────────┘\n │\n ▼\n┌──────────────────────────────────┐\n│ Chunking & Tokenization │\n│ - Code-aware splits │\n│ - Markdown aware │\n│ - Preserves context │\n└──────────┬───────────────────────┘\n │\n ▼\n┌──────────────────────────────────┐\n│ Embedding Model │\n│ (OpenAI Ada, Anthropic, Local) │\n└──────────┬───────────────────────┘\n │\n ▼\n┌──────────────────────────────────┐\n│ Vector Storage (SurrealDB) │\n│ - Vector index │\n│ - Metadata indexed │\n│ - BM25 index for keywords │\n└──────────────────────────────────┘\n```\n\n### 2. SurrealDB Integration\n\nSurrealDB serves as the vector database and knowledge store:\n\n```\n# Configuration in provisioning/schemas/ai.ncl\nlet {\n rag = {\n enabled = true,\n db_url = "surreal://localhost:8000",\n namespace = "provisioning",\n database = "ai_rag",\n \n # Collections for different document types\n collections = {\n documentation = {\n chunking_strategy = "markdown",\n chunk_size = 1024,\n overlap = 256,\n },\n schemas = {\n chunking_strategy = "code",\n chunk_size = 512,\n overlap = 128,\n },\n deployments = {\n chunking_strategy = "json",\n chunk_size = 2048,\n overlap = 512,\n },\n },\n \n # Embedding configuration\n embedding = {\n provider = "openai", # or "anthropic", "local"\n model = "text-embedding-3-small",\n cache_vectors = true,\n },\n \n # Search configuration\n search = {\n hybrid_enabled = true,\n vector_weight = 0.7,\n keyword_weight = 0.3,\n top_k = 5, # Number of results to return\n semantic_cache = true,\n },\n }\n}\n```\n\n### 3. Document Chunking\n\nIntelligent chunking preserves context while managing token limits:\n\n#### Markdown Chunking Strategy\n\n```\nInput Document: provisioning/docs/src/guides/from-scratch.md\n\nChunks:\n [1] Header + first section (up to 1024 tokens)\n [2] Next logical section + overlap with [1]\n [3] Code examples preserve as atomic units\n [4] Continue with overlap...\n\nEach chunk includes:\n - Original section heading (for context)\n - Content\n - Source file and line numbers\n - Metadata (doctype, category, version)\n```\n\n#### Code Chunking Strategy\n\n```\nInput Document: provisioning/schemas/main.ncl\n\nChunks:\n [1] Top-level let binding + comments\n [2] Function definition (atomic, preserves signature)\n [3] Type definition (atomic, preserves interface)\n [4] Implementation blocks with context overlap\n\nEach chunk preserves:\n - Type signatures\n - Function signatures\n - Import statements needed for context\n - Comments and docstrings\n```\n\n## Hybrid Search\n\nThe system implements dual search strategy for optimal results:\n\n### Vector Similarity Search\n\n```\n// Find semantically similar documents\nasync fn vector_search(query: &str, top_k: usize) -> Vec {\n let embedding = embed(query).await?;\n \n // L2 distance in SurrealDB\n db.query("\n SELECT *, vector::similarity::cosine(embedding, $embedding) AS score\n FROM documents\n WHERE embedding <~> $embedding\n ORDER BY score DESC\n LIMIT $top_k\n ")\n .bind(("embedding", embedding))\n .bind(("top_k", top_k))\n .await\n}\n```\n\n**Use case**: Semantic understanding of intent\n- Query: "How to configure PostgreSQL"\n- Finds: Documents about database configuration, examples, schemas\n\n### BM25 Keyword Search\n\n```\n// Find documents with matching keywords\nasync fn keyword_search(query: &str, top_k: usize) -> Vec {\n // BM25 full-text search in SurrealDB\n db.query("\n SELECT *, search::bm25(.) AS score\n FROM documents\n WHERE text @@ $query\n ORDER BY score DESC\n LIMIT $top_k\n ")\n .bind(("query", query))\n .bind(("top_k", top_k))\n .await\n}\n```\n\n**Use case**: Exact term matching\n- Query: "SurrealDB configuration"\n- Finds: Documents mentioning SurrealDB specifically\n\n### Hybrid Results\n\n```\nasync fn hybrid_search(\n query: &str,\n vector_weight: f32,\n keyword_weight: f32,\n top_k: usize,\n) -> Vec {\n let vector_results = vector_search(query, top_k * 2).await?;\n let keyword_results = keyword_search(query, top_k * 2).await?;\n \n let mut scored = HashMap::new();\n \n // Score from vector search\n for (i, doc) in vector_results.iter().enumerate() {\n *scored.entry(doc.id).or_insert(0.0) +=\n vector_weight * (1.0 - (i as f32 / top_k as f32));\n }\n \n // Score from keyword search\n for (i, doc) in keyword_results.iter().enumerate() {\n *scored.entry(doc.id).or_insert(0.0) +=\n keyword_weight * (1.0 - (i as f32 / top_k as f32));\n }\n \n // Return top-k by combined score\n let mut results: Vec<_> = scored.into_iter().collect();\n| results.sort_by( | a, b | b.1.partial_cmp(&a.1).unwrap()); |\n| Ok(results.into_iter().take(top_k).map( | (id, _) | ...).collect()) |\n}\n```\n\n## Semantic Caching\n\nReduces API calls by caching embeddings of repeated queries:\n\n```\nstruct SemanticCache {\n queries: Arc, CachedResult>>,\n similarity_threshold: f32,\n}\n\nimpl SemanticCache {\n async fn get(&self, query: &str) -> Option {\n let embedding = embed(query).await?;\n \n // Find cached query with similar embedding\n // (cosine distance < threshold)\n for entry in self.queries.iter() {\n let distance = cosine_distance(&embedding, entry.key());\n if distance < self.similarity_threshold {\n return Some(entry.value().clone());\n }\n }\n None\n }\n \n async fn insert(&self, query: &str, result: CachedResult) {\n let embedding = embed(query).await?;\n self.queries.insert(embedding, result);\n }\n}\n```\n\n**Benefits**:\n- 50-80% reduction in embedding API calls\n- Identical queries return in <10ms\n- Similar queries reuse cached context\n\n## Ingestion Workflow\n\n### Document Indexing\n\n```\n# Index all documentation\nprovisioning ai index-docs provisioning/docs/src\n\n# Index schemas\nprovisioning ai index-schemas provisioning/schemas\n\n# Index past deployments\nprovisioning ai index-deployments workspaces/*/deployments\n\n# Watch directory for changes (development mode)\nprovisioning ai watch docs provisioning/docs/src\n```\n\n### Programmatic Indexing\n\n```\n// In ai-service on startup\nasync fn initialize_rag() -> Result<()> {\n let rag = RAGSystem::new(&config.rag).await?;\n \n // Index documentation\n let docs = load_markdown_docs("provisioning/docs/src")?;\n for doc in docs {\n rag.ingest_document(&doc).await?;\n }\n \n // Index schemas\n let schemas = load_nickel_schemas("provisioning/schemas")?;\n for schema in schemas {\n rag.ingest_schema(&schema).await?;\n }\n \n Ok(())\n}\n```\n\n## Usage Examples\n\n### Query the RAG System\n\n```\n# Search for context-aware information\nprovisioning ai query "How do I configure PostgreSQL with encryption?"\n\n# Get configuration template\nprovisioning ai template "Describe production Kubernetes on AWS"\n\n# Interactive mode\nprovisioning ai chat\n> What are the best practices for database backup?\n```\n\n### AI Service Integration\n\n```\n// AI service uses RAG to enhance generation\nasync fn generate_config(user_request: &str) -> Result {\n // Retrieve relevant context\n let context = rag.search(user_request, top_k=5).await?;\n \n // Build prompt with context\n let prompt = build_prompt_with_context(user_request, &context);\n \n // Generate configuration\n let config = llm.generate(&prompt).await?;\n \n // Validate against schemas\n validate_nickel_config(&config)?;\n \n Ok(config)\n}\n```\n\n### Form Assistance Integration\n\n```\n// In typdialog-ai (JavaScript/TypeScript)\nasync function suggestFieldValue(fieldName, currentInput) {\n // Query RAG for similar configurations\n const context = await rag.search(\n `Field: ${fieldName}, Input: ${currentInput}`,\n { topK: 3, semantic: true }\n );\n \n // Generate suggestion using context\n const suggestion = await ai.suggest({\n field: fieldName,\n input: currentInput,\n context: context,\n });\n \n return suggestion;\n}\n```\n\n## Performance Characteristics\n\n| | Operation | Time | Cache Hit | |\n| | ----------- | ------ | ----------- | |\n| | Vector embedding | 200-500ms | N/A | |\n| | Vector search (cold) | 300-800ms | N/A | |\n| | Keyword search | 50-200ms | N/A | |\n| | Hybrid search | 500-1200ms | <100ms cached | |\n| | Semantic cache hit | 10-50ms | Always | |\n\n**Typical query flow**:\n1. Embedding: 300ms\n2. Vector search: 400ms\n3. Keyword search: 100ms\n4. Ranking: 50ms\n5. **Total**: ~850ms (first call), <100ms (cached)\n\n## Configuration\n\nSee [Configuration Guide](configuration.md) for detailed RAG setup:\n\n- LLM provider for embeddings\n- SurrealDB connection\n- Chunking strategies\n- Search weights and limits\n- Cache settings and TTLs\n\n## Limitations and Considerations\n\n### Document Freshness\n\n- RAG indexes static snapshots\n- Changes to documentation require re-indexing\n- Use watch mode during development\n\n### Token Limits\n\n- Large documents chunked to fit LLM context\n- Some context may be lost in chunking\n- Adjustable chunk size vs. context trade-off\n\n### Embedding Quality\n\n- Quality depends on embedding model\n- Domain-specific models perform better\n- Fine-tuning possible for specialized vocabularies\n\n## Monitoring and Debugging\n\n### Query Metrics\n\n```\n# View RAG search metrics\nprovisioning ai metrics show rag\n\n# Analysis of search quality\nprovisioning ai eval-rag --sample-queries 100\n```\n\n### Debug Mode\n\n```\n# In provisioning/config/ai.toml\n[ai.rag.debug]\nenabled = true\nlog_embeddings = true # Log embedding vectors\nlog_search_scores = true # Log relevance scores\nlog_context_used = true # Log context retrieved\n```\n\n## Related Documentation\n\n- [Architecture](architecture.md) - AI system overview\n- [MCP Integration](mcp-integration.md) - RAG access via MCP\n- [Configuration](configuration.md) - RAG setup guide\n- [API Reference](api-reference.md) - RAG API endpoints\n- [ADR-015](../architecture/adr/adr-015-ai-integration-architecture.md) - Design decisions\n\n---\n\n**Last Updated**: 2025-01-13\n**Status**: ✅ Production-Ready\n**Test Coverage**: 22/22 tests passing\n**Database**: SurrealDB 1.5.0+ +# Retrieval-Augmented Generation (RAG) System + +**Status**: ✅ Production-Ready (SurrealDB 1.5.0+, 22/22 tests passing) + +The RAG system enables the AI service to access, retrieve, and reason over infrastructure documentation, schemas, and past configurations. This allows +the AI to generate contextually accurate infrastructure configurations and provide intelligent troubleshooting advice grounded in actual platform +knowledge. + +## Architecture Overview + +The RAG system consists of: + +1. **Document Store**: SurrealDB vector store with semantic indexing +2. **Hybrid Search**: Vector similarity + BM25 keyword search +3. **Chunk Management**: Intelligent document chunking for code and markdown +4. **Context Ranking**: Relevance scoring for retrieved documents +5. **Semantic Cache**: Deduplication of repeated queries + +## Core Components + +### 1. Vector Embeddings + +The system uses embedding models to convert documents into vector representations: + +```text +┌─────────────────────┐ +│ Document Source │ +│ (Markdown, Code) │ +└──────────┬──────────┘ + │ + ▼ +┌──────────────────────────────────┐ +│ Chunking & Tokenization │ +│ - Code-aware splits │ +│ - Markdown aware │ +│ - Preserves context │ +└──────────┬───────────────────────┘ + │ + ▼ +┌──────────────────────────────────┐ +│ Embedding Model │ +│ (OpenAI Ada, Anthropic, Local) │ +└──────────┬───────────────────────┘ + │ + ▼ +┌──────────────────────────────────┐ +│ Vector Storage (SurrealDB) │ +│ - Vector index │ +│ - Metadata indexed │ +│ - BM25 index for keywords │ +└──────────────────────────────────┘ +``` + +### 2. SurrealDB Integration + +SurrealDB serves as the vector database and knowledge store: + +```text +# Configuration in provisioning/schemas/ai.ncl +let { + rag = { + enabled = true, + db_url = "surreal://localhost:8000", + namespace = "provisioning", + database = "ai_rag", + + # Collections for different document types + collections = { + documentation = { + chunking_strategy = "markdown", + chunk_size = 1024, + overlap = 256, + }, + schemas = { + chunking_strategy = "code", + chunk_size = 512, + overlap = 128, + }, + deployments = { + chunking_strategy = "json", + chunk_size = 2048, + overlap = 512, + }, + }, + + # Embedding configuration + embedding = { + provider = "openai", # or "anthropic", "local" + model = "text-embedding-3-small", + cache_vectors = true, + }, + + # Search configuration + search = { + hybrid_enabled = true, + vector_weight = 0.7, + keyword_weight = 0.3, + top_k = 5, # Number of results to return + semantic_cache = true, + }, + } +} +``` + +### 3. Document Chunking + +Intelligent chunking preserves context while managing token limits: + +#### Markdown Chunking Strategy + +```text +Input Document: provisioning/docs/src/guides/from-scratch.md + +Chunks: + [1] Header + first section (up to 1024 tokens) + [2] Next logical section + overlap with [1] + [3] Code examples preserve as atomic units + [4] Continue with overlap... + +Each chunk includes: + - Original section heading (for context) + - Content + - Source file and line numbers + - Metadata (doctype, category, version) +``` + +#### Code Chunking Strategy + +```text +Input Document: provisioning/schemas/main.ncl + +Chunks: + [1] Top-level let binding + comments + [2] Function definition (atomic, preserves signature) + [3] Type definition (atomic, preserves interface) + [4] Implementation blocks with context overlap + +Each chunk preserves: + - Type signatures + - Function signatures + - Import statements needed for context + - Comments and docstrings +``` + +## Hybrid Search + +The system implements dual search strategy for optimal results: + +### Vector Similarity Search + +```text +// Find semantically similar documents +async fn vector_search(query: &str, top_k: usize) -> Vec { + let embedding = embed(query).await?; + + // L2 distance in SurrealDB + db.query(" + SELECT *, vector::similarity::cosine(embedding, $embedding) AS score + FROM documents + WHERE embedding <~> $embedding + ORDER BY score DESC + LIMIT $top_k + ") + .bind(("embedding", embedding)) + .bind(("top_k", top_k)) + .await +} +``` + +**Use case**: Semantic understanding of intent +- Query: "How to configure PostgreSQL" +- Finds: Documents about database configuration, examples, schemas + +### BM25 Keyword Search + +```text +// Find documents with matching keywords +async fn keyword_search(query: &str, top_k: usize) -> Vec { + // BM25 full-text search in SurrealDB + db.query(" + SELECT *, search::bm25(.) AS score + FROM documents + WHERE text @@ $query + ORDER BY score DESC + LIMIT $top_k + ") + .bind(("query", query)) + .bind(("top_k", top_k)) + .await +} +``` + +**Use case**: Exact term matching +- Query: "SurrealDB configuration" +- Finds: Documents mentioning SurrealDB specifically + +### Hybrid Results + +```text +async fn hybrid_search( + query: &str, + vector_weight: f32, + keyword_weight: f32, + top_k: usize, +) -> Vec { + let vector_results = vector_search(query, top_k * 2).await?; + let keyword_results = keyword_search(query, top_k * 2).await?; + + let mut scored = HashMap::new(); + + // Score from vector search + for (i, doc) in vector_results.iter().enumerate() { + *scored.entry(doc.id).or_insert(0.0) += + vector_weight * (1.0 - (i as f32 / top_k as f32)); + } + + // Score from keyword search + for (i, doc) in keyword_results.iter().enumerate() { + *scored.entry(doc.id).or_insert(0.0) += + keyword_weight * (1.0 - (i as f32 / top_k as f32)); + } + + // Return top-k by combined score + let mut results: Vec<_> = scored.into_iter().collect(); +| results.sort_by( | a, b | b.1.partial_cmp(&a.1).unwrap()); | +| Ok(results.into_iter().take(top_k).map( | (id, _) | ...).collect()) | +} +``` + +## Semantic Caching + +Reduces API calls by caching embeddings of repeated queries: + +```text +struct SemanticCache { + queries: Arc, CachedResult>>, + similarity_threshold: f32, +} + +impl SemanticCache { + async fn get(&self, query: &str) -> Option { + let embedding = embed(query).await?; + + // Find cached query with similar embedding + // (cosine distance < threshold) + for entry in self.queries.iter() { + let distance = cosine_distance(&embedding, entry.key()); + if distance < self.similarity_threshold { + return Some(entry.value().clone()); + } + } + None + } + + async fn insert(&self, query: &str, result: CachedResult) { + let embedding = embed(query).await?; + self.queries.insert(embedding, result); + } +} +``` + +**Benefits**: +- 50-80% reduction in embedding API calls +- Identical queries return in <10ms +- Similar queries reuse cached context + +## Ingestion Workflow + +### Document Indexing + +```text +# Index all documentation +provisioning ai index-docs provisioning/docs/src + +# Index schemas +provisioning ai index-schemas provisioning/schemas + +# Index past deployments +provisioning ai index-deployments workspaces/*/deployments + +# Watch directory for changes (development mode) +provisioning ai watch docs provisioning/docs/src +``` + +### Programmatic Indexing + +```text +// In ai-service on startup +async fn initialize_rag() -> Result<()> { + let rag = RAGSystem::new(&config.rag).await?; + + // Index documentation + let docs = load_markdown_docs("provisioning/docs/src")?; + for doc in docs { + rag.ingest_document(&doc).await?; + } + + // Index schemas + let schemas = load_nickel_schemas("provisioning/schemas")?; + for schema in schemas { + rag.ingest_schema(&schema).await?; + } + + Ok(()) +} +``` + +## Usage Examples + +### Query the RAG System + +```text +# Search for context-aware information +provisioning ai query "How do I configure PostgreSQL with encryption?" + +# Get configuration template +provisioning ai template "Describe production Kubernetes on AWS" + +# Interactive mode +provisioning ai chat +> What are the best practices for database backup? +``` + +### AI Service Integration + +```text +// AI service uses RAG to enhance generation +async fn generate_config(user_request: &str) -> Result { + // Retrieve relevant context + let context = rag.search(user_request, top_k=5).await?; + + // Build prompt with context + let prompt = build_prompt_with_context(user_request, &context); + + // Generate configuration + let config = llm.generate(&prompt).await?; + + // Validate against schemas + validate_nickel_config(&config)?; + + Ok(config) +} +``` + +### Form Assistance Integration + +```text +// In typdialog-ai (JavaScript/TypeScript) +async function suggestFieldValue(fieldName, currentInput) { + // Query RAG for similar configurations + const context = await rag.search( + `Field: ${fieldName}, Input: ${currentInput}`, + { topK: 3, semantic: true } + ); + + // Generate suggestion using context + const suggestion = await ai.suggest({ + field: fieldName, + input: currentInput, + context: context, + }); + + return suggestion; +} +``` + +## Performance Characteristics + +| | Operation | Time | Cache Hit | | +| | ----------- | ------ | ----------- | | +| | Vector embedding | 200-500ms | N/A | | +| | Vector search (cold) | 300-800ms | N/A | | +| | Keyword search | 50-200ms | N/A | | +| | Hybrid search | 500-1200ms | <100ms cached | | +| | Semantic cache hit | 10-50ms | Always | | + +**Typical query flow**: +1. Embedding: 300ms +2. Vector search: 400ms +3. Keyword search: 100ms +4. Ranking: 50ms +5. **Total**: ~850ms (first call), <100ms (cached) + +## Configuration + +See [Configuration Guide](configuration.md) for detailed RAG setup: + +- LLM provider for embeddings +- SurrealDB connection +- Chunking strategies +- Search weights and limits +- Cache settings and TTLs + +## Limitations and Considerations + +### Document Freshness + +- RAG indexes static snapshots +- Changes to documentation require re-indexing +- Use watch mode during development + +### Token Limits + +- Large documents chunked to fit LLM context +- Some context may be lost in chunking +- Adjustable chunk size vs. context trade-off + +### Embedding Quality + +- Quality depends on embedding model +- Domain-specific models perform better +- Fine-tuning possible for specialized vocabularies + +## Monitoring and Debugging + +### Query Metrics + +```text +# View RAG search metrics +provisioning ai metrics show rag + +# Analysis of search quality +provisioning ai eval-rag --sample-queries 100 +``` + +### Debug Mode + +```text +# In provisioning/config/ai.toml +[ai.rag.debug] +enabled = true +log_embeddings = true # Log embedding vectors +log_search_scores = true # Log relevance scores +log_context_used = true # Log context retrieved +``` + +## Related Documentation + +- [Architecture](architecture.md) - AI system overview +- [MCP Integration](mcp-integration.md) - RAG access via MCP +- [Configuration](configuration.md) - RAG setup guide +- [API Reference](api-reference.md) - RAG API endpoints +- [ADR-015](../architecture/adr/adr-015-ai-integration-architecture.md) - Design decisions + +--- + +**Last Updated**: 2025-01-13 +**Status**: ✅ Production-Ready +**Test Coverage**: 22/22 tests passing +**Database**: SurrealDB 1.5.0+ \ No newline at end of file diff --git a/docs/src/ai/security-policies.md b/docs/src/ai/security-policies.md index ed9e33c..7a1e005 100644 --- a/docs/src/ai/security-policies.md +++ b/docs/src/ai/security-policies.md @@ -1 +1,537 @@ -# AI Security Policies and Cedar Authorization\n\n**Status**: ✅ Production-Ready (Cedar integration, policy enforcement)\n\nComprehensive documentation of security controls, authorization policies, and data protection mechanisms for the AI system. All AI operations are\ncontrolled through Cedar policies and include strict secret isolation.\n\n## Security Model Overview\n\n### Defense in Depth\n\n```\n┌─────────────────────────────────────────┐\n│ User Request to AI │\n└──────────────┬──────────────────────────┘\n ↓\n┌─────────────────────────────────────────┐\n│ Layer 1: Authentication │\n│ - Verify user identity │\n│ - Validate API token/credentials │\n└──────────────┬──────────────────────────┘\n ↓\n┌─────────────────────────────────────────┐\n│ Layer 2: Authorization (Cedar) │\n│ - Check if user can access AI features │\n│ - Verify workspace permissions │\n│ - Check role-based access │\n└──────────────┬──────────────────────────┘\n ↓\n┌─────────────────────────────────────────┐\n│ Layer 3: Data Sanitization │\n│ - Remove secrets from data │\n│ - Redact PII │\n│ - Filter sensitive information │\n└──────────────┬──────────────────────────┘\n ↓\n┌─────────────────────────────────────────┐\n│ Layer 4: Request Validation │\n│ - Check request parameters │\n│ - Verify resource constraints │\n│ - Apply rate limits │\n└──────────────┬──────────────────────────┘\n ↓\n┌─────────────────────────────────────────┐\n│ Layer 5: External API Call │\n│ - Only if all previous checks pass │\n│ - Encrypted TLS connection │\n│ - No secrets in request │\n└──────────────┬──────────────────────────┘\n ↓\n┌─────────────────────────────────────────┐\n│ Layer 6: Audit Logging │\n│ - Log all AI operations │\n│ - Capture user, time, action │\n│ - Store in tamper-proof log │\n└─────────────────────────────────────────┘\n```\n\n## Cedar Policies\n\n### Policy Engine Setup\n\n```\n// File: provisioning/policies/ai-policies.cedar\n\n// Core principle: Least privilege\n// All actions denied by default unless explicitly allowed\n\n// Admin users can access all AI features\npermit(\n principal == ?principal,\n action == Action::"ai_generate_config",\n resource == ?resource\n)\nwhen {\n principal.role == "admin"\n};\n\n// Developers can use AI within their workspace\npermit(\n principal == ?principal,\n action in [\n Action::"ai_query",\n Action::"ai_generate_config",\n Action::"ai_troubleshoot"\n ],\n resource == ?resource\n)\nwhen {\n principal.role in ["developer", "senior_engineer"]\n && principal.workspace == resource.workspace\n};\n\n// Operators can access troubleshooting and queries\npermit(\n principal == ?principal,\n action in [\n Action::"ai_query",\n Action::"ai_troubleshoot"\n ],\n resource == ?resource\n)\nwhen {\n principal.role in ["operator", "devops"]\n};\n\n// Form assistance enabled for all authenticated users\npermit(\n principal == ?principal,\n action == Action::"ai_form_assistance",\n resource == ?resource\n)\nwhen {\n principal.authenticated == true\n};\n\n// Agents (when available) require explicit approval\npermit(\n principal == ?principal,\n action == Action::"ai_agent_execute",\n resource == ?resource\n)\nwhen {\n principal.role == "automation_admin"\n && resource.requires_approval == true\n};\n\n// MCP tool access - restrictive by default\npermit(\n principal == ?principal,\n action == Action::"mcp_tool_call",\n resource == ?resource\n)\nwhen {\n principal.role == "admin"\n| | | (principal.role == "developer" && resource.tool in ["generate_config", "validate_config"]) |\n};\n\n// Cost control policies\npermit(\n principal == ?principal,\n action == Action::"ai_generate_config",\n resource == ?resource\n)\nwhen {\n // User must have remaining budget\n principal.ai_budget_remaining_usd > resource.estimated_cost_usd\n // Workspace must be under budget\n && resource.workspace.ai_budget_remaining_usd > resource.estimated_cost_usd\n};\n```\n\n### Policy Best Practices\n\n1. **Explicit Allow**: Only allow specific actions, deny by default\n2. **Workspace Isolation**: Users can't access AI in other workspaces\n3. **Role-Based**: Use consistent role definitions\n4. **Cost-Aware**: Check budgets before operations\n5. **Audit Trail**: Log all policy decisions\n\n## Data Sanitization\n\n### Automatic PII Removal\n\nBefore sending data to external LLMs, the system removes:\n\n```\nPatterns Removed:\n├─ Passwords: password="...", pwd=..., etc.\n├─ API Keys: api_key=..., api-key=..., etc.\n├─ Tokens: token=..., bearer=..., etc.\n├─ Email addresses: user@example.com (unless necessary for context)\n├─ Phone numbers: +1-555-0123 patterns\n├─ Credit cards: 4111-1111-1111-1111 patterns\n├─ SSH keys: -----BEGIN RSA PRIVATE KEY-----...\n└─ AWS/GCP/Azure: AKIA2..., AIza..., etc.\n```\n\n### Configuration\n\n```\n[ai.security]\nsanitize_pii = true\nsanitize_secrets = true\n\n# Custom redaction patterns\nredact_patterns = [\n # Database passwords\n "(?i)db[_-]?password\\s*[:=]\\s*'?[^'\\n]+'?",\n # Generic secrets\n "(?i)secret\\s*[:=]\\s*'?[^'\\n]+'?",\n # API endpoints that shouldn't be logged\n "https?://api[.-]secret\\..+",\n]\n\n# Exceptions (patterns NOT to redact)\npreserve_patterns = [\n # Preserve example.com domain for docs\n "example\\.com",\n # Preserve placeholder emails\n "user@example\\.com",\n]\n```\n\n### Example Sanitization\n\n**Before**:\n```\nError configuring database:\nconnection_string: postgresql://dbadmin:MySecurePassword123@prod-db.us-east-1.rds.amazonaws.com:5432/app\napi_key: sk-ant-abc123def456\nvault_token: hvs.CAESIyg7...\n```\n\n**After Sanitization**:\n```\nError configuring database:\nconnection_string: postgresql://dbadmin:[REDACTED]@prod-db.us-east-1.rds.amazonaws.com:5432/app\napi_key: [REDACTED]\nvault_token: [REDACTED]\n```\n\n## Secret Isolation\n\n### Never Access Secrets Directly\n\nAI cannot directly access secrets. Instead:\n\n```\nUser wants: "Configure PostgreSQL with encrypted backups"\n ↓\nAI generates: Configuration schema with placeholders\n ↓\nUser inserts: Actual secret values (connection strings, passwords)\n ↓\nSystem encrypts: Secrets remain encrypted at rest\n ↓\nDeployment: Uses secrets from secure store (Vault, AWS Secrets Manager)\n```\n\n### Secret Protection Rules\n\n1. **No Direct Access**: AI never reads from Vault/Secrets Manager\n2. **Never in Logs**: Secrets never logged or stored in cache\n3. **Sanitization**: All secrets redacted before sending to LLM\n4. **Encryption**: Secrets encrypted at rest and in transit\n5. **Audit Trail**: All access to secrets logged\n6. **TTL**: Temporary secrets auto-expire\n\n## Local Models Support\n\n### Air-Gapped Deployments\n\nFor environments requiring zero external API calls:\n\n```\n# Deploy local Ollama with provisioning support\ndocker run -d \\n --name provisioning-ai \\n -p 11434:11434 \\n -v ollama:/root/.ollama \\n -e OLLAMA_HOST=0.0.0.0:11434 \\n ollama/ollama\n\n# Pull model\nollama pull mistral\nollama pull llama2-70b\n\n# Configure provisioning to use local model\nprovisioning config edit ai\n\n[ai]\nprovider = "local"\nmodel = "mistral"\napi_base = "[http://localhost:11434"](http://localhost:11434")\n```\n\n### Benefits\n\n- ✅ Zero external API calls\n- ✅ Full data privacy (no LLM vendor access)\n- ✅ Compliance with classified/regulated data\n- ✅ No API key exposure\n- ✅ Deterministic (same results each run)\n\n### Performance Trade-offs\n\n| | Factor | Local | Cloud | |\n| | -------- | ------- | ------- | |\n| | Privacy | Excellent | Requires trust | |\n| | Cost | Free (hardware) | Per token | |\n| | Speed | 5-30s/response | 2-5s/response | |\n| | Quality | Good (70B models) | Excellent (Opus) | |\n| | Hardware | Requires GPU | None | |\n\n## HSM Integration\n\n### Hardware Security Module Support\n\nFor highly sensitive environments:\n\n```\n[ai.security.hsm]\nenabled = true\nprovider = "aws-cloudhsm" # or "thales", "yubihsm"\n\n[ai.security.hsm.aws]\ncluster_id = "cluster-123"\ncustomer_ca_cert = "/etc/provisioning/certs/customerCA.crt"\nserver_cert = "/etc/provisioning/certs/server.crt"\nserver_key = "/etc/provisioning/certs/server.key"\n```\n\n## Encryption\n\n### Data at Rest\n\n```\n[ai.security.encryption]\nenabled = true\nalgorithm = "aes-256-gcm"\nkey_derivation = "argon2id"\n\n# Key rotation\nkey_rotation_enabled = true\nkey_rotation_days = 90\nrotation_alert_days = 7\n\n# Encrypted storage\ncache_encryption = true\nlog_encryption = true\n```\n\n### Data in Transit\n\n```\nAll external LLM API calls:\n├─ TLS 1.3 (minimum)\n├─ Certificate pinning (optional)\n├─ Mutual TLS (with cloud providers)\n└─ No plaintext transmission\n```\n\n## Audit Logging\n\n### What Gets Logged\n\n```\n{\n "timestamp": "2025-01-13T10:30:45Z",\n "event_type": "ai_action",\n "action": "generate_config",\n "principal": {\n "user_id": "user-123",\n "role": "developer",\n "workspace": "prod"\n },\n "resource": {\n "type": "database",\n "name": "prod-postgres"\n },\n "authorization": {\n "decision": "permit",\n "policy": "ai-policies.cedar",\n "reason": "developer role in workspace"\n },\n "cost": {\n "tokens_used": 1250,\n "estimated_cost_usd": 0.037\n },\n "sanitization": {\n "items_redacted": 3,\n "patterns_matched": ["db_password", "api_key", "token"]\n },\n "status": "success"\n}\n```\n\n### Audit Trail Access\n\n```\n# View recent AI actions\nprovisioning audit log ai --tail 100\n\n# Filter by user\nprovisioning audit log ai --user alice@company.com\n\n# Filter by action\nprovisioning audit log ai --action generate_config\n\n# Filter by time range\nprovisioning audit log ai --from "2025-01-01" --to "2025-01-13"\n\n# Export for analysis\nprovisioning audit export ai --format csv --output audit.csv\n\n# Full-text search\nprovisioning audit search ai "error in database configuration"\n```\n\n## Compliance Frameworks\n\n### Built-in Compliance Checks\n\n```\n[ai.compliance]\nframeworks = ["pci-dss", "hipaa", "sox", "gdpr"]\n\n[ai.compliance.pci-dss]\nenabled = true\n# Requires encryption, audit logs, access controls\n\n[ai.compliance.hipaa]\nenabled = true\n# Requires local models, encrypted storage, audit logs\n\n[ai.compliance.gdpr]\nenabled = true\n# Requires data deletion, consent tracking, privacy by design\n```\n\n### Compliance Reports\n\n```\n# Generate compliance report\nprovisioning audit compliance-report \\n --framework pci-dss \\n --period month \\n --output report.pdf\n\n# Verify compliance\nprovisioning audit verify-compliance \\n --framework hipaa \\n --verbose\n```\n\n## Security Best Practices\n\n### For Administrators\n\n1. **Rotate API Keys**: Every 90 days minimum\n2. **Monitor Budget**: Set up alerts at 80% and 90%\n3. **Review Policies**: Quarterly policy audit\n4. **Audit Logs**: Weekly review of AI operations\n5. **Update Models**: Use latest stable models\n6. **Test Recovery**: Monthly rollback drills\n\n### For Developers\n\n1. **Use Workspace Isolation**: Never share workspace access\n2. **Don't Log Secrets**: Use sanitization, never bypass it\n3. **Validate Outputs**: Always review AI-generated configs\n4. **Report Issues**: Security issues to `security-ai@company.com`\n5. **Stay Updated**: Follow security bulletins\n\n### For Operators\n\n1. **Monitor Costs**: Alert if exceeding 110% of budget\n2. **Watch Errors**: Unusual error patterns may indicate attacks\n3. **Check Audit Logs**: Unauthorized access attempts\n4. **Test Policies**: Periodically verify Cedar policies work\n5. **Backup Configs**: Secure backup of policy files\n\n## Incident Response\n\n### Compromised API Key\n\n```\n# 1. Immediately revoke key\nprovisioning admin revoke-key ai-api-key-123\n\n# 2. Rotate key\nprovisioning admin rotate-key ai \\n --notify ops-team@company.com\n\n# 3. Audit usage since compromise\nprovisioning audit log ai \\n --since "2025-01-13T09:00:00Z" \\n --api-key-id ai-api-key-123\n\n# 4. Review any generated configs from this period\n# Configs generated while key was compromised may need review\n```\n\n### Unauthorized Access\n\n```\n# Review Cedar policy logs\nprovisioning audit log ai \\n --decision deny \\n --last-hour\n\n# Check for pattern\nprovisioning audit search ai "authorization.*deny" \\n --trend-analysis\n\n# Update policies if needed\nprovisioning policy update ai-policies.cedar\n```\n\n## Security Checklist\n\n### Pre-Production\n\n- ✅ Cedar policies reviewed and tested\n- ✅ API keys rotated and secured\n- ✅ Data sanitization tested with real secrets\n- ✅ Encryption enabled for cache\n- ✅ Audit logging configured\n- ✅ Cost limits set appropriately\n- ✅ Local-only mode tested (if needed)\n- ✅ HSM configured (if required)\n\n### Ongoing\n\n- ✅ Monthly policy review\n- ✅ Weekly audit log review\n- ✅ Quarterly key rotation\n- ✅ Annual compliance assessment\n- ✅ Continuous budget monitoring\n- ✅ Error pattern analysis\n\n## Related Documentation\n\n- [Architecture](architecture.md) - System overview\n- [Configuration](configuration.md) - Security settings\n- [Cost Management](cost-management.md) - Budget controls\n- [ADR-015](../architecture/adr/adr-015-ai-integration-architecture.md) - Design decisions\n\n---\n\n**Last Updated**: 2025-01-13\n**Status**: ✅ Production-Ready\n**Compliance**: PCI-DSS, HIPAA, SOX, GDPR\n**Cedar Version**: 3.0+ +# AI Security Policies and Cedar Authorization + +**Status**: ✅ Production-Ready (Cedar integration, policy enforcement) + +Comprehensive documentation of security controls, authorization policies, and data protection mechanisms for the AI system. All AI operations are +controlled through Cedar policies and include strict secret isolation. + +## Security Model Overview + +### Defense in Depth + +```text +┌─────────────────────────────────────────┐ +│ User Request to AI │ +└──────────────┬──────────────────────────┘ + ↓ +┌─────────────────────────────────────────┐ +│ Layer 1: Authentication │ +│ - Verify user identity │ +│ - Validate API token/credentials │ +└──────────────┬──────────────────────────┘ + ↓ +┌─────────────────────────────────────────┐ +│ Layer 2: Authorization (Cedar) │ +│ - Check if user can access AI features │ +│ - Verify workspace permissions │ +│ - Check role-based access │ +└──────────────┬──────────────────────────┘ + ↓ +┌─────────────────────────────────────────┐ +│ Layer 3: Data Sanitization │ +│ - Remove secrets from data │ +│ - Redact PII │ +│ - Filter sensitive information │ +└──────────────┬──────────────────────────┘ + ↓ +┌─────────────────────────────────────────┐ +│ Layer 4: Request Validation │ +│ - Check request parameters │ +│ - Verify resource constraints │ +│ - Apply rate limits │ +└──────────────┬──────────────────────────┘ + ↓ +┌─────────────────────────────────────────┐ +│ Layer 5: External API Call │ +│ - Only if all previous checks pass │ +│ - Encrypted TLS connection │ +│ - No secrets in request │ +└──────────────┬──────────────────────────┘ + ↓ +┌─────────────────────────────────────────┐ +│ Layer 6: Audit Logging │ +│ - Log all AI operations │ +│ - Capture user, time, action │ +│ - Store in tamper-proof log │ +└─────────────────────────────────────────┘ +``` + +## Cedar Policies + +### Policy Engine Setup + +```text +// File: provisioning/policies/ai-policies.cedar + +// Core principle: Least privilege +// All actions denied by default unless explicitly allowed + +// Admin users can access all AI features +permit( + principal == ?principal, + action == Action::"ai_generate_config", + resource == ?resource +) +when { + principal.role == "admin" +}; + +// Developers can use AI within their workspace +permit( + principal == ?principal, + action in [ + Action::"ai_query", + Action::"ai_generate_config", + Action::"ai_troubleshoot" + ], + resource == ?resource +) +when { + principal.role in ["developer", "senior_engineer"] + && principal.workspace == resource.workspace +}; + +// Operators can access troubleshooting and queries +permit( + principal == ?principal, + action in [ + Action::"ai_query", + Action::"ai_troubleshoot" + ], + resource == ?resource +) +when { + principal.role in ["operator", "devops"] +}; + +// Form assistance enabled for all authenticated users +permit( + principal == ?principal, + action == Action::"ai_form_assistance", + resource == ?resource +) +when { + principal.authenticated == true +}; + +// Agents (when available) require explicit approval +permit( + principal == ?principal, + action == Action::"ai_agent_execute", + resource == ?resource +) +when { + principal.role == "automation_admin" + && resource.requires_approval == true +}; + +// MCP tool access - restrictive by default +permit( + principal == ?principal, + action == Action::"mcp_tool_call", + resource == ?resource +) +when { + principal.role == "admin" +| | | (principal.role == "developer" && resource.tool in ["generate_config", "validate_config"]) | +}; + +// Cost control policies +permit( + principal == ?principal, + action == Action::"ai_generate_config", + resource == ?resource +) +when { + // User must have remaining budget + principal.ai_budget_remaining_usd > resource.estimated_cost_usd + // Workspace must be under budget + && resource.workspace.ai_budget_remaining_usd > resource.estimated_cost_usd +}; +``` + +### Policy Best Practices + +1. **Explicit Allow**: Only allow specific actions, deny by default +2. **Workspace Isolation**: Users can't access AI in other workspaces +3. **Role-Based**: Use consistent role definitions +4. **Cost-Aware**: Check budgets before operations +5. **Audit Trail**: Log all policy decisions + +## Data Sanitization + +### Automatic PII Removal + +Before sending data to external LLMs, the system removes: + +```text +Patterns Removed: +├─ Passwords: password="...", pwd=..., etc. +├─ API Keys: api_key=..., api-key=..., etc. +├─ Tokens: token=..., bearer=..., etc. +├─ Email addresses: user@example.com (unless necessary for context) +├─ Phone numbers: +1-555-0123 patterns +├─ Credit cards: 4111-1111-1111-1111 patterns +├─ SSH keys: -----BEGIN RSA PRIVATE KEY-----... +└─ AWS/GCP/Azure: AKIA2..., AIza..., etc. +``` + +### Configuration + +```text +[ai.security] +sanitize_pii = true +sanitize_secrets = true + +# Custom redaction patterns +redact_patterns = [ + # Database passwords + "(?i)db[_-]?password\\s*[:=]\\s*'?[^' +]+'?", + # Generic secrets + "(?i)secret\\s*[:=]\\s*'?[^' +]+'?", + # API endpoints that shouldn't be logged + "https?://api[.-]secret\\..+", +] + +# Exceptions (patterns NOT to redact) +preserve_patterns = [ + # Preserve example.com domain for docs + "example\\.com", + # Preserve placeholder emails + "user@example\\.com", +] +``` + +### Example Sanitization + +**Before**: +```text +Error configuring database: +connection_string: postgresql://dbadmin:MySecurePassword123@prod-db.us-east-1.rds.amazonaws.com:5432/app +api_key: sk-ant-abc123def456 +vault_token: hvs.CAESIyg7... +``` + +**After Sanitization**: +```text +Error configuring database: +connection_string: postgresql://dbadmin:[REDACTED]@prod-db.us-east-1.rds.amazonaws.com:5432/app +api_key: [REDACTED] +vault_token: [REDACTED] +``` + +## Secret Isolation + +### Never Access Secrets Directly + +AI cannot directly access secrets. Instead: + +```text +User wants: "Configure PostgreSQL with encrypted backups" + ↓ +AI generates: Configuration schema with placeholders + ↓ +User inserts: Actual secret values (connection strings, passwords) + ↓ +System encrypts: Secrets remain encrypted at rest + ↓ +Deployment: Uses secrets from secure store (Vault, AWS Secrets Manager) +``` + +### Secret Protection Rules + +1. **No Direct Access**: AI never reads from Vault/Secrets Manager +2. **Never in Logs**: Secrets never logged or stored in cache +3. **Sanitization**: All secrets redacted before sending to LLM +4. **Encryption**: Secrets encrypted at rest and in transit +5. **Audit Trail**: All access to secrets logged +6. **TTL**: Temporary secrets auto-expire + +## Local Models Support + +### Air-Gapped Deployments + +For environments requiring zero external API calls: + +```text +# Deploy local Ollama with provisioning support +docker run -d + --name provisioning-ai + -p 11434:11434 + -v ollama:/root/.ollama + -e OLLAMA_HOST=0.0.0.0:11434 + ollama/ollama + +# Pull model +ollama pull mistral +ollama pull llama2-70b + +# Configure provisioning to use local model +provisioning config edit ai + +[ai] +provider = "local" +model = "mistral" +api_base = "[http://localhost:11434"](http://localhost:11434") +``` + +### Benefits + +- ✅ Zero external API calls +- ✅ Full data privacy (no LLM vendor access) +- ✅ Compliance with classified/regulated data +- ✅ No API key exposure +- ✅ Deterministic (same results each run) + +### Performance Trade-offs + +| | Factor | Local | Cloud | | +| | -------- | ------- | ------- | | +| | Privacy | Excellent | Requires trust | | +| | Cost | Free (hardware) | Per token | | +| | Speed | 5-30s/response | 2-5s/response | | +| | Quality | Good (70B models) | Excellent (Opus) | | +| | Hardware | Requires GPU | None | | + +## HSM Integration + +### Hardware Security Module Support + +For highly sensitive environments: + +```text +[ai.security.hsm] +enabled = true +provider = "aws-cloudhsm" # or "thales", "yubihsm" + +[ai.security.hsm.aws] +cluster_id = "cluster-123" +customer_ca_cert = "/etc/provisioning/certs/customerCA.crt" +server_cert = "/etc/provisioning/certs/server.crt" +server_key = "/etc/provisioning/certs/server.key" +``` + +## Encryption + +### Data at Rest + +```text +[ai.security.encryption] +enabled = true +algorithm = "aes-256-gcm" +key_derivation = "argon2id" + +# Key rotation +key_rotation_enabled = true +key_rotation_days = 90 +rotation_alert_days = 7 + +# Encrypted storage +cache_encryption = true +log_encryption = true +``` + +### Data in Transit + +```text +All external LLM API calls: +├─ TLS 1.3 (minimum) +├─ Certificate pinning (optional) +├─ Mutual TLS (with cloud providers) +└─ No plaintext transmission +``` + +## Audit Logging + +### What Gets Logged + +```text +{ + "timestamp": "2025-01-13T10:30:45Z", + "event_type": "ai_action", + "action": "generate_config", + "principal": { + "user_id": "user-123", + "role": "developer", + "workspace": "prod" + }, + "resource": { + "type": "database", + "name": "prod-postgres" + }, + "authorization": { + "decision": "permit", + "policy": "ai-policies.cedar", + "reason": "developer role in workspace" + }, + "cost": { + "tokens_used": 1250, + "estimated_cost_usd": 0.037 + }, + "sanitization": { + "items_redacted": 3, + "patterns_matched": ["db_password", "api_key", "token"] + }, + "status": "success" +} +``` + +### Audit Trail Access + +```text +# View recent AI actions +provisioning audit log ai --tail 100 + +# Filter by user +provisioning audit log ai --user alice@company.com + +# Filter by action +provisioning audit log ai --action generate_config + +# Filter by time range +provisioning audit log ai --from "2025-01-01" --to "2025-01-13" + +# Export for analysis +provisioning audit export ai --format csv --output audit.csv + +# Full-text search +provisioning audit search ai "error in database configuration" +``` + +## Compliance Frameworks + +### Built-in Compliance Checks + +```text +[ai.compliance] +frameworks = ["pci-dss", "hipaa", "sox", "gdpr"] + +[ai.compliance.pci-dss] +enabled = true +# Requires encryption, audit logs, access controls + +[ai.compliance.hipaa] +enabled = true +# Requires local models, encrypted storage, audit logs + +[ai.compliance.gdpr] +enabled = true +# Requires data deletion, consent tracking, privacy by design +``` + +### Compliance Reports + +```text +# Generate compliance report +provisioning audit compliance-report + --framework pci-dss + --period month + --output report.pdf + +# Verify compliance +provisioning audit verify-compliance + --framework hipaa + --verbose +``` + +## Security Best Practices + +### For Administrators + +1. **Rotate API Keys**: Every 90 days minimum +2. **Monitor Budget**: Set up alerts at 80% and 90% +3. **Review Policies**: Quarterly policy audit +4. **Audit Logs**: Weekly review of AI operations +5. **Update Models**: Use latest stable models +6. **Test Recovery**: Monthly rollback drills + +### For Developers + +1. **Use Workspace Isolation**: Never share workspace access +2. **Don't Log Secrets**: Use sanitization, never bypass it +3. **Validate Outputs**: Always review AI-generated configs +4. **Report Issues**: Security issues to `security-ai@company.com` +5. **Stay Updated**: Follow security bulletins + +### For Operators + +1. **Monitor Costs**: Alert if exceeding 110% of budget +2. **Watch Errors**: Unusual error patterns may indicate attacks +3. **Check Audit Logs**: Unauthorized access attempts +4. **Test Policies**: Periodically verify Cedar policies work +5. **Backup Configs**: Secure backup of policy files + +## Incident Response + +### Compromised API Key + +```text +# 1. Immediately revoke key +provisioning admin revoke-key ai-api-key-123 + +# 2. Rotate key +provisioning admin rotate-key ai + --notify ops-team@company.com + +# 3. Audit usage since compromise +provisioning audit log ai + --since "2025-01-13T09:00:00Z" + --api-key-id ai-api-key-123 + +# 4. Review any generated configs from this period +# Configs generated while key was compromised may need review +``` + +### Unauthorized Access + +```text +# Review Cedar policy logs +provisioning audit log ai + --decision deny + --last-hour + +# Check for pattern +provisioning audit search ai "authorization.*deny" + --trend-analysis + +# Update policies if needed +provisioning policy update ai-policies.cedar +``` + +## Security Checklist + +### Pre-Production + +- ✅ Cedar policies reviewed and tested +- ✅ API keys rotated and secured +- ✅ Data sanitization tested with real secrets +- ✅ Encryption enabled for cache +- ✅ Audit logging configured +- ✅ Cost limits set appropriately +- ✅ Local-only mode tested (if needed) +- ✅ HSM configured (if required) + +### Ongoing + +- ✅ Monthly policy review +- ✅ Weekly audit log review +- ✅ Quarterly key rotation +- ✅ Annual compliance assessment +- ✅ Continuous budget monitoring +- ✅ Error pattern analysis + +## Related Documentation + +- [Architecture](architecture.md) - System overview +- [Configuration](configuration.md) - Security settings +- [Cost Management](cost-management.md) - Budget controls +- [ADR-015](../architecture/adr/adr-015-ai-integration-architecture.md) - Design decisions + +--- + +**Last Updated**: 2025-01-13 +**Status**: ✅ Production-Ready +**Compliance**: PCI-DSS, HIPAA, SOX, GDPR +**Cedar Version**: 3.0+ \ No newline at end of file diff --git a/docs/src/ai/troubleshooting-with-ai.md b/docs/src/ai/troubleshooting-with-ai.md index 0a1bc89..1644682 100644 --- a/docs/src/ai/troubleshooting-with-ai.md +++ b/docs/src/ai/troubleshooting-with-ai.md @@ -1 +1,502 @@ -# AI-Assisted Troubleshooting and Debugging\n\n**Status**: ✅ Production-Ready (AI troubleshooting analysis, log parsing)\n\nThe AI troubleshooting system provides intelligent debugging assistance for infrastructure failures. The system analyzes deployment logs, identifies\nroot causes, suggests fixes, and generates corrected configurations based on failure patterns.\n\n## Feature Overview\n\n### What It Does\n\nTransform deployment failures into actionable insights:\n\n```\nDeployment Fails with Error\n ↓\nAI analyzes logs:\n - Identifies failure phase (networking, database, k8s, etc.)\n - Detects root cause (resource limits, configuration, timeout)\n - Correlates with similar past failures\n - Reviews deployment configuration\n ↓\nAI generates report:\n - Root cause explanation in plain English\n - Configuration issues identified\n - Suggested fixes with rationale\n - Alternative solutions\n - Links to relevant documentation\n ↓\nDeveloper reviews and accepts:\n - Understands what went wrong\n - Knows how to fix it\n - Can implement fix with confidence\n```\n\n## Troubleshooting Workflow\n\n### Automatic Detection and Analysis\n\n```\n┌──────────────────────────────────────────┐\n│ Deployment Monitoring │\n│ - Watches deployment for failures │\n│ - Captures logs in real-time │\n│ - Detects failure events │\n└──────────────┬───────────────────────────┘\n ↓\n┌──────────────────────────────────────────┐\n│ Log Collection │\n│ - Gather all relevant logs │\n│ - Include stack traces │\n│ - Capture metrics at failure time │\n│ - Get resource usage data │\n└──────────────┬───────────────────────────┘\n ↓\n┌──────────────────────────────────────────┐\n│ Context Retrieval (RAG) │\n│ - Find similar past failures │\n│ - Retrieve troubleshooting guides │\n│ - Get schema constraints │\n│ - Find best practices │\n└──────────────┬───────────────────────────┘\n ↓\n┌──────────────────────────────────────────┐\n│ AI Analysis │\n│ - Identify failure pattern │\n│ - Determine root cause │\n│ - Generate hypotheses │\n│ - Score likely causes │\n└──────────────┬───────────────────────────┘\n ↓\n┌──────────────────────────────────────────┐\n│ Solution Generation │\n│ - Create fixed configuration │\n│ - Generate step-by-step fix guide │\n│ - Suggest preventative measures │\n│ - Provide alternative approaches │\n└──────────────┬───────────────────────────┘\n ↓\n┌──────────────────────────────────────────┐\n│ Report and Recommendations │\n│ - Explain what went wrong │\n│ - Show how to fix it │\n│ - Provide corrected configuration │\n│ - Link to prevention strategies │\n└──────────────────────────────────────────┘\n```\n\n## Usage Examples\n\n### Example 1: Database Connection Timeout\n\n**Failure**:\n```\nDeployment: deploy-2025-01-13-001\nStatus: FAILED at phase database_migration\nError: connection timeout after 30s connecting to postgres://...\n```\n\n**Run Troubleshooting**:\n```\n$ provisioning ai troubleshoot deploy-2025-01-13-001\n\nAnalyzing deployment failure...\n\n╔════════════════════════════════════════════════════════════════╗\n║ Root Cause Analysis: Database Connection Timeout ║\n╠════════════════════════════════════════════════════════════════╣\n║ ║\n║ Phase: database_migration (occurred during migration job) ║\n║ Error: Timeout after 30 seconds connecting to database ║\n║ ║\n║ Most Likely Causes (confidence): ║\n║ 1. Database security group blocks migration job (85%) ║\n║ 2. Database instance not fully initialized yet (60%) ║\n║ 3. Network connectivity issue (40%) ║\n║ ║\n║ Analysis: ║\n║ - Database was created only 2 seconds before connection ║\n║ - Migration job started immediately (no wait time) ║\n║ - Security group: allows 5432 only from default SG ║\n║ - Migration pod uses different security group ║\n║ ║\n╠════════════════════════════════════════════════════════════════╣\n║ Recommended Fix ║\n╠════════════════════════════════════════════════════════════════╣\n║ ║\n║ Issue: Migration security group not in database's inbound ║\n║ ║\n║ Solution: Add migration pod security group to DB inbound ║\n║ ║\n║ database.security_group.ingress = [ ║\n║ { ║\n║ from_port = 5432, ║\n║ to_port = 5432, ║\n║ source_security_group = "migration-pods-sg" ║\n║ } ║\n║ ] ║\n║ ║\n║ Alternative: Add 30-second wait after database creation ║\n║ ║\n║ deployment.phases.database.post_actions = [ ║\n║ {action = "wait_for_database", timeout_seconds = 30} ║\n║ ] ║\n║ ║\n╠════════════════════════════════════════════════════════════════╣\n║ Prevention ║\n╠════════════════════════════════════════════════════════════════╣\n║ ║\n║ To prevent this in future deployments: ║\n║ ║\n║ 1. Always verify security group rules before migration ║\n║ 2. Add health check: `SELECT 1` before starting migration ║\n║ 3. Increase initial timeout: database can be slow to start ║\n║ 4. Use RDS wait condition instead of time-based wait ║\n║ ║\n║ See: docs/troubleshooting/database-connectivity.md ║\n║ docs/guides/database-migrations.md ║\n║ ║\n╚════════════════════════════════════════════════════════════════╝\n\nGenerate corrected configuration? [yes/no]: yes\n\nConfiguration generated and saved to:\n workspaces/prod/database.ncl.fixed\n\nChanges made:\n ✓ Added migration security group to database inbound\n ✓ Added health check before migration\n ✓ Increased connection timeout to 60s\n\nReady to redeploy with corrected configuration? [yes/no]: yes\n```\n\n### Example 2: Kubernetes Deployment Error\n\n**Failure**:\n```\nDeployment: deploy-2025-01-13-002\nStatus: FAILED at phase kubernetes_workload\nError: failed to create deployment app: Pod exceeded capacity\n```\n\n**Troubleshooting**:\n```\n$ provisioning ai troubleshoot deploy-2025-01-13-002 --detailed\n\n╔════════════════════════════════════════════════════════════════╗\n║ Root Cause: Pod Exceeded Node Capacity ║\n╠════════════════════════════════════════════════════════════════╣\n║ ║\n║ Failure Analysis: ║\n║ ║\n║ Error: Pod requests 4CPU/8GB, but largest node has 2CPU/4GB ║\n║ Cluster: 3 nodes, each t3.medium (2CPU/4GB) ║\n║ Pod requirements: ║\n║ - CPU: 4 (requested) + 2 (reserved system) = 6 needed ║\n║ - Memory: 8Gi (requested) + 1Gi (system) = 9Gi needed ║\n║ ║\n║ Why this happened: ║\n║ Pod spec updated to 4CPU/8GB but node group wasn't ║\n║ Node group still has t3.medium (too small) ║\n║ No autoscaling configured (won't scale up automatically) ║\n║ ║\n║ Solution Options: ║\n║ 1. Reduce pod resource requests to 2CPU/4GB (simpler) ║\n║ 2. Scale up node group to t3.large (2x cost, safer) ║\n║ 3. Use both: t3.large nodes + reduce pod requests ║\n║ ║\n╠════════════════════════════════════════════════════════════════╣\n║ Recommended: Option 2 (Scale up nodes) ║\n╠════════════════════════════════════════════════════════════════╣\n║ ║\n║ Reason: Pod requests are reasonable for production app ║\n║ Better to scale infrastructure than reduce resources ║\n║ ║\n║ Changes needed: ║\n║ ║\n║ kubernetes.node_group = { ║\n║ instance_type = "t3.large" # was t3.medium ║\n║ min_size = 3 ║\n║ max_size = 10 ║\n║ ║\n║ auto_scaling = { ║\n║ enabled = true ║\n║ target_cpu_percent = 70 ║\n║ } ║\n║ } ║\n║ ║\n║ Cost Impact: ║\n║ Current: 3 × t3.medium = ~$90/month ║\n║ Proposed: 3 × t3.large = ~$180/month ║\n║ With autoscaling, average: ~$150/month (some scale-down) ║\n║ ║\n╚════════════════════════════════════════════════════════════════╝\n```\n\n## CLI Commands\n\n### Basic Troubleshooting\n\n```\n# Troubleshoot recent deployment\nprovisioning ai troubleshoot deploy-2025-01-13-001\n\n# Get detailed analysis\nprovisioning ai troubleshoot deploy-2025-01-13-001 --detailed\n\n# Analyze with specific focus\nprovisioning ai troubleshoot deploy-2025-01-13-001 --focus networking\n\n# Get alternative solutions\nprovisioning ai troubleshoot deploy-2025-01-13-001 --alternatives\n```\n\n### Working with Logs\n\n```\n# Troubleshoot from custom logs\nprovisioning ai troubleshoot \\n| --logs "$(journalctl -u provisioning --no-pager | tail -100)" |\n\n# Troubleshoot from file\nprovisioning ai troubleshoot --log-file /var/log/deployment.log\n\n# Troubleshoot from cloud provider\nprovisioning ai troubleshoot \\n --cloud-logs aws-deployment-123 \\n --region us-east-1\n```\n\n### Generate Reports\n\n```\n# Generate detailed troubleshooting report\nprovisioning ai troubleshoot deploy-123 \\n --report \\n --output troubleshooting-report.md\n\n# Generate with suggestions\nprovisioning ai troubleshoot deploy-123 \\n --report \\n --include-suggestions \\n --output report-with-fixes.md\n\n# Generate compliance report (PCI-DSS, HIPAA)\nprovisioning ai troubleshoot deploy-123 \\n --report \\n --compliance pci-dss \\n --output compliance-report.pdf\n```\n\n## Analysis Depth\n\n### Shallow Analysis (Fast)\n\n```\nprovisioning ai troubleshoot deploy-123 --depth shallow\n\nAnalyzes:\n- First error message\n- Last few log lines\n- Basic pattern matching\n- Returns in 30-60 seconds\n```\n\n### Deep Analysis (Thorough)\n\n```\nprovisioning ai troubleshoot deploy-123 --depth deep\n\nAnalyzes:\n- Full log context\n- Correlates multiple errors\n- Checks resource metrics\n- Compares to past failures\n- Generates alternative hypotheses\n- Returns in 5-10 seconds\n```\n\n## Integration with Monitoring\n\n### Automatic Troubleshooting\n\n```\n# Enable auto-troubleshoot on failures\nprovisioning config set ai.troubleshooting.auto_analyze true\n\n# Deployments that fail automatically get analyzed\n# Reports available in provisioning dashboard\n# Alerts sent to on-call engineer with analysis\n```\n\n### WebUI Integration\n\n```\nDeployment Dashboard\n ├─ deployment-123 [FAILED]\n │ └─ AI Analysis\n │ ├─ Root Cause: Database timeout\n │ ├─ Suggested Fix: ✓ View\n │ ├─ Corrected Config: ✓ Download\n │ └─ Alternative Solutions: 3 options\n```\n\n## Learning from Failures\n\n### Pattern Recognition\n\nThe system learns common failure patterns:\n\n```\nCollected Patterns:\n├─ Database Timeouts (25% of failures)\n│ └─ Usually: Security group, connection pool, slow startup\n├─ Kubernetes Pod Failures (20%)\n│ └─ Usually: Insufficient resources, bad config\n├─ Network Connectivity (15%)\n│ └─ Usually: Security groups, routing, DNS\n└─ Other (40%)\n └─ Various causes, each analyzed individually\n```\n\n### Improvement Tracking\n\n```\n# See patterns in your deployments\nprovisioning ai analytics failures --period month\n\nMonth Summary:\n Total deployments: 50\n Failed: 5 (10% failure rate)\n \n Common causes:\n 1. Security group rules (3 failures, 60%)\n 2. Resource limits (1 failure, 20%)\n 3. Configuration error (1 failure, 20%)\n \n Improvement opportunities:\n - Pre-check security groups before deployment\n - Add health checks for resource sizing\n - Add configuration validation\n```\n\n## Configuration\n\n### Troubleshooting Settings\n\n```\n[ai.troubleshooting]\nenabled = true\n\n# Analysis depth\ndefault_depth = "deep" # or "shallow" for speed\nmax_analysis_time_seconds = 30\n\n# Features\nauto_analyze_failed_deployments = true\ngenerate_corrected_config = true\nsuggest_prevention = true\n\n# Learning\ntrack_failure_patterns = true\nlearn_from_similar_failures = true\nimprove_suggestions_over_time = true\n\n# Reporting\nauto_send_report = false # Email report to user\nreport_format = "markdown" # or "json", "pdf"\ninclude_alternatives = true\n\n# Cost impact analysis\nestimate_fix_cost = true\nestimate_alternative_costs = true\n```\n\n### Failure Detection\n\n```\n[ai.troubleshooting.detection]\n# Monitor logs for these patterns\nwatch_patterns = [\n "error",\n "timeout",\n "failed",\n "unable to",\n "refused",\n "denied",\n "exceeded",\n "quota",\n]\n\n# Minimum log lines before analyzing\nmin_log_lines = 10\n\n# Time window for log collection\nlog_window_seconds = 300\n```\n\n## Best Practices\n\n### For Effective Troubleshooting\n\n1. **Keep Detailed Logs**: Enable verbose logging in deployments\n2. **Include Context**: Share full logs, not just error snippet\n3. **Check Suggestions**: Review AI suggestions even if obvious\n4. **Learn Patterns**: Track recurring failures and address root cause\n5. **Update Configs**: Use corrected configs from AI, validate them\n\n### For Prevention\n\n1. **Use Health Checks**: Add database/service health checks\n2. **Test Before Deploy**: Use dry-run to catch issues early\n3. **Monitor Metrics**: Watch CPU/memory before failures occur\n4. **Review Policies**: Ensure security groups are correct\n5. **Document Changes**: When updating configs, note the change\n\n## Limitations\n\n### What AI Can Troubleshoot\n\n✅ Configuration errors\n✅ Resource limit problems\n✅ Networking/security group issues\n✅ Database connectivity problems\n✅ Deployment ordering issues\n✅ Common application errors\n✅ Performance problems\n\n### What Requires Human Review\n\n⚠️ Data corruption scenarios\n⚠️ Multi-failure cascades\n⚠️ Unclear error messages\n⚠️ Custom application code failures\n⚠️ Third-party service issues\n⚠️ Physical infrastructure failures\n\n## Examples and Guides\n\n### Common Issues - Quick Links\n\n- [Database Connectivity](../troubleshooting/database-connectivity.md)\n- [Kubernetes Pod Failures](../troubleshooting/kubernetes-pods.md)\n- [Network Configuration](../troubleshooting/networking.md)\n- [Performance Issues](../troubleshooting/performance.md)\n- [Resource Limits](../troubleshooting/resource-limits.md)\n\n## Related Documentation\n\n- [Architecture](architecture.md) - AI system overview\n- [RAG System](rag-system.md) - Context retrieval for troubleshooting\n- [Configuration](configuration.md) - Setup guide\n- [Security Policies](security-policies.md) - Safe log handling\n- [ADR-015](../architecture/adr/adr-015-ai-integration-architecture.md) - Design decisions\n\n---\n\n**Last Updated**: 2025-01-13\n**Status**: ✅ Production-Ready\n**Success Rate**: 85-95% accuracy in root cause identification\n**Supported**: All deployment types (infrastructure, Kubernetes, database) +# AI-Assisted Troubleshooting and Debugging + +**Status**: ✅ Production-Ready (AI troubleshooting analysis, log parsing) + +The AI troubleshooting system provides intelligent debugging assistance for infrastructure failures. The system analyzes deployment logs, identifies +root causes, suggests fixes, and generates corrected configurations based on failure patterns. + +## Feature Overview + +### What It Does + +Transform deployment failures into actionable insights: + +```text +Deployment Fails with Error + ↓ +AI analyzes logs: + - Identifies failure phase (networking, database, k8s, etc.) + - Detects root cause (resource limits, configuration, timeout) + - Correlates with similar past failures + - Reviews deployment configuration + ↓ +AI generates report: + - Root cause explanation in plain English + - Configuration issues identified + - Suggested fixes with rationale + - Alternative solutions + - Links to relevant documentation + ↓ +Developer reviews and accepts: + - Understands what went wrong + - Knows how to fix it + - Can implement fix with confidence +``` + +## Troubleshooting Workflow + +### Automatic Detection and Analysis + +```text +┌──────────────────────────────────────────┐ +│ Deployment Monitoring │ +│ - Watches deployment for failures │ +│ - Captures logs in real-time │ +│ - Detects failure events │ +└──────────────┬───────────────────────────┘ + ↓ +┌──────────────────────────────────────────┐ +│ Log Collection │ +│ - Gather all relevant logs │ +│ - Include stack traces │ +│ - Capture metrics at failure time │ +│ - Get resource usage data │ +└──────────────┬───────────────────────────┘ + ↓ +┌──────────────────────────────────────────┐ +│ Context Retrieval (RAG) │ +│ - Find similar past failures │ +│ - Retrieve troubleshooting guides │ +│ - Get schema constraints │ +│ - Find best practices │ +└──────────────┬───────────────────────────┘ + ↓ +┌──────────────────────────────────────────┐ +│ AI Analysis │ +│ - Identify failure pattern │ +│ - Determine root cause │ +│ - Generate hypotheses │ +│ - Score likely causes │ +└──────────────┬───────────────────────────┘ + ↓ +┌──────────────────────────────────────────┐ +│ Solution Generation │ +│ - Create fixed configuration │ +│ - Generate step-by-step fix guide │ +│ - Suggest preventative measures │ +│ - Provide alternative approaches │ +└──────────────┬───────────────────────────┘ + ↓ +┌──────────────────────────────────────────┐ +│ Report and Recommendations │ +│ - Explain what went wrong │ +│ - Show how to fix it │ +│ - Provide corrected configuration │ +│ - Link to prevention strategies │ +└──────────────────────────────────────────┘ +``` + +## Usage Examples + +### Example 1: Database Connection Timeout + +**Failure**: +```text +Deployment: deploy-2025-01-13-001 +Status: FAILED at phase database_migration +Error: connection timeout after 30s connecting to postgres://... +``` + +**Run Troubleshooting**: +```text +$ provisioning ai troubleshoot deploy-2025-01-13-001 + +Analyzing deployment failure... + +╔════════════════════════════════════════════════════════════════╗ +║ Root Cause Analysis: Database Connection Timeout ║ +╠════════════════════════════════════════════════════════════════╣ +║ ║ +║ Phase: database_migration (occurred during migration job) ║ +║ Error: Timeout after 30 seconds connecting to database ║ +║ ║ +║ Most Likely Causes (confidence): ║ +║ 1. Database security group blocks migration job (85%) ║ +║ 2. Database instance not fully initialized yet (60%) ║ +║ 3. Network connectivity issue (40%) ║ +║ ║ +║ Analysis: ║ +║ - Database was created only 2 seconds before connection ║ +║ - Migration job started immediately (no wait time) ║ +║ - Security group: allows 5432 only from default SG ║ +║ - Migration pod uses different security group ║ +║ ║ +╠════════════════════════════════════════════════════════════════╣ +║ Recommended Fix ║ +╠════════════════════════════════════════════════════════════════╣ +║ ║ +║ Issue: Migration security group not in database's inbound ║ +║ ║ +║ Solution: Add migration pod security group to DB inbound ║ +║ ║ +║ database.security_group.ingress = [ ║ +║ { ║ +║ from_port = 5432, ║ +║ to_port = 5432, ║ +║ source_security_group = "migration-pods-sg" ║ +║ } ║ +║ ] ║ +║ ║ +║ Alternative: Add 30-second wait after database creation ║ +║ ║ +║ deployment.phases.database.post_actions = [ ║ +║ {action = "wait_for_database", timeout_seconds = 30} ║ +║ ] ║ +║ ║ +╠════════════════════════════════════════════════════════════════╣ +║ Prevention ║ +╠════════════════════════════════════════════════════════════════╣ +║ ║ +║ To prevent this in future deployments: ║ +║ ║ +║ 1. Always verify security group rules before migration ║ +║ 2. Add health check: `SELECT 1` before starting migration ║ +║ 3. Increase initial timeout: database can be slow to start ║ +║ 4. Use RDS wait condition instead of time-based wait ║ +║ ║ +║ See: docs/troubleshooting/database-connectivity.md ║ +║ docs/guides/database-migrations.md ║ +║ ║ +╚════════════════════════════════════════════════════════════════╝ + +Generate corrected configuration? [yes/no]: yes + +Configuration generated and saved to: + workspaces/prod/database.ncl.fixed + +Changes made: + ✓ Added migration security group to database inbound + ✓ Added health check before migration + ✓ Increased connection timeout to 60s + +Ready to redeploy with corrected configuration? [yes/no]: yes +``` + +### Example 2: Kubernetes Deployment Error + +**Failure**: +```text +Deployment: deploy-2025-01-13-002 +Status: FAILED at phase kubernetes_workload +Error: failed to create deployment app: Pod exceeded capacity +``` + +**Troubleshooting**: +```text +$ provisioning ai troubleshoot deploy-2025-01-13-002 --detailed + +╔════════════════════════════════════════════════════════════════╗ +║ Root Cause: Pod Exceeded Node Capacity ║ +╠════════════════════════════════════════════════════════════════╣ +║ ║ +║ Failure Analysis: ║ +║ ║ +║ Error: Pod requests 4CPU/8GB, but largest node has 2CPU/4GB ║ +║ Cluster: 3 nodes, each t3.medium (2CPU/4GB) ║ +║ Pod requirements: ║ +║ - CPU: 4 (requested) + 2 (reserved system) = 6 needed ║ +║ - Memory: 8Gi (requested) + 1Gi (system) = 9Gi needed ║ +║ ║ +║ Why this happened: ║ +║ Pod spec updated to 4CPU/8GB but node group wasn't ║ +║ Node group still has t3.medium (too small) ║ +║ No autoscaling configured (won't scale up automatically) ║ +║ ║ +║ Solution Options: ║ +║ 1. Reduce pod resource requests to 2CPU/4GB (simpler) ║ +║ 2. Scale up node group to t3.large (2x cost, safer) ║ +║ 3. Use both: t3.large nodes + reduce pod requests ║ +║ ║ +╠════════════════════════════════════════════════════════════════╣ +║ Recommended: Option 2 (Scale up nodes) ║ +╠════════════════════════════════════════════════════════════════╣ +║ ║ +║ Reason: Pod requests are reasonable for production app ║ +║ Better to scale infrastructure than reduce resources ║ +║ ║ +║ Changes needed: ║ +║ ║ +║ kubernetes.node_group = { ║ +║ instance_type = "t3.large" # was t3.medium ║ +║ min_size = 3 ║ +║ max_size = 10 ║ +║ ║ +║ auto_scaling = { ║ +║ enabled = true ║ +║ target_cpu_percent = 70 ║ +║ } ║ +║ } ║ +║ ║ +║ Cost Impact: ║ +║ Current: 3 × t3.medium = ~$90/month ║ +║ Proposed: 3 × t3.large = ~$180/month ║ +║ With autoscaling, average: ~$150/month (some scale-down) ║ +║ ║ +╚════════════════════════════════════════════════════════════════╝ +``` + +## CLI Commands + +### Basic Troubleshooting + +```text +# Troubleshoot recent deployment +provisioning ai troubleshoot deploy-2025-01-13-001 + +# Get detailed analysis +provisioning ai troubleshoot deploy-2025-01-13-001 --detailed + +# Analyze with specific focus +provisioning ai troubleshoot deploy-2025-01-13-001 --focus networking + +# Get alternative solutions +provisioning ai troubleshoot deploy-2025-01-13-001 --alternatives +``` + +### Working with Logs + +```text +# Troubleshoot from custom logs +provisioning ai troubleshoot +| --logs "$(journalctl -u provisioning --no-pager | tail -100)" | + +# Troubleshoot from file +provisioning ai troubleshoot --log-file /var/log/deployment.log + +# Troubleshoot from cloud provider +provisioning ai troubleshoot + --cloud-logs aws-deployment-123 + --region us-east-1 +``` + +### Generate Reports + +```text +# Generate detailed troubleshooting report +provisioning ai troubleshoot deploy-123 + --report + --output troubleshooting-report.md + +# Generate with suggestions +provisioning ai troubleshoot deploy-123 + --report + --include-suggestions + --output report-with-fixes.md + +# Generate compliance report (PCI-DSS, HIPAA) +provisioning ai troubleshoot deploy-123 + --report + --compliance pci-dss + --output compliance-report.pdf +``` + +## Analysis Depth + +### Shallow Analysis (Fast) + +```text +provisioning ai troubleshoot deploy-123 --depth shallow + +Analyzes: +- First error message +- Last few log lines +- Basic pattern matching +- Returns in 30-60 seconds +``` + +### Deep Analysis (Thorough) + +```text +provisioning ai troubleshoot deploy-123 --depth deep + +Analyzes: +- Full log context +- Correlates multiple errors +- Checks resource metrics +- Compares to past failures +- Generates alternative hypotheses +- Returns in 5-10 seconds +``` + +## Integration with Monitoring + +### Automatic Troubleshooting + +```text +# Enable auto-troubleshoot on failures +provisioning config set ai.troubleshooting.auto_analyze true + +# Deployments that fail automatically get analyzed +# Reports available in provisioning dashboard +# Alerts sent to on-call engineer with analysis +``` + +### WebUI Integration + +```text +Deployment Dashboard + ├─ deployment-123 [FAILED] + │ └─ AI Analysis + │ ├─ Root Cause: Database timeout + │ ├─ Suggested Fix: ✓ View + │ ├─ Corrected Config: ✓ Download + │ └─ Alternative Solutions: 3 options +``` + +## Learning from Failures + +### Pattern Recognition + +The system learns common failure patterns: + +```text +Collected Patterns: +├─ Database Timeouts (25% of failures) +│ └─ Usually: Security group, connection pool, slow startup +├─ Kubernetes Pod Failures (20%) +│ └─ Usually: Insufficient resources, bad config +├─ Network Connectivity (15%) +│ └─ Usually: Security groups, routing, DNS +└─ Other (40%) + └─ Various causes, each analyzed individually +``` + +### Improvement Tracking + +```text +# See patterns in your deployments +provisioning ai analytics failures --period month + +Month Summary: + Total deployments: 50 + Failed: 5 (10% failure rate) + + Common causes: + 1. Security group rules (3 failures, 60%) + 2. Resource limits (1 failure, 20%) + 3. Configuration error (1 failure, 20%) + + Improvement opportunities: + - Pre-check security groups before deployment + - Add health checks for resource sizing + - Add configuration validation +``` + +## Configuration + +### Troubleshooting Settings + +```text +[ai.troubleshooting] +enabled = true + +# Analysis depth +default_depth = "deep" # or "shallow" for speed +max_analysis_time_seconds = 30 + +# Features +auto_analyze_failed_deployments = true +generate_corrected_config = true +suggest_prevention = true + +# Learning +track_failure_patterns = true +learn_from_similar_failures = true +improve_suggestions_over_time = true + +# Reporting +auto_send_report = false # Email report to user +report_format = "markdown" # or "json", "pdf" +include_alternatives = true + +# Cost impact analysis +estimate_fix_cost = true +estimate_alternative_costs = true +``` + +### Failure Detection + +```text +[ai.troubleshooting.detection] +# Monitor logs for these patterns +watch_patterns = [ + "error", + "timeout", + "failed", + "unable to", + "refused", + "denied", + "exceeded", + "quota", +] + +# Minimum log lines before analyzing +min_log_lines = 10 + +# Time window for log collection +log_window_seconds = 300 +``` + +## Best Practices + +### For Effective Troubleshooting + +1. **Keep Detailed Logs**: Enable verbose logging in deployments +2. **Include Context**: Share full logs, not just error snippet +3. **Check Suggestions**: Review AI suggestions even if obvious +4. **Learn Patterns**: Track recurring failures and address root cause +5. **Update Configs**: Use corrected configs from AI, validate them + +### For Prevention + +1. **Use Health Checks**: Add database/service health checks +2. **Test Before Deploy**: Use dry-run to catch issues early +3. **Monitor Metrics**: Watch CPU/memory before failures occur +4. **Review Policies**: Ensure security groups are correct +5. **Document Changes**: When updating configs, note the change + +## Limitations + +### What AI Can Troubleshoot + +✅ Configuration errors +✅ Resource limit problems +✅ Networking/security group issues +✅ Database connectivity problems +✅ Deployment ordering issues +✅ Common application errors +✅ Performance problems + +### What Requires Human Review + +⚠️ Data corruption scenarios +⚠️ Multi-failure cascades +⚠️ Unclear error messages +⚠️ Custom application code failures +⚠️ Third-party service issues +⚠️ Physical infrastructure failures + +## Examples and Guides + +### Common Issues - Quick Links + +- [Database Connectivity](../troubleshooting/database-connectivity.md) +- [Kubernetes Pod Failures](../troubleshooting/kubernetes-pods.md) +- [Network Configuration](../troubleshooting/networking.md) +- [Performance Issues](../troubleshooting/performance.md) +- [Resource Limits](../troubleshooting/resource-limits.md) + +## Related Documentation + +- [Architecture](architecture.md) - AI system overview +- [RAG System](rag-system.md) - Context retrieval for troubleshooting +- [Configuration](configuration.md) - Setup guide +- [Security Policies](security-policies.md) - Safe log handling +- [ADR-015](../architecture/adr/adr-015-ai-integration-architecture.md) - Design decisions + +--- + +**Last Updated**: 2025-01-13 +**Status**: ✅ Production-Ready +**Success Rate**: 85-95% accuracy in root cause identification +**Supported**: All deployment types (infrastructure, Kubernetes, database) \ No newline at end of file diff --git a/docs/src/api-reference/README.md b/docs/src/api-reference/README.md index c1ff6f0..4639354 100644 --- a/docs/src/api-reference/README.md +++ b/docs/src/api-reference/README.md @@ -1 +1,28 @@ -# API Documentation\n\nAPI reference for programmatic access to the Provisioning Platform.\n\n## Available APIs\n\n- [REST API](rest-api.md) - HTTP endpoints for all operations\n- [WebSocket API](websocket.md) - Real-time event streams\n- [Extensions API](extensions.md) - Extension integration interfaces\n- [SDKs](sdks.md) - Client libraries for multiple languages\n- [Integration Examples](integration-examples.md) - Code examples and patterns\n\n## Quick Start\n\n```\n# Check API health\ncurl http://localhost:9090/health\n\n# List tasks via API\ncurl http://localhost:9090/tasks\n\n# Submit workflow\ncurl -X POST http://localhost:9090/workflows/servers/create \\n -H "Content-Type: application/json" \\n -d '{"infra": "my-project", "servers": ["web-01"]}'\n```\n\nSee [REST API](rest-api.md) for complete endpoint documentation. +# API Documentation + +API reference for programmatic access to the Provisioning Platform. + +## Available APIs + +- [REST API](rest-api.md) - HTTP endpoints for all operations +- [WebSocket API](websocket.md) - Real-time event streams +- [Extensions API](extensions.md) - Extension integration interfaces +- [SDKs](sdks.md) - Client libraries for multiple languages +- [Integration Examples](integration-examples.md) - Code examples and patterns + +## Quick Start + +```text +# Check API health +curl http://localhost:9090/health + +# List tasks via API +curl http://localhost:9090/tasks + +# Submit workflow +curl -X POST http://localhost:9090/workflows/servers/create + -H "Content-Type: application/json" + -d '{"infra": "my-project", "servers": ["web-01"]}' +``` + +See [REST API](rest-api.md) for complete endpoint documentation. \ No newline at end of file diff --git a/docs/src/api-reference/extensions.md b/docs/src/api-reference/extensions.md index 9fb2620..0e588e4 100644 --- a/docs/src/api-reference/extensions.md +++ b/docs/src/api-reference/extensions.md @@ -1 +1,1205 @@ -# Extension Development API\n\nThis document provides comprehensive guidance for developing extensions for provisioning, including providers, task services, and cluster configurations.\n\n## Overview\n\nProvisioning supports three types of extensions:\n\n1. **Providers**: Cloud infrastructure providers (AWS, UpCloud, Local, etc.)\n2. **Task Services**: Infrastructure components (Kubernetes, Cilium, Containerd, etc.)\n3. **Clusters**: Complete deployment configurations (BuildKit, CI/CD, etc.)\n\nAll extensions follow a standardized structure and API for seamless integration.\n\n## Extension Structure\n\n### Standard Directory Layout\n\n```\nextension-name/\n├── manifest.toml # Extension metadata\n├── schemas/ # Nickel configuration files\n│ ├── main.ncl # Main schema\n│ ├── settings.ncl # Settings schema\n│ ├── version.ncl # Version configuration\n│ └── contracts.ncl # Contract definitions\n├── nulib/ # Nushell library modules\n│ ├── mod.nu # Main module\n│ ├── create.nu # Creation operations\n│ ├── delete.nu # Deletion operations\n│ └── utils.nu # Utility functions\n├── templates/ # Jinja2 templates\n│ ├── config.j2 # Configuration templates\n│ └── scripts/ # Script templates\n├── generate/ # Code generation scripts\n│ └── generate.nu # Generation commands\n├── README.md # Extension documentation\n└── metadata.toml # Extension metadata\n```\n\n## Provider Extension API\n\n### Provider Interface\n\nAll providers must implement the following interface:\n\n#### Core Operations\n\n- `create-server(config: record) -> record`\n- `delete-server(server_id: string) -> null`\n- `list-servers() -> list`\n- `get-server-info(server_id: string) -> record`\n- `start-server(server_id: string) -> null`\n- `stop-server(server_id: string) -> null`\n- `reboot-server(server_id: string) -> null`\n\n#### Pricing and Plans\n\n- `get-pricing() -> list`\n- `get-plans() -> list`\n- `get-zones() -> list`\n\n#### SSH and Access\n\n- `get-ssh-access(server_id: string) -> record`\n- `configure-firewall(server_id: string, rules: list) -> null`\n\n### Provider Development Template\n\n#### Nickel Configuration Schema\n\nCreate `schemas/settings.ncl`:\n\n```\n# Provider settings schema\n{\n ProviderSettings = {\n # Authentication configuration\n auth | {\n method | "api_key" | "certificate" | "oauth" | "basic",\n api_key | String = null,\n api_secret | String = null,\n username | String = null,\n password | String = null,\n certificate_path | String = null,\n private_key_path | String = null,\n },\n\n # API configuration\n api | {\n base_url | String,\n version | String = "v1",\n timeout | Number = 30,\n retries | Number = 3,\n },\n\n # Default server configuration\n defaults: {\n plan?: str\n zone?: str\n os?: str\n ssh_keys?: [str]\n firewall_rules?: [FirewallRule]\n }\n\n # Provider-specific settings\n features: {\n load_balancer?: bool = false\n storage_encryption?: bool = true\n backup?: bool = true\n monitoring?: bool = false\n }\n}\n\nschema FirewallRule {\n direction: "ingress" | "egress"\n protocol: "tcp" | "udp" | "icmp"\n port?: str\n source?: str\n destination?: str\n action: "allow" | "deny"\n}\n\nschema ServerConfig {\n hostname: str\n plan: str\n zone: str\n os: str = "ubuntu-22.04"\n ssh_keys: [str] = []\n tags?: {str: str} = {}\n firewall_rules?: [FirewallRule] = []\n storage?: {\n size?: int\n type?: str\n encrypted?: bool = true\n }\n network?: {\n public_ip?: bool = true\n private_network?: str\n bandwidth?: int\n }\n}\n```\n\n#### Nushell Implementation\n\nCreate `nulib/mod.nu`:\n\n```\nuse std log\n\n# Provider name and version\nexport const PROVIDER_NAME = "my-provider"\nexport const PROVIDER_VERSION = "1.0.0"\n\n# Import sub-modules\nuse create.nu *\nuse delete.nu *\nuse utils.nu *\n\n# Provider interface implementation\nexport def "provider-info" [] -> record {\n {\n name: $PROVIDER_NAME,\n version: $PROVIDER_VERSION,\n type: "provider",\n interface: "API",\n supported_operations: [\n "create-server", "delete-server", "list-servers",\n "get-server-info", "start-server", "stop-server"\n ],\n required_auth: ["api_key", "api_secret"],\n supported_os: ["ubuntu-22.04", "debian-11", "centos-8"],\n regions: (get-zones).name\n }\n}\n\nexport def "validate-config" [config: record] -> record {\n mut errors = []\n mut warnings = []\n\n # Validate authentication\n if ($config | get -o "auth.api_key" | is-empty) {\n $errors = ($errors | append "Missing API key")\n }\n\n if ($config | get -o "auth.api_secret" | is-empty) {\n $errors = ($errors | append "Missing API secret")\n }\n\n # Validate API configuration\n let api_url = ($config | get -o "api.base_url")\n if ($api_url | is-empty) {\n $errors = ($errors | append "Missing API base URL")\n } else {\n try {\n http get $"($api_url)/health" | ignore\n } catch {\n $warnings = ($warnings | append "API endpoint not reachable")\n }\n }\n\n {\n valid: ($errors | is-empty),\n errors: $errors,\n warnings: $warnings\n }\n}\n\nexport def "test-connection" [config: record] -> record {\n try {\n let api_url = ($config | get "api.base_url")\n let response = (http get $"($api_url)/account" --headers {\n Authorization: $"Bearer ($config | get 'auth.api_key')"\n })\n\n {\n success: true,\n account_info: $response,\n message: "Connection successful"\n }\n } catch {|e|\n {\n success: false,\n error: ($e | get msg),\n message: "Connection failed"\n }\n }\n}\n```\n\nCreate `nulib/create.nu`:\n\n```\nuse std log\nuse utils.nu *\n\nexport def "create-server" [\n config: record # Server configuration\n --check # Check mode only\n --wait # Wait for completion\n] -> record {\n log info $"Creating server: ($config.hostname)"\n\n if $check {\n return {\n action: "create-server",\n hostname: $config.hostname,\n check_mode: true,\n would_create: true,\n estimated_time: "2-5 minutes"\n }\n }\n\n # Validate configuration\n let validation = (validate-server-config $config)\n if not $validation.valid {\n error make {\n msg: $"Invalid server configuration: ($validation.errors | str join ', ')"\n }\n }\n\n # Prepare API request\n let api_config = (get-api-config)\n let request_body = {\n hostname: $config.hostname,\n plan: $config.plan,\n zone: $config.zone,\n os: $config.os,\n ssh_keys: $config.ssh_keys,\n tags: $config.tags,\n firewall_rules: $config.firewall_rules\n }\n\n try {\n let response = (http post $"($api_config.base_url)/servers" --headers {\n Authorization: $"Bearer ($api_config.auth.api_key)"\n Content-Type: "application/json"\n } $request_body)\n\n let server_id = ($response | get id)\n log info $"Server creation initiated: ($server_id)"\n\n if $wait {\n let final_status = (wait-for-server-ready $server_id)\n {\n success: true,\n server_id: $server_id,\n hostname: $config.hostname,\n status: $final_status,\n ip_addresses: (get-server-ips $server_id),\n ssh_access: (get-ssh-access $server_id)\n }\n } else {\n {\n success: true,\n server_id: $server_id,\n hostname: $config.hostname,\n status: "creating",\n message: "Server creation in progress"\n }\n }\n } catch {|e|\n error make {\n msg: $"Server creation failed: ($e | get msg)"\n }\n }\n}\n\ndef validate-server-config [config: record] -> record {\n mut errors = []\n\n # Required fields\n if ($config | get -o hostname | is-empty) {\n $errors = ($errors | append "Hostname is required")\n }\n\n if ($config | get -o plan | is-empty) {\n $errors = ($errors | append "Plan is required")\n }\n\n if ($config | get -o zone | is-empty) {\n $errors = ($errors | append "Zone is required")\n }\n\n # Validate plan exists\n let available_plans = (get-plans)\n if not ($config.plan in ($available_plans | get name)) {\n $errors = ($errors | append $"Invalid plan: ($config.plan)")\n }\n\n # Validate zone exists\n let available_zones = (get-zones)\n if not ($config.zone in ($available_zones | get name)) {\n $errors = ($errors | append $"Invalid zone: ($config.zone)")\n }\n\n {\n valid: ($errors | is-empty),\n errors: $errors\n }\n}\n\ndef wait-for-server-ready [server_id: string] -> string {\n mut attempts = 0\n let max_attempts = 60 # 10 minutes\n\n while $attempts < $max_attempts {\n let server_info = (get-server-info $server_id)\n let status = ($server_info | get status)\n\n match $status {\n "running" => { return "running" },\n "error" => { error make { msg: "Server creation failed" } },\n _ => {\n log info $"Server status: ($status), waiting..."\n sleep 10sec\n $attempts = $attempts + 1\n }\n }\n }\n\n error make { msg: "Server creation timeout" }\n}\n```\n\n### Provider Registration\n\nAdd provider metadata in `metadata.toml`:\n\n```\n[extension]\nname = "my-provider"\ntype = "provider"\nversion = "1.0.0"\ndescription = "Custom cloud provider integration"\nauthor = "Your Name "\nlicense = "MIT"\n\n[compatibility]\nprovisioning_version = ">=2.0.0"\nnushell_version = ">=0.107.0"\nnickel_version = ">=1.15.0"\n\n[capabilities]\nserver_management = true\nload_balancer = false\nstorage_encryption = true\nbackup = true\nmonitoring = false\n\n[authentication]\nmethods = ["api_key", "certificate"]\nrequired_fields = ["api_key", "api_secret"]\n\n[regions]\ndefault = "us-east-1"\navailable = ["us-east-1", "us-west-2", "eu-west-1"]\n\n[support]\ndocumentation = "https://docs.example.com/provider"\nissues = "https://github.com/example/provider/issues"\n```\n\n## Task Service Extension API\n\n### Task Service Interface\n\nTask services must implement:\n\n#### Core Operations\n\n- `install(config: record) -> record`\n- `uninstall(config: record) -> null`\n- `configure(config: record) -> null`\n- `status() -> record`\n- `restart() -> null`\n- `upgrade(version: string) -> record`\n\n#### Version Management\n\n- `get-current-version() -> string`\n- `get-available-versions() -> list`\n- `check-updates() -> record`\n\n### Task Service Development Template\n\n#### Nickel Schema\n\nCreate `schemas/version.ncl`:\n\n```\n# Task service version configuration\n{\n taskserv_version = {\n name | String = "my-service",\n version | String = "1.0.0",\n\n # Version source configuration\n source | {\n type | String = "github",\n repository | String,\n release_pattern | String = "v{version}",\n },\n\n # Installation configuration\n install | {\n method | String = "binary",\n binary_name | String,\n binary_path | String = "/usr/local/bin",\n config_path | String = "/etc/my-service",\n data_path | String = "/var/lib/my-service",\n },\n\n # Dependencies\n dependencies | [\n {\n name | String,\n version | String = ">=1.0.0",\n }\n ],\n\n # Service configuration\n service | {\n type | String = "systemd",\n user | String = "my-service",\n group | String = "my-service",\n ports | [Number] = [8080, 9090],\n },\n\n # Health check configuration\n health_check | {\n endpoint | String,\n interval | Number = 30,\n timeout | Number = 5,\n retries | Number = 3,\n },\n }\n}\n```\n\n#### Nushell Implementation\n\nCreate `nulib/mod.nu`:\n\n```\nuse std log\nuse ../../../lib_provisioning *\n\nexport const SERVICE_NAME = "my-service"\nexport const SERVICE_VERSION = "1.0.0"\n\nexport def "taskserv-info" [] -> record {\n {\n name: $SERVICE_NAME,\n version: $SERVICE_VERSION,\n type: "taskserv",\n category: "application",\n description: "Custom application service",\n dependencies: ["containerd"],\n ports: [8080, 9090],\n config_files: ["/etc/my-service/config.yaml"],\n data_directories: ["/var/lib/my-service"]\n }\n}\n\nexport def "install" [\n config: record = {}\n --check # Check mode only\n --version: string # Specific version to install\n] -> record {\n let install_version = if ($version | is-not-empty) {\n $version\n } else {\n (get-latest-version)\n }\n\n log info $"Installing ($SERVICE_NAME) version ($install_version)"\n\n if $check {\n return {\n action: "install",\n service: $SERVICE_NAME,\n version: $install_version,\n check_mode: true,\n would_install: true,\n requirements_met: (check-requirements)\n }\n }\n\n # Check system requirements\n let req_check = (check-requirements)\n if not $req_check.met {\n error make {\n msg: $"Requirements not met: ($req_check.missing | str join ', ')"\n }\n }\n\n # Download and install\n let binary_path = (download-binary $install_version)\n install-binary $binary_path\n create-user-and-directories\n generate-config $config\n install-systemd-service\n\n # Start service\n systemctl start $SERVICE_NAME\n systemctl enable $SERVICE_NAME\n\n # Verify installation\n let health = (check-health)\n if not $health.healthy {\n error make { msg: "Service failed health check after installation" }\n }\n\n {\n success: true,\n service: $SERVICE_NAME,\n version: $install_version,\n status: "running",\n health: $health\n }\n}\n\nexport def "uninstall" [\n --force # Force removal even if running\n --keep-data # Keep data directories\n] -> null {\n log info $"Uninstalling ($SERVICE_NAME)"\n\n # Stop and disable service\n try {\n systemctl stop $SERVICE_NAME\n systemctl disable $SERVICE_NAME\n } catch {\n log warning "Failed to stop systemd service"\n }\n\n # Remove binary\n try {\n rm -f $"/usr/local/bin/($SERVICE_NAME)"\n } catch {\n log warning "Failed to remove binary"\n }\n\n # Remove configuration\n try {\n rm -rf $"/etc/($SERVICE_NAME)"\n } catch {\n log warning "Failed to remove configuration"\n }\n\n # Remove data directories (unless keeping)\n if not $keep_data {\n try {\n rm -rf $"/var/lib/($SERVICE_NAME)"\n } catch {\n log warning "Failed to remove data directories"\n }\n }\n\n # Remove systemd service file\n try {\n rm -f $"/etc/systemd/system/($SERVICE_NAME).service"\n systemctl daemon-reload\n } catch {\n log warning "Failed to remove systemd service"\n }\n\n log info $"($SERVICE_NAME) uninstalled successfully"\n}\n\nexport def "status" [] -> record {\n let systemd_status = try {\n systemctl is-active $SERVICE_NAME | str trim\n } catch {\n "unknown"\n }\n\n let health = (check-health)\n let version = (get-current-version)\n\n {\n service: $SERVICE_NAME,\n version: $version,\n systemd_status: $systemd_status,\n health: $health,\n uptime: (get-service-uptime),\n memory_usage: (get-memory-usage),\n cpu_usage: (get-cpu-usage)\n }\n}\n\ndef check-requirements [] -> record {\n mut missing = []\n mut met = true\n\n # Check for containerd\n if not (which containerd | is-not-empty) {\n $missing = ($missing | append "containerd")\n $met = false\n }\n\n # Check for systemctl\n if not (which systemctl | is-not-empty) {\n $missing = ($missing | append "systemctl")\n $met = false\n }\n\n {\n met: $met,\n missing: $missing\n }\n}\n\ndef check-health [] -> record {\n try {\n let response = (http get "http://localhost:9090/health")\n {\n healthy: true,\n status: ($response | get status),\n last_check: (date now)\n }\n } catch {\n {\n healthy: false,\n error: "Health endpoint not responding",\n last_check: (date now)\n }\n }\n}\n```\n\n## Cluster Extension API\n\n### Cluster Interface\n\nClusters orchestrate multiple components:\n\n#### Core Operations\n\n- `create(config: record) -> record`\n- `delete(config: record) -> null`\n- `status() -> record`\n- `scale(replicas: int) -> record`\n- `upgrade(version: string) -> record`\n\n#### Component Management\n\n- `list-components() -> list`\n- `component-status(name: string) -> record`\n- `restart-component(name: string) -> null`\n\n### Cluster Development Template\n\n#### Nickel Configuration\n\nCreate `schemas/cluster.ncl`:\n\n```\n# Cluster configuration schema\n{\n ClusterConfig = {\n # Cluster metadata\n name | String,\n version | String = "1.0.0",\n description | String = "",\n\n # Components to deploy\n components | [Component],\n\n # Resource requirements\n resources | {\n min_nodes | Number = 1,\n cpu_per_node | String = "2",\n memory_per_node | String = "4Gi",\n storage_per_node | String = "20Gi",\n },\n\n # Network configuration\n network | {\n cluster_cidr | String = "10.244.0.0/16",\n service_cidr | String = "10.96.0.0/12",\n dns_domain | String = "cluster.local",\n },\n\n # Feature flags\n features | {\n monitoring | Bool = true,\n logging | Bool = true,\n ingress | Bool = false,\n storage | Bool = true,\n },\n },\n\n Component = {\n name | String,\n type | String | "taskserv" | "application" | "infrastructure",\n version | String = "",\n enabled | Bool = true,\n dependencies | [String] = [],\n config | {} = {},\n resources | {\n cpu | String = "",\n memory | String = "",\n storage | String = "",\n replicas | Number = 1,\n } = {},\n },\n\n # Example cluster configuration\n buildkit_cluster = {\n name = "buildkit",\n version = "1.0.0",\n description = "Container build cluster with BuildKit and registry",\n components = [\n {\n name = "containerd",\n type = "taskserv",\n version = "1.7.0",\n enabled = true,\n dependencies = [],\n },\n {\n name = "buildkit",\n type = "taskserv",\n version = "0.12.0",\n enabled = true,\n dependencies = ["containerd"],\n config = {\n worker_count = 4,\n cache_size = "10Gi",\n registry_mirrors = ["registry:5000"],\n },\n },\n {\n name = "registry",\n type = "application",\n version = "2.8.0",\n enabled = true,\n dependencies = [],\n config = {\n storage_driver = "filesystem",\n storage_path = "/var/lib/registry",\n auth_enabled = false,\n },\n resources = {\n cpu = "500m",\n memory = "1Gi",\n storage = "50Gi",\n replicas = 1,\n },\n },\n ],\n resources = {\n min_nodes = 1,\n cpu_per_node = "4",\n memory_per_node = "8Gi",\n storage_per_node = "100Gi",\n },\n features = {\n monitoring = true,\n logging = true,\n ingress = false,\n storage = true,\n },\n },\n}\n```\n\n#### Nushell Implementation\n\nCreate `nulib/mod.nu`:\n\n```\nuse std log\nuse ../../../lib_provisioning *\n\nexport const CLUSTER_NAME = "my-cluster"\nexport const CLUSTER_VERSION = "1.0.0"\n\nexport def "cluster-info" [] -> record {\n {\n name: $CLUSTER_NAME,\n version: $CLUSTER_VERSION,\n type: "cluster",\n category: "build",\n description: "Custom application cluster",\n components: (get-cluster-components),\n required_resources: {\n min_nodes: 1,\n cpu_per_node: "2",\n memory_per_node: "4Gi",\n storage_per_node: "20Gi"\n }\n }\n}\n\nexport def "create" [\n config: record = {}\n --check # Check mode only\n --wait # Wait for completion\n] -> record {\n log info $"Creating cluster: ($CLUSTER_NAME)"\n\n if $check {\n return {\n action: "create-cluster",\n cluster: $CLUSTER_NAME,\n check_mode: true,\n would_create: true,\n components: (get-cluster-components),\n requirements_check: (check-cluster-requirements)\n }\n }\n\n # Validate cluster requirements\n let req_check = (check-cluster-requirements)\n if not $req_check.met {\n error make {\n msg: $"Cluster requirements not met: ($req_check.issues | str join ', ')"\n }\n }\n\n # Get component deployment order\n let components = (get-cluster-components)\n let deployment_order = (resolve-component-dependencies $components)\n\n mut deployment_status = []\n\n # Deploy components in dependency order\n for component in $deployment_order {\n log info $"Deploying component: ($component.name)"\n\n try {\n let result = match $component.type {\n "taskserv" => {\n taskserv create $component.name --config $component.config --wait\n },\n "application" => {\n deploy-application $component\n },\n _ => {\n error make { msg: $"Unknown component type: ($component.type)" }\n }\n }\n\n $deployment_status = ($deployment_status | append {\n component: $component.name,\n status: "deployed",\n result: $result\n })\n\n } catch {|e|\n log error $"Failed to deploy ($component.name): ($e.msg)"\n $deployment_status = ($deployment_status | append {\n component: $component.name,\n status: "failed",\n error: $e.msg\n })\n\n # Rollback on failure\n rollback-cluster-deployment $deployment_status\n error make { msg: $"Cluster deployment failed at component: ($component.name)" }\n }\n }\n\n # Configure cluster networking and integrations\n configure-cluster-networking $config\n setup-cluster-monitoring $config\n\n # Wait for all components to be ready\n if $wait {\n wait-for-cluster-ready\n }\n\n {\n success: true,\n cluster: $CLUSTER_NAME,\n components: $deployment_status,\n endpoints: (get-cluster-endpoints),\n status: "running"\n }\n}\n\nexport def "delete" [\n config: record = {}\n --force # Force deletion\n] -> null {\n log info $"Deleting cluster: ($CLUSTER_NAME)"\n\n let components = (get-cluster-components)\n let deletion_order = ($components | reverse) # Delete in reverse order\n\n for component in $deletion_order {\n log info $"Removing component: ($component.name)"\n\n try {\n match $component.type {\n "taskserv" => {\n taskserv delete $component.name --force=$force\n },\n "application" => {\n remove-application $component --force=$force\n },\n _ => {\n log warning $"Unknown component type: ($component.type)"\n }\n }\n } catch {|e|\n log error $"Failed to remove ($component.name): ($e.msg)"\n if not $force {\n error make { msg: $"Component removal failed: ($component.name)" }\n }\n }\n }\n\n # Clean up cluster-level resources\n cleanup-cluster-networking\n cleanup-cluster-monitoring\n cleanup-cluster-storage\n\n log info $"Cluster ($CLUSTER_NAME) deleted successfully"\n}\n\ndef get-cluster-components [] -> list {\n [\n {\n name: "containerd",\n type: "taskserv",\n version: "1.7.0",\n dependencies: []\n },\n {\n name: "my-service",\n type: "taskserv",\n version: "1.0.0",\n dependencies: ["containerd"]\n },\n {\n name: "registry",\n type: "application",\n version: "2.8.0",\n dependencies: []\n }\n ]\n}\n\ndef resolve-component-dependencies [components: list] -> list {\n # Topological sort of components based on dependencies\n mut sorted = []\n mut remaining = $components\n\n while ($remaining | length) > 0 {\n let no_deps = ($remaining | where {|comp|\n ($comp.dependencies | all {|dep|\n $dep in ($sorted | get name)\n })\n })\n\n if ($no_deps | length) == 0 {\n error make { msg: "Circular dependency detected in cluster components" }\n }\n\n $sorted = ($sorted | append $no_deps)\n $remaining = ($remaining | where {|comp|\n not ($comp.name in ($no_deps | get name))\n })\n }\n\n $sorted\n}\n```\n\n## Extension Registration and Discovery\n\n### Extension Registry\n\nExtensions are registered in the system through:\n\n1. **Directory Structure**: Placed in appropriate directories (providers/, taskservs/, cluster/)\n2. **Metadata Files**: `metadata.toml` with extension information\n3. **Schema Files**: `schemas/` directory with Nickel schema files\n\n### Registration API\n\n#### `register-extension(path: string, type: string) -> record`\n\nRegisters a new extension with the system.\n\n**Parameters:**\n\n- `path`: Path to extension directory\n- `type`: Extension type (provider, taskserv, cluster)\n\n#### `unregister-extension(name: string, type: string) -> null`\n\nRemoves extension from the registry.\n\n#### `list-registered-extensions(type?: string) -> list`\n\nLists all registered extensions, optionally filtered by type.\n\n### Extension Validation\n\n#### Validation Rules\n\n1. **Structure Validation**: Required files and directories exist\n2. **Schema Validation**: Nickel schemas are valid\n3. **Interface Validation**: Required functions are implemented\n4. **Dependency Validation**: Dependencies are available\n5. **Version Validation**: Version constraints are met\n\n#### `validate-extension(path: string, type: string) -> record`\n\nValidates extension structure and implementation.\n\n## Testing Extensions\n\n### Test Framework\n\nExtensions should include comprehensive tests:\n\n#### Unit Tests\n\nCreate `tests/unit_tests.nu`:\n\n```\nuse std testing\n\nexport def test_provider_config_validation [] {\n let config = {\n auth: { api_key: "test-key", api_secret: "test-secret" },\n api: { base_url: "https://api.test.com" }\n }\n\n let result = (validate-config $config)\n assert ($result.valid == true)\n assert ($result.errors | is-empty)\n}\n\nexport def test_server_creation_check_mode [] {\n let config = {\n hostname: "test-server",\n plan: "1xCPU-1 GB",\n zone: "test-zone"\n }\n\n let result = (create-server $config --check)\n assert ($result.check_mode == true)\n assert ($result.would_create == true)\n}\n```\n\n#### Integration Tests\n\nCreate `tests/integration_tests.nu`:\n\n```\nuse std testing\n\nexport def test_full_server_lifecycle [] {\n # Test server creation\n let create_config = {\n hostname: "integration-test",\n plan: "1xCPU-1 GB",\n zone: "test-zone"\n }\n\n let server = (create-server $create_config --wait)\n assert ($server.success == true)\n let server_id = $server.server_id\n\n # Test server info retrieval\n let info = (get-server-info $server_id)\n assert ($info.hostname == "integration-test")\n assert ($info.status == "running")\n\n # Test server deletion\n delete-server $server_id\n\n # Verify deletion\n let final_info = try { get-server-info $server_id } catch { null }\n assert ($final_info == null)\n}\n```\n\n### Running Tests\n\n```\n# Run unit tests\nnu tests/unit_tests.nu\n\n# Run integration tests\nnu tests/integration_tests.nu\n\n# Run all tests\nnu tests/run_all_tests.nu\n```\n\n## Documentation Requirements\n\n### Extension Documentation\n\nEach extension must include:\n\n1. **README.md**: Overview, installation, and usage\n2. **API.md**: Detailed API documentation\n3. **EXAMPLES.md**: Usage examples and tutorials\n4. **CHANGELOG.md**: Version history and changes\n\n### API Documentation Template\n\n```\n# Extension Name API\n\n## Overview\nBrief description of the extension and its purpose.\n\n## Installation\nSteps to install and configure the extension.\n\n## Configuration\nConfiguration schema and options.\n\n## API Reference\nDetailed API documentation with examples.\n\n## Examples\nCommon usage patterns and examples.\n\n## Troubleshooting\nCommon issues and solutions.\n```\n\n## Best Practices\n\n### Development Guidelines\n\n1. **Follow Naming Conventions**: Use consistent naming for functions and variables\n2. **Error Handling**: Implement comprehensive error handling and recovery\n3. **Logging**: Use structured logging for debugging and monitoring\n4. **Configuration Validation**: Validate all inputs and configurations\n5. **Documentation**: Document all public APIs and configurations\n6. **Testing**: Include comprehensive unit and integration tests\n7. **Versioning**: Follow semantic versioning principles\n8. **Security**: Implement secure credential handling and API calls\n\n### Performance Considerations\n\n1. **Caching**: Cache expensive operations and API calls\n2. **Parallel Processing**: Use parallel execution where possible\n3. **Resource Management**: Clean up resources properly\n4. **Batch Operations**: Batch API calls when possible\n5. **Health Monitoring**: Implement health checks and monitoring\n\n### Security Best Practices\n\n1. **Credential Management**: Store credentials securely\n2. **Input Validation**: Validate and sanitize all inputs\n3. **Access Control**: Implement proper access controls\n4. **Audit Logging**: Log all security-relevant operations\n5. **Encryption**: Encrypt sensitive data in transit and at rest\n\nThis extension development API provides a comprehensive framework for building robust, scalable, and maintainable extensions for provisioning. +# Extension Development API + +This document provides comprehensive guidance for developing extensions for provisioning, including providers, task services, and cluster configurations. + +## Overview + +Provisioning supports three types of extensions: + +1. **Providers**: Cloud infrastructure providers (AWS, UpCloud, Local, etc.) +2. **Task Services**: Infrastructure components (Kubernetes, Cilium, Containerd, etc.) +3. **Clusters**: Complete deployment configurations (BuildKit, CI/CD, etc.) + +All extensions follow a standardized structure and API for seamless integration. + +## Extension Structure + +### Standard Directory Layout + +```text +extension-name/ +├── manifest.toml # Extension metadata +├── schemas/ # Nickel configuration files +│ ├── main.ncl # Main schema +│ ├── settings.ncl # Settings schema +│ ├── version.ncl # Version configuration +│ └── contracts.ncl # Contract definitions +├── nulib/ # Nushell library modules +│ ├── mod.nu # Main module +│ ├── create.nu # Creation operations +│ ├── delete.nu # Deletion operations +│ └── utils.nu # Utility functions +├── templates/ # Jinja2 templates +│ ├── config.j2 # Configuration templates +│ └── scripts/ # Script templates +├── generate/ # Code generation scripts +│ └── generate.nu # Generation commands +├── README.md # Extension documentation +└── metadata.toml # Extension metadata +``` + +## Provider Extension API + +### Provider Interface + +All providers must implement the following interface: + +#### Core Operations + +- `create-server(config: record) -> record` +- `delete-server(server_id: string) -> null` +- `list-servers() -> list` +- `get-server-info(server_id: string) -> record` +- `start-server(server_id: string) -> null` +- `stop-server(server_id: string) -> null` +- `reboot-server(server_id: string) -> null` + +#### Pricing and Plans + +- `get-pricing() -> list` +- `get-plans() -> list` +- `get-zones() -> list` + +#### SSH and Access + +- `get-ssh-access(server_id: string) -> record` +- `configure-firewall(server_id: string, rules: list) -> null` + +### Provider Development Template + +#### Nickel Configuration Schema + +Create `schemas/settings.ncl`: + +```text +# Provider settings schema +{ + ProviderSettings = { + # Authentication configuration + auth | { + method | "api_key" | "certificate" | "oauth" | "basic", + api_key | String = null, + api_secret | String = null, + username | String = null, + password | String = null, + certificate_path | String = null, + private_key_path | String = null, + }, + + # API configuration + api | { + base_url | String, + version | String = "v1", + timeout | Number = 30, + retries | Number = 3, + }, + + # Default server configuration + defaults: { + plan?: str + zone?: str + os?: str + ssh_keys?: [str] + firewall_rules?: [FirewallRule] + } + + # Provider-specific settings + features: { + load_balancer?: bool = false + storage_encryption?: bool = true + backup?: bool = true + monitoring?: bool = false + } +} + +schema FirewallRule { + direction: "ingress" | "egress" + protocol: "tcp" | "udp" | "icmp" + port?: str + source?: str + destination?: str + action: "allow" | "deny" +} + +schema ServerConfig { + hostname: str + plan: str + zone: str + os: str = "ubuntu-22.04" + ssh_keys: [str] = [] + tags?: {str: str} = {} + firewall_rules?: [FirewallRule] = [] + storage?: { + size?: int + type?: str + encrypted?: bool = true + } + network?: { + public_ip?: bool = true + private_network?: str + bandwidth?: int + } +} +``` + +#### Nushell Implementation + +Create `nulib/mod.nu`: + +```text +use std log + +# Provider name and version +export const PROVIDER_NAME = "my-provider" +export const PROVIDER_VERSION = "1.0.0" + +# Import sub-modules +use create.nu * +use delete.nu * +use utils.nu * + +# Provider interface implementation +export def "provider-info" [] -> record { + { + name: $PROVIDER_NAME, + version: $PROVIDER_VERSION, + type: "provider", + interface: "API", + supported_operations: [ + "create-server", "delete-server", "list-servers", + "get-server-info", "start-server", "stop-server" + ], + required_auth: ["api_key", "api_secret"], + supported_os: ["ubuntu-22.04", "debian-11", "centos-8"], + regions: (get-zones).name + } +} + +export def "validate-config" [config: record] -> record { + mut errors = [] + mut warnings = [] + + # Validate authentication + if ($config | get -o "auth.api_key" | is-empty) { + $errors = ($errors | append "Missing API key") + } + + if ($config | get -o "auth.api_secret" | is-empty) { + $errors = ($errors | append "Missing API secret") + } + + # Validate API configuration + let api_url = ($config | get -o "api.base_url") + if ($api_url | is-empty) { + $errors = ($errors | append "Missing API base URL") + } else { + try { + http get $"($api_url)/health" | ignore + } catch { + $warnings = ($warnings | append "API endpoint not reachable") + } + } + + { + valid: ($errors | is-empty), + errors: $errors, + warnings: $warnings + } +} + +export def "test-connection" [config: record] -> record { + try { + let api_url = ($config | get "api.base_url") + let response = (http get $"($api_url)/account" --headers { + Authorization: $"Bearer ($config | get 'auth.api_key')" + }) + + { + success: true, + account_info: $response, + message: "Connection successful" + } + } catch {|e| + { + success: false, + error: ($e | get msg), + message: "Connection failed" + } + } +} +``` + +Create `nulib/create.nu`: + +```text +use std log +use utils.nu * + +export def "create-server" [ + config: record # Server configuration + --check # Check mode only + --wait # Wait for completion +] -> record { + log info $"Creating server: ($config.hostname)" + + if $check { + return { + action: "create-server", + hostname: $config.hostname, + check_mode: true, + would_create: true, + estimated_time: "2-5 minutes" + } + } + + # Validate configuration + let validation = (validate-server-config $config) + if not $validation.valid { + error make { + msg: $"Invalid server configuration: ($validation.errors | str join ', ')" + } + } + + # Prepare API request + let api_config = (get-api-config) + let request_body = { + hostname: $config.hostname, + plan: $config.plan, + zone: $config.zone, + os: $config.os, + ssh_keys: $config.ssh_keys, + tags: $config.tags, + firewall_rules: $config.firewall_rules + } + + try { + let response = (http post $"($api_config.base_url)/servers" --headers { + Authorization: $"Bearer ($api_config.auth.api_key)" + Content-Type: "application/json" + } $request_body) + + let server_id = ($response | get id) + log info $"Server creation initiated: ($server_id)" + + if $wait { + let final_status = (wait-for-server-ready $server_id) + { + success: true, + server_id: $server_id, + hostname: $config.hostname, + status: $final_status, + ip_addresses: (get-server-ips $server_id), + ssh_access: (get-ssh-access $server_id) + } + } else { + { + success: true, + server_id: $server_id, + hostname: $config.hostname, + status: "creating", + message: "Server creation in progress" + } + } + } catch {|e| + error make { + msg: $"Server creation failed: ($e | get msg)" + } + } +} + +def validate-server-config [config: record] -> record { + mut errors = [] + + # Required fields + if ($config | get -o hostname | is-empty) { + $errors = ($errors | append "Hostname is required") + } + + if ($config | get -o plan | is-empty) { + $errors = ($errors | append "Plan is required") + } + + if ($config | get -o zone | is-empty) { + $errors = ($errors | append "Zone is required") + } + + # Validate plan exists + let available_plans = (get-plans) + if not ($config.plan in ($available_plans | get name)) { + $errors = ($errors | append $"Invalid plan: ($config.plan)") + } + + # Validate zone exists + let available_zones = (get-zones) + if not ($config.zone in ($available_zones | get name)) { + $errors = ($errors | append $"Invalid zone: ($config.zone)") + } + + { + valid: ($errors | is-empty), + errors: $errors + } +} + +def wait-for-server-ready [server_id: string] -> string { + mut attempts = 0 + let max_attempts = 60 # 10 minutes + + while $attempts < $max_attempts { + let server_info = (get-server-info $server_id) + let status = ($server_info | get status) + + match $status { + "running" => { return "running" }, + "error" => { error make { msg: "Server creation failed" } }, + _ => { + log info $"Server status: ($status), waiting..." + sleep 10sec + $attempts = $attempts + 1 + } + } + } + + error make { msg: "Server creation timeout" } +} +``` + +### Provider Registration + +Add provider metadata in `metadata.toml`: + +```text +[extension] +name = "my-provider" +type = "provider" +version = "1.0.0" +description = "Custom cloud provider integration" +author = "Your Name " +license = "MIT" + +[compatibility] +provisioning_version = ">=2.0.0" +nushell_version = ">=0.107.0" +nickel_version = ">=1.15.0" + +[capabilities] +server_management = true +load_balancer = false +storage_encryption = true +backup = true +monitoring = false + +[authentication] +methods = ["api_key", "certificate"] +required_fields = ["api_key", "api_secret"] + +[regions] +default = "us-east-1" +available = ["us-east-1", "us-west-2", "eu-west-1"] + +[support] +documentation = "https://docs.example.com/provider" +issues = "https://github.com/example/provider/issues" +``` + +## Task Service Extension API + +### Task Service Interface + +Task services must implement: + +#### Core Operations + +- `install(config: record) -> record` +- `uninstall(config: record) -> null` +- `configure(config: record) -> null` +- `status() -> record` +- `restart() -> null` +- `upgrade(version: string) -> record` + +#### Version Management + +- `get-current-version() -> string` +- `get-available-versions() -> list` +- `check-updates() -> record` + +### Task Service Development Template + +#### Nickel Schema + +Create `schemas/version.ncl`: + +```text +# Task service version configuration +{ + taskserv_version = { + name | String = "my-service", + version | String = "1.0.0", + + # Version source configuration + source | { + type | String = "github", + repository | String, + release_pattern | String = "v{version}", + }, + + # Installation configuration + install | { + method | String = "binary", + binary_name | String, + binary_path | String = "/usr/local/bin", + config_path | String = "/etc/my-service", + data_path | String = "/var/lib/my-service", + }, + + # Dependencies + dependencies | [ + { + name | String, + version | String = ">=1.0.0", + } + ], + + # Service configuration + service | { + type | String = "systemd", + user | String = "my-service", + group | String = "my-service", + ports | [Number] = [8080, 9090], + }, + + # Health check configuration + health_check | { + endpoint | String, + interval | Number = 30, + timeout | Number = 5, + retries | Number = 3, + }, + } +} +``` + +#### Nushell Implementation + +Create `nulib/mod.nu`: + +```text +use std log +use ../../../lib_provisioning * + +export const SERVICE_NAME = "my-service" +export const SERVICE_VERSION = "1.0.0" + +export def "taskserv-info" [] -> record { + { + name: $SERVICE_NAME, + version: $SERVICE_VERSION, + type: "taskserv", + category: "application", + description: "Custom application service", + dependencies: ["containerd"], + ports: [8080, 9090], + config_files: ["/etc/my-service/config.yaml"], + data_directories: ["/var/lib/my-service"] + } +} + +export def "install" [ + config: record = {} + --check # Check mode only + --version: string # Specific version to install +] -> record { + let install_version = if ($version | is-not-empty) { + $version + } else { + (get-latest-version) + } + + log info $"Installing ($SERVICE_NAME) version ($install_version)" + + if $check { + return { + action: "install", + service: $SERVICE_NAME, + version: $install_version, + check_mode: true, + would_install: true, + requirements_met: (check-requirements) + } + } + + # Check system requirements + let req_check = (check-requirements) + if not $req_check.met { + error make { + msg: $"Requirements not met: ($req_check.missing | str join ', ')" + } + } + + # Download and install + let binary_path = (download-binary $install_version) + install-binary $binary_path + create-user-and-directories + generate-config $config + install-systemd-service + + # Start service + systemctl start $SERVICE_NAME + systemctl enable $SERVICE_NAME + + # Verify installation + let health = (check-health) + if not $health.healthy { + error make { msg: "Service failed health check after installation" } + } + + { + success: true, + service: $SERVICE_NAME, + version: $install_version, + status: "running", + health: $health + } +} + +export def "uninstall" [ + --force # Force removal even if running + --keep-data # Keep data directories +] -> null { + log info $"Uninstalling ($SERVICE_NAME)" + + # Stop and disable service + try { + systemctl stop $SERVICE_NAME + systemctl disable $SERVICE_NAME + } catch { + log warning "Failed to stop systemd service" + } + + # Remove binary + try { + rm -f $"/usr/local/bin/($SERVICE_NAME)" + } catch { + log warning "Failed to remove binary" + } + + # Remove configuration + try { + rm -rf $"/etc/($SERVICE_NAME)" + } catch { + log warning "Failed to remove configuration" + } + + # Remove data directories (unless keeping) + if not $keep_data { + try { + rm -rf $"/var/lib/($SERVICE_NAME)" + } catch { + log warning "Failed to remove data directories" + } + } + + # Remove systemd service file + try { + rm -f $"/etc/systemd/system/($SERVICE_NAME).service" + systemctl daemon-reload + } catch { + log warning "Failed to remove systemd service" + } + + log info $"($SERVICE_NAME) uninstalled successfully" +} + +export def "status" [] -> record { + let systemd_status = try { + systemctl is-active $SERVICE_NAME | str trim + } catch { + "unknown" + } + + let health = (check-health) + let version = (get-current-version) + + { + service: $SERVICE_NAME, + version: $version, + systemd_status: $systemd_status, + health: $health, + uptime: (get-service-uptime), + memory_usage: (get-memory-usage), + cpu_usage: (get-cpu-usage) + } +} + +def check-requirements [] -> record { + mut missing = [] + mut met = true + + # Check for containerd + if not (which containerd | is-not-empty) { + $missing = ($missing | append "containerd") + $met = false + } + + # Check for systemctl + if not (which systemctl | is-not-empty) { + $missing = ($missing | append "systemctl") + $met = false + } + + { + met: $met, + missing: $missing + } +} + +def check-health [] -> record { + try { + let response = (http get "http://localhost:9090/health") + { + healthy: true, + status: ($response | get status), + last_check: (date now) + } + } catch { + { + healthy: false, + error: "Health endpoint not responding", + last_check: (date now) + } + } +} +``` + +## Cluster Extension API + +### Cluster Interface + +Clusters orchestrate multiple components: + +#### Core Operations + +- `create(config: record) -> record` +- `delete(config: record) -> null` +- `status() -> record` +- `scale(replicas: int) -> record` +- `upgrade(version: string) -> record` + +#### Component Management + +- `list-components() -> list` +- `component-status(name: string) -> record` +- `restart-component(name: string) -> null` + +### Cluster Development Template + +#### Nickel Configuration + +Create `schemas/cluster.ncl`: + +```text +# Cluster configuration schema +{ + ClusterConfig = { + # Cluster metadata + name | String, + version | String = "1.0.0", + description | String = "", + + # Components to deploy + components | [Component], + + # Resource requirements + resources | { + min_nodes | Number = 1, + cpu_per_node | String = "2", + memory_per_node | String = "4Gi", + storage_per_node | String = "20Gi", + }, + + # Network configuration + network | { + cluster_cidr | String = "10.244.0.0/16", + service_cidr | String = "10.96.0.0/12", + dns_domain | String = "cluster.local", + }, + + # Feature flags + features | { + monitoring | Bool = true, + logging | Bool = true, + ingress | Bool = false, + storage | Bool = true, + }, + }, + + Component = { + name | String, + type | String | "taskserv" | "application" | "infrastructure", + version | String = "", + enabled | Bool = true, + dependencies | [String] = [], + config | {} = {}, + resources | { + cpu | String = "", + memory | String = "", + storage | String = "", + replicas | Number = 1, + } = {}, + }, + + # Example cluster configuration + buildkit_cluster = { + name = "buildkit", + version = "1.0.0", + description = "Container build cluster with BuildKit and registry", + components = [ + { + name = "containerd", + type = "taskserv", + version = "1.7.0", + enabled = true, + dependencies = [], + }, + { + name = "buildkit", + type = "taskserv", + version = "0.12.0", + enabled = true, + dependencies = ["containerd"], + config = { + worker_count = 4, + cache_size = "10Gi", + registry_mirrors = ["registry:5000"], + }, + }, + { + name = "registry", + type = "application", + version = "2.8.0", + enabled = true, + dependencies = [], + config = { + storage_driver = "filesystem", + storage_path = "/var/lib/registry", + auth_enabled = false, + }, + resources = { + cpu = "500m", + memory = "1Gi", + storage = "50Gi", + replicas = 1, + }, + }, + ], + resources = { + min_nodes = 1, + cpu_per_node = "4", + memory_per_node = "8Gi", + storage_per_node = "100Gi", + }, + features = { + monitoring = true, + logging = true, + ingress = false, + storage = true, + }, + }, +} +``` + +#### Nushell Implementation + +Create `nulib/mod.nu`: + +```text +use std log +use ../../../lib_provisioning * + +export const CLUSTER_NAME = "my-cluster" +export const CLUSTER_VERSION = "1.0.0" + +export def "cluster-info" [] -> record { + { + name: $CLUSTER_NAME, + version: $CLUSTER_VERSION, + type: "cluster", + category: "build", + description: "Custom application cluster", + components: (get-cluster-components), + required_resources: { + min_nodes: 1, + cpu_per_node: "2", + memory_per_node: "4Gi", + storage_per_node: "20Gi" + } + } +} + +export def "create" [ + config: record = {} + --check # Check mode only + --wait # Wait for completion +] -> record { + log info $"Creating cluster: ($CLUSTER_NAME)" + + if $check { + return { + action: "create-cluster", + cluster: $CLUSTER_NAME, + check_mode: true, + would_create: true, + components: (get-cluster-components), + requirements_check: (check-cluster-requirements) + } + } + + # Validate cluster requirements + let req_check = (check-cluster-requirements) + if not $req_check.met { + error make { + msg: $"Cluster requirements not met: ($req_check.issues | str join ', ')" + } + } + + # Get component deployment order + let components = (get-cluster-components) + let deployment_order = (resolve-component-dependencies $components) + + mut deployment_status = [] + + # Deploy components in dependency order + for component in $deployment_order { + log info $"Deploying component: ($component.name)" + + try { + let result = match $component.type { + "taskserv" => { + taskserv create $component.name --config $component.config --wait + }, + "application" => { + deploy-application $component + }, + _ => { + error make { msg: $"Unknown component type: ($component.type)" } + } + } + + $deployment_status = ($deployment_status | append { + component: $component.name, + status: "deployed", + result: $result + }) + + } catch {|e| + log error $"Failed to deploy ($component.name): ($e.msg)" + $deployment_status = ($deployment_status | append { + component: $component.name, + status: "failed", + error: $e.msg + }) + + # Rollback on failure + rollback-cluster-deployment $deployment_status + error make { msg: $"Cluster deployment failed at component: ($component.name)" } + } + } + + # Configure cluster networking and integrations + configure-cluster-networking $config + setup-cluster-monitoring $config + + # Wait for all components to be ready + if $wait { + wait-for-cluster-ready + } + + { + success: true, + cluster: $CLUSTER_NAME, + components: $deployment_status, + endpoints: (get-cluster-endpoints), + status: "running" + } +} + +export def "delete" [ + config: record = {} + --force # Force deletion +] -> null { + log info $"Deleting cluster: ($CLUSTER_NAME)" + + let components = (get-cluster-components) + let deletion_order = ($components | reverse) # Delete in reverse order + + for component in $deletion_order { + log info $"Removing component: ($component.name)" + + try { + match $component.type { + "taskserv" => { + taskserv delete $component.name --force=$force + }, + "application" => { + remove-application $component --force=$force + }, + _ => { + log warning $"Unknown component type: ($component.type)" + } + } + } catch {|e| + log error $"Failed to remove ($component.name): ($e.msg)" + if not $force { + error make { msg: $"Component removal failed: ($component.name)" } + } + } + } + + # Clean up cluster-level resources + cleanup-cluster-networking + cleanup-cluster-monitoring + cleanup-cluster-storage + + log info $"Cluster ($CLUSTER_NAME) deleted successfully" +} + +def get-cluster-components [] -> list { + [ + { + name: "containerd", + type: "taskserv", + version: "1.7.0", + dependencies: [] + }, + { + name: "my-service", + type: "taskserv", + version: "1.0.0", + dependencies: ["containerd"] + }, + { + name: "registry", + type: "application", + version: "2.8.0", + dependencies: [] + } + ] +} + +def resolve-component-dependencies [components: list] -> list { + # Topological sort of components based on dependencies + mut sorted = [] + mut remaining = $components + + while ($remaining | length) > 0 { + let no_deps = ($remaining | where {|comp| + ($comp.dependencies | all {|dep| + $dep in ($sorted | get name) + }) + }) + + if ($no_deps | length) == 0 { + error make { msg: "Circular dependency detected in cluster components" } + } + + $sorted = ($sorted | append $no_deps) + $remaining = ($remaining | where {|comp| + not ($comp.name in ($no_deps | get name)) + }) + } + + $sorted +} +``` + +## Extension Registration and Discovery + +### Extension Registry + +Extensions are registered in the system through: + +1. **Directory Structure**: Placed in appropriate directories (providers/, taskservs/, cluster/) +2. **Metadata Files**: `metadata.toml` with extension information +3. **Schema Files**: `schemas/` directory with Nickel schema files + +### Registration API + +#### `register-extension(path: string, type: string) -> record` + +Registers a new extension with the system. + +**Parameters:** + +- `path`: Path to extension directory +- `type`: Extension type (provider, taskserv, cluster) + +#### `unregister-extension(name: string, type: string) -> null` + +Removes extension from the registry. + +#### `list-registered-extensions(type?: string) -> list` + +Lists all registered extensions, optionally filtered by type. + +### Extension Validation + +#### Validation Rules + +1. **Structure Validation**: Required files and directories exist +2. **Schema Validation**: Nickel schemas are valid +3. **Interface Validation**: Required functions are implemented +4. **Dependency Validation**: Dependencies are available +5. **Version Validation**: Version constraints are met + +#### `validate-extension(path: string, type: string) -> record` + +Validates extension structure and implementation. + +## Testing Extensions + +### Test Framework + +Extensions should include comprehensive tests: + +#### Unit Tests + +Create `tests/unit_tests.nu`: + +```text +use std testing + +export def test_provider_config_validation [] { + let config = { + auth: { api_key: "test-key", api_secret: "test-secret" }, + api: { base_url: "https://api.test.com" } + } + + let result = (validate-config $config) + assert ($result.valid == true) + assert ($result.errors | is-empty) +} + +export def test_server_creation_check_mode [] { + let config = { + hostname: "test-server", + plan: "1xCPU-1 GB", + zone: "test-zone" + } + + let result = (create-server $config --check) + assert ($result.check_mode == true) + assert ($result.would_create == true) +} +``` + +#### Integration Tests + +Create `tests/integration_tests.nu`: + +```text +use std testing + +export def test_full_server_lifecycle [] { + # Test server creation + let create_config = { + hostname: "integration-test", + plan: "1xCPU-1 GB", + zone: "test-zone" + } + + let server = (create-server $create_config --wait) + assert ($server.success == true) + let server_id = $server.server_id + + # Test server info retrieval + let info = (get-server-info $server_id) + assert ($info.hostname == "integration-test") + assert ($info.status == "running") + + # Test server deletion + delete-server $server_id + + # Verify deletion + let final_info = try { get-server-info $server_id } catch { null } + assert ($final_info == null) +} +``` + +### Running Tests + +```text +# Run unit tests +nu tests/unit_tests.nu + +# Run integration tests +nu tests/integration_tests.nu + +# Run all tests +nu tests/run_all_tests.nu +``` + +## Documentation Requirements + +### Extension Documentation + +Each extension must include: + +1. **README.md**: Overview, installation, and usage +2. **API.md**: Detailed API documentation +3. **EXAMPLES.md**: Usage examples and tutorials +4. **CHANGELOG.md**: Version history and changes + +### API Documentation Template + +```text +# Extension Name API + +## Overview +Brief description of the extension and its purpose. + +## Installation +Steps to install and configure the extension. + +## Configuration +Configuration schema and options. + +## API Reference +Detailed API documentation with examples. + +## Examples +Common usage patterns and examples. + +## Troubleshooting +Common issues and solutions. +``` + +## Best Practices + +### Development Guidelines + +1. **Follow Naming Conventions**: Use consistent naming for functions and variables +2. **Error Handling**: Implement comprehensive error handling and recovery +3. **Logging**: Use structured logging for debugging and monitoring +4. **Configuration Validation**: Validate all inputs and configurations +5. **Documentation**: Document all public APIs and configurations +6. **Testing**: Include comprehensive unit and integration tests +7. **Versioning**: Follow semantic versioning principles +8. **Security**: Implement secure credential handling and API calls + +### Performance Considerations + +1. **Caching**: Cache expensive operations and API calls +2. **Parallel Processing**: Use parallel execution where possible +3. **Resource Management**: Clean up resources properly +4. **Batch Operations**: Batch API calls when possible +5. **Health Monitoring**: Implement health checks and monitoring + +### Security Best Practices + +1. **Credential Management**: Store credentials securely +2. **Input Validation**: Validate and sanitize all inputs +3. **Access Control**: Implement proper access controls +4. **Audit Logging**: Log all security-relevant operations +5. **Encryption**: Encrypt sensitive data in transit and at rest + +This extension development API provides a comprehensive framework for building robust, scalable, and maintainable extensions for provisioning. \ No newline at end of file diff --git a/docs/src/api-reference/integration-examples.md b/docs/src/api-reference/integration-examples.md index c50a8fc..e96a26d 100644 --- a/docs/src/api-reference/integration-examples.md +++ b/docs/src/api-reference/integration-examples.md @@ -1 +1,1592 @@ -# Integration Examples\n\nThis document provides comprehensive examples and patterns for integrating with provisioning APIs, including client libraries, SDKs, error handling\nstrategies, and performance optimization.\n\n## Overview\n\nProvisioning offers multiple integration points:\n\n- REST APIs for workflow management\n- WebSocket APIs for real-time monitoring\n- Configuration APIs for system setup\n- Extension APIs for custom providers and services\n\n## Complete Integration Examples\n\n### Python Integration\n\n#### Full-Featured Python Client\n\n```\nimport asyncio\nimport json\nimport logging\nimport time\nimport requests\nimport websockets\nfrom typing import Dict, List, Optional, Callable\nfrom dataclasses import dataclass\nfrom enum import Enum\n\nclass TaskStatus(Enum):\n PENDING = "Pending"\n RUNNING = "Running"\n COMPLETED = "Completed"\n FAILED = "Failed"\n CANCELLED = "Cancelled"\n\n@dataclass\nclass WorkflowTask:\n id: str\n name: str\n status: TaskStatus\n created_at: str\n started_at: Optional[str] = None\n completed_at: Optional[str] = None\n output: Optional[str] = None\n error: Optional[str] = None\n progress: Optional[float] = None\n\nclass ProvisioningAPIError(Exception):\n """Base exception for provisioning API errors"""\n pass\n\nclass AuthenticationError(ProvisioningAPIError):\n """Authentication failed"""\n pass\n\nclass ValidationError(ProvisioningAPIError):\n """Request validation failed"""\n pass\n\nclass ProvisioningClient:\n """\n Complete Python client for provisioning\n\n Features:\n - REST API integration\n - WebSocket support for real-time updates\n - Automatic token refresh\n - Retry logic with exponential backoff\n - Comprehensive error handling\n """\n\n def __init__(self,\n base_url: str = "http://localhost:9090",\n auth_url: str = "http://localhost:8081",\n username: str = None,\n password: str = None,\n token: str = None):\n self.base_url = base_url\n self.auth_url = auth_url\n self.username = username\n self.password = password\n self.token = token\n self.session = requests.Session()\n self.websocket = None\n self.event_handlers = {}\n\n # Setup logging\n self.logger = logging.getLogger(__name__)\n\n # Configure session with retries\n from requests.adapters import HTTPAdapter\n from urllib3.util.retry import Retry\n\n retry_strategy = Retry(\n total=3,\n status_forcelist=[429, 500, 502, 503, 504],\n method_whitelist=["HEAD", "GET", "OPTIONS"],\n backoff_factor=1\n )\n\n adapter = HTTPAdapter(max_retries=retry_strategy)\n self.session.mount("http://", adapter)\n self.session.mount("https://", adapter)\n\n async def authenticate(self) -> str:\n """Authenticate and get JWT token"""\n if self.token:\n return self.token\n\n if not self.username or not self.password:\n raise AuthenticationError("Username and password required for authentication")\n\n auth_data = {\n "username": self.username,\n "password": self.password\n }\n\n try:\n response = requests.post(f"{self.auth_url}/auth/login", json=auth_data)\n response.raise_for_status()\n\n result = response.json()\n if not result.get('success'):\n raise AuthenticationError(result.get('error', 'Authentication failed'))\n\n self.token = result['data']['token']\n self.session.headers.update({\n 'Authorization': f'Bearer {self.token}'\n })\n\n self.logger.info("Authentication successful")\n return self.token\n\n except requests.RequestException as e:\n raise AuthenticationError(f"Authentication request failed: {e}")\n\n def _make_request(self, method: str, endpoint: str, **kwargs) -> Dict:\n """Make authenticated HTTP request with error handling"""\n if not self.token:\n raise AuthenticationError("Not authenticated. Call authenticate() first.")\n\n url = f"{self.base_url}{endpoint}"\n\n try:\n response = self.session.request(method, url, **kwargs)\n response.raise_for_status()\n\n result = response.json()\n if not result.get('success'):\n error_msg = result.get('error', 'Request failed')\n if response.status_code == 400:\n raise ValidationError(error_msg)\n else:\n raise ProvisioningAPIError(error_msg)\n\n return result['data']\n\n except requests.RequestException as e:\n self.logger.error(f"Request failed: {method} {url} - {e}")\n raise ProvisioningAPIError(f"Request failed: {e}")\n\n # Workflow Management Methods\n\n def create_server_workflow(self,\n infra: str,\n settings: str = "config.ncl",\n check_mode: bool = False,\n wait: bool = False) -> str:\n """Create a server provisioning workflow"""\n data = {\n "infra": infra,\n "settings": settings,\n "check_mode": check_mode,\n "wait": wait\n }\n\n task_id = self._make_request("POST", "/workflows/servers/create", json=data)\n self.logger.info(f"Server workflow created: {task_id}")\n return task_id\n\n def create_taskserv_workflow(self,\n operation: str,\n taskserv: str,\n infra: str,\n settings: str = "config.ncl",\n check_mode: bool = False,\n wait: bool = False) -> str:\n """Create a task service workflow"""\n data = {\n "operation": operation,\n "taskserv": taskserv,\n "infra": infra,\n "settings": settings,\n "check_mode": check_mode,\n "wait": wait\n }\n\n task_id = self._make_request("POST", "/workflows/taskserv/create", json=data)\n self.logger.info(f"Taskserv workflow created: {task_id}")\n return task_id\n\n def create_cluster_workflow(self,\n operation: str,\n cluster_type: str,\n infra: str,\n settings: str = "config.ncl",\n check_mode: bool = False,\n wait: bool = False) -> str:\n """Create a cluster workflow"""\n data = {\n "operation": operation,\n "cluster_type": cluster_type,\n "infra": infra,\n "settings": settings,\n "check_mode": check_mode,\n "wait": wait\n }\n\n task_id = self._make_request("POST", "/workflows/cluster/create", json=data)\n self.logger.info(f"Cluster workflow created: {task_id}")\n return task_id\n\n def get_task_status(self, task_id: str) -> WorkflowTask:\n """Get the status of a specific task"""\n data = self._make_request("GET", f"/tasks/{task_id}")\n return WorkflowTask(\n id=data['id'],\n name=data['name'],\n status=TaskStatus(data['status']),\n created_at=data['created_at'],\n started_at=data.get('started_at'),\n completed_at=data.get('completed_at'),\n output=data.get('output'),\n error=data.get('error'),\n progress=data.get('progress')\n )\n\n def list_tasks(self, status_filter: Optional[str] = None) -> List[WorkflowTask]:\n """List all tasks, optionally filtered by status"""\n params = {}\n if status_filter:\n params['status'] = status_filter\n\n data = self._make_request("GET", "/tasks", params=params)\n return [\n WorkflowTask(\n id=task['id'],\n name=task['name'],\n status=TaskStatus(task['status']),\n created_at=task['created_at'],\n started_at=task.get('started_at'),\n completed_at=task.get('completed_at'),\n output=task.get('output'),\n error=task.get('error')\n )\n for task in data\n ]\n\n def wait_for_task_completion(self,\n task_id: str,\n timeout: int = 300,\n poll_interval: int = 5) -> WorkflowTask:\n """Wait for a task to complete"""\n start_time = time.time()\n\n while time.time() - start_time < timeout:\n task = self.get_task_status(task_id)\n\n if task.status in [TaskStatus.COMPLETED, TaskStatus.FAILED, TaskStatus.CANCELLED]:\n self.logger.info(f"Task {task_id} finished with status: {task.status}")\n return task\n\n self.logger.debug(f"Task {task_id} status: {task.status}")\n time.sleep(poll_interval)\n\n raise TimeoutError(f"Task {task_id} did not complete within {timeout} seconds")\n\n # Batch Operations\n\n def execute_batch_operation(self, batch_config: Dict) -> Dict:\n """Execute a batch operation"""\n return self._make_request("POST", "/batch/execute", json=batch_config)\n\n def get_batch_status(self, batch_id: str) -> Dict:\n """Get batch operation status"""\n return self._make_request("GET", f"/batch/operations/{batch_id}")\n\n def cancel_batch_operation(self, batch_id: str) -> str:\n """Cancel a running batch operation"""\n return self._make_request("POST", f"/batch/operations/{batch_id}/cancel")\n\n # System Health and Monitoring\n\n def get_system_health(self) -> Dict:\n """Get system health status"""\n return self._make_request("GET", "/state/system/health")\n\n def get_system_metrics(self) -> Dict:\n """Get system metrics"""\n return self._make_request("GET", "/state/system/metrics")\n\n # WebSocket Integration\n\n async def connect_websocket(self, event_types: List[str] = None):\n """Connect to WebSocket for real-time updates"""\n if not self.token:\n await self.authenticate()\n\n ws_url = f"ws://localhost:9090/ws?token={self.token}"\n if event_types:\n ws_url += f"&events={','.join(event_types)}"\n\n try:\n self.websocket = await websockets.connect(ws_url)\n self.logger.info("WebSocket connected")\n\n # Start listening for messages\n asyncio.create_task(self._websocket_listener())\n\n except Exception as e:\n self.logger.error(f"WebSocket connection failed: {e}")\n raise\n\n async def _websocket_listener(self):\n """Listen for WebSocket messages"""\n try:\n async for message in self.websocket:\n try:\n data = json.loads(message)\n await self._handle_websocket_message(data)\n except json.JSONDecodeError:\n self.logger.error(f"Invalid JSON received: {message}")\n except Exception as e:\n self.logger.error(f"WebSocket listener error: {e}")\n\n async def _handle_websocket_message(self, data: Dict):\n """Handle incoming WebSocket messages"""\n event_type = data.get('event_type')\n if event_type and event_type in self.event_handlers:\n for handler in self.event_handlers[event_type]:\n try:\n await handler(data)\n except Exception as e:\n self.logger.error(f"Error in event handler for {event_type}: {e}")\n\n def on_event(self, event_type: str, handler: Callable):\n """Register an event handler"""\n if event_type not in self.event_handlers:\n self.event_handlers[event_type] = []\n self.event_handlers[event_type].append(handler)\n\n async def disconnect_websocket(self):\n """Disconnect from WebSocket"""\n if self.websocket:\n await self.websocket.close()\n self.websocket = None\n self.logger.info("WebSocket disconnected")\n\n# Usage Example\nasync def main():\n # Initialize client\n client = ProvisioningClient(\n username="admin",\n password="password"\n )\n\n try:\n # Authenticate\n await client.authenticate()\n\n # Create a server workflow\n task_id = client.create_server_workflow(\n infra="production",\n settings="prod-settings.ncl",\n wait=False\n )\n print(f"Server workflow created: {task_id}")\n\n # Set up WebSocket event handlers\n async def on_task_update(event):\n print(f"Task update: {event['data']['task_id']} -> {event['data']['status']}")\n\n async def on_system_health(event):\n print(f"System health: {event['data']['overall_status']}")\n\n client.on_event('TaskStatusChanged', on_task_update)\n client.on_event('SystemHealthUpdate', on_system_health)\n\n # Connect to WebSocket\n await client.connect_websocket(['TaskStatusChanged', 'SystemHealthUpdate'])\n\n # Wait for task completion\n final_task = client.wait_for_task_completion(task_id, timeout=600)\n print(f"Task completed with status: {final_task.status}")\n\n if final_task.status == TaskStatus.COMPLETED:\n print(f"Output: {final_task.output}")\n elif final_task.status == TaskStatus.FAILED:\n print(f"Error: {final_task.error}")\n\n except ProvisioningAPIError as e:\n print(f"API Error: {e}")\n except Exception as e:\n print(f"Unexpected error: {e}")\n finally:\n await client.disconnect_websocket()\n\nif __name__ == "__main__":\n asyncio.run(main())\n```\n\n### Node.js/JavaScript Integration\n\n#### Complete JavaScript/TypeScript Client\n\n```\nimport axios, { AxiosInstance, AxiosResponse } from 'axios';\nimport WebSocket from 'ws';\nimport { EventEmitter } from 'events';\n\ninterface Task {\n id: string;\n name: string;\n status: 'Pending' | 'Running' | 'Completed' | 'Failed' | 'Cancelled';\n created_at: string;\n started_at?: string;\n completed_at?: string;\n output?: string;\n error?: string;\n progress?: number;\n}\n\ninterface BatchConfig {\n name: string;\n version: string;\n storage_backend: string;\n parallel_limit: number;\n rollback_enabled: boolean;\n operations: Array<{\n id: string;\n type: string;\n provider: string;\n dependencies: string[];\n [key: string]: any;\n }>;\n}\n\ninterface WebSocketEvent {\n event_type: string;\n timestamp: string;\n data: any;\n metadata: Record;\n}\n\nclass ProvisioningClient extends EventEmitter {\n private httpClient: AxiosInstance;\n private authClient: AxiosInstance;\n private websocket?: WebSocket;\n private token?: string;\n private reconnectAttempts = 0;\n private maxReconnectAttempts = 10;\n private reconnectInterval = 5000;\n\n constructor(\n private baseUrl = 'http://localhost:9090',\n private authUrl = 'http://localhost:8081',\n private username?: string,\n private password?: string,\n token?: string\n ) {\n super();\n\n this.token = token;\n\n // Setup HTTP clients\n this.httpClient = axios.create({\n baseURL: baseUrl,\n timeout: 30000,\n });\n\n this.authClient = axios.create({\n baseURL: authUrl,\n timeout: 10000,\n });\n\n // Setup request interceptors\n this.setupInterceptors();\n }\n\n private setupInterceptors(): void {\n // Request interceptor to add auth token\n this.httpClient.interceptors.request.use((config) => {\n if (this.token) {\n config.headers.Authorization = `Bearer ${this.token}`;\n }\n return config;\n });\n\n // Response interceptor for error handling\n this.httpClient.interceptors.response.use(\n (response) => response,\n async (error) => {\n if (error.response?.status === 401 && this.username && this.password) {\n // Token expired, try to refresh\n try {\n await this.authenticate();\n // Retry the original request\n const originalRequest = error.config;\n originalRequest.headers.Authorization = `Bearer ${this.token}`;\n return this.httpClient.request(originalRequest);\n } catch (authError) {\n this.emit('authError', authError);\n throw error;\n }\n }\n throw error;\n }\n );\n }\n\n async authenticate(): Promise {\n if (this.token) {\n return this.token;\n }\n\n if (!this.username || !this.password) {\n throw new Error('Username and password required for authentication');\n }\n\n try {\n const response = await this.authClient.post('/auth/login', {\n username: this.username,\n password: this.password,\n });\n\n const result = response.data;\n if (!result.success) {\n throw new Error(result.error || 'Authentication failed');\n }\n\n this.token = result.data.token;\n console.log('Authentication successful');\n this.emit('authenticated', this.token);\n\n return this.token;\n } catch (error) {\n console.error('Authentication failed:', error);\n throw new Error(`Authentication failed: ${error.message}`);\n }\n }\n\n private async makeRequest(method: string, endpoint: string, data?: any): Promise {\n try {\n const response: AxiosResponse = await this.httpClient.request({\n method,\n url: endpoint,\n data,\n });\n\n const result = response.data;\n if (!result.success) {\n throw new Error(result.error || 'Request failed');\n }\n\n return result.data;\n } catch (error) {\n console.error(`Request failed: ${method} ${endpoint}`, error);\n throw error;\n }\n }\n\n // Workflow Management Methods\n\n async createServerWorkflow(config: {\n infra: string;\n settings?: string;\n check_mode?: boolean;\n wait?: boolean;\n }): Promise {\n const data = {\n infra: config.infra,\n settings: config.settings || 'config.ncl',\n check_mode: config.check_mode || false,\n wait: config.wait || false,\n };\n\n const taskId = await this.makeRequest('POST', '/workflows/servers/create', data);\n console.log(`Server workflow created: ${taskId}`);\n this.emit('workflowCreated', { type: 'server', taskId });\n return taskId;\n }\n\n async createTaskservWorkflow(config: {\n operation: string;\n taskserv: string;\n infra: string;\n settings?: string;\n check_mode?: boolean;\n wait?: boolean;\n }): Promise {\n const data = {\n operation: config.operation,\n taskserv: config.taskserv,\n infra: config.infra,\n settings: config.settings || 'config.ncl',\n check_mode: config.check_mode || false,\n wait: config.wait || false,\n };\n\n const taskId = await this.makeRequest('POST', '/workflows/taskserv/create', data);\n console.log(`Taskserv workflow created: ${taskId}`);\n this.emit('workflowCreated', { type: 'taskserv', taskId });\n return taskId;\n }\n\n async createClusterWorkflow(config: {\n operation: string;\n cluster_type: string;\n infra: string;\n settings?: string;\n check_mode?: boolean;\n wait?: boolean;\n }): Promise {\n const data = {\n operation: config.operation,\n cluster_type: config.cluster_type,\n infra: config.infra,\n settings: config.settings || 'config.ncl',\n check_mode: config.check_mode || false,\n wait: config.wait || false,\n };\n\n const taskId = await this.makeRequest('POST', '/workflows/cluster/create', data);\n console.log(`Cluster workflow created: ${taskId}`);\n this.emit('workflowCreated', { type: 'cluster', taskId });\n return taskId;\n }\n\n async getTaskStatus(taskId: string): Promise {\n return this.makeRequest('GET', `/tasks/${taskId}`);\n }\n\n async listTasks(statusFilter?: string): Promise {\n const params = statusFilter ? `?status=${statusFilter}` : '';\n return this.makeRequest('GET', `/tasks${params}`);\n }\n\n async waitForTaskCompletion(\n taskId: string,\n timeout = 300000, // 5 minutes\n pollInterval = 5000 // 5 seconds\n ): Promise {\n return new Promise((resolve, reject) => {\n const startTime = Date.now();\n\n const poll = async () => {\n try {\n const task = await this.getTaskStatus(taskId);\n\n if (['Completed', 'Failed', 'Cancelled'].includes(task.status)) {\n console.log(`Task ${taskId} finished with status: ${task.status}`);\n resolve(task);\n return;\n }\n\n if (Date.now() - startTime > timeout) {\n reject(new Error(`Task ${taskId} did not complete within ${timeout}ms`));\n return;\n }\n\n console.log(`Task ${taskId} status: ${task.status}`);\n this.emit('taskProgress', task);\n setTimeout(poll, pollInterval);\n } catch (error) {\n reject(error);\n }\n };\n\n poll();\n });\n }\n\n // Batch Operations\n\n async executeBatchOperation(batchConfig: BatchConfig): Promise {\n const result = await this.makeRequest('POST', '/batch/execute', batchConfig);\n console.log(`Batch operation started: ${result.batch_id}`);\n this.emit('batchStarted', result);\n return result;\n }\n\n async getBatchStatus(batchId: string): Promise {\n return this.makeRequest('GET', `/batch/operations/${batchId}`);\n }\n\n async cancelBatchOperation(batchId: string): Promise {\n return this.makeRequest('POST', `/batch/operations/${batchId}/cancel`);\n }\n\n // System Monitoring\n\n async getSystemHealth(): Promise {\n return this.makeRequest('GET', '/state/system/health');\n }\n\n async getSystemMetrics(): Promise {\n return this.makeRequest('GET', '/state/system/metrics');\n }\n\n // WebSocket Integration\n\n async connectWebSocket(eventTypes?: string[]): Promise {\n if (!this.token) {\n await this.authenticate();\n }\n\n let wsUrl = `ws://localhost:9090/ws?token=${this.token}`;\n if (eventTypes && eventTypes.length > 0) {\n wsUrl += `&events=${eventTypes.join(',')}`;\n }\n\n return new Promise((resolve, reject) => {\n this.websocket = new WebSocket(wsUrl);\n\n this.websocket.on('open', () => {\n console.log('WebSocket connected');\n this.reconnectAttempts = 0;\n this.emit('websocketConnected');\n resolve();\n });\n\n this.websocket.on('message', (data: WebSocket.Data) => {\n try {\n const event: WebSocketEvent = JSON.parse(data.toString());\n this.handleWebSocketMessage(event);\n } catch (error) {\n console.error('Failed to parse WebSocket message:', error);\n }\n });\n\n this.websocket.on('close', (code: number, reason: string) => {\n console.log(`WebSocket disconnected: ${code} - ${reason}`);\n this.emit('websocketDisconnected', { code, reason });\n\n if (this.reconnectAttempts < this.maxReconnectAttempts) {\n setTimeout(() => {\n this.reconnectAttempts++;\n console.log(`Reconnecting... (${this.reconnectAttempts}/${this.maxReconnectAttempts})`);\n this.connectWebSocket(eventTypes);\n }, this.reconnectInterval);\n }\n });\n\n this.websocket.on('error', (error: Error) => {\n console.error('WebSocket error:', error);\n this.emit('websocketError', error);\n reject(error);\n });\n });\n }\n\n private handleWebSocketMessage(event: WebSocketEvent): void {\n console.log(`WebSocket event: ${event.event_type}`);\n\n // Emit specific event\n this.emit(event.event_type, event);\n\n // Emit general event\n this.emit('websocketMessage', event);\n\n // Handle specific event types\n switch (event.event_type) {\n case 'TaskStatusChanged':\n this.emit('taskStatusChanged', event.data);\n break;\n case 'WorkflowProgressUpdate':\n this.emit('workflowProgress', event.data);\n break;\n case 'SystemHealthUpdate':\n this.emit('systemHealthUpdate', event.data);\n break;\n case 'BatchOperationUpdate':\n this.emit('batchUpdate', event.data);\n break;\n }\n }\n\n disconnectWebSocket(): void {\n if (this.websocket) {\n this.websocket.close();\n this.websocket = undefined;\n console.log('WebSocket disconnected');\n }\n }\n\n // Utility Methods\n\n async healthCheck(): Promise {\n try {\n const response = await this.httpClient.get('/health');\n return response.data.success;\n } catch (error) {\n return false;\n }\n }\n}\n\n// Usage Example\nasync function main() {\n const client = new ProvisioningClient(\n 'http://localhost:9090',\n 'http://localhost:8081',\n 'admin',\n 'password'\n );\n\n try {\n // Authenticate\n await client.authenticate();\n\n // Set up event listeners\n client.on('taskStatusChanged', (task) => {\n console.log(`Task ${task.task_id} status changed to: ${task.status}`);\n });\n\n client.on('workflowProgress', (progress) => {\n console.log(`Workflow progress: ${progress.progress}% - ${progress.current_step}`);\n });\n\n client.on('systemHealthUpdate', (health) => {\n console.log(`System health: ${health.overall_status}`);\n });\n\n // Connect WebSocket\n await client.connectWebSocket(['TaskStatusChanged', 'WorkflowProgressUpdate', 'SystemHealthUpdate']);\n\n // Create workflows\n const serverTaskId = await client.createServerWorkflow({\n infra: 'production',\n settings: 'prod-settings.ncl',\n });\n\n const taskservTaskId = await client.createTaskservWorkflow({\n operation: 'create',\n taskserv: 'kubernetes',\n infra: 'production',\n });\n\n // Wait for completion\n const [serverTask, taskservTask] = await Promise.all([\n client.waitForTaskCompletion(serverTaskId),\n client.waitForTaskCompletion(taskservTaskId),\n ]);\n\n console.log('All workflows completed');\n console.log(`Server task: ${serverTask.status}`);\n console.log(`Taskserv task: ${taskservTask.status}`);\n\n // Create batch operation\n const batchConfig: BatchConfig = {\n name: 'test_deployment',\n version: '1.0.0',\n storage_backend: 'filesystem',\n parallel_limit: 3,\n rollback_enabled: true,\n operations: [\n {\n id: 'servers',\n type: 'server_batch',\n provider: 'upcloud',\n dependencies: [],\n server_configs: [\n { name: 'web-01', plan: '1xCPU-2 GB', zone: 'de-fra1' },\n { name: 'web-02', plan: '1xCPU-2 GB', zone: 'de-fra1' },\n ],\n },\n {\n id: 'taskservs',\n type: 'taskserv_batch',\n provider: 'upcloud',\n dependencies: ['servers'],\n taskservs: ['kubernetes', 'cilium'],\n },\n ],\n };\n\n const batchResult = await client.executeBatchOperation(batchConfig);\n console.log(`Batch operation started: ${batchResult.batch_id}`);\n\n // Monitor batch operation\n const monitorBatch = setInterval(async () => {\n try {\n const batchStatus = await client.getBatchStatus(batchResult.batch_id);\n console.log(`Batch status: ${batchStatus.status} - ${batchStatus.progress}%`);\n\n if (['Completed', 'Failed', 'Cancelled'].includes(batchStatus.status)) {\n clearInterval(monitorBatch);\n console.log(`Batch operation finished: ${batchStatus.status}`);\n }\n } catch (error) {\n console.error('Error checking batch status:', error);\n clearInterval(monitorBatch);\n }\n }, 10000);\n\n } catch (error) {\n console.error('Integration example failed:', error);\n } finally {\n client.disconnectWebSocket();\n }\n}\n\n// Run example\nif (require.main === module) {\n main().catch(console.error);\n}\n\nexport { ProvisioningClient, Task, BatchConfig };\n```\n\n## Error Handling Strategies\n\n### Comprehensive Error Handling\n\n```\nclass ProvisioningErrorHandler:\n """Centralized error handling for provisioning operations"""\n\n def __init__(self, client: ProvisioningClient):\n self.client = client\n self.retry_strategies = {\n 'network_error': self._exponential_backoff,\n 'rate_limit': self._rate_limit_backoff,\n 'server_error': self._server_error_strategy,\n 'auth_error': self._auth_error_strategy,\n }\n\n async def execute_with_retry(self, operation: Callable, *args, **kwargs):\n """Execute operation with intelligent retry logic"""\n max_attempts = 3\n attempt = 0\n\n while attempt < max_attempts:\n try:\n return await operation(*args, **kwargs)\n except Exception as e:\n attempt += 1\n error_type = self._classify_error(e)\n\n if attempt >= max_attempts:\n self._log_final_failure(operation.__name__, e, attempt)\n raise\n\n retry_strategy = self.retry_strategies.get(error_type, self._default_retry)\n wait_time = retry_strategy(attempt, e)\n\n self._log_retry_attempt(operation.__name__, e, attempt, wait_time)\n await asyncio.sleep(wait_time)\n\n def _classify_error(self, error: Exception) -> str:\n """Classify error type for appropriate retry strategy"""\n if isinstance(error, requests.ConnectionError):\n return 'network_error'\n elif isinstance(error, requests.HTTPError):\n if error.response.status_code == 429:\n return 'rate_limit'\n elif 500 <= error.response.status_code < 600:\n return 'server_error'\n elif error.response.status_code == 401:\n return 'auth_error'\n return 'unknown'\n\n def _exponential_backoff(self, attempt: int, error: Exception) -> float:\n """Exponential backoff for network errors"""\n return min(2 ** attempt + random.uniform(0, 1), 60)\n\n def _rate_limit_backoff(self, attempt: int, error: Exception) -> float:\n """Handle rate limiting with appropriate backoff"""\n retry_after = getattr(error.response, 'headers', {}).get('Retry-After')\n if retry_after:\n return float(retry_after)\n return 60 # Default to 60 seconds\n\n def _server_error_strategy(self, attempt: int, error: Exception) -> float:\n """Handle server errors"""\n return min(10 * attempt, 60)\n\n def _auth_error_strategy(self, attempt: int, error: Exception) -> float:\n """Handle authentication errors"""\n # Re-authenticate before retry\n asyncio.create_task(self.client.authenticate())\n return 5\n\n def _default_retry(self, attempt: int, error: Exception) -> float:\n """Default retry strategy"""\n return min(5 * attempt, 30)\n\n# Usage example\nasync def robust_workflow_execution():\n client = ProvisioningClient()\n handler = ProvisioningErrorHandler(client)\n\n try:\n # Execute with automatic retry\n task_id = await handler.execute_with_retry(\n client.create_server_workflow,\n infra="production",\n settings="config.ncl"\n )\n\n # Wait for completion with retry\n task = await handler.execute_with_retry(\n client.wait_for_task_completion,\n task_id,\n timeout=600\n )\n\n return task\n except Exception as e:\n # Log detailed error information\n logger.error(f"Workflow execution failed after all retries: {e}")\n # Implement fallback strategy\n return await fallback_workflow_strategy()\n```\n\n### Circuit Breaker Pattern\n\n```\nclass CircuitBreaker {\n private failures = 0;\n private nextAttempt = Date.now();\n private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';\n\n constructor(\n private threshold = 5,\n private timeout = 60000, // 1 minute\n private monitoringPeriod = 10000 // 10 seconds\n ) {}\n\n async execute(operation: () => Promise): Promise {\n if (this.state === 'OPEN') {\n if (Date.now() < this.nextAttempt) {\n throw new Error('Circuit breaker is OPEN');\n }\n this.state = 'HALF_OPEN';\n }\n\n try {\n const result = await operation();\n this.onSuccess();\n return result;\n } catch (error) {\n this.onFailure();\n throw error;\n }\n }\n\n private onSuccess(): void {\n this.failures = 0;\n this.state = 'CLOSED';\n }\n\n private onFailure(): void {\n this.failures++;\n if (this.failures >= this.threshold) {\n this.state = 'OPEN';\n this.nextAttempt = Date.now() + this.timeout;\n }\n }\n\n getState(): string {\n return this.state;\n }\n\n getFailures(): number {\n return this.failures;\n }\n}\n\n// Usage with ProvisioningClient\nclass ResilientProvisioningClient {\n private circuitBreaker = new CircuitBreaker();\n\n constructor(private client: ProvisioningClient) {}\n\n async createServerWorkflow(config: any): Promise {\n return this.circuitBreaker.execute(async () => {\n return this.client.createServerWorkflow(config);\n });\n }\n\n async getTaskStatus(taskId: string): Promise {\n return this.circuitBreaker.execute(async () => {\n return this.client.getTaskStatus(taskId);\n });\n }\n}\n```\n\n## Performance Optimization\n\n### Connection Pooling and Caching\n\n```\nimport asyncio\nimport aiohttp\nfrom cachetools import TTLCache\nimport time\n\nclass OptimizedProvisioningClient:\n """High-performance client with connection pooling and caching"""\n\n def __init__(self, base_url: str, max_connections: int = 100):\n self.base_url = base_url\n self.session = None\n self.cache = TTLCache(maxsize=1000, ttl=300) # 5-minute cache\n self.max_connections = max_connections\n\n async def __aenter__(self):\n """Async context manager entry"""\n connector = aiohttp.TCPConnector(\n limit=self.max_connections,\n limit_per_host=20,\n keepalive_timeout=30,\n enable_cleanup_closed=True\n )\n\n timeout = aiohttp.ClientTimeout(total=30, connect=5)\n\n self.session = aiohttp.ClientSession(\n connector=connector,\n timeout=timeout,\n headers={'User-Agent': 'ProvisioningClient/2.0.0'}\n )\n\n return self\n\n async def __aexit__(self, exc_type, exc_val, exc_tb):\n """Async context manager exit"""\n if self.session:\n await self.session.close()\n\n async def get_task_status_cached(self, task_id: str) -> dict:\n """Get task status with caching"""\n cache_key = f"task_status:{task_id}"\n\n # Check cache first\n if cache_key in self.cache:\n return self.cache[cache_key]\n\n # Fetch from API\n result = await self._make_request('GET', f'/tasks/{task_id}')\n\n # Cache completed tasks for longer\n if result.get('status') in ['Completed', 'Failed', 'Cancelled']:\n self.cache[cache_key] = result\n\n return result\n\n async def batch_get_task_status(self, task_ids: list) -> dict:\n """Get multiple task statuses in parallel"""\n tasks = [self.get_task_status_cached(task_id) for task_id in task_ids]\n results = await asyncio.gather(*tasks, return_exceptions=True)\n\n return {\n task_id: result for task_id, result in zip(task_ids, results)\n if not isinstance(result, Exception)\n }\n\n async def _make_request(self, method: str, endpoint: str, **kwargs):\n """Optimized HTTP request method"""\n url = f"{self.base_url}{endpoint}"\n\n start_time = time.time()\n async with self.session.request(method, url, **kwargs) as response:\n request_time = time.time() - start_time\n\n # Log slow requests\n if request_time > 5.0:\n print(f"Slow request: {method} {endpoint} took {request_time:.2f}s")\n\n response.raise_for_status()\n result = await response.json()\n\n if not result.get('success'):\n raise Exception(result.get('error', 'Request failed'))\n\n return result['data']\n\n# Usage example\nasync def high_performance_workflow():\n async with OptimizedProvisioningClient('http://localhost:9090') as client:\n # Create multiple workflows in parallel\n workflow_tasks = [\n client.create_server_workflow({'infra': f'server-{i}'})\n for i in range(10)\n ]\n\n task_ids = await asyncio.gather(*workflow_tasks)\n print(f"Created {len(task_ids)} workflows")\n\n # Monitor all tasks efficiently\n while True:\n # Batch status check\n statuses = await client.batch_get_task_status(task_ids)\n\n completed = [\n task_id for task_id, status in statuses.items()\n if status.get('status') in ['Completed', 'Failed', 'Cancelled']\n ]\n\n print(f"Completed: {len(completed)}/{len(task_ids)}")\n\n if len(completed) == len(task_ids):\n break\n\n await asyncio.sleep(10)\n```\n\n### WebSocket Connection Pooling\n\n```\nclass WebSocketPool {\n constructor(maxConnections = 5) {\n this.maxConnections = maxConnections;\n this.connections = new Map();\n this.connectionQueue = [];\n }\n\n async getConnection(token, eventTypes = []) {\n const key = `${token}:${eventTypes.sort().join(',')}`;\n\n if (this.connections.has(key)) {\n return this.connections.get(key);\n }\n\n if (this.connections.size >= this.maxConnections) {\n // Wait for available connection\n await this.waitForAvailableSlot();\n }\n\n const connection = await this.createConnection(token, eventTypes);\n this.connections.set(key, connection);\n\n return connection;\n }\n\n async createConnection(token, eventTypes) {\n const ws = new WebSocket(`ws://localhost:9090/ws?token=${token}&events=${eventTypes.join(',')}`);\n\n return new Promise((resolve, reject) => {\n ws.onopen = () => resolve(ws);\n ws.onerror = (error) => reject(error);\n\n ws.onclose = () => {\n // Remove from pool when closed\n for (const [key, conn] of this.connections.entries()) {\n if (conn === ws) {\n this.connections.delete(key);\n break;\n }\n }\n };\n });\n }\n\n async waitForAvailableSlot() {\n return new Promise((resolve) => {\n this.connectionQueue.push(resolve);\n });\n }\n\n releaseConnection(ws) {\n if (this.connectionQueue.length > 0) {\n const waitingResolver = this.connectionQueue.shift();\n waitingResolver();\n }\n }\n}\n```\n\n## SDK Documentation\n\n### Python SDK\n\nThe Python SDK provides a comprehensive interface for provisioning:\n\n#### Installation\n\n```\npip install provisioning-client\n```\n\n#### Quick Start\n\n```\nfrom provisioning_client import ProvisioningClient\n\n# Initialize client\nclient = ProvisioningClient(\n base_url="http://localhost:9090",\n username="admin",\n password="password"\n)\n\n# Create workflow\ntask_id = await client.create_server_workflow(\n infra="production",\n settings="config.ncl"\n)\n\n# Wait for completion\ntask = await client.wait_for_task_completion(task_id)\nprint(f"Workflow completed: {task.status}")\n```\n\n#### Advanced Usage\n\n```\n# Use with async context manager\nasync with ProvisioningClient() as client:\n # Batch operations\n batch_config = {\n "name": "deployment",\n "operations": [...]\n }\n\n batch_result = await client.execute_batch_operation(batch_config)\n\n # Real-time monitoring\n await client.connect_websocket(['TaskStatusChanged'])\n\n client.on_event('TaskStatusChanged', handle_task_update)\n```\n\n### JavaScript/TypeScript SDK\n\n#### Installation\n\n```\nnpm install @provisioning/client\n```\n\n#### Usage\n\n```\nimport { ProvisioningClient } from '@provisioning/client';\n\nconst client = new ProvisioningClient({\n baseUrl: 'http://localhost:9090',\n username: 'admin',\n password: 'password'\n});\n\n// Create workflow\nconst taskId = await client.createServerWorkflow({\n infra: 'production',\n settings: 'config.ncl'\n});\n\n// Monitor progress\nclient.on('workflowProgress', (progress) => {\n console.log(`Progress: ${progress.progress}%`);\n});\n\nawait client.connectWebSocket();\n```\n\n## Common Integration Patterns\n\n### Workflow Orchestration Pipeline\n\n```\nclass WorkflowPipeline:\n """Orchestrate complex multi-step workflows"""\n\n def __init__(self, client: ProvisioningClient):\n self.client = client\n self.steps = []\n\n def add_step(self, name: str, operation: Callable, dependencies: list = None):\n """Add a step to the pipeline"""\n self.steps.append({\n 'name': name,\n 'operation': operation,\n 'dependencies': dependencies or [],\n 'status': 'pending',\n 'result': None\n })\n\n async def execute(self):\n """Execute the pipeline"""\n completed_steps = set()\n\n while len(completed_steps) < len(self.steps):\n # Find steps ready to execute\n ready_steps = [\n step for step in self.steps\n if (step['status'] == 'pending' and\n all(dep in completed_steps for dep in step['dependencies']))\n ]\n\n if not ready_steps:\n raise Exception("Pipeline deadlock detected")\n\n # Execute ready steps in parallel\n tasks = []\n for step in ready_steps:\n step['status'] = 'running'\n tasks.append(self._execute_step(step))\n\n # Wait for completion\n results = await asyncio.gather(*tasks, return_exceptions=True)\n\n for step, result in zip(ready_steps, results):\n if isinstance(result, Exception):\n step['status'] = 'failed'\n step['error'] = str(result)\n raise Exception(f"Step {step['name']} failed: {result}")\n else:\n step['status'] = 'completed'\n step['result'] = result\n completed_steps.add(step['name'])\n\n async def _execute_step(self, step):\n """Execute a single step"""\n try:\n return await step['operation']()\n except Exception as e:\n print(f"Step {step['name']} failed: {e}")\n raise\n\n# Usage example\nasync def complex_deployment():\n client = ProvisioningClient()\n pipeline = WorkflowPipeline(client)\n\n # Define deployment steps\n pipeline.add_step('servers', lambda: client.create_server_workflow({\n 'infra': 'production'\n }))\n\n pipeline.add_step('kubernetes', lambda: client.create_taskserv_workflow({\n 'operation': 'create',\n 'taskserv': 'kubernetes',\n 'infra': 'production'\n }), dependencies=['servers'])\n\n pipeline.add_step('cilium', lambda: client.create_taskserv_workflow({\n 'operation': 'create',\n 'taskserv': 'cilium',\n 'infra': 'production'\n }), dependencies=['kubernetes'])\n\n # Execute pipeline\n await pipeline.execute()\n print("Deployment pipeline completed successfully")\n```\n\n### Event-Driven Architecture\n\n```\nclass EventDrivenWorkflowManager {\n constructor(client) {\n this.client = client;\n this.workflows = new Map();\n this.setupEventHandlers();\n }\n\n setupEventHandlers() {\n this.client.on('TaskStatusChanged', this.handleTaskStatusChange.bind(this));\n this.client.on('WorkflowProgressUpdate', this.handleProgressUpdate.bind(this));\n this.client.on('SystemHealthUpdate', this.handleHealthUpdate.bind(this));\n }\n\n async createWorkflow(config) {\n const workflowId = generateUUID();\n const workflow = {\n id: workflowId,\n config,\n tasks: [],\n status: 'pending',\n progress: 0,\n events: []\n };\n\n this.workflows.set(workflowId, workflow);\n\n // Start workflow execution\n await this.executeWorkflow(workflow);\n\n return workflowId;\n }\n\n async executeWorkflow(workflow) {\n try {\n workflow.status = 'running';\n\n // Create initial tasks based on configuration\n const taskId = await this.client.createServerWorkflow(workflow.config);\n workflow.tasks.push({\n id: taskId,\n type: 'server_creation',\n status: 'pending'\n });\n\n this.emit('workflowStarted', { workflowId: workflow.id, taskId });\n\n } catch (error) {\n workflow.status = 'failed';\n workflow.error = error.message;\n this.emit('workflowFailed', { workflowId: workflow.id, error });\n }\n }\n\n handleTaskStatusChange(event) {\n // Find workflows containing this task\n for (const [workflowId, workflow] of this.workflows) {\n const task = workflow.tasks.find(t => t.id === event.data.task_id);\n if (task) {\n task.status = event.data.status;\n this.updateWorkflowProgress(workflow);\n\n // Trigger next steps based on task completion\n if (event.data.status === 'Completed') {\n this.triggerNextSteps(workflow, task);\n }\n }\n }\n }\n\n updateWorkflowProgress(workflow) {\n const completedTasks = workflow.tasks.filter(t =>\n ['Completed', 'Failed'].includes(t.status)\n ).length;\n\n workflow.progress = (completedTasks / workflow.tasks.length) * 100;\n\n if (completedTasks === workflow.tasks.length) {\n const failedTasks = workflow.tasks.filter(t => t.status === 'Failed');\n workflow.status = failedTasks.length > 0 ? 'failed' : 'completed';\n\n this.emit('workflowCompleted', {\n workflowId: workflow.id,\n status: workflow.status\n });\n }\n }\n\n async triggerNextSteps(workflow, completedTask) {\n // Define workflow dependencies and next steps\n const nextSteps = this.getNextSteps(workflow, completedTask);\n\n for (const nextStep of nextSteps) {\n try {\n const taskId = await this.executeWorkflowStep(nextStep);\n workflow.tasks.push({\n id: taskId,\n type: nextStep.type,\n status: 'pending',\n dependencies: [completedTask.id]\n });\n } catch (error) {\n console.error(`Failed to trigger next step: ${error.message}`);\n }\n }\n }\n\n getNextSteps(workflow, completedTask) {\n // Define workflow logic based on completed task type\n switch (completedTask.type) {\n case 'server_creation':\n return [\n { type: 'kubernetes_installation', taskserv: 'kubernetes' },\n { type: 'monitoring_setup', taskserv: 'prometheus' }\n ];\n case 'kubernetes_installation':\n return [\n { type: 'networking_setup', taskserv: 'cilium' }\n ];\n default:\n return [];\n }\n }\n}\n```\n\nThis comprehensive integration documentation provides developers with everything needed to successfully integrate with provisioning, including\ncomplete client implementations, error handling strategies, performance optimizations, and common integration patterns. +# Integration Examples + +This document provides comprehensive examples and patterns for integrating with provisioning APIs, including client libraries, SDKs, error handling +strategies, and performance optimization. + +## Overview + +Provisioning offers multiple integration points: + +- REST APIs for workflow management +- WebSocket APIs for real-time monitoring +- Configuration APIs for system setup +- Extension APIs for custom providers and services + +## Complete Integration Examples + +### Python Integration + +#### Full-Featured Python Client + +```text +import asyncio +import json +import logging +import time +import requests +import websockets +from typing import Dict, List, Optional, Callable +from dataclasses import dataclass +from enum import Enum + +class TaskStatus(Enum): + PENDING = "Pending" + RUNNING = "Running" + COMPLETED = "Completed" + FAILED = "Failed" + CANCELLED = "Cancelled" + +@dataclass +class WorkflowTask: + id: str + name: str + status: TaskStatus + created_at: str + started_at: Optional[str] = None + completed_at: Optional[str] = None + output: Optional[str] = None + error: Optional[str] = None + progress: Optional[float] = None + +class ProvisioningAPIError(Exception): + """Base exception for provisioning API errors""" + pass + +class AuthenticationError(ProvisioningAPIError): + """Authentication failed""" + pass + +class ValidationError(ProvisioningAPIError): + """Request validation failed""" + pass + +class ProvisioningClient: + """ + Complete Python client for provisioning + + Features: + - REST API integration + - WebSocket support for real-time updates + - Automatic token refresh + - Retry logic with exponential backoff + - Comprehensive error handling + """ + + def __init__(self, + base_url: str = "http://localhost:9090", + auth_url: str = "http://localhost:8081", + username: str = None, + password: str = None, + token: str = None): + self.base_url = base_url + self.auth_url = auth_url + self.username = username + self.password = password + self.token = token + self.session = requests.Session() + self.websocket = None + self.event_handlers = {} + + # Setup logging + self.logger = logging.getLogger(__name__) + + # Configure session with retries + from requests.adapters import HTTPAdapter + from urllib3.util.retry import Retry + + retry_strategy = Retry( + total=3, + status_forcelist=[429, 500, 502, 503, 504], + method_whitelist=["HEAD", "GET", "OPTIONS"], + backoff_factor=1 + ) + + adapter = HTTPAdapter(max_retries=retry_strategy) + self.session.mount("http://", adapter) + self.session.mount("https://", adapter) + + async def authenticate(self) -> str: + """Authenticate and get JWT token""" + if self.token: + return self.token + + if not self.username or not self.password: + raise AuthenticationError("Username and password required for authentication") + + auth_data = { + "username": self.username, + "password": self.password + } + + try: + response = requests.post(f"{self.auth_url}/auth/login", json=auth_data) + response.raise_for_status() + + result = response.json() + if not result.get('success'): + raise AuthenticationError(result.get('error', 'Authentication failed')) + + self.token = result['data']['token'] + self.session.headers.update({ + 'Authorization': f'Bearer {self.token}' + }) + + self.logger.info("Authentication successful") + return self.token + + except requests.RequestException as e: + raise AuthenticationError(f"Authentication request failed: {e}") + + def _make_request(self, method: str, endpoint: str, **kwargs) -> Dict: + """Make authenticated HTTP request with error handling""" + if not self.token: + raise AuthenticationError("Not authenticated. Call authenticate() first.") + + url = f"{self.base_url}{endpoint}" + + try: + response = self.session.request(method, url, **kwargs) + response.raise_for_status() + + result = response.json() + if not result.get('success'): + error_msg = result.get('error', 'Request failed') + if response.status_code == 400: + raise ValidationError(error_msg) + else: + raise ProvisioningAPIError(error_msg) + + return result['data'] + + except requests.RequestException as e: + self.logger.error(f"Request failed: {method} {url} - {e}") + raise ProvisioningAPIError(f"Request failed: {e}") + + # Workflow Management Methods + + def create_server_workflow(self, + infra: str, + settings: str = "config.ncl", + check_mode: bool = False, + wait: bool = False) -> str: + """Create a server provisioning workflow""" + data = { + "infra": infra, + "settings": settings, + "check_mode": check_mode, + "wait": wait + } + + task_id = self._make_request("POST", "/workflows/servers/create", json=data) + self.logger.info(f"Server workflow created: {task_id}") + return task_id + + def create_taskserv_workflow(self, + operation: str, + taskserv: str, + infra: str, + settings: str = "config.ncl", + check_mode: bool = False, + wait: bool = False) -> str: + """Create a task service workflow""" + data = { + "operation": operation, + "taskserv": taskserv, + "infra": infra, + "settings": settings, + "check_mode": check_mode, + "wait": wait + } + + task_id = self._make_request("POST", "/workflows/taskserv/create", json=data) + self.logger.info(f"Taskserv workflow created: {task_id}") + return task_id + + def create_cluster_workflow(self, + operation: str, + cluster_type: str, + infra: str, + settings: str = "config.ncl", + check_mode: bool = False, + wait: bool = False) -> str: + """Create a cluster workflow""" + data = { + "operation": operation, + "cluster_type": cluster_type, + "infra": infra, + "settings": settings, + "check_mode": check_mode, + "wait": wait + } + + task_id = self._make_request("POST", "/workflows/cluster/create", json=data) + self.logger.info(f"Cluster workflow created: {task_id}") + return task_id + + def get_task_status(self, task_id: str) -> WorkflowTask: + """Get the status of a specific task""" + data = self._make_request("GET", f"/tasks/{task_id}") + return WorkflowTask( + id=data['id'], + name=data['name'], + status=TaskStatus(data['status']), + created_at=data['created_at'], + started_at=data.get('started_at'), + completed_at=data.get('completed_at'), + output=data.get('output'), + error=data.get('error'), + progress=data.get('progress') + ) + + def list_tasks(self, status_filter: Optional[str] = None) -> List[WorkflowTask]: + """List all tasks, optionally filtered by status""" + params = {} + if status_filter: + params['status'] = status_filter + + data = self._make_request("GET", "/tasks", params=params) + return [ + WorkflowTask( + id=task['id'], + name=task['name'], + status=TaskStatus(task['status']), + created_at=task['created_at'], + started_at=task.get('started_at'), + completed_at=task.get('completed_at'), + output=task.get('output'), + error=task.get('error') + ) + for task in data + ] + + def wait_for_task_completion(self, + task_id: str, + timeout: int = 300, + poll_interval: int = 5) -> WorkflowTask: + """Wait for a task to complete""" + start_time = time.time() + + while time.time() - start_time < timeout: + task = self.get_task_status(task_id) + + if task.status in [TaskStatus.COMPLETED, TaskStatus.FAILED, TaskStatus.CANCELLED]: + self.logger.info(f"Task {task_id} finished with status: {task.status}") + return task + + self.logger.debug(f"Task {task_id} status: {task.status}") + time.sleep(poll_interval) + + raise TimeoutError(f"Task {task_id} did not complete within {timeout} seconds") + + # Batch Operations + + def execute_batch_operation(self, batch_config: Dict) -> Dict: + """Execute a batch operation""" + return self._make_request("POST", "/batch/execute", json=batch_config) + + def get_batch_status(self, batch_id: str) -> Dict: + """Get batch operation status""" + return self._make_request("GET", f"/batch/operations/{batch_id}") + + def cancel_batch_operation(self, batch_id: str) -> str: + """Cancel a running batch operation""" + return self._make_request("POST", f"/batch/operations/{batch_id}/cancel") + + # System Health and Monitoring + + def get_system_health(self) -> Dict: + """Get system health status""" + return self._make_request("GET", "/state/system/health") + + def get_system_metrics(self) -> Dict: + """Get system metrics""" + return self._make_request("GET", "/state/system/metrics") + + # WebSocket Integration + + async def connect_websocket(self, event_types: List[str] = None): + """Connect to WebSocket for real-time updates""" + if not self.token: + await self.authenticate() + + ws_url = f"ws://localhost:9090/ws?token={self.token}" + if event_types: + ws_url += f"&events={','.join(event_types)}" + + try: + self.websocket = await websockets.connect(ws_url) + self.logger.info("WebSocket connected") + + # Start listening for messages + asyncio.create_task(self._websocket_listener()) + + except Exception as e: + self.logger.error(f"WebSocket connection failed: {e}") + raise + + async def _websocket_listener(self): + """Listen for WebSocket messages""" + try: + async for message in self.websocket: + try: + data = json.loads(message) + await self._handle_websocket_message(data) + except json.JSONDecodeError: + self.logger.error(f"Invalid JSON received: {message}") + except Exception as e: + self.logger.error(f"WebSocket listener error: {e}") + + async def _handle_websocket_message(self, data: Dict): + """Handle incoming WebSocket messages""" + event_type = data.get('event_type') + if event_type and event_type in self.event_handlers: + for handler in self.event_handlers[event_type]: + try: + await handler(data) + except Exception as e: + self.logger.error(f"Error in event handler for {event_type}: {e}") + + def on_event(self, event_type: str, handler: Callable): + """Register an event handler""" + if event_type not in self.event_handlers: + self.event_handlers[event_type] = [] + self.event_handlers[event_type].append(handler) + + async def disconnect_websocket(self): + """Disconnect from WebSocket""" + if self.websocket: + await self.websocket.close() + self.websocket = None + self.logger.info("WebSocket disconnected") + +# Usage Example +async def main(): + # Initialize client + client = ProvisioningClient( + username="admin", + password="password" + ) + + try: + # Authenticate + await client.authenticate() + + # Create a server workflow + task_id = client.create_server_workflow( + infra="production", + settings="prod-settings.ncl", + wait=False + ) + print(f"Server workflow created: {task_id}") + + # Set up WebSocket event handlers + async def on_task_update(event): + print(f"Task update: {event['data']['task_id']} -> {event['data']['status']}") + + async def on_system_health(event): + print(f"System health: {event['data']['overall_status']}") + + client.on_event('TaskStatusChanged', on_task_update) + client.on_event('SystemHealthUpdate', on_system_health) + + # Connect to WebSocket + await client.connect_websocket(['TaskStatusChanged', 'SystemHealthUpdate']) + + # Wait for task completion + final_task = client.wait_for_task_completion(task_id, timeout=600) + print(f"Task completed with status: {final_task.status}") + + if final_task.status == TaskStatus.COMPLETED: + print(f"Output: {final_task.output}") + elif final_task.status == TaskStatus.FAILED: + print(f"Error: {final_task.error}") + + except ProvisioningAPIError as e: + print(f"API Error: {e}") + except Exception as e: + print(f"Unexpected error: {e}") + finally: + await client.disconnect_websocket() + +if __name__ == "__main__": + asyncio.run(main()) +``` + +### Node.js/JavaScript Integration + +#### Complete JavaScript/TypeScript Client + +```text +import axios, { AxiosInstance, AxiosResponse } from 'axios'; +import WebSocket from 'ws'; +import { EventEmitter } from 'events'; + +interface Task { + id: string; + name: string; + status: 'Pending' | 'Running' | 'Completed' | 'Failed' | 'Cancelled'; + created_at: string; + started_at?: string; + completed_at?: string; + output?: string; + error?: string; + progress?: number; +} + +interface BatchConfig { + name: string; + version: string; + storage_backend: string; + parallel_limit: number; + rollback_enabled: boolean; + operations: Array<{ + id: string; + type: string; + provider: string; + dependencies: string[]; + [key: string]: any; + }>; +} + +interface WebSocketEvent { + event_type: string; + timestamp: string; + data: any; + metadata: Record; +} + +class ProvisioningClient extends EventEmitter { + private httpClient: AxiosInstance; + private authClient: AxiosInstance; + private websocket?: WebSocket; + private token?: string; + private reconnectAttempts = 0; + private maxReconnectAttempts = 10; + private reconnectInterval = 5000; + + constructor( + private baseUrl = 'http://localhost:9090', + private authUrl = 'http://localhost:8081', + private username?: string, + private password?: string, + token?: string + ) { + super(); + + this.token = token; + + // Setup HTTP clients + this.httpClient = axios.create({ + baseURL: baseUrl, + timeout: 30000, + }); + + this.authClient = axios.create({ + baseURL: authUrl, + timeout: 10000, + }); + + // Setup request interceptors + this.setupInterceptors(); + } + + private setupInterceptors(): void { + // Request interceptor to add auth token + this.httpClient.interceptors.request.use((config) => { + if (this.token) { + config.headers.Authorization = `Bearer ${this.token}`; + } + return config; + }); + + // Response interceptor for error handling + this.httpClient.interceptors.response.use( + (response) => response, + async (error) => { + if (error.response?.status === 401 && this.username && this.password) { + // Token expired, try to refresh + try { + await this.authenticate(); + // Retry the original request + const originalRequest = error.config; + originalRequest.headers.Authorization = `Bearer ${this.token}`; + return this.httpClient.request(originalRequest); + } catch (authError) { + this.emit('authError', authError); + throw error; + } + } + throw error; + } + ); + } + + async authenticate(): Promise { + if (this.token) { + return this.token; + } + + if (!this.username || !this.password) { + throw new Error('Username and password required for authentication'); + } + + try { + const response = await this.authClient.post('/auth/login', { + username: this.username, + password: this.password, + }); + + const result = response.data; + if (!result.success) { + throw new Error(result.error || 'Authentication failed'); + } + + this.token = result.data.token; + console.log('Authentication successful'); + this.emit('authenticated', this.token); + + return this.token; + } catch (error) { + console.error('Authentication failed:', error); + throw new Error(`Authentication failed: ${error.message}`); + } + } + + private async makeRequest(method: string, endpoint: string, data?: any): Promise { + try { + const response: AxiosResponse = await this.httpClient.request({ + method, + url: endpoint, + data, + }); + + const result = response.data; + if (!result.success) { + throw new Error(result.error || 'Request failed'); + } + + return result.data; + } catch (error) { + console.error(`Request failed: ${method} ${endpoint}`, error); + throw error; + } + } + + // Workflow Management Methods + + async createServerWorkflow(config: { + infra: string; + settings?: string; + check_mode?: boolean; + wait?: boolean; + }): Promise { + const data = { + infra: config.infra, + settings: config.settings || 'config.ncl', + check_mode: config.check_mode || false, + wait: config.wait || false, + }; + + const taskId = await this.makeRequest('POST', '/workflows/servers/create', data); + console.log(`Server workflow created: ${taskId}`); + this.emit('workflowCreated', { type: 'server', taskId }); + return taskId; + } + + async createTaskservWorkflow(config: { + operation: string; + taskserv: string; + infra: string; + settings?: string; + check_mode?: boolean; + wait?: boolean; + }): Promise { + const data = { + operation: config.operation, + taskserv: config.taskserv, + infra: config.infra, + settings: config.settings || 'config.ncl', + check_mode: config.check_mode || false, + wait: config.wait || false, + }; + + const taskId = await this.makeRequest('POST', '/workflows/taskserv/create', data); + console.log(`Taskserv workflow created: ${taskId}`); + this.emit('workflowCreated', { type: 'taskserv', taskId }); + return taskId; + } + + async createClusterWorkflow(config: { + operation: string; + cluster_type: string; + infra: string; + settings?: string; + check_mode?: boolean; + wait?: boolean; + }): Promise { + const data = { + operation: config.operation, + cluster_type: config.cluster_type, + infra: config.infra, + settings: config.settings || 'config.ncl', + check_mode: config.check_mode || false, + wait: config.wait || false, + }; + + const taskId = await this.makeRequest('POST', '/workflows/cluster/create', data); + console.log(`Cluster workflow created: ${taskId}`); + this.emit('workflowCreated', { type: 'cluster', taskId }); + return taskId; + } + + async getTaskStatus(taskId: string): Promise { + return this.makeRequest('GET', `/tasks/${taskId}`); + } + + async listTasks(statusFilter?: string): Promise { + const params = statusFilter ? `?status=${statusFilter}` : ''; + return this.makeRequest('GET', `/tasks${params}`); + } + + async waitForTaskCompletion( + taskId: string, + timeout = 300000, // 5 minutes + pollInterval = 5000 // 5 seconds + ): Promise { + return new Promise((resolve, reject) => { + const startTime = Date.now(); + + const poll = async () => { + try { + const task = await this.getTaskStatus(taskId); + + if (['Completed', 'Failed', 'Cancelled'].includes(task.status)) { + console.log(`Task ${taskId} finished with status: ${task.status}`); + resolve(task); + return; + } + + if (Date.now() - startTime > timeout) { + reject(new Error(`Task ${taskId} did not complete within ${timeout}ms`)); + return; + } + + console.log(`Task ${taskId} status: ${task.status}`); + this.emit('taskProgress', task); + setTimeout(poll, pollInterval); + } catch (error) { + reject(error); + } + }; + + poll(); + }); + } + + // Batch Operations + + async executeBatchOperation(batchConfig: BatchConfig): Promise { + const result = await this.makeRequest('POST', '/batch/execute', batchConfig); + console.log(`Batch operation started: ${result.batch_id}`); + this.emit('batchStarted', result); + return result; + } + + async getBatchStatus(batchId: string): Promise { + return this.makeRequest('GET', `/batch/operations/${batchId}`); + } + + async cancelBatchOperation(batchId: string): Promise { + return this.makeRequest('POST', `/batch/operations/${batchId}/cancel`); + } + + // System Monitoring + + async getSystemHealth(): Promise { + return this.makeRequest('GET', '/state/system/health'); + } + + async getSystemMetrics(): Promise { + return this.makeRequest('GET', '/state/system/metrics'); + } + + // WebSocket Integration + + async connectWebSocket(eventTypes?: string[]): Promise { + if (!this.token) { + await this.authenticate(); + } + + let wsUrl = `ws://localhost:9090/ws?token=${this.token}`; + if (eventTypes && eventTypes.length > 0) { + wsUrl += `&events=${eventTypes.join(',')}`; + } + + return new Promise((resolve, reject) => { + this.websocket = new WebSocket(wsUrl); + + this.websocket.on('open', () => { + console.log('WebSocket connected'); + this.reconnectAttempts = 0; + this.emit('websocketConnected'); + resolve(); + }); + + this.websocket.on('message', (data: WebSocket.Data) => { + try { + const event: WebSocketEvent = JSON.parse(data.toString()); + this.handleWebSocketMessage(event); + } catch (error) { + console.error('Failed to parse WebSocket message:', error); + } + }); + + this.websocket.on('close', (code: number, reason: string) => { + console.log(`WebSocket disconnected: ${code} - ${reason}`); + this.emit('websocketDisconnected', { code, reason }); + + if (this.reconnectAttempts < this.maxReconnectAttempts) { + setTimeout(() => { + this.reconnectAttempts++; + console.log(`Reconnecting... (${this.reconnectAttempts}/${this.maxReconnectAttempts})`); + this.connectWebSocket(eventTypes); + }, this.reconnectInterval); + } + }); + + this.websocket.on('error', (error: Error) => { + console.error('WebSocket error:', error); + this.emit('websocketError', error); + reject(error); + }); + }); + } + + private handleWebSocketMessage(event: WebSocketEvent): void { + console.log(`WebSocket event: ${event.event_type}`); + + // Emit specific event + this.emit(event.event_type, event); + + // Emit general event + this.emit('websocketMessage', event); + + // Handle specific event types + switch (event.event_type) { + case 'TaskStatusChanged': + this.emit('taskStatusChanged', event.data); + break; + case 'WorkflowProgressUpdate': + this.emit('workflowProgress', event.data); + break; + case 'SystemHealthUpdate': + this.emit('systemHealthUpdate', event.data); + break; + case 'BatchOperationUpdate': + this.emit('batchUpdate', event.data); + break; + } + } + + disconnectWebSocket(): void { + if (this.websocket) { + this.websocket.close(); + this.websocket = undefined; + console.log('WebSocket disconnected'); + } + } + + // Utility Methods + + async healthCheck(): Promise { + try { + const response = await this.httpClient.get('/health'); + return response.data.success; + } catch (error) { + return false; + } + } +} + +// Usage Example +async function main() { + const client = new ProvisioningClient( + 'http://localhost:9090', + 'http://localhost:8081', + 'admin', + 'password' + ); + + try { + // Authenticate + await client.authenticate(); + + // Set up event listeners + client.on('taskStatusChanged', (task) => { + console.log(`Task ${task.task_id} status changed to: ${task.status}`); + }); + + client.on('workflowProgress', (progress) => { + console.log(`Workflow progress: ${progress.progress}% - ${progress.current_step}`); + }); + + client.on('systemHealthUpdate', (health) => { + console.log(`System health: ${health.overall_status}`); + }); + + // Connect WebSocket + await client.connectWebSocket(['TaskStatusChanged', 'WorkflowProgressUpdate', 'SystemHealthUpdate']); + + // Create workflows + const serverTaskId = await client.createServerWorkflow({ + infra: 'production', + settings: 'prod-settings.ncl', + }); + + const taskservTaskId = await client.createTaskservWorkflow({ + operation: 'create', + taskserv: 'kubernetes', + infra: 'production', + }); + + // Wait for completion + const [serverTask, taskservTask] = await Promise.all([ + client.waitForTaskCompletion(serverTaskId), + client.waitForTaskCompletion(taskservTaskId), + ]); + + console.log('All workflows completed'); + console.log(`Server task: ${serverTask.status}`); + console.log(`Taskserv task: ${taskservTask.status}`); + + // Create batch operation + const batchConfig: BatchConfig = { + name: 'test_deployment', + version: '1.0.0', + storage_backend: 'filesystem', + parallel_limit: 3, + rollback_enabled: true, + operations: [ + { + id: 'servers', + type: 'server_batch', + provider: 'upcloud', + dependencies: [], + server_configs: [ + { name: 'web-01', plan: '1xCPU-2 GB', zone: 'de-fra1' }, + { name: 'web-02', plan: '1xCPU-2 GB', zone: 'de-fra1' }, + ], + }, + { + id: 'taskservs', + type: 'taskserv_batch', + provider: 'upcloud', + dependencies: ['servers'], + taskservs: ['kubernetes', 'cilium'], + }, + ], + }; + + const batchResult = await client.executeBatchOperation(batchConfig); + console.log(`Batch operation started: ${batchResult.batch_id}`); + + // Monitor batch operation + const monitorBatch = setInterval(async () => { + try { + const batchStatus = await client.getBatchStatus(batchResult.batch_id); + console.log(`Batch status: ${batchStatus.status} - ${batchStatus.progress}%`); + + if (['Completed', 'Failed', 'Cancelled'].includes(batchStatus.status)) { + clearInterval(monitorBatch); + console.log(`Batch operation finished: ${batchStatus.status}`); + } + } catch (error) { + console.error('Error checking batch status:', error); + clearInterval(monitorBatch); + } + }, 10000); + + } catch (error) { + console.error('Integration example failed:', error); + } finally { + client.disconnectWebSocket(); + } +} + +// Run example +if (require.main === module) { + main().catch(console.error); +} + +export { ProvisioningClient, Task, BatchConfig }; +``` + +## Error Handling Strategies + +### Comprehensive Error Handling + +```text +class ProvisioningErrorHandler: + """Centralized error handling for provisioning operations""" + + def __init__(self, client: ProvisioningClient): + self.client = client + self.retry_strategies = { + 'network_error': self._exponential_backoff, + 'rate_limit': self._rate_limit_backoff, + 'server_error': self._server_error_strategy, + 'auth_error': self._auth_error_strategy, + } + + async def execute_with_retry(self, operation: Callable, *args, **kwargs): + """Execute operation with intelligent retry logic""" + max_attempts = 3 + attempt = 0 + + while attempt < max_attempts: + try: + return await operation(*args, **kwargs) + except Exception as e: + attempt += 1 + error_type = self._classify_error(e) + + if attempt >= max_attempts: + self._log_final_failure(operation.__name__, e, attempt) + raise + + retry_strategy = self.retry_strategies.get(error_type, self._default_retry) + wait_time = retry_strategy(attempt, e) + + self._log_retry_attempt(operation.__name__, e, attempt, wait_time) + await asyncio.sleep(wait_time) + + def _classify_error(self, error: Exception) -> str: + """Classify error type for appropriate retry strategy""" + if isinstance(error, requests.ConnectionError): + return 'network_error' + elif isinstance(error, requests.HTTPError): + if error.response.status_code == 429: + return 'rate_limit' + elif 500 <= error.response.status_code < 600: + return 'server_error' + elif error.response.status_code == 401: + return 'auth_error' + return 'unknown' + + def _exponential_backoff(self, attempt: int, error: Exception) -> float: + """Exponential backoff for network errors""" + return min(2 ** attempt + random.uniform(0, 1), 60) + + def _rate_limit_backoff(self, attempt: int, error: Exception) -> float: + """Handle rate limiting with appropriate backoff""" + retry_after = getattr(error.response, 'headers', {}).get('Retry-After') + if retry_after: + return float(retry_after) + return 60 # Default to 60 seconds + + def _server_error_strategy(self, attempt: int, error: Exception) -> float: + """Handle server errors""" + return min(10 * attempt, 60) + + def _auth_error_strategy(self, attempt: int, error: Exception) -> float: + """Handle authentication errors""" + # Re-authenticate before retry + asyncio.create_task(self.client.authenticate()) + return 5 + + def _default_retry(self, attempt: int, error: Exception) -> float: + """Default retry strategy""" + return min(5 * attempt, 30) + +# Usage example +async def robust_workflow_execution(): + client = ProvisioningClient() + handler = ProvisioningErrorHandler(client) + + try: + # Execute with automatic retry + task_id = await handler.execute_with_retry( + client.create_server_workflow, + infra="production", + settings="config.ncl" + ) + + # Wait for completion with retry + task = await handler.execute_with_retry( + client.wait_for_task_completion, + task_id, + timeout=600 + ) + + return task + except Exception as e: + # Log detailed error information + logger.error(f"Workflow execution failed after all retries: {e}") + # Implement fallback strategy + return await fallback_workflow_strategy() +``` + +### Circuit Breaker Pattern + +```text +class CircuitBreaker { + private failures = 0; + private nextAttempt = Date.now(); + private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED'; + + constructor( + private threshold = 5, + private timeout = 60000, // 1 minute + private monitoringPeriod = 10000 // 10 seconds + ) {} + + async execute(operation: () => Promise): Promise { + if (this.state === 'OPEN') { + if (Date.now() < this.nextAttempt) { + throw new Error('Circuit breaker is OPEN'); + } + this.state = 'HALF_OPEN'; + } + + try { + const result = await operation(); + this.onSuccess(); + return result; + } catch (error) { + this.onFailure(); + throw error; + } + } + + private onSuccess(): void { + this.failures = 0; + this.state = 'CLOSED'; + } + + private onFailure(): void { + this.failures++; + if (this.failures >= this.threshold) { + this.state = 'OPEN'; + this.nextAttempt = Date.now() + this.timeout; + } + } + + getState(): string { + return this.state; + } + + getFailures(): number { + return this.failures; + } +} + +// Usage with ProvisioningClient +class ResilientProvisioningClient { + private circuitBreaker = new CircuitBreaker(); + + constructor(private client: ProvisioningClient) {} + + async createServerWorkflow(config: any): Promise { + return this.circuitBreaker.execute(async () => { + return this.client.createServerWorkflow(config); + }); + } + + async getTaskStatus(taskId: string): Promise { + return this.circuitBreaker.execute(async () => { + return this.client.getTaskStatus(taskId); + }); + } +} +``` + +## Performance Optimization + +### Connection Pooling and Caching + +```text +import asyncio +import aiohttp +from cachetools import TTLCache +import time + +class OptimizedProvisioningClient: + """High-performance client with connection pooling and caching""" + + def __init__(self, base_url: str, max_connections: int = 100): + self.base_url = base_url + self.session = None + self.cache = TTLCache(maxsize=1000, ttl=300) # 5-minute cache + self.max_connections = max_connections + + async def __aenter__(self): + """Async context manager entry""" + connector = aiohttp.TCPConnector( + limit=self.max_connections, + limit_per_host=20, + keepalive_timeout=30, + enable_cleanup_closed=True + ) + + timeout = aiohttp.ClientTimeout(total=30, connect=5) + + self.session = aiohttp.ClientSession( + connector=connector, + timeout=timeout, + headers={'User-Agent': 'ProvisioningClient/2.0.0'} + ) + + return self + + async def __aexit__(self, exc_type, exc_val, exc_tb): + """Async context manager exit""" + if self.session: + await self.session.close() + + async def get_task_status_cached(self, task_id: str) -> dict: + """Get task status with caching""" + cache_key = f"task_status:{task_id}" + + # Check cache first + if cache_key in self.cache: + return self.cache[cache_key] + + # Fetch from API + result = await self._make_request('GET', f'/tasks/{task_id}') + + # Cache completed tasks for longer + if result.get('status') in ['Completed', 'Failed', 'Cancelled']: + self.cache[cache_key] = result + + return result + + async def batch_get_task_status(self, task_ids: list) -> dict: + """Get multiple task statuses in parallel""" + tasks = [self.get_task_status_cached(task_id) for task_id in task_ids] + results = await asyncio.gather(*tasks, return_exceptions=True) + + return { + task_id: result for task_id, result in zip(task_ids, results) + if not isinstance(result, Exception) + } + + async def _make_request(self, method: str, endpoint: str, **kwargs): + """Optimized HTTP request method""" + url = f"{self.base_url}{endpoint}" + + start_time = time.time() + async with self.session.request(method, url, **kwargs) as response: + request_time = time.time() - start_time + + # Log slow requests + if request_time > 5.0: + print(f"Slow request: {method} {endpoint} took {request_time:.2f}s") + + response.raise_for_status() + result = await response.json() + + if not result.get('success'): + raise Exception(result.get('error', 'Request failed')) + + return result['data'] + +# Usage example +async def high_performance_workflow(): + async with OptimizedProvisioningClient('http://localhost:9090') as client: + # Create multiple workflows in parallel + workflow_tasks = [ + client.create_server_workflow({'infra': f'server-{i}'}) + for i in range(10) + ] + + task_ids = await asyncio.gather(*workflow_tasks) + print(f"Created {len(task_ids)} workflows") + + # Monitor all tasks efficiently + while True: + # Batch status check + statuses = await client.batch_get_task_status(task_ids) + + completed = [ + task_id for task_id, status in statuses.items() + if status.get('status') in ['Completed', 'Failed', 'Cancelled'] + ] + + print(f"Completed: {len(completed)}/{len(task_ids)}") + + if len(completed) == len(task_ids): + break + + await asyncio.sleep(10) +``` + +### WebSocket Connection Pooling + +```text +class WebSocketPool { + constructor(maxConnections = 5) { + this.maxConnections = maxConnections; + this.connections = new Map(); + this.connectionQueue = []; + } + + async getConnection(token, eventTypes = []) { + const key = `${token}:${eventTypes.sort().join(',')}`; + + if (this.connections.has(key)) { + return this.connections.get(key); + } + + if (this.connections.size >= this.maxConnections) { + // Wait for available connection + await this.waitForAvailableSlot(); + } + + const connection = await this.createConnection(token, eventTypes); + this.connections.set(key, connection); + + return connection; + } + + async createConnection(token, eventTypes) { + const ws = new WebSocket(`ws://localhost:9090/ws?token=${token}&events=${eventTypes.join(',')}`); + + return new Promise((resolve, reject) => { + ws.onopen = () => resolve(ws); + ws.onerror = (error) => reject(error); + + ws.onclose = () => { + // Remove from pool when closed + for (const [key, conn] of this.connections.entries()) { + if (conn === ws) { + this.connections.delete(key); + break; + } + } + }; + }); + } + + async waitForAvailableSlot() { + return new Promise((resolve) => { + this.connectionQueue.push(resolve); + }); + } + + releaseConnection(ws) { + if (this.connectionQueue.length > 0) { + const waitingResolver = this.connectionQueue.shift(); + waitingResolver(); + } + } +} +``` + +## SDK Documentation + +### Python SDK + +The Python SDK provides a comprehensive interface for provisioning: + +#### Installation + +```text +pip install provisioning-client +``` + +#### Quick Start + +```text +from provisioning_client import ProvisioningClient + +# Initialize client +client = ProvisioningClient( + base_url="http://localhost:9090", + username="admin", + password="password" +) + +# Create workflow +task_id = await client.create_server_workflow( + infra="production", + settings="config.ncl" +) + +# Wait for completion +task = await client.wait_for_task_completion(task_id) +print(f"Workflow completed: {task.status}") +``` + +#### Advanced Usage + +```text +# Use with async context manager +async with ProvisioningClient() as client: + # Batch operations + batch_config = { + "name": "deployment", + "operations": [...] + } + + batch_result = await client.execute_batch_operation(batch_config) + + # Real-time monitoring + await client.connect_websocket(['TaskStatusChanged']) + + client.on_event('TaskStatusChanged', handle_task_update) +``` + +### JavaScript/TypeScript SDK + +#### Installation + +```text +npm install @provisioning/client +``` + +#### Usage + +```text +import { ProvisioningClient } from '@provisioning/client'; + +const client = new ProvisioningClient({ + baseUrl: 'http://localhost:9090', + username: 'admin', + password: 'password' +}); + +// Create workflow +const taskId = await client.createServerWorkflow({ + infra: 'production', + settings: 'config.ncl' +}); + +// Monitor progress +client.on('workflowProgress', (progress) => { + console.log(`Progress: ${progress.progress}%`); +}); + +await client.connectWebSocket(); +``` + +## Common Integration Patterns + +### Workflow Orchestration Pipeline + +```text +class WorkflowPipeline: + """Orchestrate complex multi-step workflows""" + + def __init__(self, client: ProvisioningClient): + self.client = client + self.steps = [] + + def add_step(self, name: str, operation: Callable, dependencies: list = None): + """Add a step to the pipeline""" + self.steps.append({ + 'name': name, + 'operation': operation, + 'dependencies': dependencies or [], + 'status': 'pending', + 'result': None + }) + + async def execute(self): + """Execute the pipeline""" + completed_steps = set() + + while len(completed_steps) < len(self.steps): + # Find steps ready to execute + ready_steps = [ + step for step in self.steps + if (step['status'] == 'pending' and + all(dep in completed_steps for dep in step['dependencies'])) + ] + + if not ready_steps: + raise Exception("Pipeline deadlock detected") + + # Execute ready steps in parallel + tasks = [] + for step in ready_steps: + step['status'] = 'running' + tasks.append(self._execute_step(step)) + + # Wait for completion + results = await asyncio.gather(*tasks, return_exceptions=True) + + for step, result in zip(ready_steps, results): + if isinstance(result, Exception): + step['status'] = 'failed' + step['error'] = str(result) + raise Exception(f"Step {step['name']} failed: {result}") + else: + step['status'] = 'completed' + step['result'] = result + completed_steps.add(step['name']) + + async def _execute_step(self, step): + """Execute a single step""" + try: + return await step['operation']() + except Exception as e: + print(f"Step {step['name']} failed: {e}") + raise + +# Usage example +async def complex_deployment(): + client = ProvisioningClient() + pipeline = WorkflowPipeline(client) + + # Define deployment steps + pipeline.add_step('servers', lambda: client.create_server_workflow({ + 'infra': 'production' + })) + + pipeline.add_step('kubernetes', lambda: client.create_taskserv_workflow({ + 'operation': 'create', + 'taskserv': 'kubernetes', + 'infra': 'production' + }), dependencies=['servers']) + + pipeline.add_step('cilium', lambda: client.create_taskserv_workflow({ + 'operation': 'create', + 'taskserv': 'cilium', + 'infra': 'production' + }), dependencies=['kubernetes']) + + # Execute pipeline + await pipeline.execute() + print("Deployment pipeline completed successfully") +``` + +### Event-Driven Architecture + +```text +class EventDrivenWorkflowManager { + constructor(client) { + this.client = client; + this.workflows = new Map(); + this.setupEventHandlers(); + } + + setupEventHandlers() { + this.client.on('TaskStatusChanged', this.handleTaskStatusChange.bind(this)); + this.client.on('WorkflowProgressUpdate', this.handleProgressUpdate.bind(this)); + this.client.on('SystemHealthUpdate', this.handleHealthUpdate.bind(this)); + } + + async createWorkflow(config) { + const workflowId = generateUUID(); + const workflow = { + id: workflowId, + config, + tasks: [], + status: 'pending', + progress: 0, + events: [] + }; + + this.workflows.set(workflowId, workflow); + + // Start workflow execution + await this.executeWorkflow(workflow); + + return workflowId; + } + + async executeWorkflow(workflow) { + try { + workflow.status = 'running'; + + // Create initial tasks based on configuration + const taskId = await this.client.createServerWorkflow(workflow.config); + workflow.tasks.push({ + id: taskId, + type: 'server_creation', + status: 'pending' + }); + + this.emit('workflowStarted', { workflowId: workflow.id, taskId }); + + } catch (error) { + workflow.status = 'failed'; + workflow.error = error.message; + this.emit('workflowFailed', { workflowId: workflow.id, error }); + } + } + + handleTaskStatusChange(event) { + // Find workflows containing this task + for (const [workflowId, workflow] of this.workflows) { + const task = workflow.tasks.find(t => t.id === event.data.task_id); + if (task) { + task.status = event.data.status; + this.updateWorkflowProgress(workflow); + + // Trigger next steps based on task completion + if (event.data.status === 'Completed') { + this.triggerNextSteps(workflow, task); + } + } + } + } + + updateWorkflowProgress(workflow) { + const completedTasks = workflow.tasks.filter(t => + ['Completed', 'Failed'].includes(t.status) + ).length; + + workflow.progress = (completedTasks / workflow.tasks.length) * 100; + + if (completedTasks === workflow.tasks.length) { + const failedTasks = workflow.tasks.filter(t => t.status === 'Failed'); + workflow.status = failedTasks.length > 0 ? 'failed' : 'completed'; + + this.emit('workflowCompleted', { + workflowId: workflow.id, + status: workflow.status + }); + } + } + + async triggerNextSteps(workflow, completedTask) { + // Define workflow dependencies and next steps + const nextSteps = this.getNextSteps(workflow, completedTask); + + for (const nextStep of nextSteps) { + try { + const taskId = await this.executeWorkflowStep(nextStep); + workflow.tasks.push({ + id: taskId, + type: nextStep.type, + status: 'pending', + dependencies: [completedTask.id] + }); + } catch (error) { + console.error(`Failed to trigger next step: ${error.message}`); + } + } + } + + getNextSteps(workflow, completedTask) { + // Define workflow logic based on completed task type + switch (completedTask.type) { + case 'server_creation': + return [ + { type: 'kubernetes_installation', taskserv: 'kubernetes' }, + { type: 'monitoring_setup', taskserv: 'prometheus' } + ]; + case 'kubernetes_installation': + return [ + { type: 'networking_setup', taskserv: 'cilium' } + ]; + default: + return []; + } + } +} +``` + +This comprehensive integration documentation provides developers with everything needed to successfully integrate with provisioning, including +complete client implementations, error handling strategies, performance optimizations, and common integration patterns. \ No newline at end of file diff --git a/docs/src/api-reference/nushell-api.md b/docs/src/api-reference/nushell-api.md index 32112fd..268dad9 100644 --- a/docs/src/api-reference/nushell-api.md +++ b/docs/src/api-reference/nushell-api.md @@ -1 +1,111 @@ -# Nushell API Reference\n\nAPI documentation for Nushell library functions in the provisioning platform.\n\n## Overview\n\nThe provisioning platform provides a comprehensive Nushell library with reusable functions for infrastructure automation.\n\n## Core Modules\n\n### Configuration Module\n\n**Location**: `provisioning/core/nulib/lib_provisioning/config/`\n\n- `get-config ` - Retrieve configuration values\n- `validate-config` - Validate configuration files\n- `load-config ` - Load configuration from file\n\n### Server Module\n\n**Location**: `provisioning/core/nulib/lib_provisioning/servers/`\n\n- `create-servers ` - Create server infrastructure\n- `list-servers` - List all provisioned servers\n- `delete-servers ` - Remove servers\n\n### Task Service Module\n\n**Location**: `provisioning/core/nulib/lib_provisioning/taskservs/`\n\n- `install-taskserv ` - Install infrastructure service\n- `list-taskservs` - List installed services\n- `generate-taskserv-config ` - Generate service configuration\n\n### Workspace Module\n\n**Location**: `provisioning/core/nulib/lib_provisioning/workspace/`\n\n- `init-workspace ` - Initialize new workspace\n- `get-active-workspace` - Get current workspace\n- `switch-workspace ` - Switch to different workspace\n\n### Provider Module\n\n**Location**: `provisioning/core/nulib/lib_provisioning/providers/`\n\n- `discover-providers` - Find available providers\n- `load-provider ` - Load provider module\n- `list-providers` - List loaded providers\n\n## Diagnostics & Utilities\n\n### Diagnostics Module\n\n**Location**: `provisioning/core/nulib/lib_provisioning/diagnostics/`\n\n- `system-status` - Check system health (13+ checks)\n- `health-check` - Deep validation (7 areas)\n- `next-steps` - Get progressive guidance\n- `deployment-phase` - Check deployment progress\n\n### Hints Module\n\n**Location**: `provisioning/core/nulib/lib_provisioning/utils/hints.nu`\n\n- `show-next-step ` - Display next step suggestion\n- `show-doc-link ` - Show documentation link\n- `show-example ` - Display command example\n\n## Usage Example\n\n```\n# Load provisioning library\nuse provisioning/core/nulib/lib_provisioning *\n\n# Check system status\nsystem-status | table\n\n# Create servers\ncreate-servers --plan "3-node-cluster" --check\n\n# Install kubernetes\ninstall-taskserv kubernetes --check\n\n# Get next steps\nnext-steps\n```\n\n## API Conventions\n\nAll API functions follow these conventions:\n\n- **Explicit types**: All parameters have type annotations\n- **Early returns**: Validate first, fail fast\n- **Pure functions**: No side effects (mutations marked with `!`)\n- **Pipeline-friendly**: Output designed for Nu pipelines\n\n## Best Practices\n\nSee [Nushell Best Practices](../development/NUSHELL_BEST_PRACTICES.md) for coding guidelines.\n\n## Source Code\n\nBrowse the complete source code:\n\n- **Core library**: `provisioning/core/nulib/lib_provisioning/`\n- **Module index**: `provisioning/core/nulib/lib_provisioning/mod.nu`\n\n---\n\nFor integration examples, see [Integration Examples](integration-examples.md). +# Nushell API Reference + +API documentation for Nushell library functions in the provisioning platform. + +## Overview + +The provisioning platform provides a comprehensive Nushell library with reusable functions for infrastructure automation. + +## Core Modules + +### Configuration Module + +**Location**: `provisioning/core/nulib/lib_provisioning/config/` + +- `get-config ` - Retrieve configuration values +- `validate-config` - Validate configuration files +- `load-config ` - Load configuration from file + +### Server Module + +**Location**: `provisioning/core/nulib/lib_provisioning/servers/` + +- `create-servers ` - Create server infrastructure +- `list-servers` - List all provisioned servers +- `delete-servers ` - Remove servers + +### Task Service Module + +**Location**: `provisioning/core/nulib/lib_provisioning/taskservs/` + +- `install-taskserv ` - Install infrastructure service +- `list-taskservs` - List installed services +- `generate-taskserv-config ` - Generate service configuration + +### Workspace Module + +**Location**: `provisioning/core/nulib/lib_provisioning/workspace/` + +- `init-workspace ` - Initialize new workspace +- `get-active-workspace` - Get current workspace +- `switch-workspace ` - Switch to different workspace + +### Provider Module + +**Location**: `provisioning/core/nulib/lib_provisioning/providers/` + +- `discover-providers` - Find available providers +- `load-provider ` - Load provider module +- `list-providers` - List loaded providers + +## Diagnostics & Utilities + +### Diagnostics Module + +**Location**: `provisioning/core/nulib/lib_provisioning/diagnostics/` + +- `system-status` - Check system health (13+ checks) +- `health-check` - Deep validation (7 areas) +- `next-steps` - Get progressive guidance +- `deployment-phase` - Check deployment progress + +### Hints Module + +**Location**: `provisioning/core/nulib/lib_provisioning/utils/hints.nu` + +- `show-next-step ` - Display next step suggestion +- `show-doc-link ` - Show documentation link +- `show-example ` - Display command example + +## Usage Example + +```text +# Load provisioning library +use provisioning/core/nulib/lib_provisioning * + +# Check system status +system-status | table + +# Create servers +create-servers --plan "3-node-cluster" --check + +# Install kubernetes +install-taskserv kubernetes --check + +# Get next steps +next-steps +``` + +## API Conventions + +All API functions follow these conventions: + +- **Explicit types**: All parameters have type annotations +- **Early returns**: Validate first, fail fast +- **Pure functions**: No side effects (mutations marked with `!`) +- **Pipeline-friendly**: Output designed for Nu pipelines + +## Best Practices + +See [Nushell Best Practices](../development/NUSHELL_BEST_PRACTICES.md) for coding guidelines. + +## Source Code + +Browse the complete source code: + +- **Core library**: `provisioning/core/nulib/lib_provisioning/` +- **Module index**: `provisioning/core/nulib/lib_provisioning/mod.nu` + +--- + +For integration examples, see [Integration Examples](integration-examples.md). \ No newline at end of file diff --git a/docs/src/api-reference/path-resolution.md b/docs/src/api-reference/path-resolution.md index bb33d27..d212cd8 100644 --- a/docs/src/api-reference/path-resolution.md +++ b/docs/src/api-reference/path-resolution.md @@ -1 +1,730 @@ -# Path Resolution API\n\nThis document describes the path resolution system used throughout the provisioning infrastructure for discovering configurations, extensions, and\nresolving workspace paths.\n\n## Overview\n\nThe path resolution system provides a hierarchical and configurable mechanism for:\n\n- Configuration file discovery and loading\n- Extension discovery (providers, task services, clusters)\n- Workspace and project path management\n- Environment variable interpolation\n- Cross-platform path handling\n\n## Configuration Resolution Hierarchy\n\nThe system follows a specific hierarchy for loading configuration files:\n\n```\n1. System defaults (config.defaults.toml)\n2. User configuration (config.user.toml)\n3. Project configuration (config.project.toml)\n4. Infrastructure config (infra/config.toml)\n5. Environment config (config.{env}.toml)\n6. Runtime overrides (CLI arguments, ENV vars)\n```\n\n### Configuration Search Paths\n\nThe system searches for configuration files in these locations:\n\n```\n# Default search paths (in order)\n/usr/local/provisioning/config.defaults.toml\n$HOME/.config/provisioning/config.user.toml\n$PWD/config.project.toml\n$PROVISIONING_KLOUD_PATH/config.infra.toml\n$PWD/config.{PROVISIONING_ENV}.toml\n```\n\n## Path Resolution API\n\n### Core Functions\n\n#### `resolve-config-path(pattern: string, search_paths: list) -> string`\n\nResolves configuration file paths using the search hierarchy.\n\n**Parameters:**\n\n- `pattern`: File pattern to search for (for example, "config.*.toml")\n- `search_paths`: Additional paths to search (optional)\n\n**Returns:**\n\n- Full path to the first matching configuration file\n- Empty string if no file found\n\n**Example:**\n\n```\nuse path-resolution.nu *\nlet config_path = (resolve-config-path "config.user.toml" [])\n# Returns: "/home/user/.config/provisioning/config.user.toml"\n```\n\n#### `resolve-extension-path(type: string, name: string) -> record`\n\nDiscovers extension paths (providers, taskservs, clusters).\n\n**Parameters:**\n\n- `type`: Extension type ("provider", "taskserv", "cluster")\n- `name`: Extension name (for example, "upcloud", "kubernetes", "buildkit")\n\n**Returns:**\n\n```\n{\n base_path: "/usr/local/provisioning/providers/upcloud",\n schemas_path: "/usr/local/provisioning/providers/upcloud/schemas",\n nulib_path: "/usr/local/provisioning/providers/upcloud/nulib",\n templates_path: "/usr/local/provisioning/providers/upcloud/templates",\n exists: true\n}\n```\n\n#### `resolve-workspace-paths() -> record`\n\nGets current workspace path configuration.\n\n**Returns:**\n\n```\n{\n base: "/usr/local/provisioning",\n current_infra: "/workspace/infra/production",\n kloud_path: "/workspace/kloud",\n providers: "/usr/local/provisioning/providers",\n taskservs: "/usr/local/provisioning/taskservs",\n clusters: "/usr/local/provisioning/cluster",\n extensions: "/workspace/extensions"\n}\n```\n\n### Path Interpolation\n\nThe system supports variable interpolation in configuration paths:\n\n#### Supported Variables\n\n- `{{paths.base}}` - Base provisioning path\n- `{{paths.kloud}}` - Current kloud path\n- `{{env.HOME}}` - User home directory\n- `{{env.PWD}}` - Current working directory\n- `{{now.date}}` - Current date (YYYY-MM-DD)\n- `{{now.time}}` - Current time (HH:MM:SS)\n- `{{git.branch}}` - Current git branch\n- `{{git.commit}}` - Current git commit hash\n\n#### `interpolate-path(template: string, context: record) -> string`\n\nInterpolates variables in path templates.\n\n**Parameters:**\n\n- `template`: Path template with variables\n- `context`: Variable context record\n\n**Example:**\n\n```\nlet template = "{{paths.base}}/infra/{{env.USER}}/{{git.branch}}"\nlet result = (interpolate-path $template {\n paths: { base: "/usr/local/provisioning" },\n env: { USER: "admin" },\n git: { branch: "main" }\n})\n# Returns: "/usr/local/provisioning/infra/admin/main"\n```\n\n## Extension Discovery API\n\n### Provider Discovery\n\n#### `discover-providers() -> list`\n\nDiscovers all available providers.\n\n**Returns:**\n\n```\n[\n {\n name: "upcloud",\n path: "/usr/local/provisioning/providers/upcloud",\n type: "provider",\n version: "1.2.0",\n enabled: true,\n has_schemas: true,\n has_nulib: true,\n has_templates: true\n },\n {\n name: "aws",\n path: "/usr/local/provisioning/providers/aws",\n type: "provider",\n version: "2.1.0",\n enabled: true,\n has_schemas: true,\n has_nulib: true,\n has_templates: true\n }\n]\n```\n\n#### `get-provider-config(name: string) -> record`\n\nGets provider-specific configuration and paths.\n\n**Parameters:**\n\n- `name`: Provider name\n\n**Returns:**\n\n```\n{\n name: "upcloud",\n base_path: "/usr/local/provisioning/providers/upcloud",\n config: {\n api_url: "https://api.upcloud.com/1.3",\n auth_method: "basic",\n interface: "API"\n },\n paths: {\n schemas: "/usr/local/provisioning/providers/upcloud/schemas",\n nulib: "/usr/local/provisioning/providers/upcloud/nulib",\n templates: "/usr/local/provisioning/providers/upcloud/templates"\n },\n metadata: {\n version: "1.2.0",\n description: "UpCloud provider for server provisioning"\n }\n}\n```\n\n### Task Service Discovery\n\n#### `discover-taskservs() -> list`\n\nDiscovers all available task services.\n\n**Returns:**\n\n```\n[\n {\n name: "kubernetes",\n path: "/usr/local/provisioning/taskservs/kubernetes",\n type: "taskserv",\n category: "orchestration",\n version: "1.28.0",\n enabled: true\n },\n {\n name: "cilium",\n path: "/usr/local/provisioning/taskservs/cilium",\n type: "taskserv",\n category: "networking",\n version: "1.14.0",\n enabled: true\n }\n]\n```\n\n#### `get-taskserv-config(name: string) -> record`\n\nGets task service configuration and version information.\n\n**Parameters:**\n\n- `name`: Task service name\n\n**Returns:**\n\n```\n{\n name: "kubernetes",\n path: "/usr/local/provisioning/taskservs/kubernetes",\n version: {\n current: "1.28.0",\n available: "1.28.2",\n update_available: true,\n source: "github",\n release_url: "https://github.com/kubernetes/kubernetes/releases"\n },\n config: {\n category: "orchestration",\n dependencies: ["containerd"],\n supports_versions: ["1.26.x", "1.27.x", "1.28.x"]\n }\n}\n```\n\n### Cluster Discovery\n\n#### `discover-clusters() -> list`\n\nDiscovers all available cluster configurations.\n\n**Returns:**\n\n```\n[\n {\n name: "buildkit",\n path: "/usr/local/provisioning/cluster/buildkit",\n type: "cluster",\n category: "build",\n components: ["buildkit", "registry", "storage"],\n enabled: true\n }\n]\n```\n\n## Environment Management API\n\n### Environment Detection\n\n#### `detect-environment() -> string`\n\nAutomatically detects the current environment based on:\n\n1. `PROVISIONING_ENV` environment variable\n2. Git branch patterns (main → prod, develop → dev, etc.)\n3. Directory structure analysis\n4. Configuration file presence\n\n**Returns:**\n\n- Environment name string (dev, test, prod, etc.)\n\n#### `get-environment-config(env: string) -> record`\n\nGets environment-specific configuration.\n\n**Parameters:**\n\n- `env`: Environment name\n\n**Returns:**\n\n```\n{\n name: "production",\n paths: {\n base: "/opt/provisioning",\n kloud: "/data/kloud",\n logs: "/var/log/provisioning"\n },\n providers: {\n default: "upcloud",\n allowed: ["upcloud", "aws"]\n },\n features: {\n debug: false,\n telemetry: true,\n rollback: true\n }\n}\n```\n\n### Environment Switching\n\n#### `switch-environment(env: string, validate: bool = true) -> null`\n\nSwitches to a different environment and updates path resolution.\n\n**Parameters:**\n\n- `env`: Target environment name\n- `validate`: Whether to validate environment configuration\n\n**Effects:**\n\n- Updates `PROVISIONING_ENV` environment variable\n- Reconfigures path resolution for new environment\n- Validates environment configuration if requested\n\n## Workspace Management API\n\n### Workspace Discovery\n\n#### `discover-workspaces() -> list`\n\nDiscovers available workspaces and infrastructure directories.\n\n**Returns:**\n\n```\n[\n {\n name: "production",\n path: "/workspace/infra/production",\n type: "infrastructure",\n provider: "upcloud",\n settings: "settings.ncl",\n valid: true\n },\n {\n name: "development",\n path: "/workspace/infra/development",\n type: "infrastructure",\n provider: "local",\n settings: "dev-settings.ncl",\n valid: true\n }\n]\n```\n\n#### `set-current-workspace(path: string) -> null`\n\nSets the current workspace for path resolution.\n\n**Parameters:**\n\n- `path`: Workspace directory path\n\n**Effects:**\n\n- Updates `CURRENT_INFRA_PATH` environment variable\n- Reconfigures workspace-relative path resolution\n\n### Project Structure Analysis\n\n#### `analyze-project-structure(path: string = $PWD) -> record`\n\nAnalyzes project structure and identifies components.\n\n**Parameters:**\n\n- `path`: Project root path (defaults to current directory)\n\n**Returns:**\n\n```\n{\n root: "/workspace/project",\n type: "provisioning_workspace",\n components: {\n providers: [\n { name: "upcloud", path: "providers/upcloud" },\n { name: "aws", path: "providers/aws" }\n ],\n taskservs: [\n { name: "kubernetes", path: "taskservs/kubernetes" },\n { name: "cilium", path: "taskservs/cilium" }\n ],\n clusters: [\n { name: "buildkit", path: "cluster/buildkit" }\n ],\n infrastructure: [\n { name: "production", path: "infra/production" },\n { name: "staging", path: "infra/staging" }\n ]\n },\n config_files: [\n "config.defaults.toml",\n "config.user.toml",\n "config.prod.toml"\n ]\n}\n```\n\n## Caching and Performance\n\n### Path Caching\n\nThe path resolution system includes intelligent caching:\n\n#### `cache-paths(duration: duration = 5 min) -> null`\n\nEnables path caching for the specified duration.\n\n**Parameters:**\n\n- `duration`: Cache validity duration\n\n#### `invalidate-path-cache() -> null`\n\nInvalidates the path resolution cache.\n\n#### `get-cache-stats() -> record`\n\nGets path resolution cache statistics.\n\n**Returns:**\n\n```\n{\n enabled: true,\n size: 150,\n hit_rate: 0.85,\n last_invalidated: "2025-09-26T10:00:00Z"\n}\n```\n\n## Cross-Platform Compatibility\n\n### Path Normalization\n\n#### `normalize-path(path: string) -> string`\n\nNormalizes paths for cross-platform compatibility.\n\n**Parameters:**\n\n- `path`: Input path (may contain mixed separators)\n\n**Returns:**\n\n- Normalized path using platform-appropriate separators\n\n**Example:**\n\n```\n# On Windows\nnormalize-path "path/to/file" # Returns: "path\to\file"\n\n# On Unix\nnormalize-path "path\to\file" # Returns: "path/to/file"\n```\n\n#### `join-paths(segments: list) -> string`\n\nSafely joins path segments using platform separators.\n\n**Parameters:**\n\n- `segments`: List of path segments\n\n**Returns:**\n\n- Joined path string\n\n## Configuration Validation API\n\n### Path Validation\n\n#### `validate-paths(config: record) -> record`\n\nValidates all paths in configuration.\n\n**Parameters:**\n\n- `config`: Configuration record\n\n**Returns:**\n\n```\n{\n valid: true,\n errors: [],\n warnings: [\n { path: "paths.extensions", message: "Path does not exist" }\n ],\n checks_performed: 15\n}\n```\n\n#### `validate-extension-structure(type: string, path: string) -> record`\n\nValidates extension directory structure.\n\n**Parameters:**\n\n- `type`: Extension type (provider, taskserv, cluster)\n- `path`: Extension base path\n\n**Returns:**\n\n```\n{\n valid: true,\n required_files: [\n { file: "manifest.toml", exists: true },\n { file: "schemas/main.ncl", exists: true },\n { file: "nulib/mod.nu", exists: true }\n ],\n optional_files: [\n { file: "templates/server.j2", exists: false }\n ]\n}\n```\n\n## Command-Line Interface\n\n### Path Resolution Commands\n\nThe path resolution API is exposed via Nushell commands:\n\n```\n# Show current path configuration\nprovisioning show paths\n\n# Discover available extensions\nprovisioning discover providers\nprovisioning discover taskservs\nprovisioning discover clusters\n\n# Validate path configuration\nprovisioning validate paths\n\n# Switch environments\nprovisioning env switch prod\n\n# Set workspace\nprovisioning workspace set /path/to/infra\n```\n\n## Integration Examples\n\n### Python Integration\n\n```\nimport subprocess\nimport json\n\nclass PathResolver:\n def __init__(self, provisioning_path="/usr/local/bin/provisioning"):\n self.cmd = provisioning_path\n\n def get_paths(self):\n result = subprocess.run([\n "nu", "-c", f"use {self.cmd} *; show-config --section=paths --format=json"\n ], capture_output=True, text=True)\n return json.loads(result.stdout)\n\n def discover_providers(self):\n result = subprocess.run([\n "nu", "-c", f"use {self.cmd} *; discover providers --format=json"\n ], capture_output=True, text=True)\n return json.loads(result.stdout)\n\n# Usage\nresolver = PathResolver()\npaths = resolver.get_paths()\nproviders = resolver.discover_providers()\n```\n\n### JavaScript/Node.js Integration\n\n```\nconst { exec } = require('child_process');\nconst util = require('util');\nconst execAsync = util.promisify(exec);\n\nclass PathResolver {\n constructor(provisioningPath = '/usr/local/bin/provisioning') {\n this.cmd = provisioningPath;\n }\n\n async getPaths() {\n const { stdout } = await execAsync(\n `nu -c "use ${this.cmd} *; show-config --section=paths --format=json"`\n );\n return JSON.parse(stdout);\n }\n\n async discoverExtensions(type) {\n const { stdout } = await execAsync(\n `nu -c "use ${this.cmd} *; discover ${type} --format=json"`\n );\n return JSON.parse(stdout);\n }\n}\n\n// Usage\nconst resolver = new PathResolver();\nconst paths = await resolver.getPaths();\nconst providers = await resolver.discoverExtensions('providers');\n```\n\n## Error Handling\n\n### Common Error Scenarios\n\n1. **Configuration File Not Found**\n\n ```nushell\n Error: Configuration file not found in search paths\n Searched: ["/usr/local/provisioning/config.defaults.toml", ...]\n ```\n\n1. **Extension Not Found**\n\n ```nushell\n Error: Provider 'missing-provider' not found\n Available providers: ["upcloud", "aws", "local"]\n ```\n\n2. **Invalid Path Template**\n\n ```nushell\n Error: Invalid template variable: {{invalid.var}}\n Valid variables: ["paths.*", "env.*", "now.*", "git.*"]\n ```\n\n3. **Environment Not Found**\n\n ```nushell\n Error: Environment 'staging' not configured\n Available environments: ["dev", "test", "prod"]\n ```\n\n### Error Recovery\n\nThe system provides graceful fallbacks:\n\n- Missing configuration files use system defaults\n- Invalid paths fall back to safe defaults\n- Extension discovery continues if some paths are inaccessible\n- Environment detection falls back to 'local' if detection fails\n\n## Performance Considerations\n\n### Best Practices\n\n1. **Use Path Caching**: Enable caching for frequently accessed paths\n2. **Batch Discovery**: Discover all extensions at once rather than individually\n3. **Lazy Loading**: Load extension configurations only when needed\n4. **Environment Detection**: Cache environment detection results\n\n### Monitoring\n\nMonitor path resolution performance:\n\n```\n# Get resolution statistics\nprovisioning debug path-stats\n\n# Monitor cache performance\nprovisioning debug cache-stats\n\n# Profile path resolution\nprovisioning debug profile-paths\n```\n\n## Security Considerations\n\n### Path Traversal Protection\n\nThe system includes protections against path traversal attacks:\n\n- All paths are normalized and validated\n- Relative paths are resolved within safe boundaries\n- Symlinks are validated before following\n\n### Access Control\n\nPath resolution respects file system permissions:\n\n- Configuration files require read access\n- Extension directories require read/execute access\n- Workspace directories may require write access for operations\n\nThis path resolution API provides a comprehensive and flexible system for managing the complex path requirements of multi-provider, multi-environment\ninfrastructure provisioning. +# Path Resolution API + +This document describes the path resolution system used throughout the provisioning infrastructure for discovering configurations, extensions, and +resolving workspace paths. + +## Overview + +The path resolution system provides a hierarchical and configurable mechanism for: + +- Configuration file discovery and loading +- Extension discovery (providers, task services, clusters) +- Workspace and project path management +- Environment variable interpolation +- Cross-platform path handling + +## Configuration Resolution Hierarchy + +The system follows a specific hierarchy for loading configuration files: + +```text +1. System defaults (config.defaults.toml) +2. User configuration (config.user.toml) +3. Project configuration (config.project.toml) +4. Infrastructure config (infra/config.toml) +5. Environment config (config.{env}.toml) +6. Runtime overrides (CLI arguments, ENV vars) +``` + +### Configuration Search Paths + +The system searches for configuration files in these locations: + +```text +# Default search paths (in order) +/usr/local/provisioning/config.defaults.toml +$HOME/.config/provisioning/config.user.toml +$PWD/config.project.toml +$PROVISIONING_KLOUD_PATH/config.infra.toml +$PWD/config.{PROVISIONING_ENV}.toml +``` + +## Path Resolution API + +### Core Functions + +#### `resolve-config-path(pattern: string, search_paths: list) -> string` + +Resolves configuration file paths using the search hierarchy. + +**Parameters:** + +- `pattern`: File pattern to search for (for example, "config.*.toml") +- `search_paths`: Additional paths to search (optional) + +**Returns:** + +- Full path to the first matching configuration file +- Empty string if no file found + +**Example:** + +```text +use path-resolution.nu * +let config_path = (resolve-config-path "config.user.toml" []) +# Returns: "/home/user/.config/provisioning/config.user.toml" +``` + +#### `resolve-extension-path(type: string, name: string) -> record` + +Discovers extension paths (providers, taskservs, clusters). + +**Parameters:** + +- `type`: Extension type ("provider", "taskserv", "cluster") +- `name`: Extension name (for example, "upcloud", "kubernetes", "buildkit") + +**Returns:** + +```text +{ + base_path: "/usr/local/provisioning/providers/upcloud", + schemas_path: "/usr/local/provisioning/providers/upcloud/schemas", + nulib_path: "/usr/local/provisioning/providers/upcloud/nulib", + templates_path: "/usr/local/provisioning/providers/upcloud/templates", + exists: true +} +``` + +#### `resolve-workspace-paths() -> record` + +Gets current workspace path configuration. + +**Returns:** + +```text +{ + base: "/usr/local/provisioning", + current_infra: "/workspace/infra/production", + kloud_path: "/workspace/kloud", + providers: "/usr/local/provisioning/providers", + taskservs: "/usr/local/provisioning/taskservs", + clusters: "/usr/local/provisioning/cluster", + extensions: "/workspace/extensions" +} +``` + +### Path Interpolation + +The system supports variable interpolation in configuration paths: + +#### Supported Variables + +- `{{paths.base}}` - Base provisioning path +- `{{paths.kloud}}` - Current kloud path +- `{{env.HOME}}` - User home directory +- `{{env.PWD}}` - Current working directory +- `{{now.date}}` - Current date (YYYY-MM-DD) +- `{{now.time}}` - Current time (HH:MM:SS) +- `{{git.branch}}` - Current git branch +- `{{git.commit}}` - Current git commit hash + +#### `interpolate-path(template: string, context: record) -> string` + +Interpolates variables in path templates. + +**Parameters:** + +- `template`: Path template with variables +- `context`: Variable context record + +**Example:** + +```text +let template = "{{paths.base}}/infra/{{env.USER}}/{{git.branch}}" +let result = (interpolate-path $template { + paths: { base: "/usr/local/provisioning" }, + env: { USER: "admin" }, + git: { branch: "main" } +}) +# Returns: "/usr/local/provisioning/infra/admin/main" +``` + +## Extension Discovery API + +### Provider Discovery + +#### `discover-providers() -> list` + +Discovers all available providers. + +**Returns:** + +```text +[ + { + name: "upcloud", + path: "/usr/local/provisioning/providers/upcloud", + type: "provider", + version: "1.2.0", + enabled: true, + has_schemas: true, + has_nulib: true, + has_templates: true + }, + { + name: "aws", + path: "/usr/local/provisioning/providers/aws", + type: "provider", + version: "2.1.0", + enabled: true, + has_schemas: true, + has_nulib: true, + has_templates: true + } +] +``` + +#### `get-provider-config(name: string) -> record` + +Gets provider-specific configuration and paths. + +**Parameters:** + +- `name`: Provider name + +**Returns:** + +```text +{ + name: "upcloud", + base_path: "/usr/local/provisioning/providers/upcloud", + config: { + api_url: "https://api.upcloud.com/1.3", + auth_method: "basic", + interface: "API" + }, + paths: { + schemas: "/usr/local/provisioning/providers/upcloud/schemas", + nulib: "/usr/local/provisioning/providers/upcloud/nulib", + templates: "/usr/local/provisioning/providers/upcloud/templates" + }, + metadata: { + version: "1.2.0", + description: "UpCloud provider for server provisioning" + } +} +``` + +### Task Service Discovery + +#### `discover-taskservs() -> list` + +Discovers all available task services. + +**Returns:** + +```text +[ + { + name: "kubernetes", + path: "/usr/local/provisioning/taskservs/kubernetes", + type: "taskserv", + category: "orchestration", + version: "1.28.0", + enabled: true + }, + { + name: "cilium", + path: "/usr/local/provisioning/taskservs/cilium", + type: "taskserv", + category: "networking", + version: "1.14.0", + enabled: true + } +] +``` + +#### `get-taskserv-config(name: string) -> record` + +Gets task service configuration and version information. + +**Parameters:** + +- `name`: Task service name + +**Returns:** + +```text +{ + name: "kubernetes", + path: "/usr/local/provisioning/taskservs/kubernetes", + version: { + current: "1.28.0", + available: "1.28.2", + update_available: true, + source: "github", + release_url: "https://github.com/kubernetes/kubernetes/releases" + }, + config: { + category: "orchestration", + dependencies: ["containerd"], + supports_versions: ["1.26.x", "1.27.x", "1.28.x"] + } +} +``` + +### Cluster Discovery + +#### `discover-clusters() -> list` + +Discovers all available cluster configurations. + +**Returns:** + +```text +[ + { + name: "buildkit", + path: "/usr/local/provisioning/cluster/buildkit", + type: "cluster", + category: "build", + components: ["buildkit", "registry", "storage"], + enabled: true + } +] +``` + +## Environment Management API + +### Environment Detection + +#### `detect-environment() -> string` + +Automatically detects the current environment based on: + +1. `PROVISIONING_ENV` environment variable +2. Git branch patterns (main → prod, develop → dev, etc.) +3. Directory structure analysis +4. Configuration file presence + +**Returns:** + +- Environment name string (dev, test, prod, etc.) + +#### `get-environment-config(env: string) -> record` + +Gets environment-specific configuration. + +**Parameters:** + +- `env`: Environment name + +**Returns:** + +```text +{ + name: "production", + paths: { + base: "/opt/provisioning", + kloud: "/data/kloud", + logs: "/var/log/provisioning" + }, + providers: { + default: "upcloud", + allowed: ["upcloud", "aws"] + }, + features: { + debug: false, + telemetry: true, + rollback: true + } +} +``` + +### Environment Switching + +#### `switch-environment(env: string, validate: bool = true) -> null` + +Switches to a different environment and updates path resolution. + +**Parameters:** + +- `env`: Target environment name +- `validate`: Whether to validate environment configuration + +**Effects:** + +- Updates `PROVISIONING_ENV` environment variable +- Reconfigures path resolution for new environment +- Validates environment configuration if requested + +## Workspace Management API + +### Workspace Discovery + +#### `discover-workspaces() -> list` + +Discovers available workspaces and infrastructure directories. + +**Returns:** + +```text +[ + { + name: "production", + path: "/workspace/infra/production", + type: "infrastructure", + provider: "upcloud", + settings: "settings.ncl", + valid: true + }, + { + name: "development", + path: "/workspace/infra/development", + type: "infrastructure", + provider: "local", + settings: "dev-settings.ncl", + valid: true + } +] +``` + +#### `set-current-workspace(path: string) -> null` + +Sets the current workspace for path resolution. + +**Parameters:** + +- `path`: Workspace directory path + +**Effects:** + +- Updates `CURRENT_INFRA_PATH` environment variable +- Reconfigures workspace-relative path resolution + +### Project Structure Analysis + +#### `analyze-project-structure(path: string = $PWD) -> record` + +Analyzes project structure and identifies components. + +**Parameters:** + +- `path`: Project root path (defaults to current directory) + +**Returns:** + +```text +{ + root: "/workspace/project", + type: "provisioning_workspace", + components: { + providers: [ + { name: "upcloud", path: "providers/upcloud" }, + { name: "aws", path: "providers/aws" } + ], + taskservs: [ + { name: "kubernetes", path: "taskservs/kubernetes" }, + { name: "cilium", path: "taskservs/cilium" } + ], + clusters: [ + { name: "buildkit", path: "cluster/buildkit" } + ], + infrastructure: [ + { name: "production", path: "infra/production" }, + { name: "staging", path: "infra/staging" } + ] + }, + config_files: [ + "config.defaults.toml", + "config.user.toml", + "config.prod.toml" + ] +} +``` + +## Caching and Performance + +### Path Caching + +The path resolution system includes intelligent caching: + +#### `cache-paths(duration: duration = 5 min) -> null` + +Enables path caching for the specified duration. + +**Parameters:** + +- `duration`: Cache validity duration + +#### `invalidate-path-cache() -> null` + +Invalidates the path resolution cache. + +#### `get-cache-stats() -> record` + +Gets path resolution cache statistics. + +**Returns:** + +```text +{ + enabled: true, + size: 150, + hit_rate: 0.85, + last_invalidated: "2025-09-26T10:00:00Z" +} +``` + +## Cross-Platform Compatibility + +### Path Normalization + +#### `normalize-path(path: string) -> string` + +Normalizes paths for cross-platform compatibility. + +**Parameters:** + +- `path`: Input path (may contain mixed separators) + +**Returns:** + +- Normalized path using platform-appropriate separators + +**Example:** + +```text +# On Windows +normalize-path "path/to/file" # Returns: "path\to\file" + +# On Unix +normalize-path "path\to\file" # Returns: "path/to/file" +``` + +#### `join-paths(segments: list) -> string` + +Safely joins path segments using platform separators. + +**Parameters:** + +- `segments`: List of path segments + +**Returns:** + +- Joined path string + +## Configuration Validation API + +### Path Validation + +#### `validate-paths(config: record) -> record` + +Validates all paths in configuration. + +**Parameters:** + +- `config`: Configuration record + +**Returns:** + +```text +{ + valid: true, + errors: [], + warnings: [ + { path: "paths.extensions", message: "Path does not exist" } + ], + checks_performed: 15 +} +``` + +#### `validate-extension-structure(type: string, path: string) -> record` + +Validates extension directory structure. + +**Parameters:** + +- `type`: Extension type (provider, taskserv, cluster) +- `path`: Extension base path + +**Returns:** + +```text +{ + valid: true, + required_files: [ + { file: "manifest.toml", exists: true }, + { file: "schemas/main.ncl", exists: true }, + { file: "nulib/mod.nu", exists: true } + ], + optional_files: [ + { file: "templates/server.j2", exists: false } + ] +} +``` + +## Command-Line Interface + +### Path Resolution Commands + +The path resolution API is exposed via Nushell commands: + +```text +# Show current path configuration +provisioning show paths + +# Discover available extensions +provisioning discover providers +provisioning discover taskservs +provisioning discover clusters + +# Validate path configuration +provisioning validate paths + +# Switch environments +provisioning env switch prod + +# Set workspace +provisioning workspace set /path/to/infra +``` + +## Integration Examples + +### Python Integration + +```text +import subprocess +import json + +class PathResolver: + def __init__(self, provisioning_path="/usr/local/bin/provisioning"): + self.cmd = provisioning_path + + def get_paths(self): + result = subprocess.run([ + "nu", "-c", f"use {self.cmd} *; show-config --section=paths --format=json" + ], capture_output=True, text=True) + return json.loads(result.stdout) + + def discover_providers(self): + result = subprocess.run([ + "nu", "-c", f"use {self.cmd} *; discover providers --format=json" + ], capture_output=True, text=True) + return json.loads(result.stdout) + +# Usage +resolver = PathResolver() +paths = resolver.get_paths() +providers = resolver.discover_providers() +``` + +### JavaScript/Node.js Integration + +```text +const { exec } = require('child_process'); +const util = require('util'); +const execAsync = util.promisify(exec); + +class PathResolver { + constructor(provisioningPath = '/usr/local/bin/provisioning') { + this.cmd = provisioningPath; + } + + async getPaths() { + const { stdout } = await execAsync( + `nu -c "use ${this.cmd} *; show-config --section=paths --format=json"` + ); + return JSON.parse(stdout); + } + + async discoverExtensions(type) { + const { stdout } = await execAsync( + `nu -c "use ${this.cmd} *; discover ${type} --format=json"` + ); + return JSON.parse(stdout); + } +} + +// Usage +const resolver = new PathResolver(); +const paths = await resolver.getPaths(); +const providers = await resolver.discoverExtensions('providers'); +``` + +## Error Handling + +### Common Error Scenarios + +1. **Configuration File Not Found** + + ```nushell + Error: Configuration file not found in search paths + Searched: ["/usr/local/provisioning/config.defaults.toml", ...] + ``` + +1. **Extension Not Found** + + ```nushell + Error: Provider 'missing-provider' not found + Available providers: ["upcloud", "aws", "local"] + ``` + +2. **Invalid Path Template** + + ```nushell + Error: Invalid template variable: {{invalid.var}} + Valid variables: ["paths.*", "env.*", "now.*", "git.*"] + ``` + +3. **Environment Not Found** + + ```nushell + Error: Environment 'staging' not configured + Available environments: ["dev", "test", "prod"] + ``` + +### Error Recovery + +The system provides graceful fallbacks: + +- Missing configuration files use system defaults +- Invalid paths fall back to safe defaults +- Extension discovery continues if some paths are inaccessible +- Environment detection falls back to 'local' if detection fails + +## Performance Considerations + +### Best Practices + +1. **Use Path Caching**: Enable caching for frequently accessed paths +2. **Batch Discovery**: Discover all extensions at once rather than individually +3. **Lazy Loading**: Load extension configurations only when needed +4. **Environment Detection**: Cache environment detection results + +### Monitoring + +Monitor path resolution performance: + +```text +# Get resolution statistics +provisioning debug path-stats + +# Monitor cache performance +provisioning debug cache-stats + +# Profile path resolution +provisioning debug profile-paths +``` + +## Security Considerations + +### Path Traversal Protection + +The system includes protections against path traversal attacks: + +- All paths are normalized and validated +- Relative paths are resolved within safe boundaries +- Symlinks are validated before following + +### Access Control + +Path resolution respects file system permissions: + +- Configuration files require read access +- Extension directories require read/execute access +- Workspace directories may require write access for operations + +This path resolution API provides a comprehensive and flexible system for managing the complex path requirements of multi-provider, multi-environment +infrastructure provisioning. \ No newline at end of file diff --git a/docs/src/api-reference/provider-api.md b/docs/src/api-reference/provider-api.md index 2f89b16..26c23a6 100644 --- a/docs/src/api-reference/provider-api.md +++ b/docs/src/api-reference/provider-api.md @@ -1 +1,186 @@ -# Provider API Reference\n\nAPI documentation for creating and using infrastructure providers.\n\n## Overview\n\nProviders handle cloud-specific operations and resource provisioning. The provisioning platform supports multiple cloud providers through a unified API.\n\n## Supported Providers\n\n- **UpCloud** - European cloud provider\n- **AWS** - Amazon Web Services\n- **Local** - Local development environment\n\n## Provider Interface\n\nAll providers must implement the following interface:\n\n### Required Functions\n\n```\n# Provider initialization\nexport def init [] -> record { ... }\n\n# Server operations\nexport def create-servers [plan: record] -> list { ... }\nexport def delete-servers [ids: list] -> bool { ... }\nexport def list-servers [] -> table { ... }\n\n# Resource information\nexport def get-server-plans [] -> table { ... }\nexport def get-regions [] -> list { ... }\nexport def get-pricing [plan: string] -> record { ... }\n```\n\n### Provider Configuration\n\nEach provider requires configuration in Nickel format:\n\n```\n# Example: UpCloud provider configuration\n{\n provider = {\n name = "upcloud",\n type = "cloud",\n enabled = true,\n config = {\n username = "{{env.UPCLOUD_USERNAME}}",\n password = "{{env.UPCLOUD_PASSWORD}}",\n default_zone = "de-fra1",\n },\n }\n}\n```\n\n## Creating a Custom Provider\n\n### 1. Directory Structure\n\n```\nprovisioning/extensions/providers/my-provider/\n├── nulib/\n│ └── my_provider.nu # Provider implementation\n├── schemas/\n│ ├── main.ncl # Nickel schema\n│ └── defaults.ncl # Default configuration\n└── README.md # Provider documentation\n```\n\n### 2. Implementation Template\n\n```\n# my_provider.nu\nexport def init [] {\n {\n name: "my-provider"\n type: "cloud"\n ready: true\n }\n}\n\nexport def create-servers [plan: record] {\n # Implementation here\n []\n}\n\nexport def list-servers [] {\n # Implementation here\n []\n}\n\n# ... other required functions\n```\n\n### 3. Nickel Schema\n\n```\n# main.ncl\n{\n MyProvider = {\n # My custom provider schema\n name | String = "my-provider",\n type | String | "cloud" | "local" = "cloud",\n config | MyProviderConfig,\n },\n\n MyProviderConfig = {\n api_key | String,\n region | String = "us-east-1",\n },\n}\n```\n\n## Provider Discovery\n\nProviders are automatically discovered from:\n\n- `provisioning/extensions/providers/*/nu/*.nu`\n- User workspace: `workspace/extensions/providers/*/nu/*.nu`\n\n```\n# Discover available providers\nprovisioning module discover providers\n\n# Load provider\nprovisioning module load providers workspace my-provider\n```\n\n## Provider API Examples\n\n### Create Servers\n\n```\nuse my_provider.nu *\n\nlet plan = {\n count: 3\n size: "medium"\n zone: "us-east-1"\n}\n\ncreate-servers $plan\n```\n\n### List Servers\n\n```\nlist-servers | where status == "running" | select hostname ip_address\n```\n\n### Get Pricing\n\n```\nget-pricing "small" | to yaml\n```\n\n## Testing Providers\n\nUse the test environment system to test providers:\n\n```\n# Test provider without real resources\nprovisioning test env single my-provider --check\n```\n\n## Provider Development Guide\n\nFor complete provider development guide, see:\n\n- **[Provider Development](../development/QUICK_PROVIDER_GUIDE.md)** - Quick start guide\n- **[Extension Development](../development/extensions.md)** - Complete extension guide\n- **[Integration Examples](integration-examples.md)** - Example implementations\n\n## API Stability\n\nProvider API follows semantic versioning:\n\n- **Major**: Breaking changes\n- **Minor**: New features, backward compatible\n- **Patch**: Bug fixes\n\nCurrent API version: `2.0.0`\n\n---\n\nFor more examples, see [Integration Examples](integration-examples.md). +# Provider API Reference + +API documentation for creating and using infrastructure providers. + +## Overview + +Providers handle cloud-specific operations and resource provisioning. The provisioning platform supports multiple cloud providers through a unified API. + +## Supported Providers + +- **UpCloud** - European cloud provider +- **AWS** - Amazon Web Services +- **Local** - Local development environment + +## Provider Interface + +All providers must implement the following interface: + +### Required Functions + +```text +# Provider initialization +export def init [] -> record { ... } + +# Server operations +export def create-servers [plan: record] -> list { ... } +export def delete-servers [ids: list] -> bool { ... } +export def list-servers [] -> table { ... } + +# Resource information +export def get-server-plans [] -> table { ... } +export def get-regions [] -> list { ... } +export def get-pricing [plan: string] -> record { ... } +``` + +### Provider Configuration + +Each provider requires configuration in Nickel format: + +```text +# Example: UpCloud provider configuration +{ + provider = { + name = "upcloud", + type = "cloud", + enabled = true, + config = { + username = "{{env.UPCLOUD_USERNAME}}", + password = "{{env.UPCLOUD_PASSWORD}}", + default_zone = "de-fra1", + }, + } +} +``` + +## Creating a Custom Provider + +### 1. Directory Structure + +```text +provisioning/extensions/providers/my-provider/ +├── nulib/ +│ └── my_provider.nu # Provider implementation +├── schemas/ +│ ├── main.ncl # Nickel schema +│ └── defaults.ncl # Default configuration +└── README.md # Provider documentation +``` + +### 2. Implementation Template + +```text +# my_provider.nu +export def init [] { + { + name: "my-provider" + type: "cloud" + ready: true + } +} + +export def create-servers [plan: record] { + # Implementation here + [] +} + +export def list-servers [] { + # Implementation here + [] +} + +# ... other required functions +``` + +### 3. Nickel Schema + +```text +# main.ncl +{ + MyProvider = { + # My custom provider schema + name | String = "my-provider", + type | String | "cloud" | "local" = "cloud", + config | MyProviderConfig, + }, + + MyProviderConfig = { + api_key | String, + region | String = "us-east-1", + }, +} +``` + +## Provider Discovery + +Providers are automatically discovered from: + +- `provisioning/extensions/providers/*/nu/*.nu` +- User workspace: `workspace/extensions/providers/*/nu/*.nu` + +```text +# Discover available providers +provisioning module discover providers + +# Load provider +provisioning module load providers workspace my-provider +``` + +## Provider API Examples + +### Create Servers + +```text +use my_provider.nu * + +let plan = { + count: 3 + size: "medium" + zone: "us-east-1" +} + +create-servers $plan +``` + +### List Servers + +```text +list-servers | where status == "running" | select hostname ip_address +``` + +### Get Pricing + +```text +get-pricing "small" | to yaml +``` + +## Testing Providers + +Use the test environment system to test providers: + +```text +# Test provider without real resources +provisioning test env single my-provider --check +``` + +## Provider Development Guide + +For complete provider development guide, see: + +- **[Provider Development](../development/QUICK_PROVIDER_GUIDE.md)** - Quick start guide +- **[Extension Development](../development/extensions.md)** - Complete extension guide +- **[Integration Examples](integration-examples.md)** - Example implementations + +## API Stability + +Provider API follows semantic versioning: + +- **Major**: Breaking changes +- **Minor**: New features, backward compatible +- **Patch**: Bug fixes + +Current API version: `2.0.0` + +--- + +For more examples, see [Integration Examples](integration-examples.md). \ No newline at end of file diff --git a/docs/src/api-reference/rest-api.md b/docs/src/api-reference/rest-api.md index 6542e8e..30bf9e4 100644 --- a/docs/src/api-reference/rest-api.md +++ b/docs/src/api-reference/rest-api.md @@ -1 +1,1118 @@ -# REST API Reference\n\nThis document provides comprehensive documentation for all REST API endpoints in provisioning.\n\n## Overview\n\nProvisioning exposes two main REST APIs:\n\n- **Orchestrator API** (Port 8080): Core workflow management and batch operations\n- **Control Center API** (Port 9080): Authentication, authorization, and policy management\n\n## Base URLs\n\n- **Orchestrator**: `http://localhost:9090`\n- **Control Center**: `http://localhost:9080`\n\n## Authentication\n\n### JWT Authentication\n\nAll API endpoints (except health checks) require JWT authentication via the Authorization header:\n\n```\nAuthorization: Bearer \n```\n\n### Getting Access Token\n\n```\nPOST /auth/login\nContent-Type: application/json\n\n{\n "username": "admin",\n "password": "password",\n "mfa_code": "123456"\n}\n```\n\n## Orchestrator API Endpoints\n\n### Health Check\n\n#### GET /health\n\nCheck orchestrator health status.\n\n**Response:**\n\n```\n{\n "success": true,\n "data": "Orchestrator is healthy"\n}\n```\n\n### Task Management\n\n#### GET /tasks\n\nList all workflow tasks.\n\n**Query Parameters:**\n\n- `status` (optional): Filter by task status (Pending, Running, Completed, Failed, Cancelled)\n- `limit` (optional): Maximum number of results\n- `offset` (optional): Pagination offset\n\n**Response:**\n\n```\n{\n "success": true,\n "data": [\n {\n "id": "uuid-string",\n "name": "create_servers",\n "command": "/usr/local/provisioning servers create",\n "args": ["--infra", "production", "--wait"],\n "dependencies": [],\n "status": "Completed",\n "created_at": "2025-09-26T10:00:00Z",\n "started_at": "2025-09-26T10:00:05Z",\n "completed_at": "2025-09-26T10:05:30Z",\n "output": "Successfully created 3 servers",\n "error": null\n }\n ]\n}\n```\n\n#### GET /tasks/{id}\n\nGet specific task status and details.\n\n**Path Parameters:**\n\n- `id`: Task UUID\n\n**Response:**\n\n```\n{\n "success": true,\n "data": {\n "id": "uuid-string",\n "name": "create_servers",\n "command": "/usr/local/provisioning servers create",\n "args": ["--infra", "production", "--wait"],\n "dependencies": [],\n "status": "Running",\n "created_at": "2025-09-26T10:00:00Z",\n "started_at": "2025-09-26T10:00:05Z",\n "completed_at": null,\n "output": null,\n "error": null\n }\n}\n```\n\n### Workflow Submission\n\n#### POST /workflows/servers/create\n\nSubmit server creation workflow.\n\n**Request Body:**\n\n```\n{\n "infra": "production",\n "settings": "config.ncl",\n "check_mode": false,\n "wait": true\n}\n```\n\n**Response:**\n\n```\n{\n "success": true,\n "data": "uuid-task-id"\n}\n```\n\n#### POST /workflows/taskserv/create\n\nSubmit task service workflow.\n\n**Request Body:**\n\n```\n{\n "operation": "create",\n "taskserv": "kubernetes",\n "infra": "production",\n "settings": "config.ncl",\n "check_mode": false,\n "wait": true\n}\n```\n\n**Response:**\n\n```\n{\n "success": true,\n "data": "uuid-task-id"\n}\n```\n\n#### POST /workflows/cluster/create\n\nSubmit cluster workflow.\n\n**Request Body:**\n\n```\n{\n "operation": "create",\n "cluster_type": "buildkit",\n "infra": "production",\n "settings": "config.ncl",\n "check_mode": false,\n "wait": true\n}\n```\n\n**Response:**\n\n```\n{\n "success": true,\n "data": "uuid-task-id"\n}\n```\n\n### Batch Operations\n\n#### POST /batch/execute\n\nExecute batch workflow operation.\n\n**Request Body:**\n\n```\n{\n "name": "multi_cloud_deployment",\n "version": "1.0.0",\n "storage_backend": "surrealdb",\n "parallel_limit": 5,\n "rollback_enabled": true,\n "operations": [\n {\n "id": "upcloud_servers",\n "type": "server_batch",\n "provider": "upcloud",\n "dependencies": [],\n "server_configs": [\n {"name": "web-01", "plan": "1xCPU-2 GB", "zone": "de-fra1"},\n {"name": "web-02", "plan": "1xCPU-2 GB", "zone": "us-nyc1"}\n ]\n },\n {\n "id": "aws_taskservs",\n "type": "taskserv_batch",\n "provider": "aws",\n "dependencies": ["upcloud_servers"],\n "taskservs": ["kubernetes", "cilium", "containerd"]\n }\n ]\n}\n```\n\n**Response:**\n\n```\n{\n "success": true,\n "data": {\n "batch_id": "uuid-string",\n "status": "Running",\n "operations": [\n {\n "id": "upcloud_servers",\n "status": "Pending",\n "progress": 0.0\n },\n {\n "id": "aws_taskservs",\n "status": "Pending",\n "progress": 0.0\n }\n ]\n }\n}\n```\n\n#### GET /batch/operations\n\nList all batch operations.\n\n**Response:**\n\n```\n{\n "success": true,\n "data": [\n {\n "batch_id": "uuid-string",\n "name": "multi_cloud_deployment",\n "status": "Running",\n "created_at": "2025-09-26T10:00:00Z",\n "operations": [...]\n }\n ]\n}\n```\n\n#### GET /batch/operations/{id}\n\nGet batch operation status.\n\n**Path Parameters:**\n\n- `id`: Batch operation ID\n\n**Response:**\n\n```\n{\n "success": true,\n "data": {\n "batch_id": "uuid-string",\n "name": "multi_cloud_deployment",\n "status": "Running",\n "operations": [\n {\n "id": "upcloud_servers",\n "status": "Completed",\n "progress": 100.0,\n "results": {...}\n }\n ]\n }\n}\n```\n\n#### POST /batch/operations/{id}/cancel\n\nCancel running batch operation.\n\n**Path Parameters:**\n\n- `id`: Batch operation ID\n\n**Response:**\n\n```\n{\n "success": true,\n "data": "Operation cancelled"\n}\n```\n\n### State Management\n\n#### GET /state/workflows/{id}/progress\n\nGet real-time workflow progress.\n\n**Path Parameters:**\n\n- `id`: Workflow ID\n\n**Response:**\n\n```\n{\n "success": true,\n "data": {\n "workflow_id": "uuid-string",\n "progress": 75.5,\n "current_step": "Installing Kubernetes",\n "total_steps": 8,\n "completed_steps": 6,\n "estimated_time_remaining": 180\n }\n}\n```\n\n#### GET /state/workflows/{id}/snapshots\n\nGet workflow state snapshots.\n\n**Path Parameters:**\n\n- `id`: Workflow ID\n\n**Response:**\n\n```\n{\n "success": true,\n "data": [\n {\n "snapshot_id": "uuid-string",\n "timestamp": "2025-09-26T10:00:00Z",\n "state": "running",\n "details": {...}\n }\n ]\n}\n```\n\n#### GET /state/system/metrics\n\nGet system-wide metrics.\n\n**Response:**\n\n```\n{\n "success": true,\n "data": {\n "total_workflows": 150,\n "active_workflows": 5,\n "completed_workflows": 140,\n "failed_workflows": 5,\n "system_load": {\n "cpu_usage": 45.2,\n "memory_usage": 2048,\n "disk_usage": 75.5\n }\n }\n}\n```\n\n#### GET /state/system/health\n\nGet system health status.\n\n**Response:**\n\n```\n{\n "success": true,\n "data": {\n "overall_status": "Healthy",\n "components": {\n "storage": "Healthy",\n "batch_coordinator": "Healthy",\n "monitoring": "Healthy"\n },\n "last_check": "2025-09-26T10:00:00Z"\n }\n}\n```\n\n#### GET /state/statistics\n\nGet state manager statistics.\n\n**Response:**\n\n```\n{\n "success": true,\n "data": {\n "total_workflows": 150,\n "active_snapshots": 25,\n "storage_usage": "245 MB",\n "average_workflow_duration": 300\n }\n}\n```\n\n### Rollback and Recovery\n\n#### POST /rollback/checkpoints\n\nCreate new checkpoint.\n\n**Request Body:**\n\n```\n{\n "name": "before_major_update",\n "description": "Checkpoint before deploying v2.0.0"\n}\n```\n\n**Response:**\n\n```\n{\n "success": true,\n "data": "checkpoint-uuid"\n}\n```\n\n#### GET /rollback/checkpoints\n\nList all checkpoints.\n\n**Response:**\n\n```\n{\n "success": true,\n "data": [\n {\n "id": "checkpoint-uuid",\n "name": "before_major_update",\n "description": "Checkpoint before deploying v2.0.0",\n "created_at": "2025-09-26T10:00:00Z",\n "size": "150 MB"\n }\n ]\n}\n```\n\n#### GET /rollback/checkpoints/{id}\n\nGet specific checkpoint details.\n\n**Path Parameters:**\n\n- `id`: Checkpoint ID\n\n**Response:**\n\n```\n{\n "success": true,\n "data": {\n "id": "checkpoint-uuid",\n "name": "before_major_update",\n "description": "Checkpoint before deploying v2.0.0",\n "created_at": "2025-09-26T10:00:00Z",\n "size": "150 MB",\n "operations_count": 25\n }\n}\n```\n\n#### POST /rollback/execute\n\nExecute rollback operation.\n\n**Request Body:**\n\n```\n{\n "checkpoint_id": "checkpoint-uuid"\n}\n```\n\nOr for partial rollback:\n\n```\n{\n "operation_ids": ["op-1", "op-2", "op-3"]\n}\n```\n\n**Response:**\n\n```\n{\n "success": true,\n "data": {\n "rollback_id": "rollback-uuid",\n "success": true,\n "operations_executed": 25,\n "operations_failed": 0,\n "duration": 45.5\n }\n}\n```\n\n#### POST /rollback/restore/{id}\n\nRestore system state from checkpoint.\n\n**Path Parameters:**\n\n- `id`: Checkpoint ID\n\n**Response:**\n\n```\n{\n "success": true,\n "data": "State restored from checkpoint checkpoint-uuid"\n}\n```\n\n#### GET /rollback/statistics\n\nGet rollback system statistics.\n\n**Response:**\n\n```\n{\n "success": true,\n "data": {\n "total_checkpoints": 10,\n "total_rollbacks": 3,\n "success_rate": 100.0,\n "average_rollback_time": 30.5\n }\n}\n```\n\n## Control Center API Endpoints\n\n### Authentication\n\n#### POST /auth/login\n\nAuthenticate user and get JWT token.\n\n**Request Body:**\n\n```\n{\n "username": "admin",\n "password": "secure_password",\n "mfa_code": "123456"\n}\n```\n\n**Response:**\n\n```\n{\n "success": true,\n "data": {\n "token": "jwt-token-string",\n "expires_at": "2025-09-26T18:00:00Z",\n "user": {\n "id": "user-uuid",\n "username": "admin",\n "email": "admin@example.com",\n "roles": ["admin", "operator"]\n }\n }\n}\n```\n\n#### POST /auth/refresh\n\nRefresh JWT token.\n\n**Request Body:**\n\n```\n{\n "token": "current-jwt-token"\n}\n```\n\n**Response:**\n\n```\n{\n "success": true,\n "data": {\n "token": "new-jwt-token",\n "expires_at": "2025-09-26T18:00:00Z"\n }\n}\n```\n\n#### POST /auth/logout\n\nLogout and invalidate token.\n\n**Response:**\n\n```\n{\n "success": true,\n "data": "Successfully logged out"\n}\n```\n\n### User Management\n\n#### GET /users\n\nList all users.\n\n**Query Parameters:**\n\n- `role` (optional): Filter by role\n- `enabled` (optional): Filter by enabled status\n\n**Response:**\n\n```\n{\n "success": true,\n "data": [\n {\n "id": "user-uuid",\n "username": "admin",\n "email": "admin@example.com",\n "roles": ["admin"],\n "enabled": true,\n "created_at": "2025-09-26T10:00:00Z",\n "last_login": "2025-09-26T12:00:00Z"\n }\n ]\n}\n```\n\n#### POST /users\n\nCreate new user.\n\n**Request Body:**\n\n```\n{\n "username": "newuser",\n "email": "newuser@example.com",\n "password": "secure_password",\n "roles": ["operator"],\n "enabled": true\n}\n```\n\n**Response:**\n\n```\n{\n "success": true,\n "data": {\n "id": "new-user-uuid",\n "username": "newuser",\n "email": "newuser@example.com",\n "roles": ["operator"],\n "enabled": true\n }\n}\n```\n\n#### PUT /users/{id}\n\nUpdate existing user.\n\n**Path Parameters:**\n\n- `id`: User ID\n\n**Request Body:**\n\n```\n{\n "email": "updated@example.com",\n "roles": ["admin", "operator"],\n "enabled": false\n}\n```\n\n**Response:**\n\n```\n{\n "success": true,\n "data": "User updated successfully"\n}\n```\n\n#### DELETE /users/{id}\n\nDelete user.\n\n**Path Parameters:**\n\n- `id`: User ID\n\n**Response:**\n\n```\n{\n "success": true,\n "data": "User deleted successfully"\n}\n```\n\n### Policy Management\n\n#### GET /policies\n\nList all policies.\n\n**Response:**\n\n```\n{\n "success": true,\n "data": [\n {\n "id": "policy-uuid",\n "name": "admin_access_policy",\n "version": "1.0.0",\n "rules": [...],\n "created_at": "2025-09-26T10:00:00Z",\n "enabled": true\n }\n ]\n}\n```\n\n#### POST /policies\n\nCreate new policy.\n\n**Request Body:**\n\n```\n{\n "name": "new_policy",\n "version": "1.0.0",\n "rules": [\n {\n "effect": "Allow",\n "resource": "servers:*",\n "action": ["create", "read"],\n "condition": "user.role == 'admin'"\n }\n ]\n}\n```\n\n**Response:**\n\n```\n{\n "success": true,\n "data": {\n "id": "new-policy-uuid",\n "name": "new_policy",\n "version": "1.0.0"\n }\n}\n```\n\n#### PUT /policies/{id}\n\nUpdate policy.\n\n**Path Parameters:**\n\n- `id`: Policy ID\n\n**Request Body:**\n\n```\n{\n "name": "updated_policy",\n "rules": [...]\n}\n```\n\n**Response:**\n\n```\n{\n "success": true,\n "data": "Policy updated successfully"\n}\n```\n\n### Audit Logging\n\n#### GET /audit/logs\n\nGet audit logs.\n\n**Query Parameters:**\n\n- `user_id` (optional): Filter by user\n- `action` (optional): Filter by action\n- `resource` (optional): Filter by resource\n- `from` (optional): Start date (ISO 8601)\n- `to` (optional): End date (ISO 8601)\n- `limit` (optional): Maximum results\n- `offset` (optional): Pagination offset\n\n**Response:**\n\n```\n{\n "success": true,\n "data": [\n {\n "id": "audit-log-uuid",\n "timestamp": "2025-09-26T10:00:00Z",\n "user_id": "user-uuid",\n "action": "server.create",\n "resource": "servers/web-01",\n "result": "success",\n "details": {...}\n }\n ]\n}\n```\n\n## Error Responses\n\nAll endpoints may return error responses in this format:\n\n```\n{\n "success": false,\n "error": "Detailed error message"\n}\n```\n\n### HTTP Status Codes\n\n- `200 OK`: Successful request\n- `201 Created`: Resource created successfully\n- `400 Bad Request`: Invalid request parameters\n- `401 Unauthorized`: Authentication required or invalid\n- `403 Forbidden`: Permission denied\n- `404 Not Found`: Resource not found\n- `422 Unprocessable Entity`: Validation error\n- `500 Internal Server Error`: Server error\n\n## Rate Limiting\n\nAPI endpoints are rate-limited:\n\n- Authentication: 5 requests per minute per IP\n- General APIs: 100 requests per minute per user\n- Batch operations: 10 requests per minute per user\n\nRate limit headers are included in responses:\n\n```\nX-RateLimit-Limit: 100\nX-RateLimit-Remaining: 95\nX-RateLimit-Reset: 1632150000\n```\n\n## Monitoring Endpoints\n\n### GET /metrics\n\nPrometheus-compatible metrics endpoint.\n\n**Response:**\n\n```\n# HELP orchestrator_tasks_total Total number of tasks\n# TYPE orchestrator_tasks_total counter\norchestrator_tasks_total{status="completed"} 150\norchestrator_tasks_total{status="failed"} 5\n\n# HELP orchestrator_task_duration_seconds Task execution duration\n# TYPE orchestrator_task_duration_seconds histogram\norchestrator_task_duration_seconds_bucket{le="10"} 50\norchestrator_task_duration_seconds_bucket{le="30"} 120\norchestrator_task_duration_seconds_bucket{le="+Inf"} 155\n```\n\n### WebSocket /ws\n\nReal-time event streaming via WebSocket connection.\n\n**Connection:**\n\n```\nconst ws = new WebSocket('ws://localhost:9090/ws?token=jwt-token');\n\nws.onmessage = function(event) {\n const data = JSON.parse(event.data);\n console.log('Event:', data);\n};\n```\n\n**Event Format:**\n\n```\n{\n "event_type": "TaskStatusChanged",\n "timestamp": "2025-09-26T10:00:00Z",\n "data": {\n "task_id": "uuid-string",\n "status": "completed"\n },\n "metadata": {\n "task_id": "uuid-string",\n "status": "completed"\n }\n}\n```\n\n## SDK Examples\n\n### Python SDK Example\n\n```\nimport requests\n\nclass ProvisioningClient:\n def __init__(self, base_url, token):\n self.base_url = base_url\n self.headers = {\n 'Authorization': f'Bearer {token}',\n 'Content-Type': 'application/json'\n }\n\n def create_server_workflow(self, infra, settings, check_mode=False):\n payload = {\n 'infra': infra,\n 'settings': settings,\n 'check_mode': check_mode,\n 'wait': True\n }\n response = requests.post(\n f'{self.base_url}/workflows/servers/create',\n json=payload,\n headers=self.headers\n )\n return response.json()\n\n def get_task_status(self, task_id):\n response = requests.get(\n f'{self.base_url}/tasks/{task_id}',\n headers=self.headers\n )\n return response.json()\n\n# Usage\nclient = ProvisioningClient('http://localhost:9090', 'your-jwt-token')\nresult = client.create_server_workflow('production', 'config.ncl')\nprint(f"Task ID: {result['data']}")\n```\n\n### JavaScript/Node.js SDK Example\n\n```\nconst axios = require('axios');\n\nclass ProvisioningClient {\n constructor(baseUrl, token) {\n this.client = axios.create({\n baseURL: baseUrl,\n headers: {\n 'Authorization': `Bearer ${token}`,\n 'Content-Type': 'application/json'\n }\n });\n }\n\n async createServerWorkflow(infra, settings, checkMode = false) {\n const response = await this.client.post('/workflows/servers/create', {\n infra,\n settings,\n check_mode: checkMode,\n wait: true\n });\n return response.data;\n }\n\n async getTaskStatus(taskId) {\n const response = await this.client.get(`/tasks/${taskId}`);\n return response.data;\n }\n}\n\n// Usage\nconst client = new ProvisioningClient('http://localhost:9090', 'your-jwt-token');\nconst result = await client.createServerWorkflow('production', 'config.ncl');\nconsole.log(`Task ID: ${result.data}`);\n```\n\n## Webhook Integration\n\nThe system supports webhooks for external integrations:\n\n### Webhook Configuration\n\nConfigure webhooks in the system configuration:\n\n```\n[webhooks]\nenabled = true\nendpoints = [\n {\n url = "https://your-system.com/webhook"\n events = ["task.completed", "task.failed", "batch.completed"]\n secret = "webhook-secret"\n }\n]\n```\n\n### Webhook Payload\n\n```\n{\n "event": "task.completed",\n "timestamp": "2025-09-26T10:00:00Z",\n "data": {\n "task_id": "uuid-string",\n "status": "completed",\n "output": "Task completed successfully"\n },\n "signature": "sha256=calculated-signature"\n}\n```\n\n## Pagination\n\nFor endpoints that return lists, use pagination parameters:\n\n- `limit`: Maximum number of items per page (default: 50, max: 1000)\n- `offset`: Number of items to skip\n\nPagination metadata is included in response headers:\n\n```\nX-Total-Count: 1500\nX-Limit: 50\nX-Offset: 100\nLink: ; rel="next"\n```\n\n## API Versioning\n\nThe API uses header-based versioning:\n\n```\nAccept: application/vnd.provisioning.v1+json\n```\n\nCurrent version: v1\n\n## Testing\n\nUse the included test suite to validate API functionality:\n\n```\n# Run API integration tests\ncd src/orchestrator\ncargo test --test api_tests\n\n# Run load tests\ncargo test --test load_tests --release\n``` +# REST API Reference + +This document provides comprehensive documentation for all REST API endpoints in provisioning. + +## Overview + +Provisioning exposes two main REST APIs: + +- **Orchestrator API** (Port 8080): Core workflow management and batch operations +- **Control Center API** (Port 9080): Authentication, authorization, and policy management + +## Base URLs + +- **Orchestrator**: `http://localhost:9090` +- **Control Center**: `http://localhost:9080` + +## Authentication + +### JWT Authentication + +All API endpoints (except health checks) require JWT authentication via the Authorization header: + +```text +Authorization: Bearer +``` + +### Getting Access Token + +```text +POST /auth/login +Content-Type: application/json + +{ + "username": "admin", + "password": "password", + "mfa_code": "123456" +} +``` + +## Orchestrator API Endpoints + +### Health Check + +#### GET /health + +Check orchestrator health status. + +**Response:** + +```text +{ + "success": true, + "data": "Orchestrator is healthy" +} +``` + +### Task Management + +#### GET /tasks + +List all workflow tasks. + +**Query Parameters:** + +- `status` (optional): Filter by task status (Pending, Running, Completed, Failed, Cancelled) +- `limit` (optional): Maximum number of results +- `offset` (optional): Pagination offset + +**Response:** + +```text +{ + "success": true, + "data": [ + { + "id": "uuid-string", + "name": "create_servers", + "command": "/usr/local/provisioning servers create", + "args": ["--infra", "production", "--wait"], + "dependencies": [], + "status": "Completed", + "created_at": "2025-09-26T10:00:00Z", + "started_at": "2025-09-26T10:00:05Z", + "completed_at": "2025-09-26T10:05:30Z", + "output": "Successfully created 3 servers", + "error": null + } + ] +} +``` + +#### GET /tasks/{id} + +Get specific task status and details. + +**Path Parameters:** + +- `id`: Task UUID + +**Response:** + +```text +{ + "success": true, + "data": { + "id": "uuid-string", + "name": "create_servers", + "command": "/usr/local/provisioning servers create", + "args": ["--infra", "production", "--wait"], + "dependencies": [], + "status": "Running", + "created_at": "2025-09-26T10:00:00Z", + "started_at": "2025-09-26T10:00:05Z", + "completed_at": null, + "output": null, + "error": null + } +} +``` + +### Workflow Submission + +#### POST /workflows/servers/create + +Submit server creation workflow. + +**Request Body:** + +```text +{ + "infra": "production", + "settings": "config.ncl", + "check_mode": false, + "wait": true +} +``` + +**Response:** + +```text +{ + "success": true, + "data": "uuid-task-id" +} +``` + +#### POST /workflows/taskserv/create + +Submit task service workflow. + +**Request Body:** + +```text +{ + "operation": "create", + "taskserv": "kubernetes", + "infra": "production", + "settings": "config.ncl", + "check_mode": false, + "wait": true +} +``` + +**Response:** + +```text +{ + "success": true, + "data": "uuid-task-id" +} +``` + +#### POST /workflows/cluster/create + +Submit cluster workflow. + +**Request Body:** + +```text +{ + "operation": "create", + "cluster_type": "buildkit", + "infra": "production", + "settings": "config.ncl", + "check_mode": false, + "wait": true +} +``` + +**Response:** + +```text +{ + "success": true, + "data": "uuid-task-id" +} +``` + +### Batch Operations + +#### POST /batch/execute + +Execute batch workflow operation. + +**Request Body:** + +```text +{ + "name": "multi_cloud_deployment", + "version": "1.0.0", + "storage_backend": "surrealdb", + "parallel_limit": 5, + "rollback_enabled": true, + "operations": [ + { + "id": "upcloud_servers", + "type": "server_batch", + "provider": "upcloud", + "dependencies": [], + "server_configs": [ + {"name": "web-01", "plan": "1xCPU-2 GB", "zone": "de-fra1"}, + {"name": "web-02", "plan": "1xCPU-2 GB", "zone": "us-nyc1"} + ] + }, + { + "id": "aws_taskservs", + "type": "taskserv_batch", + "provider": "aws", + "dependencies": ["upcloud_servers"], + "taskservs": ["kubernetes", "cilium", "containerd"] + } + ] +} +``` + +**Response:** + +```text +{ + "success": true, + "data": { + "batch_id": "uuid-string", + "status": "Running", + "operations": [ + { + "id": "upcloud_servers", + "status": "Pending", + "progress": 0.0 + }, + { + "id": "aws_taskservs", + "status": "Pending", + "progress": 0.0 + } + ] + } +} +``` + +#### GET /batch/operations + +List all batch operations. + +**Response:** + +```text +{ + "success": true, + "data": [ + { + "batch_id": "uuid-string", + "name": "multi_cloud_deployment", + "status": "Running", + "created_at": "2025-09-26T10:00:00Z", + "operations": [...] + } + ] +} +``` + +#### GET /batch/operations/{id} + +Get batch operation status. + +**Path Parameters:** + +- `id`: Batch operation ID + +**Response:** + +```text +{ + "success": true, + "data": { + "batch_id": "uuid-string", + "name": "multi_cloud_deployment", + "status": "Running", + "operations": [ + { + "id": "upcloud_servers", + "status": "Completed", + "progress": 100.0, + "results": {...} + } + ] + } +} +``` + +#### POST /batch/operations/{id}/cancel + +Cancel running batch operation. + +**Path Parameters:** + +- `id`: Batch operation ID + +**Response:** + +```text +{ + "success": true, + "data": "Operation cancelled" +} +``` + +### State Management + +#### GET /state/workflows/{id}/progress + +Get real-time workflow progress. + +**Path Parameters:** + +- `id`: Workflow ID + +**Response:** + +```text +{ + "success": true, + "data": { + "workflow_id": "uuid-string", + "progress": 75.5, + "current_step": "Installing Kubernetes", + "total_steps": 8, + "completed_steps": 6, + "estimated_time_remaining": 180 + } +} +``` + +#### GET /state/workflows/{id}/snapshots + +Get workflow state snapshots. + +**Path Parameters:** + +- `id`: Workflow ID + +**Response:** + +```text +{ + "success": true, + "data": [ + { + "snapshot_id": "uuid-string", + "timestamp": "2025-09-26T10:00:00Z", + "state": "running", + "details": {...} + } + ] +} +``` + +#### GET /state/system/metrics + +Get system-wide metrics. + +**Response:** + +```text +{ + "success": true, + "data": { + "total_workflows": 150, + "active_workflows": 5, + "completed_workflows": 140, + "failed_workflows": 5, + "system_load": { + "cpu_usage": 45.2, + "memory_usage": 2048, + "disk_usage": 75.5 + } + } +} +``` + +#### GET /state/system/health + +Get system health status. + +**Response:** + +```text +{ + "success": true, + "data": { + "overall_status": "Healthy", + "components": { + "storage": "Healthy", + "batch_coordinator": "Healthy", + "monitoring": "Healthy" + }, + "last_check": "2025-09-26T10:00:00Z" + } +} +``` + +#### GET /state/statistics + +Get state manager statistics. + +**Response:** + +```text +{ + "success": true, + "data": { + "total_workflows": 150, + "active_snapshots": 25, + "storage_usage": "245 MB", + "average_workflow_duration": 300 + } +} +``` + +### Rollback and Recovery + +#### POST /rollback/checkpoints + +Create new checkpoint. + +**Request Body:** + +```text +{ + "name": "before_major_update", + "description": "Checkpoint before deploying v2.0.0" +} +``` + +**Response:** + +```text +{ + "success": true, + "data": "checkpoint-uuid" +} +``` + +#### GET /rollback/checkpoints + +List all checkpoints. + +**Response:** + +```text +{ + "success": true, + "data": [ + { + "id": "checkpoint-uuid", + "name": "before_major_update", + "description": "Checkpoint before deploying v2.0.0", + "created_at": "2025-09-26T10:00:00Z", + "size": "150 MB" + } + ] +} +``` + +#### GET /rollback/checkpoints/{id} + +Get specific checkpoint details. + +**Path Parameters:** + +- `id`: Checkpoint ID + +**Response:** + +```text +{ + "success": true, + "data": { + "id": "checkpoint-uuid", + "name": "before_major_update", + "description": "Checkpoint before deploying v2.0.0", + "created_at": "2025-09-26T10:00:00Z", + "size": "150 MB", + "operations_count": 25 + } +} +``` + +#### POST /rollback/execute + +Execute rollback operation. + +**Request Body:** + +```text +{ + "checkpoint_id": "checkpoint-uuid" +} +``` + +Or for partial rollback: + +```text +{ + "operation_ids": ["op-1", "op-2", "op-3"] +} +``` + +**Response:** + +```text +{ + "success": true, + "data": { + "rollback_id": "rollback-uuid", + "success": true, + "operations_executed": 25, + "operations_failed": 0, + "duration": 45.5 + } +} +``` + +#### POST /rollback/restore/{id} + +Restore system state from checkpoint. + +**Path Parameters:** + +- `id`: Checkpoint ID + +**Response:** + +```text +{ + "success": true, + "data": "State restored from checkpoint checkpoint-uuid" +} +``` + +#### GET /rollback/statistics + +Get rollback system statistics. + +**Response:** + +```text +{ + "success": true, + "data": { + "total_checkpoints": 10, + "total_rollbacks": 3, + "success_rate": 100.0, + "average_rollback_time": 30.5 + } +} +``` + +## Control Center API Endpoints + +### Authentication + +#### POST /auth/login + +Authenticate user and get JWT token. + +**Request Body:** + +```text +{ + "username": "admin", + "password": "secure_password", + "mfa_code": "123456" +} +``` + +**Response:** + +```text +{ + "success": true, + "data": { + "token": "jwt-token-string", + "expires_at": "2025-09-26T18:00:00Z", + "user": { + "id": "user-uuid", + "username": "admin", + "email": "admin@example.com", + "roles": ["admin", "operator"] + } + } +} +``` + +#### POST /auth/refresh + +Refresh JWT token. + +**Request Body:** + +```text +{ + "token": "current-jwt-token" +} +``` + +**Response:** + +```text +{ + "success": true, + "data": { + "token": "new-jwt-token", + "expires_at": "2025-09-26T18:00:00Z" + } +} +``` + +#### POST /auth/logout + +Logout and invalidate token. + +**Response:** + +```text +{ + "success": true, + "data": "Successfully logged out" +} +``` + +### User Management + +#### GET /users + +List all users. + +**Query Parameters:** + +- `role` (optional): Filter by role +- `enabled` (optional): Filter by enabled status + +**Response:** + +```text +{ + "success": true, + "data": [ + { + "id": "user-uuid", + "username": "admin", + "email": "admin@example.com", + "roles": ["admin"], + "enabled": true, + "created_at": "2025-09-26T10:00:00Z", + "last_login": "2025-09-26T12:00:00Z" + } + ] +} +``` + +#### POST /users + +Create new user. + +**Request Body:** + +```text +{ + "username": "newuser", + "email": "newuser@example.com", + "password": "secure_password", + "roles": ["operator"], + "enabled": true +} +``` + +**Response:** + +```text +{ + "success": true, + "data": { + "id": "new-user-uuid", + "username": "newuser", + "email": "newuser@example.com", + "roles": ["operator"], + "enabled": true + } +} +``` + +#### PUT /users/{id} + +Update existing user. + +**Path Parameters:** + +- `id`: User ID + +**Request Body:** + +```text +{ + "email": "updated@example.com", + "roles": ["admin", "operator"], + "enabled": false +} +``` + +**Response:** + +```text +{ + "success": true, + "data": "User updated successfully" +} +``` + +#### DELETE /users/{id} + +Delete user. + +**Path Parameters:** + +- `id`: User ID + +**Response:** + +```text +{ + "success": true, + "data": "User deleted successfully" +} +``` + +### Policy Management + +#### GET /policies + +List all policies. + +**Response:** + +```text +{ + "success": true, + "data": [ + { + "id": "policy-uuid", + "name": "admin_access_policy", + "version": "1.0.0", + "rules": [...], + "created_at": "2025-09-26T10:00:00Z", + "enabled": true + } + ] +} +``` + +#### POST /policies + +Create new policy. + +**Request Body:** + +```text +{ + "name": "new_policy", + "version": "1.0.0", + "rules": [ + { + "effect": "Allow", + "resource": "servers:*", + "action": ["create", "read"], + "condition": "user.role == 'admin'" + } + ] +} +``` + +**Response:** + +```text +{ + "success": true, + "data": { + "id": "new-policy-uuid", + "name": "new_policy", + "version": "1.0.0" + } +} +``` + +#### PUT /policies/{id} + +Update policy. + +**Path Parameters:** + +- `id`: Policy ID + +**Request Body:** + +```text +{ + "name": "updated_policy", + "rules": [...] +} +``` + +**Response:** + +```text +{ + "success": true, + "data": "Policy updated successfully" +} +``` + +### Audit Logging + +#### GET /audit/logs + +Get audit logs. + +**Query Parameters:** + +- `user_id` (optional): Filter by user +- `action` (optional): Filter by action +- `resource` (optional): Filter by resource +- `from` (optional): Start date (ISO 8601) +- `to` (optional): End date (ISO 8601) +- `limit` (optional): Maximum results +- `offset` (optional): Pagination offset + +**Response:** + +```text +{ + "success": true, + "data": [ + { + "id": "audit-log-uuid", + "timestamp": "2025-09-26T10:00:00Z", + "user_id": "user-uuid", + "action": "server.create", + "resource": "servers/web-01", + "result": "success", + "details": {...} + } + ] +} +``` + +## Error Responses + +All endpoints may return error responses in this format: + +```text +{ + "success": false, + "error": "Detailed error message" +} +``` + +### HTTP Status Codes + +- `200 OK`: Successful request +- `201 Created`: Resource created successfully +- `400 Bad Request`: Invalid request parameters +- `401 Unauthorized`: Authentication required or invalid +- `403 Forbidden`: Permission denied +- `404 Not Found`: Resource not found +- `422 Unprocessable Entity`: Validation error +- `500 Internal Server Error`: Server error + +## Rate Limiting + +API endpoints are rate-limited: + +- Authentication: 5 requests per minute per IP +- General APIs: 100 requests per minute per user +- Batch operations: 10 requests per minute per user + +Rate limit headers are included in responses: + +```text +X-RateLimit-Limit: 100 +X-RateLimit-Remaining: 95 +X-RateLimit-Reset: 1632150000 +``` + +## Monitoring Endpoints + +### GET /metrics + +Prometheus-compatible metrics endpoint. + +**Response:** + +```text +# HELP orchestrator_tasks_total Total number of tasks +# TYPE orchestrator_tasks_total counter +orchestrator_tasks_total{status="completed"} 150 +orchestrator_tasks_total{status="failed"} 5 + +# HELP orchestrator_task_duration_seconds Task execution duration +# TYPE orchestrator_task_duration_seconds histogram +orchestrator_task_duration_seconds_bucket{le="10"} 50 +orchestrator_task_duration_seconds_bucket{le="30"} 120 +orchestrator_task_duration_seconds_bucket{le="+Inf"} 155 +``` + +### WebSocket /ws + +Real-time event streaming via WebSocket connection. + +**Connection:** + +```text +const ws = new WebSocket('ws://localhost:9090/ws?token=jwt-token'); + +ws.onmessage = function(event) { + const data = JSON.parse(event.data); + console.log('Event:', data); +}; +``` + +**Event Format:** + +```text +{ + "event_type": "TaskStatusChanged", + "timestamp": "2025-09-26T10:00:00Z", + "data": { + "task_id": "uuid-string", + "status": "completed" + }, + "metadata": { + "task_id": "uuid-string", + "status": "completed" + } +} +``` + +## SDK Examples + +### Python SDK Example + +```text +import requests + +class ProvisioningClient: + def __init__(self, base_url, token): + self.base_url = base_url + self.headers = { + 'Authorization': f'Bearer {token}', + 'Content-Type': 'application/json' + } + + def create_server_workflow(self, infra, settings, check_mode=False): + payload = { + 'infra': infra, + 'settings': settings, + 'check_mode': check_mode, + 'wait': True + } + response = requests.post( + f'{self.base_url}/workflows/servers/create', + json=payload, + headers=self.headers + ) + return response.json() + + def get_task_status(self, task_id): + response = requests.get( + f'{self.base_url}/tasks/{task_id}', + headers=self.headers + ) + return response.json() + +# Usage +client = ProvisioningClient('http://localhost:9090', 'your-jwt-token') +result = client.create_server_workflow('production', 'config.ncl') +print(f"Task ID: {result['data']}") +``` + +### JavaScript/Node.js SDK Example + +```text +const axios = require('axios'); + +class ProvisioningClient { + constructor(baseUrl, token) { + this.client = axios.create({ + baseURL: baseUrl, + headers: { + 'Authorization': `Bearer ${token}`, + 'Content-Type': 'application/json' + } + }); + } + + async createServerWorkflow(infra, settings, checkMode = false) { + const response = await this.client.post('/workflows/servers/create', { + infra, + settings, + check_mode: checkMode, + wait: true + }); + return response.data; + } + + async getTaskStatus(taskId) { + const response = await this.client.get(`/tasks/${taskId}`); + return response.data; + } +} + +// Usage +const client = new ProvisioningClient('http://localhost:9090', 'your-jwt-token'); +const result = await client.createServerWorkflow('production', 'config.ncl'); +console.log(`Task ID: ${result.data}`); +``` + +## Webhook Integration + +The system supports webhooks for external integrations: + +### Webhook Configuration + +Configure webhooks in the system configuration: + +```text +[webhooks] +enabled = true +endpoints = [ + { + url = "https://your-system.com/webhook" + events = ["task.completed", "task.failed", "batch.completed"] + secret = "webhook-secret" + } +] +``` + +### Webhook Payload + +```text +{ + "event": "task.completed", + "timestamp": "2025-09-26T10:00:00Z", + "data": { + "task_id": "uuid-string", + "status": "completed", + "output": "Task completed successfully" + }, + "signature": "sha256=calculated-signature" +} +``` + +## Pagination + +For endpoints that return lists, use pagination parameters: + +- `limit`: Maximum number of items per page (default: 50, max: 1000) +- `offset`: Number of items to skip + +Pagination metadata is included in response headers: + +```text +X-Total-Count: 1500 +X-Limit: 50 +X-Offset: 100 +Link: ; rel="next" +``` + +## API Versioning + +The API uses header-based versioning: + +```text +Accept: application/vnd.provisioning.v1+json +``` + +Current version: v1 + +## Testing + +Use the included test suite to validate API functionality: + +```text +# Run API integration tests +cd src/orchestrator +cargo test --test api_tests + +# Run load tests +cargo test --test load_tests --release +``` \ No newline at end of file diff --git a/docs/src/api-reference/sdks.md b/docs/src/api-reference/sdks.md index 757d40e..2bb086e 100644 --- a/docs/src/api-reference/sdks.md +++ b/docs/src/api-reference/sdks.md @@ -1 +1,1097 @@ -# SDK Documentation\n\nThis document provides comprehensive documentation for the official SDKs and client libraries available for provisioning.\n\n## Available SDKs\n\nProvisioning provides SDKs in multiple languages to facilitate integration:\n\n### Official SDKs\n\n- **Python SDK** (`provisioning-client`) - Full-featured Python client\n- **JavaScript/TypeScript SDK** (`@provisioning/client`) - Node.js and browser support\n- **Go SDK** (`go-provisioning-client`) - Go client library\n- **Rust SDK** (`provisioning-rs`) - Native Rust integration\n\n### Community SDKs\n\n- **Java SDK** - Community-maintained Java client\n- **C# SDK** - .NET client library\n- **PHP SDK** - PHP client library\n\n## Python SDK\n\n### Installation\n\n```\n# Install from PyPI\npip install provisioning-client\n\n# Or install development version\npip install git+https://github.com/provisioning-systems/python-client.git\n```\n\n### Quick Start\n\n```\nfrom provisioning_client import ProvisioningClient\nimport asyncio\n\nasync def main():\n # Initialize client\n client = ProvisioningClient(\n base_url="http://localhost:9090",\n auth_url="http://localhost:8081",\n username="admin",\n password="your-password"\n )\n\n try:\n # Authenticate\n token = await client.authenticate()\n print(f"Authenticated with token: {token[:20]}...")\n\n # Create a server workflow\n task_id = client.create_server_workflow(\n infra="production",\n settings="prod-settings.ncl",\n wait=False\n )\n print(f"Server workflow created: {task_id}")\n\n # Wait for completion\n task = client.wait_for_task_completion(task_id, timeout=600)\n print(f"Task completed with status: {task.status}")\n\n if task.status == "Completed":\n print(f"Output: {task.output}")\n elif task.status == "Failed":\n print(f"Error: {task.error}")\n\n except Exception as e:\n print(f"Error: {e}")\n\nif __name__ == "__main__":\n asyncio.run(main())\n```\n\n### Advanced Usage\n\n#### WebSocket Integration\n\n```\nasync def monitor_workflows():\n client = ProvisioningClient()\n await client.authenticate()\n\n # Set up event handlers\n async def on_task_update(event):\n print(f"Task {event['data']['task_id']} status: {event['data']['status']}")\n\n async def on_progress_update(event):\n print(f"Progress: {event['data']['progress']}% - {event['data']['current_step']}")\n\n client.on_event('TaskStatusChanged', on_task_update)\n client.on_event('WorkflowProgressUpdate', on_progress_update)\n\n # Connect to WebSocket\n await client.connect_websocket(['TaskStatusChanged', 'WorkflowProgressUpdate'])\n\n # Keep connection alive\n await asyncio.sleep(3600) # Monitor for 1 hour\n```\n\n#### Batch Operations\n\n```\nasync def execute_batch_deployment():\n client = ProvisioningClient()\n await client.authenticate()\n\n batch_config = {\n "name": "production_deployment",\n "version": "1.0.0",\n "storage_backend": "surrealdb",\n "parallel_limit": 5,\n "rollback_enabled": True,\n "operations": [\n {\n "id": "servers",\n "type": "server_batch",\n "provider": "upcloud",\n "dependencies": [],\n "config": {\n "server_configs": [\n {"name": "web-01", "plan": "2xCPU-4 GB", "zone": "de-fra1"},\n {"name": "web-02", "plan": "2xCPU-4 GB", "zone": "de-fra1"}\n ]\n }\n },\n {\n "id": "kubernetes",\n "type": "taskserv_batch",\n "provider": "upcloud",\n "dependencies": ["servers"],\n "config": {\n "taskservs": ["kubernetes", "cilium", "containerd"]\n }\n }\n ]\n }\n\n # Execute batch operation\n batch_result = await client.execute_batch_operation(batch_config)\n print(f"Batch operation started: {batch_result['batch_id']}")\n\n # Monitor progress\n while True:\n status = await client.get_batch_status(batch_result['batch_id'])\n print(f"Batch status: {status['status']} - {status.get('progress', 0)}%")\n\n if status['status'] in ['Completed', 'Failed', 'Cancelled']:\n break\n\n await asyncio.sleep(10)\n\n print(f"Batch operation finished: {status['status']}")\n```\n\n#### Error Handling with Retries\n\n```\nfrom provisioning_client.exceptions import (\n ProvisioningAPIError,\n AuthenticationError,\n ValidationError,\n RateLimitError\n)\nfrom tenacity import retry, stop_after_attempt, wait_exponential\n\nclass RobustProvisioningClient(ProvisioningClient):\n @retry(\n stop=stop_after_attempt(3),\n wait=wait_exponential(multiplier=1, min=4, max=10)\n )\n async def create_server_workflow_with_retry(self, **kwargs):\n try:\n return await self.create_server_workflow(**kwargs)\n except RateLimitError as e:\n print(f"Rate limited, retrying in {e.retry_after} seconds...")\n await asyncio.sleep(e.retry_after)\n raise\n except AuthenticationError:\n print("Authentication failed, re-authenticating...")\n await self.authenticate()\n raise\n except ValidationError as e:\n print(f"Validation error: {e}")\n # Don't retry validation errors\n raise\n except ProvisioningAPIError as e:\n print(f"API error: {e}")\n raise\n\n# Usage\nasync def robust_workflow():\n client = RobustProvisioningClient()\n\n try:\n task_id = await client.create_server_workflow_with_retry(\n infra="production",\n settings="config.ncl"\n )\n print(f"Workflow created successfully: {task_id}")\n except Exception as e:\n print(f"Failed after retries: {e}")\n```\n\n### API Reference\n\n#### ProvisioningClient Class\n\n```\nclass ProvisioningClient:\n def __init__(self,\n base_url: str = "http://localhost:9090",\n auth_url: str = "http://localhost:8081",\n username: str = None,\n password: str = None,\n token: str = None):\n """Initialize the provisioning client"""\n\n async def authenticate(self) -> str:\n """Authenticate and get JWT token"""\n\n def create_server_workflow(self,\n infra: str,\n settings: str = "config.ncl",\n check_mode: bool = False,\n wait: bool = False) -> str:\n """Create a server provisioning workflow"""\n\n def create_taskserv_workflow(self,\n operation: str,\n taskserv: str,\n infra: str,\n settings: str = "config.ncl",\n check_mode: bool = False,\n wait: bool = False) -> str:\n """Create a task service workflow"""\n\n def get_task_status(self, task_id: str) -> WorkflowTask:\n """Get the status of a specific task"""\n\n def wait_for_task_completion(self,\n task_id: str,\n timeout: int = 300,\n poll_interval: int = 5) -> WorkflowTask:\n """Wait for a task to complete"""\n\n async def connect_websocket(self, event_types: List[str] = None):\n """Connect to WebSocket for real-time updates"""\n\n def on_event(self, event_type: str, handler: Callable):\n """Register an event handler"""\n```\n\n## JavaScript/TypeScript SDK\n\n### Installation\n\n```\n# npm\nnpm install @provisioning/client\n\n# yarn\nyarn add @provisioning/client\n\n# pnpm\npnpm add @provisioning/client\n```\n\n### Quick Start\n\n```\nimport { ProvisioningClient } from '@provisioning/client';\n\nasync function main() {\n const client = new ProvisioningClient({\n baseUrl: 'http://localhost:9090',\n authUrl: 'http://localhost:8081',\n username: 'admin',\n password: 'your-password'\n });\n\n try {\n // Authenticate\n await client.authenticate();\n console.log('Authentication successful');\n\n // Create server workflow\n const taskId = await client.createServerWorkflow({\n infra: 'production',\n settings: 'prod-settings.ncl'\n });\n console.log(`Server workflow created: ${taskId}`);\n\n // Wait for completion\n const task = await client.waitForTaskCompletion(taskId);\n console.log(`Task completed with status: ${task.status}`);\n\n } catch (error) {\n console.error('Error:', error.message);\n }\n}\n\nmain();\n```\n\n### React Integration\n\n```\nimport React, { useState, useEffect } from 'react';\nimport { ProvisioningClient } from '@provisioning/client';\n\ninterface Task {\n id: string;\n name: string;\n status: string;\n progress?: number;\n}\n\nconst WorkflowDashboard: React.FC = () => {\n const [client] = useState(() => new ProvisioningClient({\n baseUrl: process.env.REACT_APP_API_URL,\n username: process.env.REACT_APP_USERNAME,\n password: process.env.REACT_APP_PASSWORD\n }));\n\n const [tasks, setTasks] = useState([]);\n const [connected, setConnected] = useState(false);\n\n useEffect(() => {\n const initClient = async () => {\n try {\n await client.authenticate();\n\n // Set up WebSocket event handlers\n client.on('TaskStatusChanged', (event: any) => {\n setTasks(prev => prev.map(task =>\n task.id === event.data.task_id\n ? { ...task, status: event.data.status, progress: event.data.progress }\n : task\n ));\n });\n\n client.on('websocketConnected', () => {\n setConnected(true);\n });\n\n client.on('websocketDisconnected', () => {\n setConnected(false);\n });\n\n // Connect WebSocket\n await client.connectWebSocket(['TaskStatusChanged', 'WorkflowProgressUpdate']);\n\n // Load initial tasks\n const initialTasks = await client.listTasks();\n setTasks(initialTasks);\n\n } catch (error) {\n console.error('Failed to initialize client:', error);\n }\n };\n\n initClient();\n\n return () => {\n client.disconnectWebSocket();\n };\n }, [client]);\n\n const createServerWorkflow = async () => {\n try {\n const taskId = await client.createServerWorkflow({\n infra: 'production',\n settings: 'config.ncl'\n });\n\n // Add to tasks list\n setTasks(prev => [...prev, {\n id: taskId,\n name: 'Server Creation',\n status: 'Pending'\n }]);\n\n } catch (error) {\n console.error('Failed to create workflow:', error);\n }\n };\n\n return (\n
\n
\n

Workflow Dashboard

\n
\n {connected ? '🟢 Connected' : '🔴 Disconnected'}\n
\n
\n\n
\n \n
\n\n
\n {tasks.map(task => (\n
\n

{task.name}

\n
\n \n {task.status}\n \n {task.progress && (\n
\n \n {task.progress}%\n
\n )}\n
\n
\n ))}\n
\n
\n );\n};\n\nexport default WorkflowDashboard;\n```\n\n### Node.js CLI Tool\n\n```\n#!/usr/bin/env node\n\nimport { Command } from 'commander';\nimport { ProvisioningClient } from '@provisioning/client';\nimport chalk from 'chalk';\nimport ora from 'ora';\n\nconst program = new Command();\n\nprogram\n .name('provisioning-cli')\n .description('CLI tool for provisioning')\n .version('1.0.0');\n\nprogram\n .command('create-server')\n .description('Create a server workflow')\n .requiredOption('-i, --infra ', 'Infrastructure target')\n .option('-s, --settings ', 'Settings file', 'config.ncl')\n .option('-c, --check', 'Check mode only')\n .option('-w, --wait', 'Wait for completion')\n .action(async (options) => {\n const client = new ProvisioningClient({\n baseUrl: process.env.PROVISIONING_API_URL,\n username: process.env.PROVISIONING_USERNAME,\n password: process.env.PROVISIONING_PASSWORD\n });\n\n const spinner = ora('Authenticating...').start();\n\n try {\n await client.authenticate();\n spinner.text = 'Creating server workflow...';\n\n const taskId = await client.createServerWorkflow({\n infra: options.infra,\n settings: options.settings,\n check_mode: options.check,\n wait: false\n });\n\n spinner.succeed(`Server workflow created: ${chalk.green(taskId)}`);\n\n if (options.wait) {\n spinner.start('Waiting for completion...');\n\n // Set up progress updates\n client.on('TaskStatusChanged', (event: any) => {\n if (event.data.task_id === taskId) {\n spinner.text = `Status: ${event.data.status}`;\n }\n });\n\n client.on('WorkflowProgressUpdate', (event: any) => {\n if (event.data.workflow_id === taskId) {\n spinner.text = `${event.data.progress}% - ${event.data.current_step}`;\n }\n });\n\n await client.connectWebSocket(['TaskStatusChanged', 'WorkflowProgressUpdate']);\n\n const task = await client.waitForTaskCompletion(taskId);\n\n if (task.status === 'Completed') {\n spinner.succeed(chalk.green('Workflow completed successfully!'));\n if (task.output) {\n console.log(chalk.gray('Output:'), task.output);\n }\n } else {\n spinner.fail(chalk.red(`Workflow failed: ${task.error}`));\n process.exit(1);\n }\n }\n\n } catch (error) {\n spinner.fail(chalk.red(`Error: ${error.message}`));\n process.exit(1);\n }\n });\n\nprogram\n .command('list-tasks')\n .description('List all tasks')\n .option('-s, --status ', 'Filter by status')\n .action(async (options) => {\n const client = new ProvisioningClient();\n\n try {\n await client.authenticate();\n const tasks = await client.listTasks(options.status);\n\n console.log(chalk.bold('Tasks:'));\n tasks.forEach(task => {\n const statusColor = task.status === 'Completed' ? 'green' :\n task.status === 'Failed' ? 'red' :\n task.status === 'Running' ? 'yellow' : 'gray';\n\n console.log(` ${task.id} - ${task.name} [${chalk[statusColor](task.status)}]`);\n });\n\n } catch (error) {\n console.error(chalk.red(`Error: ${error.message}`));\n process.exit(1);\n }\n });\n\nprogram\n .command('monitor')\n .description('Monitor workflows in real-time')\n .action(async () => {\n const client = new ProvisioningClient();\n\n try {\n await client.authenticate();\n\n console.log(chalk.bold('🔍 Monitoring workflows...'));\n console.log(chalk.gray('Press Ctrl+C to stop'));\n\n client.on('TaskStatusChanged', (event: any) => {\n const timestamp = new Date().toLocaleTimeString();\n const statusColor = event.data.status === 'Completed' ? 'green' :\n event.data.status === 'Failed' ? 'red' :\n event.data.status === 'Running' ? 'yellow' : 'gray';\n\n console.log(`[${chalk.gray(timestamp)}] Task ${event.data.task_id} → ${chalk[statusColor](event.data.status)}`);\n });\n\n client.on('WorkflowProgressUpdate', (event: any) => {\n const timestamp = new Date().toLocaleTimeString();\n console.log(`[${chalk.gray(timestamp)}] ${event.data.workflow_id}: ${event.data.progress}% - ${event.data.current_step}`);\n });\n\n await client.connectWebSocket(['TaskStatusChanged', 'WorkflowProgressUpdate']);\n\n // Keep the process running\n process.on('SIGINT', () => {\n console.log(chalk.yellow('\nStopping monitor...'));\n client.disconnectWebSocket();\n process.exit(0);\n });\n\n // Keep alive\n setInterval(() => {}, 1000);\n\n } catch (error) {\n console.error(chalk.red(`Error: ${error.message}`));\n process.exit(1);\n }\n });\n\nprogram.parse();\n```\n\n### API Reference\n\n```\ninterface ProvisioningClientOptions {\n baseUrl?: string;\n authUrl?: string;\n username?: string;\n password?: string;\n token?: string;\n}\n\nclass ProvisioningClient extends EventEmitter {\n constructor(options: ProvisioningClientOptions);\n\n async authenticate(): Promise;\n\n async createServerWorkflow(config: {\n infra: string;\n settings?: string;\n check_mode?: boolean;\n wait?: boolean;\n }): Promise;\n\n async createTaskservWorkflow(config: {\n operation: string;\n taskserv: string;\n infra: string;\n settings?: string;\n check_mode?: boolean;\n wait?: boolean;\n }): Promise;\n\n async getTaskStatus(taskId: string): Promise;\n\n async listTasks(statusFilter?: string): Promise;\n\n async waitForTaskCompletion(\n taskId: string,\n timeout?: number,\n pollInterval?: number\n ): Promise;\n\n async connectWebSocket(eventTypes?: string[]): Promise;\n\n disconnectWebSocket(): void;\n\n async executeBatchOperation(batchConfig: BatchConfig): Promise;\n\n async getBatchStatus(batchId: string): Promise;\n}\n```\n\n## Go SDK\n\n### Installation\n\n```\ngo get github.com/provisioning-systems/go-client\n```\n\n### Quick Start\n\n```\npackage main\n\nimport (\n "context"\n "fmt"\n "log"\n "time"\n\n "github.com/provisioning-systems/go-client"\n)\n\nfunc main() {\n // Initialize client\n client, err := provisioning.NewClient(&provisioning.Config{\n BaseURL: "http://localhost:9090",\n AuthURL: "http://localhost:8081",\n Username: "admin",\n Password: "your-password",\n })\n if err != nil {\n log.Fatalf("Failed to create client: %v", err)\n }\n\n ctx := context.Background()\n\n // Authenticate\n token, err := client.Authenticate(ctx)\n if err != nil {\n log.Fatalf("Authentication failed: %v", err)\n }\n fmt.Printf("Authenticated with token: %.20s...\n", token)\n\n // Create server workflow\n taskID, err := client.CreateServerWorkflow(ctx, &provisioning.CreateServerRequest{\n Infra: "production",\n Settings: "prod-settings.ncl",\n Wait: false,\n })\n if err != nil {\n log.Fatalf("Failed to create workflow: %v", err)\n }\n fmt.Printf("Server workflow created: %s\n", taskID)\n\n // Wait for completion\n task, err := client.WaitForTaskCompletion(ctx, taskID, 10*time.Minute)\n if err != nil {\n log.Fatalf("Failed to wait for completion: %v", err)\n }\n\n fmt.Printf("Task completed with status: %s\n", task.Status)\n if task.Status == "Completed" {\n fmt.Printf("Output: %s\n", task.Output)\n } else if task.Status == "Failed" {\n fmt.Printf("Error: %s\n", task.Error)\n }\n}\n```\n\n### WebSocket Integration\n\n```\npackage main\n\nimport (\n "context"\n "fmt"\n "log"\n "os"\n "os/signal"\n\n "github.com/provisioning-systems/go-client"\n)\n\nfunc main() {\n client, err := provisioning.NewClient(&provisioning.Config{\n BaseURL: "http://localhost:9090",\n Username: "admin",\n Password: "password",\n })\n if err != nil {\n log.Fatalf("Failed to create client: %v", err)\n }\n\n ctx := context.Background()\n\n // Authenticate\n _, err = client.Authenticate(ctx)\n if err != nil {\n log.Fatalf("Authentication failed: %v", err)\n }\n\n // Set up WebSocket connection\n ws, err := client.ConnectWebSocket(ctx, []string{\n "TaskStatusChanged",\n "WorkflowProgressUpdate",\n })\n if err != nil {\n log.Fatalf("Failed to connect WebSocket: %v", err)\n }\n defer ws.Close()\n\n // Handle events\n go func() {\n for event := range ws.Events() {\n switch event.Type {\n case "TaskStatusChanged":\n fmt.Printf("Task %s status changed to: %s\n",\n event.Data["task_id"], event.Data["status"])\n case "WorkflowProgressUpdate":\n fmt.Printf("Workflow progress: %v%% - %s\n",\n event.Data["progress"], event.Data["current_step"])\n }\n }\n }()\n\n // Wait for interrupt\n c := make(chan os.Signal, 1)\n signal.Notify(c, os.Interrupt)\n <-c\n\n fmt.Println("Shutting down...")\n}\n```\n\n### HTTP Client with Retry Logic\n\n```\npackage main\n\nimport (\n "context"\n "fmt"\n "time"\n\n "github.com/provisioning-systems/go-client"\n "github.com/cenkalti/backoff/v4"\n)\n\ntype ResilientClient struct {\n *provisioning.Client\n}\n\nfunc NewResilientClient(config *provisioning.Config) (*ResilientClient, error) {\n client, err := provisioning.NewClient(config)\n if err != nil {\n return nil, err\n }\n\n return &ResilientClient{Client: client}, nil\n}\n\nfunc (c *ResilientClient) CreateServerWorkflowWithRetry(\n ctx context.Context,\n req *provisioning.CreateServerRequest,\n) (string, error) {\n var taskID string\n\n operation := func() error {\n var err error\n taskID, err = c.CreateServerWorkflow(ctx, req)\n\n // Don't retry validation errors\n if provisioning.IsValidationError(err) {\n return backoff.Permanent(err)\n }\n\n return err\n }\n\n exponentialBackoff := backoff.NewExponentialBackOff()\n exponentialBackoff.MaxElapsedTime = 5 * time.Minute\n\n err := backoff.Retry(operation, exponentialBackoff)\n if err != nil {\n return "", fmt.Errorf("failed after retries: %w", err)\n }\n\n return taskID, nil\n}\n\nfunc main() {\n client, err := NewResilientClient(&provisioning.Config{\n BaseURL: "http://localhost:9090",\n Username: "admin",\n Password: "password",\n })\n if err != nil {\n log.Fatalf("Failed to create client: %v", err)\n }\n\n ctx := context.Background()\n\n // Authenticate with retry\n _, err = client.Authenticate(ctx)\n if err != nil {\n log.Fatalf("Authentication failed: %v", err)\n }\n\n // Create workflow with retry\n taskID, err := client.CreateServerWorkflowWithRetry(ctx, &provisioning.CreateServerRequest{\n Infra: "production",\n Settings: "config.ncl",\n })\n if err != nil {\n log.Fatalf("Failed to create workflow: %v", err)\n }\n\n fmt.Printf("Workflow created successfully: %s\n", taskID)\n}\n```\n\n## Rust SDK\n\n### Installation\n\nAdd to your `Cargo.toml`:\n\n```\n[dependencies]\nprovisioning-rs = "2.0.0"\ntokio = { version = "1.0", features = ["full"] }\n```\n\n### Quick Start\n\n```\nuse provisioning_rs::{ProvisioningClient, Config, CreateServerRequest};\nuse tokio;\n\n#[tokio::main]\nasync fn main() -> Result<(), Box> {\n // Initialize client\n let config = Config {\n base_url: "http://localhost:9090".to_string(),\n auth_url: Some("http://localhost:8081".to_string()),\n username: Some("admin".to_string()),\n password: Some("your-password".to_string()),\n token: None,\n };\n\n let mut client = ProvisioningClient::new(config);\n\n // Authenticate\n let token = client.authenticate().await?;\n println!("Authenticated with token: {}...", &token[..20]);\n\n // Create server workflow\n let request = CreateServerRequest {\n infra: "production".to_string(),\n settings: Some("prod-settings.ncl".to_string()),\n check_mode: false,\n wait: false,\n };\n\n let task_id = client.create_server_workflow(request).await?;\n println!("Server workflow created: {}", task_id);\n\n // Wait for completion\n let task = client.wait_for_task_completion(&task_id, std::time::Duration::from_secs(600)).await?;\n\n println!("Task completed with status: {:?}", task.status);\n match task.status {\n TaskStatus::Completed => {\n if let Some(output) = task.output {\n println!("Output: {}", output);\n }\n },\n TaskStatus::Failed => {\n if let Some(error) = task.error {\n println!("Error: {}", error);\n }\n },\n _ => {}\n }\n\n Ok(())\n}\n```\n\n### WebSocket Integration\n\n```\nuse provisioning_rs::{ProvisioningClient, Config, WebSocketEvent};\nuse futures_util::StreamExt;\nuse tokio;\n\n#[tokio::main]\nasync fn main() -> Result<(), Box> {\n let config = Config {\n base_url: "http://localhost:9090".to_string(),\n username: Some("admin".to_string()),\n password: Some("password".to_string()),\n ..Default::default()\n };\n\n let mut client = ProvisioningClient::new(config);\n\n // Authenticate\n client.authenticate().await?;\n\n // Connect WebSocket\n let mut ws = client.connect_websocket(vec![\n "TaskStatusChanged".to_string(),\n "WorkflowProgressUpdate".to_string(),\n ]).await?;\n\n // Handle events\n tokio::spawn(async move {\n while let Some(event) = ws.next().await {\n match event {\n Ok(WebSocketEvent::TaskStatusChanged { data }) => {\n println!("Task {} status changed to: {}", data.task_id, data.status);\n },\n Ok(WebSocketEvent::WorkflowProgressUpdate { data }) => {\n println!("Workflow progress: {}% - {}", data.progress, data.current_step);\n },\n Ok(WebSocketEvent::SystemHealthUpdate { data }) => {\n println!("System health: {}", data.overall_status);\n },\n Err(e) => {\n eprintln!("WebSocket error: {}", e);\n break;\n }\n }\n }\n });\n\n // Keep the main thread alive\n tokio::signal::ctrl_c().await?;\n println!("Shutting down...");\n\n Ok(())\n}\n```\n\n### Batch Operations\n\n```\nuse provisioning_rs::{BatchOperationRequest, BatchOperation};\n\n#[tokio::main]\nasync fn main() -> Result<(), Box> {\n let mut client = ProvisioningClient::new(config);\n client.authenticate().await?;\n\n // Define batch operation\n let batch_request = BatchOperationRequest {\n name: "production_deployment".to_string(),\n version: "1.0.0".to_string(),\n storage_backend: "surrealdb".to_string(),\n parallel_limit: 5,\n rollback_enabled: true,\n operations: vec![\n BatchOperation {\n id: "servers".to_string(),\n operation_type: "server_batch".to_string(),\n provider: "upcloud".to_string(),\n dependencies: vec![],\n config: serde_json::json!({\n "server_configs": [\n {"name": "web-01", "plan": "2xCPU-4 GB", "zone": "de-fra1"},\n {"name": "web-02", "plan": "2xCPU-4 GB", "zone": "de-fra1"}\n ]\n }),\n },\n BatchOperation {\n id: "kubernetes".to_string(),\n operation_type: "taskserv_batch".to_string(),\n provider: "upcloud".to_string(),\n dependencies: vec!["servers".to_string()],\n config: serde_json::json!({\n "taskservs": ["kubernetes", "cilium", "containerd"]\n }),\n },\n ],\n };\n\n // Execute batch operation\n let batch_result = client.execute_batch_operation(batch_request).await?;\n println!("Batch operation started: {}", batch_result.batch_id);\n\n // Monitor progress\n loop {\n let status = client.get_batch_status(&batch_result.batch_id).await?;\n println!("Batch status: {} - {}%", status.status, status.progress.unwrap_or(0.0));\n\n match status.status.as_str() {\n "Completed" | "Failed" | "Cancelled" => break,\n _ => tokio::time::sleep(std::time::Duration::from_secs(10)).await,\n }\n }\n\n Ok(())\n}\n```\n\n## Best Practices\n\n### Authentication and Security\n\n1. **Token Management**: Store tokens securely and implement automatic refresh\n2. **Environment Variables**: Use environment variables for credentials\n3. **HTTPS**: Always use HTTPS in production environments\n4. **Token Expiration**: Handle token expiration gracefully\n\n### Error Handling\n\n1. **Specific Exceptions**: Handle specific error types appropriately\n2. **Retry Logic**: Implement exponential backoff for transient failures\n3. **Circuit Breakers**: Use circuit breakers for resilient integrations\n4. **Logging**: Log errors with appropriate context\n\n### Performance Optimization\n\n1. **Connection Pooling**: Reuse HTTP connections\n2. **Async Operations**: Use asynchronous operations where possible\n3. **Batch Operations**: Group related operations for efficiency\n4. **Caching**: Cache frequently accessed data appropriately\n\n### WebSocket Connections\n\n1. **Reconnection**: Implement automatic reconnection with backoff\n2. **Event Filtering**: Subscribe only to needed event types\n3. **Error Handling**: Handle WebSocket errors gracefully\n4. **Resource Cleanup**: Properly close WebSocket connections\n\n### Testing\n\n1. **Unit Tests**: Test SDK functionality with mocked responses\n2. **Integration Tests**: Test against real API endpoints\n3. **Error Scenarios**: Test error handling paths\n4. **Load Testing**: Validate performance under load\n\nThis comprehensive SDK documentation provides developers with everything needed to integrate with provisioning using their preferred programming\nlanguage, complete with examples, best practices, and detailed API references. +# SDK Documentation + +This document provides comprehensive documentation for the official SDKs and client libraries available for provisioning. + +## Available SDKs + +Provisioning provides SDKs in multiple languages to facilitate integration: + +### Official SDKs + +- **Python SDK** (`provisioning-client`) - Full-featured Python client +- **JavaScript/TypeScript SDK** (`@provisioning/client`) - Node.js and browser support +- **Go SDK** (`go-provisioning-client`) - Go client library +- **Rust SDK** (`provisioning-rs`) - Native Rust integration + +### Community SDKs + +- **Java SDK** - Community-maintained Java client +- **C# SDK** - .NET client library +- **PHP SDK** - PHP client library + +## Python SDK + +### Installation + +```text +# Install from PyPI +pip install provisioning-client + +# Or install development version +pip install git+https://github.com/provisioning-systems/python-client.git +``` + +### Quick Start + +```text +from provisioning_client import ProvisioningClient +import asyncio + +async def main(): + # Initialize client + client = ProvisioningClient( + base_url="http://localhost:9090", + auth_url="http://localhost:8081", + username="admin", + password="your-password" + ) + + try: + # Authenticate + token = await client.authenticate() + print(f"Authenticated with token: {token[:20]}...") + + # Create a server workflow + task_id = client.create_server_workflow( + infra="production", + settings="prod-settings.ncl", + wait=False + ) + print(f"Server workflow created: {task_id}") + + # Wait for completion + task = client.wait_for_task_completion(task_id, timeout=600) + print(f"Task completed with status: {task.status}") + + if task.status == "Completed": + print(f"Output: {task.output}") + elif task.status == "Failed": + print(f"Error: {task.error}") + + except Exception as e: + print(f"Error: {e}") + +if __name__ == "__main__": + asyncio.run(main()) +``` + +### Advanced Usage + +#### WebSocket Integration + +```text +async def monitor_workflows(): + client = ProvisioningClient() + await client.authenticate() + + # Set up event handlers + async def on_task_update(event): + print(f"Task {event['data']['task_id']} status: {event['data']['status']}") + + async def on_progress_update(event): + print(f"Progress: {event['data']['progress']}% - {event['data']['current_step']}") + + client.on_event('TaskStatusChanged', on_task_update) + client.on_event('WorkflowProgressUpdate', on_progress_update) + + # Connect to WebSocket + await client.connect_websocket(['TaskStatusChanged', 'WorkflowProgressUpdate']) + + # Keep connection alive + await asyncio.sleep(3600) # Monitor for 1 hour +``` + +#### Batch Operations + +```text +async def execute_batch_deployment(): + client = ProvisioningClient() + await client.authenticate() + + batch_config = { + "name": "production_deployment", + "version": "1.0.0", + "storage_backend": "surrealdb", + "parallel_limit": 5, + "rollback_enabled": True, + "operations": [ + { + "id": "servers", + "type": "server_batch", + "provider": "upcloud", + "dependencies": [], + "config": { + "server_configs": [ + {"name": "web-01", "plan": "2xCPU-4 GB", "zone": "de-fra1"}, + {"name": "web-02", "plan": "2xCPU-4 GB", "zone": "de-fra1"} + ] + } + }, + { + "id": "kubernetes", + "type": "taskserv_batch", + "provider": "upcloud", + "dependencies": ["servers"], + "config": { + "taskservs": ["kubernetes", "cilium", "containerd"] + } + } + ] + } + + # Execute batch operation + batch_result = await client.execute_batch_operation(batch_config) + print(f"Batch operation started: {batch_result['batch_id']}") + + # Monitor progress + while True: + status = await client.get_batch_status(batch_result['batch_id']) + print(f"Batch status: {status['status']} - {status.get('progress', 0)}%") + + if status['status'] in ['Completed', 'Failed', 'Cancelled']: + break + + await asyncio.sleep(10) + + print(f"Batch operation finished: {status['status']}") +``` + +#### Error Handling with Retries + +```text +from provisioning_client.exceptions import ( + ProvisioningAPIError, + AuthenticationError, + ValidationError, + RateLimitError +) +from tenacity import retry, stop_after_attempt, wait_exponential + +class RobustProvisioningClient(ProvisioningClient): + @retry( + stop=stop_after_attempt(3), + wait=wait_exponential(multiplier=1, min=4, max=10) + ) + async def create_server_workflow_with_retry(self, **kwargs): + try: + return await self.create_server_workflow(**kwargs) + except RateLimitError as e: + print(f"Rate limited, retrying in {e.retry_after} seconds...") + await asyncio.sleep(e.retry_after) + raise + except AuthenticationError: + print("Authentication failed, re-authenticating...") + await self.authenticate() + raise + except ValidationError as e: + print(f"Validation error: {e}") + # Don't retry validation errors + raise + except ProvisioningAPIError as e: + print(f"API error: {e}") + raise + +# Usage +async def robust_workflow(): + client = RobustProvisioningClient() + + try: + task_id = await client.create_server_workflow_with_retry( + infra="production", + settings="config.ncl" + ) + print(f"Workflow created successfully: {task_id}") + except Exception as e: + print(f"Failed after retries: {e}") +``` + +### API Reference + +#### ProvisioningClient Class + +```text +class ProvisioningClient: + def __init__(self, + base_url: str = "http://localhost:9090", + auth_url: str = "http://localhost:8081", + username: str = None, + password: str = None, + token: str = None): + """Initialize the provisioning client""" + + async def authenticate(self) -> str: + """Authenticate and get JWT token""" + + def create_server_workflow(self, + infra: str, + settings: str = "config.ncl", + check_mode: bool = False, + wait: bool = False) -> str: + """Create a server provisioning workflow""" + + def create_taskserv_workflow(self, + operation: str, + taskserv: str, + infra: str, + settings: str = "config.ncl", + check_mode: bool = False, + wait: bool = False) -> str: + """Create a task service workflow""" + + def get_task_status(self, task_id: str) -> WorkflowTask: + """Get the status of a specific task""" + + def wait_for_task_completion(self, + task_id: str, + timeout: int = 300, + poll_interval: int = 5) -> WorkflowTask: + """Wait for a task to complete""" + + async def connect_websocket(self, event_types: List[str] = None): + """Connect to WebSocket for real-time updates""" + + def on_event(self, event_type: str, handler: Callable): + """Register an event handler""" +``` + +## JavaScript/TypeScript SDK + +### Installation + +```text +# npm +npm install @provisioning/client + +# yarn +yarn add @provisioning/client + +# pnpm +pnpm add @provisioning/client +``` + +### Quick Start + +```text +import { ProvisioningClient } from '@provisioning/client'; + +async function main() { + const client = new ProvisioningClient({ + baseUrl: 'http://localhost:9090', + authUrl: 'http://localhost:8081', + username: 'admin', + password: 'your-password' + }); + + try { + // Authenticate + await client.authenticate(); + console.log('Authentication successful'); + + // Create server workflow + const taskId = await client.createServerWorkflow({ + infra: 'production', + settings: 'prod-settings.ncl' + }); + console.log(`Server workflow created: ${taskId}`); + + // Wait for completion + const task = await client.waitForTaskCompletion(taskId); + console.log(`Task completed with status: ${task.status}`); + + } catch (error) { + console.error('Error:', error.message); + } +} + +main(); +``` + +### React Integration + +```text +import React, { useState, useEffect } from 'react'; +import { ProvisioningClient } from '@provisioning/client'; + +interface Task { + id: string; + name: string; + status: string; + progress?: number; +} + +const WorkflowDashboard: React.FC = () => { + const [client] = useState(() => new ProvisioningClient({ + baseUrl: process.env.REACT_APP_API_URL, + username: process.env.REACT_APP_USERNAME, + password: process.env.REACT_APP_PASSWORD + })); + + const [tasks, setTasks] = useState([]); + const [connected, setConnected] = useState(false); + + useEffect(() => { + const initClient = async () => { + try { + await client.authenticate(); + + // Set up WebSocket event handlers + client.on('TaskStatusChanged', (event: any) => { + setTasks(prev => prev.map(task => + task.id === event.data.task_id + ? { ...task, status: event.data.status, progress: event.data.progress } + : task + )); + }); + + client.on('websocketConnected', () => { + setConnected(true); + }); + + client.on('websocketDisconnected', () => { + setConnected(false); + }); + + // Connect WebSocket + await client.connectWebSocket(['TaskStatusChanged', 'WorkflowProgressUpdate']); + + // Load initial tasks + const initialTasks = await client.listTasks(); + setTasks(initialTasks); + + } catch (error) { + console.error('Failed to initialize client:', error); + } + }; + + initClient(); + + return () => { + client.disconnectWebSocket(); + }; + }, [client]); + + const createServerWorkflow = async () => { + try { + const taskId = await client.createServerWorkflow({ + infra: 'production', + settings: 'config.ncl' + }); + + // Add to tasks list + setTasks(prev => [...prev, { + id: taskId, + name: 'Server Creation', + status: 'Pending' + }]); + + } catch (error) { + console.error('Failed to create workflow:', error); + } + }; + + return ( +
+
+

Workflow Dashboard

+
+ {connected ? '🟢 Connected' : '🔴 Disconnected'} +
+
+ +
+ +
+ +
+ {tasks.map(task => ( +
+

{task.name}

+
+ + {task.status} + + {task.progress && ( +
+
+ {task.progress}% +
+ )} +
+
+ ))} +
+
+ ); +}; + +export default WorkflowDashboard; +``` + +### Node.js CLI Tool + +```text +#!/usr/bin/env node + +import { Command } from 'commander'; +import { ProvisioningClient } from '@provisioning/client'; +import chalk from 'chalk'; +import ora from 'ora'; + +const program = new Command(); + +program + .name('provisioning-cli') + .description('CLI tool for provisioning') + .version('1.0.0'); + +program + .command('create-server') + .description('Create a server workflow') + .requiredOption('-i, --infra ', 'Infrastructure target') + .option('-s, --settings ', 'Settings file', 'config.ncl') + .option('-c, --check', 'Check mode only') + .option('-w, --wait', 'Wait for completion') + .action(async (options) => { + const client = new ProvisioningClient({ + baseUrl: process.env.PROVISIONING_API_URL, + username: process.env.PROVISIONING_USERNAME, + password: process.env.PROVISIONING_PASSWORD + }); + + const spinner = ora('Authenticating...').start(); + + try { + await client.authenticate(); + spinner.text = 'Creating server workflow...'; + + const taskId = await client.createServerWorkflow({ + infra: options.infra, + settings: options.settings, + check_mode: options.check, + wait: false + }); + + spinner.succeed(`Server workflow created: ${chalk.green(taskId)}`); + + if (options.wait) { + spinner.start('Waiting for completion...'); + + // Set up progress updates + client.on('TaskStatusChanged', (event: any) => { + if (event.data.task_id === taskId) { + spinner.text = `Status: ${event.data.status}`; + } + }); + + client.on('WorkflowProgressUpdate', (event: any) => { + if (event.data.workflow_id === taskId) { + spinner.text = `${event.data.progress}% - ${event.data.current_step}`; + } + }); + + await client.connectWebSocket(['TaskStatusChanged', 'WorkflowProgressUpdate']); + + const task = await client.waitForTaskCompletion(taskId); + + if (task.status === 'Completed') { + spinner.succeed(chalk.green('Workflow completed successfully!')); + if (task.output) { + console.log(chalk.gray('Output:'), task.output); + } + } else { + spinner.fail(chalk.red(`Workflow failed: ${task.error}`)); + process.exit(1); + } + } + + } catch (error) { + spinner.fail(chalk.red(`Error: ${error.message}`)); + process.exit(1); + } + }); + +program + .command('list-tasks') + .description('List all tasks') + .option('-s, --status ', 'Filter by status') + .action(async (options) => { + const client = new ProvisioningClient(); + + try { + await client.authenticate(); + const tasks = await client.listTasks(options.status); + + console.log(chalk.bold('Tasks:')); + tasks.forEach(task => { + const statusColor = task.status === 'Completed' ? 'green' : + task.status === 'Failed' ? 'red' : + task.status === 'Running' ? 'yellow' : 'gray'; + + console.log(` ${task.id} - ${task.name} [${chalk[statusColor](task.status)}]`); + }); + + } catch (error) { + console.error(chalk.red(`Error: ${error.message}`)); + process.exit(1); + } + }); + +program + .command('monitor') + .description('Monitor workflows in real-time') + .action(async () => { + const client = new ProvisioningClient(); + + try { + await client.authenticate(); + + console.log(chalk.bold('🔍 Monitoring workflows...')); + console.log(chalk.gray('Press Ctrl+C to stop')); + + client.on('TaskStatusChanged', (event: any) => { + const timestamp = new Date().toLocaleTimeString(); + const statusColor = event.data.status === 'Completed' ? 'green' : + event.data.status === 'Failed' ? 'red' : + event.data.status === 'Running' ? 'yellow' : 'gray'; + + console.log(`[${chalk.gray(timestamp)}] Task ${event.data.task_id} → ${chalk[statusColor](event.data.status)}`); + }); + + client.on('WorkflowProgressUpdate', (event: any) => { + const timestamp = new Date().toLocaleTimeString(); + console.log(`[${chalk.gray(timestamp)}] ${event.data.workflow_id}: ${event.data.progress}% - ${event.data.current_step}`); + }); + + await client.connectWebSocket(['TaskStatusChanged', 'WorkflowProgressUpdate']); + + // Keep the process running + process.on('SIGINT', () => { + console.log(chalk.yellow(' +Stopping monitor...')); + client.disconnectWebSocket(); + process.exit(0); + }); + + // Keep alive + setInterval(() => {}, 1000); + + } catch (error) { + console.error(chalk.red(`Error: ${error.message}`)); + process.exit(1); + } + }); + +program.parse(); +``` + +### API Reference + +```text +interface ProvisioningClientOptions { + baseUrl?: string; + authUrl?: string; + username?: string; + password?: string; + token?: string; +} + +class ProvisioningClient extends EventEmitter { + constructor(options: ProvisioningClientOptions); + + async authenticate(): Promise; + + async createServerWorkflow(config: { + infra: string; + settings?: string; + check_mode?: boolean; + wait?: boolean; + }): Promise; + + async createTaskservWorkflow(config: { + operation: string; + taskserv: string; + infra: string; + settings?: string; + check_mode?: boolean; + wait?: boolean; + }): Promise; + + async getTaskStatus(taskId: string): Promise; + + async listTasks(statusFilter?: string): Promise; + + async waitForTaskCompletion( + taskId: string, + timeout?: number, + pollInterval?: number + ): Promise; + + async connectWebSocket(eventTypes?: string[]): Promise; + + disconnectWebSocket(): void; + + async executeBatchOperation(batchConfig: BatchConfig): Promise; + + async getBatchStatus(batchId: string): Promise; +} +``` + +## Go SDK + +### Installation + +```text +go get github.com/provisioning-systems/go-client +``` + +### Quick Start + +```text +package main + +import ( + "context" + "fmt" + "log" + "time" + + "github.com/provisioning-systems/go-client" +) + +func main() { + // Initialize client + client, err := provisioning.NewClient(&provisioning.Config{ + BaseURL: "http://localhost:9090", + AuthURL: "http://localhost:8081", + Username: "admin", + Password: "your-password", + }) + if err != nil { + log.Fatalf("Failed to create client: %v", err) + } + + ctx := context.Background() + + // Authenticate + token, err := client.Authenticate(ctx) + if err != nil { + log.Fatalf("Authentication failed: %v", err) + } + fmt.Printf("Authenticated with token: %.20s... +", token) + + // Create server workflow + taskID, err := client.CreateServerWorkflow(ctx, &provisioning.CreateServerRequest{ + Infra: "production", + Settings: "prod-settings.ncl", + Wait: false, + }) + if err != nil { + log.Fatalf("Failed to create workflow: %v", err) + } + fmt.Printf("Server workflow created: %s +", taskID) + + // Wait for completion + task, err := client.WaitForTaskCompletion(ctx, taskID, 10*time.Minute) + if err != nil { + log.Fatalf("Failed to wait for completion: %v", err) + } + + fmt.Printf("Task completed with status: %s +", task.Status) + if task.Status == "Completed" { + fmt.Printf("Output: %s +", task.Output) + } else if task.Status == "Failed" { + fmt.Printf("Error: %s +", task.Error) + } +} +``` + +### WebSocket Integration + +```text +package main + +import ( + "context" + "fmt" + "log" + "os" + "os/signal" + + "github.com/provisioning-systems/go-client" +) + +func main() { + client, err := provisioning.NewClient(&provisioning.Config{ + BaseURL: "http://localhost:9090", + Username: "admin", + Password: "password", + }) + if err != nil { + log.Fatalf("Failed to create client: %v", err) + } + + ctx := context.Background() + + // Authenticate + _, err = client.Authenticate(ctx) + if err != nil { + log.Fatalf("Authentication failed: %v", err) + } + + // Set up WebSocket connection + ws, err := client.ConnectWebSocket(ctx, []string{ + "TaskStatusChanged", + "WorkflowProgressUpdate", + }) + if err != nil { + log.Fatalf("Failed to connect WebSocket: %v", err) + } + defer ws.Close() + + // Handle events + go func() { + for event := range ws.Events() { + switch event.Type { + case "TaskStatusChanged": + fmt.Printf("Task %s status changed to: %s +", + event.Data["task_id"], event.Data["status"]) + case "WorkflowProgressUpdate": + fmt.Printf("Workflow progress: %v%% - %s +", + event.Data["progress"], event.Data["current_step"]) + } + } + }() + + // Wait for interrupt + c := make(chan os.Signal, 1) + signal.Notify(c, os.Interrupt) + <-c + + fmt.Println("Shutting down...") +} +``` + +### HTTP Client with Retry Logic + +```text +package main + +import ( + "context" + "fmt" + "time" + + "github.com/provisioning-systems/go-client" + "github.com/cenkalti/backoff/v4" +) + +type ResilientClient struct { + *provisioning.Client +} + +func NewResilientClient(config *provisioning.Config) (*ResilientClient, error) { + client, err := provisioning.NewClient(config) + if err != nil { + return nil, err + } + + return &ResilientClient{Client: client}, nil +} + +func (c *ResilientClient) CreateServerWorkflowWithRetry( + ctx context.Context, + req *provisioning.CreateServerRequest, +) (string, error) { + var taskID string + + operation := func() error { + var err error + taskID, err = c.CreateServerWorkflow(ctx, req) + + // Don't retry validation errors + if provisioning.IsValidationError(err) { + return backoff.Permanent(err) + } + + return err + } + + exponentialBackoff := backoff.NewExponentialBackOff() + exponentialBackoff.MaxElapsedTime = 5 * time.Minute + + err := backoff.Retry(operation, exponentialBackoff) + if err != nil { + return "", fmt.Errorf("failed after retries: %w", err) + } + + return taskID, nil +} + +func main() { + client, err := NewResilientClient(&provisioning.Config{ + BaseURL: "http://localhost:9090", + Username: "admin", + Password: "password", + }) + if err != nil { + log.Fatalf("Failed to create client: %v", err) + } + + ctx := context.Background() + + // Authenticate with retry + _, err = client.Authenticate(ctx) + if err != nil { + log.Fatalf("Authentication failed: %v", err) + } + + // Create workflow with retry + taskID, err := client.CreateServerWorkflowWithRetry(ctx, &provisioning.CreateServerRequest{ + Infra: "production", + Settings: "config.ncl", + }) + if err != nil { + log.Fatalf("Failed to create workflow: %v", err) + } + + fmt.Printf("Workflow created successfully: %s +", taskID) +} +``` + +## Rust SDK + +### Installation + +Add to your `Cargo.toml`: + +```text +[dependencies] +provisioning-rs = "2.0.0" +tokio = { version = "1.0", features = ["full"] } +``` + +### Quick Start + +```text +use provisioning_rs::{ProvisioningClient, Config, CreateServerRequest}; +use tokio; + +#[tokio::main] +async fn main() -> Result<(), Box> { + // Initialize client + let config = Config { + base_url: "http://localhost:9090".to_string(), + auth_url: Some("http://localhost:8081".to_string()), + username: Some("admin".to_string()), + password: Some("your-password".to_string()), + token: None, + }; + + let mut client = ProvisioningClient::new(config); + + // Authenticate + let token = client.authenticate().await?; + println!("Authenticated with token: {}...", &token[..20]); + + // Create server workflow + let request = CreateServerRequest { + infra: "production".to_string(), + settings: Some("prod-settings.ncl".to_string()), + check_mode: false, + wait: false, + }; + + let task_id = client.create_server_workflow(request).await?; + println!("Server workflow created: {}", task_id); + + // Wait for completion + let task = client.wait_for_task_completion(&task_id, std::time::Duration::from_secs(600)).await?; + + println!("Task completed with status: {:?}", task.status); + match task.status { + TaskStatus::Completed => { + if let Some(output) = task.output { + println!("Output: {}", output); + } + }, + TaskStatus::Failed => { + if let Some(error) = task.error { + println!("Error: {}", error); + } + }, + _ => {} + } + + Ok(()) +} +``` + +### WebSocket Integration + +```text +use provisioning_rs::{ProvisioningClient, Config, WebSocketEvent}; +use futures_util::StreamExt; +use tokio; + +#[tokio::main] +async fn main() -> Result<(), Box> { + let config = Config { + base_url: "http://localhost:9090".to_string(), + username: Some("admin".to_string()), + password: Some("password".to_string()), + ..Default::default() + }; + + let mut client = ProvisioningClient::new(config); + + // Authenticate + client.authenticate().await?; + + // Connect WebSocket + let mut ws = client.connect_websocket(vec![ + "TaskStatusChanged".to_string(), + "WorkflowProgressUpdate".to_string(), + ]).await?; + + // Handle events + tokio::spawn(async move { + while let Some(event) = ws.next().await { + match event { + Ok(WebSocketEvent::TaskStatusChanged { data }) => { + println!("Task {} status changed to: {}", data.task_id, data.status); + }, + Ok(WebSocketEvent::WorkflowProgressUpdate { data }) => { + println!("Workflow progress: {}% - {}", data.progress, data.current_step); + }, + Ok(WebSocketEvent::SystemHealthUpdate { data }) => { + println!("System health: {}", data.overall_status); + }, + Err(e) => { + eprintln!("WebSocket error: {}", e); + break; + } + } + } + }); + + // Keep the main thread alive + tokio::signal::ctrl_c().await?; + println!("Shutting down..."); + + Ok(()) +} +``` + +### Batch Operations + +```text +use provisioning_rs::{BatchOperationRequest, BatchOperation}; + +#[tokio::main] +async fn main() -> Result<(), Box> { + let mut client = ProvisioningClient::new(config); + client.authenticate().await?; + + // Define batch operation + let batch_request = BatchOperationRequest { + name: "production_deployment".to_string(), + version: "1.0.0".to_string(), + storage_backend: "surrealdb".to_string(), + parallel_limit: 5, + rollback_enabled: true, + operations: vec![ + BatchOperation { + id: "servers".to_string(), + operation_type: "server_batch".to_string(), + provider: "upcloud".to_string(), + dependencies: vec![], + config: serde_json::json!({ + "server_configs": [ + {"name": "web-01", "plan": "2xCPU-4 GB", "zone": "de-fra1"}, + {"name": "web-02", "plan": "2xCPU-4 GB", "zone": "de-fra1"} + ] + }), + }, + BatchOperation { + id: "kubernetes".to_string(), + operation_type: "taskserv_batch".to_string(), + provider: "upcloud".to_string(), + dependencies: vec!["servers".to_string()], + config: serde_json::json!({ + "taskservs": ["kubernetes", "cilium", "containerd"] + }), + }, + ], + }; + + // Execute batch operation + let batch_result = client.execute_batch_operation(batch_request).await?; + println!("Batch operation started: {}", batch_result.batch_id); + + // Monitor progress + loop { + let status = client.get_batch_status(&batch_result.batch_id).await?; + println!("Batch status: {} - {}%", status.status, status.progress.unwrap_or(0.0)); + + match status.status.as_str() { + "Completed" | "Failed" | "Cancelled" => break, + _ => tokio::time::sleep(std::time::Duration::from_secs(10)).await, + } + } + + Ok(()) +} +``` + +## Best Practices + +### Authentication and Security + +1. **Token Management**: Store tokens securely and implement automatic refresh +2. **Environment Variables**: Use environment variables for credentials +3. **HTTPS**: Always use HTTPS in production environments +4. **Token Expiration**: Handle token expiration gracefully + +### Error Handling + +1. **Specific Exceptions**: Handle specific error types appropriately +2. **Retry Logic**: Implement exponential backoff for transient failures +3. **Circuit Breakers**: Use circuit breakers for resilient integrations +4. **Logging**: Log errors with appropriate context + +### Performance Optimization + +1. **Connection Pooling**: Reuse HTTP connections +2. **Async Operations**: Use asynchronous operations where possible +3. **Batch Operations**: Group related operations for efficiency +4. **Caching**: Cache frequently accessed data appropriately + +### WebSocket Connections + +1. **Reconnection**: Implement automatic reconnection with backoff +2. **Event Filtering**: Subscribe only to needed event types +3. **Error Handling**: Handle WebSocket errors gracefully +4. **Resource Cleanup**: Properly close WebSocket connections + +### Testing + +1. **Unit Tests**: Test SDK functionality with mocked responses +2. **Integration Tests**: Test against real API endpoints +3. **Error Scenarios**: Test error handling paths +4. **Load Testing**: Validate performance under load + +This comprehensive SDK documentation provides developers with everything needed to integrate with provisioning using their preferred programming +language, complete with examples, best practices, and detailed API references. \ No newline at end of file diff --git a/docs/src/api-reference/websocket.md b/docs/src/api-reference/websocket.md index 0c23115..d9de10b 100644 --- a/docs/src/api-reference/websocket.md +++ b/docs/src/api-reference/websocket.md @@ -1 +1,892 @@ -# WebSocket API Reference\n\nThis document provides comprehensive documentation for the WebSocket API used for real-time monitoring, event streaming, and live updates in\nprovisioning.\n\n## Overview\n\nThe WebSocket API enables real-time communication between clients and the provisioning orchestrator, providing:\n\n- Live workflow progress updates\n- System health monitoring\n- Event streaming\n- Real-time metrics\n- Interactive debugging sessions\n\n## WebSocket Endpoints\n\n### Primary WebSocket Endpoint\n\n#### `ws://localhost:9090/ws`\n\nThe main WebSocket endpoint for real-time events and monitoring.\n\n**Connection Parameters:**\n\n- `token`: JWT authentication token (required)\n- `events`: Comma-separated list of event types to subscribe to (optional)\n- `batch_size`: Maximum number of events per message (default: 10)\n- `compression`: Enable message compression (default: false)\n\n**Example Connection:**\n\n```\nconst ws = new WebSocket('ws://localhost:9090/ws?token=jwt-token&events=task,batch,system');\n```\n\n### Specialized WebSocket Endpoints\n\n#### `ws://localhost:9090/metrics`\n\nReal-time metrics streaming endpoint.\n\n**Features:**\n\n- Live system metrics\n- Performance data\n- Resource utilization\n- Custom metric streams\n\n#### `ws://localhost:9090/logs`\n\nLive log streaming endpoint.\n\n**Features:**\n\n- Real-time log tailing\n- Log level filtering\n- Component-specific logs\n- Search and filtering\n\n## Authentication\n\n### JWT Token Authentication\n\nAll WebSocket connections require authentication via JWT token:\n\n```\n// Include token in connection URL\nconst ws = new WebSocket('ws://localhost:9090/ws?token=' + jwtToken);\n\n// Or send token after connection\nws.onopen = function() {\n ws.send(JSON.stringify({\n type: 'auth',\n token: jwtToken\n }));\n};\n```\n\n### Connection Authentication Flow\n\n1. **Initial Connection**: Client connects with token parameter\n2. **Token Validation**: Server validates JWT token\n3. **Authorization**: Server checks token permissions\n4. **Subscription**: Client subscribes to event types\n5. **Event Stream**: Server begins streaming events\n\n## Event Types and Schemas\n\n### Core Event Types\n\n#### Task Status Changed\n\nFired when a workflow task status changes.\n\n```\n{\n "event_type": "TaskStatusChanged",\n "timestamp": "2025-09-26T10:00:00Z",\n "data": {\n "task_id": "uuid-string",\n "name": "create_servers",\n "status": "Running",\n "previous_status": "Pending",\n "progress": 45.5\n },\n "metadata": {\n "task_id": "uuid-string",\n "workflow_type": "server_creation",\n "infra": "production"\n }\n}\n```\n\n#### Batch Operation Update\n\nFired when batch operation status changes.\n\n```\n{\n "event_type": "BatchOperationUpdate",\n "timestamp": "2025-09-26T10:00:00Z",\n "data": {\n "batch_id": "uuid-string",\n "name": "multi_cloud_deployment",\n "status": "Running",\n "progress": 65.0,\n "operations": [\n {\n "id": "upcloud_servers",\n "status": "Completed",\n "progress": 100.0\n },\n {\n "id": "aws_taskservs",\n "status": "Running",\n "progress": 30.0\n }\n ]\n },\n "metadata": {\n "total_operations": 5,\n "completed_operations": 2,\n "failed_operations": 0\n }\n}\n```\n\n#### System Health Update\n\nFired when system health status changes.\n\n```\n{\n "event_type": "SystemHealthUpdate",\n "timestamp": "2025-09-26T10:00:00Z",\n "data": {\n "overall_status": "Healthy",\n "components": {\n "storage": {\n "status": "Healthy",\n "last_check": "2025-09-26T09:59:55Z"\n },\n "batch_coordinator": {\n "status": "Warning",\n "last_check": "2025-09-26T09:59:55Z",\n "message": "High memory usage"\n }\n },\n "metrics": {\n "cpu_usage": 45.2,\n "memory_usage": 2048,\n "disk_usage": 75.5,\n "active_workflows": 5\n }\n },\n "metadata": {\n "check_interval": 30,\n "next_check": "2025-09-26T10:00:30Z"\n }\n}\n```\n\n#### Workflow Progress Update\n\nFired when workflow progress changes.\n\n```\n{\n "event_type": "WorkflowProgressUpdate",\n "timestamp": "2025-09-26T10:00:00Z",\n "data": {\n "workflow_id": "uuid-string",\n "name": "kubernetes_deployment",\n "progress": 75.0,\n "current_step": "Installing CNI",\n "total_steps": 8,\n "completed_steps": 6,\n "estimated_time_remaining": 120,\n "step_details": {\n "step_name": "Installing CNI",\n "step_progress": 45.0,\n "step_message": "Downloading Cilium components"\n }\n },\n "metadata": {\n "infra": "production",\n "provider": "upcloud",\n "started_at": "2025-09-26T09:45:00Z"\n }\n}\n```\n\n#### Log Entry\n\nReal-time log streaming.\n\n```\n{\n "event_type": "LogEntry",\n "timestamp": "2025-09-26T10:00:00Z",\n "data": {\n "level": "INFO",\n "message": "Server web-01 created successfully",\n "component": "server-manager",\n "task_id": "uuid-string",\n "details": {\n "server_id": "server-uuid",\n "hostname": "web-01",\n "ip_address": "10.0.1.100"\n }\n },\n "metadata": {\n "source": "orchestrator",\n "thread": "worker-1"\n }\n}\n```\n\n#### Metric Update\n\nReal-time metrics streaming.\n\n```\n{\n "event_type": "MetricUpdate",\n "timestamp": "2025-09-26T10:00:00Z",\n "data": {\n "metric_name": "workflow_duration",\n "metric_type": "histogram",\n "value": 180.5,\n "labels": {\n "workflow_type": "server_creation",\n "status": "completed",\n "infra": "production"\n }\n },\n "metadata": {\n "interval": 15,\n "aggregation": "average"\n }\n}\n```\n\n### Custom Event Types\n\nApplications can define custom event types:\n\n```\n{\n "event_type": "CustomApplicationEvent",\n "timestamp": "2025-09-26T10:00:00Z",\n "data": {\n // Custom event data\n },\n "metadata": {\n "custom_field": "custom_value"\n }\n}\n```\n\n## Client-Side JavaScript API\n\n### Connection Management\n\n```\nclass ProvisioningWebSocket {\n constructor(baseUrl, token, options = {}) {\n this.baseUrl = baseUrl;\n this.token = token;\n this.options = {\n reconnect: true,\n reconnectInterval: 5000,\n maxReconnectAttempts: 10,\n ...options\n };\n this.ws = null;\n this.reconnectAttempts = 0;\n this.eventHandlers = new Map();\n }\n\n connect() {\n const wsUrl = `${this.baseUrl}/ws?token=${this.token}`;\n this.ws = new WebSocket(wsUrl);\n\n this.ws.onopen = (event) => {\n console.log('WebSocket connected');\n this.reconnectAttempts = 0;\n this.emit('connected', event);\n };\n\n this.ws.onmessage = (event) => {\n try {\n const message = JSON.parse(event.data);\n this.handleMessage(message);\n } catch (error) {\n console.error('Failed to parse WebSocket message:', error);\n }\n };\n\n this.ws.onclose = (event) => {\n console.log('WebSocket disconnected');\n this.emit('disconnected', event);\n\n if (this.options.reconnect && this.reconnectAttempts < this.options.maxReconnectAttempts) {\n setTimeout(() => {\n this.reconnectAttempts++;\n console.log(`Reconnecting... (${this.reconnectAttempts}/${this.options.maxReconnectAttempts})`);\n this.connect();\n }, this.options.reconnectInterval);\n }\n };\n\n this.ws.onerror = (error) => {\n console.error('WebSocket error:', error);\n this.emit('error', error);\n };\n }\n\n handleMessage(message) {\n if (message.event_type) {\n this.emit(message.event_type, message);\n this.emit('message', message);\n }\n }\n\n on(eventType, handler) {\n if (!this.eventHandlers.has(eventType)) {\n this.eventHandlers.set(eventType, []);\n }\n this.eventHandlers.get(eventType).push(handler);\n }\n\n off(eventType, handler) {\n const handlers = this.eventHandlers.get(eventType);\n if (handlers) {\n const index = handlers.indexOf(handler);\n if (index > -1) {\n handlers.splice(index, 1);\n }\n }\n }\n\n emit(eventType, data) {\n const handlers = this.eventHandlers.get(eventType);\n if (handlers) {\n handlers.forEach(handler => {\n try {\n handler(data);\n } catch (error) {\n console.error(`Error in event handler for ${eventType}:`, error);\n }\n });\n }\n }\n\n send(message) {\n if (this.ws && this.ws.readyState === WebSocket.OPEN) {\n this.ws.send(JSON.stringify(message));\n } else {\n console.warn('WebSocket not connected, message not sent');\n }\n }\n\n disconnect() {\n this.options.reconnect = false;\n if (this.ws) {\n this.ws.close();\n }\n }\n\n subscribe(eventTypes) {\n this.send({\n type: 'subscribe',\n events: Array.isArray(eventTypes) ? eventTypes : [eventTypes]\n });\n }\n\n unsubscribe(eventTypes) {\n this.send({\n type: 'unsubscribe',\n events: Array.isArray(eventTypes) ? eventTypes : [eventTypes]\n });\n }\n}\n\n// Usage example\nconst ws = new ProvisioningWebSocket('ws://localhost:9090', 'your-jwt-token');\n\nws.on('TaskStatusChanged', (event) => {\n console.log(`Task ${event.data.task_id} status: ${event.data.status}`);\n updateTaskUI(event.data);\n});\n\nws.on('WorkflowProgressUpdate', (event) => {\n console.log(`Workflow progress: ${event.data.progress}%`);\n updateProgressBar(event.data.progress);\n});\n\nws.on('SystemHealthUpdate', (event) => {\n console.log('System health:', event.data.overall_status);\n updateHealthIndicator(event.data);\n});\n\nws.connect();\n\n// Subscribe to specific events\nws.subscribe(['TaskStatusChanged', 'WorkflowProgressUpdate']);\n```\n\n### Real-Time Dashboard Example\n\n```\nclass ProvisioningDashboard {\n constructor(wsUrl, token) {\n this.ws = new ProvisioningWebSocket(wsUrl, token);\n this.setupEventHandlers();\n this.connect();\n }\n\n setupEventHandlers() {\n this.ws.on('TaskStatusChanged', this.handleTaskUpdate.bind(this));\n this.ws.on('BatchOperationUpdate', this.handleBatchUpdate.bind(this));\n this.ws.on('SystemHealthUpdate', this.handleHealthUpdate.bind(this));\n this.ws.on('WorkflowProgressUpdate', this.handleProgressUpdate.bind(this));\n this.ws.on('LogEntry', this.handleLogEntry.bind(this));\n }\n\n connect() {\n this.ws.connect();\n }\n\n handleTaskUpdate(event) {\n const taskCard = document.getElementById(`task-${event.data.task_id}`);\n if (taskCard) {\n taskCard.querySelector('.status').textContent = event.data.status;\n taskCard.querySelector('.status').className = `status ${event.data.status.toLowerCase()}`;\n\n if (event.data.progress) {\n const progressBar = taskCard.querySelector('.progress-bar');\n progressBar.style.width = `${event.data.progress}%`;\n }\n }\n }\n\n handleBatchUpdate(event) {\n const batchCard = document.getElementById(`batch-${event.data.batch_id}`);\n if (batchCard) {\n batchCard.querySelector('.batch-progress').style.width = `${event.data.progress}%`;\n\n event.data.operations.forEach(op => {\n const opElement = batchCard.querySelector(`[data-operation="${op.id}"]`);\n if (opElement) {\n opElement.querySelector('.operation-status').textContent = op.status;\n opElement.querySelector('.operation-progress').style.width = `${op.progress}%`;\n }\n });\n }\n }\n\n handleHealthUpdate(event) {\n const healthIndicator = document.getElementById('health-indicator');\n healthIndicator.className = `health-indicator ${event.data.overall_status.toLowerCase()}`;\n healthIndicator.textContent = event.data.overall_status;\n\n const metricsPanel = document.getElementById('metrics-panel');\n metricsPanel.innerHTML = `\n
CPU: ${event.data.metrics.cpu_usage}%
\n
Memory: ${Math.round(event.data.metrics.memory_usage / 1024 / 1024)}MB
\n
Disk: ${event.data.metrics.disk_usage}%
\n
Active Workflows: ${event.data.metrics.active_workflows}
\n `;\n }\n\n handleProgressUpdate(event) {\n const workflowCard = document.getElementById(`workflow-${event.data.workflow_id}`);\n if (workflowCard) {\n const progressBar = workflowCard.querySelector('.workflow-progress');\n const stepInfo = workflowCard.querySelector('.step-info');\n\n progressBar.style.width = `${event.data.progress}%`;\n stepInfo.textContent = `${event.data.current_step} (${event.data.completed_steps}/${event.data.total_steps})`;\n\n if (event.data.estimated_time_remaining) {\n const timeRemaining = workflowCard.querySelector('.time-remaining');\n timeRemaining.textContent = `${Math.round(event.data.estimated_time_remaining / 60)} min remaining`;\n }\n }\n }\n\n handleLogEntry(event) {\n const logContainer = document.getElementById('log-container');\n const logEntry = document.createElement('div');\n logEntry.className = `log-entry log-${event.data.level.toLowerCase()}`;\n logEntry.innerHTML = `\n ${new Date(event.timestamp).toLocaleTimeString()}\n ${event.data.level}\n ${event.data.component}\n ${event.data.message}\n `;\n\n logContainer.appendChild(logEntry);\n\n // Auto-scroll to bottom\n logContainer.scrollTop = logContainer.scrollHeight;\n\n // Limit log entries to prevent memory issues\n const maxLogEntries = 1000;\n if (logContainer.children.length > maxLogEntries) {\n logContainer.removeChild(logContainer.firstChild);\n }\n }\n}\n\n// Initialize dashboard\nconst dashboard = new ProvisioningDashboard('ws://localhost:9090', jwtToken);\n```\n\n## Server-Side Implementation\n\n### Rust WebSocket Handler\n\nThe orchestrator implements WebSocket support using Axum and Tokio:\n\n```\nuse axum::{\n extract::{ws::WebSocket, ws::WebSocketUpgrade, Query, State},\n response::Response,\n};\nuse serde::{Deserialize, Serialize};\nuse std::collections::HashMap;\nuse tokio::sync::broadcast;\n\n#[derive(Debug, Deserialize)]\npub struct WsQuery {\n token: String,\n events: Option,\n batch_size: Option,\n compression: Option,\n}\n\n#[derive(Debug, Clone, Serialize)]\npub struct WebSocketMessage {\n pub event_type: String,\n pub timestamp: chrono::DateTime,\n pub data: serde_json::Value,\n pub metadata: HashMap,\n}\n\npub async fn websocket_handler(\n ws: WebSocketUpgrade,\n Query(params): Query,\n State(state): State,\n) -> Response {\n // Validate JWT token\n let claims = match state.auth_service.validate_token(¶ms.token) {\n Ok(claims) => claims,\n Err(_) => return Response::builder()\n .status(401)\n .body("Unauthorized".into())\n .unwrap(),\n };\n\n ws.on_upgrade(move |socket| handle_socket(socket, params, claims, state))\n}\n\nasync fn handle_socket(\n socket: WebSocket,\n params: WsQuery,\n claims: Claims,\n state: SharedState,\n) {\n let (mut sender, mut receiver) = socket.split();\n\n // Subscribe to event stream\n let mut event_rx = state.monitoring_system.subscribe_to_events().await;\n\n // Parse requested event types\n let requested_events: Vec = params.events\n .unwrap_or_default()\n .split(',')\n .map(|s| s.trim().to_string())\n .filter(|s| !s.is_empty())\n .collect();\n\n // Handle incoming messages from client\n let sender_task = tokio::spawn(async move {\n while let Some(msg) = receiver.next().await {\n if let Ok(msg) = msg {\n if let Ok(text) = msg.to_text() {\n if let Ok(client_msg) = serde_json::from_str::(text) {\n handle_client_message(client_msg, &state).await;\n }\n }\n }\n }\n });\n\n // Handle outgoing messages to client\n let receiver_task = tokio::spawn(async move {\n let mut batch = Vec::new();\n let batch_size = params.batch_size.unwrap_or(10);\n\n while let Ok(event) = event_rx.recv().await {\n // Filter events based on subscription\n if !requested_events.is_empty() && !requested_events.contains(&event.event_type) {\n continue;\n }\n\n // Check permissions\n if !has_event_permission(&claims, &event.event_type) {\n continue;\n }\n\n batch.push(event);\n\n // Send batch when full or after timeout\n if batch.len() >= batch_size {\n send_event_batch(&mut sender, &batch).await;\n batch.clear();\n }\n }\n });\n\n // Wait for either task to complete\n tokio::select! {\n _ = sender_task => {},\n _ = receiver_task => {},\n }\n}\n\n#[derive(Debug, Deserialize)]\nstruct ClientMessage {\n #[serde(rename = "type")]\n msg_type: String,\n token: Option,\n events: Option>,\n}\n\nasync fn handle_client_message(msg: ClientMessage, state: &SharedState) {\n match msg.msg_type.as_str() {\n "subscribe" => {\n // Handle event subscription\n },\n "unsubscribe" => {\n // Handle event unsubscription\n },\n "auth" => {\n // Handle re-authentication\n },\n _ => {\n // Unknown message type\n }\n }\n}\n\nasync fn send_event_batch(sender: &mut SplitSink, batch: &[WebSocketMessage]) {\n let batch_msg = serde_json::json!({\n "type": "batch",\n "events": batch\n });\n\n if let Ok(msg_text) = serde_json::to_string(&batch_msg) {\n if let Err(e) = sender.send(Message::Text(msg_text)).await {\n eprintln!("Failed to send WebSocket message: {}", e);\n }\n }\n}\n\nfn has_event_permission(claims: &Claims, event_type: &str) -> bool {\n // Check if user has permission to receive this event type\n match event_type {\n "SystemHealthUpdate" => claims.role.contains(&"admin".to_string()),\n "LogEntry" => claims.role.contains(&"admin".to_string()) ||\n claims.role.contains(&"developer".to_string()),\n _ => true, // Most events are accessible to all authenticated users\n }\n}\n```\n\n## Event Filtering and Subscriptions\n\n### Client-Side Filtering\n\n```\n// Subscribe to specific event types\nws.subscribe(['TaskStatusChanged', 'WorkflowProgressUpdate']);\n\n// Subscribe with filters\nws.send({\n type: 'subscribe',\n events: ['TaskStatusChanged'],\n filters: {\n task_name: 'create_servers',\n status: ['Running', 'Completed', 'Failed']\n }\n});\n\n// Advanced filtering\nws.send({\n type: 'subscribe',\n events: ['LogEntry'],\n filters: {\n level: ['ERROR', 'WARN'],\n component: ['server-manager', 'batch-coordinator'],\n since: '2025-09-26T10:00:00Z'\n }\n});\n```\n\n### Server-Side Event Filtering\n\nEvents can be filtered on the server side based on:\n\n- User permissions and roles\n- Event type subscriptions\n- Custom filter criteria\n- Rate limiting\n\n## Error Handling and Reconnection\n\n### Connection Errors\n\n```\nws.on('error', (error) => {\n console.error('WebSocket error:', error);\n\n // Handle specific error types\n if (error.code === 1006) {\n // Abnormal closure, attempt reconnection\n setTimeout(() => ws.connect(), 5000);\n } else if (error.code === 1008) {\n // Policy violation, check token\n refreshTokenAndReconnect();\n }\n});\n\nws.on('disconnected', (event) => {\n console.log(`WebSocket disconnected: ${event.code} - ${event.reason}`);\n\n // Handle different close codes\n switch (event.code) {\n case 1000: // Normal closure\n console.log('Connection closed normally');\n break;\n case 1001: // Going away\n console.log('Server is shutting down');\n break;\n case 4001: // Custom: Token expired\n refreshTokenAndReconnect();\n break;\n default:\n // Attempt reconnection for other errors\n if (shouldReconnect()) {\n scheduleReconnection();\n }\n }\n});\n```\n\n### Heartbeat and Keep-Alive\n\n```\nclass ProvisioningWebSocket {\n constructor(baseUrl, token, options = {}) {\n // ... existing code ...\n this.heartbeatInterval = options.heartbeatInterval || 30000;\n this.heartbeatTimer = null;\n }\n\n connect() {\n // ... existing connection code ...\n\n this.ws.onopen = (event) => {\n console.log('WebSocket connected');\n this.startHeartbeat();\n this.emit('connected', event);\n };\n\n this.ws.onclose = (event) => {\n this.stopHeartbeat();\n // ... existing close handling ...\n };\n }\n\n startHeartbeat() {\n this.heartbeatTimer = setInterval(() => {\n if (this.ws && this.ws.readyState === WebSocket.OPEN) {\n this.send({ type: 'ping' });\n }\n }, this.heartbeatInterval);\n }\n\n stopHeartbeat() {\n if (this.heartbeatTimer) {\n clearInterval(this.heartbeatTimer);\n this.heartbeatTimer = null;\n }\n }\n\n handleMessage(message) {\n if (message.type === 'pong') {\n // Heartbeat response received\n return;\n }\n\n // ... existing message handling ...\n }\n}\n```\n\n## Performance Considerations\n\n### Message Batching\n\nTo improve performance, the server can batch multiple events into single WebSocket messages:\n\n```\n{\n "type": "batch",\n "timestamp": "2025-09-26T10:00:00Z",\n "events": [\n {\n "event_type": "TaskStatusChanged",\n "data": { ... }\n },\n {\n "event_type": "WorkflowProgressUpdate",\n "data": { ... }\n }\n ]\n}\n```\n\n### Compression\n\nEnable message compression for large events:\n\n```\nconst ws = new WebSocket('ws://localhost:9090/ws?token=jwt&compression=true');\n```\n\n### Rate Limiting\n\nThe server implements rate limiting to prevent abuse:\n\n- Maximum connections per user: 10\n- Maximum messages per second: 100\n- Maximum subscription events: 50\n\n## Security Considerations\n\n### Authentication and Authorization\n\n- All connections require valid JWT tokens\n- Tokens are validated on connection and periodically renewed\n- Event access is controlled by user roles and permissions\n\n### Message Validation\n\n- All incoming messages are validated against schemas\n- Malformed messages are rejected\n- Rate limiting prevents DoS attacks\n\n### Data Sanitization\n\n- All event data is sanitized before transmission\n- Sensitive information is filtered based on user permissions\n- PII and secrets are never transmitted\n\nThis WebSocket API provides a robust, real-time communication channel for monitoring and managing provisioning with comprehensive security and\nperformance features. +# WebSocket API Reference + +This document provides comprehensive documentation for the WebSocket API used for real-time monitoring, event streaming, and live updates in +provisioning. + +## Overview + +The WebSocket API enables real-time communication between clients and the provisioning orchestrator, providing: + +- Live workflow progress updates +- System health monitoring +- Event streaming +- Real-time metrics +- Interactive debugging sessions + +## WebSocket Endpoints + +### Primary WebSocket Endpoint + +#### `ws://localhost:9090/ws` + +The main WebSocket endpoint for real-time events and monitoring. + +**Connection Parameters:** + +- `token`: JWT authentication token (required) +- `events`: Comma-separated list of event types to subscribe to (optional) +- `batch_size`: Maximum number of events per message (default: 10) +- `compression`: Enable message compression (default: false) + +**Example Connection:** + +```text +const ws = new WebSocket('ws://localhost:9090/ws?token=jwt-token&events=task,batch,system'); +``` + +### Specialized WebSocket Endpoints + +#### `ws://localhost:9090/metrics` + +Real-time metrics streaming endpoint. + +**Features:** + +- Live system metrics +- Performance data +- Resource utilization +- Custom metric streams + +#### `ws://localhost:9090/logs` + +Live log streaming endpoint. + +**Features:** + +- Real-time log tailing +- Log level filtering +- Component-specific logs +- Search and filtering + +## Authentication + +### JWT Token Authentication + +All WebSocket connections require authentication via JWT token: + +```text +// Include token in connection URL +const ws = new WebSocket('ws://localhost:9090/ws?token=' + jwtToken); + +// Or send token after connection +ws.onopen = function() { + ws.send(JSON.stringify({ + type: 'auth', + token: jwtToken + })); +}; +``` + +### Connection Authentication Flow + +1. **Initial Connection**: Client connects with token parameter +2. **Token Validation**: Server validates JWT token +3. **Authorization**: Server checks token permissions +4. **Subscription**: Client subscribes to event types +5. **Event Stream**: Server begins streaming events + +## Event Types and Schemas + +### Core Event Types + +#### Task Status Changed + +Fired when a workflow task status changes. + +```text +{ + "event_type": "TaskStatusChanged", + "timestamp": "2025-09-26T10:00:00Z", + "data": { + "task_id": "uuid-string", + "name": "create_servers", + "status": "Running", + "previous_status": "Pending", + "progress": 45.5 + }, + "metadata": { + "task_id": "uuid-string", + "workflow_type": "server_creation", + "infra": "production" + } +} +``` + +#### Batch Operation Update + +Fired when batch operation status changes. + +```text +{ + "event_type": "BatchOperationUpdate", + "timestamp": "2025-09-26T10:00:00Z", + "data": { + "batch_id": "uuid-string", + "name": "multi_cloud_deployment", + "status": "Running", + "progress": 65.0, + "operations": [ + { + "id": "upcloud_servers", + "status": "Completed", + "progress": 100.0 + }, + { + "id": "aws_taskservs", + "status": "Running", + "progress": 30.0 + } + ] + }, + "metadata": { + "total_operations": 5, + "completed_operations": 2, + "failed_operations": 0 + } +} +``` + +#### System Health Update + +Fired when system health status changes. + +```text +{ + "event_type": "SystemHealthUpdate", + "timestamp": "2025-09-26T10:00:00Z", + "data": { + "overall_status": "Healthy", + "components": { + "storage": { + "status": "Healthy", + "last_check": "2025-09-26T09:59:55Z" + }, + "batch_coordinator": { + "status": "Warning", + "last_check": "2025-09-26T09:59:55Z", + "message": "High memory usage" + } + }, + "metrics": { + "cpu_usage": 45.2, + "memory_usage": 2048, + "disk_usage": 75.5, + "active_workflows": 5 + } + }, + "metadata": { + "check_interval": 30, + "next_check": "2025-09-26T10:00:30Z" + } +} +``` + +#### Workflow Progress Update + +Fired when workflow progress changes. + +```text +{ + "event_type": "WorkflowProgressUpdate", + "timestamp": "2025-09-26T10:00:00Z", + "data": { + "workflow_id": "uuid-string", + "name": "kubernetes_deployment", + "progress": 75.0, + "current_step": "Installing CNI", + "total_steps": 8, + "completed_steps": 6, + "estimated_time_remaining": 120, + "step_details": { + "step_name": "Installing CNI", + "step_progress": 45.0, + "step_message": "Downloading Cilium components" + } + }, + "metadata": { + "infra": "production", + "provider": "upcloud", + "started_at": "2025-09-26T09:45:00Z" + } +} +``` + +#### Log Entry + +Real-time log streaming. + +```text +{ + "event_type": "LogEntry", + "timestamp": "2025-09-26T10:00:00Z", + "data": { + "level": "INFO", + "message": "Server web-01 created successfully", + "component": "server-manager", + "task_id": "uuid-string", + "details": { + "server_id": "server-uuid", + "hostname": "web-01", + "ip_address": "10.0.1.100" + } + }, + "metadata": { + "source": "orchestrator", + "thread": "worker-1" + } +} +``` + +#### Metric Update + +Real-time metrics streaming. + +```text +{ + "event_type": "MetricUpdate", + "timestamp": "2025-09-26T10:00:00Z", + "data": { + "metric_name": "workflow_duration", + "metric_type": "histogram", + "value": 180.5, + "labels": { + "workflow_type": "server_creation", + "status": "completed", + "infra": "production" + } + }, + "metadata": { + "interval": 15, + "aggregation": "average" + } +} +``` + +### Custom Event Types + +Applications can define custom event types: + +```text +{ + "event_type": "CustomApplicationEvent", + "timestamp": "2025-09-26T10:00:00Z", + "data": { + // Custom event data + }, + "metadata": { + "custom_field": "custom_value" + } +} +``` + +## Client-Side JavaScript API + +### Connection Management + +```text +class ProvisioningWebSocket { + constructor(baseUrl, token, options = {}) { + this.baseUrl = baseUrl; + this.token = token; + this.options = { + reconnect: true, + reconnectInterval: 5000, + maxReconnectAttempts: 10, + ...options + }; + this.ws = null; + this.reconnectAttempts = 0; + this.eventHandlers = new Map(); + } + + connect() { + const wsUrl = `${this.baseUrl}/ws?token=${this.token}`; + this.ws = new WebSocket(wsUrl); + + this.ws.onopen = (event) => { + console.log('WebSocket connected'); + this.reconnectAttempts = 0; + this.emit('connected', event); + }; + + this.ws.onmessage = (event) => { + try { + const message = JSON.parse(event.data); + this.handleMessage(message); + } catch (error) { + console.error('Failed to parse WebSocket message:', error); + } + }; + + this.ws.onclose = (event) => { + console.log('WebSocket disconnected'); + this.emit('disconnected', event); + + if (this.options.reconnect && this.reconnectAttempts < this.options.maxReconnectAttempts) { + setTimeout(() => { + this.reconnectAttempts++; + console.log(`Reconnecting... (${this.reconnectAttempts}/${this.options.maxReconnectAttempts})`); + this.connect(); + }, this.options.reconnectInterval); + } + }; + + this.ws.onerror = (error) => { + console.error('WebSocket error:', error); + this.emit('error', error); + }; + } + + handleMessage(message) { + if (message.event_type) { + this.emit(message.event_type, message); + this.emit('message', message); + } + } + + on(eventType, handler) { + if (!this.eventHandlers.has(eventType)) { + this.eventHandlers.set(eventType, []); + } + this.eventHandlers.get(eventType).push(handler); + } + + off(eventType, handler) { + const handlers = this.eventHandlers.get(eventType); + if (handlers) { + const index = handlers.indexOf(handler); + if (index > -1) { + handlers.splice(index, 1); + } + } + } + + emit(eventType, data) { + const handlers = this.eventHandlers.get(eventType); + if (handlers) { + handlers.forEach(handler => { + try { + handler(data); + } catch (error) { + console.error(`Error in event handler for ${eventType}:`, error); + } + }); + } + } + + send(message) { + if (this.ws && this.ws.readyState === WebSocket.OPEN) { + this.ws.send(JSON.stringify(message)); + } else { + console.warn('WebSocket not connected, message not sent'); + } + } + + disconnect() { + this.options.reconnect = false; + if (this.ws) { + this.ws.close(); + } + } + + subscribe(eventTypes) { + this.send({ + type: 'subscribe', + events: Array.isArray(eventTypes) ? eventTypes : [eventTypes] + }); + } + + unsubscribe(eventTypes) { + this.send({ + type: 'unsubscribe', + events: Array.isArray(eventTypes) ? eventTypes : [eventTypes] + }); + } +} + +// Usage example +const ws = new ProvisioningWebSocket('ws://localhost:9090', 'your-jwt-token'); + +ws.on('TaskStatusChanged', (event) => { + console.log(`Task ${event.data.task_id} status: ${event.data.status}`); + updateTaskUI(event.data); +}); + +ws.on('WorkflowProgressUpdate', (event) => { + console.log(`Workflow progress: ${event.data.progress}%`); + updateProgressBar(event.data.progress); +}); + +ws.on('SystemHealthUpdate', (event) => { + console.log('System health:', event.data.overall_status); + updateHealthIndicator(event.data); +}); + +ws.connect(); + +// Subscribe to specific events +ws.subscribe(['TaskStatusChanged', 'WorkflowProgressUpdate']); +``` + +### Real-Time Dashboard Example + +```text +class ProvisioningDashboard { + constructor(wsUrl, token) { + this.ws = new ProvisioningWebSocket(wsUrl, token); + this.setupEventHandlers(); + this.connect(); + } + + setupEventHandlers() { + this.ws.on('TaskStatusChanged', this.handleTaskUpdate.bind(this)); + this.ws.on('BatchOperationUpdate', this.handleBatchUpdate.bind(this)); + this.ws.on('SystemHealthUpdate', this.handleHealthUpdate.bind(this)); + this.ws.on('WorkflowProgressUpdate', this.handleProgressUpdate.bind(this)); + this.ws.on('LogEntry', this.handleLogEntry.bind(this)); + } + + connect() { + this.ws.connect(); + } + + handleTaskUpdate(event) { + const taskCard = document.getElementById(`task-${event.data.task_id}`); + if (taskCard) { + taskCard.querySelector('.status').textContent = event.data.status; + taskCard.querySelector('.status').className = `status ${event.data.status.toLowerCase()}`; + + if (event.data.progress) { + const progressBar = taskCard.querySelector('.progress-bar'); + progressBar.style.width = `${event.data.progress}%`; + } + } + } + + handleBatchUpdate(event) { + const batchCard = document.getElementById(`batch-${event.data.batch_id}`); + if (batchCard) { + batchCard.querySelector('.batch-progress').style.width = `${event.data.progress}%`; + + event.data.operations.forEach(op => { + const opElement = batchCard.querySelector(`[data-operation="${op.id}"]`); + if (opElement) { + opElement.querySelector('.operation-status').textContent = op.status; + opElement.querySelector('.operation-progress').style.width = `${op.progress}%`; + } + }); + } + } + + handleHealthUpdate(event) { + const healthIndicator = document.getElementById('health-indicator'); + healthIndicator.className = `health-indicator ${event.data.overall_status.toLowerCase()}`; + healthIndicator.textContent = event.data.overall_status; + + const metricsPanel = document.getElementById('metrics-panel'); + metricsPanel.innerHTML = ` +
CPU: ${event.data.metrics.cpu_usage}%
+
Memory: ${Math.round(event.data.metrics.memory_usage / 1024 / 1024)}MB
+
Disk: ${event.data.metrics.disk_usage}%
+
Active Workflows: ${event.data.metrics.active_workflows}
+ `; + } + + handleProgressUpdate(event) { + const workflowCard = document.getElementById(`workflow-${event.data.workflow_id}`); + if (workflowCard) { + const progressBar = workflowCard.querySelector('.workflow-progress'); + const stepInfo = workflowCard.querySelector('.step-info'); + + progressBar.style.width = `${event.data.progress}%`; + stepInfo.textContent = `${event.data.current_step} (${event.data.completed_steps}/${event.data.total_steps})`; + + if (event.data.estimated_time_remaining) { + const timeRemaining = workflowCard.querySelector('.time-remaining'); + timeRemaining.textContent = `${Math.round(event.data.estimated_time_remaining / 60)} min remaining`; + } + } + } + + handleLogEntry(event) { + const logContainer = document.getElementById('log-container'); + const logEntry = document.createElement('div'); + logEntry.className = `log-entry log-${event.data.level.toLowerCase()}`; + logEntry.innerHTML = ` + ${new Date(event.timestamp).toLocaleTimeString()} + ${event.data.level} + ${event.data.component} + ${event.data.message} + `; + + logContainer.appendChild(logEntry); + + // Auto-scroll to bottom + logContainer.scrollTop = logContainer.scrollHeight; + + // Limit log entries to prevent memory issues + const maxLogEntries = 1000; + if (logContainer.children.length > maxLogEntries) { + logContainer.removeChild(logContainer.firstChild); + } + } +} + +// Initialize dashboard +const dashboard = new ProvisioningDashboard('ws://localhost:9090', jwtToken); +``` + +## Server-Side Implementation + +### Rust WebSocket Handler + +The orchestrator implements WebSocket support using Axum and Tokio: + +```text +use axum::{ + extract::{ws::WebSocket, ws::WebSocketUpgrade, Query, State}, + response::Response, +}; +use serde::{Deserialize, Serialize}; +use std::collections::HashMap; +use tokio::sync::broadcast; + +#[derive(Debug, Deserialize)] +pub struct WsQuery { + token: String, + events: Option, + batch_size: Option, + compression: Option, +} + +#[derive(Debug, Clone, Serialize)] +pub struct WebSocketMessage { + pub event_type: String, + pub timestamp: chrono::DateTime, + pub data: serde_json::Value, + pub metadata: HashMap, +} + +pub async fn websocket_handler( + ws: WebSocketUpgrade, + Query(params): Query, + State(state): State, +) -> Response { + // Validate JWT token + let claims = match state.auth_service.validate_token(¶ms.token) { + Ok(claims) => claims, + Err(_) => return Response::builder() + .status(401) + .body("Unauthorized".into()) + .unwrap(), + }; + + ws.on_upgrade(move |socket| handle_socket(socket, params, claims, state)) +} + +async fn handle_socket( + socket: WebSocket, + params: WsQuery, + claims: Claims, + state: SharedState, +) { + let (mut sender, mut receiver) = socket.split(); + + // Subscribe to event stream + let mut event_rx = state.monitoring_system.subscribe_to_events().await; + + // Parse requested event types + let requested_events: Vec = params.events + .unwrap_or_default() + .split(',') + .map(|s| s.trim().to_string()) + .filter(|s| !s.is_empty()) + .collect(); + + // Handle incoming messages from client + let sender_task = tokio::spawn(async move { + while let Some(msg) = receiver.next().await { + if let Ok(msg) = msg { + if let Ok(text) = msg.to_text() { + if let Ok(client_msg) = serde_json::from_str::(text) { + handle_client_message(client_msg, &state).await; + } + } + } + } + }); + + // Handle outgoing messages to client + let receiver_task = tokio::spawn(async move { + let mut batch = Vec::new(); + let batch_size = params.batch_size.unwrap_or(10); + + while let Ok(event) = event_rx.recv().await { + // Filter events based on subscription + if !requested_events.is_empty() && !requested_events.contains(&event.event_type) { + continue; + } + + // Check permissions + if !has_event_permission(&claims, &event.event_type) { + continue; + } + + batch.push(event); + + // Send batch when full or after timeout + if batch.len() >= batch_size { + send_event_batch(&mut sender, &batch).await; + batch.clear(); + } + } + }); + + // Wait for either task to complete + tokio::select! { + _ = sender_task => {}, + _ = receiver_task => {}, + } +} + +#[derive(Debug, Deserialize)] +struct ClientMessage { + #[serde(rename = "type")] + msg_type: String, + token: Option, + events: Option>, +} + +async fn handle_client_message(msg: ClientMessage, state: &SharedState) { + match msg.msg_type.as_str() { + "subscribe" => { + // Handle event subscription + }, + "unsubscribe" => { + // Handle event unsubscription + }, + "auth" => { + // Handle re-authentication + }, + _ => { + // Unknown message type + } + } +} + +async fn send_event_batch(sender: &mut SplitSink, batch: &[WebSocketMessage]) { + let batch_msg = serde_json::json!({ + "type": "batch", + "events": batch + }); + + if let Ok(msg_text) = serde_json::to_string(&batch_msg) { + if let Err(e) = sender.send(Message::Text(msg_text)).await { + eprintln!("Failed to send WebSocket message: {}", e); + } + } +} + +fn has_event_permission(claims: &Claims, event_type: &str) -> bool { + // Check if user has permission to receive this event type + match event_type { + "SystemHealthUpdate" => claims.role.contains(&"admin".to_string()), + "LogEntry" => claims.role.contains(&"admin".to_string()) || + claims.role.contains(&"developer".to_string()), + _ => true, // Most events are accessible to all authenticated users + } +} +``` + +## Event Filtering and Subscriptions + +### Client-Side Filtering + +```text +// Subscribe to specific event types +ws.subscribe(['TaskStatusChanged', 'WorkflowProgressUpdate']); + +// Subscribe with filters +ws.send({ + type: 'subscribe', + events: ['TaskStatusChanged'], + filters: { + task_name: 'create_servers', + status: ['Running', 'Completed', 'Failed'] + } +}); + +// Advanced filtering +ws.send({ + type: 'subscribe', + events: ['LogEntry'], + filters: { + level: ['ERROR', 'WARN'], + component: ['server-manager', 'batch-coordinator'], + since: '2025-09-26T10:00:00Z' + } +}); +``` + +### Server-Side Event Filtering + +Events can be filtered on the server side based on: + +- User permissions and roles +- Event type subscriptions +- Custom filter criteria +- Rate limiting + +## Error Handling and Reconnection + +### Connection Errors + +```text +ws.on('error', (error) => { + console.error('WebSocket error:', error); + + // Handle specific error types + if (error.code === 1006) { + // Abnormal closure, attempt reconnection + setTimeout(() => ws.connect(), 5000); + } else if (error.code === 1008) { + // Policy violation, check token + refreshTokenAndReconnect(); + } +}); + +ws.on('disconnected', (event) => { + console.log(`WebSocket disconnected: ${event.code} - ${event.reason}`); + + // Handle different close codes + switch (event.code) { + case 1000: // Normal closure + console.log('Connection closed normally'); + break; + case 1001: // Going away + console.log('Server is shutting down'); + break; + case 4001: // Custom: Token expired + refreshTokenAndReconnect(); + break; + default: + // Attempt reconnection for other errors + if (shouldReconnect()) { + scheduleReconnection(); + } + } +}); +``` + +### Heartbeat and Keep-Alive + +```text +class ProvisioningWebSocket { + constructor(baseUrl, token, options = {}) { + // ... existing code ... + this.heartbeatInterval = options.heartbeatInterval || 30000; + this.heartbeatTimer = null; + } + + connect() { + // ... existing connection code ... + + this.ws.onopen = (event) => { + console.log('WebSocket connected'); + this.startHeartbeat(); + this.emit('connected', event); + }; + + this.ws.onclose = (event) => { + this.stopHeartbeat(); + // ... existing close handling ... + }; + } + + startHeartbeat() { + this.heartbeatTimer = setInterval(() => { + if (this.ws && this.ws.readyState === WebSocket.OPEN) { + this.send({ type: 'ping' }); + } + }, this.heartbeatInterval); + } + + stopHeartbeat() { + if (this.heartbeatTimer) { + clearInterval(this.heartbeatTimer); + this.heartbeatTimer = null; + } + } + + handleMessage(message) { + if (message.type === 'pong') { + // Heartbeat response received + return; + } + + // ... existing message handling ... + } +} +``` + +## Performance Considerations + +### Message Batching + +To improve performance, the server can batch multiple events into single WebSocket messages: + +```text +{ + "type": "batch", + "timestamp": "2025-09-26T10:00:00Z", + "events": [ + { + "event_type": "TaskStatusChanged", + "data": { ... } + }, + { + "event_type": "WorkflowProgressUpdate", + "data": { ... } + } + ] +} +``` + +### Compression + +Enable message compression for large events: + +```text +const ws = new WebSocket('ws://localhost:9090/ws?token=jwt&compression=true'); +``` + +### Rate Limiting + +The server implements rate limiting to prevent abuse: + +- Maximum connections per user: 10 +- Maximum messages per second: 100 +- Maximum subscription events: 50 + +## Security Considerations + +### Authentication and Authorization + +- All connections require valid JWT tokens +- Tokens are validated on connection and periodically renewed +- Event access is controlled by user roles and permissions + +### Message Validation + +- All incoming messages are validated against schemas +- Malformed messages are rejected +- Rate limiting prevents DoS attacks + +### Data Sanitization + +- All event data is sanitized before transmission +- Sensitive information is filtered based on user permissions +- PII and secrets are never transmitted + +This WebSocket API provides a robust, real-time communication channel for monitoring and managing provisioning with comprehensive security and +performance features. \ No newline at end of file diff --git a/docs/src/architecture/README.md b/docs/src/architecture/README.md index 0f49d35..a220462 100644 --- a/docs/src/architecture/README.md +++ b/docs/src/architecture/README.md @@ -1 +1,130 @@ -# Architecture Documentation\n\nThis directory contains comprehensive architecture documentation for provisioning, including Architecture Decision Records (ADRs) and system design\ndocumentation.\n\n## Architecture Decision Records (ADRs)\n\nADRs document the major architectural decisions made for the system, including context, rationale, and consequences:\n\n- **[ADR-001: Project Structure Decision](adr/adr-001-project-structure.md)** - Domain-driven hybrid structure organization\n- **[ADR-002: Distribution Strategy](adr/adr-002-distribution-strategy.md)** - Layered distribution with workspace separation\n- **[ADR-003: Workspace Isolation](adr/adr-003-workspace-isolation.md)** - Isolated user workspaces with hierarchical configuration\n- **[ADR-004: Hybrid Architecture](adr/adr-004-hybrid-architecture.md)** - Rust coordination layer with Nushell business logic\n- **[ADR-005: Extension Framework](adr/adr-005-extension-framework.md)** - Registry-based extension system with manifest-driven loading\n\n## System Design Documentation\n\nComprehensive documentation covering system architecture, integration patterns, and design principles:\n\n### [System Overview](system-overview.md)\n\nHigh-level architecture overview including:\n\n- Executive summary and key achievements\n- Component architecture with diagrams\n- Technology stack and dependencies\n- Performance and scalability characteristics\n- Security architecture and quality attributes\n\n### [Integration Patterns](integration-patterns.md)\n\nDetailed integration patterns and implementations:\n\n- Hybrid language integration (Rust ↔ Nushell)\n- Provider abstraction and multi-cloud support\n- Configuration resolution and variable interpolation\n- Workflow orchestration and dependency management\n- State management and checkpoint recovery\n- Event-driven architecture and messaging\n- Extension integration and API patterns\n- Error handling and performance optimization\n\n### [Design Principles](design-principles.md)\n\nCore architectural principles and guidelines:\n\n- Project Architecture Principles (PAP) compliance\n- Hybrid architecture optimization strategies\n- Configuration-first architecture approach\n- Domain-driven structural organization\n- Quality attribute principles (reliability, performance, security)\n- Error handling and observability principles\n- Evolution and maintenance strategies\n\n## Key Architectural Achievements\n\n### 🚀 Batch Workflow System (v3.1.0)\n\n- **Provider-Agnostic Design**: Mixed UpCloud, AWS, and local provider support\n- **Advanced Orchestration**: Dependency resolution, parallel execution, and rollback capabilities\n- **Real-time Monitoring**: Live workflow progress tracking and health monitoring\n\n### 🏗️ Hybrid Orchestrator Architecture (v3.0.0)\n\n- **Performance Solution**: Solves Nushell deep call stack limitations\n- **Business Logic Preservation**: 65+ Nushell files with domain expertise maintained\n- **REST API Integration**: Modern HTTP endpoints for external system integration\n- **State Management**: Checkpoint-based recovery with comprehensive rollback\n\n### ⚙️ Configuration System (v2.0.0)\n\n- **Configuration Migration**: Systematic migration from ENV variables to configuration files\n- **Hierarchical Configuration**: Complete configuration flexibility with clear precedence\n- **Variable Interpolation**: Dynamic configuration with runtime variable resolution\n- **PAP Compliance**: True Infrastructure as Code without hardcoded fallbacks\n\n## Reading Guide\n\n### For New Developers\n\n1. Start with [System Overview](system-overview.md) for high-level understanding\n2. Read [Design Principles](design-principles.md) to understand architectural philosophy\n3. Review relevant ADRs for specific architectural decisions\n4. Study [Integration Patterns](integration-patterns.md) for implementation details\n\n### For Architects and Senior Developers\n\n1. Review all ADRs to understand decision rationale and trade-offs\n2. Study [Integration Patterns](integration-patterns.md) for advanced implementation patterns\n3. Reference [Design Principles](design-principles.md) for architectural guidelines\n4. Use [System Overview](system-overview.md) for comprehensive system understanding\n\n### For System Operators\n\n1. Focus on [System Overview](system-overview.md) for deployment and operation insights\n2. Review [ADR-002: Distribution Strategy](adr/adr-002-distribution-strategy.md) for deployment patterns\n3. Study [ADR-003: Workspace Isolation](adr/adr-003-workspace-isolation.md) for user management\n4. Reference [Design Principles](design-principles.md) for operational guidelines\n\n## Document Evolution\n\nThese architecture documents are living resources that evolve with the system:\n\n- **ADRs are immutable** once accepted, with new ADRs created for major changes\n- **System documentation is updated** to reflect current architecture\n- **Cross-references are maintained** between related documents\n- **Version compatibility** is documented for architectural changes\n\n## Contributing to Architecture Documentation\n\nWhen making significant architectural changes:\n\n1. **Create new ADRs** for major decisions using the standard format\n2. **Update system documentation** to reflect architectural changes\n3. **Maintain cross-references** between related documents\n4. **Document trade-offs** and alternatives considered\n5. **Update integration patterns** for new architectural patterns\n\n## Architecture Review Process\n\nAll significant architectural changes follow a review process:\n\n1. **Proposal Phase**: Create draft ADR with context and proposed decision\n2. **Review Phase**: Technical review by architecture team and stakeholders\n3. **Decision Phase**: Accept, modify, or reject based on review feedback\n4. **Documentation Phase**: Update related documentation and integration patterns\n5. **Implementation Phase**: Guide implementation according to architectural decisions\n\nThis architecture documentation represents the collective wisdom and experience of building a sophisticated, production-ready infrastructure\nautomation platform. +# Architecture Documentation + +This directory contains comprehensive architecture documentation for provisioning, including Architecture Decision Records (ADRs) and system design +documentation. + +## Architecture Decision Records (ADRs) + +ADRs document the major architectural decisions made for the system, including context, rationale, and consequences: + +- **[ADR-001: Project Structure Decision](adr/adr-001-project-structure.md)** - Domain-driven hybrid structure organization +- **[ADR-002: Distribution Strategy](adr/adr-002-distribution-strategy.md)** - Layered distribution with workspace separation +- **[ADR-003: Workspace Isolation](adr/adr-003-workspace-isolation.md)** - Isolated user workspaces with hierarchical configuration +- **[ADR-004: Hybrid Architecture](adr/adr-004-hybrid-architecture.md)** - Rust coordination layer with Nushell business logic +- **[ADR-005: Extension Framework](adr/adr-005-extension-framework.md)** - Registry-based extension system with manifest-driven loading + +## System Design Documentation + +Comprehensive documentation covering system architecture, integration patterns, and design principles: + +### [System Overview](system-overview.md) + +High-level architecture overview including: + +- Executive summary and key achievements +- Component architecture with diagrams +- Technology stack and dependencies +- Performance and scalability characteristics +- Security architecture and quality attributes + +### [Integration Patterns](integration-patterns.md) + +Detailed integration patterns and implementations: + +- Hybrid language integration (Rust ↔ Nushell) +- Provider abstraction and multi-cloud support +- Configuration resolution and variable interpolation +- Workflow orchestration and dependency management +- State management and checkpoint recovery +- Event-driven architecture and messaging +- Extension integration and API patterns +- Error handling and performance optimization + +### [Design Principles](design-principles.md) + +Core architectural principles and guidelines: + +- Project Architecture Principles (PAP) compliance +- Hybrid architecture optimization strategies +- Configuration-first architecture approach +- Domain-driven structural organization +- Quality attribute principles (reliability, performance, security) +- Error handling and observability principles +- Evolution and maintenance strategies + +## Key Architectural Achievements + +### 🚀 Batch Workflow System (v3.1.0) + +- **Provider-Agnostic Design**: Mixed UpCloud, AWS, and local provider support +- **Advanced Orchestration**: Dependency resolution, parallel execution, and rollback capabilities +- **Real-time Monitoring**: Live workflow progress tracking and health monitoring + +### 🏗️ Hybrid Orchestrator Architecture (v3.0.0) + +- **Performance Solution**: Solves Nushell deep call stack limitations +- **Business Logic Preservation**: 65+ Nushell files with domain expertise maintained +- **REST API Integration**: Modern HTTP endpoints for external system integration +- **State Management**: Checkpoint-based recovery with comprehensive rollback + +### ⚙️ Configuration System (v2.0.0) + +- **Configuration Migration**: Systematic migration from ENV variables to configuration files +- **Hierarchical Configuration**: Complete configuration flexibility with clear precedence +- **Variable Interpolation**: Dynamic configuration with runtime variable resolution +- **PAP Compliance**: True Infrastructure as Code without hardcoded fallbacks + +## Reading Guide + +### For New Developers + +1. Start with [System Overview](system-overview.md) for high-level understanding +2. Read [Design Principles](design-principles.md) to understand architectural philosophy +3. Review relevant ADRs for specific architectural decisions +4. Study [Integration Patterns](integration-patterns.md) for implementation details + +### For Architects and Senior Developers + +1. Review all ADRs to understand decision rationale and trade-offs +2. Study [Integration Patterns](integration-patterns.md) for advanced implementation patterns +3. Reference [Design Principles](design-principles.md) for architectural guidelines +4. Use [System Overview](system-overview.md) for comprehensive system understanding + +### For System Operators + +1. Focus on [System Overview](system-overview.md) for deployment and operation insights +2. Review [ADR-002: Distribution Strategy](adr/adr-002-distribution-strategy.md) for deployment patterns +3. Study [ADR-003: Workspace Isolation](adr/adr-003-workspace-isolation.md) for user management +4. Reference [Design Principles](design-principles.md) for operational guidelines + +## Document Evolution + +These architecture documents are living resources that evolve with the system: + +- **ADRs are immutable** once accepted, with new ADRs created for major changes +- **System documentation is updated** to reflect current architecture +- **Cross-references are maintained** between related documents +- **Version compatibility** is documented for architectural changes + +## Contributing to Architecture Documentation + +When making significant architectural changes: + +1. **Create new ADRs** for major decisions using the standard format +2. **Update system documentation** to reflect architectural changes +3. **Maintain cross-references** between related documents +4. **Document trade-offs** and alternatives considered +5. **Update integration patterns** for new architectural patterns + +## Architecture Review Process + +All significant architectural changes follow a review process: + +1. **Proposal Phase**: Create draft ADR with context and proposed decision +2. **Review Phase**: Technical review by architecture team and stakeholders +3. **Decision Phase**: Accept, modify, or reject based on review feedback +4. **Documentation Phase**: Update related documentation and integration patterns +5. **Implementation Phase**: Guide implementation according to architectural decisions + +This architecture documentation represents the collective wisdom and experience of building a sophisticated, production-ready infrastructure +automation platform. diff --git a/docs/src/architecture/adr/ADR-001-project-structure.md b/docs/src/architecture/adr/ADR-001-project-structure.md index 4c758c7..a708b02 100644 --- a/docs/src/architecture/adr/ADR-001-project-structure.md +++ b/docs/src/architecture/adr/ADR-001-project-structure.md @@ -1 +1,118 @@ -# ADR-001: Project Structure Decision\n\n## Status\n\nAccepted\n\n## Context\n\nProvisioning had evolved from a monolithic structure into a complex system with mixed organizational patterns. The original structure had multiple issues:\n\n1. **Provider-specific code scattered**: Cloud provider implementations were mixed with core logic\n2. **Task services fragmented**: Infrastructure services lacked consistent structure\n3. **Domain boundaries unclear**: No clear separation between core, providers, and services\n4. **Development artifacts mixed with distribution**: User-facing tools mixed with development utilities\n5. **Deep call stack limitations**: Nushell's runtime limitations required architectural solutions\n6. **Configuration complexity**: 200+ environment variables across 65+ files needed systematic organization\n\nThe system needed a clear, maintainable structure that supports:\n\n- Multi-provider infrastructure provisioning (AWS, UpCloud, local)\n- Modular task services (Kubernetes, container runtimes, storage, networking)\n- Clear separation of concerns\n- Hybrid Rust/Nushell architecture\n- Configuration-driven workflows\n- Clean distribution without development artifacts\n\n## Decision\n\nAdopt a **domain-driven hybrid structure** organized around functional boundaries:\n\n```\nsrc/\n├── core/ # Core system and CLI entry point\n├── platform/ # High-performance coordination layer (Rust orchestrator)\n├── orchestrator/ # Legacy orchestrator location (to be consolidated)\n├── provisioning/ # Main provisioning with domain modules\n├── control-center/ # Web UI management interface\n├── tools/ # Development and utility tools\n└── extensions/ # Plugin and extension framework\n```\n\n### Key Structural Principles\n\n1. **Domain Separation**: Each major component has clear boundaries and responsibilities\n2. **Hybrid Architecture**: Rust for performance-critical coordination, Nushell for business logic\n3. **Provider Abstraction**: Standardized interfaces across cloud providers\n4. **Service Modularity**: Reusable task services with consistent structure\n5. **Clean Distribution**: Development tools separated from user-facing components\n6. **Configuration Hierarchy**: Systematic config management with interpolation support\n\n### Domain Organization\n\n- **Core**: CLI interface, library modules, and common utilities\n- **Platform**: High-performance Rust orchestrator for workflow coordination\n- **Provisioning**: Main business logic with providers, task services, and clusters\n- **Control Center**: Web-based management interface\n- **Tools**: Development utilities and build systems\n- **Extensions**: Plugin framework and custom extensions\n\n## Consequences\n\n### Positive\n\n- **Clear Boundaries**: Each domain has well-defined responsibilities and interfaces\n- **Scalable Growth**: New providers and services can be added without structural changes\n- **Development Efficiency**: Developers can focus on specific domains without system-wide knowledge\n- **Clean Distribution**: Users receive only necessary components without development artifacts\n- **Maintenance Clarity**: Issues can be isolated to specific domains\n- **Hybrid Benefits**: Leverage Rust performance where needed while maintaining Nushell productivity\n- **Configuration Consistency**: Systematic approach to configuration management across all domains\n\n### Negative\n\n- **Migration Complexity**: Required systematic migration of existing components\n- **Learning Curve**: New developers need to understand domain boundaries\n- **Coordination Overhead**: Cross-domain features require careful interface design\n- **Path Management**: More complex path resolution with domain separation\n- **Build Complexity**: Multiple domains require coordinated build processes\n\n### Neutral\n\n- **Development Patterns**: Each domain may develop its own patterns within architectural guidelines\n- **Testing Strategy**: Domain-specific testing strategies while maintaining integration coverage\n- **Documentation**: Domain-specific documentation with clear cross-references\n\n## Alternatives Considered\n\n### Alternative 1: Monolithic Structure\n\nKeep all code in a single flat structure with minimal organization.\n**Rejected**: Would not solve maintainability or scalability issues. Continued technical debt accumulation.\n\n### Alternative 2: Microservice Architecture\n\nSplit into completely separate services with network communication.\n**Rejected**: Overhead too high for single-machine deployment use case. Would complicate installation and configuration.\n\n### Alternative 3: Language-Based Organization\n\nOrganize by implementation language (rust/, nushell/, kcl/).\n**Rejected**: Does not align with functional boundaries. Cross-cutting concerns would be scattered.\n\n### Alternative 4: Feature-Based Organization\n\nOrganize by user-facing features (servers/, clusters/, networking/).\n**Rejected**: Would duplicate cross-cutting infrastructure and provider logic across features.\n\n### Alternative 5: Layer-Based Architecture\n\nOrganize by architectural layers (presentation/, business/, data/).\n**Rejected**: Does not align with domain complexity. Infrastructure provisioning has different layering needs.\n\n## References\n\n- Configuration System Migration (ADR-002)\n- Hybrid Architecture Decision (ADR-004)\n- Extension Framework Design (ADR-005)\n- Project Architecture Principles (PAP) Guidelines +# ADR-001: Project Structure Decision + +## Status + +Accepted + +## Context + +Provisioning had evolved from a monolithic structure into a complex system with mixed organizational patterns. The original structure had multiple issues: + +1. **Provider-specific code scattered**: Cloud provider implementations were mixed with core logic +2. **Task services fragmented**: Infrastructure services lacked consistent structure +3. **Domain boundaries unclear**: No clear separation between core, providers, and services +4. **Development artifacts mixed with distribution**: User-facing tools mixed with development utilities +5. **Deep call stack limitations**: Nushell's runtime limitations required architectural solutions +6. **Configuration complexity**: 200+ environment variables across 65+ files needed systematic organization + +The system needed a clear, maintainable structure that supports: + +- Multi-provider infrastructure provisioning (AWS, UpCloud, local) +- Modular task services (Kubernetes, container runtimes, storage, networking) +- Clear separation of concerns +- Hybrid Rust/Nushell architecture +- Configuration-driven workflows +- Clean distribution without development artifacts + +## Decision + +Adopt a **domain-driven hybrid structure** organized around functional boundaries: + +```text +src/ +├── core/ # Core system and CLI entry point +├── platform/ # High-performance coordination layer (Rust orchestrator) +├── orchestrator/ # Legacy orchestrator location (to be consolidated) +├── provisioning/ # Main provisioning with domain modules +├── control-center/ # Web UI management interface +├── tools/ # Development and utility tools +└── extensions/ # Plugin and extension framework +``` + +### Key Structural Principles + +1. **Domain Separation**: Each major component has clear boundaries and responsibilities +2. **Hybrid Architecture**: Rust for performance-critical coordination, Nushell for business logic +3. **Provider Abstraction**: Standardized interfaces across cloud providers +4. **Service Modularity**: Reusable task services with consistent structure +5. **Clean Distribution**: Development tools separated from user-facing components +6. **Configuration Hierarchy**: Systematic config management with interpolation support + +### Domain Organization + +- **Core**: CLI interface, library modules, and common utilities +- **Platform**: High-performance Rust orchestrator for workflow coordination +- **Provisioning**: Main business logic with providers, task services, and clusters +- **Control Center**: Web-based management interface +- **Tools**: Development utilities and build systems +- **Extensions**: Plugin framework and custom extensions + +## Consequences + +### Positive + +- **Clear Boundaries**: Each domain has well-defined responsibilities and interfaces +- **Scalable Growth**: New providers and services can be added without structural changes +- **Development Efficiency**: Developers can focus on specific domains without system-wide knowledge +- **Clean Distribution**: Users receive only necessary components without development artifacts +- **Maintenance Clarity**: Issues can be isolated to specific domains +- **Hybrid Benefits**: Leverage Rust performance where needed while maintaining Nushell productivity +- **Configuration Consistency**: Systematic approach to configuration management across all domains + +### Negative + +- **Migration Complexity**: Required systematic migration of existing components +- **Learning Curve**: New developers need to understand domain boundaries +- **Coordination Overhead**: Cross-domain features require careful interface design +- **Path Management**: More complex path resolution with domain separation +- **Build Complexity**: Multiple domains require coordinated build processes + +### Neutral + +- **Development Patterns**: Each domain may develop its own patterns within architectural guidelines +- **Testing Strategy**: Domain-specific testing strategies while maintaining integration coverage +- **Documentation**: Domain-specific documentation with clear cross-references + +## Alternatives Considered + +### Alternative 1: Monolithic Structure + +Keep all code in a single flat structure with minimal organization. +**Rejected**: Would not solve maintainability or scalability issues. Continued technical debt accumulation. + +### Alternative 2: Microservice Architecture + +Split into completely separate services with network communication. +**Rejected**: Overhead too high for single-machine deployment use case. Would complicate installation and configuration. + +### Alternative 3: Language-Based Organization + +Organize by implementation language (rust/, nushell/, kcl/). +**Rejected**: Does not align with functional boundaries. Cross-cutting concerns would be scattered. + +### Alternative 4: Feature-Based Organization + +Organize by user-facing features (servers/, clusters/, networking/). +**Rejected**: Would duplicate cross-cutting infrastructure and provider logic across features. + +### Alternative 5: Layer-Based Architecture + +Organize by architectural layers (presentation/, business/, data/). +**Rejected**: Does not align with domain complexity. Infrastructure provisioning has different layering needs. + +## References + +- Configuration System Migration (ADR-002) +- Hybrid Architecture Decision (ADR-004) +- Extension Framework Design (ADR-005) +- Project Architecture Principles (PAP) Guidelines \ No newline at end of file diff --git a/docs/src/architecture/adr/ADR-002-distribution-strategy.md b/docs/src/architecture/adr/ADR-002-distribution-strategy.md index 86a1397..6b31d34 100644 --- a/docs/src/architecture/adr/ADR-002-distribution-strategy.md +++ b/docs/src/architecture/adr/ADR-002-distribution-strategy.md @@ -1 +1,179 @@ -# ADR-002: Distribution Strategy\n\n## Status\n\nAccepted\n\n## Context\n\nProvisioning needed a clean distribution strategy that separates user-facing tools from development artifacts. Key challenges included:\n\n1. **Development Artifacts Mixed with Production**: Build tools, test files, and development utilities scattered throughout user directories\n2. **Complex Installation Process**: Users had to navigate through development-specific directories and files\n3. **Unclear User Experience**: No clear distinction between what users need versus what developers need\n4. **Configuration Complexity**: Multiple configuration files with unclear precedence and purpose\n5. **Workspace Pollution**: User workspaces contained development-only files and directories\n6. **Path Resolution Issues**: Complex path resolution logic mixing development and production concerns\n\nThe system required a distribution strategy that provides:\n\n- Clean user experience without development artifacts\n- Clear separation between user and development tools\n- Simplified configuration management\n- Consistent installation and deployment patterns\n- Maintainable development workflow\n\n## Decision\n\nImplement a **layered distribution strategy** with clear separation between development and user environments:\n\n### Distribution Layers\n\n1. **Core Distribution Layer**: Essential user-facing components\n - Main CLI tools and libraries\n - Configuration templates and defaults\n - Provider implementations\n - Task service definitions\n\n2. **Development Layer**: Development-specific tools and artifacts\n - Build scripts and development utilities\n - Test suites and validation tools\n - Development configuration templates\n - Code generation tools\n\n3. **Workspace Layer**: User-specific customization and data\n - User configurations and overrides\n - Local state and cache files\n - Custom extensions and plugins\n - User-specific templates and workflows\n\n### Distribution Structure\n\n```{$detected_lang}\n# User Distribution\n/usr/local/bin/\n├── provisioning # Main CLI entry point\n└── provisioning-* # Supporting utilities\n\n/usr/local/share/provisioning/\n├── core/ # Core libraries and modules\n├── providers/ # Provider implementations\n├── taskservs/ # Task service definitions\n├── templates/ # Configuration templates\n└── config.defaults.toml # System-wide defaults\n\n# User Workspace\n~/workspace/provisioning/\n├── config.user.toml # User preferences\n├── infra/ # User infrastructure definitions\n├── extensions/ # User extensions\n└── cache/ # Local cache and state\n\n# Development Environment\n/\n├── src/ # Source code\n├── scripts/ # Development tools\n├── tests/ # Test suites\n└── tools/ # Build and development utilities\n```\n\n### Key Distribution Principles\n\n1. **Clean Separation**: Development artifacts never appear in user installations\n2. **Hierarchical Configuration**: Clear precedence from system defaults to user overrides\n3. **Self-Contained User Tools**: Users can work without accessing development directories\n4. **Workspace Isolation**: User data and customizations isolated from system installation\n5. **Consistent Paths**: Predictable path resolution across different installation types\n6. **Version Management**: Clear versioning and upgrade paths for distributed components\n\n## Consequences\n\n### Positive\n\n- **Clean User Experience**: Users interact only with production-ready tools and interfaces\n- **Simplified Installation**: Clear installation process without development complexity\n- **Workspace Isolation**: User customizations don't interfere with system installation\n- **Development Efficiency**: Developers can work with full toolset without affecting users\n- **Configuration Clarity**: Clear hierarchy and precedence for configuration settings\n- **Maintainable Updates**: System updates don't affect user customizations\n- **Path Simplicity**: Predictable path resolution without development-specific logic\n- **Security Isolation**: User workspace separated from system components\n\n### Negative\n\n- **Distribution Complexity**: Multiple distribution targets require coordinated build processes\n- **Path Management**: More complex path resolution logic to support multiple layers\n- **Migration Overhead**: Existing users need to migrate to new workspace structure\n- **Documentation Burden**: Need clear documentation for different user types\n- **Testing Complexity**: Must validate distribution across different installation scenarios\n\n### Neutral\n\n- **Development Patterns**: Different patterns for development versus production deployment\n- **Configuration Strategy**: Layer-specific configuration management approaches\n- **Tool Integration**: Different integration patterns for development versus user tools\n\n## Alternatives Considered\n\n### Alternative 1: Monolithic Distribution\n\nShip everything (development and production) in single package.\n**Rejected**: Creates confusing user experience and bloated installations. Mixes development concerns with user needs.\n\n### Alternative 2: Container-Only Distribution\n\nPackage entire system as container images only.\n**Rejected**: Limits deployment flexibility and complicates local development workflows. Not suitable for all use cases.\n\n### Alternative 3: Source-Only Distribution\n\nRequire users to build from source with development environment.\n**Rejected**: Creates high barrier to entry and mixes user concerns with development complexity.\n\n### Alternative 4: Plugin-Based Distribution\n\nMinimal core with everything else as downloadable plugins.\n**Rejected**: Would fragment essential functionality and complicate initial setup. Network dependency for basic functionality.\n\n### Alternative 5: Environment-Based Distribution\n\nUse environment variables to control what gets installed.\n**Rejected**: Creates complex configuration matrix and potential for inconsistent installations.\n\n## Implementation Details\n\n### Distribution Build Process\n\n1. **Core Layer Build**: Extract essential user components from source\n2. **Template Processing**: Generate configuration templates with proper defaults\n3. **Path Resolution**: Generate path resolution logic for different installation types\n4. **Documentation Generation**: Create user-specific documentation excluding development details\n5. **Package Creation**: Build distribution packages for different platforms\n6. **Validation Testing**: Test installations in clean environments\n\n### Configuration Hierarchy\n\n```{$detected_lang}\nSystem Defaults (lowest precedence)\n└── User Configuration\n └── Project Configuration\n └── Infrastructure Configuration\n └── Environment Configuration\n └── Runtime Configuration (highest precedence)\n```\n\n### Workspace Management\n\n- **Automatic Creation**: User workspace created on first run\n- **Template Initialization**: Workspace populated with configuration templates\n- **Version Tracking**: Workspace tracks compatible system versions\n- **Migration Support**: Automatic migration between workspace versions\n- **Backup Integration**: Workspace backup and restore capabilities\n\n## References\n\n- Project Structure Decision (ADR-001)\n- Workspace Isolation Decision (ADR-003)\n- Configuration System Migration (CLAUDE.md)\n- User Experience Guidelines (Design Principles)\n- Installation and Deployment Procedures +# ADR-002: Distribution Strategy + +## Status + +Accepted + +## Context + +Provisioning needed a clean distribution strategy that separates user-facing tools from development artifacts. Key challenges included: + +1. **Development Artifacts Mixed with Production**: Build tools, test files, and development utilities scattered throughout user directories +2. **Complex Installation Process**: Users had to navigate through development-specific directories and files +3. **Unclear User Experience**: No clear distinction between what users need versus what developers need +4. **Configuration Complexity**: Multiple configuration files with unclear precedence and purpose +5. **Workspace Pollution**: User workspaces contained development-only files and directories +6. **Path Resolution Issues**: Complex path resolution logic mixing development and production concerns + +The system required a distribution strategy that provides: + +- Clean user experience without development artifacts +- Clear separation between user and development tools +- Simplified configuration management +- Consistent installation and deployment patterns +- Maintainable development workflow + +## Decision + +Implement a **layered distribution strategy** with clear separation between development and user environments: + +### Distribution Layers + +1. **Core Distribution Layer**: Essential user-facing components + - Main CLI tools and libraries + - Configuration templates and defaults + - Provider implementations + - Task service definitions + +2. **Development Layer**: Development-specific tools and artifacts + - Build scripts and development utilities + - Test suites and validation tools + - Development configuration templates + - Code generation tools + +3. **Workspace Layer**: User-specific customization and data + - User configurations and overrides + - Local state and cache files + - Custom extensions and plugins + - User-specific templates and workflows + +### Distribution Structure + +```text +# User Distribution +/usr/local/bin/ +├── provisioning # Main CLI entry point +└── provisioning-* # Supporting utilities + +/usr/local/share/provisioning/ +├── core/ # Core libraries and modules +├── providers/ # Provider implementations +├── taskservs/ # Task service definitions +├── templates/ # Configuration templates +└── config.defaults.toml # System-wide defaults + +# User Workspace +~/workspace/provisioning/ +├── config.user.toml # User preferences +├── infra/ # User infrastructure definitions +├── extensions/ # User extensions +└── cache/ # Local cache and state + +# Development Environment +/ +├── src/ # Source code +├── scripts/ # Development tools +├── tests/ # Test suites +└── tools/ # Build and development utilities +``` + +### Key Distribution Principles + +1. **Clean Separation**: Development artifacts never appear in user installations +2. **Hierarchical Configuration**: Clear precedence from system defaults to user overrides +3. **Self-Contained User Tools**: Users can work without accessing development directories +4. **Workspace Isolation**: User data and customizations isolated from system installation +5. **Consistent Paths**: Predictable path resolution across different installation types +6. **Version Management**: Clear versioning and upgrade paths for distributed components + +## Consequences + +### Positive + +- **Clean User Experience**: Users interact only with production-ready tools and interfaces +- **Simplified Installation**: Clear installation process without development complexity +- **Workspace Isolation**: User customizations don't interfere with system installation +- **Development Efficiency**: Developers can work with full toolset without affecting users +- **Configuration Clarity**: Clear hierarchy and precedence for configuration settings +- **Maintainable Updates**: System updates don't affect user customizations +- **Path Simplicity**: Predictable path resolution without development-specific logic +- **Security Isolation**: User workspace separated from system components + +### Negative + +- **Distribution Complexity**: Multiple distribution targets require coordinated build processes +- **Path Management**: More complex path resolution logic to support multiple layers +- **Migration Overhead**: Existing users need to migrate to new workspace structure +- **Documentation Burden**: Need clear documentation for different user types +- **Testing Complexity**: Must validate distribution across different installation scenarios + +### Neutral + +- **Development Patterns**: Different patterns for development versus production deployment +- **Configuration Strategy**: Layer-specific configuration management approaches +- **Tool Integration**: Different integration patterns for development versus user tools + +## Alternatives Considered + +### Alternative 1: Monolithic Distribution + +Ship everything (development and production) in single package. +**Rejected**: Creates confusing user experience and bloated installations. Mixes development concerns with user needs. + +### Alternative 2: Container-Only Distribution + +Package entire system as container images only. +**Rejected**: Limits deployment flexibility and complicates local development workflows. Not suitable for all use cases. + +### Alternative 3: Source-Only Distribution + +Require users to build from source with development environment. +**Rejected**: Creates high barrier to entry and mixes user concerns with development complexity. + +### Alternative 4: Plugin-Based Distribution + +Minimal core with everything else as downloadable plugins. +**Rejected**: Would fragment essential functionality and complicate initial setup. Network dependency for basic functionality. + +### Alternative 5: Environment-Based Distribution + +Use environment variables to control what gets installed. +**Rejected**: Creates complex configuration matrix and potential for inconsistent installations. + +## Implementation Details + +### Distribution Build Process + +1. **Core Layer Build**: Extract essential user components from source +2. **Template Processing**: Generate configuration templates with proper defaults +3. **Path Resolution**: Generate path resolution logic for different installation types +4. **Documentation Generation**: Create user-specific documentation excluding development details +5. **Package Creation**: Build distribution packages for different platforms +6. **Validation Testing**: Test installations in clean environments + +### Configuration Hierarchy + +```text +System Defaults (lowest precedence) +└── User Configuration + └── Project Configuration + └── Infrastructure Configuration + └── Environment Configuration + └── Runtime Configuration (highest precedence) +``` + +### Workspace Management + +- **Automatic Creation**: User workspace created on first run +- **Template Initialization**: Workspace populated with configuration templates +- **Version Tracking**: Workspace tracks compatible system versions +- **Migration Support**: Automatic migration between workspace versions +- **Backup Integration**: Workspace backup and restore capabilities + +## References + +- Project Structure Decision (ADR-001) +- Workspace Isolation Decision (ADR-003) +- Configuration System Migration (CLAUDE.md) +- User Experience Guidelines (Design Principles) +- Installation and Deployment Procedures diff --git a/docs/src/architecture/adr/ADR-003-workspace-isolation.md b/docs/src/architecture/adr/ADR-003-workspace-isolation.md index 8dd7e06..dc9948d 100644 --- a/docs/src/architecture/adr/ADR-003-workspace-isolation.md +++ b/docs/src/architecture/adr/ADR-003-workspace-isolation.md @@ -1 +1,191 @@ -# ADR-003: Workspace Isolation\n\n## Status\n\nAccepted\n\n## Context\n\nProvisioning required a clear strategy for managing user-specific data, configurations,\nand customizations separate from system-wide installations. Key challenges included:\n\n1. **Configuration Conflicts**: User settings mixed with system defaults, causing unclear precedence\n2. **State Management**: User state (cache, logs, temporary files) scattered across filesystem\n3. **Customization Isolation**: User extensions and customizations affecting system behavior\n4. **Multi-User Support**: Multiple users on same system interfering with each other\n5. **Development vs Production**: Developer needs different from end-user needs\n6. **Path Resolution Complexity**: Complex logic to locate user-specific resources\n7. **Backup and Migration**: Difficulty backing up and migrating user-specific settings\n8. **Security Boundaries**: Need clear separation between system and user-writable areas\n\nThe system needed workspace isolation that provides:\n\n- Clear separation of user data from system installation\n- Predictable configuration precedence and inheritance\n- User-specific customization without system impact\n- Multi-user support on shared systems\n- Easy backup and migration of user settings\n- Security isolation between system and user areas\n\n## Decision\n\nImplement **isolated user workspaces** with clear boundaries and hierarchical configuration:\n\n### Workspace Structure\n\n```\n~/workspace/provisioning/ # User workspace root\n├── config/\n│ ├── user.toml # User preferences and overrides\n│ ├── environments/ # Environment-specific configs\n│ │ ├── dev.toml\n│ │ ├── test.toml\n│ │ └── prod.toml\n│ └── secrets/ # User-specific encrypted secrets\n├── infra/ # User infrastructure definitions\n│ ├── personal/ # Personal infrastructure\n│ ├── work/ # Work-related infrastructure\n│ └── shared/ # Shared infrastructure definitions\n├── extensions/ # User-installed extensions\n│ ├── providers/ # Custom providers\n│ ├── taskservs/ # Custom task services\n│ └── plugins/ # User plugins\n├── templates/ # User-specific templates\n├── cache/ # Local cache and temporary data\n│ ├── provider-cache/ # Provider API cache\n│ ├── version-cache/ # Version information cache\n│ └── build-cache/ # Build and generation cache\n├── logs/ # User-specific logs\n├── state/ # Local state files\n└── backups/ # Automatic workspace backups\n```\n\n### Configuration Hierarchy (Precedence Order)\n\n1. **Runtime Parameters** (command line, environment variables)\n2. **Environment Configuration** (`config/environments/{env}.toml`)\n3. **Infrastructure Configuration** (`infra/{name}/config.toml`)\n4. **Project Configuration** (project-specific settings)\n5. **User Configuration** (`config/user.toml`)\n6. **System Defaults** (system-wide defaults)\n\n### Key Isolation Principles\n\n1. **Complete Isolation**: User workspace completely independent of system installation\n2. **Hierarchical Inheritance**: Clear configuration inheritance with user overrides\n3. **Security Boundaries**: User workspace in user-writable area only\n4. **Multi-User Safe**: Multiple users can have independent workspaces\n5. **Portable**: Entire user workspace can be backed up and restored\n6. **Version Independent**: Workspace compatible across system version upgrades\n7. **Extension Safe**: User extensions cannot affect system behavior\n8. **State Isolation**: All user state contained within workspace\n\n## Consequences\n\n### Positive\n\n- **User Independence**: Users can customize without affecting system or other users\n- **Configuration Clarity**: Clear hierarchy and precedence for all configuration\n- **Security Isolation**: User modifications cannot compromise system installation\n- **Easy Backup**: Complete user environment can be backed up and restored\n- **Development Flexibility**: Developers can have multiple isolated workspaces\n- **System Upgrades**: System updates don't affect user customizations\n- **Multi-User Support**: Multiple users can work independently on same system\n- **Portable Configurations**: User workspace can be moved between systems\n- **State Management**: All user state in predictable locations\n\n### Negative\n\n- **Initial Setup**: Users must initialize workspace before first use\n- **Path Complexity**: More complex path resolution to support workspace isolation\n- **Disk Usage**: Each user maintains separate cache and state\n- **Configuration Duplication**: Some configuration may be duplicated across users\n- **Migration Overhead**: Existing users need workspace migration\n- **Documentation Complexity**: Need clear documentation for workspace management\n\n### Neutral\n\n- **Backup Strategy**: Users responsible for their own workspace backup\n- **Extension Management**: User-specific extension installation and management\n- **Version Compatibility**: Workspace versions must be compatible with system versions\n- **Performance Implications**: Additional path resolution overhead\n\n## Alternatives Considered\n\n### Alternative 1: System-Wide Configuration Only\n\nAll configuration in system directories with user overrides via environment variables.\n**Rejected**: Creates conflicts between users and makes customization difficult. Poor isolation and security.\n\n### Alternative 2: Home Directory Dotfiles\n\nUse traditional dotfile approach (~/.provisioning/).\n**Rejected**: Clutters home directory and provides less structured organization. Harder to backup and migrate.\n\n### Alternative 3: XDG Base Directory Specification\n\nFollow XDG specification for config/data/cache separation.\n**Rejected**: While standards-compliant, would fragment user data across multiple directories making management complex.\n\n### Alternative 4: Container-Based Isolation\n\nEach user gets containerized environment.\n**Rejected**: Too heavy for simple configuration isolation. Adds deployment complexity without sufficient benefits.\n\n### Alternative 5: Database-Based Configuration\n\nStore all user configuration in database.\n**Rejected**: Adds dependency complexity and makes backup/restore more difficult. Over-engineering for configuration needs.\n\n## Implementation Details\n\n### Workspace Initialization\n\n```\n# Automatic workspace creation on first run\nprovisioning workspace init\n\n# Manual workspace creation with template\nprovisioning workspace init --template=developer\n\n# Workspace status and validation\nprovisioning workspace status\nprovisioning workspace validate\n```\n\n### Configuration Resolution Process\n\n1. **Workspace Discovery**: Locate user workspace (env var → default location)\n2. **Configuration Loading**: Load configuration hierarchy with proper precedence\n3. **Path Resolution**: Resolve all paths relative to workspace and system installation\n4. **Variable Interpolation**: Process configuration variables and templates\n5. **Validation**: Validate merged configuration for completeness and correctness\n\n### Backup and Migration\n\n```\n# Backup entire workspace\nprovisioning workspace backup --output ~/backup/provisioning-workspace.tar.gz\n\n# Restore workspace from backup\nprovisioning workspace restore --input ~/backup/provisioning-workspace.tar.gz\n\n# Migrate workspace to new version\nprovisioning workspace migrate --from-version 2.0.0 --to-version 3.0.0\n```\n\n### Security Considerations\n\n- **File Permissions**: Workspace created with appropriate user permissions\n- **Secret Management**: Secrets encrypted and isolated within workspace\n- **Extension Sandboxing**: User extensions cannot access system directories\n- **Path Validation**: All paths validated to prevent directory traversal\n- **Configuration Validation**: User configuration validated against schemas\n\n## References\n\n- Distribution Strategy (ADR-002)\n- Configuration System Migration (CLAUDE.md)\n- Security Guidelines (Design Principles)\n- Extension Framework (ADR-005)\n- Multi-User Deployment Patterns +# ADR-003: Workspace Isolation + +## Status + +Accepted + +## Context + +Provisioning required a clear strategy for managing user-specific data, configurations, +and customizations separate from system-wide installations. Key challenges included: + +1. **Configuration Conflicts**: User settings mixed with system defaults, causing unclear precedence +2. **State Management**: User state (cache, logs, temporary files) scattered across filesystem +3. **Customization Isolation**: User extensions and customizations affecting system behavior +4. **Multi-User Support**: Multiple users on same system interfering with each other +5. **Development vs Production**: Developer needs different from end-user needs +6. **Path Resolution Complexity**: Complex logic to locate user-specific resources +7. **Backup and Migration**: Difficulty backing up and migrating user-specific settings +8. **Security Boundaries**: Need clear separation between system and user-writable areas + +The system needed workspace isolation that provides: + +- Clear separation of user data from system installation +- Predictable configuration precedence and inheritance +- User-specific customization without system impact +- Multi-user support on shared systems +- Easy backup and migration of user settings +- Security isolation between system and user areas + +## Decision + +Implement **isolated user workspaces** with clear boundaries and hierarchical configuration: + +### Workspace Structure + +```text +~/workspace/provisioning/ # User workspace root +├── config/ +│ ├── user.toml # User preferences and overrides +│ ├── environments/ # Environment-specific configs +│ │ ├── dev.toml +│ │ ├── test.toml +│ │ └── prod.toml +│ └── secrets/ # User-specific encrypted secrets +├── infra/ # User infrastructure definitions +│ ├── personal/ # Personal infrastructure +│ ├── work/ # Work-related infrastructure +│ └── shared/ # Shared infrastructure definitions +├── extensions/ # User-installed extensions +│ ├── providers/ # Custom providers +│ ├── taskservs/ # Custom task services +│ └── plugins/ # User plugins +├── templates/ # User-specific templates +├── cache/ # Local cache and temporary data +│ ├── provider-cache/ # Provider API cache +│ ├── version-cache/ # Version information cache +│ └── build-cache/ # Build and generation cache +├── logs/ # User-specific logs +├── state/ # Local state files +└── backups/ # Automatic workspace backups +``` + +### Configuration Hierarchy (Precedence Order) + +1. **Runtime Parameters** (command line, environment variables) +2. **Environment Configuration** (`config/environments/{env}.toml`) +3. **Infrastructure Configuration** (`infra/{name}/config.toml`) +4. **Project Configuration** (project-specific settings) +5. **User Configuration** (`config/user.toml`) +6. **System Defaults** (system-wide defaults) + +### Key Isolation Principles + +1. **Complete Isolation**: User workspace completely independent of system installation +2. **Hierarchical Inheritance**: Clear configuration inheritance with user overrides +3. **Security Boundaries**: User workspace in user-writable area only +4. **Multi-User Safe**: Multiple users can have independent workspaces +5. **Portable**: Entire user workspace can be backed up and restored +6. **Version Independent**: Workspace compatible across system version upgrades +7. **Extension Safe**: User extensions cannot affect system behavior +8. **State Isolation**: All user state contained within workspace + +## Consequences + +### Positive + +- **User Independence**: Users can customize without affecting system or other users +- **Configuration Clarity**: Clear hierarchy and precedence for all configuration +- **Security Isolation**: User modifications cannot compromise system installation +- **Easy Backup**: Complete user environment can be backed up and restored +- **Development Flexibility**: Developers can have multiple isolated workspaces +- **System Upgrades**: System updates don't affect user customizations +- **Multi-User Support**: Multiple users can work independently on same system +- **Portable Configurations**: User workspace can be moved between systems +- **State Management**: All user state in predictable locations + +### Negative + +- **Initial Setup**: Users must initialize workspace before first use +- **Path Complexity**: More complex path resolution to support workspace isolation +- **Disk Usage**: Each user maintains separate cache and state +- **Configuration Duplication**: Some configuration may be duplicated across users +- **Migration Overhead**: Existing users need workspace migration +- **Documentation Complexity**: Need clear documentation for workspace management + +### Neutral + +- **Backup Strategy**: Users responsible for their own workspace backup +- **Extension Management**: User-specific extension installation and management +- **Version Compatibility**: Workspace versions must be compatible with system versions +- **Performance Implications**: Additional path resolution overhead + +## Alternatives Considered + +### Alternative 1: System-Wide Configuration Only + +All configuration in system directories with user overrides via environment variables. +**Rejected**: Creates conflicts between users and makes customization difficult. Poor isolation and security. + +### Alternative 2: Home Directory Dotfiles + +Use traditional dotfile approach (~/.provisioning/). +**Rejected**: Clutters home directory and provides less structured organization. Harder to backup and migrate. + +### Alternative 3: XDG Base Directory Specification + +Follow XDG specification for config/data/cache separation. +**Rejected**: While standards-compliant, would fragment user data across multiple directories making management complex. + +### Alternative 4: Container-Based Isolation + +Each user gets containerized environment. +**Rejected**: Too heavy for simple configuration isolation. Adds deployment complexity without sufficient benefits. + +### Alternative 5: Database-Based Configuration + +Store all user configuration in database. +**Rejected**: Adds dependency complexity and makes backup/restore more difficult. Over-engineering for configuration needs. + +## Implementation Details + +### Workspace Initialization + +```text +# Automatic workspace creation on first run +provisioning workspace init + +# Manual workspace creation with template +provisioning workspace init --template=developer + +# Workspace status and validation +provisioning workspace status +provisioning workspace validate +``` + +### Configuration Resolution Process + +1. **Workspace Discovery**: Locate user workspace (env var → default location) +2. **Configuration Loading**: Load configuration hierarchy with proper precedence +3. **Path Resolution**: Resolve all paths relative to workspace and system installation +4. **Variable Interpolation**: Process configuration variables and templates +5. **Validation**: Validate merged configuration for completeness and correctness + +### Backup and Migration + +```text +# Backup entire workspace +provisioning workspace backup --output ~/backup/provisioning-workspace.tar.gz + +# Restore workspace from backup +provisioning workspace restore --input ~/backup/provisioning-workspace.tar.gz + +# Migrate workspace to new version +provisioning workspace migrate --from-version 2.0.0 --to-version 3.0.0 +``` + +### Security Considerations + +- **File Permissions**: Workspace created with appropriate user permissions +- **Secret Management**: Secrets encrypted and isolated within workspace +- **Extension Sandboxing**: User extensions cannot access system directories +- **Path Validation**: All paths validated to prevent directory traversal +- **Configuration Validation**: User configuration validated against schemas + +## References + +- Distribution Strategy (ADR-002) +- Configuration System Migration (CLAUDE.md) +- Security Guidelines (Design Principles) +- Extension Framework (ADR-005) +- Multi-User Deployment Patterns \ No newline at end of file diff --git a/docs/src/architecture/adr/ADR-004-hybrid-architecture.md b/docs/src/architecture/adr/ADR-004-hybrid-architecture.md index 81b3403..4a375f4 100644 --- a/docs/src/architecture/adr/ADR-004-hybrid-architecture.md +++ b/docs/src/architecture/adr/ADR-004-hybrid-architecture.md @@ -1 +1,210 @@ -# ADR-004: Hybrid Architecture\n\n## Status\n\nAccepted\n\n## Context\n\nProvisioning encountered fundamental limitations with a pure Nushell implementation that required architectural solutions:\n\n1. **Deep Call Stack Limitations**: Nushell's `open` command fails in deep call contexts\n (`enumerate | each`), causing "Type not supported" errors in template.nu:71\n2. **Performance Bottlenecks**: Complex workflow orchestration hitting Nushell's performance limits\n3. **Concurrency Constraints**: Limited parallel processing capabilities in Nushell for batch operations\n4. **Integration Complexity**: Need for REST API endpoints and external system integration\n5. **State Management**: Complex state tracking and persistence requirements beyond Nushell's capabilities\n6. **Business Logic Preservation**: 65+ existing Nushell files with domain expertise that shouldn't be rewritten\n7. **Developer Productivity**: Nushell excels for configuration management and domain-specific operations\n\nThe system needed an architecture that:\n\n- Solves Nushell's technical limitations without losing business logic\n- Leverages each language's strengths appropriately\n- Maintains existing investment in Nushell domain knowledge\n- Provides performance for coordination-heavy operations\n- Enables modern integration patterns (REST APIs, async workflows)\n- Preserves configuration-driven, Infrastructure as Code principles\n\n## Decision\n\nImplement a **Hybrid Rust/Nushell Architecture** with clear separation of concerns:\n\n### Architecture Layers\n\n#### 1. Coordination Layer (Rust)\n\n- **Orchestrator**: High-performance workflow coordination and task scheduling\n- **REST API Server**: HTTP endpoints for external integration\n- **State Management**: Persistent state tracking with checkpoint recovery\n- **Batch Processing**: Parallel execution of complex workflows\n- **File-based Persistence**: Lightweight task queue using reliable file storage\n- **Error Recovery**: Sophisticated error handling and rollback capabilities\n\n#### 2. Business Logic Layer (Nushell)\n\n- **Provider Implementations**: Cloud provider-specific operations (AWS, UpCloud, local)\n- **Task Services**: Infrastructure service management (Kubernetes, networking, storage)\n- **Configuration Management**: KCL-based configuration processing and validation\n- **Template Processing**: Infrastructure-as-Code template generation\n- **CLI Interface**: User-facing command-line tools and workflows\n- **Domain Operations**: All business-specific logic and operations\n\n### Integration Patterns\n\n#### Rust → Nushell Communication\n\n```\n// Rust orchestrator invokes Nushell scripts via process execution\nlet result = Command::new("nu")\n .arg("-c")\n .arg("use core/nulib/workflows/server_create.nu *; server_create_workflow 'name' '' []")\n .output()?;\n```\n\n#### Nushell → Rust Communication\n\n```\n# Nushell submits workflows to Rust orchestrator via HTTP API\nhttp post "http://localhost:9090/workflows/servers/create" {\n name: "server-name",\n provider: "upcloud",\n config: $server_config\n}\n```\n\n#### Data Exchange Format\n\n- **Structured JSON**: All data exchange via JSON for type safety and interoperability\n- **Configuration TOML**: Configuration data in TOML format for human readability\n- **State Files**: Lightweight file-based state exchange between layers\n\n### Key Architectural Principles\n\n1. **Language Strengths**: Use each language for what it does best\n2. **Business Logic Preservation**: All existing domain knowledge stays in Nushell\n3. **Performance Critical Path**: Coordination and orchestration in Rust\n4. **Clear Boundaries**: Well-defined interfaces between layers\n5. **Configuration Driven**: Both layers respect configuration-driven architecture\n6. **Error Handling**: Coordinated error handling across language boundaries\n7. **State Consistency**: Consistent state management across hybrid system\n\n## Consequences\n\n### Positive\n\n- **Technical Limitations Solved**: Eliminates Nushell deep call stack issues\n- **Performance Optimized**: High-performance coordination while preserving productivity\n- **Business Logic Preserved**: 65+ Nushell files with domain expertise maintained\n- **Modern Integration**: REST APIs and async workflows enabled\n- **Development Efficiency**: Developers can use optimal language for each task\n- **Batch Processing**: Parallel workflow execution with sophisticated state management\n- **Error Recovery**: Advanced error handling and rollback capabilities\n- **Scalability**: Architecture scales to complex multi-provider workflows\n- **Maintainability**: Clear separation of concerns between layers\n\n### Negative\n\n- **Complexity Increase**: Two-language system requires more architectural coordination\n- **Integration Overhead**: Data serialization/deserialization between languages\n- **Development Skills**: Team needs expertise in both Rust and Nushell\n- **Testing Complexity**: Must test integration between language layers\n- **Deployment Complexity**: Two runtime environments must be coordinated\n- **Debugging Challenges**: Debugging across language boundaries more complex\n\n### Neutral\n\n- **Development Patterns**: Different patterns for each layer while maintaining consistency\n- **Documentation Strategy**: Language-specific documentation with integration guides\n- **Tool Chain**: Multiple development tool chains must be maintained\n- **Performance Characteristics**: Different performance characteristics for different operations\n\n## Alternatives Considered\n\n### Alternative 1: Pure Nushell Implementation\n\nContinue with Nushell-only approach and work around limitations.\n**Rejected**: Technical limitations are fundamental and cannot be worked around without compromising functionality. Deep call stack issues are\narchitectural.\n\n### Alternative 2: Complete Rust Rewrite\n\nRewrite entire system in Rust for consistency.\n**Rejected**: Would lose 65+ files of domain expertise and Nushell's productivity advantages for configuration management. Massive development effort.\n\n### Alternative 3: Pure Go Implementation\n\nRewrite system in Go for simplicity and performance.\n**Rejected**: Same issues as Rust rewrite - loses domain expertise and Nushell's configuration strengths. Go doesn't provide significant advantages.\n\n### Alternative 4: Python/Shell Hybrid\n\nUse Python for coordination and shell scripts for operations.\n**Rejected**: Loses type safety and configuration-driven advantages of current system. Python adds dependency complexity.\n\n### Alternative 5: Container-Based Separation\n\nRun Nushell and coordination layer in separate containers.\n**Rejected**: Adds deployment complexity and network communication overhead. Complicates local development significantly.\n\n## Implementation Details\n\n### Orchestrator Components\n\n- **Task Queue**: File-based persistent queue for reliable workflow management\n- **HTTP Server**: REST API for workflow submission and monitoring\n- **State Manager**: Checkpoint-based state tracking with recovery\n- **Process Manager**: Nushell script execution with proper isolation\n- **Error Handler**: Comprehensive error recovery and rollback logic\n\n### Integration Protocols\n\n- **HTTP REST**: Primary API for external integration\n- **JSON Data Exchange**: Structured data format for all communication\n- **File-based State**: Lightweight persistence without database dependencies\n- **Process Execution**: Secure subprocess execution for Nushell operations\n\n### Development Workflow\n\n1. **Rust Development**: Focus on coordination, performance, and integration\n2. **Nushell Development**: Focus on business logic, providers, and task services\n3. **Integration Testing**: Validate communication between layers\n4. **End-to-End Validation**: Complete workflow testing across both layers\n\n### Monitoring and Observability\n\n- **Structured Logging**: JSON logs from both Rust and Nushell components\n- **Metrics Collection**: Performance metrics from coordination layer\n- **Health Checks**: System health monitoring across both layers\n- **Workflow Tracking**: Complete audit trail of workflow execution\n\n## Migration Strategy\n\n### Phase 1: Core Infrastructure (Completed)\n\n- ✅ Rust orchestrator implementation\n- ✅ REST API endpoints\n- ✅ File-based task queue\n- ✅ Basic Nushell integration\n\n### Phase 2: Workflow Integration (Completed)\n\n- ✅ Server creation workflows\n- ✅ Task service workflows\n- ✅ Cluster deployment workflows\n- ✅ State management and recovery\n\n### Phase 3: Advanced Features (Completed)\n\n- ✅ Batch workflow processing\n- ✅ Dependency resolution\n- ✅ Rollback capabilities\n- ✅ Real-time monitoring\n\n## References\n\n- Deep Call Stack Limitations (CLAUDE.md - Architectural Lessons Learned)\n- Configuration-Driven Architecture (ADR-002)\n- Batch Workflow System (CLAUDE.md - v3.1.0)\n- Integration Patterns Documentation\n- Performance Benchmarking Results +# ADR-004: Hybrid Architecture + +## Status + +Accepted + +## Context + +Provisioning encountered fundamental limitations with a pure Nushell implementation that required architectural solutions: + +1. **Deep Call Stack Limitations**: Nushell's `open` command fails in deep call contexts + (`enumerate | each`), causing "Type not supported" errors in template.nu:71 +2. **Performance Bottlenecks**: Complex workflow orchestration hitting Nushell's performance limits +3. **Concurrency Constraints**: Limited parallel processing capabilities in Nushell for batch operations +4. **Integration Complexity**: Need for REST API endpoints and external system integration +5. **State Management**: Complex state tracking and persistence requirements beyond Nushell's capabilities +6. **Business Logic Preservation**: 65+ existing Nushell files with domain expertise that shouldn't be rewritten +7. **Developer Productivity**: Nushell excels for configuration management and domain-specific operations + +The system needed an architecture that: + +- Solves Nushell's technical limitations without losing business logic +- Leverages each language's strengths appropriately +- Maintains existing investment in Nushell domain knowledge +- Provides performance for coordination-heavy operations +- Enables modern integration patterns (REST APIs, async workflows) +- Preserves configuration-driven, Infrastructure as Code principles + +## Decision + +Implement a **Hybrid Rust/Nushell Architecture** with clear separation of concerns: + +### Architecture Layers + +#### 1. Coordination Layer (Rust) + +- **Orchestrator**: High-performance workflow coordination and task scheduling +- **REST API Server**: HTTP endpoints for external integration +- **State Management**: Persistent state tracking with checkpoint recovery +- **Batch Processing**: Parallel execution of complex workflows +- **File-based Persistence**: Lightweight task queue using reliable file storage +- **Error Recovery**: Sophisticated error handling and rollback capabilities + +#### 2. Business Logic Layer (Nushell) + +- **Provider Implementations**: Cloud provider-specific operations (AWS, UpCloud, local) +- **Task Services**: Infrastructure service management (Kubernetes, networking, storage) +- **Configuration Management**: KCL-based configuration processing and validation +- **Template Processing**: Infrastructure-as-Code template generation +- **CLI Interface**: User-facing command-line tools and workflows +- **Domain Operations**: All business-specific logic and operations + +### Integration Patterns + +#### Rust → Nushell Communication + +```text +// Rust orchestrator invokes Nushell scripts via process execution +let result = Command::new("nu") + .arg("-c") + .arg("use core/nulib/workflows/server_create.nu *; server_create_workflow 'name' '' []") + .output()?; +``` + +#### Nushell → Rust Communication + +```text +# Nushell submits workflows to Rust orchestrator via HTTP API +http post "http://localhost:9090/workflows/servers/create" { + name: "server-name", + provider: "upcloud", + config: $server_config +} +``` + +#### Data Exchange Format + +- **Structured JSON**: All data exchange via JSON for type safety and interoperability +- **Configuration TOML**: Configuration data in TOML format for human readability +- **State Files**: Lightweight file-based state exchange between layers + +### Key Architectural Principles + +1. **Language Strengths**: Use each language for what it does best +2. **Business Logic Preservation**: All existing domain knowledge stays in Nushell +3. **Performance Critical Path**: Coordination and orchestration in Rust +4. **Clear Boundaries**: Well-defined interfaces between layers +5. **Configuration Driven**: Both layers respect configuration-driven architecture +6. **Error Handling**: Coordinated error handling across language boundaries +7. **State Consistency**: Consistent state management across hybrid system + +## Consequences + +### Positive + +- **Technical Limitations Solved**: Eliminates Nushell deep call stack issues +- **Performance Optimized**: High-performance coordination while preserving productivity +- **Business Logic Preserved**: 65+ Nushell files with domain expertise maintained +- **Modern Integration**: REST APIs and async workflows enabled +- **Development Efficiency**: Developers can use optimal language for each task +- **Batch Processing**: Parallel workflow execution with sophisticated state management +- **Error Recovery**: Advanced error handling and rollback capabilities +- **Scalability**: Architecture scales to complex multi-provider workflows +- **Maintainability**: Clear separation of concerns between layers + +### Negative + +- **Complexity Increase**: Two-language system requires more architectural coordination +- **Integration Overhead**: Data serialization/deserialization between languages +- **Development Skills**: Team needs expertise in both Rust and Nushell +- **Testing Complexity**: Must test integration between language layers +- **Deployment Complexity**: Two runtime environments must be coordinated +- **Debugging Challenges**: Debugging across language boundaries more complex + +### Neutral + +- **Development Patterns**: Different patterns for each layer while maintaining consistency +- **Documentation Strategy**: Language-specific documentation with integration guides +- **Tool Chain**: Multiple development tool chains must be maintained +- **Performance Characteristics**: Different performance characteristics for different operations + +## Alternatives Considered + +### Alternative 1: Pure Nushell Implementation + +Continue with Nushell-only approach and work around limitations. +**Rejected**: Technical limitations are fundamental and cannot be worked around without compromising functionality. Deep call stack issues are +architectural. + +### Alternative 2: Complete Rust Rewrite + +Rewrite entire system in Rust for consistency. +**Rejected**: Would lose 65+ files of domain expertise and Nushell's productivity advantages for configuration management. Massive development effort. + +### Alternative 3: Pure Go Implementation + +Rewrite system in Go for simplicity and performance. +**Rejected**: Same issues as Rust rewrite - loses domain expertise and Nushell's configuration strengths. Go doesn't provide significant advantages. + +### Alternative 4: Python/Shell Hybrid + +Use Python for coordination and shell scripts for operations. +**Rejected**: Loses type safety and configuration-driven advantages of current system. Python adds dependency complexity. + +### Alternative 5: Container-Based Separation + +Run Nushell and coordination layer in separate containers. +**Rejected**: Adds deployment complexity and network communication overhead. Complicates local development significantly. + +## Implementation Details + +### Orchestrator Components + +- **Task Queue**: File-based persistent queue for reliable workflow management +- **HTTP Server**: REST API for workflow submission and monitoring +- **State Manager**: Checkpoint-based state tracking with recovery +- **Process Manager**: Nushell script execution with proper isolation +- **Error Handler**: Comprehensive error recovery and rollback logic + +### Integration Protocols + +- **HTTP REST**: Primary API for external integration +- **JSON Data Exchange**: Structured data format for all communication +- **File-based State**: Lightweight persistence without database dependencies +- **Process Execution**: Secure subprocess execution for Nushell operations + +### Development Workflow + +1. **Rust Development**: Focus on coordination, performance, and integration +2. **Nushell Development**: Focus on business logic, providers, and task services +3. **Integration Testing**: Validate communication between layers +4. **End-to-End Validation**: Complete workflow testing across both layers + +### Monitoring and Observability + +- **Structured Logging**: JSON logs from both Rust and Nushell components +- **Metrics Collection**: Performance metrics from coordination layer +- **Health Checks**: System health monitoring across both layers +- **Workflow Tracking**: Complete audit trail of workflow execution + +## Migration Strategy + +### Phase 1: Core Infrastructure (Completed) + +- ✅ Rust orchestrator implementation +- ✅ REST API endpoints +- ✅ File-based task queue +- ✅ Basic Nushell integration + +### Phase 2: Workflow Integration (Completed) + +- ✅ Server creation workflows +- ✅ Task service workflows +- ✅ Cluster deployment workflows +- ✅ State management and recovery + +### Phase 3: Advanced Features (Completed) + +- ✅ Batch workflow processing +- ✅ Dependency resolution +- ✅ Rollback capabilities +- ✅ Real-time monitoring + +## References + +- Deep Call Stack Limitations (CLAUDE.md - Architectural Lessons Learned) +- Configuration-Driven Architecture (ADR-002) +- Batch Workflow System (CLAUDE.md - v3.1.0) +- Integration Patterns Documentation +- Performance Benchmarking Results \ No newline at end of file diff --git a/docs/src/architecture/adr/ADR-005-extension-framework.md b/docs/src/architecture/adr/ADR-005-extension-framework.md index 7be666b..1cf7735 100644 --- a/docs/src/architecture/adr/ADR-005-extension-framework.md +++ b/docs/src/architecture/adr/ADR-005-extension-framework.md @@ -1 +1,284 @@ -# ADR-005: Extension Framework\n\n## Status\n\nAccepted\n\n## Context\n\nProvisioning required a flexible extension mechanism to support:\n\n1. **Custom Providers**: Organizations need to add custom cloud providers beyond AWS, UpCloud, and local\n2. **Custom Task Services**: Users need to integrate proprietary infrastructure services\n3. **Custom Workflows**: Complex organizations require custom orchestration patterns\n4. **Third-Party Integration**: Need to integrate with existing toolchains and systems\n5. **User Customization**: Power users want to extend and modify system behavior\n6. **Plugin Ecosystem**: Enable community contributions and extensions\n7. **Isolation Requirements**: Extensions must not compromise system stability\n8. **Discovery Mechanism**: System must automatically discover and load extensions\n9. **Version Compatibility**: Extensions must work across system version upgrades\n10. **Configuration Integration**: Extensions should integrate with configuration-driven architecture\n\nThe system needed an extension framework that provides:\n\n- Clear extension API and interfaces\n- Safe isolation of extension code\n- Automatic discovery and loading\n- Configuration integration\n- Version compatibility management\n- Developer-friendly extension development patterns\n\n## Decision\n\nImplement a **registry-based extension framework** with structured discovery and isolation:\n\n### Extension Architecture\n\n#### Extension Types\n\n1. **Provider Extensions**: Custom cloud providers and infrastructure backends\n2. **Task Service Extensions**: Custom infrastructure services and components\n3. **Workflow Extensions**: Custom orchestration and deployment patterns\n4. **CLI Extensions**: Additional command-line tools and interfaces\n5. **Template Extensions**: Custom configuration and code generation templates\n6. **Integration Extensions**: External system integrations and connectors\n\n### Extension Structure\n\n```\nextensions/\n├── providers/ # Provider extensions\n│ └── custom-cloud/\n│ ├── extension.toml # Extension manifest\n│ ├── kcl/ # KCL configuration schemas\n│ ├── nulib/ # Nushell implementation\n│ └── templates/ # Configuration templates\n├── taskservs/ # Task service extensions\n│ └── custom-service/\n│ ├── extension.toml\n│ ├── kcl/\n│ ├── nulib/\n│ └── manifests/ # Kubernetes manifests\n├── workflows/ # Workflow extensions\n│ └── custom-workflow/\n│ ├── extension.toml\n│ └── nulib/\n├── cli/ # CLI extensions\n│ └── custom-commands/\n│ ├── extension.toml\n│ └── nulib/\n└── integrations/ # Integration extensions\n └── external-tool/\n ├── extension.toml\n └── nulib/\n```\n\n### Extension Manifest (extension.toml)\n\n```\n[extension]\nname = "custom-provider"\nversion = "1.0.0"\ntype = "provider"\ndescription = "Custom cloud provider integration"\nauthor = "Organization Name"\nlicense = "MIT"\nhomepage = "https://github.com/org/custom-provider"\n\n[compatibility]\nprovisioning_version = ">=3.0.0,<4.0.0"\nnushell_version = ">=0.107.0"\nkcl_version = ">=0.11.0"\n\n[dependencies]\nhttp_client = ">=1.0.0"\njson_parser = ">=2.0.0"\n\n[entry_points]\ncli = "nulib/cli.nu"\nprovider = "nulib/provider.nu"\nconfig_schema = "schemas/schema.ncl"\n\n[configuration]\nconfig_prefix = "custom_provider"\nrequired_env_vars = ["CUSTOM_PROVIDER_API_KEY"]\noptional_config = ["custom_provider.region", "custom_provider.timeout"]\n```\n\n### Key Framework Principles\n\n1. **Registry-Based Discovery**: Extensions registered in structured directories\n2. **Manifest-Driven Loading**: Extension capabilities declared in manifest files\n3. **Version Compatibility**: Explicit compatibility declarations and validation\n4. **Configuration Integration**: Extensions integrate with system configuration hierarchy\n5. **Isolation Boundaries**: Extensions isolated from core system and each other\n6. **Standard Interfaces**: Consistent interfaces across extension types\n7. **Development Patterns**: Clear patterns for extension development\n8. **Community Support**: Framework designed for community contributions\n\n## Consequences\n\n### Positive\n\n- **Extensibility**: System can be extended without modifying core code\n- **Community Growth**: Enable community contributions and ecosystem development\n- **Organization Customization**: Organizations can add proprietary integrations\n- **Innovation Support**: New technologies can be integrated via extensions\n- **Isolation Safety**: Extensions cannot compromise system stability\n- **Configuration Consistency**: Extensions integrate with configuration-driven architecture\n- **Development Efficiency**: Clear patterns reduce extension development time\n- **Version Management**: Compatibility system prevents breaking changes\n- **Discovery Automation**: Extensions automatically discovered and loaded\n\n### Negative\n\n- **Complexity Increase**: Additional layer of abstraction and management\n- **Performance Overhead**: Extension loading and isolation adds runtime cost\n- **Testing Complexity**: Must test extension framework and individual extensions\n- **Documentation Burden**: Need comprehensive extension development documentation\n- **Version Coordination**: Extension compatibility matrix requires management\n- **Support Complexity**: Community extensions may require support resources\n\n### Neutral\n\n- **Development Patterns**: Different patterns for extension vs core development\n- **Quality Control**: Community extensions may vary in quality and maintenance\n- **Security Considerations**: Extensions need security review and validation\n- **Dependency Management**: Extension dependencies must be managed carefully\n\n## Alternatives Considered\n\n### Alternative 1: Filesystem-Based Extensions\n\nSimple filesystem scanning for extension discovery.\n**Rejected**: No manifest validation or version compatibility checking. Fragile discovery mechanism.\n\n### Alternative 2: Database-Backed Registry\n\nStore extension metadata in database for discovery.\n**Rejected**: Adds database dependency complexity. Over-engineering for extension discovery needs.\n\n### Alternative 3: Package Manager Integration\n\nUse existing package managers (cargo, npm) for extension distribution.\n**Rejected**: Complicates installation and creates external dependencies. Not suitable for corporate environments.\n\n### Alternative 4: Container-Based Extensions\n\nEach extension runs in isolated container.\n**Rejected**: Too heavy for simple extensions. Complicates development and deployment significantly.\n\n### Alternative 5: Plugin Architecture\n\nTraditional plugin architecture with dynamic loading.\n**Rejected**: Complex for shell-based system. Security and isolation challenges in Nushell environment.\n\n## Implementation Details\n\n### Extension Discovery Process\n\n1. **Directory Scanning**: Scan extension directories for manifest files\n2. **Manifest Validation**: Parse and validate extension manifest\n3. **Compatibility Check**: Verify version compatibility requirements\n4. **Dependency Resolution**: Resolve extension dependencies\n5. **Configuration Integration**: Merge extension configuration schemas\n6. **Entry Point Registration**: Register extension entry points with system\n\n### Extension Loading Lifecycle\n\n```\n# Extension discovery and validation\nprovisioning extension discover\nprovisioning extension validate --extension custom-provider\n\n# Extension activation and configuration\nprovisioning extension enable custom-provider\nprovisioning extension configure custom-provider\n\n# Extension usage\nprovisioning provider list # Shows custom providers\nprovisioning server create --provider custom-provider\n\n# Extension management\nprovisioning extension disable custom-provider\nprovisioning extension update custom-provider\n```\n\n### Configuration Integration\n\nExtensions integrate with hierarchical configuration system:\n\n```\n# System configuration includes extension settings\n[custom_provider]\napi_endpoint = "https://api.custom-cloud.com"\nregion = "us-west-1"\ntimeout = 30\n\n# Extension configuration follows same hierarchy rules\n# System defaults → User config → Environment config → Runtime\n```\n\n### Security and Isolation\n\n- **Sandboxed Execution**: Extensions run in controlled environment\n- **Permission Model**: Extensions declare required permissions in manifest\n- **Code Review**: Community extensions require review process\n- **Digital Signatures**: Extensions can be digitally signed for authenticity\n- **Audit Logging**: Extension usage tracked in system audit logs\n\n### Development Support\n\n- **Extension Templates**: Scaffold new extensions from templates\n- **Development Tools**: Testing and validation tools for extension developers\n- **Documentation Generation**: Automatic documentation from extension manifests\n- **Integration Testing**: Framework for testing extensions with core system\n\n## Extension Development Patterns\n\n### Provider Extension Pattern\n\n```\n# extensions/providers/custom-cloud/nulib/provider.nu\nexport def list-servers [] -> table {\n http get $"($config.custom_provider.api_endpoint)/servers"\n | from json\n | select name status region\n}\n\nexport def create-server [name: string, config: record] -> record {\n let payload = {\n name: $name,\n instance_type: $config.plan,\n region: $config.zone\n }\n\n http post $"($config.custom_provider.api_endpoint)/servers" $payload\n | from json\n}\n```\n\n### Task Service Extension Pattern\n\n```\n# extensions/taskservs/custom-service/nulib/service.nu\nexport def install [server: string] -> nothing {\n let manifest_data = open ./manifests/deployment.yaml\n | str replace "{{server}}" $server\n\n kubectl apply --server $server --data $manifest_data\n}\n\nexport def uninstall [server: string] -> nothing {\n kubectl delete deployment custom-service --server $server\n}\n```\n\n## References\n\n- Workspace Isolation (ADR-003)\n- Configuration System Architecture (ADR-002)\n- Hybrid Architecture Integration (ADR-004)\n- Community Extension Guidelines\n- Extension Security Framework\n- Extension Development Documentation +# ADR-005: Extension Framework + +## Status + +Accepted + +## Context + +Provisioning required a flexible extension mechanism to support: + +1. **Custom Providers**: Organizations need to add custom cloud providers beyond AWS, UpCloud, and local +2. **Custom Task Services**: Users need to integrate proprietary infrastructure services +3. **Custom Workflows**: Complex organizations require custom orchestration patterns +4. **Third-Party Integration**: Need to integrate with existing toolchains and systems +5. **User Customization**: Power users want to extend and modify system behavior +6. **Plugin Ecosystem**: Enable community contributions and extensions +7. **Isolation Requirements**: Extensions must not compromise system stability +8. **Discovery Mechanism**: System must automatically discover and load extensions +9. **Version Compatibility**: Extensions must work across system version upgrades +10. **Configuration Integration**: Extensions should integrate with configuration-driven architecture + +The system needed an extension framework that provides: + +- Clear extension API and interfaces +- Safe isolation of extension code +- Automatic discovery and loading +- Configuration integration +- Version compatibility management +- Developer-friendly extension development patterns + +## Decision + +Implement a **registry-based extension framework** with structured discovery and isolation: + +### Extension Architecture + +#### Extension Types + +1. **Provider Extensions**: Custom cloud providers and infrastructure backends +2. **Task Service Extensions**: Custom infrastructure services and components +3. **Workflow Extensions**: Custom orchestration and deployment patterns +4. **CLI Extensions**: Additional command-line tools and interfaces +5. **Template Extensions**: Custom configuration and code generation templates +6. **Integration Extensions**: External system integrations and connectors + +### Extension Structure + +```text +extensions/ +├── providers/ # Provider extensions +│ └── custom-cloud/ +│ ├── extension.toml # Extension manifest +│ ├── kcl/ # KCL configuration schemas +│ ├── nulib/ # Nushell implementation +│ └── templates/ # Configuration templates +├── taskservs/ # Task service extensions +│ └── custom-service/ +│ ├── extension.toml +│ ├── kcl/ +│ ├── nulib/ +│ └── manifests/ # Kubernetes manifests +├── workflows/ # Workflow extensions +│ └── custom-workflow/ +│ ├── extension.toml +│ └── nulib/ +├── cli/ # CLI extensions +│ └── custom-commands/ +│ ├── extension.toml +│ └── nulib/ +└── integrations/ # Integration extensions + └── external-tool/ + ├── extension.toml + └── nulib/ +``` + +### Extension Manifest (extension.toml) + +```text +[extension] +name = "custom-provider" +version = "1.0.0" +type = "provider" +description = "Custom cloud provider integration" +author = "Organization Name" +license = "MIT" +homepage = "https://github.com/org/custom-provider" + +[compatibility] +provisioning_version = ">=3.0.0,<4.0.0" +nushell_version = ">=0.107.0" +kcl_version = ">=0.11.0" + +[dependencies] +http_client = ">=1.0.0" +json_parser = ">=2.0.0" + +[entry_points] +cli = "nulib/cli.nu" +provider = "nulib/provider.nu" +config_schema = "schemas/schema.ncl" + +[configuration] +config_prefix = "custom_provider" +required_env_vars = ["CUSTOM_PROVIDER_API_KEY"] +optional_config = ["custom_provider.region", "custom_provider.timeout"] +``` + +### Key Framework Principles + +1. **Registry-Based Discovery**: Extensions registered in structured directories +2. **Manifest-Driven Loading**: Extension capabilities declared in manifest files +3. **Version Compatibility**: Explicit compatibility declarations and validation +4. **Configuration Integration**: Extensions integrate with system configuration hierarchy +5. **Isolation Boundaries**: Extensions isolated from core system and each other +6. **Standard Interfaces**: Consistent interfaces across extension types +7. **Development Patterns**: Clear patterns for extension development +8. **Community Support**: Framework designed for community contributions + +## Consequences + +### Positive + +- **Extensibility**: System can be extended without modifying core code +- **Community Growth**: Enable community contributions and ecosystem development +- **Organization Customization**: Organizations can add proprietary integrations +- **Innovation Support**: New technologies can be integrated via extensions +- **Isolation Safety**: Extensions cannot compromise system stability +- **Configuration Consistency**: Extensions integrate with configuration-driven architecture +- **Development Efficiency**: Clear patterns reduce extension development time +- **Version Management**: Compatibility system prevents breaking changes +- **Discovery Automation**: Extensions automatically discovered and loaded + +### Negative + +- **Complexity Increase**: Additional layer of abstraction and management +- **Performance Overhead**: Extension loading and isolation adds runtime cost +- **Testing Complexity**: Must test extension framework and individual extensions +- **Documentation Burden**: Need comprehensive extension development documentation +- **Version Coordination**: Extension compatibility matrix requires management +- **Support Complexity**: Community extensions may require support resources + +### Neutral + +- **Development Patterns**: Different patterns for extension vs core development +- **Quality Control**: Community extensions may vary in quality and maintenance +- **Security Considerations**: Extensions need security review and validation +- **Dependency Management**: Extension dependencies must be managed carefully + +## Alternatives Considered + +### Alternative 1: Filesystem-Based Extensions + +Simple filesystem scanning for extension discovery. +**Rejected**: No manifest validation or version compatibility checking. Fragile discovery mechanism. + +### Alternative 2: Database-Backed Registry + +Store extension metadata in database for discovery. +**Rejected**: Adds database dependency complexity. Over-engineering for extension discovery needs. + +### Alternative 3: Package Manager Integration + +Use existing package managers (cargo, npm) for extension distribution. +**Rejected**: Complicates installation and creates external dependencies. Not suitable for corporate environments. + +### Alternative 4: Container-Based Extensions + +Each extension runs in isolated container. +**Rejected**: Too heavy for simple extensions. Complicates development and deployment significantly. + +### Alternative 5: Plugin Architecture + +Traditional plugin architecture with dynamic loading. +**Rejected**: Complex for shell-based system. Security and isolation challenges in Nushell environment. + +## Implementation Details + +### Extension Discovery Process + +1. **Directory Scanning**: Scan extension directories for manifest files +2. **Manifest Validation**: Parse and validate extension manifest +3. **Compatibility Check**: Verify version compatibility requirements +4. **Dependency Resolution**: Resolve extension dependencies +5. **Configuration Integration**: Merge extension configuration schemas +6. **Entry Point Registration**: Register extension entry points with system + +### Extension Loading Lifecycle + +```text +# Extension discovery and validation +provisioning extension discover +provisioning extension validate --extension custom-provider + +# Extension activation and configuration +provisioning extension enable custom-provider +provisioning extension configure custom-provider + +# Extension usage +provisioning provider list # Shows custom providers +provisioning server create --provider custom-provider + +# Extension management +provisioning extension disable custom-provider +provisioning extension update custom-provider +``` + +### Configuration Integration + +Extensions integrate with hierarchical configuration system: + +```text +# System configuration includes extension settings +[custom_provider] +api_endpoint = "https://api.custom-cloud.com" +region = "us-west-1" +timeout = 30 + +# Extension configuration follows same hierarchy rules +# System defaults → User config → Environment config → Runtime +``` + +### Security and Isolation + +- **Sandboxed Execution**: Extensions run in controlled environment +- **Permission Model**: Extensions declare required permissions in manifest +- **Code Review**: Community extensions require review process +- **Digital Signatures**: Extensions can be digitally signed for authenticity +- **Audit Logging**: Extension usage tracked in system audit logs + +### Development Support + +- **Extension Templates**: Scaffold new extensions from templates +- **Development Tools**: Testing and validation tools for extension developers +- **Documentation Generation**: Automatic documentation from extension manifests +- **Integration Testing**: Framework for testing extensions with core system + +## Extension Development Patterns + +### Provider Extension Pattern + +```text +# extensions/providers/custom-cloud/nulib/provider.nu +export def list-servers [] -> table { + http get $"($config.custom_provider.api_endpoint)/servers" + | from json + | select name status region +} + +export def create-server [name: string, config: record] -> record { + let payload = { + name: $name, + instance_type: $config.plan, + region: $config.zone + } + + http post $"($config.custom_provider.api_endpoint)/servers" $payload + | from json +} +``` + +### Task Service Extension Pattern + +```text +# extensions/taskservs/custom-service/nulib/service.nu +export def install [server: string] -> nothing { + let manifest_data = open ./manifests/deployment.yaml + | str replace "{{server}}" $server + + kubectl apply --server $server --data $manifest_data +} + +export def uninstall [server: string] -> nothing { + kubectl delete deployment custom-service --server $server +} +``` + +## References + +- Workspace Isolation (ADR-003) +- Configuration System Architecture (ADR-002) +- Hybrid Architecture Integration (ADR-004) +- Community Extension Guidelines +- Extension Security Framework +- Extension Development Documentation \ No newline at end of file diff --git a/docs/src/architecture/adr/ADR-006-provisioning-cli-refactoring.md b/docs/src/architecture/adr/ADR-006-provisioning-cli-refactoring.md index 041f588..0d3a572 100644 --- a/docs/src/architecture/adr/ADR-006-provisioning-cli-refactoring.md +++ b/docs/src/architecture/adr/ADR-006-provisioning-cli-refactoring.md @@ -1 +1,390 @@ -# ADR-006: Provisioning CLI Refactoring to Modular Architecture\n\n**Status**: Implemented ✅\n**Date**: 2025-09-30\n**Authors**: Infrastructure Team\n**Related**: ADR-001 (Project Structure), ADR-004 (Hybrid Architecture)\n\n## Context\n\nThe main provisioning CLI script (`provisioning/core/nulib/provisioning`) had grown to\n**1,329 lines** with a massive 1,100+ line match statement handling all commands. This\nmonolithic structure created multiple critical problems:\n\n### Problems Identified\n\n1. **Maintainability Crisis**\n - 54 command branches in one file\n - Code duplication: Flag handling repeated 50+ times\n - Hard to navigate: Finding specific command logic required scrolling through 1,000+ lines\n - Mixed concerns: Routing, validation, and execution all intertwined\n\n2. **Development Friction**\n - Adding new commands required editing massive file\n - Testing was nearly impossible (monolithic, no isolation)\n - High cognitive load for contributors\n - Code review difficult due to file size\n\n3. **Technical Debt**\n - 10+ lines of repetitive flag handling per command\n - No separation of concerns\n - Poor code reusability\n - Difficult to test individual command handlers\n\n4. **User Experience Issues**\n - No bi-directional help system\n - Inconsistent command shortcuts\n - Help system not fully integrated\n\n## Decision\n\nWe refactored the monolithic CLI into a **modular, domain-driven architecture** with the following structure:\n\n```\nprovisioning/core/nulib/\n├── provisioning (211 lines) ⬅️ 84% reduction\n├── main_provisioning/\n│ ├── flags.nu (139 lines) ⭐ Centralized flag handling\n│ ├── dispatcher.nu (264 lines) ⭐ Command routing\n│ ├── mod.nu (updated)\n│ └── commands/ ⭐ Domain-focused handlers\n│ ├── configuration.nu (316 lines)\n│ ├── development.nu (72 lines)\n│ ├── generation.nu (78 lines)\n│ ├── infrastructure.nu (117 lines)\n│ ├── orchestration.nu (64 lines)\n│ ├── utilities.nu (157 lines)\n│ └── workspace.nu (56 lines)\n```\n\n### Key Components\n\n#### 1. Centralized Flag Handling (`flags.nu`)\n\nSingle source of truth for all flag parsing and argument building:\n\n```\nexport def parse_common_flags [flags: record]: nothing -> record\nexport def build_module_args [flags: record, extra: string = ""]: nothing -> string\nexport def set_debug_env [flags: record]\nexport def get_debug_flag [flags: record]: nothing -> string\n```\n\n**Benefits:**\n\n- Eliminates 50+ instances of duplicate code\n- Single place to add/modify flags\n- Consistent flag handling across all commands\n- Reduced from 10 lines to 3 lines per command handler\n\n#### 2. Command Dispatcher (`dispatcher.nu`)\n\nCentral routing with 80+ command mappings:\n\n```\nexport def get_command_registry []: nothing -> record # 80+ shortcuts\nexport def dispatch_command [args: list, flags: record] # Main router\n```\n\n**Features:**\n\n- Command registry with shortcuts (ws → workspace, orch → orchestrator, etc.)\n- Bi-directional help support (`provisioning ws help` works)\n- Domain-based routing (infrastructure, orchestration, development, etc.)\n- Special command handling (create, delete, price, etc.)\n\n#### 3. Domain Command Handlers (`commands/*.nu`)\n\nSeven focused modules organized by domain:\n\n| Module | Lines | Responsibility |\n| -------- | ------- | ---------------- |\n| `infrastructure.nu` | 117 | Server, taskserv, cluster, infra |\n| `orchestration.nu` | 64 | Workflow, batch, orchestrator |\n| `development.nu` | 72 | Module, layer, version, pack |\n| `workspace.nu` | 56 | Workspace, template |\n| `generation.nu` | 78 | Generate commands |\n| `utilities.nu` | 157 | SSH, SOPS, cache, providers |\n| `configuration.nu` | 316 | Env, show, init, validate |\n\nEach handler:\n\n- Exports `handle__command` function\n- Uses shared flag handling\n- Provides error messages with usage hints\n- Isolated and testable\n\n## Architecture Principles\n\n### 1. Separation of Concerns\n\n- **Routing** → `dispatcher.nu`\n- **Flag parsing** → `flags.nu`\n- **Business logic** → `commands/*.nu`\n- **Help system** → `help_system.nu` (existing)\n\n### 2. Single Responsibility\n\nEach module has ONE clear purpose:\n\n- Command handlers execute specific domains\n- Dispatcher routes to correct handler\n- Flags module normalizes all inputs\n\n### 3. DRY (Don't Repeat Yourself)\n\nEliminated repetition:\n\n- Flag handling: 50+ instances → 1 function\n- Command routing: Scattered logic → Command registry\n- Error handling: Consistent across all domains\n\n### 4. Open/Closed Principle\n\n- Open for extension: Add new handlers easily\n- Closed for modification: Core routing unchanged\n\n### 5. Dependency Inversion\n\nAll handlers depend on abstractions (flag records, not concrete flags):\n\n```\n# Handler signature\nexport def handle_infrastructure_command [\n command: string\n ops: string\n flags: record # ⬅️ Abstraction, not concrete flags\n]\n```\n\n## Implementation Details\n\n### Migration Path (Completed in 2 Phases)\n\n**Phase 1: Foundation**\n\n1. ✅ Created `commands/` directory structure\n2. ✅ Created `flags.nu` with common flag handling\n3. ✅ Created initial command handlers (infrastructure, utilities, configuration)\n4. ✅ Created `dispatcher.nu` with routing logic\n5. ✅ Refactored main file (1,329 → 211 lines)\n6. ✅ Tested basic functionality\n\n**Phase 2: Completion**\n\n1. ✅ Fixed bi-directional help (`provisioning ws help` now works)\n2. ✅ Created remaining handlers (orchestration, development, workspace, generation)\n3. ✅ Removed duplicate code from dispatcher\n4. ✅ Added comprehensive test suite\n5. ✅ Verified all shortcuts work\n\n### Bi-directional Help System\n\nUsers can now access help in multiple ways:\n\n```\n# All these work equivalently:\nprovisioning help workspace\nprovisioning workspace help # ⬅️ NEW: Bi-directional\nprovisioning ws help # ⬅️ NEW: With shortcuts\nprovisioning help ws # ⬅️ NEW: Shortcut in help\n```\n\n**Implementation:**\n\n```\n# Intercept "command help" → "help command"\nlet first_op = if ($ops_list | length) > 0 { ($ops_list | get 0) } else { "" }\nif $first_op in ["help" "h"] {\n exec $"($env.PROVISIONING_NAME)" help $task --notitles\n}\n```\n\n### Command Shortcuts\n\nComprehensive shortcut system with 30+ mappings:\n\n**Infrastructure:**\n\n- `s` → `server`\n- `t`, `task` → `taskserv`\n- `cl` → `cluster`\n- `i` → `infra`\n\n**Orchestration:**\n\n- `wf`, `flow` → `workflow`\n- `bat` → `batch`\n- `orch` → `orchestrator`\n\n**Development:**\n\n- `mod` → `module`\n- `lyr` → `layer`\n\n**Workspace:**\n\n- `ws` → `workspace`\n- `tpl`, `tmpl` → `template`\n\n## Testing\n\nComprehensive test suite created (`tests/test_provisioning_refactor.nu`):\n\n### Test Coverage\n\n- ✅ Main help display\n- ✅ Category help (infrastructure, orchestration, development, workspace)\n- ✅ Bi-directional help routing\n- ✅ All command shortcuts\n- ✅ Category shortcut help\n- ✅ Command routing to correct handlers\n\n### Test Results\n\n```\n📋 Testing main help... ✅\n📋 Testing category help... ✅\n🔄 Testing bi-directional help... ✅\n⚡ Testing command shortcuts... ✅\n📚 Testing category shortcut help... ✅\n🎯 Testing command routing... ✅\n\n📊 TEST RESULTS: 6 passed, 0 failed\n```\n\n## Results\n\n### Quantitative Improvements\n\n| Metric | Before | After | Improvement |\n| -------- | -------- | ------- | ------------- |\n| **Main file size** | 1,329 lines | 211 lines | **84% reduction** |\n| **Command handler** | 1 massive match (1,100+ lines) | 7 focused modules | **Domain separation** |\n| **Flag handling** | Repeated 50+ times | 1 function | **98% duplication removal** |\n| **Code per command** | 10 lines | 3 lines | **70% reduction** |\n| **Modules count** | 1 monolith | 9 modules | **Modular architecture** |\n| **Test coverage** | None | 6 test groups | **Comprehensive testing** |\n\n### Qualitative Improvements\n\n**Maintainability**\n\n- ✅ Easy to find specific command logic\n- ✅ Clear separation of concerns\n- ✅ Self-documenting structure\n- ✅ Focused modules (< 320 lines each)\n\n**Extensibility**\n\n- ✅ Add new commands: Just update appropriate handler\n- ✅ Add new flags: Single function update\n- ✅ Add new shortcuts: Update command registry\n- ✅ No massive file edits required\n\n**Testability**\n\n- ✅ Isolated command handlers\n- ✅ Mockable dependencies\n- ✅ Test individual domains\n- ✅ Fast test execution\n\n**Developer Experience**\n\n- ✅ Lower cognitive load\n- ✅ Faster onboarding\n- ✅ Easier code review\n- ✅ Better IDE navigation\n\n## Trade-offs\n\n### Advantages\n\n1. **Dramatically reduced complexity**: 84% smaller main file\n2. **Better organization**: Domain-focused modules\n3. **Easier testing**: Isolated, testable units\n4. **Improved maintainability**: Clear structure, less duplication\n5. **Enhanced UX**: Bi-directional help, shortcuts\n6. **Future-proof**: Easy to extend\n\n### Disadvantages\n\n1. **More files**: 1 file → 9 files (but smaller, focused)\n2. **Module imports**: Need to import multiple modules (automated via mod.nu)\n3. **Learning curve**: New structure requires documentation (this ADR)\n\n**Decision**: Advantages significantly outweigh disadvantages.\n\n## Examples\n\n### Before: Repetitive Flag Handling\n\n```\n"server" => {\n let use_check = if $check { "--check "} else { "" }\n let use_yes = if $yes { "--yes" } else { "" }\n let use_wait = if $wait { "--wait" } else { "" }\n let use_keepstorage = if $keepstorage { "--keepstorage "} else { "" }\n let str_infra = if $infra != null { $"--infra ($infra) "} else { "" }\n let str_outfile = if $outfile != null { $"--outfile ($outfile) "} else { "" }\n let str_out = if $out != null { $"--out ($out) "} else { "" }\n let arg_include_notuse = if $include_notuse { $"--include_notuse "} else { "" }\n run_module $"($str_ops) ($str_infra) ($use_check)..." "server" --exec\n}\n```\n\n### After: Clean, Reusable\n\n```\ndef handle_server [ops: string, flags: record] {\n let args = build_module_args $flags $ops\n run_module $args "server" --exec\n}\n```\n\n**Reduction: 10 lines → 3 lines (70% reduction)**\n\n## Future Considerations\n\n### Potential Enhancements\n\n1. **Unit test expansion**: Add tests for each command handler\n2. **Integration tests**: End-to-end workflow tests\n3. **Performance profiling**: Measure routing overhead (expected to be negligible)\n4. **Documentation generation**: Auto-generate docs from handlers\n5. **Plugin architecture**: Allow third-party command extensions\n\n### Migration Guide for Contributors\n\nSee `docs/development/COMMAND_HANDLER_GUIDE.md` for:\n\n- How to add new commands\n- How to modify existing handlers\n- How to add new shortcuts\n- Testing guidelines\n\n## Related Documentation\n\n- **Architecture Overview**: `docs/architecture/system-overview.md`\n- **Developer Guide**: `docs/development/COMMAND_HANDLER_GUIDE.md`\n- **Main Project Docs**: `CLAUDE.md` (updated with new structure)\n- **Test Suite**: `tests/test_provisioning_refactor.nu`\n\n## Conclusion\n\nThis refactoring transforms the provisioning CLI from a monolithic, hard-to-maintain script into a modular, well-organized system following software\nengineering best practices. The 84% reduction in main file size, elimination of code duplication, and comprehensive test coverage position the project\nfor sustainable long-term growth.\n\nThe new architecture enables:\n\n- **Faster development**: Add commands in minutes, not hours\n- **Better quality**: Isolated testing catches bugs early\n- **Easier maintenance**: Clear structure reduces cognitive load\n- **Enhanced UX**: Shortcuts and bi-directional help improve usability\n\n**Status**: Successfully implemented and tested. All commands operational. Ready for production use.\n\n---\n\n*This ADR documents a major architectural improvement completed on 2025-09-30.* +# ADR-006: Provisioning CLI Refactoring to Modular Architecture + +**Status**: Implemented ✅ +**Date**: 2025-09-30 +**Authors**: Infrastructure Team +**Related**: ADR-001 (Project Structure), ADR-004 (Hybrid Architecture) + +## Context + +The main provisioning CLI script (`provisioning/core/nulib/provisioning`) had grown to +**1,329 lines** with a massive 1,100+ line match statement handling all commands. This +monolithic structure created multiple critical problems: + +### Problems Identified + +1. **Maintainability Crisis** + - 54 command branches in one file + - Code duplication: Flag handling repeated 50+ times + - Hard to navigate: Finding specific command logic required scrolling through 1,000+ lines + - Mixed concerns: Routing, validation, and execution all intertwined + +2. **Development Friction** + - Adding new commands required editing massive file + - Testing was nearly impossible (monolithic, no isolation) + - High cognitive load for contributors + - Code review difficult due to file size + +3. **Technical Debt** + - 10+ lines of repetitive flag handling per command + - No separation of concerns + - Poor code reusability + - Difficult to test individual command handlers + +4. **User Experience Issues** + - No bi-directional help system + - Inconsistent command shortcuts + - Help system not fully integrated + +## Decision + +We refactored the monolithic CLI into a **modular, domain-driven architecture** with the following structure: + +```text +provisioning/core/nulib/ +├── provisioning (211 lines) ⬅️ 84% reduction +├── main_provisioning/ +│ ├── flags.nu (139 lines) ⭐ Centralized flag handling +│ ├── dispatcher.nu (264 lines) ⭐ Command routing +│ ├── mod.nu (updated) +│ └── commands/ ⭐ Domain-focused handlers +│ ├── configuration.nu (316 lines) +│ ├── development.nu (72 lines) +│ ├── generation.nu (78 lines) +│ ├── infrastructure.nu (117 lines) +│ ├── orchestration.nu (64 lines) +│ ├── utilities.nu (157 lines) +│ └── workspace.nu (56 lines) +``` + +### Key Components + +#### 1. Centralized Flag Handling (`flags.nu`) + +Single source of truth for all flag parsing and argument building: + +```text +export def parse_common_flags [flags: record]: nothing -> record +export def build_module_args [flags: record, extra: string = ""]: nothing -> string +export def set_debug_env [flags: record] +export def get_debug_flag [flags: record]: nothing -> string +``` + +**Benefits:** + +- Eliminates 50+ instances of duplicate code +- Single place to add/modify flags +- Consistent flag handling across all commands +- Reduced from 10 lines to 3 lines per command handler + +#### 2. Command Dispatcher (`dispatcher.nu`) + +Central routing with 80+ command mappings: + +```text +export def get_command_registry []: nothing -> record # 80+ shortcuts +export def dispatch_command [args: list, flags: record] # Main router +``` + +**Features:** + +- Command registry with shortcuts (ws → workspace, orch → orchestrator, etc.) +- Bi-directional help support (`provisioning ws help` works) +- Domain-based routing (infrastructure, orchestration, development, etc.) +- Special command handling (create, delete, price, etc.) + +#### 3. Domain Command Handlers (`commands/*.nu`) + +Seven focused modules organized by domain: + +| Module | Lines | Responsibility | +| -------- | ------- | ---------------- | +| `infrastructure.nu` | 117 | Server, taskserv, cluster, infra | +| `orchestration.nu` | 64 | Workflow, batch, orchestrator | +| `development.nu` | 72 | Module, layer, version, pack | +| `workspace.nu` | 56 | Workspace, template | +| `generation.nu` | 78 | Generate commands | +| `utilities.nu` | 157 | SSH, SOPS, cache, providers | +| `configuration.nu` | 316 | Env, show, init, validate | + +Each handler: + +- Exports `handle__command` function +- Uses shared flag handling +- Provides error messages with usage hints +- Isolated and testable + +## Architecture Principles + +### 1. Separation of Concerns + +- **Routing** → `dispatcher.nu` +- **Flag parsing** → `flags.nu` +- **Business logic** → `commands/*.nu` +- **Help system** → `help_system.nu` (existing) + +### 2. Single Responsibility + +Each module has ONE clear purpose: + +- Command handlers execute specific domains +- Dispatcher routes to correct handler +- Flags module normalizes all inputs + +### 3. DRY (Don't Repeat Yourself) + +Eliminated repetition: + +- Flag handling: 50+ instances → 1 function +- Command routing: Scattered logic → Command registry +- Error handling: Consistent across all domains + +### 4. Open/Closed Principle + +- Open for extension: Add new handlers easily +- Closed for modification: Core routing unchanged + +### 5. Dependency Inversion + +All handlers depend on abstractions (flag records, not concrete flags): + +```text +# Handler signature +export def handle_infrastructure_command [ + command: string + ops: string + flags: record # ⬅️ Abstraction, not concrete flags +] +``` + +## Implementation Details + +### Migration Path (Completed in 2 Phases) + +**Phase 1: Foundation** + +1. ✅ Created `commands/` directory structure +2. ✅ Created `flags.nu` with common flag handling +3. ✅ Created initial command handlers (infrastructure, utilities, configuration) +4. ✅ Created `dispatcher.nu` with routing logic +5. ✅ Refactored main file (1,329 → 211 lines) +6. ✅ Tested basic functionality + +**Phase 2: Completion** + +1. ✅ Fixed bi-directional help (`provisioning ws help` now works) +2. ✅ Created remaining handlers (orchestration, development, workspace, generation) +3. ✅ Removed duplicate code from dispatcher +4. ✅ Added comprehensive test suite +5. ✅ Verified all shortcuts work + +### Bi-directional Help System + +Users can now access help in multiple ways: + +```text +# All these work equivalently: +provisioning help workspace +provisioning workspace help # ⬅️ NEW: Bi-directional +provisioning ws help # ⬅️ NEW: With shortcuts +provisioning help ws # ⬅️ NEW: Shortcut in help +``` + +**Implementation:** + +```text +# Intercept "command help" → "help command" +let first_op = if ($ops_list | length) > 0 { ($ops_list | get 0) } else { "" } +if $first_op in ["help" "h"] { + exec $"($env.PROVISIONING_NAME)" help $task --notitles +} +``` + +### Command Shortcuts + +Comprehensive shortcut system with 30+ mappings: + +**Infrastructure:** + +- `s` → `server` +- `t`, `task` → `taskserv` +- `cl` → `cluster` +- `i` → `infra` + +**Orchestration:** + +- `wf`, `flow` → `workflow` +- `bat` → `batch` +- `orch` → `orchestrator` + +**Development:** + +- `mod` → `module` +- `lyr` → `layer` + +**Workspace:** + +- `ws` → `workspace` +- `tpl`, `tmpl` → `template` + +## Testing + +Comprehensive test suite created (`tests/test_provisioning_refactor.nu`): + +### Test Coverage + +- ✅ Main help display +- ✅ Category help (infrastructure, orchestration, development, workspace) +- ✅ Bi-directional help routing +- ✅ All command shortcuts +- ✅ Category shortcut help +- ✅ Command routing to correct handlers + +### Test Results + +```text +📋 Testing main help... ✅ +📋 Testing category help... ✅ +🔄 Testing bi-directional help... ✅ +⚡ Testing command shortcuts... ✅ +📚 Testing category shortcut help... ✅ +🎯 Testing command routing... ✅ + +📊 TEST RESULTS: 6 passed, 0 failed +``` + +## Results + +### Quantitative Improvements + +| Metric | Before | After | Improvement | +| -------- | -------- | ------- | ------------- | +| **Main file size** | 1,329 lines | 211 lines | **84% reduction** | +| **Command handler** | 1 massive match (1,100+ lines) | 7 focused modules | **Domain separation** | +| **Flag handling** | Repeated 50+ times | 1 function | **98% duplication removal** | +| **Code per command** | 10 lines | 3 lines | **70% reduction** | +| **Modules count** | 1 monolith | 9 modules | **Modular architecture** | +| **Test coverage** | None | 6 test groups | **Comprehensive testing** | + +### Qualitative Improvements + +**Maintainability** + +- ✅ Easy to find specific command logic +- ✅ Clear separation of concerns +- ✅ Self-documenting structure +- ✅ Focused modules (< 320 lines each) + +**Extensibility** + +- ✅ Add new commands: Just update appropriate handler +- ✅ Add new flags: Single function update +- ✅ Add new shortcuts: Update command registry +- ✅ No massive file edits required + +**Testability** + +- ✅ Isolated command handlers +- ✅ Mockable dependencies +- ✅ Test individual domains +- ✅ Fast test execution + +**Developer Experience** + +- ✅ Lower cognitive load +- ✅ Faster onboarding +- ✅ Easier code review +- ✅ Better IDE navigation + +## Trade-offs + +### Advantages + +1. **Dramatically reduced complexity**: 84% smaller main file +2. **Better organization**: Domain-focused modules +3. **Easier testing**: Isolated, testable units +4. **Improved maintainability**: Clear structure, less duplication +5. **Enhanced UX**: Bi-directional help, shortcuts +6. **Future-proof**: Easy to extend + +### Disadvantages + +1. **More files**: 1 file → 9 files (but smaller, focused) +2. **Module imports**: Need to import multiple modules (automated via mod.nu) +3. **Learning curve**: New structure requires documentation (this ADR) + +**Decision**: Advantages significantly outweigh disadvantages. + +## Examples + +### Before: Repetitive Flag Handling + +```text +"server" => { + let use_check = if $check { "--check "} else { "" } + let use_yes = if $yes { "--yes" } else { "" } + let use_wait = if $wait { "--wait" } else { "" } + let use_keepstorage = if $keepstorage { "--keepstorage "} else { "" } + let str_infra = if $infra != null { $"--infra ($infra) "} else { "" } + let str_outfile = if $outfile != null { $"--outfile ($outfile) "} else { "" } + let str_out = if $out != null { $"--out ($out) "} else { "" } + let arg_include_notuse = if $include_notuse { $"--include_notuse "} else { "" } + run_module $"($str_ops) ($str_infra) ($use_check)..." "server" --exec +} +``` + +### After: Clean, Reusable + +```text +def handle_server [ops: string, flags: record] { + let args = build_module_args $flags $ops + run_module $args "server" --exec +} +``` + +**Reduction: 10 lines → 3 lines (70% reduction)** + +## Future Considerations + +### Potential Enhancements + +1. **Unit test expansion**: Add tests for each command handler +2. **Integration tests**: End-to-end workflow tests +3. **Performance profiling**: Measure routing overhead (expected to be negligible) +4. **Documentation generation**: Auto-generate docs from handlers +5. **Plugin architecture**: Allow third-party command extensions + +### Migration Guide for Contributors + +See `docs/development/COMMAND_HANDLER_GUIDE.md` for: + +- How to add new commands +- How to modify existing handlers +- How to add new shortcuts +- Testing guidelines + +## Related Documentation + +- **Architecture Overview**: `docs/architecture/system-overview.md` +- **Developer Guide**: `docs/development/COMMAND_HANDLER_GUIDE.md` +- **Main Project Docs**: `CLAUDE.md` (updated with new structure) +- **Test Suite**: `tests/test_provisioning_refactor.nu` + +## Conclusion + +This refactoring transforms the provisioning CLI from a monolithic, hard-to-maintain script into a modular, well-organized system following software +engineering best practices. The 84% reduction in main file size, elimination of code duplication, and comprehensive test coverage position the project +for sustainable long-term growth. + +The new architecture enables: + +- **Faster development**: Add commands in minutes, not hours +- **Better quality**: Isolated testing catches bugs early +- **Easier maintenance**: Clear structure reduces cognitive load +- **Enhanced UX**: Shortcuts and bi-directional help improve usability + +**Status**: Successfully implemented and tested. All commands operational. Ready for production use. + +--- + +*This ADR documents a major architectural improvement completed on 2025-09-30.* \ No newline at end of file diff --git a/docs/src/architecture/adr/ADR-007-kms-simplification.md b/docs/src/architecture/adr/ADR-007-kms-simplification.md index 4927473..b0bb5cc 100644 --- a/docs/src/architecture/adr/ADR-007-kms-simplification.md +++ b/docs/src/architecture/adr/ADR-007-kms-simplification.md @@ -1 +1,266 @@ -# ADR-007: KMS Service Simplification to Age and Cosmian Backends\n\n**Status**: Accepted\n**Date**: 2025-10-08\n**Deciders**: Architecture Team\n**Related**: ADR-006 (KMS Service Integration)\n\n## Context\n\nThe KMS service initially supported 4 backends: HashiCorp Vault, AWS KMS, Age, and Cosmian KMS. This created unnecessary complexity and unclear\nguidance about which backend to use for different environments.\n\n### Problems with 4-Backend Approach\n\n1. **Complexity**: Supporting 4 different backends increased maintenance burden\n2. **Dependencies**: AWS SDK added significant compile time (~30 s) and binary size\n3. **Confusion**: No clear guidance on which backend to use when\n4. **Cloud Lock-in**: AWS KMS dependency limited infrastructure flexibility\n5. **Operational Overhead**: Vault requires server setup even for simple dev environments\n6. **Code Duplication**: Similar logic implemented 4 different ways\n\n### Key Insights\n\n- Most development work doesn't need server-based KMS\n- Production deployments need enterprise-grade security features\n- Age provides fast, offline encryption perfect for development\n- Cosmian KMS offers confidential computing and zero-knowledge architecture\n- Supporting Vault AND Cosmian is redundant (both are server-based KMS)\n- AWS KMS locks us into AWS infrastructure\n\n## Decision\n\nSimplify the KMS service to support only 2 backends:\n\n1. **Age**: For development and local testing\n - Fast, offline, no server required\n - Simple key generation with `age-keygen`\n - X25519 encryption (modern, secure)\n - Perfect for dev/test environments\n\n2. **Cosmian KMS**: For production deployments\n - Enterprise-grade key management\n - Confidential computing support (SGX/SEV)\n - Zero-knowledge architecture\n - Server-side key rotation\n - Audit logging and compliance\n - Multi-tenant support\n\nRemove support for:\n\n- ❌ HashiCorp Vault (redundant with Cosmian)\n- ❌ AWS KMS (cloud lock-in, complexity)\n\n## Consequences\n\n### Positive\n\n1. **Simpler Code**: 2 backends instead of 4 reduces complexity by 50%\n2. **Faster Compilation**: Removing AWS SDK saves ~30 seconds compile time\n3. **Clear Guidance**: Age = dev, Cosmian = prod (no confusion)\n4. **Offline Development**: Age works without network connectivity\n5. **Better Security**: Cosmian provides confidential computing (TEE)\n6. **No Cloud Lock-in**: Not dependent on AWS infrastructure\n7. **Easier Testing**: Age backend requires no setup\n8. **Reduced Dependencies**: Fewer external crates to maintain\n\n### Negative\n\n1. **Migration Required**: Existing Vault/AWS KMS users must migrate\n2. **Learning Curve**: Teams must learn Age and Cosmian\n3. **Cosmian Dependency**: Production depends on Cosmian availability\n4. **Cost**: Cosmian may have licensing costs (cloud or self-hosted)\n\n### Neutral\n\n1. **Feature Parity**: Cosmian provides all features Vault/AWS had\n2. **API Compatibility**: Encrypt/decrypt API remains primarily the same\n3. **Configuration Change**: TOML config structure updated but similar\n\n## Implementation\n\n### Files Created\n\n1. `src/age/client.rs` (167 lines) - Age encryption client\n2. `src/age/mod.rs` (3 lines) - Age module exports\n3. `src/cosmian/client.rs` (294 lines) - Cosmian KMS client\n4. `src/cosmian/mod.rs` (3 lines) - Cosmian module exports\n5. `docs/migration/KMS_SIMPLIFICATION.md` (500+ lines) - Migration guide\n\n### Files Modified\n\n1. `src/lib.rs` - Updated exports (age, cosmian instead of aws, vault)\n2. `src/types.rs` - Updated error types and config enum\n3. `src/service.rs` - Simplified to 2 backends (180 lines, was 213)\n4. `Cargo.toml` - Removed AWS deps, added `age = "0.10"`\n5. `README.md` - Complete rewrite for new backends\n6. `provisioning/config/kms.toml` - Simplified configuration\n\n### Files Deleted\n\n1. `src/aws/client.rs` - AWS KMS client\n2. `src/aws/envelope.rs` - Envelope encryption helpers\n3. `src/aws/mod.rs` - AWS module\n4. `src/vault/client.rs` - Vault client\n5. `src/vault/mod.rs` - Vault module\n\n### Dependencies Changed\n\n**Removed**:\n\n- `aws-sdk-kms = "1"`\n- `aws-config = "1"`\n- `aws-credential-types = "1"`\n- `aes-gcm = "0.10"` (was only for AWS envelope encryption)\n\n**Added**:\n\n- `age = "0.10"`\n- `tempfile = "3"` (dev dependency for tests)\n\n**Kept**:\n\n- All Axum web framework deps\n- `reqwest` (for Cosmian HTTP API)\n- `base64`, `serde`, `tokio`, etc.\n\n## Migration Path\n\n### For Development\n\n```\n# 1. Install Age\nbrew install age # or apt install age\n\n# 2. Generate keys\nage-keygen -o ~/.config/provisioning/age/private_key.txt\nage-keygen -y ~/.config/provisioning/age/private_key.txt > ~/.config/provisioning/age/public_key.txt\n\n# 3. Update config to use Age backend\n# 4. Re-encrypt development secrets\n```\n\n### For Production\n\n```\n# 1. Set up Cosmian KMS (cloud or self-hosted)\n# 2. Create master key in Cosmian\n# 3. Migrate secrets from Vault/AWS to Cosmian\n# 4. Update production config\n# 5. Deploy new KMS service\n```\n\nSee `docs/migration/KMS_SIMPLIFICATION.md` for detailed steps.\n\n## Alternatives Considered\n\n### Alternative 1: Keep All 4 Backends\n\n**Pros**:\n\n- No migration required\n- Maximum flexibility\n\n**Cons**:\n\n- Continued complexity\n- Maintenance burden\n- Unclear guidance\n\n**Rejected**: Complexity outweighs benefits\n\n### Alternative 2: Only Cosmian (No Age)\n\n**Pros**:\n\n- Single backend\n- Enterprise-grade everywhere\n\n**Cons**:\n\n- Requires Cosmian server for development\n- Slower dev iteration\n- Network dependency for local dev\n\n**Rejected**: Development experience matters\n\n### Alternative 3: Only Age (No Production Backend)\n\n**Pros**:\n\n- Simplest solution\n- No server required\n\n**Cons**:\n\n- Not suitable for production\n- No audit logging\n- No key rotation\n- No multi-tenant support\n\n**Rejected**: Production needs enterprise features\n\n### Alternative 4: Age + HashiCorp Vault\n\n**Pros**:\n\n- Vault is widely known\n- No Cosmian dependency\n\n**Cons**:\n\n- Vault lacks confidential computing\n- Vault server still required\n- No zero-knowledge architecture\n\n**Rejected**: Cosmian provides better security features\n\n## Metrics\n\n### Code Reduction\n\n- **Total Lines Removed**: ~800 lines (AWS + Vault implementations)\n- **Total Lines Added**: ~470 lines (Age + Cosmian + docs)\n- **Net Reduction**: ~330 lines\n\n### Dependency Reduction\n\n- **Crates Removed**: 4 (aws-sdk-kms, aws-config, aws-credential-types, aes-gcm)\n- **Crates Added**: 1 (age)\n- **Net Reduction**: 3 crates\n\n### Compilation Time\n\n- **Before**: ~90 seconds (with AWS SDK)\n- **After**: ~60 seconds (without AWS SDK)\n- **Improvement**: 33% faster\n\n## Compliance\n\n### Security Considerations\n\n1. **Age Security**: X25519 (Curve25519) encryption, modern and secure\n2. **Cosmian Security**: Confidential computing, zero-knowledge, enterprise-grade\n3. **No Regression**: Security features maintained or improved\n4. **Clear Separation**: Dev (Age) never used for production secrets\n\n### Testing Requirements\n\n1. **Unit Tests**: Both backends have comprehensive test coverage\n2. **Integration Tests**: Age tests run without external deps\n3. **Cosmian Tests**: Require test server (marked as `#[ignore]`)\n4. **Migration Tests**: Verify old configs fail gracefully\n\n## References\n\n- [Age Encryption](https://github.com/FiloSottile/age) - Modern encryption tool\n- [Cosmian KMS](https://cosmian.com/kms/) - Enterprise KMS with confidential computing\n- [ADR-006](adr-006-provisioning-cli-refactoring.md) - Previous KMS integration\n- [Migration Guide](../migration/KMS_SIMPLIFICATION.md) - Detailed migration steps\n\n## Notes\n\n- Age is designed by Filippo Valsorda (Google, Go security team)\n- Cosmian provides FIPS 140-2 Level 3 compliance (when using certified hardware)\n- This decision aligns with project goal of reducing cloud provider dependencies\n- Migration timeline: 6 weeks for full adoption +# ADR-007: KMS Service Simplification to Age and Cosmian Backends + +**Status**: Accepted +**Date**: 2025-10-08 +**Deciders**: Architecture Team +**Related**: ADR-006 (KMS Service Integration) + +## Context + +The KMS service initially supported 4 backends: HashiCorp Vault, AWS KMS, Age, and Cosmian KMS. This created unnecessary complexity and unclear +guidance about which backend to use for different environments. + +### Problems with 4-Backend Approach + +1. **Complexity**: Supporting 4 different backends increased maintenance burden +2. **Dependencies**: AWS SDK added significant compile time (~30 s) and binary size +3. **Confusion**: No clear guidance on which backend to use when +4. **Cloud Lock-in**: AWS KMS dependency limited infrastructure flexibility +5. **Operational Overhead**: Vault requires server setup even for simple dev environments +6. **Code Duplication**: Similar logic implemented 4 different ways + +### Key Insights + +- Most development work doesn't need server-based KMS +- Production deployments need enterprise-grade security features +- Age provides fast, offline encryption perfect for development +- Cosmian KMS offers confidential computing and zero-knowledge architecture +- Supporting Vault AND Cosmian is redundant (both are server-based KMS) +- AWS KMS locks us into AWS infrastructure + +## Decision + +Simplify the KMS service to support only 2 backends: + +1. **Age**: For development and local testing + - Fast, offline, no server required + - Simple key generation with `age-keygen` + - X25519 encryption (modern, secure) + - Perfect for dev/test environments + +2. **Cosmian KMS**: For production deployments + - Enterprise-grade key management + - Confidential computing support (SGX/SEV) + - Zero-knowledge architecture + - Server-side key rotation + - Audit logging and compliance + - Multi-tenant support + +Remove support for: + +- ❌ HashiCorp Vault (redundant with Cosmian) +- ❌ AWS KMS (cloud lock-in, complexity) + +## Consequences + +### Positive + +1. **Simpler Code**: 2 backends instead of 4 reduces complexity by 50% +2. **Faster Compilation**: Removing AWS SDK saves ~30 seconds compile time +3. **Clear Guidance**: Age = dev, Cosmian = prod (no confusion) +4. **Offline Development**: Age works without network connectivity +5. **Better Security**: Cosmian provides confidential computing (TEE) +6. **No Cloud Lock-in**: Not dependent on AWS infrastructure +7. **Easier Testing**: Age backend requires no setup +8. **Reduced Dependencies**: Fewer external crates to maintain + +### Negative + +1. **Migration Required**: Existing Vault/AWS KMS users must migrate +2. **Learning Curve**: Teams must learn Age and Cosmian +3. **Cosmian Dependency**: Production depends on Cosmian availability +4. **Cost**: Cosmian may have licensing costs (cloud or self-hosted) + +### Neutral + +1. **Feature Parity**: Cosmian provides all features Vault/AWS had +2. **API Compatibility**: Encrypt/decrypt API remains primarily the same +3. **Configuration Change**: TOML config structure updated but similar + +## Implementation + +### Files Created + +1. `src/age/client.rs` (167 lines) - Age encryption client +2. `src/age/mod.rs` (3 lines) - Age module exports +3. `src/cosmian/client.rs` (294 lines) - Cosmian KMS client +4. `src/cosmian/mod.rs` (3 lines) - Cosmian module exports +5. `docs/migration/KMS_SIMPLIFICATION.md` (500+ lines) - Migration guide + +### Files Modified + +1. `src/lib.rs` - Updated exports (age, cosmian instead of aws, vault) +2. `src/types.rs` - Updated error types and config enum +3. `src/service.rs` - Simplified to 2 backends (180 lines, was 213) +4. `Cargo.toml` - Removed AWS deps, added `age = "0.10"` +5. `README.md` - Complete rewrite for new backends +6. `provisioning/config/kms.toml` - Simplified configuration + +### Files Deleted + +1. `src/aws/client.rs` - AWS KMS client +2. `src/aws/envelope.rs` - Envelope encryption helpers +3. `src/aws/mod.rs` - AWS module +4. `src/vault/client.rs` - Vault client +5. `src/vault/mod.rs` - Vault module + +### Dependencies Changed + +**Removed**: + +- `aws-sdk-kms = "1"` +- `aws-config = "1"` +- `aws-credential-types = "1"` +- `aes-gcm = "0.10"` (was only for AWS envelope encryption) + +**Added**: + +- `age = "0.10"` +- `tempfile = "3"` (dev dependency for tests) + +**Kept**: + +- All Axum web framework deps +- `reqwest` (for Cosmian HTTP API) +- `base64`, `serde`, `tokio`, etc. + +## Migration Path + +### For Development + +```text +# 1. Install Age +brew install age # or apt install age + +# 2. Generate keys +age-keygen -o ~/.config/provisioning/age/private_key.txt +age-keygen -y ~/.config/provisioning/age/private_key.txt > ~/.config/provisioning/age/public_key.txt + +# 3. Update config to use Age backend +# 4. Re-encrypt development secrets +``` + +### For Production + +```text +# 1. Set up Cosmian KMS (cloud or self-hosted) +# 2. Create master key in Cosmian +# 3. Migrate secrets from Vault/AWS to Cosmian +# 4. Update production config +# 5. Deploy new KMS service +``` + +See `docs/migration/KMS_SIMPLIFICATION.md` for detailed steps. + +## Alternatives Considered + +### Alternative 1: Keep All 4 Backends + +**Pros**: + +- No migration required +- Maximum flexibility + +**Cons**: + +- Continued complexity +- Maintenance burden +- Unclear guidance + +**Rejected**: Complexity outweighs benefits + +### Alternative 2: Only Cosmian (No Age) + +**Pros**: + +- Single backend +- Enterprise-grade everywhere + +**Cons**: + +- Requires Cosmian server for development +- Slower dev iteration +- Network dependency for local dev + +**Rejected**: Development experience matters + +### Alternative 3: Only Age (No Production Backend) + +**Pros**: + +- Simplest solution +- No server required + +**Cons**: + +- Not suitable for production +- No audit logging +- No key rotation +- No multi-tenant support + +**Rejected**: Production needs enterprise features + +### Alternative 4: Age + HashiCorp Vault + +**Pros**: + +- Vault is widely known +- No Cosmian dependency + +**Cons**: + +- Vault lacks confidential computing +- Vault server still required +- No zero-knowledge architecture + +**Rejected**: Cosmian provides better security features + +## Metrics + +### Code Reduction + +- **Total Lines Removed**: ~800 lines (AWS + Vault implementations) +- **Total Lines Added**: ~470 lines (Age + Cosmian + docs) +- **Net Reduction**: ~330 lines + +### Dependency Reduction + +- **Crates Removed**: 4 (aws-sdk-kms, aws-config, aws-credential-types, aes-gcm) +- **Crates Added**: 1 (age) +- **Net Reduction**: 3 crates + +### Compilation Time + +- **Before**: ~90 seconds (with AWS SDK) +- **After**: ~60 seconds (without AWS SDK) +- **Improvement**: 33% faster + +## Compliance + +### Security Considerations + +1. **Age Security**: X25519 (Curve25519) encryption, modern and secure +2. **Cosmian Security**: Confidential computing, zero-knowledge, enterprise-grade +3. **No Regression**: Security features maintained or improved +4. **Clear Separation**: Dev (Age) never used for production secrets + +### Testing Requirements + +1. **Unit Tests**: Both backends have comprehensive test coverage +2. **Integration Tests**: Age tests run without external deps +3. **Cosmian Tests**: Require test server (marked as `#[ignore]`) +4. **Migration Tests**: Verify old configs fail gracefully + +## References + +- [Age Encryption](https://github.com/FiloSottile/age) - Modern encryption tool +- [Cosmian KMS](https://cosmian.com/kms/) - Enterprise KMS with confidential computing +- [ADR-006](adr-006-provisioning-cli-refactoring.md) - Previous KMS integration +- [Migration Guide](../migration/KMS_SIMPLIFICATION.md) - Detailed migration steps + +## Notes + +- Age is designed by Filippo Valsorda (Google, Go security team) +- Cosmian provides FIPS 140-2 Level 3 compliance (when using certified hardware) +- This decision aligns with project goal of reducing cloud provider dependencies +- Migration timeline: 6 weeks for full adoption \ No newline at end of file diff --git a/docs/src/architecture/adr/ADR-008-cedar-authorization.md b/docs/src/architecture/adr/ADR-008-cedar-authorization.md index a932d5f..121f48d 100644 --- a/docs/src/architecture/adr/ADR-008-cedar-authorization.md +++ b/docs/src/architecture/adr/ADR-008-cedar-authorization.md @@ -1 +1,352 @@ -# ADR-008: Cedar Authorization Policy Engine Integration\n\n**Status**: Accepted\n**Date**: 2025-10-08\n**Deciders**: Architecture Team\n**Tags**: security, authorization, cedar, policy-engine\n\n## Context and Problem Statement\n\nThe Provisioning platform requires fine-grained authorization controls to manage access to infrastructure resources across multiple environments\n(development, staging, production). The authorization system must:\n\n1. Support complex authorization rules (MFA, IP restrictions, time windows, approvals)\n2. Be auditable and version-controlled\n3. Allow hot-reload of policies without restart\n4. Integrate with JWT tokens for identity\n5. Scale to thousands of authorization decisions per second\n6. Be maintainable by security team without code changes\n\nTraditional code-based authorization (if/else statements) is difficult to audit, maintain, and scale.\n\n## Decision Drivers\n\n- **Security**: Critical for production infrastructure access\n- **Auditability**: Compliance requirements demand clear authorization policies\n- **Flexibility**: Policies change more frequently than code\n- **Performance**: Low-latency authorization decisions (<10 ms)\n- **Maintainability**: Security team should update policies without developers\n- **Type Safety**: Prevent policy errors before deployment\n\n## Considered Options\n\n### Option 1: Code-Based Authorization (Current State)\n\nImplement authorization logic directly in Rust/Nushell code.\n\n**Pros**:\n\n- Full control and flexibility\n- No external dependencies\n- Simple to understand for small use cases\n\n**Cons**:\n\n- Hard to audit and maintain\n- Requires code deployment for policy changes\n- No type safety for policies\n- Difficult to test all combinations\n- Not declarative\n\n### Option 2: OPA (Open Policy Agent)\n\nUse OPA with Rego policy language.\n\n**Pros**:\n\n- Industry standard\n- Rich ecosystem\n- Rego is powerful\n\n**Cons**:\n\n- Rego is complex to learn\n- Requires separate service deployment\n- Performance overhead (HTTP calls)\n- Policies not type-checked\n\n### Option 3: Cedar Policy Engine (Chosen)\n\nUse AWS Cedar policy language integrated directly into orchestrator.\n\n**Pros**:\n\n- Type-safe policy language\n- Fast (compiled, no network overhead)\n- Schema-based validation\n- Declarative and auditable\n- Hot-reload support\n- Rust library (no external service)\n- Deny-by-default security model\n\n**Cons**:\n\n- Recently introduced (2023)\n- Smaller ecosystem than OPA\n- Learning curve for policy authors\n\n### Option 4: Casbin\n\nUse Casbin authorization library.\n\n**Pros**:\n\n- Multiple policy models (ACL, RBAC, ABAC)\n- Rust bindings available\n\n**Cons**:\n\n- Less declarative than Cedar\n- Weaker type safety\n- More imperative style\n\n## Decision Outcome\n\n**Chosen Option**: Option 3 - Cedar Policy Engine\n\n### Rationale\n\n1. **Type Safety**: Cedar's schema validation prevents policy errors before deployment\n2. **Performance**: Native Rust library, no network overhead, <1 ms authorization decisions\n3. **Auditability**: Declarative policies in version control\n4. **Hot Reload**: Update policies without orchestrator restart\n5. **AWS Standard**: Used in production by AWS for AVP (Amazon Verified Permissions)\n6. **Deny-by-Default**: Secure by design\n\n### Implementation Details\n\n#### Architecture\n\n```\n┌─────────────────────────────────────────────────────────┐\n│ Orchestrator │\n├─────────────────────────────────────────────────────────┤\n│ │\n│ HTTP Request │\n│ ↓ │\n│ ┌──────────────────┐ │\n│ │ JWT Validation │ ← Token Validator │\n│ └────────┬─────────┘ │\n│ ↓ │\n│ ┌──────────────────┐ │\n│ │ Cedar Engine │ ← Policy Loader │\n│ │ │ (Hot Reload) │\n│ │ • Check Policies │ │\n│ │ • Evaluate Rules │ │\n│ │ • Context Check │ │\n│ └────────┬─────────┘ │\n│ ↓ │\n│ Allow / Deny │\n│ │\n└─────────────────────────────────────────────────────────┘\n```\n\n#### Policy Organization\n\n```\nprovisioning/config/cedar-policies/\n├── schema.cedar # Entity and action definitions\n├── production.cedar # Production environment policies\n├── development.cedar # Development environment policies\n├── admin.cedar # Administrative policies\n└── README.md # Documentation\n```\n\n#### Rust Implementation\n\n```\nprovisioning/platform/orchestrator/src/security/\n├── cedar.rs # Cedar engine integration (450 lines)\n├── policy_loader.rs # Policy loading with hot reload (320 lines)\n├── authorization.rs # Middleware integration (380 lines)\n├── mod.rs # Module exports\n└── tests.rs # Comprehensive tests (450 lines)\n```\n\n#### Key Components\n\n1. **CedarEngine**: Core authorization engine\n - Load policies from strings\n - Load schema for validation\n - Authorize requests\n - Policy statistics\n\n2. **PolicyLoader**: File-based policy management\n - Load policies from directory\n - Hot reload on file changes (notify crate)\n - Validate policy syntax\n - Schema validation\n\n3. **Authorization Middleware**: Axum integration\n - Extract JWT claims\n - Build authorization context (IP, MFA, time)\n - Check authorization\n - Return 403 Forbidden on deny\n\n4. **Policy Files**: Declarative authorization rules\n - Production: MFA, approvals, IP restrictions, business hours\n - Development: Permissive for developers\n - Admin: Platform admin, SRE, audit team policies\n\n#### Context Variables\n\n```\nAuthorizationContext {\n mfa_verified: bool, // MFA verification status\n ip_address: String, // Client IP address\n time: String, // ISO 8601 timestamp\n approval_id: Option, // Approval ID (optional)\n reason: Option, // Reason for operation\n force: bool, // Force flag\n additional: HashMap, // Additional context\n}\n```\n\n#### Example Policy\n\n```\n// Production deployments require MFA verification\n@id("prod-deploy-mfa")\n@description("All production deployments must have MFA verification")\npermit (\n principal,\n action == Provisioning::Action::"deploy",\n resource in Provisioning::Environment::"production"\n) when {\n context.mfa_verified == true\n};\n```\n\n### Integration Points\n\n1. **JWT Tokens**: Extract principal and context from validated JWT\n2. **Audit System**: Log all authorization decisions\n3. **Control Center**: UI for policy management and testing\n4. **CLI**: Policy validation and testing commands\n\n### Security Best Practices\n\n1. **Deny by Default**: Cedar defaults to deny all actions\n2. **Schema Validation**: Type-check policies before loading\n3. **Version Control**: All policies in git for auditability\n4. **Principle of Least Privilege**: Grant minimum necessary permissions\n5. **Defense in Depth**: Combine with JWT validation and rate limiting\n6. **Separation of Concerns**: Security team owns policies, developers own code\n\n## Consequences\n\n### Positive\n\n1. ✅ **Auditable**: All policies in version control\n2. ✅ **Type-Safe**: Schema validation prevents errors\n3. ✅ **Fast**: <1 ms authorization decisions\n4. ✅ **Maintainable**: Security team can update policies independently\n5. ✅ **Hot Reload**: No downtime for policy updates\n6. ✅ **Testable**: Comprehensive test suite for policies\n7. ✅ **Declarative**: Clear intent, no hidden logic\n\n### Negative\n\n1. ❌ **Learning Curve**: Team must learn Cedar policy language\n2. ❌ **New Technology**: Cedar is relatively new (2023)\n3. ❌ **Ecosystem**: Smaller community than OPA\n4. ❌ **Tooling**: Limited IDE support compared to Rego\n\n### Neutral\n\n1. 🔶 **Migration**: Existing authorization logic needs migration to Cedar\n2. 🔶 **Policy Complexity**: Complex rules may be harder to express\n3. 🔶 **Debugging**: Policy debugging requires understanding Cedar evaluation\n\n## Compliance\n\n### Security Standards\n\n- **SOC 2**: Auditable access control policies\n- **ISO 27001**: Access control management\n- **GDPR**: Data access authorization and logging\n- **NIST 800-53**: AC-3 Access Enforcement\n\n### Audit Requirements\n\nAll authorization decisions include:\n\n- Principal (user/team)\n- Action performed\n- Resource accessed\n- Context (MFA, IP, time)\n- Decision (allow/deny)\n- Policies evaluated\n\n## Migration Path\n\n### Phase 1: Implementation (Completed)\n\n- ✅ Cedar engine integration\n- ✅ Policy loader with hot reload\n- ✅ Authorization middleware\n- ✅ Production, development, and admin policies\n- ✅ Comprehensive tests\n\n### Phase 2: Rollout (Next)\n\n- 🔲 Enable Cedar authorization in orchestrator\n- 🔲 Migrate existing authorization logic to Cedar policies\n- 🔲 Add authorization checks to all API endpoints\n- 🔲 Integrate with audit logging\n\n### Phase 3: Enhancement (Future)\n\n- 🔲 Control Center policy editor UI\n- 🔲 Policy testing UI\n- 🔲 Policy simulation and dry-run mode\n- 🔲 Policy analytics and insights\n- 🔲 Advanced context variables (location, device type)\n\n## Alternatives Considered\n\n### Alternative 1: Continue with Code-Based Authorization\n\nKeep authorization logic in Rust/Nushell code.\n\n**Rejected Because**:\n\n- Not auditable\n- Requires code changes for policy updates\n- Difficult to test all combinations\n- Not compliant with security standards\n\n### Alternative 2: Hybrid Approach\n\nUse Cedar for high-level policies, code for fine-grained checks.\n\n**Rejected Because**:\n\n- Complexity of two authorization systems\n- Unclear separation of concerns\n- Harder to audit\n\n## References\n\n- **Cedar Documentation**: \n- **Cedar GitHub**: \n- **AWS AVP**: \n- **Policy Files**: `/provisioning/config/cedar-policies/`\n- **Implementation**: `/provisioning/platform/orchestrator/src/security/`\n\n## Related ADRs\n\n- ADR-003: JWT Token-Based Authentication\n- ADR-004: Audit Logging System\n- ADR-005: KMS Key Management\n\n## Notes\n\nCedar policy language is inspired by decades of authorization research (XACML, AWS IAM) and production experience at AWS. It balances expressiveness\nwith safety.\n\n---\n\n**Approved By**: Architecture Team\n**Implementation Date**: 2025-10-08\n**Review Date**: 2026-01-08 (Quarterly) +# ADR-008: Cedar Authorization Policy Engine Integration + +**Status**: Accepted +**Date**: 2025-10-08 +**Deciders**: Architecture Team +**Tags**: security, authorization, cedar, policy-engine + +## Context and Problem Statement + +The Provisioning platform requires fine-grained authorization controls to manage access to infrastructure resources across multiple environments +(development, staging, production). The authorization system must: + +1. Support complex authorization rules (MFA, IP restrictions, time windows, approvals) +2. Be auditable and version-controlled +3. Allow hot-reload of policies without restart +4. Integrate with JWT tokens for identity +5. Scale to thousands of authorization decisions per second +6. Be maintainable by security team without code changes + +Traditional code-based authorization (if/else statements) is difficult to audit, maintain, and scale. + +## Decision Drivers + +- **Security**: Critical for production infrastructure access +- **Auditability**: Compliance requirements demand clear authorization policies +- **Flexibility**: Policies change more frequently than code +- **Performance**: Low-latency authorization decisions (<10 ms) +- **Maintainability**: Security team should update policies without developers +- **Type Safety**: Prevent policy errors before deployment + +## Considered Options + +### Option 1: Code-Based Authorization (Current State) + +Implement authorization logic directly in Rust/Nushell code. + +**Pros**: + +- Full control and flexibility +- No external dependencies +- Simple to understand for small use cases + +**Cons**: + +- Hard to audit and maintain +- Requires code deployment for policy changes +- No type safety for policies +- Difficult to test all combinations +- Not declarative + +### Option 2: OPA (Open Policy Agent) + +Use OPA with Rego policy language. + +**Pros**: + +- Industry standard +- Rich ecosystem +- Rego is powerful + +**Cons**: + +- Rego is complex to learn +- Requires separate service deployment +- Performance overhead (HTTP calls) +- Policies not type-checked + +### Option 3: Cedar Policy Engine (Chosen) + +Use AWS Cedar policy language integrated directly into orchestrator. + +**Pros**: + +- Type-safe policy language +- Fast (compiled, no network overhead) +- Schema-based validation +- Declarative and auditable +- Hot-reload support +- Rust library (no external service) +- Deny-by-default security model + +**Cons**: + +- Recently introduced (2023) +- Smaller ecosystem than OPA +- Learning curve for policy authors + +### Option 4: Casbin + +Use Casbin authorization library. + +**Pros**: + +- Multiple policy models (ACL, RBAC, ABAC) +- Rust bindings available + +**Cons**: + +- Less declarative than Cedar +- Weaker type safety +- More imperative style + +## Decision Outcome + +**Chosen Option**: Option 3 - Cedar Policy Engine + +### Rationale + +1. **Type Safety**: Cedar's schema validation prevents policy errors before deployment +2. **Performance**: Native Rust library, no network overhead, <1 ms authorization decisions +3. **Auditability**: Declarative policies in version control +4. **Hot Reload**: Update policies without orchestrator restart +5. **AWS Standard**: Used in production by AWS for AVP (Amazon Verified Permissions) +6. **Deny-by-Default**: Secure by design + +### Implementation Details + +#### Architecture + +```text +┌─────────────────────────────────────────────────────────┐ +│ Orchestrator │ +├─────────────────────────────────────────────────────────┤ +│ │ +│ HTTP Request │ +│ ↓ │ +│ ┌──────────────────┐ │ +│ │ JWT Validation │ ← Token Validator │ +│ └────────┬─────────┘ │ +│ ↓ │ +│ ┌──────────────────┐ │ +│ │ Cedar Engine │ ← Policy Loader │ +│ │ │ (Hot Reload) │ +│ │ • Check Policies │ │ +│ │ • Evaluate Rules │ │ +│ │ • Context Check │ │ +│ └────────┬─────────┘ │ +│ ↓ │ +│ Allow / Deny │ +│ │ +└─────────────────────────────────────────────────────────┘ +``` + +#### Policy Organization + +```text +provisioning/config/cedar-policies/ +├── schema.cedar # Entity and action definitions +├── production.cedar # Production environment policies +├── development.cedar # Development environment policies +├── admin.cedar # Administrative policies +└── README.md # Documentation +``` + +#### Rust Implementation + +```text +provisioning/platform/orchestrator/src/security/ +├── cedar.rs # Cedar engine integration (450 lines) +├── policy_loader.rs # Policy loading with hot reload (320 lines) +├── authorization.rs # Middleware integration (380 lines) +├── mod.rs # Module exports +└── tests.rs # Comprehensive tests (450 lines) +``` + +#### Key Components + +1. **CedarEngine**: Core authorization engine + - Load policies from strings + - Load schema for validation + - Authorize requests + - Policy statistics + +2. **PolicyLoader**: File-based policy management + - Load policies from directory + - Hot reload on file changes (notify crate) + - Validate policy syntax + - Schema validation + +3. **Authorization Middleware**: Axum integration + - Extract JWT claims + - Build authorization context (IP, MFA, time) + - Check authorization + - Return 403 Forbidden on deny + +4. **Policy Files**: Declarative authorization rules + - Production: MFA, approvals, IP restrictions, business hours + - Development: Permissive for developers + - Admin: Platform admin, SRE, audit team policies + +#### Context Variables + +```text +AuthorizationContext { + mfa_verified: bool, // MFA verification status + ip_address: String, // Client IP address + time: String, // ISO 8601 timestamp + approval_id: Option, // Approval ID (optional) + reason: Option, // Reason for operation + force: bool, // Force flag + additional: HashMap, // Additional context +} +``` + +#### Example Policy + +```text +// Production deployments require MFA verification +@id("prod-deploy-mfa") +@description("All production deployments must have MFA verification") +permit ( + principal, + action == Provisioning::Action::"deploy", + resource in Provisioning::Environment::"production" +) when { + context.mfa_verified == true +}; +``` + +### Integration Points + +1. **JWT Tokens**: Extract principal and context from validated JWT +2. **Audit System**: Log all authorization decisions +3. **Control Center**: UI for policy management and testing +4. **CLI**: Policy validation and testing commands + +### Security Best Practices + +1. **Deny by Default**: Cedar defaults to deny all actions +2. **Schema Validation**: Type-check policies before loading +3. **Version Control**: All policies in git for auditability +4. **Principle of Least Privilege**: Grant minimum necessary permissions +5. **Defense in Depth**: Combine with JWT validation and rate limiting +6. **Separation of Concerns**: Security team owns policies, developers own code + +## Consequences + +### Positive + +1. ✅ **Auditable**: All policies in version control +2. ✅ **Type-Safe**: Schema validation prevents errors +3. ✅ **Fast**: <1 ms authorization decisions +4. ✅ **Maintainable**: Security team can update policies independently +5. ✅ **Hot Reload**: No downtime for policy updates +6. ✅ **Testable**: Comprehensive test suite for policies +7. ✅ **Declarative**: Clear intent, no hidden logic + +### Negative + +1. ❌ **Learning Curve**: Team must learn Cedar policy language +2. ❌ **New Technology**: Cedar is relatively new (2023) +3. ❌ **Ecosystem**: Smaller community than OPA +4. ❌ **Tooling**: Limited IDE support compared to Rego + +### Neutral + +1. 🔶 **Migration**: Existing authorization logic needs migration to Cedar +2. 🔶 **Policy Complexity**: Complex rules may be harder to express +3. 🔶 **Debugging**: Policy debugging requires understanding Cedar evaluation + +## Compliance + +### Security Standards + +- **SOC 2**: Auditable access control policies +- **ISO 27001**: Access control management +- **GDPR**: Data access authorization and logging +- **NIST 800-53**: AC-3 Access Enforcement + +### Audit Requirements + +All authorization decisions include: + +- Principal (user/team) +- Action performed +- Resource accessed +- Context (MFA, IP, time) +- Decision (allow/deny) +- Policies evaluated + +## Migration Path + +### Phase 1: Implementation (Completed) + +- ✅ Cedar engine integration +- ✅ Policy loader with hot reload +- ✅ Authorization middleware +- ✅ Production, development, and admin policies +- ✅ Comprehensive tests + +### Phase 2: Rollout (Next) + +- 🔲 Enable Cedar authorization in orchestrator +- 🔲 Migrate existing authorization logic to Cedar policies +- 🔲 Add authorization checks to all API endpoints +- 🔲 Integrate with audit logging + +### Phase 3: Enhancement (Future) + +- 🔲 Control Center policy editor UI +- 🔲 Policy testing UI +- 🔲 Policy simulation and dry-run mode +- 🔲 Policy analytics and insights +- 🔲 Advanced context variables (location, device type) + +## Alternatives Considered + +### Alternative 1: Continue with Code-Based Authorization + +Keep authorization logic in Rust/Nushell code. + +**Rejected Because**: + +- Not auditable +- Requires code changes for policy updates +- Difficult to test all combinations +- Not compliant with security standards + +### Alternative 2: Hybrid Approach + +Use Cedar for high-level policies, code for fine-grained checks. + +**Rejected Because**: + +- Complexity of two authorization systems +- Unclear separation of concerns +- Harder to audit + +## References + +- **Cedar Documentation**: +- **Cedar GitHub**: +- **AWS AVP**: +- **Policy Files**: `/provisioning/config/cedar-policies/` +- **Implementation**: `/provisioning/platform/orchestrator/src/security/` + +## Related ADRs + +- ADR-003: JWT Token-Based Authentication +- ADR-004: Audit Logging System +- ADR-005: KMS Key Management + +## Notes + +Cedar policy language is inspired by decades of authorization research (XACML, AWS IAM) and production experience at AWS. It balances expressiveness +with safety. + +--- + +**Approved By**: Architecture Team +**Implementation Date**: 2025-10-08 +**Review Date**: 2026-01-08 (Quarterly) \ No newline at end of file diff --git a/docs/src/architecture/adr/ADR-009-security-system-complete.md b/docs/src/architecture/adr/ADR-009-security-system-complete.md index 3e9048c..ff54bd5 100644 --- a/docs/src/architecture/adr/ADR-009-security-system-complete.md +++ b/docs/src/architecture/adr/ADR-009-security-system-complete.md @@ -1 +1,661 @@ -# ADR-009: Complete Security System Implementation\n\n**Status**: Implemented\n**Date**: 2025-10-08\n**Decision Makers**: Architecture Team\n\n---\n\n## Context\n\nThe Provisioning platform required a comprehensive, enterprise-grade security system covering authentication, authorization, secrets management, MFA,\ncompliance, and emergency access. The system needed to be production-ready, scalable, and compliant with GDPR, SOC2, and ISO 27001.\n\n---\n\n## Decision\n\nImplement a complete security architecture using 12 specialized components organized in 4 implementation groups.\n\n---\n\n## Implementation Summary\n\n### Total Implementation\n\n- **39,699 lines** of production-ready code\n- **136 files** created/modified\n- **350+ tests** implemented\n- **83+ REST endpoints** available\n- **111+ CLI commands** ready\n\n---\n\n## Architecture Components\n\n### Group 1: Foundation (13,485 lines)\n\n#### 1. JWT Authentication (1,626 lines)\n\n**Location**: `provisioning/platform/control-center/src/auth/`\n\n**Features**:\n\n- RS256 asymmetric signing\n- Access tokens (15 min) + refresh tokens (7 d)\n- Token rotation and revocation\n- Argon2id password hashing\n- 5 user roles (Admin, Developer, Operator, Viewer, Auditor)\n- Thread-safe blacklist\n\n**API**: 6 endpoints\n**CLI**: 8 commands\n**Tests**: 30+\n\n#### 2. Cedar Authorization (5,117 lines)\n\n**Location**: `provisioning/config/cedar-policies/`, `provisioning/platform/orchestrator/src/security/`\n\n**Features**:\n\n- Cedar policy engine integration\n- 4 policy files (schema, production, development, admin)\n- Context-aware authorization (MFA, IP, time windows)\n- Hot reload without restart\n- Policy validation\n\n**API**: 4 endpoints\n**CLI**: 6 commands\n**Tests**: 30+\n\n#### 3. Audit Logging (3,434 lines)\n\n**Location**: `provisioning/platform/orchestrator/src/audit/`\n\n**Features**:\n\n- Structured JSON logging\n- 40+ action types\n- GDPR compliance (PII anonymization)\n- 5 export formats (JSON, CSV, Splunk, ECS, JSON Lines)\n- Query API with advanced filtering\n\n**API**: 7 endpoints\n**CLI**: 8 commands\n**Tests**: 25\n\n#### 4. Config Encryption (3,308 lines)\n\n**Location**: `provisioning/core/nulib/lib_provisioning/config/encryption.nu`\n\n**Features**:\n\n- SOPS integration\n- 4 KMS backends (Age, AWS KMS, Vault, Cosmian)\n- Transparent encryption/decryption\n- Memory-only decryption\n- Auto-detection\n\n**CLI**: 10 commands\n**Tests**: 7\n\n---\n\n### Group 2: KMS Integration (9,331 lines)\n\n#### 5. KMS Service (2,483 lines)\n\n**Location**: `provisioning/platform/kms-service/`\n\n**Features**:\n\n- HashiCorp Vault (Transit engine)\n- AWS KMS (Direct + envelope encryption)\n- Context-based encryption (AAD)\n- Key rotation support\n- Multi-region support\n\n**API**: 8 endpoints\n**CLI**: 15 commands\n**Tests**: 20\n\n#### 6. Dynamic Secrets (4,141 lines)\n\n**Location**: `provisioning/platform/orchestrator/src/secrets/`\n\n**Features**:\n\n- AWS STS temporary credentials (15 min-12 h)\n- SSH key pair generation (Ed25519)\n- UpCloud API subaccounts\n- TTL manager with auto-cleanup\n- Vault dynamic secrets integration\n\n**API**: 7 endpoints\n**CLI**: 10 commands\n**Tests**: 15\n\n#### 7. SSH Temporal Keys (2,707 lines)\n\n**Location**: `provisioning/platform/orchestrator/src/ssh/`\n\n**Features**:\n\n- Ed25519 key generation\n- Vault OTP (one-time passwords)\n- Vault CA (certificate authority signing)\n- Auto-deployment to authorized_keys\n- Background cleanup every 5 min\n\n**API**: 7 endpoints\n**CLI**: 10 commands\n**Tests**: 31\n\n---\n\n### Group 3: Security Features (8,948 lines)\n\n#### 8. MFA Implementation (3,229 lines)\n\n**Location**: `provisioning/platform/control-center/src/mfa/`\n\n**Features**:\n\n- TOTP (RFC 6238, 6-digit codes, 30 s window)\n- WebAuthn/FIDO2 (YubiKey, Touch ID, Windows Hello)\n- QR code generation\n- 10 backup codes per user\n- Multiple devices per user\n- Rate limiting (5 attempts/5 min)\n\n**API**: 13 endpoints\n**CLI**: 15 commands\n**Tests**: 85+\n\n#### 9. Orchestrator Auth Flow (2,540 lines)\n\n**Location**: `provisioning/platform/orchestrator/src/middleware/`\n\n**Features**:\n\n- Complete middleware chain (5 layers)\n- Security context builder\n- Rate limiting (100 req/min per IP)\n- JWT authentication middleware\n- MFA verification middleware\n- Cedar authorization middleware\n- Audit logging middleware\n\n**Tests**: 53\n\n#### 10. Control Center UI (3,179 lines)\n\n**Location**: `provisioning/platform/control-center/web/`\n\n**Features**:\n\n- React/TypeScript UI\n- Login with MFA (2-step flow)\n- MFA setup (TOTP + WebAuthn wizards)\n- Device management\n- Audit log viewer with filtering\n- API token management\n- Security settings dashboard\n\n**Components**: 12 React components\n**API Integration**: 17 methods\n\n---\n\n### Group 4: Advanced Features (7,935 lines)\n\n#### 11. Break-Glass Emergency Access (3,840 lines)\n\n**Location**: `provisioning/platform/orchestrator/src/break_glass/`\n\n**Features**:\n\n- Multi-party approval (2+ approvers, different teams)\n- Emergency JWT tokens (4 h max, special claims)\n- Auto-revocation (expiration + inactivity)\n- Enhanced audit (7-year retention)\n- Real-time alerts\n- Background monitoring\n\n**API**: 12 endpoints\n**CLI**: 10 commands\n**Tests**: 985 lines (unit + integration)\n\n#### 12. Compliance (4,095 lines)\n\n**Location**: `provisioning/platform/orchestrator/src/compliance/`\n\n**Features**:\n\n- **GDPR**: Data export, deletion, rectification, portability, objection\n- **SOC2**: 9 Trust Service Criteria verification\n- **ISO 27001**: 14 Annex A control families\n- **Incident Response**: Complete lifecycle management\n- **Data Protection**: 4-level classification, encryption controls\n- **Access Control**: RBAC matrix with role verification\n\n**API**: 35 endpoints\n**CLI**: 23 commands\n**Tests**: 11\n\n---\n\n## Security Architecture Flow\n\n### End-to-End Request Flow\n\n```\n1. User Request\n ↓\n2. Rate Limiting (100 req/min per IP)\n ↓\n3. JWT Authentication (RS256, 15 min tokens)\n ↓\n4. MFA Verification (TOTP/WebAuthn for sensitive ops)\n ↓\n5. Cedar Authorization (context-aware policies)\n ↓\n6. Dynamic Secrets (AWS STS, SSH keys, 1h TTL)\n ↓\n7. Operation Execution (encrypted configs, KMS)\n ↓\n8. Audit Logging (structured JSON, GDPR-compliant)\n ↓\n9. Response\n```\n\n### Emergency Access Flow\n\n```\n1. Emergency Request (reason + justification)\n ↓\n2. Multi-Party Approval (2+ approvers, different teams)\n ↓\n3. Session Activation (special JWT, 4h max)\n ↓\n4. Enhanced Audit (7-year retention, immutable)\n ↓\n5. Auto-Revocation (expiration/inactivity)\n```\n\n---\n\n## Technology Stack\n\n### Backend (Rust)\n\n- **axum**: HTTP framework\n- **jsonwebtoken**: JWT handling (RS256)\n- **cedar-policy**: Authorization engine\n- **totp-rs**: TOTP implementation\n- **webauthn-rs**: WebAuthn/FIDO2\n- **aws-sdk-kms**: AWS KMS integration\n- **argon2**: Password hashing\n- **tracing**: Structured logging\n\n### Frontend (TypeScript/React)\n\n- **React 18**: UI framework\n- **Leptos**: Rust WASM framework\n- **@simplewebauthn/browser**: WebAuthn client\n- **qrcode.react**: QR code generation\n\n### CLI (Nushell)\n\n- **Nushell 0.107**: Shell and scripting\n- **nu_plugin_kcl**: KCL integration\n\n### Infrastructure\n\n- **HashiCorp Vault**: Secrets management, KMS, SSH CA\n- **AWS KMS**: Key management service\n- **PostgreSQL/SurrealDB**: Data storage\n- **SOPS**: Config encryption\n\n---\n\n## Security Guarantees\n\n### Authentication\n\n✅ RS256 asymmetric signing (no shared secrets)\n✅ Short-lived access tokens (15 min)\n✅ Token revocation support\n✅ Argon2id password hashing (memory-hard)\n✅ MFA enforced for production operations\n\n### Authorization\n\n✅ Fine-grained permissions (Cedar policies)\n✅ Context-aware (MFA, IP, time windows)\n✅ Hot reload policies (no downtime)\n✅ Deny by default\n\n### Secrets Management\n\n✅ No static credentials stored\n✅ Time-limited secrets (1h default)\n✅ Auto-revocation on expiry\n✅ Encryption at rest (KMS)\n✅ Memory-only decryption\n\n### Audit & Compliance\n\n✅ Immutable audit logs\n✅ GDPR-compliant (PII anonymization)\n✅ SOC2 controls implemented\n✅ ISO 27001 controls verified\n✅ 7-year retention for break-glass\n\n### Emergency Access\n\n✅ Multi-party approval required\n✅ Time-limited sessions (4h max)\n✅ Enhanced audit logging\n✅ Auto-revocation\n✅ Cannot be disabled\n\n---\n\n## Performance Characteristics\n\n| Component | Latency | Throughput | Memory |\n| ----------- | --------- | ------------ | -------- |\n| JWT Auth | <5 ms | 10,000/s | ~10 MB |\n| Cedar Authz | <10 ms | 5,000/s | ~50 MB |\n| Audit Log | <5 ms | 20,000/s | ~100 MB |\n| KMS Encrypt | <50 ms | 1,000/s | ~20 MB |\n| Dynamic Secrets | <100 ms | 500/s | ~50 MB |\n| MFA Verify | <50 ms | 2,000/s | ~30 MB |\n\n**Total Overhead**: ~10-20 ms per request\n**Memory Usage**: ~260 MB total for all security components\n\n---\n\n## Deployment Options\n\n### Development\n\n```\n# Start all services\ncd provisioning/platform/kms-service && cargo run &\ncd provisioning/platform/orchestrator && cargo run &\ncd provisioning/platform/control-center && cargo run &\n```\n\n### Production\n\n```\n# Kubernetes deployment\nkubectl apply -f k8s/security-stack.yaml\n\n# Docker Compose\ndocker-compose up -d kms orchestrator control-center\n\n# Systemd services\nsystemctl start provisioning-kms\nsystemctl start provisioning-orchestrator\nsystemctl start provisioning-control-center\n```\n\n---\n\n## Configuration\n\n### Environment Variables\n\n```\n# JWT\nexport JWT_ISSUER="control-center"\nexport JWT_AUDIENCE="orchestrator,cli"\nexport JWT_PRIVATE_KEY_PATH="/keys/private.pem"\nexport JWT_PUBLIC_KEY_PATH="/keys/public.pem"\n\n# Cedar\nexport CEDAR_POLICIES_PATH="/config/cedar-policies"\nexport CEDAR_ENABLE_HOT_RELOAD=true\n\n# KMS\nexport KMS_BACKEND="vault"\nexport VAULT_ADDR="https://vault.example.com"\nexport VAULT_TOKEN="..."\n\n# MFA\nexport MFA_TOTP_ISSUER="Provisioning"\nexport MFA_WEBAUTHN_RP_ID="provisioning.example.com"\n```\n\n### Config Files\n\n```\n# provisioning/config/security.toml\n[jwt]\nissuer = "control-center"\naudience = ["orchestrator", "cli"]\naccess_token_ttl = "15m"\nrefresh_token_ttl = "7d"\n\n[cedar]\npolicies_path = "config/cedar-policies"\nhot_reload = true\nreload_interval = "60s"\n\n[mfa]\ntotp_issuer = "Provisioning"\nwebauthn_rp_id = "provisioning.example.com"\nrate_limit = 5\nrate_limit_window = "5m"\n\n[kms]\nbackend = "vault"\nvault_address = "https://vault.example.com"\nvault_mount_point = "transit"\n\n[audit]\nretention_days = 365\nretention_break_glass_days = 2555 # 7 years\nexport_format = "json"\npii_anonymization = true\n```\n\n---\n\n## Testing\n\n### Run All Tests\n\n```\n# Control Center (JWT, MFA)\ncd provisioning/platform/control-center\ncargo test\n\n# Orchestrator (Cedar, Audit, Secrets, SSH, Break-Glass, Compliance)\ncd provisioning/platform/orchestrator\ncargo test\n\n# KMS Service\ncd provisioning/platform/kms-service\ncargo test\n\n# Config Encryption (Nushell)\nnu provisioning/core/nulib/lib_provisioning/config/encryption_tests.nu\n```\n\n### Integration Tests\n\n```\n# Full security flow\ncd provisioning/platform/orchestrator\ncargo test --test security_integration_tests\ncargo test --test break_glass_integration_tests\n```\n\n---\n\n## Monitoring & Alerts\n\n### Metrics to Monitor\n\n- Authentication failures (rate, sources)\n- Authorization denials (policies, resources)\n- MFA failures (attempts, users)\n- Token revocations (rate, reasons)\n- Break-glass activations (frequency, duration)\n- Secrets generation (rate, types)\n- Audit log volume (events/sec)\n\n### Alerts to Configure\n\n- Multiple failed auth attempts (5+ in 5 min)\n- Break-glass session created\n- Compliance report non-compliant\n- Incident severity critical/high\n- Token revocation spike\n- KMS errors\n- Audit log export failures\n\n---\n\n## Maintenance\n\n### Daily\n\n- Monitor audit logs for anomalies\n- Review failed authentication attempts\n- Check break-glass sessions (should be zero)\n\n### Weekly\n\n- Review compliance reports\n- Check incident response status\n- Verify backup code usage\n- Review MFA device additions/removals\n\n### Monthly\n\n- Rotate KMS keys\n- Review and update Cedar policies\n- Generate compliance reports (GDPR, SOC2, ISO)\n- Audit access control matrix\n\n### Quarterly\n\n- Full security audit\n- Penetration testing\n- Compliance certification review\n- Update security documentation\n\n---\n\n## Migration Path\n\n### From Existing System\n\n1. **Phase 1**: Deploy security infrastructure\n - KMS service\n - Orchestrator with auth middleware\n - Control Center\n\n2. **Phase 2**: Migrate authentication\n - Enable JWT authentication\n - Migrate existing users\n - Disable old auth system\n\n3. **Phase 3**: Enable MFA\n - Require MFA enrollment for admins\n - Gradual rollout to all users\n\n4. **Phase 4**: Enable Cedar authorization\n - Deploy initial policies (permissive)\n - Monitor authorization decisions\n - Tighten policies incrementally\n\n5. **Phase 5**: Enable advanced features\n - Break-glass procedures\n - Compliance reporting\n - Incident response\n\n---\n\n## Future Enhancements\n\n### Planned (Not Implemented)\n\n- **Hardware Security Module (HSM)** integration\n- **OAuth2/OIDC** federation\n- **SAML SSO** for enterprise\n- **Risk-based authentication** (IP reputation, device fingerprinting)\n- **Behavioral analytics** (anomaly detection)\n- **Zero-Trust Network** (service mesh integration)\n\n### Under Consideration\n\n- **Blockchain audit log** (immutable append-only log)\n- **Quantum-resistant cryptography** (post-quantum algorithms)\n- **Confidential computing** (SGX/SEV enclaves)\n- **Distributed break-glass** (multi-region approval)\n\n---\n\n## Consequences\n\n### Positive\n\n✅ **Enterprise-grade security** meeting GDPR, SOC2, ISO 27001\n✅ **Zero static credentials** (all dynamic, time-limited)\n✅ **Complete audit trail** (immutable, GDPR-compliant)\n✅ **MFA-enforced** for sensitive operations\n✅ **Emergency access** with enhanced controls\n✅ **Fine-grained authorization** (Cedar policies)\n✅ **Automated compliance** (reports, incident response)\n\n### Negative\n\n⚠️ **Increased complexity** (12 components to manage)\n⚠️ **Performance overhead** (~10-20 ms per request)\n⚠️ **Memory footprint** (~260 MB additional)\n⚠️ **Learning curve** (Cedar policy language, MFA setup)\n⚠️ **Operational overhead** (key rotation, policy updates)\n\n### Mitigations\n\n- Comprehensive documentation (ADRs, guides, API docs)\n- CLI commands for all operations\n- Automated monitoring and alerting\n- Gradual rollout with feature flags\n- Training materials for operators\n\n---\n\n## Related Documentation\n\n- **JWT Auth**: `docs/architecture/JWT_AUTH_IMPLEMENTATION.md`\n- **Cedar Authz**: `docs/architecture/CEDAR_AUTHORIZATION_IMPLEMENTATION.md`\n- **Audit Logging**: `docs/architecture/AUDIT_LOGGING_IMPLEMENTATION.md`\n- **MFA**: `docs/architecture/MFA_IMPLEMENTATION_SUMMARY.md`\n- **Break-Glass**: `docs/architecture/BREAK_GLASS_IMPLEMENTATION_SUMMARY.md`\n- **Compliance**: `docs/architecture/COMPLIANCE_IMPLEMENTATION_SUMMARY.md`\n- **Config Encryption**: `docs/user/CONFIG_ENCRYPTION_GUIDE.md`\n- **Dynamic Secrets**: `docs/user/DYNAMIC_SECRETS_QUICK_REFERENCE.md`\n- **SSH Keys**: `docs/user/SSH_TEMPORAL_KEYS_USER_GUIDE.md`\n\n---\n\n## Approval\n\n**Architecture Team**: Approved\n**Security Team**: Approved (pending penetration test)\n**Compliance Team**: Approved (pending audit)\n**Engineering Team**: Approved\n\n---\n\n**Date**: 2025-10-08\n**Version**: 1.0.0\n**Status**: Implemented and Production-Ready +# ADR-009: Complete Security System Implementation + +**Status**: Implemented +**Date**: 2025-10-08 +**Decision Makers**: Architecture Team + +--- + +## Context + +The Provisioning platform required a comprehensive, enterprise-grade security system covering authentication, authorization, secrets management, MFA, +compliance, and emergency access. The system needed to be production-ready, scalable, and compliant with GDPR, SOC2, and ISO 27001. + +--- + +## Decision + +Implement a complete security architecture using 12 specialized components organized in 4 implementation groups. + +--- + +## Implementation Summary + +### Total Implementation + +- **39,699 lines** of production-ready code +- **136 files** created/modified +- **350+ tests** implemented +- **83+ REST endpoints** available +- **111+ CLI commands** ready + +--- + +## Architecture Components + +### Group 1: Foundation (13,485 lines) + +#### 1. JWT Authentication (1,626 lines) + +**Location**: `provisioning/platform/control-center/src/auth/` + +**Features**: + +- RS256 asymmetric signing +- Access tokens (15 min) + refresh tokens (7 d) +- Token rotation and revocation +- Argon2id password hashing +- 5 user roles (Admin, Developer, Operator, Viewer, Auditor) +- Thread-safe blacklist + +**API**: 6 endpoints +**CLI**: 8 commands +**Tests**: 30+ + +#### 2. Cedar Authorization (5,117 lines) + +**Location**: `provisioning/config/cedar-policies/`, `provisioning/platform/orchestrator/src/security/` + +**Features**: + +- Cedar policy engine integration +- 4 policy files (schema, production, development, admin) +- Context-aware authorization (MFA, IP, time windows) +- Hot reload without restart +- Policy validation + +**API**: 4 endpoints +**CLI**: 6 commands +**Tests**: 30+ + +#### 3. Audit Logging (3,434 lines) + +**Location**: `provisioning/platform/orchestrator/src/audit/` + +**Features**: + +- Structured JSON logging +- 40+ action types +- GDPR compliance (PII anonymization) +- 5 export formats (JSON, CSV, Splunk, ECS, JSON Lines) +- Query API with advanced filtering + +**API**: 7 endpoints +**CLI**: 8 commands +**Tests**: 25 + +#### 4. Config Encryption (3,308 lines) + +**Location**: `provisioning/core/nulib/lib_provisioning/config/encryption.nu` + +**Features**: + +- SOPS integration +- 4 KMS backends (Age, AWS KMS, Vault, Cosmian) +- Transparent encryption/decryption +- Memory-only decryption +- Auto-detection + +**CLI**: 10 commands +**Tests**: 7 + +--- + +### Group 2: KMS Integration (9,331 lines) + +#### 5. KMS Service (2,483 lines) + +**Location**: `provisioning/platform/kms-service/` + +**Features**: + +- HashiCorp Vault (Transit engine) +- AWS KMS (Direct + envelope encryption) +- Context-based encryption (AAD) +- Key rotation support +- Multi-region support + +**API**: 8 endpoints +**CLI**: 15 commands +**Tests**: 20 + +#### 6. Dynamic Secrets (4,141 lines) + +**Location**: `provisioning/platform/orchestrator/src/secrets/` + +**Features**: + +- AWS STS temporary credentials (15 min-12 h) +- SSH key pair generation (Ed25519) +- UpCloud API subaccounts +- TTL manager with auto-cleanup +- Vault dynamic secrets integration + +**API**: 7 endpoints +**CLI**: 10 commands +**Tests**: 15 + +#### 7. SSH Temporal Keys (2,707 lines) + +**Location**: `provisioning/platform/orchestrator/src/ssh/` + +**Features**: + +- Ed25519 key generation +- Vault OTP (one-time passwords) +- Vault CA (certificate authority signing) +- Auto-deployment to authorized_keys +- Background cleanup every 5 min + +**API**: 7 endpoints +**CLI**: 10 commands +**Tests**: 31 + +--- + +### Group 3: Security Features (8,948 lines) + +#### 8. MFA Implementation (3,229 lines) + +**Location**: `provisioning/platform/control-center/src/mfa/` + +**Features**: + +- TOTP (RFC 6238, 6-digit codes, 30 s window) +- WebAuthn/FIDO2 (YubiKey, Touch ID, Windows Hello) +- QR code generation +- 10 backup codes per user +- Multiple devices per user +- Rate limiting (5 attempts/5 min) + +**API**: 13 endpoints +**CLI**: 15 commands +**Tests**: 85+ + +#### 9. Orchestrator Auth Flow (2,540 lines) + +**Location**: `provisioning/platform/orchestrator/src/middleware/` + +**Features**: + +- Complete middleware chain (5 layers) +- Security context builder +- Rate limiting (100 req/min per IP) +- JWT authentication middleware +- MFA verification middleware +- Cedar authorization middleware +- Audit logging middleware + +**Tests**: 53 + +#### 10. Control Center UI (3,179 lines) + +**Location**: `provisioning/platform/control-center/web/` + +**Features**: + +- React/TypeScript UI +- Login with MFA (2-step flow) +- MFA setup (TOTP + WebAuthn wizards) +- Device management +- Audit log viewer with filtering +- API token management +- Security settings dashboard + +**Components**: 12 React components +**API Integration**: 17 methods + +--- + +### Group 4: Advanced Features (7,935 lines) + +#### 11. Break-Glass Emergency Access (3,840 lines) + +**Location**: `provisioning/platform/orchestrator/src/break_glass/` + +**Features**: + +- Multi-party approval (2+ approvers, different teams) +- Emergency JWT tokens (4 h max, special claims) +- Auto-revocation (expiration + inactivity) +- Enhanced audit (7-year retention) +- Real-time alerts +- Background monitoring + +**API**: 12 endpoints +**CLI**: 10 commands +**Tests**: 985 lines (unit + integration) + +#### 12. Compliance (4,095 lines) + +**Location**: `provisioning/platform/orchestrator/src/compliance/` + +**Features**: + +- **GDPR**: Data export, deletion, rectification, portability, objection +- **SOC2**: 9 Trust Service Criteria verification +- **ISO 27001**: 14 Annex A control families +- **Incident Response**: Complete lifecycle management +- **Data Protection**: 4-level classification, encryption controls +- **Access Control**: RBAC matrix with role verification + +**API**: 35 endpoints +**CLI**: 23 commands +**Tests**: 11 + +--- + +## Security Architecture Flow + +### End-to-End Request Flow + +```text +1. User Request + ↓ +2. Rate Limiting (100 req/min per IP) + ↓ +3. JWT Authentication (RS256, 15 min tokens) + ↓ +4. MFA Verification (TOTP/WebAuthn for sensitive ops) + ↓ +5. Cedar Authorization (context-aware policies) + ↓ +6. Dynamic Secrets (AWS STS, SSH keys, 1h TTL) + ↓ +7. Operation Execution (encrypted configs, KMS) + ↓ +8. Audit Logging (structured JSON, GDPR-compliant) + ↓ +9. Response +``` + +### Emergency Access Flow + +```text +1. Emergency Request (reason + justification) + ↓ +2. Multi-Party Approval (2+ approvers, different teams) + ↓ +3. Session Activation (special JWT, 4h max) + ↓ +4. Enhanced Audit (7-year retention, immutable) + ↓ +5. Auto-Revocation (expiration/inactivity) +``` + +--- + +## Technology Stack + +### Backend (Rust) + +- **axum**: HTTP framework +- **jsonwebtoken**: JWT handling (RS256) +- **cedar-policy**: Authorization engine +- **totp-rs**: TOTP implementation +- **webauthn-rs**: WebAuthn/FIDO2 +- **aws-sdk-kms**: AWS KMS integration +- **argon2**: Password hashing +- **tracing**: Structured logging + +### Frontend (TypeScript/React) + +- **React 18**: UI framework +- **Leptos**: Rust WASM framework +- **@simplewebauthn/browser**: WebAuthn client +- **qrcode.react**: QR code generation + +### CLI (Nushell) + +- **Nushell 0.107**: Shell and scripting +- **nu_plugin_kcl**: KCL integration + +### Infrastructure + +- **HashiCorp Vault**: Secrets management, KMS, SSH CA +- **AWS KMS**: Key management service +- **PostgreSQL/SurrealDB**: Data storage +- **SOPS**: Config encryption + +--- + +## Security Guarantees + +### Authentication + +✅ RS256 asymmetric signing (no shared secrets) +✅ Short-lived access tokens (15 min) +✅ Token revocation support +✅ Argon2id password hashing (memory-hard) +✅ MFA enforced for production operations + +### Authorization + +✅ Fine-grained permissions (Cedar policies) +✅ Context-aware (MFA, IP, time windows) +✅ Hot reload policies (no downtime) +✅ Deny by default + +### Secrets Management + +✅ No static credentials stored +✅ Time-limited secrets (1h default) +✅ Auto-revocation on expiry +✅ Encryption at rest (KMS) +✅ Memory-only decryption + +### Audit & Compliance + +✅ Immutable audit logs +✅ GDPR-compliant (PII anonymization) +✅ SOC2 controls implemented +✅ ISO 27001 controls verified +✅ 7-year retention for break-glass + +### Emergency Access + +✅ Multi-party approval required +✅ Time-limited sessions (4h max) +✅ Enhanced audit logging +✅ Auto-revocation +✅ Cannot be disabled + +--- + +## Performance Characteristics + +| Component | Latency | Throughput | Memory | +| ----------- | --------- | ------------ | -------- | +| JWT Auth | <5 ms | 10,000/s | ~10 MB | +| Cedar Authz | <10 ms | 5,000/s | ~50 MB | +| Audit Log | <5 ms | 20,000/s | ~100 MB | +| KMS Encrypt | <50 ms | 1,000/s | ~20 MB | +| Dynamic Secrets | <100 ms | 500/s | ~50 MB | +| MFA Verify | <50 ms | 2,000/s | ~30 MB | + +**Total Overhead**: ~10-20 ms per request +**Memory Usage**: ~260 MB total for all security components + +--- + +## Deployment Options + +### Development + +```text +# Start all services +cd provisioning/platform/kms-service && cargo run & +cd provisioning/platform/orchestrator && cargo run & +cd provisioning/platform/control-center && cargo run & +``` + +### Production + +```text +# Kubernetes deployment +kubectl apply -f k8s/security-stack.yaml + +# Docker Compose +docker-compose up -d kms orchestrator control-center + +# Systemd services +systemctl start provisioning-kms +systemctl start provisioning-orchestrator +systemctl start provisioning-control-center +``` + +--- + +## Configuration + +### Environment Variables + +```text +# JWT +export JWT_ISSUER="control-center" +export JWT_AUDIENCE="orchestrator,cli" +export JWT_PRIVATE_KEY_PATH="/keys/private.pem" +export JWT_PUBLIC_KEY_PATH="/keys/public.pem" + +# Cedar +export CEDAR_POLICIES_PATH="/config/cedar-policies" +export CEDAR_ENABLE_HOT_RELOAD=true + +# KMS +export KMS_BACKEND="vault" +export VAULT_ADDR="https://vault.example.com" +export VAULT_TOKEN="..." + +# MFA +export MFA_TOTP_ISSUER="Provisioning" +export MFA_WEBAUTHN_RP_ID="provisioning.example.com" +``` + +### Config Files + +```text +# provisioning/config/security.toml +[jwt] +issuer = "control-center" +audience = ["orchestrator", "cli"] +access_token_ttl = "15m" +refresh_token_ttl = "7d" + +[cedar] +policies_path = "config/cedar-policies" +hot_reload = true +reload_interval = "60s" + +[mfa] +totp_issuer = "Provisioning" +webauthn_rp_id = "provisioning.example.com" +rate_limit = 5 +rate_limit_window = "5m" + +[kms] +backend = "vault" +vault_address = "https://vault.example.com" +vault_mount_point = "transit" + +[audit] +retention_days = 365 +retention_break_glass_days = 2555 # 7 years +export_format = "json" +pii_anonymization = true +``` + +--- + +## Testing + +### Run All Tests + +```text +# Control Center (JWT, MFA) +cd provisioning/platform/control-center +cargo test + +# Orchestrator (Cedar, Audit, Secrets, SSH, Break-Glass, Compliance) +cd provisioning/platform/orchestrator +cargo test + +# KMS Service +cd provisioning/platform/kms-service +cargo test + +# Config Encryption (Nushell) +nu provisioning/core/nulib/lib_provisioning/config/encryption_tests.nu +``` + +### Integration Tests + +```text +# Full security flow +cd provisioning/platform/orchestrator +cargo test --test security_integration_tests +cargo test --test break_glass_integration_tests +``` + +--- + +## Monitoring & Alerts + +### Metrics to Monitor + +- Authentication failures (rate, sources) +- Authorization denials (policies, resources) +- MFA failures (attempts, users) +- Token revocations (rate, reasons) +- Break-glass activations (frequency, duration) +- Secrets generation (rate, types) +- Audit log volume (events/sec) + +### Alerts to Configure + +- Multiple failed auth attempts (5+ in 5 min) +- Break-glass session created +- Compliance report non-compliant +- Incident severity critical/high +- Token revocation spike +- KMS errors +- Audit log export failures + +--- + +## Maintenance + +### Daily + +- Monitor audit logs for anomalies +- Review failed authentication attempts +- Check break-glass sessions (should be zero) + +### Weekly + +- Review compliance reports +- Check incident response status +- Verify backup code usage +- Review MFA device additions/removals + +### Monthly + +- Rotate KMS keys +- Review and update Cedar policies +- Generate compliance reports (GDPR, SOC2, ISO) +- Audit access control matrix + +### Quarterly + +- Full security audit +- Penetration testing +- Compliance certification review +- Update security documentation + +--- + +## Migration Path + +### From Existing System + +1. **Phase 1**: Deploy security infrastructure + - KMS service + - Orchestrator with auth middleware + - Control Center + +2. **Phase 2**: Migrate authentication + - Enable JWT authentication + - Migrate existing users + - Disable old auth system + +3. **Phase 3**: Enable MFA + - Require MFA enrollment for admins + - Gradual rollout to all users + +4. **Phase 4**: Enable Cedar authorization + - Deploy initial policies (permissive) + - Monitor authorization decisions + - Tighten policies incrementally + +5. **Phase 5**: Enable advanced features + - Break-glass procedures + - Compliance reporting + - Incident response + +--- + +## Future Enhancements + +### Planned (Not Implemented) + +- **Hardware Security Module (HSM)** integration +- **OAuth2/OIDC** federation +- **SAML SSO** for enterprise +- **Risk-based authentication** (IP reputation, device fingerprinting) +- **Behavioral analytics** (anomaly detection) +- **Zero-Trust Network** (service mesh integration) + +### Under Consideration + +- **Blockchain audit log** (immutable append-only log) +- **Quantum-resistant cryptography** (post-quantum algorithms) +- **Confidential computing** (SGX/SEV enclaves) +- **Distributed break-glass** (multi-region approval) + +--- + +## Consequences + +### Positive + +✅ **Enterprise-grade security** meeting GDPR, SOC2, ISO 27001 +✅ **Zero static credentials** (all dynamic, time-limited) +✅ **Complete audit trail** (immutable, GDPR-compliant) +✅ **MFA-enforced** for sensitive operations +✅ **Emergency access** with enhanced controls +✅ **Fine-grained authorization** (Cedar policies) +✅ **Automated compliance** (reports, incident response) + +### Negative + +⚠️ **Increased complexity** (12 components to manage) +⚠️ **Performance overhead** (~10-20 ms per request) +⚠️ **Memory footprint** (~260 MB additional) +⚠️ **Learning curve** (Cedar policy language, MFA setup) +⚠️ **Operational overhead** (key rotation, policy updates) + +### Mitigations + +- Comprehensive documentation (ADRs, guides, API docs) +- CLI commands for all operations +- Automated monitoring and alerting +- Gradual rollout with feature flags +- Training materials for operators + +--- + +## Related Documentation + +- **JWT Auth**: `docs/architecture/JWT_AUTH_IMPLEMENTATION.md` +- **Cedar Authz**: `docs/architecture/CEDAR_AUTHORIZATION_IMPLEMENTATION.md` +- **Audit Logging**: `docs/architecture/AUDIT_LOGGING_IMPLEMENTATION.md` +- **MFA**: `docs/architecture/MFA_IMPLEMENTATION_SUMMARY.md` +- **Break-Glass**: `docs/architecture/BREAK_GLASS_IMPLEMENTATION_SUMMARY.md` +- **Compliance**: `docs/architecture/COMPLIANCE_IMPLEMENTATION_SUMMARY.md` +- **Config Encryption**: `docs/user/CONFIG_ENCRYPTION_GUIDE.md` +- **Dynamic Secrets**: `docs/user/DYNAMIC_SECRETS_QUICK_REFERENCE.md` +- **SSH Keys**: `docs/user/SSH_TEMPORAL_KEYS_USER_GUIDE.md` + +--- + +## Approval + +**Architecture Team**: Approved +**Security Team**: Approved (pending penetration test) +**Compliance Team**: Approved (pending audit) +**Engineering Team**: Approved + +--- + +**Date**: 2025-10-08 +**Version**: 1.0.0 +**Status**: Implemented and Production-Ready \ No newline at end of file diff --git a/docs/src/architecture/adr/README.md b/docs/src/architecture/adr/README.md index 74f2a15..feebd97 100644 --- a/docs/src/architecture/adr/README.md +++ b/docs/src/architecture/adr/README.md @@ -1 +1,60 @@ -# Architecture Decision Records (ADRs)\n\nThis directory contains all Architecture Decision Records for the provisioning platform. ADRs document significant architectural decisions and their rationale.\n\n## Index of Decisions\n\n### Core Architecture (ADR-001 to ADR-006)\n\n- **ADR-001**: [Project Structure](adr-001-project-structure.md) - Overall project organization and directory layout\n- **ADR-002**: [Distribution Strategy](adr-002-distribution-strategy.md) - How the platform is packaged and distributed\n- **ADR-003**: [Workspace Isolation](adr-003-workspace-isolation.md) - Workspace management and isolation boundaries\n- **ADR-004**: [Hybrid Architecture](adr-004-hybrid-architecture.md) - Rust/Nushell hybrid system design\n- **ADR-005**: [Extension Framework](adr-005-extension-framework.md) - Plugin/extension system architecture\n- **ADR-006**: [Provisioning CLI Refactoring](adr-006-provisioning-cli-refactoring.md) - CLI modularization and command handling\n\n### Infrastructure & Configuration (ADR-007 to ADR-011)\n\n- **ADR-007**: [KMS Simplification](adr-007-kms-simplification.md) - Key Management System design\n- **ADR-008**: [Cedar Authorization](adr-008-cedar-authorization.md) - Fine-grained authorization via Cedar policies\n- **ADR-009**: [Security System Complete](adr-009-security-system-complete.md) - Comprehensive security implementation\n- **ADR-010**: [Configuration Format Strategy](adr-010-configuration-format-strategy.md) - When to use Nickel, TOML, YAML, or KCL\n- **ADR-011**: [Nickel Migration](adr-011-nickel-migration.md) - Migration from KCL to Nickel as primary IaC language\n\n### Platform Services (ADR-012 to ADR-014)\n\n- **ADR-012**: [Nushell Nickel Plugin CLI Wrapper](adr-012-nushell-nickel-plugin-cli-wrapper.md) - Plugin architecture for Nickel integration\n- **ADR-013**: [Typdialog Web UI Backend Integration](adr-013-typdialog-integration.md) - Browser-based configuration forms with multi-user collaboration\n- **ADR-014**: [SecretumVault Integration](adr-014-secretumvault-integration.md) - Centralized secrets management with dynamic credentials\n\n### AI and Intelligence (ADR-015)\n\n- **ADR-015**: [AI Integration Architecture](adr-015-ai-integration-architecture.md) - Comprehensive AI system for intelligent infrastructure provisioning\n\n## How to Use ADRs\n\n1. **For decisions affecting architecture**: Create a new ADR with the next sequential number\n2. **For reading decisions**: Browse this list or check SUMMARY.md\n3. **For understanding context**: Each ADR includes context, rationale, and consequences\n\n## ADR Format\n\nEach ADR follows this standard structure:\n\n- **Context**: What problem we're solving\n- **Decision**: What we decided\n- **Rationale**: Why we chose this approach\n- **Consequences**: Positive and negative impacts\n- **Alternatives Considered**: Other options we evaluated\n\n## Status Markers\n\n- **Proposed**: Under review, not yet final\n- **Accepted**: Approved and adopted\n- **Superseded**: Replaced by a later ADR\n- **Deprecated**: No longer recommended\n\n---\n\n**Last Updated**: 2025-01-08\n**Total ADRs**: 15 +# Architecture Decision Records (ADRs) + +This directory contains all Architecture Decision Records for the provisioning platform. ADRs document significant architectural decisions and their rationale. + +## Index of Decisions + +### Core Architecture (ADR-001 to ADR-006) + +- **ADR-001**: [Project Structure](adr-001-project-structure.md) - Overall project organization and directory layout +- **ADR-002**: [Distribution Strategy](adr-002-distribution-strategy.md) - How the platform is packaged and distributed +- **ADR-003**: [Workspace Isolation](adr-003-workspace-isolation.md) - Workspace management and isolation boundaries +- **ADR-004**: [Hybrid Architecture](adr-004-hybrid-architecture.md) - Rust/Nushell hybrid system design +- **ADR-005**: [Extension Framework](adr-005-extension-framework.md) - Plugin/extension system architecture +- **ADR-006**: [Provisioning CLI Refactoring](adr-006-provisioning-cli-refactoring.md) - CLI modularization and command handling + +### Infrastructure & Configuration (ADR-007 to ADR-011) + +- **ADR-007**: [KMS Simplification](adr-007-kms-simplification.md) - Key Management System design +- **ADR-008**: [Cedar Authorization](adr-008-cedar-authorization.md) - Fine-grained authorization via Cedar policies +- **ADR-009**: [Security System Complete](adr-009-security-system-complete.md) - Comprehensive security implementation +- **ADR-010**: [Configuration Format Strategy](adr-010-configuration-format-strategy.md) - When to use Nickel, TOML, YAML, or KCL +- **ADR-011**: [Nickel Migration](adr-011-nickel-migration.md) - Migration from KCL to Nickel as primary IaC language + +### Platform Services (ADR-012 to ADR-014) + +- **ADR-012**: [Nushell Nickel Plugin CLI Wrapper](adr-012-nushell-nickel-plugin-cli-wrapper.md) - Plugin architecture for Nickel integration +- **ADR-013**: [Typdialog Web UI Backend Integration](adr-013-typdialog-integration.md) - Browser-based configuration forms with multi-user collaboration +- **ADR-014**: [SecretumVault Integration](adr-014-secretumvault-integration.md) - Centralized secrets management with dynamic credentials + +### AI and Intelligence (ADR-015) + +- **ADR-015**: [AI Integration Architecture](adr-015-ai-integration-architecture.md) - Comprehensive AI system for intelligent infrastructure provisioning + +## How to Use ADRs + +1. **For decisions affecting architecture**: Create a new ADR with the next sequential number +2. **For reading decisions**: Browse this list or check SUMMARY.md +3. **For understanding context**: Each ADR includes context, rationale, and consequences + +## ADR Format + +Each ADR follows this standard structure: + +- **Context**: What problem we're solving +- **Decision**: What we decided +- **Rationale**: Why we chose this approach +- **Consequences**: Positive and negative impacts +- **Alternatives Considered**: Other options we evaluated + +## Status Markers + +- **Proposed**: Under review, not yet final +- **Accepted**: Approved and adopted +- **Superseded**: Replaced by a later ADR +- **Deprecated**: No longer recommended + +--- + +**Last Updated**: 2025-01-08 +**Total ADRs**: 15 diff --git a/docs/src/architecture/adr/adr-010-configuration-format-strategy.md b/docs/src/architecture/adr/adr-010-configuration-format-strategy.md index cca399c..6321328 100644 --- a/docs/src/architecture/adr/adr-010-configuration-format-strategy.md +++ b/docs/src/architecture/adr/adr-010-configuration-format-strategy.md @@ -1 +1,413 @@ -# ADR-010: Configuration File Format Strategy\n\n**Status**: Accepted\n**Date**: 2025-12-03\n**Decision Makers**: Architecture Team\n**Implementation**: Multi-phase migration (KCL workspace configs + template reorganization)\n\n---\n\n## Context\n\nThe provisioning project historically used a single configuration format (YAML/TOML environment variables) for all purposes. As the system evolved,\ndifferent parts naturally adopted different formats:\n\n- **TOML** for modular provider and platform configurations (`providers/*.toml`, `platform/*.toml`)\n- **KCL** for infrastructure-as-code definitions with type safety\n- **YAML** for workspace metadata\n\nHowever, the workspace configuration remained in **YAML** (`provisioning.yaml`),\ncreating inconsistency and leaving type-unsafe configuration handling. Meanwhile,\ncomplete KCL schemas for workspace configuration were designed but unused.\n\n**Problem**: Three different formats in the same system without documented rationale or consistent patterns.\n\n---\n\n## Decision\n\nAdopt a **three-format strategy** with clear separation of concerns:\n\n| Format | Purpose | Use Cases |\n| -------- | --------- | ----------- |\n| **KCL** | Infrastructure as Code & Schemas | Workspace config, infrastructure definitions, type-safe validation |\n| **TOML** | Application Configuration & Settings | System defaults, provider settings, user preferences, interpolation |\n| **YAML** | Metadata & Kubernetes Resources | K8s manifests, tool metadata, version tracking, CI/CD resources |\n\n---\n\n## Implementation Strategy\n\n### Phase 1: Documentation (Complete)\n\nDefine and document the three-format approach through:\n\n1. **ADR-010** (this document) - Rationale and strategy\n2. **CLAUDE.md updates** - Quick reference for developers\n3. **Configuration hierarchy** - Explicit precedence rules\n\n### Phase 2: Workspace Config Migration (In Progress)\n\n**Migrate workspace configuration from YAML to KCL**:\n\n1. Create comprehensive workspace configuration schema in KCL\n2. Implement backward-compatible config loader (KCL first, fallback to YAML)\n3. Provide migration script to convert existing workspaces\n4. Update workspace initialization to generate KCL configs\n\n**Expected Outcome**:\n\n- `workspace/config/provisioning.ncl` (KCL, type-safe, validated)\n- Full schema validation with semantic versioning checks\n- Automatic validation at config load time\n\n### Phase 3: Template File Reorganization (In Progress)\n\n**Move template files to proper directory structure and correct extensions**:\n\n```\nPrevious (KCL):\n provisioning/kcl/templates/*.k (had Nushell/Jinja2 code, not KCL)\n\nCurrent (Nickel):\n provisioning/templates/\n ├── nushell/*.nu.j2\n ├── config/*.toml.j2\n ├── nickel/*.ncl.j2\n └── README.md\n```\n\n**Expected Outcome**:\n\n- Templates properly classified and discoverable\n- KCL validation passes (15/16 errors eliminated)\n- Template system clean and maintainable\n\n---\n\n## Rationale for Each Format\n\n### KCL for Workspace Configuration\n\n**Why KCL over YAML or TOML?**\n\n1. **Type Safety**: Catch configuration errors at schema validation time, not runtime\n\n ```kcl\n schema WorkspaceDeclaration:\n metadata: Metadata\n check:\n regex.match(metadata.version, r"^\d+\.\d+\.\d+$"), \\n "Version must be semantic versioning"\n ```\n\n1. **Schema-First Development**: Schemas are first-class citizens\n - Document expected structure upfront\n - IDE support for auto-completion\n - Enforce required fields and value ranges\n\n2. **Immutable by Default**: Infrastructure configurations are immutable\n - Prevents accidental mutations\n - Better for reproducible deployments\n - Aligns with PAP principle: "configuration-driven, not hardcoded"\n\n3. **Complex Validation**: KCL supports sophisticated validation rules\n - Semantic versioning validation\n - Dependency checking\n - Cross-field validation\n - Range constraints on numeric values\n\n4. **Ecosystem Consistency**: KCL is already used for infrastructure definitions\n - Server configurations use KCL\n - Cluster definitions use KCL\n - Taskserv definitions use KCL\n - Using KCL for workspace config maintains consistency\n\n5. **Existing Schemas**: `provisioning/kcl/generator/declaration.ncl` already defines complete workspace schemas\n - No design work needed\n - Production-ready schemas\n - Well-tested patterns\n\n### TOML for Application Configuration\n\n**Why TOML for settings?**\n\n1. **Hierarchical Structure**: Native support for nested configurations\n\n ```toml\n [http]\n use_curl = false\n timeout = 30\n\n [debug]\n enabled = false\n log_level = "info"\n ```\n\n2. **Interpolation Support**: Dynamic variable substitution\n\n ```toml\n base_path = "/Users/home/provisioning"\n cache_path = "{{base_path}}/.cache"\n ```\n\n3. **Industry Standard**: Widely used for application configuration (Rust, Python, Go)\n\n4. **Human Readable**: Clear, explicit, easy to edit\n\n5. **Validation Support**: Schema files (`.schema.toml`) for validation\n\n**Use Cases**:\n\n- System defaults: `provisioning/config/config.defaults.toml`\n- Provider settings: `workspace/config/providers/*.toml`\n- Platform services: `workspace/config/platform/*.toml`\n- User preferences: User config files\n\n### YAML for Metadata and Kubernetes Resources\n\n**Why YAML for metadata?**\n\n1. **Kubernetes Compatibility**: YAML is K8s standard\n - K8s manifests use YAML\n - Consistent with ecosystem\n - Familiar to DevOps engineers\n\n2. **Lightweight**: Good for simple data structures\n\n ```yaml\n workspace:\n name: "librecloud"\n version: "1.0.0"\n created: "2025-10-06T12:29:43Z"\n ```\n\n3. **Version Control**: Human-readable format\n - Diffs are clear and meaningful\n - Git-friendly\n - Comments supported\n\n**Use Cases**:\n\n- K8s resource definitions\n- Tool metadata (versions, sources, tags)\n- CI/CD configuration files\n- User workspace metadata (during transition)\n\n---\n\n## Configuration Hierarchy (Priority)\n\n**When loading configuration, use this precedence (highest to lowest)**:\n\n1. **Runtime Arguments** (highest priority)\n - CLI flags passed to commands\n - Explicit user input\n\n2. **Environment Variables** (PROVISIONING_*)\n - Override system settings\n - Deployment-specific overrides\n - Secrets via env vars\n\n3. **User Configuration** (Centralized)\n - User preferences: `~/.config/provisioning/user_config.yaml`\n - User workspace overrides: `workspace/config/local-overrides.toml`\n\n4. **Infrastructure Configuration**\n - Workspace KCL config: `workspace/config/provisioning.ncl`\n - Platform services: `workspace/config/platform/*.toml`\n - Provider configs: `workspace/config/providers/*.toml`\n\n5. **System Defaults** (lowest priority)\n - System config: `provisioning/config/config.defaults.toml`\n - Schema defaults: defined in KCL schemas\n\n---\n\n## Migration Path\n\n### For Existing Workspaces\n\n1. **Migration Path**: Config loader checks for `.ncl` first, then falls back to `.yaml` for legacy systems\n\n ```nushell\n # Try Nickel first (current)\n if ($config_nickel | path exists) {\n let config = (load_nickel_workspace_config $config_nickel)\n } else if ($config_yaml | path exists) {\n # Legacy YAML support (from pre-migration)\n let config = (open $config_yaml)\n }\n ```\n\n2. **Automatic Migration**: Migration script converts YAML/KCL → Nickel\n\n ```bash\n provisioning workspace migrate-config --all\n ```\n\n3. **Validation**: New KCL configs validated against schemas\n\n### For New Workspaces\n\n1. **Generate KCL**: Workspace initialization creates `.k` files\n\n ```bash\n provisioning workspace create my-workspace\n # Creates: workspace/my-workspace/config/provisioning.ncl\n ```\n\n2. **Use Existing Schemas**: Leverage `provisioning/kcl/generator/declaration.ncl`\n\n3. **Schema Validation**: Automatic validation during config load\n\n---\n\n## File Format Guidelines for Developers\n\n### When to Use Each Format\n\n**Use KCL for**:\n\n- Infrastructure definitions (servers, clusters, taskservs)\n- Configuration with type requirements\n- Schema definitions\n- Any config that needs validation rules\n- Workspace configuration\n\n**Use TOML for**:\n\n- Application settings (HTTP client, logging, timeouts)\n- Provider-specific settings\n- Platform service configuration\n- User preferences and overrides\n- System defaults with interpolation\n\n**Use YAML for**:\n\n- Kubernetes manifests\n- CI/CD configuration (GitHub Actions, GitLab CI)\n- Tool metadata\n- Human-readable documentation files\n- Version control metadata\n\n---\n\n## Consequences\n\n### Benefits\n\n✅ **Type Safety**: KCL schema validation catches config errors early\n✅ **Consistency**: Infrastructure definitions and configs use same language\n✅ **Maintainability**: Clear separation of concerns (IaC vs settings vs metadata)\n✅ **Validation**: Semantic versioning, required fields, range checks\n✅ **Tooling**: IDE support for KCL auto-completion\n✅ **Documentation**: Self-documenting schemas with descriptions\n✅ **Ecosystem Alignment**: TOML for settings (Rust standard), YAML for K8s\n\n### Trade-offs\n\n⚠️ **Learning Curve**: Developers must understand three formats\n⚠️ **Migration Effort**: Existing YAML configs need conversion\n⚠️ **Tooling Requirements**: KCL compiler needed (already a dependency)\n\n### Risk Mitigation\n\n1. **Documentation**: Clear guidelines in CLAUDE.md\n2. **Backward Compatibility**: YAML support maintained during transition\n3. **Automation**: Migration scripts for existing workspaces\n4. **Gradual Migration**: No hard cutoff, both formats supported for extended period\n\n---\n\n## Template File Reorganization\n\n### Problem\n\nCurrently, 15/16 files in `provisioning/kcl/templates/` have `.k` extension but contain Nushell/Jinja2 code, not KCL:\n\n```\nprovisioning/kcl/templates/\n├── server.ncl # Actually Nushell/Jinja2 template\n├── taskserv.ncl # Actually Nushell/Jinja2 template\n└── ... # 15 more template files\n```\n\nThis causes:\n\n- KCL validation failures (96.6% of errors)\n- Misclassification (templates in KCL directory)\n- Confusing directory structure\n\n### Solution\n\nReorganize into type-specific directories:\n\n```\nprovisioning/templates/\n├── nushell/ # Nushell code generation (*.nu.j2)\n│ ├── server.nu.j2\n│ ├── taskserv.nu.j2\n│ └── ...\n├── config/ # Config file generation (*.toml.j2, *.yaml.j2)\n│ ├── provider.toml.j2\n│ └── ...\n├── kcl/ # KCL file generation (*.k.j2)\n│ ├── workspace.ncl.j2\n│ └── ...\n└── README.md\n```\n\n### Outcome\n\n✅ Correct file classification\n✅ KCL validation passes completely\n✅ Clear template organization\n✅ Easier to discover and maintain templates\n\n---\n\n## References\n\n### Existing KCL Schemas\n\n1. **Workspace Declaration**: `provisioning/kcl/generator/declaration.ncl`\n - `WorkspaceDeclaration` - Complete workspace specification\n - `Metadata` - Name, version, author, timestamps\n - `DeploymentConfig` - Deployment modes, servers, HA settings\n - Includes validation rules and semantic versioning\n\n2. **Workspace Layer**: `provisioning/workspace/layers/workspace.layer.ncl`\n - `WorkspaceLayer` - Template paths, priorities, metadata\n\n3. **Core Settings**: `provisioning/kcl/settings.ncl`\n - `Settings` - Main provisioning settings\n - `SecretProvider` - SOPS/KMS configuration\n - `AIProvider` - AI provider configuration\n\n### Related ADRs\n\n- **ADR-001**: Project Structure\n- **ADR-005**: Extension Framework\n- **ADR-006**: Provisioning CLI Refactoring\n- **ADR-009**: Security System Complete\n\n---\n\n## Decision Status\n\n**Status**: Accepted\n\n**Next Steps**:\n\n1. ✅ Document strategy (this ADR)\n2. ⏳ Create workspace configuration KCL schema\n3. ⏳ Implement backward-compatible config loader\n4. ⏳ Create migration script for YAML → KCL\n5. ⏳ Move template files to proper directories\n6. ⏳ Update documentation with examples\n7. ⏳ Migrate workspace_librecloud to KCL\n\n---\n\n**Last Updated**: 2025-12-03 +# ADR-010: Configuration File Format Strategy + +**Status**: Accepted +**Date**: 2025-12-03 +**Decision Makers**: Architecture Team +**Implementation**: Multi-phase migration (KCL workspace configs + template reorganization) + +--- + +## Context + +The provisioning project historically used a single configuration format (YAML/TOML environment variables) for all purposes. As the system evolved, +different parts naturally adopted different formats: + +- **TOML** for modular provider and platform configurations (`providers/*.toml`, `platform/*.toml`) +- **KCL** for infrastructure-as-code definitions with type safety +- **YAML** for workspace metadata + +However, the workspace configuration remained in **YAML** (`provisioning.yaml`), +creating inconsistency and leaving type-unsafe configuration handling. Meanwhile, +complete KCL schemas for workspace configuration were designed but unused. + +**Problem**: Three different formats in the same system without documented rationale or consistent patterns. + +--- + +## Decision + +Adopt a **three-format strategy** with clear separation of concerns: + +| Format | Purpose | Use Cases | +| -------- | --------- | ----------- | +| **KCL** | Infrastructure as Code & Schemas | Workspace config, infrastructure definitions, type-safe validation | +| **TOML** | Application Configuration & Settings | System defaults, provider settings, user preferences, interpolation | +| **YAML** | Metadata & Kubernetes Resources | K8s manifests, tool metadata, version tracking, CI/CD resources | + +--- + +## Implementation Strategy + +### Phase 1: Documentation (Complete) + +Define and document the three-format approach through: + +1. **ADR-010** (this document) - Rationale and strategy +2. **CLAUDE.md updates** - Quick reference for developers +3. **Configuration hierarchy** - Explicit precedence rules + +### Phase 2: Workspace Config Migration (In Progress) + +**Migrate workspace configuration from YAML to KCL**: + +1. Create comprehensive workspace configuration schema in KCL +2. Implement backward-compatible config loader (KCL first, fallback to YAML) +3. Provide migration script to convert existing workspaces +4. Update workspace initialization to generate KCL configs + +**Expected Outcome**: + +- `workspace/config/provisioning.ncl` (KCL, type-safe, validated) +- Full schema validation with semantic versioning checks +- Automatic validation at config load time + +### Phase 3: Template File Reorganization (In Progress) + +**Move template files to proper directory structure and correct extensions**: + +```text +Previous (KCL): + provisioning/kcl/templates/*.k (had Nushell/Jinja2 code, not KCL) + +Current (Nickel): + provisioning/templates/ + ├── nushell/*.nu.j2 + ├── config/*.toml.j2 + ├── nickel/*.ncl.j2 + └── README.md +``` + +**Expected Outcome**: + +- Templates properly classified and discoverable +- KCL validation passes (15/16 errors eliminated) +- Template system clean and maintainable + +--- + +## Rationale for Each Format + +### KCL for Workspace Configuration + +**Why KCL over YAML or TOML?** + +1. **Type Safety**: Catch configuration errors at schema validation time, not runtime + + ```kcl + schema WorkspaceDeclaration: + metadata: Metadata + check: + regex.match(metadata.version, r"^\d+\.\d+\.\d+$"), + "Version must be semantic versioning" + ``` + +1. **Schema-First Development**: Schemas are first-class citizens + - Document expected structure upfront + - IDE support for auto-completion + - Enforce required fields and value ranges + +2. **Immutable by Default**: Infrastructure configurations are immutable + - Prevents accidental mutations + - Better for reproducible deployments + - Aligns with PAP principle: "configuration-driven, not hardcoded" + +3. **Complex Validation**: KCL supports sophisticated validation rules + - Semantic versioning validation + - Dependency checking + - Cross-field validation + - Range constraints on numeric values + +4. **Ecosystem Consistency**: KCL is already used for infrastructure definitions + - Server configurations use KCL + - Cluster definitions use KCL + - Taskserv definitions use KCL + - Using KCL for workspace config maintains consistency + +5. **Existing Schemas**: `provisioning/kcl/generator/declaration.ncl` already defines complete workspace schemas + - No design work needed + - Production-ready schemas + - Well-tested patterns + +### TOML for Application Configuration + +**Why TOML for settings?** + +1. **Hierarchical Structure**: Native support for nested configurations + + ```toml + [http] + use_curl = false + timeout = 30 + + [debug] + enabled = false + log_level = "info" + ``` + +2. **Interpolation Support**: Dynamic variable substitution + + ```toml + base_path = "/Users/home/provisioning" + cache_path = "{{base_path}}/.cache" + ``` + +3. **Industry Standard**: Widely used for application configuration (Rust, Python, Go) + +4. **Human Readable**: Clear, explicit, easy to edit + +5. **Validation Support**: Schema files (`.schema.toml`) for validation + +**Use Cases**: + +- System defaults: `provisioning/config/config.defaults.toml` +- Provider settings: `workspace/config/providers/*.toml` +- Platform services: `workspace/config/platform/*.toml` +- User preferences: User config files + +### YAML for Metadata and Kubernetes Resources + +**Why YAML for metadata?** + +1. **Kubernetes Compatibility**: YAML is K8s standard + - K8s manifests use YAML + - Consistent with ecosystem + - Familiar to DevOps engineers + +2. **Lightweight**: Good for simple data structures + + ```yaml + workspace: + name: "librecloud" + version: "1.0.0" + created: "2025-10-06T12:29:43Z" + ``` + +3. **Version Control**: Human-readable format + - Diffs are clear and meaningful + - Git-friendly + - Comments supported + +**Use Cases**: + +- K8s resource definitions +- Tool metadata (versions, sources, tags) +- CI/CD configuration files +- User workspace metadata (during transition) + +--- + +## Configuration Hierarchy (Priority) + +**When loading configuration, use this precedence (highest to lowest)**: + +1. **Runtime Arguments** (highest priority) + - CLI flags passed to commands + - Explicit user input + +2. **Environment Variables** (PROVISIONING_*) + - Override system settings + - Deployment-specific overrides + - Secrets via env vars + +3. **User Configuration** (Centralized) + - User preferences: `~/.config/provisioning/user_config.yaml` + - User workspace overrides: `workspace/config/local-overrides.toml` + +4. **Infrastructure Configuration** + - Workspace KCL config: `workspace/config/provisioning.ncl` + - Platform services: `workspace/config/platform/*.toml` + - Provider configs: `workspace/config/providers/*.toml` + +5. **System Defaults** (lowest priority) + - System config: `provisioning/config/config.defaults.toml` + - Schema defaults: defined in KCL schemas + +--- + +## Migration Path + +### For Existing Workspaces + +1. **Migration Path**: Config loader checks for `.ncl` first, then falls back to `.yaml` for legacy systems + + ```nushell + # Try Nickel first (current) + if ($config_nickel | path exists) { + let config = (load_nickel_workspace_config $config_nickel) + } else if ($config_yaml | path exists) { + # Legacy YAML support (from pre-migration) + let config = (open $config_yaml) + } + ``` + +2. **Automatic Migration**: Migration script converts YAML/KCL → Nickel + + ```bash + provisioning workspace migrate-config --all + ``` + +3. **Validation**: New KCL configs validated against schemas + +### For New Workspaces + +1. **Generate KCL**: Workspace initialization creates `.k` files + + ```bash + provisioning workspace create my-workspace + # Creates: workspace/my-workspace/config/provisioning.ncl + ``` + +2. **Use Existing Schemas**: Leverage `provisioning/kcl/generator/declaration.ncl` + +3. **Schema Validation**: Automatic validation during config load + +--- + +## File Format Guidelines for Developers + +### When to Use Each Format + +**Use KCL for**: + +- Infrastructure definitions (servers, clusters, taskservs) +- Configuration with type requirements +- Schema definitions +- Any config that needs validation rules +- Workspace configuration + +**Use TOML for**: + +- Application settings (HTTP client, logging, timeouts) +- Provider-specific settings +- Platform service configuration +- User preferences and overrides +- System defaults with interpolation + +**Use YAML for**: + +- Kubernetes manifests +- CI/CD configuration (GitHub Actions, GitLab CI) +- Tool metadata +- Human-readable documentation files +- Version control metadata + +--- + +## Consequences + +### Benefits + +✅ **Type Safety**: KCL schema validation catches config errors early +✅ **Consistency**: Infrastructure definitions and configs use same language +✅ **Maintainability**: Clear separation of concerns (IaC vs settings vs metadata) +✅ **Validation**: Semantic versioning, required fields, range checks +✅ **Tooling**: IDE support for KCL auto-completion +✅ **Documentation**: Self-documenting schemas with descriptions +✅ **Ecosystem Alignment**: TOML for settings (Rust standard), YAML for K8s + +### Trade-offs + +⚠️ **Learning Curve**: Developers must understand three formats +⚠️ **Migration Effort**: Existing YAML configs need conversion +⚠️ **Tooling Requirements**: KCL compiler needed (already a dependency) + +### Risk Mitigation + +1. **Documentation**: Clear guidelines in CLAUDE.md +2. **Backward Compatibility**: YAML support maintained during transition +3. **Automation**: Migration scripts for existing workspaces +4. **Gradual Migration**: No hard cutoff, both formats supported for extended period + +--- + +## Template File Reorganization + +### Problem + +Currently, 15/16 files in `provisioning/kcl/templates/` have `.k` extension but contain Nushell/Jinja2 code, not KCL: + +```text +provisioning/kcl/templates/ +├── server.ncl # Actually Nushell/Jinja2 template +├── taskserv.ncl # Actually Nushell/Jinja2 template +└── ... # 15 more template files +``` + +This causes: + +- KCL validation failures (96.6% of errors) +- Misclassification (templates in KCL directory) +- Confusing directory structure + +### Solution + +Reorganize into type-specific directories: + +```text +provisioning/templates/ +├── nushell/ # Nushell code generation (*.nu.j2) +│ ├── server.nu.j2 +│ ├── taskserv.nu.j2 +│ └── ... +├── config/ # Config file generation (*.toml.j2, *.yaml.j2) +│ ├── provider.toml.j2 +│ └── ... +├── kcl/ # KCL file generation (*.k.j2) +│ ├── workspace.ncl.j2 +│ └── ... +└── README.md +``` + +### Outcome + +✅ Correct file classification +✅ KCL validation passes completely +✅ Clear template organization +✅ Easier to discover and maintain templates + +--- + +## References + +### Existing KCL Schemas + +1. **Workspace Declaration**: `provisioning/kcl/generator/declaration.ncl` + - `WorkspaceDeclaration` - Complete workspace specification + - `Metadata` - Name, version, author, timestamps + - `DeploymentConfig` - Deployment modes, servers, HA settings + - Includes validation rules and semantic versioning + +2. **Workspace Layer**: `provisioning/workspace/layers/workspace.layer.ncl` + - `WorkspaceLayer` - Template paths, priorities, metadata + +3. **Core Settings**: `provisioning/kcl/settings.ncl` + - `Settings` - Main provisioning settings + - `SecretProvider` - SOPS/KMS configuration + - `AIProvider` - AI provider configuration + +### Related ADRs + +- **ADR-001**: Project Structure +- **ADR-005**: Extension Framework +- **ADR-006**: Provisioning CLI Refactoring +- **ADR-009**: Security System Complete + +--- + +## Decision Status + +**Status**: Accepted + +**Next Steps**: + +1. ✅ Document strategy (this ADR) +2. ⏳ Create workspace configuration KCL schema +3. ⏳ Implement backward-compatible config loader +4. ⏳ Create migration script for YAML → KCL +5. ⏳ Move template files to proper directories +6. ⏳ Update documentation with examples +7. ⏳ Migrate workspace_librecloud to KCL + +--- + +**Last Updated**: 2025-12-03 \ No newline at end of file diff --git a/docs/src/architecture/adr/adr-011-nickel-migration.md b/docs/src/architecture/adr/adr-011-nickel-migration.md index f8d7272..3b8a7bd 100644 --- a/docs/src/architecture/adr/adr-011-nickel-migration.md +++ b/docs/src/architecture/adr/adr-011-nickel-migration.md @@ -1 +1,479 @@ -# ADR-011: Migration from KCL to Nickel\n\n**Status**: Implemented\n**Date**: 2025-12-15\n**Decision Makers**: Architecture Team\n**Implementation**: Complete for platform schemas (100%)\n\n---\n\n## Context\n\nThe provisioning platform historically used KCL (KLang) as the primary infrastructure-as-code language for all configuration schemas. As the system\nevolved through four migration phases (Foundation, Core, Complex, Highly Complex), KCL's limitations became increasingly apparent:\n\n### Problems with KCL\n\n1. **Complex Type System**: Heavyweight schema system with extensive boilerplate\n - `schema Foo(bar.Baz)` inheritance creates rigid hierarchies\n - Union types with `null` don't work well in type annotations\n - Schema modifications propagate breaking changes\n\n2. **Limited Flexibility**: Schema-first approach is too rigid for configuration evolution\n - Difficult to extend types without modifying base schemas\n - No easy way to add custom fields without validation conflicts\n - Hard to compose configurations dynamically\n\n3. **Import System Overhead**: Non-standard module imports\n - `import provisioning.lib as lib` pattern differs from ecosystem standards\n - Re-export patterns create complexity in extension systems\n\n4. **Performance Overhead**: Compile-time validation adds latency\n - Schema validation happens at compile time\n - Large configuration files slow down evaluation\n - No lazy evaluation built-in\n\n5. **Learning Curve**: KCL is Python-like but with unique patterns\n - Team must learn KCL-specific semantics\n - Limited ecosystem and tooling support\n - Difficult to hire developers familiar with KCL\n\n### Project Needs\n\nThe provisioning system required:\n\n- **Greater flexibility** in composing configurations\n- **Better performance** for large-scale deployments\n- **Extensibility** without modifying base schemas\n- **Simpler mental model** for team learning\n- **Clean exports** to JSON/TOML/YAML formats\n\n---\n\n## Decision\n\n**Adopt Nickel as the primary infrastructure-as-code language** for all schema definitions, configuration composition, and deployment declarations.\n\n### Key Changes\n\n1. **Three-File Pattern per Module**:\n - `{module}_contracts.ncl` - Type definitions using Nickel contracts\n - `{module}_defaults.ncl` - Default values for all fields\n - `{module}.ncl` - Instances combining both, with hybrid interface\n\n2. **Hybrid Interface** (4 levels of access):\n - **Level 1**: Direct access to defaults (inspection, reference)\n - **Level 2**: Maker functions (90% of use cases)\n - **Level 3**: Default instances (pre-built, exported)\n - **Level 4**: Contracts (optional imports, advanced combinations)\n\n3. **Domain-Organized Architecture** (8 top-level domains):\n - `lib` - Core library types\n - `config` - Settings, defaults, workspace configuration\n - `infrastructure` - Compute, storage, provisioning schemas\n - `operations` - Workflows, batch, dependencies, tasks\n - `deployment` - Kubernetes, execution modes\n - `services` - Gitea and other platform services\n - `generator` - Code generation and declarations\n - `integrations` - Runtime, GitOps, external integrations\n\n4. **Two Deployment Modes**:\n - **Development**: Fast iteration with relative imports (Single Source of Truth)\n - **Production**: Frozen snapshots with immutable, self-contained deployment packages\n\n---\n\n## Implementation Summary\n\n### Migration Complete\n\n| Metric | Value |\n| -------- | ------- |\n| KCL files migrated | 40 |\n| Nickel files created | 72 |\n| Modules converted | 24 core modules |\n| Schemas migrated | 150+ |\n| Maker functions | 80+ |\n| Default instances | 90+ |\n| JSON output validation | 4,680+ lines |\n\n### Platform Schemas (`provisioning/schemas/`)\n\n- **422 Nickel files** total\n- **8 domains** with hierarchical organization\n- **Entry point**: `main.ncl` with domain-organized architecture\n- **Clean imports**: `provisioning.lib`, `provisioning.config.settings`, etc.\n\n### Extensions (`provisioning/extensions/`)\n\n- **4 providers**: hetzner, local, aws, upcloud\n- **1 cluster type**: web\n- **Consistent structure**: Each extension has `nickel/` subdirectory with contracts, defaults, main, version\n\n**Example - UpCloud Provider**:\n\n```\n# upcloud/nickel/main.ncl (migrated from upcloud/kcl/)\nlet contracts = import "./contracts.ncl" in\nlet defaults = import "./defaults.ncl" in\n\n{\n defaults = defaults,\n make_storage | not_exported = fun overrides =>\n defaults.storage & overrides,\n DefaultStorage = defaults.storage,\n DefaultStorageBackup = defaults.storage_backup,\n DefaultProvisionEnv = defaults.provision_env,\n DefaultProvisionUpcloud = defaults.provision_upcloud,\n DefaultServerDefaults_upcloud = defaults.server_defaults_upcloud,\n DefaultServerUpcloud = defaults.server_upcloud,\n}\n```\n\n### Active Workspaces (`workspace_librecloud/nickel/`)\n\n- **47 Nickel files** in productive use\n- **2 infrastructures**:\n - `wuji` - Kubernetes cluster with 20 taskservs\n - `sgoyol` - Support servers group\n- **Two deployment modes** fully implemented and tested\n- **Daily production usage** validated ✅\n\n### Backward Compatibility\n\n- **955 KCL files** remain in workspaces/ (legacy user configs)\n- 100% backward compatible - old KCL code still works\n- Config loader supports both formats during transition\n- No breaking changes to APIs\n\n---\n\n## Comparison: KCL vs Nickel\n\n| Aspect | KCL | Nickel | Winner |\n| -------- | ----- | -------- | -------- |\n| **Mental Model** | Python-like with schemas | JSON with functions | Nickel |\n| **Performance** | Baseline | 60% faster evaluation | Nickel |\n| **Type System** | Rigid schemas | Gradual typing + contracts | Nickel |\n| **Composition** | Schema inheritance | Record merging (`&`) | Nickel |\n| **Extensibility** | Requires schema modifications | Merging with custom fields | Nickel |\n| **Validation** | Compile-time (overhead) | Runtime contracts (lazy) | Nickel |\n| **Boilerplate** | High | Low (3-file pattern) | Nickel |\n| **Exports** | JSON/YAML | JSON/TOML/YAML | Nickel |\n| **Learning Curve** | Medium-High | Low | Nickel |\n| **Lazy Evaluation** | No | Yes (built-in) | Nickel |\n\n---\n\n## Architecture Patterns\n\n### Three-File Pattern\n\n**File 1: Contracts** (`batch_contracts.ncl`):\n\n```\n{\n BatchScheduler = {\n strategy | String,\n resource_limits,\n scheduling_interval | Number,\n enable_preemption | Bool,\n },\n}\n```\n\n**File 2: Defaults** (`batch_defaults.ncl`):\n\n```\n{\n scheduler = {\n strategy = "dependency_first",\n resource_limits = {"max_cpu_cores" = 0},\n scheduling_interval = 10,\n enable_preemption = false,\n },\n}\n```\n\n**File 3: Main** (`batch.ncl`):\n\n```\nlet contracts = import "./batch_contracts.ncl" in\nlet defaults = import "./batch_defaults.ncl" in\n\n{\n defaults = defaults, # Level 1: Inspection\n make_scheduler | not_exported = fun o =>\n defaults.scheduler & o, # Level 2: Makers\n DefaultScheduler = defaults.scheduler, # Level 3: Instances\n}\n```\n\n### Hybrid Pattern Benefits\n\n- **90% of users**: Use makers for simple customization\n- **9% of users**: Reference defaults for inspection\n- **1% of users**: Access contracts for advanced combinations\n- **No validation conflicts**: Record merging works without contract constraints\n\n### Domain-Organized Architecture\n\n```\nprovisioning/schemas/\n├── lib/ # Storage, TaskServDef, ClusterDef\n├── config/ # Settings, defaults, workspace_config\n├── infrastructure/ # Compute, storage, provisioning\n├── operations/ # Workflows, batch, dependencies, tasks\n├── deployment/ # Kubernetes, modes (solo, multiuser, cicd, enterprise)\n├── services/ # Gitea, etc\n├── generator/ # Declarations, gap analysis, changes\n├── integrations/ # Runtime, GitOps, main\n└── main.ncl # Entry point with namespace organization\n```\n\n**Import pattern**:\n\n```\nlet provisioning = import "./main.ncl" in\nprovisioning.lib # For Storage, TaskServDef\nprovisioning.config.settings # For Settings, Defaults\nprovisioning.infrastructure.compute.server\nprovisioning.operations.workflows\n```\n\n---\n\n## Production Deployment Patterns\n\n### Two-Mode Strategy\n\n#### 1. Development Mode (Single Source of Truth)\n\n- Relative imports to central provisioning\n- Fast iteration with immediate schema updates\n- No snapshot overhead\n- Usage: Local development, testing, experimentation\n\n```\n# workspace_librecloud/nickel/main.ncl\nimport "../../provisioning/schemas/main.ncl"\nimport "../../provisioning/extensions/taskservs/kubernetes/nickel/main.ncl"\n```\n\n#### 2. Production Mode (Hermetic Deployment)\n\nCreate immutable snapshots for reproducible deployments:\n\n```\nprovisioning workspace freeze --version "2025-12-15-prod-v1" --env production\n```\n\n**Frozen structure** (`.frozen/{version}/`):\n\n```\n├── provisioning/schemas/ # Snapshot of central schemas\n├── extensions/ # Snapshot of all extensions\n└── workspace/ # Snapshot of workspace configs\n```\n\n**All imports rewritten to local paths**:\n\n- `import "../../provisioning/schemas/main.ncl"` → `import "./provisioning/schemas/main.ncl"`\n- Guarantees immutability and reproducibility\n- No external dependencies\n- Can be deployed to air-gapped environments\n\n**Deploy from frozen snapshot**:\n\n```\nprovisioning deploy --frozen "2025-12-15-prod-v1" --infra wuji\n```\n\n**Benefits**:\n\n- ✅ Development: Fast iteration with central updates\n- ✅ Production: Immutable, reproducible deployments\n- ✅ Audit trail: Each frozen version timestamped\n- ✅ Rollback: Easy rollback to previous versions\n- ✅ Air-gapped: Works in offline environments\n\n---\n\n## Ecosystem Integration\n\n### TypeDialog (Bidirectional Nickel Integration)\n\n**Location**: `/Users/Akasha/Development/typedialog`\n**Purpose**: Type-safe prompts, forms, and schemas with Nickel output\n\n**Key Feature**: Nickel schemas → Type-safe UIs → Nickel output\n\n```\n# Nickel schema → Interactive form\ntypedialog form --schema server.ncl --output json\n\n# Interactive form → Nickel output\ntypedialog form --input form.toml --output nickel\n```\n\n**Value**: Amplifies Nickel ecosystem beyond IaC:\n\n- Schemas auto-generate type-safe UIs\n- Forms output configurations back to Nickel\n- Multiple backends: CLI, TUI, Web\n- Multiple output formats: JSON, YAML, TOML, Nickel\n\n---\n\n## Technical Patterns\n\n### Expression-Based Structure\n\n| KCL | Nickel |\n| ----- | -------- |\n| Multiple top-level let bindings | Single root expression with `let...in` chaining |\n\n### Schema Inheritance → Record Merging\n\n| KCL | Nickel |\n| ----- | -------- |\n| `schema Server(defaults.ServerDefaults)` | `defaults.ServerDefaults & { overrides }` |\n\n### Optional Fields\n\n| KCL | Nickel |\n| ----- | -------- |\n| `field?: type` | `field = null` or `field = ""` |\n\n### Union Types\n\n| KCL | Nickel |\n| ----- | -------- |\n| `"ubuntu" | "debian" | "centos"` | `[\\| 'ubuntu, 'debian, 'centos \\|]` |\n\n### Boolean/Null Conversion\n\n| KCL | Nickel |\n| ----- | -------- |\n| `True` / `False` / `None` | `true` / `false` / `null` |\n\n---\n\n## Quality Metrics\n\n- **Syntax Validation**: 100% (all files compile)\n- **JSON Export**: 100% success rate (4,680+ lines)\n- **Pattern Coverage**: All 5 templates tested and proven\n- **Backward Compatibility**: 100%\n- **Performance**: 60% faster evaluation than KCL\n- **Test Coverage**: 422 Nickel files validated in production\n\n---\n\n## Consequences\n\n### Positive ✅\n\n- **60% performance gain** in evaluation speed\n- **Reduced boilerplate** (contracts + defaults separation)\n- **Greater flexibility** (record merging without validation)\n- **Extensibility without conflicts** (custom fields allowed)\n- **Simplified mental model** ("JSON with functions")\n- **Lazy evaluation** (better performance for large configs)\n- **Clean exports** (100% JSON/TOML compatible)\n- **Hybrid pattern** (4 levels covering all use cases)\n- **Domain-organized architecture** (8 logical domains, clear imports)\n- **Production deployment** with frozen snapshots (immutable, reproducible)\n- **Ecosystem expansion** (TypeDialog integration for UI generation)\n- **Real-world validation** (47 files in productive use)\n- **20 taskservs** deployed in production infrastructure\n\n### Challenges ⚠️\n\n- **Dual format support** during transition (KCL + Nickel)\n- **Learning curve** for team (new language)\n- **Migration effort** (40 files migrated manually)\n- **Documentation updates** (guides, examples, training)\n- **955 KCL files remain** (gradual workspace migration)\n- **Frozen snapshots workflow** (requires understanding workspace freeze)\n- **TypeDialog dependency** (external Rust project)\n\n### Mitigations\n\n- ✅ Complete documentation in `docs/development/kcl-module-system.md`\n- ✅ 100% backward compatibility maintained\n- ✅ Migration framework established (5 templates, validation checklist)\n- ✅ Validation checklist for each migration step\n- ✅ 100% syntax validation on all files\n- ✅ Real-world usage validated (47 files in production)\n- ✅ Frozen snapshots guarantee reproducibility\n- ✅ Two deployment modes cover development and production\n- ✅ Gradual migration strategy (workspace-level, no hard cutoff)\n\n---\n\n## Migration Status\n\n### Completed (Phase 1-4)\n\n- ✅ Foundation (8 files) - Basic schemas, validation library\n- ✅ Core Schemas (8 files) - Settings, workspace config, gitea\n- ✅ Complex Features (7 files) - VM lifecycle, system config, services\n- ✅ Very Complex (9+ files) - Modes, commands, orchestrator, main entry point\n- ✅ Platform schemas (422 files total)\n- ✅ Extensions (providers, clusters)\n- ✅ Production workspace (47 files, 20 taskservs)\n\n### In Progress (Workspace-Level)\n\n- ⏳ Workspace migration (323+ files in workspace_librecloud)\n- ⏳ Extension migration (taskservs, clusters, providers)\n- ⏳ Parallel testing against original KCL\n- ⏳ CI/CD integration updates\n\n### Future (Optional)\n\n- User workspace KCL to Nickel (gradual, as needed)\n- Full migration of legacy configurations\n- TypeDialog UI generation for infrastructure\n\n---\n\n## Related Documentation\n\n### Development Guides\n\n- KCL Module System - Critical syntax differences and patterns\n- [Nickel Migration Guide](../development/nickel-executable-examples.md) - Three-file pattern specification and examples\n- [Configuration Architecture](../development/configuration.md) - Composition patterns and best practices\n\n### Related ADRs\n\n- **ADR-010**: Configuration Format Strategy (multi-format approach)\n- **ADR-006**: CLI Refactoring (domain-driven design)\n- **ADR-004**: Hybrid Rust/Nushell Architecture (platform architecture)\n\n### Referenced Files\n\n- **Entry point**: `provisioning/schemas/main.ncl`\n- **Workspace pattern**: `workspace_librecloud/nickel/main.ncl`\n- **Example extension**: `provisioning/extensions/providers/upcloud/nickel/main.ncl`\n- **Production infrastructure**: `workspace_librecloud/nickel/wuji/main.ncl` (20 taskservs)\n\n---\n\n## Approval\n\n**Status**: Implemented and Production-Ready\n\n- ✅ Architecture Team: Approved\n- ✅ Platform implementation: Complete (422 files)\n- ✅ Production validation: Passed (47 files active)\n- ✅ Backward compatibility: 100%\n- ✅ Real-world usage: Validated in wuji infrastructure\n\n---\n\n**Last Updated**: 2025-12-15\n**Version**: 1.0.0\n**Implementation**: Complete (Phase 1-4 finished, workspace-level in progress) +# ADR-011: Migration from KCL to Nickel + +**Status**: Implemented +**Date**: 2025-12-15 +**Decision Makers**: Architecture Team +**Implementation**: Complete for platform schemas (100%) + +--- + +## Context + +The provisioning platform historically used KCL (KLang) as the primary infrastructure-as-code language for all configuration schemas. As the system +evolved through four migration phases (Foundation, Core, Complex, Highly Complex), KCL's limitations became increasingly apparent: + +### Problems with KCL + +1. **Complex Type System**: Heavyweight schema system with extensive boilerplate + - `schema Foo(bar.Baz)` inheritance creates rigid hierarchies + - Union types with `null` don't work well in type annotations + - Schema modifications propagate breaking changes + +2. **Limited Flexibility**: Schema-first approach is too rigid for configuration evolution + - Difficult to extend types without modifying base schemas + - No easy way to add custom fields without validation conflicts + - Hard to compose configurations dynamically + +3. **Import System Overhead**: Non-standard module imports + - `import provisioning.lib as lib` pattern differs from ecosystem standards + - Re-export patterns create complexity in extension systems + +4. **Performance Overhead**: Compile-time validation adds latency + - Schema validation happens at compile time + - Large configuration files slow down evaluation + - No lazy evaluation built-in + +5. **Learning Curve**: KCL is Python-like but with unique patterns + - Team must learn KCL-specific semantics + - Limited ecosystem and tooling support + - Difficult to hire developers familiar with KCL + +### Project Needs + +The provisioning system required: + +- **Greater flexibility** in composing configurations +- **Better performance** for large-scale deployments +- **Extensibility** without modifying base schemas +- **Simpler mental model** for team learning +- **Clean exports** to JSON/TOML/YAML formats + +--- + +## Decision + +**Adopt Nickel as the primary infrastructure-as-code language** for all schema definitions, configuration composition, and deployment declarations. + +### Key Changes + +1. **Three-File Pattern per Module**: + - `{module}_contracts.ncl` - Type definitions using Nickel contracts + - `{module}_defaults.ncl` - Default values for all fields + - `{module}.ncl` - Instances combining both, with hybrid interface + +2. **Hybrid Interface** (4 levels of access): + - **Level 1**: Direct access to defaults (inspection, reference) + - **Level 2**: Maker functions (90% of use cases) + - **Level 3**: Default instances (pre-built, exported) + - **Level 4**: Contracts (optional imports, advanced combinations) + +3. **Domain-Organized Architecture** (8 top-level domains): + - `lib` - Core library types + - `config` - Settings, defaults, workspace configuration + - `infrastructure` - Compute, storage, provisioning schemas + - `operations` - Workflows, batch, dependencies, tasks + - `deployment` - Kubernetes, execution modes + - `services` - Gitea and other platform services + - `generator` - Code generation and declarations + - `integrations` - Runtime, GitOps, external integrations + +4. **Two Deployment Modes**: + - **Development**: Fast iteration with relative imports (Single Source of Truth) + - **Production**: Frozen snapshots with immutable, self-contained deployment packages + +--- + +## Implementation Summary + +### Migration Complete + +| Metric | Value | +| -------- | ------- | +| KCL files migrated | 40 | +| Nickel files created | 72 | +| Modules converted | 24 core modules | +| Schemas migrated | 150+ | +| Maker functions | 80+ | +| Default instances | 90+ | +| JSON output validation | 4,680+ lines | + +### Platform Schemas (`provisioning/schemas/`) + +- **422 Nickel files** total +- **8 domains** with hierarchical organization +- **Entry point**: `main.ncl` with domain-organized architecture +- **Clean imports**: `provisioning.lib`, `provisioning.config.settings`, etc. + +### Extensions (`provisioning/extensions/`) + +- **4 providers**: hetzner, local, aws, upcloud +- **1 cluster type**: web +- **Consistent structure**: Each extension has `nickel/` subdirectory with contracts, defaults, main, version + +**Example - UpCloud Provider**: + +```text +# upcloud/nickel/main.ncl (migrated from upcloud/kcl/) +let contracts = import "./contracts.ncl" in +let defaults = import "./defaults.ncl" in + +{ + defaults = defaults, + make_storage | not_exported = fun overrides => + defaults.storage & overrides, + DefaultStorage = defaults.storage, + DefaultStorageBackup = defaults.storage_backup, + DefaultProvisionEnv = defaults.provision_env, + DefaultProvisionUpcloud = defaults.provision_upcloud, + DefaultServerDefaults_upcloud = defaults.server_defaults_upcloud, + DefaultServerUpcloud = defaults.server_upcloud, +} +``` + +### Active Workspaces (`workspace_librecloud/nickel/`) + +- **47 Nickel files** in productive use +- **2 infrastructures**: + - `wuji` - Kubernetes cluster with 20 taskservs + - `sgoyol` - Support servers group +- **Two deployment modes** fully implemented and tested +- **Daily production usage** validated ✅ + +### Backward Compatibility + +- **955 KCL files** remain in workspaces/ (legacy user configs) +- 100% backward compatible - old KCL code still works +- Config loader supports both formats during transition +- No breaking changes to APIs + +--- + +## Comparison: KCL vs Nickel + +| Aspect | KCL | Nickel | Winner | +| -------- | ----- | -------- | -------- | +| **Mental Model** | Python-like with schemas | JSON with functions | Nickel | +| **Performance** | Baseline | 60% faster evaluation | Nickel | +| **Type System** | Rigid schemas | Gradual typing + contracts | Nickel | +| **Composition** | Schema inheritance | Record merging (`&`) | Nickel | +| **Extensibility** | Requires schema modifications | Merging with custom fields | Nickel | +| **Validation** | Compile-time (overhead) | Runtime contracts (lazy) | Nickel | +| **Boilerplate** | High | Low (3-file pattern) | Nickel | +| **Exports** | JSON/YAML | JSON/TOML/YAML | Nickel | +| **Learning Curve** | Medium-High | Low | Nickel | +| **Lazy Evaluation** | No | Yes (built-in) | Nickel | + +--- + +## Architecture Patterns + +### Three-File Pattern + +**File 1: Contracts** (`batch_contracts.ncl`): + +```text +{ + BatchScheduler = { + strategy | String, + resource_limits, + scheduling_interval | Number, + enable_preemption | Bool, + }, +} +``` + +**File 2: Defaults** (`batch_defaults.ncl`): + +```text +{ + scheduler = { + strategy = "dependency_first", + resource_limits = {"max_cpu_cores" = 0}, + scheduling_interval = 10, + enable_preemption = false, + }, +} +``` + +**File 3: Main** (`batch.ncl`): + +```text +let contracts = import "./batch_contracts.ncl" in +let defaults = import "./batch_defaults.ncl" in + +{ + defaults = defaults, # Level 1: Inspection + make_scheduler | not_exported = fun o => + defaults.scheduler & o, # Level 2: Makers + DefaultScheduler = defaults.scheduler, # Level 3: Instances +} +``` + +### Hybrid Pattern Benefits + +- **90% of users**: Use makers for simple customization +- **9% of users**: Reference defaults for inspection +- **1% of users**: Access contracts for advanced combinations +- **No validation conflicts**: Record merging works without contract constraints + +### Domain-Organized Architecture + +```text +provisioning/schemas/ +├── lib/ # Storage, TaskServDef, ClusterDef +├── config/ # Settings, defaults, workspace_config +├── infrastructure/ # Compute, storage, provisioning +├── operations/ # Workflows, batch, dependencies, tasks +├── deployment/ # Kubernetes, modes (solo, multiuser, cicd, enterprise) +├── services/ # Gitea, etc +├── generator/ # Declarations, gap analysis, changes +├── integrations/ # Runtime, GitOps, main +└── main.ncl # Entry point with namespace organization +``` + +**Import pattern**: + +```text +let provisioning = import "./main.ncl" in +provisioning.lib # For Storage, TaskServDef +provisioning.config.settings # For Settings, Defaults +provisioning.infrastructure.compute.server +provisioning.operations.workflows +``` + +--- + +## Production Deployment Patterns + +### Two-Mode Strategy + +#### 1. Development Mode (Single Source of Truth) + +- Relative imports to central provisioning +- Fast iteration with immediate schema updates +- No snapshot overhead +- Usage: Local development, testing, experimentation + +```text +# workspace_librecloud/nickel/main.ncl +import "../../provisioning/schemas/main.ncl" +import "../../provisioning/extensions/taskservs/kubernetes/nickel/main.ncl" +``` + +#### 2. Production Mode (Hermetic Deployment) + +Create immutable snapshots for reproducible deployments: + +```text +provisioning workspace freeze --version "2025-12-15-prod-v1" --env production +``` + +**Frozen structure** (`.frozen/{version}/`): + +```text +├── provisioning/schemas/ # Snapshot of central schemas +├── extensions/ # Snapshot of all extensions +└── workspace/ # Snapshot of workspace configs +``` + +**All imports rewritten to local paths**: + +- `import "../../provisioning/schemas/main.ncl"` → `import "./provisioning/schemas/main.ncl"` +- Guarantees immutability and reproducibility +- No external dependencies +- Can be deployed to air-gapped environments + +**Deploy from frozen snapshot**: + +```text +provisioning deploy --frozen "2025-12-15-prod-v1" --infra wuji +``` + +**Benefits**: + +- ✅ Development: Fast iteration with central updates +- ✅ Production: Immutable, reproducible deployments +- ✅ Audit trail: Each frozen version timestamped +- ✅ Rollback: Easy rollback to previous versions +- ✅ Air-gapped: Works in offline environments + +--- + +## Ecosystem Integration + +### TypeDialog (Bidirectional Nickel Integration) + +**Location**: `/Users/Akasha/Development/typedialog` +**Purpose**: Type-safe prompts, forms, and schemas with Nickel output + +**Key Feature**: Nickel schemas → Type-safe UIs → Nickel output + +```text +# Nickel schema → Interactive form +typedialog form --schema server.ncl --output json + +# Interactive form → Nickel output +typedialog form --input form.toml --output nickel +``` + +**Value**: Amplifies Nickel ecosystem beyond IaC: + +- Schemas auto-generate type-safe UIs +- Forms output configurations back to Nickel +- Multiple backends: CLI, TUI, Web +- Multiple output formats: JSON, YAML, TOML, Nickel + +--- + +## Technical Patterns + +### Expression-Based Structure + +| KCL | Nickel | +| ----- | -------- | +| Multiple top-level let bindings | Single root expression with `let...in` chaining | + +### Schema Inheritance → Record Merging + +| KCL | Nickel | +| ----- | -------- | +| `schema Server(defaults.ServerDefaults)` | `defaults.ServerDefaults & { overrides }` | + +### Optional Fields + +| KCL | Nickel | +| ----- | -------- | +| `field?: type` | `field = null` or `field = ""` | + +### Union Types + +| KCL | Nickel | +| ----- | -------- | +| `"ubuntu" | "debian" | "centos"` | `[\\| 'ubuntu, 'debian, 'centos \\|]` | + +### Boolean/Null Conversion + +| KCL | Nickel | +| ----- | -------- | +| `True` / `False` / `None` | `true` / `false` / `null` | + +--- + +## Quality Metrics + +- **Syntax Validation**: 100% (all files compile) +- **JSON Export**: 100% success rate (4,680+ lines) +- **Pattern Coverage**: All 5 templates tested and proven +- **Backward Compatibility**: 100% +- **Performance**: 60% faster evaluation than KCL +- **Test Coverage**: 422 Nickel files validated in production + +--- + +## Consequences + +### Positive ✅ + +- **60% performance gain** in evaluation speed +- **Reduced boilerplate** (contracts + defaults separation) +- **Greater flexibility** (record merging without validation) +- **Extensibility without conflicts** (custom fields allowed) +- **Simplified mental model** ("JSON with functions") +- **Lazy evaluation** (better performance for large configs) +- **Clean exports** (100% JSON/TOML compatible) +- **Hybrid pattern** (4 levels covering all use cases) +- **Domain-organized architecture** (8 logical domains, clear imports) +- **Production deployment** with frozen snapshots (immutable, reproducible) +- **Ecosystem expansion** (TypeDialog integration for UI generation) +- **Real-world validation** (47 files in productive use) +- **20 taskservs** deployed in production infrastructure + +### Challenges ⚠️ + +- **Dual format support** during transition (KCL + Nickel) +- **Learning curve** for team (new language) +- **Migration effort** (40 files migrated manually) +- **Documentation updates** (guides, examples, training) +- **955 KCL files remain** (gradual workspace migration) +- **Frozen snapshots workflow** (requires understanding workspace freeze) +- **TypeDialog dependency** (external Rust project) + +### Mitigations + +- ✅ Complete documentation in `docs/development/kcl-module-system.md` +- ✅ 100% backward compatibility maintained +- ✅ Migration framework established (5 templates, validation checklist) +- ✅ Validation checklist for each migration step +- ✅ 100% syntax validation on all files +- ✅ Real-world usage validated (47 files in production) +- ✅ Frozen snapshots guarantee reproducibility +- ✅ Two deployment modes cover development and production +- ✅ Gradual migration strategy (workspace-level, no hard cutoff) + +--- + +## Migration Status + +### Completed (Phase 1-4) + +- ✅ Foundation (8 files) - Basic schemas, validation library +- ✅ Core Schemas (8 files) - Settings, workspace config, gitea +- ✅ Complex Features (7 files) - VM lifecycle, system config, services +- ✅ Very Complex (9+ files) - Modes, commands, orchestrator, main entry point +- ✅ Platform schemas (422 files total) +- ✅ Extensions (providers, clusters) +- ✅ Production workspace (47 files, 20 taskservs) + +### In Progress (Workspace-Level) + +- ⏳ Workspace migration (323+ files in workspace_librecloud) +- ⏳ Extension migration (taskservs, clusters, providers) +- ⏳ Parallel testing against original KCL +- ⏳ CI/CD integration updates + +### Future (Optional) + +- User workspace KCL to Nickel (gradual, as needed) +- Full migration of legacy configurations +- TypeDialog UI generation for infrastructure + +--- + +## Related Documentation + +### Development Guides + +- KCL Module System - Critical syntax differences and patterns +- [Nickel Migration Guide](../development/nickel-executable-examples.md) - Three-file pattern specification and examples +- [Configuration Architecture](../development/configuration.md) - Composition patterns and best practices + +### Related ADRs + +- **ADR-010**: Configuration Format Strategy (multi-format approach) +- **ADR-006**: CLI Refactoring (domain-driven design) +- **ADR-004**: Hybrid Rust/Nushell Architecture (platform architecture) + +### Referenced Files + +- **Entry point**: `provisioning/schemas/main.ncl` +- **Workspace pattern**: `workspace_librecloud/nickel/main.ncl` +- **Example extension**: `provisioning/extensions/providers/upcloud/nickel/main.ncl` +- **Production infrastructure**: `workspace_librecloud/nickel/wuji/main.ncl` (20 taskservs) + +--- + +## Approval + +**Status**: Implemented and Production-Ready + +- ✅ Architecture Team: Approved +- ✅ Platform implementation: Complete (422 files) +- ✅ Production validation: Passed (47 files active) +- ✅ Backward compatibility: 100% +- ✅ Real-world usage: Validated in wuji infrastructure + +--- + +**Last Updated**: 2025-12-15 +**Version**: 1.0.0 +**Implementation**: Complete (Phase 1-4 finished, workspace-level in progress) \ No newline at end of file diff --git a/docs/src/architecture/adr/adr-012-nushell-nickel-plugin-cli-wrapper.md b/docs/src/architecture/adr/adr-012-nushell-nickel-plugin-cli-wrapper.md index b813468..d657b01 100644 --- a/docs/src/architecture/adr/adr-012-nushell-nickel-plugin-cli-wrapper.md +++ b/docs/src/architecture/adr/adr-012-nushell-nickel-plugin-cli-wrapper.md @@ -1 +1,379 @@ -# ADR-014: Nushell Nickel Plugin - CLI Wrapper Architecture\n\n## Status\n\n**Accepted** - 2025-12-15\n\n## Context\n\nThe provisioning system integrates with Nickel for configuration management in advanced\nscenarios. Users need to evaluate Nickel files and work with their output in Nushell\nscripts. The `nu_plugin_nickel` plugin provides this integration.\n\nThe architectural decision was whether the plugin should:\n\n1. **Implement Nickel directly using pure Rust** (`nickel-lang-core` crate)\n2. **Wrap the official Nickel CLI** (`nickel` command)\n\n### System Requirements\n\nNickel configurations in provisioning use the **module system**:\n\n```\n# config/database.ncl\nimport "lib/defaults" as defaults\nimport "lib/validation" as valid\n\n{\n databases: {\n primary = defaults.database & {\n name = "primary"\n host = "localhost"\n }\n }\n}\n```\n\nModule system includes:\n\n- Import resolution with search paths\n- Standard library (`builtins`, stdlib packages)\n- Module caching\n- Complex evaluation context\n\n## Decision\n\nImplement the `nu_plugin_nickel` plugin as a **CLI wrapper** that invokes the external `nickel` command.\n\n### Architecture Diagram\n\n```\n┌─────────────────────────────┐\n│ Nushell Script │\n│ │\n│ nickel-export json /file │\n│ nickel-eval /file │\n│ nickel-format /file │\n└────────────┬────────────────┘\n │\n ▼\n┌─────────────────────────────┐\n│ nu_plugin_nickel │\n│ │\n│ - Command handling │\n│ - Argument parsing │\n│ - JSON output parsing │\n│ - Caching logic │\n└────────────┬────────────────┘\n │\n ▼\n┌─────────────────────────────┐\n│ std::process::Command │\n│ │\n│ "nickel export /file ..." │\n└────────────┬────────────────┘\n │\n ▼\n┌─────────────────────────────┐\n│ Nickel Official CLI │\n│ │\n│ - Module resolution │\n│ - Import handling │\n│ - Standard library access │\n│ - Output formatting │\n│ - Error reporting │\n└────────────┬────────────────┘\n │\n ▼\n┌─────────────────────────────┐\n│ Nushell Records/Lists │\n│ │\n│ ✅ Proper types │\n│ ✅ Cell path access works │\n│ ✅ Piping works │\n└─────────────────────────────┘\n```\n\n### Implementation Characteristics\n\n**Plugin provides**:\n\n- ✅ Nushell commands: `nickel-export`, `nickel-eval`, `nickel-format`, `nickel-validate`\n- ✅ JSON/YAML output parsing (serde_json → nu_protocol::Value)\n- ✅ Automatic caching (SHA256-based, ~80-90% hit rate)\n- ✅ Error handling (CLI errors → Nushell errors)\n- ✅ Type-safe output (nu_protocol::Value::Record, not strings)\n\n**Plugin delegates to Nickel CLI**:\n\n- ✅ Module resolution with search paths\n- ✅ Standard library access and discovery\n- ✅ Evaluation context setup\n- ✅ Module caching\n- ✅ Output formatting\n\n## Rationale\n\n### Why CLI Wrapper Is The Correct Choice\n\n| Aspect | Pure Rust (nickel-lang-core) | CLI Wrapper (chosen) |\n| -------- | ------------------------------- | ---------------------- |\n| **Module resolution** | ❓ Undocumented API | ✅ Official, proven |\n| **Search paths** | ❓ How to configure? | ✅ CLI handles it |\n| **Standard library** | ❓ How to access? | ✅ Automatic discovery |\n| **Import system** | ❌ API unclear | ✅ Built-in |\n| **Evaluation context** | ❌ Complex setup needed | ✅ CLI provides |\n| **Future versions** | ⚠️ Maintain parity | ✅ Automatic support |\n| **Maintenance burden** | 🔴 High | 🟢 Low |\n| **Complexity** | 🔴 High | 🟢 Low |\n| **Correctness** | ⚠️ Risk of divergence | ✅ Single source of truth |\n\n### The Module System Problem\n\nUsing `nickel-lang-core` directly would require the plugin to:\n\n1. **Configure import search paths**:\n\n ```rust\n // Where should Nickel look for modules?\n // Current directory? Workspace? System paths?\n // This is complex and configuration-dependent\n ```\n\n1. **Access standard library**:\n\n ```rust\n // Where is the Nickel stdlib installed?\n // How to handle different Nickel versions?\n // How to provide builtins?\n ```\n\n2. **Manage module evaluation context**:\n\n ```rust\n // Set up evaluation environment\n // Configure cache locations\n // Initialize type checker\n // This is essentially re-implementing CLI logic\n ```\n\n3. **Maintain compatibility**:\n - Every Nickel version change requires review\n - Risk of subtle behavioral differences\n - Duplicate bug fixes and features\n - Two implementations to maintain\n\n### Documentation Gap\n\nThe `nickel-lang-core` crate lacks clear documentation on:\n\n- ❓ How to configure import search paths\n- ❓ How to access standard library\n- ❓ How to set up evaluation context\n- ❓ What is the public API contract?\n\nThis makes direct usage risky. The CLI is the documented, proven interface.\n\n### Why Nickel Is Different From Simple Use Cases\n\n**Simple use case** (direct library usage works):\n\n- Simple evaluation with built-in functions\n- No external dependencies\n- No modules or imports\n\n**Nickel reality** (CLI wrapper necessary):\n\n- Complex module system with search paths\n- External dependencies (standard library)\n- Import resolution with multiple fallbacks\n- Evaluation context that mirrors CLI\n\n## Consequences\n\n### Positive\n\n- **Correctness**: Module resolution guaranteed by official Nickel CLI\n- **Reliability**: No risk from reverse-engineering undocumented APIs\n- **Simplicity**: Plugin code is lean (~300 lines total)\n- **Maintainability**: Automatic tracking of Nickel changes\n- **Compatibility**: Works with all Nickel versions\n- **User Expectations**: Same behavior as CLI users experience\n- **Community Alignment**: Uses official Nickel distribution\n\n### Negative\n\n- **External Dependency**: Requires `nickel` binary installed in PATH\n- **Process Overhead**: ~100-200 ms per execution (heavily cached)\n- **Subprocess Management**: Spawn handling and stderr capture needed\n- **Distribution**: Provisioning must include Nickel binary\n\n### Mitigation Strategies\n\n**Dependency Management**:\n\n- Installation scripts handle Nickel setup\n- Docker images pre-install Nickel\n- Clear error messages if `nickel` not found\n- Documentation covers installation\n\n**Performance**:\n\n- Aggressive caching (80-90% typical hit rate)\n- Cache hits: ~1-5 ms (not 100-200 ms)\n- Cache directory: `~/.cache/provisioning/config-cache/`\n\n**Distribution**:\n\n- Provisioning distributions include Nickel\n- Installers set up Nickel automatically\n- CI/CD has Nickel available\n\n## Alternatives Considered\n\n### Alternative 1: Pure Rust with nickel-lang-core\n\n**Pros**: No external dependency\n**Cons**: Undocumented API, high risk, maintenance burden\n**Decision**: REJECTED - Too risky\n\n### Alternative 2: Hybrid (Pure Rust + CLI fallback)\n\n**Pros**: Flexibility\n**Cons**: Adds complexity, dual code paths, confusing behavior\n**Decision**: REJECTED - Over-engineering\n\n### Alternative 3: WebAssembly Version\n\n**Pros**: Standalone\n**Cons**: WASM support unclear, additional infrastructure\n**Decision**: REJECTED - Immature\n\n### Alternative 4: Use Nickel LSP\n\n**Pros**: Uses official interface\n**Cons**: LSP not designed for evaluation, wrong abstraction\n**Decision**: REJECTED - Inappropriate tool\n\n## Implementation Details\n\n### Command Set\n\n1. **nickel-export**: Export/evaluate Nickel file\n\n ```nushell\n nickel-export json /path/to/file.ncl\n nickel-export yaml /path/to/file.ncl\n ```\n\n2. **nickel-eval**: Evaluate with automatic caching (for config loader)\n\n ```nushell\n nickel-eval /workspace/config.ncl\n ```\n\n3. **nickel-format**: Format Nickel files\n\n ```nushell\n nickel-format /path/to/file.ncl\n ```\n\n4. **nickel-validate**: Validate Nickel files/project\n\n ```nushell\n nickel-validate /path/to/project\n ```\n\n### Critical Implementation Detail: Command Syntax\n\nThe plugin uses the **correct Nickel command syntax**:\n\n```\n// Correct:\ncmd.arg("export").arg(file).arg("--format").arg(format);\n// Results in: "nickel export /file --format json"\n\n// WRONG (previously):\ncmd.arg("export").arg(format).arg(file);\n// Results in: "nickel export json /file"\n// ↑ This triggers auto-import of nonexistent JSON module\n```\n\n### Caching Strategy\n\n**Cache Key**: SHA256(file_content + format)\n**Cache Hit Rate**: 80-90% (typical provisioning workflows)\n**Performance**:\n\n- Cache miss: ~100-200 ms (process fork)\n- Cache hit: ~1-5 ms (filesystem read + parse)\n- Speedup: 50-100x for cached runs\n\n**Storage**: `~/.cache/provisioning/config-cache/`\n\n### JSON Output Processing\n\nPlugin correctly processes JSON output:\n\n1. Invokes: `nickel export /file.ncl --format json`\n2. Receives: JSON string from stdout\n3. Parses: serde_json::Value\n4. Converts: `json_value_to_nu_value()` (recursive)\n5. Returns: nu_protocol::Value::Record (not string!)\n\nThis enables Nushell cell path access:\n\n```\nnickel-export json /config.ncl | .database.host # ✅ Works\n```\n\n## Testing Strategy\n\n**Unit Tests**:\n\n- JSON parsing correctness\n- Value type conversions\n- Cache logic\n\n**Integration Tests**:\n\n- Real Nickel file execution\n- Module imports verification\n- Search path resolution\n\n**Manual Verification**:\n\n```\n# Test module imports\nnickel-export json /workspace/config.ncl\n\n# Test cell path access\nnickel-export json /workspace/config.ncl | .database\n\n# Verify output types\nnickel-export json /workspace/config.ncl | type\n# Should show: record, not string\n```\n\n## Configuration Integration\n\nPlugin integrates with provisioning config system:\n\n- Nickel path auto-detected: `which nickel`\n- Cache location: platform-specific `cache_dir()`\n- Errors: consistent with provisioning patterns\n\n## References\n\n- ADR-012: Nushell Plugins (general framework)\n- [Nickel Official Documentation](https://nickel-lang.org/)\n- [nickel-lang-core Rust Crate](https://crates.io/crates/nickel-lang-core/)\n- nu_plugin_nickel Implementation: `provisioning/core/plugins/nushell-plugins/nu_plugin_nickel/`\n- [Related: ADR-013-NUSHELL-KCL-PLUGIN](adr/adr-nushell-kcl-plugin-cli-wrapper.md)\n\n---\n\n**Status**: Accepted and Implemented\n**Last Updated**: 2025-12-15\n**Implementation**: Complete\n**Tests**: Passing +# ADR-014: Nushell Nickel Plugin - CLI Wrapper Architecture + +## Status + +**Accepted** - 2025-12-15 + +## Context + +The provisioning system integrates with Nickel for configuration management in advanced +scenarios. Users need to evaluate Nickel files and work with their output in Nushell +scripts. The `nu_plugin_nickel` plugin provides this integration. + +The architectural decision was whether the plugin should: + +1. **Implement Nickel directly using pure Rust** (`nickel-lang-core` crate) +2. **Wrap the official Nickel CLI** (`nickel` command) + +### System Requirements + +Nickel configurations in provisioning use the **module system**: + +```text +# config/database.ncl +import "lib/defaults" as defaults +import "lib/validation" as valid + +{ + databases: { + primary = defaults.database & { + name = "primary" + host = "localhost" + } + } +} +``` + +Module system includes: + +- Import resolution with search paths +- Standard library (`builtins`, stdlib packages) +- Module caching +- Complex evaluation context + +## Decision + +Implement the `nu_plugin_nickel` plugin as a **CLI wrapper** that invokes the external `nickel` command. + +### Architecture Diagram + +```text +┌─────────────────────────────┐ +│ Nushell Script │ +│ │ +│ nickel-export json /file │ +│ nickel-eval /file │ +│ nickel-format /file │ +└────────────┬────────────────┘ + │ + ▼ +┌─────────────────────────────┐ +│ nu_plugin_nickel │ +│ │ +│ - Command handling │ +│ - Argument parsing │ +│ - JSON output parsing │ +│ - Caching logic │ +└────────────┬────────────────┘ + │ + ▼ +┌─────────────────────────────┐ +│ std::process::Command │ +│ │ +│ "nickel export /file ..." │ +└────────────┬────────────────┘ + │ + ▼ +┌─────────────────────────────┐ +│ Nickel Official CLI │ +│ │ +│ - Module resolution │ +│ - Import handling │ +│ - Standard library access │ +│ - Output formatting │ +│ - Error reporting │ +└────────────┬────────────────┘ + │ + ▼ +┌─────────────────────────────┐ +│ Nushell Records/Lists │ +│ │ +│ ✅ Proper types │ +│ ✅ Cell path access works │ +│ ✅ Piping works │ +└─────────────────────────────┘ +``` + +### Implementation Characteristics + +**Plugin provides**: + +- ✅ Nushell commands: `nickel-export`, `nickel-eval`, `nickel-format`, `nickel-validate` +- ✅ JSON/YAML output parsing (serde_json → nu_protocol::Value) +- ✅ Automatic caching (SHA256-based, ~80-90% hit rate) +- ✅ Error handling (CLI errors → Nushell errors) +- ✅ Type-safe output (nu_protocol::Value::Record, not strings) + +**Plugin delegates to Nickel CLI**: + +- ✅ Module resolution with search paths +- ✅ Standard library access and discovery +- ✅ Evaluation context setup +- ✅ Module caching +- ✅ Output formatting + +## Rationale + +### Why CLI Wrapper Is The Correct Choice + +| Aspect | Pure Rust (nickel-lang-core) | CLI Wrapper (chosen) | +| -------- | ------------------------------- | ---------------------- | +| **Module resolution** | ❓ Undocumented API | ✅ Official, proven | +| **Search paths** | ❓ How to configure? | ✅ CLI handles it | +| **Standard library** | ❓ How to access? | ✅ Automatic discovery | +| **Import system** | ❌ API unclear | ✅ Built-in | +| **Evaluation context** | ❌ Complex setup needed | ✅ CLI provides | +| **Future versions** | ⚠️ Maintain parity | ✅ Automatic support | +| **Maintenance burden** | 🔴 High | 🟢 Low | +| **Complexity** | 🔴 High | 🟢 Low | +| **Correctness** | ⚠️ Risk of divergence | ✅ Single source of truth | + +### The Module System Problem + +Using `nickel-lang-core` directly would require the plugin to: + +1. **Configure import search paths**: + + ```rust + // Where should Nickel look for modules? + // Current directory? Workspace? System paths? + // This is complex and configuration-dependent + ``` + +1. **Access standard library**: + + ```rust + // Where is the Nickel stdlib installed? + // How to handle different Nickel versions? + // How to provide builtins? + ``` + +2. **Manage module evaluation context**: + + ```rust + // Set up evaluation environment + // Configure cache locations + // Initialize type checker + // This is essentially re-implementing CLI logic + ``` + +3. **Maintain compatibility**: + - Every Nickel version change requires review + - Risk of subtle behavioral differences + - Duplicate bug fixes and features + - Two implementations to maintain + +### Documentation Gap + +The `nickel-lang-core` crate lacks clear documentation on: + +- ❓ How to configure import search paths +- ❓ How to access standard library +- ❓ How to set up evaluation context +- ❓ What is the public API contract? + +This makes direct usage risky. The CLI is the documented, proven interface. + +### Why Nickel Is Different From Simple Use Cases + +**Simple use case** (direct library usage works): + +- Simple evaluation with built-in functions +- No external dependencies +- No modules or imports + +**Nickel reality** (CLI wrapper necessary): + +- Complex module system with search paths +- External dependencies (standard library) +- Import resolution with multiple fallbacks +- Evaluation context that mirrors CLI + +## Consequences + +### Positive + +- **Correctness**: Module resolution guaranteed by official Nickel CLI +- **Reliability**: No risk from reverse-engineering undocumented APIs +- **Simplicity**: Plugin code is lean (~300 lines total) +- **Maintainability**: Automatic tracking of Nickel changes +- **Compatibility**: Works with all Nickel versions +- **User Expectations**: Same behavior as CLI users experience +- **Community Alignment**: Uses official Nickel distribution + +### Negative + +- **External Dependency**: Requires `nickel` binary installed in PATH +- **Process Overhead**: ~100-200 ms per execution (heavily cached) +- **Subprocess Management**: Spawn handling and stderr capture needed +- **Distribution**: Provisioning must include Nickel binary + +### Mitigation Strategies + +**Dependency Management**: + +- Installation scripts handle Nickel setup +- Docker images pre-install Nickel +- Clear error messages if `nickel` not found +- Documentation covers installation + +**Performance**: + +- Aggressive caching (80-90% typical hit rate) +- Cache hits: ~1-5 ms (not 100-200 ms) +- Cache directory: `~/.cache/provisioning/config-cache/` + +**Distribution**: + +- Provisioning distributions include Nickel +- Installers set up Nickel automatically +- CI/CD has Nickel available + +## Alternatives Considered + +### Alternative 1: Pure Rust with nickel-lang-core + +**Pros**: No external dependency +**Cons**: Undocumented API, high risk, maintenance burden +**Decision**: REJECTED - Too risky + +### Alternative 2: Hybrid (Pure Rust + CLI fallback) + +**Pros**: Flexibility +**Cons**: Adds complexity, dual code paths, confusing behavior +**Decision**: REJECTED - Over-engineering + +### Alternative 3: WebAssembly Version + +**Pros**: Standalone +**Cons**: WASM support unclear, additional infrastructure +**Decision**: REJECTED - Immature + +### Alternative 4: Use Nickel LSP + +**Pros**: Uses official interface +**Cons**: LSP not designed for evaluation, wrong abstraction +**Decision**: REJECTED - Inappropriate tool + +## Implementation Details + +### Command Set + +1. **nickel-export**: Export/evaluate Nickel file + + ```nushell + nickel-export json /path/to/file.ncl + nickel-export yaml /path/to/file.ncl + ``` + +2. **nickel-eval**: Evaluate with automatic caching (for config loader) + + ```nushell + nickel-eval /workspace/config.ncl + ``` + +3. **nickel-format**: Format Nickel files + + ```nushell + nickel-format /path/to/file.ncl + ``` + +4. **nickel-validate**: Validate Nickel files/project + + ```nushell + nickel-validate /path/to/project + ``` + +### Critical Implementation Detail: Command Syntax + +The plugin uses the **correct Nickel command syntax**: + +```text +// Correct: +cmd.arg("export").arg(file).arg("--format").arg(format); +// Results in: "nickel export /file --format json" + +// WRONG (previously): +cmd.arg("export").arg(format).arg(file); +// Results in: "nickel export json /file" +// ↑ This triggers auto-import of nonexistent JSON module +``` + +### Caching Strategy + +**Cache Key**: SHA256(file_content + format) +**Cache Hit Rate**: 80-90% (typical provisioning workflows) +**Performance**: + +- Cache miss: ~100-200 ms (process fork) +- Cache hit: ~1-5 ms (filesystem read + parse) +- Speedup: 50-100x for cached runs + +**Storage**: `~/.cache/provisioning/config-cache/` + +### JSON Output Processing + +Plugin correctly processes JSON output: + +1. Invokes: `nickel export /file.ncl --format json` +2. Receives: JSON string from stdout +3. Parses: serde_json::Value +4. Converts: `json_value_to_nu_value()` (recursive) +5. Returns: nu_protocol::Value::Record (not string!) + +This enables Nushell cell path access: + +```text +nickel-export json /config.ncl | .database.host # ✅ Works +``` + +## Testing Strategy + +**Unit Tests**: + +- JSON parsing correctness +- Value type conversions +- Cache logic + +**Integration Tests**: + +- Real Nickel file execution +- Module imports verification +- Search path resolution + +**Manual Verification**: + +```text +# Test module imports +nickel-export json /workspace/config.ncl + +# Test cell path access +nickel-export json /workspace/config.ncl | .database + +# Verify output types +nickel-export json /workspace/config.ncl | type +# Should show: record, not string +``` + +## Configuration Integration + +Plugin integrates with provisioning config system: + +- Nickel path auto-detected: `which nickel` +- Cache location: platform-specific `cache_dir()` +- Errors: consistent with provisioning patterns + +## References + +- ADR-012: Nushell Plugins (general framework) +- [Nickel Official Documentation](https://nickel-lang.org/) +- [nickel-lang-core Rust Crate](https://crates.io/crates/nickel-lang-core/) +- nu_plugin_nickel Implementation: `provisioning/core/plugins/nushell-plugins/nu_plugin_nickel/` +- [Related: ADR-013-NUSHELL-KCL-PLUGIN](adr/adr-nushell-kcl-plugin-cli-wrapper.md) + +--- + +**Status**: Accepted and Implemented +**Last Updated**: 2025-12-15 +**Implementation**: Complete +**Tests**: Passing \ No newline at end of file diff --git a/docs/src/architecture/adr/adr-013-typdialog-integration.md b/docs/src/architecture/adr/adr-013-typdialog-integration.md index 4bad746..9f9a93b 100644 --- a/docs/src/architecture/adr/adr-013-typdialog-integration.md +++ b/docs/src/architecture/adr/adr-013-typdialog-integration.md @@ -1 +1,592 @@ -# ADR-013: Typdialog Web UI Backend Integration for Interactive Configuration\n\n## Status\n\n**Accepted** - 2025-01-08\n\n## Context\n\nThe provisioning system requires interactive user input for configuration workflows, workspace initialization, credential setup, and guided deployment\nscenarios. The system architecture combines Rust (performance-critical), Nushell (scripting), and Nickel (declarative configuration), creating\nchallenges for interactive form-based input and multi-user collaboration.\n\n### The Interactive Configuration Problem\n\n**Current limitations**:\n\n1. **Nushell CLI**: Terminal-only interaction\n - `input` command: Single-line text prompts only\n - No form validation, no complex multi-field forms\n - Limited to single-user, terminal-bound workflows\n - User experience: Basic and error-prone\n\n2. **Nickel**: Declarative configuration language\n - Cannot handle interactive prompts (by design)\n - Pure evaluation model (no side effects)\n - Forms must be defined statically, not interactively\n - No runtime user interaction\n\n3. **Existing Solutions**: Inadequate for modern infrastructure provisioning\n - **Shell-based prompts**: Error-prone, no validation, single-user\n - **Custom web forms**: High maintenance, inconsistent UX\n - **Separate admin panels**: Disconnected from IaC workflow\n - **Terminal-only TUI**: Limited to SSH sessions, no collaboration\n\n### Use Cases Requiring Interactive Input\n\n1. **Workspace Initialization**:\n ```nushell\n # Current: Error-prone prompts\n let workspace_name = input "Workspace name: "\n let provider = input "Provider (aws/azure/oci): "\n # No validation, no autocomplete, no guidance\n ```\n\n2. **Credential Setup**:\n ```nushell\n # Current: Insecure and basic\n let api_key = input "API Key: " # Shows in terminal history\n let region = input "Region: " # No validation\n ```\n\n3. **Configuration Wizards**:\n - Database connection setup (host, port, credentials, SSL)\n - Network configuration (CIDR blocks, subnets, gateways)\n - Security policies (encryption, access control, audit)\n\n4. **Guided Deployments**:\n - Multi-step infrastructure provisioning\n - Service selection with dependencies\n - Environment-specific overrides\n\n### Requirements for Interactive Input System\n\n- ✅ **Terminal UI widgets**: Text input, password, select, multi-select, confirm\n- ✅ **Validation**: Type checking, regex patterns, custom validators\n- ✅ **Security**: Password masking, sensitive data handling\n- ✅ **User Experience**: Arrow key navigation, autocomplete, help text\n- ✅ **Composability**: Chain multiple prompts into forms\n- ✅ **Error Handling**: Clear validation errors, retry logic\n- ✅ **Rust Integration**: Native Rust library (no subprocess overhead)\n- ✅ **Cross-Platform**: Works on Linux, macOS, Windows\n\n## Decision\n\nIntegrate **typdialog** with its **Web UI backend** as the standard interactive configuration interface for the provisioning platform. The major\nachievement of typdialog is not the TUI - it is the Web UI backend that enables browser-based forms, multi-user collaboration, and seamless\nintegration with the provisioning orchestrator.\n\n### Architecture Diagram\n\n```\n┌─────────────────────────────────────────┐\n│ Nushell Script │\n│ │\n│ provisioning workspace init │\n│ provisioning config setup │\n│ provisioning deploy guided │\n└────────────┬────────────────────────────┘\n │\n ▼\n┌─────────────────────────────────────────┐\n│ Rust CLI Handler │\n│ (provisioning/core/cli/) │\n│ │\n│ - Parse command │\n│ - Determine if interactive needed │\n│ - Invoke TUI dialog module │\n└────────────┬────────────────────────────┘\n │\n ▼\n┌─────────────────────────────────────────┐\n│ TUI Dialog Module │\n│ (typdialog wrapper) │\n│ │\n│ - Form definition (validation rules) │\n│ - Widget rendering (text, select) │\n│ - User input capture │\n│ - Validation execution │\n│ - Result serialization (JSON/TOML) │\n└────────────┬────────────────────────────┘\n │\n ▼\n┌─────────────────────────────────────────┐\n│ typdialog Library │\n│ │\n│ - Terminal rendering (crossterm) │\n│ - Event handling (keyboard, mouse) │\n│ - Widget state management │\n│ - Input validation engine │\n└────────────┬────────────────────────────┘\n │\n ▼\n┌─────────────────────────────────────────┐\n│ Terminal (stdout/stdin) │\n│ │\n│ ✅ Rich TUI with validation │\n│ ✅ Secure password input │\n│ ✅ Guided multi-step forms │\n└─────────────────────────────────────────┘\n```\n\n### Implementation Characteristics\n\n**CLI Integration Provides**:\n\n- ✅ Native Rust commands with TUI dialogs\n- ✅ Form-based input for complex configurations\n- ✅ Validation rules defined in Rust (type-safe)\n- ✅ Secure input (password masking, no history)\n- ✅ Error handling with retry logic\n- ✅ Serialization to Nickel/TOML/JSON\n\n**TUI Dialog Library Handles**:\n\n- ✅ Terminal UI rendering and event loop\n- ✅ Widget management (text, select, checkbox, confirm)\n- ✅ Input validation and error display\n- ✅ Navigation (arrow keys, tab, enter)\n- ✅ Cross-platform terminal compatibility\n\n## Rationale\n\n### Why TUI Dialog Integration Is Required\n\n| Aspect | Shell Prompts (current) | Web Forms | TUI Dialog (chosen) |\n| -------- | ------------------------- | ----------- | --------------------- |\n| **User Experience** | ❌ Basic text only | ✅ Rich UI | ✅ Rich TUI |\n| **Validation** | ❌ Manual, error-prone | ✅ Built-in | ✅ Built-in |\n| **Security** | ❌ Plain text, history | ⚠️ Network risk | ✅ Secure terminal |\n| **Setup Complexity** | ✅ None | ❌ Server required | ✅ Minimal |\n| **Terminal Workflow** | ✅ Native | ❌ Browser switch | ✅ Native |\n| **Offline Support** | ✅ Always | ❌ Requires server | ✅ Always |\n| **Dependencies** | ✅ None | ❌ Web stack | ✅ Single crate |\n| **Error Handling** | ❌ Manual | ⚠️ Complex | ✅ Built-in retry |\n\n### The Nushell Limitation\n\nNushell's `input` command is limited:\n\n```\n# Current: No validation, no security\nlet password = input "Password: " # ❌ Shows in terminal\nlet region = input "AWS Region: " # ❌ No autocomplete/validation\n\n# Cannot do:\n# - Multi-select from options\n# - Conditional fields (if X then ask Y)\n# - Password masking\n# - Real-time validation\n# - Autocomplete/fuzzy search\n```\n\n### The Nickel Constraint\n\nNickel is declarative and cannot prompt users:\n\n```\n# Nickel defines what the config looks like, NOT how to get it\n{\n database = {\n host | String,\n port | Number,\n credentials | { username: String, password: String },\n }\n}\n\n# Nickel cannot:\n# - Prompt user for values\n# - Show interactive forms\n# - Validate input interactively\n```\n\n### Why Rust + TUI Dialog Is The Solution\n\n**Rust provides**:\n- Native terminal control (crossterm, termion)\n- Type-safe form definitions\n- Validation rules as functions\n- Secure memory handling (password zeroization)\n- Performance (no subprocess overhead)\n\n**TUI Dialog provides**:\n- Widget library (text, select, multi-select, confirm)\n- Event loop and rendering\n- Validation framework\n- Error display and retry logic\n\n**Integration enables**:\n- Nushell calls Rust CLI → Shows TUI dialog → Returns validated config\n- Nickel receives validated config → Type checks → Merges with defaults\n\n## Consequences\n\n### Positive\n\n- **User Experience**: Professional TUI with validation and guidance\n- **Security**: Password masking, sensitive data protection, no terminal history\n- **Validation**: Type-safe rules enforced before config generation\n- **Developer Experience**: Reusable form components across CLI commands\n- **Error Handling**: Clear validation errors with retry options\n- **Offline First**: No network dependencies for interactive input\n- **Terminal Native**: Fits CLI workflow, no context switching\n- **Maintainability**: Single library for all interactive input\n\n### Negative\n\n- **Terminal Dependency**: Requires interactive terminal (not scriptable)\n- **Learning Curve**: Developers must learn TUI dialog patterns\n- **Library Lock-in**: Tied to specific TUI library API\n- **Testing Complexity**: Interactive tests require terminal mocking\n- **Non-Interactive Fallback**: Need alternative for CI/CD and scripts\n\n### Mitigation Strategies\n\n**Non-Interactive Mode**:\n```\n// Support both interactive and non-interactive\nif terminal::is_interactive() {\n // Show TUI dialog\n let config = show_workspace_form()?;\n} else {\n // Use config file or CLI args\n let config = load_config_from_file(args.config)?;\n}\n```\n\n**Testing**:\n```\n// Unit tests: Test form validation logic (no TUI)\n#[test]\nfn test_validate_workspace_name() {\n assert!(validate_name("my-workspace").is_ok());\n assert!(validate_name("invalid name!").is_err());\n}\n\n// Integration tests: Use mock terminal or config files\n```\n\n**Scriptability**:\n```\n# Batch mode: Provide config via file\nprovisioning workspace init --config workspace.toml\n\n# Interactive mode: Show TUI dialog\nprovisioning workspace init --interactive\n```\n\n**Documentation**:\n- Form schemas documented in `docs/`\n- Config file examples provided\n- Screenshots of TUI forms in guides\n\n## Alternatives Considered\n\n### Alternative 1: Shell-Based Prompts (Current State)\n\n**Pros**: Simple, no dependencies\n**Cons**: No validation, poor UX, security risks\n**Decision**: REJECTED - Inadequate for production use\n\n### Alternative 2: Web-Based Forms\n\n**Pros**: Rich UI, well-known patterns\n**Cons**: Requires server, network dependency, context switch\n**Decision**: REJECTED - Too complex for CLI tool\n\n### Alternative 3: Custom TUI Per Use Case\n\n**Pros**: Tailored to each need\n**Cons**: High maintenance, code duplication, inconsistent UX\n**Decision**: REJECTED - Not sustainable\n\n### Alternative 4: External Form Tool (dialog, whiptail)\n\n**Pros**: Mature, cross-platform\n**Cons**: Subprocess overhead, limited validation, shell escaping issues\n**Decision**: REJECTED - Poor Rust integration\n\n### Alternative 5: Text-Based Config Files Only\n\n**Pros**: Fully scriptable, no interactive complexity\n**Cons**: Steep learning curve, no guidance for new users\n**Decision**: REJECTED - Poor user onboarding experience\n\n## Implementation Details\n\n### Form Definition Pattern\n\n```\nuse typdialog::Form;\n\npub fn workspace_initialization_form() -> Result {\n let form = Form::new("Workspace Initialization")\n .add_text_input("name", "Workspace Name")\n .required()\n .validator(|s| validate_workspace_name(s))\n .add_select("provider", "Cloud Provider")\n .options(&["aws", "azure", "oci", "local"])\n .required()\n .add_text_input("region", "Region")\n .default("us-west-2")\n .validator(|s| validate_region(s))\n .add_password("admin_password", "Admin Password")\n .required()\n .min_length(12)\n .add_confirm("enable_monitoring", "Enable Monitoring?")\n .default(true);\n\n let responses = form.run()?;\n\n // Convert to strongly-typed config\n let config = WorkspaceConfig {\n name: responses.get_string("name")?,\n provider: responses.get_string("provider")?.parse()?,\n region: responses.get_string("region")?,\n admin_password: responses.get_password("admin_password")?,\n enable_monitoring: responses.get_bool("enable_monitoring")?,\n };\n\n Ok(config)\n}\n```\n\n### Integration with Nickel\n\n```\n// 1. Get validated input from TUI dialog\nlet config = workspace_initialization_form()?;\n\n// 2. Serialize to TOML/JSON\nlet config_toml = toml::to_string(&config)?;\n\n// 3. Write to workspace config\nfs::write("workspace/config.toml", config_toml)?;\n\n// 4. Nickel merges with defaults\n// nickel export workspace/main.ncl --format json\n// (uses workspace/config.toml as input)\n```\n\n### CLI Command Structure\n\n```\n// provisioning/core/cli/src/commands/workspace.rs\n\n#[derive(Parser)]\npub enum WorkspaceCommand {\n Init {\n #[arg(long)]\n interactive: bool,\n\n #[arg(long)]\n config: Option,\n },\n}\n\npub fn handle_workspace_init(args: InitArgs) -> Result<()> {\n if args.interactive || terminal::is_interactive() {\n // Show TUI dialog\n let config = workspace_initialization_form()?;\n config.save("workspace/config.toml")?;\n } else if let Some(config_path) = args.config {\n // Use provided config\n let config = WorkspaceConfig::load(config_path)?;\n config.save("workspace/config.toml")?;\n } else {\n bail!("Either --interactive or --config required");\n }\n\n // Continue with workspace setup\n Ok(())\n}\n```\n\n### Validation Rules\n\n```\npub fn validate_workspace_name(name: &str) -> Result<(), String> {\n // Alphanumeric, hyphens, 3-32 chars\n let re = Regex::new(r"^[a-z0-9-]{3,32}$").unwrap();\n if !re.is_match(name) {\n return Err("Name must be 3-32 lowercase alphanumeric chars with hyphens".into());\n }\n Ok(())\n}\n\npub fn validate_region(region: &str) -> Result<(), String> {\n const VALID_REGIONS: &[&str] = &["us-west-1", "us-west-2", "us-east-1", "eu-west-1"];\n if !VALID_REGIONS.contains(®ion) {\n return Err(format!("Invalid region. Must be one of: {}", VALID_REGIONS.join(", ")));\n }\n Ok(())\n}\n```\n\n### Security: Password Handling\n\n```\nuse zeroize::Zeroizing;\n\npub fn get_secure_password() -> Result> {\n let form = Form::new("Secure Input")\n .add_password("password", "Password")\n .required()\n .min_length(12)\n .validator(password_strength_check);\n\n let responses = form.run()?;\n\n // Password automatically zeroized when dropped\n let password = Zeroizing::new(responses.get_password("password")?);\n\n Ok(password)\n}\n```\n\n## Testing Strategy\n\n**Unit Tests**:\n```\n#[test]\nfn test_workspace_name_validation() {\n assert!(validate_workspace_name("my-workspace").is_ok());\n assert!(validate_workspace_name("UPPERCASE").is_err());\n assert!(validate_workspace_name("ab").is_err()); // Too short\n}\n```\n\n**Integration Tests**:\n```\n// Use non-interactive mode with config files\n#[test]\nfn test_workspace_init_non_interactive() {\n let config = WorkspaceConfig {\n name: "test-workspace".into(),\n provider: Provider::Local,\n region: "us-west-2".into(),\n admin_password: "secure-password-123".into(),\n enable_monitoring: true,\n };\n\n config.save("/tmp/test-config.toml").unwrap();\n\n let result = handle_workspace_init(InitArgs {\n interactive: false,\n config: Some("/tmp/test-config.toml".into()),\n });\n\n assert!(result.is_ok());\n}\n```\n\n**Manual Testing**:\n```\n# Test interactive flow\ncargo build --release\n./target/release/provisioning workspace init --interactive\n\n# Test validation errors\n# - Try invalid workspace name\n# - Try weak password\n# - Try invalid region\n```\n\n## Configuration Integration\n\n**CLI Flag**:\n```\n# provisioning/config/config.defaults.toml\n[ui]\ninteractive_mode = "auto" # "auto" | "always" | "never"\ndialog_theme = "default" # "default" | "minimal" | "colorful"\n```\n\n**Environment Override**:\n```\n# Force non-interactive mode (for CI/CD)\nexport PROVISIONING_INTERACTIVE=false\n\n# Force interactive mode\nexport PROVISIONING_INTERACTIVE=true\n```\n\n## Documentation Requirements\n\n**User Guides**:\n- `docs/user/interactive-configuration.md` - How to use TUI dialogs\n- `docs/guides/workspace-setup.md` - Workspace initialization with screenshots\n\n**Developer Documentation**:\n- `docs/development/tui-forms.md` - Creating new TUI forms\n- Form definition best practices\n- Validation rule patterns\n\n**Configuration Schema**:\n```\n# provisioning/schemas/workspace.ncl\n{\n WorkspaceConfig = {\n name\n | doc "Workspace identifier (3-32 alphanumeric chars with hyphens)"\n | String,\n provider\n | doc "Cloud provider"\n | [| 'aws, 'azure, 'oci, 'local |],\n region\n | doc "Deployment region"\n | String,\n admin_password\n | doc "Admin password (min 12 characters)"\n | String,\n enable_monitoring\n | doc "Enable monitoring services"\n | Bool,\n }\n}\n```\n\n## Migration Path\n\n**Phase 1: Add Library**\n- Add typdialog dependency to `provisioning/core/cli/Cargo.toml`\n- Create TUI dialog wrapper module\n- Implement basic text/select widgets\n\n**Phase 2: Implement Forms**\n- Workspace initialization form\n- Credential setup form\n- Configuration wizard forms\n\n**Phase 3: CLI Integration**\n- Update CLI commands to use TUI dialogs\n- Add `--interactive` / `--config` flags\n- Implement non-interactive fallback\n\n**Phase 4: Documentation**\n- User guides with screenshots\n- Developer documentation for form creation\n- Example configs for non-interactive use\n\n**Phase 5: Testing**\n- Unit tests for validation logic\n- Integration tests with config files\n- Manual testing on all platforms\n\n## References\n\n- [typdialog Crate](https://crates.io/crates/typdialog) (or similar: dialoguer, inquire)\n- [crossterm](https://crates.io/crates/crossterm) - Terminal manipulation\n- [zeroize](https://crates.io/crates/zeroize) - Secure memory zeroization\n- ADR-004: Hybrid Architecture (Rust/Nushell integration)\n- ADR-011: Nickel Migration (declarative config language)\n- ADR-012: Nushell Plugins (CLI wrapper patterns)\n- Nushell `input` command limitations: [Nushell Book - Input](https://www.nushell.sh/commands/docs/input.html)\n\n---\n\n**Status**: Accepted\n**Last Updated**: 2025-01-08\n**Implementation**: Planned\n**Priority**: High (User onboarding and security)\n**Estimated Complexity**: Moderate +# ADR-013: Typdialog Web UI Backend Integration for Interactive Configuration + +## Status + +**Accepted** - 2025-01-08 + +## Context + +The provisioning system requires interactive user input for configuration workflows, workspace initialization, credential setup, and guided deployment +scenarios. The system architecture combines Rust (performance-critical), Nushell (scripting), and Nickel (declarative configuration), creating +challenges for interactive form-based input and multi-user collaboration. + +### The Interactive Configuration Problem + +**Current limitations**: + +1. **Nushell CLI**: Terminal-only interaction + - `input` command: Single-line text prompts only + - No form validation, no complex multi-field forms + - Limited to single-user, terminal-bound workflows + - User experience: Basic and error-prone + +2. **Nickel**: Declarative configuration language + - Cannot handle interactive prompts (by design) + - Pure evaluation model (no side effects) + - Forms must be defined statically, not interactively + - No runtime user interaction + +3. **Existing Solutions**: Inadequate for modern infrastructure provisioning + - **Shell-based prompts**: Error-prone, no validation, single-user + - **Custom web forms**: High maintenance, inconsistent UX + - **Separate admin panels**: Disconnected from IaC workflow + - **Terminal-only TUI**: Limited to SSH sessions, no collaboration + +### Use Cases Requiring Interactive Input + +1. **Workspace Initialization**: + ```nushell + # Current: Error-prone prompts + let workspace_name = input "Workspace name: " + let provider = input "Provider (aws/azure/oci): " + # No validation, no autocomplete, no guidance + ``` + +2. **Credential Setup**: + ```nushell + # Current: Insecure and basic + let api_key = input "API Key: " # Shows in terminal history + let region = input "Region: " # No validation + ``` + +3. **Configuration Wizards**: + - Database connection setup (host, port, credentials, SSL) + - Network configuration (CIDR blocks, subnets, gateways) + - Security policies (encryption, access control, audit) + +4. **Guided Deployments**: + - Multi-step infrastructure provisioning + - Service selection with dependencies + - Environment-specific overrides + +### Requirements for Interactive Input System + +- ✅ **Terminal UI widgets**: Text input, password, select, multi-select, confirm +- ✅ **Validation**: Type checking, regex patterns, custom validators +- ✅ **Security**: Password masking, sensitive data handling +- ✅ **User Experience**: Arrow key navigation, autocomplete, help text +- ✅ **Composability**: Chain multiple prompts into forms +- ✅ **Error Handling**: Clear validation errors, retry logic +- ✅ **Rust Integration**: Native Rust library (no subprocess overhead) +- ✅ **Cross-Platform**: Works on Linux, macOS, Windows + +## Decision + +Integrate **typdialog** with its **Web UI backend** as the standard interactive configuration interface for the provisioning platform. The major +achievement of typdialog is not the TUI - it is the Web UI backend that enables browser-based forms, multi-user collaboration, and seamless +integration with the provisioning orchestrator. + +### Architecture Diagram + +```text +┌─────────────────────────────────────────┐ +│ Nushell Script │ +│ │ +│ provisioning workspace init │ +│ provisioning config setup │ +│ provisioning deploy guided │ +└────────────┬────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────┐ +│ Rust CLI Handler │ +│ (provisioning/core/cli/) │ +│ │ +│ - Parse command │ +│ - Determine if interactive needed │ +│ - Invoke TUI dialog module │ +└────────────┬────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────┐ +│ TUI Dialog Module │ +│ (typdialog wrapper) │ +│ │ +│ - Form definition (validation rules) │ +│ - Widget rendering (text, select) │ +│ - User input capture │ +│ - Validation execution │ +│ - Result serialization (JSON/TOML) │ +└────────────┬────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────┐ +│ typdialog Library │ +│ │ +│ - Terminal rendering (crossterm) │ +│ - Event handling (keyboard, mouse) │ +│ - Widget state management │ +│ - Input validation engine │ +└────────────┬────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────┐ +│ Terminal (stdout/stdin) │ +│ │ +│ ✅ Rich TUI with validation │ +│ ✅ Secure password input │ +│ ✅ Guided multi-step forms │ +└─────────────────────────────────────────┘ +``` + +### Implementation Characteristics + +**CLI Integration Provides**: + +- ✅ Native Rust commands with TUI dialogs +- ✅ Form-based input for complex configurations +- ✅ Validation rules defined in Rust (type-safe) +- ✅ Secure input (password masking, no history) +- ✅ Error handling with retry logic +- ✅ Serialization to Nickel/TOML/JSON + +**TUI Dialog Library Handles**: + +- ✅ Terminal UI rendering and event loop +- ✅ Widget management (text, select, checkbox, confirm) +- ✅ Input validation and error display +- ✅ Navigation (arrow keys, tab, enter) +- ✅ Cross-platform terminal compatibility + +## Rationale + +### Why TUI Dialog Integration Is Required + +| Aspect | Shell Prompts (current) | Web Forms | TUI Dialog (chosen) | +| -------- | ------------------------- | ----------- | --------------------- | +| **User Experience** | ❌ Basic text only | ✅ Rich UI | ✅ Rich TUI | +| **Validation** | ❌ Manual, error-prone | ✅ Built-in | ✅ Built-in | +| **Security** | ❌ Plain text, history | ⚠️ Network risk | ✅ Secure terminal | +| **Setup Complexity** | ✅ None | ❌ Server required | ✅ Minimal | +| **Terminal Workflow** | ✅ Native | ❌ Browser switch | ✅ Native | +| **Offline Support** | ✅ Always | ❌ Requires server | ✅ Always | +| **Dependencies** | ✅ None | ❌ Web stack | ✅ Single crate | +| **Error Handling** | ❌ Manual | ⚠️ Complex | ✅ Built-in retry | + +### The Nushell Limitation + +Nushell's `input` command is limited: + +```text +# Current: No validation, no security +let password = input "Password: " # ❌ Shows in terminal +let region = input "AWS Region: " # ❌ No autocomplete/validation + +# Cannot do: +# - Multi-select from options +# - Conditional fields (if X then ask Y) +# - Password masking +# - Real-time validation +# - Autocomplete/fuzzy search +``` + +### The Nickel Constraint + +Nickel is declarative and cannot prompt users: + +```text +# Nickel defines what the config looks like, NOT how to get it +{ + database = { + host | String, + port | Number, + credentials | { username: String, password: String }, + } +} + +# Nickel cannot: +# - Prompt user for values +# - Show interactive forms +# - Validate input interactively +``` + +### Why Rust + TUI Dialog Is The Solution + +**Rust provides**: +- Native terminal control (crossterm, termion) +- Type-safe form definitions +- Validation rules as functions +- Secure memory handling (password zeroization) +- Performance (no subprocess overhead) + +**TUI Dialog provides**: +- Widget library (text, select, multi-select, confirm) +- Event loop and rendering +- Validation framework +- Error display and retry logic + +**Integration enables**: +- Nushell calls Rust CLI → Shows TUI dialog → Returns validated config +- Nickel receives validated config → Type checks → Merges with defaults + +## Consequences + +### Positive + +- **User Experience**: Professional TUI with validation and guidance +- **Security**: Password masking, sensitive data protection, no terminal history +- **Validation**: Type-safe rules enforced before config generation +- **Developer Experience**: Reusable form components across CLI commands +- **Error Handling**: Clear validation errors with retry options +- **Offline First**: No network dependencies for interactive input +- **Terminal Native**: Fits CLI workflow, no context switching +- **Maintainability**: Single library for all interactive input + +### Negative + +- **Terminal Dependency**: Requires interactive terminal (not scriptable) +- **Learning Curve**: Developers must learn TUI dialog patterns +- **Library Lock-in**: Tied to specific TUI library API +- **Testing Complexity**: Interactive tests require terminal mocking +- **Non-Interactive Fallback**: Need alternative for CI/CD and scripts + +### Mitigation Strategies + +**Non-Interactive Mode**: +```text +// Support both interactive and non-interactive +if terminal::is_interactive() { + // Show TUI dialog + let config = show_workspace_form()?; +} else { + // Use config file or CLI args + let config = load_config_from_file(args.config)?; +} +``` + +**Testing**: +```text +// Unit tests: Test form validation logic (no TUI) +#[test] +fn test_validate_workspace_name() { + assert!(validate_name("my-workspace").is_ok()); + assert!(validate_name("invalid name!").is_err()); +} + +// Integration tests: Use mock terminal or config files +``` + +**Scriptability**: +```text +# Batch mode: Provide config via file +provisioning workspace init --config workspace.toml + +# Interactive mode: Show TUI dialog +provisioning workspace init --interactive +``` + +**Documentation**: +- Form schemas documented in `docs/` +- Config file examples provided +- Screenshots of TUI forms in guides + +## Alternatives Considered + +### Alternative 1: Shell-Based Prompts (Current State) + +**Pros**: Simple, no dependencies +**Cons**: No validation, poor UX, security risks +**Decision**: REJECTED - Inadequate for production use + +### Alternative 2: Web-Based Forms + +**Pros**: Rich UI, well-known patterns +**Cons**: Requires server, network dependency, context switch +**Decision**: REJECTED - Too complex for CLI tool + +### Alternative 3: Custom TUI Per Use Case + +**Pros**: Tailored to each need +**Cons**: High maintenance, code duplication, inconsistent UX +**Decision**: REJECTED - Not sustainable + +### Alternative 4: External Form Tool (dialog, whiptail) + +**Pros**: Mature, cross-platform +**Cons**: Subprocess overhead, limited validation, shell escaping issues +**Decision**: REJECTED - Poor Rust integration + +### Alternative 5: Text-Based Config Files Only + +**Pros**: Fully scriptable, no interactive complexity +**Cons**: Steep learning curve, no guidance for new users +**Decision**: REJECTED - Poor user onboarding experience + +## Implementation Details + +### Form Definition Pattern + +```text +use typdialog::Form; + +pub fn workspace_initialization_form() -> Result { + let form = Form::new("Workspace Initialization") + .add_text_input("name", "Workspace Name") + .required() + .validator(|s| validate_workspace_name(s)) + .add_select("provider", "Cloud Provider") + .options(&["aws", "azure", "oci", "local"]) + .required() + .add_text_input("region", "Region") + .default("us-west-2") + .validator(|s| validate_region(s)) + .add_password("admin_password", "Admin Password") + .required() + .min_length(12) + .add_confirm("enable_monitoring", "Enable Monitoring?") + .default(true); + + let responses = form.run()?; + + // Convert to strongly-typed config + let config = WorkspaceConfig { + name: responses.get_string("name")?, + provider: responses.get_string("provider")?.parse()?, + region: responses.get_string("region")?, + admin_password: responses.get_password("admin_password")?, + enable_monitoring: responses.get_bool("enable_monitoring")?, + }; + + Ok(config) +} +``` + +### Integration with Nickel + +```text +// 1. Get validated input from TUI dialog +let config = workspace_initialization_form()?; + +// 2. Serialize to TOML/JSON +let config_toml = toml::to_string(&config)?; + +// 3. Write to workspace config +fs::write("workspace/config.toml", config_toml)?; + +// 4. Nickel merges with defaults +// nickel export workspace/main.ncl --format json +// (uses workspace/config.toml as input) +``` + +### CLI Command Structure + +```text +// provisioning/core/cli/src/commands/workspace.rs + +#[derive(Parser)] +pub enum WorkspaceCommand { + Init { + #[arg(long)] + interactive: bool, + + #[arg(long)] + config: Option, + }, +} + +pub fn handle_workspace_init(args: InitArgs) -> Result<()> { + if args.interactive || terminal::is_interactive() { + // Show TUI dialog + let config = workspace_initialization_form()?; + config.save("workspace/config.toml")?; + } else if let Some(config_path) = args.config { + // Use provided config + let config = WorkspaceConfig::load(config_path)?; + config.save("workspace/config.toml")?; + } else { + bail!("Either --interactive or --config required"); + } + + // Continue with workspace setup + Ok(()) +} +``` + +### Validation Rules + +```text +pub fn validate_workspace_name(name: &str) -> Result<(), String> { + // Alphanumeric, hyphens, 3-32 chars + let re = Regex::new(r"^[a-z0-9-]{3,32}$").unwrap(); + if !re.is_match(name) { + return Err("Name must be 3-32 lowercase alphanumeric chars with hyphens".into()); + } + Ok(()) +} + +pub fn validate_region(region: &str) -> Result<(), String> { + const VALID_REGIONS: &[&str] = &["us-west-1", "us-west-2", "us-east-1", "eu-west-1"]; + if !VALID_REGIONS.contains(®ion) { + return Err(format!("Invalid region. Must be one of: {}", VALID_REGIONS.join(", "))); + } + Ok(()) +} +``` + +### Security: Password Handling + +```text +use zeroize::Zeroizing; + +pub fn get_secure_password() -> Result> { + let form = Form::new("Secure Input") + .add_password("password", "Password") + .required() + .min_length(12) + .validator(password_strength_check); + + let responses = form.run()?; + + // Password automatically zeroized when dropped + let password = Zeroizing::new(responses.get_password("password")?); + + Ok(password) +} +``` + +## Testing Strategy + +**Unit Tests**: +```text +#[test] +fn test_workspace_name_validation() { + assert!(validate_workspace_name("my-workspace").is_ok()); + assert!(validate_workspace_name("UPPERCASE").is_err()); + assert!(validate_workspace_name("ab").is_err()); // Too short +} +``` + +**Integration Tests**: +```text +// Use non-interactive mode with config files +#[test] +fn test_workspace_init_non_interactive() { + let config = WorkspaceConfig { + name: "test-workspace".into(), + provider: Provider::Local, + region: "us-west-2".into(), + admin_password: "secure-password-123".into(), + enable_monitoring: true, + }; + + config.save("/tmp/test-config.toml").unwrap(); + + let result = handle_workspace_init(InitArgs { + interactive: false, + config: Some("/tmp/test-config.toml".into()), + }); + + assert!(result.is_ok()); +} +``` + +**Manual Testing**: +```text +# Test interactive flow +cargo build --release +./target/release/provisioning workspace init --interactive + +# Test validation errors +# - Try invalid workspace name +# - Try weak password +# - Try invalid region +``` + +## Configuration Integration + +**CLI Flag**: +```text +# provisioning/config/config.defaults.toml +[ui] +interactive_mode = "auto" # "auto" | "always" | "never" +dialog_theme = "default" # "default" | "minimal" | "colorful" +``` + +**Environment Override**: +```text +# Force non-interactive mode (for CI/CD) +export PROVISIONING_INTERACTIVE=false + +# Force interactive mode +export PROVISIONING_INTERACTIVE=true +``` + +## Documentation Requirements + +**User Guides**: +- `docs/user/interactive-configuration.md` - How to use TUI dialogs +- `docs/guides/workspace-setup.md` - Workspace initialization with screenshots + +**Developer Documentation**: +- `docs/development/tui-forms.md` - Creating new TUI forms +- Form definition best practices +- Validation rule patterns + +**Configuration Schema**: +```text +# provisioning/schemas/workspace.ncl +{ + WorkspaceConfig = { + name + | doc "Workspace identifier (3-32 alphanumeric chars with hyphens)" + | String, + provider + | doc "Cloud provider" + | [| 'aws, 'azure, 'oci, 'local |], + region + | doc "Deployment region" + | String, + admin_password + | doc "Admin password (min 12 characters)" + | String, + enable_monitoring + | doc "Enable monitoring services" + | Bool, + } +} +``` + +## Migration Path + +**Phase 1: Add Library** +- Add typdialog dependency to `provisioning/core/cli/Cargo.toml` +- Create TUI dialog wrapper module +- Implement basic text/select widgets + +**Phase 2: Implement Forms** +- Workspace initialization form +- Credential setup form +- Configuration wizard forms + +**Phase 3: CLI Integration** +- Update CLI commands to use TUI dialogs +- Add `--interactive` / `--config` flags +- Implement non-interactive fallback + +**Phase 4: Documentation** +- User guides with screenshots +- Developer documentation for form creation +- Example configs for non-interactive use + +**Phase 5: Testing** +- Unit tests for validation logic +- Integration tests with config files +- Manual testing on all platforms + +## References + +- [typdialog Crate](https://crates.io/crates/typdialog) (or similar: dialoguer, inquire) +- [crossterm](https://crates.io/crates/crossterm) - Terminal manipulation +- [zeroize](https://crates.io/crates/zeroize) - Secure memory zeroization +- ADR-004: Hybrid Architecture (Rust/Nushell integration) +- ADR-011: Nickel Migration (declarative config language) +- ADR-012: Nushell Plugins (CLI wrapper patterns) +- Nushell `input` command limitations: [Nushell Book - Input](https://www.nushell.sh/commands/docs/input.html) + +--- + +**Status**: Accepted +**Last Updated**: 2025-01-08 +**Implementation**: Planned +**Priority**: High (User onboarding and security) +**Estimated Complexity**: Moderate \ No newline at end of file diff --git a/docs/src/architecture/adr/adr-014-secretumvault-integration.md b/docs/src/architecture/adr/adr-014-secretumvault-integration.md index 604e190..e094696 100644 --- a/docs/src/architecture/adr/adr-014-secretumvault-integration.md +++ b/docs/src/architecture/adr/adr-014-secretumvault-integration.md @@ -1 +1,659 @@ -# ADR-014: SecretumVault Integration for Secrets Management\n\n## Status\n\n**Accepted** - 2025-01-08\n\n## Context\n\nThe provisioning system manages sensitive data across multiple infrastructure layers: cloud provider credentials, database passwords, API keys, SSH\nkeys, encryption keys, and service tokens. The current security architecture (ADR-009) includes SOPS for encrypted config files and Age for key\nmanagement, but lacks a centralized secrets management solution with dynamic secrets, access control, and audit logging.\n\n### Current Secrets Management Challenges\n\n**Existing Approach**:\n\n1. **SOPS + Age**: Static secrets encrypted in config files\n - Good: Version-controlled, gitops-friendly\n - Limited: Static rotation, no audit trail, manual key distribution\n\n2. **Nickel Configuration**: Declarative secrets references\n - Good: Type-safe configuration\n - Limited: Cannot generate dynamic secrets, no lifecycle management\n\n3. **Manual Secret Injection**: Environment variables, CLI flags\n - Good: Simple for development\n - Limited: No security guarantees, prone to leakage\n\n### Problems Without Centralized Secrets Management\n\n**Security Issues**:\n- ❌ No centralized audit trail (who accessed which secret when)\n- ❌ No automatic secret rotation policies\n- ❌ No fine-grained access control (Cedar policies not enforced on secrets)\n- ❌ Secrets scattered across: SOPS files, env vars, config files, K8s secrets\n- ❌ No detection of secret sprawl or leaked credentials\n\n**Operational Issues**:\n- ❌ Manual secret rotation (error-prone, often neglected)\n- ❌ No secret versioning (cannot rollback to previous credentials)\n- ❌ Difficult onboarding (manual key distribution)\n- ❌ No dynamic secrets (credentials exist indefinitely)\n\n**Compliance Issues**:\n- ❌ Cannot prove compliance with secret access policies\n- ❌ No audit logs for regulatory requirements\n- ❌ Cannot enforce secret expiration policies\n- ❌ Difficult to demonstrate least-privilege access\n\n### Use Cases Requiring Centralized Secrets Management\n\n1. **Dynamic Database Credentials**:\n - Generate short-lived DB credentials for applications\n - Automatic rotation based on policies\n - Revocation on application termination\n\n2. **Cloud Provider API Keys**:\n - Centralized storage with access control\n - Audit trail of credential usage\n - Automatic rotation schedules\n\n3. **Service-to-Service Authentication**:\n - Dynamic tokens for microservices\n - Short-lived certificates for mTLS\n - Automatic renewal before expiration\n\n4. **SSH Key Management**:\n - Temporal SSH keys (ADR-009 SSH integration)\n - Centralized certificate authority\n - Audit trail of SSH access\n\n5. **Encryption Key Management**:\n - Master encryption keys for data at rest\n - Key rotation and versioning\n - Integration with KMS systems\n\n### Requirements for Secrets Management System\n\n- ✅ **Dynamic Secrets**: Generate credentials on-demand with TTL\n- ✅ **Access Control**: Integration with Cedar authorization policies\n- ✅ **Audit Logging**: Complete trail of secret access and modifications\n- ✅ **Secret Rotation**: Automatic and manual rotation policies\n- ✅ **Versioning**: Track secret versions, enable rollback\n- ✅ **High Availability**: Distributed, fault-tolerant architecture\n- ✅ **Encryption at Rest**: AES-256-GCM for stored secrets\n- ✅ **API-First**: RESTful API for integration\n- ✅ **Plugin Ecosystem**: Extensible backends (AWS, Azure, databases)\n- ✅ **Open Source**: Self-hosted, no vendor lock-in\n\n## Decision\n\nIntegrate **SecretumVault** as the centralized secrets management system for the provisioning platform.\n\n### Architecture Diagram\n\n```\n┌─────────────────────────────────────────────────────────────┐\n│ Provisioning CLI / Orchestrator / Services │\n│ │\n│ - Workspace initialization (credentials) │\n│ - Infrastructure deployment (cloud API keys) │\n│ - Service configuration (database passwords) │\n│ - SSH temporal keys (certificate generation) │\n└────────────┬────────────────────────────────────────────────┘\n │\n ▼\n┌─────────────────────────────────────────────────────────────┐\n│ SecretumVault Client Library (Rust) │\n│ (provisioning/core/libs/secretum-client/) │\n│ │\n│ - Authentication (token, mTLS) │\n│ - Secret CRUD operations │\n│ - Dynamic secret generation │\n│ - Lease renewal and revocation │\n│ - Policy enforcement │\n└────────────┬────────────────────────────────────────────────┘\n │ HTTPS + mTLS\n ▼\n┌─────────────────────────────────────────────────────────────┐\n│ SecretumVault Server │\n│ (Rust-based Vault implementation) │\n│ │\n│ ┌───────────────────────────────────────────────────┐ │\n│ │ API Layer (REST + gRPC) │ │\n│ ├───────────────────────────────────────────────────┤ │\n│ │ Authentication & Authorization │ │\n│ │ - Token auth, mTLS, OIDC integration │ │\n│ │ - Cedar policy enforcement │ │\n│ ├───────────────────────────────────────────────────┤ │\n│ │ Secret Engines │ │\n│ │ - KV (key-value v2 with versioning) │ │\n│ │ - Database (dynamic credentials) │ │\n│ │ - SSH (certificate authority) │ │\n│ │ - PKI (X.509 certificates) │ │\n│ │ - Cloud Providers (AWS/Azure/OCI) │ │\n│ ├───────────────────────────────────────────────────┤ │\n│ │ Storage Backend │ │\n│ │ - Encrypted storage (AES-256-GCM) │ │\n│ │ - PostgreSQL / Raft cluster │ │\n│ ├───────────────────────────────────────────────────┤ │\n│ │ Audit Backend │ │\n│ │ - Structured logging (JSON) │ │\n│ │ - Syslog, file, database sinks │ │\n│ └───────────────────────────────────────────────────┘ │\n└─────────────────────────────────────────────────────────────┘\n │\n ▼\n┌─────────────────────────────────────────────────────────────┐\n│ Backends (Dynamic Secret Generation) │\n│ │\n│ - PostgreSQL/MySQL (database credentials) │\n│ - AWS IAM (temporary access keys) │\n│ - Azure AD (service principals) │\n│ - SSH CA (signed certificates) │\n│ - PKI (X.509 certificates) │\n└─────────────────────────────────────────────────────────────┘\n```\n\n### Implementation Characteristics\n\n**SecretumVault Provides**:\n\n- ✅ Dynamic secret generation with configurable TTL\n- ✅ Secret versioning and rollback capabilities\n- ✅ Fine-grained access control (Cedar policies)\n- ✅ Complete audit trail (all operations logged)\n- ✅ Automatic secret rotation policies\n- ✅ High availability (Raft consensus)\n- ✅ Encryption at rest (AES-256-GCM)\n- ✅ Plugin architecture for secret backends\n- ✅ RESTful and gRPC APIs\n- ✅ Rust implementation (performance, safety)\n\n**Integration with Provisioning System**:\n\n- ✅ Rust client library (native integration)\n- ✅ Nushell commands via CLI wrapper\n- ✅ Nickel configuration references secrets\n- ✅ Cedar policies control secret access\n- ✅ Orchestrator manages secret lifecycle\n- ✅ SSH integration for temporal keys\n- ✅ KMS integration for encryption keys\n\n## Rationale\n\n### Why SecretumVault Is Required\n\n| Aspect | SOPS + Age (current) | HashiCorp Vault | SecretumVault (chosen) |\n| -------- | ---------------------- | ----------------- | ------------------------ |\n| **Dynamic Secrets** | ❌ Static only | ✅ Full support | ✅ Full support |\n| **Rust Native** | ⚠️ External CLI | ❌ Go binary | ✅ Pure Rust |\n| **Cedar Integration** | ❌ None | ❌ Custom policies | ✅ Native Cedar |\n| **Audit Trail** | ❌ Git only | ✅ Comprehensive | ✅ Comprehensive |\n| **Secret Rotation** | ❌ Manual | ✅ Automatic | ✅ Automatic |\n| **Open Source** | ✅ Yes | ⚠️ MPL 2.0 (BSL now) | ✅ Yes |\n| **Self-Hosted** | ✅ Yes | ✅ Yes | ✅ Yes |\n| **License** | ✅ Permissive | ⚠️ BSL (proprietary) | ✅ Permissive |\n| **Versioning** | ⚠️ Git commits | ✅ Built-in | ✅ Built-in |\n| **High Availability** | ❌ Single file | ✅ Raft cluster | ✅ Raft cluster |\n| **Performance** | ✅ Fast (local) | ⚠️ Network latency | ✅ Rust performance |\n\n### Why Not Continue with SOPS Alone\n\nSOPS is excellent for **static secrets in git**, but inadequate for:\n\n1. **Dynamic Credentials**: Cannot generate temporary DB passwords\n2. **Audit Trail**: Git commits are insufficient for compliance\n3. **Rotation Policies**: Manual rotation is error-prone\n4. **Access Control**: No runtime policy enforcement\n5. **Secret Lifecycle**: Cannot track usage or revoke access\n6. **Multi-System Integration**: Limited to files, not API-accessible\n\n**Complementary Approach**:\n- SOPS: Configuration files with long-lived secrets (gitops workflow)\n- SecretumVault: Runtime dynamic secrets, short-lived credentials, audit trail\n\n### Why SecretumVault Over HashiCorp Vault\n\n**HashiCorp Vault Limitations**:\n\n1. **License Change**: BSL (Business Source License) - proprietary for production\n2. **Not Rust Native**: Go binary, subprocess overhead\n3. **Custom Policy Language**: HCL policies, not Cedar (provisioning standard)\n4. **Complex Deployment**: Heavy operational burden\n5. **Vendor Lock-In**: HashiCorp ecosystem dependency\n\n**SecretumVault Advantages**:\n\n1. **Rust Native**: Zero-cost integration, no subprocess spawning\n2. **Cedar Policies**: Consistent with ADR-008 authorization model\n3. **Lightweight**: Smaller binary, lower resource usage\n4. **Open Source**: Permissive license, community-driven\n5. **Provisioning-First**: Designed for IaC workflows\n\n### Integration with Existing Security Architecture\n\n**ADR-009 (Security System)**:\n- SOPS: Static config encryption (unchanged)\n- Age: Key management for SOPS (unchanged)\n- SecretumVault: Dynamic secrets, runtime access control (new)\n\n**ADR-008 (Cedar Authorization)**:\n- Cedar policies control SecretumVault secret access\n- Fine-grained permissions: `read:secret:database/prod/password`\n- Audit trail records Cedar policy decisions\n\n**SSH Temporal Keys**:\n- SecretumVault SSH CA signs user certificates\n- Short-lived certificates (1-24 hours)\n- Audit trail of SSH access\n\n## Consequences\n\n### Positive\n\n- **Security Posture**: Centralized secrets with audit trail and rotation\n- **Compliance**: Complete audit logs for regulatory requirements\n- **Operational Excellence**: Automatic rotation, dynamic credentials\n- **Developer Experience**: Simple API for secret access\n- **Performance**: Rust implementation, zero-cost abstractions\n- **Consistency**: Cedar policies across entire system (auth + secrets)\n- **Observability**: Metrics, logs, traces for secret access\n- **Disaster Recovery**: Secret versioning enables rollback\n\n### Negative\n\n- **Infrastructure Complexity**: Additional service to deploy and operate\n- **High Availability Requirements**: Raft cluster needs 3+ nodes\n- **Migration Effort**: Existing SOPS secrets need migration path\n- **Learning Curve**: Operators must learn vault concepts\n- **Dependency Risk**: Critical path service (secrets unavailable = system down)\n\n### Mitigation Strategies\n\n**High Availability**:\n```\n# Deploy SecretumVault cluster (3 nodes)\nprovisioning deploy secretum-vault --ha --replicas 3\n\n# Automatic leader election via Raft\n# Clients auto-reconnect to leader\n```\n\n**Migration from SOPS**:\n```\n# Phase 1: Import existing SOPS secrets into SecretumVault\nprovisioning secrets migrate --from-sops config/secrets.yaml\n\n# Phase 2: Update Nickel configs to reference vault paths\n# Phase 3: Deprecate SOPS for runtime secrets (keep for config files)\n```\n\n**Fallback Strategy**:\n```\n// Graceful degradation if vault unavailable\nlet secret = match vault_client.get_secret("database/password").await {\n Ok(s) => s,\n Err(VaultError::Unavailable) => {\n // Fallback to SOPS for read-only operations\n warn!("Vault unavailable, using SOPS fallback");\n sops_decrypt("config/secrets.yaml", "database.password")?\n },\n Err(e) => return Err(e),\n};\n```\n\n**Operational Monitoring**:\n```\n# prometheus metrics\nsecretum_vault_request_duration_seconds\nsecretum_vault_secret_lease_expiry\nsecretum_vault_auth_failures_total\nsecretum_vault_raft_leader_changes\n\n# Alerts: Vault unavailable, high auth failure rate, lease expiry\n```\n\n## Alternatives Considered\n\n### Alternative 1: Continue with SOPS Only\n\n**Pros**: No new infrastructure, simple\n**Cons**: No dynamic secrets, no audit trail, manual rotation\n**Decision**: REJECTED - Insufficient for production security\n\n### Alternative 2: HashiCorp Vault\n\n**Pros**: Mature, feature-rich, widely adopted\n**Cons**: BSL license, Go binary, HCL policies (not Cedar), complex deployment\n**Decision**: REJECTED - License and integration concerns\n\n### Alternative 3: Cloud Provider Native (AWS Secrets Manager, Azure Key Vault)\n\n**Pros**: Fully managed, high availability\n**Cons**: Vendor lock-in, multi-cloud complexity, cost at scale\n**Decision**: REJECTED - Against open-source and multi-cloud principles\n\n### Alternative 4: CyberArk, 1Password, and Others\n\n**Pros**: Enterprise features\n**Cons**: Proprietary, expensive, poor API integration\n**Decision**: REJECTED - Not suitable for IaC automation\n\n### Alternative 5: Build Custom Secrets Manager\n\n**Pros**: Full control, tailored to needs\n**Cons**: High maintenance burden, security risk, reinventing wheel\n**Decision**: REJECTED - SecretumVault provides this already\n\n## Implementation Details\n\n### SecretumVault Deployment\n\n```\n# Deploy via provisioning system\nprovisioning deploy secretum-vault \\n --ha \\n --replicas 3 \\n --storage postgres \\n --tls-cert /path/to/cert.pem \\n --tls-key /path/to/key.pem\n\n# Initialize and unseal\nprovisioning vault init\nprovisioning vault unseal --key-shares 5 --key-threshold 3\n```\n\n### Rust Client Library\n\n```\n// provisioning/core/libs/secretum-client/src/lib.rs\n\nuse secretum_vault::{Client, SecretEngine, Auth};\n\npub struct VaultClient {\n client: Client,\n}\n\nimpl VaultClient {\n pub async fn new(addr: &str, token: &str) -> Result {\n let client = Client::new(addr)\n .auth(Auth::Token(token))\n .tls_config(TlsConfig::from_files("ca.pem", "cert.pem", "key.pem"))?\n .build()?;\n\n Ok(Self { client })\n }\n\n pub async fn get_secret(&self, path: &str) -> Result {\n self.client.kv2().get(path).await\n }\n\n pub async fn create_dynamic_db_credentials(&self, role: &str) -> Result {\n self.client.database().generate_credentials(role).await\n }\n\n pub async fn sign_ssh_key(&self, public_key: &str, ttl: Duration) -> Result {\n self.client.ssh().sign_key(public_key, ttl).await\n }\n}\n```\n\n### Nushell Integration\n\n```\n# Nushell commands via Rust CLI wrapper\nprovisioning secrets get database/prod/password\nprovisioning secrets set api/keys/stripe --value "sk_live_xyz"\nprovisioning secrets rotate database/prod/password\nprovisioning secrets lease renew lease_id_12345\nprovisioning secrets list database/\n```\n\n### Nickel Configuration Integration\n\n```\n# provisioning/schemas/database.ncl\n{\n database = {\n host = "postgres.example.com",\n port = 5432,\n username = secrets.get "database/prod/username",\n password = secrets.get "database/prod/password",\n }\n}\n\n# Nickel function: secrets.get resolves to SecretumVault API call\n```\n\n### Cedar Policy for Secret Access\n\n```\n// policy: developers can read dev secrets, not prod\npermit(\n principal in Group::"developers",\n action == Action::"read",\n resource in Secret::"database/dev"\n);\n\nforbid(\n principal in Group::"developers",\n action == Action::"read",\n resource in Secret::"database/prod"\n);\n\n// policy: CI/CD can generate dynamic DB credentials\npermit(\n principal == Service::"github-actions",\n action == Action::"generate",\n resource in Secret::"database/dynamic"\n) when {\n context.ttl <= duration("1h")\n};\n```\n\n### Dynamic Database Credentials\n\n```\n// Application requests temporary DB credentials\nlet creds = vault_client\n .database()\n .generate_credentials("postgres-readonly")\n .await?;\n\nprintln!("Username: {}", creds.username); // v-app-abcd1234\nprintln!("Password: {}", creds.password); // random-secure-password\nprintln!("TTL: {}", creds.lease_duration); // 1h\n\n// Credentials automatically revoked after TTL\n// No manual cleanup needed\n```\n\n### Secret Rotation Automation\n\n```\n# secretum-vault config\n[[rotation_policies]]\npath = "database/prod/password"\nschedule = "0 0 * * 0" # Weekly on Sunday midnight\nmax_age = "30d"\n\n[[rotation_policies]]\npath = "api/keys/stripe"\nschedule = "0 0 1 * *" # Monthly on 1st\nmax_age = "90d"\n```\n\n### Audit Log Format\n\n```\n{\n "timestamp": "2025-01-08T12:34:56Z",\n "type": "request",\n "auth": {\n "client_token": "sha256:abc123...",\n "accessor": "hmac:def456...",\n "display_name": "service-orchestrator",\n "policies": ["default", "service-policy"]\n },\n "request": {\n "operation": "read",\n "path": "secret/data/database/prod/password",\n "remote_address": "10.0.1.5"\n },\n "response": {\n "status": 200\n },\n "cedar_policy": {\n "decision": "permit",\n "policy_id": "allow-orchestrator-read-secrets"\n }\n}\n```\n\n## Testing Strategy\n\n**Unit Tests**:\n```\n#[tokio::test]\nasync fn test_get_secret() {\n let vault = mock_vault_client();\n let secret = vault.get_secret("test/secret").await.unwrap();\n assert_eq!(secret.value, "expected-value");\n}\n\n#[tokio::test]\nasync fn test_dynamic_credentials_generation() {\n let vault = mock_vault_client();\n let creds = vault.create_dynamic_db_credentials("postgres-readonly").await.unwrap();\n assert!(creds.username.starts_with("v-"));\n assert_eq!(creds.lease_duration, Duration::from_secs(3600));\n}\n```\n\n**Integration Tests**:\n```\n# Test vault deployment\nprovisioning deploy secretum-vault --test-mode\nprovisioning vault init\nprovisioning vault unseal\n\n# Test secret operations\nprovisioning secrets set test/secret --value "test-value"\nprovisioning secrets get test/secret | assert "test-value"\n\n# Test dynamic credentials\nprovisioning secrets db-creds postgres-readonly | jq '.username' | assert-contains "v-"\n\n# Test rotation\nprovisioning secrets rotate test/secret\n```\n\n**Security Tests**:\n```\n#[tokio::test]\nasync fn test_unauthorized_access_denied() {\n let vault = vault_client_with_limited_token();\n let result = vault.get_secret("database/prod/password").await;\n assert!(matches!(result, Err(VaultError::PermissionDenied)));\n}\n```\n\n## Configuration Integration\n\n**Provisioning Config**:\n```\n# provisioning/config/config.defaults.toml\n[secrets]\nprovider = "secretum-vault" # "secretum-vault" | "sops" | "env"\nvault_addr = "https://vault.example.com:8200"\nvault_namespace = "provisioning"\nvault_mount = "secret"\n\n[secrets.tls]\nca_cert = "/etc/provisioning/vault-ca.pem"\nclient_cert = "/etc/provisioning/vault-client.pem"\nclient_key = "/etc/provisioning/vault-client-key.pem"\n\n[secrets.cache]\nenabled = true\nttl = "5m"\nmax_size = "100MB"\n```\n\n**Environment Variables**:\n```\nexport VAULT_ADDR="https://vault.example.com:8200"\nexport VAULT_TOKEN="s.abc123def456..."\nexport VAULT_NAMESPACE="provisioning"\nexport VAULT_CACERT="/etc/provisioning/vault-ca.pem"\n```\n\n## Migration Path\n\n**Phase 1: Deploy SecretumVault**\n- Deploy vault cluster in HA mode\n- Initialize and configure backends\n- Set up Cedar policies\n\n**Phase 2: Migrate Static Secrets**\n- Import SOPS secrets into vault KV store\n- Update Nickel configs to reference vault paths\n- Verify secret access via new API\n\n**Phase 3: Enable Dynamic Secrets**\n- Configure database secret engine\n- Configure SSH CA secret engine\n- Update applications to use dynamic credentials\n\n**Phase 4: Deprecate SOPS for Runtime**\n- SOPS remains for gitops config files\n- Runtime secrets exclusively from vault\n- Audit trail enforcement\n\n**Phase 5: Automation**\n- Automatic rotation policies\n- Lease renewal automation\n- Monitoring and alerting\n\n## Documentation Requirements\n\n**User Guides**:\n- `docs/user/secrets-management.md` - Using SecretumVault\n- `docs/user/dynamic-credentials.md` - Dynamic secret workflows\n- `docs/user/secret-rotation.md` - Rotation policies and procedures\n\n**Operations Documentation**:\n- `docs/operations/vault-deployment.md` - Deploying and configuring vault\n- `docs/operations/vault-backup-restore.md` - Backup and disaster recovery\n- `docs/operations/vault-monitoring.md` - Metrics, logs, alerts\n\n**Developer Documentation**:\n- `docs/development/secrets-api.md` - Rust client library usage\n- `docs/development/cedar-secret-policies.md` - Writing Cedar policies for secrets\n- Secret engine development guide\n\n**Security Documentation**:\n- `docs/security/secrets-architecture.md` - Security architecture overview\n- `docs/security/audit-logging.md` - Audit trail and compliance\n- Threat model and risk assessment\n\n## References\n\n- [SecretumVault GitHub](https://github.com/secretum-vault/secretum) (hypothetical, replace with actual)\n- [HashiCorp Vault Documentation](https://www.vaultproject.io/docs) (for comparison)\n- ADR-008: Cedar Authorization (policy integration)\n- ADR-009: Security System Complete (current security architecture)\n- [Raft Consensus Algorithm](https://raft.github.io/)\n- [Cedar Policy Language](https://www.cedarpolicy.com/)\n- SOPS: [https://github.com/getsops/sops](https://github.com/getsops/sops)\n- Age Encryption: [https://age-encryption.org/](https://age-encryption.org/)\n\n---\n\n**Status**: Accepted\n**Last Updated**: 2025-01-08\n**Implementation**: Planned\n**Priority**: High (Security and compliance)\n**Estimated Complexity**: Complex +# ADR-014: SecretumVault Integration for Secrets Management + +## Status + +**Accepted** - 2025-01-08 + +## Context + +The provisioning system manages sensitive data across multiple infrastructure layers: cloud provider credentials, database passwords, API keys, SSH +keys, encryption keys, and service tokens. The current security architecture (ADR-009) includes SOPS for encrypted config files and Age for key +management, but lacks a centralized secrets management solution with dynamic secrets, access control, and audit logging. + +### Current Secrets Management Challenges + +**Existing Approach**: + +1. **SOPS + Age**: Static secrets encrypted in config files + - Good: Version-controlled, gitops-friendly + - Limited: Static rotation, no audit trail, manual key distribution + +2. **Nickel Configuration**: Declarative secrets references + - Good: Type-safe configuration + - Limited: Cannot generate dynamic secrets, no lifecycle management + +3. **Manual Secret Injection**: Environment variables, CLI flags + - Good: Simple for development + - Limited: No security guarantees, prone to leakage + +### Problems Without Centralized Secrets Management + +**Security Issues**: +- ❌ No centralized audit trail (who accessed which secret when) +- ❌ No automatic secret rotation policies +- ❌ No fine-grained access control (Cedar policies not enforced on secrets) +- ❌ Secrets scattered across: SOPS files, env vars, config files, K8s secrets +- ❌ No detection of secret sprawl or leaked credentials + +**Operational Issues**: +- ❌ Manual secret rotation (error-prone, often neglected) +- ❌ No secret versioning (cannot rollback to previous credentials) +- ❌ Difficult onboarding (manual key distribution) +- ❌ No dynamic secrets (credentials exist indefinitely) + +**Compliance Issues**: +- ❌ Cannot prove compliance with secret access policies +- ❌ No audit logs for regulatory requirements +- ❌ Cannot enforce secret expiration policies +- ❌ Difficult to demonstrate least-privilege access + +### Use Cases Requiring Centralized Secrets Management + +1. **Dynamic Database Credentials**: + - Generate short-lived DB credentials for applications + - Automatic rotation based on policies + - Revocation on application termination + +2. **Cloud Provider API Keys**: + - Centralized storage with access control + - Audit trail of credential usage + - Automatic rotation schedules + +3. **Service-to-Service Authentication**: + - Dynamic tokens for microservices + - Short-lived certificates for mTLS + - Automatic renewal before expiration + +4. **SSH Key Management**: + - Temporal SSH keys (ADR-009 SSH integration) + - Centralized certificate authority + - Audit trail of SSH access + +5. **Encryption Key Management**: + - Master encryption keys for data at rest + - Key rotation and versioning + - Integration with KMS systems + +### Requirements for Secrets Management System + +- ✅ **Dynamic Secrets**: Generate credentials on-demand with TTL +- ✅ **Access Control**: Integration with Cedar authorization policies +- ✅ **Audit Logging**: Complete trail of secret access and modifications +- ✅ **Secret Rotation**: Automatic and manual rotation policies +- ✅ **Versioning**: Track secret versions, enable rollback +- ✅ **High Availability**: Distributed, fault-tolerant architecture +- ✅ **Encryption at Rest**: AES-256-GCM for stored secrets +- ✅ **API-First**: RESTful API for integration +- ✅ **Plugin Ecosystem**: Extensible backends (AWS, Azure, databases) +- ✅ **Open Source**: Self-hosted, no vendor lock-in + +## Decision + +Integrate **SecretumVault** as the centralized secrets management system for the provisioning platform. + +### Architecture Diagram + +```text +┌─────────────────────────────────────────────────────────────┐ +│ Provisioning CLI / Orchestrator / Services │ +│ │ +│ - Workspace initialization (credentials) │ +│ - Infrastructure deployment (cloud API keys) │ +│ - Service configuration (database passwords) │ +│ - SSH temporal keys (certificate generation) │ +└────────────┬────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ SecretumVault Client Library (Rust) │ +│ (provisioning/core/libs/secretum-client/) │ +│ │ +│ - Authentication (token, mTLS) │ +│ - Secret CRUD operations │ +│ - Dynamic secret generation │ +│ - Lease renewal and revocation │ +│ - Policy enforcement │ +└────────────┬────────────────────────────────────────────────┘ + │ HTTPS + mTLS + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ SecretumVault Server │ +│ (Rust-based Vault implementation) │ +│ │ +│ ┌───────────────────────────────────────────────────┐ │ +│ │ API Layer (REST + gRPC) │ │ +│ ├───────────────────────────────────────────────────┤ │ +│ │ Authentication & Authorization │ │ +│ │ - Token auth, mTLS, OIDC integration │ │ +│ │ - Cedar policy enforcement │ │ +│ ├───────────────────────────────────────────────────┤ │ +│ │ Secret Engines │ │ +│ │ - KV (key-value v2 with versioning) │ │ +│ │ - Database (dynamic credentials) │ │ +│ │ - SSH (certificate authority) │ │ +│ │ - PKI (X.509 certificates) │ │ +│ │ - Cloud Providers (AWS/Azure/OCI) │ │ +│ ├───────────────────────────────────────────────────┤ │ +│ │ Storage Backend │ │ +│ │ - Encrypted storage (AES-256-GCM) │ │ +│ │ - PostgreSQL / Raft cluster │ │ +│ ├───────────────────────────────────────────────────┤ │ +│ │ Audit Backend │ │ +│ │ - Structured logging (JSON) │ │ +│ │ - Syslog, file, database sinks │ │ +│ └───────────────────────────────────────────────────┘ │ +└─────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────┐ +│ Backends (Dynamic Secret Generation) │ +│ │ +│ - PostgreSQL/MySQL (database credentials) │ +│ - AWS IAM (temporary access keys) │ +│ - Azure AD (service principals) │ +│ - SSH CA (signed certificates) │ +│ - PKI (X.509 certificates) │ +└─────────────────────────────────────────────────────────────┘ +``` + +### Implementation Characteristics + +**SecretumVault Provides**: + +- ✅ Dynamic secret generation with configurable TTL +- ✅ Secret versioning and rollback capabilities +- ✅ Fine-grained access control (Cedar policies) +- ✅ Complete audit trail (all operations logged) +- ✅ Automatic secret rotation policies +- ✅ High availability (Raft consensus) +- ✅ Encryption at rest (AES-256-GCM) +- ✅ Plugin architecture for secret backends +- ✅ RESTful and gRPC APIs +- ✅ Rust implementation (performance, safety) + +**Integration with Provisioning System**: + +- ✅ Rust client library (native integration) +- ✅ Nushell commands via CLI wrapper +- ✅ Nickel configuration references secrets +- ✅ Cedar policies control secret access +- ✅ Orchestrator manages secret lifecycle +- ✅ SSH integration for temporal keys +- ✅ KMS integration for encryption keys + +## Rationale + +### Why SecretumVault Is Required + +| Aspect | SOPS + Age (current) | HashiCorp Vault | SecretumVault (chosen) | +| -------- | ---------------------- | ----------------- | ------------------------ | +| **Dynamic Secrets** | ❌ Static only | ✅ Full support | ✅ Full support | +| **Rust Native** | ⚠️ External CLI | ❌ Go binary | ✅ Pure Rust | +| **Cedar Integration** | ❌ None | ❌ Custom policies | ✅ Native Cedar | +| **Audit Trail** | ❌ Git only | ✅ Comprehensive | ✅ Comprehensive | +| **Secret Rotation** | ❌ Manual | ✅ Automatic | ✅ Automatic | +| **Open Source** | ✅ Yes | ⚠️ MPL 2.0 (BSL now) | ✅ Yes | +| **Self-Hosted** | ✅ Yes | ✅ Yes | ✅ Yes | +| **License** | ✅ Permissive | ⚠️ BSL (proprietary) | ✅ Permissive | +| **Versioning** | ⚠️ Git commits | ✅ Built-in | ✅ Built-in | +| **High Availability** | ❌ Single file | ✅ Raft cluster | ✅ Raft cluster | +| **Performance** | ✅ Fast (local) | ⚠️ Network latency | ✅ Rust performance | + +### Why Not Continue with SOPS Alone + +SOPS is excellent for **static secrets in git**, but inadequate for: + +1. **Dynamic Credentials**: Cannot generate temporary DB passwords +2. **Audit Trail**: Git commits are insufficient for compliance +3. **Rotation Policies**: Manual rotation is error-prone +4. **Access Control**: No runtime policy enforcement +5. **Secret Lifecycle**: Cannot track usage or revoke access +6. **Multi-System Integration**: Limited to files, not API-accessible + +**Complementary Approach**: +- SOPS: Configuration files with long-lived secrets (gitops workflow) +- SecretumVault: Runtime dynamic secrets, short-lived credentials, audit trail + +### Why SecretumVault Over HashiCorp Vault + +**HashiCorp Vault Limitations**: + +1. **License Change**: BSL (Business Source License) - proprietary for production +2. **Not Rust Native**: Go binary, subprocess overhead +3. **Custom Policy Language**: HCL policies, not Cedar (provisioning standard) +4. **Complex Deployment**: Heavy operational burden +5. **Vendor Lock-In**: HashiCorp ecosystem dependency + +**SecretumVault Advantages**: + +1. **Rust Native**: Zero-cost integration, no subprocess spawning +2. **Cedar Policies**: Consistent with ADR-008 authorization model +3. **Lightweight**: Smaller binary, lower resource usage +4. **Open Source**: Permissive license, community-driven +5. **Provisioning-First**: Designed for IaC workflows + +### Integration with Existing Security Architecture + +**ADR-009 (Security System)**: +- SOPS: Static config encryption (unchanged) +- Age: Key management for SOPS (unchanged) +- SecretumVault: Dynamic secrets, runtime access control (new) + +**ADR-008 (Cedar Authorization)**: +- Cedar policies control SecretumVault secret access +- Fine-grained permissions: `read:secret:database/prod/password` +- Audit trail records Cedar policy decisions + +**SSH Temporal Keys**: +- SecretumVault SSH CA signs user certificates +- Short-lived certificates (1-24 hours) +- Audit trail of SSH access + +## Consequences + +### Positive + +- **Security Posture**: Centralized secrets with audit trail and rotation +- **Compliance**: Complete audit logs for regulatory requirements +- **Operational Excellence**: Automatic rotation, dynamic credentials +- **Developer Experience**: Simple API for secret access +- **Performance**: Rust implementation, zero-cost abstractions +- **Consistency**: Cedar policies across entire system (auth + secrets) +- **Observability**: Metrics, logs, traces for secret access +- **Disaster Recovery**: Secret versioning enables rollback + +### Negative + +- **Infrastructure Complexity**: Additional service to deploy and operate +- **High Availability Requirements**: Raft cluster needs 3+ nodes +- **Migration Effort**: Existing SOPS secrets need migration path +- **Learning Curve**: Operators must learn vault concepts +- **Dependency Risk**: Critical path service (secrets unavailable = system down) + +### Mitigation Strategies + +**High Availability**: +```text +# Deploy SecretumVault cluster (3 nodes) +provisioning deploy secretum-vault --ha --replicas 3 + +# Automatic leader election via Raft +# Clients auto-reconnect to leader +``` + +**Migration from SOPS**: +```text +# Phase 1: Import existing SOPS secrets into SecretumVault +provisioning secrets migrate --from-sops config/secrets.yaml + +# Phase 2: Update Nickel configs to reference vault paths +# Phase 3: Deprecate SOPS for runtime secrets (keep for config files) +``` + +**Fallback Strategy**: +```text +// Graceful degradation if vault unavailable +let secret = match vault_client.get_secret("database/password").await { + Ok(s) => s, + Err(VaultError::Unavailable) => { + // Fallback to SOPS for read-only operations + warn!("Vault unavailable, using SOPS fallback"); + sops_decrypt("config/secrets.yaml", "database.password")? + }, + Err(e) => return Err(e), +}; +``` + +**Operational Monitoring**: +```text +# prometheus metrics +secretum_vault_request_duration_seconds +secretum_vault_secret_lease_expiry +secretum_vault_auth_failures_total +secretum_vault_raft_leader_changes + +# Alerts: Vault unavailable, high auth failure rate, lease expiry +``` + +## Alternatives Considered + +### Alternative 1: Continue with SOPS Only + +**Pros**: No new infrastructure, simple +**Cons**: No dynamic secrets, no audit trail, manual rotation +**Decision**: REJECTED - Insufficient for production security + +### Alternative 2: HashiCorp Vault + +**Pros**: Mature, feature-rich, widely adopted +**Cons**: BSL license, Go binary, HCL policies (not Cedar), complex deployment +**Decision**: REJECTED - License and integration concerns + +### Alternative 3: Cloud Provider Native (AWS Secrets Manager, Azure Key Vault) + +**Pros**: Fully managed, high availability +**Cons**: Vendor lock-in, multi-cloud complexity, cost at scale +**Decision**: REJECTED - Against open-source and multi-cloud principles + +### Alternative 4: CyberArk, 1Password, and Others + +**Pros**: Enterprise features +**Cons**: Proprietary, expensive, poor API integration +**Decision**: REJECTED - Not suitable for IaC automation + +### Alternative 5: Build Custom Secrets Manager + +**Pros**: Full control, tailored to needs +**Cons**: High maintenance burden, security risk, reinventing wheel +**Decision**: REJECTED - SecretumVault provides this already + +## Implementation Details + +### SecretumVault Deployment + +```text +# Deploy via provisioning system +provisioning deploy secretum-vault + --ha + --replicas 3 + --storage postgres + --tls-cert /path/to/cert.pem + --tls-key /path/to/key.pem + +# Initialize and unseal +provisioning vault init +provisioning vault unseal --key-shares 5 --key-threshold 3 +``` + +### Rust Client Library + +```text +// provisioning/core/libs/secretum-client/src/lib.rs + +use secretum_vault::{Client, SecretEngine, Auth}; + +pub struct VaultClient { + client: Client, +} + +impl VaultClient { + pub async fn new(addr: &str, token: &str) -> Result { + let client = Client::new(addr) + .auth(Auth::Token(token)) + .tls_config(TlsConfig::from_files("ca.pem", "cert.pem", "key.pem"))? + .build()?; + + Ok(Self { client }) + } + + pub async fn get_secret(&self, path: &str) -> Result { + self.client.kv2().get(path).await + } + + pub async fn create_dynamic_db_credentials(&self, role: &str) -> Result { + self.client.database().generate_credentials(role).await + } + + pub async fn sign_ssh_key(&self, public_key: &str, ttl: Duration) -> Result { + self.client.ssh().sign_key(public_key, ttl).await + } +} +``` + +### Nushell Integration + +```text +# Nushell commands via Rust CLI wrapper +provisioning secrets get database/prod/password +provisioning secrets set api/keys/stripe --value "sk_live_xyz" +provisioning secrets rotate database/prod/password +provisioning secrets lease renew lease_id_12345 +provisioning secrets list database/ +``` + +### Nickel Configuration Integration + +```text +# provisioning/schemas/database.ncl +{ + database = { + host = "postgres.example.com", + port = 5432, + username = secrets.get "database/prod/username", + password = secrets.get "database/prod/password", + } +} + +# Nickel function: secrets.get resolves to SecretumVault API call +``` + +### Cedar Policy for Secret Access + +```text +// policy: developers can read dev secrets, not prod +permit( + principal in Group::"developers", + action == Action::"read", + resource in Secret::"database/dev" +); + +forbid( + principal in Group::"developers", + action == Action::"read", + resource in Secret::"database/prod" +); + +// policy: CI/CD can generate dynamic DB credentials +permit( + principal == Service::"github-actions", + action == Action::"generate", + resource in Secret::"database/dynamic" +) when { + context.ttl <= duration("1h") +}; +``` + +### Dynamic Database Credentials + +```text +// Application requests temporary DB credentials +let creds = vault_client + .database() + .generate_credentials("postgres-readonly") + .await?; + +println!("Username: {}", creds.username); // v-app-abcd1234 +println!("Password: {}", creds.password); // random-secure-password +println!("TTL: {}", creds.lease_duration); // 1h + +// Credentials automatically revoked after TTL +// No manual cleanup needed +``` + +### Secret Rotation Automation + +```text +# secretum-vault config +[[rotation_policies]] +path = "database/prod/password" +schedule = "0 0 * * 0" # Weekly on Sunday midnight +max_age = "30d" + +[[rotation_policies]] +path = "api/keys/stripe" +schedule = "0 0 1 * *" # Monthly on 1st +max_age = "90d" +``` + +### Audit Log Format + +```text +{ + "timestamp": "2025-01-08T12:34:56Z", + "type": "request", + "auth": { + "client_token": "sha256:abc123...", + "accessor": "hmac:def456...", + "display_name": "service-orchestrator", + "policies": ["default", "service-policy"] + }, + "request": { + "operation": "read", + "path": "secret/data/database/prod/password", + "remote_address": "10.0.1.5" + }, + "response": { + "status": 200 + }, + "cedar_policy": { + "decision": "permit", + "policy_id": "allow-orchestrator-read-secrets" + } +} +``` + +## Testing Strategy + +**Unit Tests**: +```text +#[tokio::test] +async fn test_get_secret() { + let vault = mock_vault_client(); + let secret = vault.get_secret("test/secret").await.unwrap(); + assert_eq!(secret.value, "expected-value"); +} + +#[tokio::test] +async fn test_dynamic_credentials_generation() { + let vault = mock_vault_client(); + let creds = vault.create_dynamic_db_credentials("postgres-readonly").await.unwrap(); + assert!(creds.username.starts_with("v-")); + assert_eq!(creds.lease_duration, Duration::from_secs(3600)); +} +``` + +**Integration Tests**: +```text +# Test vault deployment +provisioning deploy secretum-vault --test-mode +provisioning vault init +provisioning vault unseal + +# Test secret operations +provisioning secrets set test/secret --value "test-value" +provisioning secrets get test/secret | assert "test-value" + +# Test dynamic credentials +provisioning secrets db-creds postgres-readonly | jq '.username' | assert-contains "v-" + +# Test rotation +provisioning secrets rotate test/secret +``` + +**Security Tests**: +```text +#[tokio::test] +async fn test_unauthorized_access_denied() { + let vault = vault_client_with_limited_token(); + let result = vault.get_secret("database/prod/password").await; + assert!(matches!(result, Err(VaultError::PermissionDenied))); +} +``` + +## Configuration Integration + +**Provisioning Config**: +```text +# provisioning/config/config.defaults.toml +[secrets] +provider = "secretum-vault" # "secretum-vault" | "sops" | "env" +vault_addr = "https://vault.example.com:8200" +vault_namespace = "provisioning" +vault_mount = "secret" + +[secrets.tls] +ca_cert = "/etc/provisioning/vault-ca.pem" +client_cert = "/etc/provisioning/vault-client.pem" +client_key = "/etc/provisioning/vault-client-key.pem" + +[secrets.cache] +enabled = true +ttl = "5m" +max_size = "100MB" +``` + +**Environment Variables**: +```text +export VAULT_ADDR="https://vault.example.com:8200" +export VAULT_TOKEN="s.abc123def456..." +export VAULT_NAMESPACE="provisioning" +export VAULT_CACERT="/etc/provisioning/vault-ca.pem" +``` + +## Migration Path + +**Phase 1: Deploy SecretumVault** +- Deploy vault cluster in HA mode +- Initialize and configure backends +- Set up Cedar policies + +**Phase 2: Migrate Static Secrets** +- Import SOPS secrets into vault KV store +- Update Nickel configs to reference vault paths +- Verify secret access via new API + +**Phase 3: Enable Dynamic Secrets** +- Configure database secret engine +- Configure SSH CA secret engine +- Update applications to use dynamic credentials + +**Phase 4: Deprecate SOPS for Runtime** +- SOPS remains for gitops config files +- Runtime secrets exclusively from vault +- Audit trail enforcement + +**Phase 5: Automation** +- Automatic rotation policies +- Lease renewal automation +- Monitoring and alerting + +## Documentation Requirements + +**User Guides**: +- `docs/user/secrets-management.md` - Using SecretumVault +- `docs/user/dynamic-credentials.md` - Dynamic secret workflows +- `docs/user/secret-rotation.md` - Rotation policies and procedures + +**Operations Documentation**: +- `docs/operations/vault-deployment.md` - Deploying and configuring vault +- `docs/operations/vault-backup-restore.md` - Backup and disaster recovery +- `docs/operations/vault-monitoring.md` - Metrics, logs, alerts + +**Developer Documentation**: +- `docs/development/secrets-api.md` - Rust client library usage +- `docs/development/cedar-secret-policies.md` - Writing Cedar policies for secrets +- Secret engine development guide + +**Security Documentation**: +- `docs/security/secrets-architecture.md` - Security architecture overview +- `docs/security/audit-logging.md` - Audit trail and compliance +- Threat model and risk assessment + +## References + +- [SecretumVault GitHub](https://github.com/secretum-vault/secretum) (hypothetical, replace with actual) +- [HashiCorp Vault Documentation](https://www.vaultproject.io/docs) (for comparison) +- ADR-008: Cedar Authorization (policy integration) +- ADR-009: Security System Complete (current security architecture) +- [Raft Consensus Algorithm](https://raft.github.io/) +- [Cedar Policy Language](https://www.cedarpolicy.com/) +- SOPS: [https://github.com/getsops/sops](https://github.com/getsops/sops) +- Age Encryption: [https://age-encryption.org/](https://age-encryption.org/) + +--- + +**Status**: Accepted +**Last Updated**: 2025-01-08 +**Implementation**: Planned +**Priority**: High (Security and compliance) +**Estimated Complexity**: Complex \ No newline at end of file diff --git a/docs/src/architecture/adr/adr-015-ai-integration-architecture.md b/docs/src/architecture/adr/adr-015-ai-integration-architecture.md index 11c0134..4ff68ee 100644 --- a/docs/src/architecture/adr/adr-015-ai-integration-architecture.md +++ b/docs/src/architecture/adr/adr-015-ai-integration-architecture.md @@ -1 +1,1123 @@ -# ADR-015: AI Integration Architecture for Intelligent Infrastructure Provisioning\n\n## Status\n\n**Accepted** - 2025-01-08\n\n## Context\n\nThe provisioning platform has evolved to include complex workflows for infrastructure configuration, deployment, and management.\nCurrent interaction patterns require deep technical knowledge of Nickel schemas, cloud provider APIs, networking concepts, and security best practices.\nThis creates barriers to entry and slows down infrastructure provisioning for operators who are not infrastructure experts.\n\n### The Infrastructure Complexity Problem\n\n**Current state challenges**:\n\n1. **Knowledge Barrier**: Deep Nickel, cloud, and networking expertise required\n - Understanding Nickel type system and contracts\n - Knowing cloud provider resource relationships\n - Configuring security policies correctly\n - Debugging deployment failures\n\n2. **Manual Configuration**: All configs hand-written\n - Repetitive boilerplate for common patterns\n - Easy to make mistakes (typos, missing fields)\n - No intelligent suggestions or autocomplete\n - Trial-and-error debugging\n\n3. **Limited Assistance**: No contextual help\n - Documentation is separate from workflow\n - No explanation of validation errors\n - No suggestions for fixing issues\n - No learning from past deployments\n\n4. **Troubleshooting Difficulty**: Manual log analysis\n - Deployment failures require expert analysis\n - No automated root cause detection\n - No suggested fixes based on similar issues\n - Long time-to-resolution\n\n### AI Integration Opportunities\n\n1. **Natural Language to Configuration**:\n - User: "Create a production PostgreSQL cluster with encryption and daily backups"\n - AI: Generates validated Nickel configuration\n\n2. **AI-Assisted Form Filling**:\n - User starts typing in typdialog web form\n - AI suggests values based on context\n - AI explains validation errors in plain language\n\n3. **Intelligent Troubleshooting**:\n - Deployment fails\n - AI analyzes logs and suggests fixes\n - AI generates corrected configuration\n\n4. **Configuration Optimization**:\n - AI analyzes workload patterns\n - AI suggests performance improvements\n - AI detects security misconfigurations\n\n5. **Learning from Operations**:\n - AI indexes past deployments\n - AI suggests configurations based on similar workloads\n - AI predicts potential issues\n\n### AI Components Overview\n\nThe system integrates multiple AI components:\n\n1. **typdialog-ai**: AI-assisted form interactions\n2. **typdialog-ag**: AI agents for autonomous operations\n3. **typdialog-prov-gen**: AI-powered configuration generation\n4. **platform/crates/ai-service**: Core AI service backend\n5. **platform/crates/mcp-server**: Model Context Protocol server\n6. **platform/crates/rag**: Retrieval-Augmented Generation system\n\n### Requirements for AI Integration\n\n- ✅ **Natural Language Understanding**: Parse user intent from free-form text\n- ✅ **Schema-Aware Generation**: Generate valid Nickel configurations\n- ✅ **Context Retrieval**: Access documentation, schemas, past deployments\n- ✅ **Security Enforcement**: Cedar policies control AI access\n- ✅ **Human-in-the-Loop**: All AI actions require human approval\n- ✅ **Audit Trail**: Complete logging of AI operations\n- ✅ **Multi-Provider Support**: OpenAI, Anthropic, local models\n- ✅ **Cost Control**: Rate limiting and budget management\n- ✅ **Observability**: Trace AI decisions and reasoning\n\n## Decision\n\nIntegrate a **comprehensive AI system** consisting of:\n\n1. **AI-Assisted Interfaces** (typdialog-ai)\n2. **Autonomous AI Agents** (typdialog-ag)\n3. **AI Configuration Generator** (typdialog-prov-gen)\n4. **Core AI Infrastructure** (ai-service, mcp-server, rag)\n\nAll AI components are **schema-aware**, **security-enforced**, and **human-supervised**.\n\n### Architecture Diagram\n\n```\n┌─────────────────────────────────────────────────────────────────┐\n│ User Interfaces │\n│ │\n│ Natural Language: "Create production K8s cluster in AWS" │\n│ Typdialog Forms: AI-assisted field suggestions │\n│ CLI: provisioning ai generate-config "description" │\n└────────────┬────────────────────────────────────────────────────┘\n │\n ▼\n┌─────────────────────────────────────────────────────────────────┐\n│ AI Frontend Layer │\n│ ┌───────────────────────────────────────────────────────┐ │\n│ │ typdialog-ai (AI-Assisted Forms) │ │\n│ │ - Natural language form filling │ │\n│ │ - Real-time AI suggestions │ │\n│ │ - Validation error explanations │ │\n│ │ - Context-aware autocomplete │ │\n│ ├───────────────────────────────────────────────────────┤ │\n│ │ typdialog-ag (AI Agents) │ │\n│ │ - Autonomous task execution │ │\n│ │ - Multi-step workflow automation │ │\n│ │ - Learning from feedback │ │\n│ │ - Agent collaboration │ │\n│ ├───────────────────────────────────────────────────────┤ │\n│ │ typdialog-prov-gen (Config Generator) │ │\n│ │ - Natural language → Nickel config │ │\n│ │ - Template-based generation │ │\n│ │ - Best practice injection │ │\n│ │ - Validation and refinement │ │\n│ └───────────────────────────────────────────────────────┘ │\n└────────────┬────────────────────────────────────────────────────┘\n │\n ▼\n┌────────────────────────────────────────────────────────────────┐\n│ Core AI Infrastructure (platform/crates/) │\n│ ┌───────────────────────────────────────────────────────┐ │\n│ │ ai-service (Central AI Service) │ │\n│ │ │ │\n│ │ - Request routing and orchestration │ │\n│ │ - Authentication and authorization (Cedar) │ │\n│ │ - Rate limiting and cost control │ │\n│ │ - Caching and optimization │ │\n│ │ - Audit logging and observability │ │\n│ │ - Multi-provider abstraction │ │\n│ └─────────────┬─────────────────────┬───────────────────┘ │\n│ │ │ │\n│ ▼ ▼ │\n│ ┌─────────────────────┐ ┌─────────────────────┐ │\n│ │ mcp-server │ │ rag │ │\n│ │ (Model Context │ │ (Retrieval-Aug Gen) │ │\n│ │ Protocol) │ │ │ │\n│ │ │ │ ┌─────────────────┐ │ │\n│ │ - LLM integration │ │ │ Vector Store │ │ │\n│ │ - Tool calling │ │ │ (Qdrant/Milvus) │ │ │\n│ │ - Context mgmt │ │ └─────────────────┘ │ │\n│ │ - Multi-provider │ │ ┌─────────────────┐ │ │\n│ │ (OpenAI, │ │ │ Embeddings │ │ │\n│ │ Anthropic, │ │ │ (text-embed) │ │ │\n│ │ Local models) │ │ └─────────────────┘ │ │\n│ │ │ │ ┌─────────────────┐ │ │\n│ │ Tools: │ │ │ Index: │ │ │\n│ │ - nickel_validate │ │ │ - Nickel schemas│ │ │\n│ │ - schema_query │ │ │ - Documentation │ │ │\n│ │ - config_generate │ │ │ - Past deploys │ │ │\n│ │ - cedar_check │ │ │ - Best practices│ │ │\n│ └─────────────────────┘ │ └─────────────────┘ │ │\n│ │ │ │\n│ │ Query: "How to │ │\n│ │ configure Postgres │ │\n│ │ with encryption?" │ │\n│ │ │ │\n│ │ Retrieval: Relevant │ │\n│ │ docs + examples │ │\n│ └─────────────────────┘ │\n└────────────┬───────────────────────────────────────────────────┘\n │\n ▼\n┌─────────────────────────────────────────────────────────────────┐\n│ Integration Points │\n│ │\n│ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────┐ │\n│ │ Nickel │ │ SecretumVault│ │ Cedar Authorization │ │\n│ │ Validation │ │ (Secrets) │ │ (AI Policies) │ │\n│ └─────────────┘ └──────────────┘ └─────────────────────┘ │\n│ │\n│ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────┐ │\n│ │ Orchestrator│ │ Typdialog │ │ Audit Logging │ │\n│ │ (Deploy) │ │ (Forms) │ │ (All AI Ops) │ │\n│ └─────────────┘ └──────────────┘ └─────────────────────┘ │\n└─────────────────────────────────────────────────────────────────┘\n │\n ▼\n┌─────────────────────────────────────────────────────────────────┐\n│ Output: Validated Nickel Configuration │\n│ │\n│ ✅ Schema-validated │\n│ ✅ Security-checked (Cedar policies) │\n│ ✅ Human-approved │\n│ ✅ Audit-logged │\n│ ✅ Ready for deployment │\n└─────────────────────────────────────────────────────────────────┘\n```\n\n### Component Responsibilities\n\n**typdialog-ai** (AI-Assisted Forms):\n- Real-time form field suggestions based on context\n- Natural language form filling\n- Validation error explanations in plain English\n- Context-aware autocomplete for configuration values\n- Integration with typdialog web UI\n\n**typdialog-ag** (AI Agents):\n- Autonomous task execution (multi-step workflows)\n- Agent collaboration (multiple agents working together)\n- Learning from user feedback and past operations\n- Goal-oriented behavior (achieve outcome, not just execute steps)\n- Safety boundaries (cannot deploy without approval)\n\n**typdialog-prov-gen** (Config Generator):\n- Natural language → Nickel configuration\n- Template-based generation with customization\n- Best practice injection (security, performance, HA)\n- Iterative refinement based on validation feedback\n- Integration with Nickel schema system\n\n**ai-service** (Core AI Service):\n- Central request router for all AI operations\n- Authentication and authorization (Cedar policies)\n- Rate limiting and cost control\n- Caching (reduce LLM API calls)\n- Audit logging (all AI operations)\n- Multi-provider abstraction (OpenAI, Anthropic, local)\n\n**mcp-server** (Model Context Protocol):\n- LLM integration (OpenAI, Anthropic, local models)\n- Tool calling framework (nickel_validate, schema_query, etc.)\n- Context management (conversation history, schemas)\n- Streaming responses for real-time feedback\n- Error handling and retries\n\n**rag** (Retrieval-Augmented Generation):\n- Vector store (Qdrant/Milvus) for embeddings\n- Document indexing (Nickel schemas, docs, deployments)\n- Semantic search (find relevant context)\n- Embedding generation (text-embedding-3-large)\n- Query expansion and reranking\n\n## Rationale\n\n### Why AI Integration Is Essential\n\n| Aspect | Manual Config | AI-Assisted (chosen) |\n| -------- | --------------- | ---------------------- |\n| **Learning Curve** | 🔴 Steep | 🟢 Gentle |\n| **Time to Deploy** | 🔴 Hours | 🟢 Minutes |\n| **Error Rate** | 🔴 High | 🟢 Low (validated) |\n| **Documentation Access** | 🔴 Separate | 🟢 Contextual |\n| **Troubleshooting** | 🔴 Manual | 🟢 AI-assisted |\n| **Best Practices** | ⚠️ Manual enforcement | ✅ Auto-injected |\n| **Consistency** | ⚠️ Varies by operator | ✅ Standardized |\n| **Scalability** | 🔴 Limited by expertise | 🟢 AI scales knowledge |\n\n### Why Schema-Aware AI Is Critical\n\nTraditional AI code generation fails for infrastructure because:\n\n```\nGeneric AI (like GitHub Copilot):\n❌ Generates syntactically correct but semantically wrong configs\n❌ Doesn't understand cloud provider constraints\n❌ No validation against schemas\n❌ No security policy enforcement\n❌ Hallucinated resource names/IDs\n```\n\n**Schema-aware AI** (our approach):\n```\n# Nickel schema provides ground truth\n{\n Database = {\n engine | [| 'postgres, 'mysql, 'mongodb |],\n version | String,\n storage_gb | Number,\n backup_retention_days | Number,\n }\n}\n\n# AI generates ONLY valid configs\n# AI knows:\n# - Valid engine values ('postgres', not 'postgresql')\n# - Required fields (all listed above)\n# - Type constraints (storage_gb is Number, not String)\n# - Nickel contracts (if defined)\n```\n\n**Result**: AI cannot generate invalid configs.\n\n### Why RAG (Retrieval-Augmented Generation) Is Essential\n\nLLMs alone have limitations:\n\n```\nPure LLM:\n❌ Knowledge cutoff (no recent updates)\n❌ Hallucinations (invents plausible-sounding configs)\n❌ No project-specific knowledge\n❌ No access to past deployments\n```\n\n**RAG-enhanced LLM**:\n```\nQuery: "How to configure Postgres with encryption?"\n\nRAG retrieves:\n- Nickel schema: provisioning/schemas/database.ncl\n- Documentation: docs/user/database-encryption.md\n- Past deployment: workspaces/prod/postgres-encrypted.ncl\n- Best practice: .claude/patterns/secure-database.md\n\nLLM generates answer WITH retrieved context:\n✅ Accurate (based on actual schemas)\n✅ Project-specific (uses our patterns)\n✅ Proven (learned from past deployments)\n✅ Secure (follows our security guidelines)\n```\n\n### Why Human-in-the-Loop Is Non-Negotiable\n\nAI-generated infrastructure configs require human approval:\n\n```\n// All AI operations require approval\npub async fn ai_generate_config(request: GenerateRequest) -> Result {\n let ai_generated = ai_service.generate(request).await?;\n\n // Validate against Nickel schema\n let validation = nickel_validate(&ai_generated)?;\n if !validation.is_valid() {\n return Err("AI generated invalid config");\n }\n\n // Check Cedar policies\n let authorized = cedar_authorize(\n principal: user,\n action: "approve_ai_config",\n resource: ai_generated,\n )?;\n if !authorized {\n return Err("User not authorized to approve AI config");\n }\n\n // Require explicit human approval\n let approval = prompt_user_approval(&ai_generated).await?;\n if !approval.approved {\n audit_log("AI config rejected by user", &ai_generated);\n return Err("User rejected AI-generated config");\n }\n\n audit_log("AI config approved by user", &ai_generated);\n Ok(ai_generated)\n}\n```\n\n**Why**:\n- Infrastructure changes have real-world cost and security impact\n- AI can make mistakes (hallucinations, misunderstandings)\n- Compliance requires human accountability\n- Learning opportunity (human reviews teach AI)\n\n### Why Multi-Provider Support Matters\n\nNo single LLM provider is best for all tasks:\n\n| Provider | Best For | Considerations |\n| ---------- | ---------- | ---------------- |\n| **Anthropic (Claude)** | Long context, accuracy | ✅ Best for complex configs |\n| **OpenAI (GPT-4)** | Tool calling, speed | ✅ Best for quick suggestions |\n| **Local (Llama, Mistral)** | Privacy, cost | ✅ Best for air-gapped envs |\n\n**Strategy**:\n- Complex config generation → Claude (long context)\n- Real-time form suggestions → GPT-4 (fast)\n- Air-gapped deployments → Local models (privacy)\n\n## Consequences\n\n### Positive\n\n- **Accessibility**: Non-experts can provision infrastructure\n- **Productivity**: 10x faster configuration creation\n- **Quality**: AI injects best practices automatically\n- **Consistency**: Standardized configurations across teams\n- **Learning**: Users learn from AI explanations\n- **Troubleshooting**: AI-assisted debugging reduces MTTR\n- **Documentation**: Contextual help embedded in workflow\n- **Safety**: Schema validation prevents invalid configs\n- **Security**: Cedar policies control AI access\n- **Auditability**: Complete trail of AI operations\n\n### Negative\n\n- **Dependency**: Requires LLM API access (or local models)\n- **Cost**: LLM API calls have per-token cost\n- **Latency**: AI responses take 1-5 seconds\n- **Accuracy**: AI can still make mistakes (needs validation)\n- **Trust**: Users must understand AI limitations\n- **Complexity**: Additional infrastructure to operate\n- **Privacy**: Configs sent to LLM providers (unless local)\n\n### Mitigation Strategies\n\n**Cost Control**:\n```\n[ai.rate_limiting]\nrequests_per_minute = 60\ntokens_per_day = 1000000\ncost_limit_per_day = "100.00" # USD\n\n[ai.caching]\nenabled = true\nttl = "1h"\n# Cache similar queries to reduce API calls\n```\n\n**Latency Optimization**:\n```\n// Streaming responses for real-time feedback\npub async fn ai_generate_stream(request: GenerateRequest) -> impl Stream {\n ai_service\n .generate_stream(request)\n .await\n .map(|chunk| chunk.text)\n}\n```\n\n**Privacy (Local Models)**:\n```\n[ai]\nprovider = "local"\nmodel_path = "/opt/provisioning/models/llama-3-70b"\n\n# No data leaves the network\n```\n\n**Validation (Defense in Depth)**:\n```\nAI generates config\n ↓\nNickel schema validation (syntax, types, contracts)\n ↓\nCedar policy check (security, compliance)\n ↓\nHuman approval (final gate)\n ↓\nDeployment\n```\n\n**Observability**:\n```\n[ai.observability]\ntrace_all_requests = true\nstore_conversations = true\nconversation_retention = "30d"\n\n# Every AI operation logged:\n# - Input prompt\n# - Retrieved context (RAG)\n# - Generated output\n# - Validation results\n# - Human approval decision\n```\n\n## Alternatives Considered\n\n### Alternative 1: No AI Integration\n\n**Pros**: Simpler, no LLM dependencies\n**Cons**: Steep learning curve, slow provisioning, manual troubleshooting\n**Decision**: REJECTED - Poor user experience (10x slower provisioning, high error rate)\n\n### Alternative 2: Generic AI Code Generation (GitHub Copilot approach)\n\n**Pros**: Existing tools, well-known UX\n**Cons**: Not schema-aware, generates invalid configs, no validation\n**Decision**: REJECTED - Inadequate for infrastructure (correctness critical)\n\n### Alternative 3: AI Only for Documentation/Search\n\n**Pros**: Lower risk (AI doesn't generate configs)\n**Cons**: Missed opportunity for 10x productivity gains\n**Decision**: REJECTED - Too conservative\n\n### Alternative 4: Fully Autonomous AI (No Human Approval)\n\n**Pros**: Maximum automation\n**Cons**: Unacceptable risk for infrastructure changes\n**Decision**: REJECTED - Safety and compliance requirements\n\n### Alternative 5: Single LLM Provider Lock-in\n\n**Pros**: Simpler integration\n**Cons**: Vendor lock-in, no flexibility for different use cases\n**Decision**: REJECTED - Multi-provider abstraction provides flexibility\n\n## Implementation Details\n\n### AI Service API\n\n```\n// platform/crates/ai-service/src/lib.rs\n\n#[async_trait]\npub trait AIService {\n async fn generate_config(\n &self,\n prompt: &str,\n schema: &NickelSchema,\n context: Option,\n ) -> Result;\n\n async fn suggest_field_value(\n &self,\n field: &FieldDefinition,\n partial_input: &str,\n form_context: &FormContext,\n ) -> Result>;\n\n async fn explain_validation_error(\n &self,\n error: &ValidationError,\n config: &Config,\n ) -> Result;\n\n async fn troubleshoot_deployment(\n &self,\n deployment_id: &str,\n logs: &DeploymentLogs,\n ) -> Result;\n}\n\npub struct AIServiceImpl {\n mcp_client: MCPClient,\n rag: RAGService,\n cedar: CedarEngine,\n audit: AuditLogger,\n rate_limiter: RateLimiter,\n cache: Cache,\n}\n\nimpl AIService for AIServiceImpl {\n async fn generate_config(\n &self,\n prompt: &str,\n schema: &NickelSchema,\n context: Option,\n ) -> Result {\n // Check authorization\n self.cedar.authorize(\n principal: current_user(),\n action: "ai:generate_config",\n resource: schema,\n )?;\n\n // Rate limiting\n self.rate_limiter.check(current_user()).await?;\n\n // Retrieve relevant context via RAG\n let rag_context = match context {\n Some(ctx) => ctx,\n None => self.rag.retrieve(prompt, schema).await?,\n };\n\n // Generate config via MCP\n let generated = self.mcp_client.generate(\n prompt: prompt,\n schema: schema,\n context: rag_context,\n tools: &["nickel_validate", "schema_query"],\n ).await?;\n\n // Validate generated config\n let validation = nickel_validate(&generated.config)?;\n if !validation.is_valid() {\n return Err(AIError::InvalidGeneration(validation.errors));\n }\n\n // Audit log\n self.audit.log(AIOperation::GenerateConfig {\n user: current_user(),\n prompt: prompt,\n schema: schema.name(),\n generated: &generated.config,\n validation: validation,\n });\n\n Ok(GeneratedConfig {\n config: generated.config,\n explanation: generated.explanation,\n confidence: generated.confidence,\n validation: validation,\n })\n }\n}\n```\n\n### MCP Server Integration\n\n```\n// platform/crates/mcp-server/src/lib.rs\n\npub struct MCPClient {\n provider: Box,\n tools: ToolRegistry,\n}\n\n#[async_trait]\npub trait LLMProvider {\n async fn generate(&self, request: GenerateRequest) -> Result;\n async fn generate_stream(&self, request: GenerateRequest) -> Result>;\n}\n\n// Tool definitions for LLM\npub struct ToolRegistry {\n tools: HashMap,\n}\n\nimpl ToolRegistry {\n pub fn new() -> Self {\n let mut tools = HashMap::new();\n\n tools.insert("nickel_validate", Tool {\n name: "nickel_validate",\n description: "Validate Nickel configuration against schema",\n parameters: json!({\n "type": "object",\n "properties": {\n "config": {"type": "string"},\n "schema_path": {"type": "string"},\n },\n "required": ["config", "schema_path"],\n }),\n handler: Box::new(|params| async {\n let config = params["config"].as_str().unwrap();\n let schema = params["schema_path"].as_str().unwrap();\n nickel_validate_tool(config, schema).await\n }),\n });\n\n tools.insert("schema_query", Tool {\n name: "schema_query",\n description: "Query Nickel schema for field information",\n parameters: json!({\n "type": "object",\n "properties": {\n "schema_path": {"type": "string"},\n "query": {"type": "string"},\n },\n "required": ["schema_path"],\n }),\n handler: Box::new(|params| async {\n let schema = params["schema_path"].as_str().unwrap();\n let query = params.get("query").and_then(|v| v.as_str());\n schema_query_tool(schema, query).await\n }),\n });\n\n Self { tools }\n }\n}\n```\n\n### RAG System Implementation\n\n```\n// platform/crates/rag/src/lib.rs\n\npub struct RAGService {\n vector_store: Box,\n embeddings: EmbeddingModel,\n indexer: DocumentIndexer,\n}\n\nimpl RAGService {\n pub async fn index_all(&self) -> Result<()> {\n // Index Nickel schemas\n self.index_schemas("provisioning/schemas").await?;\n\n // Index documentation\n self.index_docs("docs").await?;\n\n // Index past deployments\n self.index_deployments("workspaces").await?;\n\n // Index best practices\n self.index_patterns(".claude/patterns").await?;\n\n Ok(())\n }\n\n pub async fn retrieve(\n &self,\n query: &str,\n schema: &NickelSchema,\n ) -> Result {\n // Generate query embedding\n let query_embedding = self.embeddings.embed(query).await?;\n\n // Search vector store\n let results = self.vector_store.search(\n embedding: query_embedding,\n top_k: 10,\n filter: Some(json!({\n "schema": schema.name(),\n })),\n ).await?;\n\n // Rerank results\n let reranked = self.rerank(query, results).await?;\n\n // Build context\n Ok(RAGContext {\n query: query.to_string(),\n schema_definition: schema.to_string(),\n relevant_docs: reranked.iter()\n .take(5)\n .map(|r| r.content.clone())\n .collect(),\n similar_configs: self.find_similar_configs(schema).await?,\n best_practices: self.find_best_practices(schema).await?,\n })\n }\n}\n\n#[async_trait]\npub trait VectorStore {\n async fn insert(&self, id: &str, embedding: Vec, metadata: Value) -> Result<()>;\n async fn search(&self, embedding: Vec, top_k: usize, filter: Option) -> Result>;\n}\n\n// Qdrant implementation\npub struct QdrantStore {\n client: qdrant::QdrantClient,\n collection: String,\n}\n```\n\n### typdialog-ai Integration\n\n```\n// typdialog-ai/src/form_assistant.rs\n\npub struct FormAssistant {\n ai_service: Arc,\n}\n\nimpl FormAssistant {\n pub async fn suggest_field_value(\n &self,\n field: &FieldDefinition,\n partial_input: &str,\n form_context: &FormContext,\n ) -> Result> {\n self.ai_service.suggest_field_value(\n field,\n partial_input,\n form_context,\n ).await\n }\n\n pub async fn explain_error(\n &self,\n error: &ValidationError,\n field_value: &str,\n ) -> Result {\n let explanation = self.ai_service.explain_validation_error(\n error,\n field_value,\n ).await?;\n\n Ok(format!(\n "Error: {}\n\nExplanation: {}\n\nSuggested fix: {}",\n error.message,\n explanation.plain_english,\n explanation.suggested_fix,\n ))\n }\n\n pub async fn fill_from_natural_language(\n &self,\n description: &str,\n form_schema: &FormSchema,\n ) -> Result> {\n let prompt = format!(\n "User wants to: {}\n\nForm schema: {}\n\nGenerate field values:",\n description,\n serde_json::to_string_pretty(form_schema)?,\n );\n\n let generated = self.ai_service.generate_config(\n &prompt,\n &form_schema.nickel_schema,\n None,\n ).await?;\n\n Ok(generated.field_values)\n }\n}\n```\n\n### typdialog-ag Agents\n\n```\n// typdialog-ag/src/agent.rs\n\npub struct ProvisioningAgent {\n ai_service: Arc,\n orchestrator: Arc,\n max_iterations: usize,\n}\n\nimpl ProvisioningAgent {\n pub async fn execute_goal(&self, goal: &str) -> Result {\n let mut state = AgentState::new(goal);\n\n for iteration in 0..self.max_iterations {\n // AI determines next action\n let action = self.ai_service.agent_next_action(&state).await?;\n\n // Execute action (with human approval for critical operations)\n let result = self.execute_action(&action, &state).await?;\n\n // Update state\n state.update(action, result);\n\n // Check if goal achieved\n if state.goal_achieved() {\n return Ok(AgentResult::Success(state));\n }\n }\n\n Err(AgentError::MaxIterationsReached)\n }\n\n async fn execute_action(\n &self,\n action: &AgentAction,\n state: &AgentState,\n ) -> Result {\n match action {\n AgentAction::GenerateConfig { description } => {\n let config = self.ai_service.generate_config(\n description,\n &state.target_schema,\n Some(state.context.clone()),\n ).await?;\n\n Ok(ActionResult::ConfigGenerated(config))\n },\n\n AgentAction::Deploy { config } => {\n // Require human approval for deployment\n let approval = prompt_user_approval(\n "Agent wants to deploy. Approve?",\n config,\n ).await?;\n\n if !approval.approved {\n return Ok(ActionResult::DeploymentRejected);\n }\n\n let deployment = self.orchestrator.deploy(config).await?;\n Ok(ActionResult::Deployed(deployment))\n },\n\n AgentAction::Troubleshoot { deployment_id } => {\n let report = self.ai_service.troubleshoot_deployment(\n deployment_id,\n &self.orchestrator.get_logs(deployment_id).await?,\n ).await?;\n\n Ok(ActionResult::TroubleshootingReport(report))\n },\n }\n }\n}\n```\n\n### Cedar Policies for AI\n\n```\n// AI cannot access secrets without explicit permission\nforbid(\n principal == Service::"ai-service",\n action == Action::"read",\n resource in Secret::"*"\n);\n\n// AI can generate configs for non-production environments without approval\npermit(\n principal == Service::"ai-service",\n action == Action::"generate_config",\n resource in Schema::"*"\n) when {\n resource.environment in ["dev", "staging"]\n};\n\n// AI config generation for production requires senior engineer approval\npermit(\n principal in Group::"senior-engineers",\n action == Action::"approve_ai_config",\n resource in Config::"*"\n) when {\n resource.environment == "production" &&\n resource.generated_by == "ai-service"\n};\n\n// AI agents cannot deploy without human approval\nforbid(\n principal == Service::"ai-agent",\n action == Action::"deploy",\n resource == Infrastructure::"*"\n) unless {\n context.human_approved == true\n};\n```\n\n## Testing Strategy\n\n**Unit Tests**:\n```\n#[tokio::test]\nasync fn test_ai_config_generation_validates() {\n let ai_service = mock_ai_service();\n\n let generated = ai_service.generate_config(\n "Create a PostgreSQL database with encryption",\n &postgres_schema(),\n None,\n ).await.unwrap();\n\n // Must validate against schema\n assert!(generated.validation.is_valid());\n assert_eq!(generated.config["engine"], "postgres");\n assert_eq!(generated.config["encryption_enabled"], true);\n}\n\n#[tokio::test]\nasync fn test_ai_cannot_access_secrets() {\n let ai_service = ai_service_with_cedar();\n\n let result = ai_service.get_secret("database/password").await;\n\n assert!(result.is_err());\n assert_eq!(result.unwrap_err(), AIError::PermissionDenied);\n}\n```\n\n**Integration Tests**:\n```\n#[tokio::test]\nasync fn test_end_to_end_ai_config_generation() {\n // User provides natural language\n let description = "Create a production Kubernetes cluster in AWS with 5 nodes";\n\n // AI generates config\n let generated = ai_service.generate_config(description).await.unwrap();\n\n // Nickel validation\n let validation = nickel_validate(&generated.config).await.unwrap();\n assert!(validation.is_valid());\n\n // Human approval\n let approval = Approval {\n user: "senior-engineer@example.com",\n approved: true,\n timestamp: Utc::now(),\n };\n\n // Deploy\n let deployment = orchestrator.deploy_with_approval(\n generated.config,\n approval,\n ).await.unwrap();\n\n assert_eq!(deployment.status, DeploymentStatus::Success);\n}\n```\n\n**RAG Quality Tests**:\n```\n#[tokio::test]\nasync fn test_rag_retrieval_accuracy() {\n let rag = rag_service();\n\n // Index test documents\n rag.index_all().await.unwrap();\n\n // Query\n let context = rag.retrieve(\n "How to configure PostgreSQL with encryption?",\n &postgres_schema(),\n ).await.unwrap();\n\n // Should retrieve relevant docs\n assert!(context.relevant_docs.iter().any(|doc| {\n doc.contains("encryption") && doc.contains("postgres")\n }));\n\n // Should retrieve similar configs\n assert!(!context.similar_configs.is_empty());\n}\n```\n\n## Security Considerations\n\n**AI Access Control**:\n```\nAI Service Permissions (enforced by Cedar):\n✅ CAN: Read Nickel schemas\n✅ CAN: Generate configurations\n✅ CAN: Query documentation\n✅ CAN: Analyze deployment logs (sanitized)\n❌ CANNOT: Access secrets directly\n❌ CANNOT: Deploy without approval\n❌ CANNOT: Modify Cedar policies\n❌ CANNOT: Access user credentials\n```\n\n**Data Privacy**:\n```\n[ai.privacy]\n# Sanitize before sending to LLM\nsanitize_secrets = true\nsanitize_pii = true\nsanitize_credentials = true\n\n# What gets sent to LLM:\n# ✅ Nickel schemas (public)\n# ✅ Documentation (public)\n# ✅ Error messages (sanitized)\n# ❌ Secret values (never)\n# ❌ Passwords (never)\n# ❌ API keys (never)\n```\n\n**Audit Trail**:\n```\n// Every AI operation logged\npub struct AIAuditLog {\n timestamp: DateTime,\n user: UserId,\n operation: AIOperation,\n input_prompt: String,\n generated_output: String,\n validation_result: ValidationResult,\n human_approval: Option,\n deployment_outcome: Option,\n}\n```\n\n## Cost Analysis\n\n**Estimated Costs** (per month, based on typical usage):\n\n```\nAssumptions:\n- 100 active users\n- 10 AI config generations per user per day\n- Average prompt: 2000 tokens\n- Average response: 1000 tokens\n\nProvider: Anthropic Claude Sonnet\nCost: $3 per 1M input tokens, $15 per 1M output tokens\n\nMonthly cost:\n= 100 users × 10 generations × 30 days × (2000 input + 1000 output tokens)\n= 100 × 10 × 30 × 3000 tokens\n= 90M tokens\n= (60M input × $3/1M) + (30M output × $15/1M)\n= $180 + $450\n= $630/month\n\nWith caching (50% hit rate):\n= $315/month\n```\n\n**Cost optimization strategies**:\n- Caching (50-80% cost reduction)\n- Streaming (lower latency, same cost)\n- Local models for non-critical operations (zero marginal cost)\n- Rate limiting (prevent runaway costs)\n\n## References\n\n- [Model Context Protocol (MCP)](https://modelcontextprotocol.io/)\n- [Anthropic Claude API](https://docs.anthropic.com/claude/reference/getting-started)\n- [OpenAI GPT-4 API](https://platform.openai.com/docs/api-reference)\n- [Qdrant Vector Database](https://qdrant.tech/)\n- [RAG Survey Paper](https://arxiv.org/abs/2312.10997)\n- ADR-008: Cedar Authorization (AI access control)\n- ADR-011: Nickel Migration (schema-driven AI)\n- ADR-013: Typdialog Web UI Backend (AI-assisted forms)\n- ADR-014: SecretumVault Integration (AI-secret isolation)\n\n---\n\n**Status**: Accepted\n**Last Updated**: 2025-01-08\n**Implementation**: Planned (High Priority)\n**Estimated Complexity**: Very Complex\n**Dependencies**: ADR-008, ADR-011, ADR-013, ADR-014 +# ADR-015: AI Integration Architecture for Intelligent Infrastructure Provisioning + +## Status + +**Accepted** - 2025-01-08 + +## Context + +The provisioning platform has evolved to include complex workflows for infrastructure configuration, deployment, and management. +Current interaction patterns require deep technical knowledge of Nickel schemas, cloud provider APIs, networking concepts, and security best practices. +This creates barriers to entry and slows down infrastructure provisioning for operators who are not infrastructure experts. + +### The Infrastructure Complexity Problem + +**Current state challenges**: + +1. **Knowledge Barrier**: Deep Nickel, cloud, and networking expertise required + - Understanding Nickel type system and contracts + - Knowing cloud provider resource relationships + - Configuring security policies correctly + - Debugging deployment failures + +2. **Manual Configuration**: All configs hand-written + - Repetitive boilerplate for common patterns + - Easy to make mistakes (typos, missing fields) + - No intelligent suggestions or autocomplete + - Trial-and-error debugging + +3. **Limited Assistance**: No contextual help + - Documentation is separate from workflow + - No explanation of validation errors + - No suggestions for fixing issues + - No learning from past deployments + +4. **Troubleshooting Difficulty**: Manual log analysis + - Deployment failures require expert analysis + - No automated root cause detection + - No suggested fixes based on similar issues + - Long time-to-resolution + +### AI Integration Opportunities + +1. **Natural Language to Configuration**: + - User: "Create a production PostgreSQL cluster with encryption and daily backups" + - AI: Generates validated Nickel configuration + +2. **AI-Assisted Form Filling**: + - User starts typing in typdialog web form + - AI suggests values based on context + - AI explains validation errors in plain language + +3. **Intelligent Troubleshooting**: + - Deployment fails + - AI analyzes logs and suggests fixes + - AI generates corrected configuration + +4. **Configuration Optimization**: + - AI analyzes workload patterns + - AI suggests performance improvements + - AI detects security misconfigurations + +5. **Learning from Operations**: + - AI indexes past deployments + - AI suggests configurations based on similar workloads + - AI predicts potential issues + +### AI Components Overview + +The system integrates multiple AI components: + +1. **typdialog-ai**: AI-assisted form interactions +2. **typdialog-ag**: AI agents for autonomous operations +3. **typdialog-prov-gen**: AI-powered configuration generation +4. **platform/crates/ai-service**: Core AI service backend +5. **platform/crates/mcp-server**: Model Context Protocol server +6. **platform/crates/rag**: Retrieval-Augmented Generation system + +### Requirements for AI Integration + +- ✅ **Natural Language Understanding**: Parse user intent from free-form text +- ✅ **Schema-Aware Generation**: Generate valid Nickel configurations +- ✅ **Context Retrieval**: Access documentation, schemas, past deployments +- ✅ **Security Enforcement**: Cedar policies control AI access +- ✅ **Human-in-the-Loop**: All AI actions require human approval +- ✅ **Audit Trail**: Complete logging of AI operations +- ✅ **Multi-Provider Support**: OpenAI, Anthropic, local models +- ✅ **Cost Control**: Rate limiting and budget management +- ✅ **Observability**: Trace AI decisions and reasoning + +## Decision + +Integrate a **comprehensive AI system** consisting of: + +1. **AI-Assisted Interfaces** (typdialog-ai) +2. **Autonomous AI Agents** (typdialog-ag) +3. **AI Configuration Generator** (typdialog-prov-gen) +4. **Core AI Infrastructure** (ai-service, mcp-server, rag) + +All AI components are **schema-aware**, **security-enforced**, and **human-supervised**. + +### Architecture Diagram + +```text +┌─────────────────────────────────────────────────────────────────┐ +│ User Interfaces │ +│ │ +│ Natural Language: "Create production K8s cluster in AWS" │ +│ Typdialog Forms: AI-assisted field suggestions │ +│ CLI: provisioning ai generate-config "description" │ +└────────────┬────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ AI Frontend Layer │ +│ ┌───────────────────────────────────────────────────────┐ │ +│ │ typdialog-ai (AI-Assisted Forms) │ │ +│ │ - Natural language form filling │ │ +│ │ - Real-time AI suggestions │ │ +│ │ - Validation error explanations │ │ +│ │ - Context-aware autocomplete │ │ +│ ├───────────────────────────────────────────────────────┤ │ +│ │ typdialog-ag (AI Agents) │ │ +│ │ - Autonomous task execution │ │ +│ │ - Multi-step workflow automation │ │ +│ │ - Learning from feedback │ │ +│ │ - Agent collaboration │ │ +│ ├───────────────────────────────────────────────────────┤ │ +│ │ typdialog-prov-gen (Config Generator) │ │ +│ │ - Natural language → Nickel config │ │ +│ │ - Template-based generation │ │ +│ │ - Best practice injection │ │ +│ │ - Validation and refinement │ │ +│ └───────────────────────────────────────────────────────┘ │ +└────────────┬────────────────────────────────────────────────────┘ + │ + ▼ +┌────────────────────────────────────────────────────────────────┐ +│ Core AI Infrastructure (platform/crates/) │ +│ ┌───────────────────────────────────────────────────────┐ │ +│ │ ai-service (Central AI Service) │ │ +│ │ │ │ +│ │ - Request routing and orchestration │ │ +│ │ - Authentication and authorization (Cedar) │ │ +│ │ - Rate limiting and cost control │ │ +│ │ - Caching and optimization │ │ +│ │ - Audit logging and observability │ │ +│ │ - Multi-provider abstraction │ │ +│ └─────────────┬─────────────────────┬───────────────────┘ │ +│ │ │ │ +│ ▼ ▼ │ +│ ┌─────────────────────┐ ┌─────────────────────┐ │ +│ │ mcp-server │ │ rag │ │ +│ │ (Model Context │ │ (Retrieval-Aug Gen) │ │ +│ │ Protocol) │ │ │ │ +│ │ │ │ ┌─────────────────┐ │ │ +│ │ - LLM integration │ │ │ Vector Store │ │ │ +│ │ - Tool calling │ │ │ (Qdrant/Milvus) │ │ │ +│ │ - Context mgmt │ │ └─────────────────┘ │ │ +│ │ - Multi-provider │ │ ┌─────────────────┐ │ │ +│ │ (OpenAI, │ │ │ Embeddings │ │ │ +│ │ Anthropic, │ │ │ (text-embed) │ │ │ +│ │ Local models) │ │ └─────────────────┘ │ │ +│ │ │ │ ┌─────────────────┐ │ │ +│ │ Tools: │ │ │ Index: │ │ │ +│ │ - nickel_validate │ │ │ - Nickel schemas│ │ │ +│ │ - schema_query │ │ │ - Documentation │ │ │ +│ │ - config_generate │ │ │ - Past deploys │ │ │ +│ │ - cedar_check │ │ │ - Best practices│ │ │ +│ └─────────────────────┘ │ └─────────────────┘ │ │ +│ │ │ │ +│ │ Query: "How to │ │ +│ │ configure Postgres │ │ +│ │ with encryption?" │ │ +│ │ │ │ +│ │ Retrieval: Relevant │ │ +│ │ docs + examples │ │ +│ └─────────────────────┘ │ +└────────────┬───────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ Integration Points │ +│ │ +│ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────┐ │ +│ │ Nickel │ │ SecretumVault│ │ Cedar Authorization │ │ +│ │ Validation │ │ (Secrets) │ │ (AI Policies) │ │ +│ └─────────────┘ └──────────────┘ └─────────────────────┘ │ +│ │ +│ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────┐ │ +│ │ Orchestrator│ │ Typdialog │ │ Audit Logging │ │ +│ │ (Deploy) │ │ (Forms) │ │ (All AI Ops) │ │ +│ └─────────────┘ └──────────────┘ └─────────────────────┘ │ +└─────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ Output: Validated Nickel Configuration │ +│ │ +│ ✅ Schema-validated │ +│ ✅ Security-checked (Cedar policies) │ +│ ✅ Human-approved │ +│ ✅ Audit-logged │ +│ ✅ Ready for deployment │ +└─────────────────────────────────────────────────────────────────┘ +``` + +### Component Responsibilities + +**typdialog-ai** (AI-Assisted Forms): +- Real-time form field suggestions based on context +- Natural language form filling +- Validation error explanations in plain English +- Context-aware autocomplete for configuration values +- Integration with typdialog web UI + +**typdialog-ag** (AI Agents): +- Autonomous task execution (multi-step workflows) +- Agent collaboration (multiple agents working together) +- Learning from user feedback and past operations +- Goal-oriented behavior (achieve outcome, not just execute steps) +- Safety boundaries (cannot deploy without approval) + +**typdialog-prov-gen** (Config Generator): +- Natural language → Nickel configuration +- Template-based generation with customization +- Best practice injection (security, performance, HA) +- Iterative refinement based on validation feedback +- Integration with Nickel schema system + +**ai-service** (Core AI Service): +- Central request router for all AI operations +- Authentication and authorization (Cedar policies) +- Rate limiting and cost control +- Caching (reduce LLM API calls) +- Audit logging (all AI operations) +- Multi-provider abstraction (OpenAI, Anthropic, local) + +**mcp-server** (Model Context Protocol): +- LLM integration (OpenAI, Anthropic, local models) +- Tool calling framework (nickel_validate, schema_query, etc.) +- Context management (conversation history, schemas) +- Streaming responses for real-time feedback +- Error handling and retries + +**rag** (Retrieval-Augmented Generation): +- Vector store (Qdrant/Milvus) for embeddings +- Document indexing (Nickel schemas, docs, deployments) +- Semantic search (find relevant context) +- Embedding generation (text-embedding-3-large) +- Query expansion and reranking + +## Rationale + +### Why AI Integration Is Essential + +| Aspect | Manual Config | AI-Assisted (chosen) | +| -------- | --------------- | ---------------------- | +| **Learning Curve** | 🔴 Steep | 🟢 Gentle | +| **Time to Deploy** | 🔴 Hours | 🟢 Minutes | +| **Error Rate** | 🔴 High | 🟢 Low (validated) | +| **Documentation Access** | 🔴 Separate | 🟢 Contextual | +| **Troubleshooting** | 🔴 Manual | 🟢 AI-assisted | +| **Best Practices** | ⚠️ Manual enforcement | ✅ Auto-injected | +| **Consistency** | ⚠️ Varies by operator | ✅ Standardized | +| **Scalability** | 🔴 Limited by expertise | 🟢 AI scales knowledge | + +### Why Schema-Aware AI Is Critical + +Traditional AI code generation fails for infrastructure because: + +```text +Generic AI (like GitHub Copilot): +❌ Generates syntactically correct but semantically wrong configs +❌ Doesn't understand cloud provider constraints +❌ No validation against schemas +❌ No security policy enforcement +❌ Hallucinated resource names/IDs +``` + +**Schema-aware AI** (our approach): +```text +# Nickel schema provides ground truth +{ + Database = { + engine | [| 'postgres, 'mysql, 'mongodb |], + version | String, + storage_gb | Number, + backup_retention_days | Number, + } +} + +# AI generates ONLY valid configs +# AI knows: +# - Valid engine values ('postgres', not 'postgresql') +# - Required fields (all listed above) +# - Type constraints (storage_gb is Number, not String) +# - Nickel contracts (if defined) +``` + +**Result**: AI cannot generate invalid configs. + +### Why RAG (Retrieval-Augmented Generation) Is Essential + +LLMs alone have limitations: + +```text +Pure LLM: +❌ Knowledge cutoff (no recent updates) +❌ Hallucinations (invents plausible-sounding configs) +❌ No project-specific knowledge +❌ No access to past deployments +``` + +**RAG-enhanced LLM**: +```text +Query: "How to configure Postgres with encryption?" + +RAG retrieves: +- Nickel schema: provisioning/schemas/database.ncl +- Documentation: docs/user/database-encryption.md +- Past deployment: workspaces/prod/postgres-encrypted.ncl +- Best practice: .claude/patterns/secure-database.md + +LLM generates answer WITH retrieved context: +✅ Accurate (based on actual schemas) +✅ Project-specific (uses our patterns) +✅ Proven (learned from past deployments) +✅ Secure (follows our security guidelines) +``` + +### Why Human-in-the-Loop Is Non-Negotiable + +AI-generated infrastructure configs require human approval: + +```text +// All AI operations require approval +pub async fn ai_generate_config(request: GenerateRequest) -> Result { + let ai_generated = ai_service.generate(request).await?; + + // Validate against Nickel schema + let validation = nickel_validate(&ai_generated)?; + if !validation.is_valid() { + return Err("AI generated invalid config"); + } + + // Check Cedar policies + let authorized = cedar_authorize( + principal: user, + action: "approve_ai_config", + resource: ai_generated, + )?; + if !authorized { + return Err("User not authorized to approve AI config"); + } + + // Require explicit human approval + let approval = prompt_user_approval(&ai_generated).await?; + if !approval.approved { + audit_log("AI config rejected by user", &ai_generated); + return Err("User rejected AI-generated config"); + } + + audit_log("AI config approved by user", &ai_generated); + Ok(ai_generated) +} +``` + +**Why**: +- Infrastructure changes have real-world cost and security impact +- AI can make mistakes (hallucinations, misunderstandings) +- Compliance requires human accountability +- Learning opportunity (human reviews teach AI) + +### Why Multi-Provider Support Matters + +No single LLM provider is best for all tasks: + +| Provider | Best For | Considerations | +| ---------- | ---------- | ---------------- | +| **Anthropic (Claude)** | Long context, accuracy | ✅ Best for complex configs | +| **OpenAI (GPT-4)** | Tool calling, speed | ✅ Best for quick suggestions | +| **Local (Llama, Mistral)** | Privacy, cost | ✅ Best for air-gapped envs | + +**Strategy**: +- Complex config generation → Claude (long context) +- Real-time form suggestions → GPT-4 (fast) +- Air-gapped deployments → Local models (privacy) + +## Consequences + +### Positive + +- **Accessibility**: Non-experts can provision infrastructure +- **Productivity**: 10x faster configuration creation +- **Quality**: AI injects best practices automatically +- **Consistency**: Standardized configurations across teams +- **Learning**: Users learn from AI explanations +- **Troubleshooting**: AI-assisted debugging reduces MTTR +- **Documentation**: Contextual help embedded in workflow +- **Safety**: Schema validation prevents invalid configs +- **Security**: Cedar policies control AI access +- **Auditability**: Complete trail of AI operations + +### Negative + +- **Dependency**: Requires LLM API access (or local models) +- **Cost**: LLM API calls have per-token cost +- **Latency**: AI responses take 1-5 seconds +- **Accuracy**: AI can still make mistakes (needs validation) +- **Trust**: Users must understand AI limitations +- **Complexity**: Additional infrastructure to operate +- **Privacy**: Configs sent to LLM providers (unless local) + +### Mitigation Strategies + +**Cost Control**: +```text +[ai.rate_limiting] +requests_per_minute = 60 +tokens_per_day = 1000000 +cost_limit_per_day = "100.00" # USD + +[ai.caching] +enabled = true +ttl = "1h" +# Cache similar queries to reduce API calls +``` + +**Latency Optimization**: +```text +// Streaming responses for real-time feedback +pub async fn ai_generate_stream(request: GenerateRequest) -> impl Stream { + ai_service + .generate_stream(request) + .await + .map(|chunk| chunk.text) +} +``` + +**Privacy (Local Models)**: +```text +[ai] +provider = "local" +model_path = "/opt/provisioning/models/llama-3-70b" + +# No data leaves the network +``` + +**Validation (Defense in Depth)**: +```text +AI generates config + ↓ +Nickel schema validation (syntax, types, contracts) + ↓ +Cedar policy check (security, compliance) + ↓ +Human approval (final gate) + ↓ +Deployment +``` + +**Observability**: +```text +[ai.observability] +trace_all_requests = true +store_conversations = true +conversation_retention = "30d" + +# Every AI operation logged: +# - Input prompt +# - Retrieved context (RAG) +# - Generated output +# - Validation results +# - Human approval decision +``` + +## Alternatives Considered + +### Alternative 1: No AI Integration + +**Pros**: Simpler, no LLM dependencies +**Cons**: Steep learning curve, slow provisioning, manual troubleshooting +**Decision**: REJECTED - Poor user experience (10x slower provisioning, high error rate) + +### Alternative 2: Generic AI Code Generation (GitHub Copilot approach) + +**Pros**: Existing tools, well-known UX +**Cons**: Not schema-aware, generates invalid configs, no validation +**Decision**: REJECTED - Inadequate for infrastructure (correctness critical) + +### Alternative 3: AI Only for Documentation/Search + +**Pros**: Lower risk (AI doesn't generate configs) +**Cons**: Missed opportunity for 10x productivity gains +**Decision**: REJECTED - Too conservative + +### Alternative 4: Fully Autonomous AI (No Human Approval) + +**Pros**: Maximum automation +**Cons**: Unacceptable risk for infrastructure changes +**Decision**: REJECTED - Safety and compliance requirements + +### Alternative 5: Single LLM Provider Lock-in + +**Pros**: Simpler integration +**Cons**: Vendor lock-in, no flexibility for different use cases +**Decision**: REJECTED - Multi-provider abstraction provides flexibility + +## Implementation Details + +### AI Service API + +```text +// platform/crates/ai-service/src/lib.rs + +#[async_trait] +pub trait AIService { + async fn generate_config( + &self, + prompt: &str, + schema: &NickelSchema, + context: Option, + ) -> Result; + + async fn suggest_field_value( + &self, + field: &FieldDefinition, + partial_input: &str, + form_context: &FormContext, + ) -> Result>; + + async fn explain_validation_error( + &self, + error: &ValidationError, + config: &Config, + ) -> Result; + + async fn troubleshoot_deployment( + &self, + deployment_id: &str, + logs: &DeploymentLogs, + ) -> Result; +} + +pub struct AIServiceImpl { + mcp_client: MCPClient, + rag: RAGService, + cedar: CedarEngine, + audit: AuditLogger, + rate_limiter: RateLimiter, + cache: Cache, +} + +impl AIService for AIServiceImpl { + async fn generate_config( + &self, + prompt: &str, + schema: &NickelSchema, + context: Option, + ) -> Result { + // Check authorization + self.cedar.authorize( + principal: current_user(), + action: "ai:generate_config", + resource: schema, + )?; + + // Rate limiting + self.rate_limiter.check(current_user()).await?; + + // Retrieve relevant context via RAG + let rag_context = match context { + Some(ctx) => ctx, + None => self.rag.retrieve(prompt, schema).await?, + }; + + // Generate config via MCP + let generated = self.mcp_client.generate( + prompt: prompt, + schema: schema, + context: rag_context, + tools: &["nickel_validate", "schema_query"], + ).await?; + + // Validate generated config + let validation = nickel_validate(&generated.config)?; + if !validation.is_valid() { + return Err(AIError::InvalidGeneration(validation.errors)); + } + + // Audit log + self.audit.log(AIOperation::GenerateConfig { + user: current_user(), + prompt: prompt, + schema: schema.name(), + generated: &generated.config, + validation: validation, + }); + + Ok(GeneratedConfig { + config: generated.config, + explanation: generated.explanation, + confidence: generated.confidence, + validation: validation, + }) + } +} +``` + +### MCP Server Integration + +```text +// platform/crates/mcp-server/src/lib.rs + +pub struct MCPClient { + provider: Box, + tools: ToolRegistry, +} + +#[async_trait] +pub trait LLMProvider { + async fn generate(&self, request: GenerateRequest) -> Result; + async fn generate_stream(&self, request: GenerateRequest) -> Result>; +} + +// Tool definitions for LLM +pub struct ToolRegistry { + tools: HashMap, +} + +impl ToolRegistry { + pub fn new() -> Self { + let mut tools = HashMap::new(); + + tools.insert("nickel_validate", Tool { + name: "nickel_validate", + description: "Validate Nickel configuration against schema", + parameters: json!({ + "type": "object", + "properties": { + "config": {"type": "string"}, + "schema_path": {"type": "string"}, + }, + "required": ["config", "schema_path"], + }), + handler: Box::new(|params| async { + let config = params["config"].as_str().unwrap(); + let schema = params["schema_path"].as_str().unwrap(); + nickel_validate_tool(config, schema).await + }), + }); + + tools.insert("schema_query", Tool { + name: "schema_query", + description: "Query Nickel schema for field information", + parameters: json!({ + "type": "object", + "properties": { + "schema_path": {"type": "string"}, + "query": {"type": "string"}, + }, + "required": ["schema_path"], + }), + handler: Box::new(|params| async { + let schema = params["schema_path"].as_str().unwrap(); + let query = params.get("query").and_then(|v| v.as_str()); + schema_query_tool(schema, query).await + }), + }); + + Self { tools } + } +} +``` + +### RAG System Implementation + +```text +// platform/crates/rag/src/lib.rs + +pub struct RAGService { + vector_store: Box, + embeddings: EmbeddingModel, + indexer: DocumentIndexer, +} + +impl RAGService { + pub async fn index_all(&self) -> Result<()> { + // Index Nickel schemas + self.index_schemas("provisioning/schemas").await?; + + // Index documentation + self.index_docs("docs").await?; + + // Index past deployments + self.index_deployments("workspaces").await?; + + // Index best practices + self.index_patterns(".claude/patterns").await?; + + Ok(()) + } + + pub async fn retrieve( + &self, + query: &str, + schema: &NickelSchema, + ) -> Result { + // Generate query embedding + let query_embedding = self.embeddings.embed(query).await?; + + // Search vector store + let results = self.vector_store.search( + embedding: query_embedding, + top_k: 10, + filter: Some(json!({ + "schema": schema.name(), + })), + ).await?; + + // Rerank results + let reranked = self.rerank(query, results).await?; + + // Build context + Ok(RAGContext { + query: query.to_string(), + schema_definition: schema.to_string(), + relevant_docs: reranked.iter() + .take(5) + .map(|r| r.content.clone()) + .collect(), + similar_configs: self.find_similar_configs(schema).await?, + best_practices: self.find_best_practices(schema).await?, + }) + } +} + +#[async_trait] +pub trait VectorStore { + async fn insert(&self, id: &str, embedding: Vec, metadata: Value) -> Result<()>; + async fn search(&self, embedding: Vec, top_k: usize, filter: Option) -> Result>; +} + +// Qdrant implementation +pub struct QdrantStore { + client: qdrant::QdrantClient, + collection: String, +} +``` + +### typdialog-ai Integration + +```text +// typdialog-ai/src/form_assistant.rs + +pub struct FormAssistant { + ai_service: Arc, +} + +impl FormAssistant { + pub async fn suggest_field_value( + &self, + field: &FieldDefinition, + partial_input: &str, + form_context: &FormContext, + ) -> Result> { + self.ai_service.suggest_field_value( + field, + partial_input, + form_context, + ).await + } + + pub async fn explain_error( + &self, + error: &ValidationError, + field_value: &str, + ) -> Result { + let explanation = self.ai_service.explain_validation_error( + error, + field_value, + ).await?; + + Ok(format!( + "Error: {} + +Explanation: {} + +Suggested fix: {}", + error.message, + explanation.plain_english, + explanation.suggested_fix, + )) + } + + pub async fn fill_from_natural_language( + &self, + description: &str, + form_schema: &FormSchema, + ) -> Result> { + let prompt = format!( + "User wants to: {} + +Form schema: {} + +Generate field values:", + description, + serde_json::to_string_pretty(form_schema)?, + ); + + let generated = self.ai_service.generate_config( + &prompt, + &form_schema.nickel_schema, + None, + ).await?; + + Ok(generated.field_values) + } +} +``` + +### typdialog-ag Agents + +```text +// typdialog-ag/src/agent.rs + +pub struct ProvisioningAgent { + ai_service: Arc, + orchestrator: Arc, + max_iterations: usize, +} + +impl ProvisioningAgent { + pub async fn execute_goal(&self, goal: &str) -> Result { + let mut state = AgentState::new(goal); + + for iteration in 0..self.max_iterations { + // AI determines next action + let action = self.ai_service.agent_next_action(&state).await?; + + // Execute action (with human approval for critical operations) + let result = self.execute_action(&action, &state).await?; + + // Update state + state.update(action, result); + + // Check if goal achieved + if state.goal_achieved() { + return Ok(AgentResult::Success(state)); + } + } + + Err(AgentError::MaxIterationsReached) + } + + async fn execute_action( + &self, + action: &AgentAction, + state: &AgentState, + ) -> Result { + match action { + AgentAction::GenerateConfig { description } => { + let config = self.ai_service.generate_config( + description, + &state.target_schema, + Some(state.context.clone()), + ).await?; + + Ok(ActionResult::ConfigGenerated(config)) + }, + + AgentAction::Deploy { config } => { + // Require human approval for deployment + let approval = prompt_user_approval( + "Agent wants to deploy. Approve?", + config, + ).await?; + + if !approval.approved { + return Ok(ActionResult::DeploymentRejected); + } + + let deployment = self.orchestrator.deploy(config).await?; + Ok(ActionResult::Deployed(deployment)) + }, + + AgentAction::Troubleshoot { deployment_id } => { + let report = self.ai_service.troubleshoot_deployment( + deployment_id, + &self.orchestrator.get_logs(deployment_id).await?, + ).await?; + + Ok(ActionResult::TroubleshootingReport(report)) + }, + } + } +} +``` + +### Cedar Policies for AI + +```text +// AI cannot access secrets without explicit permission +forbid( + principal == Service::"ai-service", + action == Action::"read", + resource in Secret::"*" +); + +// AI can generate configs for non-production environments without approval +permit( + principal == Service::"ai-service", + action == Action::"generate_config", + resource in Schema::"*" +) when { + resource.environment in ["dev", "staging"] +}; + +// AI config generation for production requires senior engineer approval +permit( + principal in Group::"senior-engineers", + action == Action::"approve_ai_config", + resource in Config::"*" +) when { + resource.environment == "production" && + resource.generated_by == "ai-service" +}; + +// AI agents cannot deploy without human approval +forbid( + principal == Service::"ai-agent", + action == Action::"deploy", + resource == Infrastructure::"*" +) unless { + context.human_approved == true +}; +``` + +## Testing Strategy + +**Unit Tests**: +```text +#[tokio::test] +async fn test_ai_config_generation_validates() { + let ai_service = mock_ai_service(); + + let generated = ai_service.generate_config( + "Create a PostgreSQL database with encryption", + &postgres_schema(), + None, + ).await.unwrap(); + + // Must validate against schema + assert!(generated.validation.is_valid()); + assert_eq!(generated.config["engine"], "postgres"); + assert_eq!(generated.config["encryption_enabled"], true); +} + +#[tokio::test] +async fn test_ai_cannot_access_secrets() { + let ai_service = ai_service_with_cedar(); + + let result = ai_service.get_secret("database/password").await; + + assert!(result.is_err()); + assert_eq!(result.unwrap_err(), AIError::PermissionDenied); +} +``` + +**Integration Tests**: +```text +#[tokio::test] +async fn test_end_to_end_ai_config_generation() { + // User provides natural language + let description = "Create a production Kubernetes cluster in AWS with 5 nodes"; + + // AI generates config + let generated = ai_service.generate_config(description).await.unwrap(); + + // Nickel validation + let validation = nickel_validate(&generated.config).await.unwrap(); + assert!(validation.is_valid()); + + // Human approval + let approval = Approval { + user: "senior-engineer@example.com", + approved: true, + timestamp: Utc::now(), + }; + + // Deploy + let deployment = orchestrator.deploy_with_approval( + generated.config, + approval, + ).await.unwrap(); + + assert_eq!(deployment.status, DeploymentStatus::Success); +} +``` + +**RAG Quality Tests**: +```text +#[tokio::test] +async fn test_rag_retrieval_accuracy() { + let rag = rag_service(); + + // Index test documents + rag.index_all().await.unwrap(); + + // Query + let context = rag.retrieve( + "How to configure PostgreSQL with encryption?", + &postgres_schema(), + ).await.unwrap(); + + // Should retrieve relevant docs + assert!(context.relevant_docs.iter().any(|doc| { + doc.contains("encryption") && doc.contains("postgres") + })); + + // Should retrieve similar configs + assert!(!context.similar_configs.is_empty()); +} +``` + +## Security Considerations + +**AI Access Control**: +```text +AI Service Permissions (enforced by Cedar): +✅ CAN: Read Nickel schemas +✅ CAN: Generate configurations +✅ CAN: Query documentation +✅ CAN: Analyze deployment logs (sanitized) +❌ CANNOT: Access secrets directly +❌ CANNOT: Deploy without approval +❌ CANNOT: Modify Cedar policies +❌ CANNOT: Access user credentials +``` + +**Data Privacy**: +```text +[ai.privacy] +# Sanitize before sending to LLM +sanitize_secrets = true +sanitize_pii = true +sanitize_credentials = true + +# What gets sent to LLM: +# ✅ Nickel schemas (public) +# ✅ Documentation (public) +# ✅ Error messages (sanitized) +# ❌ Secret values (never) +# ❌ Passwords (never) +# ❌ API keys (never) +``` + +**Audit Trail**: +```text +// Every AI operation logged +pub struct AIAuditLog { + timestamp: DateTime, + user: UserId, + operation: AIOperation, + input_prompt: String, + generated_output: String, + validation_result: ValidationResult, + human_approval: Option, + deployment_outcome: Option, +} +``` + +## Cost Analysis + +**Estimated Costs** (per month, based on typical usage): + +```text +Assumptions: +- 100 active users +- 10 AI config generations per user per day +- Average prompt: 2000 tokens +- Average response: 1000 tokens + +Provider: Anthropic Claude Sonnet +Cost: $3 per 1M input tokens, $15 per 1M output tokens + +Monthly cost: += 100 users × 10 generations × 30 days × (2000 input + 1000 output tokens) += 100 × 10 × 30 × 3000 tokens += 90M tokens += (60M input × $3/1M) + (30M output × $15/1M) += $180 + $450 += $630/month + +With caching (50% hit rate): += $315/month +``` + +**Cost optimization strategies**: +- Caching (50-80% cost reduction) +- Streaming (lower latency, same cost) +- Local models for non-critical operations (zero marginal cost) +- Rate limiting (prevent runaway costs) + +## References + +- [Model Context Protocol (MCP)](https://modelcontextprotocol.io/) +- [Anthropic Claude API](https://docs.anthropic.com/claude/reference/getting-started) +- [OpenAI GPT-4 API](https://platform.openai.com/docs/api-reference) +- [Qdrant Vector Database](https://qdrant.tech/) +- [RAG Survey Paper](https://arxiv.org/abs/2312.10997) +- ADR-008: Cedar Authorization (AI access control) +- ADR-011: Nickel Migration (schema-driven AI) +- ADR-013: Typdialog Web UI Backend (AI-assisted forms) +- ADR-014: SecretumVault Integration (AI-secret isolation) + +--- + +**Status**: Accepted +**Last Updated**: 2025-01-08 +**Implementation**: Planned (High Priority) +**Estimated Complexity**: Very Complex +**Dependencies**: ADR-008, ADR-011, ADR-013, ADR-014 \ No newline at end of file diff --git a/docs/src/architecture/adr/adr-016-schema-driven-accessor-generation.md b/docs/src/architecture/adr/adr-016-schema-driven-accessor-generation.md index 0d68644..1985040 100644 --- a/docs/src/architecture/adr/adr-016-schema-driven-accessor-generation.md +++ b/docs/src/architecture/adr/adr-016-schema-driven-accessor-generation.md @@ -1 +1,159 @@ -# ADR-016: Schema-Driven Accessor Generation Pattern\n\n**Status**: Proposed\n**Date**: 2026-01-13\n**Author**: Architecture Team\n**Supersedes**: Manual accessor maintenance in `lib_provisioning/config/accessor.nu`\n\n## Context\n\nThe `lib_provisioning/config/accessor.nu` file contains 1567 lines across 187 accessor functions. Analysis reveals that 95% of these functions follow\nan identical mechanical pattern:\n\n```\nexport def get-{field-name} [--config: record] {\n config-get "{path.to.field}" {default_value} --config $config\n}\n```\n\nThis represents significant technical debt:\n\n1. **Manual Maintenance Burden**: Adding a new config field requires manually writing a new accessor function\n2. **Schema Drift Risk**: No automated validation that accessor matches the actual Nickel schema\n3. **Code Duplication**: Nearly identical functions across 187 definitions\n4. **Testing Complexity**: Each accessor requires manual testing\n\n## Problem Statement\n\n**Current Architecture**:\n- Nickel schemas define configuration structure (source of truth)\n- Accessor functions manually mirror the schema structure\n- No automated synchronization between schema and accessors\n- High risk of accessor-schema mismatch\n\n**Key Metrics**:\n- 1567 lines of accessor code\n- 187 repetitive functions\n- ~95% code similarity\n\n## Decision\n\nImplement **Schema-Driven Accessor Generation**: automatically generate accessor functions from Nickel schema definitions.\n\n### Architecture\n\n```\nNickel Schema (contracts.ncl)\n ↓\n[Parse & Extract Schema Structure]\n ↓\n[Generate Nushell Functions]\n ↓\naccessor_generated.nu (800 lines)\n ↓\n[Validation & Integration]\n ↓\nCI/CD enforces: schema hash == generated code\n```\n\n### Generation Process\n\n1. **Schema Parsing**: Extract field paths, types, and defaults from Nickel contracts\n2. **Code Generation**: Create accessor functions with Nushell 0.109 compliance\n3. **Validation**: Verify generated code against schema\n4. **CI Integration**: Detect schema changes, validate generated code matches\n\n### Compliance Requirements\n\n**Nushell 0.109 Guidelines**:\n- No `try-catch` blocks (use `do-complete` pattern)\n- No `reduce --init` (use `reduce --fold`)\n- No mutable variables (use immutable bindings)\n- No type annotations on boolean flags\n- Use `each` not `map`, `is-not-empty` not `length`\n\n**Nickel Compliance**:\n- Schema-first design (schema is source of truth)\n- Type contracts enforce structure\n- `| doc` before `| default` ordering\n\n## Consequences\n\n### Positive\n\n- **Elimination of Manual Maintenance**: New config fields automatically get accessors\n- **Zero Schema Drift**: Automatic validation ensures accessors match schema\n- **Reduced Code Size**: 1567 lines → ~400 lines (manual core) + ~800 lines (generated)\n- **Type Safety**: Generated code guarantees type correctness\n- **Consistency**: All 187 functions use identical pattern\n\n### Negative\n\n- **Tool Complexity**: Generator must parse Nickel and emit valid Nushell\n- **CI/CD Changes**: Build must validate schema hash\n- **Initial Migration**: One-time effort to verify generated code matches manual versions\n\n## Implementation Strategy\n\n1. **Create Generator** (`tools/codegen/accessor_generator.nu`)\n - Parse Nickel schema files\n - Extract paths, types, defaults\n - Generate valid Nushell code\n - Emit with proper formatting\n\n2. **Generate Accessors** (`lib_provisioning/config/accessor_generated.nu`)\n - Run generator on `provisioning/schemas/config/settings/contracts.ncl`\n - Output 187 accessor functions\n - Verify compatibility with existing code\n\n3. **Validation**\n - Integration tests comparing manual vs generated output\n - Signature validator ensuring generated functions match patterns\n - CI check for schema hash validity\n\n4. **Gradual Adoption**\n - Keep manual accessors temporarily\n - Feature flag to switch between manual and generated\n - Gradual migration of dependent code\n\n## Testing Strategy\n\n1. **Unit Tests**\n - Each generated accessor returns correct type\n - Default values applied correctly\n - Path resolution handles nested fields\n\n2. **Integration Tests**\n - Generated accessors produce identical output to manual versions\n - Config loading pipeline works with generated accessors\n - Fallback behavior preserved\n\n3. **Regression Tests**\n - All existing config access patterns work\n - Performance within 5% of manual version\n - No breaking changes to public API\n\n## Related ADRs\n\n- **ADR-010**: Configuration Format Strategy (TOML/YAML/Nickel)\n- **ADR-011**: Nickel Migration (schema-first architecture)\n\n## Open Questions\n\n1. Should accessors be regenerated on every build or only on schema changes?\n2. How do we handle conditional fields (if X then Y)?\n3. What's the fallback strategy if generator fails?\n\n## Timeline\n\n- **Phase 1**: Generator implementation (foundation)\n- **Phase 2**: Generate and validate accessor functions\n- **Phase 3**: Integration tests and feature flags\n- **Phase 4**: Full migration and manual code removal\n\n## References\n\n- Nickel Language: [https://nickel-lang.org/](https://nickel-lang.org/)\n- Nushell 0.109 Guidelines: `.claude/guidelines/nushell.md`\n- Current Accessor Implementation: `provisioning/core/nulib/lib_provisioning/config/accessor.nu`\n- Schema Source: `provisioning/schemas/config/settings/contracts.ncl`\n +# ADR-016: Schema-Driven Accessor Generation Pattern + +**Status**: Proposed +**Date**: 2026-01-13 +**Author**: Architecture Team +**Supersedes**: Manual accessor maintenance in `lib_provisioning/config/accessor.nu` + +## Context + +The `lib_provisioning/config/accessor.nu` file contains 1567 lines across 187 accessor functions. Analysis reveals that 95% of these functions follow +an identical mechanical pattern: + +```text +export def get-{field-name} [--config: record] { + config-get "{path.to.field}" {default_value} --config $config +} +``` + +This represents significant technical debt: + +1. **Manual Maintenance Burden**: Adding a new config field requires manually writing a new accessor function +2. **Schema Drift Risk**: No automated validation that accessor matches the actual Nickel schema +3. **Code Duplication**: Nearly identical functions across 187 definitions +4. **Testing Complexity**: Each accessor requires manual testing + +## Problem Statement + +**Current Architecture**: +- Nickel schemas define configuration structure (source of truth) +- Accessor functions manually mirror the schema structure +- No automated synchronization between schema and accessors +- High risk of accessor-schema mismatch + +**Key Metrics**: +- 1567 lines of accessor code +- 187 repetitive functions +- ~95% code similarity + +## Decision + +Implement **Schema-Driven Accessor Generation**: automatically generate accessor functions from Nickel schema definitions. + +### Architecture + +```text +Nickel Schema (contracts.ncl) + ↓ +[Parse & Extract Schema Structure] + ↓ +[Generate Nushell Functions] + ↓ +accessor_generated.nu (800 lines) + ↓ +[Validation & Integration] + ↓ +CI/CD enforces: schema hash == generated code +``` + +### Generation Process + +1. **Schema Parsing**: Extract field paths, types, and defaults from Nickel contracts +2. **Code Generation**: Create accessor functions with Nushell 0.109 compliance +3. **Validation**: Verify generated code against schema +4. **CI Integration**: Detect schema changes, validate generated code matches + +### Compliance Requirements + +**Nushell 0.109 Guidelines**: +- No `try-catch` blocks (use `do-complete` pattern) +- No `reduce --init` (use `reduce --fold`) +- No mutable variables (use immutable bindings) +- No type annotations on boolean flags +- Use `each` not `map`, `is-not-empty` not `length` + +**Nickel Compliance**: +- Schema-first design (schema is source of truth) +- Type contracts enforce structure +- `| doc` before `| default` ordering + +## Consequences + +### Positive + +- **Elimination of Manual Maintenance**: New config fields automatically get accessors +- **Zero Schema Drift**: Automatic validation ensures accessors match schema +- **Reduced Code Size**: 1567 lines → ~400 lines (manual core) + ~800 lines (generated) +- **Type Safety**: Generated code guarantees type correctness +- **Consistency**: All 187 functions use identical pattern + +### Negative + +- **Tool Complexity**: Generator must parse Nickel and emit valid Nushell +- **CI/CD Changes**: Build must validate schema hash +- **Initial Migration**: One-time effort to verify generated code matches manual versions + +## Implementation Strategy + +1. **Create Generator** (`tools/codegen/accessor_generator.nu`) + - Parse Nickel schema files + - Extract paths, types, defaults + - Generate valid Nushell code + - Emit with proper formatting + +2. **Generate Accessors** (`lib_provisioning/config/accessor_generated.nu`) + - Run generator on `provisioning/schemas/config/settings/contracts.ncl` + - Output 187 accessor functions + - Verify compatibility with existing code + +3. **Validation** + - Integration tests comparing manual vs generated output + - Signature validator ensuring generated functions match patterns + - CI check for schema hash validity + +4. **Gradual Adoption** + - Keep manual accessors temporarily + - Feature flag to switch between manual and generated + - Gradual migration of dependent code + +## Testing Strategy + +1. **Unit Tests** + - Each generated accessor returns correct type + - Default values applied correctly + - Path resolution handles nested fields + +2. **Integration Tests** + - Generated accessors produce identical output to manual versions + - Config loading pipeline works with generated accessors + - Fallback behavior preserved + +3. **Regression Tests** + - All existing config access patterns work + - Performance within 5% of manual version + - No breaking changes to public API + +## Related ADRs + +- **ADR-010**: Configuration Format Strategy (TOML/YAML/Nickel) +- **ADR-011**: Nickel Migration (schema-first architecture) + +## Open Questions + +1. Should accessors be regenerated on every build or only on schema changes? +2. How do we handle conditional fields (if X then Y)? +3. What's the fallback strategy if generator fails? + +## Timeline + +- **Phase 1**: Generator implementation (foundation) +- **Phase 2**: Generate and validate accessor functions +- **Phase 3**: Integration tests and feature flags +- **Phase 4**: Full migration and manual code removal + +## References + +- Nickel Language: [https://nickel-lang.org/](https://nickel-lang.org/) +- Nushell 0.109 Guidelines: `.claude/guidelines/nushell.md` +- Current Accessor Implementation: `provisioning/core/nulib/lib_provisioning/config/accessor.nu` +- Schema Source: `provisioning/schemas/config/settings/contracts.ncl` diff --git a/docs/src/architecture/adr/adr-017-plugin-wrapper-abstraction-framework.md b/docs/src/architecture/adr/adr-017-plugin-wrapper-abstraction-framework.md index a307f52..a825e8e 100644 --- a/docs/src/architecture/adr/adr-017-plugin-wrapper-abstraction-framework.md +++ b/docs/src/architecture/adr/adr-017-plugin-wrapper-abstraction-framework.md @@ -1 +1,225 @@ -# ADR-017: Plugin Wrapper Abstraction Framework\n\n**Status**: Proposed\n**Date**: 2026-01-13\n**Author**: Architecture Team\n**Supersedes**: Manual plugin wrapper implementations in `lib_provisioning/plugins/`\n\n## Context\n\nThe provisioning system integrates with four critical plugins, each with its own wrapper layer:\n\n1. **auth.nu** (1066 lines) - Authentication plugin wrapper\n2. **orchestrator.nu** (~500 lines) - Orchestrator plugin wrapper\n3. **secretumvault.nu** (~500 lines) - Secrets vault plugin wrapper\n4. **kms.nu** (~500 lines) - Key management service plugin wrapper\n\nAnalysis reveals ~90% code duplication across these wrappers:\n\n```\n# Pattern repeated 4 times with minor variations:\nexport def plugin-available? [] {\n # Check if plugin is installed\n}\n\nexport def try-plugin-call [method args] {\n # Try to call the plugin\n # On failure, fallback to HTTP\n}\n\nexport def http-fallback-call [endpoint method args] {\n # HTTP endpoint fallback\n}\n```\n\n## Problem Statement\n\n**Current Architecture**:\n- Each plugin has manual wrapper implementation\n- ~3000 total lines across 4 files\n- Boilerplate code repeated for each plugin method\n- HTTP fallback logic duplicated\n- Error handling inconsistent\n- Testing each wrapper requires custom setup\n\n**Key Metrics**:\n- 3000 lines of plugin wrapper code\n- 90% code similarity\n- 85% reduction opportunity\n\n## Decision\n\nImplement **Plugin Wrapper Abstraction Framework**: replace manual plugin wrappers with a generic proxy framework + declarative YAML definitions.\n\n### Architecture\n\n```\nPlugin Definition (YAML)\n ├─ plugin: auth\n ├─ methods:\n │ ├─ login(username, password)\n │ ├─ logout()\n │ └─ status()\n └─ http_endpoint: http://localhost:8001\n\nGeneric Plugin Proxy Framework\n ├─ availability() - Check if plugin installed\n ├─ call() - Try plugin, fallback to HTTP\n ├─ http_fallback() - HTTP call with retry\n └─ error_handler() - Consistent error handling\n\nGenerated Wrappers\n ├─ auth_wrapper.nu (150 lines, autogenerated)\n ├─ orchestrator_wrapper.nu (150 lines)\n ├─ vault_wrapper.nu (150 lines)\n └─ kms_wrapper.nu (150 lines)\n```\n\n### Mechanism\n\n**Plugin Call Flow**:\n\n1. **Check Availability**: Is plugin installed and running?\n2. **Try Plugin Call**: Execute plugin method with timeout\n3. **On Failure**: Fall back to HTTP endpoint\n4. **Error Handling**: Unified error response format\n5. **Retry Logic**: Configurable retry with exponential backoff\n\n### Error Handling Pattern\n\n**Nushell 0.109 Compliant** (do-complete pattern, no try-catch):\n\n```\ndef call-plugin-with-fallback [method: string args: record] {\n let plugin_result = (\n do {\n # Try plugin call\n call-plugin $method $args\n } | complete\n )\n\n if $plugin_result.exit_code != 0 {\n # Fall back to HTTP\n call-http-endpoint $method $args\n } else {\n $plugin_result.stdout | from json\n }\n}\n```\n\n## Consequences\n\n### Positive\n\n- **85% Code Reduction**: 3000 lines → 200 (proxy) + 600 (generated)\n- **Consistency**: All plugins use identical call pattern\n- **Maintainability**: Single proxy implementation vs 4 wrapper files\n- **Testability**: Mock proxy for testing, no plugin-specific setup needed\n- **Extensibility**: New plugins require only YAML definition\n\n### Negative\n\n- **Abstraction Overhead**: Proxy layer adds indirection\n- **YAML Schema**: Must maintain schema for plugin definitions\n- **Migration Risk**: Replacing working code requires careful testing\n\n## Implementation Strategy\n\n1. **Create Generic Proxy** (`lib_provisioning/plugins/proxy.nu`)\n - Plugin availability detection\n - Call execution with error handling\n - HTTP fallback mechanism\n - Retry logic with backoff\n\n2. **Define Plugin Schema** (`lib_provisioning/plugins/definitions/plugin.schema.yaml`)\n - Plugin metadata (name, http_endpoint)\n - Method definitions (parameters, return types)\n - Fallback configuration (retry count, timeout)\n\n3. **Plugin Definitions** (`lib_provisioning/plugins/definitions/`)\n - `auth.yaml` - Authentication plugin\n - `orchestrator.yaml` - Orchestrator plugin\n - `secretumvault.yaml` - Secrets vault plugin\n - `kms.yaml` - Key management service plugin\n\n4. **Code Generator** (`tools/codegen/plugin_wrapper_generator.nu`)\n - Parse plugin YAML definitions\n - Generate wrapper functions\n - Ensure Nushell 0.109 compliance\n\n5. **Integration**\n - Feature flag: `$env.PROVISIONING_USE_GENERATED_PLUGINS`\n - Gradual migration from manual to generated wrappers\n - Full compatibility with existing code\n\n## Testing Strategy\n\n1. **Unit Tests**\n - Plugin availability detection\n - Successful plugin calls\n - HTTP fallback on plugin failure\n - Error handling and retry logic\n\n2. **Integration Tests**\n - Real plugin calls with actual plugins\n - Mock HTTP server for fallback testing\n - Timeout handling\n - Retry with backoff\n\n3. **Contract Tests**\n - Plugin method signatures match definitions\n - Return values have expected structure\n - Error responses consistent\n\n## Plugin Definitions\n\n### auth.yaml Example\n\n```\nplugin: auth\nhttp_endpoint: http://localhost:8001\nmethods:\n login:\n params:\n username: string\n password: string\n returns: {token: string}\n logout:\n params: {}\n returns: {status: string}\n status:\n params: {}\n returns: {authenticated: bool}\n```\n\n## Rollback Strategy\n\n**Feature Flag Approach**:\n\n```\n# Use original manual wrappers\nexport PROVISIONING_USE_GENERATED_PLUGINS=false\n\n# Use new generated proxy framework\nexport PROVISIONING_USE_GENERATED_PLUGINS=true\n```\n\nAllows parallel operation and gradual migration.\n\n## Related ADRs\n\n- **ADR-012**: Nushell/Nickel Plugin CLI Wrapper\n- **ADR-013**: TypeDialog Integration (forms for plugin configuration)\n\n## Open Questions\n\n1. Should plugin definitions be YAML or Nickel?\n2. How do we handle plugin discovery automatically?\n3. What's the expected HTTP endpoint format for all plugins?\n4. Should retry logic be configurable per plugin?\n\n## References\n\n- Nushell 0.109 Guidelines: `.claude/guidelines/nushell.md`\n- Do-Complete Pattern: Error handling without try-catch\n- Plugin Framework: `provisioning/core/nulib/lib_provisioning/plugins/`\n +# ADR-017: Plugin Wrapper Abstraction Framework + +**Status**: Proposed +**Date**: 2026-01-13 +**Author**: Architecture Team +**Supersedes**: Manual plugin wrapper implementations in `lib_provisioning/plugins/` + +## Context + +The provisioning system integrates with four critical plugins, each with its own wrapper layer: + +1. **auth.nu** (1066 lines) - Authentication plugin wrapper +2. **orchestrator.nu** (~500 lines) - Orchestrator plugin wrapper +3. **secretumvault.nu** (~500 lines) - Secrets vault plugin wrapper +4. **kms.nu** (~500 lines) - Key management service plugin wrapper + +Analysis reveals ~90% code duplication across these wrappers: + +```text +# Pattern repeated 4 times with minor variations: +export def plugin-available? [] { + # Check if plugin is installed +} + +export def try-plugin-call [method args] { + # Try to call the plugin + # On failure, fallback to HTTP +} + +export def http-fallback-call [endpoint method args] { + # HTTP endpoint fallback +} +``` + +## Problem Statement + +**Current Architecture**: +- Each plugin has manual wrapper implementation +- ~3000 total lines across 4 files +- Boilerplate code repeated for each plugin method +- HTTP fallback logic duplicated +- Error handling inconsistent +- Testing each wrapper requires custom setup + +**Key Metrics**: +- 3000 lines of plugin wrapper code +- 90% code similarity +- 85% reduction opportunity + +## Decision + +Implement **Plugin Wrapper Abstraction Framework**: replace manual plugin wrappers with a generic proxy framework + declarative YAML definitions. + +### Architecture + +```text +Plugin Definition (YAML) + ├─ plugin: auth + ├─ methods: + │ ├─ login(username, password) + │ ├─ logout() + │ └─ status() + └─ http_endpoint: http://localhost:8001 + +Generic Plugin Proxy Framework + ├─ availability() - Check if plugin installed + ├─ call() - Try plugin, fallback to HTTP + ├─ http_fallback() - HTTP call with retry + └─ error_handler() - Consistent error handling + +Generated Wrappers + ├─ auth_wrapper.nu (150 lines, autogenerated) + ├─ orchestrator_wrapper.nu (150 lines) + ├─ vault_wrapper.nu (150 lines) + └─ kms_wrapper.nu (150 lines) +``` + +### Mechanism + +**Plugin Call Flow**: + +1. **Check Availability**: Is plugin installed and running? +2. **Try Plugin Call**: Execute plugin method with timeout +3. **On Failure**: Fall back to HTTP endpoint +4. **Error Handling**: Unified error response format +5. **Retry Logic**: Configurable retry with exponential backoff + +### Error Handling Pattern + +**Nushell 0.109 Compliant** (do-complete pattern, no try-catch): + +```text +def call-plugin-with-fallback [method: string args: record] { + let plugin_result = ( + do { + # Try plugin call + call-plugin $method $args + } | complete + ) + + if $plugin_result.exit_code != 0 { + # Fall back to HTTP + call-http-endpoint $method $args + } else { + $plugin_result.stdout | from json + } +} +``` + +## Consequences + +### Positive + +- **85% Code Reduction**: 3000 lines → 200 (proxy) + 600 (generated) +- **Consistency**: All plugins use identical call pattern +- **Maintainability**: Single proxy implementation vs 4 wrapper files +- **Testability**: Mock proxy for testing, no plugin-specific setup needed +- **Extensibility**: New plugins require only YAML definition + +### Negative + +- **Abstraction Overhead**: Proxy layer adds indirection +- **YAML Schema**: Must maintain schema for plugin definitions +- **Migration Risk**: Replacing working code requires careful testing + +## Implementation Strategy + +1. **Create Generic Proxy** (`lib_provisioning/plugins/proxy.nu`) + - Plugin availability detection + - Call execution with error handling + - HTTP fallback mechanism + - Retry logic with backoff + +2. **Define Plugin Schema** (`lib_provisioning/plugins/definitions/plugin.schema.yaml`) + - Plugin metadata (name, http_endpoint) + - Method definitions (parameters, return types) + - Fallback configuration (retry count, timeout) + +3. **Plugin Definitions** (`lib_provisioning/plugins/definitions/`) + - `auth.yaml` - Authentication plugin + - `orchestrator.yaml` - Orchestrator plugin + - `secretumvault.yaml` - Secrets vault plugin + - `kms.yaml` - Key management service plugin + +4. **Code Generator** (`tools/codegen/plugin_wrapper_generator.nu`) + - Parse plugin YAML definitions + - Generate wrapper functions + - Ensure Nushell 0.109 compliance + +5. **Integration** + - Feature flag: `$env.PROVISIONING_USE_GENERATED_PLUGINS` + - Gradual migration from manual to generated wrappers + - Full compatibility with existing code + +## Testing Strategy + +1. **Unit Tests** + - Plugin availability detection + - Successful plugin calls + - HTTP fallback on plugin failure + - Error handling and retry logic + +2. **Integration Tests** + - Real plugin calls with actual plugins + - Mock HTTP server for fallback testing + - Timeout handling + - Retry with backoff + +3. **Contract Tests** + - Plugin method signatures match definitions + - Return values have expected structure + - Error responses consistent + +## Plugin Definitions + +### auth.yaml Example + +```text +plugin: auth +http_endpoint: http://localhost:8001 +methods: + login: + params: + username: string + password: string + returns: {token: string} + logout: + params: {} + returns: {status: string} + status: + params: {} + returns: {authenticated: bool} +``` + +## Rollback Strategy + +**Feature Flag Approach**: + +```text +# Use original manual wrappers +export PROVISIONING_USE_GENERATED_PLUGINS=false + +# Use new generated proxy framework +export PROVISIONING_USE_GENERATED_PLUGINS=true +``` + +Allows parallel operation and gradual migration. + +## Related ADRs + +- **ADR-012**: Nushell/Nickel Plugin CLI Wrapper +- **ADR-013**: TypeDialog Integration (forms for plugin configuration) + +## Open Questions + +1. Should plugin definitions be YAML or Nickel? +2. How do we handle plugin discovery automatically? +3. What's the expected HTTP endpoint format for all plugins? +4. Should retry logic be configurable per plugin? + +## References + +- Nushell 0.109 Guidelines: `.claude/guidelines/nushell.md` +- Do-Complete Pattern: Error handling without try-catch +- Plugin Framework: `provisioning/core/nulib/lib_provisioning/plugins/` diff --git a/docs/src/architecture/adr/adr-018-help-system-fluent-integration.md b/docs/src/architecture/adr/adr-018-help-system-fluent-integration.md index f84ac56..23b672f 100644 --- a/docs/src/architecture/adr/adr-018-help-system-fluent-integration.md +++ b/docs/src/architecture/adr/adr-018-help-system-fluent-integration.md @@ -1 +1,280 @@ -# ADR-018: Help System Fluent Integration & Data-Driven Architecture\n\n**Status**: Proposed\n**Date**: 2026-01-13\n**Author**: Architecture Team\n**Supersedes**: Hardcoded help strings in `main_provisioning/help_system.nu`\n\n## Context\n\nThe current help system in `main_provisioning/help_system.nu` (1303 lines) consists almost entirely of hardcoded string concatenation with embedded\nANSI formatting codes:\n\n```\ndef help-infrastructure [] {\n print "╔════════════════════════════════════════════════════╗"\n print "║ SERVER & INFRASTRUCTURE ║"\n print "╚════════════════════════════════════════════════════╝"\n}\n```\n\n**Current Problems**:\n\n1. **No Internationalization**: Help text trapped in English-only code\n2. **Hard to Maintain**: Updating text requires editing Nushell code\n3. **Mixed Concerns**: Content (strings) mixed with presentation (ANSI codes)\n4. **No Hot-Reload**: Changes require recompilation\n5. **Difficult to Test**: String content buried in function definitions\n\n## Problem Statement\n\n**Metrics**:\n- 1303 lines of code-embedded help text\n- 17 help categories with 65 strings total\n- All help functions manually maintained\n- No separation of data from presentation\n\n## Decision\n\nImplement **Data-Driven Help with Mozilla Fluent Integration**:\n\n1. Extract help content to Fluent files (`.ftl` format)\n2. Support multilingual help (English base, Spanish translations)\n3. Implement runtime language resolution via `LANG` environment variable\n4. Reduce help_system.nu to wrapper functions only\n\n### Architecture\n\n```\nHelp Content (Fluent Files)\n ├─ en-US/help.ftl (65 strings - English base)\n └─ es-ES/help.ftl (65 strings - Spanish translations)\n\nLanguage Detection & Loading\n ├─ Check LANG environment variable\n ├─ Load appropriate Fluent file\n └─ Implement fallback chain (es-ES → en-US)\n\nHelp System Wrapper\n ├─ help-main [] - Display main menu\n ├─ help-infrastructure [] - Infrastructure category\n ├─ help-orchestration [] - Orchestration category\n └─ help-setup [] - Setup category\n\nUser Interface\n ├─ LANG=en_US provisioning help infrastructure\n └─ LANG=es_ES provisioning help infrastructure\n```\n\n## Implementation\n\n### 1. Fluent File Structure\n\n**en-US/help.ftl**:\n\n```\nhelp-main-title = PROVISIONING SYSTEM\nhelp-main-subtitle = Layered Infrastructure Automation\nhelp-main-categories = COMMAND CATEGORIES\nhelp-main-categories-hint = Use 'provisioning help ' for details\nhelp-main-infrastructure-name = infrastructure\nhelp-main-infrastructure-desc = Server, taskserv, cluster, VM, and infra management\nhelp-main-orchestration-name = orchestration\nhelp-main-orchestration-desc = Workflow, batch operations, and orchestrator control\nhelp-infrastructure-title = SERVER & INFRASTRUCTURE\nhelp-infra-server = Server Operations\nhelp-infra-server-create = Create a new server\nhelp-infra-server-list = List all servers\nhelp-infra-server-status = Show server status\nhelp-infra-taskserv = TaskServ Management\nhelp-infra-taskserv-create = Deploy taskserv to server\nhelp-infra-cluster = Cluster Management\nhelp-infra-vm = Virtual Machine Operations\nhelp-orchestration-title = ORCHESTRATION & WORKFLOWS\nhelp-orch-control = Orchestrator Management\nhelp-orch-start = Start orchestrator [--background]\nhelp-orch-workflows = Single Task Workflows\nhelp-orch-batch = Multi-Provider Batch Operations\n```\n\n**es-ES/help.ftl** (Spanish translations):\n\n```\nhelp-main-title = SISTEMA DE PROVISIÓN\nhelp-main-subtitle = Automatización de Infraestructura por Capas\nhelp-main-categories = CATEGORÍAS DE COMANDOS\nhelp-main-categories-hint = Use 'provisioning help ' para más detalles\nhelp-main-infrastructure-name = infraestructura\nhelp-main-infrastructure-desc = Gestión de servidores, taskserv, clusters, VM e infraestructura\nhelp-main-orchestration-name = orquestación\nhelp-main-orchestration-desc = Flujos de trabajo, operaciones por lotes y control del orquestador\nhelp-infrastructure-title = SERVIDOR E INFRAESTRUCTURA\nhelp-infra-server = Operaciones de Servidor\nhelp-infra-server-create = Crear un nuevo servidor\nhelp-infra-server-list = Listar todos los servidores\nhelp-infra-server-status = Mostrar estado del servidor\nhelp-infra-taskserv = Gestión de TaskServ\nhelp-infra-taskserv-create = Desplegar taskserv en servidor\nhelp-infra-cluster = Gestión de Clusters\nhelp-infra-vm = Operaciones de Máquinas Virtuales\nhelp-orchestration-title = ORQUESTACIÓN Y FLUJOS DE TRABAJO\nhelp-orch-control = Gestión del Orquestador\nhelp-orch-start = Iniciar orquestador [--background]\nhelp-orch-workflows = Flujos de Trabajo de Tarea Única\nhelp-orch-batch = Operaciones por Lotes Multi-Proveedor\n```\n\n### 2. Fluent Loading in Nushell\n\n```\ndef load-fluent-file [category: string] {\n let lang = ($env.LANG? | default "en_US" | str replace "_" "-")\n let fluent_path = $"provisioning/locales/($lang)/help.ftl"\n\n # Parse Fluent file and extract strings for category\n # Fallback to en-US if lang not available\n}\n```\n\n### 3. Help System Wrapper\n\n```\nexport def help-infrastructure [] {\n let strings = (load-fluent-file "infrastructure")\n\n # Apply formatting and render\n print $"╔════════════════════════════════════════════════════╗"\n print $"║ ($strings.title | str upcase) ║"\n print $"╚════════════════════════════════════════════════════╝"\n}\n```\n\n## Consequences\n\n### Positive\n\n- **Internationalization Ready**: Easy to add new languages (Portuguese, French, Japanese)\n- **Data/Presentation Separation**: Content in Fluent, formatting in Nushell\n- **Maintainability**: Edit Fluent files, not Nushell code\n- **Hot-Reload Support**: Can update help text without recompilation\n- **Testing**: Help content testable independently from rendering\n- **Code Reduction**: 1303 lines → ~50 lines (wrapper) + ~700 lines (Fluent data)\n\n### Negative\n\n- **Tool Complexity**: Need Fluent parser and loader\n- **Fallback Chain Management**: Must handle missing translations gracefully\n- **Performance**: File I/O for loading translations (mitigated by caching)\n\n## Integration Strategy\n\n### Phase 1: Infrastructure & Extraction\n\n- ✅ Create `provisioning/locales/` directory structure\n- ✅ Create `i18n-config.toml` with locale configuration\n- ✅ Extract strings to `en-US/help.ftl` (65 strings)\n- ✅ Create Spanish translations `es-ES/help.ftl`\n\n### Phase 2: Integration (This Task)\n\n- [ ] Modify `help_system.nu` to load from Fluent\n- [ ] Implement language detection (`$env.LANG`)\n- [ ] Implement fallback chain logic\n- [ ] Test with `LANG=en_US` and `LANG=es_ES`\n\n### Phase 3: Validation & Documentation\n\n- [ ] Comprehensive integration tests\n- [ ] Performance benchmarks\n- [ ] Documentation for adding new languages\n- [ ] Examples in provisioning/docs/\n\n## Language Resolution Flow\n\n```\n1. Check LANG environment variable\n LANG=es_ES.UTF-8 → extract "es_ES" or "es-ES"\n\n2. Check if locale file exists\n provisioning/locales/es-ES/help.ftl exists? → YES\n\n3. Load locale file\n Parse and extract help strings\n\n4. On missing key:\n Check fallback chain in i18n-config.toml\n es-ES → en-US\n\n5. Render with formatting\n Apply ANSI codes, boxes, alignment\n```\n\n## Testing Strategy\n\n### Unit Tests\n\n```\n# Test language detection\nLANG=en_US provisioning help infrastructure\n# Expected: English output\n\nLANG=es_ES provisioning help infrastructure\n# Expected: Spanish output\n\nLANG=fr_FR provisioning help infrastructure\n# Expected: Fallback to English (fr-FR not available)\n```\n\n## File Structure\n\n```\nprovisioning/\n├── locales/\n│ ├── i18n-config.toml # Locale metadata & fallback chains\n│ ├── en-US/\n│ │ └── help.ftl # 65 English help strings\n│ └── es-ES/\n│ └── help.ftl # 65 Spanish help strings\n└── core/nulib/main_provisioning/\n └── help_system.nu # ~50 lines (wrapper only)\n```\n\n## Configuration\n\n**i18n-config.toml** defines:\n\n```\n[locales]\ndefault = "en-US"\nfallback = "en-US"\n\n[locales.en-US]\nname = "English (United States)"\n\n[locales.es-ES]\nname = "Spanish (Spain)"\n\n[fallback_chains]\nes-ES = ["en-US"]\n```\n\n## Related ADRs\n\n- **ADR-010**: Configuration Format Strategy\n- **ADR-011**: Nickel Migration\n- **ADR-013**: TypeDialog Integration (forms also use Fluent)\n\n## Open Questions\n\n1. Should help strings support Fluent attributes for metadata?\n2. Should we implement Fluent caching for performance?\n3. How do we handle dynamic help (commands not in Fluent)?\n4. Should help system auto-update when Fluent files change?\n\n## References\n\n- Mozilla Fluent: [https://projectfluent.org/](https://projectfluent.org/)\n- Fluent Syntax: [https://projectfluent.org/fluent/guide/](https://projectfluent.org/fluent/guide/)\n- Nushell 0.109 Guidelines: `.claude/guidelines/nushell.md`\n- Current Help Implementation: `provisioning/core/nulib/main_provisioning/help_system.nu`\n- Fluent Files: `provisioning/locales/{en-US,es-ES}/help.ftl`\n +# ADR-018: Help System Fluent Integration & Data-Driven Architecture + +**Status**: Proposed +**Date**: 2026-01-13 +**Author**: Architecture Team +**Supersedes**: Hardcoded help strings in `main_provisioning/help_system.nu` + +## Context + +The current help system in `main_provisioning/help_system.nu` (1303 lines) consists almost entirely of hardcoded string concatenation with embedded +ANSI formatting codes: + +```text +def help-infrastructure [] { + print "╔════════════════════════════════════════════════════╗" + print "║ SERVER & INFRASTRUCTURE ║" + print "╚════════════════════════════════════════════════════╝" +} +``` + +**Current Problems**: + +1. **No Internationalization**: Help text trapped in English-only code +2. **Hard to Maintain**: Updating text requires editing Nushell code +3. **Mixed Concerns**: Content (strings) mixed with presentation (ANSI codes) +4. **No Hot-Reload**: Changes require recompilation +5. **Difficult to Test**: String content buried in function definitions + +## Problem Statement + +**Metrics**: +- 1303 lines of code-embedded help text +- 17 help categories with 65 strings total +- All help functions manually maintained +- No separation of data from presentation + +## Decision + +Implement **Data-Driven Help with Mozilla Fluent Integration**: + +1. Extract help content to Fluent files (`.ftl` format) +2. Support multilingual help (English base, Spanish translations) +3. Implement runtime language resolution via `LANG` environment variable +4. Reduce help_system.nu to wrapper functions only + +### Architecture + +```text +Help Content (Fluent Files) + ├─ en-US/help.ftl (65 strings - English base) + └─ es-ES/help.ftl (65 strings - Spanish translations) + +Language Detection & Loading + ├─ Check LANG environment variable + ├─ Load appropriate Fluent file + └─ Implement fallback chain (es-ES → en-US) + +Help System Wrapper + ├─ help-main [] - Display main menu + ├─ help-infrastructure [] - Infrastructure category + ├─ help-orchestration [] - Orchestration category + └─ help-setup [] - Setup category + +User Interface + ├─ LANG=en_US provisioning help infrastructure + └─ LANG=es_ES provisioning help infrastructure +``` + +## Implementation + +### 1. Fluent File Structure + +**en-US/help.ftl**: + +```text +help-main-title = PROVISIONING SYSTEM +help-main-subtitle = Layered Infrastructure Automation +help-main-categories = COMMAND CATEGORIES +help-main-categories-hint = Use 'provisioning help ' for details +help-main-infrastructure-name = infrastructure +help-main-infrastructure-desc = Server, taskserv, cluster, VM, and infra management +help-main-orchestration-name = orchestration +help-main-orchestration-desc = Workflow, batch operations, and orchestrator control +help-infrastructure-title = SERVER & INFRASTRUCTURE +help-infra-server = Server Operations +help-infra-server-create = Create a new server +help-infra-server-list = List all servers +help-infra-server-status = Show server status +help-infra-taskserv = TaskServ Management +help-infra-taskserv-create = Deploy taskserv to server +help-infra-cluster = Cluster Management +help-infra-vm = Virtual Machine Operations +help-orchestration-title = ORCHESTRATION & WORKFLOWS +help-orch-control = Orchestrator Management +help-orch-start = Start orchestrator [--background] +help-orch-workflows = Single Task Workflows +help-orch-batch = Multi-Provider Batch Operations +``` + +**es-ES/help.ftl** (Spanish translations): + +```text +help-main-title = SISTEMA DE PROVISIÓN +help-main-subtitle = Automatización de Infraestructura por Capas +help-main-categories = CATEGORÍAS DE COMANDOS +help-main-categories-hint = Use 'provisioning help ' para más detalles +help-main-infrastructure-name = infraestructura +help-main-infrastructure-desc = Gestión de servidores, taskserv, clusters, VM e infraestructura +help-main-orchestration-name = orquestación +help-main-orchestration-desc = Flujos de trabajo, operaciones por lotes y control del orquestador +help-infrastructure-title = SERVIDOR E INFRAESTRUCTURA +help-infra-server = Operaciones de Servidor +help-infra-server-create = Crear un nuevo servidor +help-infra-server-list = Listar todos los servidores +help-infra-server-status = Mostrar estado del servidor +help-infra-taskserv = Gestión de TaskServ +help-infra-taskserv-create = Desplegar taskserv en servidor +help-infra-cluster = Gestión de Clusters +help-infra-vm = Operaciones de Máquinas Virtuales +help-orchestration-title = ORQUESTACIÓN Y FLUJOS DE TRABAJO +help-orch-control = Gestión del Orquestador +help-orch-start = Iniciar orquestador [--background] +help-orch-workflows = Flujos de Trabajo de Tarea Única +help-orch-batch = Operaciones por Lotes Multi-Proveedor +``` + +### 2. Fluent Loading in Nushell + +```text +def load-fluent-file [category: string] { + let lang = ($env.LANG? | default "en_US" | str replace "_" "-") + let fluent_path = $"provisioning/locales/($lang)/help.ftl" + + # Parse Fluent file and extract strings for category + # Fallback to en-US if lang not available +} +``` + +### 3. Help System Wrapper + +```text +export def help-infrastructure [] { + let strings = (load-fluent-file "infrastructure") + + # Apply formatting and render + print $"╔════════════════════════════════════════════════════╗" + print $"║ ($strings.title | str upcase) ║" + print $"╚════════════════════════════════════════════════════╝" +} +``` + +## Consequences + +### Positive + +- **Internationalization Ready**: Easy to add new languages (Portuguese, French, Japanese) +- **Data/Presentation Separation**: Content in Fluent, formatting in Nushell +- **Maintainability**: Edit Fluent files, not Nushell code +- **Hot-Reload Support**: Can update help text without recompilation +- **Testing**: Help content testable independently from rendering +- **Code Reduction**: 1303 lines → ~50 lines (wrapper) + ~700 lines (Fluent data) + +### Negative + +- **Tool Complexity**: Need Fluent parser and loader +- **Fallback Chain Management**: Must handle missing translations gracefully +- **Performance**: File I/O for loading translations (mitigated by caching) + +## Integration Strategy + +### Phase 1: Infrastructure & Extraction + +- ✅ Create `provisioning/locales/` directory structure +- ✅ Create `i18n-config.toml` with locale configuration +- ✅ Extract strings to `en-US/help.ftl` (65 strings) +- ✅ Create Spanish translations `es-ES/help.ftl` + +### Phase 2: Integration (This Task) + +- [ ] Modify `help_system.nu` to load from Fluent +- [ ] Implement language detection (`$env.LANG`) +- [ ] Implement fallback chain logic +- [ ] Test with `LANG=en_US` and `LANG=es_ES` + +### Phase 3: Validation & Documentation + +- [ ] Comprehensive integration tests +- [ ] Performance benchmarks +- [ ] Documentation for adding new languages +- [ ] Examples in provisioning/docs/ + +## Language Resolution Flow + +```text +1. Check LANG environment variable + LANG=es_ES.UTF-8 → extract "es_ES" or "es-ES" + +2. Check if locale file exists + provisioning/locales/es-ES/help.ftl exists? → YES + +3. Load locale file + Parse and extract help strings + +4. On missing key: + Check fallback chain in i18n-config.toml + es-ES → en-US + +5. Render with formatting + Apply ANSI codes, boxes, alignment +``` + +## Testing Strategy + +### Unit Tests + +```text +# Test language detection +LANG=en_US provisioning help infrastructure +# Expected: English output + +LANG=es_ES provisioning help infrastructure +# Expected: Spanish output + +LANG=fr_FR provisioning help infrastructure +# Expected: Fallback to English (fr-FR not available) +``` + +## File Structure + +```text +provisioning/ +├── locales/ +│ ├── i18n-config.toml # Locale metadata & fallback chains +│ ├── en-US/ +│ │ └── help.ftl # 65 English help strings +│ └── es-ES/ +│ └── help.ftl # 65 Spanish help strings +└── core/nulib/main_provisioning/ + └── help_system.nu # ~50 lines (wrapper only) +``` + +## Configuration + +**i18n-config.toml** defines: + +```text +[locales] +default = "en-US" +fallback = "en-US" + +[locales.en-US] +name = "English (United States)" + +[locales.es-ES] +name = "Spanish (Spain)" + +[fallback_chains] +es-ES = ["en-US"] +``` + +## Related ADRs + +- **ADR-010**: Configuration Format Strategy +- **ADR-011**: Nickel Migration +- **ADR-013**: TypeDialog Integration (forms also use Fluent) + +## Open Questions + +1. Should help strings support Fluent attributes for metadata? +2. Should we implement Fluent caching for performance? +3. How do we handle dynamic help (commands not in Fluent)? +4. Should help system auto-update when Fluent files change? + +## References + +- Mozilla Fluent: [https://projectfluent.org/](https://projectfluent.org/) +- Fluent Syntax: [https://projectfluent.org/fluent/guide/](https://projectfluent.org/fluent/guide/) +- Nushell 0.109 Guidelines: `.claude/guidelines/nushell.md` +- Current Help Implementation: `provisioning/core/nulib/main_provisioning/help_system.nu` +- Fluent Files: `provisioning/locales/{en-US,es-ES}/help.ftl` diff --git a/docs/src/architecture/adr/adr-019-configuration-loader-modularization.md b/docs/src/architecture/adr/adr-019-configuration-loader-modularization.md index 9517c25..c5410b0 100644 --- a/docs/src/architecture/adr/adr-019-configuration-loader-modularization.md +++ b/docs/src/architecture/adr/adr-019-configuration-loader-modularization.md @@ -1 +1,262 @@ -# ADR-019: Configuration Loader Modularization\n\n**Status**: Proposed\n**Date**: 2026-01-13\n**Author**: Architecture Team\n**Supersedes**: Monolithic loader in `lib_provisioning/config/loader.nu`\n\n## Context\n\nThe `lib_provisioning/config/loader.nu` file (2199 lines) is a monolithic implementation mixing multiple unrelated concerns:\n\n```\nCurrent Structure (2199 lines):\n├─ Cache lookup/storage (300 lines)\n├─ Nickel evaluation (400 lines)\n├─ TOML/YAML parsing (250 lines)\n├─ Environment variable loading (200 lines)\n├─ Configuration hierarchy merging (400 lines)\n├─ Validation logic (250 lines)\n├─ Error handling (200 lines)\n└─ Helper utilities (150 lines)\n```\n\n**Problems**:\n\n1. **Single Responsibility Violation**: One file handling 7 different concerns\n2. **Testing Difficulty**: Can't test TOML parsing without cache setup\n3. **Change Amplification**: Modifying one component affects entire file\n4. **Code Reuse**: Hard to reuse individual loaders in other contexts\n5. **Maintenance Burden**: 2199 lines of tightly coupled code\n\n## Problem Statement\n\n**Metrics**:\n- 2199 lines in single file\n- 7 distinct responsibilities mixed together\n- Hard to test individual components\n- Changes in one area risk breaking others\n\n## Decision\n\nImplement **Layered Loader Architecture**: decompose monolithic loader into specialized, testable modules with a thin orchestrator.\n\n### Target Architecture\n\n```\nlib_provisioning/config/\n├── loader.nu # ORCHESTRATOR (< 300 lines)\n│ └─ Coordinates loading pipeline\n├── loaders/ # SPECIALIZED LOADERS\n│ ├── nickel_loader.nu # Nickel evaluation + cache (150 lines)\n│ ├── toml_loader.nu # TOML parsing (80 lines)\n│ ├── yaml_loader.nu # YAML parsing (80 lines)\n│ ├── env_loader.nu # Environment variables (100 lines)\n│ └── hierarchy.nu # Configuration merging (200 lines)\n├── cache/ # EXISTING - already modular\n│ ├── core.nu # Cache core\n│ ├── nickel.nu # Nickel-specific caching\n│ └── final.nu # Final config caching\n└── validation/ # EXTRACTED\n └── config_validator.nu # Validation rules (100 lines)\n```\n\n### Module Responsibilities\n\n**loader.nu (Orchestrator)**:\n- Define loading pipeline\n- Coordinate loaders\n- Handle high-level errors\n- Return final config\n\n**nickel_loader.nu**:\n- Evaluate Nickel files\n- Apply Nickel type contracts\n- Cache Nickel evaluation results\n- Handle schema validation\n\n**toml_loader.nu**:\n- Parse TOML configuration files\n- Extract key-value pairs\n- Validate TOML structure\n- Return parsed records\n\n**yaml_loader.nu**:\n- Parse YAML configuration files\n- Convert to Nushell records\n- Handle YAML nesting\n- Return normalized records\n\n**env_loader.nu**:\n- Load environment variables\n- Filter by prefix (PROVISIONING_*)\n- Override existing values\n- Return environment records\n\n**hierarchy.nu**:\n- Merge multiple config sources\n- Apply precedence rules\n- Handle nested merging\n- Return unified config\n\n**config_validator.nu**:\n- Validate against schema\n- Check required fields\n- Enforce type constraints\n- Return validation results\n\n## Consequences\n\n### Positive\n\n- **Separation of Concerns**: Each module has single responsibility\n- **Testability**: Can unit test each loader independently\n- **Reusability**: Loaders can be used in other contexts\n- **Maintainability**: Changes isolated to specific module\n- **Debugging**: Easier to isolate issues\n- **Performance**: Can optimize individual loaders\n\n### Negative\n\n- **Increased Complexity**: More files to maintain\n- **Integration Overhead**: Must coordinate between modules\n- **Migration Effort**: Refactoring existing monolithic code\n\n## Implementation Strategy\n\n### Phase 1: Extract Specialized Loaders\n\nCreate each loader as independent module:\n\n1. **toml_loader.nu**\n ```nushell\n export def load-toml [path: string] {\n let content = (open $path)\n $content\n }\n ```\n\n2. **yaml_loader.nu**\n ```nushell\n export def load-yaml [path: string] {\n let content = (open --raw $path | from yaml)\n $content\n }\n ```\n\n3. **env_loader.nu**\n ```nushell\n export def load-environment [] {\n $env\n | to json\n | from json\n | select --contains "PROVISIONING_"\n }\n ```\n\n4. **hierarchy.nu**\n ```nushell\n export def merge-configs [base override] {\n $base | merge $override\n }\n ```\n\n### Phase 2: Refactor Nickel Loader\n\nExtract Nickel evaluation logic:\n\n```\nexport def evaluate-nickel [file: string] {\n let result = (\n do {\n ^nickel export $file\n } | complete\n )\n\n if $result.exit_code != 0 {\n error $result.stderr\n } else {\n $result.stdout | from json\n }\n}\n```\n\n### Phase 3: Create Orchestrator\n\nImplement thin loader.nu:\n\n```\nexport def load-provisioning-config [] {\n let env_config = (env-loader load-environment)\n let toml_config = (toml-loader load-toml "config.toml")\n let nickel_config = (nickel-loader evaluate-nickel "main.ncl")\n\n let merged = (\n {}\n | hierarchy merge-configs $toml_config\n | hierarchy merge-configs $nickel_config\n | hierarchy merge-configs $env_config\n )\n\n let validated = (config-validator validate $merged)\n $validated\n}\n```\n\n### Phase 4: Testing\n\nCreate test for each module:\n\n```\ntests/config/\n├── loaders/\n│ ├── test_nickel_loader.nu\n│ ├── test_toml_loader.nu\n│ ├── test_yaml_loader.nu\n│ ├── test_env_loader.nu\n│ └── test_hierarchy.nu\n└── test_orchestrator.nu\n```\n\n## Performance Considerations\n\n**Baseline**: Current monolithic loader ~500ms\n\n**Layered Architecture**:\n- Individual loaders: ~50-100ms each\n- Orchestration: ~50ms\n- Total expected: ~400-500ms (within 5% tolerance)\n\n**Optimization**:\n- Cache Nickel evaluation (largest cost)\n- Lazy load YAML (if rarely used)\n- Environment variable filtering\n\n## Backward Compatibility\n\n**Public API Unchanged**:\n```\n# Current usage (unchanged)\nlet config = (load-provisioning-config)\n```\n\n**Internal Only**: Refactoring is internal to loader module, no breaking changes to consumers.\n\n## Related ADRs\n\n- **ADR-010**: Configuration Format Strategy\n- **ADR-011**: Nickel Migration\n- **ADR-016**: Schema-Driven Accessor Generation\n\n## Open Questions\n\n1. Should each loader have its own cache layer?\n2. How do we handle circular dependencies between loaders?\n3. Should validation run after each loader or only at end?\n4. What's the rollback strategy if orchestration fails?\n\n## References\n\n- Current Implementation: `provisioning/core/nulib/lib_provisioning/config/loader.nu`\n- Cache System: `provisioning/core/nulib/lib_provisioning/config/cache/`\n- Nushell 0.109 Guidelines: `.claude/guidelines/nushell.md`\n +# ADR-019: Configuration Loader Modularization + +**Status**: Proposed +**Date**: 2026-01-13 +**Author**: Architecture Team +**Supersedes**: Monolithic loader in `lib_provisioning/config/loader.nu` + +## Context + +The `lib_provisioning/config/loader.nu` file (2199 lines) is a monolithic implementation mixing multiple unrelated concerns: + +```text +Current Structure (2199 lines): +├─ Cache lookup/storage (300 lines) +├─ Nickel evaluation (400 lines) +├─ TOML/YAML parsing (250 lines) +├─ Environment variable loading (200 lines) +├─ Configuration hierarchy merging (400 lines) +├─ Validation logic (250 lines) +├─ Error handling (200 lines) +└─ Helper utilities (150 lines) +``` + +**Problems**: + +1. **Single Responsibility Violation**: One file handling 7 different concerns +2. **Testing Difficulty**: Can't test TOML parsing without cache setup +3. **Change Amplification**: Modifying one component affects entire file +4. **Code Reuse**: Hard to reuse individual loaders in other contexts +5. **Maintenance Burden**: 2199 lines of tightly coupled code + +## Problem Statement + +**Metrics**: +- 2199 lines in single file +- 7 distinct responsibilities mixed together +- Hard to test individual components +- Changes in one area risk breaking others + +## Decision + +Implement **Layered Loader Architecture**: decompose monolithic loader into specialized, testable modules with a thin orchestrator. + +### Target Architecture + +```text +lib_provisioning/config/ +├── loader.nu # ORCHESTRATOR (< 300 lines) +│ └─ Coordinates loading pipeline +├── loaders/ # SPECIALIZED LOADERS +│ ├── nickel_loader.nu # Nickel evaluation + cache (150 lines) +│ ├── toml_loader.nu # TOML parsing (80 lines) +│ ├── yaml_loader.nu # YAML parsing (80 lines) +│ ├── env_loader.nu # Environment variables (100 lines) +│ └── hierarchy.nu # Configuration merging (200 lines) +├── cache/ # EXISTING - already modular +│ ├── core.nu # Cache core +│ ├── nickel.nu # Nickel-specific caching +│ └── final.nu # Final config caching +└── validation/ # EXTRACTED + └── config_validator.nu # Validation rules (100 lines) +``` + +### Module Responsibilities + +**loader.nu (Orchestrator)**: +- Define loading pipeline +- Coordinate loaders +- Handle high-level errors +- Return final config + +**nickel_loader.nu**: +- Evaluate Nickel files +- Apply Nickel type contracts +- Cache Nickel evaluation results +- Handle schema validation + +**toml_loader.nu**: +- Parse TOML configuration files +- Extract key-value pairs +- Validate TOML structure +- Return parsed records + +**yaml_loader.nu**: +- Parse YAML configuration files +- Convert to Nushell records +- Handle YAML nesting +- Return normalized records + +**env_loader.nu**: +- Load environment variables +- Filter by prefix (PROVISIONING_*) +- Override existing values +- Return environment records + +**hierarchy.nu**: +- Merge multiple config sources +- Apply precedence rules +- Handle nested merging +- Return unified config + +**config_validator.nu**: +- Validate against schema +- Check required fields +- Enforce type constraints +- Return validation results + +## Consequences + +### Positive + +- **Separation of Concerns**: Each module has single responsibility +- **Testability**: Can unit test each loader independently +- **Reusability**: Loaders can be used in other contexts +- **Maintainability**: Changes isolated to specific module +- **Debugging**: Easier to isolate issues +- **Performance**: Can optimize individual loaders + +### Negative + +- **Increased Complexity**: More files to maintain +- **Integration Overhead**: Must coordinate between modules +- **Migration Effort**: Refactoring existing monolithic code + +## Implementation Strategy + +### Phase 1: Extract Specialized Loaders + +Create each loader as independent module: + +1. **toml_loader.nu** + ```nushell + export def load-toml [path: string] { + let content = (open $path) + $content + } + ``` + +2. **yaml_loader.nu** + ```nushell + export def load-yaml [path: string] { + let content = (open --raw $path | from yaml) + $content + } + ``` + +3. **env_loader.nu** + ```nushell + export def load-environment [] { + $env + | to json + | from json + | select --contains "PROVISIONING_" + } + ``` + +4. **hierarchy.nu** + ```nushell + export def merge-configs [base override] { + $base | merge $override + } + ``` + +### Phase 2: Refactor Nickel Loader + +Extract Nickel evaluation logic: + +```text +export def evaluate-nickel [file: string] { + let result = ( + do { + ^nickel export $file + } | complete + ) + + if $result.exit_code != 0 { + error $result.stderr + } else { + $result.stdout | from json + } +} +``` + +### Phase 3: Create Orchestrator + +Implement thin loader.nu: + +```text +export def load-provisioning-config [] { + let env_config = (env-loader load-environment) + let toml_config = (toml-loader load-toml "config.toml") + let nickel_config = (nickel-loader evaluate-nickel "main.ncl") + + let merged = ( + {} + | hierarchy merge-configs $toml_config + | hierarchy merge-configs $nickel_config + | hierarchy merge-configs $env_config + ) + + let validated = (config-validator validate $merged) + $validated +} +``` + +### Phase 4: Testing + +Create test for each module: + +```text +tests/config/ +├── loaders/ +│ ├── test_nickel_loader.nu +│ ├── test_toml_loader.nu +│ ├── test_yaml_loader.nu +│ ├── test_env_loader.nu +│ └── test_hierarchy.nu +└── test_orchestrator.nu +``` + +## Performance Considerations + +**Baseline**: Current monolithic loader ~500ms + +**Layered Architecture**: +- Individual loaders: ~50-100ms each +- Orchestration: ~50ms +- Total expected: ~400-500ms (within 5% tolerance) + +**Optimization**: +- Cache Nickel evaluation (largest cost) +- Lazy load YAML (if rarely used) +- Environment variable filtering + +## Backward Compatibility + +**Public API Unchanged**: +```text +# Current usage (unchanged) +let config = (load-provisioning-config) +``` + +**Internal Only**: Refactoring is internal to loader module, no breaking changes to consumers. + +## Related ADRs + +- **ADR-010**: Configuration Format Strategy +- **ADR-011**: Nickel Migration +- **ADR-016**: Schema-Driven Accessor Generation + +## Open Questions + +1. Should each loader have its own cache layer? +2. How do we handle circular dependencies between loaders? +3. Should validation run after each loader or only at end? +4. What's the rollback strategy if orchestration fails? + +## References + +- Current Implementation: `provisioning/core/nulib/lib_provisioning/config/loader.nu` +- Cache System: `provisioning/core/nulib/lib_provisioning/config/cache/` +- Nushell 0.109 Guidelines: `.claude/guidelines/nushell.md` diff --git a/docs/src/architecture/adr/adr-020-command-handler-domain-splitting.md b/docs/src/architecture/adr/adr-020-command-handler-domain-splitting.md index b640323..ab78d40 100644 --- a/docs/src/architecture/adr/adr-020-command-handler-domain-splitting.md +++ b/docs/src/architecture/adr/adr-020-command-handler-domain-splitting.md @@ -1 +1,312 @@ -# ADR-020: Command Handler Domain Splitting\n\n**Status**: Proposed\n**Date**: 2026-01-13\n**Author**: Architecture Team\n**Supersedes**: Monolithic command handlers in `main_provisioning/commands/`\n\n## Context\n\nTwo large monolithic command handler files mix disparate domains:\n\n**commands/utilities.nu** (1112 lines):\n- SSH operations (150 lines)\n- SOPS secret editing (200 lines)\n- Cache management (180 lines)\n- Provider listing (100 lines)\n- Plugin operations (150 lines)\n- Shell information (80 lines)\n- Guide system (120 lines)\n- QR code generation (50 lines)\n\n**commands/integrations.nu** (1184 lines):\n- prov-ecosystem bridge (400 lines)\n- provctl integration (350 lines)\n- External API calls (434 lines)\n\n**Problem Statement**:\n\n1. **Mixed Concerns**: Each file handles 7-10 unrelated domains\n2. **Navigation Difficulty**: Hard to find specific functionality\n3. **Testing Complexity**: Can't test SSH without SOPS setup\n4. **Reusability**: Command logic locked in monolithic files\n5. **Maintenance Burden**: Changes in one domain affect entire file\n\n## Decision\n\nImplement **Domain-Based Command Modules**: split monolithic handlers into focused domain modules organized by responsibility.\n\n### Target Architecture\n\n```\nmain_provisioning/commands/\n├── dispatcher.nu # Routes commands to domain handlers\n├── utilities/ # Split by domain\n│ ├── ssh.nu # SSH operations (150 lines)\n│ ├── sops.nu # SOPS editing (200 lines)\n│ ├── cache.nu # Cache management (180 lines)\n│ ├── providers.nu # Provider listing (100 lines)\n│ ├── plugins.nu # Plugin operations (150 lines)\n│ ├── shell.nu # Shell information (80 lines)\n│ ├── guides.nu # Guide system (120 lines)\n│ └── qr.nu # QR code generation (50 lines)\n└── integrations/ # Split by integration\n ├── prov_ecosystem.nu # Prov-ecosystem bridge (400 lines)\n ├── provctl.nu # Provctl integration (350 lines)\n └── external_apis.nu # External API calls (434 lines)\n```\n\n### Module Organization\n\n**utilities/ssh.nu**:\n- SSH connection management\n- Key management\n- Remote command execution\n- Connection pooling\n\n**utilities/sops.nu**:\n- SOPS secret file editing\n- Encryption/decryption\n- Key rotation\n- Secret validation\n\n**utilities/cache.nu**:\n- Cache lookup\n- Cache invalidation\n- Cache statistics\n- Cleanup operations\n\n**utilities/providers.nu**:\n- List available providers\n- Provider capabilities\n- Provider health check\n- Provider registration\n\n**utilities/plugins.nu**:\n- Plugin discovery\n- Plugin loading\n- Plugin execution\n- Plugin management\n\n**utilities/shell.nu**:\n- Nushell info\n- Shell configuration\n- Environment variables\n- Shell capabilities\n\n**utilities/guides.nu**:\n- Guide listing\n- Guide rendering\n- Guide search\n- Interactive guides\n\n**utilities/qr.nu**:\n- QR code generation\n- QR code display\n- Code formatting\n- Error handling\n\n**integrations/prov_ecosystem.nu**:\n- Prov-ecosystem API calls\n- Data synchronization\n- Registry integration\n- Extension discovery\n\n**integrations/provctl.nu**:\n- Provctl command bridge\n- Orchestrator integration\n- Workflow execution\n- Status monitoring\n\n**integrations/external_apis.nu**:\n- Third-party API integration\n- HTTP calls\n- Data transformation\n- Error handling\n\n## Consequences\n\n### Positive\n\n- **Single Responsibility**: Each module handles one domain\n- **Easier Navigation**: Find functionality by domain name\n- **Testable**: Can test SSH independently from SOPS\n- **Maintainable**: Changes isolated to domain module\n- **Reusable**: Modules can be imported by other components\n- **Scalable**: Easy to add new domains\n\n### Negative\n\n- **More Files**: 11 modules vs 2 monolithic files\n- **Import Overhead**: More module imports needed\n- **Coordination Complexity**: Dispatcher must route correctly\n\n## Implementation Strategy\n\n### Phase 1: Extract Utilities Domain\n\nCreate `utilities/` directory with 8 modules:\n\n1. **utilities/ssh.nu** - Extract SSH operations\n2. **utilities/sops.nu** - Extract SOPS operations\n3. **utilities/cache.nu** - Extract cache operations\n4. **utilities/providers.nu** - Extract provider operations\n5. **utilities/plugins.nu** - Extract plugin operations\n6. **utilities/shell.nu** - Extract shell operations\n7. **utilities/guides.nu** - Extract guide operations\n8. **utilities/qr.nu** - Extract QR operations\n\n### Phase 2: Extract Integrations Domain\n\nCreate `integrations/` directory with 3 modules:\n\n1. **integrations/prov_ecosystem.nu** - Extract prov-ecosystem\n2. **integrations/provctl.nu** - Extract provctl\n3. **integrations/external_apis.nu** - Extract external APIs\n\n### Phase 3: Create Dispatcher\n\nImplement `dispatcher.nu`:\n\n```\nexport def provision-ssh [args] {\n use ./utilities/ssh.nu *\n handle-ssh-command $args\n}\n\nexport def provision-sops [args] {\n use ./utilities/sops.nu *\n handle-sops-command $args\n}\n\nexport def provision-cache [args] {\n use ./utilities/cache.nu *\n handle-cache-command $args\n}\n```\n\n### Phase 4: Maintain Backward Compatibility\n\nKeep public exports in original files for compatibility:\n\n```\n# commands/utilities.nu (compatibility layer)\nuse ./utilities/ssh.nu *\nuse ./utilities/sops.nu *\nuse ./utilities/cache.nu *\n\n# Re-export all functions (unchanged public API)\nexport use ./utilities/ssh.nu\nexport use ./utilities/sops.nu\n```\n\n### Phase 5: Testing\n\nCreate test structure:\n\n```\ntests/commands/\n├── utilities/\n│ ├── test_ssh.nu\n│ ├── test_sops.nu\n│ ├── test_cache.nu\n│ ├── test_providers.nu\n│ ├── test_plugins.nu\n│ ├── test_shell.nu\n│ ├── test_guides.nu\n│ └── test_qr.nu\n└── integrations/\n ├── test_prov_ecosystem.nu\n ├── test_provctl.nu\n └── test_external_apis.nu\n```\n\n## Module Interface Example\n\n**utilities/ssh.nu**:\n\n```\n# Connect to remote host\nexport def ssh-connect [host: string --port: int = 22] {\n # Implementation\n}\n\n# Execute remote command\nexport def ssh-exec [host: string command: string] {\n # Implementation\n}\n\n# Close SSH connection\nexport def ssh-close [host: string] {\n # Implementation\n}\n```\n\n## File Structure\n\n```\nmain_provisioning/commands/\n├── dispatcher.nu # Route to domain handlers\n├── utilities/\n│ ├── mod.nu # Utilities module index\n│ ├── ssh.nu # 150 lines\n│ ├── sops.nu # 200 lines\n│ ├── cache.nu # 180 lines\n│ ├── providers.nu # 100 lines\n│ ├── plugins.nu # 150 lines\n│ ├── shell.nu # 80 lines\n│ ├── guides.nu # 120 lines\n│ └── qr.nu # 50 lines\n├── integrations/\n│ ├── mod.nu # Integrations module index\n│ ├── prov_ecosystem.nu # 400 lines\n│ ├── provctl.nu # 350 lines\n│ └── external_apis.nu # 434 lines\n└── README.md # Command routing guide\n```\n\n## CLI Interface (Unchanged)\n\nUsers see no change in CLI:\n\n```\nprovisioning ssh host.example.com\nprovisioning sops edit config.yaml\nprovisioning cache clear\nprovisioning list providers\nprovisioning guide from-scratch\n```\n\n## Backward Compatibility Strategy\n\n**Import Path Options**:\n\n```\n# Option 1: Import from domain module (new way)\nuse ./utilities/ssh.nu *\nconnect $host\n\n# Option 2: Import from compatibility layer (old way)\nuse ./utilities.nu *\nconnect $host\n```\n\nBoth paths work without breaking existing code.\n\n## Related ADRs\n\n- **ADR-006**: Provisioning CLI Refactoring\n- **ADR-012**: Nushell/Nickel Plugin CLI Wrapper\n\n## Open Questions\n\n1. Should we create a module registry for discoverability?\n2. Should domain modules be loadable as plugins?\n3. How do we handle shared utilities between domains?\n4. Should we implement hot-reloading for domain modules?\n\n## References\n\n- Current Implementation: `provisioning/core/nulib/main_provisioning/commands/`\n- Nushell 0.109 Guidelines: `.claude/guidelines/nushell.md`\n- Module System: Nushell module documentation\n +# ADR-020: Command Handler Domain Splitting + +**Status**: Proposed +**Date**: 2026-01-13 +**Author**: Architecture Team +**Supersedes**: Monolithic command handlers in `main_provisioning/commands/` + +## Context + +Two large monolithic command handler files mix disparate domains: + +**commands/utilities.nu** (1112 lines): +- SSH operations (150 lines) +- SOPS secret editing (200 lines) +- Cache management (180 lines) +- Provider listing (100 lines) +- Plugin operations (150 lines) +- Shell information (80 lines) +- Guide system (120 lines) +- QR code generation (50 lines) + +**commands/integrations.nu** (1184 lines): +- prov-ecosystem bridge (400 lines) +- provctl integration (350 lines) +- External API calls (434 lines) + +**Problem Statement**: + +1. **Mixed Concerns**: Each file handles 7-10 unrelated domains +2. **Navigation Difficulty**: Hard to find specific functionality +3. **Testing Complexity**: Can't test SSH without SOPS setup +4. **Reusability**: Command logic locked in monolithic files +5. **Maintenance Burden**: Changes in one domain affect entire file + +## Decision + +Implement **Domain-Based Command Modules**: split monolithic handlers into focused domain modules organized by responsibility. + +### Target Architecture + +```text +main_provisioning/commands/ +├── dispatcher.nu # Routes commands to domain handlers +├── utilities/ # Split by domain +│ ├── ssh.nu # SSH operations (150 lines) +│ ├── sops.nu # SOPS editing (200 lines) +│ ├── cache.nu # Cache management (180 lines) +│ ├── providers.nu # Provider listing (100 lines) +│ ├── plugins.nu # Plugin operations (150 lines) +│ ├── shell.nu # Shell information (80 lines) +│ ├── guides.nu # Guide system (120 lines) +│ └── qr.nu # QR code generation (50 lines) +└── integrations/ # Split by integration + ├── prov_ecosystem.nu # Prov-ecosystem bridge (400 lines) + ├── provctl.nu # Provctl integration (350 lines) + └── external_apis.nu # External API calls (434 lines) +``` + +### Module Organization + +**utilities/ssh.nu**: +- SSH connection management +- Key management +- Remote command execution +- Connection pooling + +**utilities/sops.nu**: +- SOPS secret file editing +- Encryption/decryption +- Key rotation +- Secret validation + +**utilities/cache.nu**: +- Cache lookup +- Cache invalidation +- Cache statistics +- Cleanup operations + +**utilities/providers.nu**: +- List available providers +- Provider capabilities +- Provider health check +- Provider registration + +**utilities/plugins.nu**: +- Plugin discovery +- Plugin loading +- Plugin execution +- Plugin management + +**utilities/shell.nu**: +- Nushell info +- Shell configuration +- Environment variables +- Shell capabilities + +**utilities/guides.nu**: +- Guide listing +- Guide rendering +- Guide search +- Interactive guides + +**utilities/qr.nu**: +- QR code generation +- QR code display +- Code formatting +- Error handling + +**integrations/prov_ecosystem.nu**: +- Prov-ecosystem API calls +- Data synchronization +- Registry integration +- Extension discovery + +**integrations/provctl.nu**: +- Provctl command bridge +- Orchestrator integration +- Workflow execution +- Status monitoring + +**integrations/external_apis.nu**: +- Third-party API integration +- HTTP calls +- Data transformation +- Error handling + +## Consequences + +### Positive + +- **Single Responsibility**: Each module handles one domain +- **Easier Navigation**: Find functionality by domain name +- **Testable**: Can test SSH independently from SOPS +- **Maintainable**: Changes isolated to domain module +- **Reusable**: Modules can be imported by other components +- **Scalable**: Easy to add new domains + +### Negative + +- **More Files**: 11 modules vs 2 monolithic files +- **Import Overhead**: More module imports needed +- **Coordination Complexity**: Dispatcher must route correctly + +## Implementation Strategy + +### Phase 1: Extract Utilities Domain + +Create `utilities/` directory with 8 modules: + +1. **utilities/ssh.nu** - Extract SSH operations +2. **utilities/sops.nu** - Extract SOPS operations +3. **utilities/cache.nu** - Extract cache operations +4. **utilities/providers.nu** - Extract provider operations +5. **utilities/plugins.nu** - Extract plugin operations +6. **utilities/shell.nu** - Extract shell operations +7. **utilities/guides.nu** - Extract guide operations +8. **utilities/qr.nu** - Extract QR operations + +### Phase 2: Extract Integrations Domain + +Create `integrations/` directory with 3 modules: + +1. **integrations/prov_ecosystem.nu** - Extract prov-ecosystem +2. **integrations/provctl.nu** - Extract provctl +3. **integrations/external_apis.nu** - Extract external APIs + +### Phase 3: Create Dispatcher + +Implement `dispatcher.nu`: + +```text +export def provision-ssh [args] { + use ./utilities/ssh.nu * + handle-ssh-command $args +} + +export def provision-sops [args] { + use ./utilities/sops.nu * + handle-sops-command $args +} + +export def provision-cache [args] { + use ./utilities/cache.nu * + handle-cache-command $args +} +``` + +### Phase 4: Maintain Backward Compatibility + +Keep public exports in original files for compatibility: + +```text +# commands/utilities.nu (compatibility layer) +use ./utilities/ssh.nu * +use ./utilities/sops.nu * +use ./utilities/cache.nu * + +# Re-export all functions (unchanged public API) +export use ./utilities/ssh.nu +export use ./utilities/sops.nu +``` + +### Phase 5: Testing + +Create test structure: + +```text +tests/commands/ +├── utilities/ +│ ├── test_ssh.nu +│ ├── test_sops.nu +│ ├── test_cache.nu +│ ├── test_providers.nu +│ ├── test_plugins.nu +│ ├── test_shell.nu +│ ├── test_guides.nu +│ └── test_qr.nu +└── integrations/ + ├── test_prov_ecosystem.nu + ├── test_provctl.nu + └── test_external_apis.nu +``` + +## Module Interface Example + +**utilities/ssh.nu**: + +```text +# Connect to remote host +export def ssh-connect [host: string --port: int = 22] { + # Implementation +} + +# Execute remote command +export def ssh-exec [host: string command: string] { + # Implementation +} + +# Close SSH connection +export def ssh-close [host: string] { + # Implementation +} +``` + +## File Structure + +```text +main_provisioning/commands/ +├── dispatcher.nu # Route to domain handlers +├── utilities/ +│ ├── mod.nu # Utilities module index +│ ├── ssh.nu # 150 lines +│ ├── sops.nu # 200 lines +│ ├── cache.nu # 180 lines +│ ├── providers.nu # 100 lines +│ ├── plugins.nu # 150 lines +│ ├── shell.nu # 80 lines +│ ├── guides.nu # 120 lines +│ └── qr.nu # 50 lines +├── integrations/ +│ ├── mod.nu # Integrations module index +│ ├── prov_ecosystem.nu # 400 lines +│ ├── provctl.nu # 350 lines +│ └── external_apis.nu # 434 lines +└── README.md # Command routing guide +``` + +## CLI Interface (Unchanged) + +Users see no change in CLI: + +```text +provisioning ssh host.example.com +provisioning sops edit config.yaml +provisioning cache clear +provisioning list providers +provisioning guide from-scratch +``` + +## Backward Compatibility Strategy + +**Import Path Options**: + +```text +# Option 1: Import from domain module (new way) +use ./utilities/ssh.nu * +connect $host + +# Option 2: Import from compatibility layer (old way) +use ./utilities.nu * +connect $host +``` + +Both paths work without breaking existing code. + +## Related ADRs + +- **ADR-006**: Provisioning CLI Refactoring +- **ADR-012**: Nushell/Nickel Plugin CLI Wrapper + +## Open Questions + +1. Should we create a module registry for discoverability? +2. Should domain modules be loadable as plugins? +3. How do we handle shared utilities between domains? +4. Should we implement hot-reloading for domain modules? + +## References + +- Current Implementation: `provisioning/core/nulib/main_provisioning/commands/` +- Nushell 0.109 Guidelines: `.claude/guidelines/nushell.md` +- Module System: Nushell module documentation diff --git a/docs/src/architecture/architecture-overview.md b/docs/src/architecture/architecture-overview.md index 1e8ea03..7b11ce9 100644 --- a/docs/src/architecture/architecture-overview.md +++ b/docs/src/architecture/architecture-overview.md @@ -1 +1,1337 @@ -# Provisioning Platform - Architecture Overview\n\n**Version**: 3.5.0\n**Date**: 2025-10-06\n**Status**: Production\n**Maintainers**: Architecture Team\n\n---\n\n## Table of Contents\n\n1. [Executive Summary](#executive-summary)\n2. [System Architecture](#system-architecture)\n3. [Component Architecture](#component-architecture)\n4. [Mode Architecture](#mode-architecture)\n5. [Network Architecture](#network-architecture)\n6. [Data Architecture](#data-architecture)\n7. [Security Architecture](#security-architecture)\n8. [Deployment Architecture](#deployment-architecture)\n9. [Integration Architecture](#integration-architecture)\n10. [Performance and Scalability](#performance-and-scalability)\n11. [Evolution and Roadmap](#evolution-and-roadmap)\n\n---\n\n## Executive Summary\n\n### What is the Provisioning Platform\n\nThe Provisioning Platform is a modern, cloud-native infrastructure automation system that combines:\n\n- the simplicity of declarative configuration (Nickel)\n- the power of shell scripting (Nushell)\n- high-performance coordination (Rust).\n\n### Key Characteristics\n\n- **Hybrid Architecture**: Rust for coordination, Nushell for business logic, Nickel for configuration\n- **Mode-Based**: Adapts from solo development to enterprise production\n- **OCI-Native**: Extends leveraging industry-standard OCI distribution\n- **Provider-Agnostic**: Supports multiple cloud providers (AWS, UpCloud) and local infrastructure\n- **Extension-Driven**: Core functionality enhanced through modular extensions\n\n### Architecture at a Glance\n\n```\n┌─────────────────────────────────────────────────────────────────────┐\n│ Provisioning Platform │\n├─────────────────────────────────────────────────────────────────────┤\n│ │\n│ ┌──────────────┐ ┌─────────────┐ ┌──────────────┐ │\n│ │ User Layer │ │ Extension │ │ Service │ │\n│ │ (CLI/UI) │ │ Registry │ │ Registry │ │\n│ └──────┬───────┘ └──────┬──────┘ └──────┬───────┘ │\n│ │ │ │ │\n│ ┌──────┴──────────────────┴──────────────────┴──--────┐ │\n│ │ Core Provisioning Engine │ │\n│ │ (Config | Dependency Resolution | Workflows) │ │\n│ └──────┬──────────────────────────────────────┬───────┘ │\n│ │ │ │\n│ ┌──────┴─────────┐ ┌──────-─┴─────────┐ │\n│ │ Orchestrator │ │ Business Logic │ │\n│ │ (Rust) │ ←─ Coordination → │ (Nushell) │ │\n│ └──────┬─────────┘ └───────┬──────────┘ │\n│ │ │ │\n│ ┌──────┴─────────────────────────────────────┴---──────┐ │\n│ │ Extension System │ │\n│ │ (Providers | Task Services | Clusters) │ │\n│ └──────┬───────────────────────────────────────────────┘ │\n│ │ │\n│ ┌──────┴──────────────────────────────────────────────────-─┐ │\n│ │ Infrastructure (Cloud | Local | Kubernetes) │ │\n│ └───────────────────────────────────────────────────────────┘ │\n│ │\n└─────────────────────────────────────────────────────────────────────┘\n```\n\n### Key Metrics\n\n| Metric | Value | Description |\n| -------- | ------- | ------------- |\n| **Codebase Size** | ~50,000 LOC | Nushell (60%), Rust (30%), Nickel (10%) |\n| **Extensions** | 100+ | Providers, taskservs, clusters |\n| **Supported Providers** | 3 | AWS, UpCloud, Local |\n| **Task Services** | 50+ | Kubernetes, databases, monitoring, etc. |\n| **Deployment Modes** | 5 | Binary, Docker, Docker Compose, K8s, Remote |\n| **Operational Modes** | 4 | Solo, Multi-user, CI/CD, Enterprise |\n| **API Endpoints** | 80+ | REST, WebSocket, GraphQL (planned) |\n\n---\n\n## System Architecture\n\n### High-Level Architecture\n\n```\n┌────────────────────────────────────────────────────────────────────────────┐\n│ PRESENTATION LAYER │\n├────────────────────────────────────────────────────────────────────────────┤\n│ │\n│ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │\n│ │ CLI (Nu) │ │ Control │ │ REST API │ │ MCP │ │\n│ │ │ │ Center (Yew) │ │ Gateway │ │ Server │ │\n│ └─────────────┘ └──────────────┘ └──────────────┘ └────────────┘ │\n│ │\n└──────────────────────────────────┬─────────────────────────────────────────┘\n │\n┌──────────────────────────────────┴─────────────────────────────────────────┐\n│ CORE LAYER │\n├────────────────────────────────────────────────────────────────────────────┤\n│ │\n│ ┌─────────────────────────────────────────────────────────────────┐ │\n│ │ Configuration Management │ │\n│ │ (Nickel Schemas | TOML Config | Hierarchical Loading) │ │\n│ └─────────────────────────────────────────────────────────────────┘ │\n│ │\n│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │\n│ │ Dependency │ │ Module/Layer │ │ Workspace │ │\n│ │ Resolution │ │ System │ │ Management │ │\n│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │\n│ │\n│ ┌──────────────────────────────────────────────────────────────────┐ │\n│ │ Workflow Engine │ │\n│ │ (Batch Operations | Checkpoints | Rollback) │ │\n│ └──────────────────────────────────────────────────────────────────┘ │\n│ │\n└──────────────────────────────────┬─────────────────────────────────────────┘\n │\n┌──────────────────────────────────┴─────────────────────────────────────────┐\n│ ORCHESTRATION LAYER │\n├────────────────────────────────────────────────────────────────────────────┤\n│ │\n│ ┌──────────────────────────────────────────────────────────────────┐ │\n│ │ Orchestrator (Rust) │ │\n│ │ • Task Queue (File-based persistence) │ │\n│ │ • State Management (Checkpoints) │ │\n│ │ • Health Monitoring │ │\n│ │ • REST API (HTTP/WS) │ │\n│ └──────────────────────────────────────────────────────────────────┘ │\n│ │\n│ ┌──────────────────────────────────────────────────────────────────┐ │\n│ │ Business Logic (Nushell) │ │\n│ │ • Provider operations (AWS, UpCloud, Local) │ │\n│ │ • Server lifecycle (create, delete, configure) │ │\n│ │ • Taskserv installation (50+ services) │ │\n│ │ • Cluster deployment │ │\n│ └──────────────────────────────────────────────────────────────────┘ │\n│ │\n└──────────────────────────────────┬─────────────────────────────────────────┘\n │\n┌──────────────────────────────────┴─────────────────────────────────────────┐\n│ EXTENSION LAYER │\n├────────────────────────────────────────────────────────────────────────────┤\n│ │\n│ ┌────────────────┐ ┌──────────────────┐ ┌───────────────────┐ │\n│ │ Providers │ │ Task Services │ │ Clusters │ │\n│ │ (3 types) │ │ (50+ types) │ │ (10+ types) │ │\n│ │ │ │ │ │ │ │\n│ │ • AWS │ │ • Kubernetes │ │ • Buildkit │ │\n│ │ • UpCloud │ │ • Containerd │ │ • Web cluster │ │\n│ │ • Local │ │ • Databases │ │ • CI/CD │ │\n│ │ │ │ • Monitoring │ │ │ │\n│ └────────────────┘ └──────────────────┘ └───────────────────┘ │\n│ │\n│ ┌──────────────────────────────────────────────────────────────────┐ │\n│ │ Extension Distribution (OCI Registry) │ │\n│ │ • Zot (local development) │ │\n│ │ • Harbor (multi-user/enterprise) │ │\n│ └──────────────────────────────────────────────────────────────────┘ │\n│ │\n└──────────────────────────────────┬─────────────────────────────────────────┘\n │\n┌──────────────────────────────────┴─────────────────────────────────────────┐\n│ INFRASTRUCTURE LAYER │\n├────────────────────────────────────────────────────────────────────────────┤\n│ │\n│ ┌────────────────┐ ┌──────────────────┐ ┌───────────────────┐ │\n│ │ Cloud (AWS) │ │ Cloud (UpCloud) │ │ Local (Docker) │ │\n│ │ │ │ │ │ │ │\n│ │ • EC2 │ │ • Servers │ │ • Containers │ │\n│ │ • EKS │ │ • LoadBalancer │ │ • Local K8s │ │\n│ │ • RDS │ │ • Networking │ │ • Processes │ │\n│ └────────────────┘ └──────────────────┘ └───────────────────┘ │\n│ │\n└────────────────────────────────────────────────────────────────────────────┘\n```\n\n### Multi-Repository Architecture\n\nThe system is organized into three separate repositories:\n\n#### **provisioning-core**\n\n```\nCore system functionality\n├── CLI interface (Nushell entry point)\n├── Core libraries (lib_provisioning)\n├── Base Nickel schemas\n├── Configuration system\n├── Workflow engine\n└── Build/distribution tools\n```\n\n**Distribution**: `oci://registry/provisioning-core:v3.5.0`\n\n#### **provisioning-extensions**\n\n```\nAll provider, taskserv, cluster extensions\n├── providers/\n│ ├── aws/\n│ ├── upcloud/\n│ └── local/\n├── taskservs/\n│ ├── kubernetes/\n│ ├── containerd/\n│ ├── postgres/\n│ └── (50+ more)\n└── clusters/\n ├── buildkit/\n ├── web/\n └── (10+ more)\n```\n\n**Distribution**: Each extension as separate OCI artifact\n\n- `oci://registry/provisioning-extensions/kubernetes:1.28.0`\n- `oci://registry/provisioning-extensions/aws:2.0.0`\n\n#### **provisioning-platform**\n\n```\nPlatform services\n├── orchestrator/ (Rust)\n├── control-center/ (Rust/Yew)\n├── mcp-server/ (Rust)\n└── api-gateway/ (Rust)\n```\n\n**Distribution**: Docker images in OCI registry\n\n- `oci://registry/provisioning-platform/orchestrator:v1.2.0`\n\n---\n\n## Component Architecture\n\n### Core Components\n\n#### 1. **CLI Interface** (Nushell)\n\n**Location**: `provisioning/core/cli/provisioning`\n\n**Purpose**: Primary user interface for all provisioning operations\n\n**Architecture**:\n\n```\nMain CLI (211 lines)\n ↓\nCommand Dispatcher (264 lines)\n ↓\nDomain Handlers (7 modules)\n ├── infrastructure.nu (117 lines)\n ├── orchestration.nu (64 lines)\n ├── development.nu (72 lines)\n ├── workspace.nu (56 lines)\n ├── generation.nu (78 lines)\n ├── utilities.nu (157 lines)\n └── configuration.nu (316 lines)\n```\n\n**Key Features**:\n\n- 80+ command shortcuts\n- Bi-directional help system\n- Centralized flag handling\n- Domain-driven design\n\n#### 2. **Configuration System** (Nickel + TOML)\n\n**Hierarchical Loading**:\n\n```\n1. System defaults (config.defaults.toml)\n2. User config (~/.provisioning/config.user.toml)\n3. Workspace config (workspace/config/provisioning.yaml)\n4. Environment config (workspace/config/{env}-defaults.toml)\n5. Infrastructure config (workspace/infra/{name}/config.toml)\n6. Runtime overrides (CLI flags, ENV variables)\n```\n\n**Variable Interpolation**:\n\n- `{{paths.base}}` - Path references\n- `{{env.HOME}}` - Environment variables\n- `{{now.date}}` - Dynamic values\n- `{{git.branch}}` - Git context\n\n#### 3. **Orchestrator** (Rust)\n\n**Location**: `provisioning/platform/orchestrator/`\n\n**Architecture**:\n\n```\nsrc/\n├── main.rs // Entry point\n├── api/\n│ ├── routes.rs // HTTP routes\n│ ├── workflows.rs // Workflow endpoints\n│ └── batch.rs // Batch endpoints\n├── workflow/\n│ ├── engine.rs // Workflow execution\n│ ├── state.rs // State management\n│ └── checkpoint.rs // Checkpoint/recovery\n├── task_queue/\n│ ├── queue.rs // File-based queue\n│ ├── priority.rs // Priority scheduling\n│ └── retry.rs // Retry logic\n├── health/\n│ └── monitor.rs // Health checks\n├── nushell/\n│ └── bridge.rs // Nu execution bridge\n└── test_environment/ // Test env management\n ├── container_manager.rs\n ├── test_orchestrator.rs\n └── topologies.rs\n```\n\n**Key Features**:\n\n- File-based task queue (reliable, simple)\n- Checkpoint-based recovery\n- Priority scheduling\n- REST API (HTTP/WebSocket)\n- Nushell script execution bridge\n\n#### 4. **Workflow Engine** (Nushell)\n\n**Location**: `provisioning/core/nulib/workflows/`\n\n**Workflow Types**:\n\n```\nworkflows/\n├── server_create.nu // Server provisioning\n├── taskserv.nu // Task service management\n├── cluster.nu // Cluster deployment\n├── batch.nu // Batch operations\n└── management.nu // Workflow monitoring\n```\n\n**Batch Workflow Features**:\n\n- Provider-agnostic (mix AWS, UpCloud, local)\n- Dependency resolution (hard/soft dependencies)\n- Parallel execution (configurable limits)\n- Rollback support\n- Real-time monitoring\n\n#### 5. **Extension System**\n\n**Extension Types**:\n\n| Type | Count | Purpose | Example |\n| ------ | ------- | --------- | --------- |\n| **Providers** | 3 | Cloud platform integration | AWS, UpCloud, Local |\n| **Task Services** | 50+ | Infrastructure components | Kubernetes, Postgres |\n| **Clusters** | 10+ | Complete configurations | Buildkit, Web cluster |\n\n**Extension Structure**:\n\n```\nextension-name/\n├── schemas/\n│ ├── main.ncl // Main schema\n│ ├── contracts.ncl // Contract definitions\n│ ├── defaults.ncl // Default values\n│ └── version.ncl // Version management\n├── scripts/\n│ ├── install.nu // Installation logic\n│ ├── check.nu // Health check\n│ └── uninstall.nu // Cleanup\n├── templates/ // Config templates\n├── docs/ // Documentation\n├── tests/ // Extension tests\n└── manifest.yaml // Extension metadata\n```\n\n**OCI Distribution**:\nEach extension packaged as OCI artifact:\n\n- Nickel schemas\n- Nushell scripts\n- Templates\n- Documentation\n- Manifest\n\n#### 6. **Module and Layer System**\n\n**Module System**:\n\n```\n# Discover available extensions\nprovisioning module discover taskservs\n\n# Load into workspace\nprovisioning module load taskserv my-workspace kubernetes containerd\n\n# List loaded modules\nprovisioning module list taskserv my-workspace\n```\n\n**Layer System** (Configuration Inheritance):\n\n```\nLayer 1: Core (provisioning/extensions/{type}/{name})\n ↓\nLayer 2: Workspace (workspace/extensions/{type}/{name})\n ↓\nLayer 3: Infrastructure (workspace/infra/{infra}/extensions/{type}/{name})\n```\n\n**Resolution Priority**: Infrastructure → Workspace → Core\n\n#### 7. **Dependency Resolution**\n\n**Algorithm**: Topological sort with cycle detection\n\n**Features**:\n\n- Hard dependencies (must exist)\n- Soft dependencies (optional enhancement)\n- Conflict detection\n- Circular dependency prevention\n- Version compatibility checking\n\n**Example**:\n\n```\nlet { TaskservDependencies } = import "provisioning/dependencies.ncl" in\n{\n kubernetes = TaskservDependencies {\n name = "kubernetes",\n version = "1.28.0",\n requires = ["containerd", "etcd", "os"],\n optional = ["cilium", "helm"],\n conflicts = ["docker", "podman"],\n }\n}\n```\n\n#### 8. **Service Management**\n\n**Supported Services**:\n\n| Service | Type | Category | Purpose |\n| --------- | ------ | ---------- | --------- |\n| orchestrator | Platform | Orchestration | Workflow coordination |\n| control-center | Platform | UI | Web management interface |\n| coredns | Infrastructure | DNS | Local DNS resolution |\n| gitea | Infrastructure | Git | Self-hosted Git service |\n| oci-registry | Infrastructure | Registry | OCI artifact storage |\n| mcp-server | Platform | API | Model Context Protocol |\n| api-gateway | Platform | API | Unified API access |\n\n**Lifecycle Management**:\n\n```\n# Start all auto-start services\nprovisioning platform start\n\n# Start specific service (with dependencies)\nprovisioning platform start orchestrator\n\n# Check health\nprovisioning platform health\n\n# View logs\nprovisioning platform logs orchestrator --follow\n```\n\n#### 9. **Test Environment Service**\n\n**Architecture**:\n\n```\nUser Command (CLI)\n ↓\nTest Orchestrator (Rust)\n ↓\nContainer Manager (bollard)\n ↓\nDocker API\n ↓\nIsolated Test Containers\n```\n\n**Test Types**:\n\n- Single taskserv testing\n- Server simulation (multiple taskservs)\n- Multi-node cluster topologies\n\n**Topology Templates**:\n\n- `kubernetes_3node` - 3-node HA cluster\n- `kubernetes_single` - All-in-one K8s\n- `etcd_cluster` - 3-node etcd\n- `postgres_redis` - Database stack\n\n---\n\n## Mode Architecture\n\n### Mode-Based System Overview\n\nThe platform supports four operational modes that adapt the system from individual development to enterprise production.\n\n### Mode Comparison\n\n```\n┌───────────────────────────────────────────────────────────────────────┐\n│ MODE ARCHITECTURE │\n├───────────────┬───────────────┬───────────────┬───────────────────────┤\n│ SOLO │ MULTI-USER │ CI/CD │ ENTERPRISE │\n├───────────────┼───────────────┼───────────────┼───────────────────────┤\n│ │ │ │ │\n│ Single Dev │ Team (5-20) │ Pipelines │ Production │\n│ │ │ │ │\n│ ┌─────────┐ │ ┌──────────┐ │ ┌──────────┐ │ ┌──────────────────┐ │\n│ │ No Auth │ │ │Token(JWT)│ │ │Token(1h) │ │ │ mTLS (TLS 1.3) │ │\n│ └─────────┘ │ └──────────┘ │ └──────────┘ │ └──────────────────┘ │\n│ │ │ │ │\n│ ┌─────────┐ │ ┌──────────┐ │ ┌──────────┐ │ ┌──────────────────┐ │\n│ │ Local │ │ │ Remote │ │ │ Remote │ │ │ Kubernetes (HA) │ │\n│ │ Binary │ │ │ Docker │ │ │ K8s │ │ │ Multi-AZ │ │\n│ └─────────┘ │ └──────────┘ │ └──────────┘ │ └──────────────────┘ │\n│ │ │ │ │\n│ ┌─────────┐ │ ┌──────────┐ │ ┌──────────┐ │ ┌──────────────────┐ │\n│ │ Local │ │ │ OCI (Zot)│ │ │OCI(Harbor│ │ │ OCI (Harbor HA) │ │\n│ │ Files │ │ │ or Harbor│ │ │ required)│ │ │ + Replication │ │\n│ └─────────┘ │ └──────────┘ │ └──────────┘ │ └──────────────────┘ │\n│ │ │ │ │\n│ ┌─────────┐ │ ┌──────────┐ │ ┌──────────-┐ │ ┌──────────────────┐ │\n│ │ None │ │ │ Gitea │ │ │ Disabled │ │ │ etcd (mandatory) │ │\n│ │ │ │ │(optional)│ │ │(stateless)| │ │ │ │\n│ └─────────┘ │ └──────────┘ │ └─────────-─┘ │ └──────────────────┘ │\n│ │ │ │ │\n│ Unlimited │ 10 srv, 32 │ 5 srv, 16 │ 20 srv, 64 cores │\n│ │ cores, 128 GB │ cores, 64 GB │ 256 GB per user │\n│ │ │ │ │\n└───────────────┴───────────────┴───────────────┴───────────────────────┘\n```\n\n### Mode Configuration\n\n**Mode Templates**: `workspace/config/modes/{mode}.yaml`\n\n**Active Mode**: `~/.provisioning/config/active-mode.yaml`\n\n**Switching Modes**:\n\n```\n# Check current mode\nprovisioning mode current\n\n# Switch to another mode\nprovisioning mode switch multi-user\n\n# Validate mode requirements\nprovisioning mode validate enterprise\n```\n\n### Mode-Specific Workflows\n\n#### Solo Mode\n\n```\n# 1. Default mode, no setup needed\nprovisioning workspace init\n\n# 2. Start local orchestrator\nprovisioning platform start orchestrator\n\n# 3. Create infrastructure\nprovisioning server create\n```\n\n#### Multi-User Mode\n\n```\n# 1. Switch mode and authenticate\nprovisioning mode switch multi-user\nprovisioning auth login\n\n# 2. Lock workspace\nprovisioning workspace lock my-infra\n\n# 3. Pull extensions from OCI\nprovisioning extension pull upcloud kubernetes\n\n# 4. Work...\n\n# 5. Unlock workspace\nprovisioning workspace unlock my-infra\n```\n\n#### CI/CD Mode\n\n```\n# GitLab CI\ndeploy:\n stage: deploy\n script:\n - export PROVISIONING_MODE=cicd\n - echo "$TOKEN" > /var/run/secrets/provisioning/token\n - provisioning validate --all\n - provisioning test quick kubernetes\n - provisioning server create --check\n - provisioning server create\n after_script:\n - provisioning workspace cleanup\n```\n\n#### Enterprise Mode\n\n```\n# 1. Switch to enterprise, verify K8s\nprovisioning mode switch enterprise\nkubectl get pods -n provisioning-system\n\n# 2. Request workspace (approval required)\nprovisioning workspace request prod-deployment\n\n# 3. After approval, lock with etcd\nprovisioning workspace lock prod-deployment --provider etcd\n\n# 4. Pull verified extensions\nprovisioning extension pull upcloud --verify-signature\n\n# 5. Deploy\nprovisioning infra create --check\nprovisioning infra create\n\n# 6. Release\nprovisioning workspace unlock prod-deployment\n```\n\n---\n\n## Network Architecture\n\n### Service Communication\n\n```\n┌──────────────────────────────────────────────────────────────────────┐\n│ NETWORK LAYER │\n├──────────────────────────────────────────────────────────────────────┤\n│ │\n│ ┌───────────────────────┐ ┌──────────────────────────┐ │\n│ │ Ingress/Load │ │ API Gateway │ │\n│ │ Balancer │──────────│ (Optional) │ │\n│ └───────────────────────┘ └──────────────────────────┘ │\n│ │ │ │\n│ │ │ │\n│ ┌───────────┴────────────────────────────────────┴──────────┐ │\n│ │ Service Mesh (Optional) │ │\n│ │ (mTLS, Circuit Breaking, Retries) │ │\n│ └────┬──────────┬───────────┬────────────┬──────────────┬───┘ │\n│ │ │ │ │ │ │\n│ ┌────┴─────┐ ┌─┴────────┐ ┌┴─────────┐ ┌┴──────────┐ ┌┴───────┐ │\n│ │ Orchestr │ │ Control │ │ CoreDNS │ │ Gitea │ │ OCI │ │\n│ │ ator │ │ Center │ │ │ │ │ │Registry│ │\n│ │ │ │ │ │ │ │ │ │ │ │\n│ │ :9090 │ │ :3000 │ │ :5353 │ │ :3001 │ │ :5000 │ │\n│ └──────────┘ └──────────┘ └──────────┘ └───────────┘ └────────┘ │\n│ │\n│ ┌────────────────────────────────────────────────────────────┐ │\n│ │ DNS Resolution (CoreDNS) │ │\n│ │ • *.prov.local → Internal services │ │\n│ │ • *.infra.local → Infrastructure nodes │ │\n│ └────────────────────────────────────────────────────────────┘ │\n│ │\n└──────────────────────────────────────────────────────────────────────┘\n```\n\n### Port Allocation\n\n| Service | Port | Protocol | Purpose |\n| --------- | ------ | ---------- | --------- |\n| Orchestrator | 8080 | HTTP/WS | REST API, WebSocket |\n| Control Center | 3000 | HTTP | Web UI |\n| CoreDNS | 5353 | UDP/TCP | DNS resolution |\n| Gitea | 3001 | HTTP | Git operations |\n| OCI Registry (Zot) | 5000 | HTTP | OCI artifacts |\n| OCI Registry (Harbor) | 443 | HTTPS | OCI artifacts (prod) |\n| MCP Server | 8081 | HTTP | MCP protocol |\n| API Gateway | 8082 | HTTP | Unified API |\n\n### Network Security\n\n**Solo Mode**:\n\n- Localhost-only bindings\n- No authentication\n- No encryption\n\n**Multi-User Mode**:\n\n- Token-based authentication (JWT)\n- TLS for external access\n- Firewall rules\n\n**CI/CD Mode**:\n\n- Token authentication (short-lived)\n- Full TLS encryption\n- Network isolation\n\n**Enterprise Mode**:\n\n- mTLS for all connections\n- Network policies (Kubernetes)\n- Zero-trust networking\n- Audit logging\n\n---\n\n## Data Architecture\n\n### Data Storage\n\n```\n┌────────────────────────────────────────────────────────────────┐\n│ DATA LAYER │\n├────────────────────────────────────────────────────────────────┤\n│ │\n│ ┌─────────────────────────────────────────────────────────┐ │\n│ │ Configuration Data (Hierarchical) │ │\n│ │ │ │\n│ │ ~/.provisioning/ │ │\n│ │ ├── config.user.toml (User preferences) │ │\n│ │ └── config/ │ │\n│ │ ├── active-mode.yaml (Active mode) │ │\n│ │ └── user_config.yaml (Workspaces, preferences) │ │\n│ │ │ │\n│ │ workspace/ │ │\n│ │ ├── config/ │ │\n│ │ │ ├── provisioning.yaml (Workspace config) │ │\n│ │ │ └── modes/*.yaml (Mode templates) │ │\n│ │ └── infra/{name}/ │ │\n│ │ ├── main.ncl (Infrastructure Nickel) │ │\n│ │ └── config.toml (Infra-specific) │ │\n│ └─────────────────────────────────────────────────────────┘ │\n│ │\n│ ┌─────────────────────────────────────────────────────────┐ │\n│ │ State Data (Runtime) │ │\n│ │ │ │\n│ │ ~/.provisioning/orchestrator/data/ │ │\n│ │ ├── tasks/ (Task queue) │ │\n│ │ ├── workflows/ (Workflow state) │ │\n│ │ └── checkpoints/ (Recovery points) │ │\n│ │ │ │\n│ │ ~/.provisioning/services/ │ │\n│ │ ├── pids/ (Process IDs) │ │\n│ │ ├── logs/ (Service logs) │ │\n│ │ └── state/ (Service state) │ │\n│ └─────────────────────────────────────────────────────────┘ │\n│ │\n│ ┌─────────────────────────────────────────────────────────┐ │\n│ │ Cache Data (Performance) │ │\n│ │ │ │\n│ │ ~/.provisioning/cache/ │ │\n│ │ ├── oci/ (OCI artifacts) │ │\n│ │ ├── schemas/ (Nickel compiled) │ │\n│ │ └── modules/ (Module cache) │ │\n│ └─────────────────────────────────────────────────────────┘ │\n│ │\n│ ┌─────────────────────────────────────────────────────────┐ │\n│ │ Extension Data (OCI Artifacts) │ │\n│ │ │ │\n│ │ OCI Registry (localhost:5000 or harbor.company.com) │ │\n│ │ ├── provisioning-core:v3.5.0 │ │\n│ │ ├── provisioning-extensions/ │ │\n│ │ │ ├── kubernetes:1.28.0 │ │\n│ │ │ ├── aws:2.0.0 │ │\n│ │ │ └── (100+ artifacts) │ │\n│ │ └── provisioning-platform/ │ │\n│ │ ├── orchestrator:v1.2.0 │ │\n│ │ └── (4 service images) │ │\n│ └─────────────────────────────────────────────────────────┘ │\n│ │\n│ ┌─────────────────────────────────────────────────────────┐ │\n│ │ Secrets (Encrypted) │ │\n│ │ │ │\n│ │ workspace/secrets/ │ │\n│ │ ├── keys.yaml.enc (SOPS-encrypted) │ │\n│ │ ├── ssh-keys/ (SSH keys) │ │\n│ │ └── tokens/ (API tokens) │ │\n│ │ │ │\n│ │ KMS Integration (Enterprise): │ │\n│ │ • AWS KMS │ │\n│ │ • HashiCorp Vault │ │\n│ │ • Age encryption (local) │ │\n│ └─────────────────────────────────────────────────────────┘ │\n│ │\n└────────────────────────────────────────────────────────────────┘\n```\n\n### Data Flow\n\n**Configuration Loading**:\n\n```\n1. Load system defaults (config.defaults.toml)\n2. Merge user config (~/.provisioning/config.user.toml)\n3. Load workspace config (workspace/config/provisioning.yaml)\n4. Load environment config (workspace/config/{env}-defaults.toml)\n5. Load infrastructure config (workspace/infra/{name}/config.toml)\n6. Apply runtime overrides (ENV variables, CLI flags)\n```\n\n**State Persistence**:\n\n```\nWorkflow execution\n ↓\nCreate checkpoint (JSON)\n ↓\nSave to ~/.provisioning/orchestrator/data/checkpoints/\n ↓\nOn failure, load checkpoint and resume\n```\n\n**OCI Artifact Flow**:\n\n```\n1. Package extension (oci-package.nu)\n2. Push to OCI registry (provisioning oci push)\n3. Extension stored as OCI artifact\n4. Pull when needed (provisioning oci pull)\n5. Cache locally (~/.provisioning/cache/oci/)\n```\n\n---\n\n## Security Architecture\n\n### Security Layers\n\n```\n┌─────────────────────────────────────────────────────────────────┐\n│ SECURITY ARCHITECTURE │\n├─────────────────────────────────────────────────────────────────┤\n│ │\n│ ┌────────────────────────────────────────────────────────┐ │\n│ │ Layer 1: Authentication & Authorization │ │\n│ │ │ │\n│ │ Solo: None (local development) │ │\n│ │ Multi-user: JWT tokens (24h expiry) │ │\n│ │ CI/CD: CI-injected tokens (1h expiry) │ │\n│ │ Enterprise: mTLS (TLS 1.3, mutual auth) │ │\n│ └────────────────────────────────────────────────────────┘ │\n│ │\n│ ┌────────────────────────────────────────────────────────┐ │\n│ │ Layer 2: Encryption │ │\n│ │ │ │\n│ │ In Transit: │ │\n│ │ • TLS 1.3 (multi-user, CI/CD, enterprise) │ │\n│ │ • mTLS (enterprise) │ │\n│ │ │ │\n│ │ At Rest: │ │\n│ │ • SOPS + Age (secrets encryption) │ │\n│ │ • KMS integration (CI/CD, enterprise) │ │\n│ │ • Encrypted filesystems (enterprise) │ │\n│ └────────────────────────────────────────────────────────┘ │\n│ │\n│ ┌────────────────────────────────────────────────────────┐ │\n│ │ Layer 3: Secret Management │ │\n│ │ │ │\n│ │ • SOPS for file encryption │ │\n│ │ • Age for key management │ │\n│ │ • KMS integration (AWS KMS, Vault) │ │\n│ │ • SSH key storage (KMS-backed) │ │\n│ │ • API token management │ │\n│ └────────────────────────────────────────────────────────┘ │\n│ │\n│ ┌────────────────────────────────────────────────────────┐ │\n│ │ Layer 4: Access Control │ │\n│ │ │ │\n│ │ • RBAC (Role-Based Access Control) │ │\n│ │ • Workspace isolation │ │\n│ │ • Workspace locking (Gitea, etcd) │ │\n│ │ • Resource quotas (per-user limits) │ │\n│ └────────────────────────────────────────────────────────┘ │\n│ │\n│ ┌────────────────────────────────────────────────────────┐ │\n│ │ Layer 5: Network Security │ │\n│ │ │ │\n│ │ • Network policies (Kubernetes) │ │\n│ │ • Firewall rules │ │\n│ │ • Zero-trust networking (enterprise) │ │\n│ │ • Service mesh (optional, mTLS) │ │\n│ └────────────────────────────────────────────────────────┘ │\n│ │\n│ ┌────────────────────────────────────────────────────────┐ │\n│ │ Layer 6: Audit & Compliance │ │\n│ │ │ │\n│ │ • Audit logs (all operations) │ │\n│ │ • Compliance policies (SOC2, ISO27001) │ │\n│ │ • Image signing (cosign, notation) │ │\n│ │ • Vulnerability scanning (Harbor) │ │\n│ └────────────────────────────────────────────────────────┘ │\n│ │\n└─────────────────────────────────────────────────────────────────┘\n```\n\n### Secret Management\n\n**SOPS Integration**:\n\n```\n# Edit encrypted file\nprovisioning sops workspace/secrets/keys.yaml.enc\n\n# Encryption happens automatically on save\n# Decryption happens automatically on load\n```\n\n**KMS Integration** (Enterprise):\n\n```\n# workspace/config/provisioning.yaml\nsecrets:\n provider: "kms"\n kms:\n type: "aws" # or "vault"\n region: "us-east-1"\n key_id: "arn:aws:kms:..."\n```\n\n### Image Signing and Verification\n\n**CI/CD Mode** (Required):\n\n```\n# Sign OCI artifact\ncosign sign oci://registry/kubernetes:1.28.0\n\n# Verify signature\ncosign verify oci://registry/kubernetes:1.28.0\n```\n\n**Enterprise Mode** (Mandatory):\n\n```\n# Pull with verification\nprovisioning extension pull kubernetes --verify-signature\n\n# System blocks unsigned artifacts\n```\n\n---\n\n## Deployment Architecture\n\n### Deployment Modes\n\n#### 1. **Binary Deployment** (Solo, Multi-user)\n\n```\nUser Machine\n├── ~/.provisioning/bin/\n│ ├── provisioning-orchestrator\n│ ├── provisioning-control-center\n│ └── ...\n├── ~/.provisioning/orchestrator/data/\n├── ~/.provisioning/services/\n└── Process Management (PID files, logs)\n```\n\n**Pros**: Simple, fast startup, no Docker dependency\n**Cons**: Platform-specific binaries, manual updates\n\n#### 2. **Docker Deployment** (Multi-user, CI/CD)\n\n```\nDocker Daemon\n├── Container: provisioning-orchestrator\n├── Container: provisioning-control-center\n├── Container: provisioning-coredns\n├── Container: provisioning-gitea\n├── Container: provisioning-oci-registry\n└── Volumes: ~/.provisioning/data/\n```\n\n**Pros**: Consistent environment, easy updates\n**Cons**: Requires Docker, resource overhead\n\n#### 3. **Docker Compose Deployment** (Multi-user)\n\n```\n# provisioning/platform/docker-compose.yaml\nservices:\n orchestrator:\n image: provisioning-platform/orchestrator:v1.2.0\n ports:\n - "8080:9090"\n volumes:\n - orchestrator-data:/data\n\n control-center:\n image: provisioning-platform/control-center:v1.2.0\n ports:\n - "3000:3000"\n depends_on:\n - orchestrator\n\n coredns:\n image: coredns/coredns:1.11.1\n ports:\n - "5353:53/udp"\n\n gitea:\n image: gitea/gitea:1.20\n ports:\n - "3001:3000"\n\n oci-registry:\n image: ghcr.io/project-zot/zot:latest\n ports:\n - "5000:5000"\n```\n\n**Pros**: Easy multi-service orchestration, declarative\n**Cons**: Local only, no HA\n\n#### 4. **Kubernetes Deployment** (CI/CD, Enterprise)\n\n```\n# Namespace: provisioning-system\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: orchestrator\nspec:\n replicas: 3 # HA\n selector:\n matchLabels:\n app: orchestrator\n template:\n metadata:\n labels:\n app: orchestrator\n spec:\n containers:\n - name: orchestrator\n image: harbor.company.com/provisioning-platform/orchestrator:v1.2.0\n ports:\n - containerPort: 8080\n env:\n - name: RUST_LOG\n value: "info"\n volumeMounts:\n - name: data\n mountPath: /data\n livenessProbe:\n httpGet:\n path: /health\n port: 8080\n readinessProbe:\n httpGet:\n path: /health\n port: 8080\n volumes:\n - name: data\n persistentVolumeClaim:\n claimName: orchestrator-data\n```\n\n**Pros**: HA, scalability, production-ready\n**Cons**: Complex setup, Kubernetes required\n\n#### 5. **Remote Deployment** (All modes)\n\n```\n# Connect to remotely-running services\nservices:\n orchestrator:\n deployment:\n mode: "remote"\n remote:\n endpoint: "https://orchestrator.company.com"\n tls_enabled: true\n auth_token_path: "~/.provisioning/tokens/orchestrator.token"\n```\n\n**Pros**: No local resources, centralized\n**Cons**: Network dependency, latency\n\n---\n\n## Integration Architecture\n\n### Integration Patterns\n\n#### 1. **Hybrid Language Integration** (Rust ↔ Nushell)\n\n```\nRust Orchestrator\n ↓ (HTTP API)\nNushell CLI\n ↓ (exec via bridge)\nNushell Business Logic\n ↓ (returns JSON)\nRust Orchestrator\n ↓ (updates state)\nFile-based Task Queue\n```\n\n**Communication**: HTTP API + stdin/stdout JSON\n\n#### 2. **Provider Abstraction**\n\n```\nUnified Provider Interface\n├── create_server(config) -> Server\n├── delete_server(id) -> bool\n├── list_servers() -> [Server]\n└── get_server_status(id) -> Status\n\nProvider Implementations:\n├── AWS Provider (aws-sdk-rust, aws cli)\n├── UpCloud Provider (upcloud API)\n└── Local Provider (Docker, libvirt)\n```\n\n#### 3. **OCI Registry Integration**\n\n```\nExtension Development\n ↓\nPackage (oci-package.nu)\n ↓\nPush (provisioning oci push)\n ↓\nOCI Registry (Zot/Harbor)\n ↓\nPull (provisioning oci pull)\n ↓\nCache (~/.provisioning/cache/oci/)\n ↓\nLoad into Workspace\n```\n\n#### 4. **Gitea Integration** (Multi-user, Enterprise)\n\n```\nWorkspace Operations\n ↓\nCheck Lock Status (Gitea API)\n ↓\nAcquire Lock (Create lock file in Git)\n ↓\nPerform Changes\n ↓\nCommit + Push\n ↓\nRelease Lock (Delete lock file)\n```\n\n**Benefits**:\n\n- Distributed locking\n- Change tracking via Git history\n- Collaboration features\n\n#### 5. **CoreDNS Integration**\n\n```\nService Registration\n ↓\nUpdate CoreDNS Corefile\n ↓\nReload CoreDNS\n ↓\nDNS Resolution Available\n\nZones:\n├── *.prov.local (Internal services)\n├── *.infra.local (Infrastructure nodes)\n└── *.test.local (Test environments)\n```\n\n---\n\n## Performance and Scalability\n\n### Performance Characteristics\n\n| Metric | Value | Notes |\n| -------- | ------- | ------- |\n| **CLI Startup Time** | < 100 ms | Nushell cold start |\n| **CLI Response Time** | < 50 ms | Most commands |\n| **Workflow Submission** | < 200 ms | To orchestrator |\n| **Task Processing** | 10-50/sec | Orchestrator throughput |\n| **Batch Operations** | Up to 100 servers | Parallel execution |\n| **OCI Pull Time** | 1-5s | Cached: <100 ms |\n| **Configuration Load** | < 500 ms | Full hierarchy |\n| **Health Check Interval** | 10s | Configurable |\n\n### Scalability Limits\n\n**Solo Mode**:\n\n- Unlimited local resources\n- Limited by machine capacity\n\n**Multi-User Mode**:\n\n- 10 servers per user\n- 32 cores, 128 GB RAM per user\n- 5-20 concurrent users\n\n**CI/CD Mode**:\n\n- 5 servers per pipeline\n- 16 cores, 64 GB RAM per pipeline\n- 100+ concurrent pipelines\n\n**Enterprise Mode**:\n\n- 20 servers per user\n- 64 cores, 256 GB RAM per user\n- 1000+ concurrent users\n- Horizontal scaling via Kubernetes\n\n### Optimization Strategies\n\n**Caching**:\n\n- OCI artifacts cached locally\n- Nickel compilation cached\n- Module resolution cached\n\n**Parallel Execution**:\n\n- Batch operations with configurable limits\n- Dependency-aware parallel starts\n- Workflow DAG execution\n\n**Incremental Operations**:\n\n- Only update changed resources\n- Checkpoint-based recovery\n- Delta synchronization\n\n---\n\n## Evolution and Roadmap\n\n### Version History\n\n| Version | Date | Major Features |\n| --------- | ------ | ---------------- |\n| **v3.5.0** | 2025-10-06 | Mode system, OCI distribution, comprehensive docs |\n| **v3.4.0** | 2025-10-06 | Test environment service |\n| **v3.3.0** | 2025-09-30 | Interactive guides |\n| **v3.2.0** | 2025-09-30 | Modular CLI refactoring |\n| **v3.1.0** | 2025-09-25 | Batch workflow system |\n| **v3.0.0** | 2025-09-25 | Hybrid orchestrator |\n| **v2.0.5** | 2025-10-02 | Workspace switching |\n| **v2.0.0** | 2025-09-23 | Configuration migration |\n\n### Roadmap (Future Versions)\n\n**v3.6.0** (Q1 2026):\n\n- GraphQL API\n- Advanced RBAC\n- Multi-tenancy\n- Observability enhancements (OpenTelemetry)\n\n**v4.0.0** (Q2 2026):\n\n- Multi-repository split complete\n- Extension marketplace\n- Advanced workflow features (conditional execution, loops)\n- Cost optimization engine\n\n**v4.1.0** (Q3 2026):\n\n- AI-assisted infrastructure generation\n- Policy-as-code (OPA integration)\n- Advanced compliance features\n\n**Long-term Vision**:\n\n- Serverless workflow execution\n- Edge computing support\n- Multi-cloud failover\n- Self-healing infrastructure\n\n---\n\n## Related Documentation\n\n### Architecture\n\n- **[Multi-Repo Architecture](MULTI_REPO_ARCHITECTURE.md)** - Repository organization\n- **[Design Principles](design-principles.md)** - Architectural philosophy\n- **[Integration Patterns](integration-patterns.md)** - Integration details\n- **[Orchestrator Model](orchestrator-integration-model.md)** - Hybrid orchestration\n\n### ADRs\n\n- **[ADR-001](adr-001-project-structure.md)** - Project structure\n- **[ADR-002](adr-002-distribution-strategy.md)** - Distribution strategy\n- **[ADR-003](adr-003-workspace-isolation.md)** - Workspace isolation\n- **[ADR-004](adr-004-hybrid-architecture.md)** - Hybrid architecture\n- **[ADR-005](adr-005-extension-framework.md)** - Extension framework\n- **[ADR-006](adr-006-provisioning-cli-refactoring.md)** - CLI refactoring\n\n### User Guides\n\n- **[Getting Started](../user/getting-started.md)** - First steps\n- **[Mode System](../user/MODE_SYSTEM_QUICK_REFERENCE.md)** - Modes overview\n- **[Service Management](../user/SERVICE_MANAGEMENT_GUIDE.md)** - Services\n- **[OCI Registry](../user/OCI_REGISTRY_GUIDE.md)** - OCI operations\n\n---\n\n**Maintained By**: Architecture Team\n**Review Cycle**: Quarterly\n**Next Review**: 2026-01-06 +# Provisioning Platform - Architecture Overview + +**Version**: 3.5.0 +**Date**: 2025-10-06 +**Status**: Production +**Maintainers**: Architecture Team + +--- + +## Table of Contents + +1. [Executive Summary](#executive-summary) +2. [System Architecture](#system-architecture) +3. [Component Architecture](#component-architecture) +4. [Mode Architecture](#mode-architecture) +5. [Network Architecture](#network-architecture) +6. [Data Architecture](#data-architecture) +7. [Security Architecture](#security-architecture) +8. [Deployment Architecture](#deployment-architecture) +9. [Integration Architecture](#integration-architecture) +10. [Performance and Scalability](#performance-and-scalability) +11. [Evolution and Roadmap](#evolution-and-roadmap) + +--- + +## Executive Summary + +### What is the Provisioning Platform + +The Provisioning Platform is a modern, cloud-native infrastructure automation system that combines: + +- the simplicity of declarative configuration (Nickel) +- the power of shell scripting (Nushell) +- high-performance coordination (Rust). + +### Key Characteristics + +- **Hybrid Architecture**: Rust for coordination, Nushell for business logic, Nickel for configuration +- **Mode-Based**: Adapts from solo development to enterprise production +- **OCI-Native**: Extends leveraging industry-standard OCI distribution +- **Provider-Agnostic**: Supports multiple cloud providers (AWS, UpCloud) and local infrastructure +- **Extension-Driven**: Core functionality enhanced through modular extensions + +### Architecture at a Glance + +```text +┌─────────────────────────────────────────────────────────────────────┐ +│ Provisioning Platform │ +├─────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌──────────────┐ ┌─────────────┐ ┌──────────────┐ │ +│ │ User Layer │ │ Extension │ │ Service │ │ +│ │ (CLI/UI) │ │ Registry │ │ Registry │ │ +│ └──────┬───────┘ └──────┬──────┘ └──────┬───────┘ │ +│ │ │ │ │ +│ ┌──────┴──────────────────┴──────────────────┴──--────┐ │ +│ │ Core Provisioning Engine │ │ +│ │ (Config | Dependency Resolution | Workflows) │ │ +│ └──────┬──────────────────────────────────────┬───────┘ │ +│ │ │ │ +│ ┌──────┴─────────┐ ┌──────-─┴─────────┐ │ +│ │ Orchestrator │ │ Business Logic │ │ +│ │ (Rust) │ ←─ Coordination → │ (Nushell) │ │ +│ └──────┬─────────┘ └───────┬──────────┘ │ +│ │ │ │ +│ ┌──────┴─────────────────────────────────────┴---──────┐ │ +│ │ Extension System │ │ +│ │ (Providers | Task Services | Clusters) │ │ +│ └──────┬───────────────────────────────────────────────┘ │ +│ │ │ +│ ┌──────┴──────────────────────────────────────────────────-─┐ │ +│ │ Infrastructure (Cloud | Local | Kubernetes) │ │ +│ └───────────────────────────────────────────────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +### Key Metrics + +| Metric | Value | Description | +| -------- | ------- | ------------- | +| **Codebase Size** | ~50,000 LOC | Nushell (60%), Rust (30%), Nickel (10%) | +| **Extensions** | 100+ | Providers, taskservs, clusters | +| **Supported Providers** | 3 | AWS, UpCloud, Local | +| **Task Services** | 50+ | Kubernetes, databases, monitoring, etc. | +| **Deployment Modes** | 5 | Binary, Docker, Docker Compose, K8s, Remote | +| **Operational Modes** | 4 | Solo, Multi-user, CI/CD, Enterprise | +| **API Endpoints** | 80+ | REST, WebSocket, GraphQL (planned) | + +--- + +## System Architecture + +### High-Level Architecture + +```text +┌────────────────────────────────────────────────────────────────────────────┐ +│ PRESENTATION LAYER │ +├────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ ┌────────────┐ │ +│ │ CLI (Nu) │ │ Control │ │ REST API │ │ MCP │ │ +│ │ │ │ Center (Yew) │ │ Gateway │ │ Server │ │ +│ └─────────────┘ └──────────────┘ └──────────────┘ └────────────┘ │ +│ │ +└──────────────────────────────────┬─────────────────────────────────────────┘ + │ +┌──────────────────────────────────┴─────────────────────────────────────────┐ +│ CORE LAYER │ +├────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌─────────────────────────────────────────────────────────────────┐ │ +│ │ Configuration Management │ │ +│ │ (Nickel Schemas | TOML Config | Hierarchical Loading) │ │ +│ └─────────────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ │ +│ │ Dependency │ │ Module/Layer │ │ Workspace │ │ +│ │ Resolution │ │ System │ │ Management │ │ +│ └──────────────────┘ └──────────────────┘ └──────────────────┘ │ +│ │ +│ ┌──────────────────────────────────────────────────────────────────┐ │ +│ │ Workflow Engine │ │ +│ │ (Batch Operations | Checkpoints | Rollback) │ │ +│ └──────────────────────────────────────────────────────────────────┘ │ +│ │ +└──────────────────────────────────┬─────────────────────────────────────────┘ + │ +┌──────────────────────────────────┴─────────────────────────────────────────┐ +│ ORCHESTRATION LAYER │ +├────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌──────────────────────────────────────────────────────────────────┐ │ +│ │ Orchestrator (Rust) │ │ +│ │ • Task Queue (File-based persistence) │ │ +│ │ • State Management (Checkpoints) │ │ +│ │ • Health Monitoring │ │ +│ │ • REST API (HTTP/WS) │ │ +│ └──────────────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌──────────────────────────────────────────────────────────────────┐ │ +│ │ Business Logic (Nushell) │ │ +│ │ • Provider operations (AWS, UpCloud, Local) │ │ +│ │ • Server lifecycle (create, delete, configure) │ │ +│ │ • Taskserv installation (50+ services) │ │ +│ │ • Cluster deployment │ │ +│ └──────────────────────────────────────────────────────────────────┘ │ +│ │ +└──────────────────────────────────┬─────────────────────────────────────────┘ + │ +┌──────────────────────────────────┴─────────────────────────────────────────┐ +│ EXTENSION LAYER │ +├────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌────────────────┐ ┌──────────────────┐ ┌───────────────────┐ │ +│ │ Providers │ │ Task Services │ │ Clusters │ │ +│ │ (3 types) │ │ (50+ types) │ │ (10+ types) │ │ +│ │ │ │ │ │ │ │ +│ │ • AWS │ │ • Kubernetes │ │ • Buildkit │ │ +│ │ • UpCloud │ │ • Containerd │ │ • Web cluster │ │ +│ │ • Local │ │ • Databases │ │ • CI/CD │ │ +│ │ │ │ • Monitoring │ │ │ │ +│ └────────────────┘ └──────────────────┘ └───────────────────┘ │ +│ │ +│ ┌──────────────────────────────────────────────────────────────────┐ │ +│ │ Extension Distribution (OCI Registry) │ │ +│ │ • Zot (local development) │ │ +│ │ • Harbor (multi-user/enterprise) │ │ +│ └──────────────────────────────────────────────────────────────────┘ │ +│ │ +└──────────────────────────────────┬─────────────────────────────────────────┘ + │ +┌──────────────────────────────────┴─────────────────────────────────────────┐ +│ INFRASTRUCTURE LAYER │ +├────────────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌────────────────┐ ┌──────────────────┐ ┌───────────────────┐ │ +│ │ Cloud (AWS) │ │ Cloud (UpCloud) │ │ Local (Docker) │ │ +│ │ │ │ │ │ │ │ +│ │ • EC2 │ │ • Servers │ │ • Containers │ │ +│ │ • EKS │ │ • LoadBalancer │ │ • Local K8s │ │ +│ │ • RDS │ │ • Networking │ │ • Processes │ │ +│ └────────────────┘ └──────────────────┘ └───────────────────┘ │ +│ │ +└────────────────────────────────────────────────────────────────────────────┘ +``` + +### Multi-Repository Architecture + +The system is organized into three separate repositories: + +#### **provisioning-core** + +```text +Core system functionality +├── CLI interface (Nushell entry point) +├── Core libraries (lib_provisioning) +├── Base Nickel schemas +├── Configuration system +├── Workflow engine +└── Build/distribution tools +``` + +**Distribution**: `oci://registry/provisioning-core:v3.5.0` + +#### **provisioning-extensions** + +```text +All provider, taskserv, cluster extensions +├── providers/ +│ ├── aws/ +│ ├── upcloud/ +│ └── local/ +├── taskservs/ +│ ├── kubernetes/ +│ ├── containerd/ +│ ├── postgres/ +│ └── (50+ more) +└── clusters/ + ├── buildkit/ + ├── web/ + └── (10+ more) +``` + +**Distribution**: Each extension as separate OCI artifact + +- `oci://registry/provisioning-extensions/kubernetes:1.28.0` +- `oci://registry/provisioning-extensions/aws:2.0.0` + +#### **provisioning-platform** + +```text +Platform services +├── orchestrator/ (Rust) +├── control-center/ (Rust/Yew) +├── mcp-server/ (Rust) +└── api-gateway/ (Rust) +``` + +**Distribution**: Docker images in OCI registry + +- `oci://registry/provisioning-platform/orchestrator:v1.2.0` + +--- + +## Component Architecture + +### Core Components + +#### 1. **CLI Interface** (Nushell) + +**Location**: `provisioning/core/cli/provisioning` + +**Purpose**: Primary user interface for all provisioning operations + +**Architecture**: + +```text +Main CLI (211 lines) + ↓ +Command Dispatcher (264 lines) + ↓ +Domain Handlers (7 modules) + ├── infrastructure.nu (117 lines) + ├── orchestration.nu (64 lines) + ├── development.nu (72 lines) + ├── workspace.nu (56 lines) + ├── generation.nu (78 lines) + ├── utilities.nu (157 lines) + └── configuration.nu (316 lines) +``` + +**Key Features**: + +- 80+ command shortcuts +- Bi-directional help system +- Centralized flag handling +- Domain-driven design + +#### 2. **Configuration System** (Nickel + TOML) + +**Hierarchical Loading**: + +```text +1. System defaults (config.defaults.toml) +2. User config (~/.provisioning/config.user.toml) +3. Workspace config (workspace/config/provisioning.yaml) +4. Environment config (workspace/config/{env}-defaults.toml) +5. Infrastructure config (workspace/infra/{name}/config.toml) +6. Runtime overrides (CLI flags, ENV variables) +``` + +**Variable Interpolation**: + +- `{{paths.base}}` - Path references +- `{{env.HOME}}` - Environment variables +- `{{now.date}}` - Dynamic values +- `{{git.branch}}` - Git context + +#### 3. **Orchestrator** (Rust) + +**Location**: `provisioning/platform/orchestrator/` + +**Architecture**: + +```text +src/ +├── main.rs // Entry point +├── api/ +│ ├── routes.rs // HTTP routes +│ ├── workflows.rs // Workflow endpoints +│ └── batch.rs // Batch endpoints +├── workflow/ +│ ├── engine.rs // Workflow execution +│ ├── state.rs // State management +│ └── checkpoint.rs // Checkpoint/recovery +├── task_queue/ +│ ├── queue.rs // File-based queue +│ ├── priority.rs // Priority scheduling +│ └── retry.rs // Retry logic +├── health/ +│ └── monitor.rs // Health checks +├── nushell/ +│ └── bridge.rs // Nu execution bridge +└── test_environment/ // Test env management + ├── container_manager.rs + ├── test_orchestrator.rs + └── topologies.rs +``` + +**Key Features**: + +- File-based task queue (reliable, simple) +- Checkpoint-based recovery +- Priority scheduling +- REST API (HTTP/WebSocket) +- Nushell script execution bridge + +#### 4. **Workflow Engine** (Nushell) + +**Location**: `provisioning/core/nulib/workflows/` + +**Workflow Types**: + +```text +workflows/ +├── server_create.nu // Server provisioning +├── taskserv.nu // Task service management +├── cluster.nu // Cluster deployment +├── batch.nu // Batch operations +└── management.nu // Workflow monitoring +``` + +**Batch Workflow Features**: + +- Provider-agnostic (mix AWS, UpCloud, local) +- Dependency resolution (hard/soft dependencies) +- Parallel execution (configurable limits) +- Rollback support +- Real-time monitoring + +#### 5. **Extension System** + +**Extension Types**: + +| Type | Count | Purpose | Example | +| ------ | ------- | --------- | --------- | +| **Providers** | 3 | Cloud platform integration | AWS, UpCloud, Local | +| **Task Services** | 50+ | Infrastructure components | Kubernetes, Postgres | +| **Clusters** | 10+ | Complete configurations | Buildkit, Web cluster | + +**Extension Structure**: + +```text +extension-name/ +├── schemas/ +│ ├── main.ncl // Main schema +│ ├── contracts.ncl // Contract definitions +│ ├── defaults.ncl // Default values +│ └── version.ncl // Version management +├── scripts/ +│ ├── install.nu // Installation logic +│ ├── check.nu // Health check +│ └── uninstall.nu // Cleanup +├── templates/ // Config templates +├── docs/ // Documentation +├── tests/ // Extension tests +└── manifest.yaml // Extension metadata +``` + +**OCI Distribution**: +Each extension packaged as OCI artifact: + +- Nickel schemas +- Nushell scripts +- Templates +- Documentation +- Manifest + +#### 6. **Module and Layer System** + +**Module System**: + +```text +# Discover available extensions +provisioning module discover taskservs + +# Load into workspace +provisioning module load taskserv my-workspace kubernetes containerd + +# List loaded modules +provisioning module list taskserv my-workspace +``` + +**Layer System** (Configuration Inheritance): + +```text +Layer 1: Core (provisioning/extensions/{type}/{name}) + ↓ +Layer 2: Workspace (workspace/extensions/{type}/{name}) + ↓ +Layer 3: Infrastructure (workspace/infra/{infra}/extensions/{type}/{name}) +``` + +**Resolution Priority**: Infrastructure → Workspace → Core + +#### 7. **Dependency Resolution** + +**Algorithm**: Topological sort with cycle detection + +**Features**: + +- Hard dependencies (must exist) +- Soft dependencies (optional enhancement) +- Conflict detection +- Circular dependency prevention +- Version compatibility checking + +**Example**: + +```text +let { TaskservDependencies } = import "provisioning/dependencies.ncl" in +{ + kubernetes = TaskservDependencies { + name = "kubernetes", + version = "1.28.0", + requires = ["containerd", "etcd", "os"], + optional = ["cilium", "helm"], + conflicts = ["docker", "podman"], + } +} +``` + +#### 8. **Service Management** + +**Supported Services**: + +| Service | Type | Category | Purpose | +| --------- | ------ | ---------- | --------- | +| orchestrator | Platform | Orchestration | Workflow coordination | +| control-center | Platform | UI | Web management interface | +| coredns | Infrastructure | DNS | Local DNS resolution | +| gitea | Infrastructure | Git | Self-hosted Git service | +| oci-registry | Infrastructure | Registry | OCI artifact storage | +| mcp-server | Platform | API | Model Context Protocol | +| api-gateway | Platform | API | Unified API access | + +**Lifecycle Management**: + +```text +# Start all auto-start services +provisioning platform start + +# Start specific service (with dependencies) +provisioning platform start orchestrator + +# Check health +provisioning platform health + +# View logs +provisioning platform logs orchestrator --follow +``` + +#### 9. **Test Environment Service** + +**Architecture**: + +```text +User Command (CLI) + ↓ +Test Orchestrator (Rust) + ↓ +Container Manager (bollard) + ↓ +Docker API + ↓ +Isolated Test Containers +``` + +**Test Types**: + +- Single taskserv testing +- Server simulation (multiple taskservs) +- Multi-node cluster topologies + +**Topology Templates**: + +- `kubernetes_3node` - 3-node HA cluster +- `kubernetes_single` - All-in-one K8s +- `etcd_cluster` - 3-node etcd +- `postgres_redis` - Database stack + +--- + +## Mode Architecture + +### Mode-Based System Overview + +The platform supports four operational modes that adapt the system from individual development to enterprise production. + +### Mode Comparison + +```text +┌───────────────────────────────────────────────────────────────────────┐ +│ MODE ARCHITECTURE │ +├───────────────┬───────────────┬───────────────┬───────────────────────┤ +│ SOLO │ MULTI-USER │ CI/CD │ ENTERPRISE │ +├───────────────┼───────────────┼───────────────┼───────────────────────┤ +│ │ │ │ │ +│ Single Dev │ Team (5-20) │ Pipelines │ Production │ +│ │ │ │ │ +│ ┌─────────┐ │ ┌──────────┐ │ ┌──────────┐ │ ┌──────────────────┐ │ +│ │ No Auth │ │ │Token(JWT)│ │ │Token(1h) │ │ │ mTLS (TLS 1.3) │ │ +│ └─────────┘ │ └──────────┘ │ └──────────┘ │ └──────────────────┘ │ +│ │ │ │ │ +│ ┌─────────┐ │ ┌──────────┐ │ ┌──────────┐ │ ┌──────────────────┐ │ +│ │ Local │ │ │ Remote │ │ │ Remote │ │ │ Kubernetes (HA) │ │ +│ │ Binary │ │ │ Docker │ │ │ K8s │ │ │ Multi-AZ │ │ +│ └─────────┘ │ └──────────┘ │ └──────────┘ │ └──────────────────┘ │ +│ │ │ │ │ +│ ┌─────────┐ │ ┌──────────┐ │ ┌──────────┐ │ ┌──────────────────┐ │ +│ │ Local │ │ │ OCI (Zot)│ │ │OCI(Harbor│ │ │ OCI (Harbor HA) │ │ +│ │ Files │ │ │ or Harbor│ │ │ required)│ │ │ + Replication │ │ +│ └─────────┘ │ └──────────┘ │ └──────────┘ │ └──────────────────┘ │ +│ │ │ │ │ +│ ┌─────────┐ │ ┌──────────┐ │ ┌──────────-┐ │ ┌──────────────────┐ │ +│ │ None │ │ │ Gitea │ │ │ Disabled │ │ │ etcd (mandatory) │ │ +│ │ │ │ │(optional)│ │ │(stateless)| │ │ │ │ +│ └─────────┘ │ └──────────┘ │ └─────────-─┘ │ └──────────────────┘ │ +│ │ │ │ │ +│ Unlimited │ 10 srv, 32 │ 5 srv, 16 │ 20 srv, 64 cores │ +│ │ cores, 128 GB │ cores, 64 GB │ 256 GB per user │ +│ │ │ │ │ +└───────────────┴───────────────┴───────────────┴───────────────────────┘ +``` + +### Mode Configuration + +**Mode Templates**: `workspace/config/modes/{mode}.yaml` + +**Active Mode**: `~/.provisioning/config/active-mode.yaml` + +**Switching Modes**: + +```text +# Check current mode +provisioning mode current + +# Switch to another mode +provisioning mode switch multi-user + +# Validate mode requirements +provisioning mode validate enterprise +``` + +### Mode-Specific Workflows + +#### Solo Mode + +```text +# 1. Default mode, no setup needed +provisioning workspace init + +# 2. Start local orchestrator +provisioning platform start orchestrator + +# 3. Create infrastructure +provisioning server create +``` + +#### Multi-User Mode + +```text +# 1. Switch mode and authenticate +provisioning mode switch multi-user +provisioning auth login + +# 2. Lock workspace +provisioning workspace lock my-infra + +# 3. Pull extensions from OCI +provisioning extension pull upcloud kubernetes + +# 4. Work... + +# 5. Unlock workspace +provisioning workspace unlock my-infra +``` + +#### CI/CD Mode + +```text +# GitLab CI +deploy: + stage: deploy + script: + - export PROVISIONING_MODE=cicd + - echo "$TOKEN" > /var/run/secrets/provisioning/token + - provisioning validate --all + - provisioning test quick kubernetes + - provisioning server create --check + - provisioning server create + after_script: + - provisioning workspace cleanup +``` + +#### Enterprise Mode + +```text +# 1. Switch to enterprise, verify K8s +provisioning mode switch enterprise +kubectl get pods -n provisioning-system + +# 2. Request workspace (approval required) +provisioning workspace request prod-deployment + +# 3. After approval, lock with etcd +provisioning workspace lock prod-deployment --provider etcd + +# 4. Pull verified extensions +provisioning extension pull upcloud --verify-signature + +# 5. Deploy +provisioning infra create --check +provisioning infra create + +# 6. Release +provisioning workspace unlock prod-deployment +``` + +--- + +## Network Architecture + +### Service Communication + +```text +┌──────────────────────────────────────────────────────────────────────┐ +│ NETWORK LAYER │ +├──────────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌───────────────────────┐ ┌──────────────────────────┐ │ +│ │ Ingress/Load │ │ API Gateway │ │ +│ │ Balancer │──────────│ (Optional) │ │ +│ └───────────────────────┘ └──────────────────────────┘ │ +│ │ │ │ +│ │ │ │ +│ ┌───────────┴────────────────────────────────────┴──────────┐ │ +│ │ Service Mesh (Optional) │ │ +│ │ (mTLS, Circuit Breaking, Retries) │ │ +│ └────┬──────────┬───────────┬────────────┬──────────────┬───┘ │ +│ │ │ │ │ │ │ +│ ┌────┴─────┐ ┌─┴────────┐ ┌┴─────────┐ ┌┴──────────┐ ┌┴───────┐ │ +│ │ Orchestr │ │ Control │ │ CoreDNS │ │ Gitea │ │ OCI │ │ +│ │ ator │ │ Center │ │ │ │ │ │Registry│ │ +│ │ │ │ │ │ │ │ │ │ │ │ +│ │ :9090 │ │ :3000 │ │ :5353 │ │ :3001 │ │ :5000 │ │ +│ └──────────┘ └──────────┘ └──────────┘ └───────────┘ └────────┘ │ +│ │ +│ ┌────────────────────────────────────────────────────────────┐ │ +│ │ DNS Resolution (CoreDNS) │ │ +│ │ • *.prov.local → Internal services │ │ +│ │ • *.infra.local → Infrastructure nodes │ │ +│ └────────────────────────────────────────────────────────────┘ │ +│ │ +└──────────────────────────────────────────────────────────────────────┘ +``` + +### Port Allocation + +| Service | Port | Protocol | Purpose | +| --------- | ------ | ---------- | --------- | +| Orchestrator | 8080 | HTTP/WS | REST API, WebSocket | +| Control Center | 3000 | HTTP | Web UI | +| CoreDNS | 5353 | UDP/TCP | DNS resolution | +| Gitea | 3001 | HTTP | Git operations | +| OCI Registry (Zot) | 5000 | HTTP | OCI artifacts | +| OCI Registry (Harbor) | 443 | HTTPS | OCI artifacts (prod) | +| MCP Server | 8081 | HTTP | MCP protocol | +| API Gateway | 8082 | HTTP | Unified API | + +### Network Security + +**Solo Mode**: + +- Localhost-only bindings +- No authentication +- No encryption + +**Multi-User Mode**: + +- Token-based authentication (JWT) +- TLS for external access +- Firewall rules + +**CI/CD Mode**: + +- Token authentication (short-lived) +- Full TLS encryption +- Network isolation + +**Enterprise Mode**: + +- mTLS for all connections +- Network policies (Kubernetes) +- Zero-trust networking +- Audit logging + +--- + +## Data Architecture + +### Data Storage + +```text +┌────────────────────────────────────────────────────────────────┐ +│ DATA LAYER │ +├────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Configuration Data (Hierarchical) │ │ +│ │ │ │ +│ │ ~/.provisioning/ │ │ +│ │ ├── config.user.toml (User preferences) │ │ +│ │ └── config/ │ │ +│ │ ├── active-mode.yaml (Active mode) │ │ +│ │ └── user_config.yaml (Workspaces, preferences) │ │ +│ │ │ │ +│ │ workspace/ │ │ +│ │ ├── config/ │ │ +│ │ │ ├── provisioning.yaml (Workspace config) │ │ +│ │ │ └── modes/*.yaml (Mode templates) │ │ +│ │ └── infra/{name}/ │ │ +│ │ ├── main.ncl (Infrastructure Nickel) │ │ +│ │ └── config.toml (Infra-specific) │ │ +│ └─────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ State Data (Runtime) │ │ +│ │ │ │ +│ │ ~/.provisioning/orchestrator/data/ │ │ +│ │ ├── tasks/ (Task queue) │ │ +│ │ ├── workflows/ (Workflow state) │ │ +│ │ └── checkpoints/ (Recovery points) │ │ +│ │ │ │ +│ │ ~/.provisioning/services/ │ │ +│ │ ├── pids/ (Process IDs) │ │ +│ │ ├── logs/ (Service logs) │ │ +│ │ └── state/ (Service state) │ │ +│ └─────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Cache Data (Performance) │ │ +│ │ │ │ +│ │ ~/.provisioning/cache/ │ │ +│ │ ├── oci/ (OCI artifacts) │ │ +│ │ ├── schemas/ (Nickel compiled) │ │ +│ │ └── modules/ (Module cache) │ │ +│ └─────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Extension Data (OCI Artifacts) │ │ +│ │ │ │ +│ │ OCI Registry (localhost:5000 or harbor.company.com) │ │ +│ │ ├── provisioning-core:v3.5.0 │ │ +│ │ ├── provisioning-extensions/ │ │ +│ │ │ ├── kubernetes:1.28.0 │ │ +│ │ │ ├── aws:2.0.0 │ │ +│ │ │ └── (100+ artifacts) │ │ +│ │ └── provisioning-platform/ │ │ +│ │ ├── orchestrator:v1.2.0 │ │ +│ │ └── (4 service images) │ │ +│ └─────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌─────────────────────────────────────────────────────────┐ │ +│ │ Secrets (Encrypted) │ │ +│ │ │ │ +│ │ workspace/secrets/ │ │ +│ │ ├── keys.yaml.enc (SOPS-encrypted) │ │ +│ │ ├── ssh-keys/ (SSH keys) │ │ +│ │ └── tokens/ (API tokens) │ │ +│ │ │ │ +│ │ KMS Integration (Enterprise): │ │ +│ │ • AWS KMS │ │ +│ │ • HashiCorp Vault │ │ +│ │ • Age encryption (local) │ │ +│ └─────────────────────────────────────────────────────────┘ │ +│ │ +└────────────────────────────────────────────────────────────────┘ +``` + +### Data Flow + +**Configuration Loading**: + +```text +1. Load system defaults (config.defaults.toml) +2. Merge user config (~/.provisioning/config.user.toml) +3. Load workspace config (workspace/config/provisioning.yaml) +4. Load environment config (workspace/config/{env}-defaults.toml) +5. Load infrastructure config (workspace/infra/{name}/config.toml) +6. Apply runtime overrides (ENV variables, CLI flags) +``` + +**State Persistence**: + +```text +Workflow execution + ↓ +Create checkpoint (JSON) + ↓ +Save to ~/.provisioning/orchestrator/data/checkpoints/ + ↓ +On failure, load checkpoint and resume +``` + +**OCI Artifact Flow**: + +```text +1. Package extension (oci-package.nu) +2. Push to OCI registry (provisioning oci push) +3. Extension stored as OCI artifact +4. Pull when needed (provisioning oci pull) +5. Cache locally (~/.provisioning/cache/oci/) +``` + +--- + +## Security Architecture + +### Security Layers + +```text +┌─────────────────────────────────────────────────────────────────┐ +│ SECURITY ARCHITECTURE │ +├─────────────────────────────────────────────────────────────────┤ +│ │ +│ ┌────────────────────────────────────────────────────────┐ │ +│ │ Layer 1: Authentication & Authorization │ │ +│ │ │ │ +│ │ Solo: None (local development) │ │ +│ │ Multi-user: JWT tokens (24h expiry) │ │ +│ │ CI/CD: CI-injected tokens (1h expiry) │ │ +│ │ Enterprise: mTLS (TLS 1.3, mutual auth) │ │ +│ └────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌────────────────────────────────────────────────────────┐ │ +│ │ Layer 2: Encryption │ │ +│ │ │ │ +│ │ In Transit: │ │ +│ │ • TLS 1.3 (multi-user, CI/CD, enterprise) │ │ +│ │ • mTLS (enterprise) │ │ +│ │ │ │ +│ │ At Rest: │ │ +│ │ • SOPS + Age (secrets encryption) │ │ +│ │ • KMS integration (CI/CD, enterprise) │ │ +│ │ • Encrypted filesystems (enterprise) │ │ +│ └────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌────────────────────────────────────────────────────────┐ │ +│ │ Layer 3: Secret Management │ │ +│ │ │ │ +│ │ • SOPS for file encryption │ │ +│ │ • Age for key management │ │ +│ │ • KMS integration (AWS KMS, Vault) │ │ +│ │ • SSH key storage (KMS-backed) │ │ +│ │ • API token management │ │ +│ └────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌────────────────────────────────────────────────────────┐ │ +│ │ Layer 4: Access Control │ │ +│ │ │ │ +│ │ • RBAC (Role-Based Access Control) │ │ +│ │ • Workspace isolation │ │ +│ │ • Workspace locking (Gitea, etcd) │ │ +│ │ • Resource quotas (per-user limits) │ │ +│ └────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌────────────────────────────────────────────────────────┐ │ +│ │ Layer 5: Network Security │ │ +│ │ │ │ +│ │ • Network policies (Kubernetes) │ │ +│ │ • Firewall rules │ │ +│ │ • Zero-trust networking (enterprise) │ │ +│ │ • Service mesh (optional, mTLS) │ │ +│ └────────────────────────────────────────────────────────┘ │ +│ │ +│ ┌────────────────────────────────────────────────────────┐ │ +│ │ Layer 6: Audit & Compliance │ │ +│ │ │ │ +│ │ • Audit logs (all operations) │ │ +│ │ • Compliance policies (SOC2, ISO27001) │ │ +│ │ • Image signing (cosign, notation) │ │ +│ │ • Vulnerability scanning (Harbor) │ │ +│ └────────────────────────────────────────────────────────┘ │ +│ │ +└─────────────────────────────────────────────────────────────────┘ +``` + +### Secret Management + +**SOPS Integration**: + +```text +# Edit encrypted file +provisioning sops workspace/secrets/keys.yaml.enc + +# Encryption happens automatically on save +# Decryption happens automatically on load +``` + +**KMS Integration** (Enterprise): + +```text +# workspace/config/provisioning.yaml +secrets: + provider: "kms" + kms: + type: "aws" # or "vault" + region: "us-east-1" + key_id: "arn:aws:kms:..." +``` + +### Image Signing and Verification + +**CI/CD Mode** (Required): + +```text +# Sign OCI artifact +cosign sign oci://registry/kubernetes:1.28.0 + +# Verify signature +cosign verify oci://registry/kubernetes:1.28.0 +``` + +**Enterprise Mode** (Mandatory): + +```text +# Pull with verification +provisioning extension pull kubernetes --verify-signature + +# System blocks unsigned artifacts +``` + +--- + +## Deployment Architecture + +### Deployment Modes + +#### 1. **Binary Deployment** (Solo, Multi-user) + +```text +User Machine +├── ~/.provisioning/bin/ +│ ├── provisioning-orchestrator +│ ├── provisioning-control-center +│ └── ... +├── ~/.provisioning/orchestrator/data/ +├── ~/.provisioning/services/ +└── Process Management (PID files, logs) +``` + +**Pros**: Simple, fast startup, no Docker dependency +**Cons**: Platform-specific binaries, manual updates + +#### 2. **Docker Deployment** (Multi-user, CI/CD) + +```text +Docker Daemon +├── Container: provisioning-orchestrator +├── Container: provisioning-control-center +├── Container: provisioning-coredns +├── Container: provisioning-gitea +├── Container: provisioning-oci-registry +└── Volumes: ~/.provisioning/data/ +``` + +**Pros**: Consistent environment, easy updates +**Cons**: Requires Docker, resource overhead + +#### 3. **Docker Compose Deployment** (Multi-user) + +```text +# provisioning/platform/docker-compose.yaml +services: + orchestrator: + image: provisioning-platform/orchestrator:v1.2.0 + ports: + - "8080:9090" + volumes: + - orchestrator-data:/data + + control-center: + image: provisioning-platform/control-center:v1.2.0 + ports: + - "3000:3000" + depends_on: + - orchestrator + + coredns: + image: coredns/coredns:1.11.1 + ports: + - "5353:53/udp" + + gitea: + image: gitea/gitea:1.20 + ports: + - "3001:3000" + + oci-registry: + image: ghcr.io/project-zot/zot:latest + ports: + - "5000:5000" +``` + +**Pros**: Easy multi-service orchestration, declarative +**Cons**: Local only, no HA + +#### 4. **Kubernetes Deployment** (CI/CD, Enterprise) + +```text +# Namespace: provisioning-system +apiVersion: apps/v1 +kind: Deployment +metadata: + name: orchestrator +spec: + replicas: 3 # HA + selector: + matchLabels: + app: orchestrator + template: + metadata: + labels: + app: orchestrator + spec: + containers: + - name: orchestrator + image: harbor.company.com/provisioning-platform/orchestrator:v1.2.0 + ports: + - containerPort: 8080 + env: + - name: RUST_LOG + value: "info" + volumeMounts: + - name: data + mountPath: /data + livenessProbe: + httpGet: + path: /health + port: 8080 + readinessProbe: + httpGet: + path: /health + port: 8080 + volumes: + - name: data + persistentVolumeClaim: + claimName: orchestrator-data +``` + +**Pros**: HA, scalability, production-ready +**Cons**: Complex setup, Kubernetes required + +#### 5. **Remote Deployment** (All modes) + +```text +# Connect to remotely-running services +services: + orchestrator: + deployment: + mode: "remote" + remote: + endpoint: "https://orchestrator.company.com" + tls_enabled: true + auth_token_path: "~/.provisioning/tokens/orchestrator.token" +``` + +**Pros**: No local resources, centralized +**Cons**: Network dependency, latency + +--- + +## Integration Architecture + +### Integration Patterns + +#### 1. **Hybrid Language Integration** (Rust ↔ Nushell) + +```text +Rust Orchestrator + ↓ (HTTP API) +Nushell CLI + ↓ (exec via bridge) +Nushell Business Logic + ↓ (returns JSON) +Rust Orchestrator + ↓ (updates state) +File-based Task Queue +``` + +**Communication**: HTTP API + stdin/stdout JSON + +#### 2. **Provider Abstraction** + +```text +Unified Provider Interface +├── create_server(config) -> Server +├── delete_server(id) -> bool +├── list_servers() -> [Server] +└── get_server_status(id) -> Status + +Provider Implementations: +├── AWS Provider (aws-sdk-rust, aws cli) +├── UpCloud Provider (upcloud API) +└── Local Provider (Docker, libvirt) +``` + +#### 3. **OCI Registry Integration** + +```text +Extension Development + ↓ +Package (oci-package.nu) + ↓ +Push (provisioning oci push) + ↓ +OCI Registry (Zot/Harbor) + ↓ +Pull (provisioning oci pull) + ↓ +Cache (~/.provisioning/cache/oci/) + ↓ +Load into Workspace +``` + +#### 4. **Gitea Integration** (Multi-user, Enterprise) + +```text +Workspace Operations + ↓ +Check Lock Status (Gitea API) + ↓ +Acquire Lock (Create lock file in Git) + ↓ +Perform Changes + ↓ +Commit + Push + ↓ +Release Lock (Delete lock file) +``` + +**Benefits**: + +- Distributed locking +- Change tracking via Git history +- Collaboration features + +#### 5. **CoreDNS Integration** + +```text +Service Registration + ↓ +Update CoreDNS Corefile + ↓ +Reload CoreDNS + ↓ +DNS Resolution Available + +Zones: +├── *.prov.local (Internal services) +├── *.infra.local (Infrastructure nodes) +└── *.test.local (Test environments) +``` + +--- + +## Performance and Scalability + +### Performance Characteristics + +| Metric | Value | Notes | +| -------- | ------- | ------- | +| **CLI Startup Time** | < 100 ms | Nushell cold start | +| **CLI Response Time** | < 50 ms | Most commands | +| **Workflow Submission** | < 200 ms | To orchestrator | +| **Task Processing** | 10-50/sec | Orchestrator throughput | +| **Batch Operations** | Up to 100 servers | Parallel execution | +| **OCI Pull Time** | 1-5s | Cached: <100 ms | +| **Configuration Load** | < 500 ms | Full hierarchy | +| **Health Check Interval** | 10s | Configurable | + +### Scalability Limits + +**Solo Mode**: + +- Unlimited local resources +- Limited by machine capacity + +**Multi-User Mode**: + +- 10 servers per user +- 32 cores, 128 GB RAM per user +- 5-20 concurrent users + +**CI/CD Mode**: + +- 5 servers per pipeline +- 16 cores, 64 GB RAM per pipeline +- 100+ concurrent pipelines + +**Enterprise Mode**: + +- 20 servers per user +- 64 cores, 256 GB RAM per user +- 1000+ concurrent users +- Horizontal scaling via Kubernetes + +### Optimization Strategies + +**Caching**: + +- OCI artifacts cached locally +- Nickel compilation cached +- Module resolution cached + +**Parallel Execution**: + +- Batch operations with configurable limits +- Dependency-aware parallel starts +- Workflow DAG execution + +**Incremental Operations**: + +- Only update changed resources +- Checkpoint-based recovery +- Delta synchronization + +--- + +## Evolution and Roadmap + +### Version History + +| Version | Date | Major Features | +| --------- | ------ | ---------------- | +| **v3.5.0** | 2025-10-06 | Mode system, OCI distribution, comprehensive docs | +| **v3.4.0** | 2025-10-06 | Test environment service | +| **v3.3.0** | 2025-09-30 | Interactive guides | +| **v3.2.0** | 2025-09-30 | Modular CLI refactoring | +| **v3.1.0** | 2025-09-25 | Batch workflow system | +| **v3.0.0** | 2025-09-25 | Hybrid orchestrator | +| **v2.0.5** | 2025-10-02 | Workspace switching | +| **v2.0.0** | 2025-09-23 | Configuration migration | + +### Roadmap (Future Versions) + +**v3.6.0** (Q1 2026): + +- GraphQL API +- Advanced RBAC +- Multi-tenancy +- Observability enhancements (OpenTelemetry) + +**v4.0.0** (Q2 2026): + +- Multi-repository split complete +- Extension marketplace +- Advanced workflow features (conditional execution, loops) +- Cost optimization engine + +**v4.1.0** (Q3 2026): + +- AI-assisted infrastructure generation +- Policy-as-code (OPA integration) +- Advanced compliance features + +**Long-term Vision**: + +- Serverless workflow execution +- Edge computing support +- Multi-cloud failover +- Self-healing infrastructure + +--- + +## Related Documentation + +### Architecture + +- **[Multi-Repo Architecture](MULTI_REPO_ARCHITECTURE.md)** - Repository organization +- **[Design Principles](design-principles.md)** - Architectural philosophy +- **[Integration Patterns](integration-patterns.md)** - Integration details +- **[Orchestrator Model](orchestrator-integration-model.md)** - Hybrid orchestration + +### ADRs + +- **[ADR-001](adr-001-project-structure.md)** - Project structure +- **[ADR-002](adr-002-distribution-strategy.md)** - Distribution strategy +- **[ADR-003](adr-003-workspace-isolation.md)** - Workspace isolation +- **[ADR-004](adr-004-hybrid-architecture.md)** - Hybrid architecture +- **[ADR-005](adr-005-extension-framework.md)** - Extension framework +- **[ADR-006](adr-006-provisioning-cli-refactoring.md)** - CLI refactoring + +### User Guides + +- **[Getting Started](../user/getting-started.md)** - First steps +- **[Mode System](../user/MODE_SYSTEM_QUICK_REFERENCE.md)** - Modes overview +- **[Service Management](../user/SERVICE_MANAGEMENT_GUIDE.md)** - Services +- **[OCI Registry](../user/OCI_REGISTRY_GUIDE.md)** - OCI operations + +--- + +**Maintained By**: Architecture Team +**Review Cycle**: Quarterly +**Next Review**: 2026-01-06 \ No newline at end of file diff --git a/docs/src/architecture/config-loading-architecture.md b/docs/src/architecture/config-loading-architecture.md index 787f500..20898be 100644 --- a/docs/src/architecture/config-loading-architecture.md +++ b/docs/src/architecture/config-loading-architecture.md @@ -1 +1,266 @@ -# Modular Configuration Loading Architecture\n\n## Overview\n\nThe configuration system has been refactored into modular components to achieve 2-3x performance improvements\nfor regular commands while maintaining full functionality for complex operations.\n\n## Architecture Layers\n\n### Layer 1: Minimal Loader (0.023s)\n\n**File**: `loader-minimal.nu` (~150 lines)\n\nContains only essential functions needed for:\n\n- Workspace detection\n- Environment determination\n- Project root discovery\n- Fast path detection\n\n**Exported Functions**:\n\n- `get-active-workspace` - Get current workspace\n- `detect-current-environment` - Determine dev/test/prod\n- `get-project-root` - Find project directory\n- `get-defaults-config-path` - Path to default config\n- `check-if-sops-encrypted` - SOPS file detection\n- `find-sops-config-path` - Locate SOPS config\n\n**Used by**:\n\n- Help commands (help infrastructure, help workspace, etc.)\n- Status commands\n- Workspace listing\n- Quick reference operations\n\n### Layer 2: Lazy Loader (decision layer)\n\n**File**: `loader-lazy.nu` (~80 lines)\n\nSmart loader that decides which configuration to load:\n\n- Fast path for help/status commands\n- Full path for operations that need config\n\n**Key Function**:\n\n- `command-needs-full-config` - Determines if full config required\n\n### Layer 3: Full Loader (0.091s)\n\n**File**: `loader.nu` (1990 lines)\n\nOriginal comprehensive loader that handles:\n\n- Hierarchical config loading\n- Variable interpolation\n- Config validation\n- Provider configuration\n- Platform configuration\n\n**Used by**:\n\n- Server creation\n- Infrastructure operations\n- Deployment commands\n- Anything needing full config\n\n## Performance Characteristics\n\n### Benchmarks\n\n| Operation | Time | Notes |\n| --------- | ---- | ----- |\n| Workspace detection | 0.023s | 23ms for minimal load |\n| Full config load | 0.091s | ~4x slower than minimal |\n| Help command | 0.040s | Uses minimal loader only |\n| Status command | 0.030s | Fast path, no full config |\n| Server operations | 0.150s+ | Requires full config load |\n\n### Performance Gains\n\n- **Help commands**: 30-40% faster (40ms vs 60ms with full config)\n- **Workspace operations**: 50% faster (uses minimal loader)\n- **Status checks**: Nearly instant (23ms)\n\n## Module Dependency Graph\n\n```\nHelp/Status Commands\n ↓\nloader-lazy.nu\n ↓\nloader-minimal.nu (workspace, environment detection)\n ↓\n (no further deps)\n\nInfrastructure/Server Commands\n ↓\nloader-lazy.nu\n ↓\nloader.nu (full configuration)\n ├── loader-minimal.nu (for workspace detection)\n ├── Interpolation functions\n ├── Validation functions\n └── Config merging logic\n```\n\n## Usage Examples\n\n### Fast Path (Help Commands)\n\n```\n# Uses minimal loader - 23ms\n./provisioning help infrastructure\n./provisioning workspace list\n./provisioning version\n```\n\n### Medium Path (Status Operations)\n\n```\n# Uses minimal loader with some full config - ~50ms\n./provisioning status\n./provisioning workspace active\n./provisioning config validate\n```\n\n### Full Path (Infrastructure Operations)\n\n```\n# Uses full loader - ~150ms\n./provisioning server create --infra myinfra\n./provisioning taskserv create kubernetes\n./provisioning workflow submit batch.yaml\n```\n\n## Implementation Details\n\n### Lazy Loading Decision Logic\n\n```\n# In loader-lazy.nu\nlet is_fast_command = (\n $command == "help" or\n $command == "status" or\n $command == "version"\n)\n\nif $is_fast_command {\n # Use minimal loader only (0.023s)\n get-minimal-config\n} else {\n # Load full configuration (0.091s)\n load-provisioning-config\n}\n```\n\n### Minimal Config Structure\n\nThe minimal loader returns a lightweight config record:\n\n```\n{\n workspace: {\n name: "librecloud"\n path: "/path/to/workspace_librecloud"\n }\n environment: "dev"\n debug: false\n paths: {\n base: "/path/to/workspace_librecloud"\n }\n}\n```\n\nThis is sufficient for:\n\n- Workspace identification\n- Environment determination\n- Path resolution\n- Help text generation\n\n### Full Config Structure\n\nThe full loader returns comprehensive configuration with:\n\n- Workspace settings\n- Provider configurations\n- Platform settings\n- Interpolated variables\n- Validation results\n- Environment-specific overrides\n\n## Migration Path\n\n### For CLI Commands\n\n1. Commands are already categorized (help, workspace, server, etc.)\n2. Help system uses fast path (minimal loader)\n3. Infrastructure commands use full path (full loader)\n4. No changes needed to command implementations\n\n### For New Modules\n\nWhen creating new modules:\n\n1. Check if full config is needed\n2. If not, use `loader-minimal.nu` functions only\n3. If yes, use `get-config` from main config accessor\n\n## Future Optimizations\n\n### Phase 2: Per-Command Config Caching\n\n- Cache full config for 60 seconds\n- Reuse config across related commands\n- Potential: Additional 50% improvement\n\n### Phase 3: Configuration Profiles\n\n- Create thin config profiles for common scenarios\n- Pre-loaded templates for workspace/infra combinations\n- Fast switching between profiles\n\n### Phase 4: Parallel Config Loading\n\n- Load workspace and provider configs in parallel\n- Async validation and interpolation\n- Potential: 30% improvement for full config load\n\n## Maintenance Notes\n\n### Adding New Functions to Minimal Loader\n\nOnly add if:\n\n1. Used by help/status commands\n2. Doesn't require full config\n3. Performance-critical path\n\n### Modifying Full Loader\n\n- Changes are backward compatible\n- Validate against existing config files\n- Update tests in test suite\n\n### Performance Testing\n\n```\n# Benchmark minimal loader\ntime nu -n -c "use loader-minimal.nu *; get-active-workspace"\n\n# Benchmark full loader\ntime nu -c "use config/accessor.nu *; get-config"\n\n# Benchmark help command\ntime ./provisioning help infrastructure\n```\n\n## See Also\n\n- `loader.nu` - Full configuration loading system\n- `loader-minimal.nu` - Fast path loader\n- `loader-lazy.nu` - Smart loader decision logic\n- `config/ARCHITECTURE.md` - Configuration architecture details +# Modular Configuration Loading Architecture + +## Overview + +The configuration system has been refactored into modular components to achieve 2-3x performance improvements +for regular commands while maintaining full functionality for complex operations. + +## Architecture Layers + +### Layer 1: Minimal Loader (0.023s) + +**File**: `loader-minimal.nu` (~150 lines) + +Contains only essential functions needed for: + +- Workspace detection +- Environment determination +- Project root discovery +- Fast path detection + +**Exported Functions**: + +- `get-active-workspace` - Get current workspace +- `detect-current-environment` - Determine dev/test/prod +- `get-project-root` - Find project directory +- `get-defaults-config-path` - Path to default config +- `check-if-sops-encrypted` - SOPS file detection +- `find-sops-config-path` - Locate SOPS config + +**Used by**: + +- Help commands (help infrastructure, help workspace, etc.) +- Status commands +- Workspace listing +- Quick reference operations + +### Layer 2: Lazy Loader (decision layer) + +**File**: `loader-lazy.nu` (~80 lines) + +Smart loader that decides which configuration to load: + +- Fast path for help/status commands +- Full path for operations that need config + +**Key Function**: + +- `command-needs-full-config` - Determines if full config required + +### Layer 3: Full Loader (0.091s) + +**File**: `loader.nu` (1990 lines) + +Original comprehensive loader that handles: + +- Hierarchical config loading +- Variable interpolation +- Config validation +- Provider configuration +- Platform configuration + +**Used by**: + +- Server creation +- Infrastructure operations +- Deployment commands +- Anything needing full config + +## Performance Characteristics + +### Benchmarks + +| Operation | Time | Notes | +| --------- | ---- | ----- | +| Workspace detection | 0.023s | 23ms for minimal load | +| Full config load | 0.091s | ~4x slower than minimal | +| Help command | 0.040s | Uses minimal loader only | +| Status command | 0.030s | Fast path, no full config | +| Server operations | 0.150s+ | Requires full config load | + +### Performance Gains + +- **Help commands**: 30-40% faster (40ms vs 60ms with full config) +- **Workspace operations**: 50% faster (uses minimal loader) +- **Status checks**: Nearly instant (23ms) + +## Module Dependency Graph + +```text +Help/Status Commands + ↓ +loader-lazy.nu + ↓ +loader-minimal.nu (workspace, environment detection) + ↓ + (no further deps) + +Infrastructure/Server Commands + ↓ +loader-lazy.nu + ↓ +loader.nu (full configuration) + ├── loader-minimal.nu (for workspace detection) + ├── Interpolation functions + ├── Validation functions + └── Config merging logic +``` + +## Usage Examples + +### Fast Path (Help Commands) + +```text +# Uses minimal loader - 23ms +./provisioning help infrastructure +./provisioning workspace list +./provisioning version +``` + +### Medium Path (Status Operations) + +```text +# Uses minimal loader with some full config - ~50ms +./provisioning status +./provisioning workspace active +./provisioning config validate +``` + +### Full Path (Infrastructure Operations) + +```text +# Uses full loader - ~150ms +./provisioning server create --infra myinfra +./provisioning taskserv create kubernetes +./provisioning workflow submit batch.yaml +``` + +## Implementation Details + +### Lazy Loading Decision Logic + +```text +# In loader-lazy.nu +let is_fast_command = ( + $command == "help" or + $command == "status" or + $command == "version" +) + +if $is_fast_command { + # Use minimal loader only (0.023s) + get-minimal-config +} else { + # Load full configuration (0.091s) + load-provisioning-config +} +``` + +### Minimal Config Structure + +The minimal loader returns a lightweight config record: + +```text +{ + workspace: { + name: "librecloud" + path: "/path/to/workspace_librecloud" + } + environment: "dev" + debug: false + paths: { + base: "/path/to/workspace_librecloud" + } +} +``` + +This is sufficient for: + +- Workspace identification +- Environment determination +- Path resolution +- Help text generation + +### Full Config Structure + +The full loader returns comprehensive configuration with: + +- Workspace settings +- Provider configurations +- Platform settings +- Interpolated variables +- Validation results +- Environment-specific overrides + +## Migration Path + +### For CLI Commands + +1. Commands are already categorized (help, workspace, server, etc.) +2. Help system uses fast path (minimal loader) +3. Infrastructure commands use full path (full loader) +4. No changes needed to command implementations + +### For New Modules + +When creating new modules: + +1. Check if full config is needed +2. If not, use `loader-minimal.nu` functions only +3. If yes, use `get-config` from main config accessor + +## Future Optimizations + +### Phase 2: Per-Command Config Caching + +- Cache full config for 60 seconds +- Reuse config across related commands +- Potential: Additional 50% improvement + +### Phase 3: Configuration Profiles + +- Create thin config profiles for common scenarios +- Pre-loaded templates for workspace/infra combinations +- Fast switching between profiles + +### Phase 4: Parallel Config Loading + +- Load workspace and provider configs in parallel +- Async validation and interpolation +- Potential: 30% improvement for full config load + +## Maintenance Notes + +### Adding New Functions to Minimal Loader + +Only add if: + +1. Used by help/status commands +2. Doesn't require full config +3. Performance-critical path + +### Modifying Full Loader + +- Changes are backward compatible +- Validate against existing config files +- Update tests in test suite + +### Performance Testing + +```text +# Benchmark minimal loader +time nu -n -c "use loader-minimal.nu *; get-active-workspace" + +# Benchmark full loader +time nu -c "use config/accessor.nu *; get-config" + +# Benchmark help command +time ./provisioning help infrastructure +``` + +## See Also + +- `loader.nu` - Full configuration loading system +- `loader-minimal.nu` - Fast path loader +- `loader-lazy.nu` - Smart loader decision logic +- `config/ARCHITECTURE.md` - Configuration architecture details \ No newline at end of file diff --git a/docs/src/architecture/database-and-config-architecture.md b/docs/src/architecture/database-and-config-architecture.md index baf1e30..c9ad8e7 100644 --- a/docs/src/architecture/database-and-config-architecture.md +++ b/docs/src/architecture/database-and-config-architecture.md @@ -1 +1,385 @@ -# Database and Configuration Architecture\n\n**Date**: 2025-10-07\n**Status**: ACTIVE DOCUMENTATION\n\n---\n\n## Control-Center Database (DBS)\n\n### Database Type: **SurrealDB** (In-Memory Backend)\n\nControl-Center uses **SurrealDB with kv-mem backend**, an embedded in-memory database - **no separate database server required**.\n\n### Database Configuration\n\n```\n[database]\nurl = "memory" # In-memory backend\nnamespace = "control_center"\ndatabase = "main"\n```\n\n**Storage**: In-memory (data persists during process lifetime)\n\n**Production Alternative**: Switch to remote WebSocket connection for persistent storage:\n\n```\n[database]\nurl = "ws://localhost:8000"\nnamespace = "control_center"\ndatabase = "main"\nusername = "root"\npassword = "secret"\n```\n\n### Why SurrealDB kv-mem\n\n| Feature | SurrealDB kv-mem | RocksDB | PostgreSQL |\n| --------- | ------------------ | --------- | ------------ |\n| **Deployment** | Embedded (no server) | Embedded | Server only |\n| **Build Deps** | None | libclang, bzip2 | Many |\n| **Docker** | Simple | Complex | External service |\n| **Performance** | Very fast (memory) | Very fast (disk) | Network latency |\n| **Use Case** | Dev/test, graphs | Production K/V | Relational data |\n| **GraphQL** | Built-in | None | External |\n\n**Control-Center choice**: SurrealDB kv-mem for **zero-dependency embedded storage**, perfect for:\n\n- Policy engine state\n- Session management\n- Configuration cache\n- Audit logs\n- User credentials\n- Graph-based policy relationships\n\n### Additional Database Support\n\nControl-Center also supports (via Cargo.toml dependencies):\n\n1. **SurrealDB (WebSocket)** - For production persistent storage\n\n ```toml\n surrealdb = { version = "2.3", features = ["kv-mem", "protocol-ws", "protocol-http"] }\n ```\n\n1. **SQLx** - For SQL database backends (optional)\n\n ```toml\n sqlx = { workspace = true }\n ```\n\n**Default**: SurrealDB kv-mem (embedded, no extra setup, no build dependencies)\n\n---\n\n## Orchestrator Database\n\n### Storage Type: **Filesystem** (File-based Queue)\n\nOrchestrator uses simple file-based storage by default:\n\n```\n[orchestrator.storage]\ntype = "filesystem" # Default\nbackend_path = "{{orchestrator.paths.data_dir}}/queue.rkvs"\n```\n\n**Resolved Path**:\n\n```\n{{workspace.path}}/.orchestrator/data/queue.rkvs\n```\n\n### Optional: SurrealDB Backend\n\nFor production deployments, switch to SurrealDB:\n\n```\n[orchestrator.storage]\ntype = "surrealdb-server" # or surrealdb-embedded\n\n[orchestrator.storage.surrealdb]\nurl = "ws://localhost:8000"\nnamespace = "orchestrator"\ndatabase = "tasks"\nusername = "root"\npassword = "secret"\n```\n\n---\n\n## Configuration Loading Architecture\n\n### Hierarchical Configuration System\n\nAll services load configuration in this order (priority: low → high):\n\n```\n1. System Defaults provisioning/config/config.defaults.toml\n2. Service Defaults provisioning/platform/{service}/config.defaults.toml\n3. Workspace Config workspace/{name}/config/provisioning.yaml\n4. User Config ~/Library/Application Support/provisioning/user_config.yaml\n5. Environment Variables PROVISIONING_*, CONTROL_CENTER_*, ORCHESTRATOR_*\n6. Runtime Overrides --config flag or API updates\n```\n\n### Variable Interpolation\n\nConfigs support dynamic variable interpolation:\n\n```\n[paths]\nbase = "/Users/Akasha/project-provisioning/provisioning"\ndata_dir = "{{paths.base}}/data" # Resolves to: /Users/.../data\n\n[database]\nurl = "rocksdb://{{paths.data_dir}}/control-center.db"\n# Resolves to: rocksdb:///Users/.../data/control-center.db\n```\n\n**Supported Variables**:\n\n- `{{paths.*}}` - Path variables from config\n- `{{workspace.path}}` - Current workspace path\n- `{{env.HOME}}` - Environment variables\n- `{{now.date}}` - Current date/time\n- `{{git.branch}}` - Git branch name\n\n### Service-Specific Config Files\n\nEach platform service has its own `config.defaults.toml`:\n\n| Service | Config File | Purpose |\n| --------- | ------------- | --------- |\n| **Orchestrator** | `provisioning/platform/orchestrator/config.defaults.toml` | Workflow management, queue settings |\n| **Control-Center** | `provisioning/platform/control-center/config.defaults.toml` | Web UI, auth, database |\n| **MCP Server** | `provisioning/platform/mcp-server/config.defaults.toml` | AI integration settings |\n| **KMS** | `provisioning/core/services/kms/config.defaults.toml` | Key management |\n\n### Central Configuration\n\n**Master config**: `provisioning/config/config.defaults.toml`\n\nContains:\n\n- Global paths\n- Provider configurations\n- Cache settings\n- Debug flags\n- Environment-specific overrides\n\n### Workspace-Aware Paths\n\nAll services use workspace-aware paths:\n\n**Orchestrator**:\n\n```\n[orchestrator.paths]\nbase = "{{workspace.path}}/.orchestrator"\ndata_dir = "{{orchestrator.paths.base}}/data"\nlogs_dir = "{{orchestrator.paths.base}}/logs"\nqueue_dir = "{{orchestrator.paths.data_dir}}/queue"\n```\n\n**Control-Center**:\n\n```\n[paths]\nbase = "{{workspace.path}}/.control-center"\ndata_dir = "{{paths.base}}/data"\nlogs_dir = "{{paths.base}}/logs"\n```\n\n**Result** (workspace: `workspace-librecloud`):\n\n```\nworkspace-librecloud/\n├── .orchestrator/\n│ ├── data/\n│ │ └── queue.rkvs\n│ └── logs/\n└── .control-center/\n ├── data/\n │ └── control-center.db\n └── logs/\n```\n\n---\n\n## Environment Variable Overrides\n\nAny config value can be overridden via environment variables:\n\n### Control-Center\n\n```\n# Override server port\nexport CONTROL_CENTER_SERVER_PORT=8081\n\n# Override database URL\nexport CONTROL_CENTER_DATABASE_URL="rocksdb:///custom/path/db"\n\n# Override JWT secret\nexport CONTROL_CENTER_JWT_ISSUER="my-issuer"\n```\n\n### Orchestrator\n\n```\n# Override orchestrator port\nexport ORCHESTRATOR_SERVER_PORT=8080\n\n# Override storage backend\nexport ORCHESTRATOR_STORAGE_TYPE="surrealdb-server"\nexport ORCHESTRATOR_STORAGE_SURREALDB_URL="ws://localhost:8000"\n\n# Override concurrency\nexport ORCHESTRATOR_QUEUE_MAX_CONCURRENT_TASKS=10\n```\n\n### Naming Convention\n\n```\n{SERVICE}_{SECTION}_{KEY} = value\n```\n\n**Examples**:\n\n- `CONTROL_CENTER_SERVER_PORT` → `[server] port`\n- `ORCHESTRATOR_QUEUE_MAX_CONCURRENT_TASKS` → `[queue] max_concurrent_tasks`\n- `PROVISIONING_DEBUG_ENABLED` → `[debug] enabled`\n\n---\n\n## Docker vs Native Configuration\n\n### Docker Deployment\n\n**Container paths** (resolved inside container):\n\n```\n[paths]\nbase = "/app/provisioning"\ndata_dir = "/data" # Mounted volume\nlogs_dir = "/var/log/orchestrator" # Mounted volume\n```\n\n**Docker Compose volumes**:\n\n```\nservices:\n orchestrator:\n volumes:\n - orchestrator-data:/data\n - orchestrator-logs:/var/log/orchestrator\n\n control-center:\n volumes:\n - control-center-data:/data\n\nvolumes:\n orchestrator-data:\n orchestrator-logs:\n control-center-data:\n```\n\n### Native Deployment\n\n**Host paths** (macOS/Linux):\n\n```\n[paths]\nbase = "/Users/Akasha/project-provisioning/provisioning"\ndata_dir = "{{workspace.path}}/.orchestrator/data"\nlogs_dir = "{{workspace.path}}/.orchestrator/logs"\n```\n\n---\n\n## Configuration Validation\n\nCheck current configuration:\n\n```\n# Show effective configuration\nprovisioning env\n\n# Show all config and environment\nprovisioning allenv\n\n# Validate configuration\nprovisioning validate config\n\n# Show service-specific config\nPROVISIONING_DEBUG=true ./orchestrator --show-config\n```\n\n---\n\n## KMS Database\n\n**Cosmian KMS** uses its own database (when deployed):\n\n```\n# KMS database location (Docker)\n/data/kms.db # SQLite database inside KMS container\n\n# KMS database location (Native)\n{{workspace.path}}/.kms/data/kms.db\n```\n\nKMS also integrates with Control-Center's KMS hybrid backend (local + remote):\n\n```\n[kms]\nmode = "hybrid" # local, remote, or hybrid\n\n[kms.local]\ndatabase_path = "{{paths.data_dir}}/kms.db"\n\n[kms.remote]\nserver_url = "http://localhost:9998" # Cosmian KMS server\n```\n\n---\n\n## Summary\n\n### Control-Center Database\n\n- **Type**: RocksDB (embedded)\n- **Location**: `{{workspace.path}}/.control-center/data/control-center.db`\n- **No server required**: Embedded in control-center process\n\n### Orchestrator Database\n\n- **Type**: Filesystem (default) or SurrealDB (production)\n- **Location**: `{{workspace.path}}/.orchestrator/data/queue.rkvs`\n- **Optional server**: SurrealDB for production\n\n### Configuration Loading\n\n1. System defaults (provisioning/config/)\n2. Service defaults (platform/{service}/)\n3. Workspace config\n4. User config\n5. Environment variables\n6. Runtime overrides\n\n### Best Practices\n\n- ✅ Use workspace-aware paths\n- ✅ Override via environment variables in Docker\n- ✅ Keep secrets in KMS, not config files\n- ✅ Use RocksDB for single-node deployments\n- ✅ Use SurrealDB for distributed/production deployments\n\n---\n\n**Related Documentation**:\n\n- [Configuration System](../infrastructure/configuration-guide.md)\n- [KMS Architecture](../security/kms-architecture.md)\n- [Workspace Switching](../infrastructure/workspace-switching-guide.md) +# Database and Configuration Architecture + +**Date**: 2025-10-07 +**Status**: ACTIVE DOCUMENTATION + +--- + +## Control-Center Database (DBS) + +### Database Type: **SurrealDB** (In-Memory Backend) + +Control-Center uses **SurrealDB with kv-mem backend**, an embedded in-memory database - **no separate database server required**. + +### Database Configuration + +```text +[database] +url = "memory" # In-memory backend +namespace = "control_center" +database = "main" +``` + +**Storage**: In-memory (data persists during process lifetime) + +**Production Alternative**: Switch to remote WebSocket connection for persistent storage: + +```text +[database] +url = "ws://localhost:8000" +namespace = "control_center" +database = "main" +username = "root" +password = "secret" +``` + +### Why SurrealDB kv-mem + +| Feature | SurrealDB kv-mem | RocksDB | PostgreSQL | +| --------- | ------------------ | --------- | ------------ | +| **Deployment** | Embedded (no server) | Embedded | Server only | +| **Build Deps** | None | libclang, bzip2 | Many | +| **Docker** | Simple | Complex | External service | +| **Performance** | Very fast (memory) | Very fast (disk) | Network latency | +| **Use Case** | Dev/test, graphs | Production K/V | Relational data | +| **GraphQL** | Built-in | None | External | + +**Control-Center choice**: SurrealDB kv-mem for **zero-dependency embedded storage**, perfect for: + +- Policy engine state +- Session management +- Configuration cache +- Audit logs +- User credentials +- Graph-based policy relationships + +### Additional Database Support + +Control-Center also supports (via Cargo.toml dependencies): + +1. **SurrealDB (WebSocket)** - For production persistent storage + + ```toml + surrealdb = { version = "2.3", features = ["kv-mem", "protocol-ws", "protocol-http"] } + ``` + +1. **SQLx** - For SQL database backends (optional) + + ```toml + sqlx = { workspace = true } + ``` + +**Default**: SurrealDB kv-mem (embedded, no extra setup, no build dependencies) + +--- + +## Orchestrator Database + +### Storage Type: **Filesystem** (File-based Queue) + +Orchestrator uses simple file-based storage by default: + +```text +[orchestrator.storage] +type = "filesystem" # Default +backend_path = "{{orchestrator.paths.data_dir}}/queue.rkvs" +``` + +**Resolved Path**: + +```text +{{workspace.path}}/.orchestrator/data/queue.rkvs +``` + +### Optional: SurrealDB Backend + +For production deployments, switch to SurrealDB: + +```text +[orchestrator.storage] +type = "surrealdb-server" # or surrealdb-embedded + +[orchestrator.storage.surrealdb] +url = "ws://localhost:8000" +namespace = "orchestrator" +database = "tasks" +username = "root" +password = "secret" +``` + +--- + +## Configuration Loading Architecture + +### Hierarchical Configuration System + +All services load configuration in this order (priority: low → high): + +```text +1. System Defaults provisioning/config/config.defaults.toml +2. Service Defaults provisioning/platform/{service}/config.defaults.toml +3. Workspace Config workspace/{name}/config/provisioning.yaml +4. User Config ~/Library/Application Support/provisioning/user_config.yaml +5. Environment Variables PROVISIONING_*, CONTROL_CENTER_*, ORCHESTRATOR_* +6. Runtime Overrides --config flag or API updates +``` + +### Variable Interpolation + +Configs support dynamic variable interpolation: + +```text +[paths] +base = "/Users/Akasha/project-provisioning/provisioning" +data_dir = "{{paths.base}}/data" # Resolves to: /Users/.../data + +[database] +url = "rocksdb://{{paths.data_dir}}/control-center.db" +# Resolves to: rocksdb:///Users/.../data/control-center.db +``` + +**Supported Variables**: + +- `{{paths.*}}` - Path variables from config +- `{{workspace.path}}` - Current workspace path +- `{{env.HOME}}` - Environment variables +- `{{now.date}}` - Current date/time +- `{{git.branch}}` - Git branch name + +### Service-Specific Config Files + +Each platform service has its own `config.defaults.toml`: + +| Service | Config File | Purpose | +| --------- | ------------- | --------- | +| **Orchestrator** | `provisioning/platform/orchestrator/config.defaults.toml` | Workflow management, queue settings | +| **Control-Center** | `provisioning/platform/control-center/config.defaults.toml` | Web UI, auth, database | +| **MCP Server** | `provisioning/platform/mcp-server/config.defaults.toml` | AI integration settings | +| **KMS** | `provisioning/core/services/kms/config.defaults.toml` | Key management | + +### Central Configuration + +**Master config**: `provisioning/config/config.defaults.toml` + +Contains: + +- Global paths +- Provider configurations +- Cache settings +- Debug flags +- Environment-specific overrides + +### Workspace-Aware Paths + +All services use workspace-aware paths: + +**Orchestrator**: + +```text +[orchestrator.paths] +base = "{{workspace.path}}/.orchestrator" +data_dir = "{{orchestrator.paths.base}}/data" +logs_dir = "{{orchestrator.paths.base}}/logs" +queue_dir = "{{orchestrator.paths.data_dir}}/queue" +``` + +**Control-Center**: + +```text +[paths] +base = "{{workspace.path}}/.control-center" +data_dir = "{{paths.base}}/data" +logs_dir = "{{paths.base}}/logs" +``` + +**Result** (workspace: `workspace-librecloud`): + +```text +workspace-librecloud/ +├── .orchestrator/ +│ ├── data/ +│ │ └── queue.rkvs +│ └── logs/ +└── .control-center/ + ├── data/ + │ └── control-center.db + └── logs/ +``` + +--- + +## Environment Variable Overrides + +Any config value can be overridden via environment variables: + +### Control-Center + +```text +# Override server port +export CONTROL_CENTER_SERVER_PORT=8081 + +# Override database URL +export CONTROL_CENTER_DATABASE_URL="rocksdb:///custom/path/db" + +# Override JWT secret +export CONTROL_CENTER_JWT_ISSUER="my-issuer" +``` + +### Orchestrator + +```text +# Override orchestrator port +export ORCHESTRATOR_SERVER_PORT=8080 + +# Override storage backend +export ORCHESTRATOR_STORAGE_TYPE="surrealdb-server" +export ORCHESTRATOR_STORAGE_SURREALDB_URL="ws://localhost:8000" + +# Override concurrency +export ORCHESTRATOR_QUEUE_MAX_CONCURRENT_TASKS=10 +``` + +### Naming Convention + +```text +{SERVICE}_{SECTION}_{KEY} = value +``` + +**Examples**: + +- `CONTROL_CENTER_SERVER_PORT` → `[server] port` +- `ORCHESTRATOR_QUEUE_MAX_CONCURRENT_TASKS` → `[queue] max_concurrent_tasks` +- `PROVISIONING_DEBUG_ENABLED` → `[debug] enabled` + +--- + +## Docker vs Native Configuration + +### Docker Deployment + +**Container paths** (resolved inside container): + +```text +[paths] +base = "/app/provisioning" +data_dir = "/data" # Mounted volume +logs_dir = "/var/log/orchestrator" # Mounted volume +``` + +**Docker Compose volumes**: + +```text +services: + orchestrator: + volumes: + - orchestrator-data:/data + - orchestrator-logs:/var/log/orchestrator + + control-center: + volumes: + - control-center-data:/data + +volumes: + orchestrator-data: + orchestrator-logs: + control-center-data: +``` + +### Native Deployment + +**Host paths** (macOS/Linux): + +```text +[paths] +base = "/Users/Akasha/project-provisioning/provisioning" +data_dir = "{{workspace.path}}/.orchestrator/data" +logs_dir = "{{workspace.path}}/.orchestrator/logs" +``` + +--- + +## Configuration Validation + +Check current configuration: + +```text +# Show effective configuration +provisioning env + +# Show all config and environment +provisioning allenv + +# Validate configuration +provisioning validate config + +# Show service-specific config +PROVISIONING_DEBUG=true ./orchestrator --show-config +``` + +--- + +## KMS Database + +**Cosmian KMS** uses its own database (when deployed): + +```text +# KMS database location (Docker) +/data/kms.db # SQLite database inside KMS container + +# KMS database location (Native) +{{workspace.path}}/.kms/data/kms.db +``` + +KMS also integrates with Control-Center's KMS hybrid backend (local + remote): + +```text +[kms] +mode = "hybrid" # local, remote, or hybrid + +[kms.local] +database_path = "{{paths.data_dir}}/kms.db" + +[kms.remote] +server_url = "http://localhost:9998" # Cosmian KMS server +``` + +--- + +## Summary + +### Control-Center Database + +- **Type**: RocksDB (embedded) +- **Location**: `{{workspace.path}}/.control-center/data/control-center.db` +- **No server required**: Embedded in control-center process + +### Orchestrator Database + +- **Type**: Filesystem (default) or SurrealDB (production) +- **Location**: `{{workspace.path}}/.orchestrator/data/queue.rkvs` +- **Optional server**: SurrealDB for production + +### Configuration Loading + +1. System defaults (provisioning/config/) +2. Service defaults (platform/{service}/) +3. Workspace config +4. User config +5. Environment variables +6. Runtime overrides + +### Best Practices + +- ✅ Use workspace-aware paths +- ✅ Override via environment variables in Docker +- ✅ Keep secrets in KMS, not config files +- ✅ Use RocksDB for single-node deployments +- ✅ Use SurrealDB for distributed/production deployments + +--- + +**Related Documentation**: + +- [Configuration System](../infrastructure/configuration-guide.md) +- [KMS Architecture](../security/kms-architecture.md) +- [Workspace Switching](../infrastructure/workspace-switching-guide.md) \ No newline at end of file diff --git a/docs/src/architecture/design-principles.md b/docs/src/architecture/design-principles.md index 32ce32b..a46029e 100644 --- a/docs/src/architecture/design-principles.md +++ b/docs/src/architecture/design-principles.md @@ -1 +1,422 @@ -# Design Principles\n\n## Overview\n\nProvisioning is built on a foundation of architectural principles that guide design decisions,\nensure system quality, and maintain consistency across the codebase.\nThese principles have evolved from real-world experience\nand represent lessons learned from complex infrastructure automation challenges.\n\n## Core Architectural Principles\n\n### 1. Project Architecture Principles (PAP) Compliance\n\n**Principle**: Fully agnostic and configuration-driven, not hardcoded. Use abstraction layers dynamically loaded from configurations.\n\n**Rationale**: Infrastructure as Code (IaC) systems must be flexible enough to adapt to any environment\nwithout code changes. Hardcoded values defeat the purpose of IaC and create maintenance burdens.\n\n**Implementation Guidelines**:\n\n- Never patch the system with hardcoded fallbacks when configuration parsing fails\n- All behavior must be configurable through the hierarchical configuration system\n- Use abstraction layers that are dynamically loaded from configuration\n- Validate configuration fully before execution, fail fast on invalid config\n\n**Anti-Patterns (Anti-PAP)**:\n\n- Hardcoded provider endpoints or credentials\n- Environment-specific logic in code\n- Fallback to default values when configuration is missing\n- Mixed configuration and implementation logic\n\n**Example**:\n\n```\n# ✅ PAP Compliant - Configuration-driven\n[providers.aws]\nregions = ["us-west-2", "us-east-1"]\ninstance_types = ["t3.micro", "t3.small"]\napi_endpoint = "https://ec2.amazonaws.com"\n\n# ❌ Anti-PAP - Hardcoded fallback in code\nif config.providers.aws.regions.is_empty() {\n regions = vec!["us-west-2"]; // Hardcoded fallback\n}\n```\n\n### 2. Hybrid Architecture Optimization\n\n**Principle**: Use each language for what it does best - Rust for coordination, Nushell for business logic.\n\n**Rationale**: Different languages have different strengths. Rust excels at performance-critical coordination tasks, while Nushell excels at\nconfiguration management and domain-specific operations.\n\n**Implementation Guidelines**:\n\n- Rust handles orchestration, state management, and performance-critical paths\n- Nushell handles provider operations, configuration processing, and CLI interfaces\n- Clear boundaries between language responsibilities\n- Structured data exchange (JSON) between languages\n- Preserve existing domain expertise in Nushell\n\n**Language Responsibility Matrix**:\n\n```\nRust Layer:\n├── Workflow orchestration and coordination\n├── REST API servers and HTTP endpoints\n├── State persistence and checkpoint management\n├── Parallel processing and batch operations\n├── Error recovery and rollback logic\n└── Performance-critical data processing\n\nNushell Layer:\n├── Provider implementations (AWS, UpCloud, local)\n├── Task service management and configuration\n├── Nickel configuration processing and validation\n├── Template generation and Infrastructure as Code\n├── CLI user interfaces and interactive tools\n└── Domain-specific business logic\n```\n\n### 3. Configuration-First Architecture\n\n**Principle**: All system behavior is determined by configuration, with clear hierarchical precedence and validation.\n\n**Rationale**: True Infrastructure as Code requires that all behavior be configurable without code changes. Configuration hierarchy provides\nflexibility while maintaining predictability.\n\n**Configuration Hierarchy** (precedence order):\n\n1. Runtime Parameters (highest precedence)\n2. Environment Configuration\n3. Infrastructure Configuration\n4. User Configuration\n5. System Defaults (lowest precedence)\n\n**Implementation Guidelines**:\n\n- Complete configuration validation before execution\n- Variable interpolation for dynamic values\n- Schema-based validation using Nickel\n- Configuration immutability during execution\n- Comprehensive error reporting for configuration issues\n\n### 4. Domain-Driven Structure\n\n**Principle**: Organize code by business domains and functional boundaries, not by technical concerns.\n\n**Rationale**: Domain-driven organization scales better, reduces coupling, and enables focused development by domain experts.\n\n**Domain Organization**:\n\n```\n├── core/ # Core system and library functions\n├── platform/ # High-performance coordination layer\n├── provisioning/ # Main business logic with providers and services\n├── control-center/ # Web-based management interface\n├── tools/ # Development and utility tools\n└── extensions/ # Plugin and extension framework\n```\n\n**Domain Responsibilities**:\n\n- Each domain has clear ownership and boundaries\n- Cross-domain communication through well-defined interfaces\n- Domain-specific testing and validation strategies\n- Independent evolution and versioning within architectural guidelines\n\n### 5. Isolation and Modularity\n\n**Principle**: Components are isolated, modular, and independently deployable with clear interface contracts.\n\n**Rationale**: Isolation enables independent development, testing, and deployment. Clear interfaces prevent tight coupling and enable system\nevolution.\n\n**Implementation Guidelines**:\n\n- User workspace isolation from system installation\n- Extension sandboxing and security boundaries\n- Provider abstraction with standardized interfaces\n- Service modularity with dependency management\n- Clear API contracts between components\n\n## Quality Attribute Principles\n\n### 6. Reliability Through Recovery\n\n**Principle**: Build comprehensive error recovery and rollback capabilities into every operation.\n\n**Rationale**: Infrastructure operations can fail at any point. Systems must be able to recover gracefully and maintain consistent state.\n\n**Implementation Guidelines**:\n\n- Checkpoint-based recovery for long-running workflows\n- Comprehensive rollback capabilities for all operations\n- Transactional semantics where possible\n- State validation and consistency checks\n- Detailed audit trails for debugging and recovery\n\n**Recovery Strategies**:\n\n```\nOperation Level:\n├── Atomic operations with rollback\n├── Retry logic with exponential backoff\n├── Circuit breakers for external dependencies\n└── Graceful degradation on partial failures\n\nWorkflow Level:\n├── Checkpoint-based recovery\n├── Dependency-aware rollback\n├── State consistency validation\n└── Resume from failure points\n\nSystem Level:\n├── Health monitoring and alerting\n├── Automatic recovery procedures\n├── Data backup and restoration\n└── Disaster recovery capabilities\n```\n\n### 7. Performance Through Parallelism\n\n**Principle**: Design for parallel execution and efficient resource utilization while maintaining correctness.\n\n**Rationale**: Infrastructure operations often involve multiple independent resources that can be processed in parallel for significant performance\ngains.\n\n**Implementation Guidelines**:\n\n- Configurable parallelism limits to prevent resource exhaustion\n- Dependency-aware parallel execution\n- Resource pooling and connection management\n- Efficient data structures and algorithms\n- Memory-conscious processing for large datasets\n\n### 8. Security Through Isolation\n\n**Principle**: Implement security through isolation boundaries, least privilege, and comprehensive validation.\n\n**Rationale**: Infrastructure systems handle sensitive data and powerful operations. Security must be built in at the architectural level.\n\n**Security Implementation**:\n\n```\nAuthentication & Authorization:\n├── API authentication for external access\n├── Role-based access control for operations\n├── Permission validation before execution\n└── Audit logging for all security events\n\nData Protection:\n├── Encrypted secrets management (SOPS/Age)\n├── Secure configuration file handling\n├── Network communication encryption\n└── Sensitive data sanitization in logs\n\nIsolation Boundaries:\n├── User workspace isolation\n├── Extension sandboxing\n├── Provider credential isolation\n└── Process and network isolation\n```\n\n## Development Methodology Principles\n\n### 9. Configuration-Driven Testing\n\n**Principle**: Tests should be configuration-driven and validate both happy path and error conditions.\n\n**Rationale**: Infrastructure systems must work across diverse environments and configurations. Tests must validate the configuration-driven nature of\nthe system.\n\n**Testing Strategy**:\n\n```\nUnit Testing:\n├── Configuration validation tests\n├── Individual component tests\n├── Error condition tests\n└── Performance benchmark tests\n\nIntegration Testing:\n├── Multi-provider workflow tests\n├── Configuration hierarchy tests\n├── Error recovery tests\n└── End-to-end scenario tests\n\nSystem Testing:\n├── Full deployment tests\n├── Upgrade and migration tests\n├── Performance and scalability tests\n└── Security and isolation tests\n```\n\n## Error Handling Principles\n\n### 11. Fail Fast, Recover Gracefully\n\n**Principle**: Validate early and fail fast on errors, but provide comprehensive recovery mechanisms.\n\n**Rationale**: Early validation prevents complex error states, while graceful recovery maintains system reliability.\n\n**Implementation Guidelines**:\n\n- Complete configuration validation before execution\n- Input validation at system boundaries\n- Clear error messages without internal stack traces (except in DEBUG mode)\n- Comprehensive error categorization and handling\n- Recovery procedures for all error categories\n\n**Error Categories**:\n\n```\nConfiguration Errors:\n├── Invalid configuration syntax\n├── Missing required configuration\n├── Configuration conflicts\n└── Schema validation failures\n\nRuntime Errors:\n├── Provider API failures\n├── Network connectivity issues\n├── Resource availability problems\n└── Permission and authentication errors\n\nSystem Errors:\n├── File system access problems\n├── Memory and resource exhaustion\n├── Process communication failures\n└── External dependency failures\n```\n\n### 12. Observable Operations\n\n**Principle**: All operations must be observable through comprehensive logging, metrics, and monitoring.\n\n**Rationale**: Infrastructure operations must be debuggable and monitorable in production environments.\n\n**Observability Implementation**:\n\n```\nLogging:\n├── Structured JSON logging\n├── Configurable log levels\n├── Context-aware log messages\n└── Audit trail for all operations\n\nMetrics:\n├── Operation performance metrics\n├── Resource utilization metrics\n├── Error rate and type metrics\n└── Business logic metrics\n\nMonitoring:\n├── Health check endpoints\n├── Real-time status reporting\n├── Workflow progress tracking\n└── Alert integration capabilities\n```\n\n## Evolution and Maintenance Principles\n\n### 13. Backward Compatibility\n\n**Principle**: Maintain backward compatibility for configuration, APIs, and user interfaces.\n\n**Rationale**: Infrastructure systems are long-lived and must support existing configurations and workflows during evolution.\n\n**Compatibility Guidelines**:\n\n- Semantic versioning for all interfaces\n- Configuration migration tools and procedures\n- Deprecation warnings and migration guides\n- API versioning for external interfaces\n- Comprehensive upgrade testing\n\n### 14. Documentation-Driven Development\n\n**Principle**: Architecture decisions, APIs, and operational procedures must be thoroughly documented.\n\n**Rationale**: Infrastructure systems are complex and require clear documentation for operation, maintenance, and evolution.\n\n**Documentation Requirements**:\n\n- Architecture Decision Records (ADRs) for major decisions\n- API documentation with examples\n- Operational runbooks and procedures\n- Configuration guides and examples\n- Troubleshooting guides and common issues\n\n### 15. Technical Debt Management\n\n**Principle**: Actively manage technical debt through regular assessment and systematic improvement.\n\n**Rationale**: Infrastructure systems accumulate complexity over time. Proactive debt management prevents system degradation.\n\n**Debt Management Strategy**:\n\n```\nAssessment:\n├── Regular code quality reviews\n├── Performance profiling and optimization\n├── Security audit and updates\n└── Dependency management and updates\n\nImprovement:\n├── Refactoring for clarity and maintainability\n├── Performance optimization based on metrics\n├── Security enhancement and hardening\n└── Test coverage improvement and validation\n```\n\n## Trade-off Management\n\n### 16. Explicit Trade-off Documentation\n\n**Principle**: All architectural trade-offs must be explicitly documented with rationale and alternatives considered.\n\n**Rationale**: Understanding trade-offs enables informed decision making and future evolution of the system.\n\n**Trade-off Categories**:\n\n```\nPerformance vs. Maintainability:\n├── Rust coordination layer for performance\n├── Nushell business logic for maintainability\n├── Caching strategies for speed vs. consistency\n└── Parallel processing vs. resource usage\n\nFlexibility vs. Complexity:\n├── Configuration-driven architecture vs. simplicity\n├── Extension framework vs. core system complexity\n├── Multi-provider support vs. specialization\n└── Hierarchical configuration vs. simple key-value\n\nSecurity vs. Usability:\n├── Workspace isolation vs. convenience\n├── Extension sandboxing vs. functionality\n├── Authentication requirements vs. ease of use\n└── Audit logging vs. performance overhead\n```\n\n## Conclusion\n\nThese design principles form the foundation of provisioning's architecture. They guide decision making, ensure quality, and provide a framework for\nsystem evolution. Adherence to these principles has enabled the development of a sophisticated, reliable, and maintainable infrastructure automation\nplatform.\n\nThe principles are living guidelines that evolve with the system while maintaining core architectural integrity. They serve as both implementation\nguidance and evaluation criteria for new features and modifications.\n\nSuccess in applying these principles is measured by:\n\n- System reliability and error recovery capabilities\n- Development efficiency and maintainability\n- Configuration flexibility and user experience\n- Performance and scalability characteristics\n- Security and isolation effectiveness\n\nThese principles represent the distilled wisdom from building and operating complex infrastructure automation systems at scale. +# Design Principles + +## Overview + +Provisioning is built on a foundation of architectural principles that guide design decisions, +ensure system quality, and maintain consistency across the codebase. +These principles have evolved from real-world experience +and represent lessons learned from complex infrastructure automation challenges. + +## Core Architectural Principles + +### 1. Project Architecture Principles (PAP) Compliance + +**Principle**: Fully agnostic and configuration-driven, not hardcoded. Use abstraction layers dynamically loaded from configurations. + +**Rationale**: Infrastructure as Code (IaC) systems must be flexible enough to adapt to any environment +without code changes. Hardcoded values defeat the purpose of IaC and create maintenance burdens. + +**Implementation Guidelines**: + +- Never patch the system with hardcoded fallbacks when configuration parsing fails +- All behavior must be configurable through the hierarchical configuration system +- Use abstraction layers that are dynamically loaded from configuration +- Validate configuration fully before execution, fail fast on invalid config + +**Anti-Patterns (Anti-PAP)**: + +- Hardcoded provider endpoints or credentials +- Environment-specific logic in code +- Fallback to default values when configuration is missing +- Mixed configuration and implementation logic + +**Example**: + +```text +# ✅ PAP Compliant - Configuration-driven +[providers.aws] +regions = ["us-west-2", "us-east-1"] +instance_types = ["t3.micro", "t3.small"] +api_endpoint = "https://ec2.amazonaws.com" + +# ❌ Anti-PAP - Hardcoded fallback in code +if config.providers.aws.regions.is_empty() { + regions = vec!["us-west-2"]; // Hardcoded fallback +} +``` + +### 2. Hybrid Architecture Optimization + +**Principle**: Use each language for what it does best - Rust for coordination, Nushell for business logic. + +**Rationale**: Different languages have different strengths. Rust excels at performance-critical coordination tasks, while Nushell excels at +configuration management and domain-specific operations. + +**Implementation Guidelines**: + +- Rust handles orchestration, state management, and performance-critical paths +- Nushell handles provider operations, configuration processing, and CLI interfaces +- Clear boundaries between language responsibilities +- Structured data exchange (JSON) between languages +- Preserve existing domain expertise in Nushell + +**Language Responsibility Matrix**: + +```text +Rust Layer: +├── Workflow orchestration and coordination +├── REST API servers and HTTP endpoints +├── State persistence and checkpoint management +├── Parallel processing and batch operations +├── Error recovery and rollback logic +└── Performance-critical data processing + +Nushell Layer: +├── Provider implementations (AWS, UpCloud, local) +├── Task service management and configuration +├── Nickel configuration processing and validation +├── Template generation and Infrastructure as Code +├── CLI user interfaces and interactive tools +└── Domain-specific business logic +``` + +### 3. Configuration-First Architecture + +**Principle**: All system behavior is determined by configuration, with clear hierarchical precedence and validation. + +**Rationale**: True Infrastructure as Code requires that all behavior be configurable without code changes. Configuration hierarchy provides +flexibility while maintaining predictability. + +**Configuration Hierarchy** (precedence order): + +1. Runtime Parameters (highest precedence) +2. Environment Configuration +3. Infrastructure Configuration +4. User Configuration +5. System Defaults (lowest precedence) + +**Implementation Guidelines**: + +- Complete configuration validation before execution +- Variable interpolation for dynamic values +- Schema-based validation using Nickel +- Configuration immutability during execution +- Comprehensive error reporting for configuration issues + +### 4. Domain-Driven Structure + +**Principle**: Organize code by business domains and functional boundaries, not by technical concerns. + +**Rationale**: Domain-driven organization scales better, reduces coupling, and enables focused development by domain experts. + +**Domain Organization**: + +```text +├── core/ # Core system and library functions +├── platform/ # High-performance coordination layer +├── provisioning/ # Main business logic with providers and services +├── control-center/ # Web-based management interface +├── tools/ # Development and utility tools +└── extensions/ # Plugin and extension framework +``` + +**Domain Responsibilities**: + +- Each domain has clear ownership and boundaries +- Cross-domain communication through well-defined interfaces +- Domain-specific testing and validation strategies +- Independent evolution and versioning within architectural guidelines + +### 5. Isolation and Modularity + +**Principle**: Components are isolated, modular, and independently deployable with clear interface contracts. + +**Rationale**: Isolation enables independent development, testing, and deployment. Clear interfaces prevent tight coupling and enable system +evolution. + +**Implementation Guidelines**: + +- User workspace isolation from system installation +- Extension sandboxing and security boundaries +- Provider abstraction with standardized interfaces +- Service modularity with dependency management +- Clear API contracts between components + +## Quality Attribute Principles + +### 6. Reliability Through Recovery + +**Principle**: Build comprehensive error recovery and rollback capabilities into every operation. + +**Rationale**: Infrastructure operations can fail at any point. Systems must be able to recover gracefully and maintain consistent state. + +**Implementation Guidelines**: + +- Checkpoint-based recovery for long-running workflows +- Comprehensive rollback capabilities for all operations +- Transactional semantics where possible +- State validation and consistency checks +- Detailed audit trails for debugging and recovery + +**Recovery Strategies**: + +```text +Operation Level: +├── Atomic operations with rollback +├── Retry logic with exponential backoff +├── Circuit breakers for external dependencies +└── Graceful degradation on partial failures + +Workflow Level: +├── Checkpoint-based recovery +├── Dependency-aware rollback +├── State consistency validation +└── Resume from failure points + +System Level: +├── Health monitoring and alerting +├── Automatic recovery procedures +├── Data backup and restoration +└── Disaster recovery capabilities +``` + +### 7. Performance Through Parallelism + +**Principle**: Design for parallel execution and efficient resource utilization while maintaining correctness. + +**Rationale**: Infrastructure operations often involve multiple independent resources that can be processed in parallel for significant performance +gains. + +**Implementation Guidelines**: + +- Configurable parallelism limits to prevent resource exhaustion +- Dependency-aware parallel execution +- Resource pooling and connection management +- Efficient data structures and algorithms +- Memory-conscious processing for large datasets + +### 8. Security Through Isolation + +**Principle**: Implement security through isolation boundaries, least privilege, and comprehensive validation. + +**Rationale**: Infrastructure systems handle sensitive data and powerful operations. Security must be built in at the architectural level. + +**Security Implementation**: + +```text +Authentication & Authorization: +├── API authentication for external access +├── Role-based access control for operations +├── Permission validation before execution +└── Audit logging for all security events + +Data Protection: +├── Encrypted secrets management (SOPS/Age) +├── Secure configuration file handling +├── Network communication encryption +└── Sensitive data sanitization in logs + +Isolation Boundaries: +├── User workspace isolation +├── Extension sandboxing +├── Provider credential isolation +└── Process and network isolation +``` + +## Development Methodology Principles + +### 9. Configuration-Driven Testing + +**Principle**: Tests should be configuration-driven and validate both happy path and error conditions. + +**Rationale**: Infrastructure systems must work across diverse environments and configurations. Tests must validate the configuration-driven nature of +the system. + +**Testing Strategy**: + +```text +Unit Testing: +├── Configuration validation tests +├── Individual component tests +├── Error condition tests +└── Performance benchmark tests + +Integration Testing: +├── Multi-provider workflow tests +├── Configuration hierarchy tests +├── Error recovery tests +└── End-to-end scenario tests + +System Testing: +├── Full deployment tests +├── Upgrade and migration tests +├── Performance and scalability tests +└── Security and isolation tests +``` + +## Error Handling Principles + +### 11. Fail Fast, Recover Gracefully + +**Principle**: Validate early and fail fast on errors, but provide comprehensive recovery mechanisms. + +**Rationale**: Early validation prevents complex error states, while graceful recovery maintains system reliability. + +**Implementation Guidelines**: + +- Complete configuration validation before execution +- Input validation at system boundaries +- Clear error messages without internal stack traces (except in DEBUG mode) +- Comprehensive error categorization and handling +- Recovery procedures for all error categories + +**Error Categories**: + +```text +Configuration Errors: +├── Invalid configuration syntax +├── Missing required configuration +├── Configuration conflicts +└── Schema validation failures + +Runtime Errors: +├── Provider API failures +├── Network connectivity issues +├── Resource availability problems +└── Permission and authentication errors + +System Errors: +├── File system access problems +├── Memory and resource exhaustion +├── Process communication failures +└── External dependency failures +``` + +### 12. Observable Operations + +**Principle**: All operations must be observable through comprehensive logging, metrics, and monitoring. + +**Rationale**: Infrastructure operations must be debuggable and monitorable in production environments. + +**Observability Implementation**: + +```text +Logging: +├── Structured JSON logging +├── Configurable log levels +├── Context-aware log messages +└── Audit trail for all operations + +Metrics: +├── Operation performance metrics +├── Resource utilization metrics +├── Error rate and type metrics +└── Business logic metrics + +Monitoring: +├── Health check endpoints +├── Real-time status reporting +├── Workflow progress tracking +└── Alert integration capabilities +``` + +## Evolution and Maintenance Principles + +### 13. Backward Compatibility + +**Principle**: Maintain backward compatibility for configuration, APIs, and user interfaces. + +**Rationale**: Infrastructure systems are long-lived and must support existing configurations and workflows during evolution. + +**Compatibility Guidelines**: + +- Semantic versioning for all interfaces +- Configuration migration tools and procedures +- Deprecation warnings and migration guides +- API versioning for external interfaces +- Comprehensive upgrade testing + +### 14. Documentation-Driven Development + +**Principle**: Architecture decisions, APIs, and operational procedures must be thoroughly documented. + +**Rationale**: Infrastructure systems are complex and require clear documentation for operation, maintenance, and evolution. + +**Documentation Requirements**: + +- Architecture Decision Records (ADRs) for major decisions +- API documentation with examples +- Operational runbooks and procedures +- Configuration guides and examples +- Troubleshooting guides and common issues + +### 15. Technical Debt Management + +**Principle**: Actively manage technical debt through regular assessment and systematic improvement. + +**Rationale**: Infrastructure systems accumulate complexity over time. Proactive debt management prevents system degradation. + +**Debt Management Strategy**: + +```text +Assessment: +├── Regular code quality reviews +├── Performance profiling and optimization +├── Security audit and updates +└── Dependency management and updates + +Improvement: +├── Refactoring for clarity and maintainability +├── Performance optimization based on metrics +├── Security enhancement and hardening +└── Test coverage improvement and validation +``` + +## Trade-off Management + +### 16. Explicit Trade-off Documentation + +**Principle**: All architectural trade-offs must be explicitly documented with rationale and alternatives considered. + +**Rationale**: Understanding trade-offs enables informed decision making and future evolution of the system. + +**Trade-off Categories**: + +```text +Performance vs. Maintainability: +├── Rust coordination layer for performance +├── Nushell business logic for maintainability +├── Caching strategies for speed vs. consistency +└── Parallel processing vs. resource usage + +Flexibility vs. Complexity: +├── Configuration-driven architecture vs. simplicity +├── Extension framework vs. core system complexity +├── Multi-provider support vs. specialization +└── Hierarchical configuration vs. simple key-value + +Security vs. Usability: +├── Workspace isolation vs. convenience +├── Extension sandboxing vs. functionality +├── Authentication requirements vs. ease of use +└── Audit logging vs. performance overhead +``` + +## Conclusion + +These design principles form the foundation of provisioning's architecture. They guide decision making, ensure quality, and provide a framework for +system evolution. Adherence to these principles has enabled the development of a sophisticated, reliable, and maintainable infrastructure automation +platform. + +The principles are living guidelines that evolve with the system while maintaining core architectural integrity. They serve as both implementation +guidance and evaluation criteria for new features and modifications. + +Success in applying these principles is measured by: + +- System reliability and error recovery capabilities +- Development efficiency and maintainability +- Configuration flexibility and user experience +- Performance and scalability characteristics +- Security and isolation effectiveness + +These principles represent the distilled wisdom from building and operating complex infrastructure automation systems at scale. \ No newline at end of file diff --git a/docs/src/architecture/ecosystem-integration.md b/docs/src/architecture/ecosystem-integration.md index 98d60c9..0aea545 100644 --- a/docs/src/architecture/ecosystem-integration.md +++ b/docs/src/architecture/ecosystem-integration.md @@ -1 +1,523 @@ -# Prov-Ecosystem & Provctl Integration\n\n**Date**: 2025-11-23\n**Version**: 1.0.0\n**Status**: ✅ Implementation Complete\n\n## Overview\n\nThis document describes the **hybrid selective integration** of prov-ecosystem and provctl with provisioning, providing access to four critical functionalities:\n\n1. **Runtime Abstraction** - Unified Docker/Podman/OrbStack/Colima/nerdctl\n2. **SSH Advanced** - Pooling, circuit breaker, retry strategies, distributed operations\n3. **Backup System** - Multi-backend (Restic, Borg, Tar, Rsync) with retention policies\n4. **GitOps Events** - Event-driven deployments from Git\n\n---\n\n## Architecture\n\n### Three-Layer Integration\n\n```\n┌─────────────────────────────────────────────┐\n│ Provisioning CLI (provisioning/core/cli/) │\n│ ✅ 80+ command shortcuts │\n│ ✅ Domain-driven architecture │\n│ ✅ Modular CLI commands │\n└─────────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────────┐\n│ Nushell Integration Layer │\n│ (provisioning/core/nulib/integrations/) │\n│ ✅ 5 modules with full type safety │\n│ ✅ Follows 17 Nushell guidelines │\n│ ✅ Early return, atomic operations │\n└─────────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────────┐\n│ Rust Bridge Crate │\n│ (provisioning/platform/integrations/ │\n│ provisioning-bridge/) │\n│ ✅ Zero unsafe code │\n│ ✅ Idiomatic error handling (Result) │\n│ ✅ 5 modules (runtime, ssh, backup, etc) │\n│ ✅ Comprehensive tests │\n└─────────────────────────────────────────────┘\n ↓\n┌─────────────────────────────────────────────┐\n│ Prov-Ecosystem & Provctl Crates │\n│ (../../prov-ecosystem/ & ../../provctl/) │\n│ ✅ runtime: Container abstraction │\n│ ✅ init-servs: Service management │\n│ ✅ backup: Multi-backend backup │\n│ ✅ gitops: Event-driven automation │\n│ ✅ provctl-machines: SSH advanced │\n└─────────────────────────────────────────────┘\n```\n\n---\n\n## Components\n\n### 1. Runtime Abstraction\n\n**Location**: `provisioning/platform/integrations/provisioning-bridge/src/runtime.rs`\n**Nushell**: `provisioning/core/nulib/integrations/runtime.nu`\n**Nickel Schema**: `provisioning/schemas/integrations/runtime.ncl`\n\n**Purpose**: Unified interface for Docker, Podman, OrbStack, Colima, nerdctl\n\n**Key Types**:\n\n```\npub enum ContainerRuntime {\n Docker,\n Podman,\n OrbStack,\n Colima,\n Nerdctl,\n}\n\npub struct RuntimeDetector { ... }\npub struct ComposeAdapter { ... }\n```\n\n**Nushell Functions**:\n\n```\nruntime-detect # Auto-detect available runtime\nruntime-exec # Execute command in detected runtime\nruntime-compose # Adapt docker-compose for runtime\nruntime-info # Get runtime details\nruntime-list # List all available runtimes\n```\n\n**Benefits**:\n\n- ✅ Eliminates Docker hardcoding\n- ✅ Platform-aware detection\n- ✅ Automatic runtime selection\n- ✅ Docker Compose adaptation\n\n---\n\n### 2. SSH Advanced\n\n**Location**: `provisioning/platform/integrations/provisioning-bridge/src/ssh.rs`\n**Nushell**: `provisioning/core/nulib/integrations/ssh_advanced.nu`\n**Nickel Schema**: `provisioning/schemas/integrations/ssh_advanced.ncl`\n\n**Purpose**: Advanced SSH operations with pooling, circuit breaker, retry strategies\n\n**Key Types**:\n\n```\npub struct SshConfig { ... }\npub struct SshPool { ... }\npub enum DeploymentStrategy {\n Rolling,\n BlueGreen,\n Canary,\n}\n```\n\n**Nushell Functions**:\n\n```\nssh-pool-connect # Create SSH pool connection\nssh-pool-exec # Execute on SSH pool\nssh-pool-status # Check pool status\nssh-deployment-strategies # List strategies\nssh-retry-config # Configure retry strategy\nssh-circuit-breaker-status # Check circuit breaker\n```\n\n**Features**:\n\n- ✅ Connection pooling (90% faster)\n- ✅ Circuit breaker for fault isolation\n- ✅ Three deployment strategies (rolling, blue-green, canary)\n- ✅ Retry strategies (exponential, linear, fibonacci)\n- ✅ Health check integration\n\n---\n\n### 3. Backup System\n\n**Location**: `provisioning/platform/integrations/provisioning-bridge/src/backup.rs`\n**Nushell**: `provisioning/core/nulib/integrations/backup.nu`\n**Nickel Schema**: `provisioning/schemas/integrations/backup.ncl`\n\n**Purpose**: Multi-backend backup with retention policies\n\n**Key Types**:\n\n```\npub enum BackupBackend {\n Restic,\n Borg,\n Tar,\n Rsync,\n Cpio,\n}\n\npub struct BackupJob { ... }\npub struct RetentionPolicy { ... }\npub struct BackupManager { ... }\n```\n\n**Nushell Functions**:\n\n```\nbackup-create # Create backup job\nbackup-restore # Restore from snapshot\nbackup-list # List snapshots\nbackup-schedule # Schedule regular backups\nbackup-retention # Configure retention policy\nbackup-status # Check backup status\n```\n\n**Features**:\n\n- ✅ Multiple backends (Restic, Borg, Tar, Rsync, CPIO)\n- ✅ Flexible repositories (local, S3, SFTP, REST, B2)\n- ✅ Retention policies (daily/weekly/monthly/yearly)\n- ✅ Pre/post backup hooks\n- ✅ Automatic scheduling\n- ✅ Compression support\n\n---\n\n### 4. GitOps Events\n\n**Location**: `provisioning/platform/integrations/provisioning-bridge/src/gitops.rs`\n**Nushell**: `provisioning/core/nulib/integrations/gitops.nu`\n**Nickel Schema**: `provisioning/schemas/integrations/gitops.ncl`\n\n**Purpose**: Event-driven deployments from Git\n\n**Key Types**:\n\n```\npub enum GitProvider {\n GitHub,\n GitLab,\n Gitea,\n}\n\npub struct GitOpsRule { ... }\npub struct GitOpsOrchestrator { ... }\n```\n\n**Nushell Functions**:\n\n```\ngitops-rules # Load rules from config\ngitops-watch # Watch for Git events\ngitops-trigger # Manually trigger deployment\ngitops-event-types # List supported events\ngitops-rule-config # Configure GitOps rule\ngitops-deployments # List active deployments\ngitops-status # Get GitOps status\n```\n\n**Features**:\n\n- ✅ Event-driven automation (push, PR, webhook, scheduled)\n- ✅ Multi-provider support (GitHub, GitLab, Gitea)\n- ✅ Three deployment strategies\n- ✅ Manual approval workflow\n- ✅ Health check triggers\n- ✅ Audit logging\n\n---\n\n### 5. Service Management\n\n**Location**: `provisioning/platform/integrations/provisioning-bridge/src/service.rs`\n**Nushell**: `provisioning/core/nulib/integrations/service.nu`\n**Nickel Schema**: `provisioning/schemas/integrations/service.ncl`\n\n**Purpose**: Cross-platform service management (systemd, launchd, runit, OpenRC)\n\n**Nushell Functions**:\n\n```\nservice-install # Install service\nservice-start # Start service\nservice-stop # Stop service\nservice-restart # Restart service\nservice-status # Get service status\nservice-list # List all services\nservice-restart-policy # Configure restart policy\nservice-detect-init # Detect init system\n```\n\n**Features**:\n\n- ✅ Multi-platform support (systemd, launchd, runit, OpenRC)\n- ✅ Service file generation\n- ✅ Restart policies (always, on-failure, no)\n- ✅ Health checks\n- ✅ Logging configuration\n- ✅ Metrics collection\n\n---\n\n## Code Quality Standards\n\nAll implementations follow project standards:\n\n### Rust (`provisioning-bridge`)\n\n- ✅ **Zero unsafe code** - `#![forbid(unsafe_code)]`\n- ✅ **Idiomatic error handling** - `Result` pattern\n- ✅ **Comprehensive docs** - Full rustdoc with examples\n- ✅ **Tests** - Unit and integration tests for each module\n- ✅ **No unwrap()** - Only in tests with comments\n- ✅ **No clippy warnings** - All warnings suppressed\n\n### Nushell\n\n- ✅ **17 Nushell rules** - See Nushell Development Guide\n- ✅ **Explicit types** - Colon notation: `[param: type]: return_type`\n- ✅ **Early return** - Validate inputs immediately\n- ✅ **Single purpose** - Each function does one thing\n- ✅ **Atomic operations** - Succeed or fail completely\n- ✅ **Pure functions** - No hidden side effects\n\n### Nickel\n\n- ✅ **Schema-first** - All configs have schemas\n- ✅ **Explicit types** - Full type annotations\n- ✅ **Direct imports** - No re-exports\n- ✅ **Immutability-first** - Mutable only when needed\n- ✅ **Lazy evaluation** - Efficient computation\n- ✅ **Security defaults** - TLS enabled, secrets referenced\n\n---\n\n## File Structure\n\n```\nprovisioning/\n├── platform/integrations/\n│ └── provisioning-bridge/ # Rust bridge crate\n│ ├── Cargo.toml\n│ └── src/\n│ ├── lib.rs\n│ ├── error.rs # Error types\n│ ├── runtime.rs # Runtime abstraction\n│ ├── ssh.rs # SSH advanced\n│ ├── backup.rs # Backup system\n│ ├── gitops.rs # GitOps events\n│ └── service.rs # Service management\n│\n├── core/nulib/lib_provisioning/\n│ └── integrations/ # Nushell modules\n│ ├── mod.nu # Module root\n│ ├── runtime.nu # Runtime functions\n│ ├── ssh_advanced.nu # SSH functions\n│ ├── backup.nu # Backup functions\n│ ├── gitops.nu # GitOps functions\n│ └── service.nu # Service functions\n│\n└── schemas/integrations/ # Nickel schemas\n ├── main.ncl # Main integration schema\n ├── runtime.ncl # Runtime schema\n ├── ssh_advanced.ncl # SSH schema\n ├── backup.ncl # Backup schema\n ├── gitops.ncl # GitOps schema\n └── service.ncl # Service schema\n```\n\n---\n\n## Usage\n\n### Runtime Abstraction\n\n```\n# Auto-detect available runtime\nlet runtime = (runtime-detect)\n\n# Execute command in detected runtime\nruntime-exec "docker ps" --check\n\n# Adapt compose file\nlet compose_cmd = (runtime-compose "./docker-compose.yml")\n```\n\n### SSH Advanced\n\n```\n# Connect to SSH pool\nlet pool = (ssh-pool-connect "server01.example.com" "root" --port 22)\n\n# Execute distributed command\nlet results = (ssh-pool-exec $hosts "systemctl status provisioning" --strategy parallel)\n\n# Check circuit breaker\nssh-circuit-breaker-status\n```\n\n### Backup System\n\n```\n# Schedule regular backups\nbackup-schedule "daily-app-backup" "0 2 * * *" \\n --paths ["/opt/app" "/var/lib/app"] \\n --backend "restic"\n\n# Create one-time backup\nbackup-create "full-backup" ["/home" "/opt"] \\n --backend "restic" \\n --repository "/backups"\n\n# Restore from snapshot\nbackup-restore "snapshot-001" --restore_path "."\n```\n\n### GitOps Events\n\n```\n# Load GitOps rules\nlet rules = (gitops-rules "./gitops-rules.yaml")\n\n# Watch for Git events\ngitops-watch --provider "github" --webhook-port 8080\n\n# Manually trigger deployment\ngitops-trigger "deploy-app" --environment "prod"\n```\n\n### Service Management\n\n```\n# Install service\nservice-install "my-app" "/usr/local/bin/my-app" \\n --user "appuser" \\n --working-dir "/opt/myapp"\n\n# Start service\nservice-start "my-app"\n\n# Check status\nservice-status "my-app"\n\n# Set restart policy\nservice-restart-policy "my-app" --policy "on-failure" --delay-secs 5\n```\n\n---\n\n## Integration Points\n\n### CLI Commands\n\nExisting `provisioning` CLI will gain new command tree:\n\n```\nprovisioning runtime detect|exec|compose|info|list\nprovisioning ssh pool connect|exec|status|strategies\nprovisioning backup create|restore|list|schedule|retention|status\nprovisioning gitops rules|watch|trigger|events|config|deployments|status\nprovisioning service install|start|stop|restart|status|list|policy|detect-init\n```\n\n### Configuration\n\nAll integrations use Nickel schemas from `provisioning/schemas/integrations/`:\n\n```\nlet { IntegrationConfig } = import "provisioning/integrations.ncl" in\n{\n runtime = { ... },\n ssh = { ... },\n backup = { ... },\n gitops = { ... },\n service = { ... },\n}\n```\n\n### Plugins\n\nNushell plugins can be created for performance-critical operations:\n\n```\nprovisioning plugin list\n# [installed]\n# nu_plugin_runtime\n# nu_plugin_ssh_advanced\n# nu_plugin_backup\n# nu_plugin_gitops\n```\n\n---\n\n## Testing\n\n### Rust Tests\n\n```\ncd provisioning/platform/integrations/provisioning-bridge\ncargo test --all\ncargo test -p provisioning-bridge --lib\ncargo test -p provisioning-bridge --doc\n```\n\n### Nushell Tests\n\n```\nnu provisioning/core/nulib/integrations/runtime.nu\nnu provisioning/core/nulib/integrations/ssh_advanced.nu\n```\n\n---\n\n## Performance\n\n| Operation | Performance |\n| ----------- | ------------- |\n| Runtime detection | ~50 ms (cached: ~1 ms) |\n| SSH pool init | ~100 ms per connection |\n| SSH command exec | 90% faster with pooling |\n| Backup initiation | <100 ms |\n| GitOps rule load | <10 ms |\n\n---\n\n## Migration Path\n\nIf you want to fully migrate from provisioning to provctl + prov-ecosystem:\n\n1. **Phase 1**: Use integrations for new features (runtime, backup, gitops)\n2. **Phase 2**: Migrate SSH operations to `provctl-machines`\n3. **Phase 3**: Adopt provctl CLI for machine orchestration\n4. **Phase 4**: Use prov-ecosystem crates directly where beneficial\n\nCurrently we implement **Phase 1** with selective integration.\n\n---\n\n## Next Steps\n\n1. ✅ **Implement**: Integrate bridge into provisioning CLI\n2. ⏳ **Document**: Add to `docs/user/` for end users\n3. ⏳ **Examples**: Create example configurations\n4. ⏳ **Tests**: Integration tests with real providers\n5. ⏳ **Plugins**: Nushell plugins for performance\n\n---\n\n## References\n\n- **Rust Bridge**: `provisioning/platform/integrations/provisioning-bridge/`\n- **Nushell Integration**: `provisioning/core/nulib/integrations/`\n- **Nickel Schemas**: `provisioning/schemas/integrations/`\n- **Prov-Ecosystem**: `/Users/Akasha/Development/prov-ecosystem/`\n- **Provctl**: `/Users/Akasha/Development/provctl/`\n- **Rust Guidelines**: See Rust Development\n- **Nushell Guidelines**: See Nushell Development\n- **Nickel Guidelines**: See Nickel Module System +# Prov-Ecosystem & Provctl Integration + +**Date**: 2025-11-23 +**Version**: 1.0.0 +**Status**: ✅ Implementation Complete + +## Overview + +This document describes the **hybrid selective integration** of prov-ecosystem and provctl with provisioning, providing access to four critical functionalities: + +1. **Runtime Abstraction** - Unified Docker/Podman/OrbStack/Colima/nerdctl +2. **SSH Advanced** - Pooling, circuit breaker, retry strategies, distributed operations +3. **Backup System** - Multi-backend (Restic, Borg, Tar, Rsync) with retention policies +4. **GitOps Events** - Event-driven deployments from Git + +--- + +## Architecture + +### Three-Layer Integration + +```text +┌─────────────────────────────────────────────┐ +│ Provisioning CLI (provisioning/core/cli/) │ +│ ✅ 80+ command shortcuts │ +│ ✅ Domain-driven architecture │ +│ ✅ Modular CLI commands │ +└─────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────┐ +│ Nushell Integration Layer │ +│ (provisioning/core/nulib/integrations/) │ +│ ✅ 5 modules with full type safety │ +│ ✅ Follows 17 Nushell guidelines │ +│ ✅ Early return, atomic operations │ +└─────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────┐ +│ Rust Bridge Crate │ +│ (provisioning/platform/integrations/ │ +│ provisioning-bridge/) │ +│ ✅ Zero unsafe code │ +│ ✅ Idiomatic error handling (Result) │ +│ ✅ 5 modules (runtime, ssh, backup, etc) │ +│ ✅ Comprehensive tests │ +└─────────────────────────────────────────────┘ + ↓ +┌─────────────────────────────────────────────┐ +│ Prov-Ecosystem & Provctl Crates │ +│ (../../prov-ecosystem/ & ../../provctl/) │ +│ ✅ runtime: Container abstraction │ +│ ✅ init-servs: Service management │ +│ ✅ backup: Multi-backend backup │ +│ ✅ gitops: Event-driven automation │ +│ ✅ provctl-machines: SSH advanced │ +└─────────────────────────────────────────────┘ +``` + +--- + +## Components + +### 1. Runtime Abstraction + +**Location**: `provisioning/platform/integrations/provisioning-bridge/src/runtime.rs` +**Nushell**: `provisioning/core/nulib/integrations/runtime.nu` +**Nickel Schema**: `provisioning/schemas/integrations/runtime.ncl` + +**Purpose**: Unified interface for Docker, Podman, OrbStack, Colima, nerdctl + +**Key Types**: + +```text +pub enum ContainerRuntime { + Docker, + Podman, + OrbStack, + Colima, + Nerdctl, +} + +pub struct RuntimeDetector { ... } +pub struct ComposeAdapter { ... } +``` + +**Nushell Functions**: + +```text +runtime-detect # Auto-detect available runtime +runtime-exec # Execute command in detected runtime +runtime-compose # Adapt docker-compose for runtime +runtime-info # Get runtime details +runtime-list # List all available runtimes +``` + +**Benefits**: + +- ✅ Eliminates Docker hardcoding +- ✅ Platform-aware detection +- ✅ Automatic runtime selection +- ✅ Docker Compose adaptation + +--- + +### 2. SSH Advanced + +**Location**: `provisioning/platform/integrations/provisioning-bridge/src/ssh.rs` +**Nushell**: `provisioning/core/nulib/integrations/ssh_advanced.nu` +**Nickel Schema**: `provisioning/schemas/integrations/ssh_advanced.ncl` + +**Purpose**: Advanced SSH operations with pooling, circuit breaker, retry strategies + +**Key Types**: + +```text +pub struct SshConfig { ... } +pub struct SshPool { ... } +pub enum DeploymentStrategy { + Rolling, + BlueGreen, + Canary, +} +``` + +**Nushell Functions**: + +```text +ssh-pool-connect # Create SSH pool connection +ssh-pool-exec # Execute on SSH pool +ssh-pool-status # Check pool status +ssh-deployment-strategies # List strategies +ssh-retry-config # Configure retry strategy +ssh-circuit-breaker-status # Check circuit breaker +``` + +**Features**: + +- ✅ Connection pooling (90% faster) +- ✅ Circuit breaker for fault isolation +- ✅ Three deployment strategies (rolling, blue-green, canary) +- ✅ Retry strategies (exponential, linear, fibonacci) +- ✅ Health check integration + +--- + +### 3. Backup System + +**Location**: `provisioning/platform/integrations/provisioning-bridge/src/backup.rs` +**Nushell**: `provisioning/core/nulib/integrations/backup.nu` +**Nickel Schema**: `provisioning/schemas/integrations/backup.ncl` + +**Purpose**: Multi-backend backup with retention policies + +**Key Types**: + +```text +pub enum BackupBackend { + Restic, + Borg, + Tar, + Rsync, + Cpio, +} + +pub struct BackupJob { ... } +pub struct RetentionPolicy { ... } +pub struct BackupManager { ... } +``` + +**Nushell Functions**: + +```text +backup-create # Create backup job +backup-restore # Restore from snapshot +backup-list # List snapshots +backup-schedule # Schedule regular backups +backup-retention # Configure retention policy +backup-status # Check backup status +``` + +**Features**: + +- ✅ Multiple backends (Restic, Borg, Tar, Rsync, CPIO) +- ✅ Flexible repositories (local, S3, SFTP, REST, B2) +- ✅ Retention policies (daily/weekly/monthly/yearly) +- ✅ Pre/post backup hooks +- ✅ Automatic scheduling +- ✅ Compression support + +--- + +### 4. GitOps Events + +**Location**: `provisioning/platform/integrations/provisioning-bridge/src/gitops.rs` +**Nushell**: `provisioning/core/nulib/integrations/gitops.nu` +**Nickel Schema**: `provisioning/schemas/integrations/gitops.ncl` + +**Purpose**: Event-driven deployments from Git + +**Key Types**: + +```text +pub enum GitProvider { + GitHub, + GitLab, + Gitea, +} + +pub struct GitOpsRule { ... } +pub struct GitOpsOrchestrator { ... } +``` + +**Nushell Functions**: + +```text +gitops-rules # Load rules from config +gitops-watch # Watch for Git events +gitops-trigger # Manually trigger deployment +gitops-event-types # List supported events +gitops-rule-config # Configure GitOps rule +gitops-deployments # List active deployments +gitops-status # Get GitOps status +``` + +**Features**: + +- ✅ Event-driven automation (push, PR, webhook, scheduled) +- ✅ Multi-provider support (GitHub, GitLab, Gitea) +- ✅ Three deployment strategies +- ✅ Manual approval workflow +- ✅ Health check triggers +- ✅ Audit logging + +--- + +### 5. Service Management + +**Location**: `provisioning/platform/integrations/provisioning-bridge/src/service.rs` +**Nushell**: `provisioning/core/nulib/integrations/service.nu` +**Nickel Schema**: `provisioning/schemas/integrations/service.ncl` + +**Purpose**: Cross-platform service management (systemd, launchd, runit, OpenRC) + +**Nushell Functions**: + +```text +service-install # Install service +service-start # Start service +service-stop # Stop service +service-restart # Restart service +service-status # Get service status +service-list # List all services +service-restart-policy # Configure restart policy +service-detect-init # Detect init system +``` + +**Features**: + +- ✅ Multi-platform support (systemd, launchd, runit, OpenRC) +- ✅ Service file generation +- ✅ Restart policies (always, on-failure, no) +- ✅ Health checks +- ✅ Logging configuration +- ✅ Metrics collection + +--- + +## Code Quality Standards + +All implementations follow project standards: + +### Rust (`provisioning-bridge`) + +- ✅ **Zero unsafe code** - `#![forbid(unsafe_code)]` +- ✅ **Idiomatic error handling** - `Result` pattern +- ✅ **Comprehensive docs** - Full rustdoc with examples +- ✅ **Tests** - Unit and integration tests for each module +- ✅ **No unwrap()** - Only in tests with comments +- ✅ **No clippy warnings** - All warnings suppressed + +### Nushell + +- ✅ **17 Nushell rules** - See Nushell Development Guide +- ✅ **Explicit types** - Colon notation: `[param: type]: return_type` +- ✅ **Early return** - Validate inputs immediately +- ✅ **Single purpose** - Each function does one thing +- ✅ **Atomic operations** - Succeed or fail completely +- ✅ **Pure functions** - No hidden side effects + +### Nickel + +- ✅ **Schema-first** - All configs have schemas +- ✅ **Explicit types** - Full type annotations +- ✅ **Direct imports** - No re-exports +- ✅ **Immutability-first** - Mutable only when needed +- ✅ **Lazy evaluation** - Efficient computation +- ✅ **Security defaults** - TLS enabled, secrets referenced + +--- + +## File Structure + +```text +provisioning/ +├── platform/integrations/ +│ └── provisioning-bridge/ # Rust bridge crate +│ ├── Cargo.toml +│ └── src/ +│ ├── lib.rs +│ ├── error.rs # Error types +│ ├── runtime.rs # Runtime abstraction +│ ├── ssh.rs # SSH advanced +│ ├── backup.rs # Backup system +│ ├── gitops.rs # GitOps events +│ └── service.rs # Service management +│ +├── core/nulib/lib_provisioning/ +│ └── integrations/ # Nushell modules +│ ├── mod.nu # Module root +│ ├── runtime.nu # Runtime functions +│ ├── ssh_advanced.nu # SSH functions +│ ├── backup.nu # Backup functions +│ ├── gitops.nu # GitOps functions +│ └── service.nu # Service functions +│ +└── schemas/integrations/ # Nickel schemas + ├── main.ncl # Main integration schema + ├── runtime.ncl # Runtime schema + ├── ssh_advanced.ncl # SSH schema + ├── backup.ncl # Backup schema + ├── gitops.ncl # GitOps schema + └── service.ncl # Service schema +``` + +--- + +## Usage + +### Runtime Abstraction + +```text +# Auto-detect available runtime +let runtime = (runtime-detect) + +# Execute command in detected runtime +runtime-exec "docker ps" --check + +# Adapt compose file +let compose_cmd = (runtime-compose "./docker-compose.yml") +``` + +### SSH Advanced + +```text +# Connect to SSH pool +let pool = (ssh-pool-connect "server01.example.com" "root" --port 22) + +# Execute distributed command +let results = (ssh-pool-exec $hosts "systemctl status provisioning" --strategy parallel) + +# Check circuit breaker +ssh-circuit-breaker-status +``` + +### Backup System + +```text +# Schedule regular backups +backup-schedule "daily-app-backup" "0 2 * * *" + --paths ["/opt/app" "/var/lib/app"] + --backend "restic" + +# Create one-time backup +backup-create "full-backup" ["/home" "/opt"] + --backend "restic" + --repository "/backups" + +# Restore from snapshot +backup-restore "snapshot-001" --restore_path "." +``` + +### GitOps Events + +```text +# Load GitOps rules +let rules = (gitops-rules "./gitops-rules.yaml") + +# Watch for Git events +gitops-watch --provider "github" --webhook-port 8080 + +# Manually trigger deployment +gitops-trigger "deploy-app" --environment "prod" +``` + +### Service Management + +```text +# Install service +service-install "my-app" "/usr/local/bin/my-app" + --user "appuser" + --working-dir "/opt/myapp" + +# Start service +service-start "my-app" + +# Check status +service-status "my-app" + +# Set restart policy +service-restart-policy "my-app" --policy "on-failure" --delay-secs 5 +``` + +--- + +## Integration Points + +### CLI Commands + +Existing `provisioning` CLI will gain new command tree: + +```text +provisioning runtime detect|exec|compose|info|list +provisioning ssh pool connect|exec|status|strategies +provisioning backup create|restore|list|schedule|retention|status +provisioning gitops rules|watch|trigger|events|config|deployments|status +provisioning service install|start|stop|restart|status|list|policy|detect-init +``` + +### Configuration + +All integrations use Nickel schemas from `provisioning/schemas/integrations/`: + +```text +let { IntegrationConfig } = import "provisioning/integrations.ncl" in +{ + runtime = { ... }, + ssh = { ... }, + backup = { ... }, + gitops = { ... }, + service = { ... }, +} +``` + +### Plugins + +Nushell plugins can be created for performance-critical operations: + +```text +provisioning plugin list +# [installed] +# nu_plugin_runtime +# nu_plugin_ssh_advanced +# nu_plugin_backup +# nu_plugin_gitops +``` + +--- + +## Testing + +### Rust Tests + +```text +cd provisioning/platform/integrations/provisioning-bridge +cargo test --all +cargo test -p provisioning-bridge --lib +cargo test -p provisioning-bridge --doc +``` + +### Nushell Tests + +```text +nu provisioning/core/nulib/integrations/runtime.nu +nu provisioning/core/nulib/integrations/ssh_advanced.nu +``` + +--- + +## Performance + +| Operation | Performance | +| ----------- | ------------- | +| Runtime detection | ~50 ms (cached: ~1 ms) | +| SSH pool init | ~100 ms per connection | +| SSH command exec | 90% faster with pooling | +| Backup initiation | <100 ms | +| GitOps rule load | <10 ms | + +--- + +## Migration Path + +If you want to fully migrate from provisioning to provctl + prov-ecosystem: + +1. **Phase 1**: Use integrations for new features (runtime, backup, gitops) +2. **Phase 2**: Migrate SSH operations to `provctl-machines` +3. **Phase 3**: Adopt provctl CLI for machine orchestration +4. **Phase 4**: Use prov-ecosystem crates directly where beneficial + +Currently we implement **Phase 1** with selective integration. + +--- + +## Next Steps + +1. ✅ **Implement**: Integrate bridge into provisioning CLI +2. ⏳ **Document**: Add to `docs/user/` for end users +3. ⏳ **Examples**: Create example configurations +4. ⏳ **Tests**: Integration tests with real providers +5. ⏳ **Plugins**: Nushell plugins for performance + +--- + +## References + +- **Rust Bridge**: `provisioning/platform/integrations/provisioning-bridge/` +- **Nushell Integration**: `provisioning/core/nulib/integrations/` +- **Nickel Schemas**: `provisioning/schemas/integrations/` +- **Prov-Ecosystem**: `/Users/Akasha/Development/prov-ecosystem/` +- **Provctl**: `/Users/Akasha/Development/provctl/` +- **Rust Guidelines**: See Rust Development +- **Nushell Guidelines**: See Nushell Development +- **Nickel Guidelines**: See Nickel Module System \ No newline at end of file diff --git a/docs/src/architecture/integration-patterns.md b/docs/src/architecture/integration-patterns.md index 81a43dd..c2c12e8 100644 --- a/docs/src/architecture/integration-patterns.md +++ b/docs/src/architecture/integration-patterns.md @@ -1 +1,623 @@ -# Integration Patterns\n\n## Overview\n\nProvisioning implements sophisticated integration patterns to coordinate between its hybrid Rust/Nushell architecture, manage multi-provider\nworkflows, and enable extensible functionality. This document outlines the key integration patterns, their implementations, and best practices.\n\n## Core Integration Patterns\n\n### 1. Hybrid Language Integration\n\n#### Rust-to-Nushell Communication Pattern\n\n**Use Case**: Orchestrator invoking business logic operations\n\n**Implementation**:\n\n```\nuse tokio::process::Command;\nuse serde_json;\n\npub async fn execute_nushell_workflow(\n workflow: &str,\n args: &[String]\n) -> Result {\n let mut cmd = Command::new("nu");\n cmd.arg("-c")\n .arg(format!("use core/nulib/workflows/{}.nu *; {}", workflow, args.join(" ")));\n\n let output = cmd.output().await?;\n let result: WorkflowResult = serde_json::from_slice(&output.stdout)?;\n Ok(result)\n}\n```\n\n**Data Exchange Format**:\n\n```\n{\n "status": "success" | "error" | "partial",\n "result": {\n "operation": "server_create",\n "resources": ["server-001", "server-002"],\n "metadata": { ... }\n },\n "error": null | { "code": "ERR001", "message": "..." },\n "context": { "workflow_id": "wf-123", "step": 2 }\n}\n```\n\n#### Nushell-to-Rust Communication Pattern\n\n**Use Case**: Business logic submitting workflows to orchestrator\n\n**Implementation**:\n\n```\ndef submit-workflow [workflow: record] -> record {\n let payload = $workflow | to json\n\n http post "http://localhost:9090/workflows/submit" {\n headers: { "Content-Type": "application/json" }\n body: $payload\n }\n | from json\n}\n```\n\n**API Contract**:\n\n```\n{\n "workflow_id": "wf-456",\n "name": "multi_cloud_deployment",\n "operations": [...],\n "dependencies": { ... },\n "configuration": { ... }\n}\n```\n\n### 2. Provider Abstraction Pattern\n\n#### Standard Provider Interface\n\n**Purpose**: Uniform API across different cloud providers\n\n**Interface Definition**:\n\n```\n# Standard provider interface that all providers must implement\nexport def list-servers [] -> table {\n # Provider-specific implementation\n}\n\nexport def create-server [config: record] -> record {\n # Provider-specific implementation\n}\n\nexport def delete-server [id: string] -> nothing {\n # Provider-specific implementation\n}\n\nexport def get-server [id: string] -> record {\n # Provider-specific implementation\n}\n```\n\n**Configuration Integration**:\n\n```\n[providers.aws]\nregion = "us-west-2"\ncredentials_profile = "default"\ntimeout = 300\n\n[providers.upcloud]\nzone = "de-fra1"\napi_endpoint = "https://api.upcloud.com"\ntimeout = 180\n\n[providers.local]\ndocker_socket = "/var/run/docker.sock"\nnetwork_mode = "bridge"\n```\n\n#### Provider Discovery and Loading\n\n```\ndef load-providers [] -> table {\n let provider_dirs = glob "providers/*/nulib"\n\n $provider_dirs\n | each { |dir|\n let provider_name = $dir | path basename | path dirname | path basename\n let provider_config = get-provider-config $provider_name\n\n {\n name: $provider_name,\n path: $dir,\n config: $provider_config,\n available: (test-provider-connectivity $provider_name)\n }\n }\n}\n```\n\n### 3. Configuration Resolution Pattern\n\n#### Hierarchical Configuration Loading\n\n**Implementation**:\n\n```\ndef resolve-configuration [context: record] -> record {\n let base_config = open config.defaults.toml\n let user_config = if ("config.user.toml" | path exists) {\n open config.user.toml\n } else { {} }\n\n let env_config = if ($env.PROVISIONING_ENV? | is-not-empty) {\n let env_file = $"config.($env.PROVISIONING_ENV).toml"\n if ($env_file | path exists) { open $env_file } else { {} }\n } else { {} }\n\n let merged_config = $base_config\n | merge $user_config\n | merge $env_config\n | merge ($context.runtime_config? | default {})\n\n interpolate-variables $merged_config\n}\n```\n\n#### Variable Interpolation Pattern\n\n```\ndef interpolate-variables [config: record] -> record {\n let interpolations = {\n "{{paths.base}}": ($env.PWD),\n "{{env.HOME}}": ($env.HOME),\n "{{now.date}}": (date now | format date "%Y-%m-%d"),\n "{{git.branch}}": (git branch --show-current | str trim)\n }\n\n $config\n | to json\n | str replace --all "{{paths.base}}" $interpolations."{{paths.base}}"\n | str replace --all "{{env.HOME}}" $interpolations."{{env.HOME}}"\n | str replace --all "{{now.date}}" $interpolations."{{now.date}}"\n | str replace --all "{{git.branch}}" $interpolations."{{git.branch}}"\n | from json\n}\n```\n\n### 4. Workflow Orchestration Patterns\n\n#### Dependency Resolution Pattern\n\n**Use Case**: Managing complex workflow dependencies\n\n**Implementation (Rust)**:\n\n```\nuse petgraph::{Graph, Direction};\nuse std::collections::HashMap;\n\npub struct DependencyResolver {\n graph: Graph,\n node_map: HashMap,\n}\n\nimpl DependencyResolver {\n pub fn resolve_execution_order(&self) -> Result, Error> {\n let mut topo = petgraph::algo::toposort(&self.graph, None)\n .map_err(|_| Error::CyclicDependency)?;\n\n Ok(topo.into_iter()\n .map(|idx| self.graph[idx].clone())\n .collect())\n }\n\n pub fn add_dependency(&mut self, from: &str, to: &str) {\n let from_idx = self.get_or_create_node(from);\n let to_idx = self.get_or_create_node(to);\n self.graph.add_edge(from_idx, to_idx, ());\n }\n}\n```\n\n#### Parallel Execution Pattern\n\n```\nuse tokio::task::JoinSet;\nuse futures::stream::{FuturesUnordered, StreamExt};\n\npub async fn execute_parallel_batch(\n operations: Vec,\n parallelism_limit: usize\n) -> Result, Error> {\n let semaphore = tokio::sync::Semaphore::new(parallelism_limit);\n let mut join_set = JoinSet::new();\n\n for operation in operations {\n let permit = semaphore.clone();\n join_set.spawn(async move {\n let _permit = permit.acquire().await?;\n execute_operation(operation).await\n });\n }\n\n let mut results = Vec::new();\n while let Some(result) = join_set.join_next().await {\n results.push(result??);\n }\n\n Ok(results)\n}\n```\n\n### 5. State Management Patterns\n\n#### Checkpoint-Based Recovery Pattern\n\n**Use Case**: Reliable state persistence and recovery\n\n**Implementation**:\n\n```\n#[derive(Serialize, Deserialize)]\npub struct WorkflowCheckpoint {\n pub workflow_id: String,\n pub step: usize,\n pub completed_operations: Vec,\n pub current_state: serde_json::Value,\n pub metadata: HashMap,\n pub timestamp: chrono::DateTime,\n}\n\npub struct CheckpointManager {\n checkpoint_dir: PathBuf,\n}\n\nimpl CheckpointManager {\n pub fn save_checkpoint(&self, checkpoint: &WorkflowCheckpoint) -> Result<(), Error> {\n let checkpoint_file = self.checkpoint_dir\n .join(&checkpoint.workflow_id)\n .with_extension("json");\n\n let checkpoint_data = serde_json::to_string_pretty(checkpoint)?;\n std::fs::write(checkpoint_file, checkpoint_data)?;\n Ok(())\n }\n\n pub fn restore_checkpoint(&self, workflow_id: &str) -> Result, Error> {\n let checkpoint_file = self.checkpoint_dir\n .join(workflow_id)\n .with_extension("json");\n\n if checkpoint_file.exists() {\n let checkpoint_data = std::fs::read_to_string(checkpoint_file)?;\n let checkpoint = serde_json::from_str(&checkpoint_data)?;\n Ok(Some(checkpoint))\n } else {\n Ok(None)\n }\n }\n}\n```\n\n#### Rollback Pattern\n\n```\npub struct RollbackManager {\n rollback_stack: Vec,\n}\n\n#[derive(Clone, Debug)]\npub enum RollbackAction {\n DeleteResource { provider: String, resource_id: String },\n RestoreFile { path: PathBuf, content: String },\n RevertConfiguration { key: String, value: serde_json::Value },\n CustomAction { command: String, args: Vec },\n}\n\nimpl RollbackManager {\n pub async fn execute_rollback(&self) -> Result<(), Error> {\n // Execute rollback actions in reverse order\n for action in self.rollback_stack.iter().rev() {\n match action {\n RollbackAction::DeleteResource { provider, resource_id } => {\n self.delete_resource(provider, resource_id).await?;\n }\n RollbackAction::RestoreFile { path, content } => {\n tokio::fs::write(path, content).await?;\n }\n // ... handle other rollback actions\n }\n }\n Ok(())\n }\n}\n```\n\n### 6. Event and Messaging Patterns\n\n#### Event-Driven Architecture Pattern\n\n**Use Case**: Decoupled communication between components\n\n**Event Definition**:\n\n```\n#[derive(Serialize, Deserialize, Clone, Debug)]\npub enum SystemEvent {\n WorkflowStarted { workflow_id: String, name: String },\n WorkflowCompleted { workflow_id: String, result: WorkflowResult },\n WorkflowFailed { workflow_id: String, error: String },\n ResourceCreated { provider: String, resource_type: String, resource_id: String },\n ResourceDeleted { provider: String, resource_type: String, resource_id: String },\n ConfigurationChanged { key: String, old_value: serde_json::Value, new_value: serde_json::Value },\n}\n```\n\n**Event Bus Implementation**:\n\n```\nuse tokio::sync::broadcast;\n\npub struct EventBus {\n sender: broadcast::Sender,\n}\n\nimpl EventBus {\n pub fn new(capacity: usize) -> Self {\n let (sender, _) = broadcast::channel(capacity);\n Self { sender }\n }\n\n pub fn publish(&self, event: SystemEvent) -> Result<(), Error> {\n self.sender.send(event)\n .map_err(|_| Error::EventPublishFailed)?;\n Ok(())\n }\n\n pub fn subscribe(&self) -> broadcast::Receiver {\n self.sender.subscribe()\n }\n}\n```\n\n### 7. Extension Integration Patterns\n\n#### Extension Discovery and Loading\n\n```\ndef discover-extensions [] -> table {\n let extension_dirs = glob "extensions/*/extension.toml"\n\n $extension_dirs\n | each { |manifest_path|\n let extension_dir = $manifest_path | path dirname\n let manifest = open $manifest_path\n\n {\n name: $manifest.extension.name,\n version: $manifest.extension.version,\n type: $manifest.extension.type,\n path: $extension_dir,\n manifest: $manifest,\n valid: (validate-extension $manifest),\n compatible: (check-compatibility $manifest.compatibility)\n }\n }\n | where valid and compatible\n}\n```\n\n#### Extension Interface Pattern\n\n```\n# Standard extension interface\nexport def extension-info [] -> record {\n {\n name: "custom-provider",\n version: "1.0.0",\n type: "provider",\n description: "Custom cloud provider integration",\n entry_points: {\n cli: "nulib/cli.nu",\n provider: "nulib/provider.nu"\n }\n }\n}\n\nexport def extension-validate [] -> bool {\n # Validate extension configuration and dependencies\n true\n}\n\nexport def extension-activate [] -> nothing {\n # Perform extension activation tasks\n}\n\nexport def extension-deactivate [] -> nothing {\n # Perform extension cleanup tasks\n}\n```\n\n### 8. API Design Patterns\n\n#### REST API Standardization\n\n**Base API Structure**:\n\n```\nuse axum::{\n extract::{Path, State},\n response::Json,\n routing::{get, post, delete},\n Router,\n};\n\npub fn create_api_router(state: AppState) -> Router {\n Router::new()\n .route("/health", get(health_check))\n .route("/workflows", get(list_workflows).post(create_workflow))\n .route("/workflows/:id", get(get_workflow).delete(delete_workflow))\n .route("/workflows/:id/status", get(workflow_status))\n .route("/workflows/:id/logs", get(workflow_logs))\n .with_state(state)\n}\n```\n\n**Standard Response Format**:\n\n```\n{\n "status": "success" | "error" | "pending",\n "data": { ... },\n "metadata": {\n "timestamp": "2025-09-26T12:00:00Z",\n "request_id": "req-123",\n "version": "3.1.0"\n },\n "error": null | {\n "code": "ERR001",\n "message": "Human readable error",\n "details": { ... }\n }\n}\n```\n\n## Error Handling Patterns\n\n### Structured Error Pattern\n\n```\n#[derive(thiserror::Error, Debug)]\npub enum ProvisioningError {\n #[error("Configuration error: {message}")]\n Configuration { message: String },\n\n #[error("Provider error [{provider}]: {message}")]\n Provider { provider: String, message: String },\n\n #[error("Workflow error [{workflow_id}]: {message}")]\n Workflow { workflow_id: String, message: String },\n\n #[error("Resource error [{resource_type}/{resource_id}]: {message}")]\n Resource { resource_type: String, resource_id: String, message: String },\n}\n```\n\n### Error Recovery Pattern\n\n```\ndef with-retry [operation: closure, max_attempts: int = 3] {\n mut attempts = 0\n mut last_error = null\n\n while $attempts < $max_attempts {\n try {\n return (do $operation)\n } catch { |error|\n $attempts = $attempts + 1\n $last_error = $error\n\n if $attempts < $max_attempts {\n let delay = (2 ** ($attempts - 1)) * 1000 # Exponential backoff\n sleep $"($delay)ms"\n }\n }\n }\n\n error make { msg: $"Operation failed after ($max_attempts) attempts: ($last_error)" }\n}\n```\n\n## Performance Optimization Patterns\n\n### Caching Strategy Pattern\n\n```\nuse std::sync::Arc;\nuse tokio::sync::RwLock;\nuse std::collections::HashMap;\nuse chrono::{DateTime, Utc, Duration};\n\n#[derive(Clone)]\npub struct CacheEntry {\n pub value: T,\n pub expires_at: DateTime,\n}\n\npub struct Cache {\n store: Arc>>>,\n default_ttl: Duration,\n}\n\nimpl Cache {\n pub async fn get(&self, key: &str) -> Option {\n let store = self.store.read().await;\n if let Some(entry) = store.get(key) {\n if entry.expires_at > Utc::now() {\n Some(entry.value.clone())\n } else {\n None\n }\n } else {\n None\n }\n }\n\n pub async fn set(&self, key: String, value: T) {\n let expires_at = Utc::now() + self.default_ttl;\n let entry = CacheEntry { value, expires_at };\n\n let mut store = self.store.write().await;\n store.insert(key, entry);\n }\n}\n```\n\n### Streaming Pattern for Large Data\n\n```\ndef process-large-dataset [source: string] -> nothing {\n # Stream processing instead of loading entire dataset\n open $source\n | lines\n | each { |line|\n # Process line individually\n $line | process-record\n }\n | save output.json\n}\n```\n\n## Testing Integration Patterns\n\n### Integration Test Pattern\n\n```\n#[cfg(test)]\nmod integration_tests {\n use super::*;\n use tokio_test;\n\n #[tokio::test]\n async fn test_workflow_execution() {\n let orchestrator = setup_test_orchestrator().await;\n let workflow = create_test_workflow();\n\n let result = orchestrator.execute_workflow(workflow).await;\n\n assert!(result.is_ok());\n assert_eq!(result.unwrap().status, WorkflowStatus::Completed);\n }\n}\n```\n\nThese integration patterns provide the foundation for the system's sophisticated multi-component architecture, enabling reliable, scalable, and\nmaintainable infrastructure automation. +# Integration Patterns + +## Overview + +Provisioning implements sophisticated integration patterns to coordinate between its hybrid Rust/Nushell architecture, manage multi-provider +workflows, and enable extensible functionality. This document outlines the key integration patterns, their implementations, and best practices. + +## Core Integration Patterns + +### 1. Hybrid Language Integration + +#### Rust-to-Nushell Communication Pattern + +**Use Case**: Orchestrator invoking business logic operations + +**Implementation**: + +```text +use tokio::process::Command; +use serde_json; + +pub async fn execute_nushell_workflow( + workflow: &str, + args: &[String] +) -> Result { + let mut cmd = Command::new("nu"); + cmd.arg("-c") + .arg(format!("use core/nulib/workflows/{}.nu *; {}", workflow, args.join(" "))); + + let output = cmd.output().await?; + let result: WorkflowResult = serde_json::from_slice(&output.stdout)?; + Ok(result) +} +``` + +**Data Exchange Format**: + +```text +{ + "status": "success" | "error" | "partial", + "result": { + "operation": "server_create", + "resources": ["server-001", "server-002"], + "metadata": { ... } + }, + "error": null | { "code": "ERR001", "message": "..." }, + "context": { "workflow_id": "wf-123", "step": 2 } +} +``` + +#### Nushell-to-Rust Communication Pattern + +**Use Case**: Business logic submitting workflows to orchestrator + +**Implementation**: + +```text +def submit-workflow [workflow: record] -> record { + let payload = $workflow | to json + + http post "http://localhost:9090/workflows/submit" { + headers: { "Content-Type": "application/json" } + body: $payload + } + | from json +} +``` + +**API Contract**: + +```text +{ + "workflow_id": "wf-456", + "name": "multi_cloud_deployment", + "operations": [...], + "dependencies": { ... }, + "configuration": { ... } +} +``` + +### 2. Provider Abstraction Pattern + +#### Standard Provider Interface + +**Purpose**: Uniform API across different cloud providers + +**Interface Definition**: + +```text +# Standard provider interface that all providers must implement +export def list-servers [] -> table { + # Provider-specific implementation +} + +export def create-server [config: record] -> record { + # Provider-specific implementation +} + +export def delete-server [id: string] -> nothing { + # Provider-specific implementation +} + +export def get-server [id: string] -> record { + # Provider-specific implementation +} +``` + +**Configuration Integration**: + +```text +[providers.aws] +region = "us-west-2" +credentials_profile = "default" +timeout = 300 + +[providers.upcloud] +zone = "de-fra1" +api_endpoint = "https://api.upcloud.com" +timeout = 180 + +[providers.local] +docker_socket = "/var/run/docker.sock" +network_mode = "bridge" +``` + +#### Provider Discovery and Loading + +```text +def load-providers [] -> table { + let provider_dirs = glob "providers/*/nulib" + + $provider_dirs + | each { |dir| + let provider_name = $dir | path basename | path dirname | path basename + let provider_config = get-provider-config $provider_name + + { + name: $provider_name, + path: $dir, + config: $provider_config, + available: (test-provider-connectivity $provider_name) + } + } +} +``` + +### 3. Configuration Resolution Pattern + +#### Hierarchical Configuration Loading + +**Implementation**: + +```text +def resolve-configuration [context: record] -> record { + let base_config = open config.defaults.toml + let user_config = if ("config.user.toml" | path exists) { + open config.user.toml + } else { {} } + + let env_config = if ($env.PROVISIONING_ENV? | is-not-empty) { + let env_file = $"config.($env.PROVISIONING_ENV).toml" + if ($env_file | path exists) { open $env_file } else { {} } + } else { {} } + + let merged_config = $base_config + | merge $user_config + | merge $env_config + | merge ($context.runtime_config? | default {}) + + interpolate-variables $merged_config +} +``` + +#### Variable Interpolation Pattern + +```text +def interpolate-variables [config: record] -> record { + let interpolations = { + "{{paths.base}}": ($env.PWD), + "{{env.HOME}}": ($env.HOME), + "{{now.date}}": (date now | format date "%Y-%m-%d"), + "{{git.branch}}": (git branch --show-current | str trim) + } + + $config + | to json + | str replace --all "{{paths.base}}" $interpolations."{{paths.base}}" + | str replace --all "{{env.HOME}}" $interpolations."{{env.HOME}}" + | str replace --all "{{now.date}}" $interpolations."{{now.date}}" + | str replace --all "{{git.branch}}" $interpolations."{{git.branch}}" + | from json +} +``` + +### 4. Workflow Orchestration Patterns + +#### Dependency Resolution Pattern + +**Use Case**: Managing complex workflow dependencies + +**Implementation (Rust)**: + +```text +use petgraph::{Graph, Direction}; +use std::collections::HashMap; + +pub struct DependencyResolver { + graph: Graph, + node_map: HashMap, +} + +impl DependencyResolver { + pub fn resolve_execution_order(&self) -> Result, Error> { + let mut topo = petgraph::algo::toposort(&self.graph, None) + .map_err(|_| Error::CyclicDependency)?; + + Ok(topo.into_iter() + .map(|idx| self.graph[idx].clone()) + .collect()) + } + + pub fn add_dependency(&mut self, from: &str, to: &str) { + let from_idx = self.get_or_create_node(from); + let to_idx = self.get_or_create_node(to); + self.graph.add_edge(from_idx, to_idx, ()); + } +} +``` + +#### Parallel Execution Pattern + +```text +use tokio::task::JoinSet; +use futures::stream::{FuturesUnordered, StreamExt}; + +pub async fn execute_parallel_batch( + operations: Vec, + parallelism_limit: usize +) -> Result, Error> { + let semaphore = tokio::sync::Semaphore::new(parallelism_limit); + let mut join_set = JoinSet::new(); + + for operation in operations { + let permit = semaphore.clone(); + join_set.spawn(async move { + let _permit = permit.acquire().await?; + execute_operation(operation).await + }); + } + + let mut results = Vec::new(); + while let Some(result) = join_set.join_next().await { + results.push(result??); + } + + Ok(results) +} +``` + +### 5. State Management Patterns + +#### Checkpoint-Based Recovery Pattern + +**Use Case**: Reliable state persistence and recovery + +**Implementation**: + +```text +#[derive(Serialize, Deserialize)] +pub struct WorkflowCheckpoint { + pub workflow_id: String, + pub step: usize, + pub completed_operations: Vec, + pub current_state: serde_json::Value, + pub metadata: HashMap, + pub timestamp: chrono::DateTime, +} + +pub struct CheckpointManager { + checkpoint_dir: PathBuf, +} + +impl CheckpointManager { + pub fn save_checkpoint(&self, checkpoint: &WorkflowCheckpoint) -> Result<(), Error> { + let checkpoint_file = self.checkpoint_dir + .join(&checkpoint.workflow_id) + .with_extension("json"); + + let checkpoint_data = serde_json::to_string_pretty(checkpoint)?; + std::fs::write(checkpoint_file, checkpoint_data)?; + Ok(()) + } + + pub fn restore_checkpoint(&self, workflow_id: &str) -> Result, Error> { + let checkpoint_file = self.checkpoint_dir + .join(workflow_id) + .with_extension("json"); + + if checkpoint_file.exists() { + let checkpoint_data = std::fs::read_to_string(checkpoint_file)?; + let checkpoint = serde_json::from_str(&checkpoint_data)?; + Ok(Some(checkpoint)) + } else { + Ok(None) + } + } +} +``` + +#### Rollback Pattern + +```text +pub struct RollbackManager { + rollback_stack: Vec, +} + +#[derive(Clone, Debug)] +pub enum RollbackAction { + DeleteResource { provider: String, resource_id: String }, + RestoreFile { path: PathBuf, content: String }, + RevertConfiguration { key: String, value: serde_json::Value }, + CustomAction { command: String, args: Vec }, +} + +impl RollbackManager { + pub async fn execute_rollback(&self) -> Result<(), Error> { + // Execute rollback actions in reverse order + for action in self.rollback_stack.iter().rev() { + match action { + RollbackAction::DeleteResource { provider, resource_id } => { + self.delete_resource(provider, resource_id).await?; + } + RollbackAction::RestoreFile { path, content } => { + tokio::fs::write(path, content).await?; + } + // ... handle other rollback actions + } + } + Ok(()) + } +} +``` + +### 6. Event and Messaging Patterns + +#### Event-Driven Architecture Pattern + +**Use Case**: Decoupled communication between components + +**Event Definition**: + +```text +#[derive(Serialize, Deserialize, Clone, Debug)] +pub enum SystemEvent { + WorkflowStarted { workflow_id: String, name: String }, + WorkflowCompleted { workflow_id: String, result: WorkflowResult }, + WorkflowFailed { workflow_id: String, error: String }, + ResourceCreated { provider: String, resource_type: String, resource_id: String }, + ResourceDeleted { provider: String, resource_type: String, resource_id: String }, + ConfigurationChanged { key: String, old_value: serde_json::Value, new_value: serde_json::Value }, +} +``` + +**Event Bus Implementation**: + +```text +use tokio::sync::broadcast; + +pub struct EventBus { + sender: broadcast::Sender, +} + +impl EventBus { + pub fn new(capacity: usize) -> Self { + let (sender, _) = broadcast::channel(capacity); + Self { sender } + } + + pub fn publish(&self, event: SystemEvent) -> Result<(), Error> { + self.sender.send(event) + .map_err(|_| Error::EventPublishFailed)?; + Ok(()) + } + + pub fn subscribe(&self) -> broadcast::Receiver { + self.sender.subscribe() + } +} +``` + +### 7. Extension Integration Patterns + +#### Extension Discovery and Loading + +```text +def discover-extensions [] -> table { + let extension_dirs = glob "extensions/*/extension.toml" + + $extension_dirs + | each { |manifest_path| + let extension_dir = $manifest_path | path dirname + let manifest = open $manifest_path + + { + name: $manifest.extension.name, + version: $manifest.extension.version, + type: $manifest.extension.type, + path: $extension_dir, + manifest: $manifest, + valid: (validate-extension $manifest), + compatible: (check-compatibility $manifest.compatibility) + } + } + | where valid and compatible +} +``` + +#### Extension Interface Pattern + +```text +# Standard extension interface +export def extension-info [] -> record { + { + name: "custom-provider", + version: "1.0.0", + type: "provider", + description: "Custom cloud provider integration", + entry_points: { + cli: "nulib/cli.nu", + provider: "nulib/provider.nu" + } + } +} + +export def extension-validate [] -> bool { + # Validate extension configuration and dependencies + true +} + +export def extension-activate [] -> nothing { + # Perform extension activation tasks +} + +export def extension-deactivate [] -> nothing { + # Perform extension cleanup tasks +} +``` + +### 8. API Design Patterns + +#### REST API Standardization + +**Base API Structure**: + +```text +use axum::{ + extract::{Path, State}, + response::Json, + routing::{get, post, delete}, + Router, +}; + +pub fn create_api_router(state: AppState) -> Router { + Router::new() + .route("/health", get(health_check)) + .route("/workflows", get(list_workflows).post(create_workflow)) + .route("/workflows/:id", get(get_workflow).delete(delete_workflow)) + .route("/workflows/:id/status", get(workflow_status)) + .route("/workflows/:id/logs", get(workflow_logs)) + .with_state(state) +} +``` + +**Standard Response Format**: + +```text +{ + "status": "success" | "error" | "pending", + "data": { ... }, + "metadata": { + "timestamp": "2025-09-26T12:00:00Z", + "request_id": "req-123", + "version": "3.1.0" + }, + "error": null | { + "code": "ERR001", + "message": "Human readable error", + "details": { ... } + } +} +``` + +## Error Handling Patterns + +### Structured Error Pattern + +```text +#[derive(thiserror::Error, Debug)] +pub enum ProvisioningError { + #[error("Configuration error: {message}")] + Configuration { message: String }, + + #[error("Provider error [{provider}]: {message}")] + Provider { provider: String, message: String }, + + #[error("Workflow error [{workflow_id}]: {message}")] + Workflow { workflow_id: String, message: String }, + + #[error("Resource error [{resource_type}/{resource_id}]: {message}")] + Resource { resource_type: String, resource_id: String, message: String }, +} +``` + +### Error Recovery Pattern + +```text +def with-retry [operation: closure, max_attempts: int = 3] { + mut attempts = 0 + mut last_error = null + + while $attempts < $max_attempts { + try { + return (do $operation) + } catch { |error| + $attempts = $attempts + 1 + $last_error = $error + + if $attempts < $max_attempts { + let delay = (2 ** ($attempts - 1)) * 1000 # Exponential backoff + sleep $"($delay)ms" + } + } + } + + error make { msg: $"Operation failed after ($max_attempts) attempts: ($last_error)" } +} +``` + +## Performance Optimization Patterns + +### Caching Strategy Pattern + +```text +use std::sync::Arc; +use tokio::sync::RwLock; +use std::collections::HashMap; +use chrono::{DateTime, Utc, Duration}; + +#[derive(Clone)] +pub struct CacheEntry { + pub value: T, + pub expires_at: DateTime, +} + +pub struct Cache { + store: Arc>>>, + default_ttl: Duration, +} + +impl Cache { + pub async fn get(&self, key: &str) -> Option { + let store = self.store.read().await; + if let Some(entry) = store.get(key) { + if entry.expires_at > Utc::now() { + Some(entry.value.clone()) + } else { + None + } + } else { + None + } + } + + pub async fn set(&self, key: String, value: T) { + let expires_at = Utc::now() + self.default_ttl; + let entry = CacheEntry { value, expires_at }; + + let mut store = self.store.write().await; + store.insert(key, entry); + } +} +``` + +### Streaming Pattern for Large Data + +```text +def process-large-dataset [source: string] -> nothing { + # Stream processing instead of loading entire dataset + open $source + | lines + | each { |line| + # Process line individually + $line | process-record + } + | save output.json +} +``` + +## Testing Integration Patterns + +### Integration Test Pattern + +```text +#[cfg(test)] +mod integration_tests { + use super::*; + use tokio_test; + + #[tokio::test] + async fn test_workflow_execution() { + let orchestrator = setup_test_orchestrator().await; + let workflow = create_test_workflow(); + + let result = orchestrator.execute_workflow(workflow).await; + + assert!(result.is_ok()); + assert_eq!(result.unwrap().status, WorkflowStatus::Completed); + } +} +``` + +These integration patterns provide the foundation for the system's sophisticated multi-component architecture, enabling reliable, scalable, and +maintainable infrastructure automation. \ No newline at end of file diff --git a/docs/src/architecture/multi-repo-architecture.md b/docs/src/architecture/multi-repo-architecture.md index b4b4917..6fa4fde 100644 --- a/docs/src/architecture/multi-repo-architecture.md +++ b/docs/src/architecture/multi-repo-architecture.md @@ -1 +1,710 @@ -# Multi-Repository Architecture with OCI Registry Support\n\n**Version**: 1.0.0\n**Date**: 2025-10-06\n**Status**: Implementation Complete\n\n## Overview\n\nThis document describes the multi-repository architecture for the provisioning system, enabling modular development, independent versioning, and\ndistributed extension management through OCI registry integration.\n\n## Architecture Goals\n\n1. **Separation of Concerns**: Core, Extensions, and Platform in separate repositories\n2. **Independent Versioning**: Each component can be versioned and released independently\n3. **Distributed Development**: Multiple teams can work on different repositories\n4. **OCI-Native Distribution**: Extensions distributed as OCI artifacts\n5. **Dependency Management**: Automated dependency resolution across repositories\n6. **Backward Compatibility**: Support legacy monorepo structure during transition\n\n## Repository Structure\n\n### Repository 1: `provisioning-core`\n\n**Purpose**: Core system functionality - CLI, libraries, base schemas\n\n```\nprovisioning-core/\n├── core/\n│ ├── cli/ # Command-line interface\n│ │ ├── provisioning # Main CLI entry point\n│ │ └── module-loader # Dynamic module loader\n│ ├── nulib/ # Core Nushell libraries\n│ │ ├── lib_provisioning/ # Core library modules\n│ │ │ ├── config/ # Configuration management\n│ │ │ ├── oci/ # OCI client integration\n│ │ │ ├── dependencies/ # Dependency resolution\n│ │ │ ├── module/ # Module system\n│ │ │ ├── layer/ # Layer system\n│ │ │ └── workspace/ # Workspace management\n│ │ └── workflows/ # Core workflow system\n│ ├── plugins/ # System plugins\n│ └── scripts/ # Utility scripts\n├── schemas/ # Base Nickel schemas\n│ ├── main.ncl # Main schema entry\n│ ├── lib.ncl # Core library types\n│ ├── settings.ncl # Settings schema\n│ ├── dependencies.ncl # Dependency schemas (with OCI support)\n│ ├── server.ncl # Server schemas\n│ ├── cluster.ncl # Cluster schemas\n│ └── workflows.ncl # Workflow schemas\n├── config/ # Core configuration templates\n├── templates/ # Core templates\n├── tools/ # Build and distribution tools\n│ ├── oci-package.nu # OCI packaging tool\n│ ├── build-core.nu # Core build script\n│ └── release-core.nu # Core release script\n├── tests/ # Core system tests\n└── docs/ # Core documentation\n ├── api/ # API documentation\n ├── architecture/ # Architecture docs\n └── development/ # Development guides\n\n```\n\n**Distribution**:\n\n- Published as OCI artifact: `oci://registry/provisioning-core:v3.5.0`\n- Contains all core functionality needed to run the provisioning system\n- Version format: `v{major}.{minor}.{patch}` (for example, v3.5.0)\n\n**CI/CD**:\n\n- Build on commit to main\n- Publish OCI artifact on git tag (v*)\n- Run integration tests before publishing\n- Update changelog automatically\n\n---\n\n### Repository 2: `provisioning-extensions`\n\n**Purpose**: All provider, taskserv, and cluster extensions\n\n```\nprovisioning-extensions/\n├── providers/\n│ ├── aws/\n│ │ ├── schemas/ # Nickel schemas\n│ │ │ ├── manifest.toml # Nickel dependencies\n│ │ │ ├── aws.ncl # Main provider schema\n│ │ │ ├── defaults_aws.ncl # AWS defaults\n│ │ │ └── server_aws.ncl # AWS server schema\n│ │ ├── scripts/ # Nushell scripts\n│ │ │ └── install.nu # Installation script\n│ │ ├── templates/ # Provider templates\n│ │ ├── docs/ # Provider documentation\n│ │ └── manifest.yaml # Extension manifest\n│ ├── upcloud/\n│ │ └── (same structure)\n│ └── local/\n│ └── (same structure)\n├── taskservs/\n│ ├── kubernetes/\n│ │ ├── schemas/\n│ │ │ ├── manifest.toml\n│ │ │ ├── kubernetes.ncl # Main taskserv schema\n│ │ │ ├── version.ncl # Version management\n│ │ │ └── dependencies.ncl # Taskserv dependencies\n│ │ ├── scripts/\n│ │ │ ├── install.nu # Installation script\n│ │ │ ├── check.nu # Health check script\n│ │ │ └── uninstall.nu # Uninstall script\n│ │ ├── templates/ # Config templates\n│ │ ├── docs/ # Taskserv docs\n│ │ ├── tests/ # Taskserv tests\n│ │ └── manifest.yaml # Extension manifest\n│ ├── containerd/\n│ ├── cilium/\n│ ├── postgres/\n│ └── (50+ more taskservs...)\n├── clusters/\n│ ├── buildkit/\n│ │ └── (same structure)\n│ ├── web/\n│ └── (other clusters...)\n├── tools/\n│ ├── extension-builder.nu # Build individual extensions\n│ ├── mass-publish.nu # Publish all extensions\n│ └── validate-extensions.nu # Validate all extensions\n└── docs/\n ├── extension-guide.md # Extension development guide\n └── publishing.md # Publishing guide\n\n```\n\n**Distribution**:\nEach extension published separately as OCI artifact:\n\n- `oci://registry/provisioning-extensions/kubernetes:1.28.0`\n- `oci://registry/provisioning-extensions/aws:2.0.0`\n- `oci://registry/provisioning-extensions/buildkit:0.12.0`\n\n**Extension Manifest** (`manifest.yaml`):\n\n```\nname: kubernetes\ntype: taskserv\nversion: 1.28.0\ndescription: Kubernetes container orchestration platform\nauthor: Provisioning Team\nlicense: MIT\nhomepage: https://kubernetes.io\nrepository: https://gitea.example.com/provisioning-extensions/kubernetes\n\ndependencies:\n containerd: ">=1.7.0"\n etcd: ">=3.5.0"\n\ntags:\n - kubernetes\n - container-orchestration\n - cncf\n\nplatforms:\n - linux/amd64\n - linux/arm64\n\nmin_provisioning_version: "3.0.0"\n```\n\n**CI/CD**:\n\n- Build and publish each extension independently\n- Git tag format: `{extension-type}/{extension-name}/v{version}`\n - Example: `taskservs/kubernetes/v1.28.0`\n- Automated publishing to OCI registry on tag\n- Run extension-specific tests before publishing\n\n---\n\n### Repository 3: `provisioning-platform`\n\n**Purpose**: Platform services (orchestrator, control-center, MCP server, API gateway)\n\n```\nprovisioning-platform/\n├── orchestrator/ # Rust orchestrator service\n│ ├── src/\n│ ├── Cargo.toml\n│ ├── Dockerfile\n│ └── README.md\n├── control-center/ # Web control center\n│ ├── src/\n│ ├── package.json\n│ ├── Dockerfile\n│ └── README.md\n├── mcp-server/ # Model Context Protocol server\n│ ├── src/\n│ ├── Cargo.toml\n│ ├── Dockerfile\n│ └── README.md\n├── api-gateway/ # REST API gateway\n│ ├── src/\n│ ├── Cargo.toml\n│ ├── Dockerfile\n│ └── README.md\n├── docker-compose.yml # Local development stack\n├── kubernetes/ # K8s deployment manifests\n│ ├── orchestrator.yaml\n│ ├── control-center.yaml\n│ ├── mcp-server.yaml\n│ └── api-gateway.yaml\n└── docs/\n ├── deployment.md\n └── api-reference.md\n\n```\n\n**Distribution**:\nStandard Docker images in OCI registry:\n\n- `oci://registry/provisioning-platform/orchestrator:v1.2.0`\n- `oci://registry/provisioning-platform/control-center:v1.2.0`\n- `oci://registry/provisioning-platform/mcp-server:v1.0.0`\n- `oci://registry/provisioning-platform/api-gateway:v1.0.0`\n\n**CI/CD**:\n\n- Build Docker images on commit to main\n- Publish images on git tag (v*)\n- Multi-architecture builds (amd64, arm64)\n- Security scanning before publishing\n\n---\n\n## OCI Registry Integration\n\n### Registry Structure\n\n```\nOCI Registry (localhost:5000 or harbor.company.com)\n├── provisioning-core/\n│ ├── v3.5.0 # Core system artifact\n│ ├── v3.4.0\n│ └── latest -> v3.5.0\n├── provisioning-extensions/\n│ ├── kubernetes:1.28.0 # Individual extension artifacts\n│ ├── kubernetes:1.27.0\n│ ├── containerd:1.7.0\n│ ├── aws:2.0.0\n│ ├── upcloud:1.5.0\n│ └── (100+ more extensions)\n└── provisioning-platform/\n ├── orchestrator:v1.2.0 # Platform service images\n ├── control-center:v1.2.0\n ├── mcp-server:v1.0.0\n └── api-gateway:v1.0.0\n\n```\n\n### OCI Artifact Structure\n\nEach extension packaged as OCI artifact:\n\n```\nkubernetes-1.28.0.tar.gz\n├── schemas/ # Nickel schemas\n│ ├── kubernetes.ncl\n│ ├── version.ncl\n│ └── dependencies.ncl\n├── scripts/ # Nushell scripts\n│ ├── install.nu\n│ ├── check.nu\n│ └── uninstall.nu\n├── templates/ # Template files\n│ ├── kubeconfig.j2\n│ └── kubelet-config.yaml.j2\n├── docs/ # Documentation\n│ └── README.md\n├── manifest.yaml # Extension manifest\n└── oci-manifest.json # OCI manifest metadata\n\n```\n\n---\n\n## Dependency Management\n\n### Workspace Configuration\n\n**File**: `workspace/config/provisioning.yaml`\n\n```\n# Core system dependency\ndependencies:\n core:\n source: "oci://harbor.company.com/provisioning-core:v3.5.0"\n # Alternative: source: "gitea://provisioning-core"\n\n # Extensions repository configuration\n extensions:\n source_type: "oci" # oci, gitea, local\n\n # OCI registry configuration\n oci:\n registry: "localhost:5000"\n namespace: "provisioning-extensions"\n tls_enabled: false\n auth_token_path: "~/.provisioning/tokens/oci"\n\n # Loaded extension modules\n modules:\n providers:\n - "oci://localhost:5000/provisioning-extensions/aws:2.0.0"\n - "oci://localhost:5000/provisioning-extensions/upcloud:1.5.0"\n\n taskservs:\n - "oci://localhost:5000/provisioning-extensions/kubernetes:1.28.0"\n - "oci://localhost:5000/provisioning-extensions/containerd:1.7.0"\n - "oci://localhost:5000/provisioning-extensions/cilium:1.14.0"\n\n clusters:\n - "oci://localhost:5000/provisioning-extensions/buildkit:0.12.0"\n\n # Platform services\n platform:\n source_type: "oci"\n\n oci:\n registry: "harbor.company.com"\n namespace: "provisioning-platform"\n\n images:\n orchestrator: "harbor.company.com/provisioning-platform/orchestrator:v1.2.0"\n control_center: "harbor.company.com/provisioning-platform/control-center:v1.2.0"\n\n # OCI registry configuration\n registry:\n type: "oci" # oci, gitea, http\n\n oci:\n endpoint: "localhost:5000"\n namespaces:\n extensions: "provisioning-extensions"\n nickel: "provisioning-nickel"\n platform: "provisioning-platform"\n test: "provisioning-test"\n```\n\n### Dependency Resolution\n\nThe system resolves dependencies in this order:\n\n1. **Parse Configuration**: Read `provisioning.yaml` and extract dependencies\n2. **Resolve Core**: Ensure core system version is compatible\n3. **Resolve Extensions**: For each extension:\n - Check if already installed and version matches\n - Pull from OCI registry if needed\n - Recursively resolve extension dependencies\n4. **Validate Graph**: Check for dependency cycles and conflicts\n5. **Install**: Install extensions in topological order\n\n### Dependency Resolution Commands\n\n```\n# Resolve and install all dependencies\nprovisioning dep resolve\n\n# Check for dependency updates\nprovisioning dep check-updates\n\n# Update specific extension\nprovisioning dep update kubernetes\n\n# Validate dependency graph\nprovisioning dep validate\n\n# Show dependency tree\nprovisioning dep tree kubernetes\n```\n\n---\n\n## OCI Client Operations\n\n### CLI Commands\n\n```\n# Pull extension from OCI registry\nprovisioning oci pull kubernetes:1.28.0\n\n# Push extension to OCI registry\nprovisioning oci push ./extensions/kubernetes kubernetes 1.28.0\n\n# List available extensions\nprovisioning oci list --namespace provisioning-extensions\n\n# Search for extensions\nprovisioning oci search kubernetes\n\n# Show extension versions\nprovisioning oci tags kubernetes\n\n# Inspect extension manifest\nprovisioning oci inspect kubernetes:1.28.0\n\n# Login to OCI registry\nprovisioning oci login localhost:5000 --username _token --password-stdin\n\n# Delete extension\nprovisioning oci delete kubernetes:1.28.0\n\n# Copy extension between registries\nprovisioning oci copy \\n localhost:5000/provisioning-extensions/kubernetes:1.28.0 \\n harbor.company.com/provisioning-extensions/kubernetes:1.28.0\n```\n\n### OCI Configuration\n\n```\n# Show OCI configuration\nprovisioning oci config\n\n# Output:\n{\n tool: "oras" # or "crane" or "skopeo"\n registry: "localhost:5000"\n namespace: {\n extensions: "provisioning-extensions"\n platform: "provisioning-platform"\n }\n cache_dir: "~/.provisioning/oci-cache"\n tls_enabled: false\n}\n```\n\n---\n\n## Extension Development Workflow\n\n### 1. Develop Extension\n\n```\n# Create new extension from template\nprovisioning generate extension taskserv redis\n\n# Directory structure created:\n# extensions/taskservs/redis/\n# ├── schemas/\n# │ ├── manifest.toml\n# │ ├── redis.ncl\n# │ ├── version.ncl\n# │ └── dependencies.ncl\n# ├── scripts/\n# │ ├── install.nu\n# │ ├── check.nu\n# │ └── uninstall.nu\n# ├── templates/\n# ├── docs/\n# │ └── README.md\n# ├── tests/\n# └── manifest.yaml\n```\n\n### 2. Test Extension Locally\n\n```\n# Load extension from local path\nprovisioning module load taskserv workspace_dev redis --source local\n\n# Test installation\nprovisioning taskserv create redis --infra test-env --check\n\n# Run extension tests\nprovisioning test extension redis\n```\n\n### 3. Package Extension\n\n```\n# Validate extension structure\nprovisioning oci package validate ./extensions/taskservs/redis\n\n# Package as OCI artifact\nprovisioning oci package ./extensions/taskservs/redis\n\n# Output: redis-1.0.0.tar.gz\n```\n\n### 4. Publish Extension\n\n```\n# Login to registry (one-time)\nprovisioning oci login localhost:5000\n\n# Publish extension\nprovisioning oci push ./extensions/taskservs/redis redis 1.0.0\n\n# Verify publication\nprovisioning oci tags redis\n\n# Output:\n# ┬───────────┬─────────┬───────────────────────────────────────────────────┐\n# │ artifact │ version │ reference │\n# ├───────────┼─────────┼───────────────────────────────────────────────────┤\n# │ redis │ 1.0.0 │ localhost:5000/provisioning-extensions/redis:1.0.0│\n# └───────────┴─────────┴───────────────────────────────────────────────────┘\n```\n\n### 5. Use Published Extension\n\n```\n# Add to workspace configuration\n# workspace/config/provisioning.yaml:\n# dependencies:\n# extensions:\n# modules:\n# taskservs:\n# - "oci://localhost:5000/provisioning-extensions/redis:1.0.0"\n\n# Pull and install\nprovisioning dep resolve\n\n# Extension automatically downloaded and installed\n```\n\n---\n\n## Registry Deployment Options\n\n### Local Registry (Solo Development)\n\n**Using Zot (lightweight OCI registry)**:\n\n```\n# Start local OCI registry\nprovisioning oci-registry start\n\n# Configuration:\n# - Endpoint: localhost:5000\n# - Storage: ~/.provisioning/oci-registry/\n# - No authentication by default\n# - TLS disabled (local only)\n\n# Stop registry\nprovisioning oci-registry stop\n\n# Check status\nprovisioning oci-registry status\n```\n\n### Remote Registry (Multi-User/Enterprise)\n\n**Using Harbor**:\n\n```\n# workspace/config/provisioning.yaml\ndependencies:\n registry:\n type: "oci"\n oci:\n endpoint: "https://harbor.company.com"\n namespaces:\n extensions: "provisioning/extensions"\n platform: "provisioning/platform"\n tls_enabled: true\n auth_token_path: "~/.provisioning/tokens/harbor"\n```\n\n**Features**:\n\n- Multi-user authentication\n- Role-based access control (RBAC)\n- Vulnerability scanning\n- Replication across registries\n- Webhook notifications\n- Image signing (cosign/notation)\n\n---\n\n## Migration from Monorepo\n\n### Phase 1: Parallel Structure (Current)\n\n- Monorepo still exists and works\n- OCI distribution layer added on top\n- Extensions can be loaded from local or OCI\n- No breaking changes\n\n### Phase 2: Gradual Migration\n\n```\n# Migrate extensions one by one\nfor ext in (ls provisioning/extensions/taskservs); do\n provisioning oci publish $ext.name\ndone\n\n# Update workspace configurations to use OCI\nprovisioning workspace migrate-to-oci workspace_prod\n```\n\n### Phase 3: Repository Split\n\n1. Create `provisioning-core` repository\n - Extract core/ and schemas/ directories\n - Set up CI/CD for core publishing\n - Publish initial OCI artifact\n\n2. Create `provisioning-extensions` repository\n - Extract extensions/ directory\n - Set up CI/CD for extension publishing\n - Publish all extensions to OCI registry\n\n3. Create `provisioning-platform` repository\n - Extract platform/ directory\n - Set up Docker image builds\n - Publish platform services\n\n4. Update workspaces\n - Reconfigure to use OCI dependencies\n - Test multi-repo setup\n - Verify all functionality works\n\n### Phase 4: Deprecate Monorepo\n\n- Archive monorepo\n- Redirect to new repositories\n- Update documentation\n- Announce migration complete\n\n---\n\n## Benefits Summary\n\n### Modularity\n\n✅ Independent repositories for core, extensions, and platform\n✅ Extensions can be developed and versioned separately\n✅ Clear ownership and responsibility boundaries\n\n### Distribution\n\n✅ OCI-native distribution (industry standard)\n✅ Built-in versioning with OCI tags\n✅ Efficient caching with OCI layers\n✅ Works with standard tools (skopeo, crane, oras)\n\n### Security\n\n✅ TLS support for registries\n✅ Authentication and authorization\n✅ Vulnerability scanning (Harbor)\n✅ Image signing (cosign, notation)\n✅ RBAC for access control\n\n### Developer Experience\n\n✅ Simple CLI commands for extension management\n✅ Automatic dependency resolution\n✅ Local testing before publishing\n✅ Easy extension discovery and installation\n\n### Operations\n\n✅ Air-gapped deployments (mirror OCI registry)\n✅ Bandwidth efficient (only download what's needed)\n✅ Version pinning for reproducibility\n✅ Rollback support (use previous versions)\n\n### Ecosystem\n\n✅ Compatible with existing OCI tooling\n✅ Can use public registries (DockerHub, GitHub, etc.)\n✅ Mirror to multiple registries\n✅ Replication for high availability\n\n---\n\n## Implementation Status\n\n| Component | Status | Notes |\n| ----------- | -------- | ------- |\n| **Nickel Schemas** | ✅ Complete | OCI schemas in `dependencies.ncl` |\n| **OCI Client** | ✅ Complete | `oci/client.nu` with skopeo/crane/oras |\n| **OCI Commands** | ✅ Complete | `oci/commands.nu` CLI interface |\n| **Dependency Resolver** | ✅ Complete | `dependencies/resolver.nu` |\n| **OCI Packaging** | ✅ Complete | `tools/oci-package.nu` |\n| **Repository Design** | ✅ Complete | This document |\n| **Migration Plan** | ✅ Complete | Phased approach defined |\n| **Documentation** | ✅ Complete | User guides and API docs |\n| **CI/CD Setup** | ⏳ Pending | Automated publishing pipelines |\n| **Registry Deployment** | ⏳ Pending | Zot/Harbor setup |\n\n---\n\n## Related Documentation\n\n- OCI Packaging Tool - Extension packaging\n- OCI Client Library - OCI operations\n- Dependency Resolver - Dependency management\n- Nickel Schemas - Type definitions\n- [Extension Development Guide](../user/extension-development.md) - How to create extensions\n\n---\n\n**Maintained By**: Architecture Team\n**Review Cycle**: Quarterly\n**Next Review**: 2026-01-06 +# Multi-Repository Architecture with OCI Registry Support + +**Version**: 1.0.0 +**Date**: 2025-10-06 +**Status**: Implementation Complete + +## Overview + +This document describes the multi-repository architecture for the provisioning system, enabling modular development, independent versioning, and +distributed extension management through OCI registry integration. + +## Architecture Goals + +1. **Separation of Concerns**: Core, Extensions, and Platform in separate repositories +2. **Independent Versioning**: Each component can be versioned and released independently +3. **Distributed Development**: Multiple teams can work on different repositories +4. **OCI-Native Distribution**: Extensions distributed as OCI artifacts +5. **Dependency Management**: Automated dependency resolution across repositories +6. **Backward Compatibility**: Support legacy monorepo structure during transition + +## Repository Structure + +### Repository 1: `provisioning-core` + +**Purpose**: Core system functionality - CLI, libraries, base schemas + +```text +provisioning-core/ +├── core/ +│ ├── cli/ # Command-line interface +│ │ ├── provisioning # Main CLI entry point +│ │ └── module-loader # Dynamic module loader +│ ├── nulib/ # Core Nushell libraries +│ │ ├── lib_provisioning/ # Core library modules +│ │ │ ├── config/ # Configuration management +│ │ │ ├── oci/ # OCI client integration +│ │ │ ├── dependencies/ # Dependency resolution +│ │ │ ├── module/ # Module system +│ │ │ ├── layer/ # Layer system +│ │ │ └── workspace/ # Workspace management +│ │ └── workflows/ # Core workflow system +│ ├── plugins/ # System plugins +│ └── scripts/ # Utility scripts +├── schemas/ # Base Nickel schemas +│ ├── main.ncl # Main schema entry +│ ├── lib.ncl # Core library types +│ ├── settings.ncl # Settings schema +│ ├── dependencies.ncl # Dependency schemas (with OCI support) +│ ├── server.ncl # Server schemas +│ ├── cluster.ncl # Cluster schemas +│ └── workflows.ncl # Workflow schemas +├── config/ # Core configuration templates +├── templates/ # Core templates +├── tools/ # Build and distribution tools +│ ├── oci-package.nu # OCI packaging tool +│ ├── build-core.nu # Core build script +│ └── release-core.nu # Core release script +├── tests/ # Core system tests +└── docs/ # Core documentation + ├── api/ # API documentation + ├── architecture/ # Architecture docs + └── development/ # Development guides + +``` + +**Distribution**: + +- Published as OCI artifact: `oci://registry/provisioning-core:v3.5.0` +- Contains all core functionality needed to run the provisioning system +- Version format: `v{major}.{minor}.{patch}` (for example, v3.5.0) + +**CI/CD**: + +- Build on commit to main +- Publish OCI artifact on git tag (v*) +- Run integration tests before publishing +- Update changelog automatically + +--- + +### Repository 2: `provisioning-extensions` + +**Purpose**: All provider, taskserv, and cluster extensions + +```text +provisioning-extensions/ +├── providers/ +│ ├── aws/ +│ │ ├── schemas/ # Nickel schemas +│ │ │ ├── manifest.toml # Nickel dependencies +│ │ │ ├── aws.ncl # Main provider schema +│ │ │ ├── defaults_aws.ncl # AWS defaults +│ │ │ └── server_aws.ncl # AWS server schema +│ │ ├── scripts/ # Nushell scripts +│ │ │ └── install.nu # Installation script +│ │ ├── templates/ # Provider templates +│ │ ├── docs/ # Provider documentation +│ │ └── manifest.yaml # Extension manifest +│ ├── upcloud/ +│ │ └── (same structure) +│ └── local/ +│ └── (same structure) +├── taskservs/ +│ ├── kubernetes/ +│ │ ├── schemas/ +│ │ │ ├── manifest.toml +│ │ │ ├── kubernetes.ncl # Main taskserv schema +│ │ │ ├── version.ncl # Version management +│ │ │ └── dependencies.ncl # Taskserv dependencies +│ │ ├── scripts/ +│ │ │ ├── install.nu # Installation script +│ │ │ ├── check.nu # Health check script +│ │ │ └── uninstall.nu # Uninstall script +│ │ ├── templates/ # Config templates +│ │ ├── docs/ # Taskserv docs +│ │ ├── tests/ # Taskserv tests +│ │ └── manifest.yaml # Extension manifest +│ ├── containerd/ +│ ├── cilium/ +│ ├── postgres/ +│ └── (50+ more taskservs...) +├── clusters/ +│ ├── buildkit/ +│ │ └── (same structure) +│ ├── web/ +│ └── (other clusters...) +├── tools/ +│ ├── extension-builder.nu # Build individual extensions +│ ├── mass-publish.nu # Publish all extensions +│ └── validate-extensions.nu # Validate all extensions +└── docs/ + ├── extension-guide.md # Extension development guide + └── publishing.md # Publishing guide + +``` + +**Distribution**: +Each extension published separately as OCI artifact: + +- `oci://registry/provisioning-extensions/kubernetes:1.28.0` +- `oci://registry/provisioning-extensions/aws:2.0.0` +- `oci://registry/provisioning-extensions/buildkit:0.12.0` + +**Extension Manifest** (`manifest.yaml`): + +```text +name: kubernetes +type: taskserv +version: 1.28.0 +description: Kubernetes container orchestration platform +author: Provisioning Team +license: MIT +homepage: https://kubernetes.io +repository: https://gitea.example.com/provisioning-extensions/kubernetes + +dependencies: + containerd: ">=1.7.0" + etcd: ">=3.5.0" + +tags: + - kubernetes + - container-orchestration + - cncf + +platforms: + - linux/amd64 + - linux/arm64 + +min_provisioning_version: "3.0.0" +``` + +**CI/CD**: + +- Build and publish each extension independently +- Git tag format: `{extension-type}/{extension-name}/v{version}` + - Example: `taskservs/kubernetes/v1.28.0` +- Automated publishing to OCI registry on tag +- Run extension-specific tests before publishing + +--- + +### Repository 3: `provisioning-platform` + +**Purpose**: Platform services (orchestrator, control-center, MCP server, API gateway) + +```text +provisioning-platform/ +├── orchestrator/ # Rust orchestrator service +│ ├── src/ +│ ├── Cargo.toml +│ ├── Dockerfile +│ └── README.md +├── control-center/ # Web control center +│ ├── src/ +│ ├── package.json +│ ├── Dockerfile +│ └── README.md +├── mcp-server/ # Model Context Protocol server +│ ├── src/ +│ ├── Cargo.toml +│ ├── Dockerfile +│ └── README.md +├── api-gateway/ # REST API gateway +│ ├── src/ +│ ├── Cargo.toml +│ ├── Dockerfile +│ └── README.md +├── docker-compose.yml # Local development stack +├── kubernetes/ # K8s deployment manifests +│ ├── orchestrator.yaml +│ ├── control-center.yaml +│ ├── mcp-server.yaml +│ └── api-gateway.yaml +└── docs/ + ├── deployment.md + └── api-reference.md + +``` + +**Distribution**: +Standard Docker images in OCI registry: + +- `oci://registry/provisioning-platform/orchestrator:v1.2.0` +- `oci://registry/provisioning-platform/control-center:v1.2.0` +- `oci://registry/provisioning-platform/mcp-server:v1.0.0` +- `oci://registry/provisioning-platform/api-gateway:v1.0.0` + +**CI/CD**: + +- Build Docker images on commit to main +- Publish images on git tag (v*) +- Multi-architecture builds (amd64, arm64) +- Security scanning before publishing + +--- + +## OCI Registry Integration + +### Registry Structure + +```text +OCI Registry (localhost:5000 or harbor.company.com) +├── provisioning-core/ +│ ├── v3.5.0 # Core system artifact +│ ├── v3.4.0 +│ └── latest -> v3.5.0 +├── provisioning-extensions/ +│ ├── kubernetes:1.28.0 # Individual extension artifacts +│ ├── kubernetes:1.27.0 +│ ├── containerd:1.7.0 +│ ├── aws:2.0.0 +│ ├── upcloud:1.5.0 +│ └── (100+ more extensions) +└── provisioning-platform/ + ├── orchestrator:v1.2.0 # Platform service images + ├── control-center:v1.2.0 + ├── mcp-server:v1.0.0 + └── api-gateway:v1.0.0 + +``` + +### OCI Artifact Structure + +Each extension packaged as OCI artifact: + +```text +kubernetes-1.28.0.tar.gz +├── schemas/ # Nickel schemas +│ ├── kubernetes.ncl +│ ├── version.ncl +│ └── dependencies.ncl +├── scripts/ # Nushell scripts +│ ├── install.nu +│ ├── check.nu +│ └── uninstall.nu +├── templates/ # Template files +│ ├── kubeconfig.j2 +│ └── kubelet-config.yaml.j2 +├── docs/ # Documentation +│ └── README.md +├── manifest.yaml # Extension manifest +└── oci-manifest.json # OCI manifest metadata + +``` + +--- + +## Dependency Management + +### Workspace Configuration + +**File**: `workspace/config/provisioning.yaml` + +```text +# Core system dependency +dependencies: + core: + source: "oci://harbor.company.com/provisioning-core:v3.5.0" + # Alternative: source: "gitea://provisioning-core" + + # Extensions repository configuration + extensions: + source_type: "oci" # oci, gitea, local + + # OCI registry configuration + oci: + registry: "localhost:5000" + namespace: "provisioning-extensions" + tls_enabled: false + auth_token_path: "~/.provisioning/tokens/oci" + + # Loaded extension modules + modules: + providers: + - "oci://localhost:5000/provisioning-extensions/aws:2.0.0" + - "oci://localhost:5000/provisioning-extensions/upcloud:1.5.0" + + taskservs: + - "oci://localhost:5000/provisioning-extensions/kubernetes:1.28.0" + - "oci://localhost:5000/provisioning-extensions/containerd:1.7.0" + - "oci://localhost:5000/provisioning-extensions/cilium:1.14.0" + + clusters: + - "oci://localhost:5000/provisioning-extensions/buildkit:0.12.0" + + # Platform services + platform: + source_type: "oci" + + oci: + registry: "harbor.company.com" + namespace: "provisioning-platform" + + images: + orchestrator: "harbor.company.com/provisioning-platform/orchestrator:v1.2.0" + control_center: "harbor.company.com/provisioning-platform/control-center:v1.2.0" + + # OCI registry configuration + registry: + type: "oci" # oci, gitea, http + + oci: + endpoint: "localhost:5000" + namespaces: + extensions: "provisioning-extensions" + nickel: "provisioning-nickel" + platform: "provisioning-platform" + test: "provisioning-test" +``` + +### Dependency Resolution + +The system resolves dependencies in this order: + +1. **Parse Configuration**: Read `provisioning.yaml` and extract dependencies +2. **Resolve Core**: Ensure core system version is compatible +3. **Resolve Extensions**: For each extension: + - Check if already installed and version matches + - Pull from OCI registry if needed + - Recursively resolve extension dependencies +4. **Validate Graph**: Check for dependency cycles and conflicts +5. **Install**: Install extensions in topological order + +### Dependency Resolution Commands + +```text +# Resolve and install all dependencies +provisioning dep resolve + +# Check for dependency updates +provisioning dep check-updates + +# Update specific extension +provisioning dep update kubernetes + +# Validate dependency graph +provisioning dep validate + +# Show dependency tree +provisioning dep tree kubernetes +``` + +--- + +## OCI Client Operations + +### CLI Commands + +```text +# Pull extension from OCI registry +provisioning oci pull kubernetes:1.28.0 + +# Push extension to OCI registry +provisioning oci push ./extensions/kubernetes kubernetes 1.28.0 + +# List available extensions +provisioning oci list --namespace provisioning-extensions + +# Search for extensions +provisioning oci search kubernetes + +# Show extension versions +provisioning oci tags kubernetes + +# Inspect extension manifest +provisioning oci inspect kubernetes:1.28.0 + +# Login to OCI registry +provisioning oci login localhost:5000 --username _token --password-stdin + +# Delete extension +provisioning oci delete kubernetes:1.28.0 + +# Copy extension between registries +provisioning oci copy + localhost:5000/provisioning-extensions/kubernetes:1.28.0 + harbor.company.com/provisioning-extensions/kubernetes:1.28.0 +``` + +### OCI Configuration + +```text +# Show OCI configuration +provisioning oci config + +# Output: +{ + tool: "oras" # or "crane" or "skopeo" + registry: "localhost:5000" + namespace: { + extensions: "provisioning-extensions" + platform: "provisioning-platform" + } + cache_dir: "~/.provisioning/oci-cache" + tls_enabled: false +} +``` + +--- + +## Extension Development Workflow + +### 1. Develop Extension + +```text +# Create new extension from template +provisioning generate extension taskserv redis + +# Directory structure created: +# extensions/taskservs/redis/ +# ├── schemas/ +# │ ├── manifest.toml +# │ ├── redis.ncl +# │ ├── version.ncl +# │ └── dependencies.ncl +# ├── scripts/ +# │ ├── install.nu +# │ ├── check.nu +# │ └── uninstall.nu +# ├── templates/ +# ├── docs/ +# │ └── README.md +# ├── tests/ +# └── manifest.yaml +``` + +### 2. Test Extension Locally + +```text +# Load extension from local path +provisioning module load taskserv workspace_dev redis --source local + +# Test installation +provisioning taskserv create redis --infra test-env --check + +# Run extension tests +provisioning test extension redis +``` + +### 3. Package Extension + +```text +# Validate extension structure +provisioning oci package validate ./extensions/taskservs/redis + +# Package as OCI artifact +provisioning oci package ./extensions/taskservs/redis + +# Output: redis-1.0.0.tar.gz +``` + +### 4. Publish Extension + +```text +# Login to registry (one-time) +provisioning oci login localhost:5000 + +# Publish extension +provisioning oci push ./extensions/taskservs/redis redis 1.0.0 + +# Verify publication +provisioning oci tags redis + +# Output: +# ┬───────────┬─────────┬───────────────────────────────────────────────────┐ +# │ artifact │ version │ reference │ +# ├───────────┼─────────┼───────────────────────────────────────────────────┤ +# │ redis │ 1.0.0 │ localhost:5000/provisioning-extensions/redis:1.0.0│ +# └───────────┴─────────┴───────────────────────────────────────────────────┘ +``` + +### 5. Use Published Extension + +```text +# Add to workspace configuration +# workspace/config/provisioning.yaml: +# dependencies: +# extensions: +# modules: +# taskservs: +# - "oci://localhost:5000/provisioning-extensions/redis:1.0.0" + +# Pull and install +provisioning dep resolve + +# Extension automatically downloaded and installed +``` + +--- + +## Registry Deployment Options + +### Local Registry (Solo Development) + +**Using Zot (lightweight OCI registry)**: + +```text +# Start local OCI registry +provisioning oci-registry start + +# Configuration: +# - Endpoint: localhost:5000 +# - Storage: ~/.provisioning/oci-registry/ +# - No authentication by default +# - TLS disabled (local only) + +# Stop registry +provisioning oci-registry stop + +# Check status +provisioning oci-registry status +``` + +### Remote Registry (Multi-User/Enterprise) + +**Using Harbor**: + +```text +# workspace/config/provisioning.yaml +dependencies: + registry: + type: "oci" + oci: + endpoint: "https://harbor.company.com" + namespaces: + extensions: "provisioning/extensions" + platform: "provisioning/platform" + tls_enabled: true + auth_token_path: "~/.provisioning/tokens/harbor" +``` + +**Features**: + +- Multi-user authentication +- Role-based access control (RBAC) +- Vulnerability scanning +- Replication across registries +- Webhook notifications +- Image signing (cosign/notation) + +--- + +## Migration from Monorepo + +### Phase 1: Parallel Structure (Current) + +- Monorepo still exists and works +- OCI distribution layer added on top +- Extensions can be loaded from local or OCI +- No breaking changes + +### Phase 2: Gradual Migration + +```text +# Migrate extensions one by one +for ext in (ls provisioning/extensions/taskservs); do + provisioning oci publish $ext.name +done + +# Update workspace configurations to use OCI +provisioning workspace migrate-to-oci workspace_prod +``` + +### Phase 3: Repository Split + +1. Create `provisioning-core` repository + - Extract core/ and schemas/ directories + - Set up CI/CD for core publishing + - Publish initial OCI artifact + +2. Create `provisioning-extensions` repository + - Extract extensions/ directory + - Set up CI/CD for extension publishing + - Publish all extensions to OCI registry + +3. Create `provisioning-platform` repository + - Extract platform/ directory + - Set up Docker image builds + - Publish platform services + +4. Update workspaces + - Reconfigure to use OCI dependencies + - Test multi-repo setup + - Verify all functionality works + +### Phase 4: Deprecate Monorepo + +- Archive monorepo +- Redirect to new repositories +- Update documentation +- Announce migration complete + +--- + +## Benefits Summary + +### Modularity + +✅ Independent repositories for core, extensions, and platform +✅ Extensions can be developed and versioned separately +✅ Clear ownership and responsibility boundaries + +### Distribution + +✅ OCI-native distribution (industry standard) +✅ Built-in versioning with OCI tags +✅ Efficient caching with OCI layers +✅ Works with standard tools (skopeo, crane, oras) + +### Security + +✅ TLS support for registries +✅ Authentication and authorization +✅ Vulnerability scanning (Harbor) +✅ Image signing (cosign, notation) +✅ RBAC for access control + +### Developer Experience + +✅ Simple CLI commands for extension management +✅ Automatic dependency resolution +✅ Local testing before publishing +✅ Easy extension discovery and installation + +### Operations + +✅ Air-gapped deployments (mirror OCI registry) +✅ Bandwidth efficient (only download what's needed) +✅ Version pinning for reproducibility +✅ Rollback support (use previous versions) + +### Ecosystem + +✅ Compatible with existing OCI tooling +✅ Can use public registries (DockerHub, GitHub, etc.) +✅ Mirror to multiple registries +✅ Replication for high availability + +--- + +## Implementation Status + +| Component | Status | Notes | +| ----------- | -------- | ------- | +| **Nickel Schemas** | ✅ Complete | OCI schemas in `dependencies.ncl` | +| **OCI Client** | ✅ Complete | `oci/client.nu` with skopeo/crane/oras | +| **OCI Commands** | ✅ Complete | `oci/commands.nu` CLI interface | +| **Dependency Resolver** | ✅ Complete | `dependencies/resolver.nu` | +| **OCI Packaging** | ✅ Complete | `tools/oci-package.nu` | +| **Repository Design** | ✅ Complete | This document | +| **Migration Plan** | ✅ Complete | Phased approach defined | +| **Documentation** | ✅ Complete | User guides and API docs | +| **CI/CD Setup** | ⏳ Pending | Automated publishing pipelines | +| **Registry Deployment** | ⏳ Pending | Zot/Harbor setup | + +--- + +## Related Documentation + +- OCI Packaging Tool - Extension packaging +- OCI Client Library - OCI operations +- Dependency Resolver - Dependency management +- Nickel Schemas - Type definitions +- [Extension Development Guide](../user/extension-development.md) - How to create extensions + +--- + +**Maintained By**: Architecture Team +**Review Cycle**: Quarterly +**Next Review**: 2026-01-06 \ No newline at end of file diff --git a/docs/src/architecture/multi-repo-strategy.md b/docs/src/architecture/multi-repo-strategy.md index ea07325..afd2363 100644 --- a/docs/src/architecture/multi-repo-strategy.md +++ b/docs/src/architecture/multi-repo-strategy.md @@ -1 +1,1025 @@ -# Multi-Repository Strategy Analysis\n\n**Date:** 2025-10-01\n**Status:** Strategic Analysis\n**Related:** [Repository Distribution Analysis](repo-dist-analysis.md)\n\n## Executive Summary\n\nThis document analyzes a **multi-repository strategy** as an alternative to the monorepo approach. After careful consideration of the provisioning\nsystem's architecture, a **hybrid approach with 4 core repositories** is recommended, avoiding submodules in favor of a cleaner package-based\ndependency model.\n\n---\n\n## Repository Architecture Options\n\n### Option A: Pure Monorepo (Original Recommendation)\n\n**Single repository:** `provisioning`\n\n**Pros:**\n\n- Simplest development workflow\n- Atomic cross-component changes\n- Single version number\n- One CI/CD pipeline\n\n**Cons:**\n\n- Large repository size\n- Mixed language tooling (Rust + Nushell)\n- All-or-nothing updates\n- Unclear ownership boundaries\n\n### Option B: Multi-Repo with Submodules (❌ Not Recommended)\n\n**Repositories:**\n\n- `provisioning-core` (main, contains submodules)\n- `provisioning-platform` (submodule)\n- `provisioning-extensions` (submodule)\n- `provisioning-workspace` (submodule)\n\n**Why Not Recommended:**\n\n- Submodule hell: complex, error-prone workflows\n- Detached HEAD issues\n- Update synchronization nightmares\n- Clone complexity for users\n- Difficult to maintain version compatibility\n- Poor developer experience\n\n### Option C: Multi-Repo with Package Dependencies (✅ RECOMMENDED)\n\n**Independent repositories with package-based integration:**\n\n- `provisioning-core` - Nushell libraries and Nickel schemas\n- `provisioning-platform` - Rust services (orchestrator, control-center, MCP)\n- `provisioning-extensions` - Extension marketplace/catalog\n- `provisioning-workspace` - Project templates and examples\n- `provisioning-distribution` - Release automation and packaging\n\n**Why Recommended:**\n\n- Clean separation of concerns\n- Independent versioning and release cycles\n- Language-specific tooling and workflows\n- Clear ownership boundaries\n- Package-based dependencies (no submodules)\n- Easier community contributions\n\n---\n\n## Recommended Multi-Repo Architecture\n\n### Repository 1: `provisioning-core`\n\n**Purpose:** Core Nushell infrastructure automation engine\n\n**Contents:**\n\n```\nprovisioning-core/\n├── nulib/ # Nushell libraries\n│ ├── lib_provisioning/ # Core library functions\n│ ├── servers/ # Server management\n│ ├── taskservs/ # Task service management\n│ ├── clusters/ # Cluster management\n│ └── workflows/ # Workflow orchestration\n├── cli/ # CLI entry point\n│ └── provisioning # Pure Nushell CLI\n├── schemas/ # Nickel schemas\n│ ├── main.ncl\n│ ├── settings.ncl\n│ ├── server.ncl\n│ ├── cluster.ncl\n│ └── workflows.ncl\n├── config/ # Default configurations\n│ └── config.defaults.toml\n├── templates/ # Core templates\n├── tools/ # Build and packaging tools\n├── tests/ # Core tests\n├── docs/ # Core documentation\n├── LICENSE\n├── README.md\n├── CHANGELOG.md\n└── version.toml # Core version file\n```\n\n**Technology:** Nushell, Nickel\n**Primary Language:** Nushell\n**Release Frequency:** Monthly (stable)\n**Ownership:** Core team\n**Dependencies:** None (foundation)\n\n**Package Output:**\n\n- `provisioning-core-{version}.tar.gz` - Installable package\n- Published to package registry\n\n**Installation Path:**\n\n```\n/usr/local/\n├── bin/provisioning\n├── lib/provisioning/\n└── share/provisioning/\n```\n\n---\n\n### Repository 2: `provisioning-platform`\n\n**Purpose:** High-performance Rust platform services\n\n**Contents:**\n\n```\nprovisioning-platform/\n├── orchestrator/ # Rust orchestrator\n│ ├── src/\n│ ├── tests/\n│ ├── benches/\n│ └── Cargo.toml\n├── control-center/ # Web control center (Leptos)\n│ ├── src/\n│ ├── tests/\n│ └── Cargo.toml\n├── mcp-server/ # Model Context Protocol server\n│ ├── src/\n│ ├── tests/\n│ └── Cargo.toml\n├── api-gateway/ # REST API gateway\n│ ├── src/\n│ ├── tests/\n│ └── Cargo.toml\n├── shared/ # Shared Rust libraries\n│ ├── types/\n│ └── utils/\n├── docs/ # Platform documentation\n├── Cargo.toml # Workspace root\n├── Cargo.lock\n├── LICENSE\n├── README.md\n└── CHANGELOG.md\n```\n\n**Technology:** Rust, WebAssembly\n**Primary Language:** Rust\n**Release Frequency:** Bi-weekly (fast iteration)\n**Ownership:** Platform team\n**Dependencies:**\n\n- `provisioning-core` (runtime integration, loose coupling)\n\n**Package Output:**\n\n- `provisioning-platform-{version}.tar.gz` - Binaries\n- Binaries for: Linux (x86_64, arm64), macOS (x86_64, arm64)\n\n**Installation Path:**\n\n```\n/usr/local/\n├── bin/\n│ ├── provisioning-orchestrator\n│ └── provisioning-control-center\n└── share/provisioning/platform/\n```\n\n**Integration with Core:**\n\n- Platform services call `provisioning` CLI via subprocess\n- No direct code dependencies\n- Communication via REST API and file-based queues\n- Core and Platform can be deployed independently\n\n---\n\n### Repository 3: `provisioning-extensions`\n\n**Purpose:** Extension marketplace and community modules\n\n**Contents:**\n\n```\nprovisioning-extensions/\n├── registry/ # Extension registry\n│ ├── index.json # Searchable index\n│ └── catalog/ # Extension metadata\n├── providers/ # Additional cloud providers\n│ ├── azure/\n│ ├── gcp/\n│ ├── digitalocean/\n│ └── hetzner/\n├── taskservs/ # Community task services\n│ ├── databases/\n│ │ ├── mongodb/\n│ │ ├── redis/\n│ │ └── cassandra/\n│ ├── development/\n│ │ ├── gitlab/\n│ │ ├── jenkins/\n│ │ └── sonarqube/\n│ └── observability/\n│ ├── prometheus/\n│ ├── grafana/\n│ └── loki/\n├── clusters/ # Cluster templates\n│ ├── ml-platform/\n│ ├── data-pipeline/\n│ └── gaming-backend/\n├── workflows/ # Workflow templates\n├── tools/ # Extension development tools\n├── docs/ # Extension development guide\n├── LICENSE\n└── README.md\n```\n\n**Technology:** Nushell, Nickel\n**Primary Language:** Nushell\n**Release Frequency:** Continuous (per-extension)\n**Ownership:** Community + Core team\n**Dependencies:**\n\n- `provisioning-core` (extends core functionality)\n\n**Package Output:**\n\n- Individual extension packages: `provisioning-ext-{name}-{version}.tar.gz`\n- Registry index for discovery\n\n**Installation:**\n\n```\n# Install extension via core CLI\nprovisioning extension install mongodb\nprovisioning extension install azure-provider\n```\n\n**Extension Structure:**\nEach extension is self-contained:\n\n```\nmongodb/\n├── manifest.toml # Extension metadata\n├── taskserv.nu # Implementation\n├── templates/ # Templates\n├── schemas/ # Nickel schemas\n├── tests/ # Tests\n└── README.md\n```\n\n---\n\n### Repository 4: `provisioning-workspace`\n\n**Purpose:** Project templates and starter kits\n\n**Contents:**\n\n```\nprovisioning-workspace/\n├── templates/ # Workspace templates\n│ ├── minimal/ # Minimal starter\n│ ├── kubernetes/ # Full K8s cluster\n│ ├── multi-cloud/ # Multi-cloud setup\n│ ├── microservices/ # Microservices platform\n│ ├── data-platform/ # Data engineering\n│ └── ml-ops/ # MLOps platform\n├── examples/ # Complete examples\n│ ├── blog-deployment/\n│ ├── e-commerce/\n│ └── saas-platform/\n├── blueprints/ # Architecture blueprints\n├── docs/ # Template documentation\n├── tools/ # Template scaffolding\n│ └── create-workspace.nu\n├── LICENSE\n└── README.md\n```\n\n**Technology:** Configuration files, Nickel\n**Primary Language:** TOML, Nickel, YAML\n**Release Frequency:** Quarterly (stable templates)\n**Ownership:** Community + Documentation team\n**Dependencies:**\n\n- `provisioning-core` (templates use core)\n- `provisioning-extensions` (may reference extensions)\n\n**Package Output:**\n\n- `provisioning-templates-{version}.tar.gz`\n\n**Usage:**\n\n```\n# Create workspace from template\nprovisioning workspace init my-project --template kubernetes\n\n# Or use separate tool\ngh repo create my-project --template provisioning-workspace\ncd my-project\nprovisioning workspace init\n```\n\n---\n\n### Repository 5: `provisioning-distribution`\n\n**Purpose:** Release automation, packaging, and distribution infrastructure\n\n**Contents:**\n\n```\nprovisioning-distribution/\n├── release-automation/ # Automated release workflows\n│ ├── build-all.nu # Build all packages\n│ ├── publish.nu # Publish to registries\n│ └── validate.nu # Validation suite\n├── installers/ # Installation scripts\n│ ├── install.nu # Nushell installer\n│ ├── install.sh # Bash installer\n│ └── install.ps1 # PowerShell installer\n├── packaging/ # Package builders\n│ ├── core/\n│ ├── platform/\n│ └── extensions/\n├── registry/ # Package registry backend\n│ ├── api/ # Registry REST API\n│ └── storage/ # Package storage\n├── ci-cd/ # CI/CD configurations\n│ ├── github/ # GitHub Actions\n│ ├── gitlab/ # GitLab CI\n│ └── jenkins/ # Jenkins pipelines\n├── version-management/ # Cross-repo version coordination\n│ ├── versions.toml # Version matrix\n│ └── compatibility.toml # Compatibility matrix\n├── docs/ # Distribution documentation\n│ ├── release-process.md\n│ └── packaging-guide.md\n├── LICENSE\n└── README.md\n```\n\n**Technology:** Nushell, Bash, CI/CD\n**Primary Language:** Nushell, YAML\n**Release Frequency:** As needed\n**Ownership:** Release engineering team\n**Dependencies:** All repositories (orchestrates releases)\n\n**Responsibilities:**\n\n- Build packages from all repositories\n- Coordinate multi-repo releases\n- Publish to package registries\n- Manage version compatibility\n- Generate release notes\n- Host package registry\n\n---\n\n## Dependency and Integration Model\n\n### Package-Based Dependencies (Not Submodules)\n\n```\n┌─────────────────────────────────────────────────────────────┐\n│ provisioning-distribution │\n│ (Release orchestration & registry) │\n└──────────────────────────┬──────────────────────────────────┘\n │ publishes packages\n ↓\n ┌──────────────┐\n │ Registry │\n └──────┬───────┘\n │\n ┌──────────────────┼──────────────────┐\n ↓ ↓ ↓\n┌───────────────┐ ┌──────────────┐ ┌──────────────┐\n│ provisioning │ │ provisioning │ │ provisioning │\n│ -core │ │ -platform │ │ -extensions │\n└───────┬───────┘ └──────┬───────┘ └──────┬───────┘\n │ │ │\n │ │ depends on │ extends\n │ └─────────┐ │\n │ ↓ │\n └───────────────────────────────────→┘\n runtime integration\n```\n\n### Integration Mechanisms\n\n#### 1. **Core ↔ Platform Integration**\n\n**Method:** Loose coupling via CLI + REST API\n\n```\n# Platform calls Core CLI (subprocess)\ndef create-server [name: string] {\n # Orchestrator executes Core CLI\n ^provisioning server create $name --infra production\n}\n\n# Core calls Platform API (HTTP)\ndef submit-workflow [workflow: record] {\n http post http://localhost:9090/workflows/submit $workflow\n}\n```\n\n**Version Compatibility:**\n\n```\n# platform/Cargo.toml\n[package.metadata.provisioning]\ncore-version = "^3.0" # Compatible with core 3.x\n```\n\n#### 2. **Core ↔ Extensions Integration**\n\n**Method:** Plugin/module system\n\n```\n# Extension manifest\n# extensions/mongodb/manifest.toml\n[extension]\nname = "mongodb"\nversion = "1.0.0"\ntype = "taskserv"\ncore-version = "^3.0"\n\n[dependencies]\nprovisioning-core = "^3.0"\n\n# Extension installation\n# Core downloads and validates extension\nprovisioning extension install mongodb\n# → Downloads from registry\n# → Validates compatibility\n# → Installs to ~/.provisioning/extensions/mongodb\n```\n\n#### 3. **Workspace Templates**\n\n**Method:** Git templates or package templates\n\n```\n# Option 1: GitHub template repository\ngh repo create my-infra --template provisioning-workspace\ncd my-infra\nprovisioning workspace init\n\n# Option 2: Template package\nprovisioning workspace create my-infra --template kubernetes\n# → Downloads template package\n# → Scaffolds workspace\n# → Initializes configuration\n```\n\n---\n\n## Version Management Strategy\n\n### Semantic Versioning Per Repository\n\nEach repository maintains independent semantic versioning:\n\n```\nprovisioning-core: 3.2.1\nprovisioning-platform: 2.5.3\nprovisioning-extensions: (per-extension versioning)\nprovisioning-workspace: 1.4.0\n```\n\n### Compatibility Matrix\n\n**`provisioning-distribution/version-management/versions.toml`:**\n\n```\n# Version compatibility matrix\n[compatibility]\n\n# Core versions and compatible platform versions\n[compatibility.core]\n"3.2.1" = { platform = "^2.5", extensions = "^1.0", workspace = "^1.0" }\n"3.2.0" = { platform = "^2.4", extensions = "^1.0", workspace = "^1.0" }\n"3.1.0" = { platform = "^2.3", extensions = "^0.9", workspace = "^1.0" }\n\n# Platform versions and compatible core versions\n[compatibility.platform]\n"2.5.3" = { core = "^3.2", min-core = "3.2.0" }\n"2.5.0" = { core = "^3.1", min-core = "3.1.0" }\n\n# Release bundles (tested combinations)\n[bundles]\n\n[bundles.stable-3.2]\nname = "Stable 3.2 Bundle"\nrelease-date = "2025-10-15"\ncore = "3.2.1"\nplatform = "2.5.3"\nextensions = ["mongodb@1.2.0", "redis@1.1.0", "azure@2.0.0"]\nworkspace = "1.4.0"\n\n[bundles.lts-3.1]\nname = "LTS 3.1 Bundle"\nrelease-date = "2025-09-01"\nlts-until = "2026-09-01"\ncore = "3.1.5"\nplatform = "2.4.8"\nworkspace = "1.3.0"\n```\n\n### Release Coordination\n\n**Coordinated releases** for major versions:\n\n```\n# Major release: All repos release together\nprovisioning-core: 3.0.0\nprovisioning-platform: 2.0.0\nprovisioning-workspace: 1.0.0\n\n# Minor/patch releases: Independent\nprovisioning-core: 3.1.0 (adds features, platform stays 2.0.x)\nprovisioning-platform: 2.1.0 (improves orchestrator, core stays 3.1.x)\n```\n\n---\n\n## Development Workflow\n\n### Working on Single Repository\n\n```\n# Developer working on core only\ngit clone https://github.com/yourorg/provisioning-core\ncd provisioning-core\n\n# Install dependencies\njust install-deps\n\n# Development\njust dev-check\njust test\n\n# Build package\njust build\n\n# Test installation locally\njust install-dev\n```\n\n### Working Across Repositories\n\n```\n# Scenario: Adding new feature requiring core + platform changes\n\n# 1. Clone both repositories\ngit clone https://github.com/yourorg/provisioning-core\ngit clone https://github.com/yourorg/provisioning-platform\n\n# 2. Create feature branches\ncd provisioning-core\ngit checkout -b feat/batch-workflow-v2\n\ncd ../provisioning-platform\ngit checkout -b feat/batch-workflow-v2\n\n# 3. Develop with local linking\ncd provisioning-core\njust install-dev # Installs to /usr/local/bin/provisioning\n\ncd ../provisioning-platform\n# Platform uses system provisioning CLI (local dev version)\ncargo run\n\n# 4. Test integration\ncd ../provisioning-core\njust test-integration\n\ncd ../provisioning-platform\ncargo test\n\n# 5. Create PRs in both repositories\n# PR #123 in provisioning-core\n# PR #456 in provisioning-platform (references core PR)\n\n# 6. Coordinate merge\n# Merge core PR first, cut release 3.3.0\n# Update platform dependency to core 3.3.0\n# Merge platform PR, cut release 2.6.0\n```\n\n### Testing Cross-Repo Integration\n\n```\n# Integration tests in provisioning-distribution\ncd provisioning-distribution\n\n# Test specific version combination\njust test-integration \\n --core 3.3.0 \\n --platform 2.6.0\n\n# Test bundle\njust test-bundle stable-3.3\n```\n\n---\n\n## Distribution Strategy\n\n### Individual Repository Releases\n\nEach repository releases independently:\n\n```\n# Core release\ncd provisioning-core\ngit tag v3.2.1\ngit push --tags\n# → GitHub Actions builds package\n# → Publishes to package registry\n\n# Platform release\ncd provisioning-platform\ngit tag v2.5.3\ngit push --tags\n# → GitHub Actions builds binaries\n# → Publishes to package registry\n```\n\n### Bundle Releases (Coordinated)\n\nDistribution repository creates tested bundles:\n\n```\ncd provisioning-distribution\n\n# Create bundle\njust create-bundle stable-3.2 \\n --core 3.2.1 \\n --platform 2.5.3 \\n --workspace 1.4.0\n\n# Test bundle\njust test-bundle stable-3.2\n\n# Publish bundle\njust publish-bundle stable-3.2\n# → Creates meta-package with all components\n# → Publishes bundle to registry\n# → Updates documentation\n```\n\n### User Installation Options\n\n#### Option 1: Bundle Installation (Recommended for Users)\n\n```\n# Install stable bundle (easiest)\ncurl -fsSL https://get.provisioning.io | sh\n\n# Installs:\n# - provisioning-core 3.2.1\n# - provisioning-platform 2.5.3\n# - provisioning-workspace 1.4.0\n```\n\n#### Option 2: Individual Component Installation\n\n```\n# Install only core (minimal)\ncurl -fsSL https://get.provisioning.io/core | sh\n\n# Add platform later\nprovisioning install platform\n\n# Add extensions\nprovisioning extension install mongodb\n```\n\n#### Option 3: Custom Combination\n\n```\n# Install specific versions\nprovisioning install core@3.1.0\nprovisioning install platform@2.4.0\n```\n\n---\n\n## Repository Ownership and Contribution Model\n\n### Core Team Ownership\n\n| Repository | Primary Owner | Contribution Model |\n| ------------ | --------------- | ------------------- |\n| `provisioning-core` | Core Team | Strict review, stable API |\n| `provisioning-platform` | Platform Team | Fast iteration, performance focus |\n| `provisioning-extensions` | Community + Core | Open contributions, moderated |\n| `provisioning-workspace` | Docs Team | Template contributions welcome |\n| `provisioning-distribution` | Release Engineering | Core team only |\n\n### Contribution Workflow\n\n**For Core:**\n\n1. Create issue in `provisioning-core`\n2. Discuss design\n3. Submit PR with tests\n4. Strict code review\n5. Merge to `main`\n6. Release when ready\n\n**For Extensions:**\n\n1. Create extension in `provisioning-extensions`\n2. Follow extension guidelines\n3. Submit PR\n4. Community review\n5. Merge and publish to registry\n6. Independent versioning\n\n**For Platform:**\n\n1. Create issue in `provisioning-platform`\n2. Implement with benchmarks\n3. Submit PR\n4. Performance review\n5. Merge and release\n\n---\n\n## CI/CD Strategy\n\n### Per-Repository CI/CD\n\n**Core CI (`provisioning-core/.github/workflows/ci.yml`):**\n\n```\nname: Core CI\n\non: [push, pull_request]\n\njobs:\n test:\n runs-on: ubuntu-latest\n steps:\n - uses: actions/checkout@v3\n - name: Install Nushell\n run: cargo install nu\n - name: Run tests\n run: just test\n - name: Validate Nickel schemas\n run: just validate-nickel\n\n package:\n runs-on: ubuntu-latest\n if: startsWith(github.ref, 'refs/tags/v')\n steps:\n - uses: actions/checkout@v3\n - name: Build package\n run: just build\n - name: Publish to registry\n run: just publish\n env:\n REGISTRY_TOKEN: ${{ secrets.REGISTRY_TOKEN }}\n```\n\n**Platform CI (`provisioning-platform/.github/workflows/ci.yml`):**\n\n```\nname: Platform CI\n\non: [push, pull_request]\n\njobs:\n test:\n strategy:\n matrix:\n os: [ubuntu-latest, macos-latest]\n runs-on: ${{ matrix.os }}\n steps:\n - uses: actions/checkout@v3\n - name: Build\n run: cargo build --release\n - name: Test\n run: cargo test --workspace\n - name: Benchmark\n run: cargo bench\n\n cross-compile:\n runs-on: ubuntu-latest\n if: startsWith(github.ref, 'refs/tags/v')\n steps:\n - uses: actions/checkout@v3\n - name: Build for Linux x86_64\n run: cargo build --release --target x86_64-unknown-linux-gnu\n - name: Build for Linux arm64\n run: cargo build --release --target aarch64-unknown-linux-gnu\n - name: Publish binaries\n run: just publish-binaries\n```\n\n### Integration Testing (Distribution Repo)\n\n**Distribution CI (`provisioning-distribution/.github/workflows/integration.yml`):**\n\n```\nname: Integration Tests\n\non:\n schedule:\n - cron: '0 0 * * *' # Daily\n workflow_dispatch:\n\njobs:\n test-bundle:\n runs-on: ubuntu-latest\n steps:\n - uses: actions/checkout@v3\n\n - name: Install bundle\n run: |\n nu release-automation/install-bundle.nu stable-3.2\n\n - name: Run integration tests\n run: |\n nu tests/integration/test-all.nu\n\n - name: Test upgrade path\n run: |\n nu tests/integration/test-upgrade.nu 3.1.0 3.2.1\n```\n\n---\n\n## File and Directory Structure Comparison\n\n### Monorepo Structure\n\n```\nprovisioning/ (One repo, ~500 MB)\n├── core/ (Nushell)\n├── platform/ (Rust)\n├── extensions/ (Community)\n├── workspace/ (Templates)\n└── distribution/ (Build)\n```\n\n### Multi-Repo Structure\n\n```\nprovisioning-core/ (Repo 1, ~50 MB)\n├── nulib/\n├── cli/\n├── schemas/\n└── tools/\n\nprovisioning-platform/ (Repo 2, ~150 MB with target/)\n├── orchestrator/\n├── control-center/\n├── mcp-server/\n└── Cargo.toml\n\nprovisioning-extensions/ (Repo 3, ~100 MB)\n├── registry/\n├── providers/\n├── taskservs/\n└── clusters/\n\nprovisioning-workspace/ (Repo 4, ~20 MB)\n├── templates/\n├── examples/\n└── blueprints/\n\nprovisioning-distribution/ (Repo 5, ~30 MB)\n├── release-automation/\n├── installers/\n├── packaging/\n└── registry/\n```\n\n---\n\n## Decision Matrix\n\n| Criterion | Monorepo | Multi-Repo |\n| ----------- | ---------- | ------------ |\n| **Development Complexity** | Simple | Moderate |\n| **Clone Size** | Large (~500 MB) | Small (50-150 MB each) |\n| **Cross-Component Changes** | Easy (atomic) | Moderate (coordinated) |\n| **Independent Releases** | Difficult | Easy |\n| **Language-Specific Tooling** | Mixed | Clean |\n| **Community Contributions** | Harder (big repo) | Easier (focused repos) |\n| **Version Management** | Simple (one version) | Complex (matrix) |\n| **CI/CD Complexity** | Simple (one pipeline) | Moderate (multiple) |\n| **Ownership Clarity** | Unclear | Clear |\n| **Extension Ecosystem** | Monolithic | Modular |\n| **Build Time** | Long (build all) | Short (build one) |\n| **Testing Isolation** | Difficult | Easy |\n\n---\n\n## Recommended Approach: Multi-Repo\n\n### Why Multi-Repo Wins for This Project\n\n1. **Clear Separation of Concerns**\n - Nushell core vs Rust platform are different domains\n - Different teams can own different repos\n - Different release cadences make sense\n\n2. **Language-Specific Tooling**\n - `provisioning-core`: Nushell-focused, simple testing\n - `provisioning-platform`: Rust workspace, Cargo tooling\n - No mixed tooling confusion\n\n3. **Community Contributions**\n - Extensions repo is easier to contribute to\n - Don't need to clone entire monorepo\n - Clearer contribution guidelines per repo\n\n4. **Independent Versioning**\n - Core can stay stable (3.x for months)\n - Platform can iterate fast (2.x weekly)\n - Extensions have own lifecycles\n\n5. **Build Performance**\n - Only build what changed\n - Faster CI/CD per repo\n - Parallel builds across repos\n\n6. **Extension Ecosystem**\n - Extensions repo becomes marketplace\n - Third-party extensions can live separately\n - Registry becomes discovery mechanism\n\n### Implementation Strategy\n\n**Phase 1: Split Repositories (Week 1-2)**\n\n1. Create 5 new repositories\n2. Extract code from monorepo\n3. Set up CI/CD for each\n4. Create initial packages\n\n**Phase 2: Package Integration (Week 3)**\n\n1. Implement package registry\n2. Create installers\n3. Set up version compatibility matrix\n4. Test cross-repo integration\n\n**Phase 3: Distribution System (Week 4)**\n\n1. Implement bundle system\n2. Create release automation\n3. Set up package hosting\n4. Document release process\n\n**Phase 4: Migration (Week 5)**\n\n1. Migrate existing users\n2. Update documentation\n3. Archive monorepo\n4. Announce new structure\n\n---\n\n## Conclusion\n\n**Recommendation: Multi-Repository Architecture with Package-Based Integration**\n\nThe multi-repo approach provides:\n\n- ✅ Clear separation between Nushell core and Rust platform\n- ✅ Independent release cycles for different components\n- ✅ Better community contribution experience\n- ✅ Language-specific tooling and workflows\n- ✅ Modular extension ecosystem\n- ✅ Faster builds and CI/CD\n- ✅ Clear ownership boundaries\n\n**Avoid:** Submodules (complexity nightmare)\n\n**Use:** Package-based dependencies with version compatibility matrix\n\nThis architecture scales better for your project's growth, supports a community extension ecosystem, and provides professional-grade separation of\nconcerns while maintaining integration through a well-designed package system.\n\n---\n\n## Next Steps\n\n1. **Approve multi-repo strategy**\n2. **Create repository split plan**\n3. **Set up GitHub organizations/teams**\n4. **Implement package registry**\n5. **Begin repository extraction**\n\nWould you like me to create a detailed **repository split implementation plan** next? +# Multi-Repository Strategy Analysis + +**Date:** 2025-10-01 +**Status:** Strategic Analysis +**Related:** [Repository Distribution Analysis](repo-dist-analysis.md) + +## Executive Summary + +This document analyzes a **multi-repository strategy** as an alternative to the monorepo approach. After careful consideration of the provisioning +system's architecture, a **hybrid approach with 4 core repositories** is recommended, avoiding submodules in favor of a cleaner package-based +dependency model. + +--- + +## Repository Architecture Options + +### Option A: Pure Monorepo (Original Recommendation) + +**Single repository:** `provisioning` + +**Pros:** + +- Simplest development workflow +- Atomic cross-component changes +- Single version number +- One CI/CD pipeline + +**Cons:** + +- Large repository size +- Mixed language tooling (Rust + Nushell) +- All-or-nothing updates +- Unclear ownership boundaries + +### Option B: Multi-Repo with Submodules (❌ Not Recommended) + +**Repositories:** + +- `provisioning-core` (main, contains submodules) +- `provisioning-platform` (submodule) +- `provisioning-extensions` (submodule) +- `provisioning-workspace` (submodule) + +**Why Not Recommended:** + +- Submodule hell: complex, error-prone workflows +- Detached HEAD issues +- Update synchronization nightmares +- Clone complexity for users +- Difficult to maintain version compatibility +- Poor developer experience + +### Option C: Multi-Repo with Package Dependencies (✅ RECOMMENDED) + +**Independent repositories with package-based integration:** + +- `provisioning-core` - Nushell libraries and Nickel schemas +- `provisioning-platform` - Rust services (orchestrator, control-center, MCP) +- `provisioning-extensions` - Extension marketplace/catalog +- `provisioning-workspace` - Project templates and examples +- `provisioning-distribution` - Release automation and packaging + +**Why Recommended:** + +- Clean separation of concerns +- Independent versioning and release cycles +- Language-specific tooling and workflows +- Clear ownership boundaries +- Package-based dependencies (no submodules) +- Easier community contributions + +--- + +## Recommended Multi-Repo Architecture + +### Repository 1: `provisioning-core` + +**Purpose:** Core Nushell infrastructure automation engine + +**Contents:** + +```text +provisioning-core/ +├── nulib/ # Nushell libraries +│ ├── lib_provisioning/ # Core library functions +│ ├── servers/ # Server management +│ ├── taskservs/ # Task service management +│ ├── clusters/ # Cluster management +│ └── workflows/ # Workflow orchestration +├── cli/ # CLI entry point +│ └── provisioning # Pure Nushell CLI +├── schemas/ # Nickel schemas +│ ├── main.ncl +│ ├── settings.ncl +│ ├── server.ncl +│ ├── cluster.ncl +│ └── workflows.ncl +├── config/ # Default configurations +│ └── config.defaults.toml +├── templates/ # Core templates +├── tools/ # Build and packaging tools +├── tests/ # Core tests +├── docs/ # Core documentation +├── LICENSE +├── README.md +├── CHANGELOG.md +└── version.toml # Core version file +``` + +**Technology:** Nushell, Nickel +**Primary Language:** Nushell +**Release Frequency:** Monthly (stable) +**Ownership:** Core team +**Dependencies:** None (foundation) + +**Package Output:** + +- `provisioning-core-{version}.tar.gz` - Installable package +- Published to package registry + +**Installation Path:** + +```text +/usr/local/ +├── bin/provisioning +├── lib/provisioning/ +└── share/provisioning/ +``` + +--- + +### Repository 2: `provisioning-platform` + +**Purpose:** High-performance Rust platform services + +**Contents:** + +```text +provisioning-platform/ +├── orchestrator/ # Rust orchestrator +│ ├── src/ +│ ├── tests/ +│ ├── benches/ +│ └── Cargo.toml +├── control-center/ # Web control center (Leptos) +│ ├── src/ +│ ├── tests/ +│ └── Cargo.toml +├── mcp-server/ # Model Context Protocol server +│ ├── src/ +│ ├── tests/ +│ └── Cargo.toml +├── api-gateway/ # REST API gateway +│ ├── src/ +│ ├── tests/ +│ └── Cargo.toml +├── shared/ # Shared Rust libraries +│ ├── types/ +│ └── utils/ +├── docs/ # Platform documentation +├── Cargo.toml # Workspace root +├── Cargo.lock +├── LICENSE +├── README.md +└── CHANGELOG.md +``` + +**Technology:** Rust, WebAssembly +**Primary Language:** Rust +**Release Frequency:** Bi-weekly (fast iteration) +**Ownership:** Platform team +**Dependencies:** + +- `provisioning-core` (runtime integration, loose coupling) + +**Package Output:** + +- `provisioning-platform-{version}.tar.gz` - Binaries +- Binaries for: Linux (x86_64, arm64), macOS (x86_64, arm64) + +**Installation Path:** + +```text +/usr/local/ +├── bin/ +│ ├── provisioning-orchestrator +│ └── provisioning-control-center +└── share/provisioning/platform/ +``` + +**Integration with Core:** + +- Platform services call `provisioning` CLI via subprocess +- No direct code dependencies +- Communication via REST API and file-based queues +- Core and Platform can be deployed independently + +--- + +### Repository 3: `provisioning-extensions` + +**Purpose:** Extension marketplace and community modules + +**Contents:** + +```text +provisioning-extensions/ +├── registry/ # Extension registry +│ ├── index.json # Searchable index +│ └── catalog/ # Extension metadata +├── providers/ # Additional cloud providers +│ ├── azure/ +│ ├── gcp/ +│ ├── digitalocean/ +│ └── hetzner/ +├── taskservs/ # Community task services +│ ├── databases/ +│ │ ├── mongodb/ +│ │ ├── redis/ +│ │ └── cassandra/ +│ ├── development/ +│ │ ├── gitlab/ +│ │ ├── jenkins/ +│ │ └── sonarqube/ +│ └── observability/ +│ ├── prometheus/ +│ ├── grafana/ +│ └── loki/ +├── clusters/ # Cluster templates +│ ├── ml-platform/ +│ ├── data-pipeline/ +│ └── gaming-backend/ +├── workflows/ # Workflow templates +├── tools/ # Extension development tools +├── docs/ # Extension development guide +├── LICENSE +└── README.md +``` + +**Technology:** Nushell, Nickel +**Primary Language:** Nushell +**Release Frequency:** Continuous (per-extension) +**Ownership:** Community + Core team +**Dependencies:** + +- `provisioning-core` (extends core functionality) + +**Package Output:** + +- Individual extension packages: `provisioning-ext-{name}-{version}.tar.gz` +- Registry index for discovery + +**Installation:** + +```text +# Install extension via core CLI +provisioning extension install mongodb +provisioning extension install azure-provider +``` + +**Extension Structure:** +Each extension is self-contained: + +```text +mongodb/ +├── manifest.toml # Extension metadata +├── taskserv.nu # Implementation +├── templates/ # Templates +├── schemas/ # Nickel schemas +├── tests/ # Tests +└── README.md +``` + +--- + +### Repository 4: `provisioning-workspace` + +**Purpose:** Project templates and starter kits + +**Contents:** + +```text +provisioning-workspace/ +├── templates/ # Workspace templates +│ ├── minimal/ # Minimal starter +│ ├── kubernetes/ # Full K8s cluster +│ ├── multi-cloud/ # Multi-cloud setup +│ ├── microservices/ # Microservices platform +│ ├── data-platform/ # Data engineering +│ └── ml-ops/ # MLOps platform +├── examples/ # Complete examples +│ ├── blog-deployment/ +│ ├── e-commerce/ +│ └── saas-platform/ +├── blueprints/ # Architecture blueprints +├── docs/ # Template documentation +├── tools/ # Template scaffolding +│ └── create-workspace.nu +├── LICENSE +└── README.md +``` + +**Technology:** Configuration files, Nickel +**Primary Language:** TOML, Nickel, YAML +**Release Frequency:** Quarterly (stable templates) +**Ownership:** Community + Documentation team +**Dependencies:** + +- `provisioning-core` (templates use core) +- `provisioning-extensions` (may reference extensions) + +**Package Output:** + +- `provisioning-templates-{version}.tar.gz` + +**Usage:** + +```text +# Create workspace from template +provisioning workspace init my-project --template kubernetes + +# Or use separate tool +gh repo create my-project --template provisioning-workspace +cd my-project +provisioning workspace init +``` + +--- + +### Repository 5: `provisioning-distribution` + +**Purpose:** Release automation, packaging, and distribution infrastructure + +**Contents:** + +```text +provisioning-distribution/ +├── release-automation/ # Automated release workflows +│ ├── build-all.nu # Build all packages +│ ├── publish.nu # Publish to registries +│ └── validate.nu # Validation suite +├── installers/ # Installation scripts +│ ├── install.nu # Nushell installer +│ ├── install.sh # Bash installer +│ └── install.ps1 # PowerShell installer +├── packaging/ # Package builders +│ ├── core/ +│ ├── platform/ +│ └── extensions/ +├── registry/ # Package registry backend +│ ├── api/ # Registry REST API +│ └── storage/ # Package storage +├── ci-cd/ # CI/CD configurations +│ ├── github/ # GitHub Actions +│ ├── gitlab/ # GitLab CI +│ └── jenkins/ # Jenkins pipelines +├── version-management/ # Cross-repo version coordination +│ ├── versions.toml # Version matrix +│ └── compatibility.toml # Compatibility matrix +├── docs/ # Distribution documentation +│ ├── release-process.md +│ └── packaging-guide.md +├── LICENSE +└── README.md +``` + +**Technology:** Nushell, Bash, CI/CD +**Primary Language:** Nushell, YAML +**Release Frequency:** As needed +**Ownership:** Release engineering team +**Dependencies:** All repositories (orchestrates releases) + +**Responsibilities:** + +- Build packages from all repositories +- Coordinate multi-repo releases +- Publish to package registries +- Manage version compatibility +- Generate release notes +- Host package registry + +--- + +## Dependency and Integration Model + +### Package-Based Dependencies (Not Submodules) + +```text +┌─────────────────────────────────────────────────────────────┐ +│ provisioning-distribution │ +│ (Release orchestration & registry) │ +└──────────────────────────┬──────────────────────────────────┘ + │ publishes packages + ↓ + ┌──────────────┐ + │ Registry │ + └──────┬───────┘ + │ + ┌──────────────────┼──────────────────┐ + ↓ ↓ ↓ +┌───────────────┐ ┌──────────────┐ ┌──────────────┐ +│ provisioning │ │ provisioning │ │ provisioning │ +│ -core │ │ -platform │ │ -extensions │ +└───────┬───────┘ └──────┬───────┘ └──────┬───────┘ + │ │ │ + │ │ depends on │ extends + │ └─────────┐ │ + │ ↓ │ + └───────────────────────────────────→┘ + runtime integration +``` + +### Integration Mechanisms + +#### 1. **Core ↔ Platform Integration** + +**Method:** Loose coupling via CLI + REST API + +```text +# Platform calls Core CLI (subprocess) +def create-server [name: string] { + # Orchestrator executes Core CLI + ^provisioning server create $name --infra production +} + +# Core calls Platform API (HTTP) +def submit-workflow [workflow: record] { + http post http://localhost:9090/workflows/submit $workflow +} +``` + +**Version Compatibility:** + +```text +# platform/Cargo.toml +[package.metadata.provisioning] +core-version = "^3.0" # Compatible with core 3.x +``` + +#### 2. **Core ↔ Extensions Integration** + +**Method:** Plugin/module system + +```text +# Extension manifest +# extensions/mongodb/manifest.toml +[extension] +name = "mongodb" +version = "1.0.0" +type = "taskserv" +core-version = "^3.0" + +[dependencies] +provisioning-core = "^3.0" + +# Extension installation +# Core downloads and validates extension +provisioning extension install mongodb +# → Downloads from registry +# → Validates compatibility +# → Installs to ~/.provisioning/extensions/mongodb +``` + +#### 3. **Workspace Templates** + +**Method:** Git templates or package templates + +```text +# Option 1: GitHub template repository +gh repo create my-infra --template provisioning-workspace +cd my-infra +provisioning workspace init + +# Option 2: Template package +provisioning workspace create my-infra --template kubernetes +# → Downloads template package +# → Scaffolds workspace +# → Initializes configuration +``` + +--- + +## Version Management Strategy + +### Semantic Versioning Per Repository + +Each repository maintains independent semantic versioning: + +```text +provisioning-core: 3.2.1 +provisioning-platform: 2.5.3 +provisioning-extensions: (per-extension versioning) +provisioning-workspace: 1.4.0 +``` + +### Compatibility Matrix + +**`provisioning-distribution/version-management/versions.toml`:** + +```text +# Version compatibility matrix +[compatibility] + +# Core versions and compatible platform versions +[compatibility.core] +"3.2.1" = { platform = "^2.5", extensions = "^1.0", workspace = "^1.0" } +"3.2.0" = { platform = "^2.4", extensions = "^1.0", workspace = "^1.0" } +"3.1.0" = { platform = "^2.3", extensions = "^0.9", workspace = "^1.0" } + +# Platform versions and compatible core versions +[compatibility.platform] +"2.5.3" = { core = "^3.2", min-core = "3.2.0" } +"2.5.0" = { core = "^3.1", min-core = "3.1.0" } + +# Release bundles (tested combinations) +[bundles] + +[bundles.stable-3.2] +name = "Stable 3.2 Bundle" +release-date = "2025-10-15" +core = "3.2.1" +platform = "2.5.3" +extensions = ["mongodb@1.2.0", "redis@1.1.0", "azure@2.0.0"] +workspace = "1.4.0" + +[bundles.lts-3.1] +name = "LTS 3.1 Bundle" +release-date = "2025-09-01" +lts-until = "2026-09-01" +core = "3.1.5" +platform = "2.4.8" +workspace = "1.3.0" +``` + +### Release Coordination + +**Coordinated releases** for major versions: + +```text +# Major release: All repos release together +provisioning-core: 3.0.0 +provisioning-platform: 2.0.0 +provisioning-workspace: 1.0.0 + +# Minor/patch releases: Independent +provisioning-core: 3.1.0 (adds features, platform stays 2.0.x) +provisioning-platform: 2.1.0 (improves orchestrator, core stays 3.1.x) +``` + +--- + +## Development Workflow + +### Working on Single Repository + +```text +# Developer working on core only +git clone https://github.com/yourorg/provisioning-core +cd provisioning-core + +# Install dependencies +just install-deps + +# Development +just dev-check +just test + +# Build package +just build + +# Test installation locally +just install-dev +``` + +### Working Across Repositories + +```text +# Scenario: Adding new feature requiring core + platform changes + +# 1. Clone both repositories +git clone https://github.com/yourorg/provisioning-core +git clone https://github.com/yourorg/provisioning-platform + +# 2. Create feature branches +cd provisioning-core +git checkout -b feat/batch-workflow-v2 + +cd ../provisioning-platform +git checkout -b feat/batch-workflow-v2 + +# 3. Develop with local linking +cd provisioning-core +just install-dev # Installs to /usr/local/bin/provisioning + +cd ../provisioning-platform +# Platform uses system provisioning CLI (local dev version) +cargo run + +# 4. Test integration +cd ../provisioning-core +just test-integration + +cd ../provisioning-platform +cargo test + +# 5. Create PRs in both repositories +# PR #123 in provisioning-core +# PR #456 in provisioning-platform (references core PR) + +# 6. Coordinate merge +# Merge core PR first, cut release 3.3.0 +# Update platform dependency to core 3.3.0 +# Merge platform PR, cut release 2.6.0 +``` + +### Testing Cross-Repo Integration + +```text +# Integration tests in provisioning-distribution +cd provisioning-distribution + +# Test specific version combination +just test-integration + --core 3.3.0 + --platform 2.6.0 + +# Test bundle +just test-bundle stable-3.3 +``` + +--- + +## Distribution Strategy + +### Individual Repository Releases + +Each repository releases independently: + +```text +# Core release +cd provisioning-core +git tag v3.2.1 +git push --tags +# → GitHub Actions builds package +# → Publishes to package registry + +# Platform release +cd provisioning-platform +git tag v2.5.3 +git push --tags +# → GitHub Actions builds binaries +# → Publishes to package registry +``` + +### Bundle Releases (Coordinated) + +Distribution repository creates tested bundles: + +```text +cd provisioning-distribution + +# Create bundle +just create-bundle stable-3.2 + --core 3.2.1 + --platform 2.5.3 + --workspace 1.4.0 + +# Test bundle +just test-bundle stable-3.2 + +# Publish bundle +just publish-bundle stable-3.2 +# → Creates meta-package with all components +# → Publishes bundle to registry +# → Updates documentation +``` + +### User Installation Options + +#### Option 1: Bundle Installation (Recommended for Users) + +```text +# Install stable bundle (easiest) +curl -fsSL https://get.provisioning.io | sh + +# Installs: +# - provisioning-core 3.2.1 +# - provisioning-platform 2.5.3 +# - provisioning-workspace 1.4.0 +``` + +#### Option 2: Individual Component Installation + +```text +# Install only core (minimal) +curl -fsSL https://get.provisioning.io/core | sh + +# Add platform later +provisioning install platform + +# Add extensions +provisioning extension install mongodb +``` + +#### Option 3: Custom Combination + +```text +# Install specific versions +provisioning install core@3.1.0 +provisioning install platform@2.4.0 +``` + +--- + +## Repository Ownership and Contribution Model + +### Core Team Ownership + +| Repository | Primary Owner | Contribution Model | +| ------------ | --------------- | ------------------- | +| `provisioning-core` | Core Team | Strict review, stable API | +| `provisioning-platform` | Platform Team | Fast iteration, performance focus | +| `provisioning-extensions` | Community + Core | Open contributions, moderated | +| `provisioning-workspace` | Docs Team | Template contributions welcome | +| `provisioning-distribution` | Release Engineering | Core team only | + +### Contribution Workflow + +**For Core:** + +1. Create issue in `provisioning-core` +2. Discuss design +3. Submit PR with tests +4. Strict code review +5. Merge to `main` +6. Release when ready + +**For Extensions:** + +1. Create extension in `provisioning-extensions` +2. Follow extension guidelines +3. Submit PR +4. Community review +5. Merge and publish to registry +6. Independent versioning + +**For Platform:** + +1. Create issue in `provisioning-platform` +2. Implement with benchmarks +3. Submit PR +4. Performance review +5. Merge and release + +--- + +## CI/CD Strategy + +### Per-Repository CI/CD + +**Core CI (`provisioning-core/.github/workflows/ci.yml`):** + +```text +name: Core CI + +on: [push, pull_request] + +jobs: + test: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + - name: Install Nushell + run: cargo install nu + - name: Run tests + run: just test + - name: Validate Nickel schemas + run: just validate-nickel + + package: + runs-on: ubuntu-latest + if: startsWith(github.ref, 'refs/tags/v') + steps: + - uses: actions/checkout@v3 + - name: Build package + run: just build + - name: Publish to registry + run: just publish + env: + REGISTRY_TOKEN: ${{ secrets.REGISTRY_TOKEN }} +``` + +**Platform CI (`provisioning-platform/.github/workflows/ci.yml`):** + +```text +name: Platform CI + +on: [push, pull_request] + +jobs: + test: + strategy: + matrix: + os: [ubuntu-latest, macos-latest] + runs-on: ${{ matrix.os }} + steps: + - uses: actions/checkout@v3 + - name: Build + run: cargo build --release + - name: Test + run: cargo test --workspace + - name: Benchmark + run: cargo bench + + cross-compile: + runs-on: ubuntu-latest + if: startsWith(github.ref, 'refs/tags/v') + steps: + - uses: actions/checkout@v3 + - name: Build for Linux x86_64 + run: cargo build --release --target x86_64-unknown-linux-gnu + - name: Build for Linux arm64 + run: cargo build --release --target aarch64-unknown-linux-gnu + - name: Publish binaries + run: just publish-binaries +``` + +### Integration Testing (Distribution Repo) + +**Distribution CI (`provisioning-distribution/.github/workflows/integration.yml`):** + +```text +name: Integration Tests + +on: + schedule: + - cron: '0 0 * * *' # Daily + workflow_dispatch: + +jobs: + test-bundle: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + + - name: Install bundle + run: | + nu release-automation/install-bundle.nu stable-3.2 + + - name: Run integration tests + run: | + nu tests/integration/test-all.nu + + - name: Test upgrade path + run: | + nu tests/integration/test-upgrade.nu 3.1.0 3.2.1 +``` + +--- + +## File and Directory Structure Comparison + +### Monorepo Structure + +```text +provisioning/ (One repo, ~500 MB) +├── core/ (Nushell) +├── platform/ (Rust) +├── extensions/ (Community) +├── workspace/ (Templates) +└── distribution/ (Build) +``` + +### Multi-Repo Structure + +```text +provisioning-core/ (Repo 1, ~50 MB) +├── nulib/ +├── cli/ +├── schemas/ +└── tools/ + +provisioning-platform/ (Repo 2, ~150 MB with target/) +├── orchestrator/ +├── control-center/ +├── mcp-server/ +└── Cargo.toml + +provisioning-extensions/ (Repo 3, ~100 MB) +├── registry/ +├── providers/ +├── taskservs/ +└── clusters/ + +provisioning-workspace/ (Repo 4, ~20 MB) +├── templates/ +├── examples/ +└── blueprints/ + +provisioning-distribution/ (Repo 5, ~30 MB) +├── release-automation/ +├── installers/ +├── packaging/ +└── registry/ +``` + +--- + +## Decision Matrix + +| Criterion | Monorepo | Multi-Repo | +| ----------- | ---------- | ------------ | +| **Development Complexity** | Simple | Moderate | +| **Clone Size** | Large (~500 MB) | Small (50-150 MB each) | +| **Cross-Component Changes** | Easy (atomic) | Moderate (coordinated) | +| **Independent Releases** | Difficult | Easy | +| **Language-Specific Tooling** | Mixed | Clean | +| **Community Contributions** | Harder (big repo) | Easier (focused repos) | +| **Version Management** | Simple (one version) | Complex (matrix) | +| **CI/CD Complexity** | Simple (one pipeline) | Moderate (multiple) | +| **Ownership Clarity** | Unclear | Clear | +| **Extension Ecosystem** | Monolithic | Modular | +| **Build Time** | Long (build all) | Short (build one) | +| **Testing Isolation** | Difficult | Easy | + +--- + +## Recommended Approach: Multi-Repo + +### Why Multi-Repo Wins for This Project + +1. **Clear Separation of Concerns** + - Nushell core vs Rust platform are different domains + - Different teams can own different repos + - Different release cadences make sense + +2. **Language-Specific Tooling** + - `provisioning-core`: Nushell-focused, simple testing + - `provisioning-platform`: Rust workspace, Cargo tooling + - No mixed tooling confusion + +3. **Community Contributions** + - Extensions repo is easier to contribute to + - Don't need to clone entire monorepo + - Clearer contribution guidelines per repo + +4. **Independent Versioning** + - Core can stay stable (3.x for months) + - Platform can iterate fast (2.x weekly) + - Extensions have own lifecycles + +5. **Build Performance** + - Only build what changed + - Faster CI/CD per repo + - Parallel builds across repos + +6. **Extension Ecosystem** + - Extensions repo becomes marketplace + - Third-party extensions can live separately + - Registry becomes discovery mechanism + +### Implementation Strategy + +**Phase 1: Split Repositories (Week 1-2)** + +1. Create 5 new repositories +2. Extract code from monorepo +3. Set up CI/CD for each +4. Create initial packages + +**Phase 2: Package Integration (Week 3)** + +1. Implement package registry +2. Create installers +3. Set up version compatibility matrix +4. Test cross-repo integration + +**Phase 3: Distribution System (Week 4)** + +1. Implement bundle system +2. Create release automation +3. Set up package hosting +4. Document release process + +**Phase 4: Migration (Week 5)** + +1. Migrate existing users +2. Update documentation +3. Archive monorepo +4. Announce new structure + +--- + +## Conclusion + +**Recommendation: Multi-Repository Architecture with Package-Based Integration** + +The multi-repo approach provides: + +- ✅ Clear separation between Nushell core and Rust platform +- ✅ Independent release cycles for different components +- ✅ Better community contribution experience +- ✅ Language-specific tooling and workflows +- ✅ Modular extension ecosystem +- ✅ Faster builds and CI/CD +- ✅ Clear ownership boundaries + +**Avoid:** Submodules (complexity nightmare) + +**Use:** Package-based dependencies with version compatibility matrix + +This architecture scales better for your project's growth, supports a community extension ecosystem, and provides professional-grade separation of +concerns while maintaining integration through a well-designed package system. + +--- + +## Next Steps + +1. **Approve multi-repo strategy** +2. **Create repository split plan** +3. **Set up GitHub organizations/teams** +4. **Implement package registry** +5. **Begin repository extraction** + +Would you like me to create a detailed **repository split implementation plan** next? \ No newline at end of file diff --git a/docs/src/architecture/nickel-executable-examples.md b/docs/src/architecture/nickel-executable-examples.md index ee8a55e..c7601db 100644 --- a/docs/src/architecture/nickel-executable-examples.md +++ b/docs/src/architecture/nickel-executable-examples.md @@ -1 +1,773 @@ -# Nickel Executable Examples & Test Cases\n\n**Status**: Practical Developer Guide\n**Last Updated**: 2025-12-15\n**Purpose**: Copy-paste ready examples, validatable patterns, runnable test cases\n\n---\n\n## Setup: Run Examples Locally\n\n### Prerequisites\n\n```\n# Install Nickel\nbrew install nickel\n# or from source: https://nickel-lang.org/getting-started/\n\n# Verify installation\nnickel --version # Should be 1.0+\n```\n\n### Directory Structure for Examples\n\n```\nmkdir -p ~/nickel-examples/{simple,complex,production}\ncd ~/nickel-examples\n```\n\n---\n\n## Example 1: Simple Server Configuration (Executable)\n\n### Step 1: Create Contract File\n\n```\ncat > simple/server_contracts.ncl << 'EOF'\n{\n ServerConfig = {\n name | String,\n cpu_cores | Number,\n memory_gb | Number,\n zone | String,\n },\n}\nEOF\n```\n\n### Step 2: Create Defaults File\n\n```\ncat > simple/server_defaults.ncl << 'EOF'\n{\n web_server = {\n name = "web-01",\n cpu_cores = 4,\n memory_gb = 8,\n zone = "us-nyc1",\n },\n\n database_server = {\n name = "db-01",\n cpu_cores = 8,\n memory_gb = 16,\n zone = "us-nyc1",\n },\n\n cache_server = {\n name = "cache-01",\n cpu_cores = 2,\n memory_gb = 4,\n zone = "us-nyc1",\n },\n}\nEOF\n```\n\n### Step 3: Create Main Module with Hybrid Interface\n\n```\ncat > simple/server.ncl << 'EOF'\nlet contracts = import "./server_contracts.ncl" in\nlet defaults = import "./server_defaults.ncl" in\n\n{\n defaults = defaults,\n\n # Level 1: Maker functions (90% of use cases)\n make_server | not_exported = fun overrides =>\n let base = defaults.web_server in\n base & overrides,\n\n # Level 2: Pre-built instances (inspection/reference)\n DefaultWebServer = defaults.web_server,\n DefaultDatabaseServer = defaults.database_server,\n DefaultCacheServer = defaults.cache_server,\n\n # Level 3: Custom combinations\n production_web_server = defaults.web_server & {\n cpu_cores = 8,\n memory_gb = 16,\n },\n\n production_database_stack = [\n defaults.database_server & { name = "db-01", zone = "us-nyc1" },\n defaults.database_server & { name = "db-02", zone = "eu-fra1" },\n ],\n}\nEOF\n```\n\n### Test: Export and Validate JSON\n\n```\ncd simple/\n\n# Export to JSON\nnickel export server.ncl --format json | jq .\n\n# Expected output:\n# {\n# "defaults": { ... },\n# "DefaultWebServer": { "name": "web-01", "cpu_cores": 4, ... },\n# "DefaultDatabaseServer": { ... },\n# "DefaultCacheServer": { ... },\n# "production_web_server": { "name": "web-01", "cpu_cores": 8, ... },\n# "production_database_stack": [ ... ]\n# }\n\n# Verify specific fields\nnickel export server.ncl --format json | jq '.production_web_server.cpu_cores'\n# Output: 8\n```\n\n### Usage in Consumer Module\n\n```\ncat > simple/consumer.ncl << 'EOF'\nlet server = import "./server.ncl" in\n\n{\n # Use maker function\n staging_web = server.make_server {\n name = "staging-web",\n zone = "eu-fra1",\n },\n\n # Reference defaults\n default_db = server.DefaultDatabaseServer,\n\n # Use pre-built\n production_stack = server.production_database_stack,\n}\nEOF\n\n# Export and verify\nnickel export consumer.ncl --format json | jq '.staging_web'\n```\n\n---\n\n## Example 2: Complex Provider Extension (Production Pattern)\n\n### Create Provider Structure\n\n```\nmkdir -p complex/upcloud/{contracts,defaults,main}\ncd complex/upcloud\n```\n\n### Provider Contracts\n\n```\ncat > upcloud_contracts.ncl << 'EOF'\n{\n StorageBackup = {\n backup_id | String,\n frequency | String,\n retention_days | Number,\n },\n\n ServerConfig = {\n name | String,\n plan | String,\n zone | String,\n backups | Array,\n },\n\n ProviderConfig = {\n api_key | String,\n api_password | String,\n servers | Array,\n },\n}\nEOF\n```\n\n### Provider Defaults\n\n```\ncat > upcloud_defaults.ncl << 'EOF'\n{\n backup = {\n backup_id = "",\n frequency = "daily",\n retention_days = 7,\n },\n\n server = {\n name = "",\n plan = "1xCPU-1 GB",\n zone = "us-nyc1",\n backups = [],\n },\n\n provider = {\n api_key = "",\n api_password = "",\n servers = [],\n },\n}\nEOF\n```\n\n### Provider Main Module\n\n```\ncat > upcloud_main.ncl << 'EOF'\nlet contracts = import "./upcloud_contracts.ncl" in\nlet defaults = import "./upcloud_defaults.ncl" in\n\n{\n defaults = defaults,\n\n # Makers (90% use case)\n make_backup | not_exported = fun overrides =>\n defaults.backup & overrides,\n\n make_server | not_exported = fun overrides =>\n defaults.server & overrides,\n\n make_provider | not_exported = fun overrides =>\n defaults.provider & overrides,\n\n # Pre-built instances\n DefaultBackup = defaults.backup,\n DefaultServer = defaults.server,\n DefaultProvider = defaults.provider,\n\n # Production configs\n production_high_availability = defaults.provider & {\n servers = [\n defaults.server & {\n name = "web-01",\n plan = "2xCPU-4 GB",\n zone = "us-nyc1",\n backups = [\n defaults.backup & { frequency = "hourly" },\n ],\n },\n defaults.server & {\n name = "web-02",\n plan = "2xCPU-4 GB",\n zone = "eu-fra1",\n backups = [\n defaults.backup & { frequency = "hourly" },\n ],\n },\n defaults.server & {\n name = "db-01",\n plan = "4xCPU-16 GB",\n zone = "us-nyc1",\n backups = [\n defaults.backup & { frequency = "every-6h", retention_days = 30 },\n ],\n },\n ],\n },\n}\nEOF\n```\n\n### Test Provider Configuration\n\n```\n# Export provider config\nnickel export upcloud_main.ncl --format json | jq '.production_high_availability'\n\n# Export as TOML (for IaC config files)\nnickel export upcloud_main.ncl --format toml > upcloud.toml\ncat upcloud.toml\n\n# Count servers in production config\nnickel export upcloud_main.ncl --format json | jq '.production_high_availability.servers | length'\n# Output: 3\n```\n\n### Consumer Using Provider\n\n```\ncat > upcloud_consumer.ncl << 'EOF'\nlet upcloud = import "./upcloud_main.ncl" in\n\n{\n # Simple production setup\n simple_production = upcloud.make_provider {\n api_key = "prod-key",\n api_password = "prod-secret",\n servers = [\n upcloud.make_server { name = "web-01", plan = "2xCPU-4 GB" },\n upcloud.make_server { name = "web-02", plan = "2xCPU-4 GB" },\n ],\n },\n\n # Advanced HA setup with custom fields\n ha_stack = upcloud.production_high_availability & {\n api_key = "prod-key",\n api_password = "prod-secret",\n monitoring_enabled = true,\n alerting_email = "ops@company.com",\n custom_vpc_id = "vpc-prod-001",\n },\n}\nEOF\n\n# Validate structure\nnickel export upcloud_consumer.ncl --format json | jq '.ha_stack | keys'\n```\n\n---\n\n## Example 3: Real-World Pattern - Taskserv Configuration\n\n### Taskserv Contracts (from wuji)\n\n```\ncat > production/taskserv_contracts.ncl << 'EOF'\n{\n Dependency = {\n name | String,\n wait_for_health | Bool,\n },\n\n TaskServ = {\n name | String,\n version | String,\n dependencies | Array,\n enabled | Bool,\n },\n}\nEOF\n```\n\n### Taskserv Defaults\n\n```\ncat > production/taskserv_defaults.ncl << 'EOF'\n{\n kubernetes = {\n name = "kubernetes",\n version = "1.28.0",\n enabled = true,\n dependencies = [\n { name = "containerd", wait_for_health = true },\n { name = "etcd", wait_for_health = true },\n ],\n },\n\n cilium = {\n name = "cilium",\n version = "1.14.0",\n enabled = true,\n dependencies = [\n { name = "kubernetes", wait_for_health = true },\n ],\n },\n\n containerd = {\n name = "containerd",\n version = "1.7.0",\n enabled = true,\n dependencies = [],\n },\n\n etcd = {\n name = "etcd",\n version = "3.5.0",\n enabled = true,\n dependencies = [],\n },\n\n postgres = {\n name = "postgres",\n version = "15.0",\n enabled = true,\n dependencies = [],\n },\n\n redis = {\n name = "redis",\n version = "7.0.0",\n enabled = true,\n dependencies = [],\n },\n}\nEOF\n```\n\n### Taskserv Main\n\n```\ncat > production/taskserv.ncl << 'EOF'\nlet contracts = import "./taskserv_contracts.ncl" in\nlet defaults = import "./taskserv_defaults.ncl" in\n\n{\n defaults = defaults,\n\n make_taskserv | not_exported = fun overrides =>\n defaults.kubernetes & overrides,\n\n # Pre-built\n DefaultKubernetes = defaults.kubernetes,\n DefaultCilium = defaults.cilium,\n DefaultContainerd = defaults.containerd,\n DefaultEtcd = defaults.etcd,\n DefaultPostgres = defaults.postgres,\n DefaultRedis = defaults.redis,\n\n # Wuji infrastructure (20 taskservs similar to actual)\n wuji_k8s_stack = {\n kubernetes = defaults.kubernetes,\n cilium = defaults.cilium,\n containerd = defaults.containerd,\n etcd = defaults.etcd,\n },\n\n wuji_data_stack = {\n postgres = defaults.postgres & { version = "15.3" },\n redis = defaults.redis & { version = "7.2.0" },\n },\n\n # Staging with different versions\n staging_stack = {\n kubernetes = defaults.kubernetes & { version = "1.27.0" },\n cilium = defaults.cilium & { version = "1.13.0" },\n containerd = defaults.containerd & { version = "1.6.0" },\n etcd = defaults.etcd & { version = "3.4.0" },\n postgres = defaults.postgres & { version = "14.0" },\n },\n}\nEOF\n```\n\n### Test Taskserv Setup\n\n```\n# Export stack\nnickel export taskserv.ncl --format json | jq '.wuji_k8s_stack | keys'\n# Output: ["kubernetes", "cilium", "containerd", "etcd"]\n\n# Get specific version\nnickel export taskserv.ncl --format json | \\n jq '.staging_stack.kubernetes.version'\n# Output: "1.27.0"\n\n# Count taskservs in stacks\necho "Wuji K8S stack:"\nnickel export taskserv.ncl --format json | jq '.wuji_k8s_stack | length'\n\necho "Staging stack:"\nnickel export taskserv.ncl --format json | jq '.staging_stack | length'\n```\n\n---\n\n## Example 4: Composition & Extension Pattern\n\n### Base Infrastructure\n\n```\ncat > production/infrastructure.ncl << 'EOF'\nlet servers = import "./server.ncl" in\nlet taskservs = import "./taskserv.ncl" in\n\n{\n # Infrastructure with servers + taskservs\n development = {\n servers = {\n app = servers.make_server { name = "dev-app", cpu_cores = 2 },\n db = servers.make_server { name = "dev-db", cpu_cores = 4 },\n },\n taskservs = taskservs.staging_stack,\n },\n\n production = {\n servers = [\n servers.make_server { name = "prod-app-01", cpu_cores = 8 },\n servers.make_server { name = "prod-app-02", cpu_cores = 8 },\n servers.make_server { name = "prod-db-01", cpu_cores = 16 },\n ],\n taskservs = taskservs.wuji_k8s_stack & {\n prometheus = {\n name = "prometheus",\n version = "2.45.0",\n enabled = true,\n dependencies = [],\n },\n },\n },\n}\nEOF\n\n# Validate composition\nnickel export infrastructure.ncl --format json | jq '.production.servers | length'\n# Output: 3\n\nnickel export infrastructure.ncl --format json | jq '.production.taskservs | keys | length'\n# Output: 5\n```\n\n### Extending Infrastructure (Nickel Advantage!)\n\n```\ncat > production/infrastructure_extended.ncl << 'EOF'\nlet infra = import "./infrastructure.ncl" in\n\n# Add custom fields without modifying base!\n{\n development = infra.development & {\n monitoring_enabled = false,\n cost_optimization = true,\n auto_shutdown = true,\n },\n\n production = infra.production & {\n monitoring_enabled = true,\n alert_email = "ops@company.com",\n backup_enabled = true,\n backup_frequency = "6h",\n disaster_recovery_enabled = true,\n dr_region = "eu-fra1",\n compliance_level = "SOC2",\n security_scanning = true,\n },\n}\nEOF\n\n# Verify extension works (custom fields are preserved!)\nnickel export infrastructure_extended.ncl --format json | \\n jq '.production | keys'\n# Output includes: monitoring_enabled, alert_email, backup_enabled, etc\n```\n\n---\n\n## Example 5: Validation & Error Handling\n\n### Validation Functions\n\n```\ncat > production/validation.ncl << 'EOF'\nlet validate_server = fun server =>\n if server.cpu_cores <= 0 then\n std.record.fail "CPU cores must be positive"\n else if server.memory_gb <= 0 then\n std.record.fail "Memory must be positive"\n else\n server\nin\n\nlet validate_taskserv = fun ts =>\n if std.string.length ts.name == 0 then\n std.record.fail "TaskServ name required"\n else if std.string.length ts.version == 0 then\n std.record.fail "TaskServ version required"\n else\n ts\nin\n\n{\n validate_server = validate_server,\n validate_taskserv = validate_taskserv,\n}\nEOF\n```\n\n### Using Validations\n\n```\ncat > production/validated_config.ncl << 'EOF'\nlet server = import "./server.ncl" in\nlet taskserv = import "./taskserv.ncl" in\nlet validation = import "./validation.ncl" in\n\n{\n # Valid server (passes validation)\n valid_server = validation.validate_server {\n name = "web-01",\n cpu_cores = 4,\n memory_gb = 8,\n zone = "us-nyc1",\n },\n\n # Valid taskserv\n valid_taskserv = validation.validate_taskserv {\n name = "kubernetes",\n version = "1.28.0",\n dependencies = [],\n enabled = true,\n },\n}\nEOF\n\n# Test validation\nnickel export validated_config.ncl --format json\n# Should succeed without errors\n\n# Test invalid (uncomment to see error)\n# {\n# invalid_server = validation.validate_server {\n# name = "bad-server",\n# cpu_cores = -1, # Invalid!\n# memory_gb = 8,\n# zone = "us-nyc1",\n# },\n# }\n```\n\n---\n\n## Test Suite: Bash Script\n\n### Run All Examples\n\n```\n#!/bin/bash\n# test_all_examples.sh\n\nset -e\n\necho "=== Testing Nickel Examples ==="\n\ncd ~/nickel-examples\n\necho "1. Simple Server Configuration..."\ncd simple\nnickel export server.ncl --format json > /dev/null\necho " ✓ Simple server config valid"\n\necho "2. Complex Provider (UpCloud)..."\ncd ../complex/upcloud\nnickel export upcloud_main.ncl --format json > /dev/null\necho " ✓ UpCloud provider config valid"\n\necho "3. Production Taskserv..."\ncd ../../production\nnickel export taskserv.ncl --format json > /dev/null\necho " ✓ Taskserv config valid"\n\necho "4. Infrastructure Composition..."\nnickel export infrastructure.ncl --format json > /dev/null\necho " ✓ Infrastructure composition valid"\n\necho "5. Extended Infrastructure..."\nnickel export infrastructure_extended.ncl --format json > /dev/null\necho " ✓ Extended infrastructure valid"\n\necho "6. Validated Config..."\nnickel export validated_config.ncl --format json > /dev/null\necho " ✓ Validated config valid"\n\necho ""\necho "=== All Tests Passed ✓ ==="\n```\n\n---\n\n## Quick Commands Reference\n\n### Common Nickel Operations\n\n```\n# Validate Nickel syntax\nnickel export config.ncl\n\n# Export as JSON (for inspecting)\nnickel export config.ncl --format json\n\n# Export as TOML (for config files)\nnickel export config.ncl --format toml\n\n# Export as YAML\nnickel export config.ncl --format yaml\n\n# Pretty print JSON output\nnickel export config.ncl --format json | jq .\n\n# Extract specific field\nnickel export config.ncl --format json | jq '.production_server'\n\n# Count array elements\nnickel export config.ncl --format json | jq '.servers | length'\n\n# Check if file has valid syntax only\nnickel typecheck config.ncl\n```\n\n---\n\n## Troubleshooting Examples\n\n### Problem: "unexpected token" with multiple let\n\n```\n# ❌ WRONG\nlet A = {x = 1}\nlet B = {y = 2}\n{A = A, B = B}\n\n# ✅ CORRECT\nlet A = {x = 1} in\nlet B = {y = 2} in\n{A = A, B = B}\n```\n\n### Problem: Function serialization fails\n\n```\n# ❌ WRONG - function will fail to serialize\n{\n get_value = fun x => x + 1,\n result = get_value 5,\n}\n\n# ✅ CORRECT - mark function not_exported\n{\n get_value | not_exported = fun x => x + 1,\n result = get_value 5,\n}\n```\n\n### Problem: Null values cause export issues\n\n```\n# ❌ WRONG\n{ optional_field = null }\n\n# ✅ CORRECT - use empty string/array/object\n{ optional_field = "" } # for strings\n{ optional_field = [] } # for arrays\n{ optional_field = {} } # for objects\n```\n\n---\n\n## Summary\n\nThese examples are:\n\n- ✅ **Copy-paste ready** - Can run directly\n- ✅ **Executable** - Validated with `nickel export`\n- ✅ **Progressive** - Simple → Complex → Production\n- ✅ **Real patterns** - Based on actual codebase (wuji, upcloud)\n- ✅ **Self-contained** - Each example works independently\n- ✅ **Comparable** - Shows KCL vs Nickel equivalence\n\n**Next**: Use these as templates for your own Nickel configurations.\n\n---\n\n**Version**: 1.0.0\n**Status**: Tested & Verified\n**Last Updated**: 2025-12-15 +# Nickel Executable Examples & Test Cases + +**Status**: Practical Developer Guide +**Last Updated**: 2025-12-15 +**Purpose**: Copy-paste ready examples, validatable patterns, runnable test cases + +--- + +## Setup: Run Examples Locally + +### Prerequisites + +```text +# Install Nickel +brew install nickel +# or from source: https://nickel-lang.org/getting-started/ + +# Verify installation +nickel --version # Should be 1.0+ +``` + +### Directory Structure for Examples + +```text +mkdir -p ~/nickel-examples/{simple,complex,production} +cd ~/nickel-examples +``` + +--- + +## Example 1: Simple Server Configuration (Executable) + +### Step 1: Create Contract File + +```text +cat > simple/server_contracts.ncl << 'EOF' +{ + ServerConfig = { + name | String, + cpu_cores | Number, + memory_gb | Number, + zone | String, + }, +} +EOF +``` + +### Step 2: Create Defaults File + +```text +cat > simple/server_defaults.ncl << 'EOF' +{ + web_server = { + name = "web-01", + cpu_cores = 4, + memory_gb = 8, + zone = "us-nyc1", + }, + + database_server = { + name = "db-01", + cpu_cores = 8, + memory_gb = 16, + zone = "us-nyc1", + }, + + cache_server = { + name = "cache-01", + cpu_cores = 2, + memory_gb = 4, + zone = "us-nyc1", + }, +} +EOF +``` + +### Step 3: Create Main Module with Hybrid Interface + +```text +cat > simple/server.ncl << 'EOF' +let contracts = import "./server_contracts.ncl" in +let defaults = import "./server_defaults.ncl" in + +{ + defaults = defaults, + + # Level 1: Maker functions (90% of use cases) + make_server | not_exported = fun overrides => + let base = defaults.web_server in + base & overrides, + + # Level 2: Pre-built instances (inspection/reference) + DefaultWebServer = defaults.web_server, + DefaultDatabaseServer = defaults.database_server, + DefaultCacheServer = defaults.cache_server, + + # Level 3: Custom combinations + production_web_server = defaults.web_server & { + cpu_cores = 8, + memory_gb = 16, + }, + + production_database_stack = [ + defaults.database_server & { name = "db-01", zone = "us-nyc1" }, + defaults.database_server & { name = "db-02", zone = "eu-fra1" }, + ], +} +EOF +``` + +### Test: Export and Validate JSON + +```text +cd simple/ + +# Export to JSON +nickel export server.ncl --format json | jq . + +# Expected output: +# { +# "defaults": { ... }, +# "DefaultWebServer": { "name": "web-01", "cpu_cores": 4, ... }, +# "DefaultDatabaseServer": { ... }, +# "DefaultCacheServer": { ... }, +# "production_web_server": { "name": "web-01", "cpu_cores": 8, ... }, +# "production_database_stack": [ ... ] +# } + +# Verify specific fields +nickel export server.ncl --format json | jq '.production_web_server.cpu_cores' +# Output: 8 +``` + +### Usage in Consumer Module + +```text +cat > simple/consumer.ncl << 'EOF' +let server = import "./server.ncl" in + +{ + # Use maker function + staging_web = server.make_server { + name = "staging-web", + zone = "eu-fra1", + }, + + # Reference defaults + default_db = server.DefaultDatabaseServer, + + # Use pre-built + production_stack = server.production_database_stack, +} +EOF + +# Export and verify +nickel export consumer.ncl --format json | jq '.staging_web' +``` + +--- + +## Example 2: Complex Provider Extension (Production Pattern) + +### Create Provider Structure + +```text +mkdir -p complex/upcloud/{contracts,defaults,main} +cd complex/upcloud +``` + +### Provider Contracts + +```text +cat > upcloud_contracts.ncl << 'EOF' +{ + StorageBackup = { + backup_id | String, + frequency | String, + retention_days | Number, + }, + + ServerConfig = { + name | String, + plan | String, + zone | String, + backups | Array, + }, + + ProviderConfig = { + api_key | String, + api_password | String, + servers | Array, + }, +} +EOF +``` + +### Provider Defaults + +```text +cat > upcloud_defaults.ncl << 'EOF' +{ + backup = { + backup_id = "", + frequency = "daily", + retention_days = 7, + }, + + server = { + name = "", + plan = "1xCPU-1 GB", + zone = "us-nyc1", + backups = [], + }, + + provider = { + api_key = "", + api_password = "", + servers = [], + }, +} +EOF +``` + +### Provider Main Module + +```text +cat > upcloud_main.ncl << 'EOF' +let contracts = import "./upcloud_contracts.ncl" in +let defaults = import "./upcloud_defaults.ncl" in + +{ + defaults = defaults, + + # Makers (90% use case) + make_backup | not_exported = fun overrides => + defaults.backup & overrides, + + make_server | not_exported = fun overrides => + defaults.server & overrides, + + make_provider | not_exported = fun overrides => + defaults.provider & overrides, + + # Pre-built instances + DefaultBackup = defaults.backup, + DefaultServer = defaults.server, + DefaultProvider = defaults.provider, + + # Production configs + production_high_availability = defaults.provider & { + servers = [ + defaults.server & { + name = "web-01", + plan = "2xCPU-4 GB", + zone = "us-nyc1", + backups = [ + defaults.backup & { frequency = "hourly" }, + ], + }, + defaults.server & { + name = "web-02", + plan = "2xCPU-4 GB", + zone = "eu-fra1", + backups = [ + defaults.backup & { frequency = "hourly" }, + ], + }, + defaults.server & { + name = "db-01", + plan = "4xCPU-16 GB", + zone = "us-nyc1", + backups = [ + defaults.backup & { frequency = "every-6h", retention_days = 30 }, + ], + }, + ], + }, +} +EOF +``` + +### Test Provider Configuration + +```text +# Export provider config +nickel export upcloud_main.ncl --format json | jq '.production_high_availability' + +# Export as TOML (for IaC config files) +nickel export upcloud_main.ncl --format toml > upcloud.toml +cat upcloud.toml + +# Count servers in production config +nickel export upcloud_main.ncl --format json | jq '.production_high_availability.servers | length' +# Output: 3 +``` + +### Consumer Using Provider + +```text +cat > upcloud_consumer.ncl << 'EOF' +let upcloud = import "./upcloud_main.ncl" in + +{ + # Simple production setup + simple_production = upcloud.make_provider { + api_key = "prod-key", + api_password = "prod-secret", + servers = [ + upcloud.make_server { name = "web-01", plan = "2xCPU-4 GB" }, + upcloud.make_server { name = "web-02", plan = "2xCPU-4 GB" }, + ], + }, + + # Advanced HA setup with custom fields + ha_stack = upcloud.production_high_availability & { + api_key = "prod-key", + api_password = "prod-secret", + monitoring_enabled = true, + alerting_email = "ops@company.com", + custom_vpc_id = "vpc-prod-001", + }, +} +EOF + +# Validate structure +nickel export upcloud_consumer.ncl --format json | jq '.ha_stack | keys' +``` + +--- + +## Example 3: Real-World Pattern - Taskserv Configuration + +### Taskserv Contracts (from wuji) + +```text +cat > production/taskserv_contracts.ncl << 'EOF' +{ + Dependency = { + name | String, + wait_for_health | Bool, + }, + + TaskServ = { + name | String, + version | String, + dependencies | Array, + enabled | Bool, + }, +} +EOF +``` + +### Taskserv Defaults + +```text +cat > production/taskserv_defaults.ncl << 'EOF' +{ + kubernetes = { + name = "kubernetes", + version = "1.28.0", + enabled = true, + dependencies = [ + { name = "containerd", wait_for_health = true }, + { name = "etcd", wait_for_health = true }, + ], + }, + + cilium = { + name = "cilium", + version = "1.14.0", + enabled = true, + dependencies = [ + { name = "kubernetes", wait_for_health = true }, + ], + }, + + containerd = { + name = "containerd", + version = "1.7.0", + enabled = true, + dependencies = [], + }, + + etcd = { + name = "etcd", + version = "3.5.0", + enabled = true, + dependencies = [], + }, + + postgres = { + name = "postgres", + version = "15.0", + enabled = true, + dependencies = [], + }, + + redis = { + name = "redis", + version = "7.0.0", + enabled = true, + dependencies = [], + }, +} +EOF +``` + +### Taskserv Main + +```text +cat > production/taskserv.ncl << 'EOF' +let contracts = import "./taskserv_contracts.ncl" in +let defaults = import "./taskserv_defaults.ncl" in + +{ + defaults = defaults, + + make_taskserv | not_exported = fun overrides => + defaults.kubernetes & overrides, + + # Pre-built + DefaultKubernetes = defaults.kubernetes, + DefaultCilium = defaults.cilium, + DefaultContainerd = defaults.containerd, + DefaultEtcd = defaults.etcd, + DefaultPostgres = defaults.postgres, + DefaultRedis = defaults.redis, + + # Wuji infrastructure (20 taskservs similar to actual) + wuji_k8s_stack = { + kubernetes = defaults.kubernetes, + cilium = defaults.cilium, + containerd = defaults.containerd, + etcd = defaults.etcd, + }, + + wuji_data_stack = { + postgres = defaults.postgres & { version = "15.3" }, + redis = defaults.redis & { version = "7.2.0" }, + }, + + # Staging with different versions + staging_stack = { + kubernetes = defaults.kubernetes & { version = "1.27.0" }, + cilium = defaults.cilium & { version = "1.13.0" }, + containerd = defaults.containerd & { version = "1.6.0" }, + etcd = defaults.etcd & { version = "3.4.0" }, + postgres = defaults.postgres & { version = "14.0" }, + }, +} +EOF +``` + +### Test Taskserv Setup + +```text +# Export stack +nickel export taskserv.ncl --format json | jq '.wuji_k8s_stack | keys' +# Output: ["kubernetes", "cilium", "containerd", "etcd"] + +# Get specific version +nickel export taskserv.ncl --format json | + jq '.staging_stack.kubernetes.version' +# Output: "1.27.0" + +# Count taskservs in stacks +echo "Wuji K8S stack:" +nickel export taskserv.ncl --format json | jq '.wuji_k8s_stack | length' + +echo "Staging stack:" +nickel export taskserv.ncl --format json | jq '.staging_stack | length' +``` + +--- + +## Example 4: Composition & Extension Pattern + +### Base Infrastructure + +```text +cat > production/infrastructure.ncl << 'EOF' +let servers = import "./server.ncl" in +let taskservs = import "./taskserv.ncl" in + +{ + # Infrastructure with servers + taskservs + development = { + servers = { + app = servers.make_server { name = "dev-app", cpu_cores = 2 }, + db = servers.make_server { name = "dev-db", cpu_cores = 4 }, + }, + taskservs = taskservs.staging_stack, + }, + + production = { + servers = [ + servers.make_server { name = "prod-app-01", cpu_cores = 8 }, + servers.make_server { name = "prod-app-02", cpu_cores = 8 }, + servers.make_server { name = "prod-db-01", cpu_cores = 16 }, + ], + taskservs = taskservs.wuji_k8s_stack & { + prometheus = { + name = "prometheus", + version = "2.45.0", + enabled = true, + dependencies = [], + }, + }, + }, +} +EOF + +# Validate composition +nickel export infrastructure.ncl --format json | jq '.production.servers | length' +# Output: 3 + +nickel export infrastructure.ncl --format json | jq '.production.taskservs | keys | length' +# Output: 5 +``` + +### Extending Infrastructure (Nickel Advantage!) + +```text +cat > production/infrastructure_extended.ncl << 'EOF' +let infra = import "./infrastructure.ncl" in + +# Add custom fields without modifying base! +{ + development = infra.development & { + monitoring_enabled = false, + cost_optimization = true, + auto_shutdown = true, + }, + + production = infra.production & { + monitoring_enabled = true, + alert_email = "ops@company.com", + backup_enabled = true, + backup_frequency = "6h", + disaster_recovery_enabled = true, + dr_region = "eu-fra1", + compliance_level = "SOC2", + security_scanning = true, + }, +} +EOF + +# Verify extension works (custom fields are preserved!) +nickel export infrastructure_extended.ncl --format json | + jq '.production | keys' +# Output includes: monitoring_enabled, alert_email, backup_enabled, etc +``` + +--- + +## Example 5: Validation & Error Handling + +### Validation Functions + +```text +cat > production/validation.ncl << 'EOF' +let validate_server = fun server => + if server.cpu_cores <= 0 then + std.record.fail "CPU cores must be positive" + else if server.memory_gb <= 0 then + std.record.fail "Memory must be positive" + else + server +in + +let validate_taskserv = fun ts => + if std.string.length ts.name == 0 then + std.record.fail "TaskServ name required" + else if std.string.length ts.version == 0 then + std.record.fail "TaskServ version required" + else + ts +in + +{ + validate_server = validate_server, + validate_taskserv = validate_taskserv, +} +EOF +``` + +### Using Validations + +```text +cat > production/validated_config.ncl << 'EOF' +let server = import "./server.ncl" in +let taskserv = import "./taskserv.ncl" in +let validation = import "./validation.ncl" in + +{ + # Valid server (passes validation) + valid_server = validation.validate_server { + name = "web-01", + cpu_cores = 4, + memory_gb = 8, + zone = "us-nyc1", + }, + + # Valid taskserv + valid_taskserv = validation.validate_taskserv { + name = "kubernetes", + version = "1.28.0", + dependencies = [], + enabled = true, + }, +} +EOF + +# Test validation +nickel export validated_config.ncl --format json +# Should succeed without errors + +# Test invalid (uncomment to see error) +# { +# invalid_server = validation.validate_server { +# name = "bad-server", +# cpu_cores = -1, # Invalid! +# memory_gb = 8, +# zone = "us-nyc1", +# }, +# } +``` + +--- + +## Test Suite: Bash Script + +### Run All Examples + +```text +#!/bin/bash +# test_all_examples.sh + +set -e + +echo "=== Testing Nickel Examples ===" + +cd ~/nickel-examples + +echo "1. Simple Server Configuration..." +cd simple +nickel export server.ncl --format json > /dev/null +echo " ✓ Simple server config valid" + +echo "2. Complex Provider (UpCloud)..." +cd ../complex/upcloud +nickel export upcloud_main.ncl --format json > /dev/null +echo " ✓ UpCloud provider config valid" + +echo "3. Production Taskserv..." +cd ../../production +nickel export taskserv.ncl --format json > /dev/null +echo " ✓ Taskserv config valid" + +echo "4. Infrastructure Composition..." +nickel export infrastructure.ncl --format json > /dev/null +echo " ✓ Infrastructure composition valid" + +echo "5. Extended Infrastructure..." +nickel export infrastructure_extended.ncl --format json > /dev/null +echo " ✓ Extended infrastructure valid" + +echo "6. Validated Config..." +nickel export validated_config.ncl --format json > /dev/null +echo " ✓ Validated config valid" + +echo "" +echo "=== All Tests Passed ✓ ===" +``` + +--- + +## Quick Commands Reference + +### Common Nickel Operations + +```text +# Validate Nickel syntax +nickel export config.ncl + +# Export as JSON (for inspecting) +nickel export config.ncl --format json + +# Export as TOML (for config files) +nickel export config.ncl --format toml + +# Export as YAML +nickel export config.ncl --format yaml + +# Pretty print JSON output +nickel export config.ncl --format json | jq . + +# Extract specific field +nickel export config.ncl --format json | jq '.production_server' + +# Count array elements +nickel export config.ncl --format json | jq '.servers | length' + +# Check if file has valid syntax only +nickel typecheck config.ncl +``` + +--- + +## Troubleshooting Examples + +### Problem: "unexpected token" with multiple let + +```text +# ❌ WRONG +let A = {x = 1} +let B = {y = 2} +{A = A, B = B} + +# ✅ CORRECT +let A = {x = 1} in +let B = {y = 2} in +{A = A, B = B} +``` + +### Problem: Function serialization fails + +```text +# ❌ WRONG - function will fail to serialize +{ + get_value = fun x => x + 1, + result = get_value 5, +} + +# ✅ CORRECT - mark function not_exported +{ + get_value | not_exported = fun x => x + 1, + result = get_value 5, +} +``` + +### Problem: Null values cause export issues + +```text +# ❌ WRONG +{ optional_field = null } + +# ✅ CORRECT - use empty string/array/object +{ optional_field = "" } # for strings +{ optional_field = [] } # for arrays +{ optional_field = {} } # for objects +``` + +--- + +## Summary + +These examples are: + +- ✅ **Copy-paste ready** - Can run directly +- ✅ **Executable** - Validated with `nickel export` +- ✅ **Progressive** - Simple → Complex → Production +- ✅ **Real patterns** - Based on actual codebase (wuji, upcloud) +- ✅ **Self-contained** - Each example works independently +- ✅ **Comparable** - Shows KCL vs Nickel equivalence + +**Next**: Use these as templates for your own Nickel configurations. + +--- + +**Version**: 1.0.0 +**Status**: Tested & Verified +**Last Updated**: 2025-12-15 \ No newline at end of file diff --git a/docs/src/architecture/nickel-vs-kcl-comparison.md b/docs/src/architecture/nickel-vs-kcl-comparison.md index 6e7e933..f5e7d3c 100644 --- a/docs/src/architecture/nickel-vs-kcl-comparison.md +++ b/docs/src/architecture/nickel-vs-kcl-comparison.md @@ -1 +1,1207 @@ -# Nickel vs KCL: Comprehensive Comparison\n\n**Status**: Reference Guide\n**Last Updated**: 2025-12-15\n**Related**: ADR-011: Migration from KCL to Nickel\n\n---\n\n## Quick Decision Tree\n\n```\nNeed to define infrastructure/schemas?\n├─ New platform schemas → Use Nickel ✅\n├─ New provider extensions → Use Nickel ✅\n├─ Legacy workspace configs → Can use KCL (migrate gradually)\n├─ Need type-safe UIs? → Nickel + TypeDialog ✅\n├─ Application settings? → Use TOML (not KCL/Nickel)\n└─ K8s/CI-CD config? → Use YAML (not KCL/Nickel)\n```\n\n---\n\n## 1. Side-by-Side Code Examples\n\n### Simple Schema: Server Configuration\n\n#### KCL Approach\n\n```\nschema ServerDefaults:\n name: str\n cpu_cores: int = 2\n memory_gb: int = 4\n os: str = "ubuntu"\n\n check:\n cpu_cores > 0, "CPU cores must be positive"\n memory_gb > 0, "Memory must be positive"\n\nserver_defaults: ServerDefaults = {\n name = "web-server",\n cpu_cores = 4,\n memory_gb = 8,\n os = "ubuntu",\n}\n```\n\n**Note**: KCL is deprecated. Use Nickel for new projects.\n\n#### Nickel Approach (Three-File Pattern)\n\n**server_contracts.ncl**:\n\n```\n{\n ServerDefaults = {\n name | String,\n cpu_cores | Number,\n memory_gb | Number,\n os | String,\n },\n}\n```\n\n**server_defaults.ncl**:\n\n```\n{\n server = {\n name = "web-server",\n cpu_cores = 4,\n memory_gb = 8,\n os = "ubuntu",\n },\n}\n```\n\n**server.ncl**:\n\n```\nlet contracts = import "./server_contracts.ncl" in\nlet defaults = import "./server_defaults.ncl" in\n\n{\n defaults = defaults,\n\n make_server | not_exported = fun overrides =>\n defaults.server & overrides,\n\n DefaultServer = defaults.server,\n}\n```\n\n**Usage**:\n\n```\nlet server = import "./server.ncl" in\n\n# Simple override\nmy_server = server.make_server { cpu_cores = 8 }\n\n# With custom field (Nickel allows this!)\nmy_custom = server.defaults.server & {\n cpu_cores = 16,\n custom_monitoring_level = "verbose" # ✅ Works!\n}\n```\n\n**Key Differences**:\n\n- **KCL**: Validation inline, single file, rigid schema\n- **Nickel**: Separated concerns (contracts, defaults, instances), flexible composition\n\n---\n\n### Complex Schema: Provider with Multiple Types\n\n#### KCL (from `provisioning/extensions/providers/upcloud/nickel/` - legacy approach)\n\n```\nschema StorageBackup:\n backup_id: str\n frequency: str\n retention_days: int = 7\n\nschema ServerUpcloud:\n name: str\n plan: str\n zone: str\n storage_backups: [StorageBackup] = []\n\nschema ProvisionUpcloud:\n api_key: str\n api_password: str\n servers: [ServerUpcloud] = []\n\nprovision_upcloud: ProvisionUpcloud = {\n api_key = ""\n api_password = ""\n servers = []\n}\n```\n\n#### Nickel (from `provisioning/extensions/providers/upcloud/nickel/`)\n\n**upcloud_contracts.ncl**:\n\n```\n{\n StorageBackup = {\n backup_id | String,\n frequency | String,\n retention_days | Number,\n },\n\n ServerUpcloud = {\n name | String,\n plan | String,\n zone | String,\n storage_backups | Array,\n },\n\n ProvisionUpcloud = {\n api_key | String,\n api_password | String,\n servers | Array,\n },\n}\n```\n\n**upcloud_defaults.ncl**:\n\n```\n{\n storage_backup = {\n backup_id = "",\n frequency = "daily",\n retention_days = 7,\n },\n\n server_upcloud = {\n name = "",\n plan = "1xCPU-1 GB",\n zone = "us-nyc1",\n storage_backups = [],\n },\n\n provision_upcloud = {\n api_key = "",\n api_password = "",\n servers = [],\n },\n}\n```\n\n**upcloud_main.ncl** (from actual codebase):\n\n```\nlet contracts = import "./upcloud_contracts.ncl" in\nlet defaults = import "./upcloud_defaults.ncl" in\n\n{\n defaults = defaults,\n\n make_storage_backup | not_exported = fun overrides =>\n defaults.storage_backup & overrides,\n\n make_server_upcloud | not_exported = fun overrides =>\n defaults.server_upcloud & overrides,\n\n make_provision_upcloud | not_exported = fun overrides =>\n defaults.provision_upcloud & overrides,\n\n DefaultStorageBackup = defaults.storage_backup,\n DefaultServerUpcloud = defaults.server_upcloud,\n DefaultProvisionUpcloud = defaults.provision_upcloud,\n}\n```\n\n**Usage Comparison**:\n\n```\n# KCL way (KCL no lo permite bien)\n# Cannot easily extend without schema modification\n\n# Nickel way (flexible!)\nlet upcloud = import "./upcloud.ncl" in\n\n# Simple override\nstaging_server = upcloud.make_server_upcloud {\n name = "staging-01",\n zone = "eu-fra1",\n}\n\n# Complex config with custom fields\nproduction_stack = upcloud.make_provision_upcloud {\n api_key = "secret",\n api_password = "secret",\n servers = [\n upcloud.make_server_upcloud { name = "prod-web-01" },\n upcloud.make_server_upcloud { name = "prod-web-02" },\n ],\n custom_vpc_id = "vpc-prod", # ✅ Custom field allowed!\n monitoring_enabled = true, # ✅ Custom field allowed!\n backup_schedule = "24h", # ✅ Custom field allowed!\n}\n```\n\n---\n\n## 2. Performance Benchmarks\n\n### Evaluation Speed\n\n| File Type | KCL | Nickel | Improvement |\n| ----------- | ----- | -------- | ------------ |\n| Simple schema (100 lines) | 45 ms | 18 ms | 60% faster |\n| Complex config (500 lines) | 180 ms | 72 ms | 60% faster |\n| Large nested (2000 lines) | 420 ms | 160 ms | 62% faster |\n| Infrastructure full stack | 850 ms | 340 ms | 60% faster |\n\n**Test Conditions**:\n\n- MacOS 13.x, M1 Pro\n- Single evaluation run\n- JSON output export\n- Average of 5 runs\n\n### Memory Usage\n\n| Configuration | KCL | Nickel | Improvement |\n| --------------- | ----- | -------- | ------------ |\n| Platform schemas (422 files) | ~180 MB | ~85 MB | 53% less |\n| Full workspace (47 files) | ~45 MB | ~22 MB | 51% less |\n| Single provider ext | ~8 MB | ~4 MB | 50% less |\n\n**Lazy Evaluation Benefit**:\n\n- KCL: Evaluates all schemas upfront\n- Nickel: Only evaluates what's used (lazy)\n- Nickel advantage: 40-50% memory savings on large configs\n\n---\n\n## 3. Use Case Examples\n\n### Use Case 1: Simple Server Definition\n\n**KCL (Legacy)**:\n\n```\nschema ServerConfig:\n name: str\n zone: str = "us-nyc1"\n\nweb_server: ServerConfig = {\n name = "web-01",\n}\n```\n\n**Nickel (Recommended)**:\n\n```\nlet defaults = import "./server_defaults.ncl" in\nweb_server = defaults.make_server { name = "web-01" }\n```\n\n**Winner**: Nickel (simpler, cleaner)\n\n---\n\n### Use Case 2: Multiple Taskservs with Dependencies\n\n**KCL** (from wuji infrastructure):\n\n```\nschema TaskServDependency:\n name: str\n wait_for_health: bool = false\n\nschema TaskServ:\n name: str\n version: str\n dependencies: [TaskServDependency] = []\n\ntaskserv_kubernetes: TaskServ = {\n name = "kubernetes",\n version = "1.28.0",\n dependencies = [\n {name = "containerd"},\n {name = "etcd"},\n ]\n}\n\ntaskserv_cilium: TaskServ = {\n name = "cilium",\n version = "1.14.0",\n dependencies = [\n {name = "kubernetes", wait_for_health = true}\n ]\n}\n```\n\n**Nickel** (from wuji/main.ncl):\n\n```\nlet ts_kubernetes = import "./taskservs/kubernetes.ncl" in\nlet ts_cilium = import "./taskservs/cilium.ncl" in\nlet ts_containerd = import "./taskservs/containerd.ncl" in\n\n{\n taskservs = {\n kubernetes = ts_kubernetes.kubernetes,\n cilium = ts_cilium.cilium,\n containerd = ts_containerd.containerd,\n },\n}\n```\n\n**Winner**: Nickel (modular, scalable to 20 taskservs)\n\n---\n\n### Use Case 3: Configuration Extension with Custom Fields\n\n**Scenario**: Need to add monitoring configuration to server definition\n\n**KCL**:\n\n```\nschema ServerConfig:\n name: str\n # Would need to modify schema!\n monitoring_enabled: bool = false\n monitoring_level: str = "basic"\n\n# All existing configs need updating...\n```\n\n**Nickel**:\n\n```\nlet server = import "./server.ncl" in\n\n# Add custom fields without modifying schema!\nmy_server = server.defaults.server & {\n name = "web-01",\n monitoring_enabled = true,\n monitoring_level = "detailed",\n custom_tags = ["production", "critical"],\n grafana_dashboard = "web-servers",\n}\n```\n\n**Winner**: Nickel (no schema modifications needed)\n\n---\n\n## 4. Architecture Patterns Comparison\n\n### Schema Inheritance\n\n**KCL Approach (Legacy)**:\n\n```\nschema ServerDefaults:\n cpu: int = 2\n memory: int = 4\n\nschema Server(ServerDefaults):\n name: str\n\nserver: Server = {\n name = "web-01",\n cpu = 4,\n memory = 8,\n}\n```\n\n**Problem**: Inheritance creates rigid hierarchies, breaking changes propagate\n\n---\n\n**Nickel Approach**:\n\n```\n# defaults.ncl\nserver_defaults = {\n cpu = 2,\n memory = 4,\n}\n\n# main.ncl\nlet make_server = fun overrides =>\n defaults.server_defaults & overrides\n\nserver = make_server {\n name = "web-01",\n cpu = 4,\n memory = 8,\n}\n```\n\n**Advantage**: Flexible composition via record merging, no inheritance rigidity\n\n---\n\n### Validation\n\n**KCL Validation (Legacy)** (compile-time, inline):\n\n```\nschema Config:\n timeout: int = 5\n\n check:\n timeout > 0, "Timeout must be positive"\n timeout < 300, "Timeout must be < 5 min"\n```\n\n**Pros**: Validation at schema definition\n**Cons**: Overhead during compilation, rigid\n\n---\n\n**Nickel Validation** (runtime, contract-based):\n\n```\n# contracts.ncl - Pure type definitions\nConfig = {\n timeout | Number,\n}\n\n# Usage - Optional validation\nlet validate_config = fun config =>\n if config.timeout <= 0 then\n std.record.fail "Timeout must be positive"\n else if config.timeout >= 300 then\n std.record.fail "Timeout must be < 5 min"\n else\n config\n\n# Apply only when needed\nmy_config = validate_config { timeout = 10 }\n```\n\n**Pros**: Lazy evaluation, optional, fine-grained control\n**Cons**: Must invoke validation explicitly\n\n---\n\n## 5. Migration Patterns (Before/After)\n\n### Pattern 1: Simple Schema Migration\n\n**Before (KCL - Legacy)**:\n\n```\nschema Scheduler:\n strategy: str = "fifo"\n workers: int = 4\n\n check:\n workers > 0, "Workers must be positive"\n\nscheduler_config: Scheduler = {\n strategy = "priority",\n workers = 8,\n}\n```\n\n**After (Nickel - Current)**:\n\n`scheduler_contracts.ncl`:\n\n```\n{\n Scheduler = {\n strategy | String,\n workers | Number,\n },\n}\n```\n\n`scheduler_defaults.ncl`:\n\n```\n{\n scheduler = {\n strategy = "fifo",\n workers = 4,\n },\n}\n```\n\n`scheduler.ncl`:\n\n```\nlet contracts = import "./scheduler_contracts.ncl" in\nlet defaults = import "./scheduler_defaults.ncl" in\n\n{\n defaults = defaults,\n make_scheduler | not_exported = fun o =>\n defaults.scheduler & o,\n DefaultScheduler = defaults.scheduler,\n SchedulerConfig = defaults.scheduler & {\n strategy = "priority",\n workers = 8,\n },\n}\n```\n\n---\n\n### Pattern 2: Union Types → Enums\n\n**Before (KCL - Legacy)**:\n\n```\nschema Mode:\n deployment_type: str = "solo" # "solo" | "multiuser" | "cicd" | "enterprise"\n\n check:\n deployment_type in ["solo", "multiuser", "cicd", "enterprise"],\n "Invalid deployment type"\n```\n\n**After (Nickel - Current)**:\n\n```\n# contracts.ncl\n{\n Mode = {\n deployment_type | [| 'solo, 'multiuser, 'cicd, 'enterprise |],\n },\n}\n\n# defaults.ncl\n{\n mode = {\n deployment_type = 'solo,\n },\n}\n```\n\n**Benefits**: Type-safe, no string validation needed\n\n---\n\n### Pattern 3: Schema Inheritance → Record Merging\n\n**Before (KCL - Legacy)**:\n\n```\nschema ServerDefaults:\n cpu: int = 2\n memory: int = 4\n\nschema Server(ServerDefaults):\n name: str\n\nweb_server: Server = {\n name = "web-01",\n cpu = 8,\n memory = 16,\n}\n```\n\n**After (Nickel - Current)**:\n\n```\n# defaults.ncl\n{\n server_defaults = {\n cpu = 2,\n memory = 4,\n },\n\n web_server = {\n name = "web-01",\n cpu = 8,\n memory = 16,\n },\n}\n\n# main.ncl - Composition\nlet make_server = fun config =>\n defaults.server_defaults & config & {\n name = config.name,\n }\n```\n\n**Advantage**: Explicit, flexible, composable\n\n---\n\n## 6. Deployment Workflows\n\n### Development Mode (Single Source of Truth)\n\n**When to Use**: Local development, testing, iterations\n\n**Workflow**:\n\n```\n# Edit workspace config\ncd workspace_librecloud/nickel\nvim wuji/main.ncl\n\n# Test immediately (relative imports)\nnickel export wuji/main.ncl --format json\n\n# Changes to central provisioning reflected immediately\nvim ../../provisioning/schemas/lib/main.ncl\nnickel export wuji/main.ncl # Uses updated schemas\n```\n\n**Imports** (relative, central):\n\n```\nimport "../../provisioning/schemas/main.ncl"\nimport "../../provisioning/extensions/taskservs/kubernetes/nickel/main.ncl"\n```\n\n---\n\n### Production Mode (Frozen Snapshots)\n\n**When to Use**: Deployments, releases, reproducibility\n\n**Workflow**:\n\n```\n# 1. Create immutable snapshot\nprovisioning workspace freeze \\n --version "2025-12-15-prod-v1" \\n --env production\n\n# 2. Frozen structure created\n.frozen/2025-12-15-prod-v1/\n├── provisioning/schemas/ # Snapshot\n├── extensions/ # Snapshot\n└── workspace/ # Snapshot\n\n# 3. Deploy from frozen\nprovisioning deploy \\n --frozen "2025-12-15-prod-v1" \\n --infra wuji\n\n# 4. Rollback if needed\nprovisioning deploy \\n --frozen "2025-12-10-prod-v0" \\n --infra wuji\n```\n\n**Frozen Imports** (rewritten to local):\n\n```\n# Original in workspace\nimport "../../provisioning/schemas/main.ncl"\n\n# Rewritten in frozen snapshot\nimport "./provisioning/schemas/main.ncl"\n```\n\n**Benefits**:\n\n- ✅ Immutable deployments\n- ✅ No external dependencies\n- ✅ Reproducible across environments\n- ✅ Works offline/air-gapped\n- ✅ Easy rollback\n\n---\n\n## 7. Troubleshooting Guide\n\n### Error: "unexpected token" with Multiple Let Bindings\n\n**Problem**:\n\n```\n# ❌ WRONG\nlet A = { x = 1 }\nlet B = { y = 2 }\n{ A = A, B = B }\n```\n\nError: `unexpected token`\n\n**Solution**: Use `let...in` chaining:\n\n```\n# ✅ CORRECT\nlet A = { x = 1 } in\nlet B = { y = 2 } in\n{ A = A, B = B }\n```\n\n---\n\n### Error: "this can't be used as a contract"\n\n**Problem**:\n\n```\n# ❌ WRONG\nlet StorageVol = {\n mount_path : String | null = null,\n}\n```\n\nError: `this can't be used as a contract`\n\n**Explanation**: Union types with `null` don't work in field annotations\n\n**Solution**: Use untyped assignment:\n\n```\n# ✅ CORRECT\nlet StorageVol = {\n mount_path = null,\n}\n```\n\n---\n\n### Error: "infinite recursion" when Exporting\n\n**Problem**:\n\n```\n# ❌ WRONG\n{\n get_value = fun x => x + 1,\n result = get_value 5,\n}\n```\n\nError: Functions can't be serialized\n\n**Solution**: Mark helper functions `not_exported`:\n\n```\n# ✅ CORRECT\n{\n get_value | not_exported = fun x => x + 1,\n result = get_value 5,\n}\n```\n\n---\n\n### Error: "field not found" After Renaming\n\n**Problem**:\n\n```\nlet defaults = import "./defaults.ncl" in\ndefaults.scheduler_config # But file has "scheduler"\n```\n\nError: `field not found`\n\n**Solution**: Use exact field names:\n\n```\nlet defaults = import "./defaults.ncl" in\ndefaults.scheduler # Correct name from defaults.ncl\n```\n\n---\n\n### Performance Issue: Slow Exports\n\n**Problem**: Large nested configs slow to export\n\n**Solution**: Check for circular references or missing `not_exported`:\n\n```\n# ❌ Slow - functions being serialized\n{\n validate_config = fun x => x,\n data = { foo = "bar" },\n}\n\n# ✅ Fast - functions excluded\n{\n validate_config | not_exported = fun x => x,\n data = { foo = "bar" },\n}\n```\n\n---\n\n## 8. Best Practices\n\n### For Nickel Schemas\n\n1. **Follow Three-File Pattern**\n\n ```nickel\n\n module_contracts.ncl # Types only\n module_defaults.ncl # Values only\n module.ncl # Instances + interface\n\n ```\n\n2. **Use Hybrid Interface** (4 levels)\n - Level 1: Direct defaults (inspection)\n - Level 2: Maker functions (customization)\n - Level 3: Default instances (pre-built)\n - Level 4: Contracts (optional, advanced)\n\n3. **Record Merging for Composition**\n\n ```nickel\n let defaults = import "./defaults.ncl" in\n my_config = defaults.server & { custom_field = "value" }\n ```\n\n1. **Mark Helper Functions `not_exported`**\n\n ```nickel\n validate | not_exported = fun x => x,\n ```\n\n2. **No Null Values in Defaults**\n\n ```nickel\n # ✅ Good\n { field = "" } # empty string for optional\n\n # ❌ Avoid\n { field = null } # causes export issues\n ```\n\n---\n\n### For Legacy KCL (Workspace-Level - Deprecated)\n\n**Note**: KCL is deprecated. Gradually migrate to Nickel for new projects.\n\n1. **Schema-First Development**\n - Define schemas before configs\n - Explicit validation\n\n2. **Immutability by Default**\n - KCL enforces immutability\n - Use `_` prefix only when necessary\n\n3. **Direct Submodule Imports**\n\n ```kcl\n import provisioning.lib as lib\n ```\n\n4. **Complex Validation**\n\n ```kcl\n check:\n timeout > 0, "Must be positive"\n timeout < 300, "Must be < 5 min"\n ```\n\n---\n\n## 9. TypeDialog Integration\n\n### What is TypeDialog\n\nType-safe prompts, forms, and schemas that **bidirectionally integrate with Nickel**.\n\n**Location**: `/Users/Akasha/Development/typedialog`\n\n### Workflow: Nickel Schemas → Interactive UIs → Nickel Output\n\n```\n# 1. Define schema in Nickel\ncat > server.ncl << 'EOF'\nlet contracts = import "./contracts.ncl" in\n{\n DefaultServer = {\n name = "web-01",\n cpu = 4,\n memory = 8,\n zone = "us-nyc1",\n },\n}\nEOF\n\n# 2. Generate interactive form from schema\ntypedialog form --schema server.ncl --output json\n\n# 3. User fills form interactively (CLI, TUI, or Web)\n# Prompts generated from field names\n# Defaults populated from Nickel config\n\n# 4. Output back to Nickel\ntypedialog form --input form.toml --output nickel\n```\n\n### Benefits\n\n- **Type-Safe UIs**: Forms validated against Nickel contracts\n- **Auto-Generated**: No UI code to maintain\n- **Multiple Backends**: CLI (inquire), TUI (ratatui), Web (axum)\n- **Multiple Formats**: JSON, YAML, TOML, Nickel output\n- **Bidirectional**: Nickel → UIs → Nickel\n\n### Example: Infrastructure Wizard\n\n```\n# User runs\nprovisioning init --wizard\n\n# Backend generates TypeDialog form from:\nprovisioning/schemas/config/workspace_config/main.ncl\n\n# Interactive form with:\n- workspace_name (text prompt)\n- deployment_mode (select: solo/multiuser/cicd/enterprise)\n- preferred_provider (select: upcloud/aws/hetzner)\n- taskservs (multi-select: kubernetes, cilium, etcd, etc)\n- custom_settings (advanced, optional)\n\n# Output: workspace_config.ncl (valid Nickel!)\n```\n\n---\n\n## 10. Migration Checklist\n\n### Before Starting Migration\n\n- [ ] Read ADR-011\n- [ ] Review [Nickel Migration Guide](../development/nickel-executable-examples.md)\n- [ ] Identify which module to migrate\n- [ ] Check for dependencies on other modules\n\n### During Migration\n\n- [ ] Extract contracts from KCL schema\n- [ ] Extract defaults from KCL config\n- [ ] Create main.ncl with hybrid interface\n- [ ] Validate JSON export: `nickel export main.ncl --format json`\n- [ ] Compare JSON output with original KCL\n\n### Validation\n\n- [ ] All required fields present\n- [ ] No null values (use empty strings/arrays)\n- [ ] Contracts are pure definitions\n- [ ] Defaults are complete values\n- [ ] Main file has 4-level interface\n- [ ] Syntax validation passes\n- [ ] No `...` as code omission indicators\n\n### Post-Migration\n\n- [ ] Update imports in dependent files\n- [ ] Test in development mode\n- [ ] Create frozen snapshot\n- [ ] Test production deployment\n- [ ] Update documentation\n\n---\n\n## 11. Real-World Examples from Codebase\n\n### Example 1: Platform Schemas Entry Point\n\n**File**: `provisioning/schemas/main.ncl` (174 lines)\n\n```\n# Domain-organized architecture\n{\n lib | doc "Core library types"\n = import "./lib/main.ncl",\n\n config | doc "Settings, defaults, workspace_config"\n = {\n settings = import "./config/settings/main.ncl",\n defaults = import "./config/defaults/main.ncl",\n workspace_config = import "./config/workspace_config/main.ncl",\n },\n\n infrastructure | doc "Compute, storage, provisioning"\n = {\n compute = {\n server = import "./infrastructure/compute/server/main.ncl",\n cluster = import "./infrastructure/compute/cluster/main.ncl",\n },\n storage = {\n vm = import "./infrastructure/storage/vm/main.ncl",\n },\n },\n\n operations | doc "Workflows, batch, dependencies, tasks"\n = {\n workflows = import "./operations/workflows/main.ncl",\n batch = import "./operations/batch/main.ncl",\n },\n\n deployment | doc "Kubernetes, modes"\n = {\n kubernetes = import "./deployment/kubernetes/main.ncl",\n modes = import "./deployment/modes/main.ncl",\n },\n}\n```\n\n**Usage**:\n\n```\nlet provisioning = import "./main.ncl" in\n\nprovisioning.lib.Storage\nprovisioning.config.settings\nprovisioning.infrastructure.compute.server\nprovisioning.operations.workflows\n```\n\n---\n\n### Example 2: Provider Extension (UpCloud)\n\n**File**: `provisioning/extensions/providers/upcloud/nickel/main.ncl` (38 lines)\n\n```\nlet contracts_lib = import "./contracts.ncl" in\nlet defaults_lib = import "./defaults.ncl" in\n\n{\n defaults = defaults_lib,\n\n make_storage_backup | not_exported = fun overrides =>\n defaults_lib.storage_backup & overrides,\n\n make_storage | not_exported = fun overrides =>\n defaults_lib.storage & overrides,\n\n make_provision_env | not_exported = fun overrides =>\n defaults_lib.provision_env & overrides,\n\n make_provision_upcloud | not_exported = fun overrides =>\n defaults_lib.provision_upcloud & overrides,\n\n make_server_defaults_upcloud | not_exported = fun overrides =>\n defaults_lib.server_defaults_upcloud & overrides,\n\n make_server_upcloud | not_exported = fun overrides =>\n defaults_lib.server_upcloud & overrides,\n\n DefaultStorageBackup = defaults_lib.storage_backup,\n DefaultStorage = defaults_lib.storage,\n DefaultProvisionEnv = defaults_lib.provision_env,\n DefaultProvisionUpcloud = defaults_lib.provision_upcloud,\n DefaultServerDefaults_upcloud = defaults_lib.server_defaults_upcloud,\n DefaultServerUpcloud = defaults_lib.server_upcloud,\n}\n```\n\n---\n\n### Example 3: Workspace Infrastructure (wuji)\n\n**File**: `workspace_librecloud/nickel/wuji/main.ncl` (53 lines)\n\n```\nlet settings_config = import "./settings.ncl" in\nlet ts_cilium = import "./taskservs/cilium.ncl" in\nlet ts_containerd = import "./taskservs/containerd.ncl" in\nlet ts_coredns = import "./taskservs/coredns.ncl" in\nlet ts_crio = import "./taskservs/crio.ncl" in\nlet ts_crun = import "./taskservs/crun.ncl" in\nlet ts_etcd = import "./taskservs/etcd.ncl" in\nlet ts_external_nfs = import "./taskservs/external-nfs.ncl" in\nlet ts_k8s_nodejoin = import "./taskservs/k8s-nodejoin.ncl" in\nlet ts_kubernetes = import "./taskservs/kubernetes.ncl" in\nlet ts_mayastor = import "./taskservs/mayastor.ncl" in\nlet ts_os = import "./taskservs/os.ncl" in\nlet ts_podman = import "./taskservs/podman.ncl" in\nlet ts_postgres = import "./taskservs/postgres.ncl" in\nlet ts_proxy = import "./taskservs/proxy.ncl" in\nlet ts_redis = import "./taskservs/redis.ncl" in\nlet ts_resolv = import "./taskservs/resolv.ncl" in\nlet ts_rook_ceph = import "./taskservs/rook_ceph.ncl" in\nlet ts_runc = import "./taskservs/runc.ncl" in\nlet ts_webhook = import "./taskservs/webhook.ncl" in\nlet ts_youki = import "./taskservs/youki.ncl" in\n\n{\n settings = settings_config.settings,\n servers = settings_config.servers,\n\n taskservs = {\n cilium = ts_cilium.cilium,\n containerd = ts_containerd.containerd,\n coredns = ts_coredns.coredns,\n crio = ts_crio.crio,\n crun = ts_crun.crun,\n etcd = ts_etcd.etcd,\n external_nfs = ts_external_nfs.external_nfs,\n k8s_nodejoin = ts_k8s_nodejoin.k8s_nodejoin,\n kubernetes = ts_kubernetes.kubernetes,\n mayastor = ts_mayastor.mayastor,\n os = ts_os.os,\n podman = ts_podman.podman,\n postgres = ts_postgres.postgres,\n proxy = ts_proxy.proxy,\n redis = ts_redis.redis,\n resolv = ts_resolv.resolv,\n rook_ceph = ts_rook_ceph.rook_ceph,\n runc = ts_runc.runc,\n webhook = ts_webhook.webhook,\n youki = ts_youki.youki,\n },\n}\n```\n\n---\n\n## Summary Table\n\n| Aspect | KCL | Nickel | Recommendation |\n| -------- | ----- | -------- | --- |\n| **Learning Curve** | 10 hours | 3 hours | Nickel |\n| **Performance** | Baseline | 60% faster | Nickel |\n| **Flexibility** | Limited | Excellent | Nickel |\n| **Type Safety** | Strong | Good (gradual) | KCL (slightly) |\n| **Extensibility** | Rigid | Excellent | Nickel |\n| **Boilerplate** | High | Low | Nickel |\n| **Ecosystem** | Small | Growing | Nickel |\n| **For New Projects** | ❌ | ✅ | Nickel |\n| **For Legacy Configs** | ✅ Supported | ⏳ Gradual | Both (migrate gradually) |\n\n---\n\n## Key Takeaways\n\n1. **Nickel is the future** - 60% faster, more flexible, simpler mental model\n2. **Three-file pattern** - Cleanly separates contracts, defaults, instances\n3. **Hybrid interface** - 4 levels cover all use cases (90% makers, 9% defaults, 1% contracts)\n4. **Domain organization** - 8 logical domains for clarity and scalability\n5. **Two deployment modes** - Development (fast iteration) + Production (immutable snapshots)\n6. **TypeDialog integration** - Amplifies Nickel beyond IaC (UI generation)\n7. **KCL still supported** - For legacy workspace configs during gradual migration\n8. **Production validated** - 47 active files, 20 taskservs, 422 total schemas\n\n---\n\n**Next Steps**:\n\n- For new schemas → Use Nickel (three-file pattern)\n- For workspace configs → Can migrate gradually\n- For UI generation → Combine Nickel + TypeDialog\n- For application settings → Use TOML (not KCL/Nickel)\n- For K8s/CI-CD → Use YAML (not KCL/Nickel)\n\n---\n\n**Version**: 1.0.0\n**Status**: Complete Reference Guide\n**Last Updated**: 2025-12-15 +# Nickel vs KCL: Comprehensive Comparison + +**Status**: Reference Guide +**Last Updated**: 2025-12-15 +**Related**: ADR-011: Migration from KCL to Nickel + +--- + +## Quick Decision Tree + +```text +Need to define infrastructure/schemas? +├─ New platform schemas → Use Nickel ✅ +├─ New provider extensions → Use Nickel ✅ +├─ Legacy workspace configs → Can use KCL (migrate gradually) +├─ Need type-safe UIs? → Nickel + TypeDialog ✅ +├─ Application settings? → Use TOML (not KCL/Nickel) +└─ K8s/CI-CD config? → Use YAML (not KCL/Nickel) +``` + +--- + +## 1. Side-by-Side Code Examples + +### Simple Schema: Server Configuration + +#### KCL Approach + +```text +schema ServerDefaults: + name: str + cpu_cores: int = 2 + memory_gb: int = 4 + os: str = "ubuntu" + + check: + cpu_cores > 0, "CPU cores must be positive" + memory_gb > 0, "Memory must be positive" + +server_defaults: ServerDefaults = { + name = "web-server", + cpu_cores = 4, + memory_gb = 8, + os = "ubuntu", +} +``` + +**Note**: KCL is deprecated. Use Nickel for new projects. + +#### Nickel Approach (Three-File Pattern) + +**server_contracts.ncl**: + +```text +{ + ServerDefaults = { + name | String, + cpu_cores | Number, + memory_gb | Number, + os | String, + }, +} +``` + +**server_defaults.ncl**: + +```text +{ + server = { + name = "web-server", + cpu_cores = 4, + memory_gb = 8, + os = "ubuntu", + }, +} +``` + +**server.ncl**: + +```text +let contracts = import "./server_contracts.ncl" in +let defaults = import "./server_defaults.ncl" in + +{ + defaults = defaults, + + make_server | not_exported = fun overrides => + defaults.server & overrides, + + DefaultServer = defaults.server, +} +``` + +**Usage**: + +```text +let server = import "./server.ncl" in + +# Simple override +my_server = server.make_server { cpu_cores = 8 } + +# With custom field (Nickel allows this!) +my_custom = server.defaults.server & { + cpu_cores = 16, + custom_monitoring_level = "verbose" # ✅ Works! +} +``` + +**Key Differences**: + +- **KCL**: Validation inline, single file, rigid schema +- **Nickel**: Separated concerns (contracts, defaults, instances), flexible composition + +--- + +### Complex Schema: Provider with Multiple Types + +#### KCL (from `provisioning/extensions/providers/upcloud/nickel/` - legacy approach) + +```text +schema StorageBackup: + backup_id: str + frequency: str + retention_days: int = 7 + +schema ServerUpcloud: + name: str + plan: str + zone: str + storage_backups: [StorageBackup] = [] + +schema ProvisionUpcloud: + api_key: str + api_password: str + servers: [ServerUpcloud] = [] + +provision_upcloud: ProvisionUpcloud = { + api_key = "" + api_password = "" + servers = [] +} +``` + +#### Nickel (from `provisioning/extensions/providers/upcloud/nickel/`) + +**upcloud_contracts.ncl**: + +```text +{ + StorageBackup = { + backup_id | String, + frequency | String, + retention_days | Number, + }, + + ServerUpcloud = { + name | String, + plan | String, + zone | String, + storage_backups | Array, + }, + + ProvisionUpcloud = { + api_key | String, + api_password | String, + servers | Array, + }, +} +``` + +**upcloud_defaults.ncl**: + +```text +{ + storage_backup = { + backup_id = "", + frequency = "daily", + retention_days = 7, + }, + + server_upcloud = { + name = "", + plan = "1xCPU-1 GB", + zone = "us-nyc1", + storage_backups = [], + }, + + provision_upcloud = { + api_key = "", + api_password = "", + servers = [], + }, +} +``` + +**upcloud_main.ncl** (from actual codebase): + +```text +let contracts = import "./upcloud_contracts.ncl" in +let defaults = import "./upcloud_defaults.ncl" in + +{ + defaults = defaults, + + make_storage_backup | not_exported = fun overrides => + defaults.storage_backup & overrides, + + make_server_upcloud | not_exported = fun overrides => + defaults.server_upcloud & overrides, + + make_provision_upcloud | not_exported = fun overrides => + defaults.provision_upcloud & overrides, + + DefaultStorageBackup = defaults.storage_backup, + DefaultServerUpcloud = defaults.server_upcloud, + DefaultProvisionUpcloud = defaults.provision_upcloud, +} +``` + +**Usage Comparison**: + +```text +# KCL way (KCL no lo permite bien) +# Cannot easily extend without schema modification + +# Nickel way (flexible!) +let upcloud = import "./upcloud.ncl" in + +# Simple override +staging_server = upcloud.make_server_upcloud { + name = "staging-01", + zone = "eu-fra1", +} + +# Complex config with custom fields +production_stack = upcloud.make_provision_upcloud { + api_key = "secret", + api_password = "secret", + servers = [ + upcloud.make_server_upcloud { name = "prod-web-01" }, + upcloud.make_server_upcloud { name = "prod-web-02" }, + ], + custom_vpc_id = "vpc-prod", # ✅ Custom field allowed! + monitoring_enabled = true, # ✅ Custom field allowed! + backup_schedule = "24h", # ✅ Custom field allowed! +} +``` + +--- + +## 2. Performance Benchmarks + +### Evaluation Speed + +| File Type | KCL | Nickel | Improvement | +| ----------- | ----- | -------- | ------------ | +| Simple schema (100 lines) | 45 ms | 18 ms | 60% faster | +| Complex config (500 lines) | 180 ms | 72 ms | 60% faster | +| Large nested (2000 lines) | 420 ms | 160 ms | 62% faster | +| Infrastructure full stack | 850 ms | 340 ms | 60% faster | + +**Test Conditions**: + +- MacOS 13.x, M1 Pro +- Single evaluation run +- JSON output export +- Average of 5 runs + +### Memory Usage + +| Configuration | KCL | Nickel | Improvement | +| --------------- | ----- | -------- | ------------ | +| Platform schemas (422 files) | ~180 MB | ~85 MB | 53% less | +| Full workspace (47 files) | ~45 MB | ~22 MB | 51% less | +| Single provider ext | ~8 MB | ~4 MB | 50% less | + +**Lazy Evaluation Benefit**: + +- KCL: Evaluates all schemas upfront +- Nickel: Only evaluates what's used (lazy) +- Nickel advantage: 40-50% memory savings on large configs + +--- + +## 3. Use Case Examples + +### Use Case 1: Simple Server Definition + +**KCL (Legacy)**: + +```text +schema ServerConfig: + name: str + zone: str = "us-nyc1" + +web_server: ServerConfig = { + name = "web-01", +} +``` + +**Nickel (Recommended)**: + +```text +let defaults = import "./server_defaults.ncl" in +web_server = defaults.make_server { name = "web-01" } +``` + +**Winner**: Nickel (simpler, cleaner) + +--- + +### Use Case 2: Multiple Taskservs with Dependencies + +**KCL** (from wuji infrastructure): + +```text +schema TaskServDependency: + name: str + wait_for_health: bool = false + +schema TaskServ: + name: str + version: str + dependencies: [TaskServDependency] = [] + +taskserv_kubernetes: TaskServ = { + name = "kubernetes", + version = "1.28.0", + dependencies = [ + {name = "containerd"}, + {name = "etcd"}, + ] +} + +taskserv_cilium: TaskServ = { + name = "cilium", + version = "1.14.0", + dependencies = [ + {name = "kubernetes", wait_for_health = true} + ] +} +``` + +**Nickel** (from wuji/main.ncl): + +```text +let ts_kubernetes = import "./taskservs/kubernetes.ncl" in +let ts_cilium = import "./taskservs/cilium.ncl" in +let ts_containerd = import "./taskservs/containerd.ncl" in + +{ + taskservs = { + kubernetes = ts_kubernetes.kubernetes, + cilium = ts_cilium.cilium, + containerd = ts_containerd.containerd, + }, +} +``` + +**Winner**: Nickel (modular, scalable to 20 taskservs) + +--- + +### Use Case 3: Configuration Extension with Custom Fields + +**Scenario**: Need to add monitoring configuration to server definition + +**KCL**: + +```text +schema ServerConfig: + name: str + # Would need to modify schema! + monitoring_enabled: bool = false + monitoring_level: str = "basic" + +# All existing configs need updating... +``` + +**Nickel**: + +```text +let server = import "./server.ncl" in + +# Add custom fields without modifying schema! +my_server = server.defaults.server & { + name = "web-01", + monitoring_enabled = true, + monitoring_level = "detailed", + custom_tags = ["production", "critical"], + grafana_dashboard = "web-servers", +} +``` + +**Winner**: Nickel (no schema modifications needed) + +--- + +## 4. Architecture Patterns Comparison + +### Schema Inheritance + +**KCL Approach (Legacy)**: + +```text +schema ServerDefaults: + cpu: int = 2 + memory: int = 4 + +schema Server(ServerDefaults): + name: str + +server: Server = { + name = "web-01", + cpu = 4, + memory = 8, +} +``` + +**Problem**: Inheritance creates rigid hierarchies, breaking changes propagate + +--- + +**Nickel Approach**: + +```text +# defaults.ncl +server_defaults = { + cpu = 2, + memory = 4, +} + +# main.ncl +let make_server = fun overrides => + defaults.server_defaults & overrides + +server = make_server { + name = "web-01", + cpu = 4, + memory = 8, +} +``` + +**Advantage**: Flexible composition via record merging, no inheritance rigidity + +--- + +### Validation + +**KCL Validation (Legacy)** (compile-time, inline): + +```text +schema Config: + timeout: int = 5 + + check: + timeout > 0, "Timeout must be positive" + timeout < 300, "Timeout must be < 5 min" +``` + +**Pros**: Validation at schema definition +**Cons**: Overhead during compilation, rigid + +--- + +**Nickel Validation** (runtime, contract-based): + +```text +# contracts.ncl - Pure type definitions +Config = { + timeout | Number, +} + +# Usage - Optional validation +let validate_config = fun config => + if config.timeout <= 0 then + std.record.fail "Timeout must be positive" + else if config.timeout >= 300 then + std.record.fail "Timeout must be < 5 min" + else + config + +# Apply only when needed +my_config = validate_config { timeout = 10 } +``` + +**Pros**: Lazy evaluation, optional, fine-grained control +**Cons**: Must invoke validation explicitly + +--- + +## 5. Migration Patterns (Before/After) + +### Pattern 1: Simple Schema Migration + +**Before (KCL - Legacy)**: + +```text +schema Scheduler: + strategy: str = "fifo" + workers: int = 4 + + check: + workers > 0, "Workers must be positive" + +scheduler_config: Scheduler = { + strategy = "priority", + workers = 8, +} +``` + +**After (Nickel - Current)**: + +`scheduler_contracts.ncl`: + +```text +{ + Scheduler = { + strategy | String, + workers | Number, + }, +} +``` + +`scheduler_defaults.ncl`: + +```text +{ + scheduler = { + strategy = "fifo", + workers = 4, + }, +} +``` + +`scheduler.ncl`: + +```text +let contracts = import "./scheduler_contracts.ncl" in +let defaults = import "./scheduler_defaults.ncl" in + +{ + defaults = defaults, + make_scheduler | not_exported = fun o => + defaults.scheduler & o, + DefaultScheduler = defaults.scheduler, + SchedulerConfig = defaults.scheduler & { + strategy = "priority", + workers = 8, + }, +} +``` + +--- + +### Pattern 2: Union Types → Enums + +**Before (KCL - Legacy)**: + +```text +schema Mode: + deployment_type: str = "solo" # "solo" | "multiuser" | "cicd" | "enterprise" + + check: + deployment_type in ["solo", "multiuser", "cicd", "enterprise"], + "Invalid deployment type" +``` + +**After (Nickel - Current)**: + +```text +# contracts.ncl +{ + Mode = { + deployment_type | [| 'solo, 'multiuser, 'cicd, 'enterprise |], + }, +} + +# defaults.ncl +{ + mode = { + deployment_type = 'solo, + }, +} +``` + +**Benefits**: Type-safe, no string validation needed + +--- + +### Pattern 3: Schema Inheritance → Record Merging + +**Before (KCL - Legacy)**: + +```text +schema ServerDefaults: + cpu: int = 2 + memory: int = 4 + +schema Server(ServerDefaults): + name: str + +web_server: Server = { + name = "web-01", + cpu = 8, + memory = 16, +} +``` + +**After (Nickel - Current)**: + +```text +# defaults.ncl +{ + server_defaults = { + cpu = 2, + memory = 4, + }, + + web_server = { + name = "web-01", + cpu = 8, + memory = 16, + }, +} + +# main.ncl - Composition +let make_server = fun config => + defaults.server_defaults & config & { + name = config.name, + } +``` + +**Advantage**: Explicit, flexible, composable + +--- + +## 6. Deployment Workflows + +### Development Mode (Single Source of Truth) + +**When to Use**: Local development, testing, iterations + +**Workflow**: + +```text +# Edit workspace config +cd workspace_librecloud/nickel +vim wuji/main.ncl + +# Test immediately (relative imports) +nickel export wuji/main.ncl --format json + +# Changes to central provisioning reflected immediately +vim ../../provisioning/schemas/lib/main.ncl +nickel export wuji/main.ncl # Uses updated schemas +``` + +**Imports** (relative, central): + +```text +import "../../provisioning/schemas/main.ncl" +import "../../provisioning/extensions/taskservs/kubernetes/nickel/main.ncl" +``` + +--- + +### Production Mode (Frozen Snapshots) + +**When to Use**: Deployments, releases, reproducibility + +**Workflow**: + +```text +# 1. Create immutable snapshot +provisioning workspace freeze + --version "2025-12-15-prod-v1" + --env production + +# 2. Frozen structure created +.frozen/2025-12-15-prod-v1/ +├── provisioning/schemas/ # Snapshot +├── extensions/ # Snapshot +└── workspace/ # Snapshot + +# 3. Deploy from frozen +provisioning deploy + --frozen "2025-12-15-prod-v1" + --infra wuji + +# 4. Rollback if needed +provisioning deploy + --frozen "2025-12-10-prod-v0" + --infra wuji +``` + +**Frozen Imports** (rewritten to local): + +```text +# Original in workspace +import "../../provisioning/schemas/main.ncl" + +# Rewritten in frozen snapshot +import "./provisioning/schemas/main.ncl" +``` + +**Benefits**: + +- ✅ Immutable deployments +- ✅ No external dependencies +- ✅ Reproducible across environments +- ✅ Works offline/air-gapped +- ✅ Easy rollback + +--- + +## 7. Troubleshooting Guide + +### Error: "unexpected token" with Multiple Let Bindings + +**Problem**: + +```text +# ❌ WRONG +let A = { x = 1 } +let B = { y = 2 } +{ A = A, B = B } +``` + +Error: `unexpected token` + +**Solution**: Use `let...in` chaining: + +```text +# ✅ CORRECT +let A = { x = 1 } in +let B = { y = 2 } in +{ A = A, B = B } +``` + +--- + +### Error: "this can't be used as a contract" + +**Problem**: + +```text +# ❌ WRONG +let StorageVol = { + mount_path : String | null = null, +} +``` + +Error: `this can't be used as a contract` + +**Explanation**: Union types with `null` don't work in field annotations + +**Solution**: Use untyped assignment: + +```text +# ✅ CORRECT +let StorageVol = { + mount_path = null, +} +``` + +--- + +### Error: "infinite recursion" when Exporting + +**Problem**: + +```text +# ❌ WRONG +{ + get_value = fun x => x + 1, + result = get_value 5, +} +``` + +Error: Functions can't be serialized + +**Solution**: Mark helper functions `not_exported`: + +```text +# ✅ CORRECT +{ + get_value | not_exported = fun x => x + 1, + result = get_value 5, +} +``` + +--- + +### Error: "field not found" After Renaming + +**Problem**: + +```text +let defaults = import "./defaults.ncl" in +defaults.scheduler_config # But file has "scheduler" +``` + +Error: `field not found` + +**Solution**: Use exact field names: + +```text +let defaults = import "./defaults.ncl" in +defaults.scheduler # Correct name from defaults.ncl +``` + +--- + +### Performance Issue: Slow Exports + +**Problem**: Large nested configs slow to export + +**Solution**: Check for circular references or missing `not_exported`: + +```text +# ❌ Slow - functions being serialized +{ + validate_config = fun x => x, + data = { foo = "bar" }, +} + +# ✅ Fast - functions excluded +{ + validate_config | not_exported = fun x => x, + data = { foo = "bar" }, +} +``` + +--- + +## 8. Best Practices + +### For Nickel Schemas + +1. **Follow Three-File Pattern** + + ```nickel + + module_contracts.ncl # Types only + module_defaults.ncl # Values only + module.ncl # Instances + interface + + ``` + +2. **Use Hybrid Interface** (4 levels) + - Level 1: Direct defaults (inspection) + - Level 2: Maker functions (customization) + - Level 3: Default instances (pre-built) + - Level 4: Contracts (optional, advanced) + +3. **Record Merging for Composition** + + ```nickel + let defaults = import "./defaults.ncl" in + my_config = defaults.server & { custom_field = "value" } + ``` + +1. **Mark Helper Functions `not_exported`** + + ```nickel + validate | not_exported = fun x => x, + ``` + +2. **No Null Values in Defaults** + + ```nickel + # ✅ Good + { field = "" } # empty string for optional + + # ❌ Avoid + { field = null } # causes export issues + ``` + +--- + +### For Legacy KCL (Workspace-Level - Deprecated) + +**Note**: KCL is deprecated. Gradually migrate to Nickel for new projects. + +1. **Schema-First Development** + - Define schemas before configs + - Explicit validation + +2. **Immutability by Default** + - KCL enforces immutability + - Use `_` prefix only when necessary + +3. **Direct Submodule Imports** + + ```kcl + import provisioning.lib as lib + ``` + +4. **Complex Validation** + + ```kcl + check: + timeout > 0, "Must be positive" + timeout < 300, "Must be < 5 min" + ``` + +--- + +## 9. TypeDialog Integration + +### What is TypeDialog + +Type-safe prompts, forms, and schemas that **bidirectionally integrate with Nickel**. + +**Location**: `/Users/Akasha/Development/typedialog` + +### Workflow: Nickel Schemas → Interactive UIs → Nickel Output + +```text +# 1. Define schema in Nickel +cat > server.ncl << 'EOF' +let contracts = import "./contracts.ncl" in +{ + DefaultServer = { + name = "web-01", + cpu = 4, + memory = 8, + zone = "us-nyc1", + }, +} +EOF + +# 2. Generate interactive form from schema +typedialog form --schema server.ncl --output json + +# 3. User fills form interactively (CLI, TUI, or Web) +# Prompts generated from field names +# Defaults populated from Nickel config + +# 4. Output back to Nickel +typedialog form --input form.toml --output nickel +``` + +### Benefits + +- **Type-Safe UIs**: Forms validated against Nickel contracts +- **Auto-Generated**: No UI code to maintain +- **Multiple Backends**: CLI (inquire), TUI (ratatui), Web (axum) +- **Multiple Formats**: JSON, YAML, TOML, Nickel output +- **Bidirectional**: Nickel → UIs → Nickel + +### Example: Infrastructure Wizard + +```text +# User runs +provisioning init --wizard + +# Backend generates TypeDialog form from: +provisioning/schemas/config/workspace_config/main.ncl + +# Interactive form with: +- workspace_name (text prompt) +- deployment_mode (select: solo/multiuser/cicd/enterprise) +- preferred_provider (select: upcloud/aws/hetzner) +- taskservs (multi-select: kubernetes, cilium, etcd, etc) +- custom_settings (advanced, optional) + +# Output: workspace_config.ncl (valid Nickel!) +``` + +--- + +## 10. Migration Checklist + +### Before Starting Migration + +- [ ] Read ADR-011 +- [ ] Review [Nickel Migration Guide](../development/nickel-executable-examples.md) +- [ ] Identify which module to migrate +- [ ] Check for dependencies on other modules + +### During Migration + +- [ ] Extract contracts from KCL schema +- [ ] Extract defaults from KCL config +- [ ] Create main.ncl with hybrid interface +- [ ] Validate JSON export: `nickel export main.ncl --format json` +- [ ] Compare JSON output with original KCL + +### Validation + +- [ ] All required fields present +- [ ] No null values (use empty strings/arrays) +- [ ] Contracts are pure definitions +- [ ] Defaults are complete values +- [ ] Main file has 4-level interface +- [ ] Syntax validation passes +- [ ] No `...` as code omission indicators + +### Post-Migration + +- [ ] Update imports in dependent files +- [ ] Test in development mode +- [ ] Create frozen snapshot +- [ ] Test production deployment +- [ ] Update documentation + +--- + +## 11. Real-World Examples from Codebase + +### Example 1: Platform Schemas Entry Point + +**File**: `provisioning/schemas/main.ncl` (174 lines) + +```text +# Domain-organized architecture +{ + lib | doc "Core library types" + = import "./lib/main.ncl", + + config | doc "Settings, defaults, workspace_config" + = { + settings = import "./config/settings/main.ncl", + defaults = import "./config/defaults/main.ncl", + workspace_config = import "./config/workspace_config/main.ncl", + }, + + infrastructure | doc "Compute, storage, provisioning" + = { + compute = { + server = import "./infrastructure/compute/server/main.ncl", + cluster = import "./infrastructure/compute/cluster/main.ncl", + }, + storage = { + vm = import "./infrastructure/storage/vm/main.ncl", + }, + }, + + operations | doc "Workflows, batch, dependencies, tasks" + = { + workflows = import "./operations/workflows/main.ncl", + batch = import "./operations/batch/main.ncl", + }, + + deployment | doc "Kubernetes, modes" + = { + kubernetes = import "./deployment/kubernetes/main.ncl", + modes = import "./deployment/modes/main.ncl", + }, +} +``` + +**Usage**: + +```text +let provisioning = import "./main.ncl" in + +provisioning.lib.Storage +provisioning.config.settings +provisioning.infrastructure.compute.server +provisioning.operations.workflows +``` + +--- + +### Example 2: Provider Extension (UpCloud) + +**File**: `provisioning/extensions/providers/upcloud/nickel/main.ncl` (38 lines) + +```text +let contracts_lib = import "./contracts.ncl" in +let defaults_lib = import "./defaults.ncl" in + +{ + defaults = defaults_lib, + + make_storage_backup | not_exported = fun overrides => + defaults_lib.storage_backup & overrides, + + make_storage | not_exported = fun overrides => + defaults_lib.storage & overrides, + + make_provision_env | not_exported = fun overrides => + defaults_lib.provision_env & overrides, + + make_provision_upcloud | not_exported = fun overrides => + defaults_lib.provision_upcloud & overrides, + + make_server_defaults_upcloud | not_exported = fun overrides => + defaults_lib.server_defaults_upcloud & overrides, + + make_server_upcloud | not_exported = fun overrides => + defaults_lib.server_upcloud & overrides, + + DefaultStorageBackup = defaults_lib.storage_backup, + DefaultStorage = defaults_lib.storage, + DefaultProvisionEnv = defaults_lib.provision_env, + DefaultProvisionUpcloud = defaults_lib.provision_upcloud, + DefaultServerDefaults_upcloud = defaults_lib.server_defaults_upcloud, + DefaultServerUpcloud = defaults_lib.server_upcloud, +} +``` + +--- + +### Example 3: Workspace Infrastructure (wuji) + +**File**: `workspace_librecloud/nickel/wuji/main.ncl` (53 lines) + +```text +let settings_config = import "./settings.ncl" in +let ts_cilium = import "./taskservs/cilium.ncl" in +let ts_containerd = import "./taskservs/containerd.ncl" in +let ts_coredns = import "./taskservs/coredns.ncl" in +let ts_crio = import "./taskservs/crio.ncl" in +let ts_crun = import "./taskservs/crun.ncl" in +let ts_etcd = import "./taskservs/etcd.ncl" in +let ts_external_nfs = import "./taskservs/external-nfs.ncl" in +let ts_k8s_nodejoin = import "./taskservs/k8s-nodejoin.ncl" in +let ts_kubernetes = import "./taskservs/kubernetes.ncl" in +let ts_mayastor = import "./taskservs/mayastor.ncl" in +let ts_os = import "./taskservs/os.ncl" in +let ts_podman = import "./taskservs/podman.ncl" in +let ts_postgres = import "./taskservs/postgres.ncl" in +let ts_proxy = import "./taskservs/proxy.ncl" in +let ts_redis = import "./taskservs/redis.ncl" in +let ts_resolv = import "./taskservs/resolv.ncl" in +let ts_rook_ceph = import "./taskservs/rook_ceph.ncl" in +let ts_runc = import "./taskservs/runc.ncl" in +let ts_webhook = import "./taskservs/webhook.ncl" in +let ts_youki = import "./taskservs/youki.ncl" in + +{ + settings = settings_config.settings, + servers = settings_config.servers, + + taskservs = { + cilium = ts_cilium.cilium, + containerd = ts_containerd.containerd, + coredns = ts_coredns.coredns, + crio = ts_crio.crio, + crun = ts_crun.crun, + etcd = ts_etcd.etcd, + external_nfs = ts_external_nfs.external_nfs, + k8s_nodejoin = ts_k8s_nodejoin.k8s_nodejoin, + kubernetes = ts_kubernetes.kubernetes, + mayastor = ts_mayastor.mayastor, + os = ts_os.os, + podman = ts_podman.podman, + postgres = ts_postgres.postgres, + proxy = ts_proxy.proxy, + redis = ts_redis.redis, + resolv = ts_resolv.resolv, + rook_ceph = ts_rook_ceph.rook_ceph, + runc = ts_runc.runc, + webhook = ts_webhook.webhook, + youki = ts_youki.youki, + }, +} +``` + +--- + +## Summary Table + +| Aspect | KCL | Nickel | Recommendation | +| -------- | ----- | -------- | --- | +| **Learning Curve** | 10 hours | 3 hours | Nickel | +| **Performance** | Baseline | 60% faster | Nickel | +| **Flexibility** | Limited | Excellent | Nickel | +| **Type Safety** | Strong | Good (gradual) | KCL (slightly) | +| **Extensibility** | Rigid | Excellent | Nickel | +| **Boilerplate** | High | Low | Nickel | +| **Ecosystem** | Small | Growing | Nickel | +| **For New Projects** | ❌ | ✅ | Nickel | +| **For Legacy Configs** | ✅ Supported | ⏳ Gradual | Both (migrate gradually) | + +--- + +## Key Takeaways + +1. **Nickel is the future** - 60% faster, more flexible, simpler mental model +2. **Three-file pattern** - Cleanly separates contracts, defaults, instances +3. **Hybrid interface** - 4 levels cover all use cases (90% makers, 9% defaults, 1% contracts) +4. **Domain organization** - 8 logical domains for clarity and scalability +5. **Two deployment modes** - Development (fast iteration) + Production (immutable snapshots) +6. **TypeDialog integration** - Amplifies Nickel beyond IaC (UI generation) +7. **KCL still supported** - For legacy workspace configs during gradual migration +8. **Production validated** - 47 active files, 20 taskservs, 422 total schemas + +--- + +**Next Steps**: + +- For new schemas → Use Nickel (three-file pattern) +- For workspace configs → Can migrate gradually +- For UI generation → Combine Nickel + TypeDialog +- For application settings → Use TOML (not KCL/Nickel) +- For K8s/CI-CD → Use YAML (not KCL/Nickel) + +--- + +**Version**: 1.0.0 +**Status**: Complete Reference Guide +**Last Updated**: 2025-12-15 \ No newline at end of file diff --git a/docs/src/architecture/orchestrator-auth-integration.md b/docs/src/architecture/orchestrator-auth-integration.md index 19a596e..47d7d9b 100644 --- a/docs/src/architecture/orchestrator-auth-integration.md +++ b/docs/src/architecture/orchestrator-auth-integration.md @@ -1 +1,621 @@ -# Orchestrator Authentication & Authorization Integration\n\n**Version**: 1.0.0\n**Date**: 2025-10-08\n**Status**: Implemented\n\n## Overview\n\nComplete authentication and authorization flow integration for the Provisioning Orchestrator, connecting all security components (JWT validation, MFA\nverification, Cedar authorization, rate limiting, and audit logging) into a cohesive security middleware chain.\n\n## Architecture\n\n### Security Middleware Chain\n\nThe middleware chain is applied in this specific order to ensure proper security:\n\n```\n┌─────────────────────────────────────────────────────────────────┐\n│ Incoming HTTP Request │\n└────────────────────────┬────────────────────────────────────────┘\n │\n ▼\n ┌────────────────────────────────┐\n │ 1. Rate Limiting Middleware │\n │ - Per-IP request limits │\n │ - Sliding window │\n │ - Exempt IPs │\n └────────────┬───────────────────┘\n │ (429 if exceeded)\n ▼\n ┌────────────────────────────────┐\n │ 2. Authentication Middleware │\n │ - Extract Bearer token │\n │ - Validate JWT signature │\n │ - Check expiry, issuer, aud │\n │ - Check revocation │\n └────────────┬───────────────────┘\n │ (401 if invalid)\n ▼\n ┌────────────────────────────────┐\n │ 3. MFA Verification │\n │ - Check MFA status in token │\n │ - Enforce for sensitive ops │\n │ - Production deployments │\n │ - All DELETE operations │\n └────────────┬───────────────────┘\n │ (403 if required but missing)\n ▼\n ┌────────────────────────────────┐\n │ 4. Authorization Middleware │\n │ - Build Cedar request │\n │ - Evaluate policies │\n │ - Check permissions │\n │ - Log decision │\n └────────────┬───────────────────┘\n │ (403 if denied)\n ▼\n ┌────────────────────────────────┐\n │ 5. Audit Logging Middleware │\n │ - Log complete request │\n │ - User, action, resource │\n │ - Authorization decision │\n │ - Response status │\n └────────────┬───────────────────┘\n │\n ▼\n ┌────────────────────────────────┐\n │ Protected Handler │\n │ - Access security context │\n │ - Execute business logic │\n └────────────────────────────────┘\n```\n\n## Implementation Details\n\n### 1. Security Context Builder (`middleware/security_context.rs`)\n\n**Purpose**: Build complete security context from authenticated requests.\n\n**Key Features**:\n\n- Extracts JWT token claims\n- Determines MFA verification status\n- Extracts IP address (X-Forwarded-For, X-Real-IP)\n- Extracts user agent and session info\n- Provides permission checking methods\n\n**Lines of Code**: 275\n\n**Example**:\n\n```\npub struct SecurityContext {\n pub user_id: String,\n pub token: ValidatedToken,\n pub mfa_verified: bool,\n pub ip_address: IpAddr,\n pub user_agent: Option,\n pub permissions: Vec,\n pub workspace: String,\n pub request_id: String,\n pub session_id: Option,\n}\n\nimpl SecurityContext {\n pub fn has_permission(&self, permission: &str) -> bool { ... }\n pub fn has_any_permission(&self, permissions: &[&str]) -> bool { ... }\n pub fn has_all_permissions(&self, permissions: &[&str]) -> bool { ... }\n}\n```\n\n### 2. Enhanced Authentication Middleware (`middleware/auth.rs`)\n\n**Purpose**: JWT token validation with revocation checking.\n\n**Key Features**:\n\n- Bearer token extraction\n- JWT signature validation (RS256)\n- Expiry, issuer, audience checks\n- Token revocation status\n- Security context injection\n\n**Lines of Code**: 245\n\n**Flow**:\n\n1. Extract `Authorization: Bearer ` header\n2. Validate JWT with TokenValidator\n3. Build SecurityContext\n4. Inject into request extensions\n5. Continue to next middleware or return 401\n\n**Error Responses**:\n\n- `401 Unauthorized`: Missing/invalid token, expired, revoked\n- `403 Forbidden`: Insufficient permissions\n\n### 3. MFA Verification Middleware (`middleware/mfa.rs`)\n\n**Purpose**: Enforce MFA for sensitive operations.\n\n**Key Features**:\n\n- Path-based MFA requirements\n- Method-based enforcement (all DELETEs)\n- Production environment protection\n- Clear error messages\n\n**Lines of Code**: 290\n\n**MFA Required For**:\n\n- Production deployments (`/production/`, `/prod/`)\n- All DELETE operations\n- Server operations (POST, PUT, DELETE)\n- Cluster operations (POST, PUT, DELETE)\n- Batch submissions\n- Rollback operations\n- Configuration changes (POST, PUT, DELETE)\n- Secret management\n- User/role management\n\n**Example**:\n\n```\nfn requires_mfa(method: &str, path: &str) -> bool {\n if path.contains("/production/") { return true; }\n if method == "DELETE" { return true; }\n if path.contains("/deploy") { return true; }\n // ...\n}\n```\n\n### 4. Enhanced Authorization Middleware (`middleware/authz.rs`)\n\n**Purpose**: Cedar policy evaluation with audit logging.\n\n**Key Features**:\n\n- Builds Cedar authorization request from HTTP request\n- Maps HTTP methods to Cedar actions (GET→Read, POST→Create, etc.)\n- Extracts resource types from paths\n- Evaluates Cedar policies with context (MFA, IP, time, workspace)\n- Logs all authorization decisions to audit log\n- Non-blocking audit logging (tokio::spawn)\n\n**Lines of Code**: 380\n\n**Resource Mapping**:\n\n```\n/api/v1/servers/srv-123 → Resource::Server("srv-123")\n/api/v1/taskserv/kubernetes → Resource::TaskService("kubernetes")\n/api/v1/cluster/prod → Resource::Cluster("prod")\n/api/v1/config/settings → Resource::Config("settings")\n```\n\n**Action Mapping**:\n\n```\nGET → Action::Read\nPOST → Action::Create\nPUT → Action::Update\nDELETE → Action::Delete\n```\n\n### 5. Rate Limiting Middleware (`middleware/rate_limit.rs`)\n\n**Purpose**: Prevent API abuse with per-IP rate limiting.\n\n**Key Features**:\n\n- Sliding window rate limiting\n- Per-IP request tracking\n- Configurable limits and windows\n- Exempt IP support\n- Automatic cleanup of old entries\n- Statistics tracking\n\n**Lines of Code**: 420\n\n**Configuration**:\n\n```\npub struct RateLimitConfig {\n pub max_requests: u32, // for example, 100\n pub window_duration: Duration, // for example, 60 seconds\n pub exempt_ips: Vec, // for example, internal services\n pub enabled: bool,\n}\n\n// Default: 100 requests per minute\n```\n\n**Statistics**:\n\n```\npub struct RateLimitStats {\n pub total_ips: usize, // Number of tracked IPs\n pub total_requests: u32, // Total requests made\n pub limited_ips: usize, // IPs that hit the limit\n pub config: RateLimitConfig,\n}\n```\n\n### 6. Security Integration Module (`security_integration.rs`)\n\n**Purpose**: Helper module to integrate all security components.\n\n**Key Features**:\n\n- `SecurityComponents` struct grouping all middleware\n- `SecurityConfig` for configuration\n- `initialize()` method to set up all components\n- `disabled()` method for development mode\n- `apply_security_middleware()` helper for router setup\n\n**Lines of Code**: 265\n\n**Usage Example**:\n\n```\nuse provisioning_orchestrator::security_integration::{\n SecurityComponents, SecurityConfig\n};\n\n// Initialize security\nlet config = SecurityConfig {\n public_key_path: PathBuf::from("keys/public.pem"),\n jwt_issuer: "control-center".to_string(),\n jwt_audience: "orchestrator".to_string(),\n cedar_policies_path: PathBuf::from("policies"),\n auth_enabled: true,\n authz_enabled: true,\n mfa_enabled: true,\n rate_limit_config: RateLimitConfig::new(100, 60),\n};\n\nlet security = SecurityComponents::initialize(config, audit_logger).await?;\n\n// Apply to router\nlet app = Router::new()\n .route("/api/v1/servers", post(create_server))\n .route("/api/v1/servers/:id", delete(delete_server));\n\nlet secured_app = apply_security_middleware(app, &security);\n```\n\n## Integration with AppState\n\n### Updated AppState Structure\n\n```\npub struct AppState {\n // Existing fields\n pub task_storage: Arc,\n pub batch_coordinator: BatchCoordinator,\n pub dependency_resolver: DependencyResolver,\n pub state_manager: Arc,\n pub monitoring_system: Arc,\n pub progress_tracker: Arc,\n pub rollback_system: Arc,\n pub test_orchestrator: Arc,\n pub dns_manager: Arc,\n pub extension_manager: Arc,\n pub oci_manager: Arc,\n pub service_orchestrator: Arc,\n pub audit_logger: Arc,\n pub args: Args,\n\n // NEW: Security components\n pub security: SecurityComponents,\n}\n```\n\n### Initialization in main.rs\n\n```\n#[tokio::main]\nasync fn main() -> Result<()> {\n let args = Args::parse();\n\n // Initialize AppState (creates audit_logger)\n let state = Arc::new(AppState::new(args).await?);\n\n // Initialize security components\n let security_config = SecurityConfig {\n public_key_path: PathBuf::from("keys/public.pem"),\n jwt_issuer: env::var("JWT_ISSUER").unwrap_or("control-center".to_string()),\n jwt_audience: "orchestrator".to_string(),\n cedar_policies_path: PathBuf::from("policies"),\n auth_enabled: env::var("AUTH_ENABLED").unwrap_or("true".to_string()) == "true",\n authz_enabled: env::var("AUTHZ_ENABLED").unwrap_or("true".to_string()) == "true",\n mfa_enabled: env::var("MFA_ENABLED").unwrap_or("true".to_string()) == "true",\n rate_limit_config: RateLimitConfig::new(\n env::var("RATE_LIMIT_MAX").unwrap_or("100".to_string()).parse().unwrap(),\n env::var("RATE_LIMIT_WINDOW").unwrap_or("60".to_string()).parse().unwrap(),\n ),\n };\n\n let security = SecurityComponents::initialize(\n security_config,\n state.audit_logger.clone()\n ).await?;\n\n // Public routes (no auth)\n let public_routes = Router::new()\n .route("/health", get(health_check));\n\n // Protected routes (full security chain)\n let protected_routes = Router::new()\n .route("/api/v1/servers", post(create_server))\n .route("/api/v1/servers/:id", delete(delete_server))\n .route("/api/v1/taskserv", post(create_taskserv))\n .route("/api/v1/cluster", post(create_cluster))\n // ... more routes\n ;\n\n // Apply security middleware to protected routes\n let secured_routes = apply_security_middleware(protected_routes, &security)\n .with_state(state.clone());\n\n // Combine routes\n let app = Router::new()\n .merge(public_routes)\n .merge(secured_routes)\n .layer(CorsLayer::permissive());\n\n // Start server\n let listener = tokio::net::TcpListener::bind("0.0.0.0:9090").await?;\n axum::serve(listener, app).await?;\n\n Ok(())\n}\n```\n\n## Protected Endpoints\n\n### Endpoint Categories\n\n| Category | Example Endpoints | Auth Required | MFA Required | Cedar Policy |\n| ---------- | ------------------- | --------------- | -------------- | -------------- |\n| **Health** | `/health` | ❌ | ❌ | ❌ |\n| **Read-Only** | `GET /api/v1/servers` | ✅ | ❌ | ✅ |\n| **Server Mgmt** | `POST /api/v1/servers` | ✅ | ❌ | ✅ |\n| **Server Delete** | `DELETE /api/v1/servers/:id` | ✅ | ✅ | ✅ |\n| **Taskserv Mgmt** | `POST /api/v1/taskserv` | ✅ | ❌ | ✅ |\n| **Cluster Mgmt** | `POST /api/v1/cluster` | ✅ | ✅ | ✅ |\n| **Production** | `POST /api/v1/production/*` | ✅ | ✅ | ✅ |\n| **Batch Ops** | `POST /api/v1/batch/submit` | ✅ | ✅ | ✅ |\n| **Rollback** | `POST /api/v1/rollback` | ✅ | ✅ | ✅ |\n| **Config Write** | `POST /api/v1/config` | ✅ | ✅ | ✅ |\n| **Secrets** | `GET /api/v1/secret/*` | ✅ | ✅ | ✅ |\n\n## Complete Authentication Flow\n\n### Step-by-Step Flow\n\n```\n1. CLIENT REQUEST\n ├─ Headers:\n │ ├─ Authorization: Bearer \n │ ├─ X-Forwarded-For: 192.168.1.100\n │ ├─ User-Agent: MyClient/1.0\n │ └─ X-MFA-Verified: true\n └─ Path: DELETE /api/v1/servers/prod-srv-01\n\n2. RATE LIMITING MIDDLEWARE\n ├─ Extract IP: 192.168.1.100\n ├─ Check limit: 45/100 requests in window\n ├─ Decision: ALLOW (under limit)\n └─ Continue →\n\n3. AUTHENTICATION MIDDLEWARE\n ├─ Extract Bearer token\n ├─ Validate JWT:\n │ ├─ Signature: ✅ Valid (RS256)\n │ ├─ Expiry: ✅ Valid until 2025-10-09 10:00:00\n │ ├─ Issuer: ✅ control-center\n │ ├─ Audience: ✅ orchestrator\n │ └─ Revoked: ✅ Not revoked\n ├─ Build SecurityContext:\n │ ├─ user_id: "user-456"\n │ ├─ workspace: "production"\n │ ├─ permissions: ["read", "write", "delete"]\n │ ├─ mfa_verified: true\n │ └─ ip_address: 192.168.1.100\n ├─ Decision: ALLOW (valid token)\n └─ Continue →\n\n4. MFA VERIFICATION MIDDLEWARE\n ├─ Check endpoint: DELETE /api/v1/servers/prod-srv-01\n ├─ Requires MFA: ✅ YES (DELETE operation)\n ├─ MFA status: ✅ Verified\n ├─ Decision: ALLOW (MFA verified)\n └─ Continue →\n\n5. AUTHORIZATION MIDDLEWARE\n ├─ Build Cedar request:\n │ ├─ Principal: User("user-456")\n │ ├─ Action: Delete\n │ ├─ Resource: Server("prod-srv-01")\n │ └─ Context:\n │ ├─ mfa_verified: true\n │ ├─ ip_address: "192.168.1.100"\n │ ├─ time: 2025-10-08T14:30:00Z\n │ └─ workspace: "production"\n ├─ Evaluate Cedar policies:\n │ ├─ Policy 1: Allow if user.role == "admin" ✅\n │ ├─ Policy 2: Allow if mfa_verified == true ✅\n │ └─ Policy 3: Deny if not business_hours ❌\n ├─ Decision: ALLOW (2 allow, 1 deny = allow)\n ├─ Log to audit: Authorization GRANTED\n └─ Continue →\n\n6. AUDIT LOGGING MIDDLEWARE\n ├─ Record:\n │ ├─ User: user-456 (IP: 192.168.1.100)\n │ ├─ Action: ServerDelete\n │ ├─ Resource: prod-srv-01\n │ ├─ Authorization: GRANTED\n │ ├─ MFA: Verified\n │ └─ Timestamp: 2025-10-08T14:30:00Z\n └─ Continue →\n\n7. PROTECTED HANDLER\n ├─ Execute business logic\n ├─ Delete server prod-srv-01\n └─ Return: 200 OK\n\n8. AUDIT LOGGING (Response)\n ├─ Update event:\n │ ├─ Status: 200 OK\n │ ├─ Duration: 1.234s\n │ └─ Result: SUCCESS\n └─ Write to audit log\n\n9. CLIENT RESPONSE\n └─ 200 OK: Server deleted successfully\n```\n\n## Configuration\n\n### Environment Variables\n\n```\n# JWT Configuration\nJWT_ISSUER=control-center\nJWT_AUDIENCE=orchestrator\nPUBLIC_KEY_PATH=/path/to/keys/public.pem\n\n# Cedar Policies\nCEDAR_POLICIES_PATH=/path/to/policies\n\n# Security Toggles\nAUTH_ENABLED=true\nAUTHZ_ENABLED=true\nMFA_ENABLED=true\n\n# Rate Limiting\nRATE_LIMIT_MAX=100\nRATE_LIMIT_WINDOW=60\nRATE_LIMIT_EXEMPT_IPS=10.0.0.1,10.0.0.2\n\n# Audit Logging\nAUDIT_ENABLED=true\nAUDIT_RETENTION_DAYS=365\n```\n\n### Development Mode\n\nFor development/testing, all security can be disabled:\n\n```\n// In main.rs\nlet security = if env::var("DEVELOPMENT_MODE").unwrap_or("false".to_string()) == "true" {\n SecurityComponents::disabled(audit_logger.clone())\n} else {\n SecurityComponents::initialize(security_config, audit_logger.clone()).await?\n};\n```\n\n## Testing\n\n### Integration Tests\n\nLocation: `provisioning/platform/orchestrator/tests/security_integration_tests.rs`\n\n**Test Coverage**:\n\n- ✅ Rate limiting enforcement\n- ✅ Rate limit statistics\n- ✅ Exempt IP handling\n- ✅ Authentication missing token\n- ✅ MFA verification for sensitive operations\n- ✅ Cedar policy evaluation\n- ✅ Complete security flow\n- ✅ Security components initialization\n- ✅ Configuration defaults\n\n**Lines of Code**: 340\n\n**Run Tests**:\n\n```\ncd provisioning/platform/orchestrator\ncargo test security_integration_tests\n```\n\n## File Summary\n\n| File | Purpose | Lines | Tests |\n| ------ | --------- | ------- | ------- |\n| `middleware/security_context.rs` | Security context builder | 275 | 8 |\n| `middleware/auth.rs` | JWT authentication | 245 | 5 |\n| `middleware/mfa.rs` | MFA verification | 290 | 15 |\n| `middleware/authz.rs` | Cedar authorization | 380 | 4 |\n| `middleware/rate_limit.rs` | Rate limiting | 420 | 8 |\n| `middleware/mod.rs` | Module exports | 25 | 0 |\n| `security_integration.rs` | Integration helpers | 265 | 2 |\n| `tests/security_integration_tests.rs` | Integration tests | 340 | 11 |\n| **Total** | | **2,240** | **53** |\n\n## Benefits\n\n### Security\n\n- ✅ Complete authentication flow with JWT validation\n- ✅ MFA enforcement for sensitive operations\n- ✅ Fine-grained authorization with Cedar policies\n- ✅ Rate limiting prevents API abuse\n- ✅ Complete audit trail for compliance\n\n### Architecture\n\n- ✅ Modular middleware design\n- ✅ Clear separation of concerns\n- ✅ Reusable security components\n- ✅ Easy to test and maintain\n- ✅ Configuration-driven behavior\n\n### Operations\n\n- ✅ Can enable/disable features independently\n- ✅ Development mode for testing\n- ✅ Comprehensive error messages\n- ✅ Real-time statistics and monitoring\n- ✅ Non-blocking audit logging\n\n## Future Enhancements\n\n1. **Token Refresh**: Automatic token refresh before expiry\n2. **IP Whitelisting**: Additional IP-based access control\n3. **Geolocation**: Block requests from specific countries\n4. **Advanced Rate Limiting**: Per-user, per-endpoint limits\n5. **Session Management**: Track active sessions, force logout\n6. **2FA Integration**: Direct integration with TOTP/SMS providers\n7. **Policy Hot Reload**: Update Cedar policies without restart\n8. **Metrics Dashboard**: Real-time security metrics visualization\n\n## Related Documentation\n\n- Cedar Policy Language\n- JWT Token Management\n- MFA Setup Guide\n- Audit Log Format\n- Rate Limiting Best Practices\n\n## Version History\n\n| Version | Date | Changes |\n| --------- | ------ | --------- |\n| 1.0.0 | 2025-10-08 | Initial implementation |\n\n---\n\n**Maintained By**: Security Team\n**Review Cycle**: Quarterly\n**Last Reviewed**: 2025-10-08 +# Orchestrator Authentication & Authorization Integration + +**Version**: 1.0.0 +**Date**: 2025-10-08 +**Status**: Implemented + +## Overview + +Complete authentication and authorization flow integration for the Provisioning Orchestrator, connecting all security components (JWT validation, MFA +verification, Cedar authorization, rate limiting, and audit logging) into a cohesive security middleware chain. + +## Architecture + +### Security Middleware Chain + +The middleware chain is applied in this specific order to ensure proper security: + +```text +┌─────────────────────────────────────────────────────────────────┐ +│ Incoming HTTP Request │ +└────────────────────────┬────────────────────────────────────────┘ + │ + ▼ + ┌────────────────────────────────┐ + │ 1. Rate Limiting Middleware │ + │ - Per-IP request limits │ + │ - Sliding window │ + │ - Exempt IPs │ + └────────────┬───────────────────┘ + │ (429 if exceeded) + ▼ + ┌────────────────────────────────┐ + │ 2. Authentication Middleware │ + │ - Extract Bearer token │ + │ - Validate JWT signature │ + │ - Check expiry, issuer, aud │ + │ - Check revocation │ + └────────────┬───────────────────┘ + │ (401 if invalid) + ▼ + ┌────────────────────────────────┐ + │ 3. MFA Verification │ + │ - Check MFA status in token │ + │ - Enforce for sensitive ops │ + │ - Production deployments │ + │ - All DELETE operations │ + └────────────┬───────────────────┘ + │ (403 if required but missing) + ▼ + ┌────────────────────────────────┐ + │ 4. Authorization Middleware │ + │ - Build Cedar request │ + │ - Evaluate policies │ + │ - Check permissions │ + │ - Log decision │ + └────────────┬───────────────────┘ + │ (403 if denied) + ▼ + ┌────────────────────────────────┐ + │ 5. Audit Logging Middleware │ + │ - Log complete request │ + │ - User, action, resource │ + │ - Authorization decision │ + │ - Response status │ + └────────────┬───────────────────┘ + │ + ▼ + ┌────────────────────────────────┐ + │ Protected Handler │ + │ - Access security context │ + │ - Execute business logic │ + └────────────────────────────────┘ +``` + +## Implementation Details + +### 1. Security Context Builder (`middleware/security_context.rs`) + +**Purpose**: Build complete security context from authenticated requests. + +**Key Features**: + +- Extracts JWT token claims +- Determines MFA verification status +- Extracts IP address (X-Forwarded-For, X-Real-IP) +- Extracts user agent and session info +- Provides permission checking methods + +**Lines of Code**: 275 + +**Example**: + +```text +pub struct SecurityContext { + pub user_id: String, + pub token: ValidatedToken, + pub mfa_verified: bool, + pub ip_address: IpAddr, + pub user_agent: Option, + pub permissions: Vec, + pub workspace: String, + pub request_id: String, + pub session_id: Option, +} + +impl SecurityContext { + pub fn has_permission(&self, permission: &str) -> bool { ... } + pub fn has_any_permission(&self, permissions: &[&str]) -> bool { ... } + pub fn has_all_permissions(&self, permissions: &[&str]) -> bool { ... } +} +``` + +### 2. Enhanced Authentication Middleware (`middleware/auth.rs`) + +**Purpose**: JWT token validation with revocation checking. + +**Key Features**: + +- Bearer token extraction +- JWT signature validation (RS256) +- Expiry, issuer, audience checks +- Token revocation status +- Security context injection + +**Lines of Code**: 245 + +**Flow**: + +1. Extract `Authorization: Bearer ` header +2. Validate JWT with TokenValidator +3. Build SecurityContext +4. Inject into request extensions +5. Continue to next middleware or return 401 + +**Error Responses**: + +- `401 Unauthorized`: Missing/invalid token, expired, revoked +- `403 Forbidden`: Insufficient permissions + +### 3. MFA Verification Middleware (`middleware/mfa.rs`) + +**Purpose**: Enforce MFA for sensitive operations. + +**Key Features**: + +- Path-based MFA requirements +- Method-based enforcement (all DELETEs) +- Production environment protection +- Clear error messages + +**Lines of Code**: 290 + +**MFA Required For**: + +- Production deployments (`/production/`, `/prod/`) +- All DELETE operations +- Server operations (POST, PUT, DELETE) +- Cluster operations (POST, PUT, DELETE) +- Batch submissions +- Rollback operations +- Configuration changes (POST, PUT, DELETE) +- Secret management +- User/role management + +**Example**: + +```text +fn requires_mfa(method: &str, path: &str) -> bool { + if path.contains("/production/") { return true; } + if method == "DELETE" { return true; } + if path.contains("/deploy") { return true; } + // ... +} +``` + +### 4. Enhanced Authorization Middleware (`middleware/authz.rs`) + +**Purpose**: Cedar policy evaluation with audit logging. + +**Key Features**: + +- Builds Cedar authorization request from HTTP request +- Maps HTTP methods to Cedar actions (GET→Read, POST→Create, etc.) +- Extracts resource types from paths +- Evaluates Cedar policies with context (MFA, IP, time, workspace) +- Logs all authorization decisions to audit log +- Non-blocking audit logging (tokio::spawn) + +**Lines of Code**: 380 + +**Resource Mapping**: + +```text +/api/v1/servers/srv-123 → Resource::Server("srv-123") +/api/v1/taskserv/kubernetes → Resource::TaskService("kubernetes") +/api/v1/cluster/prod → Resource::Cluster("prod") +/api/v1/config/settings → Resource::Config("settings") +``` + +**Action Mapping**: + +```text +GET → Action::Read +POST → Action::Create +PUT → Action::Update +DELETE → Action::Delete +``` + +### 5. Rate Limiting Middleware (`middleware/rate_limit.rs`) + +**Purpose**: Prevent API abuse with per-IP rate limiting. + +**Key Features**: + +- Sliding window rate limiting +- Per-IP request tracking +- Configurable limits and windows +- Exempt IP support +- Automatic cleanup of old entries +- Statistics tracking + +**Lines of Code**: 420 + +**Configuration**: + +```text +pub struct RateLimitConfig { + pub max_requests: u32, // for example, 100 + pub window_duration: Duration, // for example, 60 seconds + pub exempt_ips: Vec, // for example, internal services + pub enabled: bool, +} + +// Default: 100 requests per minute +``` + +**Statistics**: + +```text +pub struct RateLimitStats { + pub total_ips: usize, // Number of tracked IPs + pub total_requests: u32, // Total requests made + pub limited_ips: usize, // IPs that hit the limit + pub config: RateLimitConfig, +} +``` + +### 6. Security Integration Module (`security_integration.rs`) + +**Purpose**: Helper module to integrate all security components. + +**Key Features**: + +- `SecurityComponents` struct grouping all middleware +- `SecurityConfig` for configuration +- `initialize()` method to set up all components +- `disabled()` method for development mode +- `apply_security_middleware()` helper for router setup + +**Lines of Code**: 265 + +**Usage Example**: + +```text +use provisioning_orchestrator::security_integration::{ + SecurityComponents, SecurityConfig +}; + +// Initialize security +let config = SecurityConfig { + public_key_path: PathBuf::from("keys/public.pem"), + jwt_issuer: "control-center".to_string(), + jwt_audience: "orchestrator".to_string(), + cedar_policies_path: PathBuf::from("policies"), + auth_enabled: true, + authz_enabled: true, + mfa_enabled: true, + rate_limit_config: RateLimitConfig::new(100, 60), +}; + +let security = SecurityComponents::initialize(config, audit_logger).await?; + +// Apply to router +let app = Router::new() + .route("/api/v1/servers", post(create_server)) + .route("/api/v1/servers/:id", delete(delete_server)); + +let secured_app = apply_security_middleware(app, &security); +``` + +## Integration with AppState + +### Updated AppState Structure + +```text +pub struct AppState { + // Existing fields + pub task_storage: Arc, + pub batch_coordinator: BatchCoordinator, + pub dependency_resolver: DependencyResolver, + pub state_manager: Arc, + pub monitoring_system: Arc, + pub progress_tracker: Arc, + pub rollback_system: Arc, + pub test_orchestrator: Arc, + pub dns_manager: Arc, + pub extension_manager: Arc, + pub oci_manager: Arc, + pub service_orchestrator: Arc, + pub audit_logger: Arc, + pub args: Args, + + // NEW: Security components + pub security: SecurityComponents, +} +``` + +### Initialization in main.rs + +```text +#[tokio::main] +async fn main() -> Result<()> { + let args = Args::parse(); + + // Initialize AppState (creates audit_logger) + let state = Arc::new(AppState::new(args).await?); + + // Initialize security components + let security_config = SecurityConfig { + public_key_path: PathBuf::from("keys/public.pem"), + jwt_issuer: env::var("JWT_ISSUER").unwrap_or("control-center".to_string()), + jwt_audience: "orchestrator".to_string(), + cedar_policies_path: PathBuf::from("policies"), + auth_enabled: env::var("AUTH_ENABLED").unwrap_or("true".to_string()) == "true", + authz_enabled: env::var("AUTHZ_ENABLED").unwrap_or("true".to_string()) == "true", + mfa_enabled: env::var("MFA_ENABLED").unwrap_or("true".to_string()) == "true", + rate_limit_config: RateLimitConfig::new( + env::var("RATE_LIMIT_MAX").unwrap_or("100".to_string()).parse().unwrap(), + env::var("RATE_LIMIT_WINDOW").unwrap_or("60".to_string()).parse().unwrap(), + ), + }; + + let security = SecurityComponents::initialize( + security_config, + state.audit_logger.clone() + ).await?; + + // Public routes (no auth) + let public_routes = Router::new() + .route("/health", get(health_check)); + + // Protected routes (full security chain) + let protected_routes = Router::new() + .route("/api/v1/servers", post(create_server)) + .route("/api/v1/servers/:id", delete(delete_server)) + .route("/api/v1/taskserv", post(create_taskserv)) + .route("/api/v1/cluster", post(create_cluster)) + // ... more routes + ; + + // Apply security middleware to protected routes + let secured_routes = apply_security_middleware(protected_routes, &security) + .with_state(state.clone()); + + // Combine routes + let app = Router::new() + .merge(public_routes) + .merge(secured_routes) + .layer(CorsLayer::permissive()); + + // Start server + let listener = tokio::net::TcpListener::bind("0.0.0.0:9090").await?; + axum::serve(listener, app).await?; + + Ok(()) +} +``` + +## Protected Endpoints + +### Endpoint Categories + +| Category | Example Endpoints | Auth Required | MFA Required | Cedar Policy | +| ---------- | ------------------- | --------------- | -------------- | -------------- | +| **Health** | `/health` | ❌ | ❌ | ❌ | +| **Read-Only** | `GET /api/v1/servers` | ✅ | ❌ | ✅ | +| **Server Mgmt** | `POST /api/v1/servers` | ✅ | ❌ | ✅ | +| **Server Delete** | `DELETE /api/v1/servers/:id` | ✅ | ✅ | ✅ | +| **Taskserv Mgmt** | `POST /api/v1/taskserv` | ✅ | ❌ | ✅ | +| **Cluster Mgmt** | `POST /api/v1/cluster` | ✅ | ✅ | ✅ | +| **Production** | `POST /api/v1/production/*` | ✅ | ✅ | ✅ | +| **Batch Ops** | `POST /api/v1/batch/submit` | ✅ | ✅ | ✅ | +| **Rollback** | `POST /api/v1/rollback` | ✅ | ✅ | ✅ | +| **Config Write** | `POST /api/v1/config` | ✅ | ✅ | ✅ | +| **Secrets** | `GET /api/v1/secret/*` | ✅ | ✅ | ✅ | + +## Complete Authentication Flow + +### Step-by-Step Flow + +```text +1. CLIENT REQUEST + ├─ Headers: + │ ├─ Authorization: Bearer + │ ├─ X-Forwarded-For: 192.168.1.100 + │ ├─ User-Agent: MyClient/1.0 + │ └─ X-MFA-Verified: true + └─ Path: DELETE /api/v1/servers/prod-srv-01 + +2. RATE LIMITING MIDDLEWARE + ├─ Extract IP: 192.168.1.100 + ├─ Check limit: 45/100 requests in window + ├─ Decision: ALLOW (under limit) + └─ Continue → + +3. AUTHENTICATION MIDDLEWARE + ├─ Extract Bearer token + ├─ Validate JWT: + │ ├─ Signature: ✅ Valid (RS256) + │ ├─ Expiry: ✅ Valid until 2025-10-09 10:00:00 + │ ├─ Issuer: ✅ control-center + │ ├─ Audience: ✅ orchestrator + │ └─ Revoked: ✅ Not revoked + ├─ Build SecurityContext: + │ ├─ user_id: "user-456" + │ ├─ workspace: "production" + │ ├─ permissions: ["read", "write", "delete"] + │ ├─ mfa_verified: true + │ └─ ip_address: 192.168.1.100 + ├─ Decision: ALLOW (valid token) + └─ Continue → + +4. MFA VERIFICATION MIDDLEWARE + ├─ Check endpoint: DELETE /api/v1/servers/prod-srv-01 + ├─ Requires MFA: ✅ YES (DELETE operation) + ├─ MFA status: ✅ Verified + ├─ Decision: ALLOW (MFA verified) + └─ Continue → + +5. AUTHORIZATION MIDDLEWARE + ├─ Build Cedar request: + │ ├─ Principal: User("user-456") + │ ├─ Action: Delete + │ ├─ Resource: Server("prod-srv-01") + │ └─ Context: + │ ├─ mfa_verified: true + │ ├─ ip_address: "192.168.1.100" + │ ├─ time: 2025-10-08T14:30:00Z + │ └─ workspace: "production" + ├─ Evaluate Cedar policies: + │ ├─ Policy 1: Allow if user.role == "admin" ✅ + │ ├─ Policy 2: Allow if mfa_verified == true ✅ + │ └─ Policy 3: Deny if not business_hours ❌ + ├─ Decision: ALLOW (2 allow, 1 deny = allow) + ├─ Log to audit: Authorization GRANTED + └─ Continue → + +6. AUDIT LOGGING MIDDLEWARE + ├─ Record: + │ ├─ User: user-456 (IP: 192.168.1.100) + │ ├─ Action: ServerDelete + │ ├─ Resource: prod-srv-01 + │ ├─ Authorization: GRANTED + │ ├─ MFA: Verified + │ └─ Timestamp: 2025-10-08T14:30:00Z + └─ Continue → + +7. PROTECTED HANDLER + ├─ Execute business logic + ├─ Delete server prod-srv-01 + └─ Return: 200 OK + +8. AUDIT LOGGING (Response) + ├─ Update event: + │ ├─ Status: 200 OK + │ ├─ Duration: 1.234s + │ └─ Result: SUCCESS + └─ Write to audit log + +9. CLIENT RESPONSE + └─ 200 OK: Server deleted successfully +``` + +## Configuration + +### Environment Variables + +```text +# JWT Configuration +JWT_ISSUER=control-center +JWT_AUDIENCE=orchestrator +PUBLIC_KEY_PATH=/path/to/keys/public.pem + +# Cedar Policies +CEDAR_POLICIES_PATH=/path/to/policies + +# Security Toggles +AUTH_ENABLED=true +AUTHZ_ENABLED=true +MFA_ENABLED=true + +# Rate Limiting +RATE_LIMIT_MAX=100 +RATE_LIMIT_WINDOW=60 +RATE_LIMIT_EXEMPT_IPS=10.0.0.1,10.0.0.2 + +# Audit Logging +AUDIT_ENABLED=true +AUDIT_RETENTION_DAYS=365 +``` + +### Development Mode + +For development/testing, all security can be disabled: + +```text +// In main.rs +let security = if env::var("DEVELOPMENT_MODE").unwrap_or("false".to_string()) == "true" { + SecurityComponents::disabled(audit_logger.clone()) +} else { + SecurityComponents::initialize(security_config, audit_logger.clone()).await? +}; +``` + +## Testing + +### Integration Tests + +Location: `provisioning/platform/orchestrator/tests/security_integration_tests.rs` + +**Test Coverage**: + +- ✅ Rate limiting enforcement +- ✅ Rate limit statistics +- ✅ Exempt IP handling +- ✅ Authentication missing token +- ✅ MFA verification for sensitive operations +- ✅ Cedar policy evaluation +- ✅ Complete security flow +- ✅ Security components initialization +- ✅ Configuration defaults + +**Lines of Code**: 340 + +**Run Tests**: + +```text +cd provisioning/platform/orchestrator +cargo test security_integration_tests +``` + +## File Summary + +| File | Purpose | Lines | Tests | +| ------ | --------- | ------- | ------- | +| `middleware/security_context.rs` | Security context builder | 275 | 8 | +| `middleware/auth.rs` | JWT authentication | 245 | 5 | +| `middleware/mfa.rs` | MFA verification | 290 | 15 | +| `middleware/authz.rs` | Cedar authorization | 380 | 4 | +| `middleware/rate_limit.rs` | Rate limiting | 420 | 8 | +| `middleware/mod.rs` | Module exports | 25 | 0 | +| `security_integration.rs` | Integration helpers | 265 | 2 | +| `tests/security_integration_tests.rs` | Integration tests | 340 | 11 | +| **Total** | | **2,240** | **53** | + +## Benefits + +### Security + +- ✅ Complete authentication flow with JWT validation +- ✅ MFA enforcement for sensitive operations +- ✅ Fine-grained authorization with Cedar policies +- ✅ Rate limiting prevents API abuse +- ✅ Complete audit trail for compliance + +### Architecture + +- ✅ Modular middleware design +- ✅ Clear separation of concerns +- ✅ Reusable security components +- ✅ Easy to test and maintain +- ✅ Configuration-driven behavior + +### Operations + +- ✅ Can enable/disable features independently +- ✅ Development mode for testing +- ✅ Comprehensive error messages +- ✅ Real-time statistics and monitoring +- ✅ Non-blocking audit logging + +## Future Enhancements + +1. **Token Refresh**: Automatic token refresh before expiry +2. **IP Whitelisting**: Additional IP-based access control +3. **Geolocation**: Block requests from specific countries +4. **Advanced Rate Limiting**: Per-user, per-endpoint limits +5. **Session Management**: Track active sessions, force logout +6. **2FA Integration**: Direct integration with TOTP/SMS providers +7. **Policy Hot Reload**: Update Cedar policies without restart +8. **Metrics Dashboard**: Real-time security metrics visualization + +## Related Documentation + +- Cedar Policy Language +- JWT Token Management +- MFA Setup Guide +- Audit Log Format +- Rate Limiting Best Practices + +## Version History + +| Version | Date | Changes | +| --------- | ------ | --------- | +| 1.0.0 | 2025-10-08 | Initial implementation | + +--- + +**Maintained By**: Security Team +**Review Cycle**: Quarterly +**Last Reviewed**: 2025-10-08 \ No newline at end of file diff --git a/docs/src/architecture/orchestrator-info.md b/docs/src/architecture/orchestrator-info.md index 6069535..7bd5fab 100644 --- a/docs/src/architecture/orchestrator-info.md +++ b/docs/src/architecture/orchestrator-info.md @@ -1 +1,149 @@ -Execution Complete\n\nPerfect question. Here's a clear explanation:\n\nThe Orchestrator IS USED and IS CRITICAL\n\nThat code example was misleading. Here's the real architecture:\n\nHow It Actually Works\n\n┌─────────────────────────────────────────────────\n─┐\n│ User runs: provisioning server create --orchestrated\n└───────────────────┬─────────────────────────────\n─┘\n ↓\n ┌───────────────────────┐\n │ Nushell CLI │\n │ (provisioning) │\n └───────────┬───────────┘\n ↓ HTTP POST\n ┌───────────────────────────────┐\n │ Rust Orchestrator Daemon │\n │ (provisioning-orchestrator) │\n │ │\n │ • Task Queue │\n │ • Workflow Engine │\n │ • Dependency Resolution │\n │ • Parallel Execution │\n └───────────┬───────────────────┘\n ↓ spawns subprocess\n ┌───────────────────────────────┐\n │ Nushell Business Logic │\n │ nu -c "use servers/create.nu"│\n │ │\n │ Executes actual provider │\n │ API calls, configuration │\n └───────────────────────────────┘\nThe Flow in Detail\n\n1. User Command:\n\nprovisioning server create wuji --orchestrated\n2. Nushell CLI submits to orchestrator:\n\n# CLI code\n\nhttp post {\n infra: "wuji"\n params: {...}\n}\n\n# Returns: workflow_id = "abc-123"\n\n1. Orchestrator receives and queues:\n\n```\n// Orchestrator receives HTTP request\nasync fn create_server_workflow(request) {\n let task = Task::new(TaskType::ServerCreate, request);\n task_queue.enqueue(task).await; // Queue for execution\n return workflow_id; // Return immediately\n}\n```\n\n2. Orchestrator executes via Nushell subprocess:\n\n```\n// Orchestrator spawns Nushell to run business logic\nasync fn execute_task(task: Task) {\n let output = Command::new("nu")\n .arg("-c")\n .arg("use /usr/local/lib/provisioning/servers/create.nu; create-server 'wuji'")\n .output()\n .await?;\n\n // Orchestrator manages: retry, checkpointing, monitoring\n}\n```\n\n3. Nushell executes the actual work:\n\n```\n# servers/create.nu\n\nexport def create-server [name: string] {\n # This is the business logic\n # Calls UpCloud API, creates server, etc.\n let provider = (load-provider)\n $provider | create-vm $name\n}\nWhy This Architecture?\n\nProblem It Solves\n\nWithout Orchestrator (Old Way):\n\nprovisioning → template.nu → cluster.nu → taskserv.nu → provider.nu\n (Deep call stack = crashes!)\nWith Orchestrator (Current):\n\nprovisioning → Orchestrator → spawns fresh Nushell subprocess for each task\n (No deep nesting, parallel execution, recovery)\nWhat Orchestrator Provides\n\nTask Queue - Reliable execution even if system crashes\nParallel Execution - Run 10 tasks at once (Rust async)\nWorkflow Engine - Handle complex dependencies\nCheckpointing - Resume from failure\nMonitoring - Real-time progress tracking\nWhat Nushell Provides\n\nBusiness Logic - Provider integrations, config generation\nFlexibility - Easy to modify without recompiling\nReadability - Shell-like syntax for infrastructure ops\nMulti-Repo Impact: NONE on Integration\n\nIn Monorepo:\n\nprovisioning/\n├── core/nulib/ # Nushell code\n└── platform/orchestrator/ # Rust code\nIn Multi-Repo:\n\nprovisioning-core/ # Separate repo, installs to /usr/local/lib/provisioning\nprovisioning-platform/ # Separate repo, installs to /usr/local/bin/provisioning-orchestrator\nIntegration is the same:\n\nOrchestrator calls: nu -c "use /usr/local/lib/provisioning/servers/create.nu"\nNushell calls: http post ...\nNo code dependency, just runtime coordination!\n\nThe Orchestrator IS Essential\n\nThe orchestrator:\n\n✅ IS USED for all complex operations\n✅ IS CRITICAL for workflow system (v3.0)\n✅ IS REQUIRED for batch operations (v3.1)\n✅ SOLVES deep call stack issues\n✅ PROVIDES performance and reliability\nThat misleading code example showed how Platform doesn't link to Core code, but it absolutely uses the orchestrator for coordination.\n\nDoes this clear it up? The orchestrator is the performance and reliability layer that makes the whole system work!\n\nCost: $0.1565 USD\nDuration: 137.69s\nTurns: 40\nTotal tokens: 7466(7 in, 7459 out) +Execution Complete + +Perfect question. Here's a clear explanation: + +The Orchestrator IS USED and IS CRITICAL + +That code example was misleading. Here's the real architecture: + +How It Actually Works + +┌───────────────────────────────────────────────── +─┐ +│ User runs: provisioning server create --orchestrated +└───────────────────┬───────────────────────────── +─┘ + ↓ + ┌───────────────────────┐ + │ Nushell CLI │ + │ (provisioning) │ + └───────────┬───────────┘ + ↓ HTTP POST + ┌───────────────────────────────┐ + │ Rust Orchestrator Daemon │ + │ (provisioning-orchestrator) │ + │ │ + │ • Task Queue │ + │ • Workflow Engine │ + │ • Dependency Resolution │ + │ • Parallel Execution │ + └───────────┬───────────────────┘ + ↓ spawns subprocess + ┌───────────────────────────────┐ + │ Nushell Business Logic │ + │ nu -c "use servers/create.nu"│ + │ │ + │ Executes actual provider │ + │ API calls, configuration │ + └───────────────────────────────┘ +The Flow in Detail + +1. User Command: + +provisioning server create wuji --orchestrated +2. Nushell CLI submits to orchestrator: + +# CLI code + +http post { + infra: "wuji" + params: {...} +} + +# Returns: workflow_id = "abc-123" + +1. Orchestrator receives and queues: + +```text +// Orchestrator receives HTTP request +async fn create_server_workflow(request) { + let task = Task::new(TaskType::ServerCreate, request); + task_queue.enqueue(task).await; // Queue for execution + return workflow_id; // Return immediately +} +``` + +2. Orchestrator executes via Nushell subprocess: + +```text +// Orchestrator spawns Nushell to run business logic +async fn execute_task(task: Task) { + let output = Command::new("nu") + .arg("-c") + .arg("use /usr/local/lib/provisioning/servers/create.nu; create-server 'wuji'") + .output() + .await?; + + // Orchestrator manages: retry, checkpointing, monitoring +} +``` + +3. Nushell executes the actual work: + +```text +# servers/create.nu + +export def create-server [name: string] { + # This is the business logic + # Calls UpCloud API, creates server, etc. + let provider = (load-provider) + $provider | create-vm $name +} +Why This Architecture? + +Problem It Solves + +Without Orchestrator (Old Way): + +provisioning → template.nu → cluster.nu → taskserv.nu → provider.nu + (Deep call stack = crashes!) +With Orchestrator (Current): + +provisioning → Orchestrator → spawns fresh Nushell subprocess for each task + (No deep nesting, parallel execution, recovery) +What Orchestrator Provides + +Task Queue - Reliable execution even if system crashes +Parallel Execution - Run 10 tasks at once (Rust async) +Workflow Engine - Handle complex dependencies +Checkpointing - Resume from failure +Monitoring - Real-time progress tracking +What Nushell Provides + +Business Logic - Provider integrations, config generation +Flexibility - Easy to modify without recompiling +Readability - Shell-like syntax for infrastructure ops +Multi-Repo Impact: NONE on Integration + +In Monorepo: + +provisioning/ +├── core/nulib/ # Nushell code +└── platform/orchestrator/ # Rust code +In Multi-Repo: + +provisioning-core/ # Separate repo, installs to /usr/local/lib/provisioning +provisioning-platform/ # Separate repo, installs to /usr/local/bin/provisioning-orchestrator +Integration is the same: + +Orchestrator calls: nu -c "use /usr/local/lib/provisioning/servers/create.nu" +Nushell calls: http post ... +No code dependency, just runtime coordination! + +The Orchestrator IS Essential + +The orchestrator: + +✅ IS USED for all complex operations +✅ IS CRITICAL for workflow system (v3.0) +✅ IS REQUIRED for batch operations (v3.1) +✅ SOLVES deep call stack issues +✅ PROVIDES performance and reliability +That misleading code example showed how Platform doesn't link to Core code, but it absolutely uses the orchestrator for coordination. + +Does this clear it up? The orchestrator is the performance and reliability layer that makes the whole system work! + +Cost: $0.1565 USD +Duration: 137.69s +Turns: 40 +Total tokens: 7466(7 in, 7459 out) \ No newline at end of file diff --git a/docs/src/architecture/orchestrator-integration-model.md b/docs/src/architecture/orchestrator-integration-model.md index 5c0bc17..9c9e925 100644 --- a/docs/src/architecture/orchestrator-integration-model.md +++ b/docs/src/architecture/orchestrator-integration-model.md @@ -1 +1,805 @@ -# Orchestrator Integration Model - Deep Dive\n\n**Date:** 2025-10-01\n**Status:** Clarification Document\n**Related:** [Multi-Repo Strategy](multi-repo-strategy.md), [Hybrid Orchestrator v3.0](../user/hybrid-orchestrator.md)\n\n## Executive Summary\n\nThis document clarifies **how the Rust orchestrator integrates with Nushell core** in both monorepo and multi-repo architectures. The orchestrator is\na **critical performance layer** that coordinates Nushell business logic execution, solving deep call stack limitations while preserving all existing\nfunctionality.\n\n---\n\n## Current Architecture (Hybrid Orchestrator v3.0)\n\n### The Problem Being Solved\n\n**Original Issue:**\n\n```\nDeep call stack in Nushell (template.nu:71)\n→ "Type not supported" errors\n→ Cannot handle complex nested workflows\n→ Performance bottlenecks with recursive calls\n```\n\n**Solution:** Rust orchestrator provides:\n\n1. **Task queue management** (file-based, reliable)\n2. **Priority scheduling** (intelligent task ordering)\n3. **Deep call stack elimination** (Rust handles recursion)\n4. **Performance optimization** (async/await, parallel execution)\n5. **State management** (workflow checkpointing)\n\n### How It Works Today (Monorepo)\n\n```\n┌─────────────────────────────────────────────────────────────┐\n│ User │\n└───────────────────────────┬─────────────────────────────────┘\n │ calls\n ↓\n ┌───────────────┐\n │ provisioning │ (Nushell CLI)\n │ CLI │\n └───────┬───────┘\n │\n ┌───────────────────┼───────────────────┐\n │ │ │\n ↓ ↓ ↓\n┌───────────────┐ ┌───────────────┐ ┌──────────────┐\n│ Direct Mode │ │Orchestrated │ │ Workflow │\n│ (Simple ops) │ │ Mode │ │ Mode │\n└───────────────┘ └───────┬───────┘ └──────┬───────┘\n │ │\n ↓ ↓\n ┌────────────────────────────────┐\n │ Rust Orchestrator Service │\n │ (Background daemon) │\n │ │\n │ • Task Queue (file-based) │\n │ • Priority Scheduler │\n │ • Workflow Engine │\n │ • REST API Server │\n └────────┬───────────────────────┘\n │ spawns\n ↓\n ┌────────────────┐\n │ Nushell │\n │ Business Logic │\n │ │\n │ • servers.nu │\n │ • taskservs.nu │\n │ • clusters.nu │\n └────────────────┘\n```\n\n### Three Execution Modes\n\n#### Mode 1: Direct Mode (Simple Operations)\n\n```\n# No orchestrator needed\nprovisioning server list\nprovisioning env\nprovisioning help\n\n# Direct Nushell execution\nprovisioning (CLI) → Nushell scripts → Result\n```\n\n#### Mode 2: Orchestrated Mode (Complex Operations)\n\n```\n# Uses orchestrator for coordination\nprovisioning server create --orchestrated\n\n# Flow:\nprovisioning CLI → Orchestrator API → Task Queue → Nushell executor\n ↓\n Result back to user\n```\n\n#### Mode 3: Workflow Mode (Batch Operations)\n\n```\n# Complex workflows with dependencies\nprovisioning workflow submit server-cluster.ncl\n\n# Flow:\nprovisioning CLI → Orchestrator Workflow Engine → Dependency Graph\n ↓\n Parallel task execution\n ↓\n Nushell scripts for each task\n ↓\n Checkpoint state\n```\n\n---\n\n## Integration Patterns\n\n### Pattern 1: CLI Submits Tasks to Orchestrator\n\n**Current Implementation:**\n\n**Nushell CLI (`core/nulib/workflows/server_create.nu`):**\n\n```\n# Submit server creation workflow to orchestrator\nexport def server_create_workflow [\n infra_name: string\n --orchestrated\n] {\n if $orchestrated {\n # Submit task to orchestrator\n let task = {\n type: "server_create"\n infra: $infra_name\n params: { ... }\n }\n\n # POST to orchestrator REST API\n http post http://localhost:9090/workflows/servers/create $task\n } else {\n # Direct execution (old way)\n do-server-create $infra_name\n }\n}\n```\n\n**Rust Orchestrator (`platform/orchestrator/src/api/workflows.rs`):**\n\n```\n// Receive workflow submission from Nushell CLI\n#[axum::debug_handler]\nasync fn create_server_workflow(\n State(state): State>,\n Json(request): Json,\n) -> Result, ApiError> {\n // Create task\n let task = Task {\n id: Uuid::new_v4(),\n task_type: TaskType::ServerCreate,\n payload: serde_json::to_value(&request)?,\n priority: Priority::Normal,\n status: TaskStatus::Pending,\n created_at: Utc::now(),\n };\n\n // Queue task\n state.task_queue.enqueue(task).await?;\n\n // Return immediately (async execution)\n Ok(Json(WorkflowResponse {\n workflow_id: task.id,\n status: "queued",\n }))\n}\n```\n\n**Flow:**\n\n```\nUser → provisioning server create --orchestrated\n ↓\nNushell CLI prepares task\n ↓\nHTTP POST to orchestrator (localhost:9090)\n ↓\nOrchestrator queues task\n ↓\nReturns workflow ID immediately\n ↓\nUser can monitor: provisioning workflow monitor \n```\n\n### Pattern 2: Orchestrator Executes Nushell Scripts\n\n**Orchestrator Task Executor (`platform/orchestrator/src/executor.rs`):**\n\n```\n// Orchestrator spawns Nushell to execute business logic\npub async fn execute_task(task: Task) -> Result {\n match task.task_type {\n TaskType::ServerCreate => {\n // Orchestrator calls Nushell script via subprocess\n let output = Command::new("nu")\n .arg("-c")\n .arg(format!(\n "use {}/servers/create.nu; create-server '{}'",\n PROVISIONING_LIB_PATH,\n task.payload.infra_name\n ))\n .output()\n .await?;\n\n // Parse Nushell output\n let result = parse_nushell_output(&output)?;\n\n Ok(TaskResult {\n task_id: task.id,\n status: if result.success { "completed" } else { "failed" },\n output: result.data,\n })\n }\n // Other task types...\n }\n}\n```\n\n**Flow:**\n\n```\nOrchestrator task queue has pending task\n ↓\nExecutor picks up task\n ↓\nSpawns Nushell subprocess: nu -c "use servers/create.nu; create-server 'wuji'"\n ↓\nNushell executes business logic\n ↓\nReturns result to orchestrator\n ↓\nOrchestrator updates task status\n ↓\nUser monitors via: provisioning workflow status \n```\n\n### Pattern 3: Bidirectional Communication\n\n**Nushell Calls Orchestrator API:**\n\n```\n# Nushell script checks orchestrator status during execution\nexport def check-orchestrator-health [] {\n let response = (http get http://localhost:9090/health)\n\n if $response.status != "healthy" {\n error make { msg: "Orchestrator not available" }\n }\n\n $response\n}\n\n# Nushell script reports progress to orchestrator\nexport def report-progress [task_id: string, progress: int] {\n http post http://localhost:9090/tasks/$task_id/progress {\n progress: $progress\n status: "in_progress"\n }\n}\n```\n\n**Orchestrator Monitors Nushell Execution:**\n\n```\n// Orchestrator tracks Nushell subprocess\npub async fn execute_with_monitoring(task: Task) -> Result {\n let mut child = Command::new("nu")\n .arg("-c")\n .arg(&task.script)\n .stdout(Stdio::piped())\n .stderr(Stdio::piped())\n .spawn()?;\n\n // Monitor stdout/stderr in real-time\n let stdout = child.stdout.take().unwrap();\n tokio::spawn(async move {\n let reader = BufReader::new(stdout);\n let mut lines = reader.lines();\n\n while let Some(line) = lines.next_line().await.unwrap() {\n // Parse progress updates from Nushell\n if line.contains("PROGRESS:") {\n update_task_progress(&line);\n }\n }\n });\n\n // Wait for completion with timeout\n let result = tokio::time::timeout(\n Duration::from_secs(3600),\n child.wait()\n ).await??;\n\n Ok(TaskResult::from_exit_status(result))\n}\n```\n\n---\n\n## Multi-Repo Architecture Impact\n\n### Repository Split Doesn't Change Integration Model\n\n**In Multi-Repo Setup:**\n\n**Repository: `provisioning-core`**\n\n- Contains: Nushell business logic\n- Installs to: `/usr/local/lib/provisioning/`\n- Package: `provisioning-core-3.2.1.tar.gz`\n\n**Repository: `provisioning-platform`**\n\n- Contains: Rust orchestrator\n- Installs to: `/usr/local/bin/provisioning-orchestrator`\n- Package: `provisioning-platform-2.5.3.tar.gz`\n\n**Runtime Integration (Same as Monorepo):**\n\n```\nUser installs both packages:\n provisioning-core-3.2.1 → /usr/local/lib/provisioning/\n provisioning-platform-2.5.3 → /usr/local/bin/provisioning-orchestrator\n\nOrchestrator expects core at: /usr/local/lib/provisioning/\nCore expects orchestrator at: http://localhost:9090/\n\nNo code dependencies, just runtime coordination!\n```\n\n### Configuration-Based Integration\n\n**Core Package (`provisioning-core`) config:**\n\n```\n# /usr/local/share/provisioning/config/config.defaults.toml\n\n[orchestrator]\nenabled = true\nendpoint = "http://localhost:9090"\ntimeout = 60\nauto_start = true # Start orchestrator if not running\n\n[execution]\ndefault_mode = "orchestrated" # Use orchestrator by default\nfallback_to_direct = true # Fall back if orchestrator down\n```\n\n**Platform Package (`provisioning-platform`) config:**\n\n```\n# /usr/local/share/provisioning/platform/config.toml\n\n[orchestrator]\nhost = "127.0.0.1"\nport = 8080\ndata_dir = "/var/lib/provisioning/orchestrator"\n\n[executor]\nnushell_binary = "nu" # Expects nu in PATH\nprovisioning_lib = "/usr/local/lib/provisioning"\nmax_concurrent_tasks = 10\ntask_timeout_seconds = 3600\n```\n\n### Version Compatibility\n\n**Compatibility Matrix (`provisioning-distribution/versions.toml`):**\n\n```\n[compatibility.platform."2.5.3"]\ncore = "^3.2" # Platform 2.5.3 compatible with core 3.2.x\nmin-core = "3.2.0"\napi-version = "v1"\n\n[compatibility.core."3.2.1"]\nplatform = "^2.5" # Core 3.2.1 compatible with platform 2.5.x\nmin-platform = "2.5.0"\norchestrator-api = "v1"\n```\n\n---\n\n## Execution Flow Examples\n\n### Example 1: Simple Server Creation (Direct Mode)\n\n**No Orchestrator Needed:**\n\n```\nprovisioning server list\n\n# Flow:\nCLI → servers/list.nu → Query state → Return results\n(Orchestrator not involved)\n```\n\n### Example 2: Server Creation with Orchestrator\n\n**Using Orchestrator:**\n\n```\nprovisioning server create --orchestrated --infra wuji\n\n# Detailed Flow:\n1. User executes command\n ↓\n2. Nushell CLI (provisioning binary)\n ↓\n3. Reads config: orchestrator.enabled = true\n ↓\n4. Prepares task payload:\n {\n type: "server_create",\n infra: "wuji",\n params: { ... }\n }\n ↓\n5. HTTP POST → http://localhost:9090/workflows/servers/create\n ↓\n6. Orchestrator receives request\n ↓\n7. Creates task with UUID\n ↓\n8. Enqueues to task queue (file-based: /var/lib/provisioning/queue/)\n ↓\n9. Returns immediately: { workflow_id: "abc-123", status: "queued" }\n ↓\n10. User sees: "Workflow submitted: abc-123"\n ↓\n11. Orchestrator executor picks up task\n ↓\n12. Spawns Nushell subprocess:\n nu -c "use /usr/local/lib/provisioning/servers/create.nu; create-server 'wuji'"\n ↓\n13. Nushell executes business logic:\n - Reads Nickel config\n - Calls provider API (UpCloud/AWS)\n - Creates server\n - Returns result\n ↓\n14. Orchestrator captures output\n ↓\n15. Updates task status: "completed"\n ↓\n16. User monitors: provisioning workflow status abc-123\n → Shows: "Server wuji created successfully"\n```\n\n### Example 3: Batch Workflow with Dependencies\n\n**Complex Workflow:**\n\n```\nprovisioning batch submit multi-cloud-deployment.ncl\n\n# Workflow contains:\n- Create 5 servers (parallel)\n- Install Kubernetes on servers (depends on server creation)\n- Deploy applications (depends on Kubernetes)\n\n# Detailed Flow:\n1. CLI submits Nickel workflow to orchestrator\n ↓\n2. Orchestrator parses workflow\n ↓\n3. Builds dependency graph using petgraph (Rust)\n ↓\n4. Topological sort determines execution order\n ↓\n5. Creates tasks for each operation\n ↓\n6. Executes in parallel where possible:\n\n [Server 1] [Server 2] [Server 3] [Server 4] [Server 5]\n ↓ ↓ ↓ ↓ ↓\n (All execute in parallel via Nushell subprocesses)\n ↓ ↓ ↓ ↓ ↓\n └──────────┴──────────┴──────────┴──────────┘\n │\n ↓\n [All servers ready]\n ↓\n [Install Kubernetes]\n (Nushell subprocess)\n ↓\n [Kubernetes ready]\n ↓\n [Deploy applications]\n (Nushell subprocess)\n ↓\n [Complete]\n\n7. Orchestrator checkpoints state at each step\n ↓\n8. If failure occurs, can retry from checkpoint\n ↓\n9. User monitors real-time: provisioning batch monitor \n```\n\n---\n\n## Why This Architecture\n\n### Orchestrator Benefits\n\n1. **Eliminates Deep Call Stack Issues**\n\n ```text\n\n Without Orchestrator:\n template.nu → calls → cluster.nu → calls → taskserv.nu → calls → provider.nu\n (Deep nesting causes "Type not supported" errors)\n\n With Orchestrator:\n Orchestrator → spawns → Nushell subprocess (flat execution)\n (No deep nesting, fresh Nushell context for each task)\n\n ```\n\n2. **Performance Optimization**\n\n ```rust\n // Orchestrator executes tasks in parallel\n let tasks = vec![task1, task2, task3, task4, task5];\n\n let results = futures::future::join_all(\n tasks.iter().map(|t| execute_task(t))\n ).await;\n\n // 5 Nushell subprocesses run concurrently\n ```\n\n1. **Reliable State Management**\n\n```\n Orchestrator maintains:\n - Task queue (survives crashes)\n - Workflow checkpoints (resume on failure)\n - Progress tracking (real-time monitoring)\n - Retry logic (automatic recovery)\n```\n\n1. **Clean Separation**\n\n```\n Orchestrator (Rust): Performance, concurrency, state\n Business Logic (Nushell): Providers, taskservs, workflows\n\n Each does what it's best at!\n```\n\n### Why NOT Pure Rust\n\n**Question:** Why not implement everything in Rust?\n\n**Answer:**\n\n1. **Nushell is perfect for infrastructure automation:**\n - Shell-like scripting for system operations\n - Built-in structured data handling\n - Easy template rendering\n - Readable business logic\n\n2. **Rapid iteration:**\n - Change Nushell scripts without recompiling\n - Community can contribute Nushell modules\n - Template-based configuration generation\n\n3. **Best of both worlds:**\n - Rust: Performance, type safety, concurrency\n - Nushell: Flexibility, readability, ease of use\n\n---\n\n## Multi-Repo Integration Example\n\n### Installation\n\n**User installs bundle:**\n\n```\ncurl -fsSL https://get.provisioning.io | sh\n\n# Installs:\n1. provisioning-core-3.2.1.tar.gz\n → /usr/local/bin/provisioning (Nushell CLI)\n → /usr/local/lib/provisioning/ (Nushell libraries)\n → /usr/local/share/provisioning/ (configs, templates)\n\n2. provisioning-platform-2.5.3.tar.gz\n → /usr/local/bin/provisioning-orchestrator (Rust binary)\n → /usr/local/share/provisioning/platform/ (platform configs)\n\n3. Sets up systemd/launchd service for orchestrator\n```\n\n### Runtime Coordination\n\n**Core package expects orchestrator:**\n\n```\n# core/nulib/lib_provisioning/orchestrator/client.nu\n\n# Check if orchestrator is running\nexport def orchestrator-available [] {\n let config = (load-config)\n let endpoint = $config.orchestrator.endpoint\n\n try {\n let response = (http get $"($endpoint)/health")\n $response.status == "healthy"\n } catch {\n false\n }\n}\n\n# Auto-start orchestrator if needed\nexport def ensure-orchestrator [] {\n if not (orchestrator-available) {\n if (load-config).orchestrator.auto_start {\n print "Starting orchestrator..."\n ^provisioning-orchestrator --daemon\n sleep 2sec\n }\n }\n}\n```\n\n**Platform package executes core scripts:**\n\n```\n// platform/orchestrator/src/executor/nushell.rs\n\npub struct NushellExecutor {\n provisioning_lib: PathBuf, // /usr/local/lib/provisioning\n nu_binary: PathBuf, // nu (from PATH)\n}\n\nimpl NushellExecutor {\n pub async fn execute_script(&self, script: &str) -> Result {\n Command::new(&self.nu_binary)\n .env("NU_LIB_DIRS", &self.provisioning_lib)\n .arg("-c")\n .arg(script)\n .output()\n .await\n }\n\n pub async fn execute_module_function(\n &self,\n module: &str,\n function: &str,\n args: &[String],\n ) -> Result {\n let script = format!(\n "use {}/{}; {} {}",\n self.provisioning_lib.display(),\n module,\n function,\n args.join(" ")\n );\n\n self.execute_script(&script).await\n }\n}\n```\n\n---\n\n## Configuration Examples\n\n### Core Package Config\n\n**`/usr/local/share/provisioning/config/config.defaults.toml`:**\n\n```\n[orchestrator]\nenabled = true\nendpoint = "http://localhost:9090"\ntimeout_seconds = 60\nauto_start = true\nfallback_to_direct = true\n\n[execution]\n# Modes: "direct", "orchestrated", "auto"\ndefault_mode = "auto" # Auto-detect based on complexity\n\n# Operations that always use orchestrator\nforce_orchestrated = [\n "server.create",\n "cluster.create",\n "batch.*",\n "workflow.*"\n]\n\n# Operations that always run direct\nforce_direct = [\n "*.list",\n "*.show",\n "help",\n "version"\n]\n```\n\n### Platform Package Config\n\n**`/usr/local/share/provisioning/platform/config.toml`:**\n\n```\n[server]\nhost = "127.0.0.1"\nport = 8080\n\n[storage]\nbackend = "filesystem" # or "surrealdb"\ndata_dir = "/var/lib/provisioning/orchestrator"\n\n[executor]\nmax_concurrent_tasks = 10\ntask_timeout_seconds = 3600\ncheckpoint_interval_seconds = 30\n\n[nushell]\nbinary = "nu" # Expects nu in PATH\nprovisioning_lib = "/usr/local/lib/provisioning"\nenv_vars = { NU_LIB_DIRS = "/usr/local/lib/provisioning" }\n```\n\n---\n\n## Key Takeaways\n\n### 1. **Orchestrator is Essential**\n\n- Solves deep call stack problems\n- Provides performance optimization\n- Enables complex workflows\n- NOT optional for production use\n\n### 2. **Integration is Loose but Coordinated**\n\n- No code dependencies between repos\n- Runtime integration via CLI + REST API\n- Configuration-driven coordination\n- Works in both monorepo and multi-repo\n\n### 3. **Best of Both Worlds**\n\n- Rust: High-performance coordination\n- Nushell: Flexible business logic\n- Clean separation of concerns\n- Each technology does what it's best at\n\n### 4. **Multi-Repo Doesn't Change Integration**\n\n- Same runtime model as monorepo\n- Package installation sets up paths\n- Configuration enables discovery\n- Versioning ensures compatibility\n\n---\n\n## Conclusion\n\nThe confusing example in the multi-repo doc was **oversimplified**. The real architecture is:\n\n```\n✅ Orchestrator IS USED and IS ESSENTIAL\n✅ Platform (Rust) coordinates Core (Nushell) execution\n✅ Loose coupling via CLI + REST API (not code dependencies)\n✅ Works identically in monorepo and multi-repo\n✅ Configuration-based integration (no hardcoded paths)\n```\n\nThe orchestrator provides:\n\n- Performance layer (async, parallel execution)\n- Workflow engine (complex dependencies)\n- State management (checkpoints, recovery)\n- Task queue (reliable execution)\n\nWhile Nushell provides:\n\n- Business logic (providers, taskservs, clusters)\n- Template rendering (Jinja2 via nu_plugin_tera)\n- Configuration management (KCL integration)\n- User-facing scripting\n\n**Multi-repo just splits WHERE the code lives, not HOW it works together.** +# Orchestrator Integration Model - Deep Dive + +**Date:** 2025-10-01 +**Status:** Clarification Document +**Related:** [Multi-Repo Strategy](multi-repo-strategy.md), [Hybrid Orchestrator v3.0](../user/hybrid-orchestrator.md) + +## Executive Summary + +This document clarifies **how the Rust orchestrator integrates with Nushell core** in both monorepo and multi-repo architectures. The orchestrator is +a **critical performance layer** that coordinates Nushell business logic execution, solving deep call stack limitations while preserving all existing +functionality. + +--- + +## Current Architecture (Hybrid Orchestrator v3.0) + +### The Problem Being Solved + +**Original Issue:** + +```text +Deep call stack in Nushell (template.nu:71) +→ "Type not supported" errors +→ Cannot handle complex nested workflows +→ Performance bottlenecks with recursive calls +``` + +**Solution:** Rust orchestrator provides: + +1. **Task queue management** (file-based, reliable) +2. **Priority scheduling** (intelligent task ordering) +3. **Deep call stack elimination** (Rust handles recursion) +4. **Performance optimization** (async/await, parallel execution) +5. **State management** (workflow checkpointing) + +### How It Works Today (Monorepo) + +```text +┌─────────────────────────────────────────────────────────────┐ +│ User │ +└───────────────────────────┬─────────────────────────────────┘ + │ calls + ↓ + ┌───────────────┐ + │ provisioning │ (Nushell CLI) + │ CLI │ + └───────┬───────┘ + │ + ┌───────────────────┼───────────────────┐ + │ │ │ + ↓ ↓ ↓ +┌───────────────┐ ┌───────────────┐ ┌──────────────┐ +│ Direct Mode │ │Orchestrated │ │ Workflow │ +│ (Simple ops) │ │ Mode │ │ Mode │ +└───────────────┘ └───────┬───────┘ └──────┬───────┘ + │ │ + ↓ ↓ + ┌────────────────────────────────┐ + │ Rust Orchestrator Service │ + │ (Background daemon) │ + │ │ + │ • Task Queue (file-based) │ + │ • Priority Scheduler │ + │ • Workflow Engine │ + │ • REST API Server │ + └────────┬───────────────────────┘ + │ spawns + ↓ + ┌────────────────┐ + │ Nushell │ + │ Business Logic │ + │ │ + │ • servers.nu │ + │ • taskservs.nu │ + │ • clusters.nu │ + └────────────────┘ +``` + +### Three Execution Modes + +#### Mode 1: Direct Mode (Simple Operations) + +```text +# No orchestrator needed +provisioning server list +provisioning env +provisioning help + +# Direct Nushell execution +provisioning (CLI) → Nushell scripts → Result +``` + +#### Mode 2: Orchestrated Mode (Complex Operations) + +```text +# Uses orchestrator for coordination +provisioning server create --orchestrated + +# Flow: +provisioning CLI → Orchestrator API → Task Queue → Nushell executor + ↓ + Result back to user +``` + +#### Mode 3: Workflow Mode (Batch Operations) + +```text +# Complex workflows with dependencies +provisioning workflow submit server-cluster.ncl + +# Flow: +provisioning CLI → Orchestrator Workflow Engine → Dependency Graph + ↓ + Parallel task execution + ↓ + Nushell scripts for each task + ↓ + Checkpoint state +``` + +--- + +## Integration Patterns + +### Pattern 1: CLI Submits Tasks to Orchestrator + +**Current Implementation:** + +**Nushell CLI (`core/nulib/workflows/server_create.nu`):** + +```text +# Submit server creation workflow to orchestrator +export def server_create_workflow [ + infra_name: string + --orchestrated +] { + if $orchestrated { + # Submit task to orchestrator + let task = { + type: "server_create" + infra: $infra_name + params: { ... } + } + + # POST to orchestrator REST API + http post http://localhost:9090/workflows/servers/create $task + } else { + # Direct execution (old way) + do-server-create $infra_name + } +} +``` + +**Rust Orchestrator (`platform/orchestrator/src/api/workflows.rs`):** + +```text +// Receive workflow submission from Nushell CLI +#[axum::debug_handler] +async fn create_server_workflow( + State(state): State>, + Json(request): Json, +) -> Result, ApiError> { + // Create task + let task = Task { + id: Uuid::new_v4(), + task_type: TaskType::ServerCreate, + payload: serde_json::to_value(&request)?, + priority: Priority::Normal, + status: TaskStatus::Pending, + created_at: Utc::now(), + }; + + // Queue task + state.task_queue.enqueue(task).await?; + + // Return immediately (async execution) + Ok(Json(WorkflowResponse { + workflow_id: task.id, + status: "queued", + })) +} +``` + +**Flow:** + +```text +User → provisioning server create --orchestrated + ↓ +Nushell CLI prepares task + ↓ +HTTP POST to orchestrator (localhost:9090) + ↓ +Orchestrator queues task + ↓ +Returns workflow ID immediately + ↓ +User can monitor: provisioning workflow monitor +``` + +### Pattern 2: Orchestrator Executes Nushell Scripts + +**Orchestrator Task Executor (`platform/orchestrator/src/executor.rs`):** + +```text +// Orchestrator spawns Nushell to execute business logic +pub async fn execute_task(task: Task) -> Result { + match task.task_type { + TaskType::ServerCreate => { + // Orchestrator calls Nushell script via subprocess + let output = Command::new("nu") + .arg("-c") + .arg(format!( + "use {}/servers/create.nu; create-server '{}'", + PROVISIONING_LIB_PATH, + task.payload.infra_name + )) + .output() + .await?; + + // Parse Nushell output + let result = parse_nushell_output(&output)?; + + Ok(TaskResult { + task_id: task.id, + status: if result.success { "completed" } else { "failed" }, + output: result.data, + }) + } + // Other task types... + } +} +``` + +**Flow:** + +```text +Orchestrator task queue has pending task + ↓ +Executor picks up task + ↓ +Spawns Nushell subprocess: nu -c "use servers/create.nu; create-server 'wuji'" + ↓ +Nushell executes business logic + ↓ +Returns result to orchestrator + ↓ +Orchestrator updates task status + ↓ +User monitors via: provisioning workflow status +``` + +### Pattern 3: Bidirectional Communication + +**Nushell Calls Orchestrator API:** + +```text +# Nushell script checks orchestrator status during execution +export def check-orchestrator-health [] { + let response = (http get http://localhost:9090/health) + + if $response.status != "healthy" { + error make { msg: "Orchestrator not available" } + } + + $response +} + +# Nushell script reports progress to orchestrator +export def report-progress [task_id: string, progress: int] { + http post http://localhost:9090/tasks/$task_id/progress { + progress: $progress + status: "in_progress" + } +} +``` + +**Orchestrator Monitors Nushell Execution:** + +```text +// Orchestrator tracks Nushell subprocess +pub async fn execute_with_monitoring(task: Task) -> Result { + let mut child = Command::new("nu") + .arg("-c") + .arg(&task.script) + .stdout(Stdio::piped()) + .stderr(Stdio::piped()) + .spawn()?; + + // Monitor stdout/stderr in real-time + let stdout = child.stdout.take().unwrap(); + tokio::spawn(async move { + let reader = BufReader::new(stdout); + let mut lines = reader.lines(); + + while let Some(line) = lines.next_line().await.unwrap() { + // Parse progress updates from Nushell + if line.contains("PROGRESS:") { + update_task_progress(&line); + } + } + }); + + // Wait for completion with timeout + let result = tokio::time::timeout( + Duration::from_secs(3600), + child.wait() + ).await??; + + Ok(TaskResult::from_exit_status(result)) +} +``` + +--- + +## Multi-Repo Architecture Impact + +### Repository Split Doesn't Change Integration Model + +**In Multi-Repo Setup:** + +**Repository: `provisioning-core`** + +- Contains: Nushell business logic +- Installs to: `/usr/local/lib/provisioning/` +- Package: `provisioning-core-3.2.1.tar.gz` + +**Repository: `provisioning-platform`** + +- Contains: Rust orchestrator +- Installs to: `/usr/local/bin/provisioning-orchestrator` +- Package: `provisioning-platform-2.5.3.tar.gz` + +**Runtime Integration (Same as Monorepo):** + +```text +User installs both packages: + provisioning-core-3.2.1 → /usr/local/lib/provisioning/ + provisioning-platform-2.5.3 → /usr/local/bin/provisioning-orchestrator + +Orchestrator expects core at: /usr/local/lib/provisioning/ +Core expects orchestrator at: http://localhost:9090/ + +No code dependencies, just runtime coordination! +``` + +### Configuration-Based Integration + +**Core Package (`provisioning-core`) config:** + +```text +# /usr/local/share/provisioning/config/config.defaults.toml + +[orchestrator] +enabled = true +endpoint = "http://localhost:9090" +timeout = 60 +auto_start = true # Start orchestrator if not running + +[execution] +default_mode = "orchestrated" # Use orchestrator by default +fallback_to_direct = true # Fall back if orchestrator down +``` + +**Platform Package (`provisioning-platform`) config:** + +```text +# /usr/local/share/provisioning/platform/config.toml + +[orchestrator] +host = "127.0.0.1" +port = 8080 +data_dir = "/var/lib/provisioning/orchestrator" + +[executor] +nushell_binary = "nu" # Expects nu in PATH +provisioning_lib = "/usr/local/lib/provisioning" +max_concurrent_tasks = 10 +task_timeout_seconds = 3600 +``` + +### Version Compatibility + +**Compatibility Matrix (`provisioning-distribution/versions.toml`):** + +```text +[compatibility.platform."2.5.3"] +core = "^3.2" # Platform 2.5.3 compatible with core 3.2.x +min-core = "3.2.0" +api-version = "v1" + +[compatibility.core."3.2.1"] +platform = "^2.5" # Core 3.2.1 compatible with platform 2.5.x +min-platform = "2.5.0" +orchestrator-api = "v1" +``` + +--- + +## Execution Flow Examples + +### Example 1: Simple Server Creation (Direct Mode) + +**No Orchestrator Needed:** + +```text +provisioning server list + +# Flow: +CLI → servers/list.nu → Query state → Return results +(Orchestrator not involved) +``` + +### Example 2: Server Creation with Orchestrator + +**Using Orchestrator:** + +```text +provisioning server create --orchestrated --infra wuji + +# Detailed Flow: +1. User executes command + ↓ +2. Nushell CLI (provisioning binary) + ↓ +3. Reads config: orchestrator.enabled = true + ↓ +4. Prepares task payload: + { + type: "server_create", + infra: "wuji", + params: { ... } + } + ↓ +5. HTTP POST → http://localhost:9090/workflows/servers/create + ↓ +6. Orchestrator receives request + ↓ +7. Creates task with UUID + ↓ +8. Enqueues to task queue (file-based: /var/lib/provisioning/queue/) + ↓ +9. Returns immediately: { workflow_id: "abc-123", status: "queued" } + ↓ +10. User sees: "Workflow submitted: abc-123" + ↓ +11. Orchestrator executor picks up task + ↓ +12. Spawns Nushell subprocess: + nu -c "use /usr/local/lib/provisioning/servers/create.nu; create-server 'wuji'" + ↓ +13. Nushell executes business logic: + - Reads Nickel config + - Calls provider API (UpCloud/AWS) + - Creates server + - Returns result + ↓ +14. Orchestrator captures output + ↓ +15. Updates task status: "completed" + ↓ +16. User monitors: provisioning workflow status abc-123 + → Shows: "Server wuji created successfully" +``` + +### Example 3: Batch Workflow with Dependencies + +**Complex Workflow:** + +```text +provisioning batch submit multi-cloud-deployment.ncl + +# Workflow contains: +- Create 5 servers (parallel) +- Install Kubernetes on servers (depends on server creation) +- Deploy applications (depends on Kubernetes) + +# Detailed Flow: +1. CLI submits Nickel workflow to orchestrator + ↓ +2. Orchestrator parses workflow + ↓ +3. Builds dependency graph using petgraph (Rust) + ↓ +4. Topological sort determines execution order + ↓ +5. Creates tasks for each operation + ↓ +6. Executes in parallel where possible: + + [Server 1] [Server 2] [Server 3] [Server 4] [Server 5] + ↓ ↓ ↓ ↓ ↓ + (All execute in parallel via Nushell subprocesses) + ↓ ↓ ↓ ↓ ↓ + └──────────┴──────────┴──────────┴──────────┘ + │ + ↓ + [All servers ready] + ↓ + [Install Kubernetes] + (Nushell subprocess) + ↓ + [Kubernetes ready] + ↓ + [Deploy applications] + (Nushell subprocess) + ↓ + [Complete] + +7. Orchestrator checkpoints state at each step + ↓ +8. If failure occurs, can retry from checkpoint + ↓ +9. User monitors real-time: provisioning batch monitor +``` + +--- + +## Why This Architecture + +### Orchestrator Benefits + +1. **Eliminates Deep Call Stack Issues** + + ```text + + Without Orchestrator: + template.nu → calls → cluster.nu → calls → taskserv.nu → calls → provider.nu + (Deep nesting causes "Type not supported" errors) + + With Orchestrator: + Orchestrator → spawns → Nushell subprocess (flat execution) + (No deep nesting, fresh Nushell context for each task) + + ``` + +2. **Performance Optimization** + + ```rust + // Orchestrator executes tasks in parallel + let tasks = vec![task1, task2, task3, task4, task5]; + + let results = futures::future::join_all( + tasks.iter().map(|t| execute_task(t)) + ).await; + + // 5 Nushell subprocesses run concurrently + ``` + +1. **Reliable State Management** + +```text + Orchestrator maintains: + - Task queue (survives crashes) + - Workflow checkpoints (resume on failure) + - Progress tracking (real-time monitoring) + - Retry logic (automatic recovery) +``` + +1. **Clean Separation** + +```text + Orchestrator (Rust): Performance, concurrency, state + Business Logic (Nushell): Providers, taskservs, workflows + + Each does what it's best at! +``` + +### Why NOT Pure Rust + +**Question:** Why not implement everything in Rust? + +**Answer:** + +1. **Nushell is perfect for infrastructure automation:** + - Shell-like scripting for system operations + - Built-in structured data handling + - Easy template rendering + - Readable business logic + +2. **Rapid iteration:** + - Change Nushell scripts without recompiling + - Community can contribute Nushell modules + - Template-based configuration generation + +3. **Best of both worlds:** + - Rust: Performance, type safety, concurrency + - Nushell: Flexibility, readability, ease of use + +--- + +## Multi-Repo Integration Example + +### Installation + +**User installs bundle:** + +```text +curl -fsSL https://get.provisioning.io | sh + +# Installs: +1. provisioning-core-3.2.1.tar.gz + → /usr/local/bin/provisioning (Nushell CLI) + → /usr/local/lib/provisioning/ (Nushell libraries) + → /usr/local/share/provisioning/ (configs, templates) + +2. provisioning-platform-2.5.3.tar.gz + → /usr/local/bin/provisioning-orchestrator (Rust binary) + → /usr/local/share/provisioning/platform/ (platform configs) + +3. Sets up systemd/launchd service for orchestrator +``` + +### Runtime Coordination + +**Core package expects orchestrator:** + +```text +# core/nulib/lib_provisioning/orchestrator/client.nu + +# Check if orchestrator is running +export def orchestrator-available [] { + let config = (load-config) + let endpoint = $config.orchestrator.endpoint + + try { + let response = (http get $"($endpoint)/health") + $response.status == "healthy" + } catch { + false + } +} + +# Auto-start orchestrator if needed +export def ensure-orchestrator [] { + if not (orchestrator-available) { + if (load-config).orchestrator.auto_start { + print "Starting orchestrator..." + ^provisioning-orchestrator --daemon + sleep 2sec + } + } +} +``` + +**Platform package executes core scripts:** + +```text +// platform/orchestrator/src/executor/nushell.rs + +pub struct NushellExecutor { + provisioning_lib: PathBuf, // /usr/local/lib/provisioning + nu_binary: PathBuf, // nu (from PATH) +} + +impl NushellExecutor { + pub async fn execute_script(&self, script: &str) -> Result { + Command::new(&self.nu_binary) + .env("NU_LIB_DIRS", &self.provisioning_lib) + .arg("-c") + .arg(script) + .output() + .await + } + + pub async fn execute_module_function( + &self, + module: &str, + function: &str, + args: &[String], + ) -> Result { + let script = format!( + "use {}/{}; {} {}", + self.provisioning_lib.display(), + module, + function, + args.join(" ") + ); + + self.execute_script(&script).await + } +} +``` + +--- + +## Configuration Examples + +### Core Package Config + +**`/usr/local/share/provisioning/config/config.defaults.toml`:** + +```text +[orchestrator] +enabled = true +endpoint = "http://localhost:9090" +timeout_seconds = 60 +auto_start = true +fallback_to_direct = true + +[execution] +# Modes: "direct", "orchestrated", "auto" +default_mode = "auto" # Auto-detect based on complexity + +# Operations that always use orchestrator +force_orchestrated = [ + "server.create", + "cluster.create", + "batch.*", + "workflow.*" +] + +# Operations that always run direct +force_direct = [ + "*.list", + "*.show", + "help", + "version" +] +``` + +### Platform Package Config + +**`/usr/local/share/provisioning/platform/config.toml`:** + +```text +[server] +host = "127.0.0.1" +port = 8080 + +[storage] +backend = "filesystem" # or "surrealdb" +data_dir = "/var/lib/provisioning/orchestrator" + +[executor] +max_concurrent_tasks = 10 +task_timeout_seconds = 3600 +checkpoint_interval_seconds = 30 + +[nushell] +binary = "nu" # Expects nu in PATH +provisioning_lib = "/usr/local/lib/provisioning" +env_vars = { NU_LIB_DIRS = "/usr/local/lib/provisioning" } +``` + +--- + +## Key Takeaways + +### 1. **Orchestrator is Essential** + +- Solves deep call stack problems +- Provides performance optimization +- Enables complex workflows +- NOT optional for production use + +### 2. **Integration is Loose but Coordinated** + +- No code dependencies between repos +- Runtime integration via CLI + REST API +- Configuration-driven coordination +- Works in both monorepo and multi-repo + +### 3. **Best of Both Worlds** + +- Rust: High-performance coordination +- Nushell: Flexible business logic +- Clean separation of concerns +- Each technology does what it's best at + +### 4. **Multi-Repo Doesn't Change Integration** + +- Same runtime model as monorepo +- Package installation sets up paths +- Configuration enables discovery +- Versioning ensures compatibility + +--- + +## Conclusion + +The confusing example in the multi-repo doc was **oversimplified**. The real architecture is: + +```text +✅ Orchestrator IS USED and IS ESSENTIAL +✅ Platform (Rust) coordinates Core (Nushell) execution +✅ Loose coupling via CLI + REST API (not code dependencies) +✅ Works identically in monorepo and multi-repo +✅ Configuration-based integration (no hardcoded paths) +``` + +The orchestrator provides: + +- Performance layer (async, parallel execution) +- Workflow engine (complex dependencies) +- State management (checkpoints, recovery) +- Task queue (reliable execution) + +While Nushell provides: + +- Business logic (providers, taskservs, clusters) +- Template rendering (Jinja2 via nu_plugin_tera) +- Configuration management (KCL integration) +- User-facing scripting + +**Multi-repo just splits WHERE the code lives, not HOW it works together.** \ No newline at end of file diff --git a/docs/src/architecture/package-and-loader-system.md b/docs/src/architecture/package-and-loader-system.md index a80a9a6..22ecac7 100644 --- a/docs/src/architecture/package-and-loader-system.md +++ b/docs/src/architecture/package-and-loader-system.md @@ -1 +1,410 @@ -# Nickel Package and Module Loader System\n\nThis document describes the package-based architecture implemented for the provisioning system, replacing hardcoded extension paths with a\nflexible module discovery and loading system using Nickel for type-safe configuration.\n\n## Architecture Overview\n\nThe system consists of two main components:\n\n1. **Core Nickel Package**: Distributable core provisioning schemas with type safety\n2. **Module Loader System**: Dynamic discovery and loading of extensions\n\n### Benefits\n\n- **Type-Safe Configuration**: Nickel ensures configuration validity at evaluation time\n- **Clean Separation**: Core package is self-contained and distributable\n- **Plug-and-Play Extensions**: Taskservs, providers, and clusters can be loaded dynamically\n- **Version Management**: Core package and extensions can be versioned independently\n- **Developer Friendly**: Easy workspace setup and module management with lazy evaluation\n\n## Components\n\n### 1. Core Nickel Package (`/provisioning/schemas/`)\n\nContains fundamental schemas for provisioning:\n\n- `main.ncl` - Primary provisioning configuration\n- `server.ncl` - Server definitions and schemas\n- `defaults.ncl` - Default configurations\n- `lib.ncl` - Common library schemas\n- `dependencies.ncl` - Dependency management schemas\n\n**Key Features:**\n\n- No hardcoded extension paths\n- Self-contained and distributable\n- Type-safe package-based imports\n- Lazy evaluation of expensive computations\n\n### 2. Module Discovery System\n\n#### Discovery Commands\n\n```\n# Discover available modules\nmodule-loader discover taskservs # List all taskservs\nmodule-loader discover providers --format yaml # List providers as YAML\nmodule-loader discover clusters redis # Search for redis clusters\n```\n\n#### Supported Module Types\n\n- **Taskservs**: Infrastructure services (kubernetes, redis, postgres, etc.)\n- **Providers**: Cloud providers (upcloud, aws, local)\n- **Clusters**: Complete configurations (buildkit, web, oci-reg)\n\n### 3. Module Loading System\n\n#### Loading Commands\n\n```\n# Load modules into workspace\nmodule-loader load taskservs . [kubernetes, cilium, containerd]\nmodule-loader load providers . [upcloud]\nmodule-loader load clusters . [buildkit]\n\n# Initialize workspace with modules\nmodule-loader init workspace/infra/production \\n --taskservs [kubernetes, cilium] \\n --providers [upcloud]\n```\n\n#### Generated Files\n\n- `taskservs.ncl` - Auto-generated taskserv imports\n- `providers.ncl` - Auto-generated provider imports\n- `clusters.ncl` - Auto-generated cluster imports\n- `.manifest/*.yaml` - Module loading manifests\n\n## Workspace Structure\n\n### New Workspace Layout\n\n```\nworkspace/infra/my-project/\n├── kcl.mod # Package dependencies\n├── servers.ncl # Main server configuration\n├── taskservs.ncl # Auto-generated taskserv imports\n├── providers.ncl # Auto-generated provider imports\n├── clusters.ncl # Auto-generated cluster imports\n├── .taskservs/ # Loaded taskserv modules\n│ ├── kubernetes/\n│ ├── cilium/\n│ └── containerd/\n├── .providers/ # Loaded provider modules\n│ └── upcloud/\n├── .clusters/ # Loaded cluster modules\n│ └── buildkit/\n├── .manifest/ # Module manifests\n│ ├── taskservs.yaml\n│ ├── providers.yaml\n│ └── clusters.yaml\n├── data/ # Runtime data\n├── tmp/ # Temporary files\n├── resources/ # Resource definitions\n└── clusters/ # Cluster configurations\n```\n\n### Import Patterns\n\n#### Before (Old System)\n\n```\n# Hardcoded relative paths\nimport ../../../kcl/server as server\nimport ../../../extensions/taskservs/kubernetes/kcl/kubernetes as k8s\n```\n\n#### After (New System)\n\n```\n# Package-based imports\nimport provisioning.server as server\n\n# Auto-generated module imports (after loading)\nimport .taskservs.nclubernetes.kubernetes as k8s\n```\n\n## Package Distribution\n\n### Building Core Package\n\n```\n# Build distributable package\n./provisioning/tools/kcl-packager.nu build --version 1.0.0\n\n# Install locally\n./provisioning/tools/kcl-packager.nu install dist/provisioning-1.0.0.tar.gz\n\n# Create release\n./provisioning/tools/kcl-packager.nu build --format tar.gz --include-docs\n```\n\n### Package Installation Methods\n\n#### Method 1: Local Installation (Recommended for development)\n\n```\n[dependencies]\nprovisioning = { path = "~/.kcl/packages/provisioning", version = "0.0.1" }\n```\n\n#### Method 2: Git Repository (For distributed teams)\n\n```\n[dependencies]\nprovisioning = { git = "https://github.com/your-org/provisioning-kcl", version = "v0.0.1" }\n```\n\n#### Method 3: KCL Registry (When available)\n\n```\n[dependencies]\nprovisioning = { version = "0.0.1" }\n```\n\n## Developer Workflows\n\n### 1. New Project Setup\n\n```\n# Create workspace from template\ncp -r provisioning/templates/workspaces/kubernetes ./my-k8s-cluster\ncd my-k8s-cluster\n\n# Initialize with modules\nworkspace-init.nu . init\n\n# Load required modules\nmodule-loader load taskservs . [kubernetes, cilium, containerd]\nmodule-loader load providers . [upcloud]\n\n# Validate and deploy\nkcl run servers.ncl\nprovisioning server create --infra . --check\n```\n\n### 2. Extension Development\n\n```\n# Create new taskserv\nmkdir -p extensions/taskservs/my-service/kcl\ncd extensions/taskservs/my-service/kcl\n\n# Initialize KCL module\nkcl mod init my-service\necho 'provisioning = { path = "~/.kcl/packages/provisioning", version = "0.0.1" }' >> kcl.mod\n\n# Develop and test\nmodule-loader discover taskservs # Should find your service\n```\n\n### 3. Workspace Migration\n\n```\n# Analyze existing workspace\nworkspace-migrate.nu workspace/infra/old-project dry-run\n\n# Perform migration\nworkspace-migrate.nu workspace/infra/old-project\n\n# Verify migration\nmodule-loader validate workspace/infra/old-project\n```\n\n### 4. Multi-Environment Management\n\n```\n# Development environment\ncd workspace/infra/dev\nmodule-loader load taskservs . [redis, postgres]\nmodule-loader load providers . [local]\n\n# Production environment\ncd workspace/infra/prod\nmodule-loader load taskservs . [redis, postgres, kubernetes, monitoring]\nmodule-loader load providers . [upcloud, aws] # Multi-cloud\n```\n\n## Module Management\n\n### Listing and Validation\n\n```\n# List loaded modules\nmodule-loader list taskservs .\nmodule-loader list providers .\nmodule-loader list clusters .\n\n# Validate workspace\nmodule-loader validate .\n\n# Show workspace info\nworkspace-init.nu . info\n```\n\n### Unloading Modules\n\n```\n# Remove specific modules\nmodule-loader unload taskservs . redis\nmodule-loader unload providers . aws\n\n# This regenerates import files automatically\n```\n\n### Module Information\n\n```\n# Get detailed module info\nmodule-loader info taskservs kubernetes\nmodule-loader info providers upcloud\nmodule-loader info clusters buildkit\n```\n\n## CI/CD Integration\n\n### Pipeline Example\n\n```\n#!/usr/bin/env nu\n# deploy-pipeline.nu\n\n# Install specific versions\nkcl-packager.nu install --version $env.PROVISIONING_VERSION\n\n# Load production modules\nmodule-loader init $env.WORKSPACE_PATH \\n --taskservs $env.REQUIRED_TASKSERVS \\n --providers [$env.CLOUD_PROVIDER]\n\n# Validate configuration\nmodule-loader validate $env.WORKSPACE_PATH\n\n# Deploy infrastructure\nprovisioning server create --infra $env.WORKSPACE_PATH\n```\n\n## Troubleshooting\n\n### Common Issues\n\n#### Module Import Errors\n\n```\nError: module not found\n```\n\n**Solution**: Verify modules are loaded and regenerate imports\n\n```\nmodule-loader list taskservs .\nmodule-loader load taskservs . [kubernetes, cilium, containerd]\n```\n\n#### Provider Configuration Issues\n\n**Solution**: Check provider-specific configuration in `.providers/` directory\n\n#### KCL Compilation Errors\n\n**Solution**: Verify core package installation and kcl.mod configuration\n\n```\nkcl-packager.nu install --version latest\nkcl run --dry-run servers.ncl\n```\n\n### Debug Commands\n\n```\n# Show workspace structure\ntree -a workspace/infra/my-project\n\n# Check generated imports\ncat workspace/infra/my-project/taskservs.ncl\n\n# Validate KCL files\nnickel typecheck workspace/infra/my-project/*.ncl\n\n# Show module manifests\ncat workspace/infra/my-project/.manifest/taskservs.yaml\n```\n\n## Best Practices\n\n### 1. Version Management\n\n- Pin core package versions in production\n- Use semantic versioning for extensions\n- Test compatibility before upgrading\n\n### 2. Module Organization\n\n- Load only required modules to keep workspaces clean\n- Use meaningful workspace names\n- Document required modules in README\n\n### 3. Security\n\n- Exclude `.manifest/` and `data/` from version control\n- Use secrets management for sensitive configuration\n- Validate modules before loading in production\n\n### 4. Performance\n\n- Load modules at workspace initialization, not runtime\n- Cache discovery results when possible\n- Use parallel loading for multiple modules\n\n## Migration Guide\n\nFor existing workspaces, follow these steps:\n\n### 1. Backup Current Workspace\n\n```\ncp -r workspace/infra/existing workspace/infra/existing-backup\n```\n\n### 2. Analyze Migration Requirements\n\n```\nworkspace-migrate.nu workspace/infra/existing dry-run\n```\n\n### 3. Perform Migration\n\n```\nworkspace-migrate.nu workspace/infra/existing\n```\n\n### 4. Load Required Modules\n\n```\ncd workspace/infra/existing\nmodule-loader load taskservs . [kubernetes, cilium]\nmodule-loader load providers . [upcloud]\n```\n\n### 5. Test and Validate\n\n```\nkcl run servers.ncl\nmodule-loader validate .\n```\n\n### 6. Deploy\n\n```\nprovisioning server create --infra . --check\n```\n\n## Future Enhancements\n\n- Registry-based module distribution\n- Module dependency resolution\n- Automatic version updates\n- Module templates and scaffolding\n- Integration with external package managers +# Nickel Package and Module Loader System + +This document describes the package-based architecture implemented for the provisioning system, replacing hardcoded extension paths with a +flexible module discovery and loading system using Nickel for type-safe configuration. + +## Architecture Overview + +The system consists of two main components: + +1. **Core Nickel Package**: Distributable core provisioning schemas with type safety +2. **Module Loader System**: Dynamic discovery and loading of extensions + +### Benefits + +- **Type-Safe Configuration**: Nickel ensures configuration validity at evaluation time +- **Clean Separation**: Core package is self-contained and distributable +- **Plug-and-Play Extensions**: Taskservs, providers, and clusters can be loaded dynamically +- **Version Management**: Core package and extensions can be versioned independently +- **Developer Friendly**: Easy workspace setup and module management with lazy evaluation + +## Components + +### 1. Core Nickel Package (`/provisioning/schemas/`) + +Contains fundamental schemas for provisioning: + +- `main.ncl` - Primary provisioning configuration +- `server.ncl` - Server definitions and schemas +- `defaults.ncl` - Default configurations +- `lib.ncl` - Common library schemas +- `dependencies.ncl` - Dependency management schemas + +**Key Features:** + +- No hardcoded extension paths +- Self-contained and distributable +- Type-safe package-based imports +- Lazy evaluation of expensive computations + +### 2. Module Discovery System + +#### Discovery Commands + +```text +# Discover available modules +module-loader discover taskservs # List all taskservs +module-loader discover providers --format yaml # List providers as YAML +module-loader discover clusters redis # Search for redis clusters +``` + +#### Supported Module Types + +- **Taskservs**: Infrastructure services (kubernetes, redis, postgres, etc.) +- **Providers**: Cloud providers (upcloud, aws, local) +- **Clusters**: Complete configurations (buildkit, web, oci-reg) + +### 3. Module Loading System + +#### Loading Commands + +```text +# Load modules into workspace +module-loader load taskservs . [kubernetes, cilium, containerd] +module-loader load providers . [upcloud] +module-loader load clusters . [buildkit] + +# Initialize workspace with modules +module-loader init workspace/infra/production + --taskservs [kubernetes, cilium] + --providers [upcloud] +``` + +#### Generated Files + +- `taskservs.ncl` - Auto-generated taskserv imports +- `providers.ncl` - Auto-generated provider imports +- `clusters.ncl` - Auto-generated cluster imports +- `.manifest/*.yaml` - Module loading manifests + +## Workspace Structure + +### New Workspace Layout + +```text +workspace/infra/my-project/ +├── kcl.mod # Package dependencies +├── servers.ncl # Main server configuration +├── taskservs.ncl # Auto-generated taskserv imports +├── providers.ncl # Auto-generated provider imports +├── clusters.ncl # Auto-generated cluster imports +├── .taskservs/ # Loaded taskserv modules +│ ├── kubernetes/ +│ ├── cilium/ +│ └── containerd/ +├── .providers/ # Loaded provider modules +│ └── upcloud/ +├── .clusters/ # Loaded cluster modules +│ └── buildkit/ +├── .manifest/ # Module manifests +│ ├── taskservs.yaml +│ ├── providers.yaml +│ └── clusters.yaml +├── data/ # Runtime data +├── tmp/ # Temporary files +├── resources/ # Resource definitions +└── clusters/ # Cluster configurations +``` + +### Import Patterns + +#### Before (Old System) + +```text +# Hardcoded relative paths +import ../../../kcl/server as server +import ../../../extensions/taskservs/kubernetes/kcl/kubernetes as k8s +``` + +#### After (New System) + +```text +# Package-based imports +import provisioning.server as server + +# Auto-generated module imports (after loading) +import .taskservs.nclubernetes.kubernetes as k8s +``` + +## Package Distribution + +### Building Core Package + +```text +# Build distributable package +./provisioning/tools/kcl-packager.nu build --version 1.0.0 + +# Install locally +./provisioning/tools/kcl-packager.nu install dist/provisioning-1.0.0.tar.gz + +# Create release +./provisioning/tools/kcl-packager.nu build --format tar.gz --include-docs +``` + +### Package Installation Methods + +#### Method 1: Local Installation (Recommended for development) + +```text +[dependencies] +provisioning = { path = "~/.kcl/packages/provisioning", version = "0.0.1" } +``` + +#### Method 2: Git Repository (For distributed teams) + +```text +[dependencies] +provisioning = { git = "https://github.com/your-org/provisioning-kcl", version = "v0.0.1" } +``` + +#### Method 3: KCL Registry (When available) + +```text +[dependencies] +provisioning = { version = "0.0.1" } +``` + +## Developer Workflows + +### 1. New Project Setup + +```text +# Create workspace from template +cp -r provisioning/templates/workspaces/kubernetes ./my-k8s-cluster +cd my-k8s-cluster + +# Initialize with modules +workspace-init.nu . init + +# Load required modules +module-loader load taskservs . [kubernetes, cilium, containerd] +module-loader load providers . [upcloud] + +# Validate and deploy +kcl run servers.ncl +provisioning server create --infra . --check +``` + +### 2. Extension Development + +```text +# Create new taskserv +mkdir -p extensions/taskservs/my-service/kcl +cd extensions/taskservs/my-service/kcl + +# Initialize KCL module +kcl mod init my-service +echo 'provisioning = { path = "~/.kcl/packages/provisioning", version = "0.0.1" }' >> kcl.mod + +# Develop and test +module-loader discover taskservs # Should find your service +``` + +### 3. Workspace Migration + +```text +# Analyze existing workspace +workspace-migrate.nu workspace/infra/old-project dry-run + +# Perform migration +workspace-migrate.nu workspace/infra/old-project + +# Verify migration +module-loader validate workspace/infra/old-project +``` + +### 4. Multi-Environment Management + +```text +# Development environment +cd workspace/infra/dev +module-loader load taskservs . [redis, postgres] +module-loader load providers . [local] + +# Production environment +cd workspace/infra/prod +module-loader load taskservs . [redis, postgres, kubernetes, monitoring] +module-loader load providers . [upcloud, aws] # Multi-cloud +``` + +## Module Management + +### Listing and Validation + +```text +# List loaded modules +module-loader list taskservs . +module-loader list providers . +module-loader list clusters . + +# Validate workspace +module-loader validate . + +# Show workspace info +workspace-init.nu . info +``` + +### Unloading Modules + +```text +# Remove specific modules +module-loader unload taskservs . redis +module-loader unload providers . aws + +# This regenerates import files automatically +``` + +### Module Information + +```text +# Get detailed module info +module-loader info taskservs kubernetes +module-loader info providers upcloud +module-loader info clusters buildkit +``` + +## CI/CD Integration + +### Pipeline Example + +```text +#!/usr/bin/env nu +# deploy-pipeline.nu + +# Install specific versions +kcl-packager.nu install --version $env.PROVISIONING_VERSION + +# Load production modules +module-loader init $env.WORKSPACE_PATH + --taskservs $env.REQUIRED_TASKSERVS + --providers [$env.CLOUD_PROVIDER] + +# Validate configuration +module-loader validate $env.WORKSPACE_PATH + +# Deploy infrastructure +provisioning server create --infra $env.WORKSPACE_PATH +``` + +## Troubleshooting + +### Common Issues + +#### Module Import Errors + +```text +Error: module not found +``` + +**Solution**: Verify modules are loaded and regenerate imports + +```text +module-loader list taskservs . +module-loader load taskservs . [kubernetes, cilium, containerd] +``` + +#### Provider Configuration Issues + +**Solution**: Check provider-specific configuration in `.providers/` directory + +#### KCL Compilation Errors + +**Solution**: Verify core package installation and kcl.mod configuration + +```text +kcl-packager.nu install --version latest +kcl run --dry-run servers.ncl +``` + +### Debug Commands + +```text +# Show workspace structure +tree -a workspace/infra/my-project + +# Check generated imports +cat workspace/infra/my-project/taskservs.ncl + +# Validate KCL files +nickel typecheck workspace/infra/my-project/*.ncl + +# Show module manifests +cat workspace/infra/my-project/.manifest/taskservs.yaml +``` + +## Best Practices + +### 1. Version Management + +- Pin core package versions in production +- Use semantic versioning for extensions +- Test compatibility before upgrading + +### 2. Module Organization + +- Load only required modules to keep workspaces clean +- Use meaningful workspace names +- Document required modules in README + +### 3. Security + +- Exclude `.manifest/` and `data/` from version control +- Use secrets management for sensitive configuration +- Validate modules before loading in production + +### 4. Performance + +- Load modules at workspace initialization, not runtime +- Cache discovery results when possible +- Use parallel loading for multiple modules + +## Migration Guide + +For existing workspaces, follow these steps: + +### 1. Backup Current Workspace + +```text +cp -r workspace/infra/existing workspace/infra/existing-backup +``` + +### 2. Analyze Migration Requirements + +```text +workspace-migrate.nu workspace/infra/existing dry-run +``` + +### 3. Perform Migration + +```text +workspace-migrate.nu workspace/infra/existing +``` + +### 4. Load Required Modules + +```text +cd workspace/infra/existing +module-loader load taskservs . [kubernetes, cilium] +module-loader load providers . [upcloud] +``` + +### 5. Test and Validate + +```text +kcl run servers.ncl +module-loader validate . +``` + +### 6. Deploy + +```text +provisioning server create --infra . --check +``` + +## Future Enhancements + +- Registry-based module distribution +- Module dependency resolution +- Automatic version updates +- Module templates and scaffolding +- Integration with external package managers \ No newline at end of file diff --git a/docs/src/architecture/repo-dist-analysis.md b/docs/src/architecture/repo-dist-analysis.md index 0f0cc74..0cf226c 100644 --- a/docs/src/architecture/repo-dist-analysis.md +++ b/docs/src/architecture/repo-dist-analysis.md @@ -1 +1,1611 @@ -# Repository and Distribution Architecture Analysis\n\n**Date:** 2025-10-01\n**Status:** Analysis Complete - Implementation Planning\n**Author:** Architecture Review\n\n## Executive Summary\n\nThis document analyzes the current project structure and provides a comprehensive plan for optimizing the repository organization and distribution\nstrategy. The goal is to create a professional-grade infrastructure automation system with clear separation of concerns, efficient development\nworkflow, and user-friendly distribution.\n\n---\n\n## Current State Analysis\n\n### Strengths\n\n1. **Clean Core Separation**\n - `provisioning/` contains the core system\n - `workspace/` concept for user data\n - Clear extension points (providers, taskservs, clusters)\n\n2. **Hybrid Architecture**\n - Rust orchestrator for performance-critical operations\n - Nushell for business logic and scripting\n - KCL for type-safe configuration\n\n3. **Modular Design**\n - Extension system for providers and services\n - Plugin architecture for Nushell\n - Template-based code generation\n\n4. **Advanced Features**\n - Batch workflow system (v3.1.0)\n - Hybrid orchestrator (v3.0.0)\n - Token-optimized agent architecture\n\n### Critical Issues\n\n1. **Confusing Root Structure**\n - Multiple workspace variants: `_workspace/`, `backup-workspace/`, `workspace-librecloud/`\n - Development artifacts at root: `wrks/`, `NO/`, `target/`\n - Unclear which workspace is active\n\n2. **Mixed Concerns**\n - Runtime data intermixed with source code\n - Build artifacts not properly isolated\n - Presentations and demos in main repo\n\n3. **Distribution Challenges**\n - Bash wrapper for CLI entry point (`provisioning/core/cli/provisioning`)\n - No clear installation mechanism\n - Missing package management system\n - Undefined installation paths\n\n4. **Documentation Fragmentation**\n - Multiple `docs/` locations\n - Scattered README files\n - No unified documentation structure\n\n5. **Configuration Complexity**\n - TOML-based system is good, but paths are unclear\n - User vs system config separation needs clarification\n - Installation paths not standardized\n\n---\n\n## Recommended Architecture\n\n### 1. Monorepo Structure\n\n```{$detected_lang}\nproject-provisioning/\n│\n├── provisioning/ # CORE SYSTEM (distribution source)\n│ ├── core/ # Core engine\n│ │ ├── cli/ # Main CLI entry\n│ │ │ └── provisioning # Pure Nushell entry point\n│ │ ├── nulib/ # Nushell libraries\n│ │ │ ├── lib_provisioning/ # Core library functions\n│ │ │ ├── main_provisioning/ # CLI handlers\n│ │ │ ├── servers/ # Server management\n│ │ │ ├── taskservs/ # Task service management\n│ │ │ ├── clusters/ # Cluster management\n│ │ │ └── workflows/ # Workflow orchestration\n│ │ ├── plugins/ # System plugins\n│ │ │ └── nushell-plugins/ # Nushell plugin sources\n│ │ └── scripts/ # Utility scripts\n│ │\n│ ├── extensions/ # Extensible modules\n│ │ ├── providers/ # Cloud providers (aws, upcloud, local)\n│ │ ├── taskservs/ # Infrastructure services\n│ │ │ ├── container-runtime/ # Container runtimes\n│ │ │ ├── kubernetes/ # Kubernetes\n│ │ │ ├── networking/ # Network services\n│ │ │ ├── storage/ # Storage services\n│ │ │ ├── databases/ # Database services\n│ │ │ └── development/ # Dev tools\n│ │ ├── clusters/ # Complete cluster configurations\n│ │ └── workflows/ # Workflow templates\n│ │\n│ ├── platform/ # Platform services (Rust)\n│ │ ├── orchestrator/ # Rust coordination layer\n│ │ ├── control-center/ # Web management UI\n│ │ ├── control-center-ui/ # UI frontend\n│ │ ├── mcp-server/ # Model Context Protocol server\n│ │ └── api-gateway/ # REST API gateway\n│ │\n│ ├── kcl/ # KCL configuration schemas\n│ │ ├── main.ncl # Main entry point\n│ │ ├── settings.ncl # Settings schema\n│ │ ├── server.ncl # Server definitions\n│ │ ├── cluster.ncl # Cluster definitions\n│ │ ├── workflows.ncl # Workflow definitions\n│ │ └── docs/ # KCL documentation\n│ │\n│ ├── templates/ # Jinja2 templates\n│ │ ├── extensions/ # Extension templates\n│ │ ├── services/ # Service templates\n│ │ └── workspace/ # Workspace templates\n│ │\n│ ├── config/ # Default system configuration\n│ │ ├── config.defaults.toml # System defaults\n│ │ └── config-examples/ # Example configs\n│ │\n│ ├── tools/ # Build and packaging tools\n│ │ ├── build/ # Build scripts\n│ │ ├── package/ # Packaging tools\n│ │ ├── distribution/ # Distribution tools\n│ │ └── release/ # Release automation\n│ │\n│ └── resources/ # Static resources (images, assets)\n│\n├── workspace/ # RUNTIME DATA (gitignored except templates)\n│ ├── infra/ # Infrastructure instances (gitignored)\n│ │ └── .gitkeep\n│ ├── config/ # User configuration (gitignored)\n│ │ └── .gitkeep\n│ ├── extensions/ # User extensions (gitignored)\n│ │ └── .gitkeep\n│ ├── runtime/ # Runtime data (gitignored)\n│ │ ├── logs/\n│ │ ├── cache/\n│ │ ├── state/\n│ │ └── tmp/\n│ └── templates/ # Workspace templates (tracked)\n│ ├── minimal/\n│ ├── kubernetes/\n│ └── multi-cloud/\n│\n├── distribution/ # DISTRIBUTION ARTIFACTS (gitignored)\n│ ├── packages/ # Built packages\n│ │ ├── provisioning-core-*.tar.gz\n│ │ ├── provisioning-platform-*.tar.gz\n│ │ ├── provisioning-extensions-*.tar.gz\n│ │ └── checksums.txt\n│ ├── installers/ # Installation scripts\n│ │ ├── install.sh # Bash installer\n│ │ └── install.nu # Nushell installer\n│ └── registry/ # Package registry metadata\n│ └── index.json\n│\n├── docs/ # UNIFIED DOCUMENTATION\n│ ├── README.md # Documentation index\n│ ├── user/ # User guides\n│ │ ├── installation.md\n│ │ ├── quick-start.md\n│ │ ├── configuration.md\n│ │ └── guides/\n│ ├── api/ # API reference\n│ │ ├── rest-api.md\n│ │ ├── nushell-api.md\n│ │ └── kcl-schemas.md\n│ ├── architecture/ # Architecture documentation\n│ │ ├── overview.md\n│ │ ├── decisions/ # ADRs\n│ │ └── repo-dist-analysis.md # This document\n│ └── development/ # Development guides\n│ ├── contributing.md\n│ ├── building.md\n│ ├── testing.md\n│ └── releasing.md\n│\n├── examples/ # EXAMPLE CONFIGURATIONS\n│ ├── minimal/ # Minimal setup\n│ ├── kubernetes-cluster/ # Full K8s cluster\n│ ├── multi-cloud/ # Multi-provider setup\n│ └── README.md\n│\n├── tests/ # INTEGRATION TESTS\n│ ├── e2e/ # End-to-end tests\n│ ├── integration/ # Integration tests\n│ ├── fixtures/ # Test fixtures\n│ └── README.md\n│\n├── tools/ # DEVELOPMENT TOOLS\n│ ├── build/ # Build scripts\n│ ├── dev-env/ # Development environment setup\n│ └── scripts/ # Utility scripts\n│\n├── .github/ # GitHub configuration\n│ ├── workflows/ # CI/CD workflows\n│ │ ├── build.yml\n│ │ ├── test.yml\n│ │ └── release.yml\n│ └── ISSUE_TEMPLATE/\n│\n├── .coder/ # Coder configuration (tracked)\n│\n├── .gitignore # Git ignore rules\n├── .gitattributes # Git attributes\n├── Cargo.toml # Rust workspace root\n├── Justfile # Task runner (unified)\n├── LICENSE # License file\n├── README.md # Project README\n├── CHANGELOG.md # Changelog\n└── CLAUDE.md # AI assistant instructions\n```\n\n### Key Principles\n\n1. **Clear Separation**: Source code (`provisioning/`), runtime data (`workspace/`), build artifacts (`distribution/`)\n2. **Single Source of Truth**: One location for each type of content\n3. **Gitignore Strategy**: Runtime and build artifacts ignored, templates tracked\n4. **Standard Paths**: Follow Unix conventions for installation\n\n---\n\n## Distribution Strategy\n\n### Package Types\n\n#### 1. **provisioning-core** (Required)\n\n**Contents:**\n\n- Nushell CLI and libraries\n- Core providers (local, upcloud, aws)\n- Essential taskservs (kubernetes, containerd, cilium)\n- KCL schemas\n- Configuration system\n- Templates\n\n**Size:** ~50 MB (compressed)\n\n**Installation:**\n\n```{$detected_lang}\n/usr/local/\n├── bin/\n│ └── provisioning\n├── lib/\n│ └── provisioning/\n│ ├── core/\n│ ├── extensions/\n│ └── kcl/\n└── share/\n └── provisioning/\n ├── templates/\n ├── config/\n └── docs/\n```\n\n#### 2. **provisioning-platform** (Optional)\n\n**Contents:**\n\n- Rust orchestrator binary\n- Control center web UI\n- MCP server\n- API gateway\n\n**Size:** ~30 MB (compressed)\n\n**Installation:**\n\n```{$detected_lang}\n/usr/local/\n├── bin/\n│ ├── provisioning-orchestrator\n│ └── provisioning-control-center\n└── share/\n └── provisioning/\n └── platform/\n```\n\n#### 3. **provisioning-extensions** (Optional)\n\n**Contents:**\n\n- Additional taskservs (radicle, gitea, postgres, etc.)\n- Cluster templates\n- Workflow templates\n\n**Size:** ~20 MB (compressed)\n\n**Installation:**\n\n```{$detected_lang}\n/usr/local/lib/provisioning/extensions/\n├── taskservs/\n├── clusters/\n└── workflows/\n```\n\n#### 4. **provisioning-plugins** (Optional)\n\n**Contents:**\n\n- Pre-built Nushell plugins\n- `nu_plugin_kcl`\n- `nu_plugin_tera`\n- Other custom plugins\n\n**Size:** ~15 MB (compressed)\n\n**Installation:**\n\n```{$detected_lang}\n~/.config/nushell/plugins/\n```\n\n### Installation Paths\n\n#### System Installation (Root)\n\n```{$detected_lang}\n/usr/local/\n├── bin/\n│ ├── provisioning # Main CLI\n│ ├── provisioning-orchestrator # Orchestrator binary\n│ └── provisioning-control-center # Control center binary\n├── lib/\n│ └── provisioning/\n│ ├── core/ # Core Nushell libraries\n│ │ ├── nulib/\n│ │ └── plugins/\n│ ├── extensions/ # Extensions\n│ │ ├── providers/\n│ │ ├── taskservs/\n│ │ └── clusters/\n│ └── kcl/ # KCL schemas\n└── share/\n └── provisioning/\n ├── templates/ # System templates\n ├── config/ # Default configs\n │ └── config.defaults.toml\n └── docs/ # Documentation\n```\n\n#### User Configuration\n\n```{$detected_lang}\n~/.provisioning/\n├── config/\n│ └── config.user.toml # User overrides\n├── extensions/ # User extensions\n│ ├── providers/\n│ ├── taskservs/\n│ └── clusters/\n├── cache/ # Cache directory\n└── plugins/ # User plugins\n```\n\n#### Project Workspace\n\n```{$detected_lang}\n./workspace/\n├── infra/ # Infrastructure definitions\n│ ├── my-cluster/\n│ │ ├── config.toml\n│ │ ├── servers.yaml\n│ │ └── taskservs.yaml\n│ └── production/\n├── config/ # Project configuration\n│ └── config.toml\n├── runtime/ # Runtime data\n│ ├── logs/\n│ ├── state/\n│ └── cache/\n└── extensions/ # Project-specific extensions\n```\n\n### Configuration Hierarchy\n\n```{$detected_lang}\nPriority (highest to lowest):\n1. CLI flags --debug, --infra=my-cluster\n2. Runtime overrides PROVISIONING_DEBUG=true\n3. Project config ./workspace/config/config.toml\n4. User config ~/.provisioning/config/config.user.toml\n5. System config /usr/local/share/provisioning/config/config.defaults.toml\n```\n\n---\n\n## Build System\n\n### Build Tools Structure\n\n**`provisioning/tools/build/`:**\n\n```{$detected_lang}\nbuild/\n├── build-system.nu # Main build orchestrator\n├── package-core.nu # Core packaging\n├── package-platform.nu # Platform packaging\n├── package-extensions.nu # Extensions packaging\n├── package-plugins.nu # Plugins packaging\n├── create-installers.nu # Installer generation\n├── validate-package.nu # Package validation\n└── publish-registry.nu # Registry publishing\n```\n\n### Build System Implementation\n\n**`provisioning/tools/build/build-system.nu`:**\n\n```{$detected_lang}\n#!/usr/bin/env nu\n# Build system for provisioning project\n\nuse ../core/nulib/lib_provisioning/config/accessor.nu *\n\n# Build all packages\nexport def "main build-all" [\n --version: string = "dev" # Version to build\n --output: string = "distribution/packages" # Output directory\n] {\n print $"Building all packages version: ($version)"\n\n let results = {\n core: (build-core $version $output)\n platform: (build-platform $version $output)\n extensions: (build-extensions $version $output)\n plugins: (build-plugins $version $output)\n }\n\n # Generate checksums\n create-checksums $output\n\n print "✅ All packages built successfully"\n $results\n}\n\n# Build core package\nexport def "build-core" [\n version: string\n output: string\n] -> record {\n print "📦 Building provisioning-core..."\n\n nu package-core.nu build --version $version --output $output\n}\n\n# Build platform package (Rust binaries)\nexport def "build-platform" [\n version: string\n output: string\n] -> record {\n print "📦 Building provisioning-platform..."\n\n nu package-platform.nu build --version $version --output $output\n}\n\n# Build extensions package\nexport def "build-extensions" [\n version: string\n output: string\n] -> record {\n print "📦 Building provisioning-extensions..."\n\n nu package-extensions.nu build --version $version --output $output\n}\n\n# Build plugins package\nexport def "build-plugins" [\n version: string\n output: string\n] -> record {\n print "📦 Building provisioning-plugins..."\n\n nu package-plugins.nu build --version $version --output $output\n}\n\n# Create release artifacts\nexport def "main release" [\n version: string # Release version\n --upload # Upload to release server\n] {\n print $"🚀 Creating release ($version)"\n\n # Build all packages\n let packages = (build-all --version $version)\n\n # Create installers\n create-installers $version\n\n # Generate release notes\n generate-release-notes $version\n\n # Upload if requested\n if $upload {\n upload-release $version\n }\n\n print $"✅ Release ($version) ready"\n}\n\n# Create installers\ndef create-installers [version: string] {\n print "📝 Creating installers..."\n\n nu create-installers.nu --version $version\n}\n\n# Generate release notes\ndef generate-release-notes [version: string] {\n print "📝 Generating release notes..."\n\n let changelog = (open CHANGELOG.md)\n let notes = ($changelog | parse-version-section $version)\n\n $notes | save $"distribution/packages/RELEASE_NOTES_($version).md"\n}\n\n# Upload release\ndef upload-release [version: string] {\n print "⬆️ Uploading release..."\n\n # Implementation depends on your release infrastructure\n # Could use: GitHub releases, S3, custom server, etc.\n}\n\n# Create checksums for all packages\ndef create-checksums [output: string] {\n print "🔐 Creating checksums..."\n\n ls ($output | path join "*.tar.gz")\n | each { |file|\n let hash = (sha256sum $file.name | split row ' ' | get 0)\n $"($hash) (($file.name | path basename))"\n }\n | str join "\n"\n | save ($output | path join "checksums.txt")\n}\n\n# Clean build artifacts\nexport def "main clean" [\n --all # Clean all build artifacts\n] {\n print "🧹 Cleaning build artifacts..."\n\n if ($all) {\n rm -rf distribution/packages\n rm -rf target/\n rm -rf provisioning/platform/target/\n } else {\n rm -rf distribution/packages\n }\n\n print "✅ Clean complete"\n}\n\n# Validate built packages\nexport def "main validate" [\n package_path: string # Package to validate\n] {\n print $"🔍 Validating package: ($package_path)"\n\n nu validate-package.nu $package_path\n}\n\n# Show build status\nexport def "main status" [] {\n print "📊 Build Status"\n print "─" * 60\n\n let core_exists = ("distribution/packages" | path join "provisioning-core-*.tar.gz" | glob | is-not-empty)\n let platform_exists = ("distribution/packages" | path join "provisioning-platform-*.tar.gz" | glob | is-not-empty)\n\n print $"Core package: (if $core_exists { '✅ Built' } else { '❌ Not built' })"\n print $"Platform package: (if $platform_exists { '✅ Built' } else { '❌ Not built' })"\n\n if ("distribution/packages" | path exists) {\n let packages = (ls distribution/packages | where name =~ ".tar.gz")\n print $"\nTotal packages: (($packages | length))"\n $packages | select name size\n }\n}\n```\n\n### Justfile Integration\n\n**`Justfile`:**\n\n```{$detected_lang}\n# Provisioning Build System\n# Use 'just --list' to see all available commands\n\n# Default recipe\ndefault:\n @just --list\n\n# Development tasks\nalias d := dev-check\nalias t := test\nalias b := build\n\n# Build all packages\nbuild VERSION="dev":\n nu provisioning/tools/build/build-system.nu build-all --version {{VERSION}}\n\n# Build core package only\nbuild-core VERSION="dev":\n nu provisioning/tools/build/build-system.nu build-core {{VERSION}}\n\n# Build platform binaries\nbuild-platform VERSION="dev":\n cargo build --release --workspace --manifest-path provisioning/platform/Cargo.toml\n nu provisioning/tools/build/build-system.nu build-platform {{VERSION}}\n\n# Run development checks\ndev-check:\n @echo "🔍 Running development checks..."\n cargo check --workspace --manifest-path provisioning/platform/Cargo.toml\n cargo clippy --workspace --manifest-path provisioning/platform/Cargo.toml\n nu provisioning/tools/build/validate-nushell.nu\n\n# Run tests\ntest:\n @echo "🧪 Running tests..."\n cargo test --workspace --manifest-path provisioning/platform/Cargo.toml\n nu tests/run-all-tests.nu\n\n# Run integration tests\ntest-e2e:\n @echo "🔬 Running E2E tests..."\n nu tests/e2e/run-e2e.nu\n\n# Format code\nfmt:\n cargo fmt --all --manifest-path provisioning/platform/Cargo.toml\n nu provisioning/tools/build/format-nushell.nu\n\n# Clean build artifacts\nclean:\n nu provisioning/tools/build/build-system.nu clean\n\n# Clean all (including Rust target/)\nclean-all:\n nu provisioning/tools/build/build-system.nu clean --all\n cargo clean --manifest-path provisioning/platform/Cargo.toml\n\n# Create release\nrelease VERSION:\n @echo "🚀 Creating release {{VERSION}}..."\n nu provisioning/tools/build/build-system.nu release {{VERSION}}\n\n# Install from source\ninstall:\n @echo "📦 Installing from source..."\n just build\n sudo nu distribution/installers/install.nu --from-source\n\n# Install development version (symlink)\ninstall-dev:\n @echo "🔗 Installing development version..."\n sudo ln -sf $(pwd)/provisioning/core/cli/provisioning /usr/local/bin/provisioning\n @echo "✅ Development installation complete"\n\n# Uninstall\nuninstall:\n @echo "🗑️ Uninstalling..."\n sudo rm -f /usr/local/bin/provisioning\n sudo rm -rf /usr/local/lib/provisioning\n sudo rm -rf /usr/local/share/provisioning\n\n# Show build status\nstatus:\n nu provisioning/tools/build/build-system.nu status\n\n# Validate package\nvalidate PACKAGE:\n nu provisioning/tools/build/build-system.nu validate {{PACKAGE}}\n\n# Start development environment\ndev-start:\n @echo "🚀 Starting development environment..."\n cd provisioning/platform/orchestrator && cargo run\n\n# Watch and rebuild on changes\nwatch:\n @echo "👀 Watching for changes..."\n cargo watch -x 'check --workspace --manifest-path provisioning/platform/Cargo.toml'\n\n# Update dependencies\nupdate-deps:\n cargo update --manifest-path provisioning/platform/Cargo.toml\n nu provisioning/tools/build/update-nushell-deps.nu\n\n# Generate documentation\ndocs:\n @echo "📚 Generating documentation..."\n cargo doc --workspace --no-deps --manifest-path provisioning/platform/Cargo.toml\n nu provisioning/tools/build/generate-docs.nu\n\n# Benchmark\nbench:\n cargo bench --workspace --manifest-path provisioning/platform/Cargo.toml\n\n# Check licenses\ncheck-licenses:\n cargo deny check licenses --manifest-path provisioning/platform/Cargo.toml\n\n# Security audit\naudit:\n cargo audit --file provisioning/platform/Cargo.lock\n```\n\n---\n\n## Installation System\n\n### Installer Script\n\n**`distribution/installers/install.nu`:**\n\n```{$detected_lang}\n#!/usr/bin/env nu\n# Provisioning installation script\n\nconst DEFAULT_PREFIX = "/usr/local"\nconst REPO_URL = "https://releases.provisioning.io"\n\n# Main installation command\ndef main [\n --prefix: string = $DEFAULT_PREFIX # Installation prefix\n --version: string = "latest" # Version to install\n --from-source # Install from source (development)\n --packages: list = ["core"] # Packages to install\n] {\n print "📦 Provisioning Installation"\n print "─" * 60\n\n # Check prerequisites\n check-prerequisites\n\n # Install packages\n if $from_source {\n install-from-source $prefix\n } else {\n install-from-release $prefix $version $packages\n }\n\n # Post-installation\n post-install $prefix\n\n print ""\n print "✅ Installation complete!"\n print $"Run 'provisioning --help' to get started"\n}\n\n# Check prerequisites\ndef check-prerequisites [] {\n print "🔍 Checking prerequisites..."\n\n # Check for Nushell\n if (which nu | is-empty) {\n error make {\n msg: "Nushell not found. Please install Nushell first: https://nushell.sh"\n }\n }\n\n let nu_version = (nu --version | parse "{name} {version}" | get 0.version)\n print $" ✓ Nushell ($nu_version)"\n\n # Check for required tools\n if (which tar | is-empty) {\n error make { msg: "tar not found" }\n }\n\n if (which curl | is-empty) and (which wget | is-empty) {\n error make { msg: "curl or wget required" }\n }\n\n print " ✓ All prerequisites met"\n}\n\n# Install from source\ndef install-from-source [prefix: string] {\n print "📦 Installing from source..."\n\n # Check if we're in the source directory\n if not ("provisioning" | path exists) {\n error make { msg: "Must run from project root" }\n }\n\n # Create installation directories\n create-install-dirs $prefix\n\n # Copy files\n print " Copying core files..."\n cp -r provisioning/core/nulib $"($prefix)/lib/provisioning/core/"\n cp -r provisioning/extensions $"($prefix)/lib/provisioning/"\n cp -r provisioning/kcl $"($prefix)/lib/provisioning/"\n cp -r provisioning/templates $"($prefix)/share/provisioning/"\n cp -r provisioning/config $"($prefix)/share/provisioning/"\n\n # Create CLI wrapper\n create-cli-wrapper $prefix\n\n print " ✓ Source installation complete"\n}\n\n# Install from release\ndef install-from-release [\n prefix: string\n version: string\n packages: list\n] {\n print $"📦 Installing version ($version)..."\n\n # Download packages\n for package in $packages {\n download-package $package $version\n extract-package $package $version $prefix\n }\n}\n\n# Download package\ndef download-package [package: string, version: string] {\n let filename = $"provisioning-($package)-($version).tar.gz"\n let url = $"($REPO_URL)/($version)/($filename)"\n\n print $" Downloading ($package)..."\n\n if (which curl | is-not-empty) {\n curl -fsSL -o $"/tmp/($filename)" $url\n } else {\n wget -q -O $"/tmp/($filename)" $url\n }\n}\n\n# Extract package\ndef extract-package [package: string, version: string, prefix: string] {\n let filename = $"provisioning-($package)-($version).tar.gz"\n\n print $" Installing ($package)..."\n\n tar xzf $"/tmp/($filename)" -C $prefix\n rm $"/tmp/($filename)"\n}\n\n# Create installation directories\ndef create-install-dirs [prefix: string] {\n mkdir ($prefix | path join "bin")\n mkdir ($prefix | path join "lib" "provisioning" "core")\n mkdir ($prefix | path join "lib" "provisioning" "extensions")\n mkdir ($prefix | path join "share" "provisioning" "templates")\n mkdir ($prefix | path join "share" "provisioning" "config")\n mkdir ($prefix | path join "share" "provisioning" "docs")\n}\n\n# Create CLI wrapper\ndef create-cli-wrapper [prefix: string] {\n let wrapper = $"#!/usr/bin/env nu\n# Provisioning CLI wrapper\n\n# Load provisioning library\nconst PROVISIONING_LIB = \"($prefix)/lib/provisioning\"\nconst PROVISIONING_SHARE = \"($prefix)/share/provisioning\"\n\n$env.PROVISIONING_ROOT = $PROVISIONING_LIB\n$env.PROVISIONING_SHARE = $PROVISIONING_SHARE\n\n# Add to Nushell path\n$env.NU_LIB_DIRS = ($env.NU_LIB_DIRS | append $\"($PROVISIONING_LIB)/core/nulib\")\n\n# Load main provisioning module\nuse ($PROVISIONING_LIB)/core/nulib/main_provisioning/dispatcher.nu *\n\n# Main entry point\ndef main [...args] {\n dispatch-command $args\n}\n\nmain ...$args\n"\n\n $wrapper | save ($prefix | path join "bin" "provisioning")\n chmod +x ($prefix | path join "bin" "provisioning")\n}\n\n# Post-installation tasks\ndef post-install [prefix: string] {\n print "🔧 Post-installation setup..."\n\n # Create user config directory\n let user_config = ($env.HOME | path join ".provisioning")\n if not ($user_config | path exists) {\n mkdir ($user_config | path join "config")\n mkdir ($user_config | path join "extensions")\n mkdir ($user_config | path join "cache")\n\n # Copy example config\n let example = ($prefix | path join "share" "provisioning" "config" "config-examples" "config.user.toml")\n if ($example | path exists) {\n cp $example ($user_config | path join "config" "config.user.toml")\n }\n\n print $" ✓ Created user config directory: ($user_config)"\n }\n\n # Check if prefix is in PATH\n if not ($env.PATH | any { |p| $p == ($prefix | path join "bin") }) {\n print ""\n print "⚠️ Note: ($prefix)/bin is not in your PATH"\n print " Add this to your shell configuration:"\n print $" export PATH=\"($prefix)/bin:$PATH\""\n }\n}\n\n# Uninstall provisioning\nexport def "main uninstall" [\n --prefix: string = $DEFAULT_PREFIX # Installation prefix\n --keep-config # Keep user configuration\n] {\n print "🗑️ Uninstalling provisioning..."\n\n # Remove installed files\n rm -rf ($prefix | path join "bin" "provisioning")\n rm -rf ($prefix | path join "lib" "provisioning")\n rm -rf ($prefix | path join "share" "provisioning")\n\n # Remove user config if requested\n if not $keep_config {\n let user_config = ($env.HOME | path join ".provisioning")\n if ($user_config | path exists) {\n rm -rf $user_config\n print " ✓ Removed user configuration"\n }\n }\n\n print "✅ Uninstallation complete"\n}\n\n# Upgrade provisioning\nexport def "main upgrade" [\n --version: string = "latest" # Version to upgrade to\n --prefix: string = $DEFAULT_PREFIX # Installation prefix\n] {\n print $"⬆️ Upgrading to version ($version)..."\n\n # Check current version\n let current = (^provisioning version | parse "{version}" | get 0.version)\n print $" Current version: ($current)"\n\n if $current == $version {\n print " Already at latest version"\n return\n }\n\n # Backup current installation\n print " Backing up current installation..."\n let backup = ($prefix | path join "lib" "provisioning.backup")\n mv ($prefix | path join "lib" "provisioning") $backup\n\n # Install new version\n try {\n install-from-release $prefix $version ["core"]\n print $" ✅ Upgraded to version ($version)"\n rm -rf $backup\n } catch {\n print " ❌ Upgrade failed, restoring backup..."\n mv $backup ($prefix | path join "lib" "provisioning")\n error make { msg: "Upgrade failed" }\n }\n}\n```\n\n### Bash Installer (For Systems Without Nushell)\n\n**`distribution/installers/install.sh`:**\n\n```{$detected_lang}\n#!/usr/bin/env bash\n# Provisioning installation script (Bash version)\n# This script installs Nushell first, then runs the Nushell installer\n\nset -euo pipefail\n\nDEFAULT_PREFIX="/usr/local"\nREPO_URL="https://releases.provisioning.io"\n\n# Colors\nRED='\033[0;31m'\nGREEN='\033[0;32m'\nYELLOW='\033[1;33m'\nNC='\033[0m' # No Color\n\ninfo() {\n echo -e "${GREEN}✓${NC} $*"\n}\n\nwarn() {\n echo -e "${YELLOW}⚠${NC} $*"\n}\n\nerror() {\n echo -e "${RED}✗${NC} $*" >&2\n exit 1\n}\n\n# Check if Nushell is installed\ncheck_nushell() {\n if command -v nu >/dev/null 2>&1; then\n info "Nushell is already installed"\n return 0\n else\n warn "Nushell not found"\n return 1\n fi\n}\n\n# Install Nushell\ninstall_nushell() {\n echo "📦 Installing Nushell..."\n\n # Detect OS and architecture\n OS="$(uname -s)"\n ARCH="$(uname -m)"\n\n case "$OS" in\n Linux*)\n if command -v apt-get >/dev/null 2>&1; then\n sudo apt-get update && sudo apt-get install -y nushell\n elif command -v dnf >/dev/null 2>&1; then\n sudo dnf install -y nushell\n elif command -v brew >/dev/null 2>&1; then\n brew install nushell\n else\n error "Cannot automatically install Nushell. Please install manually: https://nushell.sh"\n fi\n ;;\n Darwin*)\n if command -v brew >/dev/null 2>&1; then\n brew install nushell\n else\n error "Homebrew not found. Install from: https://brew.sh"\n fi\n ;;\n *)\n error "Unsupported operating system: $OS"\n ;;\n esac\n\n info "Nushell installed successfully"\n}\n\n# Main installation\nmain() {\n echo "📦 Provisioning Installation"\n echo "────────────────────────────────────────────────────────────"\n\n # Check for Nushell\n if ! check_nushell; then\n read -p "Install Nushell? (y/N) " -n 1 -r\n echo\n if [[ $REPLY =~ ^[Yy]$ ]]; then\n install_nushell\n else\n error "Nushell is required. Install from: https://nushell.sh"\n fi\n fi\n\n # Download Nushell installer\n echo "📥 Downloading installer..."\n INSTALLER_URL="$REPO_URL/latest/install.nu"\n curl -fsSL "$INSTALLER_URL" -o /tmp/install.nu\n\n # Run Nushell installer\n echo "🚀 Running installer..."\n nu /tmp/install.nu "$@"\n\n # Cleanup\n rm -f /tmp/install.nu\n\n info "Installation complete!"\n}\n\n# Run main\nmain "$@"\n```\n\n---\n\n## Implementation Plan\n\n### Phase 1: Repository Restructuring (3-4 days)\n\n#### Day 1: Cleanup and Preparation\n\n**Tasks:**\n\n1. Create backup of current state\n2. Analyze and document all workspace directories\n3. Identify active workspace vs backups\n4. Map all file dependencies\n\n**Commands:**\n\n```{$detected_lang}\n# Backup current state\ncp -r /Users/Akasha/project-provisioning /Users/Akasha/project-provisioning.backup\n\n# Analyze workspaces\nfd workspace -t d > workspace-dirs.txt\n```\n\n**Deliverables:**\n\n- Complete backup\n- Workspace analysis document\n- Dependency map\n\n#### Day 2: Directory Restructuring\n\n**Tasks:**\n\n1. Consolidate workspace directories\n2. Move build artifacts to `distribution/`\n3. Remove obsolete directories (`NO/`, `wrks/`, presentation artifacts)\n4. Create proper `.gitignore`\n\n**Commands:**\n\n```{$detected_lang}\n# Create distribution directory\nmkdir -p distribution/{packages,installers,registry}\n\n# Move build artifacts\nmv target distribution/\nmv provisioning/tools/dist distribution/packages/\n\n# Remove obsolete\nrm -rf NO/ wrks/ presentations/\n```\n\n**Deliverables:**\n\n- Clean directory structure\n- Updated `.gitignore`\n- Migration log\n\n#### Day 3: Update Path References\n\n**Tasks:**\n\n1. Update all hardcoded paths in Nushell scripts\n2. Update CLAUDE.md with new paths\n3. Update documentation references\n4. Test all path changes\n\n**Files to Update:**\n\n- `provisioning/core/nulib/**/*.nu` (~65 files)\n- `CLAUDE.md`\n- `docs/**/*.md`\n\n**Deliverables:**\n\n- Updated scripts\n- Updated documentation\n- Test results\n\n#### Day 4: Validation and Documentation\n\n**Tasks:**\n\n1. Run full test suite\n2. Verify all commands work\n3. Update README.md\n4. Create migration guide\n\n**Deliverables:**\n\n- Passing tests\n- Updated README\n- Migration guide for users\n\n### Phase 2: Build System Implementation (3-4 days)\n\n#### Day 5: Build System Core\n\n**Tasks:**\n\n1. Create `provisioning/tools/build/` structure\n2. Implement `build-system.nu`\n3. Implement `package-core.nu`\n4. Create Justfile\n\n**Files to Create:**\n\n- `provisioning/tools/build/build-system.nu`\n- `provisioning/tools/build/package-core.nu`\n- `provisioning/tools/build/validate-package.nu`\n- `Justfile`\n\n**Deliverables:**\n\n- Working build system\n- Core packaging capability\n- Justfile with basic recipes\n\n#### Day 6: Platform and Extension Packaging\n\n**Tasks:**\n\n1. Implement `package-platform.nu`\n2. Implement `package-extensions.nu`\n3. Implement `package-plugins.nu`\n4. Add checksum generation\n\n**Deliverables:**\n\n- Platform packaging\n- Extension packaging\n- Plugin packaging\n- Checksum generation\n\n#### Day 7: Package Validation\n\n**Tasks:**\n\n1. Create package validation system\n2. Implement integrity checks\n3. Create test suite for packages\n4. Document package format\n\n**Deliverables:**\n\n- Package validation\n- Test suite\n- Package format documentation\n\n#### Day 8: Build System Testing\n\n**Tasks:**\n\n1. Test full build pipeline\n2. Test all package types\n3. Optimize build performance\n4. Document build system\n\n**Deliverables:**\n\n- Tested build system\n- Performance optimizations\n- Build system documentation\n\n### Phase 3: Installation System (2-3 days)\n\n#### Day 9: Nushell Installer\n\n**Tasks:**\n\n1. Create `install.nu`\n2. Implement installation logic\n3. Implement upgrade logic\n4. Implement uninstallation\n\n**Files to Create:**\n\n- `distribution/installers/install.nu`\n\n**Deliverables:**\n\n- Working Nushell installer\n- Upgrade mechanism\n- Uninstall mechanism\n\n#### Day 10: Bash Installer and CLI\n\n**Tasks:**\n\n1. Create `install.sh`\n2. Replace bash CLI wrapper with pure Nushell\n3. Update PATH handling\n4. Test installation on clean system\n\n**Files to Create:**\n\n- `distribution/installers/install.sh`\n- Updated `provisioning/core/cli/provisioning`\n\n**Deliverables:**\n\n- Bash installer\n- Pure Nushell CLI\n- Installation tests\n\n#### Day 11: Installation Testing\n\n**Tasks:**\n\n1. Test installation on multiple OSes\n2. Test upgrade scenarios\n3. Test uninstallation\n4. Create installation documentation\n\n**Deliverables:**\n\n- Multi-OS installation tests\n- Installation guide\n- Troubleshooting guide\n\n### Phase 4: Package Registry (Optional, 2-3 days)\n\n#### Day 12: Registry System\n\n**Tasks:**\n\n1. Design registry format\n2. Implement registry indexing\n3. Create package metadata\n4. Implement search functionality\n\n**Files to Create:**\n\n- `provisioning/tools/build/publish-registry.nu`\n- `distribution/registry/index.json`\n\n**Deliverables:**\n\n- Registry system\n- Package metadata\n- Search functionality\n\n#### Day 13: Registry Commands\n\n**Tasks:**\n\n1. Implement `provisioning registry list`\n2. Implement `provisioning registry search`\n3. Implement `provisioning registry install`\n4. Implement `provisioning registry update`\n\n**Deliverables:**\n\n- Registry commands\n- Package installation from registry\n- Update mechanism\n\n#### Day 14: Registry Hosting\n\n**Tasks:**\n\n1. Set up registry hosting (S3, GitHub releases, etc.)\n2. Implement upload mechanism\n3. Create CI/CD for automatic publishing\n4. Document registry system\n\n**Deliverables:**\n\n- Hosted registry\n- CI/CD pipeline\n- Registry documentation\n\n### Phase 5: Documentation and Release (2 days)\n\n#### Day 15: Documentation\n\n**Tasks:**\n\n1. Update all documentation for new structure\n2. Create user guides\n3. Create development guides\n4. Create API documentation\n\n**Deliverables:**\n\n- Updated documentation\n- User guides\n- Developer guides\n- API docs\n\n#### Day 16: Release Preparation\n\n**Tasks:**\n\n1. Create CHANGELOG.md\n2. Build release packages\n3. Test installation from packages\n4. Create release announcement\n\n**Deliverables:**\n\n- CHANGELOG\n- Release packages\n- Installation verification\n- Release announcement\n\n---\n\n## Migration Strategy\n\n### For Existing Users\n\n#### Option 1: Clean Migration\n\n```{$detected_lang}\n# Backup current workspace\ncp -r workspace workspace.backup\n\n# Upgrade to new version\nprovisioning upgrade --version 3.2.0\n\n# Migrate workspace\nprovisioning workspace migrate --from workspace.backup --to workspace/\n```\n\n#### Option 2: In-Place Migration\n\n```{$detected_lang}\n# Run migration script\nprovisioning migrate --check # Dry run\nprovisioning migrate # Execute migration\n```\n\n### For Developers\n\n```{$detected_lang}\n# Pull latest changes\ngit pull origin main\n\n# Rebuild\njust clean-all\njust build\n\n# Reinstall development version\njust install-dev\n\n# Verify\nprovisioning --version\n```\n\n---\n\n## Success Criteria\n\n### Repository Structure\n\n- ✅ Single `workspace/` directory for all runtime data\n- ✅ Clear separation: source (`provisioning/`), runtime (`workspace/`), artifacts (`distribution/`)\n- ✅ All build artifacts in `distribution/` and gitignored\n- ✅ Clean root directory (no `wrks/`, `NO/`, etc.)\n- ✅ Unified documentation in `docs/`\n\n### Build System\n\n- ✅ Single command builds all packages: `just build`\n- ✅ Packages can be built independently\n- ✅ Checksums generated automatically\n- ✅ Validation before packaging\n- ✅ Build time < 5 minutes for full build\n\n### Installation\n\n- ✅ One-line installation: `curl -fsSL https://get.provisioning.io | sh`\n- ✅ Works on Linux and macOS\n- ✅ Standard installation paths (`/usr/local/`)\n- ✅ User configuration in `~/.provisioning/`\n- ✅ Clean uninstallation\n\n### Distribution\n\n- ✅ Packages available at stable URL\n- ✅ Automated releases via CI/CD\n- ✅ Package registry for extensions\n- ✅ Upgrade mechanism works reliably\n\n### Documentation\n\n- ✅ Complete installation guide\n- ✅ Quick start guide\n- ✅ Developer contributing guide\n- ✅ API documentation\n- ✅ Architecture documentation\n\n---\n\n## Risks and Mitigations\n\n### Risk 1: Breaking Changes for Existing Users\n\n**Impact:** High\n**Probability:** High\n**Mitigation:**\n\n- Provide migration script\n- Support both old and new paths during transition (v3.2.x)\n- Clear migration guide\n- Automated backup before migration\n\n### Risk 2: Build System Complexity\n\n**Impact:** Medium\n**Probability:** Medium\n**Mitigation:**\n\n- Start with simple packaging\n- Iterate and improve\n- Document thoroughly\n- Provide examples\n\n### Risk 3: Installation Path Conflicts\n\n**Impact:** Medium\n**Probability:** Low\n**Mitigation:**\n\n- Check for existing installations\n- Support custom prefix\n- Clear uninstallation\n- Non-conflicting binary names\n\n### Risk 4: Cross-Platform Issues\n\n**Impact:** High\n**Probability:** Medium\n**Mitigation:**\n\n- Test on multiple OSes (Linux, macOS)\n- Use portable commands\n- Provide fallbacks\n- Clear error messages\n\n### Risk 5: Dependency Management\n\n**Impact:** Medium\n**Probability:** Medium\n**Mitigation:**\n\n- Document all dependencies\n- Check prerequisites during installation\n- Provide installation instructions for dependencies\n- Consider bundling critical dependencies\n\n---\n\n## Timeline Summary\n\n| Phase | Duration | Key Deliverables |\n| ------- | ---------- | ------------------ |\n| Phase 1: Restructuring | 3-4 days | Clean directory structure, updated paths |\n| Phase 2: Build System | 3-4 days | Working build system, all package types |\n| Phase 3: Installation | 2-3 days | Installers, pure Nushell CLI |\n| Phase 4: Registry (Optional) | 2-3 days | Package registry, extension management |\n| Phase 5: Documentation | 2 days | Complete documentation, release |\n| **Total** | **12-16 days** | Production-ready distribution system |\n\n---\n\n## Next Steps\n\n1. **Review and Approval** (Day 0)\n - Review this analysis\n - Approve implementation plan\n - Assign resources\n\n2. **Kickoff** (Day 1)\n - Create implementation branch\n - Set up project tracking\n - Begin Phase 1\n\n3. **Weekly Reviews**\n - End of Phase 1: Structure review\n - End of Phase 2: Build system review\n - End of Phase 3: Installation review\n - Final review before release\n\n---\n\n## Conclusion\n\nThis comprehensive plan transforms the provisioning system into a professional-grade infrastructure automation platform with:\n\n- **Clean Architecture**: Clear separation of concerns\n- **Professional Distribution**: Standard installation paths and packaging\n- **Easy Installation**: One-command installation for users\n- **Developer Friendly**: Simple build system and clear development workflow\n- **Extensible**: Package registry for community extensions\n- **Well Documented**: Complete guides for users and developers\n\nThe implementation will take approximately **2-3 weeks** and will result in a production-ready system suitable for both individual developers and\nenterprise deployments.\n\n---\n\n## References\n\n- Current codebase structure\n- Unix FHS (Filesystem Hierarchy Standard)\n- Rust cargo packaging conventions\n- npm/yarn package management patterns\n- Homebrew formula best practices\n- KCL package management design +# Repository and Distribution Architecture Analysis + +**Date:** 2025-10-01 +**Status:** Analysis Complete - Implementation Planning +**Author:** Architecture Review + +## Executive Summary + +This document analyzes the current project structure and provides a comprehensive plan for optimizing the repository organization and distribution +strategy. The goal is to create a professional-grade infrastructure automation system with clear separation of concerns, efficient development +workflow, and user-friendly distribution. + +--- + +## Current State Analysis + +### Strengths + +1. **Clean Core Separation** + - `provisioning/` contains the core system + - `workspace/` concept for user data + - Clear extension points (providers, taskservs, clusters) + +2. **Hybrid Architecture** + - Rust orchestrator for performance-critical operations + - Nushell for business logic and scripting + - KCL for type-safe configuration + +3. **Modular Design** + - Extension system for providers and services + - Plugin architecture for Nushell + - Template-based code generation + +4. **Advanced Features** + - Batch workflow system (v3.1.0) + - Hybrid orchestrator (v3.0.0) + - Token-optimized agent architecture + +### Critical Issues + +1. **Confusing Root Structure** + - Multiple workspace variants: `_workspace/`, `backup-workspace/`, `workspace-librecloud/` + - Development artifacts at root: `wrks/`, `NO/`, `target/` + - Unclear which workspace is active + +2. **Mixed Concerns** + - Runtime data intermixed with source code + - Build artifacts not properly isolated + - Presentations and demos in main repo + +3. **Distribution Challenges** + - Bash wrapper for CLI entry point (`provisioning/core/cli/provisioning`) + - No clear installation mechanism + - Missing package management system + - Undefined installation paths + +4. **Documentation Fragmentation** + - Multiple `docs/` locations + - Scattered README files + - No unified documentation structure + +5. **Configuration Complexity** + - TOML-based system is good, but paths are unclear + - User vs system config separation needs clarification + - Installation paths not standardized + +--- + +## Recommended Architecture + +### 1. Monorepo Structure + +```text +project-provisioning/ +│ +├── provisioning/ # CORE SYSTEM (distribution source) +│ ├── core/ # Core engine +│ │ ├── cli/ # Main CLI entry +│ │ │ └── provisioning # Pure Nushell entry point +│ │ ├── nulib/ # Nushell libraries +│ │ │ ├── lib_provisioning/ # Core library functions +│ │ │ ├── main_provisioning/ # CLI handlers +│ │ │ ├── servers/ # Server management +│ │ │ ├── taskservs/ # Task service management +│ │ │ ├── clusters/ # Cluster management +│ │ │ └── workflows/ # Workflow orchestration +│ │ ├── plugins/ # System plugins +│ │ │ └── nushell-plugins/ # Nushell plugin sources +│ │ └── scripts/ # Utility scripts +│ │ +│ ├── extensions/ # Extensible modules +│ │ ├── providers/ # Cloud providers (aws, upcloud, local) +│ │ ├── taskservs/ # Infrastructure services +│ │ │ ├── container-runtime/ # Container runtimes +│ │ │ ├── kubernetes/ # Kubernetes +│ │ │ ├── networking/ # Network services +│ │ │ ├── storage/ # Storage services +│ │ │ ├── databases/ # Database services +│ │ │ └── development/ # Dev tools +│ │ ├── clusters/ # Complete cluster configurations +│ │ └── workflows/ # Workflow templates +│ │ +│ ├── platform/ # Platform services (Rust) +│ │ ├── orchestrator/ # Rust coordination layer +│ │ ├── control-center/ # Web management UI +│ │ ├── control-center-ui/ # UI frontend +│ │ ├── mcp-server/ # Model Context Protocol server +│ │ └── api-gateway/ # REST API gateway +│ │ +│ ├── kcl/ # KCL configuration schemas +│ │ ├── main.ncl # Main entry point +│ │ ├── settings.ncl # Settings schema +│ │ ├── server.ncl # Server definitions +│ │ ├── cluster.ncl # Cluster definitions +│ │ ├── workflows.ncl # Workflow definitions +│ │ └── docs/ # KCL documentation +│ │ +│ ├── templates/ # Jinja2 templates +│ │ ├── extensions/ # Extension templates +│ │ ├── services/ # Service templates +│ │ └── workspace/ # Workspace templates +│ │ +│ ├── config/ # Default system configuration +│ │ ├── config.defaults.toml # System defaults +│ │ └── config-examples/ # Example configs +│ │ +│ ├── tools/ # Build and packaging tools +│ │ ├── build/ # Build scripts +│ │ ├── package/ # Packaging tools +│ │ ├── distribution/ # Distribution tools +│ │ └── release/ # Release automation +│ │ +│ └── resources/ # Static resources (images, assets) +│ +├── workspace/ # RUNTIME DATA (gitignored except templates) +│ ├── infra/ # Infrastructure instances (gitignored) +│ │ └── .gitkeep +│ ├── config/ # User configuration (gitignored) +│ │ └── .gitkeep +│ ├── extensions/ # User extensions (gitignored) +│ │ └── .gitkeep +│ ├── runtime/ # Runtime data (gitignored) +│ │ ├── logs/ +│ │ ├── cache/ +│ │ ├── state/ +│ │ └── tmp/ +│ └── templates/ # Workspace templates (tracked) +│ ├── minimal/ +│ ├── kubernetes/ +│ └── multi-cloud/ +│ +├── distribution/ # DISTRIBUTION ARTIFACTS (gitignored) +│ ├── packages/ # Built packages +│ │ ├── provisioning-core-*.tar.gz +│ │ ├── provisioning-platform-*.tar.gz +│ │ ├── provisioning-extensions-*.tar.gz +│ │ └── checksums.txt +│ ├── installers/ # Installation scripts +│ │ ├── install.sh # Bash installer +│ │ └── install.nu # Nushell installer +│ └── registry/ # Package registry metadata +│ └── index.json +│ +├── docs/ # UNIFIED DOCUMENTATION +│ ├── README.md # Documentation index +│ ├── user/ # User guides +│ │ ├── installation.md +│ │ ├── quick-start.md +│ │ ├── configuration.md +│ │ └── guides/ +│ ├── api/ # API reference +│ │ ├── rest-api.md +│ │ ├── nushell-api.md +│ │ └── kcl-schemas.md +│ ├── architecture/ # Architecture documentation +│ │ ├── overview.md +│ │ ├── decisions/ # ADRs +│ │ └── repo-dist-analysis.md # This document +│ └── development/ # Development guides +│ ├── contributing.md +│ ├── building.md +│ ├── testing.md +│ └── releasing.md +│ +├── examples/ # EXAMPLE CONFIGURATIONS +│ ├── minimal/ # Minimal setup +│ ├── kubernetes-cluster/ # Full K8s cluster +│ ├── multi-cloud/ # Multi-provider setup +│ └── README.md +│ +├── tests/ # INTEGRATION TESTS +│ ├── e2e/ # End-to-end tests +│ ├── integration/ # Integration tests +│ ├── fixtures/ # Test fixtures +│ └── README.md +│ +├── tools/ # DEVELOPMENT TOOLS +│ ├── build/ # Build scripts +│ ├── dev-env/ # Development environment setup +│ └── scripts/ # Utility scripts +│ +├── .github/ # GitHub configuration +│ ├── workflows/ # CI/CD workflows +│ │ ├── build.yml +│ │ ├── test.yml +│ │ └── release.yml +│ └── ISSUE_TEMPLATE/ +│ +├── .coder/ # Coder configuration (tracked) +│ +├── .gitignore # Git ignore rules +├── .gitattributes # Git attributes +├── Cargo.toml # Rust workspace root +├── Justfile # Task runner (unified) +├── LICENSE # License file +├── README.md # Project README +├── CHANGELOG.md # Changelog +└── CLAUDE.md # AI assistant instructions +``` + +### Key Principles + +1. **Clear Separation**: Source code (`provisioning/`), runtime data (`workspace/`), build artifacts (`distribution/`) +2. **Single Source of Truth**: One location for each type of content +3. **Gitignore Strategy**: Runtime and build artifacts ignored, templates tracked +4. **Standard Paths**: Follow Unix conventions for installation + +--- + +## Distribution Strategy + +### Package Types + +#### 1. **provisioning-core** (Required) + +**Contents:** + +- Nushell CLI and libraries +- Core providers (local, upcloud, aws) +- Essential taskservs (kubernetes, containerd, cilium) +- KCL schemas +- Configuration system +- Templates + +**Size:** ~50 MB (compressed) + +**Installation:** + +```text +/usr/local/ +├── bin/ +│ └── provisioning +├── lib/ +│ └── provisioning/ +│ ├── core/ +│ ├── extensions/ +│ └── kcl/ +└── share/ + └── provisioning/ + ├── templates/ + ├── config/ + └── docs/ +``` + +#### 2. **provisioning-platform** (Optional) + +**Contents:** + +- Rust orchestrator binary +- Control center web UI +- MCP server +- API gateway + +**Size:** ~30 MB (compressed) + +**Installation:** + +```text +/usr/local/ +├── bin/ +│ ├── provisioning-orchestrator +│ └── provisioning-control-center +└── share/ + └── provisioning/ + └── platform/ +``` + +#### 3. **provisioning-extensions** (Optional) + +**Contents:** + +- Additional taskservs (radicle, gitea, postgres, etc.) +- Cluster templates +- Workflow templates + +**Size:** ~20 MB (compressed) + +**Installation:** + +```text +/usr/local/lib/provisioning/extensions/ +├── taskservs/ +├── clusters/ +└── workflows/ +``` + +#### 4. **provisioning-plugins** (Optional) + +**Contents:** + +- Pre-built Nushell plugins +- `nu_plugin_kcl` +- `nu_plugin_tera` +- Other custom plugins + +**Size:** ~15 MB (compressed) + +**Installation:** + +```text +~/.config/nushell/plugins/ +``` + +### Installation Paths + +#### System Installation (Root) + +```text +/usr/local/ +├── bin/ +│ ├── provisioning # Main CLI +│ ├── provisioning-orchestrator # Orchestrator binary +│ └── provisioning-control-center # Control center binary +├── lib/ +│ └── provisioning/ +│ ├── core/ # Core Nushell libraries +│ │ ├── nulib/ +│ │ └── plugins/ +│ ├── extensions/ # Extensions +│ │ ├── providers/ +│ │ ├── taskservs/ +│ │ └── clusters/ +│ └── kcl/ # KCL schemas +└── share/ + └── provisioning/ + ├── templates/ # System templates + ├── config/ # Default configs + │ └── config.defaults.toml + └── docs/ # Documentation +``` + +#### User Configuration + +```text +~/.provisioning/ +├── config/ +│ └── config.user.toml # User overrides +├── extensions/ # User extensions +│ ├── providers/ +│ ├── taskservs/ +│ └── clusters/ +├── cache/ # Cache directory +└── plugins/ # User plugins +``` + +#### Project Workspace + +```text +./workspace/ +├── infra/ # Infrastructure definitions +│ ├── my-cluster/ +│ │ ├── config.toml +│ │ ├── servers.yaml +│ │ └── taskservs.yaml +│ └── production/ +├── config/ # Project configuration +│ └── config.toml +├── runtime/ # Runtime data +│ ├── logs/ +│ ├── state/ +│ └── cache/ +└── extensions/ # Project-specific extensions +``` + +### Configuration Hierarchy + +```text +Priority (highest to lowest): +1. CLI flags --debug, --infra=my-cluster +2. Runtime overrides PROVISIONING_DEBUG=true +3. Project config ./workspace/config/config.toml +4. User config ~/.provisioning/config/config.user.toml +5. System config /usr/local/share/provisioning/config/config.defaults.toml +``` + +--- + +## Build System + +### Build Tools Structure + +**`provisioning/tools/build/`:** + +```text +build/ +├── build-system.nu # Main build orchestrator +├── package-core.nu # Core packaging +├── package-platform.nu # Platform packaging +├── package-extensions.nu # Extensions packaging +├── package-plugins.nu # Plugins packaging +├── create-installers.nu # Installer generation +├── validate-package.nu # Package validation +└── publish-registry.nu # Registry publishing +``` + +### Build System Implementation + +**`provisioning/tools/build/build-system.nu`:** + +```text +#!/usr/bin/env nu +# Build system for provisioning project + +use ../core/nulib/lib_provisioning/config/accessor.nu * + +# Build all packages +export def "main build-all" [ + --version: string = "dev" # Version to build + --output: string = "distribution/packages" # Output directory +] { + print $"Building all packages version: ($version)" + + let results = { + core: (build-core $version $output) + platform: (build-platform $version $output) + extensions: (build-extensions $version $output) + plugins: (build-plugins $version $output) + } + + # Generate checksums + create-checksums $output + + print "✅ All packages built successfully" + $results +} + +# Build core package +export def "build-core" [ + version: string + output: string +] -> record { + print "📦 Building provisioning-core..." + + nu package-core.nu build --version $version --output $output +} + +# Build platform package (Rust binaries) +export def "build-platform" [ + version: string + output: string +] -> record { + print "📦 Building provisioning-platform..." + + nu package-platform.nu build --version $version --output $output +} + +# Build extensions package +export def "build-extensions" [ + version: string + output: string +] -> record { + print "📦 Building provisioning-extensions..." + + nu package-extensions.nu build --version $version --output $output +} + +# Build plugins package +export def "build-plugins" [ + version: string + output: string +] -> record { + print "📦 Building provisioning-plugins..." + + nu package-plugins.nu build --version $version --output $output +} + +# Create release artifacts +export def "main release" [ + version: string # Release version + --upload # Upload to release server +] { + print $"🚀 Creating release ($version)" + + # Build all packages + let packages = (build-all --version $version) + + # Create installers + create-installers $version + + # Generate release notes + generate-release-notes $version + + # Upload if requested + if $upload { + upload-release $version + } + + print $"✅ Release ($version) ready" +} + +# Create installers +def create-installers [version: string] { + print "📝 Creating installers..." + + nu create-installers.nu --version $version +} + +# Generate release notes +def generate-release-notes [version: string] { + print "📝 Generating release notes..." + + let changelog = (open CHANGELOG.md) + let notes = ($changelog | parse-version-section $version) + + $notes | save $"distribution/packages/RELEASE_NOTES_($version).md" +} + +# Upload release +def upload-release [version: string] { + print "⬆️ Uploading release..." + + # Implementation depends on your release infrastructure + # Could use: GitHub releases, S3, custom server, etc. +} + +# Create checksums for all packages +def create-checksums [output: string] { + print "🔐 Creating checksums..." + + ls ($output | path join "*.tar.gz") + | each { |file| + let hash = (sha256sum $file.name | split row ' ' | get 0) + $"($hash) (($file.name | path basename))" + } + | str join " +" + | save ($output | path join "checksums.txt") +} + +# Clean build artifacts +export def "main clean" [ + --all # Clean all build artifacts +] { + print "🧹 Cleaning build artifacts..." + + if ($all) { + rm -rf distribution/packages + rm -rf target/ + rm -rf provisioning/platform/target/ + } else { + rm -rf distribution/packages + } + + print "✅ Clean complete" +} + +# Validate built packages +export def "main validate" [ + package_path: string # Package to validate +] { + print $"🔍 Validating package: ($package_path)" + + nu validate-package.nu $package_path +} + +# Show build status +export def "main status" [] { + print "📊 Build Status" + print "─" * 60 + + let core_exists = ("distribution/packages" | path join "provisioning-core-*.tar.gz" | glob | is-not-empty) + let platform_exists = ("distribution/packages" | path join "provisioning-platform-*.tar.gz" | glob | is-not-empty) + + print $"Core package: (if $core_exists { '✅ Built' } else { '❌ Not built' })" + print $"Platform package: (if $platform_exists { '✅ Built' } else { '❌ Not built' })" + + if ("distribution/packages" | path exists) { + let packages = (ls distribution/packages | where name =~ ".tar.gz") + print $" +Total packages: (($packages | length))" + $packages | select name size + } +} +``` + +### Justfile Integration + +**`Justfile`:** + +```text +# Provisioning Build System +# Use 'just --list' to see all available commands + +# Default recipe +default: + @just --list + +# Development tasks +alias d := dev-check +alias t := test +alias b := build + +# Build all packages +build VERSION="dev": + nu provisioning/tools/build/build-system.nu build-all --version {{VERSION}} + +# Build core package only +build-core VERSION="dev": + nu provisioning/tools/build/build-system.nu build-core {{VERSION}} + +# Build platform binaries +build-platform VERSION="dev": + cargo build --release --workspace --manifest-path provisioning/platform/Cargo.toml + nu provisioning/tools/build/build-system.nu build-platform {{VERSION}} + +# Run development checks +dev-check: + @echo "🔍 Running development checks..." + cargo check --workspace --manifest-path provisioning/platform/Cargo.toml + cargo clippy --workspace --manifest-path provisioning/platform/Cargo.toml + nu provisioning/tools/build/validate-nushell.nu + +# Run tests +test: + @echo "🧪 Running tests..." + cargo test --workspace --manifest-path provisioning/platform/Cargo.toml + nu tests/run-all-tests.nu + +# Run integration tests +test-e2e: + @echo "🔬 Running E2E tests..." + nu tests/e2e/run-e2e.nu + +# Format code +fmt: + cargo fmt --all --manifest-path provisioning/platform/Cargo.toml + nu provisioning/tools/build/format-nushell.nu + +# Clean build artifacts +clean: + nu provisioning/tools/build/build-system.nu clean + +# Clean all (including Rust target/) +clean-all: + nu provisioning/tools/build/build-system.nu clean --all + cargo clean --manifest-path provisioning/platform/Cargo.toml + +# Create release +release VERSION: + @echo "🚀 Creating release {{VERSION}}..." + nu provisioning/tools/build/build-system.nu release {{VERSION}} + +# Install from source +install: + @echo "📦 Installing from source..." + just build + sudo nu distribution/installers/install.nu --from-source + +# Install development version (symlink) +install-dev: + @echo "🔗 Installing development version..." + sudo ln -sf $(pwd)/provisioning/core/cli/provisioning /usr/local/bin/provisioning + @echo "✅ Development installation complete" + +# Uninstall +uninstall: + @echo "🗑️ Uninstalling..." + sudo rm -f /usr/local/bin/provisioning + sudo rm -rf /usr/local/lib/provisioning + sudo rm -rf /usr/local/share/provisioning + +# Show build status +status: + nu provisioning/tools/build/build-system.nu status + +# Validate package +validate PACKAGE: + nu provisioning/tools/build/build-system.nu validate {{PACKAGE}} + +# Start development environment +dev-start: + @echo "🚀 Starting development environment..." + cd provisioning/platform/orchestrator && cargo run + +# Watch and rebuild on changes +watch: + @echo "👀 Watching for changes..." + cargo watch -x 'check --workspace --manifest-path provisioning/platform/Cargo.toml' + +# Update dependencies +update-deps: + cargo update --manifest-path provisioning/platform/Cargo.toml + nu provisioning/tools/build/update-nushell-deps.nu + +# Generate documentation +docs: + @echo "📚 Generating documentation..." + cargo doc --workspace --no-deps --manifest-path provisioning/platform/Cargo.toml + nu provisioning/tools/build/generate-docs.nu + +# Benchmark +bench: + cargo bench --workspace --manifest-path provisioning/platform/Cargo.toml + +# Check licenses +check-licenses: + cargo deny check licenses --manifest-path provisioning/platform/Cargo.toml + +# Security audit +audit: + cargo audit --file provisioning/platform/Cargo.lock +``` + +--- + +## Installation System + +### Installer Script + +**`distribution/installers/install.nu`:** + +```text +#!/usr/bin/env nu +# Provisioning installation script + +const DEFAULT_PREFIX = "/usr/local" +const REPO_URL = "https://releases.provisioning.io" + +# Main installation command +def main [ + --prefix: string = $DEFAULT_PREFIX # Installation prefix + --version: string = "latest" # Version to install + --from-source # Install from source (development) + --packages: list = ["core"] # Packages to install +] { + print "📦 Provisioning Installation" + print "─" * 60 + + # Check prerequisites + check-prerequisites + + # Install packages + if $from_source { + install-from-source $prefix + } else { + install-from-release $prefix $version $packages + } + + # Post-installation + post-install $prefix + + print "" + print "✅ Installation complete!" + print $"Run 'provisioning --help' to get started" +} + +# Check prerequisites +def check-prerequisites [] { + print "🔍 Checking prerequisites..." + + # Check for Nushell + if (which nu | is-empty) { + error make { + msg: "Nushell not found. Please install Nushell first: https://nushell.sh" + } + } + + let nu_version = (nu --version | parse "{name} {version}" | get 0.version) + print $" ✓ Nushell ($nu_version)" + + # Check for required tools + if (which tar | is-empty) { + error make { msg: "tar not found" } + } + + if (which curl | is-empty) and (which wget | is-empty) { + error make { msg: "curl or wget required" } + } + + print " ✓ All prerequisites met" +} + +# Install from source +def install-from-source [prefix: string] { + print "📦 Installing from source..." + + # Check if we're in the source directory + if not ("provisioning" | path exists) { + error make { msg: "Must run from project root" } + } + + # Create installation directories + create-install-dirs $prefix + + # Copy files + print " Copying core files..." + cp -r provisioning/core/nulib $"($prefix)/lib/provisioning/core/" + cp -r provisioning/extensions $"($prefix)/lib/provisioning/" + cp -r provisioning/kcl $"($prefix)/lib/provisioning/" + cp -r provisioning/templates $"($prefix)/share/provisioning/" + cp -r provisioning/config $"($prefix)/share/provisioning/" + + # Create CLI wrapper + create-cli-wrapper $prefix + + print " ✓ Source installation complete" +} + +# Install from release +def install-from-release [ + prefix: string + version: string + packages: list +] { + print $"📦 Installing version ($version)..." + + # Download packages + for package in $packages { + download-package $package $version + extract-package $package $version $prefix + } +} + +# Download package +def download-package [package: string, version: string] { + let filename = $"provisioning-($package)-($version).tar.gz" + let url = $"($REPO_URL)/($version)/($filename)" + + print $" Downloading ($package)..." + + if (which curl | is-not-empty) { + curl -fsSL -o $"/tmp/($filename)" $url + } else { + wget -q -O $"/tmp/($filename)" $url + } +} + +# Extract package +def extract-package [package: string, version: string, prefix: string] { + let filename = $"provisioning-($package)-($version).tar.gz" + + print $" Installing ($package)..." + + tar xzf $"/tmp/($filename)" -C $prefix + rm $"/tmp/($filename)" +} + +# Create installation directories +def create-install-dirs [prefix: string] { + mkdir ($prefix | path join "bin") + mkdir ($prefix | path join "lib" "provisioning" "core") + mkdir ($prefix | path join "lib" "provisioning" "extensions") + mkdir ($prefix | path join "share" "provisioning" "templates") + mkdir ($prefix | path join "share" "provisioning" "config") + mkdir ($prefix | path join "share" "provisioning" "docs") +} + +# Create CLI wrapper +def create-cli-wrapper [prefix: string] { + let wrapper = $"#!/usr/bin/env nu +# Provisioning CLI wrapper + +# Load provisioning library +const PROVISIONING_LIB = \"($prefix)/lib/provisioning\" +const PROVISIONING_SHARE = \"($prefix)/share/provisioning\" + +$env.PROVISIONING_ROOT = $PROVISIONING_LIB +$env.PROVISIONING_SHARE = $PROVISIONING_SHARE + +# Add to Nushell path +$env.NU_LIB_DIRS = ($env.NU_LIB_DIRS | append $\"($PROVISIONING_LIB)/core/nulib\") + +# Load main provisioning module +use ($PROVISIONING_LIB)/core/nulib/main_provisioning/dispatcher.nu * + +# Main entry point +def main [...args] { + dispatch-command $args +} + +main ...$args +" + + $wrapper | save ($prefix | path join "bin" "provisioning") + chmod +x ($prefix | path join "bin" "provisioning") +} + +# Post-installation tasks +def post-install [prefix: string] { + print "🔧 Post-installation setup..." + + # Create user config directory + let user_config = ($env.HOME | path join ".provisioning") + if not ($user_config | path exists) { + mkdir ($user_config | path join "config") + mkdir ($user_config | path join "extensions") + mkdir ($user_config | path join "cache") + + # Copy example config + let example = ($prefix | path join "share" "provisioning" "config" "config-examples" "config.user.toml") + if ($example | path exists) { + cp $example ($user_config | path join "config" "config.user.toml") + } + + print $" ✓ Created user config directory: ($user_config)" + } + + # Check if prefix is in PATH + if not ($env.PATH | any { |p| $p == ($prefix | path join "bin") }) { + print "" + print "⚠️ Note: ($prefix)/bin is not in your PATH" + print " Add this to your shell configuration:" + print $" export PATH=\"($prefix)/bin:$PATH\"" + } +} + +# Uninstall provisioning +export def "main uninstall" [ + --prefix: string = $DEFAULT_PREFIX # Installation prefix + --keep-config # Keep user configuration +] { + print "🗑️ Uninstalling provisioning..." + + # Remove installed files + rm -rf ($prefix | path join "bin" "provisioning") + rm -rf ($prefix | path join "lib" "provisioning") + rm -rf ($prefix | path join "share" "provisioning") + + # Remove user config if requested + if not $keep_config { + let user_config = ($env.HOME | path join ".provisioning") + if ($user_config | path exists) { + rm -rf $user_config + print " ✓ Removed user configuration" + } + } + + print "✅ Uninstallation complete" +} + +# Upgrade provisioning +export def "main upgrade" [ + --version: string = "latest" # Version to upgrade to + --prefix: string = $DEFAULT_PREFIX # Installation prefix +] { + print $"⬆️ Upgrading to version ($version)..." + + # Check current version + let current = (^provisioning version | parse "{version}" | get 0.version) + print $" Current version: ($current)" + + if $current == $version { + print " Already at latest version" + return + } + + # Backup current installation + print " Backing up current installation..." + let backup = ($prefix | path join "lib" "provisioning.backup") + mv ($prefix | path join "lib" "provisioning") $backup + + # Install new version + try { + install-from-release $prefix $version ["core"] + print $" ✅ Upgraded to version ($version)" + rm -rf $backup + } catch { + print " ❌ Upgrade failed, restoring backup..." + mv $backup ($prefix | path join "lib" "provisioning") + error make { msg: "Upgrade failed" } + } +} +``` + +### Bash Installer (For Systems Without Nushell) + +**`distribution/installers/install.sh`:** + +```text +#!/usr/bin/env bash +# Provisioning installation script (Bash version) +# This script installs Nushell first, then runs the Nushell installer + +set -euo pipefail + +DEFAULT_PREFIX="/usr/local" +REPO_URL="https://releases.provisioning.io" + +# Colors +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +NC='\033[0m' # No Color + +info() { + echo -e "${GREEN}✓${NC} $*" +} + +warn() { + echo -e "${YELLOW}⚠${NC} $*" +} + +error() { + echo -e "${RED}✗${NC} $*" >&2 + exit 1 +} + +# Check if Nushell is installed +check_nushell() { + if command -v nu >/dev/null 2>&1; then + info "Nushell is already installed" + return 0 + else + warn "Nushell not found" + return 1 + fi +} + +# Install Nushell +install_nushell() { + echo "📦 Installing Nushell..." + + # Detect OS and architecture + OS="$(uname -s)" + ARCH="$(uname -m)" + + case "$OS" in + Linux*) + if command -v apt-get >/dev/null 2>&1; then + sudo apt-get update && sudo apt-get install -y nushell + elif command -v dnf >/dev/null 2>&1; then + sudo dnf install -y nushell + elif command -v brew >/dev/null 2>&1; then + brew install nushell + else + error "Cannot automatically install Nushell. Please install manually: https://nushell.sh" + fi + ;; + Darwin*) + if command -v brew >/dev/null 2>&1; then + brew install nushell + else + error "Homebrew not found. Install from: https://brew.sh" + fi + ;; + *) + error "Unsupported operating system: $OS" + ;; + esac + + info "Nushell installed successfully" +} + +# Main installation +main() { + echo "📦 Provisioning Installation" + echo "────────────────────────────────────────────────────────────" + + # Check for Nushell + if ! check_nushell; then + read -p "Install Nushell? (y/N) " -n 1 -r + echo + if [[ $REPLY =~ ^[Yy]$ ]]; then + install_nushell + else + error "Nushell is required. Install from: https://nushell.sh" + fi + fi + + # Download Nushell installer + echo "📥 Downloading installer..." + INSTALLER_URL="$REPO_URL/latest/install.nu" + curl -fsSL "$INSTALLER_URL" -o /tmp/install.nu + + # Run Nushell installer + echo "🚀 Running installer..." + nu /tmp/install.nu "$@" + + # Cleanup + rm -f /tmp/install.nu + + info "Installation complete!" +} + +# Run main +main "$@" +``` + +--- + +## Implementation Plan + +### Phase 1: Repository Restructuring (3-4 days) + +#### Day 1: Cleanup and Preparation + +**Tasks:** + +1. Create backup of current state +2. Analyze and document all workspace directories +3. Identify active workspace vs backups +4. Map all file dependencies + +**Commands:** + +```text +# Backup current state +cp -r /Users/Akasha/project-provisioning /Users/Akasha/project-provisioning.backup + +# Analyze workspaces +fd workspace -t d > workspace-dirs.txt +``` + +**Deliverables:** + +- Complete backup +- Workspace analysis document +- Dependency map + +#### Day 2: Directory Restructuring + +**Tasks:** + +1. Consolidate workspace directories +2. Move build artifacts to `distribution/` +3. Remove obsolete directories (`NO/`, `wrks/`, presentation artifacts) +4. Create proper `.gitignore` + +**Commands:** + +```text +# Create distribution directory +mkdir -p distribution/{packages,installers,registry} + +# Move build artifacts +mv target distribution/ +mv provisioning/tools/dist distribution/packages/ + +# Remove obsolete +rm -rf NO/ wrks/ presentations/ +``` + +**Deliverables:** + +- Clean directory structure +- Updated `.gitignore` +- Migration log + +#### Day 3: Update Path References + +**Tasks:** + +1. Update all hardcoded paths in Nushell scripts +2. Update CLAUDE.md with new paths +3. Update documentation references +4. Test all path changes + +**Files to Update:** + +- `provisioning/core/nulib/**/*.nu` (~65 files) +- `CLAUDE.md` +- `docs/**/*.md` + +**Deliverables:** + +- Updated scripts +- Updated documentation +- Test results + +#### Day 4: Validation and Documentation + +**Tasks:** + +1. Run full test suite +2. Verify all commands work +3. Update README.md +4. Create migration guide + +**Deliverables:** + +- Passing tests +- Updated README +- Migration guide for users + +### Phase 2: Build System Implementation (3-4 days) + +#### Day 5: Build System Core + +**Tasks:** + +1. Create `provisioning/tools/build/` structure +2. Implement `build-system.nu` +3. Implement `package-core.nu` +4. Create Justfile + +**Files to Create:** + +- `provisioning/tools/build/build-system.nu` +- `provisioning/tools/build/package-core.nu` +- `provisioning/tools/build/validate-package.nu` +- `Justfile` + +**Deliverables:** + +- Working build system +- Core packaging capability +- Justfile with basic recipes + +#### Day 6: Platform and Extension Packaging + +**Tasks:** + +1. Implement `package-platform.nu` +2. Implement `package-extensions.nu` +3. Implement `package-plugins.nu` +4. Add checksum generation + +**Deliverables:** + +- Platform packaging +- Extension packaging +- Plugin packaging +- Checksum generation + +#### Day 7: Package Validation + +**Tasks:** + +1. Create package validation system +2. Implement integrity checks +3. Create test suite for packages +4. Document package format + +**Deliverables:** + +- Package validation +- Test suite +- Package format documentation + +#### Day 8: Build System Testing + +**Tasks:** + +1. Test full build pipeline +2. Test all package types +3. Optimize build performance +4. Document build system + +**Deliverables:** + +- Tested build system +- Performance optimizations +- Build system documentation + +### Phase 3: Installation System (2-3 days) + +#### Day 9: Nushell Installer + +**Tasks:** + +1. Create `install.nu` +2. Implement installation logic +3. Implement upgrade logic +4. Implement uninstallation + +**Files to Create:** + +- `distribution/installers/install.nu` + +**Deliverables:** + +- Working Nushell installer +- Upgrade mechanism +- Uninstall mechanism + +#### Day 10: Bash Installer and CLI + +**Tasks:** + +1. Create `install.sh` +2. Replace bash CLI wrapper with pure Nushell +3. Update PATH handling +4. Test installation on clean system + +**Files to Create:** + +- `distribution/installers/install.sh` +- Updated `provisioning/core/cli/provisioning` + +**Deliverables:** + +- Bash installer +- Pure Nushell CLI +- Installation tests + +#### Day 11: Installation Testing + +**Tasks:** + +1. Test installation on multiple OSes +2. Test upgrade scenarios +3. Test uninstallation +4. Create installation documentation + +**Deliverables:** + +- Multi-OS installation tests +- Installation guide +- Troubleshooting guide + +### Phase 4: Package Registry (Optional, 2-3 days) + +#### Day 12: Registry System + +**Tasks:** + +1. Design registry format +2. Implement registry indexing +3. Create package metadata +4. Implement search functionality + +**Files to Create:** + +- `provisioning/tools/build/publish-registry.nu` +- `distribution/registry/index.json` + +**Deliverables:** + +- Registry system +- Package metadata +- Search functionality + +#### Day 13: Registry Commands + +**Tasks:** + +1. Implement `provisioning registry list` +2. Implement `provisioning registry search` +3. Implement `provisioning registry install` +4. Implement `provisioning registry update` + +**Deliverables:** + +- Registry commands +- Package installation from registry +- Update mechanism + +#### Day 14: Registry Hosting + +**Tasks:** + +1. Set up registry hosting (S3, GitHub releases, etc.) +2. Implement upload mechanism +3. Create CI/CD for automatic publishing +4. Document registry system + +**Deliverables:** + +- Hosted registry +- CI/CD pipeline +- Registry documentation + +### Phase 5: Documentation and Release (2 days) + +#### Day 15: Documentation + +**Tasks:** + +1. Update all documentation for new structure +2. Create user guides +3. Create development guides +4. Create API documentation + +**Deliverables:** + +- Updated documentation +- User guides +- Developer guides +- API docs + +#### Day 16: Release Preparation + +**Tasks:** + +1. Create CHANGELOG.md +2. Build release packages +3. Test installation from packages +4. Create release announcement + +**Deliverables:** + +- CHANGELOG +- Release packages +- Installation verification +- Release announcement + +--- + +## Migration Strategy + +### For Existing Users + +#### Option 1: Clean Migration + +```text +# Backup current workspace +cp -r workspace workspace.backup + +# Upgrade to new version +provisioning upgrade --version 3.2.0 + +# Migrate workspace +provisioning workspace migrate --from workspace.backup --to workspace/ +``` + +#### Option 2: In-Place Migration + +```text +# Run migration script +provisioning migrate --check # Dry run +provisioning migrate # Execute migration +``` + +### For Developers + +```text +# Pull latest changes +git pull origin main + +# Rebuild +just clean-all +just build + +# Reinstall development version +just install-dev + +# Verify +provisioning --version +``` + +--- + +## Success Criteria + +### Repository Structure + +- ✅ Single `workspace/` directory for all runtime data +- ✅ Clear separation: source (`provisioning/`), runtime (`workspace/`), artifacts (`distribution/`) +- ✅ All build artifacts in `distribution/` and gitignored +- ✅ Clean root directory (no `wrks/`, `NO/`, etc.) +- ✅ Unified documentation in `docs/` + +### Build System + +- ✅ Single command builds all packages: `just build` +- ✅ Packages can be built independently +- ✅ Checksums generated automatically +- ✅ Validation before packaging +- ✅ Build time < 5 minutes for full build + +### Installation + +- ✅ One-line installation: `curl -fsSL https://get.provisioning.io | sh` +- ✅ Works on Linux and macOS +- ✅ Standard installation paths (`/usr/local/`) +- ✅ User configuration in `~/.provisioning/` +- ✅ Clean uninstallation + +### Distribution + +- ✅ Packages available at stable URL +- ✅ Automated releases via CI/CD +- ✅ Package registry for extensions +- ✅ Upgrade mechanism works reliably + +### Documentation + +- ✅ Complete installation guide +- ✅ Quick start guide +- ✅ Developer contributing guide +- ✅ API documentation +- ✅ Architecture documentation + +--- + +## Risks and Mitigations + +### Risk 1: Breaking Changes for Existing Users + +**Impact:** High +**Probability:** High +**Mitigation:** + +- Provide migration script +- Support both old and new paths during transition (v3.2.x) +- Clear migration guide +- Automated backup before migration + +### Risk 2: Build System Complexity + +**Impact:** Medium +**Probability:** Medium +**Mitigation:** + +- Start with simple packaging +- Iterate and improve +- Document thoroughly +- Provide examples + +### Risk 3: Installation Path Conflicts + +**Impact:** Medium +**Probability:** Low +**Mitigation:** + +- Check for existing installations +- Support custom prefix +- Clear uninstallation +- Non-conflicting binary names + +### Risk 4: Cross-Platform Issues + +**Impact:** High +**Probability:** Medium +**Mitigation:** + +- Test on multiple OSes (Linux, macOS) +- Use portable commands +- Provide fallbacks +- Clear error messages + +### Risk 5: Dependency Management + +**Impact:** Medium +**Probability:** Medium +**Mitigation:** + +- Document all dependencies +- Check prerequisites during installation +- Provide installation instructions for dependencies +- Consider bundling critical dependencies + +--- + +## Timeline Summary + +| Phase | Duration | Key Deliverables | +| ------- | ---------- | ------------------ | +| Phase 1: Restructuring | 3-4 days | Clean directory structure, updated paths | +| Phase 2: Build System | 3-4 days | Working build system, all package types | +| Phase 3: Installation | 2-3 days | Installers, pure Nushell CLI | +| Phase 4: Registry (Optional) | 2-3 days | Package registry, extension management | +| Phase 5: Documentation | 2 days | Complete documentation, release | +| **Total** | **12-16 days** | Production-ready distribution system | + +--- + +## Next Steps + +1. **Review and Approval** (Day 0) + - Review this analysis + - Approve implementation plan + - Assign resources + +2. **Kickoff** (Day 1) + - Create implementation branch + - Set up project tracking + - Begin Phase 1 + +3. **Weekly Reviews** + - End of Phase 1: Structure review + - End of Phase 2: Build system review + - End of Phase 3: Installation review + - Final review before release + +--- + +## Conclusion + +This comprehensive plan transforms the provisioning system into a professional-grade infrastructure automation platform with: + +- **Clean Architecture**: Clear separation of concerns +- **Professional Distribution**: Standard installation paths and packaging +- **Easy Installation**: One-command installation for users +- **Developer Friendly**: Simple build system and clear development workflow +- **Extensible**: Package registry for community extensions +- **Well Documented**: Complete guides for users and developers + +The implementation will take approximately **2-3 weeks** and will result in a production-ready system suitable for both individual developers and +enterprise deployments. + +--- + +## References + +- Current codebase structure +- Unix FHS (Filesystem Hierarchy Standard) +- Rust cargo packaging conventions +- npm/yarn package management patterns +- Homebrew formula best practices +- KCL package management design diff --git a/docs/src/architecture/system-overview.md b/docs/src/architecture/system-overview.md index 5081dec..27bae54 100644 --- a/docs/src/architecture/system-overview.md +++ b/docs/src/architecture/system-overview.md @@ -1 +1,355 @@ -# System Overview\n\n## Executive Summary\n\nProvisioning is an **Infrastructure Automation Platform** built with a hybrid Rust/Nushell architecture. It enables Infrastructure as Code (IaC) with\nmulti-provider support (AWS, UpCloud, local), sophisticated workflow orchestration, and configuration-driven operations.\n\nThe system solves fundamental technical challenges through architectural innovation and hybrid language design.\n\n## High-Level Architecture\n\n### System Diagram\n\n```\n┌─────────────────────────────────────────────────────────────────┐\n│ User Interface Layer │\n├─────────────────┬─────────────────┬─────────────────────────────┤\n│ CLI Tools │ REST API │ Control Center UI │\n│ (Nushell) │ (Rust) │ (Web Interface) │\n└─────────────────┴─────────────────┴─────────────────────────────┘\n │\n┌─────────────────────────────────────────────────────────────────┐\n│ Orchestration Layer │\n├─────────────────────────────────────────────────────────────────┤\n│ Rust Orchestrator: Workflow Coordination & State Management │\n│ • Task Queue & Scheduling • Batch Processing │\n│ • State Persistence • Error Recovery & Rollback │\n│ • REST API Server • Real-time Monitoring │\n└─────────────────────────────────────────────────────────────────┘\n │\n┌─────────────────────────────────────────────────────────────────┐\n│ Business Logic Layer │\n├─────────────────┬─────────────────┬─────────────────────────────┤\n│ Providers │ Task Services │ Workflows │\n│ (Nushell) │ (Nushell) │ (Nushell) │\n│ • AWS │ • Kubernetes │ • Server Creation │\n│ • UpCloud │ • Storage │ • Cluster Deployment │\n│ • Local │ • Networking │ • Batch Operations │\n└─────────────────┴─────────────────┴─────────────────────────────┘\n │\n┌─────────────────────────────────────────────────────────────────┐\n│ Configuration Layer │\n├─────────────────┬─────────────────┬─────────────────────────────┤\n│ Nickel Schemas│ TOML Config │ Templates │\n│ • Type Safety │ • Hierarchy │ • Infrastructure │\n│ • Validation │ • Environment │ • Service Configs │\n│ • Extensible │ • User Prefs │ • Code Generation │\n└─────────────────┴─────────────────┴─────────────────────────────┘\n │\n┌─────────────────────────────────────────────────────────────────┐\n│ Infrastructure Layer │\n├─────────────────┬─────────────────┬─────────────────────────────┤\n│ Cloud APIs │ Kubernetes │ Local Systems │\n│ • AWS EC2 │ • Clusters │ • Docker │\n│ • UpCloud │ • Services │ • Containers │\n│ • Others │ • Storage │ • Host Services │\n└─────────────────┴─────────────────┴─────────────────────────────┘\n```\n\n## Core Components\n\n### 1. Hybrid Architecture Foundation\n\n#### Coordination Layer (Rust)\n\n**Purpose**: High-performance workflow orchestration and system coordination\n\n**Components**:\n\n- **Orchestrator Engine**: Task scheduling and execution coordination\n- **REST API Server**: HTTP endpoints for external integration\n- **State Management**: Persistent state tracking with checkpoint recovery\n- **Batch Processor**: Parallel execution of complex multi-provider workflows\n- **File-based Queue**: Lightweight, reliable task persistence\n- **Error Recovery**: Sophisticated rollback and cleanup capabilities\n\n**Key Features**:\n\n- Solves Nushell deep call stack limitations\n- Handles 1000+ concurrent operations\n- Checkpoint-based recovery from any failure point\n- Real-time workflow monitoring and status tracking\n\n#### Business Logic Layer (Nushell)\n\n**Purpose**: Domain-specific operations and configuration management\n\n**Components**:\n\n- **Provider Implementations**: Cloud-specific operations (AWS, UpCloud, local)\n- **Task Service Management**: Infrastructure component lifecycle\n- **Configuration Processing**: Nickel-based configuration validation and templating\n- **CLI Interface**: User-facing command-line tools\n- **Workflow Definitions**: Business process implementations\n\n**Key Features**:\n\n- 65+ domain-specific modules preserved and enhanced\n- Configuration-driven operations with zero hardcoded values\n- Type-safe Nickel integration for Infrastructure as Code\n- Extensible provider and service architecture\n\n### 2. Configuration System (v2.0.0)\n\n#### Hierarchical Configuration Management\n\n**Migration Achievement**: 65+ files migrated, 200+ ENV variables → 476 config accessors\n\n**Configuration Hierarchy** (precedence order):\n\n1. **Runtime Parameters** (command line, environment variables)\n2. **Environment Configuration** (dev/test/prod specific)\n3. **Infrastructure Configuration** (project-specific settings)\n4. **User Configuration** (personal preferences)\n5. **System Defaults** (system-wide defaults)\n\n**Configuration Files**:\n\n- `config.defaults.toml` - System-wide defaults\n- `config.user.toml` - User-specific preferences\n- `config.{dev,test,prod}.toml` - Environment-specific configurations\n- Infrastructure-specific configuration files\n\n**Features**:\n\n- **Variable Interpolation**: `{{paths.base}}`, `{{env.HOME}}`, `{{now.date}}`, `{{git.branch}}`\n- **Environment Switching**: `PROVISIONING_ENV=prod` for environment-specific configs\n- **Validation Framework**: Comprehensive configuration validation and error reporting\n- **Migration Tools**: Automated migration from ENV-based to config-driven architecture\n\n### 3. Workflow System (v3.1.0)\n\n#### Batch Workflow Engine\n\n**Batch Capabilities**:\n\n- **Provider-Agnostic Workflows**: Mix UpCloud, AWS, and local providers in single workflow\n- **Dependency Resolution**: Topological sorting with soft/hard dependency support\n- **Parallel Execution**: Configurable parallelism limits with resource management\n- **State Recovery**: Checkpoint-based recovery with rollback capabilities\n- **Real-time Monitoring**: Live progress tracking and health monitoring\n\n**Workflow Types**:\n\n- **Server Workflows**: Multi-provider server provisioning and management\n- **Task Service Workflows**: Infrastructure component installation and configuration\n- **Cluster Workflows**: Complete Kubernetes cluster deployment and management\n- **Batch Workflows**: Complex multi-step operations with dependency management\n\n**Nickel Workflow Definitions**:\n\n```\n{\n batch_workflow = {\n name = "multi_cloud_deployment",\n version = "1.0.0",\n parallel_limit = 5,\n rollback_enabled = true,\n\n operations = [\n {\n id = "servers",\n type = "server_batch",\n provider = "upcloud",\n dependencies = [],\n },\n {\n id = "services",\n type = "taskserv_batch",\n provider = "aws",\n dependencies = ["servers"],\n }\n ]\n }\n}\n```\n\n### 4. Provider Ecosystem\n\n#### Multi-Provider Architecture\n\n**Supported Providers**:\n\n- **AWS**: Amazon Web Services integration\n- **UpCloud**: UpCloud provider with full feature support\n- **Local**: Local development and testing provider\n\n**Provider Features**:\n\n- **Standardized Interfaces**: Consistent API across all providers\n- **Configuration Templates**: Provider-specific configuration generation\n- **Resource Management**: Complete lifecycle management for cloud resources\n- **Cost Optimization**: Pricing information and cost optimization recommendations\n- **Regional Support**: Multi-region deployment capabilities\n\n#### Task Services Ecosystem\n\n**Infrastructure Components** (40+ services):\n\n- **Container Orchestration**: Kubernetes, container runtimes (containerd, cri-o, crun, runc, youki)\n- **Networking**: Cilium, CoreDNS, HAProxy, service mesh integration\n- **Storage**: Rook-Ceph, external-NFS, Mayastor, persistent volumes\n- **Security**: Policy engines, secrets management, RBAC\n- **Observability**: Monitoring, logging, tracing, metrics collection\n- **Development Tools**: Gitea, databases, build systems\n\n**Service Features**:\n\n- **Version Management**: Real-time version checking against GitHub releases\n- **Configuration Generation**: Automated service configuration from templates\n- **Dependency Management**: Automatic dependency resolution and installation order\n- **Health Monitoring**: Service health checks and status reporting\n\n## Key Architectural Decisions\n\n### 1. Hybrid Language Architecture (ADR-004)\n\n**Decision**: Use Rust for coordination, Nushell for business logic\n**Rationale**: Solves Nushell's deep call stack limitations while preserving domain expertise\n**Impact**: Eliminates technical limitations while maintaining productivity and configuration advantages\n\n### 2. Configuration-Driven Architecture (ADR-002)\n\n**Decision**: Complete migration from ENV variables to hierarchical configuration\n**Rationale**: True Infrastructure as Code requires configuration flexibility without hardcoded fallbacks\n**Impact**: 476 configuration accessors provide complete customization without code changes\n\n### 3. Domain-Driven Structure (ADR-001)\n\n**Decision**: Organize by functional domains (core, platform, provisioning)\n**Rationale**: Clear boundaries enable scalable development and maintenance\n**Impact**: Enables specialized development while maintaining system coherence\n\n### 4. Workspace Isolation (ADR-003)\n\n**Decision**: Isolated user workspaces with hierarchical configuration\n**Rationale**: Multi-user support and customization without system impact\n**Impact**: Complete user independence with easy backup and migration\n\n### 5. Registry-Based Extensions (ADR-005)\n\n**Decision**: Manifest-driven extension framework with structured discovery\n**Rationale**: Enable community contributions while maintaining system stability\n**Impact**: Extensible system supporting custom providers, services, and workflows\n\n## Data Flow Architecture\n\n### Configuration Resolution Flow\n\n```\n1. Workspace Discovery → 2. Configuration Loading → 3. Hierarchy Merge →\n4. Variable Interpolation → 5. Schema Validation → 6. Runtime Application\n```\n\n### Workflow Execution Flow\n\n```\n1. Workflow Submission → 2. Dependency Analysis → 3. Task Scheduling →\n4. Parallel Execution → 5. State Tracking → 6. Result Aggregation →\n7. Error Handling → 8. Cleanup/Rollback\n```\n\n### Provider Integration Flow\n\n```\n1. Provider Discovery → 2. Configuration Validation → 3. Authentication →\n4. Resource Planning → 5. Operation Execution → 6. State Persistence →\n7. Result Reporting\n```\n\n## Technology Stack\n\n### Core Technologies\n\n- **Nushell 0.107.1**: Primary shell and scripting language\n- **Rust**: High-performance coordination and orchestration\n- **Nickel 1.15.0+**: Configuration language for Infrastructure as Code\n- **TOML**: Configuration file format with human readability\n- **JSON**: Data exchange format between components\n\n### Infrastructure Technologies\n\n- **Kubernetes**: Container orchestration platform\n- **Docker/Containerd**: Container runtime environments\n- **SOPS 3.10.2**: Secrets management and encryption\n- **Age 1.2.1**: Encryption tool for secrets\n- **HTTP/REST**: API communication protocols\n\n### Development Technologies\n\n- **nu_plugin_tera**: Native Nushell template rendering\n- **K9s 0.50.6**: Kubernetes management interface\n- **Git**: Version control and configuration management\n\n## Scalability and Performance\n\n### Performance Characteristics\n\n- **Batch Processing**: 1000+ concurrent operations with configurable parallelism\n- **Provider Operations**: Sub-second response for most cloud API operations\n- **Configuration Loading**: Millisecond-level configuration resolution\n- **State Persistence**: File-based persistence with minimal overhead\n- **Memory Usage**: Efficient memory management with streaming operations\n\n### Scalability Features\n\n- **Horizontal Scaling**: Multiple orchestrator instances for high availability\n- **Resource Management**: Configurable resource limits and quotas\n- **Caching Strategy**: Multi-level caching for performance optimization\n- **Streaming Operations**: Large dataset processing without memory limits\n- **Async Processing**: Non-blocking operations for improved throughput\n\n## Security Architecture\n\n### Security Layers\n\n- **Workspace Isolation**: User data isolated from system installation\n- **Configuration Security**: Encrypted secrets with SOPS/Age integration\n- **Extension Sandboxing**: Extensions run in controlled environments\n- **API Authentication**: Secure REST API endpoints with authentication\n- **Audit Logging**: Comprehensive audit trails for all operations\n\n### Security Features\n\n- **Secrets Management**: Encrypted configuration files with rotation support\n- **Permission Model**: Role-based access control for operations\n- **Code Signing**: Digital signature verification for extensions\n- **Network Security**: Secure communication with cloud providers\n- **Input Validation**: Comprehensive input validation and sanitization\n\n## Quality Attributes\n\n### Reliability\n\n- **Error Recovery**: Sophisticated error handling and rollback capabilities\n- **State Consistency**: Transactional operations with rollback support\n- **Health Monitoring**: Comprehensive system health checks and monitoring\n- **Fault Tolerance**: Graceful degradation and recovery from failures\n\n### Maintainability\n\n- **Clear Architecture**: Well-defined boundaries and responsibilities\n- **Documentation**: Comprehensive architecture and development documentation\n- **Testing Strategy**: Multi-layer testing with integration validation\n- **Code Quality**: Consistent patterns and quality standards\n\n### Extensibility\n\n- **Plugin Framework**: Registry-based extension system\n- **Provider API**: Standardized interfaces for new providers\n- **Configuration Schema**: Extensible configuration with validation\n- **Workflow Engine**: Custom workflow definitions and execution\n\nThis system architecture represents a mature, production-ready platform for Infrastructure as Code with unique architectural innovations and proven\nscalability. +# System Overview + +## Executive Summary + +Provisioning is an **Infrastructure Automation Platform** built with a hybrid Rust/Nushell architecture. It enables Infrastructure as Code (IaC) with +multi-provider support (AWS, UpCloud, local), sophisticated workflow orchestration, and configuration-driven operations. + +The system solves fundamental technical challenges through architectural innovation and hybrid language design. + +## High-Level Architecture + +### System Diagram + +```text +┌─────────────────────────────────────────────────────────────────┐ +│ User Interface Layer │ +├─────────────────┬─────────────────┬─────────────────────────────┤ +│ CLI Tools │ REST API │ Control Center UI │ +│ (Nushell) │ (Rust) │ (Web Interface) │ +└─────────────────┴─────────────────┴─────────────────────────────┘ + │ +┌─────────────────────────────────────────────────────────────────┐ +│ Orchestration Layer │ +├─────────────────────────────────────────────────────────────────┤ +│ Rust Orchestrator: Workflow Coordination & State Management │ +│ • Task Queue & Scheduling • Batch Processing │ +│ • State Persistence • Error Recovery & Rollback │ +│ • REST API Server • Real-time Monitoring │ +└─────────────────────────────────────────────────────────────────┘ + │ +┌─────────────────────────────────────────────────────────────────┐ +│ Business Logic Layer │ +├─────────────────┬─────────────────┬─────────────────────────────┤ +│ Providers │ Task Services │ Workflows │ +│ (Nushell) │ (Nushell) │ (Nushell) │ +│ • AWS │ • Kubernetes │ • Server Creation │ +│ • UpCloud │ • Storage │ • Cluster Deployment │ +│ • Local │ • Networking │ • Batch Operations │ +└─────────────────┴─────────────────┴─────────────────────────────┘ + │ +┌─────────────────────────────────────────────────────────────────┐ +│ Configuration Layer │ +├─────────────────┬─────────────────┬─────────────────────────────┤ +│ Nickel Schemas│ TOML Config │ Templates │ +│ • Type Safety │ • Hierarchy │ • Infrastructure │ +│ • Validation │ • Environment │ • Service Configs │ +│ • Extensible │ • User Prefs │ • Code Generation │ +└─────────────────┴─────────────────┴─────────────────────────────┘ + │ +┌─────────────────────────────────────────────────────────────────┐ +│ Infrastructure Layer │ +├─────────────────┬─────────────────┬─────────────────────────────┤ +│ Cloud APIs │ Kubernetes │ Local Systems │ +│ • AWS EC2 │ • Clusters │ • Docker │ +│ • UpCloud │ • Services │ • Containers │ +│ • Others │ • Storage │ • Host Services │ +└─────────────────┴─────────────────┴─────────────────────────────┘ +``` + +## Core Components + +### 1. Hybrid Architecture Foundation + +#### Coordination Layer (Rust) + +**Purpose**: High-performance workflow orchestration and system coordination + +**Components**: + +- **Orchestrator Engine**: Task scheduling and execution coordination +- **REST API Server**: HTTP endpoints for external integration +- **State Management**: Persistent state tracking with checkpoint recovery +- **Batch Processor**: Parallel execution of complex multi-provider workflows +- **File-based Queue**: Lightweight, reliable task persistence +- **Error Recovery**: Sophisticated rollback and cleanup capabilities + +**Key Features**: + +- Solves Nushell deep call stack limitations +- Handles 1000+ concurrent operations +- Checkpoint-based recovery from any failure point +- Real-time workflow monitoring and status tracking + +#### Business Logic Layer (Nushell) + +**Purpose**: Domain-specific operations and configuration management + +**Components**: + +- **Provider Implementations**: Cloud-specific operations (AWS, UpCloud, local) +- **Task Service Management**: Infrastructure component lifecycle +- **Configuration Processing**: Nickel-based configuration validation and templating +- **CLI Interface**: User-facing command-line tools +- **Workflow Definitions**: Business process implementations + +**Key Features**: + +- 65+ domain-specific modules preserved and enhanced +- Configuration-driven operations with zero hardcoded values +- Type-safe Nickel integration for Infrastructure as Code +- Extensible provider and service architecture + +### 2. Configuration System (v2.0.0) + +#### Hierarchical Configuration Management + +**Migration Achievement**: 65+ files migrated, 200+ ENV variables → 476 config accessors + +**Configuration Hierarchy** (precedence order): + +1. **Runtime Parameters** (command line, environment variables) +2. **Environment Configuration** (dev/test/prod specific) +3. **Infrastructure Configuration** (project-specific settings) +4. **User Configuration** (personal preferences) +5. **System Defaults** (system-wide defaults) + +**Configuration Files**: + +- `config.defaults.toml` - System-wide defaults +- `config.user.toml` - User-specific preferences +- `config.{dev,test,prod}.toml` - Environment-specific configurations +- Infrastructure-specific configuration files + +**Features**: + +- **Variable Interpolation**: `{{paths.base}}`, `{{env.HOME}}`, `{{now.date}}`, `{{git.branch}}` +- **Environment Switching**: `PROVISIONING_ENV=prod` for environment-specific configs +- **Validation Framework**: Comprehensive configuration validation and error reporting +- **Migration Tools**: Automated migration from ENV-based to config-driven architecture + +### 3. Workflow System (v3.1.0) + +#### Batch Workflow Engine + +**Batch Capabilities**: + +- **Provider-Agnostic Workflows**: Mix UpCloud, AWS, and local providers in single workflow +- **Dependency Resolution**: Topological sorting with soft/hard dependency support +- **Parallel Execution**: Configurable parallelism limits with resource management +- **State Recovery**: Checkpoint-based recovery with rollback capabilities +- **Real-time Monitoring**: Live progress tracking and health monitoring + +**Workflow Types**: + +- **Server Workflows**: Multi-provider server provisioning and management +- **Task Service Workflows**: Infrastructure component installation and configuration +- **Cluster Workflows**: Complete Kubernetes cluster deployment and management +- **Batch Workflows**: Complex multi-step operations with dependency management + +**Nickel Workflow Definitions**: + +```text +{ + batch_workflow = { + name = "multi_cloud_deployment", + version = "1.0.0", + parallel_limit = 5, + rollback_enabled = true, + + operations = [ + { + id = "servers", + type = "server_batch", + provider = "upcloud", + dependencies = [], + }, + { + id = "services", + type = "taskserv_batch", + provider = "aws", + dependencies = ["servers"], + } + ] + } +} +``` + +### 4. Provider Ecosystem + +#### Multi-Provider Architecture + +**Supported Providers**: + +- **AWS**: Amazon Web Services integration +- **UpCloud**: UpCloud provider with full feature support +- **Local**: Local development and testing provider + +**Provider Features**: + +- **Standardized Interfaces**: Consistent API across all providers +- **Configuration Templates**: Provider-specific configuration generation +- **Resource Management**: Complete lifecycle management for cloud resources +- **Cost Optimization**: Pricing information and cost optimization recommendations +- **Regional Support**: Multi-region deployment capabilities + +#### Task Services Ecosystem + +**Infrastructure Components** (40+ services): + +- **Container Orchestration**: Kubernetes, container runtimes (containerd, cri-o, crun, runc, youki) +- **Networking**: Cilium, CoreDNS, HAProxy, service mesh integration +- **Storage**: Rook-Ceph, external-NFS, Mayastor, persistent volumes +- **Security**: Policy engines, secrets management, RBAC +- **Observability**: Monitoring, logging, tracing, metrics collection +- **Development Tools**: Gitea, databases, build systems + +**Service Features**: + +- **Version Management**: Real-time version checking against GitHub releases +- **Configuration Generation**: Automated service configuration from templates +- **Dependency Management**: Automatic dependency resolution and installation order +- **Health Monitoring**: Service health checks and status reporting + +## Key Architectural Decisions + +### 1. Hybrid Language Architecture (ADR-004) + +**Decision**: Use Rust for coordination, Nushell for business logic +**Rationale**: Solves Nushell's deep call stack limitations while preserving domain expertise +**Impact**: Eliminates technical limitations while maintaining productivity and configuration advantages + +### 2. Configuration-Driven Architecture (ADR-002) + +**Decision**: Complete migration from ENV variables to hierarchical configuration +**Rationale**: True Infrastructure as Code requires configuration flexibility without hardcoded fallbacks +**Impact**: 476 configuration accessors provide complete customization without code changes + +### 3. Domain-Driven Structure (ADR-001) + +**Decision**: Organize by functional domains (core, platform, provisioning) +**Rationale**: Clear boundaries enable scalable development and maintenance +**Impact**: Enables specialized development while maintaining system coherence + +### 4. Workspace Isolation (ADR-003) + +**Decision**: Isolated user workspaces with hierarchical configuration +**Rationale**: Multi-user support and customization without system impact +**Impact**: Complete user independence with easy backup and migration + +### 5. Registry-Based Extensions (ADR-005) + +**Decision**: Manifest-driven extension framework with structured discovery +**Rationale**: Enable community contributions while maintaining system stability +**Impact**: Extensible system supporting custom providers, services, and workflows + +## Data Flow Architecture + +### Configuration Resolution Flow + +```text +1. Workspace Discovery → 2. Configuration Loading → 3. Hierarchy Merge → +4. Variable Interpolation → 5. Schema Validation → 6. Runtime Application +``` + +### Workflow Execution Flow + +```text +1. Workflow Submission → 2. Dependency Analysis → 3. Task Scheduling → +4. Parallel Execution → 5. State Tracking → 6. Result Aggregation → +7. Error Handling → 8. Cleanup/Rollback +``` + +### Provider Integration Flow + +```text +1. Provider Discovery → 2. Configuration Validation → 3. Authentication → +4. Resource Planning → 5. Operation Execution → 6. State Persistence → +7. Result Reporting +``` + +## Technology Stack + +### Core Technologies + +- **Nushell 0.107.1**: Primary shell and scripting language +- **Rust**: High-performance coordination and orchestration +- **Nickel 1.15.0+**: Configuration language for Infrastructure as Code +- **TOML**: Configuration file format with human readability +- **JSON**: Data exchange format between components + +### Infrastructure Technologies + +- **Kubernetes**: Container orchestration platform +- **Docker/Containerd**: Container runtime environments +- **SOPS 3.10.2**: Secrets management and encryption +- **Age 1.2.1**: Encryption tool for secrets +- **HTTP/REST**: API communication protocols + +### Development Technologies + +- **nu_plugin_tera**: Native Nushell template rendering +- **K9s 0.50.6**: Kubernetes management interface +- **Git**: Version control and configuration management + +## Scalability and Performance + +### Performance Characteristics + +- **Batch Processing**: 1000+ concurrent operations with configurable parallelism +- **Provider Operations**: Sub-second response for most cloud API operations +- **Configuration Loading**: Millisecond-level configuration resolution +- **State Persistence**: File-based persistence with minimal overhead +- **Memory Usage**: Efficient memory management with streaming operations + +### Scalability Features + +- **Horizontal Scaling**: Multiple orchestrator instances for high availability +- **Resource Management**: Configurable resource limits and quotas +- **Caching Strategy**: Multi-level caching for performance optimization +- **Streaming Operations**: Large dataset processing without memory limits +- **Async Processing**: Non-blocking operations for improved throughput + +## Security Architecture + +### Security Layers + +- **Workspace Isolation**: User data isolated from system installation +- **Configuration Security**: Encrypted secrets with SOPS/Age integration +- **Extension Sandboxing**: Extensions run in controlled environments +- **API Authentication**: Secure REST API endpoints with authentication +- **Audit Logging**: Comprehensive audit trails for all operations + +### Security Features + +- **Secrets Management**: Encrypted configuration files with rotation support +- **Permission Model**: Role-based access control for operations +- **Code Signing**: Digital signature verification for extensions +- **Network Security**: Secure communication with cloud providers +- **Input Validation**: Comprehensive input validation and sanitization + +## Quality Attributes + +### Reliability + +- **Error Recovery**: Sophisticated error handling and rollback capabilities +- **State Consistency**: Transactional operations with rollback support +- **Health Monitoring**: Comprehensive system health checks and monitoring +- **Fault Tolerance**: Graceful degradation and recovery from failures + +### Maintainability + +- **Clear Architecture**: Well-defined boundaries and responsibilities +- **Documentation**: Comprehensive architecture and development documentation +- **Testing Strategy**: Multi-layer testing with integration validation +- **Code Quality**: Consistent patterns and quality standards + +### Extensibility + +- **Plugin Framework**: Registry-based extension system +- **Provider API**: Standardized interfaces for new providers +- **Configuration Schema**: Extensible configuration with validation +- **Workflow Engine**: Custom workflow definitions and execution + +This system architecture represents a mature, production-ready platform for Infrastructure as Code with unique architectural innovations and proven +scalability. \ No newline at end of file diff --git a/docs/src/architecture/typedialog-nickel-integration.md b/docs/src/architecture/typedialog-nickel-integration.md index c33b758..cf47a64 100644 --- a/docs/src/architecture/typedialog-nickel-integration.md +++ b/docs/src/architecture/typedialog-nickel-integration.md @@ -1 +1,952 @@ -# TypeDialog + Nickel Integration Guide\n\n**Status**: Implementation Guide\n**Last Updated**: 2025-12-15\n**Project**: TypeDialog at `/Users/Akasha/Development/typedialog`\n**Purpose**: Type-safe UI generation from Nickel schemas\n\n---\n\n## What is TypeDialog\n\nTypeDialog generates **type-safe interactive forms** from configuration schemas with **bidirectional Nickel integration**.\n\n```\nNickel Schema\n ↓\nTypeDialog Form (Auto-generated)\n ↓\nUser fills form interactively\n ↓\nNickel output config (Type-safe)\n```\n\n---\n\n## Architecture\n\n### Three Layers\n\n```\nCLI/TUI/Web Layer\n ↓\nTypeDialog Form Engine\n ↓\nNickel Integration\n ↓\nSchema Contracts\n```\n\n### Data Flow\n\n```\nInput (Nickel)\n ↓\nForm Definition (TOML)\n ↓\nForm Rendering (CLI/TUI/Web)\n ↓\nUser Input\n ↓\nValidation (against Nickel contracts)\n ↓\nOutput (JSON/YAML/TOML/Nickel)\n```\n\n---\n\n## Setup\n\n### Installation\n\n```\n# Clone TypeDialog\ngit clone https://github.com/jesusperezlorenzo/typedialog.git\ncd typedialog\n\n# Build\ncargo build --release\n\n# Install (optional)\ncargo install --path ./crates/typedialog\n```\n\n### Verify Installation\n\n```\ntypedialog --version\ntypedialog --help\n```\n\n---\n\n## Basic Workflow\n\n### Step 1: Define Nickel Schema\n\n```\n# server_config.ncl\nlet contracts = import "./contracts.ncl" in\nlet defaults = import "./defaults.ncl" in\n\n{\n defaults = defaults,\n\n make_server | not_exported = fun overrides =>\n defaults.server & overrides,\n\n DefaultServer = defaults.server,\n}\n```\n\n### Step 2: Define TypeDialog Form (TOML)\n\n```\n# server_form.toml\n[form]\ntitle = "Server Configuration"\ndescription = "Create a new server configuration"\n\n[[fields]]\nname = "server_name"\nlabel = "Server Name"\ntype = "text"\nrequired = true\nhelp = "Unique identifier for the server"\nplaceholder = "web-01"\n\n[[fields]]\nname = "cpu_cores"\nlabel = "CPU Cores"\ntype = "number"\nrequired = true\ndefault = 4\nhelp = "Number of CPU cores (1-32)"\n\n[[fields]]\nname = "memory_gb"\nlabel = "Memory (GB)"\ntype = "number"\nrequired = true\ndefault = 8\nhelp = "Memory in GB (1-256)"\n\n[[fields]]\nname = "zone"\nlabel = "Availability Zone"\ntype = "select"\nrequired = true\noptions = ["us-nyc1", "eu-fra1", "ap-syd1"]\ndefault = "us-nyc1"\n\n[[fields]]\nname = "monitoring"\nlabel = "Enable Monitoring"\ntype = "confirm"\ndefault = true\n\n[[fields]]\nname = "tags"\nlabel = "Tags"\ntype = "multiselect"\noptions = ["production", "staging", "testing", "development"]\nhelp = "Select applicable tags"\n```\n\n### Step 3: Render Form (CLI)\n\n```\ntypedialog form --config server_form.toml --backend cli\n```\n\n**Output**:\n\n```\nServer Configuration\nCreate a new server configuration\n\n? Server Name: web-01\n? CPU Cores: 4\n? Memory (GB): 8\n? Availability Zone: (us-nyc1/eu-fra1/ap-syd1) us-nyc1\n? Enable Monitoring: (y/n) y\n? Tags: (Select multiple with space)\n ◉ production\n ◯ staging\n ◯ testing\n ◯ development\n```\n\n### Step 4: Validate Against Nickel Schema\n\n```\n# Validation happens automatically\n# If input matches Nickel contract, proceeds to output\n```\n\n### Step 5: Output to Nickel\n\n```\ntypedialog form \\n --config server_form.toml \\n --output nickel \\n --backend cli\n```\n\n**Output file** (`server_config_output.ncl`):\n\n```\n{\n server_name = "web-01",\n cpu_cores = 4,\n memory_gb = 8,\n zone = "us-nyc1",\n monitoring = true,\n tags = ["production"],\n}\n```\n\n---\n\n## Real-World Example 1: Infrastructure Wizard\n\n### Scenario\n\nYou want an interactive CLI wizard for infrastructure provisioning.\n\n### Step 1: Define Nickel Schema for Infrastructure\n\n```\n# infrastructure_schema.ncl\n{\n InfrastructureConfig = {\n workspace_name | String,\n deployment_mode | [| 'solo, 'multiuser, 'cicd, 'enterprise |],\n provider | [| 'upcloud, 'aws, 'hetzner |],\n taskservs | Array,\n enable_monitoring | Bool,\n enable_backup | Bool,\n backup_retention_days | Number,\n },\n\n defaults = {\n workspace_name = "",\n deployment_mode = 'solo,\n provider = 'upcloud,\n taskservs = [],\n enable_monitoring = true,\n enable_backup = true,\n backup_retention_days = 7,\n },\n\n DefaultInfra = defaults,\n}\n```\n\n### Step 2: Create Comprehensive Form\n\n```\n# infrastructure_wizard.toml\n[form]\ntitle = "Infrastructure Provisioning Wizard"\ndescription = "Create a complete infrastructure setup"\n\n[[fields]]\nname = "workspace_name"\nlabel = "Workspace Name"\ntype = "text"\nrequired = true\nvalidation_pattern = "^[a-z0-9-]{3,32}$"\nhelp = "3-32 chars, lowercase alphanumeric and hyphens only"\nplaceholder = "my-workspace"\n\n[[fields]]\nname = "deployment_mode"\nlabel = "Deployment Mode"\ntype = "select"\nrequired = true\noptions = [\n { value = "solo", label = "Solo (Single user, 2 CPU, 4 GB RAM)" },\n { value = "multiuser", label = "MultiUser (Team, 4 CPU, 8 GB RAM)" },\n { value = "cicd", label = "CI/CD (Pipelines, 8 CPU, 16 GB RAM)" },\n { value = "enterprise", label = "Enterprise (Production, 16 CPU, 32 GB RAM)" },\n]\ndefault = "solo"\n\n[[fields]]\nname = "provider"\nlabel = "Cloud Provider"\ntype = "select"\nrequired = true\noptions = [\n { value = "upcloud", label = "UpCloud (EU)" },\n { value = "aws", label = "AWS (Global)" },\n { value = "hetzner", label = "Hetzner (EU)" },\n]\ndefault = "upcloud"\n\n[[fields]]\nname = "taskservs"\nlabel = "Task Services"\ntype = "multiselect"\nrequired = false\noptions = [\n { value = "kubernetes", label = "Kubernetes (Container orchestration)" },\n { value = "cilium", label = "Cilium (Network policy)" },\n { value = "postgres", label = "PostgreSQL (Database)" },\n { value = "redis", label = "Redis (Cache)" },\n { value = "prometheus", label = "Prometheus (Monitoring)" },\n { value = "etcd", label = "etcd (Distributed config)" },\n]\nhelp = "Select task services to deploy"\n\n[[fields]]\nname = "enable_monitoring"\nlabel = "Enable Monitoring"\ntype = "confirm"\ndefault = true\nhelp = "Prometheus + Grafana dashboards"\n\n[[fields]]\nname = "enable_backup"\nlabel = "Enable Backup"\ntype = "confirm"\ndefault = true\n\n[[fields]]\nname = "backup_retention_days"\nlabel = "Backup Retention (days)"\ntype = "number"\nrequired = false\ndefault = 7\nhelp = "How long to keep backups (if enabled)"\nvisible_if = "enable_backup == true"\n\n[[fields]]\nname = "email"\nlabel = "Admin Email"\ntype = "text"\nrequired = true\nvalidation_pattern = "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"\nhelp = "For alerts and notifications"\nplaceholder = "admin@company.com"\n```\n\n### Step 3: Run Interactive Wizard\n\n```\ntypedialog form \\n --config infrastructure_wizard.toml \\n --backend tui \\n --output nickel\n```\n\n**Output** (`infrastructure_config.ncl`):\n\n```\n{\n workspace_name = "production-eu",\n deployment_mode = 'enterprise,\n provider = 'upcloud,\n taskservs = ["kubernetes", "cilium", "postgres", "redis", "prometheus"],\n enable_monitoring = true,\n enable_backup = true,\n backup_retention_days = 30,\n email = "ops@company.com",\n}\n```\n\n### Step 4: Use Output in Infrastructure\n\n```\n# main_infrastructure.ncl\nlet config = import "./infrastructure_config.ncl" in\nlet schemas = import "../../provisioning/schemas/main.ncl" in\n\n{\n # Build infrastructure based on config\n infrastructure = if config.deployment_mode == 'solo then\n {\n servers = [\n schemas.lib.make_server {\n name = config.workspace_name,\n cpu_cores = 2,\n memory_gb = 4,\n },\n ],\n taskservs = config.taskservs,\n }\n else if config.deployment_mode == 'enterprise then\n {\n servers = [\n schemas.lib.make_server { name = "app-01", cpu_cores = 16, memory_gb = 32 },\n schemas.lib.make_server { name = "app-02", cpu_cores = 16, memory_gb = 32 },\n schemas.lib.make_server { name = "db-01", cpu_cores = 16, memory_gb = 32 },\n ],\n taskservs = config.taskservs,\n monitoring = { enabled = config.enable_monitoring, email = config.email },\n }\n else\n # default fallback\n {},\n}\n```\n\n---\n\n## Real-World Example 2: Server Configuration Form\n\n### Form Definition (Advanced)\n\n```\n# server_advanced_form.toml\n[form]\ntitle = "Server Configuration"\ndescription = "Configure server settings with validation"\n\n# Section 1: Basic Info\n[[sections]]\nname = "basic"\ntitle = "Basic Information"\n\n[[fields]]\nname = "server_name"\nsection = "basic"\nlabel = "Server Name"\ntype = "text"\nrequired = true\nvalidation_pattern = "^[a-z0-9-]{3,32}$"\n\n[[fields]]\nname = "description"\nsection = "basic"\nlabel = "Description"\ntype = "textarea"\nrequired = false\nplaceholder = "Server purpose and details"\n\n# Section 2: Resources\n[[sections]]\nname = "resources"\ntitle = "Resources"\n\n[[fields]]\nname = "cpu_cores"\nsection = "resources"\nlabel = "CPU Cores"\ntype = "number"\nrequired = true\ndefault = 4\nmin = 1\nmax = 32\n\n[[fields]]\nname = "memory_gb"\nsection = "resources"\nlabel = "Memory (GB)"\ntype = "number"\nrequired = true\ndefault = 8\nmin = 1\nmax = 256\n\n[[fields]]\nname = "disk_gb"\nsection = "resources"\nlabel = "Disk (GB)"\ntype = "number"\nrequired = true\ndefault = 100\nmin = 10\nmax = 2000\n\n# Section 3: Network\n[[sections]]\nname = "network"\ntitle = "Network Configuration"\n\n[[fields]]\nname = "zone"\nsection = "network"\nlabel = "Availability Zone"\ntype = "select"\nrequired = true\noptions = ["us-nyc1", "eu-fra1", "ap-syd1"]\n\n[[fields]]\nname = "enable_ipv6"\nsection = "network"\nlabel = "Enable IPv6"\ntype = "confirm"\ndefault = false\n\n[[fields]]\nname = "allowed_ports"\nsection = "network"\nlabel = "Allowed Ports"\ntype = "multiselect"\noptions = [\n { value = "22", label = "SSH (22)" },\n { value = "80", label = "HTTP (80)" },\n { value = "443", label = "HTTPS (443)" },\n { value = "3306", label = "MySQL (3306)" },\n { value = "5432", label = "PostgreSQL (5432)" },\n]\n\n# Section 4: Advanced\n[[sections]]\nname = "advanced"\ntitle = "Advanced Options"\n\n[[fields]]\nname = "kernel_version"\nsection = "advanced"\nlabel = "Kernel Version"\ntype = "text"\nrequired = false\nplaceholder = "5.15.0 (or leave blank for latest)"\n\n[[fields]]\nname = "enable_monitoring"\nsection = "advanced"\nlabel = "Enable Monitoring"\ntype = "confirm"\ndefault = true\n\n[[fields]]\nname = "monitoring_interval"\nsection = "advanced"\nlabel = "Monitoring Interval (seconds)"\ntype = "number"\nrequired = false\ndefault = 60\nvisible_if = "enable_monitoring == true"\n\n[[fields]]\nname = "tags"\nsection = "advanced"\nlabel = "Tags"\ntype = "multiselect"\noptions = ["production", "staging", "testing", "development"]\n```\n\n### Output Structure\n\n```\n{\n # Basic\n server_name = "web-prod-01",\n description = "Primary web server",\n\n # Resources\n cpu_cores = 16,\n memory_gb = 32,\n disk_gb = 500,\n\n # Network\n zone = "eu-fra1",\n enable_ipv6 = true,\n allowed_ports = ["22", "80", "443"],\n\n # Advanced\n kernel_version = "5.15.0",\n enable_monitoring = true,\n monitoring_interval = 30,\n tags = ["production"],\n}\n```\n\n---\n\n## API Integration\n\n### TypeDialog REST Endpoints\n\n```\n# Start TypeDialog server\ntypedialog server --port 8080\n\n# Render form via HTTP\ncurl -X POST http://localhost:8080/forms \\n -H "Content-Type: application/json" \\n -d @server_form.toml\n```\n\n### Response Format\n\n```\n{\n "form_id": "srv_abc123",\n "status": "rendered",\n "fields": [\n {\n "name": "server_name",\n "label": "Server Name",\n "type": "text",\n "required": true,\n "placeholder": "web-01"\n }\n ]\n}\n```\n\n### Submit Form\n\n```\ncurl -X POST http://localhost:8080/forms/srv_abc123/submit \\n -H "Content-Type: application/json" \\n -d '{\n "server_name": "web-01",\n "cpu_cores": 4,\n "memory_gb": 8,\n "zone": "us-nyc1",\n "monitoring": true,\n "tags": ["production"]\n }'\n```\n\n### Response\n\n```\n{\n "status": "success",\n "validation": "passed",\n "output_format": "nickel",\n "output": {\n "server_name": "web-01",\n "cpu_cores": 4,\n "memory_gb": 8,\n "zone": "us-nyc1",\n "monitoring": true,\n "tags": ["production"]\n }\n}\n```\n\n---\n\n## Validation\n\n### Contract-Based Validation\n\nTypeDialog validates user input against Nickel contracts:\n\n```\n# Nickel contract\nServerConfig = {\n cpu_cores | Number, # Must be number\n memory_gb | Number, # Must be number\n zone | [| 'us-nyc1, 'eu-fra1 |], # Enum\n}\n\n# If user enters invalid value\n# TypeDialog rejects before serializing\n```\n\n### Validation Rules in Form\n\n```\n[[fields]]\nname = "cpu_cores"\ntype = "number"\nmin = 1\nmax = 32\nhelp = "Must be 1-32 cores"\n# TypeDialog enforces before user can submit\n```\n\n---\n\n## Integration with Provisioning Platform\n\n### Use Case: Infrastructure Initialization\n\n```\n# 1. User runs initialization\nprovisioning init --wizard\n\n# 2. Behind the scenes:\n# - Loads infrastructure_wizard.toml\n# - Starts TypeDialog (CLI or TUI)\n# - User fills form interactively\n\n# 3. Output saved as config\n# ~/.config/provisioning/infrastructure_config.ncl\n\n# 4. Provisioning uses output\n# provisioning server create --from-config infrastructure_config.ncl\n```\n\n### Implementation in Nushell\n\n```\n# provisioning/core/nulib/provisioning_init.nu\n\ndef provisioning_init_wizard [] {\n # Launch TypeDialog form\n let config = (\n typedialog form \\n --config "provisioning/config/infrastructure_wizard.toml" \\n --backend tui \\n --output nickel\n )\n\n # Save output\n $config | save ~/.config/provisioning/workspace_config.ncl\n\n # Validate with provisioning schemas\n let provisioning = (import "provisioning/schemas/main.ncl")\n let validated = (\n nickel export ~/.config/provisioning/workspace_config.ncl\n | jq . | to json\n )\n\n print "Infrastructure configuration created!"\n print "Use: provisioning deploy --from-config"\n}\n```\n\n---\n\n## Advanced Features\n\n### Conditional Visibility\n\nShow/hide fields based on user selections:\n\n```\n[[fields]]\nname = "backup_retention"\nlabel = "Backup Retention (days)"\ntype = "number"\nvisible_if = "enable_backup == true" # Only shown if backup enabled\n```\n\n### Dynamic Defaults\n\nSet defaults based on other fields:\n\n```\n[[fields]]\nname = "deployment_mode"\ntype = "select"\noptions = ["solo", "enterprise"]\n\n[[fields]]\nname = "cpu_cores"\ntype = "number"\ndefault_from = "deployment_mode" # Can reference other fields\n# solo → default 2, enterprise → default 16\n```\n\n### Custom Validation\n\n```\n[[fields]]\nname = "memory_gb"\ntype = "number"\nvalidation_rule = "memory_gb >= cpu_cores * 2"\nhelp = "Memory must be at least 2 GB per CPU core"\n```\n\n---\n\n## Output Formats\n\nTypeDialog can output to multiple formats:\n\n```\n# Output to Nickel (recommended for IaC)\ntypedialog form --config form.toml --output nickel\n\n# Output to JSON (for APIs)\ntypedialog form --config form.toml --output json\n\n# Output to YAML (for K8s)\ntypedialog form --config form.toml --output yaml\n\n# Output to TOML (for application config)\ntypedialog form --config form.toml --output toml\n```\n\n---\n\n## Backends\n\nTypeDialog supports three rendering backends:\n\n### 1. CLI (Command-line prompts)\n\n```\ntypedialog form --config form.toml --backend cli\n```\n\n**Pros**: Lightweight, SSH-friendly, no dependencies\n**Cons**: Basic UI\n\n### 2. TUI (Terminal User Interface - Ratatui)\n\n```\ntypedialog form --config form.toml --backend tui\n```\n\n**Pros**: Rich UI, keyboard navigation, sections\n**Cons**: Requires terminal support\n\n### 3. Web (HTTP Server - Axum)\n\n```\ntypedialog form --config form.toml --backend web --port 3000\n# Opens http://localhost:3000\n```\n\n**Pros**: Beautiful UI, remote access, multi-user\n**Cons**: Requires browser, network\n\n---\n\n## Troubleshooting\n\n### Problem: Form doesn't match Nickel contract\n\n**Cause**: Field names or types don't match contract\n\n**Solution**: Verify field definitions match Nickel schema:\n\n```\n# Form field\n[[fields]]\nname = "cpu_cores" # Must match Nickel field name\ntype = "number" # Must match Nickel type\n```\n\n### Problem: Validation fails\n\n**Cause**: User input violates contract constraints\n\n**Solution**: Add help text and validation rules:\n\n```\n[[fields]]\nname = "cpu_cores"\nvalidation_pattern = "^[1-9][0-9]*$"\nhelp = "Must be positive integer"\n```\n\n### Problem: Output not valid Nickel\n\n**Cause**: Missing required fields\n\n**Solution**: Ensure all required fields in form:\n\n```\n[[fields]]\nname = "required_field"\nrequired = true # User must provide value\n```\n\n---\n\n## Complete Example: End-to-End Workflow\n\n### Step 1: Define Nickel Schema\n\n```\n# workspace_schema.ncl\n{\n workspace = {\n name = "",\n mode = 'solo,\n provider = 'upcloud,\n monitoring = true,\n email = "",\n },\n}\n```\n\n### Step 2: Define Form\n\n```\n# workspace_form.toml\n[[fields]]\nname = "name"\ntype = "text"\nrequired = true\n\n[[fields]]\nname = "mode"\ntype = "select"\noptions = ["solo", "enterprise"]\n\n[[fields]]\nname = "provider"\ntype = "select"\noptions = ["upcloud", "aws"]\n\n[[fields]]\nname = "monitoring"\ntype = "confirm"\n\n[[fields]]\nname = "email"\ntype = "text"\nrequired = true\n```\n\n### Step 3: User Interaction\n\n```\n$ typedialog form --config workspace_form.toml --backend tui\n# User fills form interactively\n```\n\n### Step 4: Output\n\n```\n{\n workspace = {\n name = "production",\n mode = 'enterprise,\n provider = 'upcloud,\n monitoring = true,\n email = "ops@company.com",\n },\n}\n```\n\n### Step 5: Use in Provisioning\n\n```\n# main.ncl\nlet config = import "./workspace.ncl" in\nlet schemas = import "provisioning/schemas/main.ncl" in\n\n{\n # Build infrastructure\n infrastructure = schemas.deployment.modes.make_mode {\n deployment_type = config.workspace.mode,\n provider = config.workspace.provider,\n },\n}\n```\n\n---\n\n## Summary\n\nTypeDialog + Nickel provides:\n\n✅ **Type-Safe UIs**: Forms validated against Nickel contracts\n✅ **Auto-Generated**: No UI code to maintain\n✅ **Bidirectional**: Nickel → Forms → Nickel\n✅ **Multiple Outputs**: JSON, YAML, TOML, Nickel\n✅ **Three Backends**: CLI, TUI, Web\n✅ **Production-Ready**: Used in real infrastructure\n\n**Key Benefit**: Reduce configuration errors by enforcing schema validation at UI level, not after deployment.\n\n---\n\n**Version**: 1.0.0\n**Status**: Implementation Guide\n**Last Updated**: 2025-12-15 +# TypeDialog + Nickel Integration Guide + +**Status**: Implementation Guide +**Last Updated**: 2025-12-15 +**Project**: TypeDialog at `/Users/Akasha/Development/typedialog` +**Purpose**: Type-safe UI generation from Nickel schemas + +--- + +## What is TypeDialog + +TypeDialog generates **type-safe interactive forms** from configuration schemas with **bidirectional Nickel integration**. + +```text +Nickel Schema + ↓ +TypeDialog Form (Auto-generated) + ↓ +User fills form interactively + ↓ +Nickel output config (Type-safe) +``` + +--- + +## Architecture + +### Three Layers + +```text +CLI/TUI/Web Layer + ↓ +TypeDialog Form Engine + ↓ +Nickel Integration + ↓ +Schema Contracts +``` + +### Data Flow + +```text +Input (Nickel) + ↓ +Form Definition (TOML) + ↓ +Form Rendering (CLI/TUI/Web) + ↓ +User Input + ↓ +Validation (against Nickel contracts) + ↓ +Output (JSON/YAML/TOML/Nickel) +``` + +--- + +## Setup + +### Installation + +```text +# Clone TypeDialog +git clone https://github.com/jesusperezlorenzo/typedialog.git +cd typedialog + +# Build +cargo build --release + +# Install (optional) +cargo install --path ./crates/typedialog +``` + +### Verify Installation + +```text +typedialog --version +typedialog --help +``` + +--- + +## Basic Workflow + +### Step 1: Define Nickel Schema + +```text +# server_config.ncl +let contracts = import "./contracts.ncl" in +let defaults = import "./defaults.ncl" in + +{ + defaults = defaults, + + make_server | not_exported = fun overrides => + defaults.server & overrides, + + DefaultServer = defaults.server, +} +``` + +### Step 2: Define TypeDialog Form (TOML) + +```text +# server_form.toml +[form] +title = "Server Configuration" +description = "Create a new server configuration" + +[[fields]] +name = "server_name" +label = "Server Name" +type = "text" +required = true +help = "Unique identifier for the server" +placeholder = "web-01" + +[[fields]] +name = "cpu_cores" +label = "CPU Cores" +type = "number" +required = true +default = 4 +help = "Number of CPU cores (1-32)" + +[[fields]] +name = "memory_gb" +label = "Memory (GB)" +type = "number" +required = true +default = 8 +help = "Memory in GB (1-256)" + +[[fields]] +name = "zone" +label = "Availability Zone" +type = "select" +required = true +options = ["us-nyc1", "eu-fra1", "ap-syd1"] +default = "us-nyc1" + +[[fields]] +name = "monitoring" +label = "Enable Monitoring" +type = "confirm" +default = true + +[[fields]] +name = "tags" +label = "Tags" +type = "multiselect" +options = ["production", "staging", "testing", "development"] +help = "Select applicable tags" +``` + +### Step 3: Render Form (CLI) + +```text +typedialog form --config server_form.toml --backend cli +``` + +**Output**: + +```text +Server Configuration +Create a new server configuration + +? Server Name: web-01 +? CPU Cores: 4 +? Memory (GB): 8 +? Availability Zone: (us-nyc1/eu-fra1/ap-syd1) us-nyc1 +? Enable Monitoring: (y/n) y +? Tags: (Select multiple with space) + ◉ production + ◯ staging + ◯ testing + ◯ development +``` + +### Step 4: Validate Against Nickel Schema + +```text +# Validation happens automatically +# If input matches Nickel contract, proceeds to output +``` + +### Step 5: Output to Nickel + +```text +typedialog form + --config server_form.toml + --output nickel + --backend cli +``` + +**Output file** (`server_config_output.ncl`): + +```text +{ + server_name = "web-01", + cpu_cores = 4, + memory_gb = 8, + zone = "us-nyc1", + monitoring = true, + tags = ["production"], +} +``` + +--- + +## Real-World Example 1: Infrastructure Wizard + +### Scenario + +You want an interactive CLI wizard for infrastructure provisioning. + +### Step 1: Define Nickel Schema for Infrastructure + +```text +# infrastructure_schema.ncl +{ + InfrastructureConfig = { + workspace_name | String, + deployment_mode | [| 'solo, 'multiuser, 'cicd, 'enterprise |], + provider | [| 'upcloud, 'aws, 'hetzner |], + taskservs | Array, + enable_monitoring | Bool, + enable_backup | Bool, + backup_retention_days | Number, + }, + + defaults = { + workspace_name = "", + deployment_mode = 'solo, + provider = 'upcloud, + taskservs = [], + enable_monitoring = true, + enable_backup = true, + backup_retention_days = 7, + }, + + DefaultInfra = defaults, +} +``` + +### Step 2: Create Comprehensive Form + +```text +# infrastructure_wizard.toml +[form] +title = "Infrastructure Provisioning Wizard" +description = "Create a complete infrastructure setup" + +[[fields]] +name = "workspace_name" +label = "Workspace Name" +type = "text" +required = true +validation_pattern = "^[a-z0-9-]{3,32}$" +help = "3-32 chars, lowercase alphanumeric and hyphens only" +placeholder = "my-workspace" + +[[fields]] +name = "deployment_mode" +label = "Deployment Mode" +type = "select" +required = true +options = [ + { value = "solo", label = "Solo (Single user, 2 CPU, 4 GB RAM)" }, + { value = "multiuser", label = "MultiUser (Team, 4 CPU, 8 GB RAM)" }, + { value = "cicd", label = "CI/CD (Pipelines, 8 CPU, 16 GB RAM)" }, + { value = "enterprise", label = "Enterprise (Production, 16 CPU, 32 GB RAM)" }, +] +default = "solo" + +[[fields]] +name = "provider" +label = "Cloud Provider" +type = "select" +required = true +options = [ + { value = "upcloud", label = "UpCloud (EU)" }, + { value = "aws", label = "AWS (Global)" }, + { value = "hetzner", label = "Hetzner (EU)" }, +] +default = "upcloud" + +[[fields]] +name = "taskservs" +label = "Task Services" +type = "multiselect" +required = false +options = [ + { value = "kubernetes", label = "Kubernetes (Container orchestration)" }, + { value = "cilium", label = "Cilium (Network policy)" }, + { value = "postgres", label = "PostgreSQL (Database)" }, + { value = "redis", label = "Redis (Cache)" }, + { value = "prometheus", label = "Prometheus (Monitoring)" }, + { value = "etcd", label = "etcd (Distributed config)" }, +] +help = "Select task services to deploy" + +[[fields]] +name = "enable_monitoring" +label = "Enable Monitoring" +type = "confirm" +default = true +help = "Prometheus + Grafana dashboards" + +[[fields]] +name = "enable_backup" +label = "Enable Backup" +type = "confirm" +default = true + +[[fields]] +name = "backup_retention_days" +label = "Backup Retention (days)" +type = "number" +required = false +default = 7 +help = "How long to keep backups (if enabled)" +visible_if = "enable_backup == true" + +[[fields]] +name = "email" +label = "Admin Email" +type = "text" +required = true +validation_pattern = "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$" +help = "For alerts and notifications" +placeholder = "admin@company.com" +``` + +### Step 3: Run Interactive Wizard + +```text +typedialog form + --config infrastructure_wizard.toml + --backend tui + --output nickel +``` + +**Output** (`infrastructure_config.ncl`): + +```text +{ + workspace_name = "production-eu", + deployment_mode = 'enterprise, + provider = 'upcloud, + taskservs = ["kubernetes", "cilium", "postgres", "redis", "prometheus"], + enable_monitoring = true, + enable_backup = true, + backup_retention_days = 30, + email = "ops@company.com", +} +``` + +### Step 4: Use Output in Infrastructure + +```text +# main_infrastructure.ncl +let config = import "./infrastructure_config.ncl" in +let schemas = import "../../provisioning/schemas/main.ncl" in + +{ + # Build infrastructure based on config + infrastructure = if config.deployment_mode == 'solo then + { + servers = [ + schemas.lib.make_server { + name = config.workspace_name, + cpu_cores = 2, + memory_gb = 4, + }, + ], + taskservs = config.taskservs, + } + else if config.deployment_mode == 'enterprise then + { + servers = [ + schemas.lib.make_server { name = "app-01", cpu_cores = 16, memory_gb = 32 }, + schemas.lib.make_server { name = "app-02", cpu_cores = 16, memory_gb = 32 }, + schemas.lib.make_server { name = "db-01", cpu_cores = 16, memory_gb = 32 }, + ], + taskservs = config.taskservs, + monitoring = { enabled = config.enable_monitoring, email = config.email }, + } + else + # default fallback + {}, +} +``` + +--- + +## Real-World Example 2: Server Configuration Form + +### Form Definition (Advanced) + +```text +# server_advanced_form.toml +[form] +title = "Server Configuration" +description = "Configure server settings with validation" + +# Section 1: Basic Info +[[sections]] +name = "basic" +title = "Basic Information" + +[[fields]] +name = "server_name" +section = "basic" +label = "Server Name" +type = "text" +required = true +validation_pattern = "^[a-z0-9-]{3,32}$" + +[[fields]] +name = "description" +section = "basic" +label = "Description" +type = "textarea" +required = false +placeholder = "Server purpose and details" + +# Section 2: Resources +[[sections]] +name = "resources" +title = "Resources" + +[[fields]] +name = "cpu_cores" +section = "resources" +label = "CPU Cores" +type = "number" +required = true +default = 4 +min = 1 +max = 32 + +[[fields]] +name = "memory_gb" +section = "resources" +label = "Memory (GB)" +type = "number" +required = true +default = 8 +min = 1 +max = 256 + +[[fields]] +name = "disk_gb" +section = "resources" +label = "Disk (GB)" +type = "number" +required = true +default = 100 +min = 10 +max = 2000 + +# Section 3: Network +[[sections]] +name = "network" +title = "Network Configuration" + +[[fields]] +name = "zone" +section = "network" +label = "Availability Zone" +type = "select" +required = true +options = ["us-nyc1", "eu-fra1", "ap-syd1"] + +[[fields]] +name = "enable_ipv6" +section = "network" +label = "Enable IPv6" +type = "confirm" +default = false + +[[fields]] +name = "allowed_ports" +section = "network" +label = "Allowed Ports" +type = "multiselect" +options = [ + { value = "22", label = "SSH (22)" }, + { value = "80", label = "HTTP (80)" }, + { value = "443", label = "HTTPS (443)" }, + { value = "3306", label = "MySQL (3306)" }, + { value = "5432", label = "PostgreSQL (5432)" }, +] + +# Section 4: Advanced +[[sections]] +name = "advanced" +title = "Advanced Options" + +[[fields]] +name = "kernel_version" +section = "advanced" +label = "Kernel Version" +type = "text" +required = false +placeholder = "5.15.0 (or leave blank for latest)" + +[[fields]] +name = "enable_monitoring" +section = "advanced" +label = "Enable Monitoring" +type = "confirm" +default = true + +[[fields]] +name = "monitoring_interval" +section = "advanced" +label = "Monitoring Interval (seconds)" +type = "number" +required = false +default = 60 +visible_if = "enable_monitoring == true" + +[[fields]] +name = "tags" +section = "advanced" +label = "Tags" +type = "multiselect" +options = ["production", "staging", "testing", "development"] +``` + +### Output Structure + +```text +{ + # Basic + server_name = "web-prod-01", + description = "Primary web server", + + # Resources + cpu_cores = 16, + memory_gb = 32, + disk_gb = 500, + + # Network + zone = "eu-fra1", + enable_ipv6 = true, + allowed_ports = ["22", "80", "443"], + + # Advanced + kernel_version = "5.15.0", + enable_monitoring = true, + monitoring_interval = 30, + tags = ["production"], +} +``` + +--- + +## API Integration + +### TypeDialog REST Endpoints + +```text +# Start TypeDialog server +typedialog server --port 8080 + +# Render form via HTTP +curl -X POST http://localhost:8080/forms + -H "Content-Type: application/json" + -d @server_form.toml +``` + +### Response Format + +```text +{ + "form_id": "srv_abc123", + "status": "rendered", + "fields": [ + { + "name": "server_name", + "label": "Server Name", + "type": "text", + "required": true, + "placeholder": "web-01" + } + ] +} +``` + +### Submit Form + +```text +curl -X POST http://localhost:8080/forms/srv_abc123/submit + -H "Content-Type: application/json" + -d '{ + "server_name": "web-01", + "cpu_cores": 4, + "memory_gb": 8, + "zone": "us-nyc1", + "monitoring": true, + "tags": ["production"] + }' +``` + +### Response + +```text +{ + "status": "success", + "validation": "passed", + "output_format": "nickel", + "output": { + "server_name": "web-01", + "cpu_cores": 4, + "memory_gb": 8, + "zone": "us-nyc1", + "monitoring": true, + "tags": ["production"] + } +} +``` + +--- + +## Validation + +### Contract-Based Validation + +TypeDialog validates user input against Nickel contracts: + +```text +# Nickel contract +ServerConfig = { + cpu_cores | Number, # Must be number + memory_gb | Number, # Must be number + zone | [| 'us-nyc1, 'eu-fra1 |], # Enum +} + +# If user enters invalid value +# TypeDialog rejects before serializing +``` + +### Validation Rules in Form + +```text +[[fields]] +name = "cpu_cores" +type = "number" +min = 1 +max = 32 +help = "Must be 1-32 cores" +# TypeDialog enforces before user can submit +``` + +--- + +## Integration with Provisioning Platform + +### Use Case: Infrastructure Initialization + +```text +# 1. User runs initialization +provisioning init --wizard + +# 2. Behind the scenes: +# - Loads infrastructure_wizard.toml +# - Starts TypeDialog (CLI or TUI) +# - User fills form interactively + +# 3. Output saved as config +# ~/.config/provisioning/infrastructure_config.ncl + +# 4. Provisioning uses output +# provisioning server create --from-config infrastructure_config.ncl +``` + +### Implementation in Nushell + +```text +# provisioning/core/nulib/provisioning_init.nu + +def provisioning_init_wizard [] { + # Launch TypeDialog form + let config = ( + typedialog form + --config "provisioning/config/infrastructure_wizard.toml" + --backend tui + --output nickel + ) + + # Save output + $config | save ~/.config/provisioning/workspace_config.ncl + + # Validate with provisioning schemas + let provisioning = (import "provisioning/schemas/main.ncl") + let validated = ( + nickel export ~/.config/provisioning/workspace_config.ncl + | jq . | to json + ) + + print "Infrastructure configuration created!" + print "Use: provisioning deploy --from-config" +} +``` + +--- + +## Advanced Features + +### Conditional Visibility + +Show/hide fields based on user selections: + +```text +[[fields]] +name = "backup_retention" +label = "Backup Retention (days)" +type = "number" +visible_if = "enable_backup == true" # Only shown if backup enabled +``` + +### Dynamic Defaults + +Set defaults based on other fields: + +```text +[[fields]] +name = "deployment_mode" +type = "select" +options = ["solo", "enterprise"] + +[[fields]] +name = "cpu_cores" +type = "number" +default_from = "deployment_mode" # Can reference other fields +# solo → default 2, enterprise → default 16 +``` + +### Custom Validation + +```text +[[fields]] +name = "memory_gb" +type = "number" +validation_rule = "memory_gb >= cpu_cores * 2" +help = "Memory must be at least 2 GB per CPU core" +``` + +--- + +## Output Formats + +TypeDialog can output to multiple formats: + +```text +# Output to Nickel (recommended for IaC) +typedialog form --config form.toml --output nickel + +# Output to JSON (for APIs) +typedialog form --config form.toml --output json + +# Output to YAML (for K8s) +typedialog form --config form.toml --output yaml + +# Output to TOML (for application config) +typedialog form --config form.toml --output toml +``` + +--- + +## Backends + +TypeDialog supports three rendering backends: + +### 1. CLI (Command-line prompts) + +```text +typedialog form --config form.toml --backend cli +``` + +**Pros**: Lightweight, SSH-friendly, no dependencies +**Cons**: Basic UI + +### 2. TUI (Terminal User Interface - Ratatui) + +```text +typedialog form --config form.toml --backend tui +``` + +**Pros**: Rich UI, keyboard navigation, sections +**Cons**: Requires terminal support + +### 3. Web (HTTP Server - Axum) + +```text +typedialog form --config form.toml --backend web --port 3000 +# Opens http://localhost:3000 +``` + +**Pros**: Beautiful UI, remote access, multi-user +**Cons**: Requires browser, network + +--- + +## Troubleshooting + +### Problem: Form doesn't match Nickel contract + +**Cause**: Field names or types don't match contract + +**Solution**: Verify field definitions match Nickel schema: + +```text +# Form field +[[fields]] +name = "cpu_cores" # Must match Nickel field name +type = "number" # Must match Nickel type +``` + +### Problem: Validation fails + +**Cause**: User input violates contract constraints + +**Solution**: Add help text and validation rules: + +```text +[[fields]] +name = "cpu_cores" +validation_pattern = "^[1-9][0-9]*$" +help = "Must be positive integer" +``` + +### Problem: Output not valid Nickel + +**Cause**: Missing required fields + +**Solution**: Ensure all required fields in form: + +```text +[[fields]] +name = "required_field" +required = true # User must provide value +``` + +--- + +## Complete Example: End-to-End Workflow + +### Step 1: Define Nickel Schema + +```text +# workspace_schema.ncl +{ + workspace = { + name = "", + mode = 'solo, + provider = 'upcloud, + monitoring = true, + email = "", + }, +} +``` + +### Step 2: Define Form + +```text +# workspace_form.toml +[[fields]] +name = "name" +type = "text" +required = true + +[[fields]] +name = "mode" +type = "select" +options = ["solo", "enterprise"] + +[[fields]] +name = "provider" +type = "select" +options = ["upcloud", "aws"] + +[[fields]] +name = "monitoring" +type = "confirm" + +[[fields]] +name = "email" +type = "text" +required = true +``` + +### Step 3: User Interaction + +```text +$ typedialog form --config workspace_form.toml --backend tui +# User fills form interactively +``` + +### Step 4: Output + +```text +{ + workspace = { + name = "production", + mode = 'enterprise, + provider = 'upcloud, + monitoring = true, + email = "ops@company.com", + }, +} +``` + +### Step 5: Use in Provisioning + +```text +# main.ncl +let config = import "./workspace.ncl" in +let schemas = import "provisioning/schemas/main.ncl" in + +{ + # Build infrastructure + infrastructure = schemas.deployment.modes.make_mode { + deployment_type = config.workspace.mode, + provider = config.workspace.provider, + }, +} +``` + +--- + +## Summary + +TypeDialog + Nickel provides: + +✅ **Type-Safe UIs**: Forms validated against Nickel contracts +✅ **Auto-Generated**: No UI code to maintain +✅ **Bidirectional**: Nickel → Forms → Nickel +✅ **Multiple Outputs**: JSON, YAML, TOML, Nickel +✅ **Three Backends**: CLI, TUI, Web +✅ **Production-Ready**: Used in real infrastructure + +**Key Benefit**: Reduce configuration errors by enforcing schema validation at UI level, not after deployment. + +--- + +**Version**: 1.0.0 +**Status**: Implementation Guide +**Last Updated**: 2025-12-15 \ No newline at end of file diff --git a/docs/src/configuration/config-validation.md b/docs/src/configuration/config-validation.md index 5ec0d73..cf4fbe2 100644 --- a/docs/src/configuration/config-validation.md +++ b/docs/src/configuration/config-validation.md @@ -1 +1,631 @@ -# Configuration Validation Guide\n\n## Overview\n\nThe new configuration system includes comprehensive schema validation to catch errors early and ensure configuration correctness.\n\n## Schema Validation Features\n\n### 1. Required Fields Validation\n\nEnsures all required fields are present:\n\n```\n# Schema definition\n[required]\nfields = ["name", "version", "enabled"]\n\n# Valid config\nname = "my-service"\nversion = "1.0.0"\nenabled = true\n\n# Invalid - missing 'enabled'\nname = "my-service"\nversion = "1.0.0"\n# Error: Required field missing: enabled\n```\n\n### 2. Type Validation\n\nValidates field types:\n\n```\n# Schema\n[fields.port]\ntype = "int"\n\n[fields.name]\ntype = "string"\n\n[fields.enabled]\ntype = "bool"\n\n# Valid\nport = 8080\nname = "orchestrator"\nenabled = true\n\n# Invalid - wrong type\nport = "8080" # Error: Expected int, got string\n```\n\n### 3. Enum Validation\n\nRestricts values to predefined set:\n\n```\n# Schema\n[fields.environment]\ntype = "string"\nenum = ["dev", "staging", "prod"]\n\n# Valid\nenvironment = "prod"\n\n# Invalid\nenvironment = "production" # Error: Must be one of: dev, staging, prod\n```\n\n### 4. Range Validation\n\nValidates numeric ranges:\n\n```\n# Schema\n[fields.port]\ntype = "int"\nmin = 1024\nmax = 65535\n\n# Valid\nport = 8080\n\n# Invalid - below minimum\nport = 80 # Error: Must be >= 1024\n\n# Invalid - above maximum\nport = 70000 # Error: Must be <= 65535\n```\n\n### 5. Pattern Validation\n\nValidates string patterns using regex:\n\n```\n# Schema\n[fields.email]\ntype = "string"\npattern = "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"\n\n# Valid\nemail = "admin@example.com"\n\n# Invalid\nemail = "not-an-email" # Error: Does not match pattern\n```\n\n### 6. Deprecated Fields\n\nWarns about deprecated configuration:\n\n```\n# Schema\n[deprecated]\nfields = ["old_field"]\n\n[deprecated_replacements]\nold_field = "new_field"\n\n# Config using deprecated field\nold_field = "value" # Warning: old_field is deprecated. Use new_field instead.\n```\n\n## Using Schema Validator\n\n### Command Line\n\n```\n# Validate workspace config\nprovisioning workspace config validate\n\n# Validate provider config\nprovisioning provider validate aws\n\n# Validate platform service config\nprovisioning platform validate orchestrator\n\n# Validate with detailed output\nprovisioning workspace config validate --verbose\n```\n\n### Programmatic Usage\n\n```\nuse provisioning/core/nulib/lib_provisioning/config/schema_validator.nu *\n\n# Load config\nlet config = (open ~/workspaces/my-project/config/provisioning.yaml | from yaml)\n\n# Validate against schema\nlet result = (validate-workspace-config $config)\n\n# Check results\nif $result.valid {\n print "✅ Configuration is valid"\n} else {\n print "❌ Configuration has errors:"\n for error in $result.errors {\n print $" • ($error.message)"\n }\n}\n\n# Display warnings\nif ($result.warnings | length) > 0 {\n print "⚠️ Warnings:"\n for warning in $result.warnings {\n print $" • ($warning.message)"\n }\n}\n```\n\n### Pretty Print Results\n\n```\n# Validate and print formatted results\nlet result = (validate-workspace-config $config)\nprint-validation-results $result\n```\n\n## Schema Examples\n\n### Workspace Schema\n\nFile: `/Users/Akasha/project-provisioning/provisioning/config/workspace.schema.toml`\n\n```\n[required]\nfields = ["workspace", "paths"]\n\n[fields.workspace]\ntype = "record"\n\n[fields.workspace.name]\ntype = "string"\npattern = "^[a-z][a-z0-9-]*$"\n\n[fields.workspace.version]\ntype = "string"\npattern = "^\\d+\\.\\d+\\.\\d+$"\n\n[fields.paths]\ntype = "record"\n\n[fields.paths.base]\ntype = "string"\n\n[fields.paths.infra]\ntype = "string"\n\n[fields.debug]\ntype = "record"\n\n[fields.debug.enabled]\ntype = "bool"\n\n[fields.debug.log_level]\ntype = "string"\nenum = ["debug", "info", "warn", "error"]\n```\n\n### Provider Schema (AWS)\n\nFile: `/Users/Akasha/project-provisioning/provisioning/extensions/providers/aws/config.schema.toml`\n\n```\n[required]\nfields = ["provider", "credentials"]\n\n[fields.provider]\ntype = "record"\n\n[fields.provider.name]\ntype = "string"\nenum = ["aws"]\n\n[fields.provider.region]\ntype = "string"\npattern = "^[a-z]{2}-[a-z]+-\\d+$"\n\n[fields.provider.enabled]\ntype = "bool"\n\n[fields.credentials]\ntype = "record"\n\n[fields.credentials.type]\ntype = "string"\nenum = ["environment", "file", "iam_role"]\n\n[fields.compute]\ntype = "record"\n\n[fields.compute.default_instance_type]\ntype = "string"\n\n[fields.compute.default_ami]\ntype = "string"\npattern = "^ami-[a-f0-9]{8,17}$"\n\n[fields.network]\ntype = "record"\n\n[fields.network.vpc_id]\ntype = "string"\npattern = "^vpc-[a-f0-9]{8,17}$"\n\n[fields.network.subnet_id]\ntype = "string"\npattern = "^subnet-[a-f0-9]{8,17}$"\n\n[deprecated]\nfields = ["old_region_field"]\n\n[deprecated_replacements]\nold_region_field = "provider.region"\n```\n\n### Platform Service Schema (Orchestrator)\n\nFile: `/Users/Akasha/project-provisioning/provisioning/platform/orchestrator/config.schema.toml`\n\n```\n[required]\nfields = ["service", "server"]\n\n[fields.service]\ntype = "record"\n\n[fields.service.name]\ntype = "string"\nenum = ["orchestrator"]\n\n[fields.service.enabled]\ntype = "bool"\n\n[fields.server]\ntype = "record"\n\n[fields.server.host]\ntype = "string"\n\n[fields.server.port]\ntype = "int"\nmin = 1024\nmax = 65535\n\n[fields.workers]\ntype = "int"\nmin = 1\nmax = 32\n\n[fields.queue]\ntype = "record"\n\n[fields.queue.max_size]\ntype = "int"\nmin = 100\nmax = 10000\n\n[fields.queue.storage_path]\ntype = "string"\n```\n\n### KMS Service Schema\n\nFile: `/Users/Akasha/project-provisioning/provisioning/core/services/kms/config.schema.toml`\n\n```\n[required]\nfields = ["kms", "encryption"]\n\n[fields.kms]\ntype = "record"\n\n[fields.kms.enabled]\ntype = "bool"\n\n[fields.kms.provider]\ntype = "string"\nenum = ["aws_kms", "gcp_kms", "azure_kv", "vault", "local"]\n\n[fields.encryption]\ntype = "record"\n\n[fields.encryption.algorithm]\ntype = "string"\nenum = ["AES-256-GCM", "ChaCha20-Poly1305"]\n\n[fields.encryption.key_rotation_days]\ntype = "int"\nmin = 30\nmax = 365\n\n[fields.vault]\ntype = "record"\n\n[fields.vault.address]\ntype = "string"\npattern = "^https?://.*$"\n\n[fields.vault.token_path]\ntype = "string"\n\n[deprecated]\nfields = ["old_kms_type"]\n\n[deprecated_replacements]\nold_kms_type = "kms.provider"\n```\n\n## Validation Workflow\n\n### 1. Development\n\n```\n# Create new config\nvim ~/workspaces/dev/config/provisioning.yaml\n\n# Validate immediately\nprovisioning workspace config validate\n\n# Fix errors and revalidate\nvim ~/workspaces/dev/config/provisioning.yaml\nprovisioning workspace config validate\n```\n\n### 2. CI/CD Pipeline\n\n```\n# GitLab CI\nvalidate-config:\n stage: validate\n script:\n - provisioning workspace config validate\n - provisioning provider validate aws\n - provisioning provider validate upcloud\n - provisioning platform validate orchestrator\n only:\n changes:\n - "*/config/**/*"\n```\n\n### 3. Pre-Deployment\n\n```\n# Validate all configurations before deployment\nprovisioning workspace config validate --verbose\nprovisioning provider validate --all\nprovisioning platform validate --all\n\n# If valid, proceed with deployment\nif [[ $? -eq 0 ]]; then\n provisioning deploy --workspace production\nfi\n```\n\n## Error Messages\n\n### Clear Error Format\n\n```\n❌ Validation failed\n\nErrors:\n • Required field missing: workspace.name\n • Field port type mismatch: expected int, got string\n • Field environment must be one of: dev, staging, prod\n • Field port must be >= 1024\n • Field email does not match pattern: ^[a-zA-Z0-9._%+-]+@.*$\n\n⚠️ Warnings:\n • Field old_field is deprecated. Use new_field instead.\n```\n\n### Error Details\n\nEach error includes:\n\n- **field**: Which field has the error\n- **type**: Error type (missing_required, type_mismatch, invalid_enum, etc.)\n- **message**: Human-readable description\n- **Additional context**: Expected values, patterns, ranges\n\n## Common Validation Patterns\n\n### Pattern 1: Hostname Validation\n\n```\n[fields.hostname]\ntype = "string"\npattern = "^[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?$"\n```\n\n### Pattern 2: Email Validation\n\n```\n[fields.email]\ntype = "string"\npattern = "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"\n```\n\n### Pattern 3: Semantic Version\n\n```\n[fields.version]\ntype = "string"\npattern = "^\\d+\\.\\d+\\.\\d+(-[a-zA-Z0-9]+)?$"\n```\n\n### Pattern 4: URL Validation\n\n```\n[fields.url]\ntype = "string"\npattern = "^https?://[a-zA-Z0-9.-]+(:[0-9]+)?(/.*)?$"\n```\n\n### Pattern 5: IPv4 Address\n\n```\n[fields.ip_address]\ntype = "string"\npattern = "^(?:[0-9]{1,3}\\.){3}[0-9]{1,3}$"\n```\n\n### Pattern 6: AWS Resource ID\n\n```\n[fields.instance_id]\ntype = "string"\npattern = "^i-[a-f0-9]{8,17}$"\n\n[fields.ami_id]\ntype = "string"\npattern = "^ami-[a-f0-9]{8,17}$"\n\n[fields.vpc_id]\ntype = "string"\npattern = "^vpc-[a-f0-9]{8,17}$"\n```\n\n## Testing Validation\n\n### Unit Tests\n\n```\n# Run validation test suite\nnu provisioning/tests/config_validation_tests.nu\n```\n\n### Integration Tests\n\n```\n# Test with real configs\nprovisioning test validate --workspace dev\nprovisioning test validate --workspace staging\nprovisioning test validate --workspace prod\n```\n\n### Custom Validation\n\n```\n# Create custom validation function\ndef validate-custom-config [config: record] {\n let result = (validate-workspace-config $config)\n\n # Add custom business logic validation\n if ($config.workspace.name | str starts-with "prod") {\n if not $config.debug.enabled == false {\n $result.errors = ($result.errors | append {\n field: "debug.enabled"\n type: "custom"\n message: "Debug must be disabled in production"\n })\n }\n }\n\n $result\n}\n```\n\n## Best Practices\n\n### 1. Validate Early\n\n```\n# Validate during development\nprovisioning workspace config validate\n\n# Don't wait for deployment\n```\n\n### 2. Use Strict Schemas\n\n```\n# Be explicit about types and constraints\n[fields.port]\ntype = "int"\nmin = 1024\nmax = 65535\n\n# Don't leave fields unvalidated\n```\n\n### 3. Document Patterns\n\n```\n# Include examples in schema\n[fields.email]\ntype = "string"\npattern = "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"\n# Example: user@example.com\n```\n\n### 4. Handle Deprecation\n\n```\n# Always provide replacement guidance\n[deprecated_replacements]\nold_field = "new_field" # Clear migration path\n```\n\n### 5. Test Schemas\n\n```\n# Include test cases in comments\n# Valid: "admin@example.com"\n# Invalid: "not-an-email"\n```\n\n## Troubleshooting\n\n### Schema File Not Found\n\n```\n# Error: Schema file not found: /path/to/schema.toml\n\n# Solution: Ensure schema exists\nls -la /Users/Akasha/project-provisioning/provisioning/config/*.schema.toml\n```\n\n### Pattern Not Matching\n\n```\n# Error: Field hostname does not match pattern\n\n# Debug: Test pattern separately\necho "my-hostname" | grep -E "^[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?$"\n```\n\n### Type Mismatch\n\n```\n# Error: Expected int, got string\n\n# Check config\ncat ~/workspaces/dev/config/provisioning.yaml | yq '.server.port'\n# Output: "8080" (string)\n\n# Fix: Remove quotes\nvim ~/workspaces/dev/config/provisioning.yaml\n# Change: port: "8080"\n# To: port: 8080\n```\n\n## Additional Resources\n\n- [Migration Guide](./MIGRATION_GUIDE.md)\n- [Workspace Guide](./WORKSPACE_GUIDE.md)\n- [Schema Files](../config/*.schema.toml)\n- [Validation Tests](../tests/config_validation_tests.nu) +# Configuration Validation Guide + +## Overview + +The new configuration system includes comprehensive schema validation to catch errors early and ensure configuration correctness. + +## Schema Validation Features + +### 1. Required Fields Validation + +Ensures all required fields are present: + +```text +# Schema definition +[required] +fields = ["name", "version", "enabled"] + +# Valid config +name = "my-service" +version = "1.0.0" +enabled = true + +# Invalid - missing 'enabled' +name = "my-service" +version = "1.0.0" +# Error: Required field missing: enabled +``` + +### 2. Type Validation + +Validates field types: + +```text +# Schema +[fields.port] +type = "int" + +[fields.name] +type = "string" + +[fields.enabled] +type = "bool" + +# Valid +port = 8080 +name = "orchestrator" +enabled = true + +# Invalid - wrong type +port = "8080" # Error: Expected int, got string +``` + +### 3. Enum Validation + +Restricts values to predefined set: + +```text +# Schema +[fields.environment] +type = "string" +enum = ["dev", "staging", "prod"] + +# Valid +environment = "prod" + +# Invalid +environment = "production" # Error: Must be one of: dev, staging, prod +``` + +### 4. Range Validation + +Validates numeric ranges: + +```text +# Schema +[fields.port] +type = "int" +min = 1024 +max = 65535 + +# Valid +port = 8080 + +# Invalid - below minimum +port = 80 # Error: Must be >= 1024 + +# Invalid - above maximum +port = 70000 # Error: Must be <= 65535 +``` + +### 5. Pattern Validation + +Validates string patterns using regex: + +```text +# Schema +[fields.email] +type = "string" +pattern = "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$" + +# Valid +email = "admin@example.com" + +# Invalid +email = "not-an-email" # Error: Does not match pattern +``` + +### 6. Deprecated Fields + +Warns about deprecated configuration: + +```text +# Schema +[deprecated] +fields = ["old_field"] + +[deprecated_replacements] +old_field = "new_field" + +# Config using deprecated field +old_field = "value" # Warning: old_field is deprecated. Use new_field instead. +``` + +## Using Schema Validator + +### Command Line + +```text +# Validate workspace config +provisioning workspace config validate + +# Validate provider config +provisioning provider validate aws + +# Validate platform service config +provisioning platform validate orchestrator + +# Validate with detailed output +provisioning workspace config validate --verbose +``` + +### Programmatic Usage + +```text +use provisioning/core/nulib/lib_provisioning/config/schema_validator.nu * + +# Load config +let config = (open ~/workspaces/my-project/config/provisioning.yaml | from yaml) + +# Validate against schema +let result = (validate-workspace-config $config) + +# Check results +if $result.valid { + print "✅ Configuration is valid" +} else { + print "❌ Configuration has errors:" + for error in $result.errors { + print $" • ($error.message)" + } +} + +# Display warnings +if ($result.warnings | length) > 0 { + print "⚠️ Warnings:" + for warning in $result.warnings { + print $" • ($warning.message)" + } +} +``` + +### Pretty Print Results + +```text +# Validate and print formatted results +let result = (validate-workspace-config $config) +print-validation-results $result +``` + +## Schema Examples + +### Workspace Schema + +File: `/Users/Akasha/project-provisioning/provisioning/config/workspace.schema.toml` + +```text +[required] +fields = ["workspace", "paths"] + +[fields.workspace] +type = "record" + +[fields.workspace.name] +type = "string" +pattern = "^[a-z][a-z0-9-]*$" + +[fields.workspace.version] +type = "string" +pattern = "^\\d+\\.\\d+\\.\\d+$" + +[fields.paths] +type = "record" + +[fields.paths.base] +type = "string" + +[fields.paths.infra] +type = "string" + +[fields.debug] +type = "record" + +[fields.debug.enabled] +type = "bool" + +[fields.debug.log_level] +type = "string" +enum = ["debug", "info", "warn", "error"] +``` + +### Provider Schema (AWS) + +File: `/Users/Akasha/project-provisioning/provisioning/extensions/providers/aws/config.schema.toml` + +```text +[required] +fields = ["provider", "credentials"] + +[fields.provider] +type = "record" + +[fields.provider.name] +type = "string" +enum = ["aws"] + +[fields.provider.region] +type = "string" +pattern = "^[a-z]{2}-[a-z]+-\\d+$" + +[fields.provider.enabled] +type = "bool" + +[fields.credentials] +type = "record" + +[fields.credentials.type] +type = "string" +enum = ["environment", "file", "iam_role"] + +[fields.compute] +type = "record" + +[fields.compute.default_instance_type] +type = "string" + +[fields.compute.default_ami] +type = "string" +pattern = "^ami-[a-f0-9]{8,17}$" + +[fields.network] +type = "record" + +[fields.network.vpc_id] +type = "string" +pattern = "^vpc-[a-f0-9]{8,17}$" + +[fields.network.subnet_id] +type = "string" +pattern = "^subnet-[a-f0-9]{8,17}$" + +[deprecated] +fields = ["old_region_field"] + +[deprecated_replacements] +old_region_field = "provider.region" +``` + +### Platform Service Schema (Orchestrator) + +File: `/Users/Akasha/project-provisioning/provisioning/platform/orchestrator/config.schema.toml` + +```text +[required] +fields = ["service", "server"] + +[fields.service] +type = "record" + +[fields.service.name] +type = "string" +enum = ["orchestrator"] + +[fields.service.enabled] +type = "bool" + +[fields.server] +type = "record" + +[fields.server.host] +type = "string" + +[fields.server.port] +type = "int" +min = 1024 +max = 65535 + +[fields.workers] +type = "int" +min = 1 +max = 32 + +[fields.queue] +type = "record" + +[fields.queue.max_size] +type = "int" +min = 100 +max = 10000 + +[fields.queue.storage_path] +type = "string" +``` + +### KMS Service Schema + +File: `/Users/Akasha/project-provisioning/provisioning/core/services/kms/config.schema.toml` + +```text +[required] +fields = ["kms", "encryption"] + +[fields.kms] +type = "record" + +[fields.kms.enabled] +type = "bool" + +[fields.kms.provider] +type = "string" +enum = ["aws_kms", "gcp_kms", "azure_kv", "vault", "local"] + +[fields.encryption] +type = "record" + +[fields.encryption.algorithm] +type = "string" +enum = ["AES-256-GCM", "ChaCha20-Poly1305"] + +[fields.encryption.key_rotation_days] +type = "int" +min = 30 +max = 365 + +[fields.vault] +type = "record" + +[fields.vault.address] +type = "string" +pattern = "^https?://.*$" + +[fields.vault.token_path] +type = "string" + +[deprecated] +fields = ["old_kms_type"] + +[deprecated_replacements] +old_kms_type = "kms.provider" +``` + +## Validation Workflow + +### 1. Development + +```text +# Create new config +vim ~/workspaces/dev/config/provisioning.yaml + +# Validate immediately +provisioning workspace config validate + +# Fix errors and revalidate +vim ~/workspaces/dev/config/provisioning.yaml +provisioning workspace config validate +``` + +### 2. CI/CD Pipeline + +```text +# GitLab CI +validate-config: + stage: validate + script: + - provisioning workspace config validate + - provisioning provider validate aws + - provisioning provider validate upcloud + - provisioning platform validate orchestrator + only: + changes: + - "*/config/**/*" +``` + +### 3. Pre-Deployment + +```text +# Validate all configurations before deployment +provisioning workspace config validate --verbose +provisioning provider validate --all +provisioning platform validate --all + +# If valid, proceed with deployment +if [[ $? -eq 0 ]]; then + provisioning deploy --workspace production +fi +``` + +## Error Messages + +### Clear Error Format + +```text +❌ Validation failed + +Errors: + • Required field missing: workspace.name + • Field port type mismatch: expected int, got string + • Field environment must be one of: dev, staging, prod + • Field port must be >= 1024 + • Field email does not match pattern: ^[a-zA-Z0-9._%+-]+@.*$ + +⚠️ Warnings: + • Field old_field is deprecated. Use new_field instead. +``` + +### Error Details + +Each error includes: + +- **field**: Which field has the error +- **type**: Error type (missing_required, type_mismatch, invalid_enum, etc.) +- **message**: Human-readable description +- **Additional context**: Expected values, patterns, ranges + +## Common Validation Patterns + +### Pattern 1: Hostname Validation + +```text +[fields.hostname] +type = "string" +pattern = "^[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?$" +``` + +### Pattern 2: Email Validation + +```text +[fields.email] +type = "string" +pattern = "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$" +``` + +### Pattern 3: Semantic Version + +```text +[fields.version] +type = "string" +pattern = "^\\d+\\.\\d+\\.\\d+(-[a-zA-Z0-9]+)?$" +``` + +### Pattern 4: URL Validation + +```text +[fields.url] +type = "string" +pattern = "^https?://[a-zA-Z0-9.-]+(:[0-9]+)?(/.*)?$" +``` + +### Pattern 5: IPv4 Address + +```text +[fields.ip_address] +type = "string" +pattern = "^(?:[0-9]{1,3}\\.){3}[0-9]{1,3}$" +``` + +### Pattern 6: AWS Resource ID + +```text +[fields.instance_id] +type = "string" +pattern = "^i-[a-f0-9]{8,17}$" + +[fields.ami_id] +type = "string" +pattern = "^ami-[a-f0-9]{8,17}$" + +[fields.vpc_id] +type = "string" +pattern = "^vpc-[a-f0-9]{8,17}$" +``` + +## Testing Validation + +### Unit Tests + +```text +# Run validation test suite +nu provisioning/tests/config_validation_tests.nu +``` + +### Integration Tests + +```text +# Test with real configs +provisioning test validate --workspace dev +provisioning test validate --workspace staging +provisioning test validate --workspace prod +``` + +### Custom Validation + +```text +# Create custom validation function +def validate-custom-config [config: record] { + let result = (validate-workspace-config $config) + + # Add custom business logic validation + if ($config.workspace.name | str starts-with "prod") { + if not $config.debug.enabled == false { + $result.errors = ($result.errors | append { + field: "debug.enabled" + type: "custom" + message: "Debug must be disabled in production" + }) + } + } + + $result +} +``` + +## Best Practices + +### 1. Validate Early + +```text +# Validate during development +provisioning workspace config validate + +# Don't wait for deployment +``` + +### 2. Use Strict Schemas + +```text +# Be explicit about types and constraints +[fields.port] +type = "int" +min = 1024 +max = 65535 + +# Don't leave fields unvalidated +``` + +### 3. Document Patterns + +```text +# Include examples in schema +[fields.email] +type = "string" +pattern = "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$" +# Example: user@example.com +``` + +### 4. Handle Deprecation + +```text +# Always provide replacement guidance +[deprecated_replacements] +old_field = "new_field" # Clear migration path +``` + +### 5. Test Schemas + +```text +# Include test cases in comments +# Valid: "admin@example.com" +# Invalid: "not-an-email" +``` + +## Troubleshooting + +### Schema File Not Found + +```text +# Error: Schema file not found: /path/to/schema.toml + +# Solution: Ensure schema exists +ls -la /Users/Akasha/project-provisioning/provisioning/config/*.schema.toml +``` + +### Pattern Not Matching + +```text +# Error: Field hostname does not match pattern + +# Debug: Test pattern separately +echo "my-hostname" | grep -E "^[a-z0-9]([a-z0-9-]{0,61}[a-z0-9])?$" +``` + +### Type Mismatch + +```text +# Error: Expected int, got string + +# Check config +cat ~/workspaces/dev/config/provisioning.yaml | yq '.server.port' +# Output: "8080" (string) + +# Fix: Remove quotes +vim ~/workspaces/dev/config/provisioning.yaml +# Change: port: "8080" +# To: port: 8080 +``` + +## Additional Resources + +- [Migration Guide](./MIGRATION_GUIDE.md) +- [Workspace Guide](./WORKSPACE_GUIDE.md) +- [Schema Files](../config/*.schema.toml) +- [Validation Tests](../tests/config_validation_tests.nu) \ No newline at end of file diff --git a/docs/src/development/auth-metadata-guide.md b/docs/src/development/auth-metadata-guide.md index 6909c33..04e0ab4 100644 --- a/docs/src/development/auth-metadata-guide.md +++ b/docs/src/development/auth-metadata-guide.md @@ -1 +1,536 @@ -# Metadata-Driven Authentication System - Implementation Guide\n\n**Status**: ✅ Complete and Production-Ready\n**Version**: 1.0.0\n**Last Updated**: 2025-12-10\n\n## Table of Contents\n\n1. [Overview](#overview)\n2. [Architecture](#architecture)\n3. [Installation](#installation)\n4. [Usage Guide](#usage-guide)\n5. [Migration Path](#migration-path)\n6. [Developer Guide](#developer-guide)\n7. [Testing](#testing)\n8. [Troubleshooting](#troubleshooting)\n\n## Overview\n\nThis guide describes the metadata-driven authentication system implemented over 5 weeks across 14 command handlers and 12 major systems. The system provides:\n\n- **Centralized Metadata**: All command definitions in Nickel with runtime validation\n- **Automatic Auth Checks**: Pre-execution validation before handler logic\n- **Performance Optimization**: 40-100x faster through metadata caching\n- **Flexible Deployment**: Works with orchestrator, batch workflows, and direct CLI\n\n## Architecture\n\n### System Components\n\n```\n┌─────────────────────────────────────────────────────────────┐\n│ User Command │\n└────────────────────────────────┬──────────────────────────────┘\n │\n ┌────────────▼─────────────┐\n │ CLI Dispatcher │\n │ (main_provisioning) │\n └────────────┬─────────────┘\n │\n ┌────────────▼─────────────┐\n │ Metadata Loading │\n │ (cached via traits.nu) │\n └────────────┬─────────────┘\n │\n ┌────────────▼─────────────────────┐\n │ Pre-Execution Validation │\n │ - Auth checks │\n │ - Permission validation │\n │ - Operation type mapping │\n └────────────┬─────────────────────┘\n │\n ┌────────────▼─────────────────────┐\n │ Command Handler Execution │\n │ - infrastructure.nu │\n │ - orchestration.nu │\n │ - workspace.nu │\n └────────────┬─────────────────────┘\n │\n ┌────────────▼─────────────┐\n │ Result/Response │\n └─────────────────────────┘\n```\n\n### Data Flow\n\n1. **User Command** → CLI Dispatcher\n2. **Dispatcher** → Load cached metadata (or parse Nickel)\n3. **Validate** → Check auth, operation type, permissions\n4. **Execute** → Call appropriate handler\n5. **Return** → Result to user\n\n### Metadata Caching\n\n- **Location**: `~/.cache/provisioning/command_metadata.json`\n- **Format**: Serialized JSON (pre-parsed for speed)\n- **TTL**: 1 hour (configurable via `PROVISIONING_METADATA_TTL`)\n- **Invalidation**: Automatic on `main.ncl` modification\n- **Performance**: 40-100x faster than Nickel parsing\n\n## Installation\n\n### Prerequisites\n\n- Nushell 0.109.0+\n- Nickel 1.15.0+\n- SOPS 3.10.2 (for encrypted configs)\n- Age 1.2.1 (for encryption)\n\n### Installation Steps\n\n```\n# 1. Clone or update repository\ngit clone https://github.com/your-org/project-provisioning.git\ncd project-provisioning\n\n# 2. Initialize workspace\n./provisioning/core/cli/provisioning workspace init\n\n# 3. Validate system\n./provisioning/core/cli/provisioning validate config\n\n# 4. Run system checks\n./provisioning/core/cli/provisioning health\n\n# 5. Run test suites\nnu tests/test-fase5-e2e.nu\nnu tests/test-security-audit-day20.nu\nnu tests/test-metadata-cache-benchmark.nu\n```\n\n## Usage Guide\n\n### Basic Commands\n\n```\n# Initialize authentication\nprovisioning login\n\n# Enroll in MFA\nprovisioning mfa totp enroll\n\n# Create infrastructure\nprovisioning server create --name web-01 --plan 1xCPU-2 GB\n\n# Deploy with orchestrator\nprovisioning workflow submit workflows/deployment.ncl --orchestrated\n\n# Batch operations\nprovisioning batch submit workflows/batch-deploy.ncl\n\n# Check without executing\nprovisioning server create --name test --check\n```\n\n### Authentication Flow\n\n```\n# 1. Login (required for production operations)\n$ provisioning login\nUsername: alice@example.com\nPassword: ****\n\n# 2. Optional: Setup MFA\n$ provisioning mfa totp enroll\nScan QR code with authenticator app\nVerify code: 123456\n\n# 3. Use commands (auth checks happen automatically)\n$ provisioning server delete --name old-server --infra production\nAuth check: Check auth for production (delete operation)\nAre you sure? [yes/no] yes\n✓ Server deleted\n\n# 4. All destructive operations require auth\n$ provisioning taskserv delete postgres web-01\nAuth check: Check auth for destructive operation\n✓ Taskserv deleted\n```\n\n### Check Mode (Bypass Auth for Testing)\n\n```\n# Dry-run without auth checks\nprovisioning server create --name test --check\n\n# Output: Shows what would happen, no auth checks\nDry-run mode - no changes will be made\n✓ Would create server: test\n✓ Would deploy taskservs: []\n```\n\n### Non-Interactive CI/CD Mode\n\n```\n# Automated mode - skip confirmations\nprovisioning server create --name web-01 --yes\n\n# Batch operations\nprovisioning batch submit workflows/batch.ncl --yes --check\n\n# With environment variable\nPROVISIONING_NON_INTERACTIVE=1 provisioning server create --name web-02 --yes\n```\n\n## Migration Path\n\n### Phase 1: From Old `input` to Metadata\n\n**Old Pattern** (Before Fase 5):\n\n```\n# Hardcoded auth check\nlet response = (input "Delete server? (yes/no): ")\nif $response != "yes" { exit 1 }\n\n# No metadata - auth unknown\nexport def delete-server [name: string, --yes] {\n if not $yes { ... manual confirmation ... }\n # ... deletion logic ...\n}\n```\n\n**New Pattern** (After Fase 5):\n\n```\n# Metadata header\n# [command]\n# name = "server delete"\n# group = "infrastructure"\n# tags = ["server", "delete", "destructive"]\n# version = "1.0.0"\n\n# Automatic auth check from metadata\nexport def delete-server [name: string, --yes] {\n # Pre-execution check happens in dispatcher\n # Auth enforcement via metadata\n # Operation type: "delete" automatically detected\n # ... deletion logic ...\n}\n```\n\n### Phase 2: Adding Metadata Headers\n\n**For each script that was migrated:**\n\n1. Add metadata header after shebang:\n\n```\n#!/usr/bin/env nu\n# [command]\n# name = "server create"\n# group = "infrastructure"\n# tags = ["server", "create", "interactive"]\n# version = "1.0.0"\n\nexport def create-server [name: string] {\n # Logic here\n}\n```\n\n1. Register in `provisioning/schemas/main.ncl`:\n\n```\nlet server_create = {\n name = "server create",\n domain = "infrastructure",\n description = "Create a new server",\n requirements = {\n interactive = false,\n requires_auth = true,\n auth_type = "jwt",\n side_effect_type = "create",\n min_permission = "write",\n },\n} in\nserver_create\n```\n\n1. Handler integration (happens in dispatcher):\n\n```\n# Dispatcher automatically:\n# 1. Loads metadata for "server create"\n# 2. Validates auth based on requirements\n# 3. Checks permission levels\n# 4. Calls handler if validation passes\n```\n\n### Phase 3: Validating Migration\n\n```\n# Validate metadata headers\nnu utils/validate-metadata-headers.nu\n\n# Find scripts by tag\nnu utils/search-scripts.nu by-tag destructive\n\n# Find all scripts in group\nnu utils/search-scripts.nu by-group infrastructure\n\n# Find scripts with multiple tags\nnu utils/search-scripts.nu by-tags server delete\n\n# List all migrated scripts\nnu utils/search-scripts.nu list\n```\n\n## Developer Guide\n\n### Adding New Commands with Metadata\n\n**Step 1: Create metadata in main.ncl**\n\n```\nlet new_feature_command = {\n name = "feature command",\n domain = "infrastructure",\n description = "My new feature",\n requirements = {\n interactive = false,\n requires_auth = true,\n auth_type = "jwt",\n side_effect_type = "create",\n min_permission = "write",\n },\n} in\nnew_feature_command\n```\n\n**Step 2: Add metadata header to script**\n\n```\n#!/usr/bin/env nu\n# [command]\n# name = "feature command"\n# group = "infrastructure"\n# tags = ["feature", "create"]\n# version = "1.0.0"\n\nexport def feature-command [param: string] {\n # Implementation\n}\n```\n\n**Step 3: Implement handler function**\n\n```\n# Handler registered in dispatcher\nexport def handle-feature-command [\n action: string\n --flags\n]: nothing -> nothing {\n # Dispatcher handles:\n # 1. Metadata validation\n # 2. Auth checks\n # 3. Permission validation\n\n # Your logic here\n}\n```\n\n**Step 4: Test with check mode**\n\n```\n# Dry-run without auth\nprovisioning feature command --check\n\n# Full execution\nprovisioning feature command --yes\n```\n\n### Metadata Field Reference\n\n| Field | Type | Required | Description |\n| ------- | ------ | ---------- | ------------- |\n| name | string | Yes | Command canonical name |\n| domain | string | Yes | Command category (infrastructure, orchestration, etc.) |\n| description | string | Yes | Human-readable description |\n| requires_auth | bool | Yes | Whether auth is required |\n| auth_type | enum | Yes | "none", "jwt", "mfa", "cedar" |\n| side_effect_type | enum | Yes | "none", "create", "update", "delete", "deploy" |\n| min_permission | enum | Yes | "read", "write", "admin", "superadmin" |\n| interactive | bool | No | Whether command requires user input |\n| slow_operation | bool | No | Whether operation takes >60 seconds |\n\n### Standard Tags\n\n**Groups**:\n\n- infrastructure - Server, taskserv, cluster operations\n- orchestration - Workflow, batch operations\n- workspace - Workspace management\n- authentication - Auth, MFA, tokens\n- utilities - Helper commands\n\n**Operations**:\n\n- create, read, update, delete - CRUD operations\n- destructive - Irreversible operations\n- interactive - Requires user input\n\n**Performance**:\n\n- slow - Operation >60 seconds\n- optimizable - Candidate for optimization\n\n### Performance Optimization Patterns\n\n**Pattern 1: For Long Operations**\n\n```\n# Use orchestrator for operations >2 seconds\nif (get-operation-duration "my-operation") > 2000 {\n submit-to-orchestrator $operation\n return "Operation submitted in background"\n}\n```\n\n**Pattern 2: For Batch Operations**\n\n```\n# Use batch workflows for multiple operations\nnu -c "\nuse core/nulib/workflows/batch.nu *\nbatch submit workflows/batch-deploy.ncl --parallel-limit 5\n"\n```\n\n**Pattern 3: For Metadata Overhead**\n\n```\n# Cache hit rate optimization\n# Current: 40-100x faster with warm cache\n# Target: >95% cache hit rate\n# Achieved: Metadata stays in cache for 1 hour (TTL)\n```\n\n## Testing\n\n### Running Tests\n\n```\n# End-to-End Integration Tests\nnu tests/test-fase5-e2e.nu\n\n# Security Audit\nnu tests/test-security-audit-day20.nu\n\n# Performance Benchmarks\nnu tests/test-metadata-cache-benchmark.nu\n\n# Run all tests\nfor test in tests/test-*.nu { nu $test }\n```\n\n### Test Coverage\n\n| Test Suite | Category | Coverage |\n| ----------- | ---------- | ---------- |\n| E2E Tests | Integration | 7 test groups, 40+ checks |\n| Security Audit | Auth | 5 audit categories, 100% pass |\n| Benchmarks | Performance | 6 benchmark categories |\n\n### Expected Results\n\n✅ All tests pass\n✅ No Nushell syntax violations\n✅ Cache hit rate >95%\n✅ Auth enforcement 100%\n✅ Performance baselines met\n\n## Troubleshooting\n\n### Issue: Command not found\n\n**Solution**: Ensure metadata is registered in `main.ncl`\n\n```\n# Check if command is in metadata\ngrep "command_name" provisioning/schemas/main.ncl\n```\n\n### Issue: Auth check failing\n\n**Solution**: Verify user has required permission level\n\n```\n# Check current user permissions\nprovisioning auth whoami\n\n# Check command requirements\nnu -c "\nuse core/nulib/lib_provisioning/commands/traits.nu *\nget-command-metadata 'server create'\n"\n```\n\n### Issue: Slow command execution\n\n**Solution**: Check cache status\n\n```\n# Force cache reload\nrm ~/.cache/provisioning/command_metadata.json\n\n# Check cache hit rate\nnu tests/test-metadata-cache-benchmark.nu\n```\n\n### Issue: Nushell syntax error\n\n**Solution**: Run compliance check\n\n```\n# Validate Nushell compliance\nnu --ide-check 100 \n\n# Check for common issues\ngrep "try {" # Should be empty\ngrep "let mut" # Should be empty\n```\n\n## Performance Characteristics\n\n### Baseline Metrics\n\n| Operation | Cold | Warm | Improvement |\n| ----------- | ------ | ------ | ------------- |\n| Metadata Load | 200 ms | 2-5 ms | 40-100x |\n| Auth Check | <5 ms | <5 ms | Same |\n| Command Dispatch | <10 ms | <10 ms | Same |\n| Total Command | ~210 ms | ~10 ms | 21x |\n\n### Real-World Impact\n\n```\nScenario: 20 sequential commands\n Without cache: 20 × 200 ms = 4 seconds\n With cache: 1 × 200 ms + 19 × 5 ms = 295 ms\n Speedup: ~13.5x faster\n```\n\n## Next Steps\n\n1. **Deploy**: Use installer to deploy to production\n2. **Monitor**: Watch cache hit rates (target >95%)\n3. **Extend**: Add new commands following migration pattern\n4. **Optimize**: Use profiling to identify slow operations\n5. **Maintain**: Run validation scripts regularly\n\n---\n\n**For Support**: See `docs/troubleshooting-guide.md`\n**For Architecture**: See `docs/architecture/`\n**For User Guide**: See `docs/user/AUTHENTICATION_LAYER_GUIDE.md` +# Metadata-Driven Authentication System - Implementation Guide + +**Status**: ✅ Complete and Production-Ready +**Version**: 1.0.0 +**Last Updated**: 2025-12-10 + +## Table of Contents + +1. [Overview](#overview) +2. [Architecture](#architecture) +3. [Installation](#installation) +4. [Usage Guide](#usage-guide) +5. [Migration Path](#migration-path) +6. [Developer Guide](#developer-guide) +7. [Testing](#testing) +8. [Troubleshooting](#troubleshooting) + +## Overview + +This guide describes the metadata-driven authentication system implemented over 5 weeks across 14 command handlers and 12 major systems. The system provides: + +- **Centralized Metadata**: All command definitions in Nickel with runtime validation +- **Automatic Auth Checks**: Pre-execution validation before handler logic +- **Performance Optimization**: 40-100x faster through metadata caching +- **Flexible Deployment**: Works with orchestrator, batch workflows, and direct CLI + +## Architecture + +### System Components + +```text +┌─────────────────────────────────────────────────────────────┐ +│ User Command │ +└────────────────────────────────┬──────────────────────────────┘ + │ + ┌────────────▼─────────────┐ + │ CLI Dispatcher │ + │ (main_provisioning) │ + └────────────┬─────────────┘ + │ + ┌────────────▼─────────────┐ + │ Metadata Loading │ + │ (cached via traits.nu) │ + └────────────┬─────────────┘ + │ + ┌────────────▼─────────────────────┐ + │ Pre-Execution Validation │ + │ - Auth checks │ + │ - Permission validation │ + │ - Operation type mapping │ + └────────────┬─────────────────────┘ + │ + ┌────────────▼─────────────────────┐ + │ Command Handler Execution │ + │ - infrastructure.nu │ + │ - orchestration.nu │ + │ - workspace.nu │ + └────────────┬─────────────────────┘ + │ + ┌────────────▼─────────────┐ + │ Result/Response │ + └─────────────────────────┘ +``` + +### Data Flow + +1. **User Command** → CLI Dispatcher +2. **Dispatcher** → Load cached metadata (or parse Nickel) +3. **Validate** → Check auth, operation type, permissions +4. **Execute** → Call appropriate handler +5. **Return** → Result to user + +### Metadata Caching + +- **Location**: `~/.cache/provisioning/command_metadata.json` +- **Format**: Serialized JSON (pre-parsed for speed) +- **TTL**: 1 hour (configurable via `PROVISIONING_METADATA_TTL`) +- **Invalidation**: Automatic on `main.ncl` modification +- **Performance**: 40-100x faster than Nickel parsing + +## Installation + +### Prerequisites + +- Nushell 0.109.0+ +- Nickel 1.15.0+ +- SOPS 3.10.2 (for encrypted configs) +- Age 1.2.1 (for encryption) + +### Installation Steps + +```text +# 1. Clone or update repository +git clone https://github.com/your-org/project-provisioning.git +cd project-provisioning + +# 2. Initialize workspace +./provisioning/core/cli/provisioning workspace init + +# 3. Validate system +./provisioning/core/cli/provisioning validate config + +# 4. Run system checks +./provisioning/core/cli/provisioning health + +# 5. Run test suites +nu tests/test-fase5-e2e.nu +nu tests/test-security-audit-day20.nu +nu tests/test-metadata-cache-benchmark.nu +``` + +## Usage Guide + +### Basic Commands + +```text +# Initialize authentication +provisioning login + +# Enroll in MFA +provisioning mfa totp enroll + +# Create infrastructure +provisioning server create --name web-01 --plan 1xCPU-2 GB + +# Deploy with orchestrator +provisioning workflow submit workflows/deployment.ncl --orchestrated + +# Batch operations +provisioning batch submit workflows/batch-deploy.ncl + +# Check without executing +provisioning server create --name test --check +``` + +### Authentication Flow + +```text +# 1. Login (required for production operations) +$ provisioning login +Username: alice@example.com +Password: **** + +# 2. Optional: Setup MFA +$ provisioning mfa totp enroll +Scan QR code with authenticator app +Verify code: 123456 + +# 3. Use commands (auth checks happen automatically) +$ provisioning server delete --name old-server --infra production +Auth check: Check auth for production (delete operation) +Are you sure? [yes/no] yes +✓ Server deleted + +# 4. All destructive operations require auth +$ provisioning taskserv delete postgres web-01 +Auth check: Check auth for destructive operation +✓ Taskserv deleted +``` + +### Check Mode (Bypass Auth for Testing) + +```text +# Dry-run without auth checks +provisioning server create --name test --check + +# Output: Shows what would happen, no auth checks +Dry-run mode - no changes will be made +✓ Would create server: test +✓ Would deploy taskservs: [] +``` + +### Non-Interactive CI/CD Mode + +```text +# Automated mode - skip confirmations +provisioning server create --name web-01 --yes + +# Batch operations +provisioning batch submit workflows/batch.ncl --yes --check + +# With environment variable +PROVISIONING_NON_INTERACTIVE=1 provisioning server create --name web-02 --yes +``` + +## Migration Path + +### Phase 1: From Old `input` to Metadata + +**Old Pattern** (Before Fase 5): + +```text +# Hardcoded auth check +let response = (input "Delete server? (yes/no): ") +if $response != "yes" { exit 1 } + +# No metadata - auth unknown +export def delete-server [name: string, --yes] { + if not $yes { ... manual confirmation ... } + # ... deletion logic ... +} +``` + +**New Pattern** (After Fase 5): + +```text +# Metadata header +# [command] +# name = "server delete" +# group = "infrastructure" +# tags = ["server", "delete", "destructive"] +# version = "1.0.0" + +# Automatic auth check from metadata +export def delete-server [name: string, --yes] { + # Pre-execution check happens in dispatcher + # Auth enforcement via metadata + # Operation type: "delete" automatically detected + # ... deletion logic ... +} +``` + +### Phase 2: Adding Metadata Headers + +**For each script that was migrated:** + +1. Add metadata header after shebang: + +```text +#!/usr/bin/env nu +# [command] +# name = "server create" +# group = "infrastructure" +# tags = ["server", "create", "interactive"] +# version = "1.0.0" + +export def create-server [name: string] { + # Logic here +} +``` + +1. Register in `provisioning/schemas/main.ncl`: + +```text +let server_create = { + name = "server create", + domain = "infrastructure", + description = "Create a new server", + requirements = { + interactive = false, + requires_auth = true, + auth_type = "jwt", + side_effect_type = "create", + min_permission = "write", + }, +} in +server_create +``` + +1. Handler integration (happens in dispatcher): + +```text +# Dispatcher automatically: +# 1. Loads metadata for "server create" +# 2. Validates auth based on requirements +# 3. Checks permission levels +# 4. Calls handler if validation passes +``` + +### Phase 3: Validating Migration + +```text +# Validate metadata headers +nu utils/validate-metadata-headers.nu + +# Find scripts by tag +nu utils/search-scripts.nu by-tag destructive + +# Find all scripts in group +nu utils/search-scripts.nu by-group infrastructure + +# Find scripts with multiple tags +nu utils/search-scripts.nu by-tags server delete + +# List all migrated scripts +nu utils/search-scripts.nu list +``` + +## Developer Guide + +### Adding New Commands with Metadata + +**Step 1: Create metadata in main.ncl** + +```text +let new_feature_command = { + name = "feature command", + domain = "infrastructure", + description = "My new feature", + requirements = { + interactive = false, + requires_auth = true, + auth_type = "jwt", + side_effect_type = "create", + min_permission = "write", + }, +} in +new_feature_command +``` + +**Step 2: Add metadata header to script** + +```text +#!/usr/bin/env nu +# [command] +# name = "feature command" +# group = "infrastructure" +# tags = ["feature", "create"] +# version = "1.0.0" + +export def feature-command [param: string] { + # Implementation +} +``` + +**Step 3: Implement handler function** + +```text +# Handler registered in dispatcher +export def handle-feature-command [ + action: string + --flags +]: nothing -> nothing { + # Dispatcher handles: + # 1. Metadata validation + # 2. Auth checks + # 3. Permission validation + + # Your logic here +} +``` + +**Step 4: Test with check mode** + +```text +# Dry-run without auth +provisioning feature command --check + +# Full execution +provisioning feature command --yes +``` + +### Metadata Field Reference + +| Field | Type | Required | Description | +| ------- | ------ | ---------- | ------------- | +| name | string | Yes | Command canonical name | +| domain | string | Yes | Command category (infrastructure, orchestration, etc.) | +| description | string | Yes | Human-readable description | +| requires_auth | bool | Yes | Whether auth is required | +| auth_type | enum | Yes | "none", "jwt", "mfa", "cedar" | +| side_effect_type | enum | Yes | "none", "create", "update", "delete", "deploy" | +| min_permission | enum | Yes | "read", "write", "admin", "superadmin" | +| interactive | bool | No | Whether command requires user input | +| slow_operation | bool | No | Whether operation takes >60 seconds | + +### Standard Tags + +**Groups**: + +- infrastructure - Server, taskserv, cluster operations +- orchestration - Workflow, batch operations +- workspace - Workspace management +- authentication - Auth, MFA, tokens +- utilities - Helper commands + +**Operations**: + +- create, read, update, delete - CRUD operations +- destructive - Irreversible operations +- interactive - Requires user input + +**Performance**: + +- slow - Operation >60 seconds +- optimizable - Candidate for optimization + +### Performance Optimization Patterns + +**Pattern 1: For Long Operations** + +```text +# Use orchestrator for operations >2 seconds +if (get-operation-duration "my-operation") > 2000 { + submit-to-orchestrator $operation + return "Operation submitted in background" +} +``` + +**Pattern 2: For Batch Operations** + +```text +# Use batch workflows for multiple operations +nu -c " +use core/nulib/workflows/batch.nu * +batch submit workflows/batch-deploy.ncl --parallel-limit 5 +" +``` + +**Pattern 3: For Metadata Overhead** + +```text +# Cache hit rate optimization +# Current: 40-100x faster with warm cache +# Target: >95% cache hit rate +# Achieved: Metadata stays in cache for 1 hour (TTL) +``` + +## Testing + +### Running Tests + +```text +# End-to-End Integration Tests +nu tests/test-fase5-e2e.nu + +# Security Audit +nu tests/test-security-audit-day20.nu + +# Performance Benchmarks +nu tests/test-metadata-cache-benchmark.nu + +# Run all tests +for test in tests/test-*.nu { nu $test } +``` + +### Test Coverage + +| Test Suite | Category | Coverage | +| ----------- | ---------- | ---------- | +| E2E Tests | Integration | 7 test groups, 40+ checks | +| Security Audit | Auth | 5 audit categories, 100% pass | +| Benchmarks | Performance | 6 benchmark categories | + +### Expected Results + +✅ All tests pass +✅ No Nushell syntax violations +✅ Cache hit rate >95% +✅ Auth enforcement 100% +✅ Performance baselines met + +## Troubleshooting + +### Issue: Command not found + +**Solution**: Ensure metadata is registered in `main.ncl` + +```text +# Check if command is in metadata +grep "command_name" provisioning/schemas/main.ncl +``` + +### Issue: Auth check failing + +**Solution**: Verify user has required permission level + +```text +# Check current user permissions +provisioning auth whoami + +# Check command requirements +nu -c " +use core/nulib/lib_provisioning/commands/traits.nu * +get-command-metadata 'server create' +" +``` + +### Issue: Slow command execution + +**Solution**: Check cache status + +```text +# Force cache reload +rm ~/.cache/provisioning/command_metadata.json + +# Check cache hit rate +nu tests/test-metadata-cache-benchmark.nu +``` + +### Issue: Nushell syntax error + +**Solution**: Run compliance check + +```text +# Validate Nushell compliance +nu --ide-check 100 + +# Check for common issues +grep "try {" # Should be empty +grep "let mut" # Should be empty +``` + +## Performance Characteristics + +### Baseline Metrics + +| Operation | Cold | Warm | Improvement | +| ----------- | ------ | ------ | ------------- | +| Metadata Load | 200 ms | 2-5 ms | 40-100x | +| Auth Check | <5 ms | <5 ms | Same | +| Command Dispatch | <10 ms | <10 ms | Same | +| Total Command | ~210 ms | ~10 ms | 21x | + +### Real-World Impact + +```text +Scenario: 20 sequential commands + Without cache: 20 × 200 ms = 4 seconds + With cache: 1 × 200 ms + 19 × 5 ms = 295 ms + Speedup: ~13.5x faster +``` + +## Next Steps + +1. **Deploy**: Use installer to deploy to production +2. **Monitor**: Watch cache hit rates (target >95%) +3. **Extend**: Add new commands following migration pattern +4. **Optimize**: Use profiling to identify slow operations +5. **Maintain**: Run validation scripts regularly + +--- + +**For Support**: See `docs/troubleshooting-guide.md` +**For Architecture**: See `docs/architecture/` +**For User Guide**: See `docs/user/AUTHENTICATION_LAYER_GUIDE.md` \ No newline at end of file diff --git a/docs/src/development/build-system.md b/docs/src/development/build-system.md index 7bf7278..d1c489f 100644 --- a/docs/src/development/build-system.md +++ b/docs/src/development/build-system.md @@ -1 +1,1076 @@ -# Build System Documentation\n\nThis document provides comprehensive documentation for the provisioning project's build system, including the complete Makefile reference with 40+\ntargets, build tools, compilation instructions, and troubleshooting.\n\n## Table of Contents\n\n1. [Overview](#overview)\n2. [Quick Start](#quick-start)\n3. [Makefile Reference](#makefile-reference)\n4. [Build Tools](#build-tools)\n5. [Cross-Platform Compilation](#cross-platform-compilation)\n6. [Dependency Management](#dependency-management)\n7. [Troubleshooting](#troubleshooting)\n8. [CI/CD Integration](#cicd-integration)\n\n## Overview\n\nThe build system is a comprehensive, Makefile-based solution that orchestrates:\n\n- **Rust compilation**: Platform binaries (orchestrator, control-center, etc.)\n- **Nushell bundling**: Core libraries and CLI tools\n- **Nickel validation**: Configuration schema validation\n- **Distribution generation**: Multi-platform packages\n- **Release management**: Automated release pipelines\n- **Documentation generation**: API and user documentation\n\n**Location**: `/src/tools/`\n**Main entry point**: `/src/tools/Makefile`\n\n## Quick Start\n\n```{$detected_lang}\n# Navigate to build system\ncd src/tools\n\n# View all available targets\nmake help\n\n# Complete build and package\nmake all\n\n# Development build (quick)\nmake dev-build\n\n# Build for specific platform\nmake linux\nmake macos\nmake windows\n\n# Clean everything\nmake clean\n\n# Check build system status\nmake status\n```\n\n## Makefile Reference\n\n### Build Configuration\n\n**Variables**:\n\n```{$detected_lang}\n# Project metadata\nPROJECT_NAME := provisioning\nVERSION := $(git describe --tags --always --dirty)\nBUILD_TIME := $(date -u +"%Y-%m-%dT%H:%M:%SZ")\n\n# Build configuration\nRUST_TARGET := x86_64-unknown-linux-gnu\nBUILD_MODE := release\nPLATFORMS := linux-amd64,macos-amd64,windows-amd64\nVARIANTS := complete,minimal\n\n# Flags\nVERBOSE := false\nDRY_RUN := false\nPARALLEL := true\n```\n\n### Build Targets\n\n#### Primary Build Targets\n\n**`make all`** - Complete build, package, and test\n\n- Runs: `clean build-all package-all test-dist`\n- Use for: Production releases, complete validation\n\n**`make build-all`** - Build all components\n\n- Runs: `build-platform build-core validate-nickel`\n- Use for: Complete system compilation\n\n**`make build-platform`** - Build platform binaries for all targets\n\n```{$detected_lang}\nmake build-platform\n# Equivalent to:\nnu tools/build/compile-platform.nu \\n --target x86_64-unknown-linux-gnu \\n --release \\n --output-dir dist/platform \\n --verbose=false\n```\n\n**`make build-core`** - Bundle core Nushell libraries\n\n```{$detected_lang}\nmake build-core\n# Equivalent to:\nnu tools/build/bundle-core.nu \\n --output-dir dist/core \\n --config-dir dist/config \\n --validate \\n --exclude-dev\n```\n\n**`make validate-nickel`** - Validate and compile Nickel schemas\n\n```{$detected_lang}\nmake validate-nickel\n# Equivalent to:\nnu tools/build/validate-nickel.nu \\n --output-dir dist/schemas \\n --format-code \\n --check-dependencies\n```\n\n**`make build-cross`** - Cross-compile for multiple platforms\n\n- Builds for all platforms in `PLATFORMS` variable\n- Parallel execution support\n- Failure handling for each platform\n\n#### Package Targets\n\n**`make package-all`** - Create all distribution packages\n\n- Runs: `dist-generate package-binaries package-containers`\n\n**`make dist-generate`** - Generate complete distributions\n\n```{$detected_lang}\nmake dist-generate\n# Advanced usage:\nmake dist-generate PLATFORMS=linux-amd64,macos-amd64 VARIANTS=complete\n```\n\n**`make package-binaries`** - Package binaries for distribution\n\n- Creates platform-specific archives\n- Strips debug symbols\n- Generates checksums\n\n**`make package-containers`** - Build container images\n\n- Multi-platform container builds\n- Optimized layers and caching\n- Version tagging\n\n**`make create-archives`** - Create distribution archives\n\n- TAR and ZIP formats\n- Platform-specific and universal archives\n- Compression and checksums\n\n**`make create-installers`** - Create installation packages\n\n- Shell script installers\n- Platform-specific packages (DEB, RPM, MSI)\n- Uninstaller creation\n\n#### Release Targets\n\n**`make release`** - Create a complete release (requires VERSION)\n\n```{$detected_lang}\nmake release VERSION=2.1.0\n```\n\nFeatures:\n\n- Automated changelog generation\n- Git tag creation and push\n- Artifact upload\n- Comprehensive validation\n\n**`make release-draft`** - Create a draft release\n\n- Create without publishing\n- Review artifacts before release\n- Manual approval workflow\n\n**`make upload-artifacts`** - Upload release artifacts\n\n- GitHub Releases\n- Container registries\n- Package repositories\n- Verification and validation\n\n**`make notify-release`** - Send release notifications\n\n- Slack notifications\n- Discord announcements\n- Email notifications\n- Custom webhook support\n\n**`make update-registry`** - Update package manager registries\n\n- Homebrew formula updates\n- APT repository updates\n- Custom registry support\n\n#### Development and Testing Targets\n\n**`make dev-build`** - Quick development build\n\n```{$detected_lang}\nmake dev-build\n# Fast build with minimal validation\n```\n\n**`make test-build`** - Test build system\n\n- Validates build process\n- Runs with test configuration\n- Comprehensive logging\n\n**`make test-dist`** - Test generated distributions\n\n- Validates distribution integrity\n- Tests installation process\n- Platform compatibility checks\n\n**`make validate-all`** - Validate all components\n\n- Nickel schema validation\n- Package validation\n- Configuration validation\n\n**`make benchmark`** - Run build benchmarks\n\n- Times build process\n- Performance analysis\n- Resource usage monitoring\n\n#### Documentation Targets\n\n**`make docs`** - Generate documentation\n\n```{$detected_lang}\nmake docs\n# Generates API docs, user guides, and examples\n```\n\n**`make docs-serve`** - Generate and serve documentation locally\n\n- Starts local HTTP server on port 8000\n- Live documentation browsing\n- Development documentation workflow\n\n#### Utility Targets\n\n**`make clean`** - Clean all build artifacts\n\n```{$detected_lang}\nmake clean\n# Removes all build, distribution, and package directories\n```\n\n**`make clean-dist`** - Clean only distribution artifacts\n\n- Preserves build cache\n- Removes distribution packages\n- Faster cleanup option\n\n**`make install`** - Install the built system locally\n\n- Requires distribution to be built\n- Installs to system directories\n- Creates uninstaller\n\n**`make uninstall`** - Uninstall the system\n\n- Removes system installation\n- Cleans configuration\n- Removes service files\n\n**`make status`** - Show build system status\n\n```{$detected_lang}\nmake status\n# Output:\n# Build System Status\n# ===================\n# Project: provisioning\n# Version: v2.1.0-5-g1234567\n# Git Commit: 1234567890abcdef\n# Build Time: 2025-09-25T14:30:22Z\n#\n# Directories:\n# Source: /Users/user/repo-cnz/src\n# Tools: /Users/user/repo-cnz/src/tools\n# Build: /Users/user/repo-cnz/src/target\n# Distribution: /Users/user/repo-cnz/src/dist\n# Packages: /Users/user/repo-cnz/src/packages\n```\n\n**`make info`** - Show detailed system information\n\n- OS and architecture details\n- Tool versions (Nushell, Rust, Docker, Git)\n- Environment information\n- Build prerequisites\n\n#### CI/CD Integration Targets\n\n**`make ci-build`** - CI build pipeline\n\n- Complete validation build\n- Suitable for automated CI systems\n- Comprehensive testing\n\n**`make ci-test`** - CI test pipeline\n\n- Validation and testing only\n- Fast feedback for pull requests\n- Quality assurance\n\n**`make ci-release`** - CI release pipeline\n\n- Build and packaging for releases\n- Artifact preparation\n- Release candidate creation\n\n**`make cd-deploy`** - CD deployment pipeline\n\n- Complete release and deployment\n- Artifact upload and distribution\n- User notifications\n\n#### Platform-Specific Targets\n\n**`make linux`** - Build for Linux only\n\n```{$detected_lang}\nmake linux\n# Sets PLATFORMS=linux-amd64\n```\n\n**`make macos`** - Build for macOS only\n\n```{$detected_lang}\nmake macos\n# Sets PLATFORMS=macos-amd64\n```\n\n**`make windows`** - Build for Windows only\n\n```{$detected_lang}\nmake windows\n# Sets PLATFORMS=windows-amd64\n```\n\n#### Debugging Targets\n\n**`make debug`** - Build with debug information\n\n```{$detected_lang}\nmake debug\n# Sets BUILD_MODE=debug VERBOSE=true\n```\n\n**`make debug-info`** - Show debug information\n\n- Make variables and environment\n- Build system diagnostics\n- Troubleshooting information\n\n## Build Tools\n\n### Core Build Scripts\n\nAll build tools are implemented as Nushell scripts with comprehensive parameter validation and error handling.\n\n#### `/src/tools/build/compile-platform.nu`\n\n**Purpose**: Compiles all Rust components for distribution\n\n**Components Compiled**:\n\n- `orchestrator` → `provisioning-orchestrator` binary\n- `control-center` → `control-center` binary\n- `control-center-ui` → Web UI assets\n- `mcp-server-rust` → MCP integration binary\n\n**Usage**:\n\n```{$detected_lang}\nnu compile-platform.nu [options]\n\nOptions:\n --target STRING Target platform (default: x86_64-unknown-linux-gnu)\n --release Build in release mode\n --features STRING Comma-separated features to enable\n --output-dir STRING Output directory (default: dist/platform)\n --verbose Enable verbose logging\n --clean Clean before building\n```\n\n**Example**:\n\n```{$detected_lang}\nnu compile-platform.nu \\n --target x86_64-apple-darwin \\n --release \\n --features "surrealdb,telemetry" \\n --output-dir dist/macos \\n --verbose\n```\n\n#### `/src/tools/build/bundle-core.nu`\n\n**Purpose**: Bundles Nushell core libraries and CLI for distribution\n\n**Components Bundled**:\n\n- Nushell provisioning CLI wrapper\n- Core Nushell libraries (`lib_provisioning`)\n- Configuration system\n- Template system\n- Extensions and plugins\n\n**Usage**:\n\n```{$detected_lang}\nnu bundle-core.nu [options]\n\nOptions:\n --output-dir STRING Output directory (default: dist/core)\n --config-dir STRING Configuration directory (default: dist/config)\n --validate Validate Nushell syntax\n --compress Compress bundle with gzip\n --exclude-dev Exclude development files (default: true)\n --verbose Enable verbose logging\n```\n\n**Validation Features**:\n\n- Syntax validation of all Nushell files\n- Import dependency checking\n- Function signature validation\n- Test execution (if tests present)\n\n#### `/src/tools/build/validate-nickel.nu`\n\n**Purpose**: Validates and compiles Nickel schemas\n\n**Validation Process**:\n\n1. Syntax validation of all `.ncl` files\n2. Schema dependency checking\n3. Type constraint validation\n4. Example validation against schemas\n5. Documentation generation\n\n**Usage**:\n\n```{$detected_lang}\nnu validate-nickel.nu [options]\n\nOptions:\n --output-dir STRING Output directory (default: dist/schemas)\n --format-code Format Nickel code during validation\n --check-dependencies Validate schema dependencies\n --verbose Enable verbose logging\n```\n\n#### `/src/tools/build/test-distribution.nu`\n\n**Purpose**: Tests generated distributions for correctness\n\n**Test Types**:\n\n- **Basic**: Installation test, CLI help, version check\n- **Integration**: Server creation, configuration validation\n- **Complete**: Full workflow testing including cluster operations\n\n**Usage**:\n\n```{$detected_lang}\nnu test-distribution.nu [options]\n\nOptions:\n --dist-dir STRING Distribution directory (default: dist)\n --test-types STRING Test types: basic,integration,complete\n --platform STRING Target platform for testing\n --cleanup Remove test files after completion\n --verbose Enable verbose logging\n```\n\n#### `/src/tools/build/clean-build.nu`\n\n**Purpose**: Intelligent build artifact cleanup\n\n**Cleanup Scopes**:\n\n- **all**: Complete cleanup (build, dist, packages, cache)\n- **dist**: Distribution artifacts only\n- **cache**: Build cache and temporary files\n- **old**: Files older than specified age\n\n**Usage**:\n\n```{$detected_lang}\nnu clean-build.nu [options]\n\nOptions:\n --scope STRING Cleanup scope: all,dist,cache,old\n --age DURATION Age threshold for 'old' scope (default: 7d)\n --force Force cleanup without confirmation\n --dry-run Show what would be cleaned without doing it\n --verbose Enable verbose logging\n```\n\n### Distribution Tools\n\n#### `/src/tools/distribution/generate-distribution.nu`\n\n**Purpose**: Main distribution generator orchestrating the complete process\n\n**Generation Process**:\n\n1. Platform binary compilation\n2. Core library bundling\n3. Nickel schema validation and packaging\n4. Configuration system preparation\n5. Documentation generation\n6. Archive creation and compression\n7. Installer generation\n8. Validation and testing\n\n**Usage**:\n\n```{$detected_lang}\nnu generate-distribution.nu [command] [options]\n\nCommands:\n Generate complete distribution\n quick Quick development distribution\n status Show generation status\n\nOptions:\n --version STRING Version to build (default: auto-detect)\n --platforms STRING Comma-separated platforms\n --variants STRING Variants: complete,minimal\n --output-dir STRING Output directory (default: dist)\n --compress Enable compression\n --generate-docs Generate documentation\n --parallel-builds Enable parallel builds\n --validate-output Validate generated output\n --verbose Enable verbose logging\n```\n\n**Advanced Examples**:\n\n```{$detected_lang}\n# Complete multi-platform release\nnu generate-distribution.nu \\n --version 2.1.0 \\n --platforms linux-amd64,macos-amd64,windows-amd64 \\n --variants complete,minimal \\n --compress \\n --generate-docs \\n --parallel-builds \\n --validate-output\n\n# Quick development build\nnu generate-distribution.nu quick \\n --platform linux \\n --variant minimal\n\n# Status check\nnu generate-distribution.nu status\n```\n\n#### `/src/tools/distribution/create-installer.nu`\n\n**Purpose**: Creates platform-specific installers\n\n**Installer Types**:\n\n- **shell**: Shell script installer (cross-platform)\n- **package**: Platform packages (DEB, RPM, MSI, PKG)\n- **container**: Container image with provisioning\n- **source**: Source distribution with build instructions\n\n**Usage**:\n\n```{$detected_lang}\nnu create-installer.nu DISTRIBUTION_DIR [options]\n\nOptions:\n --output-dir STRING Installer output directory\n --installer-types STRING Installer types: shell,package,container,source\n --platforms STRING Target platforms\n --include-services Include systemd/launchd service files\n --create-uninstaller Generate uninstaller\n --validate-installer Test installer functionality\n --verbose Enable verbose logging\n```\n\n### Package Tools\n\n#### `/src/tools/package/package-binaries.nu`\n\n**Purpose**: Packages compiled binaries for distribution\n\n**Package Formats**:\n\n- **archive**: TAR.GZ and ZIP archives\n- **standalone**: Single binary with embedded resources\n- **installer**: Platform-specific installer packages\n\n**Features**:\n\n- Binary stripping for size reduction\n- Compression optimization\n- Checksum generation (SHA256, MD5)\n- Digital signing (if configured)\n\n#### `/src/tools/package/build-containers.nu`\n\n**Purpose**: Builds optimized container images\n\n**Container Features**:\n\n- Multi-stage builds for minimal image size\n- Security scanning integration\n- Multi-platform image generation\n- Layer caching optimization\n- Runtime environment configuration\n\n### Release Tools\n\n#### `/src/tools/release/create-release.nu`\n\n**Purpose**: Automated release creation and management\n\n**Release Process**:\n\n1. Version validation and tagging\n2. Changelog generation from git history\n3. Asset building and validation\n4. Release creation (GitHub, GitLab, etc.)\n5. Asset upload and verification\n6. Release announcement preparation\n\n**Usage**:\n\n```{$detected_lang}\nnu create-release.nu [options]\n\nOptions:\n --version STRING Release version (required)\n --asset-dir STRING Directory containing release assets\n --draft Create draft release\n --prerelease Mark as pre-release\n --generate-changelog Auto-generate changelog\n --push-tag Push git tag\n --auto-upload Upload assets automatically\n --verbose Enable verbose logging\n```\n\n## Cross-Platform Compilation\n\n### Supported Platforms\n\n**Primary Platforms**:\n\n- `linux-amd64` (x86_64-unknown-linux-gnu)\n- `macos-amd64` (x86_64-apple-darwin)\n- `windows-amd64` (x86_64-pc-windows-gnu)\n\n**Additional Platforms**:\n\n- `linux-arm64` (aarch64-unknown-linux-gnu)\n- `macos-arm64` (aarch64-apple-darwin)\n- `freebsd-amd64` (x86_64-unknown-freebsd)\n\n### Cross-Compilation Setup\n\n**Install Rust Targets**:\n\n```{$detected_lang}\n# Install additional targets\nrustup target add x86_64-apple-darwin\nrustup target add x86_64-pc-windows-gnu\nrustup target add aarch64-unknown-linux-gnu\nrustup target add aarch64-apple-darwin\n```\n\n**Platform-Specific Dependencies**:\n\n**macOS Cross-Compilation**:\n\n```{$detected_lang}\n# Install osxcross toolchain\nbrew install FiloSottile/musl-cross/musl-cross\nbrew install mingw-w64\n```\n\n**Windows Cross-Compilation**:\n\n```{$detected_lang}\n# Install Windows dependencies\nbrew install mingw-w64\n# or on Linux:\nsudo apt-get install gcc-mingw-w64\n```\n\n### Cross-Compilation Usage\n\n**Single Platform**:\n\n```{$detected_lang}\n# Build for macOS from Linux\nmake build-platform RUST_TARGET=x86_64-apple-darwin\n\n# Build for Windows\nmake build-platform RUST_TARGET=x86_64-pc-windows-gnu\n```\n\n**Multiple Platforms**:\n\n```{$detected_lang}\n# Build for all configured platforms\nmake build-cross\n\n# Specify platforms\nmake build-cross PLATFORMS=linux-amd64,macos-amd64,windows-amd64\n```\n\n**Platform-Specific Targets**:\n\n```{$detected_lang}\n# Quick platform builds\nmake linux # Linux AMD64\nmake macos # macOS AMD64\nmake windows # Windows AMD64\n```\n\n## Dependency Management\n\n### Build Dependencies\n\n**Required Tools**:\n\n- **Nushell 0.107.1+**: Core shell and scripting\n- **Rust 1.70+**: Platform binary compilation\n- **Cargo**: Rust package management\n- **KCL 0.11.2+**: Configuration language\n- **Git**: Version control and tagging\n\n**Optional Tools**:\n\n- **Docker**: Container image building\n- **Cross**: Simplified cross-compilation\n- **SOPS**: Secrets management\n- **Age**: Encryption for secrets\n\n### Dependency Validation\n\n**Check Dependencies**:\n\n```{$detected_lang}\nmake info\n# Shows versions of all required tools\n\n# Output example:\n# Tool Versions:\n# Nushell: 0.107.1\n# Rust: rustc 1.75.0\n# Docker: Docker version 24.0.6\n# Git: git version 2.42.0\n```\n\n**Install Missing Dependencies**:\n\n```{$detected_lang}\n# Install Nushell\ncargo install nu\n\n# Install Nickel\ncargo install nickel\n\n# Install Cross (for cross-compilation)\ncargo install cross\n```\n\n### Dependency Caching\n\n**Rust Dependencies**:\n\n- Cargo cache: `~/.cargo/registry`\n- Target cache: `target/` directory\n- Cross-compilation cache: `~/.cache/cross`\n\n**Build Cache Management**:\n\n```{$detected_lang}\n# Clean Cargo cache\ncargo clean\n\n# Clean cross-compilation cache\ncross clean\n\n# Clean all caches\nmake clean SCOPE=cache\n```\n\n## Troubleshooting\n\n### Common Build Issues\n\n#### Rust Compilation Errors\n\n**Error**: `linker 'cc' not found`\n\n```{$detected_lang}\n# Solution: Install build essentials\nsudo apt-get install build-essential # Linux\nxcode-select --install # macOS\n```\n\n**Error**: `target not found`\n\n```{$detected_lang}\n# Solution: Install target\nrustup target add x86_64-unknown-linux-gnu\n```\n\n**Error**: Cross-compilation linking errors\n\n```{$detected_lang}\n# Solution: Use cross instead of cargo\ncargo install cross\nmake build-platform CROSS=true\n```\n\n#### Nushell Script Errors\n\n**Error**: `command not found`\n\n```{$detected_lang}\n# Solution: Ensure Nushell is in PATH\nwhich nu\nexport PATH="$HOME/.cargo/bin:$PATH"\n```\n\n**Error**: Permission denied\n\n```{$detected_lang}\n# Solution: Make scripts executable\nchmod +x src/tools/build/*.nu\n```\n\n**Error**: Module not found\n\n```{$detected_lang}\n# Solution: Check working directory\ncd src/tools\nnu build/compile-platform.nu --help\n```\n\n#### Nickel Validation Errors\n\n**Error**: `nickel command not found`\n\n```{$detected_lang}\n# Solution: Install Nickel\ncargo install nickel\n# or\nbrew install nickel\n```\n\n**Error**: Schema validation failed\n\n```{$detected_lang}\n# Solution: Check Nickel syntax\nnickel fmt schemas/\nnickel check schemas/\n```\n\n### Build Performance Issues\n\n#### Slow Compilation\n\n**Optimizations**:\n\n```{$detected_lang}\n# Enable parallel builds\nmake build-all PARALLEL=true\n\n# Use faster linker\nexport RUSTFLAGS="-C link-arg=-fuse-ld=lld"\n\n# Increase build jobs\nexport CARGO_BUILD_JOBS=8\n```\n\n**Cargo Configuration** (`~/.cargo/config.toml`):\n\n```{$detected_lang}\n[build]\njobs = 8\n\n[target.x86_64-unknown-linux-gnu]\nlinker = "lld"\n```\n\n#### Memory Issues\n\n**Solutions**:\n\n```{$detected_lang}\n# Reduce parallel jobs\nexport CARGO_BUILD_JOBS=2\n\n# Use debug build for development\nmake dev-build BUILD_MODE=debug\n\n# Clean up between builds\nmake clean-dist\n```\n\n### Distribution Issues\n\n#### Missing Assets\n\n**Validation**:\n\n```{$detected_lang}\n# Test distribution\nmake test-dist\n\n# Detailed validation\nnu src/tools/package/validate-package.nu dist/\n```\n\n#### Size Optimization\n\n**Optimizations**:\n\n```{$detected_lang}\n# Strip binaries\nmake package-binaries STRIP=true\n\n# Enable compression\nmake dist-generate COMPRESS=true\n\n# Use minimal variant\nmake dist-generate VARIANTS=minimal\n```\n\n### Debug Mode\n\n**Enable Debug Logging**:\n\n```{$detected_lang}\n# Set environment\nexport PROVISIONING_DEBUG=true\nexport RUST_LOG=debug\n\n# Run with debug\nmake debug\n\n# Verbose make output\nmake build-all VERBOSE=true\n```\n\n**Debug Information**:\n\n```{$detected_lang}\n# Show debug information\nmake debug-info\n\n# Build system status\nmake status\n\n# Tool information\nmake info\n```\n\n## CI/CD Integration\n\n### GitHub Actions\n\n**Example Workflow** (`.github/workflows/build.yml`):\n\n```{$detected_lang}\nname: Build and Test\non: [push, pull_request]\n\njobs:\n build:\n runs-on: ubuntu-latest\n steps:\n - uses: actions/checkout@v4\n\n - name: Setup Nushell\n uses: hustcer/setup-nu@v3.5\n\n - name: Setup Rust\n uses: actions-rs/toolchain@v1\n with:\n toolchain: stable\n\n - name: CI Build\n run: |\n cd src/tools\n make ci-build\n\n - name: Upload Artifacts\n uses: actions/upload-artifact@v4\n with:\n name: build-artifacts\n path: src/dist/\n```\n\n### Release Automation\n\n**Release Workflow**:\n\n```{$detected_lang}\nname: Release\non:\n push:\n tags: ['v*']\n\njobs:\n release:\n runs-on: ubuntu-latest\n steps:\n - uses: actions/checkout@v4\n\n - name: Build Release\n run: |\n cd src/tools\n make ci-release VERSION=${{ github.ref_name }}\n\n - name: Create Release\n run: |\n cd src/tools\n make release VERSION=${{ github.ref_name }}\n```\n\n### Local CI Testing\n\n**Test CI Pipeline Locally**:\n\n```{$detected_lang}\n# Run CI build pipeline\nmake ci-build\n\n# Run CI test pipeline\nmake ci-test\n\n# Full CI/CD pipeline\nmake ci-release\n```\n\nThis build system provides a comprehensive, maintainable foundation for the provisioning project's development lifecycle, from local development to\nproduction releases. +# Build System Documentation + +This document provides comprehensive documentation for the provisioning project's build system, including the complete Makefile reference with 40+ +targets, build tools, compilation instructions, and troubleshooting. + +## Table of Contents + +1. [Overview](#overview) +2. [Quick Start](#quick-start) +3. [Makefile Reference](#makefile-reference) +4. [Build Tools](#build-tools) +5. [Cross-Platform Compilation](#cross-platform-compilation) +6. [Dependency Management](#dependency-management) +7. [Troubleshooting](#troubleshooting) +8. [CI/CD Integration](#cicd-integration) + +## Overview + +The build system is a comprehensive, Makefile-based solution that orchestrates: + +- **Rust compilation**: Platform binaries (orchestrator, control-center, etc.) +- **Nushell bundling**: Core libraries and CLI tools +- **Nickel validation**: Configuration schema validation +- **Distribution generation**: Multi-platform packages +- **Release management**: Automated release pipelines +- **Documentation generation**: API and user documentation + +**Location**: `/src/tools/` +**Main entry point**: `/src/tools/Makefile` + +## Quick Start + +```text +# Navigate to build system +cd src/tools + +# View all available targets +make help + +# Complete build and package +make all + +# Development build (quick) +make dev-build + +# Build for specific platform +make linux +make macos +make windows + +# Clean everything +make clean + +# Check build system status +make status +``` + +## Makefile Reference + +### Build Configuration + +**Variables**: + +```text +# Project metadata +PROJECT_NAME := provisioning +VERSION := $(git describe --tags --always --dirty) +BUILD_TIME := $(date -u +"%Y-%m-%dT%H:%M:%SZ") + +# Build configuration +RUST_TARGET := x86_64-unknown-linux-gnu +BUILD_MODE := release +PLATFORMS := linux-amd64,macos-amd64,windows-amd64 +VARIANTS := complete,minimal + +# Flags +VERBOSE := false +DRY_RUN := false +PARALLEL := true +``` + +### Build Targets + +#### Primary Build Targets + +**`make all`** - Complete build, package, and test + +- Runs: `clean build-all package-all test-dist` +- Use for: Production releases, complete validation + +**`make build-all`** - Build all components + +- Runs: `build-platform build-core validate-nickel` +- Use for: Complete system compilation + +**`make build-platform`** - Build platform binaries for all targets + +```text +make build-platform +# Equivalent to: +nu tools/build/compile-platform.nu + --target x86_64-unknown-linux-gnu + --release + --output-dir dist/platform + --verbose=false +``` + +**`make build-core`** - Bundle core Nushell libraries + +```text +make build-core +# Equivalent to: +nu tools/build/bundle-core.nu + --output-dir dist/core + --config-dir dist/config + --validate + --exclude-dev +``` + +**`make validate-nickel`** - Validate and compile Nickel schemas + +```text +make validate-nickel +# Equivalent to: +nu tools/build/validate-nickel.nu + --output-dir dist/schemas + --format-code + --check-dependencies +``` + +**`make build-cross`** - Cross-compile for multiple platforms + +- Builds for all platforms in `PLATFORMS` variable +- Parallel execution support +- Failure handling for each platform + +#### Package Targets + +**`make package-all`** - Create all distribution packages + +- Runs: `dist-generate package-binaries package-containers` + +**`make dist-generate`** - Generate complete distributions + +```text +make dist-generate +# Advanced usage: +make dist-generate PLATFORMS=linux-amd64,macos-amd64 VARIANTS=complete +``` + +**`make package-binaries`** - Package binaries for distribution + +- Creates platform-specific archives +- Strips debug symbols +- Generates checksums + +**`make package-containers`** - Build container images + +- Multi-platform container builds +- Optimized layers and caching +- Version tagging + +**`make create-archives`** - Create distribution archives + +- TAR and ZIP formats +- Platform-specific and universal archives +- Compression and checksums + +**`make create-installers`** - Create installation packages + +- Shell script installers +- Platform-specific packages (DEB, RPM, MSI) +- Uninstaller creation + +#### Release Targets + +**`make release`** - Create a complete release (requires VERSION) + +```text +make release VERSION=2.1.0 +``` + +Features: + +- Automated changelog generation +- Git tag creation and push +- Artifact upload +- Comprehensive validation + +**`make release-draft`** - Create a draft release + +- Create without publishing +- Review artifacts before release +- Manual approval workflow + +**`make upload-artifacts`** - Upload release artifacts + +- GitHub Releases +- Container registries +- Package repositories +- Verification and validation + +**`make notify-release`** - Send release notifications + +- Slack notifications +- Discord announcements +- Email notifications +- Custom webhook support + +**`make update-registry`** - Update package manager registries + +- Homebrew formula updates +- APT repository updates +- Custom registry support + +#### Development and Testing Targets + +**`make dev-build`** - Quick development build + +```text +make dev-build +# Fast build with minimal validation +``` + +**`make test-build`** - Test build system + +- Validates build process +- Runs with test configuration +- Comprehensive logging + +**`make test-dist`** - Test generated distributions + +- Validates distribution integrity +- Tests installation process +- Platform compatibility checks + +**`make validate-all`** - Validate all components + +- Nickel schema validation +- Package validation +- Configuration validation + +**`make benchmark`** - Run build benchmarks + +- Times build process +- Performance analysis +- Resource usage monitoring + +#### Documentation Targets + +**`make docs`** - Generate documentation + +```text +make docs +# Generates API docs, user guides, and examples +``` + +**`make docs-serve`** - Generate and serve documentation locally + +- Starts local HTTP server on port 8000 +- Live documentation browsing +- Development documentation workflow + +#### Utility Targets + +**`make clean`** - Clean all build artifacts + +```text +make clean +# Removes all build, distribution, and package directories +``` + +**`make clean-dist`** - Clean only distribution artifacts + +- Preserves build cache +- Removes distribution packages +- Faster cleanup option + +**`make install`** - Install the built system locally + +- Requires distribution to be built +- Installs to system directories +- Creates uninstaller + +**`make uninstall`** - Uninstall the system + +- Removes system installation +- Cleans configuration +- Removes service files + +**`make status`** - Show build system status + +```text +make status +# Output: +# Build System Status +# =================== +# Project: provisioning +# Version: v2.1.0-5-g1234567 +# Git Commit: 1234567890abcdef +# Build Time: 2025-09-25T14:30:22Z +# +# Directories: +# Source: /Users/user/repo-cnz/src +# Tools: /Users/user/repo-cnz/src/tools +# Build: /Users/user/repo-cnz/src/target +# Distribution: /Users/user/repo-cnz/src/dist +# Packages: /Users/user/repo-cnz/src/packages +``` + +**`make info`** - Show detailed system information + +- OS and architecture details +- Tool versions (Nushell, Rust, Docker, Git) +- Environment information +- Build prerequisites + +#### CI/CD Integration Targets + +**`make ci-build`** - CI build pipeline + +- Complete validation build +- Suitable for automated CI systems +- Comprehensive testing + +**`make ci-test`** - CI test pipeline + +- Validation and testing only +- Fast feedback for pull requests +- Quality assurance + +**`make ci-release`** - CI release pipeline + +- Build and packaging for releases +- Artifact preparation +- Release candidate creation + +**`make cd-deploy`** - CD deployment pipeline + +- Complete release and deployment +- Artifact upload and distribution +- User notifications + +#### Platform-Specific Targets + +**`make linux`** - Build for Linux only + +```text +make linux +# Sets PLATFORMS=linux-amd64 +``` + +**`make macos`** - Build for macOS only + +```text +make macos +# Sets PLATFORMS=macos-amd64 +``` + +**`make windows`** - Build for Windows only + +```text +make windows +# Sets PLATFORMS=windows-amd64 +``` + +#### Debugging Targets + +**`make debug`** - Build with debug information + +```text +make debug +# Sets BUILD_MODE=debug VERBOSE=true +``` + +**`make debug-info`** - Show debug information + +- Make variables and environment +- Build system diagnostics +- Troubleshooting information + +## Build Tools + +### Core Build Scripts + +All build tools are implemented as Nushell scripts with comprehensive parameter validation and error handling. + +#### `/src/tools/build/compile-platform.nu` + +**Purpose**: Compiles all Rust components for distribution + +**Components Compiled**: + +- `orchestrator` → `provisioning-orchestrator` binary +- `control-center` → `control-center` binary +- `control-center-ui` → Web UI assets +- `mcp-server-rust` → MCP integration binary + +**Usage**: + +```text +nu compile-platform.nu [options] + +Options: + --target STRING Target platform (default: x86_64-unknown-linux-gnu) + --release Build in release mode + --features STRING Comma-separated features to enable + --output-dir STRING Output directory (default: dist/platform) + --verbose Enable verbose logging + --clean Clean before building +``` + +**Example**: + +```text +nu compile-platform.nu + --target x86_64-apple-darwin + --release + --features "surrealdb,telemetry" + --output-dir dist/macos + --verbose +``` + +#### `/src/tools/build/bundle-core.nu` + +**Purpose**: Bundles Nushell core libraries and CLI for distribution + +**Components Bundled**: + +- Nushell provisioning CLI wrapper +- Core Nushell libraries (`lib_provisioning`) +- Configuration system +- Template system +- Extensions and plugins + +**Usage**: + +```text +nu bundle-core.nu [options] + +Options: + --output-dir STRING Output directory (default: dist/core) + --config-dir STRING Configuration directory (default: dist/config) + --validate Validate Nushell syntax + --compress Compress bundle with gzip + --exclude-dev Exclude development files (default: true) + --verbose Enable verbose logging +``` + +**Validation Features**: + +- Syntax validation of all Nushell files +- Import dependency checking +- Function signature validation +- Test execution (if tests present) + +#### `/src/tools/build/validate-nickel.nu` + +**Purpose**: Validates and compiles Nickel schemas + +**Validation Process**: + +1. Syntax validation of all `.ncl` files +2. Schema dependency checking +3. Type constraint validation +4. Example validation against schemas +5. Documentation generation + +**Usage**: + +```text +nu validate-nickel.nu [options] + +Options: + --output-dir STRING Output directory (default: dist/schemas) + --format-code Format Nickel code during validation + --check-dependencies Validate schema dependencies + --verbose Enable verbose logging +``` + +#### `/src/tools/build/test-distribution.nu` + +**Purpose**: Tests generated distributions for correctness + +**Test Types**: + +- **Basic**: Installation test, CLI help, version check +- **Integration**: Server creation, configuration validation +- **Complete**: Full workflow testing including cluster operations + +**Usage**: + +```text +nu test-distribution.nu [options] + +Options: + --dist-dir STRING Distribution directory (default: dist) + --test-types STRING Test types: basic,integration,complete + --platform STRING Target platform for testing + --cleanup Remove test files after completion + --verbose Enable verbose logging +``` + +#### `/src/tools/build/clean-build.nu` + +**Purpose**: Intelligent build artifact cleanup + +**Cleanup Scopes**: + +- **all**: Complete cleanup (build, dist, packages, cache) +- **dist**: Distribution artifacts only +- **cache**: Build cache and temporary files +- **old**: Files older than specified age + +**Usage**: + +```text +nu clean-build.nu [options] + +Options: + --scope STRING Cleanup scope: all,dist,cache,old + --age DURATION Age threshold for 'old' scope (default: 7d) + --force Force cleanup without confirmation + --dry-run Show what would be cleaned without doing it + --verbose Enable verbose logging +``` + +### Distribution Tools + +#### `/src/tools/distribution/generate-distribution.nu` + +**Purpose**: Main distribution generator orchestrating the complete process + +**Generation Process**: + +1. Platform binary compilation +2. Core library bundling +3. Nickel schema validation and packaging +4. Configuration system preparation +5. Documentation generation +6. Archive creation and compression +7. Installer generation +8. Validation and testing + +**Usage**: + +```text +nu generate-distribution.nu [command] [options] + +Commands: + Generate complete distribution + quick Quick development distribution + status Show generation status + +Options: + --version STRING Version to build (default: auto-detect) + --platforms STRING Comma-separated platforms + --variants STRING Variants: complete,minimal + --output-dir STRING Output directory (default: dist) + --compress Enable compression + --generate-docs Generate documentation + --parallel-builds Enable parallel builds + --validate-output Validate generated output + --verbose Enable verbose logging +``` + +**Advanced Examples**: + +```text +# Complete multi-platform release +nu generate-distribution.nu + --version 2.1.0 + --platforms linux-amd64,macos-amd64,windows-amd64 + --variants complete,minimal + --compress + --generate-docs + --parallel-builds + --validate-output + +# Quick development build +nu generate-distribution.nu quick + --platform linux + --variant minimal + +# Status check +nu generate-distribution.nu status +``` + +#### `/src/tools/distribution/create-installer.nu` + +**Purpose**: Creates platform-specific installers + +**Installer Types**: + +- **shell**: Shell script installer (cross-platform) +- **package**: Platform packages (DEB, RPM, MSI, PKG) +- **container**: Container image with provisioning +- **source**: Source distribution with build instructions + +**Usage**: + +```text +nu create-installer.nu DISTRIBUTION_DIR [options] + +Options: + --output-dir STRING Installer output directory + --installer-types STRING Installer types: shell,package,container,source + --platforms STRING Target platforms + --include-services Include systemd/launchd service files + --create-uninstaller Generate uninstaller + --validate-installer Test installer functionality + --verbose Enable verbose logging +``` + +### Package Tools + +#### `/src/tools/package/package-binaries.nu` + +**Purpose**: Packages compiled binaries for distribution + +**Package Formats**: + +- **archive**: TAR.GZ and ZIP archives +- **standalone**: Single binary with embedded resources +- **installer**: Platform-specific installer packages + +**Features**: + +- Binary stripping for size reduction +- Compression optimization +- Checksum generation (SHA256, MD5) +- Digital signing (if configured) + +#### `/src/tools/package/build-containers.nu` + +**Purpose**: Builds optimized container images + +**Container Features**: + +- Multi-stage builds for minimal image size +- Security scanning integration +- Multi-platform image generation +- Layer caching optimization +- Runtime environment configuration + +### Release Tools + +#### `/src/tools/release/create-release.nu` + +**Purpose**: Automated release creation and management + +**Release Process**: + +1. Version validation and tagging +2. Changelog generation from git history +3. Asset building and validation +4. Release creation (GitHub, GitLab, etc.) +5. Asset upload and verification +6. Release announcement preparation + +**Usage**: + +```text +nu create-release.nu [options] + +Options: + --version STRING Release version (required) + --asset-dir STRING Directory containing release assets + --draft Create draft release + --prerelease Mark as pre-release + --generate-changelog Auto-generate changelog + --push-tag Push git tag + --auto-upload Upload assets automatically + --verbose Enable verbose logging +``` + +## Cross-Platform Compilation + +### Supported Platforms + +**Primary Platforms**: + +- `linux-amd64` (x86_64-unknown-linux-gnu) +- `macos-amd64` (x86_64-apple-darwin) +- `windows-amd64` (x86_64-pc-windows-gnu) + +**Additional Platforms**: + +- `linux-arm64` (aarch64-unknown-linux-gnu) +- `macos-arm64` (aarch64-apple-darwin) +- `freebsd-amd64` (x86_64-unknown-freebsd) + +### Cross-Compilation Setup + +**Install Rust Targets**: + +```text +# Install additional targets +rustup target add x86_64-apple-darwin +rustup target add x86_64-pc-windows-gnu +rustup target add aarch64-unknown-linux-gnu +rustup target add aarch64-apple-darwin +``` + +**Platform-Specific Dependencies**: + +**macOS Cross-Compilation**: + +```text +# Install osxcross toolchain +brew install FiloSottile/musl-cross/musl-cross +brew install mingw-w64 +``` + +**Windows Cross-Compilation**: + +```text +# Install Windows dependencies +brew install mingw-w64 +# or on Linux: +sudo apt-get install gcc-mingw-w64 +``` + +### Cross-Compilation Usage + +**Single Platform**: + +```text +# Build for macOS from Linux +make build-platform RUST_TARGET=x86_64-apple-darwin + +# Build for Windows +make build-platform RUST_TARGET=x86_64-pc-windows-gnu +``` + +**Multiple Platforms**: + +```text +# Build for all configured platforms +make build-cross + +# Specify platforms +make build-cross PLATFORMS=linux-amd64,macos-amd64,windows-amd64 +``` + +**Platform-Specific Targets**: + +```text +# Quick platform builds +make linux # Linux AMD64 +make macos # macOS AMD64 +make windows # Windows AMD64 +``` + +## Dependency Management + +### Build Dependencies + +**Required Tools**: + +- **Nushell 0.107.1+**: Core shell and scripting +- **Rust 1.70+**: Platform binary compilation +- **Cargo**: Rust package management +- **KCL 0.11.2+**: Configuration language +- **Git**: Version control and tagging + +**Optional Tools**: + +- **Docker**: Container image building +- **Cross**: Simplified cross-compilation +- **SOPS**: Secrets management +- **Age**: Encryption for secrets + +### Dependency Validation + +**Check Dependencies**: + +```text +make info +# Shows versions of all required tools + +# Output example: +# Tool Versions: +# Nushell: 0.107.1 +# Rust: rustc 1.75.0 +# Docker: Docker version 24.0.6 +# Git: git version 2.42.0 +``` + +**Install Missing Dependencies**: + +```text +# Install Nushell +cargo install nu + +# Install Nickel +cargo install nickel + +# Install Cross (for cross-compilation) +cargo install cross +``` + +### Dependency Caching + +**Rust Dependencies**: + +- Cargo cache: `~/.cargo/registry` +- Target cache: `target/` directory +- Cross-compilation cache: `~/.cache/cross` + +**Build Cache Management**: + +```text +# Clean Cargo cache +cargo clean + +# Clean cross-compilation cache +cross clean + +# Clean all caches +make clean SCOPE=cache +``` + +## Troubleshooting + +### Common Build Issues + +#### Rust Compilation Errors + +**Error**: `linker 'cc' not found` + +```text +# Solution: Install build essentials +sudo apt-get install build-essential # Linux +xcode-select --install # macOS +``` + +**Error**: `target not found` + +```text +# Solution: Install target +rustup target add x86_64-unknown-linux-gnu +``` + +**Error**: Cross-compilation linking errors + +```text +# Solution: Use cross instead of cargo +cargo install cross +make build-platform CROSS=true +``` + +#### Nushell Script Errors + +**Error**: `command not found` + +```text +# Solution: Ensure Nushell is in PATH +which nu +export PATH="$HOME/.cargo/bin:$PATH" +``` + +**Error**: Permission denied + +```text +# Solution: Make scripts executable +chmod +x src/tools/build/*.nu +``` + +**Error**: Module not found + +```text +# Solution: Check working directory +cd src/tools +nu build/compile-platform.nu --help +``` + +#### Nickel Validation Errors + +**Error**: `nickel command not found` + +```text +# Solution: Install Nickel +cargo install nickel +# or +brew install nickel +``` + +**Error**: Schema validation failed + +```text +# Solution: Check Nickel syntax +nickel fmt schemas/ +nickel check schemas/ +``` + +### Build Performance Issues + +#### Slow Compilation + +**Optimizations**: + +```text +# Enable parallel builds +make build-all PARALLEL=true + +# Use faster linker +export RUSTFLAGS="-C link-arg=-fuse-ld=lld" + +# Increase build jobs +export CARGO_BUILD_JOBS=8 +``` + +**Cargo Configuration** (`~/.cargo/config.toml`): + +```text +[build] +jobs = 8 + +[target.x86_64-unknown-linux-gnu] +linker = "lld" +``` + +#### Memory Issues + +**Solutions**: + +```text +# Reduce parallel jobs +export CARGO_BUILD_JOBS=2 + +# Use debug build for development +make dev-build BUILD_MODE=debug + +# Clean up between builds +make clean-dist +``` + +### Distribution Issues + +#### Missing Assets + +**Validation**: + +```text +# Test distribution +make test-dist + +# Detailed validation +nu src/tools/package/validate-package.nu dist/ +``` + +#### Size Optimization + +**Optimizations**: + +```text +# Strip binaries +make package-binaries STRIP=true + +# Enable compression +make dist-generate COMPRESS=true + +# Use minimal variant +make dist-generate VARIANTS=minimal +``` + +### Debug Mode + +**Enable Debug Logging**: + +```text +# Set environment +export PROVISIONING_DEBUG=true +export RUST_LOG=debug + +# Run with debug +make debug + +# Verbose make output +make build-all VERBOSE=true +``` + +**Debug Information**: + +```text +# Show debug information +make debug-info + +# Build system status +make status + +# Tool information +make info +``` + +## CI/CD Integration + +### GitHub Actions + +**Example Workflow** (`.github/workflows/build.yml`): + +```text +name: Build and Test +on: [push, pull_request] + +jobs: + build: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - name: Setup Nushell + uses: hustcer/setup-nu@v3.5 + + - name: Setup Rust + uses: actions-rs/toolchain@v1 + with: + toolchain: stable + + - name: CI Build + run: | + cd src/tools + make ci-build + + - name: Upload Artifacts + uses: actions/upload-artifact@v4 + with: + name: build-artifacts + path: src/dist/ +``` + +### Release Automation + +**Release Workflow**: + +```text +name: Release +on: + push: + tags: ['v*'] + +jobs: + release: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - name: Build Release + run: | + cd src/tools + make ci-release VERSION=${{ github.ref_name }} + + - name: Create Release + run: | + cd src/tools + make release VERSION=${{ github.ref_name }} +``` + +### Local CI Testing + +**Test CI Pipeline Locally**: + +```text +# Run CI build pipeline +make ci-build + +# Run CI test pipeline +make ci-test + +# Full CI/CD pipeline +make ci-release +``` + +This build system provides a comprehensive, maintainable foundation for the provisioning project's development lifecycle, from local development to +production releases. diff --git a/docs/src/development/command-handler-guide.md b/docs/src/development/command-handler-guide.md index f653063..c4f5ba8 100644 --- a/docs/src/development/command-handler-guide.md +++ b/docs/src/development/command-handler-guide.md @@ -1 +1,614 @@ -# Command Handler Developer Guide\n\n**Target Audience**: Developers working on the provisioning CLI\n**Last Updated**: 2025-09-30\n**Related**: [ADR-006 CLI Refactoring](../architecture/adr/adr-006-provisioning-cli-refactoring.md)\n\n## Overview\n\nThe provisioning CLI uses a **modular, domain-driven architecture** that separates concerns into focused command handlers. This guide shows you how to\nwork with this architecture.\n\n### Key Architecture Principles\n\n1. **Separation of Concerns**: Routing, flag parsing, and business logic are separated\n2. **Domain-Driven Design**: Commands organized by domain (infrastructure, orchestration, etc.)\n3. **DRY (Don't Repeat Yourself)**: Centralized flag handling eliminates code duplication\n4. **Single Responsibility**: Each module has one clear purpose\n5. **Open/Closed Principle**: Easy to extend, no need to modify core routing\n\n### Architecture Components\n\n```\nprovisioning/core/nulib/\n├── provisioning (211 lines) - Main entry point\n├── main_provisioning/\n│ ├── flags.nu (139 lines) - Centralized flag handling\n│ ├── dispatcher.nu (264 lines) - Command routing\n│ ├── help_system.nu - Categorized help system\n│ └── commands/ - Domain-focused handlers\n│ ├── infrastructure.nu (117 lines) - Server, taskserv, cluster, infra\n│ ├── orchestration.nu (64 lines) - Workflow, batch, orchestrator\n│ ├── development.nu (72 lines) - Module, layer, version, pack\n│ ├── workspace.nu (56 lines) - Workspace, template\n│ ├── generation.nu (78 lines) - Generate commands\n│ ├── utilities.nu (157 lines) - SSH, SOPS, cache, providers\n│ └── configuration.nu (316 lines) - Env, show, init, validate\n```\n\n## Adding New Commands\n\n### Step 1: Choose the Right Domain Handler\n\nCommands are organized by domain. Choose the appropriate handler:\n\n| Domain | Handler | Responsibility |\n| -------- | --------- | ---------------- |\n| infrastructure | `infrastructure.nu` | Server/taskserv/cluster/infra lifecycle |\n| orchestration | `orchestration.nu` | Workflow/batch operations, orchestrator control |\n| development | `development.nu` | Module discovery, layers, versions, packaging |\n| workspace | `workspace.nu` | Workspace and template management |\n| configuration | `configuration.nu` | Environment, settings, initialization |\n| utilities | `utilities.nu` | SSH, SOPS, cache, providers, utilities |\n| generation | `generation.nu` | Generate commands (server, taskserv, etc.) |\n\n### Step 2: Add Command to Handler\n\n**Example: Adding a new server command `server status`**\n\nEdit `provisioning/core/nulib/main_provisioning/commands/infrastructure.nu`:\n\n```\n# Add to the handle_infrastructure_command match statement\nexport def handle_infrastructure_command [\n command: string\n ops: string\n flags: record\n] {\n set_debug_env $flags\n\n match $command {\n "server" => { handle_server $ops $flags }\n "taskserv" | "task" => { handle_taskserv $ops $flags }\n "cluster" => { handle_cluster $ops $flags }\n "infra" | "infras" => { handle_infra $ops $flags }\n _ => {\n print $"❌ Unknown infrastructure command: ($command)"\n print ""\n print "Available infrastructure commands:"\n print " server - Server operations (create, delete, list, ssh, status)" # Updated\n print " taskserv - Task service management"\n print " cluster - Cluster operations"\n print " infra - Infrastructure management"\n print ""\n print "Use 'provisioning help infrastructure' for more details"\n exit 1\n }\n }\n}\n\n# Add the new command handler\ndef handle_server [ops: string, flags: record] {\n let args = build_module_args $flags $ops\n run_module $args "server" --exec\n}\n```\n\n**That's it!** The command is now available as `provisioning server status`.\n\n### Step 3: Add Shortcuts (Optional)\n\nIf you want shortcuts like `provisioning s status`:\n\nEdit `provisioning/core/nulib/main_provisioning/dispatcher.nu`:\n\n```\nexport def get_command_registry []: nothing -> record {\n {\n # Infrastructure commands\n "s" => "infrastructure server" # Already exists\n "server" => "infrastructure server" # Already exists\n\n # Your new shortcut (if needed)\n # Example: "srv-status" => "infrastructure server status"\n\n # ... rest of registry\n }\n}\n```\n\n**Note**: Most shortcuts are already configured. You only need to add new shortcuts if you're creating completely new command categories.\n\n## Modifying Existing Handlers\n\n### Example: Enhancing the `taskserv` Command\n\nLet's say you want to add better error handling to the taskserv command:\n\n**Before:**\n\n```\ndef handle_taskserv [ops: string, flags: record] {\n let args = build_module_args $flags $ops\n run_module $args "taskserv" --exec\n}\n```\n\n**After:**\n\n```\ndef handle_taskserv [ops: string, flags: record] {\n # Validate taskserv name if provided\n let first_arg = ($ops | split row " " | get -o 0)\n if ($first_arg | is-not-empty) and $first_arg not-in ["create", "delete", "list", "generate", "check-updates", "help"] {\n # Check if taskserv exists\n let available_taskservs = (^$env.PROVISIONING_NAME module discover taskservs | from json)\n if $first_arg not-in $available_taskservs {\n print $"❌ Unknown taskserv: ($first_arg)"\n print ""\n print "Available taskservs:"\n $available_taskservs | each { |ts| print $" • ($ts)" }\n exit 1\n }\n }\n\n let args = build_module_args $flags $ops\n run_module $args "taskserv" --exec\n}\n```\n\n## Working with Flags\n\n### Using Centralized Flag Handling\n\nThe `flags.nu` module provides centralized flag handling:\n\n```\n# Parse all flags into normalized record\nlet parsed_flags = (parse_common_flags {\n version: $version, v: $v, info: $info,\n debug: $debug, check: $check, yes: $yes,\n wait: $wait, infra: $infra, # ... etc\n})\n\n# Build argument string for module execution\nlet args = build_module_args $parsed_flags $ops\n\n# Set environment variables based on flags\nset_debug_env $parsed_flags\n```\n\n### Available Flag Parsing\n\nThe `parse_common_flags` function normalizes these flags:\n\n| Flag Record Field | Description |\n| ------------------- | ------------- |\n| `show_version` | Version display (`--version`, `-v`) |\n| `show_info` | Info display (`--info`, `-i`) |\n| `show_about` | About display (`--about`, `-a`) |\n| `debug_mode` | Debug mode (`--debug`, `-x`) |\n| `check_mode` | Check mode (`--check`, `-c`) |\n| `auto_confirm` | Auto-confirm (`--yes`, `-y`) |\n| `wait` | Wait for completion (`--wait`, `-w`) |\n| `keep_storage` | Keep storage (`--keepstorage`) |\n| `infra` | Infrastructure name (`--infra`) |\n| `outfile` | Output file (`--outfile`) |\n| `output_format` | Output format (`--out`) |\n| `template` | Template name (`--template`) |\n| `select` | Selection (`--select`) |\n| `settings` | Settings file (`--settings`) |\n| `new_infra` | New infra name (`--new`) |\n\n### Adding New Flags\n\nIf you need to add a new flag:\n\n1. **Update main `provisioning` file** to accept the flag\n2. **Update `flags.nu:parse_common_flags`** to normalize it\n3. **Update `flags.nu:build_module_args`** to pass it to modules\n\n**Example: Adding `--timeout` flag**\n\n```\n# 1. In provisioning main file (parameter list)\ndef main [\n # ... existing parameters\n --timeout: int = 300 # Timeout in seconds\n # ... rest of parameters\n] {\n # ... existing code\n let parsed_flags = (parse_common_flags {\n # ... existing flags\n timeout: $timeout\n })\n}\n\n# 2. In flags.nu:parse_common_flags\nexport def parse_common_flags [flags: record]: nothing -> record {\n {\n # ... existing normalizations\n timeout: ($flags.timeout? | default 300)\n }\n}\n\n# 3. In flags.nu:build_module_args\nexport def build_module_args [flags: record, extra: string = ""]: nothing -> string {\n # ... existing code\n let str_timeout = if ($flags.timeout != 300) { $"--timeout ($flags.timeout) " } else { "" }\n # ... rest of function\n $"($extra) ($use_check)($use_yes)($use_wait)($str_timeout)..."\n}\n```\n\n## Adding New Shortcuts\n\n### Shortcut Naming Conventions\n\n- **1-2 letters**: Ultra-short for common commands (`s` for server, `ws` for workspace)\n- **3-4 letters**: Abbreviations (`orch` for orchestrator, `tmpl` for template)\n- **Aliases**: Alternative names (`task` for taskserv, `flow` for workflow)\n\n### Example: Adding a New Shortcut\n\nEdit `provisioning/core/nulib/main_provisioning/dispatcher.nu`:\n\n```\nexport def get_command_registry []: nothing -> record {\n {\n # ... existing shortcuts\n\n # Add your new shortcut\n "db" => "infrastructure database" # New: db command\n "database" => "infrastructure database" # Full name\n\n # ... rest of registry\n }\n}\n```\n\n**Important**: After adding a shortcut, update the help system in `help_system.nu` to document it.\n\n## Testing Your Changes\n\n### Running the Test Suite\n\n```\n# Run comprehensive test suite\nnu tests/test_provisioning_refactor.nu\n```\n\n### Test Coverage\n\nThe test suite validates:\n\n- ✅ Main help display\n- ✅ Category help (infrastructure, orchestration, development, workspace)\n- ✅ Bi-directional help routing\n- ✅ All command shortcuts\n- ✅ Category shortcut help\n- ✅ Command routing to correct handlers\n\n### Adding Tests for Your Changes\n\nEdit `tests/test_provisioning_refactor.nu`:\n\n```\n# Add your test function\nexport def test_my_new_feature [] {\n print "\n🧪 Testing my new feature..."\n\n let output = (run_provisioning "my-command" "test")\n assert_contains $output "Expected Output" "My command works"\n}\n\n# Add to main test runner\nexport def main [] {\n # ... existing tests\n\n let results = [\n # ... existing test calls\n (try { test_my_new_feature; "passed" } catch { "failed" })\n ]\n\n # ... rest of main\n}\n```\n\n### Manual Testing\n\n```\n# Test command execution\nprovisioning/core/cli/provisioning my-command test --check\n\n# Test with debug mode\nprovisioning/core/cli/provisioning --debug my-command test\n\n# Test help\nprovisioning/core/cli/provisioning my-command help\nprovisioning/core/cli/provisioning help my-command # Bi-directional\n```\n\n## Common Patterns\n\n### Pattern 1: Simple Command Handler\n\n**Use Case**: Command just needs to execute a module with standard flags\n\n```\ndef handle_simple_command [ops: string, flags: record] {\n let args = build_module_args $flags $ops\n run_module $args "module_name" --exec\n}\n```\n\n### Pattern 2: Command with Validation\n\n**Use Case**: Need to validate input before execution\n\n```\ndef handle_validated_command [ops: string, flags: record] {\n # Validate\n let first_arg = ($ops | split row " " | get -o 0)\n if ($first_arg | is-empty) {\n print "❌ Missing required argument"\n print "Usage: provisioning command "\n exit 1\n }\n\n # Execute\n let args = build_module_args $flags $ops\n run_module $args "module_name" --exec\n}\n```\n\n### Pattern 3: Command with Subcommands\n\n**Use Case**: Command has multiple subcommands (like `server create`, `server delete`)\n\n```\ndef handle_complex_command [ops: string, flags: record] {\n let subcommand = ($ops | split row " " | get -o 0)\n let rest_ops = ($ops | split row " " | skip 1 | str join " ")\n\n match $subcommand {\n "create" => { handle_create $rest_ops $flags }\n "delete" => { handle_delete $rest_ops $flags }\n "list" => { handle_list $rest_ops $flags }\n _ => {\n print "❌ Unknown subcommand: $subcommand"\n print "Available: create, delete, list"\n exit 1\n }\n }\n}\n```\n\n### Pattern 4: Command with Flag-Based Routing\n\n**Use Case**: Command behavior changes based on flags\n\n```\ndef handle_flag_routed_command [ops: string, flags: record] {\n if $flags.check_mode {\n # Dry-run mode\n print "🔍 Check mode: simulating command..."\n let args = build_module_args $flags $ops\n run_module $args "module_name" # No --exec, returns output\n } else {\n # Normal execution\n let args = build_module_args $flags $ops\n run_module $args "module_name" --exec\n }\n}\n```\n\n## Best Practices\n\n### 1. Keep Handlers Focused\n\nEach handler should do **one thing well**:\n\n- ✅ Good: `handle_server` manages all server operations\n- ❌ Bad: `handle_server` also manages clusters and taskservs\n\n### 2. Use Descriptive Error Messages\n\n```\n# ❌ Bad\nprint "Error"\n\n# ✅ Good\nprint "❌ Unknown taskserv: kubernetes-invalid"\nprint ""\nprint "Available taskservs:"\nprint " • kubernetes"\nprint " • containerd"\nprint " • cilium"\nprint ""\nprint "Use 'provisioning taskserv list' to see all available taskservs"\n```\n\n### 3. Leverage Centralized Functions\n\nDon't repeat code - use centralized functions:\n\n```\n# ❌ Bad: Repeating flag handling\ndef handle_bad [ops: string, flags: record] {\n let use_check = if $flags.check_mode { "--check " } else { "" }\n let use_yes = if $flags.auto_confirm { "--yes " } else { "" }\n let str_infra = if ($flags.infra | is-not-empty) { $"--infra ($flags.infra) " } else { "" }\n # ... 10 more lines of flag handling\n run_module $"($ops) ($use_check)($use_yes)($str_infra)..." "module" --exec\n}\n\n# ✅ Good: Using centralized function\ndef handle_good [ops: string, flags: record] {\n let args = build_module_args $flags $ops\n run_module $args "module" --exec\n}\n```\n\n### 4. Document Your Changes\n\nUpdate relevant documentation:\n\n- **ADR-006**: If architectural changes\n- **CLAUDE.md**: If new commands or shortcuts\n- **help_system.nu**: If new categories or commands\n- **This guide**: If new patterns or conventions\n\n### 5. Test Thoroughly\n\nBefore committing:\n\n- [ ] Run test suite: `nu tests/test_provisioning_refactor.nu`\n- [ ] Test manual execution\n- [ ] Test with `--check` flag\n- [ ] Test with `--debug` flag\n- [ ] Test help: both `provisioning cmd help` and `provisioning help cmd`\n- [ ] Test shortcuts\n\n## Troubleshooting\n\n### Issue: "Module not found"\n\n**Cause**: Incorrect import path in handler\n\n**Fix**: Use relative imports with `.nu` extension:\n\n```\n# ✅ Correct\nuse ../flags.nu *\nuse ../../lib_provisioning *\n\n# ❌ Wrong\nuse ../main_provisioning/flags *\nuse lib_provisioning *\n```\n\n### Issue: "Parse mismatch: expected colon"\n\n**Cause**: Missing type signature format\n\n**Fix**: Use proper Nushell 0.107 type signature:\n\n```\n# ✅ Correct\nexport def my_function [param: string]: nothing -> string {\n "result"\n}\n\n# ❌ Wrong\nexport def my_function [param: string] -> string {\n "result"\n}\n```\n\n### Issue: "Command not routing correctly"\n\n**Cause**: Shortcut not in command registry\n\n**Fix**: Add to `dispatcher.nu:get_command_registry`:\n\n```\n"myshortcut" => "domain command"\n```\n\n### Issue: "Flags not being passed"\n\n**Cause**: Not using `build_module_args`\n\n**Fix**: Use centralized flag builder:\n\n```\nlet args = build_module_args $flags $ops\nrun_module $args "module" --exec\n```\n\n## Quick Reference\n\n### File Locations\n\n```\nprovisioning/core/nulib/\n├── provisioning - Main entry, flag definitions\n├── main_provisioning/\n│ ├── flags.nu - Flag parsing (parse_common_flags, build_module_args)\n│ ├── dispatcher.nu - Routing (get_command_registry, dispatch_command)\n│ ├── help_system.nu - Help (provisioning-help, help-*)\n│ └── commands/ - Domain handlers (handle_*_command)\ntests/\n└── test_provisioning_refactor.nu - Test suite\ndocs/\n├── architecture/\n│ └── adr-006-provisioning-cli-refactoring.md - Architecture docs\n└── development/\n └── COMMAND_HANDLER_GUIDE.md - This guide\n```\n\n### Key Functions\n\n```\n# In flags.nu\nparse_common_flags [flags: record]: nothing -> record\nbuild_module_args [flags: record, extra: string = ""]: nothing -> string\nset_debug_env [flags: record]\nget_debug_flag [flags: record]: nothing -> string\n\n# In dispatcher.nu\nget_command_registry []: nothing -> record\ndispatch_command [args: list, flags: record]\n\n# In help_system.nu\nprovisioning-help [category?: string]: nothing -> string\nhelp-infrastructure []: nothing -> string\nhelp-orchestration []: nothing -> string\n# ... (one for each category)\n\n# In commands/*.nu\nhandle_*_command [command: string, ops: string, flags: record]\n# Example: handle_infrastructure_command, handle_workspace_command\n```\n\n### Testing Commands\n\n```\n# Run full test suite\nnu tests/test_provisioning_refactor.nu\n\n# Test specific command\nprovisioning/core/cli/provisioning my-command test --check\n\n# Test with debug\nprovisioning/core/cli/provisioning --debug my-command test\n\n# Test help\nprovisioning/core/cli/provisioning help my-command\nprovisioning/core/cli/provisioning my-command help # Bi-directional\n```\n\n## Further Reading\n\n- **[ADR-006: CLI Refactoring](../architecture/adr/adr-006-provisioning-cli-refactoring.md)** - Complete architectural decision record\n- **[Project Structure](project-structure.md)** - Overall project organization\n- **[Workflow Development](workflow.md)** - Workflow system architecture\n- **[Development Integration](integration.md)** - Integration patterns\n\n## Contributing\n\nWhen contributing command handler changes:\n\n1. **Follow existing patterns** - Use the patterns in this guide\n2. **Update documentation** - Keep docs in sync with code\n3. **Add tests** - Cover your new functionality\n4. **Run test suite** - Ensure nothing breaks\n5. **Update CLAUDE.md** - Document new commands/shortcuts\n\nFor questions or issues, refer to ADR-006 or ask the team.\n\n---\n\n*This guide is part of the provisioning project documentation. Last updated: 2025-09-30* +# Command Handler Developer Guide + +**Target Audience**: Developers working on the provisioning CLI +**Last Updated**: 2025-09-30 +**Related**: [ADR-006 CLI Refactoring](../architecture/adr/adr-006-provisioning-cli-refactoring.md) + +## Overview + +The provisioning CLI uses a **modular, domain-driven architecture** that separates concerns into focused command handlers. This guide shows you how to +work with this architecture. + +### Key Architecture Principles + +1. **Separation of Concerns**: Routing, flag parsing, and business logic are separated +2. **Domain-Driven Design**: Commands organized by domain (infrastructure, orchestration, etc.) +3. **DRY (Don't Repeat Yourself)**: Centralized flag handling eliminates code duplication +4. **Single Responsibility**: Each module has one clear purpose +5. **Open/Closed Principle**: Easy to extend, no need to modify core routing + +### Architecture Components + +```text +provisioning/core/nulib/ +├── provisioning (211 lines) - Main entry point +├── main_provisioning/ +│ ├── flags.nu (139 lines) - Centralized flag handling +│ ├── dispatcher.nu (264 lines) - Command routing +│ ├── help_system.nu - Categorized help system +│ └── commands/ - Domain-focused handlers +│ ├── infrastructure.nu (117 lines) - Server, taskserv, cluster, infra +│ ├── orchestration.nu (64 lines) - Workflow, batch, orchestrator +│ ├── development.nu (72 lines) - Module, layer, version, pack +│ ├── workspace.nu (56 lines) - Workspace, template +│ ├── generation.nu (78 lines) - Generate commands +│ ├── utilities.nu (157 lines) - SSH, SOPS, cache, providers +│ └── configuration.nu (316 lines) - Env, show, init, validate +``` + +## Adding New Commands + +### Step 1: Choose the Right Domain Handler + +Commands are organized by domain. Choose the appropriate handler: + +| Domain | Handler | Responsibility | +| -------- | --------- | ---------------- | +| infrastructure | `infrastructure.nu` | Server/taskserv/cluster/infra lifecycle | +| orchestration | `orchestration.nu` | Workflow/batch operations, orchestrator control | +| development | `development.nu` | Module discovery, layers, versions, packaging | +| workspace | `workspace.nu` | Workspace and template management | +| configuration | `configuration.nu` | Environment, settings, initialization | +| utilities | `utilities.nu` | SSH, SOPS, cache, providers, utilities | +| generation | `generation.nu` | Generate commands (server, taskserv, etc.) | + +### Step 2: Add Command to Handler + +**Example: Adding a new server command `server status`** + +Edit `provisioning/core/nulib/main_provisioning/commands/infrastructure.nu`: + +```text +# Add to the handle_infrastructure_command match statement +export def handle_infrastructure_command [ + command: string + ops: string + flags: record +] { + set_debug_env $flags + + match $command { + "server" => { handle_server $ops $flags } + "taskserv" | "task" => { handle_taskserv $ops $flags } + "cluster" => { handle_cluster $ops $flags } + "infra" | "infras" => { handle_infra $ops $flags } + _ => { + print $"❌ Unknown infrastructure command: ($command)" + print "" + print "Available infrastructure commands:" + print " server - Server operations (create, delete, list, ssh, status)" # Updated + print " taskserv - Task service management" + print " cluster - Cluster operations" + print " infra - Infrastructure management" + print "" + print "Use 'provisioning help infrastructure' for more details" + exit 1 + } + } +} + +# Add the new command handler +def handle_server [ops: string, flags: record] { + let args = build_module_args $flags $ops + run_module $args "server" --exec +} +``` + +**That's it!** The command is now available as `provisioning server status`. + +### Step 3: Add Shortcuts (Optional) + +If you want shortcuts like `provisioning s status`: + +Edit `provisioning/core/nulib/main_provisioning/dispatcher.nu`: + +```text +export def get_command_registry []: nothing -> record { + { + # Infrastructure commands + "s" => "infrastructure server" # Already exists + "server" => "infrastructure server" # Already exists + + # Your new shortcut (if needed) + # Example: "srv-status" => "infrastructure server status" + + # ... rest of registry + } +} +``` + +**Note**: Most shortcuts are already configured. You only need to add new shortcuts if you're creating completely new command categories. + +## Modifying Existing Handlers + +### Example: Enhancing the `taskserv` Command + +Let's say you want to add better error handling to the taskserv command: + +**Before:** + +```text +def handle_taskserv [ops: string, flags: record] { + let args = build_module_args $flags $ops + run_module $args "taskserv" --exec +} +``` + +**After:** + +```text +def handle_taskserv [ops: string, flags: record] { + # Validate taskserv name if provided + let first_arg = ($ops | split row " " | get -o 0) + if ($first_arg | is-not-empty) and $first_arg not-in ["create", "delete", "list", "generate", "check-updates", "help"] { + # Check if taskserv exists + let available_taskservs = (^$env.PROVISIONING_NAME module discover taskservs | from json) + if $first_arg not-in $available_taskservs { + print $"❌ Unknown taskserv: ($first_arg)" + print "" + print "Available taskservs:" + $available_taskservs | each { |ts| print $" • ($ts)" } + exit 1 + } + } + + let args = build_module_args $flags $ops + run_module $args "taskserv" --exec +} +``` + +## Working with Flags + +### Using Centralized Flag Handling + +The `flags.nu` module provides centralized flag handling: + +```text +# Parse all flags into normalized record +let parsed_flags = (parse_common_flags { + version: $version, v: $v, info: $info, + debug: $debug, check: $check, yes: $yes, + wait: $wait, infra: $infra, # ... etc +}) + +# Build argument string for module execution +let args = build_module_args $parsed_flags $ops + +# Set environment variables based on flags +set_debug_env $parsed_flags +``` + +### Available Flag Parsing + +The `parse_common_flags` function normalizes these flags: + +| Flag Record Field | Description | +| ------------------- | ------------- | +| `show_version` | Version display (`--version`, `-v`) | +| `show_info` | Info display (`--info`, `-i`) | +| `show_about` | About display (`--about`, `-a`) | +| `debug_mode` | Debug mode (`--debug`, `-x`) | +| `check_mode` | Check mode (`--check`, `-c`) | +| `auto_confirm` | Auto-confirm (`--yes`, `-y`) | +| `wait` | Wait for completion (`--wait`, `-w`) | +| `keep_storage` | Keep storage (`--keepstorage`) | +| `infra` | Infrastructure name (`--infra`) | +| `outfile` | Output file (`--outfile`) | +| `output_format` | Output format (`--out`) | +| `template` | Template name (`--template`) | +| `select` | Selection (`--select`) | +| `settings` | Settings file (`--settings`) | +| `new_infra` | New infra name (`--new`) | + +### Adding New Flags + +If you need to add a new flag: + +1. **Update main `provisioning` file** to accept the flag +2. **Update `flags.nu:parse_common_flags`** to normalize it +3. **Update `flags.nu:build_module_args`** to pass it to modules + +**Example: Adding `--timeout` flag** + +```text +# 1. In provisioning main file (parameter list) +def main [ + # ... existing parameters + --timeout: int = 300 # Timeout in seconds + # ... rest of parameters +] { + # ... existing code + let parsed_flags = (parse_common_flags { + # ... existing flags + timeout: $timeout + }) +} + +# 2. In flags.nu:parse_common_flags +export def parse_common_flags [flags: record]: nothing -> record { + { + # ... existing normalizations + timeout: ($flags.timeout? | default 300) + } +} + +# 3. In flags.nu:build_module_args +export def build_module_args [flags: record, extra: string = ""]: nothing -> string { + # ... existing code + let str_timeout = if ($flags.timeout != 300) { $"--timeout ($flags.timeout) " } else { "" } + # ... rest of function + $"($extra) ($use_check)($use_yes)($use_wait)($str_timeout)..." +} +``` + +## Adding New Shortcuts + +### Shortcut Naming Conventions + +- **1-2 letters**: Ultra-short for common commands (`s` for server, `ws` for workspace) +- **3-4 letters**: Abbreviations (`orch` for orchestrator, `tmpl` for template) +- **Aliases**: Alternative names (`task` for taskserv, `flow` for workflow) + +### Example: Adding a New Shortcut + +Edit `provisioning/core/nulib/main_provisioning/dispatcher.nu`: + +```text +export def get_command_registry []: nothing -> record { + { + # ... existing shortcuts + + # Add your new shortcut + "db" => "infrastructure database" # New: db command + "database" => "infrastructure database" # Full name + + # ... rest of registry + } +} +``` + +**Important**: After adding a shortcut, update the help system in `help_system.nu` to document it. + +## Testing Your Changes + +### Running the Test Suite + +```text +# Run comprehensive test suite +nu tests/test_provisioning_refactor.nu +``` + +### Test Coverage + +The test suite validates: + +- ✅ Main help display +- ✅ Category help (infrastructure, orchestration, development, workspace) +- ✅ Bi-directional help routing +- ✅ All command shortcuts +- ✅ Category shortcut help +- ✅ Command routing to correct handlers + +### Adding Tests for Your Changes + +Edit `tests/test_provisioning_refactor.nu`: + +```text +# Add your test function +export def test_my_new_feature [] { + print " +🧪 Testing my new feature..." + + let output = (run_provisioning "my-command" "test") + assert_contains $output "Expected Output" "My command works" +} + +# Add to main test runner +export def main [] { + # ... existing tests + + let results = [ + # ... existing test calls + (try { test_my_new_feature; "passed" } catch { "failed" }) + ] + + # ... rest of main +} +``` + +### Manual Testing + +```text +# Test command execution +provisioning/core/cli/provisioning my-command test --check + +# Test with debug mode +provisioning/core/cli/provisioning --debug my-command test + +# Test help +provisioning/core/cli/provisioning my-command help +provisioning/core/cli/provisioning help my-command # Bi-directional +``` + +## Common Patterns + +### Pattern 1: Simple Command Handler + +**Use Case**: Command just needs to execute a module with standard flags + +```text +def handle_simple_command [ops: string, flags: record] { + let args = build_module_args $flags $ops + run_module $args "module_name" --exec +} +``` + +### Pattern 2: Command with Validation + +**Use Case**: Need to validate input before execution + +```text +def handle_validated_command [ops: string, flags: record] { + # Validate + let first_arg = ($ops | split row " " | get -o 0) + if ($first_arg | is-empty) { + print "❌ Missing required argument" + print "Usage: provisioning command " + exit 1 + } + + # Execute + let args = build_module_args $flags $ops + run_module $args "module_name" --exec +} +``` + +### Pattern 3: Command with Subcommands + +**Use Case**: Command has multiple subcommands (like `server create`, `server delete`) + +```text +def handle_complex_command [ops: string, flags: record] { + let subcommand = ($ops | split row " " | get -o 0) + let rest_ops = ($ops | split row " " | skip 1 | str join " ") + + match $subcommand { + "create" => { handle_create $rest_ops $flags } + "delete" => { handle_delete $rest_ops $flags } + "list" => { handle_list $rest_ops $flags } + _ => { + print "❌ Unknown subcommand: $subcommand" + print "Available: create, delete, list" + exit 1 + } + } +} +``` + +### Pattern 4: Command with Flag-Based Routing + +**Use Case**: Command behavior changes based on flags + +```text +def handle_flag_routed_command [ops: string, flags: record] { + if $flags.check_mode { + # Dry-run mode + print "🔍 Check mode: simulating command..." + let args = build_module_args $flags $ops + run_module $args "module_name" # No --exec, returns output + } else { + # Normal execution + let args = build_module_args $flags $ops + run_module $args "module_name" --exec + } +} +``` + +## Best Practices + +### 1. Keep Handlers Focused + +Each handler should do **one thing well**: + +- ✅ Good: `handle_server` manages all server operations +- ❌ Bad: `handle_server` also manages clusters and taskservs + +### 2. Use Descriptive Error Messages + +```text +# ❌ Bad +print "Error" + +# ✅ Good +print "❌ Unknown taskserv: kubernetes-invalid" +print "" +print "Available taskservs:" +print " • kubernetes" +print " • containerd" +print " • cilium" +print "" +print "Use 'provisioning taskserv list' to see all available taskservs" +``` + +### 3. Leverage Centralized Functions + +Don't repeat code - use centralized functions: + +```text +# ❌ Bad: Repeating flag handling +def handle_bad [ops: string, flags: record] { + let use_check = if $flags.check_mode { "--check " } else { "" } + let use_yes = if $flags.auto_confirm { "--yes " } else { "" } + let str_infra = if ($flags.infra | is-not-empty) { $"--infra ($flags.infra) " } else { "" } + # ... 10 more lines of flag handling + run_module $"($ops) ($use_check)($use_yes)($str_infra)..." "module" --exec +} + +# ✅ Good: Using centralized function +def handle_good [ops: string, flags: record] { + let args = build_module_args $flags $ops + run_module $args "module" --exec +} +``` + +### 4. Document Your Changes + +Update relevant documentation: + +- **ADR-006**: If architectural changes +- **CLAUDE.md**: If new commands or shortcuts +- **help_system.nu**: If new categories or commands +- **This guide**: If new patterns or conventions + +### 5. Test Thoroughly + +Before committing: + +- [ ] Run test suite: `nu tests/test_provisioning_refactor.nu` +- [ ] Test manual execution +- [ ] Test with `--check` flag +- [ ] Test with `--debug` flag +- [ ] Test help: both `provisioning cmd help` and `provisioning help cmd` +- [ ] Test shortcuts + +## Troubleshooting + +### Issue: "Module not found" + +**Cause**: Incorrect import path in handler + +**Fix**: Use relative imports with `.nu` extension: + +```text +# ✅ Correct +use ../flags.nu * +use ../../lib_provisioning * + +# ❌ Wrong +use ../main_provisioning/flags * +use lib_provisioning * +``` + +### Issue: "Parse mismatch: expected colon" + +**Cause**: Missing type signature format + +**Fix**: Use proper Nushell 0.107 type signature: + +```text +# ✅ Correct +export def my_function [param: string]: nothing -> string { + "result" +} + +# ❌ Wrong +export def my_function [param: string] -> string { + "result" +} +``` + +### Issue: "Command not routing correctly" + +**Cause**: Shortcut not in command registry + +**Fix**: Add to `dispatcher.nu:get_command_registry`: + +```text +"myshortcut" => "domain command" +``` + +### Issue: "Flags not being passed" + +**Cause**: Not using `build_module_args` + +**Fix**: Use centralized flag builder: + +```text +let args = build_module_args $flags $ops +run_module $args "module" --exec +``` + +## Quick Reference + +### File Locations + +```text +provisioning/core/nulib/ +├── provisioning - Main entry, flag definitions +├── main_provisioning/ +│ ├── flags.nu - Flag parsing (parse_common_flags, build_module_args) +│ ├── dispatcher.nu - Routing (get_command_registry, dispatch_command) +│ ├── help_system.nu - Help (provisioning-help, help-*) +│ └── commands/ - Domain handlers (handle_*_command) +tests/ +└── test_provisioning_refactor.nu - Test suite +docs/ +├── architecture/ +│ └── adr-006-provisioning-cli-refactoring.md - Architecture docs +└── development/ + └── COMMAND_HANDLER_GUIDE.md - This guide +``` + +### Key Functions + +```text +# In flags.nu +parse_common_flags [flags: record]: nothing -> record +build_module_args [flags: record, extra: string = ""]: nothing -> string +set_debug_env [flags: record] +get_debug_flag [flags: record]: nothing -> string + +# In dispatcher.nu +get_command_registry []: nothing -> record +dispatch_command [args: list, flags: record] + +# In help_system.nu +provisioning-help [category?: string]: nothing -> string +help-infrastructure []: nothing -> string +help-orchestration []: nothing -> string +# ... (one for each category) + +# In commands/*.nu +handle_*_command [command: string, ops: string, flags: record] +# Example: handle_infrastructure_command, handle_workspace_command +``` + +### Testing Commands + +```text +# Run full test suite +nu tests/test_provisioning_refactor.nu + +# Test specific command +provisioning/core/cli/provisioning my-command test --check + +# Test with debug +provisioning/core/cli/provisioning --debug my-command test + +# Test help +provisioning/core/cli/provisioning help my-command +provisioning/core/cli/provisioning my-command help # Bi-directional +``` + +## Further Reading + +- **[ADR-006: CLI Refactoring](../architecture/adr/adr-006-provisioning-cli-refactoring.md)** - Complete architectural decision record +- **[Project Structure](project-structure.md)** - Overall project organization +- **[Workflow Development](workflow.md)** - Workflow system architecture +- **[Development Integration](integration.md)** - Integration patterns + +## Contributing + +When contributing command handler changes: + +1. **Follow existing patterns** - Use the patterns in this guide +2. **Update documentation** - Keep docs in sync with code +3. **Add tests** - Cover your new functionality +4. **Run test suite** - Ensure nothing breaks +5. **Update CLAUDE.md** - Document new commands/shortcuts + +For questions or issues, refer to ADR-006 or ask the team. + +--- + +*This guide is part of the provisioning project documentation. Last updated: 2025-09-30* \ No newline at end of file diff --git a/docs/src/development/command-reference.md b/docs/src/development/command-reference.md index fdd344e..39b3ec8 100644 --- a/docs/src/development/command-reference.md +++ b/docs/src/development/command-reference.md @@ -1 +1,54 @@ -# Command Reference\n\nComplete command reference for the provisioning CLI.\n\n## 📖 Service Management Guide\n\nThe primary command reference is now part of the Service Management Guide:\n\n**→ [Service Management Guide](SERVICE_MANAGEMENT_GUIDE.md)** - Complete CLI reference\n\nThis guide includes:\n\n- All CLI commands and shortcuts\n- Command syntax and examples\n- Service lifecycle management\n- Troubleshooting commands\n\n## Quick Reference\n\n### Essential Commands\n\n```\n# System status\nprovisioning status\nprovisioning health\n\n# Server management\nprovisioning server create\nprovisioning server list\nprovisioning server ssh \n\n# Task services\nprovisioning taskserv create \nprovisioning taskserv list\n\n# Workspace management\nprovisioning workspace list\nprovisioning workspace switch \n\n# Get help\nprovisioning help\nprovisioning help\n```\n\n## Additional References\n\n- **[Service Management Guide](SERVICE_MANAGEMENT_GUIDE.md)** - Complete CLI reference\n- **[Service Management Quick Reference](SERVICE_MANAGEMENT_QUICKREF.md)** - Quick lookup\n- **[Quick Start Cheatsheet](../guides/quickstart-cheatsheet.md)** - All shortcuts\n- **[Authentication Guide](AUTHENTICATION_LAYER_GUIDE.md)** - Auth commands\n\n---\n\nFor complete command documentation, see [Service Management Guide](SERVICE_MANAGEMENT_GUIDE.md). +# Command Reference + +Complete command reference for the provisioning CLI. + +## 📖 Service Management Guide + +The primary command reference is now part of the Service Management Guide: + +**→ [Service Management Guide](SERVICE_MANAGEMENT_GUIDE.md)** - Complete CLI reference + +This guide includes: + +- All CLI commands and shortcuts +- Command syntax and examples +- Service lifecycle management +- Troubleshooting commands + +## Quick Reference + +### Essential Commands + +```text +# System status +provisioning status +provisioning health + +# Server management +provisioning server create +provisioning server list +provisioning server ssh + +# Task services +provisioning taskserv create +provisioning taskserv list + +# Workspace management +provisioning workspace list +provisioning workspace switch + +# Get help +provisioning help +provisioning help +``` + +## Additional References + +- **[Service Management Guide](SERVICE_MANAGEMENT_GUIDE.md)** - Complete CLI reference +- **[Service Management Quick Reference](SERVICE_MANAGEMENT_QUICKREF.md)** - Quick lookup +- **[Quick Start Cheatsheet](../guides/quickstart-cheatsheet.md)** - All shortcuts +- **[Authentication Guide](AUTHENTICATION_LAYER_GUIDE.md)** - Auth commands + +--- + +For complete command documentation, see [Service Management Guide](SERVICE_MANAGEMENT_GUIDE.md). \ No newline at end of file diff --git a/docs/src/development/ctrl-c-implementation-notes.md b/docs/src/development/ctrl-c-implementation-notes.md index 4f24c3e..894acd8 100644 --- a/docs/src/development/ctrl-c-implementation-notes.md +++ b/docs/src/development/ctrl-c-implementation-notes.md @@ -1 +1,295 @@ -# CTRL-C Handling Implementation Notes\n\n## Overview\n\nImplemented graceful CTRL-C handling for sudo password prompts during server creation/generation operations.\n\n## Problem Statement\n\nWhen `fix_local_hosts: true` is set, the provisioning tool requires sudo access to\nmodify `/etc/hosts` and SSH config. When a user cancels the sudo password prompt (no\npassword, wrong password, timeout), the system would:\n\n1. Exit with code 1 (sudo failed)\n2. Propagate null values up the call stack\n3. Show cryptic Nushell errors about pipeline failures\n4. Leave the operation in an inconsistent state\n\n**Important Unix Limitation**: Pressing CTRL-C at the sudo password prompt sends SIGINT to the entire process group, interrupting Nushell before exit\ncode handling can occur. This **cannot be caught** and is expected Unix behavior.\n\n## Solution Architecture\n\n### Key Principle: Return Values, Not Exit Codes\n\nInstead of using `exit 130` which kills the entire process, we use **return values**\nto signal cancellation and let each layer of the call stack handle it gracefully.\n\n### Three-Layer Approach\n\n1. **Detection Layer** (ssh.nu helper functions)\n - Detects sudo cancellation via exit code + stderr\n - Returns `false` instead of calling `exit`\n\n2. **Propagation Layer** (ssh.nu core functions)\n - `on_server_ssh()`: Returns `false` on cancellation\n - `server_ssh()`: Uses `reduce` to propagate failures\n\n3. **Handling Layer** (create.nu, generate.nu)\n - Checks return values\n - Displays user-friendly messages\n - Returns `false` to caller\n\n## Implementation Details\n\n### 1. Helper Functions (ssh.nu:11-32)\n\n```\ndef check_sudo_cached []: nothing -> bool {\n let result = (do --ignore-errors { ^sudo -n true } | complete)\n $result.exit_code == 0\n}\n\ndef run_sudo_with_interrupt_check [\n command: closure\n operation_name: string\n]: nothing -> bool {\n let result = (do --ignore-errors { do $command } | complete)\n if $result.exit_code == 1 and ($result.stderr | str contains "password is required") {\n print "\n⚠ Operation cancelled - sudo password required but not provided"\n print "ℹ Run 'sudo -v' first to cache credentials, or run without --fix-local-hosts"\n return false # Signal cancellation\n } else if $result.exit_code != 0 and $result.exit_code != 1 {\n error make {msg: $"($operation_name) failed: ($result.stderr)"}\n }\n true\n}\n```\n\n**Design Decision**: Return `bool` instead of throwing error or calling `exit`. This allows the caller to decide how to handle cancellation.\n\n### 2. Pre-emptive Warning (ssh.nu:155-160)\n\n```\nif $server.fix_local_hosts and not (check_sudo_cached) {\n print "\n⚠ Sudo access required for --fix-local-hosts"\n print "ℹ You will be prompted for your password, or press CTRL-C to cancel"\n print " Tip: Run 'sudo -v' beforehand to cache credentials\n"\n}\n```\n\n**Design Decision**: Warn users upfront so they're not surprised by the password prompt.\n\n### 3. CTRL-C Detection (ssh.nu:171-199)\n\nAll sudo commands wrapped with detection:\n\n```\nlet result = (do --ignore-errors { ^sudo } | complete)\nif $result.exit_code == 1 and ($result.stderr | str contains "password is required") {\n print "\n⚠ Operation cancelled"\n return false\n}\n```\n\n**Design Decision**: Use `do --ignore-errors` + `complete` to capture both exit code and stderr without throwing exceptions.\n\n### 4. State Accumulation Pattern (ssh.nu:122-129)\n\nUsing Nushell's `reduce` instead of mutable variables:\n\n```\nlet all_succeeded = ($settings.data.servers | reduce -f true { |server, acc|\n if $text_match == null or $server.hostname == $text_match {\n let result = (on_server_ssh $settings $server $ip_type $request_from $run)\n $acc and $result\n } else {\n $acc\n }\n})\n```\n\n**Design Decision**: Nushell doesn't allow mutable variable capture in closures. Use `reduce` for accumulating boolean state across iterations.\n\n### 5. Caller Handling (create.nu:262-266, generate.nu:269-273)\n\n```\nlet ssh_result = (on_server_ssh $settings $server "pub" "create" false)\nif not $ssh_result {\n _print "\n✗ Server creation cancelled"\n return false\n}\n```\n\n**Design Decision**: Check return value and provide context-specific message before returning.\n\n## Error Flow Diagram\n\n```\nUser presses CTRL-C during password prompt\n ↓\nsudo exits with code 1, stderr: "password is required"\n ↓\ndo --ignore-errors captures exit code & stderr\n ↓\nDetection logic identifies cancellation\n ↓\nPrint user-friendly message\n ↓\nReturn false (not exit!)\n ↓\non_server_ssh returns false\n ↓\nCaller (create.nu/generate.nu) checks return value\n ↓\nPrint "✗ Server creation cancelled"\n ↓\nReturn false to settings.nu\n ↓\nsettings.nu handles false gracefully (no append)\n ↓\nClean exit, no cryptic errors\n```\n\n## Nushell Idioms Used\n\n### 1. `do --ignore-errors` + `complete`\n\nCaptures both stdout, stderr, and exit code without throwing:\n\n```\nlet result = (do --ignore-errors { ^sudo command } | complete)\n# result = { stdout: "...", stderr: "...", exit_code: 1 }\n```\n\n### 2. `reduce` for Accumulation\n\nInstead of mutable variables in loops:\n\n```\n# ❌ BAD - mutable capture in closure\nmut all_succeeded = true\n$servers | each { |s|\n $all_succeeded = false # Error: capture of mutable variable\n}\n\n# ✅ GOOD - reduce with accumulator\nlet all_succeeded = ($servers | reduce -f true { |s, acc|\n $acc and (check_server $s)\n})\n```\n\n### 3. Early Returns for Error Handling\n\n```\nif not $condition {\n print "Error message"\n return false\n}\n# Continue with happy path\n```\n\n## Testing Scenarios\n\n### Scenario 1: CTRL-C During First Sudo Command\n\n```\nprovisioning -c server create\n# Password: [CTRL-C]\n\n# Expected Output:\n# ⚠ Operation cancelled - sudo password required but not provided\n# ℹ Run 'sudo -v' first to cache credentials\n# ✗ Server creation cancelled\n```\n\n### Scenario 2: Pre-cached Credentials\n\n```\nsudo -v\nprovisioning -c server create\n\n# Expected: No password prompt, smooth operation\n```\n\n### Scenario 3: Wrong Password 3 Times\n\n```\nprovisioning -c server create\n# Password: [wrong]\n# Password: [wrong]\n# Password: [wrong]\n\n# Expected: Same as CTRL-C (treated as cancellation)\n```\n\n### Scenario 4: Multiple Servers, Cancel on Second\n\n```\n# If creating multiple servers and CTRL-C on second:\n# - First server completes successfully\n# - Second server shows cancellation message\n# - Operation stops, doesn't proceed to third\n```\n\n## Maintenance Notes\n\n### Adding New Sudo Commands\n\nWhen adding new sudo commands to the codebase:\n\n1. Wrap with `do --ignore-errors` + `complete`\n2. Check for exit code 1 + "password is required"\n3. Return `false` on cancellation\n4. Let caller handle the `false` return value\n\nExample template:\n\n```\nlet result = (do --ignore-errors { ^sudo new-command } | complete)\nif $result.exit_code == 1 and ($result.stderr | str contains "password is required") {\n print "\n⚠ Operation cancelled - sudo password required"\n return false\n}\n```\n\n### Common Pitfalls\n\n1. **Don't use `exit`**: It kills the entire process\n2. **Don't use mutable variables in closures**: Use `reduce` instead\n3. **Don't ignore return values**: Always check and propagate\n4. **Don't forget the pre-check warning**: Users should know sudo is needed\n\n## Future Improvements\n\n1. **Sudo Credential Manager**: Optionally use a credential manager (keychain, etc.)\n2. **Sudo-less Mode**: Alternative implementation that doesn't require root\n3. **Timeout Handling**: Detect when sudo times out waiting for password\n4. **Multiple Password Attempts**: Distinguish between CTRL-C and wrong password\n\n## References\n\n- Nushell `complete` command: \n- Nushell `reduce` command: \n- Sudo exit codes: man sudo (exit code 1 = authentication failure)\n- POSIX signal conventions: SIGINT (CTRL-C) = 130\n\n## Related Files\n\n- `provisioning/core/nulib/servers/ssh.nu` - Core implementation\n- `provisioning/core/nulib/servers/create.nu` - Calls on_server_ssh\n- `provisioning/core/nulib/servers/generate.nu` - Calls on_server_ssh\n- `docs/troubleshooting/CTRL-C_SUDO_HANDLING.md` - User-facing docs\n- `docs/quick-reference/SUDO_PASSWORD_HANDLING.md` - Quick reference\n\n## Changelog\n\n- **2025-01-XX**: Initial implementation with return values (v2)\n- **2025-01-XX**: Fixed mutable variable capture with `reduce` pattern\n- **2025-01-XX**: First attempt with `exit 130` (reverted, caused process termination) +# CTRL-C Handling Implementation Notes + +## Overview + +Implemented graceful CTRL-C handling for sudo password prompts during server creation/generation operations. + +## Problem Statement + +When `fix_local_hosts: true` is set, the provisioning tool requires sudo access to +modify `/etc/hosts` and SSH config. When a user cancels the sudo password prompt (no +password, wrong password, timeout), the system would: + +1. Exit with code 1 (sudo failed) +2. Propagate null values up the call stack +3. Show cryptic Nushell errors about pipeline failures +4. Leave the operation in an inconsistent state + +**Important Unix Limitation**: Pressing CTRL-C at the sudo password prompt sends SIGINT to the entire process group, interrupting Nushell before exit +code handling can occur. This **cannot be caught** and is expected Unix behavior. + +## Solution Architecture + +### Key Principle: Return Values, Not Exit Codes + +Instead of using `exit 130` which kills the entire process, we use **return values** +to signal cancellation and let each layer of the call stack handle it gracefully. + +### Three-Layer Approach + +1. **Detection Layer** (ssh.nu helper functions) + - Detects sudo cancellation via exit code + stderr + - Returns `false` instead of calling `exit` + +2. **Propagation Layer** (ssh.nu core functions) + - `on_server_ssh()`: Returns `false` on cancellation + - `server_ssh()`: Uses `reduce` to propagate failures + +3. **Handling Layer** (create.nu, generate.nu) + - Checks return values + - Displays user-friendly messages + - Returns `false` to caller + +## Implementation Details + +### 1. Helper Functions (ssh.nu:11-32) + +```text +def check_sudo_cached []: nothing -> bool { + let result = (do --ignore-errors { ^sudo -n true } | complete) + $result.exit_code == 0 +} + +def run_sudo_with_interrupt_check [ + command: closure + operation_name: string +]: nothing -> bool { + let result = (do --ignore-errors { do $command } | complete) + if $result.exit_code == 1 and ($result.stderr | str contains "password is required") { + print " +⚠ Operation cancelled - sudo password required but not provided" + print "ℹ Run 'sudo -v' first to cache credentials, or run without --fix-local-hosts" + return false # Signal cancellation + } else if $result.exit_code != 0 and $result.exit_code != 1 { + error make {msg: $"($operation_name) failed: ($result.stderr)"} + } + true +} +``` + +**Design Decision**: Return `bool` instead of throwing error or calling `exit`. This allows the caller to decide how to handle cancellation. + +### 2. Pre-emptive Warning (ssh.nu:155-160) + +```text +if $server.fix_local_hosts and not (check_sudo_cached) { + print " +⚠ Sudo access required for --fix-local-hosts" + print "ℹ You will be prompted for your password, or press CTRL-C to cancel" + print " Tip: Run 'sudo -v' beforehand to cache credentials +" +} +``` + +**Design Decision**: Warn users upfront so they're not surprised by the password prompt. + +### 3. CTRL-C Detection (ssh.nu:171-199) + +All sudo commands wrapped with detection: + +```text +let result = (do --ignore-errors { ^sudo } | complete) +if $result.exit_code == 1 and ($result.stderr | str contains "password is required") { + print " +⚠ Operation cancelled" + return false +} +``` + +**Design Decision**: Use `do --ignore-errors` + `complete` to capture both exit code and stderr without throwing exceptions. + +### 4. State Accumulation Pattern (ssh.nu:122-129) + +Using Nushell's `reduce` instead of mutable variables: + +```text +let all_succeeded = ($settings.data.servers | reduce -f true { |server, acc| + if $text_match == null or $server.hostname == $text_match { + let result = (on_server_ssh $settings $server $ip_type $request_from $run) + $acc and $result + } else { + $acc + } +}) +``` + +**Design Decision**: Nushell doesn't allow mutable variable capture in closures. Use `reduce` for accumulating boolean state across iterations. + +### 5. Caller Handling (create.nu:262-266, generate.nu:269-273) + +```text +let ssh_result = (on_server_ssh $settings $server "pub" "create" false) +if not $ssh_result { + _print " +✗ Server creation cancelled" + return false +} +``` + +**Design Decision**: Check return value and provide context-specific message before returning. + +## Error Flow Diagram + +```text +User presses CTRL-C during password prompt + ↓ +sudo exits with code 1, stderr: "password is required" + ↓ +do --ignore-errors captures exit code & stderr + ↓ +Detection logic identifies cancellation + ↓ +Print user-friendly message + ↓ +Return false (not exit!) + ↓ +on_server_ssh returns false + ↓ +Caller (create.nu/generate.nu) checks return value + ↓ +Print "✗ Server creation cancelled" + ↓ +Return false to settings.nu + ↓ +settings.nu handles false gracefully (no append) + ↓ +Clean exit, no cryptic errors +``` + +## Nushell Idioms Used + +### 1. `do --ignore-errors` + `complete` + +Captures both stdout, stderr, and exit code without throwing: + +```text +let result = (do --ignore-errors { ^sudo command } | complete) +# result = { stdout: "...", stderr: "...", exit_code: 1 } +``` + +### 2. `reduce` for Accumulation + +Instead of mutable variables in loops: + +```text +# ❌ BAD - mutable capture in closure +mut all_succeeded = true +$servers | each { |s| + $all_succeeded = false # Error: capture of mutable variable +} + +# ✅ GOOD - reduce with accumulator +let all_succeeded = ($servers | reduce -f true { |s, acc| + $acc and (check_server $s) +}) +``` + +### 3. Early Returns for Error Handling + +```text +if not $condition { + print "Error message" + return false +} +# Continue with happy path +``` + +## Testing Scenarios + +### Scenario 1: CTRL-C During First Sudo Command + +```text +provisioning -c server create +# Password: [CTRL-C] + +# Expected Output: +# ⚠ Operation cancelled - sudo password required but not provided +# ℹ Run 'sudo -v' first to cache credentials +# ✗ Server creation cancelled +``` + +### Scenario 2: Pre-cached Credentials + +```text +sudo -v +provisioning -c server create + +# Expected: No password prompt, smooth operation +``` + +### Scenario 3: Wrong Password 3 Times + +```text +provisioning -c server create +# Password: [wrong] +# Password: [wrong] +# Password: [wrong] + +# Expected: Same as CTRL-C (treated as cancellation) +``` + +### Scenario 4: Multiple Servers, Cancel on Second + +```text +# If creating multiple servers and CTRL-C on second: +# - First server completes successfully +# - Second server shows cancellation message +# - Operation stops, doesn't proceed to third +``` + +## Maintenance Notes + +### Adding New Sudo Commands + +When adding new sudo commands to the codebase: + +1. Wrap with `do --ignore-errors` + `complete` +2. Check for exit code 1 + "password is required" +3. Return `false` on cancellation +4. Let caller handle the `false` return value + +Example template: + +```text +let result = (do --ignore-errors { ^sudo new-command } | complete) +if $result.exit_code == 1 and ($result.stderr | str contains "password is required") { + print " +⚠ Operation cancelled - sudo password required" + return false +} +``` + +### Common Pitfalls + +1. **Don't use `exit`**: It kills the entire process +2. **Don't use mutable variables in closures**: Use `reduce` instead +3. **Don't ignore return values**: Always check and propagate +4. **Don't forget the pre-check warning**: Users should know sudo is needed + +## Future Improvements + +1. **Sudo Credential Manager**: Optionally use a credential manager (keychain, etc.) +2. **Sudo-less Mode**: Alternative implementation that doesn't require root +3. **Timeout Handling**: Detect when sudo times out waiting for password +4. **Multiple Password Attempts**: Distinguish between CTRL-C and wrong password + +## References + +- Nushell `complete` command: +- Nushell `reduce` command: +- Sudo exit codes: man sudo (exit code 1 = authentication failure) +- POSIX signal conventions: SIGINT (CTRL-C) = 130 + +## Related Files + +- `provisioning/core/nulib/servers/ssh.nu` - Core implementation +- `provisioning/core/nulib/servers/create.nu` - Calls on_server_ssh +- `provisioning/core/nulib/servers/generate.nu` - Calls on_server_ssh +- `docs/troubleshooting/CTRL-C_SUDO_HANDLING.md` - User-facing docs +- `docs/quick-reference/SUDO_PASSWORD_HANDLING.md` - Quick reference + +## Changelog + +- **2025-01-XX**: Initial implementation with return values (v2) +- **2025-01-XX**: Fixed mutable variable capture with `reduce` pattern +- **2025-01-XX**: First attempt with `exit 130` (reverted, caused process termination) \ No newline at end of file diff --git a/docs/src/development/dev-configuration.md b/docs/src/development/dev-configuration.md index b3b8eea..6c37972 100644 --- a/docs/src/development/dev-configuration.md +++ b/docs/src/development/dev-configuration.md @@ -1 +1,984 @@ -# Configuration Management\n\nThis document provides comprehensive guidance on provisioning's configuration architecture, environment-specific configurations, validation, error\nhandling, and migration strategies.\n\n## Table of Contents\n\n1. [Overview](#overview)\n2. [Configuration Architecture](#configuration-architecture)\n3. [Configuration Files](#configuration-files)\n4. [Environment-Specific Configuration](#environment-specific-configuration)\n5. [User Overrides and Customization](#user-overrides-and-customization)\n6. [Validation and Error Handling](#validation-and-error-handling)\n7. [Interpolation and Dynamic Values](#interpolation-and-dynamic-values)\n8. [Migration Strategies](#migration-strategies)\n9. [Troubleshooting](#troubleshooting)\n\n## Overview\n\nProvisioning implements a sophisticated configuration management system that has migrated from environment variable-based configuration to a\nhierarchical TOML configuration system with comprehensive validation and interpolation support.\n\n**Key Features**:\n\n- **Hierarchical Configuration**: Multi-layer configuration with clear precedence\n- **Environment-Specific**: Dedicated configurations for dev, test, and production\n- **Dynamic Interpolation**: Template-based value resolution\n- **Type Safety**: Comprehensive validation and error handling\n- **Migration Support**: Backward compatibility with existing ENV variables\n- **Workspace Integration**: Seamless integration with development workspaces\n\n**Migration Status**: ✅ **Complete** (2025-09-23)\n\n- **65+ files migrated** across entire codebase\n- **200+ ENV variables replaced** with 476 config accessors\n- **16 token-efficient agents** used for systematic migration\n- **92% token efficiency** achieved vs monolithic approach\n\n## Configuration Architecture\n\n### Hierarchical Loading Order\n\nThe configuration system implements a clear precedence hierarchy (lowest to highest precedence):\n\n```\nConfiguration Hierarchy (Low → High Precedence)\n┌─────────────────────────────────────────────────┐\n│ 1. config.defaults.toml │ ← System defaults\n│ (System-wide default values) │\n├─────────────────────────────────────────────────┤\n│ 2. ~/.config/provisioning/config.toml │ ← User configuration\n│ (User-specific preferences) │\n├─────────────────────────────────────────────────┤\n│ 3. ./provisioning.toml │ ← Project configuration\n│ (Project-specific settings) │\n├─────────────────────────────────────────────────┤\n│ 4. ./.provisioning.toml │ ← Infrastructure config\n│ (Infrastructure-specific settings) │\n├─────────────────────────────────────────────────┤\n│ 5. Environment-specific configs │ ← Environment overrides\n│ (config.{dev,test,prod}.toml) │\n├─────────────────────────────────────────────────┤\n│ 6. Runtime environment variables │ ← Runtime overrides\n│ (PROVISIONING_* variables) │\n└─────────────────────────────────────────────────┘\n```\n\n### Configuration Access Patterns\n\n**Configuration Accessor Functions**:\n\n```\n# Core configuration access\nuse core/nulib/lib_provisioning/config/accessor.nu\n\n# Get configuration value with fallback\nlet api_url = (get-config-value "providers.upcloud.api_url" "https://api.upcloud.com")\n\n# Get required configuration (errors if missing)\nlet api_key = (get-config-required "providers.upcloud.api_key")\n\n# Get nested configuration\nlet server_defaults = (get-config-section "defaults.servers")\n\n# Environment-aware configuration\nlet log_level = (get-config-env "logging.level" "info")\n\n# Interpolated configuration\nlet data_path = (get-config-interpolated "paths.data") # Resolves {{paths.base}}/data\n```\n\n### Migration from ENV Variables\n\n**Before (ENV-based)**:\n\n```\nexport PROVISIONING_UPCLOUD_API_KEY="your-key"\nexport PROVISIONING_UPCLOUD_API_URL="https://api.upcloud.com"\nexport PROVISIONING_LOG_LEVEL="debug"\nexport PROVISIONING_BASE_PATH="/usr/local/provisioning"\n```\n\n**After (Config-based)**:\n\n```\n# config.user.toml\n[providers.upcloud]\napi_key = "your-key"\napi_url = "https://api.upcloud.com"\n\n[logging]\nlevel = "debug"\n\n[paths]\nbase = "/usr/local/provisioning"\n```\n\n## Configuration Files\n\n### System Defaults (`config.defaults.toml`)\n\n**Purpose**: Provides sensible defaults for all system components\n**Location**: Root of the repository\n**Modification**: Should only be modified by system maintainers\n\n```\n# System-wide defaults - DO NOT MODIFY in production\n# Copy values to config.user.toml for customization\n\n[core]\nversion = "1.0.0"\nname = "provisioning-system"\n\n[paths]\n# Base path - all other paths derived from this\nbase = "/usr/local/provisioning"\nconfig = "{{paths.base}}/config"\ndata = "{{paths.base}}/data"\nlogs = "{{paths.base}}/logs"\ncache = "{{paths.base}}/cache"\nruntime = "{{paths.base}}/runtime"\n\n[logging]\nlevel = "info"\nfile = "{{paths.logs}}/provisioning.log"\nrotation = true\nmax_size = "100 MB"\nmax_files = 5\n\n[http]\ntimeout = 30\nretries = 3\nuser_agent = "provisioning-system/{{core.version}}"\nuse_curl = false\n\n[providers]\ndefault = "local"\n\n[providers.upcloud]\napi_url = "https://api.upcloud.com/1.3"\ntimeout = 30\nmax_retries = 3\n\n[providers.aws]\nregion = "us-east-1"\ntimeout = 30\n\n[providers.local]\nenabled = true\nbase_path = "{{paths.data}}/local"\n\n[defaults]\n[defaults.servers]\nplan = "1xCPU-2 GB"\nzone = "auto"\ntemplate = "ubuntu-22.04"\n\n[cache]\nenabled = true\nttl = 3600\npath = "{{paths.cache}}"\n\n[orchestrator]\nenabled = false\nport = 8080\nbind = "127.0.0.1"\ndata_path = "{{paths.data}}/orchestrator"\n\n[workflow]\nstorage_backend = "filesystem"\nparallel_limit = 5\nrollback_enabled = true\n\n[telemetry]\nenabled = false\nendpoint = ""\nsample_rate = 0.1\n```\n\n### User Configuration (`~/.config/provisioning/config.toml`)\n\n**Purpose**: User-specific customizations and preferences\n**Location**: User's configuration directory\n**Modification**: Users should customize this file for their needs\n\n```\n# User configuration - customizations and personal preferences\n# This file overrides system defaults\n\n[core]\nname = "provisioning-{{env.USER}}"\n\n[paths]\n# Personal installation path\nbase = "{{env.HOME}}/.local/share/provisioning"\n\n[logging]\nlevel = "debug"\nfile = "{{paths.logs}}/provisioning-{{env.USER}}.log"\n\n[providers]\ndefault = "upcloud"\n\n[providers.upcloud]\napi_key = "your-personal-api-key"\napi_secret = "your-personal-api-secret"\n\n[defaults.servers]\nplan = "2xCPU-4 GB"\nzone = "us-nyc1"\n\n[development]\nauto_reload = true\nhot_reload_templates = true\nverbose_errors = true\n\n[notifications]\nslack_webhook = "https://hooks.slack.com/your-webhook"\nemail = "your-email@domain.com"\n\n[git]\nauto_commit = true\ncommit_prefix = "[{{env.USER}}]"\n```\n\n### Project Configuration (`./provisioning.toml`)\n\n**Purpose**: Project-specific settings shared across team\n**Location**: Project root directory\n**Version Control**: Should be committed to version control\n\n```\n# Project-specific configuration\n# Shared settings for this project/repository\n\n[core]\nname = "my-project-provisioning"\nversion = "1.2.0"\n\n[infra]\ndefault = "staging"\nenvironments = ["dev", "staging", "production"]\n\n[providers]\ndefault = "upcloud"\nallowed = ["upcloud", "aws", "local"]\n\n[providers.upcloud]\n# Project-specific UpCloud settings\ndefault_zone = "us-nyc1"\ntemplate = "ubuntu-22.04-lts"\n\n[defaults.servers]\nplan = "2xCPU-4 GB"\nstorage = 50\nfirewall_enabled = true\n\n[security]\nenforce_https = true\nrequire_mfa = true\nallowed_cidr = ["10.0.0.0/8", "172.16.0.0/12"]\n\n[compliance]\ndata_region = "us-east"\nencryption_at_rest = true\naudit_logging = true\n\n[team]\nadmins = ["alice@company.com", "bob@company.com"]\ndevelopers = ["dev-team@company.com"]\n```\n\n### Infrastructure Configuration (`./.provisioning.toml`)\n\n**Purpose**: Infrastructure-specific overrides\n**Location**: Infrastructure directory\n**Usage**: Overrides for specific infrastructure deployments\n\n```\n# Infrastructure-specific configuration\n# Overrides for this specific infrastructure deployment\n\n[core]\nname = "production-east-provisioning"\n\n[infra]\nname = "production-east"\nenvironment = "production"\nregion = "us-east-1"\n\n[providers.upcloud]\nzone = "us-nyc1"\nprivate_network = true\n\n[providers.aws]\nregion = "us-east-1"\navailability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]\n\n[defaults.servers]\nplan = "4xCPU-8 GB"\nstorage = 100\nbackup_enabled = true\nmonitoring_enabled = true\n\n[security]\nfirewall_strict_mode = true\nencryption_required = true\naudit_all_actions = true\n\n[monitoring]\nprometheus_enabled = true\ngrafana_enabled = true\nalertmanager_enabled = true\n\n[backup]\nenabled = true\nschedule = "0 2 * * *" # Daily at 2 AM\nretention_days = 30\n```\n\n## Environment-Specific Configuration\n\n### Development Environment (`config.dev.toml`)\n\n**Purpose**: Development-optimized settings\n**Features**: Enhanced debugging, local providers, relaxed validation\n\n```\n# Development environment configuration\n# Optimized for local development and testing\n\n[core]\nname = "provisioning-dev"\nversion = "dev-{{git.branch}}"\n\n[paths]\nbase = "{{env.PWD}}/dev-environment"\n\n[logging]\nlevel = "debug"\nconsole_output = true\nstructured_logging = true\ndebug_http = true\n\n[providers]\ndefault = "local"\n\n[providers.local]\nenabled = true\nfast_mode = true\nmock_delays = false\n\n[http]\ntimeout = 10\nretries = 1\ndebug_requests = true\n\n[cache]\nenabled = true\nttl = 60 # Short TTL for development\ndebug_cache = true\n\n[development]\nauto_reload = true\nhot_reload_templates = true\nvalidate_strict = false\nexperimental_features = true\ndebug_mode = true\n\n[orchestrator]\nenabled = true\nport = 8080\ndebug = true\nfile_watcher = true\n\n[testing]\nparallel_tests = true\ncleanup_after_tests = true\nmock_external_apis = true\n```\n\n### Testing Environment (`config.test.toml`)\n\n**Purpose**: Testing-specific configuration\n**Features**: Mock services, isolated environments, comprehensive logging\n\n```\n# Testing environment configuration\n# Optimized for automated testing and CI/CD\n\n[core]\nname = "provisioning-test"\nversion = "test-{{build.timestamp}}"\n\n[logging]\nlevel = "info"\ntest_output = true\ncapture_stderr = true\n\n[providers]\ndefault = "local"\n\n[providers.local]\nenabled = true\nmock_mode = true\ndeterministic = true\n\n[http]\ntimeout = 5\nretries = 0\nmock_responses = true\n\n[cache]\nenabled = false\n\n[testing]\nisolated_environments = true\ncleanup_after_each_test = true\nparallel_execution = true\nmock_all_external_calls = true\ndeterministic_ids = true\n\n[orchestrator]\nenabled = false\n\n[validation]\nstrict_mode = true\nfail_fast = true\n```\n\n### Production Environment (`config.prod.toml`)\n\n**Purpose**: Production-optimized settings\n**Features**: Performance optimization, security hardening, comprehensive monitoring\n\n```\n# Production environment configuration\n# Optimized for performance, reliability, and security\n\n[core]\nname = "provisioning-production"\nversion = "{{release.version}}"\n\n[logging]\nlevel = "warn"\nstructured_logging = true\nsensitive_data_filtering = true\naudit_logging = true\n\n[providers]\ndefault = "upcloud"\n\n[http]\ntimeout = 60\nretries = 5\nconnection_pool = 20\nkeep_alive = true\n\n[cache]\nenabled = true\nttl = 3600\nsize_limit = "500 MB"\npersistence = true\n\n[security]\nstrict_mode = true\nencrypt_at_rest = true\nencrypt_in_transit = true\naudit_all_actions = true\n\n[monitoring]\nmetrics_enabled = true\ntracing_enabled = true\nhealth_checks = true\nalerting = true\n\n[orchestrator]\nenabled = true\nport = 8080\nbind = "0.0.0.0"\nworkers = 4\nmax_connections = 100\n\n[performance]\nparallel_operations = true\nbatch_operations = true\nconnection_pooling = true\n```\n\n## User Overrides and Customization\n\n### Personal Development Setup\n\n**Creating User Configuration**:\n\n```\n# Create user config directory\nmkdir -p ~/.config/provisioning\n\n# Copy template\ncp src/provisioning/config-examples/config.user.toml ~/.config/provisioning/config.toml\n\n# Customize for your environment\n$EDITOR ~/.config/provisioning/config.toml\n```\n\n**Common User Customizations**:\n\n```\n# Personal configuration customizations\n\n[paths]\nbase = "{{env.HOME}}/dev/provisioning"\n\n[development]\neditor = "code"\nauto_backup = true\nbackup_interval = "1h"\n\n[git]\nauto_commit = false\ncommit_template = "[{{env.USER}}] {{change.type}}: {{change.description}}"\n\n[providers.upcloud]\napi_key = "{{env.UPCLOUD_API_KEY}}"\napi_secret = "{{env.UPCLOUD_API_SECRET}}"\ndefault_zone = "de-fra1"\n\n[shortcuts]\n# Custom command aliases\nquick_server = "server create {{name}} 2xCPU-4 GB --zone us-nyc1"\ndev_cluster = "cluster create development --infra {{env.USER}}-dev"\n\n[notifications]\ndesktop_notifications = true\nsound_notifications = false\nslack_webhook = "{{env.SLACK_WEBHOOK_URL}}"\n```\n\n### Workspace-Specific Configuration\n\n**Workspace Integration**:\n\n```\n# Workspace-aware configuration\n# workspace/config/developer.toml\n\n[workspace]\nuser = "developer"\ntype = "development"\n\n[paths]\nbase = "{{workspace.root}}"\nextensions = "{{workspace.root}}/extensions"\nruntime = "{{workspace.root}}/runtime/{{workspace.user}}"\n\n[development]\nworkspace_isolation = true\nper_user_cache = true\nshared_extensions = false\n\n[infra]\ncurrent = "{{workspace.user}}-development"\nauto_create = true\n```\n\n## Validation and Error Handling\n\n### Configuration Validation\n\n**Built-in Validation**:\n\n```\n# Validate current configuration\nprovisioning validate config\n\n# Validate specific configuration file\nprovisioning validate config --file config.dev.toml\n\n# Show configuration with validation\nprovisioning config show --validate\n\n# Debug configuration loading\nprovisioning config debug\n```\n\n**Validation Rules**:\n\n```\n# Configuration validation in Nushell\ndef validate_configuration [config: record] -> record {\n let errors = []\n\n # Validate required fields\n if not ("paths" in $config and "base" in $config.paths) {\n $errors = ($errors | append "paths.base is required")\n }\n\n # Validate provider configuration\n if "providers" in $config {\n for provider in ($config.providers | columns) {\n if $provider == "upcloud" {\n if not ("api_key" in $config.providers.upcloud) {\n $errors = ($errors | append "providers.upcloud.api_key is required")\n }\n }\n }\n }\n\n # Validate numeric values\n if "http" in $config and "timeout" in $config.http {\n if $config.http.timeout <= 0 {\n $errors = ($errors | append "http.timeout must be positive")\n }\n }\n\n {\n valid: ($errors | length) == 0,\n errors: $errors\n }\n}\n```\n\n### Error Handling\n\n**Configuration-Driven Error Handling**:\n\n```\n# Never patch with hardcoded fallbacks - use configuration\ndef get_api_endpoint [provider: string] -> string {\n # Good: Configuration-driven with clear error\n let config_key = $"providers.($provider).api_url"\n let endpoint = try {\n get-config-required $config_key\n } catch {\n error make {\n msg: $"API endpoint not configured for provider ($provider)",\n help: $"Add '($config_key)' to your configuration file"\n }\n }\n\n $endpoint\n}\n\n# Bad: Hardcoded fallback defeats IaC purpose\ndef get_api_endpoint_bad [provider: string] -> string {\n try {\n get-config-required $"providers.($provider).api_url"\n } catch {\n # DON'T DO THIS - defeats configuration-driven architecture\n "https://default-api.com"\n }\n}\n```\n\n**Comprehensive Error Context**:\n\n```\ndef load_provider_config [provider: string] -> record {\n let config_section = $"providers.($provider)"\n\n try {\n get-config-section $config_section\n } catch { |e|\n error make {\n msg: $"Failed to load configuration for provider ($provider): ($e.msg)",\n label: {\n text: "configuration missing",\n span: (metadata $provider).span\n },\n help: [\n $"Add [$config_section] section to your configuration",\n "Example configuration files available in config-examples/",\n "Run 'provisioning config show' to see current configuration"\n ]\n }\n }\n}\n```\n\n## Interpolation and Dynamic Values\n\n### Interpolation Syntax\n\n**Supported Interpolation Variables**:\n\n```\n# Environment variables\nbase_path = "{{env.HOME}}/provisioning"\nuser_name = "{{env.USER}}"\n\n# Configuration references\ndata_path = "{{paths.base}}/data"\nlog_file = "{{paths.logs}}/{{core.name}}.log"\n\n# Date/time values\nbackup_name = "backup-{{now.date}}-{{now.time}}"\nversion = "{{core.version}}-{{now.timestamp}}"\n\n# Git information\nbranch_name = "{{git.branch}}"\ncommit_hash = "{{git.commit}}"\nversion_with_git = "{{core.version}}-{{git.commit}}"\n\n# System information\nhostname = "{{system.hostname}}"\nplatform = "{{system.platform}}"\narchitecture = "{{system.arch}}"\n```\n\n### Complex Interpolation Examples\n\n**Dynamic Path Resolution**:\n\n```\n[paths]\nbase = "{{env.HOME}}/.local/share/provisioning"\nconfig = "{{paths.base}}/config"\ndata = "{{paths.base}}/data/{{system.hostname}}"\nlogs = "{{paths.base}}/logs/{{env.USER}}/{{now.date}}"\nruntime = "{{paths.base}}/runtime/{{git.branch}}"\n\n[providers.upcloud]\ncache_path = "{{paths.cache}}/providers/upcloud/{{env.USER}}"\nlog_file = "{{paths.logs}}/upcloud-{{now.date}}.log"\n```\n\n**Environment-Aware Configuration**:\n\n```\n[core]\nname = "provisioning-{{system.hostname}}-{{env.USER}}"\nversion = "{{release.version}}+{{git.commit}}.{{now.timestamp}}"\n\n[database]\nname = "provisioning_{{env.USER}}_{{git.branch}}"\nbackup_prefix = "{{core.name}}-backup-{{now.date}}"\n\n[monitoring]\ninstance_id = "{{system.hostname}}-{{core.version}}"\ntags = {\n environment = "{{infra.environment}}",\n user = "{{env.USER}}",\n version = "{{core.version}}",\n deployment_time = "{{now.iso8601}}"\n}\n```\n\n### Interpolation Functions\n\n**Custom Interpolation Logic**:\n\n```\n# Interpolation resolver\ndef resolve_interpolation [template: string, context: record] -> string {\n let interpolations = ($template | parse --regex '\{\{([^}]+)\}\}')\n\n mut result = $template\n\n for interpolation in $interpolations {\n let key_path = ($interpolation.capture0 | str trim)\n let value = resolve_interpolation_key $key_path $context\n\n $result = ($result | str replace $"{{($interpolation.capture0)}}" $value)\n }\n\n $result\n}\n\ndef resolve_interpolation_key [key_path: string, context: record] -> string {\n match ($key_path | split row ".") {\n ["env", $var] => ($env | get $var | default ""),\n ["paths", $path] => (resolve_path_key $path $context),\n ["now", $format] => (resolve_time_format $format),\n ["git", $info] => (resolve_git_info $info),\n ["system", $info] => (resolve_system_info $info),\n $path => (get_nested_config_value $path $context)\n }\n}\n```\n\n## Migration Strategies\n\n### ENV to Config Migration\n\n**Migration Status**: The system has successfully migrated from ENV-based to config-driven architecture:\n\n**Migration Statistics**:\n\n- **Files Migrated**: 65+ files across entire codebase\n- **Variables Replaced**: 200+ ENV variables → 476 config accessors\n- **Agent-Based Development**: 16 token-efficient agents used\n- **Efficiency Gained**: 92% token efficiency vs monolithic approach\n\n### Legacy Support\n\n**Backward Compatibility**:\n\n```\n# Configuration accessor with ENV fallback\ndef get-config-with-env-fallback [\n config_key: string,\n env_var: string,\n default: string = ""\n] -> string {\n # Try configuration first\n let config_value = try {\n get-config-value $config_key\n } catch { null }\n\n if $config_value != null {\n return $config_value\n }\n\n # Fall back to environment variable\n let env_value = ($env | get $env_var | default null)\n if $env_value != null {\n return $env_value\n }\n\n # Use default if provided\n if $default != "" {\n return $default\n }\n\n # Error if no value found\n error make {\n msg: $"Configuration value not found: ($config_key)",\n help: $"Set ($config_key) in configuration or ($env_var) environment variable"\n }\n}\n```\n\n### Migration Tools\n\n**Available Migration Scripts**:\n\n```\n# Migrate existing ENV-based setup to configuration\nnu src/tools/migration/env-to-config.nu --scan-environment --create-config\n\n# Validate migration completeness\nnu src/tools/migration/validate-migration.nu --check-env-usage\n\n# Generate configuration from current environment\nnu src/tools/migration/generate-config.nu --output-file config.migrated.toml\n```\n\n## Troubleshooting\n\n### Common Configuration Issues\n\n#### Configuration Not Found\n\n**Error**: `Configuration file not found`\n\n```\n# Solution: Check configuration file paths\nprovisioning config paths\n\n# Create default configuration\nprovisioning config init --template user\n\n# Verify configuration loading order\nprovisioning config debug\n```\n\n#### Invalid Configuration Syntax\n\n**Error**: `Invalid TOML syntax in configuration file`\n\n```\n# Solution: Validate TOML syntax\nnu -c "open config.user.toml | from toml"\n\n# Use configuration validation\nprovisioning validate config --file config.user.toml\n\n# Show parsing errors\nprovisioning config check --verbose\n```\n\n#### Interpolation Errors\n\n**Error**: `Failed to resolve interpolation: {{env.MISSING_VAR}}`\n\n```\n# Solution: Check available interpolation variables\nprovisioning config interpolation --list-variables\n\n# Debug specific interpolation\nprovisioning config interpolation --test "{{env.USER}}"\n\n# Show interpolation context\nprovisioning config debug --show-interpolation\n```\n\n#### Provider Configuration Issues\n\n**Error**: `Provider 'upcloud' configuration invalid`\n\n```\n# Solution: Validate provider configuration\nprovisioning validate config --section providers.upcloud\n\n# Show required provider fields\nprovisioning providers upcloud config --show-schema\n\n# Test provider configuration\nprovisioning providers upcloud test --dry-run\n```\n\n### Debug Commands\n\n**Configuration Debugging**:\n\n```\n# Show complete resolved configuration\nprovisioning config show --resolved\n\n# Show configuration loading order\nprovisioning config debug --show-hierarchy\n\n# Show configuration sources\nprovisioning config sources\n\n# Test specific configuration keys\nprovisioning config get paths.base --trace\n\n# Show interpolation resolution\nprovisioning config interpolation --debug "{{paths.data}}/{{env.USER}}"\n```\n\n### Performance Optimization\n\n**Configuration Caching**:\n\n```\n# Enable configuration caching\nexport PROVISIONING_CONFIG_CACHE=true\n\n# Clear configuration cache\nprovisioning config cache --clear\n\n# Show cache statistics\nprovisioning config cache --stats\n```\n\n**Startup Optimization**:\n\n```\n# Optimize configuration loading\n[performance]\nlazy_loading = true\ncache_compiled_config = true\nskip_unused_sections = true\n\n[cache]\nconfig_cache_ttl = 3600\ninterpolation_cache = true\n```\n\nThis configuration management system provides a robust, flexible foundation that supports development workflows while maintaining production\nreliability and security requirements. +# Configuration Management + +This document provides comprehensive guidance on provisioning's configuration architecture, environment-specific configurations, validation, error +handling, and migration strategies. + +## Table of Contents + +1. [Overview](#overview) +2. [Configuration Architecture](#configuration-architecture) +3. [Configuration Files](#configuration-files) +4. [Environment-Specific Configuration](#environment-specific-configuration) +5. [User Overrides and Customization](#user-overrides-and-customization) +6. [Validation and Error Handling](#validation-and-error-handling) +7. [Interpolation and Dynamic Values](#interpolation-and-dynamic-values) +8. [Migration Strategies](#migration-strategies) +9. [Troubleshooting](#troubleshooting) + +## Overview + +Provisioning implements a sophisticated configuration management system that has migrated from environment variable-based configuration to a +hierarchical TOML configuration system with comprehensive validation and interpolation support. + +**Key Features**: + +- **Hierarchical Configuration**: Multi-layer configuration with clear precedence +- **Environment-Specific**: Dedicated configurations for dev, test, and production +- **Dynamic Interpolation**: Template-based value resolution +- **Type Safety**: Comprehensive validation and error handling +- **Migration Support**: Backward compatibility with existing ENV variables +- **Workspace Integration**: Seamless integration with development workspaces + +**Migration Status**: ✅ **Complete** (2025-09-23) + +- **65+ files migrated** across entire codebase +- **200+ ENV variables replaced** with 476 config accessors +- **16 token-efficient agents** used for systematic migration +- **92% token efficiency** achieved vs monolithic approach + +## Configuration Architecture + +### Hierarchical Loading Order + +The configuration system implements a clear precedence hierarchy (lowest to highest precedence): + +```text +Configuration Hierarchy (Low → High Precedence) +┌─────────────────────────────────────────────────┐ +│ 1. config.defaults.toml │ ← System defaults +│ (System-wide default values) │ +├─────────────────────────────────────────────────┤ +│ 2. ~/.config/provisioning/config.toml │ ← User configuration +│ (User-specific preferences) │ +├─────────────────────────────────────────────────┤ +│ 3. ./provisioning.toml │ ← Project configuration +│ (Project-specific settings) │ +├─────────────────────────────────────────────────┤ +│ 4. ./.provisioning.toml │ ← Infrastructure config +│ (Infrastructure-specific settings) │ +├─────────────────────────────────────────────────┤ +│ 5. Environment-specific configs │ ← Environment overrides +│ (config.{dev,test,prod}.toml) │ +├─────────────────────────────────────────────────┤ +│ 6. Runtime environment variables │ ← Runtime overrides +│ (PROVISIONING_* variables) │ +└─────────────────────────────────────────────────┘ +``` + +### Configuration Access Patterns + +**Configuration Accessor Functions**: + +```text +# Core configuration access +use core/nulib/lib_provisioning/config/accessor.nu + +# Get configuration value with fallback +let api_url = (get-config-value "providers.upcloud.api_url" "https://api.upcloud.com") + +# Get required configuration (errors if missing) +let api_key = (get-config-required "providers.upcloud.api_key") + +# Get nested configuration +let server_defaults = (get-config-section "defaults.servers") + +# Environment-aware configuration +let log_level = (get-config-env "logging.level" "info") + +# Interpolated configuration +let data_path = (get-config-interpolated "paths.data") # Resolves {{paths.base}}/data +``` + +### Migration from ENV Variables + +**Before (ENV-based)**: + +```text +export PROVISIONING_UPCLOUD_API_KEY="your-key" +export PROVISIONING_UPCLOUD_API_URL="https://api.upcloud.com" +export PROVISIONING_LOG_LEVEL="debug" +export PROVISIONING_BASE_PATH="/usr/local/provisioning" +``` + +**After (Config-based)**: + +```text +# config.user.toml +[providers.upcloud] +api_key = "your-key" +api_url = "https://api.upcloud.com" + +[logging] +level = "debug" + +[paths] +base = "/usr/local/provisioning" +``` + +## Configuration Files + +### System Defaults (`config.defaults.toml`) + +**Purpose**: Provides sensible defaults for all system components +**Location**: Root of the repository +**Modification**: Should only be modified by system maintainers + +```text +# System-wide defaults - DO NOT MODIFY in production +# Copy values to config.user.toml for customization + +[core] +version = "1.0.0" +name = "provisioning-system" + +[paths] +# Base path - all other paths derived from this +base = "/usr/local/provisioning" +config = "{{paths.base}}/config" +data = "{{paths.base}}/data" +logs = "{{paths.base}}/logs" +cache = "{{paths.base}}/cache" +runtime = "{{paths.base}}/runtime" + +[logging] +level = "info" +file = "{{paths.logs}}/provisioning.log" +rotation = true +max_size = "100 MB" +max_files = 5 + +[http] +timeout = 30 +retries = 3 +user_agent = "provisioning-system/{{core.version}}" +use_curl = false + +[providers] +default = "local" + +[providers.upcloud] +api_url = "https://api.upcloud.com/1.3" +timeout = 30 +max_retries = 3 + +[providers.aws] +region = "us-east-1" +timeout = 30 + +[providers.local] +enabled = true +base_path = "{{paths.data}}/local" + +[defaults] +[defaults.servers] +plan = "1xCPU-2 GB" +zone = "auto" +template = "ubuntu-22.04" + +[cache] +enabled = true +ttl = 3600 +path = "{{paths.cache}}" + +[orchestrator] +enabled = false +port = 8080 +bind = "127.0.0.1" +data_path = "{{paths.data}}/orchestrator" + +[workflow] +storage_backend = "filesystem" +parallel_limit = 5 +rollback_enabled = true + +[telemetry] +enabled = false +endpoint = "" +sample_rate = 0.1 +``` + +### User Configuration (`~/.config/provisioning/config.toml`) + +**Purpose**: User-specific customizations and preferences +**Location**: User's configuration directory +**Modification**: Users should customize this file for their needs + +```text +# User configuration - customizations and personal preferences +# This file overrides system defaults + +[core] +name = "provisioning-{{env.USER}}" + +[paths] +# Personal installation path +base = "{{env.HOME}}/.local/share/provisioning" + +[logging] +level = "debug" +file = "{{paths.logs}}/provisioning-{{env.USER}}.log" + +[providers] +default = "upcloud" + +[providers.upcloud] +api_key = "your-personal-api-key" +api_secret = "your-personal-api-secret" + +[defaults.servers] +plan = "2xCPU-4 GB" +zone = "us-nyc1" + +[development] +auto_reload = true +hot_reload_templates = true +verbose_errors = true + +[notifications] +slack_webhook = "https://hooks.slack.com/your-webhook" +email = "your-email@domain.com" + +[git] +auto_commit = true +commit_prefix = "[{{env.USER}}]" +``` + +### Project Configuration (`./provisioning.toml`) + +**Purpose**: Project-specific settings shared across team +**Location**: Project root directory +**Version Control**: Should be committed to version control + +```text +# Project-specific configuration +# Shared settings for this project/repository + +[core] +name = "my-project-provisioning" +version = "1.2.0" + +[infra] +default = "staging" +environments = ["dev", "staging", "production"] + +[providers] +default = "upcloud" +allowed = ["upcloud", "aws", "local"] + +[providers.upcloud] +# Project-specific UpCloud settings +default_zone = "us-nyc1" +template = "ubuntu-22.04-lts" + +[defaults.servers] +plan = "2xCPU-4 GB" +storage = 50 +firewall_enabled = true + +[security] +enforce_https = true +require_mfa = true +allowed_cidr = ["10.0.0.0/8", "172.16.0.0/12"] + +[compliance] +data_region = "us-east" +encryption_at_rest = true +audit_logging = true + +[team] +admins = ["alice@company.com", "bob@company.com"] +developers = ["dev-team@company.com"] +``` + +### Infrastructure Configuration (`./.provisioning.toml`) + +**Purpose**: Infrastructure-specific overrides +**Location**: Infrastructure directory +**Usage**: Overrides for specific infrastructure deployments + +```text +# Infrastructure-specific configuration +# Overrides for this specific infrastructure deployment + +[core] +name = "production-east-provisioning" + +[infra] +name = "production-east" +environment = "production" +region = "us-east-1" + +[providers.upcloud] +zone = "us-nyc1" +private_network = true + +[providers.aws] +region = "us-east-1" +availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"] + +[defaults.servers] +plan = "4xCPU-8 GB" +storage = 100 +backup_enabled = true +monitoring_enabled = true + +[security] +firewall_strict_mode = true +encryption_required = true +audit_all_actions = true + +[monitoring] +prometheus_enabled = true +grafana_enabled = true +alertmanager_enabled = true + +[backup] +enabled = true +schedule = "0 2 * * *" # Daily at 2 AM +retention_days = 30 +``` + +## Environment-Specific Configuration + +### Development Environment (`config.dev.toml`) + +**Purpose**: Development-optimized settings +**Features**: Enhanced debugging, local providers, relaxed validation + +```text +# Development environment configuration +# Optimized for local development and testing + +[core] +name = "provisioning-dev" +version = "dev-{{git.branch}}" + +[paths] +base = "{{env.PWD}}/dev-environment" + +[logging] +level = "debug" +console_output = true +structured_logging = true +debug_http = true + +[providers] +default = "local" + +[providers.local] +enabled = true +fast_mode = true +mock_delays = false + +[http] +timeout = 10 +retries = 1 +debug_requests = true + +[cache] +enabled = true +ttl = 60 # Short TTL for development +debug_cache = true + +[development] +auto_reload = true +hot_reload_templates = true +validate_strict = false +experimental_features = true +debug_mode = true + +[orchestrator] +enabled = true +port = 8080 +debug = true +file_watcher = true + +[testing] +parallel_tests = true +cleanup_after_tests = true +mock_external_apis = true +``` + +### Testing Environment (`config.test.toml`) + +**Purpose**: Testing-specific configuration +**Features**: Mock services, isolated environments, comprehensive logging + +```text +# Testing environment configuration +# Optimized for automated testing and CI/CD + +[core] +name = "provisioning-test" +version = "test-{{build.timestamp}}" + +[logging] +level = "info" +test_output = true +capture_stderr = true + +[providers] +default = "local" + +[providers.local] +enabled = true +mock_mode = true +deterministic = true + +[http] +timeout = 5 +retries = 0 +mock_responses = true + +[cache] +enabled = false + +[testing] +isolated_environments = true +cleanup_after_each_test = true +parallel_execution = true +mock_all_external_calls = true +deterministic_ids = true + +[orchestrator] +enabled = false + +[validation] +strict_mode = true +fail_fast = true +``` + +### Production Environment (`config.prod.toml`) + +**Purpose**: Production-optimized settings +**Features**: Performance optimization, security hardening, comprehensive monitoring + +```text +# Production environment configuration +# Optimized for performance, reliability, and security + +[core] +name = "provisioning-production" +version = "{{release.version}}" + +[logging] +level = "warn" +structured_logging = true +sensitive_data_filtering = true +audit_logging = true + +[providers] +default = "upcloud" + +[http] +timeout = 60 +retries = 5 +connection_pool = 20 +keep_alive = true + +[cache] +enabled = true +ttl = 3600 +size_limit = "500 MB" +persistence = true + +[security] +strict_mode = true +encrypt_at_rest = true +encrypt_in_transit = true +audit_all_actions = true + +[monitoring] +metrics_enabled = true +tracing_enabled = true +health_checks = true +alerting = true + +[orchestrator] +enabled = true +port = 8080 +bind = "0.0.0.0" +workers = 4 +max_connections = 100 + +[performance] +parallel_operations = true +batch_operations = true +connection_pooling = true +``` + +## User Overrides and Customization + +### Personal Development Setup + +**Creating User Configuration**: + +```text +# Create user config directory +mkdir -p ~/.config/provisioning + +# Copy template +cp src/provisioning/config-examples/config.user.toml ~/.config/provisioning/config.toml + +# Customize for your environment +$EDITOR ~/.config/provisioning/config.toml +``` + +**Common User Customizations**: + +```text +# Personal configuration customizations + +[paths] +base = "{{env.HOME}}/dev/provisioning" + +[development] +editor = "code" +auto_backup = true +backup_interval = "1h" + +[git] +auto_commit = false +commit_template = "[{{env.USER}}] {{change.type}}: {{change.description}}" + +[providers.upcloud] +api_key = "{{env.UPCLOUD_API_KEY}}" +api_secret = "{{env.UPCLOUD_API_SECRET}}" +default_zone = "de-fra1" + +[shortcuts] +# Custom command aliases +quick_server = "server create {{name}} 2xCPU-4 GB --zone us-nyc1" +dev_cluster = "cluster create development --infra {{env.USER}}-dev" + +[notifications] +desktop_notifications = true +sound_notifications = false +slack_webhook = "{{env.SLACK_WEBHOOK_URL}}" +``` + +### Workspace-Specific Configuration + +**Workspace Integration**: + +```text +# Workspace-aware configuration +# workspace/config/developer.toml + +[workspace] +user = "developer" +type = "development" + +[paths] +base = "{{workspace.root}}" +extensions = "{{workspace.root}}/extensions" +runtime = "{{workspace.root}}/runtime/{{workspace.user}}" + +[development] +workspace_isolation = true +per_user_cache = true +shared_extensions = false + +[infra] +current = "{{workspace.user}}-development" +auto_create = true +``` + +## Validation and Error Handling + +### Configuration Validation + +**Built-in Validation**: + +```text +# Validate current configuration +provisioning validate config + +# Validate specific configuration file +provisioning validate config --file config.dev.toml + +# Show configuration with validation +provisioning config show --validate + +# Debug configuration loading +provisioning config debug +``` + +**Validation Rules**: + +```text +# Configuration validation in Nushell +def validate_configuration [config: record] -> record { + let errors = [] + + # Validate required fields + if not ("paths" in $config and "base" in $config.paths) { + $errors = ($errors | append "paths.base is required") + } + + # Validate provider configuration + if "providers" in $config { + for provider in ($config.providers | columns) { + if $provider == "upcloud" { + if not ("api_key" in $config.providers.upcloud) { + $errors = ($errors | append "providers.upcloud.api_key is required") + } + } + } + } + + # Validate numeric values + if "http" in $config and "timeout" in $config.http { + if $config.http.timeout <= 0 { + $errors = ($errors | append "http.timeout must be positive") + } + } + + { + valid: ($errors | length) == 0, + errors: $errors + } +} +``` + +### Error Handling + +**Configuration-Driven Error Handling**: + +```text +# Never patch with hardcoded fallbacks - use configuration +def get_api_endpoint [provider: string] -> string { + # Good: Configuration-driven with clear error + let config_key = $"providers.($provider).api_url" + let endpoint = try { + get-config-required $config_key + } catch { + error make { + msg: $"API endpoint not configured for provider ($provider)", + help: $"Add '($config_key)' to your configuration file" + } + } + + $endpoint +} + +# Bad: Hardcoded fallback defeats IaC purpose +def get_api_endpoint_bad [provider: string] -> string { + try { + get-config-required $"providers.($provider).api_url" + } catch { + # DON'T DO THIS - defeats configuration-driven architecture + "https://default-api.com" + } +} +``` + +**Comprehensive Error Context**: + +```text +def load_provider_config [provider: string] -> record { + let config_section = $"providers.($provider)" + + try { + get-config-section $config_section + } catch { |e| + error make { + msg: $"Failed to load configuration for provider ($provider): ($e.msg)", + label: { + text: "configuration missing", + span: (metadata $provider).span + }, + help: [ + $"Add [$config_section] section to your configuration", + "Example configuration files available in config-examples/", + "Run 'provisioning config show' to see current configuration" + ] + } + } +} +``` + +## Interpolation and Dynamic Values + +### Interpolation Syntax + +**Supported Interpolation Variables**: + +```text +# Environment variables +base_path = "{{env.HOME}}/provisioning" +user_name = "{{env.USER}}" + +# Configuration references +data_path = "{{paths.base}}/data" +log_file = "{{paths.logs}}/{{core.name}}.log" + +# Date/time values +backup_name = "backup-{{now.date}}-{{now.time}}" +version = "{{core.version}}-{{now.timestamp}}" + +# Git information +branch_name = "{{git.branch}}" +commit_hash = "{{git.commit}}" +version_with_git = "{{core.version}}-{{git.commit}}" + +# System information +hostname = "{{system.hostname}}" +platform = "{{system.platform}}" +architecture = "{{system.arch}}" +``` + +### Complex Interpolation Examples + +**Dynamic Path Resolution**: + +```text +[paths] +base = "{{env.HOME}}/.local/share/provisioning" +config = "{{paths.base}}/config" +data = "{{paths.base}}/data/{{system.hostname}}" +logs = "{{paths.base}}/logs/{{env.USER}}/{{now.date}}" +runtime = "{{paths.base}}/runtime/{{git.branch}}" + +[providers.upcloud] +cache_path = "{{paths.cache}}/providers/upcloud/{{env.USER}}" +log_file = "{{paths.logs}}/upcloud-{{now.date}}.log" +``` + +**Environment-Aware Configuration**: + +```text +[core] +name = "provisioning-{{system.hostname}}-{{env.USER}}" +version = "{{release.version}}+{{git.commit}}.{{now.timestamp}}" + +[database] +name = "provisioning_{{env.USER}}_{{git.branch}}" +backup_prefix = "{{core.name}}-backup-{{now.date}}" + +[monitoring] +instance_id = "{{system.hostname}}-{{core.version}}" +tags = { + environment = "{{infra.environment}}", + user = "{{env.USER}}", + version = "{{core.version}}", + deployment_time = "{{now.iso8601}}" +} +``` + +### Interpolation Functions + +**Custom Interpolation Logic**: + +```text +# Interpolation resolver +def resolve_interpolation [template: string, context: record] -> string { + let interpolations = ($template | parse --regex '\{\{([^}]+)\}\}') + + mut result = $template + + for interpolation in $interpolations { + let key_path = ($interpolation.capture0 | str trim) + let value = resolve_interpolation_key $key_path $context + + $result = ($result | str replace $"{{($interpolation.capture0)}}" $value) + } + + $result +} + +def resolve_interpolation_key [key_path: string, context: record] -> string { + match ($key_path | split row ".") { + ["env", $var] => ($env | get $var | default ""), + ["paths", $path] => (resolve_path_key $path $context), + ["now", $format] => (resolve_time_format $format), + ["git", $info] => (resolve_git_info $info), + ["system", $info] => (resolve_system_info $info), + $path => (get_nested_config_value $path $context) + } +} +``` + +## Migration Strategies + +### ENV to Config Migration + +**Migration Status**: The system has successfully migrated from ENV-based to config-driven architecture: + +**Migration Statistics**: + +- **Files Migrated**: 65+ files across entire codebase +- **Variables Replaced**: 200+ ENV variables → 476 config accessors +- **Agent-Based Development**: 16 token-efficient agents used +- **Efficiency Gained**: 92% token efficiency vs monolithic approach + +### Legacy Support + +**Backward Compatibility**: + +```text +# Configuration accessor with ENV fallback +def get-config-with-env-fallback [ + config_key: string, + env_var: string, + default: string = "" +] -> string { + # Try configuration first + let config_value = try { + get-config-value $config_key + } catch { null } + + if $config_value != null { + return $config_value + } + + # Fall back to environment variable + let env_value = ($env | get $env_var | default null) + if $env_value != null { + return $env_value + } + + # Use default if provided + if $default != "" { + return $default + } + + # Error if no value found + error make { + msg: $"Configuration value not found: ($config_key)", + help: $"Set ($config_key) in configuration or ($env_var) environment variable" + } +} +``` + +### Migration Tools + +**Available Migration Scripts**: + +```text +# Migrate existing ENV-based setup to configuration +nu src/tools/migration/env-to-config.nu --scan-environment --create-config + +# Validate migration completeness +nu src/tools/migration/validate-migration.nu --check-env-usage + +# Generate configuration from current environment +nu src/tools/migration/generate-config.nu --output-file config.migrated.toml +``` + +## Troubleshooting + +### Common Configuration Issues + +#### Configuration Not Found + +**Error**: `Configuration file not found` + +```text +# Solution: Check configuration file paths +provisioning config paths + +# Create default configuration +provisioning config init --template user + +# Verify configuration loading order +provisioning config debug +``` + +#### Invalid Configuration Syntax + +**Error**: `Invalid TOML syntax in configuration file` + +```text +# Solution: Validate TOML syntax +nu -c "open config.user.toml | from toml" + +# Use configuration validation +provisioning validate config --file config.user.toml + +# Show parsing errors +provisioning config check --verbose +``` + +#### Interpolation Errors + +**Error**: `Failed to resolve interpolation: {{env.MISSING_VAR}}` + +```text +# Solution: Check available interpolation variables +provisioning config interpolation --list-variables + +# Debug specific interpolation +provisioning config interpolation --test "{{env.USER}}" + +# Show interpolation context +provisioning config debug --show-interpolation +``` + +#### Provider Configuration Issues + +**Error**: `Provider 'upcloud' configuration invalid` + +```text +# Solution: Validate provider configuration +provisioning validate config --section providers.upcloud + +# Show required provider fields +provisioning providers upcloud config --show-schema + +# Test provider configuration +provisioning providers upcloud test --dry-run +``` + +### Debug Commands + +**Configuration Debugging**: + +```text +# Show complete resolved configuration +provisioning config show --resolved + +# Show configuration loading order +provisioning config debug --show-hierarchy + +# Show configuration sources +provisioning config sources + +# Test specific configuration keys +provisioning config get paths.base --trace + +# Show interpolation resolution +provisioning config interpolation --debug "{{paths.data}}/{{env.USER}}" +``` + +### Performance Optimization + +**Configuration Caching**: + +```text +# Enable configuration caching +export PROVISIONING_CONFIG_CACHE=true + +# Clear configuration cache +provisioning config cache --clear + +# Show cache statistics +provisioning config cache --stats +``` + +**Startup Optimization**: + +```text +# Optimize configuration loading +[performance] +lazy_loading = true +cache_compiled_config = true +skip_unused_sections = true + +[cache] +config_cache_ttl = 3600 +interpolation_cache = true +``` + +This configuration management system provides a robust, flexible foundation that supports development workflows while maintaining production +reliability and security requirements. \ No newline at end of file diff --git a/docs/src/development/dev-workspace-management.md b/docs/src/development/dev-workspace-management.md index 4cb1b19..56222cd 100644 --- a/docs/src/development/dev-workspace-management.md +++ b/docs/src/development/dev-workspace-management.md @@ -1 +1,915 @@ -# Workspace Management Guide\n\nThis document provides comprehensive guidance on setting up and using development workspaces, including the path resolution system, testing\ninfrastructure, and workspace tools usage.\n\n## Table of Contents\n\n1. [Overview](#overview)\n2. [Workspace Architecture](#workspace-architecture)\n3. [Setup and Initialization](#setup-and-initialization)\n4. [Path Resolution System](#path-resolution-system)\n5. [Configuration Management](#configuration-management)\n6. [Extension Development](#extension-development)\n7. [Runtime Management](#runtime-management)\n8. [Health Monitoring](#health-monitoring)\n9. [Backup and Restore](#backup-and-restore)\n10. [Troubleshooting](#troubleshooting)\n\n## Overview\n\nThe workspace system provides isolated development environments for the provisioning project, enabling:\n\n- **User Isolation**: Each developer has their own workspace with isolated runtime data\n- **Configuration Cascading**: Hierarchical configuration from workspace to core system\n- **Extension Development**: Template-based extension development with testing\n- **Path Resolution**: Smart path resolution with workspace-aware fallbacks\n- **Health Monitoring**: Comprehensive health checks with automatic repairs\n- **Backup/Restore**: Complete workspace backup and restore capabilities\n\n**Location**: `/workspace/`\n**Main Tool**: `workspace/tools/workspace.nu`\n\n## Workspace Architecture\n\n### Directory Structure\n\n```\nworkspace/\n├── config/ # Development configuration\n│ ├── dev-defaults.toml # Development environment defaults\n│ ├── test-defaults.toml # Testing environment configuration\n│ ├── local-overrides.toml.example # User customization template\n│ └── {user}.toml # User-specific configurations\n├── extensions/ # Extension development\n│ ├── providers/ # Custom provider extensions\n│ │ ├── template/ # Provider development template\n│ │ └── {user}/ # User-specific providers\n│ ├── taskservs/ # Custom task service extensions\n│ │ ├── template/ # Task service template\n│ │ └── {user}/ # User-specific task services\n│ └── clusters/ # Custom cluster extensions\n│ ├── template/ # Cluster template\n│ └── {user}/ # User-specific clusters\n├── infra/ # Development infrastructure\n│ ├── examples/ # Example infrastructures\n│ │ ├── minimal/ # Minimal learning setup\n│ │ ├── development/ # Full development environment\n│ │ └── testing/ # Testing infrastructure\n│ ├── local/ # Local development setups\n│ └── {user}/ # User-specific infrastructures\n├── lib/ # Workspace libraries\n│ └── path-resolver.nu # Path resolution system\n├── runtime/ # Runtime data (per-user isolation)\n│ ├── workspaces/{user}/ # User workspace data\n│ ├── cache/{user}/ # User-specific cache\n│ ├── state/{user}/ # User state management\n│ ├── logs/{user}/ # User application logs\n│ └── data/{user}/ # User database files\n└── tools/ # Workspace management tools\n ├── workspace.nu # Main workspace interface\n ├── init-workspace.nu # Workspace initialization\n ├── workspace-health.nu # Health monitoring\n ├── backup-workspace.nu # Backup management\n ├── restore-workspace.nu # Restore functionality\n ├── reset-workspace.nu # Workspace reset\n └── runtime-manager.nu # Runtime data management\n```\n\n### Component Integration\n\n**Workspace → Core Integration**:\n\n- Workspace paths take priority over core paths\n- Extensions discovered automatically from workspace\n- Configuration cascades from workspace to core defaults\n- Runtime data completely isolated per user\n\n**Development Workflow**:\n\n1. **Initialize** personal workspace\n2. **Configure** development environment\n3. **Develop** extensions and infrastructure\n4. **Test** locally with isolated environment\n5. **Deploy** to shared infrastructure\n\n## Setup and Initialization\n\n### Quick Start\n\n```\n# Navigate to workspace\ncd workspace/tools\n\n# Initialize workspace with defaults\nnu workspace.nu init\n\n# Initialize with specific options\nnu workspace.nu init --user-name developer --infra-name my-dev-infra\n```\n\n### Complete Initialization\n\n```\n# Full initialization with all options\nnu workspace.nu init \\n --user-name developer \\n --infra-name development-env \\n --workspace-type development \\n --template full \\n --overwrite \\n --create-examples\n```\n\n**Initialization Parameters**:\n\n- `--user-name`: User identifier (defaults to `$env.USER`)\n- `--infra-name`: Infrastructure name for this workspace\n- `--workspace-type`: Type (`development`, `testing`, `production`)\n- `--template`: Template to use (`minimal`, `full`, `custom`)\n- `--overwrite`: Overwrite existing workspace\n- `--create-examples`: Create example configurations and infrastructure\n\n### Post-Initialization Setup\n\n**Verify Installation**:\n\n```\n# Check workspace health\nnu workspace.nu health --detailed\n\n# Show workspace status\nnu workspace.nu status --detailed\n\n# List workspace contents\nnu workspace.nu list\n```\n\n**Configure Development Environment**:\n\n```\n# Create user-specific configuration\ncp workspace/config/local-overrides.toml.example workspace/config/$USER.toml\n\n# Edit configuration\n$EDITOR workspace/config/$USER.toml\n```\n\n## Path Resolution System\n\nThe workspace implements a sophisticated path resolution system that prioritizes workspace paths while providing fallbacks to core system paths.\n\n### Resolution Hierarchy\n\n**Resolution Order**:\n\n1. **Workspace User Paths**: `workspace/{type}/{user}/{name}`\n2. **Workspace Shared Paths**: `workspace/{type}/{name}`\n3. **Workspace Templates**: `workspace/{type}/template/{name}`\n4. **Core System Paths**: `core/{type}/{name}` (fallback)\n\n### Using Path Resolution\n\n```\n# Import path resolver\nuse workspace/lib/path-resolver.nu\n\n# Resolve configuration with workspace awareness\nlet config_path = (path-resolver resolve_path "config" "user" --workspace-user "developer")\n\n# Resolve with automatic fallback to core\nlet extension_path = (path-resolver resolve_path "extensions" "custom-provider" --fallback-to-core)\n\n# Create missing directories during resolution\nlet new_path = (path-resolver resolve_path "infra" "my-infra" --create-missing)\n```\n\n### Configuration Resolution\n\n**Hierarchical Configuration Loading**:\n\n```\n# Resolve configuration with full hierarchy\nlet config = (path-resolver resolve_config "user" --workspace-user "developer")\n\n# Load environment-specific configuration\nlet dev_config = (path-resolver resolve_config "development" --workspace-user "developer")\n\n# Get merged configuration with all overrides\nlet merged = (path-resolver resolve_config "merged" --workspace-user "developer" --include-overrides)\n```\n\n### Extension Discovery\n\n**Automatic Extension Discovery**:\n\n```\n# Find custom provider extension\nlet provider = (path-resolver resolve_extension "providers" "my-aws-provider")\n\n# Discover all available task services\nlet taskservs = (path-resolver list_extensions "taskservs" --include-core)\n\n# Find cluster definition\nlet cluster = (path-resolver resolve_extension "clusters" "development-cluster")\n```\n\n### Health Checking\n\n**Workspace Health Validation**:\n\n```\n# Check workspace health with automatic fixes\nlet health = (path-resolver check_workspace_health --workspace-user "developer" --fix-issues)\n\n# Validate path resolution chain\nlet validation = (path-resolver validate_paths --workspace-user "developer" --repair-broken)\n\n# Check runtime directories\nlet runtime_status = (path-resolver check_runtime_health --workspace-user "developer")\n```\n\n## Configuration Management\n\n### Configuration Hierarchy\n\n**Configuration Cascade**:\n\n1. **User Configuration**: `workspace/config/{user}.toml`\n2. **Environment Defaults**: `workspace/config/{env}-defaults.toml`\n3. **Workspace Defaults**: `workspace/config/dev-defaults.toml`\n4. **Core System Defaults**: `config.defaults.toml`\n\n### Environment-Specific Configuration\n\n**Development Environment** (`workspace/config/dev-defaults.toml`):\n\n```\n[core]\nname = "provisioning-dev"\nversion = "dev-${git.branch}"\n\n[development]\nauto_reload = true\nverbose_logging = true\nexperimental_features = true\nhot_reload_templates = true\n\n[http]\nuse_curl = false\ntimeout = 30\nretry_count = 3\n\n[cache]\nenabled = true\nttl = 300\nrefresh_interval = 60\n\n[logging]\nlevel = "debug"\nfile_rotation = true\nmax_size = "10 MB"\n```\n\n**Testing Environment** (`workspace/config/test-defaults.toml`):\n\n```\n[core]\nname = "provisioning-test"\nversion = "test-${build.timestamp}"\n\n[testing]\nmock_providers = true\nephemeral_resources = true\nparallel_tests = true\ncleanup_after_test = true\n\n[http]\nuse_curl = true\ntimeout = 10\nretry_count = 1\n\n[cache]\nenabled = false\nmock_responses = true\n\n[logging]\nlevel = "info"\ntest_output = true\n```\n\n### User Configuration Example\n\n**User-Specific Configuration** (`workspace/config/{user}.toml`):\n\n```\n[core]\nname = "provisioning-${workspace.user}"\nversion = "1.0.0-dev"\n\n[infra]\ncurrent = "${workspace.user}-development"\ndefault_provider = "upcloud"\n\n[workspace]\nuser = "developer"\ntype = "development"\ninfra_name = "developer-dev"\n\n[development]\npreferred_editor = "code"\nauto_backup = true\nbackup_interval = "1h"\n\n[paths]\n# Custom paths for this user\ntemplates = "~/custom-templates"\nextensions = "~/my-extensions"\n\n[git]\nauto_commit = false\ncommit_message_template = "[${workspace.user}] ${change.type}: ${change.description}"\n\n[notifications]\nslack_webhook = "https://hooks.slack.com/..."\nemail = "developer@company.com"\n```\n\n### Configuration Commands\n\n**Workspace Configuration Management**:\n\n```\n# Show current configuration\nnu workspace.nu config show\n\n# Validate configuration\nnu workspace.nu config validate --user-name developer\n\n# Edit user configuration\nnu workspace.nu config edit --user-name developer\n\n# Show configuration hierarchy\nnu workspace.nu config hierarchy --user-name developer\n\n# Merge configurations for debugging\nnu workspace.nu config merge --user-name developer --output merged-config.toml\n```\n\n## Extension Development\n\n### Extension Types\n\nThe workspace provides templates and tools for developing three types of extensions:\n\n1. **Providers**: Cloud provider implementations\n2. **Task Services**: Infrastructure service components\n3. **Clusters**: Complete deployment solutions\n\n### Provider Extension Development\n\n**Create New Provider**:\n\n```\n# Copy template\ncp -r workspace/extensions/providers/template workspace/extensions/providers/my-provider\n\n# Initialize provider\ncd workspace/extensions/providers/my-provider\nnu init.nu --provider-name my-provider --author developer\n```\n\n**Provider Structure**:\n\n```\nworkspace/extensions/providers/my-provider/\n├── kcl/\n│ ├── provider.ncl # Provider configuration schema\n│ ├── server.ncl # Server configuration\n│ └── version.ncl # Version management\n├── nulib/\n│ ├── provider.nu # Main provider implementation\n│ ├── servers.nu # Server management\n│ └── auth.nu # Authentication handling\n├── templates/\n│ ├── server.j2 # Server configuration template\n│ └── network.j2 # Network configuration template\n├── tests/\n│ ├── unit/ # Unit tests\n│ └── integration/ # Integration tests\n└── README.md\n```\n\n**Test Provider**:\n\n```\n# Run provider tests\nnu workspace/extensions/providers/my-provider/nulib/provider.nu test\n\n# Test with dry-run\nnu workspace/extensions/providers/my-provider/nulib/provider.nu create-server --dry-run\n\n# Integration test\nnu workspace/extensions/providers/my-provider/tests/integration/basic-test.nu\n```\n\n### Task Service Extension Development\n\n**Create New Task Service**:\n\n```\n# Copy template\ncp -r workspace/extensions/taskservs/template workspace/extensions/taskservs/my-service\n\n# Initialize service\ncd workspace/extensions/taskservs/my-service\nnu init.nu --service-name my-service --service-type database\n```\n\n**Task Service Structure**:\n\n```\nworkspace/extensions/taskservs/my-service/\n├── kcl/\n│ ├── taskserv.ncl # Service configuration schema\n│ ├── version.ncl # Version configuration with GitHub integration\n│ └── kcl.mod # KCL module dependencies\n├── nushell/\n│ ├── taskserv.nu # Main service implementation\n│ ├── install.nu # Installation logic\n│ ├── uninstall.nu # Removal logic\n│ └── check-updates.nu # Version checking\n├── templates/\n│ ├── config.j2 # Service configuration template\n│ ├── systemd.j2 # Systemd service template\n│ └── compose.j2 # Docker Compose template\n└── manifests/\n ├── deployment.yaml # Kubernetes deployment\n └── service.yaml # Kubernetes service\n```\n\n### Cluster Extension Development\n\n**Create New Cluster**:\n\n```\n# Copy template\ncp -r workspace/extensions/clusters/template workspace/extensions/clusters/my-cluster\n\n# Initialize cluster\ncd workspace/extensions/clusters/my-cluster\nnu init.nu --cluster-name my-cluster --cluster-type web-stack\n```\n\n**Testing Extensions**:\n\n```\n# Test extension syntax\nnu workspace.nu tools validate-extension providers/my-provider\n\n# Run extension tests\nnu workspace.nu tools test-extension taskservs/my-service\n\n# Integration test with infrastructure\nnu workspace.nu tools deploy-test clusters/my-cluster --infra test-env\n```\n\n## Runtime Management\n\n### Runtime Data Organization\n\n**Per-User Isolation**:\n\n```\nruntime/\n├── workspaces/\n│ ├── developer/ # Developer's workspace data\n│ │ ├── current-infra # Current infrastructure context\n│ │ ├── settings.toml # Runtime settings\n│ │ └── extensions/ # Extension runtime data\n│ └── tester/ # Tester's workspace data\n├── cache/\n│ ├── developer/ # Developer's cache\n│ │ ├── providers/ # Provider API cache\n│ │ ├── images/ # Container image cache\n│ │ └── downloads/ # Downloaded artifacts\n│ └── tester/ # Tester's cache\n├── state/\n│ ├── developer/ # Developer's state\n│ │ ├── deployments/ # Deployment state\n│ │ └── workflows/ # Workflow state\n│ └── tester/ # Tester's state\n├── logs/\n│ ├── developer/ # Developer's logs\n│ │ ├── provisioning.log\n│ │ ├── orchestrator.log\n│ │ └── extensions/\n│ └── tester/ # Tester's logs\n└── data/\n ├── developer/ # Developer's data\n │ ├── database.db # Local database\n │ └── backups/ # Local backups\n └── tester/ # Tester's data\n```\n\n### Runtime Management Commands\n\n**Initialize Runtime Environment**:\n\n```\n# Initialize for current user\nnu workspace/tools/runtime-manager.nu init\n\n# Initialize for specific user\nnu workspace/tools/runtime-manager.nu init --user-name developer\n```\n\n**Runtime Cleanup**:\n\n```\n# Clean cache older than 30 days\nnu workspace/tools/runtime-manager.nu cleanup --type cache --age 30d\n\n# Clean logs with rotation\nnu workspace/tools/runtime-manager.nu cleanup --type logs --rotate\n\n# Clean temporary files\nnu workspace/tools/runtime-manager.nu cleanup --type temp --force\n```\n\n**Log Management**:\n\n```\n# View recent logs\nnu workspace/tools/runtime-manager.nu logs --action tail --lines 100\n\n# Follow logs in real-time\nnu workspace/tools/runtime-manager.nu logs --action tail --follow\n\n# Rotate large log files\nnu workspace/tools/runtime-manager.nu logs --action rotate\n\n# Archive old logs\nnu workspace/tools/runtime-manager.nu logs --action archive --older-than 7d\n```\n\n**Cache Management**:\n\n```\n# Show cache statistics\nnu workspace/tools/runtime-manager.nu cache --action stats\n\n# Optimize cache\nnu workspace/tools/runtime-manager.nu cache --action optimize\n\n# Clear specific cache\nnu workspace/tools/runtime-manager.nu cache --action clear --type providers\n\n# Refresh cache\nnu workspace/tools/runtime-manager.nu cache --action refresh --selective\n```\n\n**Monitoring**:\n\n```\n# Monitor runtime usage\nnu workspace/tools/runtime-manager.nu monitor --duration 5m --interval 30s\n\n# Check disk usage\nnu workspace/tools/runtime-manager.nu monitor --type disk\n\n# Monitor active processes\nnu workspace/tools/runtime-manager.nu monitor --type processes --workspace-user developer\n```\n\n## Health Monitoring\n\n### Health Check System\n\nThe workspace provides comprehensive health monitoring with automatic repair capabilities.\n\n**Health Check Components**:\n\n- **Directory Structure**: Validates workspace directory integrity\n- **Configuration Files**: Checks configuration syntax and completeness\n- **Runtime Environment**: Validates runtime data and permissions\n- **Extension Status**: Checks extension functionality\n- **Resource Usage**: Monitors disk space and memory usage\n- **Integration Status**: Tests integration with core system\n\n### Health Commands\n\n**Basic Health Check**:\n\n```\n# Quick health check\nnu workspace.nu health\n\n# Detailed health check with all components\nnu workspace.nu health --detailed\n\n# Health check with automatic fixes\nnu workspace.nu health --fix-issues\n\n# Export health report\nnu workspace.nu health --report-format json > health-report.json\n```\n\n**Component-Specific Health Checks**:\n\n```\n# Check directory structure\nnu workspace/tools/workspace-health.nu check-directories --workspace-user developer\n\n# Validate configuration files\nnu workspace/tools/workspace-health.nu check-config --workspace-user developer\n\n# Check runtime environment\nnu workspace/tools/workspace-health.nu check-runtime --workspace-user developer\n\n# Test extension functionality\nnu workspace/tools/workspace-health.nu check-extensions --workspace-user developer\n```\n\n### Health Monitoring Output\n\n**Example Health Report**:\n\n```\n{\n "workspace_health": {\n "user": "developer",\n "timestamp": "2025-09-25T14:30:22Z",\n "overall_status": "healthy",\n "checks": {\n "directories": {\n "status": "healthy",\n "issues": [],\n "auto_fixed": []\n },\n "configuration": {\n "status": "warning",\n "issues": [\n "User configuration missing default provider"\n ],\n "auto_fixed": [\n "Created missing user configuration file"\n ]\n },\n "runtime": {\n "status": "healthy",\n "disk_usage": "1.2 GB",\n "cache_size": "450 MB",\n "log_size": "120 MB"\n },\n "extensions": {\n "status": "healthy",\n "providers": 2,\n "taskservs": 5,\n "clusters": 1\n }\n },\n "recommendations": [\n "Consider cleaning cache (>400 MB)",\n "Rotate logs (>100 MB)"\n ]\n }\n}\n```\n\n### Automatic Fixes\n\n**Auto-Fix Capabilities**:\n\n- **Missing Directories**: Creates missing workspace directories\n- **Broken Symlinks**: Repairs or removes broken symbolic links\n- **Configuration Issues**: Creates missing configuration files with defaults\n- **Permission Problems**: Fixes file and directory permissions\n- **Corrupted Cache**: Clears and rebuilds corrupted cache entries\n- **Log Rotation**: Rotates large log files automatically\n\n## Backup and Restore\n\n### Backup System\n\n**Backup Components**:\n\n- **Configuration**: All workspace configuration files\n- **Extensions**: Custom extensions and templates\n- **Runtime Data**: User-specific runtime data (optional)\n- **Logs**: Application logs (optional)\n- **Cache**: Cache data (optional)\n\n### Backup Commands\n\n**Create Backup**:\n\n```\n# Basic backup\nnu workspace.nu backup\n\n# Backup with auto-generated name\nnu workspace.nu backup --auto-name\n\n# Comprehensive backup including logs and cache\nnu workspace.nu backup --auto-name --include-logs --include-cache\n\n# Backup specific components\nnu workspace.nu backup --components config,extensions --name my-backup\n```\n\n**Backup Options**:\n\n- `--auto-name`: Generate timestamp-based backup name\n- `--include-logs`: Include application logs\n- `--include-cache`: Include cache data\n- `--components`: Specify components to backup\n- `--compress`: Create compressed backup archive\n- `--encrypt`: Encrypt backup with age/sops\n- `--remote`: Upload to remote storage (S3, etc.)\n\n### Restore System\n\n**List Available Backups**:\n\n```\n# List all backups\nnu workspace.nu restore --list-backups\n\n# List backups with details\nnu workspace.nu restore --list-backups --detailed\n\n# Show backup contents\nnu workspace.nu restore --show-contents --backup-name workspace-developer-20250925_143022\n```\n\n**Restore Operations**:\n\n```\n# Restore latest backup\nnu workspace.nu restore --latest\n\n# Restore specific backup\nnu workspace.nu restore --backup-name workspace-developer-20250925_143022\n\n# Selective restore\nnu workspace.nu restore --selective --backup-name my-backup\n\n# Restore to different user\nnu workspace.nu restore --backup-name my-backup --restore-to different-user\n```\n\n**Advanced Restore Options**:\n\n- `--selective`: Choose components to restore interactively\n- `--restore-to`: Restore to different user workspace\n- `--merge`: Merge with existing workspace (don't overwrite)\n- `--dry-run`: Show what would be restored without doing it\n- `--verify`: Verify backup integrity before restore\n\n### Reset and Cleanup\n\n**Workspace Reset**:\n\n```\n# Reset with backup\nnu workspace.nu reset --backup-first\n\n# Reset keeping configuration\nnu workspace.nu reset --backup-first --keep-config\n\n# Complete reset (dangerous)\nnu workspace.nu reset --force --no-backup\n```\n\n**Cleanup Operations**:\n\n```\n# Clean old data with dry-run\nnu workspace.nu cleanup --type old --age 14d --dry-run\n\n# Clean cache forcefully\nnu workspace.nu cleanup --type cache --force\n\n# Clean specific user data\nnu workspace.nu cleanup --user-name old-user --type all\n```\n\n## Troubleshooting\n\n### Common Issues\n\n#### Workspace Not Found\n\n**Error**: `Workspace for user 'developer' not found`\n\n```\n# Solution: Initialize workspace\nnu workspace.nu init --user-name developer\n```\n\n#### Path Resolution Errors\n\n**Error**: `Path resolution failed for config/user`\n\n```\n# Solution: Fix with health check\nnu workspace.nu health --fix-issues\n\n# Manual fix\nnu workspace/lib/path-resolver.nu resolve_path "config" "user" --create-missing\n```\n\n#### Configuration Errors\n\n**Error**: `Invalid configuration syntax in user.toml`\n\n```\n# Solution: Validate and fix configuration\nnu workspace.nu config validate --user-name developer\n\n# Reset to defaults\ncp workspace/config/local-overrides.toml.example workspace/config/developer.toml\n```\n\n#### Runtime Issues\n\n**Error**: `Runtime directory permissions error`\n\n```\n# Solution: Reinitialize runtime\nnu workspace/tools/runtime-manager.nu init --user-name developer --force\n\n# Fix permissions manually\nchmod -R 755 workspace/runtime/workspaces/developer\n```\n\n#### Extension Issues\n\n**Error**: `Extension 'my-provider' not found or invalid`\n\n```\n# Solution: Validate extension\nnu workspace.nu tools validate-extension providers/my-provider\n\n# Reinitialize extension from template\ncp -r workspace/extensions/providers/template workspace/extensions/providers/my-provider\n```\n\n### Debug Mode\n\n**Enable Debug Logging**:\n\n```\n# Set debug environment\nexport PROVISIONING_DEBUG=true\nexport PROVISIONING_LOG_LEVEL=debug\nexport PROVISIONING_WORKSPACE_USER=developer\n\n# Run with debug\nnu workspace.nu health --detailed\n```\n\n### Performance Issues\n\n**Slow Operations**:\n\n```\n# Check disk space\ndf -h workspace/\n\n# Check runtime data size\ndu -h workspace/runtime/workspaces/developer/\n\n# Optimize workspace\nnu workspace.nu cleanup --type cache\nnu workspace/tools/runtime-manager.nu cache --action optimize\n```\n\n### Recovery Procedures\n\n**Corrupted Workspace**:\n\n```\n# 1. Backup current state\nnu workspace.nu backup --name corrupted-backup --force\n\n# 2. Reset workspace\nnu workspace.nu reset --backup-first\n\n# 3. Restore from known good backup\nnu workspace.nu restore --latest-known-good\n\n# 4. Validate health\nnu workspace.nu health --detailed --fix-issues\n```\n\n**Data Loss Prevention**:\n\n- Enable automatic backups: `backup_interval = "1h"` in user config\n- Use version control for custom extensions\n- Regular health checks: `nu workspace.nu health`\n- Monitor disk space and set up alerts\n\nThis workspace management system provides a robust foundation for development while maintaining isolation and providing comprehensive tools for\nmaintenance and troubleshooting. +# Workspace Management Guide + +This document provides comprehensive guidance on setting up and using development workspaces, including the path resolution system, testing +infrastructure, and workspace tools usage. + +## Table of Contents + +1. [Overview](#overview) +2. [Workspace Architecture](#workspace-architecture) +3. [Setup and Initialization](#setup-and-initialization) +4. [Path Resolution System](#path-resolution-system) +5. [Configuration Management](#configuration-management) +6. [Extension Development](#extension-development) +7. [Runtime Management](#runtime-management) +8. [Health Monitoring](#health-monitoring) +9. [Backup and Restore](#backup-and-restore) +10. [Troubleshooting](#troubleshooting) + +## Overview + +The workspace system provides isolated development environments for the provisioning project, enabling: + +- **User Isolation**: Each developer has their own workspace with isolated runtime data +- **Configuration Cascading**: Hierarchical configuration from workspace to core system +- **Extension Development**: Template-based extension development with testing +- **Path Resolution**: Smart path resolution with workspace-aware fallbacks +- **Health Monitoring**: Comprehensive health checks with automatic repairs +- **Backup/Restore**: Complete workspace backup and restore capabilities + +**Location**: `/workspace/` +**Main Tool**: `workspace/tools/workspace.nu` + +## Workspace Architecture + +### Directory Structure + +```text +workspace/ +├── config/ # Development configuration +│ ├── dev-defaults.toml # Development environment defaults +│ ├── test-defaults.toml # Testing environment configuration +│ ├── local-overrides.toml.example # User customization template +│ └── {user}.toml # User-specific configurations +├── extensions/ # Extension development +│ ├── providers/ # Custom provider extensions +│ │ ├── template/ # Provider development template +│ │ └── {user}/ # User-specific providers +│ ├── taskservs/ # Custom task service extensions +│ │ ├── template/ # Task service template +│ │ └── {user}/ # User-specific task services +│ └── clusters/ # Custom cluster extensions +│ ├── template/ # Cluster template +│ └── {user}/ # User-specific clusters +├── infra/ # Development infrastructure +│ ├── examples/ # Example infrastructures +│ │ ├── minimal/ # Minimal learning setup +│ │ ├── development/ # Full development environment +│ │ └── testing/ # Testing infrastructure +│ ├── local/ # Local development setups +│ └── {user}/ # User-specific infrastructures +├── lib/ # Workspace libraries +│ └── path-resolver.nu # Path resolution system +├── runtime/ # Runtime data (per-user isolation) +│ ├── workspaces/{user}/ # User workspace data +│ ├── cache/{user}/ # User-specific cache +│ ├── state/{user}/ # User state management +│ ├── logs/{user}/ # User application logs +│ └── data/{user}/ # User database files +└── tools/ # Workspace management tools + ├── workspace.nu # Main workspace interface + ├── init-workspace.nu # Workspace initialization + ├── workspace-health.nu # Health monitoring + ├── backup-workspace.nu # Backup management + ├── restore-workspace.nu # Restore functionality + ├── reset-workspace.nu # Workspace reset + └── runtime-manager.nu # Runtime data management +``` + +### Component Integration + +**Workspace → Core Integration**: + +- Workspace paths take priority over core paths +- Extensions discovered automatically from workspace +- Configuration cascades from workspace to core defaults +- Runtime data completely isolated per user + +**Development Workflow**: + +1. **Initialize** personal workspace +2. **Configure** development environment +3. **Develop** extensions and infrastructure +4. **Test** locally with isolated environment +5. **Deploy** to shared infrastructure + +## Setup and Initialization + +### Quick Start + +```text +# Navigate to workspace +cd workspace/tools + +# Initialize workspace with defaults +nu workspace.nu init + +# Initialize with specific options +nu workspace.nu init --user-name developer --infra-name my-dev-infra +``` + +### Complete Initialization + +```text +# Full initialization with all options +nu workspace.nu init + --user-name developer + --infra-name development-env + --workspace-type development + --template full + --overwrite + --create-examples +``` + +**Initialization Parameters**: + +- `--user-name`: User identifier (defaults to `$env.USER`) +- `--infra-name`: Infrastructure name for this workspace +- `--workspace-type`: Type (`development`, `testing`, `production`) +- `--template`: Template to use (`minimal`, `full`, `custom`) +- `--overwrite`: Overwrite existing workspace +- `--create-examples`: Create example configurations and infrastructure + +### Post-Initialization Setup + +**Verify Installation**: + +```text +# Check workspace health +nu workspace.nu health --detailed + +# Show workspace status +nu workspace.nu status --detailed + +# List workspace contents +nu workspace.nu list +``` + +**Configure Development Environment**: + +```text +# Create user-specific configuration +cp workspace/config/local-overrides.toml.example workspace/config/$USER.toml + +# Edit configuration +$EDITOR workspace/config/$USER.toml +``` + +## Path Resolution System + +The workspace implements a sophisticated path resolution system that prioritizes workspace paths while providing fallbacks to core system paths. + +### Resolution Hierarchy + +**Resolution Order**: + +1. **Workspace User Paths**: `workspace/{type}/{user}/{name}` +2. **Workspace Shared Paths**: `workspace/{type}/{name}` +3. **Workspace Templates**: `workspace/{type}/template/{name}` +4. **Core System Paths**: `core/{type}/{name}` (fallback) + +### Using Path Resolution + +```text +# Import path resolver +use workspace/lib/path-resolver.nu + +# Resolve configuration with workspace awareness +let config_path = (path-resolver resolve_path "config" "user" --workspace-user "developer") + +# Resolve with automatic fallback to core +let extension_path = (path-resolver resolve_path "extensions" "custom-provider" --fallback-to-core) + +# Create missing directories during resolution +let new_path = (path-resolver resolve_path "infra" "my-infra" --create-missing) +``` + +### Configuration Resolution + +**Hierarchical Configuration Loading**: + +```text +# Resolve configuration with full hierarchy +let config = (path-resolver resolve_config "user" --workspace-user "developer") + +# Load environment-specific configuration +let dev_config = (path-resolver resolve_config "development" --workspace-user "developer") + +# Get merged configuration with all overrides +let merged = (path-resolver resolve_config "merged" --workspace-user "developer" --include-overrides) +``` + +### Extension Discovery + +**Automatic Extension Discovery**: + +```text +# Find custom provider extension +let provider = (path-resolver resolve_extension "providers" "my-aws-provider") + +# Discover all available task services +let taskservs = (path-resolver list_extensions "taskservs" --include-core) + +# Find cluster definition +let cluster = (path-resolver resolve_extension "clusters" "development-cluster") +``` + +### Health Checking + +**Workspace Health Validation**: + +```text +# Check workspace health with automatic fixes +let health = (path-resolver check_workspace_health --workspace-user "developer" --fix-issues) + +# Validate path resolution chain +let validation = (path-resolver validate_paths --workspace-user "developer" --repair-broken) + +# Check runtime directories +let runtime_status = (path-resolver check_runtime_health --workspace-user "developer") +``` + +## Configuration Management + +### Configuration Hierarchy + +**Configuration Cascade**: + +1. **User Configuration**: `workspace/config/{user}.toml` +2. **Environment Defaults**: `workspace/config/{env}-defaults.toml` +3. **Workspace Defaults**: `workspace/config/dev-defaults.toml` +4. **Core System Defaults**: `config.defaults.toml` + +### Environment-Specific Configuration + +**Development Environment** (`workspace/config/dev-defaults.toml`): + +```text +[core] +name = "provisioning-dev" +version = "dev-${git.branch}" + +[development] +auto_reload = true +verbose_logging = true +experimental_features = true +hot_reload_templates = true + +[http] +use_curl = false +timeout = 30 +retry_count = 3 + +[cache] +enabled = true +ttl = 300 +refresh_interval = 60 + +[logging] +level = "debug" +file_rotation = true +max_size = "10 MB" +``` + +**Testing Environment** (`workspace/config/test-defaults.toml`): + +```text +[core] +name = "provisioning-test" +version = "test-${build.timestamp}" + +[testing] +mock_providers = true +ephemeral_resources = true +parallel_tests = true +cleanup_after_test = true + +[http] +use_curl = true +timeout = 10 +retry_count = 1 + +[cache] +enabled = false +mock_responses = true + +[logging] +level = "info" +test_output = true +``` + +### User Configuration Example + +**User-Specific Configuration** (`workspace/config/{user}.toml`): + +```text +[core] +name = "provisioning-${workspace.user}" +version = "1.0.0-dev" + +[infra] +current = "${workspace.user}-development" +default_provider = "upcloud" + +[workspace] +user = "developer" +type = "development" +infra_name = "developer-dev" + +[development] +preferred_editor = "code" +auto_backup = true +backup_interval = "1h" + +[paths] +# Custom paths for this user +templates = "~/custom-templates" +extensions = "~/my-extensions" + +[git] +auto_commit = false +commit_message_template = "[${workspace.user}] ${change.type}: ${change.description}" + +[notifications] +slack_webhook = "https://hooks.slack.com/..." +email = "developer@company.com" +``` + +### Configuration Commands + +**Workspace Configuration Management**: + +```text +# Show current configuration +nu workspace.nu config show + +# Validate configuration +nu workspace.nu config validate --user-name developer + +# Edit user configuration +nu workspace.nu config edit --user-name developer + +# Show configuration hierarchy +nu workspace.nu config hierarchy --user-name developer + +# Merge configurations for debugging +nu workspace.nu config merge --user-name developer --output merged-config.toml +``` + +## Extension Development + +### Extension Types + +The workspace provides templates and tools for developing three types of extensions: + +1. **Providers**: Cloud provider implementations +2. **Task Services**: Infrastructure service components +3. **Clusters**: Complete deployment solutions + +### Provider Extension Development + +**Create New Provider**: + +```text +# Copy template +cp -r workspace/extensions/providers/template workspace/extensions/providers/my-provider + +# Initialize provider +cd workspace/extensions/providers/my-provider +nu init.nu --provider-name my-provider --author developer +``` + +**Provider Structure**: + +```text +workspace/extensions/providers/my-provider/ +├── kcl/ +│ ├── provider.ncl # Provider configuration schema +│ ├── server.ncl # Server configuration +│ └── version.ncl # Version management +├── nulib/ +│ ├── provider.nu # Main provider implementation +│ ├── servers.nu # Server management +│ └── auth.nu # Authentication handling +├── templates/ +│ ├── server.j2 # Server configuration template +│ └── network.j2 # Network configuration template +├── tests/ +│ ├── unit/ # Unit tests +│ └── integration/ # Integration tests +└── README.md +``` + +**Test Provider**: + +```text +# Run provider tests +nu workspace/extensions/providers/my-provider/nulib/provider.nu test + +# Test with dry-run +nu workspace/extensions/providers/my-provider/nulib/provider.nu create-server --dry-run + +# Integration test +nu workspace/extensions/providers/my-provider/tests/integration/basic-test.nu +``` + +### Task Service Extension Development + +**Create New Task Service**: + +```text +# Copy template +cp -r workspace/extensions/taskservs/template workspace/extensions/taskservs/my-service + +# Initialize service +cd workspace/extensions/taskservs/my-service +nu init.nu --service-name my-service --service-type database +``` + +**Task Service Structure**: + +```text +workspace/extensions/taskservs/my-service/ +├── kcl/ +│ ├── taskserv.ncl # Service configuration schema +│ ├── version.ncl # Version configuration with GitHub integration +│ └── kcl.mod # KCL module dependencies +├── nushell/ +│ ├── taskserv.nu # Main service implementation +│ ├── install.nu # Installation logic +│ ├── uninstall.nu # Removal logic +│ └── check-updates.nu # Version checking +├── templates/ +│ ├── config.j2 # Service configuration template +│ ├── systemd.j2 # Systemd service template +│ └── compose.j2 # Docker Compose template +└── manifests/ + ├── deployment.yaml # Kubernetes deployment + └── service.yaml # Kubernetes service +``` + +### Cluster Extension Development + +**Create New Cluster**: + +```text +# Copy template +cp -r workspace/extensions/clusters/template workspace/extensions/clusters/my-cluster + +# Initialize cluster +cd workspace/extensions/clusters/my-cluster +nu init.nu --cluster-name my-cluster --cluster-type web-stack +``` + +**Testing Extensions**: + +```text +# Test extension syntax +nu workspace.nu tools validate-extension providers/my-provider + +# Run extension tests +nu workspace.nu tools test-extension taskservs/my-service + +# Integration test with infrastructure +nu workspace.nu tools deploy-test clusters/my-cluster --infra test-env +``` + +## Runtime Management + +### Runtime Data Organization + +**Per-User Isolation**: + +```text +runtime/ +├── workspaces/ +│ ├── developer/ # Developer's workspace data +│ │ ├── current-infra # Current infrastructure context +│ │ ├── settings.toml # Runtime settings +│ │ └── extensions/ # Extension runtime data +│ └── tester/ # Tester's workspace data +├── cache/ +│ ├── developer/ # Developer's cache +│ │ ├── providers/ # Provider API cache +│ │ ├── images/ # Container image cache +│ │ └── downloads/ # Downloaded artifacts +│ └── tester/ # Tester's cache +├── state/ +│ ├── developer/ # Developer's state +│ │ ├── deployments/ # Deployment state +│ │ └── workflows/ # Workflow state +│ └── tester/ # Tester's state +├── logs/ +│ ├── developer/ # Developer's logs +│ │ ├── provisioning.log +│ │ ├── orchestrator.log +│ │ └── extensions/ +│ └── tester/ # Tester's logs +└── data/ + ├── developer/ # Developer's data + │ ├── database.db # Local database + │ └── backups/ # Local backups + └── tester/ # Tester's data +``` + +### Runtime Management Commands + +**Initialize Runtime Environment**: + +```text +# Initialize for current user +nu workspace/tools/runtime-manager.nu init + +# Initialize for specific user +nu workspace/tools/runtime-manager.nu init --user-name developer +``` + +**Runtime Cleanup**: + +```text +# Clean cache older than 30 days +nu workspace/tools/runtime-manager.nu cleanup --type cache --age 30d + +# Clean logs with rotation +nu workspace/tools/runtime-manager.nu cleanup --type logs --rotate + +# Clean temporary files +nu workspace/tools/runtime-manager.nu cleanup --type temp --force +``` + +**Log Management**: + +```text +# View recent logs +nu workspace/tools/runtime-manager.nu logs --action tail --lines 100 + +# Follow logs in real-time +nu workspace/tools/runtime-manager.nu logs --action tail --follow + +# Rotate large log files +nu workspace/tools/runtime-manager.nu logs --action rotate + +# Archive old logs +nu workspace/tools/runtime-manager.nu logs --action archive --older-than 7d +``` + +**Cache Management**: + +```text +# Show cache statistics +nu workspace/tools/runtime-manager.nu cache --action stats + +# Optimize cache +nu workspace/tools/runtime-manager.nu cache --action optimize + +# Clear specific cache +nu workspace/tools/runtime-manager.nu cache --action clear --type providers + +# Refresh cache +nu workspace/tools/runtime-manager.nu cache --action refresh --selective +``` + +**Monitoring**: + +```text +# Monitor runtime usage +nu workspace/tools/runtime-manager.nu monitor --duration 5m --interval 30s + +# Check disk usage +nu workspace/tools/runtime-manager.nu monitor --type disk + +# Monitor active processes +nu workspace/tools/runtime-manager.nu monitor --type processes --workspace-user developer +``` + +## Health Monitoring + +### Health Check System + +The workspace provides comprehensive health monitoring with automatic repair capabilities. + +**Health Check Components**: + +- **Directory Structure**: Validates workspace directory integrity +- **Configuration Files**: Checks configuration syntax and completeness +- **Runtime Environment**: Validates runtime data and permissions +- **Extension Status**: Checks extension functionality +- **Resource Usage**: Monitors disk space and memory usage +- **Integration Status**: Tests integration with core system + +### Health Commands + +**Basic Health Check**: + +```text +# Quick health check +nu workspace.nu health + +# Detailed health check with all components +nu workspace.nu health --detailed + +# Health check with automatic fixes +nu workspace.nu health --fix-issues + +# Export health report +nu workspace.nu health --report-format json > health-report.json +``` + +**Component-Specific Health Checks**: + +```text +# Check directory structure +nu workspace/tools/workspace-health.nu check-directories --workspace-user developer + +# Validate configuration files +nu workspace/tools/workspace-health.nu check-config --workspace-user developer + +# Check runtime environment +nu workspace/tools/workspace-health.nu check-runtime --workspace-user developer + +# Test extension functionality +nu workspace/tools/workspace-health.nu check-extensions --workspace-user developer +``` + +### Health Monitoring Output + +**Example Health Report**: + +```text +{ + "workspace_health": { + "user": "developer", + "timestamp": "2025-09-25T14:30:22Z", + "overall_status": "healthy", + "checks": { + "directories": { + "status": "healthy", + "issues": [], + "auto_fixed": [] + }, + "configuration": { + "status": "warning", + "issues": [ + "User configuration missing default provider" + ], + "auto_fixed": [ + "Created missing user configuration file" + ] + }, + "runtime": { + "status": "healthy", + "disk_usage": "1.2 GB", + "cache_size": "450 MB", + "log_size": "120 MB" + }, + "extensions": { + "status": "healthy", + "providers": 2, + "taskservs": 5, + "clusters": 1 + } + }, + "recommendations": [ + "Consider cleaning cache (>400 MB)", + "Rotate logs (>100 MB)" + ] + } +} +``` + +### Automatic Fixes + +**Auto-Fix Capabilities**: + +- **Missing Directories**: Creates missing workspace directories +- **Broken Symlinks**: Repairs or removes broken symbolic links +- **Configuration Issues**: Creates missing configuration files with defaults +- **Permission Problems**: Fixes file and directory permissions +- **Corrupted Cache**: Clears and rebuilds corrupted cache entries +- **Log Rotation**: Rotates large log files automatically + +## Backup and Restore + +### Backup System + +**Backup Components**: + +- **Configuration**: All workspace configuration files +- **Extensions**: Custom extensions and templates +- **Runtime Data**: User-specific runtime data (optional) +- **Logs**: Application logs (optional) +- **Cache**: Cache data (optional) + +### Backup Commands + +**Create Backup**: + +```text +# Basic backup +nu workspace.nu backup + +# Backup with auto-generated name +nu workspace.nu backup --auto-name + +# Comprehensive backup including logs and cache +nu workspace.nu backup --auto-name --include-logs --include-cache + +# Backup specific components +nu workspace.nu backup --components config,extensions --name my-backup +``` + +**Backup Options**: + +- `--auto-name`: Generate timestamp-based backup name +- `--include-logs`: Include application logs +- `--include-cache`: Include cache data +- `--components`: Specify components to backup +- `--compress`: Create compressed backup archive +- `--encrypt`: Encrypt backup with age/sops +- `--remote`: Upload to remote storage (S3, etc.) + +### Restore System + +**List Available Backups**: + +```text +# List all backups +nu workspace.nu restore --list-backups + +# List backups with details +nu workspace.nu restore --list-backups --detailed + +# Show backup contents +nu workspace.nu restore --show-contents --backup-name workspace-developer-20250925_143022 +``` + +**Restore Operations**: + +```text +# Restore latest backup +nu workspace.nu restore --latest + +# Restore specific backup +nu workspace.nu restore --backup-name workspace-developer-20250925_143022 + +# Selective restore +nu workspace.nu restore --selective --backup-name my-backup + +# Restore to different user +nu workspace.nu restore --backup-name my-backup --restore-to different-user +``` + +**Advanced Restore Options**: + +- `--selective`: Choose components to restore interactively +- `--restore-to`: Restore to different user workspace +- `--merge`: Merge with existing workspace (don't overwrite) +- `--dry-run`: Show what would be restored without doing it +- `--verify`: Verify backup integrity before restore + +### Reset and Cleanup + +**Workspace Reset**: + +```text +# Reset with backup +nu workspace.nu reset --backup-first + +# Reset keeping configuration +nu workspace.nu reset --backup-first --keep-config + +# Complete reset (dangerous) +nu workspace.nu reset --force --no-backup +``` + +**Cleanup Operations**: + +```text +# Clean old data with dry-run +nu workspace.nu cleanup --type old --age 14d --dry-run + +# Clean cache forcefully +nu workspace.nu cleanup --type cache --force + +# Clean specific user data +nu workspace.nu cleanup --user-name old-user --type all +``` + +## Troubleshooting + +### Common Issues + +#### Workspace Not Found + +**Error**: `Workspace for user 'developer' not found` + +```text +# Solution: Initialize workspace +nu workspace.nu init --user-name developer +``` + +#### Path Resolution Errors + +**Error**: `Path resolution failed for config/user` + +```text +# Solution: Fix with health check +nu workspace.nu health --fix-issues + +# Manual fix +nu workspace/lib/path-resolver.nu resolve_path "config" "user" --create-missing +``` + +#### Configuration Errors + +**Error**: `Invalid configuration syntax in user.toml` + +```text +# Solution: Validate and fix configuration +nu workspace.nu config validate --user-name developer + +# Reset to defaults +cp workspace/config/local-overrides.toml.example workspace/config/developer.toml +``` + +#### Runtime Issues + +**Error**: `Runtime directory permissions error` + +```text +# Solution: Reinitialize runtime +nu workspace/tools/runtime-manager.nu init --user-name developer --force + +# Fix permissions manually +chmod -R 755 workspace/runtime/workspaces/developer +``` + +#### Extension Issues + +**Error**: `Extension 'my-provider' not found or invalid` + +```text +# Solution: Validate extension +nu workspace.nu tools validate-extension providers/my-provider + +# Reinitialize extension from template +cp -r workspace/extensions/providers/template workspace/extensions/providers/my-provider +``` + +### Debug Mode + +**Enable Debug Logging**: + +```text +# Set debug environment +export PROVISIONING_DEBUG=true +export PROVISIONING_LOG_LEVEL=debug +export PROVISIONING_WORKSPACE_USER=developer + +# Run with debug +nu workspace.nu health --detailed +``` + +### Performance Issues + +**Slow Operations**: + +```text +# Check disk space +df -h workspace/ + +# Check runtime data size +du -h workspace/runtime/workspaces/developer/ + +# Optimize workspace +nu workspace.nu cleanup --type cache +nu workspace/tools/runtime-manager.nu cache --action optimize +``` + +### Recovery Procedures + +**Corrupted Workspace**: + +```text +# 1. Backup current state +nu workspace.nu backup --name corrupted-backup --force + +# 2. Reset workspace +nu workspace.nu reset --backup-first + +# 3. Restore from known good backup +nu workspace.nu restore --latest-known-good + +# 4. Validate health +nu workspace.nu health --detailed --fix-issues +``` + +**Data Loss Prevention**: + +- Enable automatic backups: `backup_interval = "1h"` in user config +- Use version control for custom extensions +- Regular health checks: `nu workspace.nu health` +- Monitor disk space and set up alerts + +This workspace management system provides a robust foundation for development while maintaining isolation and providing comprehensive tools for +maintenance and troubleshooting. \ No newline at end of file diff --git a/docs/src/development/distribution-process.md b/docs/src/development/distribution-process.md index 60c3838..dc64418 100644 --- a/docs/src/development/distribution-process.md +++ b/docs/src/development/distribution-process.md @@ -1 +1,1005 @@ -# Distribution Process Documentation\n\nThis document provides comprehensive documentation for the provisioning project's distribution process, covering release workflows, package\ngeneration, multi-platform distribution, and rollback procedures.\n\n## Table of Contents\n\n1. [Overview](#overview)\n2. [Distribution Architecture](#distribution-architecture)\n3. [Release Process](#release-process)\n4. [Package Generation](#package-generation)\n5. [Multi-Platform Distribution](#multi-platform-distribution)\n6. [Validation and Testing](#validation-and-testing)\n7. [Release Management](#release-management)\n8. [Rollback Procedures](#rollback-procedures)\n9. [CI/CD Integration](#cicd-integration)\n10. [Troubleshooting](#troubleshooting)\n\n## Overview\n\nThe distribution system provides a comprehensive solution for creating, packaging, and distributing provisioning across multiple platforms with\nautomated release management.\n\n**Key Features**:\n\n- **Multi-Platform Support**: Linux, macOS, Windows with multiple architectures\n- **Multiple Distribution Variants**: Complete and minimal distributions\n- **Automated Release Pipeline**: From development to production deployment\n- **Package Management**: Binary packages, container images, and installers\n- **Validation Framework**: Comprehensive testing and validation\n- **Rollback Capabilities**: Safe rollback and recovery procedures\n\n**Location**: `/src/tools/`\n**Main Tool**: `/src/tools/Makefile` and associated Nushell scripts\n\n## Distribution Architecture\n\n### Distribution Components\n\n```{$detected_lang}\nDistribution Ecosystem\n├── Core Components\n│ ├── Platform Binaries # Rust-compiled binaries\n│ ├── Core Libraries # Nushell libraries and CLI\n│ ├── Configuration System # TOML configuration files\n│ └── Documentation # User and API documentation\n├── Platform Packages\n│ ├── Archives # TAR.GZ and ZIP files\n│ ├── Installers # Platform-specific installers\n│ └── Container Images # Docker/OCI images\n├── Distribution Variants\n│ ├── Complete # Full-featured distribution\n│ └── Minimal # Lightweight distribution\n└── Release Artifacts\n ├── Checksums # SHA256/MD5 verification\n ├── Signatures # Digital signatures\n └── Metadata # Release information\n```\n\n### Build Pipeline\n\n```{$detected_lang}\nBuild Pipeline Flow\n┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐\n│ Source Code │ -> │ Build Stage │ -> │ Package Stage │\n│ │ │ │ │ │\n│ - Rust code │ │ - compile- │ │ - create- │\n│ - Nushell libs │ │ platform │ │ archives │\n│ - Nickel schemas│ │ - bundle-core │ │ - build- │\n│ - Config files │ │ - validate-nickel│ │ containers │\n└─────────────────┘ └─────────────────┘ └─────────────────┘\n |\n v\n┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐\n│ Release Stage │ <- │ Validate Stage │ <- │ Distribute Stage│\n│ │ │ │ │ │\n│ - create- │ │ - test-dist │ │ - generate- │\n│ release │ │ - validate- │ │ distribution │\n│ - upload- │ │ package │ │ - create- │\n│ artifacts │ │ - integration │ │ installers │\n└─────────────────┘ └─────────────────┘ └─────────────────┘\n```\n\n### Distribution Variants\n\n**Complete Distribution**:\n\n- All Rust binaries (orchestrator, control-center, MCP server)\n- Full Nushell library suite\n- All providers, taskservs, and clusters\n- Complete documentation and examples\n- Development tools and templates\n\n**Minimal Distribution**:\n\n- Essential binaries only\n- Core Nushell libraries\n- Basic provider support\n- Essential task services\n- Minimal documentation\n\n## Release Process\n\n### Release Types\n\n**Release Classifications**:\n\n- **Major Release** (x.0.0): Breaking changes, new major features\n- **Minor Release** (x.y.0): New features, backward compatible\n- **Patch Release** (x.y.z): Bug fixes, security updates\n- **Pre-Release** (x.y.z-alpha/beta/rc): Development/testing releases\n\n### Step-by-Step Release Process\n\n#### 1. Preparation Phase\n\n**Pre-Release Checklist**:\n\n```{$detected_lang}\n# Update dependencies and security\ncargo update\ncargo audit\n\n# Run comprehensive tests\nmake ci-test\n\n# Update documentation\nmake docs\n\n# Validate all configurations\nmake validate-all\n```\n\n**Version Planning**:\n\n```{$detected_lang}\n# Check current version\ngit describe --tags --always\n\n# Plan next version\nmake status | grep Version\n\n# Validate version bump\nnu src/tools/release/create-release.nu --dry-run --version 2.1.0\n```\n\n#### 2. Build Phase\n\n**Complete Build**:\n\n```{$detected_lang}\n# Clean build environment\nmake clean\n\n# Build all platforms and variants\nmake all\n\n# Validate build output\nmake test-dist\n```\n\n**Build with Specific Parameters**:\n\n```{$detected_lang}\n# Build for specific platforms\nmake all PLATFORMS=linux-amd64,macos-amd64 VARIANTS=complete\n\n# Build with custom version\nmake all VERSION=2.1.0-rc1\n\n# Parallel build for speed\nmake all PARALLEL=true\n```\n\n#### 3. Package Generation\n\n**Create Distribution Packages**:\n\n```{$detected_lang}\n# Generate complete distributions\nmake dist-generate\n\n# Create binary packages\nmake package-binaries\n\n# Build container images\nmake package-containers\n\n# Create installers\nmake create-installers\n```\n\n**Package Validation**:\n\n```{$detected_lang}\n# Validate packages\nmake test-dist\n\n# Check package contents\nnu src/tools/package/validate-package.nu packages/\n\n# Test installation\nmake install\nmake uninstall\n```\n\n#### 4. Release Creation\n\n**Automated Release**:\n\n```{$detected_lang}\n# Create complete release\nmake release VERSION=2.1.0\n\n# Create draft release for review\nmake release-draft VERSION=2.1.0\n\n# Manual release creation\nnu src/tools/release/create-release.nu \\n --version 2.1.0 \\n --generate-changelog \\n --push-tag \\n --auto-upload\n```\n\n**Release Options**:\n\n- `--pre-release`: Mark as pre-release\n- `--draft`: Create draft release\n- `--generate-changelog`: Auto-generate changelog from commits\n- `--push-tag`: Push git tag to remote\n- `--auto-upload`: Upload assets automatically\n\n#### 5. Distribution and Notification\n\n**Upload Artifacts**:\n\n```{$detected_lang}\n# Upload to GitHub Releases\nmake upload-artifacts\n\n# Update package registries\nmake update-registry\n\n# Send notifications\nmake notify-release\n```\n\n**Registry Updates**:\n\n```{$detected_lang}\n# Update Homebrew formula\nnu src/tools/release/update-registry.nu \\n --registries homebrew \\n --version 2.1.0 \\n --auto-commit\n\n# Custom registry updates\nnu src/tools/release/update-registry.nu \\n --registries custom \\n --registry-url https://packages.company.com \\n --credentials-file ~/.registry-creds\n```\n\n### Release Automation\n\n**Complete Automated Release**:\n\n```{$detected_lang}\n# Full release pipeline\nmake cd-deploy VERSION=2.1.0\n\n# Equivalent manual steps:\nmake clean\nmake all VERSION=2.1.0\nmake create-archives\nmake create-installers\nmake release VERSION=2.1.0\nmake upload-artifacts\nmake update-registry\nmake notify-release\n```\n\n## Package Generation\n\n### Binary Packages\n\n**Package Types**:\n\n- **Standalone Archives**: TAR.GZ and ZIP with all dependencies\n- **Platform Packages**: DEB, RPM, MSI, PKG with system integration\n- **Portable Packages**: Single-directory distributions\n- **Source Packages**: Source code with build instructions\n\n**Create Binary Packages**:\n\n```{$detected_lang}\n# Standard binary packages\nmake package-binaries\n\n# Custom package creation\nnu src/tools/package/package-binaries.nu \\n --source-dir dist/platform \\n --output-dir packages/binaries \\n --platforms linux-amd64,macos-amd64 \\n --format archive \\n --compress \\n --strip \\n --checksum\n```\n\n**Package Features**:\n\n- **Binary Stripping**: Removes debug symbols for smaller size\n- **Compression**: GZIP, LZMA, and Brotli compression\n- **Checksums**: SHA256 and MD5 verification\n- **Signatures**: GPG and code signing support\n\n### Container Images\n\n**Container Build Process**:\n\n```{$detected_lang}\n# Build container images\nmake package-containers\n\n# Advanced container build\nnu src/tools/package/build-containers.nu \\n --dist-dir dist \\n --tag-prefix provisioning \\n --version 2.1.0 \\n --platforms "linux/amd64,linux/arm64" \\n --optimize-size \\n --security-scan \\n --multi-stage\n```\n\n**Container Features**:\n\n- **Multi-Stage Builds**: Minimal runtime images\n- **Security Scanning**: Vulnerability detection\n- **Multi-Platform**: AMD64, ARM64 support\n- **Layer Optimization**: Efficient layer caching\n- **Runtime Configuration**: Environment-based configuration\n\n**Container Registry Support**:\n\n- Docker Hub\n- GitHub Container Registry\n- Amazon ECR\n- Google Container Registry\n- Azure Container Registry\n- Private registries\n\n### Installers\n\n**Installer Types**:\n\n- **Shell Script Installer**: Universal Unix/Linux installer\n- **Package Installers**: DEB, RPM, MSI, PKG\n- **Container Installer**: Docker/Podman setup\n- **Source Installer**: Build-from-source installer\n\n**Create Installers**:\n\n```{$detected_lang}\n# Generate all installer types\nmake create-installers\n\n# Custom installer creation\nnu src/tools/distribution/create-installer.nu \\n dist/provisioning-2.1.0-linux-amd64-complete \\n --output-dir packages/installers \\n --installer-types shell,package \\n --platforms linux,macos \\n --include-services \\n --create-uninstaller \\n --validate-installer\n```\n\n**Installer Features**:\n\n- **System Integration**: Systemd/Launchd service files\n- **Path Configuration**: Automatic PATH updates\n- **User/System Install**: Support for both user and system-wide installation\n- **Uninstaller**: Clean removal capability\n- **Dependency Management**: Automatic dependency resolution\n- **Configuration Setup**: Initial configuration creation\n\n## Multi-Platform Distribution\n\n### Supported Platforms\n\n**Primary Platforms**:\n\n- **Linux AMD64** (x86_64-unknown-linux-gnu)\n- **Linux ARM64** (aarch64-unknown-linux-gnu)\n- **macOS AMD64** (x86_64-apple-darwin)\n- **macOS ARM64** (aarch64-apple-darwin)\n- **Windows AMD64** (x86_64-pc-windows-gnu)\n- **FreeBSD AMD64** (x86_64-unknown-freebsd)\n\n**Platform-Specific Features**:\n\n- **Linux**: SystemD integration, package manager support\n- **macOS**: LaunchAgent services, Homebrew packages\n- **Windows**: Windows Service support, MSI installers\n- **FreeBSD**: RC scripts, pkg packages\n\n### Cross-Platform Build\n\n**Cross-Compilation Setup**:\n\n```{$detected_lang}\n# Install cross-compilation targets\nrustup target add aarch64-unknown-linux-gnu\nrustup target add x86_64-apple-darwin\nrustup target add aarch64-apple-darwin\nrustup target add x86_64-pc-windows-gnu\n\n# Install cross-compilation tools\ncargo install cross\n```\n\n**Platform-Specific Builds**:\n\n```{$detected_lang}\n# Build for specific platform\nmake build-platform RUST_TARGET=aarch64-apple-darwin\n\n# Build for multiple platforms\nmake build-cross PLATFORMS=linux-amd64,macos-arm64,windows-amd64\n\n# Platform-specific distributions\nmake linux\nmake macos\nmake windows\n```\n\n### Distribution Matrix\n\n**Generated Distributions**:\n\n```{$detected_lang}\nDistribution Matrix:\nprovisioning-{version}-{platform}-{variant}.{format}\n\nExamples:\n- provisioning-2.1.0-linux-amd64-complete.tar.gz\n- provisioning-2.1.0-macos-arm64-minimal.tar.gz\n- provisioning-2.1.0-windows-amd64-complete.zip\n- provisioning-2.1.0-freebsd-amd64-minimal.tar.xz\n```\n\n**Platform Considerations**:\n\n- **File Permissions**: Executable permissions on Unix systems\n- **Path Separators**: Platform-specific path handling\n- **Service Integration**: Platform-specific service management\n- **Package Formats**: TAR.GZ for Unix, ZIP for Windows\n- **Line Endings**: CRLF for Windows, LF for Unix\n\n## Validation and Testing\n\n### Distribution Validation\n\n**Validation Pipeline**:\n\n```{$detected_lang}\n# Complete validation\nmake test-dist\n\n# Custom validation\nnu src/tools/build/test-distribution.nu \\n --dist-dir dist \\n --test-types basic,integration,complete \\n --platform linux \\n --cleanup \\n --verbose\n```\n\n**Validation Types**:\n\n- **Basic**: Installation test, CLI help, version check\n- **Integration**: Server creation, configuration validation\n- **Complete**: Full workflow testing including cluster operations\n\n### Testing Framework\n\n**Test Categories**:\n\n- **Unit Tests**: Component-specific testing\n- **Integration Tests**: Cross-component testing\n- **End-to-End Tests**: Complete workflow testing\n- **Performance Tests**: Load and performance validation\n- **Security Tests**: Security scanning and validation\n\n**Test Execution**:\n\n```{$detected_lang}\n# Run all tests\nmake ci-test\n\n# Specific test types\nnu src/tools/build/test-distribution.nu --test-types basic\nnu src/tools/build/test-distribution.nu --test-types integration\nnu src/tools/build/test-distribution.nu --test-types complete\n```\n\n### Package Validation\n\n**Package Integrity**:\n\n```{$detected_lang}\n# Validate package structure\nnu src/tools/package/validate-package.nu dist/\n\n# Check checksums\nsha256sum -c packages/checksums.sha256\n\n# Verify signatures\ngpg --verify packages/provisioning-2.1.0.tar.gz.sig\n```\n\n**Installation Testing**:\n\n```{$detected_lang}\n# Test installation process\n./packages/installers/install-provisioning-2.1.0.sh --dry-run\n\n# Test uninstallation\n./packages/installers/uninstall-provisioning.sh --dry-run\n\n# Container testing\ndocker run --rm provisioning:2.1.0 provisioning --version\n```\n\n## Release Management\n\n### Release Workflow\n\n**GitHub Release Integration**:\n\n```{$detected_lang}\n# Create GitHub release\nnu src/tools/release/create-release.nu \\n --version 2.1.0 \\n --asset-dir packages \\n --generate-changelog \\n --push-tag \\n --auto-upload\n```\n\n**Release Features**:\n\n- **Automated Changelog**: Generated from git commit history\n- **Asset Management**: Automatic upload of all distribution artifacts\n- **Tag Management**: Semantic version tagging\n- **Release Notes**: Formatted release notes with change summaries\n\n### Versioning Strategy\n\n**Semantic Versioning**:\n\n- **MAJOR.MINOR.PATCH** format (for example, 2.1.0)\n- **Pre-release** suffixes (for example, 2.1.0-alpha.1, 2.1.0-rc.2)\n- **Build metadata** (for example, 2.1.0+20250925.abcdef)\n\n**Version Detection**:\n\n```{$detected_lang}\n# Auto-detect next version\nnu src/tools/release/create-release.nu --release-type minor\n\n# Manual version specification\nnu src/tools/release/create-release.nu --version 2.1.0\n\n# Pre-release versioning\nnu src/tools/release/create-release.nu --version 2.1.0-rc.1 --pre-release\n```\n\n### Artifact Management\n\n**Artifact Types**:\n\n- **Source Archives**: Complete source code distributions\n- **Binary Archives**: Compiled binary distributions\n- **Container Images**: OCI-compliant container images\n- **Installers**: Platform-specific installation packages\n- **Documentation**: Generated documentation packages\n\n**Upload and Distribution**:\n\n```{$detected_lang}\n# Upload to GitHub Releases\nmake upload-artifacts\n\n# Upload to container registries\ndocker push provisioning:2.1.0\n\n# Update package repositories\nmake update-registry\n```\n\n## Rollback Procedures\n\n### Rollback Scenarios\n\n**Common Rollback Triggers**:\n\n- Critical bugs discovered post-release\n- Security vulnerabilities identified\n- Performance regression\n- Compatibility issues\n- Infrastructure failures\n\n### Rollback Process\n\n**Automated Rollback**:\n\n```{$detected_lang}\n# Rollback latest release\nnu src/tools/release/rollback-release.nu --version 2.1.0\n\n# Rollback with specific target\nnu src/tools/release/rollback-release.nu \\n --from-version 2.1.0 \\n --to-version 2.0.5 \\n --update-registries \\n --notify-users\n```\n\n**Manual Rollback Steps**:\n\n```{$detected_lang}\n# 1. Identify target version\ngit tag -l | grep -v 2.1.0 | tail -5\n\n# 2. Create rollback release\nnu src/tools/release/create-release.nu \\n --version 2.0.6 \\n --rollback-from 2.1.0 \\n --urgent\n\n# 3. Update package managers\nnu src/tools/release/update-registry.nu \\n --version 2.0.6 \\n --rollback-notice "Critical fix for 2.1.0 issues"\n\n# 4. Notify users\nnu src/tools/release/notify-users.nu \\n --channels slack,discord,email \\n --message-type rollback \\n --urgent\n```\n\n### Rollback Safety\n\n**Pre-Rollback Validation**:\n\n- Validate target version integrity\n- Check compatibility matrix\n- Verify rollback procedure testing\n- Confirm communication plan\n\n**Rollback Testing**:\n\n```{$detected_lang}\n# Test rollback in staging\nnu src/tools/release/rollback-release.nu \\n --version 2.1.0 \\n --target-version 2.0.5 \\n --dry-run \\n --staging-environment\n\n# Validate rollback success\nmake test-dist DIST_VERSION=2.0.5\n```\n\n### Emergency Procedures\n\n**Critical Security Rollback**:\n\n```{$detected_lang}\n# Emergency rollback (bypasses normal procedures)\nnu src/tools/release/rollback-release.nu \\n --version 2.1.0 \\n --emergency \\n --security-issue \\n --immediate-notify\n```\n\n**Infrastructure Failure Recovery**:\n\n```{$detected_lang}\n# Failover to backup infrastructure\nnu src/tools/release/rollback-release.nu \\n --infrastructure-failover \\n --backup-registry \\n --mirror-sync\n```\n\n## CI/CD Integration\n\n### GitHub Actions Integration\n\n**Build Workflow** (`.github/workflows/build.yml`):\n\n```{$detected_lang}\nname: Build and Distribute\non:\n push:\n branches: [main]\n pull_request:\n branches: [main]\n\njobs:\n build:\n runs-on: ubuntu-latest\n strategy:\n matrix:\n platform: [linux, macos, windows]\n steps:\n - uses: actions/checkout@v4\n\n - name: Setup Nushell\n uses: hustcer/setup-nu@v3.5\n\n - name: Setup Rust\n uses: actions-rs/toolchain@v1\n with:\n toolchain: stable\n\n - name: CI Build\n run: |\n cd src/tools\n make ci-build\n\n - name: Upload Build Artifacts\n uses: actions/upload-artifact@v4\n with:\n name: build-${{ matrix.platform }}\n path: src/dist/\n```\n\n**Release Workflow** (`.github/workflows/release.yml`):\n\n```{$detected_lang}\nname: Release\non:\n push:\n tags: ['v*']\n\njobs:\n release:\n runs-on: ubuntu-latest\n steps:\n - uses: actions/checkout@v4\n\n - name: Build Release\n run: |\n cd src/tools\n make ci-release VERSION=${{ github.ref_name }}\n\n - name: Create Release\n run: |\n cd src/tools\n make release VERSION=${{ github.ref_name }}\n\n - name: Update Registries\n run: |\n cd src/tools\n make update-registry VERSION=${{ github.ref_name }}\n```\n\n### GitLab CI Integration\n\n**GitLab CI Configuration** (`.gitlab-ci.yml`):\n\n```{$detected_lang}\nstages:\n - build\n - package\n - test\n - release\n\nbuild:\n stage: build\n script:\n - cd src/tools\n - make ci-build\n artifacts:\n paths:\n - src/dist/\n expire_in: 1 hour\n\npackage:\n stage: package\n script:\n - cd src/tools\n - make package-all\n artifacts:\n paths:\n - src/packages/\n expire_in: 1 day\n\nrelease:\n stage: release\n script:\n - cd src/tools\n - make cd-deploy VERSION=${CI_COMMIT_TAG}\n only:\n - tags\n```\n\n### Jenkins Integration\n\n**Jenkinsfile**:\n\n```{$detected_lang}\npipeline {\n agent any\n\n stages {\n stage('Build') {\n steps {\n dir('src/tools') {\n sh 'make ci-build'\n }\n }\n }\n\n stage('Package') {\n steps {\n dir('src/tools') {\n sh 'make package-all'\n }\n }\n }\n\n stage('Release') {\n when {\n tag '*'\n }\n steps {\n dir('src/tools') {\n sh "make cd-deploy VERSION=${env.TAG_NAME}"\n }\n }\n }\n }\n}\n```\n\n## Troubleshooting\n\n### Common Issues\n\n#### Build Failures\n\n**Rust Compilation Errors**:\n\n```{$detected_lang}\n# Solution: Clean and rebuild\nmake clean\ncargo clean\nmake build-platform\n\n# Check Rust toolchain\nrustup show\nrustup update\n```\n\n**Cross-Compilation Issues**:\n\n```{$detected_lang}\n# Solution: Install missing targets\nrustup target list --installed\nrustup target add x86_64-apple-darwin\n\n# Use cross for problematic targets\ncargo install cross\nmake build-platform CROSS=true\n```\n\n#### Package Generation Issues\n\n**Missing Dependencies**:\n\n```{$detected_lang}\n# Solution: Install build tools\nsudo apt-get install build-essential\nbrew install gnu-tar\n\n# Check tool availability\nmake info\n```\n\n**Permission Errors**:\n\n```{$detected_lang}\n# Solution: Fix permissions\nchmod +x src/tools/build/*.nu\nchmod +x src/tools/distribution/*.nu\nchmod +x src/tools/package/*.nu\n```\n\n#### Distribution Validation Failures\n\n**Package Integrity Issues**:\n\n```{$detected_lang}\n# Solution: Regenerate packages\nmake clean-dist\nmake package-all\n\n# Verify manually\nsha256sum packages/*.tar.gz\n```\n\n**Installation Test Failures**:\n\n```{$detected_lang}\n# Solution: Test in clean environment\ndocker run --rm -v $(pwd):/work ubuntu:latest /work/packages/installers/install.sh\n\n# Debug installation\n./packages/installers/install.sh --dry-run --verbose\n```\n\n### Release Issues\n\n#### Upload Failures\n\n**Network Issues**:\n\n```{$detected_lang}\n# Solution: Retry with backoff\nnu src/tools/release/upload-artifacts.nu \\n --retry-count 5 \\n --backoff-delay 30\n\n# Manual upload\ngh release upload v2.1.0 packages/*.tar.gz\n```\n\n**Authentication Failures**:\n\n```{$detected_lang}\n# Solution: Refresh tokens\ngh auth refresh\ndocker login ghcr.io\n\n# Check credentials\ngh auth status\ndocker system info\n```\n\n#### Registry Update Issues\n\n**Homebrew Formula Issues**:\n\n```{$detected_lang}\n# Solution: Manual PR creation\ngit clone https://github.com/Homebrew/homebrew-core\ncd homebrew-core\n# Edit formula\ngit add Formula/provisioning.rb\ngit commit -m "provisioning 2.1.0"\n```\n\n### Debug and Monitoring\n\n**Debug Mode**:\n\n```{$detected_lang}\n# Enable debug logging\nexport PROVISIONING_DEBUG=true\nexport RUST_LOG=debug\n\n# Run with verbose output\nmake all VERBOSE=true\n\n# Debug specific components\nnu src/tools/distribution/generate-distribution.nu \\n --verbose \\n --dry-run\n```\n\n**Monitoring Build Progress**:\n\n```{$detected_lang}\n# Monitor build logs\ntail -f src/tools/build.log\n\n# Check build status\nmake status\n\n# Resource monitoring\ntop\ndf -h\n```\n\nThis distribution process provides a robust, automated pipeline for creating, validating, and distributing provisioning across multiple platforms\nwhile maintaining high quality and reliability standards. +# Distribution Process Documentation + +This document provides comprehensive documentation for the provisioning project's distribution process, covering release workflows, package +generation, multi-platform distribution, and rollback procedures. + +## Table of Contents + +1. [Overview](#overview) +2. [Distribution Architecture](#distribution-architecture) +3. [Release Process](#release-process) +4. [Package Generation](#package-generation) +5. [Multi-Platform Distribution](#multi-platform-distribution) +6. [Validation and Testing](#validation-and-testing) +7. [Release Management](#release-management) +8. [Rollback Procedures](#rollback-procedures) +9. [CI/CD Integration](#cicd-integration) +10. [Troubleshooting](#troubleshooting) + +## Overview + +The distribution system provides a comprehensive solution for creating, packaging, and distributing provisioning across multiple platforms with +automated release management. + +**Key Features**: + +- **Multi-Platform Support**: Linux, macOS, Windows with multiple architectures +- **Multiple Distribution Variants**: Complete and minimal distributions +- **Automated Release Pipeline**: From development to production deployment +- **Package Management**: Binary packages, container images, and installers +- **Validation Framework**: Comprehensive testing and validation +- **Rollback Capabilities**: Safe rollback and recovery procedures + +**Location**: `/src/tools/` +**Main Tool**: `/src/tools/Makefile` and associated Nushell scripts + +## Distribution Architecture + +### Distribution Components + +```text +Distribution Ecosystem +├── Core Components +│ ├── Platform Binaries # Rust-compiled binaries +│ ├── Core Libraries # Nushell libraries and CLI +│ ├── Configuration System # TOML configuration files +│ └── Documentation # User and API documentation +├── Platform Packages +│ ├── Archives # TAR.GZ and ZIP files +│ ├── Installers # Platform-specific installers +│ └── Container Images # Docker/OCI images +├── Distribution Variants +│ ├── Complete # Full-featured distribution +│ └── Minimal # Lightweight distribution +└── Release Artifacts + ├── Checksums # SHA256/MD5 verification + ├── Signatures # Digital signatures + └── Metadata # Release information +``` + +### Build Pipeline + +```text +Build Pipeline Flow +┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ +│ Source Code │ -> │ Build Stage │ -> │ Package Stage │ +│ │ │ │ │ │ +│ - Rust code │ │ - compile- │ │ - create- │ +│ - Nushell libs │ │ platform │ │ archives │ +│ - Nickel schemas│ │ - bundle-core │ │ - build- │ +│ - Config files │ │ - validate-nickel│ │ containers │ +└─────────────────┘ └─────────────────┘ └─────────────────┘ + | + v +┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ +│ Release Stage │ <- │ Validate Stage │ <- │ Distribute Stage│ +│ │ │ │ │ │ +│ - create- │ │ - test-dist │ │ - generate- │ +│ release │ │ - validate- │ │ distribution │ +│ - upload- │ │ package │ │ - create- │ +│ artifacts │ │ - integration │ │ installers │ +└─────────────────┘ └─────────────────┘ └─────────────────┘ +``` + +### Distribution Variants + +**Complete Distribution**: + +- All Rust binaries (orchestrator, control-center, MCP server) +- Full Nushell library suite +- All providers, taskservs, and clusters +- Complete documentation and examples +- Development tools and templates + +**Minimal Distribution**: + +- Essential binaries only +- Core Nushell libraries +- Basic provider support +- Essential task services +- Minimal documentation + +## Release Process + +### Release Types + +**Release Classifications**: + +- **Major Release** (x.0.0): Breaking changes, new major features +- **Minor Release** (x.y.0): New features, backward compatible +- **Patch Release** (x.y.z): Bug fixes, security updates +- **Pre-Release** (x.y.z-alpha/beta/rc): Development/testing releases + +### Step-by-Step Release Process + +#### 1. Preparation Phase + +**Pre-Release Checklist**: + +```text +# Update dependencies and security +cargo update +cargo audit + +# Run comprehensive tests +make ci-test + +# Update documentation +make docs + +# Validate all configurations +make validate-all +``` + +**Version Planning**: + +```text +# Check current version +git describe --tags --always + +# Plan next version +make status | grep Version + +# Validate version bump +nu src/tools/release/create-release.nu --dry-run --version 2.1.0 +``` + +#### 2. Build Phase + +**Complete Build**: + +```text +# Clean build environment +make clean + +# Build all platforms and variants +make all + +# Validate build output +make test-dist +``` + +**Build with Specific Parameters**: + +```text +# Build for specific platforms +make all PLATFORMS=linux-amd64,macos-amd64 VARIANTS=complete + +# Build with custom version +make all VERSION=2.1.0-rc1 + +# Parallel build for speed +make all PARALLEL=true +``` + +#### 3. Package Generation + +**Create Distribution Packages**: + +```text +# Generate complete distributions +make dist-generate + +# Create binary packages +make package-binaries + +# Build container images +make package-containers + +# Create installers +make create-installers +``` + +**Package Validation**: + +```text +# Validate packages +make test-dist + +# Check package contents +nu src/tools/package/validate-package.nu packages/ + +# Test installation +make install +make uninstall +``` + +#### 4. Release Creation + +**Automated Release**: + +```text +# Create complete release +make release VERSION=2.1.0 + +# Create draft release for review +make release-draft VERSION=2.1.0 + +# Manual release creation +nu src/tools/release/create-release.nu + --version 2.1.0 + --generate-changelog + --push-tag + --auto-upload +``` + +**Release Options**: + +- `--pre-release`: Mark as pre-release +- `--draft`: Create draft release +- `--generate-changelog`: Auto-generate changelog from commits +- `--push-tag`: Push git tag to remote +- `--auto-upload`: Upload assets automatically + +#### 5. Distribution and Notification + +**Upload Artifacts**: + +```text +# Upload to GitHub Releases +make upload-artifacts + +# Update package registries +make update-registry + +# Send notifications +make notify-release +``` + +**Registry Updates**: + +```text +# Update Homebrew formula +nu src/tools/release/update-registry.nu + --registries homebrew + --version 2.1.0 + --auto-commit + +# Custom registry updates +nu src/tools/release/update-registry.nu + --registries custom + --registry-url https://packages.company.com + --credentials-file ~/.registry-creds +``` + +### Release Automation + +**Complete Automated Release**: + +```text +# Full release pipeline +make cd-deploy VERSION=2.1.0 + +# Equivalent manual steps: +make clean +make all VERSION=2.1.0 +make create-archives +make create-installers +make release VERSION=2.1.0 +make upload-artifacts +make update-registry +make notify-release +``` + +## Package Generation + +### Binary Packages + +**Package Types**: + +- **Standalone Archives**: TAR.GZ and ZIP with all dependencies +- **Platform Packages**: DEB, RPM, MSI, PKG with system integration +- **Portable Packages**: Single-directory distributions +- **Source Packages**: Source code with build instructions + +**Create Binary Packages**: + +```text +# Standard binary packages +make package-binaries + +# Custom package creation +nu src/tools/package/package-binaries.nu + --source-dir dist/platform + --output-dir packages/binaries + --platforms linux-amd64,macos-amd64 + --format archive + --compress + --strip + --checksum +``` + +**Package Features**: + +- **Binary Stripping**: Removes debug symbols for smaller size +- **Compression**: GZIP, LZMA, and Brotli compression +- **Checksums**: SHA256 and MD5 verification +- **Signatures**: GPG and code signing support + +### Container Images + +**Container Build Process**: + +```text +# Build container images +make package-containers + +# Advanced container build +nu src/tools/package/build-containers.nu + --dist-dir dist + --tag-prefix provisioning + --version 2.1.0 + --platforms "linux/amd64,linux/arm64" + --optimize-size + --security-scan + --multi-stage +``` + +**Container Features**: + +- **Multi-Stage Builds**: Minimal runtime images +- **Security Scanning**: Vulnerability detection +- **Multi-Platform**: AMD64, ARM64 support +- **Layer Optimization**: Efficient layer caching +- **Runtime Configuration**: Environment-based configuration + +**Container Registry Support**: + +- Docker Hub +- GitHub Container Registry +- Amazon ECR +- Google Container Registry +- Azure Container Registry +- Private registries + +### Installers + +**Installer Types**: + +- **Shell Script Installer**: Universal Unix/Linux installer +- **Package Installers**: DEB, RPM, MSI, PKG +- **Container Installer**: Docker/Podman setup +- **Source Installer**: Build-from-source installer + +**Create Installers**: + +```text +# Generate all installer types +make create-installers + +# Custom installer creation +nu src/tools/distribution/create-installer.nu + dist/provisioning-2.1.0-linux-amd64-complete + --output-dir packages/installers + --installer-types shell,package + --platforms linux,macos + --include-services + --create-uninstaller + --validate-installer +``` + +**Installer Features**: + +- **System Integration**: Systemd/Launchd service files +- **Path Configuration**: Automatic PATH updates +- **User/System Install**: Support for both user and system-wide installation +- **Uninstaller**: Clean removal capability +- **Dependency Management**: Automatic dependency resolution +- **Configuration Setup**: Initial configuration creation + +## Multi-Platform Distribution + +### Supported Platforms + +**Primary Platforms**: + +- **Linux AMD64** (x86_64-unknown-linux-gnu) +- **Linux ARM64** (aarch64-unknown-linux-gnu) +- **macOS AMD64** (x86_64-apple-darwin) +- **macOS ARM64** (aarch64-apple-darwin) +- **Windows AMD64** (x86_64-pc-windows-gnu) +- **FreeBSD AMD64** (x86_64-unknown-freebsd) + +**Platform-Specific Features**: + +- **Linux**: SystemD integration, package manager support +- **macOS**: LaunchAgent services, Homebrew packages +- **Windows**: Windows Service support, MSI installers +- **FreeBSD**: RC scripts, pkg packages + +### Cross-Platform Build + +**Cross-Compilation Setup**: + +```text +# Install cross-compilation targets +rustup target add aarch64-unknown-linux-gnu +rustup target add x86_64-apple-darwin +rustup target add aarch64-apple-darwin +rustup target add x86_64-pc-windows-gnu + +# Install cross-compilation tools +cargo install cross +``` + +**Platform-Specific Builds**: + +```text +# Build for specific platform +make build-platform RUST_TARGET=aarch64-apple-darwin + +# Build for multiple platforms +make build-cross PLATFORMS=linux-amd64,macos-arm64,windows-amd64 + +# Platform-specific distributions +make linux +make macos +make windows +``` + +### Distribution Matrix + +**Generated Distributions**: + +```text +Distribution Matrix: +provisioning-{version}-{platform}-{variant}.{format} + +Examples: +- provisioning-2.1.0-linux-amd64-complete.tar.gz +- provisioning-2.1.0-macos-arm64-minimal.tar.gz +- provisioning-2.1.0-windows-amd64-complete.zip +- provisioning-2.1.0-freebsd-amd64-minimal.tar.xz +``` + +**Platform Considerations**: + +- **File Permissions**: Executable permissions on Unix systems +- **Path Separators**: Platform-specific path handling +- **Service Integration**: Platform-specific service management +- **Package Formats**: TAR.GZ for Unix, ZIP for Windows +- **Line Endings**: CRLF for Windows, LF for Unix + +## Validation and Testing + +### Distribution Validation + +**Validation Pipeline**: + +```text +# Complete validation +make test-dist + +# Custom validation +nu src/tools/build/test-distribution.nu + --dist-dir dist + --test-types basic,integration,complete + --platform linux + --cleanup + --verbose +``` + +**Validation Types**: + +- **Basic**: Installation test, CLI help, version check +- **Integration**: Server creation, configuration validation +- **Complete**: Full workflow testing including cluster operations + +### Testing Framework + +**Test Categories**: + +- **Unit Tests**: Component-specific testing +- **Integration Tests**: Cross-component testing +- **End-to-End Tests**: Complete workflow testing +- **Performance Tests**: Load and performance validation +- **Security Tests**: Security scanning and validation + +**Test Execution**: + +```text +# Run all tests +make ci-test + +# Specific test types +nu src/tools/build/test-distribution.nu --test-types basic +nu src/tools/build/test-distribution.nu --test-types integration +nu src/tools/build/test-distribution.nu --test-types complete +``` + +### Package Validation + +**Package Integrity**: + +```text +# Validate package structure +nu src/tools/package/validate-package.nu dist/ + +# Check checksums +sha256sum -c packages/checksums.sha256 + +# Verify signatures +gpg --verify packages/provisioning-2.1.0.tar.gz.sig +``` + +**Installation Testing**: + +```text +# Test installation process +./packages/installers/install-provisioning-2.1.0.sh --dry-run + +# Test uninstallation +./packages/installers/uninstall-provisioning.sh --dry-run + +# Container testing +docker run --rm provisioning:2.1.0 provisioning --version +``` + +## Release Management + +### Release Workflow + +**GitHub Release Integration**: + +```text +# Create GitHub release +nu src/tools/release/create-release.nu + --version 2.1.0 + --asset-dir packages + --generate-changelog + --push-tag + --auto-upload +``` + +**Release Features**: + +- **Automated Changelog**: Generated from git commit history +- **Asset Management**: Automatic upload of all distribution artifacts +- **Tag Management**: Semantic version tagging +- **Release Notes**: Formatted release notes with change summaries + +### Versioning Strategy + +**Semantic Versioning**: + +- **MAJOR.MINOR.PATCH** format (for example, 2.1.0) +- **Pre-release** suffixes (for example, 2.1.0-alpha.1, 2.1.0-rc.2) +- **Build metadata** (for example, 2.1.0+20250925.abcdef) + +**Version Detection**: + +```text +# Auto-detect next version +nu src/tools/release/create-release.nu --release-type minor + +# Manual version specification +nu src/tools/release/create-release.nu --version 2.1.0 + +# Pre-release versioning +nu src/tools/release/create-release.nu --version 2.1.0-rc.1 --pre-release +``` + +### Artifact Management + +**Artifact Types**: + +- **Source Archives**: Complete source code distributions +- **Binary Archives**: Compiled binary distributions +- **Container Images**: OCI-compliant container images +- **Installers**: Platform-specific installation packages +- **Documentation**: Generated documentation packages + +**Upload and Distribution**: + +```text +# Upload to GitHub Releases +make upload-artifacts + +# Upload to container registries +docker push provisioning:2.1.0 + +# Update package repositories +make update-registry +``` + +## Rollback Procedures + +### Rollback Scenarios + +**Common Rollback Triggers**: + +- Critical bugs discovered post-release +- Security vulnerabilities identified +- Performance regression +- Compatibility issues +- Infrastructure failures + +### Rollback Process + +**Automated Rollback**: + +```text +# Rollback latest release +nu src/tools/release/rollback-release.nu --version 2.1.0 + +# Rollback with specific target +nu src/tools/release/rollback-release.nu + --from-version 2.1.0 + --to-version 2.0.5 + --update-registries + --notify-users +``` + +**Manual Rollback Steps**: + +```text +# 1. Identify target version +git tag -l | grep -v 2.1.0 | tail -5 + +# 2. Create rollback release +nu src/tools/release/create-release.nu + --version 2.0.6 + --rollback-from 2.1.0 + --urgent + +# 3. Update package managers +nu src/tools/release/update-registry.nu + --version 2.0.6 + --rollback-notice "Critical fix for 2.1.0 issues" + +# 4. Notify users +nu src/tools/release/notify-users.nu + --channels slack,discord,email + --message-type rollback + --urgent +``` + +### Rollback Safety + +**Pre-Rollback Validation**: + +- Validate target version integrity +- Check compatibility matrix +- Verify rollback procedure testing +- Confirm communication plan + +**Rollback Testing**: + +```text +# Test rollback in staging +nu src/tools/release/rollback-release.nu + --version 2.1.0 + --target-version 2.0.5 + --dry-run + --staging-environment + +# Validate rollback success +make test-dist DIST_VERSION=2.0.5 +``` + +### Emergency Procedures + +**Critical Security Rollback**: + +```text +# Emergency rollback (bypasses normal procedures) +nu src/tools/release/rollback-release.nu + --version 2.1.0 + --emergency + --security-issue + --immediate-notify +``` + +**Infrastructure Failure Recovery**: + +```text +# Failover to backup infrastructure +nu src/tools/release/rollback-release.nu + --infrastructure-failover + --backup-registry + --mirror-sync +``` + +## CI/CD Integration + +### GitHub Actions Integration + +**Build Workflow** (`.github/workflows/build.yml`): + +```text +name: Build and Distribute +on: + push: + branches: [main] + pull_request: + branches: [main] + +jobs: + build: + runs-on: ubuntu-latest + strategy: + matrix: + platform: [linux, macos, windows] + steps: + - uses: actions/checkout@v4 + + - name: Setup Nushell + uses: hustcer/setup-nu@v3.5 + + - name: Setup Rust + uses: actions-rs/toolchain@v1 + with: + toolchain: stable + + - name: CI Build + run: | + cd src/tools + make ci-build + + - name: Upload Build Artifacts + uses: actions/upload-artifact@v4 + with: + name: build-${{ matrix.platform }} + path: src/dist/ +``` + +**Release Workflow** (`.github/workflows/release.yml`): + +```text +name: Release +on: + push: + tags: ['v*'] + +jobs: + release: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + + - name: Build Release + run: | + cd src/tools + make ci-release VERSION=${{ github.ref_name }} + + - name: Create Release + run: | + cd src/tools + make release VERSION=${{ github.ref_name }} + + - name: Update Registries + run: | + cd src/tools + make update-registry VERSION=${{ github.ref_name }} +``` + +### GitLab CI Integration + +**GitLab CI Configuration** (`.gitlab-ci.yml`): + +```text +stages: + - build + - package + - test + - release + +build: + stage: build + script: + - cd src/tools + - make ci-build + artifacts: + paths: + - src/dist/ + expire_in: 1 hour + +package: + stage: package + script: + - cd src/tools + - make package-all + artifacts: + paths: + - src/packages/ + expire_in: 1 day + +release: + stage: release + script: + - cd src/tools + - make cd-deploy VERSION=${CI_COMMIT_TAG} + only: + - tags +``` + +### Jenkins Integration + +**Jenkinsfile**: + +```text +pipeline { + agent any + + stages { + stage('Build') { + steps { + dir('src/tools') { + sh 'make ci-build' + } + } + } + + stage('Package') { + steps { + dir('src/tools') { + sh 'make package-all' + } + } + } + + stage('Release') { + when { + tag '*' + } + steps { + dir('src/tools') { + sh "make cd-deploy VERSION=${env.TAG_NAME}" + } + } + } + } +} +``` + +## Troubleshooting + +### Common Issues + +#### Build Failures + +**Rust Compilation Errors**: + +```text +# Solution: Clean and rebuild +make clean +cargo clean +make build-platform + +# Check Rust toolchain +rustup show +rustup update +``` + +**Cross-Compilation Issues**: + +```text +# Solution: Install missing targets +rustup target list --installed +rustup target add x86_64-apple-darwin + +# Use cross for problematic targets +cargo install cross +make build-platform CROSS=true +``` + +#### Package Generation Issues + +**Missing Dependencies**: + +```text +# Solution: Install build tools +sudo apt-get install build-essential +brew install gnu-tar + +# Check tool availability +make info +``` + +**Permission Errors**: + +```text +# Solution: Fix permissions +chmod +x src/tools/build/*.nu +chmod +x src/tools/distribution/*.nu +chmod +x src/tools/package/*.nu +``` + +#### Distribution Validation Failures + +**Package Integrity Issues**: + +```text +# Solution: Regenerate packages +make clean-dist +make package-all + +# Verify manually +sha256sum packages/*.tar.gz +``` + +**Installation Test Failures**: + +```text +# Solution: Test in clean environment +docker run --rm -v $(pwd):/work ubuntu:latest /work/packages/installers/install.sh + +# Debug installation +./packages/installers/install.sh --dry-run --verbose +``` + +### Release Issues + +#### Upload Failures + +**Network Issues**: + +```text +# Solution: Retry with backoff +nu src/tools/release/upload-artifacts.nu + --retry-count 5 + --backoff-delay 30 + +# Manual upload +gh release upload v2.1.0 packages/*.tar.gz +``` + +**Authentication Failures**: + +```text +# Solution: Refresh tokens +gh auth refresh +docker login ghcr.io + +# Check credentials +gh auth status +docker system info +``` + +#### Registry Update Issues + +**Homebrew Formula Issues**: + +```text +# Solution: Manual PR creation +git clone https://github.com/Homebrew/homebrew-core +cd homebrew-core +# Edit formula +git add Formula/provisioning.rb +git commit -m "provisioning 2.1.0" +``` + +### Debug and Monitoring + +**Debug Mode**: + +```text +# Enable debug logging +export PROVISIONING_DEBUG=true +export RUST_LOG=debug + +# Run with verbose output +make all VERBOSE=true + +# Debug specific components +nu src/tools/distribution/generate-distribution.nu + --verbose + --dry-run +``` + +**Monitoring Build Progress**: + +```text +# Monitor build logs +tail -f src/tools/build.log + +# Check build status +make status + +# Resource monitoring +top +df -h +``` + +This distribution process provides a robust, automated pipeline for creating, validating, and distributing provisioning across multiple platforms +while maintaining high quality and reliability standards. diff --git a/docs/src/development/glossary.md b/docs/src/development/glossary.md index 75a2160..6639d7f 100644 --- a/docs/src/development/glossary.md +++ b/docs/src/development/glossary.md @@ -1 +1,1760 @@ -# Provisioning Platform Glossary\n\n**Last Updated**: 2025-10-10\n**Version**: 1.0.0\n\nThis glossary defines key terminology used throughout the Provisioning Platform documentation. Terms are listed alphabetically with definitions, usage\ncontext, and cross-references to related documentation.\n\n---\n\n## A\n\n### ADR (Architecture Decision Record)\n\n**Definition**: Documentation of significant architectural decisions, including context, decision, and consequences.\n\n**Where Used**:\n\n- Architecture planning and review\n- Technical decision-making process\n- System design documentation\n\n**Related Concepts**: Architecture, Design Patterns, Technical Debt\n\n**Examples**:\n\n- ADR-001: Project Structure\n- ADR-006: CLI Refactoring\n- ADR-009: Complete Security System\n\n**See Also**: Architecture Documentation\n\n---\n\n### Agent\n\n**Definition**: A specialized component that performs a specific task in the system orchestration (for example, autonomous execution units in the\norchestrator).\n\n**Where Used**:\n\n- Task orchestration\n- Workflow management\n- Parallel execution patterns\n\n**Related Concepts**: Orchestrator, Workflow, Task\n\n**See Also**: [Orchestrator Architecture](../architecture/orchestrator-integration-model.md)\n\n---\n\n### Anchor Link\n\n**Definition**: An internal document link to a specific section within the same or different markdown file using the `#` symbol.\n\n**Where Used**:\n\n- Cross-referencing documentation sections\n- Table of contents generation\n- Navigation within long documents\n\n**Related Concepts**: Internal Link, Cross-Reference, Documentation\n\n**Examples**:\n\n- `[See Installation](#installation)` - Same document\n- `[Configuration Guide](config.md#setup)` - Different document\n\n---\n\n### API Gateway\n\n**Definition**: Platform service that provides unified REST API access to provisioning operations.\n\n**Where Used**:\n\n- External system integration\n- Web Control Center backend\n- MCP server communication\n\n**Related Concepts**: REST API, Platform Service, Orchestrator\n\n**Location**: `provisioning/platform/api-gateway/`\n\n**See Also**: REST API Documentation\n\n---\n\n### Auth (Authentication)\n\n**Definition**: The process of verifying user identity using JWT tokens, MFA, and secure session management.\n\n**Where Used**:\n\n- User login flows\n- API access control\n- CLI session management\n\n**Related Concepts**: Authorization, JWT, MFA, Security\n\n**See Also**:\n\n- Authentication Layer Guide\n- Auth Quick Reference\n\n---\n\n### Authorization\n\n**Definition**: The process of determining user permissions using Cedar policy language.\n\n**Where Used**:\n\n- Access control decisions\n- Resource permission checks\n- Multi-tenant security\n\n**Related Concepts**: Auth, Cedar, Policies, RBAC\n\n**See Also**: Cedar Authorization Implementation\n\n---\n\n## B\n\n### Batch Operation\n\n**Definition**: A collection of related infrastructure operations executed as a single workflow unit.\n\n**Where Used**:\n\n- Multi-server deployments\n- Cluster creation\n- Bulk taskserv installation\n\n**Related Concepts**: Workflow, Operation, Orchestrator\n\n**Commands**:\n\n```\nprovisioning batch submit workflow.ncl\nprovisioning batch list\nprovisioning batch status \n```\n\n**See Also**: [Batch Workflow System](../guides/from-scratch.md)\n\n---\n\n### Break-Glass\n\n**Definition**: Emergency access mechanism requiring multi-party approval for critical operations.\n\n**Where Used**:\n\n- Emergency system access\n- Incident response\n- Security override scenarios\n\n**Related Concepts**: Security, Compliance, Audit\n\n**Commands**:\n\n```\nprovisioning break-glass request "reason"\nprovisioning break-glass approve \n```\n\n**See Also**: Break-Glass Training Guide\n\n---\n\n## C\n\n### Cedar\n\n**Definition**: Amazon's policy language used for fine-grained authorization decisions.\n\n**Where Used**:\n\n- Authorization policies\n- Access control rules\n- Resource permissions\n\n**Related Concepts**: Authorization, Policies, Security\n\n**See Also**: Cedar Authorization Implementation\n\n---\n\n### Checkpoint\n\n**Definition**: A saved state of a workflow allowing resume from point of failure.\n\n**Where Used**:\n\n- Workflow recovery\n- Long-running operations\n- Batch processing\n\n**Related Concepts**: Workflow, State Management, Recovery\n\n**See Also**: [Batch Workflow System](../guides/from-scratch.md)\n\n---\n\n### CLI (Command-Line Interface)\n\n**Definition**: The `provisioning` command-line tool providing access to all platform operations.\n\n**Where Used**:\n\n- Daily operations\n- Script automation\n- CI/CD pipelines\n\n**Related Concepts**: Command, Shortcut, Module\n\n**Location**: `provisioning/core/cli/provisioning`\n\n**Examples**:\n\n```\nprovisioning server create\nprovisioning taskserv install kubernetes\nprovisioning workspace switch prod\n```\n\n**See Also**:\n\n- [CLI Reference](../infrastructure/cli-reference.md)\n- CLI Reference\n\n---\n\n### Cluster\n\n**Definition**: A complete, pre-configured deployment of multiple servers and taskservs working together.\n\n**Where Used**:\n\n- Kubernetes deployments\n- Database clusters\n- Complete infrastructure stacks\n\n**Related Concepts**: Infrastructure, Server, Taskserv\n\n**Location**: `provisioning/extensions/clusters/{name}/`\n\n**Commands**:\n\n```\nprovisioning cluster create \nprovisioning cluster list\nprovisioning cluster delete \n```\n\n**See Also**: Infrastructure Management\n\n---\n\n### Compliance\n\n**Definition**: System capabilities ensuring adherence to regulatory requirements (GDPR, SOC2, ISO 27001).\n\n**Where Used**:\n\n- Audit logging\n- Data retention policies\n- Incident response\n\n**Related Concepts**: Audit, Security, GDPR\n\n**See Also**: Compliance Implementation Summary\n\n---\n\n### Config (Configuration)\n\n**Definition**: System settings stored in TOML files with hierarchical loading and variable interpolation.\n\n**Where Used**:\n\n- System initialization\n- User preferences\n- Environment-specific settings\n\n**Related Concepts**: Settings, Environment, Workspace\n\n**Files**:\n\n- `provisioning/config/config.defaults.toml` - System defaults\n- `workspace/config/local-overrides.toml` - User settings\n\n**See Also**: [Configuration Guide](../infrastructure/configuration-guide.md)\n\n---\n\n### Control Center\n\n**Definition**: Web-based UI for managing provisioning operations built with Ratatui/Crossterm.\n\n**Where Used**:\n\n- Visual infrastructure management\n- Real-time monitoring\n- Guided workflows\n\n**Related Concepts**: UI, Platform Service, Orchestrator\n\n**Location**: `provisioning/platform/control-center/`\n\n**See Also**: Platform Services\n\n---\n\n### CoreDNS\n\n**Definition**: DNS server taskserv providing service discovery and DNS management.\n\n**Where Used**:\n\n- Kubernetes DNS\n- Service discovery\n- Internal DNS resolution\n\n**Related Concepts**: Taskserv, Kubernetes, Networking\n\n**See Also**:\n\n- CoreDNS Guide\n- CoreDNS Quick Reference\n\n---\n\n### Cross-Reference\n\n**Definition**: Links between related documentation sections or concepts.\n\n**Where Used**:\n\n- Documentation navigation\n- Related topic discovery\n- Learning path guidance\n\n**Related Concepts**: Documentation, Navigation, See Also\n\n**Examples**: "See Also" sections at the end of documentation pages\n\n---\n\n## D\n\n### Dependency\n\n**Definition**: A requirement that must be satisfied before installing or running a component.\n\n**Where Used**:\n\n- Taskserv installation order\n- Version compatibility checks\n- Cluster deployment sequencing\n\n**Related Concepts**: Version, Taskserv, Workflow\n\n**Schema**: `provisioning/schemas/dependencies.ncl`\n\n**See Also**: Nickel Dependency Patterns\n\n---\n\n### Diagnostics\n\n**Definition**: System health checking and troubleshooting assistance.\n\n**Where Used**:\n\n- System status verification\n- Problem identification\n- Guided troubleshooting\n\n**Related Concepts**: Health Check, Monitoring, Troubleshooting\n\n**Commands**:\n\n```\nprovisioning status\nprovisioning diagnostics run\n```\n\n---\n\n### Dynamic Secrets\n\n**Definition**: Temporary credentials generated on-demand with automatic expiration.\n\n**Where Used**:\n\n- AWS STS tokens\n- SSH temporary keys\n- Database credentials\n\n**Related Concepts**: Security, KMS, Secrets Management\n\n**See Also**:\n\n- Dynamic Secrets Implementation\n- Dynamic Secrets Quick Reference\n\n---\n\n## E\n\n### Environment\n\n**Definition**: A deployment context (dev, test, prod) with specific configuration overrides.\n\n**Where Used**:\n\n- Configuration loading\n- Resource isolation\n- Deployment targeting\n\n**Related Concepts**: Config, Workspace, Infrastructure\n\n**Config Files**: `config.{dev,test,prod}.toml`\n\n**Usage**:\n\n```\nPROVISIONING_ENV=prod provisioning server list\n```\n\n---\n\n### Extension\n\n**Definition**: A pluggable component adding functionality (provider, taskserv, cluster, or workflow).\n\n**Where Used**:\n\n- Custom cloud providers\n- Third-party taskservs\n- Custom deployment patterns\n\n**Related Concepts**: Provider, Taskserv, Cluster, Workflow\n\n**Location**: `provisioning/extensions/{type}/{name}/`\n\n**See Also**: Extension Development\n\n---\n\n## F\n\n### Feature\n\n**Definition**: A major system capability providing key platform functionality.\n\n**Where Used**:\n\n- Architecture documentation\n- Feature planning\n- System capabilities\n\n**Related Concepts**: ADR, Architecture, System\n\n**Examples**:\n\n- Batch Workflow System\n- Orchestrator Architecture\n- CLI Architecture\n- Configuration System\n\n**See Also**: [Architecture Overview](../architecture/system-overview.md)\n\n---\n\n## G\n\n### GDPR (General Data Protection Regulation)\n\n**Definition**: EU data protection regulation compliance features in the platform.\n\n**Where Used**:\n\n- Data export requests\n- Right to erasure\n- Audit compliance\n\n**Related Concepts**: Compliance, Audit, Security\n\n**Commands**:\n\n```\nprovisioning compliance gdpr export \nprovisioning compliance gdpr delete \n```\n\n**See Also**: Compliance Implementation\n\n---\n\n### Glossary\n\n**Definition**: This document - a comprehensive terminology reference for the platform.\n\n**Where Used**:\n\n- Learning the platform\n- Understanding documentation\n- Resolving terminology questions\n\n**Related Concepts**: Documentation, Reference, Cross-Reference\n\n---\n\n### Guide\n\n**Definition**: Step-by-step walkthrough documentation for common workflows.\n\n**Where Used**:\n\n- Onboarding new users\n- Learning workflows\n- Reference implementation\n\n**Related Concepts**: Documentation, Workflow, Tutorial\n\n**Commands**:\n\n```\nprovisioning guide from-scratch\nprovisioning guide update\nprovisioning guide customize\n```\n\n**See Also**: [Guides](../guides/README.md)\n\n---\n\n## H\n\n### Health Check\n\n**Definition**: Automated verification that a component is running correctly.\n\n**Where Used**:\n\n- Taskserv validation\n- System monitoring\n- Dependency verification\n\n**Related Concepts**: Diagnostics, Monitoring, Status\n\n**Example**:\n\n```\nhealth_check = {\n endpoint = "http://localhost:6443/healthz"\n timeout = 30\n interval = 10\n}\n```\n\n---\n\n### Hybrid Architecture\n\n**Definition**: System design combining Rust orchestrator with Nushell business logic.\n\n**Where Used**:\n\n- Core platform architecture\n- Performance optimization\n- Call stack management\n\n**Related Concepts**: Orchestrator, Architecture, Design\n\n**See Also**:\n\n- [Orchestrator Architecture](../architecture/orchestrator-integration-model.md)\n- [ADR-004: Hybrid Architecture](../architecture/adr/adr-004-hybrid-architecture.md)\n\n---\n\n## I\n\n### Infrastructure\n\n**Definition**: A named collection of servers, configurations, and deployments managed as a unit.\n\n**Where Used**:\n\n- Environment isolation\n- Resource organization\n- Deployment targeting\n\n**Related Concepts**: Workspace, Server, Environment\n\n**Location**: `workspace/infra/{name}/`\n\n**Commands**:\n\n```\nprovisioning infra list\nprovisioning generate infra --new \n```\n\n**See Also**: Infrastructure Management\n\n---\n\n### Integration\n\n**Definition**: Connection between platform components or external systems.\n\n**Where Used**:\n\n- API integration\n- CI/CD pipelines\n- External tool connectivity\n\n**Related Concepts**: API, Extension, Platform\n\n**See Also**:\n\n- Integration Patterns\n- Integration Examples\n\n---\n\n### Internal Link\n\n**Definition**: A markdown link to another documentation file or section within the platform docs.\n\n**Where Used**:\n\n- Cross-referencing documentation\n- Navigation between topics\n- Related content discovery\n\n**Related Concepts**: Anchor Link, Cross-Reference, Documentation\n\n**Examples**:\n\n- `[See Configuration](configuration.md)`\n- `[Architecture Overview](../architecture/README.md)`\n\n---\n\n## J\n\n### JWT (JSON Web Token)\n\n**Definition**: Token-based authentication mechanism using RS256 signatures.\n\n**Where Used**:\n\n- User authentication\n- API authorization\n- Session management\n\n**Related Concepts**: Auth, Security, Token\n\n**See Also**: JWT Auth Implementation\n\n---\n\n## K\n\n### Nickel (Nickel Configuration Language)\n\n**Definition**: Declarative configuration language with type safety and lazy evaluation for infrastructure definitions.\n\n**Where Used**:\n\n- Infrastructure schemas\n- Workflow definitions\n- Configuration validation\n\n**Related Concepts**: Schema, Configuration, Validation\n\n**Version**: 1.15.0+\n\n**Location**: `provisioning/schemas/*.ncl`\n\n**See Also**: Nickel Quick Reference\n\n---\n\n### KMS (Key Management Service)\n\n**Definition**: Encryption key management system supporting multiple backends (RustyVault, Age, AWS, Vault).\n\n**Where Used**:\n\n- Configuration encryption\n- Secret management\n- Data protection\n\n**Related Concepts**: Security, Encryption, Secrets\n\n**See Also**: RustyVault KMS Guide\n\n---\n\n### Kubernetes\n\n**Definition**: Container orchestration platform available as a taskserv.\n\n**Where Used**:\n\n- Container deployments\n- Cluster management\n- Production workloads\n\n**Related Concepts**: Taskserv, Cluster, Container\n\n**Commands**:\n\n```\nprovisioning taskserv create kubernetes\nprovisioning test quick kubernetes\n```\n\n---\n\n## L\n\n### Layer\n\n**Definition**: A level in the configuration hierarchy (Core → Workspace → Infrastructure).\n\n**Where Used**:\n\n- Configuration inheritance\n- Customization patterns\n- Settings override\n\n**Related Concepts**: Config, Workspace, Infrastructure\n\n**See Also**: [Configuration Guide](../infrastructure/configuration-guide.md)\n\n---\n\n## M\n\n### MCP (Model Context Protocol)\n\n**Definition**: AI-powered server providing intelligent configuration assistance.\n\n**Where Used**:\n\n- Configuration validation\n- Troubleshooting guidance\n- Documentation search\n\n**Related Concepts**: Platform Service, AI, Guidance\n\n**Location**: `provisioning/platform/mcp-server/`\n\n**See Also**: Platform Services\n\n---\n\n### MFA (Multi-Factor Authentication)\n\n**Definition**: Additional authentication layer using TOTP or WebAuthn/FIDO2.\n\n**Where Used**:\n\n- Enhanced security\n- Compliance requirements\n- Production access\n\n**Related Concepts**: Auth, Security, TOTP, WebAuthn\n\n**Commands**:\n\n```\nprovisioning mfa totp enroll\nprovisioning mfa webauthn enroll\nprovisioning mfa verify \n```\n\n**See Also**: MFA Implementation Summary\n\n---\n\n### Migration\n\n**Definition**: Process of updating existing infrastructure or moving between system versions.\n\n**Where Used**:\n\n- System upgrades\n- Configuration changes\n- Infrastructure evolution\n\n**Related Concepts**: Update, Upgrade, Version\n\n**See Also**: Migration Guide\n\n---\n\n### Module\n\n**Definition**: A reusable component (provider, taskserv, cluster) loaded into a workspace.\n\n**Where Used**:\n\n- Extension management\n- Workspace customization\n- Component distribution\n\n**Related Concepts**: Extension, Workspace, Package\n\n**Commands**:\n\n```\nprovisioning module discover provider\nprovisioning module load provider \nprovisioning module list taskserv\n```\n\n**See Also**: [Module System](../development/extension-development.md)\n\n---\n\n## N\n\n### Nushell\n\n**Definition**: Primary shell and scripting language (v0.107.1) used throughout the platform.\n\n**Where Used**:\n\n- CLI implementation\n- Automation scripts\n- Business logic\n\n**Related Concepts**: CLI, Script, Automation\n\n**Version**: 0.107.1\n\n**See Also**: [Nushell Guidelines](../development/README.md)\n\n---\n\n## O\n\n### OCI (Open Container Initiative)\n\n**Definition**: Standard format for packaging and distributing extensions.\n\n**Where Used**:\n\n- Extension distribution\n- Package registry\n- Version management\n\n**Related Concepts**: Registry, Package, Distribution\n\n**See Also**: OCI Registry Guide\n\n---\n\n### Operation\n\n**Definition**: A single infrastructure action (create server, install taskserv, etc.).\n\n**Where Used**:\n\n- Workflow steps\n- Batch processing\n- Orchestrator tasks\n\n**Related Concepts**: Workflow, Task, Action\n\n---\n\n### Orchestrator\n\n**Definition**: Hybrid Rust/Nushell service coordinating complex infrastructure operations.\n\n**Where Used**:\n\n- Workflow execution\n- Task coordination\n- State management\n\n**Related Concepts**: Hybrid Architecture, Workflow, Platform Service\n\n**Location**: `provisioning/platform/orchestrator/`\n\n**Commands**:\n\n```\ncd provisioning/platform/orchestrator\n./scripts/start-orchestrator.nu --background\n```\n\n**See Also**: [Orchestrator Architecture](../architecture/orchestrator-integration-model.md)\n\n---\n\n## P\n\n### PAP (Project Architecture Principles)\n\n**Definition**: Core architectural rules and patterns that must be followed.\n\n**Where Used**:\n\n- Code review\n- Architecture decisions\n- Design validation\n\n**Related Concepts**: Architecture, ADR, Best Practices\n\n**See Also**: Architecture Overview\n\n---\n\n### Platform Service\n\n**Definition**: A core service providing platform-level functionality (Orchestrator, Control Center, MCP, API Gateway).\n\n**Where Used**:\n\n- System infrastructure\n- Core capabilities\n- Service integration\n\n**Related Concepts**: Service, Architecture, Infrastructure\n\n**Location**: `provisioning/platform/{service}/`\n\n---\n\n### Plugin\n\n**Definition**: Native Nushell plugin providing performance-optimized operations.\n\n**Where Used**:\n\n- Auth operations (10-50x faster)\n- KMS encryption\n- Orchestrator queries\n\n**Related Concepts**: Nushell, Performance, Native\n\n**Commands**:\n\n```\nprovisioning plugin list\nprovisioning plugin install\n```\n\n**See Also**: Nushell Plugins Guide\n\n---\n\n### Provider\n\n**Definition**: Cloud platform integration (AWS, UpCloud, local) handling infrastructure provisioning.\n\n**Where Used**:\n\n- Server creation\n- Resource management\n- Cloud operations\n\n**Related Concepts**: Extension, Infrastructure, Cloud\n\n**Location**: `provisioning/extensions/providers/{name}/`\n\n**Examples**: aws, upcloud, local\n\n**Commands**:\n\n```\nprovisioning module discover provider\nprovisioning providers list\n```\n\n**See Also**: Quick Provider Guide\n\n---\n\n## Q\n\n### Quick Reference\n\n**Definition**: Condensed command and configuration reference for rapid lookup.\n\n**Where Used**:\n\n- Daily operations\n- Quick reminders\n- Command syntax\n\n**Related Concepts**: Guide, Documentation, Cheatsheet\n\n**Commands**:\n\n```\nprovisioning sc # Fastest\nprovisioning guide quickstart\n```\n\n**See Also**: Quickstart Cheatsheet\n\n---\n\n## R\n\n### RBAC (Role-Based Access Control)\n\n**Definition**: Permission system with 5 roles (admin, operator, developer, viewer, auditor).\n\n**Where Used**:\n\n- User permissions\n- Access control\n- Security policies\n\n**Related Concepts**: Authorization, Cedar, Security\n\n**Roles**: Admin, Operator, Developer, Viewer, Auditor\n\n---\n\n### Registry\n\n**Definition**: OCI-compliant repository for storing and distributing extensions.\n\n**Where Used**:\n\n- Extension publishing\n- Version management\n- Package distribution\n\n**Related Concepts**: OCI, Package, Distribution\n\n**See Also**: OCI Registry Guide\n\n---\n\n### REST API\n\n**Definition**: HTTP endpoints exposing platform operations to external systems.\n\n**Where Used**:\n\n- External integration\n- Web UI backend\n- Programmatic access\n\n**Related Concepts**: API, Integration, HTTP\n\n**Endpoint**: `http://localhost:9090`\n\n**See Also**: REST API Documentation\n\n---\n\n### Rollback\n\n**Definition**: Reverting a failed workflow or operation to previous stable state.\n\n**Where Used**:\n\n- Failure recovery\n- Deployment safety\n- State restoration\n\n**Related Concepts**: Workflow, Checkpoint, Recovery\n\n**Commands**:\n\n```\nprovisioning batch rollback \n```\n\n---\n\n### RustyVault\n\n**Definition**: Rust-based secrets management backend for KMS.\n\n**Where Used**:\n\n- Key storage\n- Secret encryption\n- Configuration protection\n\n**Related Concepts**: KMS, Security, Encryption\n\n**See Also**: RustyVault KMS Guide\n\n---\n\n## S\n\n### Schema\n\n**Definition**: Nickel type definition specifying structure and validation rules.\n\n**Where Used**:\n\n- Configuration validation\n- Type safety\n- Documentation\n\n**Related Concepts**: Nickel, Validation, Type\n\n**Example**:\n\n```\nlet ServerConfig = {\n hostname | string,\n cores | number,\n memory | number,\n} in\nServerConfig\n```\n\n**See Also**: Nickel Development\n\n---\n\n### Secrets Management\n\n**Definition**: System for secure storage and retrieval of sensitive data.\n\n**Where Used**:\n\n- Password storage\n- API keys\n- Certificates\n\n**Related Concepts**: KMS, Security, Encryption\n\n**See Also**: Dynamic Secrets Implementation\n\n---\n\n### Security System\n\n**Definition**: Comprehensive enterprise-grade security with 12 components (Auth, Cedar, MFA, KMS, Secrets, Compliance, etc.).\n\n**Where Used**:\n\n- User authentication\n- Access control\n- Data protection\n\n**Related Concepts**: Auth, Authorization, MFA, KMS, Audit\n\n**See Also**: Security System Implementation\n\n---\n\n### Server\n\n**Definition**: Virtual machine or physical host managed by the platform.\n\n**Where Used**:\n\n- Infrastructure provisioning\n- Compute resources\n- Deployment targets\n\n**Related Concepts**: Infrastructure, Provider, Taskserv\n\n**Commands**:\n\n```\nprovisioning server create\nprovisioning server list\nprovisioning server ssh \n```\n\n**See Also**: Infrastructure Management\n\n---\n\n### Service\n\n**Definition**: A running application or daemon (interchangeable with Taskserv in many contexts).\n\n**Where Used**:\n\n- Service management\n- Application deployment\n- System administration\n\n**Related Concepts**: Taskserv, Daemon, Application\n\n**See Also**: Service Management Guide\n\n---\n\n### Shortcut\n\n**Definition**: Abbreviated command alias for faster CLI operations.\n\n**Where Used**:\n\n- Daily operations\n- Quick commands\n- Productivity enhancement\n\n**Related Concepts**: CLI, Command, Alias\n\n**Examples**:\n\n- `provisioning s create` → `provisioning server create`\n- `provisioning ws list` → `provisioning workspace list`\n- `provisioning sc` → Quick reference\n\n**See Also**: [CLI Reference](../infrastructure/cli-reference.md)\n\n---\n\n### SOPS (Secrets OPerationS)\n\n**Definition**: Encryption tool for managing secrets in version control.\n\n**Where Used**:\n\n- Configuration encryption\n- Secret management\n- Secure storage\n\n**Related Concepts**: Encryption, Security, Age\n\n**Version**: 3.10.2\n\n**Commands**:\n\n```\nprovisioning sops edit \n```\n\n---\n\n### SSH (Secure Shell)\n\n**Definition**: Encrypted remote access protocol with temporal key support.\n\n**Where Used**:\n\n- Server administration\n- Remote commands\n- Secure file transfer\n\n**Related Concepts**: Security, Server, Remote Access\n\n**Commands**:\n\n```\nprovisioning server ssh \nprovisioning ssh connect \n```\n\n**See Also**: SSH Temporal Keys User Guide\n\n---\n\n### State Management\n\n**Definition**: Tracking and persisting workflow execution state.\n\n**Where Used**:\n\n- Workflow recovery\n- Progress tracking\n- Failure handling\n\n**Related Concepts**: Workflow, Checkpoint, Orchestrator\n\n---\n\n## T\n\n### Task\n\n**Definition**: A unit of work submitted to the orchestrator for execution.\n\n**Where Used**:\n\n- Workflow execution\n- Job processing\n- Operation tracking\n\n**Related Concepts**: Operation, Workflow, Orchestrator\n\n---\n\n### Taskserv\n\n**Definition**: An installable infrastructure service (Kubernetes, PostgreSQL, Redis, etc.).\n\n**Where Used**:\n\n- Service installation\n- Application deployment\n- Infrastructure components\n\n**Related Concepts**: Service, Extension, Package\n\n**Location**: `provisioning/extensions/taskservs/{category}/{name}/`\n\n**Commands**:\n\n```\nprovisioning taskserv create \nprovisioning taskserv list\nprovisioning test quick \n```\n\n**See Also**: Taskserv Developer Guide\n\n---\n\n### Template\n\n**Definition**: Parameterized configuration file supporting variable substitution.\n\n**Where Used**:\n\n- Configuration generation\n- Infrastructure customization\n- Deployment automation\n\n**Related Concepts**: Config, Generation, Customization\n\n**Location**: `provisioning/templates/`\n\n---\n\n### Test Environment\n\n**Definition**: Containerized isolated environment for testing taskservs and clusters.\n\n**Where Used**:\n\n- Development testing\n- CI/CD integration\n- Pre-deployment validation\n\n**Related Concepts**: Container, Testing, Validation\n\n**Commands**:\n\n```\nprovisioning test quick \nprovisioning test env single \nprovisioning test env cluster \n```\n\n**See Also**: [Test Environment Guide](../testing/test-environment-guide.md)\n\n---\n\n### Topology\n\n**Definition**: Multi-node cluster configuration template (Kubernetes HA, etcd cluster, etc.).\n\n**Where Used**:\n\n- Cluster testing\n- Multi-node deployments\n- Production simulation\n\n**Related Concepts**: Test Environment, Cluster, Configuration\n\n**Examples**: kubernetes_3node, etcd_cluster, kubernetes_single\n\n---\n\n### TOTP (Time-based One-Time Password)\n\n**Definition**: MFA method generating time-sensitive codes.\n\n**Where Used**:\n\n- Two-factor authentication\n- MFA enrollment\n- Security enhancement\n\n**Related Concepts**: MFA, Security, Auth\n\n**Commands**:\n\n```\nprovisioning mfa totp enroll\nprovisioning mfa totp verify \n```\n\n---\n\n### Troubleshooting\n\n**Definition**: System problem diagnosis and resolution guidance.\n\n**Where Used**:\n\n- Problem solving\n- Error resolution\n- System debugging\n\n**Related Concepts**: Diagnostics, Guide, Support\n\n**See Also**: Troubleshooting Guide\n\n---\n\n## U\n\n### UI (User Interface)\n\n**Definition**: Visual interface for platform operations (Control Center, Web UI).\n\n**Where Used**:\n\n- Visual management\n- Guided workflows\n- Monitoring dashboards\n\n**Related Concepts**: Control Center, Platform Service, GUI\n\n---\n\n### Update\n\n**Definition**: Process of upgrading infrastructure components to newer versions.\n\n**Where Used**:\n\n- Version management\n- Security patches\n- Feature updates\n\n**Related Concepts**: Version, Migration, Upgrade\n\n**Commands**:\n\n```\nprovisioning version check\nprovisioning version apply\n```\n\n**See Also**: Update Infrastructure Guide\n\n---\n\n## V\n\n### Validation\n\n**Definition**: Verification that configuration or infrastructure meets requirements.\n\n**Where Used**:\n\n- Configuration checks\n- Schema validation\n- Pre-deployment verification\n\n**Related Concepts**: Schema, Nickel, Check\n\n**Commands**:\n\n```\nprovisioning validate config\nprovisioning validate infrastructure\n```\n\n**See Also**: [Config Validation](../provisioning/docs/CONFIG_VALIDATION.md)\n\n---\n\n### Version\n\n**Definition**: Semantic version identifier for components and compatibility.\n\n**Where Used**:\n\n- Component versioning\n- Compatibility checking\n- Update management\n\n**Related Concepts**: Update, Dependency, Compatibility\n\n**Commands**:\n\n```\nprovisioning version\nprovisioning version check\nprovisioning taskserv check-updates\n```\n\n---\n\n## W\n\n### WebAuthn\n\n**Definition**: FIDO2-based passwordless authentication standard.\n\n**Where Used**:\n\n- Hardware key authentication\n- Passwordless login\n- Enhanced MFA\n\n**Related Concepts**: MFA, Security, FIDO2\n\n**Commands**:\n\n```\nprovisioning mfa webauthn enroll\nprovisioning mfa webauthn verify\n```\n\n---\n\n### Workflow\n\n**Definition**: A sequence of related operations with dependency management and state tracking.\n\n**Where Used**:\n\n- Complex deployments\n- Multi-step operations\n- Automated processes\n\n**Related Concepts**: Batch Operation, Orchestrator, Task\n\n**Commands**:\n\n```\nprovisioning workflow list\nprovisioning workflow status \nprovisioning workflow monitor \n```\n\n**See Also**: [Batch Workflow System](../guides/from-scratch.md)\n\n---\n\n### Workspace\n\n**Definition**: An isolated environment containing infrastructure definitions and configuration.\n\n**Where Used**:\n\n- Project isolation\n- Environment separation\n- Team workspaces\n\n**Related Concepts**: Infrastructure, Config, Environment\n\n**Location**: `workspace/{name}/`\n\n**Commands**:\n\n```\nprovisioning workspace list\nprovisioning workspace switch \nprovisioning workspace create \n```\n\n**See Also**: Workspace Switching Guide\n\n---\n\n## X-Z\n\n### YAML\n\n**Definition**: Data serialization format used for Kubernetes manifests and configuration.\n\n**Where Used**:\n\n- Kubernetes deployments\n- Configuration files\n- Data interchange\n\n**Related Concepts**: Config, Kubernetes, Data Format\n\n---\n\n## Symbol and Acronym Index\n\n| Symbol/Acronym | Full Term | Category |\n| ---------------- | ----------- | ---------- |\n| ADR | Architecture Decision Record | Architecture |\n| API | Application Programming Interface | Integration |\n| CLI | Command-Line Interface | User Interface |\n| GDPR | General Data Protection Regulation | Compliance |\n| JWT | JSON Web Token | Security |\n| Nickel | Nickel Configuration Language | Configuration |\n| KMS | Key Management Service | Security |\n| MCP | Model Context Protocol | Platform |\n| MFA | Multi-Factor Authentication | Security |\n| OCI | Open Container Initiative | Packaging |\n| PAP | Project Architecture Principles | Architecture |\n| RBAC | Role-Based Access Control | Security |\n| REST | Representational State Transfer | API |\n| SOC2 | Service Organization Control 2 | Compliance |\n| SOPS | Secrets OPerationS | Security |\n| SSH | Secure Shell | Remote Access |\n| TOTP | Time-based One-Time Password | Security |\n| UI | User Interface | User Interface |\n\n---\n\n## Cross-Reference Map\n\n### By Topic Area\n\n**Infrastructure**:\n\n- Infrastructure, Server, Cluster, Provider, Taskserv, Module\n\n**Security**:\n\n- Auth, Authorization, JWT, MFA, TOTP, WebAuthn, Cedar, KMS, Secrets Management, RBAC, Break-Glass\n\n**Configuration**:\n\n- Config, Nickel, Schema, Validation, Environment, Layer, Workspace\n\n**Workflow & Operations**:\n\n- Workflow, Batch Operation, Operation, Task, Orchestrator, Checkpoint, Rollback\n\n**Platform Services**:\n\n- Orchestrator, Control Center, MCP, API Gateway, Platform Service\n\n**Documentation**:\n\n- Glossary, Guide, ADR, Cross-Reference, Internal Link, Anchor Link\n\n**Development**:\n\n- Extension, Plugin, Template, Module, Integration\n\n**Testing**:\n\n- Test Environment, Topology, Validation, Health Check\n\n**Compliance**:\n\n- Compliance, GDPR, Audit, Security System\n\n### By User Journey\n\n**New User**:\n\n1. Glossary (this document)\n2. Guide\n3. Quick Reference\n4. Workspace\n5. Infrastructure\n6. Server\n7. Taskserv\n\n**Developer**:\n\n1. Extension\n2. Provider\n3. Taskserv\n4. Nickel\n5. Schema\n6. Template\n7. Plugin\n\n**Operations**:\n\n1. Workflow\n2. Orchestrator\n3. Monitoring\n4. Troubleshooting\n5. Security\n6. Compliance\n\n---\n\n## Terminology Guidelines\n\n### Writing Style\n\n**Consistency**: Use the same term throughout documentation (for example, "Taskserv" not "task service" or "task-serv")\n\n**Capitalization**:\n\n- Proper nouns and acronyms: CAPITALIZE (Nickel, JWT, MFA)\n- Generic terms: lowercase (server, cluster, workflow)\n- Platform-specific terms: Title Case (Taskserv, Workspace, Orchestrator)\n\n**Pluralization**:\n\n- Taskservs (not taskservices)\n- Workspaces (standard plural)\n- Topologies (not topologys)\n\n### Avoiding Confusion\n\n| Don't Say | Say Instead | Reason |\n| ----------- | ------------- | -------- |\n| "Task service" | "Taskserv" | Standard platform term |\n| "Configuration file" | "Config" or "Settings" | Context-dependent |\n| "Worker" | "Agent" or "Task" | Clarify context |\n| "Kubernetes service" | "K8s taskserv" or "K8s Service resource" | Disambiguate |\n\n---\n\n## Contributing to the Glossary\n\n### Adding New Terms\n\n1. Alphabetical placement in appropriate section\n2. Include all standard sections:\n - Definition\n - Where Used\n - Related Concepts\n - Examples (if applicable)\n - Commands (if applicable)\n - See Also (links to docs)\n\n3. Cross-reference in related terms\n4. Update Symbol and Acronym Index if applicable\n5. Update Cross-Reference Map\n\n### Updating Existing Terms\n\n1. Verify changes don't break cross-references\n2. Update "Last Updated" date at top\n3. Increment version if major changes\n4. Review related terms for consistency\n\n---\n\n## Version History\n\n| Version | Date | Changes |\n| --------- | ------ | --------- |\n| 1.0.0 | 2025-10-10 | Initial comprehensive glossary |\n\n---\n\n**Maintained By**: Documentation Team\n**Review Cycle**: Quarterly or when major features are added\n**Feedback**: Please report missing or unclear terms via issues +# Provisioning Platform Glossary + +**Last Updated**: 2025-10-10 +**Version**: 1.0.0 + +This glossary defines key terminology used throughout the Provisioning Platform documentation. Terms are listed alphabetically with definitions, usage +context, and cross-references to related documentation. + +--- + +## A + +### ADR (Architecture Decision Record) + +**Definition**: Documentation of significant architectural decisions, including context, decision, and consequences. + +**Where Used**: + +- Architecture planning and review +- Technical decision-making process +- System design documentation + +**Related Concepts**: Architecture, Design Patterns, Technical Debt + +**Examples**: + +- ADR-001: Project Structure +- ADR-006: CLI Refactoring +- ADR-009: Complete Security System + +**See Also**: Architecture Documentation + +--- + +### Agent + +**Definition**: A specialized component that performs a specific task in the system orchestration (for example, autonomous execution units in the +orchestrator). + +**Where Used**: + +- Task orchestration +- Workflow management +- Parallel execution patterns + +**Related Concepts**: Orchestrator, Workflow, Task + +**See Also**: [Orchestrator Architecture](../architecture/orchestrator-integration-model.md) + +--- + +### Anchor Link + +**Definition**: An internal document link to a specific section within the same or different markdown file using the `#` symbol. + +**Where Used**: + +- Cross-referencing documentation sections +- Table of contents generation +- Navigation within long documents + +**Related Concepts**: Internal Link, Cross-Reference, Documentation + +**Examples**: + +- `[See Installation](#installation)` - Same document +- `[Configuration Guide](config.md#setup)` - Different document + +--- + +### API Gateway + +**Definition**: Platform service that provides unified REST API access to provisioning operations. + +**Where Used**: + +- External system integration +- Web Control Center backend +- MCP server communication + +**Related Concepts**: REST API, Platform Service, Orchestrator + +**Location**: `provisioning/platform/api-gateway/` + +**See Also**: REST API Documentation + +--- + +### Auth (Authentication) + +**Definition**: The process of verifying user identity using JWT tokens, MFA, and secure session management. + +**Where Used**: + +- User login flows +- API access control +- CLI session management + +**Related Concepts**: Authorization, JWT, MFA, Security + +**See Also**: + +- Authentication Layer Guide +- Auth Quick Reference + +--- + +### Authorization + +**Definition**: The process of determining user permissions using Cedar policy language. + +**Where Used**: + +- Access control decisions +- Resource permission checks +- Multi-tenant security + +**Related Concepts**: Auth, Cedar, Policies, RBAC + +**See Also**: Cedar Authorization Implementation + +--- + +## B + +### Batch Operation + +**Definition**: A collection of related infrastructure operations executed as a single workflow unit. + +**Where Used**: + +- Multi-server deployments +- Cluster creation +- Bulk taskserv installation + +**Related Concepts**: Workflow, Operation, Orchestrator + +**Commands**: + +```text +provisioning batch submit workflow.ncl +provisioning batch list +provisioning batch status +``` + +**See Also**: [Batch Workflow System](../guides/from-scratch.md) + +--- + +### Break-Glass + +**Definition**: Emergency access mechanism requiring multi-party approval for critical operations. + +**Where Used**: + +- Emergency system access +- Incident response +- Security override scenarios + +**Related Concepts**: Security, Compliance, Audit + +**Commands**: + +```text +provisioning break-glass request "reason" +provisioning break-glass approve +``` + +**See Also**: Break-Glass Training Guide + +--- + +## C + +### Cedar + +**Definition**: Amazon's policy language used for fine-grained authorization decisions. + +**Where Used**: + +- Authorization policies +- Access control rules +- Resource permissions + +**Related Concepts**: Authorization, Policies, Security + +**See Also**: Cedar Authorization Implementation + +--- + +### Checkpoint + +**Definition**: A saved state of a workflow allowing resume from point of failure. + +**Where Used**: + +- Workflow recovery +- Long-running operations +- Batch processing + +**Related Concepts**: Workflow, State Management, Recovery + +**See Also**: [Batch Workflow System](../guides/from-scratch.md) + +--- + +### CLI (Command-Line Interface) + +**Definition**: The `provisioning` command-line tool providing access to all platform operations. + +**Where Used**: + +- Daily operations +- Script automation +- CI/CD pipelines + +**Related Concepts**: Command, Shortcut, Module + +**Location**: `provisioning/core/cli/provisioning` + +**Examples**: + +```text +provisioning server create +provisioning taskserv install kubernetes +provisioning workspace switch prod +``` + +**See Also**: + +- [CLI Reference](../infrastructure/cli-reference.md) +- CLI Reference + +--- + +### Cluster + +**Definition**: A complete, pre-configured deployment of multiple servers and taskservs working together. + +**Where Used**: + +- Kubernetes deployments +- Database clusters +- Complete infrastructure stacks + +**Related Concepts**: Infrastructure, Server, Taskserv + +**Location**: `provisioning/extensions/clusters/{name}/` + +**Commands**: + +```text +provisioning cluster create +provisioning cluster list +provisioning cluster delete +``` + +**See Also**: Infrastructure Management + +--- + +### Compliance + +**Definition**: System capabilities ensuring adherence to regulatory requirements (GDPR, SOC2, ISO 27001). + +**Where Used**: + +- Audit logging +- Data retention policies +- Incident response + +**Related Concepts**: Audit, Security, GDPR + +**See Also**: Compliance Implementation Summary + +--- + +### Config (Configuration) + +**Definition**: System settings stored in TOML files with hierarchical loading and variable interpolation. + +**Where Used**: + +- System initialization +- User preferences +- Environment-specific settings + +**Related Concepts**: Settings, Environment, Workspace + +**Files**: + +- `provisioning/config/config.defaults.toml` - System defaults +- `workspace/config/local-overrides.toml` - User settings + +**See Also**: [Configuration Guide](../infrastructure/configuration-guide.md) + +--- + +### Control Center + +**Definition**: Web-based UI for managing provisioning operations built with Ratatui/Crossterm. + +**Where Used**: + +- Visual infrastructure management +- Real-time monitoring +- Guided workflows + +**Related Concepts**: UI, Platform Service, Orchestrator + +**Location**: `provisioning/platform/control-center/` + +**See Also**: Platform Services + +--- + +### CoreDNS + +**Definition**: DNS server taskserv providing service discovery and DNS management. + +**Where Used**: + +- Kubernetes DNS +- Service discovery +- Internal DNS resolution + +**Related Concepts**: Taskserv, Kubernetes, Networking + +**See Also**: + +- CoreDNS Guide +- CoreDNS Quick Reference + +--- + +### Cross-Reference + +**Definition**: Links between related documentation sections or concepts. + +**Where Used**: + +- Documentation navigation +- Related topic discovery +- Learning path guidance + +**Related Concepts**: Documentation, Navigation, See Also + +**Examples**: "See Also" sections at the end of documentation pages + +--- + +## D + +### Dependency + +**Definition**: A requirement that must be satisfied before installing or running a component. + +**Where Used**: + +- Taskserv installation order +- Version compatibility checks +- Cluster deployment sequencing + +**Related Concepts**: Version, Taskserv, Workflow + +**Schema**: `provisioning/schemas/dependencies.ncl` + +**See Also**: Nickel Dependency Patterns + +--- + +### Diagnostics + +**Definition**: System health checking and troubleshooting assistance. + +**Where Used**: + +- System status verification +- Problem identification +- Guided troubleshooting + +**Related Concepts**: Health Check, Monitoring, Troubleshooting + +**Commands**: + +```text +provisioning status +provisioning diagnostics run +``` + +--- + +### Dynamic Secrets + +**Definition**: Temporary credentials generated on-demand with automatic expiration. + +**Where Used**: + +- AWS STS tokens +- SSH temporary keys +- Database credentials + +**Related Concepts**: Security, KMS, Secrets Management + +**See Also**: + +- Dynamic Secrets Implementation +- Dynamic Secrets Quick Reference + +--- + +## E + +### Environment + +**Definition**: A deployment context (dev, test, prod) with specific configuration overrides. + +**Where Used**: + +- Configuration loading +- Resource isolation +- Deployment targeting + +**Related Concepts**: Config, Workspace, Infrastructure + +**Config Files**: `config.{dev,test,prod}.toml` + +**Usage**: + +```text +PROVISIONING_ENV=prod provisioning server list +``` + +--- + +### Extension + +**Definition**: A pluggable component adding functionality (provider, taskserv, cluster, or workflow). + +**Where Used**: + +- Custom cloud providers +- Third-party taskservs +- Custom deployment patterns + +**Related Concepts**: Provider, Taskserv, Cluster, Workflow + +**Location**: `provisioning/extensions/{type}/{name}/` + +**See Also**: Extension Development + +--- + +## F + +### Feature + +**Definition**: A major system capability providing key platform functionality. + +**Where Used**: + +- Architecture documentation +- Feature planning +- System capabilities + +**Related Concepts**: ADR, Architecture, System + +**Examples**: + +- Batch Workflow System +- Orchestrator Architecture +- CLI Architecture +- Configuration System + +**See Also**: [Architecture Overview](../architecture/system-overview.md) + +--- + +## G + +### GDPR (General Data Protection Regulation) + +**Definition**: EU data protection regulation compliance features in the platform. + +**Where Used**: + +- Data export requests +- Right to erasure +- Audit compliance + +**Related Concepts**: Compliance, Audit, Security + +**Commands**: + +```text +provisioning compliance gdpr export +provisioning compliance gdpr delete +``` + +**See Also**: Compliance Implementation + +--- + +### Glossary + +**Definition**: This document - a comprehensive terminology reference for the platform. + +**Where Used**: + +- Learning the platform +- Understanding documentation +- Resolving terminology questions + +**Related Concepts**: Documentation, Reference, Cross-Reference + +--- + +### Guide + +**Definition**: Step-by-step walkthrough documentation for common workflows. + +**Where Used**: + +- Onboarding new users +- Learning workflows +- Reference implementation + +**Related Concepts**: Documentation, Workflow, Tutorial + +**Commands**: + +```text +provisioning guide from-scratch +provisioning guide update +provisioning guide customize +``` + +**See Also**: [Guides](../guides/README.md) + +--- + +## H + +### Health Check + +**Definition**: Automated verification that a component is running correctly. + +**Where Used**: + +- Taskserv validation +- System monitoring +- Dependency verification + +**Related Concepts**: Diagnostics, Monitoring, Status + +**Example**: + +```text +health_check = { + endpoint = "http://localhost:6443/healthz" + timeout = 30 + interval = 10 +} +``` + +--- + +### Hybrid Architecture + +**Definition**: System design combining Rust orchestrator with Nushell business logic. + +**Where Used**: + +- Core platform architecture +- Performance optimization +- Call stack management + +**Related Concepts**: Orchestrator, Architecture, Design + +**See Also**: + +- [Orchestrator Architecture](../architecture/orchestrator-integration-model.md) +- [ADR-004: Hybrid Architecture](../architecture/adr/adr-004-hybrid-architecture.md) + +--- + +## I + +### Infrastructure + +**Definition**: A named collection of servers, configurations, and deployments managed as a unit. + +**Where Used**: + +- Environment isolation +- Resource organization +- Deployment targeting + +**Related Concepts**: Workspace, Server, Environment + +**Location**: `workspace/infra/{name}/` + +**Commands**: + +```text +provisioning infra list +provisioning generate infra --new +``` + +**See Also**: Infrastructure Management + +--- + +### Integration + +**Definition**: Connection between platform components or external systems. + +**Where Used**: + +- API integration +- CI/CD pipelines +- External tool connectivity + +**Related Concepts**: API, Extension, Platform + +**See Also**: + +- Integration Patterns +- Integration Examples + +--- + +### Internal Link + +**Definition**: A markdown link to another documentation file or section within the platform docs. + +**Where Used**: + +- Cross-referencing documentation +- Navigation between topics +- Related content discovery + +**Related Concepts**: Anchor Link, Cross-Reference, Documentation + +**Examples**: + +- `[See Configuration](configuration.md)` +- `[Architecture Overview](../architecture/README.md)` + +--- + +## J + +### JWT (JSON Web Token) + +**Definition**: Token-based authentication mechanism using RS256 signatures. + +**Where Used**: + +- User authentication +- API authorization +- Session management + +**Related Concepts**: Auth, Security, Token + +**See Also**: JWT Auth Implementation + +--- + +## K + +### Nickel (Nickel Configuration Language) + +**Definition**: Declarative configuration language with type safety and lazy evaluation for infrastructure definitions. + +**Where Used**: + +- Infrastructure schemas +- Workflow definitions +- Configuration validation + +**Related Concepts**: Schema, Configuration, Validation + +**Version**: 1.15.0+ + +**Location**: `provisioning/schemas/*.ncl` + +**See Also**: Nickel Quick Reference + +--- + +### KMS (Key Management Service) + +**Definition**: Encryption key management system supporting multiple backends (RustyVault, Age, AWS, Vault). + +**Where Used**: + +- Configuration encryption +- Secret management +- Data protection + +**Related Concepts**: Security, Encryption, Secrets + +**See Also**: RustyVault KMS Guide + +--- + +### Kubernetes + +**Definition**: Container orchestration platform available as a taskserv. + +**Where Used**: + +- Container deployments +- Cluster management +- Production workloads + +**Related Concepts**: Taskserv, Cluster, Container + +**Commands**: + +```text +provisioning taskserv create kubernetes +provisioning test quick kubernetes +``` + +--- + +## L + +### Layer + +**Definition**: A level in the configuration hierarchy (Core → Workspace → Infrastructure). + +**Where Used**: + +- Configuration inheritance +- Customization patterns +- Settings override + +**Related Concepts**: Config, Workspace, Infrastructure + +**See Also**: [Configuration Guide](../infrastructure/configuration-guide.md) + +--- + +## M + +### MCP (Model Context Protocol) + +**Definition**: AI-powered server providing intelligent configuration assistance. + +**Where Used**: + +- Configuration validation +- Troubleshooting guidance +- Documentation search + +**Related Concepts**: Platform Service, AI, Guidance + +**Location**: `provisioning/platform/mcp-server/` + +**See Also**: Platform Services + +--- + +### MFA (Multi-Factor Authentication) + +**Definition**: Additional authentication layer using TOTP or WebAuthn/FIDO2. + +**Where Used**: + +- Enhanced security +- Compliance requirements +- Production access + +**Related Concepts**: Auth, Security, TOTP, WebAuthn + +**Commands**: + +```text +provisioning mfa totp enroll +provisioning mfa webauthn enroll +provisioning mfa verify +``` + +**See Also**: MFA Implementation Summary + +--- + +### Migration + +**Definition**: Process of updating existing infrastructure or moving between system versions. + +**Where Used**: + +- System upgrades +- Configuration changes +- Infrastructure evolution + +**Related Concepts**: Update, Upgrade, Version + +**See Also**: Migration Guide + +--- + +### Module + +**Definition**: A reusable component (provider, taskserv, cluster) loaded into a workspace. + +**Where Used**: + +- Extension management +- Workspace customization +- Component distribution + +**Related Concepts**: Extension, Workspace, Package + +**Commands**: + +```text +provisioning module discover provider +provisioning module load provider +provisioning module list taskserv +``` + +**See Also**: [Module System](../development/extension-development.md) + +--- + +## N + +### Nushell + +**Definition**: Primary shell and scripting language (v0.107.1) used throughout the platform. + +**Where Used**: + +- CLI implementation +- Automation scripts +- Business logic + +**Related Concepts**: CLI, Script, Automation + +**Version**: 0.107.1 + +**See Also**: [Nushell Guidelines](../development/README.md) + +--- + +## O + +### OCI (Open Container Initiative) + +**Definition**: Standard format for packaging and distributing extensions. + +**Where Used**: + +- Extension distribution +- Package registry +- Version management + +**Related Concepts**: Registry, Package, Distribution + +**See Also**: OCI Registry Guide + +--- + +### Operation + +**Definition**: A single infrastructure action (create server, install taskserv, etc.). + +**Where Used**: + +- Workflow steps +- Batch processing +- Orchestrator tasks + +**Related Concepts**: Workflow, Task, Action + +--- + +### Orchestrator + +**Definition**: Hybrid Rust/Nushell service coordinating complex infrastructure operations. + +**Where Used**: + +- Workflow execution +- Task coordination +- State management + +**Related Concepts**: Hybrid Architecture, Workflow, Platform Service + +**Location**: `provisioning/platform/orchestrator/` + +**Commands**: + +```text +cd provisioning/platform/orchestrator +./scripts/start-orchestrator.nu --background +``` + +**See Also**: [Orchestrator Architecture](../architecture/orchestrator-integration-model.md) + +--- + +## P + +### PAP (Project Architecture Principles) + +**Definition**: Core architectural rules and patterns that must be followed. + +**Where Used**: + +- Code review +- Architecture decisions +- Design validation + +**Related Concepts**: Architecture, ADR, Best Practices + +**See Also**: Architecture Overview + +--- + +### Platform Service + +**Definition**: A core service providing platform-level functionality (Orchestrator, Control Center, MCP, API Gateway). + +**Where Used**: + +- System infrastructure +- Core capabilities +- Service integration + +**Related Concepts**: Service, Architecture, Infrastructure + +**Location**: `provisioning/platform/{service}/` + +--- + +### Plugin + +**Definition**: Native Nushell plugin providing performance-optimized operations. + +**Where Used**: + +- Auth operations (10-50x faster) +- KMS encryption +- Orchestrator queries + +**Related Concepts**: Nushell, Performance, Native + +**Commands**: + +```text +provisioning plugin list +provisioning plugin install +``` + +**See Also**: Nushell Plugins Guide + +--- + +### Provider + +**Definition**: Cloud platform integration (AWS, UpCloud, local) handling infrastructure provisioning. + +**Where Used**: + +- Server creation +- Resource management +- Cloud operations + +**Related Concepts**: Extension, Infrastructure, Cloud + +**Location**: `provisioning/extensions/providers/{name}/` + +**Examples**: aws, upcloud, local + +**Commands**: + +```text +provisioning module discover provider +provisioning providers list +``` + +**See Also**: Quick Provider Guide + +--- + +## Q + +### Quick Reference + +**Definition**: Condensed command and configuration reference for rapid lookup. + +**Where Used**: + +- Daily operations +- Quick reminders +- Command syntax + +**Related Concepts**: Guide, Documentation, Cheatsheet + +**Commands**: + +```text +provisioning sc # Fastest +provisioning guide quickstart +``` + +**See Also**: Quickstart Cheatsheet + +--- + +## R + +### RBAC (Role-Based Access Control) + +**Definition**: Permission system with 5 roles (admin, operator, developer, viewer, auditor). + +**Where Used**: + +- User permissions +- Access control +- Security policies + +**Related Concepts**: Authorization, Cedar, Security + +**Roles**: Admin, Operator, Developer, Viewer, Auditor + +--- + +### Registry + +**Definition**: OCI-compliant repository for storing and distributing extensions. + +**Where Used**: + +- Extension publishing +- Version management +- Package distribution + +**Related Concepts**: OCI, Package, Distribution + +**See Also**: OCI Registry Guide + +--- + +### REST API + +**Definition**: HTTP endpoints exposing platform operations to external systems. + +**Where Used**: + +- External integration +- Web UI backend +- Programmatic access + +**Related Concepts**: API, Integration, HTTP + +**Endpoint**: `http://localhost:9090` + +**See Also**: REST API Documentation + +--- + +### Rollback + +**Definition**: Reverting a failed workflow or operation to previous stable state. + +**Where Used**: + +- Failure recovery +- Deployment safety +- State restoration + +**Related Concepts**: Workflow, Checkpoint, Recovery + +**Commands**: + +```text +provisioning batch rollback +``` + +--- + +### RustyVault + +**Definition**: Rust-based secrets management backend for KMS. + +**Where Used**: + +- Key storage +- Secret encryption +- Configuration protection + +**Related Concepts**: KMS, Security, Encryption + +**See Also**: RustyVault KMS Guide + +--- + +## S + +### Schema + +**Definition**: Nickel type definition specifying structure and validation rules. + +**Where Used**: + +- Configuration validation +- Type safety +- Documentation + +**Related Concepts**: Nickel, Validation, Type + +**Example**: + +```text +let ServerConfig = { + hostname | string, + cores | number, + memory | number, +} in +ServerConfig +``` + +**See Also**: Nickel Development + +--- + +### Secrets Management + +**Definition**: System for secure storage and retrieval of sensitive data. + +**Where Used**: + +- Password storage +- API keys +- Certificates + +**Related Concepts**: KMS, Security, Encryption + +**See Also**: Dynamic Secrets Implementation + +--- + +### Security System + +**Definition**: Comprehensive enterprise-grade security with 12 components (Auth, Cedar, MFA, KMS, Secrets, Compliance, etc.). + +**Where Used**: + +- User authentication +- Access control +- Data protection + +**Related Concepts**: Auth, Authorization, MFA, KMS, Audit + +**See Also**: Security System Implementation + +--- + +### Server + +**Definition**: Virtual machine or physical host managed by the platform. + +**Where Used**: + +- Infrastructure provisioning +- Compute resources +- Deployment targets + +**Related Concepts**: Infrastructure, Provider, Taskserv + +**Commands**: + +```text +provisioning server create +provisioning server list +provisioning server ssh +``` + +**See Also**: Infrastructure Management + +--- + +### Service + +**Definition**: A running application or daemon (interchangeable with Taskserv in many contexts). + +**Where Used**: + +- Service management +- Application deployment +- System administration + +**Related Concepts**: Taskserv, Daemon, Application + +**See Also**: Service Management Guide + +--- + +### Shortcut + +**Definition**: Abbreviated command alias for faster CLI operations. + +**Where Used**: + +- Daily operations +- Quick commands +- Productivity enhancement + +**Related Concepts**: CLI, Command, Alias + +**Examples**: + +- `provisioning s create` → `provisioning server create` +- `provisioning ws list` → `provisioning workspace list` +- `provisioning sc` → Quick reference + +**See Also**: [CLI Reference](../infrastructure/cli-reference.md) + +--- + +### SOPS (Secrets OPerationS) + +**Definition**: Encryption tool for managing secrets in version control. + +**Where Used**: + +- Configuration encryption +- Secret management +- Secure storage + +**Related Concepts**: Encryption, Security, Age + +**Version**: 3.10.2 + +**Commands**: + +```text +provisioning sops edit +``` + +--- + +### SSH (Secure Shell) + +**Definition**: Encrypted remote access protocol with temporal key support. + +**Where Used**: + +- Server administration +- Remote commands +- Secure file transfer + +**Related Concepts**: Security, Server, Remote Access + +**Commands**: + +```text +provisioning server ssh +provisioning ssh connect +``` + +**See Also**: SSH Temporal Keys User Guide + +--- + +### State Management + +**Definition**: Tracking and persisting workflow execution state. + +**Where Used**: + +- Workflow recovery +- Progress tracking +- Failure handling + +**Related Concepts**: Workflow, Checkpoint, Orchestrator + +--- + +## T + +### Task + +**Definition**: A unit of work submitted to the orchestrator for execution. + +**Where Used**: + +- Workflow execution +- Job processing +- Operation tracking + +**Related Concepts**: Operation, Workflow, Orchestrator + +--- + +### Taskserv + +**Definition**: An installable infrastructure service (Kubernetes, PostgreSQL, Redis, etc.). + +**Where Used**: + +- Service installation +- Application deployment +- Infrastructure components + +**Related Concepts**: Service, Extension, Package + +**Location**: `provisioning/extensions/taskservs/{category}/{name}/` + +**Commands**: + +```text +provisioning taskserv create +provisioning taskserv list +provisioning test quick +``` + +**See Also**: Taskserv Developer Guide + +--- + +### Template + +**Definition**: Parameterized configuration file supporting variable substitution. + +**Where Used**: + +- Configuration generation +- Infrastructure customization +- Deployment automation + +**Related Concepts**: Config, Generation, Customization + +**Location**: `provisioning/templates/` + +--- + +### Test Environment + +**Definition**: Containerized isolated environment for testing taskservs and clusters. + +**Where Used**: + +- Development testing +- CI/CD integration +- Pre-deployment validation + +**Related Concepts**: Container, Testing, Validation + +**Commands**: + +```text +provisioning test quick +provisioning test env single +provisioning test env cluster +``` + +**See Also**: [Test Environment Guide](../testing/test-environment-guide.md) + +--- + +### Topology + +**Definition**: Multi-node cluster configuration template (Kubernetes HA, etcd cluster, etc.). + +**Where Used**: + +- Cluster testing +- Multi-node deployments +- Production simulation + +**Related Concepts**: Test Environment, Cluster, Configuration + +**Examples**: kubernetes_3node, etcd_cluster, kubernetes_single + +--- + +### TOTP (Time-based One-Time Password) + +**Definition**: MFA method generating time-sensitive codes. + +**Where Used**: + +- Two-factor authentication +- MFA enrollment +- Security enhancement + +**Related Concepts**: MFA, Security, Auth + +**Commands**: + +```text +provisioning mfa totp enroll +provisioning mfa totp verify +``` + +--- + +### Troubleshooting + +**Definition**: System problem diagnosis and resolution guidance. + +**Where Used**: + +- Problem solving +- Error resolution +- System debugging + +**Related Concepts**: Diagnostics, Guide, Support + +**See Also**: Troubleshooting Guide + +--- + +## U + +### UI (User Interface) + +**Definition**: Visual interface for platform operations (Control Center, Web UI). + +**Where Used**: + +- Visual management +- Guided workflows +- Monitoring dashboards + +**Related Concepts**: Control Center, Platform Service, GUI + +--- + +### Update + +**Definition**: Process of upgrading infrastructure components to newer versions. + +**Where Used**: + +- Version management +- Security patches +- Feature updates + +**Related Concepts**: Version, Migration, Upgrade + +**Commands**: + +```text +provisioning version check +provisioning version apply +``` + +**See Also**: Update Infrastructure Guide + +--- + +## V + +### Validation + +**Definition**: Verification that configuration or infrastructure meets requirements. + +**Where Used**: + +- Configuration checks +- Schema validation +- Pre-deployment verification + +**Related Concepts**: Schema, Nickel, Check + +**Commands**: + +```text +provisioning validate config +provisioning validate infrastructure +``` + +**See Also**: [Config Validation](../provisioning/docs/CONFIG_VALIDATION.md) + +--- + +### Version + +**Definition**: Semantic version identifier for components and compatibility. + +**Where Used**: + +- Component versioning +- Compatibility checking +- Update management + +**Related Concepts**: Update, Dependency, Compatibility + +**Commands**: + +```text +provisioning version +provisioning version check +provisioning taskserv check-updates +``` + +--- + +## W + +### WebAuthn + +**Definition**: FIDO2-based passwordless authentication standard. + +**Where Used**: + +- Hardware key authentication +- Passwordless login +- Enhanced MFA + +**Related Concepts**: MFA, Security, FIDO2 + +**Commands**: + +```text +provisioning mfa webauthn enroll +provisioning mfa webauthn verify +``` + +--- + +### Workflow + +**Definition**: A sequence of related operations with dependency management and state tracking. + +**Where Used**: + +- Complex deployments +- Multi-step operations +- Automated processes + +**Related Concepts**: Batch Operation, Orchestrator, Task + +**Commands**: + +```text +provisioning workflow list +provisioning workflow status +provisioning workflow monitor +``` + +**See Also**: [Batch Workflow System](../guides/from-scratch.md) + +--- + +### Workspace + +**Definition**: An isolated environment containing infrastructure definitions and configuration. + +**Where Used**: + +- Project isolation +- Environment separation +- Team workspaces + +**Related Concepts**: Infrastructure, Config, Environment + +**Location**: `workspace/{name}/` + +**Commands**: + +```text +provisioning workspace list +provisioning workspace switch +provisioning workspace create +``` + +**See Also**: Workspace Switching Guide + +--- + +## X-Z + +### YAML + +**Definition**: Data serialization format used for Kubernetes manifests and configuration. + +**Where Used**: + +- Kubernetes deployments +- Configuration files +- Data interchange + +**Related Concepts**: Config, Kubernetes, Data Format + +--- + +## Symbol and Acronym Index + +| Symbol/Acronym | Full Term | Category | +| ---------------- | ----------- | ---------- | +| ADR | Architecture Decision Record | Architecture | +| API | Application Programming Interface | Integration | +| CLI | Command-Line Interface | User Interface | +| GDPR | General Data Protection Regulation | Compliance | +| JWT | JSON Web Token | Security | +| Nickel | Nickel Configuration Language | Configuration | +| KMS | Key Management Service | Security | +| MCP | Model Context Protocol | Platform | +| MFA | Multi-Factor Authentication | Security | +| OCI | Open Container Initiative | Packaging | +| PAP | Project Architecture Principles | Architecture | +| RBAC | Role-Based Access Control | Security | +| REST | Representational State Transfer | API | +| SOC2 | Service Organization Control 2 | Compliance | +| SOPS | Secrets OPerationS | Security | +| SSH | Secure Shell | Remote Access | +| TOTP | Time-based One-Time Password | Security | +| UI | User Interface | User Interface | + +--- + +## Cross-Reference Map + +### By Topic Area + +**Infrastructure**: + +- Infrastructure, Server, Cluster, Provider, Taskserv, Module + +**Security**: + +- Auth, Authorization, JWT, MFA, TOTP, WebAuthn, Cedar, KMS, Secrets Management, RBAC, Break-Glass + +**Configuration**: + +- Config, Nickel, Schema, Validation, Environment, Layer, Workspace + +**Workflow & Operations**: + +- Workflow, Batch Operation, Operation, Task, Orchestrator, Checkpoint, Rollback + +**Platform Services**: + +- Orchestrator, Control Center, MCP, API Gateway, Platform Service + +**Documentation**: + +- Glossary, Guide, ADR, Cross-Reference, Internal Link, Anchor Link + +**Development**: + +- Extension, Plugin, Template, Module, Integration + +**Testing**: + +- Test Environment, Topology, Validation, Health Check + +**Compliance**: + +- Compliance, GDPR, Audit, Security System + +### By User Journey + +**New User**: + +1. Glossary (this document) +2. Guide +3. Quick Reference +4. Workspace +5. Infrastructure +6. Server +7. Taskserv + +**Developer**: + +1. Extension +2. Provider +3. Taskserv +4. Nickel +5. Schema +6. Template +7. Plugin + +**Operations**: + +1. Workflow +2. Orchestrator +3. Monitoring +4. Troubleshooting +5. Security +6. Compliance + +--- + +## Terminology Guidelines + +### Writing Style + +**Consistency**: Use the same term throughout documentation (for example, "Taskserv" not "task service" or "task-serv") + +**Capitalization**: + +- Proper nouns and acronyms: CAPITALIZE (Nickel, JWT, MFA) +- Generic terms: lowercase (server, cluster, workflow) +- Platform-specific terms: Title Case (Taskserv, Workspace, Orchestrator) + +**Pluralization**: + +- Taskservs (not taskservices) +- Workspaces (standard plural) +- Topologies (not topologys) + +### Avoiding Confusion + +| Don't Say | Say Instead | Reason | +| ----------- | ------------- | -------- | +| "Task service" | "Taskserv" | Standard platform term | +| "Configuration file" | "Config" or "Settings" | Context-dependent | +| "Worker" | "Agent" or "Task" | Clarify context | +| "Kubernetes service" | "K8s taskserv" or "K8s Service resource" | Disambiguate | + +--- + +## Contributing to the Glossary + +### Adding New Terms + +1. Alphabetical placement in appropriate section +2. Include all standard sections: + - Definition + - Where Used + - Related Concepts + - Examples (if applicable) + - Commands (if applicable) + - See Also (links to docs) + +3. Cross-reference in related terms +4. Update Symbol and Acronym Index if applicable +5. Update Cross-Reference Map + +### Updating Existing Terms + +1. Verify changes don't break cross-references +2. Update "Last Updated" date at top +3. Increment version if major changes +4. Review related terms for consistency + +--- + +## Version History + +| Version | Date | Changes | +| --------- | ------ | --------- | +| 1.0.0 | 2025-10-10 | Initial comprehensive glossary | + +--- + +**Maintained By**: Documentation Team +**Review Cycle**: Quarterly or when major features are added +**Feedback**: Please report missing or unclear terms via issues \ No newline at end of file diff --git a/docs/src/development/implementation-guide.md b/docs/src/development/implementation-guide.md index 1900100..f22c947 100644 --- a/docs/src/development/implementation-guide.md +++ b/docs/src/development/implementation-guide.md @@ -1 +1,897 @@ -# Repository Restructuring - Implementation Guide\n\n**Status:** Ready for Implementation\n**Estimated Time:** 12-16 days\n**Priority:** High\n**Related:** [Architecture Analysis](../architecture/repo-dist-analysis.md)\n\n## Overview\n\nThis guide provides step-by-step instructions for implementing the repository restructuring and distribution system improvements. Each phase includes\nspecific commands, validation steps, and rollback procedures.\n\n---\n\n## Prerequisites\n\n### Required Tools\n\n- Nushell 0.107.1+\n- Rust toolchain (for platform builds)\n- Git\n- tar/gzip\n- curl or wget\n\n### Recommended Tools\n\n- Just (task runner)\n- ripgrep (for code searches)\n- fd (for file finding)\n\n### Before Starting\n\n1. **Create full backup**\n2. **Notify team members**\n3. **Create implementation branch**\n4. **Set aside dedicated time**\n\n---\n\n## Phase 1: Repository Restructuring (Days 1-4)\n\n### Day 1: Backup and Analysis\n\n#### Step 1.1: Create Complete Backup\n\n```\n# Create timestamped backup\nBACKUP_DIR="/Users/Akasha/project-provisioning-backup-$(date +%Y%m%d)"\ncp -r /Users/Akasha/project-provisioning "$BACKUP_DIR"\n\n# Verify backup\nls -lh "$BACKUP_DIR"\ndu -sh "$BACKUP_DIR"\n\n# Create backup manifest\nfind "$BACKUP_DIR" -type f > "$BACKUP_DIR/manifest.txt"\necho "✅ Backup created: $BACKUP_DIR"\n```\n\n#### Step 1.2: Analyze Current State\n\n```\ncd /Users/Akasha/project-provisioning\n\n# Count workspace directories\necho "=== Workspace Directories ==="\nfd workspace -t d\n\n# Analyze workspace contents\necho "=== Active Workspace ==="\ndu -sh workspace/\n\necho "=== Backup Workspaces ==="\ndu -sh _workspace/ backup-workspace/ workspace-librecloud/\n\n# Find obsolete directories\necho "=== Build Artifacts ==="\ndu -sh target/ wrks/ NO/\n\n# Save analysis\n{\n echo "# Current State Analysis - $(date)"\n echo ""\n echo "## Workspace Directories"\n fd workspace -t d\n echo ""\n echo "## Directory Sizes"\n du -sh workspace/ _workspace/ backup-workspace/ workspace-librecloud/ 2>/dev/null\n echo ""\n echo "## Build Artifacts"\n du -sh target/ wrks/ NO/ 2>/dev/null\n} > docs/development/current-state-analysis.txt\n\necho "✅ Analysis complete: docs/development/current-state-analysis.txt"\n```\n\n#### Step 1.3: Identify Dependencies\n\n```\n# Find all hardcoded paths\necho "=== Hardcoded Paths in Nushell Scripts ==="\nrg -t nu "workspace/|_workspace/|backup-workspace/" provisioning/core/nulib/ | tee hardcoded-paths.txt\n\n# Find ENV references (legacy)\necho "=== ENV References ==="\nrg "PROVISIONING_" provisioning/core/nulib/ | wc -l\n\n# Find workspace references in configs\necho "=== Config References ==="\nrg "workspace" provisioning/config/\n\necho "✅ Dependencies mapped"\n```\n\n#### Step 1.4: Create Implementation Branch\n\n```\n# Create and switch to implementation branch\ngit checkout -b feat/repo-restructure\n\n# Commit analysis\ngit add docs/development/current-state-analysis.txt\ngit commit -m "docs: add current state analysis for restructuring"\n\necho "✅ Implementation branch created: feat/repo-restructure"\n```\n\n**Validation:**\n\n- ✅ Backup exists and is complete\n- ✅ Analysis document created\n- ✅ Dependencies mapped\n- ✅ Implementation branch ready\n\n---\n\n### Day 2: Directory Restructuring\n\n#### Step 2.1: Create New Directory Structure\n\n```\ncd /Users/Akasha/project-provisioning\n\n# Create distribution directory structure\nmkdir -p distribution/{packages,installers,registry}\necho "✅ Created distribution/"\n\n# Create workspace structure (keep tracked templates)\nmkdir -p workspace/{infra,config,extensions,runtime}/{.gitkeep}\nmkdir -p workspace/templates/{minimal,kubernetes,multi-cloud}\necho "✅ Created workspace/"\n\n# Verify\ntree -L 2 distribution/ workspace/\n```\n\n#### Step 2.2: Move Build Artifacts\n\n```\n# Move Rust build artifacts\nif [ -d "target" ]; then\n mv target distribution/target\n echo "✅ Moved target/ to distribution/"\nfi\n\n# Move KCL packages\nif [ -d "provisioning/tools/dist" ]; then\n mv provisioning/tools/dist/* distribution/packages/ 2>/dev/null || true\n echo "✅ Moved packages to distribution/"\nfi\n\n# Move any existing packages\nfind . -name "*.tar.gz" -o -name "*.zip" | grep -v node_modules | while read pkg; do\n mv "$pkg" distribution/packages/\n echo " Moved: $pkg"\ndone\n```\n\n#### Step 2.3: Consolidate Workspaces\n\n```\n# Identify active workspace\necho "=== Current Workspace Status ==="\nls -la workspace/ _workspace/ backup-workspace/ 2>/dev/null\n\n# Interactive workspace consolidation\nread -p "Which workspace is currently active? (workspace/_workspace/backup-workspace): " ACTIVE_WS\n\nif [ "$ACTIVE_WS" != "workspace" ]; then\n echo "Consolidating $ACTIVE_WS to workspace/"\n\n # Merge infra configs\n if [ -d "$ACTIVE_WS/infra" ]; then\n cp -r "$ACTIVE_WS/infra/"* workspace/infra/\n fi\n\n # Merge configs\n if [ -d "$ACTIVE_WS/config" ]; then\n cp -r "$ACTIVE_WS/config/"* workspace/config/\n fi\n\n # Merge extensions\n if [ -d "$ACTIVE_WS/extensions" ]; then\n cp -r "$ACTIVE_WS/extensions/"* workspace/extensions/\n fi\n\n echo "✅ Consolidated workspace"\nfi\n\n# Archive old workspace directories\nmkdir -p .archived-workspaces\nfor ws in _workspace backup-workspace workspace-librecloud; do\n if [ -d "$ws" ] && [ "$ws" != "$ACTIVE_WS" ]; then\n mv "$ws" ".archived-workspaces/$(basename $ws)-$(date +%Y%m%d)"\n echo " Archived: $ws"\n fi\ndone\n\necho "✅ Workspaces consolidated"\n```\n\n#### Step 2.4: Remove Obsolete Directories\n\n```\n# Remove build artifacts (already moved)\nrm -rf wrks/\necho "✅ Removed wrks/"\n\n# Remove test/scratch directories\nrm -rf NO/\necho "✅ Removed NO/"\n\n# Archive presentations (optional)\nif [ -d "presentations" ]; then\n read -p "Archive presentations directory? (y/N): " ARCHIVE_PRES\n if [ "$ARCHIVE_PRES" = "y" ]; then\n tar czf presentations-archive-$(date +%Y%m%d).tar.gz presentations/\n rm -rf presentations/\n echo "✅ Archived and removed presentations/"\n fi\nfi\n\n# Remove empty directories\nfind . -type d -empty -delete 2>/dev/null || true\n\necho "✅ Cleanup complete"\n```\n\n#### Step 2.5: Update .gitignore\n\n```\n# Backup existing .gitignore\ncp .gitignore .gitignore.backup\n\n# Update .gitignore\ncat >> .gitignore << 'EOF'\n\n# ============================================================================\n# Repository Restructure (2025-10-01)\n# ============================================================================\n\n# Workspace runtime data (user-specific)\n/workspace/infra/\n/workspace/config/\n/workspace/extensions/\n/workspace/runtime/\n\n# Distribution artifacts\n/distribution/packages/\n/distribution/target/\n\n# Build artifacts\n/target/\n/provisioning/platform/target/\n/provisioning/platform/*/target/\n\n# Rust artifacts\n**/*.rs.bk\nCargo.lock\n\n# Archived directories\n/.archived-workspaces/\n\n# Temporary files\n*.tmp\n*.temp\n/tmp/\n/wrks/\n/NO/\n\n# Logs\n*.log\n/workspace/runtime/logs/\n\n# Cache\n.cache/\n/workspace/runtime/cache/\n\n# IDE\n.vscode/\n.idea/\n*.swp\n*.swo\n*~\n\n# OS\n.DS_Store\nThumbs.db\n\n# Backup files\n*.backup\n*.bak\n\nEOF\n\necho "✅ Updated .gitignore"\n```\n\n#### Step 2.6: Commit Restructuring\n\n```\n# Stage changes\ngit add -A\n\n# Show what's being committed\ngit status\n\n# Commit\ngit commit -m "refactor: restructure repository for clean distribution\n\n- Consolidate workspace directories to single workspace/\n- Move build artifacts to distribution/\n- Remove obsolete directories (wrks/, NO/)\n- Update .gitignore for new structure\n- Archive old workspace variants\n\nThis is part of Phase 1 of the repository restructuring plan.\n\nRelated: docs/architecture/repo-dist-analysis.md"\n\necho "✅ Restructuring committed"\n```\n\n**Validation:**\n\n- ✅ Single `workspace/` directory exists\n- ✅ Build artifacts in `distribution/`\n- ✅ No `wrks/`, `NO/` directories\n- ✅ `.gitignore` updated\n- ✅ Changes committed\n\n---\n\n### Day 3: Update Path References\n\n#### Step 3.1: Create Path Update Script\n\n```\n# Create migration script\ncat > provisioning/tools/migration/update-paths.nu << 'EOF'\n#!/usr/bin/env nu\n# Path update script for repository restructuring\n\n# Find and replace path references\nexport def main [] {\n print "🔧 Updating path references..."\n\n let replacements = [\n ["_workspace/" "workspace/"]\n ["backup-workspace/" "workspace/"]\n ["workspace-librecloud/" "workspace/"]\n ["wrks/" "distribution/"]\n ["NO/" "distribution/"]\n ]\n\n let files = (fd -e nu -e toml -e md . provisioning/)\n\n mut updated_count = 0\n\n for file in $files {\n mut content = (open $file)\n mut modified = false\n\n for replacement in $replacements {\n let old = $replacement.0\n let new = $replacement.1\n\n if ($content | str contains $old) {\n $content = ($content | str replace -a $old $new)\n $modified = true\n }\n }\n\n if $modified {\n $content | save -f $file\n $updated_count = $updated_count + 1\n print $" ✓ Updated: ($file)"\n }\n }\n\n print $"✅ Updated ($updated_count) files"\n}\nEOF\n\nchmod +x provisioning/tools/migration/update-paths.nu\n```\n\n#### Step 3.2: Run Path Updates\n\n```\n# Create backup before updates\ngit stash\ngit checkout -b feat/path-updates\n\n# Run update script\nnu provisioning/tools/migration/update-paths.nu\n\n# Review changes\ngit diff\n\n# Test a sample file\nnu -c "use provisioning/core/nulib/servers/create.nu; print 'OK'"\n```\n\n#### Step 3.3: Update CLAUDE.md\n\n```\n# Update CLAUDE.md with new paths\ncat > CLAUDE.md.new << 'EOF'\n# CLAUDE.md\n\n[Keep existing content, update paths section...]\n\n## Updated Path Structure (2025-10-01)\n\n### Core System\n- **Main CLI**: `provisioning/core/cli/provisioning`\n- **Libraries**: `provisioning/core/nulib/`\n- **Extensions**: `provisioning/extensions/`\n- **Platform**: `provisioning/platform/`\n\n### User Workspace\n- **Active Workspace**: `workspace/` (gitignored runtime data)\n- **Templates**: `workspace/templates/` (tracked)\n- **Infrastructure**: `workspace/infra/` (user configs, gitignored)\n\n### Build System\n- **Distribution**: `distribution/` (gitignored artifacts)\n- **Packages**: `distribution/packages/`\n- **Installers**: `distribution/installers/`\n\n[Continue with rest of content...]\nEOF\n\n# Review changes\ndiff CLAUDE.md CLAUDE.md.new\n\n# Apply if satisfied\nmv CLAUDE.md.new CLAUDE.md\n```\n\n#### Step 3.4: Update Documentation\n\n```\n# Find all documentation files\nfd -e md . docs/\n\n# Update each doc with new paths\n# This is semi-automated - review each file\n\n# Create list of docs to update\nfd -e md . docs/ > docs-to-update.txt\n\n# Manual review and update\necho "Review and update each documentation file with new paths"\necho "Files listed in: docs-to-update.txt"\n```\n\n#### Step 3.5: Commit Path Updates\n\n```\ngit add -A\ngit commit -m "refactor: update all path references for new structure\n\n- Update Nushell scripts to use workspace/ instead of variants\n- Update CLAUDE.md with new path structure\n- Update documentation references\n- Add migration script for future path changes\n\nPhase 1.3 of repository restructuring."\n\necho "✅ Path updates committed"\n```\n\n**Validation:**\n\n- ✅ All Nushell scripts reference correct paths\n- ✅ CLAUDE.md updated\n- ✅ Documentation updated\n- ✅ No references to old paths remain\n\n---\n\n### Day 4: Validation and Testing\n\n#### Step 4.1: Automated Validation\n\n```\n# Create validation script\ncat > provisioning/tools/validation/validate-structure.nu << 'EOF'\n#!/usr/bin/env nu\n# Repository structure validation\n\nexport def main [] {\n print "🔍 Validating repository structure..."\n\n mut passed = 0\n mut failed = 0\n\n # Check required directories exist\n let required_dirs = [\n "provisioning/core"\n "provisioning/extensions"\n "provisioning/platform"\n "provisioning/schemas"\n "workspace"\n "workspace/templates"\n "distribution"\n "docs"\n "tests"\n ]\n\n for dir in $required_dirs {\n if ($dir | path exists) {\n print $" ✓ ($dir)"\n $passed = $passed + 1\n } else {\n print $" ✗ ($dir) MISSING"\n $failed = $failed + 1\n }\n }\n\n # Check obsolete directories don't exist\n let obsolete_dirs = [\n "_workspace"\n "backup-workspace"\n "workspace-librecloud"\n "wrks"\n "NO"\n ]\n\n for dir in $obsolete_dirs {\n if not ($dir | path exists) {\n print $" ✓ ($dir) removed"\n $passed = $passed + 1\n } else {\n print $" ✗ ($dir) still exists"\n $failed = $failed + 1\n }\n }\n\n # Check no old path references\n let old_paths = ["_workspace/" "backup-workspace/" "wrks/"]\n for path in $old_paths {\n let results = (rg -l $path provisioning/ --iglob "!*.md" 2>/dev/null | lines)\n if ($results | is-empty) {\n print $" ✓ No references to ($path)"\n $passed = $passed + 1\n } else {\n print $" ✗ Found references to ($path):"\n $results | each { |f| print $" - ($f)" }\n $failed = $failed + 1\n }\n }\n\n print ""\n print $"Results: ($passed) passed, ($failed) failed"\n\n if $failed > 0 {\n error make { msg: "Validation failed" }\n }\n\n print "✅ Validation passed"\n}\nEOF\n\nchmod +x provisioning/tools/validation/validate-structure.nu\n\n# Run validation\nnu provisioning/tools/validation/validate-structure.nu\n```\n\n#### Step 4.2: Functional Testing\n\n```\n# Test core commands\necho "=== Testing Core Commands ==="\n\n# Version\nprovisioning/core/cli/provisioning version\necho "✓ version command"\n\n# Help\nprovisioning/core/cli/provisioning help\necho "✓ help command"\n\n# List\nprovisioning/core/cli/provisioning list servers\necho "✓ list command"\n\n# Environment\nprovisioning/core/cli/provisioning env\necho "✓ env command"\n\n# Validate config\nprovisioning/core/cli/provisioning validate config\necho "✓ validate command"\n\necho "✅ Functional tests passed"\n```\n\n#### Step 4.3: Integration Testing\n\n```\n# Test workflow system\necho "=== Testing Workflow System ==="\n\n# List workflows\nnu -c "use provisioning/core/nulib/workflows/management.nu *; workflow list"\necho "✓ workflow list"\n\n# Test workspace commands\necho "=== Testing Workspace Commands ==="\n\n# Workspace info\nprovisioning/core/cli/provisioning workspace info\necho "✓ workspace info"\n\necho "✅ Integration tests passed"\n```\n\n#### Step 4.4: Create Test Report\n\n```\n{\n echo "# Repository Restructuring - Validation Report"\n echo "Date: $(date)"\n echo ""\n echo "## Structure Validation"\n nu provisioning/tools/validation/validate-structure.nu 2>&1\n echo ""\n echo "## Functional Tests"\n echo "✓ version command"\n echo "✓ help command"\n echo "✓ list command"\n echo "✓ env command"\n echo "✓ validate command"\n echo ""\n echo "## Integration Tests"\n echo "✓ workflow list"\n echo "✓ workspace info"\n echo ""\n echo "## Conclusion"\n echo "✅ Phase 1 validation complete"\n} > docs/development/phase1-validation-report.md\n\necho "✅ Test report created: docs/development/phase1-validation-report.md"\n```\n\n#### Step 4.5: Update README\n\n```\n# Update main README with new structure\n# This is manual - review and update README.md\n\necho "📝 Please review and update README.md with new structure"\necho " - Update directory structure diagram"\necho " - Update installation instructions"\necho " - Update quick start guide"\n```\n\n#### Step 4.6: Finalize Phase 1\n\n```\n# Commit validation and reports\ngit add -A\ngit commit -m "test: add validation for repository restructuring\n\n- Add structure validation script\n- Add functional tests\n- Add integration tests\n- Create validation report\n- Document Phase 1 completion\n\nPhase 1 complete: Repository restructuring validated."\n\n# Merge to implementation branch\ngit checkout feat/repo-restructure\ngit merge feat/path-updates\n\necho "✅ Phase 1 complete and merged"\n```\n\n**Validation:**\n\n- ✅ All validation tests pass\n- ✅ Functional tests pass\n- ✅ Integration tests pass\n- ✅ Validation report created\n- ✅ README updated\n- ✅ Phase 1 changes merged\n\n---\n\n## Phase 2: Build System Implementation (Days 5-8)\n\n### Day 5: Build System Core\n\n#### Step 5.1: Create Build Tools Directory\n\n```\nmkdir -p provisioning/tools/build\ncd provisioning/tools/build\n\n# Create directory structure\nmkdir -p {core,platform,extensions,validation,distribution}\n\necho "✅ Build tools directory created"\n```\n\n#### Step 5.2: Implement Core Build System\n\n```\n# Create main build orchestrator\n# See full implementation in repo-dist-analysis.md\n# Copy build-system.nu from the analysis document\n\n# Test build system\nnu build-system.nu status\n```\n\n#### Step 5.3: Implement Core Packaging\n\n```\n# Create package-core.nu\n# This packages Nushell libraries, KCL schemas, templates\n\n# Test core packaging\nnu build-system.nu build-core --version dev\n```\n\n#### Step 5.4: Create Justfile\n\n```\n# Create Justfile in project root\n# See full Justfile in repo-dist-analysis.md\n\n# Test Justfile\njust --list\njust status\n```\n\n**Validation:**\n\n- ✅ Build system structure exists\n- ✅ Core build orchestrator works\n- ✅ Core packaging works\n- ✅ Justfile functional\n\n### Day 6-8: Continue with Platform, Extensions, and Validation\n\n[Follow similar pattern for remaining build system components]\n\n---\n\n## Phase 3: Installation System (Days 9-11)\n\n### Day 9: Nushell Installer\n\n#### Step 9.1: Create install.nu\n\n```\nmkdir -p distribution/installers\n\n# Create install.nu\n# See full implementation in repo-dist-analysis.md\n```\n\n#### Step 9.2: Test Installation\n\n```\n# Test installation to /tmp\nnu distribution/installers/install.nu --prefix /tmp/provisioning-test\n\n# Verify\nls -lh /tmp/provisioning-test/\n\n# Test uninstallation\nnu distribution/installers/install.nu uninstall --prefix /tmp/provisioning-test\n```\n\n**Validation:**\n\n- ✅ Installer works\n- ✅ Files installed to correct locations\n- ✅ Uninstaller works\n- ✅ No files left after uninstall\n\n---\n\n## Rollback Procedures\n\n### If Phase 1 Fails\n\n```\n# Restore from backup\nrm -rf /Users/Akasha/project-provisioning\ncp -r "$BACKUP_DIR" /Users/Akasha/project-provisioning\n\n# Return to main branch\ncd /Users/Akasha/project-provisioning\ngit checkout main\ngit branch -D feat/repo-restructure\n```\n\n### If Build System Fails\n\n```\n# Revert build system commits\ngit checkout feat/repo-restructure\ngit revert \n```\n\n### If Installation Fails\n\n```\n# Clean up test installation\nrm -rf /tmp/provisioning-test\nsudo rm -rf /usr/local/lib/provisioning\nsudo rm -rf /usr/local/share/provisioning\n```\n\n---\n\n## Checklist\n\n### Phase 1: Repository Restructuring\n\n- [ ] Day 1: Backup and analysis complete\n- [ ] Day 2: Directory restructuring complete\n- [ ] Day 3: Path references updated\n- [ ] Day 4: Validation passed\n\n### Phase 2: Build System\n\n- [ ] Day 5: Core build system implemented\n- [ ] Day 6: Platform/extensions packaging\n- [ ] Day 7: Package validation\n- [ ] Day 8: Build system tested\n\n### Phase 3: Installation\n\n- [ ] Day 9: Nushell installer created\n- [ ] Day 10: Bash installer and CLI\n- [ ] Day 11: Multi-OS testing\n\n### Phase 4: Registry (Optional)\n\n- [ ] Day 12: Registry system\n- [ ] Day 13: Registry commands\n- [ ] Day 14: Registry hosting\n\n### Phase 5: Documentation\n\n- [ ] Day 15: Documentation updated\n- [ ] Day 16: Release prepared\n\n---\n\n## Notes\n\n- **Take breaks between phases** - Don't rush\n- **Test thoroughly** - Each phase builds on previous\n- **Commit frequently** - Small, atomic commits\n- **Document issues** - Track any problems encountered\n- **Ask for review** - Get feedback at phase boundaries\n\n---\n\n## Support\n\nIf you encounter issues:\n\n1. Check the validation reports\n2. Review the rollback procedures\n3. Consult the architecture analysis\n4. Create an issue in the tracker +# Repository Restructuring - Implementation Guide + +**Status:** Ready for Implementation +**Estimated Time:** 12-16 days +**Priority:** High +**Related:** [Architecture Analysis](../architecture/repo-dist-analysis.md) + +## Overview + +This guide provides step-by-step instructions for implementing the repository restructuring and distribution system improvements. Each phase includes +specific commands, validation steps, and rollback procedures. + +--- + +## Prerequisites + +### Required Tools + +- Nushell 0.107.1+ +- Rust toolchain (for platform builds) +- Git +- tar/gzip +- curl or wget + +### Recommended Tools + +- Just (task runner) +- ripgrep (for code searches) +- fd (for file finding) + +### Before Starting + +1. **Create full backup** +2. **Notify team members** +3. **Create implementation branch** +4. **Set aside dedicated time** + +--- + +## Phase 1: Repository Restructuring (Days 1-4) + +### Day 1: Backup and Analysis + +#### Step 1.1: Create Complete Backup + +```text +# Create timestamped backup +BACKUP_DIR="/Users/Akasha/project-provisioning-backup-$(date +%Y%m%d)" +cp -r /Users/Akasha/project-provisioning "$BACKUP_DIR" + +# Verify backup +ls -lh "$BACKUP_DIR" +du -sh "$BACKUP_DIR" + +# Create backup manifest +find "$BACKUP_DIR" -type f > "$BACKUP_DIR/manifest.txt" +echo "✅ Backup created: $BACKUP_DIR" +``` + +#### Step 1.2: Analyze Current State + +```text +cd /Users/Akasha/project-provisioning + +# Count workspace directories +echo "=== Workspace Directories ===" +fd workspace -t d + +# Analyze workspace contents +echo "=== Active Workspace ===" +du -sh workspace/ + +echo "=== Backup Workspaces ===" +du -sh _workspace/ backup-workspace/ workspace-librecloud/ + +# Find obsolete directories +echo "=== Build Artifacts ===" +du -sh target/ wrks/ NO/ + +# Save analysis +{ + echo "# Current State Analysis - $(date)" + echo "" + echo "## Workspace Directories" + fd workspace -t d + echo "" + echo "## Directory Sizes" + du -sh workspace/ _workspace/ backup-workspace/ workspace-librecloud/ 2>/dev/null + echo "" + echo "## Build Artifacts" + du -sh target/ wrks/ NO/ 2>/dev/null +} > docs/development/current-state-analysis.txt + +echo "✅ Analysis complete: docs/development/current-state-analysis.txt" +``` + +#### Step 1.3: Identify Dependencies + +```text +# Find all hardcoded paths +echo "=== Hardcoded Paths in Nushell Scripts ===" +rg -t nu "workspace/|_workspace/|backup-workspace/" provisioning/core/nulib/ | tee hardcoded-paths.txt + +# Find ENV references (legacy) +echo "=== ENV References ===" +rg "PROVISIONING_" provisioning/core/nulib/ | wc -l + +# Find workspace references in configs +echo "=== Config References ===" +rg "workspace" provisioning/config/ + +echo "✅ Dependencies mapped" +``` + +#### Step 1.4: Create Implementation Branch + +```text +# Create and switch to implementation branch +git checkout -b feat/repo-restructure + +# Commit analysis +git add docs/development/current-state-analysis.txt +git commit -m "docs: add current state analysis for restructuring" + +echo "✅ Implementation branch created: feat/repo-restructure" +``` + +**Validation:** + +- ✅ Backup exists and is complete +- ✅ Analysis document created +- ✅ Dependencies mapped +- ✅ Implementation branch ready + +--- + +### Day 2: Directory Restructuring + +#### Step 2.1: Create New Directory Structure + +```text +cd /Users/Akasha/project-provisioning + +# Create distribution directory structure +mkdir -p distribution/{packages,installers,registry} +echo "✅ Created distribution/" + +# Create workspace structure (keep tracked templates) +mkdir -p workspace/{infra,config,extensions,runtime}/{.gitkeep} +mkdir -p workspace/templates/{minimal,kubernetes,multi-cloud} +echo "✅ Created workspace/" + +# Verify +tree -L 2 distribution/ workspace/ +``` + +#### Step 2.2: Move Build Artifacts + +```text +# Move Rust build artifacts +if [ -d "target" ]; then + mv target distribution/target + echo "✅ Moved target/ to distribution/" +fi + +# Move KCL packages +if [ -d "provisioning/tools/dist" ]; then + mv provisioning/tools/dist/* distribution/packages/ 2>/dev/null || true + echo "✅ Moved packages to distribution/" +fi + +# Move any existing packages +find . -name "*.tar.gz" -o -name "*.zip" | grep -v node_modules | while read pkg; do + mv "$pkg" distribution/packages/ + echo " Moved: $pkg" +done +``` + +#### Step 2.3: Consolidate Workspaces + +```text +# Identify active workspace +echo "=== Current Workspace Status ===" +ls -la workspace/ _workspace/ backup-workspace/ 2>/dev/null + +# Interactive workspace consolidation +read -p "Which workspace is currently active? (workspace/_workspace/backup-workspace): " ACTIVE_WS + +if [ "$ACTIVE_WS" != "workspace" ]; then + echo "Consolidating $ACTIVE_WS to workspace/" + + # Merge infra configs + if [ -d "$ACTIVE_WS/infra" ]; then + cp -r "$ACTIVE_WS/infra/"* workspace/infra/ + fi + + # Merge configs + if [ -d "$ACTIVE_WS/config" ]; then + cp -r "$ACTIVE_WS/config/"* workspace/config/ + fi + + # Merge extensions + if [ -d "$ACTIVE_WS/extensions" ]; then + cp -r "$ACTIVE_WS/extensions/"* workspace/extensions/ + fi + + echo "✅ Consolidated workspace" +fi + +# Archive old workspace directories +mkdir -p .archived-workspaces +for ws in _workspace backup-workspace workspace-librecloud; do + if [ -d "$ws" ] && [ "$ws" != "$ACTIVE_WS" ]; then + mv "$ws" ".archived-workspaces/$(basename $ws)-$(date +%Y%m%d)" + echo " Archived: $ws" + fi +done + +echo "✅ Workspaces consolidated" +``` + +#### Step 2.4: Remove Obsolete Directories + +```text +# Remove build artifacts (already moved) +rm -rf wrks/ +echo "✅ Removed wrks/" + +# Remove test/scratch directories +rm -rf NO/ +echo "✅ Removed NO/" + +# Archive presentations (optional) +if [ -d "presentations" ]; then + read -p "Archive presentations directory? (y/N): " ARCHIVE_PRES + if [ "$ARCHIVE_PRES" = "y" ]; then + tar czf presentations-archive-$(date +%Y%m%d).tar.gz presentations/ + rm -rf presentations/ + echo "✅ Archived and removed presentations/" + fi +fi + +# Remove empty directories +find . -type d -empty -delete 2>/dev/null || true + +echo "✅ Cleanup complete" +``` + +#### Step 2.5: Update .gitignore + +```text +# Backup existing .gitignore +cp .gitignore .gitignore.backup + +# Update .gitignore +cat >> .gitignore << 'EOF' + +# ============================================================================ +# Repository Restructure (2025-10-01) +# ============================================================================ + +# Workspace runtime data (user-specific) +/workspace/infra/ +/workspace/config/ +/workspace/extensions/ +/workspace/runtime/ + +# Distribution artifacts +/distribution/packages/ +/distribution/target/ + +# Build artifacts +/target/ +/provisioning/platform/target/ +/provisioning/platform/*/target/ + +# Rust artifacts +**/*.rs.bk +Cargo.lock + +# Archived directories +/.archived-workspaces/ + +# Temporary files +*.tmp +*.temp +/tmp/ +/wrks/ +/NO/ + +# Logs +*.log +/workspace/runtime/logs/ + +# Cache +.cache/ +/workspace/runtime/cache/ + +# IDE +.vscode/ +.idea/ +*.swp +*.swo +*~ + +# OS +.DS_Store +Thumbs.db + +# Backup files +*.backup +*.bak + +EOF + +echo "✅ Updated .gitignore" +``` + +#### Step 2.6: Commit Restructuring + +```text +# Stage changes +git add -A + +# Show what's being committed +git status + +# Commit +git commit -m "refactor: restructure repository for clean distribution + +- Consolidate workspace directories to single workspace/ +- Move build artifacts to distribution/ +- Remove obsolete directories (wrks/, NO/) +- Update .gitignore for new structure +- Archive old workspace variants + +This is part of Phase 1 of the repository restructuring plan. + +Related: docs/architecture/repo-dist-analysis.md" + +echo "✅ Restructuring committed" +``` + +**Validation:** + +- ✅ Single `workspace/` directory exists +- ✅ Build artifacts in `distribution/` +- ✅ No `wrks/`, `NO/` directories +- ✅ `.gitignore` updated +- ✅ Changes committed + +--- + +### Day 3: Update Path References + +#### Step 3.1: Create Path Update Script + +```text +# Create migration script +cat > provisioning/tools/migration/update-paths.nu << 'EOF' +#!/usr/bin/env nu +# Path update script for repository restructuring + +# Find and replace path references +export def main [] { + print "🔧 Updating path references..." + + let replacements = [ + ["_workspace/" "workspace/"] + ["backup-workspace/" "workspace/"] + ["workspace-librecloud/" "workspace/"] + ["wrks/" "distribution/"] + ["NO/" "distribution/"] + ] + + let files = (fd -e nu -e toml -e md . provisioning/) + + mut updated_count = 0 + + for file in $files { + mut content = (open $file) + mut modified = false + + for replacement in $replacements { + let old = $replacement.0 + let new = $replacement.1 + + if ($content | str contains $old) { + $content = ($content | str replace -a $old $new) + $modified = true + } + } + + if $modified { + $content | save -f $file + $updated_count = $updated_count + 1 + print $" ✓ Updated: ($file)" + } + } + + print $"✅ Updated ($updated_count) files" +} +EOF + +chmod +x provisioning/tools/migration/update-paths.nu +``` + +#### Step 3.2: Run Path Updates + +```text +# Create backup before updates +git stash +git checkout -b feat/path-updates + +# Run update script +nu provisioning/tools/migration/update-paths.nu + +# Review changes +git diff + +# Test a sample file +nu -c "use provisioning/core/nulib/servers/create.nu; print 'OK'" +``` + +#### Step 3.3: Update CLAUDE.md + +```text +# Update CLAUDE.md with new paths +cat > CLAUDE.md.new << 'EOF' +# CLAUDE.md + +[Keep existing content, update paths section...] + +## Updated Path Structure (2025-10-01) + +### Core System +- **Main CLI**: `provisioning/core/cli/provisioning` +- **Libraries**: `provisioning/core/nulib/` +- **Extensions**: `provisioning/extensions/` +- **Platform**: `provisioning/platform/` + +### User Workspace +- **Active Workspace**: `workspace/` (gitignored runtime data) +- **Templates**: `workspace/templates/` (tracked) +- **Infrastructure**: `workspace/infra/` (user configs, gitignored) + +### Build System +- **Distribution**: `distribution/` (gitignored artifacts) +- **Packages**: `distribution/packages/` +- **Installers**: `distribution/installers/` + +[Continue with rest of content...] +EOF + +# Review changes +diff CLAUDE.md CLAUDE.md.new + +# Apply if satisfied +mv CLAUDE.md.new CLAUDE.md +``` + +#### Step 3.4: Update Documentation + +```text +# Find all documentation files +fd -e md . docs/ + +# Update each doc with new paths +# This is semi-automated - review each file + +# Create list of docs to update +fd -e md . docs/ > docs-to-update.txt + +# Manual review and update +echo "Review and update each documentation file with new paths" +echo "Files listed in: docs-to-update.txt" +``` + +#### Step 3.5: Commit Path Updates + +```text +git add -A +git commit -m "refactor: update all path references for new structure + +- Update Nushell scripts to use workspace/ instead of variants +- Update CLAUDE.md with new path structure +- Update documentation references +- Add migration script for future path changes + +Phase 1.3 of repository restructuring." + +echo "✅ Path updates committed" +``` + +**Validation:** + +- ✅ All Nushell scripts reference correct paths +- ✅ CLAUDE.md updated +- ✅ Documentation updated +- ✅ No references to old paths remain + +--- + +### Day 4: Validation and Testing + +#### Step 4.1: Automated Validation + +```text +# Create validation script +cat > provisioning/tools/validation/validate-structure.nu << 'EOF' +#!/usr/bin/env nu +# Repository structure validation + +export def main [] { + print "🔍 Validating repository structure..." + + mut passed = 0 + mut failed = 0 + + # Check required directories exist + let required_dirs = [ + "provisioning/core" + "provisioning/extensions" + "provisioning/platform" + "provisioning/schemas" + "workspace" + "workspace/templates" + "distribution" + "docs" + "tests" + ] + + for dir in $required_dirs { + if ($dir | path exists) { + print $" ✓ ($dir)" + $passed = $passed + 1 + } else { + print $" ✗ ($dir) MISSING" + $failed = $failed + 1 + } + } + + # Check obsolete directories don't exist + let obsolete_dirs = [ + "_workspace" + "backup-workspace" + "workspace-librecloud" + "wrks" + "NO" + ] + + for dir in $obsolete_dirs { + if not ($dir | path exists) { + print $" ✓ ($dir) removed" + $passed = $passed + 1 + } else { + print $" ✗ ($dir) still exists" + $failed = $failed + 1 + } + } + + # Check no old path references + let old_paths = ["_workspace/" "backup-workspace/" "wrks/"] + for path in $old_paths { + let results = (rg -l $path provisioning/ --iglob "!*.md" 2>/dev/null | lines) + if ($results | is-empty) { + print $" ✓ No references to ($path)" + $passed = $passed + 1 + } else { + print $" ✗ Found references to ($path):" + $results | each { |f| print $" - ($f)" } + $failed = $failed + 1 + } + } + + print "" + print $"Results: ($passed) passed, ($failed) failed" + + if $failed > 0 { + error make { msg: "Validation failed" } + } + + print "✅ Validation passed" +} +EOF + +chmod +x provisioning/tools/validation/validate-structure.nu + +# Run validation +nu provisioning/tools/validation/validate-structure.nu +``` + +#### Step 4.2: Functional Testing + +```text +# Test core commands +echo "=== Testing Core Commands ===" + +# Version +provisioning/core/cli/provisioning version +echo "✓ version command" + +# Help +provisioning/core/cli/provisioning help +echo "✓ help command" + +# List +provisioning/core/cli/provisioning list servers +echo "✓ list command" + +# Environment +provisioning/core/cli/provisioning env +echo "✓ env command" + +# Validate config +provisioning/core/cli/provisioning validate config +echo "✓ validate command" + +echo "✅ Functional tests passed" +``` + +#### Step 4.3: Integration Testing + +```text +# Test workflow system +echo "=== Testing Workflow System ===" + +# List workflows +nu -c "use provisioning/core/nulib/workflows/management.nu *; workflow list" +echo "✓ workflow list" + +# Test workspace commands +echo "=== Testing Workspace Commands ===" + +# Workspace info +provisioning/core/cli/provisioning workspace info +echo "✓ workspace info" + +echo "✅ Integration tests passed" +``` + +#### Step 4.4: Create Test Report + +```text +{ + echo "# Repository Restructuring - Validation Report" + echo "Date: $(date)" + echo "" + echo "## Structure Validation" + nu provisioning/tools/validation/validate-structure.nu 2>&1 + echo "" + echo "## Functional Tests" + echo "✓ version command" + echo "✓ help command" + echo "✓ list command" + echo "✓ env command" + echo "✓ validate command" + echo "" + echo "## Integration Tests" + echo "✓ workflow list" + echo "✓ workspace info" + echo "" + echo "## Conclusion" + echo "✅ Phase 1 validation complete" +} > docs/development/phase1-validation-report.md + +echo "✅ Test report created: docs/development/phase1-validation-report.md" +``` + +#### Step 4.5: Update README + +```text +# Update main README with new structure +# This is manual - review and update README.md + +echo "📝 Please review and update README.md with new structure" +echo " - Update directory structure diagram" +echo " - Update installation instructions" +echo " - Update quick start guide" +``` + +#### Step 4.6: Finalize Phase 1 + +```text +# Commit validation and reports +git add -A +git commit -m "test: add validation for repository restructuring + +- Add structure validation script +- Add functional tests +- Add integration tests +- Create validation report +- Document Phase 1 completion + +Phase 1 complete: Repository restructuring validated." + +# Merge to implementation branch +git checkout feat/repo-restructure +git merge feat/path-updates + +echo "✅ Phase 1 complete and merged" +``` + +**Validation:** + +- ✅ All validation tests pass +- ✅ Functional tests pass +- ✅ Integration tests pass +- ✅ Validation report created +- ✅ README updated +- ✅ Phase 1 changes merged + +--- + +## Phase 2: Build System Implementation (Days 5-8) + +### Day 5: Build System Core + +#### Step 5.1: Create Build Tools Directory + +```text +mkdir -p provisioning/tools/build +cd provisioning/tools/build + +# Create directory structure +mkdir -p {core,platform,extensions,validation,distribution} + +echo "✅ Build tools directory created" +``` + +#### Step 5.2: Implement Core Build System + +```text +# Create main build orchestrator +# See full implementation in repo-dist-analysis.md +# Copy build-system.nu from the analysis document + +# Test build system +nu build-system.nu status +``` + +#### Step 5.3: Implement Core Packaging + +```text +# Create package-core.nu +# This packages Nushell libraries, KCL schemas, templates + +# Test core packaging +nu build-system.nu build-core --version dev +``` + +#### Step 5.4: Create Justfile + +```text +# Create Justfile in project root +# See full Justfile in repo-dist-analysis.md + +# Test Justfile +just --list +just status +``` + +**Validation:** + +- ✅ Build system structure exists +- ✅ Core build orchestrator works +- ✅ Core packaging works +- ✅ Justfile functional + +### Day 6-8: Continue with Platform, Extensions, and Validation + +[Follow similar pattern for remaining build system components] + +--- + +## Phase 3: Installation System (Days 9-11) + +### Day 9: Nushell Installer + +#### Step 9.1: Create install.nu + +```text +mkdir -p distribution/installers + +# Create install.nu +# See full implementation in repo-dist-analysis.md +``` + +#### Step 9.2: Test Installation + +```text +# Test installation to /tmp +nu distribution/installers/install.nu --prefix /tmp/provisioning-test + +# Verify +ls -lh /tmp/provisioning-test/ + +# Test uninstallation +nu distribution/installers/install.nu uninstall --prefix /tmp/provisioning-test +``` + +**Validation:** + +- ✅ Installer works +- ✅ Files installed to correct locations +- ✅ Uninstaller works +- ✅ No files left after uninstall + +--- + +## Rollback Procedures + +### If Phase 1 Fails + +```text +# Restore from backup +rm -rf /Users/Akasha/project-provisioning +cp -r "$BACKUP_DIR" /Users/Akasha/project-provisioning + +# Return to main branch +cd /Users/Akasha/project-provisioning +git checkout main +git branch -D feat/repo-restructure +``` + +### If Build System Fails + +```text +# Revert build system commits +git checkout feat/repo-restructure +git revert +``` + +### If Installation Fails + +```text +# Clean up test installation +rm -rf /tmp/provisioning-test +sudo rm -rf /usr/local/lib/provisioning +sudo rm -rf /usr/local/share/provisioning +``` + +--- + +## Checklist + +### Phase 1: Repository Restructuring + +- [ ] Day 1: Backup and analysis complete +- [ ] Day 2: Directory restructuring complete +- [ ] Day 3: Path references updated +- [ ] Day 4: Validation passed + +### Phase 2: Build System + +- [ ] Day 5: Core build system implemented +- [ ] Day 6: Platform/extensions packaging +- [ ] Day 7: Package validation +- [ ] Day 8: Build system tested + +### Phase 3: Installation + +- [ ] Day 9: Nushell installer created +- [ ] Day 10: Bash installer and CLI +- [ ] Day 11: Multi-OS testing + +### Phase 4: Registry (Optional) + +- [ ] Day 12: Registry system +- [ ] Day 13: Registry commands +- [ ] Day 14: Registry hosting + +### Phase 5: Documentation + +- [ ] Day 15: Documentation updated +- [ ] Day 16: Release prepared + +--- + +## Notes + +- **Take breaks between phases** - Don't rush +- **Test thoroughly** - Each phase builds on previous +- **Commit frequently** - Small, atomic commits +- **Document issues** - Track any problems encountered +- **Ask for review** - Get feedback at phase boundaries + +--- + +## Support + +If you encounter issues: + +1. Check the validation reports +2. Review the rollback procedures +3. Consult the architecture analysis +4. Create an issue in the tracker \ No newline at end of file diff --git a/docs/src/development/infrastructure-specific-extensions.md b/docs/src/development/infrastructure-specific-extensions.md index 739e37f..56727a0 100644 --- a/docs/src/development/infrastructure-specific-extensions.md +++ b/docs/src/development/infrastructure-specific-extensions.md @@ -1 +1,1230 @@ -# Infrastructure-Specific Extension Development\n\nThis guide focuses on creating extensions tailored to specific infrastructure requirements, business needs, and organizational constraints.\n\n## Table of Contents\n\n1. [Overview](#overview)\n2. [Infrastructure Assessment](#infrastructure-assessment)\n3. [Custom Taskserv Development](#custom-taskserv-development)\n4. [Provider-Specific Extensions](#provider-specific-extensions)\n5. [Multi-Environment Management](#multi-environment-management)\n6. [Integration Patterns](#integration-patterns)\n7. [Real-World Examples](#real-world-examples)\n\n## Overview\n\nInfrastructure-specific extensions address unique requirements that generic modules cannot cover:\n\n- **Company-specific applications and services**\n- **Compliance and security requirements**\n- **Legacy system integrations**\n- **Custom networking configurations**\n- **Specialized monitoring and alerting**\n- **Multi-cloud and hybrid deployments**\n\n## Infrastructure Assessment\n\n### Identifying Extension Needs\n\nBefore creating custom extensions, assess your infrastructure requirements:\n\n#### 1. Application Inventory\n\n```\n# Document existing applications\ncat > infrastructure-assessment.yaml << EOF\napplications:\n - name: "legacy-billing-system"\n type: "monolith"\n runtime: "java-8"\n database: "oracle-11g"\n integrations: ["ldap", "file-storage", "email"]\n compliance: ["pci-dss", "sox"]\n\n - name: "customer-portal"\n type: "microservices"\n runtime: "nodejs-16"\n database: "postgresql-13"\n integrations: ["redis", "elasticsearch", "s3"]\n compliance: ["gdpr", "hipaa"]\n\ninfrastructure:\n - type: "on-premise"\n location: "datacenter-primary"\n capabilities: ["kubernetes", "vmware", "storage-array"]\n\n - type: "cloud"\n provider: "aws"\n regions: ["us-east-1", "eu-west-1"]\n services: ["eks", "rds", "s3", "cloudfront"]\n\ncompliance_requirements:\n - "PCI DSS Level 1"\n - "SOX compliance"\n - "GDPR data protection"\n - "HIPAA safeguards"\n\nnetwork_requirements:\n - "air-gapped environments"\n - "private subnet isolation"\n - "vpn connectivity"\n - "load balancer integration"\nEOF\n```\n\n#### 2. Gap Analysis\n\n```\n# Analyze what standard modules don't cover\n./provisioning/core/cli/module-loader discover taskservs > available-modules.txt\n\n# Create gap analysis\ncat > gap-analysis.md << EOF\n# Infrastructure Gap Analysis\n\n## Standard Modules Available\n$(cat available-modules.txt)\n\n## Missing Capabilities\n- [ ] Legacy Oracle database integration\n- [ ] Company-specific LDAP authentication\n- [ ] Custom monitoring for legacy systems\n- [ ] Compliance reporting automation\n- [ ] Air-gapped deployment workflows\n- [ ] Multi-datacenter replication\n\n## Custom Extensions Needed\n1. **oracle-db-taskserv**: Oracle database with company settings\n2. **company-ldap-taskserv**: LDAP integration with custom schema\n3. **compliance-monitor-taskserv**: Automated compliance checking\n4. **airgap-deployment-cluster**: Air-gapped deployment patterns\n5. **company-monitoring-taskserv**: Custom monitoring dashboard\nEOF\n```\n\n### Requirements Gathering\n\n#### Business Requirements Template\n\n```\n"""\nBusiness Requirements Schema for Custom Extensions\nUse this template to document requirements before development\n"""\n\nschema BusinessRequirements:\n """Document business requirements for custom extensions"""\n\n # Project information\n project_name: str\n stakeholders: [str]\n timeline: str\n budget_constraints?: str\n\n # Functional requirements\n functional_requirements: [FunctionalRequirement]\n\n # Non-functional requirements\n performance_requirements: PerformanceRequirements\n security_requirements: SecurityRequirements\n compliance_requirements: [str]\n\n # Integration requirements\n existing_systems: [ExistingSystem]\n required_integrations: [Integration]\n\n # Operational requirements\n monitoring_requirements: [str]\n backup_requirements: [str]\n disaster_recovery_requirements: [str]\n\nschema FunctionalRequirement:\n id: str\n description: str\n priority: "high" | "medium" | "low"\n acceptance_criteria: [str]\n\nschema PerformanceRequirements:\n max_response_time: str\n throughput_requirements: str\n availability_target: str\n scalability_requirements: str\n\nschema SecurityRequirements:\n authentication_method: str\n authorization_model: str\n encryption_requirements: [str]\n audit_requirements: [str]\n network_security: [str]\n\nschema ExistingSystem:\n name: str\n type: str\n version: str\n api_available: bool\n integration_method: str\n\nschema Integration:\n target_system: str\n integration_type: "api" | "database" | "file" | "message_queue"\n data_format: str\n frequency: str\n direction: "inbound" | "outbound" | "bidirectional"\n```\n\n## Custom Taskserv Development\n\n### Company-Specific Application Taskserv\n\n#### Example: Legacy ERP System Integration\n\n```\n# Create company-specific taskserv\nmkdir -p extensions/taskservs/company-specific/legacy-erp/nickel\ncd extensions/taskservs/company-specific/legacy-erp/nickel\n```\n\nCreate `legacy-erp.ncl`:\n\n```\n"""\nLegacy ERP System Taskserv\nHandles deployment and management of company's legacy ERP system\n"""\n\nimport provisioning.lib as lib\nimport provisioning.dependencies as deps\nimport provisioning.defaults as defaults\n\n# ERP system configuration\nschema LegacyERPConfig:\n """Configuration for legacy ERP system"""\n\n # Application settings\n erp_version: str = "12.2.0"\n installation_mode: "standalone" | "cluster" | "ha" = "ha"\n\n # Database configuration\n database_type: "oracle" | "sqlserver" = "oracle"\n database_version: str = "19c"\n database_size: str = "500Gi"\n database_backup_retention: int = 30\n\n # Network configuration\n erp_port: int = 8080\n database_port: int = 1521\n ssl_enabled: bool = True\n internal_network_only: bool = True\n\n # Integration settings\n ldap_server: str\n file_share_path: str\n email_server: str\n\n # Compliance settings\n audit_logging: bool = True\n encryption_at_rest: bool = True\n encryption_in_transit: bool = True\n data_retention_years: int = 7\n\n # Resource allocation\n app_server_resources: ERPResourceConfig\n database_resources: ERPResourceConfig\n\n # Backup configuration\n backup_schedule: str = "0 2 * * *" # Daily at 2 AM\n backup_retention_policy: BackupRetentionPolicy\n\n check:\n erp_port > 0 and erp_port < 65536, "ERP port must be valid"\n database_port > 0 and database_port < 65536, "Database port must be valid"\n data_retention_years > 0, "Data retention must be positive"\n len(ldap_server) > 0, "LDAP server required"\n\nschema ERPResourceConfig:\n """Resource configuration for ERP components"""\n cpu_request: str\n memory_request: str\n cpu_limit: str\n memory_limit: str\n storage_size: str\n storage_class: str = "fast-ssd"\n\nschema BackupRetentionPolicy:\n """Backup retention policy for ERP system"""\n daily_backups: int = 7\n weekly_backups: int = 4\n monthly_backups: int = 12\n yearly_backups: int = 7\n\n# Environment-specific resource configurations\nerp_resource_profiles = {\n "development": {\n app_server_resources = {\n cpu_request = "1"\n memory_request = "4Gi"\n cpu_limit = "2"\n memory_limit = "8Gi"\n storage_size = "50Gi"\n storage_class = "standard"\n }\n database_resources = {\n cpu_request = "2"\n memory_request = "8Gi"\n cpu_limit = "4"\n memory_limit = "16Gi"\n storage_size = "100Gi"\n storage_class = "standard"\n }\n },\n "production": {\n app_server_resources = {\n cpu_request = "4"\n memory_request = "16Gi"\n cpu_limit = "8"\n memory_limit = "32Gi"\n storage_size = "200Gi"\n storage_class = "fast-ssd"\n }\n database_resources = {\n cpu_request = "8"\n memory_request = "32Gi"\n cpu_limit = "16"\n memory_limit = "64Gi"\n storage_size = "2Ti"\n storage_class = "fast-ssd"\n }\n }\n}\n\n# Taskserv definition\nschema LegacyERPTaskserv(lib.TaskServDef):\n """Legacy ERP Taskserv Definition"""\n name: str = "legacy-erp"\n config: LegacyERPConfig\n environment: "development" | "staging" | "production"\n\n# Dependencies for legacy ERP\nlegacy_erp_dependencies: deps.TaskservDependencies = {\n name = "legacy-erp"\n\n # Infrastructure dependencies\n requires = ["kubernetes", "storage-class"]\n optional = ["monitoring", "backup-agent", "log-aggregator"]\n conflicts = ["modern-erp"]\n\n # Services provided\n provides = ["erp-api", "erp-ui", "erp-reports", "erp-integration"]\n\n # Resource requirements\n resources = {\n cpu = "8"\n memory = "32Gi"\n disk = "2Ti"\n network = True\n privileged = True # Legacy systems often need privileged access\n }\n\n # Health checks\n health_checks = [\n {\n command = "curl -k https://localhost:9090/health"\n interval = 60\n timeout = 30\n retries = 3\n },\n {\n command = "sqlplus system/password@localhost:1521/XE <<< 'SELECT 1 FROM DUAL;'"\n interval = 300\n timeout = 60\n retries = 2\n }\n ]\n\n # Installation phases\n phases = [\n {\n name = "pre-install"\n order = 1\n parallel = False\n required = True\n },\n {\n name = "database-setup"\n order = 2\n parallel = False\n required = True\n },\n {\n name = "application-install"\n order = 3\n parallel = False\n required = True\n },\n {\n name = "integration-setup"\n order = 4\n parallel = True\n required = False\n },\n {\n name = "compliance-validation"\n order = 5\n parallel = False\n required = True\n }\n ]\n\n # Compatibility\n os_support = ["linux"]\n arch_support = ["amd64"]\n timeout = 3600 # 1 hour for legacy system deployment\n}\n\n# Default configuration\nlegacy_erp_default: LegacyERPTaskserv = {\n name = "legacy-erp"\n environment = "production"\n config = {\n erp_version = "12.2.0"\n installation_mode = "ha"\n\n database_type = "oracle"\n database_version = "19c"\n database_size = "1Ti"\n database_backup_retention = 30\n\n erp_port = 8080\n database_port = 1521\n ssl_enabled = True\n internal_network_only = True\n\n # Company-specific settings\n ldap_server = "ldap.company.com"\n file_share_path = "/mnt/company-files"\n email_server = "smtp.company.com"\n\n # Compliance settings\n audit_logging = True\n encryption_at_rest = True\n encryption_in_transit = True\n data_retention_years = 7\n\n # Production resources\n app_server_resources = erp_resource_profiles.production.app_server_resources\n database_resources = erp_resource_profiles.production.database_resources\n\n backup_schedule = "0 2 * * *"\n backup_retention_policy = {\n daily_backups = 7\n weekly_backups = 4\n monthly_backups = 12\n yearly_backups = 7\n }\n }\n}\n\n# Export for provisioning system\n{\n config: legacy_erp_default,\n dependencies: legacy_erp_dependencies,\n profiles: erp_resource_profiles\n}\n```\n\n### Compliance-Focused Taskserv\n\nCreate `compliance-monitor.ncl`:\n\n```\n"""\nCompliance Monitoring Taskserv\nAutomated compliance checking and reporting for regulated environments\n"""\n\nimport provisioning.lib as lib\nimport provisioning.dependencies as deps\n\nschema ComplianceMonitorConfig:\n """Configuration for compliance monitoring system"""\n\n # Compliance frameworks\n enabled_frameworks: [ComplianceFramework]\n\n # Monitoring settings\n scan_frequency: str = "0 0 * * *" # Daily\n real_time_monitoring: bool = True\n\n # Reporting settings\n report_frequency: str = "0 0 * * 0" # Weekly\n report_recipients: [str]\n report_format: "pdf" | "html" | "json" = "pdf"\n\n # Alerting configuration\n alert_severity_threshold: "low" | "medium" | "high" = "medium"\n alert_channels: [AlertChannel]\n\n # Data retention\n audit_log_retention_days: int = 2555 # 7 years\n report_retention_days: int = 365\n\n # Integration settings\n siem_integration: bool = True\n siem_endpoint?: str\n\n check:\n audit_log_retention_days >= 2555, "Audit logs must be retained for at least 7 years"\n len(report_recipients) > 0, "At least one report recipient required"\n\nschema ComplianceFramework:\n """Compliance framework configuration"""\n name: "pci-dss" | "sox" | "gdpr" | "hipaa" | "iso27001" | "nist"\n version: str\n enabled: bool = True\n custom_controls?: [ComplianceControl]\n\nschema ComplianceControl:\n """Custom compliance control"""\n id: str\n description: str\n check_command: str\n severity: "low" | "medium" | "high" | "critical"\n remediation_guidance: str\n\nschema AlertChannel:\n """Alert channel configuration"""\n type: "email" | "slack" | "teams" | "webhook" | "sms"\n endpoint: str\n severity_filter: ["low", "medium", "high", "critical"]\n\n# Taskserv definition\nschema ComplianceMonitorTaskserv(lib.TaskServDef):\n """Compliance Monitor Taskserv Definition"""\n name: str = "compliance-monitor"\n config: ComplianceMonitorConfig\n\n# Dependencies\ncompliance_monitor_dependencies: deps.TaskservDependencies = {\n name = "compliance-monitor"\n\n # Dependencies\n requires = ["kubernetes"]\n optional = ["monitoring", "logging", "backup"]\n provides = ["compliance-reports", "audit-logs", "compliance-api"]\n\n # Resource requirements\n resources = {\n cpu = "500m"\n memory = "1Gi"\n disk = "50Gi"\n network = True\n privileged = False\n }\n\n # Health checks\n health_checks = [\n {\n command = "curl -f http://localhost:9090/health"\n interval = 30\n timeout = 10\n retries = 3\n },\n {\n command = "compliance-check --dry-run"\n interval = 300\n timeout = 60\n retries = 1\n }\n ]\n\n # Compatibility\n os_support = ["linux"]\n arch_support = ["amd64", "arm64"]\n}\n\n# Default configuration with common compliance frameworks\ncompliance_monitor_default: ComplianceMonitorTaskserv = {\n name = "compliance-monitor"\n config = {\n enabled_frameworks = [\n {\n name = "pci-dss"\n version = "3.2.1"\n enabled = True\n },\n {\n name = "sox"\n version = "2002"\n enabled = True\n },\n {\n name = "gdpr"\n version = "2018"\n enabled = True\n }\n ]\n\n scan_frequency = "0 */6 * * *" # Every 6 hours\n real_time_monitoring = True\n\n report_frequency = "0 0 * * 1" # Weekly on Monday\n report_recipients = ["compliance@company.com", "security@company.com"]\n report_format = "pdf"\n\n alert_severity_threshold = "medium"\n alert_channels = [\n {\n type = "email"\n endpoint = "security-alerts@company.com"\n severity_filter = ["medium", "high", "critical"]\n },\n {\n type = "slack"\n endpoint = "https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX"\n severity_filter = ["high", "critical"]\n }\n ]\n\n audit_log_retention_days = 2555\n report_retention_days = 365\n\n siem_integration = True\n siem_endpoint = "https://siem.company.com/api/events"\n }\n}\n\n# Export configuration\n{\n config: compliance_monitor_default,\n dependencies: compliance_monitor_dependencies\n}\n```\n\n## Provider-Specific Extensions\n\n### Custom Cloud Provider Integration\n\nWhen working with specialized or private cloud providers:\n\n```\n# Create custom provider extension\nmkdir -p extensions/providers/company-private-cloud/nickel\ncd extensions/providers/company-private-cloud/nickel\n```\n\nCreate `provision_company-private-cloud.ncl`:\n\n```\n"""\nCompany Private Cloud Provider\nIntegration with company's private cloud infrastructure\n"""\n\nimport provisioning.defaults as defaults\nimport provisioning.server as server\n\nschema CompanyPrivateCloudConfig:\n """Company private cloud configuration"""\n\n # API configuration\n api_endpoint: str = "https://cloud-api.company.com"\n api_version: str = "v2"\n auth_token: str\n\n # Network configuration\n management_network: str = "10.0.0.0/24"\n production_network: str = "10.1.0.0/16"\n dmz_network: str = "10.2.0.0/24"\n\n # Resource pools\n compute_cluster: str = "production-cluster"\n storage_cluster: str = "storage-cluster"\n\n # Compliance settings\n encryption_required: bool = True\n audit_all_operations: bool = True\n\n # Company-specific settings\n cost_center: str\n department: str\n project_code: str\n\n check:\n len(api_endpoint) > 0, "API endpoint required"\n len(auth_token) > 0, "Authentication token required"\n len(cost_center) > 0, "Cost center required for billing"\n\nschema CompanyPrivateCloudServer(server.Server):\n """Server configuration for company private cloud"""\n\n # Instance configuration\n instance_class: "standard" | "compute-optimized" | "memory-optimized" | "storage-optimized" = "standard"\n instance_size: "small" | "medium" | "large" | "xlarge" | "2xlarge" = "medium"\n\n # Storage configuration\n root_disk_type: "ssd" | "nvme" | "spinning" = "ssd"\n root_disk_size: int = 50\n additional_storage?: [CompanyCloudStorage]\n\n # Network configuration\n network_segment: "management" | "production" | "dmz" = "production"\n security_groups: [str] = ["default"]\n\n # Compliance settings\n encrypted_storage: bool = True\n backup_enabled: bool = True\n monitoring_enabled: bool = True\n\n # Company metadata\n cost_center: str\n department: str\n project_code: str\n environment: "dev" | "test" | "staging" | "prod" = "prod"\n\n check:\n root_disk_size >= 20, "Root disk must be at least 20 GB"\n len(cost_center) > 0, "Cost center required"\n len(department) > 0, "Department required"\n\nschema CompanyCloudStorage:\n """Additional storage configuration"""\n size: int\n type: "ssd" | "nvme" | "spinning" | "archive" = "ssd"\n mount_point: str\n encrypted: bool = True\n backup_enabled: bool = True\n\n# Instance size configurations\ninstance_specs = {\n "small": {\n vcpus = 2\n memory_gb = 4\n network_performance = "moderate"\n },\n "medium": {\n vcpus = 4\n memory_gb = 8\n network_performance = "good"\n },\n "large": {\n vcpus = 8\n memory_gb = 16\n network_performance = "high"\n },\n "xlarge": {\n vcpus = 16\n memory_gb = 32\n network_performance = "high"\n },\n "2xlarge": {\n vcpus = 32\n memory_gb = 64\n network_performance = "very-high"\n }\n}\n\n# Provider defaults\ncompany_private_cloud_defaults: defaults.ServerDefaults = {\n lock = False\n time_zone = "UTC"\n running_wait = 20\n running_timeout = 600 # Private cloud may be slower\n\n # Company-specific OS image\n storage_os_find = "name: company-ubuntu-20.04-hardened | arch: x86_64"\n\n # Network settings\n network_utility_ipv4 = True\n network_public_ipv4 = False # Private cloud, no public IPs\n\n # Security settings\n user = "company-admin"\n user_ssh_port = 22\n fix_local_hosts = True\n\n # Company metadata\n labels = "provider: company-private-cloud, compliance: required"\n}\n\n# Export provider configuration\n{\n config: CompanyPrivateCloudConfig,\n server: CompanyPrivateCloudServer,\n defaults: company_private_cloud_defaults,\n instance_specs: instance_specs\n}\n```\n\n## Multi-Environment Management\n\n### Environment-Specific Configuration Management\n\nCreate environment-specific extensions that handle different deployment patterns:\n\n```\n# Create environment management extension\nmkdir -p extensions/clusters/company-environments/nickel\ncd extensions/clusters/company-environments/nickel\n```\n\nCreate `company-environments.ncl`:\n\n```\n"""\nCompany Environment Management\nStandardized environment configurations for different deployment stages\n"""\n\nimport provisioning.cluster as cluster\nimport provisioning.server as server\n\nschema CompanyEnvironment:\n """Standard company environment configuration"""\n\n # Environment metadata\n name: str\n type: "development" | "testing" | "staging" | "production" | "disaster-recovery"\n region: str\n availability_zones: [str]\n\n # Network configuration\n vpc_cidr: str\n subnet_configuration: SubnetConfiguration\n\n # Security configuration\n security_profile: SecurityProfile\n\n # Compliance requirements\n compliance_level: "basic" | "standard" | "high" | "critical"\n data_classification: "public" | "internal" | "confidential" | "restricted"\n\n # Resource constraints\n resource_limits: ResourceLimits\n\n # Backup and DR configuration\n backup_configuration: BackupConfiguration\n disaster_recovery_configuration?: DRConfiguration\n\n # Monitoring and alerting\n monitoring_level: "basic" | "standard" | "enhanced"\n alert_routing: AlertRouting\n\nschema SubnetConfiguration:\n """Network subnet configuration"""\n public_subnets: [str]\n private_subnets: [str]\n database_subnets: [str]\n management_subnets: [str]\n\nschema SecurityProfile:\n """Security configuration profile"""\n encryption_at_rest: bool\n encryption_in_transit: bool\n network_isolation: bool\n access_logging: bool\n vulnerability_scanning: bool\n\n # Access control\n multi_factor_auth: bool\n privileged_access_management: bool\n network_segmentation: bool\n\n # Compliance controls\n audit_logging: bool\n data_loss_prevention: bool\n endpoint_protection: bool\n\nschema ResourceLimits:\n """Resource allocation limits for environment"""\n max_cpu_cores: int\n max_memory_gb: int\n max_storage_tb: int\n max_instances: int\n\n # Cost controls\n max_monthly_cost: int\n cost_alerts_enabled: bool\n\nschema BackupConfiguration:\n """Backup configuration for environment"""\n backup_frequency: str\n retention_policy: {str: int}\n cross_region_backup: bool\n encryption_enabled: bool\n\nschema DRConfiguration:\n """Disaster recovery configuration"""\n dr_region: str\n rto_minutes: int # Recovery Time Objective\n rpo_minutes: int # Recovery Point Objective\n automated_failover: bool\n\nschema AlertRouting:\n """Alert routing configuration"""\n business_hours_contacts: [str]\n after_hours_contacts: [str]\n escalation_policy: [EscalationLevel]\n\nschema EscalationLevel:\n """Alert escalation level"""\n level: int\n delay_minutes: int\n contacts: [str]\n\n# Environment templates\nenvironment_templates = {\n "development": {\n type = "development"\n compliance_level = "basic"\n data_classification = "internal"\n security_profile = {\n encryption_at_rest = False\n encryption_in_transit = False\n network_isolation = False\n access_logging = True\n vulnerability_scanning = False\n multi_factor_auth = False\n privileged_access_management = False\n network_segmentation = False\n audit_logging = False\n data_loss_prevention = False\n endpoint_protection = False\n }\n resource_limits = {\n max_cpu_cores = 50\n max_memory_gb = 200\n max_storage_tb = 10\n max_instances = 20\n max_monthly_cost = 5000\n cost_alerts_enabled = True\n }\n monitoring_level = "basic"\n },\n\n "production": {\n type = "production"\n compliance_level = "critical"\n data_classification = "confidential"\n security_profile = {\n encryption_at_rest = True\n encryption_in_transit = True\n network_isolation = True\n access_logging = True\n vulnerability_scanning = True\n multi_factor_auth = True\n privileged_access_management = True\n network_segmentation = True\n audit_logging = True\n data_loss_prevention = True\n endpoint_protection = True\n }\n resource_limits = {\n max_cpu_cores = 1000\n max_memory_gb = 4000\n max_storage_tb = 500\n max_instances = 200\n max_monthly_cost = 100000\n cost_alerts_enabled = True\n }\n monitoring_level = "enhanced"\n disaster_recovery_configuration = {\n dr_region = "us-west-2"\n rto_minutes = 60\n rpo_minutes = 15\n automated_failover = True\n }\n }\n}\n\n# Export environment templates\n{\n templates: environment_templates,\n schema: CompanyEnvironment\n}\n```\n\n## Integration Patterns\n\n### Legacy System Integration\n\nCreate integration patterns for common legacy system scenarios:\n\n```\n# Create integration patterns\nmkdir -p extensions/taskservs/integrations/legacy-bridge/nickel\ncd extensions/taskservs/integrations/legacy-bridge/nickel\n```\n\nCreate `legacy-bridge.ncl`:\n\n```\n"""\nLegacy System Integration Bridge\nProvides standardized integration patterns for legacy systems\n"""\n\nimport provisioning.lib as lib\nimport provisioning.dependencies as deps\n\nschema LegacyBridgeConfig:\n """Configuration for legacy system integration bridge"""\n\n # Bridge configuration\n bridge_name: str\n integration_type: "api" | "database" | "file" | "message-queue" | "etl"\n\n # Legacy system details\n legacy_system: LegacySystemInfo\n\n # Modern system details\n modern_system: ModernSystemInfo\n\n # Data transformation configuration\n data_transformation: DataTransformationConfig\n\n # Security configuration\n security_config: IntegrationSecurityConfig\n\n # Monitoring and alerting\n monitoring_config: IntegrationMonitoringConfig\n\nschema LegacySystemInfo:\n """Legacy system information"""\n name: str\n type: "mainframe" | "as400" | "unix" | "windows" | "database" | "file-system"\n version: str\n\n # Connection details\n connection_method: "direct" | "vpn" | "dedicated-line" | "api-gateway"\n endpoint: str\n port?: int\n\n # Authentication\n auth_method: "password" | "certificate" | "kerberos" | "ldap" | "token"\n credentials_source: "vault" | "config" | "environment"\n\n # Data characteristics\n data_format: "fixed-width" | "csv" | "xml" | "json" | "binary" | "proprietary"\n character_encoding: str = "utf-8"\n\n # Operational characteristics\n availability_hours: str = "24/7"\n maintenance_windows: [MaintenanceWindow]\n\nschema ModernSystemInfo:\n """Modern system information"""\n name: str\n type: "microservice" | "api" | "database" | "event-stream" | "file-store"\n\n # Connection details\n endpoint: str\n api_version?: str\n\n # Data format\n data_format: "json" | "xml" | "avro" | "protobuf"\n\n # Authentication\n auth_method: "oauth2" | "jwt" | "api-key" | "mutual-tls"\n\nschema DataTransformationConfig:\n """Data transformation configuration"""\n transformation_rules: [TransformationRule]\n error_handling: ErrorHandlingConfig\n data_validation: DataValidationConfig\n\nschema TransformationRule:\n """Individual data transformation rule"""\n source_field: str\n target_field: str\n transformation_type: "direct" | "calculated" | "lookup" | "conditional"\n transformation_expression?: str\n\nschema ErrorHandlingConfig:\n """Error handling configuration"""\n retry_policy: RetryPolicy\n dead_letter_queue: bool = True\n error_notification: bool = True\n\nschema RetryPolicy:\n """Retry policy configuration"""\n max_attempts: int = 3\n initial_delay_seconds: int = 5\n backoff_multiplier: float = 2.0\n max_delay_seconds: int = 300\n\nschema DataValidationConfig:\n """Data validation configuration"""\n schema_validation: bool = True\n business_rules_validation: bool = True\n data_quality_checks: [DataQualityCheck]\n\nschema DataQualityCheck:\n """Data quality check definition"""\n name: str\n check_type: "completeness" | "uniqueness" | "validity" | "consistency"\n threshold: float = 0.95\n action_on_failure: "warn" | "stop" | "quarantine"\n\nschema IntegrationSecurityConfig:\n """Security configuration for integration"""\n encryption_in_transit: bool = True\n encryption_at_rest: bool = True\n\n # Access control\n source_ip_whitelist?: [str]\n api_rate_limiting: bool = True\n\n # Audit and compliance\n audit_all_transactions: bool = True\n pii_data_handling: PIIHandlingConfig\n\nschema PIIHandlingConfig:\n """PII data handling configuration"""\n pii_fields: [str]\n anonymization_enabled: bool = True\n retention_policy_days: int = 365\n\nschema IntegrationMonitoringConfig:\n """Monitoring configuration for integration"""\n metrics_collection: bool = True\n performance_monitoring: bool = True\n\n # SLA monitoring\n sla_targets: SLATargets\n\n # Alerting\n alert_on_failures: bool = True\n alert_on_performance_degradation: bool = True\n\nschema SLATargets:\n """SLA targets for integration"""\n max_latency_ms: int = 5000\n min_availability_percent: float = 99.9\n max_error_rate_percent: float = 0.1\n\nschema MaintenanceWindow:\n """Maintenance window definition"""\n day_of_week: int # 0=Sunday, 6=Saturday\n start_time: str # HH:MM format\n duration_hours: int\n\n# Taskserv definition\nschema LegacyBridgeTaskserv(lib.TaskServDef):\n """Legacy Bridge Taskserv Definition"""\n name: str = "legacy-bridge"\n config: LegacyBridgeConfig\n\n# Dependencies\nlegacy_bridge_dependencies: deps.TaskservDependencies = {\n name = "legacy-bridge"\n\n requires = ["kubernetes"]\n optional = ["monitoring", "logging", "vault"]\n provides = ["legacy-integration", "data-bridge"]\n\n resources = {\n cpu = "500m"\n memory = "1Gi"\n disk = "10Gi"\n network = True\n privileged = False\n }\n\n health_checks = [\n {\n command = "curl -f http://localhost:9090/health"\n interval = 30\n timeout = 10\n retries = 3\n },\n {\n command = "integration-test --quick"\n interval = 300\n timeout = 120\n retries = 1\n }\n ]\n\n os_support = ["linux"]\n arch_support = ["amd64", "arm64"]\n}\n\n# Export configuration\n{\n config: LegacyBridgeTaskserv,\n dependencies: legacy_bridge_dependencies\n}\n```\n\n## Real-World Examples\n\n### Example 1: Financial Services Company\n\n```\n# Financial services specific extensions\nmkdir -p extensions/taskservs/financial-services/{trading-system,risk-engine,compliance-reporter}/nickel\n```\n\n### Example 2: Healthcare Organization\n\n```\n# Healthcare specific extensions\nmkdir -p extensions/taskservs/healthcare/{hl7-processor,dicom-storage,hipaa-audit}/nickel\n```\n\n### Example 3: Manufacturing Company\n\n```\n# Manufacturing specific extensions\nmkdir -p extensions/taskservs/manufacturing/{iot-gateway,scada-bridge,quality-system}/nickel\n```\n\n### Usage Examples\n\n#### Loading Infrastructure-Specific Extensions\n\n```\n# Load company-specific extensions\ncd workspace/infra/production\nmodule-loader load taskservs . [legacy-erp, compliance-monitor, legacy-bridge]\nmodule-loader load providers . [company-private-cloud]\nmodule-loader load clusters . [company-environments]\n\n# Verify loading\nmodule-loader list taskservs .\nmodule-loader validate .\n```\n\n#### Using in Server Configuration\n\n```\n# Import loaded extensions\nimport .taskservs.legacy-erp.legacy-erp as erp\nimport .taskservs.compliance-monitor.compliance-monitor as compliance\nimport .providers.company-private-cloud as private_cloud\n\n# Configure servers with company-specific extensions\ncompany_servers: [server.Server] = [\n {\n hostname = "erp-prod-01"\n title = "Production ERP Server"\n\n # Use company private cloud\n # Provider-specific configuration goes here\n\n taskservs = [\n {\n name = "legacy-erp"\n profile = "production"\n },\n {\n name = "compliance-monitor"\n profile = "default"\n }\n ]\n }\n]\n```\n\nThis comprehensive guide covers all aspects of creating infrastructure-specific extensions, from assessment and planning to implementation and deployment. +# Infrastructure-Specific Extension Development + +This guide focuses on creating extensions tailored to specific infrastructure requirements, business needs, and organizational constraints. + +## Table of Contents + +1. [Overview](#overview) +2. [Infrastructure Assessment](#infrastructure-assessment) +3. [Custom Taskserv Development](#custom-taskserv-development) +4. [Provider-Specific Extensions](#provider-specific-extensions) +5. [Multi-Environment Management](#multi-environment-management) +6. [Integration Patterns](#integration-patterns) +7. [Real-World Examples](#real-world-examples) + +## Overview + +Infrastructure-specific extensions address unique requirements that generic modules cannot cover: + +- **Company-specific applications and services** +- **Compliance and security requirements** +- **Legacy system integrations** +- **Custom networking configurations** +- **Specialized monitoring and alerting** +- **Multi-cloud and hybrid deployments** + +## Infrastructure Assessment + +### Identifying Extension Needs + +Before creating custom extensions, assess your infrastructure requirements: + +#### 1. Application Inventory + +```text +# Document existing applications +cat > infrastructure-assessment.yaml << EOF +applications: + - name: "legacy-billing-system" + type: "monolith" + runtime: "java-8" + database: "oracle-11g" + integrations: ["ldap", "file-storage", "email"] + compliance: ["pci-dss", "sox"] + + - name: "customer-portal" + type: "microservices" + runtime: "nodejs-16" + database: "postgresql-13" + integrations: ["redis", "elasticsearch", "s3"] + compliance: ["gdpr", "hipaa"] + +infrastructure: + - type: "on-premise" + location: "datacenter-primary" + capabilities: ["kubernetes", "vmware", "storage-array"] + + - type: "cloud" + provider: "aws" + regions: ["us-east-1", "eu-west-1"] + services: ["eks", "rds", "s3", "cloudfront"] + +compliance_requirements: + - "PCI DSS Level 1" + - "SOX compliance" + - "GDPR data protection" + - "HIPAA safeguards" + +network_requirements: + - "air-gapped environments" + - "private subnet isolation" + - "vpn connectivity" + - "load balancer integration" +EOF +``` + +#### 2. Gap Analysis + +```text +# Analyze what standard modules don't cover +./provisioning/core/cli/module-loader discover taskservs > available-modules.txt + +# Create gap analysis +cat > gap-analysis.md << EOF +# Infrastructure Gap Analysis + +## Standard Modules Available +$(cat available-modules.txt) + +## Missing Capabilities +- [ ] Legacy Oracle database integration +- [ ] Company-specific LDAP authentication +- [ ] Custom monitoring for legacy systems +- [ ] Compliance reporting automation +- [ ] Air-gapped deployment workflows +- [ ] Multi-datacenter replication + +## Custom Extensions Needed +1. **oracle-db-taskserv**: Oracle database with company settings +2. **company-ldap-taskserv**: LDAP integration with custom schema +3. **compliance-monitor-taskserv**: Automated compliance checking +4. **airgap-deployment-cluster**: Air-gapped deployment patterns +5. **company-monitoring-taskserv**: Custom monitoring dashboard +EOF +``` + +### Requirements Gathering + +#### Business Requirements Template + +```text +""" +Business Requirements Schema for Custom Extensions +Use this template to document requirements before development +""" + +schema BusinessRequirements: + """Document business requirements for custom extensions""" + + # Project information + project_name: str + stakeholders: [str] + timeline: str + budget_constraints?: str + + # Functional requirements + functional_requirements: [FunctionalRequirement] + + # Non-functional requirements + performance_requirements: PerformanceRequirements + security_requirements: SecurityRequirements + compliance_requirements: [str] + + # Integration requirements + existing_systems: [ExistingSystem] + required_integrations: [Integration] + + # Operational requirements + monitoring_requirements: [str] + backup_requirements: [str] + disaster_recovery_requirements: [str] + +schema FunctionalRequirement: + id: str + description: str + priority: "high" | "medium" | "low" + acceptance_criteria: [str] + +schema PerformanceRequirements: + max_response_time: str + throughput_requirements: str + availability_target: str + scalability_requirements: str + +schema SecurityRequirements: + authentication_method: str + authorization_model: str + encryption_requirements: [str] + audit_requirements: [str] + network_security: [str] + +schema ExistingSystem: + name: str + type: str + version: str + api_available: bool + integration_method: str + +schema Integration: + target_system: str + integration_type: "api" | "database" | "file" | "message_queue" + data_format: str + frequency: str + direction: "inbound" | "outbound" | "bidirectional" +``` + +## Custom Taskserv Development + +### Company-Specific Application Taskserv + +#### Example: Legacy ERP System Integration + +```text +# Create company-specific taskserv +mkdir -p extensions/taskservs/company-specific/legacy-erp/nickel +cd extensions/taskservs/company-specific/legacy-erp/nickel +``` + +Create `legacy-erp.ncl`: + +```text +""" +Legacy ERP System Taskserv +Handles deployment and management of company's legacy ERP system +""" + +import provisioning.lib as lib +import provisioning.dependencies as deps +import provisioning.defaults as defaults + +# ERP system configuration +schema LegacyERPConfig: + """Configuration for legacy ERP system""" + + # Application settings + erp_version: str = "12.2.0" + installation_mode: "standalone" | "cluster" | "ha" = "ha" + + # Database configuration + database_type: "oracle" | "sqlserver" = "oracle" + database_version: str = "19c" + database_size: str = "500Gi" + database_backup_retention: int = 30 + + # Network configuration + erp_port: int = 8080 + database_port: int = 1521 + ssl_enabled: bool = True + internal_network_only: bool = True + + # Integration settings + ldap_server: str + file_share_path: str + email_server: str + + # Compliance settings + audit_logging: bool = True + encryption_at_rest: bool = True + encryption_in_transit: bool = True + data_retention_years: int = 7 + + # Resource allocation + app_server_resources: ERPResourceConfig + database_resources: ERPResourceConfig + + # Backup configuration + backup_schedule: str = "0 2 * * *" # Daily at 2 AM + backup_retention_policy: BackupRetentionPolicy + + check: + erp_port > 0 and erp_port < 65536, "ERP port must be valid" + database_port > 0 and database_port < 65536, "Database port must be valid" + data_retention_years > 0, "Data retention must be positive" + len(ldap_server) > 0, "LDAP server required" + +schema ERPResourceConfig: + """Resource configuration for ERP components""" + cpu_request: str + memory_request: str + cpu_limit: str + memory_limit: str + storage_size: str + storage_class: str = "fast-ssd" + +schema BackupRetentionPolicy: + """Backup retention policy for ERP system""" + daily_backups: int = 7 + weekly_backups: int = 4 + monthly_backups: int = 12 + yearly_backups: int = 7 + +# Environment-specific resource configurations +erp_resource_profiles = { + "development": { + app_server_resources = { + cpu_request = "1" + memory_request = "4Gi" + cpu_limit = "2" + memory_limit = "8Gi" + storage_size = "50Gi" + storage_class = "standard" + } + database_resources = { + cpu_request = "2" + memory_request = "8Gi" + cpu_limit = "4" + memory_limit = "16Gi" + storage_size = "100Gi" + storage_class = "standard" + } + }, + "production": { + app_server_resources = { + cpu_request = "4" + memory_request = "16Gi" + cpu_limit = "8" + memory_limit = "32Gi" + storage_size = "200Gi" + storage_class = "fast-ssd" + } + database_resources = { + cpu_request = "8" + memory_request = "32Gi" + cpu_limit = "16" + memory_limit = "64Gi" + storage_size = "2Ti" + storage_class = "fast-ssd" + } + } +} + +# Taskserv definition +schema LegacyERPTaskserv(lib.TaskServDef): + """Legacy ERP Taskserv Definition""" + name: str = "legacy-erp" + config: LegacyERPConfig + environment: "development" | "staging" | "production" + +# Dependencies for legacy ERP +legacy_erp_dependencies: deps.TaskservDependencies = { + name = "legacy-erp" + + # Infrastructure dependencies + requires = ["kubernetes", "storage-class"] + optional = ["monitoring", "backup-agent", "log-aggregator"] + conflicts = ["modern-erp"] + + # Services provided + provides = ["erp-api", "erp-ui", "erp-reports", "erp-integration"] + + # Resource requirements + resources = { + cpu = "8" + memory = "32Gi" + disk = "2Ti" + network = True + privileged = True # Legacy systems often need privileged access + } + + # Health checks + health_checks = [ + { + command = "curl -k https://localhost:9090/health" + interval = 60 + timeout = 30 + retries = 3 + }, + { + command = "sqlplus system/password@localhost:1521/XE <<< 'SELECT 1 FROM DUAL;'" + interval = 300 + timeout = 60 + retries = 2 + } + ] + + # Installation phases + phases = [ + { + name = "pre-install" + order = 1 + parallel = False + required = True + }, + { + name = "database-setup" + order = 2 + parallel = False + required = True + }, + { + name = "application-install" + order = 3 + parallel = False + required = True + }, + { + name = "integration-setup" + order = 4 + parallel = True + required = False + }, + { + name = "compliance-validation" + order = 5 + parallel = False + required = True + } + ] + + # Compatibility + os_support = ["linux"] + arch_support = ["amd64"] + timeout = 3600 # 1 hour for legacy system deployment +} + +# Default configuration +legacy_erp_default: LegacyERPTaskserv = { + name = "legacy-erp" + environment = "production" + config = { + erp_version = "12.2.0" + installation_mode = "ha" + + database_type = "oracle" + database_version = "19c" + database_size = "1Ti" + database_backup_retention = 30 + + erp_port = 8080 + database_port = 1521 + ssl_enabled = True + internal_network_only = True + + # Company-specific settings + ldap_server = "ldap.company.com" + file_share_path = "/mnt/company-files" + email_server = "smtp.company.com" + + # Compliance settings + audit_logging = True + encryption_at_rest = True + encryption_in_transit = True + data_retention_years = 7 + + # Production resources + app_server_resources = erp_resource_profiles.production.app_server_resources + database_resources = erp_resource_profiles.production.database_resources + + backup_schedule = "0 2 * * *" + backup_retention_policy = { + daily_backups = 7 + weekly_backups = 4 + monthly_backups = 12 + yearly_backups = 7 + } + } +} + +# Export for provisioning system +{ + config: legacy_erp_default, + dependencies: legacy_erp_dependencies, + profiles: erp_resource_profiles +} +``` + +### Compliance-Focused Taskserv + +Create `compliance-monitor.ncl`: + +```text +""" +Compliance Monitoring Taskserv +Automated compliance checking and reporting for regulated environments +""" + +import provisioning.lib as lib +import provisioning.dependencies as deps + +schema ComplianceMonitorConfig: + """Configuration for compliance monitoring system""" + + # Compliance frameworks + enabled_frameworks: [ComplianceFramework] + + # Monitoring settings + scan_frequency: str = "0 0 * * *" # Daily + real_time_monitoring: bool = True + + # Reporting settings + report_frequency: str = "0 0 * * 0" # Weekly + report_recipients: [str] + report_format: "pdf" | "html" | "json" = "pdf" + + # Alerting configuration + alert_severity_threshold: "low" | "medium" | "high" = "medium" + alert_channels: [AlertChannel] + + # Data retention + audit_log_retention_days: int = 2555 # 7 years + report_retention_days: int = 365 + + # Integration settings + siem_integration: bool = True + siem_endpoint?: str + + check: + audit_log_retention_days >= 2555, "Audit logs must be retained for at least 7 years" + len(report_recipients) > 0, "At least one report recipient required" + +schema ComplianceFramework: + """Compliance framework configuration""" + name: "pci-dss" | "sox" | "gdpr" | "hipaa" | "iso27001" | "nist" + version: str + enabled: bool = True + custom_controls?: [ComplianceControl] + +schema ComplianceControl: + """Custom compliance control""" + id: str + description: str + check_command: str + severity: "low" | "medium" | "high" | "critical" + remediation_guidance: str + +schema AlertChannel: + """Alert channel configuration""" + type: "email" | "slack" | "teams" | "webhook" | "sms" + endpoint: str + severity_filter: ["low", "medium", "high", "critical"] + +# Taskserv definition +schema ComplianceMonitorTaskserv(lib.TaskServDef): + """Compliance Monitor Taskserv Definition""" + name: str = "compliance-monitor" + config: ComplianceMonitorConfig + +# Dependencies +compliance_monitor_dependencies: deps.TaskservDependencies = { + name = "compliance-monitor" + + # Dependencies + requires = ["kubernetes"] + optional = ["monitoring", "logging", "backup"] + provides = ["compliance-reports", "audit-logs", "compliance-api"] + + # Resource requirements + resources = { + cpu = "500m" + memory = "1Gi" + disk = "50Gi" + network = True + privileged = False + } + + # Health checks + health_checks = [ + { + command = "curl -f http://localhost:9090/health" + interval = 30 + timeout = 10 + retries = 3 + }, + { + command = "compliance-check --dry-run" + interval = 300 + timeout = 60 + retries = 1 + } + ] + + # Compatibility + os_support = ["linux"] + arch_support = ["amd64", "arm64"] +} + +# Default configuration with common compliance frameworks +compliance_monitor_default: ComplianceMonitorTaskserv = { + name = "compliance-monitor" + config = { + enabled_frameworks = [ + { + name = "pci-dss" + version = "3.2.1" + enabled = True + }, + { + name = "sox" + version = "2002" + enabled = True + }, + { + name = "gdpr" + version = "2018" + enabled = True + } + ] + + scan_frequency = "0 */6 * * *" # Every 6 hours + real_time_monitoring = True + + report_frequency = "0 0 * * 1" # Weekly on Monday + report_recipients = ["compliance@company.com", "security@company.com"] + report_format = "pdf" + + alert_severity_threshold = "medium" + alert_channels = [ + { + type = "email" + endpoint = "security-alerts@company.com" + severity_filter = ["medium", "high", "critical"] + }, + { + type = "slack" + endpoint = "https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX" + severity_filter = ["high", "critical"] + } + ] + + audit_log_retention_days = 2555 + report_retention_days = 365 + + siem_integration = True + siem_endpoint = "https://siem.company.com/api/events" + } +} + +# Export configuration +{ + config: compliance_monitor_default, + dependencies: compliance_monitor_dependencies +} +``` + +## Provider-Specific Extensions + +### Custom Cloud Provider Integration + +When working with specialized or private cloud providers: + +```text +# Create custom provider extension +mkdir -p extensions/providers/company-private-cloud/nickel +cd extensions/providers/company-private-cloud/nickel +``` + +Create `provision_company-private-cloud.ncl`: + +```text +""" +Company Private Cloud Provider +Integration with company's private cloud infrastructure +""" + +import provisioning.defaults as defaults +import provisioning.server as server + +schema CompanyPrivateCloudConfig: + """Company private cloud configuration""" + + # API configuration + api_endpoint: str = "https://cloud-api.company.com" + api_version: str = "v2" + auth_token: str + + # Network configuration + management_network: str = "10.0.0.0/24" + production_network: str = "10.1.0.0/16" + dmz_network: str = "10.2.0.0/24" + + # Resource pools + compute_cluster: str = "production-cluster" + storage_cluster: str = "storage-cluster" + + # Compliance settings + encryption_required: bool = True + audit_all_operations: bool = True + + # Company-specific settings + cost_center: str + department: str + project_code: str + + check: + len(api_endpoint) > 0, "API endpoint required" + len(auth_token) > 0, "Authentication token required" + len(cost_center) > 0, "Cost center required for billing" + +schema CompanyPrivateCloudServer(server.Server): + """Server configuration for company private cloud""" + + # Instance configuration + instance_class: "standard" | "compute-optimized" | "memory-optimized" | "storage-optimized" = "standard" + instance_size: "small" | "medium" | "large" | "xlarge" | "2xlarge" = "medium" + + # Storage configuration + root_disk_type: "ssd" | "nvme" | "spinning" = "ssd" + root_disk_size: int = 50 + additional_storage?: [CompanyCloudStorage] + + # Network configuration + network_segment: "management" | "production" | "dmz" = "production" + security_groups: [str] = ["default"] + + # Compliance settings + encrypted_storage: bool = True + backup_enabled: bool = True + monitoring_enabled: bool = True + + # Company metadata + cost_center: str + department: str + project_code: str + environment: "dev" | "test" | "staging" | "prod" = "prod" + + check: + root_disk_size >= 20, "Root disk must be at least 20 GB" + len(cost_center) > 0, "Cost center required" + len(department) > 0, "Department required" + +schema CompanyCloudStorage: + """Additional storage configuration""" + size: int + type: "ssd" | "nvme" | "spinning" | "archive" = "ssd" + mount_point: str + encrypted: bool = True + backup_enabled: bool = True + +# Instance size configurations +instance_specs = { + "small": { + vcpus = 2 + memory_gb = 4 + network_performance = "moderate" + }, + "medium": { + vcpus = 4 + memory_gb = 8 + network_performance = "good" + }, + "large": { + vcpus = 8 + memory_gb = 16 + network_performance = "high" + }, + "xlarge": { + vcpus = 16 + memory_gb = 32 + network_performance = "high" + }, + "2xlarge": { + vcpus = 32 + memory_gb = 64 + network_performance = "very-high" + } +} + +# Provider defaults +company_private_cloud_defaults: defaults.ServerDefaults = { + lock = False + time_zone = "UTC" + running_wait = 20 + running_timeout = 600 # Private cloud may be slower + + # Company-specific OS image + storage_os_find = "name: company-ubuntu-20.04-hardened | arch: x86_64" + + # Network settings + network_utility_ipv4 = True + network_public_ipv4 = False # Private cloud, no public IPs + + # Security settings + user = "company-admin" + user_ssh_port = 22 + fix_local_hosts = True + + # Company metadata + labels = "provider: company-private-cloud, compliance: required" +} + +# Export provider configuration +{ + config: CompanyPrivateCloudConfig, + server: CompanyPrivateCloudServer, + defaults: company_private_cloud_defaults, + instance_specs: instance_specs +} +``` + +## Multi-Environment Management + +### Environment-Specific Configuration Management + +Create environment-specific extensions that handle different deployment patterns: + +```text +# Create environment management extension +mkdir -p extensions/clusters/company-environments/nickel +cd extensions/clusters/company-environments/nickel +``` + +Create `company-environments.ncl`: + +```text +""" +Company Environment Management +Standardized environment configurations for different deployment stages +""" + +import provisioning.cluster as cluster +import provisioning.server as server + +schema CompanyEnvironment: + """Standard company environment configuration""" + + # Environment metadata + name: str + type: "development" | "testing" | "staging" | "production" | "disaster-recovery" + region: str + availability_zones: [str] + + # Network configuration + vpc_cidr: str + subnet_configuration: SubnetConfiguration + + # Security configuration + security_profile: SecurityProfile + + # Compliance requirements + compliance_level: "basic" | "standard" | "high" | "critical" + data_classification: "public" | "internal" | "confidential" | "restricted" + + # Resource constraints + resource_limits: ResourceLimits + + # Backup and DR configuration + backup_configuration: BackupConfiguration + disaster_recovery_configuration?: DRConfiguration + + # Monitoring and alerting + monitoring_level: "basic" | "standard" | "enhanced" + alert_routing: AlertRouting + +schema SubnetConfiguration: + """Network subnet configuration""" + public_subnets: [str] + private_subnets: [str] + database_subnets: [str] + management_subnets: [str] + +schema SecurityProfile: + """Security configuration profile""" + encryption_at_rest: bool + encryption_in_transit: bool + network_isolation: bool + access_logging: bool + vulnerability_scanning: bool + + # Access control + multi_factor_auth: bool + privileged_access_management: bool + network_segmentation: bool + + # Compliance controls + audit_logging: bool + data_loss_prevention: bool + endpoint_protection: bool + +schema ResourceLimits: + """Resource allocation limits for environment""" + max_cpu_cores: int + max_memory_gb: int + max_storage_tb: int + max_instances: int + + # Cost controls + max_monthly_cost: int + cost_alerts_enabled: bool + +schema BackupConfiguration: + """Backup configuration for environment""" + backup_frequency: str + retention_policy: {str: int} + cross_region_backup: bool + encryption_enabled: bool + +schema DRConfiguration: + """Disaster recovery configuration""" + dr_region: str + rto_minutes: int # Recovery Time Objective + rpo_minutes: int # Recovery Point Objective + automated_failover: bool + +schema AlertRouting: + """Alert routing configuration""" + business_hours_contacts: [str] + after_hours_contacts: [str] + escalation_policy: [EscalationLevel] + +schema EscalationLevel: + """Alert escalation level""" + level: int + delay_minutes: int + contacts: [str] + +# Environment templates +environment_templates = { + "development": { + type = "development" + compliance_level = "basic" + data_classification = "internal" + security_profile = { + encryption_at_rest = False + encryption_in_transit = False + network_isolation = False + access_logging = True + vulnerability_scanning = False + multi_factor_auth = False + privileged_access_management = False + network_segmentation = False + audit_logging = False + data_loss_prevention = False + endpoint_protection = False + } + resource_limits = { + max_cpu_cores = 50 + max_memory_gb = 200 + max_storage_tb = 10 + max_instances = 20 + max_monthly_cost = 5000 + cost_alerts_enabled = True + } + monitoring_level = "basic" + }, + + "production": { + type = "production" + compliance_level = "critical" + data_classification = "confidential" + security_profile = { + encryption_at_rest = True + encryption_in_transit = True + network_isolation = True + access_logging = True + vulnerability_scanning = True + multi_factor_auth = True + privileged_access_management = True + network_segmentation = True + audit_logging = True + data_loss_prevention = True + endpoint_protection = True + } + resource_limits = { + max_cpu_cores = 1000 + max_memory_gb = 4000 + max_storage_tb = 500 + max_instances = 200 + max_monthly_cost = 100000 + cost_alerts_enabled = True + } + monitoring_level = "enhanced" + disaster_recovery_configuration = { + dr_region = "us-west-2" + rto_minutes = 60 + rpo_minutes = 15 + automated_failover = True + } + } +} + +# Export environment templates +{ + templates: environment_templates, + schema: CompanyEnvironment +} +``` + +## Integration Patterns + +### Legacy System Integration + +Create integration patterns for common legacy system scenarios: + +```text +# Create integration patterns +mkdir -p extensions/taskservs/integrations/legacy-bridge/nickel +cd extensions/taskservs/integrations/legacy-bridge/nickel +``` + +Create `legacy-bridge.ncl`: + +```text +""" +Legacy System Integration Bridge +Provides standardized integration patterns for legacy systems +""" + +import provisioning.lib as lib +import provisioning.dependencies as deps + +schema LegacyBridgeConfig: + """Configuration for legacy system integration bridge""" + + # Bridge configuration + bridge_name: str + integration_type: "api" | "database" | "file" | "message-queue" | "etl" + + # Legacy system details + legacy_system: LegacySystemInfo + + # Modern system details + modern_system: ModernSystemInfo + + # Data transformation configuration + data_transformation: DataTransformationConfig + + # Security configuration + security_config: IntegrationSecurityConfig + + # Monitoring and alerting + monitoring_config: IntegrationMonitoringConfig + +schema LegacySystemInfo: + """Legacy system information""" + name: str + type: "mainframe" | "as400" | "unix" | "windows" | "database" | "file-system" + version: str + + # Connection details + connection_method: "direct" | "vpn" | "dedicated-line" | "api-gateway" + endpoint: str + port?: int + + # Authentication + auth_method: "password" | "certificate" | "kerberos" | "ldap" | "token" + credentials_source: "vault" | "config" | "environment" + + # Data characteristics + data_format: "fixed-width" | "csv" | "xml" | "json" | "binary" | "proprietary" + character_encoding: str = "utf-8" + + # Operational characteristics + availability_hours: str = "24/7" + maintenance_windows: [MaintenanceWindow] + +schema ModernSystemInfo: + """Modern system information""" + name: str + type: "microservice" | "api" | "database" | "event-stream" | "file-store" + + # Connection details + endpoint: str + api_version?: str + + # Data format + data_format: "json" | "xml" | "avro" | "protobuf" + + # Authentication + auth_method: "oauth2" | "jwt" | "api-key" | "mutual-tls" + +schema DataTransformationConfig: + """Data transformation configuration""" + transformation_rules: [TransformationRule] + error_handling: ErrorHandlingConfig + data_validation: DataValidationConfig + +schema TransformationRule: + """Individual data transformation rule""" + source_field: str + target_field: str + transformation_type: "direct" | "calculated" | "lookup" | "conditional" + transformation_expression?: str + +schema ErrorHandlingConfig: + """Error handling configuration""" + retry_policy: RetryPolicy + dead_letter_queue: bool = True + error_notification: bool = True + +schema RetryPolicy: + """Retry policy configuration""" + max_attempts: int = 3 + initial_delay_seconds: int = 5 + backoff_multiplier: float = 2.0 + max_delay_seconds: int = 300 + +schema DataValidationConfig: + """Data validation configuration""" + schema_validation: bool = True + business_rules_validation: bool = True + data_quality_checks: [DataQualityCheck] + +schema DataQualityCheck: + """Data quality check definition""" + name: str + check_type: "completeness" | "uniqueness" | "validity" | "consistency" + threshold: float = 0.95 + action_on_failure: "warn" | "stop" | "quarantine" + +schema IntegrationSecurityConfig: + """Security configuration for integration""" + encryption_in_transit: bool = True + encryption_at_rest: bool = True + + # Access control + source_ip_whitelist?: [str] + api_rate_limiting: bool = True + + # Audit and compliance + audit_all_transactions: bool = True + pii_data_handling: PIIHandlingConfig + +schema PIIHandlingConfig: + """PII data handling configuration""" + pii_fields: [str] + anonymization_enabled: bool = True + retention_policy_days: int = 365 + +schema IntegrationMonitoringConfig: + """Monitoring configuration for integration""" + metrics_collection: bool = True + performance_monitoring: bool = True + + # SLA monitoring + sla_targets: SLATargets + + # Alerting + alert_on_failures: bool = True + alert_on_performance_degradation: bool = True + +schema SLATargets: + """SLA targets for integration""" + max_latency_ms: int = 5000 + min_availability_percent: float = 99.9 + max_error_rate_percent: float = 0.1 + +schema MaintenanceWindow: + """Maintenance window definition""" + day_of_week: int # 0=Sunday, 6=Saturday + start_time: str # HH:MM format + duration_hours: int + +# Taskserv definition +schema LegacyBridgeTaskserv(lib.TaskServDef): + """Legacy Bridge Taskserv Definition""" + name: str = "legacy-bridge" + config: LegacyBridgeConfig + +# Dependencies +legacy_bridge_dependencies: deps.TaskservDependencies = { + name = "legacy-bridge" + + requires = ["kubernetes"] + optional = ["monitoring", "logging", "vault"] + provides = ["legacy-integration", "data-bridge"] + + resources = { + cpu = "500m" + memory = "1Gi" + disk = "10Gi" + network = True + privileged = False + } + + health_checks = [ + { + command = "curl -f http://localhost:9090/health" + interval = 30 + timeout = 10 + retries = 3 + }, + { + command = "integration-test --quick" + interval = 300 + timeout = 120 + retries = 1 + } + ] + + os_support = ["linux"] + arch_support = ["amd64", "arm64"] +} + +# Export configuration +{ + config: LegacyBridgeTaskserv, + dependencies: legacy_bridge_dependencies +} +``` + +## Real-World Examples + +### Example 1: Financial Services Company + +```text +# Financial services specific extensions +mkdir -p extensions/taskservs/financial-services/{trading-system,risk-engine,compliance-reporter}/nickel +``` + +### Example 2: Healthcare Organization + +```text +# Healthcare specific extensions +mkdir -p extensions/taskservs/healthcare/{hl7-processor,dicom-storage,hipaa-audit}/nickel +``` + +### Example 3: Manufacturing Company + +```text +# Manufacturing specific extensions +mkdir -p extensions/taskservs/manufacturing/{iot-gateway,scada-bridge,quality-system}/nickel +``` + +### Usage Examples + +#### Loading Infrastructure-Specific Extensions + +```text +# Load company-specific extensions +cd workspace/infra/production +module-loader load taskservs . [legacy-erp, compliance-monitor, legacy-bridge] +module-loader load providers . [company-private-cloud] +module-loader load clusters . [company-environments] + +# Verify loading +module-loader list taskservs . +module-loader validate . +``` + +#### Using in Server Configuration + +```text +# Import loaded extensions +import .taskservs.legacy-erp.legacy-erp as erp +import .taskservs.compliance-monitor.compliance-monitor as compliance +import .providers.company-private-cloud as private_cloud + +# Configure servers with company-specific extensions +company_servers: [server.Server] = [ + { + hostname = "erp-prod-01" + title = "Production ERP Server" + + # Use company private cloud + # Provider-specific configuration goes here + + taskservs = [ + { + name = "legacy-erp" + profile = "production" + }, + { + name = "compliance-monitor" + profile = "default" + } + ] + } +] +``` + +This comprehensive guide covers all aspects of creating infrastructure-specific extensions, from assessment and planning to implementation and deployment. \ No newline at end of file diff --git a/docs/src/development/integration.md b/docs/src/development/integration.md index 7050c56..edab5b8 100644 --- a/docs/src/development/integration.md +++ b/docs/src/development/integration.md @@ -1 +1,1219 @@ -# Integration Guide\n\nThis document explains how the new project structure integrates with existing systems, API compatibility and versioning, database migration\nstrategies, deployment considerations, and monitoring and observability.\n\n## Table of Contents\n\n1. [Overview](#overview)\n2. [Existing System Integration](#existing-system-integration)\n3. [API Compatibility and Versioning](#api-compatibility-and-versioning)\n4. [Database Migration Strategies](#database-migration-strategies)\n5. [Deployment Considerations](#deployment-considerations)\n6. [Monitoring and Observability](#monitoring-and-observability)\n7. [Legacy System Bridge](#legacy-system-bridge)\n8. [Migration Pathways](#migration-pathways)\n9. [Troubleshooting Integration Issues](#troubleshooting-integration-issues)\n\n## Overview\n\nProvisioning has been designed with integration as a core principle, ensuring seamless compatibility between new development-focused components and\nexisting production systems while providing clear migration pathways.\n\n**Integration Principles**:\n\n- **Backward Compatibility**: All existing APIs and interfaces remain functional\n- **Gradual Migration**: Systems can be migrated incrementally without disruption\n- **Dual Operation**: New and legacy systems operate side-by-side during transition\n- **Zero Downtime**: Migrations occur without service interruption\n- **Data Integrity**: All data migrations are atomic and reversible\n\n**Integration Architecture**:\n\n```\nIntegration Ecosystem\n┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐\n│ Legacy Core │ ←→ │ Bridge Layer │ ←→ │ New Systems │\n│ │ │ │ │ │\n│ - ENV config │ │ - Compatibility │ │ - TOML config │\n│ - Direct calls │ │ - Translation │ │ - Orchestrator │\n│ - File-based │ │ - Monitoring │ │ - Workflows │\n│ - Simple logging│ │ - Validation │ │ - REST APIs │\n└─────────────────┘ └─────────────────┘ └─────────────────┘\n```\n\n## Existing System Integration\n\n### Command-Line Interface Integration\n\n**Seamless CLI Compatibility**:\n\n```\n# All existing commands continue to work unchanged\n./core/nulib/provisioning server create web-01 2xCPU-4 GB\n./core/nulib/provisioning taskserv install kubernetes\n./core/nulib/provisioning cluster create buildkit\n\n# New commands available alongside existing ones\n./src/core/nulib/provisioning server create web-01 2xCPU-4 GB --orchestrated\nnu workspace/tools/workspace.nu health --detailed\n```\n\n**Path Resolution Integration**:\n\n```\n# Automatic path resolution between systems\nuse workspace/lib/path-resolver.nu\n\n# Resolves to workspace path if available, falls back to core\nlet config_path = (path-resolver resolve_path "config" "user" --fallback-to-core)\n\n# Seamless extension discovery\nlet provider_path = (path-resolver resolve_extension "providers" "upcloud")\n```\n\n### Configuration System Bridge\n\n**Dual Configuration Support**:\n\n```\n# Configuration bridge supports both ENV and TOML\ndef get-config-value-bridge [key: string, default: string = ""] -> string {\n # Try new TOML configuration first\n let toml_value = try {\n get-config-value $key\n } catch { null }\n\n if $toml_value != null {\n return $toml_value\n }\n\n # Fall back to ENV variable (legacy support)\n let env_key = ($key | str replace "." "_" | str upcase | $"PROVISIONING_($in)")\n let env_value = ($env | get $env_key | default null)\n\n if $env_value != null {\n return $env_value\n }\n\n # Use default if provided\n if $default != "" {\n return $default\n }\n\n # Error with helpful migration message\n error make {\n msg: $"Configuration not found: ($key)",\n help: $"Migrate from ($env_key) environment variable to ($key) in config file"\n }\n}\n```\n\n### Data Integration\n\n**Shared Data Access**:\n\n```\n# Unified data access across old and new systems\ndef get-server-info [server_name: string] -> record {\n # Try new orchestrator data store first\n let orchestrator_data = try {\n get-orchestrator-server-data $server_name\n } catch { null }\n\n if $orchestrator_data != null {\n return $orchestrator_data\n }\n\n # Fall back to legacy file-based storage\n let legacy_data = try {\n get-legacy-server-data $server_name\n } catch { null }\n\n if $legacy_data != null {\n return ($legacy_data | migrate-to-new-format)\n }\n\n error make {msg: $"Server not found: ($server_name)"}\n}\n```\n\n### Process Integration\n\n**Hybrid Process Management**:\n\n```\n# Orchestrator-aware process management\ndef create-server-integrated [\n name: string,\n plan: string,\n --orchestrated: bool = false\n] -> record {\n if $orchestrated and (check-orchestrator-available) {\n # Use new orchestrator workflow\n return (create-server-workflow $name $plan)\n } else {\n # Use legacy direct creation\n return (create-server-direct $name $plan)\n }\n}\n\ndef check-orchestrator-available [] -> bool {\n try {\n http get "http://localhost:9090/health" | get status == "ok"\n } catch {\n false\n }\n}\n```\n\n## API Compatibility and Versioning\n\n### REST API Versioning\n\n**API Version Strategy**:\n\n- **v1**: Legacy compatibility API (existing functionality)\n- **v2**: Enhanced API with orchestrator features\n- **v3**: Full workflow and batch operation support\n\n**Version Header Support**:\n\n```\n# API calls with version specification\ncurl -H "API-Version: v1" http://localhost:9090/servers\ncurl -H "API-Version: v2" http://localhost:9090/workflows/servers/create\ncurl -H "API-Version: v3" http://localhost:9090/workflows/batch/submit\n```\n\n### API Compatibility Layer\n\n**Backward Compatible Endpoints**:\n\n```\n// Rust API compatibility layer\n#[derive(Debug, Serialize, Deserialize)]\nstruct ApiRequest {\n version: Option,\n #[serde(flatten)]\n payload: serde_json::Value,\n}\n\nasync fn handle_versioned_request(\n headers: HeaderMap,\n req: ApiRequest,\n) -> Result {\n let api_version = headers\n .get("API-Version")\n .and_then(|v| v.to_str().ok())\n .unwrap_or("v1");\n\n match api_version {\n "v1" => handle_v1_request(req.payload).await,\n "v2" => handle_v2_request(req.payload).await,\n "v3" => handle_v3_request(req.payload).await,\n _ => Err(ApiError::UnsupportedVersion(api_version.to_string())),\n }\n}\n\n// V1 compatibility endpoint\nasync fn handle_v1_request(payload: serde_json::Value) -> Result {\n // Transform request to legacy format\n let legacy_request = transform_to_legacy_format(payload)?;\n\n // Execute using legacy system\n let result = execute_legacy_operation(legacy_request).await?;\n\n // Transform response to v1 format\n Ok(transform_to_v1_response(result))\n}\n```\n\n### Schema Evolution\n\n**Backward Compatible Schema Changes**:\n\n```\n# API schema with version support\nlet ServerCreateRequest = {\n # V1 fields (always supported)\n name | string,\n plan | string,\n zone | string | default = "auto",\n\n # V2 additions (optional for backward compatibility)\n orchestrated | bool | default = false,\n workflow_options | { } | optional,\n\n # V3 additions\n batch_options | { } | optional,\n dependencies | array | default = [],\n\n # Version constraints\n api_version | string | default = "v1",\n} in\nServerCreateRequest\n\n# Conditional validation based on API version\nlet WorkflowOptions = {\n wait_for_completion | bool | default = true,\n timeout_seconds | number | default = 300,\n retry_count | number | default = 3,\n} in\nWorkflowOptions\n```\n\n### Client SDK Compatibility\n\n**Multi-Version Client Support**:\n\n```\n# Nushell client with version support\ndef "client create-server" [\n name: string,\n plan: string,\n --api-version: string = "v1",\n --orchestrated: bool = false\n] -> record {\n let endpoint = match $api_version {\n "v1" => "/servers",\n "v2" => "/workflows/servers/create",\n "v3" => "/workflows/batch/submit",\n _ => (error make {msg: $"Unsupported API version: ($api_version)"})\n }\n\n let request_body = match $api_version {\n "v1" => {name: $name, plan: $plan},\n "v2" => {name: $name, plan: $plan, orchestrated: $orchestrated},\n "v3" => {\n operations: [{\n id: "create_server",\n type: "server_create",\n config: {name: $name, plan: $plan}\n }]\n },\n _ => (error make {msg: $"Unsupported API version: ($api_version)"})\n }\n\n http post $"http://localhost:9090($endpoint)" $request_body\n --headers {\n "Content-Type": "application/json",\n "API-Version": $api_version\n }\n}\n```\n\n## Database Migration Strategies\n\n### Database Architecture Evolution\n\n**Migration Strategy**:\n\n```\nDatabase Evolution Path\n┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐\n│ File-based │ → │ SQLite │ → │ SurrealDB │\n│ Storage │ │ Migration │ │ Full Schema │\n│ │ │ │ │ │\n│ - JSON files │ │ - Structured │ │ - Graph DB │\n│ - Text logs │ │ - Transactions │ │ - Real-time │\n│ - Simple state │ │ - Backup/restore│ │ - Clustering │\n└─────────────────┘ └─────────────────┘ └─────────────────┘\n```\n\n### Migration Scripts\n\n**Automated Database Migration**:\n\n```\n# Database migration orchestration\ndef migrate-database [\n --from: string = "filesystem",\n --to: string = "surrealdb",\n --backup-first: bool = true,\n --verify: bool = true\n] -> record {\n if $backup_first {\n print "Creating backup before migration..."\n let backup_result = (create-database-backup $from)\n print $"Backup created: ($backup_result.path)"\n }\n\n print $"Migrating from ($from) to ($to)..."\n\n match [$from, $to] {\n ["filesystem", "sqlite"] => migrate_filesystem_to_sqlite,\n ["filesystem", "surrealdb"] => migrate_filesystem_to_surrealdb,\n ["sqlite", "surrealdb"] => migrate_sqlite_to_surrealdb,\n _ => (error make {msg: $"Unsupported migration path: ($from) → ($to)"})\n }\n\n if $verify {\n print "Verifying migration integrity..."\n let verification = (verify-migration $from $to)\n if not $verification.success {\n error make {\n msg: $"Migration verification failed: ($verification.errors)",\n help: "Restore from backup and retry migration"\n }\n }\n }\n\n print $"Migration from ($from) to ($to) completed successfully"\n {from: $from, to: $to, status: "completed", migrated_at: (date now)}\n}\n```\n\n**File System to SurrealDB Migration**:\n\n```\ndef migrate_filesystem_to_surrealdb [] -> record {\n # Initialize SurrealDB connection\n let db = (connect-surrealdb)\n\n # Migrate server data\n let server_files = (ls data/servers/*.json)\n let migrated_servers = []\n\n for server_file in $server_files {\n let server_data = (open $server_file.name | from json)\n\n # Transform to new schema\n let server_record = {\n id: $server_data.id,\n name: $server_data.name,\n plan: $server_data.plan,\n zone: ($server_data.zone? | default "unknown"),\n status: $server_data.status,\n ip_address: $server_data.ip_address?,\n created_at: $server_data.created_at,\n updated_at: (date now),\n metadata: ($server_data.metadata? | default {}),\n tags: ($server_data.tags? | default [])\n }\n\n # Insert into SurrealDB\n let insert_result = try {\n query-surrealdb $"CREATE servers:($server_record.id) CONTENT ($server_record | to json)"\n } catch { |e|\n print $"Warning: Failed to migrate server ($server_data.name): ($e.msg)"\n }\n\n $migrated_servers = ($migrated_servers | append $server_record.id)\n }\n\n # Migrate workflow data\n migrate_workflows_to_surrealdb $db\n\n # Migrate state data\n migrate_state_to_surrealdb $db\n\n {\n migrated_servers: ($migrated_servers | length),\n migrated_workflows: (migrate_workflows_to_surrealdb $db).count,\n status: "completed"\n }\n}\n```\n\n### Data Integrity Verification\n\n**Migration Verification**:\n\n```\ndef verify-migration [from: string, to: string] -> record {\n print "Verifying data integrity..."\n\n let source_data = (read-source-data $from)\n let target_data = (read-target-data $to)\n\n let errors = []\n\n # Verify record counts\n if $source_data.servers.count != $target_data.servers.count {\n $errors = ($errors | append "Server count mismatch")\n }\n\n # Verify key records\n for server in $source_data.servers {\n let target_server = ($target_data.servers | where id == $server.id | first)\n\n if ($target_server | is-empty) {\n $errors = ($errors | append $"Missing server: ($server.id)")\n } else {\n # Verify critical fields\n if $target_server.name != $server.name {\n $errors = ($errors | append $"Name mismatch for server ($server.id)")\n }\n\n if $target_server.status != $server.status {\n $errors = ($errors | append $"Status mismatch for server ($server.id)")\n }\n }\n }\n\n {\n success: ($errors | length) == 0,\n errors: $errors,\n verified_at: (date now)\n }\n}\n```\n\n## Deployment Considerations\n\n### Deployment Architecture\n\n**Hybrid Deployment Model**:\n\n```\nDeployment Architecture\n┌─────────────────────────────────────────────────────────────────┐\n│ Load Balancer / Reverse Proxy │\n└─────────────────────┬───────────────────────────────────────────┘\n │\n ┌─────────────────┼─────────────────┐\n │ │ │\n┌───▼────┐ ┌─────▼─────┐ ┌───▼────┐\n│Legacy │ │Orchestrator│ │New │\n│System │ ←→ │Bridge │ ←→ │Systems │\n│ │ │ │ │ │\n│- CLI │ │- API Gate │ │- REST │\n│- Files │ │- Compat │ │- DB │\n│- Logs │ │- Monitor │ │- Queue │\n└────────┘ └────────────┘ └────────┘\n```\n\n### Deployment Strategies\n\n**Blue-Green Deployment**:\n\n```\n# Blue-Green deployment with integration bridge\n# Phase 1: Deploy new system alongside existing (Green environment)\ncd src/tools\nmake all\nmake create-installers\n\n# Install new system without disrupting existing\n./packages/installers/install-provisioning-2.0.0.sh \\n --install-path /opt/provisioning-v2 \\n --no-replace-existing \\n --enable-bridge-mode\n\n# Phase 2: Start orchestrator and validate integration\n/opt/provisioning-v2/bin/orchestrator start --bridge-mode --legacy-path /opt/provisioning-v1\n\n# Phase 3: Gradual traffic shift\n# Route 10% traffic to new system\nnginx-traffic-split --new-backend 10%\n\n# Validate metrics and gradually increase\nnginx-traffic-split --new-backend 50%\nnginx-traffic-split --new-backend 90%\n\n# Phase 4: Complete cutover\nnginx-traffic-split --new-backend 100%\n/opt/provisioning-v1/bin/orchestrator stop\n```\n\n**Rolling Update**:\n\n```\ndef rolling-deployment [\n --target-version: string,\n --batch-size: int = 3,\n --health-check-interval: duration = 30sec\n] -> record {\n let nodes = (get-deployment-nodes)\n let batches = ($nodes | group_by --chunk-size $batch_size)\n\n let deployment_results = []\n\n for batch in $batches {\n print $"Deploying to batch: ($batch | get name | str join ', ')"\n\n # Deploy to batch\n for node in $batch {\n deploy-to-node $node $target_version\n }\n\n # Wait for health checks\n sleep $health_check_interval\n\n # Verify batch health\n let batch_health = ($batch | each { |node| check-node-health $node })\n let healthy_nodes = ($batch_health | where healthy == true | length)\n\n if $healthy_nodes != ($batch | length) {\n # Rollback batch on failure\n print $"Health check failed, rolling back batch"\n for node in $batch {\n rollback-node $node\n }\n error make {msg: "Rolling deployment failed at batch"}\n }\n\n print $"Batch deployed successfully"\n $deployment_results = ($deployment_results | append {\n batch: $batch,\n status: "success",\n deployed_at: (date now)\n })\n }\n\n {\n strategy: "rolling",\n target_version: $target_version,\n batches: ($deployment_results | length),\n status: "completed",\n completed_at: (date now)\n }\n}\n```\n\n### Configuration Deployment\n\n**Environment-Specific Deployment**:\n\n```\n# Development deployment\nPROVISIONING_ENV=dev ./deploy.sh \\n --config-source config.dev.toml \\n --enable-debug \\n --enable-hot-reload\n\n# Staging deployment\nPROVISIONING_ENV=staging ./deploy.sh \\n --config-source config.staging.toml \\n --enable-monitoring \\n --backup-before-deploy\n\n# Production deployment\nPROVISIONING_ENV=prod ./deploy.sh \\n --config-source config.prod.toml \\n --zero-downtime \\n --enable-all-monitoring \\n --backup-before-deploy \\n --health-check-timeout 5m\n```\n\n### Container Integration\n\n**Docker Deployment with Bridge**:\n\n```\n# Multi-stage Docker build supporting both systems\nFROM rust:1.70 as builder\nWORKDIR /app\nCOPY . .\nRUN cargo build --release\n\nFROM ubuntu:22.04 as runtime\nWORKDIR /app\n\n# Install both legacy and new systems\nCOPY --from=builder /app/target/release/orchestrator /app/bin/\nCOPY legacy-provisioning/ /app/legacy/\nCOPY config/ /app/config/\n\n# Bridge script for dual operation\nCOPY bridge-start.sh /app/bin/\n\nENV PROVISIONING_BRIDGE_MODE=true\nENV PROVISIONING_LEGACY_PATH=/app/legacy\nENV PROVISIONING_NEW_PATH=/app/bin\n\nEXPOSE 8080\nCMD ["/app/bin/bridge-start.sh"]\n```\n\n**Kubernetes Integration**:\n\n```\n# Kubernetes deployment with bridge sidecar\napiVersion: apps/v1\nkind: Deployment\nmetadata:\n name: provisioning-system\nspec:\n replicas: 3\n template:\n spec:\n containers:\n - name: orchestrator\n image: provisioning-system:2.0.0\n ports:\n - containerPort: 8080\n env:\n - name: PROVISIONING_BRIDGE_MODE\n value: "true"\n volumeMounts:\n - name: config\n mountPath: /app/config\n - name: legacy-data\n mountPath: /app/legacy/data\n\n - name: legacy-bridge\n image: provisioning-legacy:1.0.0\n env:\n - name: BRIDGE_ORCHESTRATOR_URL\n value: "http://localhost:9090"\n volumeMounts:\n - name: legacy-data\n mountPath: /data\n\n volumes:\n - name: config\n configMap:\n name: provisioning-config\n - name: legacy-data\n persistentVolumeClaim:\n claimName: provisioning-data\n```\n\n## Monitoring and Observability\n\n### Integrated Monitoring Architecture\n\n**Monitoring Stack Integration**:\n\n```\nObservability Architecture\n┌─────────────────────────────────────────────────────────────────┐\n│ Monitoring Dashboard │\n│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │\n│ │ Grafana │ │ Jaeger │ │ AlertMgr │ │\n│ └─────────────┘ └─────────────┘ └─────────────┘ │\n└─────────────┬───────────────┬───────────────┬─────────────────┘\n │ │ │\n ┌──────────▼──────────┐ │ ┌───────────▼───────────┐\n │ Prometheus │ │ │ Jaeger │\n │ (Metrics) │ │ │ (Tracing) │\n └──────────┬──────────┘ │ └───────────┬───────────┘\n │ │ │\n┌─────────────▼─────────────┐ │ ┌─────────────▼─────────────┐\n│ Legacy │ │ │ New System │\n│ Monitoring │ │ │ Monitoring │\n│ │ │ │ │\n│ - File-based logs │ │ │ - Structured logs │\n│ - Simple metrics │ │ │ - Prometheus metrics │\n│ - Basic health checks │ │ │ - Distributed tracing │\n└───────────────────────────┘ │ └───────────────────────────┘\n │\n ┌─────────▼─────────┐\n │ Bridge Monitor │\n │ │\n │ - Integration │\n │ - Compatibility │\n │ - Migration │\n └───────────────────┘\n```\n\n### Metrics Integration\n\n**Unified Metrics Collection**:\n\n```\n# Metrics bridge for legacy and new systems\ndef collect-system-metrics [] -> record {\n let legacy_metrics = collect-legacy-metrics\n let new_metrics = collect-new-metrics\n let bridge_metrics = collect-bridge-metrics\n\n {\n timestamp: (date now),\n legacy: $legacy_metrics,\n new: $new_metrics,\n bridge: $bridge_metrics,\n integration: {\n compatibility_rate: (calculate-compatibility-rate $bridge_metrics),\n migration_progress: (calculate-migration-progress),\n system_health: (assess-overall-health $legacy_metrics $new_metrics)\n }\n }\n}\n\ndef collect-legacy-metrics [] -> record {\n let log_files = (ls logs/*.log)\n let process_stats = (get-process-stats "legacy-provisioning")\n\n {\n active_processes: $process_stats.count,\n log_file_sizes: ($log_files | get size | math sum),\n last_activity: (get-last-log-timestamp),\n error_count: (count-log-errors "last 1h"),\n performance: {\n avg_response_time: (calculate-avg-response-time),\n throughput: (calculate-throughput)\n }\n }\n}\n\ndef collect-new-metrics [] -> record {\n let orchestrator_stats = try {\n http get "http://localhost:9090/metrics"\n } catch {\n {status: "unavailable"}\n }\n\n {\n orchestrator: $orchestrator_stats,\n workflow_stats: (get-workflow-metrics),\n api_stats: (get-api-metrics),\n database_stats: (get-database-metrics)\n }\n}\n```\n\n### Logging Integration\n\n**Unified Logging Strategy**:\n\n```\n# Structured logging bridge\ndef log-integrated [\n level: string,\n message: string,\n --component: string = "bridge",\n --legacy-compat: bool = true\n] {\n let log_entry = {\n timestamp: (date now | format date "%Y-%m-%d %H:%M:%S%.3f"),\n level: $level,\n component: $component,\n message: $message,\n system: "integrated",\n correlation_id: (generate-correlation-id)\n }\n\n # Write to structured log (new system)\n $log_entry | to json | save --append logs/integrated.jsonl\n\n if $legacy_compat {\n # Write to legacy log format\n let legacy_entry = $"[($log_entry.timestamp)] [($level)] ($component): ($message)"\n $legacy_entry | save --append logs/legacy.log\n }\n\n # Send to monitoring system\n send-to-monitoring $log_entry\n}\n```\n\n### Health Check Integration\n\n**Comprehensive Health Monitoring**:\n\n```\ndef health-check-integrated [] -> record {\n let health_checks = [\n {name: "legacy-system", check: (check-legacy-health)},\n {name: "orchestrator", check: (check-orchestrator-health)},\n {name: "database", check: (check-database-health)},\n {name: "bridge-compatibility", check: (check-bridge-health)},\n {name: "configuration", check: (check-config-health)}\n ]\n\n let results = ($health_checks | each { |check|\n let result = try {\n do $check.check\n } catch { |e|\n {status: "unhealthy", error: $e.msg}\n }\n\n {name: $check.name, result: $result}\n })\n\n let healthy_count = ($results | where result.status == "healthy" | length)\n let total_count = ($results | length)\n\n {\n overall_status: (if $healthy_count == $total_count { "healthy" } else { "degraded" }),\n healthy_services: $healthy_count,\n total_services: $total_count,\n services: $results,\n checked_at: (date now)\n }\n}\n```\n\n## Legacy System Bridge\n\n### Bridge Architecture\n\n**Bridge Component Design**:\n\n```\n# Legacy system bridge module\nexport module bridge {\n # Bridge state management\n export def init-bridge [] -> record {\n let bridge_config = get-config-section "bridge"\n\n {\n legacy_path: ($bridge_config.legacy_path? | default "/opt/provisioning-v1"),\n new_path: ($bridge_config.new_path? | default "/opt/provisioning-v2"),\n mode: ($bridge_config.mode? | default "compatibility"),\n monitoring_enabled: ($bridge_config.monitoring? | default true),\n initialized_at: (date now)\n }\n }\n\n # Command translation layer\n export def translate-command [\n legacy_command: list\n ] -> list {\n match $legacy_command {\n ["provisioning", "server", "create", $name, $plan, ...$args] => {\n let new_args = ($args | each { |arg|\n match $arg {\n "--dry-run" => "--dry-run",\n "--wait" => "--wait",\n $zone if ($zone | str starts-with "--zone=") => $zone,\n _ => $arg\n }\n })\n\n ["provisioning", "server", "create", $name, $plan] ++ $new_args ++ ["--orchestrated"]\n },\n _ => $legacy_command # Pass through unchanged\n }\n }\n\n # Data format translation\n export def translate-response [\n legacy_response: record,\n target_format: string = "v2"\n ] -> record {\n match $target_format {\n "v2" => {\n id: ($legacy_response.id? | default (generate-uuid)),\n name: $legacy_response.name,\n status: $legacy_response.status,\n created_at: ($legacy_response.created_at? | default (date now)),\n metadata: ($legacy_response | reject name status created_at),\n version: "v2-compat"\n },\n _ => $legacy_response\n }\n }\n}\n```\n\n### Bridge Operation Modes\n\n**Compatibility Mode**:\n\n```\n# Full compatibility with legacy system\ndef run-compatibility-mode [] {\n print "Starting bridge in compatibility mode..."\n\n # Intercept legacy commands\n let legacy_commands = monitor-legacy-commands\n\n for command in $legacy_commands {\n let translated = (bridge translate-command $command)\n\n try {\n let result = (execute-new-system $translated)\n let legacy_result = (bridge translate-response $result "v1")\n respond-to-legacy $legacy_result\n } catch { |e|\n # Fall back to legacy system on error\n let fallback_result = (execute-legacy-system $command)\n respond-to-legacy $fallback_result\n }\n }\n}\n```\n\n**Migration Mode**:\n\n```\n# Gradual migration with traffic splitting\ndef run-migration-mode [\n --new-system-percentage: int = 50\n] {\n print $"Starting bridge in migration mode (($new_system_percentage)% new system)"\n\n let commands = monitor-all-commands\n\n for command in $commands {\n let route_to_new = ((random integer 1..100) <= $new_system_percentage)\n\n if $route_to_new {\n try {\n execute-new-system $command\n } catch {\n # Fall back to legacy on failure\n execute-legacy-system $command\n }\n } else {\n execute-legacy-system $command\n }\n }\n}\n```\n\n## Migration Pathways\n\n### Migration Phases\n\n**Phase 1: Parallel Deployment**\n\n- Deploy new system alongside existing\n- Enable bridge for compatibility\n- Begin data synchronization\n- Monitor integration health\n\n**Phase 2: Gradual Migration**\n\n- Route increasing traffic to new system\n- Migrate data in background\n- Validate consistency\n- Address integration issues\n\n**Phase 3: Full Migration**\n\n- Complete traffic cutover\n- Decommission legacy system\n- Clean up bridge components\n- Finalize data migration\n\n### Migration Automation\n\n**Automated Migration Orchestration**:\n\n```\ndef execute-migration-plan [\n migration_plan: string,\n --dry-run: bool = false,\n --skip-backup: bool = false\n] -> record {\n let plan = (open $migration_plan | from yaml)\n\n if not $skip_backup {\n create-pre-migration-backup\n }\n\n let migration_results = []\n\n for phase in $plan.phases {\n print $"Executing migration phase: ($phase.name)"\n\n if $dry_run {\n print $"[DRY RUN] Would execute phase: ($phase)"\n continue\n }\n\n let phase_result = try {\n execute-migration-phase $phase\n } catch { |e|\n print $"Migration phase failed: ($e.msg)"\n\n if $phase.rollback_on_failure? | default false {\n print "Rolling back migration phase..."\n rollback-migration-phase $phase\n }\n\n error make {msg: $"Migration failed at phase ($phase.name): ($e.msg)"}\n }\n\n $migration_results = ($migration_results | append $phase_result)\n\n # Wait between phases if specified\n if "wait_seconds" in $phase {\n sleep ($phase.wait_seconds * 1sec)\n }\n }\n\n {\n migration_plan: $migration_plan,\n phases_completed: ($migration_results | length),\n status: "completed",\n completed_at: (date now),\n results: $migration_results\n }\n}\n```\n\n**Migration Validation**:\n\n```\ndef validate-migration-readiness [] -> record {\n let checks = [\n {name: "backup-available", check: (check-backup-exists)},\n {name: "new-system-healthy", check: (check-new-system-health)},\n {name: "database-accessible", check: (check-database-connectivity)},\n {name: "configuration-valid", check: (validate-migration-config)},\n {name: "resources-available", check: (check-system-resources)},\n {name: "network-connectivity", check: (check-network-health)}\n ]\n\n let results = ($checks | each { |check|\n {\n name: $check.name,\n result: (do $check.check),\n timestamp: (date now)\n }\n })\n\n let failed_checks = ($results | where result.status != "ready")\n\n {\n ready_for_migration: ($failed_checks | length) == 0,\n checks: $results,\n failed_checks: $failed_checks,\n validated_at: (date now)\n }\n}\n```\n\n## Troubleshooting Integration Issues\n\n### Common Integration Problems\n\n#### API Compatibility Issues\n\n**Problem**: Version mismatch between client and server\n\n```\n# Diagnosis\ncurl -H "API-Version: v1" http://localhost:9090/health\ncurl -H "API-Version: v2" http://localhost:9090/health\n\n# Solution: Check supported versions\ncurl http://localhost:9090/api/versions\n\n# Update client API version\nexport PROVISIONING_API_VERSION=v2\n```\n\n#### Configuration Bridge Issues\n\n**Problem**: Configuration not found in either system\n\n```\n# Diagnosis\ndef diagnose-config-issue [key: string] -> record {\n let toml_result = try {\n get-config-value $key\n } catch { |e| {status: "failed", error: $e.msg} }\n\n let env_key = ($key | str replace "." "_" | str upcase | $"PROVISIONING_($in)")\n let env_result = try {\n $env | get $env_key\n } catch { |e| {status: "failed", error: $e.msg} }\n\n {\n key: $key,\n toml_config: $toml_result,\n env_config: $env_result,\n migration_needed: ($toml_result.status == "failed" and $env_result.status != "failed")\n }\n}\n\n# Solution: Migrate configuration\ndef migrate-single-config [key: string] {\n let diagnosis = (diagnose-config-issue $key)\n\n if $diagnosis.migration_needed {\n let env_value = $diagnosis.env_config\n set-config-value $key $env_value\n print $"Migrated ($key) from environment variable"\n }\n}\n```\n\n#### Database Integration Issues\n\n**Problem**: Data inconsistency between systems\n\n```\n# Diagnosis and repair\ndef repair-data-consistency [] -> record {\n let legacy_data = (read-legacy-data)\n let new_data = (read-new-data)\n\n let inconsistencies = []\n\n # Check server records\n for server in $legacy_data.servers {\n let new_server = ($new_data.servers | where id == $server.id | first)\n\n if ($new_server | is-empty) {\n print $"Missing server in new system: ($server.id)"\n create-server-record $server\n $inconsistencies = ($inconsistencies | append {type: "missing", id: $server.id})\n } else if $new_server != $server {\n print $"Inconsistent server data: ($server.id)"\n update-server-record $server\n $inconsistencies = ($inconsistencies | append {type: "inconsistent", id: $server.id})\n }\n }\n\n {\n inconsistencies_found: ($inconsistencies | length),\n repairs_applied: ($inconsistencies | length),\n repaired_at: (date now)\n }\n}\n```\n\n### Debug Tools\n\n**Integration Debug Mode**:\n\n```\n# Enable comprehensive debugging\nexport PROVISIONING_DEBUG=true\nexport PROVISIONING_LOG_LEVEL=debug\nexport PROVISIONING_BRIDGE_DEBUG=true\nexport PROVISIONING_INTEGRATION_TRACE=true\n\n# Run with integration debugging\nprovisioning server create test-server 2xCPU-4 GB --debug-integration\n```\n\n**Health Check Debugging**:\n\n```\ndef debug-integration-health [] -> record {\n print "=== Integration Health Debug ==="\n\n # Check all integration points\n let legacy_health = try {\n check-legacy-system\n } catch { |e| {status: "error", error: $e.msg} }\n\n let orchestrator_health = try {\n http get "http://localhost:9090/health"\n } catch { |e| {status: "error", error: $e.msg} }\n\n let bridge_health = try {\n check-bridge-status\n } catch { |e| {status: "error", error: $e.msg} }\n\n let config_health = try {\n validate-config-integration\n } catch { |e| {status: "error", error: $e.msg} }\n\n print $"Legacy System: ($legacy_health.status)"\n print $"Orchestrator: ($orchestrator_health.status)"\n print $"Bridge: ($bridge_health.status)"\n print $"Configuration: ($config_health.status)"\n\n {\n legacy: $legacy_health,\n orchestrator: $orchestrator_health,\n bridge: $bridge_health,\n configuration: $config_health,\n debug_timestamp: (date now)\n }\n}\n```\n\nThis integration guide provides a comprehensive framework for seamlessly integrating new development components with existing production systems while\nmaintaining reliability, compatibility, and clear migration pathways. +# Integration Guide + +This document explains how the new project structure integrates with existing systems, API compatibility and versioning, database migration +strategies, deployment considerations, and monitoring and observability. + +## Table of Contents + +1. [Overview](#overview) +2. [Existing System Integration](#existing-system-integration) +3. [API Compatibility and Versioning](#api-compatibility-and-versioning) +4. [Database Migration Strategies](#database-migration-strategies) +5. [Deployment Considerations](#deployment-considerations) +6. [Monitoring and Observability](#monitoring-and-observability) +7. [Legacy System Bridge](#legacy-system-bridge) +8. [Migration Pathways](#migration-pathways) +9. [Troubleshooting Integration Issues](#troubleshooting-integration-issues) + +## Overview + +Provisioning has been designed with integration as a core principle, ensuring seamless compatibility between new development-focused components and +existing production systems while providing clear migration pathways. + +**Integration Principles**: + +- **Backward Compatibility**: All existing APIs and interfaces remain functional +- **Gradual Migration**: Systems can be migrated incrementally without disruption +- **Dual Operation**: New and legacy systems operate side-by-side during transition +- **Zero Downtime**: Migrations occur without service interruption +- **Data Integrity**: All data migrations are atomic and reversible + +**Integration Architecture**: + +```text +Integration Ecosystem +┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ +│ Legacy Core │ ←→ │ Bridge Layer │ ←→ │ New Systems │ +│ │ │ │ │ │ +│ - ENV config │ │ - Compatibility │ │ - TOML config │ +│ - Direct calls │ │ - Translation │ │ - Orchestrator │ +│ - File-based │ │ - Monitoring │ │ - Workflows │ +│ - Simple logging│ │ - Validation │ │ - REST APIs │ +└─────────────────┘ └─────────────────┘ └─────────────────┘ +``` + +## Existing System Integration + +### Command-Line Interface Integration + +**Seamless CLI Compatibility**: + +```text +# All existing commands continue to work unchanged +./core/nulib/provisioning server create web-01 2xCPU-4 GB +./core/nulib/provisioning taskserv install kubernetes +./core/nulib/provisioning cluster create buildkit + +# New commands available alongside existing ones +./src/core/nulib/provisioning server create web-01 2xCPU-4 GB --orchestrated +nu workspace/tools/workspace.nu health --detailed +``` + +**Path Resolution Integration**: + +```text +# Automatic path resolution between systems +use workspace/lib/path-resolver.nu + +# Resolves to workspace path if available, falls back to core +let config_path = (path-resolver resolve_path "config" "user" --fallback-to-core) + +# Seamless extension discovery +let provider_path = (path-resolver resolve_extension "providers" "upcloud") +``` + +### Configuration System Bridge + +**Dual Configuration Support**: + +```text +# Configuration bridge supports both ENV and TOML +def get-config-value-bridge [key: string, default: string = ""] -> string { + # Try new TOML configuration first + let toml_value = try { + get-config-value $key + } catch { null } + + if $toml_value != null { + return $toml_value + } + + # Fall back to ENV variable (legacy support) + let env_key = ($key | str replace "." "_" | str upcase | $"PROVISIONING_($in)") + let env_value = ($env | get $env_key | default null) + + if $env_value != null { + return $env_value + } + + # Use default if provided + if $default != "" { + return $default + } + + # Error with helpful migration message + error make { + msg: $"Configuration not found: ($key)", + help: $"Migrate from ($env_key) environment variable to ($key) in config file" + } +} +``` + +### Data Integration + +**Shared Data Access**: + +```text +# Unified data access across old and new systems +def get-server-info [server_name: string] -> record { + # Try new orchestrator data store first + let orchestrator_data = try { + get-orchestrator-server-data $server_name + } catch { null } + + if $orchestrator_data != null { + return $orchestrator_data + } + + # Fall back to legacy file-based storage + let legacy_data = try { + get-legacy-server-data $server_name + } catch { null } + + if $legacy_data != null { + return ($legacy_data | migrate-to-new-format) + } + + error make {msg: $"Server not found: ($server_name)"} +} +``` + +### Process Integration + +**Hybrid Process Management**: + +```text +# Orchestrator-aware process management +def create-server-integrated [ + name: string, + plan: string, + --orchestrated: bool = false +] -> record { + if $orchestrated and (check-orchestrator-available) { + # Use new orchestrator workflow + return (create-server-workflow $name $plan) + } else { + # Use legacy direct creation + return (create-server-direct $name $plan) + } +} + +def check-orchestrator-available [] -> bool { + try { + http get "http://localhost:9090/health" | get status == "ok" + } catch { + false + } +} +``` + +## API Compatibility and Versioning + +### REST API Versioning + +**API Version Strategy**: + +- **v1**: Legacy compatibility API (existing functionality) +- **v2**: Enhanced API with orchestrator features +- **v3**: Full workflow and batch operation support + +**Version Header Support**: + +```text +# API calls with version specification +curl -H "API-Version: v1" http://localhost:9090/servers +curl -H "API-Version: v2" http://localhost:9090/workflows/servers/create +curl -H "API-Version: v3" http://localhost:9090/workflows/batch/submit +``` + +### API Compatibility Layer + +**Backward Compatible Endpoints**: + +```text +// Rust API compatibility layer +#[derive(Debug, Serialize, Deserialize)] +struct ApiRequest { + version: Option, + #[serde(flatten)] + payload: serde_json::Value, +} + +async fn handle_versioned_request( + headers: HeaderMap, + req: ApiRequest, +) -> Result { + let api_version = headers + .get("API-Version") + .and_then(|v| v.to_str().ok()) + .unwrap_or("v1"); + + match api_version { + "v1" => handle_v1_request(req.payload).await, + "v2" => handle_v2_request(req.payload).await, + "v3" => handle_v3_request(req.payload).await, + _ => Err(ApiError::UnsupportedVersion(api_version.to_string())), + } +} + +// V1 compatibility endpoint +async fn handle_v1_request(payload: serde_json::Value) -> Result { + // Transform request to legacy format + let legacy_request = transform_to_legacy_format(payload)?; + + // Execute using legacy system + let result = execute_legacy_operation(legacy_request).await?; + + // Transform response to v1 format + Ok(transform_to_v1_response(result)) +} +``` + +### Schema Evolution + +**Backward Compatible Schema Changes**: + +```text +# API schema with version support +let ServerCreateRequest = { + # V1 fields (always supported) + name | string, + plan | string, + zone | string | default = "auto", + + # V2 additions (optional for backward compatibility) + orchestrated | bool | default = false, + workflow_options | { } | optional, + + # V3 additions + batch_options | { } | optional, + dependencies | array | default = [], + + # Version constraints + api_version | string | default = "v1", +} in +ServerCreateRequest + +# Conditional validation based on API version +let WorkflowOptions = { + wait_for_completion | bool | default = true, + timeout_seconds | number | default = 300, + retry_count | number | default = 3, +} in +WorkflowOptions +``` + +### Client SDK Compatibility + +**Multi-Version Client Support**: + +```text +# Nushell client with version support +def "client create-server" [ + name: string, + plan: string, + --api-version: string = "v1", + --orchestrated: bool = false +] -> record { + let endpoint = match $api_version { + "v1" => "/servers", + "v2" => "/workflows/servers/create", + "v3" => "/workflows/batch/submit", + _ => (error make {msg: $"Unsupported API version: ($api_version)"}) + } + + let request_body = match $api_version { + "v1" => {name: $name, plan: $plan}, + "v2" => {name: $name, plan: $plan, orchestrated: $orchestrated}, + "v3" => { + operations: [{ + id: "create_server", + type: "server_create", + config: {name: $name, plan: $plan} + }] + }, + _ => (error make {msg: $"Unsupported API version: ($api_version)"}) + } + + http post $"http://localhost:9090($endpoint)" $request_body + --headers { + "Content-Type": "application/json", + "API-Version": $api_version + } +} +``` + +## Database Migration Strategies + +### Database Architecture Evolution + +**Migration Strategy**: + +```text +Database Evolution Path +┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ +│ File-based │ → │ SQLite │ → │ SurrealDB │ +│ Storage │ │ Migration │ │ Full Schema │ +│ │ │ │ │ │ +│ - JSON files │ │ - Structured │ │ - Graph DB │ +│ - Text logs │ │ - Transactions │ │ - Real-time │ +│ - Simple state │ │ - Backup/restore│ │ - Clustering │ +└─────────────────┘ └─────────────────┘ └─────────────────┘ +``` + +### Migration Scripts + +**Automated Database Migration**: + +```text +# Database migration orchestration +def migrate-database [ + --from: string = "filesystem", + --to: string = "surrealdb", + --backup-first: bool = true, + --verify: bool = true +] -> record { + if $backup_first { + print "Creating backup before migration..." + let backup_result = (create-database-backup $from) + print $"Backup created: ($backup_result.path)" + } + + print $"Migrating from ($from) to ($to)..." + + match [$from, $to] { + ["filesystem", "sqlite"] => migrate_filesystem_to_sqlite, + ["filesystem", "surrealdb"] => migrate_filesystem_to_surrealdb, + ["sqlite", "surrealdb"] => migrate_sqlite_to_surrealdb, + _ => (error make {msg: $"Unsupported migration path: ($from) → ($to)"}) + } + + if $verify { + print "Verifying migration integrity..." + let verification = (verify-migration $from $to) + if not $verification.success { + error make { + msg: $"Migration verification failed: ($verification.errors)", + help: "Restore from backup and retry migration" + } + } + } + + print $"Migration from ($from) to ($to) completed successfully" + {from: $from, to: $to, status: "completed", migrated_at: (date now)} +} +``` + +**File System to SurrealDB Migration**: + +```text +def migrate_filesystem_to_surrealdb [] -> record { + # Initialize SurrealDB connection + let db = (connect-surrealdb) + + # Migrate server data + let server_files = (ls data/servers/*.json) + let migrated_servers = [] + + for server_file in $server_files { + let server_data = (open $server_file.name | from json) + + # Transform to new schema + let server_record = { + id: $server_data.id, + name: $server_data.name, + plan: $server_data.plan, + zone: ($server_data.zone? | default "unknown"), + status: $server_data.status, + ip_address: $server_data.ip_address?, + created_at: $server_data.created_at, + updated_at: (date now), + metadata: ($server_data.metadata? | default {}), + tags: ($server_data.tags? | default []) + } + + # Insert into SurrealDB + let insert_result = try { + query-surrealdb $"CREATE servers:($server_record.id) CONTENT ($server_record | to json)" + } catch { |e| + print $"Warning: Failed to migrate server ($server_data.name): ($e.msg)" + } + + $migrated_servers = ($migrated_servers | append $server_record.id) + } + + # Migrate workflow data + migrate_workflows_to_surrealdb $db + + # Migrate state data + migrate_state_to_surrealdb $db + + { + migrated_servers: ($migrated_servers | length), + migrated_workflows: (migrate_workflows_to_surrealdb $db).count, + status: "completed" + } +} +``` + +### Data Integrity Verification + +**Migration Verification**: + +```text +def verify-migration [from: string, to: string] -> record { + print "Verifying data integrity..." + + let source_data = (read-source-data $from) + let target_data = (read-target-data $to) + + let errors = [] + + # Verify record counts + if $source_data.servers.count != $target_data.servers.count { + $errors = ($errors | append "Server count mismatch") + } + + # Verify key records + for server in $source_data.servers { + let target_server = ($target_data.servers | where id == $server.id | first) + + if ($target_server | is-empty) { + $errors = ($errors | append $"Missing server: ($server.id)") + } else { + # Verify critical fields + if $target_server.name != $server.name { + $errors = ($errors | append $"Name mismatch for server ($server.id)") + } + + if $target_server.status != $server.status { + $errors = ($errors | append $"Status mismatch for server ($server.id)") + } + } + } + + { + success: ($errors | length) == 0, + errors: $errors, + verified_at: (date now) + } +} +``` + +## Deployment Considerations + +### Deployment Architecture + +**Hybrid Deployment Model**: + +```text +Deployment Architecture +┌─────────────────────────────────────────────────────────────────┐ +│ Load Balancer / Reverse Proxy │ +└─────────────────────┬───────────────────────────────────────────┘ + │ + ┌─────────────────┼─────────────────┐ + │ │ │ +┌───▼────┐ ┌─────▼─────┐ ┌───▼────┐ +│Legacy │ │Orchestrator│ │New │ +│System │ ←→ │Bridge │ ←→ │Systems │ +│ │ │ │ │ │ +│- CLI │ │- API Gate │ │- REST │ +│- Files │ │- Compat │ │- DB │ +│- Logs │ │- Monitor │ │- Queue │ +└────────┘ └────────────┘ └────────┘ +``` + +### Deployment Strategies + +**Blue-Green Deployment**: + +```text +# Blue-Green deployment with integration bridge +# Phase 1: Deploy new system alongside existing (Green environment) +cd src/tools +make all +make create-installers + +# Install new system without disrupting existing +./packages/installers/install-provisioning-2.0.0.sh + --install-path /opt/provisioning-v2 + --no-replace-existing + --enable-bridge-mode + +# Phase 2: Start orchestrator and validate integration +/opt/provisioning-v2/bin/orchestrator start --bridge-mode --legacy-path /opt/provisioning-v1 + +# Phase 3: Gradual traffic shift +# Route 10% traffic to new system +nginx-traffic-split --new-backend 10% + +# Validate metrics and gradually increase +nginx-traffic-split --new-backend 50% +nginx-traffic-split --new-backend 90% + +# Phase 4: Complete cutover +nginx-traffic-split --new-backend 100% +/opt/provisioning-v1/bin/orchestrator stop +``` + +**Rolling Update**: + +```text +def rolling-deployment [ + --target-version: string, + --batch-size: int = 3, + --health-check-interval: duration = 30sec +] -> record { + let nodes = (get-deployment-nodes) + let batches = ($nodes | group_by --chunk-size $batch_size) + + let deployment_results = [] + + for batch in $batches { + print $"Deploying to batch: ($batch | get name | str join ', ')" + + # Deploy to batch + for node in $batch { + deploy-to-node $node $target_version + } + + # Wait for health checks + sleep $health_check_interval + + # Verify batch health + let batch_health = ($batch | each { |node| check-node-health $node }) + let healthy_nodes = ($batch_health | where healthy == true | length) + + if $healthy_nodes != ($batch | length) { + # Rollback batch on failure + print $"Health check failed, rolling back batch" + for node in $batch { + rollback-node $node + } + error make {msg: "Rolling deployment failed at batch"} + } + + print $"Batch deployed successfully" + $deployment_results = ($deployment_results | append { + batch: $batch, + status: "success", + deployed_at: (date now) + }) + } + + { + strategy: "rolling", + target_version: $target_version, + batches: ($deployment_results | length), + status: "completed", + completed_at: (date now) + } +} +``` + +### Configuration Deployment + +**Environment-Specific Deployment**: + +```text +# Development deployment +PROVISIONING_ENV=dev ./deploy.sh + --config-source config.dev.toml + --enable-debug + --enable-hot-reload + +# Staging deployment +PROVISIONING_ENV=staging ./deploy.sh + --config-source config.staging.toml + --enable-monitoring + --backup-before-deploy + +# Production deployment +PROVISIONING_ENV=prod ./deploy.sh + --config-source config.prod.toml + --zero-downtime + --enable-all-monitoring + --backup-before-deploy + --health-check-timeout 5m +``` + +### Container Integration + +**Docker Deployment with Bridge**: + +```text +# Multi-stage Docker build supporting both systems +FROM rust:1.70 as builder +WORKDIR /app +COPY . . +RUN cargo build --release + +FROM ubuntu:22.04 as runtime +WORKDIR /app + +# Install both legacy and new systems +COPY --from=builder /app/target/release/orchestrator /app/bin/ +COPY legacy-provisioning/ /app/legacy/ +COPY config/ /app/config/ + +# Bridge script for dual operation +COPY bridge-start.sh /app/bin/ + +ENV PROVISIONING_BRIDGE_MODE=true +ENV PROVISIONING_LEGACY_PATH=/app/legacy +ENV PROVISIONING_NEW_PATH=/app/bin + +EXPOSE 8080 +CMD ["/app/bin/bridge-start.sh"] +``` + +**Kubernetes Integration**: + +```text +# Kubernetes deployment with bridge sidecar +apiVersion: apps/v1 +kind: Deployment +metadata: + name: provisioning-system +spec: + replicas: 3 + template: + spec: + containers: + - name: orchestrator + image: provisioning-system:2.0.0 + ports: + - containerPort: 8080 + env: + - name: PROVISIONING_BRIDGE_MODE + value: "true" + volumeMounts: + - name: config + mountPath: /app/config + - name: legacy-data + mountPath: /app/legacy/data + + - name: legacy-bridge + image: provisioning-legacy:1.0.0 + env: + - name: BRIDGE_ORCHESTRATOR_URL + value: "http://localhost:9090" + volumeMounts: + - name: legacy-data + mountPath: /data + + volumes: + - name: config + configMap: + name: provisioning-config + - name: legacy-data + persistentVolumeClaim: + claimName: provisioning-data +``` + +## Monitoring and Observability + +### Integrated Monitoring Architecture + +**Monitoring Stack Integration**: + +```text +Observability Architecture +┌─────────────────────────────────────────────────────────────────┐ +│ Monitoring Dashboard │ +│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ +│ │ Grafana │ │ Jaeger │ │ AlertMgr │ │ +│ └─────────────┘ └─────────────┘ └─────────────┘ │ +└─────────────┬───────────────┬───────────────┬─────────────────┘ + │ │ │ + ┌──────────▼──────────┐ │ ┌───────────▼───────────┐ + │ Prometheus │ │ │ Jaeger │ + │ (Metrics) │ │ │ (Tracing) │ + └──────────┬──────────┘ │ └───────────┬───────────┘ + │ │ │ +┌─────────────▼─────────────┐ │ ┌─────────────▼─────────────┐ +│ Legacy │ │ │ New System │ +│ Monitoring │ │ │ Monitoring │ +│ │ │ │ │ +│ - File-based logs │ │ │ - Structured logs │ +│ - Simple metrics │ │ │ - Prometheus metrics │ +│ - Basic health checks │ │ │ - Distributed tracing │ +└───────────────────────────┘ │ └───────────────────────────┘ + │ + ┌─────────▼─────────┐ + │ Bridge Monitor │ + │ │ + │ - Integration │ + │ - Compatibility │ + │ - Migration │ + └───────────────────┘ +``` + +### Metrics Integration + +**Unified Metrics Collection**: + +```text +# Metrics bridge for legacy and new systems +def collect-system-metrics [] -> record { + let legacy_metrics = collect-legacy-metrics + let new_metrics = collect-new-metrics + let bridge_metrics = collect-bridge-metrics + + { + timestamp: (date now), + legacy: $legacy_metrics, + new: $new_metrics, + bridge: $bridge_metrics, + integration: { + compatibility_rate: (calculate-compatibility-rate $bridge_metrics), + migration_progress: (calculate-migration-progress), + system_health: (assess-overall-health $legacy_metrics $new_metrics) + } + } +} + +def collect-legacy-metrics [] -> record { + let log_files = (ls logs/*.log) + let process_stats = (get-process-stats "legacy-provisioning") + + { + active_processes: $process_stats.count, + log_file_sizes: ($log_files | get size | math sum), + last_activity: (get-last-log-timestamp), + error_count: (count-log-errors "last 1h"), + performance: { + avg_response_time: (calculate-avg-response-time), + throughput: (calculate-throughput) + } + } +} + +def collect-new-metrics [] -> record { + let orchestrator_stats = try { + http get "http://localhost:9090/metrics" + } catch { + {status: "unavailable"} + } + + { + orchestrator: $orchestrator_stats, + workflow_stats: (get-workflow-metrics), + api_stats: (get-api-metrics), + database_stats: (get-database-metrics) + } +} +``` + +### Logging Integration + +**Unified Logging Strategy**: + +```text +# Structured logging bridge +def log-integrated [ + level: string, + message: string, + --component: string = "bridge", + --legacy-compat: bool = true +] { + let log_entry = { + timestamp: (date now | format date "%Y-%m-%d %H:%M:%S%.3f"), + level: $level, + component: $component, + message: $message, + system: "integrated", + correlation_id: (generate-correlation-id) + } + + # Write to structured log (new system) + $log_entry | to json | save --append logs/integrated.jsonl + + if $legacy_compat { + # Write to legacy log format + let legacy_entry = $"[($log_entry.timestamp)] [($level)] ($component): ($message)" + $legacy_entry | save --append logs/legacy.log + } + + # Send to monitoring system + send-to-monitoring $log_entry +} +``` + +### Health Check Integration + +**Comprehensive Health Monitoring**: + +```text +def health-check-integrated [] -> record { + let health_checks = [ + {name: "legacy-system", check: (check-legacy-health)}, + {name: "orchestrator", check: (check-orchestrator-health)}, + {name: "database", check: (check-database-health)}, + {name: "bridge-compatibility", check: (check-bridge-health)}, + {name: "configuration", check: (check-config-health)} + ] + + let results = ($health_checks | each { |check| + let result = try { + do $check.check + } catch { |e| + {status: "unhealthy", error: $e.msg} + } + + {name: $check.name, result: $result} + }) + + let healthy_count = ($results | where result.status == "healthy" | length) + let total_count = ($results | length) + + { + overall_status: (if $healthy_count == $total_count { "healthy" } else { "degraded" }), + healthy_services: $healthy_count, + total_services: $total_count, + services: $results, + checked_at: (date now) + } +} +``` + +## Legacy System Bridge + +### Bridge Architecture + +**Bridge Component Design**: + +```text +# Legacy system bridge module +export module bridge { + # Bridge state management + export def init-bridge [] -> record { + let bridge_config = get-config-section "bridge" + + { + legacy_path: ($bridge_config.legacy_path? | default "/opt/provisioning-v1"), + new_path: ($bridge_config.new_path? | default "/opt/provisioning-v2"), + mode: ($bridge_config.mode? | default "compatibility"), + monitoring_enabled: ($bridge_config.monitoring? | default true), + initialized_at: (date now) + } + } + + # Command translation layer + export def translate-command [ + legacy_command: list + ] -> list { + match $legacy_command { + ["provisioning", "server", "create", $name, $plan, ...$args] => { + let new_args = ($args | each { |arg| + match $arg { + "--dry-run" => "--dry-run", + "--wait" => "--wait", + $zone if ($zone | str starts-with "--zone=") => $zone, + _ => $arg + } + }) + + ["provisioning", "server", "create", $name, $plan] ++ $new_args ++ ["--orchestrated"] + }, + _ => $legacy_command # Pass through unchanged + } + } + + # Data format translation + export def translate-response [ + legacy_response: record, + target_format: string = "v2" + ] -> record { + match $target_format { + "v2" => { + id: ($legacy_response.id? | default (generate-uuid)), + name: $legacy_response.name, + status: $legacy_response.status, + created_at: ($legacy_response.created_at? | default (date now)), + metadata: ($legacy_response | reject name status created_at), + version: "v2-compat" + }, + _ => $legacy_response + } + } +} +``` + +### Bridge Operation Modes + +**Compatibility Mode**: + +```text +# Full compatibility with legacy system +def run-compatibility-mode [] { + print "Starting bridge in compatibility mode..." + + # Intercept legacy commands + let legacy_commands = monitor-legacy-commands + + for command in $legacy_commands { + let translated = (bridge translate-command $command) + + try { + let result = (execute-new-system $translated) + let legacy_result = (bridge translate-response $result "v1") + respond-to-legacy $legacy_result + } catch { |e| + # Fall back to legacy system on error + let fallback_result = (execute-legacy-system $command) + respond-to-legacy $fallback_result + } + } +} +``` + +**Migration Mode**: + +```text +# Gradual migration with traffic splitting +def run-migration-mode [ + --new-system-percentage: int = 50 +] { + print $"Starting bridge in migration mode (($new_system_percentage)% new system)" + + let commands = monitor-all-commands + + for command in $commands { + let route_to_new = ((random integer 1..100) <= $new_system_percentage) + + if $route_to_new { + try { + execute-new-system $command + } catch { + # Fall back to legacy on failure + execute-legacy-system $command + } + } else { + execute-legacy-system $command + } + } +} +``` + +## Migration Pathways + +### Migration Phases + +**Phase 1: Parallel Deployment** + +- Deploy new system alongside existing +- Enable bridge for compatibility +- Begin data synchronization +- Monitor integration health + +**Phase 2: Gradual Migration** + +- Route increasing traffic to new system +- Migrate data in background +- Validate consistency +- Address integration issues + +**Phase 3: Full Migration** + +- Complete traffic cutover +- Decommission legacy system +- Clean up bridge components +- Finalize data migration + +### Migration Automation + +**Automated Migration Orchestration**: + +```text +def execute-migration-plan [ + migration_plan: string, + --dry-run: bool = false, + --skip-backup: bool = false +] -> record { + let plan = (open $migration_plan | from yaml) + + if not $skip_backup { + create-pre-migration-backup + } + + let migration_results = [] + + for phase in $plan.phases { + print $"Executing migration phase: ($phase.name)" + + if $dry_run { + print $"[DRY RUN] Would execute phase: ($phase)" + continue + } + + let phase_result = try { + execute-migration-phase $phase + } catch { |e| + print $"Migration phase failed: ($e.msg)" + + if $phase.rollback_on_failure? | default false { + print "Rolling back migration phase..." + rollback-migration-phase $phase + } + + error make {msg: $"Migration failed at phase ($phase.name): ($e.msg)"} + } + + $migration_results = ($migration_results | append $phase_result) + + # Wait between phases if specified + if "wait_seconds" in $phase { + sleep ($phase.wait_seconds * 1sec) + } + } + + { + migration_plan: $migration_plan, + phases_completed: ($migration_results | length), + status: "completed", + completed_at: (date now), + results: $migration_results + } +} +``` + +**Migration Validation**: + +```text +def validate-migration-readiness [] -> record { + let checks = [ + {name: "backup-available", check: (check-backup-exists)}, + {name: "new-system-healthy", check: (check-new-system-health)}, + {name: "database-accessible", check: (check-database-connectivity)}, + {name: "configuration-valid", check: (validate-migration-config)}, + {name: "resources-available", check: (check-system-resources)}, + {name: "network-connectivity", check: (check-network-health)} + ] + + let results = ($checks | each { |check| + { + name: $check.name, + result: (do $check.check), + timestamp: (date now) + } + }) + + let failed_checks = ($results | where result.status != "ready") + + { + ready_for_migration: ($failed_checks | length) == 0, + checks: $results, + failed_checks: $failed_checks, + validated_at: (date now) + } +} +``` + +## Troubleshooting Integration Issues + +### Common Integration Problems + +#### API Compatibility Issues + +**Problem**: Version mismatch between client and server + +```text +# Diagnosis +curl -H "API-Version: v1" http://localhost:9090/health +curl -H "API-Version: v2" http://localhost:9090/health + +# Solution: Check supported versions +curl http://localhost:9090/api/versions + +# Update client API version +export PROVISIONING_API_VERSION=v2 +``` + +#### Configuration Bridge Issues + +**Problem**: Configuration not found in either system + +```text +# Diagnosis +def diagnose-config-issue [key: string] -> record { + let toml_result = try { + get-config-value $key + } catch { |e| {status: "failed", error: $e.msg} } + + let env_key = ($key | str replace "." "_" | str upcase | $"PROVISIONING_($in)") + let env_result = try { + $env | get $env_key + } catch { |e| {status: "failed", error: $e.msg} } + + { + key: $key, + toml_config: $toml_result, + env_config: $env_result, + migration_needed: ($toml_result.status == "failed" and $env_result.status != "failed") + } +} + +# Solution: Migrate configuration +def migrate-single-config [key: string] { + let diagnosis = (diagnose-config-issue $key) + + if $diagnosis.migration_needed { + let env_value = $diagnosis.env_config + set-config-value $key $env_value + print $"Migrated ($key) from environment variable" + } +} +``` + +#### Database Integration Issues + +**Problem**: Data inconsistency between systems + +```text +# Diagnosis and repair +def repair-data-consistency [] -> record { + let legacy_data = (read-legacy-data) + let new_data = (read-new-data) + + let inconsistencies = [] + + # Check server records + for server in $legacy_data.servers { + let new_server = ($new_data.servers | where id == $server.id | first) + + if ($new_server | is-empty) { + print $"Missing server in new system: ($server.id)" + create-server-record $server + $inconsistencies = ($inconsistencies | append {type: "missing", id: $server.id}) + } else if $new_server != $server { + print $"Inconsistent server data: ($server.id)" + update-server-record $server + $inconsistencies = ($inconsistencies | append {type: "inconsistent", id: $server.id}) + } + } + + { + inconsistencies_found: ($inconsistencies | length), + repairs_applied: ($inconsistencies | length), + repaired_at: (date now) + } +} +``` + +### Debug Tools + +**Integration Debug Mode**: + +```text +# Enable comprehensive debugging +export PROVISIONING_DEBUG=true +export PROVISIONING_LOG_LEVEL=debug +export PROVISIONING_BRIDGE_DEBUG=true +export PROVISIONING_INTEGRATION_TRACE=true + +# Run with integration debugging +provisioning server create test-server 2xCPU-4 GB --debug-integration +``` + +**Health Check Debugging**: + +```text +def debug-integration-health [] -> record { + print "=== Integration Health Debug ===" + + # Check all integration points + let legacy_health = try { + check-legacy-system + } catch { |e| {status: "error", error: $e.msg} } + + let orchestrator_health = try { + http get "http://localhost:9090/health" + } catch { |e| {status: "error", error: $e.msg} } + + let bridge_health = try { + check-bridge-status + } catch { |e| {status: "error", error: $e.msg} } + + let config_health = try { + validate-config-integration + } catch { |e| {status: "error", error: $e.msg} } + + print $"Legacy System: ($legacy_health.status)" + print $"Orchestrator: ($orchestrator_health.status)" + print $"Bridge: ($bridge_health.status)" + print $"Configuration: ($config_health.status)" + + { + legacy: $legacy_health, + orchestrator: $orchestrator_health, + bridge: $bridge_health, + configuration: $config_health, + debug_timestamp: (date now) + } +} +``` + +This integration guide provides a comprehensive framework for seamlessly integrating new development components with existing production systems while +maintaining reliability, compatibility, and clear migration pathways. \ No newline at end of file diff --git a/docs/src/development/kms-simplification.md b/docs/src/development/kms-simplification.md index 60d59ec..0290050 100644 --- a/docs/src/development/kms-simplification.md +++ b/docs/src/development/kms-simplification.md @@ -1 +1,570 @@ -# KMS Simplification Migration Guide\n\n**Version**: 0.2.0\n**Date**: 2025-10-08\n**Status**: Active\n\n## Overview\n\nThe KMS service has been simplified from supporting 4 backends (Vault, AWS KMS, Age, Cosmian) to supporting only 2 backends:\n\n- **Age**: Development and local testing\n- **Cosmian KMS**: Production deployments\n\nThis simplification reduces complexity, removes unnecessary cloud provider dependencies, and provides a clearer separation between development and\nproduction use cases.\n\n## What Changed\n\n### Removed\n\n- ❌ HashiCorp Vault backend (`src/vault/`)\n- ❌ AWS KMS backend (`src/aws/`)\n- ❌ AWS SDK dependencies (`aws-sdk-kms`, `aws-config`, `aws-credential-types`)\n- ❌ Envelope encryption helpers (AWS-specific)\n- ❌ Complex multi-backend configuration\n\n### Added\n\n- ✅ Age backend for development (`src/age/`)\n- ✅ Cosmian KMS backend for production (`src/cosmian/`)\n- ✅ Simplified configuration (`provisioning/config/kms.toml`)\n- ✅ Clear dev/prod separation\n- ✅ Better error messages\n\n### Modified\n\n- 🔄 `KmsBackendConfig` enum (now only Age and Cosmian)\n- 🔄 `KmsError` enum (removed Vault/AWS-specific errors)\n- 🔄 Service initialization logic\n- 🔄 README and documentation\n- 🔄 Cargo.toml dependencies\n\n## Why This Change\n\n### Problems with Previous Approach\n\n1. **Unnecessary Complexity**: 4 backends for simple use cases\n2. **Cloud Lock-in**: AWS KMS dependency limited flexibility\n3. **Operational Overhead**: Vault requires server setup even for dev\n4. **Dependency Bloat**: AWS SDK adds significant compile time\n5. **Unclear Use Cases**: When to use which backend?\n\n### Benefits of Simplified Approach\n\n1. **Clear Separation**: Age = dev, Cosmian = prod\n2. **Faster Compilation**: Removed AWS SDK (saves ~30 s)\n3. **Offline Development**: Age works without network\n4. **Enterprise Security**: Cosmian provides confidential computing\n5. **Easier Maintenance**: 2 backends instead of 4\n\n## Migration Steps\n\n### For Development Environments\n\nIf you were using **Vault** or **AWS KMS** for development:\n\n#### Step 1: Install Age\n\n```\n# macOS\nbrew install age\n\n# Ubuntu/Debian\napt install age\n\n# From source\ngo install filippo.io/age/cmd/...@latest\n```\n\n#### Step 2: Generate Age Keys\n\n```\nmkdir -p ~/.config/provisioning/age\nage-keygen -o ~/.config/provisioning/age/private_key.txt\nage-keygen -y ~/.config/provisioning/age/private_key.txt > ~/.config/provisioning/age/public_key.txt\n```\n\n#### Step 3: Update Configuration\n\nReplace your old Vault/AWS config:\n\n**Old (Vault)**:\n\n```\n[kms]\ntype = "vault"\naddress = "http://localhost:8200"\ntoken = "${VAULT_TOKEN}"\nmount_point = "transit"\n```\n\n**New (Age)**:\n\n```\n[kms]\nenvironment = "dev"\n\n[kms.age]\npublic_key_path = "~/.config/provisioning/age/public_key.txt"\nprivate_key_path = "~/.config/provisioning/age/private_key.txt"\n```\n\n#### Step 4: Re-encrypt Development Secrets\n\n```\n# Export old secrets (if using Vault)\nvault kv get -format=json secret/dev > dev-secrets.json\n\n# Encrypt with Age\ncat dev-secrets.json | age -r $(cat ~/.config/provisioning/age/public_key.txt) > dev-secrets.age\n\n# Test decryption\nage -d -i ~/.config/provisioning/age/private_key.txt dev-secrets.age\n```\n\n### For Production Environments\n\nIf you were using **Vault** or **AWS KMS** for production:\n\n#### Step 1: Set Up Cosmian KMS\n\nChoose one of these options:\n\n**Option A: Cosmian Cloud (Managed)**\n\n```\n# Sign up at https://cosmian.com\n# Get API credentials\nexport COSMIAN_KMS_URL=https://kms.cosmian.cloud\nexport COSMIAN_API_KEY=your-api-key\n```\n\n**Option B: Self-Hosted Cosmian KMS**\n\n```\n# Deploy Cosmian KMS server\n# See: https://docs.cosmian.com/kms/deployment/\n\n# Configure endpoint\nexport COSMIAN_KMS_URL=https://kms.example.com\nexport COSMIAN_API_KEY=your-api-key\n```\n\n#### Step 2: Create Master Key in Cosmian\n\n```\n# Using Cosmian CLI\ncosmian-kms create-key \\n --algorithm AES \\n --key-length 256 \\n --key-id provisioning-master-key\n\n# Or via API\ncurl -X POST $COSMIAN_KMS_URL/api/v1/keys \\n -H "X-API-Key: $COSMIAN_API_KEY" \\n -H "Content-Type: application/json" \\n -d '{\n "algorithm": "AES",\n "keyLength": 256,\n "keyId": "provisioning-master-key"\n }'\n```\n\n#### Step 3: Migrate Production Secrets\n\n**From Vault to Cosmian**:\n\n```\n# Export secrets from Vault\nvault kv get -format=json secret/prod > prod-secrets.json\n\n# Import to Cosmian\n# (Use temporary Age encryption for transfer)\ncat prod-secrets.json | \\n age -r $(cat ~/.config/provisioning/age/public_key.txt) | \\n base64 > prod-secrets.enc\n\n# On production server with Cosmian\ncat prod-secrets.enc | \\n base64 -d | \\n age -d -i ~/.config/provisioning/age/private_key.txt | \\n # Re-encrypt with Cosmian\n curl -X POST $COSMIAN_KMS_URL/api/v1/encrypt \\n -H "X-API-Key: $COSMIAN_API_KEY" \\n -d @-\n```\n\n**From AWS KMS to Cosmian**:\n\n```\n# Decrypt with AWS KMS\naws kms decrypt \\n --ciphertext-blob fileb://encrypted-data \\n --output text \\n --query Plaintext | \\n base64 -d > plaintext-data\n\n# Encrypt with Cosmian\ncurl -X POST $COSMIAN_KMS_URL/api/v1/encrypt \\n -H "X-API-Key: $COSMIAN_API_KEY" \\n -H "Content-Type: application/json" \\n -d "{\"keyId\":\"provisioning-master-key\",\"data\":\"$(base64 plaintext-data)\"}"\n```\n\n#### Step 4: Update Production Configuration\n\n**Old (AWS KMS)**:\n\n```\n[kms]\ntype = "aws-kms"\nregion = "us-east-1"\nkey_id = "arn:aws:kms:us-east-1:123456789012:key/..."\n```\n\n**New (Cosmian)**:\n\n```\n[kms]\nenvironment = "prod"\n\n[kms.cosmian]\nserver_url = "${COSMIAN_KMS_URL}"\napi_key = "${COSMIAN_API_KEY}"\ndefault_key_id = "provisioning-master-key"\ntls_verify = true\nuse_confidential_computing = false # Enable if using SGX/SEV\n```\n\n#### Step 5: Test Production Setup\n\n```\n# Set environment\nexport PROVISIONING_ENV=prod\nexport COSMIAN_KMS_URL=https://kms.example.com\nexport COSMIAN_API_KEY=your-api-key\n\n# Start KMS service\ncargo run --bin kms-service\n\n# Test encryption\ncurl -X POST http://localhost:8082/api/v1/kms/encrypt \\n -H "Content-Type: application/json" \\n -d '{"plaintext":"SGVsbG8=","context":"env=prod"}'\n\n# Test decryption\ncurl -X POST http://localhost:8082/api/v1/kms/decrypt \\n -H "Content-Type: application/json" \\n -d '{"ciphertext":"...","context":"env=prod"}'\n```\n\n## Configuration Comparison\n\n### Before (4 Backends)\n\n```\n# Development could use any backend\n[kms]\ntype = "vault" # or "aws-kms"\naddress = "http://localhost:8200"\ntoken = "${VAULT_TOKEN}"\n\n# Production used Vault or AWS\n[kms]\ntype = "aws-kms"\nregion = "us-east-1"\nkey_id = "arn:aws:kms:..."\n```\n\n### After (2 Backends)\n\n```\n# Clear environment-based selection\n[kms]\ndev_backend = "age"\nprod_backend = "cosmian"\nenvironment = "${PROVISIONING_ENV:-dev}"\n\n# Age for development\n[kms.age]\npublic_key_path = "~/.config/provisioning/age/public_key.txt"\nprivate_key_path = "~/.config/provisioning/age/private_key.txt"\n\n# Cosmian for production\n[kms.cosmian]\nserver_url = "${COSMIAN_KMS_URL}"\napi_key = "${COSMIAN_API_KEY}"\ndefault_key_id = "provisioning-master-key"\ntls_verify = true\n```\n\n## Breaking Changes\n\n### API Changes\n\n#### Removed Functions\n\n- `generate_data_key()` - Now only available with Cosmian backend\n- `envelope_encrypt()` - AWS-specific, removed\n- `envelope_decrypt()` - AWS-specific, removed\n- `rotate_key()` - Now handled server-side by Cosmian\n\n#### Changed Error Types\n\n**Before**:\n\n```\nKmsError::VaultError(String)\nKmsError::AwsKmsError(String)\n```\n\n**After**:\n\n```\nKmsError::AgeError(String)\nKmsError::CosmianError(String)\n```\n\n#### Updated Configuration Enum\n\n**Before**:\n\n```\nenum KmsBackendConfig {\n Vault { address, token, mount_point, ... },\n AwsKms { region, key_id, assume_role },\n}\n```\n\n**After**:\n\n```\nenum KmsBackendConfig {\n Age { public_key_path, private_key_path },\n Cosmian { server_url, api_key, default_key_id, tls_verify },\n}\n```\n\n## Code Migration\n\n### Rust Code\n\n**Before (AWS KMS)**:\n\n```\nuse kms_service::{KmsService, KmsBackendConfig};\n\nlet config = KmsBackendConfig::AwsKms {\n region: "us-east-1".to_string(),\n key_id: "arn:aws:kms:...".to_string(),\n assume_role: None,\n};\n\nlet kms = KmsService::new(config).await?;\n```\n\n**After (Cosmian)**:\n\n```\nuse kms_service::{KmsService, KmsBackendConfig};\n\nlet config = KmsBackendConfig::Cosmian {\n server_url: env::var("COSMIAN_KMS_URL")?,\n api_key: env::var("COSMIAN_API_KEY")?,\n default_key_id: "provisioning-master-key".to_string(),\n tls_verify: true,\n};\n\nlet kms = KmsService::new(config).await?;\n```\n\n### Nushell Code\n\n**Before (Vault)**:\n\n```\n# Set Vault environment\n$env.VAULT_ADDR = "http://localhost:8200"\n$env.VAULT_TOKEN = "root"\n\n# Use KMS\nkms encrypt "secret-data"\n```\n\n**After (Age for dev)**:\n\n```\n# Set environment\n$env.PROVISIONING_ENV = "dev"\n\n# Age keys automatically loaded from config\nkms encrypt "secret-data"\n```\n\n## Rollback Plan\n\nIf you need to rollback to Vault/AWS KMS:\n\n```\n# Checkout previous version\ngit checkout tags/v0.1.0\n\n# Rebuild with old dependencies\ncd provisioning/platform/kms-service\ncargo clean\ncargo build --release\n\n# Restore old configuration\ncp provisioning/config/kms.toml.backup provisioning/config/kms.toml\n```\n\n## Testing the Migration\n\n### Development Testing\n\n```\n# 1. Generate Age keys\nage-keygen -o /tmp/test_private.txt\nage-keygen -y /tmp/test_private.txt > /tmp/test_public.txt\n\n# 2. Test encryption\necho "test-data" | age -r $(cat /tmp/test_public.txt) > /tmp/encrypted\n\n# 3. Test decryption\nage -d -i /tmp/test_private.txt /tmp/encrypted\n\n# 4. Start KMS service with test keys\nexport PROVISIONING_ENV=dev\n# Update config to point to /tmp keys\ncargo run --bin kms-service\n```\n\n### Production Testing\n\n```\n# 1. Set up test Cosmian instance\nexport COSMIAN_KMS_URL=https://kms-staging.example.com\nexport COSMIAN_API_KEY=test-api-key\n\n# 2. Create test key\ncosmian-kms create-key --key-id test-key --algorithm AES --key-length 256\n\n# 3. Test encryption\ncurl -X POST $COSMIAN_KMS_URL/api/v1/encrypt \\n -H "X-API-Key: $COSMIAN_API_KEY" \\n -d '{"keyId":"test-key","data":"dGVzdA=="}'\n\n# 4. Start KMS service\nexport PROVISIONING_ENV=prod\ncargo run --bin kms-service\n```\n\n## Troubleshooting\n\n### Age Keys Not Found\n\n```\n# Check keys exist\nls -la ~/.config/provisioning/age/\n\n# Regenerate if missing\nage-keygen -o ~/.config/provisioning/age/private_key.txt\nage-keygen -y ~/.config/provisioning/age/private_key.txt > ~/.config/provisioning/age/public_key.txt\n```\n\n### Cosmian Connection Failed\n\n```\n# Check network connectivity\ncurl -v $COSMIAN_KMS_URL/api/v1/health\n\n# Verify API key\ncurl $COSMIAN_KMS_URL/api/v1/version \\n -H "X-API-Key: $COSMIAN_API_KEY"\n\n# Check TLS certificate\nopenssl s_client -connect kms.example.com:443\n```\n\n### Compilation Errors\n\n```\n# Clean and rebuild\ncd provisioning/platform/kms-service\ncargo clean\ncargo update\ncargo build --release\n```\n\n## Support\n\n- **Documentation**: See README.md\n- **Issues**: Report on project issue tracker\n- **Cosmian Support**: \n\n## Timeline\n\n- **2025-10-08**: Migration guide published\n- **2025-10-15**: Deprecation notices for Vault/AWS\n- **2025-11-01**: Old backends removed from codebase\n- **2025-11-15**: Migration complete, old configs unsupported\n\n## FAQs\n\n**Q: Can I still use Vault if I really need to?**\nA: No, Vault support has been removed. Use Age for dev or Cosmian for prod.\n\n**Q: What about AWS KMS for existing deployments?**\nA: Migrate to Cosmian KMS. The API is similar, and migration tools are provided.\n\n**Q: Is Age secure enough for production?**\nA: No. Age is designed for development only. Use Cosmian KMS for production.\n\n**Q: Does Cosmian support confidential computing?**\nA: Yes, Cosmian KMS supports SGX and SEV for confidential computing workloads.\n\n**Q: How much does Cosmian cost?**\nA: Cosmian offers both cloud and self-hosted options. Contact Cosmian for pricing.\n\n**Q: Can I use my own KMS backend?**\nA: Not currently supported. Only Age and Cosmian are available.\n\n## Checklist\n\nUse this checklist to track your migration:\n\n### Development Migration\n\n- [ ] Install Age (`brew install age` or equivalent)\n- [ ] Generate Age keys (`age-keygen`)\n- [ ] Update `provisioning/config/kms.toml` to use Age backend\n- [ ] Export secrets from Vault/AWS (if applicable)\n- [ ] Re-encrypt secrets with Age\n- [ ] Test KMS service startup\n- [ ] Test encrypt/decrypt operations\n- [ ] Update CI/CD pipelines (if applicable)\n- [ ] Update documentation\n\n### Production Migration\n\n- [ ] Set up Cosmian KMS server (cloud or self-hosted)\n- [ ] Create master key in Cosmian\n- [ ] Export production secrets from Vault/AWS\n- [ ] Re-encrypt secrets with Cosmian\n- [ ] Update `provisioning/config/kms.toml` to use Cosmian backend\n- [ ] Set environment variables (`COSMIAN_KMS_URL`, `COSMIAN_API_KEY`)\n- [ ] Test KMS service startup in staging\n- [ ] Test encrypt/decrypt operations in staging\n- [ ] Load test Cosmian integration\n- [ ] Update production deployment configs\n- [ ] Deploy to production\n- [ ] Verify all secrets accessible\n- [ ] Decommission old KMS infrastructure\n\n## Conclusion\n\nThe KMS simplification reduces complexity while providing better separation between development and production use cases. Age offers a fast, offline\nsolution for development, while Cosmian KMS provides enterprise-grade security for production deployments.\n\nFor questions or issues, please refer to the documentation or open an issue. +# KMS Simplification Migration Guide + +**Version**: 0.2.0 +**Date**: 2025-10-08 +**Status**: Active + +## Overview + +The KMS service has been simplified from supporting 4 backends (Vault, AWS KMS, Age, Cosmian) to supporting only 2 backends: + +- **Age**: Development and local testing +- **Cosmian KMS**: Production deployments + +This simplification reduces complexity, removes unnecessary cloud provider dependencies, and provides a clearer separation between development and +production use cases. + +## What Changed + +### Removed + +- ❌ HashiCorp Vault backend (`src/vault/`) +- ❌ AWS KMS backend (`src/aws/`) +- ❌ AWS SDK dependencies (`aws-sdk-kms`, `aws-config`, `aws-credential-types`) +- ❌ Envelope encryption helpers (AWS-specific) +- ❌ Complex multi-backend configuration + +### Added + +- ✅ Age backend for development (`src/age/`) +- ✅ Cosmian KMS backend for production (`src/cosmian/`) +- ✅ Simplified configuration (`provisioning/config/kms.toml`) +- ✅ Clear dev/prod separation +- ✅ Better error messages + +### Modified + +- 🔄 `KmsBackendConfig` enum (now only Age and Cosmian) +- 🔄 `KmsError` enum (removed Vault/AWS-specific errors) +- 🔄 Service initialization logic +- 🔄 README and documentation +- 🔄 Cargo.toml dependencies + +## Why This Change + +### Problems with Previous Approach + +1. **Unnecessary Complexity**: 4 backends for simple use cases +2. **Cloud Lock-in**: AWS KMS dependency limited flexibility +3. **Operational Overhead**: Vault requires server setup even for dev +4. **Dependency Bloat**: AWS SDK adds significant compile time +5. **Unclear Use Cases**: When to use which backend? + +### Benefits of Simplified Approach + +1. **Clear Separation**: Age = dev, Cosmian = prod +2. **Faster Compilation**: Removed AWS SDK (saves ~30 s) +3. **Offline Development**: Age works without network +4. **Enterprise Security**: Cosmian provides confidential computing +5. **Easier Maintenance**: 2 backends instead of 4 + +## Migration Steps + +### For Development Environments + +If you were using **Vault** or **AWS KMS** for development: + +#### Step 1: Install Age + +```text +# macOS +brew install age + +# Ubuntu/Debian +apt install age + +# From source +go install filippo.io/age/cmd/...@latest +``` + +#### Step 2: Generate Age Keys + +```text +mkdir -p ~/.config/provisioning/age +age-keygen -o ~/.config/provisioning/age/private_key.txt +age-keygen -y ~/.config/provisioning/age/private_key.txt > ~/.config/provisioning/age/public_key.txt +``` + +#### Step 3: Update Configuration + +Replace your old Vault/AWS config: + +**Old (Vault)**: + +```text +[kms] +type = "vault" +address = "http://localhost:8200" +token = "${VAULT_TOKEN}" +mount_point = "transit" +``` + +**New (Age)**: + +```text +[kms] +environment = "dev" + +[kms.age] +public_key_path = "~/.config/provisioning/age/public_key.txt" +private_key_path = "~/.config/provisioning/age/private_key.txt" +``` + +#### Step 4: Re-encrypt Development Secrets + +```text +# Export old secrets (if using Vault) +vault kv get -format=json secret/dev > dev-secrets.json + +# Encrypt with Age +cat dev-secrets.json | age -r $(cat ~/.config/provisioning/age/public_key.txt) > dev-secrets.age + +# Test decryption +age -d -i ~/.config/provisioning/age/private_key.txt dev-secrets.age +``` + +### For Production Environments + +If you were using **Vault** or **AWS KMS** for production: + +#### Step 1: Set Up Cosmian KMS + +Choose one of these options: + +**Option A: Cosmian Cloud (Managed)** + +```text +# Sign up at https://cosmian.com +# Get API credentials +export COSMIAN_KMS_URL=https://kms.cosmian.cloud +export COSMIAN_API_KEY=your-api-key +``` + +**Option B: Self-Hosted Cosmian KMS** + +```text +# Deploy Cosmian KMS server +# See: https://docs.cosmian.com/kms/deployment/ + +# Configure endpoint +export COSMIAN_KMS_URL=https://kms.example.com +export COSMIAN_API_KEY=your-api-key +``` + +#### Step 2: Create Master Key in Cosmian + +```text +# Using Cosmian CLI +cosmian-kms create-key + --algorithm AES + --key-length 256 + --key-id provisioning-master-key + +# Or via API +curl -X POST $COSMIAN_KMS_URL/api/v1/keys + -H "X-API-Key: $COSMIAN_API_KEY" + -H "Content-Type: application/json" + -d '{ + "algorithm": "AES", + "keyLength": 256, + "keyId": "provisioning-master-key" + }' +``` + +#### Step 3: Migrate Production Secrets + +**From Vault to Cosmian**: + +```text +# Export secrets from Vault +vault kv get -format=json secret/prod > prod-secrets.json + +# Import to Cosmian +# (Use temporary Age encryption for transfer) +cat prod-secrets.json | + age -r $(cat ~/.config/provisioning/age/public_key.txt) | + base64 > prod-secrets.enc + +# On production server with Cosmian +cat prod-secrets.enc | + base64 -d | + age -d -i ~/.config/provisioning/age/private_key.txt | + # Re-encrypt with Cosmian + curl -X POST $COSMIAN_KMS_URL/api/v1/encrypt + -H "X-API-Key: $COSMIAN_API_KEY" + -d @- +``` + +**From AWS KMS to Cosmian**: + +```text +# Decrypt with AWS KMS +aws kms decrypt + --ciphertext-blob fileb://encrypted-data + --output text + --query Plaintext | + base64 -d > plaintext-data + +# Encrypt with Cosmian +curl -X POST $COSMIAN_KMS_URL/api/v1/encrypt + -H "X-API-Key: $COSMIAN_API_KEY" + -H "Content-Type: application/json" + -d "{\"keyId\":\"provisioning-master-key\",\"data\":\"$(base64 plaintext-data)\"}" +``` + +#### Step 4: Update Production Configuration + +**Old (AWS KMS)**: + +```text +[kms] +type = "aws-kms" +region = "us-east-1" +key_id = "arn:aws:kms:us-east-1:123456789012:key/..." +``` + +**New (Cosmian)**: + +```text +[kms] +environment = "prod" + +[kms.cosmian] +server_url = "${COSMIAN_KMS_URL}" +api_key = "${COSMIAN_API_KEY}" +default_key_id = "provisioning-master-key" +tls_verify = true +use_confidential_computing = false # Enable if using SGX/SEV +``` + +#### Step 5: Test Production Setup + +```text +# Set environment +export PROVISIONING_ENV=prod +export COSMIAN_KMS_URL=https://kms.example.com +export COSMIAN_API_KEY=your-api-key + +# Start KMS service +cargo run --bin kms-service + +# Test encryption +curl -X POST http://localhost:8082/api/v1/kms/encrypt + -H "Content-Type: application/json" + -d '{"plaintext":"SGVsbG8=","context":"env=prod"}' + +# Test decryption +curl -X POST http://localhost:8082/api/v1/kms/decrypt + -H "Content-Type: application/json" + -d '{"ciphertext":"...","context":"env=prod"}' +``` + +## Configuration Comparison + +### Before (4 Backends) + +```text +# Development could use any backend +[kms] +type = "vault" # or "aws-kms" +address = "http://localhost:8200" +token = "${VAULT_TOKEN}" + +# Production used Vault or AWS +[kms] +type = "aws-kms" +region = "us-east-1" +key_id = "arn:aws:kms:..." +``` + +### After (2 Backends) + +```text +# Clear environment-based selection +[kms] +dev_backend = "age" +prod_backend = "cosmian" +environment = "${PROVISIONING_ENV:-dev}" + +# Age for development +[kms.age] +public_key_path = "~/.config/provisioning/age/public_key.txt" +private_key_path = "~/.config/provisioning/age/private_key.txt" + +# Cosmian for production +[kms.cosmian] +server_url = "${COSMIAN_KMS_URL}" +api_key = "${COSMIAN_API_KEY}" +default_key_id = "provisioning-master-key" +tls_verify = true +``` + +## Breaking Changes + +### API Changes + +#### Removed Functions + +- `generate_data_key()` - Now only available with Cosmian backend +- `envelope_encrypt()` - AWS-specific, removed +- `envelope_decrypt()` - AWS-specific, removed +- `rotate_key()` - Now handled server-side by Cosmian + +#### Changed Error Types + +**Before**: + +```text +KmsError::VaultError(String) +KmsError::AwsKmsError(String) +``` + +**After**: + +```text +KmsError::AgeError(String) +KmsError::CosmianError(String) +``` + +#### Updated Configuration Enum + +**Before**: + +```text +enum KmsBackendConfig { + Vault { address, token, mount_point, ... }, + AwsKms { region, key_id, assume_role }, +} +``` + +**After**: + +```text +enum KmsBackendConfig { + Age { public_key_path, private_key_path }, + Cosmian { server_url, api_key, default_key_id, tls_verify }, +} +``` + +## Code Migration + +### Rust Code + +**Before (AWS KMS)**: + +```text +use kms_service::{KmsService, KmsBackendConfig}; + +let config = KmsBackendConfig::AwsKms { + region: "us-east-1".to_string(), + key_id: "arn:aws:kms:...".to_string(), + assume_role: None, +}; + +let kms = KmsService::new(config).await?; +``` + +**After (Cosmian)**: + +```text +use kms_service::{KmsService, KmsBackendConfig}; + +let config = KmsBackendConfig::Cosmian { + server_url: env::var("COSMIAN_KMS_URL")?, + api_key: env::var("COSMIAN_API_KEY")?, + default_key_id: "provisioning-master-key".to_string(), + tls_verify: true, +}; + +let kms = KmsService::new(config).await?; +``` + +### Nushell Code + +**Before (Vault)**: + +```text +# Set Vault environment +$env.VAULT_ADDR = "http://localhost:8200" +$env.VAULT_TOKEN = "root" + +# Use KMS +kms encrypt "secret-data" +``` + +**After (Age for dev)**: + +```text +# Set environment +$env.PROVISIONING_ENV = "dev" + +# Age keys automatically loaded from config +kms encrypt "secret-data" +``` + +## Rollback Plan + +If you need to rollback to Vault/AWS KMS: + +```text +# Checkout previous version +git checkout tags/v0.1.0 + +# Rebuild with old dependencies +cd provisioning/platform/kms-service +cargo clean +cargo build --release + +# Restore old configuration +cp provisioning/config/kms.toml.backup provisioning/config/kms.toml +``` + +## Testing the Migration + +### Development Testing + +```text +# 1. Generate Age keys +age-keygen -o /tmp/test_private.txt +age-keygen -y /tmp/test_private.txt > /tmp/test_public.txt + +# 2. Test encryption +echo "test-data" | age -r $(cat /tmp/test_public.txt) > /tmp/encrypted + +# 3. Test decryption +age -d -i /tmp/test_private.txt /tmp/encrypted + +# 4. Start KMS service with test keys +export PROVISIONING_ENV=dev +# Update config to point to /tmp keys +cargo run --bin kms-service +``` + +### Production Testing + +```text +# 1. Set up test Cosmian instance +export COSMIAN_KMS_URL=https://kms-staging.example.com +export COSMIAN_API_KEY=test-api-key + +# 2. Create test key +cosmian-kms create-key --key-id test-key --algorithm AES --key-length 256 + +# 3. Test encryption +curl -X POST $COSMIAN_KMS_URL/api/v1/encrypt + -H "X-API-Key: $COSMIAN_API_KEY" + -d '{"keyId":"test-key","data":"dGVzdA=="}' + +# 4. Start KMS service +export PROVISIONING_ENV=prod +cargo run --bin kms-service +``` + +## Troubleshooting + +### Age Keys Not Found + +```text +# Check keys exist +ls -la ~/.config/provisioning/age/ + +# Regenerate if missing +age-keygen -o ~/.config/provisioning/age/private_key.txt +age-keygen -y ~/.config/provisioning/age/private_key.txt > ~/.config/provisioning/age/public_key.txt +``` + +### Cosmian Connection Failed + +```text +# Check network connectivity +curl -v $COSMIAN_KMS_URL/api/v1/health + +# Verify API key +curl $COSMIAN_KMS_URL/api/v1/version + -H "X-API-Key: $COSMIAN_API_KEY" + +# Check TLS certificate +openssl s_client -connect kms.example.com:443 +``` + +### Compilation Errors + +```text +# Clean and rebuild +cd provisioning/platform/kms-service +cargo clean +cargo update +cargo build --release +``` + +## Support + +- **Documentation**: See README.md +- **Issues**: Report on project issue tracker +- **Cosmian Support**: + +## Timeline + +- **2025-10-08**: Migration guide published +- **2025-10-15**: Deprecation notices for Vault/AWS +- **2025-11-01**: Old backends removed from codebase +- **2025-11-15**: Migration complete, old configs unsupported + +## FAQs + +**Q: Can I still use Vault if I really need to?** +A: No, Vault support has been removed. Use Age for dev or Cosmian for prod. + +**Q: What about AWS KMS for existing deployments?** +A: Migrate to Cosmian KMS. The API is similar, and migration tools are provided. + +**Q: Is Age secure enough for production?** +A: No. Age is designed for development only. Use Cosmian KMS for production. + +**Q: Does Cosmian support confidential computing?** +A: Yes, Cosmian KMS supports SGX and SEV for confidential computing workloads. + +**Q: How much does Cosmian cost?** +A: Cosmian offers both cloud and self-hosted options. Contact Cosmian for pricing. + +**Q: Can I use my own KMS backend?** +A: Not currently supported. Only Age and Cosmian are available. + +## Checklist + +Use this checklist to track your migration: + +### Development Migration + +- [ ] Install Age (`brew install age` or equivalent) +- [ ] Generate Age keys (`age-keygen`) +- [ ] Update `provisioning/config/kms.toml` to use Age backend +- [ ] Export secrets from Vault/AWS (if applicable) +- [ ] Re-encrypt secrets with Age +- [ ] Test KMS service startup +- [ ] Test encrypt/decrypt operations +- [ ] Update CI/CD pipelines (if applicable) +- [ ] Update documentation + +### Production Migration + +- [ ] Set up Cosmian KMS server (cloud or self-hosted) +- [ ] Create master key in Cosmian +- [ ] Export production secrets from Vault/AWS +- [ ] Re-encrypt secrets with Cosmian +- [ ] Update `provisioning/config/kms.toml` to use Cosmian backend +- [ ] Set environment variables (`COSMIAN_KMS_URL`, `COSMIAN_API_KEY`) +- [ ] Test KMS service startup in staging +- [ ] Test encrypt/decrypt operations in staging +- [ ] Load test Cosmian integration +- [ ] Update production deployment configs +- [ ] Deploy to production +- [ ] Verify all secrets accessible +- [ ] Decommission old KMS infrastructure + +## Conclusion + +The KMS simplification reduces complexity while providing better separation between development and production use cases. Age offers a fast, offline +solution for development, while Cosmian KMS provides enterprise-grade security for production deployments. + +For questions or issues, please refer to the documentation or open an issue. \ No newline at end of file diff --git a/docs/src/development/mcp-server.md b/docs/src/development/mcp-server.md index dd7fc36..aa2b07f 100644 --- a/docs/src/development/mcp-server.md +++ b/docs/src/development/mcp-server.md @@ -1 +1,114 @@ -# MCP Server - Model Context Protocol\n\nA Rust-native Model Context Protocol (MCP) server for infrastructure automation and AI-assisted DevOps operations.\n\n> **Source**: `provisioning/platform/mcp-server/`\n> **Status**: Proof of Concept Complete\n\n## Overview\n\nReplaces the Python implementation with significant performance improvements while maintaining philosophical consistency with the Rust ecosystem approach.\n\n## Performance Results\n\n```\n🚀 Rust MCP Server Performance Analysis\n==================================================\n\n📋 Server Parsing Performance:\n • Sub-millisecond latency across all operations\n • 0μs average for configuration access\n\n🤖 AI Status Performance:\n • AI Status: 0μs avg (10000 iterations)\n\n💾 Memory Footprint:\n • ServerConfig size: 80 bytes\n • Config size: 272 bytes\n\n✅ Performance Summary:\n • Server parsing: Sub-millisecond latency\n • Configuration access: Microsecond latency\n • Memory efficient: Small struct footprint\n • Zero-copy string operations where possible\n```\n\n## Architecture\n\n```\nsrc/\n├── simple_main.rs # Lightweight MCP server entry point\n├── main.rs # Full MCP server (with SDK integration)\n├── lib.rs # Library interface\n├── config.rs # Configuration management\n├── provisioning.rs # Core provisioning engine\n├── tools.rs # AI-powered parsing tools\n├── errors.rs # Error handling\n└── performance_test.rs # Performance benchmarking\n```\n\n## Key Features\n\n1. **AI-Powered Server Parsing**: Natural language to infrastructure config\n2. **Multi-Provider Support**: AWS, UpCloud, Local\n3. **Configuration Management**: TOML-based with environment overrides\n4. **Error Handling**: Comprehensive error types with recovery hints\n5. **Performance Monitoring**: Built-in benchmarking capabilities\n\n## Rust vs Python Comparison\n\n| Metric | Python MCP Server | Rust MCP Server | Improvement |\n| -------- | ------------------ | ----------------- | ------------- |\n| **Startup Time** | ~500 ms | ~50 ms | **10x faster** |\n| **Memory Usage** | ~50 MB | ~5 MB | **10x less** |\n| **Parsing Latency** | ~1 ms | ~0.001 ms | **1000x faster** |\n| **Binary Size** | Python + deps | ~15 MB static | **Portable** |\n| **Type Safety** | Runtime errors | Compile-time | **Zero runtime errors** |\n\n## Usage\n\n```\n# Build and run\ncargo run --bin provisioning-mcp-server --release\n\n# Run with custom config\nPROVISIONING_PATH=/path/to/provisioning cargo run --bin provisioning-mcp-server -- --debug\n\n# Run tests\ncargo test\n\n# Run benchmarks\ncargo run --bin provisioning-mcp-server --release\n```\n\n## Configuration\n\nSet via environment variables:\n\n```\nexport PROVISIONING_PATH=/path/to/provisioning\nexport PROVISIONING_AI_PROVIDER=openai\nexport OPENAI_API_KEY=your-key\nexport PROVISIONING_DEBUG=true\n```\n\n## Integration Benefits\n\n1. **Philosophical Consistency**: Rust throughout the stack\n2. **Performance**: Sub-millisecond response times\n3. **Memory Safety**: No segfaults, no memory leaks\n4. **Concurrency**: Native async/await support\n5. **Distribution**: Single static binary\n6. **Cross-compilation**: ARM64/x86_64 support\n\n## Next Steps\n\n1. Full MCP SDK integration (schema definitions)\n2. WebSocket/TCP transport layer\n3. Plugin system for extensibility\n4. Metrics collection and monitoring\n5. Documentation and examples\n\n## Related Documentation\n\n- **Architecture**: [MCP Integration](../architecture/orchestrator-integration-model.md) +# MCP Server - Model Context Protocol + +A Rust-native Model Context Protocol (MCP) server for infrastructure automation and AI-assisted DevOps operations. + +> **Source**: `provisioning/platform/mcp-server/` +> **Status**: Proof of Concept Complete + +## Overview + +Replaces the Python implementation with significant performance improvements while maintaining philosophical consistency with the Rust ecosystem approach. + +## Performance Results + +```text +🚀 Rust MCP Server Performance Analysis +================================================== + +📋 Server Parsing Performance: + • Sub-millisecond latency across all operations + • 0μs average for configuration access + +🤖 AI Status Performance: + • AI Status: 0μs avg (10000 iterations) + +💾 Memory Footprint: + • ServerConfig size: 80 bytes + • Config size: 272 bytes + +✅ Performance Summary: + • Server parsing: Sub-millisecond latency + • Configuration access: Microsecond latency + • Memory efficient: Small struct footprint + • Zero-copy string operations where possible +``` + +## Architecture + +```text +src/ +├── simple_main.rs # Lightweight MCP server entry point +├── main.rs # Full MCP server (with SDK integration) +├── lib.rs # Library interface +├── config.rs # Configuration management +├── provisioning.rs # Core provisioning engine +├── tools.rs # AI-powered parsing tools +├── errors.rs # Error handling +└── performance_test.rs # Performance benchmarking +``` + +## Key Features + +1. **AI-Powered Server Parsing**: Natural language to infrastructure config +2. **Multi-Provider Support**: AWS, UpCloud, Local +3. **Configuration Management**: TOML-based with environment overrides +4. **Error Handling**: Comprehensive error types with recovery hints +5. **Performance Monitoring**: Built-in benchmarking capabilities + +## Rust vs Python Comparison + +| Metric | Python MCP Server | Rust MCP Server | Improvement | +| -------- | ------------------ | ----------------- | ------------- | +| **Startup Time** | ~500 ms | ~50 ms | **10x faster** | +| **Memory Usage** | ~50 MB | ~5 MB | **10x less** | +| **Parsing Latency** | ~1 ms | ~0.001 ms | **1000x faster** | +| **Binary Size** | Python + deps | ~15 MB static | **Portable** | +| **Type Safety** | Runtime errors | Compile-time | **Zero runtime errors** | + +## Usage + +```text +# Build and run +cargo run --bin provisioning-mcp-server --release + +# Run with custom config +PROVISIONING_PATH=/path/to/provisioning cargo run --bin provisioning-mcp-server -- --debug + +# Run tests +cargo test + +# Run benchmarks +cargo run --bin provisioning-mcp-server --release +``` + +## Configuration + +Set via environment variables: + +```text +export PROVISIONING_PATH=/path/to/provisioning +export PROVISIONING_AI_PROVIDER=openai +export OPENAI_API_KEY=your-key +export PROVISIONING_DEBUG=true +``` + +## Integration Benefits + +1. **Philosophical Consistency**: Rust throughout the stack +2. **Performance**: Sub-millisecond response times +3. **Memory Safety**: No segfaults, no memory leaks +4. **Concurrency**: Native async/await support +5. **Distribution**: Single static binary +6. **Cross-compilation**: ARM64/x86_64 support + +## Next Steps + +1. Full MCP SDK integration (schema definitions) +2. WebSocket/TCP transport layer +3. Plugin system for extensibility +4. Metrics collection and monitoring +5. Documentation and examples + +## Related Documentation + +- **Architecture**: [MCP Integration](../architecture/orchestrator-integration-model.md) \ No newline at end of file diff --git a/docs/src/development/project-structure.md b/docs/src/development/project-structure.md index 18643ab..bce52a8 100644 --- a/docs/src/development/project-structure.md +++ b/docs/src/development/project-structure.md @@ -1 +1,411 @@ -# Project Structure Guide\n\nThis document provides a comprehensive overview of the provisioning project's structure after the major reorganization, explaining both the new\ndevelopment-focused organization and the preserved existing functionality.\n\n## Table of Contents\n\n1. [Overview](#overview)\n2. [New Structure vs Legacy](#new-structure-vs-legacy)\n3. [Core Directories](#core-directories)\n4. [Development Workspace](#development-workspace)\n5. [File Naming Conventions](#file-naming-conventions)\n6. [Navigation Guide](#navigation-guide)\n7. [Migration Path](#migration-path)\n\n## Overview\n\nThe provisioning project has been restructured to support a dual-organization approach:\n\n- **`src/`**: Development-focused structure with build tools, distribution system, and core components\n- **Legacy directories**: Preserved in their original locations for backward compatibility\n- **`workspace/`**: Development workspace with tools and runtime management\n\nThis reorganization enables efficient development workflows while maintaining full backward compatibility with existing deployments.\n\n## New Structure vs Legacy\n\n### New Development Structure (`/src/`)\n\n```\nsrc/\n├── config/ # System configuration\n├── control-center/ # Control center application\n├── control-center-ui/ # Web UI for control center\n├── core/ # Core system libraries\n├── docs/ # Documentation (new)\n├── extensions/ # Extension framework\n├── generators/ # Code generation tools\n├── schemas/ # Nickel configuration schemas (migrated from kcl/)\n├── orchestrator/ # Hybrid Rust/Nushell orchestrator\n├── platform/ # Platform-specific code\n├── provisioning/ # Main provisioning\n├── templates/ # Template files\n├── tools/ # Build and development tools\n└── utils/ # Utility scripts\n```\n\n### Legacy Structure (Preserved)\n\n```\nrepo-cnz/\n├── cluster/ # Cluster configurations (preserved)\n├── core/ # Core system (preserved)\n├── generate/ # Generation scripts (preserved)\n├── schemas/ # Nickel schemas (migrated from kcl/)\n├── klab/ # Development lab (preserved)\n├── nushell-plugins/ # Plugin development (preserved)\n├── providers/ # Cloud providers (preserved)\n├── taskservs/ # Task services (preserved)\n└── templates/ # Template files (preserved)\n```\n\n### Development Workspace (`/workspace/`)\n\n```\nworkspace/\n├── config/ # Development configuration\n├── extensions/ # Extension development\n├── infra/ # Development infrastructure\n├── lib/ # Workspace libraries\n├── runtime/ # Runtime data\n└── tools/ # Workspace management tools\n```\n\n## Core Directories\n\n### `/src/core/` - Core Development Libraries\n\n**Purpose**: Development-focused core libraries and entry points\n\n**Key Files**:\n\n- `nulib/provisioning` - Main CLI entry point (symlinks to legacy location)\n- `nulib/lib_provisioning/` - Core provisioning libraries\n- `nulib/workflows/` - Workflow management (orchestrator integration)\n\n**Relationship to Legacy**: Preserves original `core/` functionality while adding development enhancements\n\n### `/src/tools/` - Build and Development Tools\n\n**Purpose**: Complete build system for the provisioning project\n\n**Key Components**:\n\n```\ntools/\n├── build/ # Build tools\n│ ├── compile-platform.nu # Platform-specific compilation\n│ ├── bundle-core.nu # Core library bundling\n│ ├── validate-nickel.nu # Nickel schema validation\n│ ├── clean-build.nu # Build cleanup\n│ └── test-distribution.nu # Distribution testing\n├── distribution/ # Distribution tools\n│ ├── generate-distribution.nu # Main distribution generator\n│ ├── prepare-platform-dist.nu # Platform-specific distribution\n│ ├── prepare-core-dist.nu # Core distribution\n│ ├── create-installer.nu # Installer creation\n│ └── generate-docs.nu # Documentation generation\n├── package/ # Packaging tools\n│ ├── package-binaries.nu # Binary packaging\n│ ├── build-containers.nu # Container image building\n│ ├── create-tarball.nu # Archive creation\n│ └── validate-package.nu # Package validation\n├── release/ # Release management\n│ ├── create-release.nu # Release creation\n│ ├── upload-artifacts.nu # Artifact upload\n│ ├── rollback-release.nu # Release rollback\n│ ├── notify-users.nu # Release notifications\n│ └── update-registry.nu # Package registry updates\n└── Makefile # Main build system (40+ targets)\n```\n\n### `/src/orchestrator/` - Hybrid Orchestrator\n\n**Purpose**: Rust/Nushell hybrid orchestrator for solving deep call stack limitations\n\n**Key Components**:\n\n- `src/` - Rust orchestrator implementation\n- `scripts/` - Orchestrator management scripts\n- `data/` - File-based task queue and persistence\n\n**Integration**: Provides REST API and workflow management while preserving all Nushell business logic\n\n### `/src/provisioning/` - Enhanced Provisioning\n\n**Purpose**: Enhanced version of the main provisioning with additional features\n\n**Key Features**:\n\n- Batch workflow system (v3.1.0)\n- Provider-agnostic design\n- Configuration-driven architecture (v2.0.0)\n\n### `/workspace/` - Development Workspace\n\n**Purpose**: Complete development environment with tools and runtime management\n\n**Key Components**:\n\n- `tools/workspace.nu` - Unified workspace management interface\n- `lib/path-resolver.nu` - Smart path resolution system\n- `config/` - Environment-specific development configurations\n- `extensions/` - Extension development templates and examples\n- `infra/` - Development infrastructure examples\n- `runtime/` - Isolated runtime data per user\n\n## Development Workspace\n\n### Workspace Management\n\nThe workspace provides a sophisticated development environment:\n\n**Initialization**:\n\n```\ncd workspace/tools\nnu workspace.nu init --user-name developer --infra-name my-infra\n```\n\n**Health Monitoring**:\n\n```\nnu workspace.nu health --detailed --fix-issues\n```\n\n**Path Resolution**:\n\n```\nuse lib/path-resolver.nu\nlet config = (path-resolver resolve_config "user" --workspace-user "john")\n```\n\n### Extension Development\n\nThe workspace provides templates for developing:\n\n- **Providers**: Custom cloud provider implementations\n- **Task Services**: Infrastructure service components\n- **Clusters**: Complete deployment solutions\n\nTemplates are available in `workspace/extensions/{type}/template/`\n\n### Configuration Hierarchy\n\nThe workspace implements a sophisticated configuration cascade:\n\n1. Workspace user configuration (`workspace/config/{user}.toml`)\n2. Environment-specific defaults (`workspace/config/{env}-defaults.toml`)\n3. Workspace defaults (`workspace/config/dev-defaults.toml`)\n4. Core system defaults (`config.defaults.toml`)\n\n## File Naming Conventions\n\n### Nushell Files (`.nu`)\n\n- **Commands**: `kebab-case` - `create-server.nu`, `validate-config.nu`\n- **Modules**: `snake_case` - `lib_provisioning`, `path_resolver`\n- **Scripts**: `kebab-case` - `workspace-health.nu`, `runtime-manager.nu`\n\n### Configuration Files\n\n- **TOML**: `kebab-case.toml` - `config-defaults.toml`, `user-settings.toml`\n- **Environment**: `{env}-defaults.toml` - `dev-defaults.toml`, `prod-defaults.toml`\n- **Examples**: `*.toml.example` - `local-overrides.toml.example`\n\n### Nickel Files (`.ncl`)\n\n- **Schemas**: `kebab-case.ncl` - `server-config.ncl`, `workflow-schema.ncl`\n- **Configuration**: `manifest.toml` - Package metadata\n- **Structure**: Organized in `schemas/` directories per extension\n\n### Build and Distribution\n\n- **Scripts**: `kebab-case.nu` - `compile-platform.nu`, `generate-distribution.nu`\n- **Makefiles**: `Makefile` - Standard naming\n- **Archives**: `{project}-{version}-{platform}-{variant}.{ext}`\n\n## Navigation Guide\n\n### Finding Components\n\n**Core System Entry Points**:\n\n```\n# Main CLI (development version)\n/src/core/nulib/provisioning\n\n# Legacy CLI (production version)\n/core/nulib/provisioning\n\n# Workspace management\n/workspace/tools/workspace.nu\n```\n\n**Build System**:\n\n```\n# Main build system\ncd /src/tools && make help\n\n# Quick development build\nmake dev-build\n\n# Complete distribution\nmake all\n```\n\n**Configuration Files**:\n\n```\n# System defaults\n/config.defaults.toml\n\n# User configuration (workspace)\n/workspace/config/{user}.toml\n\n# Environment-specific\n/workspace/config/{env}-defaults.toml\n```\n\n**Extension Development**:\n\n```\n# Provider template\n/workspace/extensions/providers/template/\n\n# Task service template\n/workspace/extensions/taskservs/template/\n\n# Cluster template\n/workspace/extensions/clusters/template/\n```\n\n### Common Workflows\n\n**1. Development Setup**:\n\n```\n# Initialize workspace\ncd workspace/tools\nnu workspace.nu init --user-name $USER\n\n# Check health\nnu workspace.nu health --detailed\n```\n\n**2. Building Distribution**:\n\n```\n# Complete build\ncd src/tools\nmake all\n\n# Platform-specific build\nmake linux\nmake macos\nmake windows\n```\n\n**3. Extension Development**:\n\n```\n# Create new provider\ncp -r workspace/extensions/providers/template workspace/extensions/providers/my-provider\n\n# Test extension\nnu workspace/extensions/providers/my-provider/nulib/provider.nu test\n```\n\n### Legacy Compatibility\n\n**Existing Commands Still Work**:\n\n```\n# All existing commands preserved\n./core/nulib/provisioning server create\n./core/nulib/provisioning taskserv install kubernetes\n./core/nulib/provisioning cluster create buildkit\n```\n\n**Configuration Migration**:\n\n- ENV variables still supported as fallbacks\n- New configuration system provides better defaults\n- Migration tools available in `src/tools/migration/`\n\n## Migration Path\n\n### For Users\n\n**No Changes Required**:\n\n- All existing commands continue to work\n- Configuration files remain compatible\n- Existing infrastructure deployments unaffected\n\n**Optional Enhancements**:\n\n- Migrate to new configuration system for better defaults\n- Use workspace for development environments\n- Leverage new build system for custom distributions\n\n### For Developers\n\n**Development Environment**:\n\n1. Initialize development workspace: `nu workspace/tools/workspace.nu init`\n2. Use new build system: `cd src/tools && make dev-build`\n3. Leverage extension templates for custom development\n\n**Build System**:\n\n1. Use new Makefile for comprehensive build management\n2. Leverage distribution tools for packaging\n3. Use release management for version control\n\n**Orchestrator Integration**:\n\n1. Start orchestrator for workflow management: `cd src/orchestrator && ./scripts/start-orchestrator.nu`\n2. Use workflow APIs for complex operations\n3. Leverage batch operations for efficiency\n\n### Migration Tools\n\n**Available Migration Scripts**:\n\n- `src/tools/migration/config-migration.nu` - Configuration migration\n- `src/tools/migration/workspace-setup.nu` - Workspace initialization\n- `src/tools/migration/path-resolver.nu` - Path resolution migration\n\n**Validation Tools**:\n\n- `src/tools/validation/system-health.nu` - System health validation\n- `src/tools/validation/compatibility-check.nu` - Compatibility verification\n- `src/tools/validation/migration-status.nu` - Migration status tracking\n\n## Architecture Benefits\n\n### Development Efficiency\n\n- **Build System**: Comprehensive 40+ target Makefile system\n- **Workspace Isolation**: Per-user development environments\n- **Extension Framework**: Template-based extension development\n\n### Production Reliability\n\n- **Backward Compatibility**: All existing functionality preserved\n- **Configuration Migration**: Gradual migration from ENV to config-driven\n- **Orchestrator Architecture**: Hybrid Rust/Nushell for performance and flexibility\n- **Workflow Management**: Batch operations with rollback capabilities\n\n### Maintenance Benefits\n\n- **Clean Separation**: Development tools separate from production code\n- **Organized Structure**: Logical grouping of related functionality\n- **Documentation**: Comprehensive documentation and examples\n- **Testing Framework**: Built-in testing and validation tools\n\nThis structure represents a significant evolution in the project's organization while maintaining complete backward compatibility and providing\npowerful new development capabilities. +# Project Structure Guide + +This document provides a comprehensive overview of the provisioning project's structure after the major reorganization, explaining both the new +development-focused organization and the preserved existing functionality. + +## Table of Contents + +1. [Overview](#overview) +2. [New Structure vs Legacy](#new-structure-vs-legacy) +3. [Core Directories](#core-directories) +4. [Development Workspace](#development-workspace) +5. [File Naming Conventions](#file-naming-conventions) +6. [Navigation Guide](#navigation-guide) +7. [Migration Path](#migration-path) + +## Overview + +The provisioning project has been restructured to support a dual-organization approach: + +- **`src/`**: Development-focused structure with build tools, distribution system, and core components +- **Legacy directories**: Preserved in their original locations for backward compatibility +- **`workspace/`**: Development workspace with tools and runtime management + +This reorganization enables efficient development workflows while maintaining full backward compatibility with existing deployments. + +## New Structure vs Legacy + +### New Development Structure (`/src/`) + +```text +src/ +├── config/ # System configuration +├── control-center/ # Control center application +├── control-center-ui/ # Web UI for control center +├── core/ # Core system libraries +├── docs/ # Documentation (new) +├── extensions/ # Extension framework +├── generators/ # Code generation tools +├── schemas/ # Nickel configuration schemas (migrated from kcl/) +├── orchestrator/ # Hybrid Rust/Nushell orchestrator +├── platform/ # Platform-specific code +├── provisioning/ # Main provisioning +├── templates/ # Template files +├── tools/ # Build and development tools +└── utils/ # Utility scripts +``` + +### Legacy Structure (Preserved) + +```text +repo-cnz/ +├── cluster/ # Cluster configurations (preserved) +├── core/ # Core system (preserved) +├── generate/ # Generation scripts (preserved) +├── schemas/ # Nickel schemas (migrated from kcl/) +├── klab/ # Development lab (preserved) +├── nushell-plugins/ # Plugin development (preserved) +├── providers/ # Cloud providers (preserved) +├── taskservs/ # Task services (preserved) +└── templates/ # Template files (preserved) +``` + +### Development Workspace (`/workspace/`) + +```text +workspace/ +├── config/ # Development configuration +├── extensions/ # Extension development +├── infra/ # Development infrastructure +├── lib/ # Workspace libraries +├── runtime/ # Runtime data +└── tools/ # Workspace management tools +``` + +## Core Directories + +### `/src/core/` - Core Development Libraries + +**Purpose**: Development-focused core libraries and entry points + +**Key Files**: + +- `nulib/provisioning` - Main CLI entry point (symlinks to legacy location) +- `nulib/lib_provisioning/` - Core provisioning libraries +- `nulib/workflows/` - Workflow management (orchestrator integration) + +**Relationship to Legacy**: Preserves original `core/` functionality while adding development enhancements + +### `/src/tools/` - Build and Development Tools + +**Purpose**: Complete build system for the provisioning project + +**Key Components**: + +```text +tools/ +├── build/ # Build tools +│ ├── compile-platform.nu # Platform-specific compilation +│ ├── bundle-core.nu # Core library bundling +│ ├── validate-nickel.nu # Nickel schema validation +│ ├── clean-build.nu # Build cleanup +│ └── test-distribution.nu # Distribution testing +├── distribution/ # Distribution tools +│ ├── generate-distribution.nu # Main distribution generator +│ ├── prepare-platform-dist.nu # Platform-specific distribution +│ ├── prepare-core-dist.nu # Core distribution +│ ├── create-installer.nu # Installer creation +│ └── generate-docs.nu # Documentation generation +├── package/ # Packaging tools +│ ├── package-binaries.nu # Binary packaging +│ ├── build-containers.nu # Container image building +│ ├── create-tarball.nu # Archive creation +│ └── validate-package.nu # Package validation +├── release/ # Release management +│ ├── create-release.nu # Release creation +│ ├── upload-artifacts.nu # Artifact upload +│ ├── rollback-release.nu # Release rollback +│ ├── notify-users.nu # Release notifications +│ └── update-registry.nu # Package registry updates +└── Makefile # Main build system (40+ targets) +``` + +### `/src/orchestrator/` - Hybrid Orchestrator + +**Purpose**: Rust/Nushell hybrid orchestrator for solving deep call stack limitations + +**Key Components**: + +- `src/` - Rust orchestrator implementation +- `scripts/` - Orchestrator management scripts +- `data/` - File-based task queue and persistence + +**Integration**: Provides REST API and workflow management while preserving all Nushell business logic + +### `/src/provisioning/` - Enhanced Provisioning + +**Purpose**: Enhanced version of the main provisioning with additional features + +**Key Features**: + +- Batch workflow system (v3.1.0) +- Provider-agnostic design +- Configuration-driven architecture (v2.0.0) + +### `/workspace/` - Development Workspace + +**Purpose**: Complete development environment with tools and runtime management + +**Key Components**: + +- `tools/workspace.nu` - Unified workspace management interface +- `lib/path-resolver.nu` - Smart path resolution system +- `config/` - Environment-specific development configurations +- `extensions/` - Extension development templates and examples +- `infra/` - Development infrastructure examples +- `runtime/` - Isolated runtime data per user + +## Development Workspace + +### Workspace Management + +The workspace provides a sophisticated development environment: + +**Initialization**: + +```text +cd workspace/tools +nu workspace.nu init --user-name developer --infra-name my-infra +``` + +**Health Monitoring**: + +```text +nu workspace.nu health --detailed --fix-issues +``` + +**Path Resolution**: + +```text +use lib/path-resolver.nu +let config = (path-resolver resolve_config "user" --workspace-user "john") +``` + +### Extension Development + +The workspace provides templates for developing: + +- **Providers**: Custom cloud provider implementations +- **Task Services**: Infrastructure service components +- **Clusters**: Complete deployment solutions + +Templates are available in `workspace/extensions/{type}/template/` + +### Configuration Hierarchy + +The workspace implements a sophisticated configuration cascade: + +1. Workspace user configuration (`workspace/config/{user}.toml`) +2. Environment-specific defaults (`workspace/config/{env}-defaults.toml`) +3. Workspace defaults (`workspace/config/dev-defaults.toml`) +4. Core system defaults (`config.defaults.toml`) + +## File Naming Conventions + +### Nushell Files (`.nu`) + +- **Commands**: `kebab-case` - `create-server.nu`, `validate-config.nu` +- **Modules**: `snake_case` - `lib_provisioning`, `path_resolver` +- **Scripts**: `kebab-case` - `workspace-health.nu`, `runtime-manager.nu` + +### Configuration Files + +- **TOML**: `kebab-case.toml` - `config-defaults.toml`, `user-settings.toml` +- **Environment**: `{env}-defaults.toml` - `dev-defaults.toml`, `prod-defaults.toml` +- **Examples**: `*.toml.example` - `local-overrides.toml.example` + +### Nickel Files (`.ncl`) + +- **Schemas**: `kebab-case.ncl` - `server-config.ncl`, `workflow-schema.ncl` +- **Configuration**: `manifest.toml` - Package metadata +- **Structure**: Organized in `schemas/` directories per extension + +### Build and Distribution + +- **Scripts**: `kebab-case.nu` - `compile-platform.nu`, `generate-distribution.nu` +- **Makefiles**: `Makefile` - Standard naming +- **Archives**: `{project}-{version}-{platform}-{variant}.{ext}` + +## Navigation Guide + +### Finding Components + +**Core System Entry Points**: + +```text +# Main CLI (development version) +/src/core/nulib/provisioning + +# Legacy CLI (production version) +/core/nulib/provisioning + +# Workspace management +/workspace/tools/workspace.nu +``` + +**Build System**: + +```text +# Main build system +cd /src/tools && make help + +# Quick development build +make dev-build + +# Complete distribution +make all +``` + +**Configuration Files**: + +```text +# System defaults +/config.defaults.toml + +# User configuration (workspace) +/workspace/config/{user}.toml + +# Environment-specific +/workspace/config/{env}-defaults.toml +``` + +**Extension Development**: + +```text +# Provider template +/workspace/extensions/providers/template/ + +# Task service template +/workspace/extensions/taskservs/template/ + +# Cluster template +/workspace/extensions/clusters/template/ +``` + +### Common Workflows + +**1. Development Setup**: + +```text +# Initialize workspace +cd workspace/tools +nu workspace.nu init --user-name $USER + +# Check health +nu workspace.nu health --detailed +``` + +**2. Building Distribution**: + +```text +# Complete build +cd src/tools +make all + +# Platform-specific build +make linux +make macos +make windows +``` + +**3. Extension Development**: + +```text +# Create new provider +cp -r workspace/extensions/providers/template workspace/extensions/providers/my-provider + +# Test extension +nu workspace/extensions/providers/my-provider/nulib/provider.nu test +``` + +### Legacy Compatibility + +**Existing Commands Still Work**: + +```text +# All existing commands preserved +./core/nulib/provisioning server create +./core/nulib/provisioning taskserv install kubernetes +./core/nulib/provisioning cluster create buildkit +``` + +**Configuration Migration**: + +- ENV variables still supported as fallbacks +- New configuration system provides better defaults +- Migration tools available in `src/tools/migration/` + +## Migration Path + +### For Users + +**No Changes Required**: + +- All existing commands continue to work +- Configuration files remain compatible +- Existing infrastructure deployments unaffected + +**Optional Enhancements**: + +- Migrate to new configuration system for better defaults +- Use workspace for development environments +- Leverage new build system for custom distributions + +### For Developers + +**Development Environment**: + +1. Initialize development workspace: `nu workspace/tools/workspace.nu init` +2. Use new build system: `cd src/tools && make dev-build` +3. Leverage extension templates for custom development + +**Build System**: + +1. Use new Makefile for comprehensive build management +2. Leverage distribution tools for packaging +3. Use release management for version control + +**Orchestrator Integration**: + +1. Start orchestrator for workflow management: `cd src/orchestrator && ./scripts/start-orchestrator.nu` +2. Use workflow APIs for complex operations +3. Leverage batch operations for efficiency + +### Migration Tools + +**Available Migration Scripts**: + +- `src/tools/migration/config-migration.nu` - Configuration migration +- `src/tools/migration/workspace-setup.nu` - Workspace initialization +- `src/tools/migration/path-resolver.nu` - Path resolution migration + +**Validation Tools**: + +- `src/tools/validation/system-health.nu` - System health validation +- `src/tools/validation/compatibility-check.nu` - Compatibility verification +- `src/tools/validation/migration-status.nu` - Migration status tracking + +## Architecture Benefits + +### Development Efficiency + +- **Build System**: Comprehensive 40+ target Makefile system +- **Workspace Isolation**: Per-user development environments +- **Extension Framework**: Template-based extension development + +### Production Reliability + +- **Backward Compatibility**: All existing functionality preserved +- **Configuration Migration**: Gradual migration from ENV to config-driven +- **Orchestrator Architecture**: Hybrid Rust/Nushell for performance and flexibility +- **Workflow Management**: Batch operations with rollback capabilities + +### Maintenance Benefits + +- **Clean Separation**: Development tools separate from production code +- **Organized Structure**: Logical grouping of related functionality +- **Documentation**: Comprehensive documentation and examples +- **Testing Framework**: Built-in testing and validation tools + +This structure represents a significant evolution in the project's organization while maintaining complete backward compatibility and providing +powerful new development capabilities. \ No newline at end of file diff --git a/docs/src/development/providers/provider-agnostic-architecture.md b/docs/src/development/providers/provider-agnostic-architecture.md index e745b6e..c5b6273 100644 --- a/docs/src/development/providers/provider-agnostic-architecture.md +++ b/docs/src/development/providers/provider-agnostic-architecture.md @@ -1 +1,348 @@ -# Provider-Agnostic Architecture Documentation\n\n## Overview\n\nThe new provider-agnostic architecture eliminates hardcoded provider dependencies and enables true multi-provider infrastructure deployments. This\naddresses two critical limitations of the previous middleware:\n\n1. **Hardcoded provider dependencies** - No longer requires importing specific provider modules\n2. **Single-provider limitation** - Now supports mixing multiple providers in the same deployment (for example, AWS compute + Cloudflare DNS + UpCloud\nbackup)\n\n## Architecture Components\n\n### 1. Provider Interface (`interface.nu`)\n\nDefines the contract that all providers must implement:\n\n```\n# Standard interface functions\n- query_servers\n- server_info\n- server_exists\n- create_server\n- delete_server\n- server_state\n- get_ip\n# ... and 20+ other functions\n```\n\n**Key Features:**\n\n- Type-safe function signatures\n- Comprehensive validation\n- Provider capability flags\n- Interface versioning\n\n### 2. Provider Registry (`registry.nu`)\n\nManages provider discovery and registration:\n\n```\n# Initialize registry\ninit-provider-registry\n\n# List available providers\nlist-providers --available-only\n\n# Check provider availability\nis-provider-available "aws"\n```\n\n**Features:**\n\n- Automatic provider discovery\n- Core and extension provider support\n- Caching for performance\n- Provider capability tracking\n\n### 3. Provider Loader (`loader.nu`)\n\nHandles dynamic provider loading and validation:\n\n```\n# Load provider dynamically\nload-provider "aws"\n\n# Get provider with auto-loading\nget-provider "upcloud"\n\n# Call provider function\ncall-provider-function "aws" "query_servers" $find $cols\n```\n\n**Features:**\n\n- Lazy loading (load only when needed)\n- Interface compliance validation\n- Error handling and recovery\n- Provider health checking\n\n### 4. Provider Adapters\n\nEach provider implements a standard adapter:\n\n```\nprovisioning/extensions/providers/\n├── aws/provider.nu # AWS adapter\n├── upcloud/provider.nu # UpCloud adapter\n├── local/provider.nu # Local adapter\n└── {custom}/provider.nu # Custom providers\n```\n\n**Adapter Structure:**\n\n```\n# AWS Provider Adapter\nexport def query_servers [find?: string, cols?: string] {\n aws_query_servers $find $cols\n}\n\nexport def create_server [settings: record, server: record, check: bool, wait: bool] {\n # AWS-specific implementation\n}\n```\n\n### 5. Provider-Agnostic Middleware (`middleware_provider_agnostic.nu`)\n\nThe new middleware that uses dynamic dispatch:\n\n```\n# No hardcoded imports!\nexport def mw_query_servers [settings: record, find?: string, cols?: string] {\n $settings.data.servers | each { |server|\n # Dynamic provider loading and dispatch\n dispatch_provider_function $server.provider "query_servers" $find $cols\n }\n}\n```\n\n## Multi-Provider Support\n\n### Example: Mixed Provider Infrastructure\n\n```\nlet servers = [\n {\n hostname = "compute-01",\n provider = "aws",\n # AWS-specific config\n },\n {\n hostname = "backup-01",\n provider = "upcloud",\n # UpCloud-specific config\n },\n {\n hostname = "api.example.com",\n provider = "cloudflare",\n # DNS-specific config\n },\n] in\nservers\n```\n\n### Multi-Provider Deployment\n\n```\n# Deploy across multiple providers automatically\nmw_deploy_multi_provider_infra $settings $deployment_plan\n\n# Get deployment strategy recommendations\nmw_suggest_deployment_strategy {\n regions: ["us-east-1", "eu-west-1"]\n high_availability: true\n cost_optimization: true\n}\n```\n\n## Provider Capabilities\n\nProviders declare their capabilities:\n\n```\ncapabilities: {\n server_management: true\n network_management: true\n auto_scaling: true # AWS: yes, Local: no\n multi_region: true # AWS: yes, Local: no\n serverless: true # AWS: yes, UpCloud: no\n compliance_certifications: ["SOC2", "HIPAA"]\n}\n```\n\n## Migration Guide\n\n### From Old Middleware\n\n**Before (hardcoded):**\n\n```\n# middleware.nu\nuse ../aws/nulib/aws/servers.nu *\nuse ../upcloud/nulib/upcloud/servers.nu *\n\nmatch $server.provider {\n "aws" => { aws_query_servers $find $cols }\n "upcloud" => { upcloud_query_servers $find $cols }\n}\n```\n\n**After (provider-agnostic):**\n\n```\n# middleware_provider_agnostic.nu\n# No hardcoded imports!\n\n# Dynamic dispatch\ndispatch_provider_function $server.provider "query_servers" $find $cols\n```\n\n### Migration Steps\n\n1. **Replace middleware file:**\n\n ```bash\n cp provisioning/extensions/providers/prov_lib/middleware.nu \\n provisioning/extensions/providers/prov_lib/middleware_legacy.backup\n\n cp provisioning/extensions/providers/prov_lib/middleware_provider_agnostic.nu \\n provisioning/extensions/providers/prov_lib/middleware.nu\n ```\n\n1. **Test with existing infrastructure:**\n\n ```nushell\n ./provisioning/tools/test-provider-agnostic.nu run-all-tests\n ```\n\n2. **Update any custom code** that directly imported provider modules\n\n## Adding New Providers\n\n### 1. Create Provider Adapter\n\nCreate `provisioning/extensions/providers/{name}/provider.nu`:\n\n```\n# Digital Ocean Provider Example\nexport def get-provider-metadata [] {\n {\n name: "digitalocean"\n version: "1.0.0"\n capabilities: {\n server_management: true\n # ... other capabilities\n }\n }\n}\n\n# Implement required interface functions\nexport def query_servers [find?: string, cols?: string] {\n # DigitalOcean-specific implementation\n}\n\nexport def create_server [settings: record, server: record, check: bool, wait: bool] {\n # DigitalOcean-specific implementation\n}\n\n# ... implement all required functions\n```\n\n### 2. Provider Discovery\n\nThe registry will automatically discover the new provider on next initialization.\n\n### 3. Test New Provider\n\n```\n# Check if discovered\nis-provider-available "digitalocean"\n\n# Load and test\nload-provider "digitalocean"\ncheck-provider-health "digitalocean"\n```\n\n## Best Practices\n\n### Provider Development\n\n1. **Implement full interface** - All functions must be implemented\n2. **Handle errors gracefully** - Return appropriate error values\n3. **Follow naming conventions** - Use consistent function naming\n4. **Document capabilities** - Accurately declare what your provider supports\n5. **Test thoroughly** - Validate against the interface specification\n\n### Multi-Provider Deployments\n\n1. **Use capability-based selection** - Choose providers based on required features\n2. **Handle provider failures** - Design for provider unavailability\n3. **Optimize for cost/performance** - Mix providers strategically\n4. **Monitor cross-provider dependencies** - Understand inter-provider communication\n\n### Profile-Based Security\n\n```\n# Environment profiles can restrict providers\nPROVISIONING_PROFILE=production # Only allows certified providers\nPROVISIONING_PROFILE=development # Allows all providers including local\n```\n\n## Troubleshooting\n\n### Common Issues\n\n1. **Provider not found**\n - Check provider is in correct directory\n - Verify provider.nu exists and implements interface\n - Run `init-provider-registry` to refresh\n\n2. **Interface validation failed**\n - Use `validate-provider-interface` to check compliance\n - Ensure all required functions are implemented\n - Check function signatures match interface\n\n3. **Provider loading errors**\n - Check Nushell module syntax\n - Verify import paths are correct\n - Use `check-provider-health` for diagnostics\n\n### Debug Commands\n\n```\n# Registry diagnostics\nget-provider-stats\nlist-providers --verbose\n\n# Provider diagnostics\ncheck-provider-health "aws"\ncheck-all-providers-health\n\n# Loader diagnostics\nget-loader-stats\n```\n\n## Performance Benefits\n\n1. **Lazy Loading** - Providers loaded only when needed\n2. **Caching** - Provider registry cached to disk\n3. **Reduced Memory** - No hardcoded imports reducing memory usage\n4. **Parallel Operations** - Multi-provider operations can run in parallel\n\n## Future Enhancements\n\n1. **Provider Plugins** - Support for external provider plugins\n2. **Provider Versioning** - Multiple versions of same provider\n3. **Provider Composition** - Compose providers for complex scenarios\n4. **Provider Marketplace** - Community provider sharing\n\n## API Reference\n\nSee the interface specification for complete function documentation:\n\n```\nget-provider-interface-docs | table\n```\n\nThis returns the complete API with signatures and descriptions for all provider interface functions. +# Provider-Agnostic Architecture Documentation + +## Overview + +The new provider-agnostic architecture eliminates hardcoded provider dependencies and enables true multi-provider infrastructure deployments. This +addresses two critical limitations of the previous middleware: + +1. **Hardcoded provider dependencies** - No longer requires importing specific provider modules +2. **Single-provider limitation** - Now supports mixing multiple providers in the same deployment (for example, AWS compute + Cloudflare DNS + UpCloud +backup) + +## Architecture Components + +### 1. Provider Interface (`interface.nu`) + +Defines the contract that all providers must implement: + +```text +# Standard interface functions +- query_servers +- server_info +- server_exists +- create_server +- delete_server +- server_state +- get_ip +# ... and 20+ other functions +``` + +**Key Features:** + +- Type-safe function signatures +- Comprehensive validation +- Provider capability flags +- Interface versioning + +### 2. Provider Registry (`registry.nu`) + +Manages provider discovery and registration: + +```text +# Initialize registry +init-provider-registry + +# List available providers +list-providers --available-only + +# Check provider availability +is-provider-available "aws" +``` + +**Features:** + +- Automatic provider discovery +- Core and extension provider support +- Caching for performance +- Provider capability tracking + +### 3. Provider Loader (`loader.nu`) + +Handles dynamic provider loading and validation: + +```text +# Load provider dynamically +load-provider "aws" + +# Get provider with auto-loading +get-provider "upcloud" + +# Call provider function +call-provider-function "aws" "query_servers" $find $cols +``` + +**Features:** + +- Lazy loading (load only when needed) +- Interface compliance validation +- Error handling and recovery +- Provider health checking + +### 4. Provider Adapters + +Each provider implements a standard adapter: + +```text +provisioning/extensions/providers/ +├── aws/provider.nu # AWS adapter +├── upcloud/provider.nu # UpCloud adapter +├── local/provider.nu # Local adapter +└── {custom}/provider.nu # Custom providers +``` + +**Adapter Structure:** + +```text +# AWS Provider Adapter +export def query_servers [find?: string, cols?: string] { + aws_query_servers $find $cols +} + +export def create_server [settings: record, server: record, check: bool, wait: bool] { + # AWS-specific implementation +} +``` + +### 5. Provider-Agnostic Middleware (`middleware_provider_agnostic.nu`) + +The new middleware that uses dynamic dispatch: + +```text +# No hardcoded imports! +export def mw_query_servers [settings: record, find?: string, cols?: string] { + $settings.data.servers | each { |server| + # Dynamic provider loading and dispatch + dispatch_provider_function $server.provider "query_servers" $find $cols + } +} +``` + +## Multi-Provider Support + +### Example: Mixed Provider Infrastructure + +```text +let servers = [ + { + hostname = "compute-01", + provider = "aws", + # AWS-specific config + }, + { + hostname = "backup-01", + provider = "upcloud", + # UpCloud-specific config + }, + { + hostname = "api.example.com", + provider = "cloudflare", + # DNS-specific config + }, +] in +servers +``` + +### Multi-Provider Deployment + +```text +# Deploy across multiple providers automatically +mw_deploy_multi_provider_infra $settings $deployment_plan + +# Get deployment strategy recommendations +mw_suggest_deployment_strategy { + regions: ["us-east-1", "eu-west-1"] + high_availability: true + cost_optimization: true +} +``` + +## Provider Capabilities + +Providers declare their capabilities: + +```text +capabilities: { + server_management: true + network_management: true + auto_scaling: true # AWS: yes, Local: no + multi_region: true # AWS: yes, Local: no + serverless: true # AWS: yes, UpCloud: no + compliance_certifications: ["SOC2", "HIPAA"] +} +``` + +## Migration Guide + +### From Old Middleware + +**Before (hardcoded):** + +```text +# middleware.nu +use ../aws/nulib/aws/servers.nu * +use ../upcloud/nulib/upcloud/servers.nu * + +match $server.provider { + "aws" => { aws_query_servers $find $cols } + "upcloud" => { upcloud_query_servers $find $cols } +} +``` + +**After (provider-agnostic):** + +```text +# middleware_provider_agnostic.nu +# No hardcoded imports! + +# Dynamic dispatch +dispatch_provider_function $server.provider "query_servers" $find $cols +``` + +### Migration Steps + +1. **Replace middleware file:** + + ```bash + cp provisioning/extensions/providers/prov_lib/middleware.nu + provisioning/extensions/providers/prov_lib/middleware_legacy.backup + + cp provisioning/extensions/providers/prov_lib/middleware_provider_agnostic.nu + provisioning/extensions/providers/prov_lib/middleware.nu + ``` + +1. **Test with existing infrastructure:** + + ```nushell + ./provisioning/tools/test-provider-agnostic.nu run-all-tests + ``` + +2. **Update any custom code** that directly imported provider modules + +## Adding New Providers + +### 1. Create Provider Adapter + +Create `provisioning/extensions/providers/{name}/provider.nu`: + +```text +# Digital Ocean Provider Example +export def get-provider-metadata [] { + { + name: "digitalocean" + version: "1.0.0" + capabilities: { + server_management: true + # ... other capabilities + } + } +} + +# Implement required interface functions +export def query_servers [find?: string, cols?: string] { + # DigitalOcean-specific implementation +} + +export def create_server [settings: record, server: record, check: bool, wait: bool] { + # DigitalOcean-specific implementation +} + +# ... implement all required functions +``` + +### 2. Provider Discovery + +The registry will automatically discover the new provider on next initialization. + +### 3. Test New Provider + +```text +# Check if discovered +is-provider-available "digitalocean" + +# Load and test +load-provider "digitalocean" +check-provider-health "digitalocean" +``` + +## Best Practices + +### Provider Development + +1. **Implement full interface** - All functions must be implemented +2. **Handle errors gracefully** - Return appropriate error values +3. **Follow naming conventions** - Use consistent function naming +4. **Document capabilities** - Accurately declare what your provider supports +5. **Test thoroughly** - Validate against the interface specification + +### Multi-Provider Deployments + +1. **Use capability-based selection** - Choose providers based on required features +2. **Handle provider failures** - Design for provider unavailability +3. **Optimize for cost/performance** - Mix providers strategically +4. **Monitor cross-provider dependencies** - Understand inter-provider communication + +### Profile-Based Security + +```text +# Environment profiles can restrict providers +PROVISIONING_PROFILE=production # Only allows certified providers +PROVISIONING_PROFILE=development # Allows all providers including local +``` + +## Troubleshooting + +### Common Issues + +1. **Provider not found** + - Check provider is in correct directory + - Verify provider.nu exists and implements interface + - Run `init-provider-registry` to refresh + +2. **Interface validation failed** + - Use `validate-provider-interface` to check compliance + - Ensure all required functions are implemented + - Check function signatures match interface + +3. **Provider loading errors** + - Check Nushell module syntax + - Verify import paths are correct + - Use `check-provider-health` for diagnostics + +### Debug Commands + +```text +# Registry diagnostics +get-provider-stats +list-providers --verbose + +# Provider diagnostics +check-provider-health "aws" +check-all-providers-health + +# Loader diagnostics +get-loader-stats +``` + +## Performance Benefits + +1. **Lazy Loading** - Providers loaded only when needed +2. **Caching** - Provider registry cached to disk +3. **Reduced Memory** - No hardcoded imports reducing memory usage +4. **Parallel Operations** - Multi-provider operations can run in parallel + +## Future Enhancements + +1. **Provider Plugins** - Support for external provider plugins +2. **Provider Versioning** - Multiple versions of same provider +3. **Provider Composition** - Compose providers for complex scenarios +4. **Provider Marketplace** - Community provider sharing + +## API Reference + +See the interface specification for complete function documentation: + +```text +get-provider-interface-docs | table +``` + +This returns the complete API with signatures and descriptions for all provider interface functions. \ No newline at end of file diff --git a/docs/src/development/providers/provider-comparison.md b/docs/src/development/providers/provider-comparison.md index 1c7aa0b..99d02de 100644 --- a/docs/src/development/providers/provider-comparison.md +++ b/docs/src/development/providers/provider-comparison.md @@ -1 +1,400 @@ -# Provider Comparison Matrix\n\nThis document provides a comprehensive comparison of supported cloud providers: Hetzner, UpCloud, AWS, and DigitalOcean. Use this matrix to make\ninformed decisions about which provider is best suited for your workloads.\n\n## Feature Comparison\n\n### Compute\n\n| Feature | Hetzner | UpCloud | AWS | DigitalOcean |\n| --------- | --------- | --------- | ----- | -------------- |\n| Product Name | Cloud Servers | Servers | EC2 | Droplets |\n| Instance Sizing | Standard, dedicated cores | 2-32 vCPUs | Extensive (t2, t3, m5, c5, etc) | 1-48 vCPUs |\n| Custom CPU/RAM | ✓ | ✓ | Limited | ✗ |\n| Hourly Billing | ✓ | ✓ | ✓ | ✓ |\n| Monthly Discount | 30% | 25% | ~30% (RI) | ~25% |\n| GPU Instances | ✓ | ✗ | ✓ | ✗ |\n| Auto-scaling | Via API | Via API | Native (ASG) | Via API |\n| Bare Metal | ✓ | ✗ | ✓ (EC2) | ✗ |\n\n### Block Storage\n\n| Feature | Hetzner | UpCloud | AWS | DigitalOcean |\n| --------- | --------- | --------- | ----- | -------------- |\n| Product Name | Volumes | Storage | EBS | Volumes |\n| SSD Volumes | ✓ | ✓ | ✓ (gp3, io1) | ✓ |\n| HDD Volumes | ✗ | ✓ | ✓ (st1, sc1) | ✗ |\n| Max Volume Size | 10 TB | Unlimited | 16 TB | 100 TB |\n| IOPS Provisioning | Limited | ✓ | ✓ | ✗ |\n| Snapshots | ✓ | ✓ | ✓ | ✓ |\n| Encryption | ✓ | ✓ | ✓ | ✓ |\n| Backup Service | ✗ | ✗ | ✓ (AWS Backup) | ✓ |\n\n### Object Storage\n\n| Feature | Hetzner | UpCloud | AWS | DigitalOcean |\n| --------- | --------- | --------- | ----- | -------------- |\n| Product Name | Object Storage | — | S3 | Spaces |\n| API Compatibility | S3-compatible | — | S3 (native) | S3-compatible |\n| Pricing (per GB) | €0.025 | N/A | $0.023 | $0.015 |\n| Regions | 2 | N/A | 30+ | 4 |\n| Versioning | ✓ | N/A | ✓ | ✓ |\n| Lifecycle Rules | ✓ | N/A | ✓ | ✓ |\n| CDN Integration | ✗ | N/A | ✓ (CloudFront) | ✓ (CDN add-on) |\n| Access Control | Bucket policies | N/A | IAM + bucket policies | Token-based |\n\n### Load Balancing\n\n| Feature | Hetzner | UpCloud | AWS | DigitalOcean |\n| --------- | --------- | --------- | ----- | -------------- |\n| Product Name | Load Balancer | Load Balancer | ELB/ALB/NLB | Load Balancer |\n| Type | Layer 4/7 | Layer 4 | Layer 4/7 | Layer 4/7 |\n| Health Checks | ✓ | ✓ | ✓ | ✓ |\n| SSL/TLS Termination | ✓ | Limited | ✓ | ✓ |\n| Path-based Routing | ✓ | ✗ | ✓ (ALB) | ✗ |\n| Host-based Routing | ✓ | ✗ | ✓ (ALB) | ✗ |\n| Sticky Sessions | ✓ | ✓ | ✓ | ✓ |\n| Geographic Distribution | ✗ | ✗ | ✓ (multi-region) | ✗ |\n| DDoS Protection | Basic | ✓ | ✓ (Shield) | ✓ |\n\n### Managed Databases\n\n| Feature | Hetzner | UpCloud | AWS | DigitalOcean |\n| --------- | --------- | --------- | ----- | -------------- |\n| PostgreSQL | ✗ | ✗ | ✓ (RDS) | ✓ |\n| MySQL | ✗ | ✗ | ✓ (RDS) | ✓ |\n| Redis | ✗ | ✗ | ✓ (ElastiCache) | ✓ |\n| MongoDB | ✗ | ✗ | ✓ (DocumentDB) | ✗ |\n| Multi-AZ | N/A | N/A | ✓ | ✓ |\n| Automatic Backups | N/A | N/A | ✓ | ✓ |\n| Read Replicas | N/A | N/A | ✓ | ✓ |\n| Param Groups | N/A | N/A | ✓ | ✗ |\n\n### Kubernetes\n\n| Feature | Hetzner | UpCloud | AWS | DigitalOcean |\n| --------- | --------- | --------- | ----- | -------------- |\n| Service | Manual K8s | Manual K8s | EKS | DOKS |\n| Managed Service | ✗ | ✗ | ✓ | ✓ |\n| Control Plane Managed | ✗ | ✗ | ✓ | ✓ |\n| Node Management | ✗ | ✗ | ✓ (node groups) | ✓ (node pools) |\n| Multi-AZ | ✗ | ✗ | ✓ | ✓ |\n| Ingress Support | Via add-on | Via add-on | ✓ (ALB) | ✓ |\n| Storage Classes | Via add-on | Via add-on | ✓ (EBS) | ✓ |\n\n### CDN/Edge\n\n| Feature | Hetzner | UpCloud | AWS | DigitalOcean |\n| --------- | --------- | --------- | ----- | -------------- |\n| CDN Service | ✗ | ✗ | ✓ (CloudFront) | ✓ |\n| Edge Locations | — | — | 600+ | 12+ |\n| Geographic Routing | — | — | ✓ | ✗ |\n| Cache Invalidation | — | — | ✓ | ✓ |\n| Origins | — | — | Any | HTTP/S, Object Storage |\n| SSL/TLS | — | — | ✓ | ✓ |\n| DDoS Protection | — | — | ✓ (Shield) | ✓ |\n\n### DNS\n\n| Feature | Hetzner | UpCloud | AWS | DigitalOcean |\n| --------- | --------- | --------- | ----- | -------------- |\n| DNS Service | ✓ (Basic) | ✗ | ✓ (Route53) | ✓ |\n| Zones | ✓ | N/A | ✓ | ✓ |\n| Failover | Manual | N/A | ✓ (health checks) | ✓ (health checks) |\n| Geolocation | ✗ | N/A | ✓ | ✗ |\n| DNSSEC | ✓ | N/A | ✓ | ✗ |\n| API Management | Limited | N/A | Full | Full |\n\n## Pricing Comparison\n\n### Compute Pricing (Monthly)\n\nComparison for 1-year term where applicable:\n\n| Configuration | Hetzner | UpCloud | AWS* | DigitalOcean |\n| --------------- | --------- | --------- | ------ | -------------- |\n| 1 vCPU, 1 GB RAM | €3.29 | $5 | $18 (t3.micro) | $6 |\n| 2 vCPU, 4 GB RAM | €6.90 | $15 | $36 (t3.small) | $24 |\n| 4 vCPU, 8 GB RAM | €13.80 | $30 | $73 (t3.medium) | $48 |\n| 8 vCPU, 16 GB RAM | €27.60 | $60 | $146 (t3.large) | $96 |\n| 16 vCPU, 32 GB RAM | €55.20 | $120 | $291 (t3.xlarge) | $192 |\n\n*AWS pricing: on-demand; reserved instances 25-30% discount\n\n### Storage Pricing (Monthly)\n\nPer GB for block storage:\n\n| Provider | Price/GB | Monthly Cost (100 GB) |\n| ---------- | ---------- | ---------------------- |\n| Hetzner | €0.026 | €2.60 |\n| UpCloud | $0.025 | $2.50 |\n| AWS EBS | $0.10 | $10.00 |\n| DigitalOcean | $0.10 | $10.00 |\n\n### Data Transfer Pricing\n\nOutbound data transfer (per GB):\n\n| Provider | First 1 TB | Beyond 1 TB |\n| ---------- | ----------- | ----------- |\n| Hetzner | Included | €0.12/GB |\n| UpCloud | $0.02/GB | $0.01/GB |\n| AWS | $0.09/GB | $0.085/GB |\n| DigitalOcean | $0.01/GB | $0.01/GB |\n\n### Total Cost of Ownership (TCO) Examples\n\n#### Small Application (2 servers, 100 GB storage)\n\n| Provider | Compute | Storage | Data Transfer | Monthly |\n| ---------- | --------- | --------- | ---------------- | --------- |\n| Hetzner | €13.80 | €2.60 | Included | **€16.40** |\n| UpCloud | $30 | $2.50 | $20 | **$52.50** |\n| AWS | $72 | $10 | $45 | **$127** |\n| DigitalOcean | $48 | $10 | Included | **$58** |\n\n#### Medium Application (5 servers, 500 GB storage, 10 TB data transfer)\n\n| Provider | Compute | Storage | Data Transfer | Monthly |\n| ---------- | --------- | --------- | ---------------- | --------- |\n| Hetzner | €69 | €13 | €1,200 | **€1,282** |\n| UpCloud | $150 | $12.50 | $200 | **$362.50** |\n| AWS | $360 | $50 | $900 | **$1,310** |\n| DigitalOcean | $240 | $50 | Included | **$290** |\n\n## Regional Availability\n\n### Hetzner Regions\n\n| Region | Location | Data Center | Highlights |\n| -------- | ---------- | ------------- | ------------ |\n| nbg1 | Nuremberg, Germany | 3 | EU hub, good performance |\n| fsn1 | Falkenstein, Germany | 1 | Lower latency, German regulations |\n| hel1 | Helsinki, Finland | 1 | Nordic region option |\n| ash | Ashburn, USA | 1 | North American presence |\n\n### UpCloud Regions\n\n| Region | Location | Highlights |\n| -------- | ---------- | ------------ |\n| fi-hel1 | Helsinki, Finland | Primary EU location |\n| de-fra1 | Frankfurt, Germany | EU alternative |\n| gb-lon1 | London, UK | European coverage |\n| us-nyc1 | New York, USA | North America |\n| sg-sin1 | Singapore | Asia Pacific |\n| jp-tok1 | Tokyo, Japan | APAC alternative |\n\n### AWS Regions (Selection)\n\n| Region | Location | Availability Zones | Highlights |\n| -------- | ---------- | ------------------- | ------------ |\n| us-east-1 | N. Virginia, USA | 6 | Largest, most services |\n| eu-west-1 | Ireland | 3 | EU primary, GDPR compliant |\n| eu-central-1 | Frankfurt, Germany | 3 | German data residency |\n| ap-southeast-1 | Singapore | 3 | APAC primary |\n| ap-northeast-1 | Tokyo, Japan | 4 | Asia alternative |\n\n### DigitalOcean Regions\n\n| Region | Location | Highlights |\n| -------- | ---------- | ------------ |\n| nyc3 | New York, USA | Primary US location |\n| sfo3 | San Francisco, USA | US West Coast |\n| lon1 | London, UK | European hub |\n| fra1 | Frankfurt, Germany | German regulations |\n| sgp1 | Singapore | APAC coverage |\n| blr1 | Bangalore, India | India region |\n\n### Regional Coverage Summary\n\n**Best Global Coverage**: AWS (30+ regions, most services)\n**Best EU Coverage**: All providers have good EU options\n**Best APAC Coverage**: AWS (most regions), DigitalOcean (Singapore)\n**Best North America**: All providers have coverage\n**Emerging Markets**: DigitalOcean (India via Bangalore)\n\n## Compliance and Certifications\n\n### Security Standards\n\n| Standard | Hetzner | UpCloud | AWS | DigitalOcean |\n| ---------- | --------- | --------- | ----- | -------------- |\n| GDPR | ✓ | ✓ | ✓ | ✓ |\n| CCPA | ✓ | ✓ | ✓ | ✓ |\n| SOC 2 Type II | ✓ | ✓ | ✓ | ✓ |\n| ISO 27001 | ✓ | ✓ | ✓ | ✓ |\n| ISO 9001 | ✗ | ✗ | ✓ | ✓ |\n| FedRAMP | ✗ | ✗ | ✓ | ✗ |\n\n### Industry-Specific Compliance\n\n| Standard | Hetzner | UpCloud | AWS | DigitalOcean |\n| ---------- | --------- | --------- | ----- | -------------- |\n| HIPAA | ✗ | ✗ | ✓ | ✓** |\n| PCI-DSS | ✓ | ✓ | ✓ | ✓ |\n| HITRUST | ✗ | ✗ | ✓ | ✗ |\n| FIPS 140-2 | ✗ | ✗ | ✓ | ✗ |\n| SOX (Sarbanes-Oxley) | Limited | Limited | ✓ | Limited |\n\n**DigitalOcean: Requires BAA for HIPAA compliance\n\n### Data Residency Support\n\n| Region | Hetzner | UpCloud | AWS | DigitalOcean |\n| -------- | --------- | --------- | ----- | -------------- |\n| EU (GDPR) | ✓ DE,FI | ✓ FI,DE,GB | ✓ (multiple) | ✓ (multiple) |\n| Germany (NIS2) | ✓ | ✓ | ✓ | ✓ |\n| UK (Post-Brexit) | ✗ | ✓ GB | ✓ | ✓ |\n| USA (CCPA) | ✗ | ✓ | ✓ | ✓ |\n| Canada | ✗ | ✗ | ✓ | ✗ |\n| Australia | ✗ | ✗ | ✓ | ✗ |\n| India | ✗ | ✗ | ✓ | ✓ |\n\n## Use Case Recommendations\n\n### 1. Cost-Sensitive Startups\n\n**Recommended**: Hetzner primary + DigitalOcean backup\n\n**Rationale**:\n- Hetzner has best price/performance ratio\n- DigitalOcean for geographic diversification\n- Both have simple interfaces and good documentation\n- Monthly cost: $30-80 for basic HA setup\n\n**Example Setup**:\n- Primary: Hetzner cx31 (2 vCPU, 4 GB)\n- Backup: DigitalOcean $24/month droplet\n- Database: Self-managed PostgreSQL or Hetzner volume\n- Total: ~$35/month\n\n### 2. Enterprise Production\n\n**Recommended**: AWS primary + UpCloud backup\n\n**Rationale**:\n- AWS for managed services and compliance\n- UpCloud for cost-effective disaster recovery\n- AWS compliance certifications (HIPAA, FIPS, SOC2)\n- Multiple regions within AWS\n- Mature enterprise support\n\n**Example Setup**:\n- Primary: AWS RDS (managed DB)\n- Secondary: UpCloud for compute burst\n- Compliance: Full audit trail and encryption\n\n### 3. High-Performance Computing\n\n**Recommended**: Hetzner + AWS spot instances\n\n**Rationale**:\n- Hetzner for sustained compute (good price)\n- AWS spot for burst workloads (70-90% discount)\n- Hetzner bare metal for specialized workloads\n- Cost-effective scaling\n\n### 4. Multi-Region Global Application\n\n**Recommended**: AWS + DigitalOcean + Hetzner\n\n**Rationale**:\n- AWS for primary regions and managed services\n- DigitalOcean for edge locations and simpler regions\n- Hetzner for EU cost optimization\n- Geographic redundancy across 3 providers\n\n**Example Setup**:\n- US: AWS (primary region)\n- EU: Hetzner (cost-optimized)\n- APAC: DigitalOcean (Singapore)\n- Global: CloudFront CDN\n\n### 5. Database-Heavy Applications\n\n**Recommended**: AWS RDS/ElastiCache + DigitalOcean Spaces\n\n**Rationale**:\n- AWS managed databases are feature-rich\n- DigitalOcean managed DB for simpler needs\n- Both support replicas and backups\n- Cost: $60-200/month for medium database\n\n### 6. Web Applications\n\n**Recommended**: DigitalOcean + AWS\n\n**Rationale**:\n- DigitalOcean for simplicity and speed\n- Droplets easy to manage and scale\n- AWS for advanced features and multi-region\n- Good community and documentation\n\n## Provider Strength Matrix\n\n### Performance ⚡\n\n| Category | Winner | Notes |\n| ---------- | -------- | ------- |\n| CPU Performance | Hetzner | Dedicated cores, good specs per price |\n| Network Bandwidth | AWS | 1Gbps+ guaranteed in multiple regions |\n| Storage IOPS | AWS | gp3 with 16K IOPS provisioning |\n| Latency (Global) | AWS | Most regions, best infrastructure |\n\n### Cost 💰\n\n| Category | Winner | Notes |\n| ---------- | -------- | ------- |\n| Compute | Hetzner | 50% cheaper than AWS on-demand |\n| Managed Services | AWS | Only provider with full managed stack |\n| Data Transfer | DigitalOcean | Included with many services |\n| Storage | Hetzner Object Storage | €0.025/GB vs AWS S3 $0.023/GB |\n\n### Ease of Use 🎯\n\n| Category | Winner | Notes |\n| ---------- | -------- | ------- |\n| UI/Dashboard | DigitalOcean | Simple, intuitive, clear pricing |\n| CLI Tools | AWS | Comprehensive aws-cli (but steep) |\n| API Documentation | DigitalOcean | Clear examples, community-driven |\n| Getting Started | DigitalOcean | Fastest path to first deployment |\n\n### Enterprise Features 🏢\n\n| Category | Winner | Notes |\n| ---------- | -------- | ------- |\n| Managed Services | AWS | RDS, ElastiCache, SQS, SNS, etc |\n| Compliance | AWS | Most certifications (HIPAA, FIPS, etc) |\n| Support | AWS | 24/7 support with paid plans |\n| Scale | AWS | Best for 1000+ servers |\n\n## Decision Matrix\n\nUse this matrix to quickly select a provider:\n\n```\nIf you need: Then use:\n─────────────────────────────────────────────────────────────\nLowest cost compute Hetzner\nSimplest interface DigitalOcean\nManaged databases AWS or DigitalOcean\nGlobal multi-region AWS\nCompliance (HIPAA/FIPS) AWS\nEuropean data residency Hetzner or DigitalOcean\nHigh performance compute Hetzner or AWS (bare metal)\nDisaster recovery setup UpCloud or Hetzner\nQuick startup DigitalOcean\nEnterprise SLA AWS or UpCloud\n```\n\n## Conclusion\n\n- **Hetzner**: Best for cost-conscious teams, European focus, good performance\n- **UpCloud**: Mid-market option, Nordic/EU focus, reliable alternative\n- **AWS**: Enterprise standard, global coverage, most services, highest cost\n- **DigitalOcean**: Developer-friendly, simplicity-focused, good value\n\nFor most organizations, a **multi-provider strategy** combining Hetzner (compute), AWS (managed services), and DigitalOcean (edge) provides the best\nbalance of cost, capability, and resilience. +# Provider Comparison Matrix + +This document provides a comprehensive comparison of supported cloud providers: Hetzner, UpCloud, AWS, and DigitalOcean. Use this matrix to make +informed decisions about which provider is best suited for your workloads. + +## Feature Comparison + +### Compute + +| Feature | Hetzner | UpCloud | AWS | DigitalOcean | +| --------- | --------- | --------- | ----- | -------------- | +| Product Name | Cloud Servers | Servers | EC2 | Droplets | +| Instance Sizing | Standard, dedicated cores | 2-32 vCPUs | Extensive (t2, t3, m5, c5, etc) | 1-48 vCPUs | +| Custom CPU/RAM | ✓ | ✓ | Limited | ✗ | +| Hourly Billing | ✓ | ✓ | ✓ | ✓ | +| Monthly Discount | 30% | 25% | ~30% (RI) | ~25% | +| GPU Instances | ✓ | ✗ | ✓ | ✗ | +| Auto-scaling | Via API | Via API | Native (ASG) | Via API | +| Bare Metal | ✓ | ✗ | ✓ (EC2) | ✗ | + +### Block Storage + +| Feature | Hetzner | UpCloud | AWS | DigitalOcean | +| --------- | --------- | --------- | ----- | -------------- | +| Product Name | Volumes | Storage | EBS | Volumes | +| SSD Volumes | ✓ | ✓ | ✓ (gp3, io1) | ✓ | +| HDD Volumes | ✗ | ✓ | ✓ (st1, sc1) | ✗ | +| Max Volume Size | 10 TB | Unlimited | 16 TB | 100 TB | +| IOPS Provisioning | Limited | ✓ | ✓ | ✗ | +| Snapshots | ✓ | ✓ | ✓ | ✓ | +| Encryption | ✓ | ✓ | ✓ | ✓ | +| Backup Service | ✗ | ✗ | ✓ (AWS Backup) | ✓ | + +### Object Storage + +| Feature | Hetzner | UpCloud | AWS | DigitalOcean | +| --------- | --------- | --------- | ----- | -------------- | +| Product Name | Object Storage | — | S3 | Spaces | +| API Compatibility | S3-compatible | — | S3 (native) | S3-compatible | +| Pricing (per GB) | €0.025 | N/A | $0.023 | $0.015 | +| Regions | 2 | N/A | 30+ | 4 | +| Versioning | ✓ | N/A | ✓ | ✓ | +| Lifecycle Rules | ✓ | N/A | ✓ | ✓ | +| CDN Integration | ✗ | N/A | ✓ (CloudFront) | ✓ (CDN add-on) | +| Access Control | Bucket policies | N/A | IAM + bucket policies | Token-based | + +### Load Balancing + +| Feature | Hetzner | UpCloud | AWS | DigitalOcean | +| --------- | --------- | --------- | ----- | -------------- | +| Product Name | Load Balancer | Load Balancer | ELB/ALB/NLB | Load Balancer | +| Type | Layer 4/7 | Layer 4 | Layer 4/7 | Layer 4/7 | +| Health Checks | ✓ | ✓ | ✓ | ✓ | +| SSL/TLS Termination | ✓ | Limited | ✓ | ✓ | +| Path-based Routing | ✓ | ✗ | ✓ (ALB) | ✗ | +| Host-based Routing | ✓ | ✗ | ✓ (ALB) | ✗ | +| Sticky Sessions | ✓ | ✓ | ✓ | ✓ | +| Geographic Distribution | ✗ | ✗ | ✓ (multi-region) | ✗ | +| DDoS Protection | Basic | ✓ | ✓ (Shield) | ✓ | + +### Managed Databases + +| Feature | Hetzner | UpCloud | AWS | DigitalOcean | +| --------- | --------- | --------- | ----- | -------------- | +| PostgreSQL | ✗ | ✗ | ✓ (RDS) | ✓ | +| MySQL | ✗ | ✗ | ✓ (RDS) | ✓ | +| Redis | ✗ | ✗ | ✓ (ElastiCache) | ✓ | +| MongoDB | ✗ | ✗ | ✓ (DocumentDB) | ✗ | +| Multi-AZ | N/A | N/A | ✓ | ✓ | +| Automatic Backups | N/A | N/A | ✓ | ✓ | +| Read Replicas | N/A | N/A | ✓ | ✓ | +| Param Groups | N/A | N/A | ✓ | ✗ | + +### Kubernetes + +| Feature | Hetzner | UpCloud | AWS | DigitalOcean | +| --------- | --------- | --------- | ----- | -------------- | +| Service | Manual K8s | Manual K8s | EKS | DOKS | +| Managed Service | ✗ | ✗ | ✓ | ✓ | +| Control Plane Managed | ✗ | ✗ | ✓ | ✓ | +| Node Management | ✗ | ✗ | ✓ (node groups) | ✓ (node pools) | +| Multi-AZ | ✗ | ✗ | ✓ | ✓ | +| Ingress Support | Via add-on | Via add-on | ✓ (ALB) | ✓ | +| Storage Classes | Via add-on | Via add-on | ✓ (EBS) | ✓ | + +### CDN/Edge + +| Feature | Hetzner | UpCloud | AWS | DigitalOcean | +| --------- | --------- | --------- | ----- | -------------- | +| CDN Service | ✗ | ✗ | ✓ (CloudFront) | ✓ | +| Edge Locations | — | — | 600+ | 12+ | +| Geographic Routing | — | — | ✓ | ✗ | +| Cache Invalidation | — | — | ✓ | ✓ | +| Origins | — | — | Any | HTTP/S, Object Storage | +| SSL/TLS | — | — | ✓ | ✓ | +| DDoS Protection | — | — | ✓ (Shield) | ✓ | + +### DNS + +| Feature | Hetzner | UpCloud | AWS | DigitalOcean | +| --------- | --------- | --------- | ----- | -------------- | +| DNS Service | ✓ (Basic) | ✗ | ✓ (Route53) | ✓ | +| Zones | ✓ | N/A | ✓ | ✓ | +| Failover | Manual | N/A | ✓ (health checks) | ✓ (health checks) | +| Geolocation | ✗ | N/A | ✓ | ✗ | +| DNSSEC | ✓ | N/A | ✓ | ✗ | +| API Management | Limited | N/A | Full | Full | + +## Pricing Comparison + +### Compute Pricing (Monthly) + +Comparison for 1-year term where applicable: + +| Configuration | Hetzner | UpCloud | AWS* | DigitalOcean | +| --------------- | --------- | --------- | ------ | -------------- | +| 1 vCPU, 1 GB RAM | €3.29 | $5 | $18 (t3.micro) | $6 | +| 2 vCPU, 4 GB RAM | €6.90 | $15 | $36 (t3.small) | $24 | +| 4 vCPU, 8 GB RAM | €13.80 | $30 | $73 (t3.medium) | $48 | +| 8 vCPU, 16 GB RAM | €27.60 | $60 | $146 (t3.large) | $96 | +| 16 vCPU, 32 GB RAM | €55.20 | $120 | $291 (t3.xlarge) | $192 | + +*AWS pricing: on-demand; reserved instances 25-30% discount + +### Storage Pricing (Monthly) + +Per GB for block storage: + +| Provider | Price/GB | Monthly Cost (100 GB) | +| ---------- | ---------- | ---------------------- | +| Hetzner | €0.026 | €2.60 | +| UpCloud | $0.025 | $2.50 | +| AWS EBS | $0.10 | $10.00 | +| DigitalOcean | $0.10 | $10.00 | + +### Data Transfer Pricing + +Outbound data transfer (per GB): + +| Provider | First 1 TB | Beyond 1 TB | +| ---------- | ----------- | ----------- | +| Hetzner | Included | €0.12/GB | +| UpCloud | $0.02/GB | $0.01/GB | +| AWS | $0.09/GB | $0.085/GB | +| DigitalOcean | $0.01/GB | $0.01/GB | + +### Total Cost of Ownership (TCO) Examples + +#### Small Application (2 servers, 100 GB storage) + +| Provider | Compute | Storage | Data Transfer | Monthly | +| ---------- | --------- | --------- | ---------------- | --------- | +| Hetzner | €13.80 | €2.60 | Included | **€16.40** | +| UpCloud | $30 | $2.50 | $20 | **$52.50** | +| AWS | $72 | $10 | $45 | **$127** | +| DigitalOcean | $48 | $10 | Included | **$58** | + +#### Medium Application (5 servers, 500 GB storage, 10 TB data transfer) + +| Provider | Compute | Storage | Data Transfer | Monthly | +| ---------- | --------- | --------- | ---------------- | --------- | +| Hetzner | €69 | €13 | €1,200 | **€1,282** | +| UpCloud | $150 | $12.50 | $200 | **$362.50** | +| AWS | $360 | $50 | $900 | **$1,310** | +| DigitalOcean | $240 | $50 | Included | **$290** | + +## Regional Availability + +### Hetzner Regions + +| Region | Location | Data Center | Highlights | +| -------- | ---------- | ------------- | ------------ | +| nbg1 | Nuremberg, Germany | 3 | EU hub, good performance | +| fsn1 | Falkenstein, Germany | 1 | Lower latency, German regulations | +| hel1 | Helsinki, Finland | 1 | Nordic region option | +| ash | Ashburn, USA | 1 | North American presence | + +### UpCloud Regions + +| Region | Location | Highlights | +| -------- | ---------- | ------------ | +| fi-hel1 | Helsinki, Finland | Primary EU location | +| de-fra1 | Frankfurt, Germany | EU alternative | +| gb-lon1 | London, UK | European coverage | +| us-nyc1 | New York, USA | North America | +| sg-sin1 | Singapore | Asia Pacific | +| jp-tok1 | Tokyo, Japan | APAC alternative | + +### AWS Regions (Selection) + +| Region | Location | Availability Zones | Highlights | +| -------- | ---------- | ------------------- | ------------ | +| us-east-1 | N. Virginia, USA | 6 | Largest, most services | +| eu-west-1 | Ireland | 3 | EU primary, GDPR compliant | +| eu-central-1 | Frankfurt, Germany | 3 | German data residency | +| ap-southeast-1 | Singapore | 3 | APAC primary | +| ap-northeast-1 | Tokyo, Japan | 4 | Asia alternative | + +### DigitalOcean Regions + +| Region | Location | Highlights | +| -------- | ---------- | ------------ | +| nyc3 | New York, USA | Primary US location | +| sfo3 | San Francisco, USA | US West Coast | +| lon1 | London, UK | European hub | +| fra1 | Frankfurt, Germany | German regulations | +| sgp1 | Singapore | APAC coverage | +| blr1 | Bangalore, India | India region | + +### Regional Coverage Summary + +**Best Global Coverage**: AWS (30+ regions, most services) +**Best EU Coverage**: All providers have good EU options +**Best APAC Coverage**: AWS (most regions), DigitalOcean (Singapore) +**Best North America**: All providers have coverage +**Emerging Markets**: DigitalOcean (India via Bangalore) + +## Compliance and Certifications + +### Security Standards + +| Standard | Hetzner | UpCloud | AWS | DigitalOcean | +| ---------- | --------- | --------- | ----- | -------------- | +| GDPR | ✓ | ✓ | ✓ | ✓ | +| CCPA | ✓ | ✓ | ✓ | ✓ | +| SOC 2 Type II | ✓ | ✓ | ✓ | ✓ | +| ISO 27001 | ✓ | ✓ | ✓ | ✓ | +| ISO 9001 | ✗ | ✗ | ✓ | ✓ | +| FedRAMP | ✗ | ✗ | ✓ | ✗ | + +### Industry-Specific Compliance + +| Standard | Hetzner | UpCloud | AWS | DigitalOcean | +| ---------- | --------- | --------- | ----- | -------------- | +| HIPAA | ✗ | ✗ | ✓ | ✓** | +| PCI-DSS | ✓ | ✓ | ✓ | ✓ | +| HITRUST | ✗ | ✗ | ✓ | ✗ | +| FIPS 140-2 | ✗ | ✗ | ✓ | ✗ | +| SOX (Sarbanes-Oxley) | Limited | Limited | ✓ | Limited | + +**DigitalOcean: Requires BAA for HIPAA compliance + +### Data Residency Support + +| Region | Hetzner | UpCloud | AWS | DigitalOcean | +| -------- | --------- | --------- | ----- | -------------- | +| EU (GDPR) | ✓ DE,FI | ✓ FI,DE,GB | ✓ (multiple) | ✓ (multiple) | +| Germany (NIS2) | ✓ | ✓ | ✓ | ✓ | +| UK (Post-Brexit) | ✗ | ✓ GB | ✓ | ✓ | +| USA (CCPA) | ✗ | ✓ | ✓ | ✓ | +| Canada | ✗ | ✗ | ✓ | ✗ | +| Australia | ✗ | ✗ | ✓ | ✗ | +| India | ✗ | ✗ | ✓ | ✓ | + +## Use Case Recommendations + +### 1. Cost-Sensitive Startups + +**Recommended**: Hetzner primary + DigitalOcean backup + +**Rationale**: +- Hetzner has best price/performance ratio +- DigitalOcean for geographic diversification +- Both have simple interfaces and good documentation +- Monthly cost: $30-80 for basic HA setup + +**Example Setup**: +- Primary: Hetzner cx31 (2 vCPU, 4 GB) +- Backup: DigitalOcean $24/month droplet +- Database: Self-managed PostgreSQL or Hetzner volume +- Total: ~$35/month + +### 2. Enterprise Production + +**Recommended**: AWS primary + UpCloud backup + +**Rationale**: +- AWS for managed services and compliance +- UpCloud for cost-effective disaster recovery +- AWS compliance certifications (HIPAA, FIPS, SOC2) +- Multiple regions within AWS +- Mature enterprise support + +**Example Setup**: +- Primary: AWS RDS (managed DB) +- Secondary: UpCloud for compute burst +- Compliance: Full audit trail and encryption + +### 3. High-Performance Computing + +**Recommended**: Hetzner + AWS spot instances + +**Rationale**: +- Hetzner for sustained compute (good price) +- AWS spot for burst workloads (70-90% discount) +- Hetzner bare metal for specialized workloads +- Cost-effective scaling + +### 4. Multi-Region Global Application + +**Recommended**: AWS + DigitalOcean + Hetzner + +**Rationale**: +- AWS for primary regions and managed services +- DigitalOcean for edge locations and simpler regions +- Hetzner for EU cost optimization +- Geographic redundancy across 3 providers + +**Example Setup**: +- US: AWS (primary region) +- EU: Hetzner (cost-optimized) +- APAC: DigitalOcean (Singapore) +- Global: CloudFront CDN + +### 5. Database-Heavy Applications + +**Recommended**: AWS RDS/ElastiCache + DigitalOcean Spaces + +**Rationale**: +- AWS managed databases are feature-rich +- DigitalOcean managed DB for simpler needs +- Both support replicas and backups +- Cost: $60-200/month for medium database + +### 6. Web Applications + +**Recommended**: DigitalOcean + AWS + +**Rationale**: +- DigitalOcean for simplicity and speed +- Droplets easy to manage and scale +- AWS for advanced features and multi-region +- Good community and documentation + +## Provider Strength Matrix + +### Performance ⚡ + +| Category | Winner | Notes | +| ---------- | -------- | ------- | +| CPU Performance | Hetzner | Dedicated cores, good specs per price | +| Network Bandwidth | AWS | 1Gbps+ guaranteed in multiple regions | +| Storage IOPS | AWS | gp3 with 16K IOPS provisioning | +| Latency (Global) | AWS | Most regions, best infrastructure | + +### Cost 💰 + +| Category | Winner | Notes | +| ---------- | -------- | ------- | +| Compute | Hetzner | 50% cheaper than AWS on-demand | +| Managed Services | AWS | Only provider with full managed stack | +| Data Transfer | DigitalOcean | Included with many services | +| Storage | Hetzner Object Storage | €0.025/GB vs AWS S3 $0.023/GB | + +### Ease of Use 🎯 + +| Category | Winner | Notes | +| ---------- | -------- | ------- | +| UI/Dashboard | DigitalOcean | Simple, intuitive, clear pricing | +| CLI Tools | AWS | Comprehensive aws-cli (but steep) | +| API Documentation | DigitalOcean | Clear examples, community-driven | +| Getting Started | DigitalOcean | Fastest path to first deployment | + +### Enterprise Features 🏢 + +| Category | Winner | Notes | +| ---------- | -------- | ------- | +| Managed Services | AWS | RDS, ElastiCache, SQS, SNS, etc | +| Compliance | AWS | Most certifications (HIPAA, FIPS, etc) | +| Support | AWS | 24/7 support with paid plans | +| Scale | AWS | Best for 1000+ servers | + +## Decision Matrix + +Use this matrix to quickly select a provider: + +```text +If you need: Then use: +───────────────────────────────────────────────────────────── +Lowest cost compute Hetzner +Simplest interface DigitalOcean +Managed databases AWS or DigitalOcean +Global multi-region AWS +Compliance (HIPAA/FIPS) AWS +European data residency Hetzner or DigitalOcean +High performance compute Hetzner or AWS (bare metal) +Disaster recovery setup UpCloud or Hetzner +Quick startup DigitalOcean +Enterprise SLA AWS or UpCloud +``` + +## Conclusion + +- **Hetzner**: Best for cost-conscious teams, European focus, good performance +- **UpCloud**: Mid-market option, Nordic/EU focus, reliable alternative +- **AWS**: Enterprise standard, global coverage, most services, highest cost +- **DigitalOcean**: Developer-friendly, simplicity-focused, good value + +For most organizations, a **multi-provider strategy** combining Hetzner (compute), AWS (managed services), and DigitalOcean (edge) provides the best +balance of cost, capability, and resilience. \ No newline at end of file diff --git a/docs/src/development/providers/provider-development-guide.md b/docs/src/development/providers/provider-development-guide.md index 610404a..28febc0 100644 --- a/docs/src/development/providers/provider-development-guide.md +++ b/docs/src/development/providers/provider-development-guide.md @@ -1 +1,717 @@ -# Cloud Provider Development Guide\n\n**Version**: 2.0\n**Status**: Production Ready\n**Based On**: Hetzner, UpCloud, AWS (3 completed providers)\n\n---\n\n## Overview: 4-Task Completion Framework\n\nA cloud provider is **production-ready** when it completes all 4 tasks:\n\n| Task | Requirements | Reference |\n| ------ | --- | --- |\n| **1. Nushell Compliance** | 0 deprecated patterns, full implementations | `provisioning/extensions/providers/hetzner/` |\n| **2. Test Infrastructure** | 51 tests (14 unit + 37 integration, mock-based) | `provisioning/extensions/providers/upcloud/tests/` |\n| **3. Runtime Templates** | 3+ Jinja2/Bash templates for core resources | `provisioning/extensions/providers/aws/templates/` |\n| **4. Nickel Validation** | Schemas pass `nickel typecheck` | `provisioning/extensions/providers/hetzner/nickel/` |\n\n### Execution Sequence\n\n```\nTarea 4 (5 min) ──────┐\nTarea 1 (main) ───┐ ├──> Tarea 2 (tests)\nTarea 3 (parallel)┘ │\n └──> Production Ready ✅\n```\n\n---\n\n## Nushell 0.109.0+ Core Rules\n\nThese rules are **mandatory** for all provider Nushell code:\n\n### Rule 1: Module System & Imports\n```\nuse mod.nu\nuse api.nu\nuse servers.nu\n```\n\n### Rule 2: Function Signatures\n```\ndef function_name [param: type, optional: type = default] { }\n```\n\n### Rule 3: Return Early, Fail Fast\n```\ndef operation [resource: record] {\n if ($resource | get -o id | is-empty) {\n error make {msg: "Resource ID required"}\n }\n}\n```\n\n### Rule 4: Modern Error Handling (CRITICAL)\n\n**❌ FORBIDDEN** - Deprecated try-catch:\n```\ntry {\n ^external_command\n} catch {|err|\n print $"Error: ($err.msg)"\n}\n```\n\n**✅ REQUIRED** - Modern do/complete pattern:\n```\nlet result = (do { ^external_command } | complete)\n\nif $result.exit_code != 0 {\n error make {msg: $"Command failed: ($result.stderr)"}\n}\n\n$result.stdout\n```\n\n### Rule 5: Atomic Operations\nAll operations must fully succeed or fully fail. No partial state changes.\n\n### Rule 12: Structured Error Returns\n```\nerror make {\n msg: "Human-readable message",\n label: {text: "Error context", span: (metadata error).span}\n}\n```\n\n### Critical Violations (INSTANT FAIL)\n\n❌ **FORBIDDEN**:\n- `try { } catch { }` blocks\n- `let mut variable = value` (mutable state)\n- `error make {msg: "Not implemented"}` (stubs)\n- Empty function bodies returning ok\n- Deprecated error patterns\n\n---\n\n## Nickel IaC: Three-File Pattern\n\nAll Nickel schemas follow this pattern:\n\n### contracts.ncl: Type Definitions\n\n```\n{\n Server = {\n id | String,\n name | String,\n instance_type | String,\n zone | String,\n },\n\n Volume = {\n id | String,\n name | String,\n size | Number,\n type | String,\n }\n}\n```\n\n### defaults.ncl: Default Values\n\n```\n{\n Server = {\n instance_type = "t3.micro",\n zone = "us-east-1a",\n },\n\n Volume = {\n size = 20,\n type = "gp3",\n }\n}\n```\n\n### main.ncl: Public API\n\n```\nlet contracts = import "contracts.ncl" in\nlet defaults = import "defaults.ncl" in\n\n{\n make_server = fun config => defaults.Server & config,\n make_volume = fun config => defaults.Volume & config,\n}\n```\n\n### version.ncl: Version Tracking\n\n```\n{\n provider_version = "1.0.0",\n cli_tools = {\n hcloud = "1.47.0+",\n },\n nickel_version = "1.7.0+",\n}\n```\n\n**Validation**:\n```\nnickel typecheck nickel/contracts.ncl\nnickel typecheck nickel/defaults.ncl\nnickel typecheck nickel/main.ncl\nnickel typecheck nickel/version.ncl\nnickel export nickel/main.ncl\n```\n\n---\n\n## Tarea 1: Nushell Compliance\n\n### Identify Violations\n\n```\ncd provisioning/extensions/providers/{PROVIDER}\n\ngrep -r "try {" nulib/ --include="*.nu" | wc -l\ngrep -r "let mut " nulib/ --include="*.nu" | wc -l\ngrep -r "not implemented" nulib/ --include="*.nu" | wc -l\n```\n\nAll three commands should return `0`.\n\n### Fix Mutable Loops: Accumulation Pattern\n\n```\ndef retry_with_backoff [\n closure: closure,\n max_attempts: int\n]: nothing -> any {\n let result = (\n 0..$max_attempts | reduce --fold {\n success: false,\n value: null,\n delay: 100 ms\n } {|attempt, acc|\n if $acc.success {\n $acc\n } else {\n let op_result = (do { $closure | call } | complete)\n\n if $op_result.exit_code == 0 {\n {success: true, value: $op_result.stdout, delay: $acc.delay}\n } else if $attempt >= ($max_attempts - 1) {\n $acc\n } else {\n sleep $acc.delay\n {success: false, value: null, delay: ($acc.delay * 2)}\n }\n }\n }\n )\n\n if $result.success {\n $result.value\n } else {\n error make {msg: $"Failed after ($max_attempts) attempts"}\n }\n}\n```\n\n### Fix Mutable Loops: Recursive Pattern\n\n```\ndef _wait_for_state [\n resource_id: string,\n target_state: string,\n timeout_sec: int,\n elapsed: int = 0,\n interval: int = 2\n]: nothing -> bool {\n let current = (^aws ec2 describe-volumes \\n --volume-ids $resource_id \\n --query "Volumes[0].State" \\n --output text)\n\n if ($current | str contains $target_state) {\n true\n } else if $elapsed > $timeout_sec {\n false\n } else {\n sleep ($"($interval)sec" | into duration)\n _wait_for_state $resource_id $target_state $timeout_sec ($elapsed + $interval) $interval\n }\n}\n```\n\n### Fix Error Handling\n\n```\ndef create_server [config: record] {\n if ($config | get -o name | is-empty) {\n error make {msg: "Server name required"}\n }\n\n let api_result = (do {\n ^hcloud server create \\n --name $config.name \\n --type $config.instance_type \\n --format json\n } | complete)\n\n if $api_result.exit_code != 0 {\n error make {msg: $"Server creation failed: ($api_result.stderr)"}\n }\n\n let response = ($api_result.stdout | from json)\n {\n id: $response.server.id,\n name: $response.server.name,\n status: "created"\n }\n}\n```\n\n### Validation\n\n```\ncd provisioning/extensions/providers/{PROVIDER}\n\nfor file in nulib/*/\*.nu; do\n nu --ide-check 100 "$file" 2>&1 | grep -i error && exit 1\ndone\n\nnu -c "use nulib/{provider}/mod.nu; print 'OK'"\n\necho "✅ Nushell compliance complete"\n```\n\n---\n\n## Tarea 2: Test Infrastructure\n\n### Directory Structure\n\n```\ntests/\n├── mocks/\n│ └── mock_api_responses.json\n├── unit/\n│ └── test_utils.nu\n├── integration/\n│ ├── test_api_client.nu\n│ ├── test_server_lifecycle.nu\n│ └── test_pricing_cache.nu\n└── run_{provider}_tests.nu\n```\n\n### Mock API Responses\n\n```\n{\n "list_servers": {\n "servers": [\n {\n "id": "srv-123",\n "name": "test-server",\n "status": "running"\n }\n ]\n },\n "error_401": {\n "error": {"message": "Unauthorized", "code": 401}\n },\n "error_429": {\n "error": {"message": "Rate limited", "code": 429}\n }\n}\n```\n\n### Unit Tests: 14 Tests\n\n```\ndef test-result [name: string, result: bool] {\n if $result {\n print $"✓ ($name)"\n } else {\n print $"✗ ($name)"\n }\n $result\n}\n\ndef test-validate-instance-id [] {\n let valid = "i-1234567890abcdef0"\n let invalid = "invalid-id"\n\n let test1 = (test-result "Instance ID valid" ($valid | str contains "i-"))\n let test2 = (test-result "Instance ID invalid" (($invalid | str contains "i-") == false))\n\n $test1 and $test2\n}\n\ndef test-validate-ipv4 [] {\n let valid = "10.0.1.100"\n let parts = ($valid | split row ".")\n test-result "IPv4 four octets" (($parts | length) == 4)\n}\n\ndef test-validate-instance-type [] {\n let valid_types = ["t3.micro" "t3.small" "m5.large"]\n let invalid = "invalid_type"\n\n let test1 = (test-result "Instance type valid" (($valid_types | contains ["t3.micro"])))\n let test2 = (test-result "Instance type invalid" (($valid_types | contains [$invalid]) == false))\n\n $test1 and $test2\n}\n\ndef test-validate-zone [] {\n let valid_zones = ["us-east-1a" "us-east-1b" "eu-west-1a"]\n let invalid = "invalid-zone"\n\n let test1 = (test-result "Zone valid" (($valid_zones | contains ["us-east-1a"])))\n let test2 = (test-result "Zone invalid" (($valid_zones | contains [$invalid]) == false))\n\n $test1 and $test2\n}\n\ndef test-validate-volume-id [] {\n let valid = "vol-12345678"\n let invalid = "invalid-vol"\n\n let test1 = (test-result "Volume ID valid" ($valid | str contains "vol-"))\n let test2 = (test-result "Volume ID invalid" (($invalid | str contains "vol-") == false))\n\n $test1 and $test2\n}\n\ndef test-validate-volume-state [] {\n let valid_states = ["available" "in-use" "creating"]\n let invalid = "pending"\n\n let test1 = (test-result "Volume state valid" (($valid_states | contains ["available"])))\n let test2 = (test-result "Volume state invalid" (($valid_states | contains [$invalid]) == false))\n\n $test1 and $test2\n}\n\ndef test-validate-cidr [] {\n let valid = "10.0.0.0/16"\n let invalid = "10.0.0.1"\n\n let test1 = (test-result "CIDR valid" ($valid | str contains "/"))\n let test2 = (test-result "CIDR invalid" (($invalid | str contains "/") == false))\n\n $test1 and $test2\n}\n\ndef test-validate-volume-type [] {\n let valid_types = ["gp2" "gp3" "io1" "io2"]\n let invalid = "invalid-type"\n\n let test1 = (test-result "Volume type valid" (($valid_types | contains ["gp3"])))\n let test2 = (test-result "Volume type invalid" (($valid_types | contains [$invalid]) == false))\n\n $test1 and $test2\n}\n\ndef test-validate-timestamp [] {\n let valid = "2025-01-07T10:00:00.000Z"\n let invalid = "not-a-timestamp"\n\n let test1 = (test-result "Timestamp valid" ($valid | str contains "T" and $valid | str contains "Z"))\n let test2 = (test-result "Timestamp invalid" (($invalid | str contains "T") == false))\n\n $test1 and $test2\n}\n\ndef test-validate-server-state [] {\n let valid_states = ["running" "stopped" "pending"]\n let invalid = "hibernating"\n\n let test1 = (test-result "Server state valid" (($valid_states | contains ["running"])))\n let test2 = (test-result "Server state invalid" (($valid_states | contains [$invalid]) == false))\n\n $test1 and $test2\n}\n\ndef test-validate-security-group [] {\n let valid = "sg-12345678"\n let invalid = "invalid-sg"\n\n let test1 = (test-result "Security group valid" ($valid | str contains "sg-"))\n let test2 = (test-result "Security group invalid" (($invalid | str contains "sg-") == false))\n\n $test1 and $test2\n}\n\ndef test-validate-memory [] {\n let valid_mems = ["512 MB" "1 GB" "2 GB" "4 GB"]\n let invalid = "0 GB"\n\n let test1 = (test-result "Memory valid" (($valid_mems | contains ["1 GB"])))\n let test2 = (test-result "Memory invalid" (($valid_mems | contains [$invalid]) == false))\n\n $test1 and $test2\n}\n\ndef test-validate-vcpu [] {\n let valid_cpus = [1, 2, 4, 8, 16]\n let invalid = 0\n\n let test1 = (test-result "vCPU valid" (($valid_cpus | contains [1])))\n let test2 = (test-result "vCPU invalid" (($valid_cpus | contains [$invalid]) == false))\n\n $test1 and $test2\n}\n\ndef main [] {\n print "=== Unit Tests ==="\n print ""\n\n let results = [\n (test-validate-instance-id),\n (test-validate-ipv4),\n (test-validate-instance-type),\n (test-validate-zone),\n (test-validate-volume-id),\n (test-validate-volume-state),\n (test-validate-cidr),\n (test-validate-volume-type),\n (test-validate-timestamp),\n (test-validate-server-state),\n (test-validate-security-group),\n (test-validate-memory),\n (test-validate-vcpu)\n ]\n\n let passed = ($results | where {|it| $it == true} | length)\n let failed = ($results | where {|it| $it == false} | length)\n\n print ""\n print $"Results: ($passed) passed, ($failed) failed"\n\n {\n passed: $passed,\n failed: $failed,\n total: ($passed + $failed)\n }\n}\n\nmain\n```\n\n### Integration Tests: 37 Tests across 3 Modules\n\n**Module 1: test_api_client.nu** (13 tests)\n- Response structure validation\n- Error handling for 401, 404, 429\n- Resource listing operations\n- Pricing data validation\n\n**Module 2: test_server_lifecycle.nu** (12 tests)\n- Server creation, listing, state\n- Instance type and zone info\n- Storage and security attachment\n- Server state transitions\n\n**Module 3: test_pricing_cache.nu** (12 tests)\n- Pricing data structure validation\n- On-demand vs reserved pricing\n- Cost calculations\n- Volume pricing operations\n\n### Test Orchestrator\n\n```\ndef main [] {\n print "=== Provider Test Suite ==="\n\n let unit_result = (nu tests/unit/test_utils.nu)\n let api_result = (nu tests/integration/test_api_client.nu)\n let lifecycle_result = (nu tests/integration/test_server_lifecycle.nu)\n let pricing_result = (nu tests/integration/test_pricing_cache.nu)\n\n let total_passed = (\n $unit_result.passed +\n $api_result.passed +\n $lifecycle_result.passed +\n $pricing_result.passed\n )\n\n let total_failed = (\n $unit_result.failed +\n $api_result.failed +\n $lifecycle_result.failed +\n $pricing_result.failed\n )\n\n print $"Results: ($total_passed) passed, ($total_failed) failed"\n\n {\n passed: $total_passed,\n failed: $total_failed,\n success: ($total_failed == 0)\n }\n}\n\nlet result = (main)\nexit (if $result.success {0} else {1})\n```\n\n### Validation\n\n```\ncd provisioning/extensions/providers/{PROVIDER}\nnu tests/run_{provider}_tests.nu\n```\n\nExpected: 51 tests passing, exit code 0\n\n---\n\n## Tarea 3: Runtime Templates\n\n### Directory Structure\n\n```\ntemplates/\n├── {provider}_servers.j2\n├── {provider}_networks.j2\n└── {provider}_volumes.j2\n```\n\n### Template Example\n\n```jinja2\n#!/bin/bash\n# {{ provider_name }} Server Provisioning\nset -e\n{% if debug %}set -x{% endif %}\n\n{%- for server in servers %}\n {%- if server.name %}\n\necho "Creating server: {{ server.name }}"\n\n{%- if server.instance_type %}\nINSTANCE_TYPE="{{ server.instance_type }}"\n{%- else %}\nINSTANCE_TYPE="t3.micro"\n{%- endif %}\n\nSERVER_ID=$(^hcloud server create \\n --name "{{ server.name }}" \\n --type $INSTANCE_TYPE \\n --query 'id' \\n --output text 2>/dev/null)\n\nif [ -z "$SERVER_ID" ]; then\n echo "Failed to create server {{ server.name }}"\n exit 1\nfi\n\necho "✓ Server {{ server.name }} created: $SERVER_ID"\n\n {%- endif %}\n{%- endfor %}\n\necho "Server provisioning complete"\n```\n\n### Validation\n\n```\ncd provisioning/extensions/providers/{PROVIDER}\n\nfor template in templates/*.j2; do\n bash -n <(sed 's/{%.*%}//' "$template" | sed 's/{{.*}}/x/g')\ndone\n\necho "✅ Templates valid"\n```\n\n---\n\n## Tarea 4: Nickel Schema Validation\n\n```\ncd provisioning/extensions/providers/{PROVIDER}\n\nnickel typecheck nickel/contracts.ncl || exit 1\nnickel typecheck nickel/defaults.ncl || exit 1\nnickel typecheck nickel/main.ncl || exit 1\nnickel typecheck nickel/version.ncl || exit 1\n\nnickel export nickel/main.ncl || exit 1\n\necho "✅ Nickel schemas validated"\n```\n\n---\n\n## Complete Validation Script\n\n```\n#!/bin/bash\nset -e\n\nPROVIDER="hetzner"\nPROV="provisioning/extensions/providers/$PROVIDER"\n\necho "=== Provider Completeness Check: $PROVIDER ==="\n\necho ""\necho "✓ Tarea 4: Validating Nickel..."\nnickel typecheck "$PROV/nickel/main.ncl"\n\necho "✓ Tarea 1: Checking Nushell..."\n[ $(grep -r "try {" "$PROV/nulib" 2>/dev/null | wc -l) -eq 0 ]\n[ $(grep -r "let mut " "$PROV/nulib" 2>/dev/null | wc -l) -eq 0 ]\necho " - No deprecated patterns ✓"\n\necho "✓ Tarea 3: Validating templates..."\nfor f in "$PROV"/templates/*.j2; do\n bash -n <(sed 's/{%.*%}//' "$f" | sed 's/{{.*}}/x/g')\ndone\n\necho "✓ Tarea 2: Running tests..."\nnu "$PROV/tests/run_${PROVIDER}_tests.nu"\n\necho ""\necho "╔════════════════════════════════════════╗"\necho "║ ✅ ALL TASKS COMPLETE ║"\necho "║ PRODUCTION READY ║"\necho "╚════════════════════════════════════════╝"\n```\n\n---\n\n## Reference Implementations\n\n- **Hetzner**: `provisioning/extensions/providers/hetzner/`\n- **UpCloud**: `provisioning/extensions/providers/upcloud/`\n- **AWS**: `provisioning/extensions/providers/aws/`\n\nUse these as templates for new providers.\n\n---\n\n## Quick Start\n\n```\ncd provisioning/extensions/providers/{PROVIDER}\n\n# Validate completeness\nnickel typecheck nickel/main.ncl && \\n[ $(grep -r "try {" nulib/ 2>/dev/null | wc -l) -eq 0 ] && \\nnu tests/run_{provider}_tests.nu && \\nfor f in templates/*.j2; do bash -n <(sed 's/{%.*%}//' "$f"); done && \\necho "✅ PRODUCTION READY"\n``` +# Cloud Provider Development Guide + +**Version**: 2.0 +**Status**: Production Ready +**Based On**: Hetzner, UpCloud, AWS (3 completed providers) + +--- + +## Overview: 4-Task Completion Framework + +A cloud provider is **production-ready** when it completes all 4 tasks: + +| Task | Requirements | Reference | +| ------ | --- | --- | +| **1. Nushell Compliance** | 0 deprecated patterns, full implementations | `provisioning/extensions/providers/hetzner/` | +| **2. Test Infrastructure** | 51 tests (14 unit + 37 integration, mock-based) | `provisioning/extensions/providers/upcloud/tests/` | +| **3. Runtime Templates** | 3+ Jinja2/Bash templates for core resources | `provisioning/extensions/providers/aws/templates/` | +| **4. Nickel Validation** | Schemas pass `nickel typecheck` | `provisioning/extensions/providers/hetzner/nickel/` | + +### Execution Sequence + +```text +Tarea 4 (5 min) ──────┐ +Tarea 1 (main) ───┐ ├──> Tarea 2 (tests) +Tarea 3 (parallel)┘ │ + └──> Production Ready ✅ +``` + +--- + +## Nushell 0.109.0+ Core Rules + +These rules are **mandatory** for all provider Nushell code: + +### Rule 1: Module System & Imports +```text +use mod.nu +use api.nu +use servers.nu +``` + +### Rule 2: Function Signatures +```text +def function_name [param: type, optional: type = default] { } +``` + +### Rule 3: Return Early, Fail Fast +```text +def operation [resource: record] { + if ($resource | get -o id | is-empty) { + error make {msg: "Resource ID required"} + } +} +``` + +### Rule 4: Modern Error Handling (CRITICAL) + +**❌ FORBIDDEN** - Deprecated try-catch: +```text +try { + ^external_command +} catch {|err| + print $"Error: ($err.msg)" +} +``` + +**✅ REQUIRED** - Modern do/complete pattern: +```text +let result = (do { ^external_command } | complete) + +if $result.exit_code != 0 { + error make {msg: $"Command failed: ($result.stderr)"} +} + +$result.stdout +``` + +### Rule 5: Atomic Operations +All operations must fully succeed or fully fail. No partial state changes. + +### Rule 12: Structured Error Returns +```text +error make { + msg: "Human-readable message", + label: {text: "Error context", span: (metadata error).span} +} +``` + +### Critical Violations (INSTANT FAIL) + +❌ **FORBIDDEN**: +- `try { } catch { }` blocks +- `let mut variable = value` (mutable state) +- `error make {msg: "Not implemented"}` (stubs) +- Empty function bodies returning ok +- Deprecated error patterns + +--- + +## Nickel IaC: Three-File Pattern + +All Nickel schemas follow this pattern: + +### contracts.ncl: Type Definitions + +```text +{ + Server = { + id | String, + name | String, + instance_type | String, + zone | String, + }, + + Volume = { + id | String, + name | String, + size | Number, + type | String, + } +} +``` + +### defaults.ncl: Default Values + +```text +{ + Server = { + instance_type = "t3.micro", + zone = "us-east-1a", + }, + + Volume = { + size = 20, + type = "gp3", + } +} +``` + +### main.ncl: Public API + +```text +let contracts = import "contracts.ncl" in +let defaults = import "defaults.ncl" in + +{ + make_server = fun config => defaults.Server & config, + make_volume = fun config => defaults.Volume & config, +} +``` + +### version.ncl: Version Tracking + +```text +{ + provider_version = "1.0.0", + cli_tools = { + hcloud = "1.47.0+", + }, + nickel_version = "1.7.0+", +} +``` + +**Validation**: +```text +nickel typecheck nickel/contracts.ncl +nickel typecheck nickel/defaults.ncl +nickel typecheck nickel/main.ncl +nickel typecheck nickel/version.ncl +nickel export nickel/main.ncl +``` + +--- + +## Tarea 1: Nushell Compliance + +### Identify Violations + +```text +cd provisioning/extensions/providers/{PROVIDER} + +grep -r "try {" nulib/ --include="*.nu" | wc -l +grep -r "let mut " nulib/ --include="*.nu" | wc -l +grep -r "not implemented" nulib/ --include="*.nu" | wc -l +``` + +All three commands should return `0`. + +### Fix Mutable Loops: Accumulation Pattern + +```text +def retry_with_backoff [ + closure: closure, + max_attempts: int +]: nothing -> any { + let result = ( + 0..$max_attempts | reduce --fold { + success: false, + value: null, + delay: 100 ms + } {|attempt, acc| + if $acc.success { + $acc + } else { + let op_result = (do { $closure | call } | complete) + + if $op_result.exit_code == 0 { + {success: true, value: $op_result.stdout, delay: $acc.delay} + } else if $attempt >= ($max_attempts - 1) { + $acc + } else { + sleep $acc.delay + {success: false, value: null, delay: ($acc.delay * 2)} + } + } + } + ) + + if $result.success { + $result.value + } else { + error make {msg: $"Failed after ($max_attempts) attempts"} + } +} +``` + +### Fix Mutable Loops: Recursive Pattern + +```text +def _wait_for_state [ + resource_id: string, + target_state: string, + timeout_sec: int, + elapsed: int = 0, + interval: int = 2 +]: nothing -> bool { + let current = (^aws ec2 describe-volumes + --volume-ids $resource_id + --query "Volumes[0].State" + --output text) + + if ($current | str contains $target_state) { + true + } else if $elapsed > $timeout_sec { + false + } else { + sleep ($"($interval)sec" | into duration) + _wait_for_state $resource_id $target_state $timeout_sec ($elapsed + $interval) $interval + } +} +``` + +### Fix Error Handling + +```text +def create_server [config: record] { + if ($config | get -o name | is-empty) { + error make {msg: "Server name required"} + } + + let api_result = (do { + ^hcloud server create + --name $config.name + --type $config.instance_type + --format json + } | complete) + + if $api_result.exit_code != 0 { + error make {msg: $"Server creation failed: ($api_result.stderr)"} + } + + let response = ($api_result.stdout | from json) + { + id: $response.server.id, + name: $response.server.name, + status: "created" + } +} +``` + +### Validation + +```text +cd provisioning/extensions/providers/{PROVIDER} + +for file in nulib/*/\*.nu; do + nu --ide-check 100 "$file" 2>&1 | grep -i error && exit 1 +done + +nu -c "use nulib/{provider}/mod.nu; print 'OK'" + +echo "✅ Nushell compliance complete" +``` + +--- + +## Tarea 2: Test Infrastructure + +### Directory Structure + +```text +tests/ +├── mocks/ +│ └── mock_api_responses.json +├── unit/ +│ └── test_utils.nu +├── integration/ +│ ├── test_api_client.nu +│ ├── test_server_lifecycle.nu +│ └── test_pricing_cache.nu +└── run_{provider}_tests.nu +``` + +### Mock API Responses + +```text +{ + "list_servers": { + "servers": [ + { + "id": "srv-123", + "name": "test-server", + "status": "running" + } + ] + }, + "error_401": { + "error": {"message": "Unauthorized", "code": 401} + }, + "error_429": { + "error": {"message": "Rate limited", "code": 429} + } +} +``` + +### Unit Tests: 14 Tests + +```text +def test-result [name: string, result: bool] { + if $result { + print $"✓ ($name)" + } else { + print $"✗ ($name)" + } + $result +} + +def test-validate-instance-id [] { + let valid = "i-1234567890abcdef0" + let invalid = "invalid-id" + + let test1 = (test-result "Instance ID valid" ($valid | str contains "i-")) + let test2 = (test-result "Instance ID invalid" (($invalid | str contains "i-") == false)) + + $test1 and $test2 +} + +def test-validate-ipv4 [] { + let valid = "10.0.1.100" + let parts = ($valid | split row ".") + test-result "IPv4 four octets" (($parts | length) == 4) +} + +def test-validate-instance-type [] { + let valid_types = ["t3.micro" "t3.small" "m5.large"] + let invalid = "invalid_type" + + let test1 = (test-result "Instance type valid" (($valid_types | contains ["t3.micro"]))) + let test2 = (test-result "Instance type invalid" (($valid_types | contains [$invalid]) == false)) + + $test1 and $test2 +} + +def test-validate-zone [] { + let valid_zones = ["us-east-1a" "us-east-1b" "eu-west-1a"] + let invalid = "invalid-zone" + + let test1 = (test-result "Zone valid" (($valid_zones | contains ["us-east-1a"]))) + let test2 = (test-result "Zone invalid" (($valid_zones | contains [$invalid]) == false)) + + $test1 and $test2 +} + +def test-validate-volume-id [] { + let valid = "vol-12345678" + let invalid = "invalid-vol" + + let test1 = (test-result "Volume ID valid" ($valid | str contains "vol-")) + let test2 = (test-result "Volume ID invalid" (($invalid | str contains "vol-") == false)) + + $test1 and $test2 +} + +def test-validate-volume-state [] { + let valid_states = ["available" "in-use" "creating"] + let invalid = "pending" + + let test1 = (test-result "Volume state valid" (($valid_states | contains ["available"]))) + let test2 = (test-result "Volume state invalid" (($valid_states | contains [$invalid]) == false)) + + $test1 and $test2 +} + +def test-validate-cidr [] { + let valid = "10.0.0.0/16" + let invalid = "10.0.0.1" + + let test1 = (test-result "CIDR valid" ($valid | str contains "/")) + let test2 = (test-result "CIDR invalid" (($invalid | str contains "/") == false)) + + $test1 and $test2 +} + +def test-validate-volume-type [] { + let valid_types = ["gp2" "gp3" "io1" "io2"] + let invalid = "invalid-type" + + let test1 = (test-result "Volume type valid" (($valid_types | contains ["gp3"]))) + let test2 = (test-result "Volume type invalid" (($valid_types | contains [$invalid]) == false)) + + $test1 and $test2 +} + +def test-validate-timestamp [] { + let valid = "2025-01-07T10:00:00.000Z" + let invalid = "not-a-timestamp" + + let test1 = (test-result "Timestamp valid" ($valid | str contains "T" and $valid | str contains "Z")) + let test2 = (test-result "Timestamp invalid" (($invalid | str contains "T") == false)) + + $test1 and $test2 +} + +def test-validate-server-state [] { + let valid_states = ["running" "stopped" "pending"] + let invalid = "hibernating" + + let test1 = (test-result "Server state valid" (($valid_states | contains ["running"]))) + let test2 = (test-result "Server state invalid" (($valid_states | contains [$invalid]) == false)) + + $test1 and $test2 +} + +def test-validate-security-group [] { + let valid = "sg-12345678" + let invalid = "invalid-sg" + + let test1 = (test-result "Security group valid" ($valid | str contains "sg-")) + let test2 = (test-result "Security group invalid" (($invalid | str contains "sg-") == false)) + + $test1 and $test2 +} + +def test-validate-memory [] { + let valid_mems = ["512 MB" "1 GB" "2 GB" "4 GB"] + let invalid = "0 GB" + + let test1 = (test-result "Memory valid" (($valid_mems | contains ["1 GB"]))) + let test2 = (test-result "Memory invalid" (($valid_mems | contains [$invalid]) == false)) + + $test1 and $test2 +} + +def test-validate-vcpu [] { + let valid_cpus = [1, 2, 4, 8, 16] + let invalid = 0 + + let test1 = (test-result "vCPU valid" (($valid_cpus | contains [1]))) + let test2 = (test-result "vCPU invalid" (($valid_cpus | contains [$invalid]) == false)) + + $test1 and $test2 +} + +def main [] { + print "=== Unit Tests ===" + print "" + + let results = [ + (test-validate-instance-id), + (test-validate-ipv4), + (test-validate-instance-type), + (test-validate-zone), + (test-validate-volume-id), + (test-validate-volume-state), + (test-validate-cidr), + (test-validate-volume-type), + (test-validate-timestamp), + (test-validate-server-state), + (test-validate-security-group), + (test-validate-memory), + (test-validate-vcpu) + ] + + let passed = ($results | where {|it| $it == true} | length) + let failed = ($results | where {|it| $it == false} | length) + + print "" + print $"Results: ($passed) passed, ($failed) failed" + + { + passed: $passed, + failed: $failed, + total: ($passed + $failed) + } +} + +main +``` + +### Integration Tests: 37 Tests across 3 Modules + +**Module 1: test_api_client.nu** (13 tests) +- Response structure validation +- Error handling for 401, 404, 429 +- Resource listing operations +- Pricing data validation + +**Module 2: test_server_lifecycle.nu** (12 tests) +- Server creation, listing, state +- Instance type and zone info +- Storage and security attachment +- Server state transitions + +**Module 3: test_pricing_cache.nu** (12 tests) +- Pricing data structure validation +- On-demand vs reserved pricing +- Cost calculations +- Volume pricing operations + +### Test Orchestrator + +```text +def main [] { + print "=== Provider Test Suite ===" + + let unit_result = (nu tests/unit/test_utils.nu) + let api_result = (nu tests/integration/test_api_client.nu) + let lifecycle_result = (nu tests/integration/test_server_lifecycle.nu) + let pricing_result = (nu tests/integration/test_pricing_cache.nu) + + let total_passed = ( + $unit_result.passed + + $api_result.passed + + $lifecycle_result.passed + + $pricing_result.passed + ) + + let total_failed = ( + $unit_result.failed + + $api_result.failed + + $lifecycle_result.failed + + $pricing_result.failed + ) + + print $"Results: ($total_passed) passed, ($total_failed) failed" + + { + passed: $total_passed, + failed: $total_failed, + success: ($total_failed == 0) + } +} + +let result = (main) +exit (if $result.success {0} else {1}) +``` + +### Validation + +```text +cd provisioning/extensions/providers/{PROVIDER} +nu tests/run_{provider}_tests.nu +``` + +Expected: 51 tests passing, exit code 0 + +--- + +## Tarea 3: Runtime Templates + +### Directory Structure + +```text +templates/ +├── {provider}_servers.j2 +├── {provider}_networks.j2 +└── {provider}_volumes.j2 +``` + +### Template Example + +```jinja2 +#!/bin/bash +# {{ provider_name }} Server Provisioning +set -e +{% if debug %}set -x{% endif %} + +{%- for server in servers %} + {%- if server.name %} + +echo "Creating server: {{ server.name }}" + +{%- if server.instance_type %} +INSTANCE_TYPE="{{ server.instance_type }}" +{%- else %} +INSTANCE_TYPE="t3.micro" +{%- endif %} + +SERVER_ID=$(^hcloud server create + --name "{{ server.name }}" + --type $INSTANCE_TYPE + --query 'id' + --output text 2>/dev/null) + +if [ -z "$SERVER_ID" ]; then + echo "Failed to create server {{ server.name }}" + exit 1 +fi + +echo "✓ Server {{ server.name }} created: $SERVER_ID" + + {%- endif %} +{%- endfor %} + +echo "Server provisioning complete" +``` + +### Validation + +```text +cd provisioning/extensions/providers/{PROVIDER} + +for template in templates/*.j2; do + bash -n <(sed 's/{%.*%}//' "$template" | sed 's/{{.*}}/x/g') +done + +echo "✅ Templates valid" +``` + +--- + +## Tarea 4: Nickel Schema Validation + +```text +cd provisioning/extensions/providers/{PROVIDER} + +nickel typecheck nickel/contracts.ncl || exit 1 +nickel typecheck nickel/defaults.ncl || exit 1 +nickel typecheck nickel/main.ncl || exit 1 +nickel typecheck nickel/version.ncl || exit 1 + +nickel export nickel/main.ncl || exit 1 + +echo "✅ Nickel schemas validated" +``` + +--- + +## Complete Validation Script + +```text +#!/bin/bash +set -e + +PROVIDER="hetzner" +PROV="provisioning/extensions/providers/$PROVIDER" + +echo "=== Provider Completeness Check: $PROVIDER ===" + +echo "" +echo "✓ Tarea 4: Validating Nickel..." +nickel typecheck "$PROV/nickel/main.ncl" + +echo "✓ Tarea 1: Checking Nushell..." +[ $(grep -r "try {" "$PROV/nulib" 2>/dev/null | wc -l) -eq 0 ] +[ $(grep -r "let mut " "$PROV/nulib" 2>/dev/null | wc -l) -eq 0 ] +echo " - No deprecated patterns ✓" + +echo "✓ Tarea 3: Validating templates..." +for f in "$PROV"/templates/*.j2; do + bash -n <(sed 's/{%.*%}//' "$f" | sed 's/{{.*}}/x/g') +done + +echo "✓ Tarea 2: Running tests..." +nu "$PROV/tests/run_${PROVIDER}_tests.nu" + +echo "" +echo "╔════════════════════════════════════════╗" +echo "║ ✅ ALL TASKS COMPLETE ║" +echo "║ PRODUCTION READY ║" +echo "╚════════════════════════════════════════╝" +``` + +--- + +## Reference Implementations + +- **Hetzner**: `provisioning/extensions/providers/hetzner/` +- **UpCloud**: `provisioning/extensions/providers/upcloud/` +- **AWS**: `provisioning/extensions/providers/aws/` + +Use these as templates for new providers. + +--- + +## Quick Start + +```text +cd provisioning/extensions/providers/{PROVIDER} + +# Validate completeness +nickel typecheck nickel/main.ncl && +[ $(grep -r "try {" nulib/ 2>/dev/null | wc -l) -eq 0 ] && +nu tests/run_{provider}_tests.nu && +for f in templates/*.j2; do bash -n <(sed 's/{%.*%}//' "$f"); done && +echo "✅ PRODUCTION READY" +``` \ No newline at end of file diff --git a/docs/src/development/providers/provider-distribution-guide.md b/docs/src/development/providers/provider-distribution-guide.md index 507f7ae..452e416 100644 --- a/docs/src/development/providers/provider-distribution-guide.md +++ b/docs/src/development/providers/provider-distribution-guide.md @@ -1 +1,681 @@ -# Provider Distribution Guide\n\n**Strategic Guide for Provider Management and Distribution**\n\nThis guide explains the two complementary approaches for managing providers in the provisioning system and when to use each.\n\n---\n\n## Table of Contents\n\n- [Overview](#overview)\n- [Module-Loader Approach](#module-loader-approach)\n- [Provider Packs Approach](#provider-packs-approach)\n- [Comparison Matrix](#comparison-matrix)\n- [Recommended Hybrid Workflow](#recommended-hybrid-workflow)\n- [Command Reference](#command-reference)\n- [Real-World Scenarios](#real-world-scenarios)\n- [Best Practices](#best-practices)\n\n---\n\n## Overview\n\nThe provisioning system supports **two complementary approaches** for provider management:\n\n1. **Module-Loader**: Symlink-based local development with dynamic discovery\n2. **Provider Packs**: Versioned, distributable artifacts for production\n\nBoth approaches work seamlessly together and serve different phases of the development lifecycle.\n\n---\n\n## Module-Loader Approach\n\n### Purpose\n\nFast, local development with direct access to provider source code.\n\n### How It Works\n\n```{$detected_lang}\n# Install provider for infrastructure (creates symlinks)\nprovisioning providers install upcloud wuji\n\n# Internal Process:\n# 1. Discovers provider in extensions/providers/upcloud/\n# 2. Creates symlink: workspace/infra/wuji/.nickel-modules/upcloud_prov -> extensions/providers/upcloud/nickel/\n# 3. Updates workspace/infra/wuji/manifest.toml with local path dependency\n# 4. Updates workspace/infra/wuji/providers.manifest.yaml\n```\n\n### Key Features\n\n✅ **Instant Changes**: Edit code in `extensions/providers/`, immediately available in infrastructure\n✅ **Auto-Discovery**: Automatically finds all providers in extensions/\n✅ **Simple Commands**: `providers install/remove/list/validate`\n✅ **Easy Debugging**: Direct access to source code\n✅ **No Packaging**: Skip build/package step during development\n\n### Best Use Cases\n\n- 🔧 **Active Development**: Writing new provider features\n- 🧪 **Testing**: Rapid iteration and testing cycles\n- 🏠 **Local Infrastructure**: Single machine or small team\n- 📝 **Debugging**: Need to modify and test provider code\n- 🎓 **Learning**: Understanding how providers work\n\n### Example Workflow\n\n```{$detected_lang}\n# 1. List available providers\nprovisioning providers list\n\n# 2. Install provider for infrastructure\nprovisioning providers install upcloud wuji\n\n# 3. Verify installation\nprovisioning providers validate wuji\n\n# 4. Edit provider code\nvim extensions/providers/upcloud/nickel/server_upcloud.ncl\n\n# 5. Test changes immediately (no repackaging!)\ncd workspace/infra/wuji\nnickel export main.ncl\n\n# 6. Remove when done\nprovisioning providers remove upcloud wuji\n```\n\n### File Structure\n\n```{$detected_lang}\nextensions/providers/upcloud/\n├── nickel/\n│ ├── manifest.toml\n│ ├── server_upcloud.ncl\n│ └── network_upcloud.ncl\n└── README.md\n\nworkspace/infra/wuji/\n├── .nickel-modules/\n│ └── upcloud_prov -> ../../../../extensions/providers/upcloud/nickel/ # Symlink\n├── manifest.toml # Updated with local path dependency\n├── providers.manifest.yaml # Tracks installed providers\n└── schemas/\n └── servers.ncl\n```\n\n---\n\n## Provider Packs Approach\n\n### Purpose\n\nCreate versioned, distributable artifacts for production deployments and team collaboration.\n\n### How It Works\n\n```{$detected_lang}\n# Package providers into distributable artifacts\nexport PROVISIONING=/Users/Akasha/project-provisioning/provisioning\n./provisioning/core/cli/pack providers\n\n# Internal Process:\n# 1. Enters each provider's nickel/ directory\n# 2. Runs: nickel export . --format json (generates JSON for distribution)\n# 3. Creates: upcloud_prov_0.0.1.tar\n# 4. Generates metadata: distribution/registry/upcloud_prov.json\n```\n\n### Key Features\n\n✅ **Versioned Artifacts**: Immutable, reproducible packages\n✅ **Portable**: Share across teams and environments\n✅ **Registry Publishing**: Push to artifact registries\n✅ **Metadata**: Version, maintainer, license information\n✅ **Production-Ready**: What you package is what you deploy\n\n### Best Use Cases\n\n- 🚀 **Production Deployments**: Stable, tested provider versions\n- 📦 **Distribution**: Share across teams or organizations\n- 🔄 **CI/CD Pipelines**: Automated build and deploy\n- 📊 **Version Control**: Track provider versions explicitly\n- 🌐 **Registry Publishing**: Publish to artifact registries\n- 🔒 **Compliance**: Immutable artifacts for auditing\n\n### Example Workflow\n\n```{$detected_lang}\n# Set environment variable\nexport PROVISIONING=/Users/Akasha/project-provisioning/provisioning\n\n# 1. Package all providers\n./provisioning/core/cli/pack providers\n\n# Output:\n# ✅ Creates: distribution/packages/upcloud_prov_0.0.1.tar\n# ✅ Creates: distribution/packages/aws_prov_0.0.1.tar\n# ✅ Creates: distribution/packages/local_prov_0.0.1.tar\n# ✅ Metadata: distribution/registry/*.json\n\n# 2. List packaged modules\n./provisioning/core/cli/pack list\n\n# 3. Package only core schemas\n./provisioning/core/cli/pack core\n\n# 4. Clean old packages (keep latest 3 versions)\n./provisioning/core/cli/pack clean --keep-latest 3\n\n# 5. Upload to registry (your implementation)\n# rsync distribution/packages/*.tar repo.jesusperez.pro:/registry/\n```\n\n### File Structure\n\n```{$detected_lang}\nprovisioning/\n├── distribution/\n│ ├── packages/\n│ │ ├── provisioning_0.0.1.tar # Core schemas\n│ │ ├── upcloud_prov_0.0.1.tar # Provider packages\n│ │ ├── aws_prov_0.0.1.tar\n│ │ └── local_prov_0.0.1.tar\n│ └── registry/\n│ ├── provisioning_core.json # Metadata\n│ ├── upcloud_prov.json\n│ ├── aws_prov.json\n│ └── local_prov.json\n└── extensions/providers/ # Source code\n```\n\n### Package Metadata Example\n\n```{$detected_lang}\n{\n "name": "upcloud_prov",\n "version": "0.0.1",\n "package_file": "/path/to/upcloud_prov_0.0.1.tar",\n "created": "2025-09-29 20:47:21",\n "maintainer": "JesusPerezLorenzo",\n "repository": "https://repo.jesusperez.pro/provisioning",\n "license": "MIT",\n "homepage": "https://github.com/jesusperezlorenzo/provisioning"\n}\n```\n\n---\n\n## Comparison Matrix\n\n| Feature | Module-Loader | Provider Packs |\n| --------- | -------------- | ---------------- |\n| **Speed** | ⚡ Instant (symlinks) | 📦 Requires packaging |\n| **Versioning** | ❌ No explicit versions | ✅ Semantic versioning |\n| **Portability** | ❌ Local filesystem only | ✅ Distributable archives |\n| **Development** | ✅ Excellent (live reload) | ⚠️ Need repackage cycle |\n| **Production** | ⚠️ Mutable source | ✅ Immutable artifacts |\n| **Discovery** | ✅ Auto-discovery | ⚠️ Manual tracking |\n| **Team Sharing** | ⚠️ Git repository only | ✅ Registry + Git |\n| **Debugging** | ✅ Direct source access | ❌ Need to unpack |\n| **Rollback** | ⚠️ Git revert | ✅ Version pinning |\n| **Compliance** | ❌ Hard to audit | ✅ Signed artifacts |\n| **Setup Time** | ⚡ Seconds | ⏱️ Minutes |\n| **CI/CD** | ⚠️ Not ideal | ✅ Perfect |\n\n---\n\n## Recommended Hybrid Workflow\n\n### Development Phase\n\n```{$detected_lang}\n# 1. Start with module-loader for development\nprovisioning providers list\nprovisioning providers install upcloud wuji\n\n# 2. Develop and iterate quickly\nvim extensions/providers/upcloud/nickel/server_upcloud.ncl\n# Test immediately - no packaging needed\n\n# 3. Validate before release\nprovisioning providers validate wuji\nnickel export workspace/infra/wuji/main.ncl\n```\n\n### Release Phase\n\n```{$detected_lang}\n# 4. Create release packages\nexport PROVISIONING=/Users/Akasha/project-provisioning/provisioning\n./provisioning/core/cli/pack providers\n\n# 5. Verify packages\n./provisioning/core/cli/pack list\n\n# 6. Tag release\ngit tag v0.0.2\ngit push origin v0.0.2\n\n# 7. Publish to registry (your workflow)\nrsync distribution/packages/*.tar user@repo.jesusperez.pro:/registry/v0.0.2/\n```\n\n### Production Deployment\n\n```{$detected_lang}\n# 8. Download specific version from registry\nwget https://repo.jesusperez.pro/registry/v0.0.2/upcloud_prov_0.0.2.tar\n\n# 9. Extract and install\ntar -xf upcloud_prov_0.0.2.tar -C infrastructure/providers/\n\n# 10. Use in production infrastructure\n# (Configure manifest.toml to point to extracted package)\n```\n\n---\n\n## Command Reference\n\n### Module-Loader Commands\n\n```{$detected_lang}\n# List all available providers\nprovisioning providers list [--kcl] [--format table|json|yaml]\n\n# Show provider information\nprovisioning providers info [--kcl]\n\n# Install provider for infrastructure\nprovisioning providers install [--version 0.0.1]\n\n# Remove provider from infrastructure\nprovisioning providers remove [--force]\n\n# List installed providers\nprovisioning providers installed [--format table|json|yaml]\n\n# Validate provider installation\nprovisioning providers validate \n\n# Sync KCL dependencies\n./provisioning/core/cli/module-loader sync-kcl \n```\n\n### Provider Pack Commands\n\n```{$detected_lang}\n# Set environment variable (required)\nexport PROVISIONING=/path/to/provisioning\n\n# Package core provisioning schemas\n./provisioning/core/cli/pack core [--output dir] [--version 0.0.1]\n\n# Package single provider\n./provisioning/core/cli/pack provider [--output dir] [--version 0.0.1]\n\n# Package all providers\n./provisioning/core/cli/pack providers [--output dir]\n\n# List all packages\n./provisioning/core/cli/pack list [--format table|json|yaml]\n\n# Clean old packages\n./provisioning/core/cli/pack clean [--keep-latest 3] [--dry-run]\n```\n\n---\n\n## Real-World Scenarios\n\n### Scenario 1: Solo Developer - Local Infrastructure\n\n**Situation**: Working alone on local infrastructure projects\n\n**Recommendation**: Module-Loader only\n\n```{$detected_lang}\n# Simple and fast\nproviders install upcloud homelab\nproviders install aws cloud-backup\n# Edit and test freely\n```\n\n**Why**: No need for versioning, packaging overhead unnecessary.\n\n---\n\n### Scenario 2: Small Team - Shared Development\n\n**Situation**: 2-5 developers sharing code via Git\n\n**Recommendation**: Module-Loader + Git\n\n```{$detected_lang}\n# Each developer\ngit clone repo\nproviders install upcloud project-x\n# Make changes, commit to Git\ngit commit -m "Add upcloud GPU support"\ngit push\n# Others pull changes\ngit pull\n# Changes immediately available via symlinks\n```\n\n**Why**: Git provides version control, symlinks provide instant updates.\n\n---\n\n### Scenario 3: Medium Team - Multiple Projects\n\n**Situation**: 10+ developers, multiple infrastructure projects\n\n**Recommendation**: Hybrid (Module-Loader dev + Provider Packs releases)\n\n```{$detected_lang}\n# Development (team member)\nproviders install upcloud staging-env\n# Make changes...\n\n# Release (release engineer)\npack providers # Create v0.2.0\ngit tag v0.2.0\n# Upload to internal registry\n\n# Other projects\n# Download upcloud_prov_0.2.0.tar\n# Use stable, tested version\n```\n\n**Why**: Developers iterate fast, other teams use stable versions.\n\n---\n\n### Scenario 4: Enterprise - Production Infrastructure\n\n**Situation**: Critical production systems, compliance requirements\n\n**Recommendation**: Provider Packs only\n\n```{$detected_lang}\n# CI/CD Pipeline\npack providers # Build artifacts\n# Run tests on packages\n# Sign packages\n# Publish to artifact registry\n\n# Production Deployment\n# Download signed upcloud_prov_1.0.0.tar\n# Verify signature\n# Deploy immutable artifact\n# Document exact versions for compliance\n```\n\n**Why**: Immutability, auditability, and rollback capabilities required.\n\n---\n\n### Scenario 5: Open Source - Public Distribution\n\n**Situation**: Sharing providers with community\n\n**Recommendation**: Provider Packs + Registry\n\n```{$detected_lang}\n# Maintainer\npack providers\n# Create release on GitHub\ngh release create v1.0.0 distribution/packages/*.tar\n\n# Community User\n# Download from GitHub releases\nwget https://github.com/project/releases/v1.0.0/upcloud_prov_1.0.0.tar\n# Extract and use\n```\n\n**Why**: Easy distribution, versioning, and downloading for users.\n\n---\n\n## Best Practices\n\n### For Development\n\n1. **Use Module-Loader by default**\n - Fast iteration is crucial during development\n - Symlinks allow immediate testing\n\n2. **Keep providers.manifest.yaml in Git**\n - Documents which providers are used\n - Team members can sync easily\n\n3. **Validate before committing**\n\n ```bash\n providers validate wuji\n nickel eval defs/servers.ncl\n ```\n\n### For Releases\n\n1. **Version Everything**\n - Use semantic versioning (0.1.0, 0.2.0, 1.0.0)\n - Update version in kcl.mod before packing\n\n2. **Create Packs for Releases**\n\n ```bash\n pack providers --version 0.2.0\n git tag v0.2.0\n ```\n\n3. **Test Packs Before Publishing**\n - Extract and test packages\n - Verify metadata is correct\n\n### For Production\n\n1. **Pin Versions**\n - Use exact versions in production kcl.mod\n - Never use "latest" or symlinks\n\n2. **Maintain Artifact Registry**\n - Store all production versions\n - Keep old versions for rollback\n\n3. **Document Deployments**\n - Record which versions deployed when\n - Maintain change log\n\n### For CI/CD\n\n1. **Automate Pack Creation**\n\n ```yaml\n # .github/workflows/release.yml\n - name: Pack Providers\n run: |\n export PROVISIONING=$GITHUB_WORKSPACE/provisioning\n ./provisioning/core/cli/pack providers\n ```\n\n2. **Run Tests on Packs**\n - Extract packages\n - Run validation tests\n - Ensure they work in isolation\n\n3. **Publish Automatically**\n - Upload to artifact registry on tag\n - Update package index\n\n---\n\n## Migration Path\n\n### From Module-Loader to Packs\n\nWhen you're ready to move to production:\n\n```{$detected_lang}\n# 1. Clean up development setup\nproviders remove upcloud wuji\n\n# 2. Create release pack\npack providers --version 1.0.0\n\n# 3. Extract pack in infrastructure\ncd workspace/infra/wuji\ntar -xf ../../../distribution/packages/upcloud_prov_1.0.0.tar vendor/\n\n# 4. Update kcl.mod to use vendored path\n# Change from: upcloud_prov = { path = "./.kcl-modules/upcloud_prov" }\n# To: upcloud_prov = { path = "./vendor/upcloud_prov", version = "1.0.0" }\n\n# 5. Test\nnickel eval defs/servers.ncl\n```\n\n### From Packs Back to Module-Loader\n\nWhen you need to debug or develop:\n\n```{$detected_lang}\n# 1. Remove vendored version\nrm -rf workspace/infra/wuji/vendor/upcloud_prov\n\n# 2. Install via module-loader\nproviders install upcloud wuji\n\n# 3. Make changes in extensions/providers/upcloud/kcl/\n\n# 4. Test immediately\ncd workspace/infra/wuji\nnickel eval defs/servers.ncl\n```\n\n---\n\n## Configuration\n\n### Environment Variables\n\n```{$detected_lang}\n# Required for pack commands\nexport PROVISIONING=/path/to/provisioning\n\n# Alternative\nexport PROVISIONING_CONFIG=/path/to/provisioning\n```\n\n### Config Files\n\nDistribution settings in `provisioning/config/config.defaults.toml`:\n\n```{$detected_lang}\n[distribution]\npack_path = "{{paths.base}}/distribution/packages"\nregistry_path = "{{paths.base}}/distribution/registry"\ncache_path = "{{paths.base}}/distribution/cache"\nregistry_type = "local"\n\n[distribution.metadata]\nmaintainer = "JesusPerezLorenzo"\nrepository = "https://repo.jesusperez.pro/provisioning"\nlicense = "MIT"\nhomepage = "https://github.com/jesusperezlorenzo/provisioning"\n\n[kcl]\ncore_module = "{{paths.base}}/kcl"\ncore_version = "0.0.1"\ncore_package_name = "provisioning_core"\nuse_module_loader = true\nmodules_dir = ".kcl-modules"\n```\n\n---\n\n## Troubleshooting\n\n### Module-Loader Issues\n\n**Problem**: Provider not found after install\n\n```{$detected_lang}\n# Check provider exists\nproviders list | grep upcloud\n\n# Validate installation\nproviders validate wuji\n\n# Check symlink\nls -la workspace/infra/wuji/.kcl-modules/\n```\n\n**Problem**: Changes not reflected\n\n```{$detected_lang}\n# Verify symlink is correct\nreadlink workspace/infra/wuji/.kcl-modules/upcloud_prov\n\n# Should point to extensions/providers/upcloud/kcl/\n```\n\n### Provider Pack Issues\n\n**Problem**: No .tar file created\n\n```{$detected_lang}\n# Check KCL version (need 0.11.3+)\nkcl version\n\n# Check kcl.mod exists\nls extensions/providers/upcloud/kcl/kcl.mod\n```\n\n**Problem**: PROVISIONING environment variable not set\n\n```{$detected_lang}\n# Set it\nexport PROVISIONING=/Users/Akasha/project-provisioning/provisioning\n\n# Or add to shell profile\necho 'export PROVISIONING=/path/to/provisioning' >> ~/.zshrc\n```\n\n---\n\n## Conclusion\n\n**Both approaches are valuable and complementary:**\n\n- **Module-Loader**: Development velocity, rapid iteration\n- **Provider Packs**: Production stability, version control\n\n**Default Strategy:**\n\n- Use **Module-Loader** for day-to-day development\n- Create **Provider Packs** for releases and production\n- Both systems work seamlessly together\n\n**The system is designed for flexibility** - choose the right tool for your current phase of work!\n\n---\n\n## Additional Resources\n\n- [Module-Loader Implementation](../provisioning/core/nulib/lib_provisioning/kcl_module_loader.nu)\n- [KCL Packaging Implementation](../provisioning/core/nulib/lib_provisioning/kcl_packaging.nu)\n- [Providers CLI](.provisioning providers)\n- [Pack CLI](../provisioning/core/cli/pack)\n- [KCL Documentation](https://kcl-lang.io/)\n\n---\n\n**Document Version**: 1.0.0\n**Last Updated**: 2025-09-29\n**Maintained by**: JesusPerezLorenzo +# Provider Distribution Guide + +**Strategic Guide for Provider Management and Distribution** + +This guide explains the two complementary approaches for managing providers in the provisioning system and when to use each. + +--- + +## Table of Contents + +- [Overview](#overview) +- [Module-Loader Approach](#module-loader-approach) +- [Provider Packs Approach](#provider-packs-approach) +- [Comparison Matrix](#comparison-matrix) +- [Recommended Hybrid Workflow](#recommended-hybrid-workflow) +- [Command Reference](#command-reference) +- [Real-World Scenarios](#real-world-scenarios) +- [Best Practices](#best-practices) + +--- + +## Overview + +The provisioning system supports **two complementary approaches** for provider management: + +1. **Module-Loader**: Symlink-based local development with dynamic discovery +2. **Provider Packs**: Versioned, distributable artifacts for production + +Both approaches work seamlessly together and serve different phases of the development lifecycle. + +--- + +## Module-Loader Approach + +### Purpose + +Fast, local development with direct access to provider source code. + +### How It Works + +```text +# Install provider for infrastructure (creates symlinks) +provisioning providers install upcloud wuji + +# Internal Process: +# 1. Discovers provider in extensions/providers/upcloud/ +# 2. Creates symlink: workspace/infra/wuji/.nickel-modules/upcloud_prov -> extensions/providers/upcloud/nickel/ +# 3. Updates workspace/infra/wuji/manifest.toml with local path dependency +# 4. Updates workspace/infra/wuji/providers.manifest.yaml +``` + +### Key Features + +✅ **Instant Changes**: Edit code in `extensions/providers/`, immediately available in infrastructure +✅ **Auto-Discovery**: Automatically finds all providers in extensions/ +✅ **Simple Commands**: `providers install/remove/list/validate` +✅ **Easy Debugging**: Direct access to source code +✅ **No Packaging**: Skip build/package step during development + +### Best Use Cases + +- 🔧 **Active Development**: Writing new provider features +- 🧪 **Testing**: Rapid iteration and testing cycles +- 🏠 **Local Infrastructure**: Single machine or small team +- 📝 **Debugging**: Need to modify and test provider code +- 🎓 **Learning**: Understanding how providers work + +### Example Workflow + +```text +# 1. List available providers +provisioning providers list + +# 2. Install provider for infrastructure +provisioning providers install upcloud wuji + +# 3. Verify installation +provisioning providers validate wuji + +# 4. Edit provider code +vim extensions/providers/upcloud/nickel/server_upcloud.ncl + +# 5. Test changes immediately (no repackaging!) +cd workspace/infra/wuji +nickel export main.ncl + +# 6. Remove when done +provisioning providers remove upcloud wuji +``` + +### File Structure + +```text +extensions/providers/upcloud/ +├── nickel/ +│ ├── manifest.toml +│ ├── server_upcloud.ncl +│ └── network_upcloud.ncl +└── README.md + +workspace/infra/wuji/ +├── .nickel-modules/ +│ └── upcloud_prov -> ../../../../extensions/providers/upcloud/nickel/ # Symlink +├── manifest.toml # Updated with local path dependency +├── providers.manifest.yaml # Tracks installed providers +└── schemas/ + └── servers.ncl +``` + +--- + +## Provider Packs Approach + +### Purpose + +Create versioned, distributable artifacts for production deployments and team collaboration. + +### How It Works + +```text +# Package providers into distributable artifacts +export PROVISIONING=/Users/Akasha/project-provisioning/provisioning +./provisioning/core/cli/pack providers + +# Internal Process: +# 1. Enters each provider's nickel/ directory +# 2. Runs: nickel export . --format json (generates JSON for distribution) +# 3. Creates: upcloud_prov_0.0.1.tar +# 4. Generates metadata: distribution/registry/upcloud_prov.json +``` + +### Key Features + +✅ **Versioned Artifacts**: Immutable, reproducible packages +✅ **Portable**: Share across teams and environments +✅ **Registry Publishing**: Push to artifact registries +✅ **Metadata**: Version, maintainer, license information +✅ **Production-Ready**: What you package is what you deploy + +### Best Use Cases + +- 🚀 **Production Deployments**: Stable, tested provider versions +- 📦 **Distribution**: Share across teams or organizations +- 🔄 **CI/CD Pipelines**: Automated build and deploy +- 📊 **Version Control**: Track provider versions explicitly +- 🌐 **Registry Publishing**: Publish to artifact registries +- 🔒 **Compliance**: Immutable artifacts for auditing + +### Example Workflow + +```text +# Set environment variable +export PROVISIONING=/Users/Akasha/project-provisioning/provisioning + +# 1. Package all providers +./provisioning/core/cli/pack providers + +# Output: +# ✅ Creates: distribution/packages/upcloud_prov_0.0.1.tar +# ✅ Creates: distribution/packages/aws_prov_0.0.1.tar +# ✅ Creates: distribution/packages/local_prov_0.0.1.tar +# ✅ Metadata: distribution/registry/*.json + +# 2. List packaged modules +./provisioning/core/cli/pack list + +# 3. Package only core schemas +./provisioning/core/cli/pack core + +# 4. Clean old packages (keep latest 3 versions) +./provisioning/core/cli/pack clean --keep-latest 3 + +# 5. Upload to registry (your implementation) +# rsync distribution/packages/*.tar repo.jesusperez.pro:/registry/ +``` + +### File Structure + +```text +provisioning/ +├── distribution/ +│ ├── packages/ +│ │ ├── provisioning_0.0.1.tar # Core schemas +│ │ ├── upcloud_prov_0.0.1.tar # Provider packages +│ │ ├── aws_prov_0.0.1.tar +│ │ └── local_prov_0.0.1.tar +│ └── registry/ +│ ├── provisioning_core.json # Metadata +│ ├── upcloud_prov.json +│ ├── aws_prov.json +│ └── local_prov.json +└── extensions/providers/ # Source code +``` + +### Package Metadata Example + +```text +{ + "name": "upcloud_prov", + "version": "0.0.1", + "package_file": "/path/to/upcloud_prov_0.0.1.tar", + "created": "2025-09-29 20:47:21", + "maintainer": "JesusPerezLorenzo", + "repository": "https://repo.jesusperez.pro/provisioning", + "license": "MIT", + "homepage": "https://github.com/jesusperezlorenzo/provisioning" +} +``` + +--- + +## Comparison Matrix + +| Feature | Module-Loader | Provider Packs | +| --------- | -------------- | ---------------- | +| **Speed** | ⚡ Instant (symlinks) | 📦 Requires packaging | +| **Versioning** | ❌ No explicit versions | ✅ Semantic versioning | +| **Portability** | ❌ Local filesystem only | ✅ Distributable archives | +| **Development** | ✅ Excellent (live reload) | ⚠️ Need repackage cycle | +| **Production** | ⚠️ Mutable source | ✅ Immutable artifacts | +| **Discovery** | ✅ Auto-discovery | ⚠️ Manual tracking | +| **Team Sharing** | ⚠️ Git repository only | ✅ Registry + Git | +| **Debugging** | ✅ Direct source access | ❌ Need to unpack | +| **Rollback** | ⚠️ Git revert | ✅ Version pinning | +| **Compliance** | ❌ Hard to audit | ✅ Signed artifacts | +| **Setup Time** | ⚡ Seconds | ⏱️ Minutes | +| **CI/CD** | ⚠️ Not ideal | ✅ Perfect | + +--- + +## Recommended Hybrid Workflow + +### Development Phase + +```text +# 1. Start with module-loader for development +provisioning providers list +provisioning providers install upcloud wuji + +# 2. Develop and iterate quickly +vim extensions/providers/upcloud/nickel/server_upcloud.ncl +# Test immediately - no packaging needed + +# 3. Validate before release +provisioning providers validate wuji +nickel export workspace/infra/wuji/main.ncl +``` + +### Release Phase + +```text +# 4. Create release packages +export PROVISIONING=/Users/Akasha/project-provisioning/provisioning +./provisioning/core/cli/pack providers + +# 5. Verify packages +./provisioning/core/cli/pack list + +# 6. Tag release +git tag v0.0.2 +git push origin v0.0.2 + +# 7. Publish to registry (your workflow) +rsync distribution/packages/*.tar user@repo.jesusperez.pro:/registry/v0.0.2/ +``` + +### Production Deployment + +```text +# 8. Download specific version from registry +wget https://repo.jesusperez.pro/registry/v0.0.2/upcloud_prov_0.0.2.tar + +# 9. Extract and install +tar -xf upcloud_prov_0.0.2.tar -C infrastructure/providers/ + +# 10. Use in production infrastructure +# (Configure manifest.toml to point to extracted package) +``` + +--- + +## Command Reference + +### Module-Loader Commands + +```text +# List all available providers +provisioning providers list [--kcl] [--format table|json|yaml] + +# Show provider information +provisioning providers info [--kcl] + +# Install provider for infrastructure +provisioning providers install [--version 0.0.1] + +# Remove provider from infrastructure +provisioning providers remove [--force] + +# List installed providers +provisioning providers installed [--format table|json|yaml] + +# Validate provider installation +provisioning providers validate + +# Sync KCL dependencies +./provisioning/core/cli/module-loader sync-kcl +``` + +### Provider Pack Commands + +```text +# Set environment variable (required) +export PROVISIONING=/path/to/provisioning + +# Package core provisioning schemas +./provisioning/core/cli/pack core [--output dir] [--version 0.0.1] + +# Package single provider +./provisioning/core/cli/pack provider [--output dir] [--version 0.0.1] + +# Package all providers +./provisioning/core/cli/pack providers [--output dir] + +# List all packages +./provisioning/core/cli/pack list [--format table|json|yaml] + +# Clean old packages +./provisioning/core/cli/pack clean [--keep-latest 3] [--dry-run] +``` + +--- + +## Real-World Scenarios + +### Scenario 1: Solo Developer - Local Infrastructure + +**Situation**: Working alone on local infrastructure projects + +**Recommendation**: Module-Loader only + +```text +# Simple and fast +providers install upcloud homelab +providers install aws cloud-backup +# Edit and test freely +``` + +**Why**: No need for versioning, packaging overhead unnecessary. + +--- + +### Scenario 2: Small Team - Shared Development + +**Situation**: 2-5 developers sharing code via Git + +**Recommendation**: Module-Loader + Git + +```text +# Each developer +git clone repo +providers install upcloud project-x +# Make changes, commit to Git +git commit -m "Add upcloud GPU support" +git push +# Others pull changes +git pull +# Changes immediately available via symlinks +``` + +**Why**: Git provides version control, symlinks provide instant updates. + +--- + +### Scenario 3: Medium Team - Multiple Projects + +**Situation**: 10+ developers, multiple infrastructure projects + +**Recommendation**: Hybrid (Module-Loader dev + Provider Packs releases) + +```text +# Development (team member) +providers install upcloud staging-env +# Make changes... + +# Release (release engineer) +pack providers # Create v0.2.0 +git tag v0.2.0 +# Upload to internal registry + +# Other projects +# Download upcloud_prov_0.2.0.tar +# Use stable, tested version +``` + +**Why**: Developers iterate fast, other teams use stable versions. + +--- + +### Scenario 4: Enterprise - Production Infrastructure + +**Situation**: Critical production systems, compliance requirements + +**Recommendation**: Provider Packs only + +```text +# CI/CD Pipeline +pack providers # Build artifacts +# Run tests on packages +# Sign packages +# Publish to artifact registry + +# Production Deployment +# Download signed upcloud_prov_1.0.0.tar +# Verify signature +# Deploy immutable artifact +# Document exact versions for compliance +``` + +**Why**: Immutability, auditability, and rollback capabilities required. + +--- + +### Scenario 5: Open Source - Public Distribution + +**Situation**: Sharing providers with community + +**Recommendation**: Provider Packs + Registry + +```text +# Maintainer +pack providers +# Create release on GitHub +gh release create v1.0.0 distribution/packages/*.tar + +# Community User +# Download from GitHub releases +wget https://github.com/project/releases/v1.0.0/upcloud_prov_1.0.0.tar +# Extract and use +``` + +**Why**: Easy distribution, versioning, and downloading for users. + +--- + +## Best Practices + +### For Development + +1. **Use Module-Loader by default** + - Fast iteration is crucial during development + - Symlinks allow immediate testing + +2. **Keep providers.manifest.yaml in Git** + - Documents which providers are used + - Team members can sync easily + +3. **Validate before committing** + + ```bash + providers validate wuji + nickel eval defs/servers.ncl + ``` + +### For Releases + +1. **Version Everything** + - Use semantic versioning (0.1.0, 0.2.0, 1.0.0) + - Update version in kcl.mod before packing + +2. **Create Packs for Releases** + + ```bash + pack providers --version 0.2.0 + git tag v0.2.0 + ``` + +3. **Test Packs Before Publishing** + - Extract and test packages + - Verify metadata is correct + +### For Production + +1. **Pin Versions** + - Use exact versions in production kcl.mod + - Never use "latest" or symlinks + +2. **Maintain Artifact Registry** + - Store all production versions + - Keep old versions for rollback + +3. **Document Deployments** + - Record which versions deployed when + - Maintain change log + +### For CI/CD + +1. **Automate Pack Creation** + + ```yaml + # .github/workflows/release.yml + - name: Pack Providers + run: | + export PROVISIONING=$GITHUB_WORKSPACE/provisioning + ./provisioning/core/cli/pack providers + ``` + +2. **Run Tests on Packs** + - Extract packages + - Run validation tests + - Ensure they work in isolation + +3. **Publish Automatically** + - Upload to artifact registry on tag + - Update package index + +--- + +## Migration Path + +### From Module-Loader to Packs + +When you're ready to move to production: + +```text +# 1. Clean up development setup +providers remove upcloud wuji + +# 2. Create release pack +pack providers --version 1.0.0 + +# 3. Extract pack in infrastructure +cd workspace/infra/wuji +tar -xf ../../../distribution/packages/upcloud_prov_1.0.0.tar vendor/ + +# 4. Update kcl.mod to use vendored path +# Change from: upcloud_prov = { path = "./.kcl-modules/upcloud_prov" } +# To: upcloud_prov = { path = "./vendor/upcloud_prov", version = "1.0.0" } + +# 5. Test +nickel eval defs/servers.ncl +``` + +### From Packs Back to Module-Loader + +When you need to debug or develop: + +```text +# 1. Remove vendored version +rm -rf workspace/infra/wuji/vendor/upcloud_prov + +# 2. Install via module-loader +providers install upcloud wuji + +# 3. Make changes in extensions/providers/upcloud/kcl/ + +# 4. Test immediately +cd workspace/infra/wuji +nickel eval defs/servers.ncl +``` + +--- + +## Configuration + +### Environment Variables + +```text +# Required for pack commands +export PROVISIONING=/path/to/provisioning + +# Alternative +export PROVISIONING_CONFIG=/path/to/provisioning +``` + +### Config Files + +Distribution settings in `provisioning/config/config.defaults.toml`: + +```text +[distribution] +pack_path = "{{paths.base}}/distribution/packages" +registry_path = "{{paths.base}}/distribution/registry" +cache_path = "{{paths.base}}/distribution/cache" +registry_type = "local" + +[distribution.metadata] +maintainer = "JesusPerezLorenzo" +repository = "https://repo.jesusperez.pro/provisioning" +license = "MIT" +homepage = "https://github.com/jesusperezlorenzo/provisioning" + +[kcl] +core_module = "{{paths.base}}/kcl" +core_version = "0.0.1" +core_package_name = "provisioning_core" +use_module_loader = true +modules_dir = ".kcl-modules" +``` + +--- + +## Troubleshooting + +### Module-Loader Issues + +**Problem**: Provider not found after install + +```text +# Check provider exists +providers list | grep upcloud + +# Validate installation +providers validate wuji + +# Check symlink +ls -la workspace/infra/wuji/.kcl-modules/ +``` + +**Problem**: Changes not reflected + +```text +# Verify symlink is correct +readlink workspace/infra/wuji/.kcl-modules/upcloud_prov + +# Should point to extensions/providers/upcloud/kcl/ +``` + +### Provider Pack Issues + +**Problem**: No .tar file created + +```text +# Check KCL version (need 0.11.3+) +kcl version + +# Check kcl.mod exists +ls extensions/providers/upcloud/kcl/kcl.mod +``` + +**Problem**: PROVISIONING environment variable not set + +```text +# Set it +export PROVISIONING=/Users/Akasha/project-provisioning/provisioning + +# Or add to shell profile +echo 'export PROVISIONING=/path/to/provisioning' >> ~/.zshrc +``` + +--- + +## Conclusion + +**Both approaches are valuable and complementary:** + +- **Module-Loader**: Development velocity, rapid iteration +- **Provider Packs**: Production stability, version control + +**Default Strategy:** + +- Use **Module-Loader** for day-to-day development +- Create **Provider Packs** for releases and production +- Both systems work seamlessly together + +**The system is designed for flexibility** - choose the right tool for your current phase of work! + +--- + +## Additional Resources + +- [Module-Loader Implementation](../provisioning/core/nulib/lib_provisioning/kcl_module_loader.nu) +- [KCL Packaging Implementation](../provisioning/core/nulib/lib_provisioning/kcl_packaging.nu) +- [Providers CLI](.provisioning providers) +- [Pack CLI](../provisioning/core/cli/pack) +- [KCL Documentation](https://kcl-lang.io/) + +--- + +**Document Version**: 1.0.0 +**Last Updated**: 2025-09-29 +**Maintained by**: JesusPerezLorenzo diff --git a/docs/src/development/providers/quick-provider-guide.md b/docs/src/development/providers/quick-provider-guide.md index df45e85..ffe3192 100644 --- a/docs/src/development/providers/quick-provider-guide.md +++ b/docs/src/development/providers/quick-provider-guide.md @@ -1 +1,322 @@ -# Quick Developer Guide: Adding New Providers\n\nThis guide shows how to quickly add a new provider to the provider-agnostic infrastructure system.\n\n## Prerequisites\n\n- Understand the [Provider-Agnostic Architecture](PROVIDER_AGNOSTIC_ARCHITECTURE.md)\n- Have the provider's SDK or API available\n- Know the provider's authentication requirements\n\n## 5-Minute Provider Addition\n\n### Step 1: Create Provider Directory\n\n```\nmkdir -p provisioning/extensions/providers/{provider_name}\nmkdir -p provisioning/extensions/providers/{provider_name}/nulib/{provider_name}\n```\n\n### Step 2: Copy Template and Customize\n\n```\n# Copy the local provider as a template\ncp provisioning/extensions/providers/local/provider.nu \\n provisioning/extensions/providers/{provider_name}/provider.nu\n```\n\n### Step 3: Update Provider Metadata\n\nEdit `provisioning/extensions/providers/{provider_name}/provider.nu`:\n\n```\nexport def get-provider-metadata []: nothing -> record {\n {\n name: "your_provider_name"\n version: "1.0.0"\n description: "Your Provider Description"\n capabilities: {\n server_management: true\n network_management: true # Set based on provider features\n auto_scaling: false # Set based on provider features\n multi_region: true # Set based on provider features\n serverless: false # Set based on provider features\n # ... customize other capabilities\n }\n }\n}\n```\n\n### Step 4: Implement Core Functions\n\nThe provider interface requires these essential functions:\n\n```\n# Required: Server operations\nexport def query_servers [find?: string, cols?: string]: nothing -> list {\n # Call your provider's server listing API\n your_provider_query_servers $find $cols\n}\n\nexport def create_server [settings: record, server: record, check: bool, wait: bool]: nothing -> bool {\n # Call your provider's server creation API\n your_provider_create_server $settings $server $check $wait\n}\n\nexport def server_exists [server: record, error_exit: bool]: nothing -> bool {\n # Check if server exists in your provider\n your_provider_server_exists $server $error_exit\n}\n\nexport def get_ip [settings: record, server: record, ip_type: string, error_exit: bool]: nothing -> string {\n # Get server IP from your provider\n your_provider_get_ip $settings $server $ip_type $error_exit\n}\n\n# Required: Infrastructure operations\nexport def delete_server [settings: record, server: record, keep_storage: bool, error_exit: bool]: nothing -> bool {\n your_provider_delete_server $settings $server $keep_storage $error_exit\n}\n\nexport def server_state [server: record, new_state: string, error_exit: bool, wait: bool, settings: record]: nothing -> bool {\n your_provider_server_state $server $new_state $error_exit $wait $settings\n}\n```\n\n### Step 5: Create Provider-Specific Functions\n\nCreate `provisioning/extensions/providers/{provider_name}/nulib/{provider_name}/servers.nu`:\n\n```\n# Example: DigitalOcean provider functions\nexport def digitalocean_query_servers [find?: string, cols?: string]: nothing -> list {\n # Use DigitalOcean API to list droplets\n let droplets = (http get "https://api.digitalocean.com/v2/droplets"\n --headers { Authorization: $"Bearer ($env.DO_TOKEN)" })\n\n $droplets.droplets | select name status memory disk region.name networks.v4\n}\n\nexport def digitalocean_create_server [settings: record, server: record, check: bool, wait: bool]: nothing -> bool {\n # Use DigitalOcean API to create droplet\n let payload = {\n name: $server.hostname\n region: $server.zone\n size: $server.plan\n image: ($server.image? | default "ubuntu-20-04-x64")\n }\n\n if $check {\n print $"Would create DigitalOcean droplet: ($payload)"\n return true\n }\n\n let result = (http post "https://api.digitalocean.com/v2/droplets"\n --headers { Authorization: $"Bearer ($env.DO_TOKEN)" }\n --content-type application/json\n $payload)\n\n $result.droplet.id != null\n}\n```\n\n### Step 6: Test Your Provider\n\n```\n# Test provider discovery\nnu -c "use provisioning/core/nulib/lib_provisioning/providers/registry.nu *; init-provider-registry; list-providers"\n\n# Test provider loading\nnu -c "use provisioning/core/nulib/lib_provisioning/providers/loader.nu *; load-provider 'your_provider_name'"\n\n# Test provider functions\nnu -c "use provisioning/extensions/providers/your_provider_name/provider.nu *; query_servers"\n```\n\n### Step 7: Add Provider to Infrastructure\n\nAdd to your Nickel configuration:\n\n```\n# workspace/infra/example/servers.ncl\nlet servers = [\n {\n hostname = "test-server",\n provider = "your_provider_name",\n zone = "your-region-1",\n plan = "your-instance-type",\n }\n] in\nservers\n```\n\n## Provider Templates\n\n### Cloud Provider Template\n\nFor cloud providers (AWS, GCP, Azure, etc.):\n\n```\n# Use HTTP calls to cloud APIs\nexport def cloud_query_servers [find?: string, cols?: string]: nothing -> list {\n let auth_header = { Authorization: $"Bearer ($env.PROVIDER_TOKEN)" }\n let servers = (http get $"($env.PROVIDER_API_URL)/servers" --headers $auth_header)\n\n $servers | select name status region instance_type public_ip\n}\n```\n\n### Container Platform Template\n\nFor container platforms (Docker, Podman, etc.):\n\n```\n# Use CLI commands for container platforms\nexport def container_query_servers [find?: string, cols?: string]: nothing -> list {\n let containers = (docker ps --format json | from json)\n\n $containers | select Names State Status Image\n}\n```\n\n### Bare Metal Provider Template\n\nFor bare metal or existing servers:\n\n```\n# Use SSH or local commands\nexport def baremetal_query_servers [find?: string, cols?: string]: nothing -> list {\n # Read from inventory file or ping servers\n let inventory = (open inventory.yaml | from yaml)\n\n $inventory.servers | select hostname ip_address status\n}\n```\n\n## Best Practices\n\n### 1. Error Handling\n\n```\nexport def provider_operation []: nothing -> any {\n try {\n # Your provider operation\n provider_api_call\n } catch {|err|\n log-error $"Provider operation failed: ($err.msg)" "provider"\n if $error_exit { exit 1 }\n null\n }\n}\n```\n\n### 2. Authentication\n\n```\n# Check for required environment variables\ndef check_auth []: nothing -> bool {\n if ($env | get -o PROVIDER_TOKEN) == null {\n log-error "PROVIDER_TOKEN environment variable required" "auth"\n return false\n }\n true\n}\n```\n\n### 3. Rate Limiting\n\n```\n# Add delays for API rate limits\ndef api_call_with_retry [url: string]: nothing -> any {\n mut attempts = 0\n mut max_attempts = 3\n\n while $attempts < $max_attempts {\n try {\n return (http get $url)\n } catch {\n $attempts += 1\n sleep 1sec\n }\n }\n\n error make { msg: "API call failed after retries" }\n}\n```\n\n### 4. Provider Capabilities\n\nSet capabilities accurately:\n\n```\ncapabilities: {\n server_management: true # Can create/delete servers\n network_management: true # Can manage networks/VPCs\n storage_management: true # Can manage block storage\n load_balancer: false # No load balancer support\n dns_management: false # No DNS support\n auto_scaling: true # Supports auto-scaling\n spot_instances: false # No spot instance support\n multi_region: true # Supports multiple regions\n containers: false # No container support\n serverless: false # No serverless support\n encryption_at_rest: true # Supports encryption\n compliance_certifications: ["SOC2"] # Available certifications\n}\n```\n\n## Testing Checklist\n\n- [ ] Provider discovered by registry\n- [ ] Provider loads without errors\n- [ ] All required interface functions implemented\n- [ ] Provider metadata correct\n- [ ] Authentication working\n- [ ] Can query existing resources\n- [ ] Can create new resources (in test mode)\n- [ ] Error handling working\n- [ ] Compatible with existing infrastructure configs\n\n## Common Issues\n\n### Provider Not Found\n\n```\n# Check provider directory structure\nls -la provisioning/extensions/providers/your_provider_name/\n\n# Ensure provider.nu exists and has get-provider-metadata function\ngrep "get-provider-metadata" provisioning/extensions/providers/your_provider_name/provider.nu\n```\n\n### Interface Validation Failed\n\n```\n# Check which functions are missing\nnu -c "use provisioning/core/nulib/lib_provisioning/providers/interface.nu *; validate-provider-interface 'your_provider_name'"\n```\n\n### Authentication Errors\n\n```\n# Check environment variables\nenv | grep PROVIDER\n\n# Test API access manually\ncurl -H "Authorization: Bearer $PROVIDER_TOKEN" https://api.provider.com/test\n```\n\n## Next Steps\n\n1. **Documentation**: Add provider-specific documentation to `docs/providers/`\n2. **Examples**: Create example infrastructure using your provider\n3. **Testing**: Add integration tests for your provider\n4. **Optimization**: Implement caching and performance optimizations\n5. **Features**: Add provider-specific advanced features\n\n## Getting Help\n\n- Check existing providers for implementation patterns\n- Review the [Provider Interface Documentation](PROVIDER_AGNOSTIC_ARCHITECTURE.md#provider-interface)\n- Test with the provider test suite: `./provisioning/tools/test-provider-agnostic.nu`\n- Run migration checks: `./provisioning/tools/migrate-to-provider-agnostic.nu status` +# Quick Developer Guide: Adding New Providers + +This guide shows how to quickly add a new provider to the provider-agnostic infrastructure system. + +## Prerequisites + +- Understand the [Provider-Agnostic Architecture](PROVIDER_AGNOSTIC_ARCHITECTURE.md) +- Have the provider's SDK or API available +- Know the provider's authentication requirements + +## 5-Minute Provider Addition + +### Step 1: Create Provider Directory + +```text +mkdir -p provisioning/extensions/providers/{provider_name} +mkdir -p provisioning/extensions/providers/{provider_name}/nulib/{provider_name} +``` + +### Step 2: Copy Template and Customize + +```text +# Copy the local provider as a template +cp provisioning/extensions/providers/local/provider.nu + provisioning/extensions/providers/{provider_name}/provider.nu +``` + +### Step 3: Update Provider Metadata + +Edit `provisioning/extensions/providers/{provider_name}/provider.nu`: + +```text +export def get-provider-metadata []: nothing -> record { + { + name: "your_provider_name" + version: "1.0.0" + description: "Your Provider Description" + capabilities: { + server_management: true + network_management: true # Set based on provider features + auto_scaling: false # Set based on provider features + multi_region: true # Set based on provider features + serverless: false # Set based on provider features + # ... customize other capabilities + } + } +} +``` + +### Step 4: Implement Core Functions + +The provider interface requires these essential functions: + +```text +# Required: Server operations +export def query_servers [find?: string, cols?: string]: nothing -> list { + # Call your provider's server listing API + your_provider_query_servers $find $cols +} + +export def create_server [settings: record, server: record, check: bool, wait: bool]: nothing -> bool { + # Call your provider's server creation API + your_provider_create_server $settings $server $check $wait +} + +export def server_exists [server: record, error_exit: bool]: nothing -> bool { + # Check if server exists in your provider + your_provider_server_exists $server $error_exit +} + +export def get_ip [settings: record, server: record, ip_type: string, error_exit: bool]: nothing -> string { + # Get server IP from your provider + your_provider_get_ip $settings $server $ip_type $error_exit +} + +# Required: Infrastructure operations +export def delete_server [settings: record, server: record, keep_storage: bool, error_exit: bool]: nothing -> bool { + your_provider_delete_server $settings $server $keep_storage $error_exit +} + +export def server_state [server: record, new_state: string, error_exit: bool, wait: bool, settings: record]: nothing -> bool { + your_provider_server_state $server $new_state $error_exit $wait $settings +} +``` + +### Step 5: Create Provider-Specific Functions + +Create `provisioning/extensions/providers/{provider_name}/nulib/{provider_name}/servers.nu`: + +```text +# Example: DigitalOcean provider functions +export def digitalocean_query_servers [find?: string, cols?: string]: nothing -> list { + # Use DigitalOcean API to list droplets + let droplets = (http get "https://api.digitalocean.com/v2/droplets" + --headers { Authorization: $"Bearer ($env.DO_TOKEN)" }) + + $droplets.droplets | select name status memory disk region.name networks.v4 +} + +export def digitalocean_create_server [settings: record, server: record, check: bool, wait: bool]: nothing -> bool { + # Use DigitalOcean API to create droplet + let payload = { + name: $server.hostname + region: $server.zone + size: $server.plan + image: ($server.image? | default "ubuntu-20-04-x64") + } + + if $check { + print $"Would create DigitalOcean droplet: ($payload)" + return true + } + + let result = (http post "https://api.digitalocean.com/v2/droplets" + --headers { Authorization: $"Bearer ($env.DO_TOKEN)" } + --content-type application/json + $payload) + + $result.droplet.id != null +} +``` + +### Step 6: Test Your Provider + +```text +# Test provider discovery +nu -c "use provisioning/core/nulib/lib_provisioning/providers/registry.nu *; init-provider-registry; list-providers" + +# Test provider loading +nu -c "use provisioning/core/nulib/lib_provisioning/providers/loader.nu *; load-provider 'your_provider_name'" + +# Test provider functions +nu -c "use provisioning/extensions/providers/your_provider_name/provider.nu *; query_servers" +``` + +### Step 7: Add Provider to Infrastructure + +Add to your Nickel configuration: + +```text +# workspace/infra/example/servers.ncl +let servers = [ + { + hostname = "test-server", + provider = "your_provider_name", + zone = "your-region-1", + plan = "your-instance-type", + } +] in +servers +``` + +## Provider Templates + +### Cloud Provider Template + +For cloud providers (AWS, GCP, Azure, etc.): + +```text +# Use HTTP calls to cloud APIs +export def cloud_query_servers [find?: string, cols?: string]: nothing -> list { + let auth_header = { Authorization: $"Bearer ($env.PROVIDER_TOKEN)" } + let servers = (http get $"($env.PROVIDER_API_URL)/servers" --headers $auth_header) + + $servers | select name status region instance_type public_ip +} +``` + +### Container Platform Template + +For container platforms (Docker, Podman, etc.): + +```text +# Use CLI commands for container platforms +export def container_query_servers [find?: string, cols?: string]: nothing -> list { + let containers = (docker ps --format json | from json) + + $containers | select Names State Status Image +} +``` + +### Bare Metal Provider Template + +For bare metal or existing servers: + +```text +# Use SSH or local commands +export def baremetal_query_servers [find?: string, cols?: string]: nothing -> list { + # Read from inventory file or ping servers + let inventory = (open inventory.yaml | from yaml) + + $inventory.servers | select hostname ip_address status +} +``` + +## Best Practices + +### 1. Error Handling + +```text +export def provider_operation []: nothing -> any { + try { + # Your provider operation + provider_api_call + } catch {|err| + log-error $"Provider operation failed: ($err.msg)" "provider" + if $error_exit { exit 1 } + null + } +} +``` + +### 2. Authentication + +```text +# Check for required environment variables +def check_auth []: nothing -> bool { + if ($env | get -o PROVIDER_TOKEN) == null { + log-error "PROVIDER_TOKEN environment variable required" "auth" + return false + } + true +} +``` + +### 3. Rate Limiting + +```text +# Add delays for API rate limits +def api_call_with_retry [url: string]: nothing -> any { + mut attempts = 0 + mut max_attempts = 3 + + while $attempts < $max_attempts { + try { + return (http get $url) + } catch { + $attempts += 1 + sleep 1sec + } + } + + error make { msg: "API call failed after retries" } +} +``` + +### 4. Provider Capabilities + +Set capabilities accurately: + +```text +capabilities: { + server_management: true # Can create/delete servers + network_management: true # Can manage networks/VPCs + storage_management: true # Can manage block storage + load_balancer: false # No load balancer support + dns_management: false # No DNS support + auto_scaling: true # Supports auto-scaling + spot_instances: false # No spot instance support + multi_region: true # Supports multiple regions + containers: false # No container support + serverless: false # No serverless support + encryption_at_rest: true # Supports encryption + compliance_certifications: ["SOC2"] # Available certifications +} +``` + +## Testing Checklist + +- [ ] Provider discovered by registry +- [ ] Provider loads without errors +- [ ] All required interface functions implemented +- [ ] Provider metadata correct +- [ ] Authentication working +- [ ] Can query existing resources +- [ ] Can create new resources (in test mode) +- [ ] Error handling working +- [ ] Compatible with existing infrastructure configs + +## Common Issues + +### Provider Not Found + +```text +# Check provider directory structure +ls -la provisioning/extensions/providers/your_provider_name/ + +# Ensure provider.nu exists and has get-provider-metadata function +grep "get-provider-metadata" provisioning/extensions/providers/your_provider_name/provider.nu +``` + +### Interface Validation Failed + +```text +# Check which functions are missing +nu -c "use provisioning/core/nulib/lib_provisioning/providers/interface.nu *; validate-provider-interface 'your_provider_name'" +``` + +### Authentication Errors + +```text +# Check environment variables +env | grep PROVIDER + +# Test API access manually +curl -H "Authorization: Bearer $PROVIDER_TOKEN" https://api.provider.com/test +``` + +## Next Steps + +1. **Documentation**: Add provider-specific documentation to `docs/providers/` +2. **Examples**: Create example infrastructure using your provider +3. **Testing**: Add integration tests for your provider +4. **Optimization**: Implement caching and performance optimizations +5. **Features**: Add provider-specific advanced features + +## Getting Help + +- Check existing providers for implementation patterns +- Review the [Provider Interface Documentation](PROVIDER_AGNOSTIC_ARCHITECTURE.md#provider-interface) +- Test with the provider test suite: `./provisioning/tools/test-provider-agnostic.nu` +- Run migration checks: `./provisioning/tools/migrate-to-provider-agnostic.nu status` \ No newline at end of file diff --git a/docs/src/development/taskservs/taskserv-categorization.md b/docs/src/development/taskservs/taskserv-categorization.md index 4e9561a..2381e1d 100644 --- a/docs/src/development/taskservs/taskserv-categorization.md +++ b/docs/src/development/taskservs/taskserv-categorization.md @@ -1 +1,70 @@ -# Taskserv Categorization Plan\n\n## Categories and Taskservs (38 total)\n\n### **kubernetes/** (1)\n\n- kubernetes\n\n### **networking/** (6)\n\n- cilium\n- coredns\n- etcd\n- ip-aliases\n- proxy\n- resolv\n\n### **container-runtime/** (6)\n\n- containerd\n- crio\n- crun\n- podman\n- runc\n- youki\n\n### **storage/** (4)\n\n- external-nfs\n- mayastor\n- oci-reg\n- rook-ceph\n\n### **databases/** (2)\n\n- postgres\n- redis\n\n### **development/** (6)\n\n- coder\n- desktop\n- gitea\n- nushell\n- oras\n- radicle\n\n### **infrastructure/** (6)\n\n- kms\n- os\n- provisioning\n- polkadot\n- webhook\n- kubectl\n\n### **misc/** (1)\n\n- generate\n\n### **Keep in root/** (6)\n\n- info.md\n- manifest.toml\n- manifest.lock\n- README.md\n- REFERENCE.md\n- version.ncl\n\nTotal categorized: 32 taskservs + 6 root files = 38 items ✓ +# Taskserv Categorization Plan + +## Categories and Taskservs (38 total) + +### **kubernetes/** (1) + +- kubernetes + +### **networking/** (6) + +- cilium +- coredns +- etcd +- ip-aliases +- proxy +- resolv + +### **container-runtime/** (6) + +- containerd +- crio +- crun +- podman +- runc +- youki + +### **storage/** (4) + +- external-nfs +- mayastor +- oci-reg +- rook-ceph + +### **databases/** (2) + +- postgres +- redis + +### **development/** (6) + +- coder +- desktop +- gitea +- nushell +- oras +- radicle + +### **infrastructure/** (6) + +- kms +- os +- provisioning +- polkadot +- webhook +- kubectl + +### **misc/** (1) + +- generate + +### **Keep in root/** (6) + +- info.md +- manifest.toml +- manifest.lock +- README.md +- REFERENCE.md +- version.ncl + +Total categorized: 32 taskservs + 6 root files = 38 items ✓ diff --git a/docs/src/development/taskservs/taskserv-quick-guide.md b/docs/src/development/taskservs/taskserv-quick-guide.md index 064cd5c..14977a2 100644 --- a/docs/src/development/taskservs/taskserv-quick-guide.md +++ b/docs/src/development/taskservs/taskserv-quick-guide.md @@ -1 +1,249 @@ -# Taskserv Quick Guide\n\n## 🚀 Quick Start\n\n### Create a New Taskserv (Interactive)\n\n```\nnu provisioning/tools/create-taskserv-helper.nu interactive\n```\n\n### Create a New Taskserv (Direct)\n\n```\nnu provisioning/tools/create-taskserv-helper.nu create my-api \\n --category development \\n --port 8080 \\n --description "My REST API service"\n```\n\n## 📋 5-Minute Setup\n\n### 1. Choose Your Method\n\n- **Interactive**: `nu provisioning/tools/create-taskserv-helper.nu interactive`\n- **Command Line**: Use the direct command above\n- **Manual**: Follow the structure guide below\n\n### 2. Basic Structure\n\n```\nmy-service/\n├── nickel/\n│ ├── manifest.toml # Package definition\n│ ├── my-service.ncl # Main schema\n│ └── version.ncl # Version info\n├── default/\n│ ├── defs.toml # Default config\n│ └── install-*.sh # Install script\n└── README.md # Documentation\n```\n\n### 3. Essential Files\n\n**manifest.toml** (package definition):\n\n```\n[package]\nname = "my-service"\nversion = "1.0.0"\ndescription = "My service"\n\n[dependencies]\nk8s = { oci = "oci://ghcr.io/kcl-lang/k8s", tag = "1.30" }\n```\n\n**my-service.ncl** (main schema):\n\n```\nlet MyService = {\n name | String,\n version | String,\n port | Number,\n replicas | Number,\n} in\n\n{\n my_service_config = {\n name = "my-service",\n version = "latest",\n port = 8080,\n replicas = 1,\n }\n}\n```\n\n### 4. Test Your Taskserv\n\n```\n# Discover your taskserv\nnu -c "use provisioning/core/nulib/taskservs/discover.nu *; get-taskserv-info my-service"\n\n# Test layer resolution\nnu -c "use provisioning/workspace/tools/layer-utils.nu *; test_layer_resolution my-service wuji upcloud"\n\n# Deploy with check\nprovisioning/core/cli/provisioning taskserv create my-service --infra wuji --check\n```\n\n## 🎯 Common Patterns\n\n### Web Service\n\n```\nlet WebService = {\n name | String,\n version | String | default = "latest",\n port | Number | default = 8080,\n replicas | Number | default = 1,\n ingress | {\n enabled | Bool | default = true,\n hostname | String,\n tls | Bool | default = false,\n },\n resources | {\n cpu | String | default = "100m",\n memory | String | default = "128Mi",\n },\n} in\nWebService\n```\n\n### Database Service\n\n```\nlet DatabaseService = {\n name | String,\n version | String | default = "latest",\n port | Number | default = 5432,\n persistence | {\n enabled | Bool | default = true,\n size | String | default = "10Gi",\n storage_class | String | default = "ssd",\n },\n auth | {\n database | String | default = "app",\n username | String | default = "user",\n password_secret | String,\n },\n} in\nDatabaseService\n```\n\n### Background Worker\n\n```\nlet BackgroundWorker = {\n name | String,\n version | String | default = "latest",\n replicas | Number | default = 1,\n job | {\n schedule | String | optional, # Cron format for scheduled jobs\n parallelism | Number | default = 1,\n completions | Number | default = 1,\n },\n resources | {\n cpu | String | default = "500m",\n memory | String | default = "512Mi",\n },\n} in\nBackgroundWorker\n```\n\n## 🛠️ CLI Shortcuts\n\n### Discovery\n\n```\n# List all taskservs\nnu -c "use provisioning/core/nulib/taskservs/discover.nu *; discover-taskservs | select name group"\n\n# Search taskservs\nnu -c "use provisioning/core/nulib/taskservs/discover.nu *; search-taskservs redis"\n\n# Show stats\nnu -c "use provisioning/workspace/tools/layer-utils.nu *; show_layer_stats"\n```\n\n### Development\n\n```\n# Check Nickel syntax\nnickel typecheck provisioning/extensions/taskservs/{category}/{name}/schemas/{name}.ncl\n\n# Generate configuration\nprovisioning/core/cli/provisioning taskserv generate {name} --infra {infra}\n\n# Version management\nprovisioning/core/cli/provisioning taskserv versions {name}\nprovisioning/core/cli/provisioning taskserv check-updates\n```\n\n### Testing\n\n```\n# Dry run deployment\nprovisioning/core/cli/provisioning taskserv create {name} --infra {infra} --check\n\n# Layer resolution debug\nnu -c "use provisioning/workspace/tools/layer-utils.nu *; test_layer_resolution {name} {infra} {provider}"\n```\n\n## 📚 Categories Reference\n\n| Category | Examples | Use Case |\n| ---------- | ---------- | ---------- |\n| **container-runtime** | containerd, crio, podman | Container runtime engines |\n| **databases** | postgres, redis | Database services |\n| **development** | coder, gitea, desktop | Development tools |\n| **infrastructure** | kms, webhook, os | System infrastructure |\n| **kubernetes** | kubernetes | Kubernetes orchestration |\n| **networking** | cilium, coredns, etcd | Network services |\n| **storage** | rook-ceph, external-nfs | Storage solutions |\n\n## 🔧 Troubleshooting\n\n### Taskserv Not Found\n\n```\n# Check if discovered\nnu -c "use provisioning/core/nulib/taskservs/discover.nu *; discover-taskservs | where name == my-service"\n\n# Verify kcl.mod exists\nls provisioning/extensions/taskservs/{category}/my-service/kcl/kcl.mod\n```\n\n### Layer Resolution Issues\n\n```\n# Debug resolution\nnu -c "use provisioning/workspace/tools/layer-utils.nu *; test_layer_resolution my-service wuji upcloud"\n\n# Check template exists\nls provisioning/workspace/templates/taskservs/{category}/my-service.ncl\n```\n\n### Nickel Syntax Errors\n\n```\n# Check syntax\nnickel typecheck provisioning/extensions/taskservs/{category}/my-service/schemas/my-service.ncl\n\n# Format code\nnickel format provisioning/extensions/taskservs/{category}/my-service/schemas/\n```\n\n## 💡 Pro Tips\n\n1. **Use existing taskservs as templates** - Copy and modify similar services\n2. **Test with --check first** - Always use dry run before actual deployment\n3. **Follow naming conventions** - Use kebab-case for consistency\n4. **Document thoroughly** - Good docs save time later\n5. **Version your schemas** - Include version.ncl for compatibility tracking\n\n## 🔗 Next Steps\n\n1. Read the full [Taskserv Developer Guide](TASKSERV_DEVELOPER_GUIDE.md)\n2. Explore existing taskservs in `provisioning/extensions/taskservs/`\n3. Check out templates in `provisioning/workspace/templates/taskservs/`\n4. Join the development community for support +# Taskserv Quick Guide + +## 🚀 Quick Start + +### Create a New Taskserv (Interactive) + +```text +nu provisioning/tools/create-taskserv-helper.nu interactive +``` + +### Create a New Taskserv (Direct) + +```text +nu provisioning/tools/create-taskserv-helper.nu create my-api + --category development + --port 8080 + --description "My REST API service" +``` + +## 📋 5-Minute Setup + +### 1. Choose Your Method + +- **Interactive**: `nu provisioning/tools/create-taskserv-helper.nu interactive` +- **Command Line**: Use the direct command above +- **Manual**: Follow the structure guide below + +### 2. Basic Structure + +```text +my-service/ +├── nickel/ +│ ├── manifest.toml # Package definition +│ ├── my-service.ncl # Main schema +│ └── version.ncl # Version info +├── default/ +│ ├── defs.toml # Default config +│ └── install-*.sh # Install script +└── README.md # Documentation +``` + +### 3. Essential Files + +**manifest.toml** (package definition): + +```text +[package] +name = "my-service" +version = "1.0.0" +description = "My service" + +[dependencies] +k8s = { oci = "oci://ghcr.io/kcl-lang/k8s", tag = "1.30" } +``` + +**my-service.ncl** (main schema): + +```text +let MyService = { + name | String, + version | String, + port | Number, + replicas | Number, +} in + +{ + my_service_config = { + name = "my-service", + version = "latest", + port = 8080, + replicas = 1, + } +} +``` + +### 4. Test Your Taskserv + +```text +# Discover your taskserv +nu -c "use provisioning/core/nulib/taskservs/discover.nu *; get-taskserv-info my-service" + +# Test layer resolution +nu -c "use provisioning/workspace/tools/layer-utils.nu *; test_layer_resolution my-service wuji upcloud" + +# Deploy with check +provisioning/core/cli/provisioning taskserv create my-service --infra wuji --check +``` + +## 🎯 Common Patterns + +### Web Service + +```text +let WebService = { + name | String, + version | String | default = "latest", + port | Number | default = 8080, + replicas | Number | default = 1, + ingress | { + enabled | Bool | default = true, + hostname | String, + tls | Bool | default = false, + }, + resources | { + cpu | String | default = "100m", + memory | String | default = "128Mi", + }, +} in +WebService +``` + +### Database Service + +```text +let DatabaseService = { + name | String, + version | String | default = "latest", + port | Number | default = 5432, + persistence | { + enabled | Bool | default = true, + size | String | default = "10Gi", + storage_class | String | default = "ssd", + }, + auth | { + database | String | default = "app", + username | String | default = "user", + password_secret | String, + }, +} in +DatabaseService +``` + +### Background Worker + +```text +let BackgroundWorker = { + name | String, + version | String | default = "latest", + replicas | Number | default = 1, + job | { + schedule | String | optional, # Cron format for scheduled jobs + parallelism | Number | default = 1, + completions | Number | default = 1, + }, + resources | { + cpu | String | default = "500m", + memory | String | default = "512Mi", + }, +} in +BackgroundWorker +``` + +## 🛠️ CLI Shortcuts + +### Discovery + +```text +# List all taskservs +nu -c "use provisioning/core/nulib/taskservs/discover.nu *; discover-taskservs | select name group" + +# Search taskservs +nu -c "use provisioning/core/nulib/taskservs/discover.nu *; search-taskservs redis" + +# Show stats +nu -c "use provisioning/workspace/tools/layer-utils.nu *; show_layer_stats" +``` + +### Development + +```text +# Check Nickel syntax +nickel typecheck provisioning/extensions/taskservs/{category}/{name}/schemas/{name}.ncl + +# Generate configuration +provisioning/core/cli/provisioning taskserv generate {name} --infra {infra} + +# Version management +provisioning/core/cli/provisioning taskserv versions {name} +provisioning/core/cli/provisioning taskserv check-updates +``` + +### Testing + +```text +# Dry run deployment +provisioning/core/cli/provisioning taskserv create {name} --infra {infra} --check + +# Layer resolution debug +nu -c "use provisioning/workspace/tools/layer-utils.nu *; test_layer_resolution {name} {infra} {provider}" +``` + +## 📚 Categories Reference + +| Category | Examples | Use Case | +| ---------- | ---------- | ---------- | +| **container-runtime** | containerd, crio, podman | Container runtime engines | +| **databases** | postgres, redis | Database services | +| **development** | coder, gitea, desktop | Development tools | +| **infrastructure** | kms, webhook, os | System infrastructure | +| **kubernetes** | kubernetes | Kubernetes orchestration | +| **networking** | cilium, coredns, etcd | Network services | +| **storage** | rook-ceph, external-nfs | Storage solutions | + +## 🔧 Troubleshooting + +### Taskserv Not Found + +```text +# Check if discovered +nu -c "use provisioning/core/nulib/taskservs/discover.nu *; discover-taskservs | where name == my-service" + +# Verify kcl.mod exists +ls provisioning/extensions/taskservs/{category}/my-service/kcl/kcl.mod +``` + +### Layer Resolution Issues + +```text +# Debug resolution +nu -c "use provisioning/workspace/tools/layer-utils.nu *; test_layer_resolution my-service wuji upcloud" + +# Check template exists +ls provisioning/workspace/templates/taskservs/{category}/my-service.ncl +``` + +### Nickel Syntax Errors + +```text +# Check syntax +nickel typecheck provisioning/extensions/taskservs/{category}/my-service/schemas/my-service.ncl + +# Format code +nickel format provisioning/extensions/taskservs/{category}/my-service/schemas/ +``` + +## 💡 Pro Tips + +1. **Use existing taskservs as templates** - Copy and modify similar services +2. **Test with --check first** - Always use dry run before actual deployment +3. **Follow naming conventions** - Use kebab-case for consistency +4. **Document thoroughly** - Good docs save time later +5. **Version your schemas** - Include version.ncl for compatibility tracking + +## 🔗 Next Steps + +1. Read the full [Taskserv Developer Guide](TASKSERV_DEVELOPER_GUIDE.md) +2. Explore existing taskservs in `provisioning/extensions/taskservs/` +3. Check out templates in `provisioning/workspace/templates/taskservs/` +4. Join the development community for support \ No newline at end of file diff --git a/docs/src/development/typedialog-platform-config-guide.md b/docs/src/development/typedialog-platform-config-guide.md index da75ac4..e30f993 100644 --- a/docs/src/development/typedialog-platform-config-guide.md +++ b/docs/src/development/typedialog-platform-config-guide.md @@ -1 +1,1006 @@ -# TypeDialog Platform Configuration Guide\n\n**Version**: 2.0.0\n**Last Updated**: 2026-01-05\n**Status**: Production Ready\n**Target Audience**: DevOps Engineers, Infrastructure Administrators\n\n**Services Covered**: 8 platform services (orchestrator, control-center, mcp-server, vault-service, extension-registry, rag, ai-service,\nprovisioning-daemon)\n\nInteractive configuration for cloud-native infrastructure platform services using TypeDialog forms and Nickel.\n\n## Overview\n\n**TypeDialog** is an interactive form system that generates Nickel configurations for platform services. Instead of manually editing TOML or KCL\nfiles, you answer questions in an interactive form, and TypeDialog generates validated Nickel configuration.\n\n**Benefits**:\n\n- ✅ No manual TOML editing required\n- ✅ Interactive guidance for each setting\n- ✅ Automatic validation of inputs\n- ✅ Type-safe configuration (Nickel contracts)\n- ✅ Generated configurations ready for deployment\n\n## Quick Start\n\n### 1. Configure a Platform Service (5 minutes)\n\n```\n# Launch interactive form for orchestrator\nprovisioning config platform orchestrator\n\n# Or use TypeDialog directly\ntypedialog form .typedialog/provisioning/platform/orchestrator/form.toml\n```\n\nThis opens an interactive form with sections for:\n\n- Workspace configuration\n- Server settings (host, port, workers)\n- Storage backend (filesystem or SurrealDB)\n- Task queue and batch settings\n- Monitoring and health checks\n- Rollback and recovery\n- Logging configuration\n- Extensions and integrations\n- Advanced settings\n\n### 2. Review Generated Configuration\n\nAfter completing the form, TypeDialog generates `config.ncl`:\n\n```\n# View what was generated\ncat workspace_librecloud/config/config.ncl\n```\n\n### 3. Validate Configuration\n\n```\n# Check Nickel syntax is valid\nnickel typecheck workspace_librecloud/config/config.ncl\n\n# Export to TOML for services\nprovisioning config export\n```\n\n### 4. Services Use Generated Config\n\nPlatform services automatically load the exported TOML:\n\n```\n# Orchestrator reads config/generated/platform/orchestrator.toml\nprovisioning start orchestrator\n\n# Check it's using the right config\ncat workspace_librecloud/config/generated/platform/orchestrator.toml\n```\n\n## Interactive Configuration Workflow\n\n### Recommended Approach: Use TypeDialog Forms\n\n**Best for**: Most users, no Nickel knowledge needed\n\n**Workflow**:\n\n1. Launch form for a service: `provisioning config platform orchestrator`\n2. Answer questions in interactive prompts about workspace, server, storage, queue\n3. Review what was generated: `cat workspace_librecloud/config/config.ncl`\n4. Update running services: `provisioning config export && provisioning restart orchestrator`\n\n### Advanced Approach: Manual Nickel Editing\n\n**Best for**: Users comfortable with Nickel, want full control\n\n**Workflow**:\n\n1. Create file: `touch workspace_librecloud/config/config.ncl`\n2. Edit directly: `vim workspace_librecloud/config/config.ncl`\n3. Validate syntax: `nickel typecheck workspace_librecloud/config/config.ncl`\n4. Export and deploy: `provisioning config export && provisioning restart orchestrator`\n\n## Configuration Structure\n\n### Single File, Three Sections\n\nAll configuration lives in one Nickel file with three sections:\n\n```\n# workspace_librecloud/config/config.ncl\n{\n # SECTION 1: Workspace metadata\n workspace = {\n name = "librecloud",\n path = "/Users/Akasha/project-provisioning/workspace_librecloud",\n description = "Production workspace"\n },\n\n # SECTION 2: Cloud providers\n providers = {\n upcloud = {\n enabled = true,\n api_user = "{{env.UPCLOUD_USER}}",\n api_password = "{{kms.decrypt('upcloud_pass')}}"\n },\n aws = { enabled = false },\n local = { enabled = true }\n },\n\n # SECTION 3: Platform services\n platform = {\n orchestrator = {\n enabled = true,\n server = { host = "127.0.0.1", port = 9090 },\n storage = { type = "filesystem" }\n },\n kms = {\n enabled = true,\n backend = "rustyvault",\n url = "http://localhost:8200"\n }\n }\n}\n```\n\n### Available Configuration Sections\n\n| Section | Purpose | Used By |\n| --------- | --------- | --------- |\n| `workspace` | Workspace metadata and paths | Config loader, providers |\n| `providers.upcloud` | UpCloud provider settings | UpCloud provisioning |\n| `providers.aws` | AWS provider settings | AWS provisioning |\n| `providers.local` | Local VM provider settings | Local VM provisioning |\n| **Core Platform Services** | | |\n| `platform.orchestrator` | Orchestrator service config | Orchestrator REST API |\n| `platform.control_center` | Control center service config | Control center REST API |\n| `platform.mcp_server` | MCP server service config | Model Context Protocol integration |\n| `platform.installer` | Installer service config | Infrastructure provisioning |\n| **Security & Secrets** | | |\n| `platform.vault_service` | Vault service config | Secrets management and encryption |\n| **Extensions & Registry** | | |\n| `platform.extension_registry` | Extension registry config | Extension distribution via Gitea/OCI |\n| **AI & Intelligence** | | |\n| `platform.rag` | RAG system config | Retrieval-Augmented Generation |\n| `platform.ai_service` | AI service config | AI model integration and DAG workflows |\n| **Operations & Daemon** | | |\n| `platform.provisioning_daemon` | Provisioning daemon config | Background provisioning operations |\n\n## Service-Specific Configuration\n\n### Orchestrator Service\n\n**Purpose**: Coordinate infrastructure operations, manage workflows, handle batch operations\n\n**Key Settings**:\n\n- **server**: HTTP server configuration (host, port, workers)\n- **storage**: Task queue storage (filesystem or SurrealDB)\n- **queue**: Task processing (concurrency, retries, timeouts)\n- **batch**: Batch operation settings (parallelism, timeouts)\n- **monitoring**: Health checks and metrics collection\n- **rollback**: Checkpoint and recovery strategy\n- **logging**: Log level and format\n\n**Example**:\n\n```\nplatform = {\n orchestrator = {\n enabled = true,\n server = {\n host = "127.0.0.1",\n port = 9090,\n workers = 4,\n keep_alive = 75,\n max_connections = 1000\n },\n storage = {\n type = "filesystem",\n backend_path = "{{workspace.path}}/.orchestrator/data/queue.rkvs"\n },\n queue = {\n max_concurrent_tasks = 5,\n retry_attempts = 3,\n retry_delay_seconds = 5,\n task_timeout_minutes = 60\n }\n }\n}\n```\n\n### KMS Service\n\n**Purpose**: Cryptographic key management, secret encryption/decryption\n\n**Key Settings**:\n\n- **backend**: KMS backend (rustyvault, age, aws, vault, cosmian)\n- **url**: Backend URL or connection string\n- **credentials**: Authentication if required\n\n**Example**:\n\n```\nplatform = {\n kms = {\n enabled = true,\n backend = "rustyvault",\n url = "http://localhost:8200"\n }\n}\n```\n\n### Control Center Service\n\n**Purpose**: Centralized monitoring and control interface\n\n**Key Settings**:\n\n- **server**: HTTP server configuration\n- **database**: Backend database connection\n- **jwt**: JWT authentication settings\n- **security**: CORS and security policies\n\n**Example**:\n\n```\nplatform = {\n control_center = {\n enabled = true,\n server = {\n host = "127.0.0.1",\n port = 8080\n }\n }\n}\n```\n\n## Deployment Modes\n\nAll platform services support four deployment modes, each with different resource allocation and feature sets:\n\n| Mode | Resources | Use Case | Storage | TLS |\n| ------ | ----------- | ---------- | --------- | ----- |\n| **solo** | Minimal (2 workers) | Development, testing | Embedded/filesystem | No |\n| **multiuser** | Moderate (4 workers) | Team environments | Shared databases | Optional |\n| **cicd** | High throughput (8+ workers) | CI/CD pipelines | Ephemeral/memory | No |\n| **enterprise** | High availability (16+ workers) | Production | Clustered/distributed | Yes |\n\n**Mode-based Configuration Loading**:\n\n```\n# Load a specific mode's configuration\nexport VAULT_MODE=enterprise\nexport REGISTRY_MODE=multiuser\nexport RAG_MODE=cicd\n\n# Services automatically resolve to correct TOML files:\n# Generated from: provisioning/schemas/platform/\n# - vault-service.enterprise.toml (generated from vault-service.ncl)\n# - extension-registry.multiuser.toml (generated from extension-registry.ncl)\n# - rag.cicd.toml (generated from rag.ncl)\n```\n\n## New Platform Services (Phase 13-19)\n\n### Vault Service\n\n**Purpose**: Secrets management, encryption, and cryptographic key storage\n\n**Key Settings**:\n\n- **server**: HTTP server configuration (host, port, workers)\n- **storage**: Backend storage (filesystem, memory, surrealdb, etcd, postgresql)\n- **vault**: Vault mounting and key management\n- **ha**: High availability clustering\n- **security**: TLS, certificate validation\n- **logging**: Log level and audit trails\n\n**Mode Characteristics**:\n\n- **solo**: Filesystem storage, no TLS, embedded mode\n- **multiuser**: SurrealDB backend, shared storage, TLS optional\n- **cicd**: In-memory ephemeral storage, no persistence\n- **enterprise**: Etcd HA, TLS required, audit logging enabled\n\n**Environment Variable Overrides**:\n\n```\nVAULT_CONFIG=/path/to/vault.toml # Explicit config path\nVAULT_MODE=enterprise # Mode-specific config\nVAULT_SERVER_URL=http://localhost:8200 # Server URL\nVAULT_STORAGE_BACKEND=etcd # Storage backend\nVAULT_AUTH_TOKEN=s.xxxxxxxx # Authentication token\nVAULT_TLS_VERIFY=true # TLS verification\n```\n\n**Example Configuration**:\n\n```\nplatform = {\n vault_service = {\n enabled = true,\n server = {\n host = "0.0.0.0",\n port = 8200,\n workers = 8\n },\n storage = {\n backend = "surrealdb",\n url = "http://surrealdb:8000",\n namespace = "vault",\n database = "secrets"\n },\n vault = {\n mount_point = "transit",\n key_name = "provisioning-master"\n },\n ha = {\n enabled = true\n }\n }\n}\n```\n\n### Extension Registry Service\n\n**Purpose**: Extension distribution and management via Gitea and OCI registries\n\n**Key Settings**:\n\n- **server**: HTTP server configuration (host, port, workers)\n- **gitea**: Gitea integration for extension source repository\n- **oci**: OCI registry for artifact distribution\n- **cache**: Metadata and list caching\n- **auth**: Registry authentication\n\n**Mode Characteristics**:\n\n- **solo**: Gitea only, minimal cache, CORS disabled\n- **multiuser**: Gitea + OCI, both enabled, CORS enabled\n- **cicd**: OCI only (high-throughput mode), ephemeral cache\n- **enterprise**: Both Gitea + OCI, TLS verification, large cache\n\n**Environment Variable Overrides**:\n\n```\nREGISTRY_CONFIG=/path/to/registry.toml # Explicit config path\nREGISTRY_MODE=multiuser # Mode-specific config\nREGISTRY_SERVER_HOST=0.0.0.0 # Server host\nREGISTRY_SERVER_PORT=8081 # Server port\nREGISTRY_SERVER_WORKERS=4 # Worker count\nREGISTRY_GITEA_URL=http://gitea:3000 # Gitea URL\nREGISTRY_GITEA_ORG=provisioning # Gitea organization\nREGISTRY_OCI_REGISTRY=registry.local:5000 # OCI registry\nREGISTRY_OCI_NAMESPACE=provisioning # OCI namespace\n```\n\n**Example Configuration**:\n\n```\nplatform = {\n extension_registry = {\n enabled = true,\n server = {\n host = "0.0.0.0",\n port = 8081,\n workers = 4\n },\n gitea = {\n enabled = true,\n url = "http://gitea:3000",\n org = "provisioning"\n },\n oci = {\n enabled = true,\n registry = "registry.local:5000",\n namespace = "provisioning"\n },\n cache = {\n capacity = 1000,\n ttl = 300\n }\n }\n}\n```\n\n### RAG (Retrieval-Augmented Generation) Service\n\n**Purpose**: Document retrieval, semantic search, and AI-augmented responses\n\n**Key Settings**:\n\n- **embeddings**: Embedding model provider (openai, local, anthropic)\n- **vector_db**: Vector database backend (memory, surrealdb, qdrant, milvus)\n- **llm**: Language model provider (anthropic, openai, ollama)\n- **retrieval**: Search strategy and parameters\n- **ingestion**: Document processing and indexing\n\n**Mode Characteristics**:\n\n- **solo**: Local embeddings, in-memory vector DB, Ollama LLM\n- **multiuser**: OpenAI embeddings, SurrealDB vector DB, Anthropic LLM\n- **cicd**: **RAG completely disabled** (not applicable for ephemeral pipelines)\n- **enterprise**: Large embeddings (3072-dim), distributed vector DB, Claude Opus\n\n**Environment Variable Overrides**:\n\n```\nRAG_CONFIG=/path/to/rag.toml # Explicit config path\nRAG_MODE=multiuser # Mode-specific config\nRAG_ENABLED=true # Enable/disable RAG\nRAG_EMBEDDINGS_PROVIDER=openai # Embedding provider\nRAG_EMBEDDINGS_API_KEY=sk-xxx # Embedding API key\nRAG_VECTOR_DB_URL=http://surrealdb:8000 # Vector DB URL\nRAG_LLM_PROVIDER=anthropic # LLM provider\nRAG_LLM_API_KEY=sk-ant-xxx # LLM API key\nRAG_VECTOR_DB_TYPE=surrealdb # Vector DB type\n```\n\n**Example Configuration**:\n\n```\nplatform = {\n rag = {\n enabled = true,\n embeddings = {\n provider = "openai",\n model = "text-embedding-3-small",\n api_key = "{{env.OPENAI_API_KEY}}"\n },\n vector_db = {\n db_type = "surrealdb",\n url = "http://surrealdb:8000",\n namespace = "rag_prod"\n },\n llm = {\n provider = "anthropic",\n model = "claude-opus-4-5-20251101",\n api_key = "{{env.ANTHROPIC_API_KEY}}"\n },\n retrieval = {\n top_k = 10,\n similarity_threshold = 0.75\n }\n }\n}\n```\n\n### AI Service\n\n**Purpose**: AI model integration with RAG and MCP support for multi-step workflows\n\n**Key Settings**:\n\n- **server**: HTTP server configuration\n- **rag**: RAG system integration\n- **mcp**: Model Context Protocol integration\n- **dag**: Directed acyclic graph task orchestration\n\n**Mode Characteristics**:\n\n- **solo**: RAG enabled, no MCP, minimal concurrency (3 tasks)\n- **multiuser**: Both RAG and MCP enabled, moderate concurrency (10 tasks)\n- **cicd**: RAG disabled, MCP enabled, high concurrency (20 tasks)\n- **enterprise**: Both enabled, max concurrency (50 tasks), full monitoring\n\n**Environment Variable Overrides**:\n\n```\nAI_SERVICE_CONFIG=/path/to/ai.toml # Explicit config path\nAI_SERVICE_MODE=enterprise # Mode-specific config\nAI_SERVICE_SERVER_PORT=8082 # Server port\nAI_SERVICE_SERVER_WORKERS=16 # Worker count\nAI_SERVICE_RAG_ENABLED=true # Enable RAG integration\nAI_SERVICE_MCP_ENABLED=true # Enable MCP integration\nAI_SERVICE_DAG_MAX_CONCURRENT_TASKS=50 # Max concurrent tasks\n```\n\n**Example Configuration**:\n\n```\nplatform = {\n ai_service = {\n enabled = true,\n server = {\n host = "0.0.0.0",\n port = 8082,\n workers = 8\n },\n rag = {\n enabled = true,\n rag_service_url = "http://rag:8083",\n timeout = 60000\n },\n mcp = {\n enabled = true,\n mcp_service_url = "http://mcp-server:8084",\n timeout = 60000\n },\n dag = {\n max_concurrent_tasks = 20,\n task_timeout = 600000,\n retry_attempts = 5\n }\n }\n}\n```\n\n### Provisioning Daemon\n\n**Purpose**: Background service for provisioning operations, workspace management, and health monitoring\n\n**Key Settings**:\n\n- **daemon**: Daemon control (poll interval, max workers)\n- **logging**: Log level and output configuration\n- **actions**: Automated actions (cleanup, updates, sync)\n- **workers**: Worker pool configuration\n- **health**: Health check settings\n\n**Mode Characteristics**:\n\n- **solo**: Minimal polling, no auto-cleanup, debug logging\n- **multiuser**: Standard polling, workspace sync enabled, info logging\n- **cicd**: Frequent polling, ephemeral cleanup, warning logging\n- **enterprise**: Standard polling, full automation, all features enabled\n\n**Environment Variable Overrides**:\n\n```\nDAEMON_CONFIG=/path/to/daemon.toml # Explicit config path\nDAEMON_MODE=enterprise # Mode-specific config\nDAEMON_POLL_INTERVAL=30 # Polling interval (seconds)\nDAEMON_MAX_WORKERS=16 # Maximum worker threads\nDAEMON_LOGGING_LEVEL=info # Log level (debug/info/warn/error)\nDAEMON_AUTO_CLEANUP=true # Enable auto cleanup\nDAEMON_AUTO_UPDATE=true # Enable auto updates\n```\n\n**Example Configuration**:\n\n```\nplatform = {\n provisioning_daemon = {\n enabled = true,\n daemon = {\n poll_interval = 30,\n max_workers = 8\n },\n logging = {\n level = "info",\n file = "/var/log/provisioning/daemon.log"\n },\n actions = {\n auto_cleanup = true,\n auto_update = false,\n workspace_sync = true\n }\n }\n}\n```\n\n## Using TypeDialog Forms\n\n### Form Navigation\n\n1. **Interactive Prompts**: Answer questions one at a time\n2. **Validation**: Inputs are validated as you type\n3. **Defaults**: Each field shows a sensible default\n4. **Skip Optional**: Press Enter to use default or skip optional fields\n5. **Review**: Preview generated Nickel before saving\n\n### Field Types\n\n| Type | Example | Notes |\n| ------ | --------- | ------- |\n| `text` | "127.0.0.1" | Free-form text input |\n| `confirm` | true/false | Yes/no answer |\n| `select` | "filesystem" | Choose from list |\n| `custom(u16)` | 9090 | Number input |\n| `custom(u32)` | 1000 | Larger number |\n\n### Special Values\n\n**Environment Variables**:\n\n```\napi_user = "{{env.UPCLOUD_USER}}"\napi_password = "{{env.UPCLOUD_PASSWORD}}"\n```\n\n**Workspace Paths**:\n\n```\ndata_dir = "{{workspace.path}}/.orchestrator/data"\nlogs_dir = "{{workspace.path}}/.orchestrator/logs"\n```\n\n**KMS Decryption**:\n\n```\napi_password = "{{kms.decrypt('upcloud_pass')}}"\n```\n\n## Validation & Export\n\n### Validating Configuration\n\n```\n# Check Nickel syntax\nnickel typecheck workspace_librecloud/config/config.ncl\n\n# Detailed validation with error messages\nnickel typecheck workspace_librecloud/config/config.ncl 2>&1\n\n# Schema validation happens during export\nprovisioning config export\n```\n\n### Exporting to Service Formats\n\n```\n# One-time export\nprovisioning config export\n\n# Export creates (pre-configured TOML for all services):\nworkspace_librecloud/config/generated/\n├── workspace.toml # Workspace metadata\n├── providers/\n│ ├── upcloud.toml # UpCloud provider\n│ └── local.toml # Local provider\n└── platform/\n ├── orchestrator.toml # Orchestrator service\n ├── control_center.toml # Control center service\n ├── mcp_server.toml # MCP server service\n ├── installer.toml # Installer service\n ├── kms.toml # KMS service\n ├── vault_service.toml # Vault service (new)\n ├── extension_registry.toml # Extension registry (new)\n ├── rag.toml # RAG service (new)\n ├── ai_service.toml # AI service (new)\n └── provisioning_daemon.toml # Daemon service (new)\n\n# Public Nickel Schemas (20 total for 5 new services):\nprovisioning/schemas/platform/\n├── schemas/\n│ ├── vault-service.ncl\n│ ├── extension-registry.ncl\n│ ├── rag.ncl\n│ ├── ai-service.ncl\n│ └── provisioning-daemon.ncl\n├── defaults/\n│ ├── vault-service-defaults.ncl\n│ ├── extension-registry-defaults.ncl\n│ ├── rag-defaults.ncl\n│ ├── ai-service-defaults.ncl\n│ ├── provisioning-daemon-defaults.ncl\n│ └── deployment/\n│ ├── solo-defaults.ncl\n│ ├── multiuser-defaults.ncl\n│ ├── cicd-defaults.ncl\n│ └── enterprise-defaults.ncl\n├── validators/\n├── templates/\n├── constraints/\n└── values/\n```\n\n**Using Pre-Generated Configurations**:\n\nAll 5 new services come with pre-built TOML configs for each deployment mode:\n\n```\n# View available schemas for vault service\nls -la provisioning/schemas/platform/schemas/vault-service.ncl\nls -la provisioning/schemas/platform/defaults/vault-service-defaults.ncl\n\n# Load enterprise mode\nexport VAULT_MODE=enterprise\ncargo run -p vault-service\n\n# Or load multiuser mode\nexport REGISTRY_MODE=multiuser\ncargo run -p extension-registry\n\n# All 5 services support mode-based loading\nexport RAG_MODE=cicd\nexport AI_SERVICE_MODE=enterprise\nexport DAEMON_MODE=multiuser\n```\n\n## Updating Configuration\n\n### Change a Setting\n\n1. **Edit source config**: `vim workspace_librecloud/config/config.ncl`\n2. **Validate changes**: `nickel typecheck workspace_librecloud/config/config.ncl`\n3. **Re-export to TOML**: `provisioning config export`\n4. **Restart affected service** (if needed): `provisioning restart orchestrator`\n\n### Using TypeDialog to Update\n\nIf you prefer interactive updating:\n\n```\n# Re-run TypeDialog form (overwrites config.ncl)\nprovisioning config platform orchestrator\n\n# Or edit via TypeDialog with existing values\ntypedialog form .typedialog/provisioning/platform/orchestrator/form.toml\n```\n\n## Troubleshooting\n\n### Form Won't Load\n\n**Problem**: `Failed to parse config file`\n\n**Solution**: Check form.toml syntax and verify required fields are present (name, description, locales_path, templates_path)\n\n```\nhead -10 .typedialog/provisioning/platform/orchestrator/form.toml\n```\n\n### Validation Fails\n\n**Problem**: `Nickel configuration validation failed`\n\n**Solution**: Check for syntax errors and correct field names\n\n```\nnickel typecheck workspace_librecloud/config/config.ncl 2>&1 | less\n```\n\nCommon issues: Missing closing braces, incorrect field names, wrong data types\n\n### Export Creates Empty Files\n\n**Problem**: Generated TOML files are empty\n\n**Solution**: Verify config.ncl exports to JSON and check all required sections exist\n\n```\nnickel export --format json workspace_librecloud/config/config.ncl | head -20\n```\n\n### Services Don't Use New Config\n\n**Problem**: Changes don't take effect\n\n**Solution**:\n\n1. Verify export succeeded: `ls -lah workspace_librecloud/config/generated/platform/`\n2. Check service path: `provisioning start orchestrator --check`\n3. Restart service: `provisioning restart orchestrator`\n\n## Configuration Examples\n\n### Development Setup\n\n```\n{\n workspace = {\n name = "dev",\n path = "/Users/dev/workspace",\n description = "Development workspace"\n },\n\n providers = {\n local = {\n enabled = true,\n base_path = "/opt/vms"\n },\n upcloud = { enabled = false },\n aws = { enabled = false }\n },\n\n platform = {\n orchestrator = {\n enabled = true,\n server = { host = "127.0.0.1", port = 9090 },\n storage = { type = "filesystem" },\n logging = { level = "debug", format = "json" }\n },\n kms = {\n enabled = true,\n backend = "age"\n }\n }\n}\n```\n\n### Production Setup\n\n```\n{\n workspace = {\n name = "prod",\n path = "/opt/provisioning/prod",\n description = "Production workspace"\n },\n\n providers = {\n upcloud = {\n enabled = true,\n api_user = "{{env.UPCLOUD_USER}}",\n api_password = "{{kms.decrypt('upcloud_prod')}}",\n default_zone = "de-fra1"\n },\n aws = { enabled = false },\n local = { enabled = false }\n },\n\n platform = {\n orchestrator = {\n enabled = true,\n server = { host = "0.0.0.0", port = 9090, workers = 8 },\n storage = {\n type = "surrealdb-server",\n url = "ws://surreal.internal:8000"\n },\n monitoring = {\n enabled = true,\n metrics_interval_seconds = 30\n },\n logging = { level = "info", format = "json" }\n },\n kms = {\n enabled = true,\n backend = "vault",\n url = "https://vault.internal:8200"\n }\n }\n}\n```\n\n### Multi-Provider Setup\n\n```\n{\n workspace = {\n name = "multi",\n path = "/opt/multi",\n description = "Multi-cloud workspace"\n },\n\n providers = {\n upcloud = {\n enabled = true,\n api_user = "{{env.UPCLOUD_USER}}",\n default_zone = "de-fra1",\n zones = ["de-fra1", "us-nyc1", "nl-ams1"]\n },\n aws = {\n enabled = true,\n access_key = "{{env.AWS_ACCESS_KEY_ID}}"\n },\n local = {\n enabled = true,\n base_path = "/opt/local-vms"\n }\n },\n\n platform = {\n orchestrator = {\n enabled = true,\n multi_workspace = false,\n storage = { type = "filesystem" }\n },\n kms = {\n enabled = true,\n backend = "rustyvault"\n }\n }\n}\n```\n\n## Best Practices\n\n### 1. Use TypeDialog for Initial Setup\n\nStart with TypeDialog forms for the best experience:\n\n```\nprovisioning config platform orchestrator\n```\n\n### 2. Never Edit Generated Files\n\nOnly edit the source `.ncl` file, not the generated TOML files.\n\n**Correct**: `vim workspace_librecloud/config/config.ncl`\n\n**Wrong**: `vim workspace_librecloud/config/generated/platform/orchestrator.toml`\n\n### 3. Validate Before Deploy\n\nAlways validate before deploying changes:\n\n```\nnickel typecheck workspace_librecloud/config/config.ncl\nprovisioning config export\n```\n\n### 4. Use Environment Variables for Secrets\n\nNever hardcode credentials in config. Reference environment variables or KMS:\n\n**Wrong**: `api_password = "my-password"`\n\n**Correct**: `api_password = "{{env.UPCLOUD_PASSWORD}}"`\n\n**Better**: `api_password = "{{kms.decrypt('upcloud_key')}}"`\n\n### 5. Document Changes\n\nAdd comments explaining custom settings in the Nickel file.\n\n## Related Documentation\n\n### Core Resources\n- **Configuration System**: See `CLAUDE.md#configuration-file-format-selection`\n- **Migration Guide**: See `provisioning/config/README.md#migration-strategy`\n- **Schema Reference**: See `provisioning/schemas/`\n- **Nickel Language**: See ADR-011 in `docs/architecture/adr/`\n\n### Platform Services\n- **Platform Services Overview**: See `provisioning/platform/*/README.md`\n- **Core Services** (Phases 8-12): orchestrator, control-center, mcp-server\n- **New Services** (Phases 13-19):\n - vault-service: Secrets management and encryption\n - extension-registry: Extension distribution via Gitea/OCI\n - rag: Retrieval-Augmented Generation system\n - ai-service: AI model integration with DAG workflows\n - provisioning-daemon: Background provisioning operations\n\n**Note**: Installer is a distribution tool (provisioning/tools/distribution/create-installer.nu), not a platform service configurable via TypeDialog.\n\n### Public Definition Locations\n- **TypeDialog Forms** (Interactive UI): `provisioning/.typedialog/platform/forms/`\n- **Nickel Schemas** (Type Definitions): `provisioning/schemas/platform/schemas/`\n- **Default Values** (Base Configuration): `provisioning/schemas/platform/defaults/`\n- **Validators** (Business Logic): `provisioning/schemas/platform/validators/`\n- **Deployment Modes** (Presets): `provisioning/schemas/platform/defaults/deployment/`\n- **Rust Integration**: `provisioning/platform/crates/*/src/config.rs`\n\n## Getting Help\n\n### Validation Errors\n\nGet detailed error messages and check available fields:\n\n```\nnickel typecheck workspace_librecloud/config/config.ncl 2>&1 | less\ngrep "prompt =" .typedialog/provisioning/platform/orchestrator/form.toml\n```\n\n### Configuration Questions\n\n```\n# Show all available config commands\nprovisioning config --help\n\n# Show help for specific service\nprovisioning config platform --help\n\n# List providers and services\nprovisioning config providers list\nprovisioning config services list\n```\n\n### Test Configuration\n\n```\n# Validate without deploying\nnickel typecheck workspace_librecloud/config/config.ncl\n\n# Export to see generated config\nprovisioning config export\n\n# Check generated files\nls -la workspace_librecloud/config/generated/\n``` +# TypeDialog Platform Configuration Guide + +**Version**: 2.0.0 +**Last Updated**: 2026-01-05 +**Status**: Production Ready +**Target Audience**: DevOps Engineers, Infrastructure Administrators + +**Services Covered**: 8 platform services (orchestrator, control-center, mcp-server, vault-service, extension-registry, rag, ai-service, +provisioning-daemon) + +Interactive configuration for cloud-native infrastructure platform services using TypeDialog forms and Nickel. + +## Overview + +**TypeDialog** is an interactive form system that generates Nickel configurations for platform services. Instead of manually editing TOML or KCL +files, you answer questions in an interactive form, and TypeDialog generates validated Nickel configuration. + +**Benefits**: + +- ✅ No manual TOML editing required +- ✅ Interactive guidance for each setting +- ✅ Automatic validation of inputs +- ✅ Type-safe configuration (Nickel contracts) +- ✅ Generated configurations ready for deployment + +## Quick Start + +### 1. Configure a Platform Service (5 minutes) + +```text +# Launch interactive form for orchestrator +provisioning config platform orchestrator + +# Or use TypeDialog directly +typedialog form .typedialog/provisioning/platform/orchestrator/form.toml +``` + +This opens an interactive form with sections for: + +- Workspace configuration +- Server settings (host, port, workers) +- Storage backend (filesystem or SurrealDB) +- Task queue and batch settings +- Monitoring and health checks +- Rollback and recovery +- Logging configuration +- Extensions and integrations +- Advanced settings + +### 2. Review Generated Configuration + +After completing the form, TypeDialog generates `config.ncl`: + +```text +# View what was generated +cat workspace_librecloud/config/config.ncl +``` + +### 3. Validate Configuration + +```text +# Check Nickel syntax is valid +nickel typecheck workspace_librecloud/config/config.ncl + +# Export to TOML for services +provisioning config export +``` + +### 4. Services Use Generated Config + +Platform services automatically load the exported TOML: + +```text +# Orchestrator reads config/generated/platform/orchestrator.toml +provisioning start orchestrator + +# Check it's using the right config +cat workspace_librecloud/config/generated/platform/orchestrator.toml +``` + +## Interactive Configuration Workflow + +### Recommended Approach: Use TypeDialog Forms + +**Best for**: Most users, no Nickel knowledge needed + +**Workflow**: + +1. Launch form for a service: `provisioning config platform orchestrator` +2. Answer questions in interactive prompts about workspace, server, storage, queue +3. Review what was generated: `cat workspace_librecloud/config/config.ncl` +4. Update running services: `provisioning config export && provisioning restart orchestrator` + +### Advanced Approach: Manual Nickel Editing + +**Best for**: Users comfortable with Nickel, want full control + +**Workflow**: + +1. Create file: `touch workspace_librecloud/config/config.ncl` +2. Edit directly: `vim workspace_librecloud/config/config.ncl` +3. Validate syntax: `nickel typecheck workspace_librecloud/config/config.ncl` +4. Export and deploy: `provisioning config export && provisioning restart orchestrator` + +## Configuration Structure + +### Single File, Three Sections + +All configuration lives in one Nickel file with three sections: + +```text +# workspace_librecloud/config/config.ncl +{ + # SECTION 1: Workspace metadata + workspace = { + name = "librecloud", + path = "/Users/Akasha/project-provisioning/workspace_librecloud", + description = "Production workspace" + }, + + # SECTION 2: Cloud providers + providers = { + upcloud = { + enabled = true, + api_user = "{{env.UPCLOUD_USER}}", + api_password = "{{kms.decrypt('upcloud_pass')}}" + }, + aws = { enabled = false }, + local = { enabled = true } + }, + + # SECTION 3: Platform services + platform = { + orchestrator = { + enabled = true, + server = { host = "127.0.0.1", port = 9090 }, + storage = { type = "filesystem" } + }, + kms = { + enabled = true, + backend = "rustyvault", + url = "http://localhost:8200" + } + } +} +``` + +### Available Configuration Sections + +| Section | Purpose | Used By | +| --------- | --------- | --------- | +| `workspace` | Workspace metadata and paths | Config loader, providers | +| `providers.upcloud` | UpCloud provider settings | UpCloud provisioning | +| `providers.aws` | AWS provider settings | AWS provisioning | +| `providers.local` | Local VM provider settings | Local VM provisioning | +| **Core Platform Services** | | | +| `platform.orchestrator` | Orchestrator service config | Orchestrator REST API | +| `platform.control_center` | Control center service config | Control center REST API | +| `platform.mcp_server` | MCP server service config | Model Context Protocol integration | +| `platform.installer` | Installer service config | Infrastructure provisioning | +| **Security & Secrets** | | | +| `platform.vault_service` | Vault service config | Secrets management and encryption | +| **Extensions & Registry** | | | +| `platform.extension_registry` | Extension registry config | Extension distribution via Gitea/OCI | +| **AI & Intelligence** | | | +| `platform.rag` | RAG system config | Retrieval-Augmented Generation | +| `platform.ai_service` | AI service config | AI model integration and DAG workflows | +| **Operations & Daemon** | | | +| `platform.provisioning_daemon` | Provisioning daemon config | Background provisioning operations | + +## Service-Specific Configuration + +### Orchestrator Service + +**Purpose**: Coordinate infrastructure operations, manage workflows, handle batch operations + +**Key Settings**: + +- **server**: HTTP server configuration (host, port, workers) +- **storage**: Task queue storage (filesystem or SurrealDB) +- **queue**: Task processing (concurrency, retries, timeouts) +- **batch**: Batch operation settings (parallelism, timeouts) +- **monitoring**: Health checks and metrics collection +- **rollback**: Checkpoint and recovery strategy +- **logging**: Log level and format + +**Example**: + +```text +platform = { + orchestrator = { + enabled = true, + server = { + host = "127.0.0.1", + port = 9090, + workers = 4, + keep_alive = 75, + max_connections = 1000 + }, + storage = { + type = "filesystem", + backend_path = "{{workspace.path}}/.orchestrator/data/queue.rkvs" + }, + queue = { + max_concurrent_tasks = 5, + retry_attempts = 3, + retry_delay_seconds = 5, + task_timeout_minutes = 60 + } + } +} +``` + +### KMS Service + +**Purpose**: Cryptographic key management, secret encryption/decryption + +**Key Settings**: + +- **backend**: KMS backend (rustyvault, age, aws, vault, cosmian) +- **url**: Backend URL or connection string +- **credentials**: Authentication if required + +**Example**: + +```text +platform = { + kms = { + enabled = true, + backend = "rustyvault", + url = "http://localhost:8200" + } +} +``` + +### Control Center Service + +**Purpose**: Centralized monitoring and control interface + +**Key Settings**: + +- **server**: HTTP server configuration +- **database**: Backend database connection +- **jwt**: JWT authentication settings +- **security**: CORS and security policies + +**Example**: + +```text +platform = { + control_center = { + enabled = true, + server = { + host = "127.0.0.1", + port = 8080 + } + } +} +``` + +## Deployment Modes + +All platform services support four deployment modes, each with different resource allocation and feature sets: + +| Mode | Resources | Use Case | Storage | TLS | +| ------ | ----------- | ---------- | --------- | ----- | +| **solo** | Minimal (2 workers) | Development, testing | Embedded/filesystem | No | +| **multiuser** | Moderate (4 workers) | Team environments | Shared databases | Optional | +| **cicd** | High throughput (8+ workers) | CI/CD pipelines | Ephemeral/memory | No | +| **enterprise** | High availability (16+ workers) | Production | Clustered/distributed | Yes | + +**Mode-based Configuration Loading**: + +```text +# Load a specific mode's configuration +export VAULT_MODE=enterprise +export REGISTRY_MODE=multiuser +export RAG_MODE=cicd + +# Services automatically resolve to correct TOML files: +# Generated from: provisioning/schemas/platform/ +# - vault-service.enterprise.toml (generated from vault-service.ncl) +# - extension-registry.multiuser.toml (generated from extension-registry.ncl) +# - rag.cicd.toml (generated from rag.ncl) +``` + +## New Platform Services (Phase 13-19) + +### Vault Service + +**Purpose**: Secrets management, encryption, and cryptographic key storage + +**Key Settings**: + +- **server**: HTTP server configuration (host, port, workers) +- **storage**: Backend storage (filesystem, memory, surrealdb, etcd, postgresql) +- **vault**: Vault mounting and key management +- **ha**: High availability clustering +- **security**: TLS, certificate validation +- **logging**: Log level and audit trails + +**Mode Characteristics**: + +- **solo**: Filesystem storage, no TLS, embedded mode +- **multiuser**: SurrealDB backend, shared storage, TLS optional +- **cicd**: In-memory ephemeral storage, no persistence +- **enterprise**: Etcd HA, TLS required, audit logging enabled + +**Environment Variable Overrides**: + +```text +VAULT_CONFIG=/path/to/vault.toml # Explicit config path +VAULT_MODE=enterprise # Mode-specific config +VAULT_SERVER_URL=http://localhost:8200 # Server URL +VAULT_STORAGE_BACKEND=etcd # Storage backend +VAULT_AUTH_TOKEN=s.xxxxxxxx # Authentication token +VAULT_TLS_VERIFY=true # TLS verification +``` + +**Example Configuration**: + +```text +platform = { + vault_service = { + enabled = true, + server = { + host = "0.0.0.0", + port = 8200, + workers = 8 + }, + storage = { + backend = "surrealdb", + url = "http://surrealdb:8000", + namespace = "vault", + database = "secrets" + }, + vault = { + mount_point = "transit", + key_name = "provisioning-master" + }, + ha = { + enabled = true + } + } +} +``` + +### Extension Registry Service + +**Purpose**: Extension distribution and management via Gitea and OCI registries + +**Key Settings**: + +- **server**: HTTP server configuration (host, port, workers) +- **gitea**: Gitea integration for extension source repository +- **oci**: OCI registry for artifact distribution +- **cache**: Metadata and list caching +- **auth**: Registry authentication + +**Mode Characteristics**: + +- **solo**: Gitea only, minimal cache, CORS disabled +- **multiuser**: Gitea + OCI, both enabled, CORS enabled +- **cicd**: OCI only (high-throughput mode), ephemeral cache +- **enterprise**: Both Gitea + OCI, TLS verification, large cache + +**Environment Variable Overrides**: + +```text +REGISTRY_CONFIG=/path/to/registry.toml # Explicit config path +REGISTRY_MODE=multiuser # Mode-specific config +REGISTRY_SERVER_HOST=0.0.0.0 # Server host +REGISTRY_SERVER_PORT=8081 # Server port +REGISTRY_SERVER_WORKERS=4 # Worker count +REGISTRY_GITEA_URL=http://gitea:3000 # Gitea URL +REGISTRY_GITEA_ORG=provisioning # Gitea organization +REGISTRY_OCI_REGISTRY=registry.local:5000 # OCI registry +REGISTRY_OCI_NAMESPACE=provisioning # OCI namespace +``` + +**Example Configuration**: + +```text +platform = { + extension_registry = { + enabled = true, + server = { + host = "0.0.0.0", + port = 8081, + workers = 4 + }, + gitea = { + enabled = true, + url = "http://gitea:3000", + org = "provisioning" + }, + oci = { + enabled = true, + registry = "registry.local:5000", + namespace = "provisioning" + }, + cache = { + capacity = 1000, + ttl = 300 + } + } +} +``` + +### RAG (Retrieval-Augmented Generation) Service + +**Purpose**: Document retrieval, semantic search, and AI-augmented responses + +**Key Settings**: + +- **embeddings**: Embedding model provider (openai, local, anthropic) +- **vector_db**: Vector database backend (memory, surrealdb, qdrant, milvus) +- **llm**: Language model provider (anthropic, openai, ollama) +- **retrieval**: Search strategy and parameters +- **ingestion**: Document processing and indexing + +**Mode Characteristics**: + +- **solo**: Local embeddings, in-memory vector DB, Ollama LLM +- **multiuser**: OpenAI embeddings, SurrealDB vector DB, Anthropic LLM +- **cicd**: **RAG completely disabled** (not applicable for ephemeral pipelines) +- **enterprise**: Large embeddings (3072-dim), distributed vector DB, Claude Opus + +**Environment Variable Overrides**: + +```text +RAG_CONFIG=/path/to/rag.toml # Explicit config path +RAG_MODE=multiuser # Mode-specific config +RAG_ENABLED=true # Enable/disable RAG +RAG_EMBEDDINGS_PROVIDER=openai # Embedding provider +RAG_EMBEDDINGS_API_KEY=sk-xxx # Embedding API key +RAG_VECTOR_DB_URL=http://surrealdb:8000 # Vector DB URL +RAG_LLM_PROVIDER=anthropic # LLM provider +RAG_LLM_API_KEY=sk-ant-xxx # LLM API key +RAG_VECTOR_DB_TYPE=surrealdb # Vector DB type +``` + +**Example Configuration**: + +```text +platform = { + rag = { + enabled = true, + embeddings = { + provider = "openai", + model = "text-embedding-3-small", + api_key = "{{env.OPENAI_API_KEY}}" + }, + vector_db = { + db_type = "surrealdb", + url = "http://surrealdb:8000", + namespace = "rag_prod" + }, + llm = { + provider = "anthropic", + model = "claude-opus-4-5-20251101", + api_key = "{{env.ANTHROPIC_API_KEY}}" + }, + retrieval = { + top_k = 10, + similarity_threshold = 0.75 + } + } +} +``` + +### AI Service + +**Purpose**: AI model integration with RAG and MCP support for multi-step workflows + +**Key Settings**: + +- **server**: HTTP server configuration +- **rag**: RAG system integration +- **mcp**: Model Context Protocol integration +- **dag**: Directed acyclic graph task orchestration + +**Mode Characteristics**: + +- **solo**: RAG enabled, no MCP, minimal concurrency (3 tasks) +- **multiuser**: Both RAG and MCP enabled, moderate concurrency (10 tasks) +- **cicd**: RAG disabled, MCP enabled, high concurrency (20 tasks) +- **enterprise**: Both enabled, max concurrency (50 tasks), full monitoring + +**Environment Variable Overrides**: + +```text +AI_SERVICE_CONFIG=/path/to/ai.toml # Explicit config path +AI_SERVICE_MODE=enterprise # Mode-specific config +AI_SERVICE_SERVER_PORT=8082 # Server port +AI_SERVICE_SERVER_WORKERS=16 # Worker count +AI_SERVICE_RAG_ENABLED=true # Enable RAG integration +AI_SERVICE_MCP_ENABLED=true # Enable MCP integration +AI_SERVICE_DAG_MAX_CONCURRENT_TASKS=50 # Max concurrent tasks +``` + +**Example Configuration**: + +```text +platform = { + ai_service = { + enabled = true, + server = { + host = "0.0.0.0", + port = 8082, + workers = 8 + }, + rag = { + enabled = true, + rag_service_url = "http://rag:8083", + timeout = 60000 + }, + mcp = { + enabled = true, + mcp_service_url = "http://mcp-server:8084", + timeout = 60000 + }, + dag = { + max_concurrent_tasks = 20, + task_timeout = 600000, + retry_attempts = 5 + } + } +} +``` + +### Provisioning Daemon + +**Purpose**: Background service for provisioning operations, workspace management, and health monitoring + +**Key Settings**: + +- **daemon**: Daemon control (poll interval, max workers) +- **logging**: Log level and output configuration +- **actions**: Automated actions (cleanup, updates, sync) +- **workers**: Worker pool configuration +- **health**: Health check settings + +**Mode Characteristics**: + +- **solo**: Minimal polling, no auto-cleanup, debug logging +- **multiuser**: Standard polling, workspace sync enabled, info logging +- **cicd**: Frequent polling, ephemeral cleanup, warning logging +- **enterprise**: Standard polling, full automation, all features enabled + +**Environment Variable Overrides**: + +```text +DAEMON_CONFIG=/path/to/daemon.toml # Explicit config path +DAEMON_MODE=enterprise # Mode-specific config +DAEMON_POLL_INTERVAL=30 # Polling interval (seconds) +DAEMON_MAX_WORKERS=16 # Maximum worker threads +DAEMON_LOGGING_LEVEL=info # Log level (debug/info/warn/error) +DAEMON_AUTO_CLEANUP=true # Enable auto cleanup +DAEMON_AUTO_UPDATE=true # Enable auto updates +``` + +**Example Configuration**: + +```text +platform = { + provisioning_daemon = { + enabled = true, + daemon = { + poll_interval = 30, + max_workers = 8 + }, + logging = { + level = "info", + file = "/var/log/provisioning/daemon.log" + }, + actions = { + auto_cleanup = true, + auto_update = false, + workspace_sync = true + } + } +} +``` + +## Using TypeDialog Forms + +### Form Navigation + +1. **Interactive Prompts**: Answer questions one at a time +2. **Validation**: Inputs are validated as you type +3. **Defaults**: Each field shows a sensible default +4. **Skip Optional**: Press Enter to use default or skip optional fields +5. **Review**: Preview generated Nickel before saving + +### Field Types + +| Type | Example | Notes | +| ------ | --------- | ------- | +| `text` | "127.0.0.1" | Free-form text input | +| `confirm` | true/false | Yes/no answer | +| `select` | "filesystem" | Choose from list | +| `custom(u16)` | 9090 | Number input | +| `custom(u32)` | 1000 | Larger number | + +### Special Values + +**Environment Variables**: + +```text +api_user = "{{env.UPCLOUD_USER}}" +api_password = "{{env.UPCLOUD_PASSWORD}}" +``` + +**Workspace Paths**: + +```text +data_dir = "{{workspace.path}}/.orchestrator/data" +logs_dir = "{{workspace.path}}/.orchestrator/logs" +``` + +**KMS Decryption**: + +```text +api_password = "{{kms.decrypt('upcloud_pass')}}" +``` + +## Validation & Export + +### Validating Configuration + +```text +# Check Nickel syntax +nickel typecheck workspace_librecloud/config/config.ncl + +# Detailed validation with error messages +nickel typecheck workspace_librecloud/config/config.ncl 2>&1 + +# Schema validation happens during export +provisioning config export +``` + +### Exporting to Service Formats + +```text +# One-time export +provisioning config export + +# Export creates (pre-configured TOML for all services): +workspace_librecloud/config/generated/ +├── workspace.toml # Workspace metadata +├── providers/ +│ ├── upcloud.toml # UpCloud provider +│ └── local.toml # Local provider +└── platform/ + ├── orchestrator.toml # Orchestrator service + ├── control_center.toml # Control center service + ├── mcp_server.toml # MCP server service + ├── installer.toml # Installer service + ├── kms.toml # KMS service + ├── vault_service.toml # Vault service (new) + ├── extension_registry.toml # Extension registry (new) + ├── rag.toml # RAG service (new) + ├── ai_service.toml # AI service (new) + └── provisioning_daemon.toml # Daemon service (new) + +# Public Nickel Schemas (20 total for 5 new services): +provisioning/schemas/platform/ +├── schemas/ +│ ├── vault-service.ncl +│ ├── extension-registry.ncl +│ ├── rag.ncl +│ ├── ai-service.ncl +│ └── provisioning-daemon.ncl +├── defaults/ +│ ├── vault-service-defaults.ncl +│ ├── extension-registry-defaults.ncl +│ ├── rag-defaults.ncl +│ ├── ai-service-defaults.ncl +│ ├── provisioning-daemon-defaults.ncl +│ └── deployment/ +│ ├── solo-defaults.ncl +│ ├── multiuser-defaults.ncl +│ ├── cicd-defaults.ncl +│ └── enterprise-defaults.ncl +├── validators/ +├── templates/ +├── constraints/ +└── values/ +``` + +**Using Pre-Generated Configurations**: + +All 5 new services come with pre-built TOML configs for each deployment mode: + +```text +# View available schemas for vault service +ls -la provisioning/schemas/platform/schemas/vault-service.ncl +ls -la provisioning/schemas/platform/defaults/vault-service-defaults.ncl + +# Load enterprise mode +export VAULT_MODE=enterprise +cargo run -p vault-service + +# Or load multiuser mode +export REGISTRY_MODE=multiuser +cargo run -p extension-registry + +# All 5 services support mode-based loading +export RAG_MODE=cicd +export AI_SERVICE_MODE=enterprise +export DAEMON_MODE=multiuser +``` + +## Updating Configuration + +### Change a Setting + +1. **Edit source config**: `vim workspace_librecloud/config/config.ncl` +2. **Validate changes**: `nickel typecheck workspace_librecloud/config/config.ncl` +3. **Re-export to TOML**: `provisioning config export` +4. **Restart affected service** (if needed): `provisioning restart orchestrator` + +### Using TypeDialog to Update + +If you prefer interactive updating: + +```text +# Re-run TypeDialog form (overwrites config.ncl) +provisioning config platform orchestrator + +# Or edit via TypeDialog with existing values +typedialog form .typedialog/provisioning/platform/orchestrator/form.toml +``` + +## Troubleshooting + +### Form Won't Load + +**Problem**: `Failed to parse config file` + +**Solution**: Check form.toml syntax and verify required fields are present (name, description, locales_path, templates_path) + +```text +head -10 .typedialog/provisioning/platform/orchestrator/form.toml +``` + +### Validation Fails + +**Problem**: `Nickel configuration validation failed` + +**Solution**: Check for syntax errors and correct field names + +```text +nickel typecheck workspace_librecloud/config/config.ncl 2>&1 | less +``` + +Common issues: Missing closing braces, incorrect field names, wrong data types + +### Export Creates Empty Files + +**Problem**: Generated TOML files are empty + +**Solution**: Verify config.ncl exports to JSON and check all required sections exist + +```text +nickel export --format json workspace_librecloud/config/config.ncl | head -20 +``` + +### Services Don't Use New Config + +**Problem**: Changes don't take effect + +**Solution**: + +1. Verify export succeeded: `ls -lah workspace_librecloud/config/generated/platform/` +2. Check service path: `provisioning start orchestrator --check` +3. Restart service: `provisioning restart orchestrator` + +## Configuration Examples + +### Development Setup + +```text +{ + workspace = { + name = "dev", + path = "/Users/dev/workspace", + description = "Development workspace" + }, + + providers = { + local = { + enabled = true, + base_path = "/opt/vms" + }, + upcloud = { enabled = false }, + aws = { enabled = false } + }, + + platform = { + orchestrator = { + enabled = true, + server = { host = "127.0.0.1", port = 9090 }, + storage = { type = "filesystem" }, + logging = { level = "debug", format = "json" } + }, + kms = { + enabled = true, + backend = "age" + } + } +} +``` + +### Production Setup + +```text +{ + workspace = { + name = "prod", + path = "/opt/provisioning/prod", + description = "Production workspace" + }, + + providers = { + upcloud = { + enabled = true, + api_user = "{{env.UPCLOUD_USER}}", + api_password = "{{kms.decrypt('upcloud_prod')}}", + default_zone = "de-fra1" + }, + aws = { enabled = false }, + local = { enabled = false } + }, + + platform = { + orchestrator = { + enabled = true, + server = { host = "0.0.0.0", port = 9090, workers = 8 }, + storage = { + type = "surrealdb-server", + url = "ws://surreal.internal:8000" + }, + monitoring = { + enabled = true, + metrics_interval_seconds = 30 + }, + logging = { level = "info", format = "json" } + }, + kms = { + enabled = true, + backend = "vault", + url = "https://vault.internal:8200" + } + } +} +``` + +### Multi-Provider Setup + +```text +{ + workspace = { + name = "multi", + path = "/opt/multi", + description = "Multi-cloud workspace" + }, + + providers = { + upcloud = { + enabled = true, + api_user = "{{env.UPCLOUD_USER}}", + default_zone = "de-fra1", + zones = ["de-fra1", "us-nyc1", "nl-ams1"] + }, + aws = { + enabled = true, + access_key = "{{env.AWS_ACCESS_KEY_ID}}" + }, + local = { + enabled = true, + base_path = "/opt/local-vms" + } + }, + + platform = { + orchestrator = { + enabled = true, + multi_workspace = false, + storage = { type = "filesystem" } + }, + kms = { + enabled = true, + backend = "rustyvault" + } + } +} +``` + +## Best Practices + +### 1. Use TypeDialog for Initial Setup + +Start with TypeDialog forms for the best experience: + +```text +provisioning config platform orchestrator +``` + +### 2. Never Edit Generated Files + +Only edit the source `.ncl` file, not the generated TOML files. + +**Correct**: `vim workspace_librecloud/config/config.ncl` + +**Wrong**: `vim workspace_librecloud/config/generated/platform/orchestrator.toml` + +### 3. Validate Before Deploy + +Always validate before deploying changes: + +```text +nickel typecheck workspace_librecloud/config/config.ncl +provisioning config export +``` + +### 4. Use Environment Variables for Secrets + +Never hardcode credentials in config. Reference environment variables or KMS: + +**Wrong**: `api_password = "my-password"` + +**Correct**: `api_password = "{{env.UPCLOUD_PASSWORD}}"` + +**Better**: `api_password = "{{kms.decrypt('upcloud_key')}}"` + +### 5. Document Changes + +Add comments explaining custom settings in the Nickel file. + +## Related Documentation + +### Core Resources +- **Configuration System**: See `CLAUDE.md#configuration-file-format-selection` +- **Migration Guide**: See `provisioning/config/README.md#migration-strategy` +- **Schema Reference**: See `provisioning/schemas/` +- **Nickel Language**: See ADR-011 in `docs/architecture/adr/` + +### Platform Services +- **Platform Services Overview**: See `provisioning/platform/*/README.md` +- **Core Services** (Phases 8-12): orchestrator, control-center, mcp-server +- **New Services** (Phases 13-19): + - vault-service: Secrets management and encryption + - extension-registry: Extension distribution via Gitea/OCI + - rag: Retrieval-Augmented Generation system + - ai-service: AI model integration with DAG workflows + - provisioning-daemon: Background provisioning operations + +**Note**: Installer is a distribution tool (provisioning/tools/distribution/create-installer.nu), not a platform service configurable via TypeDialog. + +### Public Definition Locations +- **TypeDialog Forms** (Interactive UI): `provisioning/.typedialog/platform/forms/` +- **Nickel Schemas** (Type Definitions): `provisioning/schemas/platform/schemas/` +- **Default Values** (Base Configuration): `provisioning/schemas/platform/defaults/` +- **Validators** (Business Logic): `provisioning/schemas/platform/validators/` +- **Deployment Modes** (Presets): `provisioning/schemas/platform/defaults/deployment/` +- **Rust Integration**: `provisioning/platform/crates/*/src/config.rs` + +## Getting Help + +### Validation Errors + +Get detailed error messages and check available fields: + +```text +nickel typecheck workspace_librecloud/config/config.ncl 2>&1 | less +grep "prompt =" .typedialog/provisioning/platform/orchestrator/form.toml +``` + +### Configuration Questions + +```text +# Show all available config commands +provisioning config --help + +# Show help for specific service +provisioning config platform --help + +# List providers and services +provisioning config providers list +provisioning config services list +``` + +### Test Configuration + +```text +# Validate without deploying +nickel typecheck workspace_librecloud/config/config.ncl + +# Export to see generated config +provisioning config export + +# Check generated files +ls -la workspace_librecloud/config/generated/ +``` \ No newline at end of file diff --git a/docs/src/development/workflow.md b/docs/src/development/workflow.md index ec5007f..831c40f 100644 --- a/docs/src/development/workflow.md +++ b/docs/src/development/workflow.md @@ -1 +1,1065 @@ -# Development Workflow Guide\n\nThis document outlines the recommended development workflows, coding practices, testing strategies, and debugging techniques for the provisioning\nproject.\n\n## Table of Contents\n\n1. [Overview](#overview)\n2. [Development Setup](#development-setup)\n3. [Daily Development Workflow](#daily-development-workflow)\n4. [Code Organization](#code-organization)\n5. [Testing Strategies](#testing-strategies)\n6. [Debugging Techniques](#debugging-techniques)\n7. [Integration Workflows](#integration-workflows)\n8. [Collaboration Guidelines](#collaboration-guidelines)\n9. [Quality Assurance](#quality-assurance)\n10. [Best Practices](#best-practices)\n\n## Overview\n\nThe provisioning project employs a multi-language, multi-component architecture requiring specific development workflows to maintain consistency,\nquality, and efficiency.\n\n**Key Technologies**:\n\n- **Nushell**: Primary scripting and automation language\n- **Rust**: High-performance system components\n- **KCL**: Configuration language and schemas\n- **TOML**: Configuration files\n- **Jinja2**: Template engine\n\n**Development Principles**:\n\n- **Configuration-Driven**: Never hardcode, always configure\n- **Hybrid Architecture**: Rust for performance, Nushell for flexibility\n- **Test-First**: Comprehensive testing at all levels\n- **Documentation-Driven**: Code and APIs are self-documenting\n\n## Development Setup\n\n### Initial Environment Setup\n\n**1. Clone and Navigate**:\n\n```\n# Clone repository\ngit clone https://github.com/company/provisioning-system.git\ncd provisioning-system\n\n# Navigate to workspace\ncd workspace/tools\n```\n\n**2. Initialize Workspace**:\n\n```\n# Initialize development workspace\nnu workspace.nu init --user-name $USER --infra-name dev-env\n\n# Check workspace health\nnu workspace.nu health --detailed --fix-issues\n```\n\n**3. Configure Development Environment**:\n\n```\n# Create user configuration\ncp workspace/config/local-overrides.toml.example workspace/config/$USER.toml\n\n# Edit configuration for development\n$EDITOR workspace/config/$USER.toml\n```\n\n**4. Set Up Build System**:\n\n```\n# Navigate to build tools\ncd src/tools\n\n# Check build prerequisites\nmake info\n\n# Perform initial build\nmake dev-build\n```\n\n### Tool Installation\n\n**Required Tools**:\n\n```\n# Install Nushell\ncargo install nu\n\n# Install Nickel\ncargo install nickel\n\n# Install additional tools\ncargo install cross # Cross-compilation\ncargo install cargo-audit # Security auditing\ncargo install cargo-watch # File watching\n```\n\n**Optional Development Tools**:\n\n```\n# Install development enhancers\ncargo install nu_plugin_tera # Template plugin\ncargo install sops # Secrets management\nbrew install k9s # Kubernetes management\n```\n\n### IDE Configuration\n\n**VS Code Setup** (`.vscode/settings.json`):\n\n```\n{\n "files.associations": {\n "*.nu": "shellscript",\n "*.ncl": "nickel",\n "*.toml": "toml"\n },\n "nushell.shellPath": "/usr/local/bin/nu",\n "rust-analyzer.cargo.features": "all",\n "editor.formatOnSave": true,\n "editor.rulers": [100],\n "files.trimTrailingWhitespace": true\n}\n```\n\n**Recommended Extensions**:\n\n- Nushell Language Support\n- Rust Analyzer\n- Nickel Language Support\n- TOML Language Support\n- Better TOML\n\n## Daily Development Workflow\n\n### Morning Routine\n\n**1. Sync and Update**:\n\n```\n# Sync with upstream\ngit pull origin main\n\n# Update workspace\ncd workspace/tools\nnu workspace.nu health --fix-issues\n\n# Check for updates\nnu workspace.nu status --detailed\n```\n\n**2. Review Current State**:\n\n```\n# Check current infrastructure\nprovisioning show servers\nprovisioning show settings\n\n# Review workspace status\nnu workspace.nu status\n```\n\n### Development Cycle\n\n**1. Feature Development**:\n\n```\n# Create feature branch\ngit checkout -b feature/new-provider-support\n\n# Start development environment\ncd workspace/tools\nnu workspace.nu init --workspace-type development\n\n# Begin development\n$EDITOR workspace/extensions/providers/new-provider/nulib/provider.nu\n```\n\n**2. Incremental Testing**:\n\n```\n# Test syntax during development\nnu --check workspace/extensions/providers/new-provider/nulib/provider.nu\n\n# Run unit tests\nnu workspace/extensions/providers/new-provider/tests/unit/basic-test.nu\n\n# Integration testing\nnu workspace.nu tools test-extension providers/new-provider\n```\n\n**3. Build and Validate**:\n\n```\n# Quick development build\ncd src/tools\nmake dev-build\n\n# Validate changes\nmake validate-all\n\n# Test distribution\nmake test-dist\n```\n\n### Testing During Development\n\n**Unit Testing**:\n\n```\n# Add test examples to functions\ndef create-server [name: string] -> record {\n # @test: "test-server" -> {name: "test-server", status: "created"}\n # Implementation here\n}\n```\n\n**Integration Testing**:\n\n```\n# Test with real infrastructure\nnu workspace/extensions/providers/new-provider/nulib/provider.nu \\n create-server test-server --dry-run\n\n# Test with workspace isolation\nPROVISIONING_WORKSPACE_USER=$USER provisioning server create test-server --check\n```\n\n### End-of-Day Routine\n\n**1. Commit Progress**:\n\n```\n# Stage changes\ngit add .\n\n# Commit with descriptive message\ngit commit -m "feat(provider): add new cloud provider support\n\n- Implement basic server creation\n- Add configuration schema\n- Include unit tests\n- Update documentation"\n\n# Push to feature branch\ngit push origin feature/new-provider-support\n```\n\n**2. Workspace Maintenance**:\n\n```\n# Clean up development data\nnu workspace.nu cleanup --type cache --age 1d\n\n# Backup current state\nnu workspace.nu backup --auto-name --components config,extensions\n\n# Check workspace health\nnu workspace.nu health\n```\n\n## Code Organization\n\n### Nushell Code Structure\n\n**File Organization**:\n\n```\nExtension Structure:\n├── nulib/\n│ ├── main.nu # Main entry point\n│ ├── core/ # Core functionality\n│ │ ├── api.nu # API interactions\n│ │ ├── config.nu # Configuration handling\n│ │ └── utils.nu # Utility functions\n│ ├── commands/ # User commands\n│ │ ├── create.nu # Create operations\n│ │ ├── delete.nu # Delete operations\n│ │ └── list.nu # List operations\n│ └── tests/ # Test files\n│ ├── unit/ # Unit tests\n│ └── integration/ # Integration tests\n└── templates/ # Template files\n ├── config.j2 # Configuration templates\n └── manifest.j2 # Manifest templates\n```\n\n**Function Naming Conventions**:\n\n```\n# Use kebab-case for commands\ndef create-server [name: string] -> record { ... }\ndef validate-config [config: record] -> bool { ... }\n\n# Use snake_case for internal functions\ndef get_api_client [] -> record { ... }\ndef parse_config_file [path: string] -> record { ... }\n\n# Use descriptive prefixes\ndef check-server-status [server: string] -> string { ... }\ndef get-server-info [server: string] -> record { ... }\ndef list-available-zones [] -> list { ... }\n```\n\n**Error Handling Pattern**:\n\n```\ndef create-server [\n name: string\n --dry-run: bool = false\n] -> record {\n # 1. Validate inputs\n if ($name | str length) == 0 {\n error make {\n msg: "Server name cannot be empty"\n label: {\n text: "empty name provided"\n span: (metadata $name).span\n }\n }\n }\n\n # 2. Check prerequisites\n let config = try {\n get-provider-config\n } catch {\n error make {msg: "Failed to load provider configuration"}\n }\n\n # 3. Perform operation\n if $dry_run {\n return {action: "create", server: $name, status: "dry-run"}\n }\n\n # 4. Return result\n {server: $name, status: "created", id: (generate-id)}\n}\n```\n\n### Rust Code Structure\n\n**Project Organization**:\n\n```\nsrc/\n├── lib.rs # Library root\n├── main.rs # Binary entry point\n├── config/ # Configuration handling\n│ ├── mod.rs\n│ ├── loader.rs # Config loading\n│ └── validation.rs # Config validation\n├── api/ # HTTP API\n│ ├── mod.rs\n│ ├── handlers.rs # Request handlers\n│ └── middleware.rs # Middleware components\n└── orchestrator/ # Orchestration logic\n ├── mod.rs\n ├── workflow.rs # Workflow management\n └── task_queue.rs # Task queue management\n```\n\n**Error Handling**:\n\n```\nuse anyhow::{Context, Result};\nuse thiserror::Error;\n\n#[derive(Error, Debug)]\npub enum ProvisioningError {\n #[error("Configuration error: {message}")]\n Config { message: String },\n\n #[error("Network error: {source}")]\n Network {\n #[from]\n source: reqwest::Error,\n },\n\n #[error("Validation failed: {field}")]\n Validation { field: String },\n}\n\npub fn create_server(name: &str) -> Result {\n let config = load_config()\n .context("Failed to load configuration")?;\n\n validate_server_name(name)\n .context("Server name validation failed")?;\n\n let server = provision_server(name, &config)\n .context("Failed to provision server")?;\n\n Ok(server)\n}\n```\n\n### Nickel Schema Organization\n\n**Schema Structure**:\n\n```\n# Base schema definitions\nlet ServerConfig = {\n name | string,\n plan | string,\n zone | string,\n tags | { } | default = {},\n} in\nServerConfig\n\n# Provider-specific extensions\nlet UpCloudServerConfig = {\n template | string | default = "Ubuntu Server 22.04 LTS (Jammy Jellyfish)",\n storage | number | default = 25,\n} in\nUpCloudServerConfig\n\n# Composition schemas\nlet InfrastructureConfig = {\n servers | array,\n networks | array | default = [],\n load_balancers | array | default = [],\n} in\nInfrastructureConfig\n```\n\n## Testing Strategies\n\n### Test-Driven Development\n\n**TDD Workflow**:\n\n1. **Write Test First**: Define expected behavior\n2. **Run Test (Fail)**: Confirm test fails as expected\n3. **Write Code**: Implement minimal code to pass\n4. **Run Test (Pass)**: Confirm test now passes\n5. **Refactor**: Improve code while keeping tests green\n\n### Nushell Testing\n\n**Unit Test Pattern**:\n\n```\n# Function with embedded test\ndef validate-server-name [name: string] -> bool {\n # @test: "valid-name" -> true\n # @test: "" -> false\n # @test: "name-with-spaces" -> false\n\n if ($name | str length) == 0 {\n return false\n }\n\n if ($name | str contains " ") {\n return false\n }\n\n true\n}\n\n# Separate test file\n# tests/unit/server-validation-test.nu\ndef test_validate_server_name [] {\n # Valid cases\n assert (validate-server-name "valid-name")\n assert (validate-server-name "server123")\n\n # Invalid cases\n assert not (validate-server-name "")\n assert not (validate-server-name "name with spaces")\n assert not (validate-server-name "name@with!special")\n\n print "✅ validate-server-name tests passed"\n}\n```\n\n**Integration Test Pattern**:\n\n```\n# tests/integration/server-lifecycle-test.nu\ndef test_complete_server_lifecycle [] {\n # Setup\n let test_server = "test-server-" + (date now | format date "%Y%m%d%H%M%S")\n\n try {\n # Test creation\n let create_result = (create-server $test_server --dry-run)\n assert ($create_result.status == "dry-run")\n\n # Test validation\n let validate_result = (validate-server-config $test_server)\n assert $validate_result\n\n print $"✅ Server lifecycle test passed for ($test_server)"\n } catch { |e|\n print $"❌ Server lifecycle test failed: ($e.msg)"\n exit 1\n }\n}\n```\n\n### Rust Testing\n\n**Unit Testing**:\n\n```\n#[cfg(test)]\nmod tests {\n use super::*;\n use tokio_test;\n\n #[test]\n fn test_validate_server_name() {\n assert!(validate_server_name("valid-name"));\n assert!(validate_server_name("server123"));\n\n assert!(!validate_server_name(""));\n assert!(!validate_server_name("name with spaces"));\n assert!(!validate_server_name("name@special"));\n }\n\n #[tokio::test]\n async fn test_server_creation() {\n let config = test_config();\n let result = create_server("test-server", &config).await;\n\n assert!(result.is_ok());\n let server = result.unwrap();\n assert_eq!(server.name, "test-server");\n assert_eq!(server.status, "created");\n }\n}\n```\n\n**Integration Testing**:\n\n```\n#[cfg(test)]\nmod integration_tests {\n use super::*;\n use testcontainers::*;\n\n #[tokio::test]\n async fn test_full_workflow() {\n // Setup test environment\n let docker = clients::Cli::default();\n let postgres = docker.run(images::postgres::Postgres::default());\n\n let config = TestConfig {\n database_url: format!("postgresql://localhost:{}/test",\n postgres.get_host_port_ipv4(5432))\n };\n\n // Test complete workflow\n let workflow = create_workflow(&config).await.unwrap();\n let result = execute_workflow(workflow).await.unwrap();\n\n assert_eq!(result.status, WorkflowStatus::Completed);\n }\n}\n```\n\n### Nickel Testing\n\n**Schema Validation Testing**:\n\n```\n# Test Nickel schemas\nnickel check schemas/\n\n# Validate specific schemas\nnickel typecheck schemas/server.ncl\n\n# Test with examples\nnickel eval schemas/server.ncl\n```\n\n### Test Automation\n\n**Continuous Testing**:\n\n```\n# Watch for changes and run tests\ncargo watch -x test -x check\n\n# Watch Nushell files\nfind . -name "*.nu" | entr -r nu tests/run-all-tests.nu\n\n# Automated testing in workspace\nnu workspace.nu tools test-all --watch\n```\n\n## Debugging Techniques\n\n### Debug Configuration\n\n**Enable Debug Mode**:\n\n```\n# Environment variables\nexport PROVISIONING_DEBUG=true\nexport PROVISIONING_LOG_LEVEL=debug\nexport RUST_LOG=debug\nexport RUST_BACKTRACE=1\n\n# Workspace debug\nexport PROVISIONING_WORKSPACE_USER=$USER\n```\n\n### Nushell Debugging\n\n**Debug Techniques**:\n\n```\n# Debug prints\ndef debug-server-creation [name: string] {\n print $"🐛 Creating server: ($name)"\n\n let config = get-provider-config\n print $"🐛 Config loaded: ($config | to json)"\n\n let result = try {\n create-server-api $name $config\n } catch { |e|\n print $"🐛 API call failed: ($e.msg)"\n $e\n }\n\n print $"🐛 Result: ($result | to json)"\n $result\n}\n\n# Conditional debugging\ndef create-server [name: string] {\n if $env.PROVISIONING_DEBUG? == "true" {\n print $"Debug: Creating server ($name)"\n }\n\n # Implementation\n}\n\n# Interactive debugging\ndef debug-interactive [] {\n print "🐛 Entering debug mode..."\n print "Available commands: $env.PATH"\n print "Current config: " (get-config | to json)\n\n # Drop into interactive shell\n nu --interactive\n}\n```\n\n**Error Investigation**:\n\n```\n# Comprehensive error handling\ndef safe-server-creation [name: string] {\n try {\n create-server $name\n } catch { |e|\n # Log error details\n {\n timestamp: (date now | format date "%Y-%m-%d %H:%M:%S"),\n operation: "create-server",\n input: $name,\n error: $e.msg,\n debug: $e.debug?,\n env: {\n user: $env.USER,\n workspace: $env.PROVISIONING_WORKSPACE_USER?,\n debug: $env.PROVISIONING_DEBUG?\n }\n } | save --append logs/error-debug.json\n\n # Re-throw with context\n error make {\n msg: $"Server creation failed: ($e.msg)",\n label: {text: "failed here", span: $e.span?}\n }\n }\n}\n```\n\n### Rust Debugging\n\n**Debug Logging**:\n\n```\nuse tracing::{debug, info, warn, error, instrument};\n\n#[instrument]\npub async fn create_server(name: &str) -> Result {\n debug!("Starting server creation for: {}", name);\n\n let config = load_config()\n .map_err(|e| {\n error!("Failed to load config: {:?}", e);\n e\n })?;\n\n info!("Configuration loaded successfully");\n debug!("Config details: {:?}", config);\n\n let server = provision_server(name, &config).await\n .map_err(|e| {\n error!("Provisioning failed for {}: {:?}", name, e);\n e\n })?;\n\n info!("Server {} created successfully", name);\n Ok(server)\n}\n```\n\n**Interactive Debugging**:\n\n```\n// Use debugger breakpoints\n#[cfg(debug_assertions)]\n{\n println!("Debug: server creation starting");\n dbg!(&config);\n // Add breakpoint here in IDE\n}\n```\n\n### Log Analysis\n\n**Log Monitoring**:\n\n```\n# Follow all logs\ntail -f workspace/runtime/logs/$USER/*.log\n\n# Filter for errors\ngrep -i error workspace/runtime/logs/$USER/*.log\n\n# Monitor specific component\ntail -f workspace/runtime/logs/$USER/orchestrator.log | grep -i workflow\n\n# Structured log analysis\njq '.level == "ERROR"' workspace/runtime/logs/$USER/structured.jsonl\n```\n\n**Debug Log Levels**:\n\n```\n# Different verbosity levels\nPROVISIONING_LOG_LEVEL=trace provisioning server create test\nPROVISIONING_LOG_LEVEL=debug provisioning server create test\nPROVISIONING_LOG_LEVEL=info provisioning server create test\n```\n\n## Integration Workflows\n\n### Existing System Integration\n\n**Working with Legacy Components**:\n\n```\n# Test integration with existing system\nprovisioning --version # Legacy system\nsrc/core/nulib/provisioning --version # New system\n\n# Test workspace integration\nPROVISIONING_WORKSPACE_USER=$USER provisioning server list\n\n# Validate configuration compatibility\nprovisioning validate config\nnu workspace.nu config validate\n```\n\n### API Integration Testing\n\n**REST API Testing**:\n\n```\n# Test orchestrator API\ncurl -X GET http://localhost:9090/health\ncurl -X GET http://localhost:9090/tasks\n\n# Test workflow creation\ncurl -X POST http://localhost:9090/workflows/servers/create \\n -H "Content-Type: application/json" \\n -d '{"name": "test-server", "plan": "2xCPU-4 GB"}'\n\n# Monitor workflow\ncurl -X GET http://localhost:9090/workflows/batch/status/workflow-id\n```\n\n### Database Integration\n\n**SurrealDB Integration**:\n\n```\n# Test database connectivity\nuse core/nulib/lib_provisioning/database/surreal.nu\nlet db = (connect-database)\n(test-connection $db)\n\n# Workflow state testing\nlet workflow_id = (create-workflow-record "test-workflow")\nlet status = (get-workflow-status $workflow_id)\nassert ($status.status == "pending")\n```\n\n### External Tool Integration\n\n**Container Integration**:\n\n```\n# Test with Docker\ndocker run --rm -v $(pwd):/work provisioning:dev provisioning --version\n\n# Test with Kubernetes\nkubectl apply -f manifests/test-pod.yaml\nkubectl logs test-pod\n\n# Validate in different environments\nmake test-dist PLATFORM=docker\nmake test-dist PLATFORM=kubernetes\n```\n\n## Collaboration Guidelines\n\n### Branch Strategy\n\n**Branch Naming**:\n\n- `feature/description` - New features\n- `fix/description` - Bug fixes\n- `docs/description` - Documentation updates\n- `refactor/description` - Code refactoring\n- `test/description` - Test improvements\n\n**Workflow**:\n\n```\n# Start new feature\ngit checkout main\ngit pull origin main\ngit checkout -b feature/new-provider-support\n\n# Regular commits\ngit add .\ngit commit -m "feat(provider): implement server creation API"\n\n# Push and create PR\ngit push origin feature/new-provider-support\ngh pr create --title "Add new provider support" --body "..."\n```\n\n### Code Review Process\n\n**Review Checklist**:\n\n- [ ] Code follows project conventions\n- [ ] Tests are included and passing\n- [ ] Documentation is updated\n- [ ] No hardcoded values\n- [ ] Error handling is comprehensive\n- [ ] Performance considerations addressed\n\n**Review Commands**:\n\n```\n# Test PR locally\ngh pr checkout 123\ncd src/tools && make ci-test\n\n# Run specific tests\nnu workspace/extensions/providers/new-provider/tests/run-all.nu\n\n# Check code quality\ncargo clippy -- -D warnings\nnu --check $(find . -name "*.nu")\n```\n\n### Documentation Requirements\n\n**Code Documentation**:\n\n```\n# Function documentation\ndef create-server [\n name: string # Server name (must be unique)\n plan: string # Server plan (for example, "2xCPU-4 GB")\n --dry-run: bool # Show what would be created without doing it\n] -> record { # Returns server creation result\n # Creates a new server with the specified configuration\n #\n # Examples:\n # create-server "web-01" "2xCPU-4 GB"\n # create-server "test" "1xCPU-2 GB" --dry-run\n\n # Implementation\n}\n```\n\n### Communication\n\n**Progress Updates**:\n\n- Daily standup participation\n- Weekly architecture reviews\n- PR descriptions with context\n- Issue tracking with details\n\n**Knowledge Sharing**:\n\n- Technical blog posts\n- Architecture decision records\n- Code review discussions\n- Team documentation updates\n\n## Quality Assurance\n\n### Code Quality Checks\n\n**Automated Quality Gates**:\n\n```\n# Pre-commit hooks\npre-commit install\n\n# Manual quality check\ncd src/tools\nmake validate-all\n\n# Security audit\ncargo audit\n```\n\n**Quality Metrics**:\n\n- Code coverage > 80%\n- No critical security vulnerabilities\n- All tests passing\n- Documentation coverage complete\n- Performance benchmarks met\n\n### Performance Monitoring\n\n**Performance Testing**:\n\n```\n# Benchmark builds\nmake benchmark\n\n# Performance profiling\ncargo flamegraph --bin provisioning-orchestrator\n\n# Load testing\nab -n 1000 -c 10 http://localhost:9090/health\n```\n\n**Resource Monitoring**:\n\n```\n# Monitor during development\nnu workspace/tools/runtime-manager.nu monitor --duration 5m\n\n# Check resource usage\ndu -sh workspace/runtime/\ndf -h\n```\n\n## Best Practices\n\n### Configuration Management\n\n**Never Hardcode**:\n\n```\n# Bad\ndef get-api-url [] { "https://api.upcloud.com" }\n\n# Good\ndef get-api-url [] {\n get-config-value "providers.upcloud.api_url" "https://api.upcloud.com"\n}\n```\n\n### Error Handling\n\n**Comprehensive Error Context**:\n\n```\ndef create-server [name: string] {\n try {\n validate-server-name $name\n } catch { |e|\n error make {\n msg: $"Invalid server name '($name)': ($e.msg)",\n label: {text: "server name validation failed", span: $e.span?}\n }\n }\n\n try {\n provision-server $name\n } catch { |e|\n error make {\n msg: $"Server provisioning failed for '($name)': ($e.msg)",\n help: "Check provider credentials and quota limits"\n }\n }\n}\n```\n\n### Resource Management\n\n**Clean Up Resources**:\n\n```\ndef with-temporary-server [name: string, action: closure] {\n let server = (create-server $name)\n\n try {\n do $action $server\n } catch { |e|\n # Clean up on error\n delete-server $name\n $e\n }\n\n # Clean up on success\n delete-server $name\n}\n```\n\n### Testing Best Practices\n\n**Test Isolation**:\n\n```\ndef test-with-isolation [test_name: string, test_action: closure] {\n let test_workspace = $"test-($test_name)-(date now | format date '%Y%m%d%H%M%S')"\n\n try {\n # Set up isolated environment\n $env.PROVISIONING_WORKSPACE_USER = $test_workspace\n nu workspace.nu init --user-name $test_workspace\n\n # Run test\n do $test_action\n\n print $"✅ Test ($test_name) passed"\n } catch { |e|\n print $"❌ Test ($test_name) failed: ($e.msg)"\n exit 1\n } finally {\n # Clean up test environment\n nu workspace.nu cleanup --user-name $test_workspace --type all --force\n }\n}\n```\n\nThis development workflow provides a comprehensive framework for efficient, quality-focused development while maintaining the project's architectural\nprinciples and ensuring smooth collaboration across the team. +# Development Workflow Guide + +This document outlines the recommended development workflows, coding practices, testing strategies, and debugging techniques for the provisioning +project. + +## Table of Contents + +1. [Overview](#overview) +2. [Development Setup](#development-setup) +3. [Daily Development Workflow](#daily-development-workflow) +4. [Code Organization](#code-organization) +5. [Testing Strategies](#testing-strategies) +6. [Debugging Techniques](#debugging-techniques) +7. [Integration Workflows](#integration-workflows) +8. [Collaboration Guidelines](#collaboration-guidelines) +9. [Quality Assurance](#quality-assurance) +10. [Best Practices](#best-practices) + +## Overview + +The provisioning project employs a multi-language, multi-component architecture requiring specific development workflows to maintain consistency, +quality, and efficiency. + +**Key Technologies**: + +- **Nushell**: Primary scripting and automation language +- **Rust**: High-performance system components +- **KCL**: Configuration language and schemas +- **TOML**: Configuration files +- **Jinja2**: Template engine + +**Development Principles**: + +- **Configuration-Driven**: Never hardcode, always configure +- **Hybrid Architecture**: Rust for performance, Nushell for flexibility +- **Test-First**: Comprehensive testing at all levels +- **Documentation-Driven**: Code and APIs are self-documenting + +## Development Setup + +### Initial Environment Setup + +**1. Clone and Navigate**: + +```text +# Clone repository +git clone https://github.com/company/provisioning-system.git +cd provisioning-system + +# Navigate to workspace +cd workspace/tools +``` + +**2. Initialize Workspace**: + +```text +# Initialize development workspace +nu workspace.nu init --user-name $USER --infra-name dev-env + +# Check workspace health +nu workspace.nu health --detailed --fix-issues +``` + +**3. Configure Development Environment**: + +```text +# Create user configuration +cp workspace/config/local-overrides.toml.example workspace/config/$USER.toml + +# Edit configuration for development +$EDITOR workspace/config/$USER.toml +``` + +**4. Set Up Build System**: + +```text +# Navigate to build tools +cd src/tools + +# Check build prerequisites +make info + +# Perform initial build +make dev-build +``` + +### Tool Installation + +**Required Tools**: + +```text +# Install Nushell +cargo install nu + +# Install Nickel +cargo install nickel + +# Install additional tools +cargo install cross # Cross-compilation +cargo install cargo-audit # Security auditing +cargo install cargo-watch # File watching +``` + +**Optional Development Tools**: + +```text +# Install development enhancers +cargo install nu_plugin_tera # Template plugin +cargo install sops # Secrets management +brew install k9s # Kubernetes management +``` + +### IDE Configuration + +**VS Code Setup** (`.vscode/settings.json`): + +```text +{ + "files.associations": { + "*.nu": "shellscript", + "*.ncl": "nickel", + "*.toml": "toml" + }, + "nushell.shellPath": "/usr/local/bin/nu", + "rust-analyzer.cargo.features": "all", + "editor.formatOnSave": true, + "editor.rulers": [100], + "files.trimTrailingWhitespace": true +} +``` + +**Recommended Extensions**: + +- Nushell Language Support +- Rust Analyzer +- Nickel Language Support +- TOML Language Support +- Better TOML + +## Daily Development Workflow + +### Morning Routine + +**1. Sync and Update**: + +```text +# Sync with upstream +git pull origin main + +# Update workspace +cd workspace/tools +nu workspace.nu health --fix-issues + +# Check for updates +nu workspace.nu status --detailed +``` + +**2. Review Current State**: + +```text +# Check current infrastructure +provisioning show servers +provisioning show settings + +# Review workspace status +nu workspace.nu status +``` + +### Development Cycle + +**1. Feature Development**: + +```text +# Create feature branch +git checkout -b feature/new-provider-support + +# Start development environment +cd workspace/tools +nu workspace.nu init --workspace-type development + +# Begin development +$EDITOR workspace/extensions/providers/new-provider/nulib/provider.nu +``` + +**2. Incremental Testing**: + +```text +# Test syntax during development +nu --check workspace/extensions/providers/new-provider/nulib/provider.nu + +# Run unit tests +nu workspace/extensions/providers/new-provider/tests/unit/basic-test.nu + +# Integration testing +nu workspace.nu tools test-extension providers/new-provider +``` + +**3. Build and Validate**: + +```text +# Quick development build +cd src/tools +make dev-build + +# Validate changes +make validate-all + +# Test distribution +make test-dist +``` + +### Testing During Development + +**Unit Testing**: + +```text +# Add test examples to functions +def create-server [name: string] -> record { + # @test: "test-server" -> {name: "test-server", status: "created"} + # Implementation here +} +``` + +**Integration Testing**: + +```text +# Test with real infrastructure +nu workspace/extensions/providers/new-provider/nulib/provider.nu + create-server test-server --dry-run + +# Test with workspace isolation +PROVISIONING_WORKSPACE_USER=$USER provisioning server create test-server --check +``` + +### End-of-Day Routine + +**1. Commit Progress**: + +```text +# Stage changes +git add . + +# Commit with descriptive message +git commit -m "feat(provider): add new cloud provider support + +- Implement basic server creation +- Add configuration schema +- Include unit tests +- Update documentation" + +# Push to feature branch +git push origin feature/new-provider-support +``` + +**2. Workspace Maintenance**: + +```text +# Clean up development data +nu workspace.nu cleanup --type cache --age 1d + +# Backup current state +nu workspace.nu backup --auto-name --components config,extensions + +# Check workspace health +nu workspace.nu health +``` + +## Code Organization + +### Nushell Code Structure + +**File Organization**: + +```text +Extension Structure: +├── nulib/ +│ ├── main.nu # Main entry point +│ ├── core/ # Core functionality +│ │ ├── api.nu # API interactions +│ │ ├── config.nu # Configuration handling +│ │ └── utils.nu # Utility functions +│ ├── commands/ # User commands +│ │ ├── create.nu # Create operations +│ │ ├── delete.nu # Delete operations +│ │ └── list.nu # List operations +│ └── tests/ # Test files +│ ├── unit/ # Unit tests +│ └── integration/ # Integration tests +└── templates/ # Template files + ├── config.j2 # Configuration templates + └── manifest.j2 # Manifest templates +``` + +**Function Naming Conventions**: + +```text +# Use kebab-case for commands +def create-server [name: string] -> record { ... } +def validate-config [config: record] -> bool { ... } + +# Use snake_case for internal functions +def get_api_client [] -> record { ... } +def parse_config_file [path: string] -> record { ... } + +# Use descriptive prefixes +def check-server-status [server: string] -> string { ... } +def get-server-info [server: string] -> record { ... } +def list-available-zones [] -> list { ... } +``` + +**Error Handling Pattern**: + +```text +def create-server [ + name: string + --dry-run: bool = false +] -> record { + # 1. Validate inputs + if ($name | str length) == 0 { + error make { + msg: "Server name cannot be empty" + label: { + text: "empty name provided" + span: (metadata $name).span + } + } + } + + # 2. Check prerequisites + let config = try { + get-provider-config + } catch { + error make {msg: "Failed to load provider configuration"} + } + + # 3. Perform operation + if $dry_run { + return {action: "create", server: $name, status: "dry-run"} + } + + # 4. Return result + {server: $name, status: "created", id: (generate-id)} +} +``` + +### Rust Code Structure + +**Project Organization**: + +```text +src/ +├── lib.rs # Library root +├── main.rs # Binary entry point +├── config/ # Configuration handling +│ ├── mod.rs +│ ├── loader.rs # Config loading +│ └── validation.rs # Config validation +├── api/ # HTTP API +│ ├── mod.rs +│ ├── handlers.rs # Request handlers +│ └── middleware.rs # Middleware components +└── orchestrator/ # Orchestration logic + ├── mod.rs + ├── workflow.rs # Workflow management + └── task_queue.rs # Task queue management +``` + +**Error Handling**: + +```text +use anyhow::{Context, Result}; +use thiserror::Error; + +#[derive(Error, Debug)] +pub enum ProvisioningError { + #[error("Configuration error: {message}")] + Config { message: String }, + + #[error("Network error: {source}")] + Network { + #[from] + source: reqwest::Error, + }, + + #[error("Validation failed: {field}")] + Validation { field: String }, +} + +pub fn create_server(name: &str) -> Result { + let config = load_config() + .context("Failed to load configuration")?; + + validate_server_name(name) + .context("Server name validation failed")?; + + let server = provision_server(name, &config) + .context("Failed to provision server")?; + + Ok(server) +} +``` + +### Nickel Schema Organization + +**Schema Structure**: + +```text +# Base schema definitions +let ServerConfig = { + name | string, + plan | string, + zone | string, + tags | { } | default = {}, +} in +ServerConfig + +# Provider-specific extensions +let UpCloudServerConfig = { + template | string | default = "Ubuntu Server 22.04 LTS (Jammy Jellyfish)", + storage | number | default = 25, +} in +UpCloudServerConfig + +# Composition schemas +let InfrastructureConfig = { + servers | array, + networks | array | default = [], + load_balancers | array | default = [], +} in +InfrastructureConfig +``` + +## Testing Strategies + +### Test-Driven Development + +**TDD Workflow**: + +1. **Write Test First**: Define expected behavior +2. **Run Test (Fail)**: Confirm test fails as expected +3. **Write Code**: Implement minimal code to pass +4. **Run Test (Pass)**: Confirm test now passes +5. **Refactor**: Improve code while keeping tests green + +### Nushell Testing + +**Unit Test Pattern**: + +```text +# Function with embedded test +def validate-server-name [name: string] -> bool { + # @test: "valid-name" -> true + # @test: "" -> false + # @test: "name-with-spaces" -> false + + if ($name | str length) == 0 { + return false + } + + if ($name | str contains " ") { + return false + } + + true +} + +# Separate test file +# tests/unit/server-validation-test.nu +def test_validate_server_name [] { + # Valid cases + assert (validate-server-name "valid-name") + assert (validate-server-name "server123") + + # Invalid cases + assert not (validate-server-name "") + assert not (validate-server-name "name with spaces") + assert not (validate-server-name "name@with!special") + + print "✅ validate-server-name tests passed" +} +``` + +**Integration Test Pattern**: + +```text +# tests/integration/server-lifecycle-test.nu +def test_complete_server_lifecycle [] { + # Setup + let test_server = "test-server-" + (date now | format date "%Y%m%d%H%M%S") + + try { + # Test creation + let create_result = (create-server $test_server --dry-run) + assert ($create_result.status == "dry-run") + + # Test validation + let validate_result = (validate-server-config $test_server) + assert $validate_result + + print $"✅ Server lifecycle test passed for ($test_server)" + } catch { |e| + print $"❌ Server lifecycle test failed: ($e.msg)" + exit 1 + } +} +``` + +### Rust Testing + +**Unit Testing**: + +```text +#[cfg(test)] +mod tests { + use super::*; + use tokio_test; + + #[test] + fn test_validate_server_name() { + assert!(validate_server_name("valid-name")); + assert!(validate_server_name("server123")); + + assert!(!validate_server_name("")); + assert!(!validate_server_name("name with spaces")); + assert!(!validate_server_name("name@special")); + } + + #[tokio::test] + async fn test_server_creation() { + let config = test_config(); + let result = create_server("test-server", &config).await; + + assert!(result.is_ok()); + let server = result.unwrap(); + assert_eq!(server.name, "test-server"); + assert_eq!(server.status, "created"); + } +} +``` + +**Integration Testing**: + +```text +#[cfg(test)] +mod integration_tests { + use super::*; + use testcontainers::*; + + #[tokio::test] + async fn test_full_workflow() { + // Setup test environment + let docker = clients::Cli::default(); + let postgres = docker.run(images::postgres::Postgres::default()); + + let config = TestConfig { + database_url: format!("postgresql://localhost:{}/test", + postgres.get_host_port_ipv4(5432)) + }; + + // Test complete workflow + let workflow = create_workflow(&config).await.unwrap(); + let result = execute_workflow(workflow).await.unwrap(); + + assert_eq!(result.status, WorkflowStatus::Completed); + } +} +``` + +### Nickel Testing + +**Schema Validation Testing**: + +```text +# Test Nickel schemas +nickel check schemas/ + +# Validate specific schemas +nickel typecheck schemas/server.ncl + +# Test with examples +nickel eval schemas/server.ncl +``` + +### Test Automation + +**Continuous Testing**: + +```text +# Watch for changes and run tests +cargo watch -x test -x check + +# Watch Nushell files +find . -name "*.nu" | entr -r nu tests/run-all-tests.nu + +# Automated testing in workspace +nu workspace.nu tools test-all --watch +``` + +## Debugging Techniques + +### Debug Configuration + +**Enable Debug Mode**: + +```text +# Environment variables +export PROVISIONING_DEBUG=true +export PROVISIONING_LOG_LEVEL=debug +export RUST_LOG=debug +export RUST_BACKTRACE=1 + +# Workspace debug +export PROVISIONING_WORKSPACE_USER=$USER +``` + +### Nushell Debugging + +**Debug Techniques**: + +```text +# Debug prints +def debug-server-creation [name: string] { + print $"🐛 Creating server: ($name)" + + let config = get-provider-config + print $"🐛 Config loaded: ($config | to json)" + + let result = try { + create-server-api $name $config + } catch { |e| + print $"🐛 API call failed: ($e.msg)" + $e + } + + print $"🐛 Result: ($result | to json)" + $result +} + +# Conditional debugging +def create-server [name: string] { + if $env.PROVISIONING_DEBUG? == "true" { + print $"Debug: Creating server ($name)" + } + + # Implementation +} + +# Interactive debugging +def debug-interactive [] { + print "🐛 Entering debug mode..." + print "Available commands: $env.PATH" + print "Current config: " (get-config | to json) + + # Drop into interactive shell + nu --interactive +} +``` + +**Error Investigation**: + +```text +# Comprehensive error handling +def safe-server-creation [name: string] { + try { + create-server $name + } catch { |e| + # Log error details + { + timestamp: (date now | format date "%Y-%m-%d %H:%M:%S"), + operation: "create-server", + input: $name, + error: $e.msg, + debug: $e.debug?, + env: { + user: $env.USER, + workspace: $env.PROVISIONING_WORKSPACE_USER?, + debug: $env.PROVISIONING_DEBUG? + } + } | save --append logs/error-debug.json + + # Re-throw with context + error make { + msg: $"Server creation failed: ($e.msg)", + label: {text: "failed here", span: $e.span?} + } + } +} +``` + +### Rust Debugging + +**Debug Logging**: + +```text +use tracing::{debug, info, warn, error, instrument}; + +#[instrument] +pub async fn create_server(name: &str) -> Result { + debug!("Starting server creation for: {}", name); + + let config = load_config() + .map_err(|e| { + error!("Failed to load config: {:?}", e); + e + })?; + + info!("Configuration loaded successfully"); + debug!("Config details: {:?}", config); + + let server = provision_server(name, &config).await + .map_err(|e| { + error!("Provisioning failed for {}: {:?}", name, e); + e + })?; + + info!("Server {} created successfully", name); + Ok(server) +} +``` + +**Interactive Debugging**: + +```text +// Use debugger breakpoints +#[cfg(debug_assertions)] +{ + println!("Debug: server creation starting"); + dbg!(&config); + // Add breakpoint here in IDE +} +``` + +### Log Analysis + +**Log Monitoring**: + +```text +# Follow all logs +tail -f workspace/runtime/logs/$USER/*.log + +# Filter for errors +grep -i error workspace/runtime/logs/$USER/*.log + +# Monitor specific component +tail -f workspace/runtime/logs/$USER/orchestrator.log | grep -i workflow + +# Structured log analysis +jq '.level == "ERROR"' workspace/runtime/logs/$USER/structured.jsonl +``` + +**Debug Log Levels**: + +```text +# Different verbosity levels +PROVISIONING_LOG_LEVEL=trace provisioning server create test +PROVISIONING_LOG_LEVEL=debug provisioning server create test +PROVISIONING_LOG_LEVEL=info provisioning server create test +``` + +## Integration Workflows + +### Existing System Integration + +**Working with Legacy Components**: + +```text +# Test integration with existing system +provisioning --version # Legacy system +src/core/nulib/provisioning --version # New system + +# Test workspace integration +PROVISIONING_WORKSPACE_USER=$USER provisioning server list + +# Validate configuration compatibility +provisioning validate config +nu workspace.nu config validate +``` + +### API Integration Testing + +**REST API Testing**: + +```text +# Test orchestrator API +curl -X GET http://localhost:9090/health +curl -X GET http://localhost:9090/tasks + +# Test workflow creation +curl -X POST http://localhost:9090/workflows/servers/create + -H "Content-Type: application/json" + -d '{"name": "test-server", "plan": "2xCPU-4 GB"}' + +# Monitor workflow +curl -X GET http://localhost:9090/workflows/batch/status/workflow-id +``` + +### Database Integration + +**SurrealDB Integration**: + +```text +# Test database connectivity +use core/nulib/lib_provisioning/database/surreal.nu +let db = (connect-database) +(test-connection $db) + +# Workflow state testing +let workflow_id = (create-workflow-record "test-workflow") +let status = (get-workflow-status $workflow_id) +assert ($status.status == "pending") +``` + +### External Tool Integration + +**Container Integration**: + +```text +# Test with Docker +docker run --rm -v $(pwd):/work provisioning:dev provisioning --version + +# Test with Kubernetes +kubectl apply -f manifests/test-pod.yaml +kubectl logs test-pod + +# Validate in different environments +make test-dist PLATFORM=docker +make test-dist PLATFORM=kubernetes +``` + +## Collaboration Guidelines + +### Branch Strategy + +**Branch Naming**: + +- `feature/description` - New features +- `fix/description` - Bug fixes +- `docs/description` - Documentation updates +- `refactor/description` - Code refactoring +- `test/description` - Test improvements + +**Workflow**: + +```text +# Start new feature +git checkout main +git pull origin main +git checkout -b feature/new-provider-support + +# Regular commits +git add . +git commit -m "feat(provider): implement server creation API" + +# Push and create PR +git push origin feature/new-provider-support +gh pr create --title "Add new provider support" --body "..." +``` + +### Code Review Process + +**Review Checklist**: + +- [ ] Code follows project conventions +- [ ] Tests are included and passing +- [ ] Documentation is updated +- [ ] No hardcoded values +- [ ] Error handling is comprehensive +- [ ] Performance considerations addressed + +**Review Commands**: + +```text +# Test PR locally +gh pr checkout 123 +cd src/tools && make ci-test + +# Run specific tests +nu workspace/extensions/providers/new-provider/tests/run-all.nu + +# Check code quality +cargo clippy -- -D warnings +nu --check $(find . -name "*.nu") +``` + +### Documentation Requirements + +**Code Documentation**: + +```text +# Function documentation +def create-server [ + name: string # Server name (must be unique) + plan: string # Server plan (for example, "2xCPU-4 GB") + --dry-run: bool # Show what would be created without doing it +] -> record { # Returns server creation result + # Creates a new server with the specified configuration + # + # Examples: + # create-server "web-01" "2xCPU-4 GB" + # create-server "test" "1xCPU-2 GB" --dry-run + + # Implementation +} +``` + +### Communication + +**Progress Updates**: + +- Daily standup participation +- Weekly architecture reviews +- PR descriptions with context +- Issue tracking with details + +**Knowledge Sharing**: + +- Technical blog posts +- Architecture decision records +- Code review discussions +- Team documentation updates + +## Quality Assurance + +### Code Quality Checks + +**Automated Quality Gates**: + +```text +# Pre-commit hooks +pre-commit install + +# Manual quality check +cd src/tools +make validate-all + +# Security audit +cargo audit +``` + +**Quality Metrics**: + +- Code coverage > 80% +- No critical security vulnerabilities +- All tests passing +- Documentation coverage complete +- Performance benchmarks met + +### Performance Monitoring + +**Performance Testing**: + +```text +# Benchmark builds +make benchmark + +# Performance profiling +cargo flamegraph --bin provisioning-orchestrator + +# Load testing +ab -n 1000 -c 10 http://localhost:9090/health +``` + +**Resource Monitoring**: + +```text +# Monitor during development +nu workspace/tools/runtime-manager.nu monitor --duration 5m + +# Check resource usage +du -sh workspace/runtime/ +df -h +``` + +## Best Practices + +### Configuration Management + +**Never Hardcode**: + +```text +# Bad +def get-api-url [] { "https://api.upcloud.com" } + +# Good +def get-api-url [] { + get-config-value "providers.upcloud.api_url" "https://api.upcloud.com" +} +``` + +### Error Handling + +**Comprehensive Error Context**: + +```text +def create-server [name: string] { + try { + validate-server-name $name + } catch { |e| + error make { + msg: $"Invalid server name '($name)': ($e.msg)", + label: {text: "server name validation failed", span: $e.span?} + } + } + + try { + provision-server $name + } catch { |e| + error make { + msg: $"Server provisioning failed for '($name)': ($e.msg)", + help: "Check provider credentials and quota limits" + } + } +} +``` + +### Resource Management + +**Clean Up Resources**: + +```text +def with-temporary-server [name: string, action: closure] { + let server = (create-server $name) + + try { + do $action $server + } catch { |e| + # Clean up on error + delete-server $name + $e + } + + # Clean up on success + delete-server $name +} +``` + +### Testing Best Practices + +**Test Isolation**: + +```text +def test-with-isolation [test_name: string, test_action: closure] { + let test_workspace = $"test-($test_name)-(date now | format date '%Y%m%d%H%M%S')" + + try { + # Set up isolated environment + $env.PROVISIONING_WORKSPACE_USER = $test_workspace + nu workspace.nu init --user-name $test_workspace + + # Run test + do $test_action + + print $"✅ Test ($test_name) passed" + } catch { |e| + print $"❌ Test ($test_name) failed: ($e.msg)" + exit 1 + } finally { + # Clean up test environment + nu workspace.nu cleanup --user-name $test_workspace --type all --force + } +} +``` + +This development workflow provides a comprehensive framework for efficient, quality-focused development while maintaining the project's architectural +principles and ensuring smooth collaboration across the team. \ No newline at end of file diff --git a/docs/src/getting-started/01-prerequisites.md b/docs/src/getting-started/01-prerequisites.md index 247c849..52c5edd 100644 --- a/docs/src/getting-started/01-prerequisites.md +++ b/docs/src/getting-started/01-prerequisites.md @@ -1 +1,251 @@ -# Prerequisites\n\nBefore installing the Provisioning Platform, ensure your system meets the following requirements.\n\n## Hardware Requirements\n\n### Minimum Requirements (Solo Mode)\n\n- **CPU**: 2 cores\n- **RAM**: 4 GB\n- **Disk**: 20 GB available space\n- **Network**: Internet connection for downloading dependencies\n\n### Recommended Requirements (Multi-User Mode)\n\n- **CPU**: 4 cores\n- **RAM**: 8 GB\n- **Disk**: 50 GB available space\n- **Network**: Reliable internet connection\n\n### Production Requirements (Enterprise Mode)\n\n- **CPU**: 16 cores\n- **RAM**: 32 GB\n- **Disk**: 500 GB available space (SSD recommended)\n- **Network**: High-bandwidth connection with static IP\n\n## Operating System\n\n### Supported Platforms\n\n- **macOS**: 12.0 (Monterey) or later\n- **Linux**:\n - Ubuntu 22.04 LTS or later\n - Fedora 38 or later\n - Debian 12 (Bookworm) or later\n - RHEL 9 or later\n\n### Platform-Specific Notes\n\n**macOS**:\n\n- Xcode Command Line Tools required\n- Homebrew recommended for package management\n\n**Linux**:\n\n- systemd-based distribution recommended\n- sudo access required for some operations\n\n## Required Software\n\n### Core Dependencies\n\n| Software | Version | Purpose |\n| ---------- | --------- | --------- |\n| **Nushell** | 0.107.1+ | Shell and scripting language |\n| **Nickel** | 1.15.0+ | Configuration language |\n| **Docker** | 20.10+ | Container runtime (for platform services) |\n| **SOPS** | 3.10.2+ | Secrets management |\n| **Age** | 1.2.1+ | Encryption tool |\n\n### Optional Dependencies\n\n| Software | Version | Purpose |\n| ---------- | --------- | --------- |\n| **Podman** | 4.0+ | Alternative container runtime |\n| **OrbStack** | Latest | macOS-optimized container runtime |\n| **K9s** | 0.50.6+ | Kubernetes management interface |\n| **glow** | Latest | Markdown renderer for guides |\n| **bat** | Latest | Syntax highlighting for file viewing |\n\n## Installation Verification\n\nBefore proceeding, verify your system has the core dependencies installed:\n\n### Nushell\n\n```\n# Check Nushell version\nnu --version\n\n# Expected output: 0.107.1 or higher\n```\n\n### Nickel\n\n```\n# Check Nickel version\nnickel --version\n\n# Expected output: 1.15.0 or higher\n```\n\n### Docker\n\n```\n# Check Docker version\ndocker --version\n\n# Check Docker is running\ndocker ps\n\n# Expected: Docker version 20.10+ and connection successful\n```\n\n### SOPS\n\n```\n# Check SOPS version\nsops --version\n\n# Expected output: 3.10.2 or higher\n```\n\n### Age\n\n```\n# Check Age version\nage --version\n\n# Expected output: 1.2.1 or higher\n```\n\n## Installing Missing Dependencies\n\n### macOS (using Homebrew)\n\n```\n# Install Homebrew if not already installed\n/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"\n\n# Install Nushell\nbrew install nushell\n\n# Install Nickel\nbrew install nickel\n\n# Install Docker Desktop\nbrew install --cask docker\n\n# Install SOPS\nbrew install sops\n\n# Install Age\nbrew install age\n\n# Optional: Install extras\nbrew install k9s glow bat\n```\n\n### Ubuntu/Debian\n\n```\n# Update package list\nsudo apt update\n\n# Install prerequisites\nsudo apt install -y curl git build-essential\n\n# Install Nushell (from GitHub releases)\ncurl -LO https://github.com/nushell/nushell/releases/download/0.107.1/nu-0.107.1-x86_64-linux-musl.tar.gz\ntar xzf nu-0.107.1-x86_64-linux-musl.tar.gz\nsudo mv nu /usr/local/bin/\n\n# Install Nickel (using Rust cargo)\ncurl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh\nsource $HOME/.cargo/env\ncargo install nickel\n\n# Install Docker\nsudo apt install -y docker.io\nsudo systemctl enable --now docker\nsudo usermod -aG docker $USER\n\n# Install SOPS\ncurl -LO https://github.com/getsops/sops/releases/download/v3.10.2/sops-v3.10.2.linux.amd64\nchmod +x sops-v3.10.2.linux.amd64\nsudo mv sops-v3.10.2.linux.amd64 /usr/local/bin/sops\n\n# Install Age\nsudo apt install -y age\n```\n\n### Fedora/RHEL\n\n```\n# Install Nushell\nsudo dnf install -y nushell\n\n# Install Nickel (using Rust cargo)\ncurl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh\nsource $HOME/.cargo/env\ncargo install nickel\n\n# Install Docker\nsudo dnf install -y docker\nsudo systemctl enable --now docker\nsudo usermod -aG docker $USER\n\n# Install SOPS\nsudo dnf install -y sops\n\n# Install Age\nsudo dnf install -y age\n```\n\n## Network Requirements\n\n### Firewall Ports\n\nIf running platform services, ensure these ports are available:\n\n| Service | Port | Protocol | Purpose |\n| --------- | ------ | ---------- | --------- |\n| Orchestrator | 8080 | HTTP | Workflow API |\n| Control Center | 9090 | HTTP | Policy engine |\n| KMS Service | 8082 | HTTP | Key management |\n| API Server | 8083 | HTTP | REST API |\n| Extension Registry | 8084 | HTTP | Extension discovery |\n| OCI Registry | 5000 | HTTP | Artifact storage |\n\n### External Connectivity\n\nThe platform requires outbound internet access to:\n\n- Download dependencies and updates\n- Pull container images\n- Access cloud provider APIs (AWS, UpCloud)\n- Fetch extension packages\n\n## Cloud Provider Credentials (Optional)\n\nIf you plan to use cloud providers, prepare credentials:\n\n### AWS\n\n- AWS Access Key ID\n- AWS Secret Access Key\n- Configured via `~/.aws/credentials` or environment variables\n\n### UpCloud\n\n- UpCloud username\n- UpCloud password\n- Configured via environment variables or config files\n\n## Next Steps\n\nOnce all prerequisites are met, proceed to:\n→ **[Installation](02-installation.md)** +# Prerequisites + +Before installing the Provisioning Platform, ensure your system meets the following requirements. + +## Hardware Requirements + +### Minimum Requirements (Solo Mode) + +- **CPU**: 2 cores +- **RAM**: 4 GB +- **Disk**: 20 GB available space +- **Network**: Internet connection for downloading dependencies + +### Recommended Requirements (Multi-User Mode) + +- **CPU**: 4 cores +- **RAM**: 8 GB +- **Disk**: 50 GB available space +- **Network**: Reliable internet connection + +### Production Requirements (Enterprise Mode) + +- **CPU**: 16 cores +- **RAM**: 32 GB +- **Disk**: 500 GB available space (SSD recommended) +- **Network**: High-bandwidth connection with static IP + +## Operating System + +### Supported Platforms + +- **macOS**: 12.0 (Monterey) or later +- **Linux**: + - Ubuntu 22.04 LTS or later + - Fedora 38 or later + - Debian 12 (Bookworm) or later + - RHEL 9 or later + +### Platform-Specific Notes + +**macOS**: + +- Xcode Command Line Tools required +- Homebrew recommended for package management + +**Linux**: + +- systemd-based distribution recommended +- sudo access required for some operations + +## Required Software + +### Core Dependencies + +| Software | Version | Purpose | +| ---------- | --------- | --------- | +| **Nushell** | 0.107.1+ | Shell and scripting language | +| **Nickel** | 1.15.0+ | Configuration language | +| **Docker** | 20.10+ | Container runtime (for platform services) | +| **SOPS** | 3.10.2+ | Secrets management | +| **Age** | 1.2.1+ | Encryption tool | + +### Optional Dependencies + +| Software | Version | Purpose | +| ---------- | --------- | --------- | +| **Podman** | 4.0+ | Alternative container runtime | +| **OrbStack** | Latest | macOS-optimized container runtime | +| **K9s** | 0.50.6+ | Kubernetes management interface | +| **glow** | Latest | Markdown renderer for guides | +| **bat** | Latest | Syntax highlighting for file viewing | + +## Installation Verification + +Before proceeding, verify your system has the core dependencies installed: + +### Nushell + +```text +# Check Nushell version +nu --version + +# Expected output: 0.107.1 or higher +``` + +### Nickel + +```text +# Check Nickel version +nickel --version + +# Expected output: 1.15.0 or higher +``` + +### Docker + +```text +# Check Docker version +docker --version + +# Check Docker is running +docker ps + +# Expected: Docker version 20.10+ and connection successful +``` + +### SOPS + +```text +# Check SOPS version +sops --version + +# Expected output: 3.10.2 or higher +``` + +### Age + +```text +# Check Age version +age --version + +# Expected output: 1.2.1 or higher +``` + +## Installing Missing Dependencies + +### macOS (using Homebrew) + +```text +# Install Homebrew if not already installed +/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" + +# Install Nushell +brew install nushell + +# Install Nickel +brew install nickel + +# Install Docker Desktop +brew install --cask docker + +# Install SOPS +brew install sops + +# Install Age +brew install age + +# Optional: Install extras +brew install k9s glow bat +``` + +### Ubuntu/Debian + +```text +# Update package list +sudo apt update + +# Install prerequisites +sudo apt install -y curl git build-essential + +# Install Nushell (from GitHub releases) +curl -LO https://github.com/nushell/nushell/releases/download/0.107.1/nu-0.107.1-x86_64-linux-musl.tar.gz +tar xzf nu-0.107.1-x86_64-linux-musl.tar.gz +sudo mv nu /usr/local/bin/ + +# Install Nickel (using Rust cargo) +curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh +source $HOME/.cargo/env +cargo install nickel + +# Install Docker +sudo apt install -y docker.io +sudo systemctl enable --now docker +sudo usermod -aG docker $USER + +# Install SOPS +curl -LO https://github.com/getsops/sops/releases/download/v3.10.2/sops-v3.10.2.linux.amd64 +chmod +x sops-v3.10.2.linux.amd64 +sudo mv sops-v3.10.2.linux.amd64 /usr/local/bin/sops + +# Install Age +sudo apt install -y age +``` + +### Fedora/RHEL + +```text +# Install Nushell +sudo dnf install -y nushell + +# Install Nickel (using Rust cargo) +curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh +source $HOME/.cargo/env +cargo install nickel + +# Install Docker +sudo dnf install -y docker +sudo systemctl enable --now docker +sudo usermod -aG docker $USER + +# Install SOPS +sudo dnf install -y sops + +# Install Age +sudo dnf install -y age +``` + +## Network Requirements + +### Firewall Ports + +If running platform services, ensure these ports are available: + +| Service | Port | Protocol | Purpose | +| --------- | ------ | ---------- | --------- | +| Orchestrator | 8080 | HTTP | Workflow API | +| Control Center | 9090 | HTTP | Policy engine | +| KMS Service | 8082 | HTTP | Key management | +| API Server | 8083 | HTTP | REST API | +| Extension Registry | 8084 | HTTP | Extension discovery | +| OCI Registry | 5000 | HTTP | Artifact storage | + +### External Connectivity + +The platform requires outbound internet access to: + +- Download dependencies and updates +- Pull container images +- Access cloud provider APIs (AWS, UpCloud) +- Fetch extension packages + +## Cloud Provider Credentials (Optional) + +If you plan to use cloud providers, prepare credentials: + +### AWS + +- AWS Access Key ID +- AWS Secret Access Key +- Configured via `~/.aws/credentials` or environment variables + +### UpCloud + +- UpCloud username +- UpCloud password +- Configured via environment variables or config files + +## Next Steps + +Once all prerequisites are met, proceed to: +→ **[Installation](02-installation.md)** \ No newline at end of file diff --git a/docs/src/getting-started/02-installation.md b/docs/src/getting-started/02-installation.md index 2e30354..bab7ad0 100644 --- a/docs/src/getting-started/02-installation.md +++ b/docs/src/getting-started/02-installation.md @@ -1 +1,235 @@ -# Installation\n\nThis guide walks you through installing the Provisioning Platform on your system.\n\n## Overview\n\nThe installation process involves:\n\n1. Cloning the repository\n2. Installing Nushell plugins\n3. Setting up configuration\n4. Initializing your first workspace\n\nEstimated time: 15-20 minutes\n\n## Step 1: Clone the Repository\n\n```\n# Clone the repository\ngit clone https://github.com/provisioning/provisioning-platform.git\ncd provisioning-platform\n\n# Checkout the latest stable release (optional)\ngit checkout tags/v3.5.0\n```\n\n## Step 2: Install Nushell Plugins\n\nThe platform uses multiple Nushell plugins for enhanced functionality.\n\n### Install nu_plugin_tera (Template Rendering)\n\n```\n# Install from crates.io\ncargo install nu_plugin_tera\n\n# Register with Nushell\nnu -c "plugin add ~/.cargo/bin/nu_plugin_tera; plugin use tera"\n```\n\n### Verify Plugin Installation\n\n```\n# Start Nushell\nnu\n\n# List installed plugins\nplugin list\n\n# Expected output should include:\n# - tera\n```\n\n## Step 3: Add CLI to PATH\n\nMake the `provisioning` command available globally:\n\n```\n# Option 1: Symlink to /usr/local/bin (recommended)\nsudo ln -s "$(pwd)/provisioning/core/cli/provisioning" /usr/local/bin/provisioning\n\n# Option 2: Add to PATH in your shell profile\necho 'export PATH="$PATH:'"$(pwd)"'/provisioning/core/cli"' >> ~/.bashrc # or ~/.zshrc\nsource ~/.bashrc # or ~/.zshrc\n\n# Verify installation\nprovisioning --version\n```\n\n## Step 4: Generate Age Encryption Keys\n\nGenerate keys for encrypting sensitive configuration:\n\n```\n# Create Age key directory\nmkdir -p ~/.config/provisioning/age\n\n# Generate private key\nage-keygen -o ~/.config/provisioning/age/private_key.txt\n\n# Extract public key\nage-keygen -y ~/.config/provisioning/age/private_key.txt > ~/.config/provisioning/age/public_key.txt\n\n# Secure the keys\nchmod 600 ~/.config/provisioning/age/private_key.txt\nchmod 644 ~/.config/provisioning/age/public_key.txt\n```\n\n## Step 5: Configure Environment\n\nSet up basic environment variables:\n\n```\n# Create environment file\ncat > ~/.provisioning/env << 'ENVEOF'\n# Provisioning Environment Configuration\nexport PROVISIONING_ENV=dev\nexport PROVISIONING_PATH=$(pwd)\nexport PROVISIONING_KAGE=~/.config/provisioning/age\nENVEOF\n\n# Source the environment\nsource ~/.provisioning/env\n\n# Add to shell profile for persistence\necho 'source ~/.provisioning/env' >> ~/.bashrc # or ~/.zshrc\n```\n\n## Step 6: Initialize Workspace\n\nCreate your first workspace:\n\n```\n# Initialize a new workspace\nprovisioning workspace init my-first-workspace\n\n# Expected output:\n# ✓ Workspace 'my-first-workspace' created successfully\n# ✓ Configuration template generated\n# ✓ Workspace activated\n\n# Verify workspace\nprovisioning workspace list\n```\n\n## Step 7: Validate Installation\n\nRun the installation verification:\n\n```\n# Check system configuration\nprovisioning validate config\n\n# Check all dependencies\nprovisioning env\n\n# View detailed environment\nprovisioning allenv\n```\n\nExpected output should show:\n\n- ✅ All core dependencies installed\n- ✅ Age keys configured\n- ✅ Workspace initialized\n- ✅ Configuration valid\n\n## Optional: Install Platform Services\n\nIf you plan to use platform services (orchestrator, control center, etc.):\n\n```\n# Build platform services\ncd provisioning/platform\n\n# Build orchestrator\ncd orchestrator\ncargo build --release\ncd ..\n\n# Build control center\ncd control-center\ncargo build --release\ncd ..\n\n# Build KMS service\ncd kms-service\ncargo build --release\ncd ..\n\n# Verify builds\nls */target/release/\n```\n\n## Optional: Install Platform with Installer\n\nUse the interactive installer for a guided setup:\n\n```\n# Build the installer\ncd provisioning/platform/installer\ncargo build --release\n\n# Run interactive installer\n./target/release/provisioning-installer\n\n# Or headless installation\n./target/release/provisioning-installer --headless --mode solo --yes\n```\n\n## Troubleshooting\n\n### Nushell Plugin Not Found\n\nIf plugins aren't recognized:\n\n```\n# Rebuild plugin registry\nnu -c "plugin list; plugin use tera"\n```\n\n### Permission Denied\n\nIf you encounter permission errors:\n\n```\n# Ensure proper ownership\nsudo chown -R $USER:$USER ~/.config/provisioning\n\n# Check PATH\necho $PATH | grep provisioning\n```\n\n### Age Keys Not Found\n\nIf encryption fails:\n\n```\n# Verify keys exist\nls -la ~/.config/provisioning/age/\n\n# Regenerate if needed\nage-keygen -o ~/.config/provisioning/age/private_key.txt\n```\n\n## Next Steps\n\nOnce installation is complete, proceed to:\n→ **[First Deployment](03-first-deployment.md)**\n\n## Additional Resources\n\n- [Detailed Installation Guide](../user/installation-guide.md)\n- [Workspace Management](../user/workspace-setup.md)\n- [Troubleshooting Guide](../user/troubleshooting-guide.md) +# Installation + +This guide walks you through installing the Provisioning Platform on your system. + +## Overview + +The installation process involves: + +1. Cloning the repository +2. Installing Nushell plugins +3. Setting up configuration +4. Initializing your first workspace + +Estimated time: 15-20 minutes + +## Step 1: Clone the Repository + +```text +# Clone the repository +git clone https://github.com/provisioning/provisioning-platform.git +cd provisioning-platform + +# Checkout the latest stable release (optional) +git checkout tags/v3.5.0 +``` + +## Step 2: Install Nushell Plugins + +The platform uses multiple Nushell plugins for enhanced functionality. + +### Install nu_plugin_tera (Template Rendering) + +```text +# Install from crates.io +cargo install nu_plugin_tera + +# Register with Nushell +nu -c "plugin add ~/.cargo/bin/nu_plugin_tera; plugin use tera" +``` + +### Verify Plugin Installation + +```text +# Start Nushell +nu + +# List installed plugins +plugin list + +# Expected output should include: +# - tera +``` + +## Step 3: Add CLI to PATH + +Make the `provisioning` command available globally: + +```text +# Option 1: Symlink to /usr/local/bin (recommended) +sudo ln -s "$(pwd)/provisioning/core/cli/provisioning" /usr/local/bin/provisioning + +# Option 2: Add to PATH in your shell profile +echo 'export PATH="$PATH:'"$(pwd)"'/provisioning/core/cli"' >> ~/.bashrc # or ~/.zshrc +source ~/.bashrc # or ~/.zshrc + +# Verify installation +provisioning --version +``` + +## Step 4: Generate Age Encryption Keys + +Generate keys for encrypting sensitive configuration: + +```text +# Create Age key directory +mkdir -p ~/.config/provisioning/age + +# Generate private key +age-keygen -o ~/.config/provisioning/age/private_key.txt + +# Extract public key +age-keygen -y ~/.config/provisioning/age/private_key.txt > ~/.config/provisioning/age/public_key.txt + +# Secure the keys +chmod 600 ~/.config/provisioning/age/private_key.txt +chmod 644 ~/.config/provisioning/age/public_key.txt +``` + +## Step 5: Configure Environment + +Set up basic environment variables: + +```text +# Create environment file +cat > ~/.provisioning/env << 'ENVEOF' +# Provisioning Environment Configuration +export PROVISIONING_ENV=dev +export PROVISIONING_PATH=$(pwd) +export PROVISIONING_KAGE=~/.config/provisioning/age +ENVEOF + +# Source the environment +source ~/.provisioning/env + +# Add to shell profile for persistence +echo 'source ~/.provisioning/env' >> ~/.bashrc # or ~/.zshrc +``` + +## Step 6: Initialize Workspace + +Create your first workspace: + +```text +# Initialize a new workspace +provisioning workspace init my-first-workspace + +# Expected output: +# ✓ Workspace 'my-first-workspace' created successfully +# ✓ Configuration template generated +# ✓ Workspace activated + +# Verify workspace +provisioning workspace list +``` + +## Step 7: Validate Installation + +Run the installation verification: + +```text +# Check system configuration +provisioning validate config + +# Check all dependencies +provisioning env + +# View detailed environment +provisioning allenv +``` + +Expected output should show: + +- ✅ All core dependencies installed +- ✅ Age keys configured +- ✅ Workspace initialized +- ✅ Configuration valid + +## Optional: Install Platform Services + +If you plan to use platform services (orchestrator, control center, etc.): + +```text +# Build platform services +cd provisioning/platform + +# Build orchestrator +cd orchestrator +cargo build --release +cd .. + +# Build control center +cd control-center +cargo build --release +cd .. + +# Build KMS service +cd kms-service +cargo build --release +cd .. + +# Verify builds +ls */target/release/ +``` + +## Optional: Install Platform with Installer + +Use the interactive installer for a guided setup: + +```text +# Build the installer +cd provisioning/platform/installer +cargo build --release + +# Run interactive installer +./target/release/provisioning-installer + +# Or headless installation +./target/release/provisioning-installer --headless --mode solo --yes +``` + +## Troubleshooting + +### Nushell Plugin Not Found + +If plugins aren't recognized: + +```text +# Rebuild plugin registry +nu -c "plugin list; plugin use tera" +``` + +### Permission Denied + +If you encounter permission errors: + +```text +# Ensure proper ownership +sudo chown -R $USER:$USER ~/.config/provisioning + +# Check PATH +echo $PATH | grep provisioning +``` + +### Age Keys Not Found + +If encryption fails: + +```text +# Verify keys exist +ls -la ~/.config/provisioning/age/ + +# Regenerate if needed +age-keygen -o ~/.config/provisioning/age/private_key.txt +``` + +## Next Steps + +Once installation is complete, proceed to: +→ **[First Deployment](03-first-deployment.md)** + +## Additional Resources + +- [Detailed Installation Guide](../user/installation-guide.md) +- [Workspace Management](../user/workspace-setup.md) +- [Troubleshooting Guide](../user/troubleshooting-guide.md) \ No newline at end of file diff --git a/docs/src/getting-started/03-first-deployment.md b/docs/src/getting-started/03-first-deployment.md index 6333c72..5c9278d 100644 --- a/docs/src/getting-started/03-first-deployment.md +++ b/docs/src/getting-started/03-first-deployment.md @@ -1 +1,273 @@ -# First Deployment\n\nThis guide walks you through deploying your first infrastructure using the Provisioning Platform.\n\n## Overview\n\nIn this chapter, you'll:\n\n1. Configure a simple infrastructure\n2. Create your first server\n3. Install a task service (Kubernetes)\n4. Verify the deployment\n\nEstimated time: 10-15 minutes\n\n## Step 1: Configure Infrastructure\n\nCreate a basic infrastructure configuration:\n\n```\n# Generate infrastructure template\nprovisioning generate infra --new my-infra\n\n# This creates: workspace/infra/my-infra/\n# - config.toml (infrastructure settings)\n# - settings.ncl (Nickel configuration)\n```\n\n## Step 2: Edit Configuration\n\nEdit the generated configuration:\n\n```\n# Edit with your preferred editor\n$EDITOR workspace/infra/my-infra/settings.ncl\n```\n\nExample configuration:\n\n```\nimport provisioning.settings as cfg\n\n# Infrastructure settings\ninfra_settings = cfg.InfraSettings {\n name = "my-infra"\n provider = "local" # Start with local provider\n environment = "development"\n}\n\n# Server configuration\nservers = [\n {\n hostname = "dev-server-01"\n cores = 2\n memory = 4096 # MB\n disk = 50 # GB\n }\n]\n```\n\n## Step 3: Create Server (Check Mode)\n\nFirst, run in check mode to see what would happen:\n\n```\n# Check mode - no actual changes\nprovisioning server create --infra my-infra --check\n\n# Expected output:\n# ✓ Validation passed\n# ⚠ Check mode: No changes will be made\n# \n# Would create:\n# - Server: dev-server-01 (2 cores, 4 GB RAM, 50 GB disk)\n```\n\n## Step 4: Create Server (Real)\n\nIf check mode looks good, create the server:\n\n```\n# Create server\nprovisioning server create --infra my-infra\n\n# Expected output:\n# ✓ Creating server: dev-server-01\n# ✓ Server created successfully\n# ✓ IP Address: 192.168.1.100\n# ✓ SSH access: ssh user@192.168.1.100\n```\n\n## Step 5: Verify Server\n\nCheck server status:\n\n```\n# List all servers\nprovisioning server list\n\n# Get detailed server info\nprovisioning server info dev-server-01\n\n# SSH to server (optional)\nprovisioning server ssh dev-server-01\n```\n\n## Step 6: Install Kubernetes (Check Mode)\n\nInstall a task service on the server:\n\n```\n# Check mode first\nprovisioning taskserv create kubernetes --infra my-infra --check\n\n# Expected output:\n# ✓ Validation passed\n# ⚠ Check mode: No changes will be made\n#\n# Would install:\n# - Kubernetes v1.28.0\n# - Required dependencies: containerd, etcd\n# - On servers: dev-server-01\n```\n\n## Step 7: Install Kubernetes (Real)\n\nProceed with installation:\n\n```\n# Install Kubernetes\nprovisioning taskserv create kubernetes --infra my-infra --wait\n\n# This will:\n# 1. Check dependencies\n# 2. Install containerd\n# 3. Install etcd\n# 4. Install Kubernetes\n# 5. Configure and start services\n\n# Monitor progress\nprovisioning workflow monitor \n```\n\n## Step 8: Verify Installation\n\nCheck that Kubernetes is running:\n\n```\n# List installed task services\nprovisioning taskserv list --infra my-infra\n\n# Check Kubernetes status\nprovisioning server ssh dev-server-01\nkubectl get nodes # On the server\nexit\n\n# Or remotely\nprovisioning server exec dev-server-01 -- kubectl get nodes\n```\n\n## Common Deployment Patterns\n\n### Pattern 1: Multiple Servers\n\nCreate multiple servers at once:\n\n```\nservers = [\n {hostname = "web-01", cores = 2, memory = 4096},\n {hostname = "web-02", cores = 2, memory = 4096},\n {hostname = "db-01", cores = 4, memory = 8192}\n]\n```\n\n```\nprovisioning server create --infra my-infra --servers web-01,web-02,db-01\n```\n\n### Pattern 2: Server with Multiple Task Services\n\nInstall multiple services on one server:\n\n```\nprovisioning taskserv create kubernetes,cilium,postgres --infra my-infra --servers web-01\n```\n\n### Pattern 3: Complete Cluster\n\nDeploy a complete cluster configuration:\n\n```\nprovisioning cluster create buildkit --infra my-infra\n```\n\n## Deployment Workflow\n\nThe typical deployment workflow:\n\n```\n# 1. Initialize workspace\nprovisioning workspace init production\n\n# 2. Generate infrastructure\nprovisioning generate infra --new prod-infra\n\n# 3. Configure (edit settings.ncl)\n$EDITOR workspace/infra/prod-infra/settings.ncl\n\n# 4. Validate configuration\nprovisioning validate config --infra prod-infra\n\n# 5. Create servers (check mode)\nprovisioning server create --infra prod-infra --check\n\n# 6. Create servers (real)\nprovisioning server create --infra prod-infra\n\n# 7. Install task services\nprovisioning taskserv create kubernetes --infra prod-infra --wait\n\n# 8. Deploy cluster (if needed)\nprovisioning cluster create my-cluster --infra prod-infra\n\n# 9. Verify\nprovisioning server list\nprovisioning taskserv list\n```\n\n## Troubleshooting\n\n### Server Creation Fails\n\n```\n# Check logs\nprovisioning server logs dev-server-01\n\n# Try with debug mode\nprovisioning --debug server create --infra my-infra\n```\n\n### Task Service Installation Fails\n\n```\n# Check task service logs\nprovisioning taskserv logs kubernetes\n\n# Retry installation\nprovisioning taskserv create kubernetes --infra my-infra --force\n```\n\n### SSH Connection Issues\n\n```\n# Verify SSH key\nls -la ~/.ssh/\n\n# Test SSH manually\nssh -v user@\n\n# Use provisioning SSH helper\nprovisioning server ssh dev-server-01 --debug\n```\n\n## Next Steps\n\nNow that you've completed your first deployment:\n→ **[Verification](04-verification.md)** - Verify your deployment is working correctly\n\n## Additional Resources\n\n- [Complete Deployment Guide](../guides/from-scratch.md)\n- [Infrastructure Management](../user/infrastructure-management.md)\n- [Troubleshooting Guide](../user/troubleshooting-guide.md) +# First Deployment + +This guide walks you through deploying your first infrastructure using the Provisioning Platform. + +## Overview + +In this chapter, you'll: + +1. Configure a simple infrastructure +2. Create your first server +3. Install a task service (Kubernetes) +4. Verify the deployment + +Estimated time: 10-15 minutes + +## Step 1: Configure Infrastructure + +Create a basic infrastructure configuration: + +```text +# Generate infrastructure template +provisioning generate infra --new my-infra + +# This creates: workspace/infra/my-infra/ +# - config.toml (infrastructure settings) +# - settings.ncl (Nickel configuration) +``` + +## Step 2: Edit Configuration + +Edit the generated configuration: + +```text +# Edit with your preferred editor +$EDITOR workspace/infra/my-infra/settings.ncl +``` + +Example configuration: + +```text +import provisioning.settings as cfg + +# Infrastructure settings +infra_settings = cfg.InfraSettings { + name = "my-infra" + provider = "local" # Start with local provider + environment = "development" +} + +# Server configuration +servers = [ + { + hostname = "dev-server-01" + cores = 2 + memory = 4096 # MB + disk = 50 # GB + } +] +``` + +## Step 3: Create Server (Check Mode) + +First, run in check mode to see what would happen: + +```text +# Check mode - no actual changes +provisioning server create --infra my-infra --check + +# Expected output: +# ✓ Validation passed +# ⚠ Check mode: No changes will be made +# +# Would create: +# - Server: dev-server-01 (2 cores, 4 GB RAM, 50 GB disk) +``` + +## Step 4: Create Server (Real) + +If check mode looks good, create the server: + +```text +# Create server +provisioning server create --infra my-infra + +# Expected output: +# ✓ Creating server: dev-server-01 +# ✓ Server created successfully +# ✓ IP Address: 192.168.1.100 +# ✓ SSH access: ssh user@192.168.1.100 +``` + +## Step 5: Verify Server + +Check server status: + +```text +# List all servers +provisioning server list + +# Get detailed server info +provisioning server info dev-server-01 + +# SSH to server (optional) +provisioning server ssh dev-server-01 +``` + +## Step 6: Install Kubernetes (Check Mode) + +Install a task service on the server: + +```text +# Check mode first +provisioning taskserv create kubernetes --infra my-infra --check + +# Expected output: +# ✓ Validation passed +# ⚠ Check mode: No changes will be made +# +# Would install: +# - Kubernetes v1.28.0 +# - Required dependencies: containerd, etcd +# - On servers: dev-server-01 +``` + +## Step 7: Install Kubernetes (Real) + +Proceed with installation: + +```text +# Install Kubernetes +provisioning taskserv create kubernetes --infra my-infra --wait + +# This will: +# 1. Check dependencies +# 2. Install containerd +# 3. Install etcd +# 4. Install Kubernetes +# 5. Configure and start services + +# Monitor progress +provisioning workflow monitor +``` + +## Step 8: Verify Installation + +Check that Kubernetes is running: + +```text +# List installed task services +provisioning taskserv list --infra my-infra + +# Check Kubernetes status +provisioning server ssh dev-server-01 +kubectl get nodes # On the server +exit + +# Or remotely +provisioning server exec dev-server-01 -- kubectl get nodes +``` + +## Common Deployment Patterns + +### Pattern 1: Multiple Servers + +Create multiple servers at once: + +```text +servers = [ + {hostname = "web-01", cores = 2, memory = 4096}, + {hostname = "web-02", cores = 2, memory = 4096}, + {hostname = "db-01", cores = 4, memory = 8192} +] +``` + +```text +provisioning server create --infra my-infra --servers web-01,web-02,db-01 +``` + +### Pattern 2: Server with Multiple Task Services + +Install multiple services on one server: + +```text +provisioning taskserv create kubernetes,cilium,postgres --infra my-infra --servers web-01 +``` + +### Pattern 3: Complete Cluster + +Deploy a complete cluster configuration: + +```text +provisioning cluster create buildkit --infra my-infra +``` + +## Deployment Workflow + +The typical deployment workflow: + +```text +# 1. Initialize workspace +provisioning workspace init production + +# 2. Generate infrastructure +provisioning generate infra --new prod-infra + +# 3. Configure (edit settings.ncl) +$EDITOR workspace/infra/prod-infra/settings.ncl + +# 4. Validate configuration +provisioning validate config --infra prod-infra + +# 5. Create servers (check mode) +provisioning server create --infra prod-infra --check + +# 6. Create servers (real) +provisioning server create --infra prod-infra + +# 7. Install task services +provisioning taskserv create kubernetes --infra prod-infra --wait + +# 8. Deploy cluster (if needed) +provisioning cluster create my-cluster --infra prod-infra + +# 9. Verify +provisioning server list +provisioning taskserv list +``` + +## Troubleshooting + +### Server Creation Fails + +```text +# Check logs +provisioning server logs dev-server-01 + +# Try with debug mode +provisioning --debug server create --infra my-infra +``` + +### Task Service Installation Fails + +```text +# Check task service logs +provisioning taskserv logs kubernetes + +# Retry installation +provisioning taskserv create kubernetes --infra my-infra --force +``` + +### SSH Connection Issues + +```text +# Verify SSH key +ls -la ~/.ssh/ + +# Test SSH manually +ssh -v user@ + +# Use provisioning SSH helper +provisioning server ssh dev-server-01 --debug +``` + +## Next Steps + +Now that you've completed your first deployment: +→ **[Verification](04-verification.md)** - Verify your deployment is working correctly + +## Additional Resources + +- [Complete Deployment Guide](../guides/from-scratch.md) +- [Infrastructure Management](../user/infrastructure-management.md) +- [Troubleshooting Guide](../user/troubleshooting-guide.md) \ No newline at end of file diff --git a/docs/src/getting-started/04-verification.md b/docs/src/getting-started/04-verification.md index 070c163..38ef6f3 100644 --- a/docs/src/getting-started/04-verification.md +++ b/docs/src/getting-started/04-verification.md @@ -1 +1,342 @@ -# Verification\n\nThis guide helps you verify that your Provisioning Platform deployment is working correctly.\n\n## Overview\n\nAfter completing your first deployment, verify:\n\n1. System configuration\n2. Server accessibility\n3. Task service health\n4. Platform services (if installed)\n\n## Step 1: Verify Configuration\n\nCheck that all configuration is valid:\n\n```\n# Validate all configuration\nprovisioning validate config\n\n# Expected output:\n# ✓ Configuration valid\n# ✓ No errors found\n# ✓ All required fields present\n```\n\n```\n# Check environment variables\nprovisioning env\n\n# View complete configuration\nprovisioning allenv\n```\n\n## Step 2: Verify Servers\n\nCheck that servers are accessible and healthy:\n\n```\n# List all servers\nprovisioning server list\n\n# Expected output:\n# ┌───────────────┬──────────┬───────┬────────┬──────────────┬──────────┐\n# │ Hostname │ Provider │ Cores │ Memory │ IP Address │ Status │\n# ├───────────────┼──────────┼───────┼────────┼──────────────┼──────────┤\n# │ dev-server-01 │ local │ 2 │ 4096 │ 192.168.1.100│ running │\n# └───────────────┴──────────┴───────┴────────┴──────────────┴──────────┘\n```\n\n```\n# Check server details\nprovisioning server info dev-server-01\n\n# Test SSH connectivity\nprovisioning server ssh dev-server-01 -- echo "SSH working"\n```\n\n## Step 3: Verify Task Services\n\nCheck installed task services:\n\n```\n# List task services\nprovisioning taskserv list\n\n# Expected output:\n# ┌────────────┬─────────┬────────────────┬──────────┐\n# │ Name │ Version │ Server │ Status │\n# ├────────────┼─────────┼────────────────┼──────────┤\n# │ containerd │ 1.7.0 │ dev-server-01 │ running │\n# │ etcd │ 3.5.0 │ dev-server-01 │ running │\n# │ kubernetes │ 1.28.0 │ dev-server-01 │ running │\n# └────────────┴─────────┴────────────────┴──────────┘\n```\n\n```\n# Check specific task service\nprovisioning taskserv status kubernetes\n\n# View task service logs\nprovisioning taskserv logs kubernetes --tail 50\n```\n\n## Step 4: Verify Kubernetes (If Installed)\n\nIf you installed Kubernetes, verify it's working:\n\n```\n# Check Kubernetes nodes\nprovisioning server ssh dev-server-01 -- kubectl get nodes\n\n# Expected output:\n# NAME STATUS ROLES AGE VERSION\n# dev-server-01 Ready control-plane 10m v1.28.0\n```\n\n```\n# Check Kubernetes pods\nprovisioning server ssh dev-server-01 -- kubectl get pods -A\n\n# All pods should be Running or Completed\n```\n\n## Step 5: Verify Platform Services (Optional)\n\nIf you installed platform services:\n\n### Orchestrator\n\n```\n# Check orchestrator health\ncurl http://localhost:8080/health\n\n# Expected:\n# {"status":"healthy","version":"0.1.0"}\n```\n\n```\n# List tasks\ncurl http://localhost:8080/tasks\n```\n\n### Control Center\n\n```\n# Check control center health\ncurl http://localhost:9090/health\n\n# Test policy evaluation\ncurl -X POST http://localhost:9090/policies/evaluate \\n -H "Content-Type: application/json" \\n -d '{"principal":{"id":"test"},"action":{"id":"read"},"resource":{"id":"test"}}'\n```\n\n### KMS Service\n\n```\n# Check KMS health\ncurl http://localhost:8082/api/v1/kms/health\n\n# Test encryption\necho "test" | provisioning kms encrypt\n```\n\n## Step 6: Run Health Checks\n\nRun comprehensive health checks:\n\n```\n# Check all components\nprovisioning health check\n\n# Expected output:\n# ✓ Configuration: OK\n# ✓ Servers: 1/1 healthy\n# ✓ Task Services: 3/3 running\n# ✓ Platform Services: 3/3 healthy\n# ✓ Network Connectivity: OK\n# ✓ Encryption Keys: OK\n```\n\n## Step 7: Verify Workflows\n\nIf you used workflows:\n\n```\n# List all workflows\nprovisioning workflow list\n\n# Check specific workflow\nprovisioning workflow status \n\n# View workflow stats\nprovisioning workflow stats\n```\n\n## Common Verification Checks\n\n### DNS Resolution (If CoreDNS Installed)\n\n```\n# Test DNS resolution\ndig @localhost test.provisioning.local\n\n# Check CoreDNS status\nprovisioning server ssh dev-server-01 -- systemctl status coredns\n```\n\n### Network Connectivity\n\n```\n# Test server-to-server connectivity\nprovisioning server ssh dev-server-01 -- ping -c 3 dev-server-02\n\n# Check firewall rules\nprovisioning server ssh dev-server-01 -- sudo iptables -L\n```\n\n### Storage and Resources\n\n```\n# Check disk usage\nprovisioning server ssh dev-server-01 -- df -h\n\n# Check memory usage\nprovisioning server ssh dev-server-01 -- free -h\n\n# Check CPU usage\nprovisioning server ssh dev-server-01 -- top -bn1 | head -20\n```\n\n## Troubleshooting Failed Verifications\n\n### Configuration Validation Failed\n\n```\n# View detailed error\nprovisioning validate config --verbose\n\n# Check specific infrastructure\nprovisioning validate config --infra my-infra\n```\n\n### Server Unreachable\n\n```\n# Check server logs\nprovisioning server logs dev-server-01\n\n# Try debug mode\nprovisioning --debug server ssh dev-server-01\n```\n\n### Task Service Not Running\n\n```\n# Check service logs\nprovisioning taskserv logs kubernetes\n\n# Restart service\nprovisioning taskserv restart kubernetes --infra my-infra\n```\n\n### Platform Service Down\n\n```\n# Check service status\nprovisioning platform status orchestrator\n\n# View service logs\nprovisioning platform logs orchestrator --tail 100\n\n# Restart service\nprovisioning platform restart orchestrator\n```\n\n## Performance Verification\n\n### Response Time Tests\n\n```\n# Measure server response time\ntime provisioning server info dev-server-01\n\n# Measure task service response time\ntime provisioning taskserv list\n\n# Measure workflow submission time\ntime provisioning workflow submit test-workflow.ncl\n```\n\n### Resource Usage\n\n```\n# Check platform resource usage\ndocker stats # If using Docker\n\n# Check system resources\nprovisioning system resources\n```\n\n## Security Verification\n\n### Encryption\n\n```\n# Verify encryption keys\nls -la ~/.config/provisioning/age/\n\n# Test encryption/decryption\necho "test" | provisioning kms encrypt | provisioning kms decrypt\n```\n\n### Authentication (If Enabled)\n\n```\n# Test login\nprovisioning login --username admin\n\n# Verify token\nprovisioning whoami\n\n# Test MFA (if enabled)\nprovisioning mfa verify \n```\n\n## Verification Checklist\n\nUse this checklist to ensure everything is working:\n\n- [ ] Configuration validation passes\n- [ ] All servers are accessible via SSH\n- [ ] All servers show "running" status\n- [ ] All task services show "running" status\n- [ ] Kubernetes nodes are "Ready" (if installed)\n- [ ] Kubernetes pods are "Running" (if installed)\n- [ ] Platform services respond to health checks\n- [ ] Encryption/decryption works\n- [ ] Workflows can be submitted and complete\n- [ ] No errors in logs\n- [ ] Resource usage is within expected limits\n\n## Next Steps\n\nOnce verification is complete:\n\n- **[User Guide](../user/README.md)** - Learn advanced features\n- **[Quick Reference](../guides/quickstart-cheatsheet.md)** - Command shortcuts\n- **[Infrastructure Management](../user/infrastructure-management.md)** - Day-to-day operations\n- **[Troubleshooting](../user/troubleshooting-guide.md)** - Common issues and solutions\n\n## Additional Resources\n\n- [Complete From-Scratch Guide](../guides/from-scratch.md)\n- [Service Management Guide](../user/SERVICE_MANAGEMENT_GUIDE.md)\n- [Test Environment Guide](../user/test-environment-guide.md)\n\n---\n\n**Congratulations!** You've successfully deployed and verified your first Provisioning Platform infrastructure! +# Verification + +This guide helps you verify that your Provisioning Platform deployment is working correctly. + +## Overview + +After completing your first deployment, verify: + +1. System configuration +2. Server accessibility +3. Task service health +4. Platform services (if installed) + +## Step 1: Verify Configuration + +Check that all configuration is valid: + +```text +# Validate all configuration +provisioning validate config + +# Expected output: +# ✓ Configuration valid +# ✓ No errors found +# ✓ All required fields present +``` + +```text +# Check environment variables +provisioning env + +# View complete configuration +provisioning allenv +``` + +## Step 2: Verify Servers + +Check that servers are accessible and healthy: + +```text +# List all servers +provisioning server list + +# Expected output: +# ┌───────────────┬──────────┬───────┬────────┬──────────────┬──────────┐ +# │ Hostname │ Provider │ Cores │ Memory │ IP Address │ Status │ +# ├───────────────┼──────────┼───────┼────────┼──────────────┼──────────┤ +# │ dev-server-01 │ local │ 2 │ 4096 │ 192.168.1.100│ running │ +# └───────────────┴──────────┴───────┴────────┴──────────────┴──────────┘ +``` + +```text +# Check server details +provisioning server info dev-server-01 + +# Test SSH connectivity +provisioning server ssh dev-server-01 -- echo "SSH working" +``` + +## Step 3: Verify Task Services + +Check installed task services: + +```text +# List task services +provisioning taskserv list + +# Expected output: +# ┌────────────┬─────────┬────────────────┬──────────┐ +# │ Name │ Version │ Server │ Status │ +# ├────────────┼─────────┼────────────────┼──────────┤ +# │ containerd │ 1.7.0 │ dev-server-01 │ running │ +# │ etcd │ 3.5.0 │ dev-server-01 │ running │ +# │ kubernetes │ 1.28.0 │ dev-server-01 │ running │ +# └────────────┴─────────┴────────────────┴──────────┘ +``` + +```text +# Check specific task service +provisioning taskserv status kubernetes + +# View task service logs +provisioning taskserv logs kubernetes --tail 50 +``` + +## Step 4: Verify Kubernetes (If Installed) + +If you installed Kubernetes, verify it's working: + +```text +# Check Kubernetes nodes +provisioning server ssh dev-server-01 -- kubectl get nodes + +# Expected output: +# NAME STATUS ROLES AGE VERSION +# dev-server-01 Ready control-plane 10m v1.28.0 +``` + +```text +# Check Kubernetes pods +provisioning server ssh dev-server-01 -- kubectl get pods -A + +# All pods should be Running or Completed +``` + +## Step 5: Verify Platform Services (Optional) + +If you installed platform services: + +### Orchestrator + +```text +# Check orchestrator health +curl http://localhost:8080/health + +# Expected: +# {"status":"healthy","version":"0.1.0"} +``` + +```text +# List tasks +curl http://localhost:8080/tasks +``` + +### Control Center + +```text +# Check control center health +curl http://localhost:9090/health + +# Test policy evaluation +curl -X POST http://localhost:9090/policies/evaluate + -H "Content-Type: application/json" + -d '{"principal":{"id":"test"},"action":{"id":"read"},"resource":{"id":"test"}}' +``` + +### KMS Service + +```text +# Check KMS health +curl http://localhost:8082/api/v1/kms/health + +# Test encryption +echo "test" | provisioning kms encrypt +``` + +## Step 6: Run Health Checks + +Run comprehensive health checks: + +```text +# Check all components +provisioning health check + +# Expected output: +# ✓ Configuration: OK +# ✓ Servers: 1/1 healthy +# ✓ Task Services: 3/3 running +# ✓ Platform Services: 3/3 healthy +# ✓ Network Connectivity: OK +# ✓ Encryption Keys: OK +``` + +## Step 7: Verify Workflows + +If you used workflows: + +```text +# List all workflows +provisioning workflow list + +# Check specific workflow +provisioning workflow status + +# View workflow stats +provisioning workflow stats +``` + +## Common Verification Checks + +### DNS Resolution (If CoreDNS Installed) + +```text +# Test DNS resolution +dig @localhost test.provisioning.local + +# Check CoreDNS status +provisioning server ssh dev-server-01 -- systemctl status coredns +``` + +### Network Connectivity + +```text +# Test server-to-server connectivity +provisioning server ssh dev-server-01 -- ping -c 3 dev-server-02 + +# Check firewall rules +provisioning server ssh dev-server-01 -- sudo iptables -L +``` + +### Storage and Resources + +```text +# Check disk usage +provisioning server ssh dev-server-01 -- df -h + +# Check memory usage +provisioning server ssh dev-server-01 -- free -h + +# Check CPU usage +provisioning server ssh dev-server-01 -- top -bn1 | head -20 +``` + +## Troubleshooting Failed Verifications + +### Configuration Validation Failed + +```text +# View detailed error +provisioning validate config --verbose + +# Check specific infrastructure +provisioning validate config --infra my-infra +``` + +### Server Unreachable + +```text +# Check server logs +provisioning server logs dev-server-01 + +# Try debug mode +provisioning --debug server ssh dev-server-01 +``` + +### Task Service Not Running + +```text +# Check service logs +provisioning taskserv logs kubernetes + +# Restart service +provisioning taskserv restart kubernetes --infra my-infra +``` + +### Platform Service Down + +```text +# Check service status +provisioning platform status orchestrator + +# View service logs +provisioning platform logs orchestrator --tail 100 + +# Restart service +provisioning platform restart orchestrator +``` + +## Performance Verification + +### Response Time Tests + +```text +# Measure server response time +time provisioning server info dev-server-01 + +# Measure task service response time +time provisioning taskserv list + +# Measure workflow submission time +time provisioning workflow submit test-workflow.ncl +``` + +### Resource Usage + +```text +# Check platform resource usage +docker stats # If using Docker + +# Check system resources +provisioning system resources +``` + +## Security Verification + +### Encryption + +```text +# Verify encryption keys +ls -la ~/.config/provisioning/age/ + +# Test encryption/decryption +echo "test" | provisioning kms encrypt | provisioning kms decrypt +``` + +### Authentication (If Enabled) + +```text +# Test login +provisioning login --username admin + +# Verify token +provisioning whoami + +# Test MFA (if enabled) +provisioning mfa verify +``` + +## Verification Checklist + +Use this checklist to ensure everything is working: + +- [ ] Configuration validation passes +- [ ] All servers are accessible via SSH +- [ ] All servers show "running" status +- [ ] All task services show "running" status +- [ ] Kubernetes nodes are "Ready" (if installed) +- [ ] Kubernetes pods are "Running" (if installed) +- [ ] Platform services respond to health checks +- [ ] Encryption/decryption works +- [ ] Workflows can be submitted and complete +- [ ] No errors in logs +- [ ] Resource usage is within expected limits + +## Next Steps + +Once verification is complete: + +- **[User Guide](../user/README.md)** - Learn advanced features +- **[Quick Reference](../guides/quickstart-cheatsheet.md)** - Command shortcuts +- **[Infrastructure Management](../user/infrastructure-management.md)** - Day-to-day operations +- **[Troubleshooting](../user/troubleshooting-guide.md)** - Common issues and solutions + +## Additional Resources + +- [Complete From-Scratch Guide](../guides/from-scratch.md) +- [Service Management Guide](../user/SERVICE_MANAGEMENT_GUIDE.md) +- [Test Environment Guide](../user/test-environment-guide.md) + +--- + +**Congratulations!** You've successfully deployed and verified your first Provisioning Platform infrastructure! \ No newline at end of file diff --git a/docs/src/getting-started/05-platform-configuration.md b/docs/src/getting-started/05-platform-configuration.md index 9d09044..e9f7035 100644 --- a/docs/src/getting-started/05-platform-configuration.md +++ b/docs/src/getting-started/05-platform-configuration.md @@ -1 +1,499 @@ -# Platform Service Configuration\n\nAfter verifying your installation, the next step is to configure the platform services. This guide walks you through setting up your provisioning\nplatform for deployment.\n\n## What You'll Learn\n\n- Understanding platform services and configuration modes\n- Setting up platform configurations with `setup-platform-config.sh`\n- Choosing the right deployment mode for your use case\n- Configuring services interactively or with quick mode\n- Running platform services with your configuration\n\n## Prerequisites\n\nBefore configuring platform services, ensure you have:\n\n- ✅ Completed [Installation Steps](02-installation.md)\n- ✅ Verified installation with [Verification](04-verification.md)\n- ✅ **Nickel** 0.10+ (for configuration language)\n- ✅ **Nushell** 0.109+ (for scripts)\n- ✅ **TypeDialog** (optional, for interactive configuration)\n\n## Platform Services Overview\n\nThe provisioning platform consists of 8 core services:\n\n| Service | Purpose | Default Mode |\n| --------- | --------- | -------------- |\n| **orchestrator** | Main orchestration engine | Required |\n| **control-center** | Web UI and management console | Required |\n| **mcp-server** | Model Context Protocol integration | Optional |\n| **vault-service** | Secrets management and encryption | Required |\n| **extension-registry** | Extension distribution system | Required |\n| **rag** | Retrieval-Augmented Generation | Optional |\n| **ai-service** | AI model integration | Optional |\n| **provisioning-daemon** | Background operations | Required |\n\n## Deployment Modes\n\nChoose a deployment mode based on your needs:\n\n| Mode | Resources | Use Case |\n| ------ | ----------- | ---------- |\n| **solo** | 2 CPU, 4 GB RAM | Development, testing, local machines |\n| **multiuser** | 4 CPU, 8 GB RAM | Team staging, team development |\n| **cicd** | 8 CPU, 16 GB RAM | CI/CD pipelines, automated testing |\n| **enterprise** | 16+ CPU, 32+ GB | Production, high-availability |\n\n## Step 1: Initialize Configuration Script\n\nThe configuration system is managed by a standalone script that doesn't require the main installer:\n\n```\n# Navigate to the provisioning directory\ncd /path/to/project-provisioning\n\n# Verify the setup script exists\nls -la provisioning/scripts/setup-platform-config.sh\n\n# Make script executable\nchmod +x provisioning/scripts/setup-platform-config.sh\n```\n\n## Step 2: Choose Configuration Method\n\n### Method A: Interactive TypeDialog Configuration (Recommended)\n\nTypeDialog provides an interactive form-based configuration interface available in multiple backends (web, TUI, CLI).\n\n#### Quick Interactive Setup (All Services at Once)\n\n```\n# Run interactive setup - prompts for choices\n./provisioning/scripts/setup-platform-config.sh\n\n# Follow the prompts to:\n# 1. Choose action (TypeDialog, Quick Mode, Clean, List)\n# 2. Select service (or all services)\n# 3. Choose deployment mode\n# 4. Select backend (web, tui, cli)\n```\n\n#### Configure Specific Service with TypeDialog\n\n```\n# Configure orchestrator in solo mode with web UI\n./provisioning/scripts/setup-platform-config.sh \\n --service orchestrator \\n --mode solo \\n --backend web\n\n# TypeDialog opens browser → User fills form → Config generated\n```\n\n**When to use TypeDialog:**\n- First-time setup with visual form guidance\n- Updating configuration with validation\n- Multiple services needing coordinated changes\n- Team environments where UI is preferred\n\n### Method B: Quick Mode Configuration (Fastest)\n\nQuick mode automatically creates all service configurations from defaults overlaid with mode-specific tuning.\n\n```\n# Quick setup for solo development mode\n./provisioning/scripts/setup-platform-config.sh --quick-mode --mode solo\n\n# Quick setup for enterprise production\n./provisioning/scripts/setup-platform-config.sh --quick-mode --mode enterprise\n\n# Result: All 8 services configured immediately with appropriate resource limits\n```\n\n**When to use Quick Mode:**\n- Initial setup with standard defaults\n- Switching deployment modes\n- CI/CD automated setup\n- Scripted/programmatic configuration\n\n### Method C: Manual Nickel Configuration\n\nFor advanced users who prefer editing configuration files directly:\n\n```\n# View schema definition\ncat provisioning/schemas/platform/schemas/orchestrator.ncl\n\n# View default values\ncat provisioning/schemas/platform/defaults/orchestrator-defaults.ncl\n\n# View mode overlay\ncat provisioning/schemas/platform/defaults/deployment/solo-defaults.ncl\n\n# Edit configuration directly\nvim provisioning/config/runtime/orchestrator.solo.ncl\n\n# Validate Nickel syntax\nnickel typecheck provisioning/config/runtime/orchestrator.solo.ncl\n\n# Regenerate TOML from edited config (CRITICAL STEP)\n./provisioning/scripts/setup-platform-config.sh --generate-toml\n```\n\n**When to use Manual Edit:**\n- Advanced customization beyond form options\n- Programmatic configuration generation\n- Integration with CI/CD systems\n- Custom workspace-specific overrides\n\n## Step 3: Understand Configuration Layers\n\nThe configuration system uses layered composition:\n\n```\n1. Schema (Type contract)\n ↓ Defines valid fields and constraints\n\n2. Service Defaults (Base values)\n ↓ Default configuration for each service\n\n3. Mode Overlay (Mode-specific tuning)\n ↓ solo, multiuser, cicd, or enterprise settings\n\n4. User Customization (Overrides)\n ↓ User-specific or workspace-specific changes\n\n5. Runtime Config (Final result)\n ↓ provisioning/config/runtime/orchestrator.solo.ncl\n\n6. TOML Export (Service consumption)\n ↓ provisioning/config/runtime/generated/orchestrator.solo.toml\n```\n\nAll layers are automatically composed and validated.\n\n## Step 4: Verify Generated Configuration\n\nAfter running the setup script, verify the configuration was created:\n\n```\n# List generated runtime configurations\nls -la provisioning/config/runtime/\n\n# Check generated TOML files\nls -la provisioning/config/runtime/generated/\n\n# Verify TOML is valid\ncat provisioning/config/runtime/generated/orchestrator.solo.toml | head -20\n```\n\nYou should see files for all 8 services in both the runtime directory (Nickel format) and the generated directory (TOML format).\n\n## Step 5: Run Platform Services\n\nAfter successful configuration, services can be started:\n\n### Running a Single Service\n\n```\n# Set deployment mode\nexport ORCHESTRATOR_MODE=solo\n\n# Run the orchestrator service\ncd provisioning/platform\ncargo run -p orchestrator\n```\n\n### Running Multiple Services\n\n```\n# Terminal 1: Vault Service (secrets management)\nexport VAULT_MODE=solo\ncargo run -p vault-service\n\n# Terminal 2: Orchestrator (main service)\nexport ORCHESTRATOR_MODE=solo\ncargo run -p orchestrator\n\n# Terminal 3: Control Center (web UI)\nexport CONTROL_CENTER_MODE=solo\ncargo run -p control-center\n\n# Access web UI at http://localhost:8080 (default)\n```\n\n### Docker-Based Deployment\n\n```\n# Start all services in Docker (requires docker-compose.yml)\ncd provisioning/platform/infrastructure/docker\ndocker-compose -f docker-compose.solo.yml up\n\n# Or for enterprise mode\ndocker-compose -f docker-compose.enterprise.yml up\n```\n\n## Step 6: Verify Services Are Running\n\n```\n# Check orchestrator status\ncurl http://localhost:9000/health\n\n# Check control center web UI\nopen http://localhost:8080\n\n# View service logs\nexport ORCHESTRATOR_MODE=solo\ncargo run -p orchestrator -- --log-level debug\n```\n\n## Customizing Configuration\n\n### Scenario: Change Deployment Mode\n\nIf you need to switch from solo to multiuser mode:\n\n```\n# Option 1: Re-run setup with new mode\n./provisioning/scripts/setup-platform-config.sh --quick-mode --mode multiuser\n\n# Option 2: Interactive update via TypeDialog\n./provisioning/scripts/setup-platform-config.sh --service orchestrator --mode multiuser --backend web\n\n# Result: All configurations updated for multiuser mode\n# Services read from provisioning/config/runtime/generated/orchestrator.multiuser.toml\n```\n\n### Scenario: Manual Configuration Edit\n\nIf you need fine-grained control:\n\n```\n# 1. Edit the Nickel configuration directly\nvim provisioning/config/runtime/orchestrator.solo.ncl\n\n# 2. Make your changes (for example, change port, add environment variables)\n\n# 3. Validate syntax\nnickel typecheck provisioning/config/runtime/orchestrator.solo.ncl\n\n# 4. CRITICAL: Regenerate TOML (services won't see changes without this)\n./provisioning/scripts/setup-platform-config.sh --generate-toml\n\n# 5. Verify TOML was updated\nstat provisioning/config/runtime/generated/orchestrator.solo.toml\n\n# 6. Restart service with new configuration\npkill orchestrator\nexport ORCHESTRATOR_MODE=solo\ncargo run -p orchestrator\n```\n\n### Scenario: Workspace-Specific Overrides\n\nFor workspace-specific customization:\n\n```\n# Create workspace override file\nmkdir -p workspace_myworkspace/config\ncat > workspace_myworkspace/config/platform-overrides.ncl <<'EOF'\n# Workspace-specific settings\n{\n orchestrator = {\n server.port = 9999, # Custom port\n workspace.name = "myworkspace"\n },\n\n control_center = {\n workspace.name = "myworkspace"\n }\n}\nEOF\n\n# Generate config with workspace overrides\n./provisioning/scripts/setup-platform-config.sh --workspace workspace_myworkspace\n\n# Configuration system merges: defaults + mode overlay + workspace overrides\n```\n\n## Available Configuration Commands\n\n```\n# List all available modes\n./provisioning/scripts/setup-platform-config.sh --list-modes\n# Output: solo, multiuser, cicd, enterprise\n\n# List all configurable services\n./provisioning/scripts/setup-platform-config.sh --list-services\n# Output: orchestrator, control-center, mcp-server, vault-service, extension-registry, rag, ai-service, provisioning-daemon\n\n# List current configurations\n./provisioning/scripts/setup-platform-config.sh --list-configs\n# Output: Shows current runtime configurations and their status\n\n# Clean all runtime configurations (use with caution)\n./provisioning/scripts/setup-platform-config.sh --clean\n# Removes: provisioning/config/runtime/*.ncl\n# provisioning/config/runtime/generated/*.toml\n```\n\n## Configuration File Locations\n\n### Public Definitions (Part of repository)\n\n```\nprovisioning/schemas/platform/\n├── schemas/ # Type contracts (Nickel)\n├── defaults/ # Base configuration values\n│ └── deployment/ # Mode-specific: solo, multiuser, cicd, enterprise\n├── validators/ # Business logic validation\n├── templates/ # Configuration generation templates\n└── constraints/ # Validation limits\n```\n\n### Private Runtime Configs (Gitignored)\n\n```\nprovisioning/config/runtime/ # User-specific deployments\n├── orchestrator.solo.ncl # Editable config\n├── orchestrator.multiuser.ncl\n└── generated/ # Auto-generated, don't edit\n ├── orchestrator.solo.toml # For Rust services\n └── orchestrator.multiuser.toml\n```\n\n### Examples (Reference)\n\n```\nprovisioning/config/examples/\n├── orchestrator.solo.example.ncl # Solo mode reference\n└── orchestrator.enterprise.example.ncl # Enterprise mode reference\n```\n\n## Troubleshooting Configuration\n\n### Issue: Script Fails with "Nickel not found"\n\n```\n# Install Nickel\n# macOS\nbrew install nickel\n\n# Linux\ncargo install nickel --version 0.10\n\n# Verify installation\nnickel --version\n# Expected: 0.10.0 or higher\n```\n\n### Issue: Configuration Won't Generate TOML\n\n```\n# Check Nickel syntax\nnickel typecheck provisioning/config/runtime/orchestrator.solo.ncl\n\n# If errors found, view detailed message\nnickel typecheck -i provisioning/config/runtime/orchestrator.solo.ncl\n\n# Try manual export\nnickel export --format toml provisioning/config/runtime/orchestrator.solo.ncl\n```\n\n### Issue: Service Can't Read Configuration\n\n```\n# Verify TOML file exists\nls -la provisioning/config/runtime/generated/orchestrator.solo.toml\n\n# Verify file is valid TOML\nhead -20 provisioning/config/runtime/generated/orchestrator.solo.toml\n\n# Check service is looking in right location\necho $ORCHESTRATOR_MODE # Should be set to 'solo', 'multiuser', etc.\n\n# Verify environment variable is correct\nexport ORCHESTRATOR_MODE=solo\ncargo run -p orchestrator --verbose\n```\n\n### Issue: Services Won't Start After Config Change\n\n```\n# If you edited .ncl file manually, TOML must be regenerated\n./provisioning/scripts/setup-platform-config.sh --generate-toml\n\n# Verify new TOML was created\nstat provisioning/config/runtime/generated/orchestrator.solo.toml\n\n# Check modification time (should be recent)\nls -lah provisioning/config/runtime/generated/orchestrator.solo.toml\n```\n\n## Important Notes\n\n### 🔒 Runtime Configurations Are Private\n\nFiles in `provisioning/config/runtime/` are **gitignored** because:\n- May contain encrypted secrets or credentials\n- Deployment-specific (different per environment)\n- User-customized (each developer/machine has different needs)\n\n### 📘 Schemas Are Public\n\nFiles in `provisioning/schemas/platform/` are **version-controlled** because:\n- Define product structure and constraints\n- Part of official releases\n- Source of truth for configuration format\n- Shared across the team\n\n### 🔄 Configuration Is Idempotent\n\nThe setup script is safe to run multiple times:\n\n```\n# Safe: Updates only what's needed\n./provisioning/scripts/setup-platform-config.sh --quick-mode --mode enterprise\n\n# Safe: Doesn't overwrite without --clean\n./provisioning/scripts/setup-platform-config.sh --generate-toml\n\n# Only deletes on explicit request\n./provisioning/scripts/setup-platform-config.sh --clean\n```\n\n### ⚠️ Installer Status\n\nThe full provisioning installer (`provisioning/scripts/install.sh`) is **not yet implemented**. Currently:\n\n- ✅ Configuration setup script is standalone and ready to use\n- ⏳ Full installer integration is planned for future release\n- ✅ Manual workflow works perfectly without installer\n- ✅ CI/CD integration available now\n\n## Next Steps\n\nAfter completing platform configuration:\n\n1. **Run Services**: Start your platform services with configured settings\n2. **Access Web UI**: Open Control Center at [http://localhost:8080](http://localhost:8080) (default)\n3. **Create First Infrastructure**: Deploy your first servers and clusters\n4. **Set Up Extensions**: Configure providers and task services for your needs\n5. **Backup Configuration**: Back up runtime configs to private repository\n\n## Additional Resources\n\n- [Setup Status & Current System Status](../../provisioning/config/SETUP_STATUS.md) - Quick reference for system readiness\n- [Configuration README](../../provisioning/config/README.md) - Detailed configuration management guide\n- [Setup Script Documentation](../../provisioning/scripts/setup-platform-config.sh.md) - Complete script reference\n- [TypeDialog Platform Config Guide](../development/typedialog-platform-config-guide.md) - Advanced configuration topics\n- [Deployment Guide](../operations/deployment-guide.md) - Production deployment procedures\n\n---\n\n**Version**: 1.0.0\n**Last Updated**: 2026-01-05\n**Difficulty**: Beginner to Intermediate +# Platform Service Configuration + +After verifying your installation, the next step is to configure the platform services. This guide walks you through setting up your provisioning +platform for deployment. + +## What You'll Learn + +- Understanding platform services and configuration modes +- Setting up platform configurations with `setup-platform-config.sh` +- Choosing the right deployment mode for your use case +- Configuring services interactively or with quick mode +- Running platform services with your configuration + +## Prerequisites + +Before configuring platform services, ensure you have: + +- ✅ Completed [Installation Steps](02-installation.md) +- ✅ Verified installation with [Verification](04-verification.md) +- ✅ **Nickel** 0.10+ (for configuration language) +- ✅ **Nushell** 0.109+ (for scripts) +- ✅ **TypeDialog** (optional, for interactive configuration) + +## Platform Services Overview + +The provisioning platform consists of 8 core services: + +| Service | Purpose | Default Mode | +| --------- | --------- | -------------- | +| **orchestrator** | Main orchestration engine | Required | +| **control-center** | Web UI and management console | Required | +| **mcp-server** | Model Context Protocol integration | Optional | +| **vault-service** | Secrets management and encryption | Required | +| **extension-registry** | Extension distribution system | Required | +| **rag** | Retrieval-Augmented Generation | Optional | +| **ai-service** | AI model integration | Optional | +| **provisioning-daemon** | Background operations | Required | + +## Deployment Modes + +Choose a deployment mode based on your needs: + +| Mode | Resources | Use Case | +| ------ | ----------- | ---------- | +| **solo** | 2 CPU, 4 GB RAM | Development, testing, local machines | +| **multiuser** | 4 CPU, 8 GB RAM | Team staging, team development | +| **cicd** | 8 CPU, 16 GB RAM | CI/CD pipelines, automated testing | +| **enterprise** | 16+ CPU, 32+ GB | Production, high-availability | + +## Step 1: Initialize Configuration Script + +The configuration system is managed by a standalone script that doesn't require the main installer: + +```text +# Navigate to the provisioning directory +cd /path/to/project-provisioning + +# Verify the setup script exists +ls -la provisioning/scripts/setup-platform-config.sh + +# Make script executable +chmod +x provisioning/scripts/setup-platform-config.sh +``` + +## Step 2: Choose Configuration Method + +### Method A: Interactive TypeDialog Configuration (Recommended) + +TypeDialog provides an interactive form-based configuration interface available in multiple backends (web, TUI, CLI). + +#### Quick Interactive Setup (All Services at Once) + +```text +# Run interactive setup - prompts for choices +./provisioning/scripts/setup-platform-config.sh + +# Follow the prompts to: +# 1. Choose action (TypeDialog, Quick Mode, Clean, List) +# 2. Select service (or all services) +# 3. Choose deployment mode +# 4. Select backend (web, tui, cli) +``` + +#### Configure Specific Service with TypeDialog + +```text +# Configure orchestrator in solo mode with web UI +./provisioning/scripts/setup-platform-config.sh + --service orchestrator + --mode solo + --backend web + +# TypeDialog opens browser → User fills form → Config generated +``` + +**When to use TypeDialog:** +- First-time setup with visual form guidance +- Updating configuration with validation +- Multiple services needing coordinated changes +- Team environments where UI is preferred + +### Method B: Quick Mode Configuration (Fastest) + +Quick mode automatically creates all service configurations from defaults overlaid with mode-specific tuning. + +```text +# Quick setup for solo development mode +./provisioning/scripts/setup-platform-config.sh --quick-mode --mode solo + +# Quick setup for enterprise production +./provisioning/scripts/setup-platform-config.sh --quick-mode --mode enterprise + +# Result: All 8 services configured immediately with appropriate resource limits +``` + +**When to use Quick Mode:** +- Initial setup with standard defaults +- Switching deployment modes +- CI/CD automated setup +- Scripted/programmatic configuration + +### Method C: Manual Nickel Configuration + +For advanced users who prefer editing configuration files directly: + +```text +# View schema definition +cat provisioning/schemas/platform/schemas/orchestrator.ncl + +# View default values +cat provisioning/schemas/platform/defaults/orchestrator-defaults.ncl + +# View mode overlay +cat provisioning/schemas/platform/defaults/deployment/solo-defaults.ncl + +# Edit configuration directly +vim provisioning/config/runtime/orchestrator.solo.ncl + +# Validate Nickel syntax +nickel typecheck provisioning/config/runtime/orchestrator.solo.ncl + +# Regenerate TOML from edited config (CRITICAL STEP) +./provisioning/scripts/setup-platform-config.sh --generate-toml +``` + +**When to use Manual Edit:** +- Advanced customization beyond form options +- Programmatic configuration generation +- Integration with CI/CD systems +- Custom workspace-specific overrides + +## Step 3: Understand Configuration Layers + +The configuration system uses layered composition: + +```text +1. Schema (Type contract) + ↓ Defines valid fields and constraints + +2. Service Defaults (Base values) + ↓ Default configuration for each service + +3. Mode Overlay (Mode-specific tuning) + ↓ solo, multiuser, cicd, or enterprise settings + +4. User Customization (Overrides) + ↓ User-specific or workspace-specific changes + +5. Runtime Config (Final result) + ↓ provisioning/config/runtime/orchestrator.solo.ncl + +6. TOML Export (Service consumption) + ↓ provisioning/config/runtime/generated/orchestrator.solo.toml +``` + +All layers are automatically composed and validated. + +## Step 4: Verify Generated Configuration + +After running the setup script, verify the configuration was created: + +```text +# List generated runtime configurations +ls -la provisioning/config/runtime/ + +# Check generated TOML files +ls -la provisioning/config/runtime/generated/ + +# Verify TOML is valid +cat provisioning/config/runtime/generated/orchestrator.solo.toml | head -20 +``` + +You should see files for all 8 services in both the runtime directory (Nickel format) and the generated directory (TOML format). + +## Step 5: Run Platform Services + +After successful configuration, services can be started: + +### Running a Single Service + +```text +# Set deployment mode +export ORCHESTRATOR_MODE=solo + +# Run the orchestrator service +cd provisioning/platform +cargo run -p orchestrator +``` + +### Running Multiple Services + +```text +# Terminal 1: Vault Service (secrets management) +export VAULT_MODE=solo +cargo run -p vault-service + +# Terminal 2: Orchestrator (main service) +export ORCHESTRATOR_MODE=solo +cargo run -p orchestrator + +# Terminal 3: Control Center (web UI) +export CONTROL_CENTER_MODE=solo +cargo run -p control-center + +# Access web UI at http://localhost:8080 (default) +``` + +### Docker-Based Deployment + +```text +# Start all services in Docker (requires docker-compose.yml) +cd provisioning/platform/infrastructure/docker +docker-compose -f docker-compose.solo.yml up + +# Or for enterprise mode +docker-compose -f docker-compose.enterprise.yml up +``` + +## Step 6: Verify Services Are Running + +```text +# Check orchestrator status +curl http://localhost:9000/health + +# Check control center web UI +open http://localhost:8080 + +# View service logs +export ORCHESTRATOR_MODE=solo +cargo run -p orchestrator -- --log-level debug +``` + +## Customizing Configuration + +### Scenario: Change Deployment Mode + +If you need to switch from solo to multiuser mode: + +```text +# Option 1: Re-run setup with new mode +./provisioning/scripts/setup-platform-config.sh --quick-mode --mode multiuser + +# Option 2: Interactive update via TypeDialog +./provisioning/scripts/setup-platform-config.sh --service orchestrator --mode multiuser --backend web + +# Result: All configurations updated for multiuser mode +# Services read from provisioning/config/runtime/generated/orchestrator.multiuser.toml +``` + +### Scenario: Manual Configuration Edit + +If you need fine-grained control: + +```text +# 1. Edit the Nickel configuration directly +vim provisioning/config/runtime/orchestrator.solo.ncl + +# 2. Make your changes (for example, change port, add environment variables) + +# 3. Validate syntax +nickel typecheck provisioning/config/runtime/orchestrator.solo.ncl + +# 4. CRITICAL: Regenerate TOML (services won't see changes without this) +./provisioning/scripts/setup-platform-config.sh --generate-toml + +# 5. Verify TOML was updated +stat provisioning/config/runtime/generated/orchestrator.solo.toml + +# 6. Restart service with new configuration +pkill orchestrator +export ORCHESTRATOR_MODE=solo +cargo run -p orchestrator +``` + +### Scenario: Workspace-Specific Overrides + +For workspace-specific customization: + +```text +# Create workspace override file +mkdir -p workspace_myworkspace/config +cat > workspace_myworkspace/config/platform-overrides.ncl <<'EOF' +# Workspace-specific settings +{ + orchestrator = { + server.port = 9999, # Custom port + workspace.name = "myworkspace" + }, + + control_center = { + workspace.name = "myworkspace" + } +} +EOF + +# Generate config with workspace overrides +./provisioning/scripts/setup-platform-config.sh --workspace workspace_myworkspace + +# Configuration system merges: defaults + mode overlay + workspace overrides +``` + +## Available Configuration Commands + +```text +# List all available modes +./provisioning/scripts/setup-platform-config.sh --list-modes +# Output: solo, multiuser, cicd, enterprise + +# List all configurable services +./provisioning/scripts/setup-platform-config.sh --list-services +# Output: orchestrator, control-center, mcp-server, vault-service, extension-registry, rag, ai-service, provisioning-daemon + +# List current configurations +./provisioning/scripts/setup-platform-config.sh --list-configs +# Output: Shows current runtime configurations and their status + +# Clean all runtime configurations (use with caution) +./provisioning/scripts/setup-platform-config.sh --clean +# Removes: provisioning/config/runtime/*.ncl +# provisioning/config/runtime/generated/*.toml +``` + +## Configuration File Locations + +### Public Definitions (Part of repository) + +```text +provisioning/schemas/platform/ +├── schemas/ # Type contracts (Nickel) +├── defaults/ # Base configuration values +│ └── deployment/ # Mode-specific: solo, multiuser, cicd, enterprise +├── validators/ # Business logic validation +├── templates/ # Configuration generation templates +└── constraints/ # Validation limits +``` + +### Private Runtime Configs (Gitignored) + +```text +provisioning/config/runtime/ # User-specific deployments +├── orchestrator.solo.ncl # Editable config +├── orchestrator.multiuser.ncl +└── generated/ # Auto-generated, don't edit + ├── orchestrator.solo.toml # For Rust services + └── orchestrator.multiuser.toml +``` + +### Examples (Reference) + +```text +provisioning/config/examples/ +├── orchestrator.solo.example.ncl # Solo mode reference +└── orchestrator.enterprise.example.ncl # Enterprise mode reference +``` + +## Troubleshooting Configuration + +### Issue: Script Fails with "Nickel not found" + +```text +# Install Nickel +# macOS +brew install nickel + +# Linux +cargo install nickel --version 0.10 + +# Verify installation +nickel --version +# Expected: 0.10.0 or higher +``` + +### Issue: Configuration Won't Generate TOML + +```text +# Check Nickel syntax +nickel typecheck provisioning/config/runtime/orchestrator.solo.ncl + +# If errors found, view detailed message +nickel typecheck -i provisioning/config/runtime/orchestrator.solo.ncl + +# Try manual export +nickel export --format toml provisioning/config/runtime/orchestrator.solo.ncl +``` + +### Issue: Service Can't Read Configuration + +```text +# Verify TOML file exists +ls -la provisioning/config/runtime/generated/orchestrator.solo.toml + +# Verify file is valid TOML +head -20 provisioning/config/runtime/generated/orchestrator.solo.toml + +# Check service is looking in right location +echo $ORCHESTRATOR_MODE # Should be set to 'solo', 'multiuser', etc. + +# Verify environment variable is correct +export ORCHESTRATOR_MODE=solo +cargo run -p orchestrator --verbose +``` + +### Issue: Services Won't Start After Config Change + +```text +# If you edited .ncl file manually, TOML must be regenerated +./provisioning/scripts/setup-platform-config.sh --generate-toml + +# Verify new TOML was created +stat provisioning/config/runtime/generated/orchestrator.solo.toml + +# Check modification time (should be recent) +ls -lah provisioning/config/runtime/generated/orchestrator.solo.toml +``` + +## Important Notes + +### 🔒 Runtime Configurations Are Private + +Files in `provisioning/config/runtime/` are **gitignored** because: +- May contain encrypted secrets or credentials +- Deployment-specific (different per environment) +- User-customized (each developer/machine has different needs) + +### 📘 Schemas Are Public + +Files in `provisioning/schemas/platform/` are **version-controlled** because: +- Define product structure and constraints +- Part of official releases +- Source of truth for configuration format +- Shared across the team + +### 🔄 Configuration Is Idempotent + +The setup script is safe to run multiple times: + +```text +# Safe: Updates only what's needed +./provisioning/scripts/setup-platform-config.sh --quick-mode --mode enterprise + +# Safe: Doesn't overwrite without --clean +./provisioning/scripts/setup-platform-config.sh --generate-toml + +# Only deletes on explicit request +./provisioning/scripts/setup-platform-config.sh --clean +``` + +### ⚠️ Installer Status + +The full provisioning installer (`provisioning/scripts/install.sh`) is **not yet implemented**. Currently: + +- ✅ Configuration setup script is standalone and ready to use +- ⏳ Full installer integration is planned for future release +- ✅ Manual workflow works perfectly without installer +- ✅ CI/CD integration available now + +## Next Steps + +After completing platform configuration: + +1. **Run Services**: Start your platform services with configured settings +2. **Access Web UI**: Open Control Center at [http://localhost:8080](http://localhost:8080) (default) +3. **Create First Infrastructure**: Deploy your first servers and clusters +4. **Set Up Extensions**: Configure providers and task services for your needs +5. **Backup Configuration**: Back up runtime configs to private repository + +## Additional Resources + +- [Setup Status & Current System Status](../../provisioning/config/SETUP_STATUS.md) - Quick reference for system readiness +- [Configuration README](../../provisioning/config/README.md) - Detailed configuration management guide +- [Setup Script Documentation](../../provisioning/scripts/setup-platform-config.sh.md) - Complete script reference +- [TypeDialog Platform Config Guide](../development/typedialog-platform-config-guide.md) - Advanced configuration topics +- [Deployment Guide](../operations/deployment-guide.md) - Production deployment procedures + +--- + +**Version**: 1.0.0 +**Last Updated**: 2026-01-05 +**Difficulty**: Beginner to Intermediate \ No newline at end of file diff --git a/docs/src/getting-started/getting-started.md b/docs/src/getting-started/getting-started.md index 792be82..ceb3779 100644 --- a/docs/src/getting-started/getting-started.md +++ b/docs/src/getting-started/getting-started.md @@ -1 +1,551 @@ -# Getting Started Guide\n\nWelcome to Infrastructure Automation. This guide will walk you through your first steps with infrastructure automation, from basic setup to deploying\nyour first infrastructure.\n\n## What You'll Learn\n\n- Essential concepts and terminology\n- How to configure your first environment\n- Creating and managing infrastructure\n- Basic server and service management\n- Common workflows and best practices\n\n## Prerequisites\n\nBefore starting this guide, ensure you have:\n\n- ✅ Completed the [Installation Guide](installation-guide.md)\n- ✅ Verified your installation with `provisioning --version`\n- ✅ Basic familiarity with command-line interfaces\n\n## Essential Concepts\n\n### Infrastructure as Code (IaC)\n\nProvisioning uses **declarative configuration** to manage infrastructure. Instead of manually creating resources, you define what you want in\nconfiguration files, and the system makes it happen.\n\n```\nYou describe → System creates → Infrastructure exists\n```\n\n### Key Components\n\n| Component | Purpose | Example |\n| ----------- | --------- | --------- |\n| **Providers** | Cloud platforms | AWS, UpCloud, Local |\n| **Servers** | Virtual machines | Web servers, databases |\n| **Task Services** | Infrastructure software | Kubernetes, Docker, databases |\n| **Clusters** | Grouped services | Web cluster, database cluster |\n\n### Configuration Languages\n\n- **Nickel**: Primary configuration language for infrastructure definitions (type-safe, validated)\n- **TOML**: User preferences and system settings\n- **YAML**: Kubernetes manifests and service definitions\n\n## First-Time Setup\n\n### Step 1: Initialize Your Configuration\n\nCreate your personal configuration:\n\n```\n# Initialize user configuration\nprovisioning init config\n\n# This creates ~/.provisioning/config.user.toml\n```\n\n### Step 2: Verify Your Environment\n\n```\n# Check your environment setup\nprovisioning env\n\n# View comprehensive configuration\nprovisioning allenv\n```\n\nYou should see output like:\n\n```\n✅ Configuration loaded successfully\n✅ All required tools available\n📁 Base path: /usr/local/provisioning\n🏠 User config: ~/.provisioning/config.user.toml\n```\n\n### Step 3: Explore Available Resources\n\n```\n# List available providers\nprovisioning list providers\n\n# List available task services\nprovisioning list taskservs\n\n# List available clusters\nprovisioning list clusters\n```\n\n## Your First Infrastructure\n\nLet's create a simple local infrastructure to learn the basics.\n\n### Step 1: Create a Workspace\n\n```\n# Create a new workspace directory\nmkdir ~/my-first-infrastructure\ncd ~/my-first-infrastructure\n\n# Initialize workspace\nprovisioning generate infra --new local-demo\n```\n\nThis creates:\n\n```\nlocal-demo/\n├── config/\n│ └── config.ncl # Master Nickel configuration\n├── infra/\n│ └── default/\n│ ├── main.ncl # Infrastructure definition\n│ └── servers.ncl # Server configurations\n└── docs/ # Auto-generated guides\n```\n\n### Step 2: Examine the Configuration\n\n```\n# View the generated configuration\nprovisioning show settings --infra local-demo\n```\n\n### Step 3: Validate the Configuration\n\n```\n# Validate syntax and structure\nprovisioning validate config --infra local-demo\n\n# Should show: ✅ Configuration validation passed!\n```\n\n### Step 4: Deploy Infrastructure (Check Mode)\n\n```\n# Dry run - see what would be created\nprovisioning server create --infra local-demo --check\n\n# This shows planned changes without making them\n```\n\n### Step 5: Create Your Infrastructure\n\n```\n# Create the actual infrastructure\nprovisioning server create --infra local-demo\n\n# Wait for completion\nprovisioning server list --infra local-demo\n```\n\n## Working with Services\n\n### Installing Your First Service\n\nLet's install a containerized service:\n\n```\n# Install Docker/containerd\nprovisioning taskserv create containerd --infra local-demo\n\n# Verify installation\nprovisioning taskserv list --infra local-demo\n```\n\n### Installing Kubernetes\n\nFor container orchestration:\n\n```\n# Install Kubernetes\nprovisioning taskserv create kubernetes --infra local-demo\n\n# This may take several minutes...\n```\n\n### Checking Service Status\n\n```\n# Show all services on your infrastructure\nprovisioning show servers --infra local-demo\n\n# Show specific service details\nprovisioning show servers web-01 taskserv kubernetes --infra local-demo\n```\n\n## Understanding Commands\n\n### Command Structure\n\nAll commands follow this pattern:\n\n```\nprovisioning [global-options] [command-options] [arguments]\n```\n\n### Global Options\n\n| Option | Short | Description |\n| -------- | ------- | ------------- |\n| `--infra` | `-i` | Specify infrastructure |\n| `--check` | `-c` | Dry run mode |\n| `--debug` | `-x` | Enable debug output |\n| `--yes` | `-y` | Auto-confirm actions |\n\n### Essential Commands\n\n| Command | Purpose | Example |\n| --------- | --------- | --------- |\n| `help` | Show help | `provisioning help` |\n| `env` | Show environment | `provisioning env` |\n| `list` | List resources | `provisioning list servers` |\n| `show` | Show details | `provisioning show settings` |\n| `validate` | Validate config | `provisioning validate config` |\n\n## Working with Multiple Environments\n\n### Environment Concepts\n\nThe system supports multiple environments:\n\n- **dev** - Development and testing\n- **test** - Integration testing\n- **prod** - Production deployment\n\n### Switching Environments\n\n```\n# Set environment for this session\nexport PROVISIONING_ENV=dev\nprovisioning env\n\n# Or specify per command\nprovisioning --environment dev server create\n```\n\n### Environment-Specific Configuration\n\nCreate environment configs:\n\n```\n# Development environment\nprovisioning init config dev\n\n# Production environment\nprovisioning init config prod\n```\n\n## Common Workflows\n\n### Workflow 1: Development Environment\n\n```\n# 1. Create development workspace\nmkdir ~/dev-environment\ncd ~/dev-environment\n\n# 2. Generate infrastructure\nprovisioning generate infra --new dev-setup\n\n# 3. Customize for development\n# Edit settings.ncl to add development tools\n\n# 4. Deploy\nprovisioning server create --infra dev-setup --check\nprovisioning server create --infra dev-setup\n\n# 5. Install development services\nprovisioning taskserv create kubernetes --infra dev-setup\nprovisioning taskserv create containerd --infra dev-setup\n```\n\n### Workflow 2: Service Updates\n\n```\n# Check for service updates\nprovisioning taskserv check-updates\n\n# Update specific service\nprovisioning taskserv update kubernetes --infra dev-setup\n\n# Verify update\nprovisioning taskserv versions kubernetes\n```\n\n### Workflow 3: Infrastructure Scaling\n\n```\n# Add servers to existing infrastructure\n# Edit settings.ncl to add more servers\n\n# Apply changes\nprovisioning server create --infra dev-setup\n\n# Install services on new servers\nprovisioning taskserv create containerd --infra dev-setup\n```\n\n## Interactive Mode\n\n### Starting Interactive Shell\n\n```\n# Start Nushell with provisioning loaded\nprovisioning nu\n```\n\nIn the interactive shell, you have access to all provisioning functions:\n\n```\n# Inside Nushell session\nuse lib_provisioning *\n\n# Check environment\nshow_env\n\n# List available functions\nhelp commands | where name =~ "provision"\n```\n\n### Useful Interactive Commands\n\n```\n# Show detailed server information\nfind_servers "web-*" | table\n\n# Get cost estimates\nservers_walk_by_costs $settings "" false false "stdout"\n\n# Check task service status\ntaskservs_list | where status == "running"\n```\n\n## Configuration Management\n\n### Understanding Configuration Files\n\n1. **System Defaults**: `config.defaults.toml` - System-wide defaults\n2. **User Config**: `~/.provisioning/config.user.toml` - Your preferences\n3. **Environment Config**: `config.{env}.toml` - Environment-specific settings\n4. **Infrastructure Config**: `settings.ncl` - Infrastructure definitions\n\n### Configuration Hierarchy\n\n```\nInfrastructure settings.ncl\n ↓ (overrides)\nEnvironment config.{env}.toml\n ↓ (overrides)\nUser config.user.toml\n ↓ (overrides)\nSystem config.defaults.toml\n```\n\n### Customizing Your Configuration\n\n```\n# Edit user configuration\nprovisioning sops ~/.provisioning/config.user.toml\n\n# Or using your preferred editor\nnano ~/.provisioning/config.user.toml\n```\n\nExample customizations:\n\n```\n[debug]\nenabled = true # Enable debug mode by default\nlog_level = "debug" # Verbose logging\n\n[providers]\ndefault = "aws" # Use AWS as default provider\n\n[output]\nformat = "json" # Prefer JSON output\n```\n\n## Monitoring and Observability\n\n### Checking System Status\n\n```\n# Overall system health\nprovisioning env\n\n# Infrastructure status\nprovisioning show servers --infra dev-setup\n\n# Service status\nprovisioning taskserv list --infra dev-setup\n```\n\n### Logging and Debugging\n\n```\n# Enable debug mode for troubleshooting\nprovisioning --debug server create --infra dev-setup --check\n\n# View logs for specific operations\nprovisioning show logs --infra dev-setup\n```\n\n### Cost Monitoring\n\n```\n# Show cost estimates\nprovisioning show cost --infra dev-setup\n\n# Detailed cost breakdown\nprovisioning server price --infra dev-setup\n```\n\n## Best Practices\n\n### 1. Configuration Management\n\n- ✅ Use version control for infrastructure definitions\n- ✅ Test changes in development before production\n- ✅ Use `--check` mode to preview changes\n- ✅ Keep user configuration separate from infrastructure\n\n### 2. Security\n\n- ✅ Use SOPS for encrypting sensitive data\n- ✅ Regular key rotation for cloud providers\n- ✅ Principle of least privilege for access\n- ✅ Audit infrastructure changes\n\n### 3. Operational Excellence\n\n- ✅ Monitor infrastructure costs regularly\n- ✅ Keep services updated\n- ✅ Document custom configurations\n- ✅ Plan for disaster recovery\n\n### 4. Development Workflow\n\n```\n# 1. Always validate before applying\nprovisioning validate config --infra my-infra\n\n# 2. Use check mode first\nprovisioning server create --infra my-infra --check\n\n# 3. Apply changes incrementally\nprovisioning server create --infra my-infra\n\n# 4. Verify results\nprovisioning show servers --infra my-infra\n```\n\n## Getting Help\n\n### Built-in Help System\n\n```\n# General help\nprovisioning help\n\n# Command-specific help\nprovisioning server help\nprovisioning taskserv help\nprovisioning cluster help\n\n# Show available options\nprovisioning generate help\n```\n\n### Command Reference\n\nFor complete command documentation, see: [CLI Reference](cli-reference.md)\n\n### Troubleshooting\n\nIf you encounter issues, see: [Troubleshooting Guide](troubleshooting-guide.md)\n\n## Real-World Example\n\nLet's walk through a complete example of setting up a web application infrastructure:\n\n### Step 1: Plan Your Infrastructure\n\n```\n# Create project workspace\nmkdir ~/webapp-infrastructure\ncd ~/webapp-infrastructure\n\n# Generate base infrastructure\nprovisioning generate infra --new webapp\n```\n\n### Step 2: Customize Configuration\n\nEdit `webapp/settings.ncl` to define:\n\n- 2 web servers for load balancing\n- 1 database server\n- Load balancer configuration\n\n### Step 3: Deploy Base Infrastructure\n\n```\n# Validate configuration\nprovisioning validate config --infra webapp\n\n# Preview deployment\nprovisioning server create --infra webapp --check\n\n# Deploy servers\nprovisioning server create --infra webapp\n```\n\n### Step 4: Install Services\n\n```\n# Install container runtime on all servers\nprovisioning taskserv create containerd --infra webapp\n\n# Install load balancer on web servers\nprovisioning taskserv create haproxy --infra webapp\n\n# Install database on database server\nprovisioning taskserv create postgresql --infra webapp\n```\n\n### Step 5: Deploy Application\n\n```\n# Create application cluster\nprovisioning cluster create webapp --infra webapp\n\n# Verify deployment\nprovisioning show servers --infra webapp\nprovisioning cluster list --infra webapp\n```\n\n## Next Steps\n\nNow that you understand the basics:\n\n1. **Set up your workspace**: [Workspace Setup Guide](workspace-setup.md)\n2. **Learn about infrastructure management**: [Infrastructure Management Guide](infrastructure-management.md)\n3. **Understand configuration**: [Configuration Guide](configuration.md)\n4. **Explore examples**: [Examples and Tutorials](examples/)\n\nYou're ready to start building and managing cloud infrastructure with confidence! +# Getting Started Guide + +Welcome to Infrastructure Automation. This guide will walk you through your first steps with infrastructure automation, from basic setup to deploying +your first infrastructure. + +## What You'll Learn + +- Essential concepts and terminology +- How to configure your first environment +- Creating and managing infrastructure +- Basic server and service management +- Common workflows and best practices + +## Prerequisites + +Before starting this guide, ensure you have: + +- ✅ Completed the [Installation Guide](installation-guide.md) +- ✅ Verified your installation with `provisioning --version` +- ✅ Basic familiarity with command-line interfaces + +## Essential Concepts + +### Infrastructure as Code (IaC) + +Provisioning uses **declarative configuration** to manage infrastructure. Instead of manually creating resources, you define what you want in +configuration files, and the system makes it happen. + +```text +You describe → System creates → Infrastructure exists +``` + +### Key Components + +| Component | Purpose | Example | +| ----------- | --------- | --------- | +| **Providers** | Cloud platforms | AWS, UpCloud, Local | +| **Servers** | Virtual machines | Web servers, databases | +| **Task Services** | Infrastructure software | Kubernetes, Docker, databases | +| **Clusters** | Grouped services | Web cluster, database cluster | + +### Configuration Languages + +- **Nickel**: Primary configuration language for infrastructure definitions (type-safe, validated) +- **TOML**: User preferences and system settings +- **YAML**: Kubernetes manifests and service definitions + +## First-Time Setup + +### Step 1: Initialize Your Configuration + +Create your personal configuration: + +```text +# Initialize user configuration +provisioning init config + +# This creates ~/.provisioning/config.user.toml +``` + +### Step 2: Verify Your Environment + +```text +# Check your environment setup +provisioning env + +# View comprehensive configuration +provisioning allenv +``` + +You should see output like: + +```text +✅ Configuration loaded successfully +✅ All required tools available +📁 Base path: /usr/local/provisioning +🏠 User config: ~/.provisioning/config.user.toml +``` + +### Step 3: Explore Available Resources + +```text +# List available providers +provisioning list providers + +# List available task services +provisioning list taskservs + +# List available clusters +provisioning list clusters +``` + +## Your First Infrastructure + +Let's create a simple local infrastructure to learn the basics. + +### Step 1: Create a Workspace + +```text +# Create a new workspace directory +mkdir ~/my-first-infrastructure +cd ~/my-first-infrastructure + +# Initialize workspace +provisioning generate infra --new local-demo +``` + +This creates: + +```text +local-demo/ +├── config/ +│ └── config.ncl # Master Nickel configuration +├── infra/ +│ └── default/ +│ ├── main.ncl # Infrastructure definition +│ └── servers.ncl # Server configurations +└── docs/ # Auto-generated guides +``` + +### Step 2: Examine the Configuration + +```text +# View the generated configuration +provisioning show settings --infra local-demo +``` + +### Step 3: Validate the Configuration + +```text +# Validate syntax and structure +provisioning validate config --infra local-demo + +# Should show: ✅ Configuration validation passed! +``` + +### Step 4: Deploy Infrastructure (Check Mode) + +```text +# Dry run - see what would be created +provisioning server create --infra local-demo --check + +# This shows planned changes without making them +``` + +### Step 5: Create Your Infrastructure + +```text +# Create the actual infrastructure +provisioning server create --infra local-demo + +# Wait for completion +provisioning server list --infra local-demo +``` + +## Working with Services + +### Installing Your First Service + +Let's install a containerized service: + +```text +# Install Docker/containerd +provisioning taskserv create containerd --infra local-demo + +# Verify installation +provisioning taskserv list --infra local-demo +``` + +### Installing Kubernetes + +For container orchestration: + +```text +# Install Kubernetes +provisioning taskserv create kubernetes --infra local-demo + +# This may take several minutes... +``` + +### Checking Service Status + +```text +# Show all services on your infrastructure +provisioning show servers --infra local-demo + +# Show specific service details +provisioning show servers web-01 taskserv kubernetes --infra local-demo +``` + +## Understanding Commands + +### Command Structure + +All commands follow this pattern: + +```text +provisioning [global-options] [command-options] [arguments] +``` + +### Global Options + +| Option | Short | Description | +| -------- | ------- | ------------- | +| `--infra` | `-i` | Specify infrastructure | +| `--check` | `-c` | Dry run mode | +| `--debug` | `-x` | Enable debug output | +| `--yes` | `-y` | Auto-confirm actions | + +### Essential Commands + +| Command | Purpose | Example | +| --------- | --------- | --------- | +| `help` | Show help | `provisioning help` | +| `env` | Show environment | `provisioning env` | +| `list` | List resources | `provisioning list servers` | +| `show` | Show details | `provisioning show settings` | +| `validate` | Validate config | `provisioning validate config` | + +## Working with Multiple Environments + +### Environment Concepts + +The system supports multiple environments: + +- **dev** - Development and testing +- **test** - Integration testing +- **prod** - Production deployment + +### Switching Environments + +```text +# Set environment for this session +export PROVISIONING_ENV=dev +provisioning env + +# Or specify per command +provisioning --environment dev server create +``` + +### Environment-Specific Configuration + +Create environment configs: + +```text +# Development environment +provisioning init config dev + +# Production environment +provisioning init config prod +``` + +## Common Workflows + +### Workflow 1: Development Environment + +```text +# 1. Create development workspace +mkdir ~/dev-environment +cd ~/dev-environment + +# 2. Generate infrastructure +provisioning generate infra --new dev-setup + +# 3. Customize for development +# Edit settings.ncl to add development tools + +# 4. Deploy +provisioning server create --infra dev-setup --check +provisioning server create --infra dev-setup + +# 5. Install development services +provisioning taskserv create kubernetes --infra dev-setup +provisioning taskserv create containerd --infra dev-setup +``` + +### Workflow 2: Service Updates + +```text +# Check for service updates +provisioning taskserv check-updates + +# Update specific service +provisioning taskserv update kubernetes --infra dev-setup + +# Verify update +provisioning taskserv versions kubernetes +``` + +### Workflow 3: Infrastructure Scaling + +```text +# Add servers to existing infrastructure +# Edit settings.ncl to add more servers + +# Apply changes +provisioning server create --infra dev-setup + +# Install services on new servers +provisioning taskserv create containerd --infra dev-setup +``` + +## Interactive Mode + +### Starting Interactive Shell + +```text +# Start Nushell with provisioning loaded +provisioning nu +``` + +In the interactive shell, you have access to all provisioning functions: + +```text +# Inside Nushell session +use lib_provisioning * + +# Check environment +show_env + +# List available functions +help commands | where name =~ "provision" +``` + +### Useful Interactive Commands + +```text +# Show detailed server information +find_servers "web-*" | table + +# Get cost estimates +servers_walk_by_costs $settings "" false false "stdout" + +# Check task service status +taskservs_list | where status == "running" +``` + +## Configuration Management + +### Understanding Configuration Files + +1. **System Defaults**: `config.defaults.toml` - System-wide defaults +2. **User Config**: `~/.provisioning/config.user.toml` - Your preferences +3. **Environment Config**: `config.{env}.toml` - Environment-specific settings +4. **Infrastructure Config**: `settings.ncl` - Infrastructure definitions + +### Configuration Hierarchy + +```text +Infrastructure settings.ncl + ↓ (overrides) +Environment config.{env}.toml + ↓ (overrides) +User config.user.toml + ↓ (overrides) +System config.defaults.toml +``` + +### Customizing Your Configuration + +```text +# Edit user configuration +provisioning sops ~/.provisioning/config.user.toml + +# Or using your preferred editor +nano ~/.provisioning/config.user.toml +``` + +Example customizations: + +```text +[debug] +enabled = true # Enable debug mode by default +log_level = "debug" # Verbose logging + +[providers] +default = "aws" # Use AWS as default provider + +[output] +format = "json" # Prefer JSON output +``` + +## Monitoring and Observability + +### Checking System Status + +```text +# Overall system health +provisioning env + +# Infrastructure status +provisioning show servers --infra dev-setup + +# Service status +provisioning taskserv list --infra dev-setup +``` + +### Logging and Debugging + +```text +# Enable debug mode for troubleshooting +provisioning --debug server create --infra dev-setup --check + +# View logs for specific operations +provisioning show logs --infra dev-setup +``` + +### Cost Monitoring + +```text +# Show cost estimates +provisioning show cost --infra dev-setup + +# Detailed cost breakdown +provisioning server price --infra dev-setup +``` + +## Best Practices + +### 1. Configuration Management + +- ✅ Use version control for infrastructure definitions +- ✅ Test changes in development before production +- ✅ Use `--check` mode to preview changes +- ✅ Keep user configuration separate from infrastructure + +### 2. Security + +- ✅ Use SOPS for encrypting sensitive data +- ✅ Regular key rotation for cloud providers +- ✅ Principle of least privilege for access +- ✅ Audit infrastructure changes + +### 3. Operational Excellence + +- ✅ Monitor infrastructure costs regularly +- ✅ Keep services updated +- ✅ Document custom configurations +- ✅ Plan for disaster recovery + +### 4. Development Workflow + +```text +# 1. Always validate before applying +provisioning validate config --infra my-infra + +# 2. Use check mode first +provisioning server create --infra my-infra --check + +# 3. Apply changes incrementally +provisioning server create --infra my-infra + +# 4. Verify results +provisioning show servers --infra my-infra +``` + +## Getting Help + +### Built-in Help System + +```text +# General help +provisioning help + +# Command-specific help +provisioning server help +provisioning taskserv help +provisioning cluster help + +# Show available options +provisioning generate help +``` + +### Command Reference + +For complete command documentation, see: [CLI Reference](cli-reference.md) + +### Troubleshooting + +If you encounter issues, see: [Troubleshooting Guide](troubleshooting-guide.md) + +## Real-World Example + +Let's walk through a complete example of setting up a web application infrastructure: + +### Step 1: Plan Your Infrastructure + +```text +# Create project workspace +mkdir ~/webapp-infrastructure +cd ~/webapp-infrastructure + +# Generate base infrastructure +provisioning generate infra --new webapp +``` + +### Step 2: Customize Configuration + +Edit `webapp/settings.ncl` to define: + +- 2 web servers for load balancing +- 1 database server +- Load balancer configuration + +### Step 3: Deploy Base Infrastructure + +```text +# Validate configuration +provisioning validate config --infra webapp + +# Preview deployment +provisioning server create --infra webapp --check + +# Deploy servers +provisioning server create --infra webapp +``` + +### Step 4: Install Services + +```text +# Install container runtime on all servers +provisioning taskserv create containerd --infra webapp + +# Install load balancer on web servers +provisioning taskserv create haproxy --infra webapp + +# Install database on database server +provisioning taskserv create postgresql --infra webapp +``` + +### Step 5: Deploy Application + +```text +# Create application cluster +provisioning cluster create webapp --infra webapp + +# Verify deployment +provisioning show servers --infra webapp +provisioning cluster list --infra webapp +``` + +## Next Steps + +Now that you understand the basics: + +1. **Set up your workspace**: [Workspace Setup Guide](workspace-setup.md) +2. **Learn about infrastructure management**: [Infrastructure Management Guide](infrastructure-management.md) +3. **Understand configuration**: [Configuration Guide](configuration.md) +4. **Explore examples**: [Examples and Tutorials](examples/) + +You're ready to start building and managing cloud infrastructure with confidence! \ No newline at end of file diff --git a/docs/src/getting-started/installation-guide.md b/docs/src/getting-started/installation-guide.md index eed574c..b992736 100644 --- a/docs/src/getting-started/installation-guide.md +++ b/docs/src/getting-started/installation-guide.md @@ -1 +1,536 @@ -# Installation Guide\n\nThis guide will help you install Infrastructure Automation on your machine and get it ready for use.\n\n## What You'll Learn\n\n- System requirements and prerequisites\n- Different installation methods\n- How to verify your installation\n- Setting up your environment\n- Troubleshooting common installation issues\n\n## System Requirements\n\n### Operating System Support\n\n- **Linux**: Any modern distribution (Ubuntu 20.04+, CentOS 8+, Debian 11+)\n- **macOS**: 11.0+ (Big Sur and newer)\n- **Windows**: Windows 10/11 with WSL2\n\n### Hardware Requirements\n\n| Component | Minimum | Recommended |\n| ----------- | --------- | ------------- |\n| CPU | 2 cores | 4+ cores |\n| RAM | 4 GB | 8+ GB |\n| Storage | 2 GB free | 10+ GB free |\n| Network | Internet connection | Broadband connection |\n\n### Architecture Support\n\n- **x86_64** (Intel/AMD 64-bit) - Full support\n- **ARM64** (Apple Silicon, ARM servers) - Full support\n\n## Prerequisites\n\nBefore installation, ensure you have:\n\n1. **Administrative privileges** - Required for system-wide installation\n2. **Internet connection** - For downloading dependencies\n3. **Terminal/Command line access** - Basic command line knowledge helpful\n\n### Pre-installation Checklist\n\n```\n# Check your system\nuname -a # View system information\ndf -h # Check available disk space\ncurl --version # Verify internet connectivity\n```\n\n## Installation Methods\n\n### Method 1: Package Installation (Recommended)\n\nThis is the easiest method for most users.\n\n#### Step 1: Download the Package\n\n```\n# Download the latest release package\nwget https://releases.example.com/provisioning-latest.tar.gz\n\n# Or using curl\ncurl -LO https://releases.example.com/provisioning-latest.tar.gz\n```\n\n#### Step 2: Extract and Install\n\n```\n# Extract the package\ntar xzf provisioning-latest.tar.gz\n\n# Navigate to extracted directory\ncd provisioning-*\n\n# Run the installation script\nsudo ./install-provisioning\n```\n\nThe installer will:\n\n- Install to `/usr/local/provisioning`\n- Create a global command at `/usr/local/bin/provisioning`\n- Install all required dependencies\n- Set up configuration templates\n\n### Method 2: Container Installation\n\nFor containerized environments or testing.\n\n#### Using Docker\n\n```\n# Pull the provisioning container\ndocker pull provisioning:latest\n\n# Create a container with persistent storage\ndocker run -it --name provisioning-setup \\n -v ~/provisioning-data:/data \\n provisioning:latest\n\n# Install to host system (optional)\ndocker cp provisioning-setup:/usr/local/provisioning ./\nsudo cp -r ./provisioning /usr/local/\nsudo ln -sf /usr/local/provisioning/bin/provisioning /usr/local/bin/provisioning\n```\n\n#### Using Podman\n\n```\n# Similar to Docker but with Podman\npodman pull provisioning:latest\npodman run -it --name provisioning-setup \\n -v ~/provisioning-data:/data \\n provisioning:latest\n```\n\n### Method 3: Source Installation\n\nFor developers or custom installations.\n\n#### Prerequisites for Source Installation\n\n- **Git** - For cloning the repository\n- **Build tools** - Compiler toolchain for your platform\n\n#### Installation Steps\n\n```\n# Clone the repository\ngit clone https://github.com/your-org/provisioning.git\ncd provisioning\n\n# Run installation from source\n./distro/from-repo.sh\n\n# Or if you have development environment\n./distro/pack-install.sh\n```\n\n### Method 4: Manual Installation\n\nFor advanced users who want complete control.\n\n```\n# Create installation directory\nsudo mkdir -p /usr/local/provisioning\n\n# Copy files (assumes you have the source)\nsudo cp -r ./* /usr/local/provisioning/\n\n# Create global command\nsudo ln -sf /usr/local/provisioning/core/nulib/provisioning /usr/local/bin/provisioning\n\n# Install dependencies manually\n./install-dependencies.sh\n```\n\n## Installation Process Details\n\n### What Gets Installed\n\nThe installation process sets up:\n\n#### 1. Core System Files\n\n```\n/usr/local/provisioning/\n├── core/ # Core provisioning logic\n├── providers/ # Cloud provider integrations\n├── taskservs/ # Infrastructure services\n├── cluster/ # Cluster configurations\n├── schemas/ # Configuration schemas (Nickel)\n├── templates/ # Template files\n└── resources/ # Project resources\n```\n\n#### 2. Required Tools\n\n| Tool | Version | Purpose |\n| ------ | --------- | --------- |\n| Nushell | 0.107.1 | Primary shell and scripting |\n| Nickel | 1.15.0+ | Configuration language |\n| SOPS | 3.10.2 | Secret management |\n| Age | 1.2.1 | Encryption |\n| K9s | 0.50.6 | Kubernetes management |\n\n#### 3. Nushell Plugins\n\n- **nu_plugin_tera** - Template rendering\n\n#### 4. Configuration Files\n\n- User configuration templates\n- Environment-specific configs\n- Default settings and schemas\n\n## Post-Installation Verification\n\n### Basic Verification\n\n```\n# Check if provisioning command is available\nprovisioning --version\n\n# Verify installation\nprovisioning env\n\n# Show comprehensive environment info\nprovisioning allenv\n```\n\nExpected output should show:\n\n```\n✅ Provisioning v1.0.0 installed\n✅ All dependencies available\n✅ Configuration loaded successfully\n```\n\n### Tool Verification\n\n```\n# Check individual tools\nnu --version # Should show Nushell 0.109.0+\nnickel version # Should show Nickel 1.5+\nsops --version # Should show SOPS 3.10.2\nage --version # Should show Age 1.2.1\nk9s version # Should show K9s 0.50.6\n```\n\n### Plugin Verification\n\n```\n# Start Nushell and check plugins\nnu -c "version | get installed_plugins"\n\n# Should include:\n# - nu_plugin_tera (template rendering)\n```\n\n### Configuration Verification\n\n```\n# Validate configuration\nprovisioning validate config\n\n# Should show:\n# ✅ Configuration validation passed!\n```\n\n## Environment Setup\n\n### Shell Configuration\n\nAdd to your shell profile (`~/.bashrc`, `~/.zshrc`, or `~/.profile`):\n\n```\n# Add provisioning to PATH\nexport PATH="/usr/local/bin:$PATH"\n\n# Optional: Set default provisioning directory\nexport PROVISIONING="/usr/local/provisioning"\n```\n\n### Configuration Initialization\n\n```\n# Initialize user configuration\nprovisioning init config\n\n# This creates ~/.provisioning/config.user.toml\n```\n\n### First-Time Setup\n\n```\n# Set up your first workspace\nmkdir -p ~/provisioning-workspace\ncd ~/provisioning-workspace\n\n# Initialize workspace\nprovisioning init config dev\n\n# Verify setup\nprovisioning env\n```\n\n## Platform-Specific Instructions\n\n### Linux (Ubuntu/Debian)\n\n```\n# Install system dependencies\nsudo apt update\nsudo apt install -y curl wget tar\n\n# Proceed with standard installation\nwget https://releases.example.com/provisioning-latest.tar.gz\ntar xzf provisioning-latest.tar.gz\ncd provisioning-*\nsudo ./install-provisioning\n```\n\n### Linux (RHEL/CentOS/Fedora)\n\n```\n# Install system dependencies\nsudo dnf install -y curl wget tar\n# or for older versions: sudo yum install -y curl wget tar\n\n# Proceed with standard installation\n```\n\n### macOS\n\n```\n# Using Homebrew (if available)\nbrew install curl wget\n\n# Or download directly\ncurl -LO https://releases.example.com/provisioning-latest.tar.gz\ntar xzf provisioning-latest.tar.gz\ncd provisioning-*\nsudo ./install-provisioning\n```\n\n### Windows (WSL2)\n\n```\n# In WSL2 terminal\nsudo apt update\nsudo apt install -y curl wget tar\n\n# Proceed with Linux installation steps\nwget https://releases.example.com/provisioning-latest.tar.gz\n# ... continue as Linux\n```\n\n## Configuration Examples\n\n### Basic Configuration\n\nCreate `~/.provisioning/config.user.toml`:\n\n```\n[core]\nname = "my-provisioning"\n\n[paths]\nbase = "/usr/local/provisioning"\ninfra = "~/provisioning-workspace"\n\n[debug]\nenabled = false\nlog_level = "info"\n\n[providers]\ndefault = "local"\n\n[output]\nformat = "yaml"\n```\n\n### Development Configuration\n\nFor developers, use enhanced debugging:\n\n```\n[debug]\nenabled = true\nlog_level = "debug"\ncheck = true\n\n[cache]\nenabled = false # Disable caching during development\n```\n\n## Upgrade and Migration\n\n### Upgrading from Previous Version\n\n```\n# Backup current installation\nsudo cp -r /usr/local/provisioning /usr/local/provisioning.backup\n\n# Download new version\nwget https://releases.example.com/provisioning-latest.tar.gz\n\n# Extract and install\ntar xzf provisioning-latest.tar.gz\ncd provisioning-*\nsudo ./install-provisioning\n\n# Verify upgrade\nprovisioning --version\n```\n\n### Migrating Configuration\n\n```\n# Backup your configuration\ncp -r ~/.provisioning ~/.provisioning.backup\n\n# Initialize new configuration\nprovisioning init config\n\n# Manually merge important settings from backup\n```\n\n## Troubleshooting Installation Issues\n\n### Common Installation Problems\n\n#### Permission Denied Errors\n\n```\n# Problem: Cannot write to /usr/local\n# Solution: Use sudo\nsudo ./install-provisioning\n\n# Or install to user directory\n./install-provisioning --prefix=$HOME/provisioning\nexport PATH="$HOME/provisioning/bin:$PATH"\n```\n\n#### Missing Dependencies\n\n```\n# Problem: curl/wget not found\n# Ubuntu/Debian solution:\nsudo apt install -y curl wget tar\n\n# RHEL/CentOS solution:\nsudo dnf install -y curl wget tar\n```\n\n#### Download Failures\n\n```\n# Problem: Cannot download package\n# Solution: Check internet connection and try alternative\nping google.com\n\n# Try alternative download method\ncurl -LO --retry 3 https://releases.example.com/provisioning-latest.tar.gz\n\n# Or use wget with retries\nwget --tries=3 https://releases.example.com/provisioning-latest.tar.gz\n```\n\n#### Extraction Failures\n\n```\n# Problem: Archive corrupted\n# Solution: Verify and re-download\nsha256sum provisioning-latest.tar.gz # Check against published hash\n\n# Re-download if hash doesn't match\nrm provisioning-latest.tar.gz\nwget https://releases.example.com/provisioning-latest.tar.gz\n```\n\n#### Tool Installation Failures\n\n```\n# Problem: Nushell installation fails\n# Solution: Check architecture and OS compatibility\nuname -m # Should show x86_64 or arm64\nuname -s # Should show Linux, Darwin, etc.\n\n# Try manual tool installation\n./install-dependencies.sh --verbose\n```\n\n### Verification Failures\n\n#### Command Not Found\n\n```\n# Problem: 'provisioning' command not found\n# Check installation path\nls -la /usr/local/bin/provisioning\n\n# If missing, create symlink\nsudo ln -sf /usr/local/provisioning/core/nulib/provisioning /usr/local/bin/provisioning\n\n# Add to PATH if needed\nexport PATH="/usr/local/bin:$PATH"\necho 'export PATH="/usr/local/bin:$PATH"' >> ~/.bashrc\n```\n\n#### Plugin Errors\n\n```\n# Problem: Plugin command not found\n# Solution: Ensure plugin is properly registered\n\n# Check available plugins\nnu -c "version | get installed_plugins"\n\n# If plugin missing, reload Nushell:\nexec nu\n```\n\n#### Configuration Errors\n\n```\n# Problem: Configuration validation fails\n# Solution: Initialize with template\nprovisioning init config\n\n# Or validate and show errors\nprovisioning validate config --detailed\n```\n\n### Getting Help\n\nIf you encounter issues not covered here:\n\n1. **Check logs**: `provisioning --debug env`\n2. **Validate configuration**: `provisioning validate config`\n3. **Check system compatibility**: `provisioning version --verbose`\n4. **Consult troubleshooting guide**: `docs/user/troubleshooting-guide.md`\n\n## Next Steps\n\nAfter successful installation:\n\n1. **Complete the Getting Started Guide**: `docs/user/getting-started.md`\n2. **Set up your first workspace**: `docs/user/workspace-setup.md`\n3. **Learn about configuration**: `docs/user/configuration.md`\n4. **Try example tutorials**: `docs/user/examples/`\n\nYour provisioning is now ready to manage cloud infrastructure! +# Installation Guide + +This guide will help you install Infrastructure Automation on your machine and get it ready for use. + +## What You'll Learn + +- System requirements and prerequisites +- Different installation methods +- How to verify your installation +- Setting up your environment +- Troubleshooting common installation issues + +## System Requirements + +### Operating System Support + +- **Linux**: Any modern distribution (Ubuntu 20.04+, CentOS 8+, Debian 11+) +- **macOS**: 11.0+ (Big Sur and newer) +- **Windows**: Windows 10/11 with WSL2 + +### Hardware Requirements + +| Component | Minimum | Recommended | +| ----------- | --------- | ------------- | +| CPU | 2 cores | 4+ cores | +| RAM | 4 GB | 8+ GB | +| Storage | 2 GB free | 10+ GB free | +| Network | Internet connection | Broadband connection | + +### Architecture Support + +- **x86_64** (Intel/AMD 64-bit) - Full support +- **ARM64** (Apple Silicon, ARM servers) - Full support + +## Prerequisites + +Before installation, ensure you have: + +1. **Administrative privileges** - Required for system-wide installation +2. **Internet connection** - For downloading dependencies +3. **Terminal/Command line access** - Basic command line knowledge helpful + +### Pre-installation Checklist + +```text +# Check your system +uname -a # View system information +df -h # Check available disk space +curl --version # Verify internet connectivity +``` + +## Installation Methods + +### Method 1: Package Installation (Recommended) + +This is the easiest method for most users. + +#### Step 1: Download the Package + +```text +# Download the latest release package +wget https://releases.example.com/provisioning-latest.tar.gz + +# Or using curl +curl -LO https://releases.example.com/provisioning-latest.tar.gz +``` + +#### Step 2: Extract and Install + +```text +# Extract the package +tar xzf provisioning-latest.tar.gz + +# Navigate to extracted directory +cd provisioning-* + +# Run the installation script +sudo ./install-provisioning +``` + +The installer will: + +- Install to `/usr/local/provisioning` +- Create a global command at `/usr/local/bin/provisioning` +- Install all required dependencies +- Set up configuration templates + +### Method 2: Container Installation + +For containerized environments or testing. + +#### Using Docker + +```text +# Pull the provisioning container +docker pull provisioning:latest + +# Create a container with persistent storage +docker run -it --name provisioning-setup + -v ~/provisioning-data:/data + provisioning:latest + +# Install to host system (optional) +docker cp provisioning-setup:/usr/local/provisioning ./ +sudo cp -r ./provisioning /usr/local/ +sudo ln -sf /usr/local/provisioning/bin/provisioning /usr/local/bin/provisioning +``` + +#### Using Podman + +```text +# Similar to Docker but with Podman +podman pull provisioning:latest +podman run -it --name provisioning-setup + -v ~/provisioning-data:/data + provisioning:latest +``` + +### Method 3: Source Installation + +For developers or custom installations. + +#### Prerequisites for Source Installation + +- **Git** - For cloning the repository +- **Build tools** - Compiler toolchain for your platform + +#### Installation Steps + +```text +# Clone the repository +git clone https://github.com/your-org/provisioning.git +cd provisioning + +# Run installation from source +./distro/from-repo.sh + +# Or if you have development environment +./distro/pack-install.sh +``` + +### Method 4: Manual Installation + +For advanced users who want complete control. + +```text +# Create installation directory +sudo mkdir -p /usr/local/provisioning + +# Copy files (assumes you have the source) +sudo cp -r ./* /usr/local/provisioning/ + +# Create global command +sudo ln -sf /usr/local/provisioning/core/nulib/provisioning /usr/local/bin/provisioning + +# Install dependencies manually +./install-dependencies.sh +``` + +## Installation Process Details + +### What Gets Installed + +The installation process sets up: + +#### 1. Core System Files + +```text +/usr/local/provisioning/ +├── core/ # Core provisioning logic +├── providers/ # Cloud provider integrations +├── taskservs/ # Infrastructure services +├── cluster/ # Cluster configurations +├── schemas/ # Configuration schemas (Nickel) +├── templates/ # Template files +└── resources/ # Project resources +``` + +#### 2. Required Tools + +| Tool | Version | Purpose | +| ------ | --------- | --------- | +| Nushell | 0.107.1 | Primary shell and scripting | +| Nickel | 1.15.0+ | Configuration language | +| SOPS | 3.10.2 | Secret management | +| Age | 1.2.1 | Encryption | +| K9s | 0.50.6 | Kubernetes management | + +#### 3. Nushell Plugins + +- **nu_plugin_tera** - Template rendering + +#### 4. Configuration Files + +- User configuration templates +- Environment-specific configs +- Default settings and schemas + +## Post-Installation Verification + +### Basic Verification + +```text +# Check if provisioning command is available +provisioning --version + +# Verify installation +provisioning env + +# Show comprehensive environment info +provisioning allenv +``` + +Expected output should show: + +```text +✅ Provisioning v1.0.0 installed +✅ All dependencies available +✅ Configuration loaded successfully +``` + +### Tool Verification + +```text +# Check individual tools +nu --version # Should show Nushell 0.109.0+ +nickel version # Should show Nickel 1.5+ +sops --version # Should show SOPS 3.10.2 +age --version # Should show Age 1.2.1 +k9s version # Should show K9s 0.50.6 +``` + +### Plugin Verification + +```text +# Start Nushell and check plugins +nu -c "version | get installed_plugins" + +# Should include: +# - nu_plugin_tera (template rendering) +``` + +### Configuration Verification + +```text +# Validate configuration +provisioning validate config + +# Should show: +# ✅ Configuration validation passed! +``` + +## Environment Setup + +### Shell Configuration + +Add to your shell profile (`~/.bashrc`, `~/.zshrc`, or `~/.profile`): + +```text +# Add provisioning to PATH +export PATH="/usr/local/bin:$PATH" + +# Optional: Set default provisioning directory +export PROVISIONING="/usr/local/provisioning" +``` + +### Configuration Initialization + +```text +# Initialize user configuration +provisioning init config + +# This creates ~/.provisioning/config.user.toml +``` + +### First-Time Setup + +```text +# Set up your first workspace +mkdir -p ~/provisioning-workspace +cd ~/provisioning-workspace + +# Initialize workspace +provisioning init config dev + +# Verify setup +provisioning env +``` + +## Platform-Specific Instructions + +### Linux (Ubuntu/Debian) + +```text +# Install system dependencies +sudo apt update +sudo apt install -y curl wget tar + +# Proceed with standard installation +wget https://releases.example.com/provisioning-latest.tar.gz +tar xzf provisioning-latest.tar.gz +cd provisioning-* +sudo ./install-provisioning +``` + +### Linux (RHEL/CentOS/Fedora) + +```text +# Install system dependencies +sudo dnf install -y curl wget tar +# or for older versions: sudo yum install -y curl wget tar + +# Proceed with standard installation +``` + +### macOS + +```text +# Using Homebrew (if available) +brew install curl wget + +# Or download directly +curl -LO https://releases.example.com/provisioning-latest.tar.gz +tar xzf provisioning-latest.tar.gz +cd provisioning-* +sudo ./install-provisioning +``` + +### Windows (WSL2) + +```text +# In WSL2 terminal +sudo apt update +sudo apt install -y curl wget tar + +# Proceed with Linux installation steps +wget https://releases.example.com/provisioning-latest.tar.gz +# ... continue as Linux +``` + +## Configuration Examples + +### Basic Configuration + +Create `~/.provisioning/config.user.toml`: + +```text +[core] +name = "my-provisioning" + +[paths] +base = "/usr/local/provisioning" +infra = "~/provisioning-workspace" + +[debug] +enabled = false +log_level = "info" + +[providers] +default = "local" + +[output] +format = "yaml" +``` + +### Development Configuration + +For developers, use enhanced debugging: + +```text +[debug] +enabled = true +log_level = "debug" +check = true + +[cache] +enabled = false # Disable caching during development +``` + +## Upgrade and Migration + +### Upgrading from Previous Version + +```text +# Backup current installation +sudo cp -r /usr/local/provisioning /usr/local/provisioning.backup + +# Download new version +wget https://releases.example.com/provisioning-latest.tar.gz + +# Extract and install +tar xzf provisioning-latest.tar.gz +cd provisioning-* +sudo ./install-provisioning + +# Verify upgrade +provisioning --version +``` + +### Migrating Configuration + +```text +# Backup your configuration +cp -r ~/.provisioning ~/.provisioning.backup + +# Initialize new configuration +provisioning init config + +# Manually merge important settings from backup +``` + +## Troubleshooting Installation Issues + +### Common Installation Problems + +#### Permission Denied Errors + +```text +# Problem: Cannot write to /usr/local +# Solution: Use sudo +sudo ./install-provisioning + +# Or install to user directory +./install-provisioning --prefix=$HOME/provisioning +export PATH="$HOME/provisioning/bin:$PATH" +``` + +#### Missing Dependencies + +```text +# Problem: curl/wget not found +# Ubuntu/Debian solution: +sudo apt install -y curl wget tar + +# RHEL/CentOS solution: +sudo dnf install -y curl wget tar +``` + +#### Download Failures + +```text +# Problem: Cannot download package +# Solution: Check internet connection and try alternative +ping google.com + +# Try alternative download method +curl -LO --retry 3 https://releases.example.com/provisioning-latest.tar.gz + +# Or use wget with retries +wget --tries=3 https://releases.example.com/provisioning-latest.tar.gz +``` + +#### Extraction Failures + +```text +# Problem: Archive corrupted +# Solution: Verify and re-download +sha256sum provisioning-latest.tar.gz # Check against published hash + +# Re-download if hash doesn't match +rm provisioning-latest.tar.gz +wget https://releases.example.com/provisioning-latest.tar.gz +``` + +#### Tool Installation Failures + +```text +# Problem: Nushell installation fails +# Solution: Check architecture and OS compatibility +uname -m # Should show x86_64 or arm64 +uname -s # Should show Linux, Darwin, etc. + +# Try manual tool installation +./install-dependencies.sh --verbose +``` + +### Verification Failures + +#### Command Not Found + +```text +# Problem: 'provisioning' command not found +# Check installation path +ls -la /usr/local/bin/provisioning + +# If missing, create symlink +sudo ln -sf /usr/local/provisioning/core/nulib/provisioning /usr/local/bin/provisioning + +# Add to PATH if needed +export PATH="/usr/local/bin:$PATH" +echo 'export PATH="/usr/local/bin:$PATH"' >> ~/.bashrc +``` + +#### Plugin Errors + +```text +# Problem: Plugin command not found +# Solution: Ensure plugin is properly registered + +# Check available plugins +nu -c "version | get installed_plugins" + +# If plugin missing, reload Nushell: +exec nu +``` + +#### Configuration Errors + +```text +# Problem: Configuration validation fails +# Solution: Initialize with template +provisioning init config + +# Or validate and show errors +provisioning validate config --detailed +``` + +### Getting Help + +If you encounter issues not covered here: + +1. **Check logs**: `provisioning --debug env` +2. **Validate configuration**: `provisioning validate config` +3. **Check system compatibility**: `provisioning version --verbose` +4. **Consult troubleshooting guide**: `docs/user/troubleshooting-guide.md` + +## Next Steps + +After successful installation: + +1. **Complete the Getting Started Guide**: `docs/user/getting-started.md` +2. **Set up your first workspace**: `docs/user/workspace-setup.md` +3. **Learn about configuration**: `docs/user/configuration.md` +4. **Try example tutorials**: `docs/user/examples/` + +Your provisioning is now ready to manage cloud infrastructure! \ No newline at end of file diff --git a/docs/src/getting-started/installation-validation-guide.md b/docs/src/getting-started/installation-validation-guide.md index a9cca1c..97dbaf1 100644 --- a/docs/src/getting-started/installation-validation-guide.md +++ b/docs/src/getting-started/installation-validation-guide.md @@ -1 +1,622 @@ -# Installation Validation & Bootstrap Guide\n\n**Objective**: Validate your provisioning installation, run bootstrap to initialize the workspace, and verify all components are working correctly.\n\n**Expected Duration**: 30-45 minutes\n\n**Prerequisites**: Fresh clone of provisioning repository at `/Users/Akasha/project-provisioning`\n\n---\n\n## Section 1: Prerequisites Verification\n\nBefore running the bootstrap script, verify that your system has all required dependencies.\n\n### Step 1.1: Check System Requirements\n\nRun these commands to verify your system meets minimum requirements:\n\n```\n# Check OS\nuname -s\n# Expected: Darwin (macOS), Linux, or WSL2\n\n# Check CPU cores\nsysctl -n hw.physicalcpu # macOS\n# OR\nnproc # Linux\n# Expected: 2 or more cores\n\n# Check RAM\nsysctl -n hw.memsize | awk '{print int($1 / 1024 / 1024 / 1024) " GB"}' # macOS\n# OR\ngrep MemTotal /proc/meminfo | awk '{print int($2 / 1024 / 1024) " GB"}' # Linux\n# Expected: 2 GB or more (4 GB+ recommended)\n\n# Check free disk space\ndf -h | grep -E '^/dev|^Filesystem'\n# Expected: At least 2 GB free (10 GB+ recommended)\n```\n\n**Success Criteria**:\n- OS is macOS, Linux, or WSL2\n- CPU: 2+ cores available\n- RAM: 2 GB minimum, 4+ GB recommended\n- Disk: 2 GB free minimum\n\n### Step 1.2: Verify Nushell Installation\n\nNushell is required for bootstrap and CLI operations:\n\n```\ncommand -v nu\n# Expected output: /path/to/nu\n\nnu --version\n# Expected output: 0.109.0 or higher\n```\n\n**If Nushell is not installed:**\n\n```\n# macOS (using Homebrew)\nbrew install nushell\n\n# Linux (Debian/Ubuntu)\nsudo apt-get update && sudo apt-get install nushell\n\n# Linux (RHEL/CentOS)\nsudo yum install nushell\n\n# Or install from source: https://nushell.sh/book/installation.html\n```\n\n### Step 1.3: Verify Nickel Installation\n\nNickel is required for configuration validation:\n\n```\ncommand -v nickel\n# Expected output: /path/to/nickel\n\nnickel --version\n# Expected output: nickel 1.x.x or higher\n```\n\n**If Nickel is not installed:**\n\n```\n# Install via Cargo (requires Rust)\ncargo install nickel-lang-cli\n\n# Or: https://nickel-lang.org/\n```\n\n### Step 1.4: Verify Docker Installation\n\nDocker is required for running containerized services:\n\n```\ncommand -v docker\n# Expected output: /path/to/docker\n\ndocker --version\n# Expected output: Docker version 20.10 or higher\n```\n\n**If Docker is not installed:**\n\nVisit [Docker installation guide](https://docs.docker.com/get-docker/) and install for your OS.\n\n### Step 1.5: Check Provisioning Binary\n\nVerify the provisioning CLI binary exists:\n\n```\nls -la /Users/Akasha/project-provisioning/provisioning/core/cli/provisioning\n# Expected: -rwxr-xr-x (executable)\n\nfile /Users/Akasha/project-provisioning/provisioning/core/cli/provisioning\n# Expected: ELF 64-bit or similar binary format\n```\n\n**If binary is not executable:**\n\n```\nchmod +x /Users/Akasha/project-provisioning/provisioning/core/cli/provisioning\n```\n\n### Prerequisites Checklist\n\n```\n[ ] OS is macOS, Linux, or WSL2\n[ ] CPU: 2+ cores available\n[ ] RAM: 2 GB minimum installed\n[ ] Disk: 2+ GB free space\n[ ] Nushell 0.109.0+ installed\n[ ] Nickel 1.x.x installed\n[ ] Docker 20.10+ installed\n[ ] Provisioning binary exists and is executable\n```\n\n---\n\n## Section 2: Bootstrap Installation\n\nThe bootstrap script automates 7 stages of installation and initialization. Run it from the project root directory.\n\n### Step 2.1: Navigate to Project Root\n\n```\ncd /Users/Akasha/project-provisioning\n```\n\n### Step 2.2: Run Bootstrap Script\n\n```\n./provisioning/bootstrap/install.sh\n```\n\n### Bootstrap Output\n\nYou should see output similar to this:\n\n```\n╔════════════════════════════════════════════════════════════════╗\n║ PROVISIONING BOOTSTRAP (Bash) ║\n╚════════════════════════════════════════════════════════════════╝\n\n📊 Stage 1: System Detection\n─────────────────────────────────────────────────────────────────\n OS: Darwin\n Architecture: arm64 (or x86_64)\n CPU Cores: 8\n Memory: 16 GB\n ✅ System requirements met\n\n📦 Stage 2: Checking Dependencies\n─────────────────────────────────────────────────────────────────\n Versions:\n Docker: Docker version 28.5.2\n Rust: rustc 1.75.0\n Nushell: 0.109.1\n ✅ All dependencies found\n\n📁 Stage 3: Creating Directory Structure\n─────────────────────────────────────────────────────────────────\n ✅ Directory structure created\n\n⚙️ Stage 4: Validating Configuration\n─────────────────────────────────────────────────────────────────\n ✅ Configuration syntax valid\n\n📤 Stage 5: Exporting Configuration to TOML\n─────────────────────────────────────────────────────────────────\n ✅ Configuration exported\n\n🚀 Stage 6: Initializing Orchestrator Service\n─────────────────────────────────────────────────────────────────\n ✅ Orchestrator started\n\n✅ Stage 7: Verification\n─────────────────────────────────────────────────────────────────\n ✅ All configuration files generated\n ✅ All required directories created\n\n╔════════════════════════════════════════════════════════════════╗\n║ BOOTSTRAP COMPLETE ✅ ║\n╚════════════════════════════════════════════════════════════════╝\n\n📍 Next Steps:\n\n1. Verify configuration:\n cat /Users/Akasha/project-provisioning/workspaces/workspace_librecloud/config/config.ncl\n\n2. Check orchestrator is running:\n curl http://localhost:9090/health\n\n3. Start provisioning:\n provisioning server create --infra sgoyol --name web-01\n```\n\n### What Bootstrap Does\n\nThe bootstrap script automatically:\n\n1. **Detects your system** (OS, CPU, RAM, architecture)\n2. **Verifies dependencies** (Docker, Rust, Nushell)\n3. **Creates workspace directories** (config, state, cache)\n4. **Validates Nickel configuration** (syntax checking)\n5. **Exports configuration** (Nickel → TOML files)\n6. **Initializes orchestrator** (starts service in background)\n7. **Verifies installation** (checks all files created)\n\n---\n\n## Section 3: Installation Validation\n\nAfter bootstrap completes, verify that all components are working correctly.\n\n### Step 3.1: Verify Workspace Directories\n\nBootstrap should have created workspace directories. Verify they exist:\n\n```\ncd /Users/Akasha/project-provisioning\n\n# Check all required directories\nls -la workspaces/workspace_librecloud/.orchestrator/data/queue/\nls -la workspaces/workspace_librecloud/.kms/\nls -la workspaces/workspace_librecloud/.providers/\nls -la workspaces/workspace_librecloud/.taskservs/\nls -la workspaces/workspace_librecloud/.clusters/\n```\n\n**Expected Output**:\n```\ntotal 0\ndrwxr-xr-x 2 user group 64 Jan 7 10:30 .\n\n(directories exist and are accessible)\n```\n\n### Step 3.2: Verify Generated Configuration Files\n\nBootstrap should have exported Nickel configuration to TOML format:\n\n```\n# Check generated files exist\nls -la workspaces/workspace_librecloud/config/generated/\n\n# View workspace configuration\ncat workspaces/workspace_librecloud/config/generated/workspace.toml\n\n# View provider configuration\ncat workspaces/workspace_librecloud/config/generated/providers/upcloud.toml\n\n# View orchestrator configuration\ncat workspaces/workspace_librecloud/config/generated/platform/orchestrator.toml\n```\n\n**Expected Output**:\n```\nconfig/\n├── generated/\n│ ├── workspace.toml\n│ ├── providers/\n│ │ └── upcloud.toml\n│ └── platform/\n│ └── orchestrator.toml\n```\n\n### Step 3.3: Type-Check Nickel Configuration\n\nVerify Nickel configuration files have valid syntax:\n\n```\ncd /Users/Akasha/project-provisioning/workspaces/workspace_librecloud\n\n# Type-check main workspace config\nnickel typecheck config/config.ncl\n# Expected: No output (success) or clear error messages\n\n# Type-check infrastructure configs\nnickel typecheck infra/wuji/main.ncl\nnickel typecheck infra/sgoyol/main.ncl\n\n# Use workspace utility for comprehensive validation\nnu workspace.nu validate\n# Expected: ✓ All files validated successfully\n\n# Type-check all Nickel files\nnu workspace.nu typecheck\n```\n\n**Expected Output**:\n```\n✓ All files validated successfully\n✓ infra/wuji/main.ncl\n✓ infra/sgoyol/main.ncl\n```\n\n### Step 3.4: Verify Orchestrator Service\n\nThe orchestrator service manages workflows and deployments:\n\n```\n# Check if orchestrator is running (health check)\ncurl http://localhost:9090/health\n# Expected: {"status": "healthy"} or similar response\n\n# If health check fails, check orchestrator logs\ntail -f /Users/Akasha/project-provisioning/provisioning/platform/orchestrator/data/orchestrator.log\n\n# Alternative: Check if orchestrator process is running\nps aux | grep orchestrator\n# Expected: Running orchestrator process visible\n```\n\n**Expected Output**:\n```\n{\n "status": "healthy",\n "uptime": "0:05:23"\n}\n```\n\n**If Orchestrator Failed to Start:**\n\nCheck logs and restart manually:\n\n```\ncd /Users/Akasha/project-provisioning/provisioning/platform/orchestrator\n\n# Check log file\ncat data/orchestrator.log\n\n# Or start orchestrator manually\n./scripts/start-orchestrator.nu --background\n\n# Verify it's running\ncurl http://localhost:9090/health\n```\n\n### Step 3.5: Install Provisioning CLI (Optional)\n\nYou can install the provisioning CLI globally for easier access:\n\n```\n# Option A: System-wide installation (requires sudo)\ncd /Users/Akasha/project-provisioning\nsudo ./scripts/install-provisioning.sh\n\n# Verify installation\nprovisioning --version\nprovisioning help\n\n# Option B: Add to PATH temporarily (current session only)\nexport PATH="$PATH:/Users/Akasha/project-provisioning/provisioning/core/cli"\n\n# Verify\nprovisioning --version\n```\n\n**Expected Output**:\n```\nprovisioning version 1.0.0\n\nUsage: provisioning [OPTIONS] COMMAND\n\nCommands:\n server - Server management\n workspace - Workspace management\n config - Configuration management\n help - Show help information\n```\n\n### Installation Validation Checklist\n\n```\n[ ] Workspace directories created (.orchestrator, .kms, .providers, .taskservs, .clusters)\n[ ] Generated TOML files exist in config/generated/\n[ ] Nickel type-checking passes (no errors)\n[ ] Workspace utility validation passes\n[ ] Orchestrator responding to health check\n[ ] Orchestrator process running\n[ ] Provisioning CLI accessible and working\n```\n\n---\n\n## Section 4: Troubleshooting\n\nThis section covers common issues and solutions.\n\n### Issue: "Nushell not found"\n\n**Symptoms**:\n```\n./provisioning/bootstrap/install.sh: line X: nu: command not found\n```\n\n**Solution**:\n1. Install Nushell (see Step 1.2)\n2. Verify installation: `nu --version`\n3. Retry bootstrap script\n\n### Issue: "Nickel configuration validation failed"\n\n**Symptoms**:\n```\n⚙️ Stage 4: Validating Configuration\nError: Nickel configuration validation failed\n```\n\n**Solution**:\n1. Check Nickel syntax: `nickel typecheck config/config.ncl`\n2. Review error message for specific issue\n3. Edit config file: `vim config/config.ncl`\n4. Run bootstrap again\n\n### Issue: "Docker not installed"\n\n**Symptoms**:\n```\n❌ Docker is required but not installed\n```\n\n**Solution**:\n1. Install Docker: [Docker installation guide](https://docs.docker.com/get-docker/)\n2. Verify: `docker --version`\n3. Retry bootstrap script\n\n### Issue: "Configuration export failed"\n\n**Symptoms**:\n```\n⚠️ Configuration export encountered issues (may continue)\n```\n\n**Solution**:\n1. Check Nushell library paths: `nu -c "use provisioning/core/nulib/lib_provisioning/config/export.nu *"`\n2. Verify export library exists: `ls provisioning/core/nulib/lib_provisioning/config/export.nu`\n3. Re-export manually:\n ```bash\n cd /Users/Akasha/project-provisioning\n nu -c "\n use provisioning/core/nulib/lib_provisioning/config/export.nu *\n export-all-configs 'workspaces/workspace_librecloud'\n "\n ```\n\n### Issue: "Orchestrator didn't start"\n\n**Symptoms**:\n```\n🚀 Stage 6: Initializing Orchestrator Service\n⚠️ Orchestrator may not have started (check logs)\n\ncurl http://localhost:9090/health\n# Connection refused\n```\n\n**Solution**:\n1. Check for port conflicts: `lsof -i :9090`\n2. If port 9090 is in use, either:\n - Stop the conflicting service\n - Change orchestrator port in configuration\n3. Check logs: `tail -f provisioning/platform/orchestrator/data/orchestrator.log`\n4. Start manually: `cd provisioning/platform/orchestrator && ./scripts/start-orchestrator.nu --background`\n5. Verify: `curl http://localhost:9090/health`\n\n### Issue: "Sudo password prompt during bootstrap"\n\n**Symptoms**:\n```\nStage 3: Creating Directory Structure\n[sudo] password for user:\n```\n\n**Solution**:\n- This is normal if creating directories in system locations\n- Enter your sudo password when prompted\n- Or: Run bootstrap from home directory instead\n\n### Issue: "Permission denied" on binary\n\n**Symptoms**:\n```\nbash: ./provisioning/bootstrap/install.sh: Permission denied\n```\n\n**Solution**:\n```\n# Make script executable\nchmod +x /Users/Akasha/project-provisioning/provisioning/bootstrap/install.sh\n\n# Retry\n./provisioning/bootstrap/install.sh\n```\n\n---\n\n## Section 5: Next Steps\n\nAfter successful installation validation, you can:\n\n### Option 1: Deploy workspace_librecloud\n\nTo deploy infrastructure to UpCloud:\n\n```\n# Read workspace deployment guide\ncat workspaces/workspace_librecloud/docs/deployment-guide.md\n\n# Or: From workspace directory\ncd workspaces/workspace_librecloud\ncat docs/deployment-guide.md\n```\n\n### Option 2: Create a New Workspace\n\nTo create a new workspace for different infrastructure:\n\n```\nprovisioning workspace init my_workspace --template minimal\n```\n\n### Option 3: Explore Available Modules\n\nDiscover what's available to deploy:\n\n```\n# List available task services\nprovisioning mod discover taskservs\n\n# List available providers\nprovisioning mod discover providers\n\n# List available clusters\nprovisioning mod discover clusters\n```\n\n---\n\n## Section 6: Verification Checklist\n\nAfter completing all steps, verify with this final checklist:\n\n```\nPrerequisites Verified:\n [ ] OS is macOS, Linux, or WSL2\n [ ] CPU: 2+ cores\n [ ] RAM: 2+ GB available\n [ ] Disk: 2+ GB free\n [ ] Nushell 0.109.0+ installed\n [ ] Nickel 1.x.x installed\n [ ] Docker 20.10+ installed\n [ ] Provisioning binary executable\n\nBootstrap Completed:\n [ ] All 7 stages completed successfully\n [ ] No error messages in output\n [ ] Installation log shows success\n\nInstallation Validated:\n [ ] Workspace directories exist\n [ ] Generated TOML files exist\n [ ] Nickel type-checking passes\n [ ] Workspace validation passes\n [ ] Orchestrator health check passes\n [ ] Provisioning CLI works (if installed)\n\nReady to Deploy:\n [ ] No errors in validation steps\n [ ] All services responding correctly\n [ ] Configuration properly exported\n```\n\n---\n\n## Getting Help\n\nIf you encounter issues not covered here:\n\n1. **Check logs**: `tail -f provisioning/platform/orchestrator/data/orchestrator.log`\n2. **Enable debug mode**: `provisioning --debug `\n3. **Review bootstrap output**: Scroll up to see detailed error messages\n4. **Check documentation**: `provisioning help` or `provisioning guide `\n5. **Workspace guide**: `cat workspaces/workspace_librecloud/docs/deployment-guide.md`\n\n---\n\n## Summary\n\nThis guide covers:\n- ✅ Prerequisites verification (Nushell, Nickel, Docker)\n- ✅ Bootstrap installation (7-stage automated process)\n- ✅ Installation validation (directories, configs, services)\n- ✅ Troubleshooting common issues\n- ✅ Next steps for deployment\n\nYou now have a fully installed and validated provisioning system ready for workspace deployment. +# Installation Validation & Bootstrap Guide + +**Objective**: Validate your provisioning installation, run bootstrap to initialize the workspace, and verify all components are working correctly. + +**Expected Duration**: 30-45 minutes + +**Prerequisites**: Fresh clone of provisioning repository at `/Users/Akasha/project-provisioning` + +--- + +## Section 1: Prerequisites Verification + +Before running the bootstrap script, verify that your system has all required dependencies. + +### Step 1.1: Check System Requirements + +Run these commands to verify your system meets minimum requirements: + +```text +# Check OS +uname -s +# Expected: Darwin (macOS), Linux, or WSL2 + +# Check CPU cores +sysctl -n hw.physicalcpu # macOS +# OR +nproc # Linux +# Expected: 2 or more cores + +# Check RAM +sysctl -n hw.memsize | awk '{print int($1 / 1024 / 1024 / 1024) " GB"}' # macOS +# OR +grep MemTotal /proc/meminfo | awk '{print int($2 / 1024 / 1024) " GB"}' # Linux +# Expected: 2 GB or more (4 GB+ recommended) + +# Check free disk space +df -h | grep -E '^/dev|^Filesystem' +# Expected: At least 2 GB free (10 GB+ recommended) +``` + +**Success Criteria**: +- OS is macOS, Linux, or WSL2 +- CPU: 2+ cores available +- RAM: 2 GB minimum, 4+ GB recommended +- Disk: 2 GB free minimum + +### Step 1.2: Verify Nushell Installation + +Nushell is required for bootstrap and CLI operations: + +```text +command -v nu +# Expected output: /path/to/nu + +nu --version +# Expected output: 0.109.0 or higher +``` + +**If Nushell is not installed:** + +```text +# macOS (using Homebrew) +brew install nushell + +# Linux (Debian/Ubuntu) +sudo apt-get update && sudo apt-get install nushell + +# Linux (RHEL/CentOS) +sudo yum install nushell + +# Or install from source: https://nushell.sh/book/installation.html +``` + +### Step 1.3: Verify Nickel Installation + +Nickel is required for configuration validation: + +```text +command -v nickel +# Expected output: /path/to/nickel + +nickel --version +# Expected output: nickel 1.x.x or higher +``` + +**If Nickel is not installed:** + +```text +# Install via Cargo (requires Rust) +cargo install nickel-lang-cli + +# Or: https://nickel-lang.org/ +``` + +### Step 1.4: Verify Docker Installation + +Docker is required for running containerized services: + +```text +command -v docker +# Expected output: /path/to/docker + +docker --version +# Expected output: Docker version 20.10 or higher +``` + +**If Docker is not installed:** + +Visit [Docker installation guide](https://docs.docker.com/get-docker/) and install for your OS. + +### Step 1.5: Check Provisioning Binary + +Verify the provisioning CLI binary exists: + +```text +ls -la /Users/Akasha/project-provisioning/provisioning/core/cli/provisioning +# Expected: -rwxr-xr-x (executable) + +file /Users/Akasha/project-provisioning/provisioning/core/cli/provisioning +# Expected: ELF 64-bit or similar binary format +``` + +**If binary is not executable:** + +```text +chmod +x /Users/Akasha/project-provisioning/provisioning/core/cli/provisioning +``` + +### Prerequisites Checklist + +```text +[ ] OS is macOS, Linux, or WSL2 +[ ] CPU: 2+ cores available +[ ] RAM: 2 GB minimum installed +[ ] Disk: 2+ GB free space +[ ] Nushell 0.109.0+ installed +[ ] Nickel 1.x.x installed +[ ] Docker 20.10+ installed +[ ] Provisioning binary exists and is executable +``` + +--- + +## Section 2: Bootstrap Installation + +The bootstrap script automates 7 stages of installation and initialization. Run it from the project root directory. + +### Step 2.1: Navigate to Project Root + +```text +cd /Users/Akasha/project-provisioning +``` + +### Step 2.2: Run Bootstrap Script + +```text +./provisioning/bootstrap/install.sh +``` + +### Bootstrap Output + +You should see output similar to this: + +```text +╔════════════════════════════════════════════════════════════════╗ +║ PROVISIONING BOOTSTRAP (Bash) ║ +╚════════════════════════════════════════════════════════════════╝ + +📊 Stage 1: System Detection +───────────────────────────────────────────────────────────────── + OS: Darwin + Architecture: arm64 (or x86_64) + CPU Cores: 8 + Memory: 16 GB + ✅ System requirements met + +📦 Stage 2: Checking Dependencies +───────────────────────────────────────────────────────────────── + Versions: + Docker: Docker version 28.5.2 + Rust: rustc 1.75.0 + Nushell: 0.109.1 + ✅ All dependencies found + +📁 Stage 3: Creating Directory Structure +───────────────────────────────────────────────────────────────── + ✅ Directory structure created + +⚙️ Stage 4: Validating Configuration +───────────────────────────────────────────────────────────────── + ✅ Configuration syntax valid + +📤 Stage 5: Exporting Configuration to TOML +───────────────────────────────────────────────────────────────── + ✅ Configuration exported + +🚀 Stage 6: Initializing Orchestrator Service +───────────────────────────────────────────────────────────────── + ✅ Orchestrator started + +✅ Stage 7: Verification +───────────────────────────────────────────────────────────────── + ✅ All configuration files generated + ✅ All required directories created + +╔════════════════════════════════════════════════════════════════╗ +║ BOOTSTRAP COMPLETE ✅ ║ +╚════════════════════════════════════════════════════════════════╝ + +📍 Next Steps: + +1. Verify configuration: + cat /Users/Akasha/project-provisioning/workspaces/workspace_librecloud/config/config.ncl + +2. Check orchestrator is running: + curl http://localhost:9090/health + +3. Start provisioning: + provisioning server create --infra sgoyol --name web-01 +``` + +### What Bootstrap Does + +The bootstrap script automatically: + +1. **Detects your system** (OS, CPU, RAM, architecture) +2. **Verifies dependencies** (Docker, Rust, Nushell) +3. **Creates workspace directories** (config, state, cache) +4. **Validates Nickel configuration** (syntax checking) +5. **Exports configuration** (Nickel → TOML files) +6. **Initializes orchestrator** (starts service in background) +7. **Verifies installation** (checks all files created) + +--- + +## Section 3: Installation Validation + +After bootstrap completes, verify that all components are working correctly. + +### Step 3.1: Verify Workspace Directories + +Bootstrap should have created workspace directories. Verify they exist: + +```text +cd /Users/Akasha/project-provisioning + +# Check all required directories +ls -la workspaces/workspace_librecloud/.orchestrator/data/queue/ +ls -la workspaces/workspace_librecloud/.kms/ +ls -la workspaces/workspace_librecloud/.providers/ +ls -la workspaces/workspace_librecloud/.taskservs/ +ls -la workspaces/workspace_librecloud/.clusters/ +``` + +**Expected Output**: +```text +total 0 +drwxr-xr-x 2 user group 64 Jan 7 10:30 . + +(directories exist and are accessible) +``` + +### Step 3.2: Verify Generated Configuration Files + +Bootstrap should have exported Nickel configuration to TOML format: + +```text +# Check generated files exist +ls -la workspaces/workspace_librecloud/config/generated/ + +# View workspace configuration +cat workspaces/workspace_librecloud/config/generated/workspace.toml + +# View provider configuration +cat workspaces/workspace_librecloud/config/generated/providers/upcloud.toml + +# View orchestrator configuration +cat workspaces/workspace_librecloud/config/generated/platform/orchestrator.toml +``` + +**Expected Output**: +```text +config/ +├── generated/ +│ ├── workspace.toml +│ ├── providers/ +│ │ └── upcloud.toml +│ └── platform/ +│ └── orchestrator.toml +``` + +### Step 3.3: Type-Check Nickel Configuration + +Verify Nickel configuration files have valid syntax: + +```text +cd /Users/Akasha/project-provisioning/workspaces/workspace_librecloud + +# Type-check main workspace config +nickel typecheck config/config.ncl +# Expected: No output (success) or clear error messages + +# Type-check infrastructure configs +nickel typecheck infra/wuji/main.ncl +nickel typecheck infra/sgoyol/main.ncl + +# Use workspace utility for comprehensive validation +nu workspace.nu validate +# Expected: ✓ All files validated successfully + +# Type-check all Nickel files +nu workspace.nu typecheck +``` + +**Expected Output**: +```text +✓ All files validated successfully +✓ infra/wuji/main.ncl +✓ infra/sgoyol/main.ncl +``` + +### Step 3.4: Verify Orchestrator Service + +The orchestrator service manages workflows and deployments: + +```text +# Check if orchestrator is running (health check) +curl http://localhost:9090/health +# Expected: {"status": "healthy"} or similar response + +# If health check fails, check orchestrator logs +tail -f /Users/Akasha/project-provisioning/provisioning/platform/orchestrator/data/orchestrator.log + +# Alternative: Check if orchestrator process is running +ps aux | grep orchestrator +# Expected: Running orchestrator process visible +``` + +**Expected Output**: +```text +{ + "status": "healthy", + "uptime": "0:05:23" +} +``` + +**If Orchestrator Failed to Start:** + +Check logs and restart manually: + +```text +cd /Users/Akasha/project-provisioning/provisioning/platform/orchestrator + +# Check log file +cat data/orchestrator.log + +# Or start orchestrator manually +./scripts/start-orchestrator.nu --background + +# Verify it's running +curl http://localhost:9090/health +``` + +### Step 3.5: Install Provisioning CLI (Optional) + +You can install the provisioning CLI globally for easier access: + +```text +# Option A: System-wide installation (requires sudo) +cd /Users/Akasha/project-provisioning +sudo ./scripts/install-provisioning.sh + +# Verify installation +provisioning --version +provisioning help + +# Option B: Add to PATH temporarily (current session only) +export PATH="$PATH:/Users/Akasha/project-provisioning/provisioning/core/cli" + +# Verify +provisioning --version +``` + +**Expected Output**: +```text +provisioning version 1.0.0 + +Usage: provisioning [OPTIONS] COMMAND + +Commands: + server - Server management + workspace - Workspace management + config - Configuration management + help - Show help information +``` + +### Installation Validation Checklist + +```text +[ ] Workspace directories created (.orchestrator, .kms, .providers, .taskservs, .clusters) +[ ] Generated TOML files exist in config/generated/ +[ ] Nickel type-checking passes (no errors) +[ ] Workspace utility validation passes +[ ] Orchestrator responding to health check +[ ] Orchestrator process running +[ ] Provisioning CLI accessible and working +``` + +--- + +## Section 4: Troubleshooting + +This section covers common issues and solutions. + +### Issue: "Nushell not found" + +**Symptoms**: +```text +./provisioning/bootstrap/install.sh: line X: nu: command not found +``` + +**Solution**: +1. Install Nushell (see Step 1.2) +2. Verify installation: `nu --version` +3. Retry bootstrap script + +### Issue: "Nickel configuration validation failed" + +**Symptoms**: +```text +⚙️ Stage 4: Validating Configuration +Error: Nickel configuration validation failed +``` + +**Solution**: +1. Check Nickel syntax: `nickel typecheck config/config.ncl` +2. Review error message for specific issue +3. Edit config file: `vim config/config.ncl` +4. Run bootstrap again + +### Issue: "Docker not installed" + +**Symptoms**: +```text +❌ Docker is required but not installed +``` + +**Solution**: +1. Install Docker: [Docker installation guide](https://docs.docker.com/get-docker/) +2. Verify: `docker --version` +3. Retry bootstrap script + +### Issue: "Configuration export failed" + +**Symptoms**: +```text +⚠️ Configuration export encountered issues (may continue) +``` + +**Solution**: +1. Check Nushell library paths: `nu -c "use provisioning/core/nulib/lib_provisioning/config/export.nu *"` +2. Verify export library exists: `ls provisioning/core/nulib/lib_provisioning/config/export.nu` +3. Re-export manually: + ```bash + cd /Users/Akasha/project-provisioning + nu -c " + use provisioning/core/nulib/lib_provisioning/config/export.nu * + export-all-configs 'workspaces/workspace_librecloud' + " + ``` + +### Issue: "Orchestrator didn't start" + +**Symptoms**: +```text +🚀 Stage 6: Initializing Orchestrator Service +⚠️ Orchestrator may not have started (check logs) + +curl http://localhost:9090/health +# Connection refused +``` + +**Solution**: +1. Check for port conflicts: `lsof -i :9090` +2. If port 9090 is in use, either: + - Stop the conflicting service + - Change orchestrator port in configuration +3. Check logs: `tail -f provisioning/platform/orchestrator/data/orchestrator.log` +4. Start manually: `cd provisioning/platform/orchestrator && ./scripts/start-orchestrator.nu --background` +5. Verify: `curl http://localhost:9090/health` + +### Issue: "Sudo password prompt during bootstrap" + +**Symptoms**: +```text +Stage 3: Creating Directory Structure +[sudo] password for user: +``` + +**Solution**: +- This is normal if creating directories in system locations +- Enter your sudo password when prompted +- Or: Run bootstrap from home directory instead + +### Issue: "Permission denied" on binary + +**Symptoms**: +```text +bash: ./provisioning/bootstrap/install.sh: Permission denied +``` + +**Solution**: +```text +# Make script executable +chmod +x /Users/Akasha/project-provisioning/provisioning/bootstrap/install.sh + +# Retry +./provisioning/bootstrap/install.sh +``` + +--- + +## Section 5: Next Steps + +After successful installation validation, you can: + +### Option 1: Deploy workspace_librecloud + +To deploy infrastructure to UpCloud: + +```text +# Read workspace deployment guide +cat workspaces/workspace_librecloud/docs/deployment-guide.md + +# Or: From workspace directory +cd workspaces/workspace_librecloud +cat docs/deployment-guide.md +``` + +### Option 2: Create a New Workspace + +To create a new workspace for different infrastructure: + +```text +provisioning workspace init my_workspace --template minimal +``` + +### Option 3: Explore Available Modules + +Discover what's available to deploy: + +```text +# List available task services +provisioning mod discover taskservs + +# List available providers +provisioning mod discover providers + +# List available clusters +provisioning mod discover clusters +``` + +--- + +## Section 6: Verification Checklist + +After completing all steps, verify with this final checklist: + +```text +Prerequisites Verified: + [ ] OS is macOS, Linux, or WSL2 + [ ] CPU: 2+ cores + [ ] RAM: 2+ GB available + [ ] Disk: 2+ GB free + [ ] Nushell 0.109.0+ installed + [ ] Nickel 1.x.x installed + [ ] Docker 20.10+ installed + [ ] Provisioning binary executable + +Bootstrap Completed: + [ ] All 7 stages completed successfully + [ ] No error messages in output + [ ] Installation log shows success + +Installation Validated: + [ ] Workspace directories exist + [ ] Generated TOML files exist + [ ] Nickel type-checking passes + [ ] Workspace validation passes + [ ] Orchestrator health check passes + [ ] Provisioning CLI works (if installed) + +Ready to Deploy: + [ ] No errors in validation steps + [ ] All services responding correctly + [ ] Configuration properly exported +``` + +--- + +## Getting Help + +If you encounter issues not covered here: + +1. **Check logs**: `tail -f provisioning/platform/orchestrator/data/orchestrator.log` +2. **Enable debug mode**: `provisioning --debug ` +3. **Review bootstrap output**: Scroll up to see detailed error messages +4. **Check documentation**: `provisioning help` or `provisioning guide ` +5. **Workspace guide**: `cat workspaces/workspace_librecloud/docs/deployment-guide.md` + +--- + +## Summary + +This guide covers: +- ✅ Prerequisites verification (Nushell, Nickel, Docker) +- ✅ Bootstrap installation (7-stage automated process) +- ✅ Installation validation (directories, configs, services) +- ✅ Troubleshooting common issues +- ✅ Next steps for deployment + +You now have a fully installed and validated provisioning system ready for workspace deployment. \ No newline at end of file diff --git a/docs/src/getting-started/quickstart-cheatsheet.md b/docs/src/getting-started/quickstart-cheatsheet.md index c4a2fe5..cd578d9 100644 --- a/docs/src/getting-started/quickstart-cheatsheet.md +++ b/docs/src/getting-started/quickstart-cheatsheet.md @@ -1 +1,1107 @@ -# Provisioning Platform Quick Reference\n\n**Version**: 3.5.0\n**Last Updated**: 2025-10-09\n\n---\n\n## Quick Navigation\n\n- [Plugin Commands](#plugin-commands) - Native Nushell plugins (10-50x faster)\n- [CLI Shortcuts](#cli-shortcuts) - 80+ command shortcuts\n- [Infrastructure Commands](#infrastructure-commands) - Servers, taskservs, clusters\n- [Orchestration Commands](#orchestration-commands) - Workflows, batch operations\n- [Configuration Commands](#configuration-commands) - Config, validation, environment\n- [Workspace Commands](#workspace-commands) - Multi-workspace management\n- [Security Commands](#security-commands) - Auth, MFA, secrets, compliance\n- [Common Workflows](#common-workflows) - Complete deployment examples\n- [Debug and Check Mode](#debug-and-check-mode) - Testing and troubleshooting\n- [Output Formats](#output-formats) - JSON, YAML, table formatting\n\n---\n\n## Plugin Commands\n\nNative Nushell plugins for high-performance operations. **10-50x faster than HTTP API**.\n\n### Authentication Plugin (nu_plugin_auth)\n\n```\n# Login (password prompted securely)\nauth login admin\n\n# Login with custom URL\nauth login admin --url https://control-center.example.com\n\n# Verify current session\nauth verify\n# Returns: { active: true, user: "admin", role: "Admin", expires_at: "...", mfa_verified: true }\n\n# List active sessions\nauth sessions\n\n# Logout\nauth logout\n\n# MFA enrollment\nauth mfa enroll totp # TOTP (Google Authenticator, Authy)\nauth mfa enroll webauthn # WebAuthn (YubiKey, Touch ID, Windows Hello)\n\n# MFA verification\nauth mfa verify --code 123456\nauth mfa verify --code ABCD-EFGH-IJKL # Backup code\n```\n\n**Installation:**\n\n```\ncd provisioning/core/plugins/nushell-plugins\ncargo build --release -p nu_plugin_auth\nplugin add target/release/nu_plugin_auth\n```\n\n### KMS Plugin (nu_plugin_kms)\n\n**Performance**: 10x faster encryption (~5 ms vs ~50 ms HTTP)\n\n```\n# Encrypt with auto-detected backend\nkms encrypt "secret data"\n# vault:v1:abc123...\n\n# Encrypt with specific backend\nkms encrypt "data" --backend rustyvault --key provisioning-main\nkms encrypt "data" --backend age --key age1xxxxxxxxx\nkms encrypt "data" --backend aws --key alias/provisioning\n\n# Encrypt with context (AAD for additional security)\nkms encrypt "data" --context "user=admin,env=production"\n\n# Decrypt (auto-detects backend from format)\nkms decrypt "vault:v1:abc123..."\nkms decrypt "-----BEGIN AGE ENCRYPTED FILE-----..."\n\n# Decrypt with context (must match encryption context)\nkms decrypt "vault:v1:abc123..." --context "user=admin,env=production"\n\n# Generate data encryption key\nkms generate-key\nkms generate-key --spec AES256\n\n# Check backend status\nkms status\n```\n\n**Supported Backends:**\n\n- **rustyvault**: High-performance (~5 ms) - Production\n- **age**: Local encryption (~3 ms) - Development\n- **cosmian**: Cloud KMS (~30 ms)\n- **aws**: AWS KMS (~50 ms)\n- **vault**: HashiCorp Vault (~40 ms)\n\n**Installation:**\n\n```\ncargo build --release -p nu_plugin_kms\nplugin add target/release/nu_plugin_kms\n\n# Set backend environment\nexport RUSTYVAULT_ADDR="http://localhost:8200"\nexport RUSTYVAULT_TOKEN="hvs.xxxxx"\n```\n\n### Orchestrator Plugin (nu_plugin_orchestrator)\n\n**Performance**: 30-50x faster queries (~1 ms vs ~30-50 ms HTTP)\n\n```\n# Get orchestrator status (direct file access, ~1 ms)\norch status\n# { active_tasks: 5, completed_tasks: 120, health: "healthy" }\n\n# Validate workflow Nickel file (~10 ms vs ~100 ms HTTP)\norch validate workflows/deploy.ncl\norch validate workflows/deploy.ncl --strict\n\n# List tasks (direct file read, ~5 ms)\norch tasks\norch tasks --status running\norch tasks --status failed --limit 10\n```\n\n**Installation:**\n\n```\ncargo build --release -p nu_plugin_orchestrator\nplugin add target/release/nu_plugin_orchestrator\n```\n\n### Plugin Performance Comparison\n\n| Operation | HTTP API | Plugin | Speedup |\n| ----------- | ---------- | -------- | --------- |\n| KMS Encrypt | ~50 ms | ~5 ms | **10x** |\n| KMS Decrypt | ~50 ms | ~5 ms | **10x** |\n| Orch Status | ~30 ms | ~1 ms | **30x** |\n| Orch Validate | ~100 ms | ~10 ms | **10x** |\n| Orch Tasks | ~50 ms | ~5 ms | **10x** |\n| Auth Verify | ~50 ms | ~10 ms | **5x** |\n\n---\n\n## CLI Shortcuts\n\n### Infrastructure Shortcuts\n\n```\n# Server shortcuts\nprovisioning s # server (same as 'provisioning server')\nprovisioning s create # Create servers\nprovisioning s delete # Delete servers\nprovisioning s list # List servers\nprovisioning s ssh web-01 # SSH into server\n\n# Taskserv shortcuts\nprovisioning t # taskserv (same as 'provisioning taskserv')\nprovisioning task # taskserv (alias)\nprovisioning t create kubernetes\nprovisioning t delete kubernetes\nprovisioning t list\nprovisioning t generate kubernetes\nprovisioning t check-updates\n\n# Cluster shortcuts\nprovisioning cl # cluster (same as 'provisioning cluster')\nprovisioning cl create buildkit\nprovisioning cl delete buildkit\nprovisioning cl list\n\n# Infrastructure shortcuts\nprovisioning i # infra (same as 'provisioning infra')\nprovisioning infras # infra (alias)\nprovisioning i list\nprovisioning i validate\n```\n\n### Orchestration Shortcuts\n\n```\n# Workflow shortcuts\nprovisioning wf # workflow (same as 'provisioning workflow')\nprovisioning flow # workflow (alias)\nprovisioning wf list\nprovisioning wf status \nprovisioning wf monitor \nprovisioning wf stats\nprovisioning wf cleanup\n\n# Batch shortcuts\nprovisioning bat # batch (same as 'provisioning batch')\nprovisioning batch submit workflows/example.ncl\nprovisioning bat list\nprovisioning bat status \nprovisioning bat monitor \nprovisioning bat rollback \nprovisioning bat cancel \nprovisioning bat stats\n\n# Orchestrator shortcuts\nprovisioning orch # orchestrator (same as 'provisioning orchestrator')\nprovisioning orch start\nprovisioning orch stop\nprovisioning orch status\nprovisioning orch health\nprovisioning orch logs\n```\n\n### Development Shortcuts\n\n```\n# Module shortcuts\nprovisioning mod # module (same as 'provisioning module')\nprovisioning mod discover taskserv\nprovisioning mod discover provider\nprovisioning mod discover cluster\nprovisioning mod load taskserv workspace kubernetes\nprovisioning mod list taskserv workspace\nprovisioning mod unload taskserv workspace kubernetes\nprovisioning mod sync-kcl\n\n# Layer shortcuts\nprovisioning lyr # layer (same as 'provisioning layer')\nprovisioning lyr explain\nprovisioning lyr show\nprovisioning lyr test\nprovisioning lyr stats\n\n# Version shortcuts\nprovisioning version check\nprovisioning version show\nprovisioning version updates\nprovisioning version apply \nprovisioning version taskserv \n\n# Package shortcuts\nprovisioning pack core\nprovisioning pack provider upcloud\nprovisioning pack list\nprovisioning pack clean\n```\n\n### Workspace Shortcuts\n\n```\n# Workspace shortcuts\nprovisioning ws # workspace (same as 'provisioning workspace')\nprovisioning ws init\nprovisioning ws create \nprovisioning ws validate\nprovisioning ws info\nprovisioning ws list\nprovisioning ws migrate\nprovisioning ws switch # Switch active workspace\nprovisioning ws active # Show active workspace\n\n# Template shortcuts\nprovisioning tpl # template (same as 'provisioning template')\nprovisioning tmpl # template (alias)\nprovisioning tpl list\nprovisioning tpl types\nprovisioning tpl show \nprovisioning tpl apply \nprovisioning tpl validate \n```\n\n### Configuration Shortcuts\n\n```\n# Environment shortcuts\nprovisioning e # env (same as 'provisioning env')\nprovisioning val # validate (same as 'provisioning validate')\nprovisioning st # setup (same as 'provisioning setup')\nprovisioning config # setup (alias)\n\n# Show shortcuts\nprovisioning show settings\nprovisioning show servers\nprovisioning show config\n\n# Initialization\nprovisioning init \n\n# All environment\nprovisioning allenv # Show all config and environment\n```\n\n### Utility Shortcuts\n\n```\n# List shortcuts\nprovisioning l # list (same as 'provisioning list')\nprovisioning ls # list (alias)\nprovisioning list # list (full)\n\n# SSH operations\nprovisioning ssh \n\n# SOPS operations\nprovisioning sops # Edit encrypted file\n\n# Cache management\nprovisioning cache clear\nprovisioning cache stats\n\n# Provider operations\nprovisioning providers list\nprovisioning providers info \n\n# Nushell session\nprovisioning nu # Start Nushell with provisioning library loaded\n\n# QR code generation\nprovisioning qr \n\n# Nushell information\nprovisioning nuinfo\n\n# Plugin management\nprovisioning plugin # plugin (same as 'provisioning plugin')\nprovisioning plugins # plugin (alias)\nprovisioning plugin list\nprovisioning plugin test nu_plugin_kms\n```\n\n### Generation Shortcuts\n\n```\n# Generate shortcuts\nprovisioning g # generate (same as 'provisioning generate')\nprovisioning gen # generate (alias)\nprovisioning g server\nprovisioning g taskserv \nprovisioning g cluster \nprovisioning g infra --new \nprovisioning g new \n```\n\n### Action Shortcuts\n\n```\n# Common actions\nprovisioning c # create (same as 'provisioning create')\nprovisioning d # delete (same as 'provisioning delete')\nprovisioning u # update (same as 'provisioning update')\n\n# Pricing shortcuts\nprovisioning price # Show server pricing\nprovisioning cost # price (alias)\nprovisioning costs # price (alias)\n\n# Create server + taskservs (combo command)\nprovisioning cst # create-server-task\nprovisioning csts # create-server-task (alias)\n```\n\n---\n\n## Infrastructure Commands\n\n### Server Management\n\n```\n# Create servers\nprovisioning server create\nprovisioning server create --check # Dry-run mode\nprovisioning server create --yes # Skip confirmation\n\n# Delete servers\nprovisioning server delete\nprovisioning server delete --check\nprovisioning server delete --yes\n\n# List servers\nprovisioning server list\nprovisioning server list --infra wuji\nprovisioning server list --out json\n\n# SSH into server\nprovisioning server ssh web-01\nprovisioning server ssh db-01\n\n# Show pricing\nprovisioning server price\nprovisioning server price --provider upcloud\n```\n\n### Taskserv Management\n\n```\n# Create taskserv\nprovisioning taskserv create kubernetes\nprovisioning taskserv create kubernetes --check\nprovisioning taskserv create kubernetes --infra wuji\n\n# Delete taskserv\nprovisioning taskserv delete kubernetes\nprovisioning taskserv delete kubernetes --check\n\n# List taskservs\nprovisioning taskserv list\nprovisioning taskserv list --infra wuji\n\n# Generate taskserv configuration\nprovisioning taskserv generate kubernetes\nprovisioning taskserv generate kubernetes --out yaml\n\n# Check for updates\nprovisioning taskserv check-updates\nprovisioning taskserv check-updates --taskserv kubernetes\n```\n\n### Cluster Management\n\n```\n# Create cluster\nprovisioning cluster create buildkit\nprovisioning cluster create buildkit --check\nprovisioning cluster create buildkit --infra wuji\n\n# Delete cluster\nprovisioning cluster delete buildkit\nprovisioning cluster delete buildkit --check\n\n# List clusters\nprovisioning cluster list\nprovisioning cluster list --infra wuji\n```\n\n---\n\n## Orchestration Commands\n\n### Workflow Management\n\n```\n# Submit server creation workflow\nnu -c "use core/nulib/workflows/server_create.nu *; server_create_workflow 'wuji' '' [] --check"\n\n# Submit taskserv workflow\nnu -c "use core/nulib/workflows/taskserv.nu *; taskserv create 'kubernetes' 'wuji' --check"\n\n# Submit cluster workflow\nnu -c "use core/nulib/workflows/cluster.nu *; cluster create 'buildkit' 'wuji' --check"\n\n# List all workflows\nprovisioning workflow list\nnu -c "use core/nulib/workflows/management.nu *; workflow list"\n\n# Get workflow statistics\nprovisioning workflow stats\nnu -c "use core/nulib/workflows/management.nu *; workflow stats"\n\n# Monitor workflow in real-time\nprovisioning workflow monitor \nnu -c "use core/nulib/workflows/management.nu *; workflow monitor "\n\n# Check orchestrator health\nprovisioning workflow orchestrator\nnu -c "use core/nulib/workflows/management.nu *; workflow orchestrator"\n\n# Get specific workflow status\nprovisioning workflow status \nnu -c "use core/nulib/workflows/management.nu *; workflow status "\n```\n\n### Batch Operations\n\n```\n# Submit batch workflow from Nickel\nprovisioning batch submit workflows/example_batch.ncl\nnu -c "use core/nulib/workflows/batch.nu *; batch submit workflows/example_batch.ncl"\n\n# Monitor batch workflow progress\nprovisioning batch monitor \nnu -c "use core/nulib/workflows/batch.nu *; batch monitor "\n\n# List batch workflows with filtering\nprovisioning batch list\nprovisioning batch list --status Running\nnu -c "use core/nulib/workflows/batch.nu *; batch list --status Running"\n\n# Get detailed batch status\nprovisioning batch status \nnu -c "use core/nulib/workflows/batch.nu *; batch status "\n\n# Initiate rollback for failed workflow\nprovisioning batch rollback \nnu -c "use core/nulib/workflows/batch.nu *; batch rollback "\n\n# Cancel running batch\nprovisioning batch cancel \n\n# Show batch workflow statistics\nprovisioning batch stats\nnu -c "use core/nulib/workflows/batch.nu *; batch stats"\n```\n\n### Orchestrator Management\n\n```\n# Start orchestrator in background\ncd provisioning/platform/orchestrator\n./scripts/start-orchestrator.nu --background\n\n# Check orchestrator status\n./scripts/start-orchestrator.nu --check\nprovisioning orchestrator status\n\n# Stop orchestrator\n./scripts/start-orchestrator.nu --stop\nprovisioning orchestrator stop\n\n# View logs\ntail -f provisioning/platform/orchestrator/data/orchestrator.log\nprovisioning orchestrator logs\n```\n\n---\n\n## Configuration Commands\n\n### Environment and Validation\n\n```\n# Show environment variables\nprovisioning env\n\n# Show all environment and configuration\nprovisioning allenv\n\n# Validate configuration\nprovisioning validate config\nprovisioning validate infra\n\n# Setup wizard\nprovisioning setup\n```\n\n### Configuration Files\n\n```\n# System defaults\nless provisioning/config/config.defaults.toml\n\n# User configuration\nvim workspace/config/local-overrides.toml\n\n# Environment-specific configs\nvim workspace/config/dev-defaults.toml\nvim workspace/config/test-defaults.toml\nvim workspace/config/prod-defaults.toml\n\n# Infrastructure-specific config\nvim workspace/infra//config.toml\n```\n\n### HTTP Configuration\n\n```\n# Configure HTTP client behavior\n# In workspace/config/local-overrides.toml:\n[http]\nuse_curl = true # Use curl instead of ureq\n```\n\n---\n\n## Workspace Commands\n\n### Workspace Management\n\n```\n# List all workspaces\nprovisioning workspace list\n\n# Show active workspace\nprovisioning workspace active\n\n# Switch to another workspace\nprovisioning workspace switch \nprovisioning workspace activate # alias\n\n# Register new workspace\nprovisioning workspace register \nprovisioning workspace register --activate\n\n# Remove workspace from registry\nprovisioning workspace remove \nprovisioning workspace remove --force\n\n# Initialize new workspace\nprovisioning workspace init\nprovisioning workspace init --name production\n\n# Create new workspace\nprovisioning workspace create \n\n# Validate workspace\nprovisioning workspace validate\n\n# Show workspace info\nprovisioning workspace info\n\n# Migrate workspace\nprovisioning workspace migrate\n```\n\n### User Preferences\n\n```\n# View user preferences\nprovisioning workspace preferences\n\n# Set user preference\nprovisioning workspace set-preference editor vim\nprovisioning workspace set-preference output_format yaml\nprovisioning workspace set-preference confirm_delete true\n\n# Get user preference\nprovisioning workspace get-preference editor\n```\n\n**User Config Location:**\n\n- macOS: `~/Library/Application Support/provisioning/user_config.yaml`\n- Linux: `~/.config/provisioning/user_config.yaml`\n- Windows: `%APPDATA%\provisioning\user_config.yaml`\n\n---\n\n## Security Commands\n\n### Authentication (via CLI)\n\n```\n# Login\nprovisioning login admin\n\n# Logout\nprovisioning logout\n\n# Show session status\nprovisioning auth status\n\n# List active sessions\nprovisioning auth sessions\n```\n\n### Multi-Factor Authentication (MFA)\n\n```\n# Enroll in TOTP (Google Authenticator, Authy)\nprovisioning mfa totp enroll\n\n# Enroll in WebAuthn (YubiKey, Touch ID, Windows Hello)\nprovisioning mfa webauthn enroll\n\n# Verify MFA code\nprovisioning mfa totp verify --code 123456\nprovisioning mfa webauthn verify\n\n# List registered devices\nprovisioning mfa devices\n```\n\n### Secrets Management\n\n```\n# Generate AWS STS credentials (15 min-12h TTL)\nprovisioning secrets generate aws --ttl 1hr\n\n# Generate SSH key pair (Ed25519)\nprovisioning secrets generate ssh --ttl 4hr\n\n# List active secrets\nprovisioning secrets list\n\n# Revoke secret\nprovisioning secrets revoke \n\n# Cleanup expired secrets\nprovisioning secrets cleanup\n```\n\n### SSH Temporal Keys\n\n```\n# Connect to server with temporal key\nprovisioning ssh connect server01 --ttl 1hr\n\n# Generate SSH key pair only\nprovisioning ssh generate --ttl 4hr\n\n# List active SSH keys\nprovisioning ssh list\n\n# Revoke SSH key\nprovisioning ssh revoke \n```\n\n### KMS Operations (via CLI)\n\n```\n# Encrypt configuration file\nprovisioning kms encrypt secure.yaml\n\n# Decrypt configuration file\nprovisioning kms decrypt secure.yaml.enc\n\n# Encrypt entire config directory\nprovisioning config encrypt workspace/infra/production/\n\n# Decrypt config directory\nprovisioning config decrypt workspace/infra/production/\n```\n\n### Break-Glass Emergency Access\n\n```\n# Request emergency access\nprovisioning break-glass request "Production database outage"\n\n# Approve emergency request (requires admin)\nprovisioning break-glass approve --reason "Approved by CTO"\n\n# List break-glass sessions\nprovisioning break-glass list\n\n# Revoke break-glass session\nprovisioning break-glass revoke \n```\n\n### Compliance and Audit\n\n```\n# Generate compliance report\nprovisioning compliance report\nprovisioning compliance report --standard gdpr\nprovisioning compliance report --standard soc2\nprovisioning compliance report --standard iso27001\n\n# GDPR operations\nprovisioning compliance gdpr export \nprovisioning compliance gdpr delete \nprovisioning compliance gdpr rectify \n\n# Incident management\nprovisioning compliance incident create "Security breach detected"\nprovisioning compliance incident list\nprovisioning compliance incident update --status investigating\n\n# Audit log queries\nprovisioning audit query --user alice --action deploy --from 24h\nprovisioning audit export --format json --output audit-logs.json\n```\n\n---\n\n## Common Workflows\n\n### Complete Deployment from Scratch\n\n```\n# 1. Initialize workspace\nprovisioning workspace init --name production\n\n# 2. Validate configuration\nprovisioning validate config\n\n# 3. Create infrastructure definition\nprovisioning generate infra --new production\n\n# 4. Create servers (check mode first)\nprovisioning server create --infra production --check\n\n# 5. Create servers (actual deployment)\nprovisioning server create --infra production --yes\n\n# 6. Install Kubernetes\nprovisioning taskserv create kubernetes --infra production --check\nprovisioning taskserv create kubernetes --infra production\n\n# 7. Deploy cluster services\nprovisioning cluster create production --check\nprovisioning cluster create production\n\n# 8. Verify deployment\nprovisioning server list --infra production\nprovisioning taskserv list --infra production\n\n# 9. SSH to servers\nprovisioning server ssh k8s-master-01\n```\n\n### Multi-Environment Deployment\n\n```\n# Deploy to dev\nprovisioning server create --infra dev --check\nprovisioning server create --infra dev\nprovisioning taskserv create kubernetes --infra dev\n\n# Deploy to staging\nprovisioning server create --infra staging --check\nprovisioning server create --infra staging\nprovisioning taskserv create kubernetes --infra staging\n\n# Deploy to production (with confirmation)\nprovisioning server create --infra production --check\nprovisioning server create --infra production\nprovisioning taskserv create kubernetes --infra production\n```\n\n### Update Infrastructure\n\n```\n# 1. Check for updates\nprovisioning taskserv check-updates\n\n# 2. Update specific taskserv (check mode)\nprovisioning taskserv update kubernetes --check\n\n# 3. Apply update\nprovisioning taskserv update kubernetes\n\n# 4. Verify update\nprovisioning taskserv list --infra production | where name == kubernetes\n```\n\n### Encrypted Secrets Deployment\n\n```\n# 1. Authenticate\nauth login admin\nauth mfa verify --code 123456\n\n# 2. Encrypt secrets\nkms encrypt (open secrets/production.yaml) --backend rustyvault | save secrets/production.enc\n\n# 3. Deploy with encrypted secrets\nprovisioning cluster create production --secrets secrets/production.enc\n\n# 4. Verify deployment\norch tasks --status completed\n```\n\n---\n\n## Debug and Check Mode\n\n### Debug Mode\n\nEnable verbose logging with `--debug` or `-x` flag:\n\n```\n# Server creation with debug output\nprovisioning server create --debug\nprovisioning server create -x\n\n# Taskserv creation with debug\nprovisioning taskserv create kubernetes --debug\n\n# Show detailed error traces\nprovisioning --debug taskserv create kubernetes\n```\n\n### Check Mode (Dry Run)\n\nPreview changes without applying them with `--check` or `-c` flag:\n\n```\n# Check what servers would be created\nprovisioning server create --check\nprovisioning server create -c\n\n# Check taskserv installation\nprovisioning taskserv create kubernetes --check\n\n# Check cluster creation\nprovisioning cluster create buildkit --check\n\n# Combine with debug for detailed preview\nprovisioning server create --check --debug\n```\n\n### Auto-Confirm Mode\n\nSkip confirmation prompts with `--yes` or `-y` flag:\n\n```\n# Auto-confirm server creation\nprovisioning server create --yes\nprovisioning server create -y\n\n# Auto-confirm deletion\nprovisioning server delete --yes\n```\n\n### Wait Mode\n\nWait for operations to complete with `--wait` or `-w` flag:\n\n```\n# Wait for server creation to complete\nprovisioning server create --wait\n\n# Wait for taskserv installation\nprovisioning taskserv create kubernetes --wait\n```\n\n### Infrastructure Selection\n\nSpecify target infrastructure with `--infra` or `-i` flag:\n\n```\n# Create servers in specific infrastructure\nprovisioning server create --infra production\nprovisioning server create -i production\n\n# List servers in specific infrastructure\nprovisioning server list --infra production\n```\n\n---\n\n## Output Formats\n\n### JSON Output\n\n```\n# Output as JSON\nprovisioning server list --out json\nprovisioning taskserv list --out json\n\n# Pipeline JSON output\nprovisioning server list --out json | jq '.[] | select(.status == "running")'\n```\n\n### YAML Output\n\n```\n# Output as YAML\nprovisioning server list --out yaml\nprovisioning taskserv list --out yaml\n\n# Pipeline YAML output\nprovisioning server list --out yaml | yq '.[] | select(.status == "running")'\n```\n\n### Table Output (Default)\n\n```\n# Output as table (default)\nprovisioning server list\nprovisioning server list --out table\n\n# Pretty-printed table\nprovisioning server list | table\n```\n\n### Text Output\n\n```\n# Output as plain text\nprovisioning server list --out text\n```\n\n---\n\n## Performance Tips\n\n### Use Plugins for Frequent Operations\n\n```\n# ❌ Slow: HTTP API (50 ms per call)\nfor i in 1..100 { http post http://localhost:9998/encrypt { data: "secret" } }\n\n# ✅ Fast: Plugin (5 ms per call, 10x faster)\nfor i in 1..100 { kms encrypt "secret" }\n```\n\n### Batch Operations\n\n```\n# Use batch workflows for multiple operations\nprovisioning batch submit workflows/multi-cloud-deploy.ncl\n```\n\n### Check Mode for Testing\n\n```\n# Always test with --check first\nprovisioning server create --check\nprovisioning server create # Only after verification\n```\n\n---\n\n## Help System\n\n### Command-Specific Help\n\n```\n# Show help for specific command\nprovisioning help server\nprovisioning help taskserv\nprovisioning help cluster\nprovisioning help workflow\nprovisioning help batch\n\n# Show help for command category\nprovisioning help infra\nprovisioning help orch\nprovisioning help dev\nprovisioning help ws\nprovisioning help config\n```\n\n### Bi-Directional Help\n\n```\n# All these work identically:\nprovisioning help workspace\nprovisioning workspace help\nprovisioning ws help\nprovisioning help ws\n```\n\n### General Help\n\n```\n# Show all commands\nprovisioning help\nprovisioning --help\n\n# Show version\nprovisioning version\nprovisioning --version\n```\n\n---\n\n## Quick Reference: Common Flags\n\n| Flag | Short | Description | Example |\n| ------ | ------- | ------------- | --------- |\n| `--debug` | `-x` | Enable debug mode | `provisioning server create --debug` |\n| `--check` | `-c` | Check mode (dry run) | `provisioning server create --check` |\n| `--yes` | `-y` | Auto-confirm | `provisioning server delete --yes` |\n| `--wait` | `-w` | Wait for completion | `provisioning server create --wait` |\n| `--infra` | `-i` | Specify infrastructure | `provisioning server list --infra prod` |\n| `--out` | - | Output format | `provisioning server list --out json` |\n\n---\n\n## Plugin Installation Quick Reference\n\n```\n# Build all plugins (one-time setup)\ncd provisioning/core/plugins/nushell-plugins\ncargo build --release --all\n\n# Register plugins\nplugin add target/release/nu_plugin_auth\nplugin add target/release/nu_plugin_kms\nplugin add target/release/nu_plugin_orchestrator\n\n# Verify installation\nplugin list | where name =~ "auth|kms|orch"\nauth --help\nkms --help\norch --help\n\n# Set environment\nexport RUSTYVAULT_ADDR="http://localhost:8200"\nexport RUSTYVAULT_TOKEN="hvs.xxxxx"\nexport CONTROL_CENTER_URL="http://localhost:3000"\n```\n\n---\n\n## Related Documentation\n\n- **Complete Plugin Guide**: `docs/user/PLUGIN_INTEGRATION_GUIDE.md`\n- **Plugin Reference**: `docs/user/NUSHELL_PLUGINS_GUIDE.md`\n- **From Scratch Guide**: `docs/guides/from-scratch.md`\n- **Update Infrastructure**: [Update Guide](../guides/update-infrastructure.md)\n- **Customize Infrastructure**: [Customize Guide](../guides/customize-infrastructure.md)\n- **CLI Architecture**: [CLI Reference](../infrastructure/cli-reference.md)\n- **Security System**: [Security Architecture](../security/security-system.md)\n\n---\n\n**For fastest access to this guide**: `provisioning sc`\n\n**Last Updated**: 2025-10-09\n**Maintained By**: Platform Team +# Provisioning Platform Quick Reference + +**Version**: 3.5.0 +**Last Updated**: 2025-10-09 + +--- + +## Quick Navigation + +- [Plugin Commands](#plugin-commands) - Native Nushell plugins (10-50x faster) +- [CLI Shortcuts](#cli-shortcuts) - 80+ command shortcuts +- [Infrastructure Commands](#infrastructure-commands) - Servers, taskservs, clusters +- [Orchestration Commands](#orchestration-commands) - Workflows, batch operations +- [Configuration Commands](#configuration-commands) - Config, validation, environment +- [Workspace Commands](#workspace-commands) - Multi-workspace management +- [Security Commands](#security-commands) - Auth, MFA, secrets, compliance +- [Common Workflows](#common-workflows) - Complete deployment examples +- [Debug and Check Mode](#debug-and-check-mode) - Testing and troubleshooting +- [Output Formats](#output-formats) - JSON, YAML, table formatting + +--- + +## Plugin Commands + +Native Nushell plugins for high-performance operations. **10-50x faster than HTTP API**. + +### Authentication Plugin (nu_plugin_auth) + +```text +# Login (password prompted securely) +auth login admin + +# Login with custom URL +auth login admin --url https://control-center.example.com + +# Verify current session +auth verify +# Returns: { active: true, user: "admin", role: "Admin", expires_at: "...", mfa_verified: true } + +# List active sessions +auth sessions + +# Logout +auth logout + +# MFA enrollment +auth mfa enroll totp # TOTP (Google Authenticator, Authy) +auth mfa enroll webauthn # WebAuthn (YubiKey, Touch ID, Windows Hello) + +# MFA verification +auth mfa verify --code 123456 +auth mfa verify --code ABCD-EFGH-IJKL # Backup code +``` + +**Installation:** + +```text +cd provisioning/core/plugins/nushell-plugins +cargo build --release -p nu_plugin_auth +plugin add target/release/nu_plugin_auth +``` + +### KMS Plugin (nu_plugin_kms) + +**Performance**: 10x faster encryption (~5 ms vs ~50 ms HTTP) + +```text +# Encrypt with auto-detected backend +kms encrypt "secret data" +# vault:v1:abc123... + +# Encrypt with specific backend +kms encrypt "data" --backend rustyvault --key provisioning-main +kms encrypt "data" --backend age --key age1xxxxxxxxx +kms encrypt "data" --backend aws --key alias/provisioning + +# Encrypt with context (AAD for additional security) +kms encrypt "data" --context "user=admin,env=production" + +# Decrypt (auto-detects backend from format) +kms decrypt "vault:v1:abc123..." +kms decrypt "-----BEGIN AGE ENCRYPTED FILE-----..." + +# Decrypt with context (must match encryption context) +kms decrypt "vault:v1:abc123..." --context "user=admin,env=production" + +# Generate data encryption key +kms generate-key +kms generate-key --spec AES256 + +# Check backend status +kms status +``` + +**Supported Backends:** + +- **rustyvault**: High-performance (~5 ms) - Production +- **age**: Local encryption (~3 ms) - Development +- **cosmian**: Cloud KMS (~30 ms) +- **aws**: AWS KMS (~50 ms) +- **vault**: HashiCorp Vault (~40 ms) + +**Installation:** + +```text +cargo build --release -p nu_plugin_kms +plugin add target/release/nu_plugin_kms + +# Set backend environment +export RUSTYVAULT_ADDR="http://localhost:8200" +export RUSTYVAULT_TOKEN="hvs.xxxxx" +``` + +### Orchestrator Plugin (nu_plugin_orchestrator) + +**Performance**: 30-50x faster queries (~1 ms vs ~30-50 ms HTTP) + +```text +# Get orchestrator status (direct file access, ~1 ms) +orch status +# { active_tasks: 5, completed_tasks: 120, health: "healthy" } + +# Validate workflow Nickel file (~10 ms vs ~100 ms HTTP) +orch validate workflows/deploy.ncl +orch validate workflows/deploy.ncl --strict + +# List tasks (direct file read, ~5 ms) +orch tasks +orch tasks --status running +orch tasks --status failed --limit 10 +``` + +**Installation:** + +```text +cargo build --release -p nu_plugin_orchestrator +plugin add target/release/nu_plugin_orchestrator +``` + +### Plugin Performance Comparison + +| Operation | HTTP API | Plugin | Speedup | +| ----------- | ---------- | -------- | --------- | +| KMS Encrypt | ~50 ms | ~5 ms | **10x** | +| KMS Decrypt | ~50 ms | ~5 ms | **10x** | +| Orch Status | ~30 ms | ~1 ms | **30x** | +| Orch Validate | ~100 ms | ~10 ms | **10x** | +| Orch Tasks | ~50 ms | ~5 ms | **10x** | +| Auth Verify | ~50 ms | ~10 ms | **5x** | + +--- + +## CLI Shortcuts + +### Infrastructure Shortcuts + +```text +# Server shortcuts +provisioning s # server (same as 'provisioning server') +provisioning s create # Create servers +provisioning s delete # Delete servers +provisioning s list # List servers +provisioning s ssh web-01 # SSH into server + +# Taskserv shortcuts +provisioning t # taskserv (same as 'provisioning taskserv') +provisioning task # taskserv (alias) +provisioning t create kubernetes +provisioning t delete kubernetes +provisioning t list +provisioning t generate kubernetes +provisioning t check-updates + +# Cluster shortcuts +provisioning cl # cluster (same as 'provisioning cluster') +provisioning cl create buildkit +provisioning cl delete buildkit +provisioning cl list + +# Infrastructure shortcuts +provisioning i # infra (same as 'provisioning infra') +provisioning infras # infra (alias) +provisioning i list +provisioning i validate +``` + +### Orchestration Shortcuts + +```text +# Workflow shortcuts +provisioning wf # workflow (same as 'provisioning workflow') +provisioning flow # workflow (alias) +provisioning wf list +provisioning wf status +provisioning wf monitor +provisioning wf stats +provisioning wf cleanup + +# Batch shortcuts +provisioning bat # batch (same as 'provisioning batch') +provisioning batch submit workflows/example.ncl +provisioning bat list +provisioning bat status +provisioning bat monitor +provisioning bat rollback +provisioning bat cancel +provisioning bat stats + +# Orchestrator shortcuts +provisioning orch # orchestrator (same as 'provisioning orchestrator') +provisioning orch start +provisioning orch stop +provisioning orch status +provisioning orch health +provisioning orch logs +``` + +### Development Shortcuts + +```text +# Module shortcuts +provisioning mod # module (same as 'provisioning module') +provisioning mod discover taskserv +provisioning mod discover provider +provisioning mod discover cluster +provisioning mod load taskserv workspace kubernetes +provisioning mod list taskserv workspace +provisioning mod unload taskserv workspace kubernetes +provisioning mod sync-kcl + +# Layer shortcuts +provisioning lyr # layer (same as 'provisioning layer') +provisioning lyr explain +provisioning lyr show +provisioning lyr test +provisioning lyr stats + +# Version shortcuts +provisioning version check +provisioning version show +provisioning version updates +provisioning version apply +provisioning version taskserv + +# Package shortcuts +provisioning pack core +provisioning pack provider upcloud +provisioning pack list +provisioning pack clean +``` + +### Workspace Shortcuts + +```text +# Workspace shortcuts +provisioning ws # workspace (same as 'provisioning workspace') +provisioning ws init +provisioning ws create +provisioning ws validate +provisioning ws info +provisioning ws list +provisioning ws migrate +provisioning ws switch # Switch active workspace +provisioning ws active # Show active workspace + +# Template shortcuts +provisioning tpl # template (same as 'provisioning template') +provisioning tmpl # template (alias) +provisioning tpl list +provisioning tpl types +provisioning tpl show +provisioning tpl apply +provisioning tpl validate +``` + +### Configuration Shortcuts + +```text +# Environment shortcuts +provisioning e # env (same as 'provisioning env') +provisioning val # validate (same as 'provisioning validate') +provisioning st # setup (same as 'provisioning setup') +provisioning config # setup (alias) + +# Show shortcuts +provisioning show settings +provisioning show servers +provisioning show config + +# Initialization +provisioning init + +# All environment +provisioning allenv # Show all config and environment +``` + +### Utility Shortcuts + +```text +# List shortcuts +provisioning l # list (same as 'provisioning list') +provisioning ls # list (alias) +provisioning list # list (full) + +# SSH operations +provisioning ssh + +# SOPS operations +provisioning sops # Edit encrypted file + +# Cache management +provisioning cache clear +provisioning cache stats + +# Provider operations +provisioning providers list +provisioning providers info + +# Nushell session +provisioning nu # Start Nushell with provisioning library loaded + +# QR code generation +provisioning qr + +# Nushell information +provisioning nuinfo + +# Plugin management +provisioning plugin # plugin (same as 'provisioning plugin') +provisioning plugins # plugin (alias) +provisioning plugin list +provisioning plugin test nu_plugin_kms +``` + +### Generation Shortcuts + +```text +# Generate shortcuts +provisioning g # generate (same as 'provisioning generate') +provisioning gen # generate (alias) +provisioning g server +provisioning g taskserv +provisioning g cluster +provisioning g infra --new +provisioning g new +``` + +### Action Shortcuts + +```text +# Common actions +provisioning c # create (same as 'provisioning create') +provisioning d # delete (same as 'provisioning delete') +provisioning u # update (same as 'provisioning update') + +# Pricing shortcuts +provisioning price # Show server pricing +provisioning cost # price (alias) +provisioning costs # price (alias) + +# Create server + taskservs (combo command) +provisioning cst # create-server-task +provisioning csts # create-server-task (alias) +``` + +--- + +## Infrastructure Commands + +### Server Management + +```text +# Create servers +provisioning server create +provisioning server create --check # Dry-run mode +provisioning server create --yes # Skip confirmation + +# Delete servers +provisioning server delete +provisioning server delete --check +provisioning server delete --yes + +# List servers +provisioning server list +provisioning server list --infra wuji +provisioning server list --out json + +# SSH into server +provisioning server ssh web-01 +provisioning server ssh db-01 + +# Show pricing +provisioning server price +provisioning server price --provider upcloud +``` + +### Taskserv Management + +```text +# Create taskserv +provisioning taskserv create kubernetes +provisioning taskserv create kubernetes --check +provisioning taskserv create kubernetes --infra wuji + +# Delete taskserv +provisioning taskserv delete kubernetes +provisioning taskserv delete kubernetes --check + +# List taskservs +provisioning taskserv list +provisioning taskserv list --infra wuji + +# Generate taskserv configuration +provisioning taskserv generate kubernetes +provisioning taskserv generate kubernetes --out yaml + +# Check for updates +provisioning taskserv check-updates +provisioning taskserv check-updates --taskserv kubernetes +``` + +### Cluster Management + +```text +# Create cluster +provisioning cluster create buildkit +provisioning cluster create buildkit --check +provisioning cluster create buildkit --infra wuji + +# Delete cluster +provisioning cluster delete buildkit +provisioning cluster delete buildkit --check + +# List clusters +provisioning cluster list +provisioning cluster list --infra wuji +``` + +--- + +## Orchestration Commands + +### Workflow Management + +```text +# Submit server creation workflow +nu -c "use core/nulib/workflows/server_create.nu *; server_create_workflow 'wuji' '' [] --check" + +# Submit taskserv workflow +nu -c "use core/nulib/workflows/taskserv.nu *; taskserv create 'kubernetes' 'wuji' --check" + +# Submit cluster workflow +nu -c "use core/nulib/workflows/cluster.nu *; cluster create 'buildkit' 'wuji' --check" + +# List all workflows +provisioning workflow list +nu -c "use core/nulib/workflows/management.nu *; workflow list" + +# Get workflow statistics +provisioning workflow stats +nu -c "use core/nulib/workflows/management.nu *; workflow stats" + +# Monitor workflow in real-time +provisioning workflow monitor +nu -c "use core/nulib/workflows/management.nu *; workflow monitor " + +# Check orchestrator health +provisioning workflow orchestrator +nu -c "use core/nulib/workflows/management.nu *; workflow orchestrator" + +# Get specific workflow status +provisioning workflow status +nu -c "use core/nulib/workflows/management.nu *; workflow status " +``` + +### Batch Operations + +```text +# Submit batch workflow from Nickel +provisioning batch submit workflows/example_batch.ncl +nu -c "use core/nulib/workflows/batch.nu *; batch submit workflows/example_batch.ncl" + +# Monitor batch workflow progress +provisioning batch monitor +nu -c "use core/nulib/workflows/batch.nu *; batch monitor " + +# List batch workflows with filtering +provisioning batch list +provisioning batch list --status Running +nu -c "use core/nulib/workflows/batch.nu *; batch list --status Running" + +# Get detailed batch status +provisioning batch status +nu -c "use core/nulib/workflows/batch.nu *; batch status " + +# Initiate rollback for failed workflow +provisioning batch rollback +nu -c "use core/nulib/workflows/batch.nu *; batch rollback " + +# Cancel running batch +provisioning batch cancel + +# Show batch workflow statistics +provisioning batch stats +nu -c "use core/nulib/workflows/batch.nu *; batch stats" +``` + +### Orchestrator Management + +```text +# Start orchestrator in background +cd provisioning/platform/orchestrator +./scripts/start-orchestrator.nu --background + +# Check orchestrator status +./scripts/start-orchestrator.nu --check +provisioning orchestrator status + +# Stop orchestrator +./scripts/start-orchestrator.nu --stop +provisioning orchestrator stop + +# View logs +tail -f provisioning/platform/orchestrator/data/orchestrator.log +provisioning orchestrator logs +``` + +--- + +## Configuration Commands + +### Environment and Validation + +```text +# Show environment variables +provisioning env + +# Show all environment and configuration +provisioning allenv + +# Validate configuration +provisioning validate config +provisioning validate infra + +# Setup wizard +provisioning setup +``` + +### Configuration Files + +```text +# System defaults +less provisioning/config/config.defaults.toml + +# User configuration +vim workspace/config/local-overrides.toml + +# Environment-specific configs +vim workspace/config/dev-defaults.toml +vim workspace/config/test-defaults.toml +vim workspace/config/prod-defaults.toml + +# Infrastructure-specific config +vim workspace/infra//config.toml +``` + +### HTTP Configuration + +```text +# Configure HTTP client behavior +# In workspace/config/local-overrides.toml: +[http] +use_curl = true # Use curl instead of ureq +``` + +--- + +## Workspace Commands + +### Workspace Management + +```text +# List all workspaces +provisioning workspace list + +# Show active workspace +provisioning workspace active + +# Switch to another workspace +provisioning workspace switch +provisioning workspace activate # alias + +# Register new workspace +provisioning workspace register +provisioning workspace register --activate + +# Remove workspace from registry +provisioning workspace remove +provisioning workspace remove --force + +# Initialize new workspace +provisioning workspace init +provisioning workspace init --name production + +# Create new workspace +provisioning workspace create + +# Validate workspace +provisioning workspace validate + +# Show workspace info +provisioning workspace info + +# Migrate workspace +provisioning workspace migrate +``` + +### User Preferences + +```text +# View user preferences +provisioning workspace preferences + +# Set user preference +provisioning workspace set-preference editor vim +provisioning workspace set-preference output_format yaml +provisioning workspace set-preference confirm_delete true + +# Get user preference +provisioning workspace get-preference editor +``` + +**User Config Location:** + +- macOS: `~/Library/Application Support/provisioning/user_config.yaml` +- Linux: `~/.config/provisioning/user_config.yaml` +- Windows: `%APPDATA%\provisioning\user_config.yaml` + +--- + +## Security Commands + +### Authentication (via CLI) + +```text +# Login +provisioning login admin + +# Logout +provisioning logout + +# Show session status +provisioning auth status + +# List active sessions +provisioning auth sessions +``` + +### Multi-Factor Authentication (MFA) + +```text +# Enroll in TOTP (Google Authenticator, Authy) +provisioning mfa totp enroll + +# Enroll in WebAuthn (YubiKey, Touch ID, Windows Hello) +provisioning mfa webauthn enroll + +# Verify MFA code +provisioning mfa totp verify --code 123456 +provisioning mfa webauthn verify + +# List registered devices +provisioning mfa devices +``` + +### Secrets Management + +```text +# Generate AWS STS credentials (15 min-12h TTL) +provisioning secrets generate aws --ttl 1hr + +# Generate SSH key pair (Ed25519) +provisioning secrets generate ssh --ttl 4hr + +# List active secrets +provisioning secrets list + +# Revoke secret +provisioning secrets revoke + +# Cleanup expired secrets +provisioning secrets cleanup +``` + +### SSH Temporal Keys + +```text +# Connect to server with temporal key +provisioning ssh connect server01 --ttl 1hr + +# Generate SSH key pair only +provisioning ssh generate --ttl 4hr + +# List active SSH keys +provisioning ssh list + +# Revoke SSH key +provisioning ssh revoke +``` + +### KMS Operations (via CLI) + +```text +# Encrypt configuration file +provisioning kms encrypt secure.yaml + +# Decrypt configuration file +provisioning kms decrypt secure.yaml.enc + +# Encrypt entire config directory +provisioning config encrypt workspace/infra/production/ + +# Decrypt config directory +provisioning config decrypt workspace/infra/production/ +``` + +### Break-Glass Emergency Access + +```text +# Request emergency access +provisioning break-glass request "Production database outage" + +# Approve emergency request (requires admin) +provisioning break-glass approve --reason "Approved by CTO" + +# List break-glass sessions +provisioning break-glass list + +# Revoke break-glass session +provisioning break-glass revoke +``` + +### Compliance and Audit + +```text +# Generate compliance report +provisioning compliance report +provisioning compliance report --standard gdpr +provisioning compliance report --standard soc2 +provisioning compliance report --standard iso27001 + +# GDPR operations +provisioning compliance gdpr export +provisioning compliance gdpr delete +provisioning compliance gdpr rectify + +# Incident management +provisioning compliance incident create "Security breach detected" +provisioning compliance incident list +provisioning compliance incident update --status investigating + +# Audit log queries +provisioning audit query --user alice --action deploy --from 24h +provisioning audit export --format json --output audit-logs.json +``` + +--- + +## Common Workflows + +### Complete Deployment from Scratch + +```text +# 1. Initialize workspace +provisioning workspace init --name production + +# 2. Validate configuration +provisioning validate config + +# 3. Create infrastructure definition +provisioning generate infra --new production + +# 4. Create servers (check mode first) +provisioning server create --infra production --check + +# 5. Create servers (actual deployment) +provisioning server create --infra production --yes + +# 6. Install Kubernetes +provisioning taskserv create kubernetes --infra production --check +provisioning taskserv create kubernetes --infra production + +# 7. Deploy cluster services +provisioning cluster create production --check +provisioning cluster create production + +# 8. Verify deployment +provisioning server list --infra production +provisioning taskserv list --infra production + +# 9. SSH to servers +provisioning server ssh k8s-master-01 +``` + +### Multi-Environment Deployment + +```text +# Deploy to dev +provisioning server create --infra dev --check +provisioning server create --infra dev +provisioning taskserv create kubernetes --infra dev + +# Deploy to staging +provisioning server create --infra staging --check +provisioning server create --infra staging +provisioning taskserv create kubernetes --infra staging + +# Deploy to production (with confirmation) +provisioning server create --infra production --check +provisioning server create --infra production +provisioning taskserv create kubernetes --infra production +``` + +### Update Infrastructure + +```text +# 1. Check for updates +provisioning taskserv check-updates + +# 2. Update specific taskserv (check mode) +provisioning taskserv update kubernetes --check + +# 3. Apply update +provisioning taskserv update kubernetes + +# 4. Verify update +provisioning taskserv list --infra production | where name == kubernetes +``` + +### Encrypted Secrets Deployment + +```text +# 1. Authenticate +auth login admin +auth mfa verify --code 123456 + +# 2. Encrypt secrets +kms encrypt (open secrets/production.yaml) --backend rustyvault | save secrets/production.enc + +# 3. Deploy with encrypted secrets +provisioning cluster create production --secrets secrets/production.enc + +# 4. Verify deployment +orch tasks --status completed +``` + +--- + +## Debug and Check Mode + +### Debug Mode + +Enable verbose logging with `--debug` or `-x` flag: + +```text +# Server creation with debug output +provisioning server create --debug +provisioning server create -x + +# Taskserv creation with debug +provisioning taskserv create kubernetes --debug + +# Show detailed error traces +provisioning --debug taskserv create kubernetes +``` + +### Check Mode (Dry Run) + +Preview changes without applying them with `--check` or `-c` flag: + +```text +# Check what servers would be created +provisioning server create --check +provisioning server create -c + +# Check taskserv installation +provisioning taskserv create kubernetes --check + +# Check cluster creation +provisioning cluster create buildkit --check + +# Combine with debug for detailed preview +provisioning server create --check --debug +``` + +### Auto-Confirm Mode + +Skip confirmation prompts with `--yes` or `-y` flag: + +```text +# Auto-confirm server creation +provisioning server create --yes +provisioning server create -y + +# Auto-confirm deletion +provisioning server delete --yes +``` + +### Wait Mode + +Wait for operations to complete with `--wait` or `-w` flag: + +```text +# Wait for server creation to complete +provisioning server create --wait + +# Wait for taskserv installation +provisioning taskserv create kubernetes --wait +``` + +### Infrastructure Selection + +Specify target infrastructure with `--infra` or `-i` flag: + +```text +# Create servers in specific infrastructure +provisioning server create --infra production +provisioning server create -i production + +# List servers in specific infrastructure +provisioning server list --infra production +``` + +--- + +## Output Formats + +### JSON Output + +```text +# Output as JSON +provisioning server list --out json +provisioning taskserv list --out json + +# Pipeline JSON output +provisioning server list --out json | jq '.[] | select(.status == "running")' +``` + +### YAML Output + +```text +# Output as YAML +provisioning server list --out yaml +provisioning taskserv list --out yaml + +# Pipeline YAML output +provisioning server list --out yaml | yq '.[] | select(.status == "running")' +``` + +### Table Output (Default) + +```text +# Output as table (default) +provisioning server list +provisioning server list --out table + +# Pretty-printed table +provisioning server list | table +``` + +### Text Output + +```text +# Output as plain text +provisioning server list --out text +``` + +--- + +## Performance Tips + +### Use Plugins for Frequent Operations + +```text +# ❌ Slow: HTTP API (50 ms per call) +for i in 1..100 { http post http://localhost:9998/encrypt { data: "secret" } } + +# ✅ Fast: Plugin (5 ms per call, 10x faster) +for i in 1..100 { kms encrypt "secret" } +``` + +### Batch Operations + +```text +# Use batch workflows for multiple operations +provisioning batch submit workflows/multi-cloud-deploy.ncl +``` + +### Check Mode for Testing + +```text +# Always test with --check first +provisioning server create --check +provisioning server create # Only after verification +``` + +--- + +## Help System + +### Command-Specific Help + +```text +# Show help for specific command +provisioning help server +provisioning help taskserv +provisioning help cluster +provisioning help workflow +provisioning help batch + +# Show help for command category +provisioning help infra +provisioning help orch +provisioning help dev +provisioning help ws +provisioning help config +``` + +### Bi-Directional Help + +```text +# All these work identically: +provisioning help workspace +provisioning workspace help +provisioning ws help +provisioning help ws +``` + +### General Help + +```text +# Show all commands +provisioning help +provisioning --help + +# Show version +provisioning version +provisioning --version +``` + +--- + +## Quick Reference: Common Flags + +| Flag | Short | Description | Example | +| ------ | ------- | ------------- | --------- | +| `--debug` | `-x` | Enable debug mode | `provisioning server create --debug` | +| `--check` | `-c` | Check mode (dry run) | `provisioning server create --check` | +| `--yes` | `-y` | Auto-confirm | `provisioning server delete --yes` | +| `--wait` | `-w` | Wait for completion | `provisioning server create --wait` | +| `--infra` | `-i` | Specify infrastructure | `provisioning server list --infra prod` | +| `--out` | - | Output format | `provisioning server list --out json` | + +--- + +## Plugin Installation Quick Reference + +```text +# Build all plugins (one-time setup) +cd provisioning/core/plugins/nushell-plugins +cargo build --release --all + +# Register plugins +plugin add target/release/nu_plugin_auth +plugin add target/release/nu_plugin_kms +plugin add target/release/nu_plugin_orchestrator + +# Verify installation +plugin list | where name =~ "auth|kms|orch" +auth --help +kms --help +orch --help + +# Set environment +export RUSTYVAULT_ADDR="http://localhost:8200" +export RUSTYVAULT_TOKEN="hvs.xxxxx" +export CONTROL_CENTER_URL="http://localhost:3000" +``` + +--- + +## Related Documentation + +- **Complete Plugin Guide**: `docs/user/PLUGIN_INTEGRATION_GUIDE.md` +- **Plugin Reference**: `docs/user/NUSHELL_PLUGINS_GUIDE.md` +- **From Scratch Guide**: `docs/guides/from-scratch.md` +- **Update Infrastructure**: [Update Guide](../guides/update-infrastructure.md) +- **Customize Infrastructure**: [Customize Guide](../guides/customize-infrastructure.md) +- **CLI Architecture**: [CLI Reference](../infrastructure/cli-reference.md) +- **Security System**: [Security Architecture](../security/security-system.md) + +--- + +**For fastest access to this guide**: `provisioning sc` + +**Last Updated**: 2025-10-09 +**Maintained By**: Platform Team \ No newline at end of file diff --git a/docs/src/getting-started/quickstart.md b/docs/src/getting-started/quickstart.md index 0774a86..d1f4164 100644 --- a/docs/src/getting-started/quickstart.md +++ b/docs/src/getting-started/quickstart.md @@ -1 +1,29 @@ -# Quick Start\n\nThis guide has moved to a multi-chapter format for better readability.\n\n## 📖 Navigate to Quick Start Guide\n\nPlease see the complete quick start guide here:\n\n- **Prerequisites** - System requirements and setup\n- **Installation** - Install provisioning platform\n- **First Deployment** - Deploy your first infrastructure\n- **Verification** - Verify your deployment\n\n## Quick Commands\n\n```\n# Check system status\nprovisioning status\n\n# Get next step suggestions\nprovisioning next\n\n# View interactive guide\nprovisioning guide from-scratch\n```\n\n---\n\nFor the complete step-by-step walkthrough, start with Prerequisites. +# Quick Start + +This guide has moved to a multi-chapter format for better readability. + +## 📖 Navigate to Quick Start Guide + +Please see the complete quick start guide here: + +- **Prerequisites** - System requirements and setup +- **Installation** - Install provisioning platform +- **First Deployment** - Deploy your first infrastructure +- **Verification** - Verify your deployment + +## Quick Commands + +```text +# Check system status +provisioning status + +# Get next step suggestions +provisioning next + +# View interactive guide +provisioning guide from-scratch +``` + +--- + +For the complete step-by-step walkthrough, start with Prerequisites. \ No newline at end of file diff --git a/docs/src/getting-started/setup-profiles.md b/docs/src/getting-started/setup-profiles.md index 15b2979..0182c0d 100644 --- a/docs/src/getting-started/setup-profiles.md +++ b/docs/src/getting-started/setup-profiles.md @@ -1 +1,832 @@ -# Setup Profiles Guide - Detailed Reference\n\nThis guide provides detailed information about each setup profile and when to use them.\n\n---\n\n## Profile Comparison Matrix\n\n| | Aspect | Developer | Production | CI/CD | |\n| | -------- | ----------- | ----------- | ------- | |\n| | **Duration** | 3-4 min | 10-15 min | <2 min | |\n| | **User Input** | Minimal (1 question) | Extensive (10+ questions) | None (env vars) | |\n| | **Config Type** | Nickel (auto-composed) | Nickel (interactive) | Nickel (auto-minimal) | |\n| | **Validation** | Nickel typecheck | Nickel typecheck | Nickel typecheck | |\n| | **Deployment** | Docker Compose | Kubernetes/SSH/Docker | Docker Compose | |\n| | **Services Started** | Auto-start locally | Manual (you deploy) | Auto-start ephemeral | |\n| | **Storage** | Home dir (persistent) | Home dir (persistent) | /tmp (ephemeral) | |\n| | **Security** | Local defaults | MFA+Audit+Policies | Env vars + CI secrets | |\n| | **Intended User** | Developer, learner | Production operator | CI/CD automation | |\n| | **Best For** | Local testing, prototyping | Team deployments, HA | Automated testing | |\n\n---\n\n## Developer Profile: Fast Local Setup\n\n### When to Use\n\n- **First-time users**: Get provisioning working quickly\n- **Local development**: Test infrastructure on your machine\n- **Learning**: Understand provisioning concepts\n- **Prototyping**: Rapid iteration on configurations\n- **Single-user setup**: Personal workstation only\n\n### What Gets Created\n\n**Config Files** (all Nickel, type-safe):\n- `system.ncl` - System detection (auto-detected, read-only)\n- `user_preferences.ncl` - User settings (recommended defaults)\n- `platform/deployment.ncl` - Local Docker Compose setup\n- `providers/local.ncl` - Local provider (no credentials)\n\n**Services** (Docker Compose):\n- Orchestrator (port 9090)\n- Control Center (port 3000)\n- KMS service (port 3001)\n\n**Storage Location**:\n- macOS: `~/Library/Application Support/provisioning/`\n- Linux: `~/.config/provisioning/`\n\n### System Requirements\n\n**Minimum**:\n- OS: macOS (10.14+) or Linux\n- CPU: 2 cores\n- Memory: 4 GB RAM\n- Disk: 2 GB free\n\n**Recommended**:\n- CPU: 4+ cores\n- Memory: 8+ GB RAM\n- Disk: 10 GB free\n\n**Dependencies**:\n- Nushell (0.109.0+)\n- Nickel (1.5.0+)\n- Docker (latest)\n\n### Step-by-Step Walkthrough\n\n#### Step 1: Run Setup\n\n```\nprovisioning setup profile --profile developer\n```\n\nOutput:\n```\n╔═══════════════════════════════════════════════════════╗\n║ PROVISIONING SYSTEM SETUP - DEVELOPER PROFILE ║\n╚═══════════════════════════════════════════════════════╝\n\nEnvironment Detection\n OS: macOS (15.2.0)\n Architecture: aarch64\n CPU Count: 8\n Memory: 16 GB\n Disk: 500 GB\n\n✓ Detected capabilities: Docker\n✓ Configuration location: ~/Library/Application Support/provisioning/\n\nSetup Profile: DEVELOPER\n```\n\n#### Step 2: Auto-Detection\n\nSystem automatically detects:\n- Operating system (macOS/Linux)\n- Architecture (aarch64/x86_64)\n- CPU and memory\n- Available deployment tools (Docker, Kubernetes, etc.)\n\n**You see**: Detection summary, no prompts\n\n#### Step 3: Configuration Generation\n\nCreates three Nickel configs:\n\n**system.ncl** - System info (read-only):\n```\n{\n version = "1.0.0",\n config_base_path = "/Users/user/Library/Application Support/provisioning",\n os_name = 'macos,\n os_version = "15.2.0",\n system_architecture = 'aarch64,\n cpu_count = 8,\n memory_total_gb = 16,\n disk_total_gb = 500,\n setup_date = "2026-01-13T12:34:56Z"\n}\n| SystemConfig\n```\n\n**platform/deployment.ncl** - Deployment config (can edit):\n```\n{\n deployment = {\n mode = 'docker_compose,\n location_type = 'local,\n },\n services = {\n orchestrator = {\n endpoint = "http://localhost:9090/health",\n timeout_seconds = 30,\n },\n control_center = {\n endpoint = "http://localhost:3000/health",\n timeout_seconds = 30,\n },\n kms_service = {\n endpoint = "http://localhost:3001/health",\n timeout_seconds = 30,\n },\n },\n}\n| DeploymentConfig\n```\n\n**user_preferences.ncl** - User settings (can edit):\n```\n{\n output_format = 'yaml,\n use_colors = true,\n confirm_delete = true,\n default_log_level = 'info,\n http_timeout_seconds = 30,\n}\n| UserPreferencesConfig\n```\n\n#### Step 4: Validation\n\nEach config is validated:\n```\n✓ Validating system.ncl\n✓ Validating platform/deployment.ncl\n✓ Validating user_preferences.ncl\n✓ All configurations validated: PASSED\n```\n\n#### Step 5: Service Startup\n\nDocker Compose starts:\n```\n✓ Starting Docker Compose services...\n✓ Starting orchestrator... [port 9090]\n✓ Starting control-center... [port 3000]\n✓ Starting kms... [port 3001]\n```\n\n#### Step 6: Verification\n\nHealth checks verify services:\n```\n✓ Orchestrator health: HEALTHY\n✓ Control Center health: HEALTHY\n✓ KMS health: HEALTHY\n\nSetup complete in 3 minutes 47 seconds!\n```\n\n### After Setup: Common Tasks\n\n**Verify everything works**:\n```\ncurl http://localhost:9090/health\ncurl http://localhost:3000/health\ncurl http://localhost:3001/health\n```\n\n**View your configuration**:\n```\ncat ~/Library/Application\ Support/provisioning/system.ncl\ncat ~/Library/Application\ Support/provisioning/platform/deployment.ncl\n```\n\n**Create a workspace**:\n```\nprovisioning workspace create myapp\n```\n\n**View logs**:\n```\ndocker-compose logs orchestrator\ndocker-compose logs control-center\ndocker-compose logs kms\n```\n\n**Stop services**:\n```\ndocker-compose down\n```\n\n---\n\n## Production Profile: Enterprise-Ready Deployment\n\n### When to Use\n\n- **Production deployments**: Going live\n- **Team environments**: Multiple users, shared infrastructure\n- **High availability**: Kubernetes clusters\n- **Security requirements**: MFA, audit logging, policies\n- **Multi-cloud**: UpCloud, AWS, Hetzner\n- **Compliance**: Audit trails, authorization policies\n\n### What Gets Created\n\n**Config Files** (all Nickel, type-safe):\n- `system.ncl` - System detection (auto-detected)\n- `user_preferences.ncl` - Security-focused defaults (MFA, audit enabled)\n- `platform/deployment.ncl` - Kubernetes/SSH configuration\n- `providers/upcloud.ncl` (or aws/hetzner) - Cloud provider credentials\n- `cedar-policies/default.cedar` - Authorization policies (Cedar format)\n- `workspace-*/infrastructure.ncl` - Infrastructure-as-Code definitions\n\n**Services**: You deploy to Kubernetes or SSH manually\n\n**Storage Location**:\n- macOS: `~/Library/Application Support/provisioning/`\n- Linux: `~/.config/provisioning/`\n\n### System Requirements\n\n**Minimum**:\n- OS: macOS (10.14+) or Linux\n- CPU: 4 cores\n- Memory: 8 GB RAM\n- Disk: 10 GB free\n\n**Recommended**:\n- CPU: 8+ cores\n- Memory: 16+ GB RAM\n- Disk: 50 GB free\n- Cloud account (UpCloud, AWS, or Hetzner)\n\n**Dependencies**:\n- Nushell (0.109.0+)\n- Nickel (1.5.0+)\n- Docker (for building)\n- kubectl (for Kubernetes deployment)\n- Cloud CLI (upcloud-cli, aws-cli, etc.)\n\n### Step-by-Step Walkthrough\n\n#### Step 1: Run Setup\n\n```\nprovisioning setup profile --profile production --interactive\n```\n\n#### Step 2: System Detection\n\nSame as Developer profile - auto-detects OS, CPU, memory, etc.\n\n#### Step 3: Interactive Configuration\n\nThe wizard asks 10-15 questions:\n\n```\n1. Deployment Mode?\n a) Kubernetes (recommended for HA)\n b) SSH (manual server management)\n c) Docker Compose (hybrid local/remote)\n → Your choice: a) Kubernetes\n\n2. Cloud Provider?\n a) UpCloud\n b) AWS\n c) Hetzner\n d) Local (self-managed servers)\n → Your choice: a) UpCloud\n\n3. Workspace Name?\n (names your infrastructure project)\n → Your input: production-infrastructure\n\n4. Kubernetes Cluster?\n a) Create new cluster\n b) Use existing cluster\n → Your choice: a) Create new\n\n5. Master Nodes Count? (1-5, default 3)\n (for HA, recommend 3 or 5)\n → Your input: 3\n\n6. Worker Nodes Count? (2-10, default 5)\n (for scalability)\n → Your input: 5\n\n7. Enable MFA?\n (Multi-factor authentication for access)\n → Your choice: y\n\n8. Enable Audit Logging?\n (Log all operations for compliance)\n → Your choice: y\n\n9. Storage Backend?\n a) etcd (Kubernetes default)\n b) PostgreSQL (external)\n c) S3-compatible (cloud)\n → Your choice: a) etcd\n\n10. Certificate Management?\n a) Let's Encrypt (auto-renew)\n b) Self-signed (for testing)\n c) Bring your own\n → Your choice: a) Let's Encrypt\n\n11. Monitoring?\n a) Prometheus + Grafana\n b) Datadog\n c) CloudWatch\n d) None (not recommended)\n → Your choice: a) Prometheus + Grafana\n\n12. Logging?\n a) ELK Stack\n b) Splunk\n c) CloudWatch Logs\n d) None\n → Your choice: a) ELK Stack\n\n13. Authorization?\n a) Cedar policies (fine-grained)\n b) RBAC (basic roles)\n c) ABAC (attribute-based)\n → Your choice: a) Cedar policies\n```\n\n#### Step 4: Configuration Generation\n\nCreates extensive Nickel configs:\n\n**platform/deployment.ncl**:\n```\n{\n deployment = {\n mode = 'kubernetes,\n cluster_type = 'multi_master,\n master_count = 3,\n worker_count = 5,\n ha_enabled = true,\n },\n security = {\n mfa_enabled = true,\n audit_logging = true,\n tls_enabled = true,\n certificate_provider = 'letsencrypt,\n },\n monitoring = {\n prometheus_enabled = true,\n grafana_enabled = true,\n },\n logging = {\n elk_enabled = true,\n },\n}\n| ProductionDeploymentConfig\n```\n\n**providers/upcloud.ncl**:\n```\n{\n provider = 'upcloud,\n api_key_ref = "rustyvault://secrets/upcloud/api-key",\n api_secret_ref = "rustyvault://secrets/upcloud/api-secret",\n region = "us-east-1",\n server_template = "ubuntu-22.04",\n}\n| UpCloudProviderConfig\n```\n\n**cedar-policies/default.cedar**:\n```\npermit(\n principal == User::"john@company.com",\n action == Action::"Deploy",\n resource == Workspace::"prod-infra"\n)\nwhen { principal.mfa_verified == true };\n\npermit(\n principal in Group::"DevOps",\n action == Action::"ReadMetrics",\n resource in Team::"*"\n);\n\nforbid(\n principal in Group::"Contractors",\n action == Action::"DeleteWorkspace",\n resource in Team::"*"\n);\n```\n\n#### Step 5: Validation\n\nAll configs validated:\n```\n✓ Validating system.ncl\n✓ Validating platform/deployment.ncl\n✓ Validating providers/upcloud.ncl\n✓ Validating cedar-policies/default.cedar\n✓ All configurations validated: PASSED\n```\n\n#### Step 6: Summary & Confirmation\n\n```\nSetup Summary\n─────────────────────────────────────────\nProfile: Production\nDeployment Mode: Kubernetes\nCloud Provider: UpCloud\nMaster Nodes: 3\nWorker Nodes: 5\nMFA Enabled: Yes\nAudit Logging: Yes\nMonitoring: Prometheus + Grafana\nLogging: ELK Stack\n\nDo you want to proceed? (y/n): y\n```\n\n#### Step 7: Infrastructure Creation (Optional)\n\n```\nCreating UpCloud infrastructure...\n Creating 3 master nodes... [networking configured]\n Creating 5 worker nodes... [networking configured]\n Deploying Kubernetes... [cluster bootstrap]\n Installing monitoring... [Prometheus configured]\n Installing logging... [ELK deployed]\n\nInfrastructure ready in ~12 minutes!\n\nKubernetes cluster access:\n kubectl config use-context provisioning-prod-infra\n kubectl cluster-info\n\nDeploy services:\n kubectl apply -f infrastructure.ncl\n```\n\n### After Setup: Common Tasks\n\n**View Kubernetes cluster**:\n```\nkubectl get nodes\nkubectl get pods --all-namespaces\n```\n\n**Check Cedar authorization**:\n```\ncat ~/.config/provisioning/cedar-policies/default.cedar\n```\n\n**View infrastructure definition**:\n```\ncat workspace-production-infrastructure/infrastructure.ncl\n```\n\n**Deploy an application**:\n```\nprovisioning app deploy myapp --workspace production-infrastructure\n```\n\n**Monitor cluster**:\n```\n# Access Grafana\nopen http://localhost:3000\n\n# View Prometheus metrics\nopen http://localhost:9090\n```\n\n---\n\n## CI/CD Profile: Ephemeral Automated Setup\n\n### When to Use\n\n- **GitHub Actions workflows**: Test infrastructure changes\n- **GitLab CI pipelines**: Automated testing\n- **Jenkins jobs**: Integration testing\n- **Automated testing**: Spin up, test, cleanup\n- **Ephemeral environments**: No persistent state\n\n### What Gets Created\n\n**Config Files** (minimal Nickel):\n- `system.ncl` - CI environment info\n- `platform/deployment.ncl` - Minimal Docker Compose\n- `providers/local.ncl` - No credentials\n\n**Services**: Docker Compose (temporary)\n\n**Storage Location**: `/tmp/provisioning-ci-/`\n\n### System Requirements\n\n**Minimal** (CI container):\n- OS: Any Linux\n- CPU: 1+ core\n- Memory: 2 GB RAM\n- Disk: 1 GB free\n\n**Dependencies**:\n- Nushell (0.109.0+)\n- Nickel (1.5.0+)\n- Docker or Podman\n\n### Step-by-Step Walkthrough\n\n#### Example: GitHub Actions\n\n```\nname: Integration Tests\n\non: [push, pull_request]\n\njobs:\n test:\n runs-on: ubuntu-latest\n steps:\n - uses: actions/checkout@v3\n\n - name: Install Nushell\n run: |\n sudo apt-get update\n sudo apt-get install -y nushell\n\n - name: Install Nickel\n run: |\n sudo apt-get install -y nickel\n\n - name: Install Provisioning\n run: |\n git clone https://github.com/project-provisioning/provisioning\n cd provisioning\n ./scripts/install.sh\n\n - name: Setup Provisioning (CI/CD Profile)\n run: |\n export PROVISIONING_PROVIDER=local\n export PROVISIONING_WORKSPACE=ci-test-${{ github.run_id }}\n provisioning setup profile --profile cicd\n\n - name: Run Integration Tests\n run: |\n # Services are now running\n curl http://localhost:9090/health\n curl http://localhost:3000/health\n\n # Run your tests\n ./tests/integration-test.sh\n\n - name: Cleanup\n if: always()\n run: |\n docker-compose down\n # Automatic cleanup on job exit\n```\n\n#### What Happens\n\n**Step 1: Minimal Detection**\n```\n✓ Detected: CI environment\n✓ Profile: CICD\n```\n\n**Step 2: Ephemeral Config Creation**\n```\n✓ Created: /tmp/provisioning-ci-abc123def456/\n✓ Created: /tmp/provisioning-ci-abc123def456/system.ncl\n✓ Created: /tmp/provisioning-ci-abc123def456/platform/deployment.ncl\n```\n\n**Step 3: Validation**\n```\n✓ Validating system.ncl\n✓ Validating platform/deployment.ncl\n✓ All configurations validated: PASSED\n```\n\n**Step 4: Services Start**\n```\n✓ Starting Docker Compose services\n✓ Orchestrator running [port 9090]\n✓ Control Center running [port 3000]\n✓ KMS running [port 3001]\n✓ Services ready for tests\n```\n\n**Step 5: Tests Execute**\n```\n$ curl http://localhost:9090/health\n{"status": "healthy", "uptime": "2s"}\n\n$ ./tests/integration-test.sh\nTest: API endpoint... PASSED\nTest: Database schema... PASSED\nTest: Service discovery... PASSED\nAll tests passed!\n```\n\n**Step 6: Automatic Cleanup**\n```\n✓ Cleanup triggered (job exit)\n✓ Stopping Docker Compose\n✓ Removing temporary directory: /tmp/provisioning-ci-abc123def456/\n✓ Cleanup complete\n```\n\n### CI/CD Environment Variables\n\nUse environment variables to customize:\n\n```\n# Provider (local or cloud)\nexport PROVISIONING_PROVIDER=local|upcloud|aws|hetzner\n\n# Workspace name\nexport PROVISIONING_WORKSPACE=ci-test-${BUILD_ID}\n\n# Skip confirmations\nexport PROVISIONING_YES=true\n\n# Enable verbose output\nexport PROVISIONING_VERBOSE=true\n\n# Custom config location (if needed)\nexport PROVISIONING_CONFIG=/tmp/custom-config.ncl\n```\n\n### CI/CD Best Practices\n\n**1. Use matrix builds for testing**:\n```\nstrategy:\n matrix:\n profile: [developer, production]\n provider: [local, aws]\n```\n\n**2. Cache Nickel compilation**:\n```\n- uses: actions/cache@v3\n with:\n path: ~/.cache/nickel\n key: nickel-${{ hashFiles('*.ncl') }}\n```\n\n**3. Separate test stages**:\n```\n- name: Setup (CI/CD Profile)\n- name: Test Unit\n- name: Test Integration\n- name: Test E2E\n```\n\n**4. Publish test results**:\n```\n- name: Publish Test Results\n if: always()\n uses: actions/upload-artifact@v3\n with:\n name: test-results\n path: test-results/\n```\n\n---\n\n## Profile Selection Guide\n\n### "Which profile should I choose?"\n\n**Start with Developer if**:\n- You're new to provisioning\n- You're testing locally\n- You want to understand how it works\n- You need quick feedback loops\n\n**Move to Production if**:\n- You're deploying to production\n- You need high availability\n- You have security requirements\n- You're managing a team\n- You need audit logging\n\n**Use CI/CD if**:\n- You're running automated tests\n- You're in a CI/CD pipeline\n- You want ephemeral environments\n- You don't need persistent state\n\n### Migration Path\n\n```\nDeveloper → Production\n (ready for team)\n ↓\n └→ CI/CD (for testing)\n```\n\nYou can run Developer locally and CI/CD in your pipeline simultaneously.\n\n---\n\n## Modifying Profiles After Setup\n\n### Developer → Production Migration\n\nIf you started with Developer and want to move to Production:\n\n```\n# Backup your current setup\ntar czf provisioning-backup.tar.gz ~/.config/provisioning/\n\n# Run production setup\nprovisioning setup profile --profile production --interactive\n\n# Migrate any customizations from backup\ntar xzf provisioning-backup.tar.gz\n# Merge configs manually\n```\n\n### Customizing Profile Configs\n\nAll profiles' Nickel configs can be edited after setup:\n\n```\n# Edit deployment config\nvim ~/.config/provisioning/platform/deployment.ncl\n\n# Validate changes\nnickel typecheck ~/.config/provisioning/platform/deployment.ncl\n\n# Apply changes\ndocker-compose restart # or kubectl apply -f\n```\n\n---\n\n## Troubleshooting Profile-Specific Issues\n\n### Developer Profile\n\n**Problem**: Docker not running\n```\n# Solution: Start Docker\ndocker daemon &\n# or\nsudo systemctl start docker\n```\n\n**Problem**: Ports 9090/3000/3001 already in use\n```\n# Solution: Kill conflicting process\nlsof -i :9090 | grep LISTEN | awk '{print $2}' | xargs kill -9\n```\n\n### Production Profile\n\n**Problem**: Kubernetes not installed\n```\n# Solution: Install kubectl\nbrew install kubectl # macOS\nsudo apt-get install kubectl # Linux\n```\n\n**Problem**: Cloud credentials rejected\n```\n# Solution: Verify credentials\nupcloud auth status # or aws sts get-caller-identity\n# Re-run setup with correct credentials\n```\n\n### CI/CD Profile\n\n**Problem**: Services not accessible from test\n```\n# Solution: Use service DNS\ncurl http://orchestrator:9090/health # instead of localhost\n```\n\n**Problem**: Cleanup not working\n```\n# Solution: Manual cleanup\ndocker system prune -f\nrm -rf /tmp/provisioning-ci-*/\n```\n\n---\n\n| **Next Step**: Choose your profile and run `provisioning setup profile --profile ` |\n\n**Need more help?** See [Setup Guide](setup.md) or [Troubleshooting](../troubleshooting/troubleshooting.md) +# Setup Profiles Guide - Detailed Reference + +This guide provides detailed information about each setup profile and when to use them. + +--- + +## Profile Comparison Matrix + +| | Aspect | Developer | Production | CI/CD | | +| | -------- | ----------- | ----------- | ------- | | +| | **Duration** | 3-4 min | 10-15 min | <2 min | | +| | **User Input** | Minimal (1 question) | Extensive (10+ questions) | None (env vars) | | +| | **Config Type** | Nickel (auto-composed) | Nickel (interactive) | Nickel (auto-minimal) | | +| | **Validation** | Nickel typecheck | Nickel typecheck | Nickel typecheck | | +| | **Deployment** | Docker Compose | Kubernetes/SSH/Docker | Docker Compose | | +| | **Services Started** | Auto-start locally | Manual (you deploy) | Auto-start ephemeral | | +| | **Storage** | Home dir (persistent) | Home dir (persistent) | /tmp (ephemeral) | | +| | **Security** | Local defaults | MFA+Audit+Policies | Env vars + CI secrets | | +| | **Intended User** | Developer, learner | Production operator | CI/CD automation | | +| | **Best For** | Local testing, prototyping | Team deployments, HA | Automated testing | | + +--- + +## Developer Profile: Fast Local Setup + +### When to Use + +- **First-time users**: Get provisioning working quickly +- **Local development**: Test infrastructure on your machine +- **Learning**: Understand provisioning concepts +- **Prototyping**: Rapid iteration on configurations +- **Single-user setup**: Personal workstation only + +### What Gets Created + +**Config Files** (all Nickel, type-safe): +- `system.ncl` - System detection (auto-detected, read-only) +- `user_preferences.ncl` - User settings (recommended defaults) +- `platform/deployment.ncl` - Local Docker Compose setup +- `providers/local.ncl` - Local provider (no credentials) + +**Services** (Docker Compose): +- Orchestrator (port 9090) +- Control Center (port 3000) +- KMS service (port 3001) + +**Storage Location**: +- macOS: `~/Library/Application Support/provisioning/` +- Linux: `~/.config/provisioning/` + +### System Requirements + +**Minimum**: +- OS: macOS (10.14+) or Linux +- CPU: 2 cores +- Memory: 4 GB RAM +- Disk: 2 GB free + +**Recommended**: +- CPU: 4+ cores +- Memory: 8+ GB RAM +- Disk: 10 GB free + +**Dependencies**: +- Nushell (0.109.0+) +- Nickel (1.5.0+) +- Docker (latest) + +### Step-by-Step Walkthrough + +#### Step 1: Run Setup + +```text +provisioning setup profile --profile developer +``` + +Output: +```text +╔═══════════════════════════════════════════════════════╗ +║ PROVISIONING SYSTEM SETUP - DEVELOPER PROFILE ║ +╚═══════════════════════════════════════════════════════╝ + +Environment Detection + OS: macOS (15.2.0) + Architecture: aarch64 + CPU Count: 8 + Memory: 16 GB + Disk: 500 GB + +✓ Detected capabilities: Docker +✓ Configuration location: ~/Library/Application Support/provisioning/ + +Setup Profile: DEVELOPER +``` + +#### Step 2: Auto-Detection + +System automatically detects: +- Operating system (macOS/Linux) +- Architecture (aarch64/x86_64) +- CPU and memory +- Available deployment tools (Docker, Kubernetes, etc.) + +**You see**: Detection summary, no prompts + +#### Step 3: Configuration Generation + +Creates three Nickel configs: + +**system.ncl** - System info (read-only): +```text +{ + version = "1.0.0", + config_base_path = "/Users/user/Library/Application Support/provisioning", + os_name = 'macos, + os_version = "15.2.0", + system_architecture = 'aarch64, + cpu_count = 8, + memory_total_gb = 16, + disk_total_gb = 500, + setup_date = "2026-01-13T12:34:56Z" +} +| SystemConfig +``` + +**platform/deployment.ncl** - Deployment config (can edit): +```text +{ + deployment = { + mode = 'docker_compose, + location_type = 'local, + }, + services = { + orchestrator = { + endpoint = "http://localhost:9090/health", + timeout_seconds = 30, + }, + control_center = { + endpoint = "http://localhost:3000/health", + timeout_seconds = 30, + }, + kms_service = { + endpoint = "http://localhost:3001/health", + timeout_seconds = 30, + }, + }, +} +| DeploymentConfig +``` + +**user_preferences.ncl** - User settings (can edit): +```text +{ + output_format = 'yaml, + use_colors = true, + confirm_delete = true, + default_log_level = 'info, + http_timeout_seconds = 30, +} +| UserPreferencesConfig +``` + +#### Step 4: Validation + +Each config is validated: +```text +✓ Validating system.ncl +✓ Validating platform/deployment.ncl +✓ Validating user_preferences.ncl +✓ All configurations validated: PASSED +``` + +#### Step 5: Service Startup + +Docker Compose starts: +```text +✓ Starting Docker Compose services... +✓ Starting orchestrator... [port 9090] +✓ Starting control-center... [port 3000] +✓ Starting kms... [port 3001] +``` + +#### Step 6: Verification + +Health checks verify services: +```text +✓ Orchestrator health: HEALTHY +✓ Control Center health: HEALTHY +✓ KMS health: HEALTHY + +Setup complete in 3 minutes 47 seconds! +``` + +### After Setup: Common Tasks + +**Verify everything works**: +```text +curl http://localhost:9090/health +curl http://localhost:3000/health +curl http://localhost:3001/health +``` + +**View your configuration**: +```text +cat ~/Library/Application\ Support/provisioning/system.ncl +cat ~/Library/Application\ Support/provisioning/platform/deployment.ncl +``` + +**Create a workspace**: +```text +provisioning workspace create myapp +``` + +**View logs**: +```text +docker-compose logs orchestrator +docker-compose logs control-center +docker-compose logs kms +``` + +**Stop services**: +```text +docker-compose down +``` + +--- + +## Production Profile: Enterprise-Ready Deployment + +### When to Use + +- **Production deployments**: Going live +- **Team environments**: Multiple users, shared infrastructure +- **High availability**: Kubernetes clusters +- **Security requirements**: MFA, audit logging, policies +- **Multi-cloud**: UpCloud, AWS, Hetzner +- **Compliance**: Audit trails, authorization policies + +### What Gets Created + +**Config Files** (all Nickel, type-safe): +- `system.ncl` - System detection (auto-detected) +- `user_preferences.ncl` - Security-focused defaults (MFA, audit enabled) +- `platform/deployment.ncl` - Kubernetes/SSH configuration +- `providers/upcloud.ncl` (or aws/hetzner) - Cloud provider credentials +- `cedar-policies/default.cedar` - Authorization policies (Cedar format) +- `workspace-*/infrastructure.ncl` - Infrastructure-as-Code definitions + +**Services**: You deploy to Kubernetes or SSH manually + +**Storage Location**: +- macOS: `~/Library/Application Support/provisioning/` +- Linux: `~/.config/provisioning/` + +### System Requirements + +**Minimum**: +- OS: macOS (10.14+) or Linux +- CPU: 4 cores +- Memory: 8 GB RAM +- Disk: 10 GB free + +**Recommended**: +- CPU: 8+ cores +- Memory: 16+ GB RAM +- Disk: 50 GB free +- Cloud account (UpCloud, AWS, or Hetzner) + +**Dependencies**: +- Nushell (0.109.0+) +- Nickel (1.5.0+) +- Docker (for building) +- kubectl (for Kubernetes deployment) +- Cloud CLI (upcloud-cli, aws-cli, etc.) + +### Step-by-Step Walkthrough + +#### Step 1: Run Setup + +```text +provisioning setup profile --profile production --interactive +``` + +#### Step 2: System Detection + +Same as Developer profile - auto-detects OS, CPU, memory, etc. + +#### Step 3: Interactive Configuration + +The wizard asks 10-15 questions: + +```text +1. Deployment Mode? + a) Kubernetes (recommended for HA) + b) SSH (manual server management) + c) Docker Compose (hybrid local/remote) + → Your choice: a) Kubernetes + +2. Cloud Provider? + a) UpCloud + b) AWS + c) Hetzner + d) Local (self-managed servers) + → Your choice: a) UpCloud + +3. Workspace Name? + (names your infrastructure project) + → Your input: production-infrastructure + +4. Kubernetes Cluster? + a) Create new cluster + b) Use existing cluster + → Your choice: a) Create new + +5. Master Nodes Count? (1-5, default 3) + (for HA, recommend 3 or 5) + → Your input: 3 + +6. Worker Nodes Count? (2-10, default 5) + (for scalability) + → Your input: 5 + +7. Enable MFA? + (Multi-factor authentication for access) + → Your choice: y + +8. Enable Audit Logging? + (Log all operations for compliance) + → Your choice: y + +9. Storage Backend? + a) etcd (Kubernetes default) + b) PostgreSQL (external) + c) S3-compatible (cloud) + → Your choice: a) etcd + +10. Certificate Management? + a) Let's Encrypt (auto-renew) + b) Self-signed (for testing) + c) Bring your own + → Your choice: a) Let's Encrypt + +11. Monitoring? + a) Prometheus + Grafana + b) Datadog + c) CloudWatch + d) None (not recommended) + → Your choice: a) Prometheus + Grafana + +12. Logging? + a) ELK Stack + b) Splunk + c) CloudWatch Logs + d) None + → Your choice: a) ELK Stack + +13. Authorization? + a) Cedar policies (fine-grained) + b) RBAC (basic roles) + c) ABAC (attribute-based) + → Your choice: a) Cedar policies +``` + +#### Step 4: Configuration Generation + +Creates extensive Nickel configs: + +**platform/deployment.ncl**: +```text +{ + deployment = { + mode = 'kubernetes, + cluster_type = 'multi_master, + master_count = 3, + worker_count = 5, + ha_enabled = true, + }, + security = { + mfa_enabled = true, + audit_logging = true, + tls_enabled = true, + certificate_provider = 'letsencrypt, + }, + monitoring = { + prometheus_enabled = true, + grafana_enabled = true, + }, + logging = { + elk_enabled = true, + }, +} +| ProductionDeploymentConfig +``` + +**providers/upcloud.ncl**: +```text +{ + provider = 'upcloud, + api_key_ref = "rustyvault://secrets/upcloud/api-key", + api_secret_ref = "rustyvault://secrets/upcloud/api-secret", + region = "us-east-1", + server_template = "ubuntu-22.04", +} +| UpCloudProviderConfig +``` + +**cedar-policies/default.cedar**: +```text +permit( + principal == User::"john@company.com", + action == Action::"Deploy", + resource == Workspace::"prod-infra" +) +when { principal.mfa_verified == true }; + +permit( + principal in Group::"DevOps", + action == Action::"ReadMetrics", + resource in Team::"*" +); + +forbid( + principal in Group::"Contractors", + action == Action::"DeleteWorkspace", + resource in Team::"*" +); +``` + +#### Step 5: Validation + +All configs validated: +```text +✓ Validating system.ncl +✓ Validating platform/deployment.ncl +✓ Validating providers/upcloud.ncl +✓ Validating cedar-policies/default.cedar +✓ All configurations validated: PASSED +``` + +#### Step 6: Summary & Confirmation + +```text +Setup Summary +───────────────────────────────────────── +Profile: Production +Deployment Mode: Kubernetes +Cloud Provider: UpCloud +Master Nodes: 3 +Worker Nodes: 5 +MFA Enabled: Yes +Audit Logging: Yes +Monitoring: Prometheus + Grafana +Logging: ELK Stack + +Do you want to proceed? (y/n): y +``` + +#### Step 7: Infrastructure Creation (Optional) + +```text +Creating UpCloud infrastructure... + Creating 3 master nodes... [networking configured] + Creating 5 worker nodes... [networking configured] + Deploying Kubernetes... [cluster bootstrap] + Installing monitoring... [Prometheus configured] + Installing logging... [ELK deployed] + +Infrastructure ready in ~12 minutes! + +Kubernetes cluster access: + kubectl config use-context provisioning-prod-infra + kubectl cluster-info + +Deploy services: + kubectl apply -f infrastructure.ncl +``` + +### After Setup: Common Tasks + +**View Kubernetes cluster**: +```text +kubectl get nodes +kubectl get pods --all-namespaces +``` + +**Check Cedar authorization**: +```text +cat ~/.config/provisioning/cedar-policies/default.cedar +``` + +**View infrastructure definition**: +```text +cat workspace-production-infrastructure/infrastructure.ncl +``` + +**Deploy an application**: +```text +provisioning app deploy myapp --workspace production-infrastructure +``` + +**Monitor cluster**: +```text +# Access Grafana +open http://localhost:3000 + +# View Prometheus metrics +open http://localhost:9090 +``` + +--- + +## CI/CD Profile: Ephemeral Automated Setup + +### When to Use + +- **GitHub Actions workflows**: Test infrastructure changes +- **GitLab CI pipelines**: Automated testing +- **Jenkins jobs**: Integration testing +- **Automated testing**: Spin up, test, cleanup +- **Ephemeral environments**: No persistent state + +### What Gets Created + +**Config Files** (minimal Nickel): +- `system.ncl` - CI environment info +- `platform/deployment.ncl` - Minimal Docker Compose +- `providers/local.ncl` - No credentials + +**Services**: Docker Compose (temporary) + +**Storage Location**: `/tmp/provisioning-ci-/` + +### System Requirements + +**Minimal** (CI container): +- OS: Any Linux +- CPU: 1+ core +- Memory: 2 GB RAM +- Disk: 1 GB free + +**Dependencies**: +- Nushell (0.109.0+) +- Nickel (1.5.0+) +- Docker or Podman + +### Step-by-Step Walkthrough + +#### Example: GitHub Actions + +```text +name: Integration Tests + +on: [push, pull_request] + +jobs: + test: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v3 + + - name: Install Nushell + run: | + sudo apt-get update + sudo apt-get install -y nushell + + - name: Install Nickel + run: | + sudo apt-get install -y nickel + + - name: Install Provisioning + run: | + git clone https://github.com/project-provisioning/provisioning + cd provisioning + ./scripts/install.sh + + - name: Setup Provisioning (CI/CD Profile) + run: | + export PROVISIONING_PROVIDER=local + export PROVISIONING_WORKSPACE=ci-test-${{ github.run_id }} + provisioning setup profile --profile cicd + + - name: Run Integration Tests + run: | + # Services are now running + curl http://localhost:9090/health + curl http://localhost:3000/health + + # Run your tests + ./tests/integration-test.sh + + - name: Cleanup + if: always() + run: | + docker-compose down + # Automatic cleanup on job exit +``` + +#### What Happens + +**Step 1: Minimal Detection** +```text +✓ Detected: CI environment +✓ Profile: CICD +``` + +**Step 2: Ephemeral Config Creation** +```text +✓ Created: /tmp/provisioning-ci-abc123def456/ +✓ Created: /tmp/provisioning-ci-abc123def456/system.ncl +✓ Created: /tmp/provisioning-ci-abc123def456/platform/deployment.ncl +``` + +**Step 3: Validation** +```text +✓ Validating system.ncl +✓ Validating platform/deployment.ncl +✓ All configurations validated: PASSED +``` + +**Step 4: Services Start** +```text +✓ Starting Docker Compose services +✓ Orchestrator running [port 9090] +✓ Control Center running [port 3000] +✓ KMS running [port 3001] +✓ Services ready for tests +``` + +**Step 5: Tests Execute** +```text +$ curl http://localhost:9090/health +{"status": "healthy", "uptime": "2s"} + +$ ./tests/integration-test.sh +Test: API endpoint... PASSED +Test: Database schema... PASSED +Test: Service discovery... PASSED +All tests passed! +``` + +**Step 6: Automatic Cleanup** +```text +✓ Cleanup triggered (job exit) +✓ Stopping Docker Compose +✓ Removing temporary directory: /tmp/provisioning-ci-abc123def456/ +✓ Cleanup complete +``` + +### CI/CD Environment Variables + +Use environment variables to customize: + +```text +# Provider (local or cloud) +export PROVISIONING_PROVIDER=local|upcloud|aws|hetzner + +# Workspace name +export PROVISIONING_WORKSPACE=ci-test-${BUILD_ID} + +# Skip confirmations +export PROVISIONING_YES=true + +# Enable verbose output +export PROVISIONING_VERBOSE=true + +# Custom config location (if needed) +export PROVISIONING_CONFIG=/tmp/custom-config.ncl +``` + +### CI/CD Best Practices + +**1. Use matrix builds for testing**: +```text +strategy: + matrix: + profile: [developer, production] + provider: [local, aws] +``` + +**2. Cache Nickel compilation**: +```text +- uses: actions/cache@v3 + with: + path: ~/.cache/nickel + key: nickel-${{ hashFiles('*.ncl') }} +``` + +**3. Separate test stages**: +```text +- name: Setup (CI/CD Profile) +- name: Test Unit +- name: Test Integration +- name: Test E2E +``` + +**4. Publish test results**: +```text +- name: Publish Test Results + if: always() + uses: actions/upload-artifact@v3 + with: + name: test-results + path: test-results/ +``` + +--- + +## Profile Selection Guide + +### "Which profile should I choose?" + +**Start with Developer if**: +- You're new to provisioning +- You're testing locally +- You want to understand how it works +- You need quick feedback loops + +**Move to Production if**: +- You're deploying to production +- You need high availability +- You have security requirements +- You're managing a team +- You need audit logging + +**Use CI/CD if**: +- You're running automated tests +- You're in a CI/CD pipeline +- You want ephemeral environments +- You don't need persistent state + +### Migration Path + +```text +Developer → Production + (ready for team) + ↓ + └→ CI/CD (for testing) +``` + +You can run Developer locally and CI/CD in your pipeline simultaneously. + +--- + +## Modifying Profiles After Setup + +### Developer → Production Migration + +If you started with Developer and want to move to Production: + +```text +# Backup your current setup +tar czf provisioning-backup.tar.gz ~/.config/provisioning/ + +# Run production setup +provisioning setup profile --profile production --interactive + +# Migrate any customizations from backup +tar xzf provisioning-backup.tar.gz +# Merge configs manually +``` + +### Customizing Profile Configs + +All profiles' Nickel configs can be edited after setup: + +```text +# Edit deployment config +vim ~/.config/provisioning/platform/deployment.ncl + +# Validate changes +nickel typecheck ~/.config/provisioning/platform/deployment.ncl + +# Apply changes +docker-compose restart # or kubectl apply -f +``` + +--- + +## Troubleshooting Profile-Specific Issues + +### Developer Profile + +**Problem**: Docker not running +```text +# Solution: Start Docker +docker daemon & +# or +sudo systemctl start docker +``` + +**Problem**: Ports 9090/3000/3001 already in use +```text +# Solution: Kill conflicting process +lsof -i :9090 | grep LISTEN | awk '{print $2}' | xargs kill -9 +``` + +### Production Profile + +**Problem**: Kubernetes not installed +```text +# Solution: Install kubectl +brew install kubectl # macOS +sudo apt-get install kubectl # Linux +``` + +**Problem**: Cloud credentials rejected +```text +# Solution: Verify credentials +upcloud auth status # or aws sts get-caller-identity +# Re-run setup with correct credentials +``` + +### CI/CD Profile + +**Problem**: Services not accessible from test +```text +# Solution: Use service DNS +curl http://orchestrator:9090/health # instead of localhost +``` + +**Problem**: Cleanup not working +```text +# Solution: Manual cleanup +docker system prune -f +rm -rf /tmp/provisioning-ci-*/ +``` + +--- + +| **Next Step**: Choose your profile and run `provisioning setup profile --profile ` | + +**Need more help?** See [Setup Guide](setup.md) or [Troubleshooting](../troubleshooting/troubleshooting.md) \ No newline at end of file diff --git a/docs/src/getting-started/setup-quickstart.md b/docs/src/getting-started/setup-quickstart.md index 9e1a88b..e1ee914 100644 --- a/docs/src/getting-started/setup-quickstart.md +++ b/docs/src/getting-started/setup-quickstart.md @@ -1 +1,178 @@ -# Setup Quick Start - 5 Minutes to Deployment\n\n**Goal**: Get provisioning running in 5 minutes with a working example\n\n## Step 1: Check Prerequisites (30 seconds)\n\n```\n# Check Nushell\nnu --version # Should be 0.109.0+\n\n# Check deployment tool\ndocker --version # OR\nkubectl version # OR\nssh -V # OR\nsystemctl --version\n```\n\n## Step 2: Install Provisioning (1 minute)\n\n```\n# Option A: Using installer script\ncurl -sSL https://install.provisioning.dev | bash\n\n# Option B: From source\ngit clone https://github.com/project-provisioning/provisioning\ncd provisioning\n./scripts/install.sh\n```\n\n## Step 3: Initialize System (2 minutes)\n\n```\n# Run interactive setup\nprovisioning setup system --interactive\n\n# Follow the prompts:\n# - Press Enter for defaults\n# - Select your deployment tool\n# - Enter provider credentials (if using cloud)\n```\n\n## Step 4: Create Your First Workspace (1 minute)\n\n```\n# Create workspace\nprovisioning setup workspace myapp\n\n# Verify it was created\nprovisioning workspace list\n```\n\n## Step 5: Deploy Your First Server (1 minute)\n\n```\n# Activate workspace\nprovisioning workspace activate myapp\n\n# Check configuration\nprovisioning setup validate\n\n# Deploy server (dry-run first)\nprovisioning server create --check\n\n# Deploy for real\nprovisioning server create --yes\n```\n\n## Verify Everything Works\n\n```\n# Check health\nprovisioning platform health\n\n# Check servers\nprovisioning server list\n\n# SSH into server (if applicable)\nprovisioning server ssh \n```\n\n## Common Commands Cheat Sheet\n\n```\n# Workspace management\nprovisioning workspace list # List all workspaces\nprovisioning workspace activate prod # Switch workspace\nprovisioning workspace create dev # Create new workspace\n\n# Server management\nprovisioning server list # List servers\nprovisioning server create # Create server\nprovisioning server delete # Delete server\nprovisioning server ssh # SSH into server\n\n# Configuration\nprovisioning setup validate # Validate configuration\nprovisioning setup update platform # Update platform settings\n\n# System info\nprovisioning info # System information\nprovisioning capability check # Check capabilities\nprovisioning platform health # Check platform health\n```\n\n## Troubleshooting Quick Fixes\n\n**Setup wizard won't start**\n\n```\n# Check Nushell\nnu --version\n\n# Check permissions\nchmod +x $(which provisioning)\n```\n\n**Configuration error**\n\n```\n# Validate configuration\nprovisioning setup validate --verbose\n\n# Check paths\nprovisioning info paths\n```\n\n**Deployment fails**\n\n```\n# Dry-run to see what would happen\nprovisioning server create --check\n\n# Check platform status\nprovisioning platform status\n```\n\n## What's Next\n\nAfter basic setup:\n\n1. **Configure Provider**: Add cloud provider credentials\n2. **Create More Workspaces**: Dev, staging, production\n3. **Deploy Services**: Web servers, databases, etc.\n4. **Set Up Monitoring**: Health checks, logging\n5. **Automate Deployments**: CI/CD integration\n\n## Need Help\n\n```\n# Get help\nprovisioning help\n\n# Setup help\nprovisioning help setup\n\n# Specific command help\nprovisioning --help\n\n# View documentation\nprovisioning guide system-setup\n```\n\n## Key Files\n\nYour configuration is in:\n\n**macOS**: `~/Library/Application Support/provisioning/`\n**Linux**: `~/.config/provisioning/`\n\nImportant files:\n\n- `system.toml` - System configuration\n- `user_preferences.toml` - User settings\n- `workspaces/*/` - Workspace definitions\n\n---\n\n**Ready to dive deeper?** Check out the [Full Setup Guide](SETUP_SYSTEM_GUIDE.md) +# Setup Quick Start - 5 Minutes to Deployment + +**Goal**: Get provisioning running in 5 minutes with a working example + +## Step 1: Check Prerequisites (30 seconds) + +```text +# Check Nushell +nu --version # Should be 0.109.0+ + +# Check deployment tool +docker --version # OR +kubectl version # OR +ssh -V # OR +systemctl --version +``` + +## Step 2: Install Provisioning (1 minute) + +```text +# Option A: Using installer script +curl -sSL https://install.provisioning.dev | bash + +# Option B: From source +git clone https://github.com/project-provisioning/provisioning +cd provisioning +./scripts/install.sh +``` + +## Step 3: Initialize System (2 minutes) + +```text +# Run interactive setup +provisioning setup system --interactive + +# Follow the prompts: +# - Press Enter for defaults +# - Select your deployment tool +# - Enter provider credentials (if using cloud) +``` + +## Step 4: Create Your First Workspace (1 minute) + +```text +# Create workspace +provisioning setup workspace myapp + +# Verify it was created +provisioning workspace list +``` + +## Step 5: Deploy Your First Server (1 minute) + +```text +# Activate workspace +provisioning workspace activate myapp + +# Check configuration +provisioning setup validate + +# Deploy server (dry-run first) +provisioning server create --check + +# Deploy for real +provisioning server create --yes +``` + +## Verify Everything Works + +```text +# Check health +provisioning platform health + +# Check servers +provisioning server list + +# SSH into server (if applicable) +provisioning server ssh +``` + +## Common Commands Cheat Sheet + +```text +# Workspace management +provisioning workspace list # List all workspaces +provisioning workspace activate prod # Switch workspace +provisioning workspace create dev # Create new workspace + +# Server management +provisioning server list # List servers +provisioning server create # Create server +provisioning server delete # Delete server +provisioning server ssh # SSH into server + +# Configuration +provisioning setup validate # Validate configuration +provisioning setup update platform # Update platform settings + +# System info +provisioning info # System information +provisioning capability check # Check capabilities +provisioning platform health # Check platform health +``` + +## Troubleshooting Quick Fixes + +**Setup wizard won't start** + +```text +# Check Nushell +nu --version + +# Check permissions +chmod +x $(which provisioning) +``` + +**Configuration error** + +```text +# Validate configuration +provisioning setup validate --verbose + +# Check paths +provisioning info paths +``` + +**Deployment fails** + +```text +# Dry-run to see what would happen +provisioning server create --check + +# Check platform status +provisioning platform status +``` + +## What's Next + +After basic setup: + +1. **Configure Provider**: Add cloud provider credentials +2. **Create More Workspaces**: Dev, staging, production +3. **Deploy Services**: Web servers, databases, etc. +4. **Set Up Monitoring**: Health checks, logging +5. **Automate Deployments**: CI/CD integration + +## Need Help + +```text +# Get help +provisioning help + +# Setup help +provisioning help setup + +# Specific command help +provisioning --help + +# View documentation +provisioning guide system-setup +``` + +## Key Files + +Your configuration is in: + +**macOS**: `~/Library/Application Support/provisioning/` +**Linux**: `~/.config/provisioning/` + +Important files: + +- `system.toml` - System configuration +- `user_preferences.toml` - User settings +- `workspaces/*/` - Workspace definitions + +--- + +**Ready to dive deeper?** Check out the [Full Setup Guide](SETUP_SYSTEM_GUIDE.md) \ No newline at end of file diff --git a/docs/src/getting-started/setup-system-guide.md b/docs/src/getting-started/setup-system-guide.md index 3c51720..6d271f0 100644 --- a/docs/src/getting-started/setup-system-guide.md +++ b/docs/src/getting-started/setup-system-guide.md @@ -1 +1,206 @@ -# Provisioning Setup System Guide\n\n**Version**: 1.0.0\n**Last Updated**: 2025-12-09\n**Status**: Production Ready\n\n## Quick Start\n\n### Prerequisites\n\n- Nushell 0.109.0+\n- bash\n- One deployment tool: Docker, Kubernetes, SSH, or systemd\n- Optional: KCL, SOPS, Age\n\n### 30-Second Setup\n\n```\n# Install provisioning\ncurl -sSL https://install.provisioning.dev | bash\n\n# Run setup wizard\nprovisioning setup system --interactive\n\n# Create workspace\nprovisioning setup workspace myproject\n\n# Start deploying\nprovisioning server create\n```\n\n## Configuration Paths\n\n**macOS**: `~/Library/Application Support/provisioning/`\n**Linux**: `~/.config/provisioning/`\n**Windows**: `%APPDATA%/provisioning/`\n\n## Directory Structure\n\n```\nprovisioning/\n├── system.toml # System info (immutable)\n├── user_preferences.toml # User settings (editable)\n├── platform/ # Platform services\n├── providers/ # Provider configs\n└── workspaces/ # Workspace definitions\n └── myproject/\n ├── config/\n ├── infra/\n └── auth.token\n```\n\n## Setup Wizard\n\nRun the interactive setup wizard:\n\n```\nprovisioning setup system --interactive\n```\n\nThe wizard guides you through:\n\n1. Welcome & Prerequisites Check\n2. Operating System Detection\n3. Configuration Path Selection\n4. Platform Services Setup\n5. Provider Selection\n6. Security Configuration\n7. Review & Confirmation\n\n## Configuration Management\n\n### Hierarchy (highest to lowest priority)\n\n1. Runtime Arguments (`--flag value`)\n2. Environment Variables (`PROVISIONING_*`)\n3. Workspace Configuration\n4. Workspace Authentication Token\n5. User Preferences (`user_preferences.toml`)\n6. Platform Configurations (`platform/*.toml`)\n7. Provider Configurations (`providers/*.toml`)\n8. System Configuration (`system.toml`)\n9. Built-in Defaults\n\n### Configuration Files\n\n- `system.toml` - System information (OS, architecture, paths)\n- `user_preferences.toml` - User preferences (editor, format, etc.)\n- `platform/*.toml` - Service endpoints and configuration\n- `providers/*.toml` - Cloud provider settings\n\n## Multiple Workspaces\n\nCreate and manage multiple isolated environments:\n\n```\n# Create workspace\nprovisioning setup workspace dev\nprovisioning setup workspace prod\n\n# List workspaces\nprovisioning workspace list\n\n# Activate workspace\nprovisioning workspace activate prod\n```\n\n## Configuration Updates\n\nUpdate any setting:\n\n```\n# Update platform configuration\nprovisioning setup platform --config new-config.toml\n\n# Update provider settings\nprovisioning setup provider upcloud --config upcloud-config.toml\n\n# Validate changes\nprovisioning setup validate\n```\n\n## Backup & Restore\n\n```\n# Backup current configuration\nprovisioning setup backup --path ./backup.tar.gz\n\n# Restore from backup\nprovisioning setup restore --path ./backup.tar.gz\n\n# Migrate from old setup\nprovisioning setup migrate --from-existing\n```\n\n## Troubleshooting\n\n### "Command not found: provisioning"\n\n```\nexport PATH="/usr/local/bin:$PATH"\n```\n\n### "Nushell not found"\n\n```\ncurl -sSL https://raw.githubusercontent.com/nushell/nushell/main/install.sh | bash\n```\n\n### "Cannot write to directory"\n\n```\nchmod 755 ~/Library/Application\ Support/provisioning/\n```\n\n### Check required tools\n\n```\nprovisioning setup validate --check-tools\n```\n\n## FAQ\n\n**Q: Do I need all optional tools?**\nA: No. You need at least one deployment tool (Docker, Kubernetes, SSH, or systemd).\n\n**Q: Can I use provisioning without Docker?**\nA: Yes. Provisioning supports Docker, Kubernetes, SSH, systemd, or combinations.\n\n**Q: How do I update configuration?**\nA: `provisioning setup update `\n\n**Q: Can I have multiple workspaces?**\nA: Yes, unlimited workspaces.\n\n**Q: Is my configuration secure?**\nA: Yes. Credentials stored securely, never in config files.\n\n**Q: Can I share workspaces with my team?**\nA: Yes, via GitOps - configurations in Git, secrets in secure storage.\n\n## Getting Help\n\n```\n# General help\nprovisioning help\n\n# Setup help\nprovisioning help setup\n\n# Specific command help\nprovisioning setup system --help\n```\n\n## Next Steps\n\n1. [Installation Guide](installation-guide.md)\n2. [Workspace Setup](workspace-setup.md)\n3. [Provider Configuration](provider-setup.md)\n4. [From Scratch Guide](../guides/from-scratch.md)\n\n---\n\n**Status**: Production Ready ✅\n**Version**: 1.0.0\n**Last Updated**: 2025-12-09 +# Provisioning Setup System Guide + +**Version**: 1.0.0 +**Last Updated**: 2025-12-09 +**Status**: Production Ready + +## Quick Start + +### Prerequisites + +- Nushell 0.109.0+ +- bash +- One deployment tool: Docker, Kubernetes, SSH, or systemd +- Optional: KCL, SOPS, Age + +### 30-Second Setup + +```text +# Install provisioning +curl -sSL https://install.provisioning.dev | bash + +# Run setup wizard +provisioning setup system --interactive + +# Create workspace +provisioning setup workspace myproject + +# Start deploying +provisioning server create +``` + +## Configuration Paths + +**macOS**: `~/Library/Application Support/provisioning/` +**Linux**: `~/.config/provisioning/` +**Windows**: `%APPDATA%/provisioning/` + +## Directory Structure + +```text +provisioning/ +├── system.toml # System info (immutable) +├── user_preferences.toml # User settings (editable) +├── platform/ # Platform services +├── providers/ # Provider configs +└── workspaces/ # Workspace definitions + └── myproject/ + ├── config/ + ├── infra/ + └── auth.token +``` + +## Setup Wizard + +Run the interactive setup wizard: + +```text +provisioning setup system --interactive +``` + +The wizard guides you through: + +1. Welcome & Prerequisites Check +2. Operating System Detection +3. Configuration Path Selection +4. Platform Services Setup +5. Provider Selection +6. Security Configuration +7. Review & Confirmation + +## Configuration Management + +### Hierarchy (highest to lowest priority) + +1. Runtime Arguments (`--flag value`) +2. Environment Variables (`PROVISIONING_*`) +3. Workspace Configuration +4. Workspace Authentication Token +5. User Preferences (`user_preferences.toml`) +6. Platform Configurations (`platform/*.toml`) +7. Provider Configurations (`providers/*.toml`) +8. System Configuration (`system.toml`) +9. Built-in Defaults + +### Configuration Files + +- `system.toml` - System information (OS, architecture, paths) +- `user_preferences.toml` - User preferences (editor, format, etc.) +- `platform/*.toml` - Service endpoints and configuration +- `providers/*.toml` - Cloud provider settings + +## Multiple Workspaces + +Create and manage multiple isolated environments: + +```text +# Create workspace +provisioning setup workspace dev +provisioning setup workspace prod + +# List workspaces +provisioning workspace list + +# Activate workspace +provisioning workspace activate prod +``` + +## Configuration Updates + +Update any setting: + +```text +# Update platform configuration +provisioning setup platform --config new-config.toml + +# Update provider settings +provisioning setup provider upcloud --config upcloud-config.toml + +# Validate changes +provisioning setup validate +``` + +## Backup & Restore + +```text +# Backup current configuration +provisioning setup backup --path ./backup.tar.gz + +# Restore from backup +provisioning setup restore --path ./backup.tar.gz + +# Migrate from old setup +provisioning setup migrate --from-existing +``` + +## Troubleshooting + +### "Command not found: provisioning" + +```text +export PATH="/usr/local/bin:$PATH" +``` + +### "Nushell not found" + +```text +curl -sSL https://raw.githubusercontent.com/nushell/nushell/main/install.sh | bash +``` + +### "Cannot write to directory" + +```text +chmod 755 ~/Library/Application\ Support/provisioning/ +``` + +### Check required tools + +```text +provisioning setup validate --check-tools +``` + +## FAQ + +**Q: Do I need all optional tools?** +A: No. You need at least one deployment tool (Docker, Kubernetes, SSH, or systemd). + +**Q: Can I use provisioning without Docker?** +A: Yes. Provisioning supports Docker, Kubernetes, SSH, systemd, or combinations. + +**Q: How do I update configuration?** +A: `provisioning setup update ` + +**Q: Can I have multiple workspaces?** +A: Yes, unlimited workspaces. + +**Q: Is my configuration secure?** +A: Yes. Credentials stored securely, never in config files. + +**Q: Can I share workspaces with my team?** +A: Yes, via GitOps - configurations in Git, secrets in secure storage. + +## Getting Help + +```text +# General help +provisioning help + +# Setup help +provisioning help setup + +# Specific command help +provisioning setup system --help +``` + +## Next Steps + +1. [Installation Guide](installation-guide.md) +2. [Workspace Setup](workspace-setup.md) +3. [Provider Configuration](provider-setup.md) +4. [From Scratch Guide](../guides/from-scratch.md) + +--- + +**Status**: Production Ready ✅ +**Version**: 1.0.0 +**Last Updated**: 2025-12-09 \ No newline at end of file diff --git a/docs/src/getting-started/setup.md b/docs/src/getting-started/setup.md index afce053..21338da 100644 --- a/docs/src/getting-started/setup.md +++ b/docs/src/getting-started/setup.md @@ -1 +1,663 @@ -# Unified Setup Guide\n\n**Quick Answer**: Run `provisioning setup profile` and choose your profile.\n\n---\n\n## Overview\n\nThe provisioning system uses a **unified profile-based setup** that creates type-safe configurations in your platform-specific home directory. No\nmatter which profile you choose, all configurations are validated with Nickel before use.\n\n### Three Setup Profiles\n\n| | Profile | Duration | Use Case | Deployment | Security | |\n| | --------- | ---------- | ---------- | ----------- | ---------- | |\n| | **Developer** | <5 min | Local development, testing, learning | Docker Compose (local) | Minimal (local defaults) | |\n| | **Production** | ~12 min | Production-ready, HA, team deployments | Kubernetes or SSH | Full (MFA, audit, policies) | |\n| | **CI/CD** | <2 min | Automated pipelines, ephemeral setup | Docker Compose (temp) | CI secrets | |\n\nAll profiles use **Nickel-first architecture**: configuration source of truth is type-safe Nickel, validated before use.\n\n---\n\n## Quick Start (Choose Your Profile)\n\n### Developer Profile (Recommended for First Time)\n\n```\n# Run unified setup\nprovisioning setup profile --profile developer\n\n# What happens:\n# 1. Detects your OS and system capabilities\n# 2. Creates Nickel configs in platform-specific location:\n# • macOS: ~/Library/Application Support/provisioning/\n# • Linux: ~/.config/provisioning/\n# 3. Validates all configs with Nickel typecheck\n# 4. Starts platform services (orchestrator, control-center, KMS)\n# 5. Verifies health checks\n\n# Verify it worked\ncurl http://localhost:9090/health\ncurl http://localhost:3000/health\ncurl http://localhost:3001/health\n```\n\nExpected output:\n```\n╔═════════════════════════════════════════════════════╗\n║ PROVISIONING SETUP - DEVELOPER PROFILE ║\n╚═════════════════════════════════════════════════════╝\n\n✓ System detected: macOS (aarch64)\n✓ Docker available: Yes\n✓ Configuration location: ~/Library/Application Support/provisioning/\n✓ Config validation: PASSED (Nickel typecheck)\n✓ Services started: Orchestrator, Control Center, KMS\n✓ Health checks: All green\n\nSetup complete in ~4 minutes!\n```\n\n### Production Profile (HA, Security, Team Ready)\n\n```\n# Interactive setup for production\nprovisioning setup profile --profile production --interactive\n\n# What happens:\n# 1. Detects system: OS, CPU (≥4 required), memory (≥8GB recommended)\n# 2. Asks for deployment mode: Kubernetes (preferred) or SSH\n# 3. Asks for cloud provider: UpCloud, AWS, Hetzner, or local\n# 4. Asks for security settings: MFA, audit logging, Cedar policies\n# 5. Creates workspace infrastructure\n# 6. Creates Nickel configs with production overlays\n# 7. Validates all configs (Nickel typecheck)\n# 8. Optionally starts services\n\n# Setup with specific provider\nprovisioning setup profile --profile production --provider upcloud --interactive\n\n# Verify Nickel configs validated\nnickel typecheck ~/.config/provisioning/platform/deployment.ncl\n```\n\nExpected config structure:\n```\n~/.config/provisioning/\n├── system.ncl # System detection + capabilities\n├── user_preferences.ncl # User settings (MFA, audit, etc.)\n├── platform/\n│ ├── deployment.ncl # Deployment mode (kubernetes, ssh)\n│ └── services.ncl # Service endpoints and timeouts\n├── providers/\n│ ├── upcloud.ncl # UpCloud config (RustyVault refs)\n│ └── aws.ncl # AWS config (RustyVault refs)\n├── workspaces/\n│ └── infrastructure.ncl # Infrastructure definitions\n└── cedar-policies/\n └── default.cedar # Authorization policies\n```\n\n### CI/CD Profile (Automated, Ephemeral)\n\n```\n# Fully automated setup for pipelines\nexport PROVISIONING_PROVIDER=local\nexport PROVISIONING_WORKSPACE=ci-test-${CI_JOB_ID}\n\nprovisioning setup profile --profile cicd\n\n# What happens:\n# 1. No interaction (reads from env vars)\n# 2. Creates ephemeral configs in /tmp/provisioning-ci-${CI_JOB_ID}/\n# 3. Validates with Nickel typecheck\n# 4. Starts Docker Compose services\n# 5. Registers cleanup hook (auto-cleanup on exit)\n\n# In your CI pipeline:\n# Services run, tests execute, cleanup automatic\n```\n\n---\n\n## Configuration Locations (Platform-Aware)\n\n### Linux (XDG Base Directory)\n\n```\n# Primary location\n~/.config/provisioning/\n\n# Or with XDG_CONFIG_HOME override\n$XDG_CONFIG_HOME/provisioning/\n\n# Files created during setup\n~/.config/provisioning/\n├── system.ncl # Source of truth (Nickel)\n├── user_preferences.ncl # Source of truth (Nickel)\n├── platform/\n│ └── deployment.ncl # Source of truth (Nickel)\n└── generated/ # Optional: For services needing TOML\n └── deployment.toml # Auto-exported from deployment.ncl\n```\n\n### macOS (Application Support)\n\n```\n# Platform-specific location\n~/Library/Application Support/provisioning/\n\n# Same structure as Linux\n~/Library/Application Support/provisioning/\n├── system.ncl # Source of truth (Nickel)\n├── user_preferences.ncl # Source of truth (Nickel)\n├── platform/\n│ └── deployment.ncl # Source of truth (Nickel)\n└── generated/ # Optional\n └── deployment.toml\n```\n\n### Key Principle\n\n**Nickel is source of truth** - All `.ncl` files are authoritative, type-safe configurations validated by `nickel typecheck`. TOML files (if\ngenerated) are optional output only, never edited directly.\n\n---\n\n## What Happens During Setup\n\n### Step 1: System Detection\n\nProvisioning detects:\n- **OS**: macOS or Linux (Darwin detection)\n- **Architecture**: aarch64 or x86_64\n- **CPU Count**: Number of processors\n- **Memory**: Total system RAM in GB\n- **Disk Space**: Total available disk\n\n```\n# View detected system\nprovisioning setup detect --verbose\n```\n\n### Step 2: Profile Selection\n\nYou choose between:\n- **Developer**: Fast local setup, Docker Compose\n- **Production**: Full validation, Kubernetes/SSH, HA ready\n- **CI/CD**: Ephemeral, automated, no interaction\n\n### Step 3: Config Generation (Nickel-Based)\n\nSetup creates Nickel configs using composition:\n\n```\n# Example: system.ncl is composed from:\nlet helpers = import "../../schemas/platform/common/helpers.ncl"\nlet defaults = import "../../schemas/platform/defaults/system-defaults.ncl"\n\nhelpers.compose_config defaults {} {\n os_name = 'macos,\n cpu_count = 8,\n memory_total_gb = 16,\n setup_date = "2026-01-13T12:00:00Z"\n}\n| system_schema.SystemConfig # Type contract validation\n```\n\nResult: **Type-safe config**, guaranteed valid structure and values.\n\n### Step 4: Validation (Mandatory)\n\nAll configs are validated:\n\n```\n# Done automatically during setup\nnickel typecheck ~/.config/provisioning/system.ncl\nnickel typecheck ~/.config/provisioning/platform/deployment.ncl\n\n# You can verify anytime\nnickel typecheck ~/.config/provisioning/**/*.ncl\n```\n\n### Step 5: Service Bootstrap (Profile-Dependent)\n\n**Developer**: Starts Docker Compose services locally\n```\ndocker-compose up -d orchestrator control-center kms\n```\n\n**Production**: Outputs Kubernetes manifests (doesn't auto-start, you review first)\n```\ncat ~/.config/provisioning/platform/deployment.ncl\n# Review, then deploy to your cluster\nkubectl apply -f generated-from-deployment.ncl\n```\n\n**CI/CD**: Starts ephemeral Docker Compose in `/tmp`\n```\n# Automatic cleanup on job exit\ndocker-compose -f /tmp/provisioning-ci-${JOB_ID}/compose.yml up\n# Tests run, cleanup automatic on script exit\n```\n\n---\n\n## Profile Comparison Details\n\n### Developer Profile\n\n**Goal**: Working provisioning system in less than 5 minutes, minimal configuration\n\n**What gets created**:\n- System config (auto-detected, no prompts)\n- User preferences (recommended defaults)\n- Docker Compose deployment (local mode)\n- Local provider (no credentials needed)\n\n**Security**:\n- All configs validated (Nickel typecheck)\n- Services use secure defaults\n- No external API keys needed\n- Passwords auto-generated and stored locally\n\n**Time**: 3-4 minutes\n\n**Example**:\n```\nprovisioning setup profile --profile developer\n\n# Output:\n# ✓ Detected: macOS, aarch64, 8 CPU, 16GB RAM\n# ✓ Created: ~/.config/provisioning/system.ncl\n# ✓ Created: ~/.config/provisioning/platform/deployment.ncl\n# ✓ Validated: All configs passed typecheck\n# ✓ Started: orchestrator (port 9090)\n# ✓ Started: control-center (port 3000)\n# ✓ Started: kms (port 3001)\n# ✓ Ready in 3 minutes 45 seconds\n```\n\n### Production Profile\n\n**Goal**: HA-ready, validated, secure deployment with full control\n\n**What gets created**:\n- System config (auto-detected)\n- User preferences (security-focused: MFA enabled, audit on)\n- Kubernetes or SSH deployment (your choice)\n- Cloud provider config (UpCloud, AWS, Hetzner, or local)\n- Workspace infrastructure (full IaC definitions)\n- Cedar authorization policies (fine-grained RBAC)\n\n**Security**:\n- All configs validated (Nickel typecheck)\n- Requires system minimums: 4+ CPU, 8+ GB RAM\n- MFA enabled by default (can configure)\n- Audit logging enabled (captures all operations)\n- Cedar policies for authorization\n- Credentials stored encrypted (RustyVault)\n\n**Time**: 10-15 minutes (interactive, many questions)\n\n**Example**:\n```\nprovisioning setup profile --profile production --interactive\n\n# Prompts:\n# Profile: Production ✓\n# Deployment mode? (kubernetes/ssh): kubernetes\n# Cloud provider? (upcloud/aws/hetzner/local): upcloud\n# Workspace name? my-prod-infra\n# Enable MFA? (y/n): y\n# Enable audit logging? (y/n): y\n# Number of master nodes? (1-5): 3\n# Worker node count? (2-10): 5\n\n# Output (15 minutes later):\n# ✓ Created: ~/.config/provisioning/system.ncl\n# ✓ Created: ~/.config/provisioning/platform/deployment.ncl\n# ✓ Created: ~/.config/provisioning/providers/upcloud.ncl\n# ✓ Created: workspace-prod-infra/infrastructure.ncl\n# ✓ Created: cedar-policies/default.cedar\n# ✓ Validated: All configs passed typecheck\n# ✓ Services NOT started (you'll deploy to cluster)\n# ✓ Ready for Kubernetes deployment\n```\n\n### CI/CD Profile\n\n**Goal**: Minimal setup, no interaction, auto-cleanup for pipelines\n\n**What gets created**:\n- System config (minimal, CI environment)\n- Deployment config (ephemeral, auto-cleanup)\n- Docker Compose (no Kubernetes overhead)\n- Runs in /tmp (temporary directory)\n\n**Security**:\n- All configs validated (Nickel typecheck)\n- No persistent state (by design)\n- Uses CI environment variables for secrets\n- Auto-cleanup on job completion\n- No credentials stored locally\n\n**Time**: Less than 2 minutes\n\n**Example**:\n```\n# In GitHub Actions:\n- name: Setup Provisioning\n run: |\n export PROVISIONING_PROVIDER=local\n provisioning setup profile --profile cicd\n\n# Output:\n# ✓ Created: /tmp/provisioning-ci-abc123/\n# ✓ Validated: All configs passed typecheck\n# ✓ Started: Docker Compose services\n# ✓ Services ready for tests\n# Services will auto-cleanup on job exit\n```\n\n---\n\n## Verification\n\n### After Setup, Verify Everything Works\n\n**Developer Profile**:\n```\n# Check configs exist\nls -la ~/.config/provisioning/\nls -la ~/.config/provisioning/platform/\n\n# Verify Nickel validation\nnickel typecheck ~/.config/provisioning/system.ncl\nnickel typecheck ~/.config/provisioning/platform/deployment.ncl\n\n# Test services\ncurl http://localhost:9090/health\ncurl http://localhost:3000/health\ncurl http://localhost:3001/health\n\n# Expected: HTTP 200 with {"status": "healthy"}\n```\n\n**Production Profile**:\n```\n# Check Nickel configs\nnickel typecheck ~/.config/provisioning/system.ncl\nnickel typecheck ~/.config/provisioning/platform/deployment.ncl\nnickel typecheck ~/.config/provisioning/providers/upcloud.ncl\n\n# View deployment config\ncat ~/.config/provisioning/platform/deployment.ncl\n\n# View infrastructure definition\ncat workspace-my-prod-infra/infrastructure.ncl\n\n# View authorization policies\ncat ~/.config/provisioning/cedar-policies/default.cedar\n```\n\n**CI/CD Profile**:\n```\n# Check temp configs exist\nls -la /tmp/provisioning-ci-*/\n\n# Verify Nickel validation passed\nnickel typecheck /tmp/provisioning-ci-*/platform/deployment.ncl\n\n# Services should be running\ndocker ps | grep provisioning\n```\n\n---\n\n## Troubleshooting\n\n### Issue: "Nickel not found"\n\n**Cause**: Nickel binary not installed\n\n**Solution**:\n```\n# macOS\nbrew install nickel\n\n# Linux (Arch)\npacman -S nickel\n\n# From source\ngit clone https://github.com/nickel-lang/nickel\ncd nickel && cargo install --path .\n\n# Verify\nnickel --version # Should be 1.5.0+\n```\n\n### Issue: "Configuration validation failed"\n\n**Cause**: Nickel typecheck error in generated config\n\n**Solution**:\n```\n# See detailed error\nnickel typecheck ~/.config/provisioning/platform/deployment.ncl --color always\n\n# Common issues:\n# - Missing required field (check schema)\n# - Wrong type (string vs number)\n# - Enum value not in allowed list\n\n# Delete and retry setup\nrm -rf ~/.config/provisioning/\nprovisioning setup profile --profile developer --verbose\n```\n\n### Issue: "Docker not available" (Developer Profile)\n\n**Cause**: Docker not installed or not running\n\n**Solution**:\n```\n# Check Docker\ndocker --version\ndocker ps\n\n# macOS: Install Docker Desktop\nbrew install --cask docker\n\n# Linux: Install Docker\nsudo apt-get install docker.io # Ubuntu/Debian\nsudo pacman -S docker # Arch\n\n# Start Docker\nsudo systemctl start docker\n\n# Retry setup\nprovisioning setup profile --profile developer\n```\n\n### Issue: "Services won't start"\n\n**Cause**: Port already in use, Docker not running, or resource constraints\n\n**Solution**:\n```\n# Check what's using ports 9090, 3000, 3001\nlsof -i :9090\nlsof -i :3000\nlsof -i :3001\n\n# Stop conflicting service or wait for it to release port\n\n# Stop and restart provisioning services\ndocker-compose down\ndocker-compose up -d\n\n# Check Docker resources\ndocker stats\ndocker system prune # Free up space if needed\n```\n\n### Issue: "Permission denied" on config directory\n\n**Cause**: Directory created with wrong permissions\n\n**Solution**:\n```\n# Fix permissions (macOS)\nchmod 700 ~/Library/Application\ Support/provisioning/\n\n# Fix permissions (Linux)\nchmod 700 ~/.config/provisioning/\n\n# Fix nested directories\nchmod 700 ~/.config/provisioning/*\n\n# Retry setup\nprovisioning setup profile --profile developer\n```\n\n### Issue: "Wrong configuration being used"\n\n**Cause**: Services reading from old location or wrong environment variable\n\n**Solution**:\n```\n# Verify service sees new location\necho $PROVISIONING_CONFIG\n# Should be: ~/.config/provisioning/platform/deployment.ncl\n\n# Set explicitly if needed\nexport PROVISIONING_CONFIG=~/.config/provisioning/platform/deployment.ncl\nprovisioning service restart\n\n# Check what service is actually loading\nprovisioning service status --verbose\n```\n\n---\n\n## Using Workspace-Specific Overrides\n\nAfter initial setup, you can customize configs per workspace:\n\n```\n# Create workspace-specific override\nmkdir -p workspace-myproject/config\ncat > workspace-myproject/config/platform-overrides.ncl <<'EOF'\n{\n orchestrator.server.port = 9999,\n orchestrator.workspace.name = "myproject",\n vault-service.storage.path = "./workspace-myproject/data/vault"\n}\nEOF\n\n# Services will merge this with the base config\nprovisioning workspace activate myproject\nprovisioning platform deploy # Uses merged config\n```\n\n---\n\n## Next Steps\n\nAfter setup:\n\n1. **Create a Workspace**\n ```bash\n provisioning workspace create myapp\n ```\n\n2. **Deploy Your First Service**\n ```bash\n provisioning service deploy nginx\n ```\n\n3. **Configure Monitoring**\n ```bash\n provisioning monitor setup prometheus\n ```\n\n4. **Set Up CI/CD Integration**\n ```bash\n provisioning ci configure github\n ```\n\n5. **Learn Advanced Configuration**\n - See: [Setup Profiles Guide](setup-profiles.md)\n - See: [Platform Configuration](05-platform-configuration.md)\n - See: [Nickel Configuration](../configuration/nickel-configuration.md)\n\n---\n\n## Key Concepts\n\n### Type-Safe Configuration (Nickel)\n\nAll configs use Nickel type contracts:\n- Field names and types enforced\n- Enum values validated\n- Invalid configs caught at nickel typecheck time\n- No runtime surprises\n\n### Platform-Specific Paths\n\nConfigs stored in platform-standard locations:\n- **Linux**: `~/.config/provisioning/` (XDG Base Directory)\n- **macOS**: `~/Library/Application Support/provisioning/`\n- Respects `$XDG_CONFIG_HOME` override on Linux\n\n### Composition Pattern\n\nConfigs built from:\n1. **Base defaults** (provisioning/schemas/platform/defaults/)\n2. **Profile overlay** (developer/production/cicd specific)\n3. **User customization** (optional, via Nickel import)\n\nResult: Minimal, validated, reproducible config.\n\n### Ephemeral vs. Persistent\n\n- **Developer/Production**: Persistent in home directory\n- **CI/CD**: Ephemeral in /tmp, auto-cleanup\n\n---\n\n## Getting Help\n\n```\n# Help for setup\nprovisioning setup --help\n\n# Help for profiles\nprovisioning setup profile --help\n\n# Interactive debugging\nprovisioning setup profile --profile developer --verbose\n\n# Validate configuration\nprovisioning setup validate\n\n# View detected capabilities\nprovisioning setup detect --verbose\n\n# Check platform status\nprovisioning platform status\n\n# View logs\nprovisioning service logs orchestrator\nprovisioning service logs control-center\nprovisioning service logs kms\n```\n\n---\n\n**Ready?** Run: `provisioning setup profile` and choose your profile!\n\n**Questions?** Check [Troubleshooting](../troubleshooting/troubleshooting.md) or [Setup Profiles Guide](setup-profiles.md) +# Unified Setup Guide + +**Quick Answer**: Run `provisioning setup profile` and choose your profile. + +--- + +## Overview + +The provisioning system uses a **unified profile-based setup** that creates type-safe configurations in your platform-specific home directory. No +matter which profile you choose, all configurations are validated with Nickel before use. + +### Three Setup Profiles + +| | Profile | Duration | Use Case | Deployment | Security | | +| | --------- | ---------- | ---------- | ----------- | ---------- | | +| | **Developer** | <5 min | Local development, testing, learning | Docker Compose (local) | Minimal (local defaults) | | +| | **Production** | ~12 min | Production-ready, HA, team deployments | Kubernetes or SSH | Full (MFA, audit, policies) | | +| | **CI/CD** | <2 min | Automated pipelines, ephemeral setup | Docker Compose (temp) | CI secrets | | + +All profiles use **Nickel-first architecture**: configuration source of truth is type-safe Nickel, validated before use. + +--- + +## Quick Start (Choose Your Profile) + +### Developer Profile (Recommended for First Time) + +```text +# Run unified setup +provisioning setup profile --profile developer + +# What happens: +# 1. Detects your OS and system capabilities +# 2. Creates Nickel configs in platform-specific location: +# • macOS: ~/Library/Application Support/provisioning/ +# • Linux: ~/.config/provisioning/ +# 3. Validates all configs with Nickel typecheck +# 4. Starts platform services (orchestrator, control-center, KMS) +# 5. Verifies health checks + +# Verify it worked +curl http://localhost:9090/health +curl http://localhost:3000/health +curl http://localhost:3001/health +``` + +Expected output: +```text +╔═════════════════════════════════════════════════════╗ +║ PROVISIONING SETUP - DEVELOPER PROFILE ║ +╚═════════════════════════════════════════════════════╝ + +✓ System detected: macOS (aarch64) +✓ Docker available: Yes +✓ Configuration location: ~/Library/Application Support/provisioning/ +✓ Config validation: PASSED (Nickel typecheck) +✓ Services started: Orchestrator, Control Center, KMS +✓ Health checks: All green + +Setup complete in ~4 minutes! +``` + +### Production Profile (HA, Security, Team Ready) + +```text +# Interactive setup for production +provisioning setup profile --profile production --interactive + +# What happens: +# 1. Detects system: OS, CPU (≥4 required), memory (≥8GB recommended) +# 2. Asks for deployment mode: Kubernetes (preferred) or SSH +# 3. Asks for cloud provider: UpCloud, AWS, Hetzner, or local +# 4. Asks for security settings: MFA, audit logging, Cedar policies +# 5. Creates workspace infrastructure +# 6. Creates Nickel configs with production overlays +# 7. Validates all configs (Nickel typecheck) +# 8. Optionally starts services + +# Setup with specific provider +provisioning setup profile --profile production --provider upcloud --interactive + +# Verify Nickel configs validated +nickel typecheck ~/.config/provisioning/platform/deployment.ncl +``` + +Expected config structure: +```text +~/.config/provisioning/ +├── system.ncl # System detection + capabilities +├── user_preferences.ncl # User settings (MFA, audit, etc.) +├── platform/ +│ ├── deployment.ncl # Deployment mode (kubernetes, ssh) +│ └── services.ncl # Service endpoints and timeouts +├── providers/ +│ ├── upcloud.ncl # UpCloud config (RustyVault refs) +│ └── aws.ncl # AWS config (RustyVault refs) +├── workspaces/ +│ └── infrastructure.ncl # Infrastructure definitions +└── cedar-policies/ + └── default.cedar # Authorization policies +``` + +### CI/CD Profile (Automated, Ephemeral) + +```text +# Fully automated setup for pipelines +export PROVISIONING_PROVIDER=local +export PROVISIONING_WORKSPACE=ci-test-${CI_JOB_ID} + +provisioning setup profile --profile cicd + +# What happens: +# 1. No interaction (reads from env vars) +# 2. Creates ephemeral configs in /tmp/provisioning-ci-${CI_JOB_ID}/ +# 3. Validates with Nickel typecheck +# 4. Starts Docker Compose services +# 5. Registers cleanup hook (auto-cleanup on exit) + +# In your CI pipeline: +# Services run, tests execute, cleanup automatic +``` + +--- + +## Configuration Locations (Platform-Aware) + +### Linux (XDG Base Directory) + +```text +# Primary location +~/.config/provisioning/ + +# Or with XDG_CONFIG_HOME override +$XDG_CONFIG_HOME/provisioning/ + +# Files created during setup +~/.config/provisioning/ +├── system.ncl # Source of truth (Nickel) +├── user_preferences.ncl # Source of truth (Nickel) +├── platform/ +│ └── deployment.ncl # Source of truth (Nickel) +└── generated/ # Optional: For services needing TOML + └── deployment.toml # Auto-exported from deployment.ncl +``` + +### macOS (Application Support) + +```text +# Platform-specific location +~/Library/Application Support/provisioning/ + +# Same structure as Linux +~/Library/Application Support/provisioning/ +├── system.ncl # Source of truth (Nickel) +├── user_preferences.ncl # Source of truth (Nickel) +├── platform/ +│ └── deployment.ncl # Source of truth (Nickel) +└── generated/ # Optional + └── deployment.toml +``` + +### Key Principle + +**Nickel is source of truth** - All `.ncl` files are authoritative, type-safe configurations validated by `nickel typecheck`. TOML files (if +generated) are optional output only, never edited directly. + +--- + +## What Happens During Setup + +### Step 1: System Detection + +Provisioning detects: +- **OS**: macOS or Linux (Darwin detection) +- **Architecture**: aarch64 or x86_64 +- **CPU Count**: Number of processors +- **Memory**: Total system RAM in GB +- **Disk Space**: Total available disk + +```text +# View detected system +provisioning setup detect --verbose +``` + +### Step 2: Profile Selection + +You choose between: +- **Developer**: Fast local setup, Docker Compose +- **Production**: Full validation, Kubernetes/SSH, HA ready +- **CI/CD**: Ephemeral, automated, no interaction + +### Step 3: Config Generation (Nickel-Based) + +Setup creates Nickel configs using composition: + +```text +# Example: system.ncl is composed from: +let helpers = import "../../schemas/platform/common/helpers.ncl" +let defaults = import "../../schemas/platform/defaults/system-defaults.ncl" + +helpers.compose_config defaults {} { + os_name = 'macos, + cpu_count = 8, + memory_total_gb = 16, + setup_date = "2026-01-13T12:00:00Z" +} +| system_schema.SystemConfig # Type contract validation +``` + +Result: **Type-safe config**, guaranteed valid structure and values. + +### Step 4: Validation (Mandatory) + +All configs are validated: + +```text +# Done automatically during setup +nickel typecheck ~/.config/provisioning/system.ncl +nickel typecheck ~/.config/provisioning/platform/deployment.ncl + +# You can verify anytime +nickel typecheck ~/.config/provisioning/**/*.ncl +``` + +### Step 5: Service Bootstrap (Profile-Dependent) + +**Developer**: Starts Docker Compose services locally +```text +docker-compose up -d orchestrator control-center kms +``` + +**Production**: Outputs Kubernetes manifests (doesn't auto-start, you review first) +```text +cat ~/.config/provisioning/platform/deployment.ncl +# Review, then deploy to your cluster +kubectl apply -f generated-from-deployment.ncl +``` + +**CI/CD**: Starts ephemeral Docker Compose in `/tmp` +```text +# Automatic cleanup on job exit +docker-compose -f /tmp/provisioning-ci-${JOB_ID}/compose.yml up +# Tests run, cleanup automatic on script exit +``` + +--- + +## Profile Comparison Details + +### Developer Profile + +**Goal**: Working provisioning system in less than 5 minutes, minimal configuration + +**What gets created**: +- System config (auto-detected, no prompts) +- User preferences (recommended defaults) +- Docker Compose deployment (local mode) +- Local provider (no credentials needed) + +**Security**: +- All configs validated (Nickel typecheck) +- Services use secure defaults +- No external API keys needed +- Passwords auto-generated and stored locally + +**Time**: 3-4 minutes + +**Example**: +```text +provisioning setup profile --profile developer + +# Output: +# ✓ Detected: macOS, aarch64, 8 CPU, 16GB RAM +# ✓ Created: ~/.config/provisioning/system.ncl +# ✓ Created: ~/.config/provisioning/platform/deployment.ncl +# ✓ Validated: All configs passed typecheck +# ✓ Started: orchestrator (port 9090) +# ✓ Started: control-center (port 3000) +# ✓ Started: kms (port 3001) +# ✓ Ready in 3 minutes 45 seconds +``` + +### Production Profile + +**Goal**: HA-ready, validated, secure deployment with full control + +**What gets created**: +- System config (auto-detected) +- User preferences (security-focused: MFA enabled, audit on) +- Kubernetes or SSH deployment (your choice) +- Cloud provider config (UpCloud, AWS, Hetzner, or local) +- Workspace infrastructure (full IaC definitions) +- Cedar authorization policies (fine-grained RBAC) + +**Security**: +- All configs validated (Nickel typecheck) +- Requires system minimums: 4+ CPU, 8+ GB RAM +- MFA enabled by default (can configure) +- Audit logging enabled (captures all operations) +- Cedar policies for authorization +- Credentials stored encrypted (RustyVault) + +**Time**: 10-15 minutes (interactive, many questions) + +**Example**: +```text +provisioning setup profile --profile production --interactive + +# Prompts: +# Profile: Production ✓ +# Deployment mode? (kubernetes/ssh): kubernetes +# Cloud provider? (upcloud/aws/hetzner/local): upcloud +# Workspace name? my-prod-infra +# Enable MFA? (y/n): y +# Enable audit logging? (y/n): y +# Number of master nodes? (1-5): 3 +# Worker node count? (2-10): 5 + +# Output (15 minutes later): +# ✓ Created: ~/.config/provisioning/system.ncl +# ✓ Created: ~/.config/provisioning/platform/deployment.ncl +# ✓ Created: ~/.config/provisioning/providers/upcloud.ncl +# ✓ Created: workspace-prod-infra/infrastructure.ncl +# ✓ Created: cedar-policies/default.cedar +# ✓ Validated: All configs passed typecheck +# ✓ Services NOT started (you'll deploy to cluster) +# ✓ Ready for Kubernetes deployment +``` + +### CI/CD Profile + +**Goal**: Minimal setup, no interaction, auto-cleanup for pipelines + +**What gets created**: +- System config (minimal, CI environment) +- Deployment config (ephemeral, auto-cleanup) +- Docker Compose (no Kubernetes overhead) +- Runs in /tmp (temporary directory) + +**Security**: +- All configs validated (Nickel typecheck) +- No persistent state (by design) +- Uses CI environment variables for secrets +- Auto-cleanup on job completion +- No credentials stored locally + +**Time**: Less than 2 minutes + +**Example**: +```text +# In GitHub Actions: +- name: Setup Provisioning + run: | + export PROVISIONING_PROVIDER=local + provisioning setup profile --profile cicd + +# Output: +# ✓ Created: /tmp/provisioning-ci-abc123/ +# ✓ Validated: All configs passed typecheck +# ✓ Started: Docker Compose services +# ✓ Services ready for tests +# Services will auto-cleanup on job exit +``` + +--- + +## Verification + +### After Setup, Verify Everything Works + +**Developer Profile**: +```text +# Check configs exist +ls -la ~/.config/provisioning/ +ls -la ~/.config/provisioning/platform/ + +# Verify Nickel validation +nickel typecheck ~/.config/provisioning/system.ncl +nickel typecheck ~/.config/provisioning/platform/deployment.ncl + +# Test services +curl http://localhost:9090/health +curl http://localhost:3000/health +curl http://localhost:3001/health + +# Expected: HTTP 200 with {"status": "healthy"} +``` + +**Production Profile**: +```text +# Check Nickel configs +nickel typecheck ~/.config/provisioning/system.ncl +nickel typecheck ~/.config/provisioning/platform/deployment.ncl +nickel typecheck ~/.config/provisioning/providers/upcloud.ncl + +# View deployment config +cat ~/.config/provisioning/platform/deployment.ncl + +# View infrastructure definition +cat workspace-my-prod-infra/infrastructure.ncl + +# View authorization policies +cat ~/.config/provisioning/cedar-policies/default.cedar +``` + +**CI/CD Profile**: +```text +# Check temp configs exist +ls -la /tmp/provisioning-ci-*/ + +# Verify Nickel validation passed +nickel typecheck /tmp/provisioning-ci-*/platform/deployment.ncl + +# Services should be running +docker ps | grep provisioning +``` + +--- + +## Troubleshooting + +### Issue: "Nickel not found" + +**Cause**: Nickel binary not installed + +**Solution**: +```text +# macOS +brew install nickel + +# Linux (Arch) +pacman -S nickel + +# From source +git clone https://github.com/nickel-lang/nickel +cd nickel && cargo install --path . + +# Verify +nickel --version # Should be 1.5.0+ +``` + +### Issue: "Configuration validation failed" + +**Cause**: Nickel typecheck error in generated config + +**Solution**: +```text +# See detailed error +nickel typecheck ~/.config/provisioning/platform/deployment.ncl --color always + +# Common issues: +# - Missing required field (check schema) +# - Wrong type (string vs number) +# - Enum value not in allowed list + +# Delete and retry setup +rm -rf ~/.config/provisioning/ +provisioning setup profile --profile developer --verbose +``` + +### Issue: "Docker not available" (Developer Profile) + +**Cause**: Docker not installed or not running + +**Solution**: +```text +# Check Docker +docker --version +docker ps + +# macOS: Install Docker Desktop +brew install --cask docker + +# Linux: Install Docker +sudo apt-get install docker.io # Ubuntu/Debian +sudo pacman -S docker # Arch + +# Start Docker +sudo systemctl start docker + +# Retry setup +provisioning setup profile --profile developer +``` + +### Issue: "Services won't start" + +**Cause**: Port already in use, Docker not running, or resource constraints + +**Solution**: +```text +# Check what's using ports 9090, 3000, 3001 +lsof -i :9090 +lsof -i :3000 +lsof -i :3001 + +# Stop conflicting service or wait for it to release port + +# Stop and restart provisioning services +docker-compose down +docker-compose up -d + +# Check Docker resources +docker stats +docker system prune # Free up space if needed +``` + +### Issue: "Permission denied" on config directory + +**Cause**: Directory created with wrong permissions + +**Solution**: +```text +# Fix permissions (macOS) +chmod 700 ~/Library/Application\ Support/provisioning/ + +# Fix permissions (Linux) +chmod 700 ~/.config/provisioning/ + +# Fix nested directories +chmod 700 ~/.config/provisioning/* + +# Retry setup +provisioning setup profile --profile developer +``` + +### Issue: "Wrong configuration being used" + +**Cause**: Services reading from old location or wrong environment variable + +**Solution**: +```text +# Verify service sees new location +echo $PROVISIONING_CONFIG +# Should be: ~/.config/provisioning/platform/deployment.ncl + +# Set explicitly if needed +export PROVISIONING_CONFIG=~/.config/provisioning/platform/deployment.ncl +provisioning service restart + +# Check what service is actually loading +provisioning service status --verbose +``` + +--- + +## Using Workspace-Specific Overrides + +After initial setup, you can customize configs per workspace: + +```text +# Create workspace-specific override +mkdir -p workspace-myproject/config +cat > workspace-myproject/config/platform-overrides.ncl <<'EOF' +{ + orchestrator.server.port = 9999, + orchestrator.workspace.name = "myproject", + vault-service.storage.path = "./workspace-myproject/data/vault" +} +EOF + +# Services will merge this with the base config +provisioning workspace activate myproject +provisioning platform deploy # Uses merged config +``` + +--- + +## Next Steps + +After setup: + +1. **Create a Workspace** + ```bash + provisioning workspace create myapp + ``` + +2. **Deploy Your First Service** + ```bash + provisioning service deploy nginx + ``` + +3. **Configure Monitoring** + ```bash + provisioning monitor setup prometheus + ``` + +4. **Set Up CI/CD Integration** + ```bash + provisioning ci configure github + ``` + +5. **Learn Advanced Configuration** + - See: [Setup Profiles Guide](setup-profiles.md) + - See: [Platform Configuration](05-platform-configuration.md) + - See: [Nickel Configuration](../configuration/nickel-configuration.md) + +--- + +## Key Concepts + +### Type-Safe Configuration (Nickel) + +All configs use Nickel type contracts: +- Field names and types enforced +- Enum values validated +- Invalid configs caught at nickel typecheck time +- No runtime surprises + +### Platform-Specific Paths + +Configs stored in platform-standard locations: +- **Linux**: `~/.config/provisioning/` (XDG Base Directory) +- **macOS**: `~/Library/Application Support/provisioning/` +- Respects `$XDG_CONFIG_HOME` override on Linux + +### Composition Pattern + +Configs built from: +1. **Base defaults** (provisioning/schemas/platform/defaults/) +2. **Profile overlay** (developer/production/cicd specific) +3. **User customization** (optional, via Nickel import) + +Result: Minimal, validated, reproducible config. + +### Ephemeral vs. Persistent + +- **Developer/Production**: Persistent in home directory +- **CI/CD**: Ephemeral in /tmp, auto-cleanup + +--- + +## Getting Help + +```text +# Help for setup +provisioning setup --help + +# Help for profiles +provisioning setup profile --help + +# Interactive debugging +provisioning setup profile --profile developer --verbose + +# Validate configuration +provisioning setup validate + +# View detected capabilities +provisioning setup detect --verbose + +# Check platform status +provisioning platform status + +# View logs +provisioning service logs orchestrator +provisioning service logs control-center +provisioning service logs kms +``` + +--- + +**Ready?** Run: `provisioning setup profile` and choose your profile! + +**Questions?** Check [Troubleshooting](../troubleshooting/troubleshooting.md) or [Setup Profiles Guide](setup-profiles.md) \ No newline at end of file diff --git a/docs/src/guides/README.md b/docs/src/guides/README.md index 0fea5c4..50cca39 100644 --- a/docs/src/guides/README.md +++ b/docs/src/guides/README.md @@ -1 +1,18 @@ -# How-To Guides\n\nStep-by-step guides for common tasks with the Provisioning Platform.\n\n## Available Guides\n\n- [From Scratch](from-scratch.md) - Complete deployment from zero to production\n- [Update Infrastructure](update-infrastructure.md) - Safe update procedures\n- [Customize Infrastructure](customize-infrastructure.md) - Layer and template customization\n- [Quickstart Cheatsheet](../getting-started/quickstart-cheatsheet.md) - Command shortcuts and quick reference\n\n## Quick Start\n\nFor the fastest path to a working deployment:\n\n1. Run `provisioning sc` for quick command reference\n2. Follow [From Scratch](from-scratch.md) guide\n3. Use [Quickstart Cheatsheet](../getting-started/quickstart-cheatsheet.md) for daily operations +# How-To Guides + +Step-by-step guides for common tasks with the Provisioning Platform. + +## Available Guides + +- [From Scratch](from-scratch.md) - Complete deployment from zero to production +- [Update Infrastructure](update-infrastructure.md) - Safe update procedures +- [Customize Infrastructure](customize-infrastructure.md) - Layer and template customization +- [Quickstart Cheatsheet](../getting-started/quickstart-cheatsheet.md) - Command shortcuts and quick reference + +## Quick Start + +For the fastest path to a working deployment: + +1. Run `provisioning sc` for quick command reference +2. Follow [From Scratch](from-scratch.md) guide +3. Use [Quickstart Cheatsheet](../getting-started/quickstart-cheatsheet.md) for daily operations diff --git a/docs/src/guides/customize-infrastructure.md b/docs/src/guides/customize-infrastructure.md index c21de82..f7d3b92 100644 --- a/docs/src/guides/customize-infrastructure.md +++ b/docs/src/guides/customize-infrastructure.md @@ -1 +1,846 @@ -# Customize Infrastructure\n\n**Goal**: Customize infrastructure using layers, templates, and configuration patterns\n**Time**: 20-40 minutes\n**Difficulty**: Intermediate to Advanced\n\n## Overview\n\nThis guide covers:\n\n1. Understanding the layer system\n2. Using templates\n3. Creating custom modules\n4. Configuration inheritance\n5. Advanced customization patterns\n\n## The Layer System\n\n### Understanding Layers\n\nThe provisioning system uses a **3-layer architecture** for configuration inheritance:\n\n```\n┌─────────────────────────────────────┐\n│ Infrastructure Layer (Priority 300)│ ← Highest priority\n│ workspace/infra/{name}/ │\n│ • Project-specific configs │\n│ • Environment customizations │\n│ • Local overrides │\n└─────────────────────────────────────┘\n ↓ overrides\n┌─────────────────────────────────────┐\n│ Workspace Layer (Priority 200) │\n│ provisioning/workspace/templates/ │\n│ • Reusable patterns │\n│ • Organization standards │\n│ • Team conventions │\n└─────────────────────────────────────┘\n ↓ overrides\n┌─────────────────────────────────────┐\n│ Core Layer (Priority 100) │ ← Lowest priority\n│ provisioning/extensions/ │\n│ • System defaults │\n│ • Provider implementations │\n│ • Default taskserv configs │\n└─────────────────────────────────────┘\n```\n\n**Resolution Order**: Infrastructure (300) → Workspace (200) → Core (100)\n\nHigher numbers override lower numbers.\n\n### View Layer Resolution\n\n```\n# Explain layer concept\nprovisioning lyr explain\n```\n\n**Expected Output:**\n\n```\n📚 LAYER SYSTEM EXPLAINED\n\nThe layer system provides configuration inheritance across 3 levels:\n\n🔵 CORE LAYER (100) - System Defaults\n Location: provisioning/extensions/\n • Base taskserv configurations\n • Default provider settings\n • Standard cluster templates\n • Built-in extensions\n\n🟢 WORKSPACE LAYER (200) - Shared Templates\n Location: provisioning/workspace/templates/\n • Organization-wide patterns\n • Reusable configurations\n • Team standards\n • Custom extensions\n\n🔴 INFRASTRUCTURE LAYER (300) - Project Specific\n Location: workspace/infra/{project}/\n • Project-specific overrides\n • Environment customizations\n • Local modifications\n • Runtime settings\n\nResolution: Infrastructure → Workspace → Core\nHigher priority layers override lower ones.\n```\n\n```\n# Show layer resolution for your project\nprovisioning lyr show my-production\n```\n\n**Expected Output:**\n\n```\n📊 Layer Resolution for my-production:\n\nLAYER PRIORITY SOURCE FILES\nInfrastructure 300 workspace/infra/my-production/ 4 files\n • servers.ncl (overrides)\n • taskservs.ncl (overrides)\n • clusters.ncl (custom)\n • providers.ncl (overrides)\n\nWorkspace 200 provisioning/workspace/templates/ 2 files\n • production.ncl (used)\n • kubernetes.ncl (used)\n\nCore 100 provisioning/extensions/ 15 files\n • taskservs/* (base configs)\n • providers/* (default settings)\n • clusters/* (templates)\n\nResolution Order: Infrastructure → Workspace → Core\nStatus: ✅ All layers resolved successfully\n```\n\n### Test Layer Resolution\n\n```\n# Test how a specific module resolves\nprovisioning lyr test kubernetes my-production\n```\n\n**Expected Output:**\n\n```\n🔍 Layer Resolution Test: kubernetes → my-production\n\nResolving kubernetes configuration...\n\n🔴 Infrastructure Layer (300):\n ✅ Found: workspace/infra/my-production/taskservs/kubernetes.ncl\n Provides:\n • version = "1.30.0" (overrides)\n • control_plane_servers = ["web-01"] (overrides)\n • worker_servers = ["web-02"] (overrides)\n\n🟢 Workspace Layer (200):\n ✅ Found: provisioning/workspace/templates/production-kubernetes.ncl\n Provides:\n • security_policies (inherited)\n • network_policies (inherited)\n • resource_quotas (inherited)\n\n🔵 Core Layer (100):\n ✅ Found: provisioning/extensions/taskservs/kubernetes/main.ncl\n Provides:\n • default_version = "1.29.0" (base)\n • default_features (base)\n • default_plugins (base)\n\nFinal Configuration (after merging all layers):\n version: "1.30.0" (from Infrastructure)\n control_plane_servers: ["web-01"] (from Infrastructure)\n worker_servers: ["web-02"] (from Infrastructure)\n security_policies: {...} (from Workspace)\n network_policies: {...} (from Workspace)\n resource_quotas: {...} (from Workspace)\n default_features: {...} (from Core)\n default_plugins: {...} (from Core)\n\nResolution: ✅ Success\n```\n\n## Using Templates\n\n### List Available Templates\n\n```\n# List all templates\nprovisioning tpl list\n```\n\n**Expected Output:**\n\n```\n📋 Available Templates:\n\nTASKSERVS:\n • production-kubernetes - Production-ready Kubernetes setup\n • production-postgres - Production PostgreSQL with replication\n • production-redis - Redis cluster with sentinel\n • development-kubernetes - Development Kubernetes (minimal)\n • ci-cd-pipeline - Complete CI/CD pipeline\n\nPROVIDERS:\n • upcloud-production - UpCloud production settings\n • upcloud-development - UpCloud development settings\n • aws-production - AWS production VPC setup\n • aws-development - AWS development environment\n • local-docker - Local Docker-based setup\n\nCLUSTERS:\n • buildkit-cluster - BuildKit for container builds\n • monitoring-stack - Prometheus + Grafana + Loki\n • security-stack - Security monitoring tools\n\nTotal: 13 templates\n```\n\n```\n# List templates by type\nprovisioning tpl list --type taskservs\nprovisioning tpl list --type providers\nprovisioning tpl list --type clusters\n```\n\n### View Template Details\n\n```\n# Show template details\nprovisioning tpl show production-kubernetes\n```\n\n**Expected Output:**\n\n```\n📄 Template: production-kubernetes\n\nDescription: Production-ready Kubernetes configuration with\n security hardening, network policies, and monitoring\n\nCategory: taskservs\nVersion: 1.0.0\n\nConfiguration Provided:\n • Kubernetes version: 1.30.0\n • Security policies: Pod Security Standards (restricted)\n • Network policies: Default deny + allow rules\n • Resource quotas: Per-namespace limits\n • Monitoring: Prometheus integration\n • Logging: Loki integration\n • Backup: Velero configuration\n\nRequirements:\n • Minimum 2 servers\n • 4 GB RAM per server\n • Network plugin (Cilium recommended)\n\nLocation: provisioning/workspace/templates/production-kubernetes.ncl\n\nExample Usage:\n provisioning tpl apply production-kubernetes my-production\n```\n\n### Apply Template\n\n```\n# Apply template to your infrastructure\nprovisioning tpl apply production-kubernetes my-production\n```\n\n**Expected Output:**\n\n```\n🚀 Applying template: production-kubernetes → my-production\n\nChecking compatibility... ⏳\n✅ Infrastructure compatible with template\n\nMerging configuration... ⏳\n✅ Configuration merged\n\nFiles created/updated:\n • workspace/infra/my-production/taskservs/kubernetes.ncl (updated)\n • workspace/infra/my-production/policies/security.ncl (created)\n • workspace/infra/my-production/policies/network.ncl (created)\n • workspace/infra/my-production/monitoring/prometheus.ncl (created)\n\n🎉 Template applied successfully!\n\nNext steps:\n 1. Review generated configuration\n 2. Adjust as needed\n 3. Deploy: provisioning t create kubernetes --infra my-production\n```\n\n### Validate Template Usage\n\n```\n# Validate template was applied correctly\nprovisioning tpl validate my-production\n```\n\n**Expected Output:**\n\n```\n✅ Template Validation: my-production\n\nTemplates Applied:\n ✅ production-kubernetes (v1.0.0)\n ✅ production-postgres (v1.0.0)\n\nConfiguration Status:\n ✅ All required fields present\n ✅ No conflicting settings\n ✅ Dependencies satisfied\n\nCompliance:\n ✅ Security policies configured\n ✅ Network policies configured\n ✅ Resource quotas set\n ✅ Monitoring enabled\n\nStatus: ✅ Valid\n```\n\n## Creating Custom Templates\n\n### Step 1: Create Template Structure\n\n```\n# Create custom template directory\nmkdir -p provisioning/workspace/templates/my-custom-template\n```\n\n### Step 2: Write Template Configuration\n\n**File: `provisioning/workspace/templates/my-custom-template/main.ncl`**\n\n```\n# Custom Kubernetes template with specific settings\nlet kubernetes_config = {\n # Version\n version = "1.30.0",\n\n # Custom feature gates\n feature_gates = {\n "GracefulNodeShutdown" = true,\n "SeccompDefault" = true,\n "StatefulSetAutoDeletePVC" = true,\n },\n\n # Custom kubelet configuration\n kubelet_config = {\n max_pods = 110,\n pod_pids_limit = 4096,\n container_log_max_size = "10Mi",\n container_log_max_files = 5,\n },\n\n # Custom API server flags\n apiserver_extra_args = {\n "enable-admission-plugins" = "NodeRestriction,PodSecurity,LimitRanger",\n "audit-log-maxage" = "30",\n "audit-log-maxbackup" = "10",\n },\n\n # Custom scheduler configuration\n scheduler_config = {\n profiles = [\n {\n name = "high-availability",\n plugins = {\n score = {\n enabled = [\n {name = "NodeResourcesBalancedAllocation", weight = 2},\n {name = "NodeResourcesLeastAllocated", weight = 1},\n ],\n },\n },\n },\n ],\n },\n\n # Network configuration\n network = {\n service_cidr = "10.96.0.0/12",\n pod_cidr = "10.244.0.0/16",\n dns_domain = "cluster.local",\n },\n\n # Security configuration\n security = {\n pod_security_standard = "restricted",\n encrypt_etcd = true,\n rotate_certificates = true,\n },\n} in\nkubernetes_config\n```\n\n### Step 3: Create Template Metadata\n\n**File: `provisioning/workspace/templates/my-custom-template/metadata.toml`**\n\n```\n[template]\nname = "my-custom-template"\nversion = "1.0.0"\ndescription = "Custom Kubernetes template with enhanced security"\ncategory = "taskservs"\nauthor = "Your Name"\n\n[requirements]\nmin_servers = 2\nmin_memory_gb = 4\nrequired_taskservs = ["containerd", "cilium"]\n\n[tags]\nenvironment = ["production", "staging"]\nfeatures = ["security", "monitoring", "high-availability"]\n```\n\n### Step 4: Test Custom Template\n\n```\n# List templates (should include your custom template)\nprovisioning tpl list\n\n# Show your template\nprovisioning tpl show my-custom-template\n\n# Apply to test infrastructure\nprovisioning tpl apply my-custom-template my-test\n```\n\n## Configuration Inheritance Examples\n\n### Example 1: Override Single Value\n\n**Core Layer** (`provisioning/extensions/taskservs/postgres/main.ncl`):\n\n```\nlet postgres_config = {\n version = "15.5",\n port = 5432,\n max_connections = 100,\n} in\npostgres_config\n```\n\n**Infrastructure Layer** (`workspace/infra/my-production/taskservs/postgres.ncl`):\n\n```\nlet postgres_config = {\n max_connections = 500, # Override only max_connections\n} in\npostgres_config\n```\n\n**Result** (after layer resolution):\n\n```\nlet postgres_config = {\n version = "15.5", # From Core\n port = 5432, # From Core\n max_connections = 500, # From Infrastructure (overridden)\n} in\npostgres_config\n```\n\n### Example 2: Add Custom Configuration\n\n**Workspace Layer** (`provisioning/workspace/templates/production-postgres.ncl`):\n\n```\nlet postgres_config = {\n replication = {\n enabled = true,\n replicas = 2,\n sync_mode = "async",\n },\n} in\npostgres_config\n```\n\n**Infrastructure Layer** (`workspace/infra/my-production/taskservs/postgres.ncl`):\n\n```\nlet postgres_config = {\n replication = {\n sync_mode = "sync", # Override sync mode\n },\n custom_extensions = ["pgvector", "timescaledb"], # Add custom config\n} in\npostgres_config\n```\n\n**Result**:\n\n```\nlet postgres_config = {\n version = "15.5", # From Core\n port = 5432, # From Core\n max_connections = 100, # From Core\n replication = {\n enabled = true, # From Workspace\n replicas = 2, # From Workspace\n sync_mode = "sync", # From Infrastructure (overridden)\n },\n custom_extensions = ["pgvector", "timescaledb"], # From Infrastructure (added)\n} in\npostgres_config\n```\n\n### Example 3: Environment-Specific Configuration\n\n**Workspace Layer** (`provisioning/workspace/templates/base-kubernetes.ncl`):\n\n```\nlet kubernetes_config = {\n version = "1.30.0",\n control_plane_count = 3,\n worker_count = 5,\n resources = {\n control_plane = {cpu = "4", memory = "8Gi"},\n worker = {cpu = "8", memory = "16Gi"},\n },\n} in\nkubernetes_config\n```\n\n**Development Infrastructure** (`workspace/infra/my-dev/taskservs/kubernetes.ncl`):\n\n```\nlet kubernetes_config = {\n control_plane_count = 1, # Smaller for dev\n worker_count = 2,\n resources = {\n control_plane = {cpu = "2", memory = "4Gi"},\n worker = {cpu = "2", memory = "4Gi"},\n },\n} in\nkubernetes_config\n```\n\n**Production Infrastructure** (`workspace/infra/my-prod/taskservs/kubernetes.ncl`):\n\n```\nlet kubernetes_config = {\n control_plane_count = 5, # Larger for prod\n worker_count = 10,\n resources = {\n control_plane = {cpu = "8", memory = "16Gi"},\n worker = {cpu = "16", memory = "32Gi"},\n },\n} in\nkubernetes_config\n```\n\n## Advanced Customization Patterns\n\n### Pattern 1: Multi-Environment Setup\n\nCreate different configurations for each environment:\n\n```\n# Create environments\nprovisioning ws init my-app-dev\nprovisioning ws init my-app-staging\nprovisioning ws init my-app-prod\n\n# Apply environment-specific templates\nprovisioning tpl apply development-kubernetes my-app-dev\nprovisioning tpl apply staging-kubernetes my-app-staging\nprovisioning tpl apply production-kubernetes my-app-prod\n\n# Customize each environment\n# Edit: workspace/infra/my-app-dev/...\n# Edit: workspace/infra/my-app-staging/...\n# Edit: workspace/infra/my-app-prod/...\n```\n\n### Pattern 2: Shared Configuration Library\n\nCreate reusable configuration fragments:\n\n**File: `provisioning/workspace/templates/shared/security-policies.ncl`**\n\n```\nlet security_policies = {\n pod_security = {\n enforce = "restricted",\n audit = "restricted",\n warn = "restricted",\n },\n network_policies = [\n {\n name = "deny-all",\n pod_selector = {},\n policy_types = ["Ingress", "Egress"],\n },\n {\n name = "allow-dns",\n pod_selector = {},\n egress = [\n {\n to = [{namespace_selector = {name = "kube-system"}}],\n ports = [{protocol = "UDP", port = 53}],\n },\n ],\n },\n ],\n} in\nsecurity_policies\n```\n\nImport in your infrastructure:\n\n```\nlet security_policies = (import "../../../provisioning/workspace/templates/shared/security-policies.ncl") in\n\nlet kubernetes_config = {\n version = "1.30.0",\n image_repo = "k8s.gcr.io",\n security = security_policies, # Import shared policies\n} in\nkubernetes_config\n```\n\n### Pattern 3: Dynamic Configuration\n\nUse Nickel features for dynamic configuration:\n\n```\n# Calculate resources based on server count\nlet server_count = 5 in\nlet replicas_per_server = 2 in\nlet total_replicas = server_count * replicas_per_server in\n\nlet postgres_config = {\n version = "16.1",\n max_connections = total_replicas * 50, # Dynamic calculation\n shared_buffers = "1024 MB",\n} in\npostgres_config\n```\n\n### Pattern 4: Conditional Configuration\n\n```\nlet environment = "production" in # or "development"\n\nlet kubernetes_config = {\n version = "1.30.0",\n control_plane_count = if environment == "production" then 3 else 1,\n worker_count = if environment == "production" then 5 else 2,\n monitoring = {\n enabled = environment == "production",\n retention = if environment == "production" then "30d" else "7d",\n },\n} in\nkubernetes_config\n```\n\n## Layer Statistics\n\n```\n# Show layer system statistics\nprovisioning lyr stats\n```\n\n**Expected Output:**\n\n```\n📊 Layer System Statistics:\n\nInfrastructure Layer:\n • Projects: 3\n • Total files: 15\n • Average overrides per project: 5\n\nWorkspace Layer:\n • Templates: 13\n • Most used: production-kubernetes (5 projects)\n • Custom templates: 2\n\nCore Layer:\n • Taskservs: 15\n • Providers: 3\n • Clusters: 3\n\nResolution Performance:\n • Average resolution time: 45 ms\n • Cache hit rate: 87%\n • Total resolutions: 1,250\n```\n\n## Customization Workflow\n\n### Complete Customization Example\n\n```\n# 1. Create new infrastructure\nprovisioning ws init my-custom-app\n\n# 2. Understand layer system\nprovisioning lyr explain\n\n# 3. Discover templates\nprovisioning tpl list --type taskservs\n\n# 4. Apply base template\nprovisioning tpl apply production-kubernetes my-custom-app\n\n# 5. View applied configuration\nprovisioning lyr show my-custom-app\n\n# 6. Customize (edit files)\nprovisioning sops workspace/infra/my-custom-app/taskservs/kubernetes.ncl\n\n# 7. Test layer resolution\nprovisioning lyr test kubernetes my-custom-app\n\n# 8. Validate configuration\nprovisioning tpl validate my-custom-app\nprovisioning val config --infra my-custom-app\n\n# 9. Deploy customized infrastructure\nprovisioning s create --infra my-custom-app --check\nprovisioning s create --infra my-custom-app\nprovisioning t create kubernetes --infra my-custom-app\n```\n\n## Best Practices\n\n### 1. Use Layers Correctly\n\n- **Core Layer**: Only modify for system-wide changes\n- **Workspace Layer**: Use for organization-wide templates\n- **Infrastructure Layer**: Use for project-specific customizations\n\n### 2. Template Organization\n\n```\nprovisioning/workspace/templates/\n├── shared/ # Shared configuration fragments\n│ ├── security-policies.ncl\n│ ├── network-policies.ncl\n│ └── monitoring.ncl\n├── production/ # Production templates\n│ ├── kubernetes.ncl\n│ ├── postgres.ncl\n│ └── redis.ncl\n└── development/ # Development templates\n ├── kubernetes.ncl\n └── postgres.ncl\n```\n\n### 3. Documentation\n\nDocument your customizations:\n\n**File: `workspace/infra/my-production/README.md`**\n\n```\n# My Production Infrastructure\n\n## Customizations\n\n- Kubernetes: Using production template with 5 control plane nodes\n- PostgreSQL: Configured with streaming replication\n- Cilium: Native routing mode enabled\n\n## Layer Overrides\n\n- `taskservs/kubernetes.ncl`: Control plane count (3 → 5)\n- `taskservs/postgres.ncl`: Replication mode (async → sync)\n- `network/cilium.ncl`: Routing mode (tunnel → native)\n```\n\n### 4. Version Control\n\nKeep templates and configurations in version control:\n\n```\ncd provisioning/workspace/templates/\ngit add .\ngit commit -m "Add production Kubernetes template with enhanced security"\n\ncd workspace/infra/my-production/\ngit add .\ngit commit -m "Configure production environment for my-production"\n```\n\n## Troubleshooting Customizations\n\n### Issue: Configuration not applied\n\n```\n# Check layer resolution\nprovisioning lyr show my-production\n\n# Verify file exists\nls -la workspace/infra/my-production/taskservs/\n\n# Test specific resolution\nprovisioning lyr test kubernetes my-production\n```\n\n### Issue: Conflicting configurations\n\n```\n# Validate configuration\nprovisioning val config --infra my-production\n\n# Show configuration merge result\nprovisioning show config kubernetes --infra my-production\n```\n\n### Issue: Template not found\n\n```\n# List available templates\nprovisioning tpl list\n\n# Check template path\nls -la provisioning/workspace/templates/\n\n# Refresh template cache\nprovisioning tpl refresh\n```\n\n## Next Steps\n\n- **[From Scratch Guide](from-scratch.md)** - Deploy new infrastructure\n- **[Update Guide](update-infrastructure.md)** - Update existing infrastructure\n- **[Workflow Guide](../development/workflow.md)** - Automate with workflows\n- **[Nickel Guide](../development/nickel-module-guide.md)** - Learn Nickel configuration language\n\n## Quick Reference\n\n```\n# Layer system\nprovisioning lyr explain # Explain layers\nprovisioning lyr show # Show layer resolution\nprovisioning lyr test # Test resolution\nprovisioning lyr stats # Layer statistics\n\n# Templates\nprovisioning tpl list # List all templates\nprovisioning tpl list --type # Filter by type\nprovisioning tpl show