provisioning/docs/src/architecture/design-principles.md

369 lines
8.3 KiB
Markdown
Raw Normal View History

2026-01-14 04:53:21 +00:00
# Design Principles
2026-01-17 03:58:28 +00:00
Core principles guiding Provisioning architecture and development.
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
## 1. Workspace-First Design
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Principle**: Workspaces are the default organizational unit for ALL infrastructure work.
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Why**:
- Explicit project isolation
- Prevent accidental cross-project modifications
- Independent credential management
- Clear configuration boundaries
- Team collaboration enablement
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Application**:
- Every workspace has independent state
- Workspace switching is atomic
- Configuration per workspace
- Extensions inherited from platform
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Code Example**:
2026-01-14 04:53:58 +00:00
```bash
2026-01-17 03:58:28 +00:00
# Workspace-enforced workflow
provisioning workspace init my-project
provisioning workspace switch my-project
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
# This command requires active workspace
provisioning server create --name web-01
2026-01-14 04:53:21 +00:00
```
2026-01-17 03:58:28 +00:00
**Impact**: All commands validate active workspace before execution.
---
## 2. Type-Safety Mandatory
**Principle**: ALL configurations MUST be type-safe. Validation is NEVER optional.
**Why**:
- Catch errors at configuration time
- Prevent runtime failures
- Enable IDE support (LSP)
- Enforce consistency
- Reduce deployment risk
**Application**:
- **Nickel is source of truth** (NOT TOML)
- Type contracts on ALL schemas
- Gradual typing not allowed
- Validation in ALL profiles (dev, prod, cicd)
- Static analysis before deployment
**Code Example**:
```nickel
# Type-safe infrastructure definition
{
name : String = "server-01"
plan : | [ 'small, 'medium, 'large | ] = 'medium
zone : String = "de-fra1"
backup_enabled : Bool = false
} | ServerContract
```
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Impact**: Type errors caught before infrastructure changes.
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
---
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
## 3. Configuration-Driven, Never Hardcoded
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Principle**: Configuration is the source of truth. Hardcoded values are forbidden.
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Why**:
- Enable environment-specific behavior
- Support multiple deployment modes
- Allow runtime reconfiguration
- Audit configuration changes
- Team collaboration
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Application**:
- 5-layer configuration hierarchy
- 476+ configuration accessors
- Variable interpolation
- Environment-specific overrides
- Schema validation
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Code Example**:
2026-01-14 04:53:58 +00:00
```bash
2026-01-17 03:58:28 +00:00
# Configuration drives behavior
provisioning server create --plan $(config.server.default_plan)
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
# Environment-specific configs
PROVISIONING_ENV=prod provisioning server create
```
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Forbidden**:
```nushell
# ❌ WRONG - Hardcoded values
let server_plan = "medium"
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
# ✅ RIGHT - Configuration-driven
let server_plan = (config.server.plan)
```
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Impact**: Single codebase supports all environments.
---
## 4. Multi-Cloud Abstraction
**Principle**: Provider-agnostic interfaces enable multi-cloud deployments.
**Why**:
- Avoid vendor lock-in
- Reuse infrastructure code
- Support multiple cloud strategies
- Easy provider switching
**Application**:
- Unified provider interface
- Abstract resource definitions
- Provider-specific implementation
- Automatic provider selection
**Code Example**:
```nickel
# Provider-agnostic configuration
{
servers = [
{
name = "web-01"
plan = "medium" # Abstract plan size
provider = "upcloud" # Swappable provider
}
]
}
```
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Impact**: Same Nickel schema deploys to UpCloud, AWS, or Hetzner.
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
---
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
## 5. Modular, Extensible Architecture
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Principle**: Components are loosely coupled, independently deployable.
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Why**:
- Easy to add features
- Support custom extensions
- Avoid monolithic growth
- Enable community contributions
- Flexible deployment options
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Application**:
- 54 core Nushell libraries
- 111+ CLI commands in 7 domains
- 50+ task services
- 5 cloud providers
- 9 cluster templates
- Pluggable provider interface
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Impact**: Add features without modifying core system.
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
---
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
## 6. Hybrid Rust + Nushell
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Principle**: Rust for performance-critical components, Nushell for orchestration.
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Why**:
- **Rust**: Type safety, zero-cost abstractions, performance
- **Nushell**: Structured data, productivity, easy automation
- **Hybrid**: Best of both worlds
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Application**:
- Core CLI: Bash wrapper → Nushell dispatcher
- Orchestrator: Rust scheduler + Nushell task execution
- Libraries: Nushell for business logic
- Performance: Rust plugins for 10-50x speedup
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Impact**: Fast, type-safe, productive infrastructure automation.
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
---
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
## 7. State Management via Graph Database
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Principle**: Infrastructure relationships tracked via SurrealDB graph.
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Why**:
- Model complex infrastructure relationships
- Query relationships efficiently
- Track dependencies
- Support rollback via state history
- Audit trail
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Application**:
- SurrealDB for relationship queries
- File-based persistence for queue
- Event-driven state updates
- Checkpoint-based recovery
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Example Relationships**:
```text
Server → Network (connected to)
Server → Storage (mounts)
Cluster → Service (runs)
Workflow → Dependency (depends on)
2026-01-14 04:53:21 +00:00
```
2026-01-17 03:58:28 +00:00
**Impact**: Complex infrastructure relationships handled gracefully.
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
---
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
## 8. Security-First Design
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Principle**: Security is built-in, not bolted-on.
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Why**:
- Enterprise compliance
- Data protection
- Access control
- Audit trails
- Threat detection
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Application**:
- 4-layer security model (auth, authz, encryption, audit)
- JWT authentication
- Cedar policy enforcement
- AES-256-GCM encryption
- 7-year audit retention
- MFA support (TOTP, WebAuthn)
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Impact**: Enterprise-grade security by default.
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
---
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
## 9. Progressive Disclosure
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Principle**: Simple for common cases, powerful for advanced use cases.
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Why**:
- Low barrier to entry
- Professional productivity
- Advanced features available
- Avoid overwhelming users
- Gradual learning curve
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Application**:
- **Simple**: Interactive TUI installer
- **Productive**: CLI with 80+ shortcuts
- **Powerful**: Batch workflows, policies
- **Advanced**: Custom extensions, hooks
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Impact**: All skill levels supported.
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
---
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
## 10. Fail-Fast, Recover Gracefully
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Principle**: Detect issues early, provide recovery mechanisms.
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Why**:
- Prevent invalid deployments
- Enable safe recovery
- Minimize blast radius
- Audit failures for learning
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Application**:
- Validation before execution
- Checkpoint-based recovery
- Automatic rollback on failure
- Detailed error messages
- Retry with exponential backoff
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Code Example**:
2026-01-14 04:53:58 +00:00
```bash
2026-01-17 03:58:28 +00:00
# Validate before deployment
provisioning validate config --strict
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
# Dry-run to check impact
provisioning --check server create
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
# Safe rollback on failure
provisioning workflow rollback --to-checkpoint
```
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Impact**: Safe infrastructure changes with confidence.
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
---
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
## 11. Observable & Auditable
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Principle**: All operations traceable, all changes auditable.
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Why**:
- Compliance & regulation
- Troubleshooting
- Security investigation
- Team accountability
- Historical analysis
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Application**:
- Comprehensive audit logging
- 5 export formats (JSON, YAML, CSV, syslog, CloudWatch)
- Structured log entries
- Operation tracing
- Resource change tracking
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Impact**: Complete visibility into infrastructure changes.
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
---
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
## 12. No Shortcuts on Reliability
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Principle**: Reliability features are standard, not optional.
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Why**:
- Production requirements
- Minimize downtime
- Data protection
- Business continuity
- Trust & confidence
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Application**:
- Checkpoint recovery
- Automatic rollback
- Health monitoring
- Backup & restore
- Multi-node deployment
- Service redundancy
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
**Impact**: Enterprise-grade reliability standard.
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
---
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
## Architectural Decision Records (ADRs)
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
Key decisions documenting rationale:
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
| ADR | Decision | Rationale |
| --- | ---------| - --- |
| ADR-011 | Nickel Migration | Type-safety over KCL flexibility |
| ADR-010 | Config Strategy | 5-layer hierarchy over flat config |
| ADR-009 | SurrealDB | Graph relationships over relational |
| ADR-008 | Modular CLI | 80+ shortcuts over verbose commands |
| ADR-007 | Workspace-First | Isolation over global state |
| ADR-006 | Hybrid Architecture | Rust + Nushell for best of both |
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
---
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
## Design Trade-offs
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
| Decision | Gain | Cost |
| --- | -----| - --- |
| **Type-Safety** | Fewer errors | Learning curve |
| **Config Hierarchy** | Flexibility | Complexity |
| **Workspace Isolation** | Safety | Duplication |
| **Modular CLI** | Discoverability | No single command |
| **SurrealDB** | Relationships | Resource overhead |
| **Validation Strict** | Safety | Fast iteration friction |
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
---
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
## Related Documentation
2026-01-14 04:53:21 +00:00
2026-01-17 03:58:28 +00:00
- [System Overview](system-overview.md)
- [Component Architecture](component-architecture.md)
- [Integration Patterns](integration-patterns.md)