stratumiops/docs/en/ops/ops-stratumiops-projects.md

736 lines
22 KiB
Markdown
Raw Permalink Normal View History

2026-01-22 22:15:19 +00:00
# Ops/DevOps Portfolio: Modern Infrastructure End-to-End
## The Problem
DevOps and platform teams face critical challenges managing modern infrastructure:
- **Fragmented tools**: Terraform for IaC, Ansible for configuration, Vault for secrets, all disconnected
- **Untyped YAML**: Configuration errors that explode at runtime, not at compile time
- **Static cryptography**: No preparation for future quantum threats
- **Manual orchestration**: Fragile imperative scripts without rollback or recovery
- **Hidden costs**: No visibility into LLM spending for infrastructure generation
- **Complex multi-cloud**: Different APIs, configurations and tools per provider
## The Solution: An Integrated Ecosystem
Five projects designed to work together, covering the complete operations cycle.
---
## Provisioning: Declarative Infrastructure as Code
### Typed IaC with AI-Assisted Generation
Provisioning combines the precision of typed configuration (Nickel) with AI-assisted generation, eliminating fragile YAML and imperative scripts.
**Unique capabilities**:
- **Nickel IaC**: Typed configuration with lazy evaluation, pre-runtime validation
- **MCP Server**: Natural language queries about infrastructure
- **Integrated RAG**: 1,200+ domain documents for contextual responses
- **Multi-cloud**: AWS, UpCloud, local (LXD) from the same definition
**Hybrid orchestration**:
- Rust orchestrator for critical workflows (10-50x performance vs Python)
- Nushell scripts for flexibility and rapid prototyping
- Automatic dependency resolution (topological sorting)
- Checkpoints and automatic rollback on failures
**The workflow**:
```text
"I need a K8s cluster on AWS with 3 nodes and Cilium"
MCP Server (NLP)
RAG searches similar configurations
Generates Nickel + validates types
Orchestrator deploys:
1. containerd (dependency)
2. etcd (dependency)
3. kubernetes (core)
4. cilium (CNI)
With checkpoints and automatic rollback
```
**Enterprise security**:
- JWT + MFA (TOTP + WebAuthn)
- Cedar policy engine for RBAC/ABAC
- 7 years audit log retention
- 5 KMS backends (RustyVault, Age, AWS KMS, Vault, Cosmian)
- SOPS/Age for configuration encryption at rest
**For whom**:
- DevOps teams wanting typed IaC, not fragile YAML
- Multi-cloud organizations (AWS + UpCloud + on-premise)
- Teams needing audit, compliance and enterprise security
**Expected results**:
- Configuration errors detected at compile time, not at runtime
- Infrastructure generated from natural language (MCP + RAG)
- Automatic rollback on failures with state management
---
## SecretumVault: Secrets Management with Post-Quantum Crypto
### Rust Vault with PQC in Production
SecretumVault is a secrets management system that implements **production-ready post-quantum cryptography** (ML-KEM-768, ML-DSA-65), providing cryptographic agility for organizations deploying today.
**Crypto-agnostic**:
- **OpenSSL**: RSA, ECDSA, AES-256-GCM (classical compatibility)
- **OQS (Post-Quantum)**: ML-KEM-768, ML-DSA-65 (NIST FIPS 203/204)
- **AWS-LC**: Experimental PQC (testing)
- **RustCrypto**: Pure-Rust implementations (testing)
- **Pluggable backends**: Change algorithms without modifying code
**Secrets engines**:
| Engine | Capability | Use cases |
| -------- | ------------ | ----------- |
| **KV** | Versioned secret storage | Credentials, API keys, sensitive configurations |
| **Transit** | Encryption-as-a-service with key rotation | Application data encryption, key rotation |
| **PKI** | X.509 certificate generation | mTLS, service mesh, internal infrastructure |
| **Database** | Dynamic credentials with TTL | PostgreSQL, MySQL, MongoDB credentials on-demand |
**Multi-backend storage**:
- **Filesystem**: Development, single-node, rapid prototyping
- **etcd**: Kubernetes, high availability, strong consistency
- **SurrealDB**: Complex queries, time-series, multi-tenant scopes
- **PostgreSQL**: Enterprise, ACID, complete auditing
**Enterprise security**:
- Shamir Secret Sharing for unsealing (configurable threshold)
- Cedar policy engine (ABAC, AWS-compatible)
- Native TLS/mTLS with X.509 certificates
- Complete audit logging with configurable retention
- Token management with TTL and renewal
**Ops/DevOps workflow**:
```bash
# Initialize vault with Shamir (5 shares, threshold 3)
svault operator init --shares 5 --threshold 3
# Unseal with 3 shares
svault operator unseal --share <share-1>
svault operator unseal --share <share-2>
svault operator unseal --share <share-3>
# Enable Database engine for PostgreSQL
svault secret engine enable database
svault secret database config postgres-prod \
plugin_name=postgresql-database-plugin \
connection_url="postgresql://{{username}}:{{password}}@postgres:5432/mydb" \
username="vault" password="vaultpass"
# Create role for dynamic credentials
svault secret database role create myapp-role \
db_name=postgres-prod \
creation_statements="CREATE USER '{{name}}' WITH PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; GRANT SELECT ON ALL TABLES IN SCHEMA public TO '{{name}}';" \
default_ttl=1h max_ttl=24h
# Get dynamic credentials (generated on-demand)
svault secret read database/creds/myapp-role
# Key Value
# --- -----
# lease_id database/creds/myapp-role/abc123
# lease_duration 3600
# username v-myapp-role-xyz789
# password A1b2C3d4E5f6G7h8
# Credentials are automatically revoked after 1h TTL
```
**For whom**:
- Teams deploying post-quantum cryptography today
- Organizations with cryptographic agility requirements
- Multi-cloud platforms needing Rust-native secrets management
- Security teams evaluating future quantum threats
**Expected results**:
- Preparation for quantum threats without changing architecture
- Secrets management with Rust memory guarantees
- Native integration with Provisioning (KMS) and Vapora (agent credentials)
---
## Vapora: Agent Orchestration with Cost Control
### Intelligent Agents for Operations
Vapora is not just for feature development. It's an orchestration platform that can coordinate specialized agents for DevOps operations.
**Available agents for Ops**:
- **DevOps**: CI/CD, pipelines, deployment automation
- **Monitor**: Health checks, alerting, real-time metrics
- **Security**: Auditing, compliance, vulnerability scanning
- **ProjectManager**: Roadmap, tracking, task coordination
**Real cost control for LLMs**:
- Budgets per role (monthly/weekly)
- Three levels: normal → near limit → exceeded
- Automatic fallback to cheaper providers without manual intervention
- Prometheus metrics: `vapora_budget_utilization`, `vapora_fallback_triggers`
**NATS JetStream coordination**:
```text
┌──────────────────────────────────────────────────────┐
│ NATS JetStream Messaging │
├──────────────────────────────────────────────────────┤
│ │
│ vapora.tasks.assign → Task assignment │
│ vapora.tasks.results → Execution results │
│ vapora.agents.heartbeat → Agent health check │
│ │
│ Persistence: JetStream streams │
│ Delivery: At-least-once with acknowledgment │
│ Ordering: Per-subject message ordering │
└──────────────────────────────────────────────────────┘
```
**Ops pipeline orchestration**:
```text
Pipeline: "Deploy microservice to K8s"
1. Security Agent: Docker image vulnerability scan
2. DevOps Agent: Validate K8s manifests + Helm charts
3. Monitor Agent: Setup Prometheus metrics + alerts
4. DevOps Agent: Deploy with kubectl apply + health check
5. Monitor Agent: Validate health endpoints + smoke tests
If any step fails: coordinated automatic rollback
```
**Metrics and observability**:
- Prometheus metrics endpoint (`/metrics`)
- OpenTelemetry integration (traces, spans)
- SurrealDB for execution storage
- Grafana dashboards for visualization
**For whom**:
- DevOps teams coordinating multiple LLM agents for operations
- Organizations needing to control LLM spending in automation
- Platforms with complex pipelines (CI/CD, deployment, monitoring)
**Expected results**:
- LLM cost reduction through intelligent routing
- Automatic orchestration of complex operational tasks
- Complete visibility of spending and performance per agent
---
## TypeDialog: Multi-Backend Forms for Configuration
### One Definition, Six Interfaces (Includes prov-gen)
TypeDialog unifies configuration capture in CLI, TUI, Web, and has a specialized backend for multi-cloud IaC generation.
**Operational backends**:
| Backend | Typical Ops/DevOps use |
| --------- | ------------------------ |
| **CLI** | Automation scripts, CI/CD pipelines |
| **TUI** | Admin tools, terminal dashboards |
| **Web** | Self-service portals, team forms |
| **Prov-gen** | **Multi-cloud infrastructure generation** |
**Prov-gen Backend: IaC Generation**
The `prov-gen` backend generates Nickel infrastructure configurations for multiple clouds from typed forms:
```toml
# cluster-setup.toml
[form]
id = "k8s_cluster"
title = "Kubernetes Cluster Setup"
[[sections]]
id = "cloud"
title = "Cloud Provider"
[[sections.fields]]
id = "provider"
type = "select"
label = "Provider"
required = true
options = [
{ value = "aws", label = "AWS" },
{ value = "upcloud", label = "UpCloud" },
{ value = "local", label = "Local LXD" },
]
[[sections.fields]]
id = "region"
type = "text"
label = "Region"
required = true
[[sections]]
id = "cluster"
title = "Cluster Configuration"
[[sections.fields]]
id = "node_count"
type = "number"
label = "Node Count"
default = 3
validation.min = 1
validation.max = 20
[[sections.fields]]
id = "node_size"
type = "select"
label = "Node Size"
options = [
{ value = "small", label = "Small (2 CPU, 4GB RAM)" },
{ value = "medium", label = "Medium (4 CPU, 8GB RAM)" },
{ value = "large", label = "Large (8 CPU, 16GB RAM)" },
]
[output]
backend = "prov-gen"
format = "nickel"
validation = "nickel://schemas/kubernetes_cluster.ncl"
```
Execute with prov-gen:
```bash
typedialog execute cluster-setup.toml --backend prov-gen --output k8s-cluster.ncl
```
Generates Nickel IaC:
```nickel
# k8s-cluster.ncl (automatically generated)
{
provider = "aws",
region = "us-east-1",
servers = [
{
name = "k8s-control-plane-01",
plan = "medium",
role = "control-plane",
provider = "aws",
},
{
name = "k8s-worker-01",
plan = "medium",
role = "worker",
provider = "aws",
},
{
name = "k8s-worker-02",
plan = "medium",
role = "worker",
provider = "aws",
},
],
taskservs = [
"containerd",
"etcd",
"kubernetes",
"cilium",
],
networking = {
vpc_cidr = "10.0.0.0/16",
pod_cidr = "10.244.0.0/16",
service_cidr = "10.96.0.0/12",
},
}
```
**Nickel contracts validation**:
```rust
// Automatic validation with Nickel schemas
let validator = NickelValidator::new();
let result = validator.validate(&generated_iac, "schemas/kubernetes_cluster.ncl")?;
if result.errors.is_empty() {
// Valid IaC, ready for Provisioning
provisioning_client.apply(&generated_iac).await?;
} else {
// Validation errors, show to user
eprintln!("Validation errors: {:?}", result.errors);
}
```
**For whom**:
- DevOps teams maintaining configuration wizards in CLI and Web
- Organizations with self-service infrastructure portals
- Teams needing IaC generation from forms
**Expected results**:
- One TOML definition for CLI, TUI, Web and IaC generation
- Typed validation before runtime with Nickel contracts
- Reduction of manual configuration errors
---
## Kogral: Knowledge Base for Platform Teams
### Your Ops Knowledge Base, Queryable
Kogral captures architectural decisions, runbooks, postmortems and operational procedures in a format that both humans and AI agents can query.
**6 specialized node types for Ops**:
| Type | Ops/DevOps use |
| ------ | ---------------- |
| **Note** | Runbooks, procedures, troubleshooting guides |
| **Decision** | Infrastructure ADRs (why AWS vs UpCloud, etcd vs Consul) |
| **Guideline** | Deployment standards, security policies |
| **Pattern** | Reusable infrastructure patterns (multi-AZ, HA) |
| **Journal** | Change logs, daily stand-up notes |
| **Execution** | Deployment history, rollbacks, incidents |
**Git-native + MCP for Claude Code**:
- Everything in versioned markdown (`.kogral/` directory)
- MCP server for Claude Code: agents query runbooks before executing
- Semantic search with fastembed (local) or cloud embeddings
**The Ops flow**:
```text
Production incident → Capture postmortem in Kogral as Execution
Claude Code queries via MCP → "How did we resolve this error before?"
Kogral responds with similar postmortems + runbooks
Agent applies documented solution instead of guessing
```
**MCP Tools for Ops**:
```bash
# Search troubleshooting runbooks
kogral-mcp search "nginx 502 error troubleshooting"
# Add incident postmortem
kogral-mcp add-execution \
--title "2026-01-22 PostgreSQL Connection Pool Exhaustion" \
--context "Production database connections maxed out" \
--resolution "Increased max_connections from 100 to 200, added PgBouncer" \
--tags "database,incident,postgresql"
# Get deployment guidelines
kogral-mcp get-guidelines "kubernetes deployment" --include-shared true
```
**For whom**:
- Platform teams needing to preserve operational knowledge
- SRE teams with rotation losing context of previous incidents
- DevOps using Claude Code wanting contextualized runbooks
**Expected results**:
- New SRE onboarding in days, not weeks
- Incident resolution informed by previous postmortems
- Infrastructure decisions preserved and searchable
---
## The Ecosystem in Action: Ops Scenarios
### Scenario 1: New Multi-Cloud Kubernetes Cluster
```text
1. TypeDialog (prov-gen): Configuration wizard for cluster
- Cloud provider, region, node count, node size
- Generates validated Nickel IaC
2. Provisioning: Deploys infrastructure
- Creates servers on AWS/UpCloud
- Installs containerd, etcd, kubernetes, cilium
- Checkpoints per step, automatic rollback if fails
3. SecretumVault: Generates PKI certificates
- Certificates for etcd, kube-apiserver, kubelet
- Automatic rotation every 90 days
4. Kogral: Documents architecture decision
- ADR: "Why Cilium over Calico"
- Runbook: "How to scale cluster from 3 to 10 nodes"
5. Vapora: Orchestrates post-deployment
- Monitor Agent: Setup Prometheus + Grafana
- Security Agent: Vulnerability scanning
- DevOps Agent: Deploy test applications
```
### Scenario 2: Production Incident (Database Outage)
```text
1. Vapora Monitor Agent: Detects PostgreSQL down
- Alert via NATS JetStream
- Trigger incident response pipeline
2. Kogral: Claude Code queries via MCP
- "PostgreSQL outage postmortems?"
- Returns 3 similar incidents with resolutions
3. Vapora DevOps Agent: Executes runbook
- Restarts PostgreSQL with adjusted parameters
- Verifies health checks
4. SecretumVault: Rotates DB credentials
- Generates new dynamic credentials
- Updates applications via Database engine
5. Kogral: Documents postmortem
- Execution node with root cause, resolution, action items
- Linked to PostgreSQL configuration ADRs
```
### Scenario 3: Post-Quantum Cryptography Migration
```text
1. Kogral: Documents migration decision
- ADR: "Migration to ML-KEM-768 for quantum threat preparation"
- Timeline, risks, mitigation strategies
2. SecretumVault: Migrates secrets
- Backend change: openssl → oqs
- Re-encrypts secrets with ML-KEM-768
- Maintains compatibility with classical clients
3. Provisioning: Updates infrastructure
- Generates new PKI certificates with ML-DSA-65
- Deploys certificates to services (etcd, K8s API)
- Automatic rollback if health checks fail
4. Vapora: Orchestrates validation
- Security Agent: Verifies correct cryptography
- Monitor Agent: Validates latency not degraded
- DevOps Agent: Executes integration tests
5. TypeDialog: Self-service portal for teams
- Form: "Migrate service to PQC"
- prov-gen backend generates updated configuration
```
### Scenario 4: CI/CD with AI Validation
```text
1. Developer: Push to Git repository (Gitea)
2. Vapora DevOps Agent (trigger via webhook):
- Executes linting, unit tests
- Build Docker image
- Vulnerability scan with Security Agent
3. TypeDialog: Deployment form
- Environment (staging/production)
- Canary rollout percentage
- Generates validated K8s configuration
4. Provisioning: Deploys with Tekton
- Apply K8s manifests with kubectl
- Automatic health checks
- Rollback if health check fails
5. SecretumVault: Injects secrets
- Dynamic DB credentials (TTL 1h)
- API keys from KV engine
- TLS certificates from PKI engine
6. Kogral: Records deployment
- Execution node with version, timestamp, author
- Link to commit SHA, PR, changes
```
---
## Why Choose This Ecosystem (Ops Perspective)
### Versus Alternatives
| Us | Terraform + Ansible + Vault |
| ---- | ----------------------------- |
| **Typed configuration**: Nickel with pre-runtime validation | YAML/HCL without types, errors at runtime |
| **Integrated orchestration**: Provisioning orchestrator with rollback | Imperative scripts, no automatic recovery |
| **Post-Quantum crypto**: SecretumVault with ML-KEM/ML-DSA today | Vault without PQC roadmap |
| **Unified multi-cloud**: One Nickel configuration for AWS/UpCloud/Local | Separate configurations per cloud |
| **AI-native**: MCP + RAG for assisted generation | No AI assistance, manual configuration |
| **Full Rust stack**: Performance, memory-safety | Mixed Python/Go/Shell with overhead |
### Technical Investment (Ops Focus)
| Metric | Value |
| -------- | ------- |
| **Provisioning**: Nickel IaC, 80+ CLI shortcuts | ~40K LOC |
| **SecretumVault**: 4 crypto backends, 4 storage backends | ~11K LOC |
| **Vapora**: NATS JetStream, 12 agent roles | ~50K LOC |
| **TypeDialog**: 6 backends including prov-gen | ~90K LOC |
| **Kogral**: 6 node types, MCP server | ~15K LOC |
| **Total tests** | 4,360+ |
| **Crypto backends** | OpenSSL, OQS (PQC), AWS-LC, RustCrypto |
| **Storage backends** | FS, etcd, SurrealDB, PostgreSQL |
---
## Getting Started (Adoption for Ops Teams)
### Recommended Progressive Adoption
1. **SecretumVault**: Secrets management with cryptographic agility (standalone)
2. **Kogral**: Establish operational knowledge base (runbooks, ADRs, postmortems)
3. **TypeDialog**: Configuration wizards for teams (CLI + Web + prov-gen)
4. **Provisioning**: Multi-cloud declarative IaC with orchestrator
5. **Vapora**: Orchestrate Ops agents with budget control (DevOps, Monitor, Security)
Each project works independently. Synergies emerge when combining them.
### Quick Start per Project
**SecretumVault**:
```bash
# Docker Compose with etcd
docker-compose -f deploy/docker/docker-compose.yml up -d
# Initialize vault
curl -X POST http://localhost:8200/v1/sys/init -d '{"shares": 5, "threshold": 3}'
# Unseal with 3 shares
curl -X POST http://localhost:8200/v1/sys/unseal -d '{"key": "<share-1>"}'
curl -X POST http://localhost:8200/v1/sys/unseal -d '{"key": "<share-2>"}'
curl -X POST http://localhost:8200/v1/sys/unseal -d '{"key": "<share-3>"}'
# Enable PKI engine for certificates
svault secret engine enable pki
```
**Kogral**:
```bash
# Initialize knowledge repository
kogral init
# Add runbook
kogral add note "PostgreSQL Connection Pool Tuning" \
--tags "database,postgresql,performance"
# Add ADR
kogral add decision "Choose Cilium over Calico" \
--context "Need CNI for K8s with eBPF" \
--decision "Cilium for performance and observability" \
--consequences "Higher initial complexity, better long-term performance"
# Serve MCP server for Claude Code
kogral serve --port 3100
```
**Provisioning**:
```bash
# Clone repository
git clone https://repo.jesusperez.pro/jesus/provisioning
cd provisioning
# Configure provider (UpCloud in this example)
cp config/providers/upcloud.example.toml config/providers/upcloud.toml
# Edit with UpCloud credentials
# Create K8s cluster (Nickel definition)
cat > cluster.ncl <<EOF
{
provider = "upcloud",
region = "de-fra1",
servers = [
{ name = "k8s-cp-01", plan = "medium", role = "control-plane" },
{ name = "k8s-worker-01", plan = "medium", role = "worker" },
{ name = "k8s-worker-02", plan = "medium", role = "worker" },
],
taskservs = ["containerd", "etcd", "kubernetes", "cilium"],
}
EOF
# Validate configuration
nickel typecheck cluster.ncl
# Apply (orchestrator with checkpoints)
prov apply cluster.ncl --with-rollback
```
**TypeDialog (prov-gen)**:
```bash
# Execute cluster configuration wizard
typedialog execute examples/ops/cluster-setup.toml \
--backend prov-gen \
--output my-cluster.ncl
# Generated configuration ready for Provisioning
nickel typecheck my-cluster.ncl
prov apply my-cluster.ncl
```
**Vapora**:
```bash
# Deploy with Docker Compose (backend + NATS + SurrealDB)
docker-compose up -d
# Create project
curl -X POST http://localhost:8001/projects \
-H "Content-Type: application/json" \
-d '{"name": "Infrastructure Automation", "description": "DevOps pipelines"}'
# Create task for DevOps Agent
curl -X POST http://localhost:8001/tasks \
-H "Content-Type: application/json" \
-d '{
"title": "Deploy Prometheus to K8s",
"task_type": "deployment",
"context": {"cluster": "prod-us-east-1", "namespace": "monitoring"}
}'
# Assign to DevOps Agent
curl -X POST http://localhost:8001/tasks/<task-id>/assign \
-H "Content-Type: application/json" \
-d '{"agent_role": "DevOps"}'
```
---
## Contact
- **Repositories**: GitHub (private projects)
- **Stack**: Rust, Nickel, Nushell, SurrealDB, Axum
- **License**: Proprietary / To be defined
---
*Modern infrastructure shouldn't require 10 disconnected tools.*
*One ecosystem. Five projects. Real integration for Ops/DevOps.*