stratumiops/docs/en/ops/ops-stratumiops-projects.md
Jesús Pérez 1680d80a3d
Some checks failed
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
Nickel Type Check / Nickel Type Checking (push) Has been cancelled
chore: Init repo, add docs
2026-01-22 22:15:19 +00:00

22 KiB

Ops/DevOps Portfolio: Modern Infrastructure End-to-End

The Problem

DevOps and platform teams face critical challenges managing modern infrastructure:

  • Fragmented tools: Terraform for IaC, Ansible for configuration, Vault for secrets, all disconnected
  • Untyped YAML: Configuration errors that explode at runtime, not at compile time
  • Static cryptography: No preparation for future quantum threats
  • Manual orchestration: Fragile imperative scripts without rollback or recovery
  • Hidden costs: No visibility into LLM spending for infrastructure generation
  • Complex multi-cloud: Different APIs, configurations and tools per provider

The Solution: An Integrated Ecosystem

Five projects designed to work together, covering the complete operations cycle.


Provisioning: Declarative Infrastructure as Code

Typed IaC with AI-Assisted Generation

Provisioning combines the precision of typed configuration (Nickel) with AI-assisted generation, eliminating fragile YAML and imperative scripts.

Unique capabilities:

  • Nickel IaC: Typed configuration with lazy evaluation, pre-runtime validation
  • MCP Server: Natural language queries about infrastructure
  • Integrated RAG: 1,200+ domain documents for contextual responses
  • Multi-cloud: AWS, UpCloud, local (LXD) from the same definition

Hybrid orchestration:

  • Rust orchestrator for critical workflows (10-50x performance vs Python)
  • Nushell scripts for flexibility and rapid prototyping
  • Automatic dependency resolution (topological sorting)
  • Checkpoints and automatic rollback on failures

The workflow:

"I need a K8s cluster on AWS with 3 nodes and Cilium"
                    ↓
              MCP Server (NLP)
                    ↓
          RAG searches similar configurations
                    ↓
          Generates Nickel + validates types
                    ↓
          Orchestrator deploys:
            1. containerd (dependency)
            2. etcd (dependency)
            3. kubernetes (core)
            4. cilium (CNI)
          With checkpoints and automatic rollback

Enterprise security:

  • JWT + MFA (TOTP + WebAuthn)
  • Cedar policy engine for RBAC/ABAC
  • 7 years audit log retention
  • 5 KMS backends (RustyVault, Age, AWS KMS, Vault, Cosmian)
  • SOPS/Age for configuration encryption at rest

For whom:

  • DevOps teams wanting typed IaC, not fragile YAML
  • Multi-cloud organizations (AWS + UpCloud + on-premise)
  • Teams needing audit, compliance and enterprise security

Expected results:

  • Configuration errors detected at compile time, not at runtime
  • Infrastructure generated from natural language (MCP + RAG)
  • Automatic rollback on failures with state management

SecretumVault: Secrets Management with Post-Quantum Crypto

Rust Vault with PQC in Production

SecretumVault is a secrets management system that implements production-ready post-quantum cryptography (ML-KEM-768, ML-DSA-65), providing cryptographic agility for organizations deploying today.

Crypto-agnostic:

  • OpenSSL: RSA, ECDSA, AES-256-GCM (classical compatibility)
  • OQS (Post-Quantum): ML-KEM-768, ML-DSA-65 (NIST FIPS 203/204)
  • AWS-LC: Experimental PQC (testing)
  • RustCrypto: Pure-Rust implementations (testing)
  • Pluggable backends: Change algorithms without modifying code

Secrets engines:

Engine Capability Use cases
KV Versioned secret storage Credentials, API keys, sensitive configurations
Transit Encryption-as-a-service with key rotation Application data encryption, key rotation
PKI X.509 certificate generation mTLS, service mesh, internal infrastructure
Database Dynamic credentials with TTL PostgreSQL, MySQL, MongoDB credentials on-demand

Multi-backend storage:

  • Filesystem: Development, single-node, rapid prototyping
  • etcd: Kubernetes, high availability, strong consistency
  • SurrealDB: Complex queries, time-series, multi-tenant scopes
  • PostgreSQL: Enterprise, ACID, complete auditing

Enterprise security:

  • Shamir Secret Sharing for unsealing (configurable threshold)
  • Cedar policy engine (ABAC, AWS-compatible)
  • Native TLS/mTLS with X.509 certificates
  • Complete audit logging with configurable retention
  • Token management with TTL and renewal

Ops/DevOps workflow:

# Initialize vault with Shamir (5 shares, threshold 3)
svault operator init --shares 5 --threshold 3

# Unseal with 3 shares
svault operator unseal --share <share-1>
svault operator unseal --share <share-2>
svault operator unseal --share <share-3>

# Enable Database engine for PostgreSQL
svault secret engine enable database
svault secret database config postgres-prod \
  plugin_name=postgresql-database-plugin \
  connection_url="postgresql://{{username}}:{{password}}@postgres:5432/mydb" \
  username="vault" password="vaultpass"

# Create role for dynamic credentials
svault secret database role create myapp-role \
  db_name=postgres-prod \
  creation_statements="CREATE USER '{{name}}' WITH PASSWORD '{{password}}' VALID UNTIL '{{expiration}}'; GRANT SELECT ON ALL TABLES IN SCHEMA public TO '{{name}}';" \
  default_ttl=1h max_ttl=24h

# Get dynamic credentials (generated on-demand)
svault secret read database/creds/myapp-role
# Key                Value
# ---                -----
# lease_id           database/creds/myapp-role/abc123
# lease_duration     3600
# username           v-myapp-role-xyz789
# password           A1b2C3d4E5f6G7h8

# Credentials are automatically revoked after 1h TTL

For whom:

  • Teams deploying post-quantum cryptography today
  • Organizations with cryptographic agility requirements
  • Multi-cloud platforms needing Rust-native secrets management
  • Security teams evaluating future quantum threats

Expected results:

  • Preparation for quantum threats without changing architecture
  • Secrets management with Rust memory guarantees
  • Native integration with Provisioning (KMS) and Vapora (agent credentials)

Vapora: Agent Orchestration with Cost Control

Intelligent Agents for Operations

Vapora is not just for feature development. It's an orchestration platform that can coordinate specialized agents for DevOps operations.

Available agents for Ops:

  • DevOps: CI/CD, pipelines, deployment automation
  • Monitor: Health checks, alerting, real-time metrics
  • Security: Auditing, compliance, vulnerability scanning
  • ProjectManager: Roadmap, tracking, task coordination

Real cost control for LLMs:

  • Budgets per role (monthly/weekly)
  • Three levels: normal → near limit → exceeded
  • Automatic fallback to cheaper providers without manual intervention
  • Prometheus metrics: vapora_budget_utilization, vapora_fallback_triggers

NATS JetStream coordination:

┌──────────────────────────────────────────────────────┐
│             NATS JetStream Messaging                 │
├──────────────────────────────────────────────────────┤
│                                                      │
│  vapora.tasks.assign    → Task assignment           │
│  vapora.tasks.results   → Execution results         │
│  vapora.agents.heartbeat → Agent health check       │
│                                                      │
│  Persistence: JetStream streams                      │
│  Delivery: At-least-once with acknowledgment        │
│  Ordering: Per-subject message ordering             │
└──────────────────────────────────────────────────────┘

Ops pipeline orchestration:

Pipeline: "Deploy microservice to K8s"

1. Security Agent:  Docker image vulnerability scan
2. DevOps Agent:    Validate K8s manifests + Helm charts
3. Monitor Agent:   Setup Prometheus metrics + alerts
4. DevOps Agent:    Deploy with kubectl apply + health check
5. Monitor Agent:   Validate health endpoints + smoke tests

If any step fails: coordinated automatic rollback

Metrics and observability:

  • Prometheus metrics endpoint (/metrics)
  • OpenTelemetry integration (traces, spans)
  • SurrealDB for execution storage
  • Grafana dashboards for visualization

For whom:

  • DevOps teams coordinating multiple LLM agents for operations
  • Organizations needing to control LLM spending in automation
  • Platforms with complex pipelines (CI/CD, deployment, monitoring)

Expected results:

  • LLM cost reduction through intelligent routing
  • Automatic orchestration of complex operational tasks
  • Complete visibility of spending and performance per agent

TypeDialog: Multi-Backend Forms for Configuration

One Definition, Six Interfaces (Includes prov-gen)

TypeDialog unifies configuration capture in CLI, TUI, Web, and has a specialized backend for multi-cloud IaC generation.

Operational backends:

Backend Typical Ops/DevOps use
CLI Automation scripts, CI/CD pipelines
TUI Admin tools, terminal dashboards
Web Self-service portals, team forms
Prov-gen Multi-cloud infrastructure generation

Prov-gen Backend: IaC Generation

The prov-gen backend generates Nickel infrastructure configurations for multiple clouds from typed forms:

# cluster-setup.toml
[form]
id = "k8s_cluster"
title = "Kubernetes Cluster Setup"

[[sections]]
id = "cloud"
title = "Cloud Provider"

[[sections.fields]]
id = "provider"
type = "select"
label = "Provider"
required = true
options = [
    { value = "aws", label = "AWS" },
    { value = "upcloud", label = "UpCloud" },
    { value = "local", label = "Local LXD" },
]

[[sections.fields]]
id = "region"
type = "text"
label = "Region"
required = true

[[sections]]
id = "cluster"
title = "Cluster Configuration"

[[sections.fields]]
id = "node_count"
type = "number"
label = "Node Count"
default = 3
validation.min = 1
validation.max = 20

[[sections.fields]]
id = "node_size"
type = "select"
label = "Node Size"
options = [
    { value = "small", label = "Small (2 CPU, 4GB RAM)" },
    { value = "medium", label = "Medium (4 CPU, 8GB RAM)" },
    { value = "large", label = "Large (8 CPU, 16GB RAM)" },
]

[output]
backend = "prov-gen"
format = "nickel"
validation = "nickel://schemas/kubernetes_cluster.ncl"

Execute with prov-gen:

typedialog execute cluster-setup.toml --backend prov-gen --output k8s-cluster.ncl

Generates Nickel IaC:

# k8s-cluster.ncl (automatically generated)
{
  provider = "aws",
  region = "us-east-1",

  servers = [
    {
      name = "k8s-control-plane-01",
      plan = "medium",
      role = "control-plane",
      provider = "aws",
    },
    {
      name = "k8s-worker-01",
      plan = "medium",
      role = "worker",
      provider = "aws",
    },
    {
      name = "k8s-worker-02",
      plan = "medium",
      role = "worker",
      provider = "aws",
    },
  ],

  taskservs = [
    "containerd",
    "etcd",
    "kubernetes",
    "cilium",
  ],

  networking = {
    vpc_cidr = "10.0.0.0/16",
    pod_cidr = "10.244.0.0/16",
    service_cidr = "10.96.0.0/12",
  },
}

Nickel contracts validation:

// Automatic validation with Nickel schemas
let validator = NickelValidator::new();
let result = validator.validate(&generated_iac, "schemas/kubernetes_cluster.ncl")?;

if result.errors.is_empty() {
    // Valid IaC, ready for Provisioning
    provisioning_client.apply(&generated_iac).await?;
} else {
    // Validation errors, show to user
    eprintln!("Validation errors: {:?}", result.errors);
}

For whom:

  • DevOps teams maintaining configuration wizards in CLI and Web
  • Organizations with self-service infrastructure portals
  • Teams needing IaC generation from forms

Expected results:

  • One TOML definition for CLI, TUI, Web and IaC generation
  • Typed validation before runtime with Nickel contracts
  • Reduction of manual configuration errors

Kogral: Knowledge Base for Platform Teams

Your Ops Knowledge Base, Queryable

Kogral captures architectural decisions, runbooks, postmortems and operational procedures in a format that both humans and AI agents can query.

6 specialized node types for Ops:

Type Ops/DevOps use
Note Runbooks, procedures, troubleshooting guides
Decision Infrastructure ADRs (why AWS vs UpCloud, etcd vs Consul)
Guideline Deployment standards, security policies
Pattern Reusable infrastructure patterns (multi-AZ, HA)
Journal Change logs, daily stand-up notes
Execution Deployment history, rollbacks, incidents

Git-native + MCP for Claude Code:

  • Everything in versioned markdown (.kogral/ directory)
  • MCP server for Claude Code: agents query runbooks before executing
  • Semantic search with fastembed (local) or cloud embeddings

The Ops flow:

Production incident → Capture postmortem in Kogral as Execution
                                    ↓
            Claude Code queries via MCP → "How did we resolve this error before?"
                                    ↓
                   Kogral responds with similar postmortems + runbooks
                                    ↓
              Agent applies documented solution instead of guessing

MCP Tools for Ops:

# Search troubleshooting runbooks
kogral-mcp search "nginx 502 error troubleshooting"

# Add incident postmortem
kogral-mcp add-execution \
  --title "2026-01-22 PostgreSQL Connection Pool Exhaustion" \
  --context "Production database connections maxed out" \
  --resolution "Increased max_connections from 100 to 200, added PgBouncer" \
  --tags "database,incident,postgresql"

# Get deployment guidelines
kogral-mcp get-guidelines "kubernetes deployment" --include-shared true

For whom:

  • Platform teams needing to preserve operational knowledge
  • SRE teams with rotation losing context of previous incidents
  • DevOps using Claude Code wanting contextualized runbooks

Expected results:

  • New SRE onboarding in days, not weeks
  • Incident resolution informed by previous postmortems
  • Infrastructure decisions preserved and searchable

The Ecosystem in Action: Ops Scenarios

Scenario 1: New Multi-Cloud Kubernetes Cluster

1. TypeDialog (prov-gen): Configuration wizard for cluster
   - Cloud provider, region, node count, node size
   - Generates validated Nickel IaC

2. Provisioning: Deploys infrastructure
   - Creates servers on AWS/UpCloud
   - Installs containerd, etcd, kubernetes, cilium
   - Checkpoints per step, automatic rollback if fails

3. SecretumVault: Generates PKI certificates
   - Certificates for etcd, kube-apiserver, kubelet
   - Automatic rotation every 90 days

4. Kogral: Documents architecture decision
   - ADR: "Why Cilium over Calico"
   - Runbook: "How to scale cluster from 3 to 10 nodes"

5. Vapora: Orchestrates post-deployment
   - Monitor Agent: Setup Prometheus + Grafana
   - Security Agent: Vulnerability scanning
   - DevOps Agent: Deploy test applications

Scenario 2: Production Incident (Database Outage)

1. Vapora Monitor Agent: Detects PostgreSQL down
   - Alert via NATS JetStream
   - Trigger incident response pipeline

2. Kogral: Claude Code queries via MCP
   - "PostgreSQL outage postmortems?"
   - Returns 3 similar incidents with resolutions

3. Vapora DevOps Agent: Executes runbook
   - Restarts PostgreSQL with adjusted parameters
   - Verifies health checks

4. SecretumVault: Rotates DB credentials
   - Generates new dynamic credentials
   - Updates applications via Database engine

5. Kogral: Documents postmortem
   - Execution node with root cause, resolution, action items
   - Linked to PostgreSQL configuration ADRs

Scenario 3: Post-Quantum Cryptography Migration

1. Kogral: Documents migration decision
   - ADR: "Migration to ML-KEM-768 for quantum threat preparation"
   - Timeline, risks, mitigation strategies

2. SecretumVault: Migrates secrets
   - Backend change: openssl → oqs
   - Re-encrypts secrets with ML-KEM-768
   - Maintains compatibility with classical clients

3. Provisioning: Updates infrastructure
   - Generates new PKI certificates with ML-DSA-65
   - Deploys certificates to services (etcd, K8s API)
   - Automatic rollback if health checks fail

4. Vapora: Orchestrates validation
   - Security Agent: Verifies correct cryptography
   - Monitor Agent: Validates latency not degraded
   - DevOps Agent: Executes integration tests

5. TypeDialog: Self-service portal for teams
   - Form: "Migrate service to PQC"
   - prov-gen backend generates updated configuration

Scenario 4: CI/CD with AI Validation

1. Developer: Push to Git repository (Gitea)

2. Vapora DevOps Agent (trigger via webhook):
   - Executes linting, unit tests
   - Build Docker image
   - Vulnerability scan with Security Agent

3. TypeDialog: Deployment form
   - Environment (staging/production)
   - Canary rollout percentage
   - Generates validated K8s configuration

4. Provisioning: Deploys with Tekton
   - Apply K8s manifests with kubectl
   - Automatic health checks
   - Rollback if health check fails

5. SecretumVault: Injects secrets
   - Dynamic DB credentials (TTL 1h)
   - API keys from KV engine
   - TLS certificates from PKI engine

6. Kogral: Records deployment
   - Execution node with version, timestamp, author
   - Link to commit SHA, PR, changes

Why Choose This Ecosystem (Ops Perspective)

Versus Alternatives

Us Terraform + Ansible + Vault
Typed configuration: Nickel with pre-runtime validation YAML/HCL without types, errors at runtime
Integrated orchestration: Provisioning orchestrator with rollback Imperative scripts, no automatic recovery
Post-Quantum crypto: SecretumVault with ML-KEM/ML-DSA today Vault without PQC roadmap
Unified multi-cloud: One Nickel configuration for AWS/UpCloud/Local Separate configurations per cloud
AI-native: MCP + RAG for assisted generation No AI assistance, manual configuration
Full Rust stack: Performance, memory-safety Mixed Python/Go/Shell with overhead

Technical Investment (Ops Focus)

Metric Value
Provisioning: Nickel IaC, 80+ CLI shortcuts ~40K LOC
SecretumVault: 4 crypto backends, 4 storage backends ~11K LOC
Vapora: NATS JetStream, 12 agent roles ~50K LOC
TypeDialog: 6 backends including prov-gen ~90K LOC
Kogral: 6 node types, MCP server ~15K LOC
Total tests 4,360+
Crypto backends OpenSSL, OQS (PQC), AWS-LC, RustCrypto
Storage backends FS, etcd, SurrealDB, PostgreSQL

Getting Started (Adoption for Ops Teams)

  1. SecretumVault: Secrets management with cryptographic agility (standalone)
  2. Kogral: Establish operational knowledge base (runbooks, ADRs, postmortems)
  3. TypeDialog: Configuration wizards for teams (CLI + Web + prov-gen)
  4. Provisioning: Multi-cloud declarative IaC with orchestrator
  5. Vapora: Orchestrate Ops agents with budget control (DevOps, Monitor, Security)

Each project works independently. Synergies emerge when combining them.

Quick Start per Project

SecretumVault:

# Docker Compose with etcd
docker-compose -f deploy/docker/docker-compose.yml up -d

# Initialize vault
curl -X POST http://localhost:8200/v1/sys/init -d '{"shares": 5, "threshold": 3}'

# Unseal with 3 shares
curl -X POST http://localhost:8200/v1/sys/unseal -d '{"key": "<share-1>"}'
curl -X POST http://localhost:8200/v1/sys/unseal -d '{"key": "<share-2>"}'
curl -X POST http://localhost:8200/v1/sys/unseal -d '{"key": "<share-3>"}'

# Enable PKI engine for certificates
svault secret engine enable pki

Kogral:

# Initialize knowledge repository
kogral init

# Add runbook
kogral add note "PostgreSQL Connection Pool Tuning" \
  --tags "database,postgresql,performance"

# Add ADR
kogral add decision "Choose Cilium over Calico" \
  --context "Need CNI for K8s with eBPF" \
  --decision "Cilium for performance and observability" \
  --consequences "Higher initial complexity, better long-term performance"

# Serve MCP server for Claude Code
kogral serve --port 3100

Provisioning:

# Clone repository
git clone https://repo.jesusperez.pro/jesus/provisioning
cd provisioning

# Configure provider (UpCloud in this example)
cp config/providers/upcloud.example.toml config/providers/upcloud.toml
# Edit with UpCloud credentials

# Create K8s cluster (Nickel definition)
cat > cluster.ncl <<EOF
{
  provider = "upcloud",
  region = "de-fra1",
  servers = [
    { name = "k8s-cp-01", plan = "medium", role = "control-plane" },
    { name = "k8s-worker-01", plan = "medium", role = "worker" },
    { name = "k8s-worker-02", plan = "medium", role = "worker" },
  ],
  taskservs = ["containerd", "etcd", "kubernetes", "cilium"],
}
EOF

# Validate configuration
nickel typecheck cluster.ncl

# Apply (orchestrator with checkpoints)
prov apply cluster.ncl --with-rollback

TypeDialog (prov-gen):

# Execute cluster configuration wizard
typedialog execute examples/ops/cluster-setup.toml \
  --backend prov-gen \
  --output my-cluster.ncl

# Generated configuration ready for Provisioning
nickel typecheck my-cluster.ncl
prov apply my-cluster.ncl

Vapora:

# Deploy with Docker Compose (backend + NATS + SurrealDB)
docker-compose up -d

# Create project
curl -X POST http://localhost:8001/projects \
  -H "Content-Type: application/json" \
  -d '{"name": "Infrastructure Automation", "description": "DevOps pipelines"}'

# Create task for DevOps Agent
curl -X POST http://localhost:8001/tasks \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Deploy Prometheus to K8s",
    "task_type": "deployment",
    "context": {"cluster": "prod-us-east-1", "namespace": "monitoring"}
  }'

# Assign to DevOps Agent
curl -X POST http://localhost:8001/tasks/<task-id>/assign \
  -H "Content-Type: application/json" \
  -d '{"agent_role": "DevOps"}'

Contact

  • Repositories: GitHub (private projects)
  • Stack: Rust, Nickel, Nushell, SurrealDB, Axum
  • License: Proprietary / To be defined

Modern infrastructure shouldn't require 10 disconnected tools. One ecosystem. Five projects. Real integration for Ops/DevOps.