jesus/stratumiops

Fork 0

Jesús Pérez 9095ea6d8e

Nickel Type Check / Nickel Type Checking (push) Has been cancelled

Details

Rust CI / Security Audit (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (stable) (push) Has been cancelled

Details

feat: add stratum-orchestrator with graph, state, NATS, and Nickel action nodes

New crates: stratum-orchestrator (Cedar authz, Vault secrets, Nu/agent executors,
  saga runner), stratum-graph (petgraph DAG + SurrealDB repo), stratum-state
  (SurrealDB tracker), platform-nats (NKey auth client), ncl-import-resolver.

  Updates: stratum-embeddings (SurrealDB store + persistent cache), stratum-llm
  circuit breaker. Adds Nickel action-nodes, schemas, config, Nushell scripts,
  docker-compose dev stack, and ADR-003.

2026-02-22 21:33:26 +00:00

12 KiB

Raw Permalink Blame History

ADR-003: Stratum-Orchestrator — Graph-Driven Workflow Orchestrator

Status: Accepted Date: 2026-02-20

Context

The StratumIOps ecosystem spans multiple active projects (provisioning, kogral, syntaxis, typedialog) that evolve concurrently with cross-project dependencies. Each project triggers build, validation, publish, and notification workflows in response to code changes. Before this decision, these workflows were:

Hardcoded per-project scripts with no shared execution model
Not auditable — no durable record of which step ran, in what order, with what outcome
Not composable — a workflow in provisioning could not react to an event from kogral
Not safe — no atomicity, no rollback on partial failure, no credential scoping
Not scalable — adding a new project meant copying and adapting scripts, accumulating drift

The core requirement is a cross-project, event-driven, agnostic workflow orchestrator that can coordinate build pipelines, AI agent tasks, and infrastructure operations without being modified for each new project or provider.

Two design directions were evaluated:

Approach	Description	Problem
Static rule router	Map NATS subject patterns to scripts in a config file	Rules multiply with projects; changing a workflow requires editing the router
Graph-driven engine	Events traverse a DAG of action nodes declared in Nickel; orchestrator is agnostic	Orchestrator never changes for new workflows

The graph-driven approach was chosen. The orchestrator loads node definitions from Nickel files, builds an in-memory ActionGraph, and executes pipelines by traversing the graph — it does not know what the pipeline does, only how to coordinate it.

Fundamental Design Characteristics

1. Graph-Driven Execution, Not Static Routing

The orchestrator does not contain routing tables. Instead, NATS subject patterns are matched against an ActionGraph built from Nickel node definitions. Each ActionNode declares:

trigger: NATS subject patterns that can activate this node as an entry point
input_schemas: capabilities required before execution
output_schemas: capabilities produced after execution
compensate: optional rollback script for Saga atomicity

The graph is built by traversal of producer/consumer capability indexes. Topological sort produces a staged execution plan. Adding a new workflow means adding .ncl files — the orchestrator binary never changes.

2. Stateless Orchestrator — DB-First State

The orchestrator process holds no durable state. Every PipelineContext write goes to SurrealDB first, then updates an in-memory DashMap cache. On crash, a restarted instance reconstructs the cache from the DB. This enables:

Horizontal scaling: multiple orchestrator instances share one SurrealDB, no split-brain
Crash recovery: pipelines resume from last persisted capability, not from the beginning
Observability: full pipeline state is queryable at any point without instrumenting the process

3. Nickel as Single Source of Truth — No Dual Truth

Action nodes, capability schemas, and the orchestrator startup config are all defined in Nickel. There is no separate indexing process, no database copy of node definitions used at runtime. The ActionGraph is built in-memory from .ncl files at startup via nickel export --format json, then kept live via a notify file watcher for hot-reload.

This eliminates the dual-truth problem: the Nickel file IS the definition. No risk of a database record diverging from the file on disk.

4. Capability Model — Dependency Inversion at Execution Level

Nodes do not depend on each other directly. They declare capabilities they produce and consume. The graph engine resolves dependencies:

lint-crate  → produces: linted-code
fmt-crate   → produces: formatted-code
build-crate → consumes: linted-code, formatted-code
            → produces: built-artifact
install-crate → consumes: built-artifact

This is Dependency Inversion applied to the execution domain: build-crate does not know about lint-crate. It only knows it needs linted-code. Any node that produces linted-code satisfies the dependency. Nodes can be swapped, replaced, or parallelized without changing their consumers.

5. Three Independent Auth Planes

Authentication and authorization are split across three independent, non-substitutable planes:

Plane	Technology	Scope	What it controls
Publisher auth	NATS NKeys (ed25519)	Transport	Who can publish events to `dev.>` subjects
Workflow authz	Cedar policies	Orchestrator	Which pipelines/nodes a principal can trigger
Execution credentials	Vault (SecretumVault)	Per-node, per-step	Scoped secrets with TTL = node timeout

Credentials from Vault are injected as environment variables into the Nushell subprocess and revoked on node failure. They never appear in NATS messages, logs, or PipelineContext (redacted before storage).

6. Saga Atomicity — Compensation, Not Transactions

Pipelines execute forward through stages. If a stage fails, the orchestrator does not roll back a database transaction — it runs compensate.nu scripts in reverse order through all previously successful stages. This is the Saga pattern:

Stage 0: lint (ok) + fmt (ok)   → executed
Stage 1: build (FAIL)           → trigger compensation
Stage 0 compensation:           → undo lint, undo fmt (in parallel, reverse)

Compensation is best-effort: compensation failures are logged but do not block the pipeline from reaching Compensated status. The DB record captures the full compensation trace.

7. Parallel Stages via JoinSet + CancellationToken

Within each stage, nodes with no capability dependencies on each other execute in parallel using tokio::task::JoinSet. Fail-fast is implemented via CancellationToken: the first node failure cancels the token, aborting all sibling tasks in the stage.

Stage 0: [lint-crate ‖ fmt-crate]  — parallel (no inter-dependency)
Stage 1: [build-crate]             — sequential (needs both capabilities)
Stage 2: [install-crate]           — sequential (needs built-artifact)

8. OCI for Everything — Content-Addressed Artifacts

Both node definitions and the Nickel base library are published as OCI artifacts to a Zot registry. The publish pipeline for each is: nickel typecheck → gitleaks detect → nickel export → sha256sum → oras push with content-hash annotations.

The ncl-import-resolver binary bridges OCI → local filesystem: it pulls each referenced OCI layer at orchestrator startup, verifies the digest against the annotated hash, then exposes a local path for Nickel imports. This prevents loading unverified or tampered node definitions.

This follows the same model as container images: build → scan → publish → consume by digest.

9. Nushell as Execution Unit — Agnostic by Design

Each action node's handler is a Nushell script. The executor spawns nu --no-config-file <script.nu>, passes PipelineContext inputs as JSON on stdin, and reads the output JSON from stdout. This makes execution:

Domain-agnostic: the orchestrator has no knowledge of what the script does
Hot-replaceable: updating a workflow means replacing a .nu file, not recompiling the binary
Sandboxable: each node runs in its own process with scoped Vault credentials
Testable independently: scripts can be invoked directly with echo '{}' | nu script.nu

10. TypeDialog Scope — Startup Config Only

TypeDialog is used exclusively for orchestrator startup configuration (SurrealDB URL, NATS URL, Zot URL, Vault URL, log level, feature flags). It is not used for project declaratives, workflow definitions, or node configurations. Those live in Nickel files managed per project. This prevents TypeDialog from becoming a catch-all config tool and keeps its scope bounded.

Decision

stratum-orchestrator is implemented as a new crate family in the StratumIOps monorepo:

Crate	Domain	Responsibility
`stratum-graph`	Knowledge	`ActionNode`, `Capability`, `GraphRepository` trait
`stratum-state`	Operational	`PipelineRun`, `StepRecord`, `StateTracker` trait
`platform-nats`	Transport	JetStream consumer with NKey auth
`stratum-orchestrator`	Coordination	`ActionGraph`, `PipelineContext`, `StageRunner`, auth, executor

Domain isolation is structural: stratum-graph and stratum-state are separate crates with separate SurrealDB table namespaces. stratum-orchestrator depends on their traits, not their implementations — compile-time enforcement.

The orchestrator binary startup sequence: load TypeDialog config → connect SurrealDB → connect NATS → resolve OCI Nickel imports → build ActionGraph → start notify watcher → initialize Cedar policies → start HTTP server (health + agent callback) → enter JetStream pull loop.

Rationale

Why Not a General-Purpose Workflow Engine (Temporal, Argo, etc.)?

Concern	External engine	stratum-orchestrator
Cross-project event model	Requires adapter per project	Native NATS subject matching
Nickel integration	Not possible	First-class: nodes are `.ncl` files
Nushell execution	Not supported	Native subprocess executor
Operational footprint	Heavy (Temporal cluster, Argo K8s)	Single binary + SurrealDB + NATS
Custom auth model	Difficult to extend	Three planes designed in

Why Saga over 2PC?

Two-phase commit across distributed Nushell scripts is not feasible — scripts are external processes with no transaction coordinator. Saga compensation scripts (compensate.nu) are the only realistic atomicity model for cross-process workflows. The trade-off is accepted: compensation is best-effort, not guaranteed-atomic, but the failure cases are logged and auditable.

Why In-Memory ActionGraph vs DB-Persisted Nodes?

Storing node definitions in SurrealDB creates dual truth. The file on disk and the DB record can diverge. Hot-reload via notify on the filesystem is simpler, faster, and eliminates the sync problem. SurrealDB is used only for operational state (pipeline runs, capability stores) — knowledge (node definitions) stays in the filesystem.

Consequences

Accepted trade-offs:

nickel export is a subprocess call per file at startup — adds ~50ms per node file to startup time. Mitigated by parallel load with JoinSet during startup.
Saga compensation is best-effort — a compensation script that itself fails is logged but does not block status progression. This is a known Saga trade-off.
Nushell subprocess overhead per node — each node execution spawns a process. For sub-second scripts this is observable latency. Acceptable for CI/CD and infrastructure workflows.
OCI layer pull at startup — cold starts require pulling Nickel lib layers. Mitigated by local digest cache in ~/.cache/stratum/ncl/.

Benefits gained:

New workflows require only new .ncl files — zero orchestrator binary changes
Full pipeline audit trail in SurrealDB: every step, every capability deposit, every compensation
Crash recovery is free: restart the orchestrator, pipeline resumes from last persisted state
Auth is non-negotiable: publisher identity (NKeys), workflow authorization (Cedar), and execution credentials (Vault) are enforced at every pipeline invocation
Horizontal scaling: stateless orchestrator + shared SurrealDB enables multiple instances on the same event stream

References

Implementation plan: .coder/2026-02-20-stratum-orchestrator-plan.plan.md
Architecture diagram: assets/diagrams/arch-stratum-orchestrator.svg
Build pipeline flow: assets/diagrams/flow-stratum-build-pipeline.svg
Nickel base library: nickel/stratum-base/stratum-base.ncl
Crates: crates/stratum-graph/, crates/stratum-state/, crates/platform-nats/, crates/stratum-orchestrator/
Related: ADR-001 (stratum-embeddings), ADR-002 (stratum-llm)

12 KiB Raw Permalink Blame History

ADR-003: Stratum-Orchestrator — Graph-Driven Workflow Orchestrator

Context

Fundamental Design Characteristics

1. Graph-Driven Execution, Not Static Routing

2. Stateless Orchestrator — DB-First State

3. Nickel as Single Source of Truth — No Dual Truth

4. Capability Model — Dependency Inversion at Execution Level

5. Three Independent Auth Planes

6. Saga Atomicity — Compensation, Not Transactions

7. Parallel Stages via JoinSet + CancellationToken

8. OCI for Everything — Content-Addressed Artifacts

9. Nushell as Execution Unit — Agnostic by Design

10. TypeDialog Scope — Startup Config Only

Decision

Rationale

Why Not a General-Purpose Workflow Engine (Temporal, Argo, etc.)?

Why Saga over 2PC?

Why In-Memory ActionGraph vs DB-Persisted Nodes?

Consequences

References

12 KiB

Raw Permalink Blame History