stratumiops/docs/en/architecture/adrs/003-stratum-orchestrator.md
Jesús Pérez 9095ea6d8e
Some checks failed
Nickel Type Check / Nickel Type Checking (push) Has been cancelled
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
feat: add stratum-orchestrator with graph, state, NATS, and Nickel action nodes
New crates: stratum-orchestrator (Cedar authz, Vault secrets, Nu/agent executors,
  saga runner), stratum-graph (petgraph DAG + SurrealDB repo), stratum-state
  (SurrealDB tracker), platform-nats (NKey auth client), ncl-import-resolver.

  Updates: stratum-embeddings (SurrealDB store + persistent cache), stratum-llm
  circuit breaker. Adds Nickel action-nodes, schemas, config, Nushell scripts,
  docker-compose dev stack, and ADR-003.
2026-02-22 21:33:26 +00:00

12 KiB

ADR-003: Stratum-Orchestrator — Graph-Driven Workflow Orchestrator

Status: Accepted Date: 2026-02-20

Context

The StratumIOps ecosystem spans multiple active projects (provisioning, kogral, syntaxis, typedialog) that evolve concurrently with cross-project dependencies. Each project triggers build, validation, publish, and notification workflows in response to code changes. Before this decision, these workflows were:

  • Hardcoded per-project scripts with no shared execution model
  • Not auditable — no durable record of which step ran, in what order, with what outcome
  • Not composable — a workflow in provisioning could not react to an event from kogral
  • Not safe — no atomicity, no rollback on partial failure, no credential scoping
  • Not scalable — adding a new project meant copying and adapting scripts, accumulating drift

The core requirement is a cross-project, event-driven, agnostic workflow orchestrator that can coordinate build pipelines, AI agent tasks, and infrastructure operations without being modified for each new project or provider.

Two design directions were evaluated:

Approach Description Problem
Static rule router Map NATS subject patterns to scripts in a config file Rules multiply with projects; changing a workflow requires editing the router
Graph-driven engine Events traverse a DAG of action nodes declared in Nickel; orchestrator is agnostic Orchestrator never changes for new workflows

The graph-driven approach was chosen. The orchestrator loads node definitions from Nickel files, builds an in-memory ActionGraph, and executes pipelines by traversing the graph — it does not know what the pipeline does, only how to coordinate it.

Fundamental Design Characteristics

1. Graph-Driven Execution, Not Static Routing

The orchestrator does not contain routing tables. Instead, NATS subject patterns are matched against an ActionGraph built from Nickel node definitions. Each ActionNode declares:

  • trigger: NATS subject patterns that can activate this node as an entry point
  • input_schemas: capabilities required before execution
  • output_schemas: capabilities produced after execution
  • compensate: optional rollback script for Saga atomicity

The graph is built by traversal of producer/consumer capability indexes. Topological sort produces a staged execution plan. Adding a new workflow means adding .ncl files — the orchestrator binary never changes.

2. Stateless Orchestrator — DB-First State

The orchestrator process holds no durable state. Every PipelineContext write goes to SurrealDB first, then updates an in-memory DashMap cache. On crash, a restarted instance reconstructs the cache from the DB. This enables:

  • Horizontal scaling: multiple orchestrator instances share one SurrealDB, no split-brain
  • Crash recovery: pipelines resume from last persisted capability, not from the beginning
  • Observability: full pipeline state is queryable at any point without instrumenting the process

3. Nickel as Single Source of Truth — No Dual Truth

Action nodes, capability schemas, and the orchestrator startup config are all defined in Nickel. There is no separate indexing process, no database copy of node definitions used at runtime. The ActionGraph is built in-memory from .ncl files at startup via nickel export --format json, then kept live via a notify file watcher for hot-reload.

This eliminates the dual-truth problem: the Nickel file IS the definition. No risk of a database record diverging from the file on disk.

4. Capability Model — Dependency Inversion at Execution Level

Nodes do not depend on each other directly. They declare capabilities they produce and consume. The graph engine resolves dependencies:

lint-crate  → produces: linted-code
fmt-crate   → produces: formatted-code
build-crate → consumes: linted-code, formatted-code
            → produces: built-artifact
install-crate → consumes: built-artifact

This is Dependency Inversion applied to the execution domain: build-crate does not know about lint-crate. It only knows it needs linted-code. Any node that produces linted-code satisfies the dependency. Nodes can be swapped, replaced, or parallelized without changing their consumers.

5. Three Independent Auth Planes

Authentication and authorization are split across three independent, non-substitutable planes:

Plane Technology Scope What it controls
Publisher auth NATS NKeys (ed25519) Transport Who can publish events to dev.> subjects
Workflow authz Cedar policies Orchestrator Which pipelines/nodes a principal can trigger
Execution credentials Vault (SecretumVault) Per-node, per-step Scoped secrets with TTL = node timeout

Credentials from Vault are injected as environment variables into the Nushell subprocess and revoked on node failure. They never appear in NATS messages, logs, or PipelineContext (redacted before storage).

6. Saga Atomicity — Compensation, Not Transactions

Pipelines execute forward through stages. If a stage fails, the orchestrator does not roll back a database transaction — it runs compensate.nu scripts in reverse order through all previously successful stages. This is the Saga pattern:

Stage 0: lint (ok) + fmt (ok)   → executed
Stage 1: build (FAIL)           → trigger compensation
Stage 0 compensation:           → undo lint, undo fmt (in parallel, reverse)

Compensation is best-effort: compensation failures are logged but do not block the pipeline from reaching Compensated status. The DB record captures the full compensation trace.

7. Parallel Stages via JoinSet + CancellationToken

Within each stage, nodes with no capability dependencies on each other execute in parallel using tokio::task::JoinSet. Fail-fast is implemented via CancellationToken: the first node failure cancels the token, aborting all sibling tasks in the stage.

Stage 0: [lint-crate ‖ fmt-crate]  — parallel (no inter-dependency)
Stage 1: [build-crate]             — sequential (needs both capabilities)
Stage 2: [install-crate]           — sequential (needs built-artifact)

8. OCI for Everything — Content-Addressed Artifacts

Both node definitions and the Nickel base library are published as OCI artifacts to a Zot registry. The publish pipeline for each is: nickel typecheckgitleaks detectnickel exportsha256sumoras push with content-hash annotations.

The ncl-import-resolver binary bridges OCI → local filesystem: it pulls each referenced OCI layer at orchestrator startup, verifies the digest against the annotated hash, then exposes a local path for Nickel imports. This prevents loading unverified or tampered node definitions.

This follows the same model as container images: build → scan → publish → consume by digest.

9. Nushell as Execution Unit — Agnostic by Design

Each action node's handler is a Nushell script. The executor spawns nu --no-config-file <script.nu>, passes PipelineContext inputs as JSON on stdin, and reads the output JSON from stdout. This makes execution:

  • Domain-agnostic: the orchestrator has no knowledge of what the script does
  • Hot-replaceable: updating a workflow means replacing a .nu file, not recompiling the binary
  • Sandboxable: each node runs in its own process with scoped Vault credentials
  • Testable independently: scripts can be invoked directly with echo '{}' | nu script.nu

10. TypeDialog Scope — Startup Config Only

TypeDialog is used exclusively for orchestrator startup configuration (SurrealDB URL, NATS URL, Zot URL, Vault URL, log level, feature flags). It is not used for project declaratives, workflow definitions, or node configurations. Those live in Nickel files managed per project. This prevents TypeDialog from becoming a catch-all config tool and keeps its scope bounded.

Decision

stratum-orchestrator is implemented as a new crate family in the StratumIOps monorepo:

Crate Domain Responsibility
stratum-graph Knowledge ActionNode, Capability, GraphRepository trait
stratum-state Operational PipelineRun, StepRecord, StateTracker trait
platform-nats Transport JetStream consumer with NKey auth
stratum-orchestrator Coordination ActionGraph, PipelineContext, StageRunner, auth, executor

Domain isolation is structural: stratum-graph and stratum-state are separate crates with separate SurrealDB table namespaces. stratum-orchestrator depends on their traits, not their implementations — compile-time enforcement.

The orchestrator binary startup sequence: load TypeDialog config → connect SurrealDB → connect NATS → resolve OCI Nickel imports → build ActionGraph → start notify watcher → initialize Cedar policies → start HTTP server (health + agent callback) → enter JetStream pull loop.

Rationale

Why Not a General-Purpose Workflow Engine (Temporal, Argo, etc.)?

Concern External engine stratum-orchestrator
Cross-project event model Requires adapter per project Native NATS subject matching
Nickel integration Not possible First-class: nodes are .ncl files
Nushell execution Not supported Native subprocess executor
Operational footprint Heavy (Temporal cluster, Argo K8s) Single binary + SurrealDB + NATS
Custom auth model Difficult to extend Three planes designed in

Why Saga over 2PC?

Two-phase commit across distributed Nushell scripts is not feasible — scripts are external processes with no transaction coordinator. Saga compensation scripts (compensate.nu) are the only realistic atomicity model for cross-process workflows. The trade-off is accepted: compensation is best-effort, not guaranteed-atomic, but the failure cases are logged and auditable.

Why In-Memory ActionGraph vs DB-Persisted Nodes?

Storing node definitions in SurrealDB creates dual truth. The file on disk and the DB record can diverge. Hot-reload via notify on the filesystem is simpler, faster, and eliminates the sync problem. SurrealDB is used only for operational state (pipeline runs, capability stores) — knowledge (node definitions) stays in the filesystem.

Consequences

Accepted trade-offs:

  • nickel export is a subprocess call per file at startup — adds ~50ms per node file to startup time. Mitigated by parallel load with JoinSet during startup.
  • Saga compensation is best-effort — a compensation script that itself fails is logged but does not block status progression. This is a known Saga trade-off.
  • Nushell subprocess overhead per node — each node execution spawns a process. For sub-second scripts this is observable latency. Acceptable for CI/CD and infrastructure workflows.
  • OCI layer pull at startup — cold starts require pulling Nickel lib layers. Mitigated by local digest cache in ~/.cache/stratum/ncl/.

Benefits gained:

  • New workflows require only new .ncl files — zero orchestrator binary changes
  • Full pipeline audit trail in SurrealDB: every step, every capability deposit, every compensation
  • Crash recovery is free: restart the orchestrator, pipeline resumes from last persisted state
  • Auth is non-negotiable: publisher identity (NKeys), workflow authorization (Cedar), and execution credentials (Vault) are enforced at every pipeline invocation
  • Horizontal scaling: stateless orchestrator + shared SurrealDB enables multiple instances on the same event stream

References

  • Implementation plan: .coder/2026-02-20-stratum-orchestrator-plan.plan.md
  • Architecture diagram: assets/diagrams/arch-stratum-orchestrator.svg
  • Build pipeline flow: assets/diagrams/flow-stratum-build-pipeline.svg
  • Nickel base library: nickel/stratum-base/stratum-base.ncl
  • Crates: crates/stratum-graph/, crates/stratum-state/, crates/platform-nats/, crates/stratum-orchestrator/
  • Related: ADR-001 (stratum-embeddings), ADR-002 (stratum-llm)