stratumiops/docs/en/architecture/adrs/003-stratum-orchestrator.md

# ADR-003: Stratum-Orchestrator — Graph-Driven Workflow Orchestrator

**Status**: Accepted
**Date**: 2026-02-20

## Context

The StratumIOps ecosystem spans multiple active projects (provisioning, kogral, syntaxis, typedialog) that evolve concurrently with cross-project dependencies. Each project triggers build, validation, publish, and notification workflows in response to code changes. Before this decision, these workflows were:

- Hardcoded per-project scripts with no shared execution model
- Not auditable — no durable record of which step ran, in what order, with what outcome
- Not composable — a workflow in `provisioning` could not react to an event from `kogral`
- Not safe — no atomicity, no rollback on partial failure, no credential scoping
- Not scalable — adding a new project meant copying and adapting scripts, accumulating drift

The core requirement is a **cross-project, event-driven, agnostic workflow orchestrator** that can coordinate build pipelines, AI agent tasks, and infrastructure operations without being modified for each new project or provider.

Two design directions were evaluated:

| Approach | Description | Problem |
|----------|-------------|---------|
| **Static rule router** | Map NATS subject patterns to scripts in a config file | Rules multiply with projects; changing a workflow requires editing the router |
| **Graph-driven engine** | Events traverse a DAG of action nodes declared in Nickel; orchestrator is agnostic | Orchestrator never changes for new workflows |

The graph-driven approach was chosen. The orchestrator loads node definitions from Nickel files, builds an in-memory `ActionGraph`, and executes pipelines by traversing the graph — it does not know what the pipeline does, only how to coordinate it.

## Fundamental Design Characteristics

### 1. Graph-Driven Execution, Not Static Routing

The orchestrator does not contain routing tables. Instead, NATS subject patterns are matched against an `ActionGraph` built from Nickel node definitions. Each `ActionNode` declares:
- `trigger`: NATS subject patterns that can activate this node as an entry point
- `input_schemas`: capabilities required before execution
- `output_schemas`: capabilities produced after execution
- `compensate`: optional rollback script for Saga atomicity

The graph is built by traversal of producer/consumer capability indexes. Topological sort produces a staged execution plan. **Adding a new workflow means adding `.ncl` files — the orchestrator binary never changes.**

### 2. Stateless Orchestrator — DB-First State

The orchestrator process holds no durable state. Every `PipelineContext` write goes to SurrealDB first, then updates an in-memory `DashMap` cache. On crash, a restarted instance reconstructs the cache from the DB. This enables:

- **Horizontal scaling**: multiple orchestrator instances share one SurrealDB, no split-brain
- **Crash recovery**: pipelines resume from last persisted capability, not from the beginning
- **Observability**: full pipeline state is queryable at any point without instrumenting the process

### 3. Nickel as Single Source of Truth — No Dual Truth

Action nodes, capability schemas, and the orchestrator startup config are all defined in Nickel. There is no separate indexing process, no database copy of node definitions used at runtime. The `ActionGraph` is built in-memory from `.ncl` files at startup via `nickel export --format json`, then kept live via a `notify` file watcher for hot-reload.

This eliminates the dual-truth problem: the Nickel file IS the definition. No risk of a database record diverging from the file on disk.

### 4. Capability Model — Dependency Inversion at Execution Level

Nodes do not depend on each other directly. They declare capabilities they produce and consume. The graph engine resolves dependencies:

```
lint-crate  → produces: linted-code
fmt-crate   → produces: formatted-code
build-crate → consumes: linted-code, formatted-code
            → produces: built-artifact
install-crate → consumes: built-artifact
```

This is Dependency Inversion applied to the execution domain: `build-crate` does not know about `lint-crate`. It only knows it needs `linted-code`. Any node that produces `linted-code` satisfies the dependency. Nodes can be swapped, replaced, or parallelized without changing their consumers.

### 5. Three Independent Auth Planes

Authentication and authorization are split across three independent, non-substitutable planes:

| Plane | Technology | Scope | What it controls |
|-------|-----------|-------|-----------------|
| **Publisher auth** | NATS NKeys (ed25519) | Transport | Who can publish events to `dev.>` subjects |
| **Workflow authz** | Cedar policies | Orchestrator | Which pipelines/nodes a principal can trigger |
| **Execution credentials** | Vault (SecretumVault) | Per-node, per-step | Scoped secrets with TTL = node timeout |

Credentials from Vault are injected as environment variables into the Nushell subprocess and revoked on node failure. They never appear in NATS messages, logs, or `PipelineContext` (redacted before storage).

### 6. Saga Atomicity — Compensation, Not Transactions

Pipelines execute forward through stages. If a stage fails, the orchestrator does not roll back a database transaction — it runs `compensate.nu` scripts in reverse order through all previously successful stages. This is the Saga pattern:

```
Stage 0: lint (ok) + fmt (ok)   → executed
Stage 1: build (FAIL)           → trigger compensation
Stage 0 compensation:           → undo lint, undo fmt (in parallel, reverse)
```

Compensation is best-effort: compensation failures are logged but do not block the pipeline from reaching `Compensated` status. The DB record captures the full compensation trace.

### 7. Parallel Stages via JoinSet + CancellationToken

Within each stage, nodes with no capability dependencies on each other execute in parallel using `tokio::task::JoinSet`. Fail-fast is implemented via `CancellationToken`: the first node failure cancels the token, aborting all sibling tasks in the stage.

```
Stage 0: [lint-crate ‖ fmt-crate]  — parallel (no inter-dependency)
Stage 1: [build-crate]             — sequential (needs both capabilities)
Stage 2: [install-crate]           — sequential (needs built-artifact)
```

### 8. OCI for Everything — Content-Addressed Artifacts

Both node definitions and the Nickel base library are published as OCI artifacts to a Zot registry. The publish pipeline for each is: `nickel typecheck` → `gitleaks detect` → `nickel export` → `sha256sum` → `oras push` with content-hash annotations.

The `ncl-import-resolver` binary bridges OCI → local filesystem: it pulls each referenced OCI layer at orchestrator startup, verifies the digest against the annotated hash, then exposes a local path for Nickel imports. This prevents loading unverified or tampered node definitions.

This follows the same model as container images: build → scan → publish → consume by digest.

### 9. Nushell as Execution Unit — Agnostic by Design

Each action node's `handler` is a Nushell script. The executor spawns `nu --no-config-file <script.nu>`, passes `PipelineContext` inputs as JSON on stdin, and reads the output JSON from stdout. This makes execution:

- **Domain-agnostic**: the orchestrator has no knowledge of what the script does
- **Hot-replaceable**: updating a workflow means replacing a `.nu` file, not recompiling the binary
- **Sandboxable**: each node runs in its own process with scoped Vault credentials
- **Testable independently**: scripts can be invoked directly with `echo '{}' | nu script.nu`

### 10. TypeDialog Scope — Startup Config Only

TypeDialog is used exclusively for orchestrator startup configuration (SurrealDB URL, NATS URL, Zot URL, Vault URL, log level, feature flags). It is **not** used for project declaratives, workflow definitions, or node configurations. Those live in Nickel files managed per project. This prevents TypeDialog from becoming a catch-all config tool and keeps its scope bounded.

## Decision

`stratum-orchestrator` is implemented as a new crate family in the StratumIOps monorepo:

| Crate | Domain | Responsibility |
|-------|--------|----------------|
| `stratum-graph` | Knowledge | `ActionNode`, `Capability`, `GraphRepository` trait |
| `stratum-state` | Operational | `PipelineRun`, `StepRecord`, `StateTracker` trait |
| `platform-nats` | Transport | JetStream consumer with NKey auth |
| `stratum-orchestrator` | Coordination | `ActionGraph`, `PipelineContext`, `StageRunner`, auth, executor |

Domain isolation is structural: `stratum-graph` and `stratum-state` are separate crates with separate SurrealDB table namespaces. `stratum-orchestrator` depends on their traits, not their implementations — compile-time enforcement.

The orchestrator binary startup sequence: load TypeDialog config → connect SurrealDB → connect NATS → resolve OCI Nickel imports → build ActionGraph → start notify watcher → initialize Cedar policies → start HTTP server (health + agent callback) → enter JetStream pull loop.

## Rationale

### Why Not a General-Purpose Workflow Engine (Temporal, Argo, etc.)?

| Concern | External engine | stratum-orchestrator |
|---------|-----------------|----------------------|
| Cross-project event model | Requires adapter per project | Native NATS subject matching |
| Nickel integration | Not possible | First-class: nodes are `.ncl` files |
| Nushell execution | Not supported | Native subprocess executor |
| Operational footprint | Heavy (Temporal cluster, Argo K8s) | Single binary + SurrealDB + NATS |
| Custom auth model | Difficult to extend | Three planes designed in |

### Why Saga over 2PC?

Two-phase commit across distributed Nushell scripts is not feasible — scripts are external processes with no transaction coordinator. Saga compensation scripts (`compensate.nu`) are the only realistic atomicity model for cross-process workflows. The trade-off is accepted: compensation is best-effort, not guaranteed-atomic, but the failure cases are logged and auditable.

### Why In-Memory ActionGraph vs DB-Persisted Nodes?

Storing node definitions in SurrealDB creates dual truth. The file on disk and the DB record can diverge. Hot-reload via `notify` on the filesystem is simpler, faster, and eliminates the sync problem. SurrealDB is used only for operational state (pipeline runs, capability stores) — knowledge (node definitions) stays in the filesystem.

## Consequences

**Accepted trade-offs**:
- `nickel export` is a subprocess call per file at startup — adds ~50ms per node file to startup time. Mitigated by parallel load with `JoinSet` during startup.
- Saga compensation is best-effort — a compensation script that itself fails is logged but does not block status progression. This is a known Saga trade-off.
- Nushell subprocess overhead per node — each node execution spawns a process. For sub-second scripts this is observable latency. Acceptable for CI/CD and infrastructure workflows.
- OCI layer pull at startup — cold starts require pulling Nickel lib layers. Mitigated by local digest cache in `~/.cache/stratum/ncl/`.

**Benefits gained**:
- New workflows require only new `.ncl` files — zero orchestrator binary changes
- Full pipeline audit trail in SurrealDB: every step, every capability deposit, every compensation
- Crash recovery is free: restart the orchestrator, pipeline resumes from last persisted state
- Auth is non-negotiable: publisher identity (NKeys), workflow authorization (Cedar), and execution credentials (Vault) are enforced at every pipeline invocation
- Horizontal scaling: stateless orchestrator + shared SurrealDB enables multiple instances on the same event stream

## References

- Implementation plan: `.coder/2026-02-20-stratum-orchestrator-plan.plan.md`
- Architecture diagram: `assets/diagrams/arch-stratum-orchestrator.svg`
- Build pipeline flow: `assets/diagrams/flow-stratum-build-pipeline.svg`
- Nickel base library: `nickel/stratum-base/stratum-base.ncl`
- Crates: `crates/stratum-graph/`, `crates/stratum-state/`, `crates/platform-nats/`, `crates/stratum-orchestrator/`
- Related: ADR-001 (stratum-embeddings), ADR-002 (stratum-llm)