New crates: stratum-orchestrator (Cedar authz, Vault secrets, Nu/agent executors, saga runner), stratum-graph (petgraph DAG + SurrealDB repo), stratum-state (SurrealDB tracker), platform-nats (NKey auth client), ncl-import-resolver. Updates: stratum-embeddings (SurrealDB store + persistent cache), stratum-llm circuit breaker. Adds Nickel action-nodes, schemas, config, Nushell scripts, docker-compose dev stack, and ADR-003.
180 lines
12 KiB
Markdown
180 lines
12 KiB
Markdown
# ADR-003: Stratum-Orchestrator — Graph-Driven Workflow Orchestrator
|
|
|
|
**Status**: Accepted
|
|
**Date**: 2026-02-20
|
|
|
|
## Context
|
|
|
|
The StratumIOps ecosystem spans multiple active projects (provisioning, kogral, syntaxis, typedialog) that evolve concurrently with cross-project dependencies. Each project triggers build, validation, publish, and notification workflows in response to code changes. Before this decision, these workflows were:
|
|
|
|
- Hardcoded per-project scripts with no shared execution model
|
|
- Not auditable — no durable record of which step ran, in what order, with what outcome
|
|
- Not composable — a workflow in `provisioning` could not react to an event from `kogral`
|
|
- Not safe — no atomicity, no rollback on partial failure, no credential scoping
|
|
- Not scalable — adding a new project meant copying and adapting scripts, accumulating drift
|
|
|
|
The core requirement is a **cross-project, event-driven, agnostic workflow orchestrator** that can coordinate build pipelines, AI agent tasks, and infrastructure operations without being modified for each new project or provider.
|
|
|
|
Two design directions were evaluated:
|
|
|
|
| Approach | Description | Problem |
|
|
|----------|-------------|---------|
|
|
| **Static rule router** | Map NATS subject patterns to scripts in a config file | Rules multiply with projects; changing a workflow requires editing the router |
|
|
| **Graph-driven engine** | Events traverse a DAG of action nodes declared in Nickel; orchestrator is agnostic | Orchestrator never changes for new workflows |
|
|
|
|
The graph-driven approach was chosen. The orchestrator loads node definitions from Nickel files, builds an in-memory `ActionGraph`, and executes pipelines by traversing the graph — it does not know what the pipeline does, only how to coordinate it.
|
|
|
|
## Fundamental Design Characteristics
|
|
|
|
### 1. Graph-Driven Execution, Not Static Routing
|
|
|
|
The orchestrator does not contain routing tables. Instead, NATS subject patterns are matched against an `ActionGraph` built from Nickel node definitions. Each `ActionNode` declares:
|
|
- `trigger`: NATS subject patterns that can activate this node as an entry point
|
|
- `input_schemas`: capabilities required before execution
|
|
- `output_schemas`: capabilities produced after execution
|
|
- `compensate`: optional rollback script for Saga atomicity
|
|
|
|
The graph is built by traversal of producer/consumer capability indexes. Topological sort produces a staged execution plan. **Adding a new workflow means adding `.ncl` files — the orchestrator binary never changes.**
|
|
|
|
### 2. Stateless Orchestrator — DB-First State
|
|
|
|
The orchestrator process holds no durable state. Every `PipelineContext` write goes to SurrealDB first, then updates an in-memory `DashMap` cache. On crash, a restarted instance reconstructs the cache from the DB. This enables:
|
|
|
|
- **Horizontal scaling**: multiple orchestrator instances share one SurrealDB, no split-brain
|
|
- **Crash recovery**: pipelines resume from last persisted capability, not from the beginning
|
|
- **Observability**: full pipeline state is queryable at any point without instrumenting the process
|
|
|
|
### 3. Nickel as Single Source of Truth — No Dual Truth
|
|
|
|
Action nodes, capability schemas, and the orchestrator startup config are all defined in Nickel. There is no separate indexing process, no database copy of node definitions used at runtime. The `ActionGraph` is built in-memory from `.ncl` files at startup via `nickel export --format json`, then kept live via a `notify` file watcher for hot-reload.
|
|
|
|
This eliminates the dual-truth problem: the Nickel file IS the definition. No risk of a database record diverging from the file on disk.
|
|
|
|
### 4. Capability Model — Dependency Inversion at Execution Level
|
|
|
|
Nodes do not depend on each other directly. They declare capabilities they produce and consume. The graph engine resolves dependencies:
|
|
|
|
```
|
|
lint-crate → produces: linted-code
|
|
fmt-crate → produces: formatted-code
|
|
build-crate → consumes: linted-code, formatted-code
|
|
→ produces: built-artifact
|
|
install-crate → consumes: built-artifact
|
|
```
|
|
|
|
This is Dependency Inversion applied to the execution domain: `build-crate` does not know about `lint-crate`. It only knows it needs `linted-code`. Any node that produces `linted-code` satisfies the dependency. Nodes can be swapped, replaced, or parallelized without changing their consumers.
|
|
|
|
### 5. Three Independent Auth Planes
|
|
|
|
Authentication and authorization are split across three independent, non-substitutable planes:
|
|
|
|
| Plane | Technology | Scope | What it controls |
|
|
|-------|-----------|-------|-----------------|
|
|
| **Publisher auth** | NATS NKeys (ed25519) | Transport | Who can publish events to `dev.>` subjects |
|
|
| **Workflow authz** | Cedar policies | Orchestrator | Which pipelines/nodes a principal can trigger |
|
|
| **Execution credentials** | Vault (SecretumVault) | Per-node, per-step | Scoped secrets with TTL = node timeout |
|
|
|
|
Credentials from Vault are injected as environment variables into the Nushell subprocess and revoked on node failure. They never appear in NATS messages, logs, or `PipelineContext` (redacted before storage).
|
|
|
|
### 6. Saga Atomicity — Compensation, Not Transactions
|
|
|
|
Pipelines execute forward through stages. If a stage fails, the orchestrator does not roll back a database transaction — it runs `compensate.nu` scripts in reverse order through all previously successful stages. This is the Saga pattern:
|
|
|
|
```
|
|
Stage 0: lint (ok) + fmt (ok) → executed
|
|
Stage 1: build (FAIL) → trigger compensation
|
|
Stage 0 compensation: → undo lint, undo fmt (in parallel, reverse)
|
|
```
|
|
|
|
Compensation is best-effort: compensation failures are logged but do not block the pipeline from reaching `Compensated` status. The DB record captures the full compensation trace.
|
|
|
|
### 7. Parallel Stages via JoinSet + CancellationToken
|
|
|
|
Within each stage, nodes with no capability dependencies on each other execute in parallel using `tokio::task::JoinSet`. Fail-fast is implemented via `CancellationToken`: the first node failure cancels the token, aborting all sibling tasks in the stage.
|
|
|
|
```
|
|
Stage 0: [lint-crate ‖ fmt-crate] — parallel (no inter-dependency)
|
|
Stage 1: [build-crate] — sequential (needs both capabilities)
|
|
Stage 2: [install-crate] — sequential (needs built-artifact)
|
|
```
|
|
|
|
### 8. OCI for Everything — Content-Addressed Artifacts
|
|
|
|
Both node definitions and the Nickel base library are published as OCI artifacts to a Zot registry. The publish pipeline for each is: `nickel typecheck` → `gitleaks detect` → `nickel export` → `sha256sum` → `oras push` with content-hash annotations.
|
|
|
|
The `ncl-import-resolver` binary bridges OCI → local filesystem: it pulls each referenced OCI layer at orchestrator startup, verifies the digest against the annotated hash, then exposes a local path for Nickel imports. This prevents loading unverified or tampered node definitions.
|
|
|
|
This follows the same model as container images: build → scan → publish → consume by digest.
|
|
|
|
### 9. Nushell as Execution Unit — Agnostic by Design
|
|
|
|
Each action node's `handler` is a Nushell script. The executor spawns `nu --no-config-file <script.nu>`, passes `PipelineContext` inputs as JSON on stdin, and reads the output JSON from stdout. This makes execution:
|
|
|
|
- **Domain-agnostic**: the orchestrator has no knowledge of what the script does
|
|
- **Hot-replaceable**: updating a workflow means replacing a `.nu` file, not recompiling the binary
|
|
- **Sandboxable**: each node runs in its own process with scoped Vault credentials
|
|
- **Testable independently**: scripts can be invoked directly with `echo '{}' | nu script.nu`
|
|
|
|
### 10. TypeDialog Scope — Startup Config Only
|
|
|
|
TypeDialog is used exclusively for orchestrator startup configuration (SurrealDB URL, NATS URL, Zot URL, Vault URL, log level, feature flags). It is **not** used for project declaratives, workflow definitions, or node configurations. Those live in Nickel files managed per project. This prevents TypeDialog from becoming a catch-all config tool and keeps its scope bounded.
|
|
|
|
## Decision
|
|
|
|
`stratum-orchestrator` is implemented as a new crate family in the StratumIOps monorepo:
|
|
|
|
| Crate | Domain | Responsibility |
|
|
|-------|--------|----------------|
|
|
| `stratum-graph` | Knowledge | `ActionNode`, `Capability`, `GraphRepository` trait |
|
|
| `stratum-state` | Operational | `PipelineRun`, `StepRecord`, `StateTracker` trait |
|
|
| `platform-nats` | Transport | JetStream consumer with NKey auth |
|
|
| `stratum-orchestrator` | Coordination | `ActionGraph`, `PipelineContext`, `StageRunner`, auth, executor |
|
|
|
|
Domain isolation is structural: `stratum-graph` and `stratum-state` are separate crates with separate SurrealDB table namespaces. `stratum-orchestrator` depends on their traits, not their implementations — compile-time enforcement.
|
|
|
|
The orchestrator binary startup sequence: load TypeDialog config → connect SurrealDB → connect NATS → resolve OCI Nickel imports → build ActionGraph → start notify watcher → initialize Cedar policies → start HTTP server (health + agent callback) → enter JetStream pull loop.
|
|
|
|
## Rationale
|
|
|
|
### Why Not a General-Purpose Workflow Engine (Temporal, Argo, etc.)?
|
|
|
|
| Concern | External engine | stratum-orchestrator |
|
|
|---------|-----------------|----------------------|
|
|
| Cross-project event model | Requires adapter per project | Native NATS subject matching |
|
|
| Nickel integration | Not possible | First-class: nodes are `.ncl` files |
|
|
| Nushell execution | Not supported | Native subprocess executor |
|
|
| Operational footprint | Heavy (Temporal cluster, Argo K8s) | Single binary + SurrealDB + NATS |
|
|
| Custom auth model | Difficult to extend | Three planes designed in |
|
|
|
|
### Why Saga over 2PC?
|
|
|
|
Two-phase commit across distributed Nushell scripts is not feasible — scripts are external processes with no transaction coordinator. Saga compensation scripts (`compensate.nu`) are the only realistic atomicity model for cross-process workflows. The trade-off is accepted: compensation is best-effort, not guaranteed-atomic, but the failure cases are logged and auditable.
|
|
|
|
### Why In-Memory ActionGraph vs DB-Persisted Nodes?
|
|
|
|
Storing node definitions in SurrealDB creates dual truth. The file on disk and the DB record can diverge. Hot-reload via `notify` on the filesystem is simpler, faster, and eliminates the sync problem. SurrealDB is used only for operational state (pipeline runs, capability stores) — knowledge (node definitions) stays in the filesystem.
|
|
|
|
## Consequences
|
|
|
|
**Accepted trade-offs**:
|
|
- `nickel export` is a subprocess call per file at startup — adds ~50ms per node file to startup time. Mitigated by parallel load with `JoinSet` during startup.
|
|
- Saga compensation is best-effort — a compensation script that itself fails is logged but does not block status progression. This is a known Saga trade-off.
|
|
- Nushell subprocess overhead per node — each node execution spawns a process. For sub-second scripts this is observable latency. Acceptable for CI/CD and infrastructure workflows.
|
|
- OCI layer pull at startup — cold starts require pulling Nickel lib layers. Mitigated by local digest cache in `~/.cache/stratum/ncl/`.
|
|
|
|
**Benefits gained**:
|
|
- New workflows require only new `.ncl` files — zero orchestrator binary changes
|
|
- Full pipeline audit trail in SurrealDB: every step, every capability deposit, every compensation
|
|
- Crash recovery is free: restart the orchestrator, pipeline resumes from last persisted state
|
|
- Auth is non-negotiable: publisher identity (NKeys), workflow authorization (Cedar), and execution credentials (Vault) are enforced at every pipeline invocation
|
|
- Horizontal scaling: stateless orchestrator + shared SurrealDB enables multiple instances on the same event stream
|
|
|
|
## References
|
|
|
|
- Implementation plan: `.coder/2026-02-20-stratum-orchestrator-plan.plan.md`
|
|
- Architecture diagram: `assets/diagrams/arch-stratum-orchestrator.svg`
|
|
- Build pipeline flow: `assets/diagrams/flow-stratum-build-pipeline.svg`
|
|
- Nickel base library: `nickel/stratum-base/stratum-base.ncl`
|
|
- Crates: `crates/stratum-graph/`, `crates/stratum-state/`, `crates/platform-nats/`, `crates/stratum-orchestrator/`
|
|
- Related: ADR-001 (stratum-embeddings), ADR-002 (stratum-llm)
|