Vapora/docs/adrs/0033-stratum-orchestrator-workflow-hardening.md
Jesús Pérez b9e2cee9f7
Some checks failed
Documentation Lint & Validation / Markdown Linting (push) Has been cancelled
Documentation Lint & Validation / Validate mdBook Configuration (push) Has been cancelled
Documentation Lint & Validation / Content & Structure Validation (push) Has been cancelled
mdBook Build & Deploy / Build mdBook (push) Has been cancelled
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
Documentation Lint & Validation / Lint & Validation Summary (push) Has been cancelled
mdBook Build & Deploy / Documentation Quality Check (push) Has been cancelled
mdBook Build & Deploy / Deploy to GitHub Pages (push) Has been cancelled
mdBook Build & Deploy / Notification (push) Has been cancelled
feat(workflow-engine): add saga, persistence, auth, and NATS-integrated orchestrator hardening
Key changes driving this: new saga.rs, persistence.rs, auth.rs in workflow-engine; SurrealDB migration 009_workflow_state.surql; backend
  services refactored; frontend dist built; ADR-0033 documenting the hardening decision.
2026-02-22 21:44:42 +00:00

7.9 KiB
Raw Blame History

ADR-0033: Workflow Engine Hardening — Persistence, Saga Compensation, Cedar Authorization

Status: Implemented Date: 2026-02-21 Deciders: VAPORA Team Technical Story: vapora-workflow-engine lost all state on restart, had no rollback mechanism on failure, and applied no per-stage access control.


Decision

Harden vapora-workflow-engine with three independent layers inspired by the stratum-orchestrator project:

  1. SurrealDB persistence (SurrealWorkflowStore) — crash-recoverable WorkflowInstance state
  2. Saga compensation (SagaCompensator) — reverse-order rollback dispatch via SwarmCoordinator
  3. Cedar authorization (CedarAuthorizer) — per-stage policy enforcement before task dispatch

All three are implemented natively inside vapora-workflow-engine — stratum-orchestrator is not a direct dependency.


Context

Gaps Before This ADR

Gap Consequence
In-memory DashMap only All running workflows lost on server restart
No compensation on failure Stage 3 failure left Stage 1 and 2 side effects live
No authorization check per stage Any caller could trigger any stage in any workflow

Why Not Import stratum-orchestrator Directly

The plan initially included:

stratum-orchestrator = { path = "../../../stratumiops/crates/stratum-orchestrator" }

This fails because stratum-orchestrator → platform-nats → nkeys = { workspace = true }. The nkeys dependency is resolved only inside the stratumiops workspace; it is not published to crates.io and has no path resolvable from vapora's workspace root. Cargo errors with failed to select a version for nkeys.

The CedarAuthorizer inside stratum-orchestrator is 88 self-contained lines using only cedar-policy. Implementing it locally is zero duplication risk and avoids a circular workspace dependency.


Implementation

New Modules

crates/vapora-workflow-engine/src/
├── auth.rs        — CedarAuthorizer: loads .cedar policy files, authorize()
├── persistence.rs — SurrealWorkflowStore: save/load/load_active/delete
└── saga.rs        — SagaCompensator: compensate(workflow_id, stages, ctx)

New Migration

migrations/009_workflow_state.surql — SCHEMAFULL workflow_instances table

Config Changes

[engine]
cedar_policy_dir = "/etc/vapora/cedar"   # optional; Cedar disabled if absent

[[workflows.stages]]
name = "deploy"
agents = ["devops"]
compensation_agents = ["devops"]         # receives rollback task if Saga fires

Dependency Addition

# crates/vapora-workflow-engine/Cargo.toml
surrealdb    = { workspace = true }
cedar-policy = "4.9"

cedar-policy enters directly; it was previously only transitive via secretumvault (4.8). Cargo resolves the workspace to 4.9 (semver compatible, same major).

WorkflowOrchestrator Constructor Change

// Before
WorkflowOrchestrator::new(config_path, swarm, kg, nats)

// After
WorkflowOrchestrator::new(config_path, swarm, kg, nats, db: Surreal<Client>)

db is the existing backend connection — the store does not open its own connection.


Data Flow

start_workflow()
  → WorkflowInstance::new()
  → store.save()                    ← persistence
  → execute_current_stage()
      → cedar.authorize()           ← auth (if configured)
      → swarm.assign_task()

on_task_completed()
  → task.mark_completed()
  → store.save()                    ← persistence

on_task_failed(can_retry=false)
  → mark_current_task_failed()      ← stage transition
  → saga.compensate(stages, ctx)    ← saga (reverse-order dispatch)
  → instance.fail()
  → store.save()                    ← persistence

startup crash recovery
  → store.load_active()             ← restores active_workflows DashMap

Saga Compensation Protocol

Compensation is best-effort: errors are logged, never propagated. Stage order is reversed: the last executed stage receives a rollback task first.

Only stages with compensation_agents defined in their StageConfig receive a compensation task. Stages without the field are silently skipped.

Compensation task payload sent to SwarmCoordinator:

{
  "type": "compensation",
  "stage_name": "deploy",
  "workflow_id": "abc-123",
  "original_context": { "…" : "…" },
  "artifacts_to_undo": ["artifact-id-1"]
}

Cedar Authorization

CedarAuthorizer::load_from_dir(path) reads all *.cedar files from the directory and compiles them into a single PolicySet. Before each stage dispatch:

cedar.authorize(
    "vapora-orchestrator",      // principal
    "execute-stage",            // action
    "Stage::\"architecture\"",  // resource
)?;

A Deny decision returns WorkflowError::Unauthorized, halting the workflow without dispatching the stage. If cedar_policy_dir is not set in EngineConfig, Cedar is disabled and all stages proceed without policy checks.


Rationale

Why SurrealDB (not Redis / SQLite)

SurrealDB is already the persistence layer for every other stateful component in vapora. Adding workflow_instances as one more table keeps the operational footprint at zero (no new service, no new connection pool). WorkflowInstance already implements Serialize/Deserialize; the store serializes via serde_json::Value to satisfy the SurrealValue trait requirement introduced in surrealdb v3.

Why Saga Over Two-Phase Commit

Workflows already span multiple async agent executions over NATS. Two-phase commit across these boundaries would require protocol changes in every agent. Saga achieves eventual consistency via compensating transactions that each agent already understands (a task with type: "compensation").

Why Cedar Over RBAC / Custom Middleware

Cedar policies are already used by the rest of the VAPORA platform (see ADR-0010). Per-stage rules expressed in .cedar files are reviewable outside the codebase and hot-swappable without redeployment (restart required to reload, by current design). A custom middleware table would require schema migrations for every policy change.


Consequences

Positive

  • Workflows survive server restarts (crash recovery via load_active())
  • Non-retryable stage failure triggers best-effort rollback of completed stages
  • Per-stage access control via auditable policy files
  • Zero new infrastructure (uses existing SurrealDB connection)
  • 31/31 existing tests continue to pass; 5 new tests added (auth × 3, saga × 2)

Negative

  • WorkflowOrchestrator::new() signature change requires callers to pass Surreal<Client>
  • Cedar requires .cedar files on disk; missing cedar_policy_dir disables auth silently
  • Compensation is best-effort — no guarantee of full rollback if compensation agent also fails

Mitigations

Risk Mitigation
Saga partial rollback Metrics track compensation dispatch; dead-letter queue via NATS for retry
Cedar files missing cedar_policy_dir = None → no-auth mode; documented explicitly
Signature change Backend already owns db: Arc<Surreal<Client>>; passed at construction

Verification

cargo test -p vapora-workflow-engine        # 31/31 pass
cargo clippy -p vapora-workflow-engine -- -D warnings  # 0 warnings

New tests:

  • auth::tests::test_permit_allows
  • auth::tests::test_deny_returns_unauthorized
  • auth::tests::test_empty_dir_fails
  • saga::tests::test_stages_with_compensation_agents_are_included
  • saga::tests::test_stages_with_no_compensation_agents_are_skipped

  • ADR-0028 — Workflow Orchestrator (original implementation)
  • ADR-0010 — Cedar Authorization
  • ADR-0004 — SurrealDB as single persistence layer
  • ADR-0018 — SwarmCoordinator (Saga dispatch target)