# ADR-0033: Workflow Engine Hardening — Persistence, Saga Compensation, Cedar Authorization **Status**: Implemented **Date**: 2026-02-21 **Deciders**: VAPORA Team **Technical Story**: `vapora-workflow-engine` lost all state on restart, had no rollback mechanism on failure, and applied no per-stage access control. --- ## Decision Harden `vapora-workflow-engine` with three independent layers inspired by the stratum-orchestrator project: 1. **SurrealDB persistence** (`SurrealWorkflowStore`) — crash-recoverable `WorkflowInstance` state 2. **Saga compensation** (`SagaCompensator`) — reverse-order rollback dispatch via `SwarmCoordinator` 3. **Cedar authorization** (`CedarAuthorizer`) — per-stage policy enforcement before task dispatch All three are implemented natively inside `vapora-workflow-engine` — stratum-orchestrator is **not** a direct dependency. --- ## Context ### Gaps Before This ADR | Gap | Consequence | |-----|-------------| | In-memory `DashMap` only | All running workflows lost on server restart | | No compensation on failure | Stage 3 failure left Stage 1 and 2 side effects live | | No authorization check per stage | Any caller could trigger any stage in any workflow | ### Why Not Import stratum-orchestrator Directly The plan initially included: ```toml stratum-orchestrator = { path = "../../../stratumiops/crates/stratum-orchestrator" } ``` This fails because `stratum-orchestrator → platform-nats → nkeys = { workspace = true }`. The `nkeys` dependency is resolved only inside the `stratumiops` workspace; it is not published to crates.io and has no path resolvable from vapora's workspace root. Cargo errors with `failed to select a version for nkeys`. The `CedarAuthorizer` inside stratum-orchestrator is 88 self-contained lines using only `cedar-policy`. Implementing it locally is zero duplication risk and avoids a circular workspace dependency. --- ## Implementation ### New Modules ```text crates/vapora-workflow-engine/src/ ├── auth.rs — CedarAuthorizer: loads .cedar policy files, authorize() ├── persistence.rs — SurrealWorkflowStore: save/load/load_active/delete └── saga.rs — SagaCompensator: compensate(workflow_id, stages, ctx) ``` ### New Migration ```text migrations/009_workflow_state.surql — SCHEMAFULL workflow_instances table ``` ### Config Changes ```toml [engine] cedar_policy_dir = "/etc/vapora/cedar" # optional; Cedar disabled if absent [[workflows.stages]] name = "deploy" agents = ["devops"] compensation_agents = ["devops"] # receives rollback task if Saga fires ``` ### Dependency Addition ```toml # crates/vapora-workflow-engine/Cargo.toml surrealdb = { workspace = true } cedar-policy = "4.9" ``` `cedar-policy` enters directly; it was previously only transitive via `secretumvault` (4.8). Cargo resolves the workspace to 4.9 (semver compatible, same major). ### WorkflowOrchestrator Constructor Change ```rust // Before WorkflowOrchestrator::new(config_path, swarm, kg, nats) // After WorkflowOrchestrator::new(config_path, swarm, kg, nats, db: Surreal) ``` `db` is the existing backend connection — the store does not open its own connection. --- ## Data Flow ```text start_workflow() → WorkflowInstance::new() → store.save() ← persistence → execute_current_stage() → cedar.authorize() ← auth (if configured) → swarm.assign_task() on_task_completed() → task.mark_completed() → store.save() ← persistence on_task_failed(can_retry=false) → mark_current_task_failed() ← stage transition → saga.compensate(stages, ctx) ← saga (reverse-order dispatch) → instance.fail() → store.save() ← persistence startup crash recovery → store.load_active() ← restores active_workflows DashMap ``` --- ## Saga Compensation Protocol Compensation is **best-effort**: errors are logged, never propagated. Stage order is reversed: the last executed stage receives a rollback task first. Only stages with `compensation_agents` defined in their `StageConfig` receive a compensation task. Stages without the field are silently skipped. Compensation task payload sent to `SwarmCoordinator`: ```json { "type": "compensation", "stage_name": "deploy", "workflow_id": "abc-123", "original_context": { "…" : "…" }, "artifacts_to_undo": ["artifact-id-1"] } ``` --- ## Cedar Authorization `CedarAuthorizer::load_from_dir(path)` reads all `*.cedar` files from the directory and compiles them into a single `PolicySet`. Before each stage dispatch: ```rust cedar.authorize( "vapora-orchestrator", // principal "execute-stage", // action "Stage::\"architecture\"", // resource )?; ``` A `Deny` decision returns `WorkflowError::Unauthorized`, halting the workflow without dispatching the stage. If `cedar_policy_dir` is not set in `EngineConfig`, Cedar is disabled and all stages proceed without policy checks. --- ## Rationale ### Why SurrealDB (not Redis / SQLite) SurrealDB is already the persistence layer for every other stateful component in vapora. Adding `workflow_instances` as one more table keeps the operational footprint at zero (no new service, no new connection pool). `WorkflowInstance` already implements `Serialize/Deserialize`; the store serializes via `serde_json::Value` to satisfy the `SurrealValue` trait requirement introduced in surrealdb v3. ### Why Saga Over Two-Phase Commit Workflows already span multiple async agent executions over NATS. Two-phase commit across these boundaries would require protocol changes in every agent. Saga achieves eventual consistency via compensating transactions that each agent already understands (a task with `type: "compensation"`). ### Why Cedar Over RBAC / Custom Middleware Cedar policies are already used by the rest of the VAPORA platform (see ADR-0010). Per-stage rules expressed in `.cedar` files are reviewable outside the codebase and hot-swappable without redeployment (restart required to reload, by current design). A custom middleware table would require schema migrations for every policy change. --- ## Consequences ### Positive - Workflows survive server restarts (crash recovery via `load_active()`) - Non-retryable stage failure triggers best-effort rollback of completed stages - Per-stage access control via auditable policy files - Zero new infrastructure (uses existing SurrealDB connection) - 31/31 existing tests continue to pass; 5 new tests added (auth × 3, saga × 2) ### Negative - `WorkflowOrchestrator::new()` signature change requires callers to pass `Surreal` - Cedar requires `.cedar` files on disk; missing `cedar_policy_dir` disables auth silently - Compensation is best-effort — no guarantee of full rollback if compensation agent also fails ### Mitigations | Risk | Mitigation | |------|------------| | Saga partial rollback | Metrics track compensation dispatch; dead-letter queue via NATS for retry | | Cedar files missing | `cedar_policy_dir = None` → no-auth mode; documented explicitly | | Signature change | Backend already owns `db: Arc>`; passed at construction | --- ## Verification ```bash cargo test -p vapora-workflow-engine # 31/31 pass cargo clippy -p vapora-workflow-engine -- -D warnings # 0 warnings ``` New tests: - `auth::tests::test_permit_allows` - `auth::tests::test_deny_returns_unauthorized` - `auth::tests::test_empty_dir_fails` - `saga::tests::test_stages_with_compensation_agents_are_included` - `saga::tests::test_stages_with_no_compensation_agents_are_skipped` --- ## Related ADRs - [ADR-0028](./0028-workflow-orchestrator.md) — Workflow Orchestrator (original implementation) - [ADR-0010](./0010-cedar-authorization.md) — Cedar Authorization - [ADR-0004](./0004-surrealdb-database.md) — SurrealDB as single persistence layer - [ADR-0018](./0018-swarm-load-balancing.md) — SwarmCoordinator (Saga dispatch target)