Key changes driving this: new saga.rs, persistence.rs, auth.rs in workflow-engine; SurrealDB migration 009_workflow_state.surql; backend services refactored; frontend dist built; ADR-0033 documenting the hardening decision.
7.9 KiB
ADR-0033: Workflow Engine Hardening — Persistence, Saga Compensation, Cedar Authorization
Status: Implemented
Date: 2026-02-21
Deciders: VAPORA Team
Technical Story: vapora-workflow-engine lost all state on restart, had no rollback mechanism on failure, and applied no per-stage access control.
Decision
Harden vapora-workflow-engine with three independent layers inspired by the stratum-orchestrator project:
- SurrealDB persistence (
SurrealWorkflowStore) — crash-recoverableWorkflowInstancestate - Saga compensation (
SagaCompensator) — reverse-order rollback dispatch viaSwarmCoordinator - Cedar authorization (
CedarAuthorizer) — per-stage policy enforcement before task dispatch
All three are implemented natively inside vapora-workflow-engine — stratum-orchestrator is not a direct dependency.
Context
Gaps Before This ADR
| Gap | Consequence |
|---|---|
In-memory DashMap only |
All running workflows lost on server restart |
| No compensation on failure | Stage 3 failure left Stage 1 and 2 side effects live |
| No authorization check per stage | Any caller could trigger any stage in any workflow |
Why Not Import stratum-orchestrator Directly
The plan initially included:
stratum-orchestrator = { path = "../../../stratumiops/crates/stratum-orchestrator" }
This fails because stratum-orchestrator → platform-nats → nkeys = { workspace = true }. The nkeys dependency is resolved only inside the stratumiops workspace; it is not published to crates.io and has no path resolvable from vapora's workspace root. Cargo errors with failed to select a version for nkeys.
The CedarAuthorizer inside stratum-orchestrator is 88 self-contained lines using only cedar-policy. Implementing it locally is zero duplication risk and avoids a circular workspace dependency.
Implementation
New Modules
crates/vapora-workflow-engine/src/
├── auth.rs — CedarAuthorizer: loads .cedar policy files, authorize()
├── persistence.rs — SurrealWorkflowStore: save/load/load_active/delete
└── saga.rs — SagaCompensator: compensate(workflow_id, stages, ctx)
New Migration
migrations/009_workflow_state.surql — SCHEMAFULL workflow_instances table
Config Changes
[engine]
cedar_policy_dir = "/etc/vapora/cedar" # optional; Cedar disabled if absent
[[workflows.stages]]
name = "deploy"
agents = ["devops"]
compensation_agents = ["devops"] # receives rollback task if Saga fires
Dependency Addition
# crates/vapora-workflow-engine/Cargo.toml
surrealdb = { workspace = true }
cedar-policy = "4.9"
cedar-policy enters directly; it was previously only transitive via secretumvault (4.8). Cargo resolves the workspace to 4.9 (semver compatible, same major).
WorkflowOrchestrator Constructor Change
// Before
WorkflowOrchestrator::new(config_path, swarm, kg, nats)
// After
WorkflowOrchestrator::new(config_path, swarm, kg, nats, db: Surreal<Client>)
db is the existing backend connection — the store does not open its own connection.
Data Flow
start_workflow()
→ WorkflowInstance::new()
→ store.save() ← persistence
→ execute_current_stage()
→ cedar.authorize() ← auth (if configured)
→ swarm.assign_task()
on_task_completed()
→ task.mark_completed()
→ store.save() ← persistence
on_task_failed(can_retry=false)
→ mark_current_task_failed() ← stage transition
→ saga.compensate(stages, ctx) ← saga (reverse-order dispatch)
→ instance.fail()
→ store.save() ← persistence
startup crash recovery
→ store.load_active() ← restores active_workflows DashMap
Saga Compensation Protocol
Compensation is best-effort: errors are logged, never propagated. Stage order is reversed: the last executed stage receives a rollback task first.
Only stages with compensation_agents defined in their StageConfig receive a compensation task. Stages without the field are silently skipped.
Compensation task payload sent to SwarmCoordinator:
{
"type": "compensation",
"stage_name": "deploy",
"workflow_id": "abc-123",
"original_context": { "…" : "…" },
"artifacts_to_undo": ["artifact-id-1"]
}
Cedar Authorization
CedarAuthorizer::load_from_dir(path) reads all *.cedar files from the directory and compiles them into a single PolicySet. Before each stage dispatch:
cedar.authorize(
"vapora-orchestrator", // principal
"execute-stage", // action
"Stage::\"architecture\"", // resource
)?;
A Deny decision returns WorkflowError::Unauthorized, halting the workflow without dispatching the stage. If cedar_policy_dir is not set in EngineConfig, Cedar is disabled and all stages proceed without policy checks.
Rationale
Why SurrealDB (not Redis / SQLite)
SurrealDB is already the persistence layer for every other stateful component in vapora. Adding workflow_instances as one more table keeps the operational footprint at zero (no new service, no new connection pool). WorkflowInstance already implements Serialize/Deserialize; the store serializes via serde_json::Value to satisfy the SurrealValue trait requirement introduced in surrealdb v3.
Why Saga Over Two-Phase Commit
Workflows already span multiple async agent executions over NATS. Two-phase commit across these boundaries would require protocol changes in every agent. Saga achieves eventual consistency via compensating transactions that each agent already understands (a task with type: "compensation").
Why Cedar Over RBAC / Custom Middleware
Cedar policies are already used by the rest of the VAPORA platform (see ADR-0010). Per-stage rules expressed in .cedar files are reviewable outside the codebase and hot-swappable without redeployment (restart required to reload, by current design). A custom middleware table would require schema migrations for every policy change.
Consequences
Positive
- Workflows survive server restarts (crash recovery via
load_active()) - Non-retryable stage failure triggers best-effort rollback of completed stages
- Per-stage access control via auditable policy files
- Zero new infrastructure (uses existing SurrealDB connection)
- 31/31 existing tests continue to pass; 5 new tests added (auth × 3, saga × 2)
Negative
WorkflowOrchestrator::new()signature change requires callers to passSurreal<Client>- Cedar requires
.cedarfiles on disk; missingcedar_policy_dirdisables auth silently - Compensation is best-effort — no guarantee of full rollback if compensation agent also fails
Mitigations
| Risk | Mitigation |
|---|---|
| Saga partial rollback | Metrics track compensation dispatch; dead-letter queue via NATS for retry |
| Cedar files missing | cedar_policy_dir = None → no-auth mode; documented explicitly |
| Signature change | Backend already owns db: Arc<Surreal<Client>>; passed at construction |
Verification
cargo test -p vapora-workflow-engine # 31/31 pass
cargo clippy -p vapora-workflow-engine -- -D warnings # 0 warnings
New tests:
auth::tests::test_permit_allowsauth::tests::test_deny_returns_unauthorizedauth::tests::test_empty_dir_failssaga::tests::test_stages_with_compensation_agents_are_includedsaga::tests::test_stages_with_no_compensation_agents_are_skipped