226 lines
7.9 KiB
Markdown
226 lines
7.9 KiB
Markdown
|
|
# ADR-0033: Workflow Engine Hardening — Persistence, Saga Compensation, Cedar Authorization
|
|||
|
|
|
|||
|
|
**Status**: Implemented
|
|||
|
|
**Date**: 2026-02-21
|
|||
|
|
**Deciders**: VAPORA Team
|
|||
|
|
**Technical Story**: `vapora-workflow-engine` lost all state on restart, had no rollback mechanism on failure, and applied no per-stage access control.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Decision
|
|||
|
|
|
|||
|
|
Harden `vapora-workflow-engine` with three independent layers inspired by the stratum-orchestrator project:
|
|||
|
|
|
|||
|
|
1. **SurrealDB persistence** (`SurrealWorkflowStore`) — crash-recoverable `WorkflowInstance` state
|
|||
|
|
2. **Saga compensation** (`SagaCompensator`) — reverse-order rollback dispatch via `SwarmCoordinator`
|
|||
|
|
3. **Cedar authorization** (`CedarAuthorizer`) — per-stage policy enforcement before task dispatch
|
|||
|
|
|
|||
|
|
All three are implemented natively inside `vapora-workflow-engine` — stratum-orchestrator is **not** a direct dependency.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Context
|
|||
|
|
|
|||
|
|
### Gaps Before This ADR
|
|||
|
|
|
|||
|
|
| Gap | Consequence |
|
|||
|
|
|-----|-------------|
|
|||
|
|
| In-memory `DashMap` only | All running workflows lost on server restart |
|
|||
|
|
| No compensation on failure | Stage 3 failure left Stage 1 and 2 side effects live |
|
|||
|
|
| No authorization check per stage | Any caller could trigger any stage in any workflow |
|
|||
|
|
|
|||
|
|
### Why Not Import stratum-orchestrator Directly
|
|||
|
|
|
|||
|
|
The plan initially included:
|
|||
|
|
|
|||
|
|
```toml
|
|||
|
|
stratum-orchestrator = { path = "../../../stratumiops/crates/stratum-orchestrator" }
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
This fails because `stratum-orchestrator → platform-nats → nkeys = { workspace = true }`. The `nkeys` dependency is resolved only inside the `stratumiops` workspace; it is not published to crates.io and has no path resolvable from vapora's workspace root. Cargo errors with `failed to select a version for nkeys`.
|
|||
|
|
|
|||
|
|
The `CedarAuthorizer` inside stratum-orchestrator is 88 self-contained lines using only `cedar-policy`. Implementing it locally is zero duplication risk and avoids a circular workspace dependency.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Implementation
|
|||
|
|
|
|||
|
|
### New Modules
|
|||
|
|
|
|||
|
|
```text
|
|||
|
|
crates/vapora-workflow-engine/src/
|
|||
|
|
├── auth.rs — CedarAuthorizer: loads .cedar policy files, authorize()
|
|||
|
|
├── persistence.rs — SurrealWorkflowStore: save/load/load_active/delete
|
|||
|
|
└── saga.rs — SagaCompensator: compensate(workflow_id, stages, ctx)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### New Migration
|
|||
|
|
|
|||
|
|
```text
|
|||
|
|
migrations/009_workflow_state.surql — SCHEMAFULL workflow_instances table
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Config Changes
|
|||
|
|
|
|||
|
|
```toml
|
|||
|
|
[engine]
|
|||
|
|
cedar_policy_dir = "/etc/vapora/cedar" # optional; Cedar disabled if absent
|
|||
|
|
|
|||
|
|
[[workflows.stages]]
|
|||
|
|
name = "deploy"
|
|||
|
|
agents = ["devops"]
|
|||
|
|
compensation_agents = ["devops"] # receives rollback task if Saga fires
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### Dependency Addition
|
|||
|
|
|
|||
|
|
```toml
|
|||
|
|
# crates/vapora-workflow-engine/Cargo.toml
|
|||
|
|
surrealdb = { workspace = true }
|
|||
|
|
cedar-policy = "4.9"
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
`cedar-policy` enters directly; it was previously only transitive via `secretumvault` (4.8). Cargo resolves the workspace to 4.9 (semver compatible, same major).
|
|||
|
|
|
|||
|
|
### WorkflowOrchestrator Constructor Change
|
|||
|
|
|
|||
|
|
```rust
|
|||
|
|
// Before
|
|||
|
|
WorkflowOrchestrator::new(config_path, swarm, kg, nats)
|
|||
|
|
|
|||
|
|
// After
|
|||
|
|
WorkflowOrchestrator::new(config_path, swarm, kg, nats, db: Surreal<Client>)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
`db` is the existing backend connection — the store does not open its own connection.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Data Flow
|
|||
|
|
|
|||
|
|
```text
|
|||
|
|
start_workflow()
|
|||
|
|
→ WorkflowInstance::new()
|
|||
|
|
→ store.save() ← persistence
|
|||
|
|
→ execute_current_stage()
|
|||
|
|
→ cedar.authorize() ← auth (if configured)
|
|||
|
|
→ swarm.assign_task()
|
|||
|
|
|
|||
|
|
on_task_completed()
|
|||
|
|
→ task.mark_completed()
|
|||
|
|
→ store.save() ← persistence
|
|||
|
|
|
|||
|
|
on_task_failed(can_retry=false)
|
|||
|
|
→ mark_current_task_failed() ← stage transition
|
|||
|
|
→ saga.compensate(stages, ctx) ← saga (reverse-order dispatch)
|
|||
|
|
→ instance.fail()
|
|||
|
|
→ store.save() ← persistence
|
|||
|
|
|
|||
|
|
startup crash recovery
|
|||
|
|
→ store.load_active() ← restores active_workflows DashMap
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Saga Compensation Protocol
|
|||
|
|
|
|||
|
|
Compensation is **best-effort**: errors are logged, never propagated. Stage order is reversed: the last executed stage receives a rollback task first.
|
|||
|
|
|
|||
|
|
Only stages with `compensation_agents` defined in their `StageConfig` receive a compensation task. Stages without the field are silently skipped.
|
|||
|
|
|
|||
|
|
Compensation task payload sent to `SwarmCoordinator`:
|
|||
|
|
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"type": "compensation",
|
|||
|
|
"stage_name": "deploy",
|
|||
|
|
"workflow_id": "abc-123",
|
|||
|
|
"original_context": { "…" : "…" },
|
|||
|
|
"artifacts_to_undo": ["artifact-id-1"]
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Cedar Authorization
|
|||
|
|
|
|||
|
|
`CedarAuthorizer::load_from_dir(path)` reads all `*.cedar` files from the directory and compiles them into a single `PolicySet`. Before each stage dispatch:
|
|||
|
|
|
|||
|
|
```rust
|
|||
|
|
cedar.authorize(
|
|||
|
|
"vapora-orchestrator", // principal
|
|||
|
|
"execute-stage", // action
|
|||
|
|
"Stage::\"architecture\"", // resource
|
|||
|
|
)?;
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
A `Deny` decision returns `WorkflowError::Unauthorized`, halting the workflow without dispatching the stage. If `cedar_policy_dir` is not set in `EngineConfig`, Cedar is disabled and all stages proceed without policy checks.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Rationale
|
|||
|
|
|
|||
|
|
### Why SurrealDB (not Redis / SQLite)
|
|||
|
|
|
|||
|
|
SurrealDB is already the persistence layer for every other stateful component in vapora. Adding `workflow_instances` as one more table keeps the operational footprint at zero (no new service, no new connection pool). `WorkflowInstance` already implements `Serialize/Deserialize`; the store serializes via `serde_json::Value` to satisfy the `SurrealValue` trait requirement introduced in surrealdb v3.
|
|||
|
|
|
|||
|
|
### Why Saga Over Two-Phase Commit
|
|||
|
|
|
|||
|
|
Workflows already span multiple async agent executions over NATS. Two-phase commit across these boundaries would require protocol changes in every agent. Saga achieves eventual consistency via compensating transactions that each agent already understands (a task with `type: "compensation"`).
|
|||
|
|
|
|||
|
|
### Why Cedar Over RBAC / Custom Middleware
|
|||
|
|
|
|||
|
|
Cedar policies are already used by the rest of the VAPORA platform (see ADR-0010). Per-stage rules expressed in `.cedar` files are reviewable outside the codebase and hot-swappable without redeployment (restart required to reload, by current design). A custom middleware table would require schema migrations for every policy change.
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Consequences
|
|||
|
|
|
|||
|
|
### Positive
|
|||
|
|
|
|||
|
|
- Workflows survive server restarts (crash recovery via `load_active()`)
|
|||
|
|
- Non-retryable stage failure triggers best-effort rollback of completed stages
|
|||
|
|
- Per-stage access control via auditable policy files
|
|||
|
|
- Zero new infrastructure (uses existing SurrealDB connection)
|
|||
|
|
- 31/31 existing tests continue to pass; 5 new tests added (auth × 3, saga × 2)
|
|||
|
|
|
|||
|
|
### Negative
|
|||
|
|
|
|||
|
|
- `WorkflowOrchestrator::new()` signature change requires callers to pass `Surreal<Client>`
|
|||
|
|
- Cedar requires `.cedar` files on disk; missing `cedar_policy_dir` disables auth silently
|
|||
|
|
- Compensation is best-effort — no guarantee of full rollback if compensation agent also fails
|
|||
|
|
|
|||
|
|
### Mitigations
|
|||
|
|
|
|||
|
|
| Risk | Mitigation |
|
|||
|
|
|------|------------|
|
|||
|
|
| Saga partial rollback | Metrics track compensation dispatch; dead-letter queue via NATS for retry |
|
|||
|
|
| Cedar files missing | `cedar_policy_dir = None` → no-auth mode; documented explicitly |
|
|||
|
|
| Signature change | Backend already owns `db: Arc<Surreal<Client>>`; passed at construction |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Verification
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
cargo test -p vapora-workflow-engine # 31/31 pass
|
|||
|
|
cargo clippy -p vapora-workflow-engine -- -D warnings # 0 warnings
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
New tests:
|
|||
|
|
|
|||
|
|
- `auth::tests::test_permit_allows`
|
|||
|
|
- `auth::tests::test_deny_returns_unauthorized`
|
|||
|
|
- `auth::tests::test_empty_dir_fails`
|
|||
|
|
- `saga::tests::test_stages_with_compensation_agents_are_included`
|
|||
|
|
- `saga::tests::test_stages_with_no_compensation_agents_are_skipped`
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## Related ADRs
|
|||
|
|
|
|||
|
|
- [ADR-0028](./0028-workflow-orchestrator.md) — Workflow Orchestrator (original implementation)
|
|||
|
|
- [ADR-0010](./0010-cedar-authorization.md) — Cedar Authorization
|
|||
|
|
- [ADR-0004](./0004-surrealdb-database.md) — SurrealDB as single persistence layer
|
|||
|
|
- [ADR-0018](./0018-swarm-load-balancing.md) — SwarmCoordinator (Saga dispatch target)
|