Vapora/docs/adrs/0033-stratum-orchestrator-workflow-hardening.md
Jesús Pérez b9e2cee9f7
Some checks failed
Documentation Lint & Validation / Markdown Linting (push) Has been cancelled
Documentation Lint & Validation / Validate mdBook Configuration (push) Has been cancelled
Documentation Lint & Validation / Content & Structure Validation (push) Has been cancelled
mdBook Build & Deploy / Build mdBook (push) Has been cancelled
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
Documentation Lint & Validation / Lint & Validation Summary (push) Has been cancelled
mdBook Build & Deploy / Documentation Quality Check (push) Has been cancelled
mdBook Build & Deploy / Deploy to GitHub Pages (push) Has been cancelled
mdBook Build & Deploy / Notification (push) Has been cancelled
feat(workflow-engine): add saga, persistence, auth, and NATS-integrated orchestrator hardening
Key changes driving this: new saga.rs, persistence.rs, auth.rs in workflow-engine; SurrealDB migration 009_workflow_state.surql; backend
  services refactored; frontend dist built; ADR-0033 documenting the hardening decision.
2026-02-22 21:44:42 +00:00

226 lines
7.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# ADR-0033: Workflow Engine Hardening — Persistence, Saga Compensation, Cedar Authorization
**Status**: Implemented
**Date**: 2026-02-21
**Deciders**: VAPORA Team
**Technical Story**: `vapora-workflow-engine` lost all state on restart, had no rollback mechanism on failure, and applied no per-stage access control.
---
## Decision
Harden `vapora-workflow-engine` with three independent layers inspired by the stratum-orchestrator project:
1. **SurrealDB persistence** (`SurrealWorkflowStore`) — crash-recoverable `WorkflowInstance` state
2. **Saga compensation** (`SagaCompensator`) — reverse-order rollback dispatch via `SwarmCoordinator`
3. **Cedar authorization** (`CedarAuthorizer`) — per-stage policy enforcement before task dispatch
All three are implemented natively inside `vapora-workflow-engine` — stratum-orchestrator is **not** a direct dependency.
---
## Context
### Gaps Before This ADR
| Gap | Consequence |
|-----|-------------|
| In-memory `DashMap` only | All running workflows lost on server restart |
| No compensation on failure | Stage 3 failure left Stage 1 and 2 side effects live |
| No authorization check per stage | Any caller could trigger any stage in any workflow |
### Why Not Import stratum-orchestrator Directly
The plan initially included:
```toml
stratum-orchestrator = { path = "../../../stratumiops/crates/stratum-orchestrator" }
```
This fails because `stratum-orchestrator → platform-nats → nkeys = { workspace = true }`. The `nkeys` dependency is resolved only inside the `stratumiops` workspace; it is not published to crates.io and has no path resolvable from vapora's workspace root. Cargo errors with `failed to select a version for nkeys`.
The `CedarAuthorizer` inside stratum-orchestrator is 88 self-contained lines using only `cedar-policy`. Implementing it locally is zero duplication risk and avoids a circular workspace dependency.
---
## Implementation
### New Modules
```text
crates/vapora-workflow-engine/src/
├── auth.rs — CedarAuthorizer: loads .cedar policy files, authorize()
├── persistence.rs — SurrealWorkflowStore: save/load/load_active/delete
└── saga.rs — SagaCompensator: compensate(workflow_id, stages, ctx)
```
### New Migration
```text
migrations/009_workflow_state.surql — SCHEMAFULL workflow_instances table
```
### Config Changes
```toml
[engine]
cedar_policy_dir = "/etc/vapora/cedar" # optional; Cedar disabled if absent
[[workflows.stages]]
name = "deploy"
agents = ["devops"]
compensation_agents = ["devops"] # receives rollback task if Saga fires
```
### Dependency Addition
```toml
# crates/vapora-workflow-engine/Cargo.toml
surrealdb = { workspace = true }
cedar-policy = "4.9"
```
`cedar-policy` enters directly; it was previously only transitive via `secretumvault` (4.8). Cargo resolves the workspace to 4.9 (semver compatible, same major).
### WorkflowOrchestrator Constructor Change
```rust
// Before
WorkflowOrchestrator::new(config_path, swarm, kg, nats)
// After
WorkflowOrchestrator::new(config_path, swarm, kg, nats, db: Surreal<Client>)
```
`db` is the existing backend connection — the store does not open its own connection.
---
## Data Flow
```text
start_workflow()
→ WorkflowInstance::new()
→ store.save() ← persistence
→ execute_current_stage()
→ cedar.authorize() ← auth (if configured)
→ swarm.assign_task()
on_task_completed()
→ task.mark_completed()
→ store.save() ← persistence
on_task_failed(can_retry=false)
→ mark_current_task_failed() ← stage transition
→ saga.compensate(stages, ctx) ← saga (reverse-order dispatch)
→ instance.fail()
→ store.save() ← persistence
startup crash recovery
→ store.load_active() ← restores active_workflows DashMap
```
---
## Saga Compensation Protocol
Compensation is **best-effort**: errors are logged, never propagated. Stage order is reversed: the last executed stage receives a rollback task first.
Only stages with `compensation_agents` defined in their `StageConfig` receive a compensation task. Stages without the field are silently skipped.
Compensation task payload sent to `SwarmCoordinator`:
```json
{
"type": "compensation",
"stage_name": "deploy",
"workflow_id": "abc-123",
"original_context": { "…" : "…" },
"artifacts_to_undo": ["artifact-id-1"]
}
```
---
## Cedar Authorization
`CedarAuthorizer::load_from_dir(path)` reads all `*.cedar` files from the directory and compiles them into a single `PolicySet`. Before each stage dispatch:
```rust
cedar.authorize(
"vapora-orchestrator", // principal
"execute-stage", // action
"Stage::\"architecture\"", // resource
)?;
```
A `Deny` decision returns `WorkflowError::Unauthorized`, halting the workflow without dispatching the stage. If `cedar_policy_dir` is not set in `EngineConfig`, Cedar is disabled and all stages proceed without policy checks.
---
## Rationale
### Why SurrealDB (not Redis / SQLite)
SurrealDB is already the persistence layer for every other stateful component in vapora. Adding `workflow_instances` as one more table keeps the operational footprint at zero (no new service, no new connection pool). `WorkflowInstance` already implements `Serialize/Deserialize`; the store serializes via `serde_json::Value` to satisfy the `SurrealValue` trait requirement introduced in surrealdb v3.
### Why Saga Over Two-Phase Commit
Workflows already span multiple async agent executions over NATS. Two-phase commit across these boundaries would require protocol changes in every agent. Saga achieves eventual consistency via compensating transactions that each agent already understands (a task with `type: "compensation"`).
### Why Cedar Over RBAC / Custom Middleware
Cedar policies are already used by the rest of the VAPORA platform (see ADR-0010). Per-stage rules expressed in `.cedar` files are reviewable outside the codebase and hot-swappable without redeployment (restart required to reload, by current design). A custom middleware table would require schema migrations for every policy change.
---
## Consequences
### Positive
- Workflows survive server restarts (crash recovery via `load_active()`)
- Non-retryable stage failure triggers best-effort rollback of completed stages
- Per-stage access control via auditable policy files
- Zero new infrastructure (uses existing SurrealDB connection)
- 31/31 existing tests continue to pass; 5 new tests added (auth × 3, saga × 2)
### Negative
- `WorkflowOrchestrator::new()` signature change requires callers to pass `Surreal<Client>`
- Cedar requires `.cedar` files on disk; missing `cedar_policy_dir` disables auth silently
- Compensation is best-effort — no guarantee of full rollback if compensation agent also fails
### Mitigations
| Risk | Mitigation |
|------|------------|
| Saga partial rollback | Metrics track compensation dispatch; dead-letter queue via NATS for retry |
| Cedar files missing | `cedar_policy_dir = None` → no-auth mode; documented explicitly |
| Signature change | Backend already owns `db: Arc<Surreal<Client>>`; passed at construction |
---
## Verification
```bash
cargo test -p vapora-workflow-engine # 31/31 pass
cargo clippy -p vapora-workflow-engine -- -D warnings # 0 warnings
```
New tests:
- `auth::tests::test_permit_allows`
- `auth::tests::test_deny_returns_unauthorized`
- `auth::tests::test_empty_dir_fails`
- `saga::tests::test_stages_with_compensation_agents_are_included`
- `saga::tests::test_stages_with_no_compensation_agents_are_skipped`
---
## Related ADRs
- [ADR-0028](./0028-workflow-orchestrator.md) — Workflow Orchestrator (original implementation)
- [ADR-0010](./0010-cedar-authorization.md) — Cedar Authorization
- [ADR-0004](./0004-surrealdb-database.md) — SurrealDB as single persistence layer
- [ADR-0018](./0018-swarm-load-balancing.md) — SwarmCoordinator (Saga dispatch target)