Vapora/docs/guides/workflow-saga-persistence.md

# Workflow Engine: Persistence, Saga Compensation & Cedar Authorization

How to configure and operate the three hardening layers added in v1.3.0:
crash-recoverable state, Saga-based rollback, and per-stage access control.

## Prerequisites

- Running SurrealDB instance (same one used by the backend)
- `vapora-workflow-engine` v1.3.0+
- Migration `009_workflow_state.surql` applied

## 1. Apply the Database Migration

```bash
surreal import \
  --conn  ws://localhost:8000 \
  --user  root \
  --pass  root \
  --ns    vapora \
  --db    vapora \
  migrations/009_workflow_state.surql
```

This creates the `workflow_instances` SCHEMAFULL table with indexes on
`template_name` and `created_at`.

## 2. Wire the Store into the Orchestrator

Pass the existing `Surreal<Client>` when constructing `WorkflowOrchestrator`:

```rust
use std::sync::Arc;
use surrealdb::{Surreal, engine::remote::ws::Client};
use vapora_workflow_engine::WorkflowOrchestrator;

async fn build_orchestrator(
    db: Arc<Surreal<Client>>,
    swarm: Arc<SwarmCoordinator>,
    kg:    Arc<KGPersistence>,
    nats:  Arc<async_nats::Client>,
) -> anyhow::Result<Arc<WorkflowOrchestrator>> {
    let orchestrator = WorkflowOrchestrator::new(
        "config/workflows.toml",
        swarm,
        kg,
        nats,
        (*db).clone(),          // Surreal<Client> (not Arc)
    )
    .await?;

    Ok(Arc::new(orchestrator))
}
```

On startup the orchestrator calls `store.load_active()` and reinserts any
non-terminal instances from SurrealDB into the in-memory `DashMap`. Workflows
in progress before a crash resume from their last persisted state.

## 3. Configure Saga Compensation

Add `compensation_agents` to each stage that should trigger a rollback task
when the workflow fails after that stage has completed:

```toml
# config/workflows.toml

[engine]
max_parallel_tasks   = 10
workflow_timeout     = 3600
approval_gates_enabled = true

[[workflows]]
name    = "deploy_pipeline"
trigger = "manual"

[[workflows.stages]]
name                = "provision"
agents              = ["devops"]
compensation_agents = ["devops"]   # ← Saga: send rollback task to devops

[[workflows.stages]]
name                = "migrate_db"
agents              = ["backend"]
compensation_agents = ["backend"]  # ← Saga: send DB rollback task to backend

[[workflows.stages]]
name    = "smoke_test"
agents  = ["tester"]
# no compensation_agents → skipped in Saga reversal
```

### Compensation Task Payload

When a non-retryable failure occurs at stage N, the `SagaCompensator` iterates
stages `[0..N]` in reverse order. For each stage with `compensation_agents`
defined it dispatches via `SwarmCoordinator`:

```json
{
  "type":             "compensation",
  "stage_name":       "migrate_db",
  "workflow_id":      "abc-123",
  "original_context": { "…": "…" },
  "artifacts_to_undo": ["artifact-id-1"]
}
```

Compensation is **best-effort**: errors are logged but never fail the workflow
state machine. The workflow is already marked `Failed` before Saga fires.

### Agent Implementation

Agents receive compensation tasks on the same NATS subjects as regular tasks.
Distinguish by the `"type": "compensation"` field:

```rust
// In your agent task handler
if task.payload["type"] == "compensation" {
    let stage  = &task.payload["stage_name"];
    let wf_id  = &task.payload["workflow_id"];
    // perform rollback for this stage …
    return Ok(());
}
// normal task handling …
```

## 4. Configure Cedar Authorization

Cedar authorization is **opt-in**: if `cedar_policy_dir` is absent, all stages
execute without policy checks.

### Enable Cedar

```toml
[engine]
max_parallel_tasks   = 10
workflow_timeout     = 3600
approval_gates_enabled = true
cedar_policy_dir     = "/etc/vapora/cedar"  # directory with *.cedar files
```

### Write Policies

Create `.cedar` files in the configured directory. Each file is loaded at
startup and merged into a single `PolicySet`.

**Allow all stages** (permissive default):

```cedar
// /etc/vapora/cedar/allow-all.cedar
permit(
  principal == "vapora-orchestrator",
  action    == Action::"execute-stage",
  resource
);
```

**Restrict deployment stages to approved callers only**:

```cedar
// /etc/vapora/cedar/restrict-deploy.cedar
forbid(
  principal == "vapora-orchestrator",
  action    == Action::"execute-stage",
  resource  == Stage::"deployment"
) unless {
  context.approved == true
};
```

**Allow only specific stages**:

```cedar
// /etc/vapora/cedar/workflow-policy.cedar
permit(
  principal == "vapora-orchestrator",
  action    == Action::"execute-stage",
  resource  in [Stage::"architecture_design", Stage::"implementation", Stage::"testing"]
);

forbid(
  principal == "vapora-orchestrator",
  action    == Action::"execute-stage",
  resource  == Stage::"production_deploy"
);
```

### Authorization Check

Before each stage dispatch the engine calls:

```rust
// principal: "vapora-orchestrator"
// action:    "execute-stage"
// resource:  Stage::"<stage_name>"
cedar.authorize("vapora-orchestrator", "execute-stage", &stage_name)?;
```

A `Deny` decision returns `WorkflowError::Unauthorized` and halts the workflow
immediately. The stage is never dispatched to `SwarmCoordinator`.

## 5. Crash Recovery Behavior

| Scenario | Behavior |
|----------|----------|
| Server restart with active workflows | `load_active()` re-populates `DashMap`; event listener resumes NATS subscriptions; in-flight tasks re-dispatch on next `TaskCompleted`/`TaskFailed` event |
| Workflow in `Completed` or `Failed` state | Filtered out by `load_active()`; not reloaded |
| Workflow in `WaitingApproval` | Reloaded; waits for `approve_stage()` call |
| SurrealDB unavailable at startup | `load_active()` returns error; orchestrator fails to start |

## 6. Observability

Saga dispatch failures appear in logs with level `warn`:

```text
WARN vapora_workflow_engine::saga: compensation dispatch failed
  workflow_id="abc-123" stage="migrate_db" error="…"
```

Cedar denials:

```text
ERROR vapora_workflow_engine::orchestrator: Cedar authorization denied
  workflow_id="abc-123" stage="production_deploy"
```

Existing Prometheus metrics track workflow lifecycle:

```text
vapora_workflows_failed_total    # includes Cedar-denied workflows
vapora_active_workflows          # drops immediately on denial
```

## 7. Full Config Reference

```toml
[engine]
max_parallel_tasks     = 10
workflow_timeout       = 3600
approval_gates_enabled = true
cedar_policy_dir       = "/etc/vapora/cedar"   # optional

[[workflows]]
name    = "my_workflow"
trigger = "manual"

[[workflows.stages]]
name                = "stage_a"
agents              = ["agent_role"]
parallel            = false
max_parallel        = 1
approval_required   = false
compensation_agents = ["agent_role"]   # optional; omit to skip Saga for this stage
```

## Related

- [ADR-0033: Workflow Engine Hardening](../adrs/0033-stratum-orchestrator-workflow-hardening.md)
- [ADR-0028: Workflow Orchestrator](../adrs/0028-workflow-orchestrator.md)
- [ADR-0010: Cedar Authorization](../adrs/0010-cedar-authorization.md)
- [Workflow Orchestrator Feature Reference](../features/workflow-orchestrator.md)
feat(workflow-engine): add saga, persistence, auth, and NATS-integrated orchestrator hardening Key changes driving this: new saga.rs, persistence.rs, auth.rs in workflow-engine; SurrealDB migration 009_workflow_state.surql; backend services refactored; frontend dist built; ADR-0033 documenting the hardening decision. 2026-02-22 21:44:42 +00:00			`# Workflow Engine: Persistence, Saga Compensation & Cedar Authorization`

			`How to configure and operate the three hardening layers added in v1.3.0:`
			`crash-recoverable state, Saga-based rollback, and per-stage access control.`

			`## Prerequisites`

			`- Running SurrealDB instance (same one used by the backend)`
			- `vapora-workflow-engine` v1.3.0+
			- Migration `009_workflow_state.surql` applied

			`## 1. Apply the Database Migration`

			```bash
			`surreal import \`
			`--conn ws://localhost:8000 \`
			`--user root \`
			`--pass root \`
			`--ns vapora \`
			`--db vapora \`
			`migrations/009_workflow_state.surql`
			```

			This creates the `workflow_instances` SCHEMAFULL table with indexes on
			`template_name` and `created_at`.

			`## 2. Wire the Store into the Orchestrator`

			Pass the existing `Surreal<Client>` when constructing `WorkflowOrchestrator`:

			```rust
			`use std::sync::Arc;`
			`use surrealdb::{Surreal, engine::remote::ws::Client};`
			`use vapora_workflow_engine::WorkflowOrchestrator;`

			`async fn build_orchestrator(`
			`db: Arc<Surreal<Client>>,`
			`swarm: Arc<SwarmCoordinator>,`
			`kg: Arc<KGPersistence>,`
			`nats: Arc<async_nats::Client>,`
			`) -> anyhow::Result<Arc<WorkflowOrchestrator>> {`
			`let orchestrator = WorkflowOrchestrator::new(`
			`"config/workflows.toml",`
			`swarm,`
			`kg,`
			`nats,`
			`(*db).clone(), // Surreal<Client> (not Arc)`
			`)`
			`.await?;`

			`Ok(Arc::new(orchestrator))`
			`}`
			```

			On startup the orchestrator calls `store.load_active()` and reinserts any
			non-terminal instances from SurrealDB into the in-memory `DashMap`. Workflows
			`in progress before a crash resume from their last persisted state.`

			`## 3. Configure Saga Compensation`

			Add `compensation_agents` to each stage that should trigger a rollback task
			`when the workflow fails after that stage has completed:`

			```toml
			`# config/workflows.toml`

			`[engine]`
			`max_parallel_tasks = 10`
			`workflow_timeout = 3600`
			`approval_gates_enabled = true`

			`[[workflows]]`
			`name = "deploy_pipeline"`
			`trigger = "manual"`

			`[[workflows.stages]]`
			`name = "provision"`
			`agents = ["devops"]`
			`compensation_agents = ["devops"] # ← Saga: send rollback task to devops`

			`[[workflows.stages]]`
			`name = "migrate_db"`
			`agents = ["backend"]`
			`compensation_agents = ["backend"] # ← Saga: send DB rollback task to backend`

			`[[workflows.stages]]`
			`name = "smoke_test"`
			`agents = ["tester"]`
			`# no compensation_agents → skipped in Saga reversal`
			```

			`### Compensation Task Payload`

			When a non-retryable failure occurs at stage N, the `SagaCompensator` iterates
			stages `[0..N]` in reverse order. For each stage with `compensation_agents`
			defined it dispatches via `SwarmCoordinator`:

			```json
			`{`
			`"type": "compensation",`
			`"stage_name": "migrate_db",`
			`"workflow_id": "abc-123",`
			`"original_context": { "…": "…" },`
			`"artifacts_to_undo": ["artifact-id-1"]`
			`}`
			```

			`Compensation is best-effort: errors are logged but never fail the workflow`
			state machine. The workflow is already marked `Failed` before Saga fires.

			`### Agent Implementation`

			`Agents receive compensation tasks on the same NATS subjects as regular tasks.`
			Distinguish by the `"type": "compensation"` field:

			```rust
			`// In your agent task handler`
			`if task.payload["type"] == "compensation" {`
			`let stage = &task.payload["stage_name"];`
			`let wf_id = &task.payload["workflow_id"];`
			`// perform rollback for this stage …`
			`return Ok(());`
			`}`
			`// normal task handling …`
			```

			`## 4. Configure Cedar Authorization`

			Cedar authorization is opt-in: if `cedar_policy_dir` is absent, all stages
			`execute without policy checks.`

			`### Enable Cedar`

			```toml
			`[engine]`
			`max_parallel_tasks = 10`
			`workflow_timeout = 3600`
			`approval_gates_enabled = true`
			`cedar_policy_dir = "/etc/vapora/cedar" # directory with *.cedar files`
			```

			`### Write Policies`

			Create `.cedar` files in the configured directory. Each file is loaded at
			startup and merged into a single `PolicySet`.

			`Allow all stages (permissive default):`

			```cedar
			`// /etc/vapora/cedar/allow-all.cedar`
			`permit(`
			`principal == "vapora-orchestrator",`
			`action == Action::"execute-stage",`
			`resource`
			`);`
			```

			`Restrict deployment stages to approved callers only:`

			```cedar
			`// /etc/vapora/cedar/restrict-deploy.cedar`
			`forbid(`
			`principal == "vapora-orchestrator",`
			`action == Action::"execute-stage",`
			`resource == Stage::"deployment"`
			`) unless {`
			`context.approved == true`
			`};`
			```

			`Allow only specific stages:`

			```cedar
			`// /etc/vapora/cedar/workflow-policy.cedar`
			`permit(`
			`principal == "vapora-orchestrator",`
			`action == Action::"execute-stage",`
			`resource in [Stage::"architecture_design", Stage::"implementation", Stage::"testing"]`
			`);`

			`forbid(`
			`principal == "vapora-orchestrator",`
			`action == Action::"execute-stage",`
			`resource == Stage::"production_deploy"`
			`);`
			```

			`### Authorization Check`

			`Before each stage dispatch the engine calls:`

			```rust
			`// principal: "vapora-orchestrator"`
			`// action: "execute-stage"`
			`// resource: Stage::"<stage_name>"`
			`cedar.authorize("vapora-orchestrator", "execute-stage", &stage_name)?;`
			```

			A `Deny` decision returns `WorkflowError::Unauthorized` and halts the workflow
			immediately. The stage is never dispatched to `SwarmCoordinator`.

			`## 5. Crash Recovery Behavior`

			`\| Scenario \| Behavior \|`
			`\|----------\|----------\|`
			\| Server restart with active workflows \| `load_active()` re-populates `DashMap`; event listener resumes NATS subscriptions; in-flight tasks re-dispatch on next `TaskCompleted`/`TaskFailed` event \|
			\| Workflow in `Completed` or `Failed` state \| Filtered out by `load_active()`; not reloaded \|
			\| Workflow in `WaitingApproval` \| Reloaded; waits for `approve_stage()` call \|
			\| SurrealDB unavailable at startup \| `load_active()` returns error; orchestrator fails to start \|

			`## 6. Observability`

			Saga dispatch failures appear in logs with level `warn`:

			```text
			`WARN vapora_workflow_engine::saga: compensation dispatch failed`
			`workflow_id="abc-123" stage="migrate_db" error="…"`
			```

			`Cedar denials:`

			```text
			`ERROR vapora_workflow_engine::orchestrator: Cedar authorization denied`
			`workflow_id="abc-123" stage="production_deploy"`
			```

			`Existing Prometheus metrics track workflow lifecycle:`

			```text
			`vapora_workflows_failed_total # includes Cedar-denied workflows`
			`vapora_active_workflows # drops immediately on denial`
			```

			`## 7. Full Config Reference`

			```toml
			`[engine]`
			`max_parallel_tasks = 10`
			`workflow_timeout = 3600`
			`approval_gates_enabled = true`
			`cedar_policy_dir = "/etc/vapora/cedar" # optional`

			`[[workflows]]`
			`name = "my_workflow"`
			`trigger = "manual"`

			`[[workflows.stages]]`
			`name = "stage_a"`
			`agents = ["agent_role"]`
			`parallel = false`
			`max_parallel = 1`
			`approval_required = false`
			`compensation_agents = ["agent_role"] # optional; omit to skip Saga for this stage`
			```

			`## Related`

			`- [ADR-0033: Workflow Engine Hardening](../adrs/0033-stratum-orchestrator-workflow-hardening.md)`
			`- [ADR-0028: Workflow Orchestrator](../adrs/0028-workflow-orchestrator.md)`
			`- [ADR-0010: Cedar Authorization](../adrs/0010-cedar-authorization.md)`
			`- [Workflow Orchestrator Feature Reference](../features/workflow-orchestrator.md)`