Vapora/docs/guides/workflow-saga-persistence.md
Jesús Pérez b9e2cee9f7
Some checks failed
Documentation Lint & Validation / Markdown Linting (push) Has been cancelled
Documentation Lint & Validation / Validate mdBook Configuration (push) Has been cancelled
Documentation Lint & Validation / Content & Structure Validation (push) Has been cancelled
mdBook Build & Deploy / Build mdBook (push) Has been cancelled
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
Documentation Lint & Validation / Lint & Validation Summary (push) Has been cancelled
mdBook Build & Deploy / Documentation Quality Check (push) Has been cancelled
mdBook Build & Deploy / Deploy to GitHub Pages (push) Has been cancelled
mdBook Build & Deploy / Notification (push) Has been cancelled
feat(workflow-engine): add saga, persistence, auth, and NATS-integrated orchestrator hardening
Key changes driving this: new saga.rs, persistence.rs, auth.rs in workflow-engine; SurrealDB migration 009_workflow_state.surql; backend
  services refactored; frontend dist built; ADR-0033 documenting the hardening decision.
2026-02-22 21:44:42 +00:00

7.0 KiB

Workflow Engine: Persistence, Saga Compensation & Cedar Authorization

How to configure and operate the three hardening layers added in v1.3.0: crash-recoverable state, Saga-based rollback, and per-stage access control.

Prerequisites

  • Running SurrealDB instance (same one used by the backend)
  • vapora-workflow-engine v1.3.0+
  • Migration 009_workflow_state.surql applied

1. Apply the Database Migration

surreal import \
  --conn  ws://localhost:8000 \
  --user  root \
  --pass  root \
  --ns    vapora \
  --db    vapora \
  migrations/009_workflow_state.surql

This creates the workflow_instances SCHEMAFULL table with indexes on template_name and created_at.

2. Wire the Store into the Orchestrator

Pass the existing Surreal<Client> when constructing WorkflowOrchestrator:

use std::sync::Arc;
use surrealdb::{Surreal, engine::remote::ws::Client};
use vapora_workflow_engine::WorkflowOrchestrator;

async fn build_orchestrator(
    db: Arc<Surreal<Client>>,
    swarm: Arc<SwarmCoordinator>,
    kg:    Arc<KGPersistence>,
    nats:  Arc<async_nats::Client>,
) -> anyhow::Result<Arc<WorkflowOrchestrator>> {
    let orchestrator = WorkflowOrchestrator::new(
        "config/workflows.toml",
        swarm,
        kg,
        nats,
        (*db).clone(),          // Surreal<Client> (not Arc)
    )
    .await?;

    Ok(Arc::new(orchestrator))
}

On startup the orchestrator calls store.load_active() and reinserts any non-terminal instances from SurrealDB into the in-memory DashMap. Workflows in progress before a crash resume from their last persisted state.

3. Configure Saga Compensation

Add compensation_agents to each stage that should trigger a rollback task when the workflow fails after that stage has completed:

# config/workflows.toml

[engine]
max_parallel_tasks   = 10
workflow_timeout     = 3600
approval_gates_enabled = true

[[workflows]]
name    = "deploy_pipeline"
trigger = "manual"

[[workflows.stages]]
name                = "provision"
agents              = ["devops"]
compensation_agents = ["devops"]   # ← Saga: send rollback task to devops

[[workflows.stages]]
name                = "migrate_db"
agents              = ["backend"]
compensation_agents = ["backend"]  # ← Saga: send DB rollback task to backend

[[workflows.stages]]
name    = "smoke_test"
agents  = ["tester"]
# no compensation_agents → skipped in Saga reversal

Compensation Task Payload

When a non-retryable failure occurs at stage N, the SagaCompensator iterates stages [0..N] in reverse order. For each stage with compensation_agents defined it dispatches via SwarmCoordinator:

{
  "type":             "compensation",
  "stage_name":       "migrate_db",
  "workflow_id":      "abc-123",
  "original_context": { "…": "…" },
  "artifacts_to_undo": ["artifact-id-1"]
}

Compensation is best-effort: errors are logged but never fail the workflow state machine. The workflow is already marked Failed before Saga fires.

Agent Implementation

Agents receive compensation tasks on the same NATS subjects as regular tasks. Distinguish by the "type": "compensation" field:

// In your agent task handler
if task.payload["type"] == "compensation" {
    let stage  = &task.payload["stage_name"];
    let wf_id  = &task.payload["workflow_id"];
    // perform rollback for this stage …
    return Ok(());
}
// normal task handling …

4. Configure Cedar Authorization

Cedar authorization is opt-in: if cedar_policy_dir is absent, all stages execute without policy checks.

Enable Cedar

[engine]
max_parallel_tasks   = 10
workflow_timeout     = 3600
approval_gates_enabled = true
cedar_policy_dir     = "/etc/vapora/cedar"  # directory with *.cedar files

Write Policies

Create .cedar files in the configured directory. Each file is loaded at startup and merged into a single PolicySet.

Allow all stages (permissive default):

// /etc/vapora/cedar/allow-all.cedar
permit(
  principal == "vapora-orchestrator",
  action    == Action::"execute-stage",
  resource
);

Restrict deployment stages to approved callers only:

// /etc/vapora/cedar/restrict-deploy.cedar
forbid(
  principal == "vapora-orchestrator",
  action    == Action::"execute-stage",
  resource  == Stage::"deployment"
) unless {
  context.approved == true
};

Allow only specific stages:

// /etc/vapora/cedar/workflow-policy.cedar
permit(
  principal == "vapora-orchestrator",
  action    == Action::"execute-stage",
  resource  in [Stage::"architecture_design", Stage::"implementation", Stage::"testing"]
);

forbid(
  principal == "vapora-orchestrator",
  action    == Action::"execute-stage",
  resource  == Stage::"production_deploy"
);

Authorization Check

Before each stage dispatch the engine calls:

// principal: "vapora-orchestrator"
// action:    "execute-stage"
// resource:  Stage::"<stage_name>"
cedar.authorize("vapora-orchestrator", "execute-stage", &stage_name)?;

A Deny decision returns WorkflowError::Unauthorized and halts the workflow immediately. The stage is never dispatched to SwarmCoordinator.

5. Crash Recovery Behavior

Scenario Behavior
Server restart with active workflows load_active() re-populates DashMap; event listener resumes NATS subscriptions; in-flight tasks re-dispatch on next TaskCompleted/TaskFailed event
Workflow in Completed or Failed state Filtered out by load_active(); not reloaded
Workflow in WaitingApproval Reloaded; waits for approve_stage() call
SurrealDB unavailable at startup load_active() returns error; orchestrator fails to start

6. Observability

Saga dispatch failures appear in logs with level warn:

WARN vapora_workflow_engine::saga: compensation dispatch failed
  workflow_id="abc-123" stage="migrate_db" error="…"

Cedar denials:

ERROR vapora_workflow_engine::orchestrator: Cedar authorization denied
  workflow_id="abc-123" stage="production_deploy"

Existing Prometheus metrics track workflow lifecycle:

vapora_workflows_failed_total    # includes Cedar-denied workflows
vapora_active_workflows          # drops immediately on denial

7. Full Config Reference

[engine]
max_parallel_tasks     = 10
workflow_timeout       = 3600
approval_gates_enabled = true
cedar_policy_dir       = "/etc/vapora/cedar"   # optional

[[workflows]]
name    = "my_workflow"
trigger = "manual"

[[workflows.stages]]
name                = "stage_a"
agents              = ["agent_role"]
parallel            = false
max_parallel        = 1
approval_required   = false
compensation_agents = ["agent_role"]   # optional; omit to skip Saga for this stage