jesus/Vapora

Fork 0

Jesús Pérez b9e2cee9f7

Documentation Lint & Validation / Markdown Linting (push) Has been cancelled

Details

Documentation Lint & Validation / Validate mdBook Configuration (push) Has been cancelled

Details

Documentation Lint & Validation / Content & Structure Validation (push) Has been cancelled

Details

mdBook Build & Deploy / Build mdBook (push) Has been cancelled

Details

Rust CI / Security Audit (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (stable) (push) Has been cancelled

Details

Documentation Lint & Validation / Lint & Validation Summary (push) Has been cancelled

Details

mdBook Build & Deploy / Documentation Quality Check (push) Has been cancelled

Details

mdBook Build & Deploy / Deploy to GitHub Pages (push) Has been cancelled

Details

mdBook Build & Deploy / Notification (push) Has been cancelled

Details

feat(workflow-engine): add saga, persistence, auth, and NATS-integrated orchestrator hardening

Key changes driving this: new saga.rs, persistence.rs, auth.rs in workflow-engine; SurrealDB migration 009_workflow_state.surql; backend
  services refactored; frontend dist built; ADR-0033 documenting the hardening decision.

2026-02-22 21:44:42 +00:00

7.0 KiB

Raw Blame History

Workflow Engine: Persistence, Saga Compensation & Cedar Authorization

How to configure and operate the three hardening layers added in v1.3.0: crash-recoverable state, Saga-based rollback, and per-stage access control.

Prerequisites

Running SurrealDB instance (same one used by the backend)
vapora-workflow-engine v1.3.0+
Migration 009_workflow_state.surql applied

1. Apply the Database Migration

surreal import \
  --conn  ws://localhost:8000 \
  --user  root \
  --pass  root \
  --ns    vapora \
  --db    vapora \
  migrations/009_workflow_state.surql

This creates the workflow_instances SCHEMAFULL table with indexes on template_name and created_at.

2. Wire the Store into the Orchestrator

Pass the existing Surreal<Client> when constructing WorkflowOrchestrator:

use std::sync::Arc;
use surrealdb::{Surreal, engine::remote::ws::Client};
use vapora_workflow_engine::WorkflowOrchestrator;

async fn build_orchestrator(
    db: Arc<Surreal<Client>>,
    swarm: Arc<SwarmCoordinator>,
    kg:    Arc<KGPersistence>,
    nats:  Arc<async_nats::Client>,
) -> anyhow::Result<Arc<WorkflowOrchestrator>> {
    let orchestrator = WorkflowOrchestrator::new(
        "config/workflows.toml",
        swarm,
        kg,
        nats,
        (*db).clone(),          // Surreal<Client> (not Arc)
    )
    .await?;

    Ok(Arc::new(orchestrator))
}

On startup the orchestrator calls store.load_active() and reinserts any non-terminal instances from SurrealDB into the in-memory DashMap. Workflows in progress before a crash resume from their last persisted state.

3. Configure Saga Compensation

Add compensation_agents to each stage that should trigger a rollback task when the workflow fails after that stage has completed:

# config/workflows.toml

[engine]
max_parallel_tasks   = 10
workflow_timeout     = 3600
approval_gates_enabled = true

[[workflows]]
name    = "deploy_pipeline"
trigger = "manual"

[[workflows.stages]]
name                = "provision"
agents              = ["devops"]
compensation_agents = ["devops"]   # ← Saga: send rollback task to devops

[[workflows.stages]]
name                = "migrate_db"
agents              = ["backend"]
compensation_agents = ["backend"]  # ← Saga: send DB rollback task to backend

[[workflows.stages]]
name    = "smoke_test"
agents  = ["tester"]
# no compensation_agents → skipped in Saga reversal

Compensation Task Payload

When a non-retryable failure occurs at stage N, the SagaCompensator iterates stages [0..N] in reverse order. For each stage with compensation_agents defined it dispatches via SwarmCoordinator:

{
  "type":             "compensation",
  "stage_name":       "migrate_db",
  "workflow_id":      "abc-123",
  "original_context": { "…": "…" },
  "artifacts_to_undo": ["artifact-id-1"]
}

Compensation is best-effort: errors are logged but never fail the workflow state machine. The workflow is already marked Failed before Saga fires.

Agent Implementation

Agents receive compensation tasks on the same NATS subjects as regular tasks. Distinguish by the "type": "compensation" field:

// In your agent task handler
if task.payload["type"] == "compensation" {
    let stage  = &task.payload["stage_name"];
    let wf_id  = &task.payload["workflow_id"];
    // perform rollback for this stage …
    return Ok(());
}
// normal task handling …

4. Configure Cedar Authorization

Cedar authorization is opt-in: if cedar_policy_dir is absent, all stages execute without policy checks.

Enable Cedar

[engine]
max_parallel_tasks   = 10
workflow_timeout     = 3600
approval_gates_enabled = true
cedar_policy_dir     = "/etc/vapora/cedar"  # directory with *.cedar files

Write Policies

Create .cedar files in the configured directory. Each file is loaded at startup and merged into a single PolicySet.

Allow all stages (permissive default):

// /etc/vapora/cedar/allow-all.cedar
permit(
  principal == "vapora-orchestrator",
  action    == Action::"execute-stage",
  resource
);

Restrict deployment stages to approved callers only:

// /etc/vapora/cedar/restrict-deploy.cedar
forbid(
  principal == "vapora-orchestrator",
  action    == Action::"execute-stage",
  resource  == Stage::"deployment"
) unless {
  context.approved == true
};

Allow only specific stages:

// /etc/vapora/cedar/workflow-policy.cedar
permit(
  principal == "vapora-orchestrator",
  action    == Action::"execute-stage",
  resource  in [Stage::"architecture_design", Stage::"implementation", Stage::"testing"]
);

forbid(
  principal == "vapora-orchestrator",
  action    == Action::"execute-stage",
  resource  == Stage::"production_deploy"
);

Authorization Check

Before each stage dispatch the engine calls:

// principal: "vapora-orchestrator"
// action:    "execute-stage"
// resource:  Stage::"<stage_name>"
cedar.authorize("vapora-orchestrator", "execute-stage", &stage_name)?;

A Deny decision returns WorkflowError::Unauthorized and halts the workflow immediately. The stage is never dispatched to SwarmCoordinator.

5. Crash Recovery Behavior

Scenario	Behavior
Server restart with active workflows	`load_active()` re-populates `DashMap`; event listener resumes NATS subscriptions; in-flight tasks re-dispatch on next `TaskCompleted`/`TaskFailed` event
Workflow in `Completed` or `Failed` state	Filtered out by `load_active()`; not reloaded
Workflow in `WaitingApproval`	Reloaded; waits for `approve_stage()` call
SurrealDB unavailable at startup	`load_active()` returns error; orchestrator fails to start

6. Observability

Saga dispatch failures appear in logs with level warn:

WARN vapora_workflow_engine::saga: compensation dispatch failed
  workflow_id="abc-123" stage="migrate_db" error="…"

Cedar denials:

ERROR vapora_workflow_engine::orchestrator: Cedar authorization denied
  workflow_id="abc-123" stage="production_deploy"

Existing Prometheus metrics track workflow lifecycle:

vapora_workflows_failed_total    # includes Cedar-denied workflows
vapora_active_workflows          # drops immediately on denial

7. Full Config Reference

[engine]
max_parallel_tasks     = 10
workflow_timeout       = 3600
approval_gates_enabled = true
cedar_policy_dir       = "/etc/vapora/cedar"   # optional

[[workflows]]
name    = "my_workflow"
trigger = "manual"

[[workflows.stages]]
name                = "stage_a"
agents              = ["agent_role"]
parallel            = false
max_parallel        = 1
approval_required   = false
compensation_agents = ["agent_role"]   # optional; omit to skip Saga for this stage

7.0 KiB Raw Blame History

Workflow Engine: Persistence, Saga Compensation & Cedar Authorization

Prerequisites

1. Apply the Database Migration

2. Wire the Store into the Orchestrator

3. Configure Saga Compensation

Compensation Task Payload

Agent Implementation

4. Configure Cedar Authorization

Enable Cedar

Write Policies

Authorization Check

5. Crash Recovery Behavior

6. Observability

7. Full Config Reference

Related

7.0 KiB

Raw Blame History