Vapora/docs/adrs/0039-merkle-audit-trail.md
Jesús Pérez 847523e4d4
Some checks failed
Documentation Lint & Validation / Markdown Linting (push) Has been cancelled
Documentation Lint & Validation / Validate mdBook Configuration (push) Has been cancelled
Documentation Lint & Validation / Content & Structure Validation (push) Has been cancelled
mdBook Build & Deploy / Build mdBook (push) Has been cancelled
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
Documentation Lint & Validation / Lint & Validation Summary (push) Has been cancelled
mdBook Build & Deploy / Documentation Quality Check (push) Has been cancelled
mdBook Build & Deploy / Deploy to GitHub Pages (push) Has been cancelled
mdBook Build & Deploy / Notification (push) Has been cancelled
fix: eliminate stub implementations across 6 integration points
- WorkflowOrchestrator and WorkflowService wired in main.rs (non-fatal)
  - try_fallback_with_budget actually calls fallback providers
  - vapora-tracking persistence: real TrackingEntry + NatsPublisher
  - vapora-doc-lifecycle: workspace + classify/consolidate/rag/NATS stubs
  - Merkle hash chain audit trail (tamper-evident, verify_integrity)
  - /api/v1/workflows/* routes operational; get_workflow_audit Result fix
  - ADR-0039, CHANGELOG, workflow-orchestrator docs updated
2026-02-27 00:00:02 +00:00

6.1 KiB

ADR-0039: Tamper-Evident Audit Trail — Merkle Hash Chain

Status: Implemented Date: 2026-02-26 Deciders: VAPORA Team Technical Story: Competitive analysis against enterprise orchestration platforms (OpenFang included) revealed that VAPORA's audit.rs was a simple append-only log: any direct database modification (unauthorized UPDATE audit_entries ...) was undetectable. Enterprise compliance frameworks (SOC 2, ISO 27001, HIPAA) require tamper-evident logs where post-hoc modification is provably detectable.


Decision

Replace the append-only audit log in vapora-backend/src/audit/mod.rs with a Merkle hash-chain where each entry cryptographically commits to every entry before it.


Context

Why Append-Only Is Insufficient

An append-only log prevents deletion (assuming no DELETE privilege) but does not prevent modification. An attacker with write access to audit_entries can silently rewrite the event_type, actor, or details fields of any existing row without leaving any trace detectable by the application.

The previous implementation stored seq, entry_id, timestamp, workflow_id, event_type, actor, and details — but no integrity metadata. Any row could be updated without detection.

Merkle Hash Chain Model

Each audit entry stores two additional fields:

  • prev_hash — the block_hash of the immediately preceding entry (genesis entry uses GENESIS_HASH = "00...00" / 64 zeros)
  • block_hash — SHA-256 of the concatenation: prev_hash|seq|entry_id|timestamp_rfc3339|workflow_id|event_type|actor|details_json

Modifying any covered field of entry N invalidates block_hash of entry N, which causes prev_hash in entry N+1 to mismatch its predecessor's hash, propagating invalidation through the entire suffix of the chain.

Write Serialization

Fetching the previous hash and appending the new entry must be atomic with respect to other concurrent appends. A write_lock: Arc<Mutex<()>> serializes all append calls within the process. This is sufficient because VAPORA's backend is a single process; multi-node deployments would require a distributed lock (e.g., a SurrealDB UPDATE ... IF locked IS NONE CAS operation, as used by the scheduler).


Implementation

AuditEntry struct additions

pub struct AuditEntry {
    pub seq: i64,
    pub entry_id: String,
    pub timestamp: DateTime<Utc>,
    pub workflow_id: String,
    pub event_type: String,
    pub actor: String,
    pub details: serde_json::Value,
    pub prev_hash: String,   // hash of predecessor
    pub block_hash: String,  // SHA-256 over all fields above
}

Hash function

fn compute_block_hash(
    prev_hash: &str,
    seq: i64,
    entry_id: &str,
    timestamp: &DateTime<Utc>,
    workflow_id: &str,
    event_type: &str,
    actor: &str,
    details: &serde_json::Value,
) -> String {
    let details = details.to_string();
    let ts = timestamp.to_rfc3339();
    let preimage = format!(
        "{prev_hash}|{seq}|{entry_id}|{ts}|{workflow_id}|{event_type}|{actor}|{details}"
    );
    let digest = Sha256::digest(preimage.as_bytes());
    hex::encode(digest)
}

Integrity verification

pub async fn verify_integrity(&self, workflow_id: &str) -> Result<IntegrityReport> {
    // Fetch all entries for workflow ordered by seq
    // Re-derive each block_hash from stored fields
    // Compare against stored block_hash
    // Check prev_hash == previous entry's block_hash
    // Return IntegrityReport { valid, total_entries, first_tampered_seq }
}

IntegrityReport indicates the first tampered sequence number, allowing forensic identification of the modification point and every invalidated subsequent entry.


Consequences

What Becomes Possible

  • Tamper detection: Any direct UPDATE audit_entries SET event_type = ... in SurrealDB is detectable on the next verify_integrity call.
  • Compliance evidence: The chain can be presented as evidence that audit records have not been modified since creation.
  • API exposure: GET /api/v1/workflows/:id/audit returns the full chain; clients can independently verify hashes.

Limitations and Known Gaps

  1. No protection against log truncation: A DELETE audit_entries WHERE workflow_id = ... is not detectable by the chain (you cannot prove absence of entries). A separate monotonic counter or external timestamp anchor would address this.
  2. Single-process write lock: The Arc<Mutex<()>> is sufficient for a single backend process. Multi-node deployments need a distributed lock or a database-level sequence generator with compare-and-swap semantics.
  3. SHA-256 without salting: The hash is deterministic given the inputs. This is correct for tamper detection (you want reproducibility) but means the hash does not serve as a MAC (an attacker who rewrites a row can also recompute a valid hash chain if they have write access). For full WORM guarantees, chain anchoring to an external append-only service (e.g., a transparency log) would be required.
  4. Key rotation not addressed: There is no HMAC key — sha2 is used purely for commitment, not authentication. Adding a server-side HMAC key would prevent an attacker with DB write access from forging a valid chain, but requires key management.

Alternatives Considered

Database-Level Audit Triggers

SurrealDB (v3) does not expose write triggers that could hash entries at the storage level. A pure DB-level solution is not available.

External Append-Only Log (NATS JetStream with MaxMsgs and no delete)

Would require a separate NATS stream per workflow and cross-referencing two storage systems. Deferred — the Merkle chain provides sufficient tamper evidence for current compliance requirements without external dependencies.

HMAC-based Authentication

Adds server-side secret management (rotation, distribution across nodes). Deferred until multi-node deployment requires it.