Vapora/docs/adrs/0039-merkle-audit-trail.md
Jesús Pérez 847523e4d4
Some checks failed
Documentation Lint & Validation / Markdown Linting (push) Has been cancelled
Documentation Lint & Validation / Validate mdBook Configuration (push) Has been cancelled
Documentation Lint & Validation / Content & Structure Validation (push) Has been cancelled
mdBook Build & Deploy / Build mdBook (push) Has been cancelled
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
Documentation Lint & Validation / Lint & Validation Summary (push) Has been cancelled
mdBook Build & Deploy / Documentation Quality Check (push) Has been cancelled
mdBook Build & Deploy / Deploy to GitHub Pages (push) Has been cancelled
mdBook Build & Deploy / Notification (push) Has been cancelled
fix: eliminate stub implementations across 6 integration points
- WorkflowOrchestrator and WorkflowService wired in main.rs (non-fatal)
  - try_fallback_with_budget actually calls fallback providers
  - vapora-tracking persistence: real TrackingEntry + NatsPublisher
  - vapora-doc-lifecycle: workspace + classify/consolidate/rag/NATS stubs
  - Merkle hash chain audit trail (tamper-evident, verify_integrity)
  - /api/v1/workflows/* routes operational; get_workflow_audit Result fix
  - ADR-0039, CHANGELOG, workflow-orchestrator docs updated
2026-02-27 00:00:02 +00:00

133 lines
6.1 KiB
Markdown

# ADR-0039: Tamper-Evident Audit Trail — Merkle Hash Chain
**Status**: Implemented
**Date**: 2026-02-26
**Deciders**: VAPORA Team
**Technical Story**: Competitive analysis against enterprise orchestration platforms (OpenFang included) revealed that VAPORA's `audit.rs` was a simple append-only log: any direct database modification (unauthorized `UPDATE audit_entries ...`) was undetectable. Enterprise compliance frameworks (SOC 2, ISO 27001, HIPAA) require tamper-evident logs where post-hoc modification is provably detectable.
---
## Decision
Replace the append-only audit log in `vapora-backend/src/audit/mod.rs` with a Merkle hash-chain where each entry cryptographically commits to every entry before it.
---
## Context
### Why Append-Only Is Insufficient
An append-only log prevents deletion (assuming no `DELETE` privilege) but does not prevent modification. An attacker with write access to `audit_entries` can silently rewrite the `event_type`, `actor`, or `details` fields of any existing row without leaving any trace detectable by the application.
The previous implementation stored `seq`, `entry_id`, `timestamp`, `workflow_id`, `event_type`, `actor`, and `details` — but no integrity metadata. Any row could be updated without detection.
### Merkle Hash Chain Model
Each audit entry stores two additional fields:
- `prev_hash` — the `block_hash` of the immediately preceding entry (genesis entry uses `GENESIS_HASH = "00...00"` / 64 zeros)
- `block_hash` — SHA-256 of the concatenation: `prev_hash|seq|entry_id|timestamp_rfc3339|workflow_id|event_type|actor|details_json`
Modifying *any* covered field of entry N invalidates `block_hash` of entry N, which causes `prev_hash` in entry N+1 to mismatch its predecessor's hash, propagating invalidation through the entire suffix of the chain.
### Write Serialization
Fetching the previous hash and appending the new entry must be atomic with respect to other concurrent appends. A `write_lock: Arc<Mutex<()>>` serializes all `append` calls within the process. This is sufficient because VAPORA's backend is a single process; multi-node deployments would require a distributed lock (e.g., a SurrealDB `UPDATE ... IF locked IS NONE` CAS operation, as used by the scheduler).
---
## Implementation
### `AuditEntry` struct additions
```rust
pub struct AuditEntry {
pub seq: i64,
pub entry_id: String,
pub timestamp: DateTime<Utc>,
pub workflow_id: String,
pub event_type: String,
pub actor: String,
pub details: serde_json::Value,
pub prev_hash: String, // hash of predecessor
pub block_hash: String, // SHA-256 over all fields above
}
```
### Hash function
```rust
fn compute_block_hash(
prev_hash: &str,
seq: i64,
entry_id: &str,
timestamp: &DateTime<Utc>,
workflow_id: &str,
event_type: &str,
actor: &str,
details: &serde_json::Value,
) -> String {
let details = details.to_string();
let ts = timestamp.to_rfc3339();
let preimage = format!(
"{prev_hash}|{seq}|{entry_id}|{ts}|{workflow_id}|{event_type}|{actor}|{details}"
);
let digest = Sha256::digest(preimage.as_bytes());
hex::encode(digest)
}
```
### Integrity verification
```rust
pub async fn verify_integrity(&self, workflow_id: &str) -> Result<IntegrityReport> {
// Fetch all entries for workflow ordered by seq
// Re-derive each block_hash from stored fields
// Compare against stored block_hash
// Check prev_hash == previous entry's block_hash
// Return IntegrityReport { valid, total_entries, first_tampered_seq }
}
```
`IntegrityReport` indicates the first tampered sequence number, allowing forensic identification of the modification point and every invalidated subsequent entry.
---
## Consequences
### What Becomes Possible
- **Tamper detection**: Any direct `UPDATE audit_entries SET event_type = ...` in SurrealDB is detectable on the next `verify_integrity` call.
- **Compliance evidence**: The chain can be presented as evidence that audit records have not been modified since creation.
- **API exposure**: `GET /api/v1/workflows/:id/audit` returns the full chain; clients can independently verify hashes.
### Limitations and Known Gaps
1. **No protection against log truncation**: A `DELETE audit_entries WHERE workflow_id = ...` is not detectable by the chain (you cannot prove absence of entries). A separate monotonic counter or external timestamp anchor would address this.
2. **Single-process write lock**: The `Arc<Mutex<()>>` is sufficient for a single backend process. Multi-node deployments need a distributed lock or a database-level sequence generator with compare-and-swap semantics.
3. **SHA-256 without salting**: The hash is deterministic given the inputs. This is correct for tamper detection (you want reproducibility) but means the hash does not serve as a MAC (an attacker who rewrites a row can also recompute a valid hash chain if they have write access). For full WORM guarantees, chain anchoring to an external append-only service (e.g., a transparency log) would be required.
4. **Key rotation not addressed**: There is no HMAC key — `sha2` is used purely for commitment, not authentication. Adding a server-side HMAC key would prevent an attacker with DB write access from forging a valid chain, but requires key management.
---
## Alternatives Considered
### Database-Level Audit Triggers
SurrealDB (v3) does not expose write triggers that could hash entries at the storage level. A pure DB-level solution is not available.
### External Append-Only Log (NATS JetStream with `MaxMsgs` and no delete)
Would require a separate NATS stream per workflow and cross-referencing two storage systems. Deferred — the Merkle chain provides sufficient tamper evidence for current compliance requirements without external dependencies.
### HMAC-based Authentication
Adds server-side secret management (rotation, distribution across nodes). Deferred until multi-node deployment requires it.
---
## Related
- [ADR-0038: SSRF Protection and Prompt Injection Scanning](0038-security-ssrf-prompt-injection.md)
- [Workflow Orchestrator feature reference](../features/workflow-orchestrator.md)