Vapora/docs/adrs/0036-kg-hybrid-search.md
Jesús Pérez 27a290b369
feat(kg,channels): hybrid search + agent-inactive notifications
- KG: HNSW + BM25 + RRF(k=60) hybrid search via SurrealDB 3 native indexes
  - Fix schema bug: kg_executions missing agent_role/provider/cost_cents (silent empty reads)
  - channels: on_agent_inactive hook (AgentStatus::Inactive → Message::error)
  - migration 012: adds missing fields + HNSW + BM25 indexes
  - docs: ADR-0036, update ADR-0035 + notification-channels feature doc
2026-02-26 15:32:44 +00:00

7.7 KiB

ADR-0036: Knowledge Graph Hybrid Search — HNSW + BM25 + RRF

Status: Implemented Date: 2026-02-26 Deciders: VAPORA Team Technical Story: find_similar_executions was a stub returning recent records; find_similar_rlm_tasks ignored embeddings entirely. A missing schema migration caused all kg_executions reads to silently fail deserialization.


Decision

Replace the stub similarity functions in KGPersistence with a hybrid retrieval pipeline combining:

  1. HNSW (SurrealDB 3 native) — approximate nearest-neighbor vector search over embedding field
  2. BM25 (SurrealDB 3 native full-text search) — lexical scoring over task_description field
  3. Reciprocal Rank Fusion (RRF, k=60) — scale-invariant score fusion

Add migration 012_kg_hybrid_search.surql that fixes a pre-existing schema bug (three fields missing from the SCHEMAFULL table) and defines the required indexes.


Context

The Stub Problem

find_similar_executions in persistence.rs discarded its embedding: &[f32] argument entirely and returned the N most-recent successful executions, ordered by timestamp. Any caller relying on semantic proximity was silently receiving chronological results — a correctness bug, not a performance issue.

The Silent Schema Bug

kg_executions was declared SCHEMAFULL in migration 005 but three fields used by PersistedExecution (agent_role, provider, cost_cents) were absent from the schema. SurrealDB drops undefined fields on INSERT in SCHEMAFULL tables. All subsequent SELECT queries returned records that failed serde_json::from_value deserialization, which was swallowed by .filter_map(|v| v.ok()). The persistence layer appeared to work (no errors) while returning empty results for every query.

Why Not stratum-embeddings SurrealDbStore

stratumiops/crates/stratum-embeddings/src/store/surrealdb.rs implements vector search as a brute-force full-scan: it loads all records into memory and computes cosine similarity in-process. This works for document chunk retrieval (bounded dataset per document), but is unsuitable for the knowledge graph which accumulates unbounded execution records across all agents and tasks over time.

Why Hybrid Over Pure Semantic

Embedding-only retrieval misses exact keyword matches: an agent searching for "cargo clippy warnings" may not find a record titled "clippy deny warnings fix" if the embedding model compresses the phrase differently than the query. BM25 handles exact token overlap that embeddings smooth over.


Alternatives Considered

Pure HNSW semantic search only

  • Misses exact keyword matches (e.g., specific error codes, crate names)
  • Embedding quality varies across providers; degrades if provider changes

Pure BM25 lexical search only

  • Misses paraphrases and semantic variants ("task failed" vs "execution error")
  • No relevance for structurally similar tasks with different wording

Tantivy / external FTS engine

  • Adds a new process dependency for a capability SurrealDB 3 provides natively
  • Requires synchronizing two stores; adds operational complexity

SurrealDB 3 HNSW + BM25 + RRF (chosen)

Single data store, two native index types, no new dependencies, no sync complexity.


Implementation

Migration 012

-- Fix missing fields causing silent deserialization failure
DEFINE FIELD agent_role ON TABLE kg_executions TYPE option<string>;
DEFINE FIELD provider   ON TABLE kg_executions TYPE string DEFAULT 'unknown';
DEFINE FIELD cost_cents ON TABLE kg_executions TYPE int    DEFAULT 0;

-- BM25 full-text index on task_description
DEFINE ANALYZER kg_text_analyzer
    TOKENIZERS class
    FILTERS lowercase, snowball(english);

DEFINE INDEX idx_kg_executions_ft
    ON TABLE kg_executions
    FIELDS task_description
    SEARCH ANALYZER kg_text_analyzer BM25;

-- HNSW ANN index on embedding (1536-dim, cosine, float32)
DEFINE INDEX idx_kg_executions_hnsw
    ON TABLE kg_executions
    FIELDS embedding
    HNSW DIMENSION 1536 DIST COSINE TYPE F32 M 16 EF_CONSTRUCTION 200;

HNSW parameters: M 16 (16 edges per node, standard for 1536-dim); EF_CONSTRUCTION 200 (index build quality vs. insert speed; 200 is the standard default).

Query Patterns

HNSW semantic search (<|100,64|> = 100 candidates, ef=64 at query time):

SELECT *, vector::similarity::cosine(embedding, $q) AS cosine_score
FROM kg_executions
WHERE embedding <|100,64|> $q
ORDER BY cosine_score DESC
LIMIT 20

BM25 lexical search (@1@ assigns predicate ID 1; paired with search::score(1)):

SELECT *, search::score(1) AS bm25_score
FROM kg_executions
WHERE task_description @1@ $text
ORDER BY bm25_score DESC
LIMIT 100

RRF Fusion

Cosine similarity is bounded [0.0, 1.0]; BM25 is unbounded [0, ∞). Linear blending requires per-corpus normalization. RRF is scale-invariant:

hybrid_score(id) = 1 / (60 + rank_semantic) + 1 / (60 + rank_lexical)

k=60 is the standard constant (Robertson & Zaragoza, 2009). IDs absent from one ranked list receive rank 0, contributing 1/60 — never 0, preventing complete suppression of single-method results.

RLM Executions

rlm_executions is SCHEMALESS with a nullable query_embedding field. HNSW indexes require a SCHEMAFULL table with a non-nullable typed field. find_similar_rlm_tasks uses in-memory cosine similarity: loads candidate records, filters those with non-empty embeddings, sorts by cosine score. Acceptable because the RLM dataset is bounded per document.

New Public API

impl KGPersistence {
    // Was stub (returned recent records). Now uses HNSW ANN query.
    pub async fn find_similar_executions(
        &self,
        embedding: &[f32],
        limit: usize,
    ) -> anyhow::Result<Vec<PersistedExecution>>;

    // New. HNSW + BM25 + RRF. Either argument may be empty (degrades gracefully).
    pub async fn hybrid_search(
        &self,
        embedding: &[f32],
        text_query: &str,
        limit: usize,
    ) -> anyhow::Result<Vec<HybridSearchResult>>;
}

HybridSearchResult exposes semantic_score, lexical_score, hybrid_score, semantic_rank, lexical_rank — callers can inspect individual signal contributions.


Consequences

Positive

  • find_similar_executions returns semantically similar past executions, not recent ones. The correctness bug is fixed.
  • hybrid_search exposes both signals; callers can filter by semantic_score ≥ 0.7 for high-confidence-only retrieval.
  • No new dependencies. The two indexes are defined in a migration; no Rust dependency change.
  • The schema bug fix means all existing kg_executions records round-trip correctly after migration 012 is applied.

Negative / Trade-offs

  • HNSW index build is O(n log n) in SurrealDB; large existing datasets will cause migration 012 to take longer than typical DDL migrations. No data migration is needed — only index creation.
  • BM25 requires the task_description field to be populated at insert time. Records inserted before this migration with empty or null descriptions will not appear in lexical results.
  • rlm_executions hybrid search remains in-memory. A future migration converting rlm_executions to SCHEMAFULL would enable native HNSW for that table too.

Supersedes

  • The stub implementation of find_similar_executions (existed since persistence.rs was written).
  • Extends ADR-0013 (KG temporal design) with the retrieval layer decision.