Jesús Pérez 27a290b369

feat(kg,channels): hybrid search + agent-inactive notifications

- KG: HNSW + BM25 + RRF(k=60) hybrid search via SurrealDB 3 native indexes
  - Fix schema bug: kg_executions missing agent_role/provider/cost_cents (silent empty reads)
  - channels: on_agent_inactive hook (AgentStatus::Inactive → Message::error)
  - migration 012: adds missing fields + HNSW + BM25 indexes
  - docs: ADR-0036, update ADR-0035 + notification-channels feature doc

2026-02-26 15:32:44 +00:00

7.7 KiB

Raw Blame History

ADR-0036: Knowledge Graph Hybrid Search — HNSW + BM25 + RRF

Status: Implemented Date: 2026-02-26 Deciders: VAPORA Team Technical Story: find_similar_executions was a stub returning recent records; find_similar_rlm_tasks ignored embeddings entirely. A missing schema migration caused all kg_executions reads to silently fail deserialization.

Decision

Replace the stub similarity functions in KGPersistence with a hybrid retrieval pipeline combining:

HNSW (SurrealDB 3 native) — approximate nearest-neighbor vector search over embedding field
BM25 (SurrealDB 3 native full-text search) — lexical scoring over task_description field
Reciprocal Rank Fusion (RRF, k=60) — scale-invariant score fusion

Add migration 012_kg_hybrid_search.surql that fixes a pre-existing schema bug (three fields missing from the SCHEMAFULL table) and defines the required indexes.

Context

The Stub Problem

find_similar_executions in persistence.rs discarded its embedding: &[f32] argument entirely and returned the N most-recent successful executions, ordered by timestamp. Any caller relying on semantic proximity was silently receiving chronological results — a correctness bug, not a performance issue.

The Silent Schema Bug

kg_executions was declared SCHEMAFULL in migration 005 but three fields used by PersistedExecution (agent_role, provider, cost_cents) were absent from the schema. SurrealDB drops undefined fields on INSERT in SCHEMAFULL tables. All subsequent SELECT queries returned records that failed serde_json::from_value deserialization, which was swallowed by .filter_map(|v| v.ok()). The persistence layer appeared to work (no errors) while returning empty results for every query.

Why Not `stratum-embeddings` SurrealDbStore

stratumiops/crates/stratum-embeddings/src/store/surrealdb.rs implements vector search as a brute-force full-scan: it loads all records into memory and computes cosine similarity in-process. This works for document chunk retrieval (bounded dataset per document), but is unsuitable for the knowledge graph which accumulates unbounded execution records across all agents and tasks over time.

Why Hybrid Over Pure Semantic

Embedding-only retrieval misses exact keyword matches: an agent searching for "cargo clippy warnings" may not find a record titled "clippy deny warnings fix" if the embedding model compresses the phrase differently than the query. BM25 handles exact token overlap that embeddings smooth over.

Alternatives Considered

❌ Pure HNSW semantic search only

Misses exact keyword matches (e.g., specific error codes, crate names)
Embedding quality varies across providers; degrades if provider changes

❌ Pure BM25 lexical search only

Misses paraphrases and semantic variants ("task failed" vs "execution error")
No relevance for structurally similar tasks with different wording

❌ Tantivy / external FTS engine

Adds a new process dependency for a capability SurrealDB 3 provides natively
Requires synchronizing two stores; adds operational complexity

✅ SurrealDB 3 HNSW + BM25 + RRF (chosen)

Single data store, two native index types, no new dependencies, no sync complexity.

Implementation

Migration 012

-- Fix missing fields causing silent deserialization failure
DEFINE FIELD agent_role ON TABLE kg_executions TYPE option<string>;
DEFINE FIELD provider   ON TABLE kg_executions TYPE string DEFAULT 'unknown';
DEFINE FIELD cost_cents ON TABLE kg_executions TYPE int    DEFAULT 0;

-- BM25 full-text index on task_description
DEFINE ANALYZER kg_text_analyzer
    TOKENIZERS class
    FILTERS lowercase, snowball(english);

DEFINE INDEX idx_kg_executions_ft
    ON TABLE kg_executions
    FIELDS task_description
    SEARCH ANALYZER kg_text_analyzer BM25;

-- HNSW ANN index on embedding (1536-dim, cosine, float32)
DEFINE INDEX idx_kg_executions_hnsw
    ON TABLE kg_executions
    FIELDS embedding
    HNSW DIMENSION 1536 DIST COSINE TYPE F32 M 16 EF_CONSTRUCTION 200;

HNSW parameters: M 16 (16 edges per node, standard for 1536-dim); EF_CONSTRUCTION 200 (index build quality vs. insert speed; 200 is the standard default).

Query Patterns

HNSW semantic search (<|100,64|> = 100 candidates, ef=64 at query time):

SELECT *, vector::similarity::cosine(embedding, $q) AS cosine_score
FROM kg_executions
WHERE embedding <|100,64|> $q
ORDER BY cosine_score DESC
LIMIT 20

BM25 lexical search (@1@ assigns predicate ID 1; paired with search::score(1)):

SELECT *, search::score(1) AS bm25_score
FROM kg_executions
WHERE task_description @1@ $text
ORDER BY bm25_score DESC
LIMIT 100

RRF Fusion

Cosine similarity is bounded [0.0, 1.0]; BM25 is unbounded [0, ∞). Linear blending requires per-corpus normalization. RRF is scale-invariant:

hybrid_score(id) = 1 / (60 + rank_semantic) + 1 / (60 + rank_lexical)

k=60 is the standard constant (Robertson & Zaragoza, 2009). IDs absent from one ranked list receive rank 0, contributing 1/60 — never 0, preventing complete suppression of single-method results.

RLM Executions

rlm_executions is SCHEMALESS with a nullable query_embedding field. HNSW indexes require a SCHEMAFULL table with a non-nullable typed field. find_similar_rlm_tasks uses in-memory cosine similarity: loads candidate records, filters those with non-empty embeddings, sorts by cosine score. Acceptable because the RLM dataset is bounded per document.

New Public API

impl KGPersistence {
    // Was stub (returned recent records). Now uses HNSW ANN query.
    pub async fn find_similar_executions(
        &self,
        embedding: &[f32],
        limit: usize,
    ) -> anyhow::Result<Vec<PersistedExecution>>;

    // New. HNSW + BM25 + RRF. Either argument may be empty (degrades gracefully).
    pub async fn hybrid_search(
        &self,
        embedding: &[f32],
        text_query: &str,
        limit: usize,
    ) -> anyhow::Result<Vec<HybridSearchResult>>;
}

HybridSearchResult exposes semantic_score, lexical_score, hybrid_score, semantic_rank, lexical_rank — callers can inspect individual signal contributions.

Consequences

Positive

find_similar_executions returns semantically similar past executions, not recent ones. The correctness bug is fixed.
hybrid_search exposes both signals; callers can filter by semantic_score ≥ 0.7 for high-confidence-only retrieval.
No new dependencies. The two indexes are defined in a migration; no Rust dependency change.
The schema bug fix means all existing kg_executions records round-trip correctly after migration 012 is applied.

Negative / Trade-offs

HNSW index build is O(n log n) in SurrealDB; large existing datasets will cause migration 012 to take longer than typical DDL migrations. No data migration is needed — only index creation.
BM25 requires the task_description field to be populated at insert time. Records inserted before this migration with empty or null descriptions will not appear in lexical results.
rlm_executions hybrid search remains in-memory. A future migration converting rlm_executions to SCHEMAFULL would enable native HNSW for that table too.

Supersedes

The stub implementation of find_similar_executions (existed since persistence.rs was written).
Extends ADR-0013 (KG temporal design) with the retrieval layer decision.

ADR-0013: Knowledge Graph Temporal — original KG design
ADR-0029: RLM Recursive Language Models — RLM hybrid search (different use case: document chunks, not execution records)
ADR-0004: SurrealDB — database foundation

7.7 KiB Raw Blame History