# ADR-0036: Knowledge Graph Hybrid Search — HNSW + BM25 + RRF **Status**: Implemented **Date**: 2026-02-26 **Deciders**: VAPORA Team **Technical Story**: `find_similar_executions` was a stub returning recent records; `find_similar_rlm_tasks` ignored embeddings entirely. A missing schema migration caused all `kg_executions` reads to silently fail deserialization. --- ## Decision Replace the stub similarity functions in `KGPersistence` with a **hybrid retrieval pipeline** combining: 1. **HNSW** (SurrealDB 3 native) — approximate nearest-neighbor vector search over `embedding` field 2. **BM25** (SurrealDB 3 native full-text search) — lexical scoring over `task_description` field 3. **Reciprocal Rank Fusion (RRF, k=60)** — scale-invariant score fusion Add migration `012_kg_hybrid_search.surql` that fixes a pre-existing schema bug (three fields missing from the `SCHEMAFULL` table) and defines the required indexes. --- ## Context ### The Stub Problem `find_similar_executions` in `persistence.rs` discarded its `embedding: &[f32]` argument entirely and returned the N most-recent successful executions, ordered by timestamp. Any caller relying on semantic proximity was silently receiving chronological results — a correctness bug, not a performance issue. ### The Silent Schema Bug `kg_executions` was declared `SCHEMAFULL` in migration 005 but three fields used by `PersistedExecution` (`agent_role`, `provider`, `cost_cents`) were absent from the schema. SurrealDB drops undefined fields on `INSERT` in SCHEMAFULL tables. All subsequent `SELECT` queries returned records that failed `serde_json::from_value` deserialization, which was swallowed by `.filter_map(|v| v.ok())`. The persistence layer appeared to work (no errors) while returning empty results for every query. ### Why Not `stratum-embeddings` SurrealDbStore `stratumiops/crates/stratum-embeddings/src/store/surrealdb.rs` implements vector search as a brute-force full-scan: it loads all records into memory and computes cosine similarity in-process. This works for document chunk retrieval (bounded dataset per document), but is unsuitable for the knowledge graph which accumulates unbounded execution records across all agents and tasks over time. ### Why Hybrid Over Pure Semantic Embedding-only retrieval misses exact keyword matches: an agent searching for "cargo clippy warnings" may not find a record titled "clippy deny warnings fix" if the embedding model compresses the phrase differently than the query. BM25 handles exact token overlap that embeddings smooth over. --- ## Alternatives Considered ### ❌ Pure HNSW semantic search only - Misses exact keyword matches (e.g., specific error codes, crate names) - Embedding quality varies across providers; degrades if provider changes ### ❌ Pure BM25 lexical search only - Misses paraphrases and semantic variants ("task failed" vs "execution error") - No relevance for structurally similar tasks with different wording ### ❌ Tantivy / external FTS engine - Adds a new process dependency for a capability SurrealDB 3 provides natively - Requires synchronizing two stores; adds operational complexity ### ✅ SurrealDB 3 HNSW + BM25 + RRF (chosen) Single data store, two native index types, no new dependencies, no sync complexity. --- ## Implementation ### Migration 012 ```sql -- Fix missing fields causing silent deserialization failure DEFINE FIELD agent_role ON TABLE kg_executions TYPE option; DEFINE FIELD provider ON TABLE kg_executions TYPE string DEFAULT 'unknown'; DEFINE FIELD cost_cents ON TABLE kg_executions TYPE int DEFAULT 0; -- BM25 full-text index on task_description DEFINE ANALYZER kg_text_analyzer TOKENIZERS class FILTERS lowercase, snowball(english); DEFINE INDEX idx_kg_executions_ft ON TABLE kg_executions FIELDS task_description SEARCH ANALYZER kg_text_analyzer BM25; -- HNSW ANN index on embedding (1536-dim, cosine, float32) DEFINE INDEX idx_kg_executions_hnsw ON TABLE kg_executions FIELDS embedding HNSW DIMENSION 1536 DIST COSINE TYPE F32 M 16 EF_CONSTRUCTION 200; ``` HNSW parameters: `M 16` (16 edges per node, standard for 1536-dim); `EF_CONSTRUCTION 200` (index build quality vs. insert speed; 200 is the standard default). ### Query Patterns **HNSW semantic search** (`<|100,64|>` = 100 candidates, ef=64 at query time): ```surql SELECT *, vector::similarity::cosine(embedding, $q) AS cosine_score FROM kg_executions WHERE embedding <|100,64|> $q ORDER BY cosine_score DESC LIMIT 20 ``` **BM25 lexical search** (`@1@` assigns predicate ID 1; paired with `search::score(1)`): ```surql SELECT *, search::score(1) AS bm25_score FROM kg_executions WHERE task_description @1@ $text ORDER BY bm25_score DESC LIMIT 100 ``` ### RRF Fusion Cosine similarity is bounded `[0.0, 1.0]`; BM25 is unbounded `[0, ∞)`. Linear blending requires per-corpus normalization. RRF is scale-invariant: ``` hybrid_score(id) = 1 / (60 + rank_semantic) + 1 / (60 + rank_lexical) ``` `k=60` is the standard constant (Robertson & Zaragoza, 2009). IDs absent from one ranked list receive rank 0, contributing `1/60` — never 0, preventing complete suppression of single-method results. ### RLM Executions `rlm_executions` is `SCHEMALESS` with a nullable `query_embedding` field. HNSW indexes require a `SCHEMAFULL` table with a non-nullable typed field. `find_similar_rlm_tasks` uses in-memory cosine similarity: loads candidate records, filters those with non-empty embeddings, sorts by cosine score. Acceptable because the RLM dataset is bounded per document. ### New Public API ```rust impl KGPersistence { // Was stub (returned recent records). Now uses HNSW ANN query. pub async fn find_similar_executions( &self, embedding: &[f32], limit: usize, ) -> anyhow::Result>; // New. HNSW + BM25 + RRF. Either argument may be empty (degrades gracefully). pub async fn hybrid_search( &self, embedding: &[f32], text_query: &str, limit: usize, ) -> anyhow::Result>; } ``` `HybridSearchResult` exposes `semantic_score`, `lexical_score`, `hybrid_score`, `semantic_rank`, `lexical_rank` — callers can inspect individual signal contributions. --- ## Consequences ### Positive - `find_similar_executions` returns semantically similar past executions, not recent ones. The correctness bug is fixed. - `hybrid_search` exposes both signals; callers can filter by `semantic_score ≥ 0.7` for high-confidence-only retrieval. - No new dependencies. The two indexes are defined in a migration; no Rust dependency change. - The schema bug fix means all existing `kg_executions` records round-trip correctly after migration 012 is applied. ### Negative / Trade-offs - HNSW index build is `O(n log n)` in SurrealDB; large existing datasets will cause migration 012 to take longer than typical DDL migrations. No data migration is needed — only index creation. - BM25 requires the `task_description` field to be populated at insert time. Records inserted before this migration with empty or null descriptions will not appear in lexical results. - `rlm_executions` hybrid search remains in-memory. A future migration converting `rlm_executions` to SCHEMAFULL would enable native HNSW for that table too. ### Supersedes - The stub implementation of `find_similar_executions` (existed since persistence.rs was written). - Extends ADR-0013 (KG temporal design) with the retrieval layer decision. --- ## Related - [ADR-0013: Knowledge Graph Temporal](./0013-knowledge-graph.md) — original KG design - [ADR-0029: RLM Recursive Language Models](./0029-rlm-recursive-language-models.md) — RLM hybrid search (different use case: document chunks, not execution records) - [ADR-0004: SurrealDB](./0004-surrealdb-database.md) — database foundation