# ADR-0036: Knowledge Graph Hybrid Search — HNSW + BM25 + RRF

**Status**: Implemented
**Date**: 2026-02-26
**Deciders**: VAPORA Team
**Technical Story**: `find_similar_executions` was a stub returning recent records; `find_similar_rlm_tasks` ignored embeddings entirely. A missing schema migration caused all `kg_executions` reads to silently fail deserialization.

---

## Decision

Replace the stub similarity functions in `KGPersistence` with a **hybrid retrieval pipeline** combining:

1. **HNSW** (SurrealDB 3 native) — approximate nearest-neighbor vector search over `embedding` field
2. **BM25** (SurrealDB 3 native full-text search) — lexical scoring over `task_description` field
3. **Reciprocal Rank Fusion (RRF, k=60)** — scale-invariant score fusion

Add migration `012_kg_hybrid_search.surql` that fixes a pre-existing schema bug (three fields missing from the `SCHEMAFULL` table) and defines the required indexes.

---

## Context

### The Stub Problem

`find_similar_executions` in `persistence.rs` discarded its `embedding: &[f32]` argument entirely and returned the N most-recent successful executions, ordered by timestamp. Any caller relying on semantic proximity was silently receiving chronological results — a correctness bug, not a performance issue.

### The Silent Schema Bug

`kg_executions` was declared `SCHEMAFULL` in migration 005 but three fields used by `PersistedExecution` (`agent_role`, `provider`, `cost_cents`) were absent from the schema. SurrealDB drops undefined fields on `INSERT` in SCHEMAFULL tables. All subsequent `SELECT` queries returned records that failed `serde_json::from_value` deserialization, which was swallowed by `.filter_map(|v| v.ok())`. The persistence layer appeared to work (no errors) while returning empty results for every query.

### Why Not `stratum-embeddings` SurrealDbStore

`stratumiops/crates/stratum-embeddings/src/store/surrealdb.rs` implements vector search as a brute-force full-scan: it loads all records into memory and computes cosine similarity in-process. This works for document chunk retrieval (bounded dataset per document), but is unsuitable for the knowledge graph which accumulates unbounded execution records across all agents and tasks over time.

### Why Hybrid Over Pure Semantic

Embedding-only retrieval misses exact keyword matches: an agent searching for "cargo clippy warnings" may not find a record titled "clippy deny warnings fix" if the embedding model compresses the phrase differently than the query. BM25 handles exact token overlap that embeddings smooth over.

---

## Alternatives Considered

### ❌ Pure HNSW semantic search only

- Misses exact keyword matches (e.g., specific error codes, crate names)
- Embedding quality varies across providers; degrades if provider changes

### ❌ Pure BM25 lexical search only

- Misses paraphrases and semantic variants ("task failed" vs "execution error")
- No relevance for structurally similar tasks with different wording

### ❌ Tantivy / external FTS engine

- Adds a new process dependency for a capability SurrealDB 3 provides natively
- Requires synchronizing two stores; adds operational complexity

### ✅ SurrealDB 3 HNSW + BM25 + RRF (chosen)

Single data store, two native index types, no new dependencies, no sync complexity.

---

## Implementation

### Migration 012

```sql
-- Fix missing fields causing silent deserialization failure
DEFINE FIELD agent_role ON TABLE kg_executions TYPE option<string>;
DEFINE FIELD provider   ON TABLE kg_executions TYPE string DEFAULT 'unknown';
DEFINE FIELD cost_cents ON TABLE kg_executions TYPE int    DEFAULT 0;

-- BM25 full-text index on task_description
DEFINE ANALYZER kg_text_analyzer
    TOKENIZERS class
    FILTERS lowercase, snowball(english);

DEFINE INDEX idx_kg_executions_ft
    ON TABLE kg_executions
    FIELDS task_description
    SEARCH ANALYZER kg_text_analyzer BM25;

-- HNSW ANN index on embedding (1536-dim, cosine, float32)
DEFINE INDEX idx_kg_executions_hnsw
    ON TABLE kg_executions
    FIELDS embedding
    HNSW DIMENSION 1536 DIST COSINE TYPE F32 M 16 EF_CONSTRUCTION 200;
```

HNSW parameters: `M 16` (16 edges per node, standard for 1536-dim); `EF_CONSTRUCTION 200` (index build quality vs. insert speed; 200 is the standard default).

### Query Patterns

**HNSW semantic search** (`<|100,64|>` = 100 candidates, ef=64 at query time):

```surql
SELECT *, vector::similarity::cosine(embedding, $q) AS cosine_score
FROM kg_executions
WHERE embedding <|100,64|> $q
ORDER BY cosine_score DESC
LIMIT 20
```

**BM25 lexical search** (`@1@` assigns predicate ID 1; paired with `search::score(1)`):

```surql
SELECT *, search::score(1) AS bm25_score
FROM kg_executions
WHERE task_description @1@ $text
ORDER BY bm25_score DESC
LIMIT 100
```

### RRF Fusion

Cosine similarity is bounded `[0.0, 1.0]`; BM25 is unbounded `[0, ∞)`. Linear blending requires per-corpus normalization. RRF is scale-invariant:

```
hybrid_score(id) = 1 / (60 + rank_semantic) + 1 / (60 + rank_lexical)
```

`k=60` is the standard constant (Robertson & Zaragoza, 2009). IDs absent from one ranked list receive rank 0, contributing `1/60` — never 0, preventing complete suppression of single-method results.

### RLM Executions

`rlm_executions` is `SCHEMALESS` with a nullable `query_embedding` field. HNSW indexes require a `SCHEMAFULL` table with a non-nullable typed field. `find_similar_rlm_tasks` uses in-memory cosine similarity: loads candidate records, filters those with non-empty embeddings, sorts by cosine score. Acceptable because the RLM dataset is bounded per document.

### New Public API

```rust
impl KGPersistence {
    // Was stub (returned recent records). Now uses HNSW ANN query.
    pub async fn find_similar_executions(
        &self,
        embedding: &[f32],
        limit: usize,
    ) -> anyhow::Result<Vec<PersistedExecution>>;

    // New. HNSW + BM25 + RRF. Either argument may be empty (degrades gracefully).
    pub async fn hybrid_search(
        &self,
        embedding: &[f32],
        text_query: &str,
        limit: usize,
    ) -> anyhow::Result<Vec<HybridSearchResult>>;
}
```

`HybridSearchResult` exposes `semantic_score`, `lexical_score`, `hybrid_score`, `semantic_rank`, `lexical_rank` — callers can inspect individual signal contributions.

---

## Consequences

### Positive

- `find_similar_executions` returns semantically similar past executions, not recent ones. The correctness bug is fixed.
- `hybrid_search` exposes both signals; callers can filter by `semantic_score ≥ 0.7` for high-confidence-only retrieval.
- No new dependencies. The two indexes are defined in a migration; no Rust dependency change.
- The schema bug fix means all existing `kg_executions` records round-trip correctly after migration 012 is applied.

### Negative / Trade-offs

- HNSW index build is `O(n log n)` in SurrealDB; large existing datasets will cause migration 012 to take longer than typical DDL migrations. No data migration is needed — only index creation.
- BM25 requires the `task_description` field to be populated at insert time. Records inserted before this migration with empty or null descriptions will not appear in lexical results.
- `rlm_executions` hybrid search remains in-memory. A future migration converting `rlm_executions` to SCHEMAFULL would enable native HNSW for that table too.

### Supersedes

- The stub implementation of `find_similar_executions` (existed since persistence.rs was written).
- Extends ADR-0013 (KG temporal design) with the retrieval layer decision.

---

## Related

- [ADR-0013: Knowledge Graph Temporal](./0013-knowledge-graph.md) — original KG design
- [ADR-0029: RLM Recursive Language Models](./0029-rlm-recursive-language-models.md) — RLM hybrid search (different use case: document chunks, not execution records)
- [ADR-0004: SurrealDB](./0004-surrealdb-database.md) — database foundation