Vapora/docs/adrs/0036-kg-hybrid-search.md

# ADR-0036: Knowledge Graph Hybrid Search — HNSW + BM25 + RRF

**Status**: Implemented
**Date**: 2026-02-26
**Deciders**: VAPORA Team
**Technical Story**: `find_similar_executions` was a stub returning recent records; `find_similar_rlm_tasks` ignored embeddings entirely. A missing schema migration caused all `kg_executions` reads to silently fail deserialization.

---

## Decision

Replace the stub similarity functions in `KGPersistence` with a **hybrid retrieval pipeline** combining:

1. **HNSW** (SurrealDB 3 native) — approximate nearest-neighbor vector search over `embedding` field
2. **BM25** (SurrealDB 3 native full-text search) — lexical scoring over `task_description` field
3. **Reciprocal Rank Fusion (RRF, k=60)** — scale-invariant score fusion

Add migration `012_kg_hybrid_search.surql` that fixes a pre-existing schema bug (three fields missing from the `SCHEMAFULL` table) and defines the required indexes.

---

## Context

### The Stub Problem

`find_similar_executions` in `persistence.rs` discarded its `embedding: &[f32]` argument entirely and returned the N most-recent successful executions, ordered by timestamp. Any caller relying on semantic proximity was silently receiving chronological results — a correctness bug, not a performance issue.

### The Silent Schema Bug

`kg_executions` was declared `SCHEMAFULL` in migration 005 but three fields used by `PersistedExecution` (`agent_role`, `provider`, `cost_cents`) were absent from the schema. SurrealDB drops undefined fields on `INSERT` in SCHEMAFULL tables. All subsequent `SELECT` queries returned records that failed `serde_json::from_value` deserialization, which was swallowed by `.filter_map(|v| v.ok())`. The persistence layer appeared to work (no errors) while returning empty results for every query.

### Why Not `stratum-embeddings` SurrealDbStore

`stratumiops/crates/stratum-embeddings/src/store/surrealdb.rs` implements vector search as a brute-force full-scan: it loads all records into memory and computes cosine similarity in-process. This works for document chunk retrieval (bounded dataset per document), but is unsuitable for the knowledge graph which accumulates unbounded execution records across all agents and tasks over time.

### Why Hybrid Over Pure Semantic

Embedding-only retrieval misses exact keyword matches: an agent searching for "cargo clippy warnings" may not find a record titled "clippy deny warnings fix" if the embedding model compresses the phrase differently than the query. BM25 handles exact token overlap that embeddings smooth over.

---

## Alternatives Considered

### ❌ Pure HNSW semantic search only

- Misses exact keyword matches (e.g., specific error codes, crate names)
- Embedding quality varies across providers; degrades if provider changes

### ❌ Pure BM25 lexical search only

- Misses paraphrases and semantic variants ("task failed" vs "execution error")
- No relevance for structurally similar tasks with different wording

### ❌ Tantivy / external FTS engine

- Adds a new process dependency for a capability SurrealDB 3 provides natively
- Requires synchronizing two stores; adds operational complexity

### ✅ SurrealDB 3 HNSW + BM25 + RRF (chosen)

Single data store, two native index types, no new dependencies, no sync complexity.

---

## Implementation

### Migration 012

```sql
-- Fix missing fields causing silent deserialization failure
DEFINE FIELD agent_role ON TABLE kg_executions TYPE option<string>;
DEFINE FIELD provider   ON TABLE kg_executions TYPE string DEFAULT 'unknown';
DEFINE FIELD cost_cents ON TABLE kg_executions TYPE int    DEFAULT 0;

-- BM25 full-text index on task_description
DEFINE ANALYZER kg_text_analyzer
    TOKENIZERS class
    FILTERS lowercase, snowball(english);

DEFINE INDEX idx_kg_executions_ft
    ON TABLE kg_executions
    FIELDS task_description
    SEARCH ANALYZER kg_text_analyzer BM25;

-- HNSW ANN index on embedding (1536-dim, cosine, float32)
DEFINE INDEX idx_kg_executions_hnsw
    ON TABLE kg_executions
    FIELDS embedding
    HNSW DIMENSION 1536 DIST COSINE TYPE F32 M 16 EF_CONSTRUCTION 200;
```

HNSW parameters: `M 16` (16 edges per node, standard for 1536-dim); `EF_CONSTRUCTION 200` (index build quality vs. insert speed; 200 is the standard default).

### Query Patterns

**HNSW semantic search** (`<|100,64|>` = 100 candidates, ef=64 at query time):

```surql
SELECT *, vector::similarity::cosine(embedding, $q) AS cosine_score
FROM kg_executions
WHERE embedding <|100,64|> $q
ORDER BY cosine_score DESC
LIMIT 20
```

**BM25 lexical search** (`@1@` assigns predicate ID 1; paired with `search::score(1)`):

```surql
SELECT *, search::score(1) AS bm25_score
FROM kg_executions
WHERE task_description @1@ $text
ORDER BY bm25_score DESC
LIMIT 100
```

### RRF Fusion

Cosine similarity is bounded `[0.0, 1.0]`; BM25 is unbounded `[0, ∞)`. Linear blending requires per-corpus normalization. RRF is scale-invariant:

```
hybrid_score(id) = 1 / (60 + rank_semantic) + 1 / (60 + rank_lexical)
```

`k=60` is the standard constant (Robertson & Zaragoza, 2009). IDs absent from one ranked list receive rank 0, contributing `1/60` — never 0, preventing complete suppression of single-method results.

### RLM Executions

`rlm_executions` is `SCHEMALESS` with a nullable `query_embedding` field. HNSW indexes require a `SCHEMAFULL` table with a non-nullable typed field. `find_similar_rlm_tasks` uses in-memory cosine similarity: loads candidate records, filters those with non-empty embeddings, sorts by cosine score. Acceptable because the RLM dataset is bounded per document.

### New Public API

```rust
impl KGPersistence {
    // Was stub (returned recent records). Now uses HNSW ANN query.
    pub async fn find_similar_executions(
        &self,
        embedding: &[f32],
        limit: usize,
    ) -> anyhow::Result<Vec<PersistedExecution>>;

    // New. HNSW + BM25 + RRF. Either argument may be empty (degrades gracefully).
    pub async fn hybrid_search(
        &self,
        embedding: &[f32],
        text_query: &str,
        limit: usize,
    ) -> anyhow::Result<Vec<HybridSearchResult>>;
}
```

`HybridSearchResult` exposes `semantic_score`, `lexical_score`, `hybrid_score`, `semantic_rank`, `lexical_rank` — callers can inspect individual signal contributions.

---

## Consequences

### Positive

- `find_similar_executions` returns semantically similar past executions, not recent ones. The correctness bug is fixed.
- `hybrid_search` exposes both signals; callers can filter by `semantic_score ≥ 0.7` for high-confidence-only retrieval.
- No new dependencies. The two indexes are defined in a migration; no Rust dependency change.
- The schema bug fix means all existing `kg_executions` records round-trip correctly after migration 012 is applied.

### Negative / Trade-offs

- HNSW index build is `O(n log n)` in SurrealDB; large existing datasets will cause migration 012 to take longer than typical DDL migrations. No data migration is needed — only index creation.
- BM25 requires the `task_description` field to be populated at insert time. Records inserted before this migration with empty or null descriptions will not appear in lexical results.
- `rlm_executions` hybrid search remains in-memory. A future migration converting `rlm_executions` to SCHEMAFULL would enable native HNSW for that table too.

### Supersedes

- The stub implementation of `find_similar_executions` (existed since persistence.rs was written).
- Extends ADR-0013 (KG temporal design) with the retrieval layer decision.

---

## Related

- [ADR-0013: Knowledge Graph Temporal](./0013-knowledge-graph.md) — original KG design
- [ADR-0029: RLM Recursive Language Models](./0029-rlm-recursive-language-models.md) — RLM hybrid search (different use case: document chunks, not execution records)
- [ADR-0004: SurrealDB](./0004-surrealdb-database.md) — database foundation
feat(kg,channels): hybrid search + agent-inactive notifications - KG: HNSW + BM25 + RRF(k=60) hybrid search via SurrealDB 3 native indexes - Fix schema bug: kg_executions missing agent_role/provider/cost_cents (silent empty reads) - channels: on_agent_inactive hook (AgentStatus::Inactive → Message::error) - migration 012: adds missing fields + HNSW + BM25 indexes - docs: ADR-0036, update ADR-0035 + notification-channels feature doc 2026-02-26 15:32:44 +00:00			`# ADR-0036: Knowledge Graph Hybrid Search — HNSW + BM25 + RRF`

			`Status: Implemented`
			`Date: 2026-02-26`
			`Deciders: VAPORA Team`
			Technical Story: `find_similar_executions` was a stub returning recent records; `find_similar_rlm_tasks` ignored embeddings entirely. A missing schema migration caused all `kg_executions` reads to silently fail deserialization.

			`---`

			`## Decision`

			Replace the stub similarity functions in `KGPersistence` with a hybrid retrieval pipeline combining:

			1. HNSW (SurrealDB 3 native) — approximate nearest-neighbor vector search over `embedding` field
			2. BM25 (SurrealDB 3 native full-text search) — lexical scoring over `task_description` field
			`3. Reciprocal Rank Fusion (RRF, k=60) — scale-invariant score fusion`

			Add migration `012_kg_hybrid_search.surql` that fixes a pre-existing schema bug (three fields missing from the `SCHEMAFULL` table) and defines the required indexes.

			`---`

			`## Context`

			`### The Stub Problem`

			`find_similar_executions` in `persistence.rs` discarded its `embedding: &[f32]` argument entirely and returned the N most-recent successful executions, ordered by timestamp. Any caller relying on semantic proximity was silently receiving chronological results — a correctness bug, not a performance issue.

			`### The Silent Schema Bug`

			`kg_executions` was declared `SCHEMAFULL` in migration 005 but three fields used by `PersistedExecution` (`agent_role`, `provider`, `cost_cents`) were absent from the schema. SurrealDB drops undefined fields on `INSERT` in SCHEMAFULL tables. All subsequent `SELECT` queries returned records that failed `serde_json::from_value` deserialization, which was swallowed by `.filter_map(\|v\| v.ok())`. The persistence layer appeared to work (no errors) while returning empty results for every query.

			### Why Not `stratum-embeddings` SurrealDbStore

			`stratumiops/crates/stratum-embeddings/src/store/surrealdb.rs` implements vector search as a brute-force full-scan: it loads all records into memory and computes cosine similarity in-process. This works for document chunk retrieval (bounded dataset per document), but is unsuitable for the knowledge graph which accumulates unbounded execution records across all agents and tasks over time.

			`### Why Hybrid Over Pure Semantic`

			`Embedding-only retrieval misses exact keyword matches: an agent searching for "cargo clippy warnings" may not find a record titled "clippy deny warnings fix" if the embedding model compresses the phrase differently than the query. BM25 handles exact token overlap that embeddings smooth over.`

			`---`

			`## Alternatives Considered`

			`### ❌ Pure HNSW semantic search only`

			`- Misses exact keyword matches (e.g., specific error codes, crate names)`
			`- Embedding quality varies across providers; degrades if provider changes`

			`### ❌ Pure BM25 lexical search only`

			`- Misses paraphrases and semantic variants ("task failed" vs "execution error")`
			`- No relevance for structurally similar tasks with different wording`

			`### ❌ Tantivy / external FTS engine`

			`- Adds a new process dependency for a capability SurrealDB 3 provides natively`
			`- Requires synchronizing two stores; adds operational complexity`

			`### ✅ SurrealDB 3 HNSW + BM25 + RRF (chosen)`

			`Single data store, two native index types, no new dependencies, no sync complexity.`

			`---`

			`## Implementation`

			`### Migration 012`

			```sql
			`-- Fix missing fields causing silent deserialization failure`
			`DEFINE FIELD agent_role ON TABLE kg_executions TYPE option<string>;`
			`DEFINE FIELD provider ON TABLE kg_executions TYPE string DEFAULT 'unknown';`
			`DEFINE FIELD cost_cents ON TABLE kg_executions TYPE int DEFAULT 0;`

			`-- BM25 full-text index on task_description`
			`DEFINE ANALYZER kg_text_analyzer`
			`TOKENIZERS class`
			`FILTERS lowercase, snowball(english);`

			`DEFINE INDEX idx_kg_executions_ft`
			`ON TABLE kg_executions`
			`FIELDS task_description`
			`SEARCH ANALYZER kg_text_analyzer BM25;`

			`-- HNSW ANN index on embedding (1536-dim, cosine, float32)`
			`DEFINE INDEX idx_kg_executions_hnsw`
			`ON TABLE kg_executions`
			`FIELDS embedding`
			`HNSW DIMENSION 1536 DIST COSINE TYPE F32 M 16 EF_CONSTRUCTION 200;`
			```

			HNSW parameters: `M 16` (16 edges per node, standard for 1536-dim); `EF_CONSTRUCTION 200` (index build quality vs. insert speed; 200 is the standard default).

			`### Query Patterns`

			HNSW semantic search (`<\|100,64\|>` = 100 candidates, ef=64 at query time):

			```surql
			`SELECT *, vector::similarity::cosine(embedding, $q) AS cosine_score`
			`FROM kg_executions`
			`WHERE embedding <\|100,64\|> $q`
			`ORDER BY cosine_score DESC`
			`LIMIT 20`
			```

			BM25 lexical search (`@1@` assigns predicate ID 1; paired with `search::score(1)`):

			```surql
			`SELECT *, search::score(1) AS bm25_score`
			`FROM kg_executions`
			`WHERE task_description @1@ $text`
			`ORDER BY bm25_score DESC`
			`LIMIT 100`
			```

			`### RRF Fusion`

			Cosine similarity is bounded `[0.0, 1.0]`; BM25 is unbounded `[0, ∞)`. Linear blending requires per-corpus normalization. RRF is scale-invariant:

			```
			`hybrid_score(id) = 1 / (60 + rank_semantic) + 1 / (60 + rank_lexical)`
			```

			`k=60` is the standard constant (Robertson & Zaragoza, 2009). IDs absent from one ranked list receive rank 0, contributing `1/60` — never 0, preventing complete suppression of single-method results.

			`### RLM Executions`

			`rlm_executions` is `SCHEMALESS` with a nullable `query_embedding` field. HNSW indexes require a `SCHEMAFULL` table with a non-nullable typed field. `find_similar_rlm_tasks` uses in-memory cosine similarity: loads candidate records, filters those with non-empty embeddings, sorts by cosine score. Acceptable because the RLM dataset is bounded per document.

			`### New Public API`

			```rust
			`impl KGPersistence {`
			`// Was stub (returned recent records). Now uses HNSW ANN query.`
			`pub async fn find_similar_executions(`
			`&self,`
			`embedding: &[f32],`
			`limit: usize,`
			`) -> anyhow::Result<Vec<PersistedExecution>>;`

			`// New. HNSW + BM25 + RRF. Either argument may be empty (degrades gracefully).`
			`pub async fn hybrid_search(`
			`&self,`
			`embedding: &[f32],`
			`text_query: &str,`
			`limit: usize,`
			`) -> anyhow::Result<Vec<HybridSearchResult>>;`
			`}`
			```

			`HybridSearchResult` exposes `semantic_score`, `lexical_score`, `hybrid_score`, `semantic_rank`, `lexical_rank` — callers can inspect individual signal contributions.

			`---`

			`## Consequences`

			`### Positive`

			- `find_similar_executions` returns semantically similar past executions, not recent ones. The correctness bug is fixed.
			- `hybrid_search` exposes both signals; callers can filter by `semantic_score ≥ 0.7` for high-confidence-only retrieval.
			`- No new dependencies. The two indexes are defined in a migration; no Rust dependency change.`
			- The schema bug fix means all existing `kg_executions` records round-trip correctly after migration 012 is applied.

			`### Negative / Trade-offs`

			- HNSW index build is `O(n log n)` in SurrealDB; large existing datasets will cause migration 012 to take longer than typical DDL migrations. No data migration is needed — only index creation.
			- BM25 requires the `task_description` field to be populated at insert time. Records inserted before this migration with empty or null descriptions will not appear in lexical results.
			- `rlm_executions` hybrid search remains in-memory. A future migration converting `rlm_executions` to SCHEMAFULL would enable native HNSW for that table too.

			`### Supersedes`

			- The stub implementation of `find_similar_executions` (existed since persistence.rs was written).
			`- Extends ADR-0013 (KG temporal design) with the retrieval layer decision.`

			`---`

			`## Related`

			`- [ADR-0013: Knowledge Graph Temporal](./0013-knowledge-graph.md) — original KG design`
			`- [ADR-0029: RLM Recursive Language Models](./0029-rlm-recursive-language-models.md) — RLM hybrid search (different use case: document chunks, not execution records)`
			`- [ADR-0004: SurrealDB](./0004-surrealdb-database.md) — database foundation`