182 lines
7.7 KiB
Markdown
182 lines
7.7 KiB
Markdown
|
|
# ADR-0036: Knowledge Graph Hybrid Search — HNSW + BM25 + RRF
|
||
|
|
|
||
|
|
**Status**: Implemented
|
||
|
|
**Date**: 2026-02-26
|
||
|
|
**Deciders**: VAPORA Team
|
||
|
|
**Technical Story**: `find_similar_executions` was a stub returning recent records; `find_similar_rlm_tasks` ignored embeddings entirely. A missing schema migration caused all `kg_executions` reads to silently fail deserialization.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Decision
|
||
|
|
|
||
|
|
Replace the stub similarity functions in `KGPersistence` with a **hybrid retrieval pipeline** combining:
|
||
|
|
|
||
|
|
1. **HNSW** (SurrealDB 3 native) — approximate nearest-neighbor vector search over `embedding` field
|
||
|
|
2. **BM25** (SurrealDB 3 native full-text search) — lexical scoring over `task_description` field
|
||
|
|
3. **Reciprocal Rank Fusion (RRF, k=60)** — scale-invariant score fusion
|
||
|
|
|
||
|
|
Add migration `012_kg_hybrid_search.surql` that fixes a pre-existing schema bug (three fields missing from the `SCHEMAFULL` table) and defines the required indexes.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Context
|
||
|
|
|
||
|
|
### The Stub Problem
|
||
|
|
|
||
|
|
`find_similar_executions` in `persistence.rs` discarded its `embedding: &[f32]` argument entirely and returned the N most-recent successful executions, ordered by timestamp. Any caller relying on semantic proximity was silently receiving chronological results — a correctness bug, not a performance issue.
|
||
|
|
|
||
|
|
### The Silent Schema Bug
|
||
|
|
|
||
|
|
`kg_executions` was declared `SCHEMAFULL` in migration 005 but three fields used by `PersistedExecution` (`agent_role`, `provider`, `cost_cents`) were absent from the schema. SurrealDB drops undefined fields on `INSERT` in SCHEMAFULL tables. All subsequent `SELECT` queries returned records that failed `serde_json::from_value` deserialization, which was swallowed by `.filter_map(|v| v.ok())`. The persistence layer appeared to work (no errors) while returning empty results for every query.
|
||
|
|
|
||
|
|
### Why Not `stratum-embeddings` SurrealDbStore
|
||
|
|
|
||
|
|
`stratumiops/crates/stratum-embeddings/src/store/surrealdb.rs` implements vector search as a brute-force full-scan: it loads all records into memory and computes cosine similarity in-process. This works for document chunk retrieval (bounded dataset per document), but is unsuitable for the knowledge graph which accumulates unbounded execution records across all agents and tasks over time.
|
||
|
|
|
||
|
|
### Why Hybrid Over Pure Semantic
|
||
|
|
|
||
|
|
Embedding-only retrieval misses exact keyword matches: an agent searching for "cargo clippy warnings" may not find a record titled "clippy deny warnings fix" if the embedding model compresses the phrase differently than the query. BM25 handles exact token overlap that embeddings smooth over.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Alternatives Considered
|
||
|
|
|
||
|
|
### ❌ Pure HNSW semantic search only
|
||
|
|
|
||
|
|
- Misses exact keyword matches (e.g., specific error codes, crate names)
|
||
|
|
- Embedding quality varies across providers; degrades if provider changes
|
||
|
|
|
||
|
|
### ❌ Pure BM25 lexical search only
|
||
|
|
|
||
|
|
- Misses paraphrases and semantic variants ("task failed" vs "execution error")
|
||
|
|
- No relevance for structurally similar tasks with different wording
|
||
|
|
|
||
|
|
### ❌ Tantivy / external FTS engine
|
||
|
|
|
||
|
|
- Adds a new process dependency for a capability SurrealDB 3 provides natively
|
||
|
|
- Requires synchronizing two stores; adds operational complexity
|
||
|
|
|
||
|
|
### ✅ SurrealDB 3 HNSW + BM25 + RRF (chosen)
|
||
|
|
|
||
|
|
Single data store, two native index types, no new dependencies, no sync complexity.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Implementation
|
||
|
|
|
||
|
|
### Migration 012
|
||
|
|
|
||
|
|
```sql
|
||
|
|
-- Fix missing fields causing silent deserialization failure
|
||
|
|
DEFINE FIELD agent_role ON TABLE kg_executions TYPE option<string>;
|
||
|
|
DEFINE FIELD provider ON TABLE kg_executions TYPE string DEFAULT 'unknown';
|
||
|
|
DEFINE FIELD cost_cents ON TABLE kg_executions TYPE int DEFAULT 0;
|
||
|
|
|
||
|
|
-- BM25 full-text index on task_description
|
||
|
|
DEFINE ANALYZER kg_text_analyzer
|
||
|
|
TOKENIZERS class
|
||
|
|
FILTERS lowercase, snowball(english);
|
||
|
|
|
||
|
|
DEFINE INDEX idx_kg_executions_ft
|
||
|
|
ON TABLE kg_executions
|
||
|
|
FIELDS task_description
|
||
|
|
SEARCH ANALYZER kg_text_analyzer BM25;
|
||
|
|
|
||
|
|
-- HNSW ANN index on embedding (1536-dim, cosine, float32)
|
||
|
|
DEFINE INDEX idx_kg_executions_hnsw
|
||
|
|
ON TABLE kg_executions
|
||
|
|
FIELDS embedding
|
||
|
|
HNSW DIMENSION 1536 DIST COSINE TYPE F32 M 16 EF_CONSTRUCTION 200;
|
||
|
|
```
|
||
|
|
|
||
|
|
HNSW parameters: `M 16` (16 edges per node, standard for 1536-dim); `EF_CONSTRUCTION 200` (index build quality vs. insert speed; 200 is the standard default).
|
||
|
|
|
||
|
|
### Query Patterns
|
||
|
|
|
||
|
|
**HNSW semantic search** (`<|100,64|>` = 100 candidates, ef=64 at query time):
|
||
|
|
|
||
|
|
```surql
|
||
|
|
SELECT *, vector::similarity::cosine(embedding, $q) AS cosine_score
|
||
|
|
FROM kg_executions
|
||
|
|
WHERE embedding <|100,64|> $q
|
||
|
|
ORDER BY cosine_score DESC
|
||
|
|
LIMIT 20
|
||
|
|
```
|
||
|
|
|
||
|
|
**BM25 lexical search** (`@1@` assigns predicate ID 1; paired with `search::score(1)`):
|
||
|
|
|
||
|
|
```surql
|
||
|
|
SELECT *, search::score(1) AS bm25_score
|
||
|
|
FROM kg_executions
|
||
|
|
WHERE task_description @1@ $text
|
||
|
|
ORDER BY bm25_score DESC
|
||
|
|
LIMIT 100
|
||
|
|
```
|
||
|
|
|
||
|
|
### RRF Fusion
|
||
|
|
|
||
|
|
Cosine similarity is bounded `[0.0, 1.0]`; BM25 is unbounded `[0, ∞)`. Linear blending requires per-corpus normalization. RRF is scale-invariant:
|
||
|
|
|
||
|
|
```
|
||
|
|
hybrid_score(id) = 1 / (60 + rank_semantic) + 1 / (60 + rank_lexical)
|
||
|
|
```
|
||
|
|
|
||
|
|
`k=60` is the standard constant (Robertson & Zaragoza, 2009). IDs absent from one ranked list receive rank 0, contributing `1/60` — never 0, preventing complete suppression of single-method results.
|
||
|
|
|
||
|
|
### RLM Executions
|
||
|
|
|
||
|
|
`rlm_executions` is `SCHEMALESS` with a nullable `query_embedding` field. HNSW indexes require a `SCHEMAFULL` table with a non-nullable typed field. `find_similar_rlm_tasks` uses in-memory cosine similarity: loads candidate records, filters those with non-empty embeddings, sorts by cosine score. Acceptable because the RLM dataset is bounded per document.
|
||
|
|
|
||
|
|
### New Public API
|
||
|
|
|
||
|
|
```rust
|
||
|
|
impl KGPersistence {
|
||
|
|
// Was stub (returned recent records). Now uses HNSW ANN query.
|
||
|
|
pub async fn find_similar_executions(
|
||
|
|
&self,
|
||
|
|
embedding: &[f32],
|
||
|
|
limit: usize,
|
||
|
|
) -> anyhow::Result<Vec<PersistedExecution>>;
|
||
|
|
|
||
|
|
// New. HNSW + BM25 + RRF. Either argument may be empty (degrades gracefully).
|
||
|
|
pub async fn hybrid_search(
|
||
|
|
&self,
|
||
|
|
embedding: &[f32],
|
||
|
|
text_query: &str,
|
||
|
|
limit: usize,
|
||
|
|
) -> anyhow::Result<Vec<HybridSearchResult>>;
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
`HybridSearchResult` exposes `semantic_score`, `lexical_score`, `hybrid_score`, `semantic_rank`, `lexical_rank` — callers can inspect individual signal contributions.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Consequences
|
||
|
|
|
||
|
|
### Positive
|
||
|
|
|
||
|
|
- `find_similar_executions` returns semantically similar past executions, not recent ones. The correctness bug is fixed.
|
||
|
|
- `hybrid_search` exposes both signals; callers can filter by `semantic_score ≥ 0.7` for high-confidence-only retrieval.
|
||
|
|
- No new dependencies. The two indexes are defined in a migration; no Rust dependency change.
|
||
|
|
- The schema bug fix means all existing `kg_executions` records round-trip correctly after migration 012 is applied.
|
||
|
|
|
||
|
|
### Negative / Trade-offs
|
||
|
|
|
||
|
|
- HNSW index build is `O(n log n)` in SurrealDB; large existing datasets will cause migration 012 to take longer than typical DDL migrations. No data migration is needed — only index creation.
|
||
|
|
- BM25 requires the `task_description` field to be populated at insert time. Records inserted before this migration with empty or null descriptions will not appear in lexical results.
|
||
|
|
- `rlm_executions` hybrid search remains in-memory. A future migration converting `rlm_executions` to SCHEMAFULL would enable native HNSW for that table too.
|
||
|
|
|
||
|
|
### Supersedes
|
||
|
|
|
||
|
|
- The stub implementation of `find_similar_executions` (existed since persistence.rs was written).
|
||
|
|
- Extends ADR-0013 (KG temporal design) with the retrieval layer decision.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Related
|
||
|
|
|
||
|
|
- [ADR-0013: Knowledge Graph Temporal](./0013-knowledge-graph.md) — original KG design
|
||
|
|
- [ADR-0029: RLM Recursive Language Models](./0029-rlm-recursive-language-models.md) — RLM hybrid search (different use case: document chunks, not execution records)
|
||
|
|
- [ADR-0004: SurrealDB](./0004-surrealdb-database.md) — database foundation
|