# ADR-001: Stratum-Embeddings - Unified Embedding Library ## Status **Proposed** ## Context ### Current State: Fragmented Implementations The ecosystem has 3 independent embedding implementations: | Project | Location | Providers | Caching | | ------------ | ------------------------------------- | ----------------------------- | ------- | | Kogral | `kogral-core/src/embeddings/` | fastembed, rig-core (partial) | No | | Provisioning | `provisioning-rag/src/embeddings.rs` | OpenAI direct | No | | Vapora | `vapora-llm-router/src/embeddings.rs` | OpenAI, HuggingFace, Ollama | No | ### Identified Problems #### 1. Duplicated Code Each project reimplements: - HTTP client for OpenAI embeddings - JSON response parsing - Error handling - Token estimation **Impact**: ~400 duplicated lines, inconsistent error handling. #### 2. No Caching Embeddings regenerated every time: ```text "What is Rust?" → OpenAI → 1536 dims → $0.00002 "What is Rust?" → OpenAI → 1536 dims → $0.00002 (same result) "What is Rust?" → OpenAI → 1536 dims → $0.00002 (same result) ``` **Impact**: Unnecessary costs, additional latency, more frequent rate limits. #### 3. No Fallback If OpenAI fails, everything fails. No fallback to local alternatives (fastembed, Ollama). **Impact**: Reduced availability, total dependency on one provider. #### 4. Silent Dimension Mismatch Different providers produce different dimensions: | Provider | Model | Dimensions | | --------- | ---------------------- | ---------- | | fastembed | bge-small-en | 384 | | fastembed | bge-large-en | 1024 | | OpenAI | text-embedding-3-small | 1536 | | OpenAI | text-embedding-3-large | 3072 | | Ollama | nomic-embed-text | 768 | **Impact**: Corrupt vector indices if provider changes. #### 5. No Metrics No visibility into usage, cache hit rate, latency per provider, or accumulated costs. ## Decision Create `stratum-embeddings` as a unified crate that: 1. **Unifies** implementations from Kogral, Provisioning, and Vapora 2. **Adds caching** to avoid recomputing identical embeddings 3. **Implements fallback** between providers (cloud → local) 4. **Clearly documents** dimensions and limitations per provider 5. **Exposes metrics** for observability 6. **Provides VectorStore trait** with LanceDB and SurrealDB backends based on project needs ### Storage Backend Decision Each project chooses its vector storage backend based on priority: | Project | Backend | Priority | Justification | | ------------ | --------- | -------------- | -------------------------------------------------- | | Kogral | SurrealDB | Graph richness | Knowledge Graph needs unified graph+vector queries | | Provisioning | LanceDB | Vector scale | RAG with millions of document chunks | | Vapora | LanceDB | Vector scale | Execution traces, pattern matching at scale | #### Why SurrealDB for Kogral Kogral is a Knowledge Graph where relationships are the primary value. With hybrid architecture (LanceDB vectors + SurrealDB graph), a typical query would require: 1. LanceDB: vector search → candidate_ids 2. SurrealDB: graph filter on candidates → results 3. App layer: merge, re-rank, deduplication **Accepted trade-off**: SurrealDB has worse pure vector performance than LanceDB, but Kogral's scale is limited by human curation of knowledge (typically 10K-100K concepts). #### Why LanceDB for Provisioning and Vapora | Aspect | SurrealDB | LanceDB | | --------------- | ---------- | -------------------- | | Storage format | Row-based | Columnar (Lance) | | Vector index | HNSW (RAM) | IVF-PQ (disk-native) | | Practical scale | Millions | Billions | | Compression | ~1x | ~32x (PQ) | | Zero-copy read | No | Yes | ### Architecture ```text ┌─────────────────────────────────────────────────────────────────┐ │ stratum-embeddings │ ├─────────────────────────────────────────────────────────────────┤ │ EmbeddingProvider trait │ │ ├─ embed(text) → Vec │ │ ├─ embed_batch(texts) → Vec> │ │ ├─ dimensions() → usize │ │ └─ is_local() → bool │ │ │ │ ┌───────────┐ ┌───────────┐ ┌───────────┐ │ │ │ FastEmbed │ │ OpenAI │ │ Ollama │ │ │ │ (local) │ │ (cloud) │ │ (local) │ │ │ └───────────┘ └───────────┘ └───────────┘ │ │ └────────────┬────────────┘ │ │ ▼ │ │ EmbeddingCache (memory/disk) │ │ │ │ │ ▼ │ │ EmbeddingService │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ VectorStore trait │ │ │ │ ├─ upsert(id, embedding, metadata) │ │ │ │ ├─ search(embedding, limit, filter) → Vec │ │ │ │ └─ delete(id) │ │ │ └──────────────────────────────────────────────────────────┘ │ │ │ │ │ │ ▼ ▼ │ │ ┌─────────────────┐ ┌─────────────────┐ │ │ │ SurrealDbStore │ │ LanceDbStore │ │ │ │ (Kogral) │ │ (Prov/Vapora) │ │ │ └─────────────────┘ └─────────────────┘ │ └─────────────────────────────────────────────────────────────────┘ ``` ## Rationale ### Why Caching is Critical For a typical RAG system (10,000 chunks): - **Without cache**: Re-indexing and repeated queries multiply costs - **With cache**: First indexing pays, rest are cache hits **Estimated savings**: 60-80% in embedding costs. ### Why Fallback is Important | Scenario | Without Fallback | With Fallback | | ----------------- | ---------------- | -------------------- | | OpenAI rate limit | ERROR | → fastembed (local) | | OpenAI downtime | ERROR | → Ollama (local) | | No internet | ERROR | → fastembed (local) | ### Why Local Providers First For development: fastembed loads local model (~100MB), no API keys required, no costs, works offline. For production: OpenAI for quality, fastembed as fallback. ## Consequences ### Positive 1. Single source of truth for the entire ecosystem 2. 60-80% fewer embedding API calls (caching) 3. High availability with local providers (fallback) 4. Usage and cost metrics 5. Feature-gated: only compile what you need 6. Storage flexibility: VectorStore trait allows choosing backend per project ### Negative 1. **Dimension lock-in**: Changing provider requires re-indexing 2. **Cache invalidation**: Updated content may serve stale embeddings 3. **Model download**: fastembed downloads ~100MB on first use 4. **Storage lock-in per project**: Kogral tied to SurrealDB, others to LanceDB ### Mitigations | Negative | Mitigation | | ----------------- | ---------------------------------------------- | | Dimension lock-in | Document clearly, warn on provider change | | Stale cache | Configurable TTL, bypass option | | Model download | Show progress, cache in ~/.cache/fastembed | | Storage lock-in | Conscious decision based on project priorities | ## Success Metrics | Metric | Current | Target | | ------------------------- | ------- | ------ | | Duplicate implementations | 3 | 1 | | Cache hit rate | 0% | >60% | | Fallback availability | 0% | 100% | | Cost per 10K embeddings | ~$0.20 | ~$0.05 | ## Provider Selection Guide ### Development ```rust // Local, free, offline let service = EmbeddingService::builder() .with_provider(FastEmbedProvider::small()?) // 384 dims .with_memory_cache() .build()?; ``` ### Production (Quality) ```rust // OpenAI with local fallback let service = EmbeddingService::builder() .with_provider(OpenAiEmbeddingProvider::large()?) // 3072 dims .with_provider(FastEmbedProvider::large()?) // Fallback .with_memory_cache() .build()?; ``` ### Production (Cost-Optimized) ```rust // OpenAI small with fallback let service = EmbeddingService::builder() .with_provider(OpenAiEmbeddingProvider::small()?) // 1536 dims .with_provider(OllamaEmbeddingProvider::nomic()) // Fallback .with_memory_cache() .build()?; ``` ## Dimension Compatibility Matrix | If using... | Can switch to... | CANNOT switch to... | | ---------------------- | --------------------------- | ------------------- | | fastembed small (384) | fastembed small, all-minilm | Any other | | fastembed large (1024) | fastembed large | Any other | | OpenAI small (1536) | OpenAI small, ada-002 | Any other | | OpenAI large (3072) | OpenAI large | Any other | **Rule**: Only switch between models with the SAME dimensions. ## Implementation Priority | Order | Feature | Reason | | ----- | ----------------------- | -------------------------- | | 1 | EmbeddingProvider trait | Foundation for everything | | 2 | FastEmbed provider | Works without API keys | | 3 | Memory cache | Biggest cost impact | | 4 | VectorStore trait | Storage abstraction | | 5 | SurrealDbStore | Kogral needs graph+vector | | 6 | LanceDbStore | Provisioning/Vapora scale | | 7 | OpenAI provider | Production | | 8 | Ollama provider | Local fallback | | 9 | Batch processing | Efficiency | | 10 | Metrics | Observability | ## References **Existing Implementations**: - Kogral: `kogral-core/src/embeddings/` - Vapora: `vapora-llm-router/src/embeddings.rs` - Provisioning: `provisioning/platform/crates/rag/src/embeddings.rs` **Target Location**: `stratumiops/crates/stratum-embeddings/`