12 KiB
ADR-001: Stratum-Embeddings - Unified Embedding Library
Status
Proposed
Context
Current State: Fragmented Implementations
The ecosystem has 3 independent embedding implementations:
| Project | Location | Providers | Caching |
|---|---|---|---|
| Kogral | kogral-core/src/embeddings/ |
fastembed, rig-core (partial) | No |
| Provisioning | provisioning-rag/src/embeddings.rs |
OpenAI direct | No |
| Vapora | vapora-llm-router/src/embeddings.rs |
OpenAI, HuggingFace, Ollama | No |
Identified Problems
1. Duplicated Code
Each project reimplements:
- HTTP client for OpenAI embeddings
- JSON response parsing
- Error handling
- Token estimation
Impact: ~400 duplicated lines, inconsistent error handling.
2. No Caching
Embeddings regenerated every time:
"What is Rust?" → OpenAI → 1536 dims → $0.00002
"What is Rust?" → OpenAI → 1536 dims → $0.00002 (same result)
"What is Rust?" → OpenAI → 1536 dims → $0.00002 (same result)
Impact: Unnecessary costs, additional latency, more frequent rate limits.
3. No Fallback
If OpenAI fails, everything fails. No fallback to local alternatives (fastembed, Ollama).
Impact: Reduced availability, total dependency on one provider.
4. Silent Dimension Mismatch
Different providers produce different dimensions:
| Provider | Model | Dimensions |
|---|---|---|
| fastembed | bge-small-en | 384 |
| fastembed | bge-large-en | 1024 |
| OpenAI | text-embedding-3-small | 1536 |
| OpenAI | text-embedding-3-large | 3072 |
| Ollama | nomic-embed-text | 768 |
Impact: Corrupt vector indices if provider changes.
5. No Metrics
No visibility into usage, cache hit rate, latency per provider, or accumulated costs.
Decision
Create stratum-embeddings as a unified crate that:
- Unifies implementations from Kogral, Provisioning, and Vapora
- Adds caching to avoid recomputing identical embeddings
- Implements fallback between providers (cloud → local)
- Clearly documents dimensions and limitations per provider
- Exposes metrics for observability
- Provides VectorStore trait with LanceDB and SurrealDB backends based on project needs
Storage Backend Decision
Each project chooses its vector storage backend based on priority:
| Project | Backend | Priority | Justification |
|---|---|---|---|
| Kogral | SurrealDB | Graph richness | Knowledge Graph needs unified graph+vector queries |
| Provisioning | LanceDB | Vector scale | RAG with millions of document chunks |
| Vapora | LanceDB | Vector scale | Execution traces, pattern matching at scale |
Why SurrealDB for Kogral
Kogral is a Knowledge Graph where relationships are the primary value. With hybrid architecture (LanceDB vectors + SurrealDB graph), a typical query would require:
- LanceDB: vector search → candidate_ids
- SurrealDB: graph filter on candidates → results
- App layer: merge, re-rank, deduplication
Accepted trade-off: SurrealDB has worse pure vector performance than LanceDB, but Kogral's scale is limited by human curation of knowledge (typically 10K-100K concepts).
Why LanceDB for Provisioning and Vapora
| Aspect | SurrealDB | LanceDB |
|---|---|---|
| Storage format | Row-based | Columnar (Lance) |
| Vector index | HNSW (RAM) | IVF-PQ (disk-native) |
| Practical scale | Millions | Billions |
| Compression | ~1x | ~32x (PQ) |
| Zero-copy read | No | Yes |
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ stratum-embeddings │
├─────────────────────────────────────────────────────────────────┤
│ EmbeddingProvider trait │
│ ├─ embed(text) → Vec<f32> │
│ ├─ embed_batch(texts) → Vec<Vec<f32>> │
│ ├─ dimensions() → usize │
│ └─ is_local() → bool │
│ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ FastEmbed │ │ OpenAI │ │ Ollama │ │
│ │ (local) │ │ (cloud) │ │ (local) │ │
│ └───────────┘ └───────────┘ └───────────┘ │
│ └────────────┬────────────┘ │
│ ▼ │
│ EmbeddingCache (memory/disk) │
│ │ │
│ ▼ │
│ EmbeddingService │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ VectorStore trait │ │
│ │ ├─ upsert(id, embedding, metadata) │ │
│ │ ├─ search(embedding, limit, filter) → Vec<Match> │ │
│ │ └─ delete(id) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ SurrealDbStore │ │ LanceDbStore │ │
│ │ (Kogral) │ │ (Prov/Vapora) │ │
│ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Rationale
Why Caching is Critical
For a typical RAG system (10,000 chunks):
- Without cache: Re-indexing and repeated queries multiply costs
- With cache: First indexing pays, rest are cache hits
Estimated savings: 60-80% in embedding costs.
Why Fallback is Important
| Scenario | Without Fallback | With Fallback |
|---|---|---|
| OpenAI rate limit | ERROR | → fastembed (local) |
| OpenAI downtime | ERROR | → Ollama (local) |
| No internet | ERROR | → fastembed (local) |
Why Local Providers First
For development: fastembed loads local model (~100MB), no API keys required, no costs, works offline.
For production: OpenAI for quality, fastembed as fallback.
Consequences
Positive
- Single source of truth for the entire ecosystem
- 60-80% fewer embedding API calls (caching)
- High availability with local providers (fallback)
- Usage and cost metrics
- Feature-gated: only compile what you need
- Storage flexibility: VectorStore trait allows choosing backend per project
Negative
- Dimension lock-in: Changing provider requires re-indexing
- Cache invalidation: Updated content may serve stale embeddings
- Model download: fastembed downloads ~100MB on first use
- Storage lock-in per project: Kogral tied to SurrealDB, others to LanceDB
Mitigations
| Negative | Mitigation |
|---|---|
| Dimension lock-in | Document clearly, warn on provider change |
| Stale cache | Configurable TTL, bypass option |
| Model download | Show progress, cache in ~/.cache/fastembed |
| Storage lock-in | Conscious decision based on project priorities |
Success Metrics
| Metric | Current | Target |
|---|---|---|
| Duplicate implementations | 3 | 1 |
| Cache hit rate | 0% | >60% |
| Fallback availability | 0% | 100% |
| Cost per 10K embeddings | ~$0.20 | ~$0.05 |
Provider Selection Guide
Development
// Local, free, offline
let service = EmbeddingService::builder()
.with_provider(FastEmbedProvider::small()?) // 384 dims
.with_memory_cache()
.build()?;
Production (Quality)
// OpenAI with local fallback
let service = EmbeddingService::builder()
.with_provider(OpenAiEmbeddingProvider::large()?) // 3072 dims
.with_provider(FastEmbedProvider::large()?) // Fallback
.with_memory_cache()
.build()?;
Production (Cost-Optimized)
// OpenAI small with fallback
let service = EmbeddingService::builder()
.with_provider(OpenAiEmbeddingProvider::small()?) // 1536 dims
.with_provider(OllamaEmbeddingProvider::nomic()) // Fallback
.with_memory_cache()
.build()?;
Dimension Compatibility Matrix
| If using... | Can switch to... | CANNOT switch to... |
|---|---|---|
| fastembed small (384) | fastembed small, all-minilm | Any other |
| fastembed large (1024) | fastembed large | Any other |
| OpenAI small (1536) | OpenAI small, ada-002 | Any other |
| OpenAI large (3072) | OpenAI large | Any other |
Rule: Only switch between models with the SAME dimensions.
Implementation Priority
| Order | Feature | Reason |
|---|---|---|
| 1 | EmbeddingProvider trait | Foundation for everything |
| 2 | FastEmbed provider | Works without API keys |
| 3 | Memory cache | Biggest cost impact |
| 4 | VectorStore trait | Storage abstraction |
| 5 | SurrealDbStore | Kogral needs graph+vector |
| 6 | LanceDbStore | Provisioning/Vapora scale |
| 7 | OpenAI provider | Production |
| 8 | Ollama provider | Local fallback |
| 9 | Batch processing | Efficiency |
| 10 | Metrics | Observability |
References
Existing Implementations:
- Kogral:
kogral-core/src/embeddings/ - Vapora:
vapora-llm-router/src/embeddings.rs - Provisioning:
provisioning/platform/crates/rag/src/embeddings.rs
Target Location: stratumiops/crates/stratum-embeddings/