280 lines
12 KiB
Markdown
280 lines
12 KiB
Markdown
# ADR-001: Stratum-Embeddings - Unified Embedding Library
|
|
|
|
## Status
|
|
|
|
**Proposed**
|
|
|
|
## Context
|
|
|
|
### Current State: Fragmented Implementations
|
|
|
|
The ecosystem has 3 independent embedding implementations:
|
|
|
|
| Project | Location | Providers | Caching |
|
|
| ------------ | ------------------------------------- | ----------------------------- | ------- |
|
|
| Kogral | `kogral-core/src/embeddings/` | fastembed, rig-core (partial) | No |
|
|
| Provisioning | `provisioning-rag/src/embeddings.rs` | OpenAI direct | No |
|
|
| Vapora | `vapora-llm-router/src/embeddings.rs` | OpenAI, HuggingFace, Ollama | No |
|
|
|
|
### Identified Problems
|
|
|
|
#### 1. Duplicated Code
|
|
|
|
Each project reimplements:
|
|
|
|
- HTTP client for OpenAI embeddings
|
|
- JSON response parsing
|
|
- Error handling
|
|
- Token estimation
|
|
|
|
**Impact**: ~400 duplicated lines, inconsistent error handling.
|
|
|
|
#### 2. No Caching
|
|
|
|
Embeddings regenerated every time:
|
|
|
|
```text
|
|
"What is Rust?" → OpenAI → 1536 dims → $0.00002
|
|
"What is Rust?" → OpenAI → 1536 dims → $0.00002 (same result)
|
|
"What is Rust?" → OpenAI → 1536 dims → $0.00002 (same result)
|
|
```
|
|
|
|
**Impact**: Unnecessary costs, additional latency, more frequent rate limits.
|
|
|
|
#### 3. No Fallback
|
|
|
|
If OpenAI fails, everything fails. No fallback to local alternatives (fastembed, Ollama).
|
|
|
|
**Impact**: Reduced availability, total dependency on one provider.
|
|
|
|
#### 4. Silent Dimension Mismatch
|
|
|
|
Different providers produce different dimensions:
|
|
|
|
| Provider | Model | Dimensions |
|
|
| --------- | ---------------------- | ---------- |
|
|
| fastembed | bge-small-en | 384 |
|
|
| fastembed | bge-large-en | 1024 |
|
|
| OpenAI | text-embedding-3-small | 1536 |
|
|
| OpenAI | text-embedding-3-large | 3072 |
|
|
| Ollama | nomic-embed-text | 768 |
|
|
|
|
**Impact**: Corrupt vector indices if provider changes.
|
|
|
|
#### 5. No Metrics
|
|
|
|
No visibility into usage, cache hit rate, latency per provider, or accumulated costs.
|
|
|
|
## Decision
|
|
|
|
Create `stratum-embeddings` as a unified crate that:
|
|
|
|
1. **Unifies** implementations from Kogral, Provisioning, and Vapora
|
|
2. **Adds caching** to avoid recomputing identical embeddings
|
|
3. **Implements fallback** between providers (cloud → local)
|
|
4. **Clearly documents** dimensions and limitations per provider
|
|
5. **Exposes metrics** for observability
|
|
6. **Provides VectorStore trait** with LanceDB and SurrealDB backends based on project needs
|
|
|
|
### Storage Backend Decision
|
|
|
|
Each project chooses its vector storage backend based on priority:
|
|
|
|
| Project | Backend | Priority | Justification |
|
|
| ------------ | --------- | -------------- | -------------------------------------------------- |
|
|
| Kogral | SurrealDB | Graph richness | Knowledge Graph needs unified graph+vector queries |
|
|
| Provisioning | LanceDB | Vector scale | RAG with millions of document chunks |
|
|
| Vapora | LanceDB | Vector scale | Execution traces, pattern matching at scale |
|
|
|
|
#### Why SurrealDB for Kogral
|
|
|
|
Kogral is a Knowledge Graph where relationships are the primary value.
|
|
With hybrid architecture (LanceDB vectors + SurrealDB graph), a typical query would require:
|
|
|
|
1. LanceDB: vector search → candidate_ids
|
|
2. SurrealDB: graph filter on candidates → results
|
|
3. App layer: merge, re-rank, deduplication
|
|
|
|
**Accepted trade-off**: SurrealDB has worse pure vector performance than LanceDB,
|
|
but Kogral's scale is limited by human curation of knowledge (typically 10K-100K concepts).
|
|
|
|
#### Why LanceDB for Provisioning and Vapora
|
|
|
|
| Aspect | SurrealDB | LanceDB |
|
|
| --------------- | ---------- | -------------------- |
|
|
| Storage format | Row-based | Columnar (Lance) |
|
|
| Vector index | HNSW (RAM) | IVF-PQ (disk-native) |
|
|
| Practical scale | Millions | Billions |
|
|
| Compression | ~1x | ~32x (PQ) |
|
|
| Zero-copy read | No | Yes |
|
|
|
|
### Architecture
|
|
|
|
```text
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ stratum-embeddings │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ EmbeddingProvider trait │
|
|
│ ├─ embed(text) → Vec<f32> │
|
|
│ ├─ embed_batch(texts) → Vec<Vec<f32>> │
|
|
│ ├─ dimensions() → usize │
|
|
│ └─ is_local() → bool │
|
|
│ │
|
|
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
|
|
│ │ FastEmbed │ │ OpenAI │ │ Ollama │ │
|
|
│ │ (local) │ │ (cloud) │ │ (local) │ │
|
|
│ └───────────┘ └───────────┘ └───────────┘ │
|
|
│ └────────────┬────────────┘ │
|
|
│ ▼ │
|
|
│ EmbeddingCache (memory/disk) │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ EmbeddingService │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌──────────────────────────────────────────────────────────┐ │
|
|
│ │ VectorStore trait │ │
|
|
│ │ ├─ upsert(id, embedding, metadata) │ │
|
|
│ │ ├─ search(embedding, limit, filter) → Vec<Match> │ │
|
|
│ │ └─ delete(id) │ │
|
|
│ └──────────────────────────────────────────────────────────┘ │
|
|
│ │ │ │
|
|
│ ▼ ▼ │
|
|
│ ┌─────────────────┐ ┌─────────────────┐ │
|
|
│ │ SurrealDbStore │ │ LanceDbStore │ │
|
|
│ │ (Kogral) │ │ (Prov/Vapora) │ │
|
|
│ └─────────────────┘ └─────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Rationale
|
|
|
|
### Why Caching is Critical
|
|
|
|
For a typical RAG system (10,000 chunks):
|
|
|
|
- **Without cache**: Re-indexing and repeated queries multiply costs
|
|
- **With cache**: First indexing pays, rest are cache hits
|
|
|
|
**Estimated savings**: 60-80% in embedding costs.
|
|
|
|
### Why Fallback is Important
|
|
|
|
| Scenario | Without Fallback | With Fallback |
|
|
| ----------------- | ---------------- | -------------------- |
|
|
| OpenAI rate limit | ERROR | → fastembed (local) |
|
|
| OpenAI downtime | ERROR | → Ollama (local) |
|
|
| No internet | ERROR | → fastembed (local) |
|
|
|
|
### Why Local Providers First
|
|
|
|
For development: fastembed loads local model (~100MB), no API keys required, no costs, works offline.
|
|
|
|
For production: OpenAI for quality, fastembed as fallback.
|
|
|
|
## Consequences
|
|
|
|
### Positive
|
|
|
|
1. Single source of truth for the entire ecosystem
|
|
2. 60-80% fewer embedding API calls (caching)
|
|
3. High availability with local providers (fallback)
|
|
4. Usage and cost metrics
|
|
5. Feature-gated: only compile what you need
|
|
6. Storage flexibility: VectorStore trait allows choosing backend per project
|
|
|
|
### Negative
|
|
|
|
1. **Dimension lock-in**: Changing provider requires re-indexing
|
|
2. **Cache invalidation**: Updated content may serve stale embeddings
|
|
3. **Model download**: fastembed downloads ~100MB on first use
|
|
4. **Storage lock-in per project**: Kogral tied to SurrealDB, others to LanceDB
|
|
|
|
### Mitigations
|
|
|
|
| Negative | Mitigation |
|
|
| ----------------- | ---------------------------------------------- |
|
|
| Dimension lock-in | Document clearly, warn on provider change |
|
|
| Stale cache | Configurable TTL, bypass option |
|
|
| Model download | Show progress, cache in ~/.cache/fastembed |
|
|
| Storage lock-in | Conscious decision based on project priorities |
|
|
|
|
## Success Metrics
|
|
|
|
| Metric | Current | Target |
|
|
| ------------------------- | ------- | ------ |
|
|
| Duplicate implementations | 3 | 1 |
|
|
| Cache hit rate | 0% | >60% |
|
|
| Fallback availability | 0% | 100% |
|
|
| Cost per 10K embeddings | ~$0.20 | ~$0.05 |
|
|
|
|
## Provider Selection Guide
|
|
|
|
### Development
|
|
|
|
```rust
|
|
// Local, free, offline
|
|
let service = EmbeddingService::builder()
|
|
.with_provider(FastEmbedProvider::small()?) // 384 dims
|
|
.with_memory_cache()
|
|
.build()?;
|
|
```
|
|
|
|
### Production (Quality)
|
|
|
|
```rust
|
|
// OpenAI with local fallback
|
|
let service = EmbeddingService::builder()
|
|
.with_provider(OpenAiEmbeddingProvider::large()?) // 3072 dims
|
|
.with_provider(FastEmbedProvider::large()?) // Fallback
|
|
.with_memory_cache()
|
|
.build()?;
|
|
```
|
|
|
|
### Production (Cost-Optimized)
|
|
|
|
```rust
|
|
// OpenAI small with fallback
|
|
let service = EmbeddingService::builder()
|
|
.with_provider(OpenAiEmbeddingProvider::small()?) // 1536 dims
|
|
.with_provider(OllamaEmbeddingProvider::nomic()) // Fallback
|
|
.with_memory_cache()
|
|
.build()?;
|
|
```
|
|
|
|
## Dimension Compatibility Matrix
|
|
|
|
| If using... | Can switch to... | CANNOT switch to... |
|
|
| ---------------------- | --------------------------- | ------------------- |
|
|
| fastembed small (384) | fastembed small, all-minilm | Any other |
|
|
| fastembed large (1024) | fastembed large | Any other |
|
|
| OpenAI small (1536) | OpenAI small, ada-002 | Any other |
|
|
| OpenAI large (3072) | OpenAI large | Any other |
|
|
|
|
**Rule**: Only switch between models with the SAME dimensions.
|
|
|
|
## Implementation Priority
|
|
|
|
| Order | Feature | Reason |
|
|
| ----- | ----------------------- | -------------------------- |
|
|
| 1 | EmbeddingProvider trait | Foundation for everything |
|
|
| 2 | FastEmbed provider | Works without API keys |
|
|
| 3 | Memory cache | Biggest cost impact |
|
|
| 4 | VectorStore trait | Storage abstraction |
|
|
| 5 | SurrealDbStore | Kogral needs graph+vector |
|
|
| 6 | LanceDbStore | Provisioning/Vapora scale |
|
|
| 7 | OpenAI provider | Production |
|
|
| 8 | Ollama provider | Local fallback |
|
|
| 9 | Batch processing | Efficiency |
|
|
| 10 | Metrics | Observability |
|
|
|
|
## References
|
|
|
|
**Existing Implementations**:
|
|
|
|
- Kogral: `kogral-core/src/embeddings/`
|
|
- Vapora: `vapora-llm-router/src/embeddings.rs`
|
|
- Provisioning: `provisioning/platform/crates/rag/src/embeddings.rs`
|
|
|
|
**Target Location**: `stratumiops/crates/stratum-embeddings/`
|