stratumiops/docs/en/architecture/adrs/001-stratum-embeddings.md

280 lines
12 KiB
Markdown
Raw Normal View History

# ADR-001: Stratum-Embeddings - Unified Embedding Library
## Status
**Proposed**
## Context
### Current State: Fragmented Implementations
The ecosystem has 3 independent embedding implementations:
| Project | Location | Providers | Caching |
| ------------ | ------------------------------------- | ----------------------------- | ------- |
| Kogral | `kogral-core/src/embeddings/` | fastembed, rig-core (partial) | No |
| Provisioning | `provisioning-rag/src/embeddings.rs` | OpenAI direct | No |
| Vapora | `vapora-llm-router/src/embeddings.rs` | OpenAI, HuggingFace, Ollama | No |
### Identified Problems
#### 1. Duplicated Code
Each project reimplements:
- HTTP client for OpenAI embeddings
- JSON response parsing
- Error handling
- Token estimation
**Impact**: ~400 duplicated lines, inconsistent error handling.
#### 2. No Caching
Embeddings regenerated every time:
```text
"What is Rust?" → OpenAI → 1536 dims → $0.00002
"What is Rust?" → OpenAI → 1536 dims → $0.00002 (same result)
"What is Rust?" → OpenAI → 1536 dims → $0.00002 (same result)
```
**Impact**: Unnecessary costs, additional latency, more frequent rate limits.
#### 3. No Fallback
If OpenAI fails, everything fails. No fallback to local alternatives (fastembed, Ollama).
**Impact**: Reduced availability, total dependency on one provider.
#### 4. Silent Dimension Mismatch
Different providers produce different dimensions:
| Provider | Model | Dimensions |
| --------- | ---------------------- | ---------- |
| fastembed | bge-small-en | 384 |
| fastembed | bge-large-en | 1024 |
| OpenAI | text-embedding-3-small | 1536 |
| OpenAI | text-embedding-3-large | 3072 |
| Ollama | nomic-embed-text | 768 |
**Impact**: Corrupt vector indices if provider changes.
#### 5. No Metrics
No visibility into usage, cache hit rate, latency per provider, or accumulated costs.
## Decision
Create `stratum-embeddings` as a unified crate that:
1. **Unifies** implementations from Kogral, Provisioning, and Vapora
2. **Adds caching** to avoid recomputing identical embeddings
3. **Implements fallback** between providers (cloud → local)
4. **Clearly documents** dimensions and limitations per provider
5. **Exposes metrics** for observability
6. **Provides VectorStore trait** with LanceDB and SurrealDB backends based on project needs
### Storage Backend Decision
Each project chooses its vector storage backend based on priority:
| Project | Backend | Priority | Justification |
| ------------ | --------- | -------------- | -------------------------------------------------- |
| Kogral | SurrealDB | Graph richness | Knowledge Graph needs unified graph+vector queries |
| Provisioning | LanceDB | Vector scale | RAG with millions of document chunks |
| Vapora | LanceDB | Vector scale | Execution traces, pattern matching at scale |
#### Why SurrealDB for Kogral
Kogral is a Knowledge Graph where relationships are the primary value.
With hybrid architecture (LanceDB vectors + SurrealDB graph), a typical query would require:
1. LanceDB: vector search → candidate_ids
2. SurrealDB: graph filter on candidates → results
3. App layer: merge, re-rank, deduplication
**Accepted trade-off**: SurrealDB has worse pure vector performance than LanceDB,
but Kogral's scale is limited by human curation of knowledge (typically 10K-100K concepts).
#### Why LanceDB for Provisioning and Vapora
| Aspect | SurrealDB | LanceDB |
| --------------- | ---------- | -------------------- |
| Storage format | Row-based | Columnar (Lance) |
| Vector index | HNSW (RAM) | IVF-PQ (disk-native) |
| Practical scale | Millions | Billions |
| Compression | ~1x | ~32x (PQ) |
| Zero-copy read | No | Yes |
### Architecture
```text
┌─────────────────────────────────────────────────────────────────┐
│ stratum-embeddings │
├─────────────────────────────────────────────────────────────────┤
│ EmbeddingProvider trait │
│ ├─ embed(text) → Vec<f32>
│ ├─ embed_batch(texts) → Vec<Vec<f32>> │
│ ├─ dimensions() → usize │
│ └─ is_local() → bool │
│ │
│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │
│ │ FastEmbed │ │ OpenAI │ │ Ollama │ │
│ │ (local) │ │ (cloud) │ │ (local) │ │
│ └───────────┘ └───────────┘ └───────────┘ │
│ └────────────┬────────────┘ │
│ ▼ │
│ EmbeddingCache (memory/disk) │
│ │ │
│ ▼ │
│ EmbeddingService │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ VectorStore trait │ │
│ │ ├─ upsert(id, embedding, metadata) │ │
│ │ ├─ search(embedding, limit, filter) → Vec<Match> │ │
│ │ └─ delete(id) │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ SurrealDbStore │ │ LanceDbStore │ │
│ │ (Kogral) │ │ (Prov/Vapora) │ │
│ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
## Rationale
### Why Caching is Critical
For a typical RAG system (10,000 chunks):
- **Without cache**: Re-indexing and repeated queries multiply costs
- **With cache**: First indexing pays, rest are cache hits
**Estimated savings**: 60-80% in embedding costs.
### Why Fallback is Important
| Scenario | Without Fallback | With Fallback |
| ----------------- | ---------------- | -------------------- |
| OpenAI rate limit | ERROR | → fastembed (local) |
| OpenAI downtime | ERROR | → Ollama (local) |
| No internet | ERROR | → fastembed (local) |
### Why Local Providers First
For development: fastembed loads local model (~100MB), no API keys required, no costs, works offline.
For production: OpenAI for quality, fastembed as fallback.
## Consequences
### Positive
1. Single source of truth for the entire ecosystem
2. 60-80% fewer embedding API calls (caching)
3. High availability with local providers (fallback)
4. Usage and cost metrics
5. Feature-gated: only compile what you need
6. Storage flexibility: VectorStore trait allows choosing backend per project
### Negative
1. **Dimension lock-in**: Changing provider requires re-indexing
2. **Cache invalidation**: Updated content may serve stale embeddings
3. **Model download**: fastembed downloads ~100MB on first use
4. **Storage lock-in per project**: Kogral tied to SurrealDB, others to LanceDB
### Mitigations
| Negative | Mitigation |
| ----------------- | ---------------------------------------------- |
| Dimension lock-in | Document clearly, warn on provider change |
| Stale cache | Configurable TTL, bypass option |
| Model download | Show progress, cache in ~/.cache/fastembed |
| Storage lock-in | Conscious decision based on project priorities |
## Success Metrics
| Metric | Current | Target |
| ------------------------- | ------- | ------ |
| Duplicate implementations | 3 | 1 |
| Cache hit rate | 0% | >60% |
| Fallback availability | 0% | 100% |
| Cost per 10K embeddings | ~$0.20 | ~$0.05 |
## Provider Selection Guide
### Development
```rust
// Local, free, offline
let service = EmbeddingService::builder()
.with_provider(FastEmbedProvider::small()?) // 384 dims
.with_memory_cache()
.build()?;
```
### Production (Quality)
```rust
// OpenAI with local fallback
let service = EmbeddingService::builder()
.with_provider(OpenAiEmbeddingProvider::large()?) // 3072 dims
.with_provider(FastEmbedProvider::large()?) // Fallback
.with_memory_cache()
.build()?;
```
### Production (Cost-Optimized)
```rust
// OpenAI small with fallback
let service = EmbeddingService::builder()
.with_provider(OpenAiEmbeddingProvider::small()?) // 1536 dims
.with_provider(OllamaEmbeddingProvider::nomic()) // Fallback
.with_memory_cache()
.build()?;
```
## Dimension Compatibility Matrix
| If using... | Can switch to... | CANNOT switch to... |
| ---------------------- | --------------------------- | ------------------- |
| fastembed small (384) | fastembed small, all-minilm | Any other |
| fastembed large (1024) | fastembed large | Any other |
| OpenAI small (1536) | OpenAI small, ada-002 | Any other |
| OpenAI large (3072) | OpenAI large | Any other |
**Rule**: Only switch between models with the SAME dimensions.
## Implementation Priority
| Order | Feature | Reason |
| ----- | ----------------------- | -------------------------- |
| 1 | EmbeddingProvider trait | Foundation for everything |
| 2 | FastEmbed provider | Works without API keys |
| 3 | Memory cache | Biggest cost impact |
| 4 | VectorStore trait | Storage abstraction |
| 5 | SurrealDbStore | Kogral needs graph+vector |
| 6 | LanceDbStore | Provisioning/Vapora scale |
| 7 | OpenAI provider | Production |
| 8 | Ollama provider | Local fallback |
| 9 | Batch processing | Efficiency |
| 10 | Metrics | Observability |
## References
**Existing Implementations**:
- Kogral: `kogral-core/src/embeddings/`
- Vapora: `vapora-llm-router/src/embeddings.rs`
- Provisioning: `provisioning/platform/crates/rag/src/embeddings.rs`
**Target Location**: `stratumiops/crates/stratum-embeddings/`