stratumiops/docs/en/architecture/adrs/001-stratum-embeddings.md

# ADR-001: Stratum-Embeddings - Unified Embedding Library

## Status

**Proposed**

## Context

### Current State: Fragmented Implementations

The ecosystem has 3 independent embedding implementations:

| Project      | Location                              | Providers                     | Caching |
| ------------ | ------------------------------------- | ----------------------------- | ------- |
| Kogral       | `kogral-core/src/embeddings/`         | fastembed, rig-core (partial) | No      |
| Provisioning | `provisioning-rag/src/embeddings.rs`  | OpenAI direct                 | No      |
| Vapora       | `vapora-llm-router/src/embeddings.rs` | OpenAI, HuggingFace, Ollama   | No      |

### Identified Problems

#### 1. Duplicated Code

Each project reimplements:

- HTTP client for OpenAI embeddings
- JSON response parsing
- Error handling
- Token estimation

**Impact**: ~400 duplicated lines, inconsistent error handling.

#### 2. No Caching

Embeddings regenerated every time:

```text
"What is Rust?" → OpenAI → 1536 dims → $0.00002
"What is Rust?" → OpenAI → 1536 dims → $0.00002 (same result)
"What is Rust?" → OpenAI → 1536 dims → $0.00002 (same result)
```

**Impact**: Unnecessary costs, additional latency, more frequent rate limits.

#### 3. No Fallback

If OpenAI fails, everything fails. No fallback to local alternatives (fastembed, Ollama).

**Impact**: Reduced availability, total dependency on one provider.

#### 4. Silent Dimension Mismatch

Different providers produce different dimensions:

| Provider  | Model                  | Dimensions |
| --------- | ---------------------- | ---------- |
| fastembed | bge-small-en           | 384        |
| fastembed | bge-large-en           | 1024       |
| OpenAI    | text-embedding-3-small | 1536       |
| OpenAI    | text-embedding-3-large | 3072       |
| Ollama    | nomic-embed-text       | 768        |

**Impact**: Corrupt vector indices if provider changes.

#### 5. No Metrics

No visibility into usage, cache hit rate, latency per provider, or accumulated costs.

## Decision

Create `stratum-embeddings` as a unified crate that:

1. **Unifies** implementations from Kogral, Provisioning, and Vapora
2. **Adds caching** to avoid recomputing identical embeddings
3. **Implements fallback** between providers (cloud → local)
4. **Clearly documents** dimensions and limitations per provider
5. **Exposes metrics** for observability
6. **Provides VectorStore trait** with LanceDB and SurrealDB backends based on project needs

### Storage Backend Decision

Each project chooses its vector storage backend based on priority:

| Project      | Backend   | Priority       | Justification                                      |
| ------------ | --------- | -------------- | -------------------------------------------------- |
| Kogral       | SurrealDB | Graph richness | Knowledge Graph needs unified graph+vector queries |
| Provisioning | LanceDB   | Vector scale   | RAG with millions of document chunks               |
| Vapora       | LanceDB   | Vector scale   | Execution traces, pattern matching at scale        |

#### Why SurrealDB for Kogral

Kogral is a Knowledge Graph where relationships are the primary value.
With hybrid architecture (LanceDB vectors + SurrealDB graph), a typical query would require:

1. LanceDB: vector search → candidate_ids
2. SurrealDB: graph filter on candidates → results
3. App layer: merge, re-rank, deduplication

**Accepted trade-off**: SurrealDB has worse pure vector performance than LanceDB,
but Kogral's scale is limited by human curation of knowledge (typically 10K-100K concepts).

#### Why LanceDB for Provisioning and Vapora

| Aspect          | SurrealDB  | LanceDB              |
| --------------- | ---------- | -------------------- |
| Storage format  | Row-based  | Columnar (Lance)     |
| Vector index    | HNSW (RAM) | IVF-PQ (disk-native) |
| Practical scale | Millions   | Billions             |
| Compression     | ~1x        | ~32x (PQ)            |
| Zero-copy read  | No         | Yes                  |

### Architecture

```text
┌─────────────────────────────────────────────────────────────────┐
│                      stratum-embeddings                          │
├─────────────────────────────────────────────────────────────────┤
│  EmbeddingProvider trait                                         │
│  ├─ embed(text) → Vec<f32>                                      │
│  ├─ embed_batch(texts) → Vec<Vec<f32>>                          │
│  ├─ dimensions() → usize                                        │
│  └─ is_local() → bool                                           │
│                                                                  │
│  ┌───────────┐ ┌───────────┐ ┌───────────┐                      │
│  │ FastEmbed │ │  OpenAI   │ │  Ollama   │                      │
│  │  (local)  │ │  (cloud)  │ │  (local)  │                      │
│  └───────────┘ └───────────┘ └───────────┘                      │
│         └────────────┬────────────┘                              │
│                      ▼                                           │
│              EmbeddingCache (memory/disk)                        │
│                      │                                           │
│                      ▼                                           │
│             EmbeddingService                                     │
│                      │                                           │
│                      ▼                                           │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                   VectorStore trait                       │   │
│  │  ├─ upsert(id, embedding, metadata)                      │   │
│  │  ├─ search(embedding, limit, filter) → Vec<Match>        │   │
│  │  └─ delete(id)                                           │   │
│  └──────────────────────────────────────────────────────────┘   │
│         │                                    │                   │
│         ▼                                    ▼                   │
│  ┌─────────────────┐              ┌─────────────────┐           │
│  │  SurrealDbStore │              │   LanceDbStore  │           │
│  │  (Kogral)       │              │  (Prov/Vapora)  │           │
│  └─────────────────┘              └─────────────────┘           │
└─────────────────────────────────────────────────────────────────┘
```

## Rationale

### Why Caching is Critical

For a typical RAG system (10,000 chunks):

- **Without cache**: Re-indexing and repeated queries multiply costs
- **With cache**: First indexing pays, rest are cache hits

**Estimated savings**: 60-80% in embedding costs.

### Why Fallback is Important

| Scenario          | Without Fallback | With Fallback        |
| ----------------- | ---------------- | -------------------- |
| OpenAI rate limit | ERROR            | → fastembed (local)  |
| OpenAI downtime   | ERROR            | → Ollama (local)     |
| No internet       | ERROR            | → fastembed (local)  |

### Why Local Providers First

For development: fastembed loads local model (~100MB), no API keys required, no costs, works offline.

For production: OpenAI for quality, fastembed as fallback.

## Consequences

### Positive

1. Single source of truth for the entire ecosystem
2. 60-80% fewer embedding API calls (caching)
3. High availability with local providers (fallback)
4. Usage and cost metrics
5. Feature-gated: only compile what you need
6. Storage flexibility: VectorStore trait allows choosing backend per project

### Negative

1. **Dimension lock-in**: Changing provider requires re-indexing
2. **Cache invalidation**: Updated content may serve stale embeddings
3. **Model download**: fastembed downloads ~100MB on first use
4. **Storage lock-in per project**: Kogral tied to SurrealDB, others to LanceDB

### Mitigations

| Negative          | Mitigation                                     |
| ----------------- | ---------------------------------------------- |
| Dimension lock-in | Document clearly, warn on provider change      |
| Stale cache       | Configurable TTL, bypass option                |
| Model download    | Show progress, cache in ~/.cache/fastembed     |
| Storage lock-in   | Conscious decision based on project priorities |

## Success Metrics

| Metric                    | Current | Target |
| ------------------------- | ------- | ------ |
| Duplicate implementations | 3       | 1      |
| Cache hit rate            | 0%      | >60%   |
| Fallback availability     | 0%      | 100%   |
| Cost per 10K embeddings   | ~$0.20  | ~$0.05 |

## Provider Selection Guide

### Development

```rust
// Local, free, offline
let service = EmbeddingService::builder()
    .with_provider(FastEmbedProvider::small()?)  // 384 dims
    .with_memory_cache()
    .build()?;
```

### Production (Quality)

```rust
// OpenAI with local fallback
let service = EmbeddingService::builder()
    .with_provider(OpenAiEmbeddingProvider::large()?)  // 3072 dims
    .with_provider(FastEmbedProvider::large()?)        // Fallback
    .with_memory_cache()
    .build()?;
```

### Production (Cost-Optimized)

```rust
// OpenAI small with fallback
let service = EmbeddingService::builder()
    .with_provider(OpenAiEmbeddingProvider::small()?)  // 1536 dims
    .with_provider(OllamaEmbeddingProvider::nomic())   // Fallback
    .with_memory_cache()
    .build()?;
```

## Dimension Compatibility Matrix

| If using...            | Can switch to...            | CANNOT switch to... |
| ---------------------- | --------------------------- | ------------------- |
| fastembed small (384)  | fastembed small, all-minilm | Any other           |
| fastembed large (1024) | fastembed large             | Any other           |
| OpenAI small (1536)    | OpenAI small, ada-002       | Any other           |
| OpenAI large (3072)    | OpenAI large                | Any other           |

**Rule**: Only switch between models with the SAME dimensions.

## Implementation Priority

| Order | Feature                 | Reason                     |
| ----- | ----------------------- | -------------------------- |
| 1     | EmbeddingProvider trait | Foundation for everything  |
| 2     | FastEmbed provider      | Works without API keys     |
| 3     | Memory cache            | Biggest cost impact        |
| 4     | VectorStore trait       | Storage abstraction        |
| 5     | SurrealDbStore          | Kogral needs graph+vector  |
| 6     | LanceDbStore            | Provisioning/Vapora scale  |
| 7     | OpenAI provider         | Production                 |
| 8     | Ollama provider         | Local fallback             |
| 9     | Batch processing        | Efficiency                 |
| 10    | Metrics                 | Observability              |

## References

**Existing Implementations**:

- Kogral: `kogral-core/src/embeddings/`
- Vapora: `vapora-llm-router/src/embeddings.rs`
- Provisioning: `provisioning/platform/crates/rag/src/embeddings.rs`

**Target Location**: `stratumiops/crates/stratum-embeddings/`