stratumiops/docs/en/architecture/adrs/001-stratum-embeddings.md

# ADR-001: Stratum-Embeddings - Unified Embedding Library

## Status

**Proposed**

## Context

### Current State: Fragmented Implementations

The ecosystem has 3 independent embedding implementations:

| Project      | Location                              | Providers                     | Caching |
| ------------ | ------------------------------------- | ----------------------------- | ------- |
| Kogral       | `kogral-core/src/embeddings/`         | fastembed, rig-core (partial) | No      |
| Provisioning | `provisioning-rag/src/embeddings.rs`  | OpenAI direct                 | No      |
| Vapora       | `vapora-llm-router/src/embeddings.rs` | OpenAI, HuggingFace, Ollama   | No      |

### Identified Problems

#### 1. Duplicated Code

Each project reimplements:

- HTTP client for OpenAI embeddings
- JSON response parsing
- Error handling
- Token estimation

**Impact**: ~400 duplicated lines, inconsistent error handling.

#### 2. No Caching

Embeddings regenerated every time:

```text
"What is Rust?" → OpenAI → 1536 dims → $0.00002
"What is Rust?" → OpenAI → 1536 dims → $0.00002 (same result)
"What is Rust?" → OpenAI → 1536 dims → $0.00002 (same result)
```

**Impact**: Unnecessary costs, additional latency, more frequent rate limits.

#### 3. No Fallback

If OpenAI fails, everything fails. No fallback to local alternatives (fastembed, Ollama).

**Impact**: Reduced availability, total dependency on one provider.

#### 4. Silent Dimension Mismatch

Different providers produce different dimensions:

| Provider  | Model                  | Dimensions |
| --------- | ---------------------- | ---------- |
| fastembed | bge-small-en           | 384        |
| fastembed | bge-large-en           | 1024       |
| OpenAI    | text-embedding-3-small | 1536       |
| OpenAI    | text-embedding-3-large | 3072       |
| Ollama    | nomic-embed-text       | 768        |

**Impact**: Corrupt vector indices if provider changes.

#### 5. No Metrics

No visibility into usage, cache hit rate, latency per provider, or accumulated costs.

## Decision

Create `stratum-embeddings` as a unified crate that:

1. **Unifies** implementations from Kogral, Provisioning, and Vapora
2. **Adds caching** to avoid recomputing identical embeddings
3. **Implements fallback** between providers (cloud → local)
4. **Clearly documents** dimensions and limitations per provider
5. **Exposes metrics** for observability
6. **Provides VectorStore trait** with LanceDB and SurrealDB backends based on project needs

### Storage Backend Decision

Each project chooses its vector storage backend based on priority:

| Project      | Backend   | Priority       | Justification                                      |
| ------------ | --------- | -------------- | -------------------------------------------------- |
| Kogral       | SurrealDB | Graph richness | Knowledge Graph needs unified graph+vector queries |
| Provisioning | LanceDB   | Vector scale   | RAG with millions of document chunks               |
| Vapora       | LanceDB   | Vector scale   | Execution traces, pattern matching at scale        |

#### Why SurrealDB for Kogral

Kogral is a Knowledge Graph where relationships are the primary value.
With hybrid architecture (LanceDB vectors + SurrealDB graph), a typical query would require:

1. LanceDB: vector search → candidate_ids
2. SurrealDB: graph filter on candidates → results
3. App layer: merge, re-rank, deduplication

**Accepted trade-off**: SurrealDB has worse pure vector performance than LanceDB,
but Kogral's scale is limited by human curation of knowledge (typically 10K-100K concepts).

#### Why LanceDB for Provisioning and Vapora

| Aspect          | SurrealDB  | LanceDB              |
| --------------- | ---------- | -------------------- |
| Storage format  | Row-based  | Columnar (Lance)     |
| Vector index    | HNSW (RAM) | IVF-PQ (disk-native) |
| Practical scale | Millions   | Billions             |
| Compression     | ~1x        | ~32x (PQ)            |
| Zero-copy read  | No         | Yes                  |

### Architecture

```text
┌─────────────────────────────────────────────────────────────────┐
│                      stratum-embeddings                          │
├─────────────────────────────────────────────────────────────────┤
│  EmbeddingProvider trait                                         │
│  ├─ embed(text) → Vec<f32>                                      │
│  ├─ embed_batch(texts) → Vec<Vec<f32>>                          │
│  ├─ dimensions() → usize                                        │
│  └─ is_local() → bool                                           │
│                                                                  │
│  ┌───────────┐ ┌───────────┐ ┌───────────┐                      │
│  │ FastEmbed │ │  OpenAI   │ │  Ollama   │                      │
│  │  (local)  │ │  (cloud)  │ │  (local)  │                      │
│  └───────────┘ └───────────┘ └───────────┘                      │
│         └────────────┬────────────┘                              │
│                      ▼                                           │
│              EmbeddingCache (memory/disk)                        │
│                      │                                           │
│                      ▼                                           │
│             EmbeddingService                                     │
│                      │                                           │
│                      ▼                                           │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                   VectorStore trait                       │   │
│  │  ├─ upsert(id, embedding, metadata)                      │   │
│  │  ├─ search(embedding, limit, filter) → Vec<Match>        │   │
│  │  └─ delete(id)                                           │   │
│  └──────────────────────────────────────────────────────────┘   │
│         │                                    │                   │
│         ▼                                    ▼                   │
│  ┌─────────────────┐              ┌─────────────────┐           │
│  │  SurrealDbStore │              │   LanceDbStore  │           │
│  │  (Kogral)       │              │  (Prov/Vapora)  │           │
│  └─────────────────┘              └─────────────────┘           │
└─────────────────────────────────────────────────────────────────┘
```

## Rationale

### Why Caching is Critical

For a typical RAG system (10,000 chunks):

- **Without cache**: Re-indexing and repeated queries multiply costs
- **With cache**: First indexing pays, rest are cache hits

**Estimated savings**: 60-80% in embedding costs.

### Why Fallback is Important

| Scenario          | Without Fallback | With Fallback        |
| ----------------- | ---------------- | -------------------- |
| OpenAI rate limit | ERROR            | → fastembed (local)  |
| OpenAI downtime   | ERROR            | → Ollama (local)     |
| No internet       | ERROR            | → fastembed (local)  |

### Why Local Providers First

For development: fastembed loads local model (~100MB), no API keys required, no costs, works offline.

For production: OpenAI for quality, fastembed as fallback.

## Consequences

### Positive

1. Single source of truth for the entire ecosystem
2. 60-80% fewer embedding API calls (caching)
3. High availability with local providers (fallback)
4. Usage and cost metrics
5. Feature-gated: only compile what you need
6. Storage flexibility: VectorStore trait allows choosing backend per project

### Negative

1. **Dimension lock-in**: Changing provider requires re-indexing
2. **Cache invalidation**: Updated content may serve stale embeddings
3. **Model download**: fastembed downloads ~100MB on first use
4. **Storage lock-in per project**: Kogral tied to SurrealDB, others to LanceDB

### Mitigations

| Negative          | Mitigation                                     |
| ----------------- | ---------------------------------------------- |
| Dimension lock-in | Document clearly, warn on provider change      |
| Stale cache       | Configurable TTL, bypass option                |
| Model download    | Show progress, cache in ~/.cache/fastembed     |
| Storage lock-in   | Conscious decision based on project priorities |

## Success Metrics

| Metric                    | Current | Target |
| ------------------------- | ------- | ------ |
| Duplicate implementations | 3       | 1      |
| Cache hit rate            | 0%      | >60%   |
| Fallback availability     | 0%      | 100%   |
| Cost per 10K embeddings   | ~$0.20  | ~$0.05 |

## Provider Selection Guide

### Development

```rust
// Local, free, offline
let service = EmbeddingService::builder()
    .with_provider(FastEmbedProvider::small()?)  // 384 dims
    .with_memory_cache()
    .build()?;
```

### Production (Quality)

```rust
// OpenAI with local fallback
let service = EmbeddingService::builder()
    .with_provider(OpenAiEmbeddingProvider::large()?)  // 3072 dims
    .with_provider(FastEmbedProvider::large()?)        // Fallback
    .with_memory_cache()
    .build()?;
```

### Production (Cost-Optimized)

```rust
// OpenAI small with fallback
let service = EmbeddingService::builder()
    .with_provider(OpenAiEmbeddingProvider::small()?)  // 1536 dims
    .with_provider(OllamaEmbeddingProvider::nomic())   // Fallback
    .with_memory_cache()
    .build()?;
```

## Dimension Compatibility Matrix

| If using...            | Can switch to...            | CANNOT switch to... |
| ---------------------- | --------------------------- | ------------------- |
| fastembed small (384)  | fastembed small, all-minilm | Any other           |
| fastembed large (1024) | fastembed large             | Any other           |
| OpenAI small (1536)    | OpenAI small, ada-002       | Any other           |
| OpenAI large (3072)    | OpenAI large                | Any other           |

**Rule**: Only switch between models with the SAME dimensions.

## Implementation Priority

| Order | Feature                 | Reason                     |
| ----- | ----------------------- | -------------------------- |
| 1     | EmbeddingProvider trait | Foundation for everything  |
| 2     | FastEmbed provider      | Works without API keys     |
| 3     | Memory cache            | Biggest cost impact        |
| 4     | VectorStore trait       | Storage abstraction        |
| 5     | SurrealDbStore          | Kogral needs graph+vector  |
| 6     | LanceDbStore            | Provisioning/Vapora scale  |
| 7     | OpenAI provider         | Production                 |
| 8     | Ollama provider         | Local fallback             |
| 9     | Batch processing        | Efficiency                 |
| 10    | Metrics                 | Observability              |

## References

**Existing Implementations**:

- Kogral: `kogral-core/src/embeddings/`
- Vapora: `vapora-llm-router/src/embeddings.rs`
- Provisioning: `provisioning/platform/crates/rag/src/embeddings.rs`

**Target Location**: `stratumiops/crates/stratum-embeddings/`
chore: create stratum-embeddings and stratum-llm crates, docs 2026-01-24 02:03:12 +00:00			`# ADR-001: Stratum-Embeddings - Unified Embedding Library`

			`## Status`

			`Proposed`

			`## Context`

			`### Current State: Fragmented Implementations`

			`The ecosystem has 3 independent embedding implementations:`

			`\| Project \| Location \| Providers \| Caching \|`
			`\| ------------ \| ------------------------------------- \| ----------------------------- \| ------- \|`
			\| Kogral \| `kogral-core/src/embeddings/` \| fastembed, rig-core (partial) \| No \|
			\| Provisioning \| `provisioning-rag/src/embeddings.rs` \| OpenAI direct \| No \|
			\| Vapora \| `vapora-llm-router/src/embeddings.rs` \| OpenAI, HuggingFace, Ollama \| No \|

			`### Identified Problems`

			`#### 1. Duplicated Code`

			`Each project reimplements:`

			`- HTTP client for OpenAI embeddings`
			`- JSON response parsing`
			`- Error handling`
			`- Token estimation`

			`Impact: ~400 duplicated lines, inconsistent error handling.`

			`#### 2. No Caching`

			`Embeddings regenerated every time:`

			```text
			`"What is Rust?" → OpenAI → 1536 dims → $0.00002`
			`"What is Rust?" → OpenAI → 1536 dims → $0.00002 (same result)`
			`"What is Rust?" → OpenAI → 1536 dims → $0.00002 (same result)`
			```

			`Impact: Unnecessary costs, additional latency, more frequent rate limits.`

			`#### 3. No Fallback`

			`If OpenAI fails, everything fails. No fallback to local alternatives (fastembed, Ollama).`

			`Impact: Reduced availability, total dependency on one provider.`

			`#### 4. Silent Dimension Mismatch`

			`Different providers produce different dimensions:`

			`\| Provider \| Model \| Dimensions \|`
			`\| --------- \| ---------------------- \| ---------- \|`
			`\| fastembed \| bge-small-en \| 384 \|`
			`\| fastembed \| bge-large-en \| 1024 \|`
			`\| OpenAI \| text-embedding-3-small \| 1536 \|`
			`\| OpenAI \| text-embedding-3-large \| 3072 \|`
			`\| Ollama \| nomic-embed-text \| 768 \|`

			`Impact: Corrupt vector indices if provider changes.`

			`#### 5. No Metrics`

			`No visibility into usage, cache hit rate, latency per provider, or accumulated costs.`

			`## Decision`

			Create `stratum-embeddings` as a unified crate that:

			`1. Unifies implementations from Kogral, Provisioning, and Vapora`
			`2. Adds caching to avoid recomputing identical embeddings`
			`3. Implements fallback between providers (cloud → local)`
			`4. Clearly documents dimensions and limitations per provider`
			`5. Exposes metrics for observability`
			`6. Provides VectorStore trait with LanceDB and SurrealDB backends based on project needs`

			`### Storage Backend Decision`

			`Each project chooses its vector storage backend based on priority:`

			`\| Project \| Backend \| Priority \| Justification \|`
			`\| ------------ \| --------- \| -------------- \| -------------------------------------------------- \|`
			`\| Kogral \| SurrealDB \| Graph richness \| Knowledge Graph needs unified graph+vector queries \|`
			`\| Provisioning \| LanceDB \| Vector scale \| RAG with millions of document chunks \|`
			`\| Vapora \| LanceDB \| Vector scale \| Execution traces, pattern matching at scale \|`

			`#### Why SurrealDB for Kogral`

			`Kogral is a Knowledge Graph where relationships are the primary value.`
			`With hybrid architecture (LanceDB vectors + SurrealDB graph), a typical query would require:`

			`1. LanceDB: vector search → candidate_ids`
			`2. SurrealDB: graph filter on candidates → results`
			`3. App layer: merge, re-rank, deduplication`

			`Accepted trade-off: SurrealDB has worse pure vector performance than LanceDB,`
			`but Kogral's scale is limited by human curation of knowledge (typically 10K-100K concepts).`

			`#### Why LanceDB for Provisioning and Vapora`

			`\| Aspect \| SurrealDB \| LanceDB \|`
			`\| --------------- \| ---------- \| -------------------- \|`
			`\| Storage format \| Row-based \| Columnar (Lance) \|`
			`\| Vector index \| HNSW (RAM) \| IVF-PQ (disk-native) \|`
			`\| Practical scale \| Millions \| Billions \|`
			`\| Compression \| ~1x \| ~32x (PQ) \|`
			`\| Zero-copy read \| No \| Yes \|`

			`### Architecture`

			```text
			`┌─────────────────────────────────────────────────────────────────┐`
			`│ stratum-embeddings │`
			`├─────────────────────────────────────────────────────────────────┤`
			`│ EmbeddingProvider trait │`
			`│ ├─ embed(text) → Vec<f32> │`
			`│ ├─ embed_batch(texts) → Vec<Vec<f32>> │`
			`│ ├─ dimensions() → usize │`
			`│ └─ is_local() → bool │`
			`│ │`
			`│ ┌───────────┐ ┌───────────┐ ┌───────────┐ │`
			`│ │ FastEmbed │ │ OpenAI │ │ Ollama │ │`
			`│ │ (local) │ │ (cloud) │ │ (local) │ │`
			`│ └───────────┘ └───────────┘ └───────────┘ │`
			`│ └────────────┬────────────┘ │`
			`│ ▼ │`
			`│ EmbeddingCache (memory/disk) │`
			`│ │ │`
			`│ ▼ │`
			`│ EmbeddingService │`
			`│ │ │`
			`│ ▼ │`
			`│ ┌──────────────────────────────────────────────────────────┐ │`
			`│ │ VectorStore trait │ │`
			`│ │ ├─ upsert(id, embedding, metadata) │ │`
			`│ │ ├─ search(embedding, limit, filter) → Vec<Match> │ │`
			`│ │ └─ delete(id) │ │`
			`│ └──────────────────────────────────────────────────────────┘ │`
			`│ │ │ │`
			`│ ▼ ▼ │`
			`│ ┌─────────────────┐ ┌─────────────────┐ │`
			`│ │ SurrealDbStore │ │ LanceDbStore │ │`
			`│ │ (Kogral) │ │ (Prov/Vapora) │ │`
			`│ └─────────────────┘ └─────────────────┘ │`
			`└─────────────────────────────────────────────────────────────────┘`
			```

			`## Rationale`

			`### Why Caching is Critical`

			`For a typical RAG system (10,000 chunks):`

			`- Without cache: Re-indexing and repeated queries multiply costs`
			`- With cache: First indexing pays, rest are cache hits`

			`Estimated savings: 60-80% in embedding costs.`

			`### Why Fallback is Important`

			`\| Scenario \| Without Fallback \| With Fallback \|`
			`\| ----------------- \| ---------------- \| -------------------- \|`
			`\| OpenAI rate limit \| ERROR \| → fastembed (local) \|`
			`\| OpenAI downtime \| ERROR \| → Ollama (local) \|`
			`\| No internet \| ERROR \| → fastembed (local) \|`

			`### Why Local Providers First`

			`For development: fastembed loads local model (~100MB), no API keys required, no costs, works offline.`

			`For production: OpenAI for quality, fastembed as fallback.`

			`## Consequences`

			`### Positive`

			`1. Single source of truth for the entire ecosystem`
			`2. 60-80% fewer embedding API calls (caching)`
			`3. High availability with local providers (fallback)`
			`4. Usage and cost metrics`
			`5. Feature-gated: only compile what you need`
			`6. Storage flexibility: VectorStore trait allows choosing backend per project`

			`### Negative`

			`1. Dimension lock-in: Changing provider requires re-indexing`
			`2. Cache invalidation: Updated content may serve stale embeddings`
			`3. Model download: fastembed downloads ~100MB on first use`
			`4. Storage lock-in per project: Kogral tied to SurrealDB, others to LanceDB`

			`### Mitigations`

			`\| Negative \| Mitigation \|`
			`\| ----------------- \| ---------------------------------------------- \|`
			`\| Dimension lock-in \| Document clearly, warn on provider change \|`
			`\| Stale cache \| Configurable TTL, bypass option \|`
			`\| Model download \| Show progress, cache in ~/.cache/fastembed \|`
			`\| Storage lock-in \| Conscious decision based on project priorities \|`

			`## Success Metrics`

			`\| Metric \| Current \| Target \|`
			`\| ------------------------- \| ------- \| ------ \|`
			`\| Duplicate implementations \| 3 \| 1 \|`
			`\| Cache hit rate \| 0% \| >60% \|`
			`\| Fallback availability \| 0% \| 100% \|`
			`\| Cost per 10K embeddings \| ~$0.20 \| ~$0.05 \|`

			`## Provider Selection Guide`

			`### Development`

			```rust
			`// Local, free, offline`
			`let service = EmbeddingService::builder()`
			`.with_provider(FastEmbedProvider::small()?) // 384 dims`
			`.with_memory_cache()`
			`.build()?;`
			```

			`### Production (Quality)`

			```rust
			`// OpenAI with local fallback`
			`let service = EmbeddingService::builder()`
			`.with_provider(OpenAiEmbeddingProvider::large()?) // 3072 dims`
			`.with_provider(FastEmbedProvider::large()?) // Fallback`
			`.with_memory_cache()`
			`.build()?;`
			```

			`### Production (Cost-Optimized)`

			```rust
			`// OpenAI small with fallback`
			`let service = EmbeddingService::builder()`
			`.with_provider(OpenAiEmbeddingProvider::small()?) // 1536 dims`
			`.with_provider(OllamaEmbeddingProvider::nomic()) // Fallback`
			`.with_memory_cache()`
			`.build()?;`
			```

			`## Dimension Compatibility Matrix`

			`\| If using... \| Can switch to... \| CANNOT switch to... \|`
			`\| ---------------------- \| --------------------------- \| ------------------- \|`
			`\| fastembed small (384) \| fastembed small, all-minilm \| Any other \|`
			`\| fastembed large (1024) \| fastembed large \| Any other \|`
			`\| OpenAI small (1536) \| OpenAI small, ada-002 \| Any other \|`
			`\| OpenAI large (3072) \| OpenAI large \| Any other \|`

			`Rule: Only switch between models with the SAME dimensions.`

			`## Implementation Priority`

			`\| Order \| Feature \| Reason \|`
			`\| ----- \| ----------------------- \| -------------------------- \|`
			`\| 1 \| EmbeddingProvider trait \| Foundation for everything \|`
			`\| 2 \| FastEmbed provider \| Works without API keys \|`
			`\| 3 \| Memory cache \| Biggest cost impact \|`
			`\| 4 \| VectorStore trait \| Storage abstraction \|`
			`\| 5 \| SurrealDbStore \| Kogral needs graph+vector \|`
			`\| 6 \| LanceDbStore \| Provisioning/Vapora scale \|`
			`\| 7 \| OpenAI provider \| Production \|`
			`\| 8 \| Ollama provider \| Local fallback \|`
			`\| 9 \| Batch processing \| Efficiency \|`
			`\| 10 \| Metrics \| Observability \|`

			`## References`

			`Existing Implementations:`

			- Kogral: `kogral-core/src/embeddings/`
			- Vapora: `vapora-llm-router/src/embeddings.rs`
			- Provisioning: `provisioning/platform/crates/rag/src/embeddings.rs`

			Target Location: `stratumiops/crates/stratum-embeddings/`