stratumiops/docs/en/architecture/adrs/001-stratum-embeddings.md
Jesús Pérez 0ae853c2fa
Some checks failed
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
chore: create stratum-embeddings and stratum-llm crates, docs
2026-01-24 02:03:12 +00:00

12 KiB

ADR-001: Stratum-Embeddings - Unified Embedding Library

Status

Proposed

Context

Current State: Fragmented Implementations

The ecosystem has 3 independent embedding implementations:

Project Location Providers Caching
Kogral kogral-core/src/embeddings/ fastembed, rig-core (partial) No
Provisioning provisioning-rag/src/embeddings.rs OpenAI direct No
Vapora vapora-llm-router/src/embeddings.rs OpenAI, HuggingFace, Ollama No

Identified Problems

1. Duplicated Code

Each project reimplements:

  • HTTP client for OpenAI embeddings
  • JSON response parsing
  • Error handling
  • Token estimation

Impact: ~400 duplicated lines, inconsistent error handling.

2. No Caching

Embeddings regenerated every time:

"What is Rust?" → OpenAI → 1536 dims → $0.00002
"What is Rust?" → OpenAI → 1536 dims → $0.00002 (same result)
"What is Rust?" → OpenAI → 1536 dims → $0.00002 (same result)

Impact: Unnecessary costs, additional latency, more frequent rate limits.

3. No Fallback

If OpenAI fails, everything fails. No fallback to local alternatives (fastembed, Ollama).

Impact: Reduced availability, total dependency on one provider.

4. Silent Dimension Mismatch

Different providers produce different dimensions:

Provider Model Dimensions
fastembed bge-small-en 384
fastembed bge-large-en 1024
OpenAI text-embedding-3-small 1536
OpenAI text-embedding-3-large 3072
Ollama nomic-embed-text 768

Impact: Corrupt vector indices if provider changes.

5. No Metrics

No visibility into usage, cache hit rate, latency per provider, or accumulated costs.

Decision

Create stratum-embeddings as a unified crate that:

  1. Unifies implementations from Kogral, Provisioning, and Vapora
  2. Adds caching to avoid recomputing identical embeddings
  3. Implements fallback between providers (cloud → local)
  4. Clearly documents dimensions and limitations per provider
  5. Exposes metrics for observability
  6. Provides VectorStore trait with LanceDB and SurrealDB backends based on project needs

Storage Backend Decision

Each project chooses its vector storage backend based on priority:

Project Backend Priority Justification
Kogral SurrealDB Graph richness Knowledge Graph needs unified graph+vector queries
Provisioning LanceDB Vector scale RAG with millions of document chunks
Vapora LanceDB Vector scale Execution traces, pattern matching at scale

Why SurrealDB for Kogral

Kogral is a Knowledge Graph where relationships are the primary value. With hybrid architecture (LanceDB vectors + SurrealDB graph), a typical query would require:

  1. LanceDB: vector search → candidate_ids
  2. SurrealDB: graph filter on candidates → results
  3. App layer: merge, re-rank, deduplication

Accepted trade-off: SurrealDB has worse pure vector performance than LanceDB, but Kogral's scale is limited by human curation of knowledge (typically 10K-100K concepts).

Why LanceDB for Provisioning and Vapora

Aspect SurrealDB LanceDB
Storage format Row-based Columnar (Lance)
Vector index HNSW (RAM) IVF-PQ (disk-native)
Practical scale Millions Billions
Compression ~1x ~32x (PQ)
Zero-copy read No Yes

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      stratum-embeddings                          │
├─────────────────────────────────────────────────────────────────┤
│  EmbeddingProvider trait                                         │
│  ├─ embed(text) → Vec<f32>                                      │
│  ├─ embed_batch(texts) → Vec<Vec<f32>>                          │
│  ├─ dimensions() → usize                                        │
│  └─ is_local() → bool                                           │
│                                                                  │
│  ┌───────────┐ ┌───────────┐ ┌───────────┐                      │
│  │ FastEmbed │ │  OpenAI   │ │  Ollama   │                      │
│  │  (local)  │ │  (cloud)  │ │  (local)  │                      │
│  └───────────┘ └───────────┘ └───────────┘                      │
│         └────────────┬────────────┘                              │
│                      ▼                                           │
│              EmbeddingCache (memory/disk)                        │
│                      │                                           │
│                      ▼                                           │
│             EmbeddingService                                     │
│                      │                                           │
│                      ▼                                           │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                   VectorStore trait                       │   │
│  │  ├─ upsert(id, embedding, metadata)                      │   │
│  │  ├─ search(embedding, limit, filter) → Vec<Match>        │   │
│  │  └─ delete(id)                                           │   │
│  └──────────────────────────────────────────────────────────┘   │
│         │                                    │                   │
│         ▼                                    ▼                   │
│  ┌─────────────────┐              ┌─────────────────┐           │
│  │  SurrealDbStore │              │   LanceDbStore  │           │
│  │  (Kogral)       │              │  (Prov/Vapora)  │           │
│  └─────────────────┘              └─────────────────┘           │
└─────────────────────────────────────────────────────────────────┘

Rationale

Why Caching is Critical

For a typical RAG system (10,000 chunks):

  • Without cache: Re-indexing and repeated queries multiply costs
  • With cache: First indexing pays, rest are cache hits

Estimated savings: 60-80% in embedding costs.

Why Fallback is Important

Scenario Without Fallback With Fallback
OpenAI rate limit ERROR → fastembed (local)
OpenAI downtime ERROR → Ollama (local)
No internet ERROR → fastembed (local)

Why Local Providers First

For development: fastembed loads local model (~100MB), no API keys required, no costs, works offline.

For production: OpenAI for quality, fastembed as fallback.

Consequences

Positive

  1. Single source of truth for the entire ecosystem
  2. 60-80% fewer embedding API calls (caching)
  3. High availability with local providers (fallback)
  4. Usage and cost metrics
  5. Feature-gated: only compile what you need
  6. Storage flexibility: VectorStore trait allows choosing backend per project

Negative

  1. Dimension lock-in: Changing provider requires re-indexing
  2. Cache invalidation: Updated content may serve stale embeddings
  3. Model download: fastembed downloads ~100MB on first use
  4. Storage lock-in per project: Kogral tied to SurrealDB, others to LanceDB

Mitigations

Negative Mitigation
Dimension lock-in Document clearly, warn on provider change
Stale cache Configurable TTL, bypass option
Model download Show progress, cache in ~/.cache/fastembed
Storage lock-in Conscious decision based on project priorities

Success Metrics

Metric Current Target
Duplicate implementations 3 1
Cache hit rate 0% >60%
Fallback availability 0% 100%
Cost per 10K embeddings ~$0.20 ~$0.05

Provider Selection Guide

Development

// Local, free, offline
let service = EmbeddingService::builder()
    .with_provider(FastEmbedProvider::small()?)  // 384 dims
    .with_memory_cache()
    .build()?;

Production (Quality)

// OpenAI with local fallback
let service = EmbeddingService::builder()
    .with_provider(OpenAiEmbeddingProvider::large()?)  // 3072 dims
    .with_provider(FastEmbedProvider::large()?)        // Fallback
    .with_memory_cache()
    .build()?;

Production (Cost-Optimized)

// OpenAI small with fallback
let service = EmbeddingService::builder()
    .with_provider(OpenAiEmbeddingProvider::small()?)  // 1536 dims
    .with_provider(OllamaEmbeddingProvider::nomic())   // Fallback
    .with_memory_cache()
    .build()?;

Dimension Compatibility Matrix

If using... Can switch to... CANNOT switch to...
fastembed small (384) fastembed small, all-minilm Any other
fastembed large (1024) fastembed large Any other
OpenAI small (1536) OpenAI small, ada-002 Any other
OpenAI large (3072) OpenAI large Any other

Rule: Only switch between models with the SAME dimensions.

Implementation Priority

Order Feature Reason
1 EmbeddingProvider trait Foundation for everything
2 FastEmbed provider Works without API keys
3 Memory cache Biggest cost impact
4 VectorStore trait Storage abstraction
5 SurrealDbStore Kogral needs graph+vector
6 LanceDbStore Provisioning/Vapora scale
7 OpenAI provider Production
8 Ollama provider Local fallback
9 Batch processing Efficiency
10 Metrics Observability

References

Existing Implementations:

  • Kogral: kogral-core/src/embeddings/
  • Vapora: vapora-llm-router/src/embeddings.rs
  • Provisioning: provisioning/platform/crates/rag/src/embeddings.rs

Target Location: stratumiops/crates/stratum-embeddings/