jesus/stratumiops

Fork 0

Jesús Pérez 0ae853c2fa

Rust CI / Security Audit (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled

Details

Rust CI / Check + Test + Lint (stable) (push) Has been cancelled

Details

chore: create stratum-embeddings and stratum-llm crates, docs

2026-01-24 02:03:12 +00:00

12 KiB

Raw Blame History

ADR-001: Stratum-Embeddings - Unified Embedding Library

Status

Proposed

Context

Current State: Fragmented Implementations

The ecosystem has 3 independent embedding implementations:

Project	Location	Providers	Caching
Kogral	`kogral-core/src/embeddings/`	fastembed, rig-core (partial)	No
Provisioning	`provisioning-rag/src/embeddings.rs`	OpenAI direct	No
Vapora	`vapora-llm-router/src/embeddings.rs`	OpenAI, HuggingFace, Ollama	No

Identified Problems

1. Duplicated Code

Each project reimplements:

HTTP client for OpenAI embeddings
JSON response parsing
Error handling
Token estimation

Impact: ~400 duplicated lines, inconsistent error handling.

2. No Caching

Embeddings regenerated every time:

"What is Rust?" → OpenAI → 1536 dims → $0.00002
"What is Rust?" → OpenAI → 1536 dims → $0.00002 (same result)
"What is Rust?" → OpenAI → 1536 dims → $0.00002 (same result)

Impact: Unnecessary costs, additional latency, more frequent rate limits.

3. No Fallback

If OpenAI fails, everything fails. No fallback to local alternatives (fastembed, Ollama).

Impact: Reduced availability, total dependency on one provider.

4. Silent Dimension Mismatch

Different providers produce different dimensions:

Provider	Model	Dimensions
fastembed	bge-small-en	384
fastembed	bge-large-en	1024
OpenAI	text-embedding-3-small	1536
OpenAI	text-embedding-3-large	3072
Ollama	nomic-embed-text	768

Impact: Corrupt vector indices if provider changes.

5. No Metrics

No visibility into usage, cache hit rate, latency per provider, or accumulated costs.

Decision

Create stratum-embeddings as a unified crate that:

Unifies implementations from Kogral, Provisioning, and Vapora
Adds caching to avoid recomputing identical embeddings
Implements fallback between providers (cloud → local)
Clearly documents dimensions and limitations per provider
Exposes metrics for observability
Provides VectorStore trait with LanceDB and SurrealDB backends based on project needs

Storage Backend Decision

Each project chooses its vector storage backend based on priority:

Project	Backend	Priority	Justification
Kogral	SurrealDB	Graph richness	Knowledge Graph needs unified graph+vector queries
Provisioning	LanceDB	Vector scale	RAG with millions of document chunks
Vapora	LanceDB	Vector scale	Execution traces, pattern matching at scale

Why SurrealDB for Kogral

Kogral is a Knowledge Graph where relationships are the primary value. With hybrid architecture (LanceDB vectors + SurrealDB graph), a typical query would require:

LanceDB: vector search → candidate_ids
SurrealDB: graph filter on candidates → results
App layer: merge, re-rank, deduplication

Accepted trade-off: SurrealDB has worse pure vector performance than LanceDB, but Kogral's scale is limited by human curation of knowledge (typically 10K-100K concepts).

Why LanceDB for Provisioning and Vapora

Aspect	SurrealDB	LanceDB
Storage format	Row-based	Columnar (Lance)
Vector index	HNSW (RAM)	IVF-PQ (disk-native)
Practical scale	Millions	Billions
Compression	~1x	~32x (PQ)
Zero-copy read	No	Yes

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      stratum-embeddings                          │
├─────────────────────────────────────────────────────────────────┤
│  EmbeddingProvider trait                                         │
│  ├─ embed(text) → Vec<f32>                                      │
│  ├─ embed_batch(texts) → Vec<Vec<f32>>                          │
│  ├─ dimensions() → usize                                        │
│  └─ is_local() → bool                                           │
│                                                                  │
│  ┌───────────┐ ┌───────────┐ ┌───────────┐                      │
│  │ FastEmbed │ │  OpenAI   │ │  Ollama   │                      │
│  │  (local)  │ │  (cloud)  │ │  (local)  │                      │
│  └───────────┘ └───────────┘ └───────────┘                      │
│         └────────────┬────────────┘                              │
│                      ▼                                           │
│              EmbeddingCache (memory/disk)                        │
│                      │                                           │
│                      ▼                                           │
│             EmbeddingService                                     │
│                      │                                           │
│                      ▼                                           │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │                   VectorStore trait                       │   │
│  │  ├─ upsert(id, embedding, metadata)                      │   │
│  │  ├─ search(embedding, limit, filter) → Vec<Match>        │   │
│  │  └─ delete(id)                                           │   │
│  └──────────────────────────────────────────────────────────┘   │
│         │                                    │                   │
│         ▼                                    ▼                   │
│  ┌─────────────────┐              ┌─────────────────┐           │
│  │  SurrealDbStore │              │   LanceDbStore  │           │
│  │  (Kogral)       │              │  (Prov/Vapora)  │           │
│  └─────────────────┘              └─────────────────┘           │
└─────────────────────────────────────────────────────────────────┘

Rationale

Why Caching is Critical

For a typical RAG system (10,000 chunks):

Without cache: Re-indexing and repeated queries multiply costs
With cache: First indexing pays, rest are cache hits

Estimated savings: 60-80% in embedding costs.

Why Fallback is Important

Scenario	Without Fallback	With Fallback
OpenAI rate limit	ERROR	→ fastembed (local)
OpenAI downtime	ERROR	→ Ollama (local)
No internet	ERROR	→ fastembed (local)

Why Local Providers First

For development: fastembed loads local model (~100MB), no API keys required, no costs, works offline.

For production: OpenAI for quality, fastembed as fallback.

Consequences

Positive

Single source of truth for the entire ecosystem
60-80% fewer embedding API calls (caching)
High availability with local providers (fallback)
Usage and cost metrics
Feature-gated: only compile what you need
Storage flexibility: VectorStore trait allows choosing backend per project

Negative

Dimension lock-in: Changing provider requires re-indexing
Cache invalidation: Updated content may serve stale embeddings
Model download: fastembed downloads ~100MB on first use
Storage lock-in per project: Kogral tied to SurrealDB, others to LanceDB

Mitigations

Negative	Mitigation
Dimension lock-in	Document clearly, warn on provider change
Stale cache	Configurable TTL, bypass option
Model download	Show progress, cache in ~/.cache/fastembed
Storage lock-in	Conscious decision based on project priorities

Success Metrics

Metric	Current	Target
Duplicate implementations	3	1
Cache hit rate	0%	>60%
Fallback availability	0%	100%
Cost per 10K embeddings	~$0.20	~$0.05

Provider Selection Guide

Development

// Local, free, offline
let service = EmbeddingService::builder()
    .with_provider(FastEmbedProvider::small()?)  // 384 dims
    .with_memory_cache()
    .build()?;

Production (Quality)

// OpenAI with local fallback
let service = EmbeddingService::builder()
    .with_provider(OpenAiEmbeddingProvider::large()?)  // 3072 dims
    .with_provider(FastEmbedProvider::large()?)        // Fallback
    .with_memory_cache()
    .build()?;

Production (Cost-Optimized)

// OpenAI small with fallback
let service = EmbeddingService::builder()
    .with_provider(OpenAiEmbeddingProvider::small()?)  // 1536 dims
    .with_provider(OllamaEmbeddingProvider::nomic())   // Fallback
    .with_memory_cache()
    .build()?;

Dimension Compatibility Matrix

If using...	Can switch to...	CANNOT switch to...
fastembed small (384)	fastembed small, all-minilm	Any other
fastembed large (1024)	fastembed large	Any other
OpenAI small (1536)	OpenAI small, ada-002	Any other
OpenAI large (3072)	OpenAI large	Any other

Rule: Only switch between models with the SAME dimensions.

Implementation Priority

Order	Feature	Reason
1	EmbeddingProvider trait	Foundation for everything
2	FastEmbed provider	Works without API keys
3	Memory cache	Biggest cost impact
4	VectorStore trait	Storage abstraction
5	SurrealDbStore	Kogral needs graph+vector
6	LanceDbStore	Provisioning/Vapora scale
7	OpenAI provider	Production
8	Ollama provider	Local fallback
9	Batch processing	Efficiency
10	Metrics	Observability

References

Existing Implementations:

Kogral: kogral-core/src/embeddings/
Vapora: vapora-llm-router/src/embeddings.rs
Provisioning: provisioning/platform/crates/rag/src/embeddings.rs

Target Location: stratumiops/crates/stratum-embeddings/

12 KiB Raw Blame History

ADR-001: Stratum-Embeddings - Unified Embedding Library

Status

Context

Current State: Fragmented Implementations

Identified Problems

1. Duplicated Code

2. No Caching

3. No Fallback

4. Silent Dimension Mismatch

5. No Metrics

Decision

Storage Backend Decision

Why SurrealDB for Kogral

Why LanceDB for Provisioning and Vapora

Architecture

Rationale

Why Caching is Critical

Why Fallback is Important

Why Local Providers First

Consequences

Positive

Negative

Mitigations

Success Metrics

Provider Selection Guide

Development

Production (Quality)

Production (Cost-Optimized)

Dimension Compatibility Matrix

Implementation Priority

References

12 KiB

Raw Blame History