chore: create stratum-embeddings and stratum-llm crates, docs

2026-01-24 02:03:12 +00:00 · 2026-01-24 02:03:12 +00:00 · 0ae853c2fa
commit 0ae853c2fa
parent b0d039d22d
70 changed files with 19516 additions and 2 deletions
--- a/.gitignore
+++ b/.gitignore
@ -1,6 +1,7 @@
 CLAUDE.md
 .claude
 utils/save*sh
+.fastembed_cache
 COMMIT_MESSAGE.md
 .wrks
 nushell
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -0,0 +1,37 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+## [Unreleased]
+
+### Added
+
+- **Architecture Documentation**: New `docs/*/architecture/` section with ADRs
+  - ADR-001: stratum-embeddings - Unified embedding library with caching, fallback,
+    and VectorStore trait (SurrealDB for Kogral, LanceDB for Provisioning/Vapora)
+  - ADR-002: stratum-llm - Unified LLM provider library with CLI credential detection,
+    circuit breaker, caching, and Kogral integration
+- **Bilingual ADRs**: Full English and Spanish versions of all architecture documents
+- **README updates**: Added Stratum Crates section and updated documentation structure
+
+### Changed
+
+- Documentation structure now includes `architecture/adrs/` subdirectory in both
+  language directories (en/es)
+
+## [0.1.0] - 2026-01-22
+
+### Added
+
+- Initial repository setup
+- Main documentation structure (bilingual en/es)
+- Branding assets (logos, icons, social variants)
+- CI/CD configuration (GitHub Actions, Woodpecker)
+- Language guidelines (Rust, Nickel, Nushell, Bash)
+- Pre-commit hooks configuration
+
+[Unreleased]: https://repo.jesusperez.pro/jesus/stratumiops/compare/v0.1.0...HEAD
+[0.1.0]: https://repo.jesusperez.pro/jesus/stratumiops/releases/tag/v0.1.0
--- a/Cargo.lock
+++ b/Cargo.lock
--- a/Cargo.toml
+++ b/Cargo.toml
@ -0,0 +1,60 @@
+[workspace]
+members = ["crates/*"]
+resolver = "2"
+
+[workspace.package]
+edition = "2021"
+license = "MIT OR Apache-2.0"
+
+[workspace.dependencies]
+# Async runtime
+tokio = { version = "1.49", features = ["full"] }
+async-trait = "0.1"
+futures = "0.3"
+
+# HTTP client
+reqwest = { version = "0.13", features = ["json"] }
+
+# Serialization
+serde = { version = "1.0", features = ["derive"] }
+serde_json = "1.0"
+serde_yaml = "0.9"
+humantime-serde = "1.1"
+
+# Caching
+moka = { version = "0.12", features = ["future"] }
+sled = "0.34"
+
+# Embeddings
+fastembed = "5.8"
+
+# Vector storage
+lancedb = "0.23"
+surrealdb = { version = "2.5", features = ["kv-mem"] }
+# LOCKED: Arrow 56.x required for LanceDB 0.23 compatibility
+# LanceDB 0.23 uses Arrow 56.2.0 internally - Arrow 57 breaks API compatibility
+# DO NOT upgrade to Arrow 57 until LanceDB supports it
+arrow = "=56"
+
+# Error handling
+thiserror = "2.0"
+anyhow = "1.0"
+
+# Logging and tracing
+tracing = "0.1"
+tracing-subscriber = "0.3"
+
+# Metrics
+prometheus = "0.14"
+
+# Utilities
+xxhash-rust = { version = "0.8", features = ["xxh3"] }
+dirs = "6.0"
+chrono = "0.4"
+uuid = "1.19"
+which = "8.0"
+
+# Testing
+tokio-test = "0.4"
+approx = "0.5"
+tempfile = "3.24"
--- a/README.md
+++ b/README.md
@ -131,16 +131,29 @@ StratumIOps is not a single project. It's the **orchestration layer** that coord
 - **Integration Patterns**: How projects work together
 - **Shared Standards**: Language guidelines (Rust, Nickel, Nushell, Bash)

+### Stratum Crates
+
+Shared infrastructure libraries for the ecosystem:
+
+| Crate | Description | Status |
+| ----- | ----------- | ------ |
+| **stratum-embeddings** | Unified embedding providers with caching, fallback, and VectorStore trait | Proposed |
+| **stratum-llm** | Unified LLM providers with CLI detection, circuit breaker, and caching | Proposed |
+
+See [Architecture ADRs](docs/en/architecture/adrs/) for detailed design decisions.
+
 ### Documentation Structure

 ```text
 docs/
 ├── en/                 # English documentation
 │   ├── ia/             # AI/Development track
-│   └── ops/            # Ops/DevOps track
+│   ├── ops/            # Ops/DevOps track
+│   └── architecture/   # Architecture decisions (ADRs)
 └── es/                 # Spanish documentation
    ├── ia/             # AI/Development track
-    └── ops/            # Ops/DevOps track
+    ├── ops/            # Ops/DevOps track
+    └── architecture/   # Architecture decisions (ADRs)
 ```

 ### Branding Assets
--- a/crates/stratum-embeddings/Cargo.toml
+++ b/crates/stratum-embeddings/Cargo.toml
@ -0,0 +1,113 @@
+[package]
+name = "stratum-embeddings"
+version = "0.1.0"
+edition.workspace = true
+description = "Unified embedding providers with caching, batch processing, and vector storage"
+license.workspace = true
+
+[dependencies]
+# Async runtime
+tokio = { workspace = true }
+async-trait = { workspace = true }
+futures = { workspace = true }
+
+# HTTP client (for cloud providers)
+reqwest = { workspace = true, optional = true }
+
+# Serialization
+serde = { workspace = true }
+serde_json = { workspace = true }
+humantime-serde = { workspace = true }
+
+# Caching
+moka = { workspace = true }
+
+# Persistent cache (optional)
+sled = { workspace = true, optional = true }
+
+# Local embeddings
+fastembed = { workspace = true, optional = true }
+
+# Vector storage backends
+lancedb = { workspace = true, optional = true }
+surrealdb = { workspace = true, optional = true }
+arrow = { workspace = true, optional = true }
+
+# Error handling
+thiserror = { workspace = true }
+
+# Logging
+tracing = { workspace = true }
+
+# Metrics
+prometheus = { workspace = true, optional = true }
+
+# Utilities
+xxhash-rust = { workspace = true }
+
+[features]
+default = ["fastembed-provider", "memory-cache"]
+
+# Providers
+fastembed-provider = ["fastembed"]
+openai-provider = ["reqwest"]
+ollama-provider = ["reqwest"]
+cohere-provider = ["reqwest"]
+voyage-provider = ["reqwest"]
+huggingface-provider = ["reqwest"]
+all-providers = [
+    "fastembed-provider",
+    "openai-provider",
+    "ollama-provider",
+    "cohere-provider",
+    "voyage-provider",
+    "huggingface-provider",
+]
+
+# Cache backends
+memory-cache = []
+persistent-cache = ["sled"]
+all-cache = ["memory-cache", "persistent-cache"]
+
+# Vector storage backends
+lancedb-store = ["lancedb", "arrow"]
+surrealdb-store = ["surrealdb"]
+all-stores = ["lancedb-store", "surrealdb-store"]
+
+# Observability
+metrics = ["prometheus"]
+
+# Project-specific presets
+kogral = ["fastembed-provider", "memory-cache", "surrealdb-store"]
+provisioning = ["openai-provider", "memory-cache", "lancedb-store"]
+vapora = ["all-providers", "memory-cache", "lancedb-store"] # Includes huggingface-provider
+
+# Full feature set
+full = ["all-providers", "all-cache", "all-stores", "metrics"]
+
+[dev-dependencies]
+tokio-test = { workspace = true }
+approx = { workspace = true }
+tempfile = { workspace = true }
+tracing-subscriber = { workspace = true }
+
+# Example-specific feature requirements
+[[example]]
+name = "basic_usage"
+required-features = ["fastembed-provider"]
+
+[[example]]
+name = "fallback_demo"
+required-features = ["ollama-provider", "fastembed-provider"]
+
+[[example]]
+name = "lancedb_usage"
+required-features = ["lancedb-store", "fastembed-provider"]
+
+[[example]]
+name = "surrealdb_usage"
+required-features = ["surrealdb-store", "fastembed-provider"]
+
+[[example]]
+name = "huggingface_usage"
+required-features = ["huggingface-provider"]
--- a/crates/stratum-embeddings/README.md
+++ b/crates/stratum-embeddings/README.md
@ -0,0 +1,180 @@
+# stratum-embeddings
+
+Unified embedding providers with caching, batch processing, and vector storage for the STRATUMIOPS ecosystem.
+
+## Features
+
+- **Multiple Providers**: FastEmbed (local), OpenAI, Ollama
+- **Smart Caching**: In-memory caching with configurable TTL
+- **Batch Processing**: Efficient batch embedding with automatic chunking
+- **Vector Storage**: LanceDB (scale-first) and SurrealDB (graph-first)
+- **Fallback Support**: Automatic failover between providers
+- **Feature Flags**: Modular compilation for minimal dependencies
+
+## Architecture
+
+```text
+┌─────────────────────────────────────────┐
+│         EmbeddingService                │
+│  (facade with caching + fallback)       │
+└─────────────┬───────────────────────────┘
+              │
+    ┌─────────┴─────────┐
+    ▼                   ▼
+┌─────────────┐   ┌─────────────┐
+│  Providers  │   │    Cache    │
+│             │   │             │
+│ • FastEmbed │   │ • Memory    │
+│ • OpenAI    │   │ • (Sled)    │
+│ • Ollama    │   │             │
+└─────────────┘   └─────────────┘
+```
+
+## Quick Start
+
+### Basic Usage
+
+```rust
+use stratum_embeddings::{
+    EmbeddingService, FastEmbedProvider, MemoryCache, EmbeddingOptions
+};
+use std::time::Duration;
+
+#[tokio::main]
+async fn main() -> Result<(), Box<dyn std::error::Error>> {
+    let provider = FastEmbedProvider::small()?;
+    let cache = MemoryCache::new(1000, Duration::from_secs(300));
+    let service = EmbeddingService::new(provider).with_cache(cache);
+
+    let options = EmbeddingOptions::default_with_cache();
+    let embedding = service.embed("Hello world", &options).await?;
+
+    println!("Generated {} dimensions", embedding.len());
+    Ok(())
+}
+```
+
+### Batch Processing
+
+```rust
+let texts = vec![
+    "Text 1".to_string(),
+    "Text 2".to_string(),
+    "Text 3".to_string(),
+];
+
+let result = service.embed_batch(texts, &options).await?;
+println!("Embeddings: {}, Cached: {}",
+    result.embeddings.len(),
+    result.cached_count
+);
+```
+
+### Vector Storage
+
+#### LanceDB (Provisioning, Vapora)
+
+```rust
+use stratum_embeddings::{LanceDbStore, VectorStore, VectorStoreConfig};
+
+let config = VectorStoreConfig::new(384);
+let store = LanceDbStore::new("./data", "embeddings", config).await?;
+
+store.upsert("doc1", &embedding, metadata).await?;
+let results = store.search(&query_embedding, 10, None).await?;
+```
+
+#### SurrealDB (Kogral)
+
+```rust
+use stratum_embeddings::{SurrealDbStore, VectorStore, VectorStoreConfig};
+
+let config = VectorStoreConfig::new(384);
+let store = SurrealDbStore::new_memory("concepts", config).await?;
+
+store.upsert("concept1", &embedding, metadata).await?;
+let results = store.search(&query_embedding, 10, None).await?;
+```
+
+## Feature Flags
+
+### Providers
+
+- `fastembed-provider` (default) - Local embeddings via fastembed
+- `openai-provider` - OpenAI API embeddings
+- `ollama-provider` - Ollama local server embeddings
+- `all-providers` - All embedding providers
+
+### Cache
+
+- `memory-cache` (default) - In-memory caching with moka
+- `persistent-cache` - Persistent cache with sled
+- `all-cache` - All cache backends
+
+### Vector Storage
+
+- `lancedb-store` - LanceDB vector storage (columnar, disk-native)
+- `surrealdb-store` - SurrealDB vector storage (graph + vector)
+- `all-stores` - All storage backends
+
+### Project Presets
+
+- `kogral` - fastembed + memory + surrealdb
+- `provisioning` - openai + memory + lancedb
+- `vapora` - all-providers + memory + lancedb
+- `full` - Everything enabled
+
+## Examples
+
+Run examples with:
+
+```bash
+cargo run --example basic_usage --features=default
+cargo run --example fallback_demo --features=fastembed-provider,ollama-provider
+cargo run --example lancedb_usage --features=lancedb-store
+cargo run --example surrealdb_usage --features=surrealdb-store
+```
+
+## Provider Comparison
+
+| Provider | Type | Cost | Dimensions | Use Case |
+|----------|------|------|------------|----------|
+| FastEmbed | Local | Free | 384-1024 | Dev, privacy-first |
+| OpenAI | Cloud | $0.02-0.13/1M | 1536-3072 | Production RAG |
+| Ollama | Local | Free | 384-1024 | Self-hosted |
+
+## Storage Backend Comparison
+
+| Backend | Best For | Strength | Scale |
+|---------|----------|----------|-------|
+| LanceDB | RAG, traces | Columnar, IVF-PQ index | Billions |
+| SurrealDB | Knowledge graphs | Unified graph+vector queries | Millions |
+
+## Configuration
+
+Environment variables:
+
+```bash
+# FastEmbed
+FASTEMBED_MODEL=bge-small-en
+
+# OpenAI
+OPENAI_API_KEY=sk-...
+OPENAI_MODEL=text-embedding-3-small
+
+# Ollama
+OLLAMA_MODEL=nomic-embed-text
+OLLAMA_BASE_URL=http://localhost:11434
+```
+
+## Development
+
+```bash
+cargo check -p stratum-embeddings --all-features
+cargo test -p stratum-embeddings --all-features
+cargo clippy -p stratum-embeddings --all-features -- -D warnings
+```
+
+## License
+
+MIT OR Apache-2.0
--- a/crates/stratum-embeddings/docs/huggingface-provider.md
+++ b/crates/stratum-embeddings/docs/huggingface-provider.md
@ -0,0 +1,346 @@
+# HuggingFace Embedding Provider
+
+Provider for HuggingFace Inference API embeddings with support for popular sentence-transformers and BGE models.
+
+## Overview
+
+The HuggingFace provider uses the free Inference API to generate embeddings. It supports:
+
+- **Public Models**: Free access to popular embedding models
+- **Custom Models**: Support for any HuggingFace model with feature-extraction pipeline
+- **Automatic Caching**: Built-in memory cache reduces API calls
+- **Response Normalization**: Optional L2 normalization for similarity search
+
+## Features
+
+- ✅ Zero cost for public models (free Inference API)
+- ✅ Support for 5+ popular models out of the box
+- ✅ Custom model support with configurable dimensions
+- ✅ Automatic retry with exponential backoff
+- ✅ Rate limit handling
+- ✅ Integration with stratum-embeddings caching layer
+
+## Supported Models
+
+### Predefined Models
+
+| Model | Dimensions | Use Case | Constructor |
+|-------|------------|----------|-------------|
+| **BAAI/bge-small-en-v1.5** | 384 | General-purpose, efficient | `HuggingFaceProvider::bge_small()` |
+| **BAAI/bge-base-en-v1.5** | 768 | Balanced performance | `HuggingFaceProvider::bge_base()` |
+| **BAAI/bge-large-en-v1.5** | 1024 | High quality | `HuggingFaceProvider::bge_large()` |
+| **sentence-transformers/all-MiniLM-L6-v2** | 384 | Fast, lightweight | `HuggingFaceProvider::all_minilm()` |
+| **sentence-transformers/all-mpnet-base-v2** | 768 | Strong baseline | - |
+
+### Custom Models
+
+```rust
+let model = HuggingFaceModel::Custom(
+    "sentence-transformers/paraphrase-MiniLM-L6-v2".to_string(),
+    384,
+);
+let provider = HuggingFaceProvider::new(api_key, model)?;
+```
+
+## API Rate Limits
+
+### Free Inference API
+
+HuggingFace Inference API has the following rate limits:
+
+| Tier | Requests/Hour | Requests/Day | Max Concurrent |
+|------|---------------|--------------|----------------|
+| **Anonymous** | 1,000 | 10,000 | 1 |
+| **Free Account** | 3,000 | 30,000 | 3 |
+| **PRO ($9/mo)** | 10,000 | 100,000 | 10 |
+| **Enterprise** | Custom | Custom | Custom |
+
+**Rate Limit Headers**:
+```
+X-RateLimit-Limit: 3000
+X-RateLimit-Remaining: 2999
+X-RateLimit-Reset: 1234567890
+```
+
+### Rate Limit Handling
+
+The provider automatically handles rate limits with:
+
+1. **Exponential Backoff**: Retries with increasing delays (1s, 2s, 4s, 8s)
+2. **Max Retries**: Default 3 retries before failing
+3. **Circuit Breaker**: Automatically pauses requests if rate limited repeatedly
+4. **Cache Integration**: Reduces API calls by 70-90% for repeated queries
+
+**Configuration**:
+```rust
+// Default retry config (built-in)
+let provider = HuggingFaceProvider::new(api_key, model)?;
+
+// With custom retry (future enhancement)
+let provider = HuggingFaceProvider::new(api_key, model)?
+    .with_retry_config(RetryConfig {
+        max_retries: 5,
+        initial_delay: Duration::from_secs(2),
+        max_delay: Duration::from_secs(30),
+    });
+```
+
+### Best Practices for Rate Limits
+
+1. **Enable Caching**: Use `EmbeddingOptions::default_with_cache()`
+   ```rust
+   let options = EmbeddingOptions::default_with_cache();
+   let embedding = provider.embed(text, &options).await?;
+   ```
+
+2. **Batch Requests Carefully**: HuggingFace Inference API processes requests sequentially
+   ```rust
+   // This makes N API calls sequentially
+   let texts = vec!["text1", "text2", "text3"];
+   let result = provider.embed_batch(&texts, &options).await?;
+   ```
+
+3. **Use PRO Account for Production**: Free tier is suitable for development only
+
+4. **Monitor Rate Limits**: Check response headers
+   ```rust
+   // Future enhancement - rate limit monitoring
+   let stats = provider.rate_limit_stats();
+   println!("Remaining: {}/{}", stats.remaining, stats.limit);
+   ```
+
+## Authentication
+
+### Environment Variables
+
+The provider checks for API keys in this order:
+
+1. `HUGGINGFACE_API_KEY`
+2. `HF_TOKEN` (alternative name)
+
+```bash
+export HUGGINGFACE_API_KEY="hf_xxxxxxxxxxxxxxxxxxxx"
+```
+
+### Getting an API Token
+
+1. Go to [HuggingFace Settings](https://huggingface.co/settings/tokens)
+2. Click "New token"
+3. Select "Read" access (sufficient for Inference API)
+4. Copy the token starting with `hf_`
+
+## Usage Examples
+
+### Basic Usage
+
+```rust
+use stratum_embeddings::{HuggingFaceProvider, EmbeddingOptions};
+
+#[tokio::main]
+async fn main() -> Result<(), Box<dyn std::error::Error>> {
+    // Using predefined model
+    let provider = HuggingFaceProvider::bge_small()?;
+
+    let options = EmbeddingOptions::default_with_cache();
+    let embedding = provider.embed("Hello world", &options).await?;
+
+    println!("Dimensions: {}", embedding.len()); // 384
+    Ok(())
+}
+```
+
+### With EmbeddingService (Recommended)
+
+```rust
+use std::time::Duration;
+use stratum_embeddings::{
+    HuggingFaceProvider, EmbeddingService, MemoryCache, EmbeddingOptions
+};
+
+let provider = HuggingFaceProvider::bge_small()?;
+let cache = MemoryCache::new(1000, Duration::from_secs(3600));
+
+let service = EmbeddingService::new(provider)
+    .with_cache(cache);
+
+let options = EmbeddingOptions::default_with_cache();
+let embedding = service.embed("Cached embeddings", &options).await?;
+```
+
+### Semantic Similarity Search
+
+```rust
+use stratum_embeddings::{HuggingFaceProvider, EmbeddingOptions, cosine_similarity};
+
+let provider = HuggingFaceProvider::bge_small()?;
+let options = EmbeddingOptions {
+    normalize: true,  // Important for cosine similarity
+    truncate: true,
+    use_cache: true,
+};
+
+let query = "machine learning";
+let doc1 = "deep learning and neural networks";
+let doc2 = "cooking recipes";
+
+let query_emb = provider.embed(query, &options).await?;
+let doc1_emb = provider.embed(doc1, &options).await?;
+let doc2_emb = provider.embed(doc2, &options).await?;
+
+let sim1 = cosine_similarity(&query_emb, &doc1_emb);
+let sim2 = cosine_similarity(&query_emb, &doc2_emb);
+
+println!("Similarity with doc1: {:.4}", sim1); // ~0.85
+println!("Similarity with doc2: {:.4}", sim2); // ~0.15
+```
+
+### Custom Model
+
+```rust
+use stratum_embeddings::{HuggingFaceProvider, HuggingFaceModel};
+
+let api_key = std::env::var("HUGGINGFACE_API_KEY")?;
+let model = HuggingFaceModel::Custom(
+    "intfloat/multilingual-e5-large".to_string(),
+    1024,  // Specify dimensions
+);
+
+let provider = HuggingFaceProvider::new(api_key, model)?;
+```
+
+## Error Handling
+
+### Common Errors
+
+| Error | Cause | Solution |
+|-------|-------|----------|
+| `ConfigError: API key is empty` | Missing credentials | Set `HUGGINGFACE_API_KEY` |
+| `ApiError: HTTP 401` | Invalid API token | Check token validity |
+| `ApiError: HTTP 429` | Rate limit exceeded | Wait or upgrade tier |
+| `ApiError: HTTP 503` | Model loading | Retry after ~20s |
+| `DimensionMismatch` | Wrong model dimensions | Update `Custom` model dims |
+
+### Retry Example
+
+```rust
+use tokio::time::sleep;
+use std::time::Duration;
+
+let mut retries = 0;
+let max_retries = 3;
+
+loop {
+    match provider.embed(text, &options).await {
+        Ok(embedding) => break Ok(embedding),
+        Err(e) if e.to_string().contains("429") && retries < max_retries => {
+            retries += 1;
+            let delay = Duration::from_secs(2u64.pow(retries));
+            eprintln!("Rate limited, retrying in {:?}...", delay);
+            sleep(delay).await;
+        }
+        Err(e) => break Err(e),
+    }
+}
+```
+
+## Performance Characteristics
+
+### Latency
+
+| Operation | Latency | Notes |
+|-----------|---------|-------|
+| **Single embed** | 200-500ms | Depends on model size and region |
+| **Batch (N items)** | N × 200-500ms | Sequential processing |
+| **Cache hit** | <1ms | In-memory lookup |
+| **Cold start** | +5-20s | First request loads model |
+
+### Throughput
+
+| Tier | Max RPS | Daily Limit |
+|------|---------|-------------|
+| Free | ~0.8 | 30,000 |
+| PRO | ~2.8 | 100,000 |
+
+**With Caching** (80% hit rate):
+- Free tier: ~4 effective RPS
+- PRO tier: ~14 effective RPS
+
+## Cost Comparison
+
+| Provider | Cost/1M Tokens | Free Tier | Notes |
+|----------|----------------|-----------|-------|
+| **HuggingFace** | $0.00 | 30k req/day | Free for public models |
+| OpenAI | $0.02-0.13 | $5 credit | Pay per token |
+| Cohere | $0.10 | 100 req/month | Limited free tier |
+| Voyage | $0.12 | None | No free tier |
+
+## Limitations
+
+1. **No True Batching**: Inference API processes one request at a time
+2. **Cold Starts**: Models need ~20s to load on first request
+3. **Rate Limits**: Free tier suitable for development only
+4. **Regional Latency**: Single region (US/EU), no edge locations
+5. **Model Loading**: Popular models cached, custom models may be slow
+
+## Advanced Configuration
+
+### Model Loading Timeout
+
+```rust
+// Future enhancement
+let provider = HuggingFaceProvider::new(api_key, model)?
+    .with_timeout(Duration::from_secs(120)); // Wait longer for cold starts
+```
+
+### Dedicated Inference Endpoints
+
+For production workloads, consider [Dedicated Endpoints](https://huggingface.co/inference-endpoints):
+
+- True batch processing
+- Guaranteed uptime
+- No rate limits
+- Custom regions
+- ~$60-500/month
+
+## Migration Guide
+
+### From vapora Custom Implementation
+
+**Before**:
+```rust
+let hf = HuggingFaceEmbedding::new(api_key, "BAAI/bge-small-en-v1.5".to_string());
+let embedding = hf.embed(text).await?;
+```
+
+**After**:
+```rust
+let provider = HuggingFaceProvider::bge_small()?;
+let options = EmbeddingOptions::default_with_cache();
+let embedding = provider.embed(text, &options).await?;
+```
+
+### From OpenAI
+
+```rust
+// OpenAI (paid)
+let provider = OpenAiProvider::new(api_key, OpenAiModel::TextEmbedding3Small)?;
+
+// HuggingFace (free, similar quality)
+let provider = HuggingFaceProvider::bge_small()?;
+```
+
+## Running the Example
+
+```bash
+export HUGGINGFACE_API_KEY="hf_xxxxxxxxxxxxxxxxxxxx"
+
+cargo run --example huggingface_usage \
+    --features huggingface-provider
+```
+
+## References
+
+- [HuggingFace Inference API Docs](https://huggingface.co/docs/api-inference/index)
+- [BGE Embedding Models](https://huggingface.co/BAAI)
+- [Sentence Transformers](https://www.sbert.net/)
+- [Rate Limits Documentation](https://huggingface.co/docs/api-inference/rate-limits)
--- a/crates/stratum-embeddings/examples/basic_usage.rs
+++ b/crates/stratum-embeddings/examples/basic_usage.rs
@ -0,0 +1,50 @@
+use std::time::Duration;
+
+use stratum_embeddings::{EmbeddingOptions, EmbeddingService, FastEmbedProvider, MemoryCache};
+use tracing::info;
+
+#[tokio::main]
+async fn main() -> Result<(), Box<dyn std::error::Error>> {
+    tracing_subscriber::fmt::init();
+
+    info!("Initializing FastEmbed provider...");
+    let provider = FastEmbedProvider::small()?;
+
+    let cache = MemoryCache::new(1000, Duration::from_secs(300));
+    let service = EmbeddingService::new(provider).with_cache(cache);
+
+    info!("Service ready: {:?}", service.provider_info());
+
+    let options = EmbeddingOptions::default_with_cache();
+
+    info!("Embedding single text...");
+    let text = "Stratum embeddings is a unified embedding library";
+    let embedding = service.embed(text, &options).await?;
+
+    info!("Generated embedding with {} dimensions", embedding.len());
+
+    info!("Embedding same text again (should be cached)...");
+    let embedding2 = service.embed(text, &options).await?;
+    assert_eq!(embedding, embedding2);
+    info!("Cache hit confirmed!");
+
+    info!("Embedding batch of texts...");
+    let texts = vec![
+        "Rust is a systems programming language".to_string(),
+        "Knowledge graphs connect concepts".to_string(),
+        "Vector databases enable semantic search".to_string(),
+    ];
+
+    let result = service.embed_batch(texts, &options).await?;
+
+    info!(
+        "Batch complete: {} embeddings generated",
+        result.embeddings.len()
+    );
+    info!("Model: {}, Dimensions: {}", result.model, result.dimensions);
+    info!("Cached count: {}", result.cached_count);
+
+    info!("Cache size: {}", service.cache_size());
+
+    Ok(())
+}
--- a/crates/stratum-embeddings/examples/fallback_demo.rs
+++ b/crates/stratum-embeddings/examples/fallback_demo.rs
@ -0,0 +1,44 @@
+use std::{sync::Arc, time::Duration};
+
+use stratum_embeddings::{
+    EmbeddingOptions, EmbeddingService, FastEmbedProvider, MemoryCache, OllamaProvider,
+};
+use tracing::{info, warn};
+
+#[tokio::main]
+async fn main() -> Result<(), Box<dyn std::error::Error>> {
+    tracing_subscriber::fmt::init();
+
+    info!("Setting up primary provider (Ollama)...");
+    let primary = OllamaProvider::default_model()?;
+
+    info!("Setting up fallback provider (FastEmbed)...");
+    let fallback =
+        Arc::new(FastEmbedProvider::small()?) as Arc<dyn stratum_embeddings::EmbeddingProvider>;
+
+    let cache = MemoryCache::new(1000, Duration::from_secs(300));
+    let service = EmbeddingService::new(primary)
+        .with_cache(cache)
+        .with_fallback(fallback);
+
+    let options = EmbeddingOptions::default_with_cache();
+
+    info!("Checking if Ollama is available...");
+    if service.is_ready().await {
+        info!("Ollama is available, using as primary");
+    } else {
+        warn!("Ollama not available, will fall back to FastEmbed");
+    }
+
+    info!("Embedding text (will use available provider)...");
+    let text = "This demonstrates fallback strategy in action";
+    let embedding = service.embed(text, &options).await?;
+
+    info!(
+        "Successfully generated embedding with {} dimensions",
+        embedding.len()
+    );
+    info!("Cache size: {}", service.cache_size());
+
+    Ok(())
+}
--- a/crates/stratum-embeddings/examples/huggingface_usage.rs
+++ b/crates/stratum-embeddings/examples/huggingface_usage.rs
@ -0,0 +1,125 @@
+use std::time::Duration;
+
+use stratum_embeddings::{
+    EmbeddingOptions, HuggingFaceModel, HuggingFaceProvider, MemoryCache,
+};
+use tracing::info;
+
+#[tokio::main]
+async fn main() -> Result<(), Box<dyn std::error::Error>> {
+    tracing_subscriber::fmt::init();
+
+    info!("=== HuggingFace Embedding Provider Demo ===");
+
+    // Example 1: Using predefined model (bge-small)
+    info!("\n1. Using predefined BGE-small model (384 dimensions)");
+    let provider = HuggingFaceProvider::bge_small()?;
+    let options = EmbeddingOptions::default_with_cache();
+
+    let text = "HuggingFace provides free inference API for embedding models";
+    let embedding = provider.embed(text, &options).await?;
+
+    info!(
+        "Generated embedding with {} dimensions from BGE-small",
+        embedding.len()
+    );
+    info!("First 5 values: {:?}", &embedding[..5]);
+
+    // Example 2: Using different model size
+    info!("\n2. Using BGE-base model (768 dimensions)");
+    let provider = HuggingFaceProvider::bge_base()?;
+    let embedding = provider.embed(text, &options).await?;
+
+    info!(
+        "Generated embedding with {} dimensions from BGE-base",
+        embedding.len()
+    );
+
+    // Example 3: Using custom model
+    info!("\n3. Using custom model");
+    let api_key = std::env::var("HUGGINGFACE_API_KEY")
+        .or_else(|_| std::env::var("HF_TOKEN"))
+        .expect("Set HUGGINGFACE_API_KEY or HF_TOKEN");
+
+    let custom_model = HuggingFaceModel::Custom(
+        "sentence-transformers/paraphrase-MiniLM-L6-v2".to_string(),
+        384,
+    );
+    let provider = HuggingFaceProvider::new(api_key, custom_model)?;
+    let embedding = provider.embed(text, &options).await?;
+
+    info!(
+        "Custom model embedding: {} dimensions",
+        embedding.len()
+    );
+
+    // Example 4: Batch embeddings (sequential requests to HF API)
+    info!("\n4. Batch embedding (sequential API calls)");
+    let provider = HuggingFaceProvider::all_minilm()?;
+
+    let texts = vec![
+        "First document about embeddings",
+        "Second document about transformers",
+        "Third document about NLP",
+    ];
+
+    let text_refs: Vec<&str> = texts.iter().map(|s| s.as_str()).collect();
+    let result = provider.embed_batch(&text_refs, &options).await?;
+
+    info!("Embedded {} texts", result.embeddings.len());
+    for (i, emb) in result.embeddings.iter().enumerate() {
+        info!("  Text {}: {} dimensions", i + 1, emb.len());
+    }
+
+    // Example 5: Using with cache
+    info!("\n5. Demonstrating cache effectiveness");
+    let cache = MemoryCache::new(1000, Duration::from_secs(300));
+    let service = stratum_embeddings::EmbeddingService::new(
+        HuggingFaceProvider::bge_small()?
+    ).with_cache(cache);
+
+    let cached_options = EmbeddingOptions::default_with_cache();
+
+    // First call - cache miss
+    let start = std::time::Instant::now();
+    let _ = service.embed(text, &cached_options).await?;
+    let first_duration = start.elapsed();
+    info!("First call (cache miss): {:?}", first_duration);
+
+    // Second call - cache hit
+    let start = std::time::Instant::now();
+    let _ = service.embed(text, &cached_options).await?;
+    let second_duration = start.elapsed();
+    info!("Second call (cache hit): {:?}", second_duration);
+    info!("Speedup: {:.2}x", first_duration.as_secs_f64() / second_duration.as_secs_f64());
+    info!("Cache size: {}", service.cache_size());
+
+    // Example 6: Normalized embeddings for similarity search
+    info!("\n6. Normalized embeddings for similarity");
+    let provider = HuggingFaceProvider::bge_small()?;
+    let normalize_options = EmbeddingOptions {
+        normalize: true,
+        truncate: true,
+        use_cache: true,
+    };
+
+    let query = "machine learning embeddings";
+    let doc1 = "neural network embeddings for NLP";
+    let doc2 = "cooking recipes and ingredients";
+
+    let query_emb = provider.embed(query, &normalize_options).await?;
+    let doc1_emb = provider.embed(doc1, &normalize_options).await?;
+    let doc2_emb = provider.embed(doc2, &normalize_options).await?;
+
+    let sim1 = stratum_embeddings::cosine_similarity(&query_emb, &doc1_emb);
+    let sim2 = stratum_embeddings::cosine_similarity(&query_emb, &doc2_emb);
+
+    info!("Query: '{}'", query);
+    info!("Similarity with doc1 ('{}'): {:.4}", doc1, sim1);
+    info!("Similarity with doc2 ('{}'): {:.4}", doc2, sim2);
+    info!("Most similar: {}", if sim1 > sim2 { "doc1" } else { "doc2" });
+
+    info!("\n=== Demo Complete ===");
+
+    Ok(())
+}
--- a/crates/stratum-embeddings/examples/lancedb_usage.rs
+++ b/crates/stratum-embeddings/examples/lancedb_usage.rs
@ -0,0 +1,67 @@
+use std::time::Duration;
+
+use stratum_embeddings::{
+    EmbeddingOptions, EmbeddingService, FastEmbedProvider, LanceDbStore, MemoryCache, VectorStore,
+    VectorStoreConfig,
+};
+use tempfile::tempdir;
+use tracing::info;
+
+#[tokio::main]
+async fn main() -> Result<(), Box<dyn std::error::Error>> {
+    tracing_subscriber::fmt::init();
+
+    info!("Initializing embedding service...");
+    let provider = FastEmbedProvider::small()?;
+    let cache = MemoryCache::new(1000, Duration::from_secs(300));
+    let service = EmbeddingService::new(provider).with_cache(cache);
+
+    let dir = tempdir()?;
+    let db_path = dir.path().to_str().unwrap();
+
+    info!("Creating LanceDB store at: {}", db_path);
+    let config = VectorStoreConfig::new(384);
+    let store = LanceDbStore::new(db_path, "embeddings", config).await?;
+
+    let documents = vec![
+        (
+            "doc1",
+            "Rust provides memory safety without garbage collection",
+        ),
+        ("doc2", "Knowledge graphs represent structured information"),
+        ("doc3", "Vector databases enable semantic similarity search"),
+        ("doc4", "Machine learning models learn from data patterns"),
+        ("doc5", "Embeddings capture semantic meaning in vectors"),
+    ];
+
+    info!("Embedding and storing {} documents...", documents.len());
+    let options = EmbeddingOptions::default_with_cache();
+
+    for (id, text) in &documents {
+        let embedding = service.embed(text, &options).await?;
+        let metadata = serde_json::json!({
+            "text": text,
+            "source": "demo"
+        });
+        store.upsert(id, &embedding, metadata).await?;
+    }
+
+    info!("Documents stored successfully");
+
+    info!("Performing semantic search...");
+    let query = "How do databases support similarity matching?";
+    let query_embedding = service.embed(query, &options).await?;
+
+    let results = store.search(&query_embedding, 3, None).await?;
+
+    info!("Search results for: '{}'", query);
+    for (i, result) in results.iter().enumerate() {
+        let text = result.metadata["text"].as_str().unwrap_or("N/A");
+        info!("  {}. [score: {:.4}] {}", i + 1, result.score, text);
+    }
+
+    let count = store.count().await?;
+    info!("Total documents in store: {}", count);
+
+    Ok(())
+}
--- a/crates/stratum-embeddings/examples/surrealdb_usage.rs
+++ b/crates/stratum-embeddings/examples/surrealdb_usage.rs
@ -0,0 +1,66 @@
+use std::time::Duration;
+
+use stratum_embeddings::{
+    EmbeddingOptions, EmbeddingService, FastEmbedProvider, MemoryCache, SurrealDbStore,
+    VectorStore, VectorStoreConfig,
+};
+use tracing::info;
+
+#[tokio::main]
+async fn main() -> Result<(), Box<dyn std::error::Error>> {
+    tracing_subscriber::fmt::init();
+
+    info!("Initializing embedding service...");
+    let provider = FastEmbedProvider::small()?;
+    let cache = MemoryCache::new(1000, Duration::from_secs(300));
+    let service = EmbeddingService::new(provider).with_cache(cache);
+
+    info!("Creating SurrealDB in-memory store...");
+    let config = VectorStoreConfig::new(384);
+    let store = SurrealDbStore::new_memory("concepts", config).await?;
+
+    let concepts = vec![
+        ("ownership", "Rust's ownership system prevents memory leaks"),
+        (
+            "borrowing",
+            "Borrowing allows references without ownership transfer",
+        ),
+        ("lifetimes", "Lifetimes ensure references remain valid"),
+        ("traits", "Traits define shared behavior across types"),
+        ("generics", "Generics enable code reuse with type safety"),
+    ];
+
+    info!("Embedding and storing {} concepts...", concepts.len());
+    let options = EmbeddingOptions::default_with_cache();
+
+    for (id, description) in &concepts {
+        let embedding = service.embed(description, &options).await?;
+        let metadata = serde_json::json!({
+            "concept": id,
+            "description": description,
+            "language": "rust"
+        });
+        store.upsert(id, &embedding, metadata).await?;
+    }
+
+    info!("Concepts stored successfully");
+
+    info!("Performing knowledge graph search...");
+    let query = "How does Rust manage memory?";
+    let query_embedding = service.embed(query, &options).await?;
+
+    let results = store.search(&query_embedding, 3, None).await?;
+
+    info!("Most relevant concepts for: '{}'", query);
+    for (i, result) in results.iter().enumerate() {
+        let concept = result.metadata["concept"].as_str().unwrap_or("N/A");
+        let description = result.metadata["description"].as_str().unwrap_or("N/A");
+        info!("  {}. {} [score: {:.4}]", i + 1, concept, result.score);
+        info!("     {}", description);
+    }
+
+    let count = store.count().await?;
+    info!("Total concepts in graph: {}", count);
+
+    Ok(())
+}
--- a/crates/stratum-embeddings/src/batch.rs
+++ b/crates/stratum-embeddings/src/batch.rs
@ -0,0 +1,312 @@
+use std::sync::Arc;
+
+use futures::stream::{self, StreamExt};
+use tracing::{debug, info};
+
+use crate::{
+    cache::{cache_key, EmbeddingCache},
+    error::EmbeddingError,
+    traits::{Embedding, EmbeddingOptions, EmbeddingProvider, EmbeddingResult},
+};
+
+pub struct BatchProcessor<P: EmbeddingProvider, C: EmbeddingCache> {
+    provider: Arc<P>,
+    cache: Option<Arc<C>>,
+    max_concurrent: usize,
+}
+
+impl<P: EmbeddingProvider, C: EmbeddingCache> BatchProcessor<P, C> {
+    pub fn new(provider: Arc<P>, cache: Option<Arc<C>>) -> Self {
+        Self {
+            provider,
+            cache,
+            max_concurrent: 10,
+        }
+    }
+
+    pub fn with_concurrency(mut self, max_concurrent: usize) -> Self {
+        self.max_concurrent = max_concurrent;
+        self
+    }
+
+    pub async fn process_batch(
+        &self,
+        texts: Vec<String>,
+        options: &EmbeddingOptions,
+    ) -> Result<EmbeddingResult, EmbeddingError> {
+        if texts.is_empty() {
+            return Err(EmbeddingError::InvalidInput(
+                "Texts cannot be empty".to_string(),
+            ));
+        }
+
+        let provider_batch_size = self.provider.max_batch_size();
+        let cache_enabled = options.use_cache && self.cache.is_some();
+
+        let mut all_embeddings = Vec::with_capacity(texts.len());
+        let mut total_tokens = 0u32;
+        let mut cached_count = 0usize;
+
+        for chunk in texts.chunks(provider_batch_size) {
+            let (cache_hits, cache_misses) = if cache_enabled {
+                self.check_cache(chunk, self.provider.name(), self.provider.model())
+                    .await
+            } else {
+                (vec![None; chunk.len()], (0..chunk.len()).collect())
+            };
+
+            let mut chunk_embeddings = cache_hits;
+            cached_count += chunk_embeddings.iter().filter(|e| e.is_some()).count();
+
+            if !cache_misses.is_empty() {
+                let texts_to_embed: Vec<&str> = cache_misses
+                    .iter()
+                    .map(|&idx| chunk[idx].as_str())
+                    .collect();
+
+                debug!(
+                    "Embedding {} texts (cached: {}, new: {})",
+                    chunk.len(),
+                    cached_count,
+                    texts_to_embed.len()
+                );
+
+                let result = self.provider.embed_batch(&texts_to_embed, options).await?;
+
+                if let Some(tokens) = result.total_tokens {
+                    total_tokens += tokens;
+                }
+
+                if let Some(cache) = cache_enabled.then_some(self.cache.as_ref()).flatten() {
+                    let cache_items = Self::build_cache_items(
+                        self.provider.name(),
+                        self.provider.model(),
+                        chunk,
+                        &cache_misses,
+                        &result.embeddings,
+                    );
+                    cache.insert_batch(cache_items).await;
+                }
+
+                for (miss_idx, embedding) in cache_misses.iter().zip(result.embeddings.into_iter())
+                {
+                    chunk_embeddings[*miss_idx] = Some(embedding);
+                }
+            }
+
+            all_embeddings.extend(
+                chunk_embeddings
+                    .into_iter()
+                    .map(|e| e.expect("Missing embedding")),
+            );
+        }
+
+        info!(
+            "Batch complete: {} embeddings ({} cached, {} new)",
+            texts.len(),
+            cached_count,
+            texts.len() - cached_count
+        );
+
+        Ok(EmbeddingResult {
+            embeddings: all_embeddings,
+            model: self.provider.model().to_string(),
+            dimensions: self.provider.dimensions(),
+            total_tokens: if total_tokens > 0 {
+                Some(total_tokens)
+            } else {
+                None
+            },
+            cached_count,
+        })
+    }
+
+    pub async fn process_stream(
+        &self,
+        texts: Vec<String>,
+        options: &EmbeddingOptions,
+    ) -> Result<Vec<Embedding>, EmbeddingError> {
+        let provider = Arc::clone(&self.provider);
+        let cache = self.cache.clone();
+        let provider_name = provider.name().to_string();
+        let provider_model = provider.model().to_string();
+        let opts = options.clone();
+
+        let embeddings: Vec<Embedding> = stream::iter(texts)
+            .map(move |text| {
+                let provider = Arc::clone(&provider);
+                let cache = cache.clone();
+                let provider_name = provider_name.clone();
+                let provider_model = provider_model.clone();
+                let opts = opts.clone();
+
+                Self::embed_with_cache(provider, cache, text, provider_name, provider_model, opts)
+            })
+            .buffer_unordered(self.max_concurrent)
+            .collect::<Vec<Result<Embedding, EmbeddingError>>>()
+            .await
+            .into_iter()
+            .collect::<Result<Vec<_>, _>>()?;
+
+        Ok(embeddings)
+    }
+
+    fn build_cache_items(
+        provider_name: &str,
+        provider_model: &str,
+        chunk: &[String],
+        cache_misses: &[usize],
+        embeddings: &[Embedding],
+    ) -> Vec<(String, Embedding)> {
+        cache_misses
+            .iter()
+            .zip(embeddings.iter())
+            .map(|(&idx, emb)| {
+                (
+                    cache_key(provider_name, provider_model, &chunk[idx]),
+                    emb.clone(),
+                )
+            })
+            .collect()
+    }
+
+    async fn embed_with_cache(
+        provider: Arc<P>,
+        cache: Option<Arc<C>>,
+        text: String,
+        provider_name: String,
+        provider_model: String,
+        opts: EmbeddingOptions,
+    ) -> Result<Embedding, EmbeddingError> {
+        let key = cache_key(&provider_name, &provider_model, &text);
+
+        if opts.use_cache {
+            if let Some(cached) = Self::try_get_cached(&cache, &key).await {
+                return Ok(cached);
+            }
+        }
+
+        let embedding = provider.embed(&text, &opts).await?;
+
+        if opts.use_cache {
+            Self::cache_insert(&cache, &key, embedding.clone()).await;
+        }
+
+        Ok(embedding)
+    }
+
+    async fn try_get_cached(cache: &Option<Arc<C>>, key: &str) -> Option<Embedding> {
+        match cache {
+            Some(c) => c.get(key).await,
+            None => None,
+        }
+    }
+
+    async fn cache_insert(cache: &Option<Arc<C>>, key: &str, embedding: Embedding) {
+        if let Some(c) = cache {
+            c.insert(key, embedding).await;
+        }
+    }
+
+    async fn check_cache(
+        &self,
+        texts: &[String],
+        provider_name: &str,
+        model_name: &str,
+    ) -> (Vec<Option<Embedding>>, Vec<usize>) {
+        let cache = match &self.cache {
+            Some(c) => c,
+            None => return (vec![None; texts.len()], (0..texts.len()).collect()),
+        };
+
+        let keys: Vec<String> = texts
+            .iter()
+            .map(|text| cache_key(provider_name, model_name, text))
+            .collect();
+
+        let cached = cache.get_batch(&keys).await;
+        let misses: Vec<usize> = cached
+            .iter()
+            .enumerate()
+            .filter(|(_, e)| e.is_none())
+            .map(|(i, _)| i)
+            .collect();
+
+        (cached, misses)
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use std::time::Duration;
+
+    use super::*;
+    use crate::{cache::MemoryCache, providers::FastEmbedProvider};
+
+    #[tokio::test]
+    async fn test_batch_processor_no_cache() {
+        let provider = Arc::new(FastEmbedProvider::small().expect("Failed to init"));
+        let processor: BatchProcessor<_, MemoryCache> = BatchProcessor::new(provider, None);
+
+        let texts = vec!["Hello world".to_string(), "Goodbye world".to_string()];
+
+        let options = EmbeddingOptions::no_cache();
+        let result = processor
+            .process_batch(texts, &options)
+            .await
+            .expect("Failed to process");
+
+        assert_eq!(result.embeddings.len(), 2);
+        assert_eq!(result.cached_count, 0);
+    }
+
+    #[tokio::test]
+    async fn test_batch_processor_with_cache() {
+        let provider = Arc::new(FastEmbedProvider::small().expect("Failed to init"));
+        let cache = Arc::new(MemoryCache::new(100, Duration::from_secs(60)));
+        let processor = BatchProcessor::new(provider, Some(cache));
+
+        let texts = vec!["Hello world".to_string(), "Goodbye world".to_string()];
+
+        let options = EmbeddingOptions::default_with_cache();
+        let result1 = processor
+            .process_batch(texts.clone(), &options)
+            .await
+            .expect("Failed first batch");
+
+        assert_eq!(result1.embeddings.len(), 2);
+        assert_eq!(result1.cached_count, 0);
+
+        let result2 = processor
+            .process_batch(texts, &options)
+            .await
+            .expect("Failed second batch");
+
+        assert_eq!(result2.embeddings.len(), 2);
+        assert_eq!(result2.cached_count, 2);
+
+        assert_eq!(result1.embeddings, result2.embeddings);
+    }
+
+    #[tokio::test]
+    async fn test_batch_processor_stream() {
+        let provider = Arc::new(FastEmbedProvider::small().expect("Failed to init"));
+        let cache = Arc::new(MemoryCache::with_defaults());
+        let processor = BatchProcessor::new(provider, Some(cache)).with_concurrency(2);
+
+        let texts = vec![
+            "Text 1".to_string(),
+            "Text 2".to_string(),
+            "Text 3".to_string(),
+            "Text 4".to_string(),
+        ];
+
+        let options = EmbeddingOptions::default_with_cache();
+        let embeddings = processor
+            .process_stream(texts, &options)
+            .await
+            .expect("Failed stream");
+
+        assert_eq!(embeddings.len(), 4);
+    }
+}
--- a/crates/stratum-embeddings/src/cache/memory.rs
+++ b/crates/stratum-embeddings/src/cache/memory.rs
@ -0,0 +1,167 @@
+#[cfg(feature = "memory-cache")]
+use std::time::Duration;
+
+#[cfg(feature = "memory-cache")]
+use async_trait::async_trait;
+#[cfg(feature = "memory-cache")]
+use moka::future::Cache;
+
+#[cfg(feature = "memory-cache")]
+use crate::{cache::EmbeddingCache, traits::Embedding};
+
+pub struct MemoryCache {
+    cache: Cache<String, Embedding>,
+}
+
+impl MemoryCache {
+    pub fn new(max_capacity: u64, ttl: Duration) -> Self {
+        let cache = Cache::builder()
+            .max_capacity(max_capacity)
+            .time_to_live(ttl)
+            .build();
+
+        Self { cache }
+    }
+
+    pub fn with_defaults() -> Self {
+        Self::new(10_000, Duration::from_secs(3600))
+    }
+
+    pub fn unlimited(ttl: Duration) -> Self {
+        let cache = Cache::builder().time_to_live(ttl).build();
+
+        Self { cache }
+    }
+}
+
+impl Default for MemoryCache {
+    fn default() -> Self {
+        Self::with_defaults()
+    }
+}
+
+#[async_trait]
+impl EmbeddingCache for MemoryCache {
+    async fn get(&self, key: &str) -> Option<Embedding> {
+        self.cache.get(key).await
+    }
+
+    async fn insert(&self, key: &str, embedding: Embedding) {
+        self.cache.insert(key.to_string(), embedding).await;
+        self.cache.run_pending_tasks().await;
+    }
+
+    async fn get_batch(&self, keys: &[String]) -> Vec<Option<Embedding>> {
+        let mut results = Vec::with_capacity(keys.len());
+        for key in keys {
+            results.push(self.cache.get(key).await);
+        }
+        results
+    }
+
+    async fn insert_batch(&self, items: Vec<(String, Embedding)>) {
+        for (key, embedding) in items {
+            self.cache.insert(key, embedding).await;
+        }
+        self.cache.run_pending_tasks().await;
+    }
+
+    async fn invalidate(&self, key: &str) {
+        self.cache.invalidate(key).await;
+        self.cache.run_pending_tasks().await;
+    }
+
+    async fn clear(&self) {
+        self.cache.invalidate_all();
+        self.cache.run_pending_tasks().await;
+    }
+
+    fn size(&self) -> usize {
+        self.cache.entry_count() as usize
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[tokio::test]
+    async fn test_memory_cache_basic() {
+        let cache = MemoryCache::with_defaults();
+
+        let embedding = vec![1.0, 2.0, 3.0];
+        cache.insert("test_key", embedding.clone()).await;
+
+        let retrieved = cache.get("test_key").await;
+        assert_eq!(retrieved, Some(embedding));
+
+        let missing = cache.get("missing_key").await;
+        assert_eq!(missing, None);
+    }
+
+    #[tokio::test]
+    async fn test_memory_cache_batch() {
+        let cache = MemoryCache::with_defaults();
+
+        let items = vec![
+            ("key1".to_string(), vec![1.0, 2.0]),
+            ("key2".to_string(), vec![3.0, 4.0]),
+            ("key3".to_string(), vec![5.0, 6.0]),
+        ];
+
+        cache.insert_batch(items.clone()).await;
+
+        let keys = vec![
+            "key1".to_string(),
+            "key2".to_string(),
+            "missing".to_string(),
+        ];
+        let results = cache.get_batch(&keys).await;
+
+        assert_eq!(results.len(), 3);
+        assert_eq!(results[0], Some(vec![1.0, 2.0]));
+        assert_eq!(results[1], Some(vec![3.0, 4.0]));
+        assert_eq!(results[2], None);
+    }
+
+    #[tokio::test]
+    async fn test_memory_cache_invalidate() {
+        let cache = MemoryCache::with_defaults();
+
+        cache.insert("key1", vec![1.0, 2.0]).await;
+        cache.insert("key2", vec![3.0, 4.0]).await;
+
+        assert!(cache.get("key1").await.is_some());
+        assert!(cache.get("key2").await.is_some());
+
+        cache.invalidate("key1").await;
+
+        assert!(cache.get("key1").await.is_none());
+        assert!(cache.get("key2").await.is_some());
+    }
+
+    #[tokio::test]
+    async fn test_memory_cache_clear() {
+        let cache = MemoryCache::with_defaults();
+
+        cache.insert("key1", vec![1.0]).await;
+        cache.insert("key2", vec![2.0]).await;
+
+        cache.clear().await;
+
+        assert!(cache.get("key1").await.is_none());
+        assert!(cache.get("key2").await.is_none());
+    }
+
+    #[tokio::test]
+    async fn test_memory_cache_ttl() {
+        let cache = MemoryCache::new(1000, Duration::from_millis(100));
+
+        cache.insert("key1", vec![1.0, 2.0]).await;
+        assert!(cache.get("key1").await.is_some());
+
+        tokio::time::sleep(Duration::from_millis(150)).await;
+
+        assert!(cache.get("key1").await.is_none());
+    }
+}
--- a/crates/stratum-embeddings/src/cache/mod.rs
+++ b/crates/stratum-embeddings/src/cache/mod.rs
@ -0,0 +1,47 @@
+#[cfg(feature = "memory-cache")]
+pub mod memory;
+#[cfg(feature = "persistent-cache")]
+pub mod persistent;
+
+use async_trait::async_trait;
+#[cfg(feature = "memory-cache")]
+pub use memory::MemoryCache;
+#[cfg(feature = "persistent-cache")]
+pub use persistent::PersistentCache;
+
+use crate::traits::Embedding;
+
+#[async_trait]
+pub trait EmbeddingCache: Send + Sync {
+    async fn get(&self, key: &str) -> Option<Embedding>;
+    async fn insert(&self, key: &str, embedding: Embedding);
+    async fn get_batch(&self, keys: &[String]) -> Vec<Option<Embedding>>;
+    async fn insert_batch(&self, items: Vec<(String, Embedding)>);
+    async fn invalidate(&self, key: &str);
+    async fn clear(&self);
+    fn size(&self) -> usize;
+}
+
+pub fn cache_key(provider: &str, model: &str, text: &str) -> String {
+    use xxhash_rust::xxh3::xxh3_64;
+    let hash = xxh3_64(format!("{}:{}:{}", provider, model, text).as_bytes());
+    format!("{}:{}:{:x}", provider, model, hash)
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn test_cache_key_consistency() {
+        let key1 = cache_key("fastembed", "bge-small", "hello world");
+        let key2 = cache_key("fastembed", "bge-small", "hello world");
+        assert_eq!(key1, key2);
+
+        let key3 = cache_key("fastembed", "bge-small", "hello world!");
+        assert_ne!(key1, key3);
+
+        let key4 = cache_key("openai", "bge-small", "hello world");
+        assert_ne!(key1, key4);
+    }
+}
--- a/crates/stratum-embeddings/src/cache/persistent.rs
+++ b/crates/stratum-embeddings/src/cache/persistent.rs
@ -0,0 +1,152 @@
+#[cfg(feature = "persistent-cache")]
+use std::path::Path;
+
+#[cfg(feature = "persistent-cache")]
+use async_trait::async_trait;
+#[cfg(feature = "persistent-cache")]
+use sled::Db;
+
+#[cfg(feature = "persistent-cache")]
+use crate::{cache::EmbeddingCache, error::EmbeddingError, traits::Embedding};
+
+pub struct PersistentCache {
+    db: Db,
+}
+
+impl PersistentCache {
+    pub fn new<P: AsRef<Path>>(path: P) -> Result<Self, EmbeddingError> {
+        let db = sled::open(path)
+            .map_err(|e| EmbeddingError::CacheError(format!("Failed to open sled db: {}", e)))?;
+
+        Ok(Self { db })
+    }
+
+    pub fn in_memory() -> Result<Self, EmbeddingError> {
+        let db = sled::Config::new().temporary(true).open().map_err(|e| {
+            EmbeddingError::CacheError(format!("Failed to create in-memory sled db: {}", e))
+        })?;
+
+        Ok(Self { db })
+    }
+
+    fn serialize_embedding(embedding: &Embedding) -> Result<Vec<u8>, EmbeddingError> {
+        serde_json::to_vec(embedding)
+            .map_err(|e| EmbeddingError::SerializationError(format!("Embedding serialize: {}", e)))
+    }
+
+    fn deserialize_embedding(data: &[u8]) -> Result<Embedding, EmbeddingError> {
+        serde_json::from_slice(data).map_err(|e| {
+            EmbeddingError::SerializationError(format!("Embedding deserialize: {}", e))
+        })
+    }
+}
+
+#[async_trait]
+impl EmbeddingCache for PersistentCache {
+    async fn get(&self, key: &str) -> Option<Embedding> {
+        self.db
+            .get(key)
+            .ok()
+            .flatten()
+            .and_then(|bytes| Self::deserialize_embedding(&bytes).ok())
+    }
+
+    async fn insert(&self, key: &str, embedding: Embedding) {
+        if let Ok(bytes) = Self::serialize_embedding(&embedding) {
+            let _ = self.db.insert(key, bytes);
+            let _ = self.db.flush();
+        }
+    }
+
+    async fn get_batch(&self, keys: &[String]) -> Vec<Option<Embedding>> {
+        keys.iter()
+            .map(|key| {
+                self.db
+                    .get(key)
+                    .ok()
+                    .flatten()
+                    .and_then(|bytes| Self::deserialize_embedding(&bytes).ok())
+            })
+            .collect()
+    }
+
+    async fn insert_batch(&self, items: Vec<(String, Embedding)>) {
+        for (key, embedding) in items {
+            if let Ok(bytes) = Self::serialize_embedding(&embedding) {
+                let _ = self.db.insert(key, bytes);
+            }
+        }
+        let _ = self.db.flush();
+    }
+
+    async fn invalidate(&self, key: &str) {
+        let _ = self.db.remove(key);
+        let _ = self.db.flush();
+    }
+
+    async fn clear(&self) {
+        let _ = self.db.clear();
+        let _ = self.db.flush();
+    }
+
+    fn size(&self) -> usize {
+        self.db.len()
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[tokio::test]
+    async fn test_persistent_cache_in_memory() {
+        let cache = PersistentCache::in_memory().expect("Failed to create cache");
+
+        let embedding = vec![1.0, 2.0, 3.0];
+        cache.insert("test_key", embedding.clone()).await;
+
+        let retrieved = cache.get("test_key").await;
+        assert_eq!(retrieved, Some(embedding));
+    }
+
+    #[tokio::test]
+    async fn test_persistent_cache_batch() {
+        let cache = PersistentCache::in_memory().expect("Failed to create cache");
+
+        let items = vec![
+            ("key1".to_string(), vec![1.0, 2.0]),
+            ("key2".to_string(), vec![3.0, 4.0]),
+        ];
+
+        cache.insert_batch(items).await;
+
+        let keys = vec!["key1".to_string(), "key2".to_string()];
+        let results = cache.get_batch(&keys).await;
+
+        assert_eq!(results[0], Some(vec![1.0, 2.0]));
+        assert_eq!(results[1], Some(vec![3.0, 4.0]));
+    }
+
+    #[tokio::test]
+    async fn test_persistent_cache_invalidate() {
+        let cache = PersistentCache::in_memory().expect("Failed to create cache");
+
+        cache.insert("key1", vec![1.0]).await;
+        assert!(cache.get("key1").await.is_some());
+
+        cache.invalidate("key1").await;
+        assert!(cache.get("key1").await.is_none());
+    }
+
+    #[tokio::test]
+    async fn test_persistent_cache_clear() {
+        let cache = PersistentCache::in_memory().expect("Failed to create cache");
+
+        cache.insert("key1", vec![1.0]).await;
+        cache.insert("key2", vec![2.0]).await;
+        assert_eq!(cache.size(), 2);
+
+        cache.clear().await;
+        assert_eq!(cache.size(), 0);
+    }
+}
--- a/crates/stratum-embeddings/src/config.rs
+++ b/crates/stratum-embeddings/src/config.rs
@ -0,0 +1,153 @@
+use std::time::Duration;
+
+use serde::{Deserialize, Serialize};
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct EmbeddingConfig {
+    pub provider: ProviderConfig,
+    pub cache: CacheConfig,
+    #[serde(default)]
+    pub batch: BatchConfig,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+#[serde(tag = "type", rename_all = "lowercase")]
+pub enum ProviderConfig {
+    FastEmbed {
+        model: String,
+    },
+    OpenAI {
+        api_key: String,
+        model: String,
+        base_url: Option<String>,
+    },
+    Ollama {
+        model: String,
+        base_url: Option<String>,
+    },
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct CacheConfig {
+    pub enabled: bool,
+    pub max_capacity: u64,
+    #[serde(with = "humantime_serde")]
+    pub ttl: Duration,
+}
+
+impl Default for CacheConfig {
+    fn default() -> Self {
+        Self {
+            enabled: true,
+            max_capacity: 10_000,
+            ttl: Duration::from_secs(3600),
+        }
+    }
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct BatchConfig {
+    pub max_concurrent: usize,
+    pub chunk_size: Option<usize>,
+}
+
+impl Default for BatchConfig {
+    fn default() -> Self {
+        Self {
+            max_concurrent: 10,
+            chunk_size: None,
+        }
+    }
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct VectorStoreSettings {
+    pub dimensions: usize,
+    pub metric: String,
+}
+
+impl EmbeddingConfig {
+    pub fn from_env(provider_type: &str) -> Result<Self, crate::error::EmbeddingError> {
+        let provider = match provider_type {
+            "fastembed" => {
+                let model =
+                    std::env::var("FASTEMBED_MODEL").unwrap_or_else(|_| "bge-small-en".to_string());
+                ProviderConfig::FastEmbed { model }
+            }
+            "openai" => {
+                let api_key = std::env::var("OPENAI_API_KEY").map_err(|_| {
+                    crate::error::EmbeddingError::ConfigError("OPENAI_API_KEY not set".to_string())
+                })?;
+                let model = std::env::var("OPENAI_MODEL")
+                    .unwrap_or_else(|_| "text-embedding-3-small".to_string());
+                let base_url = std::env::var("OPENAI_BASE_URL").ok();
+                ProviderConfig::OpenAI {
+                    api_key,
+                    model,
+                    base_url,
+                }
+            }
+            "ollama" => {
+                let model = std::env::var("OLLAMA_MODEL")
+                    .unwrap_or_else(|_| "nomic-embed-text".to_string());
+                let base_url = std::env::var("OLLAMA_BASE_URL").ok();
+                ProviderConfig::Ollama { model, base_url }
+            }
+            _ => {
+                return Err(crate::error::EmbeddingError::ConfigError(format!(
+                    "Unknown provider type: {}",
+                    provider_type
+                )))
+            }
+        };
+
+        Ok(Self {
+            provider,
+            cache: CacheConfig::default(),
+            batch: BatchConfig::default(),
+        })
+    }
+
+    pub fn with_cache(mut self, config: CacheConfig) -> Self {
+        self.cache = config;
+        self
+    }
+
+    pub fn with_batch(mut self, config: BatchConfig) -> Self {
+        self.batch = config;
+        self
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn test_config_serialization() {
+        let config = EmbeddingConfig {
+            provider: ProviderConfig::FastEmbed {
+                model: "bge-small-en".to_string(),
+            },
+            cache: CacheConfig::default(),
+            batch: BatchConfig::default(),
+        };
+
+        let json = serde_json::to_string(&config).expect("Failed to serialize");
+        let deserialized: EmbeddingConfig =
+            serde_json::from_str(&json).expect("Failed to deserialize");
+
+        match deserialized.provider {
+            ProviderConfig::FastEmbed { model } => assert_eq!(model, "bge-small-en"),
+            _ => panic!("Wrong provider type"),
+        }
+    }
+
+    #[test]
+    fn test_cache_config_defaults() {
+        let config = CacheConfig::default();
+        assert!(config.enabled);
+        assert_eq!(config.max_capacity, 10_000);
+        assert_eq!(config.ttl, Duration::from_secs(3600));
+    }
+}
--- a/crates/stratum-embeddings/src/error.rs
+++ b/crates/stratum-embeddings/src/error.rs
@ -0,0 +1,76 @@
+use thiserror::Error;
+
+#[derive(Error, Debug)]
+pub enum EmbeddingError {
+    #[error("Provider initialization failed: {0}")]
+    Initialization(String),
+
+    #[error("Provider not available: {0}")]
+    ProviderUnavailable(String),
+
+    #[error("API request failed: {0}")]
+    ApiError(String),
+
+    #[error("Invalid input: {0}")]
+    InvalidInput(String),
+
+    #[error("Dimension mismatch: expected {expected}, got {actual}")]
+    DimensionMismatch { expected: usize, actual: usize },
+
+    #[error("Batch size {size} exceeds maximum {max}")]
+    BatchSizeExceeded { size: usize, max: usize },
+
+    #[error("Cache error: {0}")]
+    CacheError(String),
+
+    #[error("Store error: {0}")]
+    StoreError(String),
+
+    #[error("Serialization error: {0}")]
+    SerializationError(String),
+
+    #[error("Rate limit exceeded: {0}")]
+    RateLimitExceeded(String),
+
+    #[error("Timeout: {0}")]
+    Timeout(String),
+
+    #[error("Configuration error: {0}")]
+    ConfigError(String),
+
+    #[error("IO error: {0}")]
+    IoError(String),
+
+    #[error("HTTP error: {0}")]
+    HttpError(String),
+
+    #[error(transparent)]
+    Other(#[from] Box<dyn std::error::Error + Send + Sync>),
+}
+
+impl From<std::io::Error> for EmbeddingError {
+    fn from(err: std::io::Error) -> Self {
+        Self::IoError(err.to_string())
+    }
+}
+
+impl From<serde_json::Error> for EmbeddingError {
+    fn from(err: serde_json::Error) -> Self {
+        Self::SerializationError(err.to_string())
+    }
+}
+
+#[cfg(feature = "reqwest")]
+impl From<reqwest::Error> for EmbeddingError {
+    fn from(err: reqwest::Error) -> Self {
+        if err.is_timeout() {
+            Self::Timeout(err.to_string())
+        } else if err.is_status() {
+            Self::HttpError(format!("HTTP {}: {}", err.status().unwrap(), err))
+        } else {
+            Self::ApiError(err.to_string())
+        }
+    }
+}
+
+pub type Result<T> = std::result::Result<T, EmbeddingError>;
--- a/crates/stratum-embeddings/src/lib.rs
+++ b/crates/stratum-embeddings/src/lib.rs
@ -0,0 +1,34 @@
+pub mod batch;
+pub mod cache;
+pub mod config;
+pub mod error;
+pub mod metrics;
+pub mod providers;
+pub mod service;
+pub mod store;
+pub mod traits;
+
+#[cfg(feature = "memory-cache")]
+pub use cache::MemoryCache;
+#[cfg(feature = "persistent-cache")]
+pub use cache::PersistentCache;
+pub use config::{BatchConfig, CacheConfig, EmbeddingConfig, ProviderConfig};
+pub use error::{EmbeddingError, Result};
+#[cfg(feature = "cohere-provider")]
+pub use providers::cohere::{CohereModel, CohereProvider};
+#[cfg(feature = "fastembed-provider")]
+pub use providers::fastembed::{FastEmbedModel, FastEmbedProvider};
+#[cfg(feature = "huggingface-provider")]
+pub use providers::huggingface::{HuggingFaceModel, HuggingFaceProvider};
+#[cfg(feature = "ollama-provider")]
+pub use providers::ollama::{OllamaModel, OllamaProvider};
+#[cfg(feature = "openai-provider")]
+pub use providers::openai::{OpenAiModel, OpenAiProvider};
+#[cfg(feature = "voyage-provider")]
+pub use providers::voyage::{VoyageModel, VoyageProvider};
+pub use service::EmbeddingService;
+pub use store::*;
+pub use traits::{
+    cosine_similarity, euclidean_distance, normalize_embedding, Embedding, EmbeddingOptions,
+    EmbeddingProvider, EmbeddingResult, ProviderInfo,
+};
--- a/crates/stratum-embeddings/src/metrics.rs
+++ b/crates/stratum-embeddings/src/metrics.rs
@ -0,0 +1,195 @@
+#[cfg(feature = "metrics")]
+use std::sync::OnceLock;
+
+#[cfg(feature = "metrics")]
+use prometheus::{
+    register_histogram_vec, register_int_counter_vec, HistogramOpts, HistogramVec, IntCounterVec,
+    Opts,
+};
+
+#[cfg(feature = "metrics")]
+static EMBEDDING_REQUESTS: OnceLock<IntCounterVec> = OnceLock::new();
+#[cfg(feature = "metrics")]
+static EMBEDDING_ERRORS: OnceLock<IntCounterVec> = OnceLock::new();
+#[cfg(feature = "metrics")]
+static EMBEDDING_DURATION: OnceLock<HistogramVec> = OnceLock::new();
+#[cfg(feature = "metrics")]
+static CACHE_HITS: OnceLock<IntCounterVec> = OnceLock::new();
+#[cfg(feature = "metrics")]
+static CACHE_MISSES: OnceLock<IntCounterVec> = OnceLock::new();
+#[cfg(feature = "metrics")]
+static TOKENS_PROCESSED: OnceLock<IntCounterVec> = OnceLock::new();
+
+#[cfg(feature = "metrics")]
+pub fn init_metrics() -> Result<(), Box<dyn std::error::Error>> {
+    EMBEDDING_REQUESTS.get_or_init(|| {
+        register_int_counter_vec!(
+            Opts::new(
+                "embedding_requests_total",
+                "Total number of embedding requests"
+            ),
+            &["provider", "model"]
+        )
+        .expect("Failed to register embedding_requests_total")
+    });
+
+    EMBEDDING_ERRORS.get_or_init(|| {
+        register_int_counter_vec!(
+            Opts::new("embedding_errors_total", "Total number of embedding errors"),
+            &["provider", "model", "error_type"]
+        )
+        .expect("Failed to register embedding_errors_total")
+    });
+
+    EMBEDDING_DURATION.get_or_init(|| {
+        register_histogram_vec!(
+            HistogramOpts::new("embedding_duration_seconds", "Embedding request duration")
+                .buckets(vec![0.001, 0.01, 0.1, 0.5, 1.0, 5.0, 10.0]),
+            &["provider", "model"]
+        )
+        .expect("Failed to register embedding_duration_seconds")
+    });
+
+    CACHE_HITS.get_or_init(|| {
+        register_int_counter_vec!(
+            Opts::new("embedding_cache_hits_total", "Total cache hits"),
+            &["provider", "model"]
+        )
+        .expect("Failed to register embedding_cache_hits_total")
+    });
+
+    CACHE_MISSES.get_or_init(|| {
+        register_int_counter_vec!(
+            Opts::new("embedding_cache_misses_total", "Total cache misses"),
+            &["provider", "model"]
+        )
+        .expect("Failed to register embedding_cache_misses_total")
+    });
+
+    TOKENS_PROCESSED.get_or_init(|| {
+        register_int_counter_vec!(
+            Opts::new("embedding_tokens_processed_total", "Total tokens processed"),
+            &["provider", "model"]
+        )
+        .expect("Failed to register embedding_tokens_processed_total")
+    });
+
+    Ok(())
+}
+
+#[cfg(feature = "metrics")]
+pub fn record_request(provider: &str, model: &str) {
+    if let Some(counter) = EMBEDDING_REQUESTS.get() {
+        counter.with_label_values(&[provider, model]).inc();
+    }
+}
+
+#[cfg(feature = "metrics")]
+pub fn record_error(provider: &str, model: &str, error_type: &str) {
+    if let Some(counter) = EMBEDDING_ERRORS.get() {
+        counter
+            .with_label_values(&[provider, model, error_type])
+            .inc();
+    }
+}
+
+#[cfg(feature = "metrics")]
+pub fn record_duration(provider: &str, model: &str, duration_secs: f64) {
+    if let Some(histogram) = EMBEDDING_DURATION.get() {
+        histogram
+            .with_label_values(&[provider, model])
+            .observe(duration_secs);
+    }
+}
+
+#[cfg(feature = "metrics")]
+pub fn record_cache_hit(provider: &str, model: &str) {
+    if let Some(counter) = CACHE_HITS.get() {
+        counter.with_label_values(&[provider, model]).inc();
+    }
+}
+
+#[cfg(feature = "metrics")]
+pub fn record_cache_miss(provider: &str, model: &str) {
+    if let Some(counter) = CACHE_MISSES.get() {
+        counter.with_label_values(&[provider, model]).inc();
+    }
+}
+
+#[cfg(feature = "metrics")]
+pub fn record_tokens(provider: &str, model: &str, tokens: u64) {
+    if let Some(counter) = TOKENS_PROCESSED.get() {
+        counter.with_label_values(&[provider, model]).inc_by(tokens);
+    }
+}
+
+#[cfg(feature = "metrics")]
+pub struct MetricsGuard {
+    provider: String,
+    model: String,
+    start: std::time::Instant,
+}
+
+#[cfg(feature = "metrics")]
+impl MetricsGuard {
+    pub fn new(provider: &str, model: &str) -> Self {
+        record_request(provider, model);
+        Self {
+            provider: provider.to_string(),
+            model: model.to_string(),
+            start: std::time::Instant::now(),
+        }
+    }
+
+    pub fn record_success(self, tokens: Option<u32>) {
+        let duration = self.start.elapsed().as_secs_f64();
+        record_duration(&self.provider, &self.model, duration);
+
+        if let Some(token_count) = tokens {
+            record_tokens(&self.provider, &self.model, token_count as u64);
+        }
+    }
+
+    pub fn record_error(self, error_type: &str) {
+        record_error(&self.provider, &self.model, error_type);
+        let duration = self.start.elapsed().as_secs_f64();
+        record_duration(&self.provider, &self.model, duration);
+    }
+}
+
+#[cfg(not(feature = "metrics"))]
+pub fn init_metrics() -> Result<(), Box<dyn std::error::Error>> {
+    Ok(())
+}
+
+#[cfg(test)]
+#[cfg(feature = "metrics")]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn test_metrics_initialization() {
+        let result = init_metrics();
+        assert!(result.is_ok());
+    }
+
+    #[test]
+    fn test_record_request() {
+        init_metrics().unwrap();
+        record_request("test-provider", "test-model");
+    }
+
+    #[test]
+    fn test_metrics_guard() {
+        init_metrics().unwrap();
+        let guard = MetricsGuard::new("test", "model");
+        guard.record_success(Some(100));
+    }
+
+    #[test]
+    fn test_cache_metrics() {
+        init_metrics().unwrap();
+        record_cache_hit("test", "model");
+        record_cache_miss("test", "model");
+    }
+}
--- a/crates/stratum-embeddings/src/providers/cohere.rs
+++ b/crates/stratum-embeddings/src/providers/cohere.rs
@ -0,0 +1,251 @@
+#[cfg(feature = "cohere-provider")]
+use async_trait::async_trait;
+#[cfg(feature = "cohere-provider")]
+use reqwest::Client;
+#[cfg(feature = "cohere-provider")]
+use serde::{Deserialize, Serialize};
+
+#[cfg(feature = "cohere-provider")]
+use crate::{
+    error::EmbeddingError,
+    traits::{Embedding, EmbeddingOptions, EmbeddingProvider, EmbeddingResult},
+};
+
+#[derive(Debug, Clone, Copy, PartialEq, Default)]
+pub enum CohereModel {
+    #[default]
+    EmbedEnglishV3,
+    EmbedMultilingualV3,
+    EmbedEnglishLightV3,
+    EmbedMultilingualLightV3,
+    EmbedEnglishV2,
+    EmbedMultilingualV2,
+}
+
+impl CohereModel {
+    pub fn model_name(&self) -> &'static str {
+        match self {
+            Self::EmbedEnglishV3 => "embed-english-v3.0",
+            Self::EmbedMultilingualV3 => "embed-multilingual-v3.0",
+            Self::EmbedEnglishLightV3 => "embed-english-light-v3.0",
+            Self::EmbedMultilingualLightV3 => "embed-multilingual-light-v3.0",
+            Self::EmbedEnglishV2 => "embed-english-v2.0",
+            Self::EmbedMultilingualV2 => "embed-multilingual-v2.0",
+        }
+    }
+
+    pub fn dimensions(&self) -> usize {
+        match self {
+            Self::EmbedEnglishV3 | Self::EmbedMultilingualV3 => 1024,
+            Self::EmbedEnglishLightV3 | Self::EmbedMultilingualLightV3 => 384,
+            Self::EmbedEnglishV2 | Self::EmbedMultilingualV2 => 4096,
+        }
+    }
+}
+
+#[derive(Debug, Serialize)]
+struct CohereEmbedRequest {
+    model: String,
+    texts: Vec<String>,
+    input_type: String,
+    #[serde(skip_serializing_if = "Option::is_none")]
+    truncate: Option<String>,
+}
+
+#[derive(Debug, Deserialize)]
+struct CohereEmbedResponse {
+    embeddings: Vec<Vec<f32>>,
+    meta: CohereMeta,
+}
+
+#[derive(Debug, Deserialize)]
+struct CohereMeta {
+    billed_units: Option<BilledUnits>,
+}
+
+#[derive(Debug, Deserialize)]
+struct BilledUnits {
+    input_tokens: Option<u32>,
+}
+
+pub struct CohereProvider {
+    client: Client,
+    api_key: String,
+    model: CohereModel,
+    base_url: String,
+}
+
+impl CohereProvider {
+    pub fn new(api_key: String, model: CohereModel) -> Self {
+        Self {
+            client: Client::new(),
+            api_key,
+            model,
+            base_url: "https://api.cohere.ai/v1".to_string(),
+        }
+    }
+
+    pub fn with_base_url(mut self, base_url: String) -> Self {
+        self.base_url = base_url;
+        self
+    }
+
+    pub fn embed_english_v3(api_key: String) -> Self {
+        Self::new(api_key, CohereModel::EmbedEnglishV3)
+    }
+
+    pub fn embed_multilingual_v3(api_key: String) -> Self {
+        Self::new(api_key, CohereModel::EmbedMultilingualV3)
+    }
+
+    async fn embed_batch_internal(
+        &self,
+        texts: &[&str],
+        _options: &EmbeddingOptions,
+    ) -> Result<EmbeddingResult, EmbeddingError> {
+        let request = CohereEmbedRequest {
+            model: self.model.model_name().to_string(),
+            texts: texts.iter().map(|s| s.to_string()).collect(),
+            input_type: "search_document".to_string(),
+            truncate: Some("END".to_string()),
+        };
+
+        let response = self
+            .client
+            .post(format!("{}/embed", self.base_url))
+            .header("Authorization", format!("Bearer {}", self.api_key))
+            .header("Content-Type", "application/json")
+            .json(&request)
+            .send()
+            .await
+            .map_err(|e| EmbeddingError::ApiError(format!("Cohere API request failed: {}", e)))?;
+
+        if !response.status().is_success() {
+            let status = response.status();
+            let error_text = response.text().await.unwrap_or_default();
+            return Err(EmbeddingError::ApiError(format!(
+                "Cohere API error {}: {}",
+                status, error_text
+            )));
+        }
+
+        let result: CohereEmbedResponse = response.json().await.map_err(|e| {
+            EmbeddingError::ApiError(format!("Failed to parse Cohere response: {}", e))
+        })?;
+
+        let total_tokens = result.meta.billed_units.and_then(|b| b.input_tokens);
+
+        Ok(EmbeddingResult {
+            embeddings: result.embeddings,
+            model: self.model.model_name().to_string(),
+            dimensions: self.model.dimensions(),
+            total_tokens,
+            cached_count: 0,
+        })
+    }
+}
+
+#[async_trait]
+impl EmbeddingProvider for CohereProvider {
+    fn name(&self) -> &str {
+        "cohere"
+    }
+
+    fn model(&self) -> &str {
+        self.model.model_name()
+    }
+
+    fn dimensions(&self) -> usize {
+        self.model.dimensions()
+    }
+
+    fn is_local(&self) -> bool {
+        false
+    }
+
+    fn max_tokens(&self) -> usize {
+        match self.model {
+            CohereModel::EmbedEnglishV3
+            | CohereModel::EmbedMultilingualV3
+            | CohereModel::EmbedEnglishLightV3
+            | CohereModel::EmbedMultilingualLightV3 => 512,
+            CohereModel::EmbedEnglishV2 | CohereModel::EmbedMultilingualV2 => 512,
+        }
+    }
+
+    fn max_batch_size(&self) -> usize {
+        96
+    }
+
+    fn cost_per_1m_tokens(&self) -> f64 {
+        match self.model {
+            CohereModel::EmbedEnglishV3 | CohereModel::EmbedMultilingualV3 => 0.10,
+            CohereModel::EmbedEnglishLightV3 | CohereModel::EmbedMultilingualLightV3 => 0.10,
+            CohereModel::EmbedEnglishV2 | CohereModel::EmbedMultilingualV2 => 0.10,
+        }
+    }
+
+    async fn is_available(&self) -> bool {
+        !self.api_key.is_empty()
+    }
+
+    async fn embed(
+        &self,
+        text: &str,
+        options: &EmbeddingOptions,
+    ) -> Result<Embedding, EmbeddingError> {
+        let result = self.embed_batch_internal(&[text], options).await?;
+        result
+            .embeddings
+            .into_iter()
+            .next()
+            .ok_or_else(|| EmbeddingError::ApiError("No embedding returned".to_string()))
+    }
+
+    async fn embed_batch(
+        &self,
+        texts: &[&str],
+        options: &EmbeddingOptions,
+    ) -> Result<EmbeddingResult, EmbeddingError> {
+        if texts.is_empty() {
+            return Err(EmbeddingError::InvalidInput(
+                "Cannot embed empty text list".to_string(),
+            ));
+        }
+
+        if texts.len() > self.max_batch_size() {
+            return Err(EmbeddingError::InvalidInput(format!(
+                "Batch size {} exceeds maximum {}",
+                texts.len(),
+                self.max_batch_size()
+            )));
+        }
+
+        self.embed_batch_internal(texts, options).await
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn test_cohere_model_names() {
+        assert_eq!(
+            CohereModel::EmbedEnglishV3.model_name(),
+            "embed-english-v3.0"
+        );
+        assert_eq!(CohereModel::EmbedMultilingualV3.dimensions(), 1024);
+        assert_eq!(CohereModel::EmbedEnglishLightV3.dimensions(), 384);
+        assert_eq!(CohereModel::EmbedEnglishV2.dimensions(), 4096);
+    }
+
+    #[tokio::test]
+    async fn test_cohere_provider_creation() {
+        let provider = CohereProvider::new("test-key".to_string(), CohereModel::EmbedEnglishV3);
+        assert_eq!(provider.name(), "cohere");
+        assert_eq!(provider.model(), "embed-english-v3.0");
+        assert_eq!(provider.dimensions(), 1024);
+        assert_eq!(provider.max_batch_size(), 96);
+    }
+}
--- a/crates/stratum-embeddings/src/providers/fastembed.rs
+++ b/crates/stratum-embeddings/src/providers/fastembed.rs
@ -0,0 +1,247 @@
+#[cfg(feature = "fastembed-provider")]
+use std::sync::{Arc, Mutex};
+
+#[cfg(feature = "fastembed-provider")]
+use async_trait::async_trait;
+#[cfg(feature = "fastembed-provider")]
+use fastembed::{EmbeddingModel, InitOptions, TextEmbedding};
+
+#[cfg(feature = "fastembed-provider")]
+use crate::{
+    error::EmbeddingError,
+    traits::{
+        normalize_embedding, Embedding, EmbeddingOptions, EmbeddingProvider, EmbeddingResult,
+    },
+};
+
+#[derive(Debug, Clone, Copy, PartialEq, Default)]
+pub enum FastEmbedModel {
+    #[default]
+    BgeSmallEn,
+    BgeBaseEn,
+    BgeLargeEn,
+    AllMiniLmL6V2,
+    MultilingualE5Small,
+    MultilingualE5Base,
+}
+
+impl FastEmbedModel {
+    pub fn dimensions(&self) -> usize {
+        match self {
+            Self::BgeSmallEn | Self::AllMiniLmL6V2 | Self::MultilingualE5Small => 384,
+            Self::BgeBaseEn | Self::MultilingualE5Base => 768,
+            Self::BgeLargeEn => 1024,
+        }
+    }
+
+    pub fn model_name(&self) -> &'static str {
+        match self {
+            Self::BgeSmallEn => "BAAI/bge-small-en-v1.5",
+            Self::BgeBaseEn => "BAAI/bge-base-en-v1.5",
+            Self::BgeLargeEn => "BAAI/bge-large-en-v1.5",
+            Self::AllMiniLmL6V2 => "sentence-transformers/all-MiniLM-L6-v2",
+            Self::MultilingualE5Small => "intfloat/multilingual-e5-small",
+            Self::MultilingualE5Base => "intfloat/multilingual-e5-base",
+        }
+    }
+
+    fn to_fastembed_model(self) -> EmbeddingModel {
+        match self {
+            Self::BgeSmallEn => EmbeddingModel::BGESmallENV15,
+            Self::BgeBaseEn => EmbeddingModel::BGEBaseENV15,
+            Self::BgeLargeEn => EmbeddingModel::BGELargeENV15,
+            Self::AllMiniLmL6V2 => EmbeddingModel::AllMiniLML6V2,
+            Self::MultilingualE5Small => EmbeddingModel::MultilingualE5Small,
+            Self::MultilingualE5Base => EmbeddingModel::MultilingualE5Base,
+        }
+    }
+}
+
+pub struct FastEmbedProvider {
+    model: Arc<Mutex<TextEmbedding>>,
+    model_type: FastEmbedModel,
+}
+
+impl FastEmbedProvider {
+    pub fn new(model_type: FastEmbedModel) -> Result<Self, EmbeddingError> {
+        let options =
+            InitOptions::new(model_type.to_fastembed_model()).with_show_download_progress(true);
+
+        let model = TextEmbedding::try_new(options)
+            .map_err(|e| EmbeddingError::Initialization(e.to_string()))?;
+
+        Ok(Self {
+            model: Arc::new(Mutex::new(model)),
+            model_type,
+        })
+    }
+
+    pub fn default_model() -> Result<Self, EmbeddingError> {
+        Self::new(FastEmbedModel::default())
+    }
+
+    pub fn small() -> Result<Self, EmbeddingError> {
+        Self::new(FastEmbedModel::BgeSmallEn)
+    }
+
+    pub fn base() -> Result<Self, EmbeddingError> {
+        Self::new(FastEmbedModel::BgeBaseEn)
+    }
+
+    pub fn large() -> Result<Self, EmbeddingError> {
+        Self::new(FastEmbedModel::BgeLargeEn)
+    }
+
+    pub fn multilingual() -> Result<Self, EmbeddingError> {
+        Self::new(FastEmbedModel::MultilingualE5Base)
+    }
+}
+
+#[async_trait]
+impl EmbeddingProvider for FastEmbedProvider {
+    fn name(&self) -> &str {
+        "fastembed"
+    }
+
+    fn model(&self) -> &str {
+        self.model_type.model_name()
+    }
+
+    fn dimensions(&self) -> usize {
+        self.model_type.dimensions()
+    }
+
+    fn is_local(&self) -> bool {
+        true
+    }
+
+    fn max_tokens(&self) -> usize {
+        512
+    }
+
+    fn max_batch_size(&self) -> usize {
+        256
+    }
+
+    fn cost_per_1m_tokens(&self) -> f64 {
+        0.0
+    }
+
+    async fn is_available(&self) -> bool {
+        true
+    }
+
+    async fn embed(
+        &self,
+        text: &str,
+        options: &EmbeddingOptions,
+    ) -> Result<Embedding, EmbeddingError> {
+        let text = text.to_string();
+        let model = Arc::clone(&self.model);
+
+        let embeddings = tokio::task::spawn_blocking(move || {
+            let mut model_guard = model.lock().expect("Failed to acquire model lock");
+            model_guard.embed(vec![text], None)
+        })
+        .await
+        .map_err(|e| EmbeddingError::ApiError(format!("Task join error: {}", e)))?
+        .map_err(|e| EmbeddingError::ApiError(e.to_string()))?;
+
+        let mut embedding = embeddings.into_iter().next().ok_or_else(|| {
+            EmbeddingError::ApiError("FastEmbed returned no embeddings".to_string())
+        })?;
+
+        if options.normalize {
+            normalize_embedding(&mut embedding);
+        }
+
+        Ok(embedding)
+    }
+
+    async fn embed_batch(
+        &self,
+        texts: &[&str],
+        options: &EmbeddingOptions,
+    ) -> Result<EmbeddingResult, EmbeddingError> {
+        if texts.len() > self.max_batch_size() {
+            return Err(EmbeddingError::BatchSizeExceeded {
+                size: texts.len(),
+                max: self.max_batch_size(),
+            });
+        }
+
+        let texts_owned: Vec<String> = texts.iter().map(|s| s.to_string()).collect();
+        let model = Arc::clone(&self.model);
+
+        let mut embeddings = tokio::task::spawn_blocking(move || {
+            let mut model_guard = model.lock().expect("Failed to acquire model lock");
+            model_guard.embed(texts_owned, None)
+        })
+        .await
+        .map_err(|e| EmbeddingError::ApiError(format!("Task join error: {}", e)))?
+        .map_err(|e| EmbeddingError::ApiError(e.to_string()))?;
+
+        if options.normalize {
+            embeddings.iter_mut().for_each(normalize_embedding);
+        }
+
+        Ok(EmbeddingResult {
+            embeddings,
+            model: self.model().to_string(),
+            dimensions: self.dimensions(),
+            total_tokens: None,
+            cached_count: 0,
+        })
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[tokio::test]
+    async fn test_fastembed_provider() {
+        let provider = FastEmbedProvider::small().expect("Failed to initialize FastEmbed");
+
+        assert_eq!(provider.name(), "fastembed");
+        assert_eq!(provider.dimensions(), 384);
+        assert!(provider.is_local());
+        assert_eq!(provider.cost_per_1m_tokens(), 0.0);
+
+        let options = EmbeddingOptions::default_with_cache();
+        let embedding = provider
+            .embed("Hello world", &options)
+            .await
+            .expect("Failed to embed");
+
+        assert_eq!(embedding.len(), 384);
+
+        let magnitude: f32 = embedding.iter().map(|x| x * x).sum::<f32>().sqrt();
+        assert!(
+            (magnitude - 1.0).abs() < 0.01,
+            "Embedding should be normalized"
+        );
+    }
+
+    #[tokio::test]
+    async fn test_fastembed_batch() {
+        let provider = FastEmbedProvider::small().expect("Failed to initialize FastEmbed");
+
+        let texts = vec!["Hello world", "Goodbye world", "Machine learning"];
+        let options = EmbeddingOptions::default_with_cache();
+
+        let result = provider
+            .embed_batch(&texts, &options)
+            .await
+            .expect("Failed to embed batch");
+
+        assert_eq!(result.embeddings.len(), 3);
+        assert_eq!(result.dimensions, 384);
+
+        for embedding in &result.embeddings {
+            assert_eq!(embedding.len(), 384);
+            let magnitude: f32 = embedding.iter().map(|x| x * x).sum::<f32>().sqrt();
+            assert!((magnitude - 1.0).abs() < 0.01);
+        }
+    }
+}
--- a/crates/stratum-embeddings/src/providers/huggingface.rs
+++ b/crates/stratum-embeddings/src/providers/huggingface.rs
@ -0,0 +1,344 @@
+#[cfg(feature = "huggingface-provider")]
+use std::time::Duration;
+
+#[cfg(feature = "huggingface-provider")]
+use async_trait::async_trait;
+#[cfg(feature = "huggingface-provider")]
+use reqwest::Client;
+#[cfg(feature = "huggingface-provider")]
+use serde::{Deserialize, Serialize};
+
+#[cfg(feature = "huggingface-provider")]
+use crate::{
+    error::EmbeddingError,
+    traits::{
+        normalize_embedding, Embedding, EmbeddingOptions, EmbeddingProvider, EmbeddingResult,
+    },
+};
+
+#[derive(Debug, Clone, PartialEq)]
+pub enum HuggingFaceModel {
+    /// BAAI/bge-small-en-v1.5 - 384 dimensions, efficient general-purpose
+    BgeSmall,
+    /// BAAI/bge-base-en-v1.5 - 768 dimensions, balanced performance
+    BgeBase,
+    /// BAAI/bge-large-en-v1.5 - 1024 dimensions, high quality
+    BgeLarge,
+    /// sentence-transformers/all-MiniLM-L6-v2 - 384 dimensions, fast
+    AllMiniLm,
+    /// sentence-transformers/all-mpnet-base-v2 - 768 dimensions, strong baseline
+    AllMpnet,
+    /// Custom model with model ID and dimensions
+    Custom(String, usize),
+}
+
+impl Default for HuggingFaceModel {
+    fn default() -> Self {
+        Self::BgeSmall
+    }
+}
+
+impl HuggingFaceModel {
+    pub fn model_id(&self) -> &str {
+        match self {
+            Self::BgeSmall => "BAAI/bge-small-en-v1.5",
+            Self::BgeBase => "BAAI/bge-base-en-v1.5",
+            Self::BgeLarge => "BAAI/bge-large-en-v1.5",
+            Self::AllMiniLm => "sentence-transformers/all-MiniLM-L6-v2",
+            Self::AllMpnet => "sentence-transformers/all-mpnet-base-v2",
+            Self::Custom(id, _) => id.as_str(),
+        }
+    }
+
+    pub fn dimensions(&self) -> usize {
+        match self {
+            Self::BgeSmall => 384,
+            Self::BgeBase => 768,
+            Self::BgeLarge => 1024,
+            Self::AllMiniLm => 384,
+            Self::AllMpnet => 768,
+            Self::Custom(_, dims) => *dims,
+        }
+    }
+
+    pub fn from_model_id(id: &str, dimensions: Option<usize>) -> Self {
+        match id {
+            "BAAI/bge-small-en-v1.5" => Self::BgeSmall,
+            "BAAI/bge-base-en-v1.5" => Self::BgeBase,
+            "BAAI/bge-large-en-v1.5" => Self::BgeLarge,
+            "sentence-transformers/all-MiniLM-L6-v2" => Self::AllMiniLm,
+            "sentence-transformers/all-mpnet-base-v2" => Self::AllMpnet,
+            _ => Self::Custom(
+                id.to_string(),
+                dimensions.unwrap_or(384), // Default to 384 if unknown
+            ),
+        }
+    }
+}
+
+#[cfg(feature = "huggingface-provider")]
+pub struct HuggingFaceProvider {
+    client: Client,
+    api_key: String,
+    model: HuggingFaceModel,
+}
+
+#[cfg(feature = "huggingface-provider")]
+#[derive(Debug, Serialize)]
+struct HFRequest {
+    inputs: String,
+}
+
+#[cfg(feature = "huggingface-provider")]
+#[derive(Debug, Deserialize)]
+#[serde(untagged)]
+enum HFResponse {
+    /// Single text embedding response
+    Single(Vec<f32>),
+    /// Batch embedding response
+    Multiple(Vec<Vec<f32>>),
+}
+
+#[cfg(feature = "huggingface-provider")]
+impl HuggingFaceProvider {
+    const BASE_URL: &'static str = "https://api-inference.huggingface.co/pipeline/feature-extraction";
+
+    pub fn new(api_key: impl Into<String>, model: HuggingFaceModel) -> Result<Self, EmbeddingError> {
+        let api_key = api_key.into();
+        if api_key.is_empty() {
+            return Err(EmbeddingError::ConfigError(
+                "HuggingFace API key is empty".to_string(),
+            ));
+        }
+
+        let client = Client::builder()
+            .timeout(Duration::from_secs(120))
+            .build()
+            .map_err(|e| EmbeddingError::Initialization(e.to_string()))?;
+
+        Ok(Self {
+            client,
+            api_key,
+            model,
+        })
+    }
+
+    pub fn from_env(model: HuggingFaceModel) -> Result<Self, EmbeddingError> {
+        let api_key = std::env::var("HUGGINGFACE_API_KEY")
+            .or_else(|_| std::env::var("HF_TOKEN"))
+            .map_err(|_| {
+                EmbeddingError::ConfigError(
+                    "HUGGINGFACE_API_KEY or HF_TOKEN environment variable not set".to_string(),
+                )
+            })?;
+        Self::new(api_key, model)
+    }
+
+    pub fn bge_small() -> Result<Self, EmbeddingError> {
+        Self::from_env(HuggingFaceModel::BgeSmall)
+    }
+
+    pub fn bge_base() -> Result<Self, EmbeddingError> {
+        Self::from_env(HuggingFaceModel::BgeBase)
+    }
+
+    pub fn all_minilm() -> Result<Self, EmbeddingError> {
+        Self::from_env(HuggingFaceModel::AllMiniLm)
+    }
+}
+
+#[cfg(feature = "huggingface-provider")]
+#[async_trait]
+impl EmbeddingProvider for HuggingFaceProvider {
+    fn name(&self) -> &str {
+        "huggingface"
+    }
+
+    fn model(&self) -> &str {
+        self.model.model_id()
+    }
+
+    fn dimensions(&self) -> usize {
+        self.model.dimensions()
+    }
+
+    fn is_local(&self) -> bool {
+        false
+    }
+
+    fn max_tokens(&self) -> usize {
+        // HuggingFace doesn't specify a hard limit, but most models handle ~512 tokens
+        512
+    }
+
+    fn max_batch_size(&self) -> usize {
+        // HuggingFace Inference API doesn't support batch requests
+        // Each request is individual
+        1
+    }
+
+    fn cost_per_1m_tokens(&self) -> f64 {
+        // HuggingFace Inference API is free for public models
+        // For dedicated endpoints, costs vary
+        0.0
+    }
+
+    async fn is_available(&self) -> bool {
+        // HuggingFace Inference API is always available (with rate limits)
+        true
+    }
+
+    async fn embed(
+        &self,
+        text: &str,
+        options: &EmbeddingOptions,
+    ) -> Result<Embedding, EmbeddingError> {
+        if text.is_empty() {
+            return Err(EmbeddingError::InvalidInput(
+                "Text cannot be empty".to_string(),
+            ));
+        }
+
+        let url = format!("{}/{}", Self::BASE_URL, self.model.model_id());
+        let request = HFRequest {
+            inputs: text.to_string(),
+        };
+
+        let response = self
+            .client
+            .post(&url)
+            .header("Authorization", format!("Bearer {}", self.api_key))
+            .json(&request)
+            .send()
+            .await
+            .map_err(|e| {
+                EmbeddingError::ApiError(format!("HuggingFace API request failed: {}", e))
+            })?;
+
+        if !response.status().is_success() {
+            let status = response.status();
+            let error_text = response.text().await.unwrap_or_default();
+            return Err(EmbeddingError::ApiError(format!(
+                "HuggingFace API error {}: {}",
+                status, error_text
+            )));
+        }
+
+        let hf_response: HFResponse = response.json().await.map_err(|e| {
+            EmbeddingError::ApiError(format!("Failed to parse HuggingFace response: {}", e))
+        })?;
+
+        let mut embedding = match hf_response {
+            HFResponse::Single(emb) => emb,
+            HFResponse::Multiple(embs) => {
+                if embs.is_empty() {
+                    return Err(EmbeddingError::ApiError(
+                        "Empty embeddings response from HuggingFace".to_string(),
+                    ));
+                }
+                embs[0].clone()
+            }
+        };
+
+        // Validate dimensions
+        if embedding.len() != self.dimensions() {
+            return Err(EmbeddingError::DimensionMismatch {
+                expected: self.dimensions(),
+                actual: embedding.len(),
+            });
+        }
+
+        // Normalize if requested
+        if options.normalize {
+            normalize_embedding(&mut embedding);
+        }
+
+        Ok(embedding)
+    }
+
+    async fn embed_batch(
+        &self,
+        texts: &[&str],
+        options: &EmbeddingOptions,
+    ) -> Result<EmbeddingResult, EmbeddingError> {
+        if texts.is_empty() {
+            return Err(EmbeddingError::InvalidInput(
+                "Texts cannot be empty".to_string(),
+            ));
+        }
+
+        // HuggingFace Inference API doesn't support true batch requests
+        // We need to send individual requests
+        let mut embeddings = Vec::with_capacity(texts.len());
+
+        for text in texts {
+            let embedding = self.embed(text, options).await?;
+            embeddings.push(embedding);
+        }
+
+        Ok(EmbeddingResult {
+            embeddings,
+            model: self.model().to_string(),
+            dimensions: self.dimensions(),
+            total_tokens: None,
+            cached_count: 0,
+        })
+    }
+}
+
+#[cfg(all(test, feature = "huggingface-provider"))]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn test_model_id_mapping() {
+        assert_eq!(
+            HuggingFaceModel::BgeSmall.model_id(),
+            "BAAI/bge-small-en-v1.5"
+        );
+        assert_eq!(HuggingFaceModel::BgeSmall.dimensions(), 384);
+
+        assert_eq!(
+            HuggingFaceModel::BgeBase.model_id(),
+            "BAAI/bge-base-en-v1.5"
+        );
+        assert_eq!(HuggingFaceModel::BgeBase.dimensions(), 768);
+    }
+
+    #[test]
+    fn test_custom_model() {
+        let custom = HuggingFaceModel::Custom("my-model".to_string(), 512);
+        assert_eq!(custom.model_id(), "my-model");
+        assert_eq!(custom.dimensions(), 512);
+    }
+
+    #[test]
+    fn test_from_model_id() {
+        let model = HuggingFaceModel::from_model_id("BAAI/bge-small-en-v1.5", None);
+        assert_eq!(model, HuggingFaceModel::BgeSmall);
+
+        let custom = HuggingFaceModel::from_model_id("unknown-model", Some(256));
+        assert!(matches!(custom, HuggingFaceModel::Custom(_, 256)));
+    }
+
+    #[test]
+    fn test_provider_creation() {
+        let provider = HuggingFaceProvider::new("test-key", HuggingFaceModel::BgeSmall);
+        assert!(provider.is_ok());
+
+        let provider = provider.unwrap();
+        assert_eq!(provider.name(), "huggingface");
+        assert_eq!(provider.model(), "BAAI/bge-small-en-v1.5");
+        assert_eq!(provider.dimensions(), 384);
+    }
+
+    #[test]
+    fn test_empty_api_key() {
+        let provider = HuggingFaceProvider::new("", HuggingFaceModel::BgeSmall);
+        assert!(provider.is_err());
+        assert!(matches!(
+            provider.unwrap_err(),
+            EmbeddingError::ConfigError(_)
+        ));
+    }
+}
--- a/crates/stratum-embeddings/src/providers/mod.rs
+++ b/crates/stratum-embeddings/src/providers/mod.rs
@ -0,0 +1,25 @@
+#[cfg(feature = "cohere-provider")]
+pub mod cohere;
+#[cfg(feature = "fastembed-provider")]
+pub mod fastembed;
+#[cfg(feature = "huggingface-provider")]
+pub mod huggingface;
+#[cfg(feature = "ollama-provider")]
+pub mod ollama;
+#[cfg(feature = "openai-provider")]
+pub mod openai;
+#[cfg(feature = "voyage-provider")]
+pub mod voyage;
+
+#[cfg(feature = "cohere-provider")]
+pub use cohere::{CohereModel, CohereProvider};
+#[cfg(feature = "fastembed-provider")]
+pub use fastembed::{FastEmbedModel, FastEmbedProvider};
+#[cfg(feature = "huggingface-provider")]
+pub use huggingface::{HuggingFaceModel, HuggingFaceProvider};
+#[cfg(feature = "ollama-provider")]
+pub use ollama::{OllamaModel, OllamaProvider};
+#[cfg(feature = "openai-provider")]
+pub use openai::{OpenAiModel, OpenAiProvider};
+#[cfg(feature = "voyage-provider")]
+pub use voyage::{VoyageModel, VoyageProvider};
--- a/crates/stratum-embeddings/src/providers/ollama.rs
+++ b/crates/stratum-embeddings/src/providers/ollama.rs
@ -0,0 +1,272 @@
+#[cfg(feature = "ollama-provider")]
+use std::time::Duration;
+
+#[cfg(feature = "ollama-provider")]
+use async_trait::async_trait;
+#[cfg(feature = "ollama-provider")]
+use reqwest::Client;
+#[cfg(feature = "ollama-provider")]
+use serde::{Deserialize, Serialize};
+
+#[cfg(feature = "ollama-provider")]
+use crate::{
+    error::EmbeddingError,
+    traits::{
+        normalize_embedding, Embedding, EmbeddingOptions, EmbeddingProvider, EmbeddingResult,
+    },
+};
+
+#[derive(Debug, Clone, PartialEq, Default)]
+pub enum OllamaModel {
+    #[default]
+    NomicEmbed,
+    MxbaiEmbed,
+    AllMiniLm,
+    Custom(String, usize),
+}
+
+impl OllamaModel {
+    pub fn model_name(&self) -> &str {
+        match self {
+            Self::NomicEmbed => "nomic-embed-text",
+            Self::MxbaiEmbed => "mxbai-embed-large",
+            Self::AllMiniLm => "all-minilm",
+            Self::Custom(name, _) => name,
+        }
+    }
+
+    pub fn dimensions(&self) -> usize {
+        match self {
+            Self::NomicEmbed => 768,
+            Self::MxbaiEmbed => 1024,
+            Self::AllMiniLm => 384,
+            Self::Custom(_, dims) => *dims,
+        }
+    }
+}
+
+#[derive(Serialize)]
+struct OllamaRequest {
+    model: String,
+    input: OllamaInput,
+    #[serde(skip_serializing_if = "Option::is_none")]
+    options: Option<serde_json::Value>,
+}
+
+#[derive(Serialize)]
+#[serde(untagged)]
+enum OllamaInput {
+    Single(String),
+    Batch(Vec<String>),
+}
+
+#[derive(Deserialize)]
+struct OllamaResponse {
+    #[serde(default)]
+    embeddings: Vec<Vec<f32>>,
+    #[serde(default)]
+    embedding: Option<Vec<f32>>,
+}
+
+pub struct OllamaProvider {
+    client: Client,
+    model: OllamaModel,
+    base_url: String,
+}
+
+impl OllamaProvider {
+    pub fn new(model: OllamaModel) -> Result<Self, EmbeddingError> {
+        let client = Client::builder()
+            .timeout(Duration::from_secs(120))
+            .build()
+            .map_err(|e| EmbeddingError::Initialization(e.to_string()))?;
+
+        Ok(Self {
+            client,
+            model,
+            base_url: "http://localhost:11434".to_string(),
+        })
+    }
+
+    pub fn default_model() -> Result<Self, EmbeddingError> {
+        Self::new(OllamaModel::default())
+    }
+
+    pub fn with_base_url(mut self, base_url: impl Into<String>) -> Self {
+        self.base_url = base_url.into();
+        self
+    }
+
+    pub fn custom_model(
+        name: impl Into<String>,
+        dimensions: usize,
+    ) -> Result<Self, EmbeddingError> {
+        Self::new(OllamaModel::Custom(name.into(), dimensions))
+    }
+
+    async fn call_api(&self, input: OllamaInput) -> Result<OllamaResponse, EmbeddingError> {
+        let request = OllamaRequest {
+            model: self.model.model_name().to_string(),
+            input,
+            options: None,
+        };
+
+        let response = self
+            .client
+            .post(format!("{}/api/embed", self.base_url))
+            .json(&request)
+            .send()
+            .await
+            .map_err(|e| {
+                if e.is_connect() {
+                    EmbeddingError::ProviderUnavailable(format!(
+                        "Cannot connect to Ollama at {}",
+                        self.base_url
+                    ))
+                } else {
+                    e.into()
+                }
+            })?;
+
+        if !response.status().is_success() {
+            let status = response.status();
+            let error_text = response
+                .text()
+                .await
+                .unwrap_or_else(|_| "Unknown error".to_string());
+            return Err(EmbeddingError::ApiError(format!(
+                "Ollama API error {}: {}",
+                status, error_text
+            )));
+        }
+
+        let api_response: OllamaResponse = response.json().await?;
+        Ok(api_response)
+    }
+}
+
+#[async_trait]
+impl EmbeddingProvider for OllamaProvider {
+    fn name(&self) -> &str {
+        "ollama"
+    }
+
+    fn model(&self) -> &str {
+        self.model.model_name()
+    }
+
+    fn dimensions(&self) -> usize {
+        self.model.dimensions()
+    }
+
+    fn is_local(&self) -> bool {
+        true
+    }
+
+    fn max_tokens(&self) -> usize {
+        2048
+    }
+
+    fn max_batch_size(&self) -> usize {
+        128
+    }
+
+    fn cost_per_1m_tokens(&self) -> f64 {
+        0.0
+    }
+
+    async fn is_available(&self) -> bool {
+        self.client
+            .get(format!("{}/api/tags", self.base_url))
+            .send()
+            .await
+            .map(|r| r.status().is_success())
+            .unwrap_or(false)
+    }
+
+    async fn embed(
+        &self,
+        text: &str,
+        options: &EmbeddingOptions,
+    ) -> Result<Embedding, EmbeddingError> {
+        if text.is_empty() {
+            return Err(EmbeddingError::InvalidInput(
+                "Text cannot be empty".to_string(),
+            ));
+        }
+
+        let response = self.call_api(OllamaInput::Single(text.to_string())).await?;
+
+        let mut embedding = if let Some(emb) = response.embedding {
+            emb
+        } else {
+            response.embeddings.into_iter().next().ok_or_else(|| {
+                EmbeddingError::ApiError("Ollama returned no embeddings".to_string())
+            })?
+        };
+
+        if options.normalize {
+            normalize_embedding(&mut embedding);
+        }
+
+        Ok(embedding)
+    }
+
+    async fn embed_batch(
+        &self,
+        texts: &[&str],
+        options: &EmbeddingOptions,
+    ) -> Result<EmbeddingResult, EmbeddingError> {
+        if texts.is_empty() {
+            return Err(EmbeddingError::InvalidInput(
+                "Texts cannot be empty".to_string(),
+            ));
+        }
+
+        if texts.len() > self.max_batch_size() {
+            return Err(EmbeddingError::BatchSizeExceeded {
+                size: texts.len(),
+                max: self.max_batch_size(),
+            });
+        }
+
+        let texts_owned: Vec<String> = texts.iter().map(|s| s.to_string()).collect();
+        let response = self.call_api(OllamaInput::Batch(texts_owned)).await?;
+
+        let mut embeddings = response.embeddings;
+
+        if embeddings.is_empty() {
+            return Err(EmbeddingError::ApiError(
+                "Ollama returned no embeddings".to_string(),
+            ));
+        }
+
+        if options.normalize {
+            embeddings.iter_mut().for_each(normalize_embedding);
+        }
+
+        Ok(EmbeddingResult {
+            embeddings,
+            model: self.model().to_string(),
+            dimensions: self.dimensions(),
+            total_tokens: None,
+            cached_count: 0,
+        })
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn test_ollama_model_metadata() {
+        assert_eq!(OllamaModel::NomicEmbed.dimensions(), 768);
+        assert_eq!(OllamaModel::MxbaiEmbed.dimensions(), 1024);
+        assert_eq!(OllamaModel::AllMiniLm.dimensions(), 384);
+
+        let custom = OllamaModel::Custom("my-model".to_string(), 512);
+        assert_eq!(custom.model_name(), "my-model");
+        assert_eq!(custom.dimensions(), 512);
+    }
+}
--- a/crates/stratum-embeddings/src/providers/openai.rs
+++ b/crates/stratum-embeddings/src/providers/openai.rs
@ -0,0 +1,286 @@
+#[cfg(feature = "openai-provider")]
+use std::time::Duration;
+
+#[cfg(feature = "openai-provider")]
+use async_trait::async_trait;
+#[cfg(feature = "openai-provider")]
+use reqwest::Client;
+#[cfg(feature = "openai-provider")]
+use serde::{Deserialize, Serialize};
+
+#[cfg(feature = "openai-provider")]
+use crate::{
+    error::EmbeddingError,
+    traits::{
+        normalize_embedding, Embedding, EmbeddingOptions, EmbeddingProvider, EmbeddingResult,
+    },
+};
+
+#[derive(Debug, Clone, Copy, PartialEq, Default)]
+pub enum OpenAiModel {
+    #[default]
+    TextEmbedding3Small,
+    TextEmbedding3Large,
+    TextEmbeddingAda002,
+}
+
+impl OpenAiModel {
+    pub fn model_name(&self) -> &'static str {
+        match self {
+            Self::TextEmbedding3Small => "text-embedding-3-small",
+            Self::TextEmbedding3Large => "text-embedding-3-large",
+            Self::TextEmbeddingAda002 => "text-embedding-ada-002",
+        }
+    }
+
+    pub fn dimensions(&self) -> usize {
+        match self {
+            Self::TextEmbedding3Small => 1536,
+            Self::TextEmbedding3Large => 3072,
+            Self::TextEmbeddingAda002 => 1536,
+        }
+    }
+
+    pub fn cost_per_1m(&self) -> f64 {
+        match self {
+            Self::TextEmbedding3Small => 0.02,
+            Self::TextEmbedding3Large => 0.13,
+            Self::TextEmbeddingAda002 => 0.10,
+        }
+    }
+}
+
+#[derive(Serialize)]
+struct OpenAiRequest {
+    input: OpenAiInput,
+    model: String,
+    #[serde(skip_serializing_if = "Option::is_none")]
+    encoding_format: Option<String>,
+}
+
+#[derive(Serialize)]
+#[serde(untagged)]
+enum OpenAiInput {
+    Single(String),
+    Batch(Vec<String>),
+}
+
+#[derive(Deserialize)]
+struct OpenAiResponse {
+    data: Vec<OpenAiEmbedding>,
+    usage: OpenAiUsage,
+}
+
+#[derive(Deserialize)]
+struct OpenAiEmbedding {
+    embedding: Vec<f32>,
+    index: usize,
+}
+
+#[derive(Deserialize)]
+struct OpenAiUsage {
+    total_tokens: u32,
+}
+
+pub struct OpenAiProvider {
+    client: Client,
+    api_key: String,
+    model: OpenAiModel,
+    base_url: String,
+}
+
+impl OpenAiProvider {
+    pub fn new(api_key: impl Into<String>, model: OpenAiModel) -> Result<Self, EmbeddingError> {
+        let api_key = api_key.into();
+        if api_key.is_empty() {
+            return Err(EmbeddingError::ConfigError(
+                "OpenAI API key is empty".to_string(),
+            ));
+        }
+
+        let client = Client::builder()
+            .timeout(Duration::from_secs(60))
+            .build()
+            .map_err(|e| EmbeddingError::Initialization(e.to_string()))?;
+
+        Ok(Self {
+            client,
+            api_key,
+            model,
+            base_url: "https://api.openai.com/v1".to_string(),
+        })
+    }
+
+    pub fn from_env(model: OpenAiModel) -> Result<Self, EmbeddingError> {
+        let api_key = std::env::var("OPENAI_API_KEY")
+            .map_err(|_| EmbeddingError::ConfigError("OPENAI_API_KEY not set".to_string()))?;
+        Self::new(api_key, model)
+    }
+
+    pub fn with_base_url(mut self, base_url: impl Into<String>) -> Self {
+        self.base_url = base_url.into();
+        self
+    }
+
+    async fn call_api(&self, input: OpenAiInput) -> Result<OpenAiResponse, EmbeddingError> {
+        let request = OpenAiRequest {
+            input,
+            model: self.model.model_name().to_string(),
+            encoding_format: None,
+        };
+
+        let response = self
+            .client
+            .post(format!("{}/embeddings", self.base_url))
+            .header("Authorization", format!("Bearer {}", self.api_key))
+            .header("Content-Type", "application/json")
+            .json(&request)
+            .send()
+            .await?;
+
+        if !response.status().is_success() {
+            let status = response.status();
+            let error_text = response
+                .text()
+                .await
+                .unwrap_or_else(|_| "Unknown error".to_string());
+            return Err(EmbeddingError::ApiError(format!(
+                "OpenAI API error {}: {}",
+                status, error_text
+            )));
+        }
+
+        let api_response: OpenAiResponse = response.json().await?;
+        Ok(api_response)
+    }
+}
+
+#[async_trait]
+impl EmbeddingProvider for OpenAiProvider {
+    fn name(&self) -> &str {
+        "openai"
+    }
+
+    fn model(&self) -> &str {
+        self.model.model_name()
+    }
+
+    fn dimensions(&self) -> usize {
+        self.model.dimensions()
+    }
+
+    fn is_local(&self) -> bool {
+        false
+    }
+
+    fn max_tokens(&self) -> usize {
+        8191
+    }
+
+    fn max_batch_size(&self) -> usize {
+        2048
+    }
+
+    fn cost_per_1m_tokens(&self) -> f64 {
+        self.model.cost_per_1m()
+    }
+
+    async fn is_available(&self) -> bool {
+        self.client
+            .get(format!("{}/models", self.base_url))
+            .header("Authorization", format!("Bearer {}", self.api_key))
+            .send()
+            .await
+            .map(|r| r.status().is_success())
+            .unwrap_or(false)
+    }
+
+    async fn embed(
+        &self,
+        text: &str,
+        options: &EmbeddingOptions,
+    ) -> Result<Embedding, EmbeddingError> {
+        if text.is_empty() {
+            return Err(EmbeddingError::InvalidInput(
+                "Text cannot be empty".to_string(),
+            ));
+        }
+
+        let response = self.call_api(OpenAiInput::Single(text.to_string())).await?;
+
+        let mut embedding = response
+            .data
+            .into_iter()
+            .next()
+            .ok_or_else(|| EmbeddingError::ApiError("OpenAI returned no embeddings".to_string()))?
+            .embedding;
+
+        if options.normalize {
+            normalize_embedding(&mut embedding);
+        }
+
+        Ok(embedding)
+    }
+
+    async fn embed_batch(
+        &self,
+        texts: &[&str],
+        options: &EmbeddingOptions,
+    ) -> Result<EmbeddingResult, EmbeddingError> {
+        if texts.is_empty() {
+            return Err(EmbeddingError::InvalidInput(
+                "Texts cannot be empty".to_string(),
+            ));
+        }
+
+        if texts.len() > self.max_batch_size() {
+            return Err(EmbeddingError::BatchSizeExceeded {
+                size: texts.len(),
+                max: self.max_batch_size(),
+            });
+        }
+
+        let texts_owned: Vec<String> = texts.iter().map(|s| s.to_string()).collect();
+        let response = self.call_api(OpenAiInput::Batch(texts_owned)).await?;
+
+        let mut embeddings_with_index: Vec<_> = response
+            .data
+            .into_iter()
+            .map(|e| (e.index, e.embedding))
+            .collect();
+
+        embeddings_with_index.sort_by_key(|(idx, _)| *idx);
+
+        let mut embeddings: Vec<Embedding> = embeddings_with_index
+            .into_iter()
+            .map(|(_, emb)| emb)
+            .collect();
+
+        if options.normalize {
+            embeddings.iter_mut().for_each(normalize_embedding);
+        }
+
+        Ok(EmbeddingResult {
+            embeddings,
+            model: self.model().to_string(),
+            dimensions: self.dimensions(),
+            total_tokens: Some(response.usage.total_tokens),
+            cached_count: 0,
+        })
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn test_openai_model_metadata() {
+        assert_eq!(OpenAiModel::TextEmbedding3Small.dimensions(), 1536);
+        assert_eq!(OpenAiModel::TextEmbedding3Large.dimensions(), 3072);
+        assert_eq!(OpenAiModel::TextEmbeddingAda002.dimensions(), 1536);
+
+        assert_eq!(OpenAiModel::TextEmbedding3Small.cost_per_1m(), 0.02);
+        assert_eq!(OpenAiModel::TextEmbedding3Large.cost_per_1m(), 0.13);
+    }
+}
--- a/crates/stratum-embeddings/src/providers/voyage.rs
+++ b/crates/stratum-embeddings/src/providers/voyage.rs
@ -0,0 +1,259 @@
+#[cfg(feature = "voyage-provider")]
+use async_trait::async_trait;
+#[cfg(feature = "voyage-provider")]
+use reqwest::Client;
+#[cfg(feature = "voyage-provider")]
+use serde::{Deserialize, Serialize};
+
+#[cfg(feature = "voyage-provider")]
+use crate::{
+    error::EmbeddingError,
+    traits::{Embedding, EmbeddingOptions, EmbeddingProvider, EmbeddingResult},
+};
+
+#[derive(Debug, Clone, Copy, PartialEq, Default)]
+pub enum VoyageModel {
+    #[default]
+    Voyage2,
+    VoyageLarge2,
+    VoyageCode2,
+    VoyageLite02Instruct,
+}
+
+impl VoyageModel {
+    pub fn model_name(&self) -> &'static str {
+        match self {
+            Self::Voyage2 => "voyage-2",
+            Self::VoyageLarge2 => "voyage-large-2",
+            Self::VoyageCode2 => "voyage-code-2",
+            Self::VoyageLite02Instruct => "voyage-lite-02-instruct",
+        }
+    }
+
+    pub fn dimensions(&self) -> usize {
+        match self {
+            Self::Voyage2 => 1024,
+            Self::VoyageLarge2 => 1536,
+            Self::VoyageCode2 => 1536,
+            Self::VoyageLite02Instruct => 1024,
+        }
+    }
+
+    pub fn max_tokens(&self) -> usize {
+        match self {
+            Self::Voyage2 | Self::VoyageLarge2 | Self::VoyageCode2 => 16000,
+            Self::VoyageLite02Instruct => 4000,
+        }
+    }
+}
+
+#[derive(Debug, Serialize)]
+struct VoyageEmbedRequest {
+    input: Vec<String>,
+    model: String,
+    #[serde(skip_serializing_if = "Option::is_none")]
+    input_type: Option<String>,
+    #[serde(skip_serializing_if = "Option::is_none")]
+    truncation: Option<bool>,
+}
+
+#[derive(Debug, Deserialize)]
+struct VoyageEmbedResponse {
+    data: Vec<VoyageEmbeddingData>,
+    usage: VoyageUsage,
+}
+
+#[derive(Debug, Deserialize)]
+struct VoyageEmbeddingData {
+    embedding: Vec<f32>,
+}
+
+#[derive(Debug, Deserialize)]
+struct VoyageUsage {
+    total_tokens: u32,
+}
+
+pub struct VoyageProvider {
+    client: Client,
+    api_key: String,
+    model: VoyageModel,
+    base_url: String,
+}
+
+impl VoyageProvider {
+    pub fn new(api_key: String, model: VoyageModel) -> Self {
+        Self {
+            client: Client::new(),
+            api_key,
+            model,
+            base_url: "https://api.voyageai.com/v1".to_string(),
+        }
+    }
+
+    pub fn with_base_url(mut self, base_url: String) -> Self {
+        self.base_url = base_url;
+        self
+    }
+
+    pub fn voyage_2(api_key: String) -> Self {
+        Self::new(api_key, VoyageModel::Voyage2)
+    }
+
+    pub fn voyage_large_2(api_key: String) -> Self {
+        Self::new(api_key, VoyageModel::VoyageLarge2)
+    }
+
+    pub fn voyage_code_2(api_key: String) -> Self {
+        Self::new(api_key, VoyageModel::VoyageCode2)
+    }
+
+    async fn embed_batch_internal(
+        &self,
+        texts: &[&str],
+        _options: &EmbeddingOptions,
+    ) -> Result<EmbeddingResult, EmbeddingError> {
+        let request = VoyageEmbedRequest {
+            input: texts.iter().map(|s| s.to_string()).collect(),
+            model: self.model.model_name().to_string(),
+            input_type: Some("document".to_string()),
+            truncation: Some(true),
+        };
+
+        let response = self
+            .client
+            .post(format!("{}/embeddings", self.base_url))
+            .header("Authorization", format!("Bearer {}", self.api_key))
+            .header("Content-Type", "application/json")
+            .json(&request)
+            .send()
+            .await
+            .map_err(|e| EmbeddingError::ApiError(format!("Voyage API request failed: {}", e)))?;
+
+        if !response.status().is_success() {
+            let status = response.status();
+            let error_text = response.text().await.unwrap_or_default();
+            return Err(EmbeddingError::ApiError(format!(
+                "Voyage API error {}: {}",
+                status, error_text
+            )));
+        }
+
+        let result: VoyageEmbedResponse = response.json().await.map_err(|e| {
+            EmbeddingError::ApiError(format!("Failed to parse Voyage response: {}", e))
+        })?;
+
+        let embeddings: Vec<Embedding> = result.data.into_iter().map(|d| d.embedding).collect();
+
+        Ok(EmbeddingResult {
+            embeddings,
+            model: self.model.model_name().to_string(),
+            dimensions: self.model.dimensions(),
+            total_tokens: Some(result.usage.total_tokens),
+            cached_count: 0,
+        })
+    }
+}
+
+#[async_trait]
+impl EmbeddingProvider for VoyageProvider {
+    fn name(&self) -> &str {
+        "voyage"
+    }
+
+    fn model(&self) -> &str {
+        self.model.model_name()
+    }
+
+    fn dimensions(&self) -> usize {
+        self.model.dimensions()
+    }
+
+    fn is_local(&self) -> bool {
+        false
+    }
+
+    fn max_tokens(&self) -> usize {
+        self.model.max_tokens()
+    }
+
+    fn max_batch_size(&self) -> usize {
+        128
+    }
+
+    fn cost_per_1m_tokens(&self) -> f64 {
+        match self.model {
+            VoyageModel::Voyage2 => 0.10,
+            VoyageModel::VoyageLarge2 => 0.12,
+            VoyageModel::VoyageCode2 => 0.12,
+            VoyageModel::VoyageLite02Instruct => 0.06,
+        }
+    }
+
+    async fn is_available(&self) -> bool {
+        !self.api_key.is_empty()
+    }
+
+    async fn embed(
+        &self,
+        text: &str,
+        options: &EmbeddingOptions,
+    ) -> Result<Embedding, EmbeddingError> {
+        let result = self.embed_batch_internal(&[text], options).await?;
+        result
+            .embeddings
+            .into_iter()
+            .next()
+            .ok_or_else(|| EmbeddingError::ApiError("No embedding returned".to_string()))
+    }
+
+    async fn embed_batch(
+        &self,
+        texts: &[&str],
+        options: &EmbeddingOptions,
+    ) -> Result<EmbeddingResult, EmbeddingError> {
+        if texts.is_empty() {
+            return Err(EmbeddingError::InvalidInput(
+                "Cannot embed empty text list".to_string(),
+            ));
+        }
+
+        if texts.len() > self.max_batch_size() {
+            return Err(EmbeddingError::InvalidInput(format!(
+                "Batch size {} exceeds maximum {}",
+                texts.len(),
+                self.max_batch_size()
+            )));
+        }
+
+        self.embed_batch_internal(texts, options).await
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[test]
+    fn test_voyage_model_names() {
+        assert_eq!(VoyageModel::Voyage2.model_name(), "voyage-2");
+        assert_eq!(VoyageModel::Voyage2.dimensions(), 1024);
+        assert_eq!(VoyageModel::VoyageLarge2.dimensions(), 1536);
+        assert_eq!(VoyageModel::VoyageCode2.dimensions(), 1536);
+        assert_eq!(VoyageModel::VoyageLite02Instruct.dimensions(), 1024);
+    }
+
+    #[test]
+    fn test_voyage_max_tokens() {
+        assert_eq!(VoyageModel::Voyage2.max_tokens(), 16000);
+        assert_eq!(VoyageModel::VoyageLite02Instruct.max_tokens(), 4000);
+    }
+
+    #[tokio::test]
+    async fn test_voyage_provider_creation() {
+        let provider = VoyageProvider::new("test-key".to_string(), VoyageModel::Voyage2);
+        assert_eq!(provider.name(), "voyage");
+        assert_eq!(provider.model(), "voyage-2");
+        assert_eq!(provider.dimensions(), 1024);
+        assert_eq!(provider.max_batch_size(), 128);
+    }
+}
--- a/crates/stratum-embeddings/src/service.rs
+++ b/crates/stratum-embeddings/src/service.rs
@ -0,0 +1,283 @@
+use std::sync::Arc;
+
+use tracing::{debug, info, warn};
+
+use crate::{
+    batch::BatchProcessor,
+    cache::EmbeddingCache,
+    error::EmbeddingError,
+    traits::{Embedding, EmbeddingOptions, EmbeddingProvider, EmbeddingResult, ProviderInfo},
+};
+
+pub struct EmbeddingService<P: EmbeddingProvider, C: EmbeddingCache> {
+    provider: Arc<P>,
+    cache: Option<Arc<C>>,
+    fallback_providers: Vec<Arc<dyn EmbeddingProvider>>,
+    batch_processor: BatchProcessor<P, C>,
+}
+
+impl<P: EmbeddingProvider + 'static, C: EmbeddingCache + 'static> EmbeddingService<P, C> {
+    pub fn new(provider: P) -> Self {
+        let provider = Arc::new(provider);
+        let batch_processor = BatchProcessor::new(Arc::clone(&provider), None);
+
+        Self {
+            provider,
+            cache: None,
+            fallback_providers: Vec::new(),
+            batch_processor,
+        }
+    }
+
+    pub fn with_cache(mut self, cache: C) -> Self {
+        let cache = Arc::new(cache);
+        self.cache = Some(Arc::clone(&cache));
+        self.batch_processor = BatchProcessor::new(Arc::clone(&self.provider), Some(cache));
+        self
+    }
+
+    pub fn with_fallback(mut self, fallback: Arc<dyn EmbeddingProvider>) -> Self {
+        self.fallback_providers.push(fallback);
+        self
+    }
+
+    pub fn with_batch_concurrency(mut self, max_concurrent: usize) -> Self {
+        self.batch_processor = self.batch_processor.with_concurrency(max_concurrent);
+        self
+    }
+
+    pub fn provider_info(&self) -> ProviderInfo {
+        ProviderInfo::from(self.provider.as_ref())
+    }
+
+    pub async fn is_ready(&self) -> bool {
+        self.provider.is_available().await
+    }
+
+    pub async fn embed(
+        &self,
+        text: &str,
+        options: &EmbeddingOptions,
+    ) -> Result<Embedding, EmbeddingError> {
+        if text.is_empty() {
+            return Err(EmbeddingError::InvalidInput(
+                "Text cannot be empty".to_string(),
+            ));
+        }
+
+        if options.use_cache {
+            if let Some(cache) = &self.cache {
+                let key =
+                    crate::cache::cache_key(self.provider.name(), self.provider.model(), text);
+                if let Some(cached) = cache.get(&key).await {
+                    debug!("Cache hit for text (len={})", text.len());
+                    return Ok(cached);
+                }
+            }
+        }
+
+        let result = self.embed_with_fallback(text, options).await?;
+
+        if options.use_cache {
+            if let Some(cache) = &self.cache {
+                let key =
+                    crate::cache::cache_key(self.provider.name(), self.provider.model(), text);
+                cache.insert(&key, result.clone()).await;
+            }
+        }
+
+        Ok(result)
+    }
+
+    pub async fn embed_batch(
+        &self,
+        texts: Vec<String>,
+        options: &EmbeddingOptions,
+    ) -> Result<EmbeddingResult, EmbeddingError> {
+        if texts.is_empty() {
+            return Err(EmbeddingError::InvalidInput(
+                "Texts cannot be empty".to_string(),
+            ));
+        }
+
+        info!("Processing batch of {} texts", texts.len());
+        self.batch_processor.process_batch(texts, options).await
+    }
+
+    pub async fn embed_stream(
+        &self,
+        texts: Vec<String>,
+        options: &EmbeddingOptions,
+    ) -> Result<Vec<Embedding>, EmbeddingError> {
+        self.batch_processor.process_stream(texts, options).await
+    }
+
+    async fn embed_with_fallback(
+        &self,
+        text: &str,
+        options: &EmbeddingOptions,
+    ) -> Result<Embedding, EmbeddingError> {
+        match self.provider.embed(text, options).await {
+            Ok(embedding) => Ok(embedding),
+            Err(e) => {
+                warn!("Primary provider failed: {}", e);
+
+                for (idx, fallback) in self.fallback_providers.iter().enumerate() {
+                    debug!("Trying fallback provider {}", idx);
+
+                    if !fallback.is_available().await {
+                        warn!("Fallback provider {} not available", idx);
+                        continue;
+                    }
+
+                    match fallback.embed(text, options).await {
+                        Ok(embedding) => {
+                            info!("Fallback provider {} succeeded", idx);
+                            return Ok(embedding);
+                        }
+                        Err(fallback_err) => {
+                            warn!("Fallback provider {} failed: {}", idx, fallback_err);
+                        }
+                    }
+                }
+
+                Err(e)
+            }
+        }
+    }
+
+    pub async fn invalidate_cache(&self, text: &str) -> Result<(), EmbeddingError> {
+        if let Some(cache) = &self.cache {
+            let key = crate::cache::cache_key(self.provider.name(), self.provider.model(), text);
+            cache.invalidate(&key).await;
+            Ok(())
+        } else {
+            Err(EmbeddingError::CacheError(
+                "No cache configured".to_string(),
+            ))
+        }
+    }
+
+    pub async fn clear_cache(&self) -> Result<(), EmbeddingError> {
+        if let Some(cache) = &self.cache {
+            cache.clear().await;
+            Ok(())
+        } else {
+            Err(EmbeddingError::CacheError(
+                "No cache configured".to_string(),
+            ))
+        }
+    }
+
+    pub fn cache_size(&self) -> usize {
+        self.cache.as_ref().map(|c| c.size()).unwrap_or(0)
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use std::time::Duration;
+
+    use super::*;
+    use crate::{cache::MemoryCache, providers::FastEmbedProvider};
+
+    #[tokio::test]
+    async fn test_service_basic() {
+        let provider = FastEmbedProvider::small().expect("Failed to init");
+        let service: EmbeddingService<_, MemoryCache> = EmbeddingService::new(provider);
+
+        let options = EmbeddingOptions::no_cache();
+        let embedding = service
+            .embed("Hello world", &options)
+            .await
+            .expect("Failed to embed");
+
+        assert_eq!(embedding.len(), 384);
+    }
+
+    #[tokio::test]
+    async fn test_service_with_cache() {
+        let provider = FastEmbedProvider::small().expect("Failed to init");
+        let cache = MemoryCache::new(100, Duration::from_secs(60));
+        let service = EmbeddingService::new(provider).with_cache(cache);
+
+        let options = EmbeddingOptions::default_with_cache();
+
+        let embedding1 = service
+            .embed("Hello world", &options)
+            .await
+            .expect("Failed first embed");
+        assert_eq!(service.cache_size(), 1);
+
+        let embedding2 = service
+            .embed("Hello world", &options)
+            .await
+            .expect("Failed second embed");
+        assert_eq!(embedding1, embedding2);
+    }
+
+    #[tokio::test]
+    async fn test_service_batch() {
+        let provider = FastEmbedProvider::small().expect("Failed to init");
+        let cache = MemoryCache::with_defaults();
+        let service = EmbeddingService::new(provider).with_cache(cache);
+
+        let texts = vec![
+            "Text 1".to_string(),
+            "Text 2".to_string(),
+            "Text 3".to_string(),
+        ];
+
+        let options = EmbeddingOptions::default_with_cache();
+        let result = service
+            .embed_batch(texts.clone(), &options)
+            .await
+            .expect("Failed batch");
+
+        assert_eq!(result.embeddings.len(), 3);
+        assert_eq!(result.cached_count, 0);
+
+        let result2 = service
+            .embed_batch(texts, &options)
+            .await
+            .expect("Failed second batch");
+        assert_eq!(result2.cached_count, 3);
+    }
+
+    #[tokio::test]
+    async fn test_service_cache_invalidation() {
+        let provider = FastEmbedProvider::small().expect("Failed to init");
+        let cache = MemoryCache::with_defaults();
+        let service = EmbeddingService::new(provider).with_cache(cache);
+
+        let options = EmbeddingOptions::default_with_cache();
+
+        service
+            .embed("Test text", &options)
+            .await
+            .expect("Failed embed");
+        assert_eq!(service.cache_size(), 1);
+
+        service
+            .invalidate_cache("Test text")
+            .await
+            .expect("Failed to invalidate");
+        assert_eq!(service.cache_size(), 0);
+    }
+
+    #[tokio::test]
+    async fn test_service_clear_cache() {
+        let provider = FastEmbedProvider::small().expect("Failed to init");
+        let cache = MemoryCache::with_defaults();
+        let service = EmbeddingService::new(provider).with_cache(cache);
+
+        let options = EmbeddingOptions::default_with_cache();
+
+        service.embed("Test 1", &options).await.expect("Failed");
+        service.embed("Test 2", &options).await.expect("Failed");
+        assert_eq!(service.cache_size(), 2);
+
+        service.clear_cache().await.expect("Failed to clear");
+        assert_eq!(service.cache_size(), 0);
+    }
+}
--- a/crates/stratum-embeddings/src/store/lancedb.rs
+++ b/crates/stratum-embeddings/src/store/lancedb.rs
@ -0,0 +1,430 @@
+#[cfg(feature = "lancedb-store")]
+use std::sync::Arc;
+
+#[cfg(feature = "lancedb-store")]
+use arrow::{
+    array::{
+        AsArray, FixedSizeListArray, Float32Array, RecordBatch, RecordBatchIterator, StringArray,
+    },
+    datatypes::{DataType, Field, Float32Type, Schema},
+};
+#[cfg(feature = "lancedb-store")]
+use async_trait::async_trait;
+#[cfg(feature = "lancedb-store")]
+use futures::TryStreamExt;
+#[cfg(feature = "lancedb-store")]
+use lancedb::{query::ExecutableQuery, query::QueryBase, Connection, DistanceType, Table};
+
+#[cfg(feature = "lancedb-store")]
+use crate::{
+    error::EmbeddingError,
+    store::{SearchFilter, SearchResult, VectorStore, VectorStoreConfig},
+    traits::Embedding,
+};
+
+pub struct LanceDbStore {
+    #[allow(dead_code)]
+    connection: Connection,
+    table: Table,
+    dimensions: usize,
+}
+
+impl LanceDbStore {
+    pub async fn new(
+        path: &str,
+        table_name: &str,
+        config: VectorStoreConfig,
+    ) -> Result<Self, EmbeddingError> {
+        let connection = lancedb::connect(path).execute().await.map_err(|e| {
+            EmbeddingError::Initialization(format!("LanceDB connection failed: {}", e))
+        })?;
+
+        let table = match connection.open_table(table_name).execute().await {
+            Ok(t) => t,
+            Err(_) => {
+                let schema = Self::create_schema(config.dimensions);
+                let empty_batch = RecordBatch::new_empty(Arc::clone(&schema));
+                let batches = RecordBatchIterator::new(
+                    vec![Ok(empty_batch)].into_iter(),
+                    Arc::clone(&schema),
+                );
+
+                connection
+                    .create_table(table_name, batches)
+                    .execute()
+                    .await
+                    .map_err(|e| {
+                        EmbeddingError::Initialization(format!("Failed to create table: {}", e))
+                    })?
+            }
+        };
+
+        Ok(Self {
+            connection,
+            table,
+            dimensions: config.dimensions,
+        })
+    }
+
+    fn create_schema(dimensions: usize) -> Arc<Schema> {
+        Arc::new(Schema::new(vec![
+            Field::new("id", DataType::Utf8, false),
+            Field::new(
+                "vector",
+                DataType::FixedSizeList(
+                    Arc::new(Field::new("item", DataType::Float32, true)),
+                    dimensions as i32,
+                ),
+                false,
+            ),
+            Field::new("metadata", DataType::Utf8, true),
+        ]))
+    }
+
+    fn create_record_batch(
+        ids: Vec<String>,
+        embeddings: Vec<Embedding>,
+        metadata: Vec<String>,
+        dimensions: usize,
+    ) -> Result<RecordBatch, EmbeddingError> {
+        let id_array = Arc::new(StringArray::from(ids));
+        let metadata_array = Arc::new(StringArray::from(metadata));
+
+        let flat_values: Vec<f32> = embeddings.into_iter().flatten().collect();
+        let values_array = Arc::new(Float32Array::from(flat_values));
+        let vector_array = Arc::new(FixedSizeListArray::new(
+            Arc::new(Field::new("item", DataType::Float32, true)),
+            dimensions as i32,
+            values_array,
+            None,
+        ));
+
+        let schema = Self::create_schema(dimensions);
+
+        RecordBatch::try_new(schema, vec![id_array, vector_array, metadata_array]).map_err(|e| {
+            EmbeddingError::StoreError(format!("Failed to create record batch: {}", e))
+        })
+    }
+}
+
+#[async_trait]
+impl VectorStore for LanceDbStore {
+    fn name(&self) -> &str {
+        "lancedb"
+    }
+
+    fn dimensions(&self) -> usize {
+        self.dimensions
+    }
+
+    async fn upsert(
+        &self,
+        id: &str,
+        embedding: &Embedding,
+        metadata: serde_json::Value,
+    ) -> Result<(), EmbeddingError> {
+        if embedding.len() != self.dimensions {
+            return Err(EmbeddingError::DimensionMismatch {
+                expected: self.dimensions,
+                actual: embedding.len(),
+            });
+        }
+
+        let metadata_str = serde_json::to_string(&metadata)
+            .map_err(|e| EmbeddingError::SerializationError(e.to_string()))?;
+
+        let batch = Self::create_record_batch(
+            vec![id.to_string()],
+            vec![embedding.clone()],
+            vec![metadata_str],
+            self.dimensions,
+        )?;
+
+        let schema = Self::create_schema(self.dimensions);
+        let batches = RecordBatchIterator::new(vec![Ok(batch)].into_iter(), schema);
+
+        self.table
+            .add(batches)
+            .execute()
+            .await
+            .map_err(|e| EmbeddingError::StoreError(format!("Failed to upsert: {}", e)))?;
+
+        Ok(())
+    }
+
+    async fn upsert_batch(
+        &self,
+        items: Vec<(String, Embedding, serde_json::Value)>,
+    ) -> Result<(), EmbeddingError> {
+        if items.is_empty() {
+            return Ok(());
+        }
+
+        for (_, embedding, _) in &items {
+            if embedding.len() != self.dimensions {
+                return Err(EmbeddingError::DimensionMismatch {
+                    expected: self.dimensions,
+                    actual: embedding.len(),
+                });
+            }
+        }
+
+        let (ids, embeddings, metadata): (Vec<_>, Vec<_>, Vec<_>) = items.into_iter().fold(
+            (Vec::new(), Vec::new(), Vec::new()),
+            |(mut ids, mut embeddings, mut metadata), (id, embedding, meta)| {
+                ids.push(id);
+                embeddings.push(embedding);
+                metadata.push(serde_json::to_string(&meta).unwrap_or_default());
+                (ids, embeddings, metadata)
+            },
+        );
+
+        let batch = Self::create_record_batch(ids, embeddings, metadata, self.dimensions)?;
+
+        let schema = Self::create_schema(self.dimensions);
+        let batches = RecordBatchIterator::new(vec![Ok(batch)].into_iter(), schema);
+
+        self.table
+            .add(batches)
+            .execute()
+            .await
+            .map_err(|e| EmbeddingError::StoreError(format!("Failed to batch upsert: {}", e)))?;
+
+        Ok(())
+    }
+
+    async fn search(
+        &self,
+        embedding: &Embedding,
+        limit: usize,
+        filter: Option<SearchFilter>,
+    ) -> Result<Vec<SearchResult>, EmbeddingError> {
+        if embedding.len() != self.dimensions {
+            return Err(EmbeddingError::DimensionMismatch {
+                expected: self.dimensions,
+                actual: embedding.len(),
+            });
+        }
+
+        let mut query = self
+            .table
+            .vector_search(embedding.as_slice())
+            .map_err(|e| EmbeddingError::StoreError(format!("Query setup failed: {}", e)))?
+            .distance_type(DistanceType::Cosine);
+
+        if let Some(ref f) = filter {
+            if f.min_score.is_some() {
+                query = query.postfilter().refine_factor(10);
+            }
+        }
+
+        let stream =
+            query.limit(limit).execute().await.map_err(|e| {
+                EmbeddingError::StoreError(format!("Query execution failed: {}", e))
+            })?;
+
+        let batches: Vec<RecordBatch> = stream
+            .try_collect()
+            .await
+            .map_err(|e| EmbeddingError::StoreError(format!("Failed to collect results: {}", e)))?;
+
+        let mut search_results = Vec::new();
+
+        for batch in batches.iter() {
+            let ids: &Arc<dyn arrow::array::Array> =
+                batch.column_by_name("id").ok_or_else(|| {
+                    EmbeddingError::StoreError("Missing 'id' column in results".to_string())
+                })?;
+
+            let metadata_col: &Arc<dyn arrow::array::Array> =
+                batch.column_by_name("metadata").ok_or_else(|| {
+                    EmbeddingError::StoreError("Missing 'metadata' column in results".to_string())
+                })?;
+
+            let distance_col: Option<&Arc<dyn arrow::array::Array>> =
+                batch.column_by_name("_distance");
+
+            let id_array = ids.as_string::<i32>();
+            let metadata_array = metadata_col.as_string::<i32>();
+
+            let num_rows: usize = batch.num_rows();
+            for i in 0..num_rows {
+                let id = id_array.value(i).to_string();
+
+                let metadata_str = metadata_array.value(i);
+                let metadata: serde_json::Value =
+                    serde_json::from_str(metadata_str).unwrap_or(serde_json::json!({}));
+
+                let score = if let Some(dist_col) = distance_col {
+                    let dist_array = dist_col.as_primitive::<Float32Type>();
+                    1.0 - dist_array.value(i)
+                } else {
+                    0.0
+                };
+
+                search_results.push(SearchResult {
+                    id,
+                    score,
+                    embedding: None,
+                    metadata,
+                });
+            }
+        }
+
+        if let Some(f) = filter {
+            if let Some(min_score) = f.min_score {
+                search_results.retain(|r| r.score >= min_score);
+            }
+        }
+
+        Ok(search_results)
+    }
+
+    async fn get(&self, id: &str) -> Result<Option<SearchResult>, EmbeddingError> {
+        let stream = self
+            .table
+            .query()
+            .only_if(format!("id = '{}'", id))
+            .limit(1)
+            .execute()
+            .await
+            .map_err(|e| EmbeddingError::StoreError(format!("Query execution failed: {}", e)))?;
+
+        let results: Vec<RecordBatch> = stream
+            .try_collect()
+            .await
+            .map_err(|e| EmbeddingError::StoreError(format!("Failed to collect results: {}", e)))?;
+
+        for batch in results.iter() {
+            let num_rows: usize = batch.num_rows();
+            if num_rows == 0 {
+                continue;
+            }
+
+            let ids: &Arc<dyn arrow::array::Array> =
+                batch.column_by_name("id").ok_or_else(|| {
+                    EmbeddingError::StoreError("Missing 'id' column in results".to_string())
+                })?;
+
+            let metadata_col: &Arc<dyn arrow::array::Array> =
+                batch.column_by_name("metadata").ok_or_else(|| {
+                    EmbeddingError::StoreError("Missing 'metadata' column in results".to_string())
+                })?;
+
+            let id_array = ids.as_string::<i32>();
+            let metadata_array = metadata_col.as_string::<i32>();
+
+            let result_id = id_array.value(0).to_string();
+            let metadata_str = metadata_array.value(0);
+            let metadata: serde_json::Value =
+                serde_json::from_str(metadata_str).unwrap_or(serde_json::json!({}));
+
+            return Ok(Some(SearchResult {
+                id: result_id,
+                score: 1.0,
+                embedding: None,
+                metadata,
+            }));
+        }
+
+        Ok(None)
+    }
+
+    async fn delete(&self, id: &str) -> Result<bool, EmbeddingError> {
+        self.table
+            .delete(&format!("id = '{}'", id))
+            .await
+            .map_err(|e| EmbeddingError::StoreError(format!("Delete failed: {}", e)))?;
+
+        Ok(true)
+    }
+
+    async fn flush(&self) -> Result<(), EmbeddingError> {
+        Ok(())
+    }
+
+    async fn count(&self) -> Result<usize, EmbeddingError> {
+        let count = self
+            .table
+            .count_rows(None)
+            .await
+            .map_err(|e| EmbeddingError::StoreError(format!("Count failed: {}", e)))?;
+
+        Ok(count)
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use tempfile::tempdir;
+
+    use super::*;
+
+    #[tokio::test]
+    async fn test_lancedb_store_basic() {
+        let dir = tempdir().expect("Failed to create temp dir");
+        let path = dir.path().to_str().unwrap();
+
+        let config = VectorStoreConfig::new(384);
+        let store = LanceDbStore::new(path, "test_embeddings", config)
+            .await
+            .expect("Failed to create store");
+
+        let embedding = vec![0.1; 384];
+        let metadata = serde_json::json!({"text": "hello world"});
+
+        store
+            .upsert("test_id", &embedding, metadata.clone())
+            .await
+            .expect("Failed to upsert");
+
+        let result = store.get("test_id").await.expect("Failed to get");
+        assert!(result.is_some());
+
+        let search_results = store
+            .search(&embedding, 5, None)
+            .await
+            .expect("Failed to search");
+        assert!(!search_results.is_empty());
+
+        let count = store.count().await.expect("Failed to count");
+        assert_eq!(count, 1);
+    }
+
+    #[tokio::test]
+    async fn test_lancedb_store_batch() {
+        let dir = tempdir().expect("Failed to create temp dir");
+        let path = dir.path().to_str().unwrap();
+
+        let config = VectorStoreConfig::new(384);
+        let store = LanceDbStore::new(path, "test_batch", config)
+            .await
+            .expect("Failed to create store");
+
+        let items = vec![
+            (
+                "id1".to_string(),
+                vec![0.1; 384],
+                serde_json::json!({"idx": 1}),
+            ),
+            (
+                "id2".to_string(),
+                vec![0.2; 384],
+                serde_json::json!({"idx": 2}),
+            ),
+            (
+                "id3".to_string(),
+                vec![0.3; 384],
+                serde_json::json!({"idx": 3}),
+            ),
+        ];
+
+        store
+            .upsert_batch(items)
+            .await
+            .expect("Failed to batch upsert");
+
+        let count = store.count().await.expect("Failed to count");
+        assert_eq!(count, 3);
+    }
+}
--- a/crates/stratum-embeddings/src/store/mod.rs
+++ b/crates/stratum-embeddings/src/store/mod.rs
@ -0,0 +1,14 @@
+pub mod traits;
+
+#[cfg(feature = "lancedb-store")]
+pub mod lancedb;
+
+#[cfg(feature = "surrealdb-store")]
+pub mod surrealdb;
+
+pub use traits::*;
+
+#[cfg(feature = "lancedb-store")]
+pub use self::lancedb::LanceDbStore;
+#[cfg(feature = "surrealdb-store")]
+pub use self::surrealdb::SurrealDbStore;
--- a/crates/stratum-embeddings/src/store/surrealdb.rs
+++ b/crates/stratum-embeddings/src/store/surrealdb.rs
@ -0,0 +1,298 @@
+#[cfg(feature = "surrealdb-store")]
+use async_trait::async_trait;
+#[cfg(feature = "surrealdb-store")]
+use serde::{Deserialize, Serialize};
+#[cfg(feature = "surrealdb-store")]
+use surrealdb::{
+    engine::local::{Db, Mem},
+    sql::Thing,
+    Surreal,
+};
+
+#[cfg(feature = "surrealdb-store")]
+use crate::{
+    error::EmbeddingError,
+    store::{SearchFilter, SearchResult, VectorStore, VectorStoreConfig},
+    traits::Embedding,
+};
+
+#[derive(Debug, Serialize, Deserialize)]
+struct EmbeddingRecord {
+    id: Option<Thing>,
+    vector: Vec<f32>,
+    metadata: serde_json::Value,
+}
+
+pub struct SurrealDbStore {
+    db: Surreal<Db>,
+    table: String,
+    dimensions: usize,
+}
+
+impl SurrealDbStore {
+    pub async fn new(
+        connection: Surreal<Db>,
+        table_name: &str,
+        config: VectorStoreConfig,
+    ) -> Result<Self, EmbeddingError> {
+        Ok(Self {
+            db: connection,
+            table: table_name.to_string(),
+            dimensions: config.dimensions,
+        })
+    }
+
+    pub async fn new_memory(
+        table_name: &str,
+        config: VectorStoreConfig,
+    ) -> Result<Self, EmbeddingError> {
+        let db = Surreal::new::<Mem>(()).await.map_err(|e| {
+            EmbeddingError::Initialization(format!("SurrealDB connection failed: {}", e))
+        })?;
+
+        db.use_ns("embeddings")
+            .use_db("embeddings")
+            .await
+            .map_err(|e| {
+                EmbeddingError::Initialization(format!("Failed to set namespace: {}", e))
+            })?;
+
+        Self::new(db, table_name, config).await
+    }
+
+    fn compute_cosine_similarity(&self, a: &[f32], b: &[f32]) -> f32 {
+        if a.len() != b.len() {
+            return 0.0;
+        }
+
+        let dot: f32 = a.iter().zip(b.iter()).map(|(x, y)| x * y).sum();
+        let mag_a: f32 = a.iter().map(|x| x * x).sum::<f32>().sqrt();
+        let mag_b: f32 = b.iter().map(|x| x * x).sum::<f32>().sqrt();
+
+        if mag_a > 0.0 && mag_b > 0.0 {
+            dot / (mag_a * mag_b)
+        } else {
+            0.0
+        }
+    }
+}
+
+#[async_trait]
+impl VectorStore for SurrealDbStore {
+    fn name(&self) -> &str {
+        "surrealdb"
+    }
+
+    fn dimensions(&self) -> usize {
+        self.dimensions
+    }
+
+    async fn upsert(
+        &self,
+        id: &str,
+        embedding: &Embedding,
+        metadata: serde_json::Value,
+    ) -> Result<(), EmbeddingError> {
+        if embedding.len() != self.dimensions {
+            return Err(EmbeddingError::DimensionMismatch {
+                expected: self.dimensions,
+                actual: embedding.len(),
+            });
+        }
+
+        let record = EmbeddingRecord {
+            id: None,
+            vector: embedding.clone(),
+            metadata,
+        };
+
+        let _: Option<EmbeddingRecord> = self
+            .db
+            .update((self.table.as_str(), id))
+            .content(record)
+            .await
+            .map_err(|e| EmbeddingError::StoreError(format!("Upsert failed: {}", e)))?;
+
+        Ok(())
+    }
+
+    async fn search(
+        &self,
+        embedding: &Embedding,
+        limit: usize,
+        filter: Option<SearchFilter>,
+    ) -> Result<Vec<SearchResult>, EmbeddingError> {
+        if embedding.len() != self.dimensions {
+            return Err(EmbeddingError::DimensionMismatch {
+                expected: self.dimensions,
+                actual: embedding.len(),
+            });
+        }
+
+        let all_records: Vec<EmbeddingRecord> = self
+            .db
+            .select(&self.table)
+            .await
+            .map_err(|e| EmbeddingError::StoreError(format!("Search failed: {}", e)))?;
+
+        let mut scored_results: Vec<(String, f32, serde_json::Value)> = all_records
+            .into_iter()
+            .filter_map(|record| {
+                let id = record.id?.id.to_string();
+                let score = self.compute_cosine_similarity(embedding, &record.vector);
+                Some((id, score, record.metadata))
+            })
+            .collect();
+
+        scored_results.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap_or(std::cmp::Ordering::Equal));
+
+        if let Some(f) = filter {
+            if let Some(min_score) = f.min_score {
+                scored_results.retain(|(_, score, _)| *score >= min_score);
+            }
+        }
+
+        let results = scored_results
+            .into_iter()
+            .take(limit)
+            .map(|(id, score, metadata)| SearchResult {
+                id,
+                score,
+                embedding: None,
+                metadata,
+            })
+            .collect();
+
+        Ok(results)
+    }
+
+    async fn get(&self, id: &str) -> Result<Option<SearchResult>, EmbeddingError> {
+        let record: Option<EmbeddingRecord> = self
+            .db
+            .select((self.table.as_str(), id))
+            .await
+            .map_err(|e| EmbeddingError::StoreError(format!("Get failed: {}", e)))?;
+
+        Ok(record.map(|r| SearchResult {
+            id: id.to_string(),
+            score: 1.0,
+            embedding: Some(r.vector),
+            metadata: r.metadata,
+        }))
+    }
+
+    async fn delete(&self, id: &str) -> Result<bool, EmbeddingError> {
+        let result: Option<EmbeddingRecord> = self
+            .db
+            .delete((self.table.as_str(), id))
+            .await
+            .map_err(|e| EmbeddingError::StoreError(format!("Delete failed: {}", e)))?;
+
+        Ok(result.is_some())
+    }
+
+    async fn flush(&self) -> Result<(), EmbeddingError> {
+        Ok(())
+    }
+
+    async fn count(&self) -> Result<usize, EmbeddingError> {
+        let records: Vec<EmbeddingRecord> = self
+            .db
+            .select(&self.table)
+            .await
+            .map_err(|e| EmbeddingError::StoreError(format!("Count failed: {}", e)))?;
+
+        Ok(records.len())
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+
+    #[tokio::test]
+    async fn test_surrealdb_store_basic() {
+        let config = VectorStoreConfig::new(384);
+        let store = SurrealDbStore::new_memory("test_embeddings", config)
+            .await
+            .expect("Failed to create store");
+
+        let embedding = vec![0.1; 384];
+        let metadata = serde_json::json!({"text": "hello world"});
+
+        store
+            .upsert("test_id", &embedding, metadata.clone())
+            .await
+            .expect("Failed to upsert");
+
+        let result = store.get("test_id").await.expect("Failed to get");
+        assert!(result.is_some());
+        assert_eq!(result.as_ref().unwrap().id, "test_id");
+        assert_eq!(result.as_ref().unwrap().metadata, metadata);
+
+        let search_results = store
+            .search(&embedding, 5, None)
+            .await
+            .expect("Failed to search");
+        assert_eq!(search_results.len(), 1);
+        assert!(search_results[0].score > 0.99);
+
+        let count = store.count().await.expect("Failed to count");
+        assert_eq!(count, 1);
+    }
+
+    #[tokio::test]
+    async fn test_surrealdb_store_search() {
+        let config = VectorStoreConfig::new(3);
+        let store = SurrealDbStore::new_memory("test_search", config)
+            .await
+            .expect("Failed to create store");
+
+        let embeddings = [
+            vec![1.0, 0.0, 0.0],
+            vec![0.0, 1.0, 0.0],
+            vec![0.5, 0.5, 0.0],
+        ];
+
+        for (i, embedding) in embeddings.iter().enumerate() {
+            store
+                .upsert(
+                    &format!("id_{}", i),
+                    embedding,
+                    serde_json::json!({"idx": i}),
+                )
+                .await
+                .expect("Failed to upsert");
+        }
+
+        let query = vec![1.0, 0.0, 0.0];
+        let results = store
+            .search(&query, 2, None)
+            .await
+            .expect("Failed to search");
+
+        assert_eq!(results.len(), 2);
+        assert_eq!(results[0].id, "id_0");
+        assert!(results[0].score > 0.99);
+    }
+
+    #[tokio::test]
+    async fn test_surrealdb_store_delete() {
+        let config = VectorStoreConfig::new(384);
+        let store = SurrealDbStore::new_memory("test_delete", config)
+            .await
+            .expect("Failed to create store");
+
+        let embedding = vec![0.1; 384];
+        store
+            .upsert("test_id", &embedding, serde_json::json!({}))
+            .await
+            .expect("Failed to upsert");
+
+        let deleted = store.delete("test_id").await.expect("Failed to delete");
+        assert!(deleted);
+
+        let result = store.get("test_id").await.expect("Failed to get");
+        assert!(result.is_none());
+    }
+}
--- a/crates/stratum-embeddings/src/store/traits.rs
+++ b/crates/stratum-embeddings/src/store/traits.rs
@ -0,0 +1,102 @@
+use async_trait::async_trait;
+use serde::{Deserialize, Serialize};
+
+use crate::{error::EmbeddingError, traits::Embedding};
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct SearchResult {
+    pub id: String,
+    pub score: f32,
+    #[serde(skip_serializing_if = "Option::is_none")]
+    pub embedding: Option<Embedding>,
+    pub metadata: serde_json::Value,
+}
+
+#[derive(Debug, Clone, Default)]
+pub struct SearchFilter {
+    pub metadata: Option<serde_json::Value>,
+    pub min_score: Option<f32>,
+}
+
+#[derive(Debug, Clone)]
+pub struct VectorStoreConfig {
+    pub dimensions: usize,
+    pub metric: DistanceMetric,
+    pub options: serde_json::Value,
+}
+
+impl VectorStoreConfig {
+    pub fn new(dimensions: usize) -> Self {
+        Self {
+            dimensions,
+            metric: DistanceMetric::default(),
+            options: serde_json::json!({}),
+        }
+    }
+
+    pub fn with_metric(mut self, metric: DistanceMetric) -> Self {
+        self.metric = metric;
+        self
+    }
+
+    pub fn with_options(mut self, options: serde_json::Value) -> Self {
+        self.options = options;
+        self
+    }
+}
+
+#[derive(Debug, Clone, Copy, Default)]
+pub enum DistanceMetric {
+    #[default]
+    Cosine,
+    Euclidean,
+    DotProduct,
+}
+
+#[async_trait]
+pub trait VectorStore: Send + Sync {
+    fn name(&self) -> &str;
+    fn dimensions(&self) -> usize;
+
+    async fn upsert(
+        &self,
+        id: &str,
+        embedding: &Embedding,
+        metadata: serde_json::Value,
+    ) -> Result<(), EmbeddingError>;
+
+    async fn upsert_batch(
+        &self,
+        items: Vec<(String, Embedding, serde_json::Value)>,
+    ) -> Result<(), EmbeddingError> {
+        for (id, embedding, metadata) in items {
+            self.upsert(&id, &embedding, metadata).await?;
+        }
+        Ok(())
+    }
+
+    async fn search(
+        &self,
+        embedding: &Embedding,
+        limit: usize,
+        filter: Option<SearchFilter>,
+    ) -> Result<Vec<SearchResult>, EmbeddingError>;
+
+    async fn get(&self, id: &str) -> Result<Option<SearchResult>, EmbeddingError>;
+
+    async fn delete(&self, id: &str) -> Result<bool, EmbeddingError>;
+
+    async fn delete_batch(&self, ids: &[&str]) -> Result<usize, EmbeddingError> {
+        let mut count = 0;
+        for id in ids {
+            if self.delete(id).await? {
+                count += 1;
+            }
+        }
+        Ok(count)
+    }
+
+    async fn flush(&self) -> Result<(), EmbeddingError>;
+
+    async fn count(&self) -> Result<usize, EmbeddingError>;
+}
--- a/crates/stratum-embeddings/src/traits.rs
+++ b/crates/stratum-embeddings/src/traits.rs
@ -0,0 +1,162 @@
+use async_trait::async_trait;
+
+pub type Embedding = Vec<f32>;
+
+#[derive(Debug, Clone)]
+pub struct EmbeddingResult {
+    pub embeddings: Vec<Embedding>,
+    pub model: String,
+    pub dimensions: usize,
+    pub total_tokens: Option<u32>,
+    pub cached_count: usize,
+}
+
+#[derive(Debug, Clone, Default)]
+pub struct EmbeddingOptions {
+    pub normalize: bool,
+    pub truncate: bool,
+    pub use_cache: bool,
+}
+
+impl EmbeddingOptions {
+    pub fn default_with_cache() -> Self {
+        Self {
+            normalize: true,
+            truncate: true,
+            use_cache: true,
+        }
+    }
+
+    pub fn no_cache() -> Self {
+        Self {
+            normalize: true,
+            truncate: true,
+            use_cache: false,
+        }
+    }
+}
+
+#[async_trait]
+pub trait EmbeddingProvider: Send + Sync {
+    fn name(&self) -> &str;
+    fn model(&self) -> &str;
+    fn dimensions(&self) -> usize;
+    fn is_local(&self) -> bool;
+    fn max_tokens(&self) -> usize;
+    fn max_batch_size(&self) -> usize;
+    fn cost_per_1m_tokens(&self) -> f64;
+
+    async fn is_available(&self) -> bool;
+
+    async fn embed(
+        &self,
+        text: &str,
+        options: &EmbeddingOptions,
+    ) -> Result<Embedding, crate::error::EmbeddingError>;
+
+    async fn embed_batch(
+        &self,
+        texts: &[&str],
+        options: &EmbeddingOptions,
+    ) -> Result<EmbeddingResult, crate::error::EmbeddingError>;
+}
+
+#[derive(Debug, Clone)]
+pub struct ProviderInfo {
+    pub name: String,
+    pub model: String,
+    pub dimensions: usize,
+    pub is_local: bool,
+    pub cost_per_1m: f64,
+    pub max_batch_size: usize,
+}
+
+impl<T: EmbeddingProvider> From<&T> for ProviderInfo {
+    fn from(provider: &T) -> Self {
+        Self {
+            name: provider.name().to_string(),
+            model: provider.model().to_string(),
+            dimensions: provider.dimensions(),
+            is_local: provider.is_local(),
+            cost_per_1m: provider.cost_per_1m_tokens(),
+            max_batch_size: provider.max_batch_size(),
+        }
+    }
+}
+
+pub fn normalize_embedding(embedding: &mut Embedding) {
+    let magnitude: f32 = embedding.iter().map(|x| x * x).sum::<f32>().sqrt();
+    if magnitude > 0.0 {
+        embedding.iter_mut().for_each(|x| *x /= magnitude);
+    }
+}
+
+pub fn cosine_similarity(a: &Embedding, b: &Embedding) -> f32 {
+    if a.len() != b.len() {
+        return 0.0;
+    }
+
+    let dot_product: f32 = a.iter().zip(b.iter()).map(|(x, y)| x * y).sum();
+    let magnitude_a: f32 = a.iter().map(|x| x * x).sum::<f32>().sqrt();
+    let magnitude_b: f32 = b.iter().map(|x| x * x).sum::<f32>().sqrt();
+
+    if magnitude_a > 0.0 && magnitude_b > 0.0 {
+        dot_product / (magnitude_a * magnitude_b)
+    } else {
+        0.0
+    }
+}
+
+pub fn euclidean_distance(a: &Embedding, b: &Embedding) -> f32 {
+    if a.len() != b.len() {
+        return f32::MAX;
+    }
+
+    a.iter()
+        .zip(b.iter())
+        .map(|(x, y)| (x - y).powi(2))
+        .sum::<f32>()
+        .sqrt()
+}
+
+#[cfg(test)]
+mod tests {
+    use approx::assert_relative_eq;
+
+    use super::*;
+
+    #[test]
+    fn test_normalize_embedding() {
+        let mut embedding = vec![3.0, 4.0];
+        normalize_embedding(&mut embedding);
+        assert_relative_eq!(embedding[0], 0.6, epsilon = 0.0001);
+        assert_relative_eq!(embedding[1], 0.8, epsilon = 0.0001);
+
+        let magnitude: f32 = embedding.iter().map(|x| x * x).sum::<f32>().sqrt();
+        assert_relative_eq!(magnitude, 1.0, epsilon = 0.0001);
+    }
+
+    #[test]
+    fn test_cosine_similarity() {
+        let a = vec![1.0, 0.0, 0.0];
+        let b = vec![1.0, 0.0, 0.0];
+        assert_relative_eq!(cosine_similarity(&a, &b), 1.0, epsilon = 0.0001);
+
+        let c = vec![0.0, 1.0, 0.0];
+        assert_relative_eq!(cosine_similarity(&a, &c), 0.0, epsilon = 0.0001);
+
+        let d = vec![-1.0, 0.0, 0.0];
+        assert_relative_eq!(cosine_similarity(&a, &d), -1.0, epsilon = 0.0001);
+    }
+
+    #[test]
+    fn test_euclidean_distance() {
+        let a = vec![0.0, 0.0];
+        let b = vec![3.0, 4.0];
+        assert_relative_eq!(euclidean_distance(&a, &b), 5.0, epsilon = 0.0001);
+
+        let c = vec![1.0, 1.0];
+        let d = vec![1.0, 1.0];
+        assert_relative_eq!(euclidean_distance(&c, &d), 0.0, epsilon = 0.0001);
+    }
+}
--- a/crates/stratum-llm/Cargo.toml
+++ b/crates/stratum-llm/Cargo.toml
@ -0,0 +1,65 @@
+[package]
+name = "stratum-llm"
+version = "0.1.0"
+edition.workspace = true
+description = "Unified LLM abstraction with CLI detection, fallback, and caching"
+license.workspace = true
+
+[dependencies]
+# Async runtime
+tokio = { workspace = true }
+async-trait = { workspace = true }
+futures = { workspace = true }
+
+# HTTP client
+reqwest = { workspace = true, features = ["stream"] }
+
+# Serialization
+serde = { workspace = true }
+serde_json = { workspace = true }
+serde_yaml = { workspace = true, optional = true }
+
+# Caching
+moka = { workspace = true }
+
+# Error handling
+thiserror = { workspace = true }
+
+# Logging
+tracing = { workspace = true }
+
+# Metrics
+prometheus = { workspace = true, optional = true }
+
+# Utilities
+dirs = { workspace = true }
+chrono = { workspace = true }
+uuid = { workspace = true, features = ["v4"] }
+which = { workspace = true, optional = true }
+
+# Hashing for cache keys
+xxhash-rust = { workspace = true }
+
+[features]
+default = ["anthropic", "openai", "ollama"]
+anthropic = []
+openai = []
+deepseek = []
+ollama = []
+claude-cli = []
+openai-cli = []
+kogral = ["serde_yaml", "which"]
+metrics = ["prometheus"]
+all = [
+    "anthropic",
+    "openai",
+    "deepseek",
+    "ollama",
+    "claude-cli",
+    "openai-cli",
+    "kogral",
+    "metrics",
+]
+
+[dev-dependencies]
+tokio-test = { workspace = true }
--- a/crates/stratum-llm/README.md
+++ b/crates/stratum-llm/README.md
@ -0,0 +1,131 @@
+# stratum-llm
+
+Unified LLM abstraction for the stratumiops ecosystem with automatic provider detection, fallback chains, and smart caching.
+
+## Features
+
+- **Credential Auto-detection**: Automatically finds CLI credentials (Claude, OpenAI) and API keys
+- **Provider Fallback**: Circuit breaker pattern with automatic failover across providers
+- **Smart Caching**: xxHash-based request deduplication reduces duplicate API calls
+- **Kogral Integration**: Inject project context from knowledge base (optional)
+- **Cost Tracking**: Transparent cost estimation across all providers
+- **Multiple Providers**: Anthropic Claude, OpenAI, DeepSeek, Ollama
+
+## Quick Start
+
+```rust
+use stratum_llm::{UnifiedClient, Message, Role};
+
+#[tokio::main]
+async fn main() -> Result<(), Box<dyn std::error::Error>> {
+    let client = UnifiedClient::auto()?;
+
+    let messages = vec![
+        Message {
+            role: Role::User,
+            content: "What is Rust?".to_string(),
+        }
+    ];
+
+    let response = client.generate(&messages, None).await?;
+    println!("{}", response.content);
+
+    Ok(())
+}
+```
+
+## Provider Priority
+
+1. **CLI credentials** (subscription-based, no per-token cost) - preferred
+2. **API keys** from environment variables
+3. **Local models** (Ollama)
+
+The client automatically detects available credentials and builds a fallback chain.
+
+## Features
+
+### Default Features
+
+```toml
+[dependencies]
+stratum-llm = "0.1"
+```
+
+Includes: Anthropic, OpenAI, Ollama
+
+### All Features
+
+```toml
+[dependencies]
+stratum-llm = { version = "0.1", features = ["all"] }
+```
+
+Includes: All providers, CLI detection, Kogral integration, Prometheus metrics
+
+### Custom Feature Set
+
+```toml
+[dependencies]
+stratum-llm = { version = "0.1", features = ["anthropic", "deepseek", "kogral"] }
+```
+
+Available features:
+
+- `anthropic` - Anthropic Claude API
+- `openai` - OpenAI API
+- `deepseek` - DeepSeek API
+- `ollama` - Ollama local models
+- `claude-cli` - Claude CLI credential detection
+- `kogral` - Kogral knowledge base integration
+- `metrics` - Prometheus metrics
+
+## Advanced Usage
+
+### With Kogral Context
+
+```rust
+let client = UnifiedClient::builder()
+    .auto_detect()?
+    .with_kogral()
+    .build()?;
+
+let response = client
+    .generate_with_kogral(&messages, None, Some("rust"), None)
+    .await?;
+```
+
+### Custom Fallback Strategy
+
+```rust
+use stratum_llm::{FallbackStrategy, ProviderChain};
+
+let chain = ProviderChain::from_detected()?
+    .with_strategy(FallbackStrategy::OnRateLimitOrUnavailable);
+
+let client = UnifiedClient::builder()
+    .with_chain(chain)
+    .build()?;
+```
+
+### Cost Budget
+
+```rust
+let chain = ProviderChain::from_detected()?
+    .with_strategy(FallbackStrategy::OnBudgetExceeded {
+        budget_cents: 10.0,
+    });
+```
+
+## Examples
+
+Run examples with:
+
+```bash
+cargo run --example basic_usage
+cargo run --example with_kogral --features kogral
+cargo run --example fallback_demo
+```
+
+## License
+
+MIT OR Apache-2.0
--- a/crates/stratum-llm/examples/basic_usage.rs
+++ b/crates/stratum-llm/examples/basic_usage.rs
@ -0,0 +1,43 @@
+use stratum_llm::{Message, Role, UnifiedClient};
+
+#[tokio::main]
+async fn main() -> Result<(), Box<dyn std::error::Error>> {
+    tracing_subscriber::fmt::init();
+
+    println!("Creating UnifiedClient with auto-detected providers...");
+    let client = UnifiedClient::auto()?;
+
+    println!("\nAvailable providers:");
+    for provider in client.providers() {
+        println!(
+            "  - {} ({}): circuit={:?}, subscription={}",
+            provider.name, provider.model, provider.circuit_state, provider.is_subscription
+        );
+    }
+
+    let messages = vec![Message {
+        role: Role::User,
+        content: "What is the capital of France? Answer in one word.".to_string(),
+    }];
+
+    println!("\nSending request...");
+    match client.generate(&messages, None).await {
+        Ok(response) => {
+            println!("\n✓ Success!");
+            println!("Provider: {}", response.provider);
+            println!("Model: {}", response.model);
+            println!("Response: {}", response.content);
+            println!(
+                "Tokens: {} in, {} out",
+                response.input_tokens, response.output_tokens
+            );
+            println!("Cost: ${:.4}", response.cost_cents / 100.0);
+            println!("Latency: {}ms", response.latency_ms);
+        }
+        Err(e) => {
+            eprintln!("\n✗ Error: {}", e);
+        }
+    }
+
+    Ok(())
+}
--- a/crates/stratum-llm/examples/fallback_demo.rs
+++ b/crates/stratum-llm/examples/fallback_demo.rs
@ -0,0 +1,54 @@
+use stratum_llm::{FallbackStrategy, Message, Role, UnifiedClient};
+
+#[tokio::main]
+async fn main() -> Result<(), Box<dyn std::error::Error>> {
+    tracing_subscriber::fmt::init();
+
+    println!("Creating UnifiedClient with budget-based fallback strategy...");
+
+    let client = UnifiedClient::builder().auto_detect()?.build()?;
+
+    let messages = vec![Message {
+        role: Role::User,
+        content: "Explain quantum computing in simple terms.".to_string(),
+    }];
+
+    println!("\nProvider chain:");
+    for (idx, provider) in client.providers().iter().enumerate() {
+        println!(
+            "  {}. {} ({}) - circuit: {:?}",
+            idx + 1,
+            provider.name,
+            provider.model,
+            provider.circuit_state
+        );
+    }
+
+    println!("\nSending multiple requests to test fallback...");
+
+    for i in 1..=3 {
+        println!("\n--- Request {} ---", i);
+        match client.generate(&messages, None).await {
+            Ok(response) => {
+                println!(
+                    "✓ Provider: {} | Model: {} | Cost: ${:.4} | Latency: {}ms",
+                    response.provider,
+                    response.model,
+                    response.cost_cents / 100.0,
+                    response.latency_ms
+                );
+                println!(
+                    "Response preview: {}...",
+                    &response.content[..100.min(response.content.len())]
+                );
+            }
+            Err(e) => {
+                eprintln!("✗ All providers failed: {}", e);
+            }
+        }
+
+        tokio::time::sleep(tokio::time::Duration::from_secs(1)).await;
+    }
+
+    Ok(())
+}
--- a/crates/stratum-llm/examples/with_kogral.rs
+++ b/crates/stratum-llm/examples/with_kogral.rs
@ -0,0 +1,45 @@
+#[cfg(feature = "kogral")]
+use stratum_llm::{Message, Role, UnifiedClient};
+
+#[cfg(feature = "kogral")]
+#[tokio::main]
+async fn main() -> Result<(), Box<dyn std::error::Error>> {
+    tracing_subscriber::fmt::init();
+
+    println!("Creating UnifiedClient with Kogral integration...");
+    let client = UnifiedClient::builder()
+        .auto_detect()?
+        .with_kogral()
+        .build()?;
+
+    let messages = vec![Message {
+        role: Role::User,
+        content: "Write a simple Rust function to add two numbers.".to_string(),
+    }];
+
+    println!("\nSending request with Rust guidelines from Kogral...");
+    match client
+        .generate_with_kogral(&messages, None, Some("rust"), None)
+        .await
+    {
+        Ok(response) => {
+            println!("\n✓ Success!");
+            println!("Provider: {}", response.provider);
+            println!("Model: {}", response.model);
+            println!("Response:\n{}", response.content);
+            println!("\nCost: ${:.4}", response.cost_cents / 100.0);
+            println!("Latency: {}ms", response.latency_ms);
+        }
+        Err(e) => {
+            eprintln!("\n✗ Error: {}", e);
+        }
+    }
+
+    Ok(())
+}
+
+#[cfg(not(feature = "kogral"))]
+fn main() {
+    eprintln!("This example requires the 'kogral' feature.");
+    eprintln!("Run with: cargo run --example with_kogral --features kogral");
+}
--- a/crates/stratum-llm/src/cache/mod.rs
+++ b/crates/stratum-llm/src/cache/mod.rs
@ -0,0 +1,3 @@
+pub mod request_cache;
+
+pub use request_cache::{CacheConfig, CacheStats, CachedResponse, RequestCache};
--- a/crates/stratum-llm/src/cache/request_cache.rs
+++ b/crates/stratum-llm/src/cache/request_cache.rs
@ -0,0 +1,151 @@
+use std::time::Duration;
+
+use moka::future::Cache;
+use xxhash_rust::xxh3::xxh3_64;
+
+#[derive(Clone)]
+pub struct CachedResponse {
+    pub content: String,
+    pub model: String,
+    pub provider: String,
+    pub cached_at: chrono::DateTime<chrono::Utc>,
+}
+
+pub struct RequestCache {
+    cache: Cache<u64, CachedResponse>,
+    enabled: bool,
+}
+
+#[derive(Debug, Clone)]
+pub struct CacheConfig {
+    pub enabled: bool,
+    pub max_entries: u64,
+    pub ttl: Duration,
+}
+
+impl Default for CacheConfig {
+    fn default() -> Self {
+        Self {
+            enabled: true,
+            max_entries: 1000,
+            ttl: Duration::from_secs(3600),
+        }
+    }
+}
+
+impl RequestCache {
+    pub fn new(config: CacheConfig) -> Self {
+        let cache = Cache::builder()
+            .max_capacity(config.max_entries)
+            .time_to_live(config.ttl)
+            .build();
+
+        Self {
+            cache,
+            enabled: config.enabled,
+        }
+    }
+
+    fn compute_key(
+        &self,
+        messages: &[crate::providers::Message],
+        options: &crate::providers::GenerationOptions,
+    ) -> u64 {
+        let mut hasher_input = String::new();
+
+        for msg in messages {
+            hasher_input.push_str(&format!("{:?}:{}\\n", msg.role, msg.content));
+        }
+
+        hasher_input.push_str(&format!(
+            "temp:{:?}|max:{:?}|top_p:{:?}",
+            options.temperature, options.max_tokens, options.top_p,
+        ));
+
+        xxh3_64(hasher_input.as_bytes())
+    }
+
+    pub async fn get(
+        &self,
+        messages: &[crate::providers::Message],
+        options: &crate::providers::GenerationOptions,
+    ) -> Option<CachedResponse> {
+        if !self.enabled {
+            return None;
+        }
+
+        let key = self.compute_key(messages, options);
+        self.cache.get(&key).await
+    }
+
+    pub async fn put(
+        &self,
+        messages: &[crate::providers::Message],
+        options: &crate::providers::GenerationOptions,
+        response: &crate::providers::GenerationResponse,
+    ) {
+        if !self.enabled {
+            return;
+        }
+
+        let key = self.compute_key(messages, options);
+        let cached = CachedResponse {
+            content: response.content.clone(),
+            model: response.model.clone(),
+            provider: response.provider.clone(),
+            cached_at: chrono::Utc::now(),
+        };
+
+        self.cache.insert(key, cached).await;
+    }
+
+    pub async fn get_or_generate<F, Fut>(
+        &self,
+        messages: &[crate::providers::Message],
+        options: &crate::providers::GenerationOptions,
+        generate: F,
+    ) -> Result<crate::providers::GenerationResponse, crate::error::LlmError>
+    where
+        F: FnOnce() -> Fut,
+        Fut: std::future::Future<
+            Output = Result<crate::providers::GenerationResponse, crate::error::LlmError>,
+        >,
+    {
+        if let Some(cached) = self.get(messages, options).await {
+            tracing::debug!("Cache hit");
+            return Ok(crate::providers::GenerationResponse {
+                content: cached.content,
+                model: cached.model,
+                provider: cached.provider,
+                input_tokens: 0,
+                output_tokens: 0,
+                cost_cents: 0.0,
+                latency_ms: 0,
+            });
+        }
+
+        let response = generate().await?;
+        self.put(messages, options, &response).await;
+
+        Ok(response)
+    }
+
+    pub fn stats(&self) -> CacheStats {
+        CacheStats {
+            entry_count: self.cache.entry_count(),
+            hit_count: 0,
+            miss_count: 0,
+        }
+    }
+
+    pub fn clear(&self) {
+        self.cache.invalidate_all();
+    }
+}
+
+#[derive(Debug)]
+pub struct CacheStats {
+    pub entry_count: u64,
+    pub hit_count: u64,
+    pub miss_count: u64,
+}
--- a/crates/stratum-llm/src/chain/circuit_breaker.rs
+++ b/crates/stratum-llm/src/chain/circuit_breaker.rs
@ -0,0 +1,146 @@
+use std::sync::atomic::{AtomicU32, AtomicU8, Ordering};
+use std::sync::RwLock;
+use std::time::{Duration, Instant};
+
+#[derive(Debug, Clone, Copy, PartialEq)]
+pub enum CircuitState {
+    Closed = 0,
+    Open = 1,
+    HalfOpen = 2,
+}
+
+pub struct CircuitBreaker {
+    state: AtomicU8,
+    failure_count: AtomicU32,
+    success_count: AtomicU32,
+    config: CircuitBreakerConfig,
+    last_failure_time: RwLock<Option<Instant>>,
+    last_success_time: RwLock<Option<Instant>>,
+}
+
+#[derive(Debug, Clone)]
+pub struct CircuitBreakerConfig {
+    pub failure_threshold: u32,
+    pub success_threshold: u32,
+    pub reset_timeout: Duration,
+    pub request_timeout: Duration,
+}
+
+impl Default for CircuitBreakerConfig {
+    fn default() -> Self {
+        Self {
+            failure_threshold: 5,
+            success_threshold: 3,
+            reset_timeout: Duration::from_secs(30),
+            request_timeout: Duration::from_secs(60),
+        }
+    }
+}
+
+impl CircuitBreaker {
+    pub fn new(config: CircuitBreakerConfig) -> Self {
+        Self {
+            state: AtomicU8::new(CircuitState::Closed as u8),
+            failure_count: AtomicU32::new(0),
+            success_count: AtomicU32::new(0),
+            config,
+            last_failure_time: RwLock::new(None),
+            last_success_time: RwLock::new(None),
+        }
+    }
+
+    pub fn state(&self) -> CircuitState {
+        match self.state.load(Ordering::SeqCst) {
+            0 => CircuitState::Closed,
+            1 => CircuitState::Open,
+            2 => CircuitState::HalfOpen,
+            _ => CircuitState::Closed,
+        }
+    }
+
+    pub fn should_allow(&self) -> bool {
+        match self.state() {
+            CircuitState::Closed => true,
+            CircuitState::Open => {
+                if let Some(last_failure) = *self.last_failure_time.read().unwrap() {
+                    if last_failure.elapsed() >= self.config.reset_timeout {
+                        self.state
+                            .store(CircuitState::HalfOpen as u8, Ordering::SeqCst);
+                        self.success_count.store(0, Ordering::SeqCst);
+                        return true;
+                    }
+                }
+                false
+            }
+            CircuitState::HalfOpen => true,
+        }
+    }
+
+    pub fn record_success(&self) {
+        *self.last_success_time.write().unwrap() = Some(Instant::now());
+        self.failure_count.store(0, Ordering::SeqCst);
+
+        if self.state() == CircuitState::HalfOpen {
+            let count = self.success_count.fetch_add(1, Ordering::SeqCst) + 1;
+            if count >= self.config.success_threshold {
+                self.state
+                    .store(CircuitState::Closed as u8, Ordering::SeqCst);
+                tracing::info!("Circuit breaker closed (recovered)");
+            }
+        }
+    }
+
+    pub fn record_failure(&self) {
+        *self.last_failure_time.write().unwrap() = Some(Instant::now());
+
+        match self.state() {
+            CircuitState::Closed => {
+                let count = self.failure_count.fetch_add(1, Ordering::SeqCst) + 1;
+                if count >= self.config.failure_threshold {
+                    self.state.store(CircuitState::Open as u8, Ordering::SeqCst);
+                    tracing::warn!(
+                        failures = count,
+                        "Circuit breaker opened (too many failures)"
+                    );
+                }
+            }
+            CircuitState::HalfOpen => {
+                self.state.store(CircuitState::Open as u8, Ordering::SeqCst);
+                self.success_count.store(0, Ordering::SeqCst);
+                tracing::warn!("Circuit breaker reopened (half-open test failed)");
+            }
+            CircuitState::Open => {}
+        }
+    }
+
+    pub async fn call<F, T, E>(&self, f: F) -> Result<T, CircuitError<E>>
+    where
+        F: std::future::Future<Output = Result<T, E>>,
+    {
+        if !self.should_allow() {
+            return Err(CircuitError::Open);
+        }
+
+        match tokio::time::timeout(self.config.request_timeout, f).await {
+            Ok(Ok(result)) => {
+                self.record_success();
+                Ok(result)
+            }
+            Ok(Err(e)) => {
+                self.record_failure();
+                Err(CircuitError::Inner(e))
+            }
+            Err(_) => {
+                self.record_failure();
+                Err(CircuitError::Timeout)
+            }
+        }
+    }
+}
+
+#[derive(Debug)]
+pub enum CircuitError<E> {
+    Open,
+    Timeout,
+    Inner(E),
+}
--- a/crates/stratum-llm/src/chain/mod.rs
+++ b/crates/stratum-llm/src/chain/mod.rs
@ -0,0 +1,5 @@
+pub mod circuit_breaker;
+pub mod provider_chain;
+
+pub use circuit_breaker::{CircuitBreaker, CircuitBreakerConfig, CircuitError, CircuitState};
+pub use provider_chain::{FallbackStrategy, ProviderChain, ProviderInfo};
--- a/crates/stratum-llm/src/chain/provider_chain.rs
+++ b/crates/stratum-llm/src/chain/provider_chain.rs
@ -0,0 +1,208 @@
+use crate::chain::circuit_breaker::{CircuitBreaker, CircuitBreakerConfig, CircuitError};
+use crate::credentials::CredentialDetector;
+use crate::error::LlmError;
+use crate::providers::{
+    ConfiguredProvider, GenerationOptions, GenerationResponse, LlmProvider, Message,
+};
+
+pub struct ProviderChain {
+    providers: Vec<ProviderWithCircuit>,
+    strategy: FallbackStrategy,
+}
+
+struct ProviderWithCircuit {
+    provider: Box<dyn LlmProvider>,
+    circuit: CircuitBreaker,
+    is_subscription: bool,
+    priority: u32,
+}
+
+#[derive(Clone)]
+pub enum FallbackStrategy {
+    Sequential,
+    OnRateLimitOrUnavailable,
+    OnBudgetExceeded { budget_cents: f64 },
+}
+
+impl ProviderChain {
+    pub fn from_detected() -> Result<Self, LlmError> {
+        let detector = CredentialDetector::new();
+        let credentials = detector.detect_all();
+
+        if credentials.is_empty() {
+            return Err(LlmError::NoProvidersAvailable);
+        }
+
+        let mut providers = Vec::new();
+
+        for cred in credentials {
+            let provider: Option<Box<dyn LlmProvider>> = match cred.provider.as_str() {
+                #[cfg(feature = "anthropic")]
+                "anthropic" => Some(Box::new(
+                    crate::providers::AnthropicProvider::sonnet()
+                        .map_err(|_| LlmError::NoProvidersAvailable)?,
+                )),
+                #[cfg(feature = "openai")]
+                "openai" => Some(Box::new(
+                    crate::providers::OpenAiProvider::gpt4o()
+                        .map_err(|_| LlmError::NoProvidersAvailable)?,
+                )),
+                #[cfg(feature = "deepseek")]
+                "deepseek" => Some(Box::new(
+                    crate::providers::DeepSeekProvider::coder()
+                        .map_err(|_| LlmError::NoProvidersAvailable)?,
+                )),
+                #[cfg(feature = "ollama")]
+                "ollama" => Some(Box::new(crate::providers::OllamaProvider::default())),
+                _ => None,
+            };
+
+            if let Some(provider) = provider {
+                providers.push(ProviderWithCircuit {
+                    provider,
+                    circuit: CircuitBreaker::new(CircuitBreakerConfig::default()),
+                    is_subscription: cred.is_subscription,
+                    priority: if cred.is_subscription { 0 } else { 10 },
+                });
+            }
+        }
+
+        if providers.is_empty() {
+            return Err(LlmError::NoProvidersAvailable);
+        }
+
+        providers.sort_by_key(|p| p.priority);
+
+        Ok(Self {
+            providers,
+            strategy: FallbackStrategy::Sequential,
+        })
+    }
+
+    pub fn with_providers(providers: Vec<ConfiguredProvider>) -> Self {
+        let providers = providers
+            .into_iter()
+            .map(|p| ProviderWithCircuit {
+                provider: p.provider,
+                circuit: CircuitBreaker::new(CircuitBreakerConfig::default()),
+                is_subscription: matches!(
+                    p.credential_source,
+                    crate::providers::CredentialSource::Cli { .. }
+                ),
+                priority: p.priority,
+            })
+            .collect();
+
+        Self {
+            providers,
+            strategy: FallbackStrategy::Sequential,
+        }
+    }
+
+    pub fn with_strategy(mut self, strategy: FallbackStrategy) -> Self {
+        self.strategy = strategy;
+        self
+    }
+
+    pub async fn generate(
+        &self,
+        messages: &[Message],
+        options: &GenerationOptions,
+    ) -> Result<GenerationResponse, LlmError> {
+        let mut last_error: Option<LlmError> = None;
+
+        for pwc in &self.providers {
+            if !pwc.circuit.should_allow() {
+                tracing::debug!(provider = pwc.provider.name(), "Circuit open, skipping");
+                continue;
+            }
+
+            if let FallbackStrategy::OnBudgetExceeded { budget_cents } = &self.strategy {
+                let estimated_cost = pwc.provider.estimate_cost(
+                    estimate_tokens(messages),
+                    options.max_tokens.unwrap_or(1000),
+                );
+                if estimated_cost > *budget_cents {
+                    tracing::debug!(
+                        provider = pwc.provider.name(),
+                        estimated_cost,
+                        budget = budget_cents,
+                        "Would exceed budget, trying next"
+                    );
+                    continue;
+                }
+            }
+
+            match pwc
+                .circuit
+                .call(pwc.provider.generate(messages, options))
+                .await
+            {
+                Ok(response) => {
+                    tracing::info!(
+                        provider = pwc.provider.name(),
+                        model = pwc.provider.model(),
+                        cost_cents = response.cost_cents,
+                        latency_ms = response.latency_ms,
+                        "Request successful"
+                    );
+                    return Ok(response);
+                }
+                Err(CircuitError::Open) => {
+                    tracing::debug!(provider = pwc.provider.name(), "Circuit open");
+                    continue;
+                }
+                Err(CircuitError::Timeout) => {
+                    tracing::warn!(provider = pwc.provider.name(), "Request timed out");
+                    last_error = Some(LlmError::Timeout);
+                    continue;
+                }
+                Err(CircuitError::Inner(e)) => {
+                    tracing::warn!(provider = pwc.provider.name(), error = %e, "Request failed");
+
+                    let should_fallback = match &self.strategy {
+                        FallbackStrategy::Sequential => true,
+                        FallbackStrategy::OnRateLimitOrUnavailable => {
+                            matches!(e, LlmError::RateLimit(_) | LlmError::Unavailable(_))
+                        }
+                        FallbackStrategy::OnBudgetExceeded { .. } => true,
+                    };
+
+                    if should_fallback {
+                        last_error = Some(e);
+                        continue;
+                    } else {
+                        return Err(e);
+                    }
+                }
+            }
+        }
+
+        Err(last_error.unwrap_or(LlmError::NoProvidersAvailable))
+    }
+
+    pub fn provider_info(&self) -> Vec<ProviderInfo> {
+        self.providers
+            .iter()
+            .map(|p| ProviderInfo {
+                name: p.provider.name().to_string(),
+                model: p.provider.model().to_string(),
+                is_subscription: p.is_subscription,
+                circuit_state: p.circuit.state(),
+            })
+            .collect()
+    }
+}
+
+#[derive(Debug)]
+pub struct ProviderInfo {
+    pub name: String,
+    pub model: String,
+    pub is_subscription: bool,
+    pub circuit_state: crate::chain::circuit_breaker::CircuitState,
+}
+
+fn estimate_tokens(messages: &[Message]) -> u32 {
+    let total_chars: usize = messages.iter().map(|m| m.content.len()).sum();
+    (total_chars / 4) as u32
+}
--- a/crates/stratum-llm/src/client.rs
+++ b/crates/stratum-llm/src/client.rs
@ -0,0 +1,167 @@
+use crate::cache::{CacheConfig, RequestCache};
+use crate::chain::ProviderChain;
+use crate::error::LlmError;
+#[cfg(feature = "kogral")]
+use crate::kogral::KogralIntegration;
+use crate::providers::{GenerationOptions, GenerationResponse, Message};
+
+pub struct UnifiedClient {
+    chain: ProviderChain,
+    cache: RequestCache,
+    #[cfg(feature = "kogral")]
+    kogral: Option<KogralIntegration>,
+    default_options: GenerationOptions,
+}
+
+pub struct UnifiedClientBuilder {
+    chain: Option<ProviderChain>,
+    cache_config: CacheConfig,
+    #[cfg(feature = "kogral")]
+    kogral: Option<KogralIntegration>,
+    default_options: GenerationOptions,
+}
+
+impl UnifiedClientBuilder {
+    pub fn new() -> Self {
+        Self {
+            chain: None,
+            cache_config: CacheConfig::default(),
+            #[cfg(feature = "kogral")]
+            kogral: None,
+            default_options: GenerationOptions::default(),
+        }
+    }
+
+    pub fn auto_detect(mut self) -> Result<Self, LlmError> {
+        self.chain = Some(ProviderChain::from_detected()?);
+        Ok(self)
+    }
+
+    pub fn with_chain(mut self, chain: ProviderChain) -> Self {
+        self.chain = Some(chain);
+        self
+    }
+
+    pub fn with_cache(mut self, config: CacheConfig) -> Self {
+        self.cache_config = config;
+        self
+    }
+
+    pub fn without_cache(mut self) -> Self {
+        self.cache_config.enabled = false;
+        self
+    }
+
+    #[cfg(feature = "kogral")]
+    pub fn with_kogral(mut self) -> Self {
+        self.kogral = KogralIntegration::new();
+        self
+    }
+
+    pub fn with_defaults(mut self, options: GenerationOptions) -> Self {
+        self.default_options = options;
+        self
+    }
+
+    pub fn build(self) -> Result<UnifiedClient, LlmError> {
+        let chain = self.chain.ok_or(LlmError::NoProvidersAvailable)?;
+
+        Ok(UnifiedClient {
+            chain,
+            cache: RequestCache::new(self.cache_config),
+            #[cfg(feature = "kogral")]
+            kogral: self.kogral,
+            default_options: self.default_options,
+        })
+    }
+}
+
+impl Default for UnifiedClientBuilder {
+    fn default() -> Self {
+        Self::new()
+    }
+}
+
+impl UnifiedClient {
+    pub fn auto() -> Result<Self, LlmError> {
+        UnifiedClientBuilder::new().auto_detect()?.build()
+    }
+
+    pub fn builder() -> UnifiedClientBuilder {
+        UnifiedClientBuilder::new()
+    }
+
+    pub async fn generate(
+        &self,
+        messages: &[Message],
+        options: Option<&GenerationOptions>,
+    ) -> Result<GenerationResponse, LlmError> {
+        let opts = options.unwrap_or(&self.default_options);
+
+        self.cache
+            .get_or_generate(messages, opts, || self.chain.generate(messages, opts))
+            .await
+    }
+
+    #[cfg(feature = "kogral")]
+    pub async fn generate_with_kogral(
+        &self,
+        messages: &[Message],
+        options: Option<&GenerationOptions>,
+        language: Option<&str>,
+        domain: Option<&str>,
+    ) -> Result<GenerationResponse, LlmError> {
+        let opts = options.unwrap_or(&self.default_options);
+
+        let enriched_messages = if let Some(kogral) = &self.kogral {
+            let mut ctx = serde_json::json!({});
+            kogral
+                .enrich_context(&mut ctx, language, domain)
+                .await
+                .map_err(|e| LlmError::Context(e.to_string()))?;
+
+            self.inject_kogral_context(messages, &ctx)
+        } else {
+            messages.to_vec()
+        };
+
+        self.generate(&enriched_messages, Some(opts)).await
+    }
+
+    #[cfg(feature = "kogral")]
+    fn inject_kogral_context(
+        &self,
+        messages: &[Message],
+        kogral_ctx: &serde_json::Value,
+    ) -> Vec<Message> {
+        let mut result = Vec::with_capacity(messages.len());
+        let mut system_found = false;
+
+        for msg in messages {
+            if matches!(msg.role, crate::providers::Role::System) && !system_found {
+                let enhanced_content = format!(
+                    "{}\n\n## Project Context (from Kogral)\n{}",
+                    msg.content,
+                    serde_json::to_string_pretty(kogral_ctx).unwrap_or_default()
+                );
+                result.push(Message {
+                    role: msg.role,
+                    content: enhanced_content,
+                });
+                system_found = true;
+            } else {
+                result.push(msg.clone());
+            }
+        }
+
+        result
+    }
+
+    pub fn providers(&self) -> Vec<crate::chain::ProviderInfo> {
+        self.chain.provider_info()
+    }
+
+    pub fn clear_cache(&self) {
+        self.cache.clear();
+    }
+}
--- a/crates/stratum-llm/src/credentials/claude_cli.rs
+++ b/crates/stratum-llm/src/credentials/claude_cli.rs
@ -0,0 +1,68 @@
+#[cfg(feature = "claude-cli")]
+use crate::credentials::detector::DetectedCredential;
+#[cfg(feature = "claude-cli")]
+use crate::providers::CredentialSource;
+
+#[cfg(feature = "claude-cli")]
+impl crate::credentials::CredentialDetector {
+    pub fn detect_claude_cli(&self) -> Option<DetectedCredential> {
+        let config_dir = dirs::config_dir()?;
+
+        let possible_paths = [
+            config_dir.join("claude").join("credentials.json"),
+            config_dir.join("claude-cli").join("auth.json"),
+            config_dir.join("anthropic").join("credentials.json"),
+        ];
+
+        for path in &possible_paths {
+            if let Some(cred) = Self::try_read_claude_credentials(path) {
+                return Some(cred);
+            }
+        }
+
+        None
+    }
+
+    fn try_read_claude_credentials(path: &std::path::Path) -> Option<DetectedCredential> {
+        if !path.exists() {
+            return None;
+        }
+
+        let content = std::fs::read_to_string(path).ok()?;
+        let json: serde_json::Value = serde_json::from_str(&content).ok()?;
+
+        let token = json.get("access_token")?;
+        if !token.is_string() {
+            return None;
+        }
+
+        if Self::is_token_expired(&json) {
+            tracing::debug!("Claude CLI token expired");
+            return None;
+        }
+
+        Some(DetectedCredential {
+            provider: "anthropic".to_string(),
+            source: CredentialSource::Cli {
+                path: path.to_path_buf(),
+            },
+            is_subscription: true,
+        })
+    }
+
+    fn is_token_expired(json: &serde_json::Value) -> bool {
+        let Some(expires) = json.get("expires_at") else {
+            return false;
+        };
+
+        let Some(exp_str) = expires.as_str() else {
+            return false;
+        };
+
+        let Ok(exp) = chrono::DateTime::parse_from_rfc3339(exp_str) else {
+            return false;
+        };
+
+        exp < chrono::Utc::now()
+    }
+}
--- a/crates/stratum-llm/src/credentials/detector.rs
+++ b/crates/stratum-llm/src/credentials/detector.rs
@ -0,0 +1,130 @@
+use crate::providers::CredentialSource;
+
+#[derive(Debug, Clone)]
+pub struct DetectedCredential {
+    pub provider: String,
+    pub source: CredentialSource,
+    pub is_subscription: bool,
+}
+
+pub struct CredentialDetector {
+    check_cli: bool,
+    check_env: bool,
+}
+
+impl CredentialDetector {
+    pub fn new() -> Self {
+        Self {
+            check_cli: true,
+            check_env: true,
+        }
+    }
+
+    pub fn without_cli(mut self) -> Self {
+        self.check_cli = false;
+        self
+    }
+
+    pub fn detect_all(&self) -> Vec<DetectedCredential> {
+        let mut credentials = Vec::new();
+
+        if self.check_cli {
+            #[cfg(feature = "claude-cli")]
+            if let Some(cred) = self.detect_claude_cli() {
+                credentials.push(cred);
+            }
+        }
+
+        if self.check_env {
+            if let Some(cred) = self.detect_anthropic_env() {
+                credentials.push(cred);
+            }
+            if let Some(cred) = self.detect_openai_env() {
+                credentials.push(cred);
+            }
+            if let Some(cred) = self.detect_deepseek_env() {
+                credentials.push(cred);
+            }
+        }
+
+        if let Some(cred) = self.detect_ollama() {
+            credentials.push(cred);
+        }
+
+        credentials
+    }
+
+    pub fn detect_for_provider(&self, provider: &str) -> Option<DetectedCredential> {
+        match provider {
+            "anthropic" | "claude" => {
+                #[cfg(feature = "claude-cli")]
+                if self.check_cli {
+                    if let Some(cred) = self.detect_claude_cli() {
+                        return Some(cred);
+                    }
+                }
+                self.detect_anthropic_env()
+            }
+            "openai" => self.detect_openai_env(),
+            "deepseek" => self.detect_deepseek_env(),
+            "ollama" => self.detect_ollama(),
+            _ => None,
+        }
+    }
+
+    pub fn detect_anthropic_env(&self) -> Option<DetectedCredential> {
+        if std::env::var("ANTHROPIC_API_KEY").is_ok() {
+            Some(DetectedCredential {
+                provider: "anthropic".to_string(),
+                source: CredentialSource::EnvVar {
+                    name: "ANTHROPIC_API_KEY".to_string(),
+                },
+                is_subscription: false,
+            })
+        } else {
+            None
+        }
+    }
+
+    pub fn detect_openai_env(&self) -> Option<DetectedCredential> {
+        if std::env::var("OPENAI_API_KEY").is_ok() {
+            Some(DetectedCredential {
+                provider: "openai".to_string(),
+                source: CredentialSource::EnvVar {
+                    name: "OPENAI_API_KEY".to_string(),
+                },
+                is_subscription: false,
+            })
+        } else {
+            None
+        }
+    }
+
+    pub fn detect_deepseek_env(&self) -> Option<DetectedCredential> {
+        if std::env::var("DEEPSEEK_API_KEY").is_ok() {
+            Some(DetectedCredential {
+                provider: "deepseek".to_string(),
+                source: CredentialSource::EnvVar {
+                    name: "DEEPSEEK_API_KEY".to_string(),
+                },
+                is_subscription: false,
+            })
+        } else {
+            None
+        }
+    }
+
+    pub fn detect_ollama(&self) -> Option<DetectedCredential> {
+        Some(DetectedCredential {
+            provider: "ollama".to_string(),
+            source: CredentialSource::None,
+            is_subscription: false,
+        })
+    }
+}
+
+impl Default for CredentialDetector {
+    fn default() -> Self {
+        Self::new()
+    }
+}
--- a/crates/stratum-llm/src/credentials/mod.rs
+++ b/crates/stratum-llm/src/credentials/mod.rs
@ -0,0 +1,6 @@
+pub mod detector;
+
+#[cfg(feature = "claude-cli")]
+pub mod claude_cli;
+
+pub use detector::{CredentialDetector, DetectedCredential};
--- a/crates/stratum-llm/src/error.rs
+++ b/crates/stratum-llm/src/error.rs
@ -0,0 +1,47 @@
+use thiserror::Error;
+
+#[derive(Debug, Error)]
+pub enum LlmError {
+    #[error("No providers available")]
+    NoProvidersAvailable,
+
+    #[error("Missing credential: {0}")]
+    MissingCredential(String),
+
+    #[error("Network error: {0}")]
+    Network(String),
+
+    #[error("API error: {0}")]
+    Api(String),
+
+    #[error("Rate limited: {0}")]
+    RateLimit(String),
+
+    #[error("Provider unavailable: {0}")]
+    Unavailable(String),
+
+    #[error("Request timeout")]
+    Timeout,
+
+    #[error("Parse error: {0}")]
+    Parse(String),
+
+    #[error("Context error: {0}")]
+    Context(String),
+
+    #[error("Circuit breaker open for provider")]
+    CircuitOpen,
+}
+
+impl LlmError {
+    pub fn is_rate_limit(&self) -> bool {
+        matches!(self, Self::RateLimit(_))
+    }
+
+    pub fn is_retriable(&self) -> bool {
+        matches!(
+            self,
+            Self::Network(_) | Self::RateLimit(_) | Self::Timeout | Self::Unavailable(_)
+        )
+    }
+}
--- a/crates/stratum-llm/src/kogral/integration.rs
+++ b/crates/stratum-llm/src/kogral/integration.rs
@ -0,0 +1,216 @@
+#[cfg(feature = "kogral")]
+use std::path::PathBuf;
+
+#[cfg(feature = "kogral")]
+pub struct KogralIntegration {
+    kogral_path: PathBuf,
+}
+
+#[cfg(feature = "kogral")]
+impl KogralIntegration {
+    pub fn new() -> Option<Self> {
+        let possible_paths = [
+            dirs::home_dir()?.join(".kogral"),
+            PathBuf::from("/Users/Akasha/Development/kogral/.kogral"),
+        ];
+
+        for path in &possible_paths {
+            if path.exists() {
+                return Some(Self {
+                    kogral_path: path.clone(),
+                });
+            }
+        }
+
+        None
+    }
+
+    pub fn with_path(path: impl Into<PathBuf>) -> Self {
+        Self {
+            kogral_path: path.into(),
+        }
+    }
+
+    pub async fn get_guidelines(&self, language: &str) -> Result<Vec<Guideline>, KogralError> {
+        let guidelines_dir = self.kogral_path.join("default");
+        let mut guidelines = Vec::new();
+
+        if let Ok(entries) = std::fs::read_dir(&guidelines_dir) {
+            for entry in entries.flatten() {
+                let path = entry.path();
+                if path.extension().is_none_or(|e| e != "md") {
+                    continue;
+                }
+
+                let Ok(content) = std::fs::read_to_string(&path) else {
+                    continue;
+                };
+
+                if let Some(guideline) = Self::parse_guideline(&content, language) {
+                    guidelines.push(guideline);
+                }
+            }
+        }
+
+        Ok(guidelines)
+    }
+
+    pub async fn get_patterns(&self, domain: &str) -> Result<Vec<Pattern>, KogralError> {
+        let patterns_dir = self.kogral_path.join("default");
+        let mut patterns = Vec::new();
+
+        if let Ok(entries) = std::fs::read_dir(&patterns_dir) {
+            for entry in entries.flatten() {
+                let path = entry.path();
+                if path.extension().is_none_or(|e| e != "md") {
+                    continue;
+                }
+
+                let Ok(content) = std::fs::read_to_string(&path) else {
+                    continue;
+                };
+
+                if let Some(pattern) = Self::parse_pattern(&content, domain) {
+                    patterns.push(pattern);
+                }
+            }
+        }
+
+        Ok(patterns)
+    }
+
+    pub async fn enrich_context(
+        &self,
+        context: &mut serde_json::Value,
+        language: Option<&str>,
+        domain: Option<&str>,
+    ) -> Result<(), KogralError> {
+        let mut kogral_context = serde_json::json!({});
+
+        if let Some(lang) = language {
+            let guidelines = self.get_guidelines(lang).await?;
+            if !guidelines.is_empty() {
+                kogral_context["guidelines"] = serde_json::to_value(&guidelines)?;
+            }
+        }
+
+        if let Some(dom) = domain {
+            let patterns = self.get_patterns(dom).await?;
+            if !patterns.is_empty() {
+                kogral_context["patterns"] = serde_json::to_value(&patterns)?;
+            }
+        }
+
+        if let Some(obj) = context.as_object_mut() {
+            obj.insert("kogral".to_string(), kogral_context);
+        }
+
+        Ok(())
+    }
+
+    fn parse_guideline(content: &str, language: &str) -> Option<Guideline> {
+        let (frontmatter, body) = Self::split_frontmatter(content)?;
+        let meta: serde_yaml::Value = serde_yaml::from_str(&frontmatter).ok()?;
+
+        let node_type = meta.get("node_type")?.as_str()?;
+        if node_type != "guideline" {
+            return None;
+        }
+
+        let tags: Vec<String> = meta
+            .get("tags")?
+            .as_sequence()?
+            .iter()
+            .filter_map(|v| v.as_str().map(String::from))
+            .collect();
+
+        if !tags.iter().any(|t| t.eq_ignore_ascii_case(language)) {
+            return None;
+        }
+
+        Some(Guideline {
+            title: meta.get("title")?.as_str()?.to_string(),
+            content: body.to_string(),
+            tags,
+        })
+    }
+
+    fn parse_pattern(content: &str, domain: &str) -> Option<Pattern> {
+        let (frontmatter, body) = Self::split_frontmatter(content)?;
+        let meta: serde_yaml::Value = serde_yaml::from_str(&frontmatter).ok()?;
+
+        let node_type = meta.get("node_type")?.as_str()?;
+        if node_type != "pattern" {
+            return None;
+        }
+
+        let tags: Vec<String> = meta
+            .get("tags")?
+            .as_sequence()?
+            .iter()
+            .filter_map(|v| v.as_str().map(String::from))
+            .collect();
+
+        if !tags.iter().any(|t| t.eq_ignore_ascii_case(domain)) {
+            return None;
+        }
+
+        Some(Pattern {
+            title: meta.get("title")?.as_str()?.to_string(),
+            content: body.to_string(),
+            tags,
+        })
+    }
+
+    fn split_frontmatter(content: &str) -> Option<(String, String)> {
+        let content = content.trim();
+        if !content.starts_with("---") {
+            return None;
+        }
+
+        let after_first = &content[3..];
+        let end = after_first.find("---")?;
+        let frontmatter = after_first[..end].trim().to_string();
+        let body = after_first[end + 3..].trim().to_string();
+
+        Some((frontmatter, body))
+    }
+}
+
+#[cfg(feature = "kogral")]
+impl Default for KogralIntegration {
+    fn default() -> Self {
+        Self::new().unwrap_or_else(|| Self {
+            kogral_path: PathBuf::from(".kogral"),
+        })
+    }
+}
+
+#[cfg(feature = "kogral")]
+#[derive(Debug, Clone, serde::Serialize)]
+pub struct Guideline {
+    pub title: String,
+    pub content: String,
+    pub tags: Vec<String>,
+}
+
+#[cfg(feature = "kogral")]
+#[derive(Debug, Clone, serde::Serialize)]
+pub struct Pattern {
+    pub title: String,
+    pub content: String,
+    pub tags: Vec<String>,
+}
+
+#[cfg(feature = "kogral")]
+#[derive(Debug, thiserror::Error)]
+pub enum KogralError {
+    #[error("Kogral not found")]
+    NotFound,
+    #[error("IO error: {0}")]
+    Io(#[from] std::io::Error),
+    #[error("Parse error: {0}")]
+    Parse(String),
+    #[error("Serialization error: {0}")]
+    Serialization(#[from] serde_json::Error),
+}
--- a/crates/stratum-llm/src/kogral/mod.rs
+++ b/crates/stratum-llm/src/kogral/mod.rs
@ -0,0 +1,5 @@
+#[cfg(feature = "kogral")]
+pub mod integration;
+
+#[cfg(feature = "kogral")]
+pub use integration::{Guideline, KogralError, KogralIntegration, Pattern};
--- a/crates/stratum-llm/src/lib.rs
+++ b/crates/stratum-llm/src/lib.rs
@ -0,0 +1,64 @@
+//! Unified LLM abstraction with CLI detection, fallback, and caching
+//!
+//! # Features
+//!
+//! - **Credential auto-detection**: Finds CLI credentials (Claude, OpenAI) and
+//!   API keys
+//! - **Provider fallback**: Automatic failover with circuit breaker pattern
+//! - **Smart caching**: xxHash-based deduplication reduces duplicate API calls
+//! - **Kogral integration**: Inject project context from knowledge base
+//! - **Cost tracking**: Transparent cost estimation across providers
+//!
+//! # Quick Start
+//!
+//! ```no_run
+//! use stratum_llm::{UnifiedClient, Message, Role};
+//!
+//! #[tokio::main]
+//! async fn main() -> Result<(), Box<dyn std::error::Error>> {
+//!     let client = UnifiedClient::auto()?;
+//!
+//!     let messages = vec![
+//!         Message {
+//!             role: Role::User,
+//!             content: "What is Rust?".to_string(),
+//!         }
+//!     ];
+//!
+//!     let response = client.generate(&messages, None).await?;
+//!     println!("{}", response.content);
+//!
+//!     Ok(())
+//! }
+//! ```
+
+pub mod cache;
+pub mod chain;
+pub mod client;
+pub mod credentials;
+pub mod error;
+pub mod kogral;
+pub mod metrics;
+pub mod providers;
+
+pub use cache::{CacheConfig, RequestCache};
+pub use chain::{CircuitBreakerConfig, FallbackStrategy, ProviderChain};
+pub use client::{UnifiedClient, UnifiedClientBuilder};
+pub use credentials::{CredentialDetector, DetectedCredential};
+pub use error::LlmError;
+#[cfg(feature = "kogral")]
+pub use kogral::{Guideline, KogralIntegration, Pattern};
+#[cfg(feature = "metrics")]
+pub use metrics::LlmMetrics;
+#[cfg(feature = "anthropic")]
+pub use providers::AnthropicProvider;
+#[cfg(feature = "deepseek")]
+pub use providers::DeepSeekProvider;
+#[cfg(feature = "ollama")]
+pub use providers::OllamaProvider;
+#[cfg(feature = "openai")]
+pub use providers::OpenAiProvider;
+pub use providers::{
+    ConfiguredProvider, CredentialSource, GenerationOptions, GenerationResponse, LlmProvider,
+    Message, Role,
+};
--- a/crates/stratum-llm/src/metrics.rs
+++ b/crates/stratum-llm/src/metrics.rs
@ -0,0 +1,74 @@
+#[cfg(feature = "metrics")]
+use prometheus::{Counter, Histogram, IntGauge, Registry};
+
+#[cfg(feature = "metrics")]
+pub struct LlmMetrics {
+    pub requests_total: Counter,
+    pub requests_success: Counter,
+    pub requests_failed: Counter,
+    pub cache_hits: Counter,
+    pub cache_misses: Counter,
+    pub circuit_opens: Counter,
+    pub fallbacks: Counter,
+    pub latency_seconds: Histogram,
+    pub cost_cents: Counter,
+    pub active_circuits_open: IntGauge,
+}
+
+#[cfg(feature = "metrics")]
+impl LlmMetrics {
+    pub fn new() -> Self {
+        Self {
+            requests_total: Counter::new("stratum_llm_requests_total", "Total LLM requests")
+                .unwrap(),
+            requests_success: Counter::new(
+                "stratum_llm_requests_success_total",
+                "Successful LLM requests",
+            )
+            .unwrap(),
+            requests_failed: Counter::new(
+                "stratum_llm_requests_failed_total",
+                "Failed LLM requests",
+            )
+            .unwrap(),
+            cache_hits: Counter::new("stratum_llm_cache_hits_total", "Cache hits").unwrap(),
+            cache_misses: Counter::new("stratum_llm_cache_misses_total", "Cache misses").unwrap(),
+            circuit_opens: Counter::new("stratum_llm_circuit_opens_total", "Circuit breaker opens")
+                .unwrap(),
+            fallbacks: Counter::new("stratum_llm_fallbacks_total", "Provider fallbacks").unwrap(),
+            latency_seconds: Histogram::with_opts(
+                prometheus::HistogramOpts::new("stratum_llm_latency_seconds", "Request latency")
+                    .buckets(vec![0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0]),
+            )
+            .unwrap(),
+            cost_cents: Counter::new("stratum_llm_cost_cents_total", "Total cost in cents")
+                .unwrap(),
+            active_circuits_open: IntGauge::new(
+                "stratum_llm_circuits_open",
+                "Currently open circuit breakers",
+            )
+            .unwrap(),
+        }
+    }
+
+    pub fn register(&self, registry: &Registry) -> Result<(), prometheus::Error> {
+        registry.register(Box::new(self.requests_total.clone()))?;
+        registry.register(Box::new(self.requests_success.clone()))?;
+        registry.register(Box::new(self.requests_failed.clone()))?;
+        registry.register(Box::new(self.cache_hits.clone()))?;
+        registry.register(Box::new(self.cache_misses.clone()))?;
+        registry.register(Box::new(self.circuit_opens.clone()))?;
+        registry.register(Box::new(self.fallbacks.clone()))?;
+        registry.register(Box::new(self.latency_seconds.clone()))?;
+        registry.register(Box::new(self.cost_cents.clone()))?;
+        registry.register(Box::new(self.active_circuits_open.clone()))?;
+        Ok(())
+    }
+}
+
+#[cfg(feature = "metrics")]
+impl Default for LlmMetrics {
+    fn default() -> Self {
+        Self::new()
+    }
+}
--- a/crates/stratum-llm/src/providers/anthropic.rs
+++ b/crates/stratum-llm/src/providers/anthropic.rs
@ -0,0 +1,181 @@
+use async_trait::async_trait;
+
+use crate::error::LlmError;
+use crate::providers::{
+    GenerationOptions, GenerationResponse, LlmProvider, Message, StreamResponse,
+};
+
+#[cfg(feature = "anthropic")]
+pub struct AnthropicProvider {
+    client: reqwest::Client,
+    api_key: String,
+    model: String,
+}
+
+#[cfg(feature = "anthropic")]
+impl AnthropicProvider {
+    const BASE_URL: &'static str = "https://api.anthropic.com/v1";
+    const API_VERSION: &'static str = "2023-06-01";
+
+    pub fn new(api_key: impl Into<String>, model: impl Into<String>) -> Self {
+        Self {
+            client: reqwest::Client::new(),
+            api_key: api_key.into(),
+            model: model.into(),
+        }
+    }
+
+    pub fn from_env(model: impl Into<String>) -> Result<Self, LlmError> {
+        let api_key = std::env::var("ANTHROPIC_API_KEY")
+            .map_err(|_| LlmError::MissingCredential("ANTHROPIC_API_KEY".to_string()))?;
+        Ok(Self::new(api_key, model))
+    }
+
+    pub fn sonnet() -> Result<Self, LlmError> {
+        Self::from_env("claude-sonnet-4-5-20250929")
+    }
+
+    pub fn opus() -> Result<Self, LlmError> {
+        Self::from_env("claude-opus-4-5-20251101")
+    }
+
+    pub fn haiku() -> Result<Self, LlmError> {
+        Self::from_env("claude-haiku-4-5-20251001")
+    }
+}
+
+#[cfg(feature = "anthropic")]
+#[async_trait]
+impl LlmProvider for AnthropicProvider {
+    fn name(&self) -> &str {
+        "anthropic"
+    }
+
+    fn model(&self) -> &str {
+        &self.model
+    }
+
+    async fn is_available(&self) -> bool {
+        true
+    }
+
+    async fn generate(
+        &self,
+        messages: &[Message],
+        options: &GenerationOptions,
+    ) -> Result<GenerationResponse, LlmError> {
+        let start = std::time::Instant::now();
+
+        let (system, user_messages): (Vec<_>, Vec<_>) = messages
+            .iter()
+            .partition(|m| matches!(m.role, crate::providers::Role::System));
+
+        let system_content = system.first().map(|m| m.content.as_str());
+
+        let mut body = serde_json::json!({
+            "model": self.model,
+            "messages": user_messages.iter().map(|m| {
+                serde_json::json!({
+                    "role": match m.role {
+                        crate::providers::Role::User => "user",
+                        crate::providers::Role::Assistant => "assistant",
+                        crate::providers::Role::System => "user",
+                    },
+                    "content": m.content,
+                })
+            }).collect::<Vec<_>>(),
+            "max_tokens": options.max_tokens.unwrap_or(4096),
+        });
+
+        if let Some(sys) = system_content {
+            body["system"] = serde_json::json!(sys);
+        }
+        if let Some(temp) = options.temperature {
+            body["temperature"] = serde_json::json!(temp);
+        }
+        if let Some(top_p) = options.top_p {
+            body["top_p"] = serde_json::json!(top_p);
+        }
+        if !options.stop_sequences.is_empty() {
+            body["stop_sequences"] = serde_json::json!(options.stop_sequences);
+        }
+
+        let response = self
+            .client
+            .post(format!("{}/messages", Self::BASE_URL))
+            .header("x-api-key", &self.api_key)
+            .header("anthropic-version", Self::API_VERSION)
+            .header("content-type", "application/json")
+            .json(&body)
+            .send()
+            .await
+            .map_err(|e| LlmError::Network(e.to_string()))?;
+
+        if !response.status().is_success() {
+            let status = response.status();
+            let text = response.text().await.unwrap_or_default();
+
+            if status.as_u16() == 429 {
+                return Err(LlmError::RateLimit(text));
+            }
+            return Err(LlmError::Api(format!("{}: {}", status, text)));
+        }
+
+        let json: serde_json::Value = response
+            .json()
+            .await
+            .map_err(|e| LlmError::Parse(e.to_string()))?;
+
+        let content = json["content"][0]["text"]
+            .as_str()
+            .unwrap_or("")
+            .to_string();
+
+        let input_tokens = json["usage"]["input_tokens"].as_u64().unwrap_or(0) as u32;
+        let output_tokens = json["usage"]["output_tokens"].as_u64().unwrap_or(0) as u32;
+
+        Ok(GenerationResponse {
+            content,
+            model: self.model.clone(),
+            provider: "anthropic".to_string(),
+            input_tokens,
+            output_tokens,
+            cost_cents: self.estimate_cost(input_tokens, output_tokens),
+            latency_ms: start.elapsed().as_millis() as u64,
+        })
+    }
+
+    async fn stream(
+        &self,
+        _messages: &[Message],
+        _options: &GenerationOptions,
+    ) -> Result<StreamResponse, LlmError> {
+        Err(LlmError::Unavailable(
+            "Streaming not yet implemented".to_string(),
+        ))
+    }
+
+    fn estimate_cost(&self, input_tokens: u32, output_tokens: u32) -> f64 {
+        let input_cost = (input_tokens as f64 / 1_000_000.0) * self.cost_per_1m_input();
+        let output_cost = (output_tokens as f64 / 1_000_000.0) * self.cost_per_1m_output();
+        (input_cost + output_cost) * 100.0
+    }
+
+    fn cost_per_1m_input(&self) -> f64 {
+        match self.model.as_str() {
+            m if m.contains("opus") => 15.0,
+            m if m.contains("sonnet") => 3.0,
+            m if m.contains("haiku") => 1.0,
+            _ => 3.0,
+        }
+    }
+
+    fn cost_per_1m_output(&self) -> f64 {
+        match self.model.as_str() {
+            m if m.contains("opus") => 75.0,
+            m if m.contains("sonnet") => 15.0,
+            m if m.contains("haiku") => 5.0,
+            _ => 15.0,
+        }
+    }
+}
--- a/crates/stratum-llm/src/providers/deepseek.rs
+++ b/crates/stratum-llm/src/providers/deepseek.rs
@ -0,0 +1,147 @@
+use async_trait::async_trait;
+
+use crate::error::LlmError;
+use crate::providers::{
+    GenerationOptions, GenerationResponse, LlmProvider, Message, StreamResponse,
+};
+
+#[cfg(feature = "deepseek")]
+pub struct DeepSeekProvider {
+    client: reqwest::Client,
+    api_key: String,
+    model: String,
+}
+
+#[cfg(feature = "deepseek")]
+impl DeepSeekProvider {
+    const BASE_URL: &'static str = "https://api.deepseek.com/v1";
+
+    pub fn new(api_key: impl Into<String>, model: impl Into<String>) -> Self {
+        Self {
+            client: reqwest::Client::new(),
+            api_key: api_key.into(),
+            model: model.into(),
+        }
+    }
+
+    pub fn from_env(model: impl Into<String>) -> Result<Self, LlmError> {
+        let api_key = std::env::var("DEEPSEEK_API_KEY")
+            .map_err(|_| LlmError::MissingCredential("DEEPSEEK_API_KEY".to_string()))?;
+        Ok(Self::new(api_key, model))
+    }
+
+    pub fn coder() -> Result<Self, LlmError> {
+        Self::from_env("deepseek-coder")
+    }
+
+    pub fn chat() -> Result<Self, LlmError> {
+        Self::from_env("deepseek-chat")
+    }
+}
+
+#[cfg(feature = "deepseek")]
+#[async_trait]
+impl LlmProvider for DeepSeekProvider {
+    fn name(&self) -> &str {
+        "deepseek"
+    }
+
+    fn model(&self) -> &str {
+        &self.model
+    }
+
+    async fn is_available(&self) -> bool {
+        true
+    }
+
+    async fn generate(
+        &self,
+        messages: &[Message],
+        options: &GenerationOptions,
+    ) -> Result<GenerationResponse, LlmError> {
+        let start = std::time::Instant::now();
+
+        let body = serde_json::json!({
+            "model": self.model,
+            "messages": messages.iter().map(|m| {
+                serde_json::json!({
+                    "role": match m.role {
+                        crate::providers::Role::System => "system",
+                        crate::providers::Role::User => "user",
+                        crate::providers::Role::Assistant => "assistant",
+                    },
+                    "content": m.content,
+                })
+            }).collect::<Vec<_>>(),
+            "max_tokens": options.max_tokens.unwrap_or(4096),
+            "temperature": options.temperature.unwrap_or(0.7),
+        });
+
+        let response = self
+            .client
+            .post(format!("{}/chat/completions", Self::BASE_URL))
+            .header("Authorization", format!("Bearer {}", self.api_key))
+            .header("Content-Type", "application/json")
+            .json(&body)
+            .send()
+            .await
+            .map_err(|e| LlmError::Network(e.to_string()))?;
+
+        if !response.status().is_success() {
+            let status = response.status();
+            let text = response.text().await.unwrap_or_default();
+
+            if status.as_u16() == 429 {
+                return Err(LlmError::RateLimit(text));
+            }
+            return Err(LlmError::Api(format!("{}: {}", status, text)));
+        }
+
+        let json: serde_json::Value = response
+            .json()
+            .await
+            .map_err(|e| LlmError::Parse(e.to_string()))?;
+
+        let content = json["choices"][0]["message"]["content"]
+            .as_str()
+            .unwrap_or("")
+            .to_string();
+
+        let input_tokens = json["usage"]["prompt_tokens"].as_u64().unwrap_or(0) as u32;
+        let output_tokens = json["usage"]["completion_tokens"].as_u64().unwrap_or(0) as u32;
+
+        Ok(GenerationResponse {
+            content,
+            model: self.model.clone(),
+            provider: "deepseek".to_string(),
+            input_tokens,
+            output_tokens,
+            cost_cents: self.estimate_cost(input_tokens, output_tokens),
+            latency_ms: start.elapsed().as_millis() as u64,
+        })
+    }
+
+    async fn stream(
+        &self,
+        _messages: &[Message],
+        _options: &GenerationOptions,
+    ) -> Result<StreamResponse, LlmError> {
+        Err(LlmError::Unavailable(
+            "Streaming not yet implemented".to_string(),
+        ))
+    }
+
+    fn estimate_cost(&self, input_tokens: u32, output_tokens: u32) -> f64 {
+        let input_cost = (input_tokens as f64 / 1_000_000.0) * self.cost_per_1m_input();
+        let output_cost = (output_tokens as f64 / 1_000_000.0) * self.cost_per_1m_output();
+        (input_cost + output_cost) * 100.0
+    }
+
+    fn cost_per_1m_input(&self) -> f64 {
+        0.14
+    }
+
+    fn cost_per_1m_output(&self) -> f64 {
+        0.28
+    }
+}
--- a/crates/stratum-llm/src/providers/mod.rs
+++ b/crates/stratum-llm/src/providers/mod.rs
@ -0,0 +1,23 @@
+pub mod traits;
+
+#[cfg(feature = "anthropic")]
+pub mod anthropic;
+#[cfg(feature = "deepseek")]
+pub mod deepseek;
+#[cfg(feature = "ollama")]
+pub mod ollama;
+#[cfg(feature = "openai")]
+pub mod openai;
+
+#[cfg(feature = "anthropic")]
+pub use anthropic::AnthropicProvider;
+#[cfg(feature = "deepseek")]
+pub use deepseek::DeepSeekProvider;
+#[cfg(feature = "ollama")]
+pub use ollama::OllamaProvider;
+#[cfg(feature = "openai")]
+pub use openai::OpenAiProvider;
+pub use traits::{
+    ConfiguredProvider, CredentialSource, GenerationOptions, GenerationResponse, LlmProvider,
+    Message, Role, StreamChunk, StreamResponse,
+};
--- a/crates/stratum-llm/src/providers/ollama.rs
+++ b/crates/stratum-llm/src/providers/ollama.rs
@ -0,0 +1,159 @@
+use async_trait::async_trait;
+
+use crate::error::LlmError;
+use crate::providers::{
+    GenerationOptions, GenerationResponse, LlmProvider, Message, StreamResponse,
+};
+
+#[cfg(feature = "ollama")]
+pub struct OllamaProvider {
+    client: reqwest::Client,
+    base_url: String,
+    model: String,
+}
+
+#[cfg(feature = "ollama")]
+impl OllamaProvider {
+    pub fn new(base_url: impl Into<String>, model: impl Into<String>) -> Self {
+        Self {
+            client: reqwest::Client::new(),
+            base_url: base_url.into(),
+            model: model.into(),
+        }
+    }
+
+    pub fn from_env(model: impl Into<String>) -> Self {
+        let base_url =
+            std::env::var("OLLAMA_HOST").unwrap_or_else(|_| "http://localhost:11434".to_string());
+        Self::new(base_url, model)
+    }
+
+    pub fn llama3() -> Self {
+        Self::from_env("llama3")
+    }
+
+    pub fn codellama() -> Self {
+        Self::from_env("codellama")
+    }
+
+    pub fn mistral() -> Self {
+        Self::from_env("mistral")
+    }
+}
+
+#[cfg(feature = "ollama")]
+impl Default for OllamaProvider {
+    fn default() -> Self {
+        Self::llama3()
+    }
+}
+
+#[cfg(feature = "ollama")]
+#[async_trait]
+impl LlmProvider for OllamaProvider {
+    fn name(&self) -> &str {
+        "ollama"
+    }
+
+    fn model(&self) -> &str {
+        &self.model
+    }
+
+    async fn is_available(&self) -> bool {
+        self.client
+            .get(format!("{}/api/tags", self.base_url))
+            .send()
+            .await
+            .is_ok()
+    }
+
+    async fn generate(
+        &self,
+        messages: &[Message],
+        options: &GenerationOptions,
+    ) -> Result<GenerationResponse, LlmError> {
+        let start = std::time::Instant::now();
+
+        let prompt = messages
+            .iter()
+            .map(|m| match m.role {
+                crate::providers::Role::System => format!("System: {}", m.content),
+                crate::providers::Role::User => format!("User: {}", m.content),
+                crate::providers::Role::Assistant => format!("Assistant: {}", m.content),
+            })
+            .collect::<Vec<_>>()
+            .join("\n\n");
+
+        let mut body = serde_json::json!({
+            "model": self.model,
+            "prompt": prompt,
+            "stream": false,
+        });
+
+        if let Some(temp) = options.temperature {
+            body["temperature"] = serde_json::json!(temp);
+        }
+        if let Some(top_p) = options.top_p {
+            body["top_p"] = serde_json::json!(top_p);
+        }
+        if !options.stop_sequences.is_empty() {
+            body["stop"] = serde_json::json!(options.stop_sequences);
+        }
+
+        let response = self
+            .client
+            .post(format!("{}/api/generate", self.base_url))
+            .json(&body)
+            .send()
+            .await
+            .map_err(|e| LlmError::Network(e.to_string()))?;
+
+        if !response.status().is_success() {
+            let status = response.status();
+            let text = response.text().await.unwrap_or_default();
+            return Err(LlmError::Api(format!("{}: {}", status, text)));
+        }
+
+        let json: serde_json::Value = response
+            .json()
+            .await
+            .map_err(|e| LlmError::Parse(e.to_string()))?;
+
+        let content = json["response"].as_str().unwrap_or("").to_string();
+
+        let input_tokens = prompt.len() as u32 / 4;
+        let output_tokens = content.len() as u32 / 4;
+
+        Ok(GenerationResponse {
+            content,
+            model: self.model.clone(),
+            provider: "ollama".to_string(),
+            input_tokens,
+            output_tokens,
+            cost_cents: 0.0,
+            latency_ms: start.elapsed().as_millis() as u64,
+        })
+    }
+
+    async fn stream(
+        &self,
+        _messages: &[Message],
+        _options: &GenerationOptions,
+    ) -> Result<StreamResponse, LlmError> {
+        Err(LlmError::Unavailable(
+            "Streaming not yet implemented".to_string(),
+        ))
+    }
+
+    fn estimate_cost(&self, _input_tokens: u32, _output_tokens: u32) -> f64 {
+        0.0
+    }
+
+    fn cost_per_1m_input(&self) -> f64 {
+        0.0
+    }
+
+    fn cost_per_1m_output(&self) -> f64 {
+        0.0
+    }
+}
--- a/crates/stratum-llm/src/providers/openai.rs
+++ b/crates/stratum-llm/src/providers/openai.rs
@ -0,0 +1,161 @@
+use async_trait::async_trait;
+
+use crate::error::LlmError;
+use crate::providers::{
+    GenerationOptions, GenerationResponse, LlmProvider, Message, StreamResponse,
+};
+
+#[cfg(feature = "openai")]
+pub struct OpenAiProvider {
+    client: reqwest::Client,
+    api_key: String,
+    model: String,
+}
+
+#[cfg(feature = "openai")]
+impl OpenAiProvider {
+    const BASE_URL: &'static str = "https://api.openai.com/v1";
+
+    pub fn new(api_key: impl Into<String>, model: impl Into<String>) -> Self {
+        Self {
+            client: reqwest::Client::new(),
+            api_key: api_key.into(),
+            model: model.into(),
+        }
+    }
+
+    pub fn from_env(model: impl Into<String>) -> Result<Self, LlmError> {
+        let api_key = std::env::var("OPENAI_API_KEY")
+            .map_err(|_| LlmError::MissingCredential("OPENAI_API_KEY".to_string()))?;
+        Ok(Self::new(api_key, model))
+    }
+
+    pub fn gpt4o() -> Result<Self, LlmError> {
+        Self::from_env("gpt-4o")
+    }
+
+    pub fn gpt4_turbo() -> Result<Self, LlmError> {
+        Self::from_env("gpt-4-turbo")
+    }
+
+    pub fn gpt35_turbo() -> Result<Self, LlmError> {
+        Self::from_env("gpt-3.5-turbo")
+    }
+}
+
+#[cfg(feature = "openai")]
+#[async_trait]
+impl LlmProvider for OpenAiProvider {
+    fn name(&self) -> &str {
+        "openai"
+    }
+
+    fn model(&self) -> &str {
+        &self.model
+    }
+
+    async fn is_available(&self) -> bool {
+        true
+    }
+
+    async fn generate(
+        &self,
+        messages: &[Message],
+        options: &GenerationOptions,
+    ) -> Result<GenerationResponse, LlmError> {
+        let start = std::time::Instant::now();
+
+        let body = serde_json::json!({
+            "model": self.model,
+            "messages": messages.iter().map(|m| {
+                serde_json::json!({
+                    "role": match m.role {
+                        crate::providers::Role::System => "system",
+                        crate::providers::Role::User => "user",
+                        crate::providers::Role::Assistant => "assistant",
+                    },
+                    "content": m.content,
+                })
+            }).collect::<Vec<_>>(),
+            "max_tokens": options.max_tokens.unwrap_or(4096),
+            "temperature": options.temperature.unwrap_or(0.7),
+        });
+
+        let response = self
+            .client
+            .post(format!("{}/chat/completions", Self::BASE_URL))
+            .header("Authorization", format!("Bearer {}", self.api_key))
+            .header("Content-Type", "application/json")
+            .json(&body)
+            .send()
+            .await
+            .map_err(|e| LlmError::Network(e.to_string()))?;
+
+        if !response.status().is_success() {
+            let status = response.status();
+            let text = response.text().await.unwrap_or_default();
+
+            if status.as_u16() == 429 {
+                return Err(LlmError::RateLimit(text));
+            }
+            return Err(LlmError::Api(format!("{}: {}", status, text)));
+        }
+
+        let json: serde_json::Value = response
+            .json()
+            .await
+            .map_err(|e| LlmError::Parse(e.to_string()))?;
+
+        let content = json["choices"][0]["message"]["content"]
+            .as_str()
+            .unwrap_or("")
+            .to_string();
+
+        let input_tokens = json["usage"]["prompt_tokens"].as_u64().unwrap_or(0) as u32;
+        let output_tokens = json["usage"]["completion_tokens"].as_u64().unwrap_or(0) as u32;
+
+        Ok(GenerationResponse {
+            content,
+            model: self.model.clone(),
+            provider: "openai".to_string(),
+            input_tokens,
+            output_tokens,
+            cost_cents: self.estimate_cost(input_tokens, output_tokens),
+            latency_ms: start.elapsed().as_millis() as u64,
+        })
+    }
+
+    async fn stream(
+        &self,
+        _messages: &[Message],
+        _options: &GenerationOptions,
+    ) -> Result<StreamResponse, LlmError> {
+        Err(LlmError::Unavailable(
+            "Streaming not yet implemented".to_string(),
+        ))
+    }
+
+    fn estimate_cost(&self, input_tokens: u32, output_tokens: u32) -> f64 {
+        let input_cost = (input_tokens as f64 / 1_000_000.0) * self.cost_per_1m_input();
+        let output_cost = (output_tokens as f64 / 1_000_000.0) * self.cost_per_1m_output();
+        (input_cost + output_cost) * 100.0
+    }
+
+    fn cost_per_1m_input(&self) -> f64 {
+        match self.model.as_str() {
+            "gpt-4o" => 5.0,
+            "gpt-4-turbo" => 10.0,
+            "gpt-3.5-turbo" => 0.5,
+            _ => 5.0,
+        }
+    }
+
+    fn cost_per_1m_output(&self) -> f64 {
+        match self.model.as_str() {
+            "gpt-4o" => 15.0,
+            "gpt-4-turbo" => 30.0,
+            "gpt-3.5-turbo" => 1.5,
+            _ => 15.0,
+        }
+    }
+}
--- a/crates/stratum-llm/src/providers/traits.rs
+++ b/crates/stratum-llm/src/providers/traits.rs
@ -0,0 +1,95 @@
+use async_trait::async_trait;
+use serde::{Deserialize, Serialize};
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct Message {
+    pub role: Role,
+    pub content: String,
+}
+
+#[derive(Debug, Clone, Copy, Serialize, Deserialize)]
+#[serde(rename_all = "lowercase")]
+pub enum Role {
+    System,
+    User,
+    Assistant,
+}
+
+#[derive(Debug, Clone, Default)]
+pub struct GenerationOptions {
+    pub temperature: Option<f32>,
+    pub max_tokens: Option<u32>,
+    pub top_p: Option<f32>,
+    pub stop_sequences: Vec<String>,
+}
+
+#[derive(Debug, Clone)]
+pub struct GenerationResponse {
+    pub content: String,
+    pub model: String,
+    pub provider: String,
+    pub input_tokens: u32,
+    pub output_tokens: u32,
+    pub cost_cents: f64,
+    pub latency_ms: u64,
+}
+
+pub type StreamChunk = String;
+pub type StreamResponse = std::pin::Pin<
+    Box<dyn futures::Stream<Item = Result<StreamChunk, crate::error::LlmError>> + Send>,
+>;
+
+#[async_trait]
+pub trait LlmProvider: Send + Sync {
+    /// Provider name (e.g., "anthropic", "openai", "ollama")
+    fn name(&self) -> &str;
+
+    /// Model identifier
+    fn model(&self) -> &str;
+
+    /// Check if provider is available and configured
+    async fn is_available(&self) -> bool;
+
+    /// Generate a completion
+    async fn generate(
+        &self,
+        messages: &[Message],
+        options: &GenerationOptions,
+    ) -> Result<GenerationResponse, crate::error::LlmError>;
+
+    /// Stream a completion (future work)
+    async fn stream(
+        &self,
+        messages: &[Message],
+        options: &GenerationOptions,
+    ) -> Result<StreamResponse, crate::error::LlmError>;
+
+    /// Estimate cost for a request (before sending)
+    fn estimate_cost(&self, input_tokens: u32, output_tokens: u32) -> f64;
+
+    /// Cost per 1M input tokens in cents
+    fn cost_per_1m_input(&self) -> f64;
+
+    /// Cost per 1M output tokens in cents
+    fn cost_per_1m_output(&self) -> f64;
+}
+
+/// Credential source for a provider
+#[derive(Debug, Clone)]
+pub enum CredentialSource {
+    /// CLI tool credentials (subscription-based, no per-token cost)
+    Cli { path: std::path::PathBuf },
+    /// API key from environment variable
+    EnvVar { name: String },
+    /// API key from config file
+    ConfigFile { path: std::path::PathBuf },
+    /// No credentials needed (local provider)
+    None,
+}
+
+/// Provider with credential metadata
+pub struct ConfiguredProvider {
+    pub provider: Box<dyn LlmProvider>,
+    pub credential_source: CredentialSource,
+    pub priority: u32,
+}
--- a/docs/README.md
+++ b/docs/README.md
@ -32,6 +32,15 @@ Infrastructure automation and deployment tools.

 See [Operations Portfolio Docs](en/ops/) for technical details.

+### Architecture
+
+Cross-cutting architectural decisions documented as ADRs.
+
+- [ADR-001: Stratum-Embeddings](en/architecture/adrs/001-stratum-embeddings.md) - Unified embedding library
+- [ADR-002: Stratum-LLM](en/architecture/adrs/002-stratum-llm.md) - Unified LLM provider library
+
+See [Architecture Docs](en/architecture/) for all ADRs.
+
 ## Quick Start

 1. Choose your language: [English](en/) | [Español](es/)
@ -47,3 +56,4 @@ Each language directory contains:
 - `stratiumiops-technical-specs.md` - Technical specifications
 - `ia/` - AI portfolio documentation
 - `ops/` - Operations portfolio documentation
+- `architecture/` - Architecture documentation and ADRs
--- a/docs/en/README.md
+++ b/docs/en/README.md
@ -34,6 +34,16 @@ Infrastructure automation and deployment tools.

 See [ops/](ops/) directory for full operations portfolio documentation.

+### Architecture
+
+Architectural decisions and ecosystem design.
+
+- [**ADRs**](architecture/adrs/) - Architecture Decision Records
+  - [ADR-001: Stratum-Embeddings](architecture/adrs/001-stratum-embeddings.md) - Unified embedding library
+  - [ADR-002: Stratum-LLM](architecture/adrs/002-stratum-llm.md) - Unified LLM provider library
+
+See [architecture/](architecture/) directory for full architecture documentation.
+
 ## Navigation

 - [Back to root documentation](../)
--- a/docs/en/architecture/README.md
+++ b/docs/en/architecture/README.md
@ -0,0 +1,30 @@
+# Architecture
+
+Architecture documentation for the STRATUMIOPS ecosystem.
+
+## Contents
+
+### ADRs (Architecture Decision Records)
+
+Documented architectural decisions following the ADR format:
+
+- [**ADR-001: Stratum-Embeddings**](adrs/001-stratum-embeddings.md) - Unified embedding library
+- [**ADR-002: Stratum-LLM**](adrs/002-stratum-llm.md) - Unified LLM provider library
+
+## ADR Format
+
+Each ADR follows this structure:
+
+| Section         | Description                                |
+| --------------- | ------------------------------------------ |
+| Status          | Proposed, Accepted, Deprecated, Superseded |
+| Context         | Problem and current state                  |
+| Decision        | Chosen solution                            |
+| Rationale       | Why this solution                          |
+| Consequences    | Positive, negative, mitigations            |
+| Success Metrics | How to measure the outcome                 |
+
+## Navigation
+
+- [Back to main documentation](../)
+- [Spanish version](../../es/architecture/)
--- a/docs/en/architecture/adrs/001-stratum-embeddings.md
+++ b/docs/en/architecture/adrs/001-stratum-embeddings.md
@ -0,0 +1,279 @@
+# ADR-001: Stratum-Embeddings - Unified Embedding Library
+
+## Status
+
+**Proposed**
+
+## Context
+
+### Current State: Fragmented Implementations
+
+The ecosystem has 3 independent embedding implementations:
+
+| Project      | Location                              | Providers                     | Caching |
+| ------------ | ------------------------------------- | ----------------------------- | ------- |
+| Kogral       | `kogral-core/src/embeddings/`         | fastembed, rig-core (partial) | No      |
+| Provisioning | `provisioning-rag/src/embeddings.rs`  | OpenAI direct                 | No      |
+| Vapora       | `vapora-llm-router/src/embeddings.rs` | OpenAI, HuggingFace, Ollama   | No      |
+
+### Identified Problems
+
+#### 1. Duplicated Code
+
+Each project reimplements:
+
+- HTTP client for OpenAI embeddings
+- JSON response parsing
+- Error handling
+- Token estimation
+
+**Impact**: ~400 duplicated lines, inconsistent error handling.
+
+#### 2. No Caching
+
+Embeddings regenerated every time:
+
+```text
+"What is Rust?" → OpenAI → 1536 dims → $0.00002
+"What is Rust?" → OpenAI → 1536 dims → $0.00002 (same result)
+"What is Rust?" → OpenAI → 1536 dims → $0.00002 (same result)
+```
+
+**Impact**: Unnecessary costs, additional latency, more frequent rate limits.
+
+#### 3. No Fallback
+
+If OpenAI fails, everything fails. No fallback to local alternatives (fastembed, Ollama).
+
+**Impact**: Reduced availability, total dependency on one provider.
+
+#### 4. Silent Dimension Mismatch
+
+Different providers produce different dimensions:
+
+| Provider  | Model                  | Dimensions |
+| --------- | ---------------------- | ---------- |
+| fastembed | bge-small-en           | 384        |
+| fastembed | bge-large-en           | 1024       |
+| OpenAI    | text-embedding-3-small | 1536       |
+| OpenAI    | text-embedding-3-large | 3072       |
+| Ollama    | nomic-embed-text       | 768        |
+
+**Impact**: Corrupt vector indices if provider changes.
+
+#### 5. No Metrics
+
+No visibility into usage, cache hit rate, latency per provider, or accumulated costs.
+
+## Decision
+
+Create `stratum-embeddings` as a unified crate that:
+
+1. **Unifies** implementations from Kogral, Provisioning, and Vapora
+2. **Adds caching** to avoid recomputing identical embeddings
+3. **Implements fallback** between providers (cloud → local)
+4. **Clearly documents** dimensions and limitations per provider
+5. **Exposes metrics** for observability
+6. **Provides VectorStore trait** with LanceDB and SurrealDB backends based on project needs
+
+### Storage Backend Decision
+
+Each project chooses its vector storage backend based on priority:
+
+| Project      | Backend   | Priority       | Justification                                      |
+| ------------ | --------- | -------------- | -------------------------------------------------- |
+| Kogral       | SurrealDB | Graph richness | Knowledge Graph needs unified graph+vector queries |
+| Provisioning | LanceDB   | Vector scale   | RAG with millions of document chunks               |
+| Vapora       | LanceDB   | Vector scale   | Execution traces, pattern matching at scale        |
+
+#### Why SurrealDB for Kogral
+
+Kogral is a Knowledge Graph where relationships are the primary value.
+With hybrid architecture (LanceDB vectors + SurrealDB graph), a typical query would require:
+
+1. LanceDB: vector search → candidate_ids
+2. SurrealDB: graph filter on candidates → results
+3. App layer: merge, re-rank, deduplication
+
+**Accepted trade-off**: SurrealDB has worse pure vector performance than LanceDB,
+but Kogral's scale is limited by human curation of knowledge (typically 10K-100K concepts).
+
+#### Why LanceDB for Provisioning and Vapora
+
+| Aspect          | SurrealDB  | LanceDB              |
+| --------------- | ---------- | -------------------- |
+| Storage format  | Row-based  | Columnar (Lance)     |
+| Vector index    | HNSW (RAM) | IVF-PQ (disk-native) |
+| Practical scale | Millions   | Billions             |
+| Compression     | ~1x        | ~32x (PQ)            |
+| Zero-copy read  | No         | Yes                  |
+
+### Architecture
+
+```text
+┌─────────────────────────────────────────────────────────────────┐
+│                      stratum-embeddings                          │
+├─────────────────────────────────────────────────────────────────┤
+│  EmbeddingProvider trait                                         │
+│  ├─ embed(text) → Vec<f32>                                      │
+│  ├─ embed_batch(texts) → Vec<Vec<f32>>                          │
+│  ├─ dimensions() → usize                                        │
+│  └─ is_local() → bool                                           │
+│                                                                  │
+│  ┌───────────┐ ┌───────────┐ ┌───────────┐                      │
+│  │ FastEmbed │ │  OpenAI   │ │  Ollama   │                      │
+│  │  (local)  │ │  (cloud)  │ │  (local)  │                      │
+│  └───────────┘ └───────────┘ └───────────┘                      │
+│         └────────────┬────────────┘                              │
+│                      ▼                                           │
+│              EmbeddingCache (memory/disk)                        │
+│                      │                                           │
+│                      ▼                                           │
+│             EmbeddingService                                     │
+│                      │                                           │
+│                      ▼                                           │
+│  ┌──────────────────────────────────────────────────────────┐   │
+│  │                   VectorStore trait                       │   │
+│  │  ├─ upsert(id, embedding, metadata)                      │   │
+│  │  ├─ search(embedding, limit, filter) → Vec<Match>        │   │
+│  │  └─ delete(id)                                           │   │
+│  └──────────────────────────────────────────────────────────┘   │
+│         │                                    │                   │
+│         ▼                                    ▼                   │
+│  ┌─────────────────┐              ┌─────────────────┐           │
+│  │  SurrealDbStore │              │   LanceDbStore  │           │
+│  │  (Kogral)       │              │  (Prov/Vapora)  │           │
+│  └─────────────────┘              └─────────────────┘           │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+## Rationale
+
+### Why Caching is Critical
+
+For a typical RAG system (10,000 chunks):
+
+- **Without cache**: Re-indexing and repeated queries multiply costs
+- **With cache**: First indexing pays, rest are cache hits
+
+**Estimated savings**: 60-80% in embedding costs.
+
+### Why Fallback is Important
+
+| Scenario          | Without Fallback | With Fallback        |
+| ----------------- | ---------------- | -------------------- |
+| OpenAI rate limit | ERROR            | → fastembed (local)  |
+| OpenAI downtime   | ERROR            | → Ollama (local)     |
+| No internet       | ERROR            | → fastembed (local)  |
+
+### Why Local Providers First
+
+For development: fastembed loads local model (~100MB), no API keys required, no costs, works offline.
+
+For production: OpenAI for quality, fastembed as fallback.
+
+## Consequences
+
+### Positive
+
+1. Single source of truth for the entire ecosystem
+2. 60-80% fewer embedding API calls (caching)
+3. High availability with local providers (fallback)
+4. Usage and cost metrics
+5. Feature-gated: only compile what you need
+6. Storage flexibility: VectorStore trait allows choosing backend per project
+
+### Negative
+
+1. **Dimension lock-in**: Changing provider requires re-indexing
+2. **Cache invalidation**: Updated content may serve stale embeddings
+3. **Model download**: fastembed downloads ~100MB on first use
+4. **Storage lock-in per project**: Kogral tied to SurrealDB, others to LanceDB
+
+### Mitigations
+
+| Negative          | Mitigation                                     |
+| ----------------- | ---------------------------------------------- |
+| Dimension lock-in | Document clearly, warn on provider change      |
+| Stale cache       | Configurable TTL, bypass option                |
+| Model download    | Show progress, cache in ~/.cache/fastembed     |
+| Storage lock-in   | Conscious decision based on project priorities |
+
+## Success Metrics
+
+| Metric                    | Current | Target |
+| ------------------------- | ------- | ------ |
+| Duplicate implementations | 3       | 1      |
+| Cache hit rate            | 0%      | >60%   |
+| Fallback availability     | 0%      | 100%   |
+| Cost per 10K embeddings   | ~$0.20  | ~$0.05 |
+
+## Provider Selection Guide
+
+### Development
+
+```rust
+// Local, free, offline
+let service = EmbeddingService::builder()
+    .with_provider(FastEmbedProvider::small()?)  // 384 dims
+    .with_memory_cache()
+    .build()?;
+```
+
+### Production (Quality)
+
+```rust
+// OpenAI with local fallback
+let service = EmbeddingService::builder()
+    .with_provider(OpenAiEmbeddingProvider::large()?)  // 3072 dims
+    .with_provider(FastEmbedProvider::large()?)        // Fallback
+    .with_memory_cache()
+    .build()?;
+```
+
+### Production (Cost-Optimized)
+
+```rust
+// OpenAI small with fallback
+let service = EmbeddingService::builder()
+    .with_provider(OpenAiEmbeddingProvider::small()?)  // 1536 dims
+    .with_provider(OllamaEmbeddingProvider::nomic())   // Fallback
+    .with_memory_cache()
+    .build()?;
+```
+
+## Dimension Compatibility Matrix
+
+| If using...            | Can switch to...            | CANNOT switch to... |
+| ---------------------- | --------------------------- | ------------------- |
+| fastembed small (384)  | fastembed small, all-minilm | Any other           |
+| fastembed large (1024) | fastembed large             | Any other           |
+| OpenAI small (1536)    | OpenAI small, ada-002       | Any other           |
+| OpenAI large (3072)    | OpenAI large                | Any other           |
+
+**Rule**: Only switch between models with the SAME dimensions.
+
+## Implementation Priority
+
+| Order | Feature                 | Reason                     |
+| ----- | ----------------------- | -------------------------- |
+| 1     | EmbeddingProvider trait | Foundation for everything  |
+| 2     | FastEmbed provider      | Works without API keys     |
+| 3     | Memory cache            | Biggest cost impact        |
+| 4     | VectorStore trait       | Storage abstraction        |
+| 5     | SurrealDbStore          | Kogral needs graph+vector  |
+| 6     | LanceDbStore            | Provisioning/Vapora scale  |
+| 7     | OpenAI provider         | Production                 |
+| 8     | Ollama provider         | Local fallback             |
+| 9     | Batch processing        | Efficiency                 |
+| 10    | Metrics                 | Observability              |
+
+## References
+
+**Existing Implementations**:
+
+- Kogral: `kogral-core/src/embeddings/`
+- Vapora: `vapora-llm-router/src/embeddings.rs`
+- Provisioning: `provisioning/platform/crates/rag/src/embeddings.rs`
+
+**Target Location**: `stratumiops/crates/stratum-embeddings/`
--- a/docs/en/architecture/adrs/002-stratum-llm.md
+++ b/docs/en/architecture/adrs/002-stratum-llm.md
@ -0,0 +1,279 @@
+# ADR-002: Stratum-LLM - Unified LLM Provider Library
+
+## Status
+
+**Proposed**
+
+## Context
+
+### Current State: Fragmented LLM Connections
+
+The stratumiops ecosystem has 4 projects with AI functionality, each with its own implementation:
+
+| Project      | Implementation             | Providers              | Duplication         |
+| ------------ | -------------------------- | ---------------------- | ------------------- |
+| Vapora       | `typedialog-ai` (path dep) | Claude, OpenAI, Ollama | Shared base         |
+| TypeDialog   | `typedialog-ai` (local)    | Claude, OpenAI, Ollama | Defines abstraction |
+| Provisioning | Custom `LlmClient`         | Claude, OpenAI         | 100% duplicated     |
+| Kogral       | `rig-core`                 | Embeddings only        | Different stack     |
+
+### Identified Problems
+
+#### 1. Code Duplication
+
+Provisioning reimplements what TypeDialog already has:
+
+- reqwest HTTP client
+- Headers: x-api-key, anthropic-version
+- JSON body formatting
+- Response parsing
+- Error handling
+
+**Impact**: ~500 duplicated lines, bugs fixed in one place don't propagate.
+
+#### 2. API Keys Only, No CLI Detection
+
+No project detects credentials from official CLIs:
+
+```text
+Claude CLI:  ~/.config/claude/credentials.json
+OpenAI CLI:  ~/.config/openai/credentials.json
+```
+
+**Impact**: Users with Claude Pro/Max ($20-100/month) pay for API tokens when they could use their subscription.
+
+#### 3. No Automatic Fallback
+
+When a provider fails (rate limit, timeout), the request fails completely:
+
+```text
+Actual:   Request → Claude API → Rate Limit → ERROR
+Desired:  Request → Claude API → Rate Limit → OpenAI → Success
+```
+
+#### 4. No Circuit Breaker
+
+If Claude API is down, each request attempts to connect, fails, and propagates the error:
+
+```text
+Request 1 → Claude → Timeout (30s) → Error
+Request 2 → Claude → Timeout (30s) → Error
+Request 3 → Claude → Timeout (30s) → Error
+```
+
+**Impact**: Accumulated latency, degraded UX.
+
+#### 5. No Caching
+
+Identical requests always go to the API:
+
+```text
+"Explain this Rust error" → Claude → $0.003
+"Explain this Rust error" → Claude → $0.003 (same result)
+```
+
+**Impact**: Unnecessary costs, especially in development/testing.
+
+#### 6. Kogral Not Integrated
+
+Kogral has guidelines and patterns that could enrich LLM context, but there's no integration.
+
+## Decision
+
+Create `stratum-llm` as a unified crate that:
+
+1. **Consolidates** existing implementations from typedialog-ai and provisioning
+2. **Detects** CLI credentials and subscriptions before using API keys
+3. **Implements** automatic fallback with circuit breaker
+4. **Adds** request caching to reduce costs
+5. **Integrates** Kogral for context enrichment
+6. **Is used** by all ecosystem projects
+
+### Architecture
+
+```text
+┌─────────────────────────────────────────────────────────┐
+│                      stratum-llm                         │
+├─────────────────────────────────────────────────────────┤
+│  CredentialDetector                                      │
+│  ├─ Claude CLI → ~/.config/claude/ (subscription)       │
+│  ├─ OpenAI CLI → ~/.config/openai/                      │
+│  ├─ Env vars → *_API_KEY                                │
+│  └─ Ollama → localhost:11434 (free)                     │
+│                          │                               │
+│                          ▼                               │
+│  ProviderChain (ordered by priority)                     │
+│  [CLI/Sub] → [API] → [DeepSeek] → [Ollama]              │
+│      │          │         │           │                  │
+│      └──────────┴─────────┴───────────┘                  │
+│                          │                               │
+│                  CircuitBreaker per provider             │
+│                          │                               │
+│                    RequestCache                          │
+│                          │                               │
+│                  KogralIntegration                       │
+│                          │                               │
+│                    UnifiedClient                         │
+│                                                          │
+└─────────────────────────────────────────────────────────┘
+```
+
+## Rationale
+
+### Why Not Use Another External Crate
+
+| Alternative    | Why Not                                    |
+| -------------- | ------------------------------------------ |
+| kaccy-ai       | Oriented toward blockchain/fraud detection |
+| llm (crate)    | Very basic, no circuit breaker or caching  |
+| langchain-rust | Python port, not idiomatic Rust            |
+| rig-core       | Embeddings/RAG only, no chat completion    |
+
+**Best option**: Build on typedialog-ai and add missing features.
+
+### Why CLI Detection is Important
+
+Cost analysis for typical user:
+
+| Scenario                  | Monthly Cost         |
+| ------------------------- | -------------------- |
+| API only (current)        | ~$840                |
+| Claude Pro + API overflow | ~$20 + ~$200 = $220  |
+| Claude Max + API overflow | ~$100 + ~$50 = $150  |
+
+**Potential savings**: 70-80% by detecting and using subscriptions first.
+
+### Why Circuit Breaker
+
+Without circuit breaker, a downed provider causes:
+
+- N requests × 30s timeout = N×30s total latency
+- All resources occupied waiting for timeouts
+
+With circuit breaker:
+
+- First failure opens circuit
+- Following requests fail immediately (fast fail)
+- Fallback to another provider without waiting
+- Circuit resets after cooldown
+
+### Why Caching
+
+For typical development:
+
+- Same questions repeated while iterating
+- Testing executes same prompts multiple times
+
+Estimated cache hit rate: 15-30% in active development.
+
+### Why Kogral Integration
+
+Kogral has language guidelines, domain patterns, and ADRs.
+Without integration the LLM generates generic code;
+with integration it generates code following project conventions.
+
+## Consequences
+
+### Positive
+
+1. Single source of truth for LLM logic
+2. CLI detection reduces costs 70-80%
+3. Circuit breaker + fallback = high availability
+4. 15-30% fewer requests in development (caching)
+5. Kogral improves generation quality
+6. Feature-gated: each feature is optional
+
+### Negative
+
+1. **Migration effort**: Refactor Vapora, TypeDialog, Provisioning
+2. **New dependency**: Projects depend on stratumiops
+3. **CLI auth complexity**: Different credential formats per version
+4. **Cache invalidation**: Stale responses if not managed well
+
+### Mitigations
+
+| Negative            | Mitigation                                  |
+| ------------------- | ------------------------------------------- |
+| Migration effort    | Re-export compatible API from typedialog-ai |
+| New dependency      | Local path dependency, not crates.io        |
+| CLI auth complexity | Version detection, fallback to API if fails |
+| Cache invalidation  | Configurable TTL, bypass option             |
+
+## Success Metrics
+
+| Metric                   | Current | Target          |
+| ------------------------ | ------- | --------------- |
+| Duplicated lines of code | ~500    | 0               |
+| CLI credential detection | 0%      | 100%            |
+| Fallback success rate    | 0%      | >90%            |
+| Cache hit rate           | 0%      | 15-30%          |
+| Latency (provider down)  | 30s+    | <1s (fast fail) |
+
+## Cost Impact Analysis
+
+Based on real usage data ($840/month):
+
+| Scenario                   | Savings            |
+| -------------------------- | ------------------ |
+| CLI detection (Claude Max) | ~$700/month        |
+| Caching (15% hit rate)     | ~$50/month         |
+| DeepSeek fallback for code | ~$100/month        |
+| **Total potential**        | **$500-700/month** |
+
+## Migration Strategy
+
+### Migration Phases
+
+1. Create stratum-llm with API compatible with typedialog-ai
+2. typedialog-ai re-exports stratum-llm (backward compatible)
+3. Vapora migrates to stratum-llm directly
+4. Provisioning migrates its LlmClient to stratum-llm
+5. Deprecate typedialog-ai, consolidate in stratum-llm
+
+### Feature Adoption
+
+| Feature         | Adoption                                  |
+| --------------- | ----------------------------------------- |
+| Basic providers | Immediate (direct replacement)            |
+| CLI detection   | Optional, feature flag                    |
+| Circuit breaker | Default on                                |
+| Caching         | Default on, configurable TTL              |
+| Kogral          | Feature flag, requires Kogral installed   |
+
+## Alternatives Considered
+
+### Alternative 1: Improve typedialog-ai In-Place
+
+**Pros**: No new crate required
+
+**Cons**: TypeDialog is a specific project, not shared infrastructure
+
+**Decision**: stratum-llm in stratumiops is better location for cross-project infrastructure.
+
+### Alternative 2: Use LiteLLM (Python) as Proxy
+
+**Pros**: Very complete, 100+ providers
+
+**Cons**: Python dependency, proxy latency, not Rust-native
+
+**Decision**: Keep pure Rust stack.
+
+### Alternative 3: Each Project Maintains Its Own Implementation
+
+**Pros**: Independence
+
+**Cons**: Duplication, inconsistency, bugs not shared
+
+**Decision**: Consolidation is better long-term.
+
+## References
+
+**Existing Implementations**:
+
+- TypeDialog: `typedialog/crates/typedialog-ai/`
+- Vapora: `vapora/crates/vapora-llm-router/`
+- Provisioning: `provisioning/platform/crates/rag/`
+
+**Kogral**: `kogral/`
+
+**Target Location**: `stratumiops/crates/stratum-llm/`
--- a/docs/en/architecture/adrs/README.md
+++ b/docs/en/architecture/adrs/README.md
@ -0,0 +1,22 @@
+# ADRs - Architecture Decision Records
+
+Architecture decision records for the STRATUMIOPS ecosystem.
+
+## Active ADRs
+
+| ID                               | Title                                         | Status   |
+| -------------------------------- | --------------------------------------------- | -------- |
+| [001](001-stratum-embeddings.md) | Stratum-Embeddings: Unified Embedding Library | Proposed |
+| [002](002-stratum-llm.md)        | Stratum-LLM: Unified LLM Provider Library     | Proposed |
+
+## Statuses
+
+- **Proposed**: Under review, pending implementation
+- **Accepted**: Approved and implemented
+- **Deprecated**: Replaced by another ADR
+- **Superseded**: Obsolete, see replacement ADR
+
+## Navigation
+
+- [Back to architecture](../)
+- [Spanish version](../../../es/architecture/adrs/)
--- a/docs/es/README.md
+++ b/docs/es/README.md
@ -34,6 +34,16 @@ Herramientas de automatización de infraestructura y despliegue.

 Ver directorio [ops/](ops/) para documentación completa del portfolio de operaciones.

+### Arquitectura
+
+Decisiones arquitecturales y diseño del ecosistema.
+
+- [**ADRs**](architecture/adrs/) - Architecture Decision Records
+  - [ADR-001: Stratum-Embeddings](architecture/adrs/001-stratum-embeddings.md) - Biblioteca unificada de embeddings
+  - [ADR-002: Stratum-LLM](architecture/adrs/002-stratum-llm.md) - Biblioteca unificada de providers LLM
+
+Ver directorio [architecture/](architecture/) para documentación completa de arquitectura.
+
 ## Navegación

 - [Volver a documentación raíz](../)
--- a/docs/es/architecture/README.md
+++ b/docs/es/architecture/README.md
@ -0,0 +1,30 @@
+# Arquitectura
+
+Documentación de arquitectura del ecosistema STRATUMIOPS.
+
+## Contenido
+
+### ADRs (Architecture Decision Records)
+
+Decisiones arquitecturales documentadas siguiendo el formato ADR:
+
+- [**ADR-001: Stratum-Embeddings**](adrs/001-stratum-embeddings.md) - Biblioteca unificada de embeddings
+- [**ADR-002: Stratum-LLM**](adrs/002-stratum-llm.md) - Biblioteca unificada de providers LLM
+
+## Formato ADR
+
+Cada ADR sigue la estructura:
+
+| Sección           | Descripción                                |
+| ----------------- | ------------------------------------------ |
+| Estado            | Propuesto, Aceptado, Deprecado, Superseded |
+| Contexto          | Problema y estado actual                   |
+| Decisión          | Solución elegida                           |
+| Justificación     | Por qué esta solución                      |
+| Consecuencias     | Positivas, negativas, mitigaciones         |
+| Métricas de Éxito | Cómo medir el resultado                    |
+
+## Navegación
+
+- [Volver a documentación principal](../)
+- [English version](../../en/architecture/)
--- a/docs/es/architecture/adrs/001-stratum-embeddings.md
+++ b/docs/es/architecture/adrs/001-stratum-embeddings.md
@ -0,0 +1,280 @@
+# ADR-001: Stratum-Embeddings - Biblioteca Unificada de Embeddings
+
+## Estado
+
+**Propuesto**
+
+## Contexto
+
+### Estado Actual: Implementaciones Fragmentadas
+
+El ecosistema tiene 3 implementaciones independientes de embeddings:
+
+| Proyecto     | Ubicación                             | Providers                     | Caching |
+| ------------ | ------------------------------------- | ----------------------------- | ------- |
+| Kogral       | `kogral-core/src/embeddings/`         | fastembed, rig-core (parcial) | No      |
+| Provisioning | `provisioning-rag/src/embeddings.rs`  | OpenAI directo                | No      |
+| Vapora       | `vapora-llm-router/src/embeddings.rs` | OpenAI, HuggingFace, Ollama   | No      |
+
+### Problemas Identificados
+
+#### 1. Código Duplicado
+
+Cada proyecto reimplementa:
+
+- HTTP client para OpenAI embeddings
+- Parsing de respuestas JSON
+- Manejo de errores
+- Token estimation
+
+**Impacto**: ~400 líneas duplicadas, inconsistencias en manejo de errores.
+
+#### 2. Sin Caching
+
+Embeddings se regeneran cada vez:
+
+```text
+"What is Rust?" → OpenAI → 1536 dims → $0.00002
+"What is Rust?" → OpenAI → 1536 dims → $0.00002 (mismo resultado)
+"What is Rust?" → OpenAI → 1536 dims → $0.00002 (mismo resultado)
+```
+
+**Impacto**: Costos innecesarios, latencia adicional, rate limits más frecuentes.
+
+#### 3. No Hay Fallback
+
+Si OpenAI falla, todo falla. No hay fallback a alternativas locales (fastembed, Ollama).
+
+**Impacto**: Disponibilidad reducida, dependencia total de un provider.
+
+#### 4. Dimension Mismatch Silencioso
+
+Diferentes providers producen diferentes dimensiones:
+
+| Provider  | Modelo                 | Dimensiones |
+| --------- | ---------------------- | ----------- |
+| fastembed | bge-small-en           | 384         |
+| fastembed | bge-large-en           | 1024        |
+| OpenAI    | text-embedding-3-small | 1536        |
+| OpenAI    | text-embedding-3-large | 3072        |
+| Ollama    | nomic-embed-text       | 768         |
+
+**Impacto**: Índices vectoriales corruptos si se cambia de provider.
+
+#### 5. Sin Métricas
+
+No hay visibilidad de uso, hit rate de cache, latencia por provider, ni costos acumulados.
+
+## Decisión
+
+Crear `stratum-embeddings` como crate unificado que:
+
+1. **Unifique** las implementaciones de Kogral, Provisioning, y Vapora
+2. **Añada caching** para evitar re-computar embeddings idénticos
+3. **Implemente fallback** entre providers (cloud → local)
+4. **Documente claramente** las dimensiones y limitaciones por provider
+5. **Exponga métricas** para observabilidad
+6. **Provea VectorStore trait** con backends LanceDB y SurrealDB según necesidad del proyecto
+
+### Decisión de Backend de Storage
+
+Cada proyecto elige su backend de vector storage según su prioridad:
+
+| Proyecto     | Backend   | Prioridad         | Justificación                                            |
+| ------------ | --------- | ----------------- | -------------------------------------------------------- |
+| Kogral       | SurrealDB | Riqueza del grafo | Knowledge Graph necesita queries unificados graph+vector |
+| Provisioning | LanceDB   | Escala vectorial  | RAG con millones de chunks documentales                  |
+| Vapora       | LanceDB   | Escala vectorial  | Traces de ejecución, pattern matching a escala           |
+
+#### Por qué SurrealDB para Kogral
+
+Kogral es un Knowledge Graph donde las relaciones son el valor principal.
+Con arquitectura híbrida (LanceDB vectores + SurrealDB graph), un query típico requeriría:
+
+1. LanceDB: búsqueda vectorial → candidate_ids
+2. SurrealDB: filtro de grafo sobre candidates → results
+3. App layer: merge, re-rank, deduplicación
+
+**Trade-off aceptado**: SurrealDB tiene peor rendimiento vectorial puro que LanceDB,
+pero la escala de Kogral está limitada por curación humana del conocimiento
+(10K-100K conceptos típicamente).
+
+#### Por qué LanceDB para Provisioning y Vapora
+
+| Aspecto         | SurrealDB  | LanceDB              |
+| --------------- | ---------- | -------------------- |
+| Storage format  | Row-based  | Columnar (Lance)     |
+| Vector index    | HNSW (RAM) | IVF-PQ (disk-native) |
+| Escala práctica | Millones   | Billones             |
+| Compresión      | ~1x        | ~32x (PQ)            |
+| Zero-copy read  | No         | Sí                   |
+
+### Arquitectura
+
+```text
+┌─────────────────────────────────────────────────────────────────┐
+│                      stratum-embeddings                          │
+├─────────────────────────────────────────────────────────────────┤
+│  EmbeddingProvider trait                                         │
+│  ├─ embed(text) → Vec<f32>                                      │
+│  ├─ embed_batch(texts) → Vec<Vec<f32>>                          │
+│  ├─ dimensions() → usize                                        │
+│  └─ is_local() → bool                                           │
+│                                                                  │
+│  ┌───────────┐ ┌───────────┐ ┌───────────┐                      │
+│  │ FastEmbed │ │  OpenAI   │ │  Ollama   │                      │
+│  │  (local)  │ │  (cloud)  │ │  (local)  │                      │
+│  └───────────┘ └───────────┘ └───────────┘                      │
+│         └────────────┬────────────┘                              │
+│                      ▼                                           │
+│              EmbeddingCache (memory/disk)                        │
+│                      │                                           │
+│                      ▼                                           │
+│             EmbeddingService                                     │
+│                      │                                           │
+│                      ▼                                           │
+│  ┌──────────────────────────────────────────────────────────┐   │
+│  │                   VectorStore trait                       │   │
+│  │  ├─ upsert(id, embedding, metadata)                      │   │
+│  │  ├─ search(embedding, limit, filter) → Vec<Match>        │   │
+│  │  └─ delete(id)                                           │   │
+│  └──────────────────────────────────────────────────────────┘   │
+│         │                                    │                   │
+│         ▼                                    ▼                   │
+│  ┌─────────────────┐              ┌─────────────────┐           │
+│  │  SurrealDbStore │              │   LanceDbStore  │           │
+│  │  (Kogral)       │              │  (Prov/Vapora)  │           │
+│  └─────────────────┘              └─────────────────┘           │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+## Justificación
+
+### Por Qué Caching es Crítico
+
+Para un sistema RAG típico (10,000 chunks):
+
+- **Sin cache**: Re-indexaciones y queries repetidas multiplican costos
+- **Con cache**: Primera indexación paga, resto son cache hits
+
+**Ahorro estimado**: 60-80% en costos de embeddings.
+
+### Por Qué Fallback es Importante
+
+| Escenario         | Sin Fallback | Con Fallback        |
+| ----------------- | ------------ | ------------------- |
+| OpenAI rate limit | ERROR        | → fastembed (local) |
+| OpenAI downtime   | ERROR        | → Ollama (local)    |
+| Sin internet      | ERROR        | → fastembed (local) |
+
+### Por Qué Providers Locales Primero
+
+Para desarrollo: fastembed carga modelo local (~100MB), no requiere API keys, sin costos, funciona offline.
+
+Para producción: OpenAI para calidad, fastembed como fallback.
+
+## Consecuencias
+
+### Positivas
+
+1. Single source of truth para todo el ecosistema
+2. 60-80% menos llamadas a APIs de embeddings (caching)
+3. Alta disponibilidad con providers locales (fallback)
+4. Métricas de uso y costos
+5. Feature-gated: solo compila lo necesario
+6. Storage flexibility: VectorStore trait permite elegir backend por proyecto
+
+### Negativas
+
+1. **Dimension lock-in**: Cambiar provider requiere re-indexar
+2. **Cache invalidation**: Contenido actualizado puede servir embeddings stale
+3. **Model download**: fastembed descarga ~100MB en primer uso
+4. **Storage lock-in por proyecto**: Kogral atado a SurrealDB, otros a LanceDB
+
+### Mitigaciones
+
+| Negativo          | Mitigación                                             |
+| ----------------- | ------------------------------------------------------ |
+| Dimension lock-in | Documentar claramente, warn en cambio de provider      |
+| Cache stale       | TTL configurable, opción de bypass                     |
+| Model download    | Mostrar progreso, cache en ~/.cache/fastembed          |
+| Storage lock-in   | Decisión consciente basada en prioridades del proyecto |
+
+## Métricas de Éxito
+
+| Métrica                     | Actual | Objetivo |
+| --------------------------- | ------ | -------- |
+| Implementaciones duplicadas | 3      | 1        |
+| Cache hit rate              | 0%     | >60%     |
+| Fallback availability       | 0%     | 100%     |
+| Cost per 10K embeddings     | ~$0.20 | ~$0.05   |
+
+## Guía de Selección de Provider
+
+### Desarrollo
+
+```rust
+// Local, gratis, offline
+let service = EmbeddingService::builder()
+    .with_provider(FastEmbedProvider::small()?)  // 384 dims
+    .with_memory_cache()
+    .build()?;
+```
+
+### Producción (Calidad)
+
+```rust
+// OpenAI con fallback local
+let service = EmbeddingService::builder()
+    .with_provider(OpenAiEmbeddingProvider::large()?)  // 3072 dims
+    .with_provider(FastEmbedProvider::large()?)        // Fallback
+    .with_memory_cache()
+    .build()?;
+```
+
+### Producción (Costo-Optimizado)
+
+```rust
+// OpenAI small con fallback
+let service = EmbeddingService::builder()
+    .with_provider(OpenAiEmbeddingProvider::small()?)  // 1536 dims
+    .with_provider(OllamaEmbeddingProvider::nomic())   // Fallback
+    .with_memory_cache()
+    .build()?;
+```
+
+## Matriz de Compatibilidad de Dimensiones
+
+| Si usas...             | Puedes cambiar a...         | NO puedes cambiar a... |
+| ---------------------- | --------------------------- | ---------------------- |
+| fastembed small (384)  | fastembed small, all-minilm | Cualquier otro         |
+| fastembed large (1024) | fastembed large             | Cualquier otro         |
+| OpenAI small (1536)    | OpenAI small, ada-002       | Cualquier otro         |
+| OpenAI large (3072)    | OpenAI large                | Cualquier otro         |
+
+**Regla**: Solo puedes cambiar entre modelos con las MISMAS dimensiones.
+
+## Prioridad de Implementación
+
+| Orden | Feature                 | Razón                        |
+| ----- | ----------------------- | ---------------------------- |
+| 1     | EmbeddingProvider trait | Base para todo               |
+| 2     | FastEmbed provider      | Funciona sin API keys        |
+| 3     | Memory cache            | Mayor impacto en costos      |
+| 4     | VectorStore trait       | Abstracción de storage       |
+| 5     | SurrealDbStore          | Kogral necesita graph+vector |
+| 6     | LanceDbStore            | Provisioning/Vapora escala   |
+| 7     | OpenAI provider         | Producción                   |
+| 8     | Ollama provider         | Fallback local               |
+| 9     | Batch processing        | Eficiencia                   |
+| 10    | Metrics                 | Observabilidad               |
+
+## Referencias
+
+**Implementaciones Existentes**:
+
+- Kogral: `kogral-core/src/embeddings/`
+- Vapora: `vapora-llm-router/src/embeddings.rs`
+- Provisioning: `provisioning/platform/crates/rag/src/embeddings.rs`
+
+**Ubicación Objetivo**: `stratumiops/crates/stratum-embeddings/`
--- a/docs/es/architecture/adrs/002-stratum-llm.md
+++ b/docs/es/architecture/adrs/002-stratum-llm.md
@ -0,0 +1,279 @@
+# ADR-002: Stratum-LLM - Biblioteca Unificada de Providers LLM
+
+## Estado
+
+**Propuesto**
+
+## Contexto
+
+### Estado Actual: Conexiones LLM Fragmentadas
+
+El ecosistema stratumiops tiene 4 proyectos con funcionalidad IA, cada uno con su propia implementación:
+
+| Proyecto     | Implementación             | Providers              | Duplicación           |
+| ------------ | -------------------------- | ---------------------- | --------------------- |
+| Vapora       | `typedialog-ai` (path dep) | Claude, OpenAI, Ollama | Base compartida       |
+| TypeDialog   | `typedialog-ai` (local)    | Claude, OpenAI, Ollama | Define la abstracción |
+| Provisioning | Custom `LlmClient`         | Claude, OpenAI         | 100% duplicado        |
+| Kogral       | `rig-core`                 | Solo embeddings        | Diferente stack       |
+
+### Problemas Identificados
+
+#### 1. Duplicación de Código
+
+Provisioning reimplementa lo que TypeDialog ya tiene:
+
+- reqwest HTTP client
+- Headers: x-api-key, anthropic-version
+- JSON body formatting
+- Response parsing
+- Error handling
+
+**Impacto**: ~500 líneas duplicadas, bugs arreglados en un lugar no se propagan.
+
+#### 2. Solo API Keys, No CLI Detection
+
+Ningún proyecto detecta credenciales de CLIs oficiales:
+
+```text
+Claude CLI:  ~/.config/claude/credentials.json
+OpenAI CLI:  ~/.config/openai/credentials.json
+```
+
+**Impacto**: Usuarios con Claude Pro/Max ($20-100/mes) pagan API tokens cuando podrían usar su suscripción.
+
+#### 3. Sin Fallback Automático
+
+Cuando un provider falla (rate limit, timeout), la request falla completamente:
+
+```text
+Actual:   Request → Claude API → Rate Limit → ERROR
+Deseado:  Request → Claude API → Rate Limit → OpenAI → Success
+```
+
+#### 4. Sin Circuit Breaker
+
+Si Claude API está caído, cada request intenta conectar, falla, y propaga el error:
+
+```text
+Request 1 → Claude → Timeout (30s) → Error
+Request 2 → Claude → Timeout (30s) → Error
+Request 3 → Claude → Timeout (30s) → Error
+```
+
+**Impacto**: Latencia acumulada, UX degradado.
+
+#### 5. Sin Caching
+
+Requests idénticas van siempre a la API:
+
+```text
+"Explain this Rust error" → Claude → $0.003
+"Explain this Rust error" → Claude → $0.003 (mismo resultado)
+```
+
+**Impacto**: Costos innecesarios, especialmente en desarrollo/testing.
+
+#### 6. Kogral No Integrado
+
+Kogral tiene guidelines y patterns que podrían enriquecer el contexto de LLM, pero no hay integración.
+
+## Decisión
+
+Crear `stratum-llm` como crate unificado que:
+
+1. **Consolide** las implementaciones existentes de typedialog-ai y provisioning
+2. **Detecte** credenciales CLI y subscripciones antes de usar API keys
+3. **Implemente** fallback automático con circuit breaker
+4. **Añada** caching de requests para reducir costos
+5. **Integre** Kogral para enriquecer contexto
+6. **Sea usado** por todos los proyectos del ecosistema
+
+### Arquitectura
+
+```text
+┌─────────────────────────────────────────────────────────┐
+│                      stratum-llm                         │
+├─────────────────────────────────────────────────────────┤
+│  CredentialDetector                                      │
+│  ├─ Claude CLI → ~/.config/claude/ (subscription)       │
+│  ├─ OpenAI CLI → ~/.config/openai/                      │
+│  ├─ Env vars → *_API_KEY                                │
+│  └─ Ollama → localhost:11434 (free)                     │
+│                          │                               │
+│                          ▼                               │
+│  ProviderChain (ordered by priority)                     │
+│  [CLI/Sub] → [API] → [DeepSeek] → [Ollama]              │
+│      │          │         │           │                  │
+│      └──────────┴─────────┴───────────┘                  │
+│                          │                               │
+│                  CircuitBreaker per provider             │
+│                          │                               │
+│                    RequestCache                          │
+│                          │                               │
+│                  KogralIntegration                       │
+│                          │                               │
+│                    UnifiedClient                         │
+│                                                          │
+└─────────────────────────────────────────────────────────┘
+```
+
+## Justificación
+
+### Por Qué No Usar Otra Crate Externa
+
+| Alternativa    | Por Qué No                                 |
+| -------------- | ------------------------------------------ |
+| kaccy-ai       | Orientada a blockchain/fraud detection     |
+| llm (crate)    | Muy básica, sin circuit breaker ni caching |
+| langchain-rust | Port de Python, no idiomático Rust         |
+| rig-core       | Solo embeddings/RAG, no chat completion    |
+
+**Mejor opción**: Construir sobre typedialog-ai y añadir features faltantes.
+
+### Por Qué CLI Detection es Importante
+
+Análisis de costos para usuario típico:
+
+| Escenario                 | Costo Mensual        |
+| ------------------------- | -------------------- |
+| Solo API (actual)         | ~$840                |
+| Claude Pro + API overflow | ~$20 + ~$200 = $220  |
+| Claude Max + API overflow | ~$100 + ~$50 = $150  |
+
+**Ahorro potencial**: 70-80% detectando y usando subscripciones primero.
+
+### Por Qué Circuit Breaker
+
+Sin circuit breaker, un provider caído causa:
+
+- N requests × 30s timeout = N×30s de latencia total
+- Todos los recursos ocupados esperando timeouts
+
+Con circuit breaker:
+
+- Primera falla abre circuito
+- Siguientes requests fallan inmediatamente (fast fail)
+- Fallback a otro provider sin esperar
+- Circuito se resetea después de cooldown
+
+### Por Qué Caching
+
+Para desarrollo típico:
+
+- Mismas preguntas repetidas mientras se itera
+- Testing ejecuta mismos prompts múltiples veces
+
+Cache hit rate estimado: 15-30% en desarrollo activo.
+
+### Por Qué Kogral Integration
+
+Kogral tiene guidelines por lenguaje, patterns por dominio, y ADRs.
+Sin integración el LLM genera código genérico;
+con integración genera código que sigue convenciones del proyecto.
+
+## Consecuencias
+
+### Positivas
+
+1. Single source of truth para lógica de LLM
+2. CLI detection reduce costos 70-80%
+3. Circuit breaker + fallback = alta disponibilidad
+4. 15-30% menos requests en desarrollo (caching)
+5. Kogral mejora calidad de generación
+6. Feature-gated: cada feature es opcional
+
+### Negativas
+
+1. **Migration effort**: Refactorizar Vapora, TypeDialog, Provisioning
+2. **New dependency**: Proyectos dependen de stratumiops
+3. **CLI auth complexity**: Diferentes formatos de credenciales por versión
+4. **Cache invalidation**: Respuestas obsoletas si no se gestiona bien
+
+### Mitigaciones
+
+| Negativo            | Mitigación                                   |
+| ------------------- | -------------------------------------------- |
+| Migration effort    | Re-export API compatible desde typedialog-ai |
+| New dependency      | Path dependency local, no crates.io          |
+| CLI auth complexity | Version detection, fallback a API si falla   |
+| Cache invalidation  | TTL configurable, opción de bypass           |
+
+## Métricas de Éxito
+
+| Métrica                     | Actual | Objetivo        |
+| --------------------------- | ------ | --------------- |
+| Líneas de código duplicadas | ~500   | 0               |
+| CLI credential detection    | 0%     | 100%            |
+| Fallback success rate       | 0%     | >90%            |
+| Cache hit rate              | 0%     | 15-30%          |
+| Latency (provider down)     | 30s+   | <1s (fast fail) |
+
+## Análisis de Impacto en Costos
+
+Basado en datos reales de uso ($840/mes):
+
+| Escenario                     | Ahorro           |
+| ----------------------------- | ---------------- |
+| CLI detection (Claude Max)    | ~$700/mes        |
+| Caching (15% hit rate)        | ~$50/mes         |
+| DeepSeek fallback para código | ~$100/mes        |
+| **Total potencial**           | **$500-700/mes** |
+
+## Estrategia de Migración
+
+### Fases de Migración
+
+1. Crear stratum-llm con API compatible con typedialog-ai
+2. typedialog-ai re-exporta stratum-llm (backward compatible)
+3. Vapora migra a stratum-llm directamente
+4. Provisioning migra su LlmClient a stratum-llm
+5. Deprecar typedialog-ai, consolidar en stratum-llm
+
+### Adopción de Features
+
+| Feature         | Adopción                                |
+| --------------- | --------------------------------------- |
+| Basic providers | Inmediata (reemplazo directo)           |
+| CLI detection   | Opcional, feature flag                  |
+| Circuit breaker | Default on                              |
+| Caching         | Default on, configurable TTL            |
+| Kogral          | Feature flag, requiere Kogral instalado |
+
+## Alternativas Consideradas
+
+### Alternativa 1: Mejorar typedialog-ai In-Place
+
+**Pros**: No requiere nuevo crate
+
+**Cons**: TypeDialog es proyecto específico, no infraestructura compartida
+
+**Decisión**: stratum-llm en stratumiops es mejor ubicación para infraestructura cross-project.
+
+### Alternativa 2: Usar LiteLLM (Python) como Proxy
+
+**Pros**: Muy completo, 100+ providers
+
+**Cons**: Dependencia Python, latencia de proxy, no Rust-native
+
+**Decisión**: Mantener stack Rust puro.
+
+### Alternativa 3: Cada Proyecto Mantiene su Implementación
+
+**Pros**: Independencia
+
+**Cons**: Duplicación, inconsistencia, bugs no compartidos
+
+**Decisión**: Consolidar es mejor a largo plazo.
+
+## Referencias
+
+**Implementaciones Existentes**:
+
+- TypeDialog: `typedialog/crates/typedialog-ai/`
+- Vapora: `vapora/crates/vapora-llm-router/`
+- Provisioning: `provisioning/platform/crates/rag/`
+
+**Kogral**: `kogral/`
+
+**Ubicación Objetivo**: `stratumiops/crates/stratum-llm/`
--- a/docs/es/architecture/adrs/README.md
+++ b/docs/es/architecture/adrs/README.md
@ -0,0 +1,22 @@
+# ADRs - Architecture Decision Records
+
+Registro de decisiones arquitecturales del ecosistema STRATUMIOPS.
+
+## ADRs Activos
+
+| ID                               | Título                                                 | Estado    |
+| -------------------------------- | ------------------------------------------------------ | --------- |
+| [001](001-stratum-embeddings.md) | Stratum-Embeddings: Biblioteca Unificada de Embeddings | Propuesto |
+| [002](002-stratum-llm.md)        | Stratum-LLM: Biblioteca Unificada de Providers LLM     | Propuesto |
+
+## Estados
+
+- **Propuesto**: En revisión, pendiente de implementación
+- **Aceptado**: Aprobado e implementado
+- **Deprecado**: Reemplazado por otro ADR
+- **Superseded**: Obsoleto, ver ADR de reemplazo
+
+## Navegación
+
+- [Volver a arquitectura](../)
+- [English version](../../../en/architecture/adrs/)