ADR-002: FastEmbed via AI Providers for Embeddings

Status: Accepted

Date: 2026-01-17

Deciders: Architecture Team

Context: Embedding Strategy for Semantic Search

Context

The KOGRAL requires embedding generation for semantic search capabilities. Embeddings convert text into numerical vectors that capture semantic meaning, enabling "find concepts" rather than just "find keywords".

Requirements:

Local-First Option: Must work offline without external API dependencies
Production Scalability: Support cloud AI providers for large-scale deployments
Multiple Providers: Flexibility to choose based on cost, quality, privacy
Cost-Effective Development: Free local embeddings for development and testing
Quality: Good enough embeddings for finding related concepts

Options Evaluated:

Option 1: Only Local Embeddings (fastembed)

Pros:

No API costs
Works offline
Privacy-preserving (no data leaves machine)
Fast (local GPU acceleration possible)

Cons:

Limited model quality compared to cloud providers
Resource-intensive (requires download ~100MB models)
Single provider lock-in (fastembed library)

Example:

#![allow(unused)]
fn main() {
use fastembed::{TextEmbedding, InitOptions};

let model = TextEmbedding::try_new(InitOptions {
    model_name: "BAAI/bge-small-en-v1.5",
    ..Default::default()
})?;

let embeddings = model.embed(vec!["Hello world"], None)?;
// Output: Vec<Vec<f32>> with 384 dimensions
}

Option 2: Only Cloud AI Providers (OpenAI, Claude, etc.)

Pros:

State-of-the-art embedding quality
No local resource usage
Latest models available
Scalable to millions of documents

Cons:

Requires API keys (cost per embedding)
Network dependency (no offline mode)
Privacy concerns (data sent to third parties)
Vendor lock-in risk

Example:

#![allow(unused)]
fn main() {
use rig::providers::openai;

let client = openai::Client::new("sk-...");
let embeddings = client.embeddings("text-embedding-3-small")
    .embed_documents(vec!["Hello world"]).await?;
// Output: Vec<Vec<f32>> with 1536 dimensions
}

Option 3: Hybrid Strategy (fastembed + AI providers via rig-core)

Pros:

✅ Best of both worlds: local dev, cloud production
✅ User choice: privacy-first or quality-first
✅ Cost flexibility: free for small projects, paid for scale
✅ Unified interface via rig-core library
✅ Easy provider switching (config-driven)

Cons:

❌ More complex implementation (multiple providers)
❌ Dimension mismatch between providers (384 vs 1536)
❌ Additional dependencies (rig-core, fastembed)

Decision

We will use a hybrid strategy: fastembed (local) + AI providers (via rig-core).

Implementation:

Default: fastembed with BAAI/bge-small-en-v1.5 (384 dimensions)
Optional: OpenAI, Claude, Ollama via rig-core (configurable)
Interface: EmbeddingProvider trait abstracts provider details
Config-Driven: Provider selection via Nickel configuration

Architecture:

#![allow(unused)]
fn main() {
#[async_trait]
pub trait EmbeddingProvider: Send + Sync {
    async fn embed(&self, texts: Vec<String>) -> Result<Vec<Vec<f32>>>;
    fn dimensions(&self) -> usize;
    fn model_name(&self) -> &str;
}

// Local implementation
pub struct FastEmbedProvider {
    model: TextEmbedding,
}

impl FastEmbedProvider {
    pub fn new(model_name: &str) -> Result<Self> {
        let model = TextEmbedding::try_new(InitOptions {
            model_name: model_name.into(),
            ..Default::default()
        })?;
        Ok(Self { model })
    }
}

#[async_trait]
impl EmbeddingProvider for FastEmbedProvider {
    async fn embed(&self, texts: Vec<String>) -> Result<Vec<Vec<f32>>> {
        Ok(self.model.embed(texts, None)?)
    }

    fn dimensions(&self) -> usize { 384 }
    fn model_name(&self) -> &str { "BAAI/bge-small-en-v1.5" }
}

// Cloud provider implementation (via rig-core)
pub struct RigEmbeddingProvider {
    client: rig::Client,
    model: String,
    dimensions: usize,
}

#[async_trait]
impl EmbeddingProvider for RigEmbeddingProvider {
    async fn embed(&self, texts: Vec<String>) -> Result<Vec<Vec<f32>>> {
        let embeddings = self.client
            .embeddings(&self.model)
            .embed_documents(texts)
            .await?;
        Ok(embeddings)
    }

    fn dimensions(&self) -> usize { self.dimensions }
    fn model_name(&self) -> &str { &self.model }
}
}

Configuration (Nickel):

# Local development (default)
{
  embeddings = {
    enabled = true,
    provider = 'fastembed,
    model = "BAAI/bge-small-en-v1.5",
    dimensions = 384,
  },
}

# Production with OpenAI
{
  embeddings = {
    enabled = true,
    provider = 'openai,
    model = "text-embedding-3-small",
    dimensions = 1536,
    api_key_env = "OPENAI_API_KEY",
  },
}

# Self-hosted with Ollama
{
  embeddings = {
    enabled = true,
    provider = 'ollama,
    model = "nomic-embed-text",
    dimensions = 768,
  },
}

Provider Selection (kb-core/src/embeddings/mod.rs):

#![allow(unused)]
fn main() {
pub fn create_provider(config: &EmbeddingConfig) -> Result<Box<dyn EmbeddingProvider>> {
    match config.provider {
        EmbeddingProviderType::FastEmbed => {
            Ok(Box::new(FastEmbedProvider::new(&config.model)?))
        }
        EmbeddingProviderType::OpenAI => {
            let api_key = std::env::var(&config.api_key_env)?;
            Ok(Box::new(RigEmbeddingProvider::new_openai(api_key, &config.model)?))
        }
        EmbeddingProviderType::Claude => {
            let api_key = std::env::var(&config.api_key_env)?;
            Ok(Box::new(RigEmbeddingProvider::new_claude(api_key, &config.model)?))
        }
        EmbeddingProviderType::Ollama => {
            Ok(Box::new(RigEmbeddingProvider::new_ollama(&config.model)?))
        }
    }
}
}

Consequences

Positive

✅ Development Flexibility:

Developers can use fastembed without API keys
Fast feedback loop (local embeddings, no network calls)
Works offline (train trips, flights)

✅ Production Quality:

Production deployments can use OpenAI/Claude for better quality
Latest embedding models available
Scalable to millions of documents

✅ Privacy Control:

Privacy-sensitive projects use local embeddings
Public projects can use cloud providers
User choice via configuration

✅ Cost Optimization:

Small projects: free (fastembed)
Large projects: pay for quality (cloud providers)
Hybrid: important docs via cloud, bulk via local

✅ Unified Interface:

EmbeddingProvider trait abstracts provider details
Query code doesn't know/care about provider
Easy to add new providers

Negative

❌ Dimension Mismatch:

fastembed: 384 dimensions
OpenAI: 1536 dimensions
Cannot mix in same index

Mitigation:

Store provider + dimensions in node metadata
Rebuild index when changing providers
Document dimension constraints

❌ Model Download:

First use of fastembed downloads ~100MB model
Slow initial startup

Mitigation:

Pre-download in Docker images
Document model download in setup guide
Cache models in ~/.cache/fastembed

❌ Complex Configuration:

Multiple provider options may confuse users

Mitigation:

Sane default (fastembed)
Clear examples for each provider
Validation errors explain misconfigurations

Neutral

⚪ Dependency Trade-off:

fastembed adds ~5MB to binary
rig-core adds ~2MB
Total: ~7MB overhead

Not a concern for CLI/MCP server use case.

Provider Comparison

Provider	Dimensions	Quality	Cost	Privacy	Offline
fastembed	384	Good	Free	✅ Local	✅ Yes
OpenAI	1536	Excellent	$0.0001/1K	❌ Cloud	❌ No
Claude	1024	Excellent	$0.00025/1K	❌ Cloud	❌ No
Ollama	768	Very Good	Free	✅ Local	✅ Yes

Recommendation by Use Case:

Development: fastembed (fast, free, offline)
Small Teams: fastembed or Ollama (privacy, no costs)
Enterprise: OpenAI or Claude (best quality, scalable)
Self-Hosted: Ollama (good quality, local control)

Implementation Timeline

✅ Define EmbeddingProvider trait
✅ Implement FastEmbedProvider (stub, feature-gated)
✅ Implement RigEmbeddingProvider (stub, feature-gated)
⏳ Complete FastEmbed integration with model download
⏳ Complete rig-core integration (OpenAI, Claude, Ollama)
⏳ Add query engine with similarity search
⏳ Document provider selection and trade-offs

Monitoring

Success Criteria:

Users can switch providers via config change
Local embeddings work without API keys
Production deployments use cloud providers successfully
Query quality acceptable for both local and cloud embeddings

Metrics:

Embedding generation latency (local vs cloud)
Query accuracy (precision@10 for semantic search)
API costs (cloud providers)
User satisfaction (feedback on search quality)

References

Revision History

Date	Author	Change
2026-01-17	Architecture Team	Initial decision

Previous ADR: ADR-001: Nickel vs TOML Next ADR: ADR-003: Hybrid Storage Strategy

Keyboard shortcuts

KOGRAL Documentation