kogral/docs/architecture/adrs/002-fastembed-ai-providers.md
2026-01-23 16:11:07 +00:00

9.6 KiB

ADR-002: FastEmbed via AI Providers for Embeddings

Status: Accepted

Date: 2026-01-17

Deciders: Architecture Team

Context: Embedding Strategy for Semantic Search


Context

The KOGRAL requires embedding generation for semantic search capabilities. Embeddings convert text into numerical vectors that capture semantic meaning, enabling "find concepts" rather than just "find keywords".

Requirements:

  1. Local-First Option: Must work offline without external API dependencies
  2. Production Scalability: Support cloud AI providers for large-scale deployments
  3. Multiple Providers: Flexibility to choose based on cost, quality, privacy
  4. Cost-Effective Development: Free local embeddings for development and testing
  5. Quality: Good enough embeddings for finding related concepts

Options Evaluated:

Option 1: Only Local Embeddings (fastembed)

Pros:

  • No API costs
  • Works offline
  • Privacy-preserving (no data leaves machine)
  • Fast (local GPU acceleration possible)

Cons:

  • Limited model quality compared to cloud providers
  • Resource-intensive (requires download ~100MB models)
  • Single provider lock-in (fastembed library)

Example:

use fastembed::{TextEmbedding, InitOptions};

let model = TextEmbedding::try_new(InitOptions {
    model_name: "BAAI/bge-small-en-v1.5",
    ..Default::default()
})?;

let embeddings = model.embed(vec!["Hello world"], None)?;
// Output: Vec<Vec<f32>> with 384 dimensions

Option 2: Only Cloud AI Providers (OpenAI, Claude, etc.)

Pros:

  • State-of-the-art embedding quality
  • No local resource usage
  • Latest models available
  • Scalable to millions of documents

Cons:

  • Requires API keys (cost per embedding)
  • Network dependency (no offline mode)
  • Privacy concerns (data sent to third parties)
  • Vendor lock-in risk

Example:

use rig::providers::openai;

let client = openai::Client::new("sk-...");
let embeddings = client.embeddings("text-embedding-3-small")
    .embed_documents(vec!["Hello world"]).await?;
// Output: Vec<Vec<f32>> with 1536 dimensions

Option 3: Hybrid Strategy (fastembed + AI providers via rig-core)

Pros:

  • Best of both worlds: local dev, cloud production
  • User choice: privacy-first or quality-first
  • Cost flexibility: free for small projects, paid for scale
  • Unified interface via rig-core library
  • Easy provider switching (config-driven)

Cons:

  • More complex implementation (multiple providers)
  • Dimension mismatch between providers (384 vs 1536)
  • Additional dependencies (rig-core, fastembed)

Decision

We will use a hybrid strategy: fastembed (local) + AI providers (via rig-core).

Implementation:

  1. Default: fastembed with BAAI/bge-small-en-v1.5 (384 dimensions)
  2. Optional: OpenAI, Claude, Ollama via rig-core (configurable)
  3. Interface: EmbeddingProvider trait abstracts provider details
  4. Config-Driven: Provider selection via Nickel configuration

Architecture:

#[async_trait]
pub trait EmbeddingProvider: Send + Sync {
    async fn embed(&self, texts: Vec<String>) -> Result<Vec<Vec<f32>>>;
    fn dimensions(&self) -> usize;
    fn model_name(&self) -> &str;
}

// Local implementation
pub struct FastEmbedProvider {
    model: TextEmbedding,
}

impl FastEmbedProvider {
    pub fn new(model_name: &str) -> Result<Self> {
        let model = TextEmbedding::try_new(InitOptions {
            model_name: model_name.into(),
            ..Default::default()
        })?;
        Ok(Self { model })
    }
}

#[async_trait]
impl EmbeddingProvider for FastEmbedProvider {
    async fn embed(&self, texts: Vec<String>) -> Result<Vec<Vec<f32>>> {
        Ok(self.model.embed(texts, None)?)
    }

    fn dimensions(&self) -> usize { 384 }
    fn model_name(&self) -> &str { "BAAI/bge-small-en-v1.5" }
}

// Cloud provider implementation (via rig-core)
pub struct RigEmbeddingProvider {
    client: rig::Client,
    model: String,
    dimensions: usize,
}

#[async_trait]
impl EmbeddingProvider for RigEmbeddingProvider {
    async fn embed(&self, texts: Vec<String>) -> Result<Vec<Vec<f32>>> {
        let embeddings = self.client
            .embeddings(&self.model)
            .embed_documents(texts)
            .await?;
        Ok(embeddings)
    }

    fn dimensions(&self) -> usize { self.dimensions }
    fn model_name(&self) -> &str { &self.model }
}

Configuration (Nickel):

# Local development (default)
{
  embeddings = {
    enabled = true,
    provider = 'fastembed,
    model = "BAAI/bge-small-en-v1.5",
    dimensions = 384,
  },
}

# Production with OpenAI
{
  embeddings = {
    enabled = true,
    provider = 'openai,
    model = "text-embedding-3-small",
    dimensions = 1536,
    api_key_env = "OPENAI_API_KEY",
  },
}

# Self-hosted with Ollama
{
  embeddings = {
    enabled = true,
    provider = 'ollama,
    model = "nomic-embed-text",
    dimensions = 768,
  },
}

Provider Selection (kogral-core/src/embeddings/mod.rs):

pub fn create_provider(config: &EmbeddingConfig) -> Result<Box<dyn EmbeddingProvider>> {
    match config.provider {
        EmbeddingProviderType::FastEmbed => {
            Ok(Box::new(FastEmbedProvider::new(&config.model)?))
        }
        EmbeddingProviderType::OpenAI => {
            let api_key = std::env::var(&config.api_key_env)?;
            Ok(Box::new(RigEmbeddingProvider::new_openai(api_key, &config.model)?))
        }
        EmbeddingProviderType::Claude => {
            let api_key = std::env::var(&config.api_key_env)?;
            Ok(Box::new(RigEmbeddingProvider::new_claude(api_key, &config.model)?))
        }
        EmbeddingProviderType::Ollama => {
            Ok(Box::new(RigEmbeddingProvider::new_ollama(&config.model)?))
        }
    }
}

Consequences

Positive

Development Flexibility:

  • Developers can use fastembed without API keys
  • Fast feedback loop (local embeddings, no network calls)
  • Works offline (train trips, flights)

Production Quality:

  • Production deployments can use OpenAI/Claude for better quality
  • Latest embedding models available
  • Scalable to millions of documents

Privacy Control:

  • Privacy-sensitive projects use local embeddings
  • Public projects can use cloud providers
  • User choice via configuration

Cost Optimization:

  • Small projects: free (fastembed)
  • Large projects: pay for quality (cloud providers)
  • Hybrid: important docs via cloud, bulk via local

Unified Interface:

  • EmbeddingProvider trait abstracts provider details
  • Query code doesn't know/care about provider
  • Easy to add new providers

Negative

Dimension Mismatch:

  • fastembed: 384 dimensions
  • OpenAI: 1536 dimensions
  • Cannot mix in same index

Mitigation:

  • Store provider + dimensions in node metadata
  • Rebuild index when changing providers
  • Document dimension constraints

Model Download:

  • First use of fastembed downloads ~100MB model
  • Slow initial startup

Mitigation:

  • Pre-download in Docker images
  • Document model download in setup guide
  • Cache models in ~/.cache/fastembed

Complex Configuration:

  • Multiple provider options may confuse users

Mitigation:

  • Sane default (fastembed)
  • Clear examples for each provider
  • Validation errors explain misconfigurations

Neutral

Dependency Trade-off:

  • fastembed adds ~5MB to binary
  • rig-core adds ~2MB
  • Total: ~7MB overhead

Not a concern for CLI/MCP server use case.


Provider Comparison

Provider Dimensions Quality Cost Privacy Offline
fastembed 384 Good Free Local Yes
OpenAI 1536 Excellent $0.0001/1K Cloud No
Claude 1024 Excellent $0.00025/1K Cloud No
Ollama 768 Very Good Free Local Yes

Recommendation by Use Case:

  • Development: fastembed (fast, free, offline)
  • Small Teams: fastembed or Ollama (privacy, no costs)
  • Enterprise: OpenAI or Claude (best quality, scalable)
  • Self-Hosted: Ollama (good quality, local control)

Implementation Timeline

  1. Define EmbeddingProvider trait
  2. Implement FastEmbedProvider (stub, feature-gated)
  3. Implement RigEmbeddingProvider (stub, feature-gated)
  4. Complete FastEmbed integration with model download
  5. Complete rig-core integration (OpenAI, Claude, Ollama)
  6. Add query engine with similarity search
  7. Document provider selection and trade-offs

Monitoring

Success Criteria:

  • Users can switch providers via config change
  • Local embeddings work without API keys
  • Production deployments use cloud providers successfully
  • Query quality acceptable for both local and cloud embeddings

Metrics:

  • Embedding generation latency (local vs cloud)
  • Query accuracy (precision@10 for semantic search)
  • API costs (cloud providers)
  • User satisfaction (feedback on search quality)

References


Revision History

Date Author Change
2026-01-17 Architecture Team Initial decision

Previous ADR: ADR-001: Nickel vs TOML Next ADR: ADR-003: Hybrid Storage Strategy