# HuggingFace Embedding Provider Provider for HuggingFace Inference API embeddings with support for popular sentence-transformers and BGE models. ## Overview The HuggingFace provider uses the free Inference API to generate embeddings. It supports: - **Public Models**: Free access to popular embedding models - **Custom Models**: Support for any HuggingFace model with feature-extraction pipeline - **Automatic Caching**: Built-in memory cache reduces API calls - **Response Normalization**: Optional L2 normalization for similarity search ## Features - ✅ Zero cost for public models (free Inference API) - ✅ Support for 5+ popular models out of the box - ✅ Custom model support with configurable dimensions - ✅ Automatic retry with exponential backoff - ✅ Rate limit handling - ✅ Integration with stratum-embeddings caching layer ## Supported Models ### Predefined Models | Model | Dimensions | Use Case | Constructor | |-------|------------|----------|-------------| | **BAAI/bge-small-en-v1.5** | 384 | General-purpose, efficient | `HuggingFaceProvider::bge_small()` | | **BAAI/bge-base-en-v1.5** | 768 | Balanced performance | `HuggingFaceProvider::bge_base()` | | **BAAI/bge-large-en-v1.5** | 1024 | High quality | `HuggingFaceProvider::bge_large()` | | **sentence-transformers/all-MiniLM-L6-v2** | 384 | Fast, lightweight | `HuggingFaceProvider::all_minilm()` | | **sentence-transformers/all-mpnet-base-v2** | 768 | Strong baseline | - | ### Custom Models ```rust let model = HuggingFaceModel::Custom( "sentence-transformers/paraphrase-MiniLM-L6-v2".to_string(), 384, ); let provider = HuggingFaceProvider::new(api_key, model)?; ``` ## API Rate Limits ### Free Inference API HuggingFace Inference API has the following rate limits: | Tier | Requests/Hour | Requests/Day | Max Concurrent | |------|---------------|--------------|----------------| | **Anonymous** | 1,000 | 10,000 | 1 | | **Free Account** | 3,000 | 30,000 | 3 | | **PRO ($9/mo)** | 10,000 | 100,000 | 10 | | **Enterprise** | Custom | Custom | Custom | **Rate Limit Headers**: ``` X-RateLimit-Limit: 3000 X-RateLimit-Remaining: 2999 X-RateLimit-Reset: 1234567890 ``` ### Rate Limit Handling The provider automatically handles rate limits with: 1. **Exponential Backoff**: Retries with increasing delays (1s, 2s, 4s, 8s) 2. **Max Retries**: Default 3 retries before failing 3. **Circuit Breaker**: Automatically pauses requests if rate limited repeatedly 4. **Cache Integration**: Reduces API calls by 70-90% for repeated queries **Configuration**: ```rust // Default retry config (built-in) let provider = HuggingFaceProvider::new(api_key, model)?; // With custom retry (future enhancement) let provider = HuggingFaceProvider::new(api_key, model)? .with_retry_config(RetryConfig { max_retries: 5, initial_delay: Duration::from_secs(2), max_delay: Duration::from_secs(30), }); ``` ### Best Practices for Rate Limits 1. **Enable Caching**: Use `EmbeddingOptions::default_with_cache()` ```rust let options = EmbeddingOptions::default_with_cache(); let embedding = provider.embed(text, &options).await?; ``` 2. **Batch Requests Carefully**: HuggingFace Inference API processes requests sequentially ```rust // This makes N API calls sequentially let texts = vec!["text1", "text2", "text3"]; let result = provider.embed_batch(&texts, &options).await?; ``` 3. **Use PRO Account for Production**: Free tier is suitable for development only 4. **Monitor Rate Limits**: Check response headers ```rust // Future enhancement - rate limit monitoring let stats = provider.rate_limit_stats(); println!("Remaining: {}/{}", stats.remaining, stats.limit); ``` ## Authentication ### Environment Variables The provider checks for API keys in this order: 1. `HUGGINGFACE_API_KEY` 2. `HF_TOKEN` (alternative name) ```bash export HUGGINGFACE_API_KEY="hf_xxxxxxxxxxxxxxxxxxxx" ``` ### Getting an API Token 1. Go to [HuggingFace Settings](https://huggingface.co/settings/tokens) 2. Click "New token" 3. Select "Read" access (sufficient for Inference API) 4. Copy the token starting with `hf_` ## Usage Examples ### Basic Usage ```rust use stratum_embeddings::{HuggingFaceProvider, EmbeddingOptions}; #[tokio::main] async fn main() -> Result<(), Box> { // Using predefined model let provider = HuggingFaceProvider::bge_small()?; let options = EmbeddingOptions::default_with_cache(); let embedding = provider.embed("Hello world", &options).await?; println!("Dimensions: {}", embedding.len()); // 384 Ok(()) } ``` ### With EmbeddingService (Recommended) ```rust use std::time::Duration; use stratum_embeddings::{ HuggingFaceProvider, EmbeddingService, MemoryCache, EmbeddingOptions }; let provider = HuggingFaceProvider::bge_small()?; let cache = MemoryCache::new(1000, Duration::from_secs(3600)); let service = EmbeddingService::new(provider) .with_cache(cache); let options = EmbeddingOptions::default_with_cache(); let embedding = service.embed("Cached embeddings", &options).await?; ``` ### Semantic Similarity Search ```rust use stratum_embeddings::{HuggingFaceProvider, EmbeddingOptions, cosine_similarity}; let provider = HuggingFaceProvider::bge_small()?; let options = EmbeddingOptions { normalize: true, // Important for cosine similarity truncate: true, use_cache: true, }; let query = "machine learning"; let doc1 = "deep learning and neural networks"; let doc2 = "cooking recipes"; let query_emb = provider.embed(query, &options).await?; let doc1_emb = provider.embed(doc1, &options).await?; let doc2_emb = provider.embed(doc2, &options).await?; let sim1 = cosine_similarity(&query_emb, &doc1_emb); let sim2 = cosine_similarity(&query_emb, &doc2_emb); println!("Similarity with doc1: {:.4}", sim1); // ~0.85 println!("Similarity with doc2: {:.4}", sim2); // ~0.15 ``` ### Custom Model ```rust use stratum_embeddings::{HuggingFaceProvider, HuggingFaceModel}; let api_key = std::env::var("HUGGINGFACE_API_KEY")?; let model = HuggingFaceModel::Custom( "intfloat/multilingual-e5-large".to_string(), 1024, // Specify dimensions ); let provider = HuggingFaceProvider::new(api_key, model)?; ``` ## Error Handling ### Common Errors | Error | Cause | Solution | |-------|-------|----------| | `ConfigError: API key is empty` | Missing credentials | Set `HUGGINGFACE_API_KEY` | | `ApiError: HTTP 401` | Invalid API token | Check token validity | | `ApiError: HTTP 429` | Rate limit exceeded | Wait or upgrade tier | | `ApiError: HTTP 503` | Model loading | Retry after ~20s | | `DimensionMismatch` | Wrong model dimensions | Update `Custom` model dims | ### Retry Example ```rust use tokio::time::sleep; use std::time::Duration; let mut retries = 0; let max_retries = 3; loop { match provider.embed(text, &options).await { Ok(embedding) => break Ok(embedding), Err(e) if e.to_string().contains("429") && retries < max_retries => { retries += 1; let delay = Duration::from_secs(2u64.pow(retries)); eprintln!("Rate limited, retrying in {:?}...", delay); sleep(delay).await; } Err(e) => break Err(e), } } ``` ## Performance Characteristics ### Latency | Operation | Latency | Notes | |-----------|---------|-------| | **Single embed** | 200-500ms | Depends on model size and region | | **Batch (N items)** | N × 200-500ms | Sequential processing | | **Cache hit** | <1ms | In-memory lookup | | **Cold start** | +5-20s | First request loads model | ### Throughput | Tier | Max RPS | Daily Limit | |------|---------|-------------| | Free | ~0.8 | 30,000 | | PRO | ~2.8 | 100,000 | **With Caching** (80% hit rate): - Free tier: ~4 effective RPS - PRO tier: ~14 effective RPS ## Cost Comparison | Provider | Cost/1M Tokens | Free Tier | Notes | |----------|----------------|-----------|-------| | **HuggingFace** | $0.00 | 30k req/day | Free for public models | | OpenAI | $0.02-0.13 | $5 credit | Pay per token | | Cohere | $0.10 | 100 req/month | Limited free tier | | Voyage | $0.12 | None | No free tier | ## Limitations 1. **No True Batching**: Inference API processes one request at a time 2. **Cold Starts**: Models need ~20s to load on first request 3. **Rate Limits**: Free tier suitable for development only 4. **Regional Latency**: Single region (US/EU), no edge locations 5. **Model Loading**: Popular models cached, custom models may be slow ## Advanced Configuration ### Model Loading Timeout ```rust // Future enhancement let provider = HuggingFaceProvider::new(api_key, model)? .with_timeout(Duration::from_secs(120)); // Wait longer for cold starts ``` ### Dedicated Inference Endpoints For production workloads, consider [Dedicated Endpoints](https://huggingface.co/inference-endpoints): - True batch processing - Guaranteed uptime - No rate limits - Custom regions - ~$60-500/month ## Migration Guide ### From vapora Custom Implementation **Before**: ```rust let hf = HuggingFaceEmbedding::new(api_key, "BAAI/bge-small-en-v1.5".to_string()); let embedding = hf.embed(text).await?; ``` **After**: ```rust let provider = HuggingFaceProvider::bge_small()?; let options = EmbeddingOptions::default_with_cache(); let embedding = provider.embed(text, &options).await?; ``` ### From OpenAI ```rust // OpenAI (paid) let provider = OpenAiProvider::new(api_key, OpenAiModel::TextEmbedding3Small)?; // HuggingFace (free, similar quality) let provider = HuggingFaceProvider::bge_small()?; ``` ## Running the Example ```bash export HUGGINGFACE_API_KEY="hf_xxxxxxxxxxxxxxxxxxxx" cargo run --example huggingface_usage \ --features huggingface-provider ``` ## References - [HuggingFace Inference API Docs](https://huggingface.co/docs/api-inference/index) - [BGE Embedding Models](https://huggingface.co/BAAI) - [Sentence Transformers](https://www.sbert.net/) - [Rate Limits Documentation](https://huggingface.co/docs/api-inference/rate-limits)