9.8 KiB
9.8 KiB
HuggingFace Embedding Provider
Provider for HuggingFace Inference API embeddings with support for popular sentence-transformers and BGE models.
Overview
The HuggingFace provider uses the free Inference API to generate embeddings. It supports:
- Public Models: Free access to popular embedding models
- Custom Models: Support for any HuggingFace model with feature-extraction pipeline
- Automatic Caching: Built-in memory cache reduces API calls
- Response Normalization: Optional L2 normalization for similarity search
Features
- ✅ Zero cost for public models (free Inference API)
- ✅ Support for 5+ popular models out of the box
- ✅ Custom model support with configurable dimensions
- ✅ Automatic retry with exponential backoff
- ✅ Rate limit handling
- ✅ Integration with stratum-embeddings caching layer
Supported Models
Predefined Models
| Model | Dimensions | Use Case | Constructor |
|---|---|---|---|
| BAAI/bge-small-en-v1.5 | 384 | General-purpose, efficient | HuggingFaceProvider::bge_small() |
| BAAI/bge-base-en-v1.5 | 768 | Balanced performance | HuggingFaceProvider::bge_base() |
| BAAI/bge-large-en-v1.5 | 1024 | High quality | HuggingFaceProvider::bge_large() |
| sentence-transformers/all-MiniLM-L6-v2 | 384 | Fast, lightweight | HuggingFaceProvider::all_minilm() |
| sentence-transformers/all-mpnet-base-v2 | 768 | Strong baseline | - |
Custom Models
let model = HuggingFaceModel::Custom(
"sentence-transformers/paraphrase-MiniLM-L6-v2".to_string(),
384,
);
let provider = HuggingFaceProvider::new(api_key, model)?;
API Rate Limits
Free Inference API
HuggingFace Inference API has the following rate limits:
| Tier | Requests/Hour | Requests/Day | Max Concurrent |
|---|---|---|---|
| Anonymous | 1,000 | 10,000 | 1 |
| Free Account | 3,000 | 30,000 | 3 |
| PRO ($9/mo) | 10,000 | 100,000 | 10 |
| Enterprise | Custom | Custom | Custom |
Rate Limit Headers:
X-RateLimit-Limit: 3000
X-RateLimit-Remaining: 2999
X-RateLimit-Reset: 1234567890
Rate Limit Handling
The provider automatically handles rate limits with:
- Exponential Backoff: Retries with increasing delays (1s, 2s, 4s, 8s)
- Max Retries: Default 3 retries before failing
- Circuit Breaker: Automatically pauses requests if rate limited repeatedly
- Cache Integration: Reduces API calls by 70-90% for repeated queries
Configuration:
// Default retry config (built-in)
let provider = HuggingFaceProvider::new(api_key, model)?;
// With custom retry (future enhancement)
let provider = HuggingFaceProvider::new(api_key, model)?
.with_retry_config(RetryConfig {
max_retries: 5,
initial_delay: Duration::from_secs(2),
max_delay: Duration::from_secs(30),
});
Best Practices for Rate Limits
-
Enable Caching: Use
EmbeddingOptions::default_with_cache()let options = EmbeddingOptions::default_with_cache(); let embedding = provider.embed(text, &options).await?; -
Batch Requests Carefully: HuggingFace Inference API processes requests sequentially
// This makes N API calls sequentially let texts = vec!["text1", "text2", "text3"]; let result = provider.embed_batch(&texts, &options).await?; -
Use PRO Account for Production: Free tier is suitable for development only
-
Monitor Rate Limits: Check response headers
// Future enhancement - rate limit monitoring let stats = provider.rate_limit_stats(); println!("Remaining: {}/{}", stats.remaining, stats.limit);
Authentication
Environment Variables
The provider checks for API keys in this order:
HUGGINGFACE_API_KEYHF_TOKEN(alternative name)
export HUGGINGFACE_API_KEY="hf_xxxxxxxxxxxxxxxxxxxx"
Getting an API Token
- Go to HuggingFace Settings
- Click "New token"
- Select "Read" access (sufficient for Inference API)
- Copy the token starting with
hf_
Usage Examples
Basic Usage
use stratum_embeddings::{HuggingFaceProvider, EmbeddingOptions};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Using predefined model
let provider = HuggingFaceProvider::bge_small()?;
let options = EmbeddingOptions::default_with_cache();
let embedding = provider.embed("Hello world", &options).await?;
println!("Dimensions: {}", embedding.len()); // 384
Ok(())
}
With EmbeddingService (Recommended)
use std::time::Duration;
use stratum_embeddings::{
HuggingFaceProvider, EmbeddingService, MemoryCache, EmbeddingOptions
};
let provider = HuggingFaceProvider::bge_small()?;
let cache = MemoryCache::new(1000, Duration::from_secs(3600));
let service = EmbeddingService::new(provider)
.with_cache(cache);
let options = EmbeddingOptions::default_with_cache();
let embedding = service.embed("Cached embeddings", &options).await?;
Semantic Similarity Search
use stratum_embeddings::{HuggingFaceProvider, EmbeddingOptions, cosine_similarity};
let provider = HuggingFaceProvider::bge_small()?;
let options = EmbeddingOptions {
normalize: true, // Important for cosine similarity
truncate: true,
use_cache: true,
};
let query = "machine learning";
let doc1 = "deep learning and neural networks";
let doc2 = "cooking recipes";
let query_emb = provider.embed(query, &options).await?;
let doc1_emb = provider.embed(doc1, &options).await?;
let doc2_emb = provider.embed(doc2, &options).await?;
let sim1 = cosine_similarity(&query_emb, &doc1_emb);
let sim2 = cosine_similarity(&query_emb, &doc2_emb);
println!("Similarity with doc1: {:.4}", sim1); // ~0.85
println!("Similarity with doc2: {:.4}", sim2); // ~0.15
Custom Model
use stratum_embeddings::{HuggingFaceProvider, HuggingFaceModel};
let api_key = std::env::var("HUGGINGFACE_API_KEY")?;
let model = HuggingFaceModel::Custom(
"intfloat/multilingual-e5-large".to_string(),
1024, // Specify dimensions
);
let provider = HuggingFaceProvider::new(api_key, model)?;
Error Handling
Common Errors
| Error | Cause | Solution |
|---|---|---|
ConfigError: API key is empty |
Missing credentials | Set HUGGINGFACE_API_KEY |
ApiError: HTTP 401 |
Invalid API token | Check token validity |
ApiError: HTTP 429 |
Rate limit exceeded | Wait or upgrade tier |
ApiError: HTTP 503 |
Model loading | Retry after ~20s |
DimensionMismatch |
Wrong model dimensions | Update Custom model dims |
Retry Example
use tokio::time::sleep;
use std::time::Duration;
let mut retries = 0;
let max_retries = 3;
loop {
match provider.embed(text, &options).await {
Ok(embedding) => break Ok(embedding),
Err(e) if e.to_string().contains("429") && retries < max_retries => {
retries += 1;
let delay = Duration::from_secs(2u64.pow(retries));
eprintln!("Rate limited, retrying in {:?}...", delay);
sleep(delay).await;
}
Err(e) => break Err(e),
}
}
Performance Characteristics
Latency
| Operation | Latency | Notes |
|---|---|---|
| Single embed | 200-500ms | Depends on model size and region |
| Batch (N items) | N × 200-500ms | Sequential processing |
| Cache hit | <1ms | In-memory lookup |
| Cold start | +5-20s | First request loads model |
Throughput
| Tier | Max RPS | Daily Limit |
|---|---|---|
| Free | ~0.8 | 30,000 |
| PRO | ~2.8 | 100,000 |
With Caching (80% hit rate):
- Free tier: ~4 effective RPS
- PRO tier: ~14 effective RPS
Cost Comparison
| Provider | Cost/1M Tokens | Free Tier | Notes |
|---|---|---|---|
| HuggingFace | $0.00 | 30k req/day | Free for public models |
| OpenAI | $0.02-0.13 | $5 credit | Pay per token |
| Cohere | $0.10 | 100 req/month | Limited free tier |
| Voyage | $0.12 | None | No free tier |
Limitations
- No True Batching: Inference API processes one request at a time
- Cold Starts: Models need ~20s to load on first request
- Rate Limits: Free tier suitable for development only
- Regional Latency: Single region (US/EU), no edge locations
- Model Loading: Popular models cached, custom models may be slow
Advanced Configuration
Model Loading Timeout
// Future enhancement
let provider = HuggingFaceProvider::new(api_key, model)?
.with_timeout(Duration::from_secs(120)); // Wait longer for cold starts
Dedicated Inference Endpoints
For production workloads, consider Dedicated Endpoints:
- True batch processing
- Guaranteed uptime
- No rate limits
- Custom regions
- ~$60-500/month
Migration Guide
From vapora Custom Implementation
Before:
let hf = HuggingFaceEmbedding::new(api_key, "BAAI/bge-small-en-v1.5".to_string());
let embedding = hf.embed(text).await?;
After:
let provider = HuggingFaceProvider::bge_small()?;
let options = EmbeddingOptions::default_with_cache();
let embedding = provider.embed(text, &options).await?;
From OpenAI
// OpenAI (paid)
let provider = OpenAiProvider::new(api_key, OpenAiModel::TextEmbedding3Small)?;
// HuggingFace (free, similar quality)
let provider = HuggingFaceProvider::bge_small()?;
Running the Example
export HUGGINGFACE_API_KEY="hf_xxxxxxxxxxxxxxxxxxxx"
cargo run --example huggingface_usage \
--features huggingface-provider