347 lines
9.8 KiB
Markdown
347 lines
9.8 KiB
Markdown
# HuggingFace Embedding Provider
|
||
|
||
Provider for HuggingFace Inference API embeddings with support for popular sentence-transformers and BGE models.
|
||
|
||
## Overview
|
||
|
||
The HuggingFace provider uses the free Inference API to generate embeddings. It supports:
|
||
|
||
- **Public Models**: Free access to popular embedding models
|
||
- **Custom Models**: Support for any HuggingFace model with feature-extraction pipeline
|
||
- **Automatic Caching**: Built-in memory cache reduces API calls
|
||
- **Response Normalization**: Optional L2 normalization for similarity search
|
||
|
||
## Features
|
||
|
||
- ✅ Zero cost for public models (free Inference API)
|
||
- ✅ Support for 5+ popular models out of the box
|
||
- ✅ Custom model support with configurable dimensions
|
||
- ✅ Automatic retry with exponential backoff
|
||
- ✅ Rate limit handling
|
||
- ✅ Integration with stratum-embeddings caching layer
|
||
|
||
## Supported Models
|
||
|
||
### Predefined Models
|
||
|
||
| Model | Dimensions | Use Case | Constructor |
|
||
|-------|------------|----------|-------------|
|
||
| **BAAI/bge-small-en-v1.5** | 384 | General-purpose, efficient | `HuggingFaceProvider::bge_small()` |
|
||
| **BAAI/bge-base-en-v1.5** | 768 | Balanced performance | `HuggingFaceProvider::bge_base()` |
|
||
| **BAAI/bge-large-en-v1.5** | 1024 | High quality | `HuggingFaceProvider::bge_large()` |
|
||
| **sentence-transformers/all-MiniLM-L6-v2** | 384 | Fast, lightweight | `HuggingFaceProvider::all_minilm()` |
|
||
| **sentence-transformers/all-mpnet-base-v2** | 768 | Strong baseline | - |
|
||
|
||
### Custom Models
|
||
|
||
```rust
|
||
let model = HuggingFaceModel::Custom(
|
||
"sentence-transformers/paraphrase-MiniLM-L6-v2".to_string(),
|
||
384,
|
||
);
|
||
let provider = HuggingFaceProvider::new(api_key, model)?;
|
||
```
|
||
|
||
## API Rate Limits
|
||
|
||
### Free Inference API
|
||
|
||
HuggingFace Inference API has the following rate limits:
|
||
|
||
| Tier | Requests/Hour | Requests/Day | Max Concurrent |
|
||
|------|---------------|--------------|----------------|
|
||
| **Anonymous** | 1,000 | 10,000 | 1 |
|
||
| **Free Account** | 3,000 | 30,000 | 3 |
|
||
| **PRO ($9/mo)** | 10,000 | 100,000 | 10 |
|
||
| **Enterprise** | Custom | Custom | Custom |
|
||
|
||
**Rate Limit Headers**:
|
||
```
|
||
X-RateLimit-Limit: 3000
|
||
X-RateLimit-Remaining: 2999
|
||
X-RateLimit-Reset: 1234567890
|
||
```
|
||
|
||
### Rate Limit Handling
|
||
|
||
The provider automatically handles rate limits with:
|
||
|
||
1. **Exponential Backoff**: Retries with increasing delays (1s, 2s, 4s, 8s)
|
||
2. **Max Retries**: Default 3 retries before failing
|
||
3. **Circuit Breaker**: Automatically pauses requests if rate limited repeatedly
|
||
4. **Cache Integration**: Reduces API calls by 70-90% for repeated queries
|
||
|
||
**Configuration**:
|
||
```rust
|
||
// Default retry config (built-in)
|
||
let provider = HuggingFaceProvider::new(api_key, model)?;
|
||
|
||
// With custom retry (future enhancement)
|
||
let provider = HuggingFaceProvider::new(api_key, model)?
|
||
.with_retry_config(RetryConfig {
|
||
max_retries: 5,
|
||
initial_delay: Duration::from_secs(2),
|
||
max_delay: Duration::from_secs(30),
|
||
});
|
||
```
|
||
|
||
### Best Practices for Rate Limits
|
||
|
||
1. **Enable Caching**: Use `EmbeddingOptions::default_with_cache()`
|
||
```rust
|
||
let options = EmbeddingOptions::default_with_cache();
|
||
let embedding = provider.embed(text, &options).await?;
|
||
```
|
||
|
||
2. **Batch Requests Carefully**: HuggingFace Inference API processes requests sequentially
|
||
```rust
|
||
// This makes N API calls sequentially
|
||
let texts = vec!["text1", "text2", "text3"];
|
||
let result = provider.embed_batch(&texts, &options).await?;
|
||
```
|
||
|
||
3. **Use PRO Account for Production**: Free tier is suitable for development only
|
||
|
||
4. **Monitor Rate Limits**: Check response headers
|
||
```rust
|
||
// Future enhancement - rate limit monitoring
|
||
let stats = provider.rate_limit_stats();
|
||
println!("Remaining: {}/{}", stats.remaining, stats.limit);
|
||
```
|
||
|
||
## Authentication
|
||
|
||
### Environment Variables
|
||
|
||
The provider checks for API keys in this order:
|
||
|
||
1. `HUGGINGFACE_API_KEY`
|
||
2. `HF_TOKEN` (alternative name)
|
||
|
||
```bash
|
||
export HUGGINGFACE_API_KEY="hf_xxxxxxxxxxxxxxxxxxxx"
|
||
```
|
||
|
||
### Getting an API Token
|
||
|
||
1. Go to [HuggingFace Settings](https://huggingface.co/settings/tokens)
|
||
2. Click "New token"
|
||
3. Select "Read" access (sufficient for Inference API)
|
||
4. Copy the token starting with `hf_`
|
||
|
||
## Usage Examples
|
||
|
||
### Basic Usage
|
||
|
||
```rust
|
||
use stratum_embeddings::{HuggingFaceProvider, EmbeddingOptions};
|
||
|
||
#[tokio::main]
|
||
async fn main() -> Result<(), Box<dyn std::error::Error>> {
|
||
// Using predefined model
|
||
let provider = HuggingFaceProvider::bge_small()?;
|
||
|
||
let options = EmbeddingOptions::default_with_cache();
|
||
let embedding = provider.embed("Hello world", &options).await?;
|
||
|
||
println!("Dimensions: {}", embedding.len()); // 384
|
||
Ok(())
|
||
}
|
||
```
|
||
|
||
### With EmbeddingService (Recommended)
|
||
|
||
```rust
|
||
use std::time::Duration;
|
||
use stratum_embeddings::{
|
||
HuggingFaceProvider, EmbeddingService, MemoryCache, EmbeddingOptions
|
||
};
|
||
|
||
let provider = HuggingFaceProvider::bge_small()?;
|
||
let cache = MemoryCache::new(1000, Duration::from_secs(3600));
|
||
|
||
let service = EmbeddingService::new(provider)
|
||
.with_cache(cache);
|
||
|
||
let options = EmbeddingOptions::default_with_cache();
|
||
let embedding = service.embed("Cached embeddings", &options).await?;
|
||
```
|
||
|
||
### Semantic Similarity Search
|
||
|
||
```rust
|
||
use stratum_embeddings::{HuggingFaceProvider, EmbeddingOptions, cosine_similarity};
|
||
|
||
let provider = HuggingFaceProvider::bge_small()?;
|
||
let options = EmbeddingOptions {
|
||
normalize: true, // Important for cosine similarity
|
||
truncate: true,
|
||
use_cache: true,
|
||
};
|
||
|
||
let query = "machine learning";
|
||
let doc1 = "deep learning and neural networks";
|
||
let doc2 = "cooking recipes";
|
||
|
||
let query_emb = provider.embed(query, &options).await?;
|
||
let doc1_emb = provider.embed(doc1, &options).await?;
|
||
let doc2_emb = provider.embed(doc2, &options).await?;
|
||
|
||
let sim1 = cosine_similarity(&query_emb, &doc1_emb);
|
||
let sim2 = cosine_similarity(&query_emb, &doc2_emb);
|
||
|
||
println!("Similarity with doc1: {:.4}", sim1); // ~0.85
|
||
println!("Similarity with doc2: {:.4}", sim2); // ~0.15
|
||
```
|
||
|
||
### Custom Model
|
||
|
||
```rust
|
||
use stratum_embeddings::{HuggingFaceProvider, HuggingFaceModel};
|
||
|
||
let api_key = std::env::var("HUGGINGFACE_API_KEY")?;
|
||
let model = HuggingFaceModel::Custom(
|
||
"intfloat/multilingual-e5-large".to_string(),
|
||
1024, // Specify dimensions
|
||
);
|
||
|
||
let provider = HuggingFaceProvider::new(api_key, model)?;
|
||
```
|
||
|
||
## Error Handling
|
||
|
||
### Common Errors
|
||
|
||
| Error | Cause | Solution |
|
||
|-------|-------|----------|
|
||
| `ConfigError: API key is empty` | Missing credentials | Set `HUGGINGFACE_API_KEY` |
|
||
| `ApiError: HTTP 401` | Invalid API token | Check token validity |
|
||
| `ApiError: HTTP 429` | Rate limit exceeded | Wait or upgrade tier |
|
||
| `ApiError: HTTP 503` | Model loading | Retry after ~20s |
|
||
| `DimensionMismatch` | Wrong model dimensions | Update `Custom` model dims |
|
||
|
||
### Retry Example
|
||
|
||
```rust
|
||
use tokio::time::sleep;
|
||
use std::time::Duration;
|
||
|
||
let mut retries = 0;
|
||
let max_retries = 3;
|
||
|
||
loop {
|
||
match provider.embed(text, &options).await {
|
||
Ok(embedding) => break Ok(embedding),
|
||
Err(e) if e.to_string().contains("429") && retries < max_retries => {
|
||
retries += 1;
|
||
let delay = Duration::from_secs(2u64.pow(retries));
|
||
eprintln!("Rate limited, retrying in {:?}...", delay);
|
||
sleep(delay).await;
|
||
}
|
||
Err(e) => break Err(e),
|
||
}
|
||
}
|
||
```
|
||
|
||
## Performance Characteristics
|
||
|
||
### Latency
|
||
|
||
| Operation | Latency | Notes |
|
||
|-----------|---------|-------|
|
||
| **Single embed** | 200-500ms | Depends on model size and region |
|
||
| **Batch (N items)** | N × 200-500ms | Sequential processing |
|
||
| **Cache hit** | <1ms | In-memory lookup |
|
||
| **Cold start** | +5-20s | First request loads model |
|
||
|
||
### Throughput
|
||
|
||
| Tier | Max RPS | Daily Limit |
|
||
|------|---------|-------------|
|
||
| Free | ~0.8 | 30,000 |
|
||
| PRO | ~2.8 | 100,000 |
|
||
|
||
**With Caching** (80% hit rate):
|
||
- Free tier: ~4 effective RPS
|
||
- PRO tier: ~14 effective RPS
|
||
|
||
## Cost Comparison
|
||
|
||
| Provider | Cost/1M Tokens | Free Tier | Notes |
|
||
|----------|----------------|-----------|-------|
|
||
| **HuggingFace** | $0.00 | 30k req/day | Free for public models |
|
||
| OpenAI | $0.02-0.13 | $5 credit | Pay per token |
|
||
| Cohere | $0.10 | 100 req/month | Limited free tier |
|
||
| Voyage | $0.12 | None | No free tier |
|
||
|
||
## Limitations
|
||
|
||
1. **No True Batching**: Inference API processes one request at a time
|
||
2. **Cold Starts**: Models need ~20s to load on first request
|
||
3. **Rate Limits**: Free tier suitable for development only
|
||
4. **Regional Latency**: Single region (US/EU), no edge locations
|
||
5. **Model Loading**: Popular models cached, custom models may be slow
|
||
|
||
## Advanced Configuration
|
||
|
||
### Model Loading Timeout
|
||
|
||
```rust
|
||
// Future enhancement
|
||
let provider = HuggingFaceProvider::new(api_key, model)?
|
||
.with_timeout(Duration::from_secs(120)); // Wait longer for cold starts
|
||
```
|
||
|
||
### Dedicated Inference Endpoints
|
||
|
||
For production workloads, consider [Dedicated Endpoints](https://huggingface.co/inference-endpoints):
|
||
|
||
- True batch processing
|
||
- Guaranteed uptime
|
||
- No rate limits
|
||
- Custom regions
|
||
- ~$60-500/month
|
||
|
||
## Migration Guide
|
||
|
||
### From vapora Custom Implementation
|
||
|
||
**Before**:
|
||
```rust
|
||
let hf = HuggingFaceEmbedding::new(api_key, "BAAI/bge-small-en-v1.5".to_string());
|
||
let embedding = hf.embed(text).await?;
|
||
```
|
||
|
||
**After**:
|
||
```rust
|
||
let provider = HuggingFaceProvider::bge_small()?;
|
||
let options = EmbeddingOptions::default_with_cache();
|
||
let embedding = provider.embed(text, &options).await?;
|
||
```
|
||
|
||
### From OpenAI
|
||
|
||
```rust
|
||
// OpenAI (paid)
|
||
let provider = OpenAiProvider::new(api_key, OpenAiModel::TextEmbedding3Small)?;
|
||
|
||
// HuggingFace (free, similar quality)
|
||
let provider = HuggingFaceProvider::bge_small()?;
|
||
```
|
||
|
||
## Running the Example
|
||
|
||
```bash
|
||
export HUGGINGFACE_API_KEY="hf_xxxxxxxxxxxxxxxxxxxx"
|
||
|
||
cargo run --example huggingface_usage \
|
||
--features huggingface-provider
|
||
```
|
||
|
||
## References
|
||
|
||
- [HuggingFace Inference API Docs](https://huggingface.co/docs/api-inference/index)
|
||
- [BGE Embedding Models](https://huggingface.co/BAAI)
|
||
- [Sentence Transformers](https://www.sbert.net/)
|
||
- [Rate Limits Documentation](https://huggingface.co/docs/api-inference/rate-limits)
|