stratumiops/crates/stratum-embeddings/docs/huggingface-provider.md

347 lines
9.8 KiB
Markdown
Raw Normal View History

# HuggingFace Embedding Provider
Provider for HuggingFace Inference API embeddings with support for popular sentence-transformers and BGE models.
## Overview
The HuggingFace provider uses the free Inference API to generate embeddings. It supports:
- **Public Models**: Free access to popular embedding models
- **Custom Models**: Support for any HuggingFace model with feature-extraction pipeline
- **Automatic Caching**: Built-in memory cache reduces API calls
- **Response Normalization**: Optional L2 normalization for similarity search
## Features
- ✅ Zero cost for public models (free Inference API)
- ✅ Support for 5+ popular models out of the box
- ✅ Custom model support with configurable dimensions
- ✅ Automatic retry with exponential backoff
- ✅ Rate limit handling
- ✅ Integration with stratum-embeddings caching layer
## Supported Models
### Predefined Models
| Model | Dimensions | Use Case | Constructor |
|-------|------------|----------|-------------|
| **BAAI/bge-small-en-v1.5** | 384 | General-purpose, efficient | `HuggingFaceProvider::bge_small()` |
| **BAAI/bge-base-en-v1.5** | 768 | Balanced performance | `HuggingFaceProvider::bge_base()` |
| **BAAI/bge-large-en-v1.5** | 1024 | High quality | `HuggingFaceProvider::bge_large()` |
| **sentence-transformers/all-MiniLM-L6-v2** | 384 | Fast, lightweight | `HuggingFaceProvider::all_minilm()` |
| **sentence-transformers/all-mpnet-base-v2** | 768 | Strong baseline | - |
### Custom Models
```rust
let model = HuggingFaceModel::Custom(
"sentence-transformers/paraphrase-MiniLM-L6-v2".to_string(),
384,
);
let provider = HuggingFaceProvider::new(api_key, model)?;
```
## API Rate Limits
### Free Inference API
HuggingFace Inference API has the following rate limits:
| Tier | Requests/Hour | Requests/Day | Max Concurrent |
|------|---------------|--------------|----------------|
| **Anonymous** | 1,000 | 10,000 | 1 |
| **Free Account** | 3,000 | 30,000 | 3 |
| **PRO ($9/mo)** | 10,000 | 100,000 | 10 |
| **Enterprise** | Custom | Custom | Custom |
**Rate Limit Headers**:
```
X-RateLimit-Limit: 3000
X-RateLimit-Remaining: 2999
X-RateLimit-Reset: 1234567890
```
### Rate Limit Handling
The provider automatically handles rate limits with:
1. **Exponential Backoff**: Retries with increasing delays (1s, 2s, 4s, 8s)
2. **Max Retries**: Default 3 retries before failing
3. **Circuit Breaker**: Automatically pauses requests if rate limited repeatedly
4. **Cache Integration**: Reduces API calls by 70-90% for repeated queries
**Configuration**:
```rust
// Default retry config (built-in)
let provider = HuggingFaceProvider::new(api_key, model)?;
// With custom retry (future enhancement)
let provider = HuggingFaceProvider::new(api_key, model)?
.with_retry_config(RetryConfig {
max_retries: 5,
initial_delay: Duration::from_secs(2),
max_delay: Duration::from_secs(30),
});
```
### Best Practices for Rate Limits
1. **Enable Caching**: Use `EmbeddingOptions::default_with_cache()`
```rust
let options = EmbeddingOptions::default_with_cache();
let embedding = provider.embed(text, &options).await?;
```
2. **Batch Requests Carefully**: HuggingFace Inference API processes requests sequentially
```rust
// This makes N API calls sequentially
let texts = vec!["text1", "text2", "text3"];
let result = provider.embed_batch(&texts, &options).await?;
```
3. **Use PRO Account for Production**: Free tier is suitable for development only
4. **Monitor Rate Limits**: Check response headers
```rust
// Future enhancement - rate limit monitoring
let stats = provider.rate_limit_stats();
println!("Remaining: {}/{}", stats.remaining, stats.limit);
```
## Authentication
### Environment Variables
The provider checks for API keys in this order:
1. `HUGGINGFACE_API_KEY`
2. `HF_TOKEN` (alternative name)
```bash
export HUGGINGFACE_API_KEY="hf_xxxxxxxxxxxxxxxxxxxx"
```
### Getting an API Token
1. Go to [HuggingFace Settings](https://huggingface.co/settings/tokens)
2. Click "New token"
3. Select "Read" access (sufficient for Inference API)
4. Copy the token starting with `hf_`
## Usage Examples
### Basic Usage
```rust
use stratum_embeddings::{HuggingFaceProvider, EmbeddingOptions};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Using predefined model
let provider = HuggingFaceProvider::bge_small()?;
let options = EmbeddingOptions::default_with_cache();
let embedding = provider.embed("Hello world", &options).await?;
println!("Dimensions: {}", embedding.len()); // 384
Ok(())
}
```
### With EmbeddingService (Recommended)
```rust
use std::time::Duration;
use stratum_embeddings::{
HuggingFaceProvider, EmbeddingService, MemoryCache, EmbeddingOptions
};
let provider = HuggingFaceProvider::bge_small()?;
let cache = MemoryCache::new(1000, Duration::from_secs(3600));
let service = EmbeddingService::new(provider)
.with_cache(cache);
let options = EmbeddingOptions::default_with_cache();
let embedding = service.embed("Cached embeddings", &options).await?;
```
### Semantic Similarity Search
```rust
use stratum_embeddings::{HuggingFaceProvider, EmbeddingOptions, cosine_similarity};
let provider = HuggingFaceProvider::bge_small()?;
let options = EmbeddingOptions {
normalize: true, // Important for cosine similarity
truncate: true,
use_cache: true,
};
let query = "machine learning";
let doc1 = "deep learning and neural networks";
let doc2 = "cooking recipes";
let query_emb = provider.embed(query, &options).await?;
let doc1_emb = provider.embed(doc1, &options).await?;
let doc2_emb = provider.embed(doc2, &options).await?;
let sim1 = cosine_similarity(&query_emb, &doc1_emb);
let sim2 = cosine_similarity(&query_emb, &doc2_emb);
println!("Similarity with doc1: {:.4}", sim1); // ~0.85
println!("Similarity with doc2: {:.4}", sim2); // ~0.15
```
### Custom Model
```rust
use stratum_embeddings::{HuggingFaceProvider, HuggingFaceModel};
let api_key = std::env::var("HUGGINGFACE_API_KEY")?;
let model = HuggingFaceModel::Custom(
"intfloat/multilingual-e5-large".to_string(),
1024, // Specify dimensions
);
let provider = HuggingFaceProvider::new(api_key, model)?;
```
## Error Handling
### Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| `ConfigError: API key is empty` | Missing credentials | Set `HUGGINGFACE_API_KEY` |
| `ApiError: HTTP 401` | Invalid API token | Check token validity |
| `ApiError: HTTP 429` | Rate limit exceeded | Wait or upgrade tier |
| `ApiError: HTTP 503` | Model loading | Retry after ~20s |
| `DimensionMismatch` | Wrong model dimensions | Update `Custom` model dims |
### Retry Example
```rust
use tokio::time::sleep;
use std::time::Duration;
let mut retries = 0;
let max_retries = 3;
loop {
match provider.embed(text, &options).await {
Ok(embedding) => break Ok(embedding),
Err(e) if e.to_string().contains("429") && retries < max_retries => {
retries += 1;
let delay = Duration::from_secs(2u64.pow(retries));
eprintln!("Rate limited, retrying in {:?}...", delay);
sleep(delay).await;
}
Err(e) => break Err(e),
}
}
```
## Performance Characteristics
### Latency
| Operation | Latency | Notes |
|-----------|---------|-------|
| **Single embed** | 200-500ms | Depends on model size and region |
| **Batch (N items)** | N × 200-500ms | Sequential processing |
| **Cache hit** | <1ms | In-memory lookup |
| **Cold start** | +5-20s | First request loads model |
### Throughput
| Tier | Max RPS | Daily Limit |
|------|---------|-------------|
| Free | ~0.8 | 30,000 |
| PRO | ~2.8 | 100,000 |
**With Caching** (80% hit rate):
- Free tier: ~4 effective RPS
- PRO tier: ~14 effective RPS
## Cost Comparison
| Provider | Cost/1M Tokens | Free Tier | Notes |
|----------|----------------|-----------|-------|
| **HuggingFace** | $0.00 | 30k req/day | Free for public models |
| OpenAI | $0.02-0.13 | $5 credit | Pay per token |
| Cohere | $0.10 | 100 req/month | Limited free tier |
| Voyage | $0.12 | None | No free tier |
## Limitations
1. **No True Batching**: Inference API processes one request at a time
2. **Cold Starts**: Models need ~20s to load on first request
3. **Rate Limits**: Free tier suitable for development only
4. **Regional Latency**: Single region (US/EU), no edge locations
5. **Model Loading**: Popular models cached, custom models may be slow
## Advanced Configuration
### Model Loading Timeout
```rust
// Future enhancement
let provider = HuggingFaceProvider::new(api_key, model)?
.with_timeout(Duration::from_secs(120)); // Wait longer for cold starts
```
### Dedicated Inference Endpoints
For production workloads, consider [Dedicated Endpoints](https://huggingface.co/inference-endpoints):
- True batch processing
- Guaranteed uptime
- No rate limits
- Custom regions
- ~$60-500/month
## Migration Guide
### From vapora Custom Implementation
**Before**:
```rust
let hf = HuggingFaceEmbedding::new(api_key, "BAAI/bge-small-en-v1.5".to_string());
let embedding = hf.embed(text).await?;
```
**After**:
```rust
let provider = HuggingFaceProvider::bge_small()?;
let options = EmbeddingOptions::default_with_cache();
let embedding = provider.embed(text, &options).await?;
```
### From OpenAI
```rust
// OpenAI (paid)
let provider = OpenAiProvider::new(api_key, OpenAiModel::TextEmbedding3Small)?;
// HuggingFace (free, similar quality)
let provider = HuggingFaceProvider::bge_small()?;
```
## Running the Example
```bash
export HUGGINGFACE_API_KEY="hf_xxxxxxxxxxxxxxxxxxxx"
cargo run --example huggingface_usage \
--features huggingface-provider
```
## References
- [HuggingFace Inference API Docs](https://huggingface.co/docs/api-inference/index)
- [BGE Embedding Models](https://huggingface.co/BAAI)
- [Sentence Transformers](https://www.sbert.net/)
- [Rate Limits Documentation](https://huggingface.co/docs/api-inference/rate-limits)