stratumiops/crates/stratum-embeddings/docs/huggingface-provider.md
Jesús Pérez 0ae853c2fa
Some checks failed
Rust CI / Security Audit (push) Has been cancelled
Rust CI / Check + Test + Lint (nightly) (push) Has been cancelled
Rust CI / Check + Test + Lint (stable) (push) Has been cancelled
chore: create stratum-embeddings and stratum-llm crates, docs
2026-01-24 02:03:12 +00:00

347 lines
9.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# HuggingFace Embedding Provider
Provider for HuggingFace Inference API embeddings with support for popular sentence-transformers and BGE models.
## Overview
The HuggingFace provider uses the free Inference API to generate embeddings. It supports:
- **Public Models**: Free access to popular embedding models
- **Custom Models**: Support for any HuggingFace model with feature-extraction pipeline
- **Automatic Caching**: Built-in memory cache reduces API calls
- **Response Normalization**: Optional L2 normalization for similarity search
## Features
- ✅ Zero cost for public models (free Inference API)
- ✅ Support for 5+ popular models out of the box
- ✅ Custom model support with configurable dimensions
- ✅ Automatic retry with exponential backoff
- ✅ Rate limit handling
- ✅ Integration with stratum-embeddings caching layer
## Supported Models
### Predefined Models
| Model | Dimensions | Use Case | Constructor |
|-------|------------|----------|-------------|
| **BAAI/bge-small-en-v1.5** | 384 | General-purpose, efficient | `HuggingFaceProvider::bge_small()` |
| **BAAI/bge-base-en-v1.5** | 768 | Balanced performance | `HuggingFaceProvider::bge_base()` |
| **BAAI/bge-large-en-v1.5** | 1024 | High quality | `HuggingFaceProvider::bge_large()` |
| **sentence-transformers/all-MiniLM-L6-v2** | 384 | Fast, lightweight | `HuggingFaceProvider::all_minilm()` |
| **sentence-transformers/all-mpnet-base-v2** | 768 | Strong baseline | - |
### Custom Models
```rust
let model = HuggingFaceModel::Custom(
"sentence-transformers/paraphrase-MiniLM-L6-v2".to_string(),
384,
);
let provider = HuggingFaceProvider::new(api_key, model)?;
```
## API Rate Limits
### Free Inference API
HuggingFace Inference API has the following rate limits:
| Tier | Requests/Hour | Requests/Day | Max Concurrent |
|------|---------------|--------------|----------------|
| **Anonymous** | 1,000 | 10,000 | 1 |
| **Free Account** | 3,000 | 30,000 | 3 |
| **PRO ($9/mo)** | 10,000 | 100,000 | 10 |
| **Enterprise** | Custom | Custom | Custom |
**Rate Limit Headers**:
```
X-RateLimit-Limit: 3000
X-RateLimit-Remaining: 2999
X-RateLimit-Reset: 1234567890
```
### Rate Limit Handling
The provider automatically handles rate limits with:
1. **Exponential Backoff**: Retries with increasing delays (1s, 2s, 4s, 8s)
2. **Max Retries**: Default 3 retries before failing
3. **Circuit Breaker**: Automatically pauses requests if rate limited repeatedly
4. **Cache Integration**: Reduces API calls by 70-90% for repeated queries
**Configuration**:
```rust
// Default retry config (built-in)
let provider = HuggingFaceProvider::new(api_key, model)?;
// With custom retry (future enhancement)
let provider = HuggingFaceProvider::new(api_key, model)?
.with_retry_config(RetryConfig {
max_retries: 5,
initial_delay: Duration::from_secs(2),
max_delay: Duration::from_secs(30),
});
```
### Best Practices for Rate Limits
1. **Enable Caching**: Use `EmbeddingOptions::default_with_cache()`
```rust
let options = EmbeddingOptions::default_with_cache();
let embedding = provider.embed(text, &options).await?;
```
2. **Batch Requests Carefully**: HuggingFace Inference API processes requests sequentially
```rust
// This makes N API calls sequentially
let texts = vec!["text1", "text2", "text3"];
let result = provider.embed_batch(&texts, &options).await?;
```
3. **Use PRO Account for Production**: Free tier is suitable for development only
4. **Monitor Rate Limits**: Check response headers
```rust
// Future enhancement - rate limit monitoring
let stats = provider.rate_limit_stats();
println!("Remaining: {}/{}", stats.remaining, stats.limit);
```
## Authentication
### Environment Variables
The provider checks for API keys in this order:
1. `HUGGINGFACE_API_KEY`
2. `HF_TOKEN` (alternative name)
```bash
export HUGGINGFACE_API_KEY="hf_xxxxxxxxxxxxxxxxxxxx"
```
### Getting an API Token
1. Go to [HuggingFace Settings](https://huggingface.co/settings/tokens)
2. Click "New token"
3. Select "Read" access (sufficient for Inference API)
4. Copy the token starting with `hf_`
## Usage Examples
### Basic Usage
```rust
use stratum_embeddings::{HuggingFaceProvider, EmbeddingOptions};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Using predefined model
let provider = HuggingFaceProvider::bge_small()?;
let options = EmbeddingOptions::default_with_cache();
let embedding = provider.embed("Hello world", &options).await?;
println!("Dimensions: {}", embedding.len()); // 384
Ok(())
}
```
### With EmbeddingService (Recommended)
```rust
use std::time::Duration;
use stratum_embeddings::{
HuggingFaceProvider, EmbeddingService, MemoryCache, EmbeddingOptions
};
let provider = HuggingFaceProvider::bge_small()?;
let cache = MemoryCache::new(1000, Duration::from_secs(3600));
let service = EmbeddingService::new(provider)
.with_cache(cache);
let options = EmbeddingOptions::default_with_cache();
let embedding = service.embed("Cached embeddings", &options).await?;
```
### Semantic Similarity Search
```rust
use stratum_embeddings::{HuggingFaceProvider, EmbeddingOptions, cosine_similarity};
let provider = HuggingFaceProvider::bge_small()?;
let options = EmbeddingOptions {
normalize: true, // Important for cosine similarity
truncate: true,
use_cache: true,
};
let query = "machine learning";
let doc1 = "deep learning and neural networks";
let doc2 = "cooking recipes";
let query_emb = provider.embed(query, &options).await?;
let doc1_emb = provider.embed(doc1, &options).await?;
let doc2_emb = provider.embed(doc2, &options).await?;
let sim1 = cosine_similarity(&query_emb, &doc1_emb);
let sim2 = cosine_similarity(&query_emb, &doc2_emb);
println!("Similarity with doc1: {:.4}", sim1); // ~0.85
println!("Similarity with doc2: {:.4}", sim2); // ~0.15
```
### Custom Model
```rust
use stratum_embeddings::{HuggingFaceProvider, HuggingFaceModel};
let api_key = std::env::var("HUGGINGFACE_API_KEY")?;
let model = HuggingFaceModel::Custom(
"intfloat/multilingual-e5-large".to_string(),
1024, // Specify dimensions
);
let provider = HuggingFaceProvider::new(api_key, model)?;
```
## Error Handling
### Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| `ConfigError: API key is empty` | Missing credentials | Set `HUGGINGFACE_API_KEY` |
| `ApiError: HTTP 401` | Invalid API token | Check token validity |
| `ApiError: HTTP 429` | Rate limit exceeded | Wait or upgrade tier |
| `ApiError: HTTP 503` | Model loading | Retry after ~20s |
| `DimensionMismatch` | Wrong model dimensions | Update `Custom` model dims |
### Retry Example
```rust
use tokio::time::sleep;
use std::time::Duration;
let mut retries = 0;
let max_retries = 3;
loop {
match provider.embed(text, &options).await {
Ok(embedding) => break Ok(embedding),
Err(e) if e.to_string().contains("429") && retries < max_retries => {
retries += 1;
let delay = Duration::from_secs(2u64.pow(retries));
eprintln!("Rate limited, retrying in {:?}...", delay);
sleep(delay).await;
}
Err(e) => break Err(e),
}
}
```
## Performance Characteristics
### Latency
| Operation | Latency | Notes |
|-----------|---------|-------|
| **Single embed** | 200-500ms | Depends on model size and region |
| **Batch (N items)** | N × 200-500ms | Sequential processing |
| **Cache hit** | <1ms | In-memory lookup |
| **Cold start** | +5-20s | First request loads model |
### Throughput
| Tier | Max RPS | Daily Limit |
|------|---------|-------------|
| Free | ~0.8 | 30,000 |
| PRO | ~2.8 | 100,000 |
**With Caching** (80% hit rate):
- Free tier: ~4 effective RPS
- PRO tier: ~14 effective RPS
## Cost Comparison
| Provider | Cost/1M Tokens | Free Tier | Notes |
|----------|----------------|-----------|-------|
| **HuggingFace** | $0.00 | 30k req/day | Free for public models |
| OpenAI | $0.02-0.13 | $5 credit | Pay per token |
| Cohere | $0.10 | 100 req/month | Limited free tier |
| Voyage | $0.12 | None | No free tier |
## Limitations
1. **No True Batching**: Inference API processes one request at a time
2. **Cold Starts**: Models need ~20s to load on first request
3. **Rate Limits**: Free tier suitable for development only
4. **Regional Latency**: Single region (US/EU), no edge locations
5. **Model Loading**: Popular models cached, custom models may be slow
## Advanced Configuration
### Model Loading Timeout
```rust
// Future enhancement
let provider = HuggingFaceProvider::new(api_key, model)?
.with_timeout(Duration::from_secs(120)); // Wait longer for cold starts
```
### Dedicated Inference Endpoints
For production workloads, consider [Dedicated Endpoints](https://huggingface.co/inference-endpoints):
- True batch processing
- Guaranteed uptime
- No rate limits
- Custom regions
- ~$60-500/month
## Migration Guide
### From vapora Custom Implementation
**Before**:
```rust
let hf = HuggingFaceEmbedding::new(api_key, "BAAI/bge-small-en-v1.5".to_string());
let embedding = hf.embed(text).await?;
```
**After**:
```rust
let provider = HuggingFaceProvider::bge_small()?;
let options = EmbeddingOptions::default_with_cache();
let embedding = provider.embed(text, &options).await?;
```
### From OpenAI
```rust
// OpenAI (paid)
let provider = OpenAiProvider::new(api_key, OpenAiModel::TextEmbedding3Small)?;
// HuggingFace (free, similar quality)
let provider = HuggingFaceProvider::bge_small()?;
```
## Running the Example
```bash
export HUGGINGFACE_API_KEY="hf_xxxxxxxxxxxxxxxxxxxx"
cargo run --example huggingface_usage \
--features huggingface-provider
```
## References
- [HuggingFace Inference API Docs](https://huggingface.co/docs/api-inference/index)
- [BGE Embedding Models](https://huggingface.co/BAAI)
- [Sentence Transformers](https://www.sbert.net/)
- [Rate Limits Documentation](https://huggingface.co/docs/api-inference/rate-limits)