219 lines
6.0 KiB
Markdown
219 lines
6.0 KiB
Markdown
|
|
# ADR-007: Multi-Provider LLM Support (Claude, OpenAI, Gemini, Ollama)
|
||
|
|
|
||
|
|
**Status**: Accepted | Implemented
|
||
|
|
**Date**: 2024-11-01
|
||
|
|
**Deciders**: LLM Architecture Team
|
||
|
|
**Technical Story**: Enabling fallback across multiple LLM providers with cost optimization
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Decision
|
||
|
|
|
||
|
|
Soporte para **4 providers: Claude, OpenAI, Gemini, Ollama** via abstracción `LLMClient` trait con fallback chain automático.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Rationale
|
||
|
|
|
||
|
|
1. **Cost Optimization**: Barato (Ollama) → Rápido (Gemini) → Confiable (Claude/GPT-4)
|
||
|
|
2. **Resilience**: Si un provider falla, fallback automático al siguiente
|
||
|
|
3. **Task-Specific Selection**:
|
||
|
|
- Architecture → Claude Opus (mejor reasoning)
|
||
|
|
- Code generation → GPT-4 (mejor código)
|
||
|
|
- Quick queries → Gemini Flash (más rápido)
|
||
|
|
- Development/testing → Ollama (gratis)
|
||
|
|
4. **Avoid Vendor Lock-in**: Múltiples proveedores previene dependencia única
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Alternatives Considered
|
||
|
|
|
||
|
|
### ❌ Single Provider Only (Claude)
|
||
|
|
- **Pros**: Simplidad
|
||
|
|
- **Cons**: Vendor lock-in, sin fallback si servicio cae, costo alto
|
||
|
|
|
||
|
|
### ❌ Custom API Abstraction (DIY)
|
||
|
|
- **Pros**: Control total
|
||
|
|
- **Cons**: Maintenance pesada, re-implementar streaming/errors/tokens para cada provider
|
||
|
|
|
||
|
|
### ✅ Multiple Providers with Fallback (CHOSEN)
|
||
|
|
- Flexible, resiliente, cost-optimized
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Trade-offs
|
||
|
|
|
||
|
|
**Pros**:
|
||
|
|
- ✅ Fallback automático si provider primario no disponible
|
||
|
|
- ✅ Cost efficiency: Ollama $0, Gemini barato, Claude premium
|
||
|
|
- ✅ Resilience: No single point of failure
|
||
|
|
- ✅ Task-specific selection: Usar mejor tool para cada job
|
||
|
|
- ✅ No vendor lock-in
|
||
|
|
|
||
|
|
**Cons**:
|
||
|
|
- ⚠️ Múltiples API keys a gestionar (secrets management)
|
||
|
|
- ⚠️ Complicación de testing (mocks para múltiples providers)
|
||
|
|
- ⚠️ Latency variance (diferentes speeds entre providers)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Implementation
|
||
|
|
|
||
|
|
**Provider Trait Abstraction**:
|
||
|
|
```rust
|
||
|
|
// crates/vapora-llm-router/src/providers.rs
|
||
|
|
pub trait LLMClient: Send + Sync {
|
||
|
|
async fn complete(&self, prompt: &str) -> Result<String>;
|
||
|
|
async fn stream_complete(&self, prompt: &str) -> Result<BoxStream<String>>;
|
||
|
|
fn provider_name(&self) -> &str;
|
||
|
|
fn cost_per_token(&self) -> f64;
|
||
|
|
}
|
||
|
|
|
||
|
|
// Implementations
|
||
|
|
impl LLMClient for ClaudeClient { /* ... */ }
|
||
|
|
impl LLMClient for OpenAIClient { /* ... */ }
|
||
|
|
impl LLMClient for GeminiClient { /* ... */ }
|
||
|
|
impl LLMClient for OllamaClient { /* ... */ }
|
||
|
|
```
|
||
|
|
|
||
|
|
**Fallback Chain Router**:
|
||
|
|
```rust
|
||
|
|
// crates/vapora-llm-router/src/router.rs
|
||
|
|
pub async fn route_task(task: &Task) -> Result<String> {
|
||
|
|
let providers = vec![
|
||
|
|
select_primary_provider(&task), // Task-specific: Claude/GPT-4/Gemini
|
||
|
|
"gemini".to_string(), // Fallback: Gemini
|
||
|
|
"openai".to_string(), // Fallback: OpenAI
|
||
|
|
"ollama".to_string(), // Last resort: Local
|
||
|
|
];
|
||
|
|
|
||
|
|
for provider_name in providers {
|
||
|
|
match self.clients.get(provider_name).complete(&prompt).await {
|
||
|
|
Ok(response) => {
|
||
|
|
metrics::increment_provider_success(&provider_name);
|
||
|
|
return Ok(response);
|
||
|
|
}
|
||
|
|
Err(e) => {
|
||
|
|
tracing::warn!("Provider {} failed: {:?}, trying next", provider_name, e);
|
||
|
|
metrics::increment_provider_failure(&provider_name);
|
||
|
|
}
|
||
|
|
}
|
||
|
|
}
|
||
|
|
Err(VaporaError::AllProvidersFailed)
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
**Configuration**:
|
||
|
|
```toml
|
||
|
|
# config/llm-routing.toml
|
||
|
|
[[providers]]
|
||
|
|
name = "claude"
|
||
|
|
model = "claude-3-opus-20240229"
|
||
|
|
api_key_env = "ANTHROPIC_API_KEY"
|
||
|
|
priority = 1
|
||
|
|
cost_per_1k_tokens = 0.015
|
||
|
|
|
||
|
|
[[providers]]
|
||
|
|
name = "openai"
|
||
|
|
model = "gpt-4"
|
||
|
|
api_key_env = "OPENAI_API_KEY"
|
||
|
|
priority = 2
|
||
|
|
cost_per_1k_tokens = 0.03
|
||
|
|
|
||
|
|
[[providers]]
|
||
|
|
name = "gemini"
|
||
|
|
model = "gemini-2.0-flash"
|
||
|
|
api_key_env = "GOOGLE_API_KEY"
|
||
|
|
priority = 3
|
||
|
|
cost_per_1k_tokens = 0.005
|
||
|
|
|
||
|
|
[[providers]]
|
||
|
|
name = "ollama"
|
||
|
|
url = "http://localhost:11434"
|
||
|
|
model = "llama2"
|
||
|
|
priority = 4
|
||
|
|
cost_per_1k_tokens = 0.0
|
||
|
|
|
||
|
|
[[routing_rules]]
|
||
|
|
pattern = "architecture"
|
||
|
|
provider = "claude"
|
||
|
|
|
||
|
|
[[routing_rules]]
|
||
|
|
pattern = "code_generation"
|
||
|
|
provider = "openai"
|
||
|
|
|
||
|
|
[[routing_rules]]
|
||
|
|
pattern = "quick_query"
|
||
|
|
provider = "gemini"
|
||
|
|
```
|
||
|
|
|
||
|
|
**Key Files**:
|
||
|
|
- `/crates/vapora-llm-router/src/providers.rs` (trait implementations)
|
||
|
|
- `/crates/vapora-llm-router/src/router.rs` (routing logic + fallback)
|
||
|
|
- `/crates/vapora-llm-router/src/cost_tracker.rs` (token counting per provider)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Verification
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Test each provider individually
|
||
|
|
cargo test -p vapora-llm-router test_claude_provider
|
||
|
|
cargo test -p vapora-llm-router test_openai_provider
|
||
|
|
cargo test -p vapora-llm-router test_gemini_provider
|
||
|
|
cargo test -p vapora-llm-router test_ollama_provider
|
||
|
|
|
||
|
|
# Test fallback chain
|
||
|
|
cargo test -p vapora-llm-router test_fallback_chain
|
||
|
|
|
||
|
|
# Benchmark costs and latencies
|
||
|
|
cargo run -p vapora-llm-router --bin benchmark -- --providers all --samples 100
|
||
|
|
|
||
|
|
# Test task routing
|
||
|
|
cargo test -p vapora-llm-router test_task_routing
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected Output**:
|
||
|
|
- All 4 providers respond correctly when available
|
||
|
|
- Fallback triggers when primary provider fails
|
||
|
|
- Cost tracking accurate per provider
|
||
|
|
- Task routing selects appropriate provider
|
||
|
|
- Claude used for architecture, GPT-4 for code, etc.
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## Consequences
|
||
|
|
|
||
|
|
### Operational
|
||
|
|
- 4 API keys required (managed via secrets)
|
||
|
|
- Cost monitoring per provider (see ADR-015, Budget Enforcement)
|
||
|
|
- Provider status pages monitored for incidents
|
||
|
|
|
||
|
|
### Metrics & Monitoring
|
||
|
|
- Track success rate per provider
|
||
|
|
- Track latency per provider
|
||
|
|
- Alert if primary provider consistently fails
|
||
|
|
- Report costs broken down by provider
|
||
|
|
|
||
|
|
### Development
|
||
|
|
- Mocking tests for each provider
|
||
|
|
- Integration tests with real providers (limited to avoid costs)
|
||
|
|
- Provider selection logic well-documented
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
## References
|
||
|
|
|
||
|
|
- [Claude API Documentation](https://docs.anthropic.com/claude)
|
||
|
|
- [OpenAI API Documentation](https://platform.openai.com/docs)
|
||
|
|
- [Google Gemini API](https://ai.google.dev/)
|
||
|
|
- [Ollama Documentation](https://ollama.ai/)
|
||
|
|
- `/crates/vapora-llm-router/src/providers.rs` (provider implementations)
|
||
|
|
- `/crates/vapora-llm-router/src/cost_tracker.rs` (token tracking)
|
||
|
|
- ADR-012 (Three-Tier LLM Routing)
|
||
|
|
- ADR-015 (Budget Enforcement)
|
||
|
|
|
||
|
|
---
|
||
|
|
|
||
|
|
**Related ADRs**: ADR-006 (Rig Framework), ADR-012 (Routing Tiers), ADR-015 (Budget)
|