stratumiops/docs/en/architecture/adrs/002-stratum-llm.md

280 lines
9.8 KiB
Markdown
Raw Permalink Normal View History

# ADR-002: Stratum-LLM - Unified LLM Provider Library
## Status
**Proposed**
## Context
### Current State: Fragmented LLM Connections
The stratumiops ecosystem has 4 projects with AI functionality, each with its own implementation:
| Project | Implementation | Providers | Duplication |
| ------------ | -------------------------- | ---------------------- | ------------------- |
| Vapora | `typedialog-ai` (path dep) | Claude, OpenAI, Ollama | Shared base |
| TypeDialog | `typedialog-ai` (local) | Claude, OpenAI, Ollama | Defines abstraction |
| Provisioning | Custom `LlmClient` | Claude, OpenAI | 100% duplicated |
| Kogral | `rig-core` | Embeddings only | Different stack |
### Identified Problems
#### 1. Code Duplication
Provisioning reimplements what TypeDialog already has:
- reqwest HTTP client
- Headers: x-api-key, anthropic-version
- JSON body formatting
- Response parsing
- Error handling
**Impact**: ~500 duplicated lines, bugs fixed in one place don't propagate.
#### 2. API Keys Only, No CLI Detection
No project detects credentials from official CLIs:
```text
Claude CLI: ~/.config/claude/credentials.json
OpenAI CLI: ~/.config/openai/credentials.json
```
**Impact**: Users with Claude Pro/Max ($20-100/month) pay for API tokens when they could use their subscription.
#### 3. No Automatic Fallback
When a provider fails (rate limit, timeout), the request fails completely:
```text
Actual: Request → Claude API → Rate Limit → ERROR
Desired: Request → Claude API → Rate Limit → OpenAI → Success
```
#### 4. No Circuit Breaker
If Claude API is down, each request attempts to connect, fails, and propagates the error:
```text
Request 1 → Claude → Timeout (30s) → Error
Request 2 → Claude → Timeout (30s) → Error
Request 3 → Claude → Timeout (30s) → Error
```
**Impact**: Accumulated latency, degraded UX.
#### 5. No Caching
Identical requests always go to the API:
```text
"Explain this Rust error" → Claude → $0.003
"Explain this Rust error" → Claude → $0.003 (same result)
```
**Impact**: Unnecessary costs, especially in development/testing.
#### 6. Kogral Not Integrated
Kogral has guidelines and patterns that could enrich LLM context, but there's no integration.
## Decision
Create `stratum-llm` as a unified crate that:
1. **Consolidates** existing implementations from typedialog-ai and provisioning
2. **Detects** CLI credentials and subscriptions before using API keys
3. **Implements** automatic fallback with circuit breaker
4. **Adds** request caching to reduce costs
5. **Integrates** Kogral for context enrichment
6. **Is used** by all ecosystem projects
### Architecture
```text
┌─────────────────────────────────────────────────────────┐
│ stratum-llm │
├─────────────────────────────────────────────────────────┤
│ CredentialDetector │
│ ├─ Claude CLI → ~/.config/claude/ (subscription) │
│ ├─ OpenAI CLI → ~/.config/openai/ │
│ ├─ Env vars → *_API_KEY │
│ └─ Ollama → localhost:11434 (free) │
│ │ │
│ ▼ │
│ ProviderChain (ordered by priority) │
│ [CLI/Sub] → [API] → [DeepSeek] → [Ollama] │
│ │ │ │ │ │
│ └──────────┴─────────┴───────────┘ │
│ │ │
│ CircuitBreaker per provider │
│ │ │
│ RequestCache │
│ │ │
│ KogralIntegration │
│ │ │
│ UnifiedClient │
│ │
└─────────────────────────────────────────────────────────┘
```
## Rationale
### Why Not Use Another External Crate
| Alternative | Why Not |
| -------------- | ------------------------------------------ |
| kaccy-ai | Oriented toward blockchain/fraud detection |
| llm (crate) | Very basic, no circuit breaker or caching |
| langchain-rust | Python port, not idiomatic Rust |
| rig-core | Embeddings/RAG only, no chat completion |
**Best option**: Build on typedialog-ai and add missing features.
### Why CLI Detection is Important
Cost analysis for typical user:
| Scenario | Monthly Cost |
| ------------------------- | -------------------- |
| API only (current) | ~$840 |
| Claude Pro + API overflow | ~$20 + ~$200 = $220 |
| Claude Max + API overflow | ~$100 + ~$50 = $150 |
**Potential savings**: 70-80% by detecting and using subscriptions first.
### Why Circuit Breaker
Without circuit breaker, a downed provider causes:
- N requests × 30s timeout = N×30s total latency
- All resources occupied waiting for timeouts
With circuit breaker:
- First failure opens circuit
- Following requests fail immediately (fast fail)
- Fallback to another provider without waiting
- Circuit resets after cooldown
### Why Caching
For typical development:
- Same questions repeated while iterating
- Testing executes same prompts multiple times
Estimated cache hit rate: 15-30% in active development.
### Why Kogral Integration
Kogral has language guidelines, domain patterns, and ADRs.
Without integration the LLM generates generic code;
with integration it generates code following project conventions.
## Consequences
### Positive
1. Single source of truth for LLM logic
2. CLI detection reduces costs 70-80%
3. Circuit breaker + fallback = high availability
4. 15-30% fewer requests in development (caching)
5. Kogral improves generation quality
6. Feature-gated: each feature is optional
### Negative
1. **Migration effort**: Refactor Vapora, TypeDialog, Provisioning
2. **New dependency**: Projects depend on stratumiops
3. **CLI auth complexity**: Different credential formats per version
4. **Cache invalidation**: Stale responses if not managed well
### Mitigations
| Negative | Mitigation |
| ------------------- | ------------------------------------------- |
| Migration effort | Re-export compatible API from typedialog-ai |
| New dependency | Local path dependency, not crates.io |
| CLI auth complexity | Version detection, fallback to API if fails |
| Cache invalidation | Configurable TTL, bypass option |
## Success Metrics
| Metric | Current | Target |
| ------------------------ | ------- | --------------- |
| Duplicated lines of code | ~500 | 0 |
| CLI credential detection | 0% | 100% |
| Fallback success rate | 0% | >90% |
| Cache hit rate | 0% | 15-30% |
| Latency (provider down) | 30s+ | <1s (fast fail) |
## Cost Impact Analysis
Based on real usage data ($840/month):
| Scenario | Savings |
| -------------------------- | ------------------ |
| CLI detection (Claude Max) | ~$700/month |
| Caching (15% hit rate) | ~$50/month |
| DeepSeek fallback for code | ~$100/month |
| **Total potential** | **$500-700/month** |
## Migration Strategy
### Migration Phases
1. Create stratum-llm with API compatible with typedialog-ai
2. typedialog-ai re-exports stratum-llm (backward compatible)
3. Vapora migrates to stratum-llm directly
4. Provisioning migrates its LlmClient to stratum-llm
5. Deprecate typedialog-ai, consolidate in stratum-llm
### Feature Adoption
| Feature | Adoption |
| --------------- | ----------------------------------------- |
| Basic providers | Immediate (direct replacement) |
| CLI detection | Optional, feature flag |
| Circuit breaker | Default on |
| Caching | Default on, configurable TTL |
| Kogral | Feature flag, requires Kogral installed |
## Alternatives Considered
### Alternative 1: Improve typedialog-ai In-Place
**Pros**: No new crate required
**Cons**: TypeDialog is a specific project, not shared infrastructure
**Decision**: stratum-llm in stratumiops is better location for cross-project infrastructure.
### Alternative 2: Use LiteLLM (Python) as Proxy
**Pros**: Very complete, 100+ providers
**Cons**: Python dependency, proxy latency, not Rust-native
**Decision**: Keep pure Rust stack.
### Alternative 3: Each Project Maintains Its Own Implementation
**Pros**: Independence
**Cons**: Duplication, inconsistency, bugs not shared
**Decision**: Consolidation is better long-term.
## References
**Existing Implementations**:
- TypeDialog: `typedialog/crates/typedialog-ai/`
- Vapora: `vapora/crates/vapora-llm-router/`
- Provisioning: `provisioning/platform/crates/rag/`
**Kogral**: `kogral/`
**Target Location**: `stratumiops/crates/stratum-llm/`