# ADR-002: Stratum-LLM - Unified LLM Provider Library ## Status **Proposed** ## Context ### Current State: Fragmented LLM Connections The stratumiops ecosystem has 4 projects with AI functionality, each with its own implementation: | Project | Implementation | Providers | Duplication | | ------------ | -------------------------- | ---------------------- | ------------------- | | Vapora | `typedialog-ai` (path dep) | Claude, OpenAI, Ollama | Shared base | | TypeDialog | `typedialog-ai` (local) | Claude, OpenAI, Ollama | Defines abstraction | | Provisioning | Custom `LlmClient` | Claude, OpenAI | 100% duplicated | | Kogral | `rig-core` | Embeddings only | Different stack | ### Identified Problems #### 1. Code Duplication Provisioning reimplements what TypeDialog already has: - reqwest HTTP client - Headers: x-api-key, anthropic-version - JSON body formatting - Response parsing - Error handling **Impact**: ~500 duplicated lines, bugs fixed in one place don't propagate. #### 2. API Keys Only, No CLI Detection No project detects credentials from official CLIs: ```text Claude CLI: ~/.config/claude/credentials.json OpenAI CLI: ~/.config/openai/credentials.json ``` **Impact**: Users with Claude Pro/Max ($20-100/month) pay for API tokens when they could use their subscription. #### 3. No Automatic Fallback When a provider fails (rate limit, timeout), the request fails completely: ```text Actual: Request → Claude API → Rate Limit → ERROR Desired: Request → Claude API → Rate Limit → OpenAI → Success ``` #### 4. No Circuit Breaker If Claude API is down, each request attempts to connect, fails, and propagates the error: ```text Request 1 → Claude → Timeout (30s) → Error Request 2 → Claude → Timeout (30s) → Error Request 3 → Claude → Timeout (30s) → Error ``` **Impact**: Accumulated latency, degraded UX. #### 5. No Caching Identical requests always go to the API: ```text "Explain this Rust error" → Claude → $0.003 "Explain this Rust error" → Claude → $0.003 (same result) ``` **Impact**: Unnecessary costs, especially in development/testing. #### 6. Kogral Not Integrated Kogral has guidelines and patterns that could enrich LLM context, but there's no integration. ## Decision Create `stratum-llm` as a unified crate that: 1. **Consolidates** existing implementations from typedialog-ai and provisioning 2. **Detects** CLI credentials and subscriptions before using API keys 3. **Implements** automatic fallback with circuit breaker 4. **Adds** request caching to reduce costs 5. **Integrates** Kogral for context enrichment 6. **Is used** by all ecosystem projects ### Architecture ```text ┌─────────────────────────────────────────────────────────┐ │ stratum-llm │ ├─────────────────────────────────────────────────────────┤ │ CredentialDetector │ │ ├─ Claude CLI → ~/.config/claude/ (subscription) │ │ ├─ OpenAI CLI → ~/.config/openai/ │ │ ├─ Env vars → *_API_KEY │ │ └─ Ollama → localhost:11434 (free) │ │ │ │ │ ▼ │ │ ProviderChain (ordered by priority) │ │ [CLI/Sub] → [API] → [DeepSeek] → [Ollama] │ │ │ │ │ │ │ │ └──────────┴─────────┴───────────┘ │ │ │ │ │ CircuitBreaker per provider │ │ │ │ │ RequestCache │ │ │ │ │ KogralIntegration │ │ │ │ │ UnifiedClient │ │ │ └─────────────────────────────────────────────────────────┘ ``` ## Rationale ### Why Not Use Another External Crate | Alternative | Why Not | | -------------- | ------------------------------------------ | | kaccy-ai | Oriented toward blockchain/fraud detection | | llm (crate) | Very basic, no circuit breaker or caching | | langchain-rust | Python port, not idiomatic Rust | | rig-core | Embeddings/RAG only, no chat completion | **Best option**: Build on typedialog-ai and add missing features. ### Why CLI Detection is Important Cost analysis for typical user: | Scenario | Monthly Cost | | ------------------------- | -------------------- | | API only (current) | ~$840 | | Claude Pro + API overflow | ~$20 + ~$200 = $220 | | Claude Max + API overflow | ~$100 + ~$50 = $150 | **Potential savings**: 70-80% by detecting and using subscriptions first. ### Why Circuit Breaker Without circuit breaker, a downed provider causes: - N requests × 30s timeout = N×30s total latency - All resources occupied waiting for timeouts With circuit breaker: - First failure opens circuit - Following requests fail immediately (fast fail) - Fallback to another provider without waiting - Circuit resets after cooldown ### Why Caching For typical development: - Same questions repeated while iterating - Testing executes same prompts multiple times Estimated cache hit rate: 15-30% in active development. ### Why Kogral Integration Kogral has language guidelines, domain patterns, and ADRs. Without integration the LLM generates generic code; with integration it generates code following project conventions. ## Consequences ### Positive 1. Single source of truth for LLM logic 2. CLI detection reduces costs 70-80% 3. Circuit breaker + fallback = high availability 4. 15-30% fewer requests in development (caching) 5. Kogral improves generation quality 6. Feature-gated: each feature is optional ### Negative 1. **Migration effort**: Refactor Vapora, TypeDialog, Provisioning 2. **New dependency**: Projects depend on stratumiops 3. **CLI auth complexity**: Different credential formats per version 4. **Cache invalidation**: Stale responses if not managed well ### Mitigations | Negative | Mitigation | | ------------------- | ------------------------------------------- | | Migration effort | Re-export compatible API from typedialog-ai | | New dependency | Local path dependency, not crates.io | | CLI auth complexity | Version detection, fallback to API if fails | | Cache invalidation | Configurable TTL, bypass option | ## Success Metrics | Metric | Current | Target | | ------------------------ | ------- | --------------- | | Duplicated lines of code | ~500 | 0 | | CLI credential detection | 0% | 100% | | Fallback success rate | 0% | >90% | | Cache hit rate | 0% | 15-30% | | Latency (provider down) | 30s+ | <1s (fast fail) | ## Cost Impact Analysis Based on real usage data ($840/month): | Scenario | Savings | | -------------------------- | ------------------ | | CLI detection (Claude Max) | ~$700/month | | Caching (15% hit rate) | ~$50/month | | DeepSeek fallback for code | ~$100/month | | **Total potential** | **$500-700/month** | ## Migration Strategy ### Migration Phases 1. Create stratum-llm with API compatible with typedialog-ai 2. typedialog-ai re-exports stratum-llm (backward compatible) 3. Vapora migrates to stratum-llm directly 4. Provisioning migrates its LlmClient to stratum-llm 5. Deprecate typedialog-ai, consolidate in stratum-llm ### Feature Adoption | Feature | Adoption | | --------------- | ----------------------------------------- | | Basic providers | Immediate (direct replacement) | | CLI detection | Optional, feature flag | | Circuit breaker | Default on | | Caching | Default on, configurable TTL | | Kogral | Feature flag, requires Kogral installed | ## Alternatives Considered ### Alternative 1: Improve typedialog-ai In-Place **Pros**: No new crate required **Cons**: TypeDialog is a specific project, not shared infrastructure **Decision**: stratum-llm in stratumiops is better location for cross-project infrastructure. ### Alternative 2: Use LiteLLM (Python) as Proxy **Pros**: Very complete, 100+ providers **Cons**: Python dependency, proxy latency, not Rust-native **Decision**: Keep pure Rust stack. ### Alternative 3: Each Project Maintains Its Own Implementation **Pros**: Independence **Cons**: Duplication, inconsistency, bugs not shared **Decision**: Consolidation is better long-term. ## References **Existing Implementations**: - TypeDialog: `typedialog/crates/typedialog-ai/` - Vapora: `vapora/crates/vapora-llm-router/` - Provisioning: `provisioning/platform/crates/rag/` **Kogral**: `kogral/` **Target Location**: `stratumiops/crates/stratum-llm/`