provisioning/docs/src/architecture/adr/adr-015-ai-integration-architecture.md

1123 lines
38 KiB
Markdown
Raw Normal View History

2026-01-14 04:53:21 +00:00
# ADR-015: AI Integration Architecture for Intelligent Infrastructure Provisioning
## Status
**Accepted** - 2025-01-08
## Context
The provisioning platform has evolved to include complex workflows for infrastructure configuration, deployment, and management.
Current interaction patterns require deep technical knowledge of Nickel schemas, cloud provider APIs, networking concepts, and security best practices.
This creates barriers to entry and slows down infrastructure provisioning for operators who are not infrastructure experts.
### The Infrastructure Complexity Problem
**Current state challenges**:
1. **Knowledge Barrier**: Deep Nickel, cloud, and networking expertise required
- Understanding Nickel type system and contracts
- Knowing cloud provider resource relationships
- Configuring security policies correctly
- Debugging deployment failures
2. **Manual Configuration**: All configs hand-written
- Repetitive boilerplate for common patterns
- Easy to make mistakes (typos, missing fields)
- No intelligent suggestions or autocomplete
- Trial-and-error debugging
3. **Limited Assistance**: No contextual help
- Documentation is separate from workflow
- No explanation of validation errors
- No suggestions for fixing issues
- No learning from past deployments
4. **Troubleshooting Difficulty**: Manual log analysis
- Deployment failures require expert analysis
- No automated root cause detection
- No suggested fixes based on similar issues
- Long time-to-resolution
### AI Integration Opportunities
1. **Natural Language to Configuration**:
- User: "Create a production PostgreSQL cluster with encryption and daily backups"
- AI: Generates validated Nickel configuration
2. **AI-Assisted Form Filling**:
- User starts typing in typdialog web form
- AI suggests values based on context
- AI explains validation errors in plain language
3. **Intelligent Troubleshooting**:
- Deployment fails
- AI analyzes logs and suggests fixes
- AI generates corrected configuration
4. **Configuration Optimization**:
- AI analyzes workload patterns
- AI suggests performance improvements
- AI detects security misconfigurations
5. **Learning from Operations**:
- AI indexes past deployments
- AI suggests configurations based on similar workloads
- AI predicts potential issues
### AI Components Overview
The system integrates multiple AI components:
1. **typdialog-ai**: AI-assisted form interactions
2. **typdialog-ag**: AI agents for autonomous operations
3. **typdialog-prov-gen**: AI-powered configuration generation
4. **platform/crates/ai-service**: Core AI service backend
5. **platform/crates/mcp-server**: Model Context Protocol server
6. **platform/crates/rag**: Retrieval-Augmented Generation system
### Requirements for AI Integration
-**Natural Language Understanding**: Parse user intent from free-form text
-**Schema-Aware Generation**: Generate valid Nickel configurations
-**Context Retrieval**: Access documentation, schemas, past deployments
-**Security Enforcement**: Cedar policies control AI access
-**Human-in-the-Loop**: All AI actions require human approval
-**Audit Trail**: Complete logging of AI operations
-**Multi-Provider Support**: OpenAI, Anthropic, local models
-**Cost Control**: Rate limiting and budget management
-**Observability**: Trace AI decisions and reasoning
## Decision
Integrate a **comprehensive AI system** consisting of:
1. **AI-Assisted Interfaces** (typdialog-ai)
2. **Autonomous AI Agents** (typdialog-ag)
3. **AI Configuration Generator** (typdialog-prov-gen)
4. **Core AI Infrastructure** (ai-service, mcp-server, rag)
All AI components are **schema-aware**, **security-enforced**, and **human-supervised**.
### Architecture Diagram
```text
┌─────────────────────────────────────────────────────────────────┐
│ User Interfaces │
│ │
│ Natural Language: "Create production K8s cluster in AWS" │
│ Typdialog Forms: AI-assisted field suggestions │
│ CLI: provisioning ai generate-config "description" │
└────────────┬────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ AI Frontend Layer │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ typdialog-ai (AI-Assisted Forms) │ │
│ │ - Natural language form filling │ │
│ │ - Real-time AI suggestions │ │
│ │ - Validation error explanations │ │
│ │ - Context-aware autocomplete │ │
│ ├───────────────────────────────────────────────────────┤ │
│ │ typdialog-ag (AI Agents) │ │
│ │ - Autonomous task execution │ │
│ │ - Multi-step workflow automation │ │
│ │ - Learning from feedback │ │
│ │ - Agent collaboration │ │
│ ├───────────────────────────────────────────────────────┤ │
│ │ typdialog-prov-gen (Config Generator) │ │
│ │ - Natural language → Nickel config │ │
│ │ - Template-based generation │ │
│ │ - Best practice injection │ │
│ │ - Validation and refinement │ │
│ └───────────────────────────────────────────────────────┘ │
└────────────┬────────────────────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────┐
│ Core AI Infrastructure (platform/crates/) │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ ai-service (Central AI Service) │ │
│ │ │ │
│ │ - Request routing and orchestration │ │
│ │ - Authentication and authorization (Cedar) │ │
│ │ - Rate limiting and cost control │ │
│ │ - Caching and optimization │ │
│ │ - Audit logging and observability │ │
│ │ - Multi-provider abstraction │ │
│ └─────────────┬─────────────────────┬───────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ mcp-server │ │ rag │ │
│ │ (Model Context │ │ (Retrieval-Aug Gen) │ │
│ │ Protocol) │ │ │ │
│ │ │ │ ┌─────────────────┐ │ │
│ │ - LLM integration │ │ │ Vector Store │ │ │
│ │ - Tool calling │ │ │ (Qdrant/Milvus) │ │ │
│ │ - Context mgmt │ │ └─────────────────┘ │ │
│ │ - Multi-provider │ │ ┌─────────────────┐ │ │
│ │ (OpenAI, │ │ │ Embeddings │ │ │
│ │ Anthropic, │ │ │ (text-embed) │ │ │
│ │ Local models) │ │ └─────────────────┘ │ │
│ │ │ │ ┌─────────────────┐ │ │
│ │ Tools: │ │ │ Index: │ │ │
│ │ - nickel_validate │ │ │ - Nickel schemas│ │ │
│ │ - schema_query │ │ │ - Documentation │ │ │
│ │ - config_generate │ │ │ - Past deploys │ │ │
│ │ - cedar_check │ │ │ - Best practices│ │ │
│ └─────────────────────┘ │ └─────────────────┘ │ │
│ │ │ │
│ │ Query: "How to │ │
│ │ configure Postgres │ │
│ │ with encryption?" │ │
│ │ │ │
│ │ Retrieval: Relevant │ │
│ │ docs + examples │ │
│ └─────────────────────┘ │
└────────────┬───────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Integration Points │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────┐ │
│ │ Nickel │ │ SecretumVault│ │ Cedar Authorization │ │
│ │ Validation │ │ (Secrets) │ │ (AI Policies) │ │
│ └─────────────┘ └──────────────┘ └─────────────────────┘ │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────┐ │
│ │ Orchestrator│ │ Typdialog │ │ Audit Logging │ │
│ │ (Deploy) │ │ (Forms) │ │ (All AI Ops) │ │
│ └─────────────┘ └──────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────┐
│ Output: Validated Nickel Configuration │
│ │
│ ✅ Schema-validated │
│ ✅ Security-checked (Cedar policies) │
│ ✅ Human-approved │
│ ✅ Audit-logged │
│ ✅ Ready for deployment │
└─────────────────────────────────────────────────────────────────┘
```
### Component Responsibilities
**typdialog-ai** (AI-Assisted Forms):
- Real-time form field suggestions based on context
- Natural language form filling
- Validation error explanations in plain English
- Context-aware autocomplete for configuration values
- Integration with typdialog web UI
**typdialog-ag** (AI Agents):
- Autonomous task execution (multi-step workflows)
- Agent collaboration (multiple agents working together)
- Learning from user feedback and past operations
- Goal-oriented behavior (achieve outcome, not just execute steps)
- Safety boundaries (cannot deploy without approval)
**typdialog-prov-gen** (Config Generator):
- Natural language → Nickel configuration
- Template-based generation with customization
- Best practice injection (security, performance, HA)
- Iterative refinement based on validation feedback
- Integration with Nickel schema system
**ai-service** (Core AI Service):
- Central request router for all AI operations
- Authentication and authorization (Cedar policies)
- Rate limiting and cost control
- Caching (reduce LLM API calls)
- Audit logging (all AI operations)
- Multi-provider abstraction (OpenAI, Anthropic, local)
**mcp-server** (Model Context Protocol):
- LLM integration (OpenAI, Anthropic, local models)
- Tool calling framework (nickel_validate, schema_query, etc.)
- Context management (conversation history, schemas)
- Streaming responses for real-time feedback
- Error handling and retries
**rag** (Retrieval-Augmented Generation):
- Vector store (Qdrant/Milvus) for embeddings
- Document indexing (Nickel schemas, docs, deployments)
- Semantic search (find relevant context)
- Embedding generation (text-embedding-3-large)
- Query expansion and reranking
## Rationale
### Why AI Integration Is Essential
| Aspect | Manual Config | AI-Assisted (chosen) |
| -------- | --------------- | ---------------------- |
| **Learning Curve** | 🔴 Steep | 🟢 Gentle |
| **Time to Deploy** | 🔴 Hours | 🟢 Minutes |
| **Error Rate** | 🔴 High | 🟢 Low (validated) |
| **Documentation Access** | 🔴 Separate | 🟢 Contextual |
| **Troubleshooting** | 🔴 Manual | 🟢 AI-assisted |
| **Best Practices** | ⚠️ Manual enforcement | ✅ Auto-injected |
| **Consistency** | ⚠️ Varies by operator | ✅ Standardized |
| **Scalability** | 🔴 Limited by expertise | 🟢 AI scales knowledge |
### Why Schema-Aware AI Is Critical
Traditional AI code generation fails for infrastructure because:
```text
Generic AI (like GitHub Copilot):
❌ Generates syntactically correct but semantically wrong configs
❌ Doesn't understand cloud provider constraints
❌ No validation against schemas
❌ No security policy enforcement
❌ Hallucinated resource names/IDs
```
**Schema-aware AI** (our approach):
```text
# Nickel schema provides ground truth
{
Database = {
engine | [| 'postgres, 'mysql, 'mongodb |],
version | String,
storage_gb | Number,
backup_retention_days | Number,
}
}
# AI generates ONLY valid configs
# AI knows:
# - Valid engine values ('postgres', not 'postgresql')
# - Required fields (all listed above)
# - Type constraints (storage_gb is Number, not String)
# - Nickel contracts (if defined)
```
**Result**: AI cannot generate invalid configs.
### Why RAG (Retrieval-Augmented Generation) Is Essential
LLMs alone have limitations:
```text
Pure LLM:
❌ Knowledge cutoff (no recent updates)
❌ Hallucinations (invents plausible-sounding configs)
❌ No project-specific knowledge
❌ No access to past deployments
```
**RAG-enhanced LLM**:
```text
Query: "How to configure Postgres with encryption?"
RAG retrieves:
- Nickel schema: provisioning/schemas/database.ncl
- Documentation: docs/user/database-encryption.md
- Past deployment: workspaces/prod/postgres-encrypted.ncl
- Best practice: .claude/patterns/secure-database.md
LLM generates answer WITH retrieved context:
✅ Accurate (based on actual schemas)
✅ Project-specific (uses our patterns)
✅ Proven (learned from past deployments)
✅ Secure (follows our security guidelines)
```
### Why Human-in-the-Loop Is Non-Negotiable
AI-generated infrastructure configs require human approval:
```text
// All AI operations require approval
pub async fn ai_generate_config(request: GenerateRequest) -> Result<Config> {
let ai_generated = ai_service.generate(request).await?;
// Validate against Nickel schema
let validation = nickel_validate(&ai_generated)?;
if !validation.is_valid() {
return Err("AI generated invalid config");
}
// Check Cedar policies
let authorized = cedar_authorize(
principal: user,
action: "approve_ai_config",
resource: ai_generated,
)?;
if !authorized {
return Err("User not authorized to approve AI config");
}
// Require explicit human approval
let approval = prompt_user_approval(&ai_generated).await?;
if !approval.approved {
audit_log("AI config rejected by user", &ai_generated);
return Err("User rejected AI-generated config");
}
audit_log("AI config approved by user", &ai_generated);
Ok(ai_generated)
}
```
**Why**:
- Infrastructure changes have real-world cost and security impact
- AI can make mistakes (hallucinations, misunderstandings)
- Compliance requires human accountability
- Learning opportunity (human reviews teach AI)
### Why Multi-Provider Support Matters
No single LLM provider is best for all tasks:
| Provider | Best For | Considerations |
| ---------- | ---------- | ---------------- |
| **Anthropic (Claude)** | Long context, accuracy | ✅ Best for complex configs |
| **OpenAI (GPT-4)** | Tool calling, speed | ✅ Best for quick suggestions |
| **Local (Llama, Mistral)** | Privacy, cost | ✅ Best for air-gapped envs |
**Strategy**:
- Complex config generation → Claude (long context)
- Real-time form suggestions → GPT-4 (fast)
- Air-gapped deployments → Local models (privacy)
## Consequences
### Positive
- **Accessibility**: Non-experts can provision infrastructure
- **Productivity**: 10x faster configuration creation
- **Quality**: AI injects best practices automatically
- **Consistency**: Standardized configurations across teams
- **Learning**: Users learn from AI explanations
- **Troubleshooting**: AI-assisted debugging reduces MTTR
- **Documentation**: Contextual help embedded in workflow
- **Safety**: Schema validation prevents invalid configs
- **Security**: Cedar policies control AI access
- **Auditability**: Complete trail of AI operations
### Negative
- **Dependency**: Requires LLM API access (or local models)
- **Cost**: LLM API calls have per-token cost
- **Latency**: AI responses take 1-5 seconds
- **Accuracy**: AI can still make mistakes (needs validation)
- **Trust**: Users must understand AI limitations
- **Complexity**: Additional infrastructure to operate
- **Privacy**: Configs sent to LLM providers (unless local)
### Mitigation Strategies
**Cost Control**:
```text
[ai.rate_limiting]
requests_per_minute = 60
tokens_per_day = 1000000
cost_limit_per_day = "100.00" # USD
[ai.caching]
enabled = true
ttl = "1h"
# Cache similar queries to reduce API calls
```
**Latency Optimization**:
```text
// Streaming responses for real-time feedback
pub async fn ai_generate_stream(request: GenerateRequest) -> impl Stream<Item = String> {
ai_service
.generate_stream(request)
.await
.map(|chunk| chunk.text)
}
```
**Privacy (Local Models)**:
```text
[ai]
provider = "local"
model_path = "/opt/provisioning/models/llama-3-70b"
# No data leaves the network
```
**Validation (Defense in Depth)**:
```text
AI generates config
Nickel schema validation (syntax, types, contracts)
Cedar policy check (security, compliance)
Human approval (final gate)
Deployment
```
**Observability**:
```text
[ai.observability]
trace_all_requests = true
store_conversations = true
conversation_retention = "30d"
# Every AI operation logged:
# - Input prompt
# - Retrieved context (RAG)
# - Generated output
# - Validation results
# - Human approval decision
```
## Alternatives Considered
### Alternative 1: No AI Integration
**Pros**: Simpler, no LLM dependencies
**Cons**: Steep learning curve, slow provisioning, manual troubleshooting
**Decision**: REJECTED - Poor user experience (10x slower provisioning, high error rate)
### Alternative 2: Generic AI Code Generation (GitHub Copilot approach)
**Pros**: Existing tools, well-known UX
**Cons**: Not schema-aware, generates invalid configs, no validation
**Decision**: REJECTED - Inadequate for infrastructure (correctness critical)
### Alternative 3: AI Only for Documentation/Search
**Pros**: Lower risk (AI doesn't generate configs)
**Cons**: Missed opportunity for 10x productivity gains
**Decision**: REJECTED - Too conservative
### Alternative 4: Fully Autonomous AI (No Human Approval)
**Pros**: Maximum automation
**Cons**: Unacceptable risk for infrastructure changes
**Decision**: REJECTED - Safety and compliance requirements
### Alternative 5: Single LLM Provider Lock-in
**Pros**: Simpler integration
**Cons**: Vendor lock-in, no flexibility for different use cases
**Decision**: REJECTED - Multi-provider abstraction provides flexibility
## Implementation Details
### AI Service API
```text
// platform/crates/ai-service/src/lib.rs
#[async_trait]
pub trait AIService {
async fn generate_config(
&self,
prompt: &str,
schema: &NickelSchema,
context: Option<RAGContext>,
) -> Result<GeneratedConfig>;
async fn suggest_field_value(
&self,
field: &FieldDefinition,
partial_input: &str,
form_context: &FormContext,
) -> Result<Vec<Suggestion>>;
async fn explain_validation_error(
&self,
error: &ValidationError,
config: &Config,
) -> Result<Explanation>;
async fn troubleshoot_deployment(
&self,
deployment_id: &str,
logs: &DeploymentLogs,
) -> Result<TroubleshootingReport>;
}
pub struct AIServiceImpl {
mcp_client: MCPClient,
rag: RAGService,
cedar: CedarEngine,
audit: AuditLogger,
rate_limiter: RateLimiter,
cache: Cache,
}
impl AIService for AIServiceImpl {
async fn generate_config(
&self,
prompt: &str,
schema: &NickelSchema,
context: Option<RAGContext>,
) -> Result<GeneratedConfig> {
// Check authorization
self.cedar.authorize(
principal: current_user(),
action: "ai:generate_config",
resource: schema,
)?;
// Rate limiting
self.rate_limiter.check(current_user()).await?;
// Retrieve relevant context via RAG
let rag_context = match context {
Some(ctx) => ctx,
None => self.rag.retrieve(prompt, schema).await?,
};
// Generate config via MCP
let generated = self.mcp_client.generate(
prompt: prompt,
schema: schema,
context: rag_context,
tools: &["nickel_validate", "schema_query"],
).await?;
// Validate generated config
let validation = nickel_validate(&generated.config)?;
if !validation.is_valid() {
return Err(AIError::InvalidGeneration(validation.errors));
}
// Audit log
self.audit.log(AIOperation::GenerateConfig {
user: current_user(),
prompt: prompt,
schema: schema.name(),
generated: &generated.config,
validation: validation,
});
Ok(GeneratedConfig {
config: generated.config,
explanation: generated.explanation,
confidence: generated.confidence,
validation: validation,
})
}
}
```
### MCP Server Integration
```text
// platform/crates/mcp-server/src/lib.rs
pub struct MCPClient {
provider: Box<dyn LLMProvider>,
tools: ToolRegistry,
}
#[async_trait]
pub trait LLMProvider {
async fn generate(&self, request: GenerateRequest) -> Result<GenerateResponse>;
async fn generate_stream(&self, request: GenerateRequest) -> Result<impl Stream<Item = String>>;
}
// Tool definitions for LLM
pub struct ToolRegistry {
tools: HashMap<String, Tool>,
}
impl ToolRegistry {
pub fn new() -> Self {
let mut tools = HashMap::new();
tools.insert("nickel_validate", Tool {
name: "nickel_validate",
description: "Validate Nickel configuration against schema",
parameters: json!({
"type": "object",
"properties": {
"config": {"type": "string"},
"schema_path": {"type": "string"},
},
"required": ["config", "schema_path"],
}),
handler: Box::new(|params| async {
let config = params["config"].as_str().unwrap();
let schema = params["schema_path"].as_str().unwrap();
nickel_validate_tool(config, schema).await
}),
});
tools.insert("schema_query", Tool {
name: "schema_query",
description: "Query Nickel schema for field information",
parameters: json!({
"type": "object",
"properties": {
"schema_path": {"type": "string"},
"query": {"type": "string"},
},
"required": ["schema_path"],
}),
handler: Box::new(|params| async {
let schema = params["schema_path"].as_str().unwrap();
let query = params.get("query").and_then(|v| v.as_str());
schema_query_tool(schema, query).await
}),
});
Self { tools }
}
}
```
### RAG System Implementation
```text
// platform/crates/rag/src/lib.rs
pub struct RAGService {
vector_store: Box<dyn VectorStore>,
embeddings: EmbeddingModel,
indexer: DocumentIndexer,
}
impl RAGService {
pub async fn index_all(&self) -> Result<()> {
// Index Nickel schemas
self.index_schemas("provisioning/schemas").await?;
// Index documentation
self.index_docs("docs").await?;
// Index past deployments
self.index_deployments("workspaces").await?;
// Index best practices
self.index_patterns(".claude/patterns").await?;
Ok(())
}
pub async fn retrieve(
&self,
query: &str,
schema: &NickelSchema,
) -> Result<RAGContext> {
// Generate query embedding
let query_embedding = self.embeddings.embed(query).await?;
// Search vector store
let results = self.vector_store.search(
embedding: query_embedding,
top_k: 10,
filter: Some(json!({
"schema": schema.name(),
})),
).await?;
// Rerank results
let reranked = self.rerank(query, results).await?;
// Build context
Ok(RAGContext {
query: query.to_string(),
schema_definition: schema.to_string(),
relevant_docs: reranked.iter()
.take(5)
.map(|r| r.content.clone())
.collect(),
similar_configs: self.find_similar_configs(schema).await?,
best_practices: self.find_best_practices(schema).await?,
})
}
}
#[async_trait]
pub trait VectorStore {
async fn insert(&self, id: &str, embedding: Vec<f32>, metadata: Value) -> Result<()>;
async fn search(&self, embedding: Vec<f32>, top_k: usize, filter: Option<Value>) -> Result<Vec<SearchResult>>;
}
// Qdrant implementation
pub struct QdrantStore {
client: qdrant::QdrantClient,
collection: String,
}
```
### typdialog-ai Integration
```text
// typdialog-ai/src/form_assistant.rs
pub struct FormAssistant {
ai_service: Arc<AIService>,
}
impl FormAssistant {
pub async fn suggest_field_value(
&self,
field: &FieldDefinition,
partial_input: &str,
form_context: &FormContext,
) -> Result<Vec<Suggestion>> {
self.ai_service.suggest_field_value(
field,
partial_input,
form_context,
).await
}
pub async fn explain_error(
&self,
error: &ValidationError,
field_value: &str,
) -> Result<String> {
let explanation = self.ai_service.explain_validation_error(
error,
field_value,
).await?;
Ok(format!(
"Error: {}
Explanation: {}
Suggested fix: {}",
error.message,
explanation.plain_english,
explanation.suggested_fix,
))
}
pub async fn fill_from_natural_language(
&self,
description: &str,
form_schema: &FormSchema,
) -> Result<HashMap<String, Value>> {
let prompt = format!(
"User wants to: {}
Form schema: {}
Generate field values:",
description,
serde_json::to_string_pretty(form_schema)?,
);
let generated = self.ai_service.generate_config(
&prompt,
&form_schema.nickel_schema,
None,
).await?;
Ok(generated.field_values)
}
}
```
### typdialog-ag Agents
```text
// typdialog-ag/src/agent.rs
pub struct ProvisioningAgent {
ai_service: Arc<AIService>,
orchestrator: Arc<OrchestratorClient>,
max_iterations: usize,
}
impl ProvisioningAgent {
pub async fn execute_goal(&self, goal: &str) -> Result<AgentResult> {
let mut state = AgentState::new(goal);
for iteration in 0..self.max_iterations {
// AI determines next action
let action = self.ai_service.agent_next_action(&state).await?;
// Execute action (with human approval for critical operations)
let result = self.execute_action(&action, &state).await?;
// Update state
state.update(action, result);
// Check if goal achieved
if state.goal_achieved() {
return Ok(AgentResult::Success(state));
}
}
Err(AgentError::MaxIterationsReached)
}
async fn execute_action(
&self,
action: &AgentAction,
state: &AgentState,
) -> Result<ActionResult> {
match action {
AgentAction::GenerateConfig { description } => {
let config = self.ai_service.generate_config(
description,
&state.target_schema,
Some(state.context.clone()),
).await?;
Ok(ActionResult::ConfigGenerated(config))
},
AgentAction::Deploy { config } => {
// Require human approval for deployment
let approval = prompt_user_approval(
"Agent wants to deploy. Approve?",
config,
).await?;
if !approval.approved {
return Ok(ActionResult::DeploymentRejected);
}
let deployment = self.orchestrator.deploy(config).await?;
Ok(ActionResult::Deployed(deployment))
},
AgentAction::Troubleshoot { deployment_id } => {
let report = self.ai_service.troubleshoot_deployment(
deployment_id,
&self.orchestrator.get_logs(deployment_id).await?,
).await?;
Ok(ActionResult::TroubleshootingReport(report))
},
}
}
}
```
### Cedar Policies for AI
```text
// AI cannot access secrets without explicit permission
forbid(
principal == Service::"ai-service",
action == Action::"read",
resource in Secret::"*"
);
// AI can generate configs for non-production environments without approval
permit(
principal == Service::"ai-service",
action == Action::"generate_config",
resource in Schema::"*"
) when {
resource.environment in ["dev", "staging"]
};
// AI config generation for production requires senior engineer approval
permit(
principal in Group::"senior-engineers",
action == Action::"approve_ai_config",
resource in Config::"*"
) when {
resource.environment == "production" &&
resource.generated_by == "ai-service"
};
// AI agents cannot deploy without human approval
forbid(
principal == Service::"ai-agent",
action == Action::"deploy",
resource == Infrastructure::"*"
) unless {
context.human_approved == true
};
```
## Testing Strategy
**Unit Tests**:
```text
#[tokio::test]
async fn test_ai_config_generation_validates() {
let ai_service = mock_ai_service();
let generated = ai_service.generate_config(
"Create a PostgreSQL database with encryption",
&postgres_schema(),
None,
).await.unwrap();
// Must validate against schema
assert!(generated.validation.is_valid());
assert_eq!(generated.config["engine"], "postgres");
assert_eq!(generated.config["encryption_enabled"], true);
}
#[tokio::test]
async fn test_ai_cannot_access_secrets() {
let ai_service = ai_service_with_cedar();
let result = ai_service.get_secret("database/password").await;
assert!(result.is_err());
assert_eq!(result.unwrap_err(), AIError::PermissionDenied);
}
```
**Integration Tests**:
```text
#[tokio::test]
async fn test_end_to_end_ai_config_generation() {
// User provides natural language
let description = "Create a production Kubernetes cluster in AWS with 5 nodes";
// AI generates config
let generated = ai_service.generate_config(description).await.unwrap();
// Nickel validation
let validation = nickel_validate(&generated.config).await.unwrap();
assert!(validation.is_valid());
// Human approval
let approval = Approval {
user: "senior-engineer@example.com",
approved: true,
timestamp: Utc::now(),
};
// Deploy
let deployment = orchestrator.deploy_with_approval(
generated.config,
approval,
).await.unwrap();
assert_eq!(deployment.status, DeploymentStatus::Success);
}
```
**RAG Quality Tests**:
```text
#[tokio::test]
async fn test_rag_retrieval_accuracy() {
let rag = rag_service();
// Index test documents
rag.index_all().await.unwrap();
// Query
let context = rag.retrieve(
"How to configure PostgreSQL with encryption?",
&postgres_schema(),
).await.unwrap();
// Should retrieve relevant docs
assert!(context.relevant_docs.iter().any(|doc| {
doc.contains("encryption") && doc.contains("postgres")
}));
// Should retrieve similar configs
assert!(!context.similar_configs.is_empty());
}
```
## Security Considerations
**AI Access Control**:
```text
AI Service Permissions (enforced by Cedar):
✅ CAN: Read Nickel schemas
✅ CAN: Generate configurations
✅ CAN: Query documentation
✅ CAN: Analyze deployment logs (sanitized)
❌ CANNOT: Access secrets directly
❌ CANNOT: Deploy without approval
❌ CANNOT: Modify Cedar policies
❌ CANNOT: Access user credentials
```
**Data Privacy**:
```text
[ai.privacy]
# Sanitize before sending to LLM
sanitize_secrets = true
sanitize_pii = true
sanitize_credentials = true
# What gets sent to LLM:
# ✅ Nickel schemas (public)
# ✅ Documentation (public)
# ✅ Error messages (sanitized)
# ❌ Secret values (never)
# ❌ Passwords (never)
# ❌ API keys (never)
```
**Audit Trail**:
```text
// Every AI operation logged
pub struct AIAuditLog {
timestamp: DateTime<Utc>,
user: UserId,
operation: AIOperation,
input_prompt: String,
generated_output: String,
validation_result: ValidationResult,
human_approval: Option<Approval>,
deployment_outcome: Option<DeploymentResult>,
}
```
## Cost Analysis
**Estimated Costs** (per month, based on typical usage):
```text
Assumptions:
- 100 active users
- 10 AI config generations per user per day
- Average prompt: 2000 tokens
- Average response: 1000 tokens
Provider: Anthropic Claude Sonnet
Cost: $3 per 1M input tokens, $15 per 1M output tokens
Monthly cost:
= 100 users × 10 generations × 30 days × (2000 input + 1000 output tokens)
= 100 × 10 × 30 × 3000 tokens
= 90M tokens
= (60M input × $3/1M) + (30M output × $15/1M)
= $180 + $450
= $630/month
With caching (50% hit rate):
= $315/month
```
**Cost optimization strategies**:
- Caching (50-80% cost reduction)
- Streaming (lower latency, same cost)
- Local models for non-critical operations (zero marginal cost)
- Rate limiting (prevent runaway costs)
## References
- [Model Context Protocol (MCP)](https://modelcontextprotocol.io/)
- [Anthropic Claude API](https://docs.anthropic.com/claude/reference/getting-started)
- [OpenAI GPT-4 API](https://platform.openai.com/docs/api-reference)
- [Qdrant Vector Database](https://qdrant.tech/)
- [RAG Survey Paper](https://arxiv.org/abs/2312.10997)
- ADR-008: Cedar Authorization (AI access control)
- ADR-011: Nickel Migration (schema-driven AI)
- ADR-013: Typdialog Web UI Backend (AI-assisted forms)
- ADR-014: SecretumVault Integration (AI-secret isolation)
---
**Status**: Accepted
**Last Updated**: 2025-01-08
**Implementation**: Planned (High Priority)
**Estimated Complexity**: Very Complex
**Dependencies**: ADR-008, ADR-011, ADR-013, ADR-014