provisioning/docs/src/architecture/adr/adr-015-ai-integration-architecture.md
2026-01-14 04:53:21 +00:00

38 KiB
Raw Blame History

ADR-015: AI Integration Architecture for Intelligent Infrastructure Provisioning

Status

Accepted - 2025-01-08

Context

The provisioning platform has evolved to include complex workflows for infrastructure configuration, deployment, and management. Current interaction patterns require deep technical knowledge of Nickel schemas, cloud provider APIs, networking concepts, and security best practices. This creates barriers to entry and slows down infrastructure provisioning for operators who are not infrastructure experts.

The Infrastructure Complexity Problem

Current state challenges:

  1. Knowledge Barrier: Deep Nickel, cloud, and networking expertise required

    • Understanding Nickel type system and contracts
    • Knowing cloud provider resource relationships
    • Configuring security policies correctly
    • Debugging deployment failures
  2. Manual Configuration: All configs hand-written

    • Repetitive boilerplate for common patterns
    • Easy to make mistakes (typos, missing fields)
    • No intelligent suggestions or autocomplete
    • Trial-and-error debugging
  3. Limited Assistance: No contextual help

    • Documentation is separate from workflow
    • No explanation of validation errors
    • No suggestions for fixing issues
    • No learning from past deployments
  4. Troubleshooting Difficulty: Manual log analysis

    • Deployment failures require expert analysis
    • No automated root cause detection
    • No suggested fixes based on similar issues
    • Long time-to-resolution

AI Integration Opportunities

  1. Natural Language to Configuration:

    • User: "Create a production PostgreSQL cluster with encryption and daily backups"
    • AI: Generates validated Nickel configuration
  2. AI-Assisted Form Filling:

    • User starts typing in typdialog web form
    • AI suggests values based on context
    • AI explains validation errors in plain language
  3. Intelligent Troubleshooting:

    • Deployment fails
    • AI analyzes logs and suggests fixes
    • AI generates corrected configuration
  4. Configuration Optimization:

    • AI analyzes workload patterns
    • AI suggests performance improvements
    • AI detects security misconfigurations
  5. Learning from Operations:

    • AI indexes past deployments
    • AI suggests configurations based on similar workloads
    • AI predicts potential issues

AI Components Overview

The system integrates multiple AI components:

  1. typdialog-ai: AI-assisted form interactions
  2. typdialog-ag: AI agents for autonomous operations
  3. typdialog-prov-gen: AI-powered configuration generation
  4. platform/crates/ai-service: Core AI service backend
  5. platform/crates/mcp-server: Model Context Protocol server
  6. platform/crates/rag: Retrieval-Augmented Generation system

Requirements for AI Integration

  • Natural Language Understanding: Parse user intent from free-form text
  • Schema-Aware Generation: Generate valid Nickel configurations
  • Context Retrieval: Access documentation, schemas, past deployments
  • Security Enforcement: Cedar policies control AI access
  • Human-in-the-Loop: All AI actions require human approval
  • Audit Trail: Complete logging of AI operations
  • Multi-Provider Support: OpenAI, Anthropic, local models
  • Cost Control: Rate limiting and budget management
  • Observability: Trace AI decisions and reasoning

Decision

Integrate a comprehensive AI system consisting of:

  1. AI-Assisted Interfaces (typdialog-ai)
  2. Autonomous AI Agents (typdialog-ag)
  3. AI Configuration Generator (typdialog-prov-gen)
  4. Core AI Infrastructure (ai-service, mcp-server, rag)

All AI components are schema-aware, security-enforced, and human-supervised.

Architecture Diagram

┌─────────────────────────────────────────────────────────────────┐
│   User Interfaces                                               │
│                                                                 │
│   Natural Language: "Create production K8s cluster in AWS"     │
│   Typdialog Forms: AI-assisted field suggestions               │
│   CLI: provisioning ai generate-config "description"           │
└────────────┬────────────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────────────────┐
│   AI Frontend Layer                                             │
│    ┌───────────────────────────────────────────────────────┐    │
│    │ typdialog-ai (AI-Assisted Forms)                      │    │
│    │ - Natural language form filling                       │    │
│    │ - Real-time AI suggestions                            │    │
│    │ - Validation error explanations                       │    │
│    │ - Context-aware autocomplete                          │    │
│    ├───────────────────────────────────────────────────────┤    │
│    │ typdialog-ag (AI Agents)                              │    │
│    │ - Autonomous task execution                           │    │
│    │ - Multi-step workflow automation                      │    │
│    │ - Learning from feedback                              │    │
│    │ - Agent collaboration                                 │    │
│    ├───────────────────────────────────────────────────────┤    │
│    │ typdialog-prov-gen (Config Generator)                 │    │
│    │ - Natural language → Nickel config                    │    │
│    │ - Template-based generation                           │    │
│    │ - Best practice injection                             │    │
│    │ - Validation and refinement                           │    │
│    └───────────────────────────────────────────────────────┘    │
└────────────┬────────────────────────────────────────────────────┘
             │
             ▼
┌────────────────────────────────────────────────────────────────┐
│   Core AI Infrastructure (platform/crates/)                    │
│   ┌───────────────────────────────────────────────────────┐    │
│   │ ai-service (Central AI Service)                       │    │
│   │                                                       │    │
│   │ - Request routing and orchestration                   │    │
│   │ - Authentication and authorization (Cedar)            │    │
│   │ - Rate limiting and cost control                      │    │
│   │ - Caching and optimization                            │    │
│   │ - Audit logging and observability                     │    │
│   │ - Multi-provider abstraction                          │    │
│   └─────────────┬─────────────────────┬───────────────────┘    │
│                 │                     │                        │
│                 ▼                     ▼                        │
│     ┌─────────────────────┐   ┌─────────────────────┐          │
│     │ mcp-server          │   │ rag                 │          │
│     │ (Model Context      │   │ (Retrieval-Aug Gen) │          │
│     │  Protocol)          │   │                     │          │
│     │                     │   │ ┌─────────────────┐ │          │
│     │ - LLM integration   │   │ │ Vector Store    │ │          │
│     │ - Tool calling      │   │ │ (Qdrant/Milvus) │ │          │
│     │ - Context mgmt      │   │ └─────────────────┘ │          │
│     │ - Multi-provider    │   │ ┌─────────────────┐ │          │
│     │   (OpenAI,          │   │ │ Embeddings      │ │          │
│     │    Anthropic,       │   │ │ (text-embed)    │ │          │
│     │    Local models)    │   │ └─────────────────┘ │          │
│     │                     │   │ ┌─────────────────┐ │          │
│     │ Tools:              │   │ │ Index:          │ │          │
│     │ - nickel_validate   │   │ │ - Nickel schemas│ │          │
│     │ - schema_query      │   │ │ - Documentation │ │          │
│     │ - config_generate   │   │ │ - Past deploys  │ │          │
│     │ - cedar_check       │   │ │ - Best practices│ │          │
│     └─────────────────────┘   │ └─────────────────┘ │          │
│                               │                     │          │
│                               │ Query: "How to      │          │
│                               │ configure Postgres  │          │
│                               │ with encryption?"   │          │
│                               │                     │          │
│                               │ Retrieval: Relevant │          │
│                               │ docs + examples     │          │
│                               └─────────────────────┘          │
└────────────┬───────────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────────────────┐
│   Integration Points                                            │
│                                                                 │
│     ┌─────────────┐  ┌──────────────┐  ┌─────────────────────┐  │
│     │ Nickel      │  │ SecretumVault│  │ Cedar Authorization │  │
│     │ Validation  │  │ (Secrets)    │  │ (AI Policies)       │  │
│     └─────────────┘  └──────────────┘  └─────────────────────┘  │
│                                                                 │
│     ┌─────────────┐  ┌──────────────┐  ┌─────────────────────┐  │
│     │ Orchestrator│  │ Typdialog    │  │ Audit Logging       │  │
│     │ (Deploy)    │  │ (Forms)      │  │ (All AI Ops)        │  │
│     └─────────────┘  └──────────────┘  └─────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────────────────┐
│   Output: Validated Nickel Configuration                        │
│                                                                 │
│   ✅ Schema-validated                                           │
│   ✅ Security-checked (Cedar policies)                          │
│   ✅ Human-approved                                             │
│   ✅ Audit-logged                                               │
│   ✅ Ready for deployment                                       │
└─────────────────────────────────────────────────────────────────┘

Component Responsibilities

typdialog-ai (AI-Assisted Forms):

  • Real-time form field suggestions based on context
  • Natural language form filling
  • Validation error explanations in plain English
  • Context-aware autocomplete for configuration values
  • Integration with typdialog web UI

typdialog-ag (AI Agents):

  • Autonomous task execution (multi-step workflows)
  • Agent collaboration (multiple agents working together)
  • Learning from user feedback and past operations
  • Goal-oriented behavior (achieve outcome, not just execute steps)
  • Safety boundaries (cannot deploy without approval)

typdialog-prov-gen (Config Generator):

  • Natural language → Nickel configuration
  • Template-based generation with customization
  • Best practice injection (security, performance, HA)
  • Iterative refinement based on validation feedback
  • Integration with Nickel schema system

ai-service (Core AI Service):

  • Central request router for all AI operations
  • Authentication and authorization (Cedar policies)
  • Rate limiting and cost control
  • Caching (reduce LLM API calls)
  • Audit logging (all AI operations)
  • Multi-provider abstraction (OpenAI, Anthropic, local)

mcp-server (Model Context Protocol):

  • LLM integration (OpenAI, Anthropic, local models)
  • Tool calling framework (nickel_validate, schema_query, etc.)
  • Context management (conversation history, schemas)
  • Streaming responses for real-time feedback
  • Error handling and retries

rag (Retrieval-Augmented Generation):

  • Vector store (Qdrant/Milvus) for embeddings
  • Document indexing (Nickel schemas, docs, deployments)
  • Semantic search (find relevant context)
  • Embedding generation (text-embedding-3-large)
  • Query expansion and reranking

Rationale

Why AI Integration Is Essential

Aspect Manual Config AI-Assisted (chosen)
Learning Curve 🔴 Steep 🟢 Gentle
Time to Deploy 🔴 Hours 🟢 Minutes
Error Rate 🔴 High 🟢 Low (validated)
Documentation Access 🔴 Separate 🟢 Contextual
Troubleshooting 🔴 Manual 🟢 AI-assisted
Best Practices ⚠️ Manual enforcement Auto-injected
Consistency ⚠️ Varies by operator Standardized
Scalability 🔴 Limited by expertise 🟢 AI scales knowledge

Why Schema-Aware AI Is Critical

Traditional AI code generation fails for infrastructure because:

Generic AI (like GitHub Copilot):
❌ Generates syntactically correct but semantically wrong configs
❌ Doesn't understand cloud provider constraints
❌ No validation against schemas
❌ No security policy enforcement
❌ Hallucinated resource names/IDs

Schema-aware AI (our approach):

# Nickel schema provides ground truth
{
  Database = {
    engine | [| 'postgres, 'mysql, 'mongodb |],
    version | String,
    storage_gb | Number,
    backup_retention_days | Number,
  }
}

# AI generates ONLY valid configs
# AI knows:
# - Valid engine values ('postgres', not 'postgresql')
# - Required fields (all listed above)
# - Type constraints (storage_gb is Number, not String)
# - Nickel contracts (if defined)

Result: AI cannot generate invalid configs.

Why RAG (Retrieval-Augmented Generation) Is Essential

LLMs alone have limitations:

Pure LLM:
❌ Knowledge cutoff (no recent updates)
❌ Hallucinations (invents plausible-sounding configs)
❌ No project-specific knowledge
❌ No access to past deployments

RAG-enhanced LLM:

Query: "How to configure Postgres with encryption?"

RAG retrieves:
- Nickel schema: provisioning/schemas/database.ncl
- Documentation: docs/user/database-encryption.md
- Past deployment: workspaces/prod/postgres-encrypted.ncl
- Best practice: .claude/patterns/secure-database.md

LLM generates answer WITH retrieved context:
✅ Accurate (based on actual schemas)
✅ Project-specific (uses our patterns)
✅ Proven (learned from past deployments)
✅ Secure (follows our security guidelines)

Why Human-in-the-Loop Is Non-Negotiable

AI-generated infrastructure configs require human approval:

// All AI operations require approval
pub async fn ai_generate_config(request: GenerateRequest) -> Result<Config> {
    let ai_generated = ai_service.generate(request).await?;

    // Validate against Nickel schema
    let validation = nickel_validate(&ai_generated)?;
    if !validation.is_valid() {
        return Err("AI generated invalid config");
    }

    // Check Cedar policies
    let authorized = cedar_authorize(
        principal: user,
        action: "approve_ai_config",
        resource: ai_generated,
    )?;
    if !authorized {
        return Err("User not authorized to approve AI config");
    }

    // Require explicit human approval
    let approval = prompt_user_approval(&ai_generated).await?;
    if !approval.approved {
        audit_log("AI config rejected by user", &ai_generated);
        return Err("User rejected AI-generated config");
    }

    audit_log("AI config approved by user", &ai_generated);
    Ok(ai_generated)
}

Why:

  • Infrastructure changes have real-world cost and security impact
  • AI can make mistakes (hallucinations, misunderstandings)
  • Compliance requires human accountability
  • Learning opportunity (human reviews teach AI)

Why Multi-Provider Support Matters

No single LLM provider is best for all tasks:

Provider Best For Considerations
Anthropic (Claude) Long context, accuracy Best for complex configs
OpenAI (GPT-4) Tool calling, speed Best for quick suggestions
Local (Llama, Mistral) Privacy, cost Best for air-gapped envs

Strategy:

  • Complex config generation → Claude (long context)
  • Real-time form suggestions → GPT-4 (fast)
  • Air-gapped deployments → Local models (privacy)

Consequences

Positive

  • Accessibility: Non-experts can provision infrastructure
  • Productivity: 10x faster configuration creation
  • Quality: AI injects best practices automatically
  • Consistency: Standardized configurations across teams
  • Learning: Users learn from AI explanations
  • Troubleshooting: AI-assisted debugging reduces MTTR
  • Documentation: Contextual help embedded in workflow
  • Safety: Schema validation prevents invalid configs
  • Security: Cedar policies control AI access
  • Auditability: Complete trail of AI operations

Negative

  • Dependency: Requires LLM API access (or local models)
  • Cost: LLM API calls have per-token cost
  • Latency: AI responses take 1-5 seconds
  • Accuracy: AI can still make mistakes (needs validation)
  • Trust: Users must understand AI limitations
  • Complexity: Additional infrastructure to operate
  • Privacy: Configs sent to LLM providers (unless local)

Mitigation Strategies

Cost Control:

[ai.rate_limiting]
requests_per_minute = 60
tokens_per_day = 1000000
cost_limit_per_day = "100.00"  # USD

[ai.caching]
enabled = true
ttl = "1h"
# Cache similar queries to reduce API calls

Latency Optimization:

// Streaming responses for real-time feedback
pub async fn ai_generate_stream(request: GenerateRequest) -> impl Stream<Item = String> {
    ai_service
        .generate_stream(request)
        .await
        .map(|chunk| chunk.text)
}

Privacy (Local Models):

[ai]
provider = "local"
model_path = "/opt/provisioning/models/llama-3-70b"

# No data leaves the network

Validation (Defense in Depth):

AI generates config
  ↓
Nickel schema validation (syntax, types, contracts)
  ↓
Cedar policy check (security, compliance)
  ↓
Human approval (final gate)
  ↓
Deployment

Observability:

[ai.observability]
trace_all_requests = true
store_conversations = true
conversation_retention = "30d"

# Every AI operation logged:
# - Input prompt
# - Retrieved context (RAG)
# - Generated output
# - Validation results
# - Human approval decision

Alternatives Considered

Alternative 1: No AI Integration

Pros: Simpler, no LLM dependencies Cons: Steep learning curve, slow provisioning, manual troubleshooting Decision: REJECTED - Poor user experience (10x slower provisioning, high error rate)

Alternative 2: Generic AI Code Generation (GitHub Copilot approach)

Pros: Existing tools, well-known UX Cons: Not schema-aware, generates invalid configs, no validation Decision: REJECTED - Inadequate for infrastructure (correctness critical)

Alternative 3: AI Only for Documentation/Search

Pros: Lower risk (AI doesn't generate configs) Cons: Missed opportunity for 10x productivity gains Decision: REJECTED - Too conservative

Alternative 4: Fully Autonomous AI (No Human Approval)

Pros: Maximum automation Cons: Unacceptable risk for infrastructure changes Decision: REJECTED - Safety and compliance requirements

Alternative 5: Single LLM Provider Lock-in

Pros: Simpler integration Cons: Vendor lock-in, no flexibility for different use cases Decision: REJECTED - Multi-provider abstraction provides flexibility

Implementation Details

AI Service API

// platform/crates/ai-service/src/lib.rs

#[async_trait]
pub trait AIService {
    async fn generate_config(
        &self,
        prompt: &str,
        schema: &NickelSchema,
        context: Option<RAGContext>,
    ) -> Result<GeneratedConfig>;

    async fn suggest_field_value(
        &self,
        field: &FieldDefinition,
        partial_input: &str,
        form_context: &FormContext,
    ) -> Result<Vec<Suggestion>>;

    async fn explain_validation_error(
        &self,
        error: &ValidationError,
        config: &Config,
    ) -> Result<Explanation>;

    async fn troubleshoot_deployment(
        &self,
        deployment_id: &str,
        logs: &DeploymentLogs,
    ) -> Result<TroubleshootingReport>;
}

pub struct AIServiceImpl {
    mcp_client: MCPClient,
    rag: RAGService,
    cedar: CedarEngine,
    audit: AuditLogger,
    rate_limiter: RateLimiter,
    cache: Cache,
}

impl AIService for AIServiceImpl {
    async fn generate_config(
        &self,
        prompt: &str,
        schema: &NickelSchema,
        context: Option<RAGContext>,
    ) -> Result<GeneratedConfig> {
        // Check authorization
        self.cedar.authorize(
            principal: current_user(),
            action: "ai:generate_config",
            resource: schema,
        )?;

        // Rate limiting
        self.rate_limiter.check(current_user()).await?;

        // Retrieve relevant context via RAG
        let rag_context = match context {
            Some(ctx) => ctx,
            None => self.rag.retrieve(prompt, schema).await?,
        };

        // Generate config via MCP
        let generated = self.mcp_client.generate(
            prompt: prompt,
            schema: schema,
            context: rag_context,
            tools: &["nickel_validate", "schema_query"],
        ).await?;

        // Validate generated config
        let validation = nickel_validate(&generated.config)?;
        if !validation.is_valid() {
            return Err(AIError::InvalidGeneration(validation.errors));
        }

        // Audit log
        self.audit.log(AIOperation::GenerateConfig {
            user: current_user(),
            prompt: prompt,
            schema: schema.name(),
            generated: &generated.config,
            validation: validation,
        });

        Ok(GeneratedConfig {
            config: generated.config,
            explanation: generated.explanation,
            confidence: generated.confidence,
            validation: validation,
        })
    }
}

MCP Server Integration

// platform/crates/mcp-server/src/lib.rs

pub struct MCPClient {
    provider: Box<dyn LLMProvider>,
    tools: ToolRegistry,
}

#[async_trait]
pub trait LLMProvider {
    async fn generate(&self, request: GenerateRequest) -> Result<GenerateResponse>;
    async fn generate_stream(&self, request: GenerateRequest) -> Result<impl Stream<Item = String>>;
}

// Tool definitions for LLM
pub struct ToolRegistry {
    tools: HashMap<String, Tool>,
}

impl ToolRegistry {
    pub fn new() -> Self {
        let mut tools = HashMap::new();

        tools.insert("nickel_validate", Tool {
            name: "nickel_validate",
            description: "Validate Nickel configuration against schema",
            parameters: json!({
                "type": "object",
                "properties": {
                    "config": {"type": "string"},
                    "schema_path": {"type": "string"},
                },
                "required": ["config", "schema_path"],
            }),
            handler: Box::new(|params| async {
                let config = params["config"].as_str().unwrap();
                let schema = params["schema_path"].as_str().unwrap();
                nickel_validate_tool(config, schema).await
            }),
        });

        tools.insert("schema_query", Tool {
            name: "schema_query",
            description: "Query Nickel schema for field information",
            parameters: json!({
                "type": "object",
                "properties": {
                    "schema_path": {"type": "string"},
                    "query": {"type": "string"},
                },
                "required": ["schema_path"],
            }),
            handler: Box::new(|params| async {
                let schema = params["schema_path"].as_str().unwrap();
                let query = params.get("query").and_then(|v| v.as_str());
                schema_query_tool(schema, query).await
            }),
        });

        Self { tools }
    }
}

RAG System Implementation

// platform/crates/rag/src/lib.rs

pub struct RAGService {
    vector_store: Box<dyn VectorStore>,
    embeddings: EmbeddingModel,
    indexer: DocumentIndexer,
}

impl RAGService {
    pub async fn index_all(&self) -> Result<()> {
        // Index Nickel schemas
        self.index_schemas("provisioning/schemas").await?;

        // Index documentation
        self.index_docs("docs").await?;

        // Index past deployments
        self.index_deployments("workspaces").await?;

        // Index best practices
        self.index_patterns(".claude/patterns").await?;

        Ok(())
    }

    pub async fn retrieve(
        &self,
        query: &str,
        schema: &NickelSchema,
    ) -> Result<RAGContext> {
        // Generate query embedding
        let query_embedding = self.embeddings.embed(query).await?;

        // Search vector store
        let results = self.vector_store.search(
            embedding: query_embedding,
            top_k: 10,
            filter: Some(json!({
                "schema": schema.name(),
            })),
        ).await?;

        // Rerank results
        let reranked = self.rerank(query, results).await?;

        // Build context
        Ok(RAGContext {
            query: query.to_string(),
            schema_definition: schema.to_string(),
            relevant_docs: reranked.iter()
                .take(5)
                .map(|r| r.content.clone())
                .collect(),
            similar_configs: self.find_similar_configs(schema).await?,
            best_practices: self.find_best_practices(schema).await?,
        })
    }
}

#[async_trait]
pub trait VectorStore {
    async fn insert(&self, id: &str, embedding: Vec<f32>, metadata: Value) -> Result<()>;
    async fn search(&self, embedding: Vec<f32>, top_k: usize, filter: Option<Value>) -> Result<Vec<SearchResult>>;
}

// Qdrant implementation
pub struct QdrantStore {
    client: qdrant::QdrantClient,
    collection: String,
}

typdialog-ai Integration

// typdialog-ai/src/form_assistant.rs

pub struct FormAssistant {
    ai_service: Arc<AIService>,
}

impl FormAssistant {
    pub async fn suggest_field_value(
        &self,
        field: &FieldDefinition,
        partial_input: &str,
        form_context: &FormContext,
    ) -> Result<Vec<Suggestion>> {
        self.ai_service.suggest_field_value(
            field,
            partial_input,
            form_context,
        ).await
    }

    pub async fn explain_error(
        &self,
        error: &ValidationError,
        field_value: &str,
    ) -> Result<String> {
        let explanation = self.ai_service.explain_validation_error(
            error,
            field_value,
        ).await?;

        Ok(format!(
            "Error: {}

Explanation: {}

Suggested fix: {}",
            error.message,
            explanation.plain_english,
            explanation.suggested_fix,
        ))
    }

    pub async fn fill_from_natural_language(
        &self,
        description: &str,
        form_schema: &FormSchema,
    ) -> Result<HashMap<String, Value>> {
        let prompt = format!(
            "User wants to: {}

Form schema: {}

Generate field values:",
            description,
            serde_json::to_string_pretty(form_schema)?,
        );

        let generated = self.ai_service.generate_config(
            &prompt,
            &form_schema.nickel_schema,
            None,
        ).await?;

        Ok(generated.field_values)
    }
}

typdialog-ag Agents

// typdialog-ag/src/agent.rs

pub struct ProvisioningAgent {
    ai_service: Arc<AIService>,
    orchestrator: Arc<OrchestratorClient>,
    max_iterations: usize,
}

impl ProvisioningAgent {
    pub async fn execute_goal(&self, goal: &str) -> Result<AgentResult> {
        let mut state = AgentState::new(goal);

        for iteration in 0..self.max_iterations {
            // AI determines next action
            let action = self.ai_service.agent_next_action(&state).await?;

            // Execute action (with human approval for critical operations)
            let result = self.execute_action(&action, &state).await?;

            // Update state
            state.update(action, result);

            // Check if goal achieved
            if state.goal_achieved() {
                return Ok(AgentResult::Success(state));
            }
        }

        Err(AgentError::MaxIterationsReached)
    }

    async fn execute_action(
        &self,
        action: &AgentAction,
        state: &AgentState,
    ) -> Result<ActionResult> {
        match action {
            AgentAction::GenerateConfig { description } => {
                let config = self.ai_service.generate_config(
                    description,
                    &state.target_schema,
                    Some(state.context.clone()),
                ).await?;

                Ok(ActionResult::ConfigGenerated(config))
            },

            AgentAction::Deploy { config } => {
                // Require human approval for deployment
                let approval = prompt_user_approval(
                    "Agent wants to deploy. Approve?",
                    config,
                ).await?;

                if !approval.approved {
                    return Ok(ActionResult::DeploymentRejected);
                }

                let deployment = self.orchestrator.deploy(config).await?;
                Ok(ActionResult::Deployed(deployment))
            },

            AgentAction::Troubleshoot { deployment_id } => {
                let report = self.ai_service.troubleshoot_deployment(
                    deployment_id,
                    &self.orchestrator.get_logs(deployment_id).await?,
                ).await?;

                Ok(ActionResult::TroubleshootingReport(report))
            },
        }
    }
}

Cedar Policies for AI

// AI cannot access secrets without explicit permission
forbid(
  principal == Service::"ai-service",
  action == Action::"read",
  resource in Secret::"*"
);

// AI can generate configs for non-production environments without approval
permit(
  principal == Service::"ai-service",
  action == Action::"generate_config",
  resource in Schema::"*"
) when {
  resource.environment in ["dev", "staging"]
};

// AI config generation for production requires senior engineer approval
permit(
  principal in Group::"senior-engineers",
  action == Action::"approve_ai_config",
  resource in Config::"*"
) when {
  resource.environment == "production" &&
  resource.generated_by == "ai-service"
};

// AI agents cannot deploy without human approval
forbid(
  principal == Service::"ai-agent",
  action == Action::"deploy",
  resource == Infrastructure::"*"
) unless {
  context.human_approved == true
};

Testing Strategy

Unit Tests:

#[tokio::test]
async fn test_ai_config_generation_validates() {
    let ai_service = mock_ai_service();

    let generated = ai_service.generate_config(
        "Create a PostgreSQL database with encryption",
        &postgres_schema(),
        None,
    ).await.unwrap();

    // Must validate against schema
    assert!(generated.validation.is_valid());
    assert_eq!(generated.config["engine"], "postgres");
    assert_eq!(generated.config["encryption_enabled"], true);
}

#[tokio::test]
async fn test_ai_cannot_access_secrets() {
    let ai_service = ai_service_with_cedar();

    let result = ai_service.get_secret("database/password").await;

    assert!(result.is_err());
    assert_eq!(result.unwrap_err(), AIError::PermissionDenied);
}

Integration Tests:

#[tokio::test]
async fn test_end_to_end_ai_config_generation() {
    // User provides natural language
    let description = "Create a production Kubernetes cluster in AWS with 5 nodes";

    // AI generates config
    let generated = ai_service.generate_config(description).await.unwrap();

    // Nickel validation
    let validation = nickel_validate(&generated.config).await.unwrap();
    assert!(validation.is_valid());

    // Human approval
    let approval = Approval {
        user: "senior-engineer@example.com",
        approved: true,
        timestamp: Utc::now(),
    };

    // Deploy
    let deployment = orchestrator.deploy_with_approval(
        generated.config,
        approval,
    ).await.unwrap();

    assert_eq!(deployment.status, DeploymentStatus::Success);
}

RAG Quality Tests:

#[tokio::test]
async fn test_rag_retrieval_accuracy() {
    let rag = rag_service();

    // Index test documents
    rag.index_all().await.unwrap();

    // Query
    let context = rag.retrieve(
        "How to configure PostgreSQL with encryption?",
        &postgres_schema(),
    ).await.unwrap();

    // Should retrieve relevant docs
    assert!(context.relevant_docs.iter().any(|doc| {
        doc.contains("encryption") && doc.contains("postgres")
    }));

    // Should retrieve similar configs
    assert!(!context.similar_configs.is_empty());
}

Security Considerations

AI Access Control:

AI Service Permissions (enforced by Cedar):
✅ CAN: Read Nickel schemas
✅ CAN: Generate configurations
✅ CAN: Query documentation
✅ CAN: Analyze deployment logs (sanitized)
❌ CANNOT: Access secrets directly
❌ CANNOT: Deploy without approval
❌ CANNOT: Modify Cedar policies
❌ CANNOT: Access user credentials

Data Privacy:

[ai.privacy]
# Sanitize before sending to LLM
sanitize_secrets = true
sanitize_pii = true
sanitize_credentials = true

# What gets sent to LLM:
# ✅ Nickel schemas (public)
# ✅ Documentation (public)
# ✅ Error messages (sanitized)
# ❌ Secret values (never)
# ❌ Passwords (never)
# ❌ API keys (never)

Audit Trail:

// Every AI operation logged
pub struct AIAuditLog {
    timestamp: DateTime<Utc>,
    user: UserId,
    operation: AIOperation,
    input_prompt: String,
    generated_output: String,
    validation_result: ValidationResult,
    human_approval: Option<Approval>,
    deployment_outcome: Option<DeploymentResult>,
}

Cost Analysis

Estimated Costs (per month, based on typical usage):

Assumptions:
- 100 active users
- 10 AI config generations per user per day
- Average prompt: 2000 tokens
- Average response: 1000 tokens

Provider: Anthropic Claude Sonnet
Cost: $3 per 1M input tokens, $15 per 1M output tokens

Monthly cost:
= 100 users × 10 generations × 30 days × (2000 input + 1000 output tokens)
= 100 × 10 × 30 × 3000 tokens
= 90M tokens
= (60M input × $3/1M) + (30M output × $15/1M)
= $180 + $450
= $630/month

With caching (50% hit rate):
= $315/month

Cost optimization strategies:

  • Caching (50-80% cost reduction)
  • Streaming (lower latency, same cost)
  • Local models for non-critical operations (zero marginal cost)
  • Rate limiting (prevent runaway costs)

References


Status: Accepted Last Updated: 2025-01-08 Implementation: Planned (High Priority) Estimated Complexity: Very Complex Dependencies: ADR-008, ADR-011, ADR-013, ADR-014