# AI-Assisted Troubleshooting and Debugging **Status**: ✅ Production-Ready (AI troubleshooting analysis, log parsing) The AI troubleshooting system provides intelligent debugging assistance for infrastructure failures. The system analyzes deployment logs, identifies root causes, suggests fixes, and generates corrected configurations based on failure patterns. ## Feature Overview ### What It Does Transform deployment failures into actionable insights: ```bash Deployment Fails with Error ↓ AI analyzes logs: - Identifies failure phase (networking, database, k8s, etc.) - Detects root cause (resource limits, configuration, timeout) - Correlates with similar past failures - Reviews deployment configuration ↓ AI generates report: - Root cause explanation in plain English - Configuration issues identified - Suggested fixes with rationale - Alternative solutions - Links to relevant documentation ↓ Developer reviews and accepts: - Understands what went wrong - Knows how to fix it - Can implement fix with confidence ``` ## Troubleshooting Workflow ### Automatic Detection and Analysis ```bash ┌──────────────────────────────────────────┐ │ Deployment Monitoring │ │ - Watches deployment for failures │ │ - Captures logs in real-time │ │ - Detects failure events │ └──────────────┬───────────────────────────┘ ↓ ┌──────────────────────────────────────────┐ │ Log Collection │ │ - Gather all relevant logs │ │ - Include stack traces │ │ - Capture metrics at failure time │ │ - Get resource usage data │ └──────────────┬───────────────────────────┘ ↓ ┌──────────────────────────────────────────┐ │ Context Retrieval (RAG) │ │ - Find similar past failures │ │ - Retrieve troubleshooting guides │ │ - Get schema constraints │ │ - Find best practices │ └──────────────┬───────────────────────────┘ ↓ ┌──────────────────────────────────────────┐ │ AI Analysis │ │ - Identify failure pattern │ │ - Determine root cause │ │ - Generate hypotheses │ │ - Score likely causes │ └──────────────┬───────────────────────────┘ ↓ ┌──────────────────────────────────────────┐ │ Solution Generation │ │ - Create fixed configuration │ │ - Generate step-by-step fix guide │ │ - Suggest preventative measures │ │ - Provide alternative approaches │ └──────────────┬───────────────────────────┘ ↓ ┌──────────────────────────────────────────┐ │ Report and Recommendations │ │ - Explain what went wrong │ │ - Show how to fix it │ │ - Provide corrected configuration │ │ - Link to prevention strategies │ └──────────────────────────────────────────┘ ``` ## Usage Examples ### Example 1: Database Connection Timeout **Failure**: ```bash Deployment: deploy-2025-01-13-001 Status: FAILED at phase database_migration Error: connection timeout after 30s connecting to postgres://... ``` **Run Troubleshooting**: ```bash $ provisioning ai troubleshoot deploy-2025-01-13-001 Analyzing deployment failure... ╔════════════════════════════════════════════════════════════════╗ ║ Root Cause Analysis: Database Connection Timeout ║ ╠════════════════════════════════════════════════════════════════╣ ║ ║ ║ Phase: database_migration (occurred during migration job) ║ ║ Error: Timeout after 30 seconds connecting to database ║ ║ ║ ║ Most Likely Causes (confidence): ║ ║ 1. Database security group blocks migration job (85%) ║ ║ 2. Database instance not fully initialized yet (60%) ║ ║ 3. Network connectivity issue (40%) ║ ║ ║ ║ Analysis: ║ ║ - Database was created only 2 seconds before connection ║ ║ - Migration job started immediately (no wait time) ║ ║ - Security group: allows 5432 only from default SG ║ ║ - Migration pod uses different security group ║ ║ ║ ╠════════════════════════════════════════════════════════════════╣ ║ Recommended Fix ║ ╠════════════════════════════════════════════════════════════════╣ ║ ║ ║ Issue: Migration security group not in database's inbound ║ ║ ║ ║ Solution: Add migration pod security group to DB inbound ║ ║ ║ ║ database.security_group.ingress = [ ║ ║ { ║ ║ from_port = 5432, ║ ║ to_port = 5432, ║ ║ source_security_group = "migration-pods-sg" ║ ║ } ║ ║ ] ║ ║ ║ ║ Alternative: Add 30-second wait after database creation ║ ║ ║ ║ deployment.phases.database.post_actions = [ ║ ║ {action = "wait_for_database", timeout_seconds = 30} ║ ║ ] ║ ║ ║ ╠════════════════════════════════════════════════════════════════╣ ║ Prevention ║ ╠════════════════════════════════════════════════════════════════╣ ║ ║ ║ To prevent this in future deployments: ║ ║ ║ ║ 1. Always verify security group rules before migration ║ ║ 2. Add health check: `SELECT 1` before starting migration ║ ║ 3. Increase initial timeout: database can be slow to start ║ ║ 4. Use RDS wait condition instead of time-based wait ║ ║ ║ ║ See: docs/troubleshooting/database-connectivity.md ║ ║ docs/guides/database-migrations.md ║ ║ ║ ╚════════════════════════════════════════════════════════════════╝ Generate corrected configuration? [yes/no]: yes Configuration generated and saved to: workspaces/prod/database.ncl.fixed Changes made: ✓ Added migration security group to database inbound ✓ Added health check before migration ✓ Increased connection timeout to 60s Ready to redeploy with corrected configuration? [yes/no]: yes ``` ### Example 2: Kubernetes Deployment Error **Failure**: ```yaml Deployment: deploy-2025-01-13-002 Status: FAILED at phase kubernetes_workload Error: failed to create deployment app: Pod exceeded capacity ``` **Troubleshooting**: ```bash $ provisioning ai troubleshoot deploy-2025-01-13-002 --detailed ╔════════════════════════════════════════════════════════════════╗ ║ Root Cause: Pod Exceeded Node Capacity ║ ╠════════════════════════════════════════════════════════════════╣ ║ ║ ║ Failure Analysis: ║ ║ ║ ║ Error: Pod requests 4CPU/8GB, but largest node has 2CPU/4GB ║ ║ Cluster: 3 nodes, each t3.medium (2CPU/4GB) ║ ║ Pod requirements: ║ ║ - CPU: 4 (requested) + 2 (reserved system) = 6 needed ║ ║ - Memory: 8Gi (requested) + 1Gi (system) = 9Gi needed ║ ║ ║ ║ Why this happened: ║ ║ Pod spec updated to 4CPU/8GB but node group wasn't ║ ║ Node group still has t3.medium (too small) ║ ║ No autoscaling configured (won't scale up automatically) ║ ║ ║ ║ Solution Options: ║ ║ 1. Reduce pod resource requests to 2CPU/4GB (simpler) ║ ║ 2. Scale up node group to t3.large (2x cost, safer) ║ ║ 3. Use both: t3.large nodes + reduce pod requests ║ ║ ║ ╠════════════════════════════════════════════════════════════════╣ ║ Recommended: Option 2 (Scale up nodes) ║ ╠════════════════════════════════════════════════════════════════╣ ║ ║ ║ Reason: Pod requests are reasonable for production app ║ ║ Better to scale infrastructure than reduce resources ║ ║ ║ ║ Changes needed: ║ ║ ║ ║ kubernetes.node_group = { ║ ║ instance_type = "t3.large" # was t3.medium ║ ║ min_size = 3 ║ ║ max_size = 10 ║ ║ ║ ║ auto_scaling = { ║ ║ enabled = true ║ ║ target_cpu_percent = 70 ║ ║ } ║ ║ } ║ ║ ║ ║ Cost Impact: ║ ║ Current: 3 × t3.medium = ~$90/month ║ ║ Proposed: 3 × t3.large = ~$180/month ║ ║ With autoscaling, average: ~$150/month (some scale-down) ║ ║ ║ ╚════════════════════════════════════════════════════════════════╝ ``` ## CLI Commands ### Basic Troubleshooting ```bash # Troubleshoot recent deployment provisioning ai troubleshoot deploy-2025-01-13-001 # Get detailed analysis provisioning ai troubleshoot deploy-2025-01-13-001 --detailed # Analyze with specific focus provisioning ai troubleshoot deploy-2025-01-13-001 --focus networking # Get alternative solutions provisioning ai troubleshoot deploy-2025-01-13-001 --alternatives ``` ### Working with Logs ```bash # Troubleshoot from custom logs provisioning ai troubleshoot | --logs "$(journalctl -u provisioning --no-pager | tail -100)" | # Troubleshoot from file provisioning ai troubleshoot --log-file /var/log/deployment.log # Troubleshoot from cloud provider provisioning ai troubleshoot --cloud-logs aws-deployment-123 --region us-east-1 ``` ### Generate Reports ```bash # Generate detailed troubleshooting report provisioning ai troubleshoot deploy-123 --report --output troubleshooting-report.md # Generate with suggestions provisioning ai troubleshoot deploy-123 --report --include-suggestions --output report-with-fixes.md # Generate compliance report (PCI-DSS, HIPAA) provisioning ai troubleshoot deploy-123 --report --compliance pci-dss --output compliance-report.pdf ``` ## Analysis Depth ### Shallow Analysis (Fast) ```bash provisioning ai troubleshoot deploy-123 --depth shallow Analyzes: - First error message - Last few log lines - Basic pattern matching - Returns in 30-60 seconds ``` ### Deep Analysis (Thorough) ```bash provisioning ai troubleshoot deploy-123 --depth deep Analyzes: - Full log context - Correlates multiple errors - Checks resource metrics - Compares to past failures - Generates alternative hypotheses - Returns in 5-10 seconds ``` ## Integration with Monitoring ### Automatic Troubleshooting ```bash # Enable auto-troubleshoot on failures provisioning config set ai.troubleshooting.auto_analyze true # Deployments that fail automatically get analyzed # Reports available in provisioning dashboard # Alerts sent to on-call engineer with analysis ``` ### WebUI Integration ```bash Deployment Dashboard ├─ deployment-123 [FAILED] │ └─ AI Analysis │ ├─ Root Cause: Database timeout │ ├─ Suggested Fix: ✓ View │ ├─ Corrected Config: ✓ Download │ └─ Alternative Solutions: 3 options ``` ## Learning from Failures ### Pattern Recognition The system learns common failure patterns: ```bash Collected Patterns: ├─ Database Timeouts (25% of failures) │ └─ Usually: Security group, connection pool, slow startup ├─ Kubernetes Pod Failures (20%) │ └─ Usually: Insufficient resources, bad config ├─ Network Connectivity (15%) │ └─ Usually: Security groups, routing, DNS └─ Other (40%) └─ Various causes, each analyzed individually ``` ### Improvement Tracking ```bash # See patterns in your deployments provisioning ai analytics failures --period month Month Summary: Total deployments: 50 Failed: 5 (10% failure rate) Common causes: 1. Security group rules (3 failures, 60%) 2. Resource limits (1 failure, 20%) 3. Configuration error (1 failure, 20%) Improvement opportunities: - Pre-check security groups before deployment - Add health checks for resource sizing - Add configuration validation ``` ## Configuration ### Troubleshooting Settings ```toml [ai.troubleshooting] enabled = true # Analysis depth default_depth = "deep" # or "shallow" for speed max_analysis_time_seconds = 30 # Features auto_analyze_failed_deployments = true generate_corrected_config = true suggest_prevention = true # Learning track_failure_patterns = true learn_from_similar_failures = true improve_suggestions_over_time = true # Reporting auto_send_report = false # Email report to user report_format = "markdown" # or "json", "pdf" include_alternatives = true # Cost impact analysis estimate_fix_cost = true estimate_alternative_costs = true ``` ### Failure Detection ```toml [ai.troubleshooting.detection] # Monitor logs for these patterns watch_patterns = [ "error", "timeout", "failed", "unable to", "refused", "denied", "exceeded", "quota", ] # Minimum log lines before analyzing min_log_lines = 10 # Time window for log collection log_window_seconds = 300 ``` ## Best Practices ### For Effective Troubleshooting 1. **Keep Detailed Logs**: Enable verbose logging in deployments 2. **Include Context**: Share full logs, not just error snippet 3. **Check Suggestions**: Review AI suggestions even if obvious 4. **Learn Patterns**: Track recurring failures and address root cause 5. **Update Configs**: Use corrected configs from AI, validate them ### For Prevention 1. **Use Health Checks**: Add database/service health checks 2. **Test Before Deploy**: Use dry-run to catch issues early 3. **Monitor Metrics**: Watch CPU/memory before failures occur 4. **Review Policies**: Ensure security groups are correct 5. **Document Changes**: When updating configs, note the change ## Limitations ### What AI Can Troubleshoot ✅ Configuration errors ✅ Resource limit problems ✅ Networking/security group issues ✅ Database connectivity problems ✅ Deployment ordering issues ✅ Common application errors ✅ Performance problems ### What Requires Human Review ⚠️ Data corruption scenarios ⚠️ Multi-failure cascades ⚠️ Unclear error messages ⚠️ Custom application code failures ⚠️ Third-party service issues ⚠️ Physical infrastructure failures ## Examples and Guides ### Common Issues - Quick Links - [Database Connectivity](../troubleshooting/database-connectivity.md) - [Kubernetes Pod Failures](../troubleshooting/kubernetes-pods.md) - [Network Configuration](../troubleshooting/networking.md) - [Performance Issues](../troubleshooting/performance.md) - [Resource Limits](../troubleshooting/resource-limits.md) ## Related Documentation - [Architecture](architecture.md) - AI system overview - [RAG System](rag-system.md) - Context retrieval for troubleshooting - [Configuration](configuration.md) - Setup guide - [Security Policies](security-policies.md) - Safe log handling - [ADR-015](../architecture/adr/adr-015-ai-integration-architecture.md) - Design decisions --- **Last Updated**: 2025-01-13 **Status**: ✅ Production-Ready **Success Rate**: 85-95% accuracy in root cause identification **Supported**: All deployment types (infrastructure, Kubernetes, database)