2026-01-14 04:53:21 +00:00
|
|
|
|
# AI-Assisted Troubleshooting and Debugging
|
|
|
|
|
|
|
|
|
|
|
|
**Status**: ✅ Production-Ready (AI troubleshooting analysis, log parsing)
|
|
|
|
|
|
|
|
|
|
|
|
The AI troubleshooting system provides intelligent debugging assistance for infrastructure failures. The system analyzes deployment logs, identifies
|
|
|
|
|
|
root causes, suggests fixes, and generates corrected configurations based on failure patterns.
|
|
|
|
|
|
|
|
|
|
|
|
## Feature Overview
|
|
|
|
|
|
|
|
|
|
|
|
### What It Does
|
|
|
|
|
|
|
|
|
|
|
|
Transform deployment failures into actionable insights:
|
|
|
|
|
|
|
2026-01-14 04:53:58 +00:00
|
|
|
|
```bash
|
2026-01-14 04:53:21 +00:00
|
|
|
|
Deployment Fails with Error
|
|
|
|
|
|
↓
|
|
|
|
|
|
AI analyzes logs:
|
|
|
|
|
|
- Identifies failure phase (networking, database, k8s, etc.)
|
|
|
|
|
|
- Detects root cause (resource limits, configuration, timeout)
|
|
|
|
|
|
- Correlates with similar past failures
|
|
|
|
|
|
- Reviews deployment configuration
|
|
|
|
|
|
↓
|
|
|
|
|
|
AI generates report:
|
|
|
|
|
|
- Root cause explanation in plain English
|
|
|
|
|
|
- Configuration issues identified
|
|
|
|
|
|
- Suggested fixes with rationale
|
|
|
|
|
|
- Alternative solutions
|
|
|
|
|
|
- Links to relevant documentation
|
|
|
|
|
|
↓
|
|
|
|
|
|
Developer reviews and accepts:
|
|
|
|
|
|
- Understands what went wrong
|
|
|
|
|
|
- Knows how to fix it
|
|
|
|
|
|
- Can implement fix with confidence
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## Troubleshooting Workflow
|
|
|
|
|
|
|
|
|
|
|
|
### Automatic Detection and Analysis
|
|
|
|
|
|
|
2026-01-14 04:53:58 +00:00
|
|
|
|
```bash
|
2026-01-14 04:53:21 +00:00
|
|
|
|
┌──────────────────────────────────────────┐
|
|
|
|
|
|
│ Deployment Monitoring │
|
|
|
|
|
|
│ - Watches deployment for failures │
|
|
|
|
|
|
│ - Captures logs in real-time │
|
|
|
|
|
|
│ - Detects failure events │
|
|
|
|
|
|
└──────────────┬───────────────────────────┘
|
|
|
|
|
|
↓
|
|
|
|
|
|
┌──────────────────────────────────────────┐
|
|
|
|
|
|
│ Log Collection │
|
|
|
|
|
|
│ - Gather all relevant logs │
|
|
|
|
|
|
│ - Include stack traces │
|
|
|
|
|
|
│ - Capture metrics at failure time │
|
|
|
|
|
|
│ - Get resource usage data │
|
|
|
|
|
|
└──────────────┬───────────────────────────┘
|
|
|
|
|
|
↓
|
|
|
|
|
|
┌──────────────────────────────────────────┐
|
|
|
|
|
|
│ Context Retrieval (RAG) │
|
|
|
|
|
|
│ - Find similar past failures │
|
|
|
|
|
|
│ - Retrieve troubleshooting guides │
|
|
|
|
|
|
│ - Get schema constraints │
|
|
|
|
|
|
│ - Find best practices │
|
|
|
|
|
|
└──────────────┬───────────────────────────┘
|
|
|
|
|
|
↓
|
|
|
|
|
|
┌──────────────────────────────────────────┐
|
|
|
|
|
|
│ AI Analysis │
|
|
|
|
|
|
│ - Identify failure pattern │
|
|
|
|
|
|
│ - Determine root cause │
|
|
|
|
|
|
│ - Generate hypotheses │
|
|
|
|
|
|
│ - Score likely causes │
|
|
|
|
|
|
└──────────────┬───────────────────────────┘
|
|
|
|
|
|
↓
|
|
|
|
|
|
┌──────────────────────────────────────────┐
|
|
|
|
|
|
│ Solution Generation │
|
|
|
|
|
|
│ - Create fixed configuration │
|
|
|
|
|
|
│ - Generate step-by-step fix guide │
|
|
|
|
|
|
│ - Suggest preventative measures │
|
|
|
|
|
|
│ - Provide alternative approaches │
|
|
|
|
|
|
└──────────────┬───────────────────────────┘
|
|
|
|
|
|
↓
|
|
|
|
|
|
┌──────────────────────────────────────────┐
|
|
|
|
|
|
│ Report and Recommendations │
|
|
|
|
|
|
│ - Explain what went wrong │
|
|
|
|
|
|
│ - Show how to fix it │
|
|
|
|
|
|
│ - Provide corrected configuration │
|
|
|
|
|
|
│ - Link to prevention strategies │
|
|
|
|
|
|
└──────────────────────────────────────────┘
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## Usage Examples
|
|
|
|
|
|
|
|
|
|
|
|
### Example 1: Database Connection Timeout
|
|
|
|
|
|
|
|
|
|
|
|
**Failure**:
|
2026-01-14 04:53:58 +00:00
|
|
|
|
```bash
|
2026-01-14 04:53:21 +00:00
|
|
|
|
Deployment: deploy-2025-01-13-001
|
|
|
|
|
|
Status: FAILED at phase database_migration
|
|
|
|
|
|
Error: connection timeout after 30s connecting to postgres://...
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**Run Troubleshooting**:
|
2026-01-14 04:53:58 +00:00
|
|
|
|
```bash
|
2026-01-14 04:53:21 +00:00
|
|
|
|
$ provisioning ai troubleshoot deploy-2025-01-13-001
|
|
|
|
|
|
|
|
|
|
|
|
Analyzing deployment failure...
|
|
|
|
|
|
|
|
|
|
|
|
╔════════════════════════════════════════════════════════════════╗
|
|
|
|
|
|
║ Root Cause Analysis: Database Connection Timeout ║
|
|
|
|
|
|
╠════════════════════════════════════════════════════════════════╣
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
║ Phase: database_migration (occurred during migration job) ║
|
|
|
|
|
|
║ Error: Timeout after 30 seconds connecting to database ║
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
║ Most Likely Causes (confidence): ║
|
|
|
|
|
|
║ 1. Database security group blocks migration job (85%) ║
|
|
|
|
|
|
║ 2. Database instance not fully initialized yet (60%) ║
|
|
|
|
|
|
║ 3. Network connectivity issue (40%) ║
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
║ Analysis: ║
|
|
|
|
|
|
║ - Database was created only 2 seconds before connection ║
|
|
|
|
|
|
║ - Migration job started immediately (no wait time) ║
|
|
|
|
|
|
║ - Security group: allows 5432 only from default SG ║
|
|
|
|
|
|
║ - Migration pod uses different security group ║
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
╠════════════════════════════════════════════════════════════════╣
|
|
|
|
|
|
║ Recommended Fix ║
|
|
|
|
|
|
╠════════════════════════════════════════════════════════════════╣
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
║ Issue: Migration security group not in database's inbound ║
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
║ Solution: Add migration pod security group to DB inbound ║
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
║ database.security_group.ingress = [ ║
|
|
|
|
|
|
║ { ║
|
|
|
|
|
|
║ from_port = 5432, ║
|
|
|
|
|
|
║ to_port = 5432, ║
|
|
|
|
|
|
║ source_security_group = "migration-pods-sg" ║
|
|
|
|
|
|
║ } ║
|
|
|
|
|
|
║ ] ║
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
║ Alternative: Add 30-second wait after database creation ║
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
║ deployment.phases.database.post_actions = [ ║
|
|
|
|
|
|
║ {action = "wait_for_database", timeout_seconds = 30} ║
|
|
|
|
|
|
║ ] ║
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
╠════════════════════════════════════════════════════════════════╣
|
|
|
|
|
|
║ Prevention ║
|
|
|
|
|
|
╠════════════════════════════════════════════════════════════════╣
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
║ To prevent this in future deployments: ║
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
║ 1. Always verify security group rules before migration ║
|
|
|
|
|
|
║ 2. Add health check: `SELECT 1` before starting migration ║
|
|
|
|
|
|
║ 3. Increase initial timeout: database can be slow to start ║
|
|
|
|
|
|
║ 4. Use RDS wait condition instead of time-based wait ║
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
║ See: docs/troubleshooting/database-connectivity.md ║
|
|
|
|
|
|
║ docs/guides/database-migrations.md ║
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
╚════════════════════════════════════════════════════════════════╝
|
|
|
|
|
|
|
|
|
|
|
|
Generate corrected configuration? [yes/no]: yes
|
|
|
|
|
|
|
|
|
|
|
|
Configuration generated and saved to:
|
|
|
|
|
|
workspaces/prod/database.ncl.fixed
|
|
|
|
|
|
|
|
|
|
|
|
Changes made:
|
|
|
|
|
|
✓ Added migration security group to database inbound
|
|
|
|
|
|
✓ Added health check before migration
|
|
|
|
|
|
✓ Increased connection timeout to 60s
|
|
|
|
|
|
|
|
|
|
|
|
Ready to redeploy with corrected configuration? [yes/no]: yes
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Example 2: Kubernetes Deployment Error
|
|
|
|
|
|
|
|
|
|
|
|
**Failure**:
|
2026-01-14 04:53:58 +00:00
|
|
|
|
```yaml
|
2026-01-14 04:53:21 +00:00
|
|
|
|
Deployment: deploy-2025-01-13-002
|
|
|
|
|
|
Status: FAILED at phase kubernetes_workload
|
|
|
|
|
|
Error: failed to create deployment app: Pod exceeded capacity
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
**Troubleshooting**:
|
2026-01-14 04:53:58 +00:00
|
|
|
|
```bash
|
2026-01-14 04:53:21 +00:00
|
|
|
|
$ provisioning ai troubleshoot deploy-2025-01-13-002 --detailed
|
|
|
|
|
|
|
|
|
|
|
|
╔════════════════════════════════════════════════════════════════╗
|
|
|
|
|
|
║ Root Cause: Pod Exceeded Node Capacity ║
|
|
|
|
|
|
╠════════════════════════════════════════════════════════════════╣
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
║ Failure Analysis: ║
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
║ Error: Pod requests 4CPU/8GB, but largest node has 2CPU/4GB ║
|
|
|
|
|
|
║ Cluster: 3 nodes, each t3.medium (2CPU/4GB) ║
|
|
|
|
|
|
║ Pod requirements: ║
|
|
|
|
|
|
║ - CPU: 4 (requested) + 2 (reserved system) = 6 needed ║
|
|
|
|
|
|
║ - Memory: 8Gi (requested) + 1Gi (system) = 9Gi needed ║
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
║ Why this happened: ║
|
|
|
|
|
|
║ Pod spec updated to 4CPU/8GB but node group wasn't ║
|
|
|
|
|
|
║ Node group still has t3.medium (too small) ║
|
|
|
|
|
|
║ No autoscaling configured (won't scale up automatically) ║
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
║ Solution Options: ║
|
|
|
|
|
|
║ 1. Reduce pod resource requests to 2CPU/4GB (simpler) ║
|
|
|
|
|
|
║ 2. Scale up node group to t3.large (2x cost, safer) ║
|
|
|
|
|
|
║ 3. Use both: t3.large nodes + reduce pod requests ║
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
╠════════════════════════════════════════════════════════════════╣
|
|
|
|
|
|
║ Recommended: Option 2 (Scale up nodes) ║
|
|
|
|
|
|
╠════════════════════════════════════════════════════════════════╣
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
║ Reason: Pod requests are reasonable for production app ║
|
|
|
|
|
|
║ Better to scale infrastructure than reduce resources ║
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
║ Changes needed: ║
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
║ kubernetes.node_group = { ║
|
|
|
|
|
|
║ instance_type = "t3.large" # was t3.medium ║
|
|
|
|
|
|
║ min_size = 3 ║
|
|
|
|
|
|
║ max_size = 10 ║
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
║ auto_scaling = { ║
|
|
|
|
|
|
║ enabled = true ║
|
|
|
|
|
|
║ target_cpu_percent = 70 ║
|
|
|
|
|
|
║ } ║
|
|
|
|
|
|
║ } ║
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
║ Cost Impact: ║
|
|
|
|
|
|
║ Current: 3 × t3.medium = ~$90/month ║
|
|
|
|
|
|
║ Proposed: 3 × t3.large = ~$180/month ║
|
|
|
|
|
|
║ With autoscaling, average: ~$150/month (some scale-down) ║
|
|
|
|
|
|
║ ║
|
|
|
|
|
|
╚════════════════════════════════════════════════════════════════╝
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## CLI Commands
|
|
|
|
|
|
|
|
|
|
|
|
### Basic Troubleshooting
|
|
|
|
|
|
|
2026-01-14 04:53:58 +00:00
|
|
|
|
```bash
|
2026-01-14 04:53:21 +00:00
|
|
|
|
# Troubleshoot recent deployment
|
|
|
|
|
|
provisioning ai troubleshoot deploy-2025-01-13-001
|
|
|
|
|
|
|
|
|
|
|
|
# Get detailed analysis
|
|
|
|
|
|
provisioning ai troubleshoot deploy-2025-01-13-001 --detailed
|
|
|
|
|
|
|
|
|
|
|
|
# Analyze with specific focus
|
|
|
|
|
|
provisioning ai troubleshoot deploy-2025-01-13-001 --focus networking
|
|
|
|
|
|
|
|
|
|
|
|
# Get alternative solutions
|
|
|
|
|
|
provisioning ai troubleshoot deploy-2025-01-13-001 --alternatives
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Working with Logs
|
|
|
|
|
|
|
2026-01-14 04:53:58 +00:00
|
|
|
|
```bash
|
2026-01-14 04:53:21 +00:00
|
|
|
|
# Troubleshoot from custom logs
|
|
|
|
|
|
provisioning ai troubleshoot
|
|
|
|
|
|
| --logs "$(journalctl -u provisioning --no-pager | tail -100)" |
|
|
|
|
|
|
|
|
|
|
|
|
# Troubleshoot from file
|
|
|
|
|
|
provisioning ai troubleshoot --log-file /var/log/deployment.log
|
|
|
|
|
|
|
|
|
|
|
|
# Troubleshoot from cloud provider
|
|
|
|
|
|
provisioning ai troubleshoot
|
|
|
|
|
|
--cloud-logs aws-deployment-123
|
|
|
|
|
|
--region us-east-1
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Generate Reports
|
|
|
|
|
|
|
2026-01-14 04:53:58 +00:00
|
|
|
|
```bash
|
2026-01-14 04:53:21 +00:00
|
|
|
|
# Generate detailed troubleshooting report
|
|
|
|
|
|
provisioning ai troubleshoot deploy-123
|
|
|
|
|
|
--report
|
|
|
|
|
|
--output troubleshooting-report.md
|
|
|
|
|
|
|
|
|
|
|
|
# Generate with suggestions
|
|
|
|
|
|
provisioning ai troubleshoot deploy-123
|
|
|
|
|
|
--report
|
|
|
|
|
|
--include-suggestions
|
|
|
|
|
|
--output report-with-fixes.md
|
|
|
|
|
|
|
|
|
|
|
|
# Generate compliance report (PCI-DSS, HIPAA)
|
|
|
|
|
|
provisioning ai troubleshoot deploy-123
|
|
|
|
|
|
--report
|
|
|
|
|
|
--compliance pci-dss
|
|
|
|
|
|
--output compliance-report.pdf
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## Analysis Depth
|
|
|
|
|
|
|
|
|
|
|
|
### Shallow Analysis (Fast)
|
|
|
|
|
|
|
2026-01-14 04:53:58 +00:00
|
|
|
|
```bash
|
2026-01-14 04:53:21 +00:00
|
|
|
|
provisioning ai troubleshoot deploy-123 --depth shallow
|
|
|
|
|
|
|
|
|
|
|
|
Analyzes:
|
|
|
|
|
|
- First error message
|
|
|
|
|
|
- Last few log lines
|
|
|
|
|
|
- Basic pattern matching
|
|
|
|
|
|
- Returns in 30-60 seconds
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Deep Analysis (Thorough)
|
|
|
|
|
|
|
2026-01-14 04:53:58 +00:00
|
|
|
|
```bash
|
2026-01-14 04:53:21 +00:00
|
|
|
|
provisioning ai troubleshoot deploy-123 --depth deep
|
|
|
|
|
|
|
|
|
|
|
|
Analyzes:
|
|
|
|
|
|
- Full log context
|
|
|
|
|
|
- Correlates multiple errors
|
|
|
|
|
|
- Checks resource metrics
|
|
|
|
|
|
- Compares to past failures
|
|
|
|
|
|
- Generates alternative hypotheses
|
|
|
|
|
|
- Returns in 5-10 seconds
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## Integration with Monitoring
|
|
|
|
|
|
|
|
|
|
|
|
### Automatic Troubleshooting
|
|
|
|
|
|
|
2026-01-14 04:53:58 +00:00
|
|
|
|
```bash
|
2026-01-14 04:53:21 +00:00
|
|
|
|
# Enable auto-troubleshoot on failures
|
|
|
|
|
|
provisioning config set ai.troubleshooting.auto_analyze true
|
|
|
|
|
|
|
|
|
|
|
|
# Deployments that fail automatically get analyzed
|
|
|
|
|
|
# Reports available in provisioning dashboard
|
|
|
|
|
|
# Alerts sent to on-call engineer with analysis
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### WebUI Integration
|
|
|
|
|
|
|
2026-01-14 04:53:58 +00:00
|
|
|
|
```bash
|
2026-01-14 04:53:21 +00:00
|
|
|
|
Deployment Dashboard
|
|
|
|
|
|
├─ deployment-123 [FAILED]
|
|
|
|
|
|
│ └─ AI Analysis
|
|
|
|
|
|
│ ├─ Root Cause: Database timeout
|
|
|
|
|
|
│ ├─ Suggested Fix: ✓ View
|
|
|
|
|
|
│ ├─ Corrected Config: ✓ Download
|
|
|
|
|
|
│ └─ Alternative Solutions: 3 options
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## Learning from Failures
|
|
|
|
|
|
|
|
|
|
|
|
### Pattern Recognition
|
|
|
|
|
|
|
|
|
|
|
|
The system learns common failure patterns:
|
|
|
|
|
|
|
2026-01-14 04:53:58 +00:00
|
|
|
|
```bash
|
2026-01-14 04:53:21 +00:00
|
|
|
|
Collected Patterns:
|
|
|
|
|
|
├─ Database Timeouts (25% of failures)
|
|
|
|
|
|
│ └─ Usually: Security group, connection pool, slow startup
|
|
|
|
|
|
├─ Kubernetes Pod Failures (20%)
|
|
|
|
|
|
│ └─ Usually: Insufficient resources, bad config
|
|
|
|
|
|
├─ Network Connectivity (15%)
|
|
|
|
|
|
│ └─ Usually: Security groups, routing, DNS
|
|
|
|
|
|
└─ Other (40%)
|
|
|
|
|
|
└─ Various causes, each analyzed individually
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Improvement Tracking
|
|
|
|
|
|
|
2026-01-14 04:53:58 +00:00
|
|
|
|
```bash
|
2026-01-14 04:53:21 +00:00
|
|
|
|
# See patterns in your deployments
|
|
|
|
|
|
provisioning ai analytics failures --period month
|
|
|
|
|
|
|
|
|
|
|
|
Month Summary:
|
|
|
|
|
|
Total deployments: 50
|
|
|
|
|
|
Failed: 5 (10% failure rate)
|
|
|
|
|
|
|
|
|
|
|
|
Common causes:
|
|
|
|
|
|
1. Security group rules (3 failures, 60%)
|
|
|
|
|
|
2. Resource limits (1 failure, 20%)
|
|
|
|
|
|
3. Configuration error (1 failure, 20%)
|
|
|
|
|
|
|
|
|
|
|
|
Improvement opportunities:
|
|
|
|
|
|
- Pre-check security groups before deployment
|
|
|
|
|
|
- Add health checks for resource sizing
|
|
|
|
|
|
- Add configuration validation
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## Configuration
|
|
|
|
|
|
|
|
|
|
|
|
### Troubleshooting Settings
|
|
|
|
|
|
|
2026-01-14 04:53:58 +00:00
|
|
|
|
```toml
|
2026-01-14 04:53:21 +00:00
|
|
|
|
[ai.troubleshooting]
|
|
|
|
|
|
enabled = true
|
|
|
|
|
|
|
|
|
|
|
|
# Analysis depth
|
|
|
|
|
|
default_depth = "deep" # or "shallow" for speed
|
|
|
|
|
|
max_analysis_time_seconds = 30
|
|
|
|
|
|
|
|
|
|
|
|
# Features
|
|
|
|
|
|
auto_analyze_failed_deployments = true
|
|
|
|
|
|
generate_corrected_config = true
|
|
|
|
|
|
suggest_prevention = true
|
|
|
|
|
|
|
|
|
|
|
|
# Learning
|
|
|
|
|
|
track_failure_patterns = true
|
|
|
|
|
|
learn_from_similar_failures = true
|
|
|
|
|
|
improve_suggestions_over_time = true
|
|
|
|
|
|
|
|
|
|
|
|
# Reporting
|
|
|
|
|
|
auto_send_report = false # Email report to user
|
|
|
|
|
|
report_format = "markdown" # or "json", "pdf"
|
|
|
|
|
|
include_alternatives = true
|
|
|
|
|
|
|
|
|
|
|
|
# Cost impact analysis
|
|
|
|
|
|
estimate_fix_cost = true
|
|
|
|
|
|
estimate_alternative_costs = true
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
### Failure Detection
|
|
|
|
|
|
|
2026-01-14 04:53:58 +00:00
|
|
|
|
```toml
|
2026-01-14 04:53:21 +00:00
|
|
|
|
[ai.troubleshooting.detection]
|
|
|
|
|
|
# Monitor logs for these patterns
|
|
|
|
|
|
watch_patterns = [
|
|
|
|
|
|
"error",
|
|
|
|
|
|
"timeout",
|
|
|
|
|
|
"failed",
|
|
|
|
|
|
"unable to",
|
|
|
|
|
|
"refused",
|
|
|
|
|
|
"denied",
|
|
|
|
|
|
"exceeded",
|
|
|
|
|
|
"quota",
|
|
|
|
|
|
]
|
|
|
|
|
|
|
|
|
|
|
|
# Minimum log lines before analyzing
|
|
|
|
|
|
min_log_lines = 10
|
|
|
|
|
|
|
|
|
|
|
|
# Time window for log collection
|
|
|
|
|
|
log_window_seconds = 300
|
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
|
|
## Best Practices
|
|
|
|
|
|
|
|
|
|
|
|
### For Effective Troubleshooting
|
|
|
|
|
|
|
|
|
|
|
|
1. **Keep Detailed Logs**: Enable verbose logging in deployments
|
|
|
|
|
|
2. **Include Context**: Share full logs, not just error snippet
|
|
|
|
|
|
3. **Check Suggestions**: Review AI suggestions even if obvious
|
|
|
|
|
|
4. **Learn Patterns**: Track recurring failures and address root cause
|
|
|
|
|
|
5. **Update Configs**: Use corrected configs from AI, validate them
|
|
|
|
|
|
|
|
|
|
|
|
### For Prevention
|
|
|
|
|
|
|
|
|
|
|
|
1. **Use Health Checks**: Add database/service health checks
|
|
|
|
|
|
2. **Test Before Deploy**: Use dry-run to catch issues early
|
|
|
|
|
|
3. **Monitor Metrics**: Watch CPU/memory before failures occur
|
|
|
|
|
|
4. **Review Policies**: Ensure security groups are correct
|
|
|
|
|
|
5. **Document Changes**: When updating configs, note the change
|
|
|
|
|
|
|
|
|
|
|
|
## Limitations
|
|
|
|
|
|
|
|
|
|
|
|
### What AI Can Troubleshoot
|
|
|
|
|
|
|
|
|
|
|
|
✅ Configuration errors
|
|
|
|
|
|
✅ Resource limit problems
|
|
|
|
|
|
✅ Networking/security group issues
|
|
|
|
|
|
✅ Database connectivity problems
|
|
|
|
|
|
✅ Deployment ordering issues
|
|
|
|
|
|
✅ Common application errors
|
|
|
|
|
|
✅ Performance problems
|
|
|
|
|
|
|
|
|
|
|
|
### What Requires Human Review
|
|
|
|
|
|
|
|
|
|
|
|
⚠️ Data corruption scenarios
|
|
|
|
|
|
⚠️ Multi-failure cascades
|
|
|
|
|
|
⚠️ Unclear error messages
|
|
|
|
|
|
⚠️ Custom application code failures
|
|
|
|
|
|
⚠️ Third-party service issues
|
|
|
|
|
|
⚠️ Physical infrastructure failures
|
|
|
|
|
|
|
|
|
|
|
|
## Examples and Guides
|
|
|
|
|
|
|
|
|
|
|
|
### Common Issues - Quick Links
|
|
|
|
|
|
|
|
|
|
|
|
- [Database Connectivity](../troubleshooting/database-connectivity.md)
|
|
|
|
|
|
- [Kubernetes Pod Failures](../troubleshooting/kubernetes-pods.md)
|
|
|
|
|
|
- [Network Configuration](../troubleshooting/networking.md)
|
|
|
|
|
|
- [Performance Issues](../troubleshooting/performance.md)
|
|
|
|
|
|
- [Resource Limits](../troubleshooting/resource-limits.md)
|
|
|
|
|
|
|
|
|
|
|
|
## Related Documentation
|
|
|
|
|
|
|
|
|
|
|
|
- [Architecture](architecture.md) - AI system overview
|
|
|
|
|
|
- [RAG System](rag-system.md) - Context retrieval for troubleshooting
|
|
|
|
|
|
- [Configuration](configuration.md) - Setup guide
|
|
|
|
|
|
- [Security Policies](security-policies.md) - Safe log handling
|
|
|
|
|
|
- [ADR-015](../architecture/adr/adr-015-ai-integration-architecture.md) - Design decisions
|
|
|
|
|
|
|
|
|
|
|
|
---
|
|
|
|
|
|
|
|
|
|
|
|
**Last Updated**: 2025-01-13
|
|
|
|
|
|
**Status**: ✅ Production-Ready
|
|
|
|
|
|
**Success Rate**: 85-95% accuracy in root cause identification
|
|
|
|
|
|
**Supported**: All deployment types (infrastructure, Kubernetes, database)
|