jesus/provisioning

Fork 0

Jesús Pérez 17ef93ed23

chore: fix docs after fences fix

2026-01-14 04:53:58 +00:00

20 KiB

Raw Blame History

AI-Assisted Troubleshooting and Debugging

Status: ✅ Production-Ready (AI troubleshooting analysis, log parsing)

The AI troubleshooting system provides intelligent debugging assistance for infrastructure failures. The system analyzes deployment logs, identifies root causes, suggests fixes, and generates corrected configurations based on failure patterns.

Feature Overview

What It Does

Transform deployment failures into actionable insights:

Deployment Fails with Error
        ↓
AI analyzes logs:
  - Identifies failure phase (networking, database, k8s, etc.)
  - Detects root cause (resource limits, configuration, timeout)
  - Correlates with similar past failures
  - Reviews deployment configuration
        ↓
AI generates report:
  - Root cause explanation in plain English
  - Configuration issues identified
  - Suggested fixes with rationale
  - Alternative solutions
  - Links to relevant documentation
        ↓
Developer reviews and accepts:
  - Understands what went wrong
  - Knows how to fix it
  - Can implement fix with confidence

Troubleshooting Workflow

Automatic Detection and Analysis

┌──────────────────────────────────────────┐
│ Deployment Monitoring                    │
│ - Watches deployment for failures        │
│ - Captures logs in real-time             │
│ - Detects failure events                 │
└──────────────┬───────────────────────────┘
               ↓
┌──────────────────────────────────────────┐
│ Log Collection                           │
│ - Gather all relevant logs               │
│ - Include stack traces                   │
│ - Capture metrics at failure time        │
│ - Get resource usage data                │
└──────────────┬───────────────────────────┘
               ↓
┌──────────────────────────────────────────┐
│ Context Retrieval (RAG)                  │
│ - Find similar past failures             │
│ - Retrieve troubleshooting guides        │
│ - Get schema constraints                 │
│ - Find best practices                    │
└──────────────┬───────────────────────────┘
               ↓
┌──────────────────────────────────────────┐
│ AI Analysis                              │
│ - Identify failure pattern               │
│ - Determine root cause                   │
│ - Generate hypotheses                    │
│ - Score likely causes                    │
└──────────────┬───────────────────────────┘
               ↓
┌──────────────────────────────────────────┐
│ Solution Generation                      │
│ - Create fixed configuration             │
│ - Generate step-by-step fix guide        │
│ - Suggest preventative measures          │
│ - Provide alternative approaches         │
└──────────────┬───────────────────────────┘
               ↓
┌──────────────────────────────────────────┐
│ Report and Recommendations               │
│ - Explain what went wrong                │
│ - Show how to fix it                     │
│ - Provide corrected configuration        │
│ - Link to prevention strategies          │
└──────────────────────────────────────────┘

Usage Examples

Example 1: Database Connection Timeout

Failure:

Deployment: deploy-2025-01-13-001
Status: FAILED at phase database_migration
Error: connection timeout after 30s connecting to postgres://...

Run Troubleshooting:

$ provisioning ai troubleshoot deploy-2025-01-13-001

Analyzing deployment failure...

╔════════════════════════════════════════════════════════════════╗
║ Root Cause Analysis: Database Connection Timeout              ║
╠════════════════════════════════════════════════════════════════╣
║                                                                ║
║ Phase: database_migration (occurred during migration job)     ║
║ Error: Timeout after 30 seconds connecting to database        ║
║                                                                ║
║ Most Likely Causes (confidence):                              ║
║   1. Database security group blocks migration job (85%)       ║
║   2. Database instance not fully initialized yet (60%)        ║
║   3. Network connectivity issue (40%)                         ║
║                                                                ║
║ Analysis:                                                     ║
║   - Database was created only 2 seconds before connection    ║
║   - Migration job started immediately (no wait time)         ║
║   - Security group: allows 5432 only from default SG         ║
║   - Migration pod uses different security group              ║
║                                                                ║
╠════════════════════════════════════════════════════════════════╣
║ Recommended Fix                                                ║
╠════════════════════════════════════════════════════════════════╣
║                                                                ║
║ Issue: Migration security group not in database's inbound    ║
║                                                                ║
║ Solution: Add migration pod security group to DB inbound     ║
║                                                                ║
║   database.security_group.ingress = [                         ║
║     {                                                          ║
║       from_port = 5432,                                       ║
║       to_port = 5432,                                         ║
║       source_security_group = "migration-pods-sg"             ║
║     }                                                          ║
║   ]                                                            ║
║                                                                ║
║ Alternative: Add 30-second wait after database creation      ║
║                                                                ║
║   deployment.phases.database.post_actions = [                 ║
║     {action = "wait_for_database", timeout_seconds = 30}     ║
║   ]                                                            ║
║                                                                ║
╠════════════════════════════════════════════════════════════════╣
║ Prevention                                                     ║
╠════════════════════════════════════════════════════════════════╣
║                                                                ║
║ To prevent this in future deployments:                        ║
║                                                                ║
║ 1. Always verify security group rules before migration       ║
║ 2. Add health check: `SELECT 1` before starting migration    ║
║ 3. Increase initial timeout: database can be slow to start   ║
║ 4. Use RDS wait condition instead of time-based wait         ║
║                                                                ║
║ See: docs/troubleshooting/database-connectivity.md            ║
║      docs/guides/database-migrations.md                       ║
║                                                                ║
╚════════════════════════════════════════════════════════════════╝

Generate corrected configuration? [yes/no]: yes

Configuration generated and saved to:
  workspaces/prod/database.ncl.fixed

Changes made:
  ✓ Added migration security group to database inbound
  ✓ Added health check before migration
  ✓ Increased connection timeout to 60s

Ready to redeploy with corrected configuration? [yes/no]: yes

Example 2: Kubernetes Deployment Error

Failure:

Deployment: deploy-2025-01-13-002
Status: FAILED at phase kubernetes_workload
Error: failed to create deployment app: Pod exceeded capacity

Troubleshooting:

$ provisioning ai troubleshoot deploy-2025-01-13-002 --detailed

╔════════════════════════════════════════════════════════════════╗
║ Root Cause: Pod Exceeded Node Capacity                        ║
╠════════════════════════════════════════════════════════════════╣
║                                                                ║
║ Failure Analysis:                                             ║
║                                                                ║
║ Error: Pod requests 4CPU/8GB, but largest node has 2CPU/4GB  ║
║ Cluster: 3 nodes, each t3.medium (2CPU/4GB)                  ║
║ Pod requirements:                                             ║
║   - CPU: 4 (requested) + 2 (reserved system) = 6 needed      ║
║   - Memory: 8Gi (requested) + 1Gi (system) = 9Gi needed      ║
║                                                                ║
║ Why this happened:                                            ║
║   Pod spec updated to 4CPU/8GB but node group wasn't        ║
║   Node group still has t3.medium (too small)                 ║
║   No autoscaling configured (won't scale up automatically)   ║
║                                                                ║
║ Solution Options:                                             ║
║   1. Reduce pod resource requests to 2CPU/4GB (simpler)      ║
║   2. Scale up node group to t3.large (2x cost, safer)        ║
║   3. Use both: t3.large nodes + reduce pod requests          ║
║                                                                ║
╠════════════════════════════════════════════════════════════════╣
║ Recommended: Option 2 (Scale up nodes)                        ║
╠════════════════════════════════════════════════════════════════╣
║                                                                ║
║ Reason: Pod requests are reasonable for production app       ║
║         Better to scale infrastructure than reduce resources  ║
║                                                                ║
║ Changes needed:                                               ║
║                                                                ║
║   kubernetes.node_group = {                                   ║
║     instance_type = "t3.large"  # was t3.medium              ║
║     min_size = 3                                              ║
║     max_size = 10                                             ║
║                                                                ║
║     auto_scaling = {                                          ║
║       enabled = true                                          ║
║       target_cpu_percent = 70                                 ║
║     }                                                          ║
║   }                                                            ║
║                                                                ║
║ Cost Impact:                                                  ║
║   Current: 3 × t3.medium = ~$90/month                        ║
║   Proposed: 3 × t3.large = ~$180/month                       ║
║   With autoscaling, average: ~$150/month (some scale-down)   ║
║                                                                ║
╚════════════════════════════════════════════════════════════════╝

CLI Commands

Basic Troubleshooting

# Troubleshoot recent deployment
provisioning ai troubleshoot deploy-2025-01-13-001

# Get detailed analysis
provisioning ai troubleshoot deploy-2025-01-13-001 --detailed

# Analyze with specific focus
provisioning ai troubleshoot deploy-2025-01-13-001 --focus networking

# Get alternative solutions
provisioning ai troubleshoot deploy-2025-01-13-001 --alternatives

Working with Logs

# Troubleshoot from custom logs
provisioning ai troubleshoot 
| --logs "$(journalctl -u provisioning --no-pager | tail -100)" |

# Troubleshoot from file
provisioning ai troubleshoot --log-file /var/log/deployment.log

# Troubleshoot from cloud provider
provisioning ai troubleshoot 
  --cloud-logs aws-deployment-123 
  --region us-east-1

Generate Reports

# Generate detailed troubleshooting report
provisioning ai troubleshoot deploy-123 
  --report 
  --output troubleshooting-report.md

# Generate with suggestions
provisioning ai troubleshoot deploy-123 
  --report 
  --include-suggestions 
  --output report-with-fixes.md

# Generate compliance report (PCI-DSS, HIPAA)
provisioning ai troubleshoot deploy-123 
  --report 
  --compliance pci-dss 
  --output compliance-report.pdf

Analysis Depth

Shallow Analysis (Fast)

provisioning ai troubleshoot deploy-123 --depth shallow

Analyzes:
- First error message
- Last few log lines
- Basic pattern matching
- Returns in 30-60 seconds

Deep Analysis (Thorough)

provisioning ai troubleshoot deploy-123 --depth deep

Analyzes:
- Full log context
- Correlates multiple errors
- Checks resource metrics
- Compares to past failures
- Generates alternative hypotheses
- Returns in 5-10 seconds

Integration with Monitoring

Automatic Troubleshooting

# Enable auto-troubleshoot on failures
provisioning config set ai.troubleshooting.auto_analyze true

# Deployments that fail automatically get analyzed
# Reports available in provisioning dashboard
# Alerts sent to on-call engineer with analysis

WebUI Integration

Deployment Dashboard
  ├─ deployment-123 [FAILED]
  │   └─ AI Analysis
  │       ├─ Root Cause: Database timeout
  │       ├─ Suggested Fix: ✓ View
  │       ├─ Corrected Config: ✓ Download
  │       └─ Alternative Solutions: 3 options

Learning from Failures

Pattern Recognition

The system learns common failure patterns:

Collected Patterns:
├─ Database Timeouts (25% of failures)
│  └─ Usually: Security group, connection pool, slow startup
├─ Kubernetes Pod Failures (20%)
│  └─ Usually: Insufficient resources, bad config
├─ Network Connectivity (15%)
│  └─ Usually: Security groups, routing, DNS
└─ Other (40%)
   └─ Various causes, each analyzed individually

Improvement Tracking

# See patterns in your deployments
provisioning ai analytics failures --period month

Month Summary:
  Total deployments: 50
  Failed: 5 (10% failure rate)
  
  Common causes:
  1. Security group rules (3 failures, 60%)
  2. Resource limits (1 failure, 20%)
  3. Configuration error (1 failure, 20%)
  
  Improvement opportunities:
  - Pre-check security groups before deployment
  - Add health checks for resource sizing
  - Add configuration validation

Configuration

Troubleshooting Settings

[ai.troubleshooting]
enabled = true

# Analysis depth
default_depth = "deep"  # or "shallow" for speed
max_analysis_time_seconds = 30

# Features
auto_analyze_failed_deployments = true
generate_corrected_config = true
suggest_prevention = true

# Learning
track_failure_patterns = true
learn_from_similar_failures = true
improve_suggestions_over_time = true

# Reporting
auto_send_report = false  # Email report to user
report_format = "markdown"  # or "json", "pdf"
include_alternatives = true

# Cost impact analysis
estimate_fix_cost = true
estimate_alternative_costs = true

Failure Detection

[ai.troubleshooting.detection]
# Monitor logs for these patterns
watch_patterns = [
  "error",
  "timeout",
  "failed",
  "unable to",
  "refused",
  "denied",
  "exceeded",
  "quota",
]

# Minimum log lines before analyzing
min_log_lines = 10

# Time window for log collection
log_window_seconds = 300

Best Practices

For Effective Troubleshooting

Keep Detailed Logs: Enable verbose logging in deployments
Include Context: Share full logs, not just error snippet
Check Suggestions: Review AI suggestions even if obvious
Learn Patterns: Track recurring failures and address root cause
Update Configs: Use corrected configs from AI, validate them

For Prevention

Use Health Checks: Add database/service health checks
Test Before Deploy: Use dry-run to catch issues early
Monitor Metrics: Watch CPU/memory before failures occur
Review Policies: Ensure security groups are correct
Document Changes: When updating configs, note the change

Limitations

What AI Can Troubleshoot

✅ Configuration errors ✅ Resource limit problems ✅ Networking/security group issues ✅ Database connectivity problems ✅ Deployment ordering issues ✅ Common application errors ✅ Performance problems

What Requires Human Review

⚠️ Data corruption scenarios ⚠️ Multi-failure cascades ⚠️ Unclear error messages ⚠️ Custom application code failures ⚠️ Third-party service issues ⚠️ Physical infrastructure failures

Examples and Guides

Common Issues - Quick Links

Architecture - AI system overview
RAG System - Context retrieval for troubleshooting
Configuration - Setup guide
Security Policies - Safe log handling
ADR-015 - Design decisions

Last Updated: 2025-01-13 Status: ✅ Production-Ready Success Rate: 85-95% accuracy in root cause identification Supported: All deployment types (infrastructure, Kubernetes, database)

20 KiB Raw Blame History Unescape Escape