Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Troubleshooting Guide

This comprehensive troubleshooting guide helps you diagnose and resolve common issues with Infrastructure Automation.

What You’ll Learn

  • Common issues and their solutions
  • Diagnostic commands and techniques
  • Error message interpretation
  • Performance optimization
  • Recovery procedures
  • Prevention strategies

General Troubleshooting Approach

1. Identify the Problem

# Check overall system status
provisioning env
provisioning validate config

# Check specific component status
provisioning show servers --infra my-infra
provisioning taskserv list --infra my-infra --installed

2. Gather Information

# Enable debug mode for detailed output
provisioning --debug <command>

# Check logs and errors
provisioning show logs --infra my-infra

3. Use Diagnostic Commands

# Validate configuration
provisioning validate config --detailed

# Test connectivity
provisioning provider test aws
provisioning network test --infra my-infra

Installation and Setup Issues

Issue: Installation Fails

Symptoms:

  • Installation script errors
  • Missing dependencies
  • Permission denied errors

Diagnosis:

# Check system requirements
uname -a
df -h
whoami

# Check permissions
ls -la /usr/local/
sudo -l

Solutions:

Permission Issues

# Run installer with sudo
sudo ./install-provisioning

# Or install to user directory
./install-provisioning --prefix=$HOME/provisioning
export PATH="$HOME/provisioning/bin:$PATH"

Missing Dependencies

# Ubuntu/Debian
sudo apt update
sudo apt install -y curl wget tar build-essential

# RHEL/CentOS
sudo dnf install -y curl wget tar gcc make

Architecture Issues

# Check architecture
uname -m

# Download correct architecture package
# x86_64: Intel/AMD 64-bit
# arm64: ARM 64-bit (Apple Silicon)
wget https://releases.example.com/provisioning-linux-x86_64.tar.gz

Issue: Command Not Found

Symptoms:

bash: provisioning: command not found

Diagnosis:

# Check if provisioning is installed
which provisioning
ls -la /usr/local/bin/provisioning

# Check PATH
echo $PATH

Solutions:

# Add to PATH
export PATH="/usr/local/bin:$PATH"

# Make permanent (add to shell profile)
echo 'export PATH="/usr/local/bin:$PATH"' >> ~/.bashrc
source ~/.bashrc

# Create symlink if missing
sudo ln -sf /usr/local/provisioning/core/nulib/provisioning /usr/local/bin/provisioning

Issue: Nushell Plugin Errors

Symptoms:

Plugin not found: nu_plugin_kcl
Plugin registration failed

Diagnosis:

# Check Nushell version
nu --version

# Check KCL installation (required for nu_plugin_kcl)
kcl version

# Check plugin registration
nu -c "version | get installed_plugins"

Solutions:

# Install KCL CLI (required for nu_plugin_kcl)
# Download from: https://github.com/kcl-lang/cli/releases

# Re-register plugins
nu -c "plugin add /usr/local/provisioning/plugins/nu_plugin_kcl"
nu -c "plugin add /usr/local/provisioning/plugins/nu_plugin_tera"

# Restart Nushell after plugin registration

Configuration Issues

Issue: Configuration Not Found

Symptoms:

Configuration file not found
Failed to load configuration

Diagnosis:

# Check configuration file locations
provisioning env | grep config

# Check if files exist
ls -la ~/.config/provisioning/
ls -la /usr/local/provisioning/config.defaults.toml

Solutions:

# Initialize user configuration
provisioning init config

# Create missing directories
mkdir -p ~/.config/provisioning

# Copy template
cp /usr/local/provisioning/config-examples/config.user.toml ~/.config/provisioning/config.toml

# Verify configuration
provisioning validate config

Issue: Configuration Validation Errors

Symptoms:

Configuration validation failed
Invalid configuration value
Missing required field

Diagnosis:

# Detailed validation
provisioning validate config --detailed

# Check specific sections
provisioning config show --section paths
provisioning config show --section providers

Solutions:

Path Configuration Issues

# Check base path exists
ls -la /path/to/provisioning

# Update configuration
nano ~/.config/provisioning/config.toml

# Fix paths section
[paths]
base = "/correct/path/to/provisioning"

Provider Configuration Issues

# Test provider connectivity
provisioning provider test aws

# Check credentials
aws configure list  # For AWS
upcloud-cli config  # For UpCloud

# Update provider configuration
[providers.aws]
interface = "CLI"  # or "API"

Issue: Interpolation Failures

Symptoms:

Interpolation pattern not resolved: {{env.VARIABLE}}
Template rendering failed

Diagnosis:

# Test interpolation
provisioning validate interpolation test

# Check environment variables
env | grep VARIABLE

# Debug interpolation
provisioning --debug validate interpolation validate

Solutions:

# Set missing environment variables
export MISSING_VARIABLE="value"

# Use fallback values in configuration
config_value = "{{env.VARIABLE || 'default_value'}}"

# Check interpolation syntax
# Correct: {{env.HOME}}
# Incorrect: ${HOME} or $HOME

Server Management Issues

Issue: Server Creation Fails

Symptoms:

Failed to create server
Provider API error
Insufficient quota

Diagnosis:

# Check provider status
provisioning provider status aws

# Test connectivity
ping api.provider.com
curl -I https://api.provider.com

# Check quota
provisioning provider quota --infra my-infra

# Debug server creation
provisioning --debug server create web-01 --infra my-infra --check

Solutions:

API Authentication Issues

# AWS
aws configure list
aws sts get-caller-identity

# UpCloud
upcloud-cli account show

# Update credentials
aws configure  # For AWS
export UPCLOUD_USERNAME="your-username"
export UPCLOUD_PASSWORD="your-password"

Quota/Limit Issues

# Check current usage
provisioning show costs --infra my-infra

# Request quota increase from provider
# Or reduce resource requirements

# Use smaller instance types
# Reduce number of servers

Network/Connectivity Issues

# Test network connectivity
curl -v https://api.aws.amazon.com
curl -v https://api.upcloud.com

# Check DNS resolution
nslookup api.aws.amazon.com

# Check firewall rules
# Ensure outbound HTTPS (port 443) is allowed

Issue: SSH Access Fails

Symptoms:

Connection refused
Permission denied
Host key verification failed

Diagnosis:

# Check server status
provisioning server list --infra my-infra

# Test SSH manually
ssh -v user@server-ip

# Check SSH configuration
provisioning show servers web-01 --infra my-infra

Solutions:

Connection Issues

# Wait for server to be fully ready
provisioning server list --infra my-infra --status

# Check security groups/firewall
# Ensure SSH (port 22) is allowed

# Use correct IP address
provisioning show servers web-01 --infra my-infra | grep ip

Authentication Issues

# Check SSH key
ls -la ~/.ssh/
ssh-add -l

# Generate new key if needed
ssh-keygen -t ed25519 -f ~/.ssh/provisioning_key

# Use specific key
provisioning server ssh web-01 --key ~/.ssh/provisioning_key --infra my-infra

Host Key Issues

# Remove old host key
ssh-keygen -R server-ip

# Accept new host key
ssh -o StrictHostKeyChecking=accept-new user@server-ip

Task Service Issues

Issue: Service Installation Fails

Symptoms:

Service installation failed
Package not found
Dependency conflicts

Diagnosis:

# Check service prerequisites
provisioning taskserv check kubernetes --infra my-infra

# Debug installation
provisioning --debug taskserv create kubernetes --infra my-infra --check

# Check server resources
provisioning server ssh web-01 --command "free -h && df -h" --infra my-infra

Solutions:

Resource Issues

# Check available resources
provisioning server ssh web-01 --command "
    echo 'Memory:' && free -h
    echo 'Disk:' && df -h
    echo 'CPU:' && nproc
" --infra my-infra

# Upgrade server if needed
provisioning server resize web-01 --plan larger-plan --infra my-infra

Package Repository Issues

# Update package lists
provisioning server ssh web-01 --command "
    sudo apt update && sudo apt upgrade -y
" --infra my-infra

# Check repository connectivity
provisioning server ssh web-01 --command "
    curl -I https://download.docker.com/linux/ubuntu/
" --infra my-infra

Dependency Issues

# Install missing dependencies
provisioning taskserv create containerd --infra my-infra

# Then install dependent service
provisioning taskserv create kubernetes --infra my-infra

Issue: Service Not Running

Symptoms:

Service status: failed
Service not responding
Health check failures

Diagnosis:

# Check service status
provisioning taskserv status kubernetes --infra my-infra

# Check service logs
provisioning taskserv logs kubernetes --infra my-infra

# SSH and check manually
provisioning server ssh web-01 --command "
    sudo systemctl status kubernetes
    sudo journalctl -u kubernetes --no-pager -n 50
" --infra my-infra

Solutions:

Configuration Issues

# Reconfigure service
provisioning taskserv configure kubernetes --infra my-infra

# Reset to defaults
provisioning taskserv reset kubernetes --infra my-infra

Port Conflicts

# Check port usage
provisioning server ssh web-01 --command "
    sudo netstat -tulpn | grep :6443
    sudo ss -tulpn | grep :6443
" --infra my-infra

# Change port configuration or stop conflicting service

Permission Issues

# Fix permissions
provisioning server ssh web-01 --command "
    sudo chown -R kubernetes:kubernetes /var/lib/kubernetes
    sudo chmod 600 /etc/kubernetes/admin.conf
" --infra my-infra

Cluster Management Issues

Issue: Cluster Deployment Fails

Symptoms:

Cluster deployment failed
Pod creation errors
Service unavailable

Diagnosis:

# Check cluster status
provisioning cluster status web-cluster --infra my-infra

# Check Kubernetes cluster
provisioning server ssh master-01 --command "
    kubectl get nodes
    kubectl get pods --all-namespaces
" --infra my-infra

# Check cluster logs
provisioning cluster logs web-cluster --infra my-infra

Solutions:

Node Issues

# Check node status
provisioning server ssh master-01 --command "
    kubectl describe nodes
" --infra my-infra

# Drain and rejoin problematic nodes
provisioning server ssh master-01 --command "
    kubectl drain worker-01 --ignore-daemonsets
    kubectl delete node worker-01
" --infra my-infra

# Rejoin node
provisioning taskserv configure kubernetes --infra my-infra --servers worker-01

Resource Constraints

# Check resource usage
provisioning server ssh master-01 --command "
    kubectl top nodes
    kubectl top pods --all-namespaces
" --infra my-infra

# Scale down or add more nodes
provisioning cluster scale web-cluster --replicas 3 --infra my-infra
provisioning server create worker-04 --infra my-infra

Network Issues

# Check network plugin
provisioning server ssh master-01 --command "
    kubectl get pods -n kube-system | grep cilium
" --infra my-infra

# Restart network plugin
provisioning taskserv restart cilium --infra my-infra

Performance Issues

Issue: Slow Operations

Symptoms:

  • Commands take very long to complete
  • Timeouts during operations
  • High CPU/memory usage

Diagnosis:

# Check system resources
top
htop
free -h
df -h

# Check network latency
ping api.aws.amazon.com
traceroute api.aws.amazon.com

# Profile command execution
time provisioning server list --infra my-infra

Solutions:

Local System Issues

# Close unnecessary applications
# Upgrade system resources
# Use SSD storage if available

# Increase timeout values
export PROVISIONING_TIMEOUT=600  # 10 minutes

Network Issues

# Use region closer to your location
[providers.aws]
region = "us-west-1"  # Closer region

# Enable connection pooling/caching
[cache]
enabled = true

Large Infrastructure Issues

# Use parallel operations
provisioning server create --infra my-infra --parallel 4

# Filter results
provisioning server list --infra my-infra --filter "status == 'running'"

Issue: High Memory Usage

Symptoms:

  • System becomes unresponsive
  • Out of memory errors
  • Swap usage high

Diagnosis:

# Check memory usage
free -h
ps aux --sort=-%mem | head

# Check for memory leaks
valgrind provisioning server list --infra my-infra

Solutions:

# Increase system memory
# Close other applications
# Use streaming operations for large datasets

# Enable garbage collection
export PROVISIONING_GC_ENABLED=true

# Reduce concurrent operations
export PROVISIONING_MAX_PARALLEL=2

Network and Connectivity Issues

Issue: API Connectivity Problems

Symptoms:

Connection timeout
DNS resolution failed
SSL certificate errors

Diagnosis:

# Test basic connectivity
ping 8.8.8.8
curl -I https://api.aws.amazon.com
nslookup api.upcloud.com

# Check SSL certificates
openssl s_client -connect api.aws.amazon.com:443 -servername api.aws.amazon.com

Solutions:

DNS Issues

# Use alternative DNS
echo 'nameserver 8.8.8.8' | sudo tee /etc/resolv.conf

# Clear DNS cache
sudo systemctl restart systemd-resolved  # Ubuntu
sudo dscacheutil -flushcache             # macOS

Proxy/Firewall Issues

# Configure proxy if needed
export HTTP_PROXY=http://proxy.company.com:9090
export HTTPS_PROXY=http://proxy.company.com:9090

# Check firewall rules
sudo ufw status  # Ubuntu
sudo firewall-cmd --list-all  # RHEL/CentOS

Certificate Issues

# Update CA certificates
sudo apt update && sudo apt install ca-certificates  # Ubuntu
brew install ca-certificates                         # macOS

# Skip SSL verification (temporary)
export PROVISIONING_SKIP_SSL_VERIFY=true

Security and Encryption Issues

Issue: SOPS Decryption Fails

Symptoms:

SOPS decryption failed
Age key not found
Invalid key format

Diagnosis:

# Check SOPS configuration
provisioning sops config

# Test SOPS manually
sops -d encrypted-file.k

# Check Age keys
ls -la ~/.config/sops/age/keys.txt
age-keygen -y ~/.config/sops/age/keys.txt

Solutions:

Missing Keys

# Generate new Age key
age-keygen -o ~/.config/sops/age/keys.txt

# Update SOPS configuration
provisioning sops config --key-file ~/.config/sops/age/keys.txt

Key Permissions

# Fix key file permissions
chmod 600 ~/.config/sops/age/keys.txt
chown $(whoami) ~/.config/sops/age/keys.txt

Configuration Issues

# Update SOPS configuration in ~/.config/provisioning/config.toml
[sops]
use_sops = true
key_search_paths = [
    "~/.config/sops/age/keys.txt",
    "/path/to/your/key.txt"
]

Issue: Access Denied Errors

Symptoms:

Permission denied
Access denied
Insufficient privileges

Diagnosis:

# Check user permissions
id
groups

# Check file permissions
ls -la ~/.config/provisioning/
ls -la /usr/local/provisioning/

# Test with sudo
sudo provisioning env

Solutions:

# Fix file ownership
sudo chown -R $(whoami):$(whoami) ~/.config/provisioning/

# Fix permissions
chmod -R 755 ~/.config/provisioning/
chmod 600 ~/.config/provisioning/config.toml

# Add user to required groups
sudo usermod -a -G docker $(whoami)  # For Docker access

Data and Storage Issues

Issue: Disk Space Problems

Symptoms:

No space left on device
Write failed
Disk full

Diagnosis:

# Check disk usage
df -h
du -sh ~/.config/provisioning/
du -sh /usr/local/provisioning/

# Find large files
find /usr/local/provisioning -type f -size +100M

Solutions:

# Clean up cache files
rm -rf ~/.config/provisioning/cache/*
rm -rf /usr/local/provisioning/.cache/*

# Clean up logs
find /usr/local/provisioning -name "*.log" -mtime +30 -delete

# Clean up temporary files
rm -rf /tmp/provisioning-*

# Compress old backups
gzip ~/.config/provisioning/backups/*.yaml

Recovery Procedures

Configuration Recovery

# Restore from backup
provisioning config restore --backup latest

# Reset to defaults
provisioning config reset

# Recreate configuration
provisioning init config --force

Infrastructure Recovery

# Check infrastructure status
provisioning show servers --infra my-infra

# Recover failed servers
provisioning server create failed-server --infra my-infra

# Restore from backup
provisioning restore --backup latest --infra my-infra

Service Recovery

# Restart failed services
provisioning taskserv restart kubernetes --infra my-infra

# Reinstall corrupted services
provisioning taskserv delete kubernetes --infra my-infra
provisioning taskserv create kubernetes --infra my-infra

Prevention Strategies

Regular Maintenance

# Weekly maintenance script
#!/bin/bash

# Update system
provisioning update --check

# Validate configuration
provisioning validate config

# Check for service updates
provisioning taskserv check-updates

# Clean up old files
provisioning cleanup --older-than 30d

# Create backup
provisioning backup create --name "weekly-$(date +%Y%m%d)"

Monitoring Setup

# Set up health monitoring
#!/bin/bash

# Check system health every hour
0 * * * * /usr/local/bin/provisioning health check || echo "Health check failed" | mail -s "Provisioning Alert" admin@company.com

# Weekly cost reports
0 9 * * 1 /usr/local/bin/provisioning show costs --all | mail -s "Weekly Cost Report" finance@company.com

Best Practices

  1. Configuration Management

    • Version control all configuration files
    • Use check mode before applying changes
    • Regular validation and testing
  2. Security

    • Regular key rotation
    • Principle of least privilege
    • Audit logs review
  3. Backup Strategy

    • Automated daily backups
    • Test restore procedures
    • Off-site backup storage
  4. Documentation

    • Document custom configurations
    • Keep troubleshooting logs
    • Share knowledge with team

Getting Additional Help

Debug Information Collection

#!/bin/bash
# Collect debug information

echo "Collecting provisioning debug information..."

mkdir -p /tmp/provisioning-debug
cd /tmp/provisioning-debug

# System information
uname -a > system-info.txt
free -h >> system-info.txt
df -h >> system-info.txt

# Provisioning information
provisioning --version > provisioning-info.txt
provisioning env >> provisioning-info.txt
provisioning validate config --detailed > config-validation.txt 2>&1

# Configuration files
cp ~/.config/provisioning/config.toml user-config.toml 2>/dev/null || echo "No user config" > user-config.toml

# Logs
provisioning show logs > system-logs.txt 2>&1

# Create archive
cd /tmp
tar czf provisioning-debug-$(date +%Y%m%d_%H%M%S).tar.gz provisioning-debug/

echo "Debug information collected in: provisioning-debug-*.tar.gz"

Support Channels

  1. Built-in Help

    provisioning help
    provisioning help <command>
    
  2. Documentation

    • User guides in docs/user/
    • CLI reference: docs/user/cli-reference.md
    • Configuration guide: docs/user/configuration.md
  3. Community Resources

    • Project repository issues
    • Community forums
    • Documentation wiki
  4. Enterprise Support

    • Professional services
    • Priority support
    • Custom development

Remember: When reporting issues, always include the debug information collected above and specific error messages.