# Etcd Task Service ## Overview The Etcd task service provides a complete installation and configuration of [etcd](https://etcd.io/), a distributed, reliable key-value store for the most critical data of a distributed system. Etcd is the primary datastore of Kubernetes and is used by many other distributed systems for configuration management, service discovery, and distributed coordination. ## Features ### Core Capabilities - **Distributed Key-Value Store** - Consistent, reliable data storage across multiple nodes - **RAFT Consensus Algorithm** - Strong consistency guarantees with leader election - **MVCC (Multi-Version Concurrency Control)** - Point-in-time snapshots and watch functionality - **Transactional Operations** - Atomic multi-key operations with compare-and-swap - **Hierarchical Key Space** - Organized key structure with directory-like paths ### High Availability & Clustering - **Multi-Node Clusters** - Support for 3, 5, 7+ node clusters - **Automatic Leader Election** - Built-in leader election with automatic failover - **Cluster Membership Management** - Dynamic cluster membership changes - **Split-Brain Protection** - Quorum-based decision making - **Rolling Updates** - Zero-downtime cluster updates ### Security Features - **TLS Encryption** - End-to-end encryption for client and peer communication - **Certificate-Based Authentication** - X.509 certificate authentication - **Role-Based Access Control (RBAC)** - Fine-grained permission management - **User Authentication** - User-based authentication with password and certificate support - **Network Security** - Peer and client communication security ### Operational Features - **Backup & Restore** - Built-in snapshot and restoration capabilities - **Monitoring & Metrics** - Prometheus metrics integration - **Health Checking** - Comprehensive health check endpoints - **Performance Tuning** - Configurable performance parameters - **Maintenance Operations** - Compaction, defragmentation, and member management ## Configuration ### Basic Single-Node Configuration ```kcl etcd: ETCD = { name: "etcd" version: "v3.5.10" etcd_name: "etcd-single" ssl_mode: "openssl" ssl_sign: "RSA" ca_sign: "RSA" ssl_curve: "prime256v1" long_sign: 4096 cipher: "-aes256" ca_sign_days: 1460 sign_days: 730 sign_sha: 256 etcd_protocol: "https" source_url: "github" cluster_name: "etcd-cluster" hostname: "etcd-node-1" cn: "etcd-node-1" c: "US" data_dir: "/var/lib/etcd" conf_path: "/etc/etcd/config.yaml" log_level: "warn" log_out: "stderr" cli_ip: "127.0.0.1" cli_port: 2379 peer_ip: "127.0.0.1" peer_port: 2380 cluster_list: "etcd-single=https://127.0.0.1:2380" token: "etcd-cluster-1" certs_path: "/etc/ssl/etcd" use_localhost: true use_dns: false } ``` ### Production Multi-Node Cluster ```kcl etcd: ETCD = { name: "etcd" version: "v3.5.10" etcd_name: "etcd-prod-1" ssl_mode: "openssl" ssl_sign: "RSA" ca_sign: "RSA" ssl_curve: "prime256v1" long_sign: 4096 cipher: "-aes256" ca_sign_days: 1460 sign_days: 730 sign_sha: 256 etcd_protocol: "https" source_url: "github" cluster_name: "production-cluster" hostname: "etcd-prod-1" cn: "etcd-prod-1.company.com" c: "US" data_dir: "/var/lib/etcd" conf_path: "/etc/etcd/config.yaml" log_level: "warn" log_out: "stderr" cli_ip: "10.0.1.10" cli_port: 2379 peer_ip: "10.0.1.10" peer_port: 2380 cluster_list: "etcd-prod-1=https://10.0.1.10:2380,etcd-prod-2=https://10.0.1.11:2380,etcd-prod-3=https://10.0.1.12:2380" token: "production-etcd-cluster" certs_path: "/etc/ssl/etcd" prov_path: "etcdcerts" listen_peers: "https://10.0.1.10:2380" adv_listen_peers: "https://10.0.1.10:2380" initial_peers: "etcd-prod-1=https://10.0.1.10:2380,etcd-prod-2=https://10.0.1.11:2380,etcd-prod-3=https://10.0.1.12:2380" listen_clients: "https://10.0.1.10:2379,https://127.0.0.1:2379" adv_listen_clients: "https://10.0.1.10:2379" use_localhost: false domain_name: "company.com" use_dns: true } ``` ### High-Performance Configuration ```kcl etcd: ETCD = { name: "etcd" version: "v3.5.10" # ... base configuration performance: { snapshot_count: 100000 heartbeat_interval: 100 election_timeout: 1000 max_snapshots: 5 max_wals: 5 max_txn_ops: 128 max_request_bytes: 1572864 grpc_keepalive_min_time: 5 grpc_keepalive_interval: 2 grpc_keepalive_timeout: 6 } storage: { backend_batch_limit: 10000 backend_batch_interval: 100 backend_bbolt_freelist_type: "map" quota_backend_bytes: 8589934592 # 8GB } security: { auto_compaction_mode: "periodic" auto_compaction_retention: "1h" } } ``` ### Kubernetes Cluster Configuration ```kcl etcd: ETCD = { name: "etcd" version: "v3.5.10" etcd_name: "k8s-etcd-1" # ... base configuration cluster_name: "kubernetes-cluster" hostname: "k8s-master-1" cn: "k8s-master-1.cluster.local" data_dir: "/var/lib/etcd" cli_ip: "10.0.1.100" cli_port: 2379 peer_ip: "10.0.1.100" peer_port: 2380 cluster_list: "k8s-etcd-1=https://10.0.1.100:2380,k8s-etcd-2=https://10.0.1.101:2380,k8s-etcd-3=https://10.0.1.102:2380" token: "k8s-etcd-cluster" kubernetes_integration: { enabled: true namespace_prefix: "/registry" compaction_interval: "5m" defrag_threshold: 100 health_check_interval: "10s" } backup: { enabled: true interval: "6h" retention: "30d" s3_bucket: "k8s-etcd-backups" encryption: true } } ``` ### DNS-Based Discovery Configuration ```kcl etcd: ETCD = { name: "etcd" version: "v3.5.10" # ... base configuration dns_domain_path: "_etcd-server-ssl._tcp.company.com" domain_name: "company.com" discovery_srv: "_etcd-server-ssl._tcp.company.com" use_dns: true dns_discovery: { enabled: true service: "_etcd-server-ssl._tcp" domain: "company.com" srv_records: [ { name: "etcd-1" port: 2380 weight: 10 priority: 0 target: "etcd-1.company.com" }, { name: "etcd-2" port: 2380 weight: 10 priority: 0 target: "etcd-2.company.com" }, { name: "etcd-3" port: 2380 weight: 10 priority: 0 target: "etcd-3.company.com" } ] } } ``` ## Usage ### Deploy Etcd ```bash ./core/nulib/provisioning taskserv create etcd --infra ``` ### List Available Task Services ```bash ./core/nulib/provisioning taskserv list ``` ### SSH to Etcd Server ```bash ./core/nulib/provisioning server ssh ``` ### Service Management ```bash # Check etcd status systemctl status etcd # Start/stop etcd systemctl start etcd systemctl stop etcd systemctl restart etcd # View etcd logs journalctl -u etcd -f # Check etcd version etcd --version etcdctl version ``` ### Cluster Operations ```bash # Check cluster health etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/ssl/etcd/ca.pem \ --cert=/etc/ssl/etcd/etcd.pem \ --key=/etc/ssl/etcd/etcd-key.pem \ endpoint health # List cluster members etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/ssl/etcd/ca.pem \ --cert=/etc/ssl/etcd/etcd.pem \ --key=/etc/ssl/etcd/etcd-key.pem \ member list # Check cluster status etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/ssl/etcd/ca.pem \ --cert=/etc/ssl/etcd/etcd.pem \ --key=/etc/ssl/etcd/etcd-key.pem \ endpoint status --write-out=table ``` ### Data Operations ```bash # Put a key-value pair etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/ssl/etcd/ca.pem \ --cert=/etc/ssl/etcd/etcd.pem \ --key=/etc/ssl/etcd/etcd-key.pem \ put /config/app/database "postgresql://localhost:5432/app" # Get a value etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/ssl/etcd/ca.pem \ --cert=/etc/ssl/etcd/etcd.pem \ --key=/etc/ssl/etcd/etcd-key.pem \ get /config/app/database # Get all keys with prefix etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/ssl/etcd/ca.pem \ --cert=/etc/ssl/etcd/etcd.pem \ --key=/etc/ssl/etcd/etcd-key.pem \ get /config/ --prefix # Watch for changes etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/ssl/etcd/ca.pem \ --cert=/etc/ssl/etcd/etcd.pem \ --key=/etc/ssl/etcd/etcd-key.pem \ watch /config/ --prefix ``` ### Backup and Restore ```bash # Create snapshot backup etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/ssl/etcd/ca.pem \ --cert=/etc/ssl/etcd/etcd.pem \ --key=/etc/ssl/etcd/etcd-key.pem \ snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db # Check snapshot status etcdctl --write-out=table snapshot status /backup/etcd-snapshot.db # Restore from snapshot etcdctl snapshot restore /backup/etcd-snapshot.db \ --name etcd-restored \ --initial-cluster etcd-restored=https://127.0.0.1:2380 \ --initial-cluster-token etcd-cluster-restored \ --initial-advertise-peer-urls https://127.0.0.1:2380 ``` ### Maintenance Operations ```bash # Compact etcd database etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/ssl/etcd/ca.pem \ --cert=/etc/ssl/etcd/etcd.pem \ --key=/etc/ssl/etcd/etcd-key.pem \ compact $(etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/ssl/etcd/ca.pem \ --cert=/etc/ssl/etcd/etcd.pem \ --key=/etc/ssl/etcd/etcd-key.pem \ endpoint status --write-out="json" | jq -r '.[0].Status.header.revision') # Defragment etcd database etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/ssl/etcd/ca.pem \ --cert=/etc/ssl/etcd/etcd.pem \ --key=/etc/ssl/etcd/etcd-key.pem \ defrag # Check database size etcdctl --endpoints=https://127.0.0.1:2379 \ --cacert=/etc/ssl/etcd/ca.pem \ --cert=/etc/ssl/etcd/etcd.pem \ --key=/etc/ssl/etcd/etcd-key.pem \ endpoint status --write-out=table ``` ## Architecture ### System Architecture ``` ┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ Applications │────│ Etcd Cluster │────│ Data Storage │ │ │ │ │ │ │ │ • Kubernetes │ │ • Leader Node │ │ • Raft Log │ │ • Config Mgmt │────│ • Follower Nodes │────│ • Key-Value DB │ │ • Service Disc. │ │ • Client API │ │ • Snapshots │ │ • Coordination │ │ • Peer Protocol │ │ • WAL Files │ └─────────────────┘ └──────────────────┘ └─────────────────┘ ``` ### Cluster Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ Etcd Cluster (3 Nodes) │ ├─────────────────────────────────────────────────────────────┤ │ Leader Node │ Follower Node 1 │ Follower Node 2 │ │ (etcd-1) │ (etcd-2) │ (etcd-3) │ │ │ │ │ │ • Write Operations │ • Read Operations │ • Read Operations │ │ • Log Replication │ • Log Reception │ • Log Reception │ │ • Heartbeat Sender │ • Heartbeat Reply │ • Heartbeat Reply │ │ • Decision Making │ • Vote Casting │ • Vote Casting │ ├─────────────────────────────────────────────────────────────┤ │ RAFT Consensus Layer │ ├─────────────────────────────────────────────────────────────┤ │ Network Layer │ │ Client Port: 2379 │ Peer Port: 2380 │ Metrics: 2381 │ └─────────────────────────────────────────────────────────────┘ ``` ### Data Flow Architecture ``` Client Request → API Gateway → RAFT Protocol → Storage Engine ↓ ↓ ↓ ↓ Authentication → Authorization → Consensus → Persistence ↓ ↓ ↓ ↓ TLS Validation → RBAC Check → Leader Election → Disk Write ``` ### File Structure ``` /var/lib/etcd/ # Data directory ├── member/ # Member data │ ├── snap/ # Snapshots │ └── wal/ # Write-ahead logs └── proxy/ # Proxy data (if enabled) /etc/etcd/ # Configuration ├── config.yaml # Main configuration ├── env # Environment variables ├── etcdctl.sh # Client script └── cert-show.sh # Certificate inspection /etc/ssl/etcd/ # SSL certificates ├── ca.pem # Certificate Authority ├── etcd.pem # Server certificate ├── etcd-key.pem # Server private key ├── peer.pem # Peer certificate └── peer-key.pem # Peer private key /var/log/etcd/ # Log files ├── etcd.log # Main log file └── audit.log # Audit log (if enabled) ``` ## Supported Operating Systems - Ubuntu 20.04+ / Debian 11+ - CentOS 8+ / RHEL 8+ / Fedora 35+ - Amazon Linux 2+ - SUSE Linux Enterprise 15+ ## System Requirements ### Minimum Requirements - **RAM**: 2GB (4GB+ recommended) - **Storage**: 20GB SSD (50GB+ for production) - **CPU**: 2 cores (4+ cores recommended) - **Network**: Low latency between cluster members ### Production Requirements - **RAM**: 8GB+ (16GB+ for large clusters) - **Storage**: 100GB+ NVMe SSD (high IOPS) - **CPU**: 4+ cores (8+ cores for high load) - **Network**: Dedicated network with <10ms latency between nodes ### Performance Requirements - **Disk IOPS**: 1000+ IOPS for WAL directory - **Network Bandwidth**: 1Gbps+ between cluster members - **Latency**: <10ms between cluster members - **Filesystem**: ext4 or xfs with barrier=0 for performance ## Troubleshooting ### Cluster Health Issues ```bash # Check cluster health etcdctl endpoint health --cluster # Check member status etcdctl member list -w table # Check leader election etcdctl endpoint status --cluster -w table # Check network connectivity curl -k https://etcd-node:2380/health ``` ### Performance Issues ```bash # Check database size etcdctl endpoint status --write-out=table # Check fragmentation etcdctl defrag --cluster # Monitor metrics curl http://localhost:2381/metrics # Check slow queries journalctl -u etcd | grep "slow" ``` ### Certificate Issues ```bash # Check certificate validity openssl x509 -in /etc/ssl/etcd/etcd.pem -text -noout # Verify certificate chain openssl verify -CAfile /etc/ssl/etcd/ca.pem /etc/ssl/etcd/etcd.pem # Check certificate expiration /etc/etcd/cert-show.sh # Test TLS connection openssl s_client -connect localhost:2379 -cert /etc/ssl/etcd/etcd.pem -key /etc/ssl/etcd/etcd-key.pem ``` ### Data Corruption Issues ```bash # Check database consistency etcdctl check perf # Repair database etcdutl snapshot restore /backup/snapshot.db # Check WAL files ls -la /var/lib/etcd/member/wal/ # Verify snapshot integrity etcdctl snapshot status /backup/snapshot.db ``` ### Network Issues ```bash # Check port connectivity telnet etcd-node 2379 telnet etcd-node 2380 # Check firewall rules iptables -L | grep -E "(2379|2380)" # Test peer connectivity curl -k https://etcd-peer:2380/version # Monitor network traffic tcpdump -i eth0 port 2379 or port 2380 ``` ## Security Considerations ### Transport Security - **TLS Encryption** - Enable TLS for all client and peer communication - **Certificate Management** - Use proper CA-signed certificates - **Key Rotation** - Regular certificate rotation - **Cipher Suites** - Use strong cipher suites ### Authentication & Authorization - **Client Authentication** - Certificate-based client authentication - **RBAC** - Role-based access control for fine-grained permissions - **User Management** - Separate users for different applications - **Audit Logging** - Comprehensive audit trail ### Network Security - **Firewall Rules** - Restrict access to etcd ports - **Network Segmentation** - Isolate etcd traffic - **VPN/Private Networks** - Use private networks for cluster communication - **DDoS Protection** - Implement rate limiting and connection limits ### Data Security - **Encryption at Rest** - Encrypt etcd data directory - **Backup Security** - Encrypt and secure backups - **Secret Management** - Proper handling of sensitive data - **Access Logs** - Monitor all data access ## Performance Optimization ### Hardware Optimization - **Storage** - Use NVMe SSDs with high IOPS - **Memory** - Sufficient RAM for database caching - **CPU** - High-frequency CPUs for consensus operations - **Network** - Low-latency, high-bandwidth network ### Configuration Tuning - **Snapshot Count** - Optimize snapshot frequency - **Heartbeat Interval** - Tune for network conditions - **Election Timeout** - Balance availability and consistency - **Batch Limits** - Optimize batch processing ### Operational Optimization - **Regular Compaction** - Automated database compaction - **Defragmentation** - Regular defragmentation schedule - **Monitoring** - Comprehensive performance monitoring - **Capacity Planning** - Proper cluster sizing ## Resources - **Official Documentation**: [etcd.io](https://etcd.io/) - **GitHub Repository**: [etcd-io/etcd](https://github.com/etcd-io/etcd) - **Kubernetes Integration**: [kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd](https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/) - **RAFT Consensus**: [raft.github.io](https://raft.github.io/) - **Community**: [etcd.io/community](https://etcd.io/community/)