Etcd Task Service
Overview
The Etcd task service provides a complete installation and configuration of etcd, a distributed, reliable key-value store for the most critical data of a distributed system. Etcd is the primary datastore of Kubernetes and is used by many other distributed systems for configuration management, service discovery, and distributed coordination.
Features
Core Capabilities
- Distributed Key-Value Store - Consistent, reliable data storage across multiple nodes
- RAFT Consensus Algorithm - Strong consistency guarantees with leader election
- MVCC (Multi-Version Concurrency Control) - Point-in-time snapshots and watch functionality
- Transactional Operations - Atomic multi-key operations with compare-and-swap
- Hierarchical Key Space - Organized key structure with directory-like paths
High Availability & Clustering
- Multi-Node Clusters - Support for 3, 5, 7+ node clusters
- Automatic Leader Election - Built-in leader election with automatic failover
- Cluster Membership Management - Dynamic cluster membership changes
- Split-Brain Protection - Quorum-based decision making
- Rolling Updates - Zero-downtime cluster updates
Security Features
- TLS Encryption - End-to-end encryption for client and peer communication
- Certificate-Based Authentication - X.509 certificate authentication
- Role-Based Access Control (RBAC) - Fine-grained permission management
- User Authentication - User-based authentication with password and certificate support
- Network Security - Peer and client communication security
Operational Features
- Backup & Restore - Built-in snapshot and restoration capabilities
- Monitoring & Metrics - Prometheus metrics integration
- Health Checking - Comprehensive health check endpoints
- Performance Tuning - Configurable performance parameters
- Maintenance Operations - Compaction, defragmentation, and member management
Configuration
Basic Single-Node Configuration
etcd: ETCD = {
name: "etcd"
version: "v3.5.10"
etcd_name: "etcd-single"
ssl_mode: "openssl"
ssl_sign: "RSA"
ca_sign: "RSA"
ssl_curve: "prime256v1"
long_sign: 4096
cipher: "-aes256"
ca_sign_days: 1460
sign_days: 730
sign_sha: 256
etcd_protocol: "https"
source_url: "github"
cluster_name: "etcd-cluster"
hostname: "etcd-node-1"
cn: "etcd-node-1"
c: "US"
data_dir: "/var/lib/etcd"
conf_path: "/etc/etcd/config.yaml"
log_level: "warn"
log_out: "stderr"
cli_ip: "127.0.0.1"
cli_port: 2379
peer_ip: "127.0.0.1"
peer_port: 2380
cluster_list: "etcd-single=https://127.0.0.1:2380"
token: "etcd-cluster-1"
certs_path: "/etc/ssl/etcd"
use_localhost: true
use_dns: false
}
Production Multi-Node Cluster
etcd: ETCD = {
name: "etcd"
version: "v3.5.10"
etcd_name: "etcd-prod-1"
ssl_mode: "openssl"
ssl_sign: "RSA"
ca_sign: "RSA"
ssl_curve: "prime256v1"
long_sign: 4096
cipher: "-aes256"
ca_sign_days: 1460
sign_days: 730
sign_sha: 256
etcd_protocol: "https"
source_url: "github"
cluster_name: "production-cluster"
hostname: "etcd-prod-1"
cn: "etcd-prod-1.company.com"
c: "US"
data_dir: "/var/lib/etcd"
conf_path: "/etc/etcd/config.yaml"
log_level: "warn"
log_out: "stderr"
cli_ip: "10.0.1.10"
cli_port: 2379
peer_ip: "10.0.1.10"
peer_port: 2380
cluster_list: "etcd-prod-1=https://10.0.1.10:2380,etcd-prod-2=https://10.0.1.11:2380,etcd-prod-3=https://10.0.1.12:2380"
token: "production-etcd-cluster"
certs_path: "/etc/ssl/etcd"
prov_path: "etcdcerts"
listen_peers: "https://10.0.1.10:2380"
adv_listen_peers: "https://10.0.1.10:2380"
initial_peers: "etcd-prod-1=https://10.0.1.10:2380,etcd-prod-2=https://10.0.1.11:2380,etcd-prod-3=https://10.0.1.12:2380"
listen_clients: "https://10.0.1.10:2379,https://127.0.0.1:2379"
adv_listen_clients: "https://10.0.1.10:2379"
use_localhost: false
domain_name: "company.com"
use_dns: true
}
High-Performance Configuration
etcd: ETCD = {
name: "etcd"
version: "v3.5.10"
# ... base configuration
performance: {
snapshot_count: 100000
heartbeat_interval: 100
election_timeout: 1000
max_snapshots: 5
max_wals: 5
max_txn_ops: 128
max_request_bytes: 1572864
grpc_keepalive_min_time: 5
grpc_keepalive_interval: 2
grpc_keepalive_timeout: 6
}
storage: {
backend_batch_limit: 10000
backend_batch_interval: 100
backend_bbolt_freelist_type: "map"
quota_backend_bytes: 8589934592 # 8GB
}
security: {
auto_compaction_mode: "periodic"
auto_compaction_retention: "1h"
}
}
Kubernetes Cluster Configuration
etcd: ETCD = {
name: "etcd"
version: "v3.5.10"
etcd_name: "k8s-etcd-1"
# ... base configuration
cluster_name: "kubernetes-cluster"
hostname: "k8s-master-1"
cn: "k8s-master-1.cluster.local"
data_dir: "/var/lib/etcd"
cli_ip: "10.0.1.100"
cli_port: 2379
peer_ip: "10.0.1.100"
peer_port: 2380
cluster_list: "k8s-etcd-1=https://10.0.1.100:2380,k8s-etcd-2=https://10.0.1.101:2380,k8s-etcd-3=https://10.0.1.102:2380"
token: "k8s-etcd-cluster"
kubernetes_integration: {
enabled: true
namespace_prefix: "/registry"
compaction_interval: "5m"
defrag_threshold: 100
health_check_interval: "10s"
}
backup: {
enabled: true
interval: "6h"
retention: "30d"
s3_bucket: "k8s-etcd-backups"
encryption: true
}
}
DNS-Based Discovery Configuration
etcd: ETCD = {
name: "etcd"
version: "v3.5.10"
# ... base configuration
dns_domain_path: "_etcd-server-ssl._tcp.company.com"
domain_name: "company.com"
discovery_srv: "_etcd-server-ssl._tcp.company.com"
use_dns: true
dns_discovery: {
enabled: true
service: "_etcd-server-ssl._tcp"
domain: "company.com"
srv_records: [
{
name: "etcd-1"
port: 2380
weight: 10
priority: 0
target: "etcd-1.company.com"
},
{
name: "etcd-2"
port: 2380
weight: 10
priority: 0
target: "etcd-2.company.com"
},
{
name: "etcd-3"
port: 2380
weight: 10
priority: 0
target: "etcd-3.company.com"
}
]
}
}
Usage
Deploy Etcd
./core/nulib/provisioning taskserv create etcd --infra <infrastructure-name>
List Available Task Services
./core/nulib/provisioning taskserv list
SSH to Etcd Server
./core/nulib/provisioning server ssh <etcd-server>
Service Management
# Check etcd status
systemctl status etcd
# Start/stop etcd
systemctl start etcd
systemctl stop etcd
systemctl restart etcd
# View etcd logs
journalctl -u etcd -f
# Check etcd version
etcd --version
etcdctl version
Cluster Operations
# Check cluster health
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ca.pem \
--cert=/etc/ssl/etcd/etcd.pem \
--key=/etc/ssl/etcd/etcd-key.pem \
endpoint health
# List cluster members
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ca.pem \
--cert=/etc/ssl/etcd/etcd.pem \
--key=/etc/ssl/etcd/etcd-key.pem \
member list
# Check cluster status
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ca.pem \
--cert=/etc/ssl/etcd/etcd.pem \
--key=/etc/ssl/etcd/etcd-key.pem \
endpoint status --write-out=table
Data Operations
# Put a key-value pair
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ca.pem \
--cert=/etc/ssl/etcd/etcd.pem \
--key=/etc/ssl/etcd/etcd-key.pem \
put /config/app/database "postgresql://localhost:5432/app"
# Get a value
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ca.pem \
--cert=/etc/ssl/etcd/etcd.pem \
--key=/etc/ssl/etcd/etcd-key.pem \
get /config/app/database
# Get all keys with prefix
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ca.pem \
--cert=/etc/ssl/etcd/etcd.pem \
--key=/etc/ssl/etcd/etcd-key.pem \
get /config/ --prefix
# Watch for changes
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ca.pem \
--cert=/etc/ssl/etcd/etcd.pem \
--key=/etc/ssl/etcd/etcd-key.pem \
watch /config/ --prefix
Backup and Restore
# Create snapshot backup
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ca.pem \
--cert=/etc/ssl/etcd/etcd.pem \
--key=/etc/ssl/etcd/etcd-key.pem \
snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db
# Check snapshot status
etcdctl --write-out=table snapshot status /backup/etcd-snapshot.db
# Restore from snapshot
etcdctl snapshot restore /backup/etcd-snapshot.db \
--name etcd-restored \
--initial-cluster etcd-restored=https://127.0.0.1:2380 \
--initial-cluster-token etcd-cluster-restored \
--initial-advertise-peer-urls https://127.0.0.1:2380
Maintenance Operations
# Compact etcd database
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ca.pem \
--cert=/etc/ssl/etcd/etcd.pem \
--key=/etc/ssl/etcd/etcd-key.pem \
compact $(etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ca.pem \
--cert=/etc/ssl/etcd/etcd.pem \
--key=/etc/ssl/etcd/etcd-key.pem \
endpoint status --write-out="json" | jq -r '.[0].Status.header.revision')
# Defragment etcd database
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ca.pem \
--cert=/etc/ssl/etcd/etcd.pem \
--key=/etc/ssl/etcd/etcd-key.pem \
defrag
# Check database size
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/ssl/etcd/ca.pem \
--cert=/etc/ssl/etcd/etcd.pem \
--key=/etc/ssl/etcd/etcd-key.pem \
endpoint status --write-out=table
Architecture
System Architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Applications │────│ Etcd Cluster │────│ Data Storage │
│ │ │ │ │ │
│ • Kubernetes │ │ • Leader Node │ │ • Raft Log │
│ • Config Mgmt │────│ • Follower Nodes │────│ • Key-Value DB │
│ • Service Disc. │ │ • Client API │ │ • Snapshots │
│ • Coordination │ │ • Peer Protocol │ │ • WAL Files │
└─────────────────┘ └──────────────────┘ └─────────────────┘
Cluster Architecture
┌─────────────────────────────────────────────────────────────┐
│ Etcd Cluster (3 Nodes) │
├─────────────────────────────────────────────────────────────┤
│ Leader Node │ Follower Node 1 │ Follower Node 2 │
│ (etcd-1) │ (etcd-2) │ (etcd-3) │
│ │ │ │
│ • Write Operations │ • Read Operations │ • Read Operations │
│ • Log Replication │ • Log Reception │ • Log Reception │
│ • Heartbeat Sender │ • Heartbeat Reply │ • Heartbeat Reply │
│ • Decision Making │ • Vote Casting │ • Vote Casting │
├─────────────────────────────────────────────────────────────┤
│ RAFT Consensus Layer │
├─────────────────────────────────────────────────────────────┤
│ Network Layer │
│ Client Port: 2379 │ Peer Port: 2380 │ Metrics: 2381 │
└─────────────────────────────────────────────────────────────┘
Data Flow Architecture
Client Request → API Gateway → RAFT Protocol → Storage Engine
↓ ↓ ↓ ↓
Authentication → Authorization → Consensus → Persistence
↓ ↓ ↓ ↓
TLS Validation → RBAC Check → Leader Election → Disk Write
File Structure
/var/lib/etcd/ # Data directory
├── member/ # Member data
│ ├── snap/ # Snapshots
│ └── wal/ # Write-ahead logs
└── proxy/ # Proxy data (if enabled)
/etc/etcd/ # Configuration
├── config.yaml # Main configuration
├── env # Environment variables
├── etcdctl.sh # Client script
└── cert-show.sh # Certificate inspection
/etc/ssl/etcd/ # SSL certificates
├── ca.pem # Certificate Authority
├── etcd.pem # Server certificate
├── etcd-key.pem # Server private key
├── peer.pem # Peer certificate
└── peer-key.pem # Peer private key
/var/log/etcd/ # Log files
├── etcd.log # Main log file
└── audit.log # Audit log (if enabled)
Supported Operating Systems
- Ubuntu 20.04+ / Debian 11+
- CentOS 8+ / RHEL 8+ / Fedora 35+
- Amazon Linux 2+
- SUSE Linux Enterprise 15+
System Requirements
Minimum Requirements
- RAM: 2GB (4GB+ recommended)
- Storage: 20GB SSD (50GB+ for production)
- CPU: 2 cores (4+ cores recommended)
- Network: Low latency between cluster members
Production Requirements
- RAM: 8GB+ (16GB+ for large clusters)
- Storage: 100GB+ NVMe SSD (high IOPS)
- CPU: 4+ cores (8+ cores for high load)
- Network: Dedicated network with <10ms latency between nodes
Performance Requirements
- Disk IOPS: 1000+ IOPS for WAL directory
- Network Bandwidth: 1Gbps+ between cluster members
- Latency: <10ms between cluster members
- Filesystem: ext4 or xfs with barrier=0 for performance
Troubleshooting
Cluster Health Issues
# Check cluster health
etcdctl endpoint health --cluster
# Check member status
etcdctl member list -w table
# Check leader election
etcdctl endpoint status --cluster -w table
# Check network connectivity
curl -k https://etcd-node:2380/health
Performance Issues
# Check database size
etcdctl endpoint status --write-out=table
# Check fragmentation
etcdctl defrag --cluster
# Monitor metrics
curl http://localhost:2381/metrics
# Check slow queries
journalctl -u etcd | grep "slow"
Certificate Issues
# Check certificate validity
openssl x509 -in /etc/ssl/etcd/etcd.pem -text -noout
# Verify certificate chain
openssl verify -CAfile /etc/ssl/etcd/ca.pem /etc/ssl/etcd/etcd.pem
# Check certificate expiration
/etc/etcd/cert-show.sh
# Test TLS connection
openssl s_client -connect localhost:2379 -cert /etc/ssl/etcd/etcd.pem -key /etc/ssl/etcd/etcd-key.pem
Data Corruption Issues
# Check database consistency
etcdctl check perf
# Repair database
etcdutl snapshot restore /backup/snapshot.db
# Check WAL files
ls -la /var/lib/etcd/member/wal/
# Verify snapshot integrity
etcdctl snapshot status /backup/snapshot.db
Network Issues
# Check port connectivity
telnet etcd-node 2379
telnet etcd-node 2380
# Check firewall rules
iptables -L | grep -E "(2379|2380)"
# Test peer connectivity
curl -k https://etcd-peer:2380/version
# Monitor network traffic
tcpdump -i eth0 port 2379 or port 2380
Security Considerations
Transport Security
- TLS Encryption - Enable TLS for all client and peer communication
- Certificate Management - Use proper CA-signed certificates
- Key Rotation - Regular certificate rotation
- Cipher Suites - Use strong cipher suites
Authentication & Authorization
- Client Authentication - Certificate-based client authentication
- RBAC - Role-based access control for fine-grained permissions
- User Management - Separate users for different applications
- Audit Logging - Comprehensive audit trail
Network Security
- Firewall Rules - Restrict access to etcd ports
- Network Segmentation - Isolate etcd traffic
- VPN/Private Networks - Use private networks for cluster communication
- DDoS Protection - Implement rate limiting and connection limits
Data Security
- Encryption at Rest - Encrypt etcd data directory
- Backup Security - Encrypt and secure backups
- Secret Management - Proper handling of sensitive data
- Access Logs - Monitor all data access
Performance Optimization
Hardware Optimization
- Storage - Use NVMe SSDs with high IOPS
- Memory - Sufficient RAM for database caching
- CPU - High-frequency CPUs for consensus operations
- Network - Low-latency, high-bandwidth network
Configuration Tuning
- Snapshot Count - Optimize snapshot frequency
- Heartbeat Interval - Tune for network conditions
- Election Timeout - Balance availability and consistency
- Batch Limits - Optimize batch processing
Operational Optimization
- Regular Compaction - Automated database compaction
- Defragmentation - Regular defragmentation schedule
- Monitoring - Comprehensive performance monitoring
- Capacity Planning - Proper cluster sizing
Resources
- Official Documentation: etcd.io
- GitHub Repository: etcd-io/etcd
- Kubernetes Integration: kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd
- RAFT Consensus: raft.github.io
- Community: etcd.io/community