# Etcd Task Service

## Overview

The Etcd task service provides a complete installation and configuration of [etcd](https://etcd.io/), a distributed, reliable key-value store for the most critical data of a distributed system. Etcd is the primary datastore of Kubernetes and is used by many other distributed systems for configuration management, service discovery, and distributed coordination.

## Features

### Core Capabilities
- **Distributed Key-Value Store** - Consistent, reliable data storage across multiple nodes
- **RAFT Consensus Algorithm** - Strong consistency guarantees with leader election
- **MVCC (Multi-Version Concurrency Control)** - Point-in-time snapshots and watch functionality
- **Transactional Operations** - Atomic multi-key operations with compare-and-swap
- **Hierarchical Key Space** - Organized key structure with directory-like paths

### High Availability & Clustering
- **Multi-Node Clusters** - Support for 3, 5, 7+ node clusters
- **Automatic Leader Election** - Built-in leader election with automatic failover
- **Cluster Membership Management** - Dynamic cluster membership changes
- **Split-Brain Protection** - Quorum-based decision making
- **Rolling Updates** - Zero-downtime cluster updates

### Security Features
- **TLS Encryption** - End-to-end encryption for client and peer communication
- **Certificate-Based Authentication** - X.509 certificate authentication
- **Role-Based Access Control (RBAC)** - Fine-grained permission management
- **User Authentication** - User-based authentication with password and certificate support
- **Network Security** - Peer and client communication security

### Operational Features
- **Backup & Restore** - Built-in snapshot and restoration capabilities
- **Monitoring & Metrics** - Prometheus metrics integration
- **Health Checking** - Comprehensive health check endpoints
- **Performance Tuning** - Configurable performance parameters
- **Maintenance Operations** - Compaction, defragmentation, and member management

## Configuration

### Basic Single-Node Configuration
```kcl
etcd: ETCD = {
    name: "etcd"
    version: "v3.5.10"
    etcd_name: "etcd-single"
    ssl_mode: "openssl"
    ssl_sign: "RSA"
    ca_sign: "RSA"
    ssl_curve: "prime256v1"
    long_sign: 4096
    cipher: "-aes256"
    ca_sign_days: 1460
    sign_days: 730
    sign_sha: 256
    etcd_protocol: "https"
    source_url: "github"
    cluster_name: "etcd-cluster"
    hostname: "etcd-node-1"
    cn: "etcd-node-1"
    c: "US"
    data_dir: "/var/lib/etcd"
    conf_path: "/etc/etcd/config.yaml"
    log_level: "warn"
    log_out: "stderr"
    cli_ip: "127.0.0.1"
    cli_port: 2379
    peer_ip: "127.0.0.1"
    peer_port: 2380
    cluster_list: "etcd-single=https://127.0.0.1:2380"
    token: "etcd-cluster-1"
    certs_path: "/etc/ssl/etcd"
    use_localhost: true
    use_dns: false
}
```

### Production Multi-Node Cluster
```kcl
etcd: ETCD = {
    name: "etcd"
    version: "v3.5.10"
    etcd_name: "etcd-prod-1"
    ssl_mode: "openssl"
    ssl_sign: "RSA"
    ca_sign: "RSA"
    ssl_curve: "prime256v1"
    long_sign: 4096
    cipher: "-aes256"
    ca_sign_days: 1460
    sign_days: 730
    sign_sha: 256
    etcd_protocol: "https"
    source_url: "github"
    cluster_name: "production-cluster"
    hostname: "etcd-prod-1"
    cn: "etcd-prod-1.company.com"
    c: "US"
    data_dir: "/var/lib/etcd"
    conf_path: "/etc/etcd/config.yaml"
    log_level: "warn"
    log_out: "stderr"
    cli_ip: "10.0.1.10"
    cli_port: 2379
    peer_ip: "10.0.1.10"
    peer_port: 2380
    cluster_list: "etcd-prod-1=https://10.0.1.10:2380,etcd-prod-2=https://10.0.1.11:2380,etcd-prod-3=https://10.0.1.12:2380"
    token: "production-etcd-cluster"
    certs_path: "/etc/ssl/etcd"
    prov_path: "etcdcerts"
    listen_peers: "https://10.0.1.10:2380"
    adv_listen_peers: "https://10.0.1.10:2380"
    initial_peers: "etcd-prod-1=https://10.0.1.10:2380,etcd-prod-2=https://10.0.1.11:2380,etcd-prod-3=https://10.0.1.12:2380"
    listen_clients: "https://10.0.1.10:2379,https://127.0.0.1:2379"
    adv_listen_clients: "https://10.0.1.10:2379"
    use_localhost: false
    domain_name: "company.com"
    use_dns: true
}
```

### High-Performance Configuration
```kcl
etcd: ETCD = {
    name: "etcd"
    version: "v3.5.10"
    # ... base configuration
    performance: {
        snapshot_count: 100000
        heartbeat_interval: 100
        election_timeout: 1000
        max_snapshots: 5
        max_wals: 5
        max_txn_ops: 128
        max_request_bytes: 1572864
        grpc_keepalive_min_time: 5
        grpc_keepalive_interval: 2
        grpc_keepalive_timeout: 6
    }
    storage: {
        backend_batch_limit: 10000
        backend_batch_interval: 100
        backend_bbolt_freelist_type: "map"
        quota_backend_bytes: 8589934592  # 8GB
    }
    security: {
        auto_compaction_mode: "periodic"
        auto_compaction_retention: "1h"
    }
}
```

### Kubernetes Cluster Configuration
```kcl
etcd: ETCD = {
    name: "etcd"
    version: "v3.5.10"
    etcd_name: "k8s-etcd-1"
    # ... base configuration
    cluster_name: "kubernetes-cluster"
    hostname: "k8s-master-1"
    cn: "k8s-master-1.cluster.local"
    data_dir: "/var/lib/etcd"
    cli_ip: "10.0.1.100"
    cli_port: 2379
    peer_ip: "10.0.1.100"
    peer_port: 2380
    cluster_list: "k8s-etcd-1=https://10.0.1.100:2380,k8s-etcd-2=https://10.0.1.101:2380,k8s-etcd-3=https://10.0.1.102:2380"
    token: "k8s-etcd-cluster"
    kubernetes_integration: {
        enabled: true
        namespace_prefix: "/registry"
        compaction_interval: "5m"
        defrag_threshold: 100
        health_check_interval: "10s"
    }
    backup: {
        enabled: true
        interval: "6h"
        retention: "30d"
        s3_bucket: "k8s-etcd-backups"
        encryption: true
    }
}
```

### DNS-Based Discovery Configuration
```kcl
etcd: ETCD = {
    name: "etcd"
    version: "v3.5.10"
    # ... base configuration
    dns_domain_path: "_etcd-server-ssl._tcp.company.com"
    domain_name: "company.com"
    discovery_srv: "_etcd-server-ssl._tcp.company.com"
    use_dns: true
    dns_discovery: {
        enabled: true
        service: "_etcd-server-ssl._tcp"
        domain: "company.com"
        srv_records: [
            {
                name: "etcd-1"
                port: 2380
                weight: 10
                priority: 0
                target: "etcd-1.company.com"
            },
            {
                name: "etcd-2"
                port: 2380
                weight: 10
                priority: 0
                target: "etcd-2.company.com"
            },
            {
                name: "etcd-3"
                port: 2380
                weight: 10
                priority: 0
                target: "etcd-3.company.com"
            }
        ]
    }
}
```

## Usage

### Deploy Etcd
```bash
./core/nulib/provisioning taskserv create etcd --infra <infrastructure-name>
```

### List Available Task Services
```bash
./core/nulib/provisioning taskserv list
```

### SSH to Etcd Server
```bash
./core/nulib/provisioning server ssh <etcd-server>
```

### Service Management
```bash
# Check etcd status
systemctl status etcd

# Start/stop etcd
systemctl start etcd
systemctl stop etcd
systemctl restart etcd

# View etcd logs
journalctl -u etcd -f

# Check etcd version
etcd --version
etcdctl version
```

### Cluster Operations
```bash
# Check cluster health
etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ca.pem \
  --cert=/etc/ssl/etcd/etcd.pem \
  --key=/etc/ssl/etcd/etcd-key.pem \
  endpoint health

# List cluster members
etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ca.pem \
  --cert=/etc/ssl/etcd/etcd.pem \
  --key=/etc/ssl/etcd/etcd-key.pem \
  member list

# Check cluster status
etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ca.pem \
  --cert=/etc/ssl/etcd/etcd.pem \
  --key=/etc/ssl/etcd/etcd-key.pem \
  endpoint status --write-out=table
```

### Data Operations
```bash
# Put a key-value pair
etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ca.pem \
  --cert=/etc/ssl/etcd/etcd.pem \
  --key=/etc/ssl/etcd/etcd-key.pem \
  put /config/app/database "postgresql://localhost:5432/app"

# Get a value
etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ca.pem \
  --cert=/etc/ssl/etcd/etcd.pem \
  --key=/etc/ssl/etcd/etcd-key.pem \
  get /config/app/database

# Get all keys with prefix
etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ca.pem \
  --cert=/etc/ssl/etcd/etcd.pem \
  --key=/etc/ssl/etcd/etcd-key.pem \
  get /config/ --prefix

# Watch for changes
etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ca.pem \
  --cert=/etc/ssl/etcd/etcd.pem \
  --key=/etc/ssl/etcd/etcd-key.pem \
  watch /config/ --prefix
```

### Backup and Restore
```bash
# Create snapshot backup
etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ca.pem \
  --cert=/etc/ssl/etcd/etcd.pem \
  --key=/etc/ssl/etcd/etcd-key.pem \
  snapshot save /backup/etcd-snapshot-$(date +%Y%m%d-%H%M%S).db

# Check snapshot status
etcdctl --write-out=table snapshot status /backup/etcd-snapshot.db

# Restore from snapshot
etcdctl snapshot restore /backup/etcd-snapshot.db \
  --name etcd-restored \
  --initial-cluster etcd-restored=https://127.0.0.1:2380 \
  --initial-cluster-token etcd-cluster-restored \
  --initial-advertise-peer-urls https://127.0.0.1:2380
```

### Maintenance Operations
```bash
# Compact etcd database
etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ca.pem \
  --cert=/etc/ssl/etcd/etcd.pem \
  --key=/etc/ssl/etcd/etcd-key.pem \
  compact $(etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ca.pem \
  --cert=/etc/ssl/etcd/etcd.pem \
  --key=/etc/ssl/etcd/etcd-key.pem \
  endpoint status --write-out="json" | jq -r '.[0].Status.header.revision')

# Defragment etcd database
etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ca.pem \
  --cert=/etc/ssl/etcd/etcd.pem \
  --key=/etc/ssl/etcd/etcd-key.pem \
  defrag

# Check database size
etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/ssl/etcd/ca.pem \
  --cert=/etc/ssl/etcd/etcd.pem \
  --key=/etc/ssl/etcd/etcd-key.pem \
  endpoint status --write-out=table
```

## Architecture

### System Architecture
```
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Applications  │────│   Etcd Cluster   │────│   Data Storage  │
│                 │    │                  │    │                 │
│ • Kubernetes    │    │ • Leader Node    │    │ • Raft Log      │
│ • Config Mgmt   │────│ • Follower Nodes │────│ • Key-Value DB  │
│ • Service Disc. │    │ • Client API     │    │ • Snapshots     │
│ • Coordination  │    │ • Peer Protocol  │    │ • WAL Files     │
└─────────────────┘    └──────────────────┘    └─────────────────┘
```

### Cluster Architecture
```
┌─────────────────────────────────────────────────────────────┐
│                    Etcd Cluster (3 Nodes)                  │
├─────────────────────────────────────────────────────────────┤
│  Leader Node       │  Follower Node 1   │  Follower Node 2  │
│  (etcd-1)          │  (etcd-2)          │  (etcd-3)         │
│                    │                    │                   │
│ • Write Operations │ • Read Operations  │ • Read Operations │
│ • Log Replication  │ • Log Reception    │ • Log Reception   │
│ • Heartbeat Sender │ • Heartbeat Reply  │ • Heartbeat Reply │
│ • Decision Making  │ • Vote Casting     │ • Vote Casting    │
├─────────────────────────────────────────────────────────────┤
│                    RAFT Consensus Layer                    │
├─────────────────────────────────────────────────────────────┤
│                      Network Layer                         │
│  Client Port: 2379  │  Peer Port: 2380   │  Metrics: 2381  │
└─────────────────────────────────────────────────────────────┘
```

### Data Flow Architecture
```
Client Request → API Gateway → RAFT Protocol → Storage Engine
     ↓               ↓             ↓              ↓
Authentication → Authorization → Consensus → Persistence
     ↓               ↓             ↓              ↓
TLS Validation → RBAC Check → Leader Election → Disk Write
```

### File Structure
```
/var/lib/etcd/                # Data directory
├── member/                   # Member data
│   ├── snap/                # Snapshots
│   └── wal/                 # Write-ahead logs
└── proxy/                   # Proxy data (if enabled)

/etc/etcd/                   # Configuration
├── config.yaml             # Main configuration
├── env                     # Environment variables  
├── etcdctl.sh             # Client script
└── cert-show.sh           # Certificate inspection

/etc/ssl/etcd/              # SSL certificates
├── ca.pem                 # Certificate Authority
├── etcd.pem              # Server certificate
├── etcd-key.pem          # Server private key
├── peer.pem              # Peer certificate
└── peer-key.pem          # Peer private key

/var/log/etcd/             # Log files
├── etcd.log              # Main log file
└── audit.log             # Audit log (if enabled)
```

## Supported Operating Systems

- Ubuntu 20.04+ / Debian 11+
- CentOS 8+ / RHEL 8+ / Fedora 35+
- Amazon Linux 2+
- SUSE Linux Enterprise 15+

## System Requirements

### Minimum Requirements
- **RAM**: 2GB (4GB+ recommended)
- **Storage**: 20GB SSD (50GB+ for production)
- **CPU**: 2 cores (4+ cores recommended)
- **Network**: Low latency between cluster members

### Production Requirements
- **RAM**: 8GB+ (16GB+ for large clusters)
- **Storage**: 100GB+ NVMe SSD (high IOPS)
- **CPU**: 4+ cores (8+ cores for high load)
- **Network**: Dedicated network with <10ms latency between nodes

### Performance Requirements
- **Disk IOPS**: 1000+ IOPS for WAL directory
- **Network Bandwidth**: 1Gbps+ between cluster members
- **Latency**: <10ms between cluster members
- **Filesystem**: ext4 or xfs with barrier=0 for performance

## Troubleshooting

### Cluster Health Issues
```bash
# Check cluster health
etcdctl endpoint health --cluster

# Check member status
etcdctl member list -w table

# Check leader election
etcdctl endpoint status --cluster -w table

# Check network connectivity
curl -k https://etcd-node:2380/health
```

### Performance Issues
```bash
# Check database size
etcdctl endpoint status --write-out=table

# Check fragmentation
etcdctl defrag --cluster

# Monitor metrics
curl http://localhost:2381/metrics

# Check slow queries
journalctl -u etcd | grep "slow"
```

### Certificate Issues
```bash
# Check certificate validity
openssl x509 -in /etc/ssl/etcd/etcd.pem -text -noout

# Verify certificate chain
openssl verify -CAfile /etc/ssl/etcd/ca.pem /etc/ssl/etcd/etcd.pem

# Check certificate expiration
/etc/etcd/cert-show.sh

# Test TLS connection
openssl s_client -connect localhost:2379 -cert /etc/ssl/etcd/etcd.pem -key /etc/ssl/etcd/etcd-key.pem
```

### Data Corruption Issues
```bash
# Check database consistency
etcdctl check perf

# Repair database
etcdutl snapshot restore /backup/snapshot.db

# Check WAL files
ls -la /var/lib/etcd/member/wal/

# Verify snapshot integrity
etcdctl snapshot status /backup/snapshot.db
```

### Network Issues
```bash
# Check port connectivity
telnet etcd-node 2379
telnet etcd-node 2380

# Check firewall rules
iptables -L | grep -E "(2379|2380)"

# Test peer connectivity
curl -k https://etcd-peer:2380/version

# Monitor network traffic
tcpdump -i eth0 port 2379 or port 2380
```

## Security Considerations

### Transport Security
- **TLS Encryption** - Enable TLS for all client and peer communication
- **Certificate Management** - Use proper CA-signed certificates
- **Key Rotation** - Regular certificate rotation
- **Cipher Suites** - Use strong cipher suites

### Authentication & Authorization
- **Client Authentication** - Certificate-based client authentication
- **RBAC** - Role-based access control for fine-grained permissions
- **User Management** - Separate users for different applications
- **Audit Logging** - Comprehensive audit trail

### Network Security
- **Firewall Rules** - Restrict access to etcd ports
- **Network Segmentation** - Isolate etcd traffic
- **VPN/Private Networks** - Use private networks for cluster communication
- **DDoS Protection** - Implement rate limiting and connection limits

### Data Security
- **Encryption at Rest** - Encrypt etcd data directory
- **Backup Security** - Encrypt and secure backups
- **Secret Management** - Proper handling of sensitive data
- **Access Logs** - Monitor all data access

## Performance Optimization

### Hardware Optimization
- **Storage** - Use NVMe SSDs with high IOPS
- **Memory** - Sufficient RAM for database caching
- **CPU** - High-frequency CPUs for consensus operations
- **Network** - Low-latency, high-bandwidth network

### Configuration Tuning
- **Snapshot Count** - Optimize snapshot frequency
- **Heartbeat Interval** - Tune for network conditions
- **Election Timeout** - Balance availability and consistency
- **Batch Limits** - Optimize batch processing

### Operational Optimization
- **Regular Compaction** - Automated database compaction
- **Defragmentation** - Regular defragmentation schedule
- **Monitoring** - Comprehensive performance monitoring
- **Capacity Planning** - Proper cluster sizing

## Resources

- **Official Documentation**: [etcd.io](https://etcd.io/)
- **GitHub Repository**: [etcd-io/etcd](https://github.com/etcd-io/etcd)
- **Kubernetes Integration**: [kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd](https://kubernetes.io/docs/tasks/administer-cluster/configure-upgrade-etcd/)
- **RAFT Consensus**: [raft.github.io](https://raft.github.io/)
- **Community**: [etcd.io/community](https://etcd.io/community/)