# Rook-Ceph Task Service ## Overview The Rook-Ceph task service provides a complete installation and configuration of [Rook](https://rook.io/) with [Ceph](https://ceph.io/), a cloud-native storage orchestrator that automates deployment, bootstrapping, configuration, provisioning, scaling, upgrading, migration, disaster recovery, monitoring, and resource management of Ceph storage in Kubernetes environments. ## Features ### Core Storage Features - **Block Storage (RBD)** - High-performance block storage for databases and applications - **Object Storage (RGW)** - S3-compatible object storage for applications and backups - **File System (CephFS)** - POSIX-compliant shared file system for multiple consumers - **Multi-Site Replication** - Geographic data replication and disaster recovery - **Erasure Coding** - Space-efficient data protection with configurable redundancy ### Kubernetes Native Features - **Operator-Based Management** - Kubernetes-native lifecycle management - **Dynamic Provisioning** - Automatic PV provisioning with StorageClasses - **CSI Driver Integration** - Container Storage Interface for advanced features - **Snapshots & Cloning** - Volume snapshots and cloning capabilities - **Resizing** - Online volume expansion and shrinking ### High Availability & Scalability - **Distributed Architecture** - No single point of failure - **Automatic Recovery** - Self-healing and data rebalancing - **Horizontal Scaling** - Add/remove storage nodes dynamically - **Multi-Zone Deployment** - Cross-zone and cross-region deployment - **Load Balancing** - Automatic load distribution across OSDs ### Management & Monitoring - **Ceph Dashboard** - Web-based management interface - **Prometheus Integration** - Comprehensive metrics and monitoring - **Health Monitoring** - Continuous cluster health monitoring - **Performance Monitoring** - Real-time performance metrics - **Alerting System** - Configurable alerts for various conditions ### Security Features - **Encryption at Rest** - Transparent data encryption - **Encryption in Transit** - Network traffic encryption - **Authentication** - CephX authentication system - **Access Control** - Fine-grained access control policies - **Security Hardening** - Security best practices implementation ## Configuration ### Basic Ceph Cluster ```kcl rook_ceph: RookCeph = { name: "rook-ceph" namespace: "rook-ceph" clustername: "rook-ceph" ceph_image: "quay.io/ceph/ceph:v18.2.4" rookCeph_image: "rook/ceph:v1.14.2" dataDirHostPath: "/var/lib/rook" object_user: "ceph-user" object_storename: "ceph-objectstore" object_displayname: "Ceph Object Store" storage_fsName: "cephfs" storage_pool: "cephfs-replicated" nodes: [ { name: "node1" devices: ["/dev/sdb", "/dev/sdc"] }, { name: "node2" devices: ["/dev/sdb", "/dev/sdc"] }, { name: "node3" devices: ["/dev/sdb", "/dev/sdc"] } ] } ``` ### Production Ceph Cluster ```kcl rook_ceph: RookCeph = { name: "rook-ceph-prod" namespace: "rook-ceph" clustername: "production-ceph" ceph_image: "quay.io/ceph/ceph:v18.2.4" rookCeph_image: "rook/ceph:v1.14.2" dataDirHostPath: "/var/lib/rook" object_user: "prod-user" object_storename: "production-objectstore" object_displayname: "Production Object Store" storage_fsName: "production-fs" storage_pool: "replicated-pool" nodes: [ { name: "storage-01" devices: ["/dev/nvme0n1", "/dev/nvme1n1", "/dev/nvme2n1"] }, { name: "storage-02" devices: ["/dev/nvme0n1", "/dev/nvme1n1", "/dev/nvme2n1"] }, { name: "storage-03" devices: ["/dev/nvme0n1", "/dev/nvme1n1", "/dev/nvme2n1"] }, { name: "storage-04" devices: ["/dev/nvme0n1", "/dev/nvme1n1", "/dev/nvme2n1"] }, { name: "storage-05" devices: ["/dev/nvme0n1", "/dev/nvme1n1", "/dev/nvme2n1"] } ] cluster_config: { mon_count: 5 mon_allow_multiple_per_node: false mgr_count: 2 dashboard_enabled: true dashboard_ssl: true monitoring_enabled: true crash_collector_disable: false log_collector_disable: false cleanup_confirm_i_really_mean_it: false } storage_config: { use_all_nodes: false use_all_devices: false store_type: "bluestore" database_size_mb: 1024 journal_size_mb: 1024 osds_per_device: 1 encrypt_device: true } } ``` ### Multi-Zone High Availability Setup ```kcl rook_ceph: RookCeph = { name: "rook-ceph-ha" namespace: "rook-ceph-system" clustername: "ha-ceph-cluster" ceph_image: "quay.io/ceph/ceph:v18.2.4" rookCeph_image: "rook/ceph:v1.14.2" dataDirHostPath: "/var/lib/rook" # ... base configuration placement: { mon: { node_affinity: { required_during_scheduling_ignored_during_execution: { node_selector_terms: [ { match_expressions: [ { key: "rook.io/has-disk" operator: "In" values: ["true"] } ] } ] } } pod_anti_affinity: { preferred_during_scheduling_ignored_during_execution: [ { weight: 100 pod_affinity_term: { label_selector: { match_expressions: [ { key: "app" operator: "In" values: ["rook-ceph-mon"] } ] } topology_key: "kubernetes.io/hostname" } } ] } tolerations: [ { key: "storage-node" operator: "Exists" } ] } mgr: { # Similar placement rules for managers } osd: { # Similar placement rules for OSDs } } disaster_recovery: { enabled: true backup_schedule: "0 2 * * *" # Daily at 2 AM retention_days: 30 remote_site: { enabled: true endpoint: "https://backup-site.company.com" access_key: "backup-access-key" secret_key: "backup-secret-key" bucket: "ceph-dr-backups" } } } ``` ### Object Storage Configuration ```kcl rook_ceph: RookCeph = { name: "rook-ceph-object" # ... base configuration object_stores: [ { name: "production-s3" namespace: "rook-ceph" spec: { metadata_pool: { replicated: { size: 3 require_safe_replica_size: true } } data_pool: { erasure_coded: { data_chunks: 4 coding_chunks: 2 } } preserve_pools_on_delete: false gateway: { port: 80 secure_port: 443 instances: 3 placement: { node_affinity: { required_during_scheduling_ignored_during_execution: { node_selector_terms: [ { match_expressions: [ { key: "node-type" operator: "In" values: ["storage"] } ] } ] } } } resources: { limits: { cpu: "2000m" memory: "4Gi" } requests: { cpu: "1000m" memory: "2Gi" } } } health_check: { bucket: { enabled: true interval: "60s" } } } } ] object_store_users: [ { name: "app-user" namespace: "rook-ceph" display_name: "Application User" capabilities: { user: "read, write" bucket: "read, write, delete" metadata: "read, write" usage: "read" zone: "read" } quota: { max_objects: 10000 max_size: "100Gi" } } ] } ``` ### File System Configuration ```kcl rook_ceph: RookCeph = { name: "rook-ceph-fs" # ... base configuration filesystems: [ { name: "shared-fs" namespace: "rook-ceph" spec: { metadata_pool: { replicated: { size: 3 require_safe_replica_size: true } } data_pools: [ { name: "replicated-pool" replicated: { size: 3 require_safe_replica_size: true } }, { name: "erasure-coded-pool" erasure_coded: { data_chunks: 6 coding_chunks: 2 } } ] preserve_filesystem_on_delete: false metadata_server: { active_count: 2 active_standby: true placement: { node_affinity: { required_during_scheduling_ignored_during_execution: { node_selector_terms: [ { match_expressions: [ { key: "node-type" operator: "In" values: ["storage"] } ] } ] } } } resources: { limits: { cpu: "3000m" memory: "8Gi" } requests: { cpu: "1000m" memory: "4Gi" } } priority_class_name: "system-cluster-critical" } } } ] } ``` ### Monitoring and Alerting Configuration ```kcl rook_ceph: RookCeph = { name: "rook-ceph-monitoring" # ... base configuration monitoring: { enabled: true prometheus: { enabled: true service_monitor: true external_mgr_endpoints: [] external_mgr_prometheus_port: 9283 } grafana: { enabled: true service_type: "ClusterIP" ingress: { enabled: true host: "grafana.ceph.company.com" annotations: { "kubernetes.io/ingress.class": "nginx" "cert-manager.io/cluster-issuer": "letsencrypt-prod" } tls: [ { hosts: ["grafana.ceph.company.com"] secret_name: "grafana-tls" } ] } } alertmanager: { enabled: true config: { global: { smtp_smarthost: "smtp.company.com:587" smtp_from: "alerts@company.com" } route: { group_by: ["alertname"] group_wait: "10s" group_interval: "10s" repeat_interval: "1h" receiver: "web.hook" } receivers: [ { name: "web.hook" email_configs: [ { to: "storage-team@company.com" subject: "Ceph Alert: {{ .GroupLabels.alertname }}" body: "{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}" } ] } ] } } } } ``` ## Usage ### Deploy Rook-Ceph ```bash ./core/nulib/provisioning taskserv create rook-ceph --infra ``` ### List Available Task Services ```bash ./core/nulib/provisioning taskserv list ``` ### SSH to Storage Node ```bash ./core/nulib/provisioning server ssh ``` ### Cluster Management ```bash # Check cluster status kubectl -n rook-ceph get cephcluster # View cluster details kubectl -n rook-ceph describe cephcluster rook-ceph # Check operator status kubectl -n rook-ceph get pods -l app=rook-ceph-operator # View operator logs kubectl -n rook-ceph logs -l app=rook-ceph-operator -f ``` ### Storage Operations ```bash # Check OSD status kubectl -n rook-ceph get pods -l app=rook-ceph-osd # View storage classes kubectl get storageclass # Create test PVC kubectl apply -f - < ``` ### Network Issues ```bash # Check network connectivity kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph mon stat # Test inter-node connectivity kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ping # Check service endpoints kubectl -n rook-ceph get endpoints # Monitor network performance kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- rados bench -p rbd 10 write ``` ## Security Considerations ### Cluster Security - **Network Isolation** - Separate storage network from application traffic - **Encryption** - Enable encryption at rest and in transit - **Authentication** - Use CephX authentication for all components - **Access Control** - Implement proper RBAC for Kubernetes resources ### Data Security - **Encryption at Rest** - Enable OSD encryption for sensitive data - **Backup Security** - Encrypt backups and use secure storage - **Network Security** - Use dedicated networks and VLANs - **Key Management** - Secure key storage and rotation ### Kubernetes Security - **RBAC** - Implement proper RBAC for Rook resources - **Network Policies** - Use Kubernetes network policies - **Pod Security** - Configure pod security policies - **Secret Management** - Secure handling of Ceph secrets ### Operational Security - **Regular Updates** - Keep Rook, Ceph, and Kubernetes updated - **Security Monitoring** - Monitor for security events - **Access Auditing** - Audit administrative access - **Compliance** - Follow security compliance requirements ## Performance Optimization ### Hardware Optimization - **NVMe Storage** - Use NVMe SSDs for OSDs and journals - **Network Bandwidth** - Use 10Gbps+ networks for storage traffic - **CPU Performance** - High-frequency CPUs for better IOPS - **Memory** - Adequate RAM for buffer caches ### Ceph Configuration - **OSD Configuration** - Optimize OSD threads and memory - **Pool Configuration** - Choose appropriate replication/erasure coding - **PG Configuration** - Optimize placement group counts - **Network Configuration** - Separate cluster and public networks ### Kubernetes Optimization - **Node Selection** - Use dedicated storage nodes - **Resource Limits** - Set appropriate CPU and memory limits - **Scheduling** - Use anti-affinity for high availability - **Storage Classes** - Optimize storage class parameters ### Application Optimization - **Access Patterns** - Understand application I/O patterns - **Volume Sizing** - Right-size volumes for applications - **Snapshot Management** - Efficient snapshot scheduling - **Monitoring** - Continuous performance monitoring ## Resources - **Official Documentation**: [rook.io/docs](https://rook.io/docs/) - **Ceph Documentation**: [docs.ceph.com](https://docs.ceph.com/) - **GitHub Repository**: [rook/rook](https://github.com/rook/rook) - **Community Support**: [rook.io/community](https://rook.io/community/) - **CNCF Project**: [cncf.io/projects/rook](https://www.cncf.io/projects/rook/)