Rook-Ceph Task Service
Overview
The Rook-Ceph task service provides a complete installation and configuration of Rook with Ceph, a cloud-native storage orchestrator that automates deployment, bootstrapping, configuration, provisioning, scaling, upgrading, migration, disaster recovery, monitoring, and resource management of Ceph storage in Kubernetes environments.
Features
Core Storage Features
- Block Storage (RBD) - High-performance block storage for databases and applications
- Object Storage (RGW) - S3-compatible object storage for applications and backups
- File System (CephFS) - POSIX-compliant shared file system for multiple consumers
- Multi-Site Replication - Geographic data replication and disaster recovery
- Erasure Coding - Space-efficient data protection with configurable redundancy
Kubernetes Native Features
- Operator-Based Management - Kubernetes-native lifecycle management
- Dynamic Provisioning - Automatic PV provisioning with StorageClasses
- CSI Driver Integration - Container Storage Interface for advanced features
- Snapshots & Cloning - Volume snapshots and cloning capabilities
- Resizing - Online volume expansion and shrinking
High Availability & Scalability
- Distributed Architecture - No single point of failure
- Automatic Recovery - Self-healing and data rebalancing
- Horizontal Scaling - Add/remove storage nodes dynamically
- Multi-Zone Deployment - Cross-zone and cross-region deployment
- Load Balancing - Automatic load distribution across OSDs
Management & Monitoring
- Ceph Dashboard - Web-based management interface
- Prometheus Integration - Comprehensive metrics and monitoring
- Health Monitoring - Continuous cluster health monitoring
- Performance Monitoring - Real-time performance metrics
- Alerting System - Configurable alerts for various conditions
Security Features
- Encryption at Rest - Transparent data encryption
- Encryption in Transit - Network traffic encryption
- Authentication - CephX authentication system
- Access Control - Fine-grained access control policies
- Security Hardening - Security best practices implementation
Configuration
Basic Ceph Cluster
rook_ceph: RookCeph = {
name: "rook-ceph"
namespace: "rook-ceph"
clustername: "rook-ceph"
ceph_image: "quay.io/ceph/ceph:v18.2.4"
rookCeph_image: "rook/ceph:v1.14.2"
dataDirHostPath: "/var/lib/rook"
object_user: "ceph-user"
object_storename: "ceph-objectstore"
object_displayname: "Ceph Object Store"
storage_fsName: "cephfs"
storage_pool: "cephfs-replicated"
nodes: [
{
name: "node1"
devices: ["/dev/sdb", "/dev/sdc"]
},
{
name: "node2"
devices: ["/dev/sdb", "/dev/sdc"]
},
{
name: "node3"
devices: ["/dev/sdb", "/dev/sdc"]
}
]
}
Production Ceph Cluster
rook_ceph: RookCeph = {
name: "rook-ceph-prod"
namespace: "rook-ceph"
clustername: "production-ceph"
ceph_image: "quay.io/ceph/ceph:v18.2.4"
rookCeph_image: "rook/ceph:v1.14.2"
dataDirHostPath: "/var/lib/rook"
object_user: "prod-user"
object_storename: "production-objectstore"
object_displayname: "Production Object Store"
storage_fsName: "production-fs"
storage_pool: "replicated-pool"
nodes: [
{
name: "storage-01"
devices: ["/dev/nvme0n1", "/dev/nvme1n1", "/dev/nvme2n1"]
},
{
name: "storage-02"
devices: ["/dev/nvme0n1", "/dev/nvme1n1", "/dev/nvme2n1"]
},
{
name: "storage-03"
devices: ["/dev/nvme0n1", "/dev/nvme1n1", "/dev/nvme2n1"]
},
{
name: "storage-04"
devices: ["/dev/nvme0n1", "/dev/nvme1n1", "/dev/nvme2n1"]
},
{
name: "storage-05"
devices: ["/dev/nvme0n1", "/dev/nvme1n1", "/dev/nvme2n1"]
}
]
cluster_config: {
mon_count: 5
mon_allow_multiple_per_node: false
mgr_count: 2
dashboard_enabled: true
dashboard_ssl: true
monitoring_enabled: true
crash_collector_disable: false
log_collector_disable: false
cleanup_confirm_i_really_mean_it: false
}
storage_config: {
use_all_nodes: false
use_all_devices: false
store_type: "bluestore"
database_size_mb: 1024
journal_size_mb: 1024
osds_per_device: 1
encrypt_device: true
}
}
Multi-Zone High Availability Setup
rook_ceph: RookCeph = {
name: "rook-ceph-ha"
namespace: "rook-ceph-system"
clustername: "ha-ceph-cluster"
ceph_image: "quay.io/ceph/ceph:v18.2.4"
rookCeph_image: "rook/ceph:v1.14.2"
dataDirHostPath: "/var/lib/rook"
# ... base configuration
placement: {
mon: {
node_affinity: {
required_during_scheduling_ignored_during_execution: {
node_selector_terms: [
{
match_expressions: [
{
key: "rook.io/has-disk"
operator: "In"
values: ["true"]
}
]
}
]
}
}
pod_anti_affinity: {
preferred_during_scheduling_ignored_during_execution: [
{
weight: 100
pod_affinity_term: {
label_selector: {
match_expressions: [
{
key: "app"
operator: "In"
values: ["rook-ceph-mon"]
}
]
}
topology_key: "kubernetes.io/hostname"
}
}
]
}
tolerations: [
{
key: "storage-node"
operator: "Exists"
}
]
}
mgr: {
# Similar placement rules for managers
}
osd: {
# Similar placement rules for OSDs
}
}
disaster_recovery: {
enabled: true
backup_schedule: "0 2 * * *" # Daily at 2 AM
retention_days: 30
remote_site: {
enabled: true
endpoint: "https://backup-site.company.com"
access_key: "backup-access-key"
secret_key: "backup-secret-key"
bucket: "ceph-dr-backups"
}
}
}
Object Storage Configuration
rook_ceph: RookCeph = {
name: "rook-ceph-object"
# ... base configuration
object_stores: [
{
name: "production-s3"
namespace: "rook-ceph"
spec: {
metadata_pool: {
replicated: {
size: 3
require_safe_replica_size: true
}
}
data_pool: {
erasure_coded: {
data_chunks: 4
coding_chunks: 2
}
}
preserve_pools_on_delete: false
gateway: {
port: 80
secure_port: 443
instances: 3
placement: {
node_affinity: {
required_during_scheduling_ignored_during_execution: {
node_selector_terms: [
{
match_expressions: [
{
key: "node-type"
operator: "In"
values: ["storage"]
}
]
}
]
}
}
}
resources: {
limits: {
cpu: "2000m"
memory: "4Gi"
}
requests: {
cpu: "1000m"
memory: "2Gi"
}
}
}
health_check: {
bucket: {
enabled: true
interval: "60s"
}
}
}
}
]
object_store_users: [
{
name: "app-user"
namespace: "rook-ceph"
display_name: "Application User"
capabilities: {
user: "read, write"
bucket: "read, write, delete"
metadata: "read, write"
usage: "read"
zone: "read"
}
quota: {
max_objects: 10000
max_size: "100Gi"
}
}
]
}
File System Configuration
rook_ceph: RookCeph = {
name: "rook-ceph-fs"
# ... base configuration
filesystems: [
{
name: "shared-fs"
namespace: "rook-ceph"
spec: {
metadata_pool: {
replicated: {
size: 3
require_safe_replica_size: true
}
}
data_pools: [
{
name: "replicated-pool"
replicated: {
size: 3
require_safe_replica_size: true
}
},
{
name: "erasure-coded-pool"
erasure_coded: {
data_chunks: 6
coding_chunks: 2
}
}
]
preserve_filesystem_on_delete: false
metadata_server: {
active_count: 2
active_standby: true
placement: {
node_affinity: {
required_during_scheduling_ignored_during_execution: {
node_selector_terms: [
{
match_expressions: [
{
key: "node-type"
operator: "In"
values: ["storage"]
}
]
}
]
}
}
}
resources: {
limits: {
cpu: "3000m"
memory: "8Gi"
}
requests: {
cpu: "1000m"
memory: "4Gi"
}
}
priority_class_name: "system-cluster-critical"
}
}
}
]
}
Monitoring and Alerting Configuration
rook_ceph: RookCeph = {
name: "rook-ceph-monitoring"
# ... base configuration
monitoring: {
enabled: true
prometheus: {
enabled: true
service_monitor: true
external_mgr_endpoints: []
external_mgr_prometheus_port: 9283
}
grafana: {
enabled: true
service_type: "ClusterIP"
ingress: {
enabled: true
host: "grafana.ceph.company.com"
annotations: {
"kubernetes.io/ingress.class": "nginx"
"cert-manager.io/cluster-issuer": "letsencrypt-prod"
}
tls: [
{
hosts: ["grafana.ceph.company.com"]
secret_name: "grafana-tls"
}
]
}
}
alertmanager: {
enabled: true
config: {
global: {
smtp_smarthost: "smtp.company.com:587"
smtp_from: "alerts@company.com"
}
route: {
group_by: ["alertname"]
group_wait: "10s"
group_interval: "10s"
repeat_interval: "1h"
receiver: "web.hook"
}
receivers: [
{
name: "web.hook"
email_configs: [
{
to: "storage-team@company.com"
subject: "Ceph Alert: {{ .GroupLabels.alertname }}"
body: "{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}"
}
]
}
]
}
}
}
}
Usage
Deploy Rook-Ceph
./core/nulib/provisioning taskserv create rook-ceph --infra <infrastructure-name>
List Available Task Services
./core/nulib/provisioning taskserv list
SSH to Storage Node
./core/nulib/provisioning server ssh <storage-node>
Cluster Management
# Check cluster status
kubectl -n rook-ceph get cephcluster
# View cluster details
kubectl -n rook-ceph describe cephcluster rook-ceph
# Check operator status
kubectl -n rook-ceph get pods -l app=rook-ceph-operator
# View operator logs
kubectl -n rook-ceph logs -l app=rook-ceph-operator -f
Storage Operations
# Check OSD status
kubectl -n rook-ceph get pods -l app=rook-ceph-osd
# View storage classes
kubectl get storageclass
# Create test PVC
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: rook-ceph-block
EOF
# Check PVC status
kubectl get pvc test-pvc
Ceph Dashboard Access
# Get dashboard service
kubectl -n rook-ceph get service rook-ceph-mgr-dashboard
# Port forward to dashboard
kubectl -n rook-ceph port-forward service/rook-ceph-mgr-dashboard 8443:8443
# Get admin password
kubectl -n rook-ceph get secret rook-ceph-dashboard-password -o jsonpath="{['data']['password']}" | base64 --decode
# Access dashboard at https://localhost:8443
Ceph Commands via Toolbox
# Deploy Ceph toolbox
kubectl apply -f https://raw.githubusercontent.com/rook/rook/master/deploy/examples/toolbox.yaml
# Access Ceph toolbox
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- bash
# Inside toolbox - check cluster status
ceph status
ceph health detail
ceph df
# Check OSD status
ceph osd status
ceph osd tree
ceph osd pool ls
# Check mon status
ceph mon status
ceph quorum_status
# Check placement groups
ceph pg stat
ceph pg dump
Object Storage Operations
# Create object store user
kubectl apply -f - <<EOF
apiVersion: ceph.rook.io/v1
kind: CephObjectStoreUser
metadata:
name: test-user
namespace: rook-ceph
spec:
store: my-store
displayName: "Test User"
EOF
# Get user credentials
kubectl -n rook-ceph get secret rook-ceph-object-user-test-user -o yaml
# Test S3 access
export AWS_HOST=$(kubectl -n rook-ceph get svc rook-ceph-rgw-my-store -o jsonpath='{.spec.clusterIP}')
export AWS_ENDPOINT=http://$AWS_HOST:80
export AWS_ACCESS_KEY_ID=$(kubectl -n rook-ceph get secret rook-ceph-object-user-test-user -o jsonpath='{.data.AccessKey}' | base64 --decode)
export AWS_SECRET_ACCESS_KEY=$(kubectl -n rook-ceph get secret rook-ceph-object-user-test-user -o jsonpath='{.data.SecretKey}' | base64 --decode)
# Create bucket and upload file
aws --endpoint-url $AWS_ENDPOINT s3 mb s3://test-bucket
aws --endpoint-url $AWS_ENDPOINT s3 cp /etc/hosts s3://test-bucket/
File System Operations
# Create CephFS storage class
kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: rook-cephfs
provisioner: rook-ceph.cephfs.csi.ceph.com
parameters:
clusterID: rook-ceph
fsName: myfs
pool: myfs-replicated
csi.storage.k8s.io/provisioner-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/provisioner-secret-namespace: rook-ceph
csi.storage.k8s.io/controller-expand-secret-name: rook-csi-cephfs-provisioner
csi.storage.k8s.io/controller-expand-secret-namespace: rook-ceph
csi.storage.k8s.io/node-stage-secret-name: rook-csi-cephfs-node
csi.storage.k8s.io/node-stage-secret-namespace: rook-ceph
reclaimPolicy: Delete
allowVolumeExpansion: true
EOF
# Create shared PVC
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: cephfs-pvc
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100Gi
storageClassName: rook-cephfs
EOF
Monitoring and Metrics
# Check monitoring pods
kubectl -n rook-ceph get pods -l app=rook-ceph-exporter
# Access Prometheus metrics
kubectl -n rook-ceph port-forward service/rook-prometheus 9090:9090
# Check Grafana dashboards
kubectl -n rook-ceph port-forward service/rook-grafana 3000:3000
# View cluster metrics
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph tell mgr dump_historic_ops
Architecture
System Architecture
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Applications │────│ Kubernetes │────│ Ceph Cluster │
│ │ │ Storage │ │ │
│ • Databases │ │ │ │ • MON Nodes │
│ • File Shares │────│ • CSI Drivers │────│ • MGR Nodes │
│ • Object Store │ │ • Storage Classes│ │ • OSD Nodes │
│ • Backups │ │ • PV/PVC │ │ • MDS Nodes │
└─────────────────┘ └──────────────────┘ └─────────────────┘
Ceph Cluster Architecture
┌─────────────────────────────────────────────────────────────┐
│ Ceph Cluster │
├─────────────────────────────────────────────────────────────┤
│ MON Nodes │ MGR Nodes │ OSD Nodes │
│ (Monitors) │ (Managers) │ (Storage) │
│ │ │ │
│ • Cluster Map │ • Dashboard │ • Data Storage │
│ • Consensus │ • Metrics │ • Replication │
│ • Authentication │ • Load Balancing │ • Recovery │
│ • Authorization │ • Plugin System │ • Scrubbing │
├─────────────────────────────────────────────────────────────┤
│ Storage Services │
├─────────────────────────────────────────────────────────────┤
│ RBD (Blocks) │ CephFS (Files) │ RGW (Objects) │
│ │ │ │
│ • Block Devices │ • MDS Nodes │ • S3/Swift API │
│ • Snapshots │ • POSIX FS │ • Multi-Site │
│ • Cloning │ • Multi-Mount │ • Bucket Policy │
│ • Thin Provisioning│ • Quotas │ • Load Balancer │
└─────────────────────────────────────────────────────────────┘
Rook Operator Architecture
┌─────────────────────────────────────────────────────────────┐
│ Kubernetes Cluster │
├─────────────────────────────────────────────────────────────┤
│ Rook Operator │ CSI Drivers │ Storage Classes │
│ │ │ │
│ • CRD Management │ • RBD CSI │ • Block Storage │
│ • Lifecycle Mgmt │ • CephFS CSI │ • File Storage │
│ • Health Monitor │ • Volume Mgmt │ • Dynamic Prov │
│ • Upgrades │ • Snapshots │ • Resizing │
├─────────────────────────────────────────────────────────────┤
│ Ceph Cluster │
├─────────────────────────────────────────────────────────────┤
│ Physical Storage │ Network │ Security │
│ │ │ │
│ • Raw Devices │ • Cluster Network │ • Encryption │
│ • Host Paths │ • Public Network │ • Authentication │
│ • Node Selection │ • Load Balancing │ • Authorization │
└─────────────────────────────────────────────────────────────┘
File Structure
/var/lib/rook/ # Rook data directory
├── rook-ceph/ # Cluster-specific data
│ ├── crash/ # Crash dumps
│ ├── log/ # Ceph logs
│ ├── mon-a/ # Monitor data
│ ├── mgr-a/ # Manager data
│ ├── osd-0/ # OSD data
│ └── config/ # Ceph configuration
/etc/ceph/ # Ceph configuration
├── ceph.conf # Main configuration
├── keyring # Authentication keys
└── rbdmap # RBD device mapping
Kubernetes Resources:
├── CephCluster # Main cluster definition
├── CephBlockPool # Block storage pools
├── CephObjectStore # Object storage definition
├── CephFilesystem # File system definition
├── StorageClass # Kubernetes storage classes
└── VolumeSnapshotClass # Snapshot classes
Supported Operating Systems
- Ubuntu 20.04+ / Debian 11+
- CentOS 8+ / RHEL 8+ / Fedora 35+
- Amazon Linux 2+
- SUSE Linux Enterprise 15+
System Requirements
Minimum Requirements (Development)
- Kubernetes: v1.22+ with 3 nodes
- RAM: 8GB per node (16GB+ recommended)
- Storage: 100GB raw storage per OSD
- CPU: 4 cores per node (8+ cores recommended)
- Network: 1Gbps between nodes
Production Requirements
- Kubernetes: v1.22+ with 5+ nodes
- RAM: 32GB+ per storage node
- Storage: 1TB+ NVMe SSD per OSD (multiple OSDs per node)
- CPU: 16+ cores per storage node
- Network: 10Gbps+ dedicated storage network
Hardware Recommendations
- CPU: High-frequency processors for better performance
- Memory: 4GB RAM per TB of raw storage (minimum)
- Storage: NVMe SSDs for best performance, separate devices for OSDs
- Network: Dedicated networks for cluster and public traffic
Troubleshooting
Cluster Health Issues
# Check cluster status
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph status
# Check health details
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph health detail
# Check cluster logs
kubectl -n rook-ceph logs -l app=rook-ceph-operator
# Check monitor status
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph mon status
OSD Issues
# Check OSD status
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd status
# Check OSD tree
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd tree
# Check failed OSDs
kubectl -n rook-ceph get pods -l app=rook-ceph-osd
# Check OSD logs
kubectl -n rook-ceph logs rook-ceph-osd-0-xxx
Storage Issues
# Check pool status
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd pool ls detail
# Check PG status
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph pg stat
# Check storage utilization
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph df
# Check slow operations
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph osd perf
CSI Driver Issues
# Check CSI pods
kubectl -n rook-ceph get pods -l app=csi-rbdplugin
kubectl -n rook-ceph get pods -l app=csi-cephfsplugin
# Check CSI logs
kubectl -n rook-ceph logs -l app=csi-rbdplugin-provisioner
kubectl -n rook-ceph logs -l app=csi-cephfsplugin-provisioner
# Check volume attachments
kubectl get volumeattachments
# Debug PVC issues
kubectl describe pvc <pvc-name>
Network Issues
# Check network connectivity
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ceph mon stat
# Test inter-node connectivity
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- ping <node-ip>
# Check service endpoints
kubectl -n rook-ceph get endpoints
# Monitor network performance
kubectl -n rook-ceph exec -it deploy/rook-ceph-tools -- rados bench -p rbd 10 write
Security Considerations
Cluster Security
- Network Isolation - Separate storage network from application traffic
- Encryption - Enable encryption at rest and in transit
- Authentication - Use CephX authentication for all components
- Access Control - Implement proper RBAC for Kubernetes resources
Data Security
- Encryption at Rest - Enable OSD encryption for sensitive data
- Backup Security - Encrypt backups and use secure storage
- Network Security - Use dedicated networks and VLANs
- Key Management - Secure key storage and rotation
Kubernetes Security
- RBAC - Implement proper RBAC for Rook resources
- Network Policies - Use Kubernetes network policies
- Pod Security - Configure pod security policies
- Secret Management - Secure handling of Ceph secrets
Operational Security
- Regular Updates - Keep Rook, Ceph, and Kubernetes updated
- Security Monitoring - Monitor for security events
- Access Auditing - Audit administrative access
- Compliance - Follow security compliance requirements
Performance Optimization
Hardware Optimization
- NVMe Storage - Use NVMe SSDs for OSDs and journals
- Network Bandwidth - Use 10Gbps+ networks for storage traffic
- CPU Performance - High-frequency CPUs for better IOPS
- Memory - Adequate RAM for buffer caches
Ceph Configuration
- OSD Configuration - Optimize OSD threads and memory
- Pool Configuration - Choose appropriate replication/erasure coding
- PG Configuration - Optimize placement group counts
- Network Configuration - Separate cluster and public networks
Kubernetes Optimization
- Node Selection - Use dedicated storage nodes
- Resource Limits - Set appropriate CPU and memory limits
- Scheduling - Use anti-affinity for high availability
- Storage Classes - Optimize storage class parameters
Application Optimization
- Access Patterns - Understand application I/O patterns
- Volume Sizing - Right-size volumes for applications
- Snapshot Management - Efficient snapshot scheduling
- Monitoring - Continuous performance monitoring
Resources
- Official Documentation: rook.io/docs
- Ceph Documentation: docs.ceph.com
- GitHub Repository: rook/rook
- Community Support: rook.io/community
- CNCF Project: cncf.io/projects/rook