Ceph (A Distributed Object Store)


1.Basic Ceph cluster installation

1.1Prerequisites for Ceph cluster

  • Basic installation includes 5 nodes. 2 For OSDs, 1 for monitor 1 for Ceph admin and last for ceph client.
  • In case Nodes are behind Http Proxy. We need to add below entries in /etc/apt/apt.conf for all nodes

Acquire::http::proxy “http://<proxy&gt;:<port>/”;

Acquire::https::proxy “https://<proxy&gt;:<port>/”;

  • All hosts must be reachable by its hostname (e.g., you can modify /etc/hosts if necessary).
  • Add the Ceph packages and release key to our repository.

sudo –i ; set http proxies (if required)

wget -q –no-check-certificate -O- ‘https://ceph.com/git/?p=ceph.git;a=blob_plain;f=keys/release.asc&#8217; | apt-key add –

echo deb http://ceph.com/debian-emperor/ $(lsb_release -sc) main | sudo tee /etc/apt/sources.list.d/ceph.list

1.2 Installation 

  • Install ceph-deploy on “admin node”

sudo apt-get update && sudo apt-get install ceph-deploy

  • Use ceph-deploy to create the SSH key and copy it to the initial monitor nodes automatically when you create the new cluster

ceph-deploy new node1 (monitor node)

  • For other Ceph Nodes perform the following steps:
  1. Create a user on each Ceph Node.

ssh user@ceph-server

sudo useradd -d /home/ceph -m ceph

sudo passwd ceph

  1. Add root privileges for the user on each Ceph Node.

echo “ceph ALL = (root) NOPASSWD:ALL” | sudo tee /etc/sudoers.d/ceph

sudo chmod 0440 /etc/sudoers.d/ceph

  1. Install an SSH server (if necessary) on each Ceph Node:

sudo apt-get install openssh-server

sudo yum install openssh-server

Configure your ceph-deploy admin node with password-less SSH access to each Ceph Node. When configuring SSH

access, do not use sudo or the root user. Leave the passphrase empty:


  1. Copy the key to each Ceph Node.

ssh-copy-id ceph@node1

ssh-copy-id ceph@node2

ssh-copy-id ceph@node3

  1. Modify the ~/.ssh/config file of your ceph-deploy admin node so that it logs in to Ceph Nodes as the user you created (e.g., ceph).
  • Create a directory on your admin node for maintaining the configuration that ceph-deploy generates for your cluster. Run all admin commands from this directory. Do not use sudo for any ceph-deploy command.

mkdir my-cluster

cd my-cluster

  • Create Base Ceph cluster from Admin node. On your admin node from the directory you created for holding your configuration file, perform the following steps using ceph-deploy.
  1. Create the cluster.

ceph-deploy new {initial-monitor-node(s)} i.e. ceph-deploy new node1

** Check the output of ceph-deploy with ls and cat in the current directory. You should see a Ceph configuration file, a monitor secret keyring, and a log file for the new cluster. See ceph-deploy new -h for additional details.

** If you have more than one network interface, add the public network setting under the [global] section of your Ceph configuration file. See the Network Configuration Reference for details.

public network = {ip-address}/{netmask}

N.B. to get CIDR format of network run “ip route list”

  1. Install Ceph.

ceph-deploy install –no-adjust-repos node1 node2 node3

  1. Add the initial monitor(s) and gather the keys

ceph-deploy mon create-initial

** Once you complete the process, your local directory should have the following keyrings:




  1. Add two OSDs. For fast setup, this quick start uses a directory rather than an entire disk per Ceph OSD Daemon.

ssh node2

sudo mkdir /var/local/osd0


ssh node3

sudo mkdir /var/local/osd1


  1. From admin node, use ceph-deploy to prepare the OSDs.

ceph-deploy osd prepare node2:/var/local/osd0 node3:/var/local/osd1

  1. Activate the OSDs.

ceph-deploy osd activate node2:/var/local/osd0 node3:/var/local/osd1

  1. Use ceph-deploy to copy the configuration file and admin key to your admin node and your Ceph Nodes so that you can use the ceph CLI without having to specify the monitor address and ceph.client.admin.keyring each time you execute a command.

ceph-deploy admin node1 node2 node3 admin-node

  1. Ensure that you have the correct permissions for the ceph.client.admin.keyring.

sudo chmod +r /etc/ceph/ceph.client.admin.keyring

  1. Check your cluster’s health.

ceph health

**Your cluster should return an active + clean state when it has finished peering.

  1. Create Ceph Block device using QENU
  • QEMU/KVM can interact with Ceph Block Devices via librbd.
  • Install virtualization stack for Ceph Block storage on client node

sudo apt-get install qemu

sudo apt-get update && sudo apt-get install libvirt-bin

  • To configure Ceph for use with libvirt, perform the following steps:
  1. Create a pool (or use the default). The following example uses the pool name libvirt-pool with 128 placement groups.

ceph osd pool create libvirt-pool 128 128

  1. Verify the pool exists.

ceph osd lspools

  1. Verify if client.admin exists in “ceph auth list” command

** libvirt will access Ceph using the ID libvirt, not the Ceph name client.libvirt. See Cephx Commandline for detailed explanation of the difference between ID and name.

  1. Use QEMU to create an “block device image” in your RBD pool.

qemu-img create -f rbd rbd:libvirt-pool/new-libvirt-image 2G

  1. Verify the image exists.

rbd -p libvirt-pool ls

** You can also use rbd create to create an image, but we recommend ensuring that QEMU is working properly.

** This Block device image could directly be used in VMs (http://ceph.com/docs/next/rbd/libvirt/)


  • To configure a Block device image
  1. On the client node, load the rbd client module.

sudo modprobe rbd

  1. On the client node, map the image to a block device.

sudo rbd map new-libvirt-image –pool rbd –name client.admin [-m {monitor-IP}] [-k /path/to/ceph.client.admin.keyring]

  1. Use the block device by creating a file system on the client node.

sudo mkfs.ext4 -m0 /dev/rbd/rbd/new-libvirt-image

** This may take a few moments.

  1. Mount the file system on the client node.

sudo mkdir /mnt/ceph-block-device

sudo mount /dev/rbd/rbd/new-libvirt-image /mnt/ceph-block-device

cd /mnt/ceph-block-device

3.Use Ceph Block device from Cinder





  • When Glance and Cinder are both using Ceph block devices, the image is a copy-on-write clone, so volume creation is very fast.

4.Extend the cluster





6. Crush(Controlled Replication under Scalable Hashing)



  • By using Crush algorithm to store and retrieve data, we can avoid a single point of failure and scale easily.
  • Data Placement strategy in ceph has two parts placement groups and the Crush map.
  • Each object must belong to some placement group.
  • Ceph Clients (RADOS / librados) and Ceph OSD both use the CRUSH algorithm to efficiently compute information about data containers on demand, instead of having to depend on broker.
  • Ceph OSD Daemons create object replicas on other Ceph Nodes to ensure data safety and high availability. This replication is synchronous, such that a new or updated object guarantees its availability before an application is notified that the write has completed.
  • Ceph OSD Daemons having knowledge of the cluster topology using Cluster MAP. Cluster MAP = Crush map + Monitor map + OSD map + PG map + MDS map.

7. Architecture:

7.1 Logical placement

  • Pools are logical partitions for storing objects.Ceph clusters have the concept of pools, where each pool has a certain number of placement groups. Placement groups are just collections of mappings to OSDs. Each PG has a primary OSD and a number of secondary ones, based on the replication level you set when you make the pool. When an object gets written to the cluster, CRUSH will determine which PG the data should be sent to. The data will first hit the primary OSD and then replicated out to the other OSDs in the same placement group. Ceph Clients retrieve a latest Cluster Map from a Ceph Monitor, and write objects to pools.
  • Currently reads always come from the primary OSD in the placement group rather than a secondary even if the secondary is closer to the client. In many cases spreading reads out over all of the OSDs in the cluster is better than trying to optimize reads to only hit local OSDs.


  1. Could be a potential bottleneck if lot of clients want to read the same file, all requests will land on the same OSD though other replica OSDs are lying idle.
  • The only input required by the client is the object ID and the pool.

7.2 Physical placement

  • CRUSH algorithm maps each object to a placement group and then maps each placement group to one or more Ceph OSD Daemons.
  • Ceph client uses the CRUSH algorithm to compute where to store an object, maps the object to a pool and placement group, then looks at the CRUSH map to identify the primary OSD for the placement group.

o    Total PGs =    (OSD *100) /Replicas     {rounded up to the nearest power of 2}

o    The client inputs the pool ID and the object ID. (e.g., pool = “liverpool” and object-id = “john”)

o    CRUSH takes the object ID and hashes it.

o    CRUSH calculates the hash modulo the number of PGs (e.g., 0x58) to get a PG ID.

o    CRUSH gets the pool ID given the pool name (e.g., “liverpool” = 4)

o    CRUSH prepends the pool ID to the PG ID (e.g., 4.0×58).

  • With a copy of the cluster map and the CRUSH algorithm, the client can compute exactly which OSD to use when reading or writing a particular object.
  • Replication is always executed at the PG level: All objects of a placement group are replicated between different OSDs in the RADOS cluster.
  • An object ID is unique across the entire cluster, not just the local file system.
  • The client writes the object to the identified placement group in the primary OSD. Then, the primary OSD with its own copy of the CRUSH map identifies the secondary and tertiary OSDs for replication purposes, and replicates the object to the appropriate placement groups in the secondary and tertiary OSDs (as many OSDs as additional replicas), and responds to the client once it has confirmed the object was stored successfully.

7.Common cluster commands/library/Tips

# ceph –s

# ceph osd dump

# ceph mon dump

# ceph osd pool get {pool-name} {field}  (Get properties of a pool)

# ceph pg dump -o pg.txt  (Get all Placement Groups details)


  • Get crushmap file for cluster

# ceph osd getcrushmap -o {compiled-crushmap-filename} 

  • Decompile the crushmap file in plain format

# crushtool -d {compiled-crushmap-filename} -o {decompiled-crushmap-filename} 

  • To see mapped block devices

# sudo rbd showmapped

# rbd list

  • To see block device image information

qemu-img info -f rbd rbd:{pool-name}/{image-name}

  • To see all the ceph running services

# sudo initctl list | grep ceph

  • To start all the ceph services

# sudo stop ceph-all

  • To stop all the ceph services

# sudo start ceph-all

  • Pools:- A pool differs from CRUSH’s location-based buckets in that a pool doesn’t have a single physical location, and a pool provides you with some additional functionality, including replicas, Placement groups, Crush rules, snapshots and setting ownership.
  •  Object copies always spread on different OSD.

8.Troubleshooting Logs/Configurations/Tips


  • Understand data placement



  • To get CIDR for of network

ip route list

  • Common Logs


  • Common Configuration files


  • Can we disable journal?

Ans: With btrfs, yes. Otherwise no. The journal is needed for consistency of the fs; Ceph rely on

writeahead journaling.  It can’t be turned off though we can use SSD/ramdisk for journal. Loss of the journal will kill any osds using that journal.



  • So I have 1 SSD PER storage node for journaling?
    Ans: Not necessarily. It depends on a number of factors. In some cases that may be sufficient, and in others the SSD can become a bottleneck and rapidly weak out. Different applications will have a different ideal ratio of SSD journals to spinning disks, taking into account rate of write IO and bandwidth requirements for the node.
  • What happens in case of a big file (for example, 100MB) with multiple chunks? Is ceph smart enough to read multiple chunks from multiple servers simultaneously or the whole file will be served by just an OSD.

Ans: RADOS is the underlying storage cluster, but the access methods (block, object, and file) stripe their data across many RADOS objects, which CRUSH very effectively distributes across all of the servers.  A 100MB read or write turns into dozens of parallel operations to servers all over the cluster.

The problem with reading from random/multiple replicas by default is cache efficiency.  If every reader picks a random replica, then there are effectively N locations that may hae an object cached in RAM (instead of on disk), and the caches for each OSD will be about 1/Nth as effective.  The only time in makes sense to read from replicas is when you are CPU or network limited; the rest of the time it is better to read from the primary’s cache than a replica’s disk.


  • How write get performed?

OSDs use a write-ahead mode for local operations: a write hits the journal first, and from there is then being copied into the backing filestore.

This, in turn, leads to a common design principle for Ceph clusters that are both fast and cost-effective:

  1. Put your filestores on slow, cheap drives (such as SATA spinners),
  2. put your journals on fast drives (SSDs, DDR drives, Fusion-IO cards, whatever you can afford).

Another common design principle is that you create one OSD per spinning disk that you have in the system. Many contemporary systems come with only two SSD slots, and then as many spinners as you want. That is not a problem for journal capacity — a single OSD’s journal is usually no larger than about 6 GB, so even for a 16-spinner system (approx. 96GB journal space) appropriate SSDs are available at reasonable expense.


  • What happens if an OSD fails – TBD
  • What happens if an MON fails – TBD
  • What happens if an Journal SSD/Disk fails – TBD
This entry was posted in BlockStorage and tagged , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s