Ceph¶

Overview¶

http://www.youtube.com/watch?v=OyH1C0C4HzM
A monitor knows the status of the network and keeps it in its monitor map
You can have multiple monitors but should have a small, odd number
MDS are the metadata servers (stores hirarchy of ceph fs + owner, timestamps, permissions etc)
MDS is only needed for ceph fs
An OSD is a storage node that contains and servers the real data, replicates and rebalances it
The OSDs form a p2p network, recognize if one node is out and automatically restore the lost data to other nodes
The client computes the localization of storage by using the CRUSH algorythm (no need to ask a central server)
RADOS is the object storage interface
RBD (RADOS Block Device) creates a block device as RADOS object
Placement groups (pgs) combine objects into group. Replication is done on pgs or pools not files or dirs. You should have 100 pgs / OSD
Pool is a seperate storage container that contains its own placement groups and objects (think of mountpoint)

Status¶

degraded == not enough replicas
stuck inactive - The placement group has not been active for too long (i.e., it hasn’t been able to service read/write requests).
stuck unclean - The placement group has not been clean for too long (i.e., it hasn’t been able to completely recover from a previous failure).
stuck stale - The placement group status has not been updated by a ceph-osd, indicating that all nodes storing this placement group may be down.

Manual installation¶

Setup a monitor

uuidgen

Edit /etc/ceph/ceph.conf

fsid = <uuid>
mon initial members = <short_hostname>
mon host = <ip_address>
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd pool default size = 2

Generate keys for the monitor and admin user and add the monitor to the monitor map

ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *'
ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow'
ceph-authtool /tmp/ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring
monmaptool --create --add <short_hostname> <ip_address> --fsid <uuid> /tmp/monmap

Create the monitor cache filesystem, start the monitor and see if it created the default pools and is running

mkdir -p /var/lib/ceph/mon/ceph-<short_hostname>
ceph-mon --mkfs -i <short_hostname> --monmap /tmp/monmap --keyring /tmp/ceph.mon.keyring
ceph-mon -i <short_hostname>
ceph osd lspools
ceph -s

Setup an OSD (note the command ceph osd create returns the osd id to use!)

uuidgen
ceph osd create <uuid>
mkdir -p /var/lib/ceph/osd/ceph-<osd_id>
fdisk /dev/sda
ceph-disk prepare /dev/sda1
mount /dev/sda /var/lib/ceph/osd/ceph-<osd_id>/
ceph-osd -i <osd_id> --mkfs --mkkey
ceph auth add osd.<osd_id> osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-<osd_id>/keyring
ceph-disk activate /dev/sda1 --activate-key /var/lib/ceph/osd/ceph-<osd_id>/keyring
ceph-osd -i <osd_id>
ceph status

Add another OSD for replication
Setup a metadata server (only needed when using CephFS)

mkdir -p /var/lib/ceph/mds/mds.<mds_id>
ceph auth get-or-create mds.<mds_id> mds 'allow ' osd 'allow *' mon 'allow rwx' > /var/lib/ceph/mds/mds.<mds_id>/mds.<mds_id>.keyring
ceph-mds -i <mds_id>
ceph status

Adding OSDs the easy way¶

With ceph-deploy

ceph-deploy osd prepare node1:/path
ceph-deploy osd activate node1:/path

Manually (ssh to new osd node)

ceph-disk prepare --cluster ceph --cluster-uuid <fsid> --fs-type xfs /dev/sda
ceph-disk-prepare --fs-type xfs /dev/sda

Complete setup of new node¶

On new node

useradd -d /home/ceph -m ceph
passwd ceph
echo "ceph ALL = (root) NOPASSWD:ALL" | tee /etc/sudoers.d/ceph
mkdir /local/osd<id>

On ceph-deploy node

su - ceph
ssh-copyid ceph@<hostname_of_new_node>
ceph-deploy install <hostname_of_new_node>
ceph-deploy osd prepare <hostname_of_new_node>:/local/osd<id>
ceph-deploy osd activate <hostname_of_new_node>:/local/osd<id>
ceph-deploy mon create <hostname_of_new_node>

Configure replication¶

Edit ceph.conf

osd pool default size = 2

Access storage¶

CEPH FUSE (filesystem access comparable to NFS)

ceph-fuse -m <monitor>:6789 /mountpoint

FUSE via fstab

id=admin                /mnt  fuse.ceph defaults 0 0

CEPH FS kernel client
RADOS API for object storage

rados put test-object /path/to/some_file --pool=data
rados -p data ls
ceph osd map data test-object
rados rm test-object --pool=data

RADOS FUSE
Virtual Block device via kernel driver (needs kernel >= 3.4.20)

rbd create rbd/myrbd --size=1024
echo "rbd/myrbd" >> /etc/ceph/rbdmap
service rbdmap reload
rbd showmapped

iSCSI interface under development
Code your own client with librados

Check size¶

Of the filesystem

ceph df

Of a file

rbd -p <pool> info <file>

File snapshots¶

rbd -p <pool> snap create <file>
rbd -p <pool> snap ls <file>
rbd -p <pool> snap rollback <file>
rbd -p <pool> snap rm <file>

Check health¶

ceph health detail

get continuos information

ceph -w

Check osd status¶

ceph osd stat
ceph osd tree
ceph osd dump

Check server status¶

/etc/init.d/ceph status

Pools¶

Create

ceph osd lspools
ceph osd pool create <pool_name> <num_pgs>

Change number of pgs

ceph osd pool get <name> pg_num
ceph osd pool set <name> pg_num <nr>

Create a snapshot

ceph osd pool mksnap <name>

Find out nr of replicas per pool

ceph osd dump | grep <pool>

Change nr of replicas per pool

ceph osd pool set <name> size 3

Placement groups¶

Overview

ceph pg dump
ceph pg stat

What does the status XXX mean?

inactive - The placement group has not been active for too long (i.e., it hasn’t been able to service read/write requests).
unclean - The placement group has not been clean for too long (i.e., it hasn’t been able to completely recover from a previous failure).
stale - The placement group status has not been updated by a ceph-osd, indicating that all nodes storing this placement group may be down.

Why is a pg in such a state?

ceph pg <pg_num> query

Where to find an object / file?

ceph osd map <pg_name> <object-name>

“Fsck” a placement group

ceph pg scrub <pg_num>

Editing the CRUSH map¶

The CRUSH map defines buckets (think storage groups) to map placement groups tp OSDs across a failure domain (e.g. copy 1 is in rack 1 and copy 2 in rack 2 to avoid power outage of one rack to destroy all copies)
A higher weight will get more load than a lower weight

ceph osd getcrushmap -o crushmap
crushtool -d crushmap -o mymap
emacs mymap
crushtool -c mymap -o newmap
ceph osd setcrushmap -i newmap

Maintanance¶

To stop CRUSH from automatically balance load of the cluster

ceph osd set noout

Troubleshooting general¶

Remove everything (not recommended for production use!)

ceph-deploy purge host1 host2
ceph-deploy purgedata host1 host2
ceph-deploy gatherkeys

Troubleshooting sudo¶

Make sure that visiblepw is disabled

Defaults   !visiblepw

Is the /etc/sudoers.d directory really included?

Troubleshooting network¶

The name of a osd / mon must be the official name of the host no aliases!
Make sure you have a public network = 1.2.3.4/24 in your ceph.conf

Repair monitor¶

the id can be found by looking into /var/lib/ceph/mon/
run monitor in debug mode

ceph-mon -i <myid> -d

Reformat monitor data store

rm -rf /var/lib/ceph/mon/ceph-<myid>
ceph-mon --mkfs -i <myid> --keyring /etc/ceph/ceph.client.admin.keyring

Cluster is full¶

The easiest way is of course to add new OSDs, but if thats not possible
Try to reweight automatically

ceph osd reweight-by-utilization

Reweight manually free OSDs

ceph osd tree
ceph osd crush reweight osd.<nr> <new_weight>

Reconfigure full_ratio value and delete objects (DONT FORGET TO CHANGE IT BACK!)

ceph pg set_full_ratio 0.99

Cannot delete a file¶

Check that the cluster is not full otherwise see above

ceph health detail

Purge all snapshots

rbd -p <pool> snap purge <file>

Check that the file is not locked and maybe remove the lock

rbd -p <pool> lock list <file>
rbd -p <pool> lock remove <file> <id> <locker>

I can still not remove the file! (Thats the not so nice and maybe destructive way)

rados -p <pool> rm rbd_id.<file_id>
rbd -p <pool> rm <file>

Ceph¶

Overview¶

Status¶

Manual installation¶

Adding OSDs the easy way¶

Complete setup of new node¶

Configure replication¶

Access storage¶

Check size¶

File snapshots¶

Check health¶

Check osd status¶

Check server status¶

Pools¶

Placement groups¶

Editing the CRUSH map¶

Maintanance¶

Troubleshooting general¶

Troubleshooting sudo¶

Troubleshooting network¶

Repair monitor¶

Cluster is full¶

Cannot delete a file¶

Table of Contents

Previous topic

Next topic

This Page