Ceph¶
Overview¶
A monitor knows the status of the network and keeps it in its monitor map
You can have multiple monitors but should have a small, odd number
MDS are the metadata servers (stores hirarchy of ceph fs + owner, timestamps, permissions etc)
MDS is only needed for ceph fs
An OSD is a storage node that contains and servers the real data, replicates and rebalances it
The OSDs form a p2p network, recognize if one node is out and automatically restore the lost data to other nodes
The client computes the localization of storage by using the CRUSH algorythm (no need to ask a central server)
RADOS is the object storage interface
RBD (RADOS Block Device) creates a block device as RADOS object
Placement groups (pgs) combine objects into group. Replication is done on pgs or pools not files or dirs. You should have 100 pgs / OSD
Pool is a seperate storage container that contains its own placement groups and objects (think of mountpoint)
Status¶
degraded == not enough replicas
stuck inactive - The placement group has not been active for too long (i.e., it hasn’t been able to service read/write requests).
stuck unclean - The placement group has not been clean for too long (i.e., it hasn’t been able to completely recover from a previous failure).
stuck stale - The placement group status has not been updated by a ceph-osd, indicating that all nodes storing this placement group may be down.
Manual installation¶
Setup a monitor
uuidgen
Edit
/etc/ceph/ceph.conf
fsid = <uuid>
mon initial members = <short_hostname>
mon host = <ip_address>
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd pool default size = 2
Generate keys for the monitor and admin user and add the monitor to the monitor map
ceph-authtool --create-keyring /tmp/ceph.mon.keyring --gen-key -n mon. --cap mon 'allow *'
ceph-authtool --create-keyring /etc/ceph/ceph.client.admin.keyring --gen-key -n client.admin --set-uid=0 --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow'
ceph-authtool /tmp/ceph.mon.keyring --import-keyring /etc/ceph/ceph.client.admin.keyring
monmaptool --create --add <short_hostname> <ip_address> --fsid <uuid> /tmp/monmap
Create the monitor cache filesystem, start the monitor and see if it created the default pools and is running
mkdir -p /var/lib/ceph/mon/ceph-<short_hostname>
ceph-mon --mkfs -i <short_hostname> --monmap /tmp/monmap --keyring /tmp/ceph.mon.keyring
ceph-mon -i <short_hostname>
ceph osd lspools
ceph -s
Setup an OSD (note the command ceph osd create returns the osd id to use!)
uuidgen
ceph osd create <uuid>
mkdir -p /var/lib/ceph/osd/ceph-<osd_id>
fdisk /dev/sda
ceph-disk prepare /dev/sda1
mount /dev/sda /var/lib/ceph/osd/ceph-<osd_id>/
ceph-osd -i <osd_id> --mkfs --mkkey
ceph auth add osd.<osd_id> osd 'allow *' mon 'allow rwx' -i /var/lib/ceph/osd/ceph-<osd_id>/keyring
ceph-disk activate /dev/sda1 --activate-key /var/lib/ceph/osd/ceph-<osd_id>/keyring
ceph-osd -i <osd_id>
ceph status
Add another OSD for replication
Setup a metadata server (only needed when using CephFS)
mkdir -p /var/lib/ceph/mds/mds.<mds_id>
ceph auth get-or-create mds.<mds_id> mds 'allow ' osd 'allow *' mon 'allow rwx' > /var/lib/ceph/mds/mds.<mds_id>/mds.<mds_id>.keyring
ceph-mds -i <mds_id>
ceph status
Adding OSDs the easy way¶
With ceph-deploy
ceph-deploy osd prepare node1:/path
ceph-deploy osd activate node1:/path
Manually (ssh to new osd node)
ceph-disk prepare --cluster ceph --cluster-uuid <fsid> --fs-type xfs /dev/sda
ceph-disk-prepare --fs-type xfs /dev/sda
Complete setup of new node¶
On new node
useradd -d /home/ceph -m ceph
passwd ceph
echo "ceph ALL = (root) NOPASSWD:ALL" | tee /etc/sudoers.d/ceph
mkdir /local/osd<id>
On ceph-deploy node
su - ceph
ssh-copyid ceph@<hostname_of_new_node>
ceph-deploy install <hostname_of_new_node>
ceph-deploy osd prepare <hostname_of_new_node>:/local/osd<id>
ceph-deploy osd activate <hostname_of_new_node>:/local/osd<id>
ceph-deploy mon create <hostname_of_new_node>
Configure replication¶
Edit ceph.conf
osd pool default size = 2
Access storage¶
CEPH FUSE (filesystem access comparable to NFS)
ceph-fuse -m <monitor>:6789 /mountpoint
FUSE via fstab
id=admin /mnt fuse.ceph defaults 0 0
CEPH FS kernel client
RADOS API for object storage
rados put test-object /path/to/some_file --pool=data
rados -p data ls
ceph osd map data test-object
rados rm test-object --pool=data
RADOS FUSE
Virtual Block device via kernel driver (needs kernel >= 3.4.20)
rbd create rbd/myrbd --size=1024
echo "rbd/myrbd" >> /etc/ceph/rbdmap
service rbdmap reload
rbd showmapped
iSCSI interface under development
Code your own client with librados
Check size¶
Of the filesystem
ceph df
Of a file
rbd -p <pool> info <file>
File snapshots¶
rbd -p <pool> snap create <file>
rbd -p <pool> snap ls <file>
rbd -p <pool> snap rollback <file>
rbd -p <pool> snap rm <file>
Check health¶
ceph health detail
get continuos information
ceph -w
Check osd status¶
ceph osd stat
ceph osd tree
ceph osd dump
Check server status¶
/etc/init.d/ceph status
Pools¶
Create
ceph osd lspools
ceph osd pool create <pool_name> <num_pgs>
Change number of pgs
ceph osd pool get <name> pg_num
ceph osd pool set <name> pg_num <nr>
Create a snapshot
ceph osd pool mksnap <name>
Find out nr of replicas per pool
ceph osd dump | grep <pool>
Change nr of replicas per pool
ceph osd pool set <name> size 3
Placement groups¶
Overview
ceph pg dump
ceph pg stat
What does the status XXX mean?
inactive - The placement group has not been active for too long (i.e., it hasn’t been able to service read/write requests).
unclean - The placement group has not been clean for too long (i.e., it hasn’t been able to completely recover from a previous failure).
stale - The placement group status has not been updated by a ceph-osd, indicating that all nodes storing this placement group may be down.
Why is a pg in such a state?
ceph pg <pg_num> query
Where to find an object / file?
ceph osd map <pg_name> <object-name>
“Fsck” a placement group
ceph pg scrub <pg_num>
Editing the CRUSH map¶
The CRUSH map defines
buckets
(think storage groups) to map placement groups tp OSDs across a failure domain (e.g. copy 1 is in rack 1 and copy 2 in rack 2 to avoid power outage of one rack to destroy all copies)A higher weight will get more load than a lower weight
ceph osd getcrushmap -o crushmap
crushtool -d crushmap -o mymap
emacs mymap
crushtool -c mymap -o newmap
ceph osd setcrushmap -i newmap
Maintanance¶
To stop CRUSH from automatically balance load of the cluster
ceph osd set noout
Troubleshooting general¶
Remove everything (not recommended for production use!)
ceph-deploy purge host1 host2
ceph-deploy purgedata host1 host2
ceph-deploy gatherkeys
Troubleshooting sudo¶
Make sure that visiblepw is disabled
Defaults !visiblepw
Is the /etc/sudoers.d directory really included?
Troubleshooting network¶
The name of a osd / mon must be the official name of the host no aliases!
Make sure you have a
public network = 1.2.3.4/24
in your ceph.conf
Repair monitor¶
the id can be found by looking into
/var/lib/ceph/mon/
run monitor in debug mode
ceph-mon -i <myid> -d
Reformat monitor data store
rm -rf /var/lib/ceph/mon/ceph-<myid>
ceph-mon --mkfs -i <myid> --keyring /etc/ceph/ceph.client.admin.keyring
Cluster is full¶
The easiest way is of course to add new OSDs, but if thats not possible
Try to reweight automatically
ceph osd reweight-by-utilization
Reweight manually free OSDs
ceph osd tree
ceph osd crush reweight osd.<nr> <new_weight>
Reconfigure
full_ratio
value and delete objects (DONT FORGET TO CHANGE IT BACK!)
ceph pg set_full_ratio 0.99
Cannot delete a file¶
Check that the cluster is not full otherwise see above
ceph health detail
Purge all snapshots
rbd -p <pool> snap purge <file>
Check that the file is not locked and maybe remove the lock
rbd -p <pool> lock list <file>
rbd -p <pool> lock remove <file> <id> <locker>
I can still not remove the file! (Thats the not so nice and maybe destructive way)
rados -p <pool> rm rbd_id.<file_id>
rbd -p <pool> rm <file>