Most of people use ceph-disk utility to prepare a Ceph disk for OSD. But sometime cpeh-disk begin to work strange or it think you're doing something "unacceptable". When you're stuck on this, there are a short manual how to make OSD without ceph-disk.
The most often need to change the default Ceph behavior is an "journal on the SSD and data on the HDD" setup that is widely used to improve Ceph HDD OSD performance. With bluestore this setup transforms to "data on a HDD, DB and WAL on SSD" setup". With these setups, you create several partitions on one SSD and it mean your partition table became locked so you can't to add/remove/change partitions (i.e. you can't add new partitions for a new OSD device without reboot). The workaround is to use LVM but then you can get an access error.
With this post I'll try to show how these problems can be avoid.
P.S.: yes, we completely refused to use ceph-disk in our projects because its' dictatorship nature.
Here are our training partitioning for a Ceph disk. We got FIVE partitions:
[root@control0 ~]# gdisk -l /dev/sdc GPT fdisk (gdisk) version 0.8.6 Partition table scan: MBR: protective BSD: not present APM: not present GPT: present Found valid GPT with protective MBR; using GPT. Disk /dev/sdc: 50331648 sectors, 24.0 GiB Logical sector size: 512 bytes Disk identifier (GUID): B7AB6C94-607F-4BAE-A62F-9FD00AE6CAD1 Partition table holds up to 128 entries First usable sector is 34, last usable sector is 50331614 Partitions will be aligned on 2048-sector boundaries Total free space is 2014 sectors (1007.0 KiB) Number Start (sector) End (sector) Size Code Name 1 2048 264191 128.0 MiB 8300 Linux filesystem 2 264192 16852991 7.9 GiB 8300 Linux filesystem 3 16852992 17115135 128.0 MiB 8300 Linux filesystem 4 17115136 33703935 7.9 GiB 8300 Linux filesystem 5 33703936 50331614 7.9 GiB 8E00 Linux LVM
Let's zero future ceph partitions. It's strongly recommended to do it, because otherwise you can get ad "device busy" error on it. Because the sample is a VB virtual machine, we use dd. On a real install, you can use blkdiscard utility for SSDs and dd for a HDDs like we'll do next
[root@control0 ~]# for n in 1 2 3 4 ; do dd if=/dev/zero of=/dev/sdc$n bs=1024k count=10 ; done
The next is setting up a valid type GUIDs on partitions (did I tell you we MUST use GPT? I didn't but guess you know)
[root@control0 ~]# sgdisk --change-name=1:"ceph journal" --typecode=1:45B0969E-9B03-4F30-B4C6-B4B80CEFF106 [root@control0 ~]# sgdisk --change-name=2:"ceph data" --typecode=2:4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D [root@control0 ~]# sgdisk --change-name=3:"ceph data" --typecode=3:4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D [root@control0 ~]# sgdisk --change-name=4:"ceph block" --typecode=4:CAFECAFE-9B03-4F30-B4C6-B4B80CEFF106
Handling the LVM is a trivial and the only I must mark up is we have two LVs (meta and blocks) in a single VG named "kosd"
[root@control0 ~]# lvscan | grep kosd ACTIVE '/dev/kosd/meta' [100,00 MiB] inherit ACTIVE '/dev/kosd/blocks' [<7,83 GiB] inherit
During creating Ceph OSD we do several basic tasks:
With this sample, I use a patched Ceph version that allows to pass cluster name into systemds' units. With a "vanilla" Ceph you can define cluster name in /etc/sysconfig/ceph if your cluster name differs from a default "ceph"
Create filesystem
[root@control0 ~]# mkfs.xfs /dev/sdc2 meta-data=/dev/sdc2 isize=512 agcount=4, agsize=518400 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=0, sparse=0 data = bsize=4096 blocks=2073600, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
Create OSD in a cluster
[root@control0 ~]# ceph osd create --cluster=kitana 0
Mount filesystem
[root@control0 ~]# mount -t xfs /dev/sdc2 /var/lib/ceph/osd/kitana-0
Find the future journal partition (in this sample we use sdc1 as journal for sdc2 filestore)
[root@control0 ~]# ls -l /dev/disk/by-partuuid | grep sdc1 lrwxrwxrwx 1 root root 10 окт 11 17:36 5f3a4bc4-c60a-43c4-9afb-8ee00dfc0148 -> ../../sdc1
Declare this is a "filestore" OSD, not a bluestore
[root@control0 ~]# echo filestore > /dev/sdc2 /var/lib/ceph/osd/kitana-0/type
Create journal symlink
[root@control0 ~]# ln -s /dev/disk/by-partuuid/5f3a4bc4-c60a-43c4-9afb-8ee00dfc0148 /var/lib/ceph/osd/kitana-0/journal
Initialize OSD
[root@control0 ~]# ceph-osd --mkfs --cluster kitana --id 0 --mkjournal 2017-10-11 18:19:07.918666 7ff96e2b8d80 -1 journal check: ondisk fsid ba21afec-5473-4880-8273-d61b56352e55 doesn't match expected d44b8d02-0729-45b7-90d5-7ca9c359cd0d, invalid (someone else's?) journal 2017-10-11 18:19:07.933174 7ff96e2b8d80 -1 read_settings error reading settings: (2) No such file or directory 2017-10-11 18:19:07.952380 7ff96e2b8d80 -1 created object store /var/lib/ceph/osd/kitana-0 for osd.0 fsid d9e5334f-e188-481f-b908-9b09978b4317
Create CephX keys
[root@control0 ~]# ceph --cluster=kitana auth get-or-create osd.0 \ mon 'allow profile osd' \ mgr 'allow profile osd' \ osd 'allow *' > /var/lib/ceph/osd/kitana-0/keyring
Change owner
[root@control0 ~]# chown -R ceph:ceph --dereference -H /var/lib/ceph/osd/kitana-0
Test OSD is usable - start OSD in foreground on the first TTY, fix problem if found and press Ctrl+C after check complete. Use the second TTY to watch Ceph cluster status
TTY.1: [root@control0 ~]# ceph-osd -f --cluster kitana --id 0 --setuser ceph --setgroup ceph starting osd.0 at - osd_data /var/lib/ceph/osd/kitana-0 /var/lib/ceph/osd/kitana-0/journal 2017-10-11 18:21:40.153852 7feeb2356d80 -1 osd.0 0 log_to_monitors {default=true} 2017-10-11 18:21:40.160117 7fee964a9700 -1 osd.0 0 waiting for initial osdmap
TTY.2: [root@control0 ~]# ceph --cluster=kitana -s cluster: id: d9e5334f-e188-481f-b908-9b09978b4317 health: HEALTH_WARN norecover flag(s) set all OSDs are running luminous or later but require_osd_release < luminous services: mon: 1 daemons, quorum control0 mgr: control0(active) osd: 1 osds: 1 up, 1 in flags norecover data: pools: 0 pools, 0 pgs objects: 0 objects, 0 bytes usage: 43688 kB used, 8047 MB / 8090 MB avail pgs:
Start OSD in a native way
[root@control0 ~]# umount /var/lib/ceph/osd/kitana-0/ [root@control0 ~]# udevadm trigger --action=add --name-match=sdc2
With udevadm we re-trigger "add" event exactly like it will be emitting on system startup. UDEV will use partition type GUID (--typecode in a gdisk) and when UDEV find a special ceph GUIDs the proper action will be performed (or at least should be performed...)
Bluestore looks pretty match the filestore except two major notable changes. The first, we'll MKFS a small partitions - not a big one! It looke liek we swap a partitions - metadata is placed where journal should be and the big partition that was MKFS now is used for bluestore primary storage. And the second difference we create symlink with different name and declare different objectstore type
Create filesystem
[root@control0 ~]# mkfs.xfs /dev/sdc3 meta-data=/dev/sdc3 isize=512 agcount=4, agsize=8192 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=0, sparse=0 data = bsize=4096 blocks=32768, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=855, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
Create OSD in a cluster
[root@control0 ~]# ceph osd create --cluster=kitana 1
Mount filesystem
[root@control0 ~]# mount -t xfs /dev/sdc3 /var/lib/ceph/osd/kitana-1
Find the future bluestore partition stable udev symlink (in this sample we use sdc4 as bluestore for sdc3 as metadata)
[root@control0 ~]# ls -l /dev/disk/by-partuuid | grep sdc4 lrwxrwxrwx 1 root root 10 окт 11 18:15 /dev/disk/by-partuuid/89cdf38f-96b5-4d0d-8b06-2e5b2bfeb825 -> ../../sdc4
Declare this is a "filestore" OSD, not a bluestore
[root@control0 ~]# echo bluestore > /dev/sdc2 /var/lib/ceph/osd/kitana-1/type
Create primary storage symlink
[root@control0 ~]# ln -s /dev/disk/by-partuuid/89cdf38f-96b5-4d0d-8b06-2e5b2bfeb825 /var/lib/ceph/osd/kitana-1/block
Initialize OSD
[root@control0 ~]# ceph-osd --mkfs --cluster kitana --id 1 --setuser ceph --setgroup ceph 2017-10-11 18:28:55.380706 7fd592c0ad80 -1 bluestore(/var/lib/ceph/osd/kitana-1) _read_fsid unparsable uuid
Create CephX keys
[root@control0 ~]# ceph --cluster=kitana auth get-or-create osd.1 \ mon 'allow profile osd' \ mgr 'allow profile osd' \ osd 'allow *' > /var/lib/ceph/osd/kitana-1/keyring
Change owner
[root@control0 ~]# chown -R ceph:ceph --dereference -H /var/lib/ceph/osd/kitana-1
Test OSD is usable - start OSD in foreground on the first TTY, fix problem if found and press Ctrl+C after check complete. Use the second TTY to watch Ceph cluster status
TTY.1: [root@control0 ~]# ceph-osd -f --cluster kitana --id 1 --setuser ceph --setgroup ceph starting osd.1 at - osd_data /var/lib/ceph/osd/kitana-1 /var/lib/ceph/osd/kitana-1/journal 2017-10-11 18:31:54.618297 7f02cfe31d80 -1 osd.1 0 log_to_monitors {default=true} 2017-10-11 18:31:54.626327 7f02b88ce700 -1 osd.1 0 waiting for initial osdmap
TTY.2: [root@control0 ~]# ceph --cluster=kitana -s cluster: id: d9e5334f-e188-481f-b908-9b09978b4317 health: HEALTH_WARN norecover flag(s) set all OSDs are running luminous or later but require_osd_release < luminous services: mon: 1 daemons, quorum control0 mgr: control0(active) osd: 2 osds: 2 up, 2 in flags norecover data: pools: 0 pools, 0 pgs objects: 0 objects, 0 bytes usage: 1164 MB used, 15025 MB / 16190 MB avail pgs:
Start OSD in a native way
[root@control0 ~]# umount /var/lib/ceph/osd/kitana-1/ [root@control0 ~]# udevadm trigger --action=add --name-match=sdc3
Here we must have running OSD. If you aren't then there is a time for deep dive to the sea of Ceph (and this dive is out of this class study).
Bluestore on LVM looks like an ordinary Bluestore and differs in post-create steps:
Create bluestore with similar way and only change a device names
[root@control0 ~]# lvscan ACTIVE '/dev/build/rpmbuild' [56,00 GiB] inherit ACTIVE '/dev/build/swap' [<8,00 GiB] inherit ACTIVE '/dev/centos/root' [<15,00 GiB] inherit ACTIVE '/dev/kosd/meta' [100,00 MiB] inherit ACTIVE '/dev/kosd/blocks' [<7,83 GiB] inherit [root@control0 ~]# mkfs.xfs /dev/kosd/meta meta-data=/dev/kosd/meta isize=512 agcount=4, agsize=6400 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=0, sparse=0 data = bsize=4096 blocks=25600, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal log bsize=4096 blocks=855, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 [root@control0 ~]# ceph --cluster=kitana osd create 2 [root@control0 ~]# mount -t xfs /dev/kosd/meta /var/lib/ceph/osd/kitana-2 [root@control0 ~]# ln -s /dev/kosd/blocks /var/lib/ceph/osd/kitana-2/block [root@control0 ~]# echo bluestore > /var/lib/ceph/osd/kitana-2/type [root@control0 ~]# ceph-osd --mkfs --cluster kitana --id 2 2017-10-11 18:36:47.882226 7f814cbc4d80 -1 bluestore(/var/lib/ceph/osd/kitana-2) _read_fsid unparsable uuid 2017-10-11 18:36:49.446280 7f814cbc4d80 -1 created object store /var/lib/ceph/osd/kitana-2 for osd.2 fsid d9e5334f-e188-481f-b908-9b09978b4317 [root@control0 ~]# ceph --cluster=kitana auth get-or-create osd.2 \ mon 'allow profile osd' \ mgr 'allow profile osd' \ osd 'allow *' > /var/lib/ceph/osd/kitana-2/keyring [root@control0 ~]# chown -R ceph:ceph --dereference -H /var/lib/ceph/osd/kitana-2 [root@control0 ~]# ceph-osd -f --cluster kitana --id 2 --setuser ceph --setgroup ceph starting osd.2 at - osd_data /var/lib/ceph/osd/kitana-2 /var/lib/ceph/osd/kitana-2/journal 2017-10-11 18:37:22.141564 7f1de4e8ed80 -1 osd.2 0 log_to_monitors {default=true} 2017-10-11 18:37:22.147909 7f1dcd92b700 -1 osd.2 0 waiting for initial osdmap
TTY.2: [root@control0 ~]# ceph --cluster=kitana -s cluster: id: d9e5334f-e188-481f-b908-9b09978b4317 health: HEALTH_WARN norecover flag(s) set all OSDs are running luminous or later but require_osd_release < luminous services: mon: 1 daemons, quorum control0 mgr: control0(active) osd: 3 osds: 3 up, 3 in flags norecover data: pools: 0 pools, 0 pgs objects: 0 objects, 0 bytes usage: 1164 MB used, 15025 MB / 16190 MB avail pgs:
[root@control0 ~]# umount /var/lib/ceph/osd/kitana-2 [root@control0 ~]# echo "/dev/kosd/meta /var/lib/ceph/osd/kitana-2 xfs rw,noatime,nodiratime,attr2,discard,inode64,noquota 0 0" >> /etc/fstab [root@control0 ~]# mount -a [root@control0 ~]# mount | grep kitana /dev/sdc2 on /var/lib/ceph/osd/kitana-0 type xfs (rw,noatime,nodiratime,attr2,discard,inode64,noquota) /dev/sdc3 on /var/lib/ceph/osd/kitana-1 type xfs (rw,noatime,nodiratime,attr2,discard,inode64,noquota) /dev/mapper/kosd-meta on /var/lib/ceph/osd/kitana-2 type xfs (rw,noatime,nodiratime,attr2,discard,inode64,noquota)
If the mount suceed and we see our LVM-based OSD filesystem mounted, we need to enable service.
[root@control0 ~]# systemctl enable ceph-osd@2 Created symlink from /etc/systemd/system/ceph-osd.target.wants/ceph-osd@2.service to /usr/lib/systemd/system/ceph-osd@.service. [root@control0 ~]# systemctl start ceph-osd@2
But there is one caveat in our setup that prevent OSD from start on next reboot - the device file ownership.
The ceph default setup doesn't even suggest there can be block/journal files anywhere except raw partitions so if you use LVM you've got a startup problem with chown (and if you not use LVM you have problem with partition create and "device busy until reboot" on partrobe). The both are worse but I personally still prefer to fight with chown but never get a knife into my back with "device busy". The sugegsted solution is to create separate UDEV rule to change LV ownership on sartup.
We'll use custom UDEV rules rules to change LV device files owners. Within this study case the volumes need to be chown'ed belongs to a VG named "kosd". The first lets examine current udev devices.
What is our device real names?
[root@control0 ~]# ls -l /dev/mapper total 0 lrwxrwxrwx 1 root root 7 Oct 12 05:28 build-rpmbuild -> ../dm-1 lrwxrwxrwx 1 root root 7 Oct 12 05:28 build-swap -> ../dm-2 lrwxrwxrwx 1 root root 7 Oct 12 05:28 centos-root -> ../dm-0 crw------- 1 root root 10, 236 Oct 12 05:28 control lrwxrwxrwx 1 root root 7 Oct 12 05:39 kosd-blocks -> ../dm-4 lrwxrwxrwx 1 root root 7 Oct 12 05:38 kosd-meta -> ../dm-3
What is our LV device attributes?
udevadm info -n dm-4 P: /devices/virtual/block/dm-4 N: dm-4 S: disk/by-id/dm-name-kosd-blocks S: disk/by-id/dm-uuid-LVM-9UI9alw3iskHNNvKZfQSCo4bAlADH7XaPQY6Y9R9TUIF3y2OjtUuSD1w60yS3ixT S: kosd/blocks S: mapper/kosd-blocks E: DEVLINKS=/dev/disk/by-id/dm-name-kosd-blocks /dev/disk/by-id/dm-uuid-LVM-9UI9alw3iskHNNvKZfQSCo4bAlADH7XaPQY6Y9R9TUIF3y2OjtUuSD1w60yS3ixT /dev/kosd/blocks /dev/mapper/kosd-blocks E: DEVNAME=/dev/dm-4 E: DEVPATH=/devices/virtual/block/dm-4 E: DEVTYPE=disk E: DM_LV_NAME=blocks E: DM_NAME=kosd-blocks E: DM_SUSPENDED=0 E: DM_UDEV_DISABLE_LIBRARY_FALLBACK_FLAG=1 E: DM_UDEV_PRIMARY_SOURCE_FLAG=1 E: DM_UDEV_RULES_VSN=2 E: DM_UUID=LVM-9UI9alw3iskHNNvKZfQSCo4bAlADH7XaPQY6Y9R9TUIF3y2OjtUuSD1w60yS3ixT E: DM_VG_NAME=kosd E: MAJOR=253 E: MINOR=4 E: SUBSYSTEM=block E: TAGS=:systemd: E: USEC_INITIALIZED=29234
As udevadm shows there is an attribute DM_VG_NAME containing a VG name. This atribute allows us to define a custom udev rule. We'll write the rule into custom file to avoid conficts with OS update via package manager. The file will be /etc/udev/rules.d/99-lvmceph.rules:
[root@control0 ~]# cat /etc/udev/rules.d/99-lvmceph.rules # Custom LVM VG owner ACTION=="add", SUBSYSTEM=="block", \ ENV{DM_VG_NAME}=="kosd", \ OWNER:="ceph", GROUP:="ceph", MODE:="660" [root@control0 ~]#
The only steps we need is test our rule. The device name to test can be found as P: attribute in the
[root@control0 ~]# udevadm test /devices/virtual/block/dm-4 calling: test version 219 This program is for debugging only, it does not run any program specified by a RUN key. It may show incorrect results, because some values may be different, or not available at a simulation run. === trie on-disk === tool version: 219 file size: 7477022 bytes header size 80 bytes strings 1947422 bytes nodes 5529520 bytes Load module index Created link configuration context. timestamp of '/etc/udev/rules.d' changed ... lots of output skipped ... Reading rules file: /etc/udev/rules.d/99-lvmceph.rules ... lots of output skipped ... OWNER 167 /etc/udev/rules.d/99-lvmceph.rules:4 GROUP 167 /etc/udev/rules.d/99-lvmceph.rules:4 MODE 0660 /etc/udev/rules.d/99-lvmceph.rules:4 ... lots of output skipped ... Unload module index Unloaded link configuration context.
And reload udev rules we just test
[root@control0 ~]# udevadm control -R
Artemy Kapitula, software developer at Mail.RU Cloud Solutions
artemy.kapitula@gmail.com