Ceph without ceph-disk

Preface

Most of people use ceph-disk utility to prepare a Ceph disk for OSD. But sometime cpeh-disk begin to work strange or it think you're doing something "unacceptable". When you're stuck on this, there are a short manual how to make OSD without ceph-disk.

The most often need to change the default Ceph behavior is an "journal on the SSD and data on the HDD" setup that is widely used to improve Ceph HDD OSD performance. With bluestore this setup transforms to "data on a HDD, DB and WAL on SSD" setup". With these setups, you create several partitions on one SSD and it mean your partition table became locked so you can't to add/remove/change partitions (i.e. you can't add new partitions for a new OSD device without reboot). The workaround is to use LVM but then you can get an access error.

With this post I'll try to show how these problems can be avoid.

P.S.: yes, we completely refused to use ceph-disk in our projects because its' dictatorship nature.

The training homework setup

Here are our training partitioning for a Ceph disk. We got FIVE partitions:

[root@control0 ~]# gdisk -l /dev/sdc
GPT fdisk (gdisk) version 0.8.6

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.
Disk /dev/sdc: 50331648 sectors, 24.0 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): B7AB6C94-607F-4BAE-A62F-9FD00AE6CAD1
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 50331614
Partitions will be aligned on 2048-sector boundaries
Total free space is 2014 sectors (1007.0 KiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048          264191   128.0 MiB   8300  Linux filesystem
   2          264192        16852991   7.9 GiB     8300  Linux filesystem
   3        16852992        17115135   128.0 MiB   8300  Linux filesystem
   4        17115136        33703935   7.9 GiB     8300  Linux filesystem
   5        33703936        50331614   7.9 GiB     8E00  Linux LVM

Let's zero future ceph partitions. It's strongly recommended to do it, because otherwise you can get ad "device busy" error on it. Because the sample is a VB virtual machine, we use dd. On a real install, you can use blkdiscard utility for SSDs and dd for a HDDs like we'll do next

[root@control0 ~]# for n in 1 2 3 4 ; do dd if=/dev/zero of=/dev/sdc$n bs=1024k count=10 ; done

The next is setting up a valid type GUIDs on partitions (did I tell you we MUST use GPT? I didn't but guess you know)

[root@control0 ~]# sgdisk --change-name=1:"ceph journal" --typecode=1:45B0969E-9B03-4F30-B4C6-B4B80CEFF106
[root@control0 ~]# sgdisk --change-name=2:"ceph data" --typecode=2:4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D
[root@control0 ~]# sgdisk --change-name=3:"ceph data" --typecode=3:4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D
[root@control0 ~]# sgdisk --change-name=4:"ceph block" --typecode=4:CAFECAFE-9B03-4F30-B4C6-B4B80CEFF106

Handling the LVM is a trivial and the only I must mark up is we have two LVs (meta and blocks) in a single VG named "kosd"

[root@control0 ~]# lvscan | grep kosd
  ACTIVE            '/dev/kosd/meta' [100,00 MiB] inherit
  ACTIVE            '/dev/kosd/blocks' [<7,83 GiB] inherit

During creating Ceph OSD we do several basic tasks:

With this sample, I use a patched Ceph version that allows to pass cluster name into systemds' units. With a "vanilla" Ceph you can define cluster name in /etc/sysconfig/ceph if your cluster name differs from a default "ceph"

Basic filestore

Create filesystem

[root@control0 ~]# mkfs.xfs /dev/sdc2
meta-data=/dev/sdc2              isize=512    agcount=4, agsize=518400 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0, sparse=0
data     =                       bsize=4096   blocks=2073600, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Create OSD in a cluster

[root@control0 ~]# ceph osd create --cluster=kitana
0

Mount filesystem

[root@control0 ~]# mount -t xfs /dev/sdc2 /var/lib/ceph/osd/kitana-0

Find the future journal partition (in this sample we use sdc1 as journal for sdc2 filestore)

[root@control0 ~]# ls -l /dev/disk/by-partuuid | grep sdc1
lrwxrwxrwx 1 root root 10 окт 11 17:36 5f3a4bc4-c60a-43c4-9afb-8ee00dfc0148 -> ../../sdc1

Declare this is a "filestore" OSD, not a bluestore

[root@control0 ~]# echo filestore > /dev/sdc2 /var/lib/ceph/osd/kitana-0/type

Create journal symlink

[root@control0 ~]# ln -s /dev/disk/by-partuuid/5f3a4bc4-c60a-43c4-9afb-8ee00dfc0148 /var/lib/ceph/osd/kitana-0/journal

Initialize OSD

[root@control0 ~]# ceph-osd --mkfs  --cluster kitana --id 0 --mkjournal
2017-10-11 18:19:07.918666 7ff96e2b8d80 -1 journal check: ondisk fsid ba21afec-5473-4880-8273-d61b56352e55 doesn't match expected d44b8d02-0729-45b7-90d5-7ca9c359cd0d, invalid (someone else's?) journal
2017-10-11 18:19:07.933174 7ff96e2b8d80 -1 read_settings error reading settings: (2) No such file or directory
2017-10-11 18:19:07.952380 7ff96e2b8d80 -1 created object store /var/lib/ceph/osd/kitana-0 for osd.0 fsid d9e5334f-e188-481f-b908-9b09978b4317

Create CephX keys

[root@control0 ~]# ceph --cluster=kitana auth get-or-create osd.0 \
    mon 'allow profile osd' \
    mgr 'allow profile osd' \
    osd 'allow *' > /var/lib/ceph/osd/kitana-0/keyring

Change owner

[root@control0 ~]# chown -R ceph:ceph --dereference -H /var/lib/ceph/osd/kitana-0

Test OSD is usable - start OSD in foreground on the first TTY, fix problem if found and press Ctrl+C after check complete. Use the second TTY to watch Ceph cluster status

TTY.1:
[root@control0 ~]# ceph-osd -f --cluster kitana --id 0 --setuser ceph --setgroup ceph
starting osd.0 at - osd_data /var/lib/ceph/osd/kitana-0 /var/lib/ceph/osd/kitana-0/journal
2017-10-11 18:21:40.153852 7feeb2356d80 -1 osd.0 0 log_to_monitors {default=true}
2017-10-11 18:21:40.160117 7fee964a9700 -1 osd.0 0 waiting for initial osdmap
TTY.2:
[root@control0 ~]# ceph --cluster=kitana -s
  cluster:
    id:     d9e5334f-e188-481f-b908-9b09978b4317
    health: HEALTH_WARN
            norecover flag(s) set
            all OSDs are running luminous or later but require_osd_release < luminous

  services:
    mon: 1 daemons, quorum control0
    mgr: control0(active)
    osd: 1 osds: 1 up, 1 in
         flags norecover

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 bytes
    usage:   43688 kB used, 8047 MB / 8090 MB avail
    pgs:

Start OSD in a native way

[root@control0 ~]# umount /var/lib/ceph/osd/kitana-0/
[root@control0 ~]# udevadm trigger --action=add --name-match=sdc2

With udevadm we re-trigger "add" event exactly like it will be emitting on system startup. UDEV will use partition type GUID (--typecode in a gdisk) and when UDEV find a special ceph GUIDs the proper action will be performed (or at least should be performed...)

Basic bluestore

Bluestore looks pretty match the filestore except two major notable changes. The first, we'll MKFS a small partitions - not a big one! It looke liek we swap a partitions - metadata is placed where journal should be and the big partition that was MKFS now is used for bluestore primary storage. And the second difference we create symlink with different name and declare different objectstore type

Create filesystem

[root@control0 ~]# mkfs.xfs /dev/sdc3
meta-data=/dev/sdc3              isize=512    agcount=4, agsize=8192 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0, sparse=0
data     =                       bsize=4096   blocks=32768, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=855, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Create OSD in a cluster

[root@control0 ~]# ceph osd create --cluster=kitana
1

Mount filesystem

[root@control0 ~]# mount -t xfs /dev/sdc3 /var/lib/ceph/osd/kitana-1

Find the future bluestore partition stable udev symlink (in this sample we use sdc4 as bluestore for sdc3 as metadata)

[root@control0 ~]# ls -l /dev/disk/by-partuuid | grep sdc4
lrwxrwxrwx 1 root root 10 окт 11 18:15 /dev/disk/by-partuuid/89cdf38f-96b5-4d0d-8b06-2e5b2bfeb825 -> ../../sdc4

Declare this is a "filestore" OSD, not a bluestore

[root@control0 ~]# echo bluestore > /dev/sdc2 /var/lib/ceph/osd/kitana-1/type

Create primary storage symlink

[root@control0 ~]# ln -s /dev/disk/by-partuuid/89cdf38f-96b5-4d0d-8b06-2e5b2bfeb825 /var/lib/ceph/osd/kitana-1/block

Initialize OSD

[root@control0 ~]# ceph-osd --mkfs  --cluster kitana --id 1 --setuser ceph --setgroup ceph
2017-10-11 18:28:55.380706 7fd592c0ad80 -1 bluestore(/var/lib/ceph/osd/kitana-1) _read_fsid unparsable uuid

Create CephX keys

[root@control0 ~]# ceph --cluster=kitana auth get-or-create osd.1 \
    mon 'allow profile osd' \
    mgr 'allow profile osd' \
    osd 'allow *' > /var/lib/ceph/osd/kitana-1/keyring

Change owner

[root@control0 ~]# chown -R ceph:ceph --dereference -H /var/lib/ceph/osd/kitana-1

Test OSD is usable - start OSD in foreground on the first TTY, fix problem if found and press Ctrl+C after check complete. Use the second TTY to watch Ceph cluster status

TTY.1:
[root@control0 ~]# ceph-osd -f --cluster kitana --id 1 --setuser ceph --setgroup ceph
starting osd.1 at - osd_data /var/lib/ceph/osd/kitana-1 /var/lib/ceph/osd/kitana-1/journal
2017-10-11 18:31:54.618297 7f02cfe31d80 -1 osd.1 0 log_to_monitors {default=true}
2017-10-11 18:31:54.626327 7f02b88ce700 -1 osd.1 0 waiting for initial osdmap
TTY.2:
[root@control0 ~]# ceph --cluster=kitana -s
  cluster:
    id:     d9e5334f-e188-481f-b908-9b09978b4317
    health: HEALTH_WARN
            norecover flag(s) set
            all OSDs are running luminous or later but require_osd_release < luminous

  services:
    mon: 1 daemons, quorum control0
    mgr: control0(active)
    osd: 2 osds: 2 up, 2 in
         flags norecover

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 bytes
    usage:   1164 MB used, 15025 MB / 16190 MB avail
    pgs:

Start OSD in a native way

[root@control0 ~]# umount /var/lib/ceph/osd/kitana-1/
[root@control0 ~]# udevadm trigger --action=add --name-match=sdc3

Here we must have running OSD. If you aren't then there is a time for deep dive to the sea of Ceph (and this dive is out of this class study).

Bluestore OSD on LVM

Bluestore on LVM looks like an ordinary Bluestore and differs in post-create steps:

Create bluestore with similar way and only change a device names

[root@control0 ~]# lvscan
  ACTIVE            '/dev/build/rpmbuild' [56,00 GiB] inherit
  ACTIVE            '/dev/build/swap' [<8,00 GiB] inherit
  ACTIVE            '/dev/centos/root' [<15,00 GiB] inherit
  ACTIVE            '/dev/kosd/meta' [100,00 MiB] inherit
  ACTIVE            '/dev/kosd/blocks' [<7,83 GiB] inherit

[root@control0 ~]# mkfs.xfs /dev/kosd/meta
meta-data=/dev/kosd/meta         isize=512    agcount=4, agsize=6400 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=0, sparse=0
data     =                       bsize=4096   blocks=25600, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal log           bsize=4096   blocks=855, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

[root@control0 ~]# ceph --cluster=kitana osd create
2

[root@control0 ~]# mount -t xfs /dev/kosd/meta /var/lib/ceph/osd/kitana-2

[root@control0 ~]# ln -s /dev/kosd/blocks /var/lib/ceph/osd/kitana-2/block

[root@control0 ~]# echo bluestore > /var/lib/ceph/osd/kitana-2/type

[root@control0 ~]# ceph-osd --mkfs --cluster kitana --id 2
2017-10-11 18:36:47.882226 7f814cbc4d80 -1 bluestore(/var/lib/ceph/osd/kitana-2) _read_fsid unparsable uuid
2017-10-11 18:36:49.446280 7f814cbc4d80 -1 created object store /var/lib/ceph/osd/kitana-2 for osd.2 fsid d9e5334f-e188-481f-b908-9b09978b4317

[root@control0 ~]# ceph --cluster=kitana auth get-or-create osd.2 \
    mon 'allow profile osd' \
    mgr 'allow profile osd' \
    osd 'allow *' > /var/lib/ceph/osd/kitana-2/keyring

[root@control0 ~]# chown -R ceph:ceph --dereference -H /var/lib/ceph/osd/kitana-2

[root@control0 ~]# ceph-osd -f --cluster kitana --id 2 --setuser ceph --setgroup ceph
starting osd.2 at - osd_data /var/lib/ceph/osd/kitana-2 /var/lib/ceph/osd/kitana-2/journal
2017-10-11 18:37:22.141564 7f1de4e8ed80 -1 osd.2 0 log_to_monitors {default=true}
2017-10-11 18:37:22.147909 7f1dcd92b700 -1 osd.2 0 waiting for initial osdmap
 
TTY.2:
[root@control0 ~]# ceph --cluster=kitana -s
  cluster:
    id:     d9e5334f-e188-481f-b908-9b09978b4317
    health: HEALTH_WARN
            norecover flag(s) set
            all OSDs are running luminous or later but require_osd_release < luminous

  services:
    mon: 1 daemons, quorum control0
    mgr: control0(active)
    osd: 3 osds: 3 up, 3 in
         flags norecover

  data:
    pools:   0 pools, 0 pgs
    objects: 0 objects, 0 bytes
    usage:   1164 MB used, 15025 MB / 16190 MB avail
    pgs:
Start OSD in a native way differs form partition-based setup. We need:
[root@control0 ~]# umount /var/lib/ceph/osd/kitana-2
[root@control0 ~]# echo "/dev/kosd/meta	/var/lib/ceph/osd/kitana-2 xfs rw,noatime,nodiratime,attr2,discard,inode64,noquota 0 0" >> /etc/fstab
[root@control0 ~]# mount -a
[root@control0 ~]# mount | grep kitana
/dev/sdc2 on /var/lib/ceph/osd/kitana-0 type xfs (rw,noatime,nodiratime,attr2,discard,inode64,noquota)
/dev/sdc3 on /var/lib/ceph/osd/kitana-1 type xfs (rw,noatime,nodiratime,attr2,discard,inode64,noquota)
/dev/mapper/kosd-meta on /var/lib/ceph/osd/kitana-2 type xfs (rw,noatime,nodiratime,attr2,discard,inode64,noquota)

If the mount suceed and we see our LVM-based OSD filesystem mounted, we need to enable service.

[root@control0 ~]# systemctl enable ceph-osd@2
Created symlink from /etc/systemd/system/ceph-osd.target.wants/ceph-osd@2.service to /usr/lib/systemd/system/ceph-osd@.service.
[root@control0 ~]# systemctl start ceph-osd@2

But there is one caveat in our setup that prevent OSD from start on next reboot - the device file ownership.

Fixing VG/LV ownership with UDEV

The ceph default setup doesn't even suggest there can be block/journal files anywhere except raw partitions so if you use LVM you've got a startup problem with chown (and if you not use LVM you have problem with partition create and "device busy until reboot" on partrobe). The both are worse but I personally still prefer to fight with chown but never get a knife into my back with "device busy". The sugegsted solution is to create separate UDEV rule to change LV ownership on sartup.

We'll use custom UDEV rules rules to change LV device files owners. Within this study case the volumes need to be chown'ed belongs to a VG named "kosd". The first lets examine current udev devices.

What is our device real names?

[root@control0 ~]# ls -l /dev/mapper
total 0
lrwxrwxrwx 1 root root       7 Oct 12 05:28 build-rpmbuild -> ../dm-1
lrwxrwxrwx 1 root root       7 Oct 12 05:28 build-swap -> ../dm-2
lrwxrwxrwx 1 root root       7 Oct 12 05:28 centos-root -> ../dm-0
crw------- 1 root root 10, 236 Oct 12 05:28 control
lrwxrwxrwx 1 root root       7 Oct 12 05:39 kosd-blocks -> ../dm-4
lrwxrwxrwx 1 root root       7 Oct 12 05:38 kosd-meta -> ../dm-3

What is our LV device attributes?

udevadm info -n dm-4
P: /devices/virtual/block/dm-4
N: dm-4
S: disk/by-id/dm-name-kosd-blocks
S: disk/by-id/dm-uuid-LVM-9UI9alw3iskHNNvKZfQSCo4bAlADH7XaPQY6Y9R9TUIF3y2OjtUuSD1w60yS3ixT
S: kosd/blocks
S: mapper/kosd-blocks
E: DEVLINKS=/dev/disk/by-id/dm-name-kosd-blocks /dev/disk/by-id/dm-uuid-LVM-9UI9alw3iskHNNvKZfQSCo4bAlADH7XaPQY6Y9R9TUIF3y2OjtUuSD1w60yS3ixT /dev/kosd/blocks /dev/mapper/kosd-blocks
E: DEVNAME=/dev/dm-4
E: DEVPATH=/devices/virtual/block/dm-4
E: DEVTYPE=disk
E: DM_LV_NAME=blocks
E: DM_NAME=kosd-blocks
E: DM_SUSPENDED=0
E: DM_UDEV_DISABLE_LIBRARY_FALLBACK_FLAG=1
E: DM_UDEV_PRIMARY_SOURCE_FLAG=1
E: DM_UDEV_RULES_VSN=2
E: DM_UUID=LVM-9UI9alw3iskHNNvKZfQSCo4bAlADH7XaPQY6Y9R9TUIF3y2OjtUuSD1w60yS3ixT
E: DM_VG_NAME=kosd
E: MAJOR=253
E: MINOR=4
E: SUBSYSTEM=block
E: TAGS=:systemd:
E: USEC_INITIALIZED=29234

As udevadm shows there is an attribute DM_VG_NAME containing a VG name. This atribute allows us to define a custom udev rule. We'll write the rule into custom file to avoid conficts with OS update via package manager. The file will be /etc/udev/rules.d/99-lvmceph.rules:

[root@control0 ~]# cat /etc/udev/rules.d/99-lvmceph.rules

# Custom LVM VG owner
ACTION=="add", SUBSYSTEM=="block", \
  ENV{DM_VG_NAME}=="kosd", \
  OWNER:="ceph", GROUP:="ceph", MODE:="660"

[root@control0 ~]#

The only steps we need is test our rule. The device name to test can be found as P: attribute in the udevadm info ... output (see the first line of udevadm info ... snippet).

[root@control0 ~]# udevadm test /devices/virtual/block/dm-4
calling: test
version 219
This program is for debugging only, it does not run any program
specified by a RUN key. It may show incorrect results, because
some values may be different, or not available at a simulation run.

=== trie on-disk ===
tool version:          219
file size:         7477022 bytes
header size             80 bytes
strings            1947422 bytes
nodes              5529520 bytes
Load module index
Created link configuration context.
timestamp of '/etc/udev/rules.d' changed
... lots of output skipped ...
Reading rules file: /etc/udev/rules.d/99-lvmceph.rules
... lots of output skipped ...
OWNER 167 /etc/udev/rules.d/99-lvmceph.rules:4
GROUP 167 /etc/udev/rules.d/99-lvmceph.rules:4
MODE 0660 /etc/udev/rules.d/99-lvmceph.rules:4
... lots of output skipped ...
Unload module index
Unloaded link configuration context.

And reload udev rules we just test

[root@control0 ~]# udevadm control -R

Artemy Kapitula, software developer at Mail.RU Cloud Solutions
artemy.kapitula@gmail.com