Ceph RBD performance issues

Preface

RBD (Rados Block Devices) is a one of most frequently used features of CEPH object storage. RBD allows you to access CEPH storage objects as a block devices. Now there are a lot of success stories about RBD device used as storage backend for virtualization solution, but there are lots of caveats we try to describe.

The first of all the root of all issues is the object size and CEPH internal algorithms being used while performing by CEPH cluster while it's operating within non-clean state like startup/shutdown/maintnenace modes. To understand where and why CEPH may operate slow we need to know how operations are performed.

The object size defines how many chunks you image will consist and they affects performance very much. Every object has a set of metadata that OSD kept in memory, so the more objects you have, the more RAM you need. This amount of metadata also must be kept on disk (for the now on a filestore) and it will update every time you write object, so metadata size also grow with the object count increase.

Increasing object size significantly increase performance. While testing with CEPH, we got performance improvements about a 30% while growing object size and about a 30% loss when decreasing objects size. But object size is two-edge sharpened sword. With the second edge of object size lie a recovery, snapshots and startup and peering.

Objects are grouped to pools and sharded within a pool into a groups called a placement groups. Placement group may reside on several OSDs and all of these OSDs contains a same data for the PG. All operations to be performed will be served by the one of OSDs - the primary OSD and the primary OSD will replicate OPS to others OSDs called peers.

The replication allows CEPH to keep several (3 by default) copies of data. Every data change operation CEPH performs came to primary OSD daemon and the primary OSD will resend these ops to the peers. To perform right and to guarantee data integrity CEPH always performs data with the same way by the all peered OSDs. The "same way" mean that before operation will start CEPH performs some tasks to ensure modified objects state is coherent for the all online OSDs and thus the modification will keep copies (replicas) in a coherent state.

Why you may need to change a default object size for RBD images?

Startup and peering

Every time OSD is down while other OSDs operating all PGs that relies on the failed OSD becames unclean and objects goes into degraded state. When OSD is UP again, it starts a peering process while peering, OSD must build a complete information set about all objects versions and states. This information used to perform object recovery when the first object write (and some time even read) performed. These data (they are called authoritative history) are hold in RAM. When OSD start from a cold state, they also should be read from disk. So, the more objects you have, the more RAM you need for a peering and more time you need to start OSD.

Startup and peering time are significant values because PG will not accept any operation while it's state is not ACTIVE, so while PG is peering you data is inaccessible and your client are waiting. So, if you have a large amount of objects/PGs, you may stuck in a peering time. If you have a node with a several OSDs, peering whiel having a large objects count may result in Out-Of-Memory. OSD is shutdown again and peering will restart. If OOM occured on the already acting OSD it will result in ACTIVE to PEERED status change and peering restart.

Sometime if you have two of reliability domains failed, you will stuck in a PG that have PEERED state. Within this state, you need to start missing OSDs A.S.A.P. to get acces to you data back (or you may decrease pool min size to 1) and OSD startup time when OSD reads all PGs and objects metadata became a bottleneck factor.

Increase object size decreases memory and CPU requirements and decrease startup and peering time. Decreasing object size increase startup and peering time and slowdown your OSD

Recovery

After OSD is started and joined an prevously active+unclean PG, all write operations that ends within an object that was modified while joined OSD was inaccessible results in a recovery. With a CEPH, recovery performed by replicating whole object from OSD that has a newest object copy to OSD that have an outdated object version. It's not a problem if write operation has a write data size near to an object size (you write about 3MB with object size 4MB), but it lead to a total performance loss (some people call it a catasropic and they aren't failed too much) if object operation has a small data size (you write 64KB with object size 4MB).

Where we got trouble?

Disk performance with a recovery

The first of all, it's a disk device performance. If you have OSD journal and data device both on a same disk, every write had occured twice: journal write and filestore write. If your disk device performing at 600MB/sec, it mean a 300MB/sec of payload bandwidth and 75 write ops per second. Yes, 75 operation per second per disc. For your mention, typical SSD-backed CEPH OSD performance is about a 1000 of 4K ops/sec. So if you have a all-flash cluster with 3 nodes of 4 disk each, you will down from 12K of 4K writes/sec to 300 of 4K writes/sec. Shit happens, and you MUST to be ready to get it.

Network performance with a recovery

With CEPH, every write comes to primary OSD and ends withe secodary OSDs (with a replicated pools typically it is one or two secondary OSDs). It mean that every primary OSD sends an ONE or DOUBLE data size to it's peers. With a 10GBit cluster (replication) network, you have a bandwidth with 1.25GB/sec and about 300 of 4MB ops per second. If you have three-replicas pool (it's a default with a Ceph replicated pool) it mean that OSD node may perform 150 writes per second with replicating objects of size of 4MB.

Snapshots

CEPH provides objects with a snapshot feature. With snapshots enabled, when object write performed on objects that has a snapshots, OSD performs object clone. Clone implementaton is dependent on objectstore implementation. XFS or ext4 performs full object read and write (object read, then copy write performed, then object is modified and thus transaction is commited). With the large object, we got the same disk bottleneck: OSD perform read, then write. BTRFS uses more complicated technology through block mapping via CLONE_RANGE. With this IOCTL source object blocks are assigned to a snapshot object too and write ends with a new space, and after all filesystem marks old block as belonging to a snapshot object only.

With this schema we got an filesystem data fragmentation that will end up in performance degradation in future. Auto-defragmentation feature mask this effect but on the high write-intensive workload autodefrag also significantly slows down OSD. The other side is that we don't get a bandwidth bottleneck but increase metadata ops and that moves us close to IOPS bottleneck - HDD performs slow at random I/O and even metadata-only update eats a large enough part of device IOPS.

But from the third side, CEPH tries to use all of BTRFS features like filesystem snapshots and transactions that degrades performance in some cases. BTRFS also is under a development and should be used with extremely care.

Decrease object size may increase or decrease performance under specific conditions like a recovery or snapshots.

What we can to do?

Avoid small-size operations.

Modern operating systems reads device capabilities and align I/O operations to be comply with device limits. The clients that directly access CEPH storage cluster RBD objects mostly reads RBD object metadata and declares it to guest OS. With this case, guest OS will align I/O ops to objects sizes and boundaries so write data size became near to object size and overhead became low enough. For a now, there are only one production-quality hypervisor that access RBD directly. Yes, it's a QEMU-KVM.

If your client can't access RBD device directly (i.e. you want to have a virtual disk that contains a filesystem shared via NFS or SMB) or you are limited to not access Ceph cluster directly but via host only (i.e. you use XEN hypervisor), you can use a kernel RBD client. Some features isn't supported with KRBD, but it allows you to access RBD objects as ordinary block devices. And it has a bug - with object size with more than 4MB it wrongly detects device limits, so using a large object size on a default kernels ends up in lots of 512-bytes ops even if clients sends a big enough write request.

If you use other non RBD-capable hypervisors like HyperV or ESXi you will use some gateways that reexports your RBD images to hypervisor via iSCSI/FC/IB SCSI transports. The gateway software or hypervisors may add extra limitations that fall you down to small-size IOs. Practically, you need to increase maximum transfer sizes at least - of course, if it's possible. The "cup de grace" is a clustered filesystems like VMFS that may allocate additional space with a small blocks and leads to a lots of small IOs, again.

Avoid "proxies"

Adding a "proxy" like an iSCSI-to-RBD gateways you always add a latency. Lots of this solutions are wrong when passing up RBD device characteristics like recommended IO size and others to SCSI levels. It also will result in performance degradation and latency increase. With HDD-backed OSDs the RBD latency is increased small enough relatively to device latency, but with SSD it increased twice relatively SSD latency.

Everywhere where it's possible use direct passing RBD device or RBD-backed SCSI factory into guest VM or other client. If your SCSI factory is good enough to pass device characteristing to a client level, there is a high chance that client will align and merge ops properly so you wil not get a significant performance loose.

Avoid high-density OSD nodes

High-density node is a node with a large amount os OSD devices. High-density OSD node is a "headshot" to OSD performance under a recovery (see "Network performance with a recovery" part). Within a clear case, your node incoming network bandwidth should match your OSD data devices bandwidth. I.e. when you have 6 HDDs with 150MB/sec each, you need 10Gbit. If you have 8 SSDs with 500MB/sec each you need 40GBit otherwise network became a bottleneck and you stuck into network limitations.

Physically separate journals on a HDD-backed OSDs

Separate journal and data devices. If you have journal and data device on the same disk, your recovery performance decreased a half. One enterprise-class SSD should be enough so serve up to 6 of HDD-backed OSDs. You may place OS and swap on this SSD too.

Plan your nodes, plan your ops

While buying hardware keep in mind that the more OSDs is offline at once, the more will be your performance loss. Choose the hardware that doesn't require a node shutdown to change disk. When starting node, keep in mind that you can start OSDs one-by-one. Starting OSDs one-by-one allows you to decrease performance degradation during recovery: the degraded PG (PG that has one OSD missing) is operating with a default speed but the unclean PG (PG that has all OSDs online but one OSD has outdated objects version) is operating 5 to 60 times slower.

Give enough memory. Average memory consume with a recovery and peering is near to 2GB per OSD + 1GB per 1TB of data. Beacuse CEPH is a software-defined storage, you also need enough of CPU. For SSD-backed OSD it's a 4 Core-GHz (2 cores with 2GHz each). For HDD-backed it's about 2 core-GHz per OSD.

With this opinions, it's recommended 4...8 cores CPU (with HT) and 4 SSD per node for all-flash OSD nodes, and 4..8 cores CPU with 1 SSD for system and journal for 8xHDD OSD node, dual-port 10/40 GBit ethernet with active-active bonding in both cases.

Use cluster flags

Sometime it's better to perform temporary "disable" cluster ops for a short time but not to throw cluster into heavy recovery. When you specify nodown flag, CEPH will sugegts all OSDs up, so write ops sent to offline OSDs will hang until OSD is not up again. This way you may shutdown and start OSDs without getting objects outdated and without falling into heavy recovery.

Upgrade your network

Use network bonding with active-active setting - it will double your replication speed.

Separate public and cluster networks. It's better if you separate it physically. And try to give your replication network at least 4 times greater bandiwdth than a client network. Sometime it's even can be better to use 1GBit for public and 2x10GBit for cluster because public bandwidths is not a bottleneck even with normal ops but cluster network with a recovery always is the heart of problem.

Adapt to your profile

Know your clients payload profile. When using CEPH with ESXi or Windows keep in mind that they limits IO size to 128 and 64 kbytes respectively. It mean that if your image has objects of 4MBytes then every write-with-recovery will perform 30 to 60 times slowly that the same op to be performed on clean cluster. You can't change your image object size on-the-fly, so plan to use smaller objects size before image create. And keep in mind that small objects leads to RAM consume.

Scale out

Scale out! CEPH is a scale-out system so it's better to use 10 devices operating at speed X ops/sec then 5 devices that operates at X*2 speed. And of course, it helps you with recovery because decrease latency twice while recovering

Exeprimental features

All we told about is more related to recovery, not a snapshot. While using snapshots keep in mind that you stuck into disk bandwidth and disk latency. This case, you refer to "Avoid small-size operations" and thus "Avoid proxies", and don't forget a disk performance limitations when planning your QoS.

Artemy Kapitula, software developer at Mail.RU Cloud Solutions
artemy.kapitula@gmail.com