DRBD: For When it Absolutely, Positively, Has to be in Sync

by Ostatic Staff - Aug. 25, 2010

DRBD is an acronym that stands for Distributed Replicated Block Device, and as the name implies it is used for replicating a block device between two servers. DRBD was designed to be used in “High Availability” (HA) clusters, and is conceptually similar to a level one RAID, or mirroring, setup.

Let’s say you have two servers, and one service, say a MySQL database that you want to make sure stays up, even if the server it is on crashes. Without some form of HA, if the server hosting MySQL goes away, so does the MySQL database. To provide HA, DRBD inserts itself into the IO stack and proxies all block level actions, simultaneously writing the data to the local disk, as well as the disk on the second server. So, when the time comes to fail over to the standby server, a script moves the IP address over from the primary, mounts the DRBD filesystem, and starts MySQL. Since everything is replicated, no data loss occurs during the failover.

I honestly think that DRBD is ingenious. The replication happens in real time, and in my experience causes little to no performance degradation. DRBD, when combined with a floating IP address and any available heartbeat monitoring script can make nearly any application highly available, as long as it can recover cleanly from a crash.

DRBD is a kernel loadable module, and as of kernel 2.6.33 it is included with the mainline kernel. If you are running a Linux server that has any level of kernel revision past 2.6.33, you already have drbd available, and at most will need to load the management tools. If your server is older than that, you will need to download the tools and the loadable module to get it running.

There are a few important things to know that DRBD will not help with. If there is corruption in the filesystem, DRBD will happily replicate the corruption between nodes. This is because DRBD has no knowledge of what is happening farther up the IO stack, and therefore has no way of detecting if such corruption is occurring. Further, DRBD cannot provide instant failover between nodes. Failover is fast, but it’s not on the same level as MySQL Cluster. If you are using scripts like Heartbeat and Mon or Pacemaker, these systems must first detect that a failure has occurred, then add the IP address, take over primary DRBD role, mount the filesystem, and then start the service. These things take time, not a lot, but it might be noticeable, depending on the sensitivity of your environment.

If you are responsible for a service, if it is important enough to warrant some form of high availability, DRBD may well be worth a look. DRBD is open source, licensed under the GPL, and commercial support is provided by Linbit. Do you have any experience with DRBD, good or bad? How has it changed how you look at high availability and disaster recovery? I’d love to hear your stories in the comments!