Interview: Mesosphere's Ben Hindman on the Need for a Data Center OS

by Ostatic Staff - Jan. 19, 2015

One of the most interesting new companies leveraging an open source Apache project has to be Mesosphere, which OStatic covered in a recent post. The company offers a “data center operating system” (DCOS) built on the open source Apache Mesos project, and has announced a recent round of $36M in Series B funding. New investor Khosla Ventures led the round, with additional investments from Andreessen Horowitz, Fuel Capital, SV Angel and others.

According to Mesosphere’s leaders, the tech industry now needs a new type of operating system to automate the various tools used in the agile IT era.  They argure that developers and operators don’t need to focus on individual virtual or physical machines but can easily build and deploy applications and services that span entire datacenters.

OStatic caught up with former Twitter lead engineer and Apache Mesos co-creator Ben Hindman (seen here), who is now leading the design of Mesosphere’s DCOS, for an in-depth interview. Here are his thoughts.

What advantages can organizations get from a data center operating system?

The biggest advantages come from automating manual operations. The number of machines that most enterprise are working with is growing, and so is the variety of services and frameworks they’re trying to run. Organizations are under immense pressure to deliver software faster, with more “agility”. The combination of these has made static partitioning and human-scale management of machines and applications impractical.

Humans will always have a role in the datacenter, but things should be more automated with common services. Automation enables us to be smarter about scheduling and resource allocation, helping us drive up utilization (which drives down costs) and better handle machine and hardware failures.

Higher utilization is a key advantage of a datacenter operating system. If you’re in the cloud, you might be buying 8 core machines but only using 2 cores. Your cloud provider is really the one benefiting from virtualized resources, not you! The datacenter operating system enables you to more fully utilize your machines by automating the placement of your applications across your machines, using as many resources as it can per machine.

Dealing with failures gets much easier with a datacenter operating system too. When you are running 2-3 machines dealing with failures is a pain, but you can usually track down and fix any issues within a small amount of time. But when you begin to scale to tens, then hundreds, then thousands of machines, dealing with failures becomes an expensive manual operation.

Finally, a datacenter operating system enables developers, who traditionally have had to  interface with humans for access to machines, to develop and run their applications directly against datacenter resources via an API. Whether they’re claiming resources for existing applications or building new frameworks, the abstraction layer of the datacenter operating system makes it easier to build applications and share those applications across organizations.

Obviously Mesosphere's platform is based on Apache Mesos, but it's more complex than just Mesos. Tell us about the guts of the platform and how it was developed.  

The guts really are Mesos, which acts as the kernel for the distributed operating system. It provides the basic primitives of the DCOS, including starting and stopping applications - and the bridge between applications and the hardware.

What was built around Mesos and packaged into the Mesosphere DCOS are the other components that you would expect of an operating system. For example, the DCOS includes Marathon which acts as the distributed “init” system. Marathon uses the Mesos kernel to automatically launch and scale specific system and user applications. In addition to Marathon, the Mesosphere DCOS includes Chronos which provides distributed cron, i.e., the ability to launch applications on a regularly scheduled basis. The Mesosphere DCOS includes a Service Discovery component as well - a way of naming and finding specific applications as they are moved around in your datacenter or cloud.

There are a number of other components we’ve built in related to storage, managing containers, and other functionality that we view as key for running the next generation of distributed applications. And as with any other successful operating system, a huge focus for its evolution will be expanding the library of applications and frameworks that are natively supported.

In 2015, what do you think are the major trends we'll see in data centers?

Operators will stop thinking in terms of individual servers, and more in terms of reasoning across pools of resources and running distributed applications.

Some particularly interesting distributed applications will fall under the domain of “stateful services”, which is a challenging application to run in the cloud today and is ripe for innovation in the next few years.

There will be a lot of interesting work using machine learning to better automate and manage applications as well. Humans are notoriously bad at figuring out how many resources they need and will ultimately be completely handled via software.

From the hardware side of things I think we’ll start to hear more about concepts like disaggregated racks - where racks become like a big single computer. But we also see a trend towards the completely disaggregated datacenter. There are a number of scenarios where transporting the compute makes little sense, where you want to instead do local processing. Cell towers might have a mini datacenter, so you don’t have to get it back to the cloud, for example.

I've heard you talk about some data centers needing to do things like run multiple instances of Hadoop, and other tools. Why would such needs arise?

Primarily, you want to run multiple instances because you have different organizations in your company and you want to create isolation. Most organizations who have run multiple instances do so by creating a whole other cluster, which they have to set up - and then run independently. The problem here, however, is that you’ll often have large pools of idle resources in one cluster while another cluster might be completely overloaded. Using something like Mesos lets you run those two instances of Hadoop on the same hardware!

Another reason organizations will have multiple instances of Hadoop is when they want to upgrade from one version of Hadoop to another, which usually is performed in a completely new cluster. This is an expensive way to upgrade a Hadoop cluster, but there aren’t many other options out there!

Can you provide some anecdotal detail about a particular organization that is benefiting from Mesosphere's DCOS? How are efficiencies being captured there?

The Mesosphere DCOS was just launched, and we’ll be sharing lighthouse customer usage success stories in 2015. But I think a really good example of a compelling Mesos story is how eBay was able to pool its Jenkins instances. That’s an example of an organization that had to run multiple instances of a framework (Jenkins) and leveraged Mesos to collocate Jenkins on a single cluster.