Eucalyptus Cloud Originator Rich Wolski on the Cloud and Big Data
All the way back in early 2008, OStatic broke the news about Eucalyptus, an open source infrastructure for cloud computing on clusters that duplicated the functionality of Amazon's EC2, using the Amazon command-line tools. The project resided at the University of California, Santa Barbara, and was driven and overseen by Rich Wolski, a professer there (shown here).
Eucalyptus, of course, was one of the earliest open cloud platforms to take seriously, and gave rise to Eucalyptus Systems, where Wolski also found a home working on the commercial arm of the Eucalyptus effort. After an impactful effort on that front, he has returned to his role as a Professor of Computer Science at U.C. Santa Barbara, where he quietly carries around one of the most rarefied sets of personal experiences with the open cloud found anywhere on this planet.
OStatic caught up with Rich once more for a few of his thoughts on the cloud scene and what emerging projects he finds interesting. Here are his thoughts.
The Current Scene
Was Wolski surprised by the rapid rise of commercial open cloud efforts? “I can’t say that I’m surprised at what we’ve seen with open source cloud adoption,” he said. “It really seemed like the next step, and solved so many problems.”
He acknowledges that challenges remain in making popular open cloud platforms robust, but feels that in areas such as security, solid efforts are being made. “If you’re going to build a giant medium for shared information, security has got to be part of it,” said Wolski. “And it’s a challenge to do it right. However, I don’t actually think cloud security is any worse than other type of platform security.”
In fact, Wolski added that the Eucalyptus cloud platform has been through types of availability testing that showed it to fail less often than Linux.
Analytics and the Ecosystem
Wolski not only has extensive experience with the open cloud, but he teaches courses on operating systems and has a lot to say on the topics of data analytics, machine learning and the Internet of Things (IoT). He said some interesting work in these areas is going on at UCSB.
“Certainly, cloud platforms including Eucalyptus, OpenStack and CloudStack remain interesting, but some of the most compelling work is going on above and around them,” he said. “Hadoop and Spark remain very interesting, and there are lots of integration tools surrounding them. There are also a lot of interesting tools that can be compared to or improve on MapReduce.”
“I think there are good questions to ask about MapReduce, and there are good questions to ask about what is referred to as ‘batch’ processing,” Wolski said. “Hadoop takes a large, stable set of data and runs calculations over it, and that’s not going away. The need to do that is not necessarily replaced by streaming or anything else.”
“But I think the question for the Hadoop community is whether that’s the right model or whether they want to go with other ways to do things, such as leveraging a graph,” he added. “There are ways to do the computation and analytics that are graphical, that use a different kind of internal representation that basically leverages a mathematical graph. Some people in the analytics community may get a lot out of these.”
He points to much interest in but also unanswered questions in the area of streaming analytics. “Lots of people don’t really know yet exactly what they can do with a streaming, distributed analytics model,” he said. “People haven’t worked out the precise way to reason about what you’re getting with the streaming model. Enterprises will ask about how reliable these are.”
Wolski also cited in-memory versus out-of-memory tools and NoSQL databases as interesting. In particular, he took note of the Tachyon project. Tachyon is a Hadoop-compatible, memory-centric distributed file system that enables reliable file sharing at memory-speed across cluster frameworks, such as Spark and MapReduce.
Tachyon achieves high performance by leveraging lineage information and using memory aggressively. It caches working set files in memory, thereby avoiding going to disk to load datasets that are frequently read. This enables different jobs and queries to access cached files at memory speed.
Wolski also pointed out Grappa, a project that effectively makes an entire cluster look like a single, powerful, shared-memory machine. Unlike classic distributed shared memory (DSM) systems, Grappa does not require spatial locality or data reuse to perform well.
For more on our conversation with Wolski, see part two of our interview with him coming in an upcoming post here on OStatic. And, for more in our running series of interviews with project leaders working on the cloud and Big Data, see our talks with Ben Hindman from Mesosphere, Tomer Shiran of the Apache Drill project, and co-founder of Mirantis Boris Renski.