Interview: The Team Behind Grappa Discusses Next-Gen Big Data Analytics
There are a lot of folks out there working on new ways to cull meaningful insights from data stores, and many of them are working with data found on clusters and, often, on commodity hardware. That puts a premium on affordable data-centric approaches that can improve on the performance and functionality of MapReduce, Spark and many other tools.
One of the most interesting new tools in this area is the open source Grappa project, which scales data-intensive applications on commodity clusters and offers a new type of abstraction that can beat classic distributed shared memory (DSM) systems. In fact, Rich Wolski, founder of the Eucalyptus cloud project, enthusiastically pointed to Grappa as a very interesting project in our recent interview with him.
We caught up with some of the leaders behind Grappa, who are based at the University of Washington, for an interview. They are Luis Ceze, Jacob Nelson, and Mark Oskin (see the photos at the bottom of this post). Here are their thoughts.
How did Grappa come to be?
Univ. Washington Team: One of our colleagues here worked at Cray for a while and worked closely with a large-scale shared memory machine about three years ago. What that machine was capable of was interesting but also very expensive. So we asked whether take some of the ideas that the Cray machine raised, and the analytics it was capable of, and try to make them work in off-the-shelf commodity hardware. That’s how Grappa started.
What does Grappa do, and what kind of organization can benefit from it?
Univ. Washington Team: Grappa helps accelerate in-memory analytics computation. Specifically, we’re exploring techniques that make worst-case performance for these applications. Applications like graph analytics have a lot of low locality access needs. If you just use standard analytics approaches to commodity clusters, you’ll end up needing a lot of overhead for each step you take as you perform analytics. Grappa tries to find opportunities in the random-access performance that goes on there better. It also provides an easier programming model for distributed memory machines.
How else does it differ from traditional distributed shared memory systems?
Univ. Washington Team: We’re providing some of the same abstractions that you see in distributed shared memory systems, but taking a very different approach. They have taken the standard technique of exploiting locality for performance, and depended on that. We take the opposite approach. Rather than simple approaches to caching data, we’ve reduced the cost of migrating operations around the cluster.
In the 1990s there was a lot of research on shared memory systems. The way those worked was that you would hook into the default data-handling mechanisms of the processors and then in software move pages around leveraging software-based cache. On a cluster, for those things to have any kind of performance, you had to have very high localities in your applications. Grappa’s philosophy is to avoid hooking into the processor in the way just described and instead use modern language abstractions. When we go to access remote memory we’re going to intelligently be able to context switch between tasks.
Do you think people should be thinking beyond MapReduce, Spark and other powerful tools being used in the data analytics space?
Univ. Washington Team: What we’ve observed when we look at systems like Hadoop and Spark is that they end up building their own optimized stack all the way from the operating system up to whatever the user programming abstraction is. We’ve spent some time observing how Grappa can be used as a kind of optimized substrate for these programming models.
Specifically, we’ve looked at implementing a subset of MapReduce on top of Grappa, implementing a subset of Spark, focusing on the query processing platform. Basically, we’ve looked at providing familiar programming abstractions on top of Grappa, making it easier to work with new approaches to analytics.
Grappa is a square peg, but there are a lot of round holes out there. When software developers encounter problems with MapReduce, they can end up implementing approaches that start to crumble. What we can do is give software developers a MapReduce abstraction that allows them to use their familiar models to leverage some of the powerful results that Grappa can help them get.
Editor's Note: This story is the latest in a series of interview pieces with project leaders working on the cloud, Big Data, and the Internet of Things. The series has included talks with Rich Wolski who founded the Eucalyptus cloud project, Ben Hindman from Mesosphere, Sam Ramji from Cloud Foundry Foundation,Tomer Shiran of the Apache Drill project, Philip DesAutels who oversees the AllSeen Alliance, Tomer Shiran on MapR and Hadoop, and co-founder of Mirantis Boris Renski.