Apache Spark Gets Billed as the Next Big Data Thing

by Ostatic Staff - Jul. 31, 2014

People in the Big Data and Hadoop communities are becoming increasingly interested in Apache Spark, an open source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley.  According to Apache, Spark can run programs up to 100 times faster than Hadoop MapReduce in memory, and ten times faster on disk. When crunching large data sets, those are big performance differences.

Among vendors making moves surrounding Spark, Cloudera made a number of notable announcements recently. The company, focused on Hadoop, announced Apache Spark training "to prepare developers and software engineers to build complete, unified applications that combine batch, streaming, and interactive analytics."

"Broadly embraced by the open source community, Big Data vendors, and data-intensive enterprises for its stream processing capabilities and its support for complex, iterative algorithms, Spark offers performance gains that enable applications to run on the data in a Hadoop cluster at speeds up to 100 times faster than traditional MapReduce programs," Cloudera claims.

Cloudera has already been involved in offering commercial support for Spark as part of its Cloudera Enterprise subscription and the company recently announced a collaboration with Databricks, IBM, Intel, and MapR to broaden support for Spark as the standard data processing engine for the Hadoop ecosystem. 

"Spark offers clear benefits for realizing sophisticated analytics and is quickly becoming the future of data processing on Hadoop," said Sarah Sproehnle, vice president, Education Services, Cloudera, in a statement. "With Spark, customers can realize immediate business advantages. For example, Spark Streaming enables businesses to process live data as it arrives in the enterprise data hub, rather than having to wait to batch-process it later. The fact that the same codebase can be used for streaming data and data-at-rest significantly reduces development time for Big Data applications, speeding up time-to-insight by several orders of magnitude and decreasing the need for expensive specialized systems. This is just one case where the benefits of Spark have a direct impact on a company's bottom line." 

Some are actually calling Apache Spark "the next big thing in Big Data."  According to a post by John Furrier:

"What is the next big thing in #bigdata?  It’s called Spark. Spark is a fast data analysis engine. Think Hadoop MapReduce, but 100x faster and still fully interoperable with the wider Hadoop ecosystem. Spark has the largest open-source development community in the Big Data space, after Hadoop MapReduce, with over 90 developers from 25 companies contributing code."

You can find out more about Spark here, including release notes on a brand new version that arrived a week ago.

We also covered Cloudera's work with Intel and partners to deliver Hadoop appliances leveraging Apache Spark here.  In an announcement, Cloudera, Dell and Intel said they are launching a dedicated Dell In-Memory Appliance for Cloudera Enterprise, to be known as Dell Engineered Systems for Cloudera Enterprise. It's basically an integrated appliance solution that can make advanced Hadoop-driven analytics easy to implement in data centers, but powerful via Spark integration.