Cloudera and Others Rally Behind Hadoop Challenger Spark

by Ostatic Staff - Dec. 08, 2014

Folks in the Big Data and Hadoop communities are becoming increasingly interested in Apache Spark, an open source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. We've covered Spark before, and some reports are characterizing it as a tool that could supplant Hadoop in many enterprises.

According to Apache, Spark can run programs up to 100 times faster than Hadoop MapReduce in memory, and ten times faster on disk. When crunching large data sets, those are big performance differences. 

Among vendors who have recently been making moves surrounding Spark, Cloudera made a number of notable announcements. The company, focused on Hadoop, announced Apache Spark training "to prepare developers and software engineers to build complete, unified applications that combine batch, streaming, and interactive analytics."

As IDG News Service reports:

"Spark is an engine for analyzing data stored across a cluster of computers. Like Hadoop, Spark can be used to examine data sets that are too large to fit into a traditional data warehouse or a relational database. Also like Hadoop, Spark can work on unstructured data, such as event logs, that hasn't been formatted into database tables. Spark, however, goes beyond what Hadoop can easily do, in that it can analyze streaming data as it is coming off the wire."

Spark's ability to work with unstructured data is particularly notable. Many enterprises haven't been able to fully structure their data sources and need tools flexible enough to work with unstructured archives.

"Broadly embraced by the open source community, Big Data vendors, and data-intensive enterprises for its stream processing capabilities and its support for complex, iterative algorithms, Spark offers performance gains that enable applications to run on the data in a Hadoop cluster at speeds up to 100 times faster than traditional MapReduce programs," Cloudera claims.

Spotify leverages Spark, as do a number of enterprises. 

You can find out more about Spark here. We also covered Cloudera's work with Intel and partners to deliver Hadoop appliances leveraging Apache Spark here.  In an announcement, Cloudera, Dell and Intel said they are launching a dedicated Dell In-Memory Appliance for Cloudera Enterprise, to be known as Dell Engineered Systems for Cloudera Enterprise. It's basically an integrated appliance solution that can make advanced Hadoop-driven analytics easy to implement in data centers, but powerful via Spark integration. 

Spark is shaping up as one of the bigger open source stories for 2015.