Apache Spark Heavyweights Make Their Next Moves
Two of the most prominent companies advancing the Apache Spark Big Data toolset are out with new releases. MapR has announced the immediate availability of Apache Spark 1.6.1 on the MapR Converged Data Platform. The company also noted that the free, online Spark On Demand Training (ODT) courses via MapR Academy have achieved the highest course enrollment rate since the ODT program’s initial launch.
“We have seen a significant customer adoption of Spark for building data pipelines and advanced analytics,” said Anoop Dawar, vice president of product management, Spark and Hadoop, MapR Technologies. “MapR has fully supported the Spark stack for two years – more than any other vendor in this industry. Based on customer feedback MapR provides early preview releases so data scientists and developers can try cutting edge features and then follows it up with a GA release for production deployments.”
Spark continues to attract significant interest from developers and MapR claims that 30% of its course registrants have already become certified as MapR Certified Spark Developers.
Apache Spark version 1.6.1 from MapR includes:
Improved performance gains with core Spark engine. With Spark 1.6.1 automatic memory management, both execution memory and storage memory can be changed dynamically based on workload characteristics. Execution memory can now borrow available memory from the storage region and vice versa.
Persistence of machine learning pipelines. Spark 1.6.1 adds new features to machine learning that take persistence beyond models to persisting the entire pipeline, including transformers and estimators. The entire workflow can be persisted which includes pipeline persistence along with model persistence, without needing to write custom code for exporting or importing.
Dataset API. Spark 1.6.1 introduces a new experimental interface called Dataset API that is an extension of the DataFrames API. Datasets contain encoders that can be used in both Scala and Java, with Python support to be added in future releases. The biggest benefit of this new Dataset API is the reduction in memory usage as it can create a more optimal layout in memory when caching datasets.
Meanwhle, It’s been two years since Apache Spark 1.0 was released, and Databricks is providing a preview of what is to come in version 2.0:
"We are happy to announce the availability of the Apache Spark 2.0 technical preview in Databricks Community Edition today. This preview package is built using the upstream branch-2.0....
According to our 2015 Spark Survey, 91% of users consider performance as the most important aspect of Spark. As a result, performance optimizations have always been a focus in our Spark development. Before we started planning for Spark 2.0, we asked ourselves a question: Spark is already pretty fast, but can we push the boundary and make Spark 10X faster?
This question led us to fundamentally rethink the way we build Spark’s physical execution layer. When you look into a modern data engine (e.g. Spark or other MPP databases), majority of the CPU cycles are spent in useless work, such as making virtual function calls or reading/writing intermediate data to CPU cache or memory. Optimizing performance by reducing the amount of CPU cycles wasted in these useless work has been a long time focus of modern compilers. Spark 2.0 ships with the second generation Tungsten engine."
Spark 2.0’s Structured Streaming APIs also promise to attract developers.
Databricks also has a webinar called Spark 2.0: Easier, Faster, and Smarter, which you can register for and watch.