As Spark Advances, IBM's Commitment Does Too
Last year was a giant one for Apache Spark, now one of the most talked about data analytics tools going. Not only did the project graduate to Top-Level Status at Apache, but IBM made a huge financial commitment to it. Late last year, Databricks, which was founded by Spark’s creators, revealed that an increasing number of users are choosing to complement or replace Hadoop tasks with Spark processes.
Now, the release of Spark 1.6 is here, and you can look for IBM's commitment to Spark to advance in big ways this year.
"Today we are happy to announce the availability of Apache Spark 1.6! With this release, Spark hit a major milestone in terms of community development: the number of people that have contributed code to Spark has crossed 1000, doubling the 500 number we saw at the end of 2014."
"According to our 2015 Spark Survey, 91% of users consider performance as the most important aspect of Spark. As a result, performance optimizations have always been a focus in our Spark development."
Databricks is highlighting the following three additions:
"Parquet performance: Parquet has been one of the most commonly used data formats with Spark, and Parquet scan performance has pretty big impact on many large applications. In the past, Spark’s Parquet reader relies on parquet-mr to read and decode Parquet files. When we profile Spark applications, often many cycles are spent in “record assembly”, a process that reconstructs records from Parquet columns. In Spark 1.6. we introduce a new Parquet reader that bypasses parquert-mr’s record assembly and uses a more optimized code path for flat schemas."
"Automatic memory management: Another area of performance gains in Spark 1.6 comes from better memory management. Before Spark 1.6, Spark statically divided the available memory into two regions: execution memory and cache memory. Execution memory is the region that is used in sorting, hashing, and shuffling, while cache memory is used to cache hot data. Spark 1.6 introduces a new memory manager that automatically tunes the size of different memory regions."
"While the above two improvements apply transparently without any application code change, the following improvement is an example of a new API that enables better performance."
"10X speedup for streaming state management: State management is an important function in streaming applications, often used to maintain aggregations or session information. Having worked with many users, we have redesigned the state management API in Spark Streaming and introduced a new mapWithState API that scales linearly to the number of updates rather than the total number of records. That is to say, it has an efficient implementation that tracks “deltas”, rather than always requiring full scans over data."
Meanwhile, IBM is embedding Spark into its Analytics and Commerce platforms, and offering Spark as a service on IBM Cloud. IBM will also put more than 3,500 IBM researchers and developers to work on Spark-related projects at more than a dozen labs worldwide; donate its IBM SystemML machine learning technology to the Spark open source ecosystem; and educate more than one million data scientists and data engineers on Spark.
"IBM has been a decades long leader in open source innovation. We believe strongly in the power of open source as the basis to build value for clients, and are fully committed to Spark as a foundational technology platform for accelerating innovation and driving analytics across every business in a fundamental way," said Beth Smith, General Manager, Analytics Platform, IBM Analytics. "Our clients will benefit as we help them embrace Spark to advance their own data strategies to drive business transformation and competitive differentiation."