Arrow is the Latest Major Big Data Tool Advanced by Apache

by Ostatic Staff - Feb. 22, 2016

As we've been reporting, The Apache Software Foundation, which incubates more than 350 open source projects and initiatives, has squarely turned its focus to Big Data tools in 2015. The foundation has also made clear that you can expect more on this front in 2016, as a number of incubated projects graduate to Top-Level Status at Apache, which helps them get both advanced stewardship and certainly far more contributions.

Now, Apache has announced Arrow as a new Top-Level Project, and it is billed as an "open source Big Data in-memory columnar layer that accelerates analytical processing and interchange by more than 100x." 

According to Apache, Arrow is a descendent of Apache Drill:

"A high-performance cross-system data layer for columnar in-memory analytics, Apache Arrow provides the following benefits for Big Data workloads:

- Accelerates the performance of analytical workloads by more than 100x in some cases

- Enables multi-system workloads by eliminating cross-system communication overhead

"Initially seeded by code from the Apache Drill project, Apache Arrow was built on top of a number of Open Source collaborations, and establishes a de-facto standard for columnar in-memory processing and interchange."

 "The open source community has joined forces on Apache Arrow," said Jacques Nadeau, Vice President of Apache Arrow and Vice President Apache Drill. "Developers from 13 major Open Source Big Data projects are already on board --by introducing a new era of columnar in-memory analytics, we anticipate the majority of the world's data will be processed through Arrow within the next few years."

Code committers to Apache Arrow include developers from Apache Big Data projects Calcite, Cassandra, Drill, Hadoop, HBase, Impala, Kudu (incubating), Parquet, Phoenix, Spark, and Storm as well as established and emerging open source projects such as Pandas and Ibis.

"Arrow's cross platform and cross system strengths will enable Python and R to become first-class languages across the entire Big Data stack," said Wes McKinney, creator of Pandas.

Apache Arrow accelerates analytical processing by providing a high performance columnar in-memory representation. A number of processing algorithms benefit greatly from this memory design.

"A columnar in-memory data layer enables systems and applications to process data at full hardware speeds," said Todd Lipcon, original Apache Kudu creator and member of the Apache Arrow Project Management Committee. "Modern CPUs are designed to exploit data-level parallelism via vectorized operations and SIMD instructions. Arrow facilitates such processing."

According to Apache:

"In many workloads, 70-80% of CPU cycles are spent serializing and deserializing data. Arrow solves this problem by enabling data to be shared between systems and processes with no serialization, deserialization or memory copies."

 Here are some other Apache Big Data projects that are either in incubation stage now or have already graduated to Top-Level Status:

Brooklyn. The foundation announced that Apache Brooklyn is now a Top-Level Project (TLP), "signifying that the project's community and products have been well-governed under the ASF's meritocratic process and principles."  Brooklyn is an application blueprint and management platform used for integrating services across multiple data centers as well as and a wide range of software in the cloud.

According to the Brooklyn announcement:

"With modern applications being composed of many components, and increasing interest in micro-services architecture, the deployment and ongoing evolution of deployed apps is an increasingly difficult problem. Apache Brooklyn’s blueprints provide a clear, concise way to model an application, its components and their configuration, and the relationships between components, before deploying to public Cloud or private infrastructure. Policy-based management, built on the foundation of autonomic computing theory, continually evaluates the running application and makes modifications to it to keep it healthy and optimize for metrics such as cost and responsiveness."

Brooklyn is in use at some notable organizations. Cloud service providers Canopy and Virtustream have created product offerings built on Brooklyn. IBM has also made extensive use of Apache Brooklyn in order to migrate large workloads from AWS to IBM Softlayer.

Kylin. Meanwhile, the foundation has also just announced that Apache Kylin, an open source big data project born at eBay, has graduated to Top-Level status. Kylin is an open source Distributed Analytics Engine designed to provide an SQL interface and multi-dimensional analysis (OLAP) on Apache Hadoop, supporting extremely large datasets. It is widely used at eBay and at a few other organizations.

"Apache Kylin's incubation journey has demonstrated the value of Open Source governance at ASF and the power of building an open-source community and ecosystem around the project," said Luke Han, Vice President of Apache Kylin. "Our community is engaging the world's biggest local developer community in alignment with the Apache Way."

As an OLAP-on-Hadoop solution, Apache Kylin aims to fill the gap between Big Data exploration and human use, "enabling interactive analysis on massive datasets with sub-second latency for analysts, end users, developers, and data enthusiasts," according to developers. "Apache Kylin brings back business intelligence (BI) to Apache Hadoop to unleash the value of Big Data," they added.

Lens. Apache recently announced that Apache Lens, an open source Big Data and analytics tool, has graduated from the Apache Incubator to become a Top-Level Project (TLP).

According to the announcement:

"Apache Lens is a Unified Analytics platform. It provides an optimal execution environment for analytical queries in the unified view. Apache Lens aims to cut the Data Analytics silos by providing a single view of data across multiple tiered data stores."

"By providing an online analytical processing (OLAP) model on top of data, Lens seamlessly integrates Apache Hadoop with traditional data warehouses to appear as one. It also provides query history and statistics for queries running in the system along with query life cycle management."

 "Incubating Apache Lens has been an amazing experience at the ASF," said Amareshwari Sriramadasu, Vice President of Apache Lens. "Apache Lens solves a very critical problem in Big Data analytics space with respect to end users. It enables business users, analysts, data scientists, developers and other users to do complex analysis with ease, without knowing the underlying data layout."

Ignite. The ASF has announced that Apache Ignite is to become a top-level project. It's an open source effort to build an in-memory data fabric that was driven by GridGain Systems and WANdisco.

Apache Ignite is a high-performance, integrated and distributed In-Memory Data Fabric for computing and transacting on large-scale data sets in real-time, "orders of magnitude faster than possible with traditional disk-based or flash technologies," according to Apache. It is designed to easily power both existing and new applications in a distributed, massively parallel architecture on affordable, industry-standard hardware.

Tajo. Apache Tajo v0.11.0, an advanced open source data warehousing system in Apache Hadoop, is another new Top-Level project. Apache claims that Tajo provides the ability to rapidly extract more intelligence fro  Hadoop deployments, third party databases, and commercial business  intelligence tools.

And of course, Spark and other previously announced Big Data tools overseen by Apache are flourishing. Look for many of these projects to advance data analytics in 2016.