Google Wants Apache's Attention to Evolve Cloud Dataflow Tool

by Ostatic Staff - Jan. 25, 2016

The Apache Software Foundation, which incubates more than 350 open source projects and initiatives, has elevated a slew of big data and cloud computing projects to Top-Level status recently. With that designation, projects get more attention from the development community and other perks.

Now, Google is making a big open source-focused move by offering its Dataflow technology to the Apache Software Foundation (ASF) as an incubator project.  Cloudera, Data Artisans GmbH, PayPal Holdings and Talend are all backing the move.

“We believe this proposal is a step towards the ability to define one data pipeline for multiple processing needs, without tradeoffs, which can be run in a number of runtimes, on-premise, in the cloud, or locally,” wrote Google Software Engineer Frances Perry and Product Manager James Malone in a blog post.

Google Cloud Dataflow will remain as a “no-ops” managed service to execute Dataflow pipelines quickly and cost-effectively in Google Cloud Platform, but attention from Apache can help it evolve.

According to Google's post:

"It wasn't long ago that Apache Hadoop MapReduce was the obvious engine for all things big data, then Apache Spark came along, and more recently Apache Flink, a streaming-native engine. Unlike upgrading hardware, adopting these more modern engines has generally required rewriting pipelines to adopt engine-specific APIs, often with different implementations for streaming and batch scenarios. This can mean throwing away user code that had just been weathered enough to be considered (mostly) bug-free, and replacing it with immature new code. All of this just because the data pipelines needed to scale better, or have lower latency, or run more cheaply, or complete faster.

Adjusting such aspects should not require throwing away well-tested business logic. You should be able to move your application or data pipeline to the appropriate engine, or to the appropriate environment (e.g., from on-prem to cloud) while keeping the business logic intact. But, to do this, two conditions need to be met. First, you need a portable SDK, which can produce programs that can execute on one of many pluggable execution environments. Second, that SDK has to expose a programming model whose semantics are focused on your workload and not on the capabilities of the underlying engine. For example, MapReduce as a programming model doesn’t meet the bill (even though MapReduce as an execution method might be appropriate in some cases) because it cannot productively express low-latency computations.

Google designed Dataflow specifically to address both of these issues. The Dataflow Java SDK has been architected to support pluggable “runners” to connect to execution engines, of which four currently exist: data Artisans created one for Apache Flink, Cloudera did it for Apache Spark, and Google implemented a single-node local execution runner as well as one for Google’s hosted Cloud Dataflow service."

 Indeed, the world of data analytics is quickly moving beyond MapReduce. Spark and other tools are moving to the forefront. 

Regarding its pitch to Apache, Google noted: "We sent a proposal for Dataflow to become an Apache Software Foundation (ASF) Incubator project. In this proposal the Dataflow model, Java SDK, and runners will be bundled into one incubating project with the Python SDK joining the project in the future. We believe this proposal is a step towards the ability to define one data pipeline for multiple processing needs, without tradeoffs, which can be run in a number of runtimes, on-premise, in the cloud, or locally."