Google Open Sources "Cloud Dataflow" SDK, Built to Trump MapReduce

by Ostatic Staff - Dec. 19, 2014

All the way back in June, at Google I/O, Google pronounced that the venerable MapReduce data crunching scheme was "tired" and launched a service dubbed Cloud Dataflow that analyzes pipelines with "arbitrarily large datasets." Dataflow was a much talked about star in a set of cloud services discussed at Google I/O and Google officials even confirmed that Dataflow had replaced MapReduce at Google. MapReduce, of course, is built for processing and generating large data sets with a parallel, distributed algorithm on clusters.

Now, in an effort to spur use of and development of Dataflow, Google has released a software-development kit (SDK) in Java for using Cloud Dataflow under an open-source license.

According to the announcement of the SDK:

"Today, we are announcing availability of the Cloud Dataflow SDK as open-source. This will make it easier for developers to integrate with our managed service while also forming the basis for porting Cloud Dataflow to other languages and execution environments. We’ve learned a lot about how to turn data into intelligence as the original FlumeJava programming models (basis for Cloud Dataflow) have continued to evolve internally at Google."

"Interested in adding to the Cloud Dataflow conversation? You can:Apply for access to Cloud Dataflow's managed service, Learn more through the documentation, and take part in the conversation at StackOverflow [tag: google-cloud-dataflow]"

"As Storm, Spark, and the greater Hadoop family continue to mature - developers are challenged with bifurcated programming models. We hope to relieve developer fatigue and enable choice in deployment platforms by supporting execution and service portability...We are currently building a Python 3 version of the SDK, to give developers even more choice and to make dataflow accessible to more applications."

As VentureBeat notes:

"The open-source move could result in more developers coming around to the approach Google has thrown its weight behind: setting up pipelines to process data as it comes in, instead of or in addition to doing batch processing jobs that take a while. What’s more, the open-sourcing strategy could increase the usage of Cloud Dataflow on the Google Cloud Platform, which competes with other big and growing public clouds, like Amazon Web Services and Microsoft Azure."

Reusable programming patterns have become key to developers. According to Google, the Cloud Dataflow SDK introduces a unified model for batch and stream data processing. It pursues temporal based aggregations providing a rich set of windowing primitives allowing the same computations to be used with batch or stream based data sources.

You can find out much more about the new SDK here.