At Hadoop Summit, Yahoo! Announces its Tested Distribution

by Ostatic Staff - Jun. 10, 2009

At today's Hadoop Summit in Silicon Valley, Yahoo! announced the availability of the Yahoo! Distribution of Hadoop, a source-only version of Apache Hadoop that Yahoo! uses within its own search engine. Hadoop, of course, is an open source software framework that helps process very large data sets, and is widely used in large-scale data mining applications as well as in search tools at sites like Facebook and many others. For developers and users interested in Hadoop, it's worth noting that the Yahoo! Distribution of Hadoop has been widely tested and developed at Yahoo! for years now, as Eric Baldeschwieler, VP of grid computing at Yahoo, described in detail here.

According to an advisory sent from Yahoo!:

"Yahoo! pioneered much of the Apache Hadoop technology and remains dedicated to making the Hadoop ecosystem stronger, working closely with key collaborators in the Hadoop community and helping to drive more users and contributors to Apache Hadoop. In response to frequent requests from the community, Yahoo! is opening up its investment in Hadoop quality engineering via the Yahoo! Distribution of Hadoop."

The advisory added this about the extensive testing Yahoo!'s distribution has been through:

"The Yahoo! Distribution of Hadoop has been tested and deployed at Yahoo! on the largest Hadoop clusters in the world and is based entirely on code available from Apache Hadoop, an open source project of the Apache Software Foundation which develops distributed file system and parallel execution environment that enables its users to process massive amounts of data."

You can find a blog post on the announcement here, and learn more about Yahoo!'s distribution here. I suspect that many developers interested in Hadoop will be attracted to this new distribution because of the testing, bug-fixing and ongoing development that it's been through. Yahoo! is also emphasizing that the new distribution raises its "commitment to cloud computing."

Mike Olson, CEO of Cloudera, a well-funded startup that provides commercial support for Hadoop, also weighed in on how it will make use of Yahoo!'s distribution:

"The general availability of Yahoo!’s Hadoop source code helps make Cloudera’s Distribution for Hadoop even more robust and scalable, and we will continue to collaborate with Yahoo! to include their tested source code in our commercial distribution,” said Mike Olson, CEO of Cloudera. “Cloudera’s Distribution for Hadoop is a complete, enterprise-ready distribution inclusive of key tools, utilities, and full service and support, to help enterprises deploy and manage the Hadoop platform for large-scale data processing and management.”

Cloudera's inclusion of Yahoo!'s tested source code in its own distribution speaks volumes about how trusted Yahoo!'s Hadoop distribution is. I've spoken with the Hadoop folks at Yahoo!, and it's unlikely that any company has made more extensive use of this open source tool than they have within their search engine.