If you read this blog regularly, you're familiar with an ultra-important open source project, sponsored by the Apache Software Foundation: Hadoop. Hadoop is a software framework able to take advantage of huge clusters of computers to produce fast results for queries and more. Thursday night, in San Francisco, GigaOm held a meet-and-greet with stars from the Hadoop project, and stars who benefit: Doug Cutting, head of Hadoop; Eric Baldeschwieler, VP of grid computing at Yahoo; Larry Heck, VP of search and advertising sciences at Yahoo; and Chad Walters, director of engineering at Powerset. OStatic staff live blogged from the event from 6pm on, Pacific time. Check out what the Hadoop stars had to say.
LIVE BLOG TRANSCRIPT:
6:30 -- Doug Cutting, head of Hadoop, is giving an aerial overview to the group. How Hadoop got started: Around 2004 Google published its Google File System and MapReduce papers. These answered many questions about scaling issues when trying to use the resources of thousands of computers in tandem. In 2006, Yahoo came along and hired Doug and took over his project called Nutch, which Hadoop grew out of. "Lots of people are using Hadoop now," he says, "and lots of people are helping develop it.
6:45 -- Eric Baldeschwieler, in charge of grid computing at Yahoo, is talking (Yahoo uses Hadoop to leverage the power of thousands of computers for various types of search tasks). "We have 500 million unique users per month. We need ways to leverage commodity hardware to process everything we process. Yahoo uses an open source stack, and Hadoop is the primary resource. It has three components that are key. The Distributed File System lets you use thousands of computers at once, and helps keep processes going when one node fails. The MapReduce Framework is a programming module that allocates and reallocates resources as appropriate. The last component is HOD, a Dynamic Cluster Manager."
7:00 -- More from Eric: "We're using Hadoop to make search better, to make advertising better and more. We're using Hadoop with thousands of nodes. We're also focusing on various collaborations using Hadoop. We have a collaboration with Carnegie Mellon to use clusters for researchers there. We're hoping to see more community collaboration, and hope to see people posting Hadoop tools and participating. We're hoping we can hire some of the people producing these."
"The team at Yahoo is the primary team contributing to Hadoop (Eric shows a graph of this), but other teams are doing more and more patches, etc. Some core work is coming from outside and we want to see more of that."
Yahoo has a PoweredBy Hadoop wiki page. Eric says people interested in participating should register there. This address--http://developer.yahoo.com/blogs/hadoop--is a starting point.
7:05-- Larry Heck, in search and advertising at Yahoo, is talking. Yahoo employs hundreds of PhD-level scientists and researchers, and has an R&D lab. Hadoop is an essential tool for experimentation at Yahoo.
"It's common in research to pick an algorithm, start experimenting with parameters to find optimized ones, etc. There is a lot of computing involved with doing this well. Let the massive compute grid go through all the possible computations, and the data starts to speak."
Search Assist's database was built using Hadoop. Search Assist is the group of suggested search terms you get back when you type a search term into Yahoo's search bar. The suggestions are to assist the user in search. This database requires a grid to produce the best possible suggestions for searches.
Query Speller is another Hadoop application at Yahoo. This is when Yahoo says "Did you mean?..." when you are searching. The application includes very deep user behavior logs that require a lot of computing power to go through.
7:20 -- Chad Walters, director of engineering at Powerset is talking. Powerset is applying deep natural language processing in the consumer search space--uses technology licensed from Xerox PARC. This kind of search requires 100 times more processing than simple keyword searching and indexing (about one second per sentence is required for processing). Wikipedia searching is handled by Powerset's natural language search.
The company uses a distributed database system called HBase. HBase is targeted at offline processing using distibuted scanning. It's used at Powerset as a content repository. It stores tons of data from documents, and metadata. Coral is the name of Powerset's Document Processing System--a robust Java framework for large scale document processing.
Coral uses Hadoop as its job control machine. Powerset uses 92 8-core machines to do processing. Eight cores! Each machine handles eight tasks at a time--one per core--as documents and language are processed.
Q&A SESSION
Question: How efficient is Hadoop in general? How well does it use cluster resources.
Doug answers, it's getting steadily faster and we try to make sure that the Hadoop software framework doesn't have excessive overhead. Eric adds, the HOD component of Hadoop is not completely efficient, and we can see some places where it can do a better job. Overall, though most of the teams working with Hadoop are very pleased.
Question: Is there any chance Hadoop will fade in importance?
Eric answers, Hadoop is an Apache product. It has a lot of diversity of users. Google uses Hadoop extensively to educate at universities all over the country, to encourage interest. It's here to stay.
Question: Is Hadoop mainly for document-based processing?
Larry answers: Any pattern recognition-based process is great for Hadoop. The pattern-recognition techniques can be linear or non-linear. Chad adds: Log processing is another area where people are getting good use from Hadoop. Facebook does a lot of this kind of thing on Hadoop.
Eric summarizes: Seismology and many other kinds of data collection-intensive pursuits are prime areas where people can get good use from Hadoop.