What Does Hadoop Mean to You?

by Ostatic Staff - Mar. 26, 2008

MapReduce is Google's secret weapon: A way of breaking complicated problems apart, and spreading them across many computers. Hadoop is an open source implementation of MapReduce that you can use on your own computers, in the same way. How does Hadoop work, and how might you best use it?

Especially if you were interested in the recent news involving Yahoo and Hadoop, or if you're interested in cloud computing, it's worth finding out.

Why is Google so successful? One answer is that they have a smarter search algorithm. You could also say that they have a winning advertising system, based on their search algorithm. And yet another possibility is that Google's enormous profits have enabled it to hire the best and the brightest computer scientists, who further refine the search algorithm, as well as Google's various applications and services.

But another reason for Google's success is that they have learned how to take advantage of large clusters of computers, much better and faster than their rivals. A key component to this success is something known as MapReduce, which allows Google's engineers to define a problem in such a way that it is broken up, allocated to a large number of computers for processing, reassembled, and then used.

MapReduce is based on "functional programming," a style common in such languages as Lisp, and available in high-level languages such as Perl, Python, and Ruby. In functional programming, data is rarely assigned to a variable. Rather, it is processed as it passes through different types of functions. The "map" part of MapReduce is a built-in function that invokes the same operation on each element of a list. So in Ruby, we can say:

[1, 2, 3].map{|i| i * 10}

The above code takes the three-element array [1,2,3], and multiplies each element by 10. The original array is not modified, but "map" returns a new array whose value is [10, 20, 30]. Of course, you can apply far more sophisticated functions to "map", ranging from image processing to text searches.

After "map" has applied the function to each list element, MapReduce applies a second function, called (big surprise!) "reduce." The job of "reduce" is to transform the output from "map". This function might do nothing at all, or it might sum the resulting list of numbers, or it might remove those list elements that fail a particular test.

The key thing to remember with "map" is that it doesn't matter how many parameters are in the list to which it is applied: Each element is handled independently of every other element. This means that "map" can process each element in the order that it is given, or backwards, or even -- and here's where it gets amazing -- on multiple computers, with each computer processing one or more elements from map's input list.

So if you have 10 elements in your list, you can invoke the mapping function on all 10 elements on one computer. Or you can divide the list into two parts, and invoke the mapping function on five elements on each of the computers. Or you can divide the list into 10 parts, and invoking the mapping function once on each of 10 computers. The results will be the same, and will be combined intelligently by the "reduce" function.

Or if you're Google, you can invoke the mapping function once on each of their hundreds of thousands of computers. A Google employee doesn't need to specify which computers will invoke the function, or how many items are being invoked on each computer. However, that employee will discover that thanks to the magic of a huge server farm and MapReduce, operations execute very, very quickly.

Now, all of this is great news for Google employees, and bad news for all of us with only a few computers at our disposal. But it turns out that the Apache Software Foundation sponsors an open source version of MapReduce, known as Hadoop, named for the primary author's childhood stuffed elephant. Hadoop allows you to specify that one or more computers are ready and willing to perform a processing operation; when you invoke MapReduce on one computer, the others are drafted into service.

Hadoop isn't the sort of thing that you're likely to use every day. At least, not as far as we can tell; the idea that we can split a task apart automatically, and take advantage of thousands of computers, is something that very few people have been able to try so far. This computing paradigm, known as "cloud computing," has drawn increasing attention from academic and corporate circles, because of its potential to substantially increase the execution speed of many operations.

That's why it was big news several days ago, when Yahoo announced that it had partnered with Computational Research Laboratories, a subsidiary of Indian company Tata Sons, Ltd. The joint venture, known as EKA, will make a huge cluster of Hadoop-equipped computers available to researchers. The announcement implied that the cluster would also be available on a commercial basis.

If cost weren't an issue, then what would you do with a MapReduce- or Hadoop-equipped supercomputer? I have come up with a list of several possibilities, but I'm curious to hear other suggestions:

  • Translation of large texts from one language into another; we could break each document into paragraphs, and send each paragraph to a different node for translation.
  • Apply a common process to a large number of files. (The New York Times did just this, and described it on their blog, in an impressive combination of Hadoop and Amazon's EC2.)
  • Travel planning. Find the optimal price and speed for travel between two points, given many different transportation options. Each node can calculate the time, distance, and expense for a different combination of travel methods.
  • Massive compilations. If you need to compile several thousand files, each of which takes several minutes, but isn't connected to other files to be compiled, you can farm the compilation out to various nodes with Hadoop.
  • Price calculations. If you have thousands of items in an online product catalog, and if the prices need to change depending on your competitor's prices, you could use Hadoop to reprice your products in parallel.
  • Stock history analysis, as well as automatic trading. Instead of looping over a list of stocks, and then looking up the history of each one, create a Hadoop cluster and have each node look a the history of a different stock. Use the "reduce" function to find those stocks that have gone up or down substantially, and use that data to either buy or sell.

What would you use them for?