A Quick Overview of Hadoop
Building web scale applications means building systems that can survive any number of horrible things happening to your hardware or software. It means eliminating single points of failure, it means scaling horizontally, as well as vertically, and it means being able to respond to influxes of traffic without buckling under the weight. Increasingly, building web scale applications also means handling terabytes, or even petabytes of data; this is where the Apache Hadoop project comes in.
What is Hadoop?
Hadoop is a big system with s a lot of moving parts, and an in-depth understanding of the entire system would obviously take more space than we have here. Instead, I’d like to discuss a high-level overview of the project, it’s parts, and why someone might want to choose Hadoop for their project.
According to Apache:
The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.
That explanation alone does not say much, but I believe it is intentionally vague. Hadoop is actually a collection of tools, and an ecosystem built on top of the tools. The problem Hadoop solves is how to store and process big data. When you need to store and process petabytes of information, the monolithic approach to computing no longer makes sense. Single disks are too slow, and the larger single machines needed to process the data effectively are prohibitively expensive to most companies building web scale applications. Hadoop uses commodity hardware, and of course open source software, to distribute the data across multiple machines, using their combined power and storage to overcome monolithic bottlenecks.
The two main subprojects of Hadoop are HDFS and MapReduce, as well as a collection of tools known as Hadoop Common. Let’s start at HDFS.
HDFS is the Hadoop File System, and oddly enough, it is not actually a filesystem in the same way that ext4 or ZFS is. HDFS is a distributed filesystem that sits on top of the native filesystem of the server. When an HDFS client saves a file to HDFS, the file is broken up into (by default) 64MB blocks and sent to several different HDFS data nodes. HDFS is built to withstand single servers failing, in fact, it expects that they will. So, when a file is broken into blocks, each block is also replicated three times, to three different nodes. A separate server, called the Name Node keeps track of the location of the blocks. Since HDFS lives on top of the native filesystem of the data node, the blocks of data are viewable using standard Unix utilities like cat.
MapReduce is the next part of Hadoop. MapReduce is a key-value programming model that allows distributed processing of data. Explaining the details of MapReduce is a rabbit hole I’m not willing to dive down into right now, but suffice to say that it provides a mappings of locations of data with the value of the data, and then removes duplicates by counting the number of occurrences of the data, and their locations.
The best explanation of MapReduce that I have seen is in the form of a Unix command line. The “Map” portion of the process can be thought of as similar to grepping for a pattern in a file. An intermediate process, known as “shuffle and sort”, is best thought of as piping the output of grep to the sort utility. The final “Reduce” portion of the process is conceptually similar to piping the output of sort to uniq. Now, just run that command against terabytes of data spread across multiple machines and you have MapReduce.
OK, both the HDFS and MapReduce explanations are gross oversimplifications, but again, we are working at a high level here.
Hadoop also provides a web interface, command line tools, and processes that control job execution, failure recovery, and management functions, known collectively as Hadoop Common.
What Hadoop Does Not Do
Hadoop is almost a web scale application. It scales horizontally, and vertically with larger hardware, but it also introduces a single point of failure: the Name Node. The Name Node stores three types of information about the cluster. What blocks of data exist (in the fsimage file), what changes to those blocks of data have occurred (in an edits file), and the location of each block of data. The knowledge of which blocks exists and what changes have occurred are stored on disk, but the location of each block is not. This vital information is kept only in memory, which complicates making the Name Node highly available. I said that the Name Node represented a single point of failure, but in the latest release of Hadoop, 2.x, the developers have come up with an optional scheme for keeping a standby node available for HA. Unfortunately, it’s not what I would consider a good HA setup.
In the new HA setup, the secondary name node receives notices from the cluster on the location of all blocks of data, same as the primary, which it also keeps in memory. But, to keep the data stored on disk in sync, the secondary name node needs to have access to that data, so, the recommendation is to setup an NFS server which stores the edits and fsimage files. Hadoop has not so much solved the HA problem in this release as passed the buck. They have moved the problem of being a single point of failure from the Name Node to the NFS server. Fortunately, making an NFS server HA is a solved problem. So, it’s an OK solution, maybe a good solution, but not great.
Hadoop supports a vibrant ecosystem of other projects that build on the distributed computing capability of Hadoop. Cassandra, HBase, and Hive are large scale databases and data warehouses. Avro and Chukwa are projects aimed at data serialization and collection. Other projects like ZooKeeper assist in coordinating clusters. Data storage and processing at web scale is more important than ever, so development on these and other related projects is happening rapidly.
If you prefer to have specialized support and streamlined installation for Hadoop, Cloudera offers the Cloudera Distribution Including Apache Hadoop (CDH) as an open source download with an enterprise support package available. The latest release, CDH4, is built on Hadoop 2, and includes a number of helpful scripts to simplify installing and managing Hadoop.
Once again, Hadoop is a big project, and as such, is a big topic to cover in a blog post. If you have experience with Hadoop in your data center, I’d love to hear about it in the comments.