Opinion: Shakeups Ahead for Yahoo!, EMC and Hadoop

by Guest Editor - Apr. 23, 2008Comments (2)

By Raj Bala

Trends in data storage and server computing are changing rapidly, and there are some unexpected shakeups to come, with open source implications. Hardware continues to get cheaper. Bandwidth isn’t quite free yet, but it’s hardly expensive. High-quality open source software now abounds in enterprise data centers, but grid computing solutions remain half-baked and hardly commoditized.

These trends are all behind why I think Yahoo! and EMC are set for a future technology collision given their respective philosophies. They’re also behind how I see Hadoop, sponsored by the Apache Software Foundation and supported by Yahoo!, gaining much more prominence.

EMC sells what amounts to proprietary high-end storage solutions that are primarily centralized in nature. These types of storage solutions are created with the notion that they “can’t go down —– if they do, we could be screwed.”

Yahoo’s support of the open source Hadoop project aims to commoditize grid computing and distributed storage. Hadoop allows you to specify that one or more computers are ready and willing to perform a processing operation, then draft them in to service in tandem.

Today, disk storage is cheap, but the cost of moving large amounts of data around is still computationally expensive. The Hadoop answer: Move the computations —– not the data —– and then process everything in parallel. This type of solution is designed with the notion that components “will go down —– just work around it better.”

Part of Hadoop is a distributed file system. Multiple copies of the data are replicated around the nodes in the cluster. Rather than moving large data sets around to satisfy requests, much smaller sized computations are moved to the nodes housing the data. It’s a beautifully simple yet effective paradigm that also drives Google’s infrastructure. But Hadoop is open source, and thus allows all application developers to deploy Google-grade systems. Collectively, this paradigm that “divides and conquers” is called MapReduce. It’s just a fancy way to describe splitting up requests and then putting the results back together again.

Back when EMC acquired VMWare in 2004, lots of people questioned the synergy between data storage and x86 virtualization software. Was the acquisition simply a matter of EMC trying to increase profit margins by expanding their software business, or was there genuinely a compelling reason to acquire the virtualization startup?

Maybe it was both. Since 2004, VMWare has grown the business to 5,000 employees and over $400 million in quarterly revenues. The company went public on the NYSE in the summer of 2007, and nearly matched the total value of the company that acquired it before settling at just over a $20 billion valuation today.

EMC obviously hit a home run on the business side, but would the technical romance go beyond press release glory and actually translate into value for customers? The answer, for EMC at least, is a huge “maybe.” VMWare’s technology and EMC’s hardware haven’t yet ushered in a lot of the expected applications that can run on virtual machines. You can’t automatically add VMWare instances to add scale to an application any better than you can on physical x86 hardware, but you will eventually be able to.

Hadoop is already there: Add nodes to the cluster, draft them into use in tandem, and applications can start using distributed storage and computing resources easily. A node goes down in a Hadoop cluster? No worries — the data is replicated and inherently fault-tolerant.

Hadoop has a huge future in the enterprise. It’s making a dent with consumer-oriented application developers where technology trends typically gain momentum. Hadoop has tons of room to grow before enterprises feel comfortable with it, but that translates into opportunity for open source developers to innovate with the technology.

Disclosure: I worked for EMC’s Documentum organization (not really involved with VMWare or their hardware business) for many years and was extremely happy. I still have lots of friends there, respect what they do, and hope that they invest resources in Hadoop to support the project. They’ll only stand to gain from Hadoop in the future by adopting it.

 



D J uses OStatic to support Open Source, ask and answer questions and stay informed. What about you?



2 Comments
 

I see the value in Hadoop, but it is by no means a cure-all for distributed computing. I mean, you need your applications to be in a situation where they can parallelize the computation tasks. For a lot of tasks, VMware and the sort are going to be much better suited for seamless scale.

0 Votes

For data serving applications, hadoop is a better choice. For processor serving applications vmware is a better choice. There is no conflict between the two.

0 Votes
Share Your Comments

If you are a member, to have your comment attributed to you. If you are not yet a member, Join OStatic and help the Open Source community by sharing your thoughts, answering user questions and providing reviews and alternatives for projects.