Tools and Training Resources Arrive for Hadoop
This has been a big month for news on the Hadoop front, as the open platform for crunching data and helping to yield unseen insights begins to explore some new frontiers. Cloudera, which has focused on supporting and advancing Hadoop-based analytics, and Udacity, a provider of online higher education, announced a partnership to deliver Hadoop and Data Science training via Udacity's online education portal. Meanwhile, Facebook open sourced Presto, an SQL engine that it claims is about 10 times faster than Hive for running queries across large data sets with Hadoop.
Hadoop is a very hot topic in enterprises as the amount of data being generated and stored continues to increase. There is also big demand in enterprises for people with Hadoop skills. Cloudera founded Cloudera University in 2009, and in April 2013, the company announced the Cloudera Academic Partnership (CAP) to extend its course curriculum and training program These moves were intended to help train a new generation of Hadoop-savvy workers.
Through its new partnership with Udacity, Cloudera is seeking to put high-quality Big Data training within reach of anyone who has an Internet connection. Udacity's online platform will leverage Cloudera's background in Hadoop, to deliver online coursework that students can go through at their own pace. Upon completion of courses, students also have the option to take Cloudera's full suite of multi-day, live professional training courses and earn accredited professional certifications. You can find out more about the program here.
Facebook has also open sourced Presto, the interactive SQL-on-Hadoop engine that is reputed to offer impressive performance. Cloudera itself offers a similar tool called Impala, which I covered here. There is a need for fast real-time queries on Hadoop, and these tools facilitate fast SQL queries on Hadoop clusters, usually in real-time without having to load data into a database.
As GigaOM notes:
"Technologically, Presto and other query engines of its ilk can be viewed as faster versions of Hive, the data warehouse framework for Hadoop that Facebook created several years ago. Facebook and many other Hadoop users still rely heavily on Hive for batch-processing jobs such as regular reporting, but there has been a demand for something letting users perform ad hoc, exploratory queries on Hadoop data similar to how they might do them using a massively parallel relational database."
Facebook has Presto running in several of its data centers and will presumably make continuing contributions to the project.
Hadoop has emerged as a mighty and unique tool for yielding insights from large data sets, but it will be used on conjunction with other data crunching tools over time. These tools for quick real-time queries are only going to proliferate and make using Hadoop easier.