Databricks Delivers Online Courses Focused on Apache Spark

by Ostatic Staff - Dec. 22, 2014

Databricks, a company founded by the creators of the popular open-source Big Data processing engine Apache Spark, is a firm that you may not have heard much from in 2014, but you will throughout 2015. The company has healthy venture funding of $47 million, and  Andreesen Horowitz is one of the investors, with Ben Horowitz on board.

Folks in the Big Data and Hadoop communities are becoming increasingly interested in Apache Spark, an open source processing engine for Hadoop data built for speed and advanced analytics. It was developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010. Now, Databricks has announced the launch of two massive open online courses (MOOCs) focused on distributed analytics using Spark. The courses will be made available in Spring 2015 via BerkeleyX, in collaboration with the MOOC provider and online learning platform, edX.

The two five-week courses are designed to augment Databricks' efforts to grow the Spark community. They provide students with hands-on experience with Spark's analytics and real-time capabilities to deliver insights into data. The launch of these courses comes on the heels of a series of Apache Spark training offerings from Databricks, including the Spark Certification Program for System Integrators and the Spark Certification Program for Developers.

"Spark is the most active open source project in the Big Data ecosystem, and continues to be deployed by enterprises across multiple verticals due to its speed and efficiency, ease of use, and single unified system for the complete data analytics pipelines," said Matei Zaharia, co-founder and CTO at Databricks. "As we continue to foster and grow the Spark community to meet that demand, we are excited to launch these two MOOCs, making hands-on, practical courses available to a community that will advance Spark's adoption with greater ease."

Both courses will use the Python interface to Spark, making them widely accessible to data scientists and developers. The courses include:

  • Introduction to Big Data with Apache Spark - Students will learn how to apply data science techniques using parallel programming in Spark to explore big (and small) data. The course will identify the most common responsibilities of data scientists and teach students how to use Spark to deliver against these expectations.

    When: February 23 - March 27, 2015
    Professor: Anthony D. Joseph, Professor in Electrical Engineering and Computer Science at UC Berkeley and Technical Advisor at Databricks
  • Scalable Machine Learning - The course will present the underlying statistical and algorithmic principles required to develop scalable machine learning pipelines and provide hands-on experience using Apache Spark. Students will use Spark to implement scalable algorithms for fundamental statistical models while tackling key real-world problems from various domains.

    When: April 14 - May 18, 2015
    Professor: Ameet Talwalkar, Assistant Professor of Computer Science at UCLA and Technical Advisor at Databricks

Both courses are available to the public for free and are now open for enrollment on the edX website. edX Verified Certificates are also available for a fee. For more information, visit: https://www.edx.org/

Cloudera is also rallying behind Spark. The company has also announced Apache Spark training "to prepare developers and software engineers to build complete, unified applications that combine batch, streaming, and interactive analytics."

Spark's ability to work with unstructured data is particularly notable. Many enterprises haven't been able to fully structure their data sources and need tools flexible enough to work with unstructured archives.

"Broadly embraced by the open source community, Big Data vendors, and data-intensive enterprises for its stream processing capabilities and its support for complex, iterative algorithms, Spark offers performance gains that enable applications to run on the data in a Hadoop cluster at speeds up to 100 times faster than traditional MapReduce programs," Cloudera claims.

Spotify leverages Spark, as do a number of enterprises. In 2015, Spark promises to become very big news on the Big Data front, and Databricks and Cloudera are likely to be among the top players focused on equipping enterprises with the knowledge and tools to leverage the platform.