Marrying Apache Spark and R for Next-Gen Data Science

by Ostatic Staff - Oct. 07, 2016

Recently, we caught up with Kavitha Mariappan, who is Vice President of Marketing at Databricks, for a guest post on open source tools and data science. In this arena, she took special note of The R Project (“R”), which is a popular open source language and runtime environment for advanced analytics. She also highlighted Apache Spark and its distributed in-memory data processing, which is fueling next-generation data science.

Now, R users can leverage the popular dplyr package to sift and work with Apache Spark data. Via the sparklyr package, a dplyr interface for Spark, users can filter and aggregate Spark datasets then bring them into R for analysis and visualization, according to an RStudio blog post.

According to the post:

Over the past couple of years we’ve heard time and time again that people want a native dplyr interface to Spark, so we built one! sparklyr also provides interfaces to Spark’s distributed machine learning algorithms and much more. Highlights include:

Interactively manipulate Spark data using both dplyr and SQL (via DBI).

Filter and aggregate Spark datasets then bring them into R for analysis and visualization.

Orchestrate distributed machine learning from R using either Spark MLlib or H2O SparkingWater.

Create extensions that call the full Spark API and provide interfaces to Spark packages.

Integrated support for establishing Spark connections and browsing Spark data frames within the RStudio IDE.

 Meanwhile, IBM is incorporating sparklyr into its Data Science Experience, Cloudera is working to ensure that sparklyr meets the requirements of their enterprise customers, and H2O has provided an integration between sparklyr and H2O Sparkling Water. If you're unfamiliar with the power of Sparkling Water, see our post here.

“With our latest contributions to Apache Spark and the release of sparklyr, we continue to emphasize R as a primary data science language within the Spark community. Additionally, we are making plans to include sparklyr in Data Science Experience to provide the tools data scientists are comfortable with to help them bring business-changing insights to their companies faster,” said Ritika Gunnar, vice president of Offering Management, IBM Analytics.

“At Cloudera, data science is one of the most popular use cases we see for Apache Spark as a core part of the Apache Hadoop ecosystem, yet the lack of a compelling R experience has limited data scientists’ access to available data and compute,” said Charles Zedlewski, vice president, Products at Cloudera. “We are excited to partner with RStudio to help bring sparklyr to the enterprise, so that data scientists and IT teams alike can get more value from their existing skills and infrastructure, all with the security, governance, and management our customers expect.”

“At H2O.ai, we’ve been focused on bringing the best of breed open source machine learning to data scientists working in R & Python. However, the lack of robust tooling in the R ecosystem for interfacing with Apache Spark has made it difficult for the R community to take advantage of the distributed data processing capabilities of Apache Spark.