Spark Update Leverages the Super Powerful R Statistical Language

by Ostatic Staff - Jun. 12, 2015

Folks in the Big Data and Hadoop communities are becoming increasingly interested in Apache Spark, an open source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. We've covered Spark before, including the momentum surrounding it and backing for it from players like Cloudera

The latest updated Spark, version 1.4, not only supports Python 3 but it also supports one of the most powerful statistical programming languages, R, which could usher in much more advanced statistical analysis in Big Data scenarios.

R is an open source language and environment for statistical computing and graphics. We covered it here.

SparkR uses a parallel engine within Spark to allow for operations that can leverage multiple cores or multiple machines. These operations can thus be thrown at much larger and more comple data sets than users of R are familiar with.

SparkR is the name of the API (application programming interface) that permits programs R-based analysis with Spark.

According to a Databricks post:

"Spark 1.4 introduces SparkR, an R API for Spark and Spark’s first new language API since PySpark was added in 2012. SparkR is based on Spark’s parallel DataFrame abstraction. Users can create SparkR DataFrames from “local” R data frames, or from any Spark data source such as Hive, HDFS, Parquet or JSON. SparkR DataFrames support all Spark DataFrame operations including aggregation, filtering, grouping, summary statistics, and other analytical functions. They also supports mixing-in SQL queries, and converting query results to and from DataFrames. Because SparkR uses the Spark’s parallel engine underneath, operations take advantage of multiple cores or multiple machines, and can scale to data sizes much larger than standalone R programs."