Databricks Survey Shows Massive Commitments to Spark
People in the Big Data and Hadoop communities are becoming increasingly interested in Apache Spark, an open source data analytics cluster computing framework originally developed in the AMPLab at UC Berkeley. According to Apache, Spark can run programs up to 100 times faster than Hadoop MapReduce in memory, and ten times faster on disk. When crunching large data sets, those are big performance differences.
Databricks, the company founded by the creators of Apache Spark, today released the findings of a survey of more than 1,400 respondents from the Spark community to identify how organizations and users are utilizing the data analytics and processing engine. The 2015 Spark User Survey results determined that the number of standalone deployments of Spark eclipses those on YARN as more users run Spark independent of Hadoop. Users that are running Spark in standalone (48 percent of respondents) exceeds those running Spark on YARN (40 percent of respondents), alongside a majority of users running Spark in the public cloud. The survey also found that 51 percent of respondents run Spark on a public cloud.
With more than 600 contributors in the last 12 months (up from 315 contributors the 12-months prior), Spark is billed as the most active open source project in Big Data. Additionally, more than 200 organizations contribute code to Spark, making it one of the largest communities of engaged developers to date. IBM has pledged massive investments in Spark.
Key findings from the survey include:
Spark is outgrowing Hadoop: The most common Spark deployments according to the community are: 48 percent standalone, 40 percent YARN within Hadoop and 11 percent Apache Mesos. Spark users who do not use any Hadoop components have more than doubled in 2015 (from 2014).
Streaming and advanced analytics uses rising: Spark is being used for an increasingly diverse set of applications, particularly data scientists for machine learning, streaming and graph analysis use cases. In 2015, there are 56 percent more Spark streaming users than in 2014. The production use of advanced analytics, like MLib for machine learning and GraphX for graph processing, increased from 11 percent in 2014 to 15 percent in 2015. 75 percent of Spark users are also using two or more Spark components (51 percent of Spark users are using three or more Spark components).
Spark users are becoming more diverse: Spark is breaking down technology barriers between data scientists and engineers, who are working collaboratively to solve data problems. Of those surveyed, 41 percent identified themselves as Data Engineers, while 22 percent of respondents identified themselves as Data Scientists. Spark users are solving a variety of problems in different languages -- Scala (71 percent), Python (58 percent), SQL (36 percent), Java (31 percent) and R (18 percent) -- and all within the same framework.
Spark’s most popular use cases come to light: Fifty two percent use Spark for data warehousing, 68 percent use it for business intelligence, 40 percent for processing application and system logs, 48 percent to build recommendation engines, 36 percent for user-facing services and 29 percent for fraud detection and security.
Spark is increasing access to big data: Spark adoption is growing so quickly because users are finding Spark easy to use and deploy, reliably fast, and aligned for future growth in real-time and advanced analytics. Ninety one percent of those surveyed claim performance as their reason for adoption, while 77 percent cite ease of programming, 71 percent cite ease of deployment, 64 percent cite advanced analytics capabilities and 52 percent cite real-time streaming capabilities.
“The continued growth of Spark has been highly encouraging, as companies are going into production to obtain real business value, and they are doing so in a wide range of environments beyond Hadoop clusters,” said Matei Zaharia, creator of Apache Spark and CTO of Databricks. “Databricks and our partners are 100 percent committed to the long-term growth of Spark and we’ll continue to make improvements based on this survey data and our ongoing community feedback, to make the most complete big data analytics toolkit accessible to all businesses.”
“The enthusiasm for big data is matched only by the pace of innovation. Many organizations are shifting to a ‘Spark-first’ strategy, recognizing its advantages of analytics versatility, development familiarity, superior performance, range of data sources supported, and deployment flexibility. The market will no doubt continue to evolve, but Spark has established considerable momentum today,” said Nik Rouda, Senior Analyst at Enterprise Strategy Group.
The results of the survey reflect the answers and opinions of 1,417 respondents representing 842 organizations.