Databricks' Kavitha Mariappan on Open Source Tools and Data Science
Databricks, a company founded by the creators of the popular open-source Big Data processing engine Apache Spark, has gained much momentum as Spark has gathered big backers and widespread development. Spark is one of the most active open source projects in the Big Data ecosystem, and there are increasing efforts among data scientists to leverage it and other open source tools.
We caught up with Kavitha Mariappan (shown here), who is Vice President of Marketing at Databricks, for a guest post on open source tools and data science. Here are her thoughts.
On Open Source Tools for Data Science
By Kavitha Mariappan
Open source software provides data scientists with the flexibility and power they need to do their work. In fact, recent surveys confirm data scientists’ preference for open source software:
In a recent poll of data scientists by KDnuggets.com, eight of the top ten most frequently used tools are open source.
Open source tools are the overwhelming choice ofrespondents to the 2015 O’Reilly Data Science Salary Survey.Among analytics professionalssurveyed recently by recruiting firm Burtch Works, 62% prefer open source tools.
In this article, I will profile the leading open-source software tools for data science. R and Python are general purpose tools that run on single machines. For scalable analytics, we discuss options for data scientists in Apache Hadoop and close with a discussion of Apache Spark.
R For Data Science
The R Project (“R”) is a popular open source language and runtime environment for advanced analytics. Data science website KDnuggets.com reports that 49% of readers surveyed said they had used R for an analytics project in the past twelve months. In a recent survey of quantitative professionals in the job market, recruiting firm Burch Works found that 42% prefer to use R. Academics, researchers, and statisticians use R extensively; some call it the lingua franca of advanced analytics.
Users deploy R on Windows, MacOS or Linux. The core R distribution includes a command line interface (CLI) for interactive use; however, many users prefer an integrated development environment or IDE, such as RStudio or Microsoft R Tools for Visual Studio.
In most cases, R users import data into structured tables called data frames. As the first step in an analysis pipeline, the user imports data from its source to a data frame; the data must fit into memory on the machine where R is running. This limitation is not a problem with small datasets, but with larger datasets, it poses a severe bottleneck. Some packages support in-database processing, but they offer limited functionality.
R comprehensively supports data science, with capabilities for data import, export and storage; content analytics; matrix manipulation; exploratory analysis; visualization; statistical modeling; machine learning; and many other tasks. As of May 2016, there are more than 8,000 packages in the Comprehensive R Archive Network (CRAN), the most widely used R archive. While software vendors (such as Oracle and SAP) contribute packages that support integration with their commercial products, R users contribute most of the packages. Due to the broad developer community and minimal barriers to contribution, the breadth of functionality available in R far exceeds that of commercial analytic software.
Microsoft and Oracle offer commercially supported distributions of R. Both vendors offer free enhanced distributions (Microsoft R Open and Oracle R Distribution), as well as commercially licensed versions with additional enhancements.
Python for Data Science
Python is a programming language with a syntax that enables programmers to write efficient and concise code. While not as feature-rich for analytics as R, Python's capabilities for scientific computing are expanding rapidly. Among working analysts, Python is quickly gaining on R; in the KDnuggets.com poll, 46% said they use Python, second only to R.
Like R, Python works on all modern operating systems. Python users work with several different IDEs and notebooks. Jupyter is a popular choice; it offers an architecture for interactive computing, including an interactive shell, browser-based notebook, visualization support, embeddable Python interpreters and a framework for parallel computing.
For data manipulation and analysis, many Python users work with pandas, a package designed to handle structured data. Pandas supports SQL-like operations, such as join, merge, group, insert and delete; it also handles more complex operations, such as missing value treatment and time series functionality.
As a general-purpose language, Python supports core capabilities needed in an analytic language, such as data import and program control. For advanced analytics, two packages (NumPy and SciPy) provide foundation functions. For machine learning in Python, scikit-learn is the richest package; it includes algorithms for dimension reduction, clustering, classification, and regression. The package also includes tools for pre-processing, model evaluation, and model selection.
Continuum Analytics distributes Anaconda, a free Python distribution that includes some enhancements for scientific computing and predictive analytics. These include pre-selected Python packages for science, math, engineering, and data analysis; an online repository with the most up-to-date version of each package, and a set of plug-ins for Microsoft Excel.
Beyond R and Python
R and Python are excellent tools as far as they go. But sometimes, data scientists need tools that can scale out, or distribute the workload over many machines. These circumstances include:
- The data required for analysis is too large to fit on a single server.
- It takes too much time to extract and move data to an external server.
- The source data is streaming and requires a real-time approach.
- The analysis is computationally complex and takes too long on a single machine.
- The analysis requires many experiments, and serial execution takes too long.
Apache Hadoop is a popular distributed computing environment for analytics that started in 2006. Hadoop itself consists of just three components: HDFS, a distributed file system; MapReduce, a distributed programming framework; and YARN, a resource management system. However, most organizations use products from distributors who bundle Hadoop together with many other components.
For analytics in Hadoop, data scientists can choose from many options. For example, Apache Hive, Apache Impala, Apache Drill and several other projects support SQL operations; for stream processing, data scientists can use Apache Storm; Apache Giraph supports graph-parallel operations; and so forth.
However, Hadoop poses two challenges for data scientists. First, in most projects data scientists build "pipelines" that use many different types of processing. Creating a workflow across multiple software components is challenging; diverse components add complexity to the infrastructure, and they may be difficult to integrate.
The second challenge is performance. By design, MapReduce writes to disk after each pass through the data. Machine learning algorithms tend to require many iterations. Thus, an analysis that runs on MapReduce (or other Hadoop components that depend on MapReduce) can take a long time to run. That means longer project cycles and reduced productivity.
Data scientists need an integrated platform that supports a range of computing tasks, including SQL, streaming, machine learning and graph operations. And, for good performance, data scientists need a distributed in-memory framework.
Next Generation Data Science with Apache Spark
Apache Spark, a project of the Apache Software Foundation, is an open source platform for distributed in-memory data processing. Databricks, a startup based in San Francisco, was founded by the team who created Apache Spark and is the largest contributor to the open source Apache Spark project providing 10x more code than any other company.
MapReduce is more than ten years old, and it is no longer rapidly evolving. Spark does much more than MapReduce, and it replaces MapReduce for most workloads. Spark is significantly faster than MapReduce for large and small data, and it is more versatile than MapReduce: It supports tasks like machine learning and streaming that were impossible or inefficient on MapReduce. These capabilities drive rapid Spark adoption. In the KDnuggets.com poll, 22% of surveyed data scientists said they had used Spark in the past year up from 11% in 2015.
Spark provides a fault-tolerant runtime environment for fast execution. Users can deploy the software on a single machine, in a free-standing cluster, in Hadoop on YARN, on Apache Mesos, on cloud platforms, and in containers. Users interact with Spark through programming interfaces for Scala, Java, Python and R. PySpark provides a familiar interface for Python users, and SparkR supports R users.
Spark supports complete data science pipelines with libraries that run on the Spark engine, including Spark SQL, Spark Streaming, Spark MLlib and GraphX. Spark SQL supports operations with structured data, such as queries, filters, joins, and selects. In Spark 2.0, released in July 2016, Spark SQL comprehensively supports the SQL 2003 standard, so users with experience working with SQL on relational databases can learn how to work with Spark quickly.
Instead of a native file system, Spark works with a very broad array of storage platforms: HDFS, relational databases, NoSQL data stores, cloud data stores, streaming sources, in-memory file systems, search engines and many other data sources. Users work with Spark SQL to create structured tables, or DataFrames, which are available for subsequent operations in the user’s data science pipeline.
Spark Streaming enables the user to work with streaming data sources, such as Apache Kafka or Amazon Kinesis.Structured Streaming, introduced in Spark 2.0, combines stream data processing with the Spark SQL interface; users query streams in the same way that they query static tables.
MLlib, Spark’s machine learning library, supports feature engineering, classification, regression, clustering and collaborative filtering. The library also includes tools for model selection and tuning. Spark uniquely supports machine learning algorithms that work with streaming data. GraphX supports graph-parallel processing for applications like network and link analysis.
The Spark Packages page includes more than 200 additional packages contributed by third-party developers. Spark Packages support other data sources, machine learning features (including Deep Learning), streaming capabilities and utilities.
Every major Hadoop distributor includes Apache Spark in its offering today; as do both leading and emerging vendors in Big Data. With over a thousand contributing developers and submissions from third-party developers, the Spark team adds enhancements faster than many other open source projects. One important reason for the growth of Apache Spark is the broad grassroots community interest and support to share, teach, and learn with one another. Today, there are more than two hundred thousand Spark meetup members worldwide, making it the largest open source community in big data[Office1]. Moreover, with commercial support available from virtually every major vendor in Big Data, Spark is an excellent choice for the enterprise.
R and Python are good tools for general data science when you do not need a scalable engine. Each has its loyal advocates: R users cite its rich graphics capabilities, while Python users value its broad application development potential.
Data scientists need scalable computing environments for power and speed. Where Hadoop is the leading distributed framework for general purpose computing, Spark goes beyond that to deliver integrated tooling for data science, leveraging in-memory operations for speed and superior performance.
About Kavitha Mariappan Vice President of Marketing, Databricks
Kavitha heads up Databricks’ end-to-end global marketing efforts. She brings more than 20 years of extensive industry experience in product and outbound marketing, product management, and business development to Databricks. Prior to Databricks, Kavitha was the VP of Marketing at Maginatics (acquired by EMC), where she built and led the team responsible for all aspects of marketing and communications. Her previous professional experience includes leadership roles at Riverbed Technology and Cisco Systems, Inc. Kavitha has a Bachelor of Engineering in Communication Engineering from the Royal Melbourne Institute of Technology, Australia.