More on Open Source Tools for Data Science

by Ostatic Staff - Aug. 26, 2016

Open source tools are having a transformative impact on the world of data science. In a recent guest post here on OStatic, Databricks' Kavitha Mariappan (shown here), who is Vice President of Marketing, discussed some of the most powerful open source solutions for use in the data science arena. Databricks was founded by the creators of the popular open source Big Data processing engine Apache Spark, which is itself transforming data science.

Here are some other open source tools in this arena to know about.

As Mariappan wrote: "Apache Spark, a project of the Apache Software Foundation, is an open source platform for distributed in-memory data processing. Spark supports complete data science pipelines with libraries that run on the Spark engine, including Spark SQL, Spark Streaming, Spark MLlib and GraphX. Spark SQL supports operations with structured data, such as queries, filters, joins, and selects. In Spark 2.0, released in July 2016, Spark SQL comprehensively supports the SQL 2003 standard, so users with experience working with SQL on relational databases can learn how to work with Spark quickly."

Indeed, Spark is in wide use for in-memory tasks in data science, but a tool that takes a different, but very interesting approach, is Grappa. The open source Grappa project, which scales data-intensive applications on commodity clusters and offers a new type of abstraction that can beat classic distributed shared memory (DSM) systems. In fact, Rich Wolski, founder of the Eucalyptus cloud project, enthusiastically pointed to Grappa in an interview with us as a very interesting project in the data science arena.

We followed up with an interview with the Grappa team. They told us: 

"We’re providing some of the same abstractions that you see in distributed shared memory systems, but taking a very different approach. They have taken the standard technique of exploiting locality for performance, and depended on that. We take the opposite approach. Rather than simple approaches to caching data, we’ve reduced the cost of migrating operations around the cluster."

"Grappa helps accelerate in-memory analytics computation. Specifically, we’re exploring techniques that make worst-case performance for these applications. Applications like graph analytics have a lot of low locality access needs. If you just use standard analytics approaches to commodity clusters, you’ll end up needing a lot of overhead for each step you take as you perform analytics. Grappa tries to find opportunities in the random-access performance that goes on there better. It also provides an easier programming model for distributed memory machines."

As Mariappan noted in her guest post,  Apache Hadoop is definitely transforming data science. She wrote: "For analytics in Hadoop, data scientists can choose from many options. For example, Apache Hive, Apache Impala, Apache Drill and several other projects support SQL operations; for stream processing, data scientists can use Apache Storm; Apache Giraph supports graph-parallel operations; and so forth."

We also rounded up some of the other important open source machine learning and data science tools in this post. Google, Facebook and other tech giants are open sourcing key tools in this area. Meanwhile, startups like H20.ai, formerly known as Oxdata, have steadily been carving out a niche with open source software for big data analysis and machine learning. We have interviewed H20.ai's leaders several times

You can find much more on this front in Kavitha Mariappan's guest post.