AtScale Delivers Findings on BI-Plus-Hadoop
Business intelligence is the dominant use-case for IT organizations implementing Hadoop, according to a report from the folks at AtScale. The benchmark study also shows which tools in the Haddop ecosystem are best for particular types of BI queries.
As we've reported before, tools that demystify and function as useful front-ends and connectors for the open source Hadoop project are much in demand. AtScale, billed as “the first company to allow business users to do business intelligence on Hadoop,” focused its study on the strengths and weaknesses of the industry’s most popular analytical engines for Hadoop – Impala, SparkSQL, Hive and Presto.
You can Get a complimentary copy of the full report @ www.atscale.com/benchmark
Some of the findings that surfaced include:
There is rapid innovation in the open source space, as reflected by Spark SQL improvements, even from 1.6 to 2.0: The open source community continues to drive significant and rapid improvements across the board. All engines tested showed between 2x to 4x performance gains in the six months between the first and second edition of the benchmarks. The study shows significant performance improvements between Spark 1.6 and Spark 2.0. Cloudera’s recent decision to donate Impala to the Apache Foundation will benefit the community, Cloudera, and any enterprise connecting business users to Hadoop. This is great news for those enterprises deploying BI workloads to Hadoop.
Different engines perform well for different types of queries: For large data sets Hive, Impala, Presto, and Spark SQL were all able to effectively complete a range of queries on over 6 Billion rows of data. There is no single “winning engine” for all query types. Depending on raw data size, query complexity, and the target number of end-users enterprises will find that each engine has its own ‘sweet spot’.
Presto and Impala scale better than Hive and Spark for concurrent dashboard queries: Production enterprise BI user-bases may be on the order of 100s or 1,000s of users. As such, support for concurrent query workloads is critical. Our benchmarks showed that Presto and Impala performed best – that is, showed the least query degradation – as concurrent query workload increased. Presto, new to this edition of the benchmark, showed the best results in our user concurrency testing.
"As enterprises adopt Hadoop more broadly, business intelligence (BI) and analytical use cases on Hadoop have expanded from strong, but limited, adoption among data scientists.” says John L Myers, Managing Research Director at Enterprise Management Associates (EMA), “Now, organizations need to make the data within their Hadoop clusters available and ‘business critical’ to a wider business stakeholder audience. BI on Hadoop is a logical use case to help them accomplish that growth in adoption and acceptance.”
As indicated in the latest Hadoop Maturity Survey, Business Intelligence is now a top workload for Hadoop, ahead of Data Science and ETL. We were writing about this trend quite a bit over the past year.
The AtScale study also produced other key findings:
SQL-on-Hadoop engines are well suited for Business Intelligence (BI): All tested engines – Hive, Impala, Presto,and Spark SQL – successfully executed all of the queries in our benchmark suite and are stable enough to support business intelligence workloads.
Small vs. Big Data: Impala and Spark SQL continue to shine for small data queries (queries against the AtScale Adaptive Cache). The latest release of Hive LLAP (Live Long and Process) shows suitable “small data” query response times. Presto also shows promise on small, interactive queries.