HPCC: An Open Source Big Data Competitor to Hadoop

by Ostatic Staff - Dec. 27, 2012

We've written before many times about Hadoop, an open source software framework for highly scalable queries and data-intensive distributed applications. The ecosystem of companies and organizations using Hadoop has grown dramatically in recent years, and as the Big Data trend grows, Hadoop training and support solutions are proliferating.

Hadoop is far from the only tool designed to cull insights from large data sets, though. HPCC (High-Performance Computing Cluster) from LexisNexis is, like Hadoop, a competitive open source solution released under an Apache 2.0 license, available in a free community edition and an enterprise edition.

HPCC processes large quantities of data, and can do so across disparate data sources, functioning as both a processing and a distributed data storage environment. HPCC Systems offers quite a lot of online material that allows you to do a deep-dive comparison of HPCC versus Hadoop.  You can find HPCC's four key differentiators between the two platforms here.

Of the four differentiators, one of the biggest is that HPCC doesn't rely on MapReduce, which is the programming model that allows Hadoop's distributed processing of data. Among other tasks, it provides mappings of locations of data with the value of the data, and then removes duplicates by counting the number of occurrences of the data, and determining locations.

According to a missive from HPCC Systems sent to OStatic:

"HPCC was designed under a different paradigm to provide for a comprehensive and consistent high-level and concise declarative dataflow oriented programming model. One of the significant limitations of the strict MapReduce model utilized by Hadoop, is the fact that internode communication is left to the Shuffle phase. This makes iterative algorithms that require frequent internode data exchange hard to code and slow to execute (as they need to go through multiple phases of Map, Shuffle and Reduce, each representing a barrier operation that forces serialization of the long tails of execution).

"In contrast, the HPCC Systems platform provide for direct inter-node communication at all times, which is leveraged by many of the high level ECL primitives. Another disadvantage for Hadoop is the use of Java as the programming language for the entire platform, including the HDFS distributed filesystem, which adds for overhead from the JVM; in contrast, HPCC and ECL are compiled into C++, which executes natively on top of the Operating System, lending to more predictable latencies and overall faster execution (we have seen anywhere between 3 and 10 times faster execution on HPCC, compared to Hadoop, on the exact same hardware)."

If you want some real-world evidence about how the HPCC platform works for various types of organizations, you can find Big Data case studies here. Many organizations can benefit from using various tools to crunch large data sets, and it may be worth looking into how HPCC's platform can be used in conjunction with Hadoop.