Apache Parquet Now a Top-Level Project, with Hadoop Appeal

by Ostatic Staff - Apr. 29, 2015

Never underestimate how much the Apache Software Foundation (ASF) is doing to drive the Hadoop community forward. This week, I reported on how it and the Open Data Platform partners are really driving standardization forward on the Hadoop scene.

Now, an open source project called Apache Parquet, which provides columnar storage in Hadoop, has been promoted to a top-level Apache Software Foundation (ASF)-sponsored project, a clear sign that it will find its way to entrenchment in many Hadoop deployments.

The ASF has remained focused on pushing common standards on the Big Data scene, and common standards for Hadoop and related tools.

"The incubation process at Apache has been fantastic and really the last step of making Parquet a community driven standard fully integrated within the greater Hadoop ecosystem," said Julien Le Dem, Vice President of Apache Parquet.

Apache Parquet is an open source columnar storage format for the Apache Hadoop ecosystem, built to work across programming languages and much more...with these details available:

- processing frameworks (MapReduce, Apache Spark, Scalding, Cascading, Crunch, Kite)

- data models (Apache Avro, Apache Thrift, Protocol Buffers, POJOs)

- query engines (Apache Hive, Impala, HAWQ, Apache Drill, Apache Tajo, Apache Pig, Presto, Apache Spark SQL)

"At Twitter, Parquet has helped us scale our big data usage by in some cases reducing storage requirements by one third on large datasets as well as scan and deserialization time. This translated into hardware savings as well as reduced latency for accessing the data. Furthermore, Parquet being integrated with so many tools creates opportunities and flexibility regarding query engines," said Chris Aniszczyk, Head of Open Source at Twitter. "Finally, it's just fantastic to see it graduate to a top-level project and we look forward to further collaborating with the Apache Parquet community to continually improve performance."

"Parquet's integration with other object models, like Avro and Thrift, has been a key feature for our customers," said Ryan Blue, Software Engineer at Cloudera. "They can take advantage of columnar storage without changing the classes they already use in their production applications."

"At Netflix, Parquet is the primary storage format for data warehousing. More than 7 petabytes of our 10+ Petabyte warehouse is Parquet formatted data that we query across a wide range of tools including Apache Hive, Apache Pig, Apache Spark, PigPen, Presto, and native MapReduce. The performance benefit of columnar projection and statistics is a game changer for our big data platform," said Daniel Weeks, Software Engineer at Netflix. "We look forward to working with the Apache community to advance the state of big data storage with Parquet and are excited to see the project graduate to full Apache status."

Apache Parquet will be demonstrated at the Hadoop Summit, going on June 9 to 11, in San Jose, California. The Apache Parquet community offers more info at  http://parquet.apache.org/community/