Apache Parquet

Apache Parquet is an open-source columnar storage format for use in the Hadoop ecosystem. It was developed by Twitter and Cloudera to improve the performance of the existing Hadoop stack. Two of the founding engineers working on the project “described” the benefits as follows: 1) Limits the IO of only the data that is needed 2) Saves space, the columnar layout compresses better 3) Enables betters scans, loads only the columns that need to be accessed and 4) Enables vectorized execution engine. 

Project Background

  • Storage: Apache Parquet 
  • Author: Twitter and Cloudera
  • Released: March 2013
  • Type: N/A
  • License: Apache License 2.0
  • Language: Java
  • GitHub: apache/parquet-format
  • Runs on: Microsoft Windows, macOS, Linux
  • GitHub Stars: 1k
  • GitHub Contributors: 49

Applications

  • Type-specific encoding
  • Hive integration
  • Pig integration
  • Complex structure support
  • Run-length encoding (RLE)
  • Bit Packing
  • Adaptive dictionary encoding
  • Predicate pushdown

Summary

  •  Apache Parquet gives support to compression and encoding schemes.
  • It makes the advantages of compressed, efficient columnar data representation available in the Hadoop ecosystem.
  • Its per-column level feature is future-proofed that allows adding more encodings as they are invented and implemented.
Scroll to Top