Apache Iceberg

Apache Iceberg is an open table format designed to work with large petabyte-scale tables. At its core,  the table format helps users organize, manage, and track files present in a table. Netflix developed the project to improve on their “cobbled together data warehouse architecture” comprising of Hadoop, Presto, Hive, and other legacy tools.  

Project Background

  • Platform: Apache Iceberg 
  • Author: Netflix
  • Released: N/A
  • Type: Open table for storing huge tabular data
  • License: Apache-2.0 
  • Languages: Java, Python, and Scala 
  • GitHub: apache/iceberg
  • GitHub Stars: 2.5k stars
  • GitHub Contributors: 236

Applications

  • Manage multiple data pipelines
  • Extendable Model
  • Alerting system via mail or slack
  • Simple interface log for each task
  • Pipelines are defined in Python
  • Uses DAGs for setting and managing various workflows
  • Top-notch security

Summary

  • It allows readers to use a consistent snapshot of the, without needing to hold a lock.
  • In this, table snapshots are kept as history, and tables can roll back if a job produces bad data.
  • Files within partitions are discovered by listing partition paths. It can list partitions to plan a read is expensive, especially when using S3.
  • It can track the table data in both a central megastore for partitions and the file system for files.
Scroll to Top