Menu
Apache Iceberg
Apache Iceberg is an open table format designed to work with large petabyte-scale tables. At its core, the table format helps users organize, manage, and track files present in a table. Netflix developed the project to improve on their “cobbled together data warehouse architecture” comprising of Hadoop, Presto, Hive, and other legacy tools. Â
Project Background
- Platform:Â Apache IcebergÂ
- Author: Netflix
- Released: N/A
- Type: Open table for storing huge tabular data
- License: Apache-2.0Â
- Languages: Java, Python, and ScalaÂ
- GitHub:Â apache/iceberg
- GitHub Stars: 2.5k stars
- GitHub Contributors: 236
Applications
- Manage multiple data pipelines
- Extendable Model
- Alerting system via mail or slack
- Simple interface log for each task
- Pipelines are defined in Python
- Uses DAGs for setting and managing various workflows
- Top-notch security
Summary
- It allows readers to use a consistent snapshot of the, without needing to hold a lock.
- In this, table snapshots are kept as history, and tables can roll back if a job produces bad data.
- Files within partitions are discovered by listing partition paths. It can list partitions to plan a read is expensive, especially when using S3.
- It can track the table data in both a central megastore for partitions and the file system for files.