Menu
Apache Hudi
Apache Hudi is an open-source abstraction storage framework that helps users build and manage large petabyte-scale data lakes. Hudi (Hadoop Upserts Deletes and Incremental) has built-in stream processing capabilities, thus bringing real-time processing capabilities to data lakes. The project was started at Uber, where it now manages hundreds of petabytes of storage, and it’s widely used by other prominent companies like Walmart, TikTok, Apple, and Amazon.Â
The founders of Hudi describe it as having “brought data warehouse and database like functionality to data lakes, making new things like minute level data freshness or optimized storage or self-managing tables, possible directly on data lakes.”   Â
Project Background
- Platform:Â Apache Hudi
- Author: UberÂ
- Released: 2016
- Type: Transactional Data Lake
- License: Apache-2.0Â
- Languages: Java and ScalaÂ
- GitHub:Â apache/hudi
- GitHub Stars: 2.9k stars
- GitHub Contributors: 251
Features
- Offers record-level updates, deletes, and change streams to data lakes
- Provides ACID semantic on a data lake
- Snapshot IsolationÂ
- Incremental Query provides records on insertions and updates with a point in timeÂ
- Performs SQL reads and writes from Hive, Spark, and other platforms
- Built-in metadata tracking