Apache Hudi

Apache Hudi is an open-source abstraction storage framework that helps users build and manage large petabyte-scale data lakes. Hudi (Hadoop Upserts Deletes and Incremental) has built-in stream processing capabilities, thus bringing real-time processing capabilities to data lakes. The project was started at Uber, where it now manages hundreds of petabytes of storage, and it’s widely used by other prominent companies like Walmart, TikTok, Apple, and Amazon. 

The founders of Hudi describe it as having “brought data warehouse and database like functionality to data lakes, making new things like minute level data freshness or optimized storage or self-managing tables, possible directly on data lakes.”       

Project Background

  • Platform: Apache Hudi
  • Author: Uber 
  • Released: 2016
  • Type: Transactional Data Lake
  • License: Apache-2.0 
  • Languages: Java and Scala 
  • GitHub: apache/hudi
  • GitHub Stars: 2.9k stars
  • GitHub Contributors: 251

Features

  • Offers record-level updates, deletes, and change streams to data lakes
  • Provides ACID semantic on a data lake
  • Snapshot Isolation 
  • Incremental Query provides records on insertions and updates with a point in time 
  • Performs SQL reads and writes from Hive, Spark, and other platforms
  • Built-in metadata tracking
Scroll to Top