Apache Kudu

Apache Kudu is an open-source, columnar storage manager designed for the Hadoop ecosystem. It was developed by Todd Lipcon at Cloudera. He describes it as a storage system, not a streaming or processing engine, and it doesn’t do SQL on its own, it just stores bytes.

Kudu was developed to plug a gap in the Hadoop ecosystem. When it comes to the storage layer, there is HDFS and HBase. HDFS is ideal for storing large volumes of data and scanning that data quickly. However, since it is a file system, it does not work well with structured data SQL workloads. At the other end of the spectrum, HBase is the ideal solution for working with SQL workloads, allowing users to write to individual rows, update rows, and so on. 

However, a problem arises when workloads contain structured and unstructured data. This is where Kudu comes in. The storage system can work with SQL and NoSQL workloads.      

  • Filesystem: HDFS  
  • Columnar Store: Kudu
  • NoSQL: HBase

Project Background

  • Platform: Apache Kudu 
  • Author: Cloudera
  • Released: 2016
  • License: Apache-2.0 
  • Languages: C++
  • GitHub: apache/kudu
  • GitHub Stars: 1.5k stars
  • GitHub Contributors: 120

Features

  • Integrates with MapReduce and Spark
  • Replaces HDFS and HBase
  • Integrates with Apache Impala
  • Designed for OLAP tasks
  • Provides high throughput and low-latency
  • Incorporates database-like semantics
  • Supports SQL and NoSQL workloads
  • Works on disk and in RAM
  • Scales to thousands of nodes and tens of PBs
  • Supports millions of read/write operations per second per cluster
  • Ideal for time series and online reporting use cases
Scroll to Top