Hadoop Ecosystem

The Apache Hadoop is a framework and collection of modules that help distribute, process and store large datasets across clusters of bare metal hardware. Doug Cutting, one of the developers created it while working at Yahoo and name it after his son’s toy elephant. 

At its core, Hadoop is a big data ecosystem capable of storing petabytes of data, structured and unstructured, across thousands of computers. The three primary modules are HDFS, MapReduce, and YARN. HDFS is the file system and storage that provides fault tolerance by storing multiple copies of data across several systems. The data is stored in blocks across several nodes.  

MapReduce is the processing engine by breaks data into parts and processes them individually on various nodes. And YARN is the module that is comprised of a resource manager, node manager, application master, and containers. Other popular modules are Hive, Pig, and HBase.

Project Background

  • Framework: Apache Hadoop
  • Author: Doug Cutting and Mike Cafarella
  • Released: April 1, 2006;
  • Type: Open-source software
  • License: Apache License 2.0
  • Language: Java
  • GitHub: apache/hadoop
  • Runs on: Multi-platform
  • GitHub Stars: 12.4k
  • GitHub Contributors: 433

Features

  • HDFS: file system and storage
  • MapReduce: Processing engine
  • YARN: Resource manager
Scroll to Top