Apache Beam

Beam is an open source tool used for defining and executing batch and streaming data pipelines. It supports Spark, Flink, and Samza data processing engines. The Beam pipeline is a DAG that contains data such as computation and tasks. 

Project Background

  • Tool: Apache Beam 
  • Author: Google
  • Released: June 2016
  • Type: Open source tool for monitoring workflow
  • License: Apache v2 license
  • Support: Apache Flink, Apache Spark, and Google Cloud Dataflow
  • GitHub: apache/beam

Applications

  • Batch-data parallel-processing pipelines
  • Streaming-data parallel-processing pipelines
  • Large-scale data
  • Distributed processing
  • Parallel processing
  • Abstraction 

Summary

  • Use single programming for both batch and stream use cases.
  • It can be executed on different runners.
  • Choose your language – Python, Java, or Golang.
  • The process includes three major steps – Pipeline, PCollection, and PTransform.
  • Explore and share new SDKs as there are ten connectors and libraries available.
Scroll to Top