Apache Spark

Apache Spark is a processing engine that works with large-scale data demands. The platform provides high-level APIs in Java, Python, Scala, and R. Also, it supports general computation graphs required for data analysis. Nearly 80% of Fortune 500 companies around the globe use Apache Spark for scalable computing. It has more than 2,000 contributors from industry and academia.  

Project Background

  • Framework: Apache Spark
  • Author: Matei Zaharia
  • Released: May 26, 2014
  • Type: Open-source unified analytics engine for large-scale data processing
  • License: Apache License 2.0
  • Supports: Spark Core, Spark SQL, Spark Streaming, MLlib Machine Learning Library, and GraphX
  • Language: Scala, Java, SQL, Python, R, C#, F#
  • GitHub: apache/spark

Applications

  • Big data
  • Machine learning

Summary

  • Unifies batch/streaming data using your preferred programming language.
  • Helps with Exploratory Data Analysis (EDA) on petabyte-scale data. No downsampling!
  • Fast execution of distributed ANSI SQL queries
  • Provides training to machine learning algorithms
  • Supports a pseudo-distributed local mode for development or testing
  • Supports standalone Hadoop YARN, Apache Mesos, or Kubernetes
Scroll to Top