Apache Spark has a Python interface called PySpark. It includes the PySpark shell for interactively examining data in a distributed environment, as well as the ability to develop Spark applications using Python APIs.

The resilient distributed dataset (RDD), a read-only multiset of data objects spread over a cluster of servers and maintained in a fault-tolerant manner, is the cornerstone of Apache Spark’s architecture. Following the Dataframe API, the Dataset API was published as an abstraction over the RDD.

Project Background

  • Project: PySpark
  • Author: Matei Zaharia
  • Initial Release: 2014
  • Type: Data Analytics and Machine Learning Algorithm
  • License: Apache Licence
  • Contains: Spark SQL, DataFrame, Streaming, MLlib (Machine Learning), and Spark Core.
  • Language: Python, Scala, Java, SQL, R, C#, F#
  • GitHub: python/pyspark 
  • Runs On: Windows, Linux, MacOS
  • Twitter: None


  • Distributed task dispatching, scheduling, and basic Input-Output functionalities
  • Provides support for structured and semi-structured data.
  • Perform streaming analytics
  • Distributed machine-learning framework 
  • Distributed graph-processing framework 
  • Language Support 
Scroll to Top