Apache Spark has a Python interface called PySpark. It includes the PySpark shell for interactively examining data in a distributed environment, as well as the ability to develop Spark applications using Python APIs.
The resilient distributed dataset (RDD), a read-only multiset of data objects spread over a cluster of servers and maintained in a fault-tolerant manner, is the cornerstone of Apache Spark’s architecture. Following the Dataframe API, the Dataset API was published as an abstraction over the RDD.
- Project: PySpark
- Author: Matei Zaharia
- Initial Release: 2014
- Type: Data Analytics and Machine Learning Algorithm
- License: Apache Licence
- Contains: Spark SQL, DataFrame, Streaming, MLlib (Machine Learning), and Spark Core.
- Language: Python, Scala, Java, SQL, R, C#, F#
- GitHub: python/pyspark
- Runs On: Windows, Linux, MacOS
- Twitter: None
- Distributed task dispatching, scheduling, and basic Input-Output functionalities
- Provides support for structured and semi-structured data.
- Perform streaming analytics
- Distributed machine-learning framework
- Distributed graph-processing framework
- Language Support