Spark MLlib

Spark MLlib is an open-source machine-learning library that works with NumPy. According to the maintainers, its high-quality algorithms allow it to run 100x faster than MapReduce. The way it works is it performs iterative computation that yields better results THAN one-pass approximations that are used on MapReduce.

MLlib runs on Hadoop, Kubernetes, Apache Mesos, and in the cloud, against different data sources, using Spark’s standalone cluster mode, on Hadoop YARN, and/or on EC2. Also, it allows users to access data in Apache Hive, Apache Cassandra, HDFS, Apache HBase, and hundreds of other data sources.

MLlib contains a wide array of algorithms and utilities used for performing different machine learning functions. It’s updated and tested with each Spark release.

Project background

  • Library: Apache Spark MLlib
  • Author: Matei Zaharia
  • Initial Release: May 26, 2014
  • Type: Apache Spark’s scalable machine learning library
  • License: Apache License, Version 2.0
  • Contains: ML Algorithms, featurization, pipelines, persistence, utilities
  • Language: Java, Scala, and Python
  • GitHub: apache/spark/tree/master/mllib
  • Runs On: Hadoop, standalone, Kubernetes, Apache Mesos, in the cloud
  • Stackflow: apache-spark-mllib


  • Classification
  • Regression
  • Clustering
  • Decision trees
  • Recommendation
  • Topic modeling
  • Collaborative filtering
  • Feature extraction and transformation
  • Optimization
  • Frequent pattern mining
  • Evaluation metrics
  • PMML model export

Scroll to Top