Spark MLlib

Spark MLlib is an open-source machine-learning library that works with NumPy. According to the maintainers, its high-quality algorithms allow it to run 100x faster than MapReduce. The way it works is it performs iterative computation that yields better results THAN one-pass approximations that are used on MapReduce.

MLlib runs on Hadoop, Kubernetes, Apache Mesos, and in the cloud, against different data sources, using Sparkโ€™s standalone cluster mode, on Hadoop YARN, and/or on EC2. Also, it allows users to access data in Apache Hive, Apache Cassandra, HDFS, Apache HBase, and hundreds of other data sources.

MLlib contains a wide array of algorithms and utilities used for performing different machine learning functions. Itโ€™s updated and tested with each Spark release.

Project background

  • Library: Apache Spark MLlib
  • Author: Matei Zaharia
  • Initial Release: May 26, 2014
  • Type: Apache Spark’s scalable machine learning library
  • License: Apache License, Version 2.0
  • Contains: ML Algorithms, featurization, pipelines, persistence, utilities
  • Language: Java, Scala, and Python
  • GitHub: apache/spark/tree/master/mllib
  • Runs On: Hadoop, standalone, Kubernetes, Apache Mesos, in the cloud
  • Stackflow: apache-spark-mllib

Applications

  • Classification
  • Regression
  • Clustering
  • Decision trees
  • Recommendation
  • Topic modeling
  • Collaborative filtering
  • Feature extraction and transformation
  • Optimization
  • Frequent pattern mining
  • Evaluation metrics
  • PMML model export

Scroll to Top