
MLlib: Main Guide - Spark 4.1.0 Documentation
As of Spark 2.0, the RDD -based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame -based API in the spark.ml package.
MLlib | Apache Spark
MLlib is Apache Spark's scalable machine learning library, with APIs in Java, Scala, Python, and R.
ML Pipelines - Spark 4.1.0 Documentation
Machine learning can be applied to a wide variety of data types, such as vectors, text, images, and structured data. This API adopts the DataFrame from Spark SQL in order to support a variety of data …
Classification and regression - Spark 4.1.0 Documentation
The spark.ml implementation supports decision trees for binary and multiclass classification and for regression, using both continuous and categorical features.
MLlib (DataFrame-based) — PySpark 4.1.0 documentation - Apache …
MLlib (DataFrame-based) # Note From Apache Spark 4.0.0, all builtin algorithms support Spark Connect.
Apache Spark™ - Unified Engine for large-scale data analytics
Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
MLlib: RDD-based API - Spark 4.1.0 Documentation
This page documents sections of the MLlib guide for the RDD-based API (the spark.mllib package). Please see the MLlib Main Guide for the DataFrame-based API (the spark.ml package), which is now …
Clustering - Spark 4.1.0 Documentation
The spark.ml implementation uses the expectation-maximization algorithm to induce the maximum-likelihood model given a set of samples. GaussianMixture is implemented as an Estimator and …
PySpark Overview — PySpark 4.1.0 documentation - Apache Spark
Dec 11, 2025 · Built on top of Spark, MLlib is a scalable machine learning library that provides a uniform set of high-level APIs that help users create and tune practical machine learning pipelines.
Overview - Spark 4.0.0 Documentation
It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, pandas API on Spark for pandas workloads, MLlib for machine learning, GraphX for …