Use MLlib, a Spark-based library, for distributed machine learning tasks and large-scale datasets.
Similar to scikit-learn, MLlib provides the following tools:
- ML Algorithms: Classification, regression, clustering, and collaborative filtering
- Featurization: Feature extraction, transformation, dimensionality reduction, and selection
- Pipelines: Construction, evaluation, and tuning of ML Pipelines