MLlib (DataFrame-based)¶
Pipeline APIs¶
| Abstract class for transformers that transform one dataset into another. | |
| Abstract class for transformers that take one input column, apply transformation, and output the result as a new column. | |
| Abstract class for estimators that fit models to data. | |
| 
 | Abstract class for models that are fitted by estimators. | 
| Estimator for prediction tasks (regression and classification). | |
| Model for prediction tasks (regression and classification). | |
| 
 | A simple pipeline, which acts as an estimator. | 
| 
 | Represents a compiled pipeline with transformers and fitted models. | 
Parameters¶
| 
 | A param with self-contained documentation. | 
| 
 | Components that take parameters. | 
| Factory methods for common type conversion functions for Param.typeConverter. | 
Feature¶
| 
 | Binarize a column of continuous features given a threshold. | 
| 
 | LSH class for Euclidean distance metrics. | 
| 
 | Model fitted by  | 
| 
 | Maps a column of continuous features to a column of feature buckets. | 
| 
 | Chi-Squared feature selection, which selects categorical features to use for predicting a categorical label. | 
| 
 | Model fitted by  | 
| 
 | Extracts a vocabulary from document collections and generates a  | 
| 
 | Model fitted by  | 
| 
 | A feature transformer that takes the 1D discrete cosine transform of a real vector. | 
| 
 | Outputs the Hadamard product (i.e., the element-wise product) of each input vector with a provided “weight” vector. | 
| 
 | Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). | 
| 
 | Maps a sequence of terms to their term frequencies using the hashing trick. | 
| 
 | Compute the Inverse Document Frequency (IDF) given a collection of documents. | 
| 
 | Model fitted by  | 
| 
 | Imputation estimator for completing missing values, using the mean, median or mode of the columns in which the missing values are located. | 
| 
 | Model fitted by  | 
| 
 | A  | 
| 
 | Implements the feature interaction transform. | 
| 
 | Rescale each feature individually to range [-1, 1] by dividing through the largest maximum absolute value in each feature. | 
| 
 | Model fitted by  | 
| 
 | LSH class for Jaccard distance. | 
| 
 | Model produced by  | 
| 
 | Rescale each feature individually to a common range [min, max] linearly using column summary statistics, which is also known as min-max normalization or Rescaling. | 
| 
 | Model fitted by  | 
| 
 | A feature transformer that converts the input array of strings into an array of n-grams. | 
| 
 | Normalize a vector to have unit norm using the given p-norm. | 
| 
 | A one-hot encoder that maps a column of category indices to a column of binary vectors, with at most a single one-value per row that indicates the input category index. | 
| 
 | Model fitted by  | 
| 
 | PCA trains a model to project vectors to a lower dimensional space of the top  | 
| 
 | Model fitted by  | 
| 
 | Perform feature expansion in a polynomial space. | 
| 
 | 
 | 
| 
 | RobustScaler removes the median and scales the data according to the quantile range. | 
| 
 | Model fitted by  | 
| 
 | A regex based tokenizer that extracts tokens either by using the provided regex pattern (in Java dialect) to split the text (default) or repeatedly matching the regex (if gaps is false). | 
| 
 | Implements the transforms required for fitting a dataset against an R model formula. | 
| 
 | Model fitted by  | 
| 
 | Implements the transforms which are defined by SQL statement. | 
| 
 | Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set. | 
| 
 | Model fitted by  | 
| 
 | A feature transformer that filters out stop words from input. | 
| 
 | A label indexer that maps a string column of labels to an ML column of label indices. | 
| 
 | Model fitted by  | 
| 
 | A tokenizer that converts the input string to lowercase and then splits it by white spaces. | 
| 
 | Feature selector based on univariate statistical tests against labels. | 
| 
 | Model fitted by  | 
| 
 | Feature selector that removes all low-variance features. | 
| 
 | Model fitted by  | 
| 
 | A feature transformer that merges multiple columns into a vector column. | 
| 
 | Class for indexing categorical feature columns in a dataset of Vector. | 
| 
 | Model fitted by  | 
| 
 | A feature transformer that adds size information to the metadata of a vector column. | 
| 
 | This class takes a feature vector and outputs a new feature vector with a subarray of the original features. | 
| 
 | Word2Vec trains a model of Map(String, Vector), i.e. | 
| 
 | Model fitted by  | 
Classification¶
| 
 | This binary classifier optimizes the Hinge Loss using the OWLQN optimizer. | 
| 
 | Model fitted by LinearSVC. | 
| 
 | Abstraction for LinearSVC Results for a given model. | 
| 
 | Abstraction for LinearSVC Training results. | 
| 
 | Logistic regression. | 
| 
 | Model fitted by LogisticRegression. | 
| 
 | Abstraction for Logistic Regression Results for a given model. | 
| 
 | Abstraction for multinomial Logistic Regression Training results. | 
| 
 | Binary Logistic regression results for a given model. | 
| Binary Logistic regression training results for a given model. | |
| 
 | Decision tree learning algorithm for classification.It supports both binary and multiclass labels, as well as both continuous and categorical features.. | 
| 
 | Model fitted by DecisionTreeClassifier. | 
| 
 | Gradient-Boosted Trees (GBTs) learning algorithm for classification.It supports binary labels, as well as both continuous and categorical features.. | 
| 
 | Model fitted by GBTClassifier. | 
| 
 | Random Forest learning algorithm for classification.It supports both binary and multiclass labels, as well as both continuous and categorical features.. | 
| 
 | Model fitted by RandomForestClassifier. | 
| 
 | Abstraction for RandomForestClassification Results for a given model. | 
| Abstraction for RandomForestClassificationTraining Training results. | |
| BinaryRandomForestClassification results for a given model. | |
| BinaryRandomForestClassification training results for a given model. | |
| 
 | Naive Bayes Classifiers. | 
| 
 | Model fitted by NaiveBayes. | 
| 
 | Classifier trainer based on the Multilayer Perceptron. | 
| Model fitted by MultilayerPerceptronClassifier. | |
| Abstraction for MultilayerPerceptronClassifier Results for a given model. | |
| Abstraction for MultilayerPerceptronClassifier Training results. | |
| 
 | Reduction of Multiclass Classification to Binary Classification. | 
| 
 | Model fitted by OneVsRest. | 
| 
 | Factorization Machines learning algorithm for classification. | 
| 
 | Model fitted by  | 
| 
 | Abstraction for FMClassifier Results for a given model. | 
| 
 | Abstraction for FMClassifier Training results. | 
Clustering¶
| 
 | A bisecting k-means algorithm based on the paper “A comparison of document clustering techniques” by Steinbach, Karypis, and Kumar, with modification to fit Spark. | 
| 
 | Model fitted by BisectingKMeans. | 
| 
 | Bisecting KMeans clustering results for a given model. | 
| 
 | K-means clustering with a k-means++ like initialization mode (the k-means|| algorithm by Bahmani et al). | 
| 
 | Model fitted by KMeans. | 
| 
 | Summary of KMeans. | 
| 
 | GaussianMixture clustering. | 
| 
 | Model fitted by GaussianMixture. | 
| 
 | Gaussian mixture clustering results for a given model. | 
| 
 | Latent Dirichlet Allocation (LDA), a topic model designed for text documents. | 
| 
 | Latent Dirichlet Allocation (LDA) model. | 
| 
 | Local (non-distributed) model fitted by  | 
| 
 | Distributed model fitted by  | 
| 
 | Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen.From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.. | 
Functions¶
| 
 | Converts a column of array of numeric type into a column of pyspark.ml.linalg.DenseVector instances | 
| 
 | Converts a column of MLlib sparse/dense vectors into a column of dense arrays. | 
| 
 | Given a function which loads a model and returns a predict function for inference over a batch of numpy inputs, returns a Pandas UDF wrapper for inference over a Spark DataFrame. | 
Vector and Matrix¶
| 
 | A dense vector represented by a value array. | 
| 
 | A simple sparse vector class for passing data to MLlib. | 
| Factory methods for working with vectors. | |
| 
 | |
| 
 | Column-major dense matrix. | 
| 
 | Sparse Matrix stored in CSC format. | 
Recommendation¶
| 
 | Alternating Least Squares (ALS) matrix factorization. | 
| 
 | Model fitted by ALS. | 
Regression¶
| 
 | Accelerated Failure Time (AFT) Model Survival Regression | 
| 
 | Model fitted by  | 
| 
 | Decision tree learning algorithm for regression.It supports both continuous and categorical features.. | 
| 
 | Model fitted by  | 
| 
 | Gradient-Boosted Trees (GBTs) learning algorithm for regression.It supports both continuous and categorical features.. | 
| 
 | Model fitted by  | 
| 
 | Generalized Linear Regression. | 
| 
 | Model fitted by  | 
| 
 | Generalized linear regression results evaluated on a dataset. | 
| Generalized linear regression training results. | |
| 
 | Currently implemented using parallelized pool adjacent violators algorithm. | 
| 
 | Model fitted by  | 
| 
 | Linear regression. | 
| 
 | Model fitted by  | 
| 
 | Linear regression results evaluated on a dataset. | 
| 
 | Linear regression training results. | 
| 
 | Random Forest learning algorithm for regression.It supports both continuous and categorical features.. | 
| 
 | Model fitted by  | 
| 
 | Factorization Machines learning algorithm for regression. | 
| 
 | Model fitted by  | 
Statistics¶
| Conduct Pearson’s independence test for every feature against the label. | |
| Compute the correlation matrix for the input dataset of Vectors using the specified method. | |
| Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a continuous distribution. | |
| 
 | Represents a (mean, cov) tuple | 
| Tools for vectorized statistics on MLlib Vectors. | |
| 
 | A builder object that provides summary statistics about a given column. | 
Tuning¶
| Builder for a param grid used in grid search-based model selection. | |
| 
 | K-fold cross validation performs model selection by splitting the dataset into a set of non-overlapping randomly partitioned folds which are used as separate training and test datasets e.g., with k=3 folds, K-fold cross validation will generate 3 (training, test) dataset pairs, each of which uses 2/3 of the data for training and 1/3 for testing. | 
| 
 | CrossValidatorModel contains the model with the highest average cross-validation metric across folds and uses this model to transform input data. | 
| 
 | Validation for hyper-parameter tuning. | 
| 
 | Model from train validation split. | 
Evaluation¶
| Base class for evaluators that compute metrics from predictions. | |
| 
 | Evaluator for binary classification, which expects input columns rawPrediction, label and an optional weight column. | 
| 
 | Evaluator for Regression, which expects input columns prediction, label and an optional weight column. | 
| Evaluator for Multiclass Classification, which expects input columns: prediction, label, weight (optional) and probabilityCol (only for logLoss). | |
| Evaluator for Multilabel Classification, which expects two input columns: prediction and label. | |
| 
 | Evaluator for Clustering results, which expects two input columns: prediction and features. | 
| 
 | Evaluator for Ranking, which expects two input columns: prediction and label. | 
Frequency Pattern Mining¶
| 
 | A parallel FP-growth algorithm to mine frequent itemsets. | 
| 
 | Model fitted by FPGrowth. | 
| 
 | A parallel PrefixSpan algorithm to mine frequent sequential patterns. | 
Image¶
| Internal class for pyspark.ml.image.ImageSchema attribute. | |
| Internal class for pyspark.ml.image.ImageSchema attribute. | 
Distributor¶
| 
 | A class to support distributed training on PyTorch and PyTorch Lightning using PySpark. | 
| 
 | 
Utilities¶
| Base class for MLWriter and MLReader. | |
| Helper trait for making simple  | |
| 
 | Specialization of  | 
| Helper trait for making simple  | |
| 
 | Specialization of  | 
| Utility class that can save ML instances in different formats. | |
| Base class for models that provides Training summary. | |
| Object with a unique ID. | |
| Mixin for instances that provide  | |
| 
 | Utility class that can load ML instances. | 
| Mixin for ML instances that provide  | |
| 
 | Utility class that can save ML instances. |