This project generalizes the Spark MLLIB Batch K-Means (v1.1.0) clusterer and the Spark MLLIB Streaming K-Means (v1.2.0) clusterer. Most practical variants of K-means clustering are implemented or can be implemented with this package, including:
- clustering using general distance functions (Bregman divergences)
- clustering large numbers of points using mini-batches
- clustering high dimensional Euclidean data
- clustering high dimensional time series data
- clustering using symmetrized Bregman divergences
- clustering via bisection
- clustering with near-optimality
- clustering streaming data
If you find a novel variant of k-means clustering that is provably superior in some manner, implement it using the package and send a pull request along with the paper analyzing the variant!
This code has been tested on data sets of tens of millions of points in a 700+ dimensional space using a variety of distance functions. Thanks to the excellent core Spark implementation, it rocks!