05/02/2022 KevinZonda
Centroid-based: describe each cluster by its mean
Goal: assign data to K
Algorithms Objective: minimise the within-cluster variances of all clusters
Assume
- Choose 2 random point as cluster.
- Calculate distance between each data point and each cluster, then connect the data point to the nearest cluster.
Move the points to the centroid points.
Consider last step, we moved the points the centroid points, the distances between data points and cluster points are changed. Therefore, we need to reassign points.
After reassigning points, we might need to adjust our mean like previous step. So we just repeat these steps.
No | Initial Clusters | Final Clusters |
---|---|---|
1 | ||
2 |
By setting different starting points, the clustering results would change. Therefore, we said:
- K-means is a non-deterministic method
- K-means finds a local optimal result (it is not global, so multiple restarts are often necessary)
-
Data:
$x_{1:N}$ - Choose initial cluster means
$m_{1:k}$
Should have the same dimension as data
Assign each data point to be its closest mean
$\arg \min$ 就是使后面这个式子达到最小值时的变量的取值
To
Compute each cluster mean to be coordinate-wise average over data points assigned to that cluster.
i.e. in each dimension
Example:
We use 2 3-D vectors
To calculate its mean, we should calculate mean od their each dimension:
Therefore, the mean vector is
Until assignments
Consider that
In this case, the best point might be 4.