K-Means
The K-Means algorithm is a capable of clustering the dataset very quickly and efficiently, often in just a few iterations. We have to specify the number of clusters k that the algorithm must find. Each instance was assigned to one of the clusters. In the context of clustering, an instance’s label is the index of the cluster that this instance gets assigned to by the algorithm. The instance preserves a copy of the labels of the instances it was trained on, available via the instance variable. We can easily assign new instances to the cluster whose centroid is closest. The vast majority of the instances can be clearly assigned to the appropriate cluster, but a few instances can be probably mislabeled.
The K-Means algorithm
We start by placing the centroids randomly. Then label the instances, update the centroids, and so on until the centroids stop moving. The algorithm is guaranteed to converge in a finite number of steps. It will not oscillate forever. Although the algorithm is guaranteed to converge, it may not converge to the right solution, i.e., it may converge to a local optimum. Whether it does or not depends on the centroid initialization.
To evaluate the best solution, we use a performance metric called the model’s inertia, which is the mean squared distance between each instance and its closest centroid. The score() method returns the negative inertia.
In general, it will not be so easy to know how to set k, and the result might be quite bad if we set it to the wrong value. The inertia is not a good performance metric when trying to choose k because it keeps getting lower as we increase k. Indeed, the more clusters there are, the closer each instance will be to its closest centroid, and therefore the lower the inertia will be.
It is important to scale the input features before we run K-Means, or the clusters may be very stretched and K-Means will perform poorly. Scaling the features does not guarantee that all the clusters will be nice and spherical, but it generally improves things.
Silhouette Score
When plotting the inertia as a function of the number of clusters k, the curve has roughly the shape of an arm, and it often contains an inflexion point called the elbow, which would be a good choice. However, this technique for choosing the best value for the number of clusters is rather coarse.
A more precise approach is to use the silhouette score, which is the mean silhouette coefficient over all the instances. An instance’s silhouette coefficient is equal to (b-a)/max(a, b), where a is the mean distance to the other instances in the same cluster, i.e., the mean intra-cluster distance and b is the mean nearest-cluster distance, i.e., the mean distance to the instances of the next closest cluster, defined as the one that minimizes b, excluding the instance’s own cluster. The silhouette coefficient can vary between -1 and +1. A coefficient close to +1 means that the instance is well inside its own cluster and far from other clusters, while a coefficient close to 0 means that it is close to a cluster boundary, and finally a coefficient close to -1 means that the instance may have been assigned to the wrong cluster.
An informative visualization is obtained when we plot every instance’s silhouette coefficient, sorted by the cluster they are assigned to and by the value of the coefficient. This is called a silhouette diagram.