Clustering — In layman terms explanation
In this blog, I’ve tried to put the clustering concept in layman terms for shorter reading time .
Topics:
- What is clustering?
- Application of clustering.
- Geometric intuition of clustering.
- Cost function of clustering.
- Types of clustering.
- Lloyds algorithm:
- K means ++ .
- K-medoids.
What is clustering?
As the name suggests ,it is a technique of grouping similar data points forming a cluster in an unsupervised dataset(dataset which do not have Yi’) is called clustering.
Application of clustering:
1. Suppose we are working with the ecommerce data, we can group the types of customers who were buying different items
2.Grouping similar pixels in image processing(especially in object detection)
Clustering may be used as preprocessing technique
Geometric intuition of clustering:
Consider two clusters C1 and C2.Here ,ideally both clusters should be separated as far as possible so that we can distinguish the difference between two clusters.
So basically the distance between the two clusters should be larger in the space.
From this we can conclude that farther the clusters ,better is the performance.
Cost function of clustering:
Here x is the points in the sets .Ci is the centroid point .The ideal condition in clustering is that the distance between centroids and other points in the set should be as small as possible so that clusters can be distinguished well
Types of clustering:
There are various types of clustering.
1.K means clustering
2.Hierarchial based clustering
3.Density Based Clustering and so on…….
Here we will see briefly about K means clustering and its geometric intuition
K Means clustering:
K in the K-means clustering denotes number of clusters present in the space .In the below image, K=3
Each cluster is called as set(S1,S2,S3) and it has centroids(C1,C2,C3)
Here centroids are the ones which determines the clusters .For eg. Consider a new data point ,if that data point is closer to C1,then that point belongs to S1(cluster 1)
So the crux here is finding the centroid points for forming the clusters There is a method called “Lloyds Algorithm” for calculating centroid points .We’ll see about Lloyds algorithm.
Lloyds algorithm:
1.First we have to take few random points from the dataset and consider them as centroids
2.Find the points closest to these centroids and form a group
3.Now again compute centroids by taking the mean of the group formed in previous step
4.Repeat 2 and 3 until the distance of centroids between previous centroid and new centroid becomes small value
K means ++ :
There is an advanced method in K means called K-means ++ which is almost the same as Lloyd’s algorithm but with a slight change.
In K means ++ the first step instead of taking the random centroid points ,the centroids are considered by probabilistic choice(probability of the point which has farthest distance between them)
All the remaining steps are same as K means
K-medoids:
K-medoids is another evolution from K-means++ which differs in step 3.
Algorithm:
1.First initialize centroids same as K-means++
2.Find the closest points to centroids same as step 2 in K-means
3.Update:
Here is the change in technique , where instead of taking the mean, we are replacing with other data points which are closest to centroid(medoid here) in the cluster and the cost function is calculated such that until the minimum cost is obtained, this step is repeated.
That’s all for this post, We’ll see the next topics in the upcoming blogs.
Thanks for reading :)
Reference:
1.Applied AI