Ever have the feeling that there are too many possible scenarios to dealing with a situation? Let’s take college work load for example. Ok, so you’re coming into your last week of the semester and you have three separate modules, all of which have assignment deadlines coming thick and fast, plus you have to begin studying for your exams! What is the optimal approach to dealing with this? Do you try and get all of the assignments done by ordering them into different bundles by assignment deadline? Or maybe you should sort them by order of marks per each assignment? Or why not order them by favourite subject to least favourite? So many possibilities but so little time. Wouldn’t it be easy if there were some way that you could bundle all of your options into an optimal number of groups or clusters? Well that is what the principle of K-means clustering algorithm is based on.
Clustering. K-means clustering. It is a form of unsupervised learning, designed to help the user sort out a set of data into a number of clusters. It is not to be confused with K-nearest neighbour (KNN), which will also be dealt with in this post. While I will discuss these two topics one after another, I feel it may be beneficial to lay some basic principles of differentiation between the two before I go any further, as initially I had assumed the K’s were related in both of the algorithms. This is not the case, K-means clustering is an unsupervised clustering algorithm, whereas KNN is a form of supervised learning and is a classification algorithm used for predictions.
A real life example of this might as stated above with the assignment scenario, how can I break these up into clusters? Whereas in the case of KNN, it’s more a case of finding patterns in historical data to try and predict future trends. In the case of a college example it might be reading through past papers to see if you can identify previous patterns that may help you identify what will be on your upcoming exam. Hopefully this is starting to make some sense now before we look at the data. If you want to look into either of the algorithms further then you can do so here:
The data set for this blog is based upon the physical attributes of basketball players, football players and jockeys. The data is all continuous as needed for both algorithms and there are no missing values within the data set.
To determine the optimal number of clusters the set should be split into there are two common methods that can be used in RStudio. The first is a visual plot of the within group sum of squares versus number of clusters. The important thing about this plot is that you want to identify where the clusters start to level out:
Figure 1: SSW vs Number of Clusters
From the above graphic it seems to suggest that 3 is a good number of clusters to choose. Let’s also run an NbClust() function to double check:
Table 1: NbClust table
The table above confirms what the graph has told us and we can also plot this information using a bar chart:
Figure 2: Bar chart of Number of Criteria vs Clusters
We shall select 3 clusters for our analysis.
Cross tabulation of the K-means algorithm reveals where each of the players from each sport was placed:
Table 2: Confusion matrix of K-means analysis
This displays that jockeys were easily placed in one category and nearly all of the footballers also, with a small bit of a crossing in the basketball players. We can view how well the groups were divided by plotting a graph of the distributed cluster analysis and the actual distribution of the players:
Figure 3: Cluster distribution vs Actual distribution
The left hand side shows how they were distributed with some crossover. The right is the actual distribution.
Finally the RandIndex() function in RStudio tells us the level of agreement between the division of each cluster. A score of 1 represents perfect agreement and a score of -1 represents no agreement at all. The score here is 0.9, which means it has quite a high level of agreement between each of the clusters.
Figure 4: API Score
Nearest Neighbour Analysis
The data set is the same as was used for the cluster analysis. Just by visualizing the data we can see a small positive correlation between height and weight across the sports with jockeys being lightest and smallest and basketball players being tallest and heaviest.
Figure 5: Scatter Plot of Weight(kg) vs Height(cm)
As was said previously, KNN is a prediction model, so the data is separated into a train and test set to see how accurate the prediction is. When creating the KNN algorithm there are really two important rules you should consider:
- Make sure the k number is odd so that a tie cannot occur
- Try and make the k number not equal to, or a multiple of the number of
neighbours in your training data set as it also may result in a tie
In the case of this data set k was set as equal to 5. The results are displayed below:
Table 3: KNN Cross Table
The cross table displays a high level of accuracy in terms of predicting players by sport. The only one that wasn’t predicted 100% accurately was the basketball players. However, it should be noted that the basketball players were the youngest of the three groups and had the highest standard deviation, meaning that some of the players were still young teens and so while they may have been tall, their physicality may have been misjudged in the weight category.
To summarise, both of these algorithms are effective in their own right and can be used well on data sets containing continuous attributes. This particular data set worked very nicely to demonstrate both the clustering and classification algorithms, K-means and KNN.