Cluster analysis is a statistical tool used to group variables into homogeneous and distinct groups. It is used for exploratory data analysis and serves as a method of discovery by solving classification issues.
The grouping of the questions by means of cluster analysis helps to identify redundant questions and reduce their number, thus improving the chances of a good response rate to the final version of the questionnaire.
A simple numerical example will help explain these objectives.
The daily expenditures on food (X1) and clothing (X2) of five persons are shown in Table 15.1.
Person X1 X2
a 2 4
b 8 2
c 9 3
d 1 5
e 8.5 1
These numbers are fictitious and not all realistic. This example will help us explain the essential features of cluster analysis as simply as possible. The data of table are plotted below
Inspection of above figure shows that the five observations form two clusters. The first consists of persons a and d, and the second of b, c and e. It can be noted that the observations in each cluster are similar to one another with respect to expenditures on food and clothing and that the two clusters are quite distinct from each other.
These conclusions concerning the number of clusters and their membership were reached through a visual inspection. This inspection was possible because only two variables were involved in grouping the observations.
Applications of Cluster Analysis
- Clustering analysis is broadly used in many applications such as market research, pattern recognition, data analysis and image processing.
- Clustering can also help marketers discover distinct groups in their customer base and they can characterize their customer groups based on the purchasing patterns.
- Clustering helps in identification of areas of similar land use in an earth observation database. It also helps in the identification of groups of houses in a city according to house type, value and geographic location.
- Clustering helps in classifying documents on the web for information discovery.
- Clustering is used in outlier detection applications such as detection of credit card fraud.
- As a data mining function, cluster analysis serves as a tool to gain insight into the distribution of data to observe characteristics of each cluster.
Clustering methods can be classified into the following categories −
- Partitioning Method
- Hierarchical Method
- Density-based Method
- Grid-Based Method
- Model-Based Method
- Constraint-based Method
Here, we discussed important K-means clustering technique.
This final method that we would like to examine is a non-hierarchical approach. This method was presented by Macqueen (1967) in the Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability.
One of the advantages of this method is that we do not have to calculate the distance measures between all pairs of subjects. Therefore, this procedure seems much more efficient or practical when you have very large datasets.
Under this procedure you need to pre-specify how many clusters you want to consider. The clusters in this procedure do not form a tree. There are two approaches to carrying out the K-Means procedure. The approaches vary as to how the procedure begins the partitioning. The first approach is to do this randomly, to start out with a random partitioning of subjects into groups and go from there. The alternative is to start with an additional set of starting points. These would form the centers of our clusters. The random nature of the first approach will avoid bias.
Once this decision has been made, here is an overview of the process:
Step 1 – Partition the items into K initial clusters.
Step 2 – Scan through the list of n items, assigning each item to the cluster whose centroid (mean) is closest. Each time an item is reassigned we will recalculate the cluster mean or centroid for the cluster receiving that item and the cluster losing that item.
Step 3 – Repeat Step 2 over and over again until no more reassignments are made.
Let’s look at a simple example in order to see how this works. Here is an example where we have four items and only two variables:
Suppose that items are initially decided to partition the items into two clusters (A, B) and (C, D). Then calculating the cluster centroids, or the mean of all the variables within the cluster, we would obtain:
For example, the mean of the first variable for cluster (A, B) is 5.
Next we calculate the distances between the item A and the centroids of clusters (A, B) and (C, D).
Here, we get a Euclidean distance between A and each of these cluster centroids. We see that item A is closer to cluster (A, B) than cluster (C, D). Therefore, we are going to leave item A in cluster (A, B) and no change is made at this point.
Next, we will look at the distance between item B and the centroids of clusters (A, B) and (C, D).
Here, we see that item B is closer to cluster (C, D) than cluster (A, B). Therefore, item B will be reassigned, resulting in the new clusters (A) and (B, C, D).
The centroids of the new clusters now changed are calculated as:
Next, we will calculate the distance between the items and each of the clusters (A) and (B, C, D).
It turns out that since all four items are closer to their current cluster centroids, no further reassignments are required.
We must note however, that the results of the K-means procedure can be sensitive to the initial assignment of clusters.
For example, suppose the items had initially been assigned to the clusters (A, C) and (B, D). Then the cluster centroids would be calculated as follows:
From here we can find that the distances between the items and the cluster centroids are:
Note that each item is closer to its cluster centroid than the opposite centroid. So, the initial cluster assignment is retained.
If this is the case, the which result should be used as our summary?
We can compute the sum of squared distances between the items and their cluster centroid. For our first clustering scheme for clusters (A) and (B, C, D), we had the following distances to cluster centroids:
So, the sum of squared distances is:
For clusters (A, C) and (B, D), we had the following distances to cluster centroids:
So, the sum of squared distances is: 18.25+6.25+18.25+6.25=49.0
We would conclude that since 26.6¯<49.0, this would suggest that the first clustering scheme is better and we would partition the items into the clusters (A) and (B, C, D).