Clustering with R

This page contains the clustering code and visualizations done in R for record data. The dataset has been downloaded from UCI Machine learning repository. The link for the source can be found here This dataset consists of various columns like age, workclass, education, marital status, relationship, race, sex, hours per week. The label column contains the data of whether the person is rich (salary greater than or equal to 50k) or poor(salary less than $50k). The encoding for this dataset can be found here

R_Data_Cluster.jpg
View
Download csv file

Finding the optimal number of clusters

To start with KMeans, the first thing that is required is the number of clusters. The Elbow method and the Silhouette method were used to obtain the optimal value of k.

Elbow.jpeg       Silhoutte.jpeg

From the Elbow and Silhouette method it can be inferred that the k should be taken as 2,4,8.

The code for finding optimal number of cluster can be found here


K Means Clustering

K means clustering is performed on the data, with the assumption that k is 2 (can be seen from the Elbow and Silhouette Method).

k_2.jpeg

K means clustering is performed on the data, with the assumption that k is 4 (can be seen from the Elbow and Silhouette Method).

Cluster_plot.jpeg

K means clustering is performed on the data, with the assumption that k is 8 (can be seen from the Elbow and Silhouette Method).

r_8.jpeg

The code for K Means Clustering can be found here

The clusters for this data are overlapping, this can still be improved.


Calculating the distances

Different methods are used to plot the dendogram in R

Dendogram.jpeg

The links for different dendograms can be found below:

1. Euclidean Distance

2. Manhattan Distance

3. Cosine Similarity Distance

4. Minkowski Distance

The code for calculating the distances can be found here


Predicting new category

Three new vectors are created, to predict which cluster that vector belongs to,distance of the vector from the cluster centers is calculated. It belongs to the cluster with the minimum distance, the distance method used is Euclidean. For this the vectors are created using the runif function of R. The values in the vector should be 1 and 2. In this vector, it predicts the cluster to be 4.

The code for predicting the clusters for the new vector can be found here

Summary

The dataset consists of just 8 columns and 198 rows. The label column is whether the person is rich or poor based on the various factors (other columns). The aim is to look for connections between different people based on their income.Clustering the dataset provided insights into that. The value of k = 2,4,8 was set for defining the number of clusters which was calculated using the Elbow method and Silhouette method. But from the K means clustering it can be seen that the clusters are overlapping and can be improved.