10 Prove : Assignment
Clustering the States
Objective
Understand the basics of clustering using k-means and agglomerative hierarchical clustering.
Instructions
This week you will look at different ways to cluster data regarding the United States. Once again, we will not be implementing any algorithms this week, but will be experimenting with them in R.
The dataset is included in the datasets package:
library(datasets)
myData = state.x77
This dataset has the following attributes for each of the 50 States:
Population
Income
Illiteracy
Life Exp
Murder
HS Grad
Frost
Area
Your instructions are to play around with the kmeans function and the hclust function to learn more about this dataset and about the clustering process. Please follow the steps in the Experiment Guidelines, and then look for ways to do additional analysis on this and other datasets.
Please refer to A few helpful hints for doing clustering in R in I-Learn for some basic syntax to help get you started.
Experiment Guidelines
To help get you started, please follow the prescribed steps below:
Agglomerative Hierarchical Clustering
Load the dataset
Use hierarchical clustering to cluster the data on all attributes and produce a dendrogram
Repeat the previous item with a normalized dataset and note any differences
Remove "Area" from the attributes and re-cluster (and note any differences)
Cluster only on the Frost attribute and observe the results
Using k-means
Make sure to use a normalized version of the dataset.
Using k-means, cluster the data into 3 clusters. Note the size of each cluster and the mean values. Do you have any insight into why they were divided this way?
Using a for loop, repeat the clustering process for k = 1 to 25, and plot the total within-cluster sum of squares error for each k-value.
Evaluate the plot from the previous item, and choose an appropriate k-value using the "elbow method" mentioned in your reading. Then re-cluster a single time using that k-value. Use this clustering for the remaining questions.
List the states in each cluster.
Use "clusplot" to plot a 2D representation of the clustering.
Analyze the centers of each of these clusters. Can you identify any insight into this clustering?
After going through these steps, you are encouraged to continue on to analyze this and other datasets further to see if you can find other interesting things.
Submission
Prepare a PDF document with graphs and discussion for each of the points above. Then, list anything additional you have done (above and beyond these requirements).
Finally, please state which category you feel best describes your assignment and give a 1-2 sentence justification for your choice:
1 - Some attempt was made
2 - Developing, but significantly deficient
3 - Slightly deficient, but still mostly adequate
4 - Meets requirements
5 - Shows creativity and excels above and beyond requirements
Upload this document to I-Learn in the space provided.