Advanced Analytical Theory and Methods: Clustering - Data Science and Big Data Analytics

Database Reference

In-Depth Information

Using R to Perform a K-means Analysis

To illustrate how to use the WSS to determine an appropriate number, k, of

clusters, the following example uses R to perform a k-means analysis. The task is to

group 620 high school seniors based on their grades in three subject areas: English,

mathematics, and science. The grades are averaged over their high school career

and assume values from 0 to 100. The following R code establishes the necessary R

libraries and imports the CSV file containing the grades.

library(plyr)

library(ggplot2)

library(cluster)

library(lattice)

library(graphics)

library(grid)

library(gridExtra)

#import the student grades

grade_input = as.data.frame(read.csv("c:/data/

grades_km_input.csv"))

The following R code formats the grades for processing. The data file contains

four columns. The first column holds a student identification (ID) number, and

the other three columns are for the grades in the three subject areas. Because the

student ID is not used in the clustering analysis, it is excluded from the k-means

input matrix, kmdata .

kmdata_orig = as.matrix(grade_input[,c("Student","English",

"Math","Science")])

kmdata <- kmdata_orig[,2:4]

kmdata[1:10,]

English Math Science

[1,] 99 96 97

[2,] 99 96 97

[3,] 98 97 97

[4,] 95 100 95

[5,] 95 96 96

[6,] 96 97 96

[7,] 100 96 97

[8,] 95 98 98

Search WWH ::

Custom Search

Home