Databases Reference
In-Depth Information
Historical Perspective: k-means
Wait, didn't we just describe the algorithm? It turns out there's more
than one way to go after k-means clustering.
The standard k-means algorithm is attributed to separate work by
Hugo Steinhaus and Stuart Lloyd in 1957, but it wasn't called “k-
means” then. The first person to use that term was James MacQueen
in 1967. It wasn't published outside Bell Labs until 1982.
Newer versions of the algorithm are Hartigan-Wong and Lloyd and
Forgy, named for their inventors and developed throughout the '60s
and '70s. The algorithm we described is the default, Hartigan-Wong.
It's fine to use the default.
As history keeps marching on, it's worth checking out the more recent
k-means++ developed in 2007 by David Arthur and Sergei Vassilvit‐
skii (now at Google), which helps avoid convergence issues with
k-means by optimizing the initial seeds.
Exercise: Basic Machine Learning Algorithms
Continue with the NYC (Manhattan) Housing dataset you worked
with in the preceding chapter: http://abt.cm/1g3A12P .
• Analyze sales using regression with any predictors you feel are
relevant. Justify why regression was appropriate to use.
• Visualize the coefficients and fitted model.
• Predict the neighborhood using a k-NN classifier. Be sure to with‐
hold a subset of the data for testing. Find the variables and the k
that give you the lowest prediction error.
• Report and visualize your findings.
• Describe any decisions that could be made or actions that could
be taken from this analysis.
Solutions
In the preceding chapter, we showed how explore and clean this da‐
taset, so you'll want to do that first before you build your regression
model. Following are two pieces of R code. The first shows how you
 
Search WWH ::




Custom Search