Algorithms - Doing Data Science

Databases Reference

In-Depth Information

Historical Perspective: k-means

Wait, didn't we just describe the algorithm? It turns out there's more

than one way to go after k-means clustering.

The standard k-means algorithm is attributed to separate work by

Hugo Steinhaus and Stuart Lloyd in 1957, but it wasn't called “k-

means” then. The first person to use that term was James MacQueen

in 1967. It wasn't published outside Bell Labs until 1982.

Newer versions of the algorithm are Hartigan-Wong and Lloyd and

Forgy, named for their inventors and developed throughout the '60s

and '70s. The algorithm we described is the default, Hartigan-Wong.

It's fine to use the default.

As history keeps marching on, it's worth checking out the more recent

k-means++ developed in 2007 by David Arthur and Sergei Vassilvit‐

skii (now at Google), which helps avoid convergence issues with

k-means by optimizing the initial seeds.

Exercise: Basic Machine Learning Algorithms

Continue with the NYC (Manhattan) Housing dataset you worked

with in the preceding chapter: http://abt.cm/1g3A12P .

• Analyze sales using regression with any predictors you feel are

relevant. Justify why regression was appropriate to use.

• Visualize the coefficients and fitted model.

• Predict the neighborhood using a k-NN classifier. Be sure to with‐

hold a subset of the data for testing. Find the variables and the k

that give you the lowest prediction error.

• Report and visualize your findings.

• Describe any decisions that could be made or actions that could

be taken from this analysis.

Solutions

In the preceding chapter, we showed how explore and clean this da‐

taset, so you'll want to do that first before you build your regression

model. Following are two pieces of R code. The first shows how you

Search WWH ::

Custom Search

Home