Database Reference
In-Depth Information
This is the same query from the Predicting Shakespeare example in the
section on R; it relies on the same intermediate table,
ch13.shakespeare_tfidf , which contains relative frequencies for all
the words in all Shakespeare's plays. You can run the query using the pandas
gbq.read_gbq() method and save the result in a data frame. A pandas
data frame is similar to an R data frame; it is a bit like a matrix but can have
additional metadata, such as column and row names.
The first column of the data frame is word ; this contains nearly every
word that is used somewhere in Shakespeare. The subsequent columns are
the normalized frequencies of the corresponding word's usage in each of
Shakespeare's plays. The column names (other than word ) are the names of
the plays. You don't actually use the word in the clustering step; the learning
process doesn't know anything about words; it just cares about the relative
frequencies that make up the feature matrix. Because you don't need the
actual words, you can drop the word column from the data frame.
>>> del data_frame['word']
To run k-means clustering, you need to turn the data frame into an array
where each row is a vector describing the sample. That is, you want each play
to represent one row, while the word frequencies are columns. To coerce
the data into this format, you can create a numpy array containing the
transposed results:
>>> features = asarray(data_frame.T)
After you create the features matrix, pass it to the clustering function:
>>> codes, _ = kmeans(features, 2)
K is set to 2, which means you're just trying to find two clusters. Another way
of looking at it is that you're creating a hyperplane dividing the Shakespeare
word frequency matrix into two parts, where the hyperplane is defined as
all the points that are equidistant from the two cluster centroids. If that
sounds confusing, don't worry; it was just an excuse to get to write the word
“hyperplane.”
The first result from the kmeans() function is the “code book”; this is a k by
N matrix (where k is the number of clusters and N is the number of samples)
Search WWH ::




Custom Search