Database Reference
In-Depth Information
data into a fixed (k) number of clusters. K-means clustering is provided in
the scipy.clustering.vq package.
We'll use a k of 2, which means we're going to be dividing up the plays into
two different buckets. We don't have a lot of data (there are only 38 plays
that we know about), so dividing up the plays into a lot of clusters may not
be particularly instructive. After clustering, we'll see if this binary division
makes any intuitive sense.
The Python file clustering_shakespeare.py has the entire script for
computing the clusters, but we will walk you through the individual pieces.
Start with the import statements:
>>> from numpy import array
>>> from numpy import asarray
>>> from pandas import DataFrame
>>> from pandas.io import gbq
>>> from scipy.cluster.vq import vq, kmeans, whiten
Note that you need numpy, pandas, and scipy to be installed. If they're not,
or they have problems, you'll see errors either here or when you try to use
them. A bit of forewarning—if you don't have everything installed correctly,
the errors can be a bit cryptic because the thing that fails to load often isn't
the thing that is missing. If module A imports module B, you might see
module A fail to load, but module B might be the missing one, and you may
not get an error message telling you why.
Next, after you verify that you have all the libraries that you need, run a
BigQuery query to get the data you need:
>>> query = """
SELECT word ,
SUM(if (corpus == '1kinghenryiv', tfidf, 0)) as
onekinghenryiv ,
…
SUM(if (corpus == 'winterstale', tfidf, 0)) as
winterstale ,
FROM [ch13.shakespeare_tfidf]
GROUP BY word
"""
>>> data_frame = gbq.read_gbq(query)
Search WWH ::




Custom Search