Using BigQuery from Third-Party Tools - Google BigQuery Analytics

Database Reference

In-Depth Information

data into a fixed (k) number of clusters. K-means clustering is provided in

the scipy.clustering.vq package.

We'll use a k of 2, which means we're going to be dividing up the plays into

two different buckets. We don't have a lot of data (there are only 38 plays

that we know about), so dividing up the plays into a lot of clusters may not

be particularly instructive. After clustering, we'll see if this binary division

makes any intuitive sense.

The Python file clustering_shakespeare.py has the entire script for

computing the clusters, but we will walk you through the individual pieces.

Start with the import statements:

>>> from numpy import array

>>> from numpy import asarray

>>> from pandas import DataFrame

>>> from pandas.io import gbq

>>> from scipy.cluster.vq import vq, kmeans, whiten

Note that you need numpy, pandas, and scipy to be installed. If they're not,

or they have problems, you'll see errors either here or when you try to use

them. A bit of forewarning—if you don't have everything installed correctly,

the errors can be a bit cryptic because the thing that fails to load often isn't

the thing that is missing. If module A imports module B, you might see

module A fail to load, but module B might be the missing one, and you may

not get an error message telling you why.

Next, after you verify that you have all the libraries that you need, run a

BigQuery query to get the data you need:

>>> query = """

SELECT word ,

SUM(if (corpus == '1kinghenryiv', tfidf, 0)) as

onekinghenryiv ,

…

SUM(if (corpus == 'winterstale', tfidf, 0)) as

winterstale ,

FROM [ch13.shakespeare_tfidf]

GROUP BY word

"""

>>> data_frame = gbq.read_gbq(query)

Search WWH ::

Custom Search

Home