Using BigQuery from Third-Party Tools - Google BigQuery Analytics

Database Reference

In-Depth Information

that defines the centroids of the clusters. In this case, the code book contains

two columns, one for each cluster. The rows contain the expected frequency

of each word in the play for the corresponding cluster.

After you find the two clusters, sort the Shakespeare plays into which cluster

they are closest to. You can use the vq() (short for “vector quantization”)

method for this.

>>> assignments, _ = vq(features, codes)

>>> results = {

'play' : array(data_frame.columns.values) ,

'cluster' : assignments}

The assignments will be an array of cluster indexes indicating which cluster

each sample was closest to. That is, there will be a value of 0 or 1 for each

Shakespeare play that says whether that play was in the first or second

cluster.

A DataFrame is pandas' version of R's data frame that represents a matrix

of values with some additional metadata. You can match the cluster with the

play name with a little bit of DataFrame magic; combine the column names

from the result of the BigQuery query (which are the play names) with the

cluster assignments and then sort by the cluster. The sort operation enables

you to see all the plays that showed up in the same cluster, which can give

you a good picture of which plays were assigned to which cluster.

>>> result_frame =

DataFrame.from_dict(results).sort(['cluster' ,

'play'])

The assignment matrix is reproduced here:

cluster play

5 0 allswellthatendswell

6 0 antonyandcleopatra

7 0 asyoulikeit

8 0 comedyoferrors

9 0 coriolanus

10 0 cymbeline

Search WWH ::

Custom Search

Home