Database Reference
In-Depth Information
that defines the centroids of the clusters. In this case, the code book contains
two columns, one for each cluster. The rows contain the expected frequency
of each word in the play for the corresponding cluster.
After you find the two clusters, sort the Shakespeare plays into which cluster
they are closest to. You can use the vq() (short for “vector quantization”)
method for this.
>>> assignments, _ = vq(features, codes)
>>> results = {
'play' : array(data_frame.columns.values) ,
'cluster' : assignments}
The assignments will be an array of cluster indexes indicating which cluster
each sample was closest to. That is, there will be a value of 0 or 1 for each
Shakespeare play that says whether that play was in the first or second
cluster.
A DataFrame is pandas' version of R's data frame that represents a matrix
of values with some additional metadata. You can match the cluster with the
play name with a little bit of DataFrame magic; combine the column names
from the result of the BigQuery query (which are the play names) with the
cluster assignments and then sort by the cluster. The sort operation enables
you to see all the plays that showed up in the same cluster, which can give
you a good picture of which plays were assigned to which cluster.
>>> result_frame =
DataFrame.from_dict(results).sort(['cluster' ,
'play'])
The assignment matrix is reproduced here:
cluster play
5 0 allswellthatendswell
6 0 antonyandcleopatra
7 0 asyoulikeit
8 0 comedyoferrors
9 0 coriolanus
10 0 cymbeline
Search WWH ::




Custom Search