Database Reference
In-Depth Information
installed bq or have not authorized your Google account using it, you should
follow the instructions in Chapter 3, “Getting Started with BigQuery.”
Assuming you have authenticated with bq or the Google Cloud SDK, it is
easy to run a query in pandas:
$ python
>>> from pandas.io import gbq
>>> data_frame = gbq.read_gbq(
'SELECT COUNT(*) FROM
[publicdata:samples.shakespeare]')
Waiting on bqjob_r2f6dcee956cff5bd_0000014460593881_1
… (0s)
Current status: DONE
>>> print "%s" % (data_frame,)
f0_
0 164656
[1 rows x 1 columns]
If the gbq.read_gbq() command works without returning an error, then
you're all set to begin using BigQuery from pandas.
Pandas Example: Clustering Shakespeare
In the R example, we tried to classify Shakespeare texts into genres; whether
they are tragedies, histories, or comedies. To do so, we needed to know
the genre in advance for some of the plays in order to train our machine
learning model. Wouldn't it be nice if we didn't have to know any genres
in advance, but we'd still be able to classify the plays? We could divide up
the plays into the “natural” buckets and go back and see if those buckets
have any real-world meaning. That is, instead of providing classifications at
the beginning, we can provide them after we've already sorted the plays into
buckets. Of course, the buckets may not correspond to genre, but they might
also show some hidden similarity between the plays, or align themselves
in other ways, like early or late plays or even plays that were written by
Shakespeare's evil twin brother.
This type of analysis is called unsupervised learning, and there are a lot
of different algorithms you can use to approach the problem. One of the
standard algorithms is called k-means clustering , which groups unlabeled
Search WWH ::




Custom Search