Using BigQuery from Third-Party Tools - Google BigQuery Analytics

Database Reference

In-Depth Information

SELECT word, word_count, corpus

FROM [publicdata:samples.shakespeare]

WHERE word IN (

SELECT word

FROM (

SELECT word, COUNT(*) as corpus_count

FROM [publicdata:samples.shakespeare]

GROUP BY word

HAVING corpus_count > 1 and corpus_count < 36

))

This can serve as the core of your query. There is a technique in Information

Retrieval called TF-IDF, which means that you compute the term frequency

(TF) and multiply by the inverse document frequency (IDF). In other words,

for each word in the corpus, you divide the number of times that word

occurs by the number of total words in the document. This mechanism lets

you compare relative frequencies for documents that are different sizes. You

can apply TF-IDF to your Shakespeare results by dividing each word count

by the number of words in the corresponding play:

$ QUERY="

SELECT s1.word AS word ,

10000 * s1.word_count/s2.total_words AS tfidf ,

s1.corpus as corpus

FROM (

SELECT word, word_count, corpus

FROM [publicdata:samples.shakespeare]

WHERE word IN (

SELECT word

FROM (

SELECT word, COUNT(*) AS corpus_count

FROM [publicdata:samples.shakespeare]

GROUP BY word

HAVING corpus_count > 1 AND corpus_count < 36

))) s1

JOIN (

SELECT corpus, SUM(word_count) AS total_words

FROM [publicdata:samples.shakespeare]

GROUP BY corpus

Search WWH ::

Custom Search

Home