Database Reference
In-Depth Information
SELECT word, word_count, corpus
FROM [publicdata:samples.shakespeare]
WHERE word IN (
SELECT word
FROM (
SELECT word, COUNT(*) as corpus_count
FROM [publicdata:samples.shakespeare]
GROUP BY word
HAVING corpus_count > 1 and corpus_count < 36
))
This can serve as the core of your query. There is a technique in Information
Retrieval called TF-IDF, which means that you compute the term frequency
(TF) and multiply by the inverse document frequency (IDF). In other words,
for each word in the corpus, you divide the number of times that word
occurs by the number of total words in the document. This mechanism lets
you compare relative frequencies for documents that are different sizes. You
can apply TF-IDF to your Shakespeare results by dividing each word count
by the number of words in the corresponding play:
$ QUERY="
SELECT s1.word AS word ,
10000 * s1.word_count/s2.total_words AS tfidf ,
s1.corpus as corpus
FROM (
SELECT word, word_count, corpus
FROM [publicdata:samples.shakespeare]
WHERE word IN (
SELECT word
FROM (
SELECT word, COUNT(*) AS corpus_count
FROM [publicdata:samples.shakespeare]
GROUP BY word
HAVING corpus_count > 1 AND corpus_count < 36
))) s1
JOIN (
SELECT corpus, SUM(word_count) AS total_words
FROM [publicdata:samples.shakespeare]
GROUP BY corpus
Search WWH ::




Custom Search