Database Reference
In-Depth Information
multidimensional (play versus word count versus date), it can be tricky
to grok the patterns from a table of results by itself. You can get this
information in a single numerical result using a correlation query:
SELECT CORR(count, date)
FROM (
SELECT SUM(word_count) AS count, MIN(corpus_date) AS
date
FROM publicdata:samples.shakespeare
GROUP BY corpus)
This returns 0.41—a high positive correlation. Sounds like maybe
Shakespeare stopped listening to his editor telling him to “wrap it up” after
he got a little bit of fame. Not quite, however. If you look at the extreme
values of the data, there are two corpora with a corpus_date of 0:
sonnets and various . Because they were written over a period of time,
there is no one date that makes the most sense; whoever created the dataset
has the date set to 0. The collected sonnets are shorter than any play, and the
various other writings are only one-fifth as long as the shortest play. These
two values are going to completely skew the correlation calculation; if you
remove them by adding the filter WHERE corpus_date > 0 , you see that
the actual correlation coefficient is actually negative, although quite small:
-0.21.
This mistake highlights why data visualization is so powerful; if you rely on
the raw numbers without looking closely at them, you run the risk of making
mistakes if you have outliers or data you don't expect. Data visualization, in
general, makes these types of issues much more obvious. To see this, walk
through how you could visualize this relationship in Tableau.
To start out, drag the corpus measure to the Columns box. This indicates
that you want to plot something against the Shakespeare corpus name (for
example, Hamlet or Merchant of Venice ). To generate a nice bar chart, all
you need to do is drag word_count from the Measures pane to the Rows
box. It automatically selects SUM as the aggregation and plots the corpus
name as the bar label versus the total word count of the play. From this, it
is easy to see that Hamlet and Richard III have a lot of words, whereas The
Comedy of Errors is significantly shorter (which may be, in fact, how you
like your Shakespeare). This graph displayed in Figure 13.6 is the equivalent
of running the query:
 
Search WWH ::




Custom Search