Database Reference
In-Depth Information
multidimensional (play versus word count versus date), it can be tricky
to grok the patterns from a table of results by itself. You can get this
information in a single numerical result using a correlation query:
SELECT CORR(count, date)
FROM (
SELECT SUM(word_count) AS count, MIN(corpus_date) AS
date
FROM publicdata:samples.shakespeare
GROUP BY corpus)
This returns 0.41—a high positive correlation. Sounds like maybe
Shakespeare stopped listening to his editor telling him to “wrap it up” after
he got a little bit of fame. Not quite, however. If you look at the extreme
values of the data, there are two corpora with a
corpus_date
of 0:
sonnets
and
various
. Because they were written over a period of time,
there is no one date that makes the most sense; whoever created the dataset
has the date set to 0. The collected sonnets are shorter than any play, and the
various other writings are only one-fifth as long as the shortest play. These
two values are going to completely skew the correlation calculation; if you
remove them by adding the filter
WHERE corpus_date > 0
, you see that
the actual correlation coefficient is actually negative, although quite small:
-0.21.
This mistake highlights why data visualization is so powerful; if you rely on
the raw numbers without looking closely at them, you run the risk of making
mistakes if you have outliers or data you don't expect. Data visualization, in
general, makes these types of issues much more obvious. To see this, walk
through how you could visualize this relationship in Tableau.
To start out, drag the
corpus
measure to the Columns box. This indicates
that you want to plot something against the Shakespeare corpus name (for
example,
Hamlet
or
Merchant of Venice
). To generate a nice bar chart, all
you need to do is drag
word_count
from the Measures pane to the Rows
box. It automatically selects
SUM
as the aggregation and plots the corpus
name as the bar label versus the total word count of the play. From this, it
is easy to see that
Hamlet
and
Richard III
have a lot of words, whereas
The
Comedy of Errors
is significantly shorter (which may be, in fact, how you
like your Shakespeare). This graph displayed in
Figure 13.6
is the equivalent
of running the query: