Database Reference
In-Depth Information
lowercase. The
LOWER()
function operates only on a single row at a time. In
Query #2, however, you have
COUNT()
and
SUM()
. These are aggregation
functions, which operate over a field in multiple rows at once.
If you take a query without a
GROUP BY
and add an aggregation function
to the
SELECT
line, you'll get only a single result; that result will be the
value of the aggregation function applied over the entire table. Note there
are rules on mixing aggregation functions with nonaggregations: You either
need aggregations everywhere or nowhere. If you think about this, it makes
sense: If you compute the sum over a particular field, you want one result,
but if you return the field as well, you want multiple results. Because you
can't satisfy both of these constraints at once, this situation is disallowed.
You may notice, however, that we did have a mixture of aggregation
functions and raw fields in the query; we are selecting
word
as well as
SUM()
and
COUNT()
. Why is this legal? You can legally perform this
operation because of the
GROUP BY
clause:
GROUP BY word
. This causes
the query engine to partition the table into buckets—one for each value of
the
word
field. The aggregation functions then get applied to each of the
buckets, outputting one row per distinct
word
value.
To give a concrete example, consider the word “bagpipe,” which occurs twice
in
Merchant of Venice
, once in
King Henry IV part 1
and once in
Winter's
Tale
. When you group by
word
, you get a single bucket for “bagpipe.” The
COUNT()
operation returns “3” because there are three plays in which it
appears. The
SUM(word_count)
operation returns 4 since it appears four
times across the three plays. In the query results, then, you'd get one row for
“bagpipe” in the results:
{word: bagpipe, corpora: 3, total: 4}
.
There is one more line we haven't mentioned yet and doesn't look familiar
from query #1:
HAVING corpora >=2 AND corpora <> total
A
HAVING
clause is a lot like a
WHERE
clause; this one filters out any word
that appears only in one corpus or once per corpus. The difference between
a
WHERE
and
HAVING
clause is when the filtering is applied. A
WHERE
clause
filters values in the original table; it gets applied before any aggregation is
done. However, a
HAVING
clause applies filters after any aggregation, so you
can use the aggregated fields. This
HAVING
clause uses the corpora and total