Database Reference
In-Depth Information
could have been written using string quoting; I used backticks there to avoid mixing
alias quoting styles within a given query.
How Repetitive Is a Set of Values?
To assess how much repetition is present in a set of values, use the ratio of
COUNT(DIS
TINCT)
and
COUNT()
. If all values are unique, both counts are the same and the ratio is
1. This is the case for the
t
values in the
mail
table and the
pop
values in the
states
table:
mysql>
SELECT COUNT(DISTINCT t) / COUNT(t) FROM mail;
+------------------------------+
| COUNT(DISTINCT t) / COUNT(t) |
+------------------------------+
| 1.0000 |
+------------------------------+
mysql>
SELECT COUNT(DISTINCT pop) / COUNT(pop) FROM states;
+----------------------------------+
| COUNT(DISTINCT pop) / COUNT(pop) |
+----------------------------------+
| 1.0000 |
+----------------------------------+
For a more repetitive set of values,
COUNT(DISTINCT)
is less than
COUNT()
, and the ratio
is smaller:
mysql>
SELECT COUNT(DISTINCT name) / COUNT(name) FROM driver_log;
+------------------------------------+
| COUNT(DISTINCT name) / COUNT(name) |
+------------------------------------+
| 0.3000 |
+------------------------------------+
What's the practical use for this ratio? A result close to zero indicates a high degree of
repetition, which means the values will group into a small number of categories natu‐
rally. A result of 1 or close to it indicates many unique values, with the consequence that
GROUP
BY
won't be very efficient for grouping the values into categories. (That is, there
will be a lot of categories, relative to the number of values.) This tells you that, to generate
a summary, you'll probably find it necessary to impose an artificial categorization on
the values, using the techniques described in this recipe.
8.11. Finding Smallest or Largest Summary Values
Problem
You want to compute per-group summary values but display only the smallest or largest
of them.