Database Reference
In-Depth Information
divided by the right side. So WHERE HASH(title) % 10 == 0 is just a
fancy way of taking 1 out of every 10 distinct hash values.
Note that this is different than just taking 10 percent of the data; applying
the HASH filter means you keep 10 percent of titles, not necessarily 10
percent of the underlying data. If you want to know the number of distinct
titles, multiplying by 10 gives you a good estimate. If, however, you had just
kept 10 percent of the underlying rows, you likely wouldn't know how that
would affect the distinct title count.
The other advantage of having the HASH function return stable results is
that you can use it to compute different slices of your underlying table.
You already got the first 10 percent of the titles; you can get the second 10
percent by changing the value that you're comparing against:
SELECT title, COUNT(*)
FROM [publicdata:samples.wikipedia]
WHERE ABS( HASH(title) % 10 ) == 1
GROUP BY title
To get the next slice of the table, increment the value you're matching
against; this is the remainder when divided by 10. You can get each of the 10
slices in turn by running once with each number from 0 up to 9. If you look
closely at the query, however, you might notice that there was one additional
change: the addition of an absolute value ( ABS() ) call to the hashed value.
The modulo operator doesn't behave predictably when given a negative
number. Some programming languages define the modulo of a negative
number to be negative. Others say it is always positive. Others say it is
always the same sign as the right side (10 in this case). BigQuery SQL defines
the output of the modulo operator as the same sign as the left side of the
operation. Because HASH can return negative values, taking the modulo of
the result can return a negative value. So if you filter by HASH values where
the modulo is equal to 1, you'll miss out on the ones where the modulo is
-1. This is fairly confusing, and it confuses a lot of people who are surprised
when their HASH filters don't work the way they expect.
There is one drawback to using HASH filtering: You have to run your queries
over the whole table. If you run queries against each of 10 slices, it means
you have to run 10 queries against the whole table, which can get expensive.
Search WWH ::




Custom Search