Database Reference
In-Depth Information
github
ON shakespeare.corpus = github.repository_name
Data Sampling
Sometimes, it would be nice to operate on only a small portion of your data.
Maybe you're testing out a query over a giant table and don't want to incur
large costs while you're getting your query right. Or maybe you are running
into “resources exceeded” errors, so you need to break up your table into
pieces. In a BigQuery query, there are two commonly used ways to break up
your data: HASH sampling and partition decorators. A word of caution, you
may not want to try all the queries in this section because they will consume
a fair amount of your quota.
HASH Sampling
HASH sampling is a mechanism for selecting a small portion of your table. It
assigns a HASH value to each row in the table and filters out only those rows
that match a certain pattern. This mechanism is similar to how BigQuery
performs the shuffle operation.
For example, say you want to select 10 percent of the unique titles from the
Wikipedia sample dataset. You could run the following query:
SELECT title, COUNT(*)
FROM [publicdata:samples.wikipedia]
WHERE HASH(title) % 10 == 0
GROUP BY title
This applies a HASH function to each title and keeps the rows where the
hash value divides evenly into 10. The HASH function can be thought of as
a kind of random number generator. However, it isn't completely random:
HASH applies a complex function that makes the output for a given input
as random as possible; however, if you run it with the same input multiple
times, you'll always get the same answer. This has the nice property that all
rows that have the same source value will HASH to the same result, so if you
filter on the HASH value, you can filter out a portion of the distinct source
data.
The next interesting part of the WHERE clause is the % 10 == 0 part. The %
symbol is the modulo operator, which returns the remainder of the left side
Search WWH ::




Custom Search