Advanced Queries - Google BigQuery Analytics

Database Reference

In-Depth Information

github

ON shakespeare.corpus = github.repository_name

Data Sampling

Sometimes, it would be nice to operate on only a small portion of your data.

Maybe you're testing out a query over a giant table and don't want to incur

large costs while you're getting your query right. Or maybe you are running

into “resources exceeded” errors, so you need to break up your table into

pieces. In a BigQuery query, there are two commonly used ways to break up

your data: HASH sampling and partition decorators. A word of caution, you

may not want to try all the queries in this section because they will consume

a fair amount of your quota.

HASH Sampling

HASH sampling is a mechanism for selecting a small portion of your table. It

assigns a HASH value to each row in the table and filters out only those rows

that match a certain pattern. This mechanism is similar to how BigQuery

performs the shuffle operation.

For example, say you want to select 10 percent of the unique titles from the

Wikipedia sample dataset. You could run the following query:

SELECT title, COUNT(*)

FROM [publicdata:samples.wikipedia]

WHERE HASH(title) % 10 == 0

GROUP BY title

This applies a HASH function to each title and keeps the rows where the

hash value divides evenly into 10. The HASH function can be thought of as

a kind of random number generator. However, it isn't completely random:

HASH applies a complex function that makes the output for a given input

as random as possible; however, if you run it with the same input multiple

times, you'll always get the same answer. This has the nice property that all

rows that have the same source value will HASH to the same result, so if you

filter on the HASH value, you can filter out a portion of the distinct source

data.

The next interesting part of the WHERE clause is the % 10 == 0 part. The %

symbol is the modulo operator, which returns the remainder of the left side

Search WWH ::

Custom Search

Home