Advanced Queries - Google BigQuery Analytics

Database Reference

In-Depth Information

It has been previously discussed that BigQuery uses a columnar storage

scheme to deliver better performance. It is clear how records with a simple

schema consisting of only required or optional fields with primitive types

could be laid out in storage so the data in a single column appears

sequentially. Optional fields would require some care because you would

need to have some way to indicate that a particular field was absent in a

given record, for example, a reserved value that represented NULL. Even

nested or record type fields without repetition are easy to handle because

you could just treat all the fields as top-level fields using the fully qualified

names. However, after you combine nested and repeated fields, the

columnar representation becomes more interesting. The scheme used in

BigQuery is explained in the Dremel paper, and variants have been adopted

in related open source formats (Parquet). These formats are designed to

support complex schemas without sacrificing the advantages of columnar

storage for data analysis.

Storing these complex records is only half the story; it would be quite useless

if we cannot also query over them effectively. Since standard SQL operates

on flat relational tables, it is not always convenient for working with

complex schemas. BigQuery SQL has added support for some non-standard

extensions to simplify working with these schemas. This section explains

how to use these features.

Pre-Joined Layout

The catch with respect to queries is that you need to operate over these

structured records using SQL. However, relational databases don't

encourage schemas with nested and repeated fields, and as a result, SQL

is not well suited to manipulating these records with internal structure.

Fortunately, repeated fields bear a striking resemblance to foreign key

relationships between relational tables. When you recognize the similarities,

it can feel quite natural to operate on repeated fields using SQL. An example

from the sample application can illustrate this point.

Each log record in the sample application consists of fields with one value

per record such as the timestamp and location information. It also has fields

that are lists, such as the set of applications that have recently been used

on the phone. The full schema was described Chapter 8; here the portion

relevant to the examples is reproduced.

Search WWH ::

Custom Search

Home