Database Reference
In-Depth Information
It has been previously discussed that BigQuery uses a columnar storage
scheme to deliver better performance. It is clear how records with a simple
schema consisting of only required or optional fields with primitive types
could be laid out in storage so the data in a single column appears
sequentially. Optional fields would require some care because you would
need to have some way to indicate that a particular field was absent in a
given record, for example, a reserved value that represented NULL. Even
nested or record type fields without repetition are easy to handle because
you could just treat all the fields as top-level fields using the fully qualified
names. However, after you combine nested and repeated fields, the
columnar representation becomes more interesting. The scheme used in
BigQuery is explained in the Dremel paper, and variants have been adopted
in related open source formats (Parquet). These formats are designed to
support complex schemas without sacrificing the advantages of columnar
storage for data analysis.
Storing these complex records is only half the story; it would be quite useless
if we cannot also query over them effectively. Since standard SQL operates
on flat relational tables, it is not always convenient for working with
complex schemas. BigQuery SQL has added support for some non-standard
extensions to simplify working with these schemas. This section explains
how to use these features.
Pre-Joined Layout
The catch with respect to queries is that you need to operate over these
structured records using SQL. However, relational databases don't
encourage schemas with nested and repeated fields, and as a result, SQL
is not well suited to manipulating these records with internal structure.
Fortunately, repeated fields bear a striking resemblance to foreign key
relationships between relational tables. When you recognize the similarities,
it can feel quite natural to operate on repeated fields using SQL. An example
from the sample application can illustrate this point.
Each log record in the sample application consists of fields with one value
per record such as the timestamp and location information. It also has fields
that are lists, such as the set of applications that have recently been used
on the phone. The full schema was described Chapter 8; here the portion
relevant to the examples is reproduced.
Search WWH ::




Custom Search