Database Reference
In-Depth Information
parent (the left side of the join) table. This is advantageous when most
queries end up accessing both tables because it reduces the disk seeks.
However, queries that scan just the parent table are degraded because all the
records from the child table must be skipped over. With columnar storage,
scans are not impacted; a query that references only fields in the outer
record incurs no penalty from fields in the nested record. However, any
join operation in a query is relatively costly in BigQuery because it does
not maintain indexes. To avoid joins in your query, you should store nested
records whenever possible. In effect, this data is pre-joined , which makes
it more efficient to query. There are caveats, which are discussed next, but
a good rule of thumb is that you can use your data collection process as
a guide for how to structure records. Another clue is if you find you need
to update two different tables consistently, for example, if you have an
Orders table and an ItemsInOrder table and you need to insert entries
in both tables atomically for each order processed. In a database you will
likely use a transaction. In BigQuery you could achieve the same result by
nesting ItemsInOrder in Orders because record insertions are atomic. In
general, all the information that is recorded in a single operation or event
should be packaged into a record with a suitable schema.
When working with nested and repeated schemas, special attention needs to
be paid to the use of COUNT . Take for example this query that counts various
fields in our logs.
SELECT COUNT(running.name) / COUNT(ts) AS
apps_per_record
FROM [bigquery-e2e:ch10.sample_data]
Note the use of COUNT(ts) rather than COUNT(*) . In BigQuery,
COUNT(*) returns the total number of top-level records. It is a good practice
to use COUNT( field ) unless you are certain you want COUNT(*) . For
a repeated field, COUNT( field ) returns the total repetitions of the field
across all records. This behavior is in contrast to the behavior you would
see in a relational database. Assuming no NULL values are present in the
fields, then a COUNT of any field in a natural join query would return the
identical value, matching the total number of rows generated by the join.
This difference is an indication that our analogy between multiple relational
tables and repeated fields, while useful, is only approximate.
Search WWH ::




Custom Search