Database Reference
In-Depth Information
a field in the LEFT table matches a field in the RIGHT table. Some other
SQL systems allow the filter to be specified in a WHERE clause (for example,
WHERE table1.foo = table2.bar ), but in BigQuery, only the ON clause
version is supported.
The difference between INNER and OUTER is how they handle cases when
there are no matches between the value in the LEFT table and the value in
the RIGHT . An INNER JOIN drops any rows where there is not a match,
whereas an OUTER JOIN keeps the row and fills the side that doesn't
match with null values. The difference between LEFT OUTER and RIGHT
OUTER JOIN is which table gets precedence—the LEFT or the RIGHT . It
isn't important to understand all the possible JOIN options now, but some
familiarity with JOIN semantics helps to understand the BigQuery JOIN
implementation.
Broadcast JOIN
A “small” JOIN is often called a “broadcast” JOIN because it relies on
broadcasting the smaller table to all the nodes in the computation tree.
A JOIN operation needs to generate all combinations of rows where the
fields specified in the ON clause match. To process the JOIN operation, you
can walk all the rows in the LEFT table and figure out whether there is a
matching row in the RIGHT table. If the RIGHT table is small, you can just
put all the values into a hash table. The hash table has keys that are the JOIN
keys from the ON clause, and the values are the rest of the row (or the fields
needed from the row in the query). Then, for each row on the left, you can
do a quick lookup by key to find the matching row from the RIGHT table.
This algorithm is parallelizable, as long as you replicate the hash table to
each parallel worker. This, in essence, is how the broadcast JOIN works.
The smaller table is required to be on the RIGHT side of the query, and it
is broadcast to each worker node. Then the worker can perform the JOIN
operation by doing a simple key lookup. Figure 9.3 shows how a broadcast
JOIN operation works in practice. The small boxes to the left of the shards
and mixers are the RIGHT table from the query, which gets distributed to
every node in the tree.
 
Search WWH ::




Custom Search