Database Reference
In-Depth Information
a field in the
LEFT
table matches a field in the
RIGHT
table. Some other
SQL systems allow the filter to be specified in a
WHERE
clause (for example,
WHERE table1.foo = table2.bar
), but in BigQuery, only the
ON
clause
version is supported.
The difference between
INNER
and
OUTER
is how they handle cases when
there are no matches between the value in the
LEFT
table and the value in
the
RIGHT
. An
INNER JOIN
drops any rows where there is not a match,
whereas an
OUTER JOIN
keeps the row and fills the side that doesn't
match with
null
values. The difference between
LEFT OUTER
and
RIGHT
OUTER JOIN
is which table gets precedence—the
LEFT
or the
RIGHT
. It
isn't important to understand all the possible
JOIN
options now, but some
familiarity with
JOIN
semantics helps to understand the BigQuery
JOIN
implementation.
Broadcast JOIN
A “small”
JOIN
is often called a “broadcast”
JOIN
because it relies on
broadcasting the smaller table to all the nodes in the computation tree.
A
JOIN
operation needs to generate all combinations of rows where the
fields specified in the
ON
clause match. To process the
JOIN
operation, you
can walk all the rows in the
LEFT
table and figure out whether there is a
matching row in the
RIGHT
table. If the
RIGHT
table is small, you can just
put all the values into a hash table. The hash table has keys that are the
JOIN
keys from the
ON
clause, and the values are the rest of the row (or the fields
needed from the row in the query). Then, for each row on the left, you can
do a quick lookup by key to find the matching row from the
RIGHT
table.
This algorithm is parallelizable, as long as you replicate the hash table to
each parallel worker. This, in essence, is how the broadcast
JOIN
works.
The smaller table is required to be on the
RIGHT
side of the query, and it
is broadcast to each worker node. Then the worker can perform the
JOIN
operation by doing a simple key lookup.
Figure 9.3
shows how a broadcast
JOIN
operation works in practice. The small boxes to the left of the shards
and mixers are the
RIGHT
table from the query, which gets distributed to
every node in the tree.