Database Reference
In-Depth Information
NOTE
Hive (like MySQL and Oracle) allows you to list the join tables in the
FROM
clause and specify the join
condition in the
WHERE
clause of a
SELECT
statement. For example, the following is another way of ex-
pressing the query we just saw:
SELECT sales.*, things.*
FROM sales, things
WHERE sales.id = things.id;
A single join is implemented as a single MapReduce job, but multiple joins can be per-
formed in less than one MapReduce job per join if the same column is used in the join
query by prefixing it with the
EXPLAIN
keyword:
EXPLAIN
SELECT sales.*, things.*
FROM sales JOIN things ON (sales.id = things.id);
The
EXPLAIN
output includes many details about the execution plan for the query, in-
cluding the abstract syntax tree, the dependency graph for the stages that Hive will ex-
ecute, and information about each stage. Stages may be MapReduce jobs or operations
such as file moves. For even more detail, prefix the query with
EXPLAIN EXTENDED
.
Hive currently uses a rule-based query optimizer for determining how to execute a query,
but a cost-based optimizer is available from Hive 0.14.0.
Outer joins
Outer joins allow you to find nonmatches in the tables being joined. In the current ex-
ample, when we performed an inner join, the row for Ali did not appear in the output, be-
cause the ID of the item she purchased was not present in the
things
table. If we change
the join type to
LEFT OUTER JOIN
, the query will return a row for every row in the left
table (
sales
), even if there is no corresponding row in the table it is being joined to
(
things
):
hive>
SELECT sales.*, things.*
>
FROM sales LEFT OUTER JOIN things ON (sales.id = things.id);
Joe 2 2 Tie
Hank 4 4 Coat
Ali 0 NULL NULL
Eve 3 3 Hat
Hank 2 2 Tie