Database Reference
In-Depth Information
distributed data sets is extremely resource intensive, and performance
would often be unreasonably poor. For the same reason,
OR
s cannot be used
in
JOIN
clauses.
INNER
,
LEFT OUTER
,
RIGHT OUTER
, and
FULL OUTER JOIN
s are
supported. These function like their SQL equivalents. When processing a
SELECT
with both
JOIN
and
WHERE
clauses, Hive evaluates the
JOIN
first,
then the
WHERE
clause is applied on the joined results.
During join operations, Hive makes the assumption that the largest table
appears last in the
FROM
clause. Therefore, it attempts to process the other
tables first, and then streams the content of the last table. If you keep this in
mind when writing your Hive queries, you will get better performance. You
can use a query hint to indicate which table should be streamed, too:
SELECT /*+ STREAMTABLE(bt) */ bt.name,
bt.transactionAmount, c.state
FROM bigTable bt JOIN customer c ON bt.postalCode =
c.PostalCode
When you are using
ORDER BY
, be aware that this requires ordering of the
whole data set. Because this operation cannot be distributed across multiple
nodes, it can be quite slow. Hive offers the alternative
SORT BY
. Instead of
sorting the entire data set,
SORT BY
lets each node that is processing results
sort its results locally. The overall data set won't be ordered, but the results
from each node will be sorted:
SELECT c.name, c.city, c.state, c.postalCode,
c.purchases
FROM MsBigData.customer c
SORT BY c.postalCode;
You can use
SORT BY
in conjunction with
DISTRIBUTE BY
to send related
data to the same nodes for processing so that there is less overlap from
sorting on multiple nodes. In the following example, the data is distributed
to nodes, based on the state, and then the postal codes are sorted per state
for each node:
SELECT c.name, c.city, c.state, c.postalCode,
c.purchases