Database Reference
In-Depth Information
distributed data sets is extremely resource intensive, and performance
would often be unreasonably poor. For the same reason, OR s cannot be used
in JOIN clauses.
INNER , LEFT OUTER , RIGHT OUTER , and FULL OUTER JOIN s are
supported. These function like their SQL equivalents. When processing a
SELECT with both JOIN and WHERE clauses, Hive evaluates the JOIN first,
then the WHERE clause is applied on the joined results.
During join operations, Hive makes the assumption that the largest table
appears last in the FROM clause. Therefore, it attempts to process the other
tables first, and then streams the content of the last table. If you keep this in
mind when writing your Hive queries, you will get better performance. You
can use a query hint to indicate which table should be streamed, too:
SELECT /*+ STREAMTABLE(bt) */ bt.name,
bt.transactionAmount, c.state
FROM bigTable bt JOIN customer c ON bt.postalCode =
c.PostalCode
When you are using ORDER BY , be aware that this requires ordering of the
whole data set. Because this operation cannot be distributed across multiple
nodes, it can be quite slow. Hive offers the alternative SORT BY . Instead of
sorting the entire data set, SORT BY lets each node that is processing results
sort its results locally. The overall data set won't be ordered, but the results
from each node will be sorted:
SELECT c.name, c.city, c.state, c.postalCode,
c.purchases
FROM MsBigData.customer c
SORT BY c.postalCode;
You can use SORT BY in conjunction with DISTRIBUTE BY to send related
data to the same nodes for processing so that there is less overlap from
sorting on multiple nodes. In the following example, the data is distributed
to nodes, based on the state, and then the postal codes are sorted per state
for each node:
SELECT c.name, c.city, c.state, c.postalCode,
c.purchases
Search WWH ::




Custom Search