Adding Structure with Hive - Microsoft Big Data Solutions

Database Reference

In-Depth Information

distributed data sets is extremely resource intensive, and performance

would often be unreasonably poor. For the same reason, OR s cannot be used

in JOIN clauses.

INNER , LEFT OUTER , RIGHT OUTER , and FULL OUTER JOIN s are

supported. These function like their SQL equivalents. When processing a

SELECT with both JOIN and WHERE clauses, Hive evaluates the JOIN first,

then the WHERE clause is applied on the joined results.

During join operations, Hive makes the assumption that the largest table

appears last in the FROM clause. Therefore, it attempts to process the other

tables first, and then streams the content of the last table. If you keep this in

mind when writing your Hive queries, you will get better performance. You

can use a query hint to indicate which table should be streamed, too:

SELECT /*+ STREAMTABLE(bt) */ bt.name,

bt.transactionAmount, c.state

FROM bigTable bt JOIN customer c ON bt.postalCode =

c.PostalCode

When you are using ORDER BY , be aware that this requires ordering of the

whole data set. Because this operation cannot be distributed across multiple

nodes, it can be quite slow. Hive offers the alternative SORT BY . Instead of

sorting the entire data set, SORT BY lets each node that is processing results

sort its results locally. The overall data set won't be ordered, but the results

from each node will be sorted:

SELECT c.name, c.city, c.state, c.postalCode,

c.purchases

FROM MsBigData.customer c

SORT BY c.postalCode;

You can use SORT BY in conjunction with DISTRIBUTE BY to send related

data to the same nodes for processing so that there is less overlap from

sorting on multiple nodes. In the following example, the data is distributed

to nodes, based on the state, and then the postal codes are sorted per state

for each node:

SELECT c.name, c.city, c.state, c.postalCode,

c.purchases

Search WWH ::

Custom Search

Home