Database Reference
In-Depth Information
Querying Data
This section discusses how to use various forms of the SELECT statement to retrieve data
from Hive.
Sorting and Aggregating
Sorting data in Hive can be achieved by using a standard ORDER BY clause. ORDER BY
performs a parallel total sort of the input (like that described in Total Sort ). When a glob-
ally sorted result is not required — and in many cases it isn't — you can use Hive's non-
standard extension, SORT BY , instead. SORT BY produces a sorted file per reducer.
In some cases, you want to control which reducer a particular row goes to — typically so
you can perform some subsequent aggregation. This is what Hive's DISTRIBUTE BY
clause does. Here's an example to sort the weather dataset by year and temperature, in such
a way as to ensure that all the rows for a given year end up in the same reducer parti-
tion: [ 116 ]
hive> FROM records2
> SELECT year, temperature
> DISTRIBUTE BY year
> SORT BY year ASC, temperature DESC;
1949 111
1949 78
1950 22
1950 0
1950 -11
A follow-on query (or a query that nests this query as a subquery; see Subqueries ) would
be able to use the fact that each year's temperatures were grouped and sorted (in descend-
ing order) in the same file.
If the columns for SORT BY and DISTRIBUTE BY are the same, you can use CLUSTER
BY as a shorthand for specifying both.
MapReduce Scripts
Using an approach like Hadoop Streaming, the TRANSFORM , MAP , and REDUCE clauses
make it possible to invoke an external script or program from Hive. Suppose we want to
use a script to filter out rows that don't meet some condition, such as the script in
Example 17-1 , which removes poor-quality readings.
Example 17-1. Python script to filter out poor-quality weather records
Search WWH ::




Custom Search