Hive - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Querying Data

This section discusses how to use various forms of the SELECT statement to retrieve data

from Hive.

Sorting and Aggregating

Sorting data in Hive can be achieved by using a standard ORDER BY clause. ORDER BY

performs a parallel total sort of the input (like that described in Total Sort ). When a glob-

ally sorted result is not required — and in many cases it isn't — you can use Hive's non-

standard extension, SORT BY , instead. SORT BY produces a sorted file per reducer.

In some cases, you want to control which reducer a particular row goes to — typically so

you can perform some subsequent aggregation. This is what Hive's DISTRIBUTE BY

clause does. Here's an example to sort the weather dataset by year and temperature, in such

a way as to ensure that all the rows for a given year end up in the same reducer parti-

tion: [ 116 ]

hive> FROM records2

> SELECT year, temperature

> DISTRIBUTE BY year

> SORT BY year ASC, temperature DESC;

1949 111

1949 78

1950 22

1950 0

1950 -11

A follow-on query (or a query that nests this query as a subquery; see Subqueries ) would

be able to use the fact that each year's temperatures were grouped and sorted (in descend-

ing order) in the same file.

If the columns for SORT BY and DISTRIBUTE BY are the same, you can use CLUSTER

BY as a shorthand for specifying both.

MapReduce Scripts

Using an approach like Hadoop Streaming, the TRANSFORM , MAP , and REDUCE clauses

make it possible to invoke an external script or program from Hive. Suppose we want to

use a script to filter out rows that don't meet some condition, such as the script in

Example 17-1 , which removes poor-quality readings.

Example 17-1. Python script to filter out poor-quality weather records

Search WWH ::

Custom Search

Home