Hive - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

#!/usr/bin/env python

import re

import sys

for line in sys . stdin :

( year , temp , q ) = line . strip (). split ()

if ( temp != "9999" and re . match ( "[01459]" , q )):

print " %s \t %s " % ( year , temp )

We can use the script as follows:

hive> ADD FILE /Users/tom/book-workspace/hadoop-book/ch17-hive/

src/main/python/is_good_quality.py;

hive> FROM records2

> SELECT TRANSFORM(year, temperature, quality)

> USING 'is_good_quality.py'

> AS year, temperature;

1950 0

1950 22

1950 -11

1949 111

1949 78

Before running the query, we need to register the script with Hive. This is so Hive knows

to ship the file to the Hadoop cluster (see Distributed Cache ) .

The query itself streams the year , temperature , and quality fields as a tab-separ-

ated line to the is_good_quality.py script, and parses the tab-separated output into year

and temperature fields to form the output of the query.

This example has no reducers. If we use a nested form for the query, we can specify a map

and a reduce function. This time we use the MAP and REDUCE keywords, but SELECT

TRANSFORM in both cases would have the same result. ( Example 2-10 includes the

source for the max_temperature_reduce.py script):

FROM (

FROM records2

MAP year, temperature, quality

USING 'is_good_quality.py'

AS year, temperature) map_output

REDUCE year, temperature

USING 'max_temperature_reduce.py'

AS year, temperature;

Search WWH ::

Custom Search

Home