Database Reference
In-Depth Information
#!/usr/bin/env python
import re
import sys
for line in sys . stdin :
( year , temp , q ) = line . strip (). split ()
if ( temp != "9999" and re . match ( "[01459]" , q )):
print " %s \t %s " % ( year , temp )
We can use the script as follows:
hive> ADD FILE /Users/tom/book-workspace/hadoop-book/ch17-hive/
src/main/python/is_good_quality.py;
hive> FROM records2
> SELECT TRANSFORM(year, temperature, quality)
> USING 'is_good_quality.py'
> AS year, temperature;
1950 0
1950 22
1950 -11
1949 111
1949 78
Before running the query, we need to register the script with Hive. This is so Hive knows
to ship the file to the Hadoop cluster (see Distributed Cache ) .
The query itself streams the year , temperature , and quality fields as a tab-separ-
ated line to the is_good_quality.py script, and parses the tab-separated output into year
and temperature fields to form the output of the query.
This example has no reducers. If we use a nested form for the query, we can specify a map
and a reduce function. This time we use the MAP and REDUCE keywords, but SELECT
TRANSFORM in both cases would have the same result. ( Example 2-10 includes the
source for the max_temperature_reduce.py script):
FROM (
FROM records2
MAP year, temperature, quality
USING 'is_good_quality.py'
AS year, temperature) map_output
REDUCE year, temperature
USING 'max_temperature_reduce.py'
AS year, temperature;
Search WWH ::




Custom Search