Database Reference
In-Depth Information
#!/usr/bin/env python
import
re
import
sys
for
line
in
sys
.
stdin
:
(
year
,
temp
,
q
) =
line
.
strip
().
split
()
if
(
temp
!=
"9999"
and
re
.
match
(
"[01459]"
,
q
)):
print
"
%s
\t
%s
"
% (
year
,
temp
)
We can use the script as follows:
hive>
ADD FILE /Users/tom/book-workspace/hadoop-book/ch17-hive/
src/main/python/is_good_quality.py;
hive>
FROM records2
>
SELECT TRANSFORM(year, temperature, quality)
>
USING 'is_good_quality.py'
>
AS year, temperature;
1950 0
1950 22
1950 -11
1949 111
1949 78
Before running the query, we need to register the script with Hive. This is so Hive knows
to ship the file to the Hadoop cluster (see
Distributed Cache
)
.
The query itself streams the
year
,
temperature
, and
quality
fields as a tab-separ-
ated line to the
is_good_quality.py
script, and parses the tab-separated output into
year
and
temperature
fields to form the output of the query.
This example has no reducers. If we use a nested form for the query, we can specify a map
and a reduce function. This time we use the
MAP
and
REDUCE
keywords, but
SELECT
TRANSFORM
in both cases would have the same result. (
Example 2-10
includes the
source for the max_temperature_reduce.py script):
FROM (
FROM records2
MAP
year, temperature, quality
USING 'is_good_quality.py'
AS year, temperature) map_output
REDUCE
year, temperature
USING 'max_temperature_reduce.py'
AS year, temperature;