Database Reference
In-Depth Information
%
cat input/ncdc/sample.txt | \
ch02-mr-intro/src/main/python/max_temperature_map.py | \
sort | ch02-mr-intro/src/main/python/max_temperature_reduce.py
1949 111
1950 22
[
20
]
Functions with this property are called
commutative
and
associative
. They are also sometimes referred
to as
distributive
, such as by Jim Gray et al.'s
“Data Cube: A Relational Aggregation Operator Generalizing
Group-By, Cross-Tab, and Sub-Totals,”
February1995.
[
21
]
This is a factor of seven faster than the serial run on one machine using
awk
. The main reason it wasn't
proportionately faster is because the input data wasn't evenly partitioned. For convenience, the input files
were gzipped by year, resulting in large files for later years in the dataset, when the number of weather re-
cords was much higher.
[
22
]
Hadoop Pipes is an alternative to Streaming for C++ programmers. It uses sockets to communicate with
the process running the C++ map or reduce function.
[
23
]
Alternatively, you could use “pull”-style processing in the new MapReduce API; see
Appendix D
.
[
24
]
As an alternative to Streaming, Python programmers should consider
Dumbo
,
which makes the Stream-
ing MapReduce interface more Pythonic and easier to use.