Database Reference
In-Depth Information
def main():
data = read_stdin_generator(sys.stdin)
for record in data:
# Extract the year and month string from each line
year = record[14:18]
month = record[18:20]
# Print year-month (our key) along with
# separated to stdout
print '%s-%s\t%d' % (year, month, 1)
if __name__ == "__main__":
main()
Counting Births per Month: The Reducer Phase
In a MapReduce job, the reducer step takes the aggregate of a collection of keys from
the mapper step. Prior to the reducer step, some form of sorting is required in which
the many key-value pairs generated are aggregated before being passed to the reducer
functions. With millions or even billions of individual shards, this shuffling of data
can be quite resource intensive. Much of the “special sauce” of the MapReduce frame-
work takes place in this sorting step.
In our simple example, the reducer phase is defined by the reducer.py script, which
accepts the output of the mapper.py script and aggregates the result. The reducer.py script
takes each of the key-value pairs generated in the previous step and groups them by
key. Next, the sum of all the births from each distinct month is calculated. Recall that
the value for every unique month-year key was 1, so this reducer step is simply pro-
ducing the sum of each key. See Listing 8.6 for the reducer code.
Listing 8.6 A reducer.py script featuring Python iterators for counting values
#!/usr/bin/python
import sys
from itertools import groupby
from operator import itemgetter
def read_stdin_generator(data):
for record in data:
yield record.rstrip().split('\t')
def main():
birth_data = read_stdin_generator(sys.stdin)
for month, group in groupby(birth_data, itemgetter(0)):
births = sum(int(count) for month, count in group)
print "%s:%d" % (month, births)
 
 
Search WWH ::




Custom Search