Database Reference
In-Depth Information
Listing 8.4 A single NVSS birth record
# Note the year and month of the birth. Other information
# is more obfuscated.
# Easy to process... but hideous.
S 201001 7 2 2
30105 2 011 06 1 123
3405 1 06 01 2 2 0321
1006 314 2000 2 222
2 2 2 2 2 122222
11 3 094 1
M 04 200940 39072 3941
083 22 2
2 2 2
110 110 00 0000000 00
000000000 000000 000 000000000000000000011 101
1 111 1 0 1 1 1
111111 11 1
1 1 1
Extracting Relevant Information from Raw NVSS Data:
Map Phase
The first step in constructing our MapReduce job is to break down our source data
into smaller chunks that can in turn be processed in the future. This is the job of our
map phase. Our mapper function will be applied to every separate birth in parallel—
meaning that multiple birth records will be processed at the same time, depending on
the size of our compute cluster.
The result of our map function will be a key-value pair, representing a shard of pro-
cessed data. In this case, we will simply read each record (one per line), determine the
month and year of the birth, and then assign a count of “1” to the record. The key will
be a string representing the month and year, and the value will be 1. At the end, the
map phase will result in thousands of shards containing the same month and date key.
Our map phase is provided by our mapper.py script, as shown in Listing 8.5. The
resulting processed key-value pairs will be emitted as strings, with the key and value
separated by a tab character. By default, Hadoop's streaming API will treat anything
up to the first tab character as the “key” and the rest of the data on a line of standard
input is treated as the value.
Listing 8.5 A mapper that outputs the month and a count of “1” for each birth record:
mapper.py
#!/usr/bin/python
import sys
def read_stdin_generator(file):
for record in file:
yield record
 
 
Search WWH ::




Custom Search