Putting It Together: MapReduce Data Pipelines - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

Listing 8.4 A single NVSS birth record

# Note the year and month of the birth. Other information

# is more obfuscated.

# Easy to process... but hideous.

S 201001 7 2 2

30105 2 011 06 1 123

3405 1 06 01 2 2 0321

1006 314 2000 2 222

2 2 2 2 2 122222

11 3 094 1

M 04 200940 39072 3941

083 22 2

2 2 2

110 110 00 0000000 00

000000000 000000 000 000000000000000000011 101

1 111 1 0 1 1 1

111111 11 1

1 1 1

Extracting Relevant Information from Raw NVSS Data:

Map Phase

The first step in constructing our MapReduce job is to break down our source data

into smaller chunks that can in turn be processed in the future. This is the job of our

map phase. Our mapper function will be applied to every separate birth in parallel—

meaning that multiple birth records will be processed at the same time, depending on

the size of our compute cluster.

The result of our map function will be a key-value pair, representing a shard of pro-

cessed data. In this case, we will simply read each record (one per line), determine the

month and year of the birth, and then assign a count of “1” to the record. The key will

be a string representing the month and year, and the value will be 1. At the end, the

map phase will result in thousands of shards containing the same month and date key.

Our map phase is provided by our mapper.py script, as shown in Listing 8.5. The

resulting processed key-value pairs will be emitted as strings, with the key and value

separated by a tab character. By default, Hadoop's streaming API will treat anything

up to the first tab character as the “key” and the rest of the data on a line of standard

input is treated as the value.

Listing 8.5 A mapper that outputs the month and a count of “1” for each birth record:

mapper.py

#!/usr/bin/python

import sys

def read_stdin_generator(file):

for record in file:

yield record

Search WWH ::

Custom Search

Home