Putting It Together: MapReduce Data Pipelines - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

Hadoop, we are not going to go into detail here about how to install and administrate

a multinode cluster. However, even if you don't currently have access to a cluster, it is

possible to run versions of streaming applications on a single machine using Hadoop

in a single node or local mode. It is even possible to simulate a distributed environment

in which various Hadoop daemons are run in separate Java processes (this is called

pseudo-distributed mode). One of my favorite things about the Hadoop streaming

framework is that you can test your scripts on small sets of data much like we did

previously; by piping the output of the mapper into the reducer, you can sanity check

your application before scaling up.

Programming tutorials often introduce new languages by demonstrating how to dis-

play “hello world.” This usually takes just a few lines of code, if that, and is often quite

useless in the overall scheme of things.

The “hello world” of MapReduce implementations is definitely the word count ,

an example that takes a collection of input documents and produces an overall word

count for each unique word in the corpus.

Instead of counting words, let's do something very similar but perhaps more inter-

esting. We will use publicly available American birth statistics data to count the num-

ber of births in any given year. The United States requires that information about all

births be recorded. This task falls to the U.S. National Vital Statistics System, which

makes basic information about every birth recorded in the United States available.

Using raw data provided by the NVSS, we will run a simple MapReduce job that

counts the total number of births per month for a single year.

This task will take a slight bit of processing because our NVSS source data files are

a bit raw. The Center for Disease Control provides each year's worth of NVSS data

as a huge text file, with information about one birth on each line. The uncompressed

data file containing information for babies born in 2010 is nearly three gigabytes and

contains information for over four million births.

Let's take a look at a single year: 2010. According to the user guide for the 2010

NVSS dataset, 1 each birth is recorded as a single, ugly, 755-character line (see Listing

8.4 for an example of a single NVSS birth record). An individual record contains all

kinds of coded information about the birth, such as birthday, weight, and whether or

not the child was a twin. It also contains risk factors involved in the pregnancy, such as

information about maternal smoking. Although some of the information requires the

user guide to decipher, one piece of information that is easy to pick out of the raw file

is the year and month of the birth. Let's start by extracting this value out of each birth

record as part of our MapReduce job.

1. ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/DVS/natality/

UserGuide2010.pdf

Search WWH ::

Custom Search

Home