Database Reference
In-Depth Information
Hadoop, we are not going to go into detail here about how to install and administrate
a multinode cluster. However, even if you don't currently have access to a cluster, it is
possible to run versions of streaming applications on a single machine using Hadoop
in a single node or local mode. It is even possible to simulate a distributed environment
in which various Hadoop daemons are run in separate Java processes (this is called
pseudo-distributed mode). One of my favorite things about the Hadoop streaming
framework is that you can test your scripts on small sets of data much like we did
previously; by piping the output of the mapper into the reducer, you can sanity check
your application before scaling up.
A One-Step MapReduce Transformation
Programming tutorials often introduce new languages by demonstrating how to dis-
play “hello world.” This usually takes just a few lines of code, if that, and is often quite
useless in the overall scheme of things.
The “hello world” of MapReduce implementations is definitely the word count ,
an example that takes a collection of input documents and produces an overall word
count for each unique word in the corpus.
Instead of counting words, let's do something very similar but perhaps more inter-
esting. We will use publicly available American birth statistics data to count the num-
ber of births in any given year. The United States requires that information about all
births be recorded. This task falls to the U.S. National Vital Statistics System, which
makes basic information about every birth recorded in the United States available.
Using raw data provided by the NVSS, we will run a simple MapReduce job that
counts the total number of births per month for a single year.
This task will take a slight bit of processing because our NVSS source data files are
a bit raw. The Center for Disease Control provides each year's worth of NVSS data
as a huge text file, with information about one birth on each line. The uncompressed
data file containing information for babies born in 2010 is nearly three gigabytes and
contains information for over four million births.
Let's take a look at a single year: 2010. According to the user guide for the 2010
NVSS dataset, 1 each birth is recorded as a single, ugly, 755-character line (see Listing
8.4 for an example of a single NVSS birth record). An individual record contains all
kinds of coded information about the birth, such as birthday, weight, and whether or
not the child was a twin. It also contains risk factors involved in the pregnancy, such as
information about maternal smoking. Although some of the information requires the
user guide to decipher, one piece of information that is easy to pick out of the raw file
is the year and month of the birth. Let's start by extracting this value out of each birth
record as part of our MapReduce job.
1. ftp://ftp.cdc.gov/pub/Health_Statistics/NCHS/Dataset_Documentation/DVS/natality/
UserGuide2010.pdf
 
 
Search WWH ::




Custom Search