Writing basic MapReduce programs - Hadoop in Action

Databases Reference

In-Depth Information

We assume you already have a basic grasp of Hadoop. You can set up Hadoop,

and you have compiled and run an example program, such as word counting from

chapter 1. Let's use examples—from a real-world data set.

4.1

Getting the patent data set

To do anything meaningful with Hadoop we need data. Many of our examples will use

patent data sets, both of which are available from the National Bureau of Economic

Research

(NBER) at http://www.nber.org/patents/ . The data sets were originally

compiled for the paper “The NBER Patent Citation Data File: Lessons, Insights and

Methodological Tools.” 1 We use the citation data

set cite75_99.txt and the patent

description data

set apat63_99.txt .

NOTE The data sets are approximately 250 MB each, which are small enough

to make our examples runnable in Hadoop's standalone

or pseudo-distributed

mode. You can practice writing MapReduce programs using them even when

you don't have access to a live cluster. The best part of Hadoop is that you

can be fairly sure your MapReduce program will run on clusters of machines

processing data sets 100 or 1,000 times larger with virtually no code changes.

A popular development tactic is to create a smaller, sampled

subset of your

large production data and call it the development data set. This development

data set may only have several hundred megabytes. You develop your

program in standalone or pseudo-distributed mode with the development

data set. This gives your development process a fast turnaround time, the

convenience of running on your own machine, and an isolated environment

for debugging.

We have chosen these two data sets for our example programs because they're

similar to most data types you'll encounter. First of all, the citation data encodes a

graph

are also graphs. Patents

are published in chronological order; some of their properties resemble time series.

Each patent is linked with a person (inventor) and a location (country of inventor).

You can view them as personal or geographical data. Finally, you can look at the

data as generic database relations

in the same vein that web links and social networks

with well-defined schemas, in a simple comma-

separated format. 2

4.1.1

The patent citation data

The patent citation data set contains citations from U.S. patents issued between 1975 and

1999. It has more than 16 million rows and the first few lines resemble the following:

1

NBER Working Paper 8498, by Hall, B. H., A. B. Jaffe, and M. Tratjenberg (2001).

There are more common data types than two data sets can possibly represent. An important one that's

missing here is text, but you've already seen text used in the word count example. Other missing types

include XML, image, and geolocation (the lat-long variety). Math matrix is not represented in general,

although the citation graph can be interpreted as a sparse 0/1 matrix.

2

Hadoop in Action

Search WWH ::

Custom Search

Home