Databases Reference
In-Depth Information
We assume you already have a basic grasp of Hadoop. You can set up Hadoop,
and you have compiled and run an example program, such as word counting from
chapter 1. Let's use examples—from a real-world data set.
4.1
Getting the patent data set
To do anything meaningful with Hadoop we need data. Many of our examples will use
patent data sets, both of which are available from the National Bureau of Economic
Research
(NBER) at http://www.nber.org/patents/ . The data sets were originally
compiled for the paper “The NBER Patent Citation Data File: Lessons, Insights and
Methodological Tools.” 1 We use the citation data
set cite75_99.txt and the patent
description data
set apat63_99.txt .
NOTE The data sets are approximately 250 MB each, which are small enough
to make our examples runnable in Hadoop's standalone
or pseudo-distributed
mode. You can practice writing MapReduce programs using them even when
you don't have access to a live cluster. The best part of Hadoop is that you
can be fairly sure your MapReduce program will run on clusters of machines
processing data sets 100 or 1,000 times larger with virtually no code changes.
A popular development tactic is to create a smaller, sampled
subset of your
large production data and call it the development data set. This development
data set may only have several hundred megabytes. You develop your
program in standalone or pseudo-distributed mode with the development
data set. This gives your development process a fast turnaround time, the
convenience of running on your own machine, and an isolated environment
for debugging.
We have chosen these two data sets for our example programs because they're
similar to most data types you'll encounter. First of all, the citation data encodes a
graph
are also graphs. Patents
are published in chronological order; some of their properties resemble time series.
Each patent is linked with a person (inventor) and a location (country of inventor).
You can view them as personal or geographical data. Finally, you can look at the
data as generic database relations
in the same vein that web links and social networks
with well-defined schemas, in a simple comma-
separated format. 2
4.1.1
The patent citation data
The patent citation data set contains citations from U.S. patents issued between 1975 and
1999. It has more than 16 million rows and the first few lines resemble the following:
1
NBER Working Paper 8498, by Hall, B. H., A. B. Jaffe, and M. Tratjenberg (2001).
There are more common data types than two data sets can possibly represent. An important one that's
missing here is text, but you've already seen text used in the word count example. Other missing types
include XML, image, and geolocation (the lat-long variety). Math matrix is not represented in general,
although the citation graph can be interpreted as a sparse 0/1 matrix.
2
 
Search WWH ::




Custom Search