Introducing Hadoop - Hadoop in Action

Databases Reference

In-Depth Information

Table 1.1 Words with a count higher than 4 in the 2002 State of the Union Address ( continued )

camps (8)

in (79)

opportunity (5)

this (28)

you (12)

can (7)

is (44)

or (8)

thousands (5)

children (6)

it (21)

our (78)

time (7)

We see that 128 words have a frequency count greater than 4. Many of these words

appear frequently in almost any English text. For example, there is a (69), and (210),

i (29), in (79), the (184) and many others. We also see words that summarize the issues

facing the United States at that time: terror (13), terrorist (12), terrorists (10), security

(19), weapons (12), destruction (5), afghanistan (10), freedom (10), jobs (11), budget (7),

and many others.

1.7

History of Hadoop

Hadoop started out as a subproject of Nutch, which in turn was a subproject of Apache

Lucene. Doug Cutting

founded all three projects, and each project was a logical pro-

gression of the previous one.

Lucene is a full-featured text indexing and searching library. Given a text collection,

a developer can easily add search capability to the documents using the Lucene engine.

Desktop search, enterprise search, and many domain-specific search engines have been

built using Lucene. Nutch is the most ambitious extension of Lucene. It tries to build

a complete web search engine using Lucene as its core component. Nutch has parsers

for HTML, a web crawler, a link-graph database, and other extra components necessary

for a web search engine. Doug Cutting envisions Nutch to be an open democratic

alternative to the proprietary technologies in commercial offerings such as Google.

Besides having added components like a crawler and a parser, a web search engine

differs from a basic document search engine in terms of scale. Whereas Lucene is

targeted at indexing millions of documents, Nutch should be able to handle billions of

web pages without becoming exorbitantly expensive to operate. Nutch will have to run

on a distributed cluster of commodity hardware. The challenge for the Nutch team

is to address scalability issues in software. Nutch needs a layer to handle distributed

processing, redundancy, automatic failover, and load balancing. These challenges are

by no means trivial.

Around 2004, Google published two papers describing the Google File System (GFS)

and the MapReduce framework. Google claimed to use these two technologies for

scaling its own search system. Doug Cutting immediately saw the applicability of these

technologies to Nutch, and his team implemented the new framework and ported

Nutch to it. The new implementation immediately boosted Nutch's scalability. It started

to handle several hundred million web pages and could run on clusters of dozens of

nodes. Doug realized that a dedicated project to flesh out the two technologies was

needed to get to web scale, and Hadoop was born. Yahoo! hired Doug in January

Hadoop in Action

Search WWH ::

Custom Search

Home