Databases Reference
In-Depth Information
Table 1.1 Words with a count higher than 4 in the 2002 State of the Union Address ( continued )
camps (8)
in (79)
opportunity (5)
this (28)
you (12)
can (7)
is (44)
or (8)
thousands (5)
children (6)
it (21)
our (78)
time (7)
We see that 128 words have a frequency count greater than 4. Many of these words
appear frequently in almost any English text. For example, there is a (69), and (210),
i (29), in (79), the (184) and many others. We also see words that summarize the issues
facing the United States at that time: terror (13), terrorist (12), terrorists (10), security
(19), weapons (12), destruction (5), afghanistan (10), freedom (10), jobs (11), budget (7),
and many others.
1.7
History of Hadoop
Hadoop started out as a subproject of Nutch, which in turn was a subproject of Apache
Lucene. Doug Cutting
founded all three projects, and each project was a logical pro-
gression of the previous one.
Lucene is a full-featured text indexing and searching library. Given a text collection,
a developer can easily add search capability to the documents using the Lucene engine.
Desktop search, enterprise search, and many domain-specific search engines have been
built using Lucene. Nutch is the most ambitious extension of Lucene. It tries to build
a complete web search engine using Lucene as its core component. Nutch has parsers
for HTML, a web crawler, a link-graph database, and other extra components necessary
for a web search engine. Doug Cutting envisions Nutch to be an open democratic
alternative to the proprietary technologies in commercial offerings such as Google.
Besides having added components like a crawler and a parser, a web search engine
differs from a basic document search engine in terms of scale. Whereas Lucene is
targeted at indexing millions of documents, Nutch should be able to handle billions of
web pages without becoming exorbitantly expensive to operate. Nutch will have to run
on a distributed cluster of commodity hardware. The challenge for the Nutch team
is to address scalability issues in software. Nutch needs a layer to handle distributed
processing, redundancy, automatic failover, and load balancing. These challenges are
by no means trivial.
Around 2004, Google published two papers describing the Google File System (GFS)
and the MapReduce framework. Google claimed to use these two technologies for
scaling its own search system. Doug Cutting immediately saw the applicability of these
technologies to Nutch, and his team implemented the new framework and ported
Nutch to it. The new implementation immediately boosted Nutch's scalability. It started
to handle several hundred million web pages and could run on clusters of dozens of
nodes. Doug realized that a dedicated project to flesh out the two technologies was
needed to get to web scale, and Hadoop was born. Yahoo! hired Doug in January
 
Search WWH ::




Custom Search