Case Studies - Hadoop in Action

Databases Reference

In-Depth Information

stage. Jaql queries are used to bring together the results of different local analyses and

invoke global analysis. Jaql is also used to invoke the variant generation and indexing

workflow using the outputs of local and global analyses. The indexes are periodically

copied to a different set of machines that serve user queries.

Although not part of the main workflow, ES2 periodically executes several mining

and classification tasks. Examples of this include algorithms to automatically produce

acronym libraries, regular expression libraries [6], and geo-classification rules.

12.4.2 ES2

crawler

ES2 uses Nutch version 0.9. A primary data structure in Nutch is the CrawlDB: a key-

value set where the keys are the URLs known to Nutch and the value is the status of

the URL. The status contains metadata about the URL, such as the time of discovery,

whether it has been fetched, and so on. Nutch

is architected as a sequence of three

MapReduce jobs:

Generate —In this phase, a fetch list is generated by scanning the input key/value

pairs (from CrawlDB) for URLs that have been discovered, but not fetched. A

common choice in generating this list is to select the top k unfetched URLs using

an appropriate scoring mechanism (k is a configuration parameter in Nutch).

■

Fetch —In this phase, the pages associated with the URLs in the input fetch list

are

fetched and parsed. The output consists of the URL and the parsed representa-

tion of the page.

■

Update —The update phase collects all the URLs that have been discovered by

parsing the contents of the pages in the fetch phase and merges them with the

CrawlDB.

■

The pages fetched in each cycle of generate-fetch-update are referred to as a segment .

Out of the box, the first problem we encountered was crawl speed. Nutch's crawl rate

was under three pages per second—far less than the network bandwidth available to

the cluster. A deeper problem we encountered after a sample crawl of 80 million pages

was that the quality of discovered pages was surprisingly low. In this section, we identify

the underlying reasons for both these problems and describe the enhancements made

to Nutch to adapt it to the IBM intranet.

MODIFICATIONS FOR PERFORMANCE

Nutch's design was aimed at web crawling

. When using it to crawl the IBM intranet,

we observed multiple performance bottlenecks. We discovered that the reason for the

bottlenecks was that the enterprise intranet contains far fewer hosts than the web, and

some of the design choices made in Nutch assume a large number of distinct hosts. We

describe two ways in which this problem manifests itself, and the approach used in ES2

to adapt Nutch's design for the enterprise.

A major performance bottleneck in the fetch

phase, called long tail problem , exhibits

the following behavior. The crawl rate in the early part of the fetch phase is relatively

high (typically dozens of pages a second). But this deteriorates relatively quickly to

less than a page per second, where it remains until completion of the segment. A

Hadoop in Action

Search WWH ::

Custom Search

Home