Case Studies - Hadoop in Action

Databases Reference

In-Depth Information

quick examination revealed that this behavior is heavily influenced by the host with

the largest number of URLs in the fetch list. You can understand this by observing

that the fetch rate in Nutch is controlled by two parameters: the number of distinct

hosts in the fetch list that Nutch can concurrently crawl from, and the duration

for which Nutch waits before making consecutive requests to the same host. A

straightforward solution to the long tail problem is to restrict the number of URLs

for a particular host in the fetch list. Unfortunately, this is not sufficient because not

all host servers are identical, and the time required to fetch the same number of

pages from different hosts can be dramatically different. We added a time-shutoff

parameter that terminates the fetcher after a fixed amount of time as an engineering

fix to this problem. While this terminates the fetch phase early (and fewer pages

are retrieved in total in the segment), by avoiding the slow tail phase, we sustain a

higher average crawl rate. In practice, we observed that by appropriately setting this

shutoff parameter, the average crawl rate could be improved to nearly three times the

original crawl rate. Ideally, the current fetch rate should determine such a shutoff;

this unfortunately requires pooling information across map tasks and can't easily be

performed in Hadoop today.

A main-memory data structure in the fetcher causes a different performance

bottleneck. The fetcher works by first creating a set of queues where each queue stores

URLs for a particular host—we call this data structure FetchQueues . A fixed amount of

memory is allocated to FetchQueues to be shared across the individual queues. The

fetcher reads the URLs to be fetched from its input and inserts them into FetchQueues

until it exhausts the allocated memory. Worker threads assigned to each queue in

FetchQueues concurrently fetch pages from different hosts as long as their queues are

non-empty. The bottleneck arises because URLs in the input are ordered by host (this

is an artifact of the generate phase) and the fetcher exhausts the memory allocated to

FetchQueues with URLs from very few hosts. Such a design is appropriate for crawling

a large number of hosts on the web as each host in the fetch list would then have only

a few URLs. In the enterprise, host diversity is limited to a few thousand at best. As a

result, few worker threads are actively fetching from FetchQueues, leading to severe

under-utilization of resources. We address this problem by replacing FetchQueues with

a disk-based data structure without any limits on the total size. This allows the fetcher to

populate FetchQueues with all the URLs in the input, thereby keeping the maximum

possible number of worker threads active. This simple change improved the fetch rate

several fold.

12.4.3 ES2 analytics

Much of the complexity and power in ES2 lies in its analytics. In this section, we briefly

describe the different algorithms, paying special attention to the design choices made

in mapping these algorithms onto Hadoop.

LOCAL ANALYSIS

In local analysis, each page is individually analyzed to extract clues that help decide

whether that page is a candidate navigational page . In ES2, five different local analysis

Search WWH ::

Custom Search

Home