Databases Reference
In-Depth Information
quick examination revealed that this behavior is heavily influenced by the host with
the largest number of URLs in the fetch list. You can understand this by observing
that the fetch rate in Nutch is controlled by two parameters: the number of distinct
hosts in the fetch list that Nutch can concurrently crawl from, and the duration
for which Nutch waits before making consecutive requests to the same host. A
straightforward solution to the long tail problem is to restrict the number of URLs
for a particular host in the fetch list. Unfortunately, this is not sufficient because not
all host servers are identical, and the time required to fetch the same number of
pages from different hosts can be dramatically different. We added a time-shutoff
parameter that terminates the fetcher after a fixed amount of time as an engineering
fix to this problem. While this terminates the fetch phase early (and fewer pages
are retrieved in total in the segment), by avoiding the slow tail phase, we sustain a
higher average crawl rate. In practice, we observed that by appropriately setting this
shutoff parameter, the average crawl rate could be improved to nearly three times the
original crawl rate. Ideally, the current fetch rate should determine such a shutoff;
this unfortunately requires pooling information across map tasks and can't easily be
performed in Hadoop today.
A main-memory data structure in the fetcher causes a different performance
bottleneck. The fetcher works by first creating a set of queues where each queue stores
URLs for a particular host—we call this data structure FetchQueues . A fixed amount of
memory is allocated to FetchQueues to be shared across the individual queues. The
fetcher reads the URLs to be fetched from its input and inserts them into FetchQueues
until it exhausts the allocated memory. Worker threads assigned to each queue in
FetchQueues concurrently fetch pages from different hosts as long as their queues are
non-empty. The bottleneck arises because URLs in the input are ordered by host (this
is an artifact of the generate phase) and the fetcher exhausts the memory allocated to
FetchQueues with URLs from very few hosts. Such a design is appropriate for crawling
a large number of hosts on the web as each host in the fetch list would then have only
a few URLs. In the enterprise, host diversity is limited to a few thousand at best. As a
result, few worker threads are actively fetching from FetchQueues, leading to severe
under-utilization of resources. We address this problem by replacing FetchQueues with
a disk-based data structure without any limits on the total size. This allows the fetcher to
populate FetchQueues with all the URLs in the input, thereby keeping the maximum
possible number of worker threads active. This simple change improved the fetch rate
several fold.
12.4.3 ES2 analytics
Much of the complexity and power in ES2 lies in its analytics. In this section, we briefly
describe the different algorithms, paying special attention to the design choices made
in mapping these algorithms onto Hadoop.
LOCAL ANALYSIS
In local analysis, each page is individually analyzed to extract clues that help decide
whether that page is a candidate navigational page . In ES2, five different local analysis
 
Search WWH ::




Custom Search