Databases Reference
In-Depth Information
stage. Jaql queries are used to bring together the results of different local analyses and
invoke global analysis. Jaql is also used to invoke the variant generation and indexing
workflow using the outputs of local and global analyses. The indexes are periodically
copied to a different set of machines that serve user queries.
Although not part of the main workflow, ES2 periodically executes several mining
and classification tasks. Examples of this include algorithms to automatically produce
acronym libraries, regular expression libraries [6], and geo-classification rules.
12.4.2 ES2
crawler
ES2 uses Nutch version 0.9. A primary data structure in Nutch is the CrawlDB: a key-
value set where the keys are the URLs known to Nutch and the value is the status of
the URL. The status contains metadata about the URL, such as the time of discovery,
whether it has been fetched, and so on. Nutch
is architected as a sequence of three
MapReduce jobs:
Generate —In this phase, a fetch list is generated by scanning the input key/value
pairs (from CrawlDB) for URLs that have been discovered, but not fetched. A
common choice in generating this list is to select the top k unfetched URLs using
an appropriate scoring mechanism (k is a configuration parameter in Nutch).
Fetch —In this phase, the pages associated with the URLs in the input fetch list
are
fetched and parsed. The output consists of the URL and the parsed representa-
tion of the page.
Update —The update phase collects all the URLs that have been discovered by
parsing the contents of the pages in the fetch phase and merges them with the
CrawlDB.
The pages fetched in each cycle of generate-fetch-update are referred to as a segment .
Out of the box, the first problem we encountered was crawl speed. Nutch's crawl rate
was under three pages per second—far less than the network bandwidth available to
the cluster. A deeper problem we encountered after a sample crawl of 80 million pages
was that the quality of discovered pages was surprisingly low. In this section, we identify
the underlying reasons for both these problems and describe the enhancements made
to Nutch to adapt it to the IBM intranet.
MODIFICATIONS FOR PERFORMANCE
Nutch's design was aimed at web crawling
. When using it to crawl the IBM intranet,
we observed multiple performance bottlenecks. We discovered that the reason for the
bottlenecks was that the enterprise intranet contains far fewer hosts than the web, and
some of the design choices made in Nutch assume a large number of distinct hosts. We
describe two ways in which this problem manifests itself, and the approach used in ES2
to adapt Nutch's design for the enterprise.
A major performance bottleneck in the fetch
phase, called long tail problem , exhibits
the following behavior. The crawl rate in the early part of the fetch phase is relatively
high (typically dozens of pages a second). But this deteriorates relatively quickly to
less than a page per second, where it remains until completion of the segment. A
 
Search WWH ::




Custom Search