Databases Reference
In-Depth Information
1. The first step is to load the URL State database with an initial set of URLs. These
can be a broad set of top-level domains such as the 1.7 million web sites with the
highest US-based traffic, or the results from selective searches against another
index, or manually selected URLs that point to specific, high quality pages.
2. Once the URL State database has been loaded with some initial URLs, the first
loop in the focused crawl can begin. The first step in each loop is to extract all of
the unprocessed URLs, and sort them by their link score.
3. Next comes one of the two critical steps in the workflow. A decision is made
about how many of the top-scoring URLs to process in this loop. The fewer the
number, the “tighter” the focus of the crawl. There are many options for decid-
ing how many URLs to accept—for example, based on a fixed minimum score,
a fixed percentage of all URLs, or a maximum count. More sophisticated ap-
proaches include picking a cutoff score that represents the transition point (el-
bow) in a power curve.
4. Once the set of accepted URLs has been created, the standard fetch process be-
gins. This includes all of the usual steps required for polite and efficient fetching,
such as robots.txt processing. Pages that are successfully fetched can then be
parsed.
5. Typically fetched pages are also saved into the Fetched Pages database.
6. Now comes the second of the two critical steps. The parsed page content is
given to the page scorer, which returns a value representing how closely the page
matches the focus of the crawl. Typically this is a value from 0.0 to 1.0, with
higher scores being better.
7. Once the page has been scored, each outlink found in the parse is extracted.
8. The score for the page is divided among all of the outlinks.
9. Finally, the URL State database is updated with the results of fetch attempts
(succeeded, failed), all newly discovered URLs are added, and any existing URLs
get their link score increased by all matching outlinks that were extracted during
this loop.
At this point the focused crawl can terminate, if sufficient pages of high enough
quality (score) have been found, or the next loop can begin.
In this manner the crawl proceeds in a depth-first manner, focusing on areas of
the web graph where the most high scoring pages are found.
In the end we wound up with about 50 million pages, and a “crawlDB” that
contained around 250 million URLs, of which about half were scored high enough
such that we would eventually want to crawl them.
6.3.2 Web Page Processing
Once we had fetched a web page (or document, such as a PDF) then we'd parse it,
to extract the title and text. Again, we leveraged the support that was already there
in Nutch.
Search WWH ::




Custom Search