Databases Reference
In-Depth Information
algorithms, namely TitleHomePage , PersonalHomePage , URLHomePage , AnchorHome, and
NavLink are used. These algorithms use rules based on regular expression patterns,
dictionaries, and information extraction tools [7] to identify candidate navigational
pages. For instance, using a regular expression like “ \ A\ W*(.+)\s<Home> ” (Java
regular expression syntax), the PersonalHomePage algorithm can detect that a page with
a title “G. J. Chaitin's Home” indicates that this is the home page of G. J. Chaitin. The
algorithm outputs the name of a feature (“Personal Home Page”) and associates a
value with this feature (“G. J. Chaitin”). The next section describes the impact of redi-
rections on local analysis and discusses a solution.
REDIRECTION RESOLUTION
Many sites in IBM's intranet employ redirection for updating, load balancing, upgrad-
ing, and handling internal reorganizations. Unfortunately, redirections can cause
complications in the local analysis algorithms. For instance, URLHomePage uses the
text of the URL to detect a candidate navigational page. After redirection, the target
URL may not contain the same features as the original URL. As an illustrative ex-
ample, consider the URL http://w3.can.ibm.com/hr/erbp . Local analysis algorithms
can correctly identify this URL as the home page for the Employee Referral Bonus
Program (ERBP)
using clues from the URL. But this URL gets redirected to a Lotus
Domino server at http://w3-03.ibm.com/hr/hrc.nsf/3f31db8c0ff0ac90852568f7006d
51ea/ac3f2f04ba60a6d585256d05004cef97?OpenDocument , where a Lotus Domino
database serves information about the Employee Referral Bonus Program. The clues
in the source URL are no longer available in the target, and the local analysis algo-
rithm can no longer identify this page as navigational. To prevent this, ES2 resolves all
redirections, collects the set of URLs that lead to the target page through redirections,
and provides local analysis with the appropriate URLs.
To track redirections, we modified Nutch to tag every page that was a target
of redirection with the source URL. Consider figure 12.9. The crawler follows
redirections from a page A to page B, and from page B to arrive at page C. We
track these redirections by tagging pages B and C with the source URL, A. This
tag is stored as a metadata field in the segment file. A segment file is a key/value set
URL: A
URL: B
URL: C
Redirect
Redirect
Source: A
Source: A
After Redirection
Resolution
URL: C
Sources: A, B
Figure 12.9 Resolving
redirections
 
Search WWH ::




Custom Search