Case Studies - Hadoop in Action

Databases Reference

In-Depth Information

algorithms, namely TitleHomePage , PersonalHomePage , URLHomePage , AnchorHome, and

NavLink are used. These algorithms use rules based on regular expression patterns,

dictionaries, and information extraction tools [7] to identify candidate navigational

pages. For instance, using a regular expression like “ \ A\ W*(.+)\s<Home> ” (Java

regular expression syntax), the PersonalHomePage algorithm can detect that a page with

a title “G. J. Chaitin's Home” indicates that this is the home page of G. J. Chaitin. The

algorithm outputs the name of a feature (“Personal Home Page”) and associates a

value with this feature (“G. J. Chaitin”). The next section describes the impact of redi-

rections on local analysis and discusses a solution.

REDIRECTION RESOLUTION

Many sites in IBM's intranet employ redirection for updating, load balancing, upgrad-

ing, and handling internal reorganizations. Unfortunately, redirections can cause

complications in the local analysis algorithms. For instance, URLHomePage uses the

text of the URL to detect a candidate navigational page. After redirection, the target

URL may not contain the same features as the original URL. As an illustrative ex-

ample, consider the URL http://w3.can.ibm.com/hr/erbp . Local analysis algorithms

can correctly identify this URL as the home page for the Employee Referral Bonus

Program (ERBP)

using clues from the URL. But this URL gets redirected to a Lotus

51ea/ac3f2f04ba60a6d585256d05004cef97?OpenDocument , where a Lotus Domino

database serves information about the Employee Referral Bonus Program. The clues

in the source URL are no longer available in the target, and the local analysis algo-

rithm can no longer identify this page as navigational. To prevent this, ES2 resolves all

redirections, collects the set of URLs that lead to the target page through redirections,

and provides local analysis with the appropriate URLs.

To track redirections, we modified Nutch to tag every page that was a target

of redirection with the source URL. Consider figure 12.9. The crawler follows

redirections from a page A to page B, and from page B to arrive at page C. We

track these redirections by tagging pages B and C with the source URL, A. This

tag is stored as a metadata field in the segment file. A segment file is a key/value set

URL: A

URL: B

URL: C

Redirect

Source: A

After Redirection

Resolution

URL: C

Sources: A, B

Figure 12.9 Resolving

redirections

Search WWH ::

Custom Search

Home