Case Studies - Hadoop in Action

Databases Reference

In-Depth Information

The two algorithms

of global analysis

are

Site root analysis —Both algorithms are used to group candidate pages and iden-

tify a set of representative pages. Given a collection of candidate pages, it's first

partitioned by the feature of interest, for example, PersonalHomePage. For each

group, a forest of pages is constructed where each URL is a node in the forest,

relating the two URLS A and B as parent and child if A is the longest prefix of B .

(Shorter prefixes are higher ancestors.) The forest is pruned using some com-

plex logic that may involve inputs from other local analysis algorithms, the de-

tails of which are beyond the scope of this case study. We use site root analysis not

only for the output from PersonalHomePage, but also for TitleHomePage (e.g.,

pages titled “Working at Almaden Research Center” or “IT Help Central”).

■

Anchor text analysis —This algorithm collects all the anchor text for each page by

examining all the pages that point to it. The aggregated anchor text is processed

to pick a set of representative terms for that URL. For further details on this

algorithm, see [5].

■

HADOOP IMPLEMENTATION

In global analysis, first, a merge step joins together the results of local analysis on the

main crawl and the tags for the URLs collected from Dogear. This is followed by a de-

duplication step where duplicate pages are eliminated. Each global analysis task then

involves some standard data manipulation (e.g., partitioning, filtering, joining) in con-

junction with some task-specific user-defined function, such as URL forest generation

and pruning. Jaql is used to specify these tasks at a high level, and execute them in

parallel using Hadoop.

Consider the Jaql query in listing 12.4 used for the global analysis on

PersonalHomePage data. The first two lines specify the input and output files. The

input is assumed to be a JSON array—in this case, an array of records—each record

representing a page and the associated results from local analysis. The third line

is the start of a Jaql pipe: pages flow from the input file, referred to by $allDocs ,

to subsequent operators. The connection between pipe operators is denoted by

->. Following the input, the “filter” operator produces a value when its predicate

evaluates to true. In the example, only pages that have a local analysis (LA) field,

a PersonalHomePage field, and a non-null name are output to the next operator.

The $ is a variable that refers to the current value in the pipe. The filtered pages

are partitioned according to name. For each partition, the user-defined function

SiteRootAnalysis is evaluated. The function takes as input the partitioning field $t

(a variable for name ), and all pages in the partition ( $ ). Finally, the annotated pages

are written to $results output file.

Jaql evaluates the query shown in the preceding listing 12.4 by translating to a

MapReduce job and submitting the job to Hadoop for evaluation. In this example, the

map stage filters pages and extracts the partitioning key. The reduce stage evaluates

the SiteRootAnalysis function per partition and writes the output to a file. In general,

Hadoop in Action

Search WWH ::

Custom Search

Home