Databases Reference
In-Depth Information
The two algorithms
of global analysis
are
Site root analysis —Both algorithms are used to group candidate pages and iden-
tify a set of representative pages. Given a collection of candidate pages, it's first
partitioned by the feature of interest, for example, PersonalHomePage. For each
group, a forest of pages is constructed where each URL is a node in the forest,
relating the two URLS A and B as parent and child if A is the longest prefix of B .
(Shorter prefixes are higher ancestors.) The forest is pruned using some com-
plex logic that may involve inputs from other local analysis algorithms, the de-
tails of which are beyond the scope of this case study. We use site root analysis not
only for the output from PersonalHomePage, but also for TitleHomePage (e.g.,
pages titled “Working at Almaden Research Center” or “IT Help Central”).
Anchor text analysis —This algorithm collects all the anchor text for each page by
examining all the pages that point to it. The aggregated anchor text is processed
to pick a set of representative terms for that URL. For further details on this
algorithm, see [5].
HADOOP IMPLEMENTATION
In global analysis, first, a merge step joins together the results of local analysis on the
main crawl and the tags for the URLs collected from Dogear. This is followed by a de-
duplication step where duplicate pages are eliminated. Each global analysis task then
involves some standard data manipulation (e.g., partitioning, filtering, joining) in con-
junction with some task-specific user-defined function, such as URL forest generation
and pruning. Jaql is used to specify these tasks at a high level, and execute them in
parallel using Hadoop.
Consider the Jaql query in listing 12.4 used for the global analysis on
PersonalHomePage data. The first two lines specify the input and output files. The
input is assumed to be a JSON array—in this case, an array of records—each record
representing a page and the associated results from local analysis. The third line
is the start of a Jaql pipe: pages flow from the input file, referred to by $allDocs ,
to subsequent operators. The connection between pipe operators is denoted by
->. Following the input, the “filter” operator produces a value when its predicate
evaluates to true. In the example, only pages that have a local analysis (LA) field,
a PersonalHomePage field, and a non-null name are output to the next operator.
The $ is a variable that refers to the current value in the pipe. The filtered pages
are partitioned according to name. For each partition, the user-defined function
SiteRootAnalysis is evaluated. The function takes as input the partitioning field $t
(a variable for name ), and all pages in the partition ( $ ). Finally, the annotated pages
are written to $results output file.
Jaql evaluates the query shown in the preceding listing 12.4 by translating to a
MapReduce job and submitting the job to Hadoop for evaluation. In this example, the
map stage filters pages and extracts the partitioning key. The reduce stage evaluates
the SiteRootAnalysis function per partition and writes the output to a file. In general,
 
Search WWH ::




Custom Search