Databases Reference
In-Depth Information
where the key is the URL of a page and the value is contents of the page (along with
additional metadata fields).
Listing 12.2 (called ResolveSimple ) outlines the map and reduce functions that are
used to resolve redirections on a segment and invoke local analysis. The map phase
outputs the source URL and the page contents. The reduce phase brings all the
pages with the same source URL into a single group. In the preceding example of
figure 12.9, the common source URL for pages A, B, and C is A. The target page
in this group (C) is then passed to local analysis along with the other URLs in the
group—A and B.
Listing 12.2 ResolveSimple
Map (Key: URL, Value: PageData)
if PageData.SourceURL exists then
Ouput [PageData.SourceURL, PageData]
else
Output [URL, Pagedata]
end if
End
Reduce (Key: URL, Values: Pageset)
Let URLset = Set of all URLs in Pageset
Let page = Target of redirection in Pageset
result = LocalAnalysis(page, URLset)
output [page.URL, result]
End
HADOOP IMPLEMENTATION
In ResolveSimple, local analysis is invoked in the reducer. This requires Hadoop to
pass along the contents of each page from the map phase to the reduce phase. This
involves sorting and moving a large amount of data across the network. To avoid this,
we modify ResolveSimple (listing 12.2) and separate the task of redirection resolution
and the local analysis so that the algorithms in local analysis are run in the map phase.
This allows the local analysis computation to be colocated with the data, and therefore
results in significant performance improvement.
We have outlined the modified algorithm, called Resolve2Step , in listing 12.3. In the
map phase of this algorithm, we only pass the metadata along and the page content
(which accounts for a majority of the data volume) is projected out. In the reduce
phase of ResolveSimple, we output a table with two columns: the first column is the
URL of the target page in the group of pages, and the second column is the set of
URLs to be associated with the page when it's submitted to local analysis.
Listing 12.3 Resolve2Step
1: Resolve Redirections
Map (Key: URL, Value: Page)
if PageData.SourceURL exists then
Ouput [PageData.SourceURL, PageData.metadata]
else
 
Search WWH ::




Custom Search