Database Reference
In-Depth Information
17.6.2.1 WordCount
WordCount is a sample MapReduce program in the Hadoop package. The map pro-
gram splits the input text into words and the result is locally aggregated by word with
a combiner; the reduce program sums up the local aggregation results 〈 word , count
by words and output the final word counts. Since the number of words is limited,
the amount of output data to the reduce stage and the cost of reduce stage are small,
compared with the data and the processing cost for the map stage. The complexity of
the reduce program, g (), is linear to reduce's input data.
17.6.2.2 Sort
Sort is also a sample MapReduce program in the Hadoop package. It depends on a
custom partitioner that uses a sorted list of N - 1 sampled keys to define the key range
for each reduce process. All keys such that sample[i i - 1] <= key < sample[i i ] are sent
to reduce i . Then, the inherent MergeSort in the Shuffle stage sorts the input data to
the reduce. This guarantees that the output of reduce i are all less than the output of
reduce i  + 1. Both the map program and the reduce program do nothing but simply
pass the input to the output. Therefore, the function g () is also linear to the size of the
input of reduce.
17.6.2.3 PageRank
PageRank is a MapReduce implementation of the well-known Google's PageRank
algorithm [3]. PageRank can be implemented with an iterative algorithm and applied
to a graph data set. Assume each node p i in the graph has a PageRank PR ( p i ). M ( p i )
represents the set of neighboring nodes of p i that have outlinks pointing to p i . L ( p j )
is the total number of outlinks the node p j has. d is the damping factor and N is the
total number of nodes. The following equation calculates the PageRank for each
node p i .
PR p
Lp
()
()
j
PR p
() (
=− +
1
d
)/
Nd
(17.17)
i
j
pMp
j
()
i
PageRank values are updated in multiple rounds until they converge. In one round of
PageRank MapReduce program, all nodes' PageRank values are updated in paral-
lel based on the above equation. Concretely, the map program distributes a share of
each node's PageRank, that is, PR ( p j )/ L ( p j ), to all its outlink neighbors. The reduce
program collects the shares from its neighbors and applies the equation to update
the PageRank. The complexity function g () is also linear to the size of the input of
reduce.
17.6.2.4 Join
Join is a MapReduce program that joins a large file with a small file based on a des-
ignated key attribute, which mimics the Join operation in relational database. The
large files are the text files randomly generated with RandomTextWriter. The small
file consists of 50 randomly generated lines using the same method for generating
Search WWH ::




Custom Search