Toward Optimal Resource Provisioning for Economical and Green MapReduce Computing in the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

17.6.2.1 WordCount

WordCount is a sample MapReduce program in the Hadoop package. The map pro-

gram splits the input text into words and the result is locally aggregated by word with

a combiner; the reduce program sums up the local aggregation results 〈 word , count 〉

by words and output the final word counts. Since the number of words is limited,

the amount of output data to the reduce stage and the cost of reduce stage are small,

compared with the data and the processing cost for the map stage. The complexity of

the reduce program, g (), is linear to reduce's input data.

17.6.2.2 Sort

Sort is also a sample MapReduce program in the Hadoop package. It depends on a

custom partitioner that uses a sorted list of N - 1 sampled keys to define the key range

for each reduce process. All keys such that sample[i i - 1] <= key < sample[i i ] are sent

to reduce i . Then, the inherent MergeSort in the Shuffle stage sorts the input data to

the reduce. This guarantees that the output of reduce i are all less than the output of

reduce i + 1. Both the map program and the reduce program do nothing but simply

pass the input to the output. Therefore, the function g () is also linear to the size of the

input of reduce.

17.6.2.3 PageRank

PageRank is a MapReduce implementation of the well-known Google's PageRank

algorithm [3]. PageRank can be implemented with an iterative algorithm and applied

to a graph data set. Assume each node p i in the graph has a PageRank PR ( p i ). M ( p i )

represents the set of neighboring nodes of p i that have outlinks pointing to p i . L ( p j )

is the total number of outlinks the node p j has. d is the damping factor and N is the

total number of nodes. The following equation calculates the PageRank for each

node p i .

PR p

Lp

()

∑

j

PR p

() (

=− +

1

d

)/

Nd

(17.17)

i

j

pMp

j

∈

()

i

PageRank values are updated in multiple rounds until they converge. In one round of

PageRank MapReduce program, all nodes' PageRank values are updated in paral-

lel based on the above equation. Concretely, the map program distributes a share of

each node's PageRank, that is, PR ( p j )/ L ( p j ), to all its outlink neighbors. The reduce

program collects the shares from its neighbors and applies the equation to update

the PageRank. The complexity function g () is also linear to the size of the input of

reduce.

17.6.2.4 Join

Join is a MapReduce program that joins a large file with a small file based on a des-

ignated key attribute, which mimics the Join operation in relational database. The

large files are the text files randomly generated with RandomTextWriter. The small

file consists of 50 randomly generated lines using the same method for generating

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home