MapReduce and the New Software Stack - Mining of Massive Datasets

Database Reference

In-Depth Information

Assuming that communication cost is the dominant cost, we might still ask why we

count only input size, and not output size. The answer to this question involves two points:

(1) If the output of one task τ is input to another task, then the size of τ 's output will be

accounted for when measuring the input size for the receiving task. Thus, there is no

reason to count the size of any output except for those tasks whose output forms the

result of the entire algorithm.

(2) But in practice, the algorithm output is rarely large compared with the input or the in-

termediate data produced by the algorithm. The reason is that massive outputs cannot

be used unless they are summarized or aggregated in some way. For example, although

we talked in Example 2.6 of computing the entire transitive closure of a graph, in prac-

tice we would want something much simpler, such as the count of the number of nodes

reachable from each node, or the set of nodes reachable from a single node.

EXAMPLE 2.8 Let us evaluate the communication cost for the join algorithm from Section

2.3.7 . Suppose we are joining R ( A, B ) ⋈ S ( B, C ), and the sizes of relations R and S are r

and s , respectively. Each chunk of the files holding R and S is fed to one Map task, so the

sum of the communication costs for all the Map tasks is r + s . Note that in a typical execu-

tion, the Map tasks will each be executed at a compute node holding a copy of the chunk to

which it applies. Thus, no internode communication is needed for the Map tasks, but they

still must read their data from disk. Since all the Map tasks do is make a simple transform-

ation of each input tuple into a key-value pair, we expect that the computation cost will be

small compared with the communication cost, regardless of whether the input is local to

the task or must be transported to its compute node.

The sum of the outputs of the Map tasks is roughly as large as their inputs. Each output

key-value pair is sent to exactly one Reduce task, and it is unlikely that this Reduce task

will execute at the same compute node. Therefore, communication from Map tasks to Re-

duce tasks is likely to be across the interconnect of the cluster, rather than memory-to-disk.

This communication is O ( r + s ), so the communication cost of the join algorithm is O ( r +

s ).

The Reduce tasks execute the reducer (application of the Reduce function to a key and

its associated value list) for one or more values of attribute B . Each reducer takes the inputs

it receives and divides them between tuples that came from R and those that came from S .

Each tuple from R pairs with each tuple from S to produce one output. The output size for

the join can be either larger or smaller than r + s , depending on how likely it is that a given

R -tuple joins with a given S -tuple. For example, if there are many different B -values, we

would expect the output to be small, while if there are few B -values, a large output is likely.

If the output is large, then the computation cost of generating all the outputs from a re-

ducer could be much larger than O ( r + s ). However, we shall rely on our supposition that if

Search WWH ::

Custom Search

Home