Large-Scale File Systems and Map-Reduce - Mining of Massive Datasets

Databases Reference

In-Depth Information

1. If the output of one task τ is input to another task, then the size of τ 's

output will be accounted for when measuring the input size for the receiv-

ing task. Thus, there is no reason to count the size of any output except

for those tasks whose output forms the result of the entire algorithm.

2. In practice, the output of a job is rarely large compared with the input

or the intermediate data produced by the job. The reason is that massive

outputs cannot be used unless they are summarized or aggregated in some

way. For example, although we talked in Example 2.6 of computing the

entire transitive closure of a graph, in practice we would want something

much simpler, such as the count of the number of nodes reachable from

each node, or the set of nodes reachable from a single node.

Example 2.8 : Let us evaluate the communication cost for the join algorithm

from Section 2.3.7. Suppose we are joining R(A, B) ⊲⊳ S(B, C), and the sizes

of relations R and S are r and s, respectively. Each chunk of the files holding

R and S is fed to one Map task, so the sum of the communication costs for all

the Map tasks is r + s. Note that in a typical execution, the Map tasks will

each be executed at a compute node holding a copy of the chunk to which it

applies. Thus, no internode communication is needed for the Map tasks, but

they still must read their data from disk. Since all the Map tasks do is make a

simple transformation of each input tuple into a key-value pair, we expect that

the computation cost will be small compared with the communication cost,

regardless of whether the input is local to the task or must be transported to

its compute node.

The sum of the outputs of the Map tasks is roughly as large as their in-

puts. Each output key-value pair is sent to exactly one Reduce task, and it is

unlikely that this Reduce task will execute at the same compute node. There-

fore, communication from Map tasks to Reduce tasks is likely to be across the

interconnect of the cluster, rather than memory-to-disk. This communication

is O(r + s), so the communication cost of the join algorithm is O(r + s).

Observe that the Reduce tasks can use a hash join of the tuples received.

This process involves hashing each of the tuples received on their B-values,

using a different hash function from the one that divided the tuples among

Reduce tasks. The local hash join takes time that is linear in the number of

tuples received, and thus is also O(r + s). We do not count this execution

time in the communication-cost model, but it is comforting to know that the

computation cost is surely not the dominant factor for this algorithm.

The output size for the join can be either larger or smaller than r + s,

depending on how likely it is that a given R-tuple joins with a given S-tuple.

For example, if there are many different B-values, we would expect the output

to be small, while if there are few B-values, a large output is likely. However,

we shall rely on our supposition that if the output of the join is large, then

there is probably some aggregation being done to reduce the size of the output.

This aggregation typically can be executed by the Reduce tasks as they produce

their output.

2

Mining of Massive Datasets

Search WWH ::

Custom Search

Home