Database Reference
In-Depth Information
Assuming that communication cost is the dominant cost, we might still ask why we
count only input size, and not output size. The answer to this question involves two points:
(1) If the output of one task τ is input to another task, then the size of τ 's output will be
accounted for when measuring the input size for the receiving task. Thus, there is no
reason to count the size of any output except for those tasks whose output forms the
result of the entire algorithm.
(2) But in practice, the algorithm output is rarely large compared with the input or the in-
termediate data produced by the algorithm. The reason is that massive outputs cannot
be used unless they are summarized or aggregated in some way. For example, although
we talked in Example 2.6 of computing the entire transitive closure of a graph, in prac-
tice we would want something much simpler, such as the count of the number of nodes
reachable from each node, or the set of nodes reachable from a single node.
EXAMPLE 2.8 Let us evaluate the communication cost for the join algorithm from Section
2.3.7 . Suppose we are joining R ( A, B ) S ( B, C ), and the sizes of relations R and S are r
and s , respectively. Each chunk of the files holding R and S is fed to one Map task, so the
sum of the communication costs for all the Map tasks is r + s . Note that in a typical execu-
tion, the Map tasks will each be executed at a compute node holding a copy of the chunk to
which it applies. Thus, no internode communication is needed for the Map tasks, but they
still must read their data from disk. Since all the Map tasks do is make a simple transform-
ation of each input tuple into a key-value pair, we expect that the computation cost will be
small compared with the communication cost, regardless of whether the input is local to
the task or must be transported to its compute node.
The sum of the outputs of the Map tasks is roughly as large as their inputs. Each output
key-value pair is sent to exactly one Reduce task, and it is unlikely that this Reduce task
will execute at the same compute node. Therefore, communication from Map tasks to Re-
duce tasks is likely to be across the interconnect of the cluster, rather than memory-to-disk.
This communication is O ( r + s ), so the communication cost of the join algorithm is O ( r +
s ).
The Reduce tasks execute the reducer (application of the Reduce function to a key and
its associated value list) for one or more values of attribute B . Each reducer takes the inputs
it receives and divides them between tuples that came from R and those that came from S .
Each tuple from R pairs with each tuple from S to produce one output. The output size for
the join can be either larger or smaller than r + s , depending on how likely it is that a given
R -tuple joins with a given S -tuple. For example, if there are many different B -values, we
would expect the output to be small, while if there are few B -values, a large output is likely.
If the output is large, then the computation cost of generating all the outputs from a re-
ducer could be much larger than O ( r + s ). However, we shall rely on our supposition that if
Search WWH ::




Custom Search