MapReduce and the New Software Stack - Mining of Massive Datasets

Database Reference

In-Depth Information

2.4.1

Workflow Systems

Two experimental systems called Clustera from the University of Wisconsin and Hyracks

from the University of California at Irvine extend MapReduce from the simple two-step

workflow (the Map function feeds the Reduce function) to any collection of functions, with

an acyclic graph representing workflow among the functions. That is, there is an acyclic

flow graph whose arcs a → b represent the fact that function a 's output is input to function

b . A suggestion of what a workflow might look like is in Fig. 2.6 . There, five functions, f

through j , pass data from left to right in specific ways, so the flow of data is acyclic and no

task needs to provide data out before its input is available. For instance, function h takes its

input from a preexisting file of the distributed file system. Each of h 's output elements is

passed to at least one of the functions i and j .

Figure 2.6 An example of a workflow that is more complex than Map feeding Reduce

In analogy to Map and Reduce functions, each function of a workflow can be executed

by many tasks, each of which is assigned a portion of the input to the function. A master

controller is responsible for dividing the work among the tasks that implement a function,

usually by hashing the input elements to decide on the proper task to receive an element.

Thus, like Map tasks, each task implementing a function f has an output file of data destined

for each of the tasks that implement the successor function(s) of f . These files are delivered

by the Master at the appropriate time - after the task has completed its work.

The functions of a workflow, and therefore the tasks, share with MapReduce tasks the

important property that they only deliver output after they complete. As a result, if a task

fails, it has not delivered output to any of its successors in the flow graph. A master con-

troller can therefore restart the failed task at another compute node, without worrying that

the output of the restarted task will duplicate output that previously was passed to some

other task.

Many applications of workflow systems such as Clustera or Hyracks are cascades of

MapReduce jobs. An example would be the join of three relations, where one MapReduce

job joins the first two relations, and a second MapReduce job joins the third relation with

the result of joining the first two relations. Both jobs would use an algorithm like that of

Section 2.3.7 .

There is an advantage to implementing such cascades as a single workflow. For example,

the flow of data among tasks, and its replication, can be managed by the master controller,

Search WWH ::

Custom Search

Home