Large-Scale File Systems and Map-Reduce - Mining of Massive Datasets

Databases Reference

In-Depth Information

User

Program

fork

Master

assign

Reduce

Map

Worker

Input

Worker

Data

Output

File

Intermediate

Files

Figure 2.3: Overview of the execution of a map-reduce program

file(s), but we may wish to create fewer Reduce tasks. The reason for limiting

the number of Reduce tasks is that it is necessary for each Map task to create

an intermediate file for each Reduce task, and if there are too many Reduce

tasks the number of intermediate files explodes.

The Master keeps track of the status of each Map and Reduce task (idle,

executing at a particular Worker, or completed). A Worker process reports to

the Master when it finishes a task, and a new task is scheduled by the Master

for that Worker process.

Each Map task is assigned one or more chunks of the input file(s) and

executes on it the code written by the user. The Map task creates a file for

each Reduce task on the local disk of the Worker that executes the Map task.

The Master is informed of the location and sizes of each of these files, and the

Reduce task for which each is destined. When a Reduce task is assigned by the

Master to a Worker process, that task is given all the files that form its input.

The Reduce task executes code written by the user and writes its output to a

file that is part of the surrounding distributed file system.

2.2.6

Coping With Node Failures

The worst thing that can happen is that the compute node at which the Master

is executing fails. In this case, the entire map-reduce job must be restarted.

But only this one node can bring the entire process down; other failures will be

Search WWH ::

Custom Search

Home