Large-Scale File Systems and Map-Reduce - Mining of Massive Datasets

Databases Reference

In-Depth Information

managed by the Master, and the map-reduce job will complete eventually.

Suppose the compute node at which a Map worker resides fails. This fail-

ure will be detected by the Master, because it periodically pings the Worker

processes. All the Map tasks that were assigned to this Worker will have to

be redone, even if they had completed. The reason for redoing completed Map

tasks is that their output destined for the Reduce tasks resides at that compute

node, and is now unavailable to the Reduce tasks. The Master sets the status

of each of these Map tasks to idle and will schedule them on a Worker when

one becomes available. The Master must also inform each Reduce task that the

location of its input from that Map task has changed.

Dealing with a failure at the node of a Reduce worker is simpler. The Master

simply sets the status of its currently executing Reduce tasks to idle.

These

will be rescheduled on another reduce worker later.

2.3

Algorithms Using Map-Reduce

Map-reduce is not a solution to every problem, not even every problem that

profitably can use many compute nodes operating in parallel. As we mentioned

in Section 2.1.2, the entire distributed-file-system milieu makes sense only when

files are very large and are rarely updated in place. Thus, we would not expect

to use either a DFS or an implementation of map-reduce for managing on-

line retail sales, even though a large on-line retailer such as Amazon.com uses

thousands of compute nodes when processing requests over the Web. The reason

is that the principal operations on Amazon data involve responding to searches

for products, recording sales, and so on, processes that involve relatively little

calculation and that change the database. 2 On the other hand, Amazon might

use map-reduce to perform certain analytic queries on large amounts of data,

such as finding for each user those users whose buying patterns were most

similar.

The original purpose for which the Google implementation of map-reduce

was created was to execute very large matrix-vector multiplications as are

needed in the calculation of PageRank (See Chapter 5). We shall see that

matrix-vector and matrix-matrix calculations fit nicely into the map-reduce

style of computing. Another important class of operations that can use map-

reduce effectively are the relational-algebra operations. We shall examine the

map-reduce execution of these operations as well.

2.3.1

Matrix-Vector Multiplication by Map-Reduce

Suppose we have an n×n matrix M , whose element in row i and column j will

be denoted m ij . Suppose we also have a vector v of length n, whose jth element

is v j . Then the matrix-vector product is the vector x of length n, whose ith

2 Remember that even looking at a product you don't buy causes Amazon to remember

that you looked at it.

Mining of Massive Datasets

Search WWH ::

Custom Search

Home