MapReduce and the New Software Stack - Mining of Massive Datasets

Database Reference

In-Depth Information

2.7 Summary of Chapter 2

✦ Cluster Computing : A common architecture for very large-scale applications is a cluster of compute nodes (pro-

cessor chip, main memory, and disk). Compute nodes are mounted in racks, and the nodes on a rack are connected,

typically by gigabit Ethernet. Racks are also connected by a high-speed network or switch.

✦ Distributed File Systems : An architecture for very large-scale file systems has developed recently. Files are com-

posed of chunks of about 64 megabytes, and each chunk is replicated several times, on different compute nodes or

racks.

✦ MapReduce : This programming system allows one to exploit parallelism inherent in cluster computing, and manages

the hardware failures that can occur during a long computation on many nodes. Many Map tasks and many Reduce

tasks are managed by a Master process. Tasks on a failed compute node are rerun by the Master.

✦ The Map Function : This function is written by the user. It takes a collection of input objects and turns each into zero

or more key-value pairs. Keys are not necessarily unique.

✦ The Reduce Function : A MapReduce programming system sorts all the key-value pairs produced by all the Map

tasks, forms all the values associated with a given key into a list and distributes key-list pairs to Reduce tasks. Each

Reduce task combines the elements on each list, by applying the function written by the user. The results produced

by all the Reduce tasks form the output of the MapReduce process.

✦ Reducers : It is often convenient to refer to the application of the Reduce function to a single key and its associated

value list as a “reducer.”

✦ Hadoop : This programming system is an open-source implementation of a distributed file system (HDFS, the Ha-

doop Distributed File System) and MapReduce (Hadoop itself). It is available through the Apache Foundation.

✦ Managing Compute-Node Failures : MapReduce systems support restart of tasks that fail because their compute

node, or the rack containing that node, fail. Because Map and Reduce tasks deliver their output only after they finish,

it is possible to restart a failed task without concern for possible repetition of the effects of that task. It is necessary

to restart the entire job only if the node at which the Master executes fails.

✦ Applications of MapReduce : While not all parallel algorithms are suitable for implementation in the MapReduce

framework, there are simple implementations of matrix-vector and matrix-matrix multiplication. Also, the principal

operators of relational algebra are easily implemented in MapReduce.

✦ Workflow Systems : MapReduce has been generalized to systems that support any acyclic collection of functions,

each of which can be instantiated by any number of tasks, each responsible for executing that function on a portion

of the data.

✦ Recursive Workflows : When implementing a recursive collection of functions, it is not always possible to preserve

the ability to restart any failed task, because recursive tasks may have produced output that was consumed by another

task before the failure. A number of schemes for checkpointing parts of the computation to allow restart of single

tasks, or restart all tasks from a recent point, have been proposed.

✦ Communication-Cost : Many applications of MapReduce or similar systems do very simple things for each task.

Then, the dominant cost is usually the cost of transporting data from where it is created to where it is used. In these

cases, efficiency of a MapReduce algorithm can be estimated by calculating the sum of the sizes of the inputs to all

the tasks.

✦ Multiway Joins : It is sometimes more efficient to replicate tuples of the relations involved in a join and have the join

of three or more relations computed as a single MapReduce job. The technique of Lagrangean multipliers can be

used to optimize the degree of replication for each of the participating relations.

✦ Star Joins : Analytic queries often involve a very large fact table joined with smaller dimension tables. These joins

can always be done efficiently by the multiway-join technique. An alternative is to distribute the fact table and rep-

licate the dimension tables permanently, using the same strategy as would be used if we were taking the multiway

join of the fact table and every dimension table.

Mining of Massive Datasets

Search WWH ::

Custom Search

Home