Database Reference
In-Depth Information
2.7 Summary of Chapter 2
Cluster Computing : A common architecture for very large-scale applications is a cluster of compute nodes (pro-
cessor chip, main memory, and disk). Compute nodes are mounted in racks, and the nodes on a rack are connected,
typically by gigabit Ethernet. Racks are also connected by a high-speed network or switch.
Distributed File Systems : An architecture for very large-scale file systems has developed recently. Files are com-
posed of chunks of about 64 megabytes, and each chunk is replicated several times, on different compute nodes or
racks.
MapReduce : This programming system allows one to exploit parallelism inherent in cluster computing, and manages
the hardware failures that can occur during a long computation on many nodes. Many Map tasks and many Reduce
tasks are managed by a Master process. Tasks on a failed compute node are rerun by the Master.
The Map Function : This function is written by the user. It takes a collection of input objects and turns each into zero
or more key-value pairs. Keys are not necessarily unique.
The Reduce Function : A MapReduce programming system sorts all the key-value pairs produced by all the Map
tasks, forms all the values associated with a given key into a list and distributes key-list pairs to Reduce tasks. Each
Reduce task combines the elements on each list, by applying the function written by the user. The results produced
by all the Reduce tasks form the output of the MapReduce process.
Reducers : It is often convenient to refer to the application of the Reduce function to a single key and its associated
value list as a “reducer.”
Hadoop : This programming system is an open-source implementation of a distributed file system (HDFS, the Ha-
doop Distributed File System) and MapReduce (Hadoop itself). It is available through the Apache Foundation.
Managing Compute-Node Failures : MapReduce systems support restart of tasks that fail because their compute
node, or the rack containing that node, fail. Because Map and Reduce tasks deliver their output only after they finish,
it is possible to restart a failed task without concern for possible repetition of the effects of that task. It is necessary
to restart the entire job only if the node at which the Master executes fails.
Applications of MapReduce : While not all parallel algorithms are suitable for implementation in the MapReduce
framework, there are simple implementations of matrix-vector and matrix-matrix multiplication. Also, the principal
operators of relational algebra are easily implemented in MapReduce.
Workflow Systems : MapReduce has been generalized to systems that support any acyclic collection of functions,
each of which can be instantiated by any number of tasks, each responsible for executing that function on a portion
of the data.
Recursive Workflows : When implementing a recursive collection of functions, it is not always possible to preserve
the ability to restart any failed task, because recursive tasks may have produced output that was consumed by another
task before the failure. A number of schemes for checkpointing parts of the computation to allow restart of single
tasks, or restart all tasks from a recent point, have been proposed.
Communication-Cost : Many applications of MapReduce or similar systems do very simple things for each task.
Then, the dominant cost is usually the cost of transporting data from where it is created to where it is used. In these
cases, efficiency of a MapReduce algorithm can be estimated by calculating the sum of the sizes of the inputs to all
the tasks.
Multiway Joins : It is sometimes more efficient to replicate tuples of the relations involved in a join and have the join
of three or more relations computed as a single MapReduce job. The technique of Lagrangean multipliers can be
used to optimize the degree of replication for each of the participating relations.
Star Joins : Analytic queries often involve a very large fact table joined with smaller dimension tables. These joins
can always be done efficiently by the multiway-join technique. An alternative is to distribute the fact table and rep-
licate the dimension tables permanently, using the same strategy as would be used if we were taking the multiway
join of the fact table and every dimension table.
Search WWH ::




Custom Search