Databases Reference
In-Depth Information
tasks, each associated with a vector of buckets, one for each of the key attributes
A 1 , A 2 , . . . , A m . Suppose the number of buckets into which we hash A i is a i .
Naturally, a 1 a 2 a m = k. Finally, suppose each dimension table D i has size
d i , and the size of the fact table is much larger than any of these sizes. Find
the values of the a i 's that minimize the cost of taking the star join as one
map-reduce operation.
2.6
Summary of Chapter 2
F
Cluster Computing: A common architecture for very large-scale applica-
tions is a cluster of compute nodes (processor chip, main memory, and
disk). Compute nodes are mounted in racks, and the nodes on a rack are
connected, typically by gigabit Ethernet. Racks are also connected by a
high-speed network or switch.
F
Distributed File Systems: An architecture for very large-scale file sys-
tems has developed recently. Files are composed of chunks of about 64
megabytes, and each chunk is replicated several times, on different com-
pute nodes or racks.
F
Map-Reduce: This programming system allows one to exploit parallelism
inherent in cluster computing, and manages the hardware failures that
can occur during a long computation on many nodes. Many Map tasks
and many Reduce tasks are managed by a Master process.
Tasks on a
failed compute node are rerun by the Master.
F
The Map Function: This function is written by the user. It takes a
collection of input objects and turns each into zero or more key-value
pairs. Key values are not necessarily unique.
F
The Reduce Function: A map-reduce programming system sorts all the
key-value pairs produced by all the Map tasks, forms all the values asso-
ciated with a given key into a list and distributes key-list pairs to Reduce
tasks. Each reduce task combines the elements on each list, by applying
the function written by the user. The results produced by all the Reduce
tasks form the output of the map-reduce process.
F
Hadoop: This programming system is an open-source implementation of a
distributed file system (HDFS, the Hadoop Distributed File System) and
map-reduce (Hadoop itself). It is available through the Apache Founda-
tion.
F
Managing Compute-Node Failures: Map-reduce systems support restart
of tasks that fail because their compute node, or the rack containing
that node, fail. Because Map and Reduce tasks deliver their output only
after they finish, it is possible to restart a failed task without concern for
Search WWH ::




Custom Search