Large-Scale File Systems and Map-Reduce - Mining of Massive Datasets

Databases Reference

In-Depth Information

tasks, each associated with a vector of buckets, one for each of the key attributes

A 1 , A 2 , . . . , A m . Suppose the number of buckets into which we hash A i is a i .

Naturally, a 1 a 2 a m = k. Finally, suppose each dimension table D i has size

d i , and the size of the fact table is much larger than any of these sizes. Find

the values of the a i 's that minimize the cost of taking the star join as one

map-reduce operation.

2.6

Summary of Chapter 2

F

Cluster Computing: A common architecture for very large-scale applica-

tions is a cluster of compute nodes (processor chip, main memory, and

disk). Compute nodes are mounted in racks, and the nodes on a rack are

connected, typically by gigabit Ethernet. Racks are also connected by a

high-speed network or switch.

F

Distributed File Systems: An architecture for very large-scale file sys-

tems has developed recently. Files are composed of chunks of about 64

megabytes, and each chunk is replicated several times, on different com-

pute nodes or racks.

F

Map-Reduce: This programming system allows one to exploit parallelism

inherent in cluster computing, and manages the hardware failures that

can occur during a long computation on many nodes. Many Map tasks

and many Reduce tasks are managed by a Master process.

Tasks on a

failed compute node are rerun by the Master.

F

The Map Function: This function is written by the user. It takes a

collection of input objects and turns each into zero or more key-value

pairs. Key values are not necessarily unique.

F

The Reduce Function: A map-reduce programming system sorts all the

key-value pairs produced by all the Map tasks, forms all the values asso-

ciated with a given key into a list and distributes key-list pairs to Reduce

tasks. Each reduce task combines the elements on each list, by applying

the function written by the user. The results produced by all the Reduce

tasks form the output of the map-reduce process.

F

Hadoop: This programming system is an open-source implementation of a

distributed file system (HDFS, the Hadoop Distributed File System) and

map-reduce (Hadoop itself). It is available through the Apache Founda-

tion.

F

Managing Compute-Node Failures: Map-reduce systems support restart

of tasks that fail because their compute node, or the rack containing

that node, fail. Because Map and Reduce tasks deliver their output only

after they finish, it is possible to restart a failed task without concern for

Mining of Massive Datasets

Search WWH ::

Custom Search

Home