Large-Scale File Systems and Map-Reduce - Mining of Massive Datasets

Databases Reference

In-Depth Information

Switch

Racks of compute nodes

Figure 2.1: Compute nodes are organized into racks, and racks are intercon-

nected by a switch

disk crashes, the files would be lost forever. We discuss file management

in Section 2.1.2.

2. Computations must be divided into tasks, such that if any one task fails

to execute to completion, it can be restarted without affecting other tasks.

This strategy is followed by the map-reduce programming system that we

introduce in Section 2.2.

2.1.2

Large-Scale File-System Organization

To exploit cluster computing, files must look and behave somewhat differently

from the conventional file systems found on single computers. This new file

system, often called a distributed file system or DFS (although this term has

had other meanings in the past), is typically used as follows.

•Files can be enormous, possibly a terabyte in size. If you have only small

files, there is no point using a DFS for them.

•Files are rarely updated. Rather, they are read as data for some calcula-

tion, and possibly additional data is appended to files from time to time.

For example, an airline reservation system would not be suitable for a

DFS, even if the data were very large, because the data is changed so

frequently.

Files are divided into chunks, which are typically 64 megabytes in size.

Chunks are replicated, perhaps three times, at three different compute nodes.

Moreover, the nodes holding copies of one chunk should be located on different

Search WWH ::

Custom Search

Home