Information Technology Reference
In-Depth Information
GFS has been designed with the following assumptions:
• The system is built on top of commodity hardware that often fails.
• The system stores a modest number of large files; multi-GB files are
common and should be treated efficiently, and small files must be
supported, but there is no need to optimize for that.
• The workloads primarily consist of two kinds of reads: large stream-
ing reads and small random reads.
• The workloads also have many large sequential writes that append
data to files.
• High sustained bandwidth is more important than low latency.
The architecture of the file system is organized into a single master, which
contains the metadata of the entire file system, and a collection of chunk
servers, which provide storage space. From a logical point of view, the system
is composed of a collection of software daemons, which implement either the
master server or the chunk server. A file is a collection of chunks for which
the size can be configured at file system level. Chunks are replicated on mul-
tiple nodes in order to tolerate failures. Clients look up the master server and
identify the specific chunk of a file they want to access. Once the chunk is
identified, the interaction happens between the client and the chunk server.
Applications interact through the file system with a specific interface sup-
porting the usual operations for file creation, deletion, read, and write. The
interface also supports snapshots and records append operations that are
frequently performed by applications. GFS has been conceived by consid-
ering that failures in a large distributed infrastructure are common rather
than a rarity; therefore, specific attention has been given to implementing a
highly available, lightweight, and fault-tolerant infrastructure. The potential
single point of failure of the single-master architecture has been addressed
by giving the possibility of replicating the master node on any other node
belonging to the infrastructure.
17.2.2 Google's BigTable
BigTable is the distributed storage system designed to scale up to petabytes
of data across thousands of servers. BigTable provides storage support for
several Google applications that expose different types of workload: from
throughput-oriented batch-processing jobs to latency-sensitive serving of
data to end users. BigTable's key design goals are wide applicability, scalabil-
ity, high performance, and high availability. To achieve these goals, BigTable
organizes the data storage in tables of which the rows are distributed over
the distributed file system supporting the middleware, which is the Google
File System. From a logical point of view, a table is a multidimensional sorted
map indexed by a key that is represented by a string of arbitrary length.
Search WWH ::




Custom Search