Databases Reference
In-Depth Information
Another important issue to understand in the GFS architecture is the single point of failure
(SPOF) of the master node and all the metadata that keeps track of the chunks and their state. To
avoid this situation, GFS was designed to have the master keep data in memory for speed, keep a log
on the master's local disk, and replicate the disk across remote nodes. This way if there is a crash in
the master node, a shadow can be up and running almost instantly.
The master stores three types of metadata:
1. File and chunk names or namespaces .
2. Mapping from files to chunks (i.e., the chunks that make up each file).
3. Locations of each chunk's replicas. The replica locations for each chunk are stored on the local
chunk server apart from being replicated, and the information of the replications is provided to the
master at startup or when a chunk server is added to a cluster. Since the master controls the chunk
placement, it always updates metadata as new chunks get written.
The master keeps track on the health of the entire cluster through handshaking with all the chunk
servers. Periodic checksums are executed to keep track of any data corruption. Due to the volume and
scale of processing, there are chances of data getting corrupt or stale.
To recover from any corruption, GFS appends data as it is available rather than updates an existing
data set; this provides the ability to recover from corruption or failure quickly. When a corruption is
detected, with a combination of frequent checkpoints, snapshots, and replicas, data is recovered with
minimal chance of data loss. The architecture results in data unavailability for a short period but not
data corruption.
The GFS architecture has the following strengths:
Availability:
Triple replication-based redundancy (or more if you choose).
Chunk replication.
Rapid failovers for any master failure.
Automatic replication management.
Performance:
The biggest workload for GFS is read-on large data sets, which based on the architecture
discussion, will be a nonissue.
There are minimal writes to the chunks directly, thus providing auto availability.
Management:
GFS manages itself through multiple failure modes.
Automatic load balancing.
Storage management and pooling.
Chunk management.
Failover management.
Cost:
Is not a constraint due to use of commodity hardware and Linux platforms.
Google combined the scalability and processing power of the GFS architecture and developed the
first versions of MapReduce programming constructs to execute on top of the file system. There are
several other proprietary architectural advances that Google has since been deploying and continues
to innovate and deploy that are outside the scope of the discussion for this topic and this chapter.
Search WWH ::




Custom Search