Introducing Big Data Technologies - Data Warehousing in the Age of Big Data

Databases Reference

In-Depth Information

Another important issue to understand in the GFS architecture is the single point of failure

(SPOF) of the master node and all the metadata that keeps track of the chunks and their state. To

avoid this situation, GFS was designed to have the master keep data in memory for speed, keep a log

on the master's local disk, and replicate the disk across remote nodes. This way if there is a crash in

the master node, a shadow can be up and running almost instantly.

The master stores three types of metadata:

1. File and chunk names or namespaces .

2. Mapping from files to chunks (i.e., the chunks that make up each file).

3. Locations of each chunk's replicas. The replica locations for each chunk are stored on the local

chunk server apart from being replicated, and the information of the replications is provided to the

master at startup or when a chunk server is added to a cluster. Since the master controls the chunk

placement, it always updates metadata as new chunks get written.

The master keeps track on the health of the entire cluster through handshaking with all the chunk

servers. Periodic checksums are executed to keep track of any data corruption. Due to the volume and

scale of processing, there are chances of data getting corrupt or stale.

To recover from any corruption, GFS appends data as it is available rather than updates an existing

data set; this provides the ability to recover from corruption or failure quickly. When a corruption is

detected, with a combination of frequent checkpoints, snapshots, and replicas, data is recovered with

minimal chance of data loss. The architecture results in data unavailability for a short period but not

data corruption.

The GFS architecture has the following strengths:

● Availability:

● Triple replication-based redundancy (or more if you choose).

● Chunk replication.

● Rapid failovers for any master failure.

● Automatic replication management.

● Performance:

● The biggest workload for GFS is read-on large data sets, which based on the architecture

discussion, will be a nonissue.

● There are minimal writes to the chunks directly, thus providing auto availability.

● Management:

● GFS manages itself through multiple failure modes.

● Automatic load balancing.

● Storage management and pooling.

● Chunk management.

● Failover management.

● Cost:

●

Is not a constraint due to use of commodity hardware and Linux platforms.

Google combined the scalability and processing power of the GFS architecture and developed the

first versions of MapReduce programming constructs to execute on top of the file system. There are

several other proprietary architectural advances that Google has since been deploying and continues

to innovate and deploy that are outside the scope of the discussion for this topic and this chapter.

Data Warehousing in the Age of Big Data

Search WWH ::

Custom Search

Home