Integration with Hadoop - Mastering Apache Cassandra

Database Reference

In-Depth Information

Introduction to Hadoop

Apache Hadoop is an open source implementation of two famous white papers from

Google: Google File System ( GFS ) ( http://research.google.com/archive/gfs.html ) and

Google MapReduce ( http://research.google.com/archive/mapreduce.html ) . Vanilla Hadoop

consists of two modules, Hadoop Distributed File System ( HDFS ) and MapReduce.

HDFS and MR are implementations of GFS and Google MapReduce, respectively. One

may consider HDFS as a storage module and MapReduce as a processing module.

HDFS

Let's start with an example. Assume you have 1 TB of data to read from a single machine

with a single hard disk. Assuming the disk read rate is 100 MBps, it will take about 2 hours

and 45 minutes to read the file. If you could split this data over 10 hard disks and read them

all in parallel, it would have decreased the read time by 10—more or less. From a layman's

perspective, this is what HDFS does; it breaks the data into fixed-sized blocks (default is 64

MB) and distributes them over a number of slave machines.

HDFS is a filesystem that runs on top of a regular filesystem. Production installations gen-

erally have ext3 filesystems running beneath HDFS. By distributing data across several

nodes, the storage layer can scale to very large virtual storage that scales linearly. To

provide reliability to store data, the data is stored with redundancy. Each block is replicated

three times by default. HDFS is architected in such a way that each data block gets replic-

ated to different servers and, if possible, on different racks. This saves data from disk, serv-

er, or complete rack failure. In the event of a disk or a server failure, data is replicated to a

new location to meet the replication factor. If this reminds you of Cassandra, or any other

distributed system, you are on the right track. As we will see very soon, unlike Cassandra,

HDFS has single point of failure due to its master-slave design.

Despite all these good features, HDFS has a couple of shortcomings too. They are as fol-

lows:

• HDFS is optimized for streaming. This means that there is no random access to a

file. It may not utilize the maximum data transfer rate.

• NameNode (discussed later) is a single point of unavailability for HDFS.

• HDFS is better suited for large files.

• The append method is not supported by default. However, one can change the con-

figuration to allow the append method.

Search WWH ::

Custom Search

Home