Database Reference
In-Depth Information
Introduction to Hadoop
Apache Hadoop is an open source implementation of two famous white papers from
Google: Google File System ( GFS ) ( http://research.google.com/archive/gfs.html ) and
Google MapReduce ( http://research.google.com/archive/mapreduce.html ) . Vanilla Hadoop
consists of two modules, Hadoop Distributed File System ( HDFS ) and MapReduce.
HDFS and MR are implementations of GFS and Google MapReduce, respectively. One
may consider HDFS as a storage module and MapReduce as a processing module.
HDFS
Let's start with an example. Assume you have 1 TB of data to read from a single machine
with a single hard disk. Assuming the disk read rate is 100 MBps, it will take about 2 hours
and 45 minutes to read the file. If you could split this data over 10 hard disks and read them
all in parallel, it would have decreased the read time by 10—more or less. From a layman's
perspective, this is what HDFS does; it breaks the data into fixed-sized blocks (default is 64
MB) and distributes them over a number of slave machines.
HDFS is a filesystem that runs on top of a regular filesystem. Production installations gen-
erally have ext3 filesystems running beneath HDFS. By distributing data across several
nodes, the storage layer can scale to very large virtual storage that scales linearly. To
provide reliability to store data, the data is stored with redundancy. Each block is replicated
three times by default. HDFS is architected in such a way that each data block gets replic-
ated to different servers and, if possible, on different racks. This saves data from disk, serv-
er, or complete rack failure. In the event of a disk or a server failure, data is replicated to a
new location to meet the replication factor. If this reminds you of Cassandra, or any other
distributed system, you are on the right track. As we will see very soon, unlike Cassandra,
HDFS has single point of failure due to its master-slave design.
Despite all these good features, HDFS has a couple of shortcomings too. They are as fol-
lows:
• HDFS is optimized for streaming. This means that there is no random access to a
file. It may not utilize the maximum data transfer rate.
• NameNode (discussed later) is a single point of unavailability for HDFS.
• HDFS is better suited for large files.
• The append method is not supported by default. However, one can change the con-
figuration to allow the append method.
Search WWH ::




Custom Search