Getting Started with Apache Hadoop - Cloudera Administration

Database Reference

In-Depth Information

impossible to process. They had to find a way to solve this problem and this led to the cre-

ation of Google File System ( GFS ) and MapReduce .

The GFS or GoogleFS is a filesystem created by Google that enables them to store their

large amount of data easily across multiple nodes in a cluster. Once stored, they use

MapReduce, a programming model developed by Google to process (or query) the data

stored in GFS efficiently. The MapReduce programming model implements a parallel,

distributed algorithm on the cluster, where the processing goes to the location where data

resides, making it faster to generate results rather than wait for the data to be moved to the

processing, which could be a very time consuming activity. Google found tremendous

success using this architecture and released white papers for GFS in 2003 and MapReduce

in 2004.

Around 2002, Doug Cutting and Mike Cafarella were working on Nutch, an open source

web search engine, and faced problems of scalability when trying to store billions of web

pages that were crawled everyday by Nutch. In 2004, the Nutch team discovered that the

GFS architecture was the solution to their problem and started working on an implementa-

tion based on the GFS white paper. They called their filesystem Nutch Distributed File

System ( NDFS ). In 2005, they also implemented MapReduce for NDFS based on

Google's MapReduce white paper.

In 2006, the Nutch team realized that their implementations, NDFS and MapReduce,

could be applied to more areas and could solve the problems of large data volume pro-

cessing. This led to the formation of a project called Hadoop. Under Hadoop, NDFS was

renamed to Hadoop Distributed File System ( HDFS ). After Doug Cutting joined Yahoo!

in 2006, Hadoop received lot of attention within Yahoo!, and Hadoop became a very im-

portant system running successfully on top of a very large cluster (around 1000 nodes). In

2008, Hadoop became one of Apache's top-level projects.

So, Apache Hadoop is a framework written in Java that:

• Is used for distributed storage and processing of large volumes of data, which run

on top of a cluster and can scale from a single computer to thousands of com-

puters

• Uses the MapReduce programming model to process data

• Stores and processes data on every worker node (the nodes on the cluster that are

responsible for the storage and processing of data) and handles hardware failures

efficiently, providing high availability

Search WWH ::

Custom Search

Home