Database Reference
In-Depth Information
impossible to process. They had to find a way to solve this problem and this led to the cre-
ation of Google File System ( GFS ) and MapReduce .
The GFS or GoogleFS is a filesystem created by Google that enables them to store their
large amount of data easily across multiple nodes in a cluster. Once stored, they use
MapReduce, a programming model developed by Google to process (or query) the data
stored in GFS efficiently. The MapReduce programming model implements a parallel,
distributed algorithm on the cluster, where the processing goes to the location where data
resides, making it faster to generate results rather than wait for the data to be moved to the
processing, which could be a very time consuming activity. Google found tremendous
success using this architecture and released white papers for GFS in 2003 and MapReduce
in 2004.
Around 2002, Doug Cutting and Mike Cafarella were working on Nutch, an open source
web search engine, and faced problems of scalability when trying to store billions of web
pages that were crawled everyday by Nutch. In 2004, the Nutch team discovered that the
GFS architecture was the solution to their problem and started working on an implementa-
tion based on the GFS white paper. They called their filesystem Nutch Distributed File
System ( NDFS ). In 2005, they also implemented MapReduce for NDFS based on
Google's MapReduce white paper.
In 2006, the Nutch team realized that their implementations, NDFS and MapReduce,
could be applied to more areas and could solve the problems of large data volume pro-
cessing. This led to the formation of a project called Hadoop. Under Hadoop, NDFS was
renamed to Hadoop Distributed File System ( HDFS ). After Doug Cutting joined Yahoo!
in 2006, Hadoop received lot of attention within Yahoo!, and Hadoop became a very im-
portant system running successfully on top of a very large cluster (around 1000 nodes). In
2008, Hadoop became one of Apache's top-level projects.
So, Apache Hadoop is a framework written in Java that:
• Is used for distributed storage and processing of large volumes of data, which run
on top of a cluster and can scale from a single computer to thousands of com-
puters
• Uses the MapReduce programming model to process data
• Stores and processes data on every worker node (the nodes on the cluster that are
responsible for the storage and processing of data) and handles hardware failures
efficiently, providing high availability
Search WWH ::




Custom Search