Databases Reference
In-Depth Information
The next few sections are focused on the Big Data platforms including Hadoop and NoSQL. The
goal of these discussions is to provide you with a concise perspective on the subject. There are several
other topics, whitepapers, and material that are available on these topics if you need deeper details
and technical insights.
Hadoop
The most popular word in the industry at the time of writing this topic, Hadoop has taken the world
by storm in providing the solution architecture to solve Big Data processing on a cheaper commodity
platform with faster scalability and parallel processing.
Hadoop was started an open-source search engine project called Nutch in 2002 by Mike Cafarella
and Doug Cutting. By early 2004, the team had developed an excellent crawler engine but hit a road-
block with the scalability of the search engine. Around the same time, Google announced the availabil-
ity of GFS and MapReduce papers to open-source communities. The Nutch team developed the Nutch
Distributed File System (NDFS), an open-source distributed file system, based on the architecture con-
cepts of GFS. The NDFS architecture solved the storage and associated scalability issues. In 2005, the
Nutch team completed the port of Nutch algorithms to the MapReduce programming model. The new
architecture could enable processing of large and unstructured data with unsurpassed scalability.
In 2006, the Nutch team of Cafarella and Cutting created a subproject under Apache Lucene and
called it Hadoop (named after Doug Cutting's son's toy elephant), and released the early version to
the open-source community. Yahoo adopted the project and sponsored the continued development of
Hadoop, which it widely adopted and deployed within Yahoo. In January 2008, Yahoo released the
first complete project release of Hadoop under open source.
The first generation of Hadoop consisted of an HDFS (modeled after NDFS) distributed file sys-
tem and MapReduce framework along with a coordinator interface and an interface to write and read
from HDFS. When the first generation of Hadoop architecture was conceived and implemented in
2004 by Cutting and Cafarella, they were able to automate a lot of operations on crawling and index-
ing on search, and improved efficiencies and scalability. Within a few months they reached an archi-
tecture scalability of 20 nodes running Nutch without missing a heartbeat. This provided Yahoo the
next move to hire Cutting and adopt Hadoop to become one of its core platforms. Yahoo kept the plat-
form moving with its constant innovation and research. Soon many committers and volunteer devel-
opers/testers started contributing to the growth of a healthy ecosystem around Hadoop.
At the time of writing (2012), in the last three years we have seen two leading distributors of
Hadoop with management tools and professional services emerge: CloudEra and HortonWorks. We
have also seen the emergence of Hadoop-based solutions from IBM, Teradata, Oracle, and Microsoft,
and HP, SAP, and DELL in partnerships with other providers and distributors.
The most current list at Apache's website for Hadoop lists the top-level stable projects and
releases and also incubated projects that are evolving.
Hadoop Common—the common utilities that support other Hadoop subprojects.
Hadoop Distributed File System (HDFS™)—a distributed file system that provides high-
throughput access to application data.
Hadoop MapReduce—a software framework for distributed processing of large data sets on
compute clusters.
 
Search WWH ::




Custom Search