Introducing Big Data Technologies - Data Warehousing in the Age of Big Data

Databases Reference

In-Depth Information

The next few sections are focused on the Big Data platforms including Hadoop and NoSQL. The

goal of these discussions is to provide you with a concise perspective on the subject. There are several

other topics, whitepapers, and material that are available on these topics if you need deeper details

and technical insights.

Hadoop

The most popular word in the industry at the time of writing this topic, Hadoop has taken the world

by storm in providing the solution architecture to solve Big Data processing on a cheaper commodity

platform with faster scalability and parallel processing.

Hadoop was started an open-source search engine project called Nutch in 2002 by Mike Cafarella

and Doug Cutting. By early 2004, the team had developed an excellent crawler engine but hit a road-

block with the scalability of the search engine. Around the same time, Google announced the availabil-

ity of GFS and MapReduce papers to open-source communities. The Nutch team developed the Nutch

Distributed File System (NDFS), an open-source distributed file system, based on the architecture con-

cepts of GFS. The NDFS architecture solved the storage and associated scalability issues. In 2005, the

Nutch team completed the port of Nutch algorithms to the MapReduce programming model. The new

architecture could enable processing of large and unstructured data with unsurpassed scalability.

In 2006, the Nutch team of Cafarella and Cutting created a subproject under Apache Lucene and

called it Hadoop (named after Doug Cutting's son's toy elephant), and released the early version to

the open-source community. Yahoo adopted the project and sponsored the continued development of

Hadoop, which it widely adopted and deployed within Yahoo. In January 2008, Yahoo released the

first complete project release of Hadoop under open source.

The first generation of Hadoop consisted of an HDFS (modeled after NDFS) distributed file sys-

tem and MapReduce framework along with a coordinator interface and an interface to write and read

from HDFS. When the first generation of Hadoop architecture was conceived and implemented in

2004 by Cutting and Cafarella, they were able to automate a lot of operations on crawling and index-

ing on search, and improved efficiencies and scalability. Within a few months they reached an archi-

tecture scalability of 20 nodes running Nutch without missing a heartbeat. This provided Yahoo the

next move to hire Cutting and adopt Hadoop to become one of its core platforms. Yahoo kept the plat-

form moving with its constant innovation and research. Soon many committers and volunteer devel-

opers/testers started contributing to the growth of a healthy ecosystem around Hadoop.

At the time of writing (2012), in the last three years we have seen two leading distributors of

Hadoop with management tools and professional services emerge: CloudEra and HortonWorks. We

have also seen the emergence of Hadoop-based solutions from IBM, Teradata, Oracle, and Microsoft,

and HP, SAP, and DELL in partnerships with other providers and distributors.

The most current list at Apache's website for Hadoop lists the top-level stable projects and

releases and also incubated projects that are evolving.

●

Hadoop Common—the common utilities that support other Hadoop subprojects.

●

Hadoop Distributed File System (HDFS™)—a distributed file system that provides high-

throughput access to application data.

●

Hadoop MapReduce—a software framework for distributed processing of large data sets on

compute clusters.

Data Warehousing in the Age of Big Data

Search WWH ::

Custom Search

Home