Meet Hadoop - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

NDFS and the MapReduce implementation in Nutch were applicable beyond the realm of

search, and in February 2006 they moved out of Nutch to form an independent subproject

of Lucene called Hadoop. At around the same time, Doug Cutting joined Yahoo!, which

provided a dedicated team and the resources to turn Hadoop into a system that ran at web

scale (see the following sidebar). This was demonstrated in February 2008 when Yahoo!

announced that its production search index was being generated by a 10,000-core Hadoop

cluster. [ 14 ]

HADOOP AT YAHOO!

Building Internet-scale search engines requires huge amounts of data and therefore large numbers of ma-

chines to process it. Yahoo! Search consists of four primary components: the Crawler , which downloads

pages from web servers; the WebMap , which builds a graph of the known Web; the Indexer , which

builds a reverse index to the best pages; and the Runtime , which answers users' queries. The WebMap is

a graph that consists of roughly 1 trillion (10 12 ) edges, each representing a web link, and 100 billion

(10 11 ) nodes, each representing distinct URLs. Creating and analyzing such a large graph requires a

large number of computers running for many days. In early 2005, the infrastructure for the WebMap,

named Dreadnaught , needed to be redesigned to scale up to more nodes. Dreadnaught had successfully

scaled from 20 to 600 nodes, but required a complete redesign to scale out further. Dreadnaught is simil-

ar to MapReduce in many ways, but provides more flexibility and less structure. In particular, each frag-

ment in a Dreadnaught job could send output to each of the fragments in the next stage of the job, but the

sort was all done in library code. In practice, most of the WebMap phases were pairs that corresponded

to MapReduce. Therefore, the WebMap applications would not require extensive refactoring to fit into

MapReduce.

Eric Baldeschwieler (aka Eric14) created a small team, and we started designing and prototyping a new

framework, written in C++ modeled and after GFS and MapReduce, to replace Dreadnaught. Although

the immediate need was for a new framework for WebMap, it was clear that standardization of the batch

platform across Yahoo! Search was critical and that by making the framework general enough to support

other users, we could better leverage investment in the new platform.

At the same time, we were watching Hadoop, which was part of Nutch, and its progress. In January

2006, Yahoo! hired Doug Cutting, and a month later we decided to abandon our prototype and adopt Ha-

doop. The advantage of Hadoop over our prototype and design was that it was already working with a

real application (Nutch) on 20 nodes. That allowed us to bring up a research cluster two months later and

start helping real customers use the new framework much sooner than we could have otherwise. Another

advantage, of course, was that since Hadoop was already open source, it was easier (although far from

easy!) to get permission from Yahoo!'s legal department to work in open source. So, we set up a

200-node cluster for the researchers in early 2006 and put the WebMap conversion plans on hold while

we supported and improved Hadoop for the research users.

— Owen O'Malley, 2009

Search WWH ::

Custom Search

Home