Database Reference
In-Depth Information
NDFS and the MapReduce implementation in Nutch were applicable beyond the realm of
search, and in February 2006 they moved out of Nutch to form an independent subproject
of Lucene called Hadoop. At around the same time, Doug Cutting joined Yahoo!, which
provided a dedicated team and the resources to turn Hadoop into a system that ran at web
scale (see the following sidebar). This was demonstrated in February 2008 when Yahoo!
announced that its production search index was being generated by a 10,000-core Hadoop
cluster. [ 14 ]
HADOOP AT YAHOO!
Building Internet-scale search engines requires huge amounts of data and therefore large numbers of ma-
chines to process it. Yahoo! Search consists of four primary components: the Crawler , which downloads
pages from web servers; the WebMap , which builds a graph of the known Web; the Indexer , which
builds a reverse index to the best pages; and the Runtime , which answers users' queries. The WebMap is
a graph that consists of roughly 1 trillion (10 12 ) edges, each representing a web link, and 100 billion
(10 11 ) nodes, each representing distinct URLs. Creating and analyzing such a large graph requires a
large number of computers running for many days. In early 2005, the infrastructure for the WebMap,
named Dreadnaught , needed to be redesigned to scale up to more nodes. Dreadnaught had successfully
scaled from 20 to 600 nodes, but required a complete redesign to scale out further. Dreadnaught is simil-
ar to MapReduce in many ways, but provides more flexibility and less structure. In particular, each frag-
ment in a Dreadnaught job could send output to each of the fragments in the next stage of the job, but the
sort was all done in library code. In practice, most of the WebMap phases were pairs that corresponded
to MapReduce. Therefore, the WebMap applications would not require extensive refactoring to fit into
MapReduce.
Eric Baldeschwieler (aka Eric14) created a small team, and we started designing and prototyping a new
framework, written in C++ modeled and after GFS and MapReduce, to replace Dreadnaught. Although
the immediate need was for a new framework for WebMap, it was clear that standardization of the batch
platform across Yahoo! Search was critical and that by making the framework general enough to support
other users, we could better leverage investment in the new platform.
At the same time, we were watching Hadoop, which was part of Nutch, and its progress. In January
2006, Yahoo! hired Doug Cutting, and a month later we decided to abandon our prototype and adopt Ha-
doop. The advantage of Hadoop over our prototype and design was that it was already working with a
real application (Nutch) on 20 nodes. That allowed us to bring up a research cluster two months later and
start helping real customers use the new framework much sooner than we could have otherwise. Another
advantage, of course, was that since Hadoop was already open source, it was easier (although far from
easy!) to get permission from Yahoo!'s legal department to work in open source. So, we set up a
200-node cluster for the researchers in early 2006 and put the WebMap conversion plans on hold while
we supported and improved Hadoop for the research users.
— Owen O'Malley, 2009
Search WWH ::




Custom Search