Database Reference
In-Depth Information
simultaneously during the map step. A key characteristic of MapReduce is that
the processing of one portion of the input can be carried out independently of the
processing of the other inputs. Thus, the workload can be easily distributed over a
cluster of machines.
U.S. Navy rear admiral Grace Hopper (1906-1992), who was a pioneer in the field
of computers, provided one of the best explanations of the need for using a group
of computers. She commented that during preindustrial times, oxen were used for
heavy pulling, but when one ox couldn't budge a log, people didn't try to raise a
larger ox; they added more oxen. Her point was that as computational problems
grow, instead of building a bigger, more powerful, and more expensive computer, a
better alternative is to build a system of computers to share the workload. Thus, in
the MapReduce context, a large processing task would be distributed across many
computers.
Although the concept of MapReduce has existed for decades, Google led the
resurgence in its interest and adoption starting in 2004 with the published work
by Dean and Ghemawat [9]. This paper described Google's approach for crawling
the web and building Google's search engine. As the paper describes, MapReduce
has been used in functional programming languages such as Lisp, which obtained
its name from being readily able to process lists (List processing).
In 2007, a well-publicized MapReduce use case was the conversion of 11 million
New York Times newspaper articles from 1851 to 1980 into PDF files. The intent
was to make the PDF files openly available to users on the Internet. After some
development and testing of the MapReduce code on a local machine, the 11 million
PDF files were generated on a 100-node cluster in about 24 hours [10].
What allowed the development of the MapReduce code and its execution to
proceed easily was that the MapReduce paradigm had already been implemented
in Apache Hadoop.
10.1.3 Apache Hadoop
Although MapReduce is a simple paradigm to understand, it is not as easy to
implement, especially in a distributed system. Executing a MapReduce job (the
MapReduce code run against some specified data) requires the management and
coordination of several activities:
• MapReduce jobs need to be scheduled based on the system's workload.
Search WWH ::




Custom Search