Advanced Analytics—Technology and Tools: MapReduce and Hadoop - Data Science and Big Data Analytics

Database Reference

In-Depth Information

simultaneously during the map step. A key characteristic of MapReduce is that

the processing of one portion of the input can be carried out independently of the

processing of the other inputs. Thus, the workload can be easily distributed over a

cluster of machines.

U.S. Navy rear admiral Grace Hopper (1906-1992), who was a pioneer in the field

of computers, provided one of the best explanations of the need for using a group

of computers. She commented that during preindustrial times, oxen were used for

heavy pulling, but when one ox couldn't budge a log, people didn't try to raise a

larger ox; they added more oxen. Her point was that as computational problems

grow, instead of building a bigger, more powerful, and more expensive computer, a

better alternative is to build a system of computers to share the workload. Thus, in

the MapReduce context, a large processing task would be distributed across many

computers.

Although the concept of MapReduce has existed for decades, Google led the

resurgence in its interest and adoption starting in 2004 with the published work

by Dean and Ghemawat [9]. This paper described Google's approach for crawling

the web and building Google's search engine. As the paper describes, MapReduce

has been used in functional programming languages such as Lisp, which obtained

its name from being readily able to process lists (List processing).

In 2007, a well-publicized MapReduce use case was the conversion of 11 million

New York Times newspaper articles from 1851 to 1980 into PDF files. The intent

was to make the PDF files openly available to users on the Internet. After some

development and testing of the MapReduce code on a local machine, the 11 million

PDF files were generated on a 100-node cluster in about 24 hours [10].

What allowed the development of the MapReduce code and its execution to

proceed easily was that the MapReduce paradigm had already been implemented

in Apache Hadoop.

10.1.3 Apache Hadoop

Although MapReduce is a simple paradigm to understand, it is not as easy to

implement, especially in a distributed system. Executing a MapReduce job (the

MapReduce code run against some specified data) requires the management and

coordination of several activities:

• MapReduce jobs need to be scheduled based on the system's workload.

Search WWH ::

Custom Search

Home