Advanced Analytics—Technology and Tools: MapReduce and Hadoop - Data Science and Big Data Analytics

Database Reference

In-Depth Information

Yahoo!

As of 2012, Yahoo! has one of the largest publicly announced Hadoop deployments

at 42,000 nodes across several clusters utilizing 350 petabytes of raw storage [7].

Yahoo!'s Hadoop applications include the following [8]:

• Search index creation and maintenance

• Web page content optimization

• Web ad placement optimization

• Spam filters

• Ad-hoc analysis and analytic model development

Prior to deploying Hadoop, it took 26 days to process three years' worth of log data.

With Hadoop, the processing time was reduced to 20 minutes.

10.1.2 MapReduce

As mentioned earlier, the MapReduce paradigm provides the means to break a

large task into smaller tasks, run the tasks in parallel, and consolidate the outputs

of the individual tasks into the final output. As its name implies, MapReduce

consists of two basic parts—a map step and a reduce step—detailed as follows:

Map:

• Applies an operation to a piece of data

• Provides some intermediate output

Reduce:

• Consolidates the intermediate outputs from the map steps

• Provides the final output

Each step uses key/value pairs, denoted as <key, value> , as input and output.

It is useful to think of the key/value pairs as a simple ordered pair. However, the

pairs can take fairly complex forms. For example, the key could be a filename, and

the value could be the entire contents of the file.

The simplest illustration of MapReduce is a word count example in which the

task is to simply count the number of times each word appears in a collection

of documents. In practice, the objective of such an exercise is to establish a list

of words and their frequency for purposes of search or establishing the relative

importance of certain words. Chapter 9 provides more details on text analytics.

Search WWH ::

Custom Search

Home