Database Reference
In-Depth Information
Yahoo!
As of 2012, Yahoo! has one of the largest publicly announced Hadoop deployments
at 42,000 nodes across several clusters utilizing 350 petabytes of raw storage [7].
Yahoo!'s Hadoop applications include the following [8]:
• Search index creation and maintenance
• Web page content optimization
• Web ad placement optimization
• Spam filters
• Ad-hoc analysis and analytic model development
Prior to deploying Hadoop, it took 26 days to process three years' worth of log data.
With Hadoop, the processing time was reduced to 20 minutes.
10.1.2 MapReduce
As mentioned earlier, the MapReduce paradigm provides the means to break a
large task into smaller tasks, run the tasks in parallel, and consolidate the outputs
of the individual tasks into the final output. As its name implies, MapReduce
consists of two basic parts—a map step and a reduce step—detailed as follows:
Map:
• Applies an operation to a piece of data
• Provides some intermediate output
Reduce:
• Consolidates the intermediate outputs from the map steps
• Provides the final output
Each step uses key/value pairs, denoted as <key, value> , as input and output.
It is useful to think of the key/value pairs as a simple ordered pair. However, the
pairs can take fairly complex forms. For example, the key could be a filename, and
the value could be the entire contents of the file.
The simplest illustration of MapReduce is a word count example in which the
task is to simply count the number of times each word appears in a collection
of documents. In practice, the objective of such an exercise is to establish a list
of words and their frequency for purposes of search or establishing the relative
importance of certain words. Chapter 9 provides more details on text analytics.
Search WWH ::




Custom Search