Databases Reference
In-Depth Information
MapReduce is an ideal tool to use when creating reverse indexes due to its ability
to scale horizontally. Creating reverse indexes was the primary driver behind the
Google MapReduce project, and the reason the Hadoop framework was created. Let's
take a step-by-step look at how you can use MapReduce to create reverse indexes.
To design a MapReduce job, you must break the problem into multiple steps. The
first step is to write a map function that takes your inputs (the source documents) and
returns a set of key-value pairs. The second step is to write a reduce function that will
return your results. In this case, the results will be the reverse index files. For each key-
word, the reverse index lists what documents contain that word.
You may recall that the interface between the map and reduce phases must be a set
of key-value pairs. The next question to answer is what to return for the key. The most
logical key would be the word itself. The “value” of the key-value pair would be a list of
all the document identifiers that contain that word.
Figure 7.6 shows the detailed steps in this process. You can see from this figure that
before you process the inputs, you remove uppercase letters and small stop words
such as the , and , or , and to , since it's unlikely they'll be used as keywords. You then cre-
ate a list of key-value pairs for each word where the document ID is the “value” part of
the key-value pair. The MapReduce infrastructure then performs the “shuffle and
sort” steps and pass the output to the final reduce phase that collapses each of the
word-document pairs into a word-document list item, which is the format of the
reverse indexes.
In our next two sections we'll look at case studies to see how search can be used to
solve specific business problems.
Map
Reduce
Key-value
pairs
Sort
Input
documents
Final reverse
index
cat d2
cat d3
Normalization
sue d1
likes d1
cats d1
d1
Sue likes cats.
sue likes cats
cat: d2,d3
cats d1
cats d2
cats: d1, d2
cats d2
like d2
cat d2
food d2
food: d2
food d2
d2
Cats like cat food.
cats like cat food
like: d2, d3
like d2
like d3
likes: d1
play: d3
likes d1
cats d3
like d3
play d3
d3
Cats like to play.
cats like play
sue: d1
play d3
sue d1
Figure 7.6 Using the MapReduce algorithm to create a reverse index. The
normalization step removes punctuation and stop words and converts all words to
lowercase. The output of the map phase must be a set of key-value pairs. The reduce
function groups the keyword documents to form the final reverse index.
Search WWH ::




Custom Search