Databases Reference
In-Depth Information
Again, this uses the same input from “Example 1: Simplest Possible App in Cascad‐
ing” , but now we expect all stop words to be removed from the output stream. Common
words such as a , an , as , etc., have been filtered out.
You can verify the entire output text in the output/wc partition file, where the first 10
lines (including the header) should look like this:
$ head output/wc/part-00000
token count
air 1
area 4
australia 1
broken 1
california ' s 1
cause 1
cloudcover 1
death 1
deserts 1
The flow diagram will be in the dot/ subdirectory after the app runs. For those keeping
score, the resulting physical plan in Apache Hadoop uses one map and one reduce.
Again, a GitHub gist shows building and running this example. If your run looks terribly
different, something is probably not set up correctly. Ask the developer community for
advice.
Stop Words and Replicated Joins
Let's consider why we would want to use a stop words list. This approach originated in
work by Hans Peter Luhn at IBM Research, during the dawn of computing. The reasons
for it are twofold. On one hand, consider that the most common words in any given
natural language are generally not useful for text analytics . For example, in English,
words such as “and,” “of,” and “the” are probably not what you want to search and
probably not interesting for Word Count metrics. They represent high frequency and
low semantic value within the token distribution. They also represent the bulk of the
processing required. Natural languages tend to have on the order of 10 5 words, so the
potential size of any stop words list is nicely bounded. Filtering those high-frequency
words out of the token stream dramatically reduces the amount of processing required
later in the workflow.
On the other hand, you may also want to remove some words explicitly from the token
stream. This almost always comes up in practice, especially when working with public
discussions such as social network comments.
Search WWH ::




Custom Search