Extending Pipe Assemblies - Enterprise Data Workflows with Cascading

Databases Reference

In-Depth Information

Again, this uses the same input from “Example 1: Simplest Possible App in Cascad‐

ing” , but now we expect all stop words to be removed from the output stream. Common

words such as a , an , as , etc., have been filtered out.

You can verify the entire output text in the output/wc partition file, where the first 10

lines (including the header) should look like this:

$ head output/wc/part-00000

token count

air 1

area 4

australia 1

broken 1

california ' s 1

cause 1

cloudcover 1

death 1

deserts 1

The flow diagram will be in the dot/ subdirectory after the app runs. For those keeping

score, the resulting physical plan in Apache Hadoop uses one map and one reduce.

Again, a GitHub gist shows building and running this example. If your run looks terribly

different, something is probably not set up correctly. Ask the developer community for

advice.

Stop Words and Replicated Joins

Let's consider why we would want to use a stop words list. This approach originated in

work by Hans Peter Luhn at IBM Research, during the dawn of computing. The reasons

for it are twofold. On one hand, consider that the most common words in any given

natural language are generally not useful for text analytics . For example, in English,

words such as “and,” “of,” and “the” are probably not what you want to search and

probably not interesting for Word Count metrics. They represent high frequency and

low semantic value within the token distribution. They also represent the bulk of the

processing required. Natural languages tend to have on the order of 10 5 words, so the

potential size of any stop words list is nicely bounded. Filtering those high-frequency

words out of the token stream dramatically reduces the amount of processing required

later in the workflow.

On the other hand, you may also want to remove some words explicitly from the token

stream. This almost always comes up in practice, especially when working with public

discussions such as social network comments.

Search WWH ::

Custom Search

Home