Database Reference
In-Depth Information
We create a new Scheme that writes simple text files and expects a Tuple with any
number of fields/values. If there is more than one value, they will be tab-delimited in
the output file.
We create source and sink Tap instances that reference the input file and output direct-
ory, respectively. The sink Tap will overwrite any file that may already exist.
We construct the head of our pipe assembly and name it “wordcount.” This name is
used to bind the source and sink Tap s to the assembly. Multiple heads or tails would
require unique names.
We construct an Each pipe with a function that will parse the “line” field into a new
Tuple for each word encountered.
We construct a GroupBy pipe that will create a new Tuple grouping for each unique
value in the field “word.”
We construct an Every pipe with an Aggregator that will count the number of
Tuple s in every unique word group. The result is stored in a field named “count.”
We construct a GroupBy pipe that will create a new Tuple grouping for each unique
value in the field “count” and secondary sort each value in the field “word.” The result
will be a list of “count” and “word” values with “count” sorted in increasing order.
We connect the pipe assembly to its sources and sinks in a Flow , and then execute the
Flow on the cluster.
In the example, we count the words encountered in the input document, and we sort the
counts in their natural order (ascending). If some words have the same “count” value,
these words are sorted in their natural order (alphabetical).
One obvious problem with this example is that some words might have uppercase letters
in some instances — for example, “the” and “The” when the word comes at the beginning
of a sentence. We might consider inserting a new operation to force all the words to
 
 
 
 
 
 
 
 
Search WWH ::




Custom Search