Database Reference
In-Depth Information
We create a new
Scheme
that writes simple text files and expects a
Tuple
with any
number of fields/values. If there is more than one value, they will be tab-delimited in
the output file.
We create source and sink
Tap
instances that reference the input file and output direct-
ory, respectively. The sink
Tap
will overwrite any file that may already exist.
We construct the head of our pipe assembly and name it “wordcount.” This name is
used to bind the source and sink
Tap
s to the assembly. Multiple heads or tails would
require unique names.
We construct an
Each
pipe with a function that will parse the “line” field into a new
Tuple
for each word encountered.
We construct a
GroupBy
pipe that will create a new
Tuple
grouping for each unique
value in the field “word.”
We construct an
Every
pipe with an
Aggregator
that will count the number of
Tuple
s in every unique word group. The result is stored in a field named “count.”
We construct a
GroupBy
pipe that will create a new
Tuple
grouping for each unique
value in the field “count” and secondary sort each value in the field “word.” The result
will be a list of “count” and “word” values with “count” sorted in increasing order.
We connect the pipe assembly to its sources and sinks in a
Flow
, and then execute the
Flow
on the cluster.
In the example, we count the words encountered in the input document, and we sort the
counts in their natural order (ascending). If some words have the same “count” value,
these words are sorted in their natural order (alphabetical).
One obvious problem with this example is that some words might have uppercase letters
in some instances — for example, “the” and “The” when the word comes at the beginning
of a sentence. We might consider inserting a new operation to force all the words to