Cascading - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

We create a new Scheme that writes simple text files and expects a Tuple with any

number of fields/values. If there is more than one value, they will be tab-delimited in

the output file.

We create source and sink Tap instances that reference the input file and output direct-

ory, respectively. The sink Tap will overwrite any file that may already exist.

We construct the head of our pipe assembly and name it “wordcount.” This name is

used to bind the source and sink Tap s to the assembly. Multiple heads or tails would

require unique names.

We construct an Each pipe with a function that will parse the “line” field into a new

Tuple for each word encountered.

We construct a GroupBy pipe that will create a new Tuple grouping for each unique

value in the field “word.”

We construct an Every pipe with an Aggregator that will count the number of

Tuple s in every unique word group. The result is stored in a field named “count.”

We construct a GroupBy pipe that will create a new Tuple grouping for each unique

value in the field “count” and secondary sort each value in the field “word.” The result

will be a list of “count” and “word” values with “count” sorted in increasing order.

We connect the pipe assembly to its sources and sinks in a Flow , and then execute the

Flow on the cluster.

In the example, we count the words encountered in the input document, and we sort the

counts in their natural order (ascending). If some words have the same “count” value,

these words are sorted in their natural order (alphabetical).

One obvious problem with this example is that some words might have uppercase letters

in some instances — for example, “the” and “The” when the word comes at the beginning

of a sentence. We might consider inserting a new operation to force all the words to

Search WWH ::

Custom Search

Home