Databases Reference
In-Depth Information
Fields
token
=
new
Fields
(
"token"
);
Fields
text
=
new
Fields
(
"text"
);
RegexSplitGenerator
splitter
=
new
RegexSplitGenerator
(
token
,
"[ \\[\\]\\(\\),.]"
);
// returns only "token"
Pipe
docPipe
=
new
Each
(
"token"
,
text
,
splitter
,
Fields
.
RESULTS
);
Out of that pipe, we get a tuple stream of
token
values. One benefit of using a regex is
that it's simple to change. We can handle more complex cases of splitting tokens without
having to rewrite the generator.
Next, we use a
GroupBy
to count the occurrences of each token:
Pipe
wcPipe
=
new
Pipe
(
"wc"
,
docPipe
);
wcPipe
=
new
GroupBy
(
wcPipe
,
token
);
wcPipe
=
new
Every
(
wcPipe
,
Fields
.
ALL
,
new
Count
(),
Fields
.
ALL
);
Notice that we've used
Each
and
Every
to perform operations within the pipe assembly.
The difference between these two is that an
Each
operates on individual tuples so that
it takes
Function
operations. An
Every
operates on groups of tuples so that it takes
Aggregator
or
Buffer
operations—in this case the
GroupBy
performed an aggregation.
The different ways of inserting operations serve to categorize the different built-in op‐
erations in Cascading. They also illustrate how the pattern language syntax guides the
development of workflows.
From that
wcPipe
we get a resulting tuple stream of
token
and
count
for the output.
Again, we connect the plumbing with a
FlowDef
:
FlowDef
flowDef
=
FlowDef
.
flowDef
()
.
setName
(
"wc"
)
.
addSource
(
docPipe
,
docTap
)
.
addTailSink
(
wcPipe
,
wcTap
);
Finally, we generate a
DOT file
to depict the Cascading flow graphically. You can load
the DOT file into
OmniGraffle
or Visio. Those diagrams are really helpful for trouble‐
shooting workflows in Cascading:
Flow
wcFlow
=
flowConnector
.
connect
(
flowDef
);
wcFlow
.
writeDOT
(
"dot/wc.dot"
);
wcFlow
.
complete
();
This code is already in the
part2/src/main/java/impatient/
directory, in the
Main.java
file. To build it:
$
gradle clean jar
Then to run it:
$
rm -rf output
$
hadoop jar ./build/libs/impatient.jar data/rain.txt output/wc