Database Reference
In-Depth Information
Operations
As mentioned earlier, Cascading departs from MapReduce by introducing alternative oper-
ations that are applied either to individual tuples or groups of tuples (
Figure 24-5
):
Function
A
Function
operates on individual input tuples and may return zero or more output
tuples for every one input. Functions are applied by the
Each
pipe.
Filter
A
Filter
is a special kind of function that returns a Boolean value indicating whether
the current input tuple should be removed from the tuple stream. A
Function
could
serve this purpose, but the
Filter
is optimized for this case, and many filters can be
grouped by “logical” filters such as
AND
,
OR
,
XOR
, and
NOT
, rapidly creating more
complex filtering operations.
Aggregator
An
Aggregator
performs some operation against a group of tuples, where the
grouped tuples are by a common set of field values (for example, all tuples having the
same “last-name” value). Common
Aggregator
implementations would be
Sum
,
Count
,
Average
,
Max
, and
Min
.
Buffer
A
Buffer
is similar to an
Aggregator
, except it is optimized to act as a “sliding
window” across all the tuples in a unique grouping. This is useful when the developer
needs to efficiently insert missing values in an ordered set of tuples (such as a missing
date or duration) or create a running average. Usually
Aggregator
is the operation of
choice when working with groups of tuples, since many
Aggregator
s can be chained
together very efficiently, but sometimes a
Buffer
is the best tool for the job.
Figure 24-5. Operation types
Operations are bound to pipes when the pipe assembly is created (
Figure 24-6
).