Database Reference
In-Depth Information
Operations
As mentioned earlier, Cascading departs from MapReduce by introducing alternative oper-
ations that are applied either to individual tuples or groups of tuples ( Figure 24-5 ):
Function
A Function operates on individual input tuples and may return zero or more output
tuples for every one input. Functions are applied by the Each pipe.
Filter
A Filter is a special kind of function that returns a Boolean value indicating whether
the current input tuple should be removed from the tuple stream. A Function could
serve this purpose, but the Filter is optimized for this case, and many filters can be
grouped by “logical” filters such as AND , OR , XOR , and NOT , rapidly creating more
complex filtering operations.
Aggregator
An Aggregator performs some operation against a group of tuples, where the
grouped tuples are by a common set of field values (for example, all tuples having the
same “last-name” value). Common Aggregator implementations would be Sum ,
Count , Average , Max , and Min .
Buffer
A Buffer is similar to an Aggregator , except it is optimized to act as a “sliding
window” across all the tuples in a unique grouping. This is useful when the developer
needs to efficiently insert missing values in an ordered set of tuples (such as a missing
date or duration) or create a running average. Usually Aggregator is the operation of
choice when working with groups of tuples, since many Aggregator s can be chained
together very efficiently, but sometimes a Buffer is the best tool for the job.
Figure 24-5. Operation types
Operations are bound to pipes when the pipe assembly is created ( Figure 24-6 ).
Search WWH ::




Custom Search