Cascading - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Operations

As mentioned earlier, Cascading departs from MapReduce by introducing alternative oper-

ations that are applied either to individual tuples or groups of tuples ( Figure 24-5 ):

Function

A Function operates on individual input tuples and may return zero or more output

tuples for every one input. Functions are applied by the Each pipe.

Filter

A Filter is a special kind of function that returns a Boolean value indicating whether

the current input tuple should be removed from the tuple stream. A Function could

serve this purpose, but the Filter is optimized for this case, and many filters can be

grouped by “logical” filters such as AND , OR , XOR , and NOT , rapidly creating more

complex filtering operations.

Aggregator

An Aggregator performs some operation against a group of tuples, where the

grouped tuples are by a common set of field values (for example, all tuples having the

same “last-name” value). Common Aggregator implementations would be Sum ,

Count , Average , Max , and Min .

Buffer

A Buffer is similar to an Aggregator , except it is optimized to act as a “sliding

window” across all the tuples in a unique grouping. This is useful when the developer

needs to efficiently insert missing values in an ordered set of tuples (such as a missing

date or duration) or create a running average. Usually Aggregator is the operation of

choice when working with groups of tuples, since many Aggregator s can be chained

together very efficiently, but sometimes a Buffer is the best tool for the job.

Figure 24-5. Operation types

Operations are bound to pipes when the pipe assembly is created ( Figure 24-6 ).

Search WWH ::

Custom Search

Home