Database Reference
In-Depth Information
User-Defined Functions
Pig's designers realized that the ability to plug in custom code is crucial for all but the most
trivial data processing jobs. For this reason, they made it easy to define and use user-
defined functions. We only cover Java UDFs in this section, but be aware that you can also
write UDFs in Python, JavaScript, Ruby, or Groovy, all of which are run using the Java
Scripting API.
A Filter UDF
Let's demonstrate by writing a filter function for filtering out weather records that do not
have a temperature quality reading of satisfactory (or better). The idea is to change this
line:
filtered_records = FILTER records BY temperature != 9999 AND
quality IN ( 0 , 1 , 4 , 5 , 9 );
to:
filtered_records = FILTER records BY temperature != 9999 AND
isGood (quality);
This achieves two things: it makes the Pig script a little more concise, and it encapsulates
the logic in one place so that it can be easily reused in other scripts. If we were just writing
an ad hoc query, we probably wouldn't bother to write a UDF. It's when you start doing the
same kind of processing over and over again that you see opportunities for reusable UDFs.
Filter UDFs are all subclasses of FilterFunc , which itself is a subclass of EvalFunc .
We'll look at EvalFunc in more detail later, but for the moment just note that, in essence,
EvalFunc looks like the following class:
public abstract class EvalFunc < T > {
public abstract T exec ( Tuple input ) throws IOException ;
}
EvalFunc 's only abstract method, exec() , takes a tuple and returns a single value, the
(parameterized) type T . The fields in the input tuple consist of the expressions passed to the
function — in this case, a single integer. For FilterFunc , T is Boolean , so the method
should return true only for those tuples that should not be filtered out.
For the quality filter, we write a class, IsGoodQuality , that extends FilterFunc and
implements the exec() method (see Example 16-1 ) . The Tuple class is essentially a list
of objects with associated types. Here we are concerned only with the first field (since the
function only has a single argument), which we extract by index using the get() method
Search WWH ::




Custom Search