Programming with Pig - Hadoop in Action

Databases Reference

In-Depth Information

The object input belongs to the Tuple class, which has two methods for retrieving its

content.

List<Object> getAll();

Object get(int fieldNum) throws ExecException;

The getAll() method

return all fields in the tuple as an ordered list. UPPER instead

uses the get() method

to request for a specific field (at position 0). This method

would throw an ExecException if the requested field number is greater than the num-

ber of fields in the tuple. In UPPER the retrieved field is casted to a Java String, which

usually works but may cause a cast exception if we were casting

between incompatible

data types. We'll see later how to use Pig to ensure that our casting works. In any case,

the try/catch block would've caught and handled any exception. If everything works,

UPPER 's exec() method will return a String with characters uppercased. In addition,

most UDFs should implement the default behavior that the output is null when the

input tuple is null.

In addition to implementing exec() , UPPER also overrides a couple methods from

EvalFunc , one of which is getArgToFuncMapping :

@Override

public List<FuncSpec> getArgToFuncMapping() throws FrontendException {

List<FuncSpec> funcList = new ArrayList<FuncSpec>();

funcList.add(new FuncSpec(this.getClass().getName(),

new Schema(new Schema.FieldSchema(null, DataType.CHARARRAY))));

return funcList;

}

➥

The getArgToFuncMapping() method returns a List of FuncSpec objects

repre-

senting the schema of each field in the input tuple. Pig will handle typecasting for

you by converting the types of all fields in a tuple to conform to this schema before

passing it to exec() . It will pass fields that can't be converted to the desired type

as null.

UPPER only cares about the type of the first field, so it adds only one FuncSpec to

the list, and this FuncSpec states that the field must be of type chararray , represented

as DataType.CHARARRAY . The instantiation of FuncSpec is quite convoluted, which

is due to Pig's ability to handle complex nested types. Fortunately, unless you work

with unusually complicated types, you'll probably find a FuncSpec instantiation for the

type you want already in one of PiggyBank's UDFs. Reuse that in your code. You can

even reuse the entire getArgToFuncMapping() function if you have the same tuple

schema as another UDF.

Besides telling Pig the input schema, you can also tell Pig the schema of your

output. You may not need to do this if the output of your UDF is a simple scalar, as

Pig will use Java's Reflection mechanism to infer the schema automatically. But if

your UDF returns a tuple or a bag, the Reflection mechanism

will fail to figure out

the schema completely. In that case you should specify it so that Pig can propagate

the schema correctly.

Hadoop in Action

Search WWH ::

Custom Search

Home