Pig - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

public class Trim extends PrimitiveEvalFunc < String , String > {

@Override

public String exec ( String input ) {

return input . trim ();

}

In this case, we have taken advantage of PrimitiveEvalFunc , which is a specializa-

tion of EvalFunc for when the input is a single primitive (atomic) type. For the Trim

UDF, the input and output types are both of type String . [ 102 ]

In general, when you write an eval function, you need to consider what the output's

schema looks like. In the following statement, the schema of B is determined by the func-

tion udf :

B = FOREACH A GENERATE udf ($0);

If udf creates tuples with scalar fields, then Pig can determine B 's schema through reflec-

tion. For complex types such as bags, tuples, or maps, Pig needs more help, and you

should implement the outputSchema() method to give Pig the information about the

output schema.

The Trim UDF returns a string, which Pig translates as a chararray , as can be seen

from the following session:

grunt> DUMP A;

( pomegranate)

(banana )

(apple)

( lychee )

grunt> DESCRIBE A;

A: {fruit: chararray}

grunt> B = FOREACH A GENERATE com.hadoopbook.pig.Trim(fruit);

grunt> DUMP B;

(pomegranate)

(banana)

(apple)

(lychee)

grunt> DESCRIBE B;

B: {chararray}

A has chararray fields that have leading and trailing spaces. We create B from A by ap-

plying the Trim function to the first field in A (named fruit ). B 's fields are correctly

inferred to be of type chararray .

Search WWH ::

Custom Search

Home