Pig - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

ber of fields, which is just the number of Range classes, and the fields are populated us-

ing substrings of the line, which are determined by the Range objects.

We need to think about what to do when the line is shorter than the range asked for. One

option is to throw an exception and stop further processing. This is appropriate if your ap-

plication cannot tolerate incomplete or corrupt records. In many cases, it is better to return

a tuple with null fields and let the Pig script handle the incomplete data as it sees fit.

This is the approach we take here; by exiting the for loop if the range end is past the end

of the line, we leave the current field and any subsequent fields in the tuple with their de-

fault values of null .

Using a schema

Let's now consider the types of the fields being loaded. If the user has specified a schema,

then the fields need to be converted to the relevant types. However, this is performed

lazily by Pig, so the loader should always construct tuples of type bytearrary , using

the DataByteArray type. The load function still has the opportunity to do the conver-

sion, however, by overriding getLoadCaster() to return a custom implementation of

the LoadCaster interface, which provides a collection of conversion methods for this

purpose.

CutLoadFunc doesn't override getLoadCaster() because the default implementa-

tion returns Utf8StorageConverter , which provides standard conversions between

UTF-8-encoded data and Pig data types.

In some cases, the load function itself can determine the schema. For example, if we were

loading self-describing data such as XML or JSON, we could create a schema for Pig by

looking at the data. Alternatively, the load function may determine the schema in another

way, such as from an external file, or by being passed information in its constructor. To

support such cases, the load function should implement the LoadMetadata interface (in

addition to the LoadFunc interface) so it can supply a schema to the Pig runtime. Note,

however, that if a user supplies a schema in the AS clause of LOAD , then it takes preced-

ence over the schema specified through the LoadMetadata interface.

A load function may additionally implement the LoadPushDown interface as a means

for finding out which columns the query is asking for. This can be a useful optimization

for column-oriented storage, so that the loader loads only the columns that are needed by

the query. There is no obvious way for CutLoadFunc to load only a subset of columns,

because it reads the whole line for each tuple, so we don't use this optimization.

Search WWH ::

Custom Search

Home