Database Reference
In-Depth Information
ber of fields, which is just the number of
Range
classes, and the fields are populated us-
ing substrings of the line, which are determined by the
Range
objects.
We need to think about what to do when the line is shorter than the range asked for. One
option is to throw an exception and stop further processing. This is appropriate if your ap-
plication cannot tolerate incomplete or corrupt records. In many cases, it is better to return
a tuple with
null
fields and let the Pig script handle the incomplete data as it sees fit.
This is the approach we take here; by exiting the
for
loop if the range end is past the end
of the line, we leave the current field and any subsequent fields in the tuple with their de-
fault values of
null
.
Using a schema
Let's now consider the types of the fields being loaded. If the user has specified a schema,
then the fields need to be converted to the relevant types. However, this is performed
lazily by Pig, so the loader should always construct tuples of type
bytearrary
, using
the
DataByteArray
type. The load function still has the opportunity to do the conver-
sion, however, by overriding
getLoadCaster()
to return a custom implementation of
the
LoadCaster
interface, which provides a collection of conversion methods for this
purpose.
CutLoadFunc
doesn't override
getLoadCaster()
because the default implementa-
tion returns
Utf8StorageConverter
, which provides standard conversions between
UTF-8-encoded data and Pig data types.
In some cases, the load function itself can determine the schema. For example, if we were
loading self-describing data such as XML or JSON, we could create a schema for Pig by
looking at the data. Alternatively, the load function may determine the schema in another
way, such as from an external file, or by being passed information in its constructor. To
support such cases, the load function should implement the
LoadMetadata
interface (in
addition to the
LoadFunc
interface) so it can supply a schema to the Pig runtime. Note,
however, that if a user supplies a schema in the
AS
clause of
LOAD
, then it takes preced-
ence over the schema specified through the
LoadMetadata
interface.
A load function may additionally implement the
LoadPushDown
interface as a means
for finding out which columns the query is asking for. This can be a useful optimization
for column-oriented storage, so that the loader loads only the columns that are needed by
the query. There is no obvious way for
CutLoadFunc
to load only a subset of columns,
because it reads the whole line for each tuple, so we don't use this optimization.