Database Reference
In-Depth Information
ber of fields, which is just the number of Range classes, and the fields are populated us-
ing substrings of the line, which are determined by the Range objects.
We need to think about what to do when the line is shorter than the range asked for. One
option is to throw an exception and stop further processing. This is appropriate if your ap-
plication cannot tolerate incomplete or corrupt records. In many cases, it is better to return
a tuple with null fields and let the Pig script handle the incomplete data as it sees fit.
This is the approach we take here; by exiting the for loop if the range end is past the end
of the line, we leave the current field and any subsequent fields in the tuple with their de-
fault values of null .
Using a schema
Let's now consider the types of the fields being loaded. If the user has specified a schema,
then the fields need to be converted to the relevant types. However, this is performed
lazily by Pig, so the loader should always construct tuples of type bytearrary , using
the DataByteArray type. The load function still has the opportunity to do the conver-
sion, however, by overriding getLoadCaster() to return a custom implementation of
the LoadCaster interface, which provides a collection of conversion methods for this
purpose.
CutLoadFunc doesn't override getLoadCaster() because the default implementa-
tion returns Utf8StorageConverter , which provides standard conversions between
UTF-8-encoded data and Pig data types.
In some cases, the load function itself can determine the schema. For example, if we were
loading self-describing data such as XML or JSON, we could create a schema for Pig by
looking at the data. Alternatively, the load function may determine the schema in another
way, such as from an external file, or by being passed information in its constructor. To
support such cases, the load function should implement the LoadMetadata interface (in
addition to the LoadFunc interface) so it can supply a schema to the Pig runtime. Note,
however, that if a user supplies a schema in the AS clause of LOAD , then it takes preced-
ence over the schema specified through the LoadMetadata interface.
A load function may additionally implement the LoadPushDown interface as a means
for finding out which columns the query is asking for. This can be a useful optimization
for column-oriented storage, so that the loader loads only the columns that are needed by
the query. There is no obvious way for CutLoadFunc to load only a subset of columns,
because it reads the whole line for each tuple, so we don't use this optimization.
Search WWH ::




Custom Search