Database Reference
In-Depth Information
tuple . set ( i , new DataByteArray ( range . getSubstring ( line )));
}
return tuple ;
} catch ( InterruptedException e ) {
throw new ExecException ( e );
}
}
}
In Pig, like in Hadoop, data loading takes place before the mapper runs, so it is important
that the input can be split into portions that are handled independently by each mapper
(see Input Splits and Records for background). A LoadFunc will typically use an exist-
ing underlying Hadoop InputFormat to create records, with the LoadFunc providing
the logic for turning the records into Pig tuples.
CutLoadFunc is constructed with a string that specifies the column ranges to use for
each field. The logic for parsing this string and creating a list of internal Range objects
that encapsulates these ranges is contained in the Range class, and is not shown here (it is
available in the example code that accompanies this topic).
Pig calls setLocation() on a LoadFunc to pass the input location to the loader.
Since CutLoadFunc uses a TextInputFormat to break the input into lines, we just
pass the location to set the input path using a static method on FileInputFormat .
NOTE
Pig uses the new MapReduce API, so we use the input and output formats and associated classes from
the org.apache.hadoop.mapreduce package.
Next, Pig calls the getInputFormat() method to create a RecordReader for each
split, just like in MapReduce. Pig passes each RecordReader to the pre-
pareToRead() method of CutLoadFunc , which we store a reference to, so we can
use it in the getNext() method for iterating through the records.
The Pig runtime calls getNext() repeatedly, and the load function reads tuples from the
reader until the reader reaches the last record in its split. At this point, it returns null to
signal that there are no more tuples to be read.
It is the responsibility of the getNext() implementation to turn lines of the input file
into Tuple objects. It does this by means of a TupleFactory , a Pig class for creating
Tuple instances. The newTuple() method creates a new tuple with the required num-
Search WWH ::




Custom Search