Database Reference
In-Depth Information
tuple
.
set
(
i
,
new
DataByteArray
(
range
.
getSubstring
(
line
)));
}
return
tuple
;
}
catch
(
InterruptedException e
) {
throw new
ExecException
(
e
);
}
}
}
In Pig, like in Hadoop, data loading takes place before the mapper runs, so it is important
that the input can be split into portions that are handled independently by each mapper
ing underlying Hadoop
InputFormat
to create records, with the
LoadFunc
providing
the logic for turning the records into Pig tuples.
CutLoadFunc
is constructed with a string that specifies the column ranges to use for
each field. The logic for parsing this string and creating a list of internal
Range
objects
that encapsulates these ranges is contained in the
Range
class, and is not shown here (it is
available in the example code that accompanies this topic).
Pig calls
setLocation()
on a
LoadFunc
to pass the input location to the loader.
Since
CutLoadFunc
uses a
TextInputFormat
to break the input into lines, we just
pass the location to set the input path using a static method on
FileInputFormat
.
NOTE
Pig uses the new MapReduce API, so we use the input and output formats and associated classes from
the
org.apache.hadoop.mapreduce
package.
Next, Pig calls the
getInputFormat()
method to create a
RecordReader
for each
split, just like in MapReduce. Pig passes each
RecordReader
to the
pre-
pareToRead()
method of
CutLoadFunc
, which we store a reference to, so we can
use it in the
getNext()
method for iterating through the records.
The Pig runtime calls
getNext()
repeatedly, and the load function reads tuples from the
reader until the reader reaches the last record in its split. At this point, it returns
null
to
signal that there are no more tuples to be read.
It is the responsibility of the
getNext()
implementation to turn lines of the input file
into
Tuple
objects. It does this by means of a
TupleFactory
, a Pig class for creating
Tuple
instances. The
newTuple()
method creates a new tuple with the required num-