Database Reference
In-Depth Information
XML
Most XML parsers operate on whole XML documents, so if a large XML document is
made up of multiple input splits, it is a challenge to parse these individually. Of course,
you can process the entire XML document in one mapper (if it is not too large) using the
technique in Processing a whole file as a record .
Large XML documents that are composed of a series of “records” (XML document frag-
ments) can be broken into these records using simple string or regular-expression match-
ing to find the start and end tags of records. This alleviates the problem when the docu-
ment is split by the framework because the next start tag of a record is easy to find by
simply scanning from the start of the split, just like TextInputFormat finds newline
boundaries.
Hadoop comes with a class for this purpose called StreamXmlRecordReader (which
is in the org.apache.hadoop.streaming.mapreduce package, although it can
be used outside of Streaming). You can use it by setting your input format to StreamIn-
putFormat and setting the stream.recordreader.class property to
org.apache.hadoop.streaming.mapreduce.StreamXmlRecordReader .
The reader is configured by setting job configuration properties to tell it the patterns for
the start and end tags (see the class documentation for details). [ 58 ]
To take an example, Wikipedia provides dumps of its content in XML form, which are ap-
propriate for processing in parallel with MapReduce using this approach. The data is con-
tained in one large XML wrapper document, which contains a series of elements, such as
page elements that contain a page's content and associated metadata. Using
StreamXmlRecordReader , the page elements can be interpreted as records for pro-
cessing by a mapper.
Binary Input
Hadoop MapReduce is not restricted to processing textual data. It has support for binary
formats, too.
SequenceFileInputFormat
Hadoop's sequence file format stores sequences of binary key-value pairs. Sequence files
are well suited as a format for MapReduce data because they are splittable (they have sync
points so that readers can synchronize with record boundaries from an arbitrary point in
the file, such as the start of a split), they support compression as a part of the format, and
Search WWH ::




Custom Search