MapReduce Types and Formats - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

XML

Most XML parsers operate on whole XML documents, so if a large XML document is

made up of multiple input splits, it is a challenge to parse these individually. Of course,

you can process the entire XML document in one mapper (if it is not too large) using the

technique in Processing a whole file as a record .

Large XML documents that are composed of a series of “records” (XML document frag-

ments) can be broken into these records using simple string or regular-expression match-

ing to find the start and end tags of records. This alleviates the problem when the docu-

ment is split by the framework because the next start tag of a record is easy to find by

simply scanning from the start of the split, just like TextInputFormat finds newline

boundaries.

Hadoop comes with a class for this purpose called StreamXmlRecordReader (which

is in the org.apache.hadoop.streaming.mapreduce package, although it can

be used outside of Streaming). You can use it by setting your input format to StreamIn-

putFormat and setting the stream.recordreader.class property to

org.apache.hadoop.streaming.mapreduce.StreamXmlRecordReader .

The reader is configured by setting job configuration properties to tell it the patterns for

the start and end tags (see the class documentation for details). [ 58 ]

To take an example, Wikipedia provides dumps of its content in XML form, which are ap-

propriate for processing in parallel with MapReduce using this approach. The data is con-

tained in one large XML wrapper document, which contains a series of elements, such as

page elements that contain a page's content and associated metadata. Using

StreamXmlRecordReader , the page elements can be interpreted as records for pro-

cessing by a mapper.

Binary Input

Hadoop MapReduce is not restricted to processing textual data. It has support for binary

formats, too.

SequenceFileInputFormat

Hadoop's sequence file format stores sequences of binary key-value pairs. Sequence files

are well suited as a format for MapReduce data because they are splittable (they have sync

points so that readers can synchronize with record boundaries from an arbitrary point in

the file, such as the start of a split), they support compression as a part of the format, and

Search WWH ::

Custom Search

Home