Database Reference
In-Depth Information
Hadoop Input and Output Formats
In addition to the formats Spark has wrappers for, we can also interact with any
Hadoop-supported formats. Spark supports both the “old” and “new” Hadoop file
APIs, providing a great amount of flexibility. 4
Loading with other Hadoop input formats
To read in a file using the new Hadoop API we need to tell Spark a few things. The
newAPIHadoopFile takes a path, and three classes. The first class is the “format” class,
which is the class representing our input format. A similar function, hadoopFile() ,
exists for working with Hadoop input formats implemented with the older API. The
next class is the class for our key, and the final class is the class of our value. If we
need to specify additional Hadoop configuration properties, we can also pass in a
conf object.
One of the simplest Hadoop input formats is the KeyValueTextInputFormat , which
can be used for reading in key/value data from text files (see Example 5-24 ). Each line
is processed individually, with the key and value separated by a tab character. This
format ships with Hadoop so we don't have to add any extra dependencies to our
project to use it.
Example 5-24. Loading KeyValueTextInputFormat() with old-style API in Scala
val input = sc . hadoopFile [ Text , Text , KeyValueTextInputFormat ]( inputFile ). map {
case ( x , y ) => ( x . toString , y . toString )
}
We looked at loading JSON data by loading the data as a text file and then parsing it,
but we can also load JSON data using a custom Hadoop input format. This example
requires setting up some extra bits for compression, so feel free to skip it. Twitter's
Elephant Bird package supports a large number of data formats, including JSON,
Lucene, Protocol Buffer-related formats, and others. The package also works with
both the new and old Hadoop file APIs. To illustrate how to work with the new-style
Hadoop APIs from Spark, we'll look at loading LZO-compressed JSON data with Lzo
JsonInputFormat in Example 5-25 .
Example 5-25. Loading LZO-compressed JSON with Elephant Bird in Scala
val input = sc . newAPIHadoopFile ( inputFile , classOf [ LzoJsonInputFormat ],
classOf [ LongWritable ], classOf [ MapWritable ], conf )
// Each MapWritable in "input" represents a JSON object
4 Hadoop added a new MapReduce API early in its lifetime, but some libraries still use the old one.
 
Search WWH ::




Custom Search