Loading and Saving Your Data - Learning Spark

Database Reference

In-Depth Information

Hadoop Input and Output Formats

In addition to the formats Spark has wrappers for, we can also interact with any

Hadoop-supported formats. Spark supports both the “old” and “new” Hadoop file

APIs, providing a great amount of flexibility. 4

Loading with other Hadoop input formats

To read in a file using the new Hadoop API we need to tell Spark a few things. The

newAPIHadoopFile takes a path, and three classes. The first class is the “format” class,

which is the class representing our input format. A similar function, hadoopFile() ,

exists for working with Hadoop input formats implemented with the older API. The

next class is the class for our key, and the final class is the class of our value. If we

need to specify additional Hadoop configuration properties, we can also pass in a

conf object.

One of the simplest Hadoop input formats is the KeyValueTextInputFormat , which

can be used for reading in key/value data from text files (see Example 5-24 ). Each line

is processed individually, with the key and value separated by a tab character. This

format ships with Hadoop so we don't have to add any extra dependencies to our

project to use it.

Example 5-24. Loading KeyValueTextInputFormat() with old-style API in Scala

val input = sc . hadoopFile [ Text , Text , KeyValueTextInputFormat ]( inputFile ). map {

case ( x , y ) => ( x . toString , y . toString )

}

We looked at loading JSON data by loading the data as a text file and then parsing it,

but we can also load JSON data using a custom Hadoop input format. This example

requires setting up some extra bits for compression, so feel free to skip it. Twitter's

Elephant Bird package supports a large number of data formats, including JSON,

Lucene, Protocol Buffer-related formats, and others. The package also works with

both the new and old Hadoop file APIs. To illustrate how to work with the new-style

Hadoop APIs from Spark, we'll look at loading LZO-compressed JSON data with Lzo

JsonInputFormat in Example 5-25 .

Example 5-25. Loading LZO-compressed JSON with Elephant Bird in Scala

val input = sc . newAPIHadoopFile ( inputFile , classOf [ LzoJsonInputFormat ],

classOf [ LongWritable ], classOf [ MapWritable ], conf )

// Each MapWritable in "input" represents a JSON object

4 Hadoop added a new MapReduce API early in its lifetime, but some libraries still use the old one.

Search WWH ::

Custom Search

Home