Database Reference
In-Depth Information
LZO support requires you to install the hadoop-lzo package and
point Spark to its native libraries. If you install the Debian package,
adding --driver-library-path /usr/lib/hadoop/lib/native/
--driver-class-path /usr/lib/hadoop/lib/ to your spark-
submit invocation should do the trick.
Reading a file using the old Hadoop API is pretty much the same from a usage point
of view, except we provide an old-style InputFormat class. Many of Spark's built-in
convenience functions (like sequenceFile() ) are implemented using the old-style
Hadoop API.
Saving with Hadoop output formats
We already examined SequenceFiles to some extent, but in Java we don't have the
same convenience function for saving from a pair RDD. We will use this as a way to
illustrate how to use the old Hadoop format APIs (see Example 5-26 ); the call for the
new one ( saveAsNewAPIHadoopFile ) is similar.
Example 5-26. Saving a SequenceFile in Java
public static class ConvertToWritableTypes implements
PairFunction < Tuple2 < String , Integer >, Text , IntWritable > {
public Tuple2 < Text , IntWritable > call ( Tuple2 < String , Integer > record ) {
return new Tuple2 ( new Text ( record . _1 ), new IntWritable ( record . _2 ));
}
}
JavaPairRDD < String , Integer > rdd = sc . parallelizePairs ( input );
JavaPairRDD < Text , IntWritable > result = rdd . mapToPair ( new ConvertToWritableTypes ());
result . saveAsHadoopFile ( fileName , Text . class , IntWritable . class ,
SequenceFileOutputFormat . class );
Non-filesystem data sources
In addition to the hadoopFile() and saveAsHadoopFile() family of functions, you
can use hadoopDataset / saveAsHadoopDataSet and newAPIHadoopDataset / saveAsNe
wAPIHadoopDataset to access Hadoop-supported storage formats that are not filesys‐
tems. For example, many key/value stores, such as HBase and MongoDB, provide
Hadoop input formats that read directly from the key/value store. You can easily use
any such format in Spark.
The hadoopDataset() family of functions just take a Configuration object on which
you set the Hadoop properties needed to access your data source. You do the config‐
uration the same way as you would configure a Hadoop MapReduce job, so you can
follow the instructions for accessing one of these data sources in MapReduce and
Search WWH ::




Custom Search