Database Reference
In-Depth Information
LZO support requires you to install the
hadoop-lzo
package and
point Spark to its native libraries. If you install the Debian package,
adding
--driver-library-path /usr/lib/hadoop/lib/native/
--driver-class-path /usr/lib/hadoop/lib/
to your
spark-
submit
invocation should do the trick.
Reading a file using the old Hadoop API is pretty much the same from a usage point
of view, except we provide an old-style
InputFormat
class. Many of Spark's built-in
convenience functions (like
sequenceFile()
) are implemented using the old-style
Hadoop API.
Saving with Hadoop output formats
We already examined SequenceFiles to some extent, but in Java we don't have the
same convenience function for saving from a pair RDD. We will use this as a way to
illustrate how to use the old Hadoop format APIs (see
Example 5-26
); the call for the
new one (
saveAsNewAPIHadoopFile
) is similar.
Example 5-26. Saving a SequenceFile in Java
public
static
class
ConvertToWritableTypes
implements
PairFunction
<
Tuple2
<
String
,
Integer
>,
Text
,
IntWritable
>
{
public
Tuple2
<
Text
,
IntWritable
>
call
(
Tuple2
<
String
,
Integer
>
record
)
{
return
new
Tuple2
(
new
Text
(
record
.
_1
),
new
IntWritable
(
record
.
_2
));
}
}
JavaPairRDD
<
String
,
Integer
>
rdd
=
sc
.
parallelizePairs
(
input
);
JavaPairRDD
<
Text
,
IntWritable
>
result
=
rdd
.
mapToPair
(
new
ConvertToWritableTypes
());
result
.
saveAsHadoopFile
(
fileName
,
Text
.
class
,
IntWritable
.
class
,
SequenceFileOutputFormat
.
class
);
Non-filesystem data sources
In addition to the
hadoopFile()
and
saveAsHadoopFile()
family of functions, you
can use
hadoopDataset
/
saveAsHadoopDataSet
and
newAPIHadoopDataset
/
saveAsNe
wAPIHadoopDataset
to access Hadoop-supported storage formats that are not filesys‐
tems. For example, many key/value stores, such as HBase and MongoDB, provide
Hadoop input formats that read directly from the key/value store. You can easily use
any such format in Spark.
The
hadoopDataset()
family of functions just take a
Configuration
object on which
you set the Hadoop properties needed to access your data source. You do the config‐
uration the same way as you would configure a Hadoop MapReduce job, so you can
follow the instructions for accessing one of these data sources in MapReduce and