Database Reference
In-Depth Information
Class
would be
Text
, and our
valueClass
would be
IntWritable
or
VIntWritable
,
but for simplicity we'll work with
IntWritable
in Examples
5-20
through
5-22
.
Example 5-20. Loading a SequenceFile in Python
val
data
=
sc
.
sequenceFile
(
inFile
,
"org.apache.hadoop.io.Text"
,
"org.apache.hadoop.io.IntWritable"
)
Example 5-21. Loading a SequenceFile in Scala
val
data
=
sc
.
sequenceFile
(
inFile
,
classOf
[
Text
],
classOf
[
IntWritable
]).
map
{
case
(
x
,
y
)
=>
(
x
.
toString
,
y
.
get
())}
Example 5-22. Loading a SequenceFile in Java
public
static
class
ConvertToNativeTypes
implements
PairFunction
<
Tuple2
<
Text
,
IntWritable
>,
String
,
Integer
>
{
public
Tuple2
<
String
,
Integer
>
call
(
Tuple2
<
Text
,
IntWritable
>
record
)
{
return
new
Tuple2
(
record
.
_1
.
toString
(),
record
.
_2
.
get
());
}
}
JavaPairRDD
<
Text
,
IntWritable
>
input
=
sc
.
sequenceFile
(
fileName
,
Text
.
class
,
IntWritable
.
class
);
JavaPairRDD
<
String
,
Integer
>
result
=
input
.
mapToPair
(
new
ConvertToNativeTypes
());
In Scala there is a convenience function that can automatically
convert Writables to their corresponding Scala type. Instead of
specifying the
keyClass
and
valueClass
, we can call
sequence
File[Key, Value](path, minPartitions)
and get back an RDD
of native Scala types.
Saving SequenceFiles
Writing the data out to a SequenceFile is fairly similar in Scala. First, because Sequen‐
ceFiles are key/value pairs, we need a
PairRDD
with types that our SequenceFile can
write out. Implicit conversions between Scala types and Hadoop Writables exist for
many native types, so if you are writing out a native type you can just save your
PairRDD
by calling
saveAsSequenceFile(path)
, and it will write out the data for you.
If there isn't an automatic conversion from our key and value to Writable, or we want
to use variable-length types (e.g.,
VIntWritable
), we can just map over the data and
convert it before saving. Let's consider writing out the data that we loaded in the pre‐
vious example (people and how many pandas they have seen), as shown in
Example 5-23
.