Database Reference
In-Depth Information
Table 5-2. Corresponding Hadoop Writable types
Scala type
Java type
Hadoop Writable
IntWritable
or
VIntWritable
2
Int
Integer
LongWritable
or
VLongWritable
2
Long
Long
Float
Float
FloatWritable
Double
Double
DoubleWritable
Boolean
Boolean
BooleanWritable
Array[Byte]
byte[]
BytesWritable
String
String
Text
ArrayWritable<TW>
3
Array[T]
T[]
ArrayWritable<TW>
3
List[T]
List<T>
MapWritable<AW, BW>
3
Map[A, B]
Map<A, B>
In Spark 1.0 and earlier, SequenceFiles were available only in Java and Scala, but
Spark 1.1 added the ability to load and save them in Python as well. Note that you
will need to use Java and Scala to define custom Writable types, however. The Python
Spark API knows only how to convert the basic Writables available in Hadoop to
Python, and makes a best effort for other classes based on their available getter
methods.
Loading SequenceFiles
Spark has a specialized API for reading in SequenceFiles. On the SparkContext we
can call
sequenceFile(path, keyClass, valueClass, minPartitions)
. As men‐
tioned earlier, SequenceFiles work with Writable classes, so our
keyClass
and
value
Class
will both have to be the correct Writable class. Let's consider loading people
and the number of pandas they have seen from a SequenceFile. In this case our
key
2
ints and longs are often stored as a fixed size. Storing the number 12 takes the same amount of space as
storing the number 2**30. If you might have a large number of small numbers use the variable sized types,
VIntWritable
and
VLongWritable
, which will use fewer bits to store smaller numbers.
3
The templated type must also be a Writable type.