Database Reference
In-Depth Information
Table 5-2. Corresponding Hadoop Writable types
Scala type
Java type
Hadoop Writable
IntWritable or VIntWritable 2
Int
Integer
LongWritable or VLongWritable 2
Long
Long
Float
Float
FloatWritable
Double
Double
DoubleWritable
Boolean
Boolean
BooleanWritable
Array[Byte]
byte[]
BytesWritable
String
String
Text
ArrayWritable<TW> 3
Array[T]
T[]
ArrayWritable<TW> 3
List[T]
List<T>
MapWritable<AW, BW> 3
Map[A, B]
Map<A, B>
In Spark 1.0 and earlier, SequenceFiles were available only in Java and Scala, but
Spark 1.1 added the ability to load and save them in Python as well. Note that you
will need to use Java and Scala to define custom Writable types, however. The Python
Spark API knows only how to convert the basic Writables available in Hadoop to
Python, and makes a best effort for other classes based on their available getter
methods.
Loading SequenceFiles
Spark has a specialized API for reading in SequenceFiles. On the SparkContext we
can call sequenceFile(path, keyClass, valueClass, minPartitions) . As men‐
tioned earlier, SequenceFiles work with Writable classes, so our keyClass and value
Class will both have to be the correct Writable class. Let's consider loading people
and the number of pandas they have seen from a SequenceFile. In this case our key
2 ints and longs are often stored as a fixed size. Storing the number 12 takes the same amount of space as
storing the number 2**30. If you might have a large number of small numbers use the variable sized types,
VIntWritable and VLongWritable , which will use fewer bits to store smaller numbers.
3 The templated type must also be a Writable type.
 
Search WWH ::




Custom Search