Loading and Saving Your Data - Learning Spark

Database Reference

In-Depth Information

Table 5-2. Corresponding Hadoop Writable types

Scala type

Java type

Hadoop Writable

IntWritable or VIntWritable 2

Int

Integer

LongWritable or VLongWritable 2

Long

Float

FloatWritable

Double

DoubleWritable

Boolean

BooleanWritable

Array[Byte]

byte[]

BytesWritable

String

Text

ArrayWritable<TW> 3

Array[T]

T[]

ArrayWritable<TW> 3

List[T]

List<T>

MapWritable<AW, BW> 3

Map[A, B]

Map<A, B>

In Spark 1.0 and earlier, SequenceFiles were available only in Java and Scala, but

Spark 1.1 added the ability to load and save them in Python as well. Note that you

will need to use Java and Scala to define custom Writable types, however. The Python

Spark API knows only how to convert the basic Writables available in Hadoop to

Python, and makes a best effort for other classes based on their available getter

methods.

Loading SequenceFiles

Spark has a specialized API for reading in SequenceFiles. On the SparkContext we

can call sequenceFile(path, keyClass, valueClass, minPartitions) . As men‐

tioned earlier, SequenceFiles work with Writable classes, so our keyClass and value

Class will both have to be the correct Writable class. Let's consider loading people

and the number of pandas they have seen from a SequenceFile. In this case our key

2 ints and longs are often stored as a fixed size. Storing the number 12 takes the same amount of space as

storing the number 2**30. If you might have a large number of small numbers use the variable sized types,

VIntWritable and VLongWritable , which will use fewer bits to store smaller numbers.

3 The templated type must also be a Writable type.

Learning Spark

Search WWH ::

Custom Search

Home