Programming with RDDs - Learning Spark

Database Reference

In-Depth Information

partitions. If a node that has data persisted on it fails, Spark will recompute the lost

partitions of the data when needed. We can also replicate our data on multiple nodes

if we want to be able to handle node failure without slowdown.

Spark has many levels of persistence to choose from based on what our goals are, as

you can see in Table 3-6 . In Scala ( Example 3-40 ) and Java, the default persist() will

store the data in the JVM heap as unserialized objects. In Python, we always serialize

the data that persist stores, so the default is instead stored in the JVM heap as pickled

objects. When we write data out to disk or off-heap storage, that data is also always

serialized.

Table 3-6. Persistence levels from org.apache.spark.storage.StorageLevel and

pyspark.StorageLevel; if desired we can replicate the data on two machines by adding _2 to

the end of the storage level

Level

Space used

CPU time

In memory

On disk

Comments

MEMORY_ONLY

High

Low

Y

N

MEMORY_ONLY_SER

Low

High

Y

N

MEMORY_AND_DISK

High

Medium

Some

Spills to disk if there is too much data to fit in

memory.

MEMORY_AND_DISK_SER

Low

High

Some

Spills to disk if there is too much data to fit in

memory. Stores serialized representation in

memory.

DISK_ONLY

Low

High

N

Y

Off-heap caching is experimental and uses Tachyon . If you are

interested in off-heap caching with Spark, take a look at the Run‐

ning Spark on Tachyon guide .

Example 3-40. persist() in Scala

val result = input . map ( x => x * x )

result . persist ( StorageLevel . DISK_ONLY )

println ( result . count ())

println ( result . collect (). mkString ( "," ))

Notice that we called persist() on the RDD before the first action. The persist()

call on its own doesn't force evaluation.

Learning Spark

Search WWH ::

Custom Search

Home