Database Reference
In-Depth Information
partitions. If a node that has data persisted on it fails, Spark will recompute the lost
partitions of the data when needed. We can also replicate our data on multiple nodes
if we want to be able to handle node failure without slowdown.
Spark has many levels of persistence to choose from based on what our goals are, as
you can see in Table 3-6 . In Scala ( Example 3-40 ) and Java, the default persist() will
store the data in the JVM heap as unserialized objects. In Python, we always serialize
the data that persist stores, so the default is instead stored in the JVM heap as pickled
objects. When we write data out to disk or off-heap storage, that data is also always
serialized.
Table 3-6. Persistence levels from org.apache.spark.storage.StorageLevel and
pyspark.StorageLevel; if desired we can replicate the data on two machines by adding _2 to
the end of the storage level
Level
Space used
CPU time
In memory
On disk
Comments
MEMORY_ONLY
High
Low
Y
N
MEMORY_ONLY_SER
Low
High
Y
N
MEMORY_AND_DISK
High
Medium
Some
Some
Spills to disk if there is too much data to fit in
memory.
MEMORY_AND_DISK_SER
Low
High
Some
Some
Spills to disk if there is too much data to fit in
memory. Stores serialized representation in
memory.
DISK_ONLY
Low
High
N
Y
Off-heap caching is experimental and uses Tachyon . If you are
interested in off-heap caching with Spark, take a look at the Run‐
ning Spark on Tachyon guide .
Example 3-40. persist() in Scala
val result = input . map ( x => x * x )
result . persist ( StorageLevel . DISK_ONLY )
println ( result . count ())
println ( result . collect (). mkString ( "," ))
Notice that we called persist() on the RDD before the first action. The persist()
call on its own doesn't force evaluation.
 
Search WWH ::




Custom Search