Spark - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Persistence levels

Calling cache() will persist each partition of the RDD in the executor's memory. If an

executor does not have enough memory to store the RDD partition, the computation will

not fail, but instead the partition will be recomputed as needed. For complex programs

with lots of transformations, this may be expensive, so Spark offers different types of per-

sistence behavior that may be selected by calling persist() with an argument to speci-

fy the StorageLevel .

By default, the level is MEMORY_ONLY , which uses the regular in-memory representation

of objects. A more compact representation can be used by serializing the elements in a

partition as a byte array. This level is MEMORY_ONLY_SER ; it incurs CPU overhead com-

pared to MEMORY_ONLY , but is worth it if the resulting serialized RDD partition fits in

memory when the regular in-memory representation doesn't. MEMORY_ONLY_SER also

reduces garbage collection pressure, since each RDD is stored as one byte array rather

than lots of objects.

NOTE

You can see if an RDD partition doesn't fit in memory by inspecting the driver logfile for messages from

the BlockManager . Also, every driver's SparkContext runs an HTTP server (on port 4040) that

provides useful information about its environment and the jobs it is running, including information about

cached RDD partitions.

By default, regular Java serialization is used to serialize RDD partitions, but Kryo serializ-

ation (covered in the next section) is normally a better choice, both in terms of size and

speed. Further space savings can be achieved (again at the expense of CPU) by compress-

ing the serialized partitions by setting the spark.rdd.compress property to true ,

and optionally setting spark.io.compression.codec .

If recomputing a dataset is expensive, then either MEMORY_AND_DISK (spill to disk if

the dataset doesn't fit in memory) or MEMORY_AND_DISK_SER (spill to disk if the seri-

alized dataset doesn't fit in memory) is appropriate.

There are also some more advanced and experimental persistence levels for replicating

partitions on more than one node in the cluster, or using off-heap memory — see the

Spark documentation for details.

Search WWH ::

Custom Search

Home