Database Reference
In-Depth Information
Persistence levels
Calling cache() will persist each partition of the RDD in the executor's memory. If an
executor does not have enough memory to store the RDD partition, the computation will
not fail, but instead the partition will be recomputed as needed. For complex programs
with lots of transformations, this may be expensive, so Spark offers different types of per-
sistence behavior that may be selected by calling persist() with an argument to speci-
fy the StorageLevel .
By default, the level is MEMORY_ONLY , which uses the regular in-memory representation
of objects. A more compact representation can be used by serializing the elements in a
partition as a byte array. This level is MEMORY_ONLY_SER ; it incurs CPU overhead com-
pared to MEMORY_ONLY , but is worth it if the resulting serialized RDD partition fits in
memory when the regular in-memory representation doesn't. MEMORY_ONLY_SER also
reduces garbage collection pressure, since each RDD is stored as one byte array rather
than lots of objects.
NOTE
You can see if an RDD partition doesn't fit in memory by inspecting the driver logfile for messages from
the BlockManager . Also, every driver's SparkContext runs an HTTP server (on port 4040) that
provides useful information about its environment and the jobs it is running, including information about
cached RDD partitions.
By default, regular Java serialization is used to serialize RDD partitions, but Kryo serializ-
ation (covered in the next section) is normally a better choice, both in terms of size and
speed. Further space savings can be achieved (again at the expense of CPU) by compress-
ing the serialized partitions by setting the spark.rdd.compress property to true ,
and optionally setting spark.io.compression.codec .
If recomputing a dataset is expensive, then either MEMORY_AND_DISK (spill to disk if
the dataset doesn't fit in memory) or MEMORY_AND_DISK_SER (spill to disk if the seri-
alized dataset doesn't fit in memory) is appropriate.
There are also some more advanced and experimental persistence levels for replicating
partitions on more than one node in the cluster, or using off-heap memory — see the
Spark documentation for details.
Search WWH ::




Custom Search