Database Reference
In-Depth Information
If you attempt to cache too much data to fit in memory, Spark will automatically
evict old partitions using a Least Recently Used (LRU) cache policy. For the memory-
only storage levels, it will recompute these partitions the next time they are accessed,
while for the memory-and-disk ones, it will write them out to disk. In either case, this
means that you don't have to worry about your job breaking if you ask Spark to cache
too much data. However, caching unnecessary data can lead to eviction of useful data
and more recomputation time.
Finally, RDDs come with a method called unpersist() that lets you manually
remove them from the cache.
Conclusion
In this chapter, we have covered the RDD execution model and a large number of
common operations on RDDs. If you have gotten here, congratulations—you've
learned all the core concepts of working in Spark. In the next chapter, we'll cover a
special set of operations available on RDDs of key/value pairs, which are the most
common way to aggregate or group together data in parallel. After that, we discuss
input and output from a variety of data sources, and more advanced topics in work‐
ing with SparkContext.
Search WWH ::




Custom Search