Programming with RDDs - Learning Spark

Database Reference

In-Depth Information

If you attempt to cache too much data to fit in memory, Spark will automatically

evict old partitions using a Least Recently Used (LRU) cache policy. For the memory-

only storage levels, it will recompute these partitions the next time they are accessed,

while for the memory-and-disk ones, it will write them out to disk. In either case, this

means that you don't have to worry about your job breaking if you ask Spark to cache

too much data. However, caching unnecessary data can lead to eviction of useful data

and more recomputation time.

Finally, RDDs come with a method called unpersist() that lets you manually

remove them from the cache.

Conclusion

In this chapter, we have covered the RDD execution model and a large number of

common operations on RDDs. If you have gotten here, congratulations—you've

learned all the core concepts of working in Spark. In the next chapter, we'll cover a

special set of operations available on RDDs of key/value pairs, which are the most

common way to aggregate or group together data in parallel. After that, we discuss

input and output from a variety of data sources, and more advanced topics in work‐

ing with SparkContext.

Search WWH ::

Custom Search

Home