Database Reference
In-Depth Information
Further Reading
This chapter only covered the basics of Spark. For more detail, see Learning Spark by
Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia (O'Reilly, 2014). The
Apache Spark website also has up-to-date documentation about the latest Spark release.
[ 128 ] See Matei Zaharia et al., “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory
Cluster Computing,” NSDI '12 Proceedings of the 9th USENIX Conference on Networked Systems Design
and Implementation , 2012.
[ 129 ] The Java version is much more compact when written using Java 8 lambda expressions.
[ 130 ] This is like performing a parameter sweep using NLineInputFormat in MapReduce, as described in
NLineInputFormat .
[ 131 ] Note that countByKey() performs its final aggregation locally on the driver rather than using a
second shuffle step. This is unlike the equivalent Crunch program in Example 18-3 , which uses a second
MapReduce job for the count.
[ 132 ] There is scope for tuning the performance of the shuffle through configuration . Note also that Spark
uses its own custom implementation for the shuffle, and does not share any code with the MapReduce shuffle
implementation.
[ 133 ] Speculative tasks are duplicates of existing tasks, which the scheduler may run as a backup if a task is
running more slowly than expected. See Speculative Execution .
[ 134 ] This is not true for Mesos fine-grained mode, where each task runs as a separate process. See the fol-
lowing section for details.
[ 135 ] The preferred locations API is not stable (in Spark 1.2.0, the latest release as of this writing) and may
change in a later release.
Search WWH ::




Custom Search