Spark - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Further Reading

This chapter only covered the basics of Spark. For more detail, see Learning Spark by

Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia (O'Reilly, 2014). The

Apache Spark website also has up-to-date documentation about the latest Spark release.

[ 128 ] See Matei Zaharia et al., “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory

Cluster Computing,” NSDI '12 Proceedings of the 9th USENIX Conference on Networked Systems Design

and Implementation , 2012.

[ 129 ] The Java version is much more compact when written using Java 8 lambda expressions.

[ 130 ] This is like performing a parameter sweep using NLineInputFormat in MapReduce, as described in

NLineInputFormat .

[ 131 ] Note that countByKey() performs its final aggregation locally on the driver rather than using a

second shuffle step. This is unlike the equivalent Crunch program in Example 18-3 , which uses a second

MapReduce job for the count.

[ 132 ] There is scope for tuning the performance of the shuffle through configuration . Note also that Spark

uses its own custom implementation for the shuffle, and does not share any code with the MapReduce shuffle

implementation.

[ 133 ] Speculative tasks are duplicates of existing tasks, which the scheduler may run as a backup if a task is

running more slowly than expected. See Speculative Execution .

[ 134 ] This is not true for Mesos fine-grained mode, where each task runs as a separate process. See the fol-

lowing section for details.

[ 135 ] The preferred locations API is not stable (in Spark 1.2.0, the latest release as of this writing) and may

change in a later release.

Search WWH ::

Custom Search

Home