Database Reference
In-Depth Information
Figure 19-4. How Spark executors are started in YARN cluster mode
In both YARN modes, the executors are launched before there is any data locality inform-
ation available, so it could be that they end up not being co-located on the datanodes host-
ing the files that the jobs access. For interactive sessions, this may be acceptable, particu-
larly as it may not be known which datasets are going to be accessed before the session
starts. This is less true of production jobs, however, so Spark provides a way to give
placement hints to improve data locality when running in YARN cluster mode.
The
SparkContext
constructor can take a second argument of preferred locations,
computed from the input format and path using the
InputFormatInfo
helper class.
For example, for text files, we use
TextInputFormat
:
val
preferredLocations
=
InputFormatInfo
.
computePreferredLocations
(
Seq
(
new
InputFormatInfo
(
new
Configuration
(),
classOf
[
TextInputFormat
],