Database Reference
In-Depth Information
Key Performance Considerations
At this point you know a bit about how Spark works internally, how to follow the
progress of a running Spark application, and where to go for metrics and log
information. This section takes the next step and discusses common performance
issues you might encounter in Spark applications along with tips for tuning your
application to get the best possible performance. The first three subsections cover
code-level changes you can make in order to improve performance, while the last
subsection discusses tuning the cluster and environment in which Spark runs.
Level of Parallelism
The logical representation of an RDD is a single collection of objects. During physical
execution, as discussed already a few times in this topic, an RDD is divided into a set
of partitions with each partition containing some subset of the total data. When
Spark schedules and runs tasks, it creates a single task for data stored in one partition,
and that task will require, by default, a single core in the cluster to execute. Out of the
box, Spark will infer what it thinks is a good degree of parallelism for RDDs, and this
is sufficient for many use cases. Input RDDs typically choose parallelism based on the
underlying storage systems. For example, HDFS input RDDs have one partition for
each block of the underlying HDFS file. RDDs that are derived from shuffling other
RDDs will have parallelism set based on the size of their parent RDDs.
The degree of parallelism can affect performance in two ways. First, if there is too lit‐
tle parallelism, Spark might leave resources idle. For example, if your application has
1,000 cores allocated to it, and you are running a stage with only 30 tasks, you might
be able to increase the level of parallelism to utilize more cores. If there is too much
parallelism, small overheads associated with each partition can add up and become
significant. A sign of this is that you have tasks that complete almost instantly—in a
few milliseconds—or tasks that do not read or write any data.
Spark offers two ways to tune the degree of parallelism for operations. The first is
that, during operations that shuffle data, you can always give a degree of parallelism
for the produced RDD as a parameter. The second is that any existing RDD can be
redistributed to have more or fewer partitions. The repartition() operator will ran‐
domly shuffle an RDD into the desired number of partitions. If you know you are
shrinking the RDD, you can use the coalesce() operator; this is more efficient than
repartition() since it avoids a shuffle operation. If you think you have too much or
too little parallelism, it can help to redistribute your data with these operators.
As an example, let's say we are reading a large amount of data from S3, but then
immediately performing a filter() operation that is likely to exclude all but a tiny
fraction of the dataset. By default the RDD returned by filter() will have the same
size as its parent and might have many empty or small partitions. In this case you can
Search WWH ::




Custom Search