Tuning and Debugging Spark - Learning Spark

Database Reference

In-Depth Information

Key Performance Considerations

At this point you know a bit about how Spark works internally, how to follow the

progress of a running Spark application, and where to go for metrics and log

information. This section takes the next step and discusses common performance

issues you might encounter in Spark applications along with tips for tuning your

application to get the best possible performance. The first three subsections cover

code-level changes you can make in order to improve performance, while the last

subsection discusses tuning the cluster and environment in which Spark runs.

Level of Parallelism

The logical representation of an RDD is a single collection of objects. During physical

execution, as discussed already a few times in this topic, an RDD is divided into a set

of partitions with each partition containing some subset of the total data. When

Spark schedules and runs tasks, it creates a single task for data stored in one partition,

and that task will require, by default, a single core in the cluster to execute. Out of the

box, Spark will infer what it thinks is a good degree of parallelism for RDDs, and this

is sufficient for many use cases. Input RDDs typically choose parallelism based on the

underlying storage systems. For example, HDFS input RDDs have one partition for

each block of the underlying HDFS file. RDDs that are derived from shuffling other

RDDs will have parallelism set based on the size of their parent RDDs.

The degree of parallelism can affect performance in two ways. First, if there is too lit‐

tle parallelism, Spark might leave resources idle. For example, if your application has

1,000 cores allocated to it, and you are running a stage with only 30 tasks, you might

be able to increase the level of parallelism to utilize more cores. If there is too much

parallelism, small overheads associated with each partition can add up and become

significant. A sign of this is that you have tasks that complete almost instantly—in a

few milliseconds—or tasks that do not read or write any data.

Spark offers two ways to tune the degree of parallelism for operations. The first is

that, during operations that shuffle data, you can always give a degree of parallelism

for the produced RDD as a parameter. The second is that any existing RDD can be

redistributed to have more or fewer partitions. The repartition() operator will ran‐

domly shuffle an RDD into the desired number of partitions. If you know you are

shrinking the RDD, you can use the coalesce() operator; this is more efficient than

repartition() since it avoids a shuffle operation. If you think you have too much or

too little parallelism, it can help to redistribute your data with these operators.

As an example, let's say we are reading a large amount of data from S3, but then

immediately performing a filter() operation that is likely to exclude all but a tiny

fraction of the dataset. By default the RDD returned by filter() will have the same

size as its parent and might have many empty or small partitions. In this case you can

Search WWH ::

Custom Search

Home