Working with Key/Value Pairs - Learning Spark

Database Reference

In-Depth Information

Figure 4-3. combineByKey() sample data flow

There are many options for combining our data by key. Most of them are imple‐

mented on top of combineByKey() but provide a simpler interface. In any case, using

one of the specialized aggregation functions in Spark can be much faster than the

naive approach of grouping our data and then reducing it.

Tuning the level of parallelism

So far we have talked about how all of our transformations are distributed, but we

have not really looked at how Spark decides how to split up the work. Every RDD has

a fixed number of partitions that determine the degree of parallelism to use when exe‐

cuting operations on the RDD.

When performing aggregations or grouping operations, we can ask Spark to use a

specific number of partitions. Spark will always try to infer a sensible default value

based on the size of your cluster, but in some cases you will want to tune the level of

parallelism for better performance.

Most of the operators discussed in this chapter accept a second parameter giving the

number of partitions to use when creating the grouped or aggregated RDD, as shown

in Examples 4-15 and 4-16 .

Search WWH ::

Custom Search

Home