Tuning and Debugging Spark - Learning Spark

Database Reference

In-Depth Information

improve the application's performance by coalescing down to a smaller RDD, as

shown in Example 8-11 .

Example 8-11. Coalescing a large RDD in the PySpark shell

# Wildcard input that may match thousands of files

>>> input = sc . textFile ( "s3n://log-files/2014/*.log" )

>>> input . getNumPartitions ()

35154

# A filter that excludes almost all data

>>> lines = input . filter ( lambda line : line . startswith ( "2014-10-17" ))

>>> lines . getNumPartitions ()

35154

# We coalesce the lines RDD before caching

>>> lines = lines . coalesce ( 5 ) . cache ()

>>> lines . getNumPartitions ()

4

# Subsequent analysis can operate on the coalesced RDD...

>>> lines . count ()

Serialization Format

When Spark is transferring data over the network or spilling data to disk, it needs to

serialize objects into a binary format. This comes into play during shuffle operations,

where potentially large amounts of data are transferred. By default Spark will use

Java's built-in serializer. Spark also supports the use of Kryo , a third-party serializa‐

tion library that improves on Java's serialization by offering both faster serialization

times and a more compact binary representation, but cannot serialize all types of

objects “out of the box.” Almost all applications will benefit from shifting to Kryo for

serialization.

To use Kryo serialization, you can set the spark.serializer setting to

org.apache.spark.serializer.KryoSerializer . For best performance, you'll also

want to register classes with Kryo that you plan to serialize, as shown in

Example 8-12 . Registering a class allows Kryo to avoid writing full class names with

individual objects, a space savings that can add up over thousands or millions of seri‐

alized records. If you want to force this type of registration, you can set

spark.kryo.registrationRequired to true , and Kryo will throw errors if it encoun‐

ters an unregistered class.

Search WWH ::

Custom Search

Home