Database Reference
In-Depth Information
improve the application's performance by coalescing down to a smaller RDD, as
shown in Example 8-11 .
Example 8-11. Coalescing a large RDD in the PySpark shell
# Wildcard input that may match thousands of files
>>> input = sc . textFile ( "s3n://log-files/2014/*.log" )
>>> input . getNumPartitions ()
35154
# A filter that excludes almost all data
>>> lines = input . filter ( lambda line : line . startswith ( "2014-10-17" ))
>>> lines . getNumPartitions ()
35154
# We coalesce the lines RDD before caching
>>> lines = lines . coalesce ( 5 ) . cache ()
>>> lines . getNumPartitions ()
4
# Subsequent analysis can operate on the coalesced RDD...
>>> lines . count ()
Serialization Format
When Spark is transferring data over the network or spilling data to disk, it needs to
serialize objects into a binary format. This comes into play during shuffle operations,
where potentially large amounts of data are transferred. By default Spark will use
Java's built-in serializer. Spark also supports the use of Kryo , a third-party serializa‐
tion library that improves on Java's serialization by offering both faster serialization
times and a more compact binary representation, but cannot serialize all types of
objects “out of the box.” Almost all applications will benefit from shifting to Kryo for
serialization.
To use Kryo serialization, you can set the spark.serializer setting to
org.apache.spark.serializer.KryoSerializer . For best performance, you'll also
want to register classes with Kryo that you plan to serialize, as shown in
Example 8-12 . Registering a class allows Kryo to avoid writing full class names with
individual objects, a space savings that can add up over thousands or millions of seri‐
alized records. If you want to force this type of registration, you can set
spark.kryo.registrationRequired to true , and Kryo will throw errors if it encoun‐
ters an unregistered class.
Search WWH ::




Custom Search