Running on a Cluster - Learning Spark

Database Reference

In-Depth Information

Example 7-9. Packaging a Spark application built with sbt

$ sbt assembly

# In the target directory, we'll see an assembly JAR

$ ls target/scala-2.10/

my-project-assembly.jar

# Listing the assembly JAR will reveal classes from dependency libraries

$ jar tf target/scala-2.10/my-project-assembly.jar

...

joptsimple/HelpFormatter.class

...

org/joda/time/tz/UTCProvider.class

...

# An assembly JAR can be passed directly to spark-submit

$ /path/to/spark/bin/spark-submit --master local ...

target/scala-2.10/my-project-assembly.jar

Dependency Conflicts

One occasionally disruptive issue is dealing with dependency conflicts in cases where

a user application and Spark itself both depend on the same library. This comes up

relatively rarely, but when it does, it can be vexing for users. Typically, this will mani‐

fest itself when a NoSuchMethodError , a ClassNotFoundException , or some other

JVM exception related to class loading is thrown during the execution of a Spark job.

There are two solutions to this problem. The first is to modify your application to

depend on the same version of the third-party library that Spark does. The second is

to modify the packaging of your application using a procedure that is often called

“shading.” The Maven build tool supports shading through advanced configuration

of the plug-in shown in Example 7-5 (in fact, the shading capability is why the plug-

in is named maven-shade-plugin ). Shading allows you to make a second copy of the

conflicting package under a different namespace and rewrites your application's code

to use the renamed version. This somewhat brute-force technique is quite effective at

resolving runtime dependency conflicts. For specific instructions on how to shade

dependencies, see the documentation for your build tool.

Scheduling Within and Between Spark Applications

The example we just walked through involves a single user submitting a job to a clus‐

ter. In reality, many clusters are shared between multiple users. Shared environments

have the challenge of scheduling: what happens if two users both launch Spark appli‐

cations that each want to use the entire cluster's worth of resources? Scheduling poli‐

cies help ensure that resources are not overwhelmed and allow for prioritization of

workloads.

For scheduling in multitenant clusters, Spark primarily relies on the cluster manager

to share resources between Spark applications. When a Spark application asks for

Search WWH ::

Custom Search

Home