Database Reference
In-Depth Information
Example 7-9. Packaging a Spark application built with sbt
$ sbt assembly
# In the target directory, we'll see an assembly JAR
$ ls target/scala-2.10/
my-project-assembly.jar
# Listing the assembly JAR will reveal classes from dependency libraries
$ jar tf target/scala-2.10/my-project-assembly.jar
...
joptsimple/HelpFormatter.class
...
org/joda/time/tz/UTCProvider.class
...
# An assembly JAR can be passed directly to spark-submit
$ /path/to/spark/bin/spark-submit --master local ...
target/scala-2.10/my-project-assembly.jar
Dependency Conflicts
One occasionally disruptive issue is dealing with dependency conflicts in cases where
a user application and Spark itself both depend on the same library. This comes up
relatively rarely, but when it does, it can be vexing for users. Typically, this will mani‐
fest itself when a NoSuchMethodError , a ClassNotFoundException , or some other
JVM exception related to class loading is thrown during the execution of a Spark job.
There are two solutions to this problem. The first is to modify your application to
depend on the same version of the third-party library that Spark does. The second is
to modify the packaging of your application using a procedure that is often called
“shading.” The Maven build tool supports shading through advanced configuration
of the plug-in shown in Example 7-5 (in fact, the shading capability is why the plug-
in is named maven-shade-plugin ). Shading allows you to make a second copy of the
conflicting package under a different namespace and rewrites your application's code
to use the renamed version. This somewhat brute-force technique is quite effective at
resolving runtime dependency conflicts. For specific instructions on how to shade
dependencies, see the documentation for your build tool.
Scheduling Within and Between Spark Applications
The example we just walked through involves a single user submitting a job to a clus‐
ter. In reality, many clusters are shared between multiple users. Shared environments
have the challenge of scheduling: what happens if two users both launch Spark appli‐
cations that each want to use the entire cluster's worth of resources? Scheduling poli‐
cies help ensure that resources are not overwhelmed and allow for prioritization of
workloads.
For scheduling in multitenant clusters, Spark primarily relies on the cluster manager
to share resources between Spark applications. When a Spark application asks for
Search WWH ::




Custom Search