Database Reference
In-Depth Information
ensure that all your dependencies are present at the runtime of your Spark
application.
For Python users, there are a few ways to install third-party libraries. Since PySpark
uses the existing Python installation on worker machines, you can install dependency
libraries directly on the cluster machines using standard Python package managers
(such as pip or easy_install ), or via a manual installation into the site-packages/
directory of your Python installation. Alternatively, you can submit individual libra‐
ries using the --py-files argument to spark-submit and they will be added to the
Python interpreter's path. Adding libraries manually is more convenient if you do
not have access to install packages on the cluster, but do keep in mind potential con‐
flicts with existing packages already installed on the machines.
What About Spark Itself?
When you are bundling an application, you should never include
Spark itself in the list of submitted dependencies. spark-submit
automatically ensures that Spark is present in the path of your
program.
For Java and Scala users, it is also possible to submit individual JAR files using the --
jars flag to spark-submit . This can work well if you have a very simple dependency
on one or two libraries and they themselves don't have any other dependencies. It is
more common, however, for users to have Java or Scala projects that depend on sev‐
eral libraries. When you submit an application to Spark, it must ship with its entire
transitive dependency graph to the cluster. This includes not only the libraries you
directly depend on, but also their dependencies, their dependencies' dependencies,
and so on. Manually tracking and submitting this set of JAR files would be extremely
cumbersome. Instead, it's common practice to rely on a build tool to produce a single
large JAR containing the entire transitive dependency graph of an application. This is
often called an uber JAR or an assembly JAR , and most Java or Scala build tools can
produce this type of artifact.
The most popular build tools for Java and Scala are Maven and sbt (Scala build tool).
Either tool can be used with either language, but Maven is more often used for Java
projects and sbt for Scala projects. Here, we'll give examples of Spark application
builds using both tools. You can use these as templates for your own Spark projects.
A Java Spark Application Built with Maven
Let's look at an example Java project with multiple dependencies that produces an
uber JAR. Example 7-5 provides a Maven pom.xml file containing a build definition.
This example doesn't show the actual Java code or project directory structure, but
Search WWH ::




Custom Search