Running on a Cluster - Learning Spark

Database Reference

In-Depth Information

ensure that all your dependencies are present at the runtime of your Spark

application.

For Python users, there are a few ways to install third-party libraries. Since PySpark

uses the existing Python installation on worker machines, you can install dependency

libraries directly on the cluster machines using standard Python package managers

(such as pip or easy_install ), or via a manual installation into the site-packages/

directory of your Python installation. Alternatively, you can submit individual libra‐

ries using the --py-files argument to spark-submit and they will be added to the

Python interpreter's path. Adding libraries manually is more convenient if you do

not have access to install packages on the cluster, but do keep in mind potential con‐

flicts with existing packages already installed on the machines.

What About Spark Itself?

When you are bundling an application, you should never include

Spark itself in the list of submitted dependencies. spark-submit

automatically ensures that Spark is present in the path of your

program.

For Java and Scala users, it is also possible to submit individual JAR files using the --

jars flag to spark-submit . This can work well if you have a very simple dependency

on one or two libraries and they themselves don't have any other dependencies. It is

more common, however, for users to have Java or Scala projects that depend on sev‐

eral libraries. When you submit an application to Spark, it must ship with its entire

transitive dependency graph to the cluster. This includes not only the libraries you

directly depend on, but also their dependencies, their dependencies' dependencies,

and so on. Manually tracking and submitting this set of JAR files would be extremely

cumbersome. Instead, it's common practice to rely on a build tool to produce a single

large JAR containing the entire transitive dependency graph of an application. This is

often called an uber JAR or an assembly JAR , and most Java or Scala build tools can

produce this type of artifact.

The most popular build tools for Java and Scala are Maven and sbt (Scala build tool).

Either tool can be used with either language, but Maven is more often used for Java

projects and sbt for Scala projects. Here, we'll give examples of Spark application

builds using both tools. You can use these as templates for your own Spark projects.

A Java Spark Application Built with Maven

Let's look at an example Java project with multiple dependencies that produces an

uber JAR. Example 7-5 provides a Maven pom.xml file containing a build definition.

This example doesn't show the actual Java code or project directory structure, but

Search WWH ::

Custom Search

Home