Downloading Spark and Getting Started - Learning Spark

Database Reference

In-Depth Information

Figure 2-2. The PySpark shell with less logging output

Using IPython

IPython is an enhanced Python shell that many Python users pre‐

fer, offering features such as tab completion. You can find instruc‐

tions for installing it at http://ipython.org . You can use IPython

with Spark by setting the IPYTHON environment variable to 1:

IPYTHON = 1 ./bin/pyspark

To use the IPython Notebook, which is a web-browser-based ver‐

sion of IPython, use:

IPYTHON_OPTS = "notebook" ./bin/pyspark

On Windows, set the variable and run the shell as follows:

set IPYTHON = 1

bin\pyspark

In Spark, we express our computation through operations on distributed collections

that are automatically parallelized across the cluster. These collections are called resil‐

ient distributed datasets , or RDDs. RDDs are Spark's fundamental abstraction for dis‐

tributed data and computation.

Before we say more about RDDs, let's create one in the shell from a local text file and

do some very simple ad hoc analysis by following Example 2-1 for Python or

Example 2-2 for Scala.

Search WWH ::

Custom Search

Home