Database Reference
In-Depth Information
Figure 2-2. The PySpark shell with less logging output
Using IPython
IPython is an enhanced Python shell that many Python users pre‐
fer, offering features such as tab completion. You can find instruc‐
tions for installing it at
http://ipython.org
. You can use IPython
with Spark by setting the
IPYTHON
environment variable to 1:
IPYTHON
=
1
./bin/pyspark
To use the IPython Notebook, which is a web-browser-based ver‐
sion of IPython, use:
IPYTHON_OPTS
=
"notebook"
./bin/pyspark
On Windows, set the variable and run the shell as follows:
set
IPYTHON
=
1
bin\pyspark
In Spark, we express our computation through operations on distributed collections
that are automatically parallelized across the cluster. These collections are called
resil‐
ient distributed datasets
, or RDDs. RDDs are Spark's fundamental abstraction for dis‐
tributed data and computation.
Before we say more about RDDs, let's create one in the shell from a local text file and
do some very simple ad hoc analysis by following
Example 2-1
for Python or
Example 2-2
for Scala.