Database Reference
In-Depth Information
Figure 2-2. The PySpark shell with less logging output
Using IPython
IPython is an enhanced Python shell that many Python users pre‐
fer, offering features such as tab completion. You can find instruc‐
tions for installing it at http://ipython.org . You can use IPython
with Spark by setting the IPYTHON environment variable to 1:
IPYTHON = 1 ./bin/pyspark
To use the IPython Notebook, which is a web-browser-based ver‐
sion of IPython, use:
IPYTHON_OPTS = "notebook" ./bin/pyspark
On Windows, set the variable and run the shell as follows:
set IPYTHON = 1
bin\pyspark
In Spark, we express our computation through operations on distributed collections
that are automatically parallelized across the cluster. These collections are called resil‐
ient distributed datasets , or RDDs. RDDs are Spark's fundamental abstraction for dis‐
tributed data and computation.
Before we say more about RDDs, let's create one in the shell from a local text file and
do some very simple ad hoc analysis by following Example 2-1 for Python or
Example 2-2 for Scala.
 
Search WWH ::




Custom Search