Database Reference
In-Depth Information
Exploring and visualizing your data
Now that we have our data available, let's fire up an interactive Spark console and explore
it! For this section, we will use Python and the PySpark shell, as we are going to use the
IPython interactive console and the matplotlib plotting library to process and visualize our
data.
Note
IPython is an advanced, interactive shell for Python. It includes a useful set of features
called pylab, which includes NumPy and SciPy for numerical computing and matplotlib for
interactive plotting and visualization.
We recommend that you use the latest version of IPython (2.3.1 at the time of writing this
topic). To install IPython for your platform, follow the instructions available at ht-
tp://ipython.org/install.html . If this is the first time you are using IPython, you can find a
tutorial at http://ipython.org/ipython-doc/stable/interactive/tutorial.html .
You will need to install all the packages listed earlier in order to work through the code in
this chapter. Instructions to install the packages can be found in the code bundle. If you are
starting out with Python or are unfamiliar with the process of installing these packages, we
strongly recommend that you use a prebuilt scientific Python installation such as Anaconda
(available at http://continuum.io/downloads ) or Enthought (available at ht-
tps://store.enthought.com/downloads/ ) . These make the installation process much easier
and include everything you will need to follow the example code.
The PySpark console allows the option of setting which Python executable needs to be
used to run the shell. We can choose to use IPython, as opposed to the standard Python
shell, when launching our PySpark console. We can also pass in additional options to
IPython, including telling it to launch with the pylab functionality enabled.
We can do this by running the following command from the Spark home directory (that is,
the same directory that we used previously to explore the Spark interactive console):
>IPYTHON=1 IPYTHON_OPTS="--pylab" ./bin/pyspark
You will see the PySpark console start up, showing output similar to the following screen-
shot:
Search WWH ::




Custom Search