Building Analytics Workf lows Using Python and Pandas - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

exploration to parallel computing, publication, and education.” 6 In fact, iPython has

seen quite a lot of growth among scientific users, and as a result the project has also

been awarded a grant by the Sloan Foundation to help drive development of more col-

laborative and interactive visualization tools.

iPython adds an important tooling layer to the standard Python shell, including

features such as autocomplete and the ability to access interactive help. It's very easy to

incorporate existing Python scripts into iPython's interactive workf low. iPython also

has an excellent notebook mode that provides iPython's features through an interac-

tive Web application. When starting up iPython with the notebook command, a Web

server will be launched directly on the workstation, and a browser-based interface

becomes available on a local URL. Python commands and output can be run directly

in the browser window, and best of all, these notebooks can be saved, exported, and

shared with others.

Parallelizing iPython Using a Cluster

As we've mentioned before, one of the advantages of distributed-processing frame-

works such as Hadoop is the ability to wrangle multiple machines to help solve large

data problems quickly. For many people, Hadoop is the de facto method of running

such tasks, but it's not always the best fit for the job. Although Hadoop is becoming

more and more automated, there's often quite a lot of administrative overhead when

initializing and running a Hadoop cluster, not to mention a great deal of work in

writing the workf low code (see Chapter 9, “Building Data Transformation Workf lows

with Pig and Cascading,” for more on Hadoop workf low tools). Often, all we want to

do is simply farm a task out to a number of machines or even a set of processors on a

multicore machine with as little effort as possible.

iPython makes it easy to run tasks in parallel, by coordinating running Python

commands across a distributed network of machines (which iPython calls engines ).

iPython takes advantage of the very fast message-passing library called ØMQ (a.k.a.

ZeroMQ) to coordinate messaging between multiple machines. Even if you don't have

a cluster of machines available, you can observe some of the advantages of parallel

computing on a multicore local machine. As with Hadoop, it's possible to test iPython

scripts locally before extending them to run across a cluster of machines. Even better,

iPython enables you to run these scripts interactively.

Let's look at a simple example meant to tax the CPU a bit. In Listing 12.7, we use

the NumPy random package to generate a list of 1000 integers between 1,000,000 and

20,000,000, and we'll try to identify if they are prime numbers by brute force. Essen-

tially, we will divide (using a modulo operation) each number by every integer from two

up to the square root of the number itself. If the resulting remainder is zero, the number

will be returned as not prime. This solution basically requires thousands of large division

calculations per potential prime. Our first try will be a simple nondistributed solution.

Next, we will demonstrate a solution using iPython's parallel library.

Search WWH ::

Custom Search

Home