Building Analytics Workf lows Using Python and Pandas - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

# Returns a message about number being checked as prime or not

def find_primes(number):

#For each number in potential_list:

print number

return '%d is prime? %r' % (number, prime_check(number))

# Add our functions to the namespace of our running engines

dview.push({'find_primes': find_primes})

dview.push({'prime_check': prime_check})

# Generate some random large integers

np.random.seed(seed=12345)

possible_primes = np.random.random_integers(1000000, 20000000, 10).tolist()

# Run the functions on our cluster

results = dview.map(find_primes,possible_primes)

# Print the results to std out

for result in results.get():

print result

# time ipython prime_finder.py

# Result:

# 17645405 is prime? False

# ...

# 1667625154 is prime? False

# time output:

# real 0m1.711s

On my multicore-processor laptop, the parallelized version using six engines took

only just over 1.7 seconds, a significant speed improvement. If you have access to a

cluster of multicore machines, you could possibly speed this type of brute force appli-

cation up even more, with some additional configuration work. Remember that at

some point the problem becomes IO-bound, and latency in the network may cause

some performance issues.

Summary

R's functional programming model and massive collection of libraries has made it the

de facto open-source science and statistics language. At the same time, Python has

come of age as a productive programming language for memory-intensive data appli-

cations. The sheer number of Python developers and the ease of development give

Python a unique advantage over other methods of building CPU-bound data applica-

tions. Python can often be the easiest way to solve a wide variety of data challenges

in the shortest amount of time. Building an application using a more general-purpose

Search WWH ::

Custom Search

Home