Database Reference
In-Depth Information
12
Building Analytics Workflows
Using Python and Pandas
A central theme of this topic is that of accessibility. The availability of new, powerful
open-source software has been a driving force to help a growing number of developers
and analysts gain access to tools they need to solve their data challenges. The open-
source movement brings more than just the software alone; another advantage is the
momentum of the community of developers who work with those tools. For example,
the programming language R, a leading programming environment for statistics and
mathematical computing, is not only a language but also offers a huge community of
people who contribute code, modules, and other tools.
The R community is large and vibrant, but its focus has always been statistics and
scientific computing. In practice, this means that a module for dealing with almost
every common type of mathematical computation likely already exists. R is great for
interactive and exploratory data needs. However, in order to build fully functional
applications, it makes sense to take advantage of software that already has a great deal
of built-in functionality. Despite the popularity of specialized languages such as R,
more and more scientists, statisticians, and data-application developers are turning to a
language that at first glance may not seem like a natural fit for high-performance data
processing: Python.
In other chapters, we've already looked at examples of using Python to simplify
interacting with data processing tools such as the Hadoop Streaming API. This chapter
will cover how to use Python for more applied, CPU-intensive tasks as well as how to
work with Python as part of an interactive workflow.
The Snakes Are Loose in the Data Zoo
A lot of attention has been paid to the open-source software commonly associ-
ated with large-scale data processing over multiple machines. As discussed in previ-
ous chapters, Hadoop is a tool used to distribute processing tasks over a number of
machines using a computational strategy called MapReduce. This type of processing
works by breaking larger problems into smaller ones and then farming out the small
 
 
 
 
Search WWH ::




Custom Search