Building Analytics Workf lows Using Python and Pandas - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

additional challenges to address. Common processing tasks usually require an even

higher-level approach to data modeling. For example, what if we need to make radi-

cal changes to our data's structure? How do we combine datasets of different shapes?

What's the best way to work with records with missing data? How do we label data

in rows and columns? These real-world data challenges come up time and time again.

Fortunately, Pandas (short for the Python Data Analysis Library, not named for the

adorable animal Ailuropoda melanoleuca ) augments the utility of NumPy and SciPy by

addressing some of these real-world data processing tasks.

Note

The first thing I did after I learned about Pandas was to type “pandas” into a search

engine to learn more. It dawned on me at the time that, because Pandas shares a

name with a popular animal (among other things), I would always have to add some-

thing about data or statistics to find the project homepage. That's no longer the case;

judging by recent search-engine ranking, questions on Stack Overflow, and mentions

on blog posts, the Pandas library has become quite popular, making information about

it pretty easy to find.

Pandas Data Types

Like SciPy, Pandas is based heavily on the data structures provided by the underly-

ing NumPy. Pandas provides a variety of useful data types, including a list-like type

called a Series . The bread-and-butter Pandas data type is reminiscent of one of the

most powerful types in R—the data frame . A Pandas DataFrame is a two-dimensional

data structure in which each column may contain different types of data. An example

might be a table featuring three columns in which the first column is a person's name,

the second is the person's age, and the third might represent the person's height.

A Pandas DataFrame can be created by passing input from various sources, such as

a NumPy array or even a Python dictionary containing lists. One of the differences

between a DataFrame and a NumPy array is that a DataFrame can also contain labels

for rows and columns.

In my professional data work, I am always amazed how often I worry about the

index or schema of the data I am working with. One of the aha moments I had with

the Pandas library was when I discovered the excellent indexing and alignment meth-

ods available for DataFrame objects. It's also great to be able to inspect data quickly.

Something I often do when working with data is to take a look at the first few rows

of a file or print out the end of a file to sanity check whether my application worked

properly. For Pandas DataFrames, use the head or tail methods to output lines from

the top or bottom of your file.

Time Series Data

Pandas was originally developed as a tool for looking at financial data. Wall Street

quants and econometricians are heavy users of time series data , which is essentially

Search WWH ::

Custom Search

Home