Database Reference
In-Depth Information
additional challenges to address. Common processing tasks usually require an even
higher-level approach to data modeling. For example, what if we need to make radi-
cal changes to our data's structure? How do we combine datasets of different shapes?
What's the best way to work with records with missing data? How do we label data
in rows and columns? These real-world data challenges come up time and time again.
Fortunately, Pandas (short for the Python Data Analysis Library, not named for the
adorable animal Ailuropoda melanoleuca ) augments the utility of NumPy and SciPy by
addressing some of these real-world data processing tasks.
Note
The first thing I did after I learned about Pandas was to type “pandas” into a search
engine to learn more. It dawned on me at the time that, because Pandas shares a
name with a popular animal (among other things), I would always have to add some-
thing about data or statistics to find the project homepage. That's no longer the case;
judging by recent search-engine ranking, questions on Stack Overflow, and mentions
on blog posts, the Pandas library has become quite popular, making information about
it pretty easy to find.
Pandas Data Types
Like SciPy, Pandas is based heavily on the data structures provided by the underly-
ing NumPy. Pandas provides a variety of useful data types, including a list-like type
called a Series . The bread-and-butter Pandas data type is reminiscent of one of the
most powerful types in R—the data frame . A Pandas DataFrame is a two-dimensional
data structure in which each column may contain different types of data. An example
might be a table featuring three columns in which the first column is a person's name,
the second is the person's age, and the third might represent the person's height.
A Pandas DataFrame can be created by passing input from various sources, such as
a NumPy array or even a Python dictionary containing lists. One of the differences
between a DataFrame and a NumPy array is that a DataFrame can also contain labels
for rows and columns.
In my professional data work, I am always amazed how often I worry about the
index or schema of the data I am working with. One of the aha moments I had with
the Pandas library was when I discovered the excellent indexing and alignment meth-
ods available for DataFrame objects. It's also great to be able to inspect data quickly.
Something I often do when working with data is to take a look at the first few rows
of a file or print out the end of a file to sanity check whether my application worked
properly. For Pandas DataFrames, use the head or tail methods to output lines from
the top or bottom of your file.
Time Series Data
Pandas was originally developed as a tool for looking at financial data. Wall Street
quants and econometricians are heavy users of time series data , which is essentially
 
Search WWH ::




Custom Search