Database Reference
In-Depth Information
a collection of data indexed over regular time intervals. Time series data is often sub-
jected to regression analysis , which can help explain the connection between different
variables over time.
Time series data can be tricky to deal with. Time zones are always a pain, even for
automated systems. How do you compare data series from two different time zones?
Another issue is the sampling rate of data. If the data you have is the result of a reading
taken once an hour, is it possible to fill in or extrapolate values when you compare this
data to points taken every minute? Pandas makes a great effort to help abstract all of
these problems away.
Let's take a look at some basic time series manipulations using daily historic stock
values for a company that has been around for a while: IBM. The Yahoo! Finance 4
Web site provides a useful Web interface for downloading historical stock data in CSV
format. The raw data used in this example is a simple CSV file containing the date
as a string (in YYYY-MM-DD format) along with a collection of values for various
aspects of the stock price. The labels for these columns are contained in the first row
of the file. See Listing 12.4 for an example.
The read_csv method tells Pandas to take the CSV file and import it as a
DataFrame. We also direct Pandas to set the index of the DataFrame to the date and to
parse the strings found in the Date column into timestamps.
In my experience, dealing with data from multiple time zones is a common chal-
lenge, especially with data from computer logs. When faced with the problem of
comparing datasets from two different time zones, it's often a good idea to normalize
values to Coordinated Universal Time, or UTC . Pandas's time series methods make
it easy to both set and convert a DataFrame's original data to a particular time zone.
Because our data about IBM's valuation comes from a U.S. stock market listing, let's
assume that these dates are measured in the U.S. Eastern time zone. We can set this
time zone using the tz_localize method and if necessary convert the result to UTC
using tz_convert .
Finally, let's resample the data. Our original data is sampled per day, meaning that
we have a single set of data points for each day. We can up- or downsample the granu-
larity of our data using the Pandas resample method, effectively changing the number
of time stamps we have for our data points. If we resample this data to reflect values
taken every five minutes, we won't automatically have the data for that granularity
level. Using the raw data we have here, it's impossible to tell what IBM's stock price
was in the five-minute period between 10:00 a.m. and 10:05 a.m. However, it is pos-
sible to interpolate, or fill, those values based on the data we do have. We could, for
example, tell Pandas to fill in all of those missing values with the value we have for the
whole day. In the example ahead, we resample our data the other way to be granular to
the week. We can tell Pandas to also convert the values using a number of methods; in
the example in Listing 12.4, we've asked for the max value of daily values to be used.
4. http://finance.yahoo.com/q/hp?s=IBM
 
Search WWH ::




Custom Search