Building Analytics Workf lows Using Python and Pandas - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

a collection of data indexed over regular time intervals. Time series data is often sub-

jected to regression analysis , which can help explain the connection between different

variables over time.

Time series data can be tricky to deal with. Time zones are always a pain, even for

automated systems. How do you compare data series from two different time zones?

Another issue is the sampling rate of data. If the data you have is the result of a reading

taken once an hour, is it possible to fill in or extrapolate values when you compare this

data to points taken every minute? Pandas makes a great effort to help abstract all of

these problems away.

Let's take a look at some basic time series manipulations using daily historic stock

values for a company that has been around for a while: IBM. The Yahoo! Finance 4

Web site provides a useful Web interface for downloading historical stock data in CSV

format. The raw data used in this example is a simple CSV file containing the date

as a string (in YYYY-MM-DD format) along with a collection of values for various

aspects of the stock price. The labels for these columns are contained in the first row

of the file. See Listing 12.4 for an example.

The read_csv method tells Pandas to take the CSV file and import it as a

DataFrame. We also direct Pandas to set the index of the DataFrame to the date and to

parse the strings found in the Date column into timestamps.

In my experience, dealing with data from multiple time zones is a common chal-

lenge, especially with data from computer logs. When faced with the problem of

comparing datasets from two different time zones, it's often a good idea to normalize

values to Coordinated Universal Time, or UTC . Pandas's time series methods make

it easy to both set and convert a DataFrame's original data to a particular time zone.

Because our data about IBM's valuation comes from a U.S. stock market listing, let's

assume that these dates are measured in the U.S. Eastern time zone. We can set this

time zone using the tz_localize method and if necessary convert the result to UTC

using tz_convert .

Finally, let's resample the data. Our original data is sampled per day, meaning that

we have a single set of data points for each day. We can up- or downsample the granu-

larity of our data using the Pandas resample method, effectively changing the number

of time stamps we have for our data points. If we resample this data to reflect values

taken every five minutes, we won't automatically have the data for that granularity

level. Using the raw data we have here, it's impossible to tell what IBM's stock price

was in the five-minute period between 10:00 a.m. and 10:05 a.m. However, it is pos-

sible to interpolate, or fill, those values based on the data we do have. We could, for

example, tell Pandas to fill in all of those missing values with the value we have for the

whole day. In the example ahead, we resample our data the other way to be granular to

the week. We can tell Pandas to also convert the values using a number of methods; in

the example in Listing 12.4, we've asked for the max value of daily values to be used.

Search WWH ::

Custom Search

Home