Building Analytics Workf lows Using Python and Pandas - Data Just Right: Introduction to Large-Scale Data and Analytics

Database Reference

In-Depth Information

Open High Low Close Volume Adj Close

Date

2013-02-22 199.23 201.09 198.84 201.09 3107900 201.09

2013-02-21 198.63 199.07 198.11 198.33 3922900 198.33

2013-02-20 200.62 201.72 198.86 199.31 3715400 199.31

One of the advantages of using Python for data analysis is that we have access to an

enormous number of additional general-purpose libraries along with Python's brevity

and clean syntax. The ability to build applications entirely in Python, incorporating

other popular libraries into your data workf lows, makes it great for turning explor-

atory scripts into full production applications. To demonstrate, let's take a look at an

example of something that might be a bit cumbersome to do with R but is very easy

to do in Python.

Twitter provides APIs for interacting with public tweets. The Twitter Streaming

API can be a lot of fun for people learning how to work with data streams, as it simply

unleashes a nonstop barrage of randomly sampled public tweets along with metadata

about each tweet, conveniently packaged in a JSON object. In this example, we use

the Python Twitter Tools module 5 to read and extract the hashtags found in tweets

from the Twitter public sample stream API. Once running, our script will read new

tweets from the Twitter sample stream until we hit 1,000 total public tweets. As we

collect tweets, the relevant information we want will be stored in a Python dictionary.

We then convert the dictionary into a Python DataFrame and gain access to the avail-

able Pandas methods for analysis.

In addition to the Twitter API and tweetstream, we will incorporate one of my favor-

ite Python libraries, the Natural Language Toolkit (or NLTK). NLTK is a popular and

well-supported library for problems in the natural-language processing domain. One

of these domains is the study of n-grams, which are phrases of a certain regular term

length. For example, a phrase with three separate words is called a 3-gram, and one

with five terms would be called a 5-gram. NLTK is great for generating collections of

n-grams from otherwise unstructured text blobs (such as tweets). It also provides excel-

lent libraries for parsing text with regular expressions, word stemming, and much more.

Listing 12.5 provides an example of Twitter Streaming API statistics.

Listing 12.5 Twitter Streaming API statistics example

import json

from pandas import Series

from nltk.tokenize import RegexpTokenizer

from twitter import OAuth, TwitterStream

Search WWH ::

Custom Search

Home