Database Reference
In-Depth Information
Open High Low Close Volume Adj Close
Date
2013-02-22 199.23 201.09 198.84 201.09 3107900 201.09
2013-02-21 198.63 199.07 198.11 198.33 3922900 198.33
2013-02-20 200.62 201.72 198.86 199.31 3715400 199.31
Building More Complex Workflows
One of the advantages of using Python for data analysis is that we have access to an
enormous number of additional general-purpose libraries along with Python's brevity
and clean syntax. The ability to build applications entirely in Python, incorporating
other popular libraries into your data workf lows, makes it great for turning explor-
atory scripts into full production applications. To demonstrate, let's take a look at an
example of something that might be a bit cumbersome to do with R but is very easy
to do in Python.
Twitter provides APIs for interacting with public tweets. The Twitter Streaming
API can be a lot of fun for people learning how to work with data streams, as it simply
unleashes a nonstop barrage of randomly sampled public tweets along with metadata
about each tweet, conveniently packaged in a JSON object. In this example, we use
the Python Twitter Tools module 5 to read and extract the hashtags found in tweets
from the Twitter public sample stream API. Once running, our script will read new
tweets from the Twitter sample stream until we hit 1,000 total public tweets. As we
collect tweets, the relevant information we want will be stored in a Python dictionary.
We then convert the dictionary into a Python DataFrame and gain access to the avail-
able Pandas methods for analysis.
In addition to the Twitter API and tweetstream, we will incorporate one of my favor-
ite Python libraries, the Natural Language Toolkit (or NLTK). NLTK is a popular and
well-supported library for problems in the natural-language processing domain. One
of these domains is the study of n-grams, which are phrases of a certain regular term
length. For example, a phrase with three separate words is called a 3-gram, and one
with five terms would be called a 5-gram. NLTK is great for generating collections of
n-grams from otherwise unstructured text blobs (such as tweets). It also provides excel-
lent libraries for parsing text with regular expressions, word stemming, and much more.
Listing 12.5 provides an example of Twitter Streaming API statistics.
Listing 12.5 Twitter Streaming API statistics example
import json
from pandas import Series
from nltk.tokenize import RegexpTokenizer
from twitter import OAuth, TwitterStream
5. http://github.com/sixohsix/twitter
 
 
 
Search WWH ::




Custom Search