Database Reference
In-Depth Information
9.3 Collecting Raw Text
Recall that in the Data Analytic Lifecycle seen in Chapter 2, “Data Analytics
Lifecycle,” discovery is the first phase. In it, the Data Science team investigates
the problem, understands the necessary data sources, and formulates initial
hypotheses. Correspondingly, for text analysis, data must be collected before
anything can happen. The Data Science team starts by actively monitoring various
websites for user-generated contents. The user-generated contents being collected
could be related articles from news portals and blogs, comments on ACME's
products from online shops or reviews sites, or social media posts that contain
keywords bPhone or bEbook . Regardless of where the data comes from, it's likely
that the team would deal with semi-structured data such as HTML web pages,
Really Simple Syndication (RSS) feeds, XML, or JavaScript Object Notation (JSON)
files. Enough structure needs to be imposed to find the part of the raw text that the
team really cares about. In the brand management example, ACME is interested in
what the reviews say about bPhone or bEbook and when the reviews are posted.
Therefore, the team will actively collect such information.
Many websites and services offer public APIs [4, 5] for third-party developers to
access their data. For example, the Twitter API [6] allows developers to choose from
the Streaming API or the REST API to retrieve public Twitter posts that contain the
keywords bPhone or bEbook . Developers can also read tweets in real time from a
specific user or tweets posted near a specific venue. The fetched tweets are in the
JSON format.
As an example, a sample tweet that contains the keyword bPhone fetched using the
Twitter Streaming API version 1.1 is shown next.
01 {
02 "created_at": "Thu Aug 15 20:06:48 +0000 2013",
03 "coordinates": {
04 "type": "Point",
05 "coordinates": [
06 -157.81538521787621,
07 21.3002578885766
08 ]
09 },
10 "favorite_count": 0,
11 "id": 368101488276824010,
Search WWH ::




Custom Search