Database Reference
In-Depth Information
9.2 A Text Analysis Example
To further describe the three text analysis steps, consider the fictitious company
ACME, maker of two products: bPhone and bEbook . ACME is in strong
competition with other companies that manufacture and sell similar products. To
succeed, ACME needs to produce excellent phones and eBook readers and increase
sales.
One of the ways the company does this is to monitor what is being said about
ACME products in social media. In other words, what is the buzz on its products?
ACME wants to search all that is said about ACME products in social media sites,
such as Twitter and Facebook, and popular review sites, such as Amazon and
ConsumerReports. It wants to answer questions such as these.
• Are people mentioning its products?
• What is being said? Are the products seen as good or bad? If people think
an ACME product is bad, why? For example, are they complaining about
the battery life of the bPhone , or the response time in their bEbook ?
ACME can monitor the social media buzz using a simple process based on the three
steps outlined in Section 9.1. This process is illustrated in Figure 9.1 , and it includes
the modules in the next list.
1. Collect raw text (Section 9.3). This corresponds to Phase 1 and Phase 2 of
the Data Analytic Lifecycle. In this step, the Data Science team at ACME
monitors websites for references to specific products. The websites may
include social media and review sites. The team could interact with social
network application programming interfaces (APIs) process data feeds, or
scrape pages and use product names as keywords to get the raw data.
Regular expressions are commonly used in this case to identify text that
matches certain patterns. Additional filters can be applied to the raw data
for a more focused study. For example, only retrieving the reviews
originating in New York instead of the entire United States would allow
ACME to conduct regional studies on its products. Generally, it is a good
practice to apply filters during the data collection phase. They can reduce
I/O workloads and minimize the storage requirements.
2. Represent text (Section 9.4). Convert each review into a suitable document
representation with proper indices, and build a corpus based on these
Search WWH ::




Custom Search