Advanced Analytical Theory and Methods: Text Analysis - Data Science and Big Data Analytics

Database Reference

In-Depth Information

09 <link>http://www.phones.com/link.htm</link>

10 <guid isPermaLink="false">1102345</guid>

11 <pubDate>Tue, 29 Aug 2011 09:00:00 -0400</pubDate>

12 <item>

13 </channel>

The content from the title (line 7), the description (line 8), and the published

date ( pubDate , line 11) is what ACME is interested in.

If the plan is to collect user comments on ACME's products from online shops

and review sites where APIs or data feeds are not provided, the team may have

to write web scrapers to parse web pages and automatically extract the interesting

data from those HTML files. A web scraper is a software program (bot) that

systematically browses the World Wide Web, downloads web pages, extracts useful

information, and stores it somewhere for further study.

Unfortunately, it is nearly impossible to write a one-size-fits-all web scraper. This

is because websites like online shops and review sites have different structures. It is

common to customize a web scraper for a specific website. In addition, the website

formats can change over time, which requires the web scraper to be updated every

now and then. To build a web scraper for a specific website, one must study the

HTML source code of its web pages to find patterns before extracting any useful

content. For example, the team may find out that each user comment in the HTML

is enclosed by a DIV element inside another DIV with the ID usrcommt , or it might

be enclosed by a DIV element with the CLASS commtcls .

The team can then construct the web scraper based on the identified patterns. The

scraper can use the curl tool [7] to fetch HTML source code given specific URLs,

use XPath [8] and regular expressions to select and extract the data that match the

patterns, and write them into a data store.

Regular expressions can find words and strings that match particular patterns

in the text effectively and efficiently. Table 9.3 shows some regular expressions.

The general idea is that once text from the fields of interest is obtained, regular

expressions can help identify if the text is really interesting for the project. In

this case, do those fields mention bPhone , bEbook , or ACME ? When matching

the text, regular expressions can also take into account capitalizations, common

misspellings, common abbreviations, and special formats for e-mail addresses,

dates, and telephone numbers.

Search WWH ::

Custom Search

Home