Database Reference
In-Depth Information
09 <link>http://www.phones.com/link.htm</link>
10 <guid isPermaLink="false">1102345</guid>
11 <pubDate>Tue, 29 Aug 2011 09:00:00 -0400</pubDate>
12 <item>
13 </channel>
The content from the title (line 7), the description (line 8), and the published
date ( pubDate , line 11) is what ACME is interested in.
If the plan is to collect user comments on ACME's products from online shops
and review sites where APIs or data feeds are not provided, the team may have
to write web scrapers to parse web pages and automatically extract the interesting
data from those HTML files. A web scraper is a software program (bot) that
systematically browses the World Wide Web, downloads web pages, extracts useful
information, and stores it somewhere for further study.
Unfortunately, it is nearly impossible to write a one-size-fits-all web scraper. This
is because websites like online shops and review sites have different structures. It is
common to customize a web scraper for a specific website. In addition, the website
formats can change over time, which requires the web scraper to be updated every
now and then. To build a web scraper for a specific website, one must study the
HTML source code of its web pages to find patterns before extracting any useful
content. For example, the team may find out that each user comment in the HTML
is enclosed by a DIV element inside another DIV with the ID usrcommt , or it might
be enclosed by a DIV element with the CLASS commtcls .
The team can then construct the web scraper based on the identified patterns. The
scraper can use the curl tool [7] to fetch HTML source code given specific URLs,
use XPath [8] and regular expressions to select and extract the data that match the
patterns, and write them into a data store.
Regular expressions can find words and strings that match particular patterns
in the text effectively and efficiently. Table 9.3 shows some regular expressions.
The general idea is that once text from the fields of interest is obtained, regular
expressions can help identify if the text is really interesting for the project. In
this case, do those fields mention bPhone , bEbook , or ACME ? When matching
the text, regular expressions can also take into account capitalizations, common
misspellings, common abbreviations, and special formats for e-mail addresses,
dates, and telephone numbers.
Search WWH ::




Custom Search