Hardware Reference
In-Depth Information
Lest I advocate scraping a page of a litigious company, I will provide an example using my own Minerva site to
retrieve the most recent story from the news page at http://www.minervahome.net/news.htm .
Begin by loading the page in a web browser to get a feel for the page layout and to see where the target
information is located. Also, review other pages to see whether there's any commonality that can be exploited. You
can do this by reviewing the source (as either a whole page or with a “view source selection” option) or enlisting the
help of Firebug 3 to highlight the tables and subcomponents within the table.
Then look for any “low-hanging fruit.” These are the easily solved parts of a problem, so you might find the
desired text inside a specially named div element or included inside a table with a particular id attribute. Many
professionally designed web sites do this to make redesigns quicker and unwittingly help the scraper.
If there are no distinguishing features around the text, look to the elements surrounding it . . . and the elements
surrounding those. Work outward until you find something unique enough to be of interest or you reach the root html
node. If you've found nothing unique, then you will have to describe the data with code such as “in the first row and
second column of the third table.”
Once you are able to describe the location of the data in human terms, you can start writing the code! The process
involves a mechanized agent that is able to load the web page and traverse links and a stream processor that skips
over the HTML tags. You begin the scraping with a fairly common loading block like this:
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
use HTML::TokeParser;
my $agent = WWW::Mechanize->new();
$agent->get(" http://www.minervahome.net/news.htm ");
my $stream = HTML::TokeParser->new(\$agent->{content});
Given the $stream , you can now skip to the fourth table, for example, by jumping over four of the opening table
tags using the following:
foreach(1..4) {
$stream->get_tag("table");
}
Notice that get_tag positions the stream point immediately after the opening tag given, in this case table .
Consequently, the stream point is now inside the fourth table. Because our data is on the first row, you don't need to
worry about skipping the tr tag, so you can jump straight into the second column with this:
$stream->get_tag("td");
$stream->get_tag("td");
as skipping the td tag will automatically skip the preceding tr . The stream is now positioned exactly where you want
it. The HTML structure of this block is as follows:
<a href="url">Main title</a></td>
<td valign="top">
Main story text
&IREBUG ISANEXTENSIONTO&IREFOXTHATALLOWSWEBDEVELOPERSANDCURIOUSGEEKSFULLACCESSTOTHEINNERWORKINGSOFTHEWEBPAGES
THATAPPEARINTHEBROWSER
Search WWH ::




Custom Search