Data Sources: Making Homes Smart - Smart Home Automation with Linux and Raspberry Pi

Hardware Reference

In-Depth Information

Lest I advocate scraping a page of a litigious company, I will provide an example using my own Minerva site to

retrieve the most recent story from the news page at http://www.minervahome.net/news.htm .

Begin by loading the page in a web browser to get a feel for the page layout and to see where the target

information is located. Also, review other pages to see whether there's any commonality that can be exploited. You

can do this by reviewing the source (as either a whole page or with a “view source selection” option) or enlisting the

help of Firebug 3 to highlight the tables and subcomponents within the table.

Then look for any “low-hanging fruit.” These are the easily solved parts of a problem, so you might find the

desired text inside a specially named div element or included inside a table with a particular id attribute. Many

professionally designed web sites do this to make redesigns quicker and unwittingly help the scraper.

If there are no distinguishing features around the text, look to the elements surrounding it . . . and the elements

surrounding those. Work outward until you find something unique enough to be of interest or you reach the root html

node. If you've found nothing unique, then you will have to describe the data with code such as “in the first row and

second column of the third table.”

Once you are able to describe the location of the data in human terms, you can start writing the code! The process

involves a mechanized agent that is able to load the web page and traverse links and a stream processor that skips

over the HTML tags. You begin the scraping with a fairly common loading block like this:

#!/usr/bin/perl -w

use strict;

use WWW::Mechanize;

use HTML::TokeParser;

my $agent = WWW::Mechanize->new();

$agent->get(" http://www.minervahome.net/news.htm ");

my $stream = HTML::TokeParser->new(\$agent->{content});

Given the $stream , you can now skip to the fourth table, for example, by jumping over four of the opening table

tags using the following:

foreach(1..4) {

$stream->get_tag("table");

}

Notice that get_tag positions the stream point immediately after the opening tag given, in this case table .

Consequently, the stream point is now inside the fourth table. Because our data is on the first row, you don't need to

worry about skipping the tr tag, so you can jump straight into the second column with this:

$stream->get_tag("td");

as skipping the td tag will automatically skip the preceding tr . The stream is now positioned exactly where you want

it. The HTML structure of this block is as follows:

<a href="url">Main title</a></td>

Main story text

&IREBUG ISANEXTENSIONTO&IREFOXTHATALLOWSWEBDEVELOPERSANDCURIOUSGEEKSFULLACCESSTOTHEINNERWORKINGSOFTHEWEBPAGES

THATAPPEARINTHEBROWSER

Search WWH ::

Custom Search

Home