Data Sources: Making Homes Smart - Smart Home Automation with Linux and Raspberry Pi

Hardware Reference

In-Depth Information

So far, I have been using get_tag to skip elements, but it also sports a return value, containing the contents of

the tag. So, you'd retrieve the information from the anchor with the following, which, by its nature, can return

multiple tags:

my @link = $stream->get_tag("a");

Because you know there is only one in this particular HTML, it is $link[0] that is of interest. Inside this is

another array containing the following:

$link[0][0] # tag

$link[0][1] # attributes

$link[0][2] # attribute sequence

$link[0][3] # text

Therefore, you can extract the link information with the following:

my $href = $link[0][1]{href};

And because get_tag only retrieves the information about the tag, you must return to the stream to extract all the

data between this <a> and the </a> :

my $storyHeadline = $stream->get_trimmed_text("/a");

From here, you can see that you need to skip the next opening td tag and get the story text between it and the

next closing td tag:

$stream->get_tag("td");

print $stream->get_trimmed_text("/td");

Because you are only getting the first story from the page, your scraping is done. If you wanted to get the first

two stories, for example, you'd need to correctly skip the remainder of this table, or row, before repeating the parse

loop again.

Naturally, if this web page changes in any way, the code won't work!

This admittedly simple approach can fail when Javascript is used to control the web page content or layout.

Most commonly, this occurs when the page uses AJAX calls for pagination. In this case, the next button loads some

data dynamically from the server (with Javascript) and rewrites the contents of the appropriate <div> element. You

are unlikely to encounter such pages with the data sources that benefit home automation solutions, as none of the

examples presented here do so. However, if you do uncover one (supermarkets, for example, do this a lot), then you

need to upgrade to a headless browser solution, such as Casper and PhantomJS , which allows you to programmatically

click buttons on the page and invoke those AJAX requests. A new technology that also aims to simplify this process is

“Copy as cURL,” covered at https://twitter.com/ChromiumDev/status/317183238026186752 .

Fortunately, this game of cat and mouse between the web developers and the screen scrapers often comes to a

pleasant end. For us! Tired with redesigning their sites every week and in an attempt to connect with the Web 2.0 and

mashup communities on the Web, many companies are providing APIs to access their data. And, like most good APIs,

they remain stable between versions.

Search WWH ::

Custom Search

Home