Hardware Reference
In-Depth Information
So far, I have been using get_tag to skip elements, but it also sports a return value, containing the contents of
the tag. So, you'd retrieve the information from the anchor with the following, which, by its nature, can return
multiple tags:
my @link = $stream->get_tag("a");
Because you know there is only one in this particular HTML, it is $link[0] that is of interest. Inside this is
another array containing the following:
$link[0][0] # tag
$link[0][1] # attributes
$link[0][2] # attribute sequence
$link[0][3] # text
Therefore, you can extract the link information with the following:
my $href = $link[0][1]{href};
And because get_tag only retrieves the information about the tag, you must return to the stream to extract all the
data between this <a> and the </a> :
my $storyHeadline = $stream->get_trimmed_text("/a");
From here, you can see that you need to skip the next opening td tag and get the story text between it and the
next closing td tag:
$stream->get_tag("td");
print $stream->get_trimmed_text("/td");
Because you are only getting the first story from the page, your scraping is done. If you wanted to get the first
two stories, for example, you'd need to correctly skip the remainder of this table, or row, before repeating the parse
loop again.
Naturally, if this web page changes in any way, the code won't work!
This admittedly simple approach can fail when Javascript is used to control the web page content or layout.
Most commonly, this occurs when the page uses AJAX calls for pagination. In this case, the next button loads some
data dynamically from the server (with Javascript) and rewrites the contents of the appropriate <div> element. You
are unlikely to encounter such pages with the data sources that benefit home automation solutions, as none of the
examples presented here do so. However, if you do uncover one (supermarkets, for example, do this a lot), then you
need to upgrade to a headless browser solution, such as Casper and PhantomJS , which allows you to programmatically
click buttons on the page and invoke those AJAX requests. A new technology that also aims to simplify this process is
“Copy as cURL,” covered at https://twitter.com/ChromiumDev/status/317183238026186752 .
Fortunately, this game of cat and mouse between the web developers and the screen scrapers often comes to a
pleasant end. For us! Tired with redesigning their sites every week and in an attempt to connect with the Web 2.0 and
mashup communities on the Web, many companies are providing APIs to access their data. And, like most good APIs,
they remain stable between versions.
Search WWH ::




Custom Search