Data Sources: Making Homes Smart - Smart Home Automation with Linux and Raspberry Pi

Hardware Reference

In-Depth Information

There is even a questionable legality in some areas over whether you are allowed to provide tools that improve

or change the format of existing (copyrighted) data. Fortunately, most companies turn a blind eye to this area, as they

do for the internal distribution of data to members of your household—not that they'd know, or be able to prove it, if

you did.

The larger issue has to do with improvements to the data, as most data is either too raw or too complex to be

useful. Let's take a web site containing the weather forecast as an example; the raw data might include only the string

“rain, 25,” which would need to be parsed into a nice icon and a temperature bar to be user-friendly. A complex report

could include a friendly set of graphics on the original site but make the original data set unavailable to anyone else

who either tries to load the report from another site through deep linking or tries to reference the source table data

used to build the image.

Screen Scraping

This is the process whereby a web page is downloaded by a command-line tool, such as wget or cURL , and then

processed by an HTML parser so that individual elements can be read and extracted from it. This is the most legally

suspect and most troublesome method of processing information.

It is the most suspect because you are downloading copyrighted content from a site in a manner that is against

the site's terms and conditions—so much so that, until fairly recently, one famous weather site labeled its images as

please_dont_scrape_this_use_the_api.gif !

Scraping is troublesome because it is very difficult to accurately parse a web page for content . It is very easy to

parse the page on a technical level because the language is computer-based, and parsers already exist. It is also very

easy for a user to parse the rendered page for the data, because the human eye will naturally seek out the information

it desires. But knowing that the information is in the top-left corner of the screen is a very difficult thing for a machine

to assess. Instead, most scrapers will work on a principle of blocking. This is where the information is known to exist

in a particular block, determined beforehand by a programmer, and the parser blindly copies data from that block.

For example, it will go to the web page, find the third table, look in the fifth column and second row, and read the data

from the first paragraph tag. This is time-consuming to determine but easy to parse. It is troublesome because any

breakages in the HTML format itself (either introduced intentionally by the developers or introduced accidentally

because of changes in advertising 2 ) will require the script to be modified or rewritten.

Because of the number of different languages and libraries available to the would-be screen-scraper and

the infinite number of (as yet undetermined) formats into which you'd like to convert the data, there isn't really

a database of known web sites with matching scraping code. To compile such a database would be a massive

undertaking. However, if you're unable to program suitable scraping code, it might be best to seek out local groups

or those communities based around the web site in question, such as TV fan pages. Any home will generally have a

large number of data sources, and trying to maintain scrapers for each source will be time-consuming if you attempt

it alone.

The mechanics of scraping are best explained with an example. In this case, I'll use Perl and the WWW::Mechanize

and HTML::TokeParser modules. Begin by installing them in any way suitable for your distribution. I personally use

the CPAN module, which generally autoconfigures itself on invocation of the cpan command. Additional mirrors can

be added by adding to the URL list like this:

o conf urllist push ftp://ftp-mirror.internap.com/pub/CPAN/

o conf commit

This is then followed by the installation of the modules themselves:

perl -MCPAN -e 'install WWW::Mechanize'

perl -MCPAN -e 'install HTML::TokeParser'

!NDALTHOUGHTHE7EBEXISTSASAFREERESOURCEFORINFORMATIONSOMEONEWILLBEPAYINGFORADVERTISINGSPACETOOFFSETTHEPRODUCTIONCOSTS

Smart Home Automation with Linux and Raspberry Pi

Search WWH ::

Custom Search

Home