Hardware Reference
In-Depth Information
There is even a questionable legality in some areas over whether you are allowed to provide tools that improve
or change the format of existing (copyrighted) data. Fortunately, most companies turn a blind eye to this area, as they
do for the internal distribution of data to members of your household—not that they'd know, or be able to prove it, if
you did.
The larger issue has to do with improvements to the data, as most data is either too raw or too complex to be
useful. Let's take a web site containing the weather forecast as an example; the raw data might include only the string
“rain, 25,” which would need to be parsed into a nice icon and a temperature bar to be user-friendly. A complex report
could include a friendly set of graphics on the original site but make the original data set unavailable to anyone else
who either tries to load the report from another site through deep linking or tries to reference the source table data
used to build the image.
Screen Scraping
This is the process whereby a web page is downloaded by a command-line tool, such as wget or cURL , and then
processed by an HTML parser so that individual elements can be read and extracted from it. This is the most legally
suspect and most troublesome method of processing information.
It is the most suspect because you are downloading copyrighted content from a site in a manner that is against
the site's terms and conditions—so much so that, until fairly recently, one famous weather site labeled its images as
please_dont_scrape_this_use_the_api.gif !
Scraping is troublesome because it is very difficult to accurately parse a web page for content . It is very easy to
parse the page on a technical level because the language is computer-based, and parsers already exist. It is also very
easy for a user to parse the rendered page for the data, because the human eye will naturally seek out the information
it desires. But knowing that the information is in the top-left corner of the screen is a very difficult thing for a machine
to assess. Instead, most scrapers will work on a principle of blocking. This is where the information is known to exist
in a particular block, determined beforehand by a programmer, and the parser blindly copies data from that block.
For example, it will go to the web page, find the third table, look in the fifth column and second row, and read the data
from the first paragraph tag. This is time-consuming to determine but easy to parse. It is troublesome because any
breakages in the HTML format itself (either introduced intentionally by the developers or introduced accidentally
because of changes in advertising 2 ) will require the script to be modified or rewritten.
Because of the number of different languages and libraries available to the would-be screen-scraper and
the infinite number of (as yet undetermined) formats into which you'd like to convert the data, there isn't really
a database of known web sites with matching scraping code. To compile such a database would be a massive
undertaking. However, if you're unable to program suitable scraping code, it might be best to seek out local groups
or those communities based around the web site in question, such as TV fan pages. Any home will generally have a
large number of data sources, and trying to maintain scrapers for each source will be time-consuming if you attempt
it alone.
The mechanics of scraping are best explained with an example. In this case, I'll use Perl and the WWW::Mechanize
and HTML::TokeParser modules. Begin by installing them in any way suitable for your distribution. I personally use
the CPAN module, which generally autoconfigures itself on invocation of the cpan command. Additional mirrors can
be added by adding to the URL list like this:
o conf urllist push ftp://ftp-mirror.internap.com/pub/CPAN/
o conf commit
This is then followed by the installation of the modules themselves:
perl -MCPAN -e 'install WWW::Mechanize'
perl -MCPAN -e 'install HTML::TokeParser'
!NDALTHOUGHTHE7EBEXISTSASAFREERESOURCEFORINFORMATIONSOMEONEWILLBEPAYINGFORADVERTISINGSPACETOOFFSETTHEPRODUCTIONCOSTS
Search WWH ::




Custom Search