Geoscience Reference
In-Depth Information
requirements. Data collection by scraping requires a large
proportion of modification; whereas data structuring is
already partially achieved in the case of RSS feeds and APIs,
as these aim to facilitate interchanges among applications.
4.1.3.1. Web scraping
The mashup model started with Paul Rademacher's
hacking (see section 3.1.2) that reverse-engineered the
Google Maps application to extract the tiles and added the
advertisements for flat renting from the Craiglist Website.
The latter information was extracted by Web scraping , which
involves automatically extracting online content with the
help of a script. The script works in two stages: first, it
decomposes the text and identifies the desired element in the
HTML page, also known as parsing, and second, it creates a
database which facilitates processing the extracted elements.
Contrary to APIs, which provide a legal framework and data
structuring, scraping is a less “polite” version of data
extraction [MAR 13] as it sometimes bypasses copyright laws
and
users'
terms
and
conditions
for
the
Websites
in
question 13 .
Michael Young gets data for his map by scraping news
briefs of the Associated Press news agency (Table 4.1, map
no. 12). He filters the Website's RSS feeds and extracts the
beginning of each brief, which usually starts by stating the
location from which the information comes:
AP puts a location at the beginning of each story,
for example 'NEW YORK (AP) Millions of New
Yorkers…', which is typically city, state or just a
13 Scraping has numerous legal implications: for the use of a website to be
acceptable (in this case this refers to the number of times a scraper can
visit a given Website) is defined by the terms and conditions. For more
information, please refer to the interview of Dick Hall, Manager of
Infochimps,
by
Audrey
Watters
for
O'Reilly
Radar 's
blog:
http://radar.oreilly.com/2011/05/data-scraping-infochimps.html.
 
Search WWH ::




Custom Search