Geography Reference
In-Depth Information
Tezuka and Tanaka [ 47 ] explicitly aim at extracting landmark information
from texts taken from the World Wide Web. They compared several statistical
and linguistic measures to calculate the landmarkness of some geographic object
mentioned in one or several documents of a given corpus. These measures are:
1. Document frequency : In how many documents does a term (reference to an
object) appear;
2. Regional co-occurrence summation : How often does an object appear with its
surrounding objects in a document;
3. Regional co-occurrence variation : With how many different objects does an
object co-occur in documents;
4. Spatial sentence frequency : Using spatial trigger phrases, how often is an object
used in a spatial sentence;
5. Case frequency : What grammatical structure is used to refer to an object.
Their results show that the measures with spatial context (2-4) match better with a
human-judged set of landmarks than the one without spatial context (1,5). If only
interested in the most prominent landmarks, regional co-occurrence summation is a
useful measure, as it has high precision (the ratio of correctly retrieved objects to all
retrieved objects) for low recall (the ratio of correctly retrieved objects to all existing
correct objects). If a large set of landmark candidates is desired, sentence frequency
yields the best results with a relatively high precision for high recall situations, i.e.,
many landmark candidates are retrieved of which many are correct.
Getting more visual, several approaches exist that use tag-based descriptions,
photographs, or a combination of both to determine points of interest or specific
relevant regions. Mummidi and Krumm [ 30 ] used pushpins on Microsoft's Bing
Maps to find POIs that are not already contained in the underlying database. Each
pushpin has a known position (coordinate), and an associated title and textual
description. Pushpins are clustered based on their latitude and longitude, using
a hierarchical agglomerative clustering technique [ 10 ] . This clustering technique
starts out with each pushpin as its own cluster, and then iteratively combines closest
clusters until only one cluster is left. When combining two clusters, the position of
the emerging cluster is the centroid of all contained pushpins. Figure 5.5 illustrates
this idea further.
To figure out whether a given cluster actually describes a POI, the authors
make use of n-grams of the pushpins' descriptions. An n-gram is a phrase with
n words in it. In this case it is a subsequence of n consecutive words from each
description. For example, in the description 'my favorite pizza place' all valid 2-
grams (called bigrams) would be 'my favorite', 'favorite pizza', and 'pizza place'
(but not 'my place', because these are not consecutive words in the description).
The main measure to identify useful clusters the authors use is term frequency
inverse document frequency (TFIDF). This measure compares the number of times a
specific n-gram appears in a cluster ('term frequency') with the number of times the
same n-gram appears in all clusters combined ('document frequency'). Dividing the
 
Search WWH ::




Custom Search