Computational Aspects: How Landmarks Can Be Observed, Stored, and Analysed - Landmarks: GIScience for Intelligent Services

Geography Reference

In-Depth Information

Tezuka and Tanaka [ 47 ] explicitly aim at extracting landmark information

from texts taken from the World Wide Web. They compared several statistical

and linguistic measures to calculate the landmarkness of some geographic object

mentioned in one or several documents of a given corpus. These measures are:

1. Document frequency : In how many documents does a term (reference to an

object) appear;

2. Regional co-occurrence summation : How often does an object appear with its

surrounding objects in a document;

3. Regional co-occurrence variation : With how many different objects does an

object co-occur in documents;

4. Spatial sentence frequency : Using spatial trigger phrases, how often is an object

used in a spatial sentence;

5. Case frequency : What grammatical structure is used to refer to an object.

Their results show that the measures with spatial context (2-4) match better with a

human-judged set of landmarks than the one without spatial context (1,5). If only

interested in the most prominent landmarks, regional co-occurrence summation is a

useful measure, as it has high precision (the ratio of correctly retrieved objects to all

retrieved objects) for low recall (the ratio of correctly retrieved objects to all existing

correct objects). If a large set of landmark candidates is desired, sentence frequency

yields the best results with a relatively high precision for high recall situations, i.e.,

many landmark candidates are retrieved of which many are correct.

Getting more visual, several approaches exist that use tag-based descriptions,

photographs, or a combination of both to determine points of interest or specific

relevant regions. Mummidi and Krumm [ 30 ] used pushpins on Microsoft's Bing

Maps to find POIs that are not already contained in the underlying database. Each

pushpin has a known position (coordinate), and an associated title and textual

description. Pushpins are clustered based on their latitude and longitude, using

a hierarchical agglomerative clustering technique [ 10 ] . This clustering technique

starts out with each pushpin as its own cluster, and then iteratively combines closest

clusters until only one cluster is left. When combining two clusters, the position of

the emerging cluster is the centroid of all contained pushpins. Figure 5.5 illustrates

this idea further.

To figure out whether a given cluster actually describes a POI, the authors

make use of n-grams of the pushpins' descriptions. An n-gram is a phrase with

n words in it. In this case it is a subsequence of n consecutive words from each

description. For example, in the description 'my favorite pizza place' all valid 2-

grams (called bigrams) would be 'my favorite', 'favorite pizza', and 'pizza place'

(but not 'my place', because these are not consecutive words in the description).

The main measure to identify useful clusters the authors use is term frequency

inverse document frequency (TFIDF). This measure compares the number of times a

specific n-gram appears in a cluster ('term frequency') with the number of times the

same n-gram appears in all clusters combined ('document frequency'). Dividing the

Search WWH ::

Custom Search

Home