Geography Reference
In-Depth Information
5.4 Text Clustering
To understand the observed temporal patterns, we extracted tags from photos taken
in places A, B, C, D, and E and created two model representations. In one rep-
resentation, a photo was treated as a separate document (all tags of a photo were
saved as one document). In another model the owner of all photos in a cluster was
treated as a document, so all unique tags from photos of the owner were collected
and saved as one document. We applied two clustering algorithms (the operation
took about 1 s per algorithm) on these models: Lingo (Osiriski et al. 2004 , Osiriski
and Weiss 2004 ) and suffix tree clustering (STC) with default parameters (Part of
the Carrot2 workbench, http://project.carrot2.org ) . These algorithms use different
clustering approaches [term-document matrix (Lingo) versus suffix tree clustering
(STC)] and produce different cluster quality [high cluster diversity (Lingo) versus
low cluster diversity (STC)]. However, they create overlapping cluster categories.
This is an advantage over the methods for automatic representative tag and event
extraction proposed in the literature (see Sect. 2.2 ), since the photo can have
different tags that may describe several categories like (trees, sun, summer). In
addition to the understanding of the observed temporal patterns, our goal is to
show how results may differ due to model representations, clustering algorithms,
language differences or mistakes made during tagging, and stress the importance
of visual analytics. Tables 2 and 3 present the ten most frequent categories
extracted from region A and B using two model representations (owners and
photos) and two clustering algorithms applied on them (Lingo and STC). The
number of occurrences of every category in documents is given on the right side of
each category in parentheses. The tag syntax is preserved.
Let us inspect the obtained cluster categories. A quick look on the categories
suggest that people use four languages to tag their photos (Table 2 , Lingo owner):
English (Snow, 9), Spanish (Suiza, 10), French (Suisse Vaud, 1), German
(Schweiz, 3). At least three different contexts can be extracted from the categories:
places (Vaud—the Swiss canton, Gstaad—small village, Chateux Doex—munic-
ipality), events (balloon, Montgolfiere, festival), season (Snow). Balloon is the
most frequently used term but different variations are used like hotairabolloon,
ballon, ballons that are treated as different entities by the clustering algorithms.
Similarly, categories of region B are expressed in different languages (Table 3 ,
Lingo owner): German (Autofriedhof, 1), English (Carwreck, 8) and French
(Suisse, 5, Lingo photos). Several contexts can be extracted: places (Gürbetal,
Bern, Kaufdorf), cars (Volkswagen Beetle, Ford Zephyr, VW, Fiat), objects' state
(Abandoned, Cemetery, Carwreck, Rost, Old, Oldtimer, Junkyard, Scrapyard),
nature (Forest).
Search WWH ::




Custom Search