Mobility Data Mining - Mobility Data

Database Reference

In-Depth Information

thus not limiting the cluster extension or its shape and, in some cases, putting

together couples of very dissimilar objects. Also, objects that cannot be linked

to any cluster are labeled as noise and removed.

How does one choose the appropriate clustering method? While no strict rule

can exist, a general hint consists of paying attention to some basic characteristics

of the data and the expected characteristics of the output. For instance, if we

expect that our data should form compact clusters of spherical shapes (i.e.,

they should agglomerate around some centers of attraction), then k -means is

a good candidate, especially if the data set is large - k -means is known to

be very efficient. However, the user should know the number k of clusters to

be found in the data, or at least some reasonable guess. That can be avoided

with hierarchical, agglomerative algorithms, since the dendograms they produce

synthesize the results that can be obtained for all possible values of k , from 1

to N (the number of input objects). The choice of the most appealing k can

be postponed after the computation, and be supported by an examination of

the dendogram. However, hierarchical clustering is usually expensive (efficient

variants exist, yet these introduce other factors to be evaluated), so it is not a

good option with large data sets. Finally, density-based methods apparently do

not suffer of any of the issues mentioned above, and are also more robust to noisy

data, yet the resulting clusters will usually have an arbitrary shape and size -

a feature that might be unacceptable in some contexts, and extremely useful in

others.

Depending on the analysis task that the user wants to perform, once the

clustering schema to be adopted has been selected, he or she needs to choose

the most appropriate similarity function, that is, the numerical measure that

quantifies how much two trajectories look similar. The range of possible choices

is virtually unlimited. The examples that can be found in the literature include

the following, approximately sorted in increasing order of complexity 6 :

Spatial starts, ends, or both: Two trajectories are compared based only on

their starting points (the origin of the trip), the ending points (the final des-

tination of the trip), or a combination of them. The distance between the

trajectories, then, reduces to the spatial distance between two points. When

both starts and ends are considered, the sum or average of their respective

distances is computed. The output of a clustering based on these distances

will generally put together trajectories that start or end in similar places,

regardless of when they do start/end and what happens in the rest of the

trajectory.

6 Notice that distance computation is at the base of classical database queries such as range queries

and k -nearest neighbors (see Chapter 3 ). Indeed, k -means involves a 1-nearest neighbor query in

the cluster assignment step, while density-based methods execute a range query to compute the

neighborhood of each point.

Search WWH ::

Custom Search

Home