Database Reference
In-Depth Information
thus not limiting the cluster extension or its shape and, in some cases, putting
together couples of very dissimilar objects. Also, objects that cannot be linked
to any cluster are labeled as noise and removed.
How does one choose the appropriate clustering method? While no strict rule
can exist, a general hint consists of paying attention to some basic characteristics
of the data and the expected characteristics of the output. For instance, if we
expect that our data should form compact clusters of spherical shapes (i.e.,
they should agglomerate around some centers of attraction), then k -means is
a good candidate, especially if the data set is large - k -means is known to
be very efficient. However, the user should know the number k of clusters to
be found in the data, or at least some reasonable guess. That can be avoided
with hierarchical, agglomerative algorithms, since the dendograms they produce
synthesize the results that can be obtained for all possible values of k , from 1
to N (the number of input objects). The choice of the most appealing k can
be postponed after the computation, and be supported by an examination of
the dendogram. However, hierarchical clustering is usually expensive (efficient
variants exist, yet these introduce other factors to be evaluated), so it is not a
good option with large data sets. Finally, density-based methods apparently do
not suffer of any of the issues mentioned above, and are also more robust to noisy
data, yet the resulting clusters will usually have an arbitrary shape and size -
a feature that might be unacceptable in some contexts, and extremely useful in
others.
Depending on the analysis task that the user wants to perform, once the
clustering schema to be adopted has been selected, he or she needs to choose
the most appropriate similarity function, that is, the numerical measure that
quantifies how much two trajectories look similar. The range of possible choices
is virtually unlimited. The examples that can be found in the literature include
the following, approximately sorted in increasing order of complexity 6 :
Spatial starts, ends, or both: Two trajectories are compared based only on
their starting points (the origin of the trip), the ending points (the final des-
tination of the trip), or a combination of them. The distance between the
trajectories, then, reduces to the spatial distance between two points. When
both starts and ends are considered, the sum or average of their respective
distances is computed. The output of a clustering based on these distances
will generally put together trajectories that start or end in similar places,
regardless of when they do start/end and what happens in the rest of the
trajectory.
6 Notice that distance computation is at the base of classical database queries such as range queries
and k -nearest neighbors (see Chapter 3 ). Indeed, k -means involves a 1-nearest neighbor query in
the cluster assignment step, while density-based methods execute a range query to compute the
neighborhood of each point.
Search WWH ::




Custom Search