Clustering Moving Object Trajectories Using Coresets (WiMoA 2011 and ICCSEA 2011)

Abstract

Given a set of moving objects, we show the applicability of using coresets to perform fast approximate clustering. A trajectory coreset is simply a small weighted set of the trajectory segments that approximates the original trajectory. In this paper, we present an efficient algorithm that integrates coreset approximation with k-means clustering. Our methodology to build the trajectory coreset depends on using the trajectory segments movement direction. Using the movement direction feature of the segments we basically select the most influential segments to contribute in our coreset. The main strength of the algorithm is that it can quickly determine a clustering of a dataset for any number of clusters. In addition, to measure the quality of the resulting clustering, we use the silhouette coefficient. Finally, we present experimental results that show the efficiency of our proposed algorithm.

Keywords: Moving Object Database(MOD), clustering moving objects, and approximate clustering via coresets.

Introduction

Today portable devices as mobile phones, laptops, personal digital assistants (PDAs), have become ubiquitous. Along with to the rapid advances in RFID, satellites, sensors, wireless, and video technologies, moving object position information has become easier to acquire. On the hand, the adoption of Global Positioning Systems (GPSs) promotes many new applications. Today, mobility managers embed GPS in cars to better monitor and guide vehicles; and meteorologists use weather satellites and radars to observe hurricanes. Therefore, data mining on such huge data volumes became an essential and crucial task. Data mining techniques and specially clustering techniques play an important role in extracting useful knowledge from moving objects’ location information, and in discovering hidden patterns in their motion behaviors.

Consequently, a variety of disciplines including database research, market research, transport analysis, and animal behavior research show an increasing interest in movement patterns of moving objects [9][1]. For example, in transportation context, movement patterns could be used for traffic jams prediction, and in the case of moving animals movement patterns can be utilized to view spatio-temporal behaviors, e.g. seasonal animal migration.

Clustering is the computational task to group a given input into subsets of similar characteristics (clusters) [8]. The goal is that objects within a group are similar (close) to each other and different from objects in other groups. Clustering has many applications in different areas; it is a widely used technique in computer science with applications to unsupervised learning, classification, data mining, text mining, information retrieval, and pattern recognition. There is no general clustering algorithm that can be applied to all applications because usually clustering is problem dependent. So, there are many different clustering algorithms. K-means clustering algorithm is considered one of the basic proposed clustering algorithms [20,19]. It has been applied to a wide spectrum of research problems and has been used in several applications. Nevertheless, it has a strong weakness namely, the dependence on the initial number of centroids (k). In fact, there might be no definite or unique answer as to what value k should take for an optimal clustering. Thus, both the dependence on the chosen number of clusters (k) and the initial choice of the centroids affects both the performance and accuracy of the algorithm.

In this paper, we present a novel "approximate" clustering for moving objects. Our approach integrates the idea of approximation using coresets along with clustering using k-means technique. The main characteristic of our algorithm is its fast clustering time compared to traditional k-means clustering. The reason behind this feature is that being an approximate approach it does not need to examine all the trajectory segments, only the segments in the generated coreset are the ones that need to be examined. This reduction in the number of candidate segments reduces the overall clustering time. Nevertheless, clustering quality remains good. Using the silhouette coefficient [16] as a clustering quality measure we test the quality of the obtained clusters to ensure the efficiency of the proposed algorithm.

In the paper we use the concept of coresets to perform efficient approximate clustering for moving object trajectories. We define a trajectory coreset as a small weighted set of segments that approximates the original trajectory. This weighted set is the set of the most influential segments that characterize the motion direction of the trajectory over time. Our main contributions are:

- We present an efficient algorithm that builds a coreset for each trajectory using the repeated inter-segments’ directions. The result is a small weighted set of trajectory segments that efficiently approximates the whole original trajectory.

- We integrate coreset approximation approach together with k-means clustering technique to achieve a fast, high quality clustering for moving objects.

- In our proposed clustering algorithm, we present a solution to overcome the weaknesses of the k-means algorithm, namely, the dependency on the chosen number of clusters, and the dependency on the initial choice of centroids.

- We test the efficiency of our resulting clustering using the silhouette coefficient [16]. The silhouette coefficient is one of the proposed techniques for measuring the quality of clustering.

- Finally, we present experimental results that test both the performance of our technique and measure the accuracy (quality) of our resulting clustering.

The remainder of the paper is organized as follows: Section 2 provides an overview of the related literature. Section 3 describes the basic preliminaries, notations, and definitions that we will use throughout the rest of the paper. Section 4 presents our proposed algorithms for approximate clustering of moving object trajectories. Section 5 presents our experimental results. Finally, Section 6 concludes and proposes directions for future work.

Related Work

Applying data mining techniques to moving object databases became a crucial direction for many applications. Traffic jam prediction, animal migration, air flight traffic monitoring and control, or location based services are all new applications that require efficient analysis and mining of location data. Today, data mining is an essential ingredient for efficient data analysis and knowledge discovery. Clustering is considered one of the main data mining techniques. In general, clustering is an unsupervised learning technique that plays an outstanding role in data mining applications such as scientific data exploration, information retrieval and text mining, computational biology, spatio-temporal data analysis and many others. Many clustering algorithms were developed to satisfy the needs of different applications [14]. These algorithms can be generally characterized as hierarchical algorithms (i.e Agglomerative and Divisive algorithms), partitioning algorithms (i.e k-means, k-mediods, and density based algorithms), Grid based, and constraint based algorithms [29]. Hierarchical algorithms build a hierarchy of clusters, i.e. every cluster is subdivided into child clusters, which form a partition of their parent cluster. Depending on how the hierarchy is built we distinguish between agglomerative (bottom-up) and divisible (top-down) clustering algorithms [14]. On the other hand, partitioning algorithms compute a clustering directly. For example, they compute a clustering by iteratively swapping objects or groups of objects between the clusters. The most prominent clustering algorithm is the k-means algorithm or Lloyd’s algorithm [20].However, the generic definition of clustering is usually redefined depending on the type of data to be clustered and the clustering objective. In other words, different clustering techniques use different definitions and evaluation criteria like partitioning (i.e k-means), hierarchical clustering, and density-based methods [13,27,18]. The original version of the k-means clustering problem is Lloyd’s algorithm [19] and MacQueen [23]. Lloyd’s algorithm is very popular in many fields because of its simplicity and flexibility. Nevertheless, it does not specify the initial placement of centers; authors in [6] provided an approach for solving this issue. There are many clustering algorithms that are very close to clustering based on k-means like the Euclidean k-medians [5,17], in which the objective is to minimize the sum of distances to the nearest center. The geometric k-center problem [4] in which the objective is to minimize the maximum distance from every point to its closest center. Recently, a number of very efficient implementations of k-means algorithm have been developed [15,25,12]. There are many variants of this algorithm; authors in [7] have shown how to scale k-means clustering to very large data sets through sampling and pruning. Clustering moving objects was also investigated in a number of research work. In [21], the authors cluster moving object trajectories based on the time interval of each trajectory segment. Another approach was proposed in [24] where the authors employ the direction feature of trajectory segments to group similar segments in the same cluster. On the other hand, authors in [3] provided an overview of coreset constructions. There are several coreset constructions that have been developed for the k-median and k-means clustering problem [10,11]. Besides, there are many applications that support usage of coresets clustering of moving data [26], approximation algorithms [3], and data streaming [10,11]. Inspired by the importance of clustering moving objects, in this paper we propose a novel algorithm that employs the coreset approximation idea together with the k-means algorithm to provide an efficient clustering methodology. The key difference between this work and previous moving object clustering work is that we propose an approximate clustering technique that only examines a subset of the trajectory segments rather than examining all trajectory segments. We employ the trajectory segments angle as proposed in [22] to determine similar segments. However, only candidate segments selected in the coreset are clustered. In the rest of the paper we present the details of our approach.

Preliminaries

In this paper we propose a fast approximate clustering algorithm for clustering moving object trajectories. The main idea of our algorithm is to compute a weighted set of segments which we refer to as the trajectory-coreset. A trajectory-coreset is thus a subset of the original trajectory, such that the selected subset provides a fair approximation of the original trajectory. In general, moving objects are defined as objects that change their location and/or shape over time [28]. Moving objects are usually divided into moving points and moving regions. In this paper we focus on moving points. Moving points are objects whose location is continuously changing over time; like cars, buses, trains, plans, mobile phone users, etc. The motion of a moving point is usually expressed in terms of its trajectory that represents the path that the object follows during its motion along time. A trajectory of a moving object is typically modeled using either a discrete or a continuous model. In the discrete model, the trajectory is modeled as a sequence of consecutive locations in a multidimensional (generally 2D or 3D) Euclidean space.On the other hand, in the continuous model, the trajectory is modeled as a piece-wise linear function of time. In our model, we consider the continuous representation of a trajectory. In the model, we focus on an important characteristic of a trajectory, namely, the trajectory segment’s slope. A segment’s slope is simply expressed by the segment angular value 6. The angular value 6 is a value between [0, 360], so each segment belongs to one of the following four quadrants: Q1 : [0 - 90[; Q2 : [90 - 180[; Q3 : [180 - 270[; Q4 : [270 - 360]. For simplicity we assume that each segment passes only by one quadrant so we calculate its 6 value only once.

Definition 1. A segment angular measure is an angle that expresses the segment’s motion direction. The segment angular measure 6 is measured as:

Definition 2. A moving object trajectory is a finite sequence of time-parameterized line segments. A trajectory segment is a quadruple (A, B, 6, T), where A, B are the 2D motion parameter vectors, 6 is the angular measure of the segment, and T e R is the time interval over which the segment is defined.

Having computed the angular value for all trajectory segments. Our approach for clustering the trajectory then proceeds as presented in the remainder of the paper.

Coreset Clustering Algorithm

We consider a cluster to be a set of similar trajectory segments. To create clusters, simply a trajectory is first partitioned into its composing line segments, then, we create the trajectory coreset for each trajectory in the moving object database (mod). Each coreset consists of a set of weighted line segments selected using the algorithm shown in Fig. 1. Once the coresets are computed, clustering is then performed over those coresets using the algorithm in Fig. 3. This procedure can result in a single trajectory belonging to multiple clusters based on the clustering result for its individual segments. The clustering algorithm is based on the idea that segments within a cluster are close to each other according to a distance measure and are at distance from segments in other clusters. To define similar trajectories; we apply two different criteria to find similar segments; a distant-based measure, and a shape-based measure. For the distance measure we simply employ the Euclidean distance to retrieve segments that are close to each other. On the other hand, for the shape-based measure we employ the angular measure computed in Equation 1 that utilizes the segments slope to define segments’ similarity based on their orientation. Thus, the similarity process proceeds in 2 steps:

1. The Euclidean distance is applied to retrieve spatially close trajectory segments.

2. The angular measure defined in Equation 1 is applied to select from the segments resulting from step (1) those segments that have the same direction (orientation).

This two step procedure thus guarantees that clustered segments are "close" to each other. In the following discussion we elaborate in more details how the clustering algorithm works. The main point is that since we are clustering trajectories, thus, the centroid of a cluster is a line segments rather than a single point.

Definition 3. Given a cluster C with centroid (i.e. segment) c = (A, B, 0c, T), and a trajectory segment s = (C, D, 0s,T). The spatial distance between c and s is defined as:

Where d(At + B,Ct + D) is the Euclidean distance between the spatial locations of c, s at each time instant t e T.

The key input to the clustering algorithm is the coreset for the mod trajectories. In the rest of this section we present our proposed algorithm for picking a small representative subset of the moving object trajectory (trajectory-coreset) that best approximates the trajectory. In addition, we present our technique for using the silhouette coefficient to ensure high clustering quality.

Algorithm:CreateCoreset(T, Delta)

Building the Trajectory Coreset

A mod trajectory is composed of a set of segments. Each segment is assigned a weight that is a real number expressing the influence of this segment on the trajectory motion pattern. A segment of high weight will thus be added to the coreset as it consequently has a high impact on the overall trajectory motion pattern. A segment gets high weight if the angular difference between it and the previous segment’s angle is high. In other words, the segments with little effect in trajectory shape will be ignored. A threshold S is defined to control the angular difference value that results in a segment addition to the coreset. Thus, given 2 consecutive segments si, s2 with angular values 61,62 resp. If 62 – 61 > S, then, s2 is added to the coreset. Thus we can consider the coreset as a representable subset of the original trajectory.

Example 1. Consider the trajectory shown in Fig. 2 (a). The trajectory segments (s2 and s3) and (s4 and s5) have the same motion direction and slope. Thus, segments s3 and s5 will be ignored and only s1, s2, s4 and s6 will be added to the trajectory coreset as shown in Fig. 2 (b).

Definition 4. Given a trajectory t gmod, a trajectory coreset of t is an approximate representation of t. Such that:

The algorithm in Fig. 1 shows how to create the coreset from the original trajectory. The input to the algorithm is the trajectory t and the deviation allowance between segments’ angles S. The algorithm adds segments to the trajectory coreset CT if the absolute difference between its angular value and the previous segment’s angular value is greater than or equal the deviation allowance.

Fig. 2. Trajectory vs. Trajectory Coreset 4.2 The Silhouette Coefficient

The silhouette coefficient is a measure for the clustering quality. In order to be rather independent from features used for clustering and the number of clusters produced as a clustering result, our main evaluation is based on the silhouette coefficient [16]. To evaluate the quality of a clustering we compute the average silhouette coefficient of all segments.

Definition 5. (Silhouette Coefficient):

Given a coreset based clustering result,such that each ci e CS is a cluster of line segments. The distance between a segment sj in cluster ci is given by the Euclidean distance in Equation 2 between sj and the centroid of ci.

And the distance of segment sj to other clusters that do not contain sj is the minimum of Euclidean distance to all other clusters’ centroids.

Then,

The resulting value of the silhouette coefficient of a segment varies between -1 and 1. A value near -1 implies that the segment is clustered badly. While, a value near 1 implies that the segment is well clustered. Using this approach is computing the Silhouette coefficient a remarkable reduction in the computation time for each value of the number of clusters is achieved.

The algorithm shown in Fig. 3 presents our proposed framework to enhance trajectory clustering using the k-means algorithm through optimally choosing the initial number of clusters (k) using silhouette coefficient and speeding up the overall clustering time using coreset.

The idea of building the trajectory coreset relies on ignoring consecutive similar segments. In this stage we consider shape similarity, where as mentioned above, 2 consecutive segments are considered similar if the difference between their slopes is below deviation allowance.

Algorithm:k-means Using Coresets(S, C)

Fig.3. Clustering Phase Procedure:Update_Centroid(C,NewSeg)

Fig. 4. Updating Centroid Procedure Algorithm:Angle_Similarity (si.Angle, s2.Angle, tolerance)

Fig. 5. Angle Similarity Measure

This in turn means that the new segment has minimal or negligible effect on the trajectory shape (motion pattern). For each trajectory we compute the angles of its segments. The angle of each motion is evaluated using Equation 1.

To cluster all trajectories in mod, we apply the algorithm in Fig. 3. The algorithm tries to find the best match between a new motion and all clusters’ centroid. We denote our proposed approach as core-means because we combine the k-means clustering algorithm along with the coreset approach. Thus the clustering algorithm is finally applied on the coreset trajectory segments not the original trajectory data. The complete clustering techniques thus proceeds as follows:

- First, we build the coreset from the input trajectory set in mod.

- Then, we assign segments to clusters by calculating the distance similarity between new segment and cetroids of all cluster using Euclidean distance in Equation 2.

- Having computed the distance similarity, we then apply our second measure, the angular measure.

- The centroid of the cluster is then updated based on the new segment assigned to the cluster through calling UpdateCentroid procedure in Fig. 4.

This step sets the new segment to be the cluster centroid if it is the closest to cluster segments than current centroid and consequently has the most similar shape to other cluster segments.

- Finally, through experiments we run this algorithm many times and calculate silhouette coefficient for each run to find the best approximation to the original dataset and the optimal number of clusters k.

Experimental Evaluation

In this section we present experimental evaluation for our proposed clustering technique. The proposed method is tested using real traffic trajectory set. This dataset represents the movement of 10000 moving object in London during a one month period (July 2007) with a 10 seconds sampling rate [2]. The first experiment compares whole trajectory clustering against our approximated coreset trajectory clustering technique.

Fig. 6. Running time of clustering coreset vs. whole set

Fig. 7. Running time vs. coreset size

Fig. 8. Running time vs. Dataset size

Cores et Size	Time (Sec)	Silhouette Time (Sec)	Silhouette Value
2091	15	0.45	-1.9
4091	11	1.5	O.fiZ
6091	31	3.7	1.1
B091	SO	5	1.9

Fig. 9. Coreset Size vs. Silhouette Coefficient

The results shown in Fig. 6 are the average performance over 5 runs. The key conclusion is that clustering the whole set takes almost triple the time that the coreset clustering requires.

The second experiment works on changing the percentage of the whole set that would be represented by the coreset; we found that when the coreset size gets smaller it takes less time as shown in Fig. 7. Moreover clustering the coreset and whole set takes more time when dataset size increases by increasing either the number of trajectories or the trajectory length (number of segemnts/trajectroy) because it takes more processing time. Nevertheless, coreset clustering still saves about triple the time that whole set clustering takes as shown in Fig. 8. On the other hand, we study the efficiency (quality) of the resulting clustering using the silhouette coefficient as our quality measure [16]. As presented in Fig. 9, silhouette coefficient value gets -ve values when coreset size becomes a very small percentage of whole set. This result is due to the fact that the coreset is initially an approximation of the original trajectories and not an actual representing of the whole trajectory. Yet, for reasonable approximation (larger coreset size/trajectory), +ve values are achieved indicating good clustering.

Conclusions and Future Work

In this paper we propose a new efficient and fast clustering algorithm for moving object trajectories. The proposed algorithm applies the famous k-means clustering algorithm on a representative subset of the moving object trajectory. This subset is obtained through the use of coresets as a technique to reduce the number of trajectory segments that are considered for clustering. Using coresets, influential segments that cause remarkable changes in the trajectory motion behavior are only added to the coreset for further processing. The proposed algorithm is characterized by its fast running time and its quality of clustering for appropriate coreset sizes.

For future work we think that other trajectory clustering approximation techniques can be considered. Also, other data mining techniques can be investigated for better clustering results. We think that further improvements in trajectory approximation and consequently in clustering results can be investigated.