Database Reference
In-Depth Information
as long as a data point is not added to the micro-cluster. In the
latter case, the statistics need to be updated explicitly, while other
counts can still be maintained implicitly.
Uncertain Data: In many cases, such as in sensor networks, the
underlying data may be noisy and uncertain. In such cases, it may
be desirable to incorporate the uncertainty into the clustering pro-
cess. In order to do so, the micro-cluster statistics are appended
with the information about the underlying uncertainty in the data.
This information can be used in order to make more robust clus-
tering computations. The advantages of using the uncertainty into
the clustering process are illustrated in [7].
Text and Categorical Data: A closely related problem is that
of text and categorical data. The main difference with the quan-
titative domain is the nature of the statistics which are stored for
clustering purposes. In this case, we maintain the counts of the
frequencies of the discrete attributes in each cluster. Furthermore,
we also maintain the inter-attribute correlation counts which may
be required in a variety of applications. In [12], an ecient algo-
rithm has been proposed for clustering text and categorical data
streams. This algorithm also allows for a decay-based approach
as in [11]. Text and categorical streams often arise in the context
of social sensors such as micro-blogging sites, or other tagging or
event detection scenarios.
In addition, a number of density-based methods [25, 28] have also been
proposed for the problem of stream clustering.
In the context of sensor networks, the stream data is often available
only in a distributed setting , in which large volumes of data are col-
lected separately at the different sensors. A natural approach for clus-
tering such data is to transmit all of the data to a centralized server.
The clustering can then be performed at the centralized server in or-
der to determine the final results. Unfortunately, such an approach is
extremely expensive in terms of its communication costs. Therefore, it
is important to design a method which can reduce the communication
costs among the different processors. A method proposed in [32] per-
forms local clustering at each node, and merges these different clusters
into a single global clustering at low communication cost. Two different
methods are proposed in this work. The first method determines the
cluster centers by using a furthest point algorithm , on the current set of
data points at the local site. In the furthest point algorithm, the center
of a cluster is picked as a furthest point to the current set of centers.
For any incoming data point, it is assigned to its closest center, as long
Search WWH ::




Custom Search