MINING SENSOR DATA STREAMS - Managing and Mining Sensor Data

Database Reference

In-Depth Information

as long as a data point is not added to the micro-cluster. In the

latter case, the statistics need to be updated explicitly, while other

counts can still be maintained implicitly.

Uncertain Data: In many cases, such as in sensor networks, the

underlying data may be noisy and uncertain. In such cases, it may

be desirable to incorporate the uncertainty into the clustering pro-

cess. In order to do so, the micro-cluster statistics are appended

with the information about the underlying uncertainty in the data.

This information can be used in order to make more robust clus-

tering computations. The advantages of using the uncertainty into

the clustering process are illustrated in [7].

Text and Categorical Data: A closely related problem is that

of text and categorical data. The main difference with the quan-

titative domain is the nature of the statistics which are stored for

clustering purposes. In this case, we maintain the counts of the

frequencies of the discrete attributes in each cluster. Furthermore,

we also maintain the inter-attribute correlation counts which may

be required in a variety of applications. In [12], an ecient algo-

rithm has been proposed for clustering text and categorical data

streams. This algorithm also allows for a decay-based approach

as in [11]. Text and categorical streams often arise in the context

of social sensors such as micro-blogging sites, or other tagging or

event detection scenarios.

In addition, a number of density-based methods [25, 28] have also been

proposed for the problem of stream clustering.

In the context of sensor networks, the stream data is often available

only in a distributed setting , in which large volumes of data are col-

lected separately at the different sensors. A natural approach for clus-

tering such data is to transmit all of the data to a centralized server.

The clustering can then be performed at the centralized server in or-

der to determine the final results. Unfortunately, such an approach is

extremely expensive in terms of its communication costs. Therefore, it

is important to design a method which can reduce the communication

costs among the different processors. A method proposed in [32] per-

forms local clustering at each node, and merges these different clusters

into a single global clustering at low communication cost. Two different

methods are proposed in this work. The first method determines the

cluster centers by using a furthest point algorithm , on the current set of

data points at the local site. In the furthest point algorithm, the center

of a cluster is picked as a furthest point to the current set of centers.

For any incoming data point, it is assigned to its closest center, as long

Search WWH ::

Custom Search

Home