DISTRIBUTED DATA MINING IN SENSOR NETWORKS - Managing and Mining Sensor Data

Database Reference

In-Depth Information

tal version of the EM algorithm [43]. It repeatedly cycles through all

nodes in a network and performs incremental E- and M-steps at each

node, using only locally stored data and summary statistics passed from

the previous node. DEM is guaranteed to converge to a local maximum

and, as shown empirically, often more rapidly than the standard EM

algorithm. Gu [28] proposes to estimate the global sucient statistics

for the M-step by an average consensus filter, diffusing the local suf-

ficient statistics over the entire network by communicating only with

neighboring nodes. Thereby, each node gradually gains global informa-

tion, until the parameters to estimate can be accessed from any node

in the network. The local communication between neighbors which is

inherently parallel makes it more run-time ecient than DEM which

repeatedly cycles over all nodes in the network. Another approach by

Kriegel et al. [37], the DMBC (Distributed Model-Based Clustering) al-

gorithm, also assumes a Gaussian mixture distribution. It first estimates

the number of Gaussian clusters, their parameters (mean and covariance

matrix) and their weights at the local nodes, using the standard EM al-

gorithm. Then, the local parameters and weights are transferred to a

central site, where similar Gaussians are joined to a compact global dis-

tribution. The similarity is measured as the mutual support between two

clusters C 1 ,C 2 , which in addition to their mean vectors also considers

the variance of the clusters. For high dimensional data, DMBC assumes

attributes to be independent of each other, resulting in a reduction of the

d

d covariance matrices to d -dimensional variance vectors. For n nodes

and a maximum number of local clusters K , the total communication

costs are thus bounded by O ( nK ). It was shown empirically, for varying

numbers of clusters and nodes, that the clustering found by DMBC is

highly similar to a central clustering, as measured by the Rand Index.

Further algorithms exist, like a distributed version of density-based

clustering [34], spectral clustering [51] and solutions specialized on par-

ticular applications, like spatial [41] and time series clustering [62].

Only few algorithms developed so far are truly resource-aware and

consider, for example, the residual energy of nodes or the CPU utilization

explicitly. An exception is ERA-cluster, proposed by Phung et al. [46],

which is based on the concept of microclusters [2]. It can automatically

adapt its sampling rate and the number of examined microclusters based

on the current battery, memory and CPU utilization as measured by a

resource monitor. EDISKCO [31] solves the k -center clustering problem

and can also determine outliers. It works incrementally and only needs

a single pass over the input observations, without storing them. The

local nodes keep a special heap structure for storing their local k centers

and z outliers, sorted according to cluster counts. If a new point doesn't

×

Managing and Mining Sensor Data

Search WWH ::

Custom Search

Home