Database Reference
In-Depth Information
tal version of the EM algorithm [43]. It repeatedly cycles through all
nodes in a network and performs incremental E- and M-steps at each
node, using only locally stored data and summary statistics passed from
the previous node. DEM is guaranteed to converge to a local maximum
and, as shown empirically, often more rapidly than the standard EM
algorithm. Gu [28] proposes to estimate the global sucient statistics
for the M-step by an average consensus filter, diffusing the local suf-
ficient statistics over the entire network by communicating only with
neighboring nodes. Thereby, each node gradually gains global informa-
tion, until the parameters to estimate can be accessed from any node
in the network. The local communication between neighbors which is
inherently parallel makes it more run-time ecient than DEM which
repeatedly cycles over all nodes in the network. Another approach by
Kriegel et al. [37], the DMBC (Distributed Model-Based Clustering) al-
gorithm, also assumes a Gaussian mixture distribution. It first estimates
the number of Gaussian clusters, their parameters (mean and covariance
matrix) and their weights at the local nodes, using the standard EM al-
gorithm. Then, the local parameters and weights are transferred to a
central site, where similar Gaussians are joined to a compact global dis-
tribution. The similarity is measured as the mutual support between two
clusters C 1 ,C 2 , which in addition to their mean vectors also considers
the variance of the clusters. For high dimensional data, DMBC assumes
attributes to be independent of each other, resulting in a reduction of the
d
d covariance matrices to d -dimensional variance vectors. For n nodes
and a maximum number of local clusters K , the total communication
costs are thus bounded by O ( nK ). It was shown empirically, for varying
numbers of clusters and nodes, that the clustering found by DMBC is
highly similar to a central clustering, as measured by the Rand Index.
Further algorithms exist, like a distributed version of density-based
clustering [34], spectral clustering [51] and solutions specialized on par-
ticular applications, like spatial [41] and time series clustering [62].
Only few algorithms developed so far are truly resource-aware and
consider, for example, the residual energy of nodes or the CPU utilization
explicitly. An exception is ERA-cluster, proposed by Phung et al. [46],
which is based on the concept of microclusters [2]. It can automatically
adapt its sampling rate and the number of examined microclusters based
on the current battery, memory and CPU utilization as measured by a
resource monitor. EDISKCO [31] solves the k -center clustering problem
and can also determine outliers. It works incrementally and only needs
a single pass over the input observations, without storing them. The
local nodes keep a special heap structure for storing their local k centers
and z outliers, sorted according to cluster counts. If a new point doesn't
×
Search WWH ::




Custom Search