Database Reference
In-Depth Information
the distance is within a certain factor of an optimally computed radius.
Otherwise, a re-clustering is forced by applying the furthest point algo-
rithm on current set of points. After the application of the furthest point
algorithm, the centers are transmitted to the central server, which then
computes a global clustering from these local centers over the different
nodes. These global centers can ten be transmitted to the local nodes if
desired. One attractive feature of the method is that an approximation
bound is proposed on the quality of the clustering. A second method for
distributed clustering proposed in [32] is the parallel guessing algorithm .
Another method for distributed sensor stream clustering which reduces
the dimensionality and communication cost by maintaining an online
discretization may be found in [68].
3.2 Data Stream Classification
The problem of classification is perhaps one of the most widely stud-
ied in the context of data stream mining. The problem of classification
is made more dicult by the evolution of the underlying data stream.
Therefore, effective algorithms need to be designed in order to take tem-
poral locality into account. The concept of stream evolution is sometimes
referred to as concept drift in the stream classification literature. Some
of these algorithms are designed to be purely one-pass adaptations of
conventional classification algorithms [39], whereas others (such as the
methods in [13, 48]) are more effective in accounting for the evolution of
the underlying data stream. The broad methods which are studied for
classification in the data stream scenario are as follows:
VFDT Method: The VFDT (Very Fast Decision Trees) method has
been adapted to create decision trees which are similar to those con-
structed by a conventional learner with the use of sampling based ap-
proximations. The VFDT method splits a tree using the current best
attribute, taking into consideration the fact that the number of exam-
ples used are sucient to preserve the Hoeffding bound in a way that
the output is similar to that of a conventional learner. The key question
during the construction of the decision tree is the choice of attributes
to be used for splits. Approximate ties are broken using a user-specified
threshold of acceptable error-measure for the output. It can be shown
that for any small value of δ , a particular choice of the split variable is
the correct choice with probability at least 1
δ , if a sucient number
of stream records have been processed. This number has been shown
in [39] to increase at a relatively modest rate of ln(1 ). This bound
can then be extended to the entire decision tree, so as to quantify the
probability that the same decision tree as a conventional learner is cre-
Search WWH ::




Custom Search