Biomedical Engineering Reference
In-Depth Information
lihood estimates of the distribution parameters are estimated by maximizing the
expected likelihood of the examples found in the expectation. The algorithm be-
gins with random values of the distribution parameters and iterates between the
E and M steps until it converges. The most demanding computational part of the
algorithm is the expectation step, where each sample in the data must be accessed.
However, this is also a task that is trivial to parallelize; each node in a parallel
system can observe some of the data and compute the statistics independently of
other nodes. The total update can then be aggregated at a central location (a mas-
ter node) and the M step executed. Alternatively, stochastic methods have been
suggested for online training of the EM algorithm [31].
When the underlying distribution of data cannot be modeled using paramet-
ric distributions, or when these distributions are unknown, one is forced to use
other classification approaches. One of these is to model the density of data for
each of the classes and classifying test points according to the maximum posterior
probability. Frequently, this is done using the Parzen windows estimator [32]. This
method is, however, expensive both computationally and memory-wise. Addition-
ally, a large number of training points is required for an accurate estimation of
density. Therefore, this method is suitable mainly for small to medium datasets.
A different method for classification uses a naive Bayes classifier. This classifier
works under the assumption that each feature is independent of other features.
Under this assumption, using Bayes' theorem, the probability that a sample drawn
from a given class is simply the prior probability of this class multiplied by the
probability that the value of the sample for each feature is derived from the class.
The probabilities for each feature are approximated with relative frequencies found
in the training set.
In spite of their naive design and underlying assumptions, naive Bayes classifiers
often work well on real-world problems. This has been shown to have a theoretic
justification [33].
Speeding up a naive Bayes classifier (i.e., computing the frequencies of feature
values) can be done by either sending different features to different processing
nodes or by partitioning the data to different nodes and aggregating the frequencies
returned from each node.
Most classification algorithms directly try to construct a boundary between
classes to minimize the test error. The k-nearest neighbor classifier is probably the
simplest classification method. This classifier labels test points according to a ma-
jority vote amongst the k closest training points. This method frequently achieves
remarkably low classification errors but it suffers from two main drawbacks. The
first is that it is computationally expensive because of the need to find the k closest
training points to each test point. The second drawback is that a large memory is
required for storing the training data. One way to overcome the first drawback is
through efficient data structures such as the KD-tree [34] or, more recently, cover
trees [35].
A different approach to classification uses decision trees. In a decision tree,
each node corresponds to a variable and each arc to a range of possible values of
that variable. The leaves of the tree represent predictions for values of the labels.
Decision trees are built to create leaves that are as pure as possible; that is, they
contain the largest possible fraction of points from a single class. Classification of
Search WWH ::




Custom Search