Information Technology Reference
In-Depth Information
closest to it according to some distance measure d ( x i , x j ) . While the NN classifier
achieves perfect performance on the training set, generalization on a test set might
not be satisfactory. Better generalization is generally achieved if not only the clos-
est training vector is considered, but the classification decision is based on the K
nearest neighbors. If the majority rule is used for the decision, this system is called
KNN classifier.
The KNN classifier stores all examples of the training set. While this requires
no training time, it is frequently advantageous to invest some time for learning in
order to extract the essence of a dataset for later use as data model. This speeds up
the recall and improves generalization.
6.1.2 Decision Trees
Several supervised learning techniques have been proposed in the literature. One
example is decision trees. Breiman et al. [36] proposed classification and regression
trees (CART) for supervised learning problems. Here, the input-vector components
represent attributes of the examples. The inner nodes of decision trees use one at-
tribute at a time to recursively split the input space. Depending on the node's deci-
sion, one of its child nodes is visited until a leaf node is reached. The leaves store
the output of the tree. The order in which the individual attributes are queried is
important for the performance of decision trees. Several algorithms have been pro-
posed for the top-down inference of decision trees from a dataset. One example is
the ID3 algorithm, proposed by Quinlan [181] for classification. It asks the question
first that is most informative about the class. Decision trees can learn any consistent
dataset perfectly, but this may not be desirable. For this reason, ID3 has been ex-
tended to the C4.5 algorithm [182], which uses a Shannon entropy criterion for the
pruning trees. Another possibility to limit the tree size is to grow a tree only until ad-
ditional splitting of nodes produces no significant information gain. The preference
for small trees is motivated by the principle of Occam's razor [32] which prefers a
simpler explanation of a dataset over a more complicated one.
Decision trees work well if the input attributes are discrete, and the dimension-
ality of the input vectors is not too large. However, the class boundaries produced
by decision trees are axis-parallel to the dimensions of the input space. This may
cause difficulties if the true class boundaries are non-aligned.
6.1.3 Bayesian Classifier
The theoretically optimal classifier is based on Bayes' theorem of conditional prob-
ability. It is described by the Bayesian classification rule:
c ( x ) = argmax
j
p ( H j ) ยท p ( x | H j ) ,
where p ( H j ) is the a-priori probability of class j , p ( x | H j ) describes the distri-
bution of the examples from class j in the input space, and c ( x ) is the classification
produced.
p ( H j | x ) = argmax
j
Search WWH ::




Custom Search