Information Technology Reference

In-Depth Information

22.2.4 K Nearest Neighbor (KNN)

KNN classifier is an instance-based learning algorithm that is based on a distance

function for pairs of observations, such as the Euclidean distance and the Manhattan

distance. The basic idea of KNN is very simple. In this classification paradigm,
k

nearest neighbors of a test sample are retrieved first. Then the similarities between

the test sample and the
k
nearest neighbors are aggregated according to the class of

the neighbors, and the test sample is assigned to the most similar class.

The best choice of
k
depends upon the data; generally, larger values of
k
reduce

the effect of noise on the classification, but make boundaries between classes less

distinct. A good
k
can be selected by various heuristic techniques, for example,

cross-validation. The special case where the class is predicted to be the class of the

closest training sample (i.e. when
k

1) is called the nearest neighbor algorithm.

The accuracy of the KNN algorithm can be severely degraded by the presence

of noisy or irrelevant features, or if the feature scales are not consistent with their

importance. This is mainly because the similarity measure in KNN uses all features

equally. Much research effort has been put into selecting or scaling features to im-

prove classification.

The naive version of the KNN algorithm is easy to implement by computing

the distances from the test sample to all stored training samples, but it is computa-

tionally intensive, especially when the size of the training set grows. Many nearest

neighbor search algorithms have been proposed over the years; these generally seek

to reduce the number of distance evaluations actually performed. Using an appropri-

ate nearest neighbor search algorithm makes KNN computationally tractable even

for large datasets.

The nearest neighbor algorithm has some strong consistency results. As the

amount of data approaches infinity, the algorithm is guaranteed to yield an error

rate no worse than twice the Bayes error rate (the minimum achievable error rate

given the distribution of the data).

=

22.3 Statistical Learning Theory

Statistical learning theory provides a framework for studying the problems of gain-

ing knowledge, making predictions, making decisions, or constructing models from

a set of data. Statistical learning theory is to formalize the above problems, to de-

scribe existing learning algorithms more precisely, and to guide the development of

new or improved algorithms.

In particular, when we are interested in supervised learning, the task is as follows.

Given a set of training data consisting of label-instance pairs (e.g., the label is either

+

1), a model is automatically constructed from the data using a specific al-

gorithm. The model maps instances to labels, and should make few mistakes when

predicting the labels of unseen instances. Of course, it is always possible to build

a model that exactly fits the training data. However, in the presence of noise, this

1or

−