Information Technology Reference
22.2.4 K Nearest Neighbor (KNN)
KNN classifier is an instance-based learning algorithm that is based on a distance
function for pairs of observations, such as the Euclidean distance and the Manhattan
distance. The basic idea of KNN is very simple. In this classification paradigm, k
nearest neighbors of a test sample are retrieved first. Then the similarities between
the test sample and the k nearest neighbors are aggregated according to the class of
the neighbors, and the test sample is assigned to the most similar class.
The best choice of k depends upon the data; generally, larger values of k reduce
the effect of noise on the classification, but make boundaries between classes less
distinct. A good k can be selected by various heuristic techniques, for example,
cross-validation. The special case where the class is predicted to be the class of the
closest training sample (i.e. when k
1) is called the nearest neighbor algorithm.
The accuracy of the KNN algorithm can be severely degraded by the presence
of noisy or irrelevant features, or if the feature scales are not consistent with their
importance. This is mainly because the similarity measure in KNN uses all features
equally. Much research effort has been put into selecting or scaling features to im-
The naive version of the KNN algorithm is easy to implement by computing
the distances from the test sample to all stored training samples, but it is computa-
tionally intensive, especially when the size of the training set grows. Many nearest
neighbor search algorithms have been proposed over the years; these generally seek
to reduce the number of distance evaluations actually performed. Using an appropri-
ate nearest neighbor search algorithm makes KNN computationally tractable even
for large datasets.
The nearest neighbor algorithm has some strong consistency results. As the
amount of data approaches infinity, the algorithm is guaranteed to yield an error
rate no worse than twice the Bayes error rate (the minimum achievable error rate
given the distribution of the data).
22.3 Statistical Learning Theory
Statistical learning theory provides a framework for studying the problems of gain-
ing knowledge, making predictions, making decisions, or constructing models from
a set of data. Statistical learning theory is to formalize the above problems, to de-
scribe existing learning algorithms more precisely, and to guide the development of
new or improved algorithms.
In particular, when we are interested in supervised learning, the task is as follows.
Given a set of training data consisting of label-instance pairs (e.g., the label is either
1), a model is automatically constructed from the data using a specific al-
gorithm. The model maps instances to labels, and should make few mistakes when
predicting the labels of unseen instances. Of course, it is always possible to build
a model that exactly fits the training data. However, in the presence of noise, this