Information Technology Reference
In-Depth Information
6.4.2 Convergence Theorem for the Perceptron Algorithm
Theorem. If the examples of the training set are linearly separable, the per-
ceptron algorithm finds a solution in a finite number of iterations.
In order to prove the theorem, we assume that the weights are initialized
to zero, i.e., using the option of tabula rasa . Strictly, that assumption is not
necessary. It just makes the proof easier than starting with arbitrary weight
values.
Since the examples of the training set L M are assumed to be linearly sepa-
rable, there exists a weight vector w , hereinafter called reference perceptron ,
that classifies correctly the examples. Without any loss of generality, we as-
sume that w is unitary. Otherwise, it would su ce to divide its components
by its norm. The stabilities of the examples in L M with respect to the refer-
ence hyperplane are all positive. Since w is unitary, they are identical to the
corresponding aligned fields,
γ
= y k x k
w = z
·
.
To prove the theorem, we derive upper and lower bounds to the norm of
the weight vector generated by the algorithm. Those bounds (see the Sect.
6.8 “Additional material” at the end of the chapter) are increasing functions
of the number of iterations t , but they increase at different rates. The lower
bound increases linearly with the n u mber of iterations t , whereas the upper
bound increases more slowly, as t (see Fig. 6.8). The bounds cross each
other, which is absurd, after a number of iterations T given by
T =
2
x max
,
γ min
x max
where
is the norm of the example of L M
with the largest norm, and
γ min
is the stability of the example of L M that has the smallest stability
with respect to the reference perceptron hyperplane. Thus, the perceptron
algorithm must converge, since the number of iterations cannot be larger
than T . Notice that the learning time may be very long, especially if there
are examples very close to the reference hyperplane ( γ min
small relative to
x max
). However, the algorithm may converge in times much smaller than
that given by the above relation for two reasons:
first, because the reference hyperplane w
is arbitrary, and the value of
γ min
may be particularly small;
on the other hand, the learning time is a random variable that depends on
the particular sequence of examples selected for the successive updates.
Remark 1. The expression of the number of iterations T has a simple intuitive
interpretation. The correction to the weights performed at each iteration is
bounded, because its norm cannot be larger than the norm of the example
 
Search WWH ::




Custom Search