Information Technology Reference
In-Depth Information
d =2we get a decision surface in three-dimensional space, which for a linear
activation function is a plane (the linear discriminant of 3.28) and for a
squashing function an S-folded plane; the decision border is always a line.
In the present section we analyze the perceptron as a regression-like ma-
chine: the learning algorithm iteratively drives the weights such that the
continuous output y i = ϕ ( w T x i + w 0 ) approximates the target value t i for
every x i . The error r.v. whose instantiations are e i = t i
y i is, therefore, a
continuous random variable.
In order to derive analytical expressions of the theoretical EEs for the
perceptron (and other machines as well), one is compelled to apply transfor-
mations of the input distributions. The well-known theorem of univariate r.v.
transformation (see e.g., [183]) is enough for our purposes:
Theorem 3.2. Let f ( x ) be the PDF of the r.v. X . Assume ϕ ( x ) to be a
monotonic and differentiable function. If g ( y ) is the PDF of Y = ϕ ( X ) and
ϕ ( x )
=0 ,
x
X ,then
g ( y )= f ( ϕ 1 ( y ))
inf ϕ ( x ) <y< sup ϕ ( x )
( ϕ 1 ( y )) |
,
(3.41)
0
otherwise
where x = ϕ 1 ( y ) is the inverse function of y = ϕ ( x ) .
Note that sigmoidal activation functions satisfy the conditions of the theorem.
3.3.1 Motivational Examples
Two examples are now presented which constitute a good illustration on
how the Shannon EE-risk perceptron, with ϕ (
,
performs when applied to two-class, two-dimensional datasets. They involve
artificial data for which the min P e values are known. We will leave the ap-
plication to real-world datasets to a later section. We will also use these
examples to discuss some convergence issues.
·
) = tanh(
·
) and T =
{−
1 , 1
}
Example 3.7. In this example we use two Gaussian distributed class-conditi-
onal PDFs, g ( x ;
μ t , Σ t ),with
μ 1 =[0 0] T , Σ 1 =I,andμ 1 =[1 . 50 . 5] T , Σ 1 = 1 . 10 . 3
.
0 . 31 . 5
Let us consider 300-instance training and test datasets (150 instances per
class) and apply the (Shannon) MEE gradient descent algorithm with h =1
and η =0 . 001. Note that according to formula (E.19) the optimal bandwidth
for n = 150 is h IMSE =0 . 39. We are, therefore, using fat estimation of the
error PDF.
 
Search WWH ::




Custom Search