Information Technology Reference
In-Depth Information
d
=2we get a decision surface in three-dimensional space, which for a linear
activation function is a plane (the linear discriminant of 3.28) and for a
squashing function an S-folded plane; the decision border is always a line.
In the present section we analyze the perceptron as a regression-like ma-
chine: the learning algorithm iteratively drives the weights such that the
continuous
output
y
i
=
ϕ
(
w
T
x
i
+
w
0
) approximates the target value
t
i
for
every
x
i
. The error r.v. whose instantiations are
e
i
=
t
i
−
y
i
is, therefore, a
continuous random variable.
In order to derive analytical expressions of the theoretical EEs for the
perceptron (and other machines as well), one is compelled to apply transfor-
mations of the input distributions. The well-known theorem of univariate r.v.
transformation (see e.g., [183]) is enough for our purposes:
Theorem 3.2.
Let
f
(
x
)
be the PDF of the r.v.
X
. Assume
ϕ
(
x
)
to be a
monotonic and differentiable function. If
g
(
y
)
is the PDF of
Y
=
ϕ
(
X
)
and
ϕ
(
x
)
=0
,
∀
x
∈
X
,then
g
(
y
)=
f
(
ϕ
−
1
(
y
))
inf
ϕ
(
x
)
<y<
sup
ϕ
(
x
)
|ϕ
(
ϕ
−
1
(
y
))
|
,
(3.41)
0
otherwise
where
x
=
ϕ
−
1
(
y
)
is the inverse function of
y
=
ϕ
(
x
)
.
Note that sigmoidal activation functions satisfy the conditions of the theorem.
3.3.1 Motivational Examples
Two examples are now presented which constitute a good illustration on
how the Shannon EE-risk perceptron, with
ϕ
(
,
performs when applied to two-class, two-dimensional datasets. They involve
artificial data for which the min
P
e
values are known. We will leave the ap-
plication to real-world datasets to a later section. We will also use these
examples to discuss some convergence issues.
·
) = tanh(
·
) and
T
=
{−
1
,
1
}
Example 3.7.
In this example we use two Gaussian distributed class-conditi-
onal PDFs,
g
(
x
;
μ
t
,
Σ
t
),with
μ
−
1
=[0 0]
T
,
Σ
−
1
=I,andμ
1
=[1
.
50
.
5]
T
,
Σ
1
=
1
.
10
.
3
.
0
.
31
.
5
Let us consider 300-instance training and test datasets (150 instances per
class) and apply the (Shannon) MEE gradient descent algorithm with
h
=1
and
η
=0
.
001. Note that according to formula (E.19) the optimal bandwidth
for
n
= 150 is
h
IMSE
=0
.
39. We are, therefore, using fat estimation of the
error PDF.