Information Technology Reference
In-Depth Information
where
ε
i
is an error term that captures either the unmodeled effects or random noise.
Since we usually do not know much about this error term, a simple way is to assume
that the
ε
i
are independently and identically distributed (i.i.d.) according to a Gaus-
sian distribution:
ε
i
∼
N(
0
,σ
2
)
. In other words, we assume that the probability
density of
ε
i
is
√
2
πσ
exp
.
ε
i
2
σ
2
1
p(ε
i
)
=
−
Accordingly, we have
exp
.
w
T
x
i
)
2
2
σ
2
(y
i
−
1
√
2
πσ
p(y
i
|
x
i
;
=
−
w)
Given the above assumptions, we can write the conditional likelihood of the train-
ing data as
exp
.
m
m
w
T
x
i
)
2
2
σ
2
1
√
2
πσ
(y
i
−
l(w)
=
p(y
i
|
x
i
;
w)
=
−
i
=
1
i
=
1
The log likelihood can then be written as
n
y
i
−
w
T
x
i
2
.
1
√
2
πσ
−
1
σ
2
·
1
2
log
l(w)
=
n
log
i
=
1
Now we maximize this log likelihood in order to get the optimal parameter
w
.It
is not difficult to see that this is equivalent to minimizing the following least-square
loss function:
n
w
T
x
i
−
y
i
2
.
1
2
L(w)
=
i
=
1
The above analysis shows that under certain probabilistic assumptions on the
data, least-square regression corresponds to finding the maximum likelihood esti-
mation of
w
.
22.2 Classification
The literature of classification is relatively richer than that of regression. Many clas-
sification methods have been proposed, with different loss functions and different
formulations. In this section, we will take binary classification as an example to il-
lustrate several widely used classification algorithms, which are mostly related to
this topic.