Information Technology Reference
In-Depth Information
where ε i is an error term that captures either the unmodeled effects or random noise.
Since we usually do not know much about this error term, a simple way is to assume
that the ε i are independently and identically distributed (i.i.d.) according to a Gaus-
sian distribution: ε i N( 0 2 ) . In other words, we assume that the probability
density of ε i is
2 πσ exp
.
ε i
2 σ 2
1
p(ε i )
=
Accordingly, we have
exp
.
w T x i ) 2
2 σ 2
(y i
1
2 πσ
p(y i |
x i ;
=
w)
Given the above assumptions, we can write the conditional likelihood of the train-
ing data as
exp
.
m
m
w T x i ) 2
2 σ 2
1
2 πσ
(y i
l(w)
=
p(y i |
x i ;
w)
=
i =
1
i =
1
The log likelihood can then be written as
n
y i
w T x i 2 .
1
2 πσ
1
σ 2 ·
1
2
log l(w)
=
n log
i
=
1
Now we maximize this log likelihood in order to get the optimal parameter w .It
is not difficult to see that this is equivalent to minimizing the following least-square
loss function:
n
w T x i
y i 2 .
1
2
L(w)
=
i =
1
The above analysis shows that under certain probabilistic assumptions on the
data, least-square regression corresponds to finding the maximum likelihood esti-
mation of w .
22.2 Classification
The literature of classification is relatively richer than that of regression. Many clas-
sification methods have been proposed, with different loss functions and different
formulations. In this section, we will take binary classification as an example to il-
lustrate several widely used classification algorithms, which are mostly related to
this topic.
Search WWH ::




Custom Search