Information Technology Reference
Given the training set, in order to learn the hypothesis, a loss function is needed.
In particular, the loss function can be defined as follows.
w T x i − y i 2 .
In order to find the optimal w that can minimize the above loss function, one
can simply use the gradient descent method, since this loss function is convex. By
some simple mathematical tricks, one can get the following update rule for gradient
descent, which we usually call the least mean squares (LMS) update rule:
y i −
w T x i x i,j ,
w j =
w j +
where j is the index for the features, and x i,j means the j th feature of input variable
x i .
The above update rule is quite intuitive. The magnitude of the update is propor-
tional to the error term (y i −
w T x i ) . As a result, if we encounter a training example
on which the hypothesis can predict the target value in a perfect manner, there will
be no need to change the parameters. In contrast, if the prediction made by the hy-
pothesis has a large error, a large change will be made to the parameters.
Note that the above update rule makes use of the entire training set in each step
of the update. There is an alternative approach, in which we only use one training
example in each step. This is usually called a stochastic gradient descent method,
and the original update rule corresponds to the so-called batch gradient descend
method. Often the stochastic gradient method gets the loss function close to the
minimum much faster than the batch gradient descent method, mainly because it
updates the parameter more frequently in the training process. In this regard, es-
pecially when the training set is very large, the stochastic gradient descent method
is usually preferred. However, when using the stochastic gradient descent method,
there is the risk of never converging to the minimum when the parameter w keeps
oscillating around the minimum of L(w) , although in practice most of the values
near the minimum will have been good enough.
22.1.2 Probabilistic Explanation
There are many different ways of explaining the above linear regression algorithm,
among which the probabilistic explanation shows that the least-square linear regres-
sion is a very natural algorithm.
Let us consider the following relationship between the target variable and the
w T x i +
y i =
ε i ,