Database Reference
In-Depth Information
The primary requirements are that there is no correlation between the
different x variables and that the standard deviation of the y term does
not depend on the value of x (only the mean of y varies with x ). The first
requirement is usually fairly easy to achieve by checking the correlations
between the various x variables under consideration and dropping one of
the two correlated variables from the equation. Another approach is to
transform the matrix of x values using an orthogonal transformation, such
as principal components analysis. This produces x values that are, by
definition, uncorrelated. The second requirement is usually assumed more
thanitisensured,butitiseasytocheckbyinspectingthedifferencebetween
the observed y values and their predicted values after fitting the model.
Assuming these conditions are met and the values of x are placed into a
matrix X with k columns and n rows, where k is the number of different
variablesand n isthenumberofobservations,thefollowingexpressionfinds
a B vector that minimizes the mean square error:
B = (X T X) -1 X T y
This form is the solution to a linear system of equations called the normal
equations. It is possible to solve these directly using linear algebra libraries
such as the Apache Commons Math library. However, the direct
computation can have problems with numerical stability, so most
implementations use other techniques. There are a variety of options, but
one of the most common is the use of the QR factorization. What QR
factorization says is that any matrix A can be represented by the production
of two matrices Q and R where Q is an orthonormal matrix (like those
discussed earlier), and R is an upper triangular matrix (meaning roughly
half of its values are zeroes). An orthonormal matrix is special because Q T Q
= I, where I is known as the identity matrix. The identity matrix is a matrix
of all zeroes, except its diagonal, which is filled with ones. By replacing X
with its QR factorization, the equation for B is then:
B = R -1 Q T y
This form is much more numerically stable, although it does require the
computation of the QR factorization. Rather than attempting to implement
this, there are many different libraries that can be used. For example, in
Java there is the Apache Common Math library, available through Maven:
Search WWH ::




Custom Search