Information Technology Reference
In-Depth Information
quadratic error, known as the coe cient of nondetermination, defines the
contribution of the model: the least expensive and least powerful model is the
model that predicts the output as the average value of the measured output,
irrespective of the input. For that model, the average quadratic error EQM r
is 1.
3.3 Input Dimension Reduction
The design of the model g ( x , w ) may require a reduction in dimension of the
input vector x . That is particularly important when the number variables is
too large to be handled conveniently; or when it is assumed that they are not
mutually independent. In the latter case, their reduction simplifies the design
of the model. The latter is therefore more robust with respect to the variability
of the data, and is less sensitive to overfitting due to over-parameterization
(see Chap. 2).
In order to explore the structure of multidimensional data, the analysis is
based on the observation of the distribution of variables in the input space.
When the number of factors is too high for visual analysis or digital process-
ing, it must be decreased. In linear statistics, PCA (Principal Component
Analysis) is used for reducing the number of factors. The method is based
on a linear combination of factors by projection. It provides a more synthetic
representation of the data.
In this section, we will review the principles of PCA; we will then discuss
CCA (Curvilinear Component Analysis), which may be viewed as a nonlinear
extension of PCA, well suited to representations of more complex data struc-
tures. A parallel will be drawn with self-organizing Kohonen maps, which are
also used for nonlinear data analysis.
3.4 Principal Component Analysis
Principal component analysis is one of the oldest statistical analysis tech-
niques. It was developed to study samples of individuals described by several
factors. The method is therefore suited to the analysis of multidimensional
data: in general, the separate study of each factor is not su cient, since it
does not allow for the detection of possible dependencies between factors.
3.4.1 Principle of PCA
To reduce the number of factors (components), PCA constructs sub-spaces of
input space (also termed representation space), whose dimensions are there-
fore smaller than the number of factors, in which the distribution of obser-
vations (points) is as similar as possible to their distribution in representa-
tion space. The similarity criterion is the total inertia of the scatter diagram.
Search WWH ::




Custom Search