Multivariate Statistics - MATLAB Recipes for Earth Sciences

Geoscience Reference

In-Depth Information

even though nature rarely falls into discrete classes. Classii cation (or

categorization) is useful as it can, for example, help decision makers to take

necessary precautions to reduce risk, to drill an oil well, or to assign fossils

to a particular genus or species. Most classii cation methods make decisions

based on Boolean logic with two options, true or false; an example is the use

of a threshold value for identifying charcoal in microscope images (Section

8.11). Alternatively, fuzzy logic (which is not explained in this topic) is a

generalization of the binary Boolean logic with respect to many real world

problems in decision-making, where gradual transitions are reasonable

(Zadeh 1965, MathWorks 2014a).

h e following sections introduce the most important techniques of

multivariate statistics: principal component analysis (PCA) and cluster

analysis (CA) in Sections 9.2 and 9.5, and independent component analysis

(ICA), which is a nonlinear extension of PCA, in Section 9.3. Section

9.4 introduces discriminant analysis (DA), which is a popular method

of classii cation in earth sciences. Section 9.6. introduces multiple linear

regression . h ese sections i rst provide an introduction to the theory behind

the various techniques and then demonstrate their use for analyzing earth

sciences data, using MATLAB functions (MathWorks 2014b).

9.2 Principal Component Analysis

Principal component analysis (PCA) detects linear dependencies between

variables and replaces groups of correlated variables with new, uncorrelated

variables referred to as the principal components (PCs). PCA was introduced

by Karl Pearson (1901) and further developed by Harold Hotelling (1931).

h e performance of PCA is better illustrated with a bivariate data set than

with a multivariate data set. Figure 9.1 shows a bivariate data set that exhibits

a strong linear correlation between the two variables x and y in an orthogonal

xy coordinate system. h e two variables have their individual univariate

means and variances (Chapter 3). h e bivariate data set can be described by

the bivariate sample mean and the covariance (Chapter 4). h e xy coordinate

system can be replaced by a new orthogonal coordinate system, where the

i rst axis passes through the long axis of the data scatter and the new origin

is the bivariate mean. h is new reference frame has the advantage that the

i rst axis can be used to describe most of the variance, while the second axis

contributes only a small amount of additional information. Prior to this

transformation two axes were required to describe the data set, but it is now

possible to reduce the dimensions of the data by dropping the second axis

without losing very much information, as shown in Figure 9.1.

h is process is now expanded to an arbitrary number of variables and

Search WWH ::

Custom Search

Home