Multivariate Statistics - MATLAB Recipes for Earth Sciences

Geoscience Reference

In-Depth Information

It is important to evaluate the data properties prior to the application of

a clustering algorithm. h e absolute values of the variables should i rst be

considered. For example a geochemical sample from volcanic ash might

show an SiO 2 content of around 77% and a Na 2 O contents of only 3.5%, but

the Na 2 O content may be considered to be of greater importance. In such

a case the data need to be transformed so that they have means equal to

zero ( mean centering ). Dif erences in both the variances and the means are

corrected by standardizing , i.e., the data are standardized to means equal

to zero and variances equal to one. Artifacts arising from closed data, such

as artii cial negative correlations, are avoided by using Aitchison's log-ratio

transformation (Aitchison 1984, 1986). h is ensures data independence

and avoids the constant sum normalization constraints. h e log-ratio

transformation is

where x tr denotes the transformed score ( i =1, 2, 3, …, d -1) of some raw

data x i . h e procedure is invariant under the group of permutations of the

variables, and any variable can be used as the divisor x d .

As an exercise in performing a cluster analysis, the sediment data stored in

sediment_3.txt are loaded. h is data set contains the percentages of various

minerals contained in sediment samples. h e sediments are sourced from

three rock types: a magmatic rock containing amphibole ( amp ), pyroxene

( pyr ) and plagioclase ( pla ), a hydrothermal vein characterized by the

presence of l uorite ( l u ), sphalerite ( sph ) and galena ( gal ), some feldspars

(plagioclase and potassium feldspars, ksp ) and quartz, and a sandstone unit

containing feldspars, quartz and clay minerals ( cla ). Ten samples were taken

from various levels in this sedimentary sequence, each containing varying

proportions of these minerals. First, the distances between pairs of samples

can be computed. h e function pdist provides many dif erent measures of

distance, such as the Euclidian or Manhattan (or city block) distance. We use

the default setting which is the Euclidian distance.

clear

data = load('sediments_3.txt');

Y = pdist(data);

h e function pdist returns a vector Y containing the distances between

each pair of observations in the original data matrix. We can visualize the

distances in another pseudocolor plot.

imagesc(squareform(Y)), colormap(hot)

Search WWH ::

Custom Search

Home