Environmental Engineering Reference
In-Depth Information
problem is considered as a fundamental problem inherent in the studies using
spatially aggregated data because the results are always affected by the areal units
used (Openshaw 1984 ). It can be essentially unpredictable in its intensity and
effects in multivariate statistical analysis, and is therefore a much greater problem
than in univariate or bivariate analysis (Fotheringham and Wong 1991 ). While the
variations of statistical analysis due to the aggregation of smaller areal units into
regions are generally well understood (e.g. Fotheringham and Wong 1991 ), the
zoning problem is much less well understood (Jelinski and Wu 1996 ).
The number of observations or sample size is very important for multivariate
statistical analysis. For geographically referenced data, we normally consider the
number of spatial observation units being equivalent to sample size from a sta-
tistical perspective. In general, the number of observation units should be 5-10
times the number of candidate independent variables (Brace et al. 2012 ). For
example, if a specific multivariate statistical model intends to include 5 inde-
pendent variables, the number of observation units should be at least 25-50. Too
many or too few units could lead statistical models to be over-fitting.
Because the total number of observation units is actually quite limited for many
applications, one should be careful when selecting candidate independent variables
to be included in multivariate statistical analysis. As mentioned before, many
landscape metrics are perfectly or partially correlated with each other, which can
cause information duplication. Therefore, when using landscape metrics as can-
didate independent variables to assess their impacts upon specific ecological
processes, it is important to identify a small number of landscape metrics that are
not duplicated but capture the major landscape properties.
A preprocessing procedure should be conducted for all dependent or indepen-
dent variables. Because multivariate statistical analysis is sensitive to the variance
of samples and data distribution, one should avoid using the raw data directly,
particularly for those variables with a large statistical variance. For some envi-
ronmental variables, such as water or air quality, one should use their average
measurements by month, quarter or year. For landscape composition metrics, one
should use the relative proportion rather than the total number. Before actual
statistical analysis, raw data should be logarithmically transformed to improve
their normality.
Before a statistical model is established, one should check the normality,
multicollinearity, and spatial autocorrelation of independent variables. Data nor-
mality can be checked through the Kolmogorov-Smirnov test or the graphic
approach using histograms and QQ plots. For some variables that do not show a
clear normal distribution, one can transform the raw data logarithmically to
improve the data normality. Any statistical models that show strong multicollin-
earity among the independent variables should be used with caution. The spatial
autocorrelation can be computed by using Moran I or Geary C. If a strong spatial
autocorrelation exists, one should use the strategies suggested by Legendre ( 1993 )
to reduce the spatial dependence.
When assessing the performance of different statistical models, one should pay
attention
on
the
number
of
independent
variables
included.
In
general,
the
 
Search WWH ::




Custom Search