Environmental Engineering Reference
In-Depth Information
Bolker 2008). Sometimes people log-transform count data (more precisely: y' ¼
log(y þ 1)), and find the new y' to be normally distributed.
Normally (Gaussian) distributed data show a normal distribution in the model
residuals and a straight 1:1 relationship in a QQ-plot of these residuals. Deviations
need to be accounted for, e.g. by transforming the data (any good introductory
textbook, such as Quinn and Keough (2002), will feature a section on transforma-
tions, including useful ones such as the Box-Cox 1 transformation).
When we have presence-only data (i.e. only locations where a species occurs but
no information where it does not), two alternative approaches are available. We
could use purpose-built presence-only methods, or we could use all locations
without a presence and call them absences (pseudo-absences). Both approaches
have their difficulties (Brotons et al. 2004; Pearce and Boyce 2006). The first suffers
from a lack of sound methods (in fact, following Tsoar et al. (2007) and Elith and
Graham (2009), I would currently only recommend MaxEnt 2 in this direction and
hope for the approach of Ward et al. (2009) to become publicly available). The
second approach lacks simulation tests on how to select pseudo-absences and how
to weight them (see Phillips et al. 2009 for the cutting edge in this field), although it
has been argued that the pseudo-absence approach can be as good or better than the
purpose-built presence-only methods (Zuo et al. 2008). In what follows, I only
consider presence-(pseudo)absence data.
The Explanatory Variables
Explanatory variables may also require transforming! Consider a relevant explana-
tory variable which is highly skewed (e.g. log-normally distributed), as is commonly
the case for land-use proportions. Few high-value data points may completely
dominate the regression fitted. To give a more balanced influence to all data points,
we want the values of the predictors to be uniformly distributed over their range.
This will rarely be achievable, and researchers mostly settle for a more or less
symmetric distribution of the predictor. Note, however, that ideally we want most
data points where they help most. For a linear regression, the mean is always best
described, so we would want most data points at the lowest and highest end of the
range. For a non-linear function, for example a Michaelis-Menton-like saturation
curve, we want most data points in the steep increase, while there is little gained from
many points at the high end, once the maximum is reached. As a rule of thumb we
need many data points where a curve is changing its slope.
Transformation of explanatory variables is particularly needed for regression-
type modelling approaches such as GLM and GAM (see below for explanation).
Regression trees (used, e.g. in Boosted Regression Trees, BRT, or randomForest)
are far less sensitive, if at all (Hastie et al. 2009). It is a good custom to make
1
boxcox in MASS (typewriter and bold are used to refer to a function and its R-package)
2 Phillips et al. (2006b): http://www.cs.princeton.edu/~schapire/maxent/
Search WWH ::




Custom Search