Modelling Species’ Distributions - Modelling Complex Ecological Dynamics

Environmental Engineering Reference

In-Depth Information

Bolker 2008). Sometimes people log-transform count data (more precisely: y' ¼

log(y þ 1)), and find the new y' to be normally distributed.

Normally (Gaussian) distributed data show a normal distribution in the model

residuals and a straight 1:1 relationship in a QQ-plot of these residuals. Deviations

need to be accounted for, e.g. by transforming the data (any good introductory

textbook, such as Quinn and Keough (2002), will feature a section on transforma-

tions, including useful ones such as the Box-Cox 1 transformation).

When we have presence-only data (i.e. only locations where a species occurs but

no information where it does not), two alternative approaches are available. We

could use purpose-built presence-only methods, or we could use all locations

without a presence and call them absences (pseudo-absences). Both approaches

have their difficulties (Brotons et al. 2004; Pearce and Boyce 2006). The first suffers

from a lack of sound methods (in fact, following Tsoar et al. (2007) and Elith and

Graham (2009), I would currently only recommend MaxEnt 2 in this direction and

hope for the approach of Ward et al. (2009) to become publicly available). The

second approach lacks simulation tests on how to select pseudo-absences and how

to weight them (see Phillips et al. 2009 for the cutting edge in this field), although it

has been argued that the pseudo-absence approach can be as good or better than the

purpose-built presence-only methods (Zuo et al. 2008). In what follows, I only

consider presence-(pseudo)absence data.

The Explanatory Variables

Explanatory variables may also require transforming! Consider a relevant explana-

tory variable which is highly skewed (e.g. log-normally distributed), as is commonly

the case for land-use proportions. Few high-value data points may completely

dominate the regression fitted. To give a more balanced influence to all data points,

we want the values of the predictors to be uniformly distributed over their range.

This will rarely be achievable, and researchers mostly settle for a more or less

symmetric distribution of the predictor. Note, however, that ideally we want most

data points where they help most. For a linear regression, the mean is always best

described, so we would want most data points at the lowest and highest end of the

range. For a non-linear function, for example a Michaelis-Menton-like saturation

curve, we want most data points in the steep increase, while there is little gained from

many points at the high end, once the maximum is reached. As a rule of thumb we

need many data points where a curve is changing its slope.

Transformation of explanatory variables is particularly needed for regression-

type modelling approaches such as GLM and GAM (see below for explanation).

Regression trees (used, e.g. in Boosted Regression Trees, BRT, or randomForest)

are far less sensitive, if at all (Hastie et al. 2009). It is a good custom to make

1

boxcox in MASS (typewriter and bold are used to refer to a function and its R-package)

2 Phillips et al. (2006b): http://www.cs.princeton.edu/~schapire/maxent/

Search WWH ::

Custom Search

Home