Environmental Engineering Reference
In-Depth Information
species richness. This model proposal is then fitted to the data. In machine learning,
we propose only the set of predictors, but not the model structure. Here, an algorithm
builds a model proposal, fits it to a part of the data set and evaluates its performance
on the other part of the data. It then proposes a modification of the original model and
so forth. Machine-learning algorithms 11 differ in scope, origin, complexity, and
speed, but they all share this validation step which is used to steer the algorithm
towards a better model formulation. There are plenty of studies comparing different
modelling approaches (Guisan et al. 2007; Meynard and Quinn 2007; Pearson et al.
2006; Segurado and Ara ´ jo 2004). Rather, we shall continue using GLM and BRT as
representatives for the two most common good approaches.
The choice of model type has much to do with availability of software, current
fashion and, of course, with the specific aim of the study. Further complications arise
if the design of the survey may require a mixed model approach (e.g. due to repeated
measurements or surveys split across observers), if spatial autocorrelation needs to be
addressed, if zero-inflated distributions have to be employed, and if corrections for
detection probability shall bemodeled. The more additional requirements are imposed
on the model, the more GLMs become the sole possible method. 12 Alternatively, you
may want to go for a Bayesian SDM (see Latimer et al. 2006, for a primer).
If your data and model require an unusual combination of steps (say a combination
of zero-inflated data with nested design and spatial autocorrelation, while predictors
are highly correlated and many values missing), and you develop a way to cook this
dish, then you should do (at least) two things: Firstly, evaluate your method for its
ability to detect an effect that you know is there (“sensitivity”). Secondly, evaluate
your method for its specificity to detect effects that you know are not there. Both
evaluations should be amply replicated, should be based on simulated data (so that
you know the truth) and should (finally) confirm that your new methods is reliable!
Spatial Autocorrelation
Spatial autocorrelation (SAC) refers to the phenomenon that data points close to each
other in space are more alike than those further apart. For example, species richness in
a given site is likely to be similar to a site nearby, but very different from sites far away.
This is mainly due to the fact that the environment is more similar within a shorter
distance. Hence, SAC in the raw data (species occurrence) is a consequence of SAC in
the environment (topography, climate), something Legendre (1993) termed “spatial
11 http://www.machinelearning.org/ is a good place to start exploring this field.
12
Most of these “complications” can be handled by standard extensions of GLMs (see, e.g. Bolker
2008, and various dedicated R-packages). They will, however, make the model less stable, require
larger run-times and still rely on getting the distribution right. There is, of course, the alternative of
Bayesian implementations. Since these are also fundamentally maximum likelihood approaches,
they are similar to sophisticated GLMs. In any case, there is no Bayesian Boosted Regression Tree
(not to speak of a combination with spatial terms and mixed effects). It runs against the Bayesian
philosophy to use boosting or bagging, and there is no efficient implementation either.
Search WWH ::




Custom Search