Modelling Species’ Distributions - Modelling Complex Ecological Dynamics

Environmental Engineering Reference

In-Depth Information

validating it on a new, independent data set (variance-bias-trade-off: Hastie et al.

2009). Smaller models are more robust, i.e. less biased, at the expense of being not

very good in explaining variance. The way to derive the “optimal” model size is

through cross-validation (CV). For some modelling approaches this is automati-

cally implemented, but the majority of model types require the user to carry out this

step. N- fold cross-validation encompasses a random assignment of data points to

the N subset, with N usually between 3 and 10. Care should be taken to have equal

prevalence in all subsets, e.g. by randomizing 0s and 1s separately (stratified

randomization). The model is then fitted to N

1 of the N subsets and evaluated

on (by predicting to) the remaining subset. This is repeated for all N subsets and

evaluations are averaged. Based on these values, we can select the best modelling

strategy (both model complexity and model type). An alternative approach is to

bootstrap the entire model building process and use bootstrapped measures of

model performance. Since a bootstrap requires several thousand runs, and a CV

only a few, CV is far more common.

Information theoretical approaches are based on analytical methods to describe

this CV. Hence Akaike's Information Criterion (AIC) or Schwartz'/Bayesian

Information Criterion (BIC) are implicitly also based on cross-validation. While

it is clear that too large a model will be over-fitting, and that too small a model will

not capture as much of the variation as it should in the data, the “true” model will

always remain elusive, and our “optimal” model will only be a caricature of the

truth. However, here is much to be learned from this caricature!

Model Type

At this point we have to choose one (or more) method(s) to do our analysis with.

The good “traditional” approaches comprise Generalised Linear Models (GLM)

and Generalised Additive Models (Guisan and Zimmermann 2000). Discriminant

Analysis has been given up on, as have been Neural Networks and CARTs (Guisan

and Thuiller 2005). “Modern” approaches are often based on either multidimen-

sional extensions of GAMs (such as MARS and SVM) or machine-learning varia-

tions of CART (such as BRT and randomForest: Hastie et al. 2009). Anyone using a

machine-learning method should familiarize himself with this method. The major-

ity of them are performed on real data sets, where the truth is unknown and the

performance of a method was hence assessed by cross-validation. These compar-

isons show, broadly speaking, that model types sometimes differ dramatically in

performance, that each model type can be misused and that both GLM and BRT are

reliable methods when used properly.

This is not the place to explain the differences between all of them (see Hastie

et al. 2009 for a recent and comprehensive description or Elith and Leathwick

2009a). It has to suffice to make clear the main difference in the machine-learning

approach to “traditional” statistical models. In traditional models (e.g. GLM), we

specify the functional relationship between the response and its predictors. For

example, we decide to include precipitation as a non-linear predictor for plant

Modelling Complex Ecological Dynamics

Search WWH ::

Custom Search

Home