Environmental Engineering Reference
In-Depth Information
validating it on a new, independent data set (variance-bias-trade-off: Hastie et al.
2009). Smaller models are more robust, i.e. less biased, at the expense of being not
very good in explaining variance. The way to derive the “optimal” model size is
through cross-validation (CV). For some modelling approaches this is automati-
cally implemented, but the majority of model types require the user to carry out this
step. N- fold cross-validation encompasses a random assignment of data points to
the N subset, with N usually between 3 and 10. Care should be taken to have equal
prevalence in all subsets, e.g. by randomizing 0s and 1s separately (stratified
randomization). The model is then fitted to N
1 of the N subsets and evaluated
on (by predicting to) the remaining subset. This is repeated for all N subsets and
evaluations are averaged. Based on these values, we can select the best modelling
strategy (both model complexity and model type). An alternative approach is to
bootstrap the entire model building process and use bootstrapped measures of
model performance. Since a bootstrap requires several thousand runs, and a CV
only a few, CV is far more common.
Information theoretical approaches are based on analytical methods to describe
this CV. Hence Akaike's Information Criterion (AIC) or Schwartz'/Bayesian
Information Criterion (BIC) are implicitly also based on cross-validation. While
it is clear that too large a model will be over-fitting, and that too small a model will
not capture as much of the variation as it should in the data, the “true” model will
always remain elusive, and our “optimal” model will only be a caricature of the
truth. However, here is much to be learned from this caricature!
Model Type
At this point we have to choose one (or more) method(s) to do our analysis with.
The good “traditional” approaches comprise Generalised Linear Models (GLM)
and Generalised Additive Models (Guisan and Zimmermann 2000). Discriminant
Analysis has been given up on, as have been Neural Networks and CARTs (Guisan
and Thuiller 2005). “Modern” approaches are often based on either multidimen-
sional extensions of GAMs (such as MARS and SVM) or machine-learning varia-
tions of CART (such as BRT and randomForest: Hastie et al. 2009). Anyone using a
machine-learning method should familiarize himself with this method. The major-
ity of them are performed on real data sets, where the truth is unknown and the
performance of a method was hence assessed by cross-validation. These compar-
isons show, broadly speaking, that model types sometimes differ dramatically in
performance, that each model type can be misused and that both GLM and BRT are
reliable methods when used properly.
This is not the place to explain the differences between all of them (see Hastie
et al. 2009 for a recent and comprehensive description or Elith and Leathwick
2009a). It has to suffice to make clear the main difference in the machine-learning
approach to “traditional” statistical models. In traditional models (e.g. GLM), we
specify the functional relationship between the response and its predictors. For
example, we decide to include precipitation as a non-linear predictor for plant
Search WWH ::




Custom Search