Environmental Engineering Reference
In-Depth Information
An alternative is to “filter” the data by importance. We can use a robust and able
technique to tell us which variables are important. Next, we use only those 5 or 12
variables filtered from the initial pool of variables, and continue. If variables are
uncorrelated, regression-tree based methods are very useful for this, and I recom-
mend randomForest and Boosted Regression Trees. If you plan to model your data
with BRT anyway, there is little point in reducing the number of explanatory
variables before.
Finally, be aware that any model can only find correlations with the variables
provided. Of course, we know that our hypothetical bird of prey depends on specific
prey. Without this information, we may actually be modelling the niche of the prey,
not of the predator!
Exploratory Data Plotting
Can we finally start? No! It is both good practice and highly advisable to look at the
data by plotting them in any reasonable combination conceivable (see, e.g., Bolker
2008). Plot thematically related explanatory variables as scatterplot 7 to detect
collinearity. Plot each explanatory variable against the response (henceforth called
X and y, respectively) and look for nonlinear effects. Plot data in parameter space,
e.g. as function of two Xs (Fig. 13.2 ) and a hull-polygon around the data to see that
40% or so of the parameter space is not in your data set. This is the area outside the
convex hull in Fig. 13.2 . The more variables (and hence dimensions) your data set
has, the more severe this problem becomes. It is so prominent among statisticians
(though not among ecologists) that it is referred to as the “curse of dimensionality”
(Bellman 1957; Hastie et al. 2009). Repeat this plotting for any number of variables.
Getting a feeling for the data is crucial, and many later errors can be avoided. Every
minute invested at this stage saves hours later on.
13.2.2 Modelling
Here, we arbitrarily divide the process of deriving a “usable” model into two steps.
The first, model building, selects the variables to be included, the type of non-
linearity and order of interactions considered, and the criteria for selecting the final
complexity of a model. The second step, model parameterisation, performs the final
step of using the data to calculate the best estimates for variable effects. It is this
model that we want to use for interpolation, hypothesis testing or extrapolation. Note
that in some methods these two steps are implicitly taken care of and that there is no
two-step process (mainly machine learning, where model selection is done internally
through cross-validation in order to prevent models from being “unreasonably”
7
pairs
Search WWH ::




Custom Search