Decision Trees in Ecological Modelling - Modelling Complex Ecological Dynamics

Environmental Engineering Reference

In-Depth Information

learn a predictive model (such as a decision tree or a set of rules) that accurately

predicts this property.

Machine learning (and in particular predictive modelling) can be used to auto-

mate the construction of certain ecological models, such as models of habitat

suitability and models of population dynamics from measured data. The most

popular machine learning techniques used for ecological modelling include deci-

sion tree induction (Breiman et al. 1984), rule induction (Clark and Boswell 1991),

and neural networks (Lek and Guegan 1999).

This chapter first introduces the task of predictive modelling. It then describes

the different types of decision trees (classification, regression and multi-target

trees) and presents techniques for learning them. Finally, it gives examples of the

use of decision trees in ecological modelling, including examples of both popula-

tion dynamics and habitat suitability modelling.

14.2 The Machine Learning Task of Predictive Modelling

The input to a machine learning algorithm is most commonly a single flat table

comprising a number of fields (columns) and records (rows). In general, each row

represents an object and each column represents a property (of the object). In machine

learning terminology, rows are called examples and columns are called attributes (or

sometimes features). Attributes that have numeric (real) values are called continuous

attributes. Attributes that have nominal values are called discrete attributes.

The tasks of classification and regression are the two most commonly addressed

tasks in machine learning. They deal with predicting the value of one field from the

values of other fields. The target field is called the class (dependent variable in

statistical terminology). The other fields are called attributes (independent variables

in statistical terminology).

If the class is continuous, the task at hand is called regression. If the class is

discrete (it has a finite set of nominal values), the task at hand is called classifica-

tion. In both cases, a set of data (dataset) is taken as input, and a predictive model is

generated. This model can then be used to predict values of the class for new data.

The common term predictive modelling refers to both classification and regression.

Given a set of data (a table), only a part of it is typically used to generate (induce,

learn) a predictive model. This part is referred to as the training set. The remaining

(hold-out) part is reserved for evaluating the quality of the learned model and is

called the testing set. The testing set is used to estimate the quality of the model

when applied to unseen data, i.e. the predictive performance of the model.

More reliable estimates of performance on new data (not seen in the process of

learning) are obtained by using cross-validation (Alpaydin 2010). Cross-validation

partitions the entire set of data into k (with k typically set to 10) subsets of roughly

equal size. Each of these subsets is in turn used as a testing set, with all of the

remaining data used as a training set. The performance figures for each of the testing

sets are averaged to obtain an overall estimate of the performance on unseen data.

Search WWH ::

Custom Search

Home