Environmental Engineering Reference
In-Depth Information
learn a predictive model (such as a decision tree or a set of rules) that accurately
predicts this property.
Machine learning (and in particular predictive modelling) can be used to auto-
mate the construction of certain ecological models, such as models of habitat
suitability and models of population dynamics from measured data. The most
popular machine learning techniques used for ecological modelling include deci-
sion tree induction (Breiman et al. 1984), rule induction (Clark and Boswell 1991),
and neural networks (Lek and Guegan 1999).
This chapter first introduces the task of predictive modelling. It then describes
the different types of decision trees (classification, regression and multi-target
trees) and presents techniques for learning them. Finally, it gives examples of the
use of decision trees in ecological modelling, including examples of both popula-
tion dynamics and habitat suitability modelling.
14.2 The Machine Learning Task of Predictive Modelling
The input to a machine learning algorithm is most commonly a single flat table
comprising a number of fields (columns) and records (rows). In general, each row
represents an object and each column represents a property (of the object). In machine
learning terminology, rows are called examples and columns are called attributes (or
sometimes features). Attributes that have numeric (real) values are called continuous
attributes. Attributes that have nominal values are called discrete attributes.
The tasks of classification and regression are the two most commonly addressed
tasks in machine learning. They deal with predicting the value of one field from the
values of other fields. The target field is called the class (dependent variable in
statistical terminology). The other fields are called attributes (independent variables
in statistical terminology).
If the class is continuous, the task at hand is called regression. If the class is
discrete (it has a finite set of nominal values), the task at hand is called classifica-
tion. In both cases, a set of data (dataset) is taken as input, and a predictive model is
generated. This model can then be used to predict values of the class for new data.
The common term predictive modelling refers to both classification and regression.
Given a set of data (a table), only a part of it is typically used to generate (induce,
learn) a predictive model. This part is referred to as the training set. The remaining
(hold-out) part is reserved for evaluating the quality of the learned model and is
called the testing set. The testing set is used to estimate the quality of the model
when applied to unseen data, i.e. the predictive performance of the model.
More reliable estimates of performance on new data (not seen in the process of
learning) are obtained by using cross-validation (Alpaydin 2010). Cross-validation
partitions the entire set of data into k (with k typically set to 10) subsets of roughly
equal size. Each of these subsets is in turn used as a testing set, with all of the
remaining data used as a training set. The performance figures for each of the testing
sets are averaged to obtain an overall estimate of the performance on unseen data.
Search WWH ::




Custom Search