Geoscience Reference
In-Depth Information
variables in predicting the discharge response of a catchment might be some measure or measures of
rainfalls in the past and some way of characterising the antecedent state of the catchment.
Unfortunately, it might not be clear how best to define these control variables given the data available
in the training set. However, we can supply a large number of potential control variables to such a
classification algorithm and let it decide which are the most effective in classifying the responses. This is
the approach taken, for example, by Iorgulescu and Beven (2004) using a classification and regression tree
(CART) algorithm proposed by Breiman et al. (1984) and also orginally derived from work in artificial
intelligence. This is one possible classificatory algorithm; different algorithms have different ways of
finding the most important control variable and forming the resulting classification tree. When applied to
this type of dynamic system, it is important that the control variables should reflect the dynamic history of
the catchment in setting up the response at a particular time step. The idea is to turn the dynamic response
into a static prediction problem at each time step. Thus, for a rainfall-runoff modelling problem, the
dynamics depend on rainfalls in the immediate past, rainfalls in an event over the time of concentration
of the catchment and longer term rainfalls and evapotranspiration in setting up the antecedent conditions.
Discharge at the current time and in the immediate past might also be useful in defining the current state
of the system.
Iorgulescu and Beven set up more than 1000 potential control variables, based on integrating rainfalls
(and rainfall-potential evapotranspiration) over longer and longer time steps. The algorithm proceeds in
a top-down order, i.e. the first split at the root node divides the data into two sets based on a threshold
value of one of the control variables, these are then each split into two sets (perhaps using a different
control variable threshold), and so on until a set of terminal nodes is defined. Various stopping criteria
can be used in the branches of the tree, including a minimum number of data values assigned to the
node (usually no fewer than six) or if all the values assigned to the node are identical. At each split,
the threshold value of the control variable is chosen to give the greatest explanation of the variability in
the data being split. This can also be done in different ways: least square deviation splits are commonly
used; the minimum sum of the absolute deviations in the two sets can also be used. The latter is more
robust with respect to outliers (Breiman et al. , 1984).
In series, such as discharges, the values being split are not independent but exhibit a strong autocor-
relation, especially during recession periods. This affects the nature of the splits. Iorgulescu and Beven
(2004) tried to allow for this by considering an additional criterion at each split that maximised the
diversity of the descendent sets in the sense of containing values from different periods of the original
time series. Once each branch of the tree has reached a terminal node, it is normal to consider “pruning”
the tree back to try to reduce the impact of overfitting on the resulting predictions. The decision to prune
is based on the prediction errors. If removing a branch of the tree improves the prediction errors then that
branch can be pruned.
Note that the tree serves to directly map the input control variables (as indicative of the hydrological
system) to a set of output values in the terminal nodes of the final (pruned) tree. There are no parameters
or coefficients to be determined, only the threshold values for the control variables that determine the
branches of the tree. The sets of output values can be used to provide predictions in different ways. Most
commonly the median is used to provide a deterministic prediction (e.g. Figure 4.14), but the distribution
of values in that node can be used to provide some measure of the uncertainty associated with the outputs
under those conditions. The approach can also be extended to allow fitting of a model of the data values
in each node, which might help in improving predictions beyond the range of the training data (e.g.
Solomatine and Dulal, 2003).
It is known that the partitioning of the tree can be affected by errors in the control variables and output
variable or by small changes to the training period. To try to mitigate this effect, a technique called
regression forests can be used either by bootstrapping the data or by randomising the choice of splits
(Breiman, 2001). Iorgulescu and Beven (2004) have applied the regression tree technique to predict the
Search WWH ::




Custom Search