Predicting Hydrographs Using Models Based on Data - Rainfall-Runoff Modelling: The Primer

Geoscience Reference

In-Depth Information

variables in predicting the discharge response of a catchment might be some measure or measures of

rainfalls in the past and some way of characterising the antecedent state of the catchment.

Unfortunately, it might not be clear how best to define these control variables given the data available

in the training set. However, we can supply a large number of potential control variables to such a

classification algorithm and let it decide which are the most effective in classifying the responses. This is

the approach taken, for example, by Iorgulescu and Beven (2004) using a classification and regression tree

(CART) algorithm proposed by Breiman et al. (1984) and also orginally derived from work in artificial

intelligence. This is one possible classificatory algorithm; different algorithms have different ways of

finding the most important control variable and forming the resulting classification tree. When applied to

this type of dynamic system, it is important that the control variables should reflect the dynamic history of

the catchment in setting up the response at a particular time step. The idea is to turn the dynamic response

into a static prediction problem at each time step. Thus, for a rainfall-runoff modelling problem, the

dynamics depend on rainfalls in the immediate past, rainfalls in an event over the time of concentration

of the catchment and longer term rainfalls and evapotranspiration in setting up the antecedent conditions.

Discharge at the current time and in the immediate past might also be useful in defining the current state

of the system.

Iorgulescu and Beven set up more than 1000 potential control variables, based on integrating rainfalls

(and rainfall-potential evapotranspiration) over longer and longer time steps. The algorithm proceeds in

a top-down order, i.e. the first split at the root node divides the data into two sets based on a threshold

value of one of the control variables, these are then each split into two sets (perhaps using a different

control variable threshold), and so on until a set of terminal nodes is defined. Various stopping criteria

can be used in the branches of the tree, including a minimum number of data values assigned to the

node (usually no fewer than six) or if all the values assigned to the node are identical. At each split,

the threshold value of the control variable is chosen to give the greatest explanation of the variability in

the data being split. This can also be done in different ways: least square deviation splits are commonly

used; the minimum sum of the absolute deviations in the two sets can also be used. The latter is more

robust with respect to outliers (Breiman et al. , 1984).

In series, such as discharges, the values being split are not independent but exhibit a strong autocor-

relation, especially during recession periods. This affects the nature of the splits. Iorgulescu and Beven

(2004) tried to allow for this by considering an additional criterion at each split that maximised the

diversity of the descendent sets in the sense of containing values from different periods of the original

time series. Once each branch of the tree has reached a terminal node, it is normal to consider “pruning”

the tree back to try to reduce the impact of overfitting on the resulting predictions. The decision to prune

is based on the prediction errors. If removing a branch of the tree improves the prediction errors then that

branch can be pruned.

Note that the tree serves to directly map the input control variables (as indicative of the hydrological

system) to a set of output values in the terminal nodes of the final (pruned) tree. There are no parameters

or coefficients to be determined, only the threshold values for the control variables that determine the

branches of the tree. The sets of output values can be used to provide predictions in different ways. Most

commonly the median is used to provide a deterministic prediction (e.g. Figure 4.14), but the distribution

of values in that node can be used to provide some measure of the uncertainty associated with the outputs

under those conditions. The approach can also be extended to allow fitting of a model of the data values

in each node, which might help in improving predictions beyond the range of the training data (e.g.

Solomatine and Dulal, 2003).

It is known that the partitioning of the tree can be affected by errors in the control variables and output

variable or by small changes to the training period. To try to mitigate this effect, a technique called

regression forests can be used either by bootstrapping the data or by randomising the choice of splits

(Breiman, 2001). Iorgulescu and Beven (2004) have applied the regression tree technique to predict the

Rainfall-Runoff Modelling: The Primer

Search WWH ::

Custom Search

Home