Model Data Selection and Data Pre-processing Approaches - Hydrological Data Driven Modelling

Geology Reference

In-Depth Information

2 (y) is the variance of output y, which allows a judgment to be formed

independent of the output range as to how well the output can be modeled by a

smooth function. A V ratio close to zero indicates that there is a high degree of

predictability of the given output y.

We can also determine the reliability of

where

σ

statistic by running a series of Gamma

Tests for increasing M, to establish the size of data set required to produce a stable

asymptote. This is known as the M-test. An M-test result would help to avoid the

wasteful attempt of

Γ

fitting the model beyond the stage where the MSE on the

training data is smaller than Var(r), which may lead to over

tting. The M-test also

helps to decide how much data might be required to build a model with a MSE

which approximates the estimated noise variance. In practice, the Gamma Test can

be achieved through winGamma TM software implementation [ 21 ].

3.1.2 Assumptions in the Gamma Test

There are three principle assumptions associated with the GT which are clearly

stated in [ 21 ]. These major assumptions are:

1. The training set inputs are non-sparse in input space (i.e., the

rst nearest

neighbor distances reduces as the number of training data points increases)

2. Each output is determined from the inputs by a deterministic process which is

the same for both training and test sets

3. Each output is subjected to statistical noise, the distribution of which may be

different for different outputs but which is the same in both training and test sets

for corresponding outputs.

3.1.3 Data Analysis Using Gamma Test

This section gives a brief description on how GT can be used as a precursor to

nonlinear time series modeling to identify the quality of data used for modeling and

to select the best features out of the available input data sets. The GT describes the

feature sets (subsets) of available input data as masks. As an example of how GT

could be used as a best feature selection, assume a set of input data series (x 1 , x 2 , and

x 3 ) and three mask sets [0, 1, 1], [1, 0, 1], and [1, 1, 0]. The mask subset [0, 1, 1]

corresponds to the data sets without series x 1 and the other two masks are data sets

without x 2 and x 3 , respectively. We use 0 and 1 representation in the context of

feature selection to describe which data series are used [1] and which are not used

[0]. The GT analysis on given masked data subsets would provide information such

as Gamma statistic (measure of best achievable MSE), V-ratio (the measure of

degree of predictability), and Gradient (measure of complexity of model). If the

values of the Gamma statistic and V ratio are close to zero, it indicates that there is a

Hydrological Data Driven Modelling

Search WWH ::

Custom Search

Home