Geology Reference
In-Depth Information
2 (y) is the variance of output y, which allows a judgment to be formed
independent of the output range as to how well the output can be modeled by a
smooth function. A V ratio close to zero indicates that there is a high degree of
predictability of the given output y.
We can also determine the reliability of
where
σ
statistic by running a series of Gamma
Tests for increasing M, to establish the size of data set required to produce a stable
asymptote. This is known as the M-test. An M-test result would help to avoid the
wasteful attempt of
Γ
fitting the model beyond the stage where the MSE on the
training data is smaller than Var(r), which may lead to over
tting. The M-test also
helps to decide how much data might be required to build a model with a MSE
which approximates the estimated noise variance. In practice, the Gamma Test can
be achieved through winGamma TM software implementation [ 21 ].
3.1.2 Assumptions in the Gamma Test
There are three principle assumptions associated with the GT which are clearly
stated in [ 21 ]. These major assumptions are:
1. The training set inputs are non-sparse in input space (i.e., the
rst nearest
neighbor distances reduces as the number of training data points increases)
2. Each output is determined from the inputs by a deterministic process which is
the same for both training and test sets
3. Each output is subjected to statistical noise, the distribution of which may be
different for different outputs but which is the same in both training and test sets
for corresponding outputs.
3.1.3 Data Analysis Using Gamma Test
This section gives a brief description on how GT can be used as a precursor to
nonlinear time series modeling to identify the quality of data used for modeling and
to select the best features out of the available input data sets. The GT describes the
feature sets (subsets) of available input data as masks. As an example of how GT
could be used as a best feature selection, assume a set of input data series (x 1 , x 2 , and
x 3 ) and three mask sets [0, 1, 1], [1, 0, 1], and [1, 1, 0]. The mask subset [0, 1, 1]
corresponds to the data sets without series x 1 and the other two masks are data sets
without x 2 and x 3 , respectively. We use 0 and 1 representation in the context of
feature selection to describe which data series are used [1] and which are not used
[0]. The GT analysis on given masked data subsets would provide information such
as Gamma statistic (measure of best achievable MSE), V-ratio (the measure of
degree of predictability), and Gradient (measure of complexity of model). If the
values of the Gamma statistic and V ratio are close to zero, it indicates that there is a
 
Search WWH ::




Custom Search