Information Technology Reference
In-Depth Information
4.2
Convergence Classification
Convergence classification aims at estimating if it can be shown that samples from one
distribution are better on average (e.g., if it can be shown by a statistical test that the
mean is greater than the mean of another distribution). The basic idea is to observe the
development of p values while the number of samples is increasing. We have set up
the convergence estimation as a classification task. A classifier is trained using a set
of positive and negative examples (different distribution vs. identical distribution). This
classifier can later be used in order to classify unseen p value series.
In our current implementation, we extract five straight-forward features which are
used for classification and have a target attribute with two possible outcomes:
- p min : The minimal p value observed so far.
- p man : The maximal p value observed so far.
- p avg : The average of all observed p values.
- p last : The last known p value (taking into account the whole samples).
- p grad : The “gradient” of the p value development, taking into account first and last
known p value in relation to the number of samples.
- class : Different or same distribution (diff/same).
In order to train the classifier, we apply the C4.5 algorithm for decision tree learning
[13]. In our work, we have integrated the WEKA machine learning program and have
used the J4.8 implementation of C4.5 [17].
4.3
Replication Prediction
While significance estimation only aims at the classification if a significant statistical
result is expected, the replication prediction task has the goal to estimate the number of
needed replications in order to reach the significant result with a statistical test. Thus,
in this case we are facing a numeric prediction task.
Various prediction methods could be applied to the data, e.g., from the field of time
series prediction. For our initial experiments we decided to apply regression to the
known series of p values in order to estimate the subsequent development. Therefore,
we use the R project implementation of the nonlinear least squares method ( NLS )[14].
In order to fit a function to the provided data, we let the regression identify the
coefficients a and b of the following formula: f ( x )=
1
a + bx .
The prediction of the number of necessary replications is done by computing the
interception point of the curve with the desired significance level α . Equalizing the
function with α and solving it for x leads to the predicted number of replications: x =
1
a
b .
Figure 4 shows the development of p values as well as the regression curve which
has been generated from the first 30 p values.
αb
 
Search WWH ::




Custom Search