Convergence Classification and Replication Prediction for Simulation Studies - Agents and Artificial Intelligence

Information Technology Reference

In-Depth Information

4.2

Convergence Classification

Convergence classification aims at estimating if it can be shown that samples from one

distribution are better on average (e.g., if it can be shown by a statistical test that the

mean is greater than the mean of another distribution). The basic idea is to observe the

development of p values while the number of samples is increasing. We have set up

the convergence estimation as a classification task. A classifier is trained using a set

of positive and negative examples (different distribution vs. identical distribution). This

classifier can later be used in order to classify unseen p value series.

In our current implementation, we extract five straight-forward features which are

used for classification and have a target attribute with two possible outcomes:

- p min : The minimal p value observed so far.

- p man : The maximal p value observed so far.

- p avg : The average of all observed p values.

- p last : The last known p value (taking into account the whole samples).

- p grad : The “gradient” of the p value development, taking into account first and last

known p value in relation to the number of samples.

- class : Different or same distribution (diff/same).

In order to train the classifier, we apply the C4.5 algorithm for decision tree learning

[13]. In our work, we have integrated the WEKA machine learning program and have

used the J4.8 implementation of C4.5 [17].

4.3

Replication Prediction

While significance estimation only aims at the classification if a significant statistical

result is expected, the replication prediction task has the goal to estimate the number of

needed replications in order to reach the significant result with a statistical test. Thus,

in this case we are facing a numeric prediction task.

Various prediction methods could be applied to the data, e.g., from the field of time

series prediction. For our initial experiments we decided to apply regression to the

known series of p values in order to estimate the subsequent development. Therefore,

we use the R project implementation of the nonlinear least squares method ( NLS )[14].

In order to fit a function to the provided data, we let the regression identify the

coefficients a and b of the following formula: f ( x )=

1

a + bx .

The prediction of the number of necessary replications is done by computing the

interception point of the curve with the desired significance level α . Equalizing the

function with α and solving it for x leads to the predicted number of replications: x =

1

a

b .

Figure 4 shows the development of p values as well as the regression curve which

has been generated from the first 30 p values.

αb −

Agents and Artificial Intelligence

Search WWH ::

Custom Search

Home